8

机器学习著名定理之—No Free Lunch定理详解

 2 years ago
source link: https://blog.csdn.net/qq_38406029/article/details/123061417
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

机器学习著名定理之—No Free Lunch定理详解

original.png
鬼道2022 newUpTime2.png 已于 2022-02-22 21:56:39 修改 articleReadEyes2.png 257
专栏收录该内容
37 篇文章 42 订阅

No Free Lunch定理

定理(No Free Lunch): 假定 A \mathcal{A} A是一个在域 X \mathcal{X} X的二分类任务中任意一个机器学习算法,其损失函数为 0 - 1 0\text{-}1 0-1损失。令 n n n是一个大小为 ∣ X ∣ / 2 |\mathcal{X}|/2 ∣X∣/2的训练集,存在域 X \mathcal{X} X中的分布 D \mathcal{D} D,则有
(1)存在一个函数 f : X → { 0 , 1 } f:\mathcal{X}\rightarrow \{0,1\} f:X→{0,1},且有 L D ( f ) = 0 L_\mathcal{D}(f)=0 LD​(f)=0。
(2)对于子列 S ∼ D n \mathcal{S\sim D}^n S∼Dn,则概率不等式 P ( L D ( A ( S ) ) ≥ 1 / 8 : S ∼ D n ) ≥ 1 / 7 P(L_\mathcal{D}(\mathcal{A}(\mathcal{S}))\ge 1/8: \mathcal{S}\sim\mathcal{D}^n)\ge 1/7 P(LD​(A(S))≥1/8:S∼Dn)≥1/7成立。

证明:
(1) 令 C \mathcal{C} C表示域 X \mathcal{X} X中大小为 2 n 2n 2n的一个子集。主要的证明思路是只利用数据集 C \mathcal{C} C一半的数据样本点并不能给出剩下一半数据点的信息。 假定 H \mathcal{H} H表示数据集 C \mathcal{C} C到标签集合 { 0 , 1 } \{0,1\} {0,1}所有可能的函数集合,且 T T T表示的是函数集合的基数,其中 H = { f 1 , ⋯   , f T } \mathcal{H}=\{f_1,\cdots,f_T\} H={f1​,⋯,fT​}, T = 2 2 n T=2^{2n} T=22n。对于 H \mathcal{H} H中每一个函数假设,令 D i \mathcal{D}_i Di​是 C × { 0 , 1 } \mathcal{C}\times\{0,1\} C×{0,1}中的分布 D i ( { ( x , y ) } ) = { 1 / 2 m i f   y = f i ( x ) 0 o t h e r w i s e \mathcal{D}_i(\{(x,y)\})=\left\{\right. Di​({(x,y)})={1/2m0​if y=fi​(x)otherwise​进而可知存在函数 f i f_i fi​,在数据分布 D i \mathcal{D}_i Di​上则有 L D i ( f i ) = 0 L_{\mathcal{D}_i}(f_i)=0 LDi​​(fi​)=0。
(2)主要证明的关键在于即对任意的学习算法 A \mathcal{A} A有 max ⁡ i ∈ [ T ] E S ∼ D i n [ L D i ( A ( S ) ) ] ≥ 1 / 4 \max\limits_{i \in [T]}\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]\ge 1 / 4 i∈[T]max​ES∼Din​​[LDi​​(A(S))]≥1/4首先从 C × { 0 , 1 } \mathcal{C}\times \{0,1\} C×{0,1}中采样出 n n n个样本构造一个训练集,其中采样出的样本可以重复,进而可知有 k = ( 2 n ) n k=(2n)^n k=(2n)n中可能的样本序列。令这些样本序列分别表示为 S 1 , S 2 , ⋯   , S k \mathcal{S}_1,\mathcal{S}_2,\cdots,\mathcal{S}_k S1​,S2​,⋯,Sk​。 S j i = ( ( x 1 , f i ( x 1 ) ) , ⋯   , ( x n , f i ( x n ) ) ) \mathcal{S}_j^i=((x_1,f_i(x_1)),\cdots,(x_n,f_i(x_n))) Sji​=((x1​,fi​(x1​)),⋯,(xn​,fi​(xn​)))表示的是函数 f j f_{j} fj​在样本序列 S j S_j Sj​中的数据集合,则有 E S ∼ D i n [ L D i ( A ( S ) ) ] = 1 k ∑ j = 1 k L D i ( A ( S j i ) ) \mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]=\frac{1}{k}\sum\limits_{j=1}^k L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j)) ES∼Din​​[LDi​​(A(S))]=k1​j=1∑k​LDi​​(A(Sji​))又因为 " m a x i m u m " ≥ " a v e r a g e " ≥ " m i n i m u m " \mathrm{"maximum"}\ge \mathrm{"average"}\ge \mathrm{"minimum"} "maximum"≥"average"≥"minimum",所以则有 max ⁡ i ∈ [ T ] 1 T ∑ j = 1 k L D i ( A ( S j i ) ) ≥ 1 T ∑ i = 1 T 1 k ∑ j = 1 k L D i ( A ( S j i ) ) = 1 k ∑ j = 1 k 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ min ⁡ j ∈ [ k ] 1 T ∑ i = 1 T L D i ( A ( S j i ) ) i∈[T]max​T1​j=1∑k​LDi​​(A(Sji​))​≥T1​i=1∑T​k1​j=1∑k​LDi​​(A(Sji​))=k1​j=1∑k​T1​i=1∑T​LDi​​(A(Sji​))≥j∈[k]min​T1​i=1∑T​LDi​​(A(Sji​))​现固定 j ∈ [ k ] j \in [k] j∈[k],令 S j = { x 1 , ⋯   , x n } \mathcal{S}_j=\{x_1,\cdots,x_n\} Sj​={x1​,⋯,xn​}, C \ S j = { v 1 , ⋯   , v p } C \backslash \mathcal{S}_j=\{v_1,\cdots,v_p\} C\Sj​={v1​,⋯,vp​},其中 p ≥ n p\ge n p≥n是剩余没有采样的样本数。对于每一个函数 h : C → { 0 , 1 } h:C \rightarrow\{0,1\} h:C→{0,1}, i ∈ [ T ] i\in[T] i∈[T]有 L D i ( h ) = 1 2 n ∑ x ∈ C I ( h ( x ) ≠ f i ( x ) ) = 1 2 n ( ∑ l = 1 n I ( h ( x l ) ≠ f i ( x l ) ) + ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) ) ≥ 1 2 n ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) ≥ 1 2 p ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) LDi​​(h)​=2n1​x∈C∑​I(h(x)​=fi​(x))=2n1​(l=1∑n​I(h(xl​)​=fi​(xl​))+r=1∑p​I(h(vr​)​=fi​(vr​)))≥2n1​r=1∑p​I(h(vr​)​=fi​(vr​))≥2p1​r=1∑p​I(h(vr​)​=fi​(vr​))​所以 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ 1 T ∑ i = 1 T 1 2 p ∑ r = 1 p I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) = 1 2 p ∑ r = 1 p 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) ≥ 1 2 min ⁡ r ∈ [ p ] 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) T1​i=1∑T​LDi​​(A(Sji​))​≥T1​i=1∑T​2p1​r=1∑p​I(A(Sji​)(vr​)​=fi​(vr​))=2p1​r=1∑p​T1​i=1∑T​I(A(Sji​)(vr​)​=fi​(vr​))≥21​r∈[p]min​T1​i=1∑T​I(A(Sji​)(vr​)​=fi​(vr​))​对于给定的 r ∈ [ p ] r\in [p] r∈[p],因为 T T T是所有可能函数映射的基数,所以总有成对存在的 a , b ∈ [ T ] a,b\in [T] a,b∈[T]有 I ( A ( S j i ) ( v r ) ≠ f a ( v r ) ) + I ( A ( S j i ) ( v r ) ≠ f b ( v r ) ) = 1 \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_a(v_r))+\mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_b(v_r))=1 I(A(Sji​)(vr​)​=fa​(vr​))+I(A(Sji​)(vr​)​=fb​(vr​))=1进而则有 E S ∼ D i n [ L D i ( A ( S ) ) ] = e 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ 1 2 min ⁡ r ∈ [ p ] 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) = 1 2 ⋅ 1 T ⋅ T 2 = 1 4 ES∼Din​​[LDi​​(A(S))]​=eT1​i=1∑T​LDi​​(A(Sji​))≥21​r∈[p]min​T1​i=1∑T​I(A(Sji​)(vr​)​=fi​(vr​))=21​⋅T1​⋅2T​=41​​根据马尔可夫不等式的修改版可知,给定一个随机变量 X ∈ [ 0 , 1 ] X \in[0,1] X∈[0,1],给定一个常数 a ∈ [ 0 , 1 ] a\in [0,1] a∈[0,1],进而则有 E [ X ] = ∫ 0 1 x p ( x ) d x = ∫ 0 a x p ( x ) d x + ∫ a 1 x p ( x ) d x ≤ a ∫ 0 a p ( x ) d x + ∫ a 1 p ( x ) d x = a ( 1 − p { X ≥ a } ) + p { X ≥ a } = a + ( 1 − a ) P { X ≥ a } E[X]​=∫01​xp(x)dx=∫0a​xp(x)dx+∫a1​xp(x)dx≤a∫0a​p(x)dx+∫a1​p(x)dx=a(1−p{X≥a})+p{X≥a}=a+(1−a)P{X≥a}​马尔可夫不等式为 P { X ≥ a } ≥ E [ X ] − a 1 − a P\{X\ge a\}\ge \frac{\mathbb{E}[X]-a}{1-a} P{X≥a}≥1−aE[X]−a​利用马尔可夫不等可知 P ( L D ( A ( S ) ) ≥ 1 / 8 : S ∼ D n ) = E S ∼ D i n [ L D i ( A ( S ) ) ] − 1 / 8 1 − 1 / 8 ≥ 1 / 4 − 1 / 8 1 − 1 / 8 = 1 7 P(LD​(A(S))≥1/8:S∼Dn)​=1−1/8ES∼Din​​[LDi​​(A(S))]−1/8​≥1−1/81/4−1/8​=71​​


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK