跳至主要內容

白板机械学习笔书|基础一

Kevin 吴嘉文大约 8 分钟知识笔记Machine Learning

机械学习基础。笔记思路参考了 shuhuai-白板机械学习open in new window 系列教程。

基础

极大似然估计

极大似然估计可以理解为:已知观测变量来源于分布 f(θ)f(\theta),求 θ\theta 使得出现当前观测变量的概率最大。 解极大似然,可以先定义出似然函数,而后取对数再令导数为 0,得到似然方程并求解。

如假设 XX 中有 NN 个观测样本,样本来源分布符合高斯分布 f(μ,σ)f(\mu,\sigma)。首先写出根据假设的高斯分布似然函数:

logP(Xθ)=logi=1NP(xiθ)=i=1Nlog(12πσexp((xμ)22σ2))=i=1N[log12π+log1σ(xμ)22σ2] \begin{aligned} \log P(X \mid \theta) &=\log \prod_{i=1}^{N} P\left(x_{i} \mid \theta\right) \\ &=\sum_{i=1}^{N} \log \left(\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right)\right) \\ &=\sum_{i=1}^{N}\left[\log \frac{1}{\sqrt{2 \pi}}+\log \frac{1}{\sigma}-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right] \end{aligned}

求解最优的 μ\mu 参数:

μMLE=argmaxμlogP(Xθ)=argmaxμi=1N(xiμ)22σ2=argminμi=1N(xiμ)2 \begin{aligned} \mu_{M L E} &=\arg \max _{\mu} \log P(X \mid \theta) \\ &=\arg \max _{\mu} \sum_{i=1}^{N}-\frac{\left(x_{i}-\mu\right)^{2}}{2 \sigma^{2}} \\ &=\arg \min _{\mu} \sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2} \end{aligned}

而后求导:

μi=1N(xiμ)2=i=1N2(xiμ)(1)=0i=1N(xiμ)=0i=1NxiNμ=0μMLE=1Ni=1Nxi \begin{aligned} \frac{\partial}{\partial \mu} \sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2} &=\sum_{i=1}^{N} 2\left(x_{i}-\mu\right)(-1)=0 \\ \sum_{i=1}^{N}\left(x_{i}-\mu\right) &=0 \\ \sum_{i=1}^{N} x_{i}-N \mu &=0 \\ \mu_{M L E} &=\frac{1}{N} \sum_{i=1}^{N} x_{i} \end{aligned}

同样的可以求得 σ\sigma 的最优参数值:

σMLE2=argmaxσi=1N[log1σ(xiμ)22σ2]=argmaxσi=1N[logσ12σ2(xiμ)2] \begin{aligned} \sigma_{M L E}^{2} &=\arg \max _{\sigma} \sum_{i=1}^{N}\left[\log \frac{1}{\sigma}-\frac{\left(x_{i}-\mu\right)^{2}}{2 \sigma^{2}}\right] \\ &=\arg \max _{\sigma} \sum_{i=1}^{N}\left[-\log \sigma-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right] \end{aligned}

令导数为 0:

σi=1N[logσ12σ2(xiμ)2]=i=1N[1σ12(xiμ)2(2)1σ3]=0i=1N[1σ+(xiμ)2σ3]=0i=1N[σ2+(xiμ)2]=0Nσ2+i=1N(xiμ)2=0σMLE2=1Ni=1N(xiμMLE)2 \begin{aligned} \frac{\partial}{\partial \sigma} \sum_{i=1}^{N}\left[-\log \sigma-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right] &=\sum_{i=1}^{N}\left[-\frac{1}{\sigma}-\frac{1}{2}\left(x_{i}-\mu\right)^{2}(-2) \frac{1}{\sigma^{3}}\right]=0 \\ \sum_{i=1}^{N}\left[-\frac{1}{\sigma}+\left(x_{i}-\mu\right)^{2} \sigma^{-3}\right] &=0 \\ \sum_{i=1}^{N}\left[-\sigma^{2}+\left(x_{i}-\mu\right)^{2}\right] &=0 \\ -N \sigma^{2}+\sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2} &=0 \\ \sigma_{M L E}^{2} &=\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}-\mu_{M L E}\right)^{2} \end{aligned}

有偏估计与无偏估计

若参数数学期望等于它本身: E[μ^]=μE[σ^]=σE[\hat{\mu}]=\mu \quad E[\hat{\sigma}]=\sigma ,则为无偏估计。

对于以上结果 μ\mu 为无偏估计:

E[μMLE]=E[1Ni=1Nxi]=1Ni=1NE[xi]=1Ni=1Nμ=μ \begin{aligned} E\left[\mu_{M L E}\right] &=E\left[\frac{1}{N} \sum_{i=1}^{N} x_{i}\right] \\ &=\frac{1}{N} \sum_{i=1}^{N} E\left[x_{i}\right] \\ &=\frac{1}{N} \sum_{i=1}^{N} \mu \\ &=\mu \end{aligned}

σMLE2=1Ni=1N(xiμMLE)2\sigma_{M L E}^{2} =\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}-\mu_{M L E}\right)^{2} 的最优解为有偏估计:

E[σMLE2]=E[1Ni=1N(xiμMLE)2]=E[1Ni=1N(xi22xiμMLE+μMLE2)]=E[1Ni=1Nxi22μMLE1Ni=1Nxi+1Ni=1NμMLE2]=E[1Ni=1Nxi22μMLE2+μMLE2]=E[1Ni=1N(xi2μMLE2)]=1Ni=1N(E[xi2]E[μMLE2])=1Ni=1N(D[xi]+E[xi]2D[μMLE]E[μMLE]2)=1Ni=1N(σMLE2+μ2D[1Ni=1Nxi]μ2)=1Ni=1N(σMLE21N2NσMLE2)=N1NσMLE2 \begin{aligned} E\left[\sigma_{M L E}^{2}\right] &=E\left[\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}-\mu_{M L E}\right)^{2}\right] \\ &=E\left[\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}^{2}-2 x_{i} \mu_{M L E}+\mu_{M L E}^{2}\right)\right] \\ &=E\left[\frac{1}{N} \sum_{i=1}^{N} x_{i}^{2}-2 \mu_{M L E} \frac{1}{N} \sum_{i=1}^{N} x_{i}+\frac{1}{N} \sum_{i=1}^{N} \mu_{M L E}^{2}\right] \\ &=E\left[\frac{1}{N} \sum_{i=1}^{N} x_{i}^{2}-2 \mu_{M L E}^{2}+\mu_{M L E}^{2}\right] \\ &=E\left[\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}^{2}-\mu_{M L E}^{2}\right)\right] \\ &=\frac{1}{N} \sum_{i=1}^{N}\left(E\left[x_{i}^{2}\right]-E\left[\mu_{M L E}^{2}\right]\right) \\ &=\frac{1}{N} \sum_{i=1}^{N}\left(D\left[x_{i}\right]+E\left[x_{i}\right]^{2}-D\left[\mu_{M L E}\right]-E\left[\mu_{M L E}\right]^{2}\right) \\ &=\frac{1}{N} \sum_{i=1}^{N}\left(\sigma_{M L E}^{2}+\mu^{2}-D\left[\frac{1}{N} \sum_{i=1}^{N} x_{i}\right]-\mu^{2}\right) \\ &=\frac{1}{N} \sum_{i=1}^{N}\left(\sigma_{M L E}^{2}-\frac{1}{N^{2}} N \sigma_{M L E}^{2}\right) \\ &=\frac{N-1}{N} \sigma_{M L E}^{2} \end{aligned}

无偏估计为:σMLE2=1N1i=1N(xiμMLE)2\sigma_{M L E}^{2} =\frac{1}{N-1} \sum_{i=1}^{N}\left(x_{i}-\mu_{M L E}\right)^{2}因此从上面结果可以看出,使用极大似然估计,会带来一定的偏差。

高维高斯分布

xN(μ,Σ)=1(2π)p2Σ12exp(12(xμ)TΣ1(xμ))x \sim N(\mu, \Sigma)=\frac{1}{(2 \pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}} \exp (-\frac{1}{2} {(x-\mu)^{T} \Sigma^{-1}(x-\mu)})

指数部分可以理解为 xxμ\mu 马氏距离open in new window

Σ\Sigma 进行特征值分解: Σ=UΛUT\Sigma=U \Lambda U^{T} ;其中 UUT=UTU=IU U^{T}=U^{T} U=I (两者正交) U=(u1,u2,,up)p×pΛ=diagi=1,,p(λi)U=\left(u_{1}, u_{2}, \cdots, u_{p}\right)_{p \times p} \Lambda=\operatorname{diag}_{i=1, \cdots, p}\left(\lambda_{i}\right) (对角阵)

特征值分解可以将矩阵分解为连加形式:

Δ=(xμ)TΣ1(xμ)=(xμ)Ti=1p(ui1λiuiT)(xμ)=i=1p((xμ)Tui1λiuiT(xμ))=i=1p(yi1λiyiT)=i=1p(yi2λi) \begin{aligned} \Delta &=(x-\mu)^{T} \Sigma^{-1}(x-\mu) \\ &=(x-\mu)^{T} \sum_{i=1}^{p}\left(u_{i} \frac{1}{\lambda_{i}} u_{i}^{T}\right)(x-\mu) \\ &=\sum_{i=1}^{p}\left((x-\mu)^{T} u_{i} \frac{1}{\lambda_{i}} u_{i}^{T}(x-\mu)\right) \\ &=\sum_{i=1}^{p}\left(y_{i} \frac{1}{\lambda_{i}} y_{i}^{T}\right) \\ &=\sum_{i=1}^{p}\left(\frac{y_{i}^{2}}{\lambda_{i}}\right) \end{aligned}

其中 yi=(xμ)Tuiy_i = (x-\mu)^T u_iyiy_i(xμ)(x-\mu)μi\mu_i 上的投影。通过上述公式, 马氏距离可以理解为各个特征值 λi\lambda_i 的加权。加权系数为不同维度特征在各个特征向量上的投影。

从几何上理解,当 p=2p=2 时,Δ=y12λ1+y22λ2=r\Delta=\frac{y_{1}^{2}}{\lambda_{1}}+\frac{y_{2}^{2}}{\lambda_{2}}=r 即为椭圆。Δ\Delta 的值决定了椭圆大小。

p(x)=1(2π)p2Σ12exp(12Δ)0p(x)1p(x)=\frac{1}{(2 \pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{1}{2} \Delta\right) \quad 0 \leq p(x) \leq 1 ,若 p=2p=2p(x)p(x) 给定后,(x1,x2)(x_1,x_2) 就是在三维空间 (x1,x2,p(x))(x_1,x_2,p(x)) 上的椭圆曲线的切面。

由于 Σ\Sigma 参数复杂度为 O(p2)O(p^2),通常假设其为对角矩阵以简化计算(如 PCA),此时椭圆曲线的切面为正的,椭圆的对称轴平行于坐标轴。

若对角矩阵 Σ\Sigma 中的 λi\lambda_i 都相等,则切面为正圆,这种情况称为 各向同性

概率分布

给定联合概率分布

xN(μ,Σ)=1(2π)p2Σ12exp(12(xμ)TΣ1(xμ)) x \sim N(\mu, \Sigma)=\frac{1}{(2 \pi)^{\frac{p}{2}}|\Sigma|^{\frac{1}{2}}} \exp (-\frac{1}{2} {(x-\mu)^{T} \Sigma^{-1}(x-\mu)})

求边缘概率分布 p(xa),p(xb)p(x_a),p(x_b) 与 条件概率分布 p(xbxa),p(xaxb)p(x_b|x_a),p(x_a|x_b) 。(通常使用配方法)视频open in new window 中使用了另一种方法:

定理: 已知 xN(μ,Σ)x \sim N(\mu, \Sigma)y=Ax+By=Ax+ByN(Aμ+B,AΣAT)y \sim N\left(A \mu+B, A \Sigma A^{T}\right)

p(xa)p(x_a)

xa=(Im0)A(xaxb)x+0Bx_{a}=\underbrace{\left(\begin{array}{ll} I_{m} & 0 \end{array}\right)}_{A} \underbrace{\left(\begin{array}{c} x_{a} \\ x_{b} \end{array}\right)}_{x}+\underbrace{0}_{B}

E[xa]=Aμ+B=(Im0)(μaμb)+0=μaVar[xa]=AΣAT=(Im0)(ΣaaΣabΣbaΣbb)(Im0)=(ΣaaΣab)(Im0)=Σaa \begin{array}{l} E\left[x_{a}\right]=A \mu+B=\left(\begin{array}{ll} I_{m} & 0 \end{array}\right)\left(\begin{array}{c} \mu_{a} \\ \mu_{b} \end{array}\right)+0=\mu_{a} \\ Var\left[x_{a}\right]=A \Sigma A^{T}=\left(\begin{array}{ll} I_{m} & 0 \end{array}\right)\left(\begin{array}{cc} \Sigma_{a a} & \Sigma_{a b} \\ \Sigma_{b a} & \Sigma_{b b} \end{array}\right)\left(\begin{array}{c} I_{m} \\ 0 \end{array}\right)=\left(\begin{array}{cc} \Sigma_{a a} & \Sigma_{a b} \end{array}\right)\left(\begin{array}{c} I_{m} \\ 0 \end{array}\right)=\Sigma_{a a} \end{array}

xaN(μa,Σaa)x_a \sim N(\mu_a, \Sigma_{aa})

p(xbxa)p(x_b|x_a),先定义:

xba=xbΣbaΣaa1xaμba=μbΣbaΣaa1μaΣbba=ΣbbΣbaΣaa1Σab \begin{aligned} x_{b \cdot a} &=x_{b}-\Sigma_{b a} \Sigma_{a a}^{-1} x_{a} \\ \mu_{b \cdot a} &=\mu_{b}-\Sigma_{b a} \Sigma_{a a}^{-1} \mu_{a} \\ \Sigma_{b b \cdot a} &=\Sigma_{b b}-\Sigma_{b a} \Sigma_{a a}^{-1} \Sigma_{a b} \end{aligned}

其中第二、三式可通过一式推导得来。提示:将一式写成:

xba=(ΣbaΣaa1I)(xaxb) x_{b \cdot a}=\left(\begin{array}{cc} -\Sigma_{b a} \Sigma_{a a}^{-1} & I \end{array}\right)\left(\begin{array}{l} x_{a} \\ x_{b} \end{array}\right)

相关知识:Schur Complement

由一式可得: xbaN(μba,Σbba)x_{b·a} \sim N(\mu_{b·a}, \Sigma_{bb·a})xb=xba+ΣbaΣaa1xax_{b}=x_{b \cdot a}+\Sigma_{b a} \Sigma_{a a}^{-1} x_{a} ,当 xax_a 给定时可以看做常量。因此:

E[xbxa]=μba+ΣbaΣaa1xaVar[xbxa]=Var[xba]=Σbba \begin{array}{l} E\left[x_{b} \mid x_{a}\right]=\mu_{b \cdot a}+\Sigma_{b a} \Sigma_{a a}^{-1} x_{a} \\ Var\left[x_{b} \mid x_{a}\right]=Var\left[x_{b·a} \right]=\Sigma_{b b \cdot a} \end{array}

求得:

xbxaN(μba+ΣbaΣaa1xa,Σbba)(1) x_{b} \mid x_{a} \sim N\left(\mu_{b \cdot a}+\Sigma_{b a} \Sigma_{a a}^{-1} x_{a}, \Sigma_{b b \cdot a}\right) \tag 1

线性高斯模型基础

已知:p(x)=N(μ,Λ1)p(x)=N\left(\mu, \Lambda^{-1}\right)p(yx)=N(Ax+b,L1)p(y \mid x)=N\left(A x+b, L^{-1}\right) 求:p(y),p(xy)p(y),p(x|y)

此处 Λ1\Lambda^{-1}( convariance matrix )1(\text { convariance matrix })^{-1}

p(y)p(y)

E[y]=E[Ax+b+ϵ]=AE[x]+b+E[ϵ]=Aμ+bVar[y]=Var[Ax+b+ϵ]=Var[Ax+b]+Var[ϵ]=AΛ1AT+L1 E[y]=E[A x+b+\epsilon]=A E[x]+b+E[\epsilon]=A \mu+b\\ Var[y]=Var[A x+b+\epsilon]=Var[A x+b]+Var[\epsilon]\\ = A\Lambda^{-1A^T} + L^{-1}

因此:

yN(Aμ+b,AΛ1AT+L1) y \sim N\left(A \mu+b, A \Lambda^{-1} A^{T}+L^{-1}\right)

p(xy)p(x|y):

思路:先求出联合概率分布,在通过上文定理求条件概率。

定义 z=(xy)z=\left(\begin{array}{l} x \\ y \end{array}\right) ,通过上文结论可知:

Var[z]=(cov(x,x)cov(x,y)cov(y,x)cov(y,y))=(Λ1cov(x,y)cov(y,x)L1+AΛ1AT) Var[z]=\left(\begin{array}{cc} \operatorname{cov}(x, x) & \operatorname{cov}(x, y) \\ \operatorname{cov}(y, x) & \operatorname{cov}(y, y) \end{array}\right)=\left(\begin{array}{cc} \Lambda^{-1} & \operatorname{cov}(x, y) \\ \operatorname{cov}(y, x) & L^{-1}+A \Lambda^{-1} A^{T} \end{array}\right)

只要求出 cov(x,y)cov(x,y) ,就可以通过上节定理求出 p(xy)p(x|y)

cov(x,y)=E[(xE[x])(yE[y])T]=E[(xμ)(Ax+b+ϵAμb)T]=E[(xμ)(AxAμ+ϵ)T]=E[(xμ)(AxAμ)T+(xμ)ϵT]=E[(xμ)(xμ)TAT]+E[(xμ)ϵT] \begin{aligned} \operatorname{cov}(x, y) &=E\left[(x-E[x]) \cdot(y-E[y])^{T}\right] \\ &=E\left[(x-\mu) \cdot(A x+b+\epsilon-A \mu-b)^{T}\right] \\ &=E\left[(x-\mu) \cdot(A x-A \mu+\epsilon)^{T}\right] \\ &=E\left[(x-\mu) \cdot(A x-A \mu)^{T}+(x-\mu) \cdot \epsilon^{T}\right] \\ &=E\left[(x-\mu)(x-\mu)^{T} A^{T}\right]+E\left[(x-\mu) \epsilon^{T}\right] \end{aligned}

因为 xμx-\muϵ\epsilon 独立,所以 E[(xμ)ϵT]=E[xμ]E[ϵT]=(E[x]μ)E[ϵT]=0E\left[(x-\mu) \epsilon^{T}\right]=E[x-\mu] \cdot E\left[\epsilon^{T}\right]=(E[x]-\mu) E\left[\epsilon^{T}\right]=0

继续,求得:

cov(x,y)=E[(xμ)(xμ)TAT]=E[(xμ)(xμ)T]AT=D[x]AT=Λ1AT \begin{aligned} \operatorname{cov}(x, y) &=E\left[(x-\mu)(x-\mu)^{T} A^{T}\right] \\ &=E\left[(x-\mu)(x-\mu)^{T}\right] A^{T} \\ &=D[x] A^{T} \\ &=\Lambda^{-1} A^{T} \end{aligned}

所以,结合上节 (1)(1) 式可得:

zN((μAμ+b),(Λ1Λ1ATAΛ1L1+AΛ1AT))xyN(μ+Λ1AT(L1+AΛ1AT)1(yAμb),Λ1Λ1AT(L1+AΛ1AT)1AΛ1) z \sim N\left(\left(\begin{array}{c} \mu \\ A \mu+b \end{array}\right),\left(\begin{array}{cc} \Lambda^{-1} & \Lambda^{-1} A^{T} \\ A \Lambda^{-1} & L^{-1}+A \Lambda^{-1} A^{T} \end{array}\right)\right)\\ x \mid y \sim N\left(\mu+\Lambda^{-1} A^{T}\left(L^{-1}+A \Lambda^{-1} A^{T}\right)^{-1}(y-A \mu-b), \Lambda^{-1}-\Lambda^{-1} A^{T}\left(L^{-1}+A \Lambda^{-1} A^{T}\right)^{-1} A \Lambda^{-1}\right)

Jensen's 不等式

对于凸函数 f(x)f(x) 有:E[f(x)]f(E[x])E[f(x)] \ge f(E[x])

证明:

过点 (E[x],f(E[x]))(E[x],f(E[x])) 做切线 l(x)=ax+bl(x)=ax+b。因此:f(E[x])=l(E[x]),x,f(x)l(x)f(E[x]) = l(E[x]),\forall x ,f(x)\ge l(x)

E[f(x)]E[l(x)]=E[ax+b]=aE[x]+b=f(E[x]) \begin{aligned} E[f(x)] & \geq E[l(x)] \\ &=E[a x+b] \\ &=a E[x]+b \\ &=f(E[x]) \end{aligned}

Jensen's 不等式变型:

tf(a)+(1t)f(b)f(ta+(1t)b) t f(a)+(1-t) f(b) \geq f(t a+(1-t) b)

上次编辑于:
贡献者: kevinng77