Regression
Regression
: Input a vector, the function outputs a scalar.
使用简单模型
预测问题:根据前面的浏览数据,预测后面的浏览量
Function with Unknown Parameters
Model
: $y = b + wx_1$
$y$(Label
): no. of views on 2/26, $x_1$(feature
): no. of views on 2/25
$w$(weight
) and $b$(bias
) are unknown parameters (learned from data)
Define Loss from Training Data
- Loss is a function of parameters: $L(b, w)$
- Loss: how good a set of values is.
- Loss: $ L = \frac{1}{N}\sum\limits_{n=1}^N e_n$
- $e = |y - \hat{y}|$ $L$ is mean absolute error (MAE)
- $e = (y-\hat{y})^2$ $L$ is mean square error (MSE)
- if $y$ and $\hat{y}$ are both probability distributions, then use
Cross-Entropy
- $e = |y - \hat{y}|$ $L$ is mean absolute error (MAE)
使用不同的参数,计算出来的Loss画出来的等高线图叫做:Error Surface
Optimization
找一个$w$和$b$,使$L$最小: $ w^*, b^* = arg \min\limits_{w, b} L$
这种找到最小$w$和$b$的方法叫做:Gradient Descent
简述Gradient Descent
过程
以一个参数$w$为例描述Gradient Descent
的过程:
- 随机初始化点$w^0$
- 计算$w=w^0$时,对$L$的微分是多少:$\frac{{\partial}L}{{\partial}W}|_{w=w^0}$
- 如果计算出来的结果为负数,则增加$w$
- 如果计算出来的结果为正数,则减少$w$
- 增加或减少的数值为:${\color{red}\eta}\frac{{\partial}L}{{\partial}W}|_{w=w^0}$, $\color{red}\eta$:叫learning rate,是一个hyperparameter
- 这个过程用的数学表达式是:$ w^1 \leftarrow w^0 - {\color{red}\eta}\frac{{\partial}L}{{\partial}W}|_{w=w^0} $
- 重复上述步骤不断更新$w$。两种状况会停下来:
- 更新的次数达到预设值
- 微分为0
当两个参数$w$, $b$时:
- 随机初始化$w^0$, $b^0$
- 计算两个微分值
- $ \frac{\partial L}{\partial w}|_{w=w^0, b=b^0} $ 用$w$的微分更新$w$的值
- $ w^1 \leftarrow w^0 - {\color{red}\eta}\frac{\partial L}{\partial w}|_{w=w^0, b=b^0} $
- $ \frac{\partial L}{\partial b}|_{w=w^0, b=b^0} $ 用$b$的微分更新$b$的值
- $ b^1 \leftarrow b^0 - {\color{red}\eta}\frac{\partial L}{\partial b}|_{w=w^0, b=b^0} $
- Update $w$ and $b$ interatively
使用单参数的多个连续的历史记录
通过观察资料发现数据有7天为一个周期,所以使用新的公式进行调整, 并得到下面数据:
days | function | training loss | testing loss |
---|---|---|---|
1 | $ y = b + wx_{\color{red}1} $ | $ L = 0.48k $ | $ L’ = 0.58k $ |
7 | $ y = b + \sum\limits_{j=1}^{\color{red}7}w_jx_j $ | $ L = 0.38k $ | $ L’ = 0.49k $ |
28 | $ y = b + \sum\limits_{j=1}^{\color{red}28}w_jx_j $ | $ L = 0.33k $ | $ L’ = 0.46k $ |
56 | $ y = b + \sum\limits_{j=1}^{\color{red}56}w_jx_j $ | $ L = 0.32k $ | $ L’ = 0.46k $ |
上述模型有个共同的名字Linear Models
名词解释
hyperparameter
: 需要人来设置的参数
local minima
: 局部最小值
global minima
: 全局最小值
总结Machine Learning训练的简单步骤
- function with unknown
- define loss from training data
- optimization
打破模型局限
不同的w和不同的b对Linear Models的影响如蓝色线。红色表示可能的真实趋势。这种来自于Model的限制叫做Model Bias
。
All Piecewise Linear Curves
All Piecewise Linear Curves = constant + sum of a set of Hard Sigmoid
sigmoid
\begin{align*} y &= {\color{red}c}\frac{1}{1+e^{-({\color{green}b}+{\color{blue}w}x_1)}} \cr &= {\color{red}c}\,sigmoid({\color{green}b}+{\color{blue}w}x_1) \end{align*}
调整$ {\color{blue}w}, {\color{green}b}, {\color{red}c} $对应的函数图像
相对于红色线段,可以用多个Sigmoid
函数组合出来,将0:constant
和1,2,3sigmoid
加起来就是红色线段
\begin{align*} b\tag{0}\cr {\color{red}c_1}\,sigmoid({\color{green}b_1}+{\color{blue}w_1}x_1)\tag{1}\cr {\color{red}c_2}\,sigmoid({\color{green}b_2}+{\color{blue}w_2}x_1)\tag{2}\cr {\color{red}c_3}\,sigmoid({\color{green}b_3}+{\color{blue}w_3}x_1)\tag{3}\cr \textcolor{red}{\text{red curve}}\text{ will be:} \cr y = b + \sum_{i=1}^3{\color{red}c_i}\,sigmoid({\color{green}b_i}+{\color{blue}w_i}x1) \end{align*}
基于sigmoid的模型,对原来的模型进行调整如下: \begin{align*} y &= b+wx_1 \cr & \Downarrow \cr y &= b + \sum_{i=1}^n {\color{red}c_i} \, sigmoid({\color{green}b_i}+{\color{blue}w_i}x_i) \cr y &= b + \sum_{j=1}^m w_jx_j \cr & \Downarrow \cr y &= b + \sum_{i=1}^n {\color{red}c_i} \, sigmoid({\color{green}b_i}+\sum_{j=1}^m{\color{blue}w_{ij}}x_j) \end{align*}
使$n=3, m=3$对表达式$y = b + \sum_{i=1}^n {\color{red}c_i} \, sigmoid({\color{green}b_i}+\sum_{j=1}^m{\color{blue}w_{ij}}x_j)$进行展开 \begin{align*} r_1 &= {\color{green}b_1} + {\color{blue}w_{11}}x_1 + {\color{blue}w_{12}}x_2 + {\color{blue}w_{13}}x_3 \cr r_2 &= {\color{green}b_2} + {\color{blue}w_{21}}x_1 + {\color{blue}w_{22}}x_2 + {\color{blue}w_{23}}x_3 \cr r_3 &= {\color{green}b_3} + {\color{blue}w_{31}}x_1 + {\color{blue}w_{32}}x_2 + {\color{blue}w_{33}}x_3 \cr &\Downarrow \cr \begin{bmatrix} r_1 \cr r_2 \cr r_3 \end{bmatrix} &= \begin{bmatrix} {\color{green}b_1} \cr {\color{green}b_2} \cr {\color{green}b_3} \end{bmatrix} + \begin{bmatrix} {\color{blue}w_{11}} & {\color{blue}w_{12}} & {\color{blue}w_{13}} \cr {\color{blue}w_{21}} & {\color{blue}w_{22}} & {\color{blue}w_{23}} \cr {\color{blue}w_{31}} & {\color{blue}w_{32}} & {\color{blue}w_{33}} \cr \end{bmatrix} \begin{bmatrix} x_1 \cr x_2 \cr x_3 \end{bmatrix} \end{align*}
其中$\color{red}\sigma$表示sigmoid
表达式
如下图所示,x为feature
;而所有的$W, {\color{green}b}, c^T, b$作为unknown parameters展开为一个长的一维向量,定义为$\color{red}\theta$
Loss function
- Loss is a function of parameters $L(\theta)$
- Loss means how good a set of values is.
Optimization of New Model
$ \theta^* = arg\,\min\limits_\theta L$
- (Randomly) Pick initial values $\theta^0$
- 对所有参数$\theta$对$L$做微分,这里的$g$叫做
gradient
- 然后进行参数更新
- Compute gradient $ g = \nabla L(\theta^0) $
全部资料是$L$,批次编号为$L^1, L^2, L^3$。batch是进行参数更新的单位,即一个批次进行一次参数更新;epoch表示所有批次全部执行了参数更新。
使用$ Sigmoid \rightarrow ReLU $
Rectified Linear Unit (ReLU)
: $ {\color{red}c}\,max(0, {\color{green}b} + {\color{blue}w}x_1) $- 类似
sigmoid
和ReLU
的函数在机器学习中叫做Activation function
- 1个
sigmoid
图形需要2个ReLU
来表示
作个数不同的ReLU
- only one layer
- input features are the no. of views in the past 56 days
model | training loss | testing loss |
---|---|---|
linear | 0.32k | 0.46k |
10ReLU | 0.32k | 0.45k |
100ReLU | 0.28k | 0.43k |
1000ReLU | 0.27k | 0.43k |
作多层ReLU
- 100 ReLU for each layer
- input features are the no. of views in the past 56 days
Better on training data, worse on unseen data
:Overfitting
, see layer count 4.
layer count | training loss | testing loss |
---|---|---|
1 | 0.28k | 0.43k |
2 | 0.18k | 0.39k |
3 | 0.14k | 0.38k |
4 | 0.10k | 0.44k |
Backpropagation
Backpropgation
: an efficient way to compute $\sfrac{\partial L}{\partial w}$
Gradient Descent
- 对每一个参数针对L进行偏微分得到: $\nabla L(\theta)$
- 使用
batch
的数据对参数$\theta$进行更新.
Chain Rule
- case 1: $\frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}$
- case 2: $\frac{dz}{ds}=\frac{dz}{dx}\frac{dx}{ds}+\frac{dz}{dy}\frac{dy}{ds}$
Forward and Backward pass:
Forward pass
: Compute $\sfrac{\partial z}{\partial w}$for all parametersBackward pass
: Compute $\sfrac{\partial C}{\partial z}$ for all activation function inputs z
Regularization
Regularization
出现的背景:当原始数据有过多的feature和模型有大量的w,可能存在某些feature确实与最终的结果无关,在这种情况下可以先将所有feature包含进来,然后通过Regularization
的思路对w进行优化,从而降低无效feature对最终结果的影响。
- 假设Model Function为$ y = b + \sum w_i x_i $
- 原来定义的Loss Function为$ L = \sum\limits_n( \hat{y}^n - ( b + \sum w_i x_i ) )^2$
- Regularization既是在上面Loss Function的基础上加上红色部分: $ L = \sum\limits_n( \hat{y}^n - ( b + \sum w_i x_i ) )^2 + \color{red}\lambda\sum(w_i)^2$
- The functions with smaller $w_i$ are better
- Why smooth functions are preferred?
- If some noises corrupt input $x_i$ when testing. A smoother function has less influence.
- 因为在进行预测时,有噪音输入的情况下,越smooth的function对输出造成的影响越不敏感。
- 其中的$\lambda$也是一个hyperparameter
当我们调整$\lambda$的值,观察Loss的变化是,我们可以观察到如下信息:
- $\lambda$越大时,Regularization项影响就越大,整个function就越平滑
- Training随着$\lambda$的增加而增加
- Testing随着$\lambda$的增加先减少再增加
- We prefer smooth function, but don’t be too smooth.
- 所以$\lambda$的选择值选择在Testing的Loss的转折点处