Deep Learning from Scratch

6 minute read

Published:

Neural Networks and Deep Learning & Improving Deep Neural Networks: Hyperparameter, Tuning, Regularization and Optimization

Course I - Neural Networks and Deep Learning

Week II - Neural Networks Basics

  1. Reduce to use explicit FOR loop => Use vectorization
  2. Python broadcasting
    • vertical: axis=0
    • horizontal: axis=1
  3. Numpy
    • DO NOT USE
      • a = np.random.rand(5)
      • a.shape = (5, ) <= which is a “rank 1” array
    • RATHER
      • a = np.random.rand(5, 1)
      • column vector
      • a = np.random.rand(1, 5)
      • row vector
  4. Maximum Likehood Estimation \(- \sum^{m}_{i=1}\mathcal{L}(\hat{y}, y) \rightarrow \mathcal{J}(w, b) = \frac{1}{m}\sum^{m}_{i=1}\mathcal{L}(\hat{y}, y)\) left loss function use MLE to make its value smaller, instead, switch $-$ to $\frac{1}{m}$ for minimizing the value in right cost function.

    IID: Identically Independently Distributed

  5. Reshape eg. X.shape = (a, b, c, d)
    • X' = X.reshape(X.shape[0], -1) X’.shape = (a, b*c*d)
    • X'' = X.reshape(X.shape[0], X.shape[1], -1) X’‘.shape = (a, b, c*d)

Week III - Shallow Neural Networks

  1. Sometimes, people do not count the Input Layer
  2. Matrix
    • vertical: hidden unites (n_h)
    • horizontal: training examples (m)
      • W.shape = (n_next, n_x)
      • X.shape = (n_x, m)
      • b.shape = (n_next, 1)
  3. Activation function \(\tanh = \frac{e^{z}-e^{z}}{e^{z}+e^{z}} \tag{1}\) \(\tanh^{\prime}(z) = 1 - \tanh^{2}(z) \tag{2}\) \(\text{sigmoid} = \sigma(z) = \frac{1}{1+e^{-z}} \tag{3}\) \(\sigma^{\prime}(z) = \sigma(z)(1-\sigma(z)) \tag{4}\)
    • Proof (2) \(\begin{align*} \frac{\text{d}}{\text{dz}}\tanh &= \frac{(e^z+e^{-z})\cdot\text{d}(e^z-e^{-z})-(e^z-e^{-z})\cdot\text{d}(e^z+e^{-z})}{(e^z+e^{-z})^2} \\ &= \frac{(e^z+e^{-z})\cdot(e^z+e^{-z})-(e^z-e^{-z})\cdot(e^z-e^{-z})}{(e^z+e^{-z})^2} \\ &= \frac{(e^z+e^{-z})^2-(e^z-e^{-z})^2}{(e^z+e^{-z})^2} \\ &= 1 - \left(\frac{(e^z-e^{-z})}{(e^z+e^{-z})}\right)^2 \\ &= 1 - \tanh^2(z) \end{align*}\)
    • Proof (4) \(\begin{align*} \frac{\text{d}}{\text{dz}}\sigma(z) &= \frac{\text{d}}{\text{dz}}\left(\frac{1}{1+e^{-z}}\right) \\ &= -\frac{-e^{-z}}{(1+e^{-z})^2} \\ &= \frac{1}{1+e^{-z}}\cdot\frac{e^{-z}}{1+e^{-z}} \\ &= \sigma(z)\left(1-\sigma(z)\right) \end{align*}\)
  4. Back-propagation

    Make sure the dimension of matrices are matched up.

    • $foo$ and $\text{d}(foo)$ have the same dimension
    • update: $x := A_1$
  5. Random initialization
    • initialize zeros: hidden unites always execute the same function => symmetric $\text{d}w = \begin{bmatrix} u & v \ u & v \end{bmatrix}$ $w^{[i]} := w^{[i]} - \alpha\cdot\text{d}w^{[i-1]}$ $w$: first row = second row; forever
    • w[1] = np.random.rand(n2, n1) * 0.01 (*0.01 for a steeper slope (at origin point))

Course II - Improving Deep Neural Networks: Hyperparameter, Tuning, Regularization and Optimization

Week I - Practical Aspects of Deep Learning

1. Regularizing your Neural Networks

  1. Logistic regression: $\min_{w,b}\mathcal{J}(w,b)$ \(\mathcal{J}(w,b)=\frac{1}{m}\sum^{m}_{i=1}\mathcal{L}(\hat{y^{(i)}}, y^{(i)})+\frac{\lambda}{2m}\lVert w\rVert^2_2\)
    • L2 regularization \(\lVert w\rVert^2_2=\sum^{n_x}_{j=1}w_j^2=w^Tw\)
    • L1 regularization \(\lVert w\rVert_1=\sum^{n_x}_{j=1}\lvert w_j\rvert\)
  2. Frobenius norm formula - in neural network
    • the shape of w: $(n^{[l]}, n^{[l-1]})$
    • in iteration, $i$: current layer; $j$: previous layer
    • Frobenius norm formula: $\lVert \cdot \rVert_2^2$ \(\mathcal{J}(w^{[1]}, b^{[1]}, \dots, w^{[L]}, b^{[L]})=\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y^{(i)}}, y^{(i)})+\frac{\lambda}{2m}\sum_{l=1}^L\lVert w^{[l]}\rVert_F^2 \\ \lVert w^{[l]}\rVert_F^2=\sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}(w^{[l]}_{i,j})^2\)
    • Parameters updating (weight decay $\left(1-\frac{\alpha\lambda}{m}\right)<1$) \(\text{d}w^{[l]}:=\text{(from backpropagation)}+\frac{\lambda}{m}w^{[l]} \\ w^{[l]}:=w^{[l]}-\alpha\text{d}w^{[l]}=\left(1-\frac{\alpha\lambda}{m}\right)w^{[l]}-\alpha\text{(from backpropagation)}\)
    • the bigger $\lambda$ is, the smaller $w$ will be, and the smaller $z$ will be. $\Rightarrow$ after activation function, $\mathcal{J}$ function will be like a linear function rather than a complex non-linear function.
  3. Dropout regularization

     keep_prob = 0.8
     al = np.random.rand(3, 5)
     dl = np.random.rand(a.shape[0], a.shape[1]) < keep_prob
     al /= keep_prob   # inverted dropout technique
    

    Intuition: Cannot relay on any one feature, so have to spread out weights.

  4. Other regularization
    • Data augmentation
    • Early stopping

2. Setting up your optimization problems

Normalize inputs to shrink the scale of each feature to speed up the training process.

  1. Subtract mean \(\mu=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}\\ x:=x-\mu\)
  2. Normalize variance (std) \(\sigma^2=\frac{1}{m}\sum_{i=1}^{m}\left(x^{(i)}\right)^2\\ x:=x/\sigma\)
  3. Vanishing/Exploding Gradients in deep neural network If the network is very deep, $\Rightarrow L$ will be very huge

    \[\Rightarrow \left(\begin{bmatrix}0.5&0.\\ 0.&0.5\end{bmatrix}\right)^L\rightarrow\begin{bmatrix}0.&0.\\ 0.&0.\end{bmatrix} \\ \Rightarrow\left(\begin{bmatrix}1.5&0.\\ 0.&1.5\end{bmatrix}\right)^L\rightarrow\begin{bmatrix} ∞ &0.\\ 0.& ∞ \end{bmatrix}\]
  4. Weight initialization

     w[l] = np.random.randn(shape) * np.sqrt(2/n[l-1])
    
    • Xavier initialization: \(\tanh\sqrt{\frac{1}{n^{[l-1]}}}\)
    • Or use this instead: \(\sqrt{\frac{1}{n^{[l-1]}+n^{[l]}}}\)
  5. Derivative calculation… \(f^{\prime}(\theta)=\lim_{\epsilon\rightarrow 0}\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\theta}\)

  6. Gradient checking \(\begin{align*} \text{for each }i:\\ d\theta_{approx}&=\frac{\mathcal{J}\left(\theta_1,\theta_2,\dots,\theta_i+\epsilon,\dots\right)-\left(\theta_1,\theta_2,\dots,\theta_i-\epsilon,\dots\right)}{2\epsilon}\\ &\approx d\theta_i = \frac{\partial\mathcal{J}}{\partial\theta_i} \end{align*}\\ \text{Check if }\frac{\lVert d\theta_{approx}-d\theta\rVert_2}{\lVert d\theta_{approx}\rVert_2+\lVert d\theta\rVert_2} < \epsilon = 10^{-7} \Rightarrow \text{More confidence it's correct.}\)
    • DO NOT check gradient in training – only use to debug – Too Slow
    • $d\theta_{approx}$ is far from $d\theta$ – check each $l$ in layers – which $dw_l$ or $db_l$ influence the $d\theta_l$
    • DO regularization
    • DO NOT use dropout (keep_prob=1.0) (turn off then turn on)
    • Run random initialization; perhaps again after some training

Week II - Optimization Algorithms

1. Mini-batch

for speeding up the training process.

  • $Z^{[l]}: \text{layers}$
  • $X^{(i)}: (n_x,\ 1)$
  • $X^{{t}}: (n_x,\ m/\text{batch_size})$

  • Mini-batch size
    • $(=m)$: Batch Gradient Descent
      • Too long during iteration
    • $(=1)$: Stochastic Gradient Descent (SGD)
      • Lose speed-up from vectorization

    $\Rightarrow \text{Proper mini-batch size:}$

    • Vectorization
    • Partial training process (Not whole training set)

2. Exponentially Weight Averages

\[v_t=\beta v_{t-1}+(1-\beta)\theta_t\]
  • $v_t$: approximately average over $\frac{1}{1-\beta}$ days
    • $\beta=0.9\Rightarrow\text{10 days}$
    • $\beta=0.98\Rightarrow\text{50 days}$
    • $\beta=0.5\Rightarrow\text{2 days}$
\[\begin{align*} v_{100}&=0.1\theta_{100}+0.9v_{99}\\ &=0.1\theta_{100}+0.9(0.1\theta_{99}+0.9v_{98})\\ &=0.1\theta_{100}+0.9\cdot 0.1\theta_{99}+(0.9)^2(0.1\theta_{98}+0.9v_{97})\\ &=\cdots\\ \Rightarrow v_t&=0.1\sum_{i=1}^t(0.9)^{i}\theta_{100-i} \end{align*}\]
  • $(1-\epsilon)^{1/\epsilon} = \frac{1}{e}$: exponential weight
  • Code for updating

      v_theta = 0.
      for loop:
          v_theta = beta*v_theta + (1.-beta)*theta_t
    

3. Bias Corrections

\(\frac{v_t}{1-\beta^t}\) $t\uparrow: 1-\beta^t\rightarrow0$ $\Rightarrow\text{warming estimate}$

4. Gradient Descent with Momentum

\(v_{dW}=v_{db}=0\\ \begin{align*} \text{On iteration }& t:\\ &\begin{cases} v_{dW}:=\beta v_{dW}+(1-\beta)dW\\ v_{db}:=\beta v_{db}+(1-\beta)db \end{cases}\\ &\begin{cases} W:=W-\alpha v_{dW}\\ b:=b-\alpha v_{db} \end{cases} \end{align*}\) $\bullet\ \text{Hyperparameters}: \alpha,\beta$

5. RMSprop - Root Mean Square prop

\(S_{dW}=S_{db}=0\\ \begin{align*} \text{On iteration }& t:\\ &\begin{cases} S_{dW}:=\beta S_{dW}+(1-\beta)dW^2\\ S_{db}:=\beta S_{db}+(1-\beta)db^2 \end{cases}\\ &\begin{cases} W:=W-\alpha\frac{dW}{\sqrt{S_{dW}}}\\ b:=b-\alpha\frac{db}{\sqrt{S_{db}}} \end{cases} \end{align*}\) $\bullet\ S\uparrow\frac{\alpha}{\sqrt{S}}\downarrow, S\downarrow\frac{\alpha}{\sqrt{S}}\uparrow$ $\bullet\ \text{However } S\rightarrow 0 \Rightarrow \div 0?$ $\bullet\ \text{Hyperparameters}: \alpha,\beta$

6. Adam Optimization Algorithm - Adaptive Moment Estimation

\(v_{dW}=v_{db}=S_{dW}=S_{db}=0\\ \begin{align*} \text{On iteration }& t:\\ &\begin{cases} v_{dW}:=\beta_1 v_{dW}+(1-\beta_1)dW\\ v_{db}:=\beta_1 v_{db}+(1-\beta_1)db \end{cases}\\ &\begin{cases} v_{dW}^{corrected}:=\frac{v_{dW}}{1-\beta_1^t}\\ v_{db}^{corrected}:=\frac{v_{db}}{1-\beta_1^t} \end{cases}\\ &\begin{cases} S_{dW}:=\beta_2 S_{dW}+(1-\beta_2)dW^2\\ S_{db}:=\beta_2 S_{db}+(1-\beta_2)db^2 \end{cases}\\ &\begin{cases} S_{dW}^{corrected}:=\frac{S_{dW}}{1-\beta_2^t}\\ S_{db}^{corrected}:=\frac{S_{db}}{1-\beta_2^t} \end{cases}\\ &\begin{cases} W:=W-\alpha\frac{v_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}\\ b:=b-\alpha\frac{v_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon} \end{cases} \end{align*}\) $\bullet\ \text{Hyperparameters}: \alpha,\ \beta_1(=0.9),\ \beta_2(=0.999),\ \epsilon(=10^{-8})$

7. Learning Rate Decay

Slowly reduce the learning rate.

\[\alpha=\frac{1}{1+\text{decay rate}\cdot\text{epoch number}}\alpha_0\]

8. Local Optima

  • Saddle Point
  • Try to escape from the plateau

Week III - Hyperparameter Tuning, Batch Normalization and Programming Frameworks

1. Batch Normalization

Normalize inputs/activations in networks to make training more efficient. eg. normalize $a^{[2]}$ to train $W^{[3]}$ and $b^{[3]}$ faster.

  • Implement Given some intermediate values in NN: $Z^{(1)},\dots Z^{(m)}$ (of layer $L$):

    \[\mu = \frac{1}{m}\sum_{i}Z^{(i)}\\ \sigma^2 = \frac{1}{m}\sum_{i}\left(Z^{(i)}-\mu\right)^2\\ Z_{norm}^{(i)} = \frac{Z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}\\ \tilde{Z}^{(i)} = \gamma Z_{norm}^{(i)} + \beta\]
    • Use $\tilde{Z}^{(i)}$ instead of $Z_{norm}^{(i)}$
    • Both $\gamma$ and $\beta$ are learnable parameters of model
    • When $\gamma=\sqrt{\sigma^2+\epsilon}$ as well as $\beta=\mu$, $\tilde{Z}^{(i)}=Z_{norm}^{(i)}$
    • Set the mean and variance of the hidden layers to any value you want, not just 0 and 1

      \[X \xrightarrow[\text{normalize}]{W^{[1]},\ b^{[1]}} Z_{norm}^{[1]} \xrightarrow[\text{Batch Norm}]{\gamma^{[1]},\ \beta^{[1]}} \tilde{Z}_{norm}^{[1]} \xrightarrow[\text{update with optimizer}]{dW^{[1]},\ db^{[1]},\ d\gamma^{[1]},\ d\beta^{[1]}} A^{[1]} \xrightarrow[\text{normalize}]{W^{[2]},\ b^{[2]}} \cdots\]
  • Batch Norm speeds up learning Batch Norm reduces the problem of input values and make the value more stable, so that later layers have more firm ground to stand on. Because early and later layers are independent with each other, each layer can learn by itself, and this has the effect of speeding up of learning in the whole network.
  • Batch Norm as regularization
    • Each mini-batch is scaled by its own mean and variance
    • Add some noise to $Z^{[l]}$ ($\tilde{Z}^{[l]}$) within mini-batch and each hidden layers’ activations
    • Slight regularization effect
  • Batch Norm at test

    Use the mean of entire mean and variance values of mini-batches

    \[\bar{\mu} = \frac{1}{m/\text{mini-batch size}}\sum_{i}\mu^{\{i\}[l]}\\ \bar{\sigma}^2 = \frac{1}{m/\text{mini-batch size}}\sum_{i}\left(\sigma^{\{i\}[l]}\right)^2\]

2. Softmax Regression

\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}\\ \log\left(\text{softmax}(z_i)\right) = e^{z_i}-\log\sum_j e^{z_j}\]