There are many variations and tricks to deep learning.
This page is a work in progress listing a few of the terms and concepts that we will cover in this course.
Loss functions
- {$L_2$}
- log-likelihood
- {$log(\Pi_i p(y_i|X_i))$}
Regularization
- early stopping
- start with small random weights and don’t converge the gradient descent all the way
- {$L_1$} or {$L_2$} penalty on the weights
- max norm ({$L_\infty$})
- constrain the weights to not exceed a given size
- dropout
- randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
- may not be applied to the inputs
Equational forms for neurons (nodes)
- logistic
- {$ output = \frac{1}{1+ e^{-w^\top x}} $}
- hyperbolic tangent
- {$ output = tanh(w^\top x) $}
- rectified linear unit (ReLU)
- {$ output = max(0,w^\top x)$}
- max pooling
- {$ output= max({\bf x}) $}
- the maximum of all the inputs to the neuron
- softmax (when the output should be a probability distribution)
- {$ softmax_j= p(y=j|x) = \frac{e^{w_j\top x}}{\sum_k e^{w_k^\top x}} $}
- where {$w_k$} are the weights to each output node {$k$}
- gated recurrent unit (GRU)
- to be covered later, when we do dynamic models
Architectures
- Supervised
- Fully connected
- Convolutional (CNN)
- local receptive fields
- with parameter tying
- requires picking the 3-dimensions of the input box and the “stride” size
- max pooling
- often used with multitask learning
- the output {$y$} is a vector of labels for each observation
- Unsupervised
- minimize reconstruction error
- Semi-supervised
- use outputs from interior layer of a network trained on a different, larger data set
- Generative
- Dynamic
- Gated Recurrent Net
- LSTMs
- Attentional models
Search methods for doing gradient descent
- batch
- average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights
- stochastic gradient
- compute the gradient {$ \frac{\delta r_i}{\delta w} $}for the residual {$r_i$} each observation {$(x_i, y_i)$}, and update the weights {$w$} one observation at a time
- minibatch
- use a small subset of observations (e.g. 50–100) for each update
- compute gradient
- analytical gradient or
- numerical gradient
- {$ \frac{dErr(w)}{dw} = \frac{Err(w+h)−Err(w−h)}{2h} $}
- sometimes use ‘gradient clipping’
- truncate gradients that get too large
- learning rate adaption
- weight decay
- {$ \Delta w^t = \eta(t) \frac{\delta Err}{\delta w} $} where {$\eta$} is the learning rate which decays at some rate (e.g. linear, exponential…)
- adagrad
- {$ \Delta w_j^t = \frac{\eta}{|| \delta w_j^\tau||_2} \frac{\delta Err}{\delta w_j} $} where with {$\eta$} is the learning rate {$ \delta w_j^\tau$} is a vector of all the previous gradients of weight {$w_j$}
- momentum
- {$ \Delta w^t = \eta \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
- add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\eta$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
- learning tricks and details
Visualizations
- compute input that maximizes the output of a neuron
- over all inputs in the training set
- over the entire range of possible inputs
- early layers do edge or color detection; laters do object recognition
- Display pattern of hidden unit activations
- Just shows they are sparse
- Show how occluding parts of an image affects classification accuracy
Constraints:
- build in word vectors
- build in shift invariance (leNet ‘90’s le Cunn)
References
Back to Lectures