There are many variations and tricks to deep learning.
This page is a work in progress listing a few that we terms and concepts that we will cover in this course.
Loss functions
- L2
- log-likelihood
- classification error
Regularization
- early stopping
- start with small random weights and don’t converge the gradient descent all the way
- L1 or L2 penalty on the weights
- max norm
- constrain the weights to not exceed a given size
- dropout
- randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
- may not applied to the inputs
Equational forms for neurons (nodes)
- logistic
- {$ output = \frac{1}{1+ e^{-w^\top x}} $}
- hyperbolic tangent
- {$ output = tanh(w^\top x) $}
- rectified linear unit (ReLU)
- {$ output = max(0,w^\top x)$}
- max pooling
- {$ output= max({\bf x}) $}
- the maximum of all the inputs to the neuron
- softmax (when the output should be a probability distribution)
- {$ softmax_i= \frac{e^{w_i^\top x}}{\sum_j e^{w_j^\top x}} $}
- where {$w_j$} are the weights to each output node {$j$}
- gated recurrent unit (GRU)
- to be covered later, when we do dynamic models
Architectures
- Supervised
- Fully connected
- Convolutional (CNN)
- local receptive fields
- with parameter tying
- requires picking the 3-dimensions of the input box and the “stride” size
- max pooling
- often used with multitask learning
- the output {$y$} is a vector of labels for each observation
- Unsupervised
- minimize reconstruction error
- Semi-supervised
- use outputs from interior layer
- Generative
- Dynamic
- Gated Recurrent Net
- LSTMs
- Attentional models
Search methods for doing gradient descent
- batch
- average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights
- stochastic gradient
- compute the gradient {$ \frac{\delta r}{\delta w} $}for the residual {$r$} each observation, and update one point at a time
- minibatch
- use a small subset of points (e.g. 50–100) for each update
- compute gradient
- analytical gradient or
- numerical gradient
- {$ \frac{df(x)}{dx} = \frac{f(x+h)−f(x−h)}{2h} $}
- sometimes use ‘gradient clipping’
- truncate gradients that get too large
- learning rate adaption
- weight decay
- {$ \Delta w^t = f(\mu,t) \frac{\delta Err}{\delta w} $} where with {$\mu$} is the learning rate and {$ f(\mu,t) $} controls the decay rate (e.g. linear, exponential…)
- adagrad
- {$ \Delta w^t = \frac{\mu}{|| \delta w^\tau||_2} \frac{\delta Err}{\delta w} $} where with {$\mu$} is the learning rate {$ \delta w^\tau$} is a vector of all the previous gradients
- momentum
- {$ \Delta w^t = \mu \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
- add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\mu$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
- learning tricks and details
Visualizations
Constraints:
- build in word vectors
- build in shift invariance (leNet ‘90’s le Cunn)
References
Back to Lectures