Lectures /

# DeepLearning

There are many variations and tricks to deep learning. This page is a work in progress listing a few of the terms and concepts that we will cover in this course.

## Loss functions

- {$L_2$}
- log-likelihood
- {$log(\Pi_i p(y_i|X_i))$}

## Regularization

- early stopping
- start with small random weights and don't converge the gradient descent all the way

- {$L_1$} or {$L_2$} penalty on the weights
- max norm ({$L_\infty$})
- constrain the weights to not exceed a given size

- dropout
- randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
- may not be applied to the inputs

## Equational forms for neurons (nodes)

- logistic
- {$ output = \frac{1}{1+ e^{-w^\top x}} $}

- hyperbolic tangent
- {$ output = tanh(w^\top x) $}

- rectified linear unit (ReLU)
- {$ output = max(0,w^\top x)$}

- max pooling
- {$ output= max({\bf x}) $}
- the maximum of all the inputs to the neuron

- softmax (when the output should be a probability distribution)
- {$ softmax_j= p(y=j|x) = \frac{e^{w_j\top x}}{\sum_k e^{w_k^\top x}} $}
- where {$w_k$} are the weights to each output node {$k$}

- gated recurrent unit (GRU)
- to be covered later, when we do dynamic models

## Architectures

- Supervised
- Fully connected
- Convolutional (CNN)
- local receptive fields
- with parameter tying
- requires picking the 3-dimensions of the input box and the "stride" size

- max pooling

- local receptive fields
- often used with multitask learning
- the output {$y$} is a vector of labels for each observation

- Unsupervised
- minimize reconstruction error

- Semi-supervised
- use outputs from interior layer of a network trained on a different, larger data set

- Generative
- Dynamic
- Gated Recurrent Net
- LSTMs
- Attentional models

## Search methods for doing gradient descent

- batch
- average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights

- stochastic gradient
- compute the gradient {$ \frac{\delta r_i}{\delta w} $}for the residual {$r_i$} each observation {$(x_i, y_i)$}, and update the weights {$w$} one observation at a time

- minibatch
- use a small subset of observations (e.g. 50-100) for each update

- compute gradient
- analytical gradient or
- numerical gradient
- {$ \frac{dErr(w)}{dw} = \frac{Err(w+h)−Err(w−h)}{2h} $}

- sometimes use 'gradient clipping'
- truncate gradients that get too large

- learning rate adaption
- weight decay
- {$ \Delta w^t = \eta(t) \frac{\delta Err}{\delta w} $} where {$\eta$} is the learning rate which decays at some rate (e.g. linear, exponential...)

- adagrad
- {$ \Delta w_j^t = \frac{\eta}{|| \delta w_j^\tau||_2} \frac{\delta Err}{\delta w_j} $} where with {$\eta$} is the learning rate {$ \delta w_j^\tau$} is a vector of all the previous gradients of weight {$w_j$}

- weight decay
- momentum
- {$ \Delta w^t = \eta \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
- add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\eta$}). This crudely approximates using the Hessian (the second derivative) and speeds learning

- learning tricks and details
- run "sanity checks"

## Visualizations

- compute input that maximizes the output of a neuron
- over all inputs in the training set
- over the entire range of possible inputs
- early layers do edge or color detection; laters do object recognition

- Display pattern of hidden unit activations
- Just shows they are sparse

- Show how occluding parts of an image affects classification accuracy

## Constraints:

- build in word vectors
- build in shift invariance (leNet '90's le Cunn)

## References

- tensorflow
- theano tutorial
- NVIDIA GPUs
- NLP deep learning tutorial
- vision with CNNs and vision deep learning courses
- see also the deep learning book