CIS520 Machine Learning | Lectures / Deep Learning

There are many variations and tricks to deep learning. This page is a work in progress listing a few of the terms and concepts that we will cover in this course.

Loss functions

{$L_2$}
log-likelihood
- {$log(\Pi_i p(y_i|X_i))$}

Regularization

early stopping
- start with small random weights and don’t converge the gradient descent all the way
{$L_1$} or {$L_2$} penalty on the weights
max norm ({$L_\infty$})
- constrain the weights to not exceed a given size
dropout
- randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
- may not be applied to the inputs

Equational forms for neurons (nodes)

logistic
- {$ output = \frac{1}{1+ e^{-w^\top x}} $}
hyperbolic tangent
- {$ output = tanh(w^\top x) $}
rectified linear unit (ReLU)
- {$ output = max(0,w^\top x)$}
max pooling
- {$ output= max({\bf x}) $}
- the maximum of all the inputs to the neuron
softmax (when the output should be a probability distribution)
- {$ softmax_j= p(y=j|x) = \frac{e^{w_j\top x}}{\sum_k e^{w_k^\top x}} $}
- where {$w_k$} are the weights to each output node {$k$}
gated recurrent unit (GRU)
- to be covered later, when we do dynamic models

Architectures

Supervised
- Fully connected
- Convolutional (CNN)
  - local receptive fields
    - with parameter tying
    - requires picking the 3-dimensions of the input box and the “stride” size
  - max pooling
- often used with multitask learning
  - the output {$y$} is a vector of labels for each observation
Unsupervised
- minimize reconstruction error
Semi-supervised
- use outputs from interior layer of a network trained on a different, larger data set
Generative
Dynamic
- Gated Recurrent Net
- LSTMs
- Attentional models

Search methods for doing gradient descent

batch
- average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights
stochastic gradient
- compute the gradient {$ \frac{\delta r_i}{\delta w} $}for the residual {$r_i$} each observation {$(x_i, y_i)$}, and update the weights {$w$} one observation at a time
minibatch
- use a small subset of observations (e.g. 50–100) for each update
compute gradient
- analytical gradient or
- numerical gradient
  - {$ \frac{dErr(w)}{dw} = \frac{Err(w+h)−Err(w−h)}{2h} $}
- sometimes use ‘gradient clipping’
  - truncate gradients that get too large
learning rate adaption
- weight decay
  - {$ \Delta w^t = \eta(t) \frac{\delta Err}{\delta w} $} where {$\eta$} is the learning rate which decays at some rate (e.g. linear, exponential…)
- adagrad
  - {$ \Delta w_j^t = \frac{\eta}{|| \delta w_j^\tau||_2} \frac{\delta Err}{\delta w_j} $} where with {$\eta$} is the learning rate {$ \delta w_j^\tau$} is a vector of all the previous gradients of weight {$w_j$}
momentum
- {$ \Delta w^t = \eta \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
- add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\eta$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
learning tricks and details
- run “sanity checks”

Visualizations

compute input that maximizes the output of a neuron
- over all inputs in the training set
- over the entire range of possible inputs
- early layers do edge or color detection; laters do object recognition
Display pattern of hidden unit activations
- Just shows they are sparse
Show how occluding parts of an image affects classification accuracy

Constraints:

build in word vectors
build in shift invariance (leNet ‘90’s le Cunn)

References

tensorflow
theano tutorial
NVIDIA GPUs
NLP deep learning tutorial
vision with CNNs and vision deep learning courses
see also the deep learning book

Back to Lectures