CIS520 Machine Learning | Lectures / Deep Learning

There are many variations and tricks to deep learning. This page is a work in progress listing a few that we terms and concepts that we will cover in this course.

Loss functions

L2
log-likelihood
classification error

Regularization

early stopping
- start with small random weights and don’t converge the gradient descent all the way
L1 or L2 penalty on the weights
max norm
- constrain the weights to not exceed a given size
dropout
- randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
- may not applied to the inputs

Equational forms for neurons (nodes)

logistic
- {$ output = \frac{1}{1+ e^{-w^\top x}} $}
hyperbolic tangent
- {$ output = tanh(w^\top x) $}
rectified linear unit (ReLU)
- {$ output = max(0,w^\top x)$}
max pooling
- {$ output= max({\bf x}) $}
- the maximum of all the inputs to the neuron
softmax (when the output should be a probability distribution)
- {$ softmax_i= \frac{e^{w_i^\top x}}{\sum_j e^{w_j^\top x}} $}
- where {$w_j$} are the weights to each output node {$j$}
gated recurrent unit (GRU)
- to be covered later, when we do dynamic models

Architectures

Supervised
- Fully connected
- Convolutional (CNN)
  - local receptive fields
    - with parameter tying
    - requires picking the 3-dimensions of the input box and the “stride” size
  - max pooling
- often used with multitask learning
  - the output {$y$} is a vector of labels for each observation
Unsupervised
- minimize reconstruction error
Semi-supervised
- use outputs from interior layer
Generative
Dynamic
- Gated Recurrent Net
- LSTMs
- Attentional models

Search methods for doing gradient descent

batch
- average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights
stochastic gradient
- compute the gradient {$ \frac{\delta r}{\delta w} $}for the residual {$r$} each observation, and update one point at a time
minibatch
- use a small subset of points (e.g. 50–100) for each update
compute gradient
- analytical gradient or
- numerical gradient
  - {$ \frac{df(x)}{dx} = \frac{f(x+h)−f(x−h)}{2h} $}
- sometimes use ‘gradient clipping’
  - truncate gradients that get too large
learning rate adaption
- weight decay
  - {$ \Delta w^t = f(\mu,t) \frac{\delta Err}{\delta w} $} where with {$\mu$} is the learning rate and {$ f(\mu,t) $} controls the decay rate (e.g. linear, exponential…)
- adagrad
  - {$ \Delta w^t = \frac{\mu}{|| \delta w^\tau||_2} \frac{\delta Err}{\delta w} $} where with {$\mu$} is the learning rate {$ \delta w^\tau$} is a vector of all the previous gradients
momentum
- {$ \Delta w^t = \mu \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
- add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\mu$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
learning tricks and details
- run “sanity checks”

Visualizations

compute input that maximizes the output of a neuron
- over all inputs in the training set
- over the entire range of possible inputs

Constraints:

build in word vectors
build in shift invariance (leNet ‘90’s le Cunn)

References

tensorflow
theano tutorial
NVIDIA GPUs
NLP deep learning tutorial
vision with CNNs and vision deep learning courses
see also the deep learning book

Back to Lectures