DeepLearning

There are many variations and tricks to deep learning. This page is a work in progress listing a few of the terms and concepts that we will cover in this course.

Loss functions

  • {$L_2$}
  • log-likelihood
    • {$log(\Pi_i p(y_i|X_i))$}

Regularization

  • early stopping
    • start with small random weights and don't converge the gradient descent all the way
  • {$L_1$} or {$L_2$} penalty on the weights
  • max norm ({$L_\infty$})
    • constrain the weights to not exceed a given size
  • dropout
    • randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
    • may not be applied to the inputs

Equational forms for neurons (nodes)

  • logistic
    • {$ output = \frac{1}{1+ e^{-w^\top x}} $}
  • hyperbolic tangent
    • {$ output = tanh(w^\top x) $}
  • rectified linear unit (ReLU)
    • {$ output = max(0,w^\top x)$}
  • max pooling
    • {$ output= max({\bf x}) $}
    • the maximum of all the inputs to the neuron
  • softmax (when the output should be a probability distribution)
    • {$ softmax_j= p(y=j|x) = \frac{e^{w_j\top x}}{\sum_k e^{w_k^\top x}} $}
    • where {$w_k$} are the weights to each output node {$k$}
  • gated recurrent unit (GRU)
    • to be covered later, when we do dynamic models

Architectures

  • Supervised
    • Fully connected
    • Convolutional (CNN)
      • local receptive fields
        • with parameter tying
        • requires picking the 3-dimensions of the input box and the "stride" size
      • max pooling
    • often used with multitask learning
      • the output {$y$} is a vector of labels for each observation
  • Unsupervised
    • minimize reconstruction error
  • Semi-supervised
    • use outputs from interior layer of a network trained on a different, larger data set
  • Generative
  • Dynamic
    • Gated Recurrent Net
    • LSTMs
    • Attentional models

Search methods for doing gradient descent

  • batch
    • average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights
  • stochastic gradient
    • compute the gradient {$ \frac{\delta r_i}{\delta w} $}for the residual {$r_i$} each observation {$(x_i, y_i)$}, and update the weights {$w$} one observation at a time
  • minibatch
    • use a small subset of observations (e.g. 50-100) for each update
  • compute gradient
    • analytical gradient or
    • numerical gradient
      • {$ \frac{dErr(w)}{dw} = \frac{Err(w+h)−Err(w−h)}{2h} $}
    • sometimes use 'gradient clipping'
      • truncate gradients that get too large
  • learning rate adaption
    • weight decay
      • {$ \Delta w^t = \eta(t) \frac{\delta Err}{\delta w} $} where {$\eta$} is the learning rate which decays at some rate (e.g. linear, exponential...)
    • adagrad
      • {$ \Delta w_j^t = \frac{\eta}{|| \delta w_j^\tau||_2} \frac{\delta Err}{\delta w_j} $} where with {$\eta$} is the learning rate {$ \delta w_j^\tau$} is a vector of all the previous gradients of weight {$w_j$}
  • momentum
    • {$ \Delta w^t = \eta \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
    • add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\eta$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
  • learning tricks and details
    • run "sanity checks"

Visualizations

  • compute input that maximizes the output of a neuron
    • over all inputs in the training set
    • over the entire range of possible inputs
    • early layers do edge or color detection; laters do object recognition
  • Display pattern of hidden unit activations
    • Just shows they are sparse
  • Show how occluding parts of an image affects classification accuracy

Constraints:

  • build in word vectors
  • build in shift invariance (leNet '90's le Cunn)

References

Back to Lectures