Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2017/wiki/pmwiki.php on line 691

Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2017/wiki/pmwiki.php on line 694

Warning: Use of undefined constant MathJaxInlineCallback - assumed 'MathJaxInlineCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2017/wiki/cookbook/MathJax.php on line 84

Warning: Use of undefined constant MathJaxEquationCallback - assumed 'MathJaxEquationCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2017/wiki/cookbook/MathJax.php on line 88

Warning: Use of undefined constant MathJaxLatexeqrefCallback - assumed 'MathJaxLatexeqrefCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2017/wiki/cookbook/MathJax.php on line 94
CIS520 Machine Learning | Lectures / Deep Learning
Recent Changes - Search:

Home

Deep Learning

 

There are many variations and tricks to deep learning. This page is a work in progress listing a few of the terms and concepts that we will cover in this course.

Loss functions

  • {$L_2$}
  • log-likelihood
    • {$log(\Pi_i p(y_i|X_i))$}

Regularization

  • early stopping
    • start with small random weights and don’t converge the gradient descent all the way
  • {$L_1$} or {$L_2$} penalty on the weights
  • max norm ({$L_\infty$})
    • constrain the weights to not exceed a given size
  • dropout
    • randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
    • may not be applied to the inputs

Equational forms for neurons (nodes)

  • logistic
    • {$ output = \frac{1}{1+ e^{-w^\top x}} $}
  • hyperbolic tangent
    • {$ output = tanh(w^\top x) $}
  • rectified linear unit (ReLU)
    • {$ output = max(0,w^\top x)$}
  • max pooling
    • {$ output= max({\bf x}) $}
    • the maximum of all the inputs to the neuron
  • softmax (when the output should be a probability distribution)
    • {$ softmax_j= p(y=j|x) = \frac{e^{w_j\top x}}{\sum_k e^{w_k^\top x}} $}
    • where {$w_k$} are the weights to each output node {$k$}
  • gated recurrent unit (GRU)
    • to be covered later, when we do dynamic models

Architectures

  • Supervised
    • Fully connected
    • Convolutional (CNN)
      • local receptive fields
        • with parameter tying
        • requires picking the 3-dimensions of the input box and the “stride” size
      • max pooling
    • often used with multitask learning
      • the output {$y$} is a vector of labels for each observation
  • Unsupervised
    • minimize reconstruction error
  • Semi-supervised
    • use outputs from interior layer of a network trained on a different, larger data set
  • Generative
  • Dynamic
    • Gated Recurrent Net
    • LSTMs
    • Attentional models

Search methods for doing gradient descent

  • batch
    • average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights
  • stochastic gradient
    • compute the gradient {$ \frac{\delta r_i}{\delta w} $}for the residual {$r_i$} each observation {$(x_i, y_i)$}, and update the weights {$w$} one observation at a time
  • minibatch
    • use a small subset of observations (e.g. 50–100) for each update
  • compute gradient
    • analytical gradient or
    • numerical gradient
      • {$ \frac{dErr(w)}{dw} = \frac{Err(w+h)−Err(w−h)}{2h} $}
    • sometimes use ‘gradient clipping’
      • truncate gradients that get too large
  • learning rate adaption
    • weight decay
      • {$ \Delta w^t = \eta(t) \frac{\delta Err}{\delta w} $} where {$\eta$} is the learning rate which decays at some rate (e.g. linear, exponential…)
    • adagrad
      • {$ \Delta w_j^t = \frac{\eta}{|| \delta w_j^\tau||_2} \frac{\delta Err}{\delta w_j} $} where with {$\eta$} is the learning rate {$ \delta w_j^\tau$} is a vector of all the previous gradients of weight {$w_j$}
  • momentum
    • {$ \Delta w^t = \eta \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
    • add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\eta$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
  • learning tricks and details
    • run “sanity checks”

Visualizations

  • compute input that maximizes the output of a neuron
    • over all inputs in the training set
    • over the entire range of possible inputs
    • early layers do edge or color detection; laters do object recognition
  • Display pattern of hidden unit activations
    • Just shows they are sparse
  • Show how occluding parts of an image affects classification accuracy

Constraints:

  • build in word vectors
  • build in shift invariance (leNet ‘90’s le Cunn)

References

Back to Lectures

Edit - History - Print - Recent Changes - Search
Page last modified on 28 September 2017 at 03:10 PM