#!/usr/local/bin/php
Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2016/wiki/pmwiki.php on line 691

Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2016/wiki/pmwiki.php on line 694

Warning: Use of undefined constant MathJaxInlineCallback - assumed 'MathJaxInlineCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 84

Warning: Use of undefined constant MathJaxEquationCallback - assumed 'MathJaxEquationCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 88

Warning: Use of undefined constant MathJaxLatexeqrefCallback - assumed 'MathJaxLatexeqrefCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 94
CIS520 Machine Learning | Lectures / Deep Learning
Recent Changes - Search:

Home

Deep Learning

 

There are many variations and tricks to deep learning. This page is a work in progress listing a few that we terms and concepts that we will cover in this course.

Loss functions

  • L2
  • log-likelihood
  • classification error

Regularization

  • early stopping
    • start with small random weights and don’t converge the gradient descent all the way
  • L1 or L2 penalty on the weights
  • max norm
    • constrain the weights to not exceed a given size
  • dropout
    • randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
    • may not applied to the inputs

Equational forms for neurons (nodes)

  • logistic
    • {$ output = \frac{1}{1+ e^{-w^\top x}} $}
  • hyperbolic tangent
    • {$ output = tanh(w^\top x) $}
  • rectified linear unit (ReLU)
    • {$ output = max(0,w^\top x)$}
  • max pooling
    • {$ output= max({\bf x}) $}
    • the maximum of all the inputs to the neuron
  • softmax (when the output should be a probability distribution)
    • {$ softmax_i= \frac{e^{w_i^\top x}}{\sum_j e^{w_j^\top x}} $}
    • where {$w_j$} are the weights to each output node {$j$}
  • gated recurrent unit (GRU)
    • to be covered later, when we do dynamic models

Architectures

  • Supervised
    • Fully connected
    • Convolutional (CNN)
      • local receptive fields
        • with parameter tying
        • requires picking the 3-dimensions of the input box and the “stride” size
      • max pooling
    • often used with multitask learning
      • the output {$y$} is a vector of labels for each observation
  • Unsupervised
    • minimize reconstruction error
  • Semi-supervised
    • use outputs from interior layer
  • Generative
  • Dynamic
    • Gated Recurrent Net
    • LSTMs
    • Attentional models

Search methods for doing gradient descent

  • batch
    • average the gradient {$ \frac{\delta Err}{\delta w} $}over all points in the training set and then update the weights
  • stochastic gradient
    • compute the gradient {$ \frac{\delta r}{\delta w} $}for the residual {$r$} each observation, and update one point at a time
  • minibatch
    • use a small subset of points (e.g. 50–100) for each update
  • compute gradient
    • analytical gradient or
    • numerical gradient
      • {$ \frac{df(x)}{dx} = \frac{f(x+h)−f(x−h)}{2h} $}
    • sometimes use ‘gradient clipping’
      • truncate gradients that get too large
  • learning rate adaption
    • weight decay
      • {$ \Delta w^t = f(\mu,t) \frac{\delta Err}{\delta w} $} where with {$\mu$} is the learning rate and {$ f(\mu,t) $} controls the decay rate (e.g. linear, exponential…)
    • adagrad
      • {$ \Delta w^t = \frac{\mu}{|| \delta w^\tau||_2} \frac{\delta Err}{\delta w} $} where with {$\mu$} is the learning rate {$ \delta w^\tau$} is a vector of all the previous gradients
  • momentum
    • {$ \Delta w^t = \mu \frac{\delta Err}{\delta w} + m \Delta w^{t-1} $}
    • add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\mu$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
  • learning tricks and details
    • run “sanity checks”

Visualizations

Constraints:

  • build in word vectors
  • build in shift invariance (leNet ‘90’s le Cunn)

References

Back to Lectures

Edit - History - Print - Recent Changes - Search
Page last modified on 12 October 2016 at 09:28 AM