# DeepLearning

There are many variations and tricks to deep learning. This page is a work in progress listing a few of the terms and concepts that we will cover in this course.

## Loss functions

• {$L_2$}
• log-likelihood
• {$log(\Pi_i p(y_i|X_i))$}

## Regularization

• early stopping
• {$L_1$} or {$L_2$} penalty on the weights
• max norm ({$L_\infty$})
• constrain the weights to not exceed a given size
• dropout
• randomly remove a fraction (often half) of the weights on each iteration (sampled with replacement)
• may not be applied to the inputs

## Equational forms for neurons (nodes)

• logistic
• {$output = \frac{1}{1+ e^{-w^\top x}}$}
• hyperbolic tangent
• {$output = tanh(w^\top x)$}
• rectified linear unit (ReLU)
• {$output = max(0,w^\top x)$}
• max pooling
• {$output= max({\bf x})$}
• the maximum of all the inputs to the neuron
• softmax (when the output should be a probability distribution)
• {$softmax_j= p(y=j|x) = \frac{e^{w_j\top x}}{\sum_k e^{w_k^\top x}}$}
• where {$w_k$} are the weights to each output node {$k$}
• gated recurrent unit (GRU)
• to be covered later, when we do dynamic models

## Architectures

• Supervised
• Fully connected
• Convolutional (CNN)
• local receptive fields
• with parameter tying
• requires picking the 3-dimensions of the input box and the "stride" size
• max pooling
• often used with multitask learning
• the output {$y$} is a vector of labels for each observation
• Unsupervised
• minimize reconstruction error
• Semi-supervised
• use outputs from interior layer of a network trained on a different, larger data set
• Generative
• Dynamic
• Gated Recurrent Net
• LSTMs
• Attentional models

## Search methods for doing gradient descent

• batch
• average the gradient {$\frac{\delta Err}{\delta w}$}over all points in the training set and then update the weights
• compute the gradient {$\frac{\delta r_i}{\delta w}$}for the residual {$r_i$} each observation {$(x_i, y_i)$}, and update the weights {$w$} one observation at a time
• minibatch
• use a small subset of observations (e.g. 50-100) for each update
• {$\frac{dErr(w)}{dw} = \frac{Err(w+h)−Err(w−h)}{2h}$}
• truncate gradients that get too large
• weight decay
• {$\Delta w^t = \eta(t) \frac{\delta Err}{\delta w}$} where {$\eta$} is the learning rate which decays at some rate (e.g. linear, exponential...)
• {$\Delta w_j^t = \frac{\eta}{|| \delta w_j^\tau||_2} \frac{\delta Err}{\delta w_j}$} where with {$\eta$} is the learning rate {$\delta w_j^\tau$} is a vector of all the previous gradients of weight {$w_j$}
• momentum
• {$\Delta w^t = \eta \frac{\delta Err}{\delta w} + m \Delta w^{t-1}$}
• add in a constant ({$m$}) times the preceding gradient step to the standard gradient in Error (with its learning rate {$\eta$}). This crudely approximates using the Hessian (the second derivative) and speeds learning
• learning tricks and details
• run "sanity checks"

## Visualizations

• compute input that maximizes the output of a neuron
• over all inputs in the training set
• over the entire range of possible inputs
• early layers do edge or color detection; laters do object recognition
• Display pattern of hidden unit activations
• Just shows they are sparse
• Show how occluding parts of an image affects classification accuracy

## Constraints:

• build in word vectors
• build in shift invariance (leNet '90's le Cunn)