ReinforcementLearning

The RL lectures are covered in the slides, which include all of the algorithms.

This page is under development, but will include sections on

Model-based vs. Model-Free

On policy vs off policy

V(s) and Q(s,a)

Markov Chain Tree Search

Exploration vs. Exploitation

Deep Q-Learning aka Deep Q Networks (DQN)

Contextual Bandits are a special case of RL where we ignore the fact that our current action will move us to a new state that enables different future actions.

Response surfaces are a form of bandit problem (and hence of RL). The goal is to find the optimal control action (typically a real-valued vector of settings, {$x$}). If one knew the "response surface"--the function {$f(x;w)$} that gives reward (or cost) as a function of action, then finding the best action would just be finding the minimum of the function (for example, with gradient descent). But we don't know {$f(x;w)$}, and so need to learn it.

In standard active learning, our goal would be to learn {$f(x;w)$} everywhere (or, more precisely, everywhere in proportion to how often we observe points {$x$}. Here, we don't care about most regions of the action space, only the ones that give high reward (or low loss).

So when picking a new {$x$} to evaluate, we are trying to trade off finding the one that maximizes reward (a greedy policy) and one that explores new regions of the action space {$x$} that might be even better (active learning).

Back to Lectures