Date | Topic | Materials |
January 6 | Introduction to reinforcement learning. Bandit algorithms |
RL book, chapter 1. Intro slides |
January 8 | Bandits: definition of multi-armed bandit, epsilon-greedy exploration, optimism, UCB. |
RL book, Sec. 2.1-2.7 Bandit slides |
January 13 | Bandits: regret definition and analysis for epsilon-gredy and UCB, gradient-based bandits |
RL book, chapter 2 Assignment 1 posted Bandit slides |
January 15 | Wrap up of bandits: Gradient-based bandits, Thompson sampling. Markov Decision Processes (MDPs) |
RL book, chapter 2 and 3.1 Bandit slides |
January 20 | Value functions. Bellman equations, policy evaluation. Policy iteration. Value iteration | RL book, chapter 3.2-4.1 Slides |
January 22 | More on dynamic programming: policy iteration, value iteration, contractions. | RL book, Chapter 4.1-4.8 Slides |
January 27 | Policy evaluation using Monte-Carlo Methods and Temporal-Difference | RL book, Sec. 5.1 5.2 6.1 6.2 Slides |
January 29 | Learning Control using Monte Carlo and TD, including SARSA | RL book, Sec. 5.3 5.4 5.6 5.7 6.3 6.4 7.1 7.2 7.3 7.5 Assignment 1 due Assignment 2 posted Slides |
February 3 | Q-learning | RL book, Sec. 6.5-6.7 9.1-9.3 Q-Learning slides |
February 5 | Function Value Approximation, DQN, Eligibility Traces |
RL book Sec. 10.2 10.5 16.5 12.1 12.2 12.4 12.5 Slides David Silver's lecture on RL with function approximation |
February 10 | More on Eligibility Trace and TD(λ) |
RL book chapter 12 Slides |
February 12 | Plannning and model-based RL | Slides
RL book chapter 8 |
February 17 | Deep model-based RL and planning |
RL book end of chapter 8, PlaNet Paper, Dreamer Paper, MuZero Paper Slides Assignment 2 due Project information posted |
February 19 | Policy-gradient methods: Policy Gradient Theorem and REINFORCE |
RL book chapter 13.7 13.1-13.3 Slides |
February 24 | Policy-gradient methods: Actor-critic |
RL book chapter 13.4 13.5 13.6 Slides Assignment 3 posted |
February 26 | Policy-gradient methods: Deterministic Policy Gradient, DDPG |
DPG paper, DDPG paper, Slides |
March 3 | Study break | |
March 5 | Study break | |
March 10 | TRPO, PPO, Review |
TRPO paper, PPO paper Slides |
March 12 | Hierachical RL |
Slides Options paper Option-critic architecture Project proposal due |
March 17 | Wrap-up of HRL. Off-policy RL | Assignment 3 due Slides on HRL Slides on off-policy learning |
March 19 | Offline and batch RL |
Slides Offline RL tutorial (Levine, Kumar, Tucker and Fu, 2020) |
March 24 | Where do rewards come from? Inverse RL. Learning from Preferences |
Inverse RL and Preference-based learning Inverse RL survey |
March 26 | Large Language Models (LLMs) and Reinforcement Learning from Human Feedback (RLHF) | Slides (more info to be posted) |
March 30 | More on RL for LLMs | Slides (more info to be posted) |
April 2 | Alignment for RLHF. Continual RL | Slides |
April 7 | Never-ending / continual RL |
Slides Continual RL survey |
April 9 | RL applications | Slides. If we have time: Reward is enough |