Date | Topic | Materials |
January 4 | Introduction to reinforcement learning. Bandit algorithms |
RL book, chapter 1. Intro slides |
January 9 | Bandits: definition of multi-armed bandit, epsilon-greedy exploration, optimism, UCB. |
RL book, Sec. 2.1-2.7 Bandit slides |
January 11 | Bandits: regret definition and analysis for epsilon-gredy and UCB, gradient-based bandits |
RL book, chapter 2 Assignment 1 posted |
January 16 | Wrap up of bandits: Gradient-based bandits, Thompson sampling. Markov Decision Processes (MDPs) |
RL book, chapter 2 and 3.1 |
January 18 | Value functions. Bellman equations, policy evaluation. Policy iteration. Value iteration | RL book, chapter 3.2-4.1 Slides |
January 23 | More on dynamic programming: policy iteration, value iteration, contractions. | RL book, Chapter 4.1-4.8 Slides |
January 25 | Policy evaluation using Monte-Carlo Methods and Temporal-Difference | RL book, Sec. 5.1 5.2 6.1 6.2 Slides |
January 30 | Learning Control using Monte Carlo and TD, including SARSA | RL book, Sec. 5.3 5.4 5.6 5.7 6.3 6.4 7.1 7.2 7.3 7.5 Assignment 1 due Assignment 2 posted Slides |
February 1 | Q-learning | RL book, Sec. 6.5-6.7 9.1-9.3 Q-Learning slides |
February 6 | Function Value Approximation, DQN, Eligibility Traces |
RL book Sec. 10.2 10.5 16.5 12.1 12.2 12.4 12.5 Slides David Silver's lecture on RL with function approximation |
February 8 | More on Eligibility Trace and TD(λ) |
RL book chapter 12 Slides |
February 13 | Plannning and model-based RL | Slides
RL book chapter 8 |
February 15 | Deep model-based RL and planning |
RL book end of chapter 8, PlaNet Paper, Dreamer Paper, MuZero Paper Slides Assignment 2 due Project information posted |
February 20 | Policy-gradient methods: Policy Gradient Theorem and REINFORCE |
RL book chapter 13.7 13.1-13.3 Slides |
February 22 | Policy-gradient methods: Actor-critic |
RL book chapter 13.4 13.5 13.6 Slides Assignment 3 posted |
February 27 | Policy-gradient methods: Deterministic Policy Gradient, DDPG, TRPO, PPO |
DPG paper, DDPG paper,
, TRPO paper, PPO paper Slides |
February 29 | Review |
Slides |
March 5 | Study break | |
March 7 | Study break | |
March 12 | Hierachical RL |
Slides Options paper Option-critic architecture |
March 14 | More on hierarchical RL | |
March 19 | Wrap-up of HRL. Off-policy RL | Assignement 3 due Slides on HRL Slides on off-policy learning |
March 21 | Offline and batch RL | Slides |
March 26 | Where do rewards come from? Inverse RL. Learning from Preferences |
Inverse RL slides (with thanks to Pieter Abeell) Preferences-based learning. RL from human feedback |
March 28 | Where do rewards come from? Learning from preferences and human feedback | Slides (more info to be posted) |
April 2 | RL from Human Feedback in LLMs | Slides |
April 4 | Never-ending / continual RL |
Slides Continual RL survey |
April 9 | Slides |