Specifically, they show the following result :

\[\begin{equation}{ \log p(\theta \mid \mathcal{D}_1, \mathcal{D}_2) = \log p(\theta \mid \mathcal{D}_1) + \log p(\mathcal{D}_2 \mid \theta) - \log p(\mathcal{D}_2). }\end{equation}\]Looking at the three terms on the R.H.S, the last two terms only depend on the second dataset, and the dependance on the first task is only through \(\log p(\theta \mid \mathcal{D}_1)\), the posterior probability of the parameters over the first dataset. In other words this posterior distribution is sufficient to encapsulate all the information of the first task. So this is a cool result, i.e. continual learning reduces to learning this posterior.

But how do we obtain this result ? Let’s go through the derivation.

\[\begin{align} \log p(\theta | \mathcal{D}_1, \mathcal{D}_2) &= \log p(\mathcal{D}_2, \mathcal{D}_1, \theta) - \log p(\mathcal{D}_1, \mathcal{D}_2) \\ &= \log p(\mathcal{D}_2 | \theta, \mathcal{D}_1) + \log p(\theta | \mathcal{D}_1) + \log p(\mathcal{D}_1) - \log p(\mathcal{D}_1, \mathcal{D}_2) \\ &= \log p(\mathcal{D}_2 | \theta ) + \log p(\theta | \mathcal{D}_1) - \log p(\mathcal{D}_2|\mathcal{D}_1)), \end{align}\]In the first row, we expand the definition of a conditional probability. In the second row, we do a similar step and rewrite the first term \(p(\mathcal{D}_2, \mathcal{D}_1, \theta)\) as a product of conditional probabilities. But the magic really happens in the last row, where we leverage the conditional independence of \(D_1\) and \(D_2\) given \(\theta\) to remove the tricky dependance over information from the first task.

That’s it! Now we have an objective which can be optimized without simultaneous access to the data.

As you can see, equations (1) and (4) don’t perfectly line up; the EWC paper makes the additional assumption that \(\mathcal{D}_1\) and \(\mathcal{D}_2\) are independent (rather than conditionally on \(\theta\)), i.e. \(p(\mathcal{D}_2 | \mathcal{D}_1) = p(\mathcal{D}_2)\). My understanding is that you do not need to make this assumption, as this term does not depend on \(\theta\) anyways. If you have any thoughts on why they do this, please let me know :)

While prior based methods like EWC are theoretically sound, in practice they are quite bad. I strongly believe this is because the assumptions we make so that computing \(p(\theta | \mathcal{D}_1)\) is tractable are wayy to restrictive. Maybe more flexible posterior inference is a good research direction. More on that in the next blog post!

]]>