Rahul Ramesh2023-12-12T00:42:01-05:00https://rahulramesh.infoRahul RameshDeep Reference Priors for Pre-training2022-07-14T00:00:00-04:00https://rahulramesh.info/blog/refprior.mdHow do we pre-train a model using unlabeled data? In this blog post, we tackle
this question by adopting the bayesian perspective; Probabilities are used to
represent a state of knowledge. Instead of learning a single
pre-trained model, we learn a prior -- or a probability distribution over the
model's parameters.
It is instructive to consider a specific example: estimating the bias of a
coin. A prior is used to quantify our belief about the coin's bias before we
observe any data. It is common to represent this prior using a beta
distribution. If we believe that the coin is unbiased, then we can encode this
belief using a beta-distribution with $$\alpha=\beta$$.
<div class="imgcaption">
<div style="width: 50%;" class="imgcenter">
<img src="/public/imgs/refprior/beta_distribution.gif" alt="beta distribution">
</div>
<span class="caption"> <em>The beta distribution can be used to represent the prior for
the bias of a coin. When alpha=2 and beta=5, the prior represents the belief that the coin is biased towards tails.</em></span>
</div>
## Uniformative Priors
It is hard to determine what the prior should be in many instances -- an issue
that dates back to the likes of Bayes and Laplace. Revisiting our earlier
example, we may not have any particular belief regarding the bias of the coin.
How do we represent a *state of ignorance*?
A natural choice for the prior is a uniform distribution over model's
parameters motivated by Laplace's principle for insufficient reason. This would
correspond to a uniform distribution over the interval $$[0, 1]$$, for the bias
of the coin. While the uniform prior seems like a reasonable choice, it is not
considered an uninformative prior (a prior that attempts to codify ignorance).
Ideally, an uninformative prior should be invariant to re-parameterizations,
which is the most common criticism levied against uniform priors. Let us
consider the problem of estimating the variance of a Gaussian random variable
$$X = \mathcal{N}(0, \sigma^2)$$ with a uniform prior. We can estimate both
$$\sigma$$ and $$\log \sigma$$, both of which are reasonable choices. However,
the resulting posterior will look different depending on the choice of
parameterization. A uniform prior over $$\sigma$$ assigns a larger probability
mass to values near $$\infty$$ compared to a uniform prior for $$\log \sigma$$.
The uniform prior is often informative, which is not desirable.
The key idea behind uniformative priors is to operate in probability space, as
opposed to the parameter space. If we have a probability model
$$p_{w}(z)$$, then it is easier to specify priors using conditions on
$$p_{w}$$ as opposed to conditions on $$w$$. The Jeffrey's prior is a
popular uniformative prior that is uniform in probability space. In this blog,
we will focus on another closely related uninformative prior.
## Reference Priors
Reference priors <a class="citation" href="#bernardo1979reference">(Bernardo 1979)</a> are based on the guiding
principle that *the data should dominate the posterior -- not the prior*. A
prior that maximizes the KL-divergence with the posterior allows the data to be
maximally informative. We codify this idea through a maximization problem --
the solution to which is the reference prior.
Let $$Z$$ be a random variable that represents the data, and let $$w$$
be the parameters of the model. Additionally, let $$T(Z)$$ be a sufficient
statistic for $$w$$. A sufficient statistic summarizes the information from the
entire random sample of $$Z$$. Estimating $$w$$ using the entire random sample
is identical to estimating $$w$$ using just the sufficient statistic $$T$$.
The reference prior $$\pi^*$$ maximizes the KL-divergence between the prior and
the posterior, averaged over distribution $$p(t)$$ of the sufficient statistic:
\begin{equation}
\pi^* := \argmax_{\pi} \ \ \mathbb{E}_{t} \left[ KL(p(w \mid t) \mid \pi(w)) \right]
\end{equation}
Equation (1) is also the mutual information between $$w$$ and $$t$$.
Intuitively, the prior maximizes the information that is missing between $$w$$
and $$t$$. The reference prior is invariant to re-parameterizations of $$w$$
since it is defined using the mutual information.
However, we do not know $$t$$ for most problems, making Equation (1)
computationally intractable. Bernado proposed estimating $$\pi^*$$ as a limit
of a sequence, i.e.
\begin{equation}
\pi^* := \lim_{n \rightarrow \infty} \pi^n = \lim_{n \rightarrow \infty} \argmax_{\pi} \ \ \mathbb{E}_{z^n} \left[ KL(p(w \mid z^n) \mid \pi(w)) \right]
\end{equation}
where $$z^n$$ represents $$n$$-random samples of $$Z$$. $$\pi^n$$ is called the
$$n$$-reference prior where $$n$$ is the order.
Priors are computed without any data or evidence. This is also true for the
reference prior. The objective for the $$n$$-reference prior depends on the
posterior $$p(w \mid z^n)$$. However, the posterior is computed using samples
that were synthesized by the model and does not use data. In particular, we
consider the "imagined" data distribution
\begin{equation}
p(z) = \int p(z \mid w) \pi(w) \ \text{d} w, \notag
\end{equation}
when we compute the expectation in the reference prior.
$$n$$-reference priors are supported on a finite number of points in
<a class="citation" href="#zhang1994discrete">(Zhang 1994)</a>, i.e.
\begin{equation}
\pi^n(w) = \sum_{i=1}^K p_i \delta_{w_i}. \notag
\end{equation}
This intuitively holds since the dataset $$z^n$$ is finite. We revisit the
coin-tossing example and visualize the $$n$$-reference prior for different
orders $$n$$. For $$n=1$$, the prior is a uniform probability over two points
$$p=0$$ and $$p=1$$. If we have only one sample, then the 1-reference prior assigns probability masses to the two possible outcomes, thereby maximizing the information gained from the data.
<div class="imgcaption">
<div style="width: 30%;" class="imgcenter">
<img src="/public/imgs/refprior/coin_n1.png" alt="ref prior n=1">
</div>
<span class="caption"> <em></em></span>
</div>
<div class="imgcaption">
<div style="width: 30%;" class="imgcenter">
<img src="/public/imgs/refprior/coin_n10.png" alt="ref prior n=10">
</div>
<span class="caption"> <em>We compute the n-reference prior for orders n=1 (top) and n=10 (bottom) for the coin-tossing example. The priors are supported on a discrete set of points.</em></span>
</div>
## Deep Reference Priors
How do we compute the reference prior using neural networks for the model
family.
## Resources:
* https://www.youtube.com/watch?v=pNkeDdFs0QQ
* Mattingly
## References
<ol class="bibliography"><li>
<span id="bernardo1979reference">Bernardo, Jose M. 1979. “Reference Posterior Distributions for Bayesian
Inference.” <i>Journal of the Royal Statistical Society: Series B
(Methodological)</i> 41 (2): 113–28.</span>
</li>
<li>
<span id="zhang1994discrete">Zhang, Zhongxin. 1994. “Discrete Noninformative Priors.” PhD thesis, Yale University.</span>
</li></ol>
Reference Priors for Pre-training Neural Nets2022-07-14T00:00:00-04:00https://rahulramesh.info/blog/refprior<p>We start with the problem of estimating the bias of a coin.
We know that we will receive \(n\) data points</p>
<p>We maximize the information brought by the data.</p>
<p>We want to tackle a supervised learning task with limited amounts of data. In
addition, we have unlabeled data from the same distribution. Can we learn a useful prior from unlabeled data, such that</p>
<p>In
this blog post, we seek a prior – or a probability distribution over the
model’s weights – from unlabeled data. The prior is selected to work with small amounts of labeled data</p>
<p>We want to leverage this prior on small amounts of labeled data. How do we optimally leverage this limited information. We consider the problem of estimating</p>
<p>To do so, we would like to restrict the model space</p>
<h2 id="uniformative-priors">Uniformative Priors</h2>
<h2 id="resources">Resources:</h2>
<ul>
<li>https://www.youtube.com/watch?v=pNkeDdFs0QQ</li>
<li>Mattingly</li>
</ul>
<h2 id="references">References</h2>
<ol class="bibliography"></ol>
Model Zoos for Continual Learning2022-06-24T00:00:00-04:00https://rahulramesh.info/blog/modelzoo<p>Consider two tasks: a classification problem between strawberries and
raspberries and another classification problem between oranges and bananas. The
two tasks are similar and share information relevant to classification. If we can
exploit this information, we can achieve lower generalization errors and
better sample complexities on both these tasks. This motivates us to explore
models that train on multiple tasks.</p>
<p>Caruana’s model <a class="citation" href="#caruana_mtl">(1997)</a> is the most popular
choice, when we want to learn from multiple tasks. It relies on learning a
shared representation function \(f\) across all the tasks. The output of the
function \(f\) is used as input to a task-specific classification layer
\(g_i\), to generate predictions for task \(i\). Formally, the classifier for
task \(i\) is
\begin{equation}
h_i := g_i \circ f. \nonumber
\end{equation}</p>
<p>In order to tackle our earlier example, we need to learn a single representation \(f\)
for oranges, bananas, strawberries and raspberries. For example, we can learn \(f\)
using a Resnet backbone. The learnable parameters of \(f\) usually make up most
of the parameters in the network since it helps us achieve better sample
complexity bounds <a class="citation" href="#baxter_inductive">(Baxter 2000)</a>. In addition, we need to learn
\(g_1\) and \(g_2\), which are linear layers attached to the Resnet backbone.
\(g_1\) is used to identify if the image is a strawberry or a raspberry and
\(g_2\) is used to determine if the image is a banana or orange.</p>
<div class="imgcaption">
<div style="width: 30%;" class="imgcenter">
<img src="/public/imgs/modelzoo/mtl.png" alt="Multi-task Model" />
</div>
<span class="caption"> <em>The multi-task model proposed by Rich Caruana continues
to remain the most popular architecture when we want to learn from multiple
tasks; This includes areas like multi-task learning, meta-learning or
continual learning. The model is trained to minimize the sum of losses on
all the tasks.</em></span>
</div>
<p>We assume that we know if each image belongs to task 1 or task 2. This helps us
choose the right output head for each input. The model need not distinguish
between a class from task-1 and another class from task-2. For example, the
model need not distinguish between strawberries and oranges since the two
classes are from different tasks.</p>
<p>Task-incremental continual learning <a class="citation" href="#van2019three">(Van de Ven and Tolias 2019)</a> studies the problem
of learning on a sequence of tasks. The continual learner observes data from
a new task after every “episode” of continual learning. We seek low
generalization errors on all tasks, observed until the current episode. To
achieve this goal, we should leverage information from past tasks, to generalize
better on new tasks and simultaneously use information from the new tasks, to
improve on past tasks.</p>
<p>What does a typical continual learner look like? Most methods train a single
network — one that looks like Caruana’s multi-task model — on a sequence of
tasks. At every episode, the network learns to tackle the new task while trying
to maintain its ability to address past tasks. The latter is enforced using
auxiliary losses or constraints on the gradients.</p>
<h2 id="task-competition">Task Competition</h2>
<p>Caruana’s model assumes that <em>all tasks</em> share a single representation \(f\).
This assumption requires all tasks to be similar to each other, which is less
likely to be true if we have a large number of tasks. Yet, this model continues
to remain the most popular choice architecture, even if we have to tackle many
tasks, like in continual learning.</p>
<p>Under Caruana’s multi-task model, we see that tasks can both help and hurt each
other. The below matrix is representative of distances between
every pair of tasks. It is the decrease (red) or increase (green) in
accuracy when we train two tasks together, compared to training on a single
task. Clearly, not all pairs of tasks benefit each other – evident from the
red cells in the matrix.</p>
<div class="imgcaption">
<div style="width: 55%;" class="imgcenter">
<img src="/public/imgs/modelzoo/task_corr.svg" alt="Task Distances" />
</div>
<span class="caption"> <em>The matrix is representative of task distances on 20 5-way
classification tasks constructed from the CIFAR-100 dataset; Each cell is
the relative change in accuracy when we train on two tasks as opposed to
one task. Ideally, we should use this matrix to determine which tasks
should be trained together. However, estimating this distance matrix is
expensive, particularly in continual learning, where all tasks are not
available a priori.</em></span>
</div>
<p>We <a class="citation" href="#modelzoo">(Ramesh and Chaudhari 2022)</a> formalize this idea using the task-competition
theorem. We show that dissimilar tasks fight over the model’s capacity and they
benefit from being trained on separate models. This observation inspires
us to build Model Zoo, which is capable of expanding its model capacity to
accommodate dissimilar tasks.</p>
<h2 id="model-zoo">Model Zoo</h2>
<p>Model Zoo <a class="citation" href="#modelzoo">(Ramesh and Chaudhari 2022)</a> is a set of models where each model is trained on a subset of tasks.
In the continual learning setup, we add 1 model to the zoo after every
episode. The model is trained on the new task and a subset of the past tasks
(around 2-4). After \(T\) episodes, Model Zoo is consists of an ensemble of
\(T\) models, which are each trained on a subset of tasks.</p>
<div class="imgcaption">
<div style="width: 70%;" class="imgcenter">
<img src="/public/imgs/modelzoo/zoo.svg" alt="Model Zoo" />
</div>
<span class="caption"> <em>The above figure depicts one episode of Model Zoo training.
The training loss (at the top of the figure) is computed using the first
two learners. The training loss is used to determine which tasks should be
trained together in the third model.</em></span>
</div>
<p>The zoo’s output for a particular task is computed by averaging the outputs
of all models trained on that task. For example, the output of the green task
is the average of the green output heads of the first two models. The outputs of the
orange and yellow tasks can be computed using a single model (orange head of
the second model and yellow head of the third model respectively).</p>
<p>Identifying related sets of tasks is the most important component of the
zoo; This determines which tasks are trained together. Often, figuring out the
relatedness tasks is as expensive as training on those tasks. For continual
learning, we seek an inexpensive proxy for the distance between tasks, in order
to split the tasks into related subsets.</p>
<p>Inspired by Adaboost, we use the training losses of the tasks to identify
which tasks should be trained together. We add a learner to the zoo, which is
trained on tasks with high training losses similar to how a Adaboost trains
learners on samples with high training errors. Even if the zoo performs poorly
on some tasks at episode \(T\), it has the ability to correct itself in future
episodes.</p>
<h2 id="evaluating-the-zoo">Evaluating the Zoo</h2>
<p>Continual learning has many formulations which makes it hard to compare
different methods. They vary on factors like model storage, sample storage
or compute. Methods typically either store no data from previous tasks, or use
a limited replay buffer to store a fraction (around 5-10%) of the data from
previous tasks. Catastrophic forgetting — a tendency of the model to rapidly
forget the solution to past tasks — is commonly observed in these settings and
addressing this, is the primary focus of most existing methods.</p>
<p>In our work, we show that a learner that shares no information between the
tasks i.e., it does no real continual learning, performs better than existing
methods. We call this method <em>Isolated</em> in our work. Isolated has low training times
and uses no replay buffer but grows it model capacity linearly with the number
of tasks. Existing methods scale well with respect to model size, but fail to
achieve good generalization errors.</p>
<div class="imgcaption">
<div style="width: 65%;" class="imgcenter">
<img src="/public/imgs/modelzoo/isolated_acc.svg" alt="Isolated" />
</div>
<span class="caption"> <em>In the above plot, we track the accuracy averaged over all
tasks, over 20 episodes of continual learning. Different methods are
evaluated in the single-epoch setting where the model is allowed one epoch of
training, at every episode of continual learning. Regardless of the
architecture (a small convolution network or Resnet18-S), Isolated
outperforms other methods. This indicates that existing methods fail to
transfer between different tasks. We find a similar trend in the
multi-epoch setting too.</em></span>
</div>
<p>Model Zoo on the other hand, is able to outperform the Isolated baseline. It
improves on previous tasks (backward transfer) and has an improved ability to
solve new tasks (forward transfer). The zoo has low training and inference
times compared to other methods but grows its model capacity at every
episode and uses a replay buffer.</p>
<div class="imgcaption">
<div style="width: 65%;" class="imgcenter">
<img src="/public/imgs/modelzoo/modelzoo_fwd_bwd.svg" alt="Isolated" />
</div>
<span class="caption"> <em>We track the accuracy of each task from the
Split-MiniImagenet dataset which consists of 20 5-way classification
problems. 'X', which denote the accuracy of Isolated on a new task, is
usually lower accuracy than Model Zoo's accuracy on the same task when it
is first seen. Additionally, the accuracy of past tasks improve over
episodes of continual learning.</em></span>
</div>
<p>Model Zoo shows that growing the capacity of the model, can help us build
effective continual learners. If you are interested in exploring Model Zoos,
consider checking out our <a href="https://arxiv.org/abs/2106.03027">ICLR 22 paper</a> or
this <a href="https://colab.research.google.com/github/rahul13ramesh/modelzoo_continual/blob/main/modelzoo_minimal.ipynb">jupyter
notebook</a></p>
<h2 id="references">References</h2>
<ol class="bibliography"><li>
<a class="darklink" href="https://link.springer.com/article/10.1023/A:1007379606734" target="_blank" rel="noopener noreferrer">
<span id="caruana_mtl">Caruana, Rich. 1997. “Multitask Learning.” <i>Machine Learning</i> 28 (1): 41–75.</span>
</a>
</li>
<li>
<span id="baxter_inductive">Baxter, Jonathan. 2000. “A Model of Inductive Bias Learning.” <i>Journal of Artificial Intelligence Research</i> 12: 149–98.</span>
</li>
<li>
<span id="van2019three">Ven, Gido M Van de, and Andreas S Tolias. 2019. “Three Scenarios for Continual Learning.” <i>ArXiv Preprint ArXiv:1904.07734</i>.</span>
</li>
<li>
<a class="darklink" href="https://openreview.net/forum?id=WfvgGBcgbE7" target="_blank" rel="noopener noreferrer">
<span id="modelzoo">Ramesh, Rahul, and Pratik Chaudhari. 2022. “Model Zoo: A Growing Brain That Learns Continually.” In <i>International Conference on Learning Representations</i>.</span>
</a>
</li></ol>