Low-Rank Adaptation¶

An important paradigm for large language models is the idea of pre-training on large-scale general domain data followed by fine tuning on task-specific data. However, fine tuning retrains all model parameters which can become costly. Low-Rank Adaptation freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer, which drastically reduces the memory requirements.

Consider a language modeling problem where the goal is to maximize the conditional probabilities given a task-specific prompt. Given a pre-trained autoregressive model $P_{Φ} (y | x)$ , each downstream task is represented by a training dataset of context-target pairs of token sequences $Z = {(x_{1}, y_{1}), . . ., (x_{n}, y_{n})}$ .

Traditionally the model is initialized to pre-trained weights $Φ_{0}$ and updated to $Φ_{0} + Δ Φ$ with gradient descent on a conditional language modeling objective.

$\underset{Φ}{m a x} \sum_{(x, y) \in Z} \sum_{t = 1}^{| y |} l o g (P_{Φ} (y_{t} | x, y_{< t}))$

Training on each downstream tasks results in a different set of parameters $Δ Φ$ with the same dimensions as $Φ_{0}$ . Low-Rank Adaptation encodes these task-specific models with a much smaller sized set of parameters $Θ$ .

$\underset{Θ}{m a x} \sum_{(x, y) \in Z} \sum_{t = 1}^{| y |} l o g (p_{Φ_{0} + Δ Φ (Θ)} (y_{t} | x, y_{< t}))$

For a pre-trained weight matrix $W_{0} \in R^{d \times k}$ , the weight update is constrained through a low-rank matrix decomposition $W_{0} + Δ W = W_{0} + B A$ where $B \in R^{d \times r}$ , $A \in R^{r \times k}$ and the rank $r ≪ m i n (d, k)$ . $A$ is randomly initialized with a Gaussian and B is initialized with zeros such that $Δ W = B A = 0$ at the beginning of training. The modified forward pass yields:

$h = W_{0} x + Δ W x = W_{0} x + B A x$