An important paradigm for large language models is the idea of pre-training on large-scale general domain data followed by fine tuning on task-specific data. However, fine tuning retrains all model parameters which can become costly. Low-Rank Adaptation freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer, which drastically reduces the memory requirements.
Consider a language modeling problem where the goal is to maximize the conditional probabilities given a task-specific prompt. Given a pre-trained autoregressive model , each downstream task is represented by a training dataset of context-target pairs of token sequences .
Traditionally the model is initialized to pre-trained weights and updated to with gradient descent on a conditional language modeling objective.
Training on each downstream tasks results in a different set of parameters with the same dimensions as . Low-Rank Adaptation encodes these task-specific models with a much smaller sized set of parameters .
For a pre-trained weight matrix , the weight update is constrained through a low-rank matrix decomposition where , and the rank . is randomly initialized with a Gaussian and B is initialized with zeros such that at the beginning of training. The modified forward pass yields: