Fine-tuning neural networks with LoRA (Low-Rank Adaptation)
Low-rank adaptation (LoRA) is a popular method of fine-tuning neural networks that trains a relatively small number of parameters while achieving comparable performance to full fine-tuning of all parameters (Hu et al., 2021). Instead of fine-tuning the original weights from the base model, an additional set of weights (that are much lesser in number) are trained instead. This reduces computational cost, allowing for cheaper and faster fine-tuning.
![]() | |
|
LoRA works by using two low-rank matrices, the down-projection matrix \( \mathbf{A} \in \mathbb{R}^{d \times r} \) and the up-projection matrix \( \mathbf{B} \in \mathbb{R}^{r \times d} \), to represent a original weight matrix from the base model \( \mathbf{W} \in \mathbb{R}^{d \times d} \) (Figure 1). \(\mathbf{W}\) represents the connections between two layers in a neural network . In LoRA, \(2dr\) parameters are fine-tuned, as opposed to \(d^2\) parameters in full fine-tuning. By using \(r << d\), the number of parameters involved in fine-tuning is drastically reduced.
During training (i.e. fine-tuning), on top of multiplying the input vector \(\mathbf{x}\) with \(\mathbf{W}\) , \(\mathbf{x}\) is multiplied with \(\mathbf{A}\) and \(\mathbf{B}\) before summing the resulting vectors to obtain the output vector, \( \mathbf{h} = \mathbf{Wx} + \mathbf{BAx} \). Gradients are computed and applied only to \(\mathbf{A}\) and \(\mathbf{B}\), with \(\mathbf{W}\) remaining frozen during fine-tuning. During inference, there is no additional latency as the number of parameters is kept the same by merging the low-rank matrices with the original weight matrix, \( \mathbf{W'} = \mathbf{W} + \mathbf{BA} \), and performing the forward pass with the merged matrix instead, \( \mathbf{h} = \mathbf{W'} \mathbf{x} \).
LoRA's approach of fine-tuning a separate set of weights yields other benefits as well. Firstly, we can switch dynamically between multiple LoRA models by performing the forward pass through all matrices (\( \mathbf{h} = \mathbf{Wx} + \mathbf{BAx} \)) and not merging the original and LoRA weight matrices. This allows for the replacement of only LoRA weight matrices when switching between models and avoids expensive I/O operations involving large matrices. Secondly, as LoRA works orthogonally from other other parameter-efficient fine-tuning (PEFT) techniques, it is able to be used together with them.
References
[YouTube] Edward Hu - What is Low-Rank Adaptation (LoRA)

Comments
Post a Comment