Fine-tuning neural networks with LoRA (Low-Rank Adaptation)

December 06, 2025

Low-rank adaptation (LoRA) is a popular method of fine-tuning neural networks that trains a relatively small number of parameters while achieving comparable performance to full fine-tuning of all parameters (Hu et al., 2021). Instead of fine-tuning the original weights from the base model, an additional set of weights (that are much lesser in number) are trained instead. This reduces computational cost, allowing for cheaper and faster fine-tuning.

Figure 1: Illustration of low-rank adaptation (LoRA) during training (left) and inference (right).

LoRA works by using two low-rank matrices, the down-projection matrix \( \mathbf{A} \in \mathbb{R}^{d \times r} \) and the up-projection matrix \( \mathbf{B} \in \mathbb{R}^{r \times d} \), to represent a original weight matrix from the base model \( \mathbf{W} \in \mathbb{R}^{d \times d} \) (Figure 1). \(\mathbf{W}\) represents the connections between two layers in a neural network . In LoRA, \(2dr\) parameters are fine-tuned, as opposed to \(d^2\) parameters in full fine-tuning. By using \(r << d\), the number of parameters involved in fine-tuning is drastically reduced.

During training (i.e. fine-tuning), on top of multiplying the input vector \(\mathbf{x}\) with \(\mathbf{W}\) , \(\mathbf{x}\) is multiplied with \(\mathbf{A}\) and \(\mathbf{B}\) before summing the resulting vectors to obtain the output vector, \( \mathbf{h} = \mathbf{Wx} + \mathbf{BAx} \). Gradients are computed and applied only to \(\mathbf{A}\) and \(\mathbf{B}\), with \(\mathbf{W}\) remaining frozen during fine-tuning. During inference, there is no additional latency as the number of parameters is kept the same by merging the low-rank matrices with the original weight matrix, \( \mathbf{W'} = \mathbf{W} + \mathbf{BA} \), and performing the forward pass with the merged matrix instead, \( \mathbf{h} = \mathbf{W'} \mathbf{x} \).

LoRA's approach of fine-tuning a separate set of weights yields other benefits as well. Firstly, we can switch dynamically between multiple LoRA models by performing the forward pass through all matrices (\( \mathbf{h} = \mathbf{Wx} + \mathbf{BAx} \)) and not merging the original and LoRA weight matrices. This allows for the replacement of only LoRA weight matrices when switching between models and avoids expensive I/O operations involving large matrices. Secondly, as LoRA works orthogonally from other other parameter-efficient fine-tuning (PEFT) techniques, it is able to be used together with them.

References

[YouTube] Edward Hu - What is Low-Rank Adaptation (LoRA)

Wei Xin Chan

Fine-tuning neural networks with LoRA (Low-Rank Adaptation)

References

Comments

Post a Comment

Popular posts from this blog

Multi-head Latent Attention (In A Nutshell!)

Self-Attention and the Key-Value Cache (In A Nutshell!)

Training an LLM (In a Nutshell!)