LoRA Without Regret | Achille Triomphe

Introduction

Today’s leading language models contain upwards of a trillion parameters, pretrained on tens of trillions of tokens. Base model performance keeps improving with scale, as these trillions are necessary for learning and representing all the patterns in written-down human knowledge.

In contrast, post-training involves smaller datasets and generally focuses on narrower domains of knowledge and ranges of behavior. It seems wasteful to use a terabit of weights to represent updates from a gigabit or megabit of training data. This intuition has motivated parameter efficient fine-tuning (PEFT), which adjusts a large network by updating a much smaller set of parameters.

The leading PEFT method is low-rank adaptation, or LoRA. LoRA replaces each weight matrix $W$ from the original model with a modified version $W' = W + \gamma BA$ , where $B$ and $A$ are matrices that together have far fewer parameters than $W$ , and $\gamma$ is a constant scaling factor. In effect, LoRA creates a low-dimensional representation of the updates imparted by fine-tuning.

LoRA may offer advantages in the cost and speed of post-training, and there are also a few operational reasons to prefer it to full fine-tuning (henceforth, FullFT):

Multi-tenant serving. Since LoRA trains an adapter (i.e., the $A$ and $B$ matrices) while keeping the original weights unchanged, a single inference server can keep many adapters (different model versions) in memory and sample from them simultaneously in a batched way. Modern inference engines such as vLLM and SGLang implement this feature. See Punica: Multi-Tenant LoRA Serving (Chen, Ye, et al, 2023).

Layout size for training. When fine-tuning the whole model, the optimizer state needs to be stored along with the original weights, often at higher precision. As a result, FullFT usually requires an order of magnitude more accelerators than sampling from the same model does, and thus a different layout. For training, besides storing the weights, we typically need to store gradients and optimizer moments for all of the weights; moreover, these variables are often stored in higher precision (float32) than what’s used to store the weights for inference (bfloat16 or lower).

Ease of loading and transfer. With fewer weights to store, LoRA adapters are fast and easy to set up or transfer between machines.

These reasons are sufficient to explain the growing popularity of LoRA since the publication of the original LoRA paper in 2021. LoRA: Low-Rank Adaptation of Large Language Models (Hu et al, 2021). However, the literature is unclear on how well LoRA performs relative to FullFT.

There is agreement that LoRA underperforms in settings that resemble pre-training, LoRA Learns Less and Forgets Less (Biderman et al, 2024). namely those with very large datasets that exceed the storage limits of LoRA parameters. But for dataset sizes that are typical in post-training, LoRA has sufficient capacity to store the essential information. However, this fact makes no guarantees regarding sample efficiency and compute efficiency. The question is: can LoRA match the performance of full fine-tuning, and if so, under which conditions?

In our experiments, we find that indeed, when we get a few key details right, LoRA learns with the same sample efficiency as FullFT and achieves the same ultimate performance.

What matters for LoRA

This article covers a series of supervised fine-tuning and reinforcement learning experiments we conducted to determine the conditions under which LoRA matches FullFT efficiency. To this end, we did a few things differently from previous experiments on LoRA:

We investigated the general relationship between training set size and number of LoRA parameters, rather than focusing on specific datasets and tasks.
In supervised learning, we measured log loss rather than employing sampling-based evals, with the same goal of generality in mind. Log loss measurement gives clean results and scaling laws over ranges of training steps and training parameters.

We find that:

For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning.
For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can’t go below, LoRA results in worse training efficiency that depends on the relationship between model capacity to dataset size.
In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning — it pays a larger penalty in loss as batch size increases beyond some point.
Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. Attention-only LoRA underperforms even when we match the number of trainable parameters by using higher rank for attention-only LoRA.
LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.

Low-regret regime diagram — **Figure 4.** The low-regret regime: when adapter capacity is well-matched to dataset size, LoRA achieves the same sample efficiency as FullFT.

We also studied the impact of hyperparameters used for LoRA on its learning rate relative to full fine-tuning. We examine some invariances in hyperparameters like init scales and multipliers, and explain why the $1/r$ prefactor makes the optimal learning rate (LR) approximately independent of rank.

Methods and results

We designed our experiments to measure in detail the relative performance of LoRA compared to FullFT across a range of conditions. Here are some details of our experimental setup:

We varied the LoRA rank over three orders of magnitude, with rank between $1$ and $512$ , and compared these to full fine-tuning.
To eliminate potential confounds from using a suboptimal learning rate, we swept the LR for each experimental condition. We used constant learning rate schedule (no warmup or cooldown).
Our experiments used Llama 3 series models The Llama 3 Herd of Models (Dubey et al, 2024). and Qwen3 models Qwen3 Technical Report (Qwen Team, 2025). , including a mixture of experts (MoE) model.
The main supervised learning experiments used the Tulu3 Tulu 3: Pushing Frontiers in Open Language Model Post-Training (Ivison et al, 2024). and OpenThoughts3 OpenThoughts: Data Recipes for Reasoning Models (Guha et al, 2025). datasets, focused on instruction following and reasoning, respectively.
Our RL experiments used mathematical reasoning tasks with answer correctness as the reward.

LoRA rank

We trained for a single epoch on the Tulu3 dataset and a subset of the OpenThoughts3 datasets. For each dataset and model size, we swept over LoRA rank and learning rate.

We see that FullFT and high-rank LoRAs have similar learning curves with loss decreasing linearly with the logarithm of the number of steps. Medium and low-rank LoRAs fall off the minimum-loss learning curves at some threshold of steps that correlates with rank. Intuitively, learning slows down when the adapter runs out of capacity, which in turn is determined by rank.

LoRA training curves by rank — **Figure 1.** Training curves for various LoRA ranks on Tulu3. FullFT and high-rank LoRAs follow similar trajectories; lower ranks hit a capacity floor sooner.

Batch size effects

In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning. As batch size increases beyond an optimal point, LoRA pays a larger penalty in final loss. This penalty is not mitigated by increasing rank — it is a property of the product-of-matrices parametrization.

Batch size penalty for LoRA vs FullFT — **Figure 2.** Final validation loss vs. batch size. FullFT remains stable across a wider range of batch sizes than LoRA.

Setting LoRA hyperparameters

Optimal learning rate and rank

Following Hu et al., we consider the following parametrization for LoRA:

$W' = W + \frac{\alpha}{r} BA$

Where $r$ is the LoRA rank, $\alpha$ is the LoRA scaling factor, and $A$ , $B$ are the LoRA weight matrices (of rank $r$ ). We use $\alpha = 32$ for the experiments in this article, following standard practice from other implementations.

We can partly explain this result by looking at the expected update to the LoRA matrix after the very first training update. We can think of the LoRA product $BA$ as the sum of $r$ rank-1 outer products:

$BA = \sum_{i=1}^{r} b_i a_i^T = \sum_{i=1}^{r} \Delta_i$

where we define $\Delta_i = b_i a_i^T$ . Here, $\partial \text{Loss} / \partial \Delta_i$ is the same for all $i$ ; however the gradients $\partial \text{Loss} / \partial b_i$ and $\partial \text{Loss} / \partial a_i$ will depend on the initialization ( $\partial \text{Loss} / \partial b_i$ depends on $a_i$ , for example). Since the initialization of $a_i$ and $b_i$ do not depend on rank, it follows that $\mathbb{E}[\Delta_i]$ is the same for all $i$ and does not depend on rank. At the first step of training, the expected update from each of these terms is equal and independent of the rank. It follows that $(1/r) \sum_{i=1}^{r} \Delta_i$ is just a sample average of $r$ terms with the same expectation, so the expectation of the average, i.e., the change to the adapter $(1/r) BA$ , doesn’t depend on the rank.

This invariance is what makes the optimal learning rate approximately independent of rank when using the $\alpha/r$ scaling. The key insight is that the effective step size on the full weight update is controlled by $\alpha$ , not by $r$ directly.

Learning rate sweep for various LoRA ranks — **Figure 3.** Learning rate vs. final validation loss for different ranks. The optimal LR is approximately independent of rank when using $\alpha/r$ scaling.

Geometry of the update space

To understand why LoRA behaves the way it does, it helps to think about the geometry of the parameter updates. Full fine-tuning operates in the full tangent space of the weight matrix — a space with dimension $d_{\text{out}} \times d_{\text{in}}$ . LoRA, on the other hand, constrains the update to lie in a low-rank subspace of dimension $r \times (d_{\text{out}} + d_{\text{in}})$ .

The figure below illustrates the unit balls in the tangent space for different distance measures. The geometry of the optimization landscape depends on which norm we use to measure the size of an update:

← drag to rotate →

Figure 2. Inscribing unit balls in the tangent space for different distance measures. The ℓ₂ (Euclidean) unit ball is a circle while the ℓ₁ (Manhattan) unit ball is a diamond.

The $\ell_2$ norm (Euclidean distance) gives a circular unit ball, which is isotropic — it treats all directions equally. The $\ell_1$ norm (Manhattan distance) gives a diamond-shaped unit ball, which is anisotropic — it prefers updates aligned with the coordinate axes.

In the context of LoRA, the product $BA$ implicitly defines a geometry on the update space. Because $B$ and $A$ are initialized with different scales (typically $A$ with Kaiming init and $B$ with zeros), the effective geometry is neither purely $\ell_2$ nor $\ell_1$ , but something in between that depends on the relative scaling of the two factors.

Interaction dynamics

When we think about how models interact with their environment during post-training, it’s useful to consider the temporal structure of the interaction. Rather than treating training as a sequence of discrete, independent examples, we can view it as a continuous stream of input and output:

Time-aligned micro-turn based

Interaction is grounded in time with continuous input and output streams split into micro-turns

0ms 3000ms

video

audio

model

The video stream provides a continuous visual signal, the audio stream provides a continuous auditory signal, and the model produces output tokens in response. By splitting the interaction into micro-turns — short segments of time where the model processes input and generates output — we can align the model’s behavior with the temporal structure of the environment.

This perspective is particularly relevant for reinforcement learning from human feedback (RLHF) and related methods, where the model’s output at one time step influences the input it receives at the next. The time-aligned view helps us understand why RL requires less capacity than supervised fine-tuning: the model only needs to learn a policy — a mapping from states to actions — rather than memorizing a large dataset of input-output pairs.

Discussion

Why LoRA might be needed on all layers

Our experiments show that applying LoRA to all weight matrices, including MLP and MoE layers, gives better results than attention-only LoRA, even when the total number of trainable parameters is matched. This suggests that the capacity of LoRA is not just about the number of parameters, but about which parameters are updated.

The attention layers are responsible for routing information between tokens, while the MLP layers are responsible for transforming the representations at each position. Both are necessary for the model to adapt to a new task. Restricting LoRA to attention layers only limits the model’s ability to change the content of the representations, even if it can change how they are mixed.

Layer ablation results — **Figure 5.** Validation loss for different LoRA layer configurations. Applying LoRA to all layers matches FullFT; attention-only LoRA underperforms.

How much capacity is needed?

For supervised fine-tuning, the required LoRA rank depends on the size and diversity of the dataset. For small, focused datasets (e.g., a few thousand examples of a specific task), a rank of 8 or 16 is often sufficient. For larger, more diverse datasets (e.g., hundreds of thousands of instruction-response pairs), higher ranks (64–256) may be needed to match FullFT.

For reinforcement learning, we find that surprisingly low ranks are sufficient. This is consistent with the idea that RL only requires the model to learn a preference ordering over outputs, rather than to memorize a large corpus. A rank of 1 or 4 can often match FullFT for RL tasks, even on large models.

Compute efficiency advantage of LoRA

The compute efficiency of LoRA comes from two sources:

Fewer FLOPs per step. Because LoRA only computes gradients for the adapter matrices, the backward pass is cheaper than FullFT. The exact savings depend on the rank and the fraction of layers that are adapted.

Smaller optimizer state. Adam and other adaptive optimizers store running statistics of the gradients (first and second moments). For FullFT, these statistics are as large as the model itself. For LoRA, they are proportional to the adapter size, which is much smaller.

The combined effect is that LoRA training can use smaller GPU layouts and higher batch sizes per device, which can lead to better hardware utilization and faster training.

Closing thoughts

LoRA is a powerful tool for post-training large language models, and our experiments show that it can match full fine-tuning in a wide range of settings. The key to achieving this parity is to use the right hyperparameters — especially the learning rate, which should be scaled with the $\alpha/r$ prefactor — and to apply LoRA to all layers, not just attention.

The “low-regret regime” we identify covers most post-training scenarios, including supervised fine-tuning on small-to-medium datasets and reinforcement learning. In this regime, LoRA offers the operational advantages of multi-tenant serving, smaller training layouts, and easy transfer, without sacrificing performance.

For very large datasets that approach pre-training scale, LoRA does eventually underperform FullFT, as the adapter runs out of capacity. But for the vast majority of practical applications, LoRA without regret is not just possible — it’s the default choice.

Acknowledgements

This article is heavily inspired by the excellent work of John Schulman and collaborators at Thinking Machines. Their empirical and theoretical insights into LoRA have greatly advanced the community’s understanding of parameter-efficient fine-tuning.

Citation

Please cite this work as:

Achille Triomphe, "LoRA Without Regret",
Achille Triomphe, June 2026.

Or use the BibTeX citation:

@article{achilletriomphe2026lorawithoutregret,
  author    = {Achille Triomphe},
  title     = {LoRA Without Regret},
  journal   = {Achille Triomphe},
  year      = {2026},
  month     = {June},
  note      = {https://www.achilletriomphe.com/blog/lora-without-regret/},
}