LoRA Has a Memory Ceiling, and Now You Can Calculate It

A new power law quantifies LoRA's exact parametric memory capacity, giving teams a principled ceiling instead of trial-and-error rank tuning.

Fine-tuning teams have treated rank selection as a dial to turn up when memorization fails. More rank, more capacity, better recall. That assumption turns out to be wrong in a precise and measurable way: LoRA memory capacity follows a power law, which means returns diminish sharply, and the ceiling is calculable before you run a single training job.

The mechanism is cleaner than most practitioners expect. Loss reduction from LoRA fine-tuning scales predictably with two variables: the number of effective parameters (rank times the number of adapted layers) and the sequence length of the training data. The relationship is a power law, meaning early rank increases yield large gains, but past a threshold the curve flattens. Doubling rank from 4 to 8 moves you meaningfully. Doubling from 64 to 128 in the same architecture moves you almost nowhere on the loss curve, because you have already saturated the capacity the model's weight geometry can absorb through low-rank updates.

The token-level analysis adds a second, more actionable finding. Fine-grained probing reveals a deterministic phase transition in memorization: once a token's prediction probability crosses p > 0.5 under greedy decoding, verbatim recall is guaranteed. Below that threshold, no amount of additional training on other tokens helps that specific token. This is not a soft gradient. It is a hard boundary. The implication is that standard uniform training wastes budget on tokens already above threshold while under-serving the tokens that actually need more signal.

MemFT, the optimization strategy built on these findings, uses the p > 0.5 threshold as a dynamic routing signal. At each step, it identifies sub-threshold tokens and redistributes the training budget toward them, rather than applying uniform gradient updates across the full sequence. The result is faster convergence to high-fidelity recall without increasing the total compute budget. For ML engineers managing fine-tuning pipelines at scale, the takeaway is direct: you can now set rank based on a law rather than a guess, and you can stop paying for training steps that serve already-memorized tokens.

We're thinking: We find the power law framing more useful than the MemFT method itself, at least for most teams right now. The law gives you a principled stopping point: compute your effective parameter count (rank times adapted layer count), plot it against the curve, and read off whether adding rank will meaningfully reduce loss before you spend the compute. That is a different kind of tool than a new optimizer. It is a budget constraint with a derivation behind it. The contrarian read is that this also sets a ceiling on what LoRA can do for knowledge-dense memorization tasks, which means teams expecting LoRA to absorb large factual updates reliably may need to reconsider the architecture choice entirely, not just the rank.

Key takeaways:

LoRA memorization capacity follows a power law over effective parameters (rank times adapted layers) and sequence length, with a hard phase transition at p > 0.5 prediction probability marking the boundary between failed and guaranteed verbatim recall.
MemFT improves memory fidelity and training efficiency by dynamically redirecting budget to sub-threshold tokens; results are empirical on controlled memorization tasks, and generalization to diverse production fine-tuning scenarios is not yet established.
Teams running LoRA fine-tuning for knowledge injection or document memorization should use the Parametric Memory Law to set rank ceilings before training, and audit whether their training budget is concentrated on already-memorized tokens.

Source: How LoRA Remembers? A Parametric Memory Law for LLM Finetuning