What Is LoRA (Low-Rank Adaptation)?

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models efficiently by training only a tiny fraction of the model’s parameters.

The Problem LoRA Solves

Full fine-tuning of a model like Llama 3 70B requires updating billions of parameters — demanding hundreds of GBs of GPU memory and significant compute time. Most teams can’t afford this.

How LoRA Works

Instead of updating all model weights, LoRA:

Freezes the original model parameters
Adds small trainable “adapter” matrices to specific layers
Trains only these adapters (typically <1% of total parameters)
At inference, the adapters merge with the original weights

The result: same quality as full fine-tuning at a fraction of the cost.

LoRA vs. Full Fine-Tuning

Aspect	Full Fine-Tuning	LoRA
Parameters trained	All (~70B)	~0.1% (~70M)
GPU memory	100+ GB	16-24 GB
Training time	Days	Hours
Quality	Baseline	Comparable
Storage per adapter	Full model copy	~100 MB

QLoRA

QLoRA combines LoRA with quantization — loading the base model in 4-bit precision while training LoRA adapters in full precision. This makes fine-tuning a 70B model possible on a single consumer GPU.

LoRA in Practice

Many community models on Hugging Face are LoRA adapters shared on top of base models. You can stack multiple LoRA adapters to combine different capabilities — one for coding, another for creative writing.