What Is LoRA (Low-Rank Adaptation)?
LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models efficiently by training only a tiny fraction of the model’s parameters.
The Problem LoRA Solves
Full fine-tuning of a model like Llama 3 70B requires updating billions of parameters — demanding hundreds of GBs of GPU memory and significant compute time. Most teams can’t afford this.
How LoRA Works
Instead of updating all model weights, LoRA:
- Freezes the original model parameters
- Adds small trainable “adapter” matrices to specific layers
- Trains only these adapters (typically <1% of total parameters)
- At inference, the adapters merge with the original weights
The result: same quality as full fine-tuning at a fraction of the cost.
LoRA vs. Full Fine-Tuning
| Aspect | Full Fine-Tuning | LoRA |
|---|---|---|
| Parameters trained | All (~70B) | ~0.1% (~70M) |
| GPU memory | 100+ GB | 16-24 GB |
| Training time | Days | Hours |
| Quality | Baseline | Comparable |
| Storage per adapter | Full model copy | ~100 MB |
QLoRA
QLoRA combines LoRA with quantization — loading the base model in 4-bit precision while training LoRA adapters in full precision. This makes fine-tuning a 70B model possible on a single consumer GPU.
LoRA in Practice
Many community models on Hugging Face are LoRA adapters shared on top of base models. You can stack multiple LoRA adapters to combine different capabilities — one for coding, another for creative writing.
Elvean brings all these concepts together in one native Mac app — local models, cloud APIs, agentic tools, and more.
Learn more about Elvean