import torch
from torch import Tensor
import torch.nn as nn
LoRA Explained: Fine-Tune Large Models with 90% Fewer Parameters
A hands-on guide to Parameter-Efficient Fine-Tuning using Low-Rank Adaptation
What if you could adapt a large neural network to new tasks while only training 7% of its parameters? Low-Rank Adaptation (LoRA) makes this possible by cleverly decomposing weight updates into smaller matrices.
In this tutorial, we’ll build LoRA from scratch using PyTorch, demonstrate it on MNIST classification, and show why it’s revolutionizing how we fine-tune large language models.
What you’ll learn: - The mathematical intuition behind low-rank decomposition - How to implement LoRA adapters in PyTorch - Why LoRA prevents catastrophic forgetting - Practical tips for hyperparameter tuning (rank, alpha) - How to save and load multiple task-specific adapters
When we fine tune the model, we update all the paramets of the weights. Which might lead to catastrophic forgettign and overfitting. By this method we can update only a subset of the parameters. If you are awaare of resnet block, you can think of LoRA weights are main path and the original model is the identity path. By training we eventually learn wrt the new weights added to the model.
Rank of a tensor:
No of linearly independent rows or columns. It gives of true dimension of the information of the matrix. For LoRA, when we say “rank=8”, we’re forcing our adaptation to have at most rank 8, meaning it can only capture 8 independent patterns of change.
= torch.tensor([[1., 2.], [2., 4.]])
rank_1 = torch.tensor([[1., 2.], [3., 4.]])
rank_2 = torch.zeros(3, 3)
zero_matrix
print("Rank 1 matrix:")
print(rank_1)
print(f"Actual rank: {torch.linalg.matrix_rank(rank_1)}")
print("\nRank 2 matrix:")
print(rank_2)
print(f"Actual rank: {torch.linalg.matrix_rank(rank_2)}")
print("\nZero matrix:")
print(f"Actual rank: {torch.linalg.matrix_rank(zero_matrix)}")
Rank 1 matrix:
tensor([[1., 2.],
[2., 4.]])
Actual rank: 1
Rank 2 matrix:
tensor([[1., 2.],
[3., 4.]])
Actual rank: 2
Zero matrix:
Actual rank: 0
LoRA
Let’s consider a simple case of matrix multiplication.
= torch.randn(512, 256)
W = torch.randn(2, 256)
x = x @ W.T
y print(f"Input shape: {x.shape}")
print(f"Weight shape: {W.shape}")
print(f"Output shape: {y.shape}")
Input shape: torch.Size([2, 256])
Weight shape: torch.Size([512, 256])
Output shape: torch.Size([2, 512])
Low-rank decomposition means breaking down a large matrix into the product of two smaller matrices. Think of it this way: 1. Original matrix: 512×256 (rank could be up to 256) 1. Low-rank decomposition: A(512*8) * B(8*256)
The “rank” is the inner dimension (8 in our case).
= 8
rank = torch.randn(512, rank)
A = torch.randn(rank, 256)
B = A @ B
W_decomposed print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"W_decomposed shape: {W_decomposed.shape}")
print(f"Parameters in original W: {W.numel()}")
print(f"Parameters in A + B: {A.numel() + B.numel()}")
A shape: torch.Size([512, 8])
B shape: torch.Size([8, 256])
W_decomposed shape: torch.Size([512, 256])
Parameters in original W: 131072
Parameters in A + B: 6144
Now we need to understand how LoRA uses this decomposition. In LoRA, we don’t replace the original weight W. Instead, we ADD the low-rank adaptation to it.
= x @ (W + W_decomposed).T = x @ W.T + x @ W_decomposed.T = x @ W.T + (x @ B.T) @ A.T y_lora
= x @ (W + W_decomposed).T
y_lora assert y.shape == y_lora.shape
= x @ W.T + (x @ B.T) @ A.T
y_efficient assert y_efficient.shape == y.shape
assert torch.allclose(y_lora, y_lora)
So using matrix decompostion we can represet a higher dim matrix with a couple of lower dim matrixs. Which will be used for efficient learning. Useing the lower matrix we can fine tune bigger model in smaller gpu as they wull need smaller space. We can have many different such adapter for different tasks. There is drawback though we need aditional param to keeptack as well as the computation cost of forward pass increases.
LoRA using pytorch’s nn.Module
Our aim is following 1. Train a simple neural network on MNIST digits 3 and 4 for binary classification 2. Freeze the original model after training 3. Add LoRA adapters to the same model 4. Fine-tune only the LoRA parameters on digits 7 and 8 5. Compare performance - showing that LoRA can adapt the model to new tasks without changing original weights
This demonstrates LoRA’s key benefit: we can reuse a trained model for new tasks by only training a small number of additional parameters, while keeping the original model intact.
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader, Subset
= transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
transform = torchvision.datasets.MNIST('./data', train=True, download=True, transform=transform)
mnist_train = torchvision.datasets.MNIST('./data', train=False, transform=transform)
mnist_test
len(mnist_train), len(mnist_test)
(60000, 10000)
Filtering datasets
def filter_classes(typ, classes):
= mnist_train if typ == 'train' else mnist_test
dataset = []
indices for i, (_, label) in enumerate(dataset):
if label in classes:
indices.append(i)
# Create new dataset with remapped labels
= []
remapped_data for i in indices:
= dataset[i]
x, y = 0 if y == classes[0] else 1
new_y
remapped_data.append((x, new_y))
return remapped_data
= filter_classes('train', [3, 4])
train_34 = filter_classes('test', [3, 4])
test_34
len(train_34), len(test_34)
(11973, 1992)
## dataloaders
= {
dls1 'train' : DataLoader(train_34, batch_size=64, shuffle=True),
'valid' : DataLoader(test_34 , batch_size=64)
}
the base model
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 64)
self.fc2 = nn.Linear(64, 2)
def forward(self, x):
= x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x return self.fc2(x)
= Net() model1
Trainer loop
class Trainer:
def __init__(self, dls, model, lr=0.01):
self.dls = dls
self.model = model
self.optim = torch.optim.Adam(self.model.parameters(), lr=lr)
def train(self, epochs=1):
for epoch in range(epochs):
# training loop
self.model.train()
= 0
train_loss for batch_idx, (data, target) in enumerate(self.dls['train']):
= self.model(data)
output = F.cross_entropy(output, target)
loss
loss.backward()self.optim.step()
self.optim.zero_grad()
+= loss.item()
train_loss
# Validation inside epoch loop
self.model.eval()
= 0
correct
with torch.no_grad():
= 0
valid_loss for data, target in self.dls['valid']:
= self.model(data)
output = output.argmax(dim=1)
pred += pred.eq(target).sum().item()
correct = F.cross_entropy(output, target)
loss += loss.item()
valid_loss
= 100. * correct / len(self.dls['valid'].dataset)
accuracy print(f'Epoch {epoch+1}: Train Loss: {train_loss/len(self.dls["train"]):.4f}, Valid Loss: {valid_loss/len(self.dls["valid"]):.4f} Accuracy: {accuracy:.2f}%')
= Net()
model1 = Trainer(dls=dls1, model=model1, lr=0.01)
t 1) t.train(
Epoch 1: Train Loss: 0.0355, Valid Loss: 0.0131 Accuracy: 99.75%
Visualization
import matplotlib.pyplot as plt
def visualize_predictions(model, dataloader, num_samples=8):
eval()
model.= plt.subplots(2, 4, figsize=(12, 6))
fig, axes = axes.flatten()
axes
with torch.no_grad():
= next(iter(dataloader))
data, targets = model(data)
outputs = outputs.argmax(dim=1)
predictions
for i in range(num_samples):
= data[i].squeeze()
img ='gray')
axes[i].imshow(img, cmapf'Pred: {predictions[i]}, True: {targets[i]}')
axes[i].set_title('off')
axes[i].axis(
plt.tight_layout() plt.show()
'valid']) visualize_predictions(model1, dls1[
We trined for 3 and 4 and got around 99% accurecy. Let’s move on to using LoRA using Model1 to classification of 7 and 8.
LoRA classification of 7
and 8
= filter_classes('train', [7, 8])
train_78 = filter_classes('test', [7, 8])
test_78
len(train_78), len(test_78)
(12116, 2002)
= {
dls2 'train' : DataLoader(train_78, batch_size=64, shuffle=True),
'valid' : DataLoader(test_78, batch_size=64)
}
= next(iter(dls2['train']))
x, y print(f"Batch shape: {x.shape}, Labels shape: {y.shape}")
model1
Batch shape: torch.Size([64, 1, 28, 28]), Labels shape: torch.Size([64])
Net(
(fc1): Linear(in_features=784, out_features=64, bias=True)
(fc2): Linear(in_features=64, out_features=2, bias=True)
)
LoRA model for Linear layers
class LoRALinear(nn.Module):
def __init__(self, original_layer, rank=4, alpha=1):
super().__init__()
self.original_layer = original_layer
self.rank = rank
self.alpha = alpha
# Freeze original layer
for param in self.original_layer.parameters():
= False
param.requires_grad
# LoRA parameters
= original_layer.in_features
in_features = original_layer.out_features
out_features self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
def forward(self, x):
#x = x.view(x.size(0), -1)
return self.original_layer(x) + (x @ self.lora_A.T ) @ self.lora_B.T * (self.alpha / self.rank)
LoRA Weight Initialization: Looking at your code:
self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01) # Small random values
self.lora_B = nn.Parameter(torch.zeros(out_features, rank)) # Zeros!
The key insight is that at initialization, we want LoRA to have zero effect:
lora_A
starts with small random valueslora_B
starts with zeros- So
lora_A @ lora_B = small_values @ zeros = zeros
This means initially:original_output + 0 = original_output
ResNet Connection Analogy 1. Original model = identity path (stable, proven features) 1. LoRA adaptation = residual path (learns what’s missing) 1. Final output = identity + residual
If both A and B started random, the initial LoRA output would be: random_A @ random_B = large
random values. This would immediately distort the original model’s good representations, forcing the optimizer to:
- First “undo” the random noise
- Then learn the actual adaptation Alpha Parameter: controls the “strength” of the LoRA adaptation. In the forward pass:
return original_layer(x) + (x @ lora_A.T) @ lora_B.T * (alpha / rank)
The alpha/rank scaling serves two purposes:
Scaling independence: change in rank, the adaptation strength stays consistent Learning rate control: Higher alpha = stronger LoRA influence
LoRA model
class NetLoRA(nn.Module):
def __init__(self, original_model, rank=4, alpha=1):
super().__init__()
self.layers = []
self.fc1 = LoRALinear(original_model.fc1, rank, alpha)
self.fc2 = LoRALinear(original_model.fc2, rank, alpha)
def forward(self, x):
= x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x return self.fc2(x)
= NetLoRA(model1)
model2 = next(iter(dls2['train']))
x, _ = model2(x)
pred pred.shape
torch.Size([64, 2])
# Checking if the module have grad attributes.
for name, param in model2.named_parameters():
print(f"{name}: requires_grad={param.requires_grad}")
fc1.lora_A: requires_grad=True
fc1.lora_B: requires_grad=True
fc1.original_layer.weight: requires_grad=False
fc1.original_layer.bias: requires_grad=False
fc2.lora_A: requires_grad=True
fc2.lora_B: requires_grad=True
fc2.original_layer.weight: requires_grad=False
fc2.original_layer.bias: requires_grad=False
#Lets train using dls2 our 7 and 8 datasets with new lora model
= Trainer(dls=dls2, model=model2, lr=0.01)
t2 5) t2.train(
Epoch 1: Train Loss: 0.2800, Valid Loss: 0.0527 Accuracy: 98.15%
Epoch 2: Train Loss: 0.0249, Valid Loss: 0.0290 Accuracy: 98.85%
Epoch 3: Train Loss: 0.0190, Valid Loss: 0.0457 Accuracy: 98.35%
Epoch 4: Train Loss: 0.0192, Valid Loss: 0.0383 Accuracy: 98.40%
Epoch 5: Train Loss: 0.0150, Valid Loss: 0.0295 Accuracy: 98.90%
# visualizing the model wrt the valid set
'valid']) visualize_predictions(model2, dls2[
= sum(p.numel() for p in model1.parameters())
original_params = sum(p.numel() for p in model2.parameters() if p.requires_grad)
lora_params print(f"Original model: {original_params} parameters")
print(f"LoRA adapters: {lora_params} parameters")
print(f"Efficiency: {lora_params/original_params*100:.2f}% of original")
Original model: 50370 parameters
LoRA adapters: 3656 parameters
Efficiency: 7.26% of original
Saving LoRA weights and parameters
To save LoRA model that we specially trained. We have to save the following : 1. We need to save the hyperparameters (rank, alpha) with the LoRA weights 1. We should filter to save only requires_grad=True parameters 1. This approach allows us to have multiple LoRA adapters for different tasks By doing the above we can save multiple LoRAs for a given base model trained for differeent task. Where each LoRA file would contain both weights and hyperparameters.
# saving base model
= "main_model.pth"
base_model_pth
torch.save(model1.state_dict(), base_model_pth)!file {base_model_pth}
main_model.pth: Zip archive data, at least v0.0 to extract, compression method=store
# saving lora weights
= 4, 1.
rank, alpha = {
lora_state 'rank': rank,
'alpha': alpha,
'weights': {}
}
for name, param in model2.named_parameters():
# we are only filtering out the lora params whih are added
if 'lora' in name and param.requires_grad:
'weights'][name] = param.data
lora_state[
= 'lora.pth'
lora_model_pth
torch.save(lora_state, lora_model_pth )!file {lora_model_pth}
lora.pth: Zip archive data, at least v0.0 to extract, compression method=store
Loading the LoRA model back from file
# load base model
= Net()
base_model
base_model.load_state_dict(torch.load(base_model_pth)) base_model
Net(
(fc1): Linear(in_features=784, out_features=64, bias=True)
(fc2): Linear(in_features=64, out_features=2, bias=True)
)
'valid']) visualize_predictions(base_model, dls1[
# loading lora params
= torch.load(lora_model_pth)
lora_data = lora_data['rank']
rank = lora_data['alpha']
alpha rank, alpha
(4, 1.0)
# Create LoRA model from base model
= NetLoRA(base_model, rank=rank, alpha=alpha)
lora_model
# Then load the LoRA weights
for name, param in lora_model.named_parameters():
if 'lora' in name and name in lora_data['weights']:
'weights'][name]) param.data.copy_(lora_data[
# Load LoRA weights properly
for name, param in lora_model.named_parameters():
if 'lora' in name and name in lora_data['weights']:
'weights'][name]) param.data.copy_(lora_data[
# Test on 7&8 dataset
'valid']) visualize_predictions(lora_model, dls2[
Hyperparameter analysis
analysis of rank
def test_rank(rank, alpha=1, epochs=3):
print(f"\n=== Testing Rank {rank=} and {alpha=} ===")
# Create LoRA model with specific rank
= NetLoRA(model1, rank=rank, alpha=alpha)
lora_model
# Count parameters
= sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
lora_params = sum(p.numel() for p in model1.parameters())
original_params = lora_params/original_params*100
efficiency
print(f"LoRA parameters: {lora_params}")
print(f"Efficiency: {efficiency:.2f}% of original")
# Train and get final accuracy
= Trainer(dls=dls2, model=lora_model, lr=0.01)
trainer
trainer.train(epochs)
return lora_params, efficiency
# Test different ranks
= [2, 4, 8, 16]
ranks_to_test = []
results
for rank in ranks_to_test:
= test_rank(rank)
params, eff results.append((rank, params, eff))
=== Testing Rank rank=2 and alpha=1 ===
LoRA parameters: 1828
Efficiency: 3.63% of original
Epoch 1: Train Loss: 0.3000, Valid Loss: 0.0417 Accuracy: 98.55%
Epoch 2: Train Loss: 0.0242, Valid Loss: 0.0296 Accuracy: 99.00%
Epoch 3: Train Loss: 0.0212, Valid Loss: 0.0318 Accuracy: 98.80%
=== Testing Rank rank=4 and alpha=1 ===
LoRA parameters: 3656
Efficiency: 7.26% of original
Epoch 1: Train Loss: 0.2954, Valid Loss: 0.0395 Accuracy: 98.65%
Epoch 2: Train Loss: 0.0247, Valid Loss: 0.0344 Accuracy: 98.55%
Epoch 3: Train Loss: 0.0213, Valid Loss: 0.0321 Accuracy: 98.85%
=== Testing Rank rank=8 and alpha=1 ===
LoRA parameters: 7312
Efficiency: 14.52% of original
Epoch 1: Train Loss: 0.2932, Valid Loss: 0.0576 Accuracy: 98.15%
Epoch 2: Train Loss: 0.0240, Valid Loss: 0.0306 Accuracy: 98.80%
Epoch 3: Train Loss: 0.0201, Valid Loss: 0.0273 Accuracy: 98.75%
=== Testing Rank rank=16 and alpha=1 ===
LoRA parameters: 14624
Efficiency: 29.03% of original
Epoch 1: Train Loss: 0.2780, Valid Loss: 0.0365 Accuracy: 98.50%
Epoch 2: Train Loss: 0.0271, Valid Loss: 0.0301 Accuracy: 98.65%
Epoch 3: Train Loss: 0.0197, Valid Loss: 0.0251 Accuracy: 99.05%
analysis of alpha
= [4, 2, 1, 0.5]
alpahs_to_test for alpha in alpahs_to_test:
= test_rank(rank=4, alpha= alpha)
params, eff results.append((rank, params, eff))
=== Testing Rank rank=4 and alpha=4 ===
LoRA parameters: 3656
Efficiency: 7.26% of original
Epoch 1: Train Loss: 0.1721, Valid Loss: 0.0368 Accuracy: 98.45%
Epoch 2: Train Loss: 0.0240, Valid Loss: 0.0269 Accuracy: 98.75%
Epoch 3: Train Loss: 0.0191, Valid Loss: 0.0328 Accuracy: 98.95%
=== Testing Rank rank=4 and alpha=2 ===
LoRA parameters: 3656
Efficiency: 7.26% of original
Epoch 1: Train Loss: 0.2189, Valid Loss: 0.0422 Accuracy: 98.65%
Epoch 2: Train Loss: 0.0244, Valid Loss: 0.0237 Accuracy: 99.05%
Epoch 3: Train Loss: 0.0206, Valid Loss: 0.0277 Accuracy: 98.70%
=== Testing Rank rank=4 and alpha=1 ===
LoRA parameters: 3656
Efficiency: 7.26% of original
Epoch 1: Train Loss: 0.2877, Valid Loss: 0.0422 Accuracy: 98.50%
Epoch 2: Train Loss: 0.0235, Valid Loss: 0.0306 Accuracy: 98.75%
Epoch 3: Train Loss: 0.0209, Valid Loss: 0.0376 Accuracy: 98.55%
=== Testing Rank rank=4 and alpha=0.5 ===
LoRA parameters: 3656
Efficiency: 7.26% of original
Epoch 1: Train Loss: 0.3995, Valid Loss: 0.0719 Accuracy: 97.65%
Epoch 2: Train Loss: 0.0307, Valid Loss: 0.0339 Accuracy: 98.65%
Epoch 3: Train Loss: 0.0230, Valid Loss: 0.0277 Accuracy: 99.00%
Rank Selection:
- Start small: Begin with rank=4 or 8 for most tasks
- Rule of thumb: Higher rank = more expressiveness but more parameters
- Task complexity matters:
- Simple tasks (like your digit classification): rank=4-8
- Complex tasks (large language models): rank=16-64
- Diminishing returns: Performance often plateaus after a certain rank
Alpha Selection:
- Common values: 1, 8, 16, 32 (often powers of 2)
- Higher alpha: Stronger LoRA influence, faster adaptation
- Lower alpha: More conservative, slower learning
- Starting point: Try alpha = rank (so alpha=8 for rank=8)
Practical approach:
- Fix alpha=1, try ranks [4, 8, 16]
- Pick best performing rank
- Then tune alpha [0.1, 1, 8, 16] with that rank
Conclusion
Through this hands-on exploration, we’ve demonstrated LoRA’s core value proposition: achieving strong performance on new tasks while using only a fraction of the original model’s parameters.
Key takeaways:
- LoRA adapters used only 7.26% of the original parameters yet achieved 98.15% accuracy on a completely different classification task
- The original model weights remain frozen and unchanged, preventing catastrophic forgetting
- Multiple task-specific LoRA adapters can be saved and swapped for the same base model Why LoRA matters:
- Memory efficient: Fine-tune large models on consumer GPUs
- Storage efficient: Store multiple task adapters instead of full model copies
- Modular: Easy to experiment with different tasks without retraining from scratch This simple MNIST example scales to modern LLMs where LoRA enables fine-tuning billion-parameter models with minimal computational resources, making personalized AI more accessible.