Self-Distillation Without Labels
Standard vision model training requires millions of labeled images. DINOv2 from Meta demonstrates that self-supervised learning — using the model's own predictions as supervision — can produce features that surpass supervised training on downstream tasks.
DINOv2 uses self-distillation: a student network learns to match the outputs of a teacher network (exponential moving average of student weights). Both networks see different augmented views of the same image; the student must predict what the teacher sees for the global view. No labels needed.
LVD-142M Dataset
The quality of self-supervised learning depends heavily on training data diversity. DINOv2's LVD-142M (Large-scale Visual Deduplicated) dataset was curated through:
- Starting with a large uncurated web image collection
- Self-supervised retrieval to find images similar to curated reference datasets
- Deduplication to remove near-duplicate images
This produces 142 million diverse, high-quality images without manual annotation — far more than ImageNet's 1.2 million.
Using DINOv2 as a Frozen Feature Extractor
import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-large")
model = AutoModel.from_pretrained("facebook/dinov2-large")
model.eval()
image = Image.open("photo.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# CLS token: global image representation
cls_features = outputs.last_hidden_state[:, 0, :] # [1, 1024]
# Patch tokens: spatial features for dense tasks
patch_features = outputs.last_hidden_state[:, 1:, :] # [1, num_patches, 1024]
print(f"Global feature shape: {cls_features.shape}")
print(f"Patch feature shape: {patch_features.shape}")
ViT Backbone Variants
| Variant | Parameters | Dim | Speed | |---|---|---|---| | ViT-S/14 | 21M | 384 | Fastest | | ViT-B/14 | 86M | 768 | Fast | | ViT-L/14 | 307M | 1024 | Moderate | | ViT-g/14 | 1.1B | 1536 | Slow |
The HuggingFace DINOv2-large page is the standard balance point. ViT-g provides marginal improvements for most use cases.
Fine-Tuning for Custom Vision Tasks
DINOv2 frozen features with a linear head achieve 86.5% top-1 on ImageNet — competitive with many fully supervised models. For custom classification:
import torch.nn as nn
from transformers import AutoModel
class DINOv2Classifier(nn.Module):
def __init__(self, num_classes: int):
super().__init__()
self.backbone = AutoModel.from_pretrained("facebook/dinov2-base")
self.classifier = nn.Linear(768, num_classes)
# Freeze backbone for fast training
for param in self.backbone.parameters():
param.requires_grad = False
def forward(self, pixel_values):
outputs = self.backbone(pixel_values=pixel_values)
cls = outputs.last_hidden_state[:, 0]
return self.classifier(cls)
Training only the linear head converges in minutes and works well with as few as 50 labeled examples per class — DINOv2's features generalize to new visual domains without extensive fine-tuning.
The GitHub repository includes depth estimation and semantic segmentation examples using DINOv2 patch features directly.