DINOv2: Meta's Self-Supervised Vision Features That Beat Supervised Models

DINOv2 learns visual features from 142 million curated images without labels, producing representations that outperform supervised ImageNet models as frozen feature extractors across classification, segmentation, and depth tasks.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 2, 2026

7 min read

// tags

#dinov2#meta#self-supervised#vision#features

FIG. ART-28

7 min read

“

DINOv2: Meta's Self-Supervised Vision Features That Beat Supervised Models

// reading plan

sections

408

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Supervised Learning Explained: How Models Learn from Labeled Examples

Using DINOv2 as a Frozen Feature Extractor

import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image

processor = AutoImageProcessor.from_pretrained("facebook/dinov2-large")
model = AutoModel.from_pretrained("facebook/dinov2-large")
model.eval()

image = Image.open("photo.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# CLS token: global image representation
cls_features = outputs.last_hidden_state[:, 0, :]  # [1, 1024]

# Patch tokens: spatial features for dense tasks
patch_features = outputs.last_hidden_state[:, 1:, :]  # [1, num_patches, 1024]

print(f"Global feature shape: {cls_features.shape}")
print(f"Patch feature shape: {patch_features.shape}")

ViT Backbone Variants

Variant	Parameters	Dim	Speed
ViT-S/14	21M	384	Fastest
ViT-B/14	86M	768	Fast
ViT-L/14	307M	1024	Moderate
ViT-g/14	1.1B	1536	Slow

The HuggingFace DINOv2-large page is the standard balance point. ViT-g provides marginal improvements for most use cases.

Fine-Tuning for Custom Vision Tasks

DINOv2 frozen features with a linear head achieve 86.5% top-1 on ImageNet - competitive with many fully supervised models. For custom classification:

import torch.nn as nn
from transformers import AutoModel

class DINOv2Classifier(nn.Module):
    def __init__(self, num_classes: int):
        super().__init__()
        self.backbone = AutoModel.from_pretrained("facebook/dinov2-base")
        self.classifier = nn.Linear(768, num_classes)
        # Freeze backbone for fast training
        for param in self.backbone.parameters():
            param.requires_grad = False

    def forward(self, pixel_values):
        outputs = self.backbone(pixel_values=pixel_values)
        cls = outputs.last_hidden_state[:, 0]
        return self.classifier(cls)

Training only the linear head converges in minutes and works well with as few as 50 labeled examples per class - DINOv2's features generalize to new visual domains without extensive fine-tuning.

The GitHub repository includes depth estimation and semantic segmentation examples using DINOv2 patch features directly.

DINOv2: Meta's Self-Supervised Vision Features That Beat Supervised Models

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Self-Distillation Without Labels

LVD-142M Dataset

Using DINOv2 as a Frozen Feature Extractor

ViT Backbone Variants

Fine-Tuning for Custom Vision Tasks

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

DINOv2: Meta's Self-Supervised Vision Features That Beat Supervised Models

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Self-Distillation Without Labels

LVD-142M Dataset

Using DINOv2 as a Frozen Feature Extractor

ViT Backbone Variants

Fine-Tuning for Custom Vision Tasks

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The workspace your team
actually needs