Learning From Natural Language Supervision
CLIP (Contrastive Language-Image Pre-training) was trained on 400 million image-text pairs from the internet. Rather than predicting explicit labels, it learns to match images to their correct text descriptions using a contrastive loss: correct pairs are pulled together in embedding space, while incorrect pairs are pushed apart.
The resulting representations are universal: any image concept you can describe in English can be used as a classification target, even if CLIP never saw a labeled example during training.
Zero-Shot Classification
from PIL import Image
import torch
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
tokenizer = open_clip.get_tokenizer("ViT-L-14")
model.eval()
image = preprocess(Image.open("product.jpg")).unsqueeze(0)
candidate_labels = [
"a photo of a running shoe",
"a photo of a dress shirt",
"a photo of a laptop bag",
"a photo of a wristwatch",
]
text_tokens = tokenizer(candidate_labels)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text_tokens)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
for label, score in zip(candidate_labels, similarity[0]):
print(f"{label}: {score.item():.2%}")
ImageNet Zero-Shot Results
CLIP ViT-L/14 achieves 76.2% top-1 accuracy on ImageNet zero-shot — meaning it was never shown ImageNet labels during training. This matched ResNet-101 trained specifically on ImageNet, demonstrating that natural language supervision at scale can match specialist supervised training.
Semantic Image Search
CLIP embeddings enable semantic image search without manual tagging:
import numpy as np
# Precompute and store image embeddings for your catalog
def embed_images(image_paths, model, preprocess):
embeddings = []
for path in image_paths:
img = preprocess(Image.open(path)).unsqueeze(0)
with torch.no_grad():
emb = model.encode_image(img)
emb /= emb.norm(dim=-1, keepdim=True)
embeddings.append(emb.numpy())
return np.vstack(embeddings)
# Query with natural language
def search(query_text, image_embeddings, model, tokenizer, top_k=5):
tokens = tokenizer([query_text])
with torch.no_grad():
text_emb = model.encode_text(tokens)
text_emb /= text_emb.norm(dim=-1, keepdim=True)
scores = image_embeddings @ text_emb.numpy().T
return np.argsort(scores.squeeze())[::-1][:top_k]
CLIP vs ALIGN vs SigLIP
CLIP (OpenAI): softmax-normalized contrastive loss, 400M pairs, ViT backbones up to ViT-L/14. ALIGN (Google): similar approach but trained on 1.8B noisier pairs — scales better to noisy data. SigLIP (Google): sigmoid loss instead of softmax — does not require the full batch normalization that CLIP needs, enabling larger effective batch sizes and better few-shot performance.
For new projects, SigLIP typically outperforms CLIP at equivalent model sizes. For existing integrations, CLIP's mature ecosystem (OpenCLIP, FAISS indexes, Pinecone native support) provides practical advantages.
Use Cases
Content moderation (embed images, flag semantic similarity to known violating content), product catalog tagging (classify thousands of images into taxonomy nodes without labeled data), e-commerce recommendation (find visually and semantically similar products), and accessibility (generate searchable alt text from image embeddings).