CLIP: Using OpenAI's Contrastive Model for Zero-Shot Image Classification

CLIP learns joint image-text embeddings from 400 million pairs, enabling zero-shot classification on any category you can describe in words - no labeled training data required.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 9, 2026

7 min read

// tags

#clip#openai#zero-shot#image-classification#contrastive-learning

FIG. ART-26

7 min read

“

CLIP: Using OpenAI's Contrastive Model for Zero-Shot Image Classification

// reading plan

sections

424

words

min read

// AI Agents

What is Harness engineering: Leveraging Codex in an agent-first world? A Practical Overview

Harness engineering is the practice of building structured, safe environments for AI agents to execute code. This post explains how to leverage OpenAI Codex in an agent-first world, with concrete examples, cost breakdowns, and honest tradeoffs.

5 min read

// LLM & Language Models

ImageNet Zero-Shot Results

CLIP ViT-L/14 achieves 76.2% top-1 accuracy on ImageNet zero-shot - meaning it was never shown ImageNet labels during training. This matched ResNet-101 trained specifically on ImageNet, demonstrating that natural language supervision at scale can match specialist supervised training.

Semantic Image Search

CLIP embeddings enable semantic image search without manual tagging:

import numpy as np

# Precompute and store image embeddings for your catalog
def embed_images(image_paths, model, preprocess):
    embeddings = []
    for path in image_paths:
        img = preprocess(Image.open(path)).unsqueeze(0)
        with torch.no_grad():
            emb = model.encode_image(img)
            emb /= emb.norm(dim=-1, keepdim=True)
        embeddings.append(emb.numpy())
    return np.vstack(embeddings)

# Query with natural language
def search(query_text, image_embeddings, model, tokenizer, top_k=5):
    tokens = tokenizer([query_text])
    with torch.no_grad():
        text_emb = model.encode_text(tokens)
        text_emb /= text_emb.norm(dim=-1, keepdim=True)
    scores = image_embeddings @ text_emb.numpy().T
    return np.argsort(scores.squeeze())[::-1][:top_k]

CLIP vs ALIGN vs SigLIP

CLIP (OpenAI): softmax-normalized contrastive loss, 400M pairs, ViT backbones up to ViT-L/14. ALIGN (Google): similar approach but trained on 1.8B noisier pairs - scales better to noisy data. SigLIP (Google): sigmoid loss instead of softmax - does not require the full batch normalization that CLIP needs, enabling larger effective batch sizes and better few-shot performance.

For new projects, SigLIP typically outperforms CLIP at equivalent model sizes. For existing integrations, CLIP's mature ecosystem (OpenCLIP, FAISS indexes, Pinecone native support) provides practical advantages.

Use Cases

Content moderation (embed images, flag semantic similarity to known violating content), product catalog tagging (classify thousands of images into taxonomy nodes without labeled data), e-commerce recommendation (find visually and semantically similar products), and accessibility (generate searchable alt text from image embeddings).

CLIP: Using OpenAI's Contrastive Model for Zero-Shot Image Classification

Related Articles

What is Harness engineering: Leveraging Codex in an agent-first world? A Practical Overview

Learning From Natural Language Supervision

Zero-Shot Classification

ImageNet Zero-Shot Results

Semantic Image Search

CLIP vs ALIGN vs SigLIP

Use Cases

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

ONNX: Export Any ML Model and Run It Anywhere

CLIP: Using OpenAI's Contrastive Model for Zero-Shot Image Classification

Related Articles

What is Harness engineering: Leveraging Codex in an agent-first world? A Practical Overview

Learning From Natural Language Supervision

Zero-Shot Classification

ImageNet Zero-Shot Results

Semantic Image Search

CLIP vs ALIGN vs SigLIP

Use Cases

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

ONNX: Export Any ML Model and Run It Anywhere

The workspace your team
actually needs