SAM 2: Meta's Real-Time Video Segmentation Model

SAM 2 extends the original Segment Anything Model to video with a streaming memory bank, enabling zero-shot object tracking and segmentation across arbitrary video sequences.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 17, 2026

7 min read

// tags

#sam-2#meta#segmentation#video#computer-vision

FIG. ART-22

7 min read

“

SAM 2: Meta's Real-Time Video Segmentation Model

// reading plan

sections

407

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format — export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

SAM 2 vs SAM 1: The Video Gap

The original Segment Anything Model (SAM 1) was image-only. You could apply it frame-by-frame to video, but without temporal consistency — each frame started fresh, and the same object might get different masks across frames. Meta AI's SAM 2 solves this with a streaming memory bank that maintains object state across frames.

Key architectural additions:

Memory encoder: Encodes frame features and masks into memory tokens
Memory bank: Stores recent frames and prompted frames as spatial memories
Memory attention: Cross-attends current frame features to memory bank before decoding

The result: prompt an object once at frame 0 (with a click or bounding box), and SAM 2 tracks and segments it through the rest of the video.

Prompting Modes

SAM 2 inherits SAM 1's flexible prompting:

Points: Click to include (+) or exclude (-) regions
Bounding boxes: Coarse box → precise mask
Masks: Refine a previous mask
Text (via integration with CLIP): Experimental natural language prompting

For video, you prompt on one or more keyframes and the model propagates forward and backward in time.

SA-V Dataset

HuggingFace's SAM 2 was trained on SA-V, a dataset 35x larger than SA-1B (the original SAM dataset). SA-V contains 51,000 videos with 643,000 masklet annotations — objects tracked across entire video sequences. The diversity spans sports, cooking, outdoor scenes, and social media content.

Python API

import torch
from sam2.build_sam import build_sam2_video_predictor

predictor = build_sam2_video_predictor(
    "sam2_hiera_large.yaml",
    "sam2_hiera_large.pt",
    device="cuda"
)

# Load video frames (as JPEG directory or tensor)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = predictor.init_state(video_path="./video_frames/")

    # Prompt object at frame 0 with a point
    frame_idx, obj_ids, mask_logits = predictor.add_new_points_or_box(
        inference_state=state,
        frame_idx=0,
        obj_id=1,
        points=[[350, 200]],  # (x, y) in pixels
        labels=[1],           # 1=foreground, 0=background
    )

    # Propagate through video
    for frame_idx, obj_ids, mask_logits in predictor.propagate_in_video(state):
        masks = (mask_logits > 0.0).cpu().numpy()
        print(f"Frame {frame_idx}: {masks.sum()} pixels segmented")

Real-Time Performance

SAM 2 Hiera-L runs at ~45 FPS on an A100 GPU for 1024x1024 frames. The smaller Hiera-S variant achieves ~140 FPS. For real-time applications (30fps video), Hiera-B+ is sufficient on a consumer RTX 4080.

Use Cases

Video editing: Rotoscope a subject for background replacement without frame-by-frame masking.

Medical imaging: Track tumors or instruments across CT/MRI scan slices.

Robotics: Identify and track objects for manipulation tasks without domain-specific training.

Sports analytics: Track individual players through occlusion, camera cuts, and fast motion.

The SAM 2 GitHub includes demo notebooks for all major use cases and a Gradio web demo for rapid prototyping.

SAM 2: Meta's Real-Time Video Segmentation Model

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

SAM 2 vs SAM 1: The Video Gap

Prompting Modes

SA-V Dataset

Python API

Real-Time Performance

Use Cases

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

SAM 2: Meta's Real-Time Video Segmentation Model

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

SAM 2 vs SAM 1: The Video Gap

Prompting Modes

SA-V Dataset

Python API

Real-Time Performance

Use Cases

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs