SAM 2 vs SAM 1: The Video Gap
The original Segment Anything Model (SAM 1) was image-only. You could apply it frame-by-frame to video, but without temporal consistency — each frame started fresh, and the same object might get different masks across frames. Meta AI's SAM 2 solves this with a streaming memory bank that maintains object state across frames.
Key architectural additions:
- Memory encoder: Encodes frame features and masks into memory tokens
- Memory bank: Stores recent frames and prompted frames as spatial memories
- Memory attention: Cross-attends current frame features to memory bank before decoding
The result: prompt an object once at frame 0 (with a click or bounding box), and SAM 2 tracks and segments it through the rest of the video.
Prompting Modes
SAM 2 inherits SAM 1's flexible prompting:
- Points: Click to include (+) or exclude (-) regions
- Bounding boxes: Coarse box → precise mask
- Masks: Refine a previous mask
- Text (via integration with CLIP): Experimental natural language prompting
For video, you prompt on one or more keyframes and the model propagates forward and backward in time.
SA-V Dataset
HuggingFace's SAM 2 was trained on SA-V, a dataset 35x larger than SA-1B (the original SAM dataset). SA-V contains 51,000 videos with 643,000 masklet annotations — objects tracked across entire video sequences. The diversity spans sports, cooking, outdoor scenes, and social media content.
Python API
import torch
from sam2.build_sam import build_sam2_video_predictor
predictor = build_sam2_video_predictor(
"sam2_hiera_large.yaml",
"sam2_hiera_large.pt",
device="cuda"
)
# Load video frames (as JPEG directory or tensor)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
state = predictor.init_state(video_path="./video_frames/")
# Prompt object at frame 0 with a point
frame_idx, obj_ids, mask_logits = predictor.add_new_points_or_box(
inference_state=state,
frame_idx=0,
obj_id=1,
points=[[350, 200]], # (x, y) in pixels
labels=[1], # 1=foreground, 0=background
)
# Propagate through video
for frame_idx, obj_ids, mask_logits in predictor.propagate_in_video(state):
masks = (mask_logits > 0.0).cpu().numpy()
print(f"Frame {frame_idx}: {masks.sum()} pixels segmented")
Real-Time Performance
SAM 2 Hiera-L runs at ~45 FPS on an A100 GPU for 1024x1024 frames. The smaller Hiera-S variant achieves ~140 FPS. For real-time applications (30fps video), Hiera-B+ is sufficient on a consumer RTX 4080.
Use Cases
Video editing: Rotoscope a subject for background replacement without frame-by-frame masking.
Medical imaging: Track tumors or instruments across CT/MRI scan slices.
Robotics: Identify and track objects for manipulation tasks without domain-specific training.
Sports analytics: Track individual players through occlusion, camera cuts, and fast motion.
The SAM 2 GitHub includes demo notebooks for all major use cases and a Gradio web demo for rapid prototyping.