# SAM 3: Segment Anything with Concepts

SAM 3 (Segment Anything Model 3) is a unified foundation model from Meta Superintelligence Labs for promptable segmentation in images and videos. It extends the capabilities of SAM 2 by introducing the ability to exhaustively segment all instances of an open-vocabulary concept specified by text prompts or visual exemplars. SAM 3 features a DETR-based detector conditioned on text, geometry, and image exemplars, combined with a tracker inheriting the SAM 2 transformer architecture for video segmentation and interactive refinement. The model has 848M parameters and achieves 75-80% of human performance on the SA-CO benchmark containing 270K unique concepts.

SAM 3.1 Object Multiplex is an enhanced version that introduces a shared-memory approach for joint multi-object tracking. Instead of processing objects individually, Object Multiplex groups objects into fixed-capacity buckets and processes them jointly, drastically reducing redundant computation while maintaining accuracy. Both versions support text prompts, point prompts, box prompts, and mask refinement through an intuitive session-based API with `handle_request()` and `handle_stream_request()` methods.

## Image Segmentation with Text Prompts

The `Sam3Processor` class provides a high-level interface for image segmentation. After setting an image, you can use text prompts to detect and segment all instances of objects matching the description. The processor handles image preprocessing, model inference, and post-processing of segmentation masks.

```python
import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Enable mixed precision for faster inference on Ampere+ GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()

# Build the model (auto-downloads checkpoint from HuggingFace)
model = build_sam3_image_model()

# Create processor with confidence threshold
processor = Sam3Processor(model, confidence_threshold=0.5)

# Load image and set it for inference
image = Image.open("path/to/image.jpg")
inference_state = processor.set_image(image)

# Segment with text prompt - finds all instances of "shoe"
inference_state = processor.set_text_prompt(state=inference_state, prompt="shoe")

# Get results
masks = inference_state["masks"]        # Binary masks: (N, 1, H, W) tensor
boxes = inference_state["boxes"]        # Bounding boxes: (N, 4) tensor in [x0, y0, x1, y1] format
scores = inference_state["scores"]      # Confidence scores: (N,) tensor
mask_logits = inference_state["masks_logits"]  # Raw logits before thresholding

print(f"Found {len(scores)} objects")
for i in range(len(scores)):
    print(f"Object {i}: score={scores[i]:.3f}, box={boxes[i].tolist()}")
```

## Image Segmentation with Visual Box Prompts

SAM 3 supports visual prompting through bounding boxes to specify objects of interest. Boxes use normalized center coordinates (cx, cy, w, h) in [0, 1] range. You can combine positive boxes (objects to segment) with negative boxes (objects to exclude).

```python
import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from sam3.model.box_ops import box_xywh_to_cxcywh
from sam3.visualization_utils import normalize_bbox

# Build model and processor
model = build_sam3_image_model()
processor = Sam3Processor(model, confidence_threshold=0.5)

# Load and set image
image = Image.open("path/to/image.jpg")
width, height = image.size
inference_state = processor.set_image(image)

# Convert box from (x, y, w, h) pixel coordinates to normalized (cx, cy, w, h)
box_input_xywh = torch.tensor([480.0, 290.0, 110.0, 360.0]).view(-1, 4)
box_input_cxcywh = box_xywh_to_cxcywh(box_input_xywh)
norm_box_cxcywh = normalize_bbox(box_input_cxcywh, width, height).flatten().tolist()

# Add single positive box prompt
inference_state = processor.add_geometric_prompt(
    state=inference_state,
    box=norm_box_cxcywh,
    label=True  # True for positive, False for negative
)

# Multi-box prompting with positive and negative boxes
processor.reset_all_prompts(inference_state)

boxes_xywh = [[480.0, 290.0, 110.0, 360.0], [370.0, 280.0, 115.0, 375.0]]
boxes_cxcywh = box_xywh_to_cxcywh(torch.tensor(boxes_xywh).view(-1, 4))
norm_boxes = normalize_bbox(boxes_cxcywh, width, height).tolist()
box_labels = [True, False]  # First box positive, second negative

for box, label in zip(norm_boxes, box_labels):
    inference_state = processor.add_geometric_prompt(
        state=inference_state, box=box, label=label
    )

# Get segmentation results
masks = inference_state["masks"]
boxes = inference_state["boxes"]
scores = inference_state["scores"]
```

## Video Segmentation with SAM 3 Predictor

The `build_sam3_video_predictor` function creates a multi-GPU video predictor for SAM 3. It uses a session-based API where you start a session on a video, add prompts, propagate masks through the video, and manage tracked objects.

```python
import torch
from sam3.model_builder import build_sam3_video_predictor

# Build predictor using all available GPUs
gpus_to_use = range(torch.cuda.device_count())
predictor = build_sam3_video_predictor(gpus_to_use=gpus_to_use)

# Start session on video (JPEG folder or MP4 file)
video_path = "path/to/video/frames"  # Folder with 0.jpg, 1.jpg, ... or video.mp4
response = predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path=video_path,
    )
)
session_id = response["session_id"]

# Add text prompt on frame 0 - detects all instances of "person"
response = predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=session_id,
        frame_index=0,
        text="person",
    )
)
initial_output = response["outputs"]
# initial_output contains: out_obj_ids, out_boxes_xywh, out_probs, out_binary_masks

# Propagate through entire video using streaming API
outputs_per_frame = {}
for response in predictor.handle_stream_request(
    request=dict(
        type="propagate_in_video",
        session_id=session_id,
    )
):
    frame_idx = response["frame_index"]
    outputs_per_frame[frame_idx] = response["outputs"]

# Remove specific object by ID
predictor.handle_request(
    request=dict(
        type="remove_object",
        session_id=session_id,
        obj_id=2,  # Remove object with ID 2
    )
)

# Reset session to clear all prompts and tracked objects
predictor.handle_request(
    request=dict(
        type="reset_session",
        session_id=session_id,
    )
)

# Close session when done
predictor.handle_request(
    request=dict(
        type="close_session",
        session_id=session_id,
    )
)

# Shutdown predictor to free multi-GPU resources
predictor.shutdown()
```

## Video Segmentation with SAM 3.1 Multiplex Predictor

SAM 3.1 Object Multiplex provides faster multi-object tracking by processing objects jointly in buckets. Use `build_sam3_multiplex_video_predictor` for the improved version with optional torch.compile acceleration.

```python
from sam3.model_builder import build_sam3_multiplex_video_predictor

# Build SAM 3.1 multiplex predictor with optional compilation for 2x speedup
predictor = build_sam3_multiplex_video_predictor(
    compile=True,              # Enable torch.compile for faster inference
    warm_up=False,             # Set True to run warm-up compilation passes
    max_num_objects=16,        # Maximum tracked objects
    multiplex_count=16,        # Objects per multiplex bucket
    use_fa3=True,              # Use Flash Attention 3
    async_loading_frames=True, # Load frames asynchronously
)

# Start session
response = predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path="path/to/video/frames",
    )
)
session_id = response["session_id"]

# Add text prompt
response = predictor.handle_request(
    request=dict(
        type="add_prompt",
        session_id=session_id,
        frame_index=0,
        text="person",
    )
)

# Propagate through video
outputs_per_frame = {}
for response in predictor.handle_stream_request(
    request=dict(
        type="propagate_in_video",
        session_id=session_id,
    )
):
    outputs_per_frame[response["frame_index"]] = response["outputs"]

# Close session
predictor.handle_request(
    request=dict(
        type="close_session",
        session_id=session_id,
    )
)
```

## Unified Predictor Builder with Version Selection

The `build_sam3_predictor` function provides a unified entry point for both SAM 3 and SAM 3.1 predictors. It automatically downloads checkpoints from HuggingFace and configures the appropriate model architecture.

```python
from sam3.model_builder import build_sam3_predictor

# Build SAM 3.1 predictor (default, recommended for performance)
predictor = build_sam3_predictor(
    version="sam3.1",          # "sam3" or "sam3.1"
    compile=True,              # Enable torch.compile (SAM 3.1 only)
    warm_up=False,             # Run warm-up passes
    max_num_objects=16,        # Max tracked objects (SAM 3.1)
    use_fa3=True,              # Flash Attention 3
    async_loading_frames=True, # Async frame loading
)

# Build SAM 3 predictor
predictor_v3 = build_sam3_predictor(
    version="sam3",
    compile=False,
    async_loading_frames=True,
)

# Use with custom checkpoint path
predictor_custom = build_sam3_predictor(
    checkpoint_path="path/to/custom/checkpoint.pt",
    version="sam3.1",
)

# Both use the same API
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video/path"
})
session_id = response["session_id"]

predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "dog"
})

for out in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id
}):
    masks = out["outputs"]["out_binary_masks"]
    obj_ids = out["outputs"]["out_obj_ids"]
```

## Interactive Refinement with Point Prompts

After initial detection with text prompts, you can refine segmentation masks interactively using point prompts. Positive points (label=1) indicate areas to include, while negative points (label=0) exclude areas.

```python
import torch
import numpy as np
from sam3.model_builder import build_sam3_predictor

predictor = build_sam3_predictor(version="sam3.1")

# Start session and add initial text prompt
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video/frames"
})
session_id = response["session_id"]

predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person"
})

# Helper to convert absolute pixel coords to relative [0,1] coords
def abs_to_rel_coords(coords, img_width, img_height):
    return [[x / img_width, y / img_height] for x, y in coords]

IMG_WIDTH, IMG_HEIGHT = 1280, 720  # Your video dimensions

# Add new object with point prompt
points_abs = np.array([[760, 550]])  # Single positive click
labels = np.array([1])

points_tensor = torch.tensor(
    abs_to_rel_coords(points_abs, IMG_WIDTH, IMG_HEIGHT),
    dtype=torch.float32,
)
labels_tensor = torch.tensor(labels, dtype=torch.int32)

response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "points": points_tensor,
    "point_labels": labels_tensor,
    "obj_id": 5,  # Specify object ID for new or existing object
})

# Refine existing object mask with multiple points
# Example: Select shirt only, exclude body
refine_points_abs = np.array([
    [740, 450],  # Positive - include shirt
    [760, 630],  # Negative - exclude legs
    [840, 640],  # Negative - exclude legs
    [760, 550],  # Positive - include shirt center
])
refine_labels = np.array([1, 0, 0, 1])

response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "points": torch.tensor(abs_to_rel_coords(refine_points_abs, IMG_WIDTH, IMG_HEIGHT), dtype=torch.float32),
    "point_labels": torch.tensor(refine_labels, dtype=torch.int32),
    "obj_id": 2,  # Refine existing object ID 2
})

# Propagate refined masks
for out in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id
}):
    frame_idx = out["frame_index"]
    masks = out["outputs"]["out_binary_masks"]
```

## Batched Image Inference

For processing multiple images efficiently, use the `set_image_batch` method which processes images in a single forward pass through the backbone.

```python
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

model = build_sam3_image_model()
processor = Sam3Processor(model, confidence_threshold=0.5)

# Load multiple images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg"),
]

# Set image batch - processes all images through backbone at once
inference_state = processor.set_image_batch(images)

# Then process each image with text prompts
# Note: set_text_prompt works on the current backbone output
# You may need to iterate or handle indexing for batch processing
inference_state = processor.set_text_prompt(state=inference_state, prompt="car")

masks = inference_state["masks"]
boxes = inference_state["boxes"]
scores = inference_state["scores"]
```

## Visualization Utilities

SAM 3 provides comprehensive visualization utilities for displaying segmentation results on images and videos. These utilities handle mask overlay, bounding box drawing, and multi-frame visualization.

```python
import matplotlib.pyplot as plt
from PIL import Image
from sam3.visualization_utils import (
    plot_results,
    render_masklet_frame,
    save_masklet_video,
    save_masklet_image,
    prepare_masks_for_visualization,
    visualize_formatted_frame_output,
    draw_box_on_image,
    plot_mask,
    plot_bbox,
)

# Simple results plotting for image segmentation
image = Image.open("image.jpg")
# inference_state from processor.set_text_prompt(...)
plot_results(image, inference_state)
plt.show()

# Draw bounding box on image (xywh format)
box_xywh = [480.0, 290.0, 110.0, 360.0]
image_with_box = draw_box_on_image(image, box_xywh, color=(0, 255, 0))

# Render single frame with masklet overlays
# outputs dict with: out_boxes_xywh, out_probs, out_obj_ids, out_binary_masks
frame = plt.imread("frame.jpg")
overlay = render_masklet_frame(frame, outputs, frame_idx=0, alpha=0.5)
plt.imshow(overlay)

# Save visualization to image file
save_masklet_image(frame, outputs, "output.png", alpha=0.5, frame_idx=0)

# Save video with mask overlays
# outputs_per_frame: {frame_idx: outputs_dict}
video_frames = ["frame0.jpg", "frame1.jpg", "frame2.jpg", ...]
save_masklet_video(video_frames, outputs_per_frame, "output.mp4", alpha=0.5, fps=10)

# Prepare masks for multi-frame visualization
# Converts {frame_idx: {out_obj_ids, out_binary_masks, ...}} to {frame_idx: {obj_id: mask}}
formatted_outputs = prepare_masks_for_visualization(outputs_per_frame)

# Visualize formatted output with matplotlib
visualize_formatted_frame_output(
    frame_idx=0,
    video_frames=video_frames,
    outputs_list=[formatted_outputs],
    titles=["SAM 3 Dense Tracking outputs"],
    figsize=(6, 4),
    points_list=None,           # Optional point prompts to show
    points_labels_list=None,    # Optional point labels
)
```

## Model Configuration Options

When building models, you can configure various options for performance optimization, memory management, and feature selection based on your hardware and use case.

```python
from sam3.model_builder import (
    build_sam3_image_model,
    build_sam3_video_model,
    build_sam3_multiplex_video_model,
    download_ckpt_from_hf,
)

# Image model with full configuration
model = build_sam3_image_model(
    bpe_path=None,                      # Auto-detects bundled tokenizer
    device="cuda",                      # "cuda" or "cpu"
    eval_mode=True,                     # Set to evaluation mode
    checkpoint_path=None,               # Auto-downloads from HuggingFace
    load_from_HF=True,                  # Download checkpoint if not provided
    enable_segmentation=True,           # Enable segmentation head
    enable_inst_interactivity=False,    # Enable SAM 1-style interactive prompts
    compile=False,                      # Enable torch.compile
)

# Video model with temporal disambiguation
video_model = build_sam3_video_model(
    checkpoint_path=None,
    load_from_HF=True,
    bpe_path=None,
    has_presence_token=True,            # Enable presence token for discrimination
    geo_encoder_use_img_cross_attn=True,
    strict_state_dict_loading=True,
    apply_temporal_disambiguation=True, # Enable temporal disambiguation heuristics
    device="cuda",
    compile=False,
)

# SAM 3.1 multiplex model for fast multi-object tracking
multiplex_model = build_sam3_multiplex_video_model(
    checkpoint_path=None,
    load_from_HF=True,
    multiplex_count=16,                 # Objects per bucket
    use_fa3=False,                      # Flash Attention 3
    use_rope_real=False,                # Real-valued RoPE for compile compat
    strict_state_dict_loading=True,
    device="cuda",
    compile=False,
)

# Manually download checkpoints
sam3_ckpt = download_ckpt_from_hf(version="sam3")
sam31_ckpt = download_ckpt_from_hf(version="sam3.1")
```

## Summary

SAM 3 and SAM 3.1 provide a comprehensive foundation for promptable image and video segmentation tasks. The primary use cases include: (1) open-vocabulary object detection and segmentation using natural language descriptions, (2) interactive video object tracking with text prompts and click refinement, (3) visual prompting through bounding boxes for exemplar-based segmentation, and (4) multi-object tracking at scale using the efficient Object Multiplex architecture in SAM 3.1.

Integration patterns typically involve: building a model or predictor using the appropriate builder function, setting up inference sessions for video processing, using `handle_request()` for single operations and `handle_stream_request()` for video propagation, and managing sessions with start/reset/close operations. For production deployments, enable `compile=True` with SAM 3.1 for significant speedups, use Flash Attention 3 (`use_fa3=True`) on supported hardware, and leverage multi-GPU inference via `gpus_to_use` parameter. The session-based API ensures stateful tracking across video frames while providing flexibility for interactive refinement workflows.