# SAM 3: Segment Anything with Concepts SAM 3 (Segment Anything Model 3) is a unified foundation model from Meta Superintelligence Labs for promptable segmentation in images and videos. It extends the capabilities of SAM 2 by introducing the ability to exhaustively segment all instances of an open-vocabulary concept specified by text prompts or visual exemplars. SAM 3 features a DETR-based detector conditioned on text, geometry, and image exemplars, combined with a tracker inheriting the SAM 2 transformer architecture for video segmentation and interactive refinement. The model has 848M parameters and achieves 75-80% of human performance on the SA-CO benchmark containing 270K unique concepts. SAM 3.1 Object Multiplex is an enhanced version that introduces a shared-memory approach for joint multi-object tracking. Instead of processing objects individually, Object Multiplex groups objects into fixed-capacity buckets and processes them jointly, drastically reducing redundant computation while maintaining accuracy. Both versions support text prompts, point prompts, box prompts, and mask refinement through an intuitive session-based API with `handle_request()` and `handle_stream_request()` methods. ## Image Segmentation with Text Prompts The `Sam3Processor` class provides a high-level interface for image segmentation. After setting an image, you can use text prompts to detect and segment all instances of objects matching the description. The processor handles image preprocessing, model inference, and post-processing of segmentation masks. ```python import torch from PIL import Image from sam3.model_builder import build_sam3_image_model from sam3.model.sam3_image_processor import Sam3Processor # Enable mixed precision for faster inference on Ampere+ GPUs torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True torch.autocast("cuda", dtype=torch.bfloat16).__enter__() # Build the model (auto-downloads checkpoint from HuggingFace) model = build_sam3_image_model() # Create processor with confidence threshold processor = Sam3Processor(model, confidence_threshold=0.5) # Load image and set it for inference image = Image.open("path/to/image.jpg") inference_state = processor.set_image(image) # Segment with text prompt - finds all instances of "shoe" inference_state = processor.set_text_prompt(state=inference_state, prompt="shoe") # Get results masks = inference_state["masks"] # Binary masks: (N, 1, H, W) tensor boxes = inference_state["boxes"] # Bounding boxes: (N, 4) tensor in [x0, y0, x1, y1] format scores = inference_state["scores"] # Confidence scores: (N,) tensor mask_logits = inference_state["masks_logits"] # Raw logits before thresholding print(f"Found {len(scores)} objects") for i in range(len(scores)): print(f"Object {i}: score={scores[i]:.3f}, box={boxes[i].tolist()}") ``` ## Image Segmentation with Visual Box Prompts SAM 3 supports visual prompting through bounding boxes to specify objects of interest. Boxes use normalized center coordinates (cx, cy, w, h) in [0, 1] range. You can combine positive boxes (objects to segment) with negative boxes (objects to exclude). ```python import torch from PIL import Image from sam3.model_builder import build_sam3_image_model from sam3.model.sam3_image_processor import Sam3Processor from sam3.model.box_ops import box_xywh_to_cxcywh from sam3.visualization_utils import normalize_bbox # Build model and processor model = build_sam3_image_model() processor = Sam3Processor(model, confidence_threshold=0.5) # Load and set image image = Image.open("path/to/image.jpg") width, height = image.size inference_state = processor.set_image(image) # Convert box from (x, y, w, h) pixel coordinates to normalized (cx, cy, w, h) box_input_xywh = torch.tensor([480.0, 290.0, 110.0, 360.0]).view(-1, 4) box_input_cxcywh = box_xywh_to_cxcywh(box_input_xywh) norm_box_cxcywh = normalize_bbox(box_input_cxcywh, width, height).flatten().tolist() # Add single positive box prompt inference_state = processor.add_geometric_prompt( state=inference_state, box=norm_box_cxcywh, label=True # True for positive, False for negative ) # Multi-box prompting with positive and negative boxes processor.reset_all_prompts(inference_state) boxes_xywh = [[480.0, 290.0, 110.0, 360.0], [370.0, 280.0, 115.0, 375.0]] boxes_cxcywh = box_xywh_to_cxcywh(torch.tensor(boxes_xywh).view(-1, 4)) norm_boxes = normalize_bbox(boxes_cxcywh, width, height).tolist() box_labels = [True, False] # First box positive, second negative for box, label in zip(norm_boxes, box_labels): inference_state = processor.add_geometric_prompt( state=inference_state, box=box, label=label ) # Get segmentation results masks = inference_state["masks"] boxes = inference_state["boxes"] scores = inference_state["scores"] ``` ## Video Segmentation with SAM 3 Predictor The `build_sam3_video_predictor` function creates a multi-GPU video predictor for SAM 3. It uses a session-based API where you start a session on a video, add prompts, propagate masks through the video, and manage tracked objects. ```python import torch from sam3.model_builder import build_sam3_video_predictor # Build predictor using all available GPUs gpus_to_use = range(torch.cuda.device_count()) predictor = build_sam3_video_predictor(gpus_to_use=gpus_to_use) # Start session on video (JPEG folder or MP4 file) video_path = "path/to/video/frames" # Folder with 0.jpg, 1.jpg, ... or video.mp4 response = predictor.handle_request( request=dict( type="start_session", resource_path=video_path, ) ) session_id = response["session_id"] # Add text prompt on frame 0 - detects all instances of "person" response = predictor.handle_request( request=dict( type="add_prompt", session_id=session_id, frame_index=0, text="person", ) ) initial_output = response["outputs"] # initial_output contains: out_obj_ids, out_boxes_xywh, out_probs, out_binary_masks # Propagate through entire video using streaming API outputs_per_frame = {} for response in predictor.handle_stream_request( request=dict( type="propagate_in_video", session_id=session_id, ) ): frame_idx = response["frame_index"] outputs_per_frame[frame_idx] = response["outputs"] # Remove specific object by ID predictor.handle_request( request=dict( type="remove_object", session_id=session_id, obj_id=2, # Remove object with ID 2 ) ) # Reset session to clear all prompts and tracked objects predictor.handle_request( request=dict( type="reset_session", session_id=session_id, ) ) # Close session when done predictor.handle_request( request=dict( type="close_session", session_id=session_id, ) ) # Shutdown predictor to free multi-GPU resources predictor.shutdown() ``` ## Video Segmentation with SAM 3.1 Multiplex Predictor SAM 3.1 Object Multiplex provides faster multi-object tracking by processing objects jointly in buckets. Use `build_sam3_multiplex_video_predictor` for the improved version with optional torch.compile acceleration. ```python from sam3.model_builder import build_sam3_multiplex_video_predictor # Build SAM 3.1 multiplex predictor with optional compilation for 2x speedup predictor = build_sam3_multiplex_video_predictor( compile=True, # Enable torch.compile for faster inference warm_up=False, # Set True to run warm-up compilation passes max_num_objects=16, # Maximum tracked objects multiplex_count=16, # Objects per multiplex bucket use_fa3=True, # Use Flash Attention 3 async_loading_frames=True, # Load frames asynchronously ) # Start session response = predictor.handle_request( request=dict( type="start_session", resource_path="path/to/video/frames", ) ) session_id = response["session_id"] # Add text prompt response = predictor.handle_request( request=dict( type="add_prompt", session_id=session_id, frame_index=0, text="person", ) ) # Propagate through video outputs_per_frame = {} for response in predictor.handle_stream_request( request=dict( type="propagate_in_video", session_id=session_id, ) ): outputs_per_frame[response["frame_index"]] = response["outputs"] # Close session predictor.handle_request( request=dict( type="close_session", session_id=session_id, ) ) ``` ## Unified Predictor Builder with Version Selection The `build_sam3_predictor` function provides a unified entry point for both SAM 3 and SAM 3.1 predictors. It automatically downloads checkpoints from HuggingFace and configures the appropriate model architecture. ```python from sam3.model_builder import build_sam3_predictor # Build SAM 3.1 predictor (default, recommended for performance) predictor = build_sam3_predictor( version="sam3.1", # "sam3" or "sam3.1" compile=True, # Enable torch.compile (SAM 3.1 only) warm_up=False, # Run warm-up passes max_num_objects=16, # Max tracked objects (SAM 3.1) use_fa3=True, # Flash Attention 3 async_loading_frames=True, # Async frame loading ) # Build SAM 3 predictor predictor_v3 = build_sam3_predictor( version="sam3", compile=False, async_loading_frames=True, ) # Use with custom checkpoint path predictor_custom = build_sam3_predictor( checkpoint_path="path/to/custom/checkpoint.pt", version="sam3.1", ) # Both use the same API response = predictor.handle_request({ "type": "start_session", "resource_path": "video/path" }) session_id = response["session_id"] predictor.handle_request({ "type": "add_prompt", "session_id": session_id, "frame_index": 0, "text": "dog" }) for out in predictor.handle_stream_request({ "type": "propagate_in_video", "session_id": session_id }): masks = out["outputs"]["out_binary_masks"] obj_ids = out["outputs"]["out_obj_ids"] ``` ## Interactive Refinement with Point Prompts After initial detection with text prompts, you can refine segmentation masks interactively using point prompts. Positive points (label=1) indicate areas to include, while negative points (label=0) exclude areas. ```python import torch import numpy as np from sam3.model_builder import build_sam3_predictor predictor = build_sam3_predictor(version="sam3.1") # Start session and add initial text prompt response = predictor.handle_request({ "type": "start_session", "resource_path": "video/frames" }) session_id = response["session_id"] predictor.handle_request({ "type": "add_prompt", "session_id": session_id, "frame_index": 0, "text": "person" }) # Helper to convert absolute pixel coords to relative [0,1] coords def abs_to_rel_coords(coords, img_width, img_height): return [[x / img_width, y / img_height] for x, y in coords] IMG_WIDTH, IMG_HEIGHT = 1280, 720 # Your video dimensions # Add new object with point prompt points_abs = np.array([[760, 550]]) # Single positive click labels = np.array([1]) points_tensor = torch.tensor( abs_to_rel_coords(points_abs, IMG_WIDTH, IMG_HEIGHT), dtype=torch.float32, ) labels_tensor = torch.tensor(labels, dtype=torch.int32) response = predictor.handle_request({ "type": "add_prompt", "session_id": session_id, "frame_index": 0, "points": points_tensor, "point_labels": labels_tensor, "obj_id": 5, # Specify object ID for new or existing object }) # Refine existing object mask with multiple points # Example: Select shirt only, exclude body refine_points_abs = np.array([ [740, 450], # Positive - include shirt [760, 630], # Negative - exclude legs [840, 640], # Negative - exclude legs [760, 550], # Positive - include shirt center ]) refine_labels = np.array([1, 0, 0, 1]) response = predictor.handle_request({ "type": "add_prompt", "session_id": session_id, "frame_index": 0, "points": torch.tensor(abs_to_rel_coords(refine_points_abs, IMG_WIDTH, IMG_HEIGHT), dtype=torch.float32), "point_labels": torch.tensor(refine_labels, dtype=torch.int32), "obj_id": 2, # Refine existing object ID 2 }) # Propagate refined masks for out in predictor.handle_stream_request({ "type": "propagate_in_video", "session_id": session_id }): frame_idx = out["frame_index"] masks = out["outputs"]["out_binary_masks"] ``` ## Batched Image Inference For processing multiple images efficiently, use the `set_image_batch` method which processes images in a single forward pass through the backbone. ```python from PIL import Image from sam3.model_builder import build_sam3_image_model from sam3.model.sam3_image_processor import Sam3Processor model = build_sam3_image_model() processor = Sam3Processor(model, confidence_threshold=0.5) # Load multiple images images = [ Image.open("image1.jpg"), Image.open("image2.jpg"), Image.open("image3.jpg"), ] # Set image batch - processes all images through backbone at once inference_state = processor.set_image_batch(images) # Then process each image with text prompts # Note: set_text_prompt works on the current backbone output # You may need to iterate or handle indexing for batch processing inference_state = processor.set_text_prompt(state=inference_state, prompt="car") masks = inference_state["masks"] boxes = inference_state["boxes"] scores = inference_state["scores"] ``` ## Visualization Utilities SAM 3 provides comprehensive visualization utilities for displaying segmentation results on images and videos. These utilities handle mask overlay, bounding box drawing, and multi-frame visualization. ```python import matplotlib.pyplot as plt from PIL import Image from sam3.visualization_utils import ( plot_results, render_masklet_frame, save_masklet_video, save_masklet_image, prepare_masks_for_visualization, visualize_formatted_frame_output, draw_box_on_image, plot_mask, plot_bbox, ) # Simple results plotting for image segmentation image = Image.open("image.jpg") # inference_state from processor.set_text_prompt(...) plot_results(image, inference_state) plt.show() # Draw bounding box on image (xywh format) box_xywh = [480.0, 290.0, 110.0, 360.0] image_with_box = draw_box_on_image(image, box_xywh, color=(0, 255, 0)) # Render single frame with masklet overlays # outputs dict with: out_boxes_xywh, out_probs, out_obj_ids, out_binary_masks frame = plt.imread("frame.jpg") overlay = render_masklet_frame(frame, outputs, frame_idx=0, alpha=0.5) plt.imshow(overlay) # Save visualization to image file save_masklet_image(frame, outputs, "output.png", alpha=0.5, frame_idx=0) # Save video with mask overlays # outputs_per_frame: {frame_idx: outputs_dict} video_frames = ["frame0.jpg", "frame1.jpg", "frame2.jpg", ...] save_masklet_video(video_frames, outputs_per_frame, "output.mp4", alpha=0.5, fps=10) # Prepare masks for multi-frame visualization # Converts {frame_idx: {out_obj_ids, out_binary_masks, ...}} to {frame_idx: {obj_id: mask}} formatted_outputs = prepare_masks_for_visualization(outputs_per_frame) # Visualize formatted output with matplotlib visualize_formatted_frame_output( frame_idx=0, video_frames=video_frames, outputs_list=[formatted_outputs], titles=["SAM 3 Dense Tracking outputs"], figsize=(6, 4), points_list=None, # Optional point prompts to show points_labels_list=None, # Optional point labels ) ``` ## Model Configuration Options When building models, you can configure various options for performance optimization, memory management, and feature selection based on your hardware and use case. ```python from sam3.model_builder import ( build_sam3_image_model, build_sam3_video_model, build_sam3_multiplex_video_model, download_ckpt_from_hf, ) # Image model with full configuration model = build_sam3_image_model( bpe_path=None, # Auto-detects bundled tokenizer device="cuda", # "cuda" or "cpu" eval_mode=True, # Set to evaluation mode checkpoint_path=None, # Auto-downloads from HuggingFace load_from_HF=True, # Download checkpoint if not provided enable_segmentation=True, # Enable segmentation head enable_inst_interactivity=False, # Enable SAM 1-style interactive prompts compile=False, # Enable torch.compile ) # Video model with temporal disambiguation video_model = build_sam3_video_model( checkpoint_path=None, load_from_HF=True, bpe_path=None, has_presence_token=True, # Enable presence token for discrimination geo_encoder_use_img_cross_attn=True, strict_state_dict_loading=True, apply_temporal_disambiguation=True, # Enable temporal disambiguation heuristics device="cuda", compile=False, ) # SAM 3.1 multiplex model for fast multi-object tracking multiplex_model = build_sam3_multiplex_video_model( checkpoint_path=None, load_from_HF=True, multiplex_count=16, # Objects per bucket use_fa3=False, # Flash Attention 3 use_rope_real=False, # Real-valued RoPE for compile compat strict_state_dict_loading=True, device="cuda", compile=False, ) # Manually download checkpoints sam3_ckpt = download_ckpt_from_hf(version="sam3") sam31_ckpt = download_ckpt_from_hf(version="sam3.1") ``` ## Summary SAM 3 and SAM 3.1 provide a comprehensive foundation for promptable image and video segmentation tasks. The primary use cases include: (1) open-vocabulary object detection and segmentation using natural language descriptions, (2) interactive video object tracking with text prompts and click refinement, (3) visual prompting through bounding boxes for exemplar-based segmentation, and (4) multi-object tracking at scale using the efficient Object Multiplex architecture in SAM 3.1. Integration patterns typically involve: building a model or predictor using the appropriate builder function, setting up inference sessions for video processing, using `handle_request()` for single operations and `handle_stream_request()` for video propagation, and managing sessions with start/reset/close operations. For production deployments, enable `compile=True` with SAM 3.1 for significant speedups, use Flash Attention 3 (`use_fa3=True`) on supported hardware, and leverage multi-GPU inference via `gpus_to_use` parameter. The session-based API ensures stateful tracking across video frames while providing flexibility for interactive refinement workflows.