### Install and Initialize ACT Policy Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Installs the ACT policy library and initializes a policy model with specified dimensions and layer configurations. Use this for setting up a basic ACT policy for action prediction. ```python import torch from act.policy import ACTPolicy policy = ACTPolicy( state_dim=14, # bimanual: 7 DOF × 2 action_dim=14, chunk_size=100, # predict 100 future actions at once hidden_dim=512, enc_layers=4, dec_layers=7, nheads=8 ) qpos = torch.zeros(1, 14) # current joint positions image = torch.zeros(1, 3, 480, 640) # camera observation # CVAE inference: sample latent z=0 (mean) for deterministic rollout action_seq = policy(qpos, image) # shape: (1, 100, 14) ``` -------------------------------- ### Install vla-eval Package Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This command installs the vla-eval package, a unified evaluation harness for VLAs. It is a prerequisite for running evaluations on various benchmarks. ```bash # Repository: https://github.com/allenai/vla-evaluation-harness pip install vla-eval ``` -------------------------------- ### Clone vla0-trl Repository and Install Dependencies Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This command sequence clones the vla0-trl repository and installs its Python dependencies. This is the first step to setting up the environment for fine-tuning a VLA. ```bash # Repository: https://github.com/MilkClouds/vla0-trl git clone https://github.com/MilkClouds/vla0-trl cd vla0-trl pip install -r requirements.txt ``` -------------------------------- ### Octo: Generalist Policy with Diffusion Head Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Inference with Octo, a generalist robot policy. Load a pretrained model and generate actions based on language instructions and observations. Requires installation of the 'octo' library. ```python # Paper: https://arxiv.org/abs/2405.12213 # Install: pip install octo import octo import jax import numpy as np model = octo.OctoModel.load_pretrained("hf://rail-berkeley/octo-small") # Build task from language instruction task = model.create_tasks(texts=["pick up the red block"]) # Run inference observation = { "image_primary": np.zeros((1, 2, 256, 256, 3), dtype=np.uint8), # (batch, history, H, W, C) "pad_mask_dict": {"image_primary": np.ones((1, 2), dtype=bool)}, } actions = model.sample_actions( observation, task, rng=jax.random.PRNGKey(0) ) # actions shape: (1, 7) — 7-DOF end-effector delta print(actions) # e.g., [[ 0.012, -0.003, 0.021, 0.001, -0.005, 0.002, 0.98]] ``` -------------------------------- ### Diffusion Policy: Robot Action Prediction Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Inference with Diffusion Policy for robot action prediction. This policy uses a conditional denoising diffusion process. Requires installation of 'diffusion_policy'. Predicts a chunk of future actions. ```python # Paper: https://arxiv.org/abs/2303.04137 # Install: pip install diffusion_policy import torch from diffusion_policy.policy.diffusion_unet_image_policy import DiffusionUnetImagePolicy policy = DiffusionUnetImagePolicy.from_pretrained("chichilicious/diffusion_policy_pusht") # Observation: dict with image and agent_pos obs_dict = { "image": torch.zeros(1, 2, 3, 96, 96), # (B, T_obs, C, H, W) "agent_pos": torch.zeros(1, 2, 2), # (B, T_obs, pos_dim) } # Predict action chunk via DDPM/DDIM denoising (T_action steps ahead) with torch.no_grad(): result = policy.predict_action(obs_dict) action_pred = result["action"] # shape: (1, 16, 2) — 16-step action chunk # Execute first n_action_steps actions, re-plan at next observation actions_to_execute = action_pred[0, :8] # execute 8, observe, repeat ``` -------------------------------- ### Fine-tune Qwen2.5-VL with vla0-trl Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This command fine-tunes the Qwen2.5-VL model on LIBERO action data using the vla0-trl script. It specifies the model, dataset, action representation, and output directory. ```bash # Fine-tune Qwen2.5-VL on LIBERO action data python train.py \ --model_name Qwen/Qwen2.5-VL-7B-Instruct \ --dataset libero \ --action_repr text_tokens \ --output_dir ./checkpoints \ --num_train_epochs 3 \ --per_device_train_batch_size 4 ``` -------------------------------- ### HIL-SERL Agent for Real-World RL with Human Intervention Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Conceptual training loop for HIL-SERL, combining human demonstrations with online RL. Human interventions are recorded as positive data. The agent can be seeded with human demonstrations before online fine-tuning. ```python # Conceptual training loop from hil_serl import HILSERLAgent, ReplayBuffer agent = HILSERLAgent( obs_dim=env.observation_space.shape, act_dim=env.action_space.shape, demo_buffer_size=200, # pre-collected human demos rl_buffer_size=100_000, ) # Phase 1: Seed with human demonstrations agent.load_demos(demo_trajectories) # 10–20 teleoperated demos # Phase 2: Online RL with human-in-the-loop for step in range(10_000): obs = env.get_observation() action = agent.select_action(obs) # Human operator can intervene at any time if human_intervening: action = human_controller.get_action() agent.record_intervention(obs, action) # intervention = positive data next_obs, reward, done, info = env.step(action) agent.update(obs, action, reward, next_obs, done) # Typical results: 90%+ success within ~30 min real-world training ``` -------------------------------- ### Evaluate a VLA Model using vla-eval Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This command uses vla-eval to evaluate a specified VLA model on a chosen benchmark. It supports multiple benchmarks and can record videos of the evaluation runs. ```bash # Evaluate any VLA on any supported benchmark vla-eval \ --model openvla/openvla-7b \ --benchmark libero_goal \ --num_episodes 100 \ --record_video # Supported benchmarks: LIBERO, SIMPLER, RoboMimic, MetaWorld, Calvin # Output: JSON report with per-task success rates # { # "libero_goal": {"overall": 0.72, "task_pick_up_the_alphabet_soup": 0.88, ...}, # "total_episodes": 100, # "wall_time_s": 1847 # } ``` -------------------------------- ### Load Open X-Embodiment Dataset with TensorFlow Datasets Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Shows how to load a constituent dataset from the Open X-Embodiment (OXE) collection using TensorFlow Datasets. This is useful for accessing standardized robot trajectories for pretraining generalist policies. ```python # Paper: https://arxiv.org/abs/2310.08864 # Install: pip install tensorflow tensorflow-datasets import tensorflow_datasets as tfds # Load a specific OXE constituent dataset ds = tfds.load( "bridge", # one of 22 embodiment datasets in OXE data_dir="gs://gresearch/robotics", split="train" ) for episode in ds.take(1): steps = list(episode["steps"]) first_step = steps[0] print(first_step["observation"]["image"].shape) # (480, 640, 3) print(first_step["observation"]["state"].shape) # (7,) joint positions print(first_step["action"].shape) # (7,) delta EEF action print(first_step["language_instruction"].numpy()) # b"pick up the spoon" print(f"Episode length: {len(steps)}") # e.g., 87 steps ``` -------------------------------- ### Load and Use SmolVLAPolicy for Action Selection Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Loads a pre-trained SmolVLA policy for inference. Ensure the model is moved to CUDA if available. The policy expects a batch dictionary containing observations and task instructions. ```python from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy import torch policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla-base") policy.eval().cuda() batch = { "observation.images.top": torch.zeros(1, 3, 224, 224).cuda(), "observation.state": torch.zeros(1, 6).cuda(), "task": ["pick up the cup and place it on the plate"], } with torch.no_grad(): action = policy.select_action(batch) # action shape: (1, action_dim) — continuous end-effector delta # Async inference: vision encoder runs at low freq, action head at high freq print(action) # tensor([[ 0.018, -0.005, 0.022, 0.001, -0.003, 0.001]]) ``` -------------------------------- ### OpenVLA: VLM-based Vision-Language Agent Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Inference with OpenVLA, a VLM-based vision-language agent. Requires 'transformers' and 'torch' libraries. Loads a pretrained model and processor, then predicts robot actions based on an image and a text prompt. ```python # Paper: https://arxiv.org/abs/2406.09246 # Install: pip install transformers torch from transformers import AutoModelForVision2Seq, AutoProcessor from PIL import Image import torch processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True) model = AutoModelForVision2Seq.from_pretrained( "openvla/openvla-7b", torch_dtype=torch.bfloat16, trust_remote_code=True ).cuda() image = Image.open("robot_obs.png") prompt = "In: What action should the robot take to pick up the cup?\nOut:" inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16) action = model.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False) # action: np.array([dx, dy, dz, droll, dpitch, dyaw, gripper]) # e.g., [ 0.023, -0.011, 0.017, 0.002, -0.004, 0.001, 1.0 ] ``` -------------------------------- ### Behavior Transformers (BeT): Multimodal Action Discretization Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Conceptual usage of Behavior Transformers (BeT) for multimodal action discretization. This model predicts action bins (cluster indices) and continuous residual offsets. ```python # Paper: https://arxiv.org/abs/2206.11251 # Conceptual usage from bet import BehaviorTransformer import numpy as np model = BehaviorTransformer( obs_dim=20, act_dim=7, n_clusters=24, # k-means action bins n_layers=6, n_heads=8, context_len=10 # history window ) # Input: sequence of observations obs_seq = np.random.randn(1, 10, 20) # (batch, context_len, obs_dim) # Output: cluster index + continuous offset cluster_idx, offset = model.predict(obs_seq) # cluster_idx: int in [0, 23] # offset: np.array of shape (7,) — residual within cluster action = model.cluster_centers[cluster_idx] + offset ``` -------------------------------- ### Evaluate vla0-trl Checkpoint on LIBERO Benchmark Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This command evaluates a fine-tuned vla0-trl model on the LIBERO simulation benchmark. It specifies the checkpoint path and the evaluation suite. ```bash # Evaluate on LIBERO simulation benchmark python eval.py \ --checkpoint ./checkpoints/final \ --suite libero_spatial \ --num_episodes 50 # Expected: ~90% success rate on LIBERO-Spatial ``` -------------------------------- ### Temporal Ensembling for ACT Policy Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Demonstrates temporal ensembling by averaging overlapping predictions from the ACT policy's action sequence to reduce jitter. This is applied after an initial action sequence is generated. ```python # Temporal ensembling: average overlapping predictions across time steps for t in range(100): action_t = action_seq[0, t] # execute action at step t # policy re-plans every chunk_size steps; ensembling reduces jitter ``` -------------------------------- ### π0 (pi-zero): VLM Intermediate Features for Action Expert Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Demonstrates the π0 architecture, where an action expert transformer cross-attends to all intermediate hidden states of a VLM. This allows for deeper integration of perception and action compared to architectures that only use the VLM's last hidden state. ```python # Paper: https://arxiv.org/abs/2410.24164 # Conceptual architecture (Physical Intelligence) # VLM backbone: PaliGemma processes image + language tokens vlm_hidden_states = paligemma_backbone( images=[wrist_cam, overhead_cam], text="fold the shirt" ) # list of hidden states from ALL transformer layers # Action expert: lightweight transformer that cross-attends to ALL VLM layers # (not just the last hidden state — key difference from CogACT/GR00T) action_expert_out = action_expert( proprioception=robot_joint_positions, # (14,) bimanual joints vlm_features=vlm_hidden_states, # full intermediate features ) # Flow matching: ODE integration from noise → action action = flow_matching_ode_solve( score_fn=action_expert_out, x_init=torch.randn(action_dim), n_steps=10 ) # action: (50,) — 50-dim continuous action chunk ``` -------------------------------- ### Generate Visual Chain-of-Thought and Action with CoT-VLA Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This snippet demonstrates how CoT-VLA generates an intermediate visual goal state (future_image_tokens) and then uses it to predict a final action. It requires loading an image and an instruction. ```python current_image = load_image("robot_obs.png") instruction = "stack all three blocks" # Step 1: Generate visual chain-of-thought (predicted future state) future_image_tokens = cot_vla.generate_visual_cot( image=current_image, instruction=instruction, n_future_tokens=256 # 16×16 image token grid ) # future_image_tokens represents an imagined intermediate goal state # Step 2: Generate action conditioned on current obs + visual CoT action = cot_vla.generate_action( image=current_image, instruction=instruction, visual_cot=future_image_tokens # ground action in imagined future ) # action: (7,) continuous end-effector delta ``` -------------------------------- ### ACT/ALOHA: Action Chunking Transformer Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This section introduces the Action Chunking Transformer (ACT) and its application in the ALOHA robot. It focuses on predicting multiple future actions at once using a CVAE-based transformer for dexterous manipulation. ```python # Paper: https://arxiv.org/abs/2304.13705 ``` -------------------------------- ### RT-1 Robot Policy Inference Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Conceptual inference pipeline for RT-1, a transformer-based robot policy. It takes an image observation and natural language instruction as input and outputs discretized action tokens. ```python # Conceptual inference pipeline for RT-1 # Paper: https://arxiv.org/abs/2212.06817 import numpy as np # RT-1 input: image observation + natural language instruction observation = { "image": np.zeros((300, 300, 3), dtype=np.uint8), # RGB camera frame "instruction": "pick up the soda can and place it on the counter" } # RT-1 output: 11-dimensional action token (discretized into 256 bins each) # [x, y, z, roll, pitch, yaw, gripper] + [terminate_episode] # Model size: 35M parameters # Training data: 130k demonstrations, 700+ tasks, 13 robots predicted_action_tokens = rt1_model.predict( image=observation["image"], instruction=observation["instruction"]) # Output shape: (11,) integer tokens in range [0, 255] # Decode to continuous actions via: action = (token / 255) * action_range + action_min ``` -------------------------------- ### UniVLA Training Objective and Inference Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt This code illustrates the UniVLA training loss, which combines action prediction and world modeling. At inference, only the action prediction head is active, ensuring efficiency. ```python # Paper: https://arxiv.org/abs/2506.19850 # Training objective (simplified) # UniVLA loss = L_action + λ * L_world_model # L_action : cross-entropy on discretized action tokens # L_world_model: next-frame prediction (training only, not used at inference) # Training batch batch = { "images": torch.zeros(B, T, 3, 224, 224), # video frames "instructions": ["pick up the cup"] * B, "actions": torch.zeros(B, T, action_dim), # ground-truth actions } loss_action = model.compute_action_loss(batch) loss_world = model.compute_world_model_loss(batch) # predicts frame t+1 total_loss = loss_action + 0.1 * loss_world total_loss.backward() # Inference: world model head is INACTIVE — same speed as a standard VLA action = model.predict_action( image=current_frame, instruction="pick up the cup" ) # shape: (action_dim,) ``` -------------------------------- ### RT-2 VLM Robot Control Inference Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Conceptual inference for RT-2, which uses a VLM backbone and represents robot actions as text tokens. This enables emergent reasoning and generalization to novel instructions. ```python # Conceptual RT-2 inference — actions represented as text tokens # Paper: https://arxiv.org/abs/2307.15818 prompt = ( "What action should the robot take to pick up the green cup?\n" "\n" "Answer with robot actions:" ) # RT-2 treats robot actions as additional tokens in the VLM vocabulary # Action representation: each DOF discretized to 256 bins, serialized as text # e.g., "255 128 089 200 150 175 001" → 7-DOF end-effector delta + gripper response = rt2_model.generate(prompt, image=camera_frame, max_new_tokens=16) # response: "187 142 095 210 160 180 000" action_tokens = [int(t) for t in response.strip().split()] action_continuous = decode_action_tokens(action_tokens) # action_continuous: np.array([dx, dy, dz, droll, dpitch, dyaw, gripper]) ``` -------------------------------- ### CogACT: VLM + DiT Action Head Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt Illustrates the CogACT architecture, which combines a VLM with a Diffusion Transformer (DiT) action head. The VLM's output conditions the DiT for action generation. Requires a VLM backbone and a DiT action head. ```python # Paper: https://arxiv.org/abs/2411.19650 # Architecture overview # Stage 1: VLM encodes image + language → context vector vlm_context = vlm_backbone( image=camera_frame, # (H, W, 3) text="pick up the red cube" # tokenized instruction ) # shape: (seq_len, hidden_dim=4096) # Stage 2: DiT action head denoises action conditioned on VLM last hidden state noisy_action = torch.randn(1, action_dim) # start from noise for t in reversed(range(T_diffusion)): noisy_action = dit_action_head( x=noisy_action, conditioning=vlm_context[-1], # only last hidden state used timestep=t ) # Final output: continuous action action = noisy_action # shape: (action_dim,) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.