### Install and Initialize ACT Policy

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Installs the ACT policy library and initializes a policy model with specified dimensions and layer configurations. Use this for setting up a basic ACT policy for action prediction.

```python
import torch
from act.policy import ACTPolicy

policy = ACTPolicy(
    state_dim=14,       # bimanual: 7 DOF × 2
    action_dim=14,
    chunk_size=100,     # predict 100 future actions at once
    hidden_dim=512,
    enc_layers=4,
    dec_layers=7,
    nheads=8
)

qpos = torch.zeros(1, 14)                 # current joint positions
image = torch.zeros(1, 3, 480, 640)      # camera observation

# CVAE inference: sample latent z=0 (mean) for deterministic rollout
action_seq = policy(qpos, image)          # shape: (1, 100, 14)
```

--------------------------------

### Install vla-eval Package

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This command installs the vla-eval package, a unified evaluation harness for VLAs. It is a prerequisite for running evaluations on various benchmarks.

```bash
# Repository: https://github.com/allenai/vla-evaluation-harness
pip install vla-eval
```

--------------------------------

### Clone vla0-trl Repository and Install Dependencies

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This command sequence clones the vla0-trl repository and installs its Python dependencies. This is the first step to setting up the environment for fine-tuning a VLA.

```bash
# Repository: https://github.com/MilkClouds/vla0-trl
git clone https://github.com/MilkClouds/vla0-trl
cd vla0-trl
pip install -r requirements.txt
```

--------------------------------

### Octo: Generalist Policy with Diffusion Head

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Inference with Octo, a generalist robot policy. Load a pretrained model and generate actions based on language instructions and observations. Requires installation of the 'octo' library.

```python
# Paper: https://arxiv.org/abs/2405.12213
# Install: pip install octo

import octo
import jax
import numpy as np

model = octo.OctoModel.load_pretrained("hf://rail-berkeley/octo-small")

# Build task from language instruction
task = model.create_tasks(texts=["pick up the red block"])

# Run inference
observation = {
    "image_primary": np.zeros((1, 2, 256, 256, 3), dtype=np.uint8),  # (batch, history, H, W, C)
    "pad_mask_dict": {"image_primary": np.ones((1, 2), dtype=bool)},
}
actions = model.sample_actions(
    observation,
    task,
    rng=jax.random.PRNGKey(0)
)
# actions shape: (1, 7) — 7-DOF end-effector delta
print(actions)  # e.g., [[ 0.012, -0.003,  0.021,  0.001, -0.005,  0.002,  0.98]]
```

--------------------------------

### Diffusion Policy: Robot Action Prediction

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Inference with Diffusion Policy for robot action prediction. This policy uses a conditional denoising diffusion process. Requires installation of 'diffusion_policy'. Predicts a chunk of future actions.

```python
# Paper: https://arxiv.org/abs/2303.04137
# Install: pip install diffusion_policy

import torch
from diffusion_policy.policy.diffusion_unet_image_policy import DiffusionUnetImagePolicy

policy = DiffusionUnetImagePolicy.from_pretrained("chichilicious/diffusion_policy_pusht")

# Observation: dict with image and agent_pos
obs_dict = {
    "image": torch.zeros(1, 2, 3, 96, 96),      # (B, T_obs, C, H, W)
    "agent_pos": torch.zeros(1, 2, 2),            # (B, T_obs, pos_dim)
}

# Predict action chunk via DDPM/DDIM denoising (T_action steps ahead)
with torch.no_grad():
    result = policy.predict_action(obs_dict)

action_pred = result["action"]   # shape: (1, 16, 2) — 16-step action chunk
# Execute first n_action_steps actions, re-plan at next observation
actions_to_execute = action_pred[0, :8]   # execute 8, observe, repeat
```

--------------------------------

### Fine-tune Qwen2.5-VL with vla0-trl

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This command fine-tunes the Qwen2.5-VL model on LIBERO action data using the vla0-trl script. It specifies the model, dataset, action representation, and output directory.

```bash
# Fine-tune Qwen2.5-VL on LIBERO action data
python train.py \
  --model_name Qwen/Qwen2.5-VL-7B-Instruct \
  --dataset libero \
  --action_repr text_tokens \
  --output_dir ./checkpoints \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4
```

--------------------------------

### HIL-SERL Agent for Real-World RL with Human Intervention

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Conceptual training loop for HIL-SERL, combining human demonstrations with online RL. Human interventions are recorded as positive data. The agent can be seeded with human demonstrations before online fine-tuning.

```python
# Conceptual training loop

from hil_serl import HILSERLAgent, ReplayBuffer

agent = HILSERLAgent(
    obs_dim=env.observation_space.shape,
    act_dim=env.action_space.shape,
    demo_buffer_size=200,     # pre-collected human demos
    rl_buffer_size=100_000,
)

# Phase 1: Seed with human demonstrations
agent.load_demos(demo_trajectories)   # 10–20 teleoperated demos

# Phase 2: Online RL with human-in-the-loop
for step in range(10_000):
    obs = env.get_observation()
    action = agent.select_action(obs)

    # Human operator can intervene at any time
    if human_intervening:
        action = human_controller.get_action()
        agent.record_intervention(obs, action)   # intervention = positive data

    next_obs, reward, done, info = env.step(action)
    agent.update(obs, action, reward, next_obs, done)

# Typical results: 90%+ success within ~30 min real-world training
```

--------------------------------

### Evaluate a VLA Model using vla-eval

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This command uses vla-eval to evaluate a specified VLA model on a chosen benchmark. It supports multiple benchmarks and can record videos of the evaluation runs.

```bash
# Evaluate any VLA on any supported benchmark
vla-eval \
  --model openvla/openvla-7b \
  --benchmark libero_goal \
  --num_episodes 100 \
  --record_video

# Supported benchmarks: LIBERO, SIMPLER, RoboMimic, MetaWorld, Calvin
# Output: JSON report with per-task success rates
# {
#   "libero_goal": {"overall": 0.72, "task_pick_up_the_alphabet_soup": 0.88, ...},
#   "total_episodes": 100,
#   "wall_time_s": 1847
# }
```

--------------------------------

### Load Open X-Embodiment Dataset with TensorFlow Datasets

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Shows how to load a constituent dataset from the Open X-Embodiment (OXE) collection using TensorFlow Datasets. This is useful for accessing standardized robot trajectories for pretraining generalist policies.

```python
# Paper: https://arxiv.org/abs/2310.08864
# Install: pip install tensorflow tensorflow-datasets

import tensorflow_datasets as tfds

# Load a specific OXE constituent dataset
ds = tfds.load(
    "bridge",               # one of 22 embodiment datasets in OXE
    data_dir="gs://gresearch/robotics",
    split="train"
)

for episode in ds.take(1):
    steps = list(episode["steps"])
    first_step = steps[0]
    print(first_step["observation"]["image"].shape)    # (480, 640, 3)
    print(first_step["observation"]["state"].shape)    # (7,) joint positions
    print(first_step["action"].shape)                   # (7,) delta EEF action
    print(first_step["language_instruction"].numpy())   # b"pick up the spoon"
    print(f"Episode length: {len(steps)}")              # e.g., 87 steps
```

--------------------------------

### Load and Use SmolVLAPolicy for Action Selection

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Loads a pre-trained SmolVLA policy for inference. Ensure the model is moved to CUDA if available. The policy expects a batch dictionary containing observations and task instructions.

```python
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
import torch

policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla-base")
policy.eval().cuda()

batch = {
    "observation.images.top": torch.zeros(1, 3, 224, 224).cuda(),
    "observation.state": torch.zeros(1, 6).cuda(),
    "task": ["pick up the cup and place it on the plate"],
}

with torch.no_grad():
    action = policy.select_action(batch)

# action shape: (1, action_dim) — continuous end-effector delta
# Async inference: vision encoder runs at low freq, action head at high freq
print(action)   # tensor([[ 0.018, -0.005,  0.022,  0.001, -0.003,  0.001]])
```

--------------------------------

### OpenVLA: VLM-based Vision-Language Agent

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Inference with OpenVLA, a VLM-based vision-language agent. Requires 'transformers' and 'torch' libraries. Loads a pretrained model and processor, then predicts robot actions based on an image and a text prompt.

```python
# Paper: https://arxiv.org/abs/2406.09246
# Install: pip install transformers torch

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda()

image = Image.open("robot_obs.png")
prompt = "In: What action should the robot take to pick up the cup?\nOut:"

inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = model.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
# action: np.array([dx, dy, dz, droll, dpitch, dyaw, gripper])
# e.g., [ 0.023, -0.011,  0.017,  0.002, -0.004,  0.001,  1.0  ]
```

--------------------------------

### Behavior Transformers (BeT): Multimodal Action Discretization

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Conceptual usage of Behavior Transformers (BeT) for multimodal action discretization. This model predicts action bins (cluster indices) and continuous residual offsets.

```python
# Paper: https://arxiv.org/abs/2206.11251
# Conceptual usage

from bet import BehaviorTransformer
import numpy as np

model = BehaviorTransformer(
    obs_dim=20,
    act_dim=7,
    n_clusters=24,      # k-means action bins
    n_layers=6,
    n_heads=8,
    context_len=10      # history window
)

# Input: sequence of observations
obs_seq = np.random.randn(1, 10, 20)   # (batch, context_len, obs_dim)

# Output: cluster index + continuous offset
cluster_idx, offset = model.predict(obs_seq)
# cluster_idx: int in [0, 23]
# offset: np.array of shape (7,) — residual within cluster
action = model.cluster_centers[cluster_idx] + offset
```

--------------------------------

### Evaluate vla0-trl Checkpoint on LIBERO Benchmark

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This command evaluates a fine-tuned vla0-trl model on the LIBERO simulation benchmark. It specifies the checkpoint path and the evaluation suite.

```bash
# Evaluate on LIBERO simulation benchmark
python eval.py \
  --checkpoint ./checkpoints/final \
  --suite libero_spatial \
  --num_episodes 50
# Expected: ~90% success rate on LIBERO-Spatial
```

--------------------------------

### Temporal Ensembling for ACT Policy

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Demonstrates temporal ensembling by averaging overlapping predictions from the ACT policy's action sequence to reduce jitter. This is applied after an initial action sequence is generated.

```python
# Temporal ensembling: average overlapping predictions across time steps
for t in range(100):
    action_t = action_seq[0, t]           # execute action at step t
    # policy re-plans every chunk_size steps; ensembling reduces jitter
```

--------------------------------

### π0 (pi-zero): VLM Intermediate Features for Action Expert

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Demonstrates the π0 architecture, where an action expert transformer cross-attends to all intermediate hidden states of a VLM. This allows for deeper integration of perception and action compared to architectures that only use the VLM's last hidden state.

```python
# Paper: https://arxiv.org/abs/2410.24164
# Conceptual architecture (Physical Intelligence)

# VLM backbone: PaliGemma processes image + language tokens
vlm_hidden_states = paligemma_backbone(
    images=[wrist_cam, overhead_cam],
    text="fold the shirt"
)  # list of hidden states from ALL transformer layers

# Action expert: lightweight transformer that cross-attends to ALL VLM layers
# (not just the last hidden state — key difference from CogACT/GR00T)
action_expert_out = action_expert(
    proprioception=robot_joint_positions,   # (14,) bimanual joints
    vlm_features=vlm_hidden_states,         # full intermediate features
)

# Flow matching: ODE integration from noise → action
action = flow_matching_ode_solve(
    score_fn=action_expert_out,
    x_init=torch.randn(action_dim),
    n_steps=10
)
# action: (50,) — 50-dim continuous action chunk
```

--------------------------------

### Generate Visual Chain-of-Thought and Action with CoT-VLA

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This snippet demonstrates how CoT-VLA generates an intermediate visual goal state (future_image_tokens) and then uses it to predict a final action. It requires loading an image and an instruction.

```python
current_image = load_image("robot_obs.png")
instruction   = "stack all three blocks"

# Step 1: Generate visual chain-of-thought (predicted future state)
future_image_tokens = cot_vla.generate_visual_cot(
    image=current_image,
    instruction=instruction,
    n_future_tokens=256    # 16×16 image token grid
)
# future_image_tokens represents an imagined intermediate goal state

# Step 2: Generate action conditioned on current obs + visual CoT
action = cot_vla.generate_action(
    image=current_image,
    instruction=instruction,
    visual_cot=future_image_tokens   # ground action in imagined future
)
# action: (7,) continuous end-effector delta
```

--------------------------------

### ACT/ALOHA: Action Chunking Transformer

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This section introduces the Action Chunking Transformer (ACT) and its application in the ALOHA robot. It focuses on predicting multiple future actions at once using a CVAE-based transformer for dexterous manipulation.

```python
# Paper: https://arxiv.org/abs/2304.13705

```

--------------------------------

### RT-1 Robot Policy Inference

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Conceptual inference pipeline for RT-1, a transformer-based robot policy. It takes an image observation and natural language instruction as input and outputs discretized action tokens.

```python
# Conceptual inference pipeline for RT-1
# Paper: https://arxiv.org/abs/2212.06817

import numpy as np

# RT-1 input: image observation + natural language instruction
observation = {
    "image": np.zeros((300, 300, 3), dtype=np.uint8),  # RGB camera frame
    "instruction": "pick up the soda can and place it on the counter"
}

# RT-1 output: 11-dimensional action token (discretized into 256 bins each)
# [x, y, z, roll, pitch, yaw, gripper] + [terminate_episode]
# Model size: 35M parameters
# Training data: 130k demonstrations, 700+ tasks, 13 robots

predicted_action_tokens = rt1_model.predict(
    image=observation["image"],
    instruction=observation["instruction"])
# Output shape: (11,) integer tokens in range [0, 255]
# Decode to continuous actions via: action = (token / 255) * action_range + action_min
```

--------------------------------

### UniVLA Training Objective and Inference

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

This code illustrates the UniVLA training loss, which combines action prediction and world modeling. At inference, only the action prediction head is active, ensuring efficiency.

```python
# Paper: https://arxiv.org/abs/2506.19850
# Training objective (simplified)

# UniVLA loss = L_action + λ * L_world_model
# L_action     : cross-entropy on discretized action tokens
# L_world_model: next-frame prediction (training only, not used at inference)

# Training batch
batch = {
    "images":       torch.zeros(B, T, 3, 224, 224),  # video frames
    "instructions": ["pick up the cup"] * B,
    "actions":      torch.zeros(B, T, action_dim),    # ground-truth actions
}

loss_action = model.compute_action_loss(batch)
loss_world  = model.compute_world_model_loss(batch)   # predicts frame t+1
total_loss  = loss_action + 0.1 * loss_world
total_loss.backward()

# Inference: world model head is INACTIVE — same speed as a standard VLA
action = model.predict_action(
    image=current_frame,
    instruction="pick up the cup"
)  # shape: (action_dim,)
```

--------------------------------

### RT-2 VLM Robot Control Inference

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Conceptual inference for RT-2, which uses a VLM backbone and represents robot actions as text tokens. This enables emergent reasoning and generalization to novel instructions.

```python
# Conceptual RT-2 inference — actions represented as text tokens
# Paper: https://arxiv.org/abs/2307.15818

prompt = (
    "What action should the robot take to pick up the green cup?\n"
    "<image>\n"
    "Answer with robot actions:"
)

# RT-2 treats robot actions as additional tokens in the VLM vocabulary
# Action representation: each DOF discretized to 256 bins, serialized as text
# e.g., "255 128 089 200 150 175 001" → 7-DOF end-effector delta + gripper

response = rt2_model.generate(prompt, image=camera_frame, max_new_tokens=16)
# response: "187 142 095 210 160 180 000"

action_tokens = [int(t) for t in response.strip().split()]
action_continuous = decode_action_tokens(action_tokens)
# action_continuous: np.array([dx, dy, dz, droll, dpitch, dyaw, gripper])
```

--------------------------------

### CogACT: VLM + DiT Action Head

Source: https://context7.com/milkclouds/awesome-vla-study/llms.txt

Illustrates the CogACT architecture, which combines a VLM with a Diffusion Transformer (DiT) action head. The VLM's output conditions the DiT for action generation. Requires a VLM backbone and a DiT action head.

```python
# Paper: https://arxiv.org/abs/2411.19650
# Architecture overview

# Stage 1: VLM encodes image + language → context vector
vlm_context = vlm_backbone(
    image=camera_frame,          # (H, W, 3)
    text="pick up the red cube"  # tokenized instruction
)  # shape: (seq_len, hidden_dim=4096)

# Stage 2: DiT action head denoises action conditioned on VLM last hidden state
noisy_action = torch.randn(1, action_dim)   # start from noise
for t in reversed(range(T_diffusion)):
    noisy_action = dit_action_head(
        x=noisy_action,
        conditioning=vlm_context[-1],   # only last hidden state used
        timestep=t
    )

# Final output: continuous action
action = noisy_action   # shape: (action_dim,)
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.