### Install KernelBench

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Clones the KernelBench repository for problem examples.

```bash
git clone https://github.com/ScalingIntelligence/KernelBench.git
```

--------------------------------

### Pinned GB300 / SM103 Benchmark Environment Setup

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to create a conda environment and install necessary dependencies for the GB300 / SM103 benchmark, including PyTorch, CuTeDSL, Triton, and optional Quack.

```bash
conda create -y -n cute python=3.12
conda run -n cute python -m pip install --upgrade pip setuptools wheel packaging ninja
conda run -n cute python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu130 torch
conda run -n cute python -m pip install 'nvidia-cutlass-dsl==4.4.2' cuda-python triton matplotlib pytest pytest-cov
conda run -n cute python -m pip install -e '.[bench]'
conda run -n cute python -m pip install 'git+https://github.com/Dao-AILab/quack.git'  # optional comparison baseline
```

--------------------------------

### Run End-to-End Example

Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md

Executes an end-to-end test, which requires network access and a GPU.

```bash
uv run python e2e_test.py
```

--------------------------------

### Install KernelAgent-Oink

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md

Installation instructions for KernelAgent-Oink from the repository root.

```bash
pip install -e ./oink
pip install -e "./oink[bench]"  # optional benchmark/plot deps
```

--------------------------------

### Reproducible GB300 benchmark environment setup

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md

Conda environment setup for a reproducible GB300 benchmark.

```bash
conda create -y -n cute python=3.12
conda run -n cute python -m pip install --upgrade pip setuptools wheel packaging ninja
conda run -n cute python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu130 torch
conda run -n cute python -m pip install 'nvidia-cutlass-dsl==4.4.2' cuda-python triton matplotlib
conda run -n cute python -m pip install -e './oink[bench]'
```

--------------------------------

### Install KernelAgent

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Installs KernelAgent using pip.

```bash
pip install -e .
```

--------------------------------

### TensorMapArray Acquire and Load Example

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

This snippet demonstrates acquiring a TensorMapArray using `experimental_tensormap_fenceproxy_acquire` and then loading data into it using `_experimental_descriptor_load`. It includes comments explaining the offset calculation.

```python
# pyre-ignore [20]
               tl.extra.cuda.experimental_tensormap_fenceproxy_acquire(device_desc_q)


q = tl._experimental_descriptor_load(
          device_desc_q,
          [ #offset as explained in comments above (where the box starts at)
            (seq_start + start_m).to(tl.int32),
            (off_h * stride_qh).to(tl.int32),
          ],
          [BLOCK_M,BLOCK_D_Q],
          Q.dtype.element_ty,
      )
```

--------------------------------

### Verifying Platform Setup - CUDA and XPU Availability

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Python code to check for CUDA and XPU availability using PyTorch.

```python
# Check CUDA availability
import torch
print("CUDA available:", torch.cuda.is_available())

# Check XPU availability
print("XPU available:", hasattr(torch, 'xpu') and torch.xpu.is_available())
```

--------------------------------

### Clone and Install (uv)

Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md

Clones the KernelAgent repository and installs it using uv, matching CI practices.

```bash
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
```

--------------------------------

### Platform-Specific PyTorch Installation for Intel XPU

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Installs PyTorch with support for Intel XPU (Intel GPUs).

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu
```

--------------------------------

### Verify XPU Installation

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Python code to verify if Intel XPU is available and count the number of Intel GPUs.

```python
import torch
print(torch.xpu.is_available())  # Should print True
print(torch.xpu.device_count())  # Number of Intel GPUs
```

--------------------------------

### Set Environment for E2E Runs

Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md

Copies the example environment file and prompts the user to edit it with API keys for end-to-end runs.

```bash
cp .env.example .env   # then edit with your API key(s)
```

--------------------------------

### Install Development Dependencies

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Installs the KernelAgent package in editable mode with development dependencies.

```bash
pip install -e .[dev]
```

--------------------------------

### Clone and Install (pip)

Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md

Clones the KernelAgent repository and installs it using pip with development dependencies.

```bash
git clone https://github.com/pytorch-labs/KernelAgent.git
cd KernelAgent

# Using pip
python -m venv .venv && source .venv/bin/activate
pip install -e .
pip install -e ".[dev]"
```

--------------------------------

### Direct KernelAgent run

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

This example shows how to bypass Fuser and provide a plain language problem description or a KernelBench snippet directly to TritonKernelAgent. It prints the kernel path and session directory on success, or an error message on failure.

```python
from triton_kernel_agent import TritonKernelAgent

agent = TritonKernelAgent(num_workers=4, max_rounds=8, model_name="gpt-5")
result = agent.generate_kernel(
    problem_description="Implement ReLU over a contiguous 1D tensor of length 1024"
)

if result["success"]:
    print("Kernel path:", result["kernel_path"])
    print("Session directory:", result["session_dir"])
else:
    print("Failure:", result["message"])
```

--------------------------------

### Direct PyTorch usage

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md

Example of using KernelAgent-Oink's RMSNorm operation directly with PyTorch.

```python
import kernelagent_oink
import torch

kernelagent_oink.register(force=True)

x = torch.randn(1024, 4096, device="cuda", dtype=torch.bfloat16)
w = torch.randn(4096, device="cuda", dtype=torch.bfloat16)
y = torch.ops.oink.rmsnorm(x, w, 1e-6)
```

--------------------------------

### Host-side TMA Implementation - Host Setup

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md

Demonstrates how to create TensorDescriptor objects on the host side to describe tensor layouts for TMA operations.

```python
from triton.tools.tensor_descriptor import TensorDescriptor

def matmul_with_tma(a, b):
    # Create TMA descriptors on host
    a_desc = TensorDescriptor(
        a,                                   # the tensor
        a.shape,                             # tensor shape
        a.stride(),                          # tensor strides
        [BLOCK_SIZE_M, BLOCK_SIZE_K]         # block size for TMA operations
    )

    b_desc = TensorDescriptor(
        b,
        b.shape,
        b.stride(),
        [BLOCK_SIZE_K, BLOCK_SIZE_N]
    )

    c_desc = TensorDescriptor(
        c,
        c.shape,
        c.stride(),
        [BLOCK_SIZE_M, BLOCK_SIZE_N]
    )

    # Pass descriptors to kernel
    kernel[grid](a_desc, b_desc, c_desc, ...)
```

--------------------------------

### Device-side TMA Implementation - Host Setup

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md

Configures a custom allocator on the host side for device-side TMA operations.

```python
from typing import Optional

def alloc_fn(size: int, alignment: int, stream: Optional[int]):
    return torch.empty(size, device="cuda", dtype=torch.int8)

# Set custom allocator for TMA
triton.set_allocator(alloc_fn)
```

--------------------------------

### Recommended Environment Variables

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Environment variables recommended for setting up the benchmark environment, including PyTorch allocation configuration and CUDA architecture.

```bash
export PYTORCH_ALLOC_CONF=expandable_segments:True
# GB300 / SM103 on the current CuTeDSL host:
export CUTE_DSL_ARCH=sm_103
# GB200/B200 / SM100 historical runs:
# export CUTE_DSL_ARCH=sm_100a
```

--------------------------------

### Recommended environment variables

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md

Environment variables recommended for optimizing performance and compatibility with CuTeDSL.

```bash
export PYTORCH_ALLOC_CONF=expandable_segments:True
export CUTE_DSL_ARCH=sm_103    # GB300 / SM103 on the current CuTeDSL host
# export CUTE_DSL_ARCH=sm_100a # GB200/B200 / SM100
```

--------------------------------

### Gradio UI Usage

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Command to launch the Gradio UI for KernelAgent.

```bash
python scripts/optimization_ui.py --port 8085
```

--------------------------------

### Reproduce LayerNorm backward benchmark

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md

Command to reproduce the LayerNorm backward benchmark results on GB300 (SM103).

```bash
env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
  conda run -n cute python -u oink/benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
    --dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \
    --json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json
```

--------------------------------

### Softmax (forward + backward) benchmark commands

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to benchmark Softmax forward and backward passes using different configurations (quack-suite and dsv3).

```bash
python benchmarks/benchmark/benchmark_softmax_sm100.py --dtype bf16 --mode fwd_bwd --quack-suite --iters 50 --warmup-ms 25 \
  --json /tmp/oink_softmax_fwd_bwd_quack_suite.json

python benchmarks/benchmark/benchmark_softmax_sm100.py --dtype bf16 --mode fwd_bwd --dsv3 --iters 50 --warmup-ms 25 \
  --json /tmp/oink_softmax_fwd_bwd_dsv3.json
```

--------------------------------

### Run Test Suite

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Executes the test suite for the KernelAgent project.

```bash
pytest -v
```

--------------------------------

### Turn JSON artifacts into Markdown tables

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Command to summarize benchmark results into a Markdown table with geomean speedups.

```bash
conda run -n cute bash -lc 'python benchmarks/readme/summarize_results.py \
  --in-dir /tmp/oink_sm103_suite_bf16_current \
  --out /tmp/oink_sm103_suite_bf16_current_summary.md'
```

--------------------------------

### Configure LLM Providers

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Configures API keys for OpenAI and Anthropic LLM providers, or sets up a relay for self-hosted gateways.

```bash
# OpenAI (models like `o4-mini`, `gpt-5`)
OPENAI_API_KEY=sk-...

# Anthropic (default; `claude-sonnet-4-20250514` is used when `OPENAI_MODEL` is unset)
ANTHROPIC_API_KEY=sk-ant-...

# Relay configuration for self-hosted gateways
LLM_RELAY_URL=http://127.0.0.1:11434
LLM_RELAY_TIMEOUT_S=120
```

--------------------------------

### Cross-entropy (forward + backward) benchmark commands

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to benchmark Cross-entropy forward and backward passes using different configurations (quack-suite and dsv3).

```bash
python benchmarks/benchmark/benchmark_cross_entropy_sm100.py --dtype bf16 --mode fwd_bwd --quack-suite --iters 50 --warmup-ms 25 \
  --json /tmp/oink_cross_entropy_fwd_bwd_quack_suite.json

python benchmarks/benchmark/benchmark_cross_entropy_sm100.py --dtype bf16 --mode fwd_bwd --dsv3 --iters 50 --warmup-ms 25 \
  --json /tmp/oink_cross_entropy_fwd_bwd_dsv3.json
```

--------------------------------

### LayerNorm forward benchmark commands

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to benchmark LayerNorm forward pass using different configurations (quack-suite and dsv3).

```bash
python benchmarks/benchmark/benchmark_layernorm_sm100.py --dtype bf16 --quack-suite --iters 200 --warmup-ms 25 \
  --json /tmp/oink_layernorm_fwd_quack_suite.json

python benchmarks/benchmark/benchmark_layernorm_sm100.py --dtype bf16 --dsv3 --iters 200 --warmup-ms 25 \
  --json /tmp/oink_layernorm_fwd_dsv3.json
```

--------------------------------

### Include DeepSeek-V4-Flash norm workloads

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Command to run SM100 suite including DeepSeek-V4-Flash workloads with bf16 dtype.

```bash
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
  python benchmarks/readme/run_sm100_suite.py --dtype bf16 --include-dsv4 \
  --out-dir /tmp/oink_sm103_suite_bf16_current'
```

--------------------------------

### Equivalent TMA store code

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

This section provides the equivalent TMA store code for a given non-TMA store operation. It initializes a 2D descriptor, acquires a fence, and then performs the store operation using the descriptor.

```python
# pyre-ignore [20]
tl.extra.cuda.experimental_device_tensormap_create2d(
     desc_ptr=device_desc_o,
     global_address=Out, # Out is of shape (L, H, DimV)
     load_size=[BLOCK_M, BLOCK_D_V], #box size as explained in comments above
     global_size=[seq_end.to(tl.int32), H * DimV], # this eliminates the need for `mask`, TMA automatically take care of boundaries.
     element_ty=Out.dtype.element_ty,
)
# pyre-ignore [20]
tl.extra.cuda.experimental_tensormap_fenceproxy_acquire(device_desc_o)
tl._experimental_descriptor_store(
      device_desc_o,
      acc, # acc needs to be casted to the right dtype
      [
        (seq_start + pid * BLOCK_M).to(tl.int32),
        (off_h * stride_oh).to(tl.int32),
       ],
  )
```

--------------------------------

### One-command suite for Quack-suite + DSv3

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Command to run the full Quack-suite + DSv3 set (Oink vs Quack) and write all JSON artifacts to a timestamped directory.

```bash
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
  python benchmarks/readme/run_sm100_suite.py --dtype bf16'
```

--------------------------------

### Allocate Workspace for On-Device TMA Descriptors

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

This code snippet demonstrates how to allocate a workspace in CUDA memory to store TMA descriptors for on-device initialization. It defines constants for descriptor size and the number of descriptors per program and programs, then uses torch.empty to create the workspace.

```python
TMA_DESC_SIZE = 128 # size in bytes used by a single descriptor, tunable
NUM_DESC_PER_PROGRAM = ... # how many different tensors to load/store by each program. e.g. 3 for GEMM `C=AB`, 4 for HSTU Q,K,V,O tensors
NUM_OF_PROGRAMS = ... # same as specified in kernel `grid`. If grid size is related to auto tune config, use a reasonable upper bound by hard coding "minimal block M size" etc. for now.
workspace = torch.empty(
           TMA_DESC_SIZE * NUM_DESC_PER_PROGRAM * NUM_OF_PROGRAMS,
           dtype=torch.uint8,
           device="cuda",)

```

--------------------------------

### Programmatic API Usage - Beam Search Optimization

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Command to optimize a kernel using the programmatic API with beam search strategy.

```bash
cd examples && python run_opt_manager.py \
  --kernel-dir optimize_01_matvec/ \
  --strategy beam_search \
  --max-rounds 5
```

--------------------------------

### RMSNorm forward benchmarks

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to benchmark RMSNorm forward pass with different configurations (bf16, fp32, same weight dtype, quack-suite, dsv3, dsv4).

```bash
python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype fp32 --quack-suite --iters 200 --warmup-ms 25 \
  --json /tmp/oink_rmsnorm_fwd_quack_suite.json
```

```bash
python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype fp32 --dsv3 --iters 200 --warmup-ms 25 \
  --json /tmp/oink_rmsnorm_fwd_dsv3.json
```

```bash
# vLLM-style inference weights (weight dtype == activation dtype)
python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype same --quack-suite --iters 200 --warmup-ms 25 \
  --json /tmp/oink_rmsnorm_fwd_quack_suite_wsame.json
```

```bash
# DeepSeek-V4-Flash norm grid
python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype same --dsv4 --iters 200 --warmup-ms 25 \
  --json /tmp/oink_rmsnorm_fwd_dsv4_wsame.json
```

--------------------------------

### Equivalent TMA load code

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

This section provides the equivalent TMA load code for a given non-TMA load operation. It initializes a 2D descriptor, defines the box size and global boundaries, and then performs the load operation.

```python
# pyre-ignore [20]
tl.extra.cuda.experimental_device_tensormap_create2d(
     desc_ptr=device_desc_q,
     global_address=Q, # shape (L, H, DimQ)
     load_size=[BLOCK_M,BLOCK_D_Q], #box size as explained in comments above
     global_size=[seq_end.to(tl.int32), H * DimQ], # seq_end == seq_start + seq_len
     element_ty=Q.dtype.element_ty,
  )

```

--------------------------------

### Fused Add + RMSNorm (vLLM-style, in-place) benchmarks

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to benchmark fused add + RMSNorm for DeepSeek-V3 and DeepSeek-V4-Flash hidden-state sweeps with bf16 dtype, using in-place baseline.

```bash
# DeepSeek-V3 hidden-size sweep
PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
  python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \
    --dtype bf16 --dsv3 --iters 80 --warmup-ms 15 \
    --quack-baseline kernel_inplace \
    --json /tmp/oink_sm103_fused_add_rmsnorm_dsv3_bf16.json
```

```bash
# DeepSeek-V4-Flash hidden-state sweep (N=7168)
PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
  python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \
    --dtype bf16 --dsv4 --iters 80 --warmup-ms 15 \
    --quack-baseline kernel_inplace \
    --json /tmp/oink_sm103_fused_add_rmsnorm_dsv4_bf16.json
```

--------------------------------

### Load data from GMEM to SMEM

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

Loads data from global memory (GMEM) to shared memory (SMEM) using an initialized descriptor pointer and specified offsets and box sizes.

```python
- Load data from GMEM to SMEM:
```
x = tl._experimental_descriptor_load(
                   <in_desc_ptr>, #initialized, and acquired fence above
                   [OFFSET_0, OFFSET_1], # offset in "global box" for the 2D loading box to start from
                   [BOX_SIZE_0, BOX_SIZE_1], # keep the same as descriptor's `load_size`
                   <dtype>)
```
```

--------------------------------

### RMSNorm backward benchmarks

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to benchmark RMSNorm backward pass with different configurations (bf16, fp32, quack-suite, dsv3).

```bash
python benchmarks/benchmark/benchmark_rmsnorm_bwd_sm100.py --dtype bf16 --weight-dtype fp32 --quack-suite --iters 100 --warmup-ms 25 \
  --csv /tmp/oink_rmsnorm_bwd_quack_suite.csv
```

```bash
python benchmarks/benchmark/benchmark_rmsnorm_bwd_sm100.py --dtype bf16 --weight-dtype fp32 --dsv3 --iters 100 --warmup-ms 25 \
  --csv /tmp/oink_rmsnorm_bwd_dsv3.csv
```

--------------------------------

### Auto-route a KernelBench problem

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

This command automatically routes a KernelBench problem, choosing between the direct KernelAgent path and the full Fuser pipeline, with fallback if the first attempt fails. The `--no-router-cache` flag prevents caching or using cached results, and `--verify` ensures the final composition test runs.

```bash
python -m Fuser.auto_agent \
  --problem /abs/path/to/KernelBench/level1/19_ReLU.py \
  --no-router-cache \
  # avoid caching or using cached results
  --verify \
  # ensure final composition test runs
```

--------------------------------

### Manually run the pipeline (extract → dispatch → compose)

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

This command provides explicit control over models and concurrency when running the Fuser pipeline. The `--no-router-cache` flag can be enabled to avoid utilizing any cached router results and prevent writing to the cache. `dispatch-jobs auto` matches the number of discovered subgraphs; artifacts are placed under `.fuse/<run_id>/`.

```bash
python -m Fuser.pipeline \
  --problem /abs/path/to/problem.py \
  --extract-model gpt-5 \
  --dispatch-model o4-mini \
  --dispatch-jobs auto \
  --compose-model o4-mini \
  --workers 4 \
  --max-iters 5 \
  --verify
```

```bash
# For Intel XPU
python -m Fuser.pipeline \
  --problem /abs/path/to/problem.py \
  --target-platform xpu \
  --extract-model gpt-5 \
  --dispatch-model o4-mini \
  --dispatch-jobs auto \
  --compose-model o4-mini \
  --workers 4 \
  --max-iters 5 \
  --verify
```

--------------------------------

### vLLM plugin usage

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md

Enabling KernelAgent-Oink as a vLLM plugin.

```bash
export VLLM_USE_OINK_RMSNORM=1
```

--------------------------------

### Initialize descriptor object

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

Initializes a descriptor object for TMA operations, specifying the workspace, program ID offset, and in-program offset. It also defines the tensor to be loaded/stored, the size of the 2D box to copy, the global memory boundaries, and the element data type.

```python
- Initialize descriptor object:
```
desc_ptr = workspace + TMA_DESC_SIZE * <program id offset> + TMA_DESC_SIZE * <in program offset> # in program offset in range [0,NUM_DESC_PER_PROGRAM)


tl.extra.cuda.experimental_device_tensormap_create2d(
desc_ptr=desc_ptr,
global_address=<tensor_ptr>, # tensor to load into or store from
load_size=[BOX_SIZE_0, BOX_SIZE_1], # size of the 2D box to copy
global_size=[GLOBAL_SIZE_0, GLOBAL_SIZE_1], # this defines a "global box" in GMEM. TMA load/store won't go over this boundary if load_size is not divisble by global_size. e.g. Assuming GLOBAL_SIZE_0 == 1.5 * BLOCK_SIZE_0 and GLOBAL_SIZE_1 == BLOCK_SIZE_1, then: for TMA load, the second box will return a tensor of size (BLOCK_SIZE_0, BLOCK_SIZE_1) but the second half of the tensor is all 0; for TMA store, the second box will only have its first half written to GMEM.
element_ty=<element_ty> # usually tensor_ptr.dtype.element_ty
)
```
```

--------------------------------

### Acquire fence on a TensorMap/descriptor object

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

Acquires a fence on a TensorMap or descriptor object, which is necessary before performing load or store operations.

```python
- Acquire fence on a TensorMap/descriptor object:
```
tl.extra.cuda.experimental_tensormap_fenceproxy_acquire(<desc_ptr>)
```
```

--------------------------------

### Regenerate GB300 BF16 STREAM-like roofline

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Command to regenerate the GB300 BF16 STREAM-like roofline on the current machine, used for published results.

```bash
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
  python benchmarks/benchmark/benchmark_hbm_roofline_sm100.py --dtype bf16 --op both --gb 1 \
  --json /tmp/oink_sm103_hbm_roofline_bf16_current.json'
```

--------------------------------

### Run Unit Tests

Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md

Executes unit tests for KernelAgent using pytest.

```bash
uv run pytest tests/ -v
```

--------------------------------

### Linting Check

Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md

Checks the codebase for PEP 8 compliance and style issues using ruff.

```bash
ruff check .
```

--------------------------------

### Generate SM103 SVGs from current JSONs and measured roofline

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to generate SM103 SVGs from JSON results, comparing Oink vs Quack with layernorm and DSv3_all.

```bash
conda run -n cute bash -lc 'python benchmarks/readme/plot_quack_style_svg.py \
  --in-dir /tmp/oink_sm103_suite_bf16_current \
  --suite quack_suite --include-layernorm \
  --roofline-json /tmp/oink_sm103_hbm_roofline_bf16_current.json \
  --arch-label "SM103 / GB300" \
  --out benchmarks/media/sm103_bf16_oink_vs_quack_with_layernorm.svg'
```

```bash
conda run -n cute bash -lc 'python benchmarks/readme/plot_quack_style_svg.py \
  --in-dir /tmp/oink_sm103_suite_bf16_current \
  --suite dsv3_all --shape-policy first \
  --roofline-json /tmp/oink_sm103_hbm_roofline_bf16_current.json \
  --arch-label "SM103 / GB300" \
  --out benchmarks/media/sm103_bf16_oink_vs_quack_dsv3_all.svg'
```

--------------------------------

### vLLM custom ops configuration

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md

Configuring vLLM to use KernelAgent-Oink custom ops when using torch.compile or CUDA graphs.

```python
from vllm import LLM

llm = LLM(
    model=...,
    tensor_parallel_size=...,
    enforce_eager=False,
    compilation_config={"custom_ops": ["none", "+rms_norm"]},
)
```

--------------------------------

### LayerNorm backward benchmark commands

Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md

Commands to benchmark LayerNorm backward pass for DeepSeek-V4-Flash and DeepSeek-V3 with CUDA graph replay.

```bash
# DeepSeek-V4-Flash hidden LayerNorm shape sweep (N=7168)
env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
  conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
    --dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \
    --json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json

# DeepSeek-V3 shape sweep (N in {6144,8192})
env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
  conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
    --dtype bf16 --weight-dtype same --dsv3 --iters 80 --warmup-ms 10 --cuda-graph \
    --json /tmp/oink_layernorm_bwd_sm103_dsv3_cuda_graph_seq.json
```

--------------------------------

### Configure KernelAgent Environment Variables

Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md

Sets environment variables for KernelAgent configuration, including LLM model, parallel workers, refinement rounds, and logging level.

```bash
OPENAI_MODEL=gpt-5            # default model for extraction
NUM_KERNEL_SEEDS=4            # parallel workers per kernel
MAX_REFINEMENT_ROUNDS=10      # retry budget per worker
LOG_LEVEL=INFO                # logging level
```

--------------------------------

### Run Tests with Coverage

Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md

Executes unit tests and calculates code coverage for the triton_kernel_agent module.

```bash
uv run pytest tests/ -v --cov=triton_kernel_agent
```

--------------------------------

### Store data from SMEM to GMEM

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md

Stores data from shared memory (SMEM) to global memory (GMEM) using an initialized descriptor pointer, the output tensor, and specified offsets.

```python
- Store data from SMEM to GMEM:
```
tl._experimental_descriptor_store(
                   <out_desc_ptr>, #initialized, and acquired fence above
                   <output_tensor>, #the tensor to be stored on GMEM
                   [OFFSET_0, OFFSET_1], # offset in "global box" for the 2D loading box to start from
)
```
```

--------------------------------

### Key Differences: Traditional vs TMA Memory Access

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md

Compares the manual pointer arithmetic used in traditional approaches with the descriptor-based access provided by TMA.

```python
# Traditional:
# Manual pointer arithmetic
offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
a_ptrs = a_ptr + offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak
a = tl.load(a_ptrs, mask=...)

# TMA:
# Descriptor-based access
a = a_desc.load([pid_m * BLOCK_SIZE_M, k_offset])
```

--------------------------------

### Device-side TMA Implementation - Kernel Usage

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md

Illustrates creating and using TMA descriptors directly within the Triton kernel for memory operations.

```python
@triton.jit
def matmul_kernel(a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, ...):
    # Create TMA descriptors in kernel
    a_desc = tl.make_tensor_descriptor(
        a_ptr,                               # pointer to tensor
        shape=[M, K],                        # tensor shape
        strides=[stride_am, stride_ak],      # tensor strides
        block_shape=[BLOCK_SIZE_M, BLOCK_SIZE_K]  # TMA block size
    )

    b_desc = tl.make_tensor_descriptor(
        b_ptr,
        shape=[K, N],
        strides=[stride_bk, stride_bn],
        block_shape=[BLOCK_SIZE_K, BLOCK_SIZE_N]
    )

    c_desc = tl.make_tensor_descriptor(
        c_ptr,
        shape=[M, N],
        strides=[stride_cm, stride_cn],
        block_shape=[BLOCK_SIZE_M, BLOCK_SIZE_N]
    )

    # Use descriptors for memory operations
    pid = tl.program_id(axis=0)
    pid_m = pid // num_pid_n
    pid_n = pid % num_pid_n

    # Load blocks using TMA
    a = a_desc.load([pid_m * BLOCK_SIZE_M, 0])
    b = b_desc.load([0, pid_n * BLOCK_SIZE_N])

    # Compute and store
    result = tl.dot(a, b)
    c_desc.store([pid_m * BLOCK_SIZE_M, pid_n * BLOCK_SIZE_N], result)
```

--------------------------------

### Host-side TMA Implementation - Kernel Usage

Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md

Shows how to use TMA descriptors within a Triton kernel for loading and storing tensor blocks.

```python
@triton.jit
def matmul_kernel(a_desc, b_desc, c_desc, ...):
    pid = tl.program_id(axis=0)
    # Calculate tile positions
    pid_m = pid // num_pid_n
    pid_n = pid % num_pid_n

    # Load using TMA descriptors
    a = a_desc.load([pid_m * BLOCK_SIZE_M, 0])  # offset coordinates
    b = b_desc.load([0, pid_n * BLOCK_SIZE_N])

    # Compute
    accumulator = tl.dot(a, b)

    # Store using TMA descriptor
    c_desc.store([pid_m * BLOCK_SIZE_M, pid_n * BLOCK_SIZE_N], accumulator)
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.