### Install KernelBench Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Clones the KernelBench repository for problem examples. ```bash git clone https://github.com/ScalingIntelligence/KernelBench.git ``` -------------------------------- ### Pinned GB300 / SM103 Benchmark Environment Setup Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to create a conda environment and install necessary dependencies for the GB300 / SM103 benchmark, including PyTorch, CuTeDSL, Triton, and optional Quack. ```bash conda create -y -n cute python=3.12 conda run -n cute python -m pip install --upgrade pip setuptools wheel packaging ninja conda run -n cute python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu130 torch conda run -n cute python -m pip install 'nvidia-cutlass-dsl==4.4.2' cuda-python triton matplotlib pytest pytest-cov conda run -n cute python -m pip install -e '.[bench]' conda run -n cute python -m pip install 'git+https://github.com/Dao-AILab/quack.git' # optional comparison baseline ``` -------------------------------- ### Run End-to-End Example Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md Executes an end-to-end test, which requires network access and a GPU. ```bash uv run python e2e_test.py ``` -------------------------------- ### Install KernelAgent-Oink Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md Installation instructions for KernelAgent-Oink from the repository root. ```bash pip install -e ./oink pip install -e "./oink[bench]" # optional benchmark/plot deps ``` -------------------------------- ### Reproducible GB300 benchmark environment setup Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md Conda environment setup for a reproducible GB300 benchmark. ```bash conda create -y -n cute python=3.12 conda run -n cute python -m pip install --upgrade pip setuptools wheel packaging ninja conda run -n cute python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu130 torch conda run -n cute python -m pip install 'nvidia-cutlass-dsl==4.4.2' cuda-python triton matplotlib conda run -n cute python -m pip install -e './oink[bench]' ``` -------------------------------- ### Install KernelAgent Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Installs KernelAgent using pip. ```bash pip install -e . ``` -------------------------------- ### TensorMapArray Acquire and Load Example Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md This snippet demonstrates acquiring a TensorMapArray using `experimental_tensormap_fenceproxy_acquire` and then loading data into it using `_experimental_descriptor_load`. It includes comments explaining the offset calculation. ```python # pyre-ignore [20] tl.extra.cuda.experimental_tensormap_fenceproxy_acquire(device_desc_q) q = tl._experimental_descriptor_load( device_desc_q, [ #offset as explained in comments above (where the box starts at) (seq_start + start_m).to(tl.int32), (off_h * stride_qh).to(tl.int32), ], [BLOCK_M,BLOCK_D_Q], Q.dtype.element_ty, ) ``` -------------------------------- ### Verifying Platform Setup - CUDA and XPU Availability Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Python code to check for CUDA and XPU availability using PyTorch. ```python # Check CUDA availability import torch print("CUDA available:", torch.cuda.is_available()) # Check XPU availability print("XPU available:", hasattr(torch, 'xpu') and torch.xpu.is_available()) ``` -------------------------------- ### Clone and Install (uv) Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md Clones the KernelAgent repository and installs it using uv, matching CI practices. ```bash uv venv && source .venv/bin/activate uv pip install -e ".[dev]" ``` -------------------------------- ### Platform-Specific PyTorch Installation for Intel XPU Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Installs PyTorch with support for Intel XPU (Intel GPUs). ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu ``` -------------------------------- ### Verify XPU Installation Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Python code to verify if Intel XPU is available and count the number of Intel GPUs. ```python import torch print(torch.xpu.is_available()) # Should print True print(torch.xpu.device_count()) # Number of Intel GPUs ``` -------------------------------- ### Set Environment for E2E Runs Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md Copies the example environment file and prompts the user to edit it with API keys for end-to-end runs. ```bash cp .env.example .env # then edit with your API key(s) ``` -------------------------------- ### Install Development Dependencies Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Installs the KernelAgent package in editable mode with development dependencies. ```bash pip install -e .[dev] ``` -------------------------------- ### Clone and Install (pip) Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md Clones the KernelAgent repository and installs it using pip with development dependencies. ```bash git clone https://github.com/pytorch-labs/KernelAgent.git cd KernelAgent # Using pip python -m venv .venv && source .venv/bin/activate pip install -e . pip install -e ".[dev]" ``` -------------------------------- ### Direct KernelAgent run Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md This example shows how to bypass Fuser and provide a plain language problem description or a KernelBench snippet directly to TritonKernelAgent. It prints the kernel path and session directory on success, or an error message on failure. ```python from triton_kernel_agent import TritonKernelAgent agent = TritonKernelAgent(num_workers=4, max_rounds=8, model_name="gpt-5") result = agent.generate_kernel( problem_description="Implement ReLU over a contiguous 1D tensor of length 1024" ) if result["success"]: print("Kernel path:", result["kernel_path"]) print("Session directory:", result["session_dir"]) else: print("Failure:", result["message"]) ``` -------------------------------- ### Direct PyTorch usage Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md Example of using KernelAgent-Oink's RMSNorm operation directly with PyTorch. ```python import kernelagent_oink import torch kernelagent_oink.register(force=True) x = torch.randn(1024, 4096, device="cuda", dtype=torch.bfloat16) w = torch.randn(4096, device="cuda", dtype=torch.bfloat16) y = torch.ops.oink.rmsnorm(x, w, 1e-6) ``` -------------------------------- ### Host-side TMA Implementation - Host Setup Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md Demonstrates how to create TensorDescriptor objects on the host side to describe tensor layouts for TMA operations. ```python from triton.tools.tensor_descriptor import TensorDescriptor def matmul_with_tma(a, b): # Create TMA descriptors on host a_desc = TensorDescriptor( a, # the tensor a.shape, # tensor shape a.stride(), # tensor strides [BLOCK_SIZE_M, BLOCK_SIZE_K] # block size for TMA operations ) b_desc = TensorDescriptor( b, b.shape, b.stride(), [BLOCK_SIZE_K, BLOCK_SIZE_N] ) c_desc = TensorDescriptor( c, c.shape, c.stride(), [BLOCK_SIZE_M, BLOCK_SIZE_N] ) # Pass descriptors to kernel kernel[grid](a_desc, b_desc, c_desc, ...) ``` -------------------------------- ### Device-side TMA Implementation - Host Setup Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md Configures a custom allocator on the host side for device-side TMA operations. ```python from typing import Optional def alloc_fn(size: int, alignment: int, stream: Optional[int]): return torch.empty(size, device="cuda", dtype=torch.int8) # Set custom allocator for TMA triton.set_allocator(alloc_fn) ``` -------------------------------- ### Recommended Environment Variables Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Environment variables recommended for setting up the benchmark environment, including PyTorch allocation configuration and CUDA architecture. ```bash export PYTORCH_ALLOC_CONF=expandable_segments:True # GB300 / SM103 on the current CuTeDSL host: export CUTE_DSL_ARCH=sm_103 # GB200/B200 / SM100 historical runs: # export CUTE_DSL_ARCH=sm_100a ``` -------------------------------- ### Recommended environment variables Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md Environment variables recommended for optimizing performance and compatibility with CuTeDSL. ```bash export PYTORCH_ALLOC_CONF=expandable_segments:True export CUTE_DSL_ARCH=sm_103 # GB300 / SM103 on the current CuTeDSL host # export CUTE_DSL_ARCH=sm_100a # GB200/B200 / SM100 ``` -------------------------------- ### Gradio UI Usage Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Command to launch the Gradio UI for KernelAgent. ```bash python scripts/optimization_ui.py --port 8085 ``` -------------------------------- ### Reproduce LayerNorm backward benchmark Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md Command to reproduce the LayerNorm backward benchmark results on GB300 (SM103). ```bash env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \ conda run -n cute python -u oink/benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \ --dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \ --json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json ``` -------------------------------- ### Softmax (forward + backward) benchmark commands Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to benchmark Softmax forward and backward passes using different configurations (quack-suite and dsv3). ```bash python benchmarks/benchmark/benchmark_softmax_sm100.py --dtype bf16 --mode fwd_bwd --quack-suite --iters 50 --warmup-ms 25 \ --json /tmp/oink_softmax_fwd_bwd_quack_suite.json python benchmarks/benchmark/benchmark_softmax_sm100.py --dtype bf16 --mode fwd_bwd --dsv3 --iters 50 --warmup-ms 25 \ --json /tmp/oink_softmax_fwd_bwd_dsv3.json ``` -------------------------------- ### Run Test Suite Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Executes the test suite for the KernelAgent project. ```bash pytest -v ``` -------------------------------- ### Turn JSON artifacts into Markdown tables Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Command to summarize benchmark results into a Markdown table with geomean speedups. ```bash conda run -n cute bash -lc 'python benchmarks/readme/summarize_results.py \ --in-dir /tmp/oink_sm103_suite_bf16_current \ --out /tmp/oink_sm103_suite_bf16_current_summary.md' ``` -------------------------------- ### Configure LLM Providers Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Configures API keys for OpenAI and Anthropic LLM providers, or sets up a relay for self-hosted gateways. ```bash # OpenAI (models like `o4-mini`, `gpt-5`) OPENAI_API_KEY=sk-... # Anthropic (default; `claude-sonnet-4-20250514` is used when `OPENAI_MODEL` is unset) ANTHROPIC_API_KEY=sk-ant-... # Relay configuration for self-hosted gateways LLM_RELAY_URL=http://127.0.0.1:11434 LLM_RELAY_TIMEOUT_S=120 ``` -------------------------------- ### Cross-entropy (forward + backward) benchmark commands Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to benchmark Cross-entropy forward and backward passes using different configurations (quack-suite and dsv3). ```bash python benchmarks/benchmark/benchmark_cross_entropy_sm100.py --dtype bf16 --mode fwd_bwd --quack-suite --iters 50 --warmup-ms 25 \ --json /tmp/oink_cross_entropy_fwd_bwd_quack_suite.json python benchmarks/benchmark/benchmark_cross_entropy_sm100.py --dtype bf16 --mode fwd_bwd --dsv3 --iters 50 --warmup-ms 25 \ --json /tmp/oink_cross_entropy_fwd_bwd_dsv3.json ``` -------------------------------- ### LayerNorm forward benchmark commands Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to benchmark LayerNorm forward pass using different configurations (quack-suite and dsv3). ```bash python benchmarks/benchmark/benchmark_layernorm_sm100.py --dtype bf16 --quack-suite --iters 200 --warmup-ms 25 \ --json /tmp/oink_layernorm_fwd_quack_suite.json python benchmarks/benchmark/benchmark_layernorm_sm100.py --dtype bf16 --dsv3 --iters 200 --warmup-ms 25 \ --json /tmp/oink_layernorm_fwd_dsv3.json ``` -------------------------------- ### Include DeepSeek-V4-Flash norm workloads Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Command to run SM100 suite including DeepSeek-V4-Flash workloads with bf16 dtype. ```bash conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \ python benchmarks/readme/run_sm100_suite.py --dtype bf16 --include-dsv4 \ --out-dir /tmp/oink_sm103_suite_bf16_current' ``` -------------------------------- ### Equivalent TMA store code Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md This section provides the equivalent TMA store code for a given non-TMA store operation. It initializes a 2D descriptor, acquires a fence, and then performs the store operation using the descriptor. ```python # pyre-ignore [20] tl.extra.cuda.experimental_device_tensormap_create2d( desc_ptr=device_desc_o, global_address=Out, # Out is of shape (L, H, DimV) load_size=[BLOCK_M, BLOCK_D_V], #box size as explained in comments above global_size=[seq_end.to(tl.int32), H * DimV], # this eliminates the need for `mask`, TMA automatically take care of boundaries. element_ty=Out.dtype.element_ty, ) # pyre-ignore [20] tl.extra.cuda.experimental_tensormap_fenceproxy_acquire(device_desc_o) tl._experimental_descriptor_store( device_desc_o, acc, # acc needs to be casted to the right dtype [ (seq_start + pid * BLOCK_M).to(tl.int32), (off_h * stride_oh).to(tl.int32), ], ) ``` -------------------------------- ### One-command suite for Quack-suite + DSv3 Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Command to run the full Quack-suite + DSv3 set (Oink vs Quack) and write all JSON artifacts to a timestamped directory. ```bash conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \ python benchmarks/readme/run_sm100_suite.py --dtype bf16' ``` -------------------------------- ### Allocate Workspace for On-Device TMA Descriptors Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md This code snippet demonstrates how to allocate a workspace in CUDA memory to store TMA descriptors for on-device initialization. It defines constants for descriptor size and the number of descriptors per program and programs, then uses torch.empty to create the workspace. ```python TMA_DESC_SIZE = 128 # size in bytes used by a single descriptor, tunable NUM_DESC_PER_PROGRAM = ... # how many different tensors to load/store by each program. e.g. 3 for GEMM `C=AB`, 4 for HSTU Q,K,V,O tensors NUM_OF_PROGRAMS = ... # same as specified in kernel `grid`. If grid size is related to auto tune config, use a reasonable upper bound by hard coding "minimal block M size" etc. for now. workspace = torch.empty( TMA_DESC_SIZE * NUM_DESC_PER_PROGRAM * NUM_OF_PROGRAMS, dtype=torch.uint8, device="cuda",) ``` -------------------------------- ### Programmatic API Usage - Beam Search Optimization Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Command to optimize a kernel using the programmatic API with beam search strategy. ```bash cd examples && python run_opt_manager.py \ --kernel-dir optimize_01_matvec/ \ --strategy beam_search \ --max-rounds 5 ``` -------------------------------- ### RMSNorm forward benchmarks Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to benchmark RMSNorm forward pass with different configurations (bf16, fp32, same weight dtype, quack-suite, dsv3, dsv4). ```bash python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype fp32 --quack-suite --iters 200 --warmup-ms 25 \ --json /tmp/oink_rmsnorm_fwd_quack_suite.json ``` ```bash python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype fp32 --dsv3 --iters 200 --warmup-ms 25 \ --json /tmp/oink_rmsnorm_fwd_dsv3.json ``` ```bash # vLLM-style inference weights (weight dtype == activation dtype) python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype same --quack-suite --iters 200 --warmup-ms 25 \ --json /tmp/oink_rmsnorm_fwd_quack_suite_wsame.json ``` ```bash # DeepSeek-V4-Flash norm grid python benchmarks/benchmark/benchmark_rmsnorm_sm100.py --dtype bf16 --weight-dtype same --dsv4 --iters 200 --warmup-ms 25 \ --json /tmp/oink_rmsnorm_fwd_dsv4_wsame.json ``` -------------------------------- ### Equivalent TMA load code Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md This section provides the equivalent TMA load code for a given non-TMA load operation. It initializes a 2D descriptor, defines the box size and global boundaries, and then performs the load operation. ```python # pyre-ignore [20] tl.extra.cuda.experimental_device_tensormap_create2d( desc_ptr=device_desc_q, global_address=Q, # shape (L, H, DimQ) load_size=[BLOCK_M,BLOCK_D_Q], #box size as explained in comments above global_size=[seq_end.to(tl.int32), H * DimQ], # seq_end == seq_start + seq_len element_ty=Q.dtype.element_ty, ) ``` -------------------------------- ### Fused Add + RMSNorm (vLLM-style, in-place) benchmarks Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to benchmark fused add + RMSNorm for DeepSeek-V3 and DeepSeek-V4-Flash hidden-state sweeps with bf16 dtype, using in-place baseline. ```bash # DeepSeek-V3 hidden-size sweep PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \ python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \ --dtype bf16 --dsv3 --iters 80 --warmup-ms 15 \ --quack-baseline kernel_inplace \ --json /tmp/oink_sm103_fused_add_rmsnorm_dsv3_bf16.json ``` ```bash # DeepSeek-V4-Flash hidden-state sweep (N=7168) PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \ python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \ --dtype bf16 --dsv4 --iters 80 --warmup-ms 15 \ --quack-baseline kernel_inplace \ --json /tmp/oink_sm103_fused_add_rmsnorm_dsv4_bf16.json ``` -------------------------------- ### Load data from GMEM to SMEM Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md Loads data from global memory (GMEM) to shared memory (SMEM) using an initialized descriptor pointer and specified offsets and box sizes. ```python - Load data from GMEM to SMEM: ``` x = tl._experimental_descriptor_load( , #initialized, and acquired fence above [OFFSET_0, OFFSET_1], # offset in "global box" for the 2D loading box to start from [BOX_SIZE_0, BOX_SIZE_1], # keep the same as descriptor's `load_size` ) ``` ``` -------------------------------- ### RMSNorm backward benchmarks Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to benchmark RMSNorm backward pass with different configurations (bf16, fp32, quack-suite, dsv3). ```bash python benchmarks/benchmark/benchmark_rmsnorm_bwd_sm100.py --dtype bf16 --weight-dtype fp32 --quack-suite --iters 100 --warmup-ms 25 \ --csv /tmp/oink_rmsnorm_bwd_quack_suite.csv ``` ```bash python benchmarks/benchmark/benchmark_rmsnorm_bwd_sm100.py --dtype bf16 --weight-dtype fp32 --dsv3 --iters 100 --warmup-ms 25 \ --csv /tmp/oink_rmsnorm_bwd_dsv3.csv ``` -------------------------------- ### Auto-route a KernelBench problem Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md This command automatically routes a KernelBench problem, choosing between the direct KernelAgent path and the full Fuser pipeline, with fallback if the first attempt fails. The `--no-router-cache` flag prevents caching or using cached results, and `--verify` ensures the final composition test runs. ```bash python -m Fuser.auto_agent \ --problem /abs/path/to/KernelBench/level1/19_ReLU.py \ --no-router-cache \ # avoid caching or using cached results --verify \ # ensure final composition test runs ``` -------------------------------- ### Manually run the pipeline (extract → dispatch → compose) Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md This command provides explicit control over models and concurrency when running the Fuser pipeline. The `--no-router-cache` flag can be enabled to avoid utilizing any cached router results and prevent writing to the cache. `dispatch-jobs auto` matches the number of discovered subgraphs; artifacts are placed under `.fuse//`. ```bash python -m Fuser.pipeline \ --problem /abs/path/to/problem.py \ --extract-model gpt-5 \ --dispatch-model o4-mini \ --dispatch-jobs auto \ --compose-model o4-mini \ --workers 4 \ --max-iters 5 \ --verify ``` ```bash # For Intel XPU python -m Fuser.pipeline \ --problem /abs/path/to/problem.py \ --target-platform xpu \ --extract-model gpt-5 \ --dispatch-model o4-mini \ --dispatch-jobs auto \ --compose-model o4-mini \ --workers 4 \ --max-iters 5 \ --verify ``` -------------------------------- ### vLLM plugin usage Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md Enabling KernelAgent-Oink as a vLLM plugin. ```bash export VLLM_USE_OINK_RMSNORM=1 ``` -------------------------------- ### Initialize descriptor object Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md Initializes a descriptor object for TMA operations, specifying the workspace, program ID offset, and in-program offset. It also defines the tensor to be loaded/stored, the size of the 2D box to copy, the global memory boundaries, and the element data type. ```python - Initialize descriptor object: ``` desc_ptr = workspace + TMA_DESC_SIZE * + TMA_DESC_SIZE * # in program offset in range [0,NUM_DESC_PER_PROGRAM) tl.extra.cuda.experimental_device_tensormap_create2d( desc_ptr=desc_ptr, global_address=, # tensor to load into or store from load_size=[BOX_SIZE_0, BOX_SIZE_1], # size of the 2D box to copy global_size=[GLOBAL_SIZE_0, GLOBAL_SIZE_1], # this defines a "global box" in GMEM. TMA load/store won't go over this boundary if load_size is not divisble by global_size. e.g. Assuming GLOBAL_SIZE_0 == 1.5 * BLOCK_SIZE_0 and GLOBAL_SIZE_1 == BLOCK_SIZE_1, then: for TMA load, the second box will return a tensor of size (BLOCK_SIZE_0, BLOCK_SIZE_1) but the second half of the tensor is all 0; for TMA store, the second box will only have its first half written to GMEM. element_ty= # usually tensor_ptr.dtype.element_ty ) ``` ``` -------------------------------- ### Acquire fence on a TensorMap/descriptor object Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md Acquires a fence on a TensorMap or descriptor object, which is necessary before performing load or store operations. ```python - Acquire fence on a TensorMap/descriptor object: ``` tl.extra.cuda.experimental_tensormap_fenceproxy_acquire() ``` ``` -------------------------------- ### Regenerate GB300 BF16 STREAM-like roofline Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Command to regenerate the GB300 BF16 STREAM-like roofline on the current machine, used for published results. ```bash conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \ python benchmarks/benchmark/benchmark_hbm_roofline_sm100.py --dtype bf16 --op both --gb 1 \ --json /tmp/oink_sm103_hbm_roofline_bf16_current.json' ``` -------------------------------- ### Run Unit Tests Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md Executes unit tests for KernelAgent using pytest. ```bash uv run pytest tests/ -v ``` -------------------------------- ### Linting Check Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md Checks the codebase for PEP 8 compliance and style issues using ruff. ```bash ruff check . ``` -------------------------------- ### Generate SM103 SVGs from current JSONs and measured roofline Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to generate SM103 SVGs from JSON results, comparing Oink vs Quack with layernorm and DSv3_all. ```bash conda run -n cute bash -lc 'python benchmarks/readme/plot_quack_style_svg.py \ --in-dir /tmp/oink_sm103_suite_bf16_current \ --suite quack_suite --include-layernorm \ --roofline-json /tmp/oink_sm103_hbm_roofline_bf16_current.json \ --arch-label "SM103 / GB300" \ --out benchmarks/media/sm103_bf16_oink_vs_quack_with_layernorm.svg' ``` ```bash conda run -n cute bash -lc 'python benchmarks/readme/plot_quack_style_svg.py \ --in-dir /tmp/oink_sm103_suite_bf16_current \ --suite dsv3_all --shape-policy first \ --roofline-json /tmp/oink_sm103_hbm_roofline_bf16_current.json \ --arch-label "SM103 / GB300" \ --out benchmarks/media/sm103_bf16_oink_vs_quack_dsv3_all.svg' ``` -------------------------------- ### vLLM custom ops configuration Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/README.md Configuring vLLM to use KernelAgent-Oink custom ops when using torch.compile or CUDA graphs. ```python from vllm import LLM llm = LLM( model=..., tensor_parallel_size=..., enforce_eager=False, compilation_config={"custom_ops": ["none", "+rms_norm"]}, ) ``` -------------------------------- ### LayerNorm backward benchmark commands Source: https://github.com/meta-pytorch/kernelagent/blob/main/oink/benchmarks/README.md Commands to benchmark LayerNorm backward pass for DeepSeek-V4-Flash and DeepSeek-V3 with CUDA graph replay. ```bash # DeepSeek-V4-Flash hidden LayerNorm shape sweep (N=7168) env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \ conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \ --dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \ --json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json # DeepSeek-V3 shape sweep (N in {6144,8192}) env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \ conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \ --dtype bf16 --weight-dtype same --dsv3 --iters 80 --warmup-ms 10 --cuda-graph \ --json /tmp/oink_layernorm_bwd_sm103_dsv3_cuda_graph_seq.json ``` -------------------------------- ### Configure KernelAgent Environment Variables Source: https://github.com/meta-pytorch/kernelagent/blob/main/README.md Sets environment variables for KernelAgent configuration, including LLM model, parallel workers, refinement rounds, and logging level. ```bash OPENAI_MODEL=gpt-5 # default model for extraction NUM_KERNEL_SEEDS=4 # parallel workers per kernel MAX_REFINEMENT_ROUNDS=10 # retry budget per worker LOG_LEVEL=INFO # logging level ``` -------------------------------- ### Run Tests with Coverage Source: https://github.com/meta-pytorch/kernelagent/blob/main/CONTRIBUTING.md Executes unit tests and calculates code coverage for the triton_kernel_agent module. ```bash uv run pytest tests/ -v --cov=triton_kernel_agent ``` -------------------------------- ### Store data from SMEM to GMEM Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/experimental_tma.md Stores data from shared memory (SMEM) to global memory (GMEM) using an initialized descriptor pointer, the output tensor, and specified offsets. ```python - Store data from SMEM to GMEM: ``` tl._experimental_descriptor_store( , #initialized, and acquired fence above , #the tensor to be stored on GMEM [OFFSET_0, OFFSET_1], # offset in "global box" for the 2D loading box to start from ) ``` ``` -------------------------------- ### Key Differences: Traditional vs TMA Memory Access Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md Compares the manual pointer arithmetic used in traditional approaches with the descriptor-based access provided by TMA. ```python # Traditional: # Manual pointer arithmetic offs_am = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) a_ptrs = a_ptr + offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak a = tl.load(a_ptrs, mask=...) # TMA: # Descriptor-based access a = a_desc.load([pid_m * BLOCK_SIZE_M, k_offset]) ``` -------------------------------- ### Device-side TMA Implementation - Kernel Usage Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md Illustrates creating and using TMA descriptors directly within the Triton kernel for memory operations. ```python @triton.jit def matmul_kernel(a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, ...): # Create TMA descriptors in kernel a_desc = tl.make_tensor_descriptor( a_ptr, # pointer to tensor shape=[M, K], # tensor shape strides=[stride_am, stride_ak], # tensor strides block_shape=[BLOCK_SIZE_M, BLOCK_SIZE_K] # TMA block size ) b_desc = tl.make_tensor_descriptor( b_ptr, shape=[K, N], strides=[stride_bk, stride_bn], block_shape=[BLOCK_SIZE_K, BLOCK_SIZE_N] ) c_desc = tl.make_tensor_descriptor( c_ptr, shape=[M, N], strides=[stride_cm, stride_cn], block_shape=[BLOCK_SIZE_M, BLOCK_SIZE_N] ) # Use descriptors for memory operations pid = tl.program_id(axis=0) pid_m = pid // num_pid_n pid_n = pid % num_pid_n # Load blocks using TMA a = a_desc.load([pid_m * BLOCK_SIZE_M, 0]) b = b_desc.load([0, pid_n * BLOCK_SIZE_N]) # Compute and store result = tl.dot(a, b) c_desc.store([pid_m * BLOCK_SIZE_M, pid_n * BLOCK_SIZE_N], result) ``` -------------------------------- ### Host-side TMA Implementation - Kernel Usage Source: https://github.com/meta-pytorch/kernelagent/blob/main/kernel_perf_agent/kernel_opt/database/docs/tma.md Shows how to use TMA descriptors within a Triton kernel for loading and storing tensor blocks. ```python @triton.jit def matmul_kernel(a_desc, b_desc, c_desc, ...): pid = tl.program_id(axis=0) # Calculate tile positions pid_m = pid // num_pid_n pid_n = pid % num_pid_n # Load using TMA descriptors a = a_desc.load([pid_m * BLOCK_SIZE_M, 0]) # offset coordinates b = b_desc.load([0, pid_n * BLOCK_SIZE_N]) # Compute accumulator = tl.dot(a, b) # Store using TMA descriptor c_desc.store([pid_m * BLOCK_SIZE_M, pid_n * BLOCK_SIZE_N], accumulator) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.