### Quick Start Kernel Build

Source: https://github.com/huggingface/kernels/blob/main/nix-builder/README.md

Builds and copies a custom machine learning kernel using Nix. Use this command to quickly start building kernels from the examples directory.

```bash
cd examples/relu
nix run .#build-and-copy \
  --max-jobs 2 \
  --cores 8 \
  -L
```

--------------------------------

### Complete HuggingFace Kernels Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/SKILL.md

Run a complete example demonstrating the usage of HuggingFace Kernels.

```bash
python scripts/huggingface_kernels_example.py
```

--------------------------------

### Install and Get Kernel

Source: https://github.com/huggingface/kernels/blob/main/examples/kernels/cutlass-gemm/CARD.md

Installs the kernels library and retrieves a specific kernel module by its repository ID and version. This is the primary way to start using a kernel.

```python
# make sure `kernels` is installed: `pip install -U kernels`
from kernels import get_kernel

kernel_module = get_kernel("{{ repo_id }}", version={{ version }})
{{ functions[0] }} = kernel_module.{{ functions[0] }}

{{ functions[0] }}(...)
```

--------------------------------

### Minimal Benchmark Script Example

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-benchmark.md

A basic example of a benchmark script adhering to the `kernels` benchmark structure. It defines setup, benchmark, and verification methods.

```python
import torch

from kernels.benchmark import Benchmark


class ActivationBenchmark(Benchmark):
    seed = 0

    def setup(self):
        self.x = torch.randn(128, 1024, device=self.device, dtype=torch.float16)
        self.out = torch.empty(128, 512, device=self.device, dtype=torch.float16)

    def benchmark_silu_and_mul(self):
        self.kernel.silu_and_mul(self.out, self.x)

    def verify_silu_and_mul(self):
        # Return reference tensor; runner compares with self.out
        return torch.nn.functional.silu(self.x[..., :512]) * self.x[..., 512:]
```

--------------------------------

### Build and Install CPU Kernel

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/workflow_details.md

Build the kernel in release mode and install it using pip.

```bash
kernel-builder build --release
pip install dist/*.whl --force-reinstall
```

--------------------------------

### Complete Example: Build, Copy, and Run Injection Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-integration.md

A comprehensive example demonstrating the full workflow: building and copying kernels using kernel-builder, then running a Python script that loads a pipeline, injects custom kernels, verifies the injection, and generates a test video.

```bash
cd <your-kernel-project>     # kernel-builder project with your kernels
nix run .#build-and-copy -L  # Build kernels with kernel-builder
python path/to/skills/cuda-kernels/scripts/ltx_kernel_injection_example.py
```

--------------------------------

### Install Dependencies

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/examples/ltx-video-benchmark/README.md

Install the necessary Python packages for the benchmark. Run this command from the root of the kernels project.

```bash
python -m pip install -r skills/rocm-kernels/scripts/requirements.txt
```

--------------------------------

### Minimal Kernel Injection Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-h100.md

Run a minimal example demonstrating kernel injection, approximately 150 lines of code.

```bash
python scripts/ltx_kernel_injection_example.py
```

--------------------------------

### Install Kernels Benchmark Dependencies

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-benchmark.md

Install the necessary dependencies for the kernels benchmark tool. Use `uv` or `pip` for installation.

```bash
uv pip install 'kernels[benchmark]'
```

--------------------------------

### Install and Authenticate hf CLI

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md

Install the Hugging Face CLI tool and log in to your account to manage Hub interactions.

```shell
# install the hf CLI tool
hf skills add

# Authenticate
hf auth login
```

--------------------------------

### Minimal Transformers Integration Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/SKILL.md

Run a minimal integration example for Transformers (LLMs), approximately 120 lines of code.

```bash
python scripts/transformers_injection_example.py
```

--------------------------------

### Install nix-direnv

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/ide-setup.md

Install the `nix-direnv` package using `nix profile` on non-NixOS systems.

```bash
nix profile install nixpkgs#nix-direnv
```

--------------------------------

### Check NixOS Setup Status

Source: https://github.com/huggingface/kernels/blob/main/terraform/README.md

Verify the completion status of the NixOS first-boot setup by checking the systemd service status. The setup is complete when the service reaches the 'Finished' state.

```bash
systemctl status amazon-init
```

--------------------------------

### Kernel-Builder Configuration Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/SKILL.md

Example `build.toml` file demonstrating essential configurations for a CUDA kernel project, including general settings, hub integration, and torch extension details.

```toml
[general]
# Name MUST be dash-separated lowercase (my-kernel), never underscores —
# `kernel-builder check-config` rejects underscores. The Python package
# lives at torch-ext/<name with dashes replaced by underscores>.
name = "ltx-kernels"
backends = ["cuda"]
version = 1
license = "Apache-2.0"   # required field

[general.hub]
# Hub repo for `kernel-builder build-and-upload`; with `version` this
# selects the version branch (e.g. v1).
repo-id = "my-username/ltx-kernels"

[torch]
src = [
  "torch-ext/torch_binding.cpp",
  "torch-ext/torch_binding.h"
]

[kernel.your_kernel]
backend = "cuda"
src = ["kernel_src/your_kernel.cu"]
depends = ["torch"]
# Only constrain cuda-capabilities when the kernel truly requires it —
# do not over-specify.

```

--------------------------------

### Verify Kernel Project with Transformers Example

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md

Use this command to verify that the generated kernel project works correctly with a transformers example.

```bash
Verify the kernel project works with a transformers example.
```

--------------------------------

### Run Transformers Integration Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/transformers-integration.md

Execute this bash script to build custom kernels and run the integration example. Ensure you are in your kernel-builder project directory.

```bash
cd <your-kernel-project>     # kernel-builder project with your kernels
nix run .#build-and-copy -L  # Build kernels with kernel-builder
python path/to/skills/cuda-kernels/scripts/transformers_injection_example.py
```

--------------------------------

### Example output of kernel versions

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-versions.md

This is an example of the output you might see when listing kernel versions. Compatible versions are indicated.

```text
Version 1: torch210-metal-aarch64-darwin, torch28-cxx11-cu126-aarch64-linux, torch28-cxx11-cu129-aarch64-linux, torch28-cxx11-cu128-aarch64-linux, torch29-cxx11-cu130-x86_64-linux, torch27-cxx11-cu118-x86_64-linux, torch210-cxx11-cu130-x86_64-linux, torch29-cxx11-cu128-aarch64-linux, torch29-cxx11-cu130-aarch64-linux, torch27-cxx11-cu126-x86_64-linux, ✅ torch29-cxx11-cu126-x86_64-linux (compatible), torch27-cxx11-cu128-x86_64-linux, torch210-cxx11-cu126-x86_64-linux, torch29-metal-aarch64-darwin, torch27-cxx11-cu128-aarch64-linux, torch210-cu128-x86_64-windows, torch28-cxx11-cu128-x86_64-linux, torch28-cxx11-cu126-x86_64-linux, torch210-cxx11-cu128-x86_64-linux, torch29-cxx11-cu126-aarch64-linux, ✅ torch29-cxx11-cu128-x86_64-linux (preferred), torch28-cxx11-cu129-x86_64-linux
```

--------------------------------

### Install kernel-builder and Determinate Nix

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md

Installs Determinate Nix and kernel-builder using a curl script. This is the quickest way to set up the development environment.

```bash
curl -fsSL https://raw.githubusercontent.com/huggingface/kernels/main/install.sh | bash
```

--------------------------------

### Hugging Face Kernels Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/manifest.txt

Example script demonstrating the usage of Hugging Face kernels.

```python
import torch
from transformers.models.llama.modeling_llama import LlamaAttention


class MyModel(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.llama_attention = LlamaAttention(config)

    def forward(self, x):
        return self.llama_attention(x)[0]


if __name__ == '__main__':
    from transformers.models.llama.configuration_llama import LlamaConfig

    config = LlamaConfig(hidden_size=1024, num_attention_heads=32)
    model = MyModel(config)
    model.cuda()

    x = torch.randn(1, 1024, 1024).cuda()
    output = model(x)
    print(output.shape)

```

--------------------------------

### Install Skills for Multiple Assistants

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md

Installs skills for multiple AI coding assistants simultaneously.

```bash
# install for multiple assistants
kernel-builder skills add --claude --codex --opencode
```

--------------------------------

### Install HuggingFace Kernels and PyTorch

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/huggingface-kernels-integration.md

Install the necessary libraries using pip. This is the first step to using the HuggingFace Kernels library.

```bash
pip install kernels torch
```

--------------------------------

### Install HF Kernel Builder CLI

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md

Install the `hf-kernel-builder` CLI tool from source to build custom kernels. Use the `--claude`, `--codex`, or `--opencode` flags to install specific agent skills.

```shell
cargo install --git https://github.com/huggingface/kernels hf-kernel-builder

# Install your skills. Use --claude, --codex, or --opencode
kernel-builder skills add --claude
```

--------------------------------

### Hugging Face Kernels Example Script

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/manifest.txt

Example script demonstrating the usage of Hugging Face kernels.

```python
import torch

from kernels.torch.nn.functional import scaled_dot_product_attention

def main():
    # Example usage of scaled_dot_product_attention
    batch_size, num_heads, sequence_length, head_dim = 2, 4, 128, 64
    query = torch.randn(
        batch_size, num_heads, sequence_length, head_dim, device="xpu"
    )
    key = torch.randn(
        batch_size, num_heads, sequence_length, head_dim, device="xpu"
    )
    value = torch.randn(
        batch_size, num_heads, sequence_length, head_dim, device="xpu"
    )

    attn_output, attn_weights = scaled_dot_product_attention(
        query, key, value, is_causal=False
    )

    print("Scaled Dot Product Attention Output Shape:", attn_output.shape)
    print("Scaled Dot Product Attention Weights Shape:", attn_weights.shape)

if __name__ == "__main__":
    main()

```

--------------------------------

### Install Project Dependencies

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/README.md

Install the necessary Python dependencies for the project using pip. Ensure you have a requirements.txt file in the scripts directory.

```bash
pip install -r scripts/requirements.txt
```

--------------------------------

### Install HuggingFace Kernels

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/references/huggingface-kernels-integration.md

Install the necessary libraries for HuggingFace Kernels integration. Ensure you have PyTorch with XPU support, an Intel XPU GPU, and Python 3.8+.

```bash
pip install kernels torch numpy
```

--------------------------------

### Combined Kernel Dependencies Example

Source: https://github.com/huggingface/kernels/blob/main/docs/source/kernel-requirements.md

An example showing how to specify both general and backend-specific Python dependencies along with a version number. Dependencies are validated against the active backend.

```json
{
  "python-depends": ["einops"],
  "python-depends-backends": {
    "cuda": ["nvidia-cutlass-dsl"]
  },
  "version": 1
}
```

--------------------------------

### Example output of kernels info

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-info.md

This is an example of the human-readable output provided by the 'kernels info' command, detailing various aspects of a kernel.

```text
Repository: kernels-community/activation
Revision: v1
Name: activation
Version: 1
License: Apache-2.0
Upstream: -
Source: -
Python dependencies: -
Backends: cuda, metal
```

--------------------------------

### Transformers Injection Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/manifest.txt

Example script demonstrating how to inject transformers into ROCm kernels. Ensure transformers library is installed.

```python
python scripts/transformers_injection_example.py --help
```

--------------------------------

### Example build.toml Configuration

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md

A sample `build.toml` file for the 'mykernel' project, demonstrating general settings, backend selection, and Torch extension configuration.

```toml
[general]
backends = [
  "cuda",
]
name = "mykernel"
version = 1
edition = 5

[general.hub]
repo-id = "myorg/mykernel"

[torch]
src = [
  "torch-ext/torch_binding.cpp",
  "torch-ext/torch_binding.h",
]

[kernel.mykernel]
backend = "cuda"
depends = ["torch"]
src = ["mykernel_cuda/mykernel.cu"]
# If the kernel is only supported on specific capabilities, set the
# cuda-capabilities option:
#
# cuda-capabilities = [ "9.0", "10.0", "12.0" ]
```

--------------------------------

### Monitor NixOS First-Boot Setup

Source: https://github.com/huggingface/kernels/blob/main/terraform/README.md

Use this command inside the VM to follow the progress of the NixOS configuration application via the amazon-init service. This process typically takes 5-10 minutes.

```bash
journalctl -u amazon-init -f
```

--------------------------------

### Build Kernel with setup.py

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/local-dev.md

Builds the kernel using the generated setup.py file. The compatible variant is placed in the 'build' directory.

```bash
python setup.py build_kernel
```

--------------------------------

### Get Locked Kernel

Source: https://github.com/huggingface/kernels/blob/main/docs/source/locking.md

Load a kernel using its locked revision. The lock file is included in package metadata and is visible after installation.

```python
from kernels import get_locked_kernel

activation = get_locked_kernel("kernels-community/activation")

```

--------------------------------

### Install Kernels with uv

Source: https://github.com/huggingface/kernels/blob/main/docs/source/installation.md

Install the kernels package using the uv package installer. This requires torch>=2.5 and CUDA.

```bash
uv pip install kernels
```

--------------------------------

### Initialize and Run Phase 2 Trials

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/SKILL.md

Commands to initialize Phase 2 trials, build the kernel, benchmark, save trial data, and record results. Includes an optional profiling step.

```bash
# Initialize Phase 2 trials
python scripts/trial_manager.py init <kernel_name> baseline.py

# For each trial:
# 1. Modify kernel code (AVX512 intrinsics, or brgemm for GEMM kernels)
# 2. Build
kernel-builder build --release && pip install dist/*.whl --force-reinstall
# 3. Benchmark
python scripts/benchmark_cpu.py baseline.py --kernel-package <pkg> --op <func>
# 4. Save trial
python scripts/trial_manager.py save <kernel_name> <dir> --parent <parent_id> --strategy "description"
# 5. Record result
python scripts/trial_manager.py result <kernel_name> <trial_id> --correctness pass --speedup <float> --baseline_us <float> --kernel_us <float>
# 6. Profile (after t1, or when plateaued)
python scripts/cpu_profiler.py --kernel-package <pkg> --op <func>
```

--------------------------------

### Install Claude Skills

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md

Installs skills for Claude in the current project directory.

```bash
# install for Claude in the current project
kernel-builder skills add --claude
```

--------------------------------

### Transformers Injection Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/manifest.txt

Example script for injecting kernels into Transformers models.

```python
import torch
from transformers.models.llama.modeling_llama import LlamaAttention
from transformers.models.llama.configuration_llama import LlamaConfig
from apex.normalization.fused_normalization import FusedLayerNorm


class MyModel(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.llama_attention = LlamaAttention(config)
        self.layer_norm = FusedLayerNorm(config.hidden_size)

    def forward(self, x):
        return self.layer_norm(self.llama_attention(x)[0])


if __name__ == '__main__':
    config = LlamaConfig(hidden_size=1024, num_attention_heads=32)
    model = MyModel(config)
    model.cuda()

    x = torch.randn(1, 1024, 1024).cuda()
    output = model(x)
    print(output.shape)

```

--------------------------------

### LTX Kernel Injection Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/manifest.txt

Example script for LTX kernel injection.

```python
import torch
from transformers.models.llama.modeling_llama import LlamaAttention
from ltx.kernels.llama import LlamaFlashAttention2


class MyModel(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.llama_attention = LlamaFlashAttention2(config)

    def forward(self, x):
        return self.llama_attention(x)[0]


if __name__ == '__main__':
    from transformers.models.llama.configuration_llama import LlamaConfig

    config = LlamaConfig(hidden_size=1024, num_attention_heads=32)
    model = MyModel(config)
    model.cuda()

    x = torch.randn(1, 1024, 1024).cuda()
    output = model(x)
    print(output.shape)

```

--------------------------------

### Use RMSNorm CPU Kernel

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/huggingface-kernels-integration.md

Example demonstrating how to load and use the RMSNorm CPU kernel. CPU dispatch is handled automatically.

```python
import torch
from kernels import get_kernel, has_kernel

repo_id = "kernels-community/rmsnorm"

if has_kernel(repo_id):
    layer_norm = get_kernel(repo_id)

    x = torch.randn(2, 1024, 2048, dtype=torch.bfloat16)  # CPU tensor
    weight = torch.ones(2048, dtype=torch.bfloat16)

    # CPU dispatch happens automatically via torch_binding.cpp
    out = layer_norm.rms_norm_fn(x, weight, eps=1e-6)
```

--------------------------------

### Initialize a New Kernel Project

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md

Use `kernel-builder init` to scaffold a new kernel project with a standard directory structure and configuration files.

```bash
kernel-builder init --name my-username/my-kernel
```

--------------------------------

### Install Skills Globally for Codex

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md

Installs skills globally for the Codex assistant, making them available system-wide.

```bash
# install globally for Codex
kernel-builder skills add --codex --global
```

--------------------------------

### Install ROCm Kernels Skill for Codex

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md

Installs the ROCm kernels skill specifically for the Codex assistant.

```bash
# install ROCm kernels skill for Codex
kernel-builder skills add --skill rocm-kernels --codex
```

--------------------------------

### Initialize a new multi-backend kernel

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md

Creates a new kernel project that supports multiple backends, such as CUDA and XPU. List all desired backends with the --backends option.

```bash
$ kernel-builder init --name myorg/mykernel --backends cuda xpu
Initialized `myorg/mykernel` at /home/daniel/git/kernels/examples/kernels/mykernel
```

--------------------------------

### Install Metal Toolchain

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/metal.md

Install the Metal Toolchain component for Xcode, which is required for building Metal kernels.

```bash
xcodebuild -downloadComponent MetalToolchain
```

--------------------------------

### Install Kernels Package

Source: https://github.com/huggingface/kernels/blob/main/README.md

Install the `kernels` Python package using pip. Requires `torch>=2.5` and CUDA.

```bash
pip install kernels
```

--------------------------------

### Testing a Kernel with `get_kernel`

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md

Demonstrates how to test a kernel using `get_kernel` to load it from the development environment. This ensures the kernel is tested as it will be used.

```python
import kernels
import torch
import torch.nn.functional as F

relu = kernels.get_kernel("kernels-community/relu", version=1)

def test_relu():
    x = torch.randn(1024, 1024, dtype=torch.float32, device=torch.device("cuda"))
    y = relu.relu(x, torch.empty_like(x))
    y_ref = F.relu(x)
    torch.testing.assert_close(y_ref, y)
```

--------------------------------

### Install Skills to Custom Destination

Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md

Installs skills to a specified custom directory and forces overwriting if the skills already exist.

```bash
# install to a custom destination and overwrite if already present
kernel-builder skills add --dest ~/my-skills --force
```

--------------------------------

### Develop with Nix

Source: https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md

Sets up the full development environment for both the Python package and kernel-builder using Nix.

```bash
nix develop
```

--------------------------------

### Define Multi-Target Build Configuration

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/huggingface-kernels-integration.md

Example `build.toml` snippet showing how to define configurations for different backends, including CUDA and CPU. This allows a single repository to support multiple hardware targets.

```toml
# CUDA sections
[kernel.my_kernel_cuda]
backend = "cuda"
# ...

# CPU sections
[kernel.my_kernel_cpu]
backend = "cpu"
# ...
```

--------------------------------

### Install Kernels Python Package with Dev Extras

Source: https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md

Installs the kernels Python package in editable mode with development dependencies.

```bash
pip install -e "kernels/[dev]"
```

--------------------------------

### Initialize a new XPU kernel

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md

Creates a new kernel project specifically for the XPU backend. Use the --backends option to specify the desired backend.

```bash
$ kernel-builder init --name myorg/mykernel --backends xpu
```

--------------------------------

### RMSNorm Kernel Usage Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/huggingface-kernels-integration.md

Demonstrates how to load and use the RMSNorm kernel from the Hub. It includes checking for availability, inspecting kernel functions, and applying the kernel to a tensor.

```python
import torch
from kernels import get_kernel, has_kernel

repo_id = "kernels-community/triton-layer-norm"

if has_kernel(repo_id):
    layer_norm = get_kernel(repo_id)

    # Inspect available functions
    print([f for f in dir(layer_norm) if not f.startswith('_')])
    # e.g. ['layer_norm', 'layer_norm_fn', 'rms_norm_fn', ...]

    x = torch.randn(2, 1024, 2048, dtype=torch.bfloat16, device="cuda")
    weight = torch.ones(2048, dtype=torch.bfloat16, device="cuda")

    # Use the actual function name (rms_norm_fn in current version)
    out = layer_norm.rms_norm_fn(x, weight, eps=1e-6)
    print(f"Output shape: {out.shape}")
else:
    print("No ROCm-compatible build available")
```

--------------------------------

### Local Kernel Build for Development

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-h100.md

Generate project files with kernel-builder and build the kernel using `setup.py build_kernel`. This method avoids issues with `torch.utils.cpp_extension`/pybind11 under ABI3.

```bash
kernel-builder create-pyproject -f
python setup.py build_kernel
```

--------------------------------

### Profiler Output Interpretation Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/workflow_details.md

Example of profiler output indicating a memory-bound or dependency-bound situation and suggesting optimization strategies.

```text
>> IPC < 1.0: Memory-bound or dependency-bound.
   - Add prefetch instructions (_mm_prefetch with _MM_HINT_T0 or _MM_HINT_T1)
   - Reduce cache blocking tile size
   Reference: references/memory_patterns.yaml
```

--------------------------------

### Run Full Benchmark with Options

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-h100.md

Execute the benchmark script with various options to measure kernel performance. Ensure all necessary parameters like batch size, dimensions, and steps are configured.

```bash
python scripts/benchmark_example.py \
    --use-optimized-kernels \
    --compile \
    --batch-size 1 \
    --num-frames 161 \
    --height 512 \
    --width 768 \
    --steps 50 \
    --warmup-iterations 2
```

--------------------------------

### Pure Layer Example: SiluAndMul

Source: https://github.com/huggingface/kernels/blob/main/docs/source/kernel-requirements.md

Example of a pure PyTorch layer that does not implement backward. It uses the `has_backward` attribute to indicate this.

```python
class SiluAndMul(nn.Module):
    # This layer does not implement backward.
    has_backward: bool = False

    def forward(self, x: torch.Tensor):
        d = x.shape[-1] // 2
        output_shape = x.shape[:-1] + (d,)
        out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
        ops.silu_and_mul(out, x)
        return out
```

--------------------------------

### Example Kernel Project Structure

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md

A typical kernel-builder project structure should include CUDA kernel source, PyTorch C++ bindings, micro-benchmark scripts, and configuration files.

```text
examples/your_model/
├── kernel_src/
│   └── rmsnorm.cu              # Vectorized CUDA kernel
├── torch-ext/
│   ├── your_kernels/__init__.py
│   └── torch_binding.cpp       # PyTorch C++ bindings
├── benchmark_rmsnorm.py        # Micro-benchmark script
├── build.toml                  # kernel-builder config
├── setup.py                    # python setup.py build_kernel
└── pyproject.toml
```

--------------------------------

### LTX Video Benchmark README

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/manifest.txt

README file for the LTX video benchmark example. Provides instructions and context for the benchmark.

```markdown
# LTX Video Benchmark

This directory contains scripts and data for benchmarking LTX video processing capabilities on ROCm.

## Usage

1. Ensure you have the ROCm environment set up.
2. Install necessary Python dependencies:
   ```bash
   pip install -r ../../scripts/requirements.txt
   ```
3. Run the benchmark scripts (e.g., `benchmark_kernels.py` or `benchmark_e2e.py` from the parent directory).

## Results

Benchmark results are stored in `benchmark_results.json` and trace data in the `trace/` directory.
```

--------------------------------

### Hugging Face Kernels Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/manifest.txt

Example script for using Hugging Face kernels with ROCm. This script helps in understanding the integration.

```python
python scripts/huggingface_kernels_example.py --help
```

--------------------------------

### Triton Kernel Project Structure Example

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/huggingface-kernels-integration.md

Illustrates a typical project structure for a Triton-based kernel, including build configuration, kernel source, and PyTorch bindings.

```text
my-triton-kernel/
├── build.toml
├── kernel_src/
│   └── rmsnorm.py          # Triton kernel source
└── torch-ext/
    ├── torch_binding.cpp
    └── my_kernels/
        └── __init__.py
```

--------------------------------

### Example Usage of Optimized Diffusers Pipeline

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/diffusers-integration.md

Demonstrates the complete workflow: loading a LTXPipeline, moving it to CUDA, injecting optimized kernels, enabling CPU offload, running inference, and exporting the output to a video file.

```python
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")  # ROCm via HIP

stats = inject_optimized_kernels(pipe)
print(f"RMSNorm modules patched: {stats['rmsnorm_modules']}")
# Expected: 168

pipe.enable_model_cpu_offload()  # AFTER injection

output = pipe(
    prompt="A cat sleeping in the sun",
    num_frames=25, height=480, width=704,
    num_inference_steps=30,
)
export_to_video(output.frames[0], "output.mp4", fps=24)
```

--------------------------------

### Build and Upload Kernel to Hub

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md

This command builds the kernel and uploads it to the Hugging Face Hub. Ensure your `build.toml` is configured correctly.

```bash
kernel-builder build-and-upload
```

--------------------------------

### Check Metal Toolchain installation

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/metal.md

Validate that the Metal Toolchain is installed and retrieve its details, including asset path, build version, and identifier.

```bash
$ xcodebuild -showComponent metalToolchain
Asset Path: /System/Library/AssetsV2/com_apple_MobileAsset_MetalToolchain/68d8db6212b48d387d071ff7b905df796658e713.asset/AssetData
Build Version: 17B54
Status: installed
Toolchain Identifier: com.apple.dt.toolchain.Metal.32023
Toolchain Search Path: /private/var/run/com.apple.security.cryptexd/mnt/com.apple.MobileAsset.MetalToolchain-v17.2.54.0.mDxgz0
```

--------------------------------

### Install AI Coding Assistant Skills

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder-cli.md

Installs skills for AI coding assistants such as Claude, Codex, and OpenCode. This command has subcommands for managing skills.

```APIDOC
## `kernel-builder skills`

Install skills for AI coding assistants (Claude, Codex, OpenCode)

**Usage:** `kernel-builder skills <COMMAND>`

###### **Subcommands:**

* `add` — Install a kernels skill for an AI assistant
```

--------------------------------

### Example Usage: Load Model and Inject Kernels

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/transformers-integration.md

Demonstrates loading a transformers model and tokenizer, injecting optimized CUDA kernels for RMSNorm layers, and then performing text generation. Ensure the model is loaded to CUDA before injecting kernels.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Import kernels
from ltx_kernels import rmsnorm

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Inject kernels
stats = inject_optimized_kernels(model)
print(f"RMSNorm modules patched: {stats['rmsnorm_modules']}")

# Generate text
inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
```

--------------------------------

### Install Kernels from Main Branch

Source: https://github.com/huggingface/kernels/blob/main/docs/source/installation.md

Install the latest version of the kernels package directly from the main branch of the GitHub repository. This includes the 'benchmark' extra.

```bash
pip install "kernels[benchmark] @ git+https://github.com/huggingface/kernels#subdirectory=kernels"
```

--------------------------------

### Initialize Git Submodules

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/references/workflow_details.md

Ensures that all necessary external tools and libraries are downloaded and initialized for the project.

```bash
git submodule update --init
```

--------------------------------

### Build and Copy Kernel Locally

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md

Use this command to build your kernel and copy it to the local build directory for immediate use.

```bash
kernel-builder build-and-copy -L
```

--------------------------------

### Load and Execute Basic Activation Kernel

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md

Demonstrates loading a GELU activation kernel from the Hugging Face Hub and executing it on a CUDA tensor. Ensure the output tensor is pre-allocated with the same shape and type as the input.

```python
import torch
from kernels import get_kernel

# Load activation kernels from Hub
activation = get_kernel("kernels-community/activation", version=1)

# Create test tensor
x = torch.randn((10, 10), dtype=torch.float16, device="cuda")

# Execute kernel (output tensor must be pre-allocated)
y = torch.empty_like(x)
activation.gelu_fast(y, x)

print(y)
```

--------------------------------

### Verify ROCm Triton Installation

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/troubleshooting.md

Confirm your ROCm Triton installation by printing the Triton version and the PyTorch ROCm version. Also, check the device properties.

```python
import triton
print(triton.__version__)
import torch
print(torch.version.hip)  # Should show ROCm version
print(torch.cuda.get_device_properties(0))
```

--------------------------------

### Install Kernels with Curated XPU Extras

Source: https://github.com/huggingface/kernels/blob/main/docs/source/installation.md

Install the kernels package with the 'curated-xpu' extra for XPU (Intel GPU) platforms. This version omits CUDA-only dependencies.

```bash
pip install "kernels[curated-xpu]"
```

--------------------------------

### Load and Run Activation Kernel

Source: https://github.com/huggingface/kernels/blob/main/README.md

Demonstrates loading the 'activation' kernel from the Hugging Face Hub and running a GELU fast operation on a CUDA tensor.

```python
import torch

from kernels import get_kernel

# Download optimized kernels from the Hugging Face hub
activation = get_kernel("kernels-community/activation", version=1)

# Random tensor
x = torch.randn((10, 10), dtype=torch.float16, device="cuda")

# Run the kernel
y = torch.empty_like(x)
activation.gelu_fast(y, x)

print(y)
```

--------------------------------

### Clean Install and Smoke Test ROCm Kernels

Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/diffusers-integration.md

Sets up a virtual environment, installs dependencies, and performs a basic smoke test with the LTXPipeline to verify ROCm functionality.

```bash
python -m venv .venv-rocm-kernels
source .venv-rocm-kernels/bin/activate
python -m pip install --upgrade pip
python -m pip install -r skills/rocm-kernels/scripts/requirements.txt
```

--------------------------------

### Build Configuration for Hub Upload

Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/build.md

Example build.toml configuration specifying the repository ID and version for uploading the kernel. The kernel will be uploaded to the repository defined by 'repo-id' and its version branch.

```toml
[general]
# ...
version = 1

[general.hub]
repo-id = "kernels-community/flash-attn4"
```