### Quick Start Kernel Build Source: https://github.com/huggingface/kernels/blob/main/nix-builder/README.md Builds and copies a custom machine learning kernel using Nix. Use this command to quickly start building kernels from the examples directory. ```bash cd examples/relu nix run .#build-and-copy \ --max-jobs 2 \ --cores 8 \ -L ``` -------------------------------- ### Complete HuggingFace Kernels Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/SKILL.md Run a complete example demonstrating the usage of HuggingFace Kernels. ```bash python scripts/huggingface_kernels_example.py ``` -------------------------------- ### Install and Get Kernel Source: https://github.com/huggingface/kernels/blob/main/examples/kernels/cutlass-gemm/CARD.md Installs the kernels library and retrieves a specific kernel module by its repository ID and version. This is the primary way to start using a kernel. ```python # make sure `kernels` is installed: `pip install -U kernels` from kernels import get_kernel kernel_module = get_kernel("{{ repo_id }}", version={{ version }}) {{ functions[0] }} = kernel_module.{{ functions[0] }} {{ functions[0] }}(...) ``` -------------------------------- ### Minimal Benchmark Script Example Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-benchmark.md A basic example of a benchmark script adhering to the `kernels` benchmark structure. It defines setup, benchmark, and verification methods. ```python import torch from kernels.benchmark import Benchmark class ActivationBenchmark(Benchmark): seed = 0 def setup(self): self.x = torch.randn(128, 1024, device=self.device, dtype=torch.float16) self.out = torch.empty(128, 512, device=self.device, dtype=torch.float16) def benchmark_silu_and_mul(self): self.kernel.silu_and_mul(self.out, self.x) def verify_silu_and_mul(self): # Return reference tensor; runner compares with self.out return torch.nn.functional.silu(self.x[..., :512]) * self.x[..., 512:] ``` -------------------------------- ### Build and Install CPU Kernel Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/workflow_details.md Build the kernel in release mode and install it using pip. ```bash kernel-builder build --release pip install dist/*.whl --force-reinstall ``` -------------------------------- ### Complete Example: Build, Copy, and Run Injection Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-integration.md A comprehensive example demonstrating the full workflow: building and copying kernels using kernel-builder, then running a Python script that loads a pipeline, injects custom kernels, verifies the injection, and generates a test video. ```bash cd # kernel-builder project with your kernels nix run .#build-and-copy -L # Build kernels with kernel-builder python path/to/skills/cuda-kernels/scripts/ltx_kernel_injection_example.py ``` -------------------------------- ### Install Dependencies Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/examples/ltx-video-benchmark/README.md Install the necessary Python packages for the benchmark. Run this command from the root of the kernels project. ```bash python -m pip install -r skills/rocm-kernels/scripts/requirements.txt ``` -------------------------------- ### Minimal Kernel Injection Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-h100.md Run a minimal example demonstrating kernel injection, approximately 150 lines of code. ```bash python scripts/ltx_kernel_injection_example.py ``` -------------------------------- ### Install Kernels Benchmark Dependencies Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-benchmark.md Install the necessary dependencies for the kernels benchmark tool. Use `uv` or `pip` for installation. ```bash uv pip install 'kernels[benchmark]' ``` -------------------------------- ### Install and Authenticate hf CLI Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md Install the Hugging Face CLI tool and log in to your account to manage Hub interactions. ```shell # install the hf CLI tool hf skills add # Authenticate hf auth login ``` -------------------------------- ### Minimal Transformers Integration Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/SKILL.md Run a minimal integration example for Transformers (LLMs), approximately 120 lines of code. ```bash python scripts/transformers_injection_example.py ``` -------------------------------- ### Install nix-direnv Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/ide-setup.md Install the `nix-direnv` package using `nix profile` on non-NixOS systems. ```bash nix profile install nixpkgs#nix-direnv ``` -------------------------------- ### Check NixOS Setup Status Source: https://github.com/huggingface/kernels/blob/main/terraform/README.md Verify the completion status of the NixOS first-boot setup by checking the systemd service status. The setup is complete when the service reaches the 'Finished' state. ```bash systemctl status amazon-init ``` -------------------------------- ### Kernel-Builder Configuration Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/SKILL.md Example `build.toml` file demonstrating essential configurations for a CUDA kernel project, including general settings, hub integration, and torch extension details. ```toml [general] # Name MUST be dash-separated lowercase (my-kernel), never underscores — # `kernel-builder check-config` rejects underscores. The Python package # lives at torch-ext/. name = "ltx-kernels" backends = ["cuda"] version = 1 license = "Apache-2.0" # required field [general.hub] # Hub repo for `kernel-builder build-and-upload`; with `version` this # selects the version branch (e.g. v1). repo-id = "my-username/ltx-kernels" [torch] src = [ "torch-ext/torch_binding.cpp", "torch-ext/torch_binding.h" ] [kernel.your_kernel] backend = "cuda" src = ["kernel_src/your_kernel.cu"] depends = ["torch"] # Only constrain cuda-capabilities when the kernel truly requires it — # do not over-specify. ``` -------------------------------- ### Verify Kernel Project with Transformers Example Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md Use this command to verify that the generated kernel project works correctly with a transformers example. ```bash Verify the kernel project works with a transformers example. ``` -------------------------------- ### Run Transformers Integration Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/transformers-integration.md Execute this bash script to build custom kernels and run the integration example. Ensure you are in your kernel-builder project directory. ```bash cd # kernel-builder project with your kernels nix run .#build-and-copy -L # Build kernels with kernel-builder python path/to/skills/cuda-kernels/scripts/transformers_injection_example.py ``` -------------------------------- ### Example output of kernel versions Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-versions.md This is an example of the output you might see when listing kernel versions. Compatible versions are indicated. ```text Version 1: torch210-metal-aarch64-darwin, torch28-cxx11-cu126-aarch64-linux, torch28-cxx11-cu129-aarch64-linux, torch28-cxx11-cu128-aarch64-linux, torch29-cxx11-cu130-x86_64-linux, torch27-cxx11-cu118-x86_64-linux, torch210-cxx11-cu130-x86_64-linux, torch29-cxx11-cu128-aarch64-linux, torch29-cxx11-cu130-aarch64-linux, torch27-cxx11-cu126-x86_64-linux, ✅ torch29-cxx11-cu126-x86_64-linux (compatible), torch27-cxx11-cu128-x86_64-linux, torch210-cxx11-cu126-x86_64-linux, torch29-metal-aarch64-darwin, torch27-cxx11-cu128-aarch64-linux, torch210-cu128-x86_64-windows, torch28-cxx11-cu128-x86_64-linux, torch28-cxx11-cu126-x86_64-linux, torch210-cxx11-cu128-x86_64-linux, torch29-cxx11-cu126-aarch64-linux, ✅ torch29-cxx11-cu128-x86_64-linux (preferred), torch28-cxx11-cu129-x86_64-linux ``` -------------------------------- ### Install kernel-builder and Determinate Nix Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md Installs Determinate Nix and kernel-builder using a curl script. This is the quickest way to set up the development environment. ```bash curl -fsSL https://raw.githubusercontent.com/huggingface/kernels/main/install.sh | bash ``` -------------------------------- ### Hugging Face Kernels Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/manifest.txt Example script demonstrating the usage of Hugging Face kernels. ```python import torch from transformers.models.llama.modeling_llama import LlamaAttention class MyModel(torch.nn.Module): def __init__(self, config): super().__init__() self.llama_attention = LlamaAttention(config) def forward(self, x): return self.llama_attention(x)[0] if __name__ == '__main__': from transformers.models.llama.configuration_llama import LlamaConfig config = LlamaConfig(hidden_size=1024, num_attention_heads=32) model = MyModel(config) model.cuda() x = torch.randn(1, 1024, 1024).cuda() output = model(x) print(output.shape) ``` -------------------------------- ### Install Skills for Multiple Assistants Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md Installs skills for multiple AI coding assistants simultaneously. ```bash # install for multiple assistants kernel-builder skills add --claude --codex --opencode ``` -------------------------------- ### Install HuggingFace Kernels and PyTorch Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/huggingface-kernels-integration.md Install the necessary libraries using pip. This is the first step to using the HuggingFace Kernels library. ```bash pip install kernels torch ``` -------------------------------- ### Install HF Kernel Builder CLI Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md Install the `hf-kernel-builder` CLI tool from source to build custom kernels. Use the `--claude`, `--codex`, or `--opencode` flags to install specific agent skills. ```shell cargo install --git https://github.com/huggingface/kernels hf-kernel-builder # Install your skills. Use --claude, --codex, or --opencode kernel-builder skills add --claude ``` -------------------------------- ### Hugging Face Kernels Example Script Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/manifest.txt Example script demonstrating the usage of Hugging Face kernels. ```python import torch from kernels.torch.nn.functional import scaled_dot_product_attention def main(): # Example usage of scaled_dot_product_attention batch_size, num_heads, sequence_length, head_dim = 2, 4, 128, 64 query = torch.randn( batch_size, num_heads, sequence_length, head_dim, device="xpu" ) key = torch.randn( batch_size, num_heads, sequence_length, head_dim, device="xpu" ) value = torch.randn( batch_size, num_heads, sequence_length, head_dim, device="xpu" ) attn_output, attn_weights = scaled_dot_product_attention( query, key, value, is_causal=False ) print("Scaled Dot Product Attention Output Shape:", attn_output.shape) print("Scaled Dot Product Attention Weights Shape:", attn_weights.shape) if __name__ == "__main__": main() ``` -------------------------------- ### Install Project Dependencies Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/README.md Install the necessary Python dependencies for the project using pip. Ensure you have a requirements.txt file in the scripts directory. ```bash pip install -r scripts/requirements.txt ``` -------------------------------- ### Install HuggingFace Kernels Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/references/huggingface-kernels-integration.md Install the necessary libraries for HuggingFace Kernels integration. Ensure you have PyTorch with XPU support, an Intel XPU GPU, and Python 3.8+. ```bash pip install kernels torch numpy ``` -------------------------------- ### Combined Kernel Dependencies Example Source: https://github.com/huggingface/kernels/blob/main/docs/source/kernel-requirements.md An example showing how to specify both general and backend-specific Python dependencies along with a version number. Dependencies are validated against the active backend. ```json { "python-depends": ["einops"], "python-depends-backends": { "cuda": ["nvidia-cutlass-dsl"] }, "version": 1 } ``` -------------------------------- ### Example output of kernels info Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-info.md This is an example of the human-readable output provided by the 'kernels info' command, detailing various aspects of a kernel. ```text Repository: kernels-community/activation Revision: v1 Name: activation Version: 1 License: Apache-2.0 Upstream: - Source: - Python dependencies: - Backends: cuda, metal ``` -------------------------------- ### Transformers Injection Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/manifest.txt Example script demonstrating how to inject transformers into ROCm kernels. Ensure transformers library is installed. ```python python scripts/transformers_injection_example.py --help ``` -------------------------------- ### Example build.toml Configuration Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md A sample `build.toml` file for the 'mykernel' project, demonstrating general settings, backend selection, and Torch extension configuration. ```toml [general] backends = [ "cuda", ] name = "mykernel" version = 1 edition = 5 [general.hub] repo-id = "myorg/mykernel" [torch] src = [ "torch-ext/torch_binding.cpp", "torch-ext/torch_binding.h", ] [kernel.mykernel] backend = "cuda" depends = ["torch"] src = ["mykernel_cuda/mykernel.cu"] # If the kernel is only supported on specific capabilities, set the # cuda-capabilities option: # # cuda-capabilities = [ "9.0", "10.0", "12.0" ] ``` -------------------------------- ### Monitor NixOS First-Boot Setup Source: https://github.com/huggingface/kernels/blob/main/terraform/README.md Use this command inside the VM to follow the progress of the NixOS configuration application via the amazon-init service. This process typically takes 5-10 minutes. ```bash journalctl -u amazon-init -f ``` -------------------------------- ### Build Kernel with setup.py Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/local-dev.md Builds the kernel using the generated setup.py file. The compatible variant is placed in the 'build' directory. ```bash python setup.py build_kernel ``` -------------------------------- ### Get Locked Kernel Source: https://github.com/huggingface/kernels/blob/main/docs/source/locking.md Load a kernel using its locked revision. The lock file is included in package metadata and is visible after installation. ```python from kernels import get_locked_kernel activation = get_locked_kernel("kernels-community/activation") ``` -------------------------------- ### Install Kernels with uv Source: https://github.com/huggingface/kernels/blob/main/docs/source/installation.md Install the kernels package using the uv package installer. This requires torch>=2.5 and CUDA. ```bash uv pip install kernels ``` -------------------------------- ### Initialize and Run Phase 2 Trials Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/SKILL.md Commands to initialize Phase 2 trials, build the kernel, benchmark, save trial data, and record results. Includes an optional profiling step. ```bash # Initialize Phase 2 trials python scripts/trial_manager.py init baseline.py # For each trial: # 1. Modify kernel code (AVX512 intrinsics, or brgemm for GEMM kernels) # 2. Build kernel-builder build --release && pip install dist/*.whl --force-reinstall # 3. Benchmark python scripts/benchmark_cpu.py baseline.py --kernel-package --op # 4. Save trial python scripts/trial_manager.py save --parent --strategy "description" # 5. Record result python scripts/trial_manager.py result --correctness pass --speedup --baseline_us --kernel_us # 6. Profile (after t1, or when plateaued) python scripts/cpu_profiler.py --kernel-package --op ``` -------------------------------- ### Install Claude Skills Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md Installs skills for Claude in the current project directory. ```bash # install for Claude in the current project kernel-builder skills add --claude ``` -------------------------------- ### Transformers Injection Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/manifest.txt Example script for injecting kernels into Transformers models. ```python import torch from transformers.models.llama.modeling_llama import LlamaAttention from transformers.models.llama.configuration_llama import LlamaConfig from apex.normalization.fused_normalization import FusedLayerNorm class MyModel(torch.nn.Module): def __init__(self, config): super().__init__() self.llama_attention = LlamaAttention(config) self.layer_norm = FusedLayerNorm(config.hidden_size) def forward(self, x): return self.layer_norm(self.llama_attention(x)[0]) if __name__ == '__main__': config = LlamaConfig(hidden_size=1024, num_attention_heads=32) model = MyModel(config) model.cuda() x = torch.randn(1, 1024, 1024).cuda() output = model(x) print(output.shape) ``` -------------------------------- ### LTX Kernel Injection Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/manifest.txt Example script for LTX kernel injection. ```python import torch from transformers.models.llama.modeling_llama import LlamaAttention from ltx.kernels.llama import LlamaFlashAttention2 class MyModel(torch.nn.Module): def __init__(self, config): super().__init__() self.llama_attention = LlamaFlashAttention2(config) def forward(self, x): return self.llama_attention(x)[0] if __name__ == '__main__': from transformers.models.llama.configuration_llama import LlamaConfig config = LlamaConfig(hidden_size=1024, num_attention_heads=32) model = MyModel(config) model.cuda() x = torch.randn(1, 1024, 1024).cuda() output = model(x) print(output.shape) ``` -------------------------------- ### Use RMSNorm CPU Kernel Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/huggingface-kernels-integration.md Example demonstrating how to load and use the RMSNorm CPU kernel. CPU dispatch is handled automatically. ```python import torch from kernels import get_kernel, has_kernel repo_id = "kernels-community/rmsnorm" if has_kernel(repo_id): layer_norm = get_kernel(repo_id) x = torch.randn(2, 1024, 2048, dtype=torch.bfloat16) # CPU tensor weight = torch.ones(2048, dtype=torch.bfloat16) # CPU dispatch happens automatically via torch_binding.cpp out = layer_norm.rms_norm_fn(x, weight, eps=1e-6) ``` -------------------------------- ### Initialize a New Kernel Project Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md Use `kernel-builder init` to scaffold a new kernel project with a standard directory structure and configuration files. ```bash kernel-builder init --name my-username/my-kernel ``` -------------------------------- ### Install Skills Globally for Codex Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md Installs skills globally for the Codex assistant, making them available system-wide. ```bash # install globally for Codex kernel-builder skills add --codex --global ``` -------------------------------- ### Install ROCm Kernels Skill for Codex Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md Installs the ROCm kernels skill specifically for the Codex assistant. ```bash # install ROCm kernels skill for Codex kernel-builder skills add --skill rocm-kernels --codex ``` -------------------------------- ### Initialize a new multi-backend kernel Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md Creates a new kernel project that supports multiple backends, such as CUDA and XPU. List all desired backends with the --backends option. ```bash $ kernel-builder init --name myorg/mykernel --backends cuda xpu Initialized `myorg/mykernel` at /home/daniel/git/kernels/examples/kernels/mykernel ``` -------------------------------- ### Install Metal Toolchain Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/metal.md Install the Metal Toolchain component for Xcode, which is required for building Metal kernels. ```bash xcodebuild -downloadComponent MetalToolchain ``` -------------------------------- ### Install Kernels Package Source: https://github.com/huggingface/kernels/blob/main/README.md Install the `kernels` Python package using pip. Requires `torch>=2.5` and CUDA. ```bash pip install kernels ``` -------------------------------- ### Testing a Kernel with `get_kernel` Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md Demonstrates how to test a kernel using `get_kernel` to load it from the development environment. This ensures the kernel is tested as it will be used. ```python import kernels import torch import torch.nn.functional as F relu = kernels.get_kernel("kernels-community/relu", version=1) def test_relu(): x = torch.randn(1024, 1024, dtype=torch.float32, device=torch.device("cuda")) y = relu.relu(x, torch.empty_like(x)) y_ref = F.relu(x) torch.testing.assert_close(y_ref, y) ``` -------------------------------- ### Install Skills to Custom Destination Source: https://github.com/huggingface/kernels/blob/main/docs/source/cli-skills.md Installs skills to a specified custom directory and forces overwriting if the skills already exist. ```bash # install to a custom destination and overwrite if already present kernel-builder skills add --dest ~/my-skills --force ``` -------------------------------- ### Develop with Nix Source: https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md Sets up the full development environment for both the Python package and kernel-builder using Nix. ```bash nix develop ``` -------------------------------- ### Define Multi-Target Build Configuration Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/huggingface-kernels-integration.md Example `build.toml` snippet showing how to define configurations for different backends, including CUDA and CPU. This allows a single repository to support multiple hardware targets. ```toml # CUDA sections [kernel.my_kernel_cuda] backend = "cuda" # ... # CPU sections [kernel.my_kernel_cpu] backend = "cpu" # ... ``` -------------------------------- ### Install Kernels Python Package with Dev Extras Source: https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md Installs the kernels Python package in editable mode with development dependencies. ```bash pip install -e "kernels/[dev]" ``` -------------------------------- ### Initialize a new XPU kernel Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/writing-kernels.md Creates a new kernel project specifically for the XPU backend. Use the --backends option to specify the desired backend. ```bash $ kernel-builder init --name myorg/mykernel --backends xpu ``` -------------------------------- ### RMSNorm Kernel Usage Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/huggingface-kernels-integration.md Demonstrates how to load and use the RMSNorm kernel from the Hub. It includes checking for availability, inspecting kernel functions, and applying the kernel to a tensor. ```python import torch from kernels import get_kernel, has_kernel repo_id = "kernels-community/triton-layer-norm" if has_kernel(repo_id): layer_norm = get_kernel(repo_id) # Inspect available functions print([f for f in dir(layer_norm) if not f.startswith('_')]) # e.g. ['layer_norm', 'layer_norm_fn', 'rms_norm_fn', ...] x = torch.randn(2, 1024, 2048, dtype=torch.bfloat16, device="cuda") weight = torch.ones(2048, dtype=torch.bfloat16, device="cuda") # Use the actual function name (rms_norm_fn in current version) out = layer_norm.rms_norm_fn(x, weight, eps=1e-6) print(f"Output shape: {out.shape}") else: print("No ROCm-compatible build available") ``` -------------------------------- ### Local Kernel Build for Development Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-h100.md Generate project files with kernel-builder and build the kernel using `setup.py build_kernel`. This method avoids issues with `torch.utils.cpp_extension`/pybind11 under ABI3. ```bash kernel-builder create-pyproject -f python setup.py build_kernel ``` -------------------------------- ### Profiler Output Interpretation Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cpu-kernels/references/workflow_details.md Example of profiler output indicating a memory-bound or dependency-bound situation and suggesting optimization strategies. ```text >> IPC < 1.0: Memory-bound or dependency-bound. - Add prefetch instructions (_mm_prefetch with _MM_HINT_T0 or _MM_HINT_T1) - Reduce cache blocking tile size Reference: references/memory_patterns.yaml ``` -------------------------------- ### Run Full Benchmark with Options Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/diffusers-h100.md Execute the benchmark script with various options to measure kernel performance. Ensure all necessary parameters like batch size, dimensions, and steps are configured. ```bash python scripts/benchmark_example.py \ --use-optimized-kernels \ --compile \ --batch-size 1 \ --num-frames 161 \ --height 512 \ --width 768 \ --steps 50 \ --warmup-iterations 2 ``` -------------------------------- ### Pure Layer Example: SiluAndMul Source: https://github.com/huggingface/kernels/blob/main/docs/source/kernel-requirements.md Example of a pure PyTorch layer that does not implement backward. It uses the `has_backward` attribute to indicate this. ```python class SiluAndMul(nn.Module): # This layer does not implement backward. has_backward: bool = False def forward(self, x: torch.Tensor): d = x.shape[-1] // 2 output_shape = x.shape[:-1] + (d,) out = torch.empty(output_shape, dtype=x.dtype, device=x.device) ops.silu_and_mul(out, x) return out ``` -------------------------------- ### Example Kernel Project Structure Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/agents-guide.md A typical kernel-builder project structure should include CUDA kernel source, PyTorch C++ bindings, micro-benchmark scripts, and configuration files. ```text examples/your_model/ ├── kernel_src/ │ └── rmsnorm.cu # Vectorized CUDA kernel ├── torch-ext/ │ ├── your_kernels/__init__.py │ └── torch_binding.cpp # PyTorch C++ bindings ├── benchmark_rmsnorm.py # Micro-benchmark script ├── build.toml # kernel-builder config ├── setup.py # python setup.py build_kernel └── pyproject.toml ``` -------------------------------- ### LTX Video Benchmark README Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/manifest.txt README file for the LTX video benchmark example. Provides instructions and context for the benchmark. ```markdown # LTX Video Benchmark This directory contains scripts and data for benchmarking LTX video processing capabilities on ROCm. ## Usage 1. Ensure you have the ROCm environment set up. 2. Install necessary Python dependencies: ```bash pip install -r ../../scripts/requirements.txt ``` 3. Run the benchmark scripts (e.g., `benchmark_kernels.py` or `benchmark_e2e.py` from the parent directory). ## Results Benchmark results are stored in `benchmark_results.json` and trace data in the `trace/` directory. ``` -------------------------------- ### Hugging Face Kernels Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/manifest.txt Example script for using Hugging Face kernels with ROCm. This script helps in understanding the integration. ```python python scripts/huggingface_kernels_example.py --help ``` -------------------------------- ### Triton Kernel Project Structure Example Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/huggingface-kernels-integration.md Illustrates a typical project structure for a Triton-based kernel, including build configuration, kernel source, and PyTorch bindings. ```text my-triton-kernel/ ├── build.toml ├── kernel_src/ │ └── rmsnorm.py # Triton kernel source └── torch-ext/ ├── torch_binding.cpp └── my_kernels/ └── __init__.py ``` -------------------------------- ### Example Usage of Optimized Diffusers Pipeline Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/diffusers-integration.md Demonstrates the complete workflow: loading a LTXPipeline, moving it to CUDA, injecting optimized kernels, enabling CPU offload, running inference, and exporting the output to a video file. ```python import torch from diffusers import LTXPipeline from diffusers.utils import export_to_video pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) pipe.to("cuda") # ROCm via HIP stats = inject_optimized_kernels(pipe) print(f"RMSNorm modules patched: {stats['rmsnorm_modules']}") # Expected: 168 pipe.enable_model_cpu_offload() # AFTER injection output = pipe( prompt="A cat sleeping in the sun", num_frames=25, height=480, width=704, num_inference_steps=30, ) export_to_video(output.frames[0], "output.mp4", fps=24) ``` -------------------------------- ### Build and Upload Kernel to Hub Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md This command builds the kernel and uploads it to the Hugging Face Hub. Ensure your `build.toml` is configured correctly. ```bash kernel-builder build-and-upload ``` -------------------------------- ### Check Metal Toolchain installation Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/metal.md Validate that the Metal Toolchain is installed and retrieve its details, including asset path, build version, and identifier. ```bash $ xcodebuild -showComponent metalToolchain Asset Path: /System/Library/AssetsV2/com_apple_MobileAsset_MetalToolchain/68d8db6212b48d387d071ff7b905df796658e713.asset/AssetData Build Version: 17B54 Status: installed Toolchain Identifier: com.apple.dt.toolchain.Metal.32023 Toolchain Search Path: /private/var/run/com.apple.security.cryptexd/mnt/com.apple.MobileAsset.MetalToolchain-v17.2.54.0.mDxgz0 ``` -------------------------------- ### Install AI Coding Assistant Skills Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder-cli.md Installs skills for AI coding assistants such as Claude, Codex, and OpenCode. This command has subcommands for managing skills. ```APIDOC ## `kernel-builder skills` Install skills for AI coding assistants (Claude, Codex, OpenCode) **Usage:** `kernel-builder skills ` ###### **Subcommands:** * `add` — Install a kernels skill for an AI assistant ``` -------------------------------- ### Example Usage: Load Model and Inject Kernels Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/transformers-integration.md Demonstrates loading a transformers model and tokenizer, injecting optimized CUDA kernels for RMSNorm layers, and then performing text generation. Ensure the model is loaded to CUDA before injecting kernels. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Import kernels from ltx_kernels import rmsnorm # Load model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", torch_dtype=torch.bfloat16, device_map="cuda" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Inject kernels stats = inject_optimized_kernels(model) print(f"RMSNorm modules patched: {stats['rmsnorm_modules']}") # Generate text inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0])) ``` -------------------------------- ### Install Kernels from Main Branch Source: https://github.com/huggingface/kernels/blob/main/docs/source/installation.md Install the latest version of the kernels package directly from the main branch of the GitHub repository. This includes the 'benchmark' extra. ```bash pip install "kernels[benchmark] @ git+https://github.com/huggingface/kernels#subdirectory=kernels" ``` -------------------------------- ### Initialize Git Submodules Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/xpu-kernels/references/workflow_details.md Ensures that all necessary external tools and libraries are downloaded and initialized for the project. ```bash git submodule update --init ``` -------------------------------- ### Build and Copy Kernel Locally Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md Use this command to build your kernel and copy it to the local build directory for immediate use. ```bash kernel-builder build-and-copy -L ``` -------------------------------- ### Load and Execute Basic Activation Kernel Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/cuda-kernels/references/huggingface-kernels-integration.md Demonstrates loading a GELU activation kernel from the Hugging Face Hub and executing it on a CUDA tensor. Ensure the output tensor is pre-allocated with the same shape and type as the input. ```python import torch from kernels import get_kernel # Load activation kernels from Hub activation = get_kernel("kernels-community/activation", version=1) # Create test tensor x = torch.randn((10, 10), dtype=torch.float16, device="cuda") # Execute kernel (output tensor must be pre-allocated) y = torch.empty_like(x) activation.gelu_fast(y, x) print(y) ``` -------------------------------- ### Verify ROCm Triton Installation Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/troubleshooting.md Confirm your ROCm Triton installation by printing the Triton version and the PyTorch ROCm version. Also, check the device properties. ```python import triton print(triton.__version__) import torch print(torch.version.hip) # Should show ROCm version print(torch.cuda.get_device_properties(0)) ``` -------------------------------- ### Install Kernels with Curated XPU Extras Source: https://github.com/huggingface/kernels/blob/main/docs/source/installation.md Install the kernels package with the 'curated-xpu' extra for XPU (Intel GPU) platforms. This version omits CUDA-only dependencies. ```bash pip install "kernels[curated-xpu]" ``` -------------------------------- ### Load and Run Activation Kernel Source: https://github.com/huggingface/kernels/blob/main/README.md Demonstrates loading the 'activation' kernel from the Hugging Face Hub and running a GELU fast operation on a CUDA tensor. ```python import torch from kernels import get_kernel # Download optimized kernels from the Hugging Face hub activation = get_kernel("kernels-community/activation", version=1) # Random tensor x = torch.randn((10, 10), dtype=torch.float16, device="cuda") # Run the kernel y = torch.empty_like(x) activation.gelu_fast(y, x) print(y) ``` -------------------------------- ### Clean Install and Smoke Test ROCm Kernels Source: https://github.com/huggingface/kernels/blob/main/kernel-builder/skills/rocm-kernels/references/diffusers-integration.md Sets up a virtual environment, installs dependencies, and performs a basic smoke test with the LTXPipeline to verify ROCm functionality. ```bash python -m venv .venv-rocm-kernels source .venv-rocm-kernels/bin/activate python -m pip install --upgrade pip python -m pip install -r skills/rocm-kernels/scripts/requirements.txt ``` -------------------------------- ### Build Configuration for Hub Upload Source: https://github.com/huggingface/kernels/blob/main/docs/source/builder/build.md Example build.toml configuration specifying the repository ID and version for uploading the kernel. The kernel will be uploaded to the repository defined by 'repo-id' and its version branch. ```toml [general] # ... version = 1 [general.hub] repo-id = "kernels-community/flash-attn4" ```