### Install GPU Dependencies with uv Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Installs dependencies required for local GPU evaluation using uv. ```bash uv sync --extra gpu ``` -------------------------------- ### Install Base Dependencies with uv Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Installs the base dependencies for KernelBench using uv. This command works even without a local GPU. ```bash uv sync ``` -------------------------------- ### Install KernelBench Package Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Installs the KernelBench package in editable mode, allowing for direct use of its functionalities. ```bash !pip install -e . ``` -------------------------------- ### Evaluate Generated Kernels Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Evaluates all generated kernels from a specified run directory. This example uses a local dataset source, level 1, with 8 GPU devices and a 300-second timeout. ```bash uv run python scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300 ``` -------------------------------- ### Install Dependencies Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Installs necessary Python packages for KernelBench, including pydra_config, modal, litellm, ninja, and tomli. This is required for running KernelBench functionalities. ```python # Latest KernelBench uses PyTorch 2.9.0 (latest version that supports Blackwell) # !pip install -r requirements.txt # However, for this exercise, we will use the latest pytorch that ship with Google colab import torch torch.__version__ ``` ```python # here is a min set of our requirements !pip install pydra_config !pip install modal !pip install litellm !pip install ninja !pip install tomli ``` -------------------------------- ### Install PyTorch with AMD ROCm Backend Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Adds PyTorch with ROCm support for AMD GPUs. ROCm version 7.1 or higher is required. This is typically done within a Docker image due to ROCm setup complexity. ```bash uv add torch --index pytorch=https://download.pytorch.org/whl/rocm7.1 ``` -------------------------------- ### Example Naive CUDA Matrix Multiplication Kernel Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb A sample implementation of a naive matrix multiplication kernel using PyTorch's load_inline. ```python level_1_problem_1_sample_0_kernel = '''import torch import torch.nn as nn from torch.utils.cpp_extension import load_inline # Define the custom CUDA kernel for matrix multiplication matrix_multiply_source = """ #include #include __global__ void matrix_multiply_kernel(const float* A, const float* B, float* C, int N) { int row = blockIdx.x * blockDim.x + threadIdx.x; int col = blockIdx.y * blockDim.y + threadIdx.y; if (row < N && col < N) { float sum = 0.0f; for (int i = 0; i < N; i++) { sum += A[row * N + i] * B[i * N + col]; } C[row * N + col] = sum; } } torch::Tensor matrix_multiply_cuda(torch::Tensor A, torch::Tensor B) { auto N = A.size(0); auto C = torch::zeros_like(A); const int block_size = 16; const int num_blocks_x = (N + block_size - 1) / block_size; const int num_blocks_y = (N + block_size - 1) / block_size; dim3 block(block_size, block_size); dim3 grid(num_blocks_x, num_blocks_y); matrix_multiply_kernel<<>>(A.data_ptr(), B.data_ptr(), C.data_ptr(), N); return C; } """ matrix_multiply_cpp_source = ( "torch::Tensor matrix_multiply_cuda(torch::Tensor A, torch::Tensor B);" ) # Compile the inline CUDA code for matrix multiplication matrix_multiply = load_inline( name="matrix_multiply", cpp_sources=matrix_multiply_cpp_source, cuda_sources=matrix_multiply_source, functions=["matrix_multiply_cuda"], verbose=True, extra_cflags=[""], extra_ldflags=[""], ) class ModelNew(nn.Module): def __init__(self): super(ModelNew, self).__init__() self.matrix_multiply = matrix_multiply def forward(self, A: torch.Tensor, B: torch.Tensor) -> torch.Tensor: return self.matrix_multiply.matrix_multiply_cuda(A, B) ''' # uncomment if you want to use this example target_kernel_generation = level_1_problem_1_sample_0_kernel ``` -------------------------------- ### Run Single Sample Generation and Evaluation Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Generates and evaluates a single sample for a specified problem. This example uses Hugging Face dataset, level 2, problem 40, and Google Gemini 2.5 Flash. ```bash uv run python scripts/generate_and_eval_single_sample.py dataset_src=huggingface level=2 problem_id=40 server_type=google model_name=gemini/gemini-2.5-flash ``` -------------------------------- ### Generate Responses and Store Kernels Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Generates responses and stores the resulting kernels locally in a specified run directory. This example targets Hugging Face dataset, level 1, using Deepseek model. ```bash uv run python scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0 ``` -------------------------------- ### Get Representative KernelBench Dataset Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Retrieves a representative subset of KernelBench problems from a local source, suitable for quick testing. Returns the IDs of the problems in this subset. ```python # Get representative subset for quick testing rep_dataset = get_representative_dataset(level=1, source="local") print(f"Representative problems: {rep_dataset.get_problem_ids()}") ``` -------------------------------- ### Get Specific KernelBench Problem Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Retrieves a specific KernelBench problem by its 1-indexed ID from a local dataset. Displays the problem name and a preview of its code. ```python # Get a specific problem by ID (1-indexed) problem = local_dataset.get_problem_by_id(1) print(f"Problem: {problem.name}") # Output: 1_Square_matrix_multiplication_.py print(f"Code preview: {problem.code[:100]}...") ``` -------------------------------- ### Define Reference PyTorch Model for Evaluation Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Defines a simple PyTorch `nn.Module` for matrix multiplication and provides functions to get input tensors and initialization arguments. This serves as a reference implementation for kernel evaluation. ```python from kernelbench.eval import ( eval_kernel_against_ref, KernelExecResult, get_torch_dtype_from_string, ) import torch # Reference PyTorch code (from KernelBench problem) original_model_src = """ import torch import torch.nn as nn class Model(nn.Module): def __init__(self): super().__init__() def forward(self, A, B): return torch.matmul(A, B) def get_inputs(): A = torch.randn(1024, 1024, device='cuda') B = torch.randn(1024, 1024, device='cuda') return [A, B] def get_init_inputs(): return [] """ ``` -------------------------------- ### Generate and Evaluate Custom CUDA Kernel Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Defines a custom CUDA matrix multiplication kernel and evaluates its performance and correctness against a reference implementation. Ensure CUDA is available and PyTorch is installed with CUDA support. ```python custom_model_src = """ import torch import torch.nn as nn from torch.utils.cpp_extension import load_inline cuda_source = ''' #include #include __global__ void matmul_kernel(const float* A, const float* B, float* C, int N) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < N && col < N) { float sum = 0.0f; for (int k = 0; k < N; k++) { sum += A[row * N + k] * B[k * N + col]; } C[row * N + col] = sum; } } torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B) { int N = A.size(0); auto C = torch::zeros({N, N}, A.options()); dim3 block(16, 16); dim3 grid((N + 15) / 16, (N + 15) / 16); matmul_kernel<<>>(A.data_ptr(), B.data_ptr(), C.data_ptr(), N); return C; } PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("matmul_cuda", &matmul_cuda, "Matrix multiplication (CUDA)"); } ''' matmul_ext = load_inline( name='matmul_cuda', cpp_sources='', cuda_sources=cuda_source, functions=['matmul_cuda'], verbose=False, ) class ModelNew(nn.Module): def __init__(self): super().__init__() def forward(self, A, B): return matmul_ext.matmul_cuda(A.contiguous(), B.contiguous()) """ # Evaluate the kernel result: KernelExecResult = eval_kernel_against_ref( original_model_src=original_model_src, custom_model_src=custom_model_src, seed_num=42, num_correct_trials=5, # Number of trials with random inputs num_perf_trials=100, # Number of performance measurement trials measure_performance=True, timing_method="cuda_event", # Options: cuda_event, do_bench, host_time verbose=True, device=torch.device("cuda:0"), backend="cuda", # Options: cuda, triton, cute, tilelang, hip precision=torch.float32, # Options: torch.float32, torch.float16, torch.bfloat16 ) # Check results print(f"Compiled: {result.compiled}") print(f"Correct: {result.correctness}") print(f"Runtime (ms): {result.runtime}") print(f"Runtime stats: {result.runtime_stats}") print(f"Reference runtime (ms): {result.ref_runtime}") print(f"Metadata: {result.metadata}") # Calculate speedup if result.correctness and result.runtime > 0: speedup = result.ref_runtime / result.runtime print(f"Speedup: {speedup:.2f}x") ``` -------------------------------- ### Directly Use Timing Functions Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Demonstrates how to directly use timing functions for custom kernel benchmarking. This provides flexibility in measuring specific code segments. ```python # Use timing functions directly timing_fn = get_timing_function("cuda_event") # Also: do_bench, do_bench_impl, host_time def my_kernel(x): return torch.softmax(x, dim=-1) x = torch.randn(1024, 1024, device="cuda") elapsed_times = timing_fn( kernel_fn=my_kernel, args=[x], num_warmup=3, num_trials=10, discard_first=1, verbose=True, device=torch.device("cuda:0"), ) # Get statistics from elapsed times stats = get_timing_stats(elapsed_times, device=torch.device("cuda:0")) print(f"Timing stats: {stats}") # Clear L2 cache for cold-cache benchmarking clear_l2_cache(device="cuda:0") ``` -------------------------------- ### Load KernelBench Dataset Locally Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Constructs a KernelBench dataset from the local filesystem. Specify the difficulty level and source. The number of problems is printed. ```python from kernelbench.dataset import ( construct_kernelbench_dataset, LocalKernelBenchDataset, HuggingFaceKernelBenchDataset, get_representative_dataset, ) # Load dataset from local filesystem local_dataset = construct_kernelbench_dataset( level=1, source="local", ) print(f"Level 1 has {len(local_dataset)} problems") # Output: Level 1 has 100 problems ``` -------------------------------- ### Execute KernelBench CLI Scripts Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Run evaluation workflows, generate samples, and analyze benchmark results using the provided CLI tools. ```bash uv run python scripts/generate_and_eval_single_sample.py \ dataset_src=huggingface \ level=1 \ problem_id=1 \ server_type=google \ model_name=gemini/gemini-2.5-flash \ backend=cuda \ precision=fp32 \ gpu_arch=Ada \ eval_mode=local uv run python scripts/generate_samples.py \ run_name=my_experiment \ dataset_src=huggingface \ level=1 \ server_type=deepseek \ model_name=deepseek-chat \ temperature=0.7 \ num_workers=50 \ backend=cuda uv run python scripts/eval_from_generations.py \ run_name=my_experiment \ dataset_src=local \ level=1 \ num_gpu_devices=8 \ timeout=300 \ eval_mode=local uv run python scripts/eval_from_generations.py \ run_name=my_experiment \ dataset_src=huggingface \ level=1 \ eval_mode=modal \ gpu=H100 uv run python scripts/benchmark_eval_analysis.py \ run_name=my_experiment \ level=1 \ hardware=L40S_matx3 \ baseline=baseline_time_torch uv run python scripts/generate_baseline_time.py \ level=1 \ device=0 \ output_dir=results/timing/my_gpu ``` -------------------------------- ### Load KernelBench Dataset from HuggingFace Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Constructs a KernelBench dataset from HuggingFace. Specify the difficulty level, source, and dataset name. This allows access to datasets hosted on HuggingFace. ```python # Load dataset from HuggingFace hf_dataset = construct_kernelbench_dataset( level=1, source="huggingface", dataset_name="ScalingIntelligence/KernelBench", ) ``` -------------------------------- ### Construct Prompt for Backend Generation Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb This code imports a function to construct a prompt for a backend generation task. It is intended to be used with LLMs for generating optimized code. ```python from kernelbench.prompt_constructor_toml import get_prompt_for_backend # Here is an example of a constructed context with hardware context # This injects specs (memory bandwidth, cache size) from kernelbench/prompts/hardware/gpu_specs.py ``` -------------------------------- ### Load KernelBench Dataset and Reference Kernel Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb This code loads the KernelBench dataset and fetches the reference PyTorch kernel code for a selected problem. It then prints the code and saves it to a file for further use. ```python from datasets import load_dataset import os # Load the dataset based on the sliders above dataset = load_dataset("ScalingIntelligence/KernelBench") level_string = "level_" + f"{level}" try: # Fetch the problem target_kernel_reference = dataset[level_string][problem_id]["code"] problem_name = dataset[level_string][problem_id]["name"] print(f"✅ Successfully loaded: {problem_name}") print("="*80) print(target_kernel_reference) # Save to file for the next steps with open("tmp/reference.py", "w") as f: f.write(target_kernel_reference) except IndexError: print(f"❌ Error: Problem ID {problem_id} does not exist in Level {level}. Please adjust the slider to a lower number.") ``` -------------------------------- ### Change Directory to KernelBench Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Navigates the current working directory into the cloned KernelBench repository. ```bash %cd /content/KernelBench ``` -------------------------------- ### Manage GPU Utilities and Architecture Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Helper functions for detecting GPU vendors and configuring architectures for kernel compilation. ```python from kernelbench.utils import ( get_gpu_vendor, set_gpu_arch, rand_mix, rand_mix_like, NVIDIA_ARCHS, AMD_ARCHS, ) import torch # Detect GPU vendor vendor = get_gpu_vendor(device=0) print(f"GPU vendor: {vendor}") # Output: nvidia, amd, or unknown # Set GPU architecture for kernel compilation # NVIDIA architectures: Maxwell, Pascal, Volta, Turing, Ampere, Hopper, Ada, Blackwell # AMD architectures: gfx942 (MI300), gfx950 (MI350) set_gpu_arch(["Hopper"]) # For H100 set_gpu_arch(["Ada"]) # For L40S, RTX 4090 set_gpu_arch(["Ampere"]) # For A100, A10G print(f"NVIDIA archs: {NVIDIA_ARCHS}") print(f"AMD archs: {AMD_ARCHS}") # Generate random tensors with various distributions ``` -------------------------------- ### Generate Prompts for Kernel Optimization Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Constructs prompts for various backends like CUDA, Triton, and CUTE using different shot strategies and hardware-specific guidance. ```python from kernelbench.prompt_constructor_toml import ( get_prompt_for_backend, get_custom_prompt, render_prompt_by_option, PromptConfig, ) # Reference architecture to optimize ref_arch_src = """ import torch import torch.nn as nn class Model(nn.Module): def __init__(self): super().__init__() def forward(self, A, B): return torch.matmul(A, B) def get_inputs(): A = torch.randn(1024, 1024, device='cuda') B = torch.randn(1024, 1024, device='cuda') return [A, B] def get_init_inputs(): return [] """ # Generate one-shot CUDA prompt cuda_prompt = get_prompt_for_backend( ref_arch_src=ref_arch_src, backend="cuda", option="one_shot", precision="fp32", ) print(cuda_prompt[:500]) # Generate Triton prompt with few-shot examples triton_prompt = get_prompt_for_backend( ref_arch_src=ref_arch_src, backend="triton", option="few_shot", precision="fp32", ) # Generate prompt with hardware-specific guidance hardware_prompt = get_prompt_for_backend( ref_arch_src=ref_arch_src, backend="cute", option="one_shot", precision="fp32", include_hardware=True, gpu_name="H100", # Supported: H100, A100, L40S, L4, etc. ) # Use custom prompt from prompts.toml custom_prompt = get_custom_prompt( custom_key="custom", # Key defined in [custom_prompts] section ref_arch_src=ref_arch_src, backend="triton", option="one_shot", precision="fp32", include_hardware=True, gpu_name="A100", ) # Zero-shot prompt (no examples) zero_shot_prompt = get_prompt_for_backend( ref_arch_src=ref_arch_src, backend="cuda", option="zero_shot", precision="fp16", ) ``` -------------------------------- ### Calculate Performance Metrics Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Compute benchmark performance metrics including fast_p and geometric mean speedup using the score module. ```python from kernelbench.score import ( fastp, geometric_mean_speed_ratio_correct_only, geometric_mean_speed_ratio_correct_and_faster_only, ) import numpy as np is_correct = np.array([True, True, False, True, True, False, True, True, True, False]) baseline_speed = np.array([10.0, 8.0, 12.0, 15.0, 9.0, 11.0, 7.0, 14.0, 6.0, 13.0]) # ms actual_speed = np.array([5.0, 10.0, 8.0, 7.0, 4.0, 9.0, 8.0, 6.0, 3.0, 10.0]) # ms n_problems = len(is_correct) fast_1 = fastp(is_correct, baseline_speed, actual_speed, n_problems, p=1.0) print(f"fast_1 (correct and faster): {fast_1:.2%}") fast_2 = fastp(is_correct, baseline_speed, actual_speed, n_problems, p=2.0) print(f"fast_2 (correct and 2x faster): {fast_2:.2%}") fast_0 = fastp(is_correct, baseline_speed, actual_speed, n_problems, p=0.0) print(f"fast_0 (correctness rate): {fast_0:.2%}") geo_mean = geometric_mean_speed_ratio_correct_only( is_correct, baseline_speed, actual_speed, n_problems ) print(f"Geometric mean speedup (correct only): {geo_mean:.2f}x") geo_mean_faster = geometric_mean_speed_ratio_correct_and_faster_only( is_correct, baseline_speed, actual_speed, n_problems ) print(f"Geometric mean speedup (correct and faster): {geo_mean_faster:.2f}x") ``` -------------------------------- ### Run and Check Model Generation Script Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb This command executes the `run_and_check.py` script to compare the generated model with inline CUDA against a reference PyTorch model. ```bash !python3 scripts/run_and_check.py ref_origin=local gpu_arch="['Turing']" ref_arch_src_path=tmp/ex_add_model_ref.py kernel_src_path=tmp/ex_add_model_generation.py ``` -------------------------------- ### Analyze Benchmark Performance Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Executes the evaluation analysis script to compute success rates and timing metrics. ```bash uv run python scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch ``` -------------------------------- ### Generate LLM Prompt for Kernel Optimization Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Configures the prompt generation for specific hardware backends and architectures. ```python prompt = get_prompt_for_backend( ref_arch_src=target_kernel_reference, backend="cuda", # You can also try "triton" or "tilelang" here! option="one_shot", # <--- show example of generation format via minimum example include_hardware=True, # <--- Enable hardware specific context gpu_name="T4" # <--- Specify your GPU (matches keys in gpu_specs.py) ) print("="*40) print(" ⬇️ COPY THIS PROMPT TO YOUR LLM ⬇️ ") print("="*40) print(prompt) print("="*40) ``` -------------------------------- ### Check GPU Runtime Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Verifies that a GPU is available in the current runtime environment. ```python !nvidia-smi ``` -------------------------------- ### Run Kernel Performance Benchmark Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Executes the evaluation script to check correctness and speedup against reference implementations. ```bash !python scripts/run_and_check.py ref_origin=kernelbench gpu_arch="['Turing']" level={level} problem_id={problem_id} kernel_src_path=tmp/generation.py ``` -------------------------------- ### Import JSON Module and Define Path Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/inspect_baseline_times.ipynb Imports the JSON module and specifies the path to a JSON results file for performance timing data. ```python import json # Path to the JSON file json_path = "/home/ubuntu/aco/KernelBench/results/timing/H100_PCIe_LambdaLabs/baseline_time_torch.json" ``` -------------------------------- ### Run Commands with uv Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Executes Python scripts using uv, ensuring the correct environment is invoked. ```bash uv run python scripts/.py ... ``` -------------------------------- ### Create Temporary Directory Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Creates a temporary directory named 'tmp' if it does not already exist. This directory is used for storing generated kernels and other temporary files. ```python import os # temporary directory for storing kernels directory = "tmp" if not os.path.exists(directory): os.makedirs(directory) ``` -------------------------------- ### Kernel Evaluation API Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Methods for evaluating LLM-generated kernels against reference PyTorch implementations. ```APIDOC ## eval_kernel_against_ref ### Description Evaluates an LLM-generated kernel against a reference PyTorch implementation to check for correctness and performance. ### Parameters #### Arguments - **original_model_src** (str) - Required - The reference PyTorch implementation code. - **kernel_code** (str) - Required - The LLM-generated kernel code to evaluate. ### Response - **result** (KernelExecResult) - An object containing execution metrics, correctness status, and performance speedup data. ``` -------------------------------- ### Clone KernelBench Repository Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Clones the KernelBench GitHub repository to the local environment. ```bash !git clone https://github.com/ScalingIntelligence/KernelBench.git ``` -------------------------------- ### Save Reference Kernel to File Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb This snippet re-prints the reference kernel code and saves it to `tmp/reference.py`. This is a redundant step if the previous snippet was executed successfully. ```python target_kernel_reference = dataset[level_string][problem_id]["code"] print(target_kernel_reference) with open("tmp/reference.py", "w") as f: f.write(target_kernel_reference) ``` -------------------------------- ### Cite KernelBench Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md BibTeX entry for referencing the KernelBench paper. ```bibtex @misc{ouyang2025kernelbenchllmswriteefficient, title={KernelBench: Can LLMs Write Efficient GPU Kernels?}, author={Anne Ouyang and Simon Guo and Simran Arora and Alex L. Zhang and William Hu and Christopher Ré and Azalia Mirhoseini}, year={2025}, eprint={2502.10517}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.10517}, } ``` -------------------------------- ### Save Generated Kernel to File Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Writes the generated kernel string to a Python file for the benchmarking script to consume. ```python with open("tmp/generation.py", "w") as f: f.write(target_kernel_generation) ``` -------------------------------- ### Save PyTorch Reference for Element-wise Addition Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Defines and saves a PyTorch reference implementation for element-wise addition to a file. This serves as the ground-truth for comparison. ```python ex_add_model_ref = ''' import torch import torch.nn as nn import torch.nn.functional as F class Model(nn.Module): def __init__(self) -> None: super().__init__() def forward(self, a, b): return a + b def get_inputs(): # randomly generate input tensors based on the model architecture a = torch.randn(1, 128).cuda() b = torch.randn(1, 128).cuda() return [a, b] def get_init_inputs(): # randomly generate tensors required for initialization based on the model architecture return [] ''' with open("tmp/ex_add_model_ref.py", "w") as f: f.write(ex_add_model_ref) ``` -------------------------------- ### Iterate Over KernelBench Problems Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Iterates through all problems in a KernelBench dataset loaded from a local source. Accesses problem details like ID, name, code, level, path, and hash. ```python # Iterate over all problems for problem in local_dataset: print(f"Problem {problem.problem_id}: {problem.name}") # Access: problem.code, problem.level, problem.path, problem.hash ``` -------------------------------- ### Dataset Management API Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Methods for loading and filtering KernelBench datasets from local storage or HuggingFace. ```APIDOC ## construct_kernelbench_dataset ### Description Constructs a dataset object for a specific difficulty level from a chosen source. ### Parameters #### Arguments - **level** (int) - Required - Difficulty level (1-4). - **source** (str) - Required - Data source, either 'local' or 'huggingface'. - **dataset_name** (str) - Optional - Name of the HuggingFace dataset. - **problem_ids** (list) - Optional - List of specific problem IDs to include. - **id_range** (tuple) - Optional - Inclusive range of problem IDs to include. ### Response - **dataset** (Object) - A dataset instance containing problem objects. ``` -------------------------------- ### Query LLM Providers and Extract Code Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Provides a unified interface for querying LLM providers and extracting code blocks from responses. ```python from kernelbench.utils import ( query_server, create_inference_server_from_presets, extract_first_code, extract_last_code, SERVER_PRESETS, ) # Query using presets (recommended) inference_fn = create_inference_server_from_presets( server_type="google", # Options: google, openai, anthropic, deepseek, together, local model_name="gemini/gemini-2.5-flash", temperature=0.7, max_tokens=8192, verbose=True, time_generation=True, ) prompt = "Write a CUDA kernel for matrix multiplication" response = inference_fn(prompt) print(response) # Extract code from LLM response code = extract_first_code(response, code_language_types=["python", "cpp", "cuda"]) print(f"Extracted code:\n{code}") # Query server directly with more control response = query_server( prompt="Optimize this PyTorch code with CUDA", system_prompt="You are an expert GPU kernel programmer", temperature=0.0, max_tokens=4096, server_type="openai", model_name="gpt-4o", ) # Chat-style prompts messages = [ {"role": "user", "content": "What is the best way to optimize matrix multiplication?"}, ] response = query_server( prompt=messages, server_type="anthropic", model_name="anthropic/claude-3-7-sonnet-20250219", max_tokens=4096, ) # Reasoning models (o1, o3, Gemini thinking) response = query_server( prompt="Solve this complex optimization problem...", server_type="openai", model_name="o1-preview", is_reasoning_model=True, reasoning_effort="high", # Options: low, medium, high max_tokens=8192, ) # Local server (SGLang, vLLM) response = query_server( prompt="Generate optimized kernel", server_type="local", server_address="localhost", server_port=30000, max_tokens=4096, ) # Available presets print(f"Server presets: {list(SERVER_PRESETS.keys())}") # Output: ['deepseek', 'google', 'together', 'local', 'anthropic', 'openai', 'fireworks'] ``` -------------------------------- ### Read JSON File Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/inspect_baseline_times.ipynb Opens and loads data from a JSON file. Ensure the 'json_path' variable is correctly defined before use. ```python with open(json_path, "r") as f: data = json.load(f) ``` -------------------------------- ### Configure KernelBench Problem Selection Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb This code snippet configures the difficulty level and problem ID for selecting a KernelBench task. It includes input validation to ensure the problem ID is within the valid range for the selected level. ```python # @title Select KernelBench Problem Configuration 🎛️ # @markdown Select the difficulty level (1-3) and specific problem ID. # @markdown * **Level 1 & 2:** 100 Problems # @markdown * **Level 3:** 50 Problems level = 1 # @param {type:"slider", min:1, max:3, step:1} problem_id = 1 # @param {type:"slider", min:1, max:100, step:1} # Input Validation Logic max_problems = 50 if level == 3 else 100 if problem_id > max_problems: print(f"⚠️ Warning: Level {level} only has {max_problems} problems.") print(f" -> Automatically adjusting Problem ID from {problem_id} to {max_problems}.") problem_id = max_problems print(f"Selected: Level {level}, Problem ID {problem_id}") ``` -------------------------------- ### Sort and Print Performance Data Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/inspect_baseline_times.ipynb This script processes performance data, sorts it by mean execution time, and prints the results. It assumes the data is structured in a dictionary format. ```python times = [] for level in [1, 2, 3]: levelx_data = data[f"level{level}"] for problem_name, problem_data in levelx_data.items(): times.append((problem_data["mean"], f"level{level}/{problem_name}")) times.sort(key=lambda x: x[0]) for time, problem_name in times: print(f"{problem_name}: {time}") ``` -------------------------------- ### Speed Up Evaluation with Parallel Compilation Source: https://github.com/scalingintelligence/kernelbench/blob/main/README.md Optimizes evaluation speed by enabling parallel compilation on CPUs before GPU evaluation. Requires specifying `build_cache=True` and `num_cpu_workers`. ```bash # add build_cache=True and num_cpu_workers= to the command ``` -------------------------------- ### Filter KernelBench Dataset by Problem IDs Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Constructs a KernelBench dataset by filtering for specific problem IDs from a local source. Returns the IDs of the problems included in the subset. ```python # Filter by specific problem IDs subset = construct_kernelbench_dataset( level=1, source="local", problem_ids=[1, 3, 5, 10], ) print(f"Subset IDs: {subset.get_problem_ids()}") # Output: [1, 3, 5, 10] ``` -------------------------------- ### Generate Test Tensors with rand_mix Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Create tensors for robustness testing using various distribution types. These utilities are available as standalone functions or as torch methods. ```python x = rand_mix(1024, 1024, dist="random", device="cuda", dtype=torch.float32) template = torch.randn(512, 512, device="cuda") y = rand_mix_like(template, dist="uniform") z = torch.rand_mix(256, 256, dist="laplace", device="cuda") ``` -------------------------------- ### Compile Inline CUDA Code for Element-wise Addition Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb This snippet shows how to load and compile inline CUDA code for an element-wise addition operation. It defines a PyTorch module that uses the compiled CUDA function. ```python elementwise_add = load_inline( name="elementwise_add", cpp_sources=elementwise_add_cpp_source, cuda_sources=elementwise_add_source, functions=["elementwise_add_cuda"], verbose=True, extra_cflags=[""] extra_ldflags=[""] ) class ModelNew(nn.Module): def __init__(self) -> None: super().__init__() self.elementwise_add = elementwise_add def forward(self, a, b): return self.elementwise_add.elementwise_add_cuda(a, b) ``` ```python with open("tmp/ex_add_model_generation.py", "w") as f: f.write(ex_add_model_generation) ``` -------------------------------- ### Print JSON Data Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/inspect_baseline_times.ipynb Prints the loaded JSON data to the console. This is useful for inspecting the benchmark results. ```python print(data) ``` -------------------------------- ### Define Target Kernel Generation Variable Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Placeholder variable for storing the LLM-generated kernel code. ```python target_kernel_generation = "paste the output here" ``` -------------------------------- ### Filter KernelBench Dataset by ID Range Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Constructs a KernelBench dataset by filtering for problems within a specified ID range (inclusive) from a local source. Returns the IDs of the problems in the range subset. ```python # Filter by ID range (inclusive) range_subset = construct_kernelbench_dataset( level=1, source="local", id_range=(1, 10), ) print(f"Range subset: {range_subset.get_problem_ids()}") # Output: [1, 2, 3, ..., 10] ``` -------------------------------- ### Measure Reference PyTorch Program Time Source: https://context7.com/scalingintelligence/kernelbench/llms.txt Measures the execution time of a reference PyTorch model using different timing methods. Supports both eager execution and torch.compile. ```python from kernelbench.timing import ( measure_ref_program_time, get_timing_function, get_timing_stats, fetch_baseline_time, clear_l2_cache, ) import torch # Measure reference PyTorch program time ref_arch_src = """ import torch import torch.nn as nn class Model(nn.Module): def __init__(self): super().__init__() def forward(self, x): return torch.softmax(x, dim=-1) def get_inputs(): return [torch.randn(1024, 1024, device='cuda')] def get_init_inputs(): return [] """ # Measure with PyTorch eager execution stats = measure_ref_program_time( ref_arch_name="softmax_baseline", ref_arch_src=ref_arch_src, num_warmup=5, num_trials=100, discard_first=1, timing_method="cuda_event", use_torch_compile=False, device=torch.device("cuda:0"), verbose=True, precision="fp32", ) print(f"Mean time: {stats['mean']} ms") print(f"Std: {stats['std']} ms") print(f"Min: {stats['min']} ms, Max: {stats['max']} ms") # Measure with torch.compile optimization compiled_stats = measure_ref_program_time( ref_arch_name="softmax_compiled", ref_arch_src=ref_arch_src, num_warmup=5, num_trials=100, use_torch_compile=True, torch_compile_backend="inductor", torch_compile_options="default", device=torch.device("cuda:0"), precision="fp32", ) print(f"Compiled mean time: {compiled_stats['mean']} ms") ``` -------------------------------- ### Define CUDA Kernel for Element-wise Addition Source: https://github.com/scalingintelligence/kernelbench/blob/main/notebooks/tutorial.ipynb Defines the CUDA kernel source code and its C++ interface for performing element-wise addition. This is a component of the generated model. ```python ex_add_model_generation = ''' import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.cpp_extension import load_inline # Define the custom CUDA kernel for element-wise addition elementwise_add_source = """ #include #include __global__ void elementwise_add_kernel(const float* a, const float* b, float* out, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { out[idx] = a[idx] + b[idx]; } } torch::Tensor elementwise_add_cuda(torch::Tensor a, torch::Tensor b) { auto size = a.numel(); auto out = torch::zeros_like(a); const int block_size = 256; const int num_blocks = (size + block_size - 1) / block_size; elementwise_add_kernel<<>>(a.data_ptr(), b.data_ptr(), out.data_ptr(), size); return out; } """ elementwise_add_cpp_source = ( "torch::Tensor elementwise_add_cuda(torch::Tensor a, torch::Tensor b);" ) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.