### Setup Development Environment

Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md

Create and activate a virtual environment using `uv`, then upgrade pip, setuptools, and wheel. Install development dependencies from `requirements-dev.txt`.

```bash
uv venv --seed .venv  # use `python3 -m venv .venv` if you don't have `uv`

source .venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel "build[uv]"
uv pip install --requirements requirements-dev.txt
```

--------------------------------

### Install TileLang from GitHub

Source: https://github.com/tile-ai/tilelang/blob/main/README.md

Installs TileLang directly from its GitHub repository. Useful for getting the latest unreleased changes.

```bash
pip install git+https://github.com/tile-ai/tilelang
```

--------------------------------

### Install TileLang and Bitblas

Source: https://github.com/tile-ai/tilelang/blob/main/examples/gemm/README.md

Install the necessary libraries for TileLang and optional utilities like bitblas.

```bash
pip install tilelang bitblas
```

--------------------------------

### Install TileLang with Pip

Source: https://github.com/tile-ai/tilelang/blob/main/README.md

Installs the latest release of TileLang from PyPI. Use this for a quick setup.

```bash
pip install tilelang
```

--------------------------------

### Install Tilelang from a Prebuilt Wheel

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Install tilelang using a specific prebuilt wheel file. This is useful for installing a particular version or a version tailored for your system.

```bash
pip install tilelang-0.0.0.dev0+ubuntu.20.4.cu120-py3-none-any.whl
```

--------------------------------

### Install OS-Level Prerequisites on Ubuntu/Debian

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Install necessary system packages on Ubuntu or Debian-based Linux distributions before building tilelang from source.

```bash
apt-get update
apt-get install -y python3 python3-dev python3-setuptools gcc zlib1g-dev build-essential cmake libedit-dev
```

--------------------------------

### Install TileLang Locally

Source: https://github.com/tile-ai/tilelang/blob/main/README.md

Installs TileLang locally from source after installing system dependencies. Use the -e option for editable mode.

```bash
sudo apt-get update
sudo apt-get install -y python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev
pip install -e . -v
```

--------------------------------

### Verify Tilelang Installation

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

After installation, run this command to confirm that tilelang is installed correctly and to check its version.

```python
python -c "import tilelang; print(tilelang.__version__)"
```

--------------------------------

### Serve Documentation Locally

Source: https://github.com/tile-ai/tilelang/blob/main/docs/README.md

Start a local HTTP server to view the built documentation in a browser. The port can be customized.

```bash
cd _build/html
python3 -m http.server
```

--------------------------------

### End-to-End Example: Before InjectFenceProxy

Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/inject_fence_proxy.md

Shows a TileLang prim_func before the InjectFenceProxy pass is applied. This example includes descriptor initialization and a shared memory write, followed by an async wgmma intrinsic.

```python
@T.prim_func
def kernel():
    with T.Kernel(1):
        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
        smem = T.decl_buffer((128,), "float16", scope="shared")
        T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32)
        smem[0] = T.float16(0)
        T.ptx_wgmma_ss(
            "float16",
            "m64n64k16",
            T.bool(True),
            T.bool(True),
            "fp16",
            "fp16",
            "fp16",
            desc.data,
            T.int32(0),
            desc.data,
            T.int32(0),
            smem.data,
            T.int32(0),
            T.bool(True),
            1,
            1,
        )
```

--------------------------------

### Install Dependencies with Pip

Source: https://github.com/tile-ai/tilelang/blob/main/docs/README.md

Run this command to install project dependencies using pip.

```bash
pip3 install -r requirements.txt
```

--------------------------------

### Set PYTHONPATH and Verify Installation

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

After building the native extension, add the tilelang repository root to your `PYTHONPATH` environment variable. Then, verify the installation by importing tilelang and printing its version.

```bash
export PYTHONPATH=/path/to/tilelang:$PYTHONPATH
python -c "import tilelang; print(tilelang.__version__)"
```

--------------------------------

### Install Pre-commit Hooks

Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md

Install the pre-commit hooks to ensure code quality and consistency before committing changes.

```bash
pre-commit install --install-hooks
```

--------------------------------

### Autotuner Usage Example

Source: https://context7.com/tile-ai/tilelang/llms.txt

Demonstrates how to use the AutoTuner to find the best configuration for a matrix multiplication kernel.

```APIDOC
## AutoTuner.from_kernel

### Description
Initializes an AutoTuner with a kernel factory and a list of configurations to explore.

### Method
`AutoTuner.from_kernel(kernel_factory, configs)`

### Parameters
- **kernel_factory**: A function that returns a kernel.
- **configs**: A list of dictionaries, where each dictionary represents a configuration to test.

## AutoTuner.set_compile_args

### Description
Sets the compilation arguments for the kernels.

### Method
`set_compile_args(out_idx, target, execution_backend, verbose)`

### Parameters
- **out_idx** (list): Indices for output tensors.
- **target** (str): Compilation target (e.g., "auto", "cuda", "hip").
- **execution_backend** (str): The backend to use for execution (e.g., "auto").
- **verbose** (bool): Whether to enable verbose output during compilation.

## AutoTuner.set_profile_args

### Description
Sets the profiling arguments for benchmarking the kernels.

### Method
`set_profile_args(warmup, rep, timeout, ref_prog, rtol, atol, max_mismatched_ratio, skip_check, backend)`

### Parameters
- **warmup** (int): Number of warmup runs.
- **rep** (int): Number of repetitions for benchmarking.
- **timeout** (int): Timeout in seconds for each benchmark run.
- **ref_prog** (callable): A reference program for correctness checking.
- **rtol** (float): Relative tolerance for correctness checks.
- **atol** (float): Absolute tolerance for correctness checks.
- **max_mismatched_ratio** (float): Maximum allowed ratio of mismatched elements.
- **skip_check** (bool): Whether to skip correctness checks.
- **backend** (str): Profiling backend (e.g., "event", "cupti", "cudagraph").

## AutoTuner.run

### Description
Runs the autotuning process to find the best kernel configuration.

### Method
`run() -> AutotuneResult`

### Returns
- **AutotuneResult**: An object containing the best configuration and kernel.

## AutotuneResult.save_to_disk

### Description
Saves the autotuning result to disk.

### Method
`save_to_disk(save_dir, verbose)`

### Parameters
- **save_dir** (Path): Directory to save the results.
- **verbose** (bool): Whether to print verbose output during saving.

## AutotuneResult.load_from_disk

### Description
Loads an autotuning result from disk.

### Method
`load_from_disk(save_dir, compile_args)`

### Parameters
- **save_dir** (Path): Directory from which to load the results.
- **compile_args** (CompileArgs): Compilation arguments used when saving.

### Returns
- **AutotuneResult**: The loaded autotuning result.
```

--------------------------------

### Install TileLang with Pip

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Install the latest stable version of TileLang using pip. Ensure you have the necessary dependencies like apache-tvm-ffi and z3-solver installed.

```bash
pip install "apache-tvm-ffi>=0.1.6" "z3-solver>=4.13.0"
# If you already installed torch-c-dlpack-ext and hit `libtorch_cuda.so` errors:
# pip uninstall -y torch-c-dlpack-ext

# If you hit Cython compile errors like `PyLong_SHIFT`/`digit` not declared,
# disable the stable ABI (abi3) for editable builds:
# export CMAKE_ARGS="-DUSE_CUDA=OFF -DUSE_ROCM=ON -DROCM_PATH=/opt/rocm -DLLVM_CONFIG=${LLVM_CONFIG} -DSKBUILD_SABI_VERSION="
# pip install -e . -v --no-build-isolation --no-deps

# Verify
python -c "import tilelang; print(tilelang.__version__)"
```

--------------------------------

### Install Develop Version

Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md

Install TileLang in an editable mode for development. This allows changes to be reflected immediately without reinstallation.

```bash
python3 -m pip install --no-build-isolation --verbose --editable .
```

--------------------------------

### Install TileLang Targets

Source: https://github.com/tile-ai/tilelang/blob/main/CMakeLists.txt

Installs the specified TileLang targets to their respective library, runtime, and archive destinations.

```cmake
install(
  TARGETS ${TILELANG_OUTPUT_TARGETS}
  LIBRARY DESTINATION tilelang/lib
  RUNTIME DESTINATION tilelang/lib
  ARCHIVE DESTINATION tilelang/lib
)
```

--------------------------------

### Matmul + ReLU Kernel Example

Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/tensor_checks.md

This reference example showcases a matrix multiplication followed by a ReLU activation, demonstrating kernel context initialization, shared memory allocation, and tensor operations.

```python
@T.prim_func
def matmul_relu_kernel(
    A: T.Tensor((M, K), dtype),
    B: T.Tensor((K, N), dtype),
    C: T.Tensor((M, N), dtype),
):
    # Initialize Kernel Context
    with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
        A_shared = T.alloc_shared((block_M, block_K), dtype)
        B_shared = T.alloc_shared((block_K, block_N), dtype)
        C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
        T.clear(C_local)
        for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=0):
            T.copy(A[by * block_M, ko * block_K], A_shared)
            T.copy(B[ko * block_K, bx * block_N], B_shared)
            T.gemm(A_shared, B_shared, C_local)
        T.copy(C_local, C[by * block_M, bx * block_N])

```

--------------------------------

### Access and Verify Installation in Container

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Execute commands inside the running Docker container to access its shell and verify the tilelang installation by checking its version.

```bash
docker exec -it tilelang_b200 /bin/zsh
# Inside the container:
python -c "import tilelang; print(tilelang.__version__)"
```

--------------------------------

### Install TileLang Nightly Version

Source: https://github.com/tile-ai/tilelang/blob/main/README.md

Installs the nightly build of TileLang for access to the latest features. Note that nightly builds may be less stable.

```bash
pip install tilelang -f https://tile-ai.github.io/whl/nightly
# or pip install tilelang --find-links https://tile-ai.github.io/whl/nightly
```

--------------------------------

### TileLang Layout Inference Text Output Example

Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/debug_tools_for_tilelang.md

An example of the textual output generated by TileLang's layout inference tool, showing the mapping between logical indices, thread IDs, and register file locations.

```text
C_local inferenced layout:
  Shape: [32, 32] -> [8]
  Thread: _j // 16 * 64 + _i // 16 * 32 + _i % 8 * 4 + _j % 8 // 2
  Index:  [_j % 16 // 8 * 4 + _i % 16 // 8 * 2 + _j % 2]
```

--------------------------------

### Naive TileLang GEMV Implementation

Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/gemv.md

A naive GEMV kernel written in TileLang, adapted from a GEMM tiling strategy. This serves as a starting point for optimization.

```python
import tilelang.primitives as T

def naive_gemv(
    N: int,
    K: int,
    BLOCK_N: int,
    BLOCK_K: int,
    dtype: str = "float16",
    accum_dtype: str = "float",
):

    @T.prim_func
    def main(
            A: T.Buffer((K,), dtype),
            B: T.Buffer((N, K), dtype),
            C: T.Buffer((N,), dtype),
    ):
        with T.Kernel(T.ceildiv(N, BLOCK_N)) as bn:
            tn = T.get_thread_binding(0)  # tn = threadIdx.x
            A_shared = T.alloc_shared((BLOCK_K,), dtype)
            B_shared = T.alloc_shared((BLOCK_N, BLOCK_K), dtype)
            C_reg = T.alloc_local((1,), accum_dtype)
            T.clear(C_reg)
            for bk in T.serial(T.ceildiv(K, BLOCK_K)):
                for tk in T.serial(BLOCK_K):
                    A_shared[tk] = A[bk * BLOCK_K + tk]
                    B_shared[tn, tk] = B[bn * BLOCK_N + tn, bk * BLOCK_K + tk]
                for tk in T.serial(BLOCK_K):
                    C_reg[0] += A_shared[tk].astype(accum_dtype) * B_shared[tn,
                                                                            tk].astype(accum_dtype)
            C[bn * BLOCK_N + tn] = C_reg[0]

    return main
```

--------------------------------

### End-to-End Example: After InjectFenceProxy

Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/inject_fence_proxy.md

Illustrates the same TileLang prim_func after the InjectFenceProxy pass. A `T.fence_proxy_async()` instruction has been inserted between the generic operations and the async `wgmma` intrinsic.

```python
@T.prim_func
def kernel():
    with T.Kernel(1):
        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
        smem = T.decl_buffer((128,), "float16", scope="shared")
        T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32)
        smem[0] = T.float16(0)
        T.fence_proxy_async()
        T.ptx_wgmma_ss(
            "float16",
            "m64n64k16",
            T.bool(True),
            T.bool(True),
            "fp16",
            "fp16",
            "fp16",
            desc.data,
            T.int32(0),
            desc.data,
            T.int32(0),
            smem.data,
            T.int32(0),
            T.bool(True),
            1,
            1,
        )
```

--------------------------------

### Run tilelang Docker Container

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Start a Docker container for tilelang with specified shared memory size, GPU access, and volume mounting. This command also assigns a name to the container for easier management.

```bash
docker run -itd \
  --shm-size 32g \
  --gpus all \
  -v /home/tilelang:/home/tilelang \
  --name tilelang_b200 \
  tilelang-cu120 \
  /bin/zsh
```

--------------------------------

### GEMM Analysis Example

Source: https://github.com/tile-ai/tilelang/blob/main/examples/analyze/README.md

Analyzes a GEMM kernel using the TVM IR Analyzer. Requires importing T, Analyzer, and CUDA. Ensure kernel dimensions and block sizes are appropriately set.

```python
import tilelang.language as T
from tilelang.tools import Analyzer
from tilelang.carver.arch import CUDA

M = N = K = 1024

def kernel(block_M=128, block_N=128, block_K=32, num_stages=3, thread_num=128):
    @T.prim_func
    def main(A: T.Tensor((M, K), T.float16),
             B: T.Tensor((N, K), T.float16),
             C: T.Tensor((M, N), T.float)):
        # ... (kernel definition)
    return main

cuda_device = CUDA("cuda")
result = Analyzer.analysis(kernel(), cuda_device)
print(result)
```

--------------------------------

### Kernel Context Setup in TileLang

Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/matmul.md

Sets up the block grid dimensions for the kernel. The `threads` argument specifies the number of threads per block, which the compiler uses to infer loop mappings.

```python
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
    ...
```

--------------------------------

### Build Documentation with Make

Source: https://github.com/tile-ai/tilelang/blob/main/docs/README.md

Execute this command to build the HTML documentation.

```bash
make html
```

--------------------------------

### Install Development Requirements and Pip CUDA Toolchain

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Install development dependencies and specific NVIDIA CUDA packages from pip. This option is for building from source without a host CUDA installation.

```bash
git clone --recursive https://github.com/tile-ai/tilelang.git
cd tilelang
pip install -r requirements-dev.txt
pip install "nvidia-cuda-nvcc>=13" "nvidia-cuda-cccl>=13" "nvidia-cuda-nvrtc>=13"
pip install . -v --no-build-isolation
```

--------------------------------

### Run AutoTuner for Optimization

Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md

Initialize an AutoTuner with the kernel and configurations, set compilation arguments, and run the tuning process. The result contains the optimized kernel.

```python
autotuner = AutoTuner.from_kernel(
    kernel=kernel, configs=get_configs(M, N, K, with_roller)).set_compile_args(
        out_idx=[-1],
        supply_type=tl.TensorSupplyType.Integer,
        ref_prog=ref_program,
        skip_check=False,
        target="auto",
    )
result = autotuner.run(warmup=3, rep=20)
out_c = result.kernel(a, b)

```

--------------------------------

### Build Native Extension with CMake and Make

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Build the native extension (`libtilelang.so`) by creating a build directory, configuring with CMake (enabling CUDA), and then using make for compilation. This is part of the recommended workflow for developers working directly from source.

```bash
mkdir -p build
cd build
cmake .. -DUSE_CUDA=ON
make -j
```

--------------------------------

### Generate Candidate Configurations using Combinatorial Product

Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md

Use `itertools.product` to generate all possible combinations of parameters from predefined lists, creating a comprehensive set of candidate configurations.

```python
import itertools

block_M = [64, 128, 256]
block_N = [64, 128, 256]
block_K = [32, 64]
num_stages = [0, 1, 2, 3]
thread_num = [128, 256]
enable_rasterization = [True, False]
_configs = list(
    itertools.product(
        block_M,
        block_N,
        block_K,
        num_stages,
        thread_num,
        enable_rasterization,
    ))

configs = [
    {
        "block_M": c[0],
        "block_N": c[1],
        "block_K": c[2],
        "num_stages": c[3],
        "thread_num": c[4],
        "enable_rasteration": c[5]
    } for c in _configs
]

```

--------------------------------

### Apply Carver Hints to Configuration Parameters

Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md

Iterate through recommended Carver hints and map their properties (e.g., `rstep`, `pipeline_stage`, `rasterization_plan`) to the corresponding configuration parameters for auto-tuning.

```python
# Configure candidate parameters
for hint in roller_hints:

    # ...existing code...

    config["block_M"] = block_m
    config["block_N"] = block_n
    config["block_K"] = hint.rstep[0]
    config["num_stages"] = hint.pipeline_stage
    config["thread_num"] = block_rows * block_cols * 32
    config["enable_rasteration"] = hint.rasterization_plan is not NoRasterization

```

--------------------------------

### TMEM Shapes for 128x256 Examples

Source: https://github.com/tile-ai/tilelang/blob/main/examples/blockscaled_gemm_sm100/mxfp8_illustrated.md

Illustrates the resulting TMEM shapes for SFA and SFB in the context of 128x256 examples, considering different numbers of columns for SFB.

```text
SFA_tmem: [128 lanes, 4 columns]
SFB_tmem: [128 lanes, 8 columns]  # two 128-column N chunks
```

--------------------------------

### Faster Rebuild for Developers

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

For developers needing frequent recompilation, use pip install -e . for editable installs or manually compile with cmake/ninja. Ensure PYTHONPATH is set correctly.

```bash
pip install -r requirements-dev.txt

# For first time compilation
pip install -e . -v --no-build-isolation

# Or manually compile with cmake/ninja. Remember to set PYTHONPATH properly.
mkdir build
cd build
cmake .. -G Ninja
ninja

# Rebuild when you change the cpp code
cd build; ninja
```

--------------------------------

### Defining TileLang Kernel Launch Contexts

Source: https://context7.com/tile-ai/tilelang/llms.txt

Demonstrates how to define the GPU grid and thread counts for TileLang kernels using `T.Kernel`. Supports 1D, 2D, and 3D grids.

```python
import tilelang.language as T

# Single-dim grid (e.g., elementwise)
with T.Kernel(T.ceildiv(N, block), threads=256) as bx:
    for i in T.Parallel(block):
        C[bx * block + i] = A[bx * block + i] + B[bx * block + i]

# 2-D grid (e.g., GEMM)
with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
    ...

# 3-D grid with batch dimension
with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), batch, threads=128) as (bx, by, bz):
    ...
```

--------------------------------

### Install Tilelang in Development Mode

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Install tilelang using pip in editable mode (`-e`). This allows changes to Python files to take effect immediately without reinstallation.

```bash
pip install -e . -v
```

--------------------------------

### Verify TileLang Kernel with PyTorch

Source: https://github.com/tile-ai/tilelang/blob/main/examples/gemm/README.md

Demonstrates how to verify the correctness of a compiled TileLang kernel by running it with PyTorch tensors and comparing the output against PyTorch's native matrix multiplication.

```python
import torch

# Suppose your compiled kernel is in rt_mod
profiler = Profiler(rt_mod, params, result_idx=[2])

A = torch.randn(1024, 1024).cuda().half()
B = torch.randn(1024, 1024).cuda().half()

C_tilelang = profiler(A, B)
C_ref = A @ B

torch.testing.assert_close(C_tilelang, C_ref, rtol=1e-2, atol=1e-2)
print("Results match!")
```

--------------------------------

### CuTe Elementwise Addition Setup

Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/elementwise.md

This C++ code snippet demonstrates the setup for an elementwise addition operation using CuTe. It utilizes `cute` tensors to represent global memory and defines local tiles for register buffering.

```c++
template<int NUM_ELE_PER_THREAD=8>
__global__ void elementwise_add(nv_bfloat16* C,
                                 const nv_bfloat16* A,
                                 const nv_bfloat16* B,
                                 int N) {
  using namespace cute;

  const int idx = threadIdx.x + blockIdx.x * blockDim.x;

  Tensor t_C = make_tensor(make_gmem_ptr(C), make_shape(N));
  Tensor t_A = make_tensor(make_gmem_ptr(A), make_shape(N));
  Tensor t_B = make_tensor(make_gmem_ptr(B), make_shape(N));

  Tensor t_C_tile = local_tile(t_C, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
  Tensor t_A_tile = local_tile(t_A, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
  Tensor t_B_tile = local_tile(t_B, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));

  Tensor reg_buffer_A = make_tensor_like(t_A_tile);
  Tensor reg_buffer_B = make_tensor_like(t_B_tile);

```

--------------------------------

### Hardware-Aware GEMM Configuration with Carver Templates

Source: https://context7.com/tile-ai/tilelang/llms.txt

Use Carver templates to automatically generate and rank tiling configurations for operators like GEMM based on problem dimensions and architecture descriptors. The `recommend_hints` method returns `RollerHint` objects that map to TileLang parameters.

```python
from tilelang.carver import CUDA, MatmulTemplate, FlashAttentionTemplate

arch = CUDA("cuda")  # auto-detects current GPU

# ── GEMM hints ─────────────────────────────────────────────────────────────────
M = N = K = 4096
matmul_template = MatmulTemplate(
    M=M, N=N, K=K,
    in_dtype="float16",
    out_dtype="float16",
    accum_dtype="float",
).with_arch(arch)

hints = matmul_template.recommend_hints(topk=10)
configs = []
for hint in hints:
    block_m = hint.block[0]
    block_n = hint.block[1]
    configs.append(dict(
        block_M=block_m,
        block_N=block_n,
        block_K=hint.rstep[0],
        num_stages=hint.pipeline_stage,
        threads=block_m * block_n // 32,   # approximate thread count
    ))
print(f"Top-1 config: {configs[0]}")
# {'block_M': 128, 'block_N': 128, 'block_K': 32, 'num_stages': 3, 'threads': 512)
```

--------------------------------

### Generate Optimization Hints with Carver

Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md

Use the `recommend_hints` method of a Carver template to generate a list of top-k optimization hints for the specified operation and architecture.

```python
# Generate top-k optimization hints (topk=10 recommended)
roller_hints = carve_template.recommend_hints(topk=10)

```

--------------------------------

### Install System Libraries for ROCm Build

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Update package lists and install essential system libraries, development tools, and Python packages required for building tilelang within a ROCm container. This step ensures all build dependencies are met.

```bash
# Inside the container (as root)
apt-get update && apt-get install -y --no-install-recommends \
  build-essential git wget curl ca-certificates gnupg \
  libgtest-dev libgmock-dev \
  libprotobuf-dev protobuf-compiler libgflags-dev libsqlite3-dev \
  python3 python3-dev python3-setuptools python3-pip \
  gcc libtinfo-dev zlib1g-dev libedit-dev libxml2-dev \
  cmake ninja-build pkg-config libstdc++6 \
  && rm -rf /var/lib/apt/lists/*

# Prefer the container venv (avoid system pip)
export PATH="/opt/venv/bin:${PATH}"

# Build GoogleTest static libs (Ubuntu package ships sources only)
cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release
cmake --build /tmp/build-gtest -j"$(nproc)"
cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/
rm -rf /tmp/build-gtest

# Keep setuptools < 80 (compat with some base images)
pip install --upgrade "setuptools>=77.0.3,<80" wheel cmake ninja scikit-build-core

# Locate ROCm llvm-config (install LLVM 18 if missing)
LLVM_CONFIG_PATH=""
for p in /opt/rocm/llvm/bin/llvm-config /opt/rocm/llvm-*/bin/llvm-config /opt/rocm-*/llvm*/bin/llvm-config; do
  if [ -x "$p" ]; then LLVM_CONFIG_PATH="$p"; break; fi
done
if [ -z "$LLVM_CONFIG_PATH" ]; then
  echo "ROCm llvm-config not found; installing LLVM 18..."
  curl -fsSL https://apt.llvm.org/llvm.sh -o /tmp/llvm.sh
  chmod +x /tmp/llvm.sh
  /tmp/llvm.sh 18
  LLVM_CONFIG_PATH="$(command -v llvm-config-18)"
  if [ -z "$LLVM_CONFIG_PATH" ]; then
    echo "ERROR: llvm-config-18 not found after install"
    exit 1
  fi
fi
export LLVM_CONFIG="$LLVM_CONFIG_PATH"
export PATH="$(dirname "$LLVM_CONFIG"):/usr/local/bin:${PATH}"

# Optional shim for tools that expect llvm-config-16
mkdir -p /usr/local/bin
printf "#!/usr/bin/env bash\nexec \"%s\" \"\$@\"\n" "$LLVM_CONFIG_PATH" > /usr/local/bin/llvm-config-16
chmod +x /usr/local/bin/llvm-config-16

# TVM Python bits need Cython (for system Python used by the build)
pip install --no-cache-dir "cython>=0.29.36,<3.0"

# Clone + build TileLang (ROCm)
# Default location: /opt/tilelang (adjust if you prefer a different path).
git clone --recursive https://github.com/tile-ai/tilelang.git /opt/tilelang
cd /opt/tilelang
git submodule update --init --recursive
export CMAKE_ARGS="-DUSE_CUDA=OFF -DUSE_ROCM=ON -DROCM_PATH=/opt/rocm -DLLVM_CONFIG=${LLVM_CONFIG}"

# Avoid pulling CUDA wheels / reinstalling torch by skipping dependency resolution.
# Assume torch is already installed in the container.
pip install -e . -v --no-build-isolation --no-deps

# Manually install required runtime deps when using --no-deps.

```

--------------------------------

### Uninstall TileLang

Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md

Remove the TileLang installation from the current environment.

```bash
python3 -m pip uninstall tilelang
```

--------------------------------

### Build Tilelang from Source with Host CUDA

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Clone the tilelang repository and build it from source using your host's CUDA toolchain. The `-v` flag provides verbose output.

```bash
git clone --recursive https://github.com/tile-ai/tilelang.git
cd tilelang
pip install . -v
```

--------------------------------

### Execute and Verify TileLang Kernel

Source: https://github.com/tile-ai/tilelang/blob/main/README.md

Demonstrates how to execute a compiled TileLang kernel with PyTorch tensors and verify its correctness against a reference implementation. It shows how to call the kernel directly and with overridden compilation constants.

```python
M, N, K = 1024, 1024, 1024

a = torch.randn(M, K, device="cuda", dtype=torch.float16)
b = torch.randn(K, N, device="cuda", dtype=torch.float16)
c_ref = torch.relu(a @ b)

# Call the kernel
c = matmul_relu(a, b)
torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)

# Call the kernel with overwritten compilation constants
c = matmul_relu(a, b, block_M=128, block_N=128, block_K=64)
torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)
```

--------------------------------

### Enabling Debug Logs in Python

Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/logging.md

Example of how to enable debug logs for Python execution by setting the TVM_LOG_DEBUG environment variable.

```shell
TVM_LOG_DEBUG=1 python3 code.py
```

--------------------------------

### Compile PrimFunc to JITKernel with tilelang.compile

Source: https://context7.com/tile-ai/tilelang/llms.txt

Use `tilelang.compile` to compile a single PrimFunc. Results are cached on disk. Specify compilation options like output index, target, execution backend, and optional pass configurations.

```python
import tilelang
import tilelang.language as T
import torch

def make_add_kernel(N: int, block: int = 256, dtype: str = "float32"):
    @T.prim_func
    def add(
        A: T.Tensor((N,), dtype),
        B: T.Tensor((N,), dtype),
        C: T.Tensor((N,), dtype),
    ):
        with T.Kernel(T.ceildiv(N, block), threads=block) as bx:
            for i in T.Parallel(block):
                gi = bx * block + i
                C[gi] = A[gi] + B[gi]
    return add

N = 1 << 20
prim = make_add_kernel(N)

# Compile with explicit options
kernel = tilelang.compile(
    prim,
    out_idx=[2],                         # index 2 (C) is the output
    target="cuda",
    execution_backend="tvm_ffi",         # "auto"|"tvm_ffi"|"cython"|"nvrtc"|"torch"|"cutedsl"
    verbose=False,
    pass_configs={                        # optional TileLang/TVM pass flags
        # tilelang.PassConfigKey.TL_ENABLE_DUMP_IR: True,
    },
)

A = torch.randn(N, device="cuda")
B = torch.randn(N, device="cuda")
C = kernel(A, B)                         # returns the output tensor at out_idx
torch.testing.assert_close(C, A + B)

# Export compiled artifact
kernel.export_library("/tmp/add_kernel.so")   # save shared library for later loading
print(kernel.get_kernel_source())             # inspect CUDA C source
```

--------------------------------

### Execute a serial loop

Source: https://github.com/tile-ai/tilelang/blob/main/docs/programming_guides/language_basics.md

Use T.serial(start, stop[, step]) to define a standard for-loop within a TileLang kernel.

```python
for i in T.serial(N):
    ...
```

--------------------------------

### Build Tilelang with Existing TVM Installation

Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md

Build tilelang from source while specifying the path to an existing TVM repository using the `TVM_ROOT` environment variable. Note that this may still rebuild TVM-related libraries and could lead to path issues.

```bash
TVM_ROOT=<your-tvm-repo> pip install . -v
```

--------------------------------

### Debug Dtype Mismatch

Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/tensor_checks.md

This example demonstrates a dtype mismatch. The data type of the input tensor does not match the expected dtype for the kernel.

```python
import torch

A = torch.empty((M, K), device='cuda', dtype=torch.float32)  # should be float16
B = torch.empty((K, N), device='cuda', dtype=torch.float16)
C = torch.empty((M, N), device='cuda', dtype=torch.float16)
fn(A, B, C)
```

--------------------------------

### Execute an unrolled loop

Source: https://github.com/tile-ai/tilelang/blob/main/docs/programming_guides/language_basics.md

Use T.unroll(start, stop[, step]) for loops that should be unrolled, which can improve performance by reducing loop overhead.

```python
for i in T.unroll(N):
    ...
```

--------------------------------

### Compile and Launch TileLang Kernel

Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/elementwise.md

Compiles a TileLang program for CUDA execution using the cython backend and demonstrates how to launch the compiled kernel.

```python
program = elementwise_add(1024, threads=256, dtype=T.bfloat16)
kernel = tilelang.compile(program, out_idx=-1, target="cuda", execution_backend="cython")
```

```python
C = kernel(A, B)
```

--------------------------------

### Recursive Macros

Source: https://github.com/tile-ai/tilelang/blob/main/examples/eager_jit/eagerjit.en.ipynb

Illustrates recursive macros where the termination condition must be known at compile time. The example implements a variation of the Collatz conjecture.

```python
@T.macro
def n31(x, var: T.Ref):
    if x == 1:
        pass
    elif x % 2 == 0:
        var = var // 2
        n31(x // 2, var)
    else:
        var = var * 3 + 1
        n31(x * 3 + 1, var)


@tilelang.jit
def foo(A: T.Tensor[[1], T.int32], n: int):
    with T.Kernel(1) as _:
        n31(n, A[0])
```

```python
A = torch.tensor([100], dtype=torch.int32, device="cuda")
foo(A, 5)
A
```

--------------------------------

### Tensor Must Be Non-NULL (Used)

Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/tensor_checks.md

This example demonstrates a tensor that must be non-NULL because it is used within the function. Passing None will raise an error.

```python
@T.prim_func
def main(A: T.Tensor((M, K), dtype)):
    A[0] = 1
```