### Setup Development Environment Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md Create and activate a virtual environment using `uv`, then upgrade pip, setuptools, and wheel. Install development dependencies from `requirements-dev.txt`. ```bash uv venv --seed .venv # use `python3 -m venv .venv` if you don't have `uv` source .venv/bin/activate python3 -m pip install --upgrade pip setuptools wheel "build[uv]" uv pip install --requirements requirements-dev.txt ``` -------------------------------- ### Install TileLang from GitHub Source: https://github.com/tile-ai/tilelang/blob/main/README.md Installs TileLang directly from its GitHub repository. Useful for getting the latest unreleased changes. ```bash pip install git+https://github.com/tile-ai/tilelang ``` -------------------------------- ### Install TileLang and Bitblas Source: https://github.com/tile-ai/tilelang/blob/main/examples/gemm/README.md Install the necessary libraries for TileLang and optional utilities like bitblas. ```bash pip install tilelang bitblas ``` -------------------------------- ### Install TileLang with Pip Source: https://github.com/tile-ai/tilelang/blob/main/README.md Installs the latest release of TileLang from PyPI. Use this for a quick setup. ```bash pip install tilelang ``` -------------------------------- ### Install Tilelang from a Prebuilt Wheel Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Install tilelang using a specific prebuilt wheel file. This is useful for installing a particular version or a version tailored for your system. ```bash pip install tilelang-0.0.0.dev0+ubuntu.20.4.cu120-py3-none-any.whl ``` -------------------------------- ### Install OS-Level Prerequisites on Ubuntu/Debian Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Install necessary system packages on Ubuntu or Debian-based Linux distributions before building tilelang from source. ```bash apt-get update apt-get install -y python3 python3-dev python3-setuptools gcc zlib1g-dev build-essential cmake libedit-dev ``` -------------------------------- ### Install TileLang Locally Source: https://github.com/tile-ai/tilelang/blob/main/README.md Installs TileLang locally from source after installing system dependencies. Use the -e option for editable mode. ```bash sudo apt-get update sudo apt-get install -y python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev pip install -e . -v ``` -------------------------------- ### Verify Tilelang Installation Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md After installation, run this command to confirm that tilelang is installed correctly and to check its version. ```python python -c "import tilelang; print(tilelang.__version__)" ``` -------------------------------- ### Serve Documentation Locally Source: https://github.com/tile-ai/tilelang/blob/main/docs/README.md Start a local HTTP server to view the built documentation in a browser. The port can be customized. ```bash cd _build/html python3 -m http.server ``` -------------------------------- ### End-to-End Example: Before InjectFenceProxy Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/inject_fence_proxy.md Shows a TileLang prim_func before the InjectFenceProxy pass is applied. This example includes descriptor initialization and a shared memory write, followed by an async wgmma intrinsic. ```python @T.prim_func def kernel(): with T.Kernel(1): desc = T.decl_buffer((1,), "uint64", scope="local.descriptor") smem = T.decl_buffer((128,), "float16", scope="shared") T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32) smem[0] = T.float16(0) T.ptx_wgmma_ss( "float16", "m64n64k16", T.bool(True), T.bool(True), "fp16", "fp16", "fp16", desc.data, T.int32(0), desc.data, T.int32(0), smem.data, T.int32(0), T.bool(True), 1, 1, ) ``` -------------------------------- ### Install Dependencies with Pip Source: https://github.com/tile-ai/tilelang/blob/main/docs/README.md Run this command to install project dependencies using pip. ```bash pip3 install -r requirements.txt ``` -------------------------------- ### Set PYTHONPATH and Verify Installation Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md After building the native extension, add the tilelang repository root to your `PYTHONPATH` environment variable. Then, verify the installation by importing tilelang and printing its version. ```bash export PYTHONPATH=/path/to/tilelang:$PYTHONPATH python -c "import tilelang; print(tilelang.__version__)" ``` -------------------------------- ### Install Pre-commit Hooks Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md Install the pre-commit hooks to ensure code quality and consistency before committing changes. ```bash pre-commit install --install-hooks ``` -------------------------------- ### Autotuner Usage Example Source: https://context7.com/tile-ai/tilelang/llms.txt Demonstrates how to use the AutoTuner to find the best configuration for a matrix multiplication kernel. ```APIDOC ## AutoTuner.from_kernel ### Description Initializes an AutoTuner with a kernel factory and a list of configurations to explore. ### Method `AutoTuner.from_kernel(kernel_factory, configs)` ### Parameters - **kernel_factory**: A function that returns a kernel. - **configs**: A list of dictionaries, where each dictionary represents a configuration to test. ## AutoTuner.set_compile_args ### Description Sets the compilation arguments for the kernels. ### Method `set_compile_args(out_idx, target, execution_backend, verbose)` ### Parameters - **out_idx** (list): Indices for output tensors. - **target** (str): Compilation target (e.g., "auto", "cuda", "hip"). - **execution_backend** (str): The backend to use for execution (e.g., "auto"). - **verbose** (bool): Whether to enable verbose output during compilation. ## AutoTuner.set_profile_args ### Description Sets the profiling arguments for benchmarking the kernels. ### Method `set_profile_args(warmup, rep, timeout, ref_prog, rtol, atol, max_mismatched_ratio, skip_check, backend)` ### Parameters - **warmup** (int): Number of warmup runs. - **rep** (int): Number of repetitions for benchmarking. - **timeout** (int): Timeout in seconds for each benchmark run. - **ref_prog** (callable): A reference program for correctness checking. - **rtol** (float): Relative tolerance for correctness checks. - **atol** (float): Absolute tolerance for correctness checks. - **max_mismatched_ratio** (float): Maximum allowed ratio of mismatched elements. - **skip_check** (bool): Whether to skip correctness checks. - **backend** (str): Profiling backend (e.g., "event", "cupti", "cudagraph"). ## AutoTuner.run ### Description Runs the autotuning process to find the best kernel configuration. ### Method `run() -> AutotuneResult` ### Returns - **AutotuneResult**: An object containing the best configuration and kernel. ## AutotuneResult.save_to_disk ### Description Saves the autotuning result to disk. ### Method `save_to_disk(save_dir, verbose)` ### Parameters - **save_dir** (Path): Directory to save the results. - **verbose** (bool): Whether to print verbose output during saving. ## AutotuneResult.load_from_disk ### Description Loads an autotuning result from disk. ### Method `load_from_disk(save_dir, compile_args)` ### Parameters - **save_dir** (Path): Directory from which to load the results. - **compile_args** (CompileArgs): Compilation arguments used when saving. ### Returns - **AutotuneResult**: The loaded autotuning result. ``` -------------------------------- ### Install TileLang with Pip Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Install the latest stable version of TileLang using pip. Ensure you have the necessary dependencies like apache-tvm-ffi and z3-solver installed. ```bash pip install "apache-tvm-ffi>=0.1.6" "z3-solver>=4.13.0" # If you already installed torch-c-dlpack-ext and hit `libtorch_cuda.so` errors: # pip uninstall -y torch-c-dlpack-ext # If you hit Cython compile errors like `PyLong_SHIFT`/`digit` not declared, # disable the stable ABI (abi3) for editable builds: # export CMAKE_ARGS="-DUSE_CUDA=OFF -DUSE_ROCM=ON -DROCM_PATH=/opt/rocm -DLLVM_CONFIG=${LLVM_CONFIG} -DSKBUILD_SABI_VERSION=" # pip install -e . -v --no-build-isolation --no-deps # Verify python -c "import tilelang; print(tilelang.__version__)" ``` -------------------------------- ### Install Develop Version Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md Install TileLang in an editable mode for development. This allows changes to be reflected immediately without reinstallation. ```bash python3 -m pip install --no-build-isolation --verbose --editable . ``` -------------------------------- ### Install TileLang Targets Source: https://github.com/tile-ai/tilelang/blob/main/CMakeLists.txt Installs the specified TileLang targets to their respective library, runtime, and archive destinations. ```cmake install( TARGETS ${TILELANG_OUTPUT_TARGETS} LIBRARY DESTINATION tilelang/lib RUNTIME DESTINATION tilelang/lib ARCHIVE DESTINATION tilelang/lib ) ``` -------------------------------- ### Matmul + ReLU Kernel Example Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/tensor_checks.md This reference example showcases a matrix multiplication followed by a ReLU activation, demonstrating kernel context initialization, shared memory allocation, and tensor operations. ```python @T.prim_func def matmul_relu_kernel( A: T.Tensor((M, K), dtype), B: T.Tensor((K, N), dtype), C: T.Tensor((M, N), dtype), ): # Initialize Kernel Context with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by): A_shared = T.alloc_shared((block_M, block_K), dtype) B_shared = T.alloc_shared((block_K, block_N), dtype) C_local = T.alloc_fragment((block_M, block_N), accum_dtype) T.clear(C_local) for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=0): T.copy(A[by * block_M, ko * block_K], A_shared) T.copy(B[ko * block_K, bx * block_N], B_shared) T.gemm(A_shared, B_shared, C_local) T.copy(C_local, C[by * block_M, bx * block_N]) ``` -------------------------------- ### Access and Verify Installation in Container Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Execute commands inside the running Docker container to access its shell and verify the tilelang installation by checking its version. ```bash docker exec -it tilelang_b200 /bin/zsh # Inside the container: python -c "import tilelang; print(tilelang.__version__)" ``` -------------------------------- ### Install TileLang Nightly Version Source: https://github.com/tile-ai/tilelang/blob/main/README.md Installs the nightly build of TileLang for access to the latest features. Note that nightly builds may be less stable. ```bash pip install tilelang -f https://tile-ai.github.io/whl/nightly # or pip install tilelang --find-links https://tile-ai.github.io/whl/nightly ``` -------------------------------- ### TileLang Layout Inference Text Output Example Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/debug_tools_for_tilelang.md An example of the textual output generated by TileLang's layout inference tool, showing the mapping between logical indices, thread IDs, and register file locations. ```text C_local inferenced layout: Shape: [32, 32] -> [8] Thread: _j // 16 * 64 + _i // 16 * 32 + _i % 8 * 4 + _j % 8 // 2 Index: [_j % 16 // 8 * 4 + _i % 16 // 8 * 2 + _j % 2] ``` -------------------------------- ### Naive TileLang GEMV Implementation Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/gemv.md A naive GEMV kernel written in TileLang, adapted from a GEMM tiling strategy. This serves as a starting point for optimization. ```python import tilelang.primitives as T def naive_gemv( N: int, K: int, BLOCK_N: int, BLOCK_K: int, dtype: str = "float16", accum_dtype: str = "float", ): @T.prim_func def main( A: T.Buffer((K,), dtype), B: T.Buffer((N, K), dtype), C: T.Buffer((N,), dtype), ): with T.Kernel(T.ceildiv(N, BLOCK_N)) as bn: tn = T.get_thread_binding(0) # tn = threadIdx.x A_shared = T.alloc_shared((BLOCK_K,), dtype) B_shared = T.alloc_shared((BLOCK_N, BLOCK_K), dtype) C_reg = T.alloc_local((1,), accum_dtype) T.clear(C_reg) for bk in T.serial(T.ceildiv(K, BLOCK_K)): for tk in T.serial(BLOCK_K): A_shared[tk] = A[bk * BLOCK_K + tk] B_shared[tn, tk] = B[bn * BLOCK_N + tn, bk * BLOCK_K + tk] for tk in T.serial(BLOCK_K): C_reg[0] += A_shared[tk].astype(accum_dtype) * B_shared[tn, tk].astype(accum_dtype) C[bn * BLOCK_N + tn] = C_reg[0] return main ``` -------------------------------- ### End-to-End Example: After InjectFenceProxy Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/inject_fence_proxy.md Illustrates the same TileLang prim_func after the InjectFenceProxy pass. A `T.fence_proxy_async()` instruction has been inserted between the generic operations and the async `wgmma` intrinsic. ```python @T.prim_func def kernel(): with T.Kernel(1): desc = T.decl_buffer((1,), "uint64", scope="local.descriptor") smem = T.decl_buffer((128,), "float16", scope="shared") T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32) smem[0] = T.float16(0) T.fence_proxy_async() T.ptx_wgmma_ss( "float16", "m64n64k16", T.bool(True), T.bool(True), "fp16", "fp16", "fp16", desc.data, T.int32(0), desc.data, T.int32(0), smem.data, T.int32(0), T.bool(True), 1, 1, ) ``` -------------------------------- ### Run tilelang Docker Container Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Start a Docker container for tilelang with specified shared memory size, GPU access, and volume mounting. This command also assigns a name to the container for easier management. ```bash docker run -itd \ --shm-size 32g \ --gpus all \ -v /home/tilelang:/home/tilelang \ --name tilelang_b200 \ tilelang-cu120 \ /bin/zsh ``` -------------------------------- ### GEMM Analysis Example Source: https://github.com/tile-ai/tilelang/blob/main/examples/analyze/README.md Analyzes a GEMM kernel using the TVM IR Analyzer. Requires importing T, Analyzer, and CUDA. Ensure kernel dimensions and block sizes are appropriately set. ```python import tilelang.language as T from tilelang.tools import Analyzer from tilelang.carver.arch import CUDA M = N = K = 1024 def kernel(block_M=128, block_N=128, block_K=32, num_stages=3, thread_num=128): @T.prim_func def main(A: T.Tensor((M, K), T.float16), B: T.Tensor((N, K), T.float16), C: T.Tensor((M, N), T.float)): # ... (kernel definition) return main cuda_device = CUDA("cuda") result = Analyzer.analysis(kernel(), cuda_device) print(result) ``` -------------------------------- ### Kernel Context Setup in TileLang Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/matmul.md Sets up the block grid dimensions for the kernel. The `threads` argument specifies the number of threads per block, which the compiler uses to infer loop mappings. ```python with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by): ... ``` -------------------------------- ### Build Documentation with Make Source: https://github.com/tile-ai/tilelang/blob/main/docs/README.md Execute this command to build the HTML documentation. ```bash make html ``` -------------------------------- ### Install Development Requirements and Pip CUDA Toolchain Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Install development dependencies and specific NVIDIA CUDA packages from pip. This option is for building from source without a host CUDA installation. ```bash git clone --recursive https://github.com/tile-ai/tilelang.git cd tilelang pip install -r requirements-dev.txt pip install "nvidia-cuda-nvcc>=13" "nvidia-cuda-cccl>=13" "nvidia-cuda-nvrtc>=13" pip install . -v --no-build-isolation ``` -------------------------------- ### Run AutoTuner for Optimization Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md Initialize an AutoTuner with the kernel and configurations, set compilation arguments, and run the tuning process. The result contains the optimized kernel. ```python autotuner = AutoTuner.from_kernel( kernel=kernel, configs=get_configs(M, N, K, with_roller)).set_compile_args( out_idx=[-1], supply_type=tl.TensorSupplyType.Integer, ref_prog=ref_program, skip_check=False, target="auto", ) result = autotuner.run(warmup=3, rep=20) out_c = result.kernel(a, b) ``` -------------------------------- ### Build Native Extension with CMake and Make Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Build the native extension (`libtilelang.so`) by creating a build directory, configuring with CMake (enabling CUDA), and then using make for compilation. This is part of the recommended workflow for developers working directly from source. ```bash mkdir -p build cd build cmake .. -DUSE_CUDA=ON make -j ``` -------------------------------- ### Generate Candidate Configurations using Combinatorial Product Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md Use `itertools.product` to generate all possible combinations of parameters from predefined lists, creating a comprehensive set of candidate configurations. ```python import itertools block_M = [64, 128, 256] block_N = [64, 128, 256] block_K = [32, 64] num_stages = [0, 1, 2, 3] thread_num = [128, 256] enable_rasterization = [True, False] _configs = list( itertools.product( block_M, block_N, block_K, num_stages, thread_num, enable_rasterization, )) configs = [ { "block_M": c[0], "block_N": c[1], "block_K": c[2], "num_stages": c[3], "thread_num": c[4], "enable_rasteration": c[5] } for c in _configs ] ``` -------------------------------- ### Apply Carver Hints to Configuration Parameters Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md Iterate through recommended Carver hints and map their properties (e.g., `rstep`, `pipeline_stage`, `rasterization_plan`) to the corresponding configuration parameters for auto-tuning. ```python # Configure candidate parameters for hint in roller_hints: # ...existing code... config["block_M"] = block_m config["block_N"] = block_n config["block_K"] = hint.rstep[0] config["num_stages"] = hint.pipeline_stage config["thread_num"] = block_rows * block_cols * 32 config["enable_rasteration"] = hint.rasterization_plan is not NoRasterization ``` -------------------------------- ### TMEM Shapes for 128x256 Examples Source: https://github.com/tile-ai/tilelang/blob/main/examples/blockscaled_gemm_sm100/mxfp8_illustrated.md Illustrates the resulting TMEM shapes for SFA and SFB in the context of 128x256 examples, considering different numbers of columns for SFB. ```text SFA_tmem: [128 lanes, 4 columns] SFB_tmem: [128 lanes, 8 columns] # two 128-column N chunks ``` -------------------------------- ### Faster Rebuild for Developers Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md For developers needing frequent recompilation, use pip install -e . for editable installs or manually compile with cmake/ninja. Ensure PYTHONPATH is set correctly. ```bash pip install -r requirements-dev.txt # For first time compilation pip install -e . -v --no-build-isolation # Or manually compile with cmake/ninja. Remember to set PYTHONPATH properly. mkdir build cd build cmake .. -G Ninja ninja # Rebuild when you change the cpp code cd build; ninja ``` -------------------------------- ### Defining TileLang Kernel Launch Contexts Source: https://context7.com/tile-ai/tilelang/llms.txt Demonstrates how to define the GPU grid and thread counts for TileLang kernels using `T.Kernel`. Supports 1D, 2D, and 3D grids. ```python import tilelang.language as T # Single-dim grid (e.g., elementwise) with T.Kernel(T.ceildiv(N, block), threads=256) as bx: for i in T.Parallel(block): C[bx * block + i] = A[bx * block + i] + B[bx * block + i] # 2-D grid (e.g., GEMM) with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by): ... # 3-D grid with batch dimension with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), batch, threads=128) as (bx, by, bz): ... ``` -------------------------------- ### Install Tilelang in Development Mode Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Install tilelang using pip in editable mode (`-e`). This allows changes to Python files to take effect immediately without reinstallation. ```bash pip install -e . -v ``` -------------------------------- ### Verify TileLang Kernel with PyTorch Source: https://github.com/tile-ai/tilelang/blob/main/examples/gemm/README.md Demonstrates how to verify the correctness of a compiled TileLang kernel by running it with PyTorch tensors and comparing the output against PyTorch's native matrix multiplication. ```python import torch # Suppose your compiled kernel is in rt_mod profiler = Profiler(rt_mod, params, result_idx=[2]) A = torch.randn(1024, 1024).cuda().half() B = torch.randn(1024, 1024).cuda().half() C_tilelang = profiler(A, B) C_ref = A @ B torch.testing.assert_close(C_tilelang, C_ref, rtol=1e-2, atol=1e-2) print("Results match!") ``` -------------------------------- ### CuTe Elementwise Addition Setup Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/elementwise.md This C++ code snippet demonstrates the setup for an elementwise addition operation using CuTe. It utilizes `cute` tensors to represent global memory and defines local tiles for register buffering. ```c++ template __global__ void elementwise_add(nv_bfloat16* C, const nv_bfloat16* A, const nv_bfloat16* B, int N) { using namespace cute; const int idx = threadIdx.x + blockIdx.x * blockDim.x; Tensor t_C = make_tensor(make_gmem_ptr(C), make_shape(N)); Tensor t_A = make_tensor(make_gmem_ptr(A), make_shape(N)); Tensor t_B = make_tensor(make_gmem_ptr(B), make_shape(N)); Tensor t_C_tile = local_tile(t_C, make_shape(Int{}), make_coord(idx)); Tensor t_A_tile = local_tile(t_A, make_shape(Int{}), make_coord(idx)); Tensor t_B_tile = local_tile(t_B, make_shape(Int{}), make_coord(idx)); Tensor reg_buffer_A = make_tensor_like(t_A_tile); Tensor reg_buffer_B = make_tensor_like(t_B_tile); ``` -------------------------------- ### Hardware-Aware GEMM Configuration with Carver Templates Source: https://context7.com/tile-ai/tilelang/llms.txt Use Carver templates to automatically generate and rank tiling configurations for operators like GEMM based on problem dimensions and architecture descriptors. The `recommend_hints` method returns `RollerHint` objects that map to TileLang parameters. ```python from tilelang.carver import CUDA, MatmulTemplate, FlashAttentionTemplate arch = CUDA("cuda") # auto-detects current GPU # ── GEMM hints ───────────────────────────────────────────────────────────────── M = N = K = 4096 matmul_template = MatmulTemplate( M=M, N=N, K=K, in_dtype="float16", out_dtype="float16", accum_dtype="float", ).with_arch(arch) hints = matmul_template.recommend_hints(topk=10) configs = [] for hint in hints: block_m = hint.block[0] block_n = hint.block[1] configs.append(dict( block_M=block_m, block_N=block_n, block_K=hint.rstep[0], num_stages=hint.pipeline_stage, threads=block_m * block_n // 32, # approximate thread count )) print(f"Top-1 config: {configs[0]}") # {'block_M': 128, 'block_N': 128, 'block_K': 32, 'num_stages': 3, 'threads': 512) ``` -------------------------------- ### Generate Optimization Hints with Carver Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/auto_tuning.md Use the `recommend_hints` method of a Carver template to generate a list of top-k optimization hints for the specified operation and architecture. ```python # Generate top-k optimization hints (topk=10 recommended) roller_hints = carve_template.recommend_hints(topk=10) ``` -------------------------------- ### Install System Libraries for ROCm Build Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Update package lists and install essential system libraries, development tools, and Python packages required for building tilelang within a ROCm container. This step ensures all build dependencies are met. ```bash # Inside the container (as root) apt-get update && apt-get install -y --no-install-recommends \ build-essential git wget curl ca-certificates gnupg \ libgtest-dev libgmock-dev \ libprotobuf-dev protobuf-compiler libgflags-dev libsqlite3-dev \ python3 python3-dev python3-setuptools python3-pip \ gcc libtinfo-dev zlib1g-dev libedit-dev libxml2-dev \ cmake ninja-build pkg-config libstdc++6 \ && rm -rf /var/lib/apt/lists/* # Prefer the container venv (avoid system pip) export PATH="/opt/venv/bin:${PATH}" # Build GoogleTest static libs (Ubuntu package ships sources only) cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release cmake --build /tmp/build-gtest -j"$(nproc)" cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/ rm -rf /tmp/build-gtest # Keep setuptools < 80 (compat with some base images) pip install --upgrade "setuptools>=77.0.3,<80" wheel cmake ninja scikit-build-core # Locate ROCm llvm-config (install LLVM 18 if missing) LLVM_CONFIG_PATH="" for p in /opt/rocm/llvm/bin/llvm-config /opt/rocm/llvm-*/bin/llvm-config /opt/rocm-*/llvm*/bin/llvm-config; do if [ -x "$p" ]; then LLVM_CONFIG_PATH="$p"; break; fi done if [ -z "$LLVM_CONFIG_PATH" ]; then echo "ROCm llvm-config not found; installing LLVM 18..." curl -fsSL https://apt.llvm.org/llvm.sh -o /tmp/llvm.sh chmod +x /tmp/llvm.sh /tmp/llvm.sh 18 LLVM_CONFIG_PATH="$(command -v llvm-config-18)" if [ -z "$LLVM_CONFIG_PATH" ]; then echo "ERROR: llvm-config-18 not found after install" exit 1 fi fi export LLVM_CONFIG="$LLVM_CONFIG_PATH" export PATH="$(dirname "$LLVM_CONFIG"):/usr/local/bin:${PATH}" # Optional shim for tools that expect llvm-config-16 mkdir -p /usr/local/bin printf "#!/usr/bin/env bash\nexec \"%s\" \"\$@\"\n" "$LLVM_CONFIG_PATH" > /usr/local/bin/llvm-config-16 chmod +x /usr/local/bin/llvm-config-16 # TVM Python bits need Cython (for system Python used by the build) pip install --no-cache-dir "cython>=0.29.36,<3.0" # Clone + build TileLang (ROCm) # Default location: /opt/tilelang (adjust if you prefer a different path). git clone --recursive https://github.com/tile-ai/tilelang.git /opt/tilelang cd /opt/tilelang git submodule update --init --recursive export CMAKE_ARGS="-DUSE_CUDA=OFF -DUSE_ROCM=ON -DROCM_PATH=/opt/rocm -DLLVM_CONFIG=${LLVM_CONFIG}" # Avoid pulling CUDA wheels / reinstalling torch by skipping dependency resolution. # Assume torch is already installed in the container. pip install -e . -v --no-build-isolation --no-deps # Manually install required runtime deps when using --no-deps. ``` -------------------------------- ### Uninstall TileLang Source: https://github.com/tile-ai/tilelang/blob/main/CONTRIBUTING.md Remove the TileLang installation from the current environment. ```bash python3 -m pip uninstall tilelang ``` -------------------------------- ### Build Tilelang from Source with Host CUDA Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Clone the tilelang repository and build it from source using your host's CUDA toolchain. The `-v` flag provides verbose output. ```bash git clone --recursive https://github.com/tile-ai/tilelang.git cd tilelang pip install . -v ``` -------------------------------- ### Execute and Verify TileLang Kernel Source: https://github.com/tile-ai/tilelang/blob/main/README.md Demonstrates how to execute a compiled TileLang kernel with PyTorch tensors and verify its correctness against a reference implementation. It shows how to call the kernel directly and with overridden compilation constants. ```python M, N, K = 1024, 1024, 1024 a = torch.randn(M, K, device="cuda", dtype=torch.float16) b = torch.randn(K, N, device="cuda", dtype=torch.float16) c_ref = torch.relu(a @ b) # Call the kernel c = matmul_relu(a, b) torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2) # Call the kernel with overwritten compilation constants c = matmul_relu(a, b, block_M=128, block_N=128, block_K=64) torch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2) ``` -------------------------------- ### Enabling Debug Logs in Python Source: https://github.com/tile-ai/tilelang/blob/main/docs/tutorials/logging.md Example of how to enable debug logs for Python execution by setting the TVM_LOG_DEBUG environment variable. ```shell TVM_LOG_DEBUG=1 python3 code.py ``` -------------------------------- ### Compile PrimFunc to JITKernel with tilelang.compile Source: https://context7.com/tile-ai/tilelang/llms.txt Use `tilelang.compile` to compile a single PrimFunc. Results are cached on disk. Specify compilation options like output index, target, execution backend, and optional pass configurations. ```python import tilelang import tilelang.language as T import torch def make_add_kernel(N: int, block: int = 256, dtype: str = "float32"): @T.prim_func def add( A: T.Tensor((N,), dtype), B: T.Tensor((N,), dtype), C: T.Tensor((N,), dtype), ): with T.Kernel(T.ceildiv(N, block), threads=block) as bx: for i in T.Parallel(block): gi = bx * block + i C[gi] = A[gi] + B[gi] return add N = 1 << 20 prim = make_add_kernel(N) # Compile with explicit options kernel = tilelang.compile( prim, out_idx=[2], # index 2 (C) is the output target="cuda", execution_backend="tvm_ffi", # "auto"|"tvm_ffi"|"cython"|"nvrtc"|"torch"|"cutedsl" verbose=False, pass_configs={ # optional TileLang/TVM pass flags # tilelang.PassConfigKey.TL_ENABLE_DUMP_IR: True, }, ) A = torch.randn(N, device="cuda") B = torch.randn(N, device="cuda") C = kernel(A, B) # returns the output tensor at out_idx torch.testing.assert_close(C, A + B) # Export compiled artifact kernel.export_library("/tmp/add_kernel.so") # save shared library for later loading print(kernel.get_kernel_source()) # inspect CUDA C source ``` -------------------------------- ### Execute a serial loop Source: https://github.com/tile-ai/tilelang/blob/main/docs/programming_guides/language_basics.md Use T.serial(start, stop[, step]) to define a standard for-loop within a TileLang kernel. ```python for i in T.serial(N): ... ``` -------------------------------- ### Build Tilelang with Existing TVM Installation Source: https://github.com/tile-ai/tilelang/blob/main/docs/get_started/Installation.md Build tilelang from source while specifying the path to an existing TVM repository using the `TVM_ROOT` environment variable. Note that this may still rebuild TVM-related libraries and could lead to path issues. ```bash TVM_ROOT= pip install . -v ``` -------------------------------- ### Debug Dtype Mismatch Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/tensor_checks.md This example demonstrates a dtype mismatch. The data type of the input tensor does not match the expected dtype for the kernel. ```python import torch A = torch.empty((M, K), device='cuda', dtype=torch.float32) # should be float16 B = torch.empty((K, N), device='cuda', dtype=torch.float16) C = torch.empty((M, N), device='cuda', dtype=torch.float16) fn(A, B, C) ``` -------------------------------- ### Execute an unrolled loop Source: https://github.com/tile-ai/tilelang/blob/main/docs/programming_guides/language_basics.md Use T.unroll(start, stop[, step]) for loops that should be unrolled, which can improve performance by reducing loop overhead. ```python for i in T.unroll(N): ... ``` -------------------------------- ### Compile and Launch TileLang Kernel Source: https://github.com/tile-ai/tilelang/blob/main/docs/deeplearning_operators/elementwise.md Compiles a TileLang program for CUDA execution using the cython backend and demonstrates how to launch the compiled kernel. ```python program = elementwise_add(1024, threads=256, dtype=T.bfloat16) kernel = tilelang.compile(program, out_idx=-1, target="cuda", execution_backend="cython") ``` ```python C = kernel(A, B) ``` -------------------------------- ### Recursive Macros Source: https://github.com/tile-ai/tilelang/blob/main/examples/eager_jit/eagerjit.en.ipynb Illustrates recursive macros where the termination condition must be known at compile time. The example implements a variation of the Collatz conjecture. ```python @T.macro def n31(x, var: T.Ref): if x == 1: pass elif x % 2 == 0: var = var // 2 n31(x // 2, var) else: var = var * 3 + 1 n31(x * 3 + 1, var) @tilelang.jit def foo(A: T.Tensor[[1], T.int32], n: int): with T.Kernel(1) as _: n31(n, A[0]) ``` ```python A = torch.tensor([100], dtype=torch.int32, device="cuda") foo(A, 5) A ``` -------------------------------- ### Tensor Must Be Non-NULL (Used) Source: https://github.com/tile-ai/tilelang/blob/main/docs/compiler_internals/tensor_checks.md This example demonstrates a tensor that must be non-NULL because it is used within the function. Passing None will raise an error. ```python @T.prim_func def main(A: T.Tensor((M, K), dtype)): A[0] = 1 ```