### Run cuTile Python Quickstart Example Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst Executes the quickstart vector addition example script from the command line. Successful execution indicates that the cuTile Python environment is set up correctly. ```bash $ python3 samples/quickstart/VectorAdd_quickstart.py ``` -------------------------------- ### Run cuTile Python FFT Sample Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst Executes the FFT sample script from the command line. This is an example of running another cuTile Python sample. ```bash $ python3 samples/FFT.py ``` -------------------------------- ### Install cupy for cuTile Python Samples Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst Installs the cupy package, which is used by some cuTile Python samples. Ensure compatibility with your CUDA version. ```bash pip install cupy-cuda13x ``` -------------------------------- ### cuTile Python Vector Addition Example Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst Demonstrates a basic vector addition kernel using cuTile for tile-based programming. It loads tiles, performs addition, and stores the result, illustrating a common cuTile kernel structure. ```python import cutile import cupy def vector_add_example(): a = cupy.array([1, 2, 3, 4, 5], dtype=cupy.float32) b = cupy.array([5, 4, 3, 2, 1], dtype=cupy.float32) c = cupy.zeros_like(a) @cutile.kernel def vector_add(a, b, c): a_tile = cutile.load_tile(a) b_tile = cutile.load_tile(b) result_tile = a_tile + b_tile cutile.store_tile(result_tile, c) vector_add(a, b, c) cupy.cuda.stream.get_current_stream().synchronize() expected = cupy.array([6, 6, 6, 6, 6], dtype=cupy.float32) assert cupy.array_equal(a + b, expected) print("✓ vector_add_example passed!") if __name__ == "__main__": vector_add_example() ``` -------------------------------- ### Install pytest and numpy for cuTile Python Samples Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst Installs the pytest and numpy packages, commonly used in conjunction with cuTile Python for testing and data manipulation in samples. ```bash pip install pytest numpy ``` -------------------------------- ### Load and Store Operation Example in Python Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/performance.rst This Python code snippet demonstrates the usage of load and store operations with performance hints like 'latency' and 'allow_tma'. These hints guide the compiler in optimizing memory traffic and Tensor Memory Accelerator usage. The example assumes the necessary cuTile library is imported and configured. ```python from cuda.tile import * from cuda.thread_scope import * # noqa def test_load_store(): @program.kernel def kernel(a, b): idx = threadIdx.x # Example of a load operation with latency hint val = load(a, idx, N, cache_hint=1, allow_tma=True) # Example of a store operation with TMA disabled store(b, idx, val, N, allow_tma=False) N = 1024 a = NDArray(N, dtype=float32) b = NDArray(N, dtype=float32) kernel[1, 1](a, b) test_load_store() ``` -------------------------------- ### Run all cuTile Python Samples with pytest Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst Executes all available cuTile Python samples using the pytest framework. This command is used for comprehensive testing of the samples. ```bash $ pytest samples ``` -------------------------------- ### Set up Python Virtual Environment for cuTile Source: https://github.com/nvidia/cutile-python/blob/main/README.md Creates and activates a Python virtual environment named 'env' to isolate cuTile installation and dependencies. Recommended for avoiding global installations. ```bash python3 -m venv env source env/bin/activate ``` -------------------------------- ### Install Ubuntu Build Dependencies for cuTile Source: https://github.com/nvidia/cutile-python/blob/main/README.md Installs essential build tools for cuTile on Ubuntu systems using APT. Includes build-essential, cmake, python3-dev, and python3-venv. ```bash sudo apt-get update && sudo apt-get install build-essential cmake python3-dev python3-venv ``` -------------------------------- ### Install cuTile Python from PyPI Source: https://github.com/nvidia/cutile-python/blob/main/README.md Installs the cuTile Python package from PyPI using pip. Requires CUDA Toolkit 13.1 or later to be pre-installed. ```bash pip install cuda-tile ``` -------------------------------- ### Profile cuTile Python Kernel with Nsight Compute Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst Generates a performance profile for the cuTile Python vector addition kernel using NVIDIA Nsight Compute. This command-line invocation creates a profile file for later analysis. ```bash ncu -o VecAddProfile --set detailed python3 VectorAdd_quickstart.py ``` -------------------------------- ### Install Test Dependencies for cuTile Source: https://github.com/nvidia/cutile-python/blob/main/README.md Installs additional Python packages required for running cuTile tests, including dependencies like PyTorch, from the specified requirements file. ```bash pip install -r test/requirements.txt ``` -------------------------------- ### Build and Install cuTile from Source (Editable Mode) Source: https://github.com/nvidia/cutile-python/blob/main/README.md Builds and installs cuTile in editable mode from the source root directory. This command invokes the CMake build process and is typically run once. ```bash pip install -e . ``` -------------------------------- ### Python: Constant Type Hinting Example Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/execution.rst This snippet illustrates the usage of constant type hinting within the Cutile Python library. It shows how to import and potentially use `ConstantAnnotation` and `Constant` for defining and annotating constants in kernel functions. ```python from cuda.tile import ConstantAnnotation, Constant # Example usage (conceptual based on provided structure) # Assuming ConstantAnnotation is used to decorate functions or classes # and Constant is used to define constant parameters within those. # @ConstantAnnotation def my_kernel(param1: Constant[int, 5]): # kernel logic using param1 as a literal value 5 pass # Or potentially defining constants directly # MY_CONST = Constant(value=10, type=ct.int32) ``` -------------------------------- ### Conditional Execution with CUDA Tiles (Masking) Source: https://context7.com/nvidia/cutile-python/llms.txt This example demonstrates conditional execution within tiles using masks provided by the cuda.tile library. It handles boundary conditions and applies a threshold to selectively compute results. Requires torch and cuda.tile. ```python import cuda.tile as ct import torch @ct.kernel def masked_operation(X, Y, threshold: float, tile_n: ct.Constant[int]): bid_m = ct.bid(0) bid_n = ct.bid(1) N = X.shape[1] # Load tile x_tile = ct.load(X, index=(bid_m, bid_n), shape=(1, tile_n), padding_mode=ct.PaddingMode.ZERO) # Create mask for valid elements indices = bid_n * tile_n + ct.arange(tile_n, dtype=ct.int32) valid_mask = indices < N # Apply threshold condition threshold_mask = x_tile > threshold combined_mask = valid_mask & threshold_mask # Conditional computation (zero out values below threshold) result = ct.where(combined_mask, x_tile * 2.0, 0.0) ct.store(Y, index=(bid_m, bid_n), tile=result) # Create input M, N = 512, 1000 # Non-tile-aligned N X = torch.randn(M, N, dtype=torch.float32, device='cuda') Y = torch.empty_like(X) tile_n = 256 grid = (M, (N + tile_n - 1) // tile_n, 1) ct.launch(torch.cuda.current_stream(), grid, masked_operation, (X, Y, 0.5, tile_n)) print("Masked operation kernel launched.") ``` -------------------------------- ### Matrix Multiplication using CUDA Tiles Source: https://context7.com/nvidia/cutile-python/llms.txt This snippet demonstrates how to perform matrix multiplication (C = A * B) using the cuda.tile library. It configures tile sizes and launches a custom CUDA kernel for the operation. Ensure torch and cuda.tile are installed. ```python import cuda.tile as ct import torch from math import ceil # Assuming matmul_kernel is defined elsewhere and accessible # For example: # @ct.kernel # def matmul_kernel(A, B, C, tm, tn, tk): # # ... kernel implementation ... # pass # Create matrices M, K, N = 512, 768, 512 A = torch.randn(M, K, dtype=torch.float16, device='cuda') B = torch.randn(K, N, dtype=torch.float16, device='cuda') C = torch.empty(M, N, dtype=torch.float16, device='cuda') # Configure tile sizes tm, tn, tk = 128, 256, 64 grid = (ceil(M / tm), ceil(N / tn), 1) # Placeholder for matmul_kernel definition if not provided externally # In a real scenario, matmul_kernel would be defined using @ct.kernel # For demonstration purposes, we assume it's available. def matmul_kernel(A, B, C, tm, tn, tk): pass # Replace with actual kernel logic # ct.launch(torch.cuda.current_stream(), grid, matmul_kernel, (A, B, C, tm, tn, tk)) print("Matrix multiplication kernel launch would occur here.") ``` -------------------------------- ### Create Development Installation Target Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt This snippet adds a custom CMake target named 'devinstall'. When executed, it creates a symbolic link from the built C extension library within the build directory to the expected location in the source directory, facilitating development workflows. ```cmake add_custom_target( devinstall COMMAND ln -fs "${CMAKE_BINARY_DIR}/cext/lib_cext.so" ${CMAKE_SOURCE_DIR}/cuda/tile/_cext.so ) ``` -------------------------------- ### Mixed Precision Computation with CUDA Tiles Source: https://context7.com/nvidia/cutile-python/llms.txt This example demonstrates data type conversion and mixed-precision computation using the cuda.tile library. It loads data in fp16, converts it to tfloat32 for computation using tensor cores, and stores the result in fp32. Requires torch and cuda.tile. ```python import cuda.tile as ct import torch @ct.kernel def mixed_precision_compute(X_fp16, Y_fp16, Z_fp32, tile_m: ct.Constant[int], tile_n: ct.Constant[int]): bid_x = ct.bid(0) bid_y = ct.bid(1) # Load fp16 tiles x = ct.load(X_fp16, index=(bid_x, bid_y), shape=(tile_m, tile_n)) y = ct.load(Y_fp16, index=(bid_x, bid_y), shape=(tile_m, tile_n)) # Convert to tfloat32 for tensor core computation x_tf32 = x.astype(ct.tfloat32) y_tf32 = y.astype(ct.tfloat32) # Compute in higher precision result = x_tf32 * y_tf32 + ct.full((tile_m, tile_n), 1.0, dtype=ct.float32) # Store as fp32 ct.store(Z_fp32, index=(bid_x, bid_y), tile=result) # Example usage setup (actual tensors would be created and launched) # tile_m, tile_n = 32, 32 # X_fp16 = torch.randn(tile_m, tile_n, dtype=torch.float16, device='cuda') # Y_fp16 = torch.randn(tile_m, tile_n, dtype=torch.float16, device='cuda') # Z_fp32 = torch.empty(tile_m, tile_n, dtype=torch.float32, device='cuda') # grid = (1, 1, 1) # Example grid # ct.launch(torch.cuda.current_stream(), grid, mixed_precision_compute, (X_fp16, Y_fp16, Z_fp32, tile_m, tile_n)) print("Mixed precision computation kernel defined. Example launch commented out.") ``` -------------------------------- ### Reduction Operations with CUDA Tiles (Row Mean) Source: https://context7.com/nvidia/cutile-python/llms.txt This example shows how to perform reduction operations, specifically calculating the row mean of a matrix, using the cuda.tile library. It utilizes accumulation across tile dimensions and reduces the result to a scalar. Requires torch and cuda.tile. ```python import cuda.tile as ct import torch @ct.kernel def row_mean(X, Mean, tile_n: ct.Constant[int]): bid_m = ct.bid(0) num_tiles = ct.num_tiles(X, axis=1, shape=(1, tile_n)) N = X.shape[1] # Initialize accumulator sum_acc = ct.full((1, tile_n), 0, dtype=ct.float32) # Accumulate across N dimension for j in range(num_tiles): tx = ct.load(X, index=(bid_m, j), shape=(1, tile_n), padding_mode=ct.PaddingMode.ZERO) sum_acc += tx # Reduce to scalar and compute mean mean = ct.sum(sum_acc, axis=1) / N ct.store(Mean, index=(bid_m,), tile=mean) # Create 2D matrix M, N = 1024, 2048 X = torch.randn(M, N, dtype=torch.float32, device='cuda') Mean = torch.empty(M, dtype=torch.float32, device='cuda') # Launch kernel tile_n = 256 grid = (M, 1, 1) ct.launch(torch.cuda.current_stream(), grid, row_mean, (X, Mean, tile_n)) print("Row mean kernel launched.") ``` -------------------------------- ### Recompile cuTile C++ Extension after Changes Source: https://github.com/nvidia/cutile-python/blob/main/README.md Recompiles the C++ extension module for cuTile after making modifications. This is a faster alternative to 'pip install -e .' for subsequent builds. ```bash make -C build ``` -------------------------------- ### Configuring Test Executable: test_vec Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt Defines the 'test_vec' executable, specifying its source files, include directories, and compiler/linker options. This setup allows for the compilation and execution of tests related to vector operations within the CUTile project, with test coverage enabled. ```cmake add_executable(test_vec test/test_vec.cpp memory.cpp) target_include_directories(test_vec PRIVATE ${cext_include_dirs}) target_compile_options(test_vec PUBLIC ${cext_compile_flags} ${test_coverage_options}) target_link_options(test_vec PRIVATE ${test_coverage_options}) target_link_libraries(test_vec PRIVATE ${Python_LIBRARIES}) ``` -------------------------------- ### Define and Launch cuTile GPU Kernel Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/index.rst Defines a GPU kernel using the @ct.kernel decorator and demonstrates how to launch it on the GPU using ct.launch(). Kernels are executed in parallel on a logical grid of blocks. Arrays and tiles are used for data manipulation within kernels. ```python from cuda.tile import kernel, launch @kernel def add_kernel(a, b, result): idx = a.num_tiles * ct.program_id() + ct.lane_id() result[idx] = a[idx] + b[idx] def main(): # ... (host code to prepare input arrays and launch kernel) ... launch(add_kernel, (input_a, input_b, output_result), num_warps=4) # ... (host code to synchronize and retrieve results) ... ``` -------------------------------- ### cuTile Environment Variables for Debugging Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/debugging.rst Details on environment variables that can be set to assist in debugging cuTile compilation and execution issues. ```APIDOC ## cuTile Debugging Environment Variables ### Description These environment variables can be set to control the behavior of the cuTile compiler and to gather more diagnostic information when debugging issues, particularly those related to `TileCompilerExecutionError`, `TileTypeError`, and `TileCompilerTimeoutError`. ### Environment Variables - **CUDA_TILE_ENABLE_CRASH_DUMP** (value: `1`) - **Description**: When set to `1`, this variable enables the dumping of an archive containing the TileIR bytecode. This is highly recommended for submitting bug reports for `TileCompilerExecutionError` or `TileCompilerTimeoutError`. - **CUDA_TILE_COMPILER_TIMEOUT_SEC** (value: `seconds`) - **Description**: Limits the maximum time (in seconds) the TileIR compiler (`tileiras`) is allowed to run. Useful for preventing excessively long compilation times and diagnosing `TileCompilerTimeoutError`. - **CUDA_TILE_LOGS** (value: `CUTILEIR`) - **Description**: Set to `CUTILEIR` to print the cuTile Python IR during compilation to stderr. This is particularly useful for debugging `TileTypeError` by inspecting the intermediate representation. - **CUDA_TILE_TEMP_DIR** (value: `directory_path`) - **Description**: Configures the directory where the cuTile compiler will store temporary files during its operation. ``` -------------------------------- ### Building the Static Library for CUTile Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt Compiles a static library named '_cext_static' from several C++ source files. It sets public compile and link options, including the conditional flags defined earlier and standard library flags. Private include directories are also specified. ```cmake # Build a static library first, so that we could reuse it for several build targets add_library(_cext_static STATIC cuda_loader.cpp cuda_helper.cpp memory.cpp py.cpp stream_buffer.cpp tile_kernel.cpp ) target_compile_options(_cext_static PUBLIC ${cext_compile_flags} ${nostdlib_flags}) target_include_directories(_cext_static PRIVATE ${cext_include_dirs}) target_link_options(_cext_static PUBLIC ${cext_link_flags} ${nostdlib_flags}) ``` -------------------------------- ### Python: Create and Combine Strictly Typed Constants Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/execution.rst Demonstrates the creation of strictly typed constants (e.g., int16) and how they interact with loosely typed constants and other strictly typed constants. The result of operations involving these constants will be a strictly typed constant, following type promotion rules. ```python import cuda.tile as ct # Creating a strictly typed int16 constant constant_int16 = ct.int16(5) # Combining a strictly typed constant with a loosely typed constant result_loose = ct.int16(5) + 2 print(f"{result_loose} (type: {type(result_loose)})") # Combining two strictly typed constants (int16 and int32) result_strict = ct.int16(5) + ct.int32(7) print(f"{result_strict} (type: {type(result_strict)})") ``` -------------------------------- ### Creating the Python Extension Module Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt Builds a Python extension module named '_cext' from 'module.cpp'. It links against the previously built static library '_cext_static' and the Python libraries. Platform-specific link options for Apple are also included to handle dynamic lookup. ```cmake # Build a Python extension add_library(_cext MODULE module.cpp) target_compile_options(_cext PUBLIC ${cext_compile_flags} ${nostdlib_flags}) target_include_directories(_cext PRIVATE ${cext_include_dirs}) target_link_libraries(_cext PUBLIC _cext_static ${Python_LIBRARIES}) target_link_options(_cext PUBLIC ${cext_link_flags} ${nostdlib_flags}) set_target_properties(_cext PROPERTIES SKIP_BUILD_RPATH TRUE) if (APPLE) target_link_options(_cext PRIVATE -undefined dynamic_lookup) endif() ``` -------------------------------- ### Building a Shared Library with No Undefined Symbols Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt Constructs a shared library '_cext_shared' also from 'module.cpp'. It links against the static library and Python libraries, and optionally the ASan library if enabled. Crucially, it uses '-Wl,--no-undefined' to ensure all symbols are resolved, preventing runtime linking errors. ```cmake # Link another shared library with --no-undefined to make sure we don't have any unresolved symbols. if (ENABLE_ASAN) set(asan_library asan) endif() add_library(_cext_shared SHARED module.cpp) target_link_libraries(_cext_shared _cext_static ${Python_LIBRARIES} ${asan_library}) target_compile_options(_cext_shared PUBLIC ${cext_compile_flags} ${nostdlib_flags}) target_include_directories(_cext_shared PRIVATE ${cext_include_dirs}) target_link_options(_cext_shared PUBLIC ${cext_link_flags} ${nostdlib_flags} -Wl,--no-undefined) ``` -------------------------------- ### Tuple Usage Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/data.rst Describes how tuples can be used within tile code but not as kernel parameters. ```APIDOC ## Tuples Tuples can be used in tile code. They cannot be kernel parameters. ``` -------------------------------- ### Matrix Addition Kernel with Tiled Load/Store in Python Source: https://context7.com/nvidia/cutile-python/llms.txt Implements a kernel for matrix addition using 2D tiled load and store operations. It loads tiles from input matrices, performs element-wise addition, and stores the resulting tile. Supports automatic bounds handling. ```python import cuda.tile as ct import torch @ct.kernel def matrix_add(a, b, c, tile_x: ct.Constant[int], tile_y: ct.Constant[int]): bid_x = ct.bid(0) bid_y = ct.bid(1) # Load 2D tiles from input matrices a_tile = ct.load(a, index=(bid_x, bid_y), shape=(tile_x, tile_y)) b_tile = ct.load(b, index=(bid_x, bid_y), shape=(tile_x, tile_y)) # Element-wise addition sum_tile = a_tile + b_tile # Store result tile ct.store(c, index=(bid_x, bid_y), tile=sum_tile) # Create 2D matrices M, N = 2048, 1024 a = torch.randn(M, N, dtype=torch.float32, device='cuda') b = torch.randn(M, N, dtype=torch.float32, device='cuda') c = torch.empty_like(a) # Launch with 2D grid tile_x, tile_y = 32, 32 grid = ((M + tile_x - 1) // tile_x, (N + tile_y - 1) // tile_y, 1) ct.launch(torch.cuda.current_stream(), grid, matrix_add, (a, b, c, tile_x, tile_y)) ``` -------------------------------- ### Define and Launch Vector Addition Kernel in Python Source: https://context7.com/nvidia/cutile-python/llms.txt Defines a GPU kernel for vector addition using the `@ct.kernel` decorator and launches it with a specified grid configuration. It utilizes `ct.load` and `ct.store` for tile-based memory operations. ```python import cuda.tile as ct import torch @ct.kernel def vector_add(a, b, c, tile_size: ct.Constant[int]): pid = ct.bid(0) a_tile = ct.load(a, index=(pid,), shape=(tile_size,)) b_tile = ct.load(b, index=(pid,), shape=(tile_size,)) result = a_tile + b_tile ct.store(c, index=(pid,), tile=result) # Create input tensors a = torch.randn(4096, dtype=torch.float32, device='cuda') b = torch.randn(4096, dtype=torch.float32, device='cuda') c = torch.empty_like(a) # Launch kernel with grid configuration tile_size = 256 grid = (16, 1, 1) # 4096 / 256 = 16 blocks ct.launch(torch.cuda.current_stream(), grid, vector_add, (a, b, c, tile_size)) ``` -------------------------------- ### Grid Configuration and Block Indexing with Swizzled Access Source: https://context7.com/nvidia/cutile-python/llms.txt Configures multi-dimensional grids and computes block-specific indices for work distribution using swizzled indexing. This kernel requires PyTorch and the cuda.tile library. It processes input tensors and stores results in output tensors. ```python import cuda.tile as ct import torch @ct.kernel def swizzled_access(Input, Output, tile_size: ct.Constant[int], group_size: ct.Constant[int]): # Get 1D block ID bid = ct.bid(0) # Calculate total dimensions M = Input.shape[0] N = Input.shape[1] # Compute 2D block coordinates with swizzling num_blocks_n = ct.cdiv(N, tile_size) group_id = bid // (group_size * num_blocks_n) first_block_m = group_id * group_size block_m = first_block_m + (bid % group_size) block_n = (bid % (group_size * num_blocks_n)) // group_size # Load and process tile tile = ct.load(Input, index=(block_m, block_n), shape=(tile_size, tile_size), padding_mode=ct.PaddingMode.ZERO) result = tile * 2.0 ct.store(Output, index=(block_m, block_n), tile=result) # Create 2D input M, N = 2048, 2048 Input = torch.randn(M, N, dtype=torch.float32, device='cuda') Output = torch.empty_like(Input) tile_size = 64 group_size = 8 num_blocks = ct.cdiv(M, tile_size) * ct.cdiv(N, tile_size) grid = (num_blocks, 1, 1) ct.launch(torch.cuda.current_stream(), grid, swizzled_access, (Input, Output, tile_size, group_size)) ``` -------------------------------- ### Launch Mixed Precision Compute Kernel Source: https://context7.com/nvidia/cutile-python/llms.txt Launches a mixed-precision computation kernel using cuTile. This function requires PyTorch and CUDA streams for execution. It takes input tensors (X_fp16, Y_fp16) and an output tensor (Z_fp32) along with tile dimensions. ```python import torch import cuda.tile as ct # Assume mixed_precision_compute is defined elsewhere def mixed_precision_compute(X_fp16, Y_fp16, Z_fp32, tile_m, tile_n): # Placeholder for actual computation logic pass # Create inputs M, N = 1024, 1024 X_fp16 = torch.randn(M, N, dtype=torch.float16, device='cuda') Y_fp16 = torch.randn(M, N, dtype=torch.float16, device='cuda') Z_fp32 = torch.empty(M, N, dtype=torch.float32, device='cuda') tile_m, tile_n = 64, 64 grid = (M // tile_m, N // tile_n, 1) ct.launch(torch.cuda.current_stream(), grid, mixed_precision_compute, (X_fp16, Y_fp16, Z_fp32, tile_m, tile_n)) ``` -------------------------------- ### Configuring Test Executable: test_hash_map Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt Sets up the 'test_hash_map' executable, linking it with necessary source files and libraries. It configures include paths, compile options including test coverage, and link options, ensuring the test can be built and run correctly. ```cmake add_executable(test_hash_map test/test_hash_map.cpp memory.cpp) target_include_directories(test_hash_map PRIVATE ${cext_include_dirs}) target_compile_options(test_hash_map PUBLIC ${cext_compile_flags} ${test_coverage_options}) target_link_options(test_hash_map PRIVATE ${test_coverage_options}) target_link_libraries(test_hash_map PRIVATE ${Python_LIBRARIES}) ``` -------------------------------- ### Run Specific cuTile Test File using Pytest Source: https://github.com/nvidia/cutile-python/blob/main/README.md Executes a particular test file (e.g., test_copy.py) within the cuTile test suite using the pytest framework. ```bash pytest test/test_copy.py ``` -------------------------------- ### Defining Include Directories for CUTile Project Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt Specifies the include directories required for compiling the CUTile project. This includes paths for DLPack, Python headers, and CUDA Toolkit headers, ensuring that all necessary header files are found during the build process. ```cmake set(cext_include_dirs ${dlpack_INCLUDE_DIR} ${Python_INCLUDE_DIRS} ${CUDAToolkit_INCLUDE_DIRS}) ``` -------------------------------- ### Enable Address Sanitizer (ASan) Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt This configuration enables the Address Sanitizer (ASan) for debugging memory errors if the ENABLE_ASAN option is ON. It adds '-fsanitize=address' compile and link flags for GCC and Clang compilers. If the compiler is not supported, it will exit with a FATAL_ERROR. ```cmake option(ENABLE_ASAN "Enable address sanitizer" OFF) if (ENABLE_ASAN) if ((CMAKE_CXX_COMPILER_ID STREQUAL "GNU") OR (CMAKE_CXX_COMPILER_ID STREQUAL "Clang")) add_compile_options(-fsanitize=address) add_link_options(-fsanitize=address) else() message(FATAL_ERROR "Not sure how to enable address sanitizer " "for the current compiler ${CMAKE_CXX_COMPILER_ID}.") endif() message(STATUS "Enabling address sanitizer") endif() ``` -------------------------------- ### Configuring Test Executable: test_stream_buffer Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt Defines an executable for testing 'stream_buffer'. It includes specific source files, sets public compile options including test coverage flags, specifies private include directories, and links against the Python libraries and test coverage options. ```cmake # Tests add_executable(test_stream_buffer test/test_stream_buffer.cpp cuda_loader.cpp memory.cpp ) target_compile_options(test_stream_buffer PUBLIC ${cext_compile_flags} ${test_coverage_options}) target_include_directories(test_stream_buffer PRIVATE ${cext_include_dirs}) target_link_libraries(test_stream_buffer PRIVATE ${Python_LIBRARIES}) target_link_options(test_stream_buffer PRIVATE ${test_coverage_options}) ``` -------------------------------- ### Vector Addition with Gather/Scatter in Python Source: https://context7.com/nvidia/cutile-python/llms.txt Demonstrates vector addition using `ct.gather` and `ct.scatter` for indirect memory access. This method handles non-tile-aligned vector sizes and performs automatic boundary checking for reads and writes. ```python import cuda.tile as ct import torch @ct.kernel def vector_add_gather(a, b, c, tile_size: ct.Constant[int]): bid = ct.bid(0) # Calculate indices for this block's tile indices = bid * tile_size + ct.arange(tile_size, dtype=torch.int32) # Gather elements (zeros out-of-bounds automatically) a_tile = ct.gather(a, indices) b_tile = ct.gather(b, indices) # Compute result sum_tile = a_tile + b_tile # Scatter result (ignores out-of-bounds writes) ct.scatter(c, indices, sum_tile) # Create vectors with non-tile-aligned size N = 10001 # Not divisible by tile_size a = torch.randn(N, dtype=torch.float32, device='cuda') b = torch.randn(N, dtype=torch.float32, device='cuda') c = torch.empty_like(a) # Launch kernel tile_size = 256 grid = ((N + tile_size - 1) // tile_size, 1, 1) ct.launch(torch.cuda.current_stream(), grid, vector_add_gather, (a, b, c, tile_size)) ``` -------------------------------- ### Scalar Typing Rules Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/data.rst Explains the typing rules for scalar values in CUTile kernels, including constant and non-constant scalars. ```APIDOC ## Scalars A *scalar* is a single immutable value of a specific data type. A *scalar* and *0D-tile* can be used interchangeably in a tile kernel. They can also be kernel parameters. Typing of a *scalar* has the following rules: - Constant scalars are loosely typed by default, for example, a literal ``2`` or a constant property like ``Tile.ndim``, ``Tile.shape``, or ``Array.ndim``. - ``Array.shape`` and ``Array.stride`` are not constant by default and have default int type `int32`. Using default `int32` makes kernels more performant at the cost of limiting max representable shape. This limitation will be lifted in the near future. ``` -------------------------------- ### Add C Extension Subdirectory Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt This command includes the 'cext' subdirectory, which presumably contains CMakeLists.txt to build the C extensions for the Python project. This is a standard way to organize and build modular components in CMake. ```cmake add_subdirectory(cext) ``` -------------------------------- ### Matrix Multiplication Kernel with Tiled Algorithm in Python Source: https://context7.com/nvidia/cutile-python/llms.txt Implements a high-performance matrix multiplication kernel using a tile-based accumulation strategy. It leverages `ct.mma` for tensor core operations and handles data types including tf32 for potential speedups. ```python import cuda.tile as ct import torch from math import ceil @ct.kernel def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int], tk: ct.Constant[int]): bid_x = ct.bid(0) bid_y = ct.bid(1) # Number of K-dimension tiles to process num_tiles_k = ct.num_tiles(A, axis=1, shape=(tm, tk)) # Initialize accumulator accumulator = ct.full((tm, tn), 0, dtype=ct.float32) zero_pad = ct.PaddingMode.ZERO # Convert fp32 to tf32 for tensor core usage dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype # Iterate over K dimension for k in range(num_tiles_k): a = ct.load(A, index=(bid_x, k), shape=(tm, tk), padding_mode=zero_pad).astype(dtype) b = ct.load(B, index=(k, bid_y), shape=(tk, tn), padding_mode=zero_pad).astype(dtype) accumulator = ct.mma(a, b, accumulator) # Store result accumulator = ct.astype(accumulator, C.dtype) ct.store(C, index=(bid_x, bid_y), tile=accumulator) ``` -------------------------------- ### cuTile Exception Types Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/debugging.rst Information about the various exception types raised by cuTile during kernel development. ```APIDOC ## cuTile Exception Types ### Description This section lists and describes the custom exception types provided by the cuTile library, which are raised during the compilation and execution of tile kernels. ### Exception Types - **TileSyntaxError**: Raised when there is a syntax error in the TileIR code. - **TileTypeError**: Raised when a type mismatch occurs during compilation. This can be further diagnosed by setting `CUDA_TILE_LOGS=CUTILEIR`. - **TileValueError**: Raised when an invalid value is encountered in the TileIR code or parameters. - **TileCompilerExecutionError**: Raised when an error occurs during the execution of the TileIR compiler. Enabling crash dumps with `CUDA_TILE_ENABLE_CRASH_DUMP=1` can provide more diagnostic information. - **TileCompilerTimeoutError**: Raised when the TileIR compiler exceeds the specified timeout limit. The timeout can be configured using `CUDA_TILE_COMPILER_TIMEOUT_SEC` and crash dumps can be enabled with `CUDA_TILE_ENABLE_CRASH_DUMP=1`. ``` -------------------------------- ### Persistent Kernels for Multi-Tile Processing Source: https://context7.com/nvidia/cutile-python/llms.txt Implements persistent kernels where GPU blocks process multiple tiles to improve occupancy. This CUDA kernel requires PyTorch and the cuda.tile library. It loads tiles, performs element-wise operations (square root of sum of squares), and stores results. ```python import cuda.tile as ct import torch @ct.kernel def persistent_process(Input, Output, tm: ct.Constant[int], tn: ct.Constant[int]): bid = ct.bid(0) M = Input.shape[0] N = Input.shape[1] # Calculate total tiles and blocks num_tiles_m = ct.cdiv(M, tm) num_tiles_n = ct.cdiv(N, tn) total_tiles = num_tiles_m * num_tiles_n num_blocks_launched = ct.num_blocks(0) # Each block processes multiple tiles for tile_id in range(bid, total_tiles, num_blocks_launched): # Compute 2D tile coordinates tile_m = tile_id // num_tiles_n tile_n = tile_id % num_tiles_n # Process tile tile = ct.load(Input, index=(tile_m, tile_n), shape=(tm, tn), padding_mode=ct.PaddingMode.ZERO) result = ct.sqrt(tile * tile + 1.0) ct.store(Output, index=(tile_m, tile_n), tile=result) # Create input M, N = 4096, 4096 Input = torch.randn(M, N, dtype=torch.float32, device='cuda') Output = torch.empty_like(Input) # Launch with fewer blocks than tiles (persistent approach) tm, tn = 128, 128 num_sms = torch.cuda.get_device_properties('cuda').multi_processor_count grid = (num_sms * 2, 1, 1) # Launch 2 blocks per SM ct.launch(torch.cuda.current_stream(), grid, persistent_process, (Input, Output, tm, tn)) ``` -------------------------------- ### Atomic Operations with CUDA Tiles Source: https://context7.com/nvidia/cutile-python/llms.txt This snippet demonstrates thread-safe accumulation using atomic operations and memory ordering with the cuda.tile library. It employs atomic compare-and-swap (CAS) for lock acquisition and atomic exchange for lock release. Requires torch and cuda.tile. ```python import cuda.tile as ct import torch @ct.kernel def atomic_accumulate(X, Partial_Sum, Locks, group_size: ct.Constant[int], tile_n: ct.Constant[int]): bid = ct.bid(0) group_id = bid % group_size # Load data for this block x_tile = ct.load(X, index=(bid,), shape=(tile_n,)) local_sum = ct.sum(x_tile, axis=0) # Acquire lock using compare-and-swap while ct.atomic_cas(Locks, group_id, 0, 1, memory_order=ct.MemoryOrder.ACQUIRE) == 1: pass # Critical section: accumulate partial sum current = ct.load(Partial_Sum, index=(group_id,), shape=(1,)) updated = current + local_sum ct.store(Partial_Sum, index=(group_id,), tile=updated) # Release lock ct.atomic_xchg(Locks, group_id, 0, memory_order=ct.MemoryOrder.RELEASE) # Create input data N = 16384 tile_n = 256 num_blocks = N // tile_n group_size = 8 X = torch.randn(N, dtype=torch.float32, device='cuda') Partial_Sum = torch.zeros(group_size, dtype=torch.float32, device='cuda') Locks = torch.zeros(group_size, dtype=torch.int32, device='cuda') grid = (num_blocks, 1, 1) ct.launch(torch.cuda.current_stream(), grid, atomic_accumulate, (X, Partial_Sum, Locks, group_size, tile_n)) print("Atomic accumulation kernel launched.") ``` -------------------------------- ### Enable Test Code Coverage Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt This section enables code coverage for tests if the ENABLE_COVERAGE_FOR_TESTS option is set to ON. It supports GCC and Clang compilers by adding the '--coverage' flag. For unsupported compilers, it raises a FATAL_ERROR. ```cmake option(ENABLE_COVERAGE_FOR_TESTS "Enable code coverage for tests" OFF) set(test_coverage_options) if (ENABLE_COVERAGE_FOR_TESTS) if ((CMAKE_CXX_COMPILER_ID STREQUAL "GNU") OR (CMAKE_CXX_COMPILER_ID STREQUAL "Clang")) set(test_coverage_options "--coverage") else() message(FATAL_ERROR "Not sure how to enable coverage " "for the current compiler ${CMAKE_CXX_COMPILER_ID}.") endif() message(STATUS "Enabling code coverage for tests") endif() ``` -------------------------------- ### Sign Git Commits with --signoff Flag Source: https://github.com/nvidia/cutile-python/blob/main/CONTRIBUTING.md This snippet demonstrates the command used to sign off on a Git commit. The `-s` or `--signoff` flag appends the contributor's name and email to the commit message, certifying their work under the DCO. This is a crucial step for all contributions to the project. ```bash git commit -s -m "Add cool feature." ``` -------------------------------- ### Manage dlpack Dependency Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt This configuration handles the dependency on the dlpack library. It checks if DLPACK_PATH is set, using a local dlpack if available. Otherwise, it fetches dlpack from its GitHub repository using FetchContent and sets the include directory accordingly. ```cmake if (DLPACK_PATH) set(dlpack_INCLUDE_DIR "${DLPACK_PATH}/include/dlpack") message(STATUS "Using local dlpack, include path: ${dlpack_INCLUDE_DIR}") else() include(FetchContent) FetchContent_Declare( dlpack GIT_REPOSITORY https://github.com/dmlc/dlpack.git GIT_TAG v1.1 ) FetchContent_MakeAvailable(dlpack) set(dlpack_INCLUDE_DIR "${dlpack_SOURCE_DIR}/include/dlpack") message(STATUS "Fetching dlpack") endif() ```