### Run cuTile Python Quickstart Example

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst

Executes the quickstart vector addition example script from the command line. Successful execution indicates that the cuTile Python environment is set up correctly.

```bash
$ python3 samples/quickstart/VectorAdd_quickstart.py
```

--------------------------------

### Run cuTile Python FFT Sample

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst

Executes the FFT sample script from the command line. This is an example of running another cuTile Python sample.

```bash
$ python3 samples/FFT.py
```

--------------------------------

### Install cupy for cuTile Python Samples

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst

Installs the cupy package, which is used by some cuTile Python samples. Ensure compatibility with your CUDA version.

```bash
pip install cupy-cuda13x
```

--------------------------------

### cuTile Python Vector Addition Example

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst

Demonstrates a basic vector addition kernel using cuTile for tile-based programming. It loads tiles, performs addition, and stores the result, illustrating a common cuTile kernel structure.

```python
import cutile
import cupy


def vector_add_example():
    a = cupy.array([1, 2, 3, 4, 5], dtype=cupy.float32)
    b = cupy.array([5, 4, 3, 2, 1], dtype=cupy.float32)
    c = cupy.zeros_like(a)

    @cutile.kernel
def vector_add(a, b, c):
        a_tile = cutile.load_tile(a)
        b_tile = cutile.load_tile(b)
        result_tile = a_tile + b_tile
        cutile.store_tile(result_tile, c)

    vector_add(a, b, c)
    cupy.cuda.stream.get_current_stream().synchronize()

    expected = cupy.array([6, 6, 6, 6, 6], dtype=cupy.float32)
    assert cupy.array_equal(a + b, expected)

    print("✓ vector_add_example passed!")

if __name__ == "__main__":
    vector_add_example()

```

--------------------------------

### Install pytest and numpy for cuTile Python Samples

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst

Installs the pytest and numpy packages, commonly used in conjunction with cuTile Python for testing and data manipulation in samples.

```bash
pip install pytest numpy
```

--------------------------------

### Load and Store Operation Example in Python

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/performance.rst

This Python code snippet demonstrates the usage of load and store operations with performance hints like 'latency' and 'allow_tma'. These hints guide the compiler in optimizing memory traffic and Tensor Memory Accelerator usage. The example assumes the necessary cuTile library is imported and configured.

```python
from cuda.tile import *
from cuda.thread_scope import * # noqa

def test_load_store():
    @program.kernel
    def kernel(a, b):
        idx = threadIdx.x
        # Example of a load operation with latency hint
        val = load(a, idx, N, cache_hint=1, allow_tma=True)
        # Example of a store operation with TMA disabled
        store(b, idx, val, N, allow_tma=False)

    N = 1024
    a = NDArray(N, dtype=float32)
    b = NDArray(N, dtype=float32)
    kernel[1, 1](a, b)

test_load_store()
```

--------------------------------

### Run all cuTile Python Samples with pytest

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst

Executes all available cuTile Python samples using the pytest framework. This command is used for comprehensive testing of the samples.

```bash
$  pytest samples
```

--------------------------------

### Set up Python Virtual Environment for cuTile

Source: https://github.com/nvidia/cutile-python/blob/main/README.md

Creates and activates a Python virtual environment named 'env' to isolate cuTile installation and dependencies. Recommended for avoiding global installations.

```bash
python3 -m venv env
source env/bin/activate
```

--------------------------------

### Install Ubuntu Build Dependencies for cuTile

Source: https://github.com/nvidia/cutile-python/blob/main/README.md

Installs essential build tools for cuTile on Ubuntu systems using APT. Includes build-essential, cmake, python3-dev, and python3-venv.

```bash
sudo apt-get update && sudo apt-get install build-essential cmake python3-dev python3-venv
```

--------------------------------

### Install cuTile Python from PyPI

Source: https://github.com/nvidia/cutile-python/blob/main/README.md

Installs the cuTile Python package from PyPI using pip. Requires CUDA Toolkit 13.1 or later to be pre-installed.

```bash
pip install cuda-tile
```

--------------------------------

### Profile cuTile Python Kernel with Nsight Compute

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/quickstart.rst

Generates a performance profile for the cuTile Python vector addition kernel using NVIDIA Nsight Compute. This command-line invocation creates a profile file for later analysis.

```bash
ncu -o VecAddProfile --set detailed python3 VectorAdd_quickstart.py
```

--------------------------------

### Install Test Dependencies for cuTile

Source: https://github.com/nvidia/cutile-python/blob/main/README.md

Installs additional Python packages required for running cuTile tests, including dependencies like PyTorch, from the specified requirements file.

```bash
pip install -r test/requirements.txt
```

--------------------------------

### Build and Install cuTile from Source (Editable Mode)

Source: https://github.com/nvidia/cutile-python/blob/main/README.md

Builds and installs cuTile in editable mode from the source root directory. This command invokes the CMake build process and is typically run once.

```bash
pip install -e .
```

--------------------------------

### Python: Constant Type Hinting Example

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/execution.rst

This snippet illustrates the usage of constant type hinting within the Cutile Python library. It shows how to import and potentially use `ConstantAnnotation` and `Constant` for defining and annotating constants in kernel functions.

```python
from cuda.tile import ConstantAnnotation, Constant

# Example usage (conceptual based on provided structure)
# Assuming ConstantAnnotation is used to decorate functions or classes
# and Constant is used to define constant parameters within those.

# @ConstantAnnotation
def my_kernel(param1: Constant[int, 5]):
    # kernel logic using param1 as a literal value 5
    pass

# Or potentially defining constants directly
# MY_CONST = Constant(value=10, type=ct.int32)

```

--------------------------------

### Conditional Execution with CUDA Tiles (Masking)

Source: https://context7.com/nvidia/cutile-python/llms.txt

This example demonstrates conditional execution within tiles using masks provided by the cuda.tile library. It handles boundary conditions and applies a threshold to selectively compute results. Requires torch and cuda.tile.

```python
import cuda.tile as ct
import torch

@ct.kernel
def masked_operation(X, Y, threshold: float, tile_n: ct.Constant[int]):
    bid_m = ct.bid(0)
    bid_n = ct.bid(1)

    N = X.shape[1]

    # Load tile
    x_tile = ct.load(X, index=(bid_m, bid_n), shape=(1, tile_n), padding_mode=ct.PaddingMode.ZERO)

    # Create mask for valid elements
    indices = bid_n * tile_n + ct.arange(tile_n, dtype=ct.int32)
    valid_mask = indices < N

    # Apply threshold condition
    threshold_mask = x_tile > threshold
    combined_mask = valid_mask & threshold_mask

    # Conditional computation (zero out values below threshold)
    result = ct.where(combined_mask, x_tile * 2.0, 0.0)

    ct.store(Y, index=(bid_m, bid_n), tile=result)

# Create input
M, N = 512, 1000  # Non-tile-aligned N
X = torch.randn(M, N, dtype=torch.float32, device='cuda')
Y = torch.empty_like(X)

tile_n = 256
grid = (M, (N + tile_n - 1) // tile_n, 1)
ct.launch(torch.cuda.current_stream(), grid, masked_operation, (X, Y, 0.5, tile_n))
print("Masked operation kernel launched.")
```

--------------------------------

### Matrix Multiplication using CUDA Tiles

Source: https://context7.com/nvidia/cutile-python/llms.txt

This snippet demonstrates how to perform matrix multiplication (C = A * B) using the cuda.tile library. It configures tile sizes and launches a custom CUDA kernel for the operation. Ensure torch and cuda.tile are installed.

```python
import cuda.tile as ct
import torch
from math import ceil

# Assuming matmul_kernel is defined elsewhere and accessible
# For example:
# @ct.kernel
# def matmul_kernel(A, B, C, tm, tn, tk):
#     # ... kernel implementation ...
#     pass

# Create matrices
M, K, N = 512, 768, 512
A = torch.randn(M, K, dtype=torch.float16, device='cuda')
B = torch.randn(K, N, dtype=torch.float16, device='cuda')
C = torch.empty(M, N, dtype=torch.float16, device='cuda')

# Configure tile sizes
tm, tn, tk = 128, 256, 64
grid = (ceil(M / tm), ceil(N / tn), 1)

# Placeholder for matmul_kernel definition if not provided externally
# In a real scenario, matmul_kernel would be defined using @ct.kernel
# For demonstration purposes, we assume it's available.
def matmul_kernel(A, B, C, tm, tn, tk):
    pass # Replace with actual kernel logic

# ct.launch(torch.cuda.current_stream(), grid, matmul_kernel, (A, B, C, tm, tn, tk))
print("Matrix multiplication kernel launch would occur here.")
```

--------------------------------

### Create Development Installation Target

Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt

This snippet adds a custom CMake target named 'devinstall'. When executed, it creates a symbolic link from the built C extension library within the build directory to the expected location in the source directory, facilitating development workflows.

```cmake
add_custom_target(
    devinstall
    COMMAND ln -fs "${CMAKE_BINARY_DIR}/cext/lib_cext.so"
            ${CMAKE_SOURCE_DIR}/cuda/tile/_cext.so
)
```

--------------------------------

### Mixed Precision Computation with CUDA Tiles

Source: https://context7.com/nvidia/cutile-python/llms.txt

This example demonstrates data type conversion and mixed-precision computation using the cuda.tile library. It loads data in fp16, converts it to tfloat32 for computation using tensor cores, and stores the result in fp32. Requires torch and cuda.tile.

```python
import cuda.tile as ct
import torch

@ct.kernel
def mixed_precision_compute(X_fp16, Y_fp16, Z_fp32, tile_m: ct.Constant[int], tile_n: ct.Constant[int]):
    bid_x = ct.bid(0)
    bid_y = ct.bid(1)

    # Load fp16 tiles
    x = ct.load(X_fp16, index=(bid_x, bid_y), shape=(tile_m, tile_n))
    y = ct.load(Y_fp16, index=(bid_x, bid_y), shape=(tile_m, tile_n))

    # Convert to tfloat32 for tensor core computation
    x_tf32 = x.astype(ct.tfloat32)
    y_tf32 = y.astype(ct.tfloat32)

    # Compute in higher precision
    result = x_tf32 * y_tf32 + ct.full((tile_m, tile_n), 1.0, dtype=ct.float32)

    # Store as fp32
    ct.store(Z_fp32, index=(bid_x, bid_y), tile=result)

# Example usage setup (actual tensors would be created and launched)
# tile_m, tile_n = 32, 32
# X_fp16 = torch.randn(tile_m, tile_n, dtype=torch.float16, device='cuda')
# Y_fp16 = torch.randn(tile_m, tile_n, dtype=torch.float16, device='cuda')
# Z_fp32 = torch.empty(tile_m, tile_n, dtype=torch.float32, device='cuda')
# grid = (1, 1, 1) # Example grid
# ct.launch(torch.cuda.current_stream(), grid, mixed_precision_compute, (X_fp16, Y_fp16, Z_fp32, tile_m, tile_n))
print("Mixed precision computation kernel defined. Example launch commented out.")
```

--------------------------------

### Reduction Operations with CUDA Tiles (Row Mean)

Source: https://context7.com/nvidia/cutile-python/llms.txt

This example shows how to perform reduction operations, specifically calculating the row mean of a matrix, using the cuda.tile library. It utilizes accumulation across tile dimensions and reduces the result to a scalar. Requires torch and cuda.tile.

```python
import cuda.tile as ct
import torch

@ct.kernel
def row_mean(X, Mean, tile_n: ct.Constant[int]):
    bid_m = ct.bid(0)
    num_tiles = ct.num_tiles(X, axis=1, shape=(1, tile_n))
    N = X.shape[1]

    # Initialize accumulator
    sum_acc = ct.full((1, tile_n), 0, dtype=ct.float32)

    # Accumulate across N dimension
    for j in range(num_tiles):
        tx = ct.load(X, index=(bid_m, j), shape=(1, tile_n), padding_mode=ct.PaddingMode.ZERO)
        sum_acc += tx

    # Reduce to scalar and compute mean
    mean = ct.sum(sum_acc, axis=1) / N
    ct.store(Mean, index=(bid_m,), tile=mean)

# Create 2D matrix
M, N = 1024, 2048
X = torch.randn(M, N, dtype=torch.float32, device='cuda')
Mean = torch.empty(M, dtype=torch.float32, device='cuda')

# Launch kernel
tile_n = 256
grid = (M, 1, 1)
ct.launch(torch.cuda.current_stream(), grid, row_mean, (X, Mean, tile_n))
print("Row mean kernel launched.")
```

--------------------------------

### Recompile cuTile C++ Extension after Changes

Source: https://github.com/nvidia/cutile-python/blob/main/README.md

Recompiles the C++ extension module for cuTile after making modifications. This is a faster alternative to 'pip install -e .' for subsequent builds.

```bash
make -C build
```

--------------------------------

### Configuring Test Executable: test_vec

Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt

Defines the 'test_vec' executable, specifying its source files, include directories, and compiler/linker options. This setup allows for the compilation and execution of tests related to vector operations within the CUTile project, with test coverage enabled.

```cmake

add_executable(test_vec test/test_vec.cpp memory.cpp)
target_include_directories(test_vec PRIVATE ${cext_include_dirs})
target_compile_options(test_vec PUBLIC ${cext_compile_flags} ${test_coverage_options})
target_link_options(test_vec PRIVATE ${test_coverage_options})
target_link_libraries(test_vec PRIVATE ${Python_LIBRARIES})

```

--------------------------------

### Define and Launch cuTile GPU Kernel

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/index.rst

Defines a GPU kernel using the @ct.kernel decorator and demonstrates how to launch it on the GPU using ct.launch(). Kernels are executed in parallel on a logical grid of blocks. Arrays and tiles are used for data manipulation within kernels.

```python
from cuda.tile import kernel, launch

@kernel
def add_kernel(a, b, result):
    idx = a.num_tiles * ct.program_id() + ct.lane_id()
    result[idx] = a[idx] + b[idx]

def main():
    # ... (host code to prepare input arrays and launch kernel) ...
    launch(add_kernel, (input_a, input_b, output_result), num_warps=4)
    # ... (host code to synchronize and retrieve results) ...

```

--------------------------------

### cuTile Environment Variables for Debugging

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/debugging.rst

Details on environment variables that can be set to assist in debugging cuTile compilation and execution issues.

```APIDOC
## cuTile Debugging Environment Variables

### Description
These environment variables can be set to control the behavior of the cuTile compiler and to gather more diagnostic information when debugging issues, particularly those related to `TileCompilerExecutionError`, `TileTypeError`, and `TileCompilerTimeoutError`.

### Environment Variables

- **CUDA_TILE_ENABLE_CRASH_DUMP** (value: `1`)
  - **Description**: When set to `1`, this variable enables the dumping of an archive containing the TileIR bytecode. This is highly recommended for submitting bug reports for `TileCompilerExecutionError` or `TileCompilerTimeoutError`.

- **CUDA_TILE_COMPILER_TIMEOUT_SEC** (value: `seconds`)
  - **Description**: Limits the maximum time (in seconds) the TileIR compiler (`tileiras`) is allowed to run. Useful for preventing excessively long compilation times and diagnosing `TileCompilerTimeoutError`.

- **CUDA_TILE_LOGS** (value: `CUTILEIR`)
  - **Description**: Set to `CUTILEIR` to print the cuTile Python IR during compilation to stderr. This is particularly useful for debugging `TileTypeError` by inspecting the intermediate representation.

- **CUDA_TILE_TEMP_DIR** (value: `directory_path`)
  - **Description**: Configures the directory where the cuTile compiler will store temporary files during its operation.
```

--------------------------------

### Building the Static Library for CUTile

Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt

Compiles a static library named '_cext_static' from several C++ source files. It sets public compile and link options, including the conditional flags defined earlier and standard library flags. Private include directories are also specified.

```cmake
# Build a static library first, so that we could reuse it for several build targets

add_library(_cext_static STATIC
    cuda_loader.cpp
    cuda_helper.cpp
    memory.cpp
    py.cpp
    stream_buffer.cpp
    tile_kernel.cpp
    )

target_compile_options(_cext_static PUBLIC ${cext_compile_flags} ${nostdlib_flags})
target_include_directories(_cext_static PRIVATE ${cext_include_dirs})
target_link_options(_cext_static PUBLIC ${cext_link_flags} ${nostdlib_flags})

```

--------------------------------

### Python: Create and Combine Strictly Typed Constants

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/execution.rst

Demonstrates the creation of strictly typed constants (e.g., int16) and how they interact with loosely typed constants and other strictly typed constants. The result of operations involving these constants will be a strictly typed constant, following type promotion rules.

```python
import cuda.tile as ct

# Creating a strictly typed int16 constant
constant_int16 = ct.int16(5)

# Combining a strictly typed constant with a loosely typed constant
result_loose = ct.int16(5) + 2
print(f"{result_loose} (type: {type(result_loose)})")

# Combining two strictly typed constants (int16 and int32)
result_strict = ct.int16(5) + ct.int32(7)
print(f"{result_strict} (type: {type(result_strict)})")
```

--------------------------------

### Creating the Python Extension Module

Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt

Builds a Python extension module named '_cext' from 'module.cpp'. It links against the previously built static library '_cext_static' and the Python libraries. Platform-specific link options for Apple are also included to handle dynamic lookup.

```cmake
# Build a Python extension

add_library(_cext MODULE module.cpp)

target_compile_options(_cext PUBLIC ${cext_compile_flags} ${nostdlib_flags})
target_include_directories(_cext PRIVATE ${cext_include_dirs})
target_link_libraries(_cext PUBLIC _cext_static ${Python_LIBRARIES})
target_link_options(_cext PUBLIC ${cext_link_flags} ${nostdlib_flags})

set_target_properties(_cext PROPERTIES SKIP_BUILD_RPATH TRUE)

if (APPLE)
    target_link_options(_cext PRIVATE -undefined dynamic_lookup)
endif()

```

--------------------------------

### Building a Shared Library with No Undefined Symbols

Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt

Constructs a shared library '_cext_shared' also from 'module.cpp'. It links against the static library and Python libraries, and optionally the ASan library if enabled. Crucially, it uses '-Wl,--no-undefined' to ensure all symbols are resolved, preventing runtime linking errors.

```cmake
# Link another shared library with --no-undefined to make sure we don't have any unresolved symbols.

if (ENABLE_ASAN)
    set(asan_library asan)
endif()

add_library(_cext_shared SHARED module.cpp)
target_link_libraries(_cext_shared _cext_static ${Python_LIBRARIES} ${asan_library})
target_compile_options(_cext_shared PUBLIC ${cext_compile_flags} ${nostdlib_flags})
target_include_directories(_cext_shared PRIVATE ${cext_include_dirs})
target_link_options(_cext_shared PUBLIC ${cext_link_flags} ${nostdlib_flags} -Wl,--no-undefined)

```

--------------------------------

### Tuple Usage

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/data.rst

Describes how tuples can be used within tile code but not as kernel parameters.

```APIDOC
## Tuples

Tuples can be used in tile code. They cannot be kernel parameters.
```

--------------------------------

### Matrix Addition Kernel with Tiled Load/Store in Python

Source: https://context7.com/nvidia/cutile-python/llms.txt

Implements a kernel for matrix addition using 2D tiled load and store operations. It loads tiles from input matrices, performs element-wise addition, and stores the resulting tile. Supports automatic bounds handling.

```python
import cuda.tile as ct
import torch

@ct.kernel
def matrix_add(a, b, c, tile_x: ct.Constant[int], tile_y: ct.Constant[int]):
    bid_x = ct.bid(0)
    bid_y = ct.bid(1)

    # Load 2D tiles from input matrices
    a_tile = ct.load(a, index=(bid_x, bid_y), shape=(tile_x, tile_y))
    b_tile = ct.load(b, index=(bid_x, bid_y), shape=(tile_x, tile_y))

    # Element-wise addition
    sum_tile = a_tile + b_tile

    # Store result tile
    ct.store(c, index=(bid_x, bid_y), tile=sum_tile)

# Create 2D matrices
M, N = 2048, 1024
a = torch.randn(M, N, dtype=torch.float32, device='cuda')
b = torch.randn(M, N, dtype=torch.float32, device='cuda')
c = torch.empty_like(a)

# Launch with 2D grid
tile_x, tile_y = 32, 32
grid = ((M + tile_x - 1) // tile_x, (N + tile_y - 1) // tile_y, 1)
ct.launch(torch.cuda.current_stream(), grid, matrix_add, (a, b, c, tile_x, tile_y))
```

--------------------------------

### Define and Launch Vector Addition Kernel in Python

Source: https://context7.com/nvidia/cutile-python/llms.txt

Defines a GPU kernel for vector addition using the `@ct.kernel` decorator and launches it with a specified grid configuration. It utilizes `ct.load` and `ct.store` for tile-based memory operations.

```python
import cuda.tile as ct
import torch

@ct.kernel
def vector_add(a, b, c, tile_size: ct.Constant[int]):
    pid = ct.bid(0)
    a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
    b_tile = ct.load(b, index=(pid,), shape=(tile_size,))
    result = a_tile + b_tile
    ct.store(c, index=(pid,), tile=result)

# Create input tensors
a = torch.randn(4096, dtype=torch.float32, device='cuda')
b = torch.randn(4096, dtype=torch.float32, device='cuda')
c = torch.empty_like(a)

# Launch kernel with grid configuration
tile_size = 256
grid = (16, 1, 1)  # 4096 / 256 = 16 blocks
ct.launch(torch.cuda.current_stream(), grid, vector_add, (a, b, c, tile_size))
```

--------------------------------

### Grid Configuration and Block Indexing with Swizzled Access

Source: https://context7.com/nvidia/cutile-python/llms.txt

Configures multi-dimensional grids and computes block-specific indices for work distribution using swizzled indexing. This kernel requires PyTorch and the cuda.tile library. It processes input tensors and stores results in output tensors.

```python
import cuda.tile as ct
import torch

@ct.kernel
def swizzled_access(Input, Output, tile_size: ct.Constant[int], group_size: ct.Constant[int]):
    # Get 1D block ID
    bid = ct.bid(0)

    # Calculate total dimensions
    M = Input.shape[0]
    N = Input.shape[1]

    # Compute 2D block coordinates with swizzling
    num_blocks_n = ct.cdiv(N, tile_size)
    group_id = bid // (group_size * num_blocks_n)
    first_block_m = group_id * group_size

    block_m = first_block_m + (bid % group_size)
    block_n = (bid % (group_size * num_blocks_n)) // group_size

    # Load and process tile
    tile = ct.load(Input, index=(block_m, block_n), shape=(tile_size, tile_size),
                   padding_mode=ct.PaddingMode.ZERO)
    result = tile * 2.0
    ct.store(Output, index=(block_m, block_n), tile=result)

# Create 2D input
M, N = 2048, 2048
Input = torch.randn(M, N, dtype=torch.float32, device='cuda')
Output = torch.empty_like(Input)

tile_size = 64
group_size = 8
num_blocks = ct.cdiv(M, tile_size) * ct.cdiv(N, tile_size)
grid = (num_blocks, 1, 1)

ct.launch(torch.cuda.current_stream(), grid, swizzled_access, (Input, Output, tile_size, group_size))
```

--------------------------------

### Launch Mixed Precision Compute Kernel

Source: https://context7.com/nvidia/cutile-python/llms.txt

Launches a mixed-precision computation kernel using cuTile. This function requires PyTorch and CUDA streams for execution. It takes input tensors (X_fp16, Y_fp16) and an output tensor (Z_fp32) along with tile dimensions.

```python
import torch
import cuda.tile as ct

# Assume mixed_precision_compute is defined elsewhere
def mixed_precision_compute(X_fp16, Y_fp16, Z_fp32, tile_m, tile_n):
    # Placeholder for actual computation logic
    pass

# Create inputs
M, N = 1024, 1024
X_fp16 = torch.randn(M, N, dtype=torch.float16, device='cuda')
Y_fp16 = torch.randn(M, N, dtype=torch.float16, device='cuda')
Z_fp32 = torch.empty(M, N, dtype=torch.float32, device='cuda')

tile_m, tile_n = 64, 64
grid = (M // tile_m, N // tile_n, 1)
ct.launch(torch.cuda.current_stream(), grid, mixed_precision_compute, (X_fp16, Y_fp16, Z_fp32, tile_m, tile_n))
```

--------------------------------

### Configuring Test Executable: test_hash_map

Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt

Sets up the 'test_hash_map' executable, linking it with necessary source files and libraries. It configures include paths, compile options including test coverage, and link options, ensuring the test can be built and run correctly.

```cmake

add_executable(test_hash_map test/test_hash_map.cpp memory.cpp)
target_include_directories(test_hash_map PRIVATE ${cext_include_dirs})
target_compile_options(test_hash_map PUBLIC ${cext_compile_flags} ${test_coverage_options})
target_link_options(test_hash_map PRIVATE ${test_coverage_options})
target_link_libraries(test_hash_map PRIVATE ${Python_LIBRARIES})

```

--------------------------------

### Run Specific cuTile Test File using Pytest

Source: https://github.com/nvidia/cutile-python/blob/main/README.md

Executes a particular test file (e.g., test_copy.py) within the cuTile test suite using the pytest framework.

```bash
pytest test/test_copy.py
```

--------------------------------

### Defining Include Directories for CUTile Project

Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt

Specifies the include directories required for compiling the CUTile project. This includes paths for DLPack, Python headers, and CUDA Toolkit headers, ensuring that all necessary header files are found during the build process.

```cmake
set(cext_include_dirs
    ${dlpack_INCLUDE_DIR}
    ${Python_INCLUDE_DIRS}
    ${CUDAToolkit_INCLUDE_DIRS})

```

--------------------------------

### Enable Address Sanitizer (ASan)

Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt

This configuration enables the Address Sanitizer (ASan) for debugging memory errors if the ENABLE_ASAN option is ON. It adds '-fsanitize=address' compile and link flags for GCC and Clang compilers. If the compiler is not supported, it will exit with a FATAL_ERROR.

```cmake
option(ENABLE_ASAN "Enable address sanitizer" OFF)
if (ENABLE_ASAN)
    if ((CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
            OR (CMAKE_CXX_COMPILER_ID STREQUAL "Clang"))
        add_compile_options(-fsanitize=address)
        add_link_options(-fsanitize=address)
    else()
        message(FATAL_ERROR "Not sure how to enable address sanitizer "
                            "for the current compiler ${CMAKE_CXX_COMPILER_ID}.")
    endif()
    message(STATUS "Enabling address sanitizer")
endif()
```

--------------------------------

### Configuring Test Executable: test_stream_buffer

Source: https://github.com/nvidia/cutile-python/blob/main/cext/CMakeLists.txt

Defines an executable for testing 'stream_buffer'. It includes specific source files, sets public compile options including test coverage flags, specifies private include directories, and links against the Python libraries and test coverage options.

```cmake
# Tests

add_executable(test_stream_buffer
    test/test_stream_buffer.cpp
    cuda_loader.cpp
    memory.cpp
    )
target_compile_options(test_stream_buffer PUBLIC ${cext_compile_flags} ${test_coverage_options})
target_include_directories(test_stream_buffer PRIVATE ${cext_include_dirs})
target_link_libraries(test_stream_buffer PRIVATE ${Python_LIBRARIES})
target_link_options(test_stream_buffer PRIVATE ${test_coverage_options})

```

--------------------------------

### Vector Addition with Gather/Scatter in Python

Source: https://context7.com/nvidia/cutile-python/llms.txt

Demonstrates vector addition using `ct.gather` and `ct.scatter` for indirect memory access. This method handles non-tile-aligned vector sizes and performs automatic boundary checking for reads and writes.

```python
import cuda.tile as ct
import torch

@ct.kernel
def vector_add_gather(a, b, c, tile_size: ct.Constant[int]):
    bid = ct.bid(0)

    # Calculate indices for this block's tile
    indices = bid * tile_size + ct.arange(tile_size, dtype=torch.int32)

    # Gather elements (zeros out-of-bounds automatically)
    a_tile = ct.gather(a, indices)
    b_tile = ct.gather(b, indices)

    # Compute result
    sum_tile = a_tile + b_tile

    # Scatter result (ignores out-of-bounds writes)
    ct.scatter(c, indices, sum_tile)

# Create vectors with non-tile-aligned size
N = 10001  # Not divisible by tile_size
a = torch.randn(N, dtype=torch.float32, device='cuda')
b = torch.randn(N, dtype=torch.float32, device='cuda')
c = torch.empty_like(a)

# Launch kernel
tile_size = 256
grid = ((N + tile_size - 1) // tile_size, 1, 1)
ct.launch(torch.cuda.current_stream(), grid, vector_add_gather, (a, b, c, tile_size))
```

--------------------------------

### Scalar Typing Rules

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/data.rst

Explains the typing rules for scalar values in CUTile kernels, including constant and non-constant scalars.

```APIDOC
## Scalars

A *scalar* is a single immutable value of a specific data type. A *scalar* and *0D-tile* can be used interchangeably in a tile kernel. They can also be kernel parameters.

Typing of a *scalar* has the following rules:

- Constant scalars are loosely typed by default, for example, a literal ``2`` or a constant property like ``Tile.ndim``, ``Tile.shape``, or ``Array.ndim``.
- ``Array.shape`` and ``Array.stride`` are not constant by default and have default int type `int32`. Using default `int32` makes kernels more performant at the cost of limiting max representable shape. This limitation will be lifted in the near future.
```

--------------------------------

### Add C Extension Subdirectory

Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt

This command includes the 'cext' subdirectory, which presumably contains CMakeLists.txt to build the C extensions for the Python project. This is a standard way to organize and build modular components in CMake.

```cmake
add_subdirectory(cext)
```

--------------------------------

### Matrix Multiplication Kernel with Tiled Algorithm in Python

Source: https://context7.com/nvidia/cutile-python/llms.txt

Implements a high-performance matrix multiplication kernel using a tile-based accumulation strategy. It leverages `ct.mma` for tensor core operations and handles data types including tf32 for potential speedups.

```python
import cuda.tile as ct
import torch
from math import ceil

@ct.kernel
def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int], tk: ct.Constant[int]):
    bid_x = ct.bid(0)
    bid_y = ct.bid(1)

    # Number of K-dimension tiles to process
    num_tiles_k = ct.num_tiles(A, axis=1, shape=(tm, tk))

    # Initialize accumulator
    accumulator = ct.full((tm, tn), 0, dtype=ct.float32)
    zero_pad = ct.PaddingMode.ZERO

    # Convert fp32 to tf32 for tensor core usage
    dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype

    # Iterate over K dimension
    for k in range(num_tiles_k):
        a = ct.load(A, index=(bid_x, k), shape=(tm, tk), padding_mode=zero_pad).astype(dtype)
        b = ct.load(B, index=(k, bid_y), shape=(tk, tn), padding_mode=zero_pad).astype(dtype)
        accumulator = ct.mma(a, b, accumulator)

    # Store result
    accumulator = ct.astype(accumulator, C.dtype)
    ct.store(C, index=(bid_x, bid_y), tile=accumulator)
```

--------------------------------

### cuTile Exception Types

Source: https://github.com/nvidia/cutile-python/blob/main/docs/source/debugging.rst

Information about the various exception types raised by cuTile during kernel development.

```APIDOC
## cuTile Exception Types

### Description
This section lists and describes the custom exception types provided by the cuTile library, which are raised during the compilation and execution of tile kernels.

### Exception Types

- **TileSyntaxError**: Raised when there is a syntax error in the TileIR code.
- **TileTypeError**: Raised when a type mismatch occurs during compilation. This can be further diagnosed by setting `CUDA_TILE_LOGS=CUTILEIR`.
- **TileValueError**: Raised when an invalid value is encountered in the TileIR code or parameters.
- **TileCompilerExecutionError**: Raised when an error occurs during the execution of the TileIR compiler. Enabling crash dumps with `CUDA_TILE_ENABLE_CRASH_DUMP=1` can provide more diagnostic information.
- **TileCompilerTimeoutError**: Raised when the TileIR compiler exceeds the specified timeout limit. The timeout can be configured using `CUDA_TILE_COMPILER_TIMEOUT_SEC` and crash dumps can be enabled with `CUDA_TILE_ENABLE_CRASH_DUMP=1`.
```

--------------------------------

### Persistent Kernels for Multi-Tile Processing

Source: https://context7.com/nvidia/cutile-python/llms.txt

Implements persistent kernels where GPU blocks process multiple tiles to improve occupancy. This CUDA kernel requires PyTorch and the cuda.tile library. It loads tiles, performs element-wise operations (square root of sum of squares), and stores results.

```python
import cuda.tile as ct
import torch

@ct.kernel
def persistent_process(Input, Output, tm: ct.Constant[int], tn: ct.Constant[int]):
    bid = ct.bid(0)

    M = Input.shape[0]
    N = Input.shape[1]

    # Calculate total tiles and blocks
    num_tiles_m = ct.cdiv(M, tm)
    num_tiles_n = ct.cdiv(N, tn)
    total_tiles = num_tiles_m * num_tiles_n
    num_blocks_launched = ct.num_blocks(0)

    # Each block processes multiple tiles
    for tile_id in range(bid, total_tiles, num_blocks_launched):
        # Compute 2D tile coordinates
        tile_m = tile_id // num_tiles_n
        tile_n = tile_id % num_tiles_n

        # Process tile
        tile = ct.load(Input, index=(tile_m, tile_n), shape=(tm, tn),
                      padding_mode=ct.PaddingMode.ZERO)
        result = ct.sqrt(tile * tile + 1.0)
        ct.store(Output, index=(tile_m, tile_n), tile=result)

# Create input
M, N = 4096, 4096
Input = torch.randn(M, N, dtype=torch.float32, device='cuda')
Output = torch.empty_like(Input)

# Launch with fewer blocks than tiles (persistent approach)
tm, tn = 128, 128
num_sms = torch.cuda.get_device_properties('cuda').multi_processor_count
grid = (num_sms * 2, 1, 1)  # Launch 2 blocks per SM

ct.launch(torch.cuda.current_stream(), grid, persistent_process, (Input, Output, tm, tn))
```

--------------------------------

### Atomic Operations with CUDA Tiles

Source: https://context7.com/nvidia/cutile-python/llms.txt

This snippet demonstrates thread-safe accumulation using atomic operations and memory ordering with the cuda.tile library. It employs atomic compare-and-swap (CAS) for lock acquisition and atomic exchange for lock release. Requires torch and cuda.tile.

```python
import cuda.tile as ct
import torch

@ct.kernel
def atomic_accumulate(X, Partial_Sum, Locks, group_size: ct.Constant[int], tile_n: ct.Constant[int]):
    bid = ct.bid(0)
    group_id = bid % group_size

    # Load data for this block
    x_tile = ct.load(X, index=(bid,), shape=(tile_n,))
    local_sum = ct.sum(x_tile, axis=0)

    # Acquire lock using compare-and-swap
    while ct.atomic_cas(Locks, group_id, 0, 1, memory_order=ct.MemoryOrder.ACQUIRE) == 1:
        pass

    # Critical section: accumulate partial sum
    current = ct.load(Partial_Sum, index=(group_id,), shape=(1,))
    updated = current + local_sum
    ct.store(Partial_Sum, index=(group_id,), tile=updated)

    # Release lock
    ct.atomic_xchg(Locks, group_id, 0, memory_order=ct.MemoryOrder.RELEASE)

# Create input data
N = 16384
tile_n = 256
num_blocks = N // tile_n
group_size = 8

X = torch.randn(N, dtype=torch.float32, device='cuda')
Partial_Sum = torch.zeros(group_size, dtype=torch.float32, device='cuda')
Locks = torch.zeros(group_size, dtype=torch.int32, device='cuda')

grid = (num_blocks, 1, 1)
ct.launch(torch.cuda.current_stream(), grid, atomic_accumulate, (X, Partial_Sum, Locks, group_size, tile_n))
print("Atomic accumulation kernel launched.")
```

--------------------------------

### Enable Test Code Coverage

Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt

This section enables code coverage for tests if the ENABLE_COVERAGE_FOR_TESTS option is set to ON. It supports GCC and Clang compilers by adding the '--coverage' flag. For unsupported compilers, it raises a FATAL_ERROR.

```cmake
option(ENABLE_COVERAGE_FOR_TESTS "Enable code coverage for tests" OFF)
set(test_coverage_options)

if (ENABLE_COVERAGE_FOR_TESTS)
    if ((CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
            OR (CMAKE_CXX_COMPILER_ID STREQUAL "Clang"))
        set(test_coverage_options "--coverage")
    else()
        message(FATAL_ERROR "Not sure how to enable coverage "
                            "for the current compiler ${CMAKE_CXX_COMPILER_ID}.")
    endif()
    message(STATUS "Enabling code coverage for tests")
endif()
```

--------------------------------

### Sign Git Commits with --signoff Flag

Source: https://github.com/nvidia/cutile-python/blob/main/CONTRIBUTING.md

This snippet demonstrates the command used to sign off on a Git commit. The `-s` or `--signoff` flag appends the contributor's name and email to the commit message, certifying their work under the DCO. This is a crucial step for all contributions to the project.

```bash
git commit -s -m "Add cool feature."
```

--------------------------------

### Manage dlpack Dependency

Source: https://github.com/nvidia/cutile-python/blob/main/CMakeLists.txt

This configuration handles the dependency on the dlpack library. It checks if DLPACK_PATH is set, using a local dlpack if available. Otherwise, it fetches dlpack from its GitHub repository using FetchContent and sets the include directory accordingly.

```cmake
if (DLPACK_PATH)
    set(dlpack_INCLUDE_DIR "${DLPACK_PATH}/include/dlpack")
    message(STATUS "Using local dlpack, include path: ${dlpack_INCLUDE_DIR}")
else()
    include(FetchContent)
    FetchContent_Declare(
        dlpack
        GIT_REPOSITORY https://github.com/dmlc/dlpack.git
        GIT_TAG v1.1
    )
    FetchContent_MakeAvailable(dlpack)
    set(dlpack_INCLUDE_DIR "${dlpack_SOURCE_DIR}/include/dlpack")
    message(STATUS "Fetching dlpack")
endif()
```