### Compile Thrust Example with nvcc

Source: https://github.com/nvidia/cccl/blob/main/thrust/examples/README.md

Compile a Thrust example program like 'norm.cu' using the nvcc compiler. Ensure Thrust is installed before attempting to compile.

```bash
$ nvcc norm.cu -o norm
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads 64-bit data from global memory with L1 eviction and an L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.b64

### Description
Loads 64-bit data from global memory with L1 eviction and an L2 cache hint.

### PTX ISA
74, SM_80

### Template
```cpp
template <typename B64, enable_if_t<sizeof(B64) == 8, bool> = true>
__device__ static inline B64 ld_L1_evict_first_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B64* addr,
  uint64_t cache_policy);
```
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.v4.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads 256 bits (4x64 bits) from global memory with L1 evict first and L2 cache hint. Requires SM_100 or higher.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.v4.b64

### Description
Loads a vector of 4x64-bit elements from global memory, prioritizing L1 eviction and providing an L2 cache hint.

### Parameters
- `dest`: The destination register.
- `addr`: The memory address to load from.
- `cache_policy`: The L2 cache policy hint.
```

--------------------------------

### Get CUDA Toolkit Version

Source: https://github.com/nvidia/cccl/blob/main/cudax/examples/CMakeLists.txt

Retrieves the version of the installed CUDA toolkit.

```cmake
cccl_get_cudatoolkit()
```

--------------------------------

### Development Setup and Testing for CCCL

Source: https://github.com/nvidia/cccl/blob/main/docs/python/setup.md

Sets up the CCCL development environment by cloning the repository, installing in editable mode with test dependencies, and running tests to verify the installation.

```bash
# Clone the repository
git clone https://github.com/NVIDIA/cccl.git
cd cccl/python/cuda_cccl

# Install in development mode with test dependencies
pip install -e .[test-cu13]  # or .[test-cu12], .[test-sysctk13], .[test-sysctk12]

# Run tests to verify everything works
pytest tests/
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.b32

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads 32-bit data from global memory with L1 eviction and an L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.b32

### Description
Loads 32-bit data from global memory with L1 eviction and an L2 cache hint.

### PTX ISA
74, SM_80

### Template
```cpp
template <typename B32, enable_if_t<sizeof(B32) == 4, bool> = true>
__device__ static inline B32 ld_L1_evict_first_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B32* addr,
  uint64_t cache_policy);
```
```

--------------------------------

### Example Usage of cuda::bitfield_insert in a CUDA Kernel

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/bit/bitfield_insert.md

This example demonstrates the usage of `cuda::bitfield_insert` within a CUDA kernel. It uses assertions to verify the correctness of the bitfield insertion for different start positions and values.

```cuda
#include <cuda/bit>
#include <cuda/std/cassert>

__global__ void bitfield_insert_kernel() {
    assert(cuda::bitfield_insert(0u, 0xFFFFu, 0, 4) == 0b1111);
    assert(cuda::bitfield_insert(0u, 0xFFFFu, 3, 4) == 0b1111000);
    assert(cuda::bitfield_insert(1u, 0xFFFFu, 3, 4) == 0b1111001);
}

int main() {
    bitfield_insert_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}
```

--------------------------------

### ld.global.L1::evict_first.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 64-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::256B.b64

### Description
Loads 64-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

### Method
__device__ static inline B64

### Parameters
- **cuda::ptx::space_global_t**: Specifies the memory space as global.
- **const B64* addr**: Pointer to the memory address to load from.

### PTX ISA
ld.space.L1::evict_first.L2::256B.b64 dest, [addr];

### Notes
- PTX ISA 74, SM_80
- .space = { .global }
- Template parameter B64 requires sizeof(B64) == 8.
```

--------------------------------

### Example Usage of cuda::device::memcpy_async_tx

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async_tx.md

This example demonstrates the usage of `cuda::device::memcpy_async_tx` within a CUDA kernel. It includes setup for a barrier, electing a leader thread to initiate the copy, and synchronizing using `barrier_arrive_tx` and `bar.wait`. The code includes a static assert to ensure it's compiled for a compatible architecture.

```cuda
#include <cuda/barrier>
#include <cuda/std/utility> // cuda::std::move

#if defined(__CUDA_MINIMUM_ARCH__) && __CUDA_MINIMUM_ARCH__ < 900
static_assert(false, "Insufficient CUDA Compute Capability: cuda::device::memcpy_async_tx is not available.");
#endif // __CUDA_MINIMUM_ARCH__

__device__ alignas(16) int gmem_x[2048];

__device__ inline bool elect_one() {
  const unsigned int tid = threadIdx.x;
  const unsigned int warp_id = tid / 32;
  const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0
  return (uniform_warp_id == 0 && cuda::ptx::elect_sync(0xFFFFFFFF)); // elect a leader thread among warp 0
}

__global__ void example_kernel() {
  alignas(16) __shared__ int smem_x[1024];
  #pragma nv_diag_suppress static_var_with_dynamic_init
  __shared__ cuda::barrier<cuda::thread_scope_block> bar;

  // setup the mbarrier
  if (threadIdx.x == 0) {
    init(&bar, blockDim.x);
  }
  __syncthreads();

  // issue the async copy from a single thread and wait for completion
  const bool is_block_leader = elect_one();
  const int tx_count = is_block_leader ? sizeof(smem_x) : 0;
  if (is_block_leader) {
    cuda::device::memcpy_async_tx(smem_x, gmem_x, cuda::aligned_size_t<16>(tx_count), bar);
  }
  auto token = cuda::device::barrier_arrive_tx(bar, 1, tx_count);
  bar.wait(cuda::std::move(token));

  // smem_x contains the contents of gmem_x[0], ..., gmem_x[1023]
  smem_x[threadIdx.x] += 1;
}

```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 64 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.L2::256B.b64

### Description
Loads 64 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80.

### Method
__device__ static inline B64 ld_L1_evict_first_L2_cache_hint_L2_256B(cuda::ptx::space_global_t, const B64* addr, uint64_t cache_policy);

### Parameters
- **cuda::ptx::space_global_t**: Specifies the memory space (global).
- **const B64* addr**: Pointer to the memory location in global memory to load from. B64 must be 8 bytes in size.
- **uint64_t cache_policy**: Specifies the cache behavior for L2.
```

--------------------------------

### CUDA API Forward Progress Guarantee Example

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/execution_model.md

Demonstrates that `cudaDeviceSynchronize` eventually ensures at least one device thread makes progress, leading to program termination. This relies on CUDA's guarantee that if the device is empty, threads will eventually start and complete.

```cuda
__global__ void hello_world() { __syncthreads(); }
int main() {
    hello_world<<<1,2>>>();
    return (int)cudaDeviceSynchronize();
}
```

--------------------------------

### ld.global.L1::evict_last.L2::cache_hint.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads a 64-bit value from global memory with L1 eviction, L2 256B cache hint.

```APIDOC
## ld.global.L1::evict_last.L2::cache_hint.L2::256B.b64

### Description
Loads a 64-bit value from global memory, evicting the last L1 cache line and providing a specific L2 256B cache hint.

### Signature
```cpp
template <typename B64, enable_if_t<sizeof(B64) == 8, bool> = true>
__device__ static inline B64 ld_L1_evict_last_L2_cache_hint_L2_256B(
  cuda::ptx::space_global_t,
  const B64* addr,
  uint64_t cache_policy);
```

### Parameters
* `cuda::ptx::space_global_t`: Specifies the global memory space.
* `const B64* addr`: Pointer to the memory address to load from.
* `uint64_t cache_policy`: Cache policy hint for L2 (256B).
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.b8

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads 8-bit data from global memory with L1 eviction and an L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.b8

### Description
Loads 8-bit data from global memory with L1 eviction and an L2 cache hint.

### PTX ISA
74, SM_80

### Template
```cpp
template <typename B8, enable_if_t<sizeof(B8) == 1, bool> = true>
__device__ static inline B8 ld_L1_evict_first_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B8* addr,
  uint64_t cache_policy);
```
```

--------------------------------

### Example Usage of cuda::bitmask in a CUDA Kernel

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/bit/bitmask.md

Demonstrates the usage of `cuda::bitmask` within a CUDA kernel to create bitmasks. Asserts are used to verify the correctness of the generated bitmasks for different types and bit positions. Ensure that `start` and `width` satisfy the function's preconditions.

```cuda
#include <cuda/bit>
#include <cuda/std/assert>
#include <cuda/std/cstdint>

__global__ void bitmask_kernel() {
    assert(cuda::bitmask(2, 4) == 0b111100u);
    assert(cuda::bitmask<uint64_t>(1, 3) == uint64_t{0b1110});
}

int main() {
    bitmask_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}
```

--------------------------------

### Define Temporary Installation Prefix

Source: https://github.com/nvidia/cccl/blob/main/test/cmake/CMakeLists.txt

Sets a temporary installation prefix within the build directory for testing against an installed project.

```cmake
set(tmp_install_prefix "${CMAKE_CURRENT_BINARY_DIR}/test_install")
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.b16

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads 16-bit data from global memory with L1 eviction and an L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.b16

### Description
Loads 16-bit data from global memory with L1 eviction and an L2 cache hint.

### PTX ISA
74, SM_80

### Template
```cpp
template <typename B16, enable_if_t<sizeof(B16) == 2, bool> = true>
__device__ static inline B16 ld_L1_evict_first_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B16* addr,
  uint64_t cache_policy);
```
```

--------------------------------

### Launch STF Example Binary

Source: https://github.com/nvidia/cccl/blob/main/docs/cudax/stf.md

Example command to launch a compiled STF example binary. Binaries are located in the 'bin/' subdirectory after building.

```bash
./bin/cudax.cpp17.example.stf.01-axpy
```

--------------------------------

### ld.global.L1::evict_first.L2::256B.b32

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 32-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::256B.b32

### Description
Loads 32-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

### Method
__device__ static inline B32

### Parameters
- **cuda::ptx::space_global_t**: Specifies the memory space as global.
- **const B32* addr**: Pointer to the memory address to load from.

### PTX ISA
ld.space.L1::evict_first.L2::256B.b32 dest, [addr];

### Notes
- PTX ISA 74, SM_80
- .space = { .global }
- Template parameter B32 requires sizeof(B32) == 4.
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.v4.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads data from global memory with L1 eviction first and L2 cache hint. Supports 256-bit vectors.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.v4.b64

### Description
Loads a 256-bit vector from global memory, evicting the first cache line and using an L2 cache hint.

### Method
__device__ static inline B256 ld_L1_evict_first_L2_cache_hint(cuda::ptx::space_global_t, const B256* addr, uint64_t cache_policy)

### Parameters
- `cuda::ptx::space_global_t`: Specifies the global memory space.
- `const B256* addr`: Pointer to the source memory address.
- `uint64_t cache_policy`: Cache policy hint for L2.
```

--------------------------------

### Add CUDA C++ Example

Source: https://github.com/nvidia/cccl/blob/main/cub/examples/block/CMakeLists.txt

Use this snippet to add CUDA C++ examples to your project. It finds all example source files and adds them as targets.

```cmake
file(
  GLOB_RECURSE example_srcs
  RELATIVE "${CMAKE_CURRENT_LIST_DIR}"
  CONFIGURE_DEPENDS
  example_*.cu
)

foreach (example_src IN LISTS example_srcs)
  get_filename_component(example_name "${example_src}" NAME_WE)
  string(
    REGEX REPLACE
    "^example_block_"
    "block."
    example_name
    "${example_name}"
  )
  cub_add_example(target_name ${example_name} "${example_src}")
endforeach()
```

--------------------------------

### Building Examples

Source: https://github.com/nvidia/cccl/blob/main/docs/thrust/developer/cmake_options.md

Enable or disable the building of examples. Defaults to ON.

```cmake
cmake -DTHRUST_ENABLE_EXAMPLES=OFF ...
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.b128

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads 128-bit data from global memory with L1 eviction and an L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.b128

### Description
Loads 128-bit data from global memory with L1 eviction and an L2 cache hint.

### PTX ISA
83, SM_80

### Template
```cpp
template <typename B128, enable_if_t<sizeof(B128) == 16, bool> = true>
__device__ static inline B128 ld_L1_evict_first_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B128* addr,
  uint64_t cache_policy);
```
```

--------------------------------

### ld.global.nc.L1::evict_last.L2::cache_hint.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 64-bit data from global memory with L1 cache eviction to L2 (256B) and L2 cache hint.

```APIDOC
## ld.global.nc.L1::evict_last.L2::cache_hint.L2::256B.b64

### Description
Loads 64-bit data from global memory into a register, evicting the last accessed line from L1 cache to L2 (256B), with a specified cache policy.

### Signature
```cpp
template <typename B64, enable_if_t<sizeof(B64) == 8, bool> = true>
__device__ static inline B64 ld_nc_L1_evict_last_L2_cache_hint_L2_256B(
  cuda::ptx::space_global_t,
  const B64* addr,
  uint64_t cache_policy);
```

### PTX ISA
`ld.space.nc.L1::evict_last.L2::cache_hint.L2::256B.b64 dest, [addr], cache_policy;` (PTX ISA 74, SM_80)
```

--------------------------------

### Process Code Generation STF Examples

Source: https://github.com/nvidia/cccl/blob/main/cudax/examples/stf/CMakeLists.txt

Conditionally processes STF examples that rely on code generation (parallel_for or launch) if CUDASTF_CODE_GENERATION is enabled. It adds these examples as executables.

```cmake
set(
  stf_example_codegen_sources
  01-axpy-launch.cu
  01-axpy-parallel_for.cu
  binary_fhe.cu
  binary_fhe_stackable.cu
  09-dot-reduce.cu
  cfd.cu
  custom_data_interface.cu
  fdtd_mgpu.cu
  fdtd_while.cu
  fdtd_repeat_n.cu
  frozen_data_init.cu
  graph_algorithms/degree_centrality.cu
  graph_algorithms/jaccard.cu
  graph_algorithms/pagerank.cu
  graph_algorithms/pagerank_batched.cu
  graph_algorithms/pagerank_while.cu
  graph_algorithms/tricount.cu
  graph_scope.cu
  heat.cu
  heat_mgpu.cu
  jacobi.cu
  jacobi_pfor.cu
  jacobi_stackable.cu
  jacobi_stackable_raii.cu
  jacobi_update_cond.cu
  launch_histogram.cu
  launch_scan.cu
  launch_sum.cu
  launch_sum_cub.cu
  linear_algebra/burger.cu
  linear_algebra/burger_sensitivity.cu
  linear_algebra/cg_csr.cu
  linear_algebra/cg_csr_stackable.cu
  logical_gates_composition.cu
  mandelbrot.cu
  parallel_for_2D.cu
  pi.cu
  scan.cu
  sqrt_newton_stackable.cu
  standalone-launches.cu
  word_count.cu
  word_count_reduce.cu
)

if (cudax_ENABLE_CUDASTF_CODE_GENERATION)
  foreach (source IN LISTS stf_example_codegen_sources)
    cudax_add_stf_example(example_target "${source}")
  endforeach()
endif()
```

--------------------------------

### Add Install Test Fixture

Source: https://github.com/nvidia/cccl/blob/main/test/cmake/CMakeLists.txt

Defines a CMake test to perform an installation using the '--install' command, targeting a specified prefix. This test requires the 'install_tree' fixture.

```cmake
add_test(
  NAME cccl.test.cmake.install_tree.install
  # gersemi: off
  COMMAND
    "${CMAKE_COMMAND}"
      --install "${CCCL_BINARY_DIR}"
      --prefix "${tmp_install_prefix}"
  # gersemi: on
)
set_tests_properties(
  cccl.test.cmake.install_tree.install
  PROPERTIES FIXTURES_SETUP install_tree
)
```

--------------------------------

### ld.global.L1::evict_last.L2::cache_hint.L2::256B.b32

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads a 32-bit value from global memory with L1 eviction, L2 256B cache hint.

```APIDOC
## ld.global.L1::evict_last.L2::cache_hint.L2::256B.b32

### Description
Loads a 32-bit value from global memory, evicting the last L1 cache line and providing a specific L2 256B cache hint.

### Signature
```cpp
template <typename B32, enable_if_t<sizeof(B32) == 4, bool> = true>
__device__ static inline B32 ld_L1_evict_last_L2_cache_hint_L2_256B(
  cuda::ptx::space_global_t,
  const B32* addr,
  uint64_t cache_policy);
```

### Parameters
* `cuda::ptx::space_global_t`: Specifies the global memory space.
* `const B32* addr`: Pointer to the memory address to load from.
* `uint64_t cache_policy`: Cache policy hint for L2 (256B).
```

--------------------------------

### ld.global.nc.L1::evict_first.L2::256B.b8

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 8 bits (b8) from global memory with non-coherent caching, L1 evict first, and L2 256B cache hint. Requires SM 80.

```APIDOC
## ld.global.nc.L1::evict_first.L2::256B.b8

### Description
Loads 8 bits of data from a global memory address with non-coherent caching, an L1 evict first policy, and an L2 256B cache hint.

### Method
__device__ static inline B8 ld_nc_L1_evict_first_L2_256B(cuda::ptx::space_global_t, const B8* addr)

### Parameters
- `cuda::ptx::space_global_t`: Specifies the global memory space.
- `const B8* addr`: Pointer to the memory address to load from. `B8` must be 1 byte.

### PTX ISA
74, SM_80
```

--------------------------------

### Install CCCL using CMake

Source: https://github.com/nvidia/cccl/blob/main/README.md

Clone the CCCL repository, navigate to the directory, and use CMake to configure and install CCCL. The default CMake options generate installation rules.

```bash
git clone https://github.com/NVIDIA/cccl.git
cd cccl
cmake . -DCMAKE_INSTALL_PREFIX=/usr/local
make install
```

--------------------------------

### Add CUB Example Executable

Source: https://github.com/nvidia/cccl/blob/main/cub/examples/CMakeLists.txt

Defines a CMake function to add an example executable, register it with ctest, and configure its build settings. Use this to create new example targets.

```cmake
function(cub_add_example target_name_var example_name example_src)
  # The actual name of the test's target:
  set(example_target cub.example.${example_name})
  set(${target_name_var} ${example_target} PARENT_SCOPE)

  cccl_add_executable(${example_target} SOURCES "${example_src}" ADD_CTEST)
  cub_configure_cuda_target(${example_target} RDC ${CUB_FORCE_RDC})
  target_link_libraries(
    ${example_target}
    PRIVATE #
      cub.compiler_interface
      cccl.c2h
  )
  target_include_directories(
    ${example_target}
    PRIVATE "${CUB_SOURCE_DIR}/examples"
  )
endfunction()
```

--------------------------------

### ld.global.L1::no_allocate.L2::cache_hint.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 64-bit data from global memory with L1 no-allocate and L2 256B cache hint.

```APIDOC
## ld.global.L1::no_allocate.L2::cache_hint.L2::256B.b64

### Description
Loads 64-bit data from global memory with L1 no-allocate and L2 256B cache hint.

### Method
__device__ static inline B64

### Parameters
- **cuda::ptx::space_global_t**: Specifies the memory space as global.
- **const B64* addr**: Pointer to the memory address to load from.
- **uint64_t cache_policy**: Specifies the cache policy for L2.

### PTX ISA
PTX ISA 74, SM_80
```

--------------------------------

### Avoid Calling get() on Temporary mdspan-to-DLPack Conversion

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/mdspan/mdspan_to_dlpack.md

Calling `get()` on a temporary `cuda::to_dlpack_tensor` result is invalid because the temporary object is destroyed immediately after `get()` returns, leading to a dangling reference. The `DLTensor` object returned by `get()` must not outlive the `cuda::to_dlpack_tensor` return value.

```cuda
#include <dlpack/dlpack.h>
#include <cuda/mdspan>
#include <cuda/std/cstdint>

void show_invalid_usage1() {
  using extents_t = cuda::std::extents<size_t, 2, 3>;

  int data[6] = {0, 1, 2, 3, 4, 5};
  cuda::host_mdspan<int, extents_t> md{data, extents_t{}};

  // WRONG: calling get() on a temporary is deleted to prevent dangling references.
  // const DLTensor& dltensor = cuda::to_dlpack_tensor(md).get(); // compile error
}
```

--------------------------------

### ld.global.nc.L1::evict_last.L2::cache_hint.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 64-bit data from global memory with L1 cache eviction and L2 cache hint.

```APIDOC
## ld.global.nc.L1::evict_last.L2::cache_hint.b64

### Description
Loads 64-bit data from global memory into a register, evicting the last accessed line from L1 cache to L2, with a specified cache policy.

### Signature
```cpp
template <typename B64, enable_if_t<sizeof(B64) == 8, bool> = true>
__device__ static inline B64 ld_nc_L1_evict_last_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B64* addr,
  uint64_t cache_policy);
```

### PTX ISA
`ld.space.nc.L1::evict_last.L2::cache_hint.b64 dest, [addr], cache_policy;` (PTX ISA 74, SM_80)
```

--------------------------------

### Navigate to Example Directory

Source: https://github.com/nvidia/cccl/blob/main/examples/basic/README.md

Change into the example project directory after cloning the repository.

```bash
cd cccl/examples/example_project
```

--------------------------------

### ld.global.nc.L1::evict_first.L2::cache_hint.v4.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads a vector of four 64-bit elements from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint.

```APIDOC
## ld.global.nc.L1::evict_first.L2::cache_hint.v4.b64

### Description
Loads a vector of four 64-bit elements from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint.

### PTX ISA
88, SM_100

### Template Signature
```cuda
template <typename B256, enable_if_t<sizeof(B256) == 32, bool> = true>
__device__ static inline B256 ld_nc_L1_evict_first_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B256* addr,
  uint64_t cache_policy);
```
```

--------------------------------

### Install CCCL from PyPI with System CUDA Toolkit

Source: https://github.com/nvidia/cccl/blob/main/docs/python/setup.md

Installs cuda-cccl without the CUDA Toolkit, assuming a compatible CUDA toolkit is already installed and configured in the system's PATH and LD_LIBRARY_PATH.

```bash
pip install cuda-cccl[sysctk13]  # or cuda-cccl[sysctk12]
```

--------------------------------

### ld.global.nc.L1::evict_first.L2::cache_hint.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads a 64-bit value from global memory with L1 eviction and L2 cache hint.

```APIDOC
## ld.global.nc.L1::evict_first.L2::cache_hint.b64

### Description
Loads a 64-bit value from global memory, employing L1 cache with 'evict_first' policy and an L2 cache hint.

### Parameters
- `cuda::ptx::space_global_t`: Specifies the memory space as global.
- `const B64* addr`: Pointer to the memory address to load from.
- `uint64_t cache_policy`: Cache policy hint for L2 cache.
```

--------------------------------

### Enabling Installation Rules

Source: https://github.com/nvidia/cccl/blob/main/docs/thrust/developer/cmake_options.md

Control whether installation rules for Thrust are generated. Defaults to ON.

```cmake
cmake -DTHRUST_ENABLE_INSTALL_RULES=OFF ...
```

--------------------------------

### ld.global.L1::evict_first.L2::cache_hint.L2::256B.b32

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 32 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80.

```APIDOC
## ld.global.L1::evict_first.L2::cache_hint.L2::256B.b32

### Description
Loads 32 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80.

### Method
__device__ static inline B32 ld_L1_evict_first_L2_cache_hint_L2_256B(cuda::ptx::space_global_t, const B32* addr, uint64_t cache_policy);

### Parameters
- **cuda::ptx::space_global_t**: Specifies the memory space (global).
- **const B32* addr**: Pointer to the memory location in global memory to load from. B32 must be 4 bytes in size.
- **uint64_t cache_policy**: Specifies the cache behavior for L2.
```

--------------------------------

### Install Plotting Dependencies

Source: https://github.com/nvidia/cccl/blob/main/docs/cub/benchmarking.md

Installs Python packages required for plotting benchmark results.

```bash
pip install fpzip pandas matplotlib seaborn tabulate PyQt5 colorama
```

--------------------------------

### ld.global.L1::evict_last.L2::256B.b8

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads 8 bits from global memory with L1 evict last and L2 256B hint. Requires SM_80 or higher.

```APIDOC
## ld.global.L1::evict_last.L2::256B.b8

### Description
Loads 8 bits from global memory with L1 evict last policy and a hint for L2 cache to use 256B line size.

### Parameters
- `dest`: The destination register.
- `addr`: The memory address to load from.
```

--------------------------------

### ld.global.L1::evict_first.L2::256B.b8

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 8-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::256B.b8

### Description
Loads 8-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

### Method
__device__ static inline B8

### Parameters
- **cuda::ptx::space_global_t**: Specifies the memory space as global.
- **const B8* addr**: Pointer to the memory address to load from.

### PTX ISA
ld.space.L1::evict_first.L2::256B.b8 dest, [addr];

### Notes
- PTX ISA 74, SM_80
- .space = { .global }
- Template parameter B8 requires sizeof(B8) == 1.
```

--------------------------------

### Example Usage

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/mdspan/restrict_accessor.md

An example demonstrating the usage of `restrict_mdspan` in a CUDA kernel and `main` function.

```APIDOC
#include <cuda/mdspan>

using restrict_mdspan = cuda::restrict_mdspan<int, cuda::std::dims<1>>;

__host__ __device__ void
compute(restrict_mdspan a, restrict_mdspan b, restrict_mdspan c) {
    c[0] = a[0] * b[0];
    c[1] = a[0] * b[0];
    c[2] = a[0] * b[0] * a[1];
    c[3] = a[0] * a[1];
    c[4] = a[0] * b[0];
    c[5] = b[0];
}

int main() {
    using  dim      = cuda::std::dims<1>;
    using  mdspan   = cuda::std::mdspan<int, dim>;
    int    arrayA[] = {1, 2};
    int    arrayB[] = {5};
    int    arrayC[] = {9, 10, 11, 12, 13, 14};
    mdspan mdA{arrayA, dim{1}};
    mdspan mdB{arrayB, dim{5}};
    mdspan mdC{arrayC, dim{6}};
    compute(mdA, mdB, mdC);

    using restrict_aligned_accesor = cuda::std::restrict_accessor<cuda::std::aligned_accessor<int, 8>>;
    using restrict_aligned_mdspan  = cuda::std::mdspan<int, dim, layout_right, restrict_aligned_accesor>;
    restrict_aligned_mdspan mdD{mdC};
}

[See it on Godbolt 🔗](https://godbolt.org/z/Wjco996z8)
```

--------------------------------

### ld.global.nc.L1::evict_last.L2::cache_hint.b32

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 32-bit data from global memory with L1 cache eviction and L2 cache hint.

```APIDOC
## ld.global.nc.L1::evict_last.L2::cache_hint.b32

### Description
Loads 32-bit data from global memory into a register, evicting the last accessed line from L1 cache to L2, with a specified cache policy.

### Signature
```cpp
template <typename B32, enable_if_t<sizeof(B32) == 4, bool> = true>
__device__ static inline B32 ld_nc_L1_evict_last_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B32* addr,
  uint64_t cache_policy);
```

### PTX ISA
`ld.space.nc.L1::evict_last.L2::cache_hint.b32 dest, [addr], cache_policy;` (PTX ISA 74, SM_80)
```

--------------------------------

### Install CCCL Python Package from Source

Source: https://github.com/nvidia/cccl/blob/main/AGENTS.md

Installs the CCCL Python package from source. This involves cloning the repository and using pip with an editable install, optionally specifying test CUDA Toolkit versions.

```bash
git clone https://github.com/NVIDIA/cccl.git
cd cccl/python/cuda_cccl
pip install -e .[test-cu13] # or [test-cu12] for CTK 12.X
```

--------------------------------

### ld.global.L1::evict_last.v4.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads a 256-bit vector from global memory with L1 eviction last.

```APIDOC
## ld.global.L1::evict_last.v4.b64

### Description
Loads a 256-bit vector from global memory, evicting the last cache line.

### Method
__device__ static inline B256 ld_L1_evict_last(cuda::ptx::space_global_t, const B256* addr)

### Parameters
- `cuda::ptx::space_global_t`: Specifies the global memory space.
- `const B256* addr`: Pointer to the source memory address.
```

--------------------------------

### Process Math Library STF Examples

Source: https://github.com/nvidia/cccl/blob/main/cudax/examples/stf/CMakeLists.txt

Conditionally processes STF examples that utilize math libraries like CUBLAS and CUSOLVER if CUDASTF_MATHLIBS is enabled. These examples are added as executables and linked with math libraries.

```cmake
set(
  stf_example_mathlib_sources
  linear_algebra/06-pdgemm.cu
  linear_algebra/06-pdgemm-stackable.cu
  linear_algebra/07-cholesky.cu
  linear_algebra/07-potri.cu
  linear_algebra/cg_dense_2D.cu
  linear_algebra/strassen.cu
)

if (cudax_ENABLE_CUDASTF_MATHLIBS)
  foreach (source IN LISTS stf_example_mathlib_sources)
    cudax_add_stf_example(example_target "${source}" LINK_MATHLIBS)
  endforeach()
endif()
```

--------------------------------

### Install CCCL Python Package from Conda-forge

Source: https://github.com/nvidia/cccl/blob/main/AGENTS.md

Installs the CCCL Python package from the conda-forge channel.

```bash
conda install -c conda-forge cccl-python
```

--------------------------------

### ld.global.L1::evict_last.L2::cache_hint.v4.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads a 256-bit vector (v4.b64) from global memory with L1 eviction and L2 cache hint.

```APIDOC
## ld.global.L1::evict_last.L2::cache_hint.v4.b64

### Description
Loads a 256-bit vector (four 64-bit elements) from global memory, evicting the last L1 cache line and providing an L2 cache hint.

### Signature
```cpp
template <typename B256, enable_if_t<sizeof(B256) == 32, bool> = true>
__device__ static inline B256 ld_L1_evict_last_L2_cache_hint(
  cuda::ptx::space_global_t,
  const B256* addr,
  uint64_t cache_policy);
```

### Parameters
* `cuda::ptx::space_global_t`: Specifies the global memory space.
* `const B256* addr`: Pointer to the memory address to load from.
* `uint64_t cache_policy`: Cache policy hint for L2.
```

--------------------------------

### resource_ref Usage Example

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/memory_resource/wrappers.md

Example showing `cuda::mr::resource_ref` used with a specific property.

```APIDOC
## resource_ref

`cuda::mr::resource_ref` is the non-owning, type-erased wrapper. Prefer it when the caller controls the resource lifetime.

```cpp
struct required_alignment {
    using value_type = std::size_t;
};
void* do_allocate_with_alignment(cuda::mr::resource_ref<required_alignment> resource, cuda::stream_ref stream, std::size_t size) {
    return resource.allocate(stream, size, get_property(resource, required_alignment{}));
}
```

However, the type erasure comes with the cost that arbitrary properties cannot be queried from either wrapper:

```cpp
struct required_alignment {
    using value_type = std::size_t;
};
void* buggy_allocate_with_alignment(cuda::mr::resource_ref<> resource, cuda::stream_ref stream, std::size_t size) {
    if constexpr (cuda::has_property<decltype(resource), required_alignment>) { // BUG: This will always be false
        return resource.allocate(stream, size, get_property(resource, required_alignment{}));
    }
    else {
        return resource.allocate(stream, size, my_default_alignment);
    }
}
```

So, choose wisely. If your library has a well-defined set of fixed properties that you expect to always be available, then `cuda::mr::{synchronous_}resource_ref` is an amazing tool to improve compile times and binary size. If you need a flexible interface then constraining a template argument through `cuda::mr::{synchronous_}resource_with` is the proper solution.
```

--------------------------------

### ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads a 64-bit value from global memory with L1 eviction, L2 cache hint, and a 256B L2 cache line size.

```APIDOC
## ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64

### Description
Loads a 64-bit value from global memory, using L1 cache with 'evict_first' policy and a 256-byte L2 cache line hint.

### Parameters
- `cuda::ptx::space_global_t`: Specifies the memory space as global.
- `const B64* addr`: Pointer to the memory address to load from.
- `uint64_t cache_policy`: Cache policy hint for L2 cache.
```

--------------------------------

### ld.global.L1::evict_first.L2::256B.b16

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads 16-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

```APIDOC
## ld.global.L1::evict_first.L2::256B.b16

### Description
Loads 16-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint.

### Method
__device__ static inline B16

### Parameters
- **cuda::ptx::space_global_t**: Specifies the memory space as global.
- **const B16* addr**: Pointer to the memory address to load from.

### PTX ISA
ld.space.L1::evict_first.L2::256B.b16 dest, [addr];

### Notes
- PTX ISA 74, SM_80
- .space = { .global }
- Template parameter B16 requires sizeof(B16) == 2.
```

--------------------------------

### CountingIterator

Source: https://github.com/nvidia/cccl/blob/main/docs/python/compute_api.md

Represents a sequence of incrementing values. The iterator starts at start and increments by 1 for each advance.

```APIDOC
## class cuda.compute.iterators.CountingIterator(start: number)

Iterator representing a sequence of incrementing values.

Similar to [thrust::counting_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html).

The iterator starts at start and increments by 1 for each advance.

### Example

The code snippet below demonstrates the usage of a `CountingIterator`
representing the sequence `[10, 11, 12]`:

```python
"""
Example showing how to use counting_iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    OpKind,
)

# Prepare the input and output arrays.
first_item = 1
num_items = 100

# Create the counting iterator.
first_it = CountingIterator(np.int32(first_item))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(
    d_in=first_it, d_out=d_output, num_items=num_items, op=OpKind.PLUS, h_init=h_init
)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")
```

### __init__(start: number)

Create a counting iterator starting at start.

* **Parameters:**
  **start** – The initial value (must be a numpy scalar)
```

--------------------------------

### ld.global.nc.L1::evict_first.L2::cache_hint.b32

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md

Loads a 32-bit value from global memory with L1 eviction and L2 cache hint.

```APIDOC
## ld.global.nc.L1::evict_first.L2::cache_hint.b32

### Description
Loads a 32-bit value from global memory, using L1 cache with an 'evict_first' policy and providing a hint to the L2 cache.

### Parameters
- `cuda::ptx::space_global_t`: Specifies the memory space as global.
- `const B32* addr`: Pointer to the memory address to load from.
- `uint64_t cache_policy`: Cache policy hint for L2 cache.
```

--------------------------------

### Set installation location for generated code

Source: https://github.com/nvidia/cccl/blob/main/libcudacxx/codegen/CMakeLists.txt

Defines the destination directory for installing the generated atomic functions header.

```cmake
set(
  atomic_install_location
  "${libcudacxx_SOURCE_DIR}/include/cuda/std/__atomic/functions"
)
```

--------------------------------

### Install CCCL using Convenience Script

Source: https://github.com/nvidia/cccl/blob/main/README.md

Use the provided convenience script to install CCCL to a specified directory.

```bash
ci/install_cccl.sh /usr/local
```

--------------------------------

### ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads data from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint, for 64-bit data and a 256B L2 cache line size.

```APIDOC
## ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64

### Description
Loads data from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint, for 64-bit data and a 256B L2 cache line size.

### PTX ISA
74, SM_80

### Template Signature
```cuda
template <typename B64, enable_if_t<sizeof(B64) == 8, bool> = true>
__device__ static inline B64 ld_nc_L1_evict_first_L2_cache_hint_L2_256B(
  cuda::ptx::space_global_t,
  const B64* addr,
  uint64_t cache_policy);
```
```

--------------------------------

### Enabling Example FileCheck Validation

Source: https://github.com/nvidia/cccl/blob/main/docs/thrust/developer/cmake_options.md

Enable validation of example outputs using the LLVM FileCheck utility. Defaults to OFF.

```cmake
cmake -DTHRUST_ENABLE_EXAMPLE_FILECHECK=ON ...
```

--------------------------------

### ld.global.L1::evict_last.L2::256B.b64

Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md

Loads a 64-bit value from global memory with L1 eviction last and L2 256B cache hint.

```APIDOC
## ld.global.L1::evict_last.L2::256B.b64

### Description
Loads a 64-bit value from global memory, evicting the last cache line and using an L2 256B cache hint.

### Method
__device__ static inline B64 ld_L1_evict_last_L2_256B(cuda::ptx::space_global_t, const B64* addr)

### Parameters
- `cuda::ptx::space_global_t`: Specifies the global memory space.
- `const B64* addr`: Pointer to the source memory address.
```

--------------------------------

### Install CUB Tuning Dependencies

Source: https://github.com/nvidia/cccl/blob/main/docs/cub/benchmarking.md

Installs necessary Python packages for the tuning infrastructure. Ensure you are in a clean build directory.

```bash
ninja clean
pip install --user fpzip pandas scipy
```