### Compile Thrust Example with nvcc Source: https://github.com/nvidia/cccl/blob/main/thrust/examples/README.md Compile a Thrust example program like 'norm.cu' using the nvcc compiler. Ensure Thrust is installed before attempting to compile. ```bash $ nvcc norm.cu -o norm ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads 64-bit data from global memory with L1 eviction and an L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.b64 ### Description Loads 64-bit data from global memory with L1 eviction and an L2 cache hint. ### PTX ISA 74, SM_80 ### Template ```cpp template = true> __device__ static inline B64 ld_L1_evict_first_L2_cache_hint( cuda::ptx::space_global_t, const B64* addr, uint64_t cache_policy); ``` ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.v4.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads 256 bits (4x64 bits) from global memory with L1 evict first and L2 cache hint. Requires SM_100 or higher. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.v4.b64 ### Description Loads a vector of 4x64-bit elements from global memory, prioritizing L1 eviction and providing an L2 cache hint. ### Parameters - `dest`: The destination register. - `addr`: The memory address to load from. - `cache_policy`: The L2 cache policy hint. ``` -------------------------------- ### Get CUDA Toolkit Version Source: https://github.com/nvidia/cccl/blob/main/cudax/examples/CMakeLists.txt Retrieves the version of the installed CUDA toolkit. ```cmake cccl_get_cudatoolkit() ``` -------------------------------- ### Development Setup and Testing for CCCL Source: https://github.com/nvidia/cccl/blob/main/docs/python/setup.md Sets up the CCCL development environment by cloning the repository, installing in editable mode with test dependencies, and running tests to verify the installation. ```bash # Clone the repository git clone https://github.com/NVIDIA/cccl.git cd cccl/python/cuda_cccl # Install in development mode with test dependencies pip install -e .[test-cu13] # or .[test-cu12], .[test-sysctk13], .[test-sysctk12] # Run tests to verify everything works pytest tests/ ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.b32 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads 32-bit data from global memory with L1 eviction and an L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.b32 ### Description Loads 32-bit data from global memory with L1 eviction and an L2 cache hint. ### PTX ISA 74, SM_80 ### Template ```cpp template = true> __device__ static inline B32 ld_L1_evict_first_L2_cache_hint( cuda::ptx::space_global_t, const B32* addr, uint64_t cache_policy); ``` ``` -------------------------------- ### Example Usage of cuda::bitfield_insert in a CUDA Kernel Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/bit/bitfield_insert.md This example demonstrates the usage of `cuda::bitfield_insert` within a CUDA kernel. It uses assertions to verify the correctness of the bitfield insertion for different start positions and values. ```cuda #include #include __global__ void bitfield_insert_kernel() { assert(cuda::bitfield_insert(0u, 0xFFFFu, 0, 4) == 0b1111); assert(cuda::bitfield_insert(0u, 0xFFFFu, 3, 4) == 0b1111000); assert(cuda::bitfield_insert(1u, 0xFFFFu, 3, 4) == 0b1111001); } int main() { bitfield_insert_kernel<<<1, 1>>>(); cudaDeviceSynchronize(); return 0; } ``` -------------------------------- ### ld.global.L1::evict_first.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 64-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::256B.b64 ### Description Loads 64-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ### Method __device__ static inline B64 ### Parameters - **cuda::ptx::space_global_t**: Specifies the memory space as global. - **const B64* addr**: Pointer to the memory address to load from. ### PTX ISA ld.space.L1::evict_first.L2::256B.b64 dest, [addr]; ### Notes - PTX ISA 74, SM_80 - .space = { .global } - Template parameter B64 requires sizeof(B64) == 8. ``` -------------------------------- ### Example Usage of cuda::device::memcpy_async_tx Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async_tx.md This example demonstrates the usage of `cuda::device::memcpy_async_tx` within a CUDA kernel. It includes setup for a barrier, electing a leader thread to initiate the copy, and synchronizing using `barrier_arrive_tx` and `bar.wait`. The code includes a static assert to ensure it's compiled for a compatible architecture. ```cuda #include #include // cuda::std::move #if defined(__CUDA_MINIMUM_ARCH__) && __CUDA_MINIMUM_ARCH__ < 900 static_assert(false, "Insufficient CUDA Compute Capability: cuda::device::memcpy_async_tx is not available."); #endif // __CUDA_MINIMUM_ARCH__ __device__ alignas(16) int gmem_x[2048]; __device__ inline bool elect_one() { const unsigned int tid = threadIdx.x; const unsigned int warp_id = tid / 32; const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0 return (uniform_warp_id == 0 && cuda::ptx::elect_sync(0xFFFFFFFF)); // elect a leader thread among warp 0 } __global__ void example_kernel() { alignas(16) __shared__ int smem_x[1024]; #pragma nv_diag_suppress static_var_with_dynamic_init __shared__ cuda::barrier bar; // setup the mbarrier if (threadIdx.x == 0) { init(&bar, blockDim.x); } __syncthreads(); // issue the async copy from a single thread and wait for completion const bool is_block_leader = elect_one(); const int tx_count = is_block_leader ? sizeof(smem_x) : 0; if (is_block_leader) { cuda::device::memcpy_async_tx(smem_x, gmem_x, cuda::aligned_size_t<16>(tx_count), bar); } auto token = cuda::device::barrier_arrive_tx(bar, 1, tx_count); bar.wait(cuda::std::move(token)); // smem_x contains the contents of gmem_x[0], ..., gmem_x[1023] smem_x[threadIdx.x] += 1; } ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 64 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.L2::256B.b64 ### Description Loads 64 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80. ### Method __device__ static inline B64 ld_L1_evict_first_L2_cache_hint_L2_256B(cuda::ptx::space_global_t, const B64* addr, uint64_t cache_policy); ### Parameters - **cuda::ptx::space_global_t**: Specifies the memory space (global). - **const B64* addr**: Pointer to the memory location in global memory to load from. B64 must be 8 bytes in size. - **uint64_t cache_policy**: Specifies the cache behavior for L2. ``` -------------------------------- ### CUDA API Forward Progress Guarantee Example Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/execution_model.md Demonstrates that `cudaDeviceSynchronize` eventually ensures at least one device thread makes progress, leading to program termination. This relies on CUDA's guarantee that if the device is empty, threads will eventually start and complete. ```cuda __global__ void hello_world() { __syncthreads(); } int main() { hello_world<<<1,2>>>(); return (int)cudaDeviceSynchronize(); } ``` -------------------------------- ### ld.global.L1::evict_last.L2::cache_hint.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads a 64-bit value from global memory with L1 eviction, L2 256B cache hint. ```APIDOC ## ld.global.L1::evict_last.L2::cache_hint.L2::256B.b64 ### Description Loads a 64-bit value from global memory, evicting the last L1 cache line and providing a specific L2 256B cache hint. ### Signature ```cpp template = true> __device__ static inline B64 ld_L1_evict_last_L2_cache_hint_L2_256B( cuda::ptx::space_global_t, const B64* addr, uint64_t cache_policy); ``` ### Parameters * `cuda::ptx::space_global_t`: Specifies the global memory space. * `const B64* addr`: Pointer to the memory address to load from. * `uint64_t cache_policy`: Cache policy hint for L2 (256B). ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.b8 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads 8-bit data from global memory with L1 eviction and an L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.b8 ### Description Loads 8-bit data from global memory with L1 eviction and an L2 cache hint. ### PTX ISA 74, SM_80 ### Template ```cpp template = true> __device__ static inline B8 ld_L1_evict_first_L2_cache_hint( cuda::ptx::space_global_t, const B8* addr, uint64_t cache_policy); ``` ``` -------------------------------- ### Example Usage of cuda::bitmask in a CUDA Kernel Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/bit/bitmask.md Demonstrates the usage of `cuda::bitmask` within a CUDA kernel to create bitmasks. Asserts are used to verify the correctness of the generated bitmasks for different types and bit positions. Ensure that `start` and `width` satisfy the function's preconditions. ```cuda #include #include #include __global__ void bitmask_kernel() { assert(cuda::bitmask(2, 4) == 0b111100u); assert(cuda::bitmask(1, 3) == uint64_t{0b1110}); } int main() { bitmask_kernel<<<1, 1>>>(); cudaDeviceSynchronize(); return 0; } ``` -------------------------------- ### Define Temporary Installation Prefix Source: https://github.com/nvidia/cccl/blob/main/test/cmake/CMakeLists.txt Sets a temporary installation prefix within the build directory for testing against an installed project. ```cmake set(tmp_install_prefix "${CMAKE_CURRENT_BINARY_DIR}/test_install") ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.b16 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads 16-bit data from global memory with L1 eviction and an L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.b16 ### Description Loads 16-bit data from global memory with L1 eviction and an L2 cache hint. ### PTX ISA 74, SM_80 ### Template ```cpp template = true> __device__ static inline B16 ld_L1_evict_first_L2_cache_hint( cuda::ptx::space_global_t, const B16* addr, uint64_t cache_policy); ``` ``` -------------------------------- ### Launch STF Example Binary Source: https://github.com/nvidia/cccl/blob/main/docs/cudax/stf.md Example command to launch a compiled STF example binary. Binaries are located in the 'bin/' subdirectory after building. ```bash ./bin/cudax.cpp17.example.stf.01-axpy ``` -------------------------------- ### ld.global.L1::evict_first.L2::256B.b32 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 32-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::256B.b32 ### Description Loads 32-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ### Method __device__ static inline B32 ### Parameters - **cuda::ptx::space_global_t**: Specifies the memory space as global. - **const B32* addr**: Pointer to the memory address to load from. ### PTX ISA ld.space.L1::evict_first.L2::256B.b32 dest, [addr]; ### Notes - PTX ISA 74, SM_80 - .space = { .global } - Template parameter B32 requires sizeof(B32) == 4. ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.v4.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads data from global memory with L1 eviction first and L2 cache hint. Supports 256-bit vectors. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.v4.b64 ### Description Loads a 256-bit vector from global memory, evicting the first cache line and using an L2 cache hint. ### Method __device__ static inline B256 ld_L1_evict_first_L2_cache_hint(cuda::ptx::space_global_t, const B256* addr, uint64_t cache_policy) ### Parameters - `cuda::ptx::space_global_t`: Specifies the global memory space. - `const B256* addr`: Pointer to the source memory address. - `uint64_t cache_policy`: Cache policy hint for L2. ``` -------------------------------- ### Add CUDA C++ Example Source: https://github.com/nvidia/cccl/blob/main/cub/examples/block/CMakeLists.txt Use this snippet to add CUDA C++ examples to your project. It finds all example source files and adds them as targets. ```cmake file( GLOB_RECURSE example_srcs RELATIVE "${CMAKE_CURRENT_LIST_DIR}" CONFIGURE_DEPENDS example_*.cu ) foreach (example_src IN LISTS example_srcs) get_filename_component(example_name "${example_src}" NAME_WE) string( REGEX REPLACE "^example_block_" "block." example_name "${example_name}" ) cub_add_example(target_name ${example_name} "${example_src}") endforeach() ``` -------------------------------- ### Building Examples Source: https://github.com/nvidia/cccl/blob/main/docs/thrust/developer/cmake_options.md Enable or disable the building of examples. Defaults to ON. ```cmake cmake -DTHRUST_ENABLE_EXAMPLES=OFF ... ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.b128 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads 128-bit data from global memory with L1 eviction and an L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.b128 ### Description Loads 128-bit data from global memory with L1 eviction and an L2 cache hint. ### PTX ISA 83, SM_80 ### Template ```cpp template = true> __device__ static inline B128 ld_L1_evict_first_L2_cache_hint( cuda::ptx::space_global_t, const B128* addr, uint64_t cache_policy); ``` ``` -------------------------------- ### ld.global.nc.L1::evict_last.L2::cache_hint.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 64-bit data from global memory with L1 cache eviction to L2 (256B) and L2 cache hint. ```APIDOC ## ld.global.nc.L1::evict_last.L2::cache_hint.L2::256B.b64 ### Description Loads 64-bit data from global memory into a register, evicting the last accessed line from L1 cache to L2 (256B), with a specified cache policy. ### Signature ```cpp template = true> __device__ static inline B64 ld_nc_L1_evict_last_L2_cache_hint_L2_256B( cuda::ptx::space_global_t, const B64* addr, uint64_t cache_policy); ``` ### PTX ISA `ld.space.nc.L1::evict_last.L2::cache_hint.L2::256B.b64 dest, [addr], cache_policy;` (PTX ISA 74, SM_80) ``` -------------------------------- ### Process Code Generation STF Examples Source: https://github.com/nvidia/cccl/blob/main/cudax/examples/stf/CMakeLists.txt Conditionally processes STF examples that rely on code generation (parallel_for or launch) if CUDASTF_CODE_GENERATION is enabled. It adds these examples as executables. ```cmake set( stf_example_codegen_sources 01-axpy-launch.cu 01-axpy-parallel_for.cu binary_fhe.cu binary_fhe_stackable.cu 09-dot-reduce.cu cfd.cu custom_data_interface.cu fdtd_mgpu.cu fdtd_while.cu fdtd_repeat_n.cu frozen_data_init.cu graph_algorithms/degree_centrality.cu graph_algorithms/jaccard.cu graph_algorithms/pagerank.cu graph_algorithms/pagerank_batched.cu graph_algorithms/pagerank_while.cu graph_algorithms/tricount.cu graph_scope.cu heat.cu heat_mgpu.cu jacobi.cu jacobi_pfor.cu jacobi_stackable.cu jacobi_stackable_raii.cu jacobi_update_cond.cu launch_histogram.cu launch_scan.cu launch_sum.cu launch_sum_cub.cu linear_algebra/burger.cu linear_algebra/burger_sensitivity.cu linear_algebra/cg_csr.cu linear_algebra/cg_csr_stackable.cu logical_gates_composition.cu mandelbrot.cu parallel_for_2D.cu pi.cu scan.cu sqrt_newton_stackable.cu standalone-launches.cu word_count.cu word_count_reduce.cu ) if (cudax_ENABLE_CUDASTF_CODE_GENERATION) foreach (source IN LISTS stf_example_codegen_sources) cudax_add_stf_example(example_target "${source}") endforeach() endif() ``` -------------------------------- ### Add Install Test Fixture Source: https://github.com/nvidia/cccl/blob/main/test/cmake/CMakeLists.txt Defines a CMake test to perform an installation using the '--install' command, targeting a specified prefix. This test requires the 'install_tree' fixture. ```cmake add_test( NAME cccl.test.cmake.install_tree.install # gersemi: off COMMAND "${CMAKE_COMMAND}" --install "${CCCL_BINARY_DIR}" --prefix "${tmp_install_prefix}" # gersemi: on ) set_tests_properties( cccl.test.cmake.install_tree.install PROPERTIES FIXTURES_SETUP install_tree ) ``` -------------------------------- ### ld.global.L1::evict_last.L2::cache_hint.L2::256B.b32 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads a 32-bit value from global memory with L1 eviction, L2 256B cache hint. ```APIDOC ## ld.global.L1::evict_last.L2::cache_hint.L2::256B.b32 ### Description Loads a 32-bit value from global memory, evicting the last L1 cache line and providing a specific L2 256B cache hint. ### Signature ```cpp template = true> __device__ static inline B32 ld_L1_evict_last_L2_cache_hint_L2_256B( cuda::ptx::space_global_t, const B32* addr, uint64_t cache_policy); ``` ### Parameters * `cuda::ptx::space_global_t`: Specifies the global memory space. * `const B32* addr`: Pointer to the memory address to load from. * `uint64_t cache_policy`: Cache policy hint for L2 (256B). ``` -------------------------------- ### ld.global.nc.L1::evict_first.L2::256B.b8 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 8 bits (b8) from global memory with non-coherent caching, L1 evict first, and L2 256B cache hint. Requires SM 80. ```APIDOC ## ld.global.nc.L1::evict_first.L2::256B.b8 ### Description Loads 8 bits of data from a global memory address with non-coherent caching, an L1 evict first policy, and an L2 256B cache hint. ### Method __device__ static inline B8 ld_nc_L1_evict_first_L2_256B(cuda::ptx::space_global_t, const B8* addr) ### Parameters - `cuda::ptx::space_global_t`: Specifies the global memory space. - `const B8* addr`: Pointer to the memory address to load from. `B8` must be 1 byte. ### PTX ISA 74, SM_80 ``` -------------------------------- ### Install CCCL using CMake Source: https://github.com/nvidia/cccl/blob/main/README.md Clone the CCCL repository, navigate to the directory, and use CMake to configure and install CCCL. The default CMake options generate installation rules. ```bash git clone https://github.com/NVIDIA/cccl.git cd cccl cmake . -DCMAKE_INSTALL_PREFIX=/usr/local make install ``` -------------------------------- ### Add CUB Example Executable Source: https://github.com/nvidia/cccl/blob/main/cub/examples/CMakeLists.txt Defines a CMake function to add an example executable, register it with ctest, and configure its build settings. Use this to create new example targets. ```cmake function(cub_add_example target_name_var example_name example_src) # The actual name of the test's target: set(example_target cub.example.${example_name}) set(${target_name_var} ${example_target} PARENT_SCOPE) cccl_add_executable(${example_target} SOURCES "${example_src}" ADD_CTEST) cub_configure_cuda_target(${example_target} RDC ${CUB_FORCE_RDC}) target_link_libraries( ${example_target} PRIVATE # cub.compiler_interface cccl.c2h ) target_include_directories( ${example_target} PRIVATE "${CUB_SOURCE_DIR}/examples" ) endfunction() ``` -------------------------------- ### ld.global.L1::no_allocate.L2::cache_hint.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 64-bit data from global memory with L1 no-allocate and L2 256B cache hint. ```APIDOC ## ld.global.L1::no_allocate.L2::cache_hint.L2::256B.b64 ### Description Loads 64-bit data from global memory with L1 no-allocate and L2 256B cache hint. ### Method __device__ static inline B64 ### Parameters - **cuda::ptx::space_global_t**: Specifies the memory space as global. - **const B64* addr**: Pointer to the memory address to load from. - **uint64_t cache_policy**: Specifies the cache policy for L2. ### PTX ISA PTX ISA 74, SM_80 ``` -------------------------------- ### Avoid Calling get() on Temporary mdspan-to-DLPack Conversion Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/mdspan/mdspan_to_dlpack.md Calling `get()` on a temporary `cuda::to_dlpack_tensor` result is invalid because the temporary object is destroyed immediately after `get()` returns, leading to a dangling reference. The `DLTensor` object returned by `get()` must not outlive the `cuda::to_dlpack_tensor` return value. ```cuda #include #include #include void show_invalid_usage1() { using extents_t = cuda::std::extents; int data[6] = {0, 1, 2, 3, 4, 5}; cuda::host_mdspan md{data, extents_t{}}; // WRONG: calling get() on a temporary is deleted to prevent dangling references. // const DLTensor& dltensor = cuda::to_dlpack_tensor(md).get(); // compile error } ``` -------------------------------- ### ld.global.nc.L1::evict_last.L2::cache_hint.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 64-bit data from global memory with L1 cache eviction and L2 cache hint. ```APIDOC ## ld.global.nc.L1::evict_last.L2::cache_hint.b64 ### Description Loads 64-bit data from global memory into a register, evicting the last accessed line from L1 cache to L2, with a specified cache policy. ### Signature ```cpp template = true> __device__ static inline B64 ld_nc_L1_evict_last_L2_cache_hint( cuda::ptx::space_global_t, const B64* addr, uint64_t cache_policy); ``` ### PTX ISA `ld.space.nc.L1::evict_last.L2::cache_hint.b64 dest, [addr], cache_policy;` (PTX ISA 74, SM_80) ``` -------------------------------- ### Navigate to Example Directory Source: https://github.com/nvidia/cccl/blob/main/examples/basic/README.md Change into the example project directory after cloning the repository. ```bash cd cccl/examples/example_project ``` -------------------------------- ### ld.global.nc.L1::evict_first.L2::cache_hint.v4.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads a vector of four 64-bit elements from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint. ```APIDOC ## ld.global.nc.L1::evict_first.L2::cache_hint.v4.b64 ### Description Loads a vector of four 64-bit elements from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint. ### PTX ISA 88, SM_100 ### Template Signature ```cuda template = true> __device__ static inline B256 ld_nc_L1_evict_first_L2_cache_hint( cuda::ptx::space_global_t, const B256* addr, uint64_t cache_policy); ``` ``` -------------------------------- ### Install CCCL from PyPI with System CUDA Toolkit Source: https://github.com/nvidia/cccl/blob/main/docs/python/setup.md Installs cuda-cccl without the CUDA Toolkit, assuming a compatible CUDA toolkit is already installed and configured in the system's PATH and LD_LIBRARY_PATH. ```bash pip install cuda-cccl[sysctk13] # or cuda-cccl[sysctk12] ``` -------------------------------- ### ld.global.nc.L1::evict_first.L2::cache_hint.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads a 64-bit value from global memory with L1 eviction and L2 cache hint. ```APIDOC ## ld.global.nc.L1::evict_first.L2::cache_hint.b64 ### Description Loads a 64-bit value from global memory, employing L1 cache with 'evict_first' policy and an L2 cache hint. ### Parameters - `cuda::ptx::space_global_t`: Specifies the memory space as global. - `const B64* addr`: Pointer to the memory address to load from. - `uint64_t cache_policy`: Cache policy hint for L2 cache. ``` -------------------------------- ### Enabling Installation Rules Source: https://github.com/nvidia/cccl/blob/main/docs/thrust/developer/cmake_options.md Control whether installation rules for Thrust are generated. Defaults to ON. ```cmake cmake -DTHRUST_ENABLE_INSTALL_RULES=OFF ... ``` -------------------------------- ### ld.global.L1::evict_first.L2::cache_hint.L2::256B.b32 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 32 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80. ```APIDOC ## ld.global.L1::evict_first.L2::cache_hint.L2::256B.b32 ### Description Loads 32 bits from global memory with L1 eviction, an L2 cache hint, and a 256B L2 line size. Available from SM_80. ### Method __device__ static inline B32 ld_L1_evict_first_L2_cache_hint_L2_256B(cuda::ptx::space_global_t, const B32* addr, uint64_t cache_policy); ### Parameters - **cuda::ptx::space_global_t**: Specifies the memory space (global). - **const B32* addr**: Pointer to the memory location in global memory to load from. B32 must be 4 bytes in size. - **uint64_t cache_policy**: Specifies the cache behavior for L2. ``` -------------------------------- ### Install Plotting Dependencies Source: https://github.com/nvidia/cccl/blob/main/docs/cub/benchmarking.md Installs Python packages required for plotting benchmark results. ```bash pip install fpzip pandas matplotlib seaborn tabulate PyQt5 colorama ``` -------------------------------- ### ld.global.L1::evict_last.L2::256B.b8 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads 8 bits from global memory with L1 evict last and L2 256B hint. Requires SM_80 or higher. ```APIDOC ## ld.global.L1::evict_last.L2::256B.b8 ### Description Loads 8 bits from global memory with L1 evict last policy and a hint for L2 cache to use 256B line size. ### Parameters - `dest`: The destination register. - `addr`: The memory address to load from. ``` -------------------------------- ### ld.global.L1::evict_first.L2::256B.b8 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 8-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::256B.b8 ### Description Loads 8-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ### Method __device__ static inline B8 ### Parameters - **cuda::ptx::space_global_t**: Specifies the memory space as global. - **const B8* addr**: Pointer to the memory address to load from. ### PTX ISA ld.space.L1::evict_first.L2::256B.b8 dest, [addr]; ### Notes - PTX ISA 74, SM_80 - .space = { .global } - Template parameter B8 requires sizeof(B8) == 1. ``` -------------------------------- ### Example Usage Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/mdspan/restrict_accessor.md An example demonstrating the usage of `restrict_mdspan` in a CUDA kernel and `main` function. ```APIDOC #include using restrict_mdspan = cuda::restrict_mdspan>; __host__ __device__ void compute(restrict_mdspan a, restrict_mdspan b, restrict_mdspan c) { c[0] = a[0] * b[0]; c[1] = a[0] * b[0]; c[2] = a[0] * b[0] * a[1]; c[3] = a[0] * a[1]; c[4] = a[0] * b[0]; c[5] = b[0]; } int main() { using dim = cuda::std::dims<1>; using mdspan = cuda::std::mdspan; int arrayA[] = {1, 2}; int arrayB[] = {5}; int arrayC[] = {9, 10, 11, 12, 13, 14}; mdspan mdA{arrayA, dim{1}}; mdspan mdB{arrayB, dim{5}}; mdspan mdC{arrayC, dim{6}}; compute(mdA, mdB, mdC); using restrict_aligned_accesor = cuda::std::restrict_accessor>; using restrict_aligned_mdspan = cuda::std::mdspan; restrict_aligned_mdspan mdD{mdC}; } [See it on Godbolt 🔗](https://godbolt.org/z/Wjco996z8) ``` -------------------------------- ### ld.global.nc.L1::evict_last.L2::cache_hint.b32 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 32-bit data from global memory with L1 cache eviction and L2 cache hint. ```APIDOC ## ld.global.nc.L1::evict_last.L2::cache_hint.b32 ### Description Loads 32-bit data from global memory into a register, evicting the last accessed line from L1 cache to L2, with a specified cache policy. ### Signature ```cpp template = true> __device__ static inline B32 ld_nc_L1_evict_last_L2_cache_hint( cuda::ptx::space_global_t, const B32* addr, uint64_t cache_policy); ``` ### PTX ISA `ld.space.nc.L1::evict_last.L2::cache_hint.b32 dest, [addr], cache_policy;` (PTX ISA 74, SM_80) ``` -------------------------------- ### Install CCCL Python Package from Source Source: https://github.com/nvidia/cccl/blob/main/AGENTS.md Installs the CCCL Python package from source. This involves cloning the repository and using pip with an editable install, optionally specifying test CUDA Toolkit versions. ```bash git clone https://github.com/NVIDIA/cccl.git cd cccl/python/cuda_cccl pip install -e .[test-cu13] # or [test-cu12] for CTK 12.X ``` -------------------------------- ### ld.global.L1::evict_last.v4.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads a 256-bit vector from global memory with L1 eviction last. ```APIDOC ## ld.global.L1::evict_last.v4.b64 ### Description Loads a 256-bit vector from global memory, evicting the last cache line. ### Method __device__ static inline B256 ld_L1_evict_last(cuda::ptx::space_global_t, const B256* addr) ### Parameters - `cuda::ptx::space_global_t`: Specifies the global memory space. - `const B256* addr`: Pointer to the source memory address. ``` -------------------------------- ### Process Math Library STF Examples Source: https://github.com/nvidia/cccl/blob/main/cudax/examples/stf/CMakeLists.txt Conditionally processes STF examples that utilize math libraries like CUBLAS and CUSOLVER if CUDASTF_MATHLIBS is enabled. These examples are added as executables and linked with math libraries. ```cmake set( stf_example_mathlib_sources linear_algebra/06-pdgemm.cu linear_algebra/06-pdgemm-stackable.cu linear_algebra/07-cholesky.cu linear_algebra/07-potri.cu linear_algebra/cg_dense_2D.cu linear_algebra/strassen.cu ) if (cudax_ENABLE_CUDASTF_MATHLIBS) foreach (source IN LISTS stf_example_mathlib_sources) cudax_add_stf_example(example_target "${source}" LINK_MATHLIBS) endforeach() endif() ``` -------------------------------- ### Install CCCL Python Package from Conda-forge Source: https://github.com/nvidia/cccl/blob/main/AGENTS.md Installs the CCCL Python package from the conda-forge channel. ```bash conda install -c conda-forge cccl-python ``` -------------------------------- ### ld.global.L1::evict_last.L2::cache_hint.v4.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads a 256-bit vector (v4.b64) from global memory with L1 eviction and L2 cache hint. ```APIDOC ## ld.global.L1::evict_last.L2::cache_hint.v4.b64 ### Description Loads a 256-bit vector (four 64-bit elements) from global memory, evicting the last L1 cache line and providing an L2 cache hint. ### Signature ```cpp template = true> __device__ static inline B256 ld_L1_evict_last_L2_cache_hint( cuda::ptx::space_global_t, const B256* addr, uint64_t cache_policy); ``` ### Parameters * `cuda::ptx::space_global_t`: Specifies the global memory space. * `const B256* addr`: Pointer to the memory address to load from. * `uint64_t cache_policy`: Cache policy hint for L2. ``` -------------------------------- ### resource_ref Usage Example Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/extended_api/memory_resource/wrappers.md Example showing `cuda::mr::resource_ref` used with a specific property. ```APIDOC ## resource_ref `cuda::mr::resource_ref` is the non-owning, type-erased wrapper. Prefer it when the caller controls the resource lifetime. ```cpp struct required_alignment { using value_type = std::size_t; }; void* do_allocate_with_alignment(cuda::mr::resource_ref resource, cuda::stream_ref stream, std::size_t size) { return resource.allocate(stream, size, get_property(resource, required_alignment{})); } ``` However, the type erasure comes with the cost that arbitrary properties cannot be queried from either wrapper: ```cpp struct required_alignment { using value_type = std::size_t; }; void* buggy_allocate_with_alignment(cuda::mr::resource_ref<> resource, cuda::stream_ref stream, std::size_t size) { if constexpr (cuda::has_property) { // BUG: This will always be false return resource.allocate(stream, size, get_property(resource, required_alignment{})); } else { return resource.allocate(stream, size, my_default_alignment); } } ``` So, choose wisely. If your library has a well-defined set of fixed properties that you expect to always be available, then `cuda::mr::{synchronous_}resource_ref` is an amazing tool to improve compile times and binary size. If you need a flexible interface then constraining a template argument through `cuda::mr::{synchronous_}resource_with` is the proper solution. ``` -------------------------------- ### ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads a 64-bit value from global memory with L1 eviction, L2 cache hint, and a 256B L2 cache line size. ```APIDOC ## ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64 ### Description Loads a 64-bit value from global memory, using L1 cache with 'evict_first' policy and a 256-byte L2 cache line hint. ### Parameters - `cuda::ptx::space_global_t`: Specifies the memory space as global. - `const B64* addr`: Pointer to the memory address to load from. - `uint64_t cache_policy`: Cache policy hint for L2 cache. ``` -------------------------------- ### ld.global.L1::evict_first.L2::256B.b16 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads 16-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ```APIDOC ## ld.global.L1::evict_first.L2::256B.b16 ### Description Loads 16-bit data from global memory, evicting the first cache line from L1 and using a 256B L2 cache hint. ### Method __device__ static inline B16 ### Parameters - **cuda::ptx::space_global_t**: Specifies the memory space as global. - **const B16* addr**: Pointer to the memory address to load from. ### PTX ISA ld.space.L1::evict_first.L2::256B.b16 dest, [addr]; ### Notes - PTX ISA 74, SM_80 - .space = { .global } - Template parameter B16 requires sizeof(B16) == 2. ``` -------------------------------- ### CountingIterator Source: https://github.com/nvidia/cccl/blob/main/docs/python/compute_api.md Represents a sequence of incrementing values. The iterator starts at start and increments by 1 for each advance. ```APIDOC ## class cuda.compute.iterators.CountingIterator(start: number) Iterator representing a sequence of incrementing values. Similar to [thrust::counting_iterator](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html). The iterator starts at start and increments by 1 for each advance. ### Example The code snippet below demonstrates the usage of a `CountingIterator` representing the sequence `[10, 11, 12]`: ```python """ Example showing how to use counting_iterator. """ import functools import cupy as cp import numpy as np import cuda.compute from cuda.compute import ( CountingIterator, OpKind, ) # Prepare the input and output arrays. first_item = 1 num_items = 100 # Create the counting iterator. first_it = CountingIterator(np.int32(first_item)) # Prepare the initial value for the reduction. h_init = np.array([0], dtype=np.int32) # Prepare the output array. d_output = cp.empty(1, dtype=np.int32) # Perform the reduction. cuda.compute.reduce_into( d_in=first_it, d_out=d_output, num_items=num_items, op=OpKind.PLUS, h_init=h_init ) # Verify the result. expected_output = functools.reduce( lambda a, b: a + b, range(first_item, first_item + num_items) ) assert (d_output == expected_output).all() print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})") ``` ### __init__(start: number) Create a counting iterator starting at start. * **Parameters:** **start** – The initial value (must be a numpy scalar) ``` -------------------------------- ### ld.global.nc.L1::evict_first.L2::cache_hint.b32 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/ld.md Loads a 32-bit value from global memory with L1 eviction and L2 cache hint. ```APIDOC ## ld.global.nc.L1::evict_first.L2::cache_hint.b32 ### Description Loads a 32-bit value from global memory, using L1 cache with an 'evict_first' policy and providing a hint to the L2 cache. ### Parameters - `cuda::ptx::space_global_t`: Specifies the memory space as global. - `const B32* addr`: Pointer to the memory address to load from. - `uint64_t cache_policy`: Cache policy hint for L2 cache. ``` -------------------------------- ### Set installation location for generated code Source: https://github.com/nvidia/cccl/blob/main/libcudacxx/codegen/CMakeLists.txt Defines the destination directory for installing the generated atomic functions header. ```cmake set( atomic_install_location "${libcudacxx_SOURCE_DIR}/include/cuda/std/__atomic/functions" ) ``` -------------------------------- ### Install CCCL using Convenience Script Source: https://github.com/nvidia/cccl/blob/main/README.md Use the provided convenience script to install CCCL to a specified directory. ```bash ci/install_cccl.sh /usr/local ``` -------------------------------- ### ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads data from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint, for 64-bit data and a 256B L2 cache line size. ```APIDOC ## ld.global.nc.L1::evict_first.L2::cache_hint.L2::256B.b64 ### Description Loads data from global memory with L1 cache eviction policy 'evict_first' and an L2 cache hint, for 64-bit data and a 256B L2 cache line size. ### PTX ISA 74, SM_80 ### Template Signature ```cuda template = true> __device__ static inline B64 ld_nc_L1_evict_first_L2_cache_hint_L2_256B( cuda::ptx::space_global_t, const B64* addr, uint64_t cache_policy); ``` ``` -------------------------------- ### Enabling Example FileCheck Validation Source: https://github.com/nvidia/cccl/blob/main/docs/thrust/developer/cmake_options.md Enable validation of example outputs using the LLVM FileCheck utility. Defaults to OFF. ```cmake cmake -DTHRUST_ENABLE_EXAMPLE_FILECHECK=ON ... ``` -------------------------------- ### ld.global.L1::evict_last.L2::256B.b64 Source: https://github.com/nvidia/cccl/blob/main/docs/libcudacxx/ptx/instructions/generated/ld.md Loads a 64-bit value from global memory with L1 eviction last and L2 256B cache hint. ```APIDOC ## ld.global.L1::evict_last.L2::256B.b64 ### Description Loads a 64-bit value from global memory, evicting the last cache line and using an L2 256B cache hint. ### Method __device__ static inline B64 ld_L1_evict_last_L2_256B(cuda::ptx::space_global_t, const B64* addr) ### Parameters - `cuda::ptx::space_global_t`: Specifies the global memory space. - `const B64* addr`: Pointer to the source memory address. ``` -------------------------------- ### Install CUB Tuning Dependencies Source: https://github.com/nvidia/cccl/blob/main/docs/cub/benchmarking.md Installs necessary Python packages for the tuning infrastructure. Ensure you are in a clean build directory. ```bash ninja clean pip install --user fpzip pandas scipy ```