### Build and Run Rust-CUDA Docker Container Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md These shell commands provide a quickstart for setting up a Docker environment for Rust-CUDA. They cover building the Docker image from your Rust-CUDA clone and then running a container with GPU access enabled, mounting the project directory, and dropping into a bash shell for development. ```Shell docker build -t rust-cuda $RUST_CUDA docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda ``` -------------------------------- ### Build Rust-CUDA Project with Cargo Source: https://github.com/rust-gpu/rust-cuda/blob/main/README.md This snippet demonstrates the necessary environment variable setup for the NVIDIA OptiX SDK and the command to build the Rust-CUDA project using the `cargo build` command. Ensure the `OPTIX_ROOT` and `OPTIX_ROOT_DIR` variables point to your OptiX installation. ```bash ## setup your environment like: ### export OPTIX_ROOT=/opt/NVIDIA-OptiX-SDK-9.0.0-linux64-x86_64 ### export OPTIX_ROOT_DIR=/opt/NVIDIA-OptiX-SDK-9.0.0-linux64-x86_64 ## build proj cargo build ``` -------------------------------- ### Configure Rust Toolchain for Rust-CUDA Development Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md This `rust-toolchain.toml` configuration specifies the exact Rust toolchain required for Rust-CUDA development. It mandates a specific nightly channel and includes essential components such as `rust-src` for source code, `rustc-dev` for compiler development utilities, and `llvm-tools-preview` for LLVM-related tools, ensuring compatibility and necessary features. ```TOML # If you see this, run `rustup self update` to get rustup 1.23 or newer. # NOTE: above comment is for older `rustup` (before TOML support was added), # which will treat the first line as the toolchain name, and therefore show it # to the user in the error, instead of "error: invalid channel name '[toolchain]'". [toolchain] channel = "nightly-2021-12-04" components = ["rust-src", "rustc-dev", "llvm-tools-preview"] ``` -------------------------------- ### Configure Cargo.toml for GPU Crate Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md Modify your `Cargo.toml` to specify `cdylib` and `rlib` crate types, which are necessary for NVPTX targets and for using the crate as a dependency. Also, add `cuda_std` as a dependency, replacing `XX` with the latest version. ```toml [package] name = "name" version = "0.1.0" edition = "2021" +[lib] +crate-type = ["cdylib", "rlib"] [dependencies] +cuda_std = "XX" ``` -------------------------------- ### Enable Allocator for GPU Kernels (Optional) Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md If your GPU kernels require dynamic memory allocation or features like printing (which depend on `alloc`), you must explicitly declare `extern crate alloc;` in your `lib.rs` to make the `alloc` crate available. ```rust extern crate alloc; ``` -------------------------------- ### Initial Setup for Unsafe OptiX API in Rust Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md This snippet provides the foundational setup for interacting with the unsafe OptiX API in Rust. It demonstrates the necessary steps to initialize CUDA and OptiX contexts, retrieve a device, and prepare basic device buffers for vertex and index data. This preparation is crucial before invoking low-level functions like `accel_compute_memory_usage` and `accel_build`. ```Rust use cust::prelude as cu; use optix::prelude as ox; # fn doit() -> Result<(), Box> { # cust::init(cu::CudaFlags::empty())?; # ox::init()?; # let device = cu::Device::get_device(0)?; # let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO | # cu::ContextFlags::MAP_HOST, device)?; # let ctx = ox::DeviceContext::new(&cu_ctx, false)?; # let vertices: Vec<[f32; 3]> = Vec::new(); # let indices: Vec<[u32; 3]> = Vec::new(); ``` -------------------------------- ### Add Initial Directives to lib.rs for GPU Compilation Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md Include essential attributes in your `lib.rs` file that are conditionally applied when compiling for GPU targets (`target_os = "cuda"`). These attributes declare the crate as `no_std` and register `nvvm_internal` for codegen functionality, along with importing `cuda_std`. ```rust #![cfg_attr( target_os = "cuda", no_std, register_attr(nvvm_internal) )] use cuda_std::*; ``` -------------------------------- ### Add cuda_builder Build Dependency Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md This snippet demonstrates how to declare `cuda_builder` as a build dependency in your `Cargo.toml` file. This is essential for the CPU crate that orchestrates the compilation of GPU kernels, ensuring the `cuda_builder` utility is available during the build process. ```TOML +[build-dependencies] +cuda_builder = "XX" ``` -------------------------------- ### Invoke cuda_builder in Rust Build Script Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md This Rust code shows how to integrate `cuda_builder` into your `build.rs` file. It initializes `CudaBuilder` with the path to your GPU crate, specifies an output path for the compiled PTX file using `.copy_to()`, and then executes the build process. The `.unwrap()` handles potential errors during compilation. ```Rust use cuda_builder::CudaBuilder; fn main() { CudaBuilder::new("path/to/gpu/crate/root") .copy_to("some/path.ptx") .build() .unwrap(); } ``` -------------------------------- ### Rust OptiX Device Context Initialization Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/context.md This Rust example demonstrates the essential steps to set up an NVIDIA OptiX 7 device context. It involves initializing the CUDA driver, creating a CUDA context for a specific device, and then instantiating the OptiX `DeviceContext` linked to the CUDA context. ```Rust # fn doit() -> Result<(), Box> { use optix::prelude as ox; use cust::prelude as cu; // Initialize cuda and optix cust::init(cu::CudaFlags::empty())?; ox::init()?; // Create a cuda context for the first device let device = cu::Device::get_device(0)?; let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO | cu::ContextFlags::MAP_HOST, device)?; // Create optix device context let ctx = ox::DeviceContext::new(&cu_ctx, false)?; # Ok(()) # } ``` -------------------------------- ### APIDOC: SRT Transformation Usage Examples Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md Illustrative examples demonstrating how to apply the SRT motion transform for specific scenarios. These include setting up rotations about the origin, rotations around a custom pivot point, and scaling operations relative to a pivot point, showing the necessary mathematical adjustments. ```APIDOC Example 1 - Rotation about the origin: Q = [ 0, 0, sin(pi/6), cos(pi/6) ] (60-degree rotation about z-axis) Example 2 - Rotation about a pivot point: S' = P^-1 * S T' = T * P C = T' * R * S' Example 3 - Scaling about a pivot point: P'x = Px + (-Sx * Gx + Gx) P'y = Py + (-Sy * Gy + Gy) P'z = Pz + (-Sz * Gz + Gz) ``` -------------------------------- ### SBT Geometry-AS Index Mapping for Single-Record Build Inputs Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md This example illustrates the mapping between SBT geometry acceleration structure indices and build inputs when each build input references exactly one SBT record. The index directly corresponds to the build input's order. ```APIDOC
SBT Geometry-AS Index012
Geometry-AS build inputbuild_input[0]
built_input[1]
built_input[2]
``` -------------------------------- ### SBT Geometry-AS Index Mapping with Per-Primitive Offsets Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md This example details how per-primitive SBT index offsets (`sbt_index_offset_buffer`) are applied. These offsets are local to each build input and are combined with the base SBT geometry acceleration structure index to determine the final SBT index. ```APIDOC
SBT Geometry-AS Index0123456
build_input[0].sbt_index_offset:[0]
[1]
[2]
[3]
build_input[1].sbt_index_offset=None
build_input[1].sbt_index_offset:[0]
[1]
``` -------------------------------- ### Rust-CUDA OptiX Acceleration Structure Build Example Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md This Rust code snippet demonstrates the complete workflow for building an OptiX acceleration structure (AS). It covers initializing device buffers for vertices and indices, defining build inputs with geometry flags, setting acceleration build options, computing memory requirements, allocating temporary and output buffers, and finally executing the asynchronous AS build operation followed by stream synchronization. ```Rust # let stream = cu::Stream::new(cu::StreamFlags::DEFAULT, None)?; let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?; let buf_indices = cu::DeviceBuffer::from_slice(&indices)?; let geometry_flags = ox::GeometryFlags::None; let build_inputs = [ox::IndexedTriangleArray::new( &[&buf_vertex], &buf_indices, &[geometry_flags] )]; let accel_options = ox::AccelBuildOptions::new( ox::BuildFlags::ALLOW_COMPACTION, ox::BuildOperation::Build ); // Get the storage requirements for temporary and output buffers let sizes = accel_compute_memory_usage(ctx, accel_options, build_inputs)?; // Allocate temporary and output buffers let mut output_buffer = unsafe { DeviceBuffer::::uninitialized(sizes.output_size_in_bytes)? }; let mut temp_buffer = unsafe { DeviceBuffer::::uninitialized(sizes.temp_size_in_bytes)? }; // Build the accel let hnd = unsafe { accel_build( ctx, stream, accel_options, build_inputs, &mut temp_buffer, &mut output_buffer, &mut properties, )? }; // The accel build is asynchronous stream.synchronize()?; # Ok(()) # } ``` -------------------------------- ### Initial Stack Size Estimates for Simple Path Tracer Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md These formulas provide initial estimates for direct callable and continuation stack sizes in a basic path tracer setup. They account for ray generation, miss, and closest-hit programs, assuming a maximum trace depth of two. ```Rust let direct_callable_stack_size_from_traversable = max_dc_depth * dss_dc; let direct_callable_stack_size_from_state = max_dc_depth * dss_dc; let continuation_stack_size = css_rg + 2 * css_ch1.max(css_ch2).max(css_ms1).max(css_ms2); ``` -------------------------------- ### Rust CUDA Kernel for Element-wise Vector Addition Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md This `#[kernel]` annotated Rust function `add` performs element-wise addition of two `f32` input slices (`a` and `b`), storing the result in a mutable `f32` pointer `c`. It leverages `cuda_std::thread::index_1d` to calculate a unique global thread index, ensuring operations are within array bounds. Key considerations include the `unsafe` declaration due to evolving Rust GPU semantics and the use of raw pointers (`*mut f32`) to prevent unsound aliasing issues inherent with `&mut` references in a multi-threaded GPU context. ```Rust #[kernel] pub unsafe fn add(a: &[f32], b: &[f32], c: *mut f32) { let idx = thread::index_1d() as usize; if idx < a.len() { let elem = &mut *c.add(idx); *elem = a[idx] + b[idx]; } } ``` -------------------------------- ### Rust OptiX Acceleration Structure Compaction Example Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md This Rust code snippet demonstrates the process of building and optionally compacting an OptiX acceleration structure. It shows how to set build flags for compaction, compute memory requirements, allocate buffers, perform the build, and then conditionally compact the structure based on the computed compacted size. It uses `cust` for CUDA operations and `optix` for OptiX API calls. ```Rust use cust::prelude as cu; use optix::prelude as ox; let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?; let buf_indices = cu::DeviceBuffer::from_slice(&indices)?; let geometry_flags = ox::GeometryFlags::None; let build_inputs = [ox::IndexedTriangleArray::new( &[&buf_vertex], &buf_indices, &[geometry_flags] )]; let accel_options = ox::AccelBuildOptions::new( ox::BuildFlags::ALLOW_COMPACTION, ox::BuildOperation::Build ); // Get the storage requirements for temporary and output buffers let sizes = accel_compute_memory_usage(ctx, accel_options, build_inputs)?; // Allocate temporary and output buffers let mut output_buffer = unsafe { DeviceBuffer::::uninitialized(sizes.output_size_in_bytes)? }; let mut temp_buffer = unsafe { DeviceBuffer::::uninitialized(sizes.temp_size_in_bytes)? }; // Build the accel let hnd = unsafe { accel_build( ctx, stream, accel_options, build_inputs, &mut temp_buffer, &mut output_buffer, &mut properties, )? }; stream.synchronize()?; let mut compacted_size = 0usize; compacted_size_buffer.copy_to(&mut compacted_size)?; let accel = if compacted_size < sizes.output_size_in_bytes { let mut buf = unsafe { DeviceBuffer::::uninitialized(compacted_size)? }; let hnd = unsafe { accel_compact(ctx, stream, hnd, &mut buf)? }; stream.synchronize()?; Accel::from_raw_parts(buf, hnd); } else { Accel::from_raw_parts(output_buffer, hnd) }; ``` -------------------------------- ### Example NVVM IR Module Linking with libdevice Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt This pseudo-code demonstrates the initial steps for linking an NVVM IR module with the libdevice library using the libnvvm API. It shows the declaration of a program object and a variable to store the libdevice module size, indicating where the library would be loaded or referenced during the compilation process. ```C++ nvvmProgram prog; size_t libdeviceModSize; ``` -------------------------------- ### Access SBT Data in Closest Hit Program Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md This example demonstrates how to cast the pointer returned by `optixGetSbtDataPointer` to a custom struct (`CHData`) to access material information associated with the SBT record in a closest hit program, allowing for dynamic material properties based on the SBT entry. ```C++ struct CHData { int meshIdx; // Triangle mesh build input index float3 base_color; }; CHData* material_info = (CHData*)optixGetSbtDataPointer(); ``` -------------------------------- ### SBT Geometry-AS Index Mapping for Multi-Record Build Inputs Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md This example demonstrates how SBT geometry acceleration structure indices are assigned when build inputs reference multiple SBT records. The mapping follows a prefix sum of the number of records, allowing a single build input to span multiple SBT indices. ```APIDOC
SBT Geometry-AS Index0123456
Geometry-AS build inputbuild_input[0] num=4
build_input[1] num=1
build_input[2] offset=2
``` -------------------------------- ### Create Program Group for Hitgroup in Rust Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md This example illustrates the creation of a `ProgramGroup` object in Rust, specifically for a hitgroup. It involves instantiating a `Module` and then defining the `ProgramGroupDesc` with entry function names for closest-hit and any-hit shaders. Program groups are essential for filling Shader Binding Table (SBT) records. ```Rust let (module, _log) = Module::new( &mut ctx, &module_compile_options, &pipeline_compile_options, ptx, )?; let pgdesc_hitgroup = ProgramGroupDesc::hitgroup( Some((&module, "__closesthit__radiance")), Some((&module, "__anyhit__radiance")), None, ); let (pg_hitgroup, _log) = ProgramGroup::new(&mut ctx, &[pgdesc_hitgroup])?; ``` -------------------------------- ### CUDA C++ Kernel Example for Buffer Manipulation Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/faq.md This snippet demonstrates a basic CUDA C++ program that launches a kernel to copy data between two integer buffers on the GPU. It highlights common pitfalls and areas where errors can occur due to CUDA's lack of inherent safety checks, such as buffer overflows, incorrect kernel launches, unhandled API errors, and memory management issues. ```cpp __global__ void kernel(int* buf, int* other) { int idx = threadIdx.x; buf[idx] = other[idx]; } int main(void) { int N = 50; int* a, b, d_a, d_b; a = (int*)malloc(N*sizeof(int)); b = (int*)malloc(N*sizeof(int)); cudaMalloc(&d_a, N*sizeof(int)); cudaMalloc(&d_b, N*sizeof(int)); for (int i = 0; i < N; i++) { a[i] = 0.0f; b[i] = 2.0f; } cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyHostToDevice); kernel<<<1, N>>>(d_a, d_b); cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyDeviceToHost); cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyDeviceToHost); /* do something with the data */ cudaFree(d_a); cudaFree(d_b); free(a); free(b); } ``` -------------------------------- ### nvcc and nvrtc Compilation Requirements for OptiX Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md This section outlines critical compilation requirements and recommended options for `nvcc` and `nvrtc` when generating PTX for NVIDIA OptiX. It covers aspects like target SM version, 64-bit code, output format, debugging flags, relocatable device code, fast math, and profiling setup. ```APIDOC nvcc/nvrtc Compilation Requirements: - SM Target Compatibility: Description: Input PTX program's SM target must be less than or equal to the GPU's SM version. Option: --gpu-architecture=compute_X.X (e.g., compute_50 for Maxwell) Note: OptiX rewrites code internally, so targets will work on any newer GPU. Warning Suppression: -Wno-deprecated-gpu-targets (for SM 5.0 deprecation). - 64-bit Code: Required Option: --machine=64 (-m64) Description: Only 64-bit code is supported in OptiX. - Output Type: Required Option: --ptx Description: Do not compile to obj or cubin. - Debug Flags: Avoid Options: -g, -G Description: OptiX might not handle all debugging instrumentation. Important when using Microsoft Visual Studio CUDA integration. - Relocatable Device Code: Required Option: --relocatable-device-code=true (-rdc) Alternative (nvcc only): --keep-device-functions Description: Prevents CUDA compiler from eliminating direct or continuation callables as dead code. - Fast Math: Recommended Option: --use_fast_math Description: Enables .approx instructions for trigonometric functions and reciprocals, avoiding inadvertent use of slow double-precision floats. Recommended for performance unless higher precision is required. - Profiling (Nsight Compute): Required Options: --generate-line-info Application Host Code Setting: debug_level = CompileDebugLevel::LineInfo (in ModuleCompileOptions and PipelineLinkOptions) Description: Enables line information for profiling with Nsight Compute. ``` -------------------------------- ### Rust-CUDA Stream Synchronization Example Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/safety.md This Rust code snippet demonstrates the importance of explicit synchronization when interacting with GPU kernels from the CPU. It illustrates that a kernel launched on a stream is not guaranteed to be finished immediately, and CPU code accessing shared memory must call `stream.synchronize()` to ensure the kernel's completion before safely accessing the data modified by the GPU. ```rs launch!(module.bar<<<1, 1, 0, stream>>>(foo.as_unified_ptr()))?; // 'bar' is not guaranteed to be finished executing at this point. function_that_accesses_foo(foo); stream.synchronize()?; // foo may be accessed and will see the changes that 'bar' wrote to 'foo'. 'bar' is guaranteed // to be finished executing. ``` -------------------------------- ### Allow Improper C Types Definitions for GPU Kernels (Optional) Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md To use types such as slices or arrays inside GPU kernels, you must allow `improper_ctypes_definitions`. This is because `rustc_codegen_nvvm` guarantees their passing convention, unlike standard C ABI rules for `extern "C"` functions. ```rust #![allow(improper_ctypes_definitions)] ``` -------------------------------- ### Correct Rust Kernel Launch with Slice Parameters Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/kernel_abi.md Demonstrates the correct way to launch a GPU kernel from Rust when passing a slice, by explicitly providing the device pointer and the length as separate arguments. ```Rust let mut buf = [5u8; 10].as_dbuf()?; unsafe { launch!( module.kernel<<<1, 1, 0, stream>>>(buf.as_device_ptr(), buf.len()) )?; } ``` -------------------------------- ### Correct Rust Kernel Launch with Device Pointer Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/kernel_abi.md Illustrates the correct way to launch a GPU kernel from Rust when passing a reference, by providing a device pointer to the allocated memory. ```Rust let mut val = DeviceBox::new(&5)?; unsafe { launch!( module.kernel<<<1, 1, 0, stream>>>(val.as_device_ptr()) )?; } ``` -------------------------------- ### OptiX DeviceContext API Reference Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/context.md This section provides an overview of key methods and properties available on the `DeviceContext` object in the OptiX Rust bindings, including context creation, property querying, logging configuration, and compilation cache management. ```APIDOC DeviceContext: new(cuda_context: &cust::Context, enable_debug: bool) -> Result Purpose: Creates a new NVIDIA OptiX 7 device context. Parameters: cuda_context: The associated CUDA context. enable_debug: Boolean to enable debug mode. get_property(property_type: PropertyType) -> Result Purpose: Queries various properties of the context, such as max trace depth or max traversable graph depth. Parameters: property_type: The specific property to query. Returns: The value of the queried property. set_log_callback(callback: F) Purpose: Specifies a closure for logging messages from OptiX. Parameters: callback: A closure with signature `FnMut(u32, &str, &str) + 'static`. u32: Log level (0=disable, 1=fatal, 2=error, 3=warning, 4=print). &str: Message category description (e.g., "SCENE STAT"). &str: The log message itself. set_cache_enabled(enabled: bool) -> Result<(), Error> Purpose: Enables or disables the compilation cache for the device context. Parameters: enabled: `true` to enable, `false` to disable. get_cache_enabled() -> bool Purpose: Queries the current status of the compilation cache for the device context. Returns: `true` if caching is enabled, `false` otherwise. ``` -------------------------------- ### Embed PTX as Static String in Rust Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md This Rust snippet illustrates how to statically embed the compiled PTX (Parallel Thread Execution) file directly into your Rust application. Using `include_str!`, the content of the specified PTX file is loaded at compile time as a string literal, making it available for runtime execution by libraries like `cust`. ```Rust static PTX: &str = include_str!("some/path.ptx"); ``` -------------------------------- ### NVIDIA OptiX 7 API Core Concepts and Program Model Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/introduction.md Detailed overview of the NVIDIA OptiX 7 API, covering its core characteristics, context management, multi-GPU support, and the fundamental programming model concepts including various program types that form the ray tracing pipeline. ```APIDOC NVIDIA OptiX 7 API: Overview: - CUDA-centric API, invoked by CUDA-based applications. - Designed to be stateless, multi-threaded, and asynchronous. - Provides explicit control over performance-sensitive operations (memory management, shader compilation). - Supports lightweight scene representation (instancing, vertex/transform-based motion blur, built-in triangles/swept curves, user-defined primitives). - Includes highly-tuned kernels and neural networks for machine-learning-based denoising. - Context: Controls a single GPU, manages handle objects (small host memory), automatically released on destruction. - Asynchronous Operations: All API functions employ CUDA streams and invoke GPU functions asynchronously. - Multi-GPU: Capabilities (load balancing, NVLINK memory sharing) must be handled by the application developer. - Execution Model: Allows task movement across lanes/warps/SMs; applications cannot use shared memory, synchronization, barriers, or other SM-thread-specific constructs. - Future-proofing: API allows existing programs to use new NVIDIA hardware features. Basic concepts and definitions: Program: - A block of executable code on the GPU. - Represents a particular shading operation (called a shader in DXR/Vulkan). - Programmable component in the system, capable of more than shading. Program and Data Model: Implements a single-ray programming model with eight types of programs forming the ray tracing pipeline: - Ray generation (RG): Entry point into the ray tracing pipeline, invoked in parallel for each pixel, sample, or user-defined work assignment. - Intersection (IS): Implements a ray-primitive intersection test, invoked during traversal. - Any-hit (AH): Called when a traced ray finds a new, potentially closest, intersection point (e.g., for shadow computation). - Closest-hit (CH): Called when a traced ray finds the closest intersection point (e.g., for material shading). - Miss: Called when a traced ray misses all scene geometry. - Exception: Handler invoked for conditions such as stack overflow and other errors. - Direct callables: Similar to a regular CUDA function call, called immediately. - Continuation callables: Unlike direct callables, executed by the scheduler. Pipeline: Based on the interconnected calling structure of these eight programs and their relationship to the search through geometric data (traversal). ``` -------------------------------- ### Build and Run Rust-CUDA Docker Container Source: https://github.com/rust-gpu/rust-cuda/blob/main/README.md These commands illustrate how to build a Docker image for Rust-CUDA using a provided Dockerfile (e.g., for Ubuntu 24.04) and then run a container from that image. The `docker run` command includes `--runtime=nvidia` and `--gpus all` to enable GPU access within the container. ```bash docker build -f ./container/ubuntu24/Dockerfile -t rust-cuda-ubuntu24 . docker run --rm --runtime=nvidia --gpus all -it rust-cuda-ubuntu24 ``` -------------------------------- ### Document __nv_sin CUDA libdevice function Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates the sine of the input argument `x`, which is measured in radians. Refer to the CUDA C Programming Guide for accuracy information. ```LLVM IR Function: __nv_sin Prototype: double @__nv_sin(double %x) Description: Calculate the sine of the input argument x (measured in radians). Parameters: %x (double): The angle in radians. Returns: double • __nv_sin(0) returns 0. • __nv_sin(infinity) returns NaN. Library Availability: Compute 2.0: Yes Compute 3.0: Yes Compute 3.5: Yes ``` -------------------------------- ### Compile NVVM Program with Libdevice and Custom IR Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt This C/C++ code demonstrates the process of compiling an NVVM program. It involves loading the libdevice library and custom NVVM IR, adding them to an NVVM program object, and then compiling the program with specified options. The `loadFile` function is a placeholder for client-side logic to retrieve the libdevice binary. ```C const char *libdeviceMod = loadFile('/path/to/libdevice.*.bc', &libdeviceModSize); const char *myIr = /* NVVM IR in text or binary format */; size_t myIrSize = /* size of myIr in bytes */; // Create NVVM program object nvvmCreateProgram(&prog); // Add libdevice module to program nvvmAddModuleToProgram(prog, libdeviceMod, libdeviceModSize); // Add custom IR to program nvvmAddModuleToProgram(prog, myIr, myIrSize); // Declare compile options const char *options[] = { "-ftz=1" }; // Compile the program nvvmCompileProgram(prog, 1, options); ``` -------------------------------- ### __nv_logbf Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates the floating point representation of the exponent of the input argument 'x'. For accuracy information, refer to the CUDA C Programming Guide, Appendix D.1, Table 6. ```APIDOC Prototype: float @__nv_logbf(float %x) Returns: - __nv_logbf(0) returns (negative infinity). - __nv_logbf(infinity) returns (positive infinity). ``` -------------------------------- ### __nv_logb Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates the floating point representation of the exponent of the input argument 'x'. For accuracy information, refer to the CUDA C Programming Guide, Appendix D.1, Table 7. ```APIDOC Prototype: double @__nv_logb(double %x) Returns: - __nv_logb(0) returns (negative infinity). - __nv_logb(infinity) returns (positive infinity). ``` -------------------------------- ### Libdevice __nv_cospi Function Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates the cosine of x (measured in radians). Returns 1 for specific inputs and NaN for others. Refer to CUDA C Programming Guide for accuracy information. Available for Compute 2.0, 3.0, and 3.5. ```APIDOC Function: __nv_cospi Prototype: double @__nv_cospi(double %x) Description: Calculate the cosine of x (measured in radians). Parameters: %x (double): The input argument. Returns: double: - 1 for specific inputs. - NaN for other inputs. Notes: For accuracy information, see CUDA C Programming Guide, Appendix D.1, Table 7. Availability: Compute 2.0: Yes Compute 3.0: Yes Compute 3.5: Yes ``` -------------------------------- ### Libdevice __nv_dmul_rd Function Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Multiplies two double-precision floating point values x and y, rounding the result down (towards negative infinity). Refer to CUDA C Programming Guide for accuracy information. Available for Compute 2.0, 3.0, and 3.5. ```APIDOC Function: __nv_dmul_rd Prototype: double @__nv_dmul_rd(double %x, double %y) Description: Multiplies two floating point values x and y in round-down (to negative infinity) mode. Parameters: %x (double): First floating point value. %y (double): Second floating point value. Returns: double: The product of x * y. Notes: For accuracy information, see CUDA C Programming Guide, Appendix D.1, Table 7. Availability: Compute 2.0: Yes Compute 3.0: Yes Compute 3.5: Yes ``` -------------------------------- ### Correct Rust Kernel Launch with Struct by Value Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/kernel_abi.md Illustrates the correct way to launch a GPU kernel from Rust, passing a struct directly by value as expected by the CUDA/PTX ABI for structs. ```Rust let foo = Foo { a: 5, b: 6, c: 7 }; unsafe { launch!( module.kernel<<<1, 1, 0, stream>>>(foo) )?; } ``` -------------------------------- ### Rust OptiX Shader Binding Table Record Packing Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/introduction.md This Rust code demonstrates how to define a custom data structure, `HitgroupSbtData`, and then pack it into an `SbtRecord` for use with NVIDIA OptiX 7. It utilizes `cust::prelude` for device memory operations and `optix::prelude` for OptiX-specific types, illustrating the process of preparing data records for the shader binding table. ```Rust use cust::prelude as cu; use optix::prelude as ox; #[derive(Copy, Clone, Default, cu::DeviceCopy)] struct HitgroupSbtData { object_id: u32, } type HitgroupRecord = ox::SbtRecord; let rec_hitgroup: Vec<_> = (0..num_objects) .map(|i| { let object_type = 0; let rec = HitgroupRecord::pack( HitgroupSbtData { object_id: i }, &pg_hitgroup[object_type], ) .expect("failed to pack hitgroup record"); rec }) .collect(); ``` -------------------------------- ### Document __nv_sincos CUDA libdevice function Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates both the sine and cosine of the input argument `x` (measured in radians). The results are written to the provided pointers `sptr` (for sine) and `cptr` (for cosine). Refer to the CUDA C Programming Guide for accuracy information. ```LLVM IR Function: __nv_sincos Prototype: void @__nv_sincos(double %x, double* %sptr, double* %cptr) Description: Calculate the sine and cosine of the first input argument x (measured in radians). The results for sine and cosine are written into the second argument, sptr, and, respectively, third argument, cptr. Parameters: %x (double): The angle in radians. %sptr (double*): Pointer to store the calculated sine value. %cptr (double*): Pointer to store the calculated cosine value. Returns: void • none Library Availability: Compute 2.0: Yes Compute 3.0: Yes Compute 3.5: Yes ``` -------------------------------- ### Libdevice __nv_cospif Function Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates the cosine of x (measured in radians) for single-precision floats. Returns 1 for specific inputs and NaN for others. Refer to CUDA C Programming Guide for accuracy information. Available for Compute 2.0, 3.0, and 3.5. ```APIDOC Function: __nv_cospif Prototype: float @__nv_cospif(float %x) Description: Calculate the cosine of x (measured in radians). Parameters: %x (float): The input argument. Returns: float: - 1 for specific inputs. - NaN for other inputs. Notes: For accuracy information, see CUDA C Programming Guide, Appendix D.1, Table 6. Availability: Compute 2.0: Yes Compute 3.0: Yes Compute 3.5: Yes ``` -------------------------------- ### Build OptiX Geometry Acceleration Structure (GAS) in Rust Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md This Rust snippet demonstrates the process of setting up and building a Geometry Acceleration Structure (GAS) using the `cust` and `optix` crates. It covers allocating device buffers for vertex and index data, defining a triangle mesh input, configuring acceleration build options for fast ray tracing, and asynchronously initiating the GAS build operation. ```Rust use cust::prelude as cu; use optix::prelude as ox; # fn doit() -> Result<(), Box> { # cust::init(cu::CudaFlags::empty())?; # ox::init()?; # let device = cu::Device::get_device(0)?; # let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO | # cu::ContextFlags::MAP_HOST, device)?; # let ctx = ox::DeviceContext::new(&cu_ctx, false)?; # let vertices: Vec<[f32; 3]> = Vec::new(); # let indices: Vec<[u32; 3]> = Vec::new(); # let stream = cu::Stream::new(cu::StreamFlags::DEFAULT, None)?; // Allocate buffers and copy vertex and index data to device let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?; let buf_indices = cu::DeviceBuffer::from_slice(&indices)?; // Tell OptiX the structure of our triangle mesh let geometry_flags = ox::GeometryFlags::None; let triangle_input = ox::IndexedTriangleArray::new( &[&buf_vertex], &buf_indices, &[geometry_flags] ); // Tell OptiX we'd prefer a faster traversal over a faster bvh build. let accel_options = AccelBuildOptions::new(ox::BuildFlags::PREFER_FAST_TRACE); // Build the accel asynchronously let gas = ox::Accel::build( &ctx, &stream, &[accel_options], &[triangle_input], true )?; # Ok(()) # } ``` -------------------------------- ### Count Leading Zeros (64-bit integer) Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Counts the number of consecutive leading zero bits, starting at the most significant bit (bit 63) of the input 64-bit integer. Returns a value between 0 and 64 inclusive representing the number of zero bits. ```APIDOC __nv_clzll(x: i64) -> i32 x: The input 64-bit integer. ``` -------------------------------- ### Compile CUDA Source to PTX for OptiX using nvcc Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md This command demonstrates how to compile a CUDA source file (`myprogram.cu`) into a PTX file (`myprogram.ptx`) suitable for NVIDIA OptiX 7. It includes flags for specifying the OptiX SDK include path and enabling fast math optimizations. ```bash nvcc -ptx -Ipath-to-optix-sdk/include --use_fast_math myprogram.cu -o myprogram.ptx ``` -------------------------------- ### Count Leading Zeros (32-bit integer) Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Counts the number of consecutive leading zero bits, starting at the most significant bit (bit 31) of the input 32-bit integer. Returns a value between 0 and 32 inclusive representing the number of zero bits. ```APIDOC __nv_clz(x: i32) -> i32 x: The input 32-bit integer. ``` -------------------------------- ### Document __nv_sincosf CUDA libdevice function Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates both the sine and cosine of the single-precision input argument `x` (measured in radians). The results are written to the provided pointers `sptr` (for sine) and `cptr` (for cosine). Refer to the CUDA C Programming Guide for accuracy information. ```LLVM IR Function: __nv_sincosf Prototype: void @__nv_sincosf(float %x, float* %sptr, float* %cptr) Description: Calculate the sine and cosine of the first input argument x (measured in radians). The results for sine and cosine are written into the second argument, sptr, and, respectively, third argument, cptr. Parameters: %x (float): The angle in radians. %sptr (float*): Pointer to store the calculated sine value. %cptr (float*): Pointer to store the calculated cosine value. Returns: void • none Library Availability: Compute 2.0: Yes Compute 3.0: Yes Compute 3.5: Yes ``` -------------------------------- ### Configure CMake Project for GPU Development Tools Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/examples/common/gdt/CMakeLists.txt This CMake configuration block initializes the 'GPU_Development_Tools' project, specifies the minimum required CMake version (3.5), sets the C++ standard to C++11, includes the current source directory for header resolution, and defines a static library named 'gdt'. The 'gdt' library is built from a combination of CMake configuration files and C++ source/header files. ```CMake project(GPU_Development_Tools) cmake_minimum_required(VERSION 3.5) set(CMAKE_CXX_STANDARD 11) include_directories(${CMAKE_CURRENT_SOURCE_DIR}) add_library(gdt cmake/configure_build_type.cmake cmake/configure_optix.cmake cmake/FindOptiX.cmake gdt/gdt.h gdt/math/LinearSpace.h gdt/math/AffineSpace.h gdt/gdt.cpp ) ``` -------------------------------- ### __nv_logf Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt Calculates the natural logarithm of the input argument 'x'. This function returns specific values for various inputs like 0, 1, negative x, and infinity. For accuracy information, refer to the CUDA C Programming Guide, Appendix D.1, Table 6. ```APIDOC Prototype: float @__nv_logf(float %x) Returns: - __nv_logf(0) returns (negative infinity). - __nv_logf(1) returns +0. - __nv_logf(x) returns NaN for x < 0. - __nv_logf(infinity) returns (infinity). ``` -------------------------------- ### Building OptiX Acceleration Structures with Safe Rust API Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md This snippet demonstrates the use of `Accel::build()` for creating acceleration structures. This method simplifies the process by automatically handling memory allocation and synchronization, making it ideal for rapid prototyping. However, it may incur overhead due to reallocating temporary storage and synchronizing after each build, which can be less efficient for batch operations. ```Rust use cust::prelude as cu; use optix::prelude as ox; # fn doit() -> Result<(), Box> { # cust::init(cu::CudaFlags::empty())?; # ox::init()?; # let device = cu::Device::get_device(0)?; # let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO | # cu::ContextFlags::MAP_HOST, device)?; # let ctx = ox::DeviceContext::new(&cu_ctx, false)?; # let vertices: Vec<[f32; 3]> = Vec::new(); # let indices: Vec<[u32; 3]> = Vec::new(); # let stream = cu::Stream::new(cu::StreamFlags::DEFAULT, None)?; let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?; let buf_indices = cu::DeviceBuffer::from_slice(&indices)?; let geometry_flags = ox::GeometryFlags::None; let triangle_input = ox::IndexedTriangleArray::new( &[&buf_vertex], &buf_indices, &[geometry_flags] ); let accel_options = ox::AccelBuildOptions::new( ox::BuildFlags::ALLOW_COMPACTION, ox::BuildOperation::Build ); let build_inputs = vec![triangle_input]; let gas = ox::Accel::build( &ctx, &stream, &[accel_options], &build_inputs, true )?; stream.synchronize()?; # Ok(()) # } ```