### Build and Run Rust-CUDA Docker Container

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

These shell commands provide a quickstart for setting up a Docker environment for Rust-CUDA. They cover building the Docker image from your Rust-CUDA clone and then running a container with GPU access enabled, mounting the project directory, and dropping into a bash shell for development.

```Shell
docker build -t rust-cuda $RUST_CUDA
docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda
```

--------------------------------

### Build Rust-CUDA Project with Cargo

Source: https://github.com/rust-gpu/rust-cuda/blob/main/README.md

This snippet demonstrates the necessary environment variable setup for the NVIDIA OptiX SDK and the command to build the Rust-CUDA project using the `cargo build` command. Ensure the `OPTIX_ROOT` and `OPTIX_ROOT_DIR` variables point to your OptiX installation.

```bash
## setup your environment like:
### export OPTIX_ROOT=/opt/NVIDIA-OptiX-SDK-9.0.0-linux64-x86_64
### export OPTIX_ROOT_DIR=/opt/NVIDIA-OptiX-SDK-9.0.0-linux64-x86_64

## build proj
cargo build
```

--------------------------------

### Configure Rust Toolchain for Rust-CUDA Development

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

This `rust-toolchain.toml` configuration specifies the exact Rust toolchain required for Rust-CUDA development. It mandates a specific nightly channel and includes essential components such as `rust-src` for source code, `rustc-dev` for compiler development utilities, and `llvm-tools-preview` for LLVM-related tools, ensuring compatibility and necessary features.

```TOML
# If you see this, run `rustup self update` to get rustup 1.23 or newer.

# NOTE: above comment is for older `rustup` (before TOML support was added),
# which will treat the first line as the toolchain name, and therefore show it
# to the user in the error, instead of "error: invalid channel name '[toolchain]'".

[toolchain]
channel = "nightly-2021-12-04"
components = ["rust-src", "rustc-dev", "llvm-tools-preview"]
```

--------------------------------

### Configure Cargo.toml for GPU Crate

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

Modify your `Cargo.toml` to specify `cdylib` and `rlib` crate types, which are necessary for NVPTX targets and for using the crate as a dependency. Also, add `cuda_std` as a dependency, replacing `XX` with the latest version.

```toml
[package]
name = "name"
version = "0.1.0"
edition = "2021"

+[lib]
+crate-type = ["cdylib", "rlib"]

[dependencies]
+cuda_std = "XX"
```

--------------------------------

### Enable Allocator for GPU Kernels (Optional)

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

If your GPU kernels require dynamic memory allocation or features like printing (which depend on `alloc`), you must explicitly declare `extern crate alloc;` in your `lib.rs` to make the `alloc` crate available.

```rust
extern crate alloc;
```

--------------------------------

### Initial Setup for Unsafe OptiX API in Rust

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md

This snippet provides the foundational setup for interacting with the unsafe OptiX API in Rust. It demonstrates the necessary steps to initialize CUDA and OptiX contexts, retrieve a device, and prepare basic device buffers for vertex and index data. This preparation is crucial before invoking low-level functions like `accel_compute_memory_usage` and `accel_build`.

```Rust
use cust::prelude as cu;
use optix::prelude as ox;
# fn doit() -> Result<(), Box<dyn std::error::Error>> {
# cust::init(cu::CudaFlags::empty())?;
# ox::init()?;
# let device = cu::Device::get_device(0)?;
# let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO |
# cu::ContextFlags::MAP_HOST, device)?;
# let ctx = ox::DeviceContext::new(&cu_ctx, false)?;
# let vertices: Vec<[f32; 3]> = Vec::new();
# let indices: Vec<[u32; 3]> = Vec::new();
```

--------------------------------

### Add Initial Directives to lib.rs for GPU Compilation

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

Include essential attributes in your `lib.rs` file that are conditionally applied when compiling for GPU targets (`target_os = "cuda"`). These attributes declare the crate as `no_std` and register `nvvm_internal` for codegen functionality, along with importing `cuda_std`.

```rust
#![cfg_attr(
    target_os = "cuda",
    no_std,
    register_attr(nvvm_internal)
)]

use cuda_std::*;
```

--------------------------------

### Add cuda_builder Build Dependency

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

This snippet demonstrates how to declare `cuda_builder` as a build dependency in your `Cargo.toml` file. This is essential for the CPU crate that orchestrates the compilation of GPU kernels, ensuring the `cuda_builder` utility is available during the build process.

```TOML
+[build-dependencies]
+cuda_builder = "XX"
```

--------------------------------

### Invoke cuda_builder in Rust Build Script

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

This Rust code shows how to integrate `cuda_builder` into your `build.rs` file. It initializes `CudaBuilder` with the path to your GPU crate, specifies an output path for the compiled PTX file using `.copy_to()`, and then executes the build process. The `.unwrap()` handles potential errors during compilation.

```Rust
use cuda_builder::CudaBuilder;

fn main() {
    CudaBuilder::new("path/to/gpu/crate/root")
        .copy_to("some/path.ptx")
        .build()
        .unwrap();
}
```

--------------------------------

### Rust OptiX Device Context Initialization

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/context.md

This Rust example demonstrates the essential steps to set up an NVIDIA OptiX 7 device context. It involves initializing the CUDA driver, creating a CUDA context for a specific device, and then instantiating the OptiX `DeviceContext` linked to the CUDA context.

```Rust
# fn doit() -> Result<(), Box<dyn std::error::Error>> {
use optix::prelude as ox;
use cust::prelude as cu;

// Initialize cuda and optix
cust::init(cu::CudaFlags::empty())?;
ox::init()?;

// Create a cuda context for the first device
let device = cu::Device::get_device(0)?;
let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO |
cu::ContextFlags::MAP_HOST, device)?;

// Create optix device context
let ctx = ox::DeviceContext::new(&cu_ctx, false)?;

# Ok(())
# }
```

--------------------------------

### APIDOC: SRT Transformation Usage Examples

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md

Illustrative examples demonstrating how to apply the SRT motion transform for specific scenarios. These include setting up rotations about the origin, rotations around a custom pivot point, and scaling operations relative to a pivot point, showing the necessary mathematical adjustments.

```APIDOC
Example 1 - Rotation about the origin:
  Q = [ 0, 0, sin(pi/6), cos(pi/6) ] (60-degree rotation about z-axis)

Example 2 - Rotation about a pivot point:
  S' = P^-1 * S
  T' = T * P
  C = T' * R * S'

Example 3 - Scaling about a pivot point:
  P'x = Px + (-Sx * Gx + Gx)
  P'y = Py + (-Sy * Gy + Gy)
  P'z = Pz + (-Sz * Gz + Gz)
```

--------------------------------

### SBT Geometry-AS Index Mapping for Single-Record Build Inputs

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md

This example illustrates the mapping between SBT geometry acceleration structure indices and build inputs when each build input references exactly one SBT record. The index directly corresponds to the build input's order.

```APIDOC
<table>
<tr><th>SBT Geometry-AS Index</th><th>0</th><th>1</th><th>2</th></tr>
<tr><td rowspan=3>Geometry-AS build input</td><td><code>build_input[0]</code></td><td></td><td></td></r>
<tr><td></td><td><code>built_input[1]</code></td><td></td></r>
<tr><td></td><td></td><td><code>built_input[2]</code></td></r>
</table>
```

--------------------------------

### SBT Geometry-AS Index Mapping with Per-Primitive Offsets

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md

This example details how per-primitive SBT index offsets (`sbt_index_offset_buffer`) are applied. These offsets are local to each build input and are combined with the base SBT geometry acceleration structure index to determine the final SBT index.

```APIDOC
<table>
<tr>
<th>SBT Geometry-AS Index</th><th>0</th><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th>
</tr>
<tr>
<td rowspan=4><code>build_input[0].sbt_index_offset:</code></td><td>[0]</td><td></td><td></td><td></td><td></td><td></td><td></td>
</tr>
<tr>
<td></td><td>[1]</td><td></td><td></td><td></td><td></td><td></td>
</tr>
<tr>
<td></td><td></td><td>[2]</td><td></td><td></td><td></td><td></td>
</tr>
<tr>
<td></td><td></td><td></td><td>[3]</td><td></td><td></td><td></td>
</tr>
<tr>
<td><code>build_input[1].sbt_index_offset=None</code></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td>
</tr>
<tr>
<td rowspan=2><code>build_input[1].sbt_index_offset:</code></td><td></td><td></td><td></td><td></td><td></td><td>[0]</td><td></td>
</tr>
<tr>
<td></td><td></td><td></td><td></td><td></td><td></td><td>[1]</td>
</tr>
</table>
```

--------------------------------

### Rust-CUDA OptiX Acceleration Structure Build Example

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md

This Rust code snippet demonstrates the complete workflow for building an OptiX acceleration structure (AS). It covers initializing device buffers for vertices and indices, defining build inputs with geometry flags, setting acceleration build options, computing memory requirements, allocating temporary and output buffers, and finally executing the asynchronous AS build operation followed by stream synchronization.

```Rust
# let stream = cu::Stream::new(cu::StreamFlags::DEFAULT, None)?;

let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?;
let buf_indices = cu::DeviceBuffer::from_slice(&indices)?;

let geometry_flags = ox::GeometryFlags::None;

let build_inputs =
    [ox::IndexedTriangleArray::new(
        &[&buf_vertex],
        &buf_indices,
        &[geometry_flags]
    )];

let accel_options =
    ox::AccelBuildOptions::new(
        ox::BuildFlags::ALLOW_COMPACTION,
        ox::BuildOperation::Build
    );

// Get the storage requirements for temporary and output buffers
let sizes = accel_compute_memory_usage(ctx, accel_options, build_inputs)?;

// Allocate temporary and output buffers
let mut output_buffer =
    unsafe { DeviceBuffer::<u8>::uninitialized(sizes.output_size_in_bytes)? };
let mut temp_buffer =
    unsafe { DeviceBuffer::<u8>::uninitialized(sizes.temp_size_in_bytes)? };

// Build the accel
let hnd = unsafe {
    accel_build(
        ctx,
        stream,
        accel_options,
        build_inputs,
        &mut temp_buffer,
        &mut output_buffer,
        &mut properties,
    )?
};

// The accel build is asynchronous
stream.synchronize()?;

# Ok(())
# }
```

--------------------------------

### Initial Stack Size Estimates for Simple Path Tracer

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md

These formulas provide initial estimates for direct callable and continuation stack sizes in a basic path tracer setup. They account for ray generation, miss, and closest-hit programs, assuming a maximum trace depth of two.

```Rust
let direct_callable_stack_size_from_traversable = max_dc_depth * dss_dc;
let direct_callable_stack_size_from_state = max_dc_depth * dss_dc;
let continuation_stack_size = css_rg + 2 * css_ch1.max(css_ch2).max(css_ms1).max(css_ms2);
```

--------------------------------

### Rust CUDA Kernel for Element-wise Vector Addition

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

This `#[kernel]` annotated Rust function `add` performs element-wise addition of two `f32` input slices (`a` and `b`), storing the result in a mutable `f32` pointer `c`. It leverages `cuda_std::thread::index_1d` to calculate a unique global thread index, ensuring operations are within array bounds. Key considerations include the `unsafe` declaration due to evolving Rust GPU semantics and the use of raw pointers (`*mut f32`) to prevent unsound aliasing issues inherent with `&mut` references in a multi-threaded GPU context.

```Rust
#[kernel]
pub unsafe fn add(a: &[f32], b: &[f32], c: *mut f32) {
    let idx = thread::index_1d() as usize;
    if idx < a.len() {
        let elem = &mut *c.add(idx);
        *elem = a[idx] + b[idx];
    }
}
```

--------------------------------

### Rust OptiX Acceleration Structure Compaction Example

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md

This Rust code snippet demonstrates the process of building and optionally compacting an OptiX acceleration structure. It shows how to set build flags for compaction, compute memory requirements, allocate buffers, perform the build, and then conditionally compact the structure based on the computed compacted size. It uses `cust` for CUDA operations and `optix` for OptiX API calls.

```Rust
use cust::prelude as cu;
use optix::prelude as ox;

let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?;
let buf_indices = cu::DeviceBuffer::from_slice(&indices)?;

let geometry_flags = ox::GeometryFlags::None;

let build_inputs =
    [ox::IndexedTriangleArray::new(
        &[&buf_vertex],
        &buf_indices,
        &[geometry_flags]
    )];

let accel_options =
    ox::AccelBuildOptions::new(
        ox::BuildFlags::ALLOW_COMPACTION,
        ox::BuildOperation::Build
    );

// Get the storage requirements for temporary and output buffers
let sizes = accel_compute_memory_usage(ctx, accel_options, build_inputs)?;

// Allocate temporary and output buffers
let mut output_buffer =
    unsafe { DeviceBuffer::<u8>::uninitialized(sizes.output_size_in_bytes)? };
let mut temp_buffer =
    unsafe { DeviceBuffer::<u8>::uninitialized(sizes.temp_size_in_bytes)? };

// Build the accel
let hnd = unsafe {
    accel_build(
        ctx,
        stream,
        accel_options,
        build_inputs,
        &mut temp_buffer,
        &mut output_buffer,
        &mut properties,
    )?
};

stream.synchronize()?;

let mut compacted_size = 0usize;
compacted_size_buffer.copy_to(&mut compacted_size)?;

let accel = if compacted_size < sizes.output_size_in_bytes {
    let mut buf = unsafe { DeviceBuffer::<u8>::uninitialized(compacted_size)? };
    let hnd = unsafe { accel_compact(ctx, stream, hnd, &mut buf)? };

    stream.synchronize()?;
    Accel::from_raw_parts(buf, hnd);
} else {
    Accel::from_raw_parts(output_buffer, hnd)
};
```

--------------------------------

### Example NVVM IR Module Linking with libdevice

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

This pseudo-code demonstrates the initial steps for linking an NVVM IR module with the libdevice library using the libnvvm API. It shows the declaration of a program object and a variable to store the libdevice module size, indicating where the library would be loaded or referenced during the compilation process.

```C++
nvvmProgram prog;
size_t libdeviceModSize;
```

--------------------------------

### Access SBT Data in Closest Hit Program

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md

This example demonstrates how to cast the pointer returned by `optixGetSbtDataPointer` to a custom struct (`CHData`) to access material information associated with the SBT record in a closest hit program, allowing for dynamic material properties based on the SBT entry.

```C++
struct CHData {
    int meshIdx; // Triangle mesh build input index
    float3 base_color;
};

CHData* material_info = (CHData*)optixGetSbtDataPointer();
```

--------------------------------

### SBT Geometry-AS Index Mapping for Multi-Record Build Inputs

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/shader_binding_table.md

This example demonstrates how SBT geometry acceleration structure indices are assigned when build inputs reference multiple SBT records. The mapping follows a prefix sum of the number of records, allowing a single build input to span multiple SBT indices.

```APIDOC
<table>
<tr><th>SBT Geometry-AS Index</th><th>0</th><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th></tr>
<tr><td rowspan=3>Geometry-AS build input</td><td><code>build_input[0] num=4</code></td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
<tr><td></td><td></td><td></td><td><code>build_input[1] num=1</code></td><td></td><td></td><td></td></tr>
<tr><td></td><td></td><td></td><td></td><td></td><td><code>build_input[2] offset=2</code></td><td></td></tr>
</table>
```

--------------------------------

### Create Program Group for Hitgroup in Rust

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md

This example illustrates the creation of a `ProgramGroup` object in Rust, specifically for a hitgroup. It involves instantiating a `Module` and then defining the `ProgramGroupDesc` with entry function names for closest-hit and any-hit shaders. Program groups are essential for filling Shader Binding Table (SBT) records.

```Rust
let (module, _log) = Module::new(
    &mut ctx,
    &module_compile_options,
    &pipeline_compile_options,
    ptx,
)?;

let pgdesc_hitgroup = ProgramGroupDesc::hitgroup(
    Some((&module, "__closesthit__radiance")),
    Some((&module, "__anyhit__radiance")),
    None,
);

let (pg_hitgroup, _log) = ProgramGroup::new(&mut ctx, &[pgdesc_hitgroup])?;
```

--------------------------------

### CUDA C++ Kernel Example for Buffer Manipulation

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/faq.md

This snippet demonstrates a basic CUDA C++ program that launches a kernel to copy data between two integer buffers on the GPU. It highlights common pitfalls and areas where errors can occur due to CUDA's lack of inherent safety checks, such as buffer overflows, incorrect kernel launches, unhandled API errors, and memory management issues.

```cpp
__global__ void kernel(int* buf, int* other)
{
  int idx = threadIdx.x;
  buf[idx] = other[idx];
}

int main(void)
{
  int N = 50;
  int* a, b, d_a, d_b;
  a = (int*)malloc(N*sizeof(int));
  b = (int*)malloc(N*sizeof(int));

  cudaMalloc(&d_a, N*sizeof(int));
  cudaMalloc(&d_b, N*sizeof(int));

  for (int i = 0; i < N; i++) {
    a[i] = 0.0f;
    b[i] = 2.0f;
  }

  cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyHostToDevice);

  kernel<<<1, N>>>(d_a, d_b);

  cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyDeviceToHost);

  /* do something with the data */

  cudaFree(d_a);
  cudaFree(d_b);
  free(a);
  free(b);
}
```

--------------------------------

### nvcc and nvrtc Compilation Requirements for OptiX

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md

This section outlines critical compilation requirements and recommended options for `nvcc` and `nvrtc` when generating PTX for NVIDIA OptiX. It covers aspects like target SM version, 64-bit code, output format, debugging flags, relocatable device code, fast math, and profiling setup.

```APIDOC
nvcc/nvrtc Compilation Requirements:
  - SM Target Compatibility:
      Description: Input PTX program's SM target must be less than or equal to the GPU's SM version.
      Option: --gpu-architecture=compute_X.X (e.g., compute_50 for Maxwell)
      Note: OptiX rewrites code internally, so targets will work on any newer GPU.
      Warning Suppression: -Wno-deprecated-gpu-targets (for SM 5.0 deprecation).
  - 64-bit Code:
      Required Option: --machine=64 (-m64)
      Description: Only 64-bit code is supported in OptiX.
  - Output Type:
      Required Option: --ptx
      Description: Do not compile to obj or cubin.
  - Debug Flags:
      Avoid Options: -g, -G
      Description: OptiX might not handle all debugging instrumentation. Important when using Microsoft Visual Studio CUDA integration.
  - Relocatable Device Code:
      Required Option: --relocatable-device-code=true (-rdc)
      Alternative (nvcc only): --keep-device-functions
      Description: Prevents CUDA compiler from eliminating direct or continuation callables as dead code.
  - Fast Math:
      Recommended Option: --use_fast_math
      Description: Enables .approx instructions for trigonometric functions and reciprocals, avoiding inadvertent use of slow double-precision floats. Recommended for performance unless higher precision is required.
  - Profiling (Nsight Compute):
      Required Options: --generate-line-info
      Application Host Code Setting: debug_level = CompileDebugLevel::LineInfo (in ModuleCompileOptions and PipelineLinkOptions)
      Description: Enables line information for profiling with Nsight Compute.
```

--------------------------------

### Rust-CUDA Stream Synchronization Example

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/safety.md

This Rust code snippet demonstrates the importance of explicit synchronization when interacting with GPU kernels from the CPU. It illustrates that a kernel launched on a stream is not guaranteed to be finished immediately, and CPU code accessing shared memory must call `stream.synchronize()` to ensure the kernel's completion before safely accessing the data modified by the GPU.

```rs
launch!(module.bar<<<1, 1, 0, stream>>>(foo.as_unified_ptr()))?;
// 'bar' is not guaranteed to be finished executing at this point.
function_that_accesses_foo(foo);
stream.synchronize()?;
// foo may be accessed and will see the changes that 'bar' wrote to 'foo'. 'bar' is guaranteed 
// to be finished executing.
```

--------------------------------

### Allow Improper C Types Definitions for GPU Kernels (Optional)

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

To use types such as slices or arrays inside GPU kernels, you must allow `improper_ctypes_definitions`. This is because `rustc_codegen_nvvm` guarantees their passing convention, unlike standard C ABI rules for `extern "C"` functions.

```rust
#![allow(improper_ctypes_definitions)]
```

--------------------------------

### Correct Rust Kernel Launch with Slice Parameters

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/kernel_abi.md

Demonstrates the correct way to launch a GPU kernel from Rust when passing a slice, by explicitly providing the device pointer and the length as separate arguments.

```Rust
let mut buf = [5u8; 10].as_dbuf()?;

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(buf.as_device_ptr(), buf.len())
  )?;
}
```

--------------------------------

### Correct Rust Kernel Launch with Device Pointer

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/kernel_abi.md

Illustrates the correct way to launch a GPU kernel from Rust when passing a reference, by providing a device pointer to the allocated memory.

```Rust
let mut val = DeviceBox::new(&5)?;

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(val.as_device_ptr())
  )?;
}
```

--------------------------------

### OptiX DeviceContext API Reference

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/context.md

This section provides an overview of key methods and properties available on the `DeviceContext` object in the OptiX Rust bindings, including context creation, property querying, logging configuration, and compilation cache management.

```APIDOC
DeviceContext:
  new(cuda_context: &cust::Context, enable_debug: bool) -> Result<Self, Error>
    Purpose: Creates a new NVIDIA OptiX 7 device context.
    Parameters:
      cuda_context: The associated CUDA context.
      enable_debug: Boolean to enable debug mode.
  get_property(property_type: PropertyType) -> Result<PropertyValue, Error>
    Purpose: Queries various properties of the context, such as max trace depth or max traversable graph depth.
    Parameters:
      property_type: The specific property to query.
    Returns: The value of the queried property.
  set_log_callback(callback: F)
    Purpose: Specifies a closure for logging messages from OptiX.
    Parameters:
      callback: A closure with signature `FnMut(u32, &str, &str) + 'static`.
        u32: Log level (0=disable, 1=fatal, 2=error, 3=warning, 4=print).
        &str: Message category description (e.g., "SCENE STAT").
        &str: The log message itself.
  set_cache_enabled(enabled: bool) -> Result<(), Error>
    Purpose: Enables or disables the compilation cache for the device context.
    Parameters:
      enabled: `true` to enable, `false` to disable.
  get_cache_enabled() -> bool
    Purpose: Queries the current status of the compilation cache for the device context.
    Returns: `true` if caching is enabled, `false` otherwise.
```

--------------------------------

### Embed PTX as Static String in Rust

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/getting_started.md

This Rust snippet illustrates how to statically embed the compiled PTX (Parallel Thread Execution) file directly into your Rust application. Using `include_str!`, the content of the specified PTX file is loaded at compile time as a string literal, making it available for runtime execution by libraries like `cust`.

```Rust
static PTX: &str = include_str!("some/path.ptx");
```

--------------------------------

### NVIDIA OptiX 7 API Core Concepts and Program Model

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/introduction.md

Detailed overview of the NVIDIA OptiX 7 API, covering its core characteristics, context management, multi-GPU support, and the fundamental programming model concepts including various program types that form the ray tracing pipeline.

```APIDOC
NVIDIA OptiX 7 API:
  Overview:
    - CUDA-centric API, invoked by CUDA-based applications.
    - Designed to be stateless, multi-threaded, and asynchronous.
    - Provides explicit control over performance-sensitive operations (memory management, shader compilation).
    - Supports lightweight scene representation (instancing, vertex/transform-based motion blur, built-in triangles/swept curves, user-defined primitives).
    - Includes highly-tuned kernels and neural networks for machine-learning-based denoising.
    - Context: Controls a single GPU, manages handle objects (small host memory), automatically released on destruction.
    - Asynchronous Operations: All API functions employ CUDA streams and invoke GPU functions asynchronously.
    - Multi-GPU: Capabilities (load balancing, NVLINK memory sharing) must be handled by the application developer.
    - Execution Model: Allows task movement across lanes/warps/SMs; applications cannot use shared memory, synchronization, barriers, or other SM-thread-specific constructs.
    - Future-proofing: API allows existing programs to use new NVIDIA hardware features.

  Basic concepts and definitions:
    Program:
      - A block of executable code on the GPU.
      - Represents a particular shading operation (called a shader in DXR/Vulkan).
      - Programmable component in the system, capable of more than shading.

  Program and Data Model:
    Implements a single-ray programming model with eight types of programs forming the ray tracing pipeline:
      - Ray generation (RG): Entry point into the ray tracing pipeline, invoked in parallel for each pixel, sample, or user-defined work assignment.
      - Intersection (IS): Implements a ray-primitive intersection test, invoked during traversal.
      - Any-hit (AH): Called when a traced ray finds a new, potentially closest, intersection point (e.g., for shadow computation).
      - Closest-hit (CH): Called when a traced ray finds the closest intersection point (e.g., for material shading).
      - Miss: Called when a traced ray misses all scene geometry.
      - Exception: Handler invoked for conditions such as stack overflow and other errors.
      - Direct callables: Similar to a regular CUDA function call, called immediately.
      - Continuation callables: Unlike direct callables, executed by the scheduler.

    Pipeline: Based on the interconnected calling structure of these eight programs and their relationship to the search through geometric data (traversal).
```

--------------------------------

### Build and Run Rust-CUDA Docker Container

Source: https://github.com/rust-gpu/rust-cuda/blob/main/README.md

These commands illustrate how to build a Docker image for Rust-CUDA using a provided Dockerfile (e.g., for Ubuntu 24.04) and then run a container from that image. The `docker run` command includes `--runtime=nvidia` and `--gpus all` to enable GPU access within the container.

```bash
docker build -f ./container/ubuntu24/Dockerfile -t rust-cuda-ubuntu24 .
docker run --rm --runtime=nvidia --gpus all -it rust-cuda-ubuntu24
```

--------------------------------

### Document __nv_sin CUDA libdevice function

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates the sine of the input argument `x`, which is measured in radians. Refer to the CUDA C Programming Guide for accuracy information.

```LLVM IR
Function: __nv_sin
Prototype: double @__nv_sin(double %x)
Description: Calculate the sine of the input argument x (measured in radians).
Parameters:
  %x (double): The angle in radians.
Returns: double
  • __nv_sin(0) returns 0.
  • __nv_sin(infinity) returns NaN.
Library Availability:
  Compute 2.0: Yes
  Compute 3.0: Yes
  Compute 3.5: Yes
```

--------------------------------

### Compile NVVM Program with Libdevice and Custom IR

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

This C/C++ code demonstrates the process of compiling an NVVM program. It involves loading the libdevice library and custom NVVM IR, adding them to an NVVM program object, and then compiling the program with specified options. The `loadFile` function is a placeholder for client-side logic to retrieve the libdevice binary.

```C
const char *libdeviceMod = loadFile('/path/to/libdevice.*.bc',
                                    &libdeviceModSize);
const char *myIr = /* NVVM IR in text or binary format */;
size_t myIrSize = /* size of myIr in bytes */;
// Create NVVM program object
nvvmCreateProgram(&prog);
// Add libdevice module to program
nvvmAddModuleToProgram(prog, libdeviceMod, libdeviceModSize);
// Add custom IR to program
nvvmAddModuleToProgram(prog, myIr, myIrSize);
// Declare compile options
const char *options[] = { "-ftz=1" };
// Compile the program
nvvmCompileProgram(prog, 1, options);
```

--------------------------------

### __nv_logbf

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates the floating point representation of the exponent of the input argument 'x'. For accuracy information, refer to the CUDA C Programming Guide, Appendix D.1, Table 6.

```APIDOC
Prototype: float @__nv_logbf(float %x)
Returns:
  - __nv_logbf(0) returns (negative infinity).
  - __nv_logbf(infinity) returns (positive infinity).
```

--------------------------------

### __nv_logb

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates the floating point representation of the exponent of the input argument 'x'. For accuracy information, refer to the CUDA C Programming Guide, Appendix D.1, Table 7.

```APIDOC
Prototype: double @__nv_logb(double %x)
Returns:
  - __nv_logb(0) returns (negative infinity).
  - __nv_logb(infinity) returns (positive infinity).
```

--------------------------------

### Libdevice __nv_cospi Function

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates the cosine of x (measured in radians). Returns 1 for specific inputs and NaN for others. Refer to CUDA C Programming Guide for accuracy information. Available for Compute 2.0, 3.0, and 3.5.

```APIDOC
Function: __nv_cospi
Prototype: double @__nv_cospi(double %x)
Description: Calculate the cosine of x (measured in radians).
Parameters:
  %x (double): The input argument.
Returns:
  double:
    - 1 for specific inputs.
    - NaN for other inputs.
Notes:
  For accuracy information, see CUDA C Programming Guide, Appendix D.1, Table 7.
Availability:
  Compute 2.0: Yes
  Compute 3.0: Yes
  Compute 3.5: Yes
```

--------------------------------

### Libdevice __nv_dmul_rd Function

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Multiplies two double-precision floating point values x and y, rounding the result down (towards negative infinity). Refer to CUDA C Programming Guide for accuracy information. Available for Compute 2.0, 3.0, and 3.5.

```APIDOC
Function: __nv_dmul_rd
Prototype: double @__nv_dmul_rd(double %x, double %y)
Description: Multiplies two floating point values x and y in round-down (to negative infinity) mode.
Parameters:
  %x (double): First floating point value.
  %y (double): Second floating point value.
Returns:
  double: The product of x * y.
Notes:
  For accuracy information, see CUDA C Programming Guide, Appendix D.1, Table 7.
Availability:
  Compute 2.0: Yes
  Compute 3.0: Yes
  Compute 3.5: Yes
```

--------------------------------

### Correct Rust Kernel Launch with Struct by Value

Source: https://github.com/rust-gpu/rust-cuda/blob/main/guide/src/guide/kernel_abi.md

Illustrates the correct way to launch a GPU kernel from Rust, passing a struct directly by value as expected by the CUDA/PTX ABI for structs.

```Rust
let foo = Foo { 
  a: 5,
  b: 6,
  c: 7
};

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(foo)
  )?;
}
```

--------------------------------

### Rust OptiX Shader Binding Table Record Packing

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/introduction.md

This Rust code demonstrates how to define a custom data structure, `HitgroupSbtData`, and then pack it into an `SbtRecord` for use with NVIDIA OptiX 7. It utilizes `cust::prelude` for device memory operations and `optix::prelude` for OptiX-specific types, illustrating the process of preparing data records for the shader binding table.

```Rust
use cust::prelude as cu;
use optix::prelude as ox;

#[derive(Copy, Clone, Default, cu::DeviceCopy)]
struct HitgroupSbtData {
    object_id: u32,
}

type HitgroupRecord = ox::SbtRecord<HitgroupSbtData>;
let rec_hitgroup: Vec<_> = (0..num_objects)
    .map(|i| {
        let object_type = 0;
        let rec = HitgroupRecord::pack(
            HitgroupSbtData { object_id: i },
            &pg_hitgroup[object_type],
        )
        .expect("failed to pack hitgroup record");
        rec
    })
    .collect();
```

--------------------------------

### Document __nv_sincos CUDA libdevice function

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates both the sine and cosine of the input argument `x` (measured in radians). The results are written to the provided pointers `sptr` (for sine) and `cptr` (for cosine). Refer to the CUDA C Programming Guide for accuracy information.

```LLVM IR
Function: __nv_sincos
Prototype: void @__nv_sincos(double %x, double* %sptr, double* %cptr)
Description: Calculate the sine and cosine of the first input argument x (measured in radians). The results for sine and cosine are written into the second argument, sptr, and, respectively, third argument, cptr.
Parameters:
  %x (double): The angle in radians.
  %sptr (double*): Pointer to store the calculated sine value.
  %cptr (double*): Pointer to store the calculated cosine value.
Returns: void
  • none
Library Availability:
  Compute 2.0: Yes
  Compute 3.0: Yes
  Compute 3.5: Yes
```

--------------------------------

### Libdevice __nv_cospif Function

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates the cosine of x (measured in radians) for single-precision floats. Returns 1 for specific inputs and NaN for others. Refer to CUDA C Programming Guide for accuracy information. Available for Compute 2.0, 3.0, and 3.5.

```APIDOC
Function: __nv_cospif
Prototype: float @__nv_cospif(float %x)
Description: Calculate the cosine of x (measured in radians).
Parameters:
  %x (float): The input argument.
Returns:
  float:
    - 1 for specific inputs.
    - NaN for other inputs.
Notes:
  For accuracy information, see CUDA C Programming Guide, Appendix D.1, Table 6.
Availability:
  Compute 2.0: Yes
  Compute 3.0: Yes
  Compute 3.5: Yes
```

--------------------------------

### Build OptiX Geometry Acceleration Structure (GAS) in Rust

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md

This Rust snippet demonstrates the process of setting up and building a Geometry Acceleration Structure (GAS) using the `cust` and `optix` crates. It covers allocating device buffers for vertex and index data, defining a triangle mesh input, configuring acceleration build options for fast ray tracing, and asynchronously initiating the GAS build operation.

```Rust
use cust::prelude as cu;
use optix::prelude as ox;
# fn doit() -> Result<(), Box<dyn std::error::Error>> {
# cust::init(cu::CudaFlags::empty())?;
# ox::init()?;
# let device = cu::Device::get_device(0)?;
# let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO |
# cu::ContextFlags::MAP_HOST, device)?;
# let ctx = ox::DeviceContext::new(&cu_ctx, false)?;
# let vertices: Vec<[f32; 3]> = Vec::new();
# let indices: Vec<[u32; 3]> = Vec::new();
# let stream = cu::Stream::new(cu::StreamFlags::DEFAULT, None)?;

// Allocate buffers and copy vertex and index data to device
let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?;
let buf_indices = cu::DeviceBuffer::from_slice(&indices)?;

// Tell OptiX the structure of our triangle mesh
let geometry_flags = ox::GeometryFlags::None;
let triangle_input =
    ox::IndexedTriangleArray::new(
        &[&buf_vertex],
        &buf_indices,
        &[geometry_flags]
    );

// Tell OptiX we'd prefer a faster traversal over a faster bvh build.
let accel_options = AccelBuildOptions::new(ox::BuildFlags::PREFER_FAST_TRACE);

// Build the accel asynchronously
let gas = ox::Accel::build(
    &ctx,
    &stream,
    &[accel_options],
    &[triangle_input],
    true
)?;
# Ok(())
# }
```

--------------------------------

### Count Leading Zeros (64-bit integer)

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Counts the number of consecutive leading zero bits, starting at the most significant bit (bit 63) of the input 64-bit integer. Returns a value between 0 and 64 inclusive representing the number of zero bits.

```APIDOC
__nv_clzll(x: i64) -> i32
  x: The input 64-bit integer.
```

--------------------------------

### Compile CUDA Source to PTX for OptiX using nvcc

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/pipeline.md

This command demonstrates how to compile a CUDA source file (`myprogram.cu`) into a PTX file (`myprogram.ptx`) suitable for NVIDIA OptiX 7. It includes flags for specifying the OptiX SDK include path and enabling fast math optimizations.

```bash
nvcc -ptx -Ipath-to-optix-sdk/include --use_fast_math myprogram.cu -o myprogram.ptx
```

--------------------------------

### Count Leading Zeros (32-bit integer)

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Counts the number of consecutive leading zero bits, starting at the most significant bit (bit 31) of the input 32-bit integer. Returns a value between 0 and 32 inclusive representing the number of zero bits.

```APIDOC
__nv_clz(x: i32) -> i32
  x: The input 32-bit integer.
```

--------------------------------

### Document __nv_sincosf CUDA libdevice function

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates both the sine and cosine of the single-precision input argument `x` (measured in radians). The results are written to the provided pointers `sptr` (for sine) and `cptr` (for cosine). Refer to the CUDA C Programming Guide for accuracy information.

```LLVM IR
Function: __nv_sincosf
Prototype: void @__nv_sincosf(float %x, float* %sptr, float* %cptr)
Description: Calculate the sine and cosine of the first input argument x (measured in radians). The results for sine and cosine are written into the second argument, sptr, and, respectively, third argument, cptr.
Parameters:
  %x (float): The angle in radians.
  %sptr (float*): Pointer to store the calculated sine value.
  %cptr (float*): Pointer to store the calculated cosine value.
Returns: void
  • none
Library Availability:
  Compute 2.0: Yes
  Compute 3.0: Yes
  Compute 3.5: Yes
```

--------------------------------

### Configure CMake Project for GPU Development Tools

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/examples/common/gdt/CMakeLists.txt

This CMake configuration block initializes the 'GPU_Development_Tools' project, specifies the minimum required CMake version (3.5), sets the C++ standard to C++11, includes the current source directory for header resolution, and defines a static library named 'gdt'. The 'gdt' library is built from a combination of CMake configuration files and C++ source/header files.

```CMake
project(GPU_Development_Tools)
cmake_minimum_required(VERSION 3.5)

set(CMAKE_CXX_STANDARD 11)

include_directories(${CMAKE_CURRENT_SOURCE_DIR})

add_library(gdt 
  cmake/configure_build_type.cmake
  cmake/configure_optix.cmake
  cmake/FindOptiX.cmake
  
  gdt/gdt.h
  gdt/math/LinearSpace.h
  gdt/math/AffineSpace.h
  
  gdt/gdt.cpp
  )
```

--------------------------------

### __nv_logf

Source: https://github.com/rust-gpu/rust-cuda/blob/main/scripts/data/libdevice.txt

Calculates the natural logarithm of the input argument 'x'. This function returns specific values for various inputs like 0, 1, negative x, and infinity. For accuracy information, refer to the CUDA C Programming Guide, Appendix D.1, Table 6.

```APIDOC
Prototype: float @__nv_logf(float %x)
Returns:
  - __nv_logf(0) returns (negative infinity).
  - __nv_logf(1) returns +0.
  - __nv_logf(x) returns NaN for x < 0.
  - __nv_logf(infinity) returns (infinity).
```

--------------------------------

### Building OptiX Acceleration Structures with Safe Rust API

Source: https://github.com/rust-gpu/rust-cuda/blob/main/crates/optix/src/acceleration.md

This snippet demonstrates the use of `Accel::build()` for creating acceleration structures. This method simplifies the process by automatically handling memory allocation and synchronization, making it ideal for rapid prototyping. However, it may incur overhead due to reallocating temporary storage and synchronizing after each build, which can be less efficient for batch operations.

```Rust
use cust::prelude as cu;
use optix::prelude as ox;
# fn doit() -> Result<(), Box<dyn std::error::Error>> {
# cust::init(cu::CudaFlags::empty())?;
# ox::init()?;
# let device = cu::Device::get_device(0)?;
# let cu_ctx = cu::Context::create_and_push(cu::ContextFlags::SCHED_AUTO |
# cu::ContextFlags::MAP_HOST, device)?;
# let ctx = ox::DeviceContext::new(&cu_ctx, false)?;
# let vertices: Vec<[f32; 3]> = Vec::new();
# let indices: Vec<[u32; 3]> = Vec::new();
# let stream = cu::Stream::new(cu::StreamFlags::DEFAULT, None)?;

let buf_vertex = cu::DeviceBuffer::from_slice(&vertices)?;
let buf_indices = cu::DeviceBuffer::from_slice(&indices)?;

let geometry_flags = ox::GeometryFlags::None;
let triangle_input =
    ox::IndexedTriangleArray::new(
        &[&buf_vertex],
        &buf_indices,
        &[geometry_flags]
    );

let accel_options =
    ox::AccelBuildOptions::new(
        ox::BuildFlags::ALLOW_COMPACTION,
        ox::BuildOperation::Build
    );

let build_inputs = vec![triangle_input];

let gas = ox::Accel::build(
    &ctx,
    &stream,
    &[accel_options],
    &build_inputs,
    true
)?;

stream.synchronize()?;
# Ok(())
# }
```
SBT Geometry-AS Index	0	1	2
Geometry-AS build input	`build_input[0]`
		`built_input[1]`
			`built_input[2]`
SBT Geometry-AS Index	0	1	2	3	5	6
`build_input[0].sbt_index_offset:`	[0]
		[1]
			[2]
				[3]
`build_input[1].sbt_index_offset=None`
`build_input[1].sbt_index_offset:`					[0]
`build_input[1].sbt_index_offset:`						[1]
SBT Geometry-AS Index	0	3	5
Geometry-AS build input	`build_input[0] num=4`
		`build_input[1] num=1`
			`build_input[2] offset=2`