### Install Boost Program Options (Debian/Ubuntu)

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Installs the Boost program_options library on Debian/Ubuntu systems using apt.

```bash
apt install libboost-program-options-dev
```

--------------------------------

### Install Boost Program Options (Fedora)

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Installs the Boost program_options library on Fedora systems using dnf.

```bash
sudo dnf -y install boost-devel
```

--------------------------------

### Install Boost Program Options

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Install the Boost.Program_options library development package. Resolves 'No rule to make target libboost_program_options.a' errors.

```bash
sudo apt-get install libboost-program-options-dev
```

--------------------------------

### nvbandwidth Example Output

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Example output format for a device-to-device memcpy CE bandwidth test, showing measured GB/s between GPU pairs.

```text
Running device_to_device_memcpy_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
          0         1         2         3         4         5         6         7
0      0.00    276.07    276.36    276.14    276.29    276.48    276.55    276.33
1    276.19      0.00    276.29    276.29    276.57    276.48    276.38    276.24
2    276.33    276.29      0.00    276.38    276.50    276.50    276.29    276.31
3    276.19    276.62    276.24      0.00    276.29    276.60    276.29    276.55
4    276.03    276.55    276.45    276.76      0.00    276.45    276.36    276.62
5    276.17    276.57    276.19    276.50    276.31      0.00    276.31    276.15
6    274.89    276.41    276.38    276.67    276.41    276.26      0.00    276.33
7    276.12    276.45    276.12    276.36    276.00    276.57    276.45      0.00
```

--------------------------------

### Run debian_install.sh script

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Executes the debian_install.sh script to install generic software components and build the nvbandwidth project on Debian/Ubuntu systems.

```bash
sudo ./debian_install.sh
```

--------------------------------

### Run Multinode Bandwidth Test with Pair Sampling (4 Nodes, 8 GPUs/Node)

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Example command for running multinode bandwidth tests on a system with 4 nodes and 8 GPUs per node, selecting a specific number of pairs for testing. This reduces test time while aiming for good GPU topology coverage.

```bash
mpirun --allow-run-as-root --map-by ppr:4:node --bind-to core -np 8 --report-bindings -q -mca btl_tcp_if_include enP5p9s0 --hostfile /etc/nvidia-imex/nodes_config.cfg  ./nvbandwidth -p multinode  --targetNumPairs 8
```

--------------------------------

### Install OpenMPI Development Packages

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Install necessary OpenMPI packages to resolve 'libmpi_cxx.so.40 => not found' errors. Package names may vary by distribution.

```bash
sudo apt-get install openmpi-bin libopenmpi-dev
```

```bash
sudo yum install openmpi-devel
```

```bash
sudo dnf install openmpi-devel
```

--------------------------------

### Check CUDA Setup and Compatibility

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Verify CUDA driver and runtime versions to ensure compatibility. Helps resolve 'CUDA_ERROR_NO_DEVICE' and 'cudaErrorUnsupportedPtxVersion'.

```bash
# Check CUDA driver version
$ nvidia-smi | grep "CUDA Version"
```

```bash
# Check CUDA runtime version
$ nvcc --version
```

--------------------------------

### Troubleshoot Fabric Manager Requirement

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

For multi-GPU NVSwitch systems, if the system is not yet initialized, start the NVIDIA Fabric Manager service using `sudo systemctl start nvidia-fabricmanager`.

```bash
# Fabric manager required (multi-GPU NVSwitch systems)
[CUDA_ERROR_SYSTEM_NOT_READY] system not yet initialized
→ sudo systemctl start nvidia-fabricmanager
```

--------------------------------

### Start IMEX Service

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Start the NVIDIA Internode Memory Exchange Service (IMEX) on each compute tray. This service is required for GPU memory export and import operations across OS domains in an NVLink multi-node deployment.

```bash
sudo systemctl start nvidia-imex.service
```

--------------------------------

### Run Multinode Bandwidth Test (2 Nodes, 4 GPUs/Node)

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Example command to run multinode bandwidth tests on a system with 2 nodes and 4 GPUs per node. It specifies MPI run options, network interface, and host configuration file.

```bash
mpirun --allow-run-as-root --map-by ppr:4:node --bind-to core -np 8 --report-bindings -q -mca btl_tcp_if_include enP5p9s0 --hostfile /etc/nvidia-imex/nodes_config.cfg  ./nvbandwidth -p multinode
```

--------------------------------

### Check CUDA Version

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Verify the installed CUDA version. Essential for multi-node setups requiring CUDA 12.3 or above.

```bash
# Ensure CUDA 12.3+ is in use
$ nvcc --version
```

--------------------------------

### Troubleshoot IMEX Channels Missing

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

For multinode systems experiencing missing IMEX channels, start the NVIDIA IMEX service and create the necessary device node.

```bash
# IMEX channels missing (multinode)
[CUDA_ERROR_NOT_PERMITTED] operation not permitted in cuMemCreate
→ sudo systemctl start nvidia-imex
→ sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major> 0
```

--------------------------------

### Verify NVIDIA Driver Installation

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Check the NVIDIA driver status and loaded modules. Useful for diagnosing 'CUDA_ERROR_NO_DEVICE'.

```bash
# Check driver version
$ cat /proc/driver/nvidia/version
```

```bash
# Check loaded modules
$ lsmod | grep nvidia
```

```bash
# Check driver installation
$ dpkg -l | grep nvidia-driver
```

--------------------------------

### Run nvbandwidth Testcases by Prefix

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Runs all testcases whose names start with the specified prefix string. Useful for grouping tests like CE or SM.

```bash
# Run all CE testcases
./nvbandwidth -p host_to_device

# Run all SM testcases
./nvbandwidth -p device_to_device_memcpy_read

# Run all multinode testcases (MULTINODE build only)
mpirun -n 8 ./nvbandwidth -p multinode
```

--------------------------------

### Build Multinode nvbandwidth

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Use these CMake commands to build the multinode version of nvbandwidth. Ensure MPI is installed as CMake will automatically find and link against it.

```bash
cmake -DMULTINODE=1 .
make
```

--------------------------------

### nvbandwidth Multinode Sampling Examples

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Demonstrates different ways to use the --targetNumPairs option for multinode tests. Use -1 for all pairs, or a specific number for intelligent sampling. Note that this option is ignored in single-node mode.

```bash
# Test all pairs (default/full coverage)
mpirun -n 8 ./nvbandwidth -p multinode --targetNumPairs -1

# Test 20 carefully selected pairs
mpirun -n 8 ./nvbandwidth -p multinode --targetNumPairs 20

# For a 4-node, 8-GPU system: 992 total pairs available
# Using --targetNumPairs 100 will test ~10% of pairs while covering all GPUs
```

--------------------------------

### Troubleshoot Missing MPI Shared Library

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

If the `libmpi_cxx.so.40` library is not found, install the Open MPI package and development files using the system's package manager.

```bash
# Missing MPI shared library
libmpi_cxx.so.40 => not found
→ sudo apt-get install openmpi-bin libopenmpi-dev
```

--------------------------------

### Fix CUDA_ERROR_NOT_PERMITTED in Multi-Node Runs

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Resolve IMEX channel issues for multi-node runs by starting the IMEX daemon and creating necessary device nodes. Ensure correct major numbers are used when creating channels.

```bash
sudo systemctl start nvidia-imex
```

```bash
/proc/devices | grep nvidia-caps-imex-channels
```

```bash
sudo mkdir /dev/nvidia-caps-imex-channels/
```

```bash
sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major number> 0
```

```bash
systemctl status nvidia-imex
```

--------------------------------

### Run nvbandwidth (All Testcases)

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Executes all available test cases for bandwidth measurement.

```bash
./nvbandwidth
```

--------------------------------

### nvbandwidth CLI Help

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Displays the help message for the nvbandwidth command-line interface, listing available options.

```bash
./nvbandwidth -h
```

--------------------------------

### Run nvbandwidth (Specific Testcase)

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Executes a specific test case, 'device_to_device_memcpy_read_ce', for bandwidth measurement.

```bash
./nvbandwidth -t device_to_device_memcpy_read_ce
```

--------------------------------

### Build nvbandwidth (Single-Node)

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Builds the nvbandwidth executable for single-node execution using CMake and make.

```bash
cmake .
make
```

--------------------------------

### Build nvbandwidth (Multi-Node MPI + IMEX)

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Build with MULTINODE flag for multi-node tests using MPI and IMEX. Requires CUDA Toolkit 12.3+ and driver 550+.

```bash
sudo apt-get install openmpi-bin libopenmpi-dev

cmake -DMULTINODE=1 .
make -j$(nproc)

sudo systemctl start nvidia-imex.service
sudo mkdir -p /dev/nvidia-caps-imex-channels/
sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major_number> 0
```

--------------------------------

### Run All-to-Host and Host-to-All Memory Copy Tests

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures bandwidth between each GPU and the host while all other GPUs simultaneously transfer data. Reports a 1xN bandwidth matrix for each GPU under contention. Use this to stress shared PCIe bandwidth.

```bash
./nvbandwidth -t all_to_host_memcpy_ce host_to_all_memcpy_ce

# Running all_to_host_memcpy_ce.
# memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s) [all devices active]
#           0         1         2         3
# 0     12.10     12.05     11.98     12.03
```

--------------------------------

### Run All-to-One Peer-to-Peer Bandwidth Tests

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Aggregated peer-to-peer tests measuring total inbound bandwidth to each GPU when all peers write or read simultaneously (`all_to_one`), or total outbound bandwidth from a single GPU to all peers (`one_to_all`). Requires accessible peer pairs.

```bash
./nvbandwidth -t all_to_one_write_ce one_to_all_write_ce

# Running all_to_one_write_ce.
# memcpy CE sum of GPU(column) <- all-peers bandwidth (GB/s)
#           0         1         2         3
# 0    828.21    829.44    827.98    830.11
```

--------------------------------

### Local Testing with nvbandwidth

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Command to run nvbandwidth on a single-node machine with Ampere+ GPU. This spawns 4 processes and runs all tests prefixed with 'multinode'.

```bash
mpirun -n 4 ./nvbandwidth -p multinode
```

--------------------------------

### List Available nvbandwidth Testcases

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Prints all testcase names and their corresponding indices, which can be used with the -t or -p flags.

```bash
./nvbandwidth --list
```

--------------------------------

### Run Specific nvbandwidth Testcase(s)

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Selects one or more testcases by name or numeric index for execution. Multiple values are space-separated.

```bash
# By name
./nvbandwidth -t device_to_device_memcpy_read_ce

# By index
./nvbandwidth -t 4

# Multiple testcases
./nvbandwidth -t host_to_device_memcpy_ce device_to_host_memcpy_ce
```

--------------------------------

### Configure and Run Multinode Device-to-Device Tests

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Family of multinode tests measuring inter-node GPU bandwidth over NVLink fabric using MPI and IMEX. Requires a `MULTINODE=1` build, IMEX daemon, and running under MPI with one process per GPU.

```bash
# 2 nodes, 4 GPUs per node (8 total processes)
mpirun --allow-run-as-root \
  --map-by ppr:4:node \
  --bind-to core \
  -np 8 \
  --hostfile /etc/nvidia-imex/nodes_config.cfg \
  ./nvbandwidth -p multinode
```

```bash
# Run only write CE tests between all pairs
mpirun -np 8 --hostfile nodes.cfg \
  ./nvbandwidth -t multinode_device_to_device_memcpy_write_ce
```

```bash
# Bisect test (rank A writes to rank (A + N/2) % N)
mpirun -np 8 --hostfile nodes.cfg \
  ./nvbandwidth -t multinode_bisect_write_ce
```

--------------------------------

### Display GPU Topology

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Use this command to display the topology of your GPUs, which is useful for understanding connectivity and potential bottlenecks.

```bash
nvidia-smi topo -m
```

--------------------------------

### Run Device-to-Device SM Copy Tests

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures peer-to-peer bandwidth using SM copy kernels. Produces an N×N matrix, useful for comparing hardware DMA versus kernel-driven transfer throughput.

```bash
./nvbandwidth -t device_to_device_memcpy_read_sm device_to_device_memcpy_write_sm
```

--------------------------------

### Use Transparent Huge Pages for Buffers

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Allocates host-side buffers using `madvise(MADV_HUGEPAGE)` when Transparent Huge Pages are enabled on the system. This is not supported on Windows and requires THP to be enabled first.

```bash
# Enable THP first (if not already enabled)
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Run with huge pages
./nvbandwidth -H -t host_to_device_memcpy_ce
```

--------------------------------

### Enable JSON Output for Automated Parsing

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Switches the output formatter to emit structured JSON, suitable for automated parsing and CI integration. This is useful for integrating bandwidth test results into other systems.

```bash
./nvbandwidth --json -t host_to_device_memcpy_ce device_to_host_memcpy_ce
```

```json
{
  "nvbandwidth_version" : "v0.9",
  "git_version" : "v0.9",
  "cuda_runtime_version" : 12080,
  "driver_version" : "560.00",
  "hostname" : "node01",
  "testcases" : [
    {
      "name" : "host_to_device_memcpy_ce",
      "status" : "Passed",
      "bw_description" : "memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)",
      "bw_matrix" : [ [ 26.03, 25.94, 25.97, 26.00 ] ],
      "bw_sum" : 103.94
    },
    {
      "name" : "device_to_host_memcpy_ce",
      "status" : "Passed",
      "bw_description" : "memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)",
      "bw_matrix" : [ [ 25.80, 25.71, 25.76, 25.68 ] ],
      "bw_sum" : 102.95
    }
  ]
}
```

--------------------------------

### Run Host-to-Device and Device-to-Host SM Copy Tests

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures host-to-device and device-to-host bandwidth using SM (Streaming Multiprocessor) copy kernels instead of hardware DMA. Copy size is adjusted to align with `threadsPerBlock × SM_count` for accurate reporting.

```bash
./nvbandwidth -t host_to_device_memcpy_sm device_to_host_memcpy_sm

# Running host_to_device_memcpy_sm.
# memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
#           0         1         2         3
# 0     25.11     24.98     25.03     24.89
```

--------------------------------

### Troubleshoot CUDA Device Detection

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

If no CUDA device is detected, verify GPU visibility using `nvidia-smi` and check driver modules with `lsmod | grep nvidia`.

```bash
# No CUDA device detected
[CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected
→ nvidia-smi          # check GPU visibility
→ lsmod | grep nvidia # check driver modules
```

--------------------------------

### Accumulate and Analyze Bandwidth Samples

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

The `PerformanceStatistic` class accumulates per-sample bandwidth values and computes various metrics like mean, median, standard deviation, min, and max. Use `returnAppropriateMetric()` for the default (median) or specific methods like `median()`, `mean()`, `stddev()`, `largest()`, and `smallest()`.

```cpp
PerformanceStatistic stat;
stat.recordSample(25.94);
stat.recordSample(26.03);
stat.recordSample(25.97);

double result = stat.returnAppropriateMetric(); // median (default) or mean
double med    = stat.median();    // 25.97
double avg    = stat.mean();      // 25.98
double sd     = stat.stddev();
double hi     = stat.largest();   // 26.03
double lo     = stat.smallest();  // 25.94
```

--------------------------------

### Configure Multi-Node Build

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Configure CMake to build with multi-node support. Requires CUDA 12.3+.

```bash
# Build with multi-node support
$ cmake -DMULTINODE=1
```

--------------------------------

### Clean Build Artifacts

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Remove CMake cache and previous build files. Recommended before rebuilding to resolve compilation issues like 'cudaErrorUnsupportedPtxVersion'.

```bash
rm -rf CMakeCache.txt CMakeFiles
make clean
```

--------------------------------

### Check for CUDA Devices

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Verify if any CUDA-capable devices are detected by the system. This command helps diagnose 'CUDA_ERROR_NO_DEVICE'.

```bash
$ nvidia-smi
```

--------------------------------

### Generate NVIDIA Bug Report

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Create a detailed diagnostic report for NVIDIA hardware and software issues. Useful for complex troubleshooting.

```bash
$ nvidia-bug-report.sh
```

--------------------------------

### Run Bidirectional Device-to-Device SM Copy Test

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Performs a bidirectional SM copy test where a single kernel with split-warp execution copies data in both directions simultaneously. Reports the total bandwidth (both directions summed).

```bash
./nvbandwidth -t device_to_device_bidirectional_memcpy_write_sm

# SM bidir. bandwidth = size / kernel_time  (both directions counted)
```

--------------------------------

### Measure Device-to-Device Memcpy Read/Write Bandwidth

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures peer-to-peer bandwidth between every accessible GPU pair using CE. 'Read' uses the target GPU's context to pull data from the peer; 'Write' uses the target's context to push data to the peer. Reports an NxN bandwidth matrix; diagonal entries are 0.

```bash
./nvbandwidth -t device_to_device_memcpy_write_ce
```

```text
# Running device_to_device_memcpy_write_ce.
# memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
#           0         1         2         3
# 0      0.00    276.07    276.36    276.14
# 1    276.19      0.00    276.29    276.29
# 2    276.33    276.29      0.00    276.38
# 3    276.19    276.62    276.24      0.00
```

--------------------------------

### Set CUDA Compute Architecture

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Specify the target GPU architecture for compilation using CMake. Use this to resolve 'Unsupported gpu architecture' errors for older GPUs.

```bash
# For Hopper
cmake -DCMAKE_CUDA_ARCHITECTURES=sm_90
```

--------------------------------

### Troubleshoot Unsupported GPU Architecture

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

When nvcc reports an unsupported GPU architecture, specify the desired compute architecture during the CMake configuration process using `-DCMAKE_CUDA_ARCHITECTURES`.

```bash
# Unsupported GPU architecture
nvcc fatal: Unsupported gpu architecture 'compute_52'
→ cmake -DCMAKE_CUDA_ARCHITECTURES=90   # e.g. for Hopper
```

--------------------------------

### Enable Verbose Output for Diagnostics

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Enables additional diagnostic output during test execution, including per-iteration bandwidth samples. This can be helpful for detailed analysis of performance.

```bash
./nvbandwidth -v -t device_to_device_memcpy_write_ce
```

--------------------------------

### Limit Multinode Tests with Pair Sampling

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Use the `--targetNumPairs` option to limit multinode device-to-device tests to a sampled subset of GPU pairs. Use -1 for all pairs. This ensures every GPU participates in at least one pair.

```bash
mpirun -np 8 --hostfile nodes.cfg \
  ./nvbandwidth -p multinode_device_to_device --targetNumPairs 20
```

```bash
mpirun -np 8 --hostfile nodes.cfg \
  ./nvbandwidth -p multinode_device_to_device --targetNumPairs -1
```

--------------------------------

### Set Buffer Size for nvbandwidth

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Controls the memory buffer size for copy operations. Default is 512 MiB for peak bandwidth. Latency tests use a fixed 2 MiB buffer.

```bash
# Use 1024 MiB buffer
./nvbandwidth -b 1024

# Use 128 MiB buffer (triggers advisory warning)
./nvbandwidth -b 128
```

--------------------------------

### Measure Host-to-Device and Device-to-Host Memcpy Bandwidth

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures unidirectional PCIe or NVLink bandwidth between host (CPU-pinned) memory and each GPU using `cuMemcpyAsync` (Copy Engine). Reports a 1xN bandwidth matrix. Use `-b` for buffer size and `-i` for sample count.

```bash
./nvbandwidth -t host_to_device_memcpy_ce device_to_host_memcpy_ce -b 512 -i 5
```

```text
# Running host_to_device_memcpy_ce.
# memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
#           0         1         2         3
# 0     26.03     25.94     25.97     26.00

# Running device_to_host_memcpy_ce.
# memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
#           0         1         2         3
# 0     25.80     25.71     25.76     25.68
```

--------------------------------

### Measure On-Device Local Memory Bandwidth

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures on-device memory bandwidth (within a single GPU's HBM) using `cuMemcpyAsync`. Reports a 1xN matrix, with one entry per GPU.

```bash
./nvbandwidth -t device_local_copy

# Running device_local_copy.
# memcpy CE GPU(row) local copy bandwidth (GB/s)
#           0         1         2         3
# 0   3200.11   3198.44   3201.07   3199.88
```

--------------------------------

### Set CUDA Environment Variables

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Set PATH and LD_LIBRARY_PATH for a specific CUDA toolkit version. Use this to resolve 'cudaErrorUnsupportedPtxVersion' by ensuring the correct toolkit is used.

```bash
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$PATH
```

--------------------------------

### Measure Bidirectional Host-to-Device Memcpy Bandwidth

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures bidirectional PCIe bandwidth where a copy in the measured direction runs simultaneously with an interfering copy in the opposite direction. Only the measured direction's bandwidth is reported.

```bash
./nvbandwidth -t host_to_device_bidirectional_memcpy_ce
```

```text
# Running host_to_device_bidirectional_memcpy_ce.
# memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
#           0         1         2         3
# 0     18.56     18.37     19.37     19.59
```

--------------------------------

### Measure Bidirectional Device-to-Device Memcpy Bandwidth

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures bidirectional peer-to-peer bandwidth between GPUs with an opposing-direction interfering copy running simultaneously. Only the primary direction bandwidth is reported. Use `-b` for buffer size.

```bash
./nvbandwidth -t device_to_device_bidirectional_memcpy_read_ce -b 512
```

--------------------------------

### Troubleshoot PTX Version Mismatch

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

If encountering a PTX version mismatch, ensure the CUDA toolkit's bin directory is in the PATH and then clean and re-run CMake and make.

```bash
# PTX version mismatch
[cudaErrorUnsupportedPtxVersion]
→ export PATH=/usr/local/cuda-12.8/bin:$PATH
→ rm -rf CMakeCache.txt CMakeFiles && cmake . && make
```

--------------------------------

### Check NVLink Status

Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md

Verify the status of NVLink connections between GPUs. Helps diagnose 'CUDA_ERROR_NVLINK_UNCORRECTABLE' errors.

```bash
# Check NVLink status
nvidia-smi nvlink -s
```

--------------------------------

### Skip Data Verification to Reduce Runtime

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

By default, nvbandwidth verifies that copied data is correct after each transfer. This flag skips that check to reduce runtime, which can be useful for quick performance checks.

```bash
./nvbandwidth -s -t device_to_device_memcpy_write_ce
```

--------------------------------

### Bidirectional Host <-> Device Memcpy CE Bandwidth

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Measures the bidirectional bandwidth between CPU (host) and GPU (device) memory using the Compute Engine. This involves concurrent reads and writes between host and device memory, with results shown in GB/s.

```text
Running host_to_device_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
          0         1         2         3         4         5         6         7
0     18.56     18.37     19.37     19.59     18.71     18.79     18.46     18.61
```

--------------------------------

### Measure Host-Device Memory Access Latency

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures host-to-device memory access latency using a GPU pointer-chasing kernel over a fixed 2 MiB host-pinned buffer. The strided linked-list pattern prevents prefetching while keeping TLB entries warm.

```bash
./nvbandwidth -t host_device_latency_sm

# Running host_device_latency_sm.
# Latency SM CPU(row) <-> GPU(column) latency (ns)
#           0         1         2         3
# 0    220.15    218.77    219.42    221.03
```

--------------------------------

### Measure Device-to-Device Memory Access Latency

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Measures peer GPU memory access latency using pointer-chasing over a 2 MiB device buffer. Uses the same methodology as `host_device_latency_sm`; the `--bufferSize` flag is ignored.

```bash
./nvbandwidth -t device_to_device_latency_sm

# Running device_to_device_latency_sm.
# Latency SM GPU(row) <-> GPU(column) latency (ns)
#           0         1         2         3
# 0      0.00    610.34    612.01    609.88
# 1    611.20      0.00    609.65    611.44
```

--------------------------------

### Set Sample Count for Stable Results

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Controls the number of outer measurement iterations. Use a higher number for more stable results. The reported bandwidth is the median across all samples by default.

```bash
./nvbandwidth -t host_to_device_memcpy_ce -i 10
```

```bash
./nvbandwidth -t host_to_device_memcpy_ce -i 10 --useMean
```

--------------------------------

### Host to Device Memcpy CE Bandwidth

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Measures the bandwidth for copying data from CPU (host) memory to GPU (device) memory using the Compute Engine. Results are presented in a matrix format showing bandwidth in GB/s.

```text
Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
          0         1         2         3         4         5         6         7
0     26.03     25.94     25.97     26.00     26.19     25.95     26.00     25.97
```

--------------------------------

### SM Copy Size Calculation

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Formula used by SM copies to calculate the actual byte size for a copy operation. This ensures the size uniformly fits within the target device, considering threads per block and device SM count.

```plaintext
(threadsPerBlock * deviceSMCount) * floor(copySize / (threadsPerBlock * deviceSMCount))
```

--------------------------------

### Tune Inner Loop Count for Bandwidth Tests

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

The inner loop count for bandwidth tests, which determines how many times a transfer is repeated within a single sample, can be tuned using the `--loopCount` option. The default is 16.

```bash
./nvbandwidth --loopCount 32 -t device_to_device_memcpy_write_ce
```

```bash
./nvbandwidth -i 3 --useMean -t host_to_device_memcpy_ce
```

--------------------------------

### SM Bidirectional Bandwidth Calculation

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Formula for calculating bidirectional bandwidth when using Streaming Multiprocessor (SM) copies. It is based on the total data size transferred and the kernel execution time.

```plaintext
SM bidir. bandwidth = size/(kernel time);
```

--------------------------------

### Disable CPU Affinity for NUMA Node Minimization

Source: https://context7.com/nvidia/nvbandwidth/llms.txt

Disables the default behavior of pinning threads to the NUMA node closest to the target GPU. This can be useful for testing scenarios where NUMA affinity is not desired or needs to be controlled manually.

```bash
./nvbandwidth -d -t all_to_host_memcpy_ce
```

--------------------------------

### CE Bidirectional Bandwidth Calculation

Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md

Formula for calculating bidirectional bandwidth when using Compute Engine (CE) copies. It is based on the size of data on the measured stream and the time taken on that stream.

```plaintext
CE bidir. bandwidth = (size of data on measured stream) / (time on measured stream)
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.