### Install Boost Program Options (Debian/Ubuntu) Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Installs the Boost program_options library on Debian/Ubuntu systems using apt. ```bash apt install libboost-program-options-dev ``` -------------------------------- ### Install Boost Program Options (Fedora) Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Installs the Boost program_options library on Fedora systems using dnf. ```bash sudo dnf -y install boost-devel ``` -------------------------------- ### Install Boost Program Options Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Install the Boost.Program_options library development package. Resolves 'No rule to make target libboost_program_options.a' errors. ```bash sudo apt-get install libboost-program-options-dev ``` -------------------------------- ### nvbandwidth Example Output Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Example output format for a device-to-device memcpy CE bandwidth test, showing measured GB/s between GPU pairs. ```text Running device_to_device_memcpy_write_ce. memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s) 0 1 2 3 4 5 6 7 0 0.00 276.07 276.36 276.14 276.29 276.48 276.55 276.33 1 276.19 0.00 276.29 276.29 276.57 276.48 276.38 276.24 2 276.33 276.29 0.00 276.38 276.50 276.50 276.29 276.31 3 276.19 276.62 276.24 0.00 276.29 276.60 276.29 276.55 4 276.03 276.55 276.45 276.76 0.00 276.45 276.36 276.62 5 276.17 276.57 276.19 276.50 276.31 0.00 276.31 276.15 6 274.89 276.41 276.38 276.67 276.41 276.26 0.00 276.33 7 276.12 276.45 276.12 276.36 276.00 276.57 276.45 0.00 ``` -------------------------------- ### Run debian_install.sh script Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Executes the debian_install.sh script to install generic software components and build the nvbandwidth project on Debian/Ubuntu systems. ```bash sudo ./debian_install.sh ``` -------------------------------- ### Run Multinode Bandwidth Test with Pair Sampling (4 Nodes, 8 GPUs/Node) Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Example command for running multinode bandwidth tests on a system with 4 nodes and 8 GPUs per node, selecting a specific number of pairs for testing. This reduces test time while aiming for good GPU topology coverage. ```bash mpirun --allow-run-as-root --map-by ppr:4:node --bind-to core -np 8 --report-bindings -q -mca btl_tcp_if_include enP5p9s0 --hostfile /etc/nvidia-imex/nodes_config.cfg ./nvbandwidth -p multinode --targetNumPairs 8 ``` -------------------------------- ### Install OpenMPI Development Packages Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Install necessary OpenMPI packages to resolve 'libmpi_cxx.so.40 => not found' errors. Package names may vary by distribution. ```bash sudo apt-get install openmpi-bin libopenmpi-dev ``` ```bash sudo yum install openmpi-devel ``` ```bash sudo dnf install openmpi-devel ``` -------------------------------- ### Check CUDA Setup and Compatibility Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Verify CUDA driver and runtime versions to ensure compatibility. Helps resolve 'CUDA_ERROR_NO_DEVICE' and 'cudaErrorUnsupportedPtxVersion'. ```bash # Check CUDA driver version $ nvidia-smi | grep "CUDA Version" ``` ```bash # Check CUDA runtime version $ nvcc --version ``` -------------------------------- ### Troubleshoot Fabric Manager Requirement Source: https://context7.com/nvidia/nvbandwidth/llms.txt For multi-GPU NVSwitch systems, if the system is not yet initialized, start the NVIDIA Fabric Manager service using `sudo systemctl start nvidia-fabricmanager`. ```bash # Fabric manager required (multi-GPU NVSwitch systems) [CUDA_ERROR_SYSTEM_NOT_READY] system not yet initialized → sudo systemctl start nvidia-fabricmanager ``` -------------------------------- ### Start IMEX Service Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Start the NVIDIA Internode Memory Exchange Service (IMEX) on each compute tray. This service is required for GPU memory export and import operations across OS domains in an NVLink multi-node deployment. ```bash sudo systemctl start nvidia-imex.service ``` -------------------------------- ### Run Multinode Bandwidth Test (2 Nodes, 4 GPUs/Node) Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Example command to run multinode bandwidth tests on a system with 2 nodes and 4 GPUs per node. It specifies MPI run options, network interface, and host configuration file. ```bash mpirun --allow-run-as-root --map-by ppr:4:node --bind-to core -np 8 --report-bindings -q -mca btl_tcp_if_include enP5p9s0 --hostfile /etc/nvidia-imex/nodes_config.cfg ./nvbandwidth -p multinode ``` -------------------------------- ### Check CUDA Version Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Verify the installed CUDA version. Essential for multi-node setups requiring CUDA 12.3 or above. ```bash # Ensure CUDA 12.3+ is in use $ nvcc --version ``` -------------------------------- ### Troubleshoot IMEX Channels Missing Source: https://context7.com/nvidia/nvbandwidth/llms.txt For multinode systems experiencing missing IMEX channels, start the NVIDIA IMEX service and create the necessary device node. ```bash # IMEX channels missing (multinode) [CUDA_ERROR_NOT_PERMITTED] operation not permitted in cuMemCreate → sudo systemctl start nvidia-imex → sudo mknod /dev/nvidia-caps-imex-channels/channel0 c 0 ``` -------------------------------- ### Verify NVIDIA Driver Installation Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Check the NVIDIA driver status and loaded modules. Useful for diagnosing 'CUDA_ERROR_NO_DEVICE'. ```bash # Check driver version $ cat /proc/driver/nvidia/version ``` ```bash # Check loaded modules $ lsmod | grep nvidia ``` ```bash # Check driver installation $ dpkg -l | grep nvidia-driver ``` -------------------------------- ### Run nvbandwidth Testcases by Prefix Source: https://context7.com/nvidia/nvbandwidth/llms.txt Runs all testcases whose names start with the specified prefix string. Useful for grouping tests like CE or SM. ```bash # Run all CE testcases ./nvbandwidth -p host_to_device # Run all SM testcases ./nvbandwidth -p device_to_device_memcpy_read # Run all multinode testcases (MULTINODE build only) mpirun -n 8 ./nvbandwidth -p multinode ``` -------------------------------- ### Build Multinode nvbandwidth Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Use these CMake commands to build the multinode version of nvbandwidth. Ensure MPI is installed as CMake will automatically find and link against it. ```bash cmake -DMULTINODE=1 . make ``` -------------------------------- ### nvbandwidth Multinode Sampling Examples Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Demonstrates different ways to use the --targetNumPairs option for multinode tests. Use -1 for all pairs, or a specific number for intelligent sampling. Note that this option is ignored in single-node mode. ```bash # Test all pairs (default/full coverage) mpirun -n 8 ./nvbandwidth -p multinode --targetNumPairs -1 # Test 20 carefully selected pairs mpirun -n 8 ./nvbandwidth -p multinode --targetNumPairs 20 # For a 4-node, 8-GPU system: 992 total pairs available # Using --targetNumPairs 100 will test ~10% of pairs while covering all GPUs ``` -------------------------------- ### Troubleshoot Missing MPI Shared Library Source: https://context7.com/nvidia/nvbandwidth/llms.txt If the `libmpi_cxx.so.40` library is not found, install the Open MPI package and development files using the system's package manager. ```bash # Missing MPI shared library libmpi_cxx.so.40 => not found → sudo apt-get install openmpi-bin libopenmpi-dev ``` -------------------------------- ### Fix CUDA_ERROR_NOT_PERMITTED in Multi-Node Runs Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Resolve IMEX channel issues for multi-node runs by starting the IMEX daemon and creating necessary device nodes. Ensure correct major numbers are used when creating channels. ```bash sudo systemctl start nvidia-imex ``` ```bash /proc/devices | grep nvidia-caps-imex-channels ``` ```bash sudo mkdir /dev/nvidia-caps-imex-channels/ ``` ```bash sudo mknod /dev/nvidia-caps-imex-channels/channel0 c 0 ``` ```bash systemctl status nvidia-imex ``` -------------------------------- ### Run nvbandwidth (All Testcases) Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Executes all available test cases for bandwidth measurement. ```bash ./nvbandwidth ``` -------------------------------- ### nvbandwidth CLI Help Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Displays the help message for the nvbandwidth command-line interface, listing available options. ```bash ./nvbandwidth -h ``` -------------------------------- ### Run nvbandwidth (Specific Testcase) Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Executes a specific test case, 'device_to_device_memcpy_read_ce', for bandwidth measurement. ```bash ./nvbandwidth -t device_to_device_memcpy_read_ce ``` -------------------------------- ### Build nvbandwidth (Single-Node) Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Builds the nvbandwidth executable for single-node execution using CMake and make. ```bash cmake . make ``` -------------------------------- ### Build nvbandwidth (Multi-Node MPI + IMEX) Source: https://context7.com/nvidia/nvbandwidth/llms.txt Build with MULTINODE flag for multi-node tests using MPI and IMEX. Requires CUDA Toolkit 12.3+ and driver 550+. ```bash sudo apt-get install openmpi-bin libopenmpi-dev cmake -DMULTINODE=1 . make -j$(nproc) sudo systemctl start nvidia-imex.service sudo mkdir -p /dev/nvidia-caps-imex-channels/ sudo mknod /dev/nvidia-caps-imex-channels/channel0 c 0 ``` -------------------------------- ### Run All-to-Host and Host-to-All Memory Copy Tests Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures bandwidth between each GPU and the host while all other GPUs simultaneously transfer data. Reports a 1xN bandwidth matrix for each GPU under contention. Use this to stress shared PCIe bandwidth. ```bash ./nvbandwidth -t all_to_host_memcpy_ce host_to_all_memcpy_ce # Running all_to_host_memcpy_ce. # memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s) [all devices active] # 0 1 2 3 # 0 12.10 12.05 11.98 12.03 ``` -------------------------------- ### Run All-to-One Peer-to-Peer Bandwidth Tests Source: https://context7.com/nvidia/nvbandwidth/llms.txt Aggregated peer-to-peer tests measuring total inbound bandwidth to each GPU when all peers write or read simultaneously (`all_to_one`), or total outbound bandwidth from a single GPU to all peers (`one_to_all`). Requires accessible peer pairs. ```bash ./nvbandwidth -t all_to_one_write_ce one_to_all_write_ce # Running all_to_one_write_ce. # memcpy CE sum of GPU(column) <- all-peers bandwidth (GB/s) # 0 1 2 3 # 0 828.21 829.44 827.98 830.11 ``` -------------------------------- ### Local Testing with nvbandwidth Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Command to run nvbandwidth on a single-node machine with Ampere+ GPU. This spawns 4 processes and runs all tests prefixed with 'multinode'. ```bash mpirun -n 4 ./nvbandwidth -p multinode ``` -------------------------------- ### List Available nvbandwidth Testcases Source: https://context7.com/nvidia/nvbandwidth/llms.txt Prints all testcase names and their corresponding indices, which can be used with the -t or -p flags. ```bash ./nvbandwidth --list ``` -------------------------------- ### Run Specific nvbandwidth Testcase(s) Source: https://context7.com/nvidia/nvbandwidth/llms.txt Selects one or more testcases by name or numeric index for execution. Multiple values are space-separated. ```bash # By name ./nvbandwidth -t device_to_device_memcpy_read_ce # By index ./nvbandwidth -t 4 # Multiple testcases ./nvbandwidth -t host_to_device_memcpy_ce device_to_host_memcpy_ce ``` -------------------------------- ### Configure and Run Multinode Device-to-Device Tests Source: https://context7.com/nvidia/nvbandwidth/llms.txt Family of multinode tests measuring inter-node GPU bandwidth over NVLink fabric using MPI and IMEX. Requires a `MULTINODE=1` build, IMEX daemon, and running under MPI with one process per GPU. ```bash # 2 nodes, 4 GPUs per node (8 total processes) mpirun --allow-run-as-root \ --map-by ppr:4:node \ --bind-to core \ -np 8 \ --hostfile /etc/nvidia-imex/nodes_config.cfg \ ./nvbandwidth -p multinode ``` ```bash # Run only write CE tests between all pairs mpirun -np 8 --hostfile nodes.cfg \ ./nvbandwidth -t multinode_device_to_device_memcpy_write_ce ``` ```bash # Bisect test (rank A writes to rank (A + N/2) % N) mpirun -np 8 --hostfile nodes.cfg \ ./nvbandwidth -t multinode_bisect_write_ce ``` -------------------------------- ### Display GPU Topology Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Use this command to display the topology of your GPUs, which is useful for understanding connectivity and potential bottlenecks. ```bash nvidia-smi topo -m ``` -------------------------------- ### Run Device-to-Device SM Copy Tests Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures peer-to-peer bandwidth using SM copy kernels. Produces an N×N matrix, useful for comparing hardware DMA versus kernel-driven transfer throughput. ```bash ./nvbandwidth -t device_to_device_memcpy_read_sm device_to_device_memcpy_write_sm ``` -------------------------------- ### Use Transparent Huge Pages for Buffers Source: https://context7.com/nvidia/nvbandwidth/llms.txt Allocates host-side buffers using `madvise(MADV_HUGEPAGE)` when Transparent Huge Pages are enabled on the system. This is not supported on Windows and requires THP to be enabled first. ```bash # Enable THP first (if not already enabled) echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled # Run with huge pages ./nvbandwidth -H -t host_to_device_memcpy_ce ``` -------------------------------- ### Enable JSON Output for Automated Parsing Source: https://context7.com/nvidia/nvbandwidth/llms.txt Switches the output formatter to emit structured JSON, suitable for automated parsing and CI integration. This is useful for integrating bandwidth test results into other systems. ```bash ./nvbandwidth --json -t host_to_device_memcpy_ce device_to_host_memcpy_ce ``` ```json { "nvbandwidth_version" : "v0.9", "git_version" : "v0.9", "cuda_runtime_version" : 12080, "driver_version" : "560.00", "hostname" : "node01", "testcases" : [ { "name" : "host_to_device_memcpy_ce", "status" : "Passed", "bw_description" : "memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)", "bw_matrix" : [ [ 26.03, 25.94, 25.97, 26.00 ] ], "bw_sum" : 103.94 }, { "name" : "device_to_host_memcpy_ce", "status" : "Passed", "bw_description" : "memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)", "bw_matrix" : [ [ 25.80, 25.71, 25.76, 25.68 ] ], "bw_sum" : 102.95 } ] } ``` -------------------------------- ### Run Host-to-Device and Device-to-Host SM Copy Tests Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures host-to-device and device-to-host bandwidth using SM (Streaming Multiprocessor) copy kernels instead of hardware DMA. Copy size is adjusted to align with `threadsPerBlock × SM_count` for accurate reporting. ```bash ./nvbandwidth -t host_to_device_memcpy_sm device_to_host_memcpy_sm # Running host_to_device_memcpy_sm. # memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s) # 0 1 2 3 # 0 25.11 24.98 25.03 24.89 ``` -------------------------------- ### Troubleshoot CUDA Device Detection Source: https://context7.com/nvidia/nvbandwidth/llms.txt If no CUDA device is detected, verify GPU visibility using `nvidia-smi` and check driver modules with `lsmod | grep nvidia`. ```bash # No CUDA device detected [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected → nvidia-smi # check GPU visibility → lsmod | grep nvidia # check driver modules ``` -------------------------------- ### Accumulate and Analyze Bandwidth Samples Source: https://context7.com/nvidia/nvbandwidth/llms.txt The `PerformanceStatistic` class accumulates per-sample bandwidth values and computes various metrics like mean, median, standard deviation, min, and max. Use `returnAppropriateMetric()` for the default (median) or specific methods like `median()`, `mean()`, `stddev()`, `largest()`, and `smallest()`. ```cpp PerformanceStatistic stat; stat.recordSample(25.94); stat.recordSample(26.03); stat.recordSample(25.97); double result = stat.returnAppropriateMetric(); // median (default) or mean double med = stat.median(); // 25.97 double avg = stat.mean(); // 25.98 double sd = stat.stddev(); double hi = stat.largest(); // 26.03 double lo = stat.smallest(); // 25.94 ``` -------------------------------- ### Configure Multi-Node Build Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Configure CMake to build with multi-node support. Requires CUDA 12.3+. ```bash # Build with multi-node support $ cmake -DMULTINODE=1 ``` -------------------------------- ### Clean Build Artifacts Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Remove CMake cache and previous build files. Recommended before rebuilding to resolve compilation issues like 'cudaErrorUnsupportedPtxVersion'. ```bash rm -rf CMakeCache.txt CMakeFiles make clean ``` -------------------------------- ### Check for CUDA Devices Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Verify if any CUDA-capable devices are detected by the system. This command helps diagnose 'CUDA_ERROR_NO_DEVICE'. ```bash $ nvidia-smi ``` -------------------------------- ### Generate NVIDIA Bug Report Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Create a detailed diagnostic report for NVIDIA hardware and software issues. Useful for complex troubleshooting. ```bash $ nvidia-bug-report.sh ``` -------------------------------- ### Run Bidirectional Device-to-Device SM Copy Test Source: https://context7.com/nvidia/nvbandwidth/llms.txt Performs a bidirectional SM copy test where a single kernel with split-warp execution copies data in both directions simultaneously. Reports the total bandwidth (both directions summed). ```bash ./nvbandwidth -t device_to_device_bidirectional_memcpy_write_sm # SM bidir. bandwidth = size / kernel_time (both directions counted) ``` -------------------------------- ### Measure Device-to-Device Memcpy Read/Write Bandwidth Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures peer-to-peer bandwidth between every accessible GPU pair using CE. 'Read' uses the target GPU's context to pull data from the peer; 'Write' uses the target's context to push data to the peer. Reports an NxN bandwidth matrix; diagonal entries are 0. ```bash ./nvbandwidth -t device_to_device_memcpy_write_ce ``` ```text # Running device_to_device_memcpy_write_ce. # memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s) # 0 1 2 3 # 0 0.00 276.07 276.36 276.14 # 1 276.19 0.00 276.29 276.29 # 2 276.33 276.29 0.00 276.38 # 3 276.19 276.62 276.24 0.00 ``` -------------------------------- ### Set CUDA Compute Architecture Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Specify the target GPU architecture for compilation using CMake. Use this to resolve 'Unsupported gpu architecture' errors for older GPUs. ```bash # For Hopper cmake -DCMAKE_CUDA_ARCHITECTURES=sm_90 ``` -------------------------------- ### Troubleshoot Unsupported GPU Architecture Source: https://context7.com/nvidia/nvbandwidth/llms.txt When nvcc reports an unsupported GPU architecture, specify the desired compute architecture during the CMake configuration process using `-DCMAKE_CUDA_ARCHITECTURES`. ```bash # Unsupported GPU architecture nvcc fatal: Unsupported gpu architecture 'compute_52' → cmake -DCMAKE_CUDA_ARCHITECTURES=90 # e.g. for Hopper ``` -------------------------------- ### Enable Verbose Output for Diagnostics Source: https://context7.com/nvidia/nvbandwidth/llms.txt Enables additional diagnostic output during test execution, including per-iteration bandwidth samples. This can be helpful for detailed analysis of performance. ```bash ./nvbandwidth -v -t device_to_device_memcpy_write_ce ``` -------------------------------- ### Limit Multinode Tests with Pair Sampling Source: https://context7.com/nvidia/nvbandwidth/llms.txt Use the `--targetNumPairs` option to limit multinode device-to-device tests to a sampled subset of GPU pairs. Use -1 for all pairs. This ensures every GPU participates in at least one pair. ```bash mpirun -np 8 --hostfile nodes.cfg \ ./nvbandwidth -p multinode_device_to_device --targetNumPairs 20 ``` ```bash mpirun -np 8 --hostfile nodes.cfg \ ./nvbandwidth -p multinode_device_to_device --targetNumPairs -1 ``` -------------------------------- ### Set Buffer Size for nvbandwidth Source: https://context7.com/nvidia/nvbandwidth/llms.txt Controls the memory buffer size for copy operations. Default is 512 MiB for peak bandwidth. Latency tests use a fixed 2 MiB buffer. ```bash # Use 1024 MiB buffer ./nvbandwidth -b 1024 # Use 128 MiB buffer (triggers advisory warning) ./nvbandwidth -b 128 ``` -------------------------------- ### Measure Host-to-Device and Device-to-Host Memcpy Bandwidth Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures unidirectional PCIe or NVLink bandwidth between host (CPU-pinned) memory and each GPU using `cuMemcpyAsync` (Copy Engine). Reports a 1xN bandwidth matrix. Use `-b` for buffer size and `-i` for sample count. ```bash ./nvbandwidth -t host_to_device_memcpy_ce device_to_host_memcpy_ce -b 512 -i 5 ``` ```text # Running host_to_device_memcpy_ce. # memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s) # 0 1 2 3 # 0 26.03 25.94 25.97 26.00 # Running device_to_host_memcpy_ce. # memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s) # 0 1 2 3 # 0 25.80 25.71 25.76 25.68 ``` -------------------------------- ### Measure On-Device Local Memory Bandwidth Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures on-device memory bandwidth (within a single GPU's HBM) using `cuMemcpyAsync`. Reports a 1xN matrix, with one entry per GPU. ```bash ./nvbandwidth -t device_local_copy # Running device_local_copy. # memcpy CE GPU(row) local copy bandwidth (GB/s) # 0 1 2 3 # 0 3200.11 3198.44 3201.07 3199.88 ``` -------------------------------- ### Set CUDA Environment Variables Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Set PATH and LD_LIBRARY_PATH for a specific CUDA toolkit version. Use this to resolve 'cudaErrorUnsupportedPtxVersion' by ensuring the correct toolkit is used. ```bash export PATH=/usr/local/cuda-12.8/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$PATH ``` -------------------------------- ### Measure Bidirectional Host-to-Device Memcpy Bandwidth Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures bidirectional PCIe bandwidth where a copy in the measured direction runs simultaneously with an interfering copy in the opposite direction. Only the measured direction's bandwidth is reported. ```bash ./nvbandwidth -t host_to_device_bidirectional_memcpy_ce ``` ```text # Running host_to_device_bidirectional_memcpy_ce. # memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s) # 0 1 2 3 # 0 18.56 18.37 19.37 19.59 ``` -------------------------------- ### Measure Bidirectional Device-to-Device Memcpy Bandwidth Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures bidirectional peer-to-peer bandwidth between GPUs with an opposing-direction interfering copy running simultaneously. Only the primary direction bandwidth is reported. Use `-b` for buffer size. ```bash ./nvbandwidth -t device_to_device_bidirectional_memcpy_read_ce -b 512 ``` -------------------------------- ### Troubleshoot PTX Version Mismatch Source: https://context7.com/nvidia/nvbandwidth/llms.txt If encountering a PTX version mismatch, ensure the CUDA toolkit's bin directory is in the PATH and then clean and re-run CMake and make. ```bash # PTX version mismatch [cudaErrorUnsupportedPtxVersion] → export PATH=/usr/local/cuda-12.8/bin:$PATH → rm -rf CMakeCache.txt CMakeFiles && cmake . && make ``` -------------------------------- ### Check NVLink Status Source: https://github.com/nvidia/nvbandwidth/blob/main/troubleshooting.md Verify the status of NVLink connections between GPUs. Helps diagnose 'CUDA_ERROR_NVLINK_UNCORRECTABLE' errors. ```bash # Check NVLink status nvidia-smi nvlink -s ``` -------------------------------- ### Skip Data Verification to Reduce Runtime Source: https://context7.com/nvidia/nvbandwidth/llms.txt By default, nvbandwidth verifies that copied data is correct after each transfer. This flag skips that check to reduce runtime, which can be useful for quick performance checks. ```bash ./nvbandwidth -s -t device_to_device_memcpy_write_ce ``` -------------------------------- ### Bidirectional Host <-> Device Memcpy CE Bandwidth Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Measures the bidirectional bandwidth between CPU (host) and GPU (device) memory using the Compute Engine. This involves concurrent reads and writes between host and device memory, with results shown in GB/s. ```text Running host_to_device_bidirectional_memcpy_ce. memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s) 0 1 2 3 4 5 6 7 0 18.56 18.37 19.37 19.59 18.71 18.79 18.46 18.61 ``` -------------------------------- ### Measure Host-Device Memory Access Latency Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures host-to-device memory access latency using a GPU pointer-chasing kernel over a fixed 2 MiB host-pinned buffer. The strided linked-list pattern prevents prefetching while keeping TLB entries warm. ```bash ./nvbandwidth -t host_device_latency_sm # Running host_device_latency_sm. # Latency SM CPU(row) <-> GPU(column) latency (ns) # 0 1 2 3 # 0 220.15 218.77 219.42 221.03 ``` -------------------------------- ### Measure Device-to-Device Memory Access Latency Source: https://context7.com/nvidia/nvbandwidth/llms.txt Measures peer GPU memory access latency using pointer-chasing over a 2 MiB device buffer. Uses the same methodology as `host_device_latency_sm`; the `--bufferSize` flag is ignored. ```bash ./nvbandwidth -t device_to_device_latency_sm # Running device_to_device_latency_sm. # Latency SM GPU(row) <-> GPU(column) latency (ns) # 0 1 2 3 # 0 0.00 610.34 612.01 609.88 # 1 611.20 0.00 609.65 611.44 ``` -------------------------------- ### Set Sample Count for Stable Results Source: https://context7.com/nvidia/nvbandwidth/llms.txt Controls the number of outer measurement iterations. Use a higher number for more stable results. The reported bandwidth is the median across all samples by default. ```bash ./nvbandwidth -t host_to_device_memcpy_ce -i 10 ``` ```bash ./nvbandwidth -t host_to_device_memcpy_ce -i 10 --useMean ``` -------------------------------- ### Host to Device Memcpy CE Bandwidth Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Measures the bandwidth for copying data from CPU (host) memory to GPU (device) memory using the Compute Engine. Results are presented in a matrix format showing bandwidth in GB/s. ```text Running host_to_device_memcpy_ce. memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s) 0 1 2 3 4 5 6 7 0 26.03 25.94 25.97 26.00 26.19 25.95 26.00 25.97 ``` -------------------------------- ### SM Copy Size Calculation Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Formula used by SM copies to calculate the actual byte size for a copy operation. This ensures the size uniformly fits within the target device, considering threads per block and device SM count. ```plaintext (threadsPerBlock * deviceSMCount) * floor(copySize / (threadsPerBlock * deviceSMCount)) ``` -------------------------------- ### Tune Inner Loop Count for Bandwidth Tests Source: https://context7.com/nvidia/nvbandwidth/llms.txt The inner loop count for bandwidth tests, which determines how many times a transfer is repeated within a single sample, can be tuned using the `--loopCount` option. The default is 16. ```bash ./nvbandwidth --loopCount 32 -t device_to_device_memcpy_write_ce ``` ```bash ./nvbandwidth -i 3 --useMean -t host_to_device_memcpy_ce ``` -------------------------------- ### SM Bidirectional Bandwidth Calculation Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Formula for calculating bidirectional bandwidth when using Streaming Multiprocessor (SM) copies. It is based on the total data size transferred and the kernel execution time. ```plaintext SM bidir. bandwidth = size/(kernel time); ``` -------------------------------- ### Disable CPU Affinity for NUMA Node Minimization Source: https://context7.com/nvidia/nvbandwidth/llms.txt Disables the default behavior of pinning threads to the NUMA node closest to the target GPU. This can be useful for testing scenarios where NUMA affinity is not desired or needs to be controlled manually. ```bash ./nvbandwidth -d -t all_to_host_memcpy_ce ``` -------------------------------- ### CE Bidirectional Bandwidth Calculation Source: https://github.com/nvidia/nvbandwidth/blob/main/README.md Formula for calculating bidirectional bandwidth when using Compute Engine (CE) copies. It is based on the size of data on the measured stream and the time taken on that stream. ```plaintext CE bidir. bandwidth = (size of data on measured stream) / (time on measured stream) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.