### Build MSCCL++ Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/03-memory-channel.md Build the example code for the Memory Channel tutorial using make. Navigate to the example directory first. ```bash cd examples/tutorials/03-memory-channel make ``` -------------------------------- ### Build MSCCL++ Example with Make Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md Builds the MSCCL++ example code using the make utility. Navigate to the example directory first. ```bash cd examples/tutorials/01-basic-concepts make ``` -------------------------------- ### Build the Port Channel Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md Commands to navigate to the example directory and compile the project using make. ```bash $ cd examples/tutorials/04-port-channel $ make ``` -------------------------------- ### Build and Run MSCCL++ Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/02-bootstrap-and-communicator.md Build the example using make and run the executable. Ensure you have at least two GPUs and they are peer-to-peer connected. ```bash cd examples/tutorials/02-bootstrap-and-communicator make ``` ```default # ./gpu_ping_pong_mp GPU 1: Initializing a bootstrap ... GPU 0: Initializing a bootstrap ... GPU 0: Creating a connection ... GPU 1: Creating a connection ... GPU 0: Creating a semaphore ... GPU 1: Creating a semaphore ... GPU 1: Creating a channel ... GPU 0: Creating a channel ... GPU 1: Launching a GPU kernel ... GPU 0: Launching a GPU kernel ... Elapsed 4.78082 ms per iteration (100) Succeed! ``` -------------------------------- ### Install Project Directories and Files Source: https://github.com/microsoft/mscclpp/blob/main/include/CMakeLists.txt This snippet defines the installation rules for the MSCCLPP project, specifying where to install the header files and version information. ```cmake install(DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/mscclpp DESTINATION include) install(FILES ${CMAKE_CURRENT_BINARY_DIR}/mscclpp/version.hpp DESTINATION include/mscclpp) ``` -------------------------------- ### Install MSCCL++ Binaries Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md Final step to install the compiled headers and binaries to the system path. ```bash $ sudo make install ``` -------------------------------- ### Install System Dependencies Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md Commands to install required system libraries for MSCCL++ development. ```bash sudo apt-get install libnuma-dev ``` ```bash sudo apt-get satisfy "python3 (>=3.8), python3-dev (>=3.8)" ``` -------------------------------- ### Build and Run MSCCL++ Custom Collective Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/customized-algorithm-with-nccl-api.md Navigate to the example directory and build the custom collective algorithm. Then, run the executable using LD_PRELOAD to load the MSCCL++ NCCL library. ```bash cd examples/customized-collective-algorithm make ``` ```bash LD_PRELOAD=/lib/libmscclpp_nccl.so ./customized_allgather ``` -------------------------------- ### Run MSCCL++ Ping-Pong Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md Executes the GPU ping-pong example. Root privileges may be required in containerized environments. Ensure at least two GPUs are available and peer-to-peer accessible. ```bash ./gpu_ping_pong ``` -------------------------------- ### Run MSCCL++ Memory Channel Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/03-memory-channel.md Execute the bidirectional memory channel example. Root privileges may be required in containerized environments. Observe the performance metrics for data transfer. ```bash ./bidir_memory_channel ``` -------------------------------- ### Install Python Module from Source Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md Commands to install the Python bindings for MSCCL++. ```bash # For NVIDIA platforms $ python -m pip install . # For AMD platforms, set the C++ compiler to HIPCC $ CXX=/opt/rocm/bin/hipcc python -m pip install . ``` -------------------------------- ### Port Channel Example Output Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md Expected console output when running the bidirectional port channel example. ```default # ./bidir_port_channel GPU 0: Preparing for tests ... GPU 1: Preparing for tests ... GPU 0: [Bidir PutWithSignal] bytes 1024, elapsed 0.0204875 ms/iter, BW 0.0499818 GB/s GPU 0: [Bidir PutWithSignal] bytes 1048576, elapsed 0.0250319 ms/iter, BW 41.8896 GB/s GPU 0: [Bidir PutWithSignal] bytes 134217728, elapsed 0.365497 ms/iter, BW 367.219 GB/s Succeed! ``` -------------------------------- ### Run Bidirectional Port Channel Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md Execute the bidirectional port channel example across two nodes. Ensure the environment meets GPUDirect RDMA prerequisites and RDMA networking is configured. ```default ./bidir_port_channel [ ] ``` ```bash $ ./bidir_port_channel 192.168.0.1:50000 0 0 IB0 ``` ```bash $ ./bidir_port_channel 192.168.0.1:50000 1 0 IB0 ``` -------------------------------- ### Install MSCCL++ Source: https://github.com/microsoft/mscclpp/blob/main/docs/dsl/results.md Installs the MSCCL++ library, including default algorithms. This command is necessary to access pre-generated execution plans. ```bash python3 -m mscclpp --install ``` -------------------------------- ### Install Python Dependencies Source: https://github.com/microsoft/mscclpp/blob/main/docs/README.md Installs Python packages required for documentation building from a requirements file. Ensure the user's local bin directory is in the PATH if installing locally. ```bash $ sudo python3 -m pip install -r ./requirements.txt ``` -------------------------------- ### Initialize and Manage Default MSCCL++ Proxy Source: https://github.com/microsoft/mscclpp/blob/main/README.md Demonstrates the lifecycle of the default proxy service, including bootstrap initialization and communicator setup. ```cpp // Bootstrap: initialize control-plane connections between all ranks auto bootstrap = std::make_shared(rank, world_size); // Create a communicator for connection setup mscclpp::Communicator comm(bootstrap); // Setup connections here using `comm` ... // Construct the default proxy mscclpp::ProxyService proxyService(); // Start the proxy proxyService.startProxy(); // Run the user application, i.e., launch GPU kernels here ... // Stop the proxy after the application is finished proxyService.stopProxy(); ``` -------------------------------- ### Install Doxygen and Graphviz Source: https://github.com/microsoft/mscclpp/blob/main/docs/README.md Installs Doxygen and Graphviz, which are required for generating documentation. Use this command on Debian-based systems. ```bash $ sudo apt-get install doxygen graphviz ``` -------------------------------- ### Install MSCCL++ Libraries Source: https://github.com/microsoft/mscclpp/blob/main/src/ext/nccl/CMakeLists.txt This CMake command specifies the installation paths for the compiled shared libraries (mscclpp_nccl and mscclpp_audit_nccl) within the installation prefix. ```cmake install(TARGETS mscclpp_nccl LIBRARY DESTINATION ${INSTALL_PREFIX}/lib) install(TARGETS mscclpp_audit_nccl LIBRARY DESTINATION ${INSTALL_PREFIX}/lib) ``` -------------------------------- ### Run Python Performance Benchmark Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md Installs required dependencies and executes the AllReduce benchmark using MPI. ```bash # Choose `requirements_*.txt` according to your CUDA/ROCm version. $ python3 -m pip install -r ./python/requirements_cuda12.txt $ mpirun -tag-output -np 8 python3 ./python/mscclpp_benchmark/allreduce_bench.py ``` -------------------------------- ### Run mscclpp Tuning Example Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/mscclpp-torch-integration.md Execute the mscclpp tuning script using torchrun. Ensure to set the master address and port environment variables. ```bash MSCCLPP_MASTER_ADDR= MSCCLPP_MASTER_PORT= \ torchrun --nnodes=1 --nproc_per_node=8 customized_comm_with_tuning.py ``` -------------------------------- ### Run benchmarks with specific fallback operations Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md Examples of running benchmarks with selective fallback configurations. ```bash mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/lib/libmscclpp_nccl.so -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=$NCCL_BUILD/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather" ./build/all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50 ``` ```bash mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/lib/libmscclpp_nccl.so -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=$NCCL_BUILD/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="broadcast" ./build/reduce_scatter_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50 ``` -------------------------------- ### Launch Development Docker Container Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md Commands to start a pre-configured Docker container for NVIDIA or AMD development environments. ```bash # For NVIDIA platforms $ docker run -it --privileged --net=host --ipc=host --gpus all --name mscclpp-dev ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.9 bash # For AMD platforms $ docker run -it --privileged --net=host --ipc=host --security-opt=seccomp=unconfined --group-add=video --name mscclpp-dev ghcr.io/microsoft/mscclpp/mscclpp:base-dev-rocm6.2 bash ``` -------------------------------- ### Configure InfiniBand Endpoint in C++ Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md Customize MSCCL++ endpoint configuration for InfiniBand transport. This example shows how to set InfiniBand-specific parameters like `maxCqSize` and `maxCqPollNum`. ```cpp mscclpp::EndpointConfig epConfig; epConfig.transport = mscclpp::Transport::IB0; epConfig.device = {mscclpp::DeviceType::GPU, 0}; // GPU 0 // InfiniBand-specific parameters epConfig.ib.maxCqSize = 8192; epConfig.ib.maxCqPollNum = 4; // Create an endpoint and establish a connection auto conn = comm.connect(epConfig, remoteRank).get(); ``` -------------------------------- ### ProxyService Host-Side Setup for PortChannels Source: https://context7.com/microsoft/mscclpp/llms.txt Sets up communication using ProxyService to manage host-side operations for PortChannels. Resources like semaphores and memory must be registered before creating a PortChannel. ```cpp #include #include void setupPortChannelCommunication(mscclpp::Communicator& comm, mscclpp::Semaphore& sema, mscclpp::RegisteredMemory& localMem, mscclpp::RegisteredMemory& remoteMem) { // Create proxy service for handling PortChannel operations mscclpp::ProxyService proxyService; // Register resources with the proxy service mscclpp::SemaphoreId semaId = proxyService.addSemaphore(sema); mscclpp::MemoryId localMemId = proxyService.addMemory(localMem); mscclpp::MemoryId remoteMemId = proxyService.addMemory(remoteMem); // Create a PortChannel using the registered IDs mscclpp::PortChannel portChan = proxyService.portChannel(semaId, remoteMemId, localMemId); // Get device handle for GPU kernel auto devHandle = portChan.deviceHandle(); // Copy device handle to GPU memory mscclpp::PortChannelDeviceHandle* d_handle; cudaMalloc(&d_handle, sizeof(mscclpp::PortChannelDeviceHandle)); cudaMemcpy(d_handle, &devHandle, sizeof(devHandle), cudaMemcpyHostToDevice); // Start proxy thread before launching kernels proxyService.startProxy(); // Launch GPU kernels that use the PortChannel portChannelKernel<<<1, 32>>>(d_handle, /*myRank=*/0, /*nRanks=*/8, /*numElements=*/1024); cudaDeviceSynchronize(); // Stop proxy after all GPU operations complete proxyService.stopProxy(); cudaFree(d_handle); } ``` -------------------------------- ### Example Output for Successful Custom Collective Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/customized-algorithm-with-nccl-api.md This is the expected abbreviated output upon successful execution of the custom AllGather algorithm, indicating performance metrics and success. ```text GPU 0: bytes 268435456, elapsed 7.35012 ms/iter, BW 109.564 GB/s Succeed! ``` -------------------------------- ### Fetch and Configure DLPack Source: https://github.com/microsoft/mscclpp/blob/main/python/csrc/CMakeLists.txt Fetches the DLPack library and adds it as a subdirectory to the build. It excludes DLPack from being installed as part of the main project. ```cmake FetchContent_Declare( dlpack GIT_REPOSITORY https://github.com/dmlc/dlpack.git GIT_TAG 5c210da409e7f1e51ddf445134a4376fdbd70d7d ) FetchContent_GetProperties(dlpack) if(NOT dlpack_POPULATED) FetchContent_Populate(dlpack) # Add dlpack subdirectory but exclude it from installation add_subdirectory(${dlpack_SOURCE_DIR} ${dlpack_BINARY_DIR} EXCLUDE_FROM_ALL) endif() ``` -------------------------------- ### Python CommGroup and Channel Setup Source: https://context7.com/microsoft/mscclpp/llms.txt Initializes a communication group from MPI, makes connections to remote ranks using a specified transport, allocates a GPU buffer, and sets up proxy services and port channels for communication. Ensure MPI is initialized before calling. ```python from mpi4py import MPI import cupy as cp import mscclpp from mscclpp import ProxyService, Transport from mscclpp.utils import GpuBuffer, KernelBuilder, pack def setup_communication(): # Initialize communication group from MPI mscclpp_group = mscclpp.CommGroup(MPI.COMM_WORLD) rank = mscclpp_group.my_rank nranks = mscclpp_group.nranks # Create connections to all other ranks remote_ranks = [r for r in range(nranks) if r != rank] # For intra-node (NVLink): transport = Transport.CudaIpc # For inter-node (InfiniBand): # transport = mscclpp_group.my_ib_device(rank % 8) connections = mscclpp_group.make_connection(remote_ranks, transport) # Allocate GPU buffer nelems = 1024 * 1024 memory = GpuBuffer(nelems, dtype=cp.float32) # Create proxy service and port channels proxy_service = ProxyService() port_channels = mscclpp_group.make_port_channels( proxy_service, memory, connections ) # Start proxy before kernel execution proxy_service.start_proxy() mscclpp_group.barrier() # Launch kernel (see kernel launching example) launch_custom_kernel(rank, nranks, port_channels, memory) cp.cuda.runtime.deviceSynchronize() mscclpp_group.barrier() proxy_service.stop_proxy() if __name__ == "__main__": setup_communication() ``` -------------------------------- ### Setup Communication Channel with Python API Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/python-api.md Initializes communication channels for a mesh topology with multiple GPUs using MSCCL++ Python API. Requires mpi4py and cupy. Ensure correct transport type is specified. ```python from mpi4py import MPI import cupy as cp from mscclpp import ( ProxyService, Transport, ) from mscclpp.utils import GpuBuffer def create_connection(group: mscclpp.CommGroup, transport: str): remote_nghrs = list(range(group.nranks)) remote_nghrs.remove(group.my_rank) if transport == "NVLink": tran = Transport.CudaIpc elif transport == "IB": tran = group.my_ib_device(group.my_rank % 8) else: assert False connections = group.make_connection(remote_nghrs, tran) return connections if __name__ == "__main__": mscclpp_group = mscclpp.CommGroup(MPI.COMM_WORLD) connections = create_connection(mscclpp_group, "NVLink") nelems = 1024 memory = GpuBuffer(nelem, dtype=cp.int32) proxy_service = ProxyService() simple_channels = group.make_port_channels(proxy_service, memory, connections) proxy_service.start_proxy() mscclpp_group.barrier() launch_kernel(mscclpp_group.my_rank, mscclpp_group.nranks, simple_channels, memory) cp.cuda.runtime.deviceSynchronize() mscclpp_group.barrier() ``` -------------------------------- ### Generate Doxygen Documents Source: https://github.com/microsoft/mscclpp/blob/main/docs/README.md Executes Doxygen to create the initial documentation files. This command should be run from the project root. ```bash $ doxygen ``` -------------------------------- ### Initialize TcpBootstrap with IP and Port Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/02-bootstrap-and-communicator.md Initialize a TcpBootstrap instance with the rank, total number of ranks, and network interface, IP address, and port number. The TcpBootstrap will listen on the specified port and accept connections from other processes. ```cpp auto bootstrap = std::make_shared(myRank, nRanks); bootstrap->initialize("lo:127.0.0.1:" PORT_NUMER); ``` -------------------------------- ### Display Allreduce Test Help Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/cpp-examples.md View the command-line arguments and usage instructions for the allreduce_test_perf utility. ```bash $ ./bin/allreduce_test_perf --help USAGE: allreduce_test_perf [-b,--minbytes ] [-e,--maxbytes ] [-i,--stepbytes ] [-f,--stepfactor ] [-n,--iters ] [-w,--warmup_iters ] [-c,--check <0/1>] [-T,--timeout