### Build MSCCL++ Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/03-memory-channel.md

Build the example code for the Memory Channel tutorial using make. Navigate to the example directory first.

```bash
cd examples/tutorials/03-memory-channel
make
```

--------------------------------

### Build MSCCL++ Example with Make

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md

Builds the MSCCL++ example code using the make utility. Navigate to the example directory first.

```bash
cd examples/tutorials/01-basic-concepts
make
```

--------------------------------

### Build the Port Channel Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md

Commands to navigate to the example directory and compile the project using make.

```bash
$ cd examples/tutorials/04-port-channel
$ make
```

--------------------------------

### Build and Run MSCCL++ Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/02-bootstrap-and-communicator.md

Build the example using make and run the executable. Ensure you have at least two GPUs and they are peer-to-peer connected.

```bash
cd examples/tutorials/02-bootstrap-and-communicator
make
```

```default
# ./gpu_ping_pong_mp
GPU 1: Initializing a bootstrap ...
GPU 0: Initializing a bootstrap ...
GPU 0: Creating a connection ...
GPU 1: Creating a connection ...
GPU 0: Creating a semaphore ...
GPU 1: Creating a semaphore ...
GPU 1: Creating a channel ...
GPU 0: Creating a channel ...
GPU 1: Launching a GPU kernel ...
GPU 0: Launching a GPU kernel ...
Elapsed 4.78082 ms per iteration (100)
Succeed!
```

--------------------------------

### Install Project Directories and Files

Source: https://github.com/microsoft/mscclpp/blob/main/include/CMakeLists.txt

This snippet defines the installation rules for the MSCCLPP project, specifying where to install the header files and version information.

```cmake
install(DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/mscclpp DESTINATION include)
install(FILES ${CMAKE_CURRENT_BINARY_DIR}/mscclpp/version.hpp DESTINATION include/mscclpp)
```

--------------------------------

### Install MSCCL++ Binaries

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Final step to install the compiled headers and binaries to the system path.

```bash
$ sudo make install
```

--------------------------------

### Install System Dependencies

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Commands to install required system libraries for MSCCL++ development.

```bash
sudo apt-get install libnuma-dev
```

```bash
sudo apt-get satisfy "python3 (>=3.8), python3-dev (>=3.8)"
```

--------------------------------

### Build and Run MSCCL++ Custom Collective Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/customized-algorithm-with-nccl-api.md

Navigate to the example directory and build the custom collective algorithm. Then, run the executable using LD_PRELOAD to load the MSCCL++ NCCL library.

```bash
cd examples/customized-collective-algorithm
make
```

```bash
LD_PRELOAD=<MSCCLPP_INSTALL_DIR>/lib/libmscclpp_nccl.so ./customized_allgather
```

--------------------------------

### Run MSCCL++ Ping-Pong Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md

Executes the GPU ping-pong example. Root privileges may be required in containerized environments. Ensure at least two GPUs are available and peer-to-peer accessible.

```bash
./gpu_ping_pong
```

--------------------------------

### Run MSCCL++ Memory Channel Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/03-memory-channel.md

Execute the bidirectional memory channel example. Root privileges may be required in containerized environments. Observe the performance metrics for data transfer.

```bash
./bidir_memory_channel
```

--------------------------------

### Install Python Module from Source

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Commands to install the Python bindings for MSCCL++.

```bash
# For NVIDIA platforms
$ python -m pip install .
# For AMD platforms, set the C++ compiler to HIPCC
$ CXX=/opt/rocm/bin/hipcc python -m pip install .
```

--------------------------------

### Port Channel Example Output

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md

Expected console output when running the bidirectional port channel example.

```default
# ./bidir_port_channel
GPU 0: Preparing for tests ...
GPU 1: Preparing for tests ...
GPU 0: [Bidir PutWithSignal] bytes 1024, elapsed 0.0204875 ms/iter, BW 0.0499818 GB/s
GPU 0: [Bidir PutWithSignal] bytes 1048576, elapsed 0.0250319 ms/iter, BW 41.8896 GB/s
GPU 0: [Bidir PutWithSignal] bytes 134217728, elapsed 0.365497 ms/iter, BW 367.219 GB/s
Succeed!
```

--------------------------------

### Run Bidirectional Port Channel Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md

Execute the bidirectional port channel example across two nodes. Ensure the environment meets GPUDirect RDMA prerequisites and RDMA networking is configured.

```default
./bidir_port_channel [<ip_port> <rank> <gpu_id> <transport>]
```

```bash
$ ./bidir_port_channel 192.168.0.1:50000 0 0 IB0
```

```bash
$ ./bidir_port_channel 192.168.0.1:50000 1 0 IB0
```

--------------------------------

### Install MSCCL++

Source: https://github.com/microsoft/mscclpp/blob/main/docs/dsl/results.md

Installs the MSCCL++ library, including default algorithms. This command is necessary to access pre-generated execution plans.

```bash
python3 -m mscclpp --install
```

--------------------------------

### Install Python Dependencies

Source: https://github.com/microsoft/mscclpp/blob/main/docs/README.md

Installs Python packages required for documentation building from a requirements file. Ensure the user's local bin directory is in the PATH if installing locally.

```bash
$ sudo python3 -m pip install -r ./requirements.txt
```

--------------------------------

### Initialize and Manage Default MSCCL++ Proxy

Source: https://github.com/microsoft/mscclpp/blob/main/README.md

Demonstrates the lifecycle of the default proxy service, including bootstrap initialization and communicator setup.

```cpp
// Bootstrap: initialize control-plane connections between all ranks
auto bootstrap = std::make_shared<mscclpp::TcpBootstrap>(rank, world_size);
// Create a communicator for connection setup
mscclpp::Communicator comm(bootstrap);
// Setup connections here using `comm`
...
// Construct the default proxy
mscclpp::ProxyService proxyService();
// Start the proxy
proxyService.startProxy();
// Run the user application, i.e., launch GPU kernels here
...
// Stop the proxy after the application is finished
proxyService.stopProxy();
```

--------------------------------

### Install Doxygen and Graphviz

Source: https://github.com/microsoft/mscclpp/blob/main/docs/README.md

Installs Doxygen and Graphviz, which are required for generating documentation. Use this command on Debian-based systems.

```bash
$ sudo apt-get install doxygen graphviz
```

--------------------------------

### Install MSCCL++ Libraries

Source: https://github.com/microsoft/mscclpp/blob/main/src/ext/nccl/CMakeLists.txt

This CMake command specifies the installation paths for the compiled shared libraries (mscclpp_nccl and mscclpp_audit_nccl) within the installation prefix.

```cmake
install(TARGETS mscclpp_nccl
    LIBRARY DESTINATION ${INSTALL_PREFIX}/lib)
install(TARGETS mscclpp_audit_nccl
    LIBRARY DESTINATION ${INSTALL_PREFIX}/lib)
```

--------------------------------

### Run Python Performance Benchmark

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Installs required dependencies and executes the AllReduce benchmark using MPI.

```bash
# Choose `requirements_*.txt` according to your CUDA/ROCm version.
$ python3 -m pip install -r ./python/requirements_cuda12.txt
$ mpirun -tag-output -np 8 python3 ./python/mscclpp_benchmark/allreduce_bench.py
```

--------------------------------

### Run mscclpp Tuning Example

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/mscclpp-torch-integration.md

Execute the mscclpp tuning script using torchrun. Ensure to set the master address and port environment variables.

```bash
MSCCLPP_MASTER_ADDR=<ip> MSCCLPP_MASTER_PORT=<port> \
  torchrun --nnodes=1 --nproc_per_node=8 customized_comm_with_tuning.py
```

--------------------------------

### Run benchmarks with specific fallback operations

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Examples of running benchmarks with selective fallback configurations.

```bash
mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/lib/libmscclpp_nccl.so -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=$NCCL_BUILD/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather" ./build/all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50
```

```bash
mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/lib/libmscclpp_nccl.so -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=$NCCL_BUILD/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="broadcast" ./build/reduce_scatter_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50
```

--------------------------------

### Launch Development Docker Container

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Commands to start a pre-configured Docker container for NVIDIA or AMD development environments.

```bash
# For NVIDIA platforms
$ docker run -it --privileged --net=host --ipc=host --gpus all --name mscclpp-dev ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.9 bash
# For AMD platforms
$ docker run -it --privileged --net=host --ipc=host --security-opt=seccomp=unconfined --group-add=video --name mscclpp-dev ghcr.io/microsoft/mscclpp/mscclpp:base-dev-rocm6.2 bash
```

--------------------------------

### Configure InfiniBand Endpoint in C++

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md

Customize MSCCL++ endpoint configuration for InfiniBand transport. This example shows how to set InfiniBand-specific parameters like `maxCqSize` and `maxCqPollNum`.

```cpp
mscclpp::EndpointConfig epConfig;
epConfig.transport = mscclpp::Transport::IB0;
epConfig.device = {mscclpp::DeviceType::GPU, 0}; // GPU 0
// InfiniBand-specific parameters
epConfig.ib.maxCqSize = 8192;
epConfig.ib.maxCqPollNum = 4;
// Create an endpoint and establish a connection
auto conn = comm.connect(epConfig, remoteRank).get();
```

--------------------------------

### ProxyService Host-Side Setup for PortChannels

Source: https://context7.com/microsoft/mscclpp/llms.txt

Sets up communication using ProxyService to manage host-side operations for PortChannels. Resources like semaphores and memory must be registered before creating a PortChannel.

```cpp
#include <mscclpp/proxy_service.hpp>
#include <mscclpp/communicator.hpp>

void setupPortChannelCommunication(mscclpp::Communicator& comm,
                                    mscclpp::Semaphore& sema,
                                    mscclpp::RegisteredMemory& localMem,
                                    mscclpp::RegisteredMemory& remoteMem) {
    // Create proxy service for handling PortChannel operations
    mscclpp::ProxyService proxyService;

    // Register resources with the proxy service
    mscclpp::SemaphoreId semaId = proxyService.addSemaphore(sema);
    mscclpp::MemoryId localMemId = proxyService.addMemory(localMem);
    mscclpp::MemoryId remoteMemId = proxyService.addMemory(remoteMem);

    // Create a PortChannel using the registered IDs
    mscclpp::PortChannel portChan = proxyService.portChannel(semaId, remoteMemId, localMemId);

    // Get device handle for GPU kernel
    auto devHandle = portChan.deviceHandle();

    // Copy device handle to GPU memory
    mscclpp::PortChannelDeviceHandle* d_handle;
    cudaMalloc(&d_handle, sizeof(mscclpp::PortChannelDeviceHandle));
    cudaMemcpy(d_handle, &devHandle, sizeof(devHandle), cudaMemcpyHostToDevice);

    // Start proxy thread before launching kernels
    proxyService.startProxy();

    // Launch GPU kernels that use the PortChannel
    portChannelKernel<<<1, 32>>>(d_handle, /*myRank=*/0, /*nRanks=*/8, /*numElements=*/1024);
    cudaDeviceSynchronize();

    // Stop proxy after all GPU operations complete
    proxyService.stopProxy();

    cudaFree(d_handle);
}
```

--------------------------------

### Example Output for Successful Custom Collective

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/customized-algorithm-with-nccl-api.md

This is the expected abbreviated output upon successful execution of the custom AllGather algorithm, indicating performance metrics and success.

```text
GPU 0: bytes 268435456, elapsed 7.35012 ms/iter, BW 109.564 GB/s
Succeed!
```

--------------------------------

### Fetch and Configure DLPack

Source: https://github.com/microsoft/mscclpp/blob/main/python/csrc/CMakeLists.txt

Fetches the DLPack library and adds it as a subdirectory to the build. It excludes DLPack from being installed as part of the main project.

```cmake
FetchContent_Declare(
    dlpack
    GIT_REPOSITORY https://github.com/dmlc/dlpack.git
    GIT_TAG 5c210da409e7f1e51ddf445134a4376fdbd70d7d
)

FetchContent_GetProperties(dlpack)
if(NOT dlpack_POPULATED)
    FetchContent_Populate(dlpack)
    # Add dlpack subdirectory but exclude it from installation
    add_subdirectory(${dlpack_SOURCE_DIR} ${dlpack_BINARY_DIR} EXCLUDE_FROM_ALL)
endif()
```

--------------------------------

### Python CommGroup and Channel Setup

Source: https://context7.com/microsoft/mscclpp/llms.txt

Initializes a communication group from MPI, makes connections to remote ranks using a specified transport, allocates a GPU buffer, and sets up proxy services and port channels for communication. Ensure MPI is initialized before calling.

```python
from mpi4py import MPI
import cupy as cp
import mscclpp
from mscclpp import ProxyService, Transport
from mscclpp.utils import GpuBuffer, KernelBuilder, pack

def setup_communication():
    # Initialize communication group from MPI
    mscclpp_group = mscclpp.CommGroup(MPI.COMM_WORLD)
    rank = mscclpp_group.my_rank
    nranks = mscclpp_group.nranks

    # Create connections to all other ranks
    remote_ranks = [r for r in range(nranks) if r != rank]

    # For intra-node (NVLink):
    transport = Transport.CudaIpc
    # For inter-node (InfiniBand):
    # transport = mscclpp_group.my_ib_device(rank % 8)

    connections = mscclpp_group.make_connection(remote_ranks, transport)

    # Allocate GPU buffer
    nelems = 1024 * 1024
    memory = GpuBuffer(nelems, dtype=cp.float32)

    # Create proxy service and port channels
    proxy_service = ProxyService()
    port_channels = mscclpp_group.make_port_channels(
        proxy_service, memory, connections
    )

    # Start proxy before kernel execution
    proxy_service.start_proxy()
    mscclpp_group.barrier()

    # Launch kernel (see kernel launching example)
    launch_custom_kernel(rank, nranks, port_channels, memory)

    cp.cuda.runtime.deviceSynchronize()
    mscclpp_group.barrier()
    proxy_service.stop_proxy()

if __name__ == "__main__":
    setup_communication()
```

--------------------------------

### Setup Communication Channel with Python API

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/python-api.md

Initializes communication channels for a mesh topology with multiple GPUs using MSCCL++ Python API. Requires mpi4py and cupy. Ensure correct transport type is specified.

```python
from mpi4py import MPI
import cupy as cp

from mscclpp import (
    ProxyService,
    Transport,
)
from mscclpp.utils import GpuBuffer


def create_connection(group: mscclpp.CommGroup, transport: str):
    remote_nghrs = list(range(group.nranks))
    remote_nghrs.remove(group.my_rank)
    if transport == "NVLink":
        tran = Transport.CudaIpc
    elif transport == "IB":
        tran = group.my_ib_device(group.my_rank % 8)
    else:
        assert False
    connections = group.make_connection(remote_nghrs, tran)
    return connections

if __name__ == "__main__":
    mscclpp_group = mscclpp.CommGroup(MPI.COMM_WORLD)
    connections = create_connection(mscclpp_group, "NVLink")
    nelems = 1024
    memory = GpuBuffer(nelem, dtype=cp.int32)
    proxy_service = ProxyService()
    simple_channels = group.make_port_channels(proxy_service, memory, connections)
    proxy_service.start_proxy()
    mscclpp_group.barrier()
    launch_kernel(mscclpp_group.my_rank, mscclpp_group.nranks, simple_channels, memory)
    cp.cuda.runtime.deviceSynchronize()
    mscclpp_group.barrier()
```

--------------------------------

### Generate Doxygen Documents

Source: https://github.com/microsoft/mscclpp/blob/main/docs/README.md

Executes Doxygen to create the initial documentation files. This command should be run from the project root.

```bash
$ doxygen
```

--------------------------------

### Initialize TcpBootstrap with IP and Port

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/02-bootstrap-and-communicator.md

Initialize a TcpBootstrap instance with the rank, total number of ranks, and network interface, IP address, and port number. The TcpBootstrap will listen on the specified port and accept connections from other processes.

```cpp
auto bootstrap = std::make_shared<mscclpp::TcpBootstrap>(myRank, nRanks);
bootstrap->initialize("lo:127.0.0.1:" PORT_NUMER);
```

--------------------------------

### Display Allreduce Test Help

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/cpp-examples.md

View the command-line arguments and usage instructions for the allreduce_test_perf utility.

```bash
$ ./bin/allreduce_test_perf --help
USAGE: allreduce_test_perf
        [-b,--minbytes <min size in bytes>]
        [-e,--maxbytes <max size in bytes>]
        [-i,--stepbytes <increment size>]
        [-f,--stepfactor <increment factor>]
        [-n,--iters <iteration count>]
        [-w,--warmup_iters <warmup iteration count>]
        [-c,--check <0/1>]
        [-T,--timeout <time in seconds>]
        [-G,--cudagraph <num graph launches>]
        [-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
        [-k,--kernel_num <kernel number of commnication primitive>]
        [-o, --output_file <output file name>]
        [-h,--help]
```

--------------------------------

### Clone and Build MSCCL++ from Source

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Standard procedure for cloning the repository and configuring the build environment.

```bash
$ git clone https://github.com/microsoft/mscclpp.git
$ mkdir -p mscclpp/build && cd mscclpp/build
```

--------------------------------

### Configure Python and Nanobind

Source: https://github.com/microsoft/mscclpp/blob/main/python/test/CMakeLists.txt

Finds the required Python version and declares the nanobind dependency. Ensure Python 3.8+ is installed.

```cmake
find_package(Python 3.8 COMPONENTS Interpreter Development.Module REQUIRED)
include(FetchContent)
FetchContent_Declare(nanobind GIT_REPOSITORY https://github.com/wjakob/nanobind.git GIT_TAG v1.4.0)
FetchContent_MakeAvailable(nanobind)
```

--------------------------------

### Get CUDA Stream Handle

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/mscclpp-torch-integration.md

Retrieve the CUDA stream handle from the current PyTorch stream for use with MSCCL++ operations.

```python
stream_handle = torch.cuda.current_stream().cuda_stream
```

--------------------------------

### Define Common Test Libraries and Includes

Source: https://github.com/microsoft/mscclpp/blob/main/test/CMakeLists.txt

Sets up common libraries and include directories for test executables. Conditionally includes InfiniBand libraries if MSCCLPP is configured to use them.

```cmake
set(TEST_LIBS_COMMON mscclpp ${GPU_LIBRARIES} ${NUMA_LIBRARIES} Threads::Threads)
if(MSCCLPP_USE_IB)
    list(APPEND TEST_LIBS_COMMON ${IBVERBS_LIBRARIES})
endif()
set(TEST_INC_COMMON PRIVATE ${PROJECT_SOURCE_DIR}/include SYSTEM PRIVATE ${GPU_INCLUDE_DIRS})
set(TEST_INC_INTERNAL PRIVATE ${PROJECT_SOURCE_DIR}/src/core/include)
```

--------------------------------

### Implement Custom Proxy Service and GPU Triggers

Source: https://github.com/microsoft/mscclpp/blob/main/README.md

Shows how to define a custom proxy service to handle specific GPU-sent triggers and the corresponding device-side code to push those triggers.

```cpp
// Proxy FIFO is obtained from mscclpp::Proxy on the host and copied to the device.
__device__ mscclpp::FifoDeviceHandle fifo;
__global__ void gpuKernel() {
  ...
  // Only one thread is needed for the followings
  mscclpp::ProxyTrigger trigger;
  // Send a custom request: "1"
  trigger.fst = 1;
  fifo.push(trigger);
  // Send a custom request: "2"
  trigger.fst = 2;
  fifo.push(trigger);
  // Send a custom request: "0xdeadbeef"
  trigger.fst = 0xdeadbeef;
  fifo.push(trigger);
  ...
}

// Host-side custom proxy service
class CustomProxyService {
private:
  mscclpp::Proxy proxy_;
public:
  CustomProxyService() : proxy_([&](mscclpp::ProxyTrigger trigger) {
                                  // Custom trigger handler
                                  if (trigger.fst == 1) {
                                    // Handle request "1"
                                  } else if (trigger.fst == 2) {
                                    // Handle request "2"
                                  } else if (trigger.fst == 0xdeadbeef) {
                                    // Handle request "0xdeadbeef"
                                  }
                                },
                                [&]() { /* Empty proxy initializer */ }) {}
  void startProxy() { proxy_.start(); }
  void stopProxy()  { proxy_.stop(); }
};
```

--------------------------------

### Initialize Bootstrap and Communicator

Source: https://context7.com/microsoft/mscclpp/llms.txt

Initializes TCP bootstrap for inter-process communication and creates a communicator for GPU connections. Connects to a remote rank using CUDA IPC and builds a semaphore for synchronization. A barrier synchronization across all ranks is performed.

```cpp
#include <mscclpp/communicator.hpp>
#include <mscclpp/core.hpp>

int main() {
    int myRank = 0;   // Current process rank
    int nRanks = 8;   // Total number of processes

    // Initialize TCP bootstrap for control plane communication
    auto bootstrap = std::make_shared<mscclpp::TcpBootstrap>(myRank, nRanks);
    bootstrap->initialize("eth0:192.168.1.1:50000");  // interface:ip:port

    // Create communicator for GPU connections
    mscclpp::Communicator comm(bootstrap);

    // Connect to a remote rank using CUDA IPC transport (intra-node)
    int remoteRank = (myRank + 1) % nRanks;
    auto connFuture = comm.connect({mscclpp::Transport::CudaIpc,
                                    {mscclpp::DeviceType::GPU, myRank}}, remoteRank);
    auto conn = connFuture.get();

    // Build a semaphore for synchronization
    auto semaFuture = comm.buildSemaphore(conn, remoteRank);
    auto sema = semaFuture.get();

    // Barrier synchronization across all ranks
    bootstrap->barrier();

    return 0;
}
```

--------------------------------

### Perform bidirectional data copy with get()

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/03-memory-channel.md

Reads data from a remote memory region into local memory. Does not require explicit post-copy synchronization as it is a read operation.

```cpp
__global__ void bidirGetKernel(mscclpp::MemoryChannelDeviceHandle *devHandle, size_t copyBytes, int myRank) {
  const int tid = threadIdx.x + blockIdx.x * blockDim.x;
  if (tid == 0) {
    devHandle->relaxedSignal();
    devHandle->relaxedWait();
  }
  devSyncer.sync(gridDim.x);

  const int remoteRank = myRank ^ 1;
  const uint64_t srcOffset = remoteRank * copyBytes;
  const uint64_t dstOffset = srcOffset;
  devHandle->get(srcOffset, dstOffset, copyBytes, /*threadId*/ tid, /*numThreads*/ blockDim.x * gridDim.x);
}
```

--------------------------------

### Bidirectional Data Transfer with MemoryChannel Device API

Source: https://context7.com/microsoft/mscclpp/llms.txt

Demonstrates a bidirectional data transfer using put() and get() operations. Ensure proper synchronization before and after data transfer.

```cpp
#include <mscclpp/memory_channel_device.hpp>

// Device synchronization helper
__device__ mscclpp::DeviceSyncer devSyncer;

__global__ void bidirDataTransferKernel(
    mscclpp::MemoryChannelDeviceHandle* devHandle,
    size_t copyBytes,
    int myRank
) {
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;
    const int numThreads = blockDim.x * gridDim.x;

    // Step 1: Signal readiness and wait for peer
    if (tid == 0) {
        devHandle->relaxedSignal();  // Relaxed: no memory ordering guarantee
        devHandle->relaxedWait();    // Wait for peer to signal
    }
    devSyncer.sync(gridDim.x);  // Synchronize all thread blocks

    // Step 2: Copy data to remote GPU using put()
    // All threads participate in the copy for high bandwidth
    const uint64_t srcOffset = myRank * copyBytes;
    const uint64_t dstOffset = srcOffset;
    devHandle->put(dstOffset, srcOffset, copyBytes, tid, numThreads);

    // Step 3: Synchronize completion
    devSyncer.sync(gridDim.x);
    if (tid == 0) {
        devHandle->signal();  // Full memory fence before signal
        devHandle->wait();    // Wait for peer's completion signal
    }
}

// Using get() for reading from remote memory
__global__ void getDataKernel(mscclpp::MemoryChannelDeviceHandle* devHandle,
                               size_t bytes, int tid, int numThreads) {
    // Read from remote memory into local buffer
    devHandle->get(/*offset=*/0, bytes, tid, numThreads);
}

// Direct read/write for individual values
__global__ void directAccessKernel(mscclpp::MemoryChannelDeviceHandle* devHandle) {
    if (threadIdx.x == 0) {
        // Read a single value from remote memory
        int value = devHandle->read<int>(/*index=*/0);
        // Write a value to remote memory
        devHandle->write<int>(/*index=*/1, value + 1);
    }
}

```

--------------------------------

### Define and Link MSCCLPP Python Module

Source: https://github.com/microsoft/mscclpp/blob/main/python/csrc/CMakeLists.txt

Defines the MSCCLPP Python module using nanobind, sets its output name and installation path, and links necessary libraries and include directories.

```cmake
file(GLOB_RECURSE SOURCES CONFIGURE_DEPENDS *.cpp)
nanobind_add_module(mscclpp_py ${SOURCES})
set_target_properties(mscclpp_py PROPERTIES OUTPUT_NAME _mscclpp)
set_target_properties(mscclpp_py PROPERTIES INSTALL_RPATH "$ORIGIN/lib")
target_link_libraries(mscclpp_py PRIVATE dlpack mscclpp mscclpp_collectives ${GPU_LIBRARIES})
target_include_directories(mscclpp_py SYSTEM PRIVATE ${GPU_INCLUDE_DIRS})
install(TARGETS mscclpp_py LIBRARY DESTINATION .)
```

--------------------------------

### Run Unit Tests

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Compiles and executes basic unit tests requiring a single GPU.

```bash
$ make -j unit_tests
$ ./bin/unit_tests
```

--------------------------------

### Build and Run MSCCL++ with NCCL Compatibility

Source: https://context7.com/microsoft/mscclpp/llms.txt

Commands to build the library from source and execute performance tests or training scripts using LD_PRELOAD or LD_AUDIT for NCCL replacement.

```bash
# Build MSCCL++ from source
git clone https://github.com/microsoft/mscclpp.git
mkdir -p mscclpp/build && cd mscclpp/build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)

# Run nccl-tests with MSCCL++
mpirun -np 8 --bind-to numa --allow-run-as-root \
    -x LD_PRELOAD=$PWD/lib/libmscclpp_nccl.so \
    ./all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50

# Enable fallback to NCCL for unsupported operations
mpirun -np 8 --bind-to numa --allow-run-as-root \
    -x LD_PRELOAD=$PWD/lib/libmscclpp_nccl.so \
    -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE \
    -x MSCCLPP_NCCL_LIB_PATH=/usr/lib/libnccl.so \
    -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="broadcast" \
    ./all_reduce_perf -b 1K -e 256M -f 2

# Using audit shim (avoids LD_PRELOAD issues in some environments)
export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_training_script.py
```

--------------------------------

### Get Tuned Configuration for Allreduce

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/mscclpp-torch-integration.md

Retrieves the best-tuned configuration for a given message size, rounding to the nearest power of two. Falls back to a default configuration if no tuned config is found.

```python
def get_tuned_config(self, size):
    if size < 1024:
        target_size = 1024
    elif size > 256 * 1024 * 1024:
        target_size = 256 * 1024 * 1024
    else:
        target_size = 1 << (size - 1).bit_length()
    return self.best_configs.get(target_size)
```

--------------------------------

### MSCCL++ Port Channel CUDA Kernel

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/python-api.md

A CUDA kernel for MSCCL++ that utilizes PortChannelDeviceHandle for inter-GPU communication. It performs putWithSignalAndFlush and wait operations. Ensure correct device handle setup.

```cuda
#include <mscclpp/packet_device.hpp>
#include <mscclpp/port_channel_device.hpp>

// be careful about using channels[my_rank] as it is inavlie and it is there just for simplicity of indexing
extern "C" __global__ void __launch_bounds__(1024, 1)
    port_channel(mscclpp::PortChannelDeviceHandle* channels, int my_rank, int nranks,
                         int num_elements) {
    int tid = threadIdx.x;
    int nthreads = blockDim.x;
    uint64_t size_per_rank = (num_elements * sizeof(int)) / nranks;
    uint64_t my_offset = size_per_rank * my_rank;
    __syncthreads();
    if (tid < nranks && tid != my_rank) {
      channels[tid].putWithSignalAndFlush(my_offset, my_offset, size_per_rank);
      channels[tid].wait();
    }
}
```

--------------------------------

### Create SemaphoreStub instances

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md

Initialize SemaphoreStub objects using established connections from each endpoint.

```cpp
// From gpu_ping_pong.cu, lines 77 and 83
mscclpp::SemaphoreStub semaStub0(conn0);
mscclpp::SemaphoreStub semaStub1(conn1);
```

--------------------------------

### Simple AllGather Algorithm in MSCCL++ DSL

Source: https://github.com/microsoft/mscclpp/blob/main/docs/dsl/quick_start.md

A basic 2-GPU AllGather implementation demonstrating MSCCL++ DSL concepts. It requires importing necessary components from mscclpp.language.

```python
from mscclpp.language import *

def simple_allgather(name):
    """
    A simple AllGather implementation using the MSCCL++ DSL.
    
    This example demonstrates a 2-GPU AllGather where each GPU sends
    its data to all other GPUs, so all GPUs end up with everyone's data.
    
    Args:
        name: Algorithm name for identification
    """
    num_gpus = 2
    chunk_factor = 1  # Split data into num_gpus chunks
    
    # Define the collective operation
    collective = AllGather(num_gpus, chunk_factor, inplace=True)
    
    # Create the program context
    with CollectiveProgram(
        name,
        collective,
        num_gpus,
        protocol="Simple",  # Use Simple protocol (vs "LL" for low-latency)
        min_message_size=0,
        max_message_size=2**30  # 1GB
    ):
        # Loop over each source GPU rank
        for src_rank in range(num_gpus):
            # Create a Rank object for the source GPU
            rank = Rank(src_rank)
            # Get the output buffer where the data is stored
            src_buffer = rank.get_output_buffer()
            # Take a slice corresponding to this rank's data
            src_chunk = src_buffer[src_rank:src_rank + 1]
            
            # Loop over each destination GPU rank
            for dst_rank in range(num_gpus):
                # Skip sending from a rank to itself
                if src_rank != dst_rank:
                    # Create a Rank object for the destination GPU
                    dst_rank_obj = Rank(dst_rank)
                    # Get the destination buffer where data will be sent
                    dst_buffer = dst_rank_obj.get_output_buffer()
                    # Take a slice where the data will be placed
                    dst_chunk = dst_buffer[src_rank:src_rank + 1]
                    
                    # Define a channel from src_rank → dst_rank
                    channel = MemoryChannel(dst_rank, src_rank)
                    
                    # Step 1: Source signals it is ready to send data
                    channel.signal(tb=0, relaxed=True)
                    
                    # Step 2: Wait for destination to be ready
                    channel.wait(tb=0, data_sync=SyncType.after, relaxed=True)
                    
                    # Step 3: Source rank sends data to destination rank
                    channel.put(dst_chunk, src_chunk, tb=0)
                    
                    # Step 4: Signal that put operation is complete
                    channel.signal(tb=0, data_sync=SyncType.before)
                    
                    # Step 5: Wait for acknowledgment
                    channel.wait(tb=0, data_sync=SyncType.after)
            
        print(JSON())

simple_allgather("simple_allgather_2gpus")
```

--------------------------------

### Zero-Copy Allreduce with NVLS and Semaphores

Source: https://github.com/microsoft/mscclpp/blob/main/docs/dsl/concepts.md

An example of a two-rank allreduce operation that achieves zero-copy using NVLS and synchronizes three thread-blocks with semaphores. The thread-blocks handle data copying to/from scratch buffers and the allreduce operation itself.

```python
nvls_chan = SwitchChannel(rank_list=[0, 1], buffer_type=BufferType.scratch)
scratch_buffer = []
for i in range(nranks):
    scratch_buffer.append(Buffer(i, nranks))

for i in range(nranks):
    src_rank = i
    dst_rank = (i + 1) % nranks
    chan = MemoryChannel(dst_rank, src_rank)
    chan1 = MemoryChannel(dst_rank, src_rank)
    rank = Rank(i)
    sem0 = Semaphore(rank=i, initial_value=0)
    sem1 = Semaphore(rank=i, initial_value=0)
    input_buffer = rank.get_input_buffer()
    output_buffer = rank.get_output_buffer()

    # Define loop iteration context for processing data chunks
    with LoopIterationContext(unit=2**20, num_chunks=1):
        # Copy input data to scratch buffers
        for offset in range(nranks):
            dst_chunk = scratch_buffer[i][offset : offset + 1]
            src_chunk = input_buffer[offset : offset + 1]
            rank.copy(dst_chunk, src_chunk, tb=0)

        # Synchronize with other ranks
        chan.signal(tb=0, data_sync=SyncType.before)
        chan.wait(tb=0, data_sync=SyncType.after)
        sem0.release(tb=0)  # Release semaphore to allow next step to proceed

        # Wait for previous step completion
        sem0.acquire(tb=1, data_sync=SyncType.after)

        # Reduce operation: combine data from multiple ranks into local chunk
        nvls_chan.at_rank(src_rank).reduce(
            buffer_offset=i, size=1, dst_chunk=scratch_buffer[i][i : i + 1], tb=1
        )

        # Broadcast the reduced result to all participating ranks
        nvls_chan.at_rank(src_rank).broadcast(
            src_chunk=scratch_buffer[i][i : i + 1], buffer_offset=i, size=1, tb=1
        )

        # Signal completion of reduction stage and prepare for next stage
        chan1.signal(tb=1, data_sync=SyncType.before)
        sem1.release(tb=1)

        # Wait for previous stage completion
        sem1.acquire(tb=2)
        chan1.wait(tb=2, data_sync=SyncType.after)

        # Copy all reduced chunks from scratch buffer to final output buffer
        for index in range(nranks):
            dst_chunk = output_buffer[index : index + 1]
            src_chunk = scratch_buffer[i][index : index + 1]
            rank.copy(dst_chunk, src_chunk, tb=2)
```

--------------------------------

### Run NCCL benchmarks with MSCCL++

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Execute nccl-tests using the MSCCL++ library via LD_PRELOAD.

```bash
mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/lib/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50
```

--------------------------------

### Fetch and Configure Nanobind

Source: https://github.com/microsoft/mscclpp/blob/main/python/csrc/CMakeLists.txt

Fetches and makes the nanobind library available for use in the build. Ensure nanobind is available at the specified tag.

```cmake
find_package(Python 3.8 COMPONENTS Interpreter Development.Module REQUIRED)
include(FetchContent)
FetchContent_Declare(nanobind GIT_REPOSITORY https://github.com/wjakob/nanobind.git GIT_TAG v1.9.2)
FetchContent_MakeAvailable(nanobind)
```

--------------------------------

### Initialize Semaphore objects

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md

Construct asymmetric Semaphore objects by pairing local and remote SemaphoreStubs.

```cpp
// From gpu_ping_pong.cu, lines 88 and 98
mscclpp::Semaphore sema0(/*localSemaphoreStub*/ semaStub0, /*remoteSemaphoreStub*/ semaStub1);
mscclpp::Semaphore sema1(/*localSemaphoreStub*/ semaStub1, /*remoteSemaphoreStub*/ semaStub0);
```

--------------------------------

### Launch GPU kernel with channel handle

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md

Execute the GPU kernel using the transferred channel device handle.

```cpp
// From gpu_ping_pong.cu, line 108
gpuKernel0<<<1, 1>>>(reinterpret_cast<mscclpp::BaseMemoryChannelDeviceHandle *>(devHandle0), iter);
```

--------------------------------

### Custom Proxy Handler Implementation

Source: https://context7.com/microsoft/mscclpp/llms.txt

Illustrates implementing a custom proxy handler for advanced scenarios like batching operations or custom protocols. Requires defining device-side FIFOs and host-side logic for handling custom triggers.

```cpp
#include <mscclpp/proxy.hpp>
#include <mscclpp/fifo_device.hpp>

// Device-side FIFO handle for custom triggers
__device__ mscclpp::FifoDeviceHandle fifo;

__global__ void customTriggerKernel() {
    if (threadIdx.x == 0) {
        mscclpp::ProxyTrigger trigger;

        // Send custom request type 1
        trigger.fst = 1;
        fifo.push(trigger);

        // Send custom request type 2
        trigger.fst = 2;
        fifo.push(trigger);
    }
}
```

```cpp
// Host-side custom proxy implementation
class CustomProxyService {
private:
    mscclpp::Proxy proxy_;

public:
    CustomProxyService() : proxy_(
        // Trigger handler lambda
        [this](mscclpp::ProxyTrigger trigger) {
            if (trigger.fst == 1) {
                // Handle custom operation type 1
                performBatchTransfer();
            } else if (trigger.fst == 2) {
                // Handle custom operation type 2
                performCustomSync();
            }
        },
        // Initialization lambda (empty)
        []() {}
    ) {}

    void startProxy() { proxy_.start(); }
    void stopProxy() { proxy_.stop(); }

private:
    void performBatchTransfer() { /* ... */ }
    void performCustomSync() { /* ... */ }
};

```

--------------------------------

### Launch Single Node Custom Communicator

Source: https://github.com/microsoft/mscclpp/blob/main/docs/dsl/integration.md

Launches a single-node training job using a custom MSCCL++ communicator. Requires setting master address and port.

```bash
MSCCLPP_MASTER_ADDR=<master_ip> MSCCLPP_MASTER_PORT=<port> torchrun --nnodes=1 --nproc_per_node=8  customized_comm.py
```

--------------------------------

### Build mscclpp-test with CMake

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/cpp-examples.md

Configure and build the test suite using MPI. Ensure MPI_HOME is set to the correct path.

```bash
$ MPI_HOME=/path/to/mpi cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j allgather_test_perf allreduce_test_perf
```

--------------------------------

### Launch Kernel with Python API

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/python-api.md

Launches a compiled CUDA kernel using MSCCL++ Python utilities. Prepares parameters including device handles and packs them for kernel execution. Requires os, List, PortChannel, and cp.

```python
from mscclpp.utils import KernelBuilder, pack

def launch_kernel(my_rank: int, nranks: int, simple_channels: List[PortChannel], memory: cp.ndarray):
    file_dir = os.path.dirname(os.path.abspath(__file__))
    kernel = KernelBuilder(file="test.cu", kernel_name="test", file_dir=file_dir).get_compiled_kernel()
    params = b""
    first_arg = next(iter(simple_channels.values()))
    size_of_channels = len(first_arg.device_handle().raw)
    device_handles = []
    for rank in range(nranks):
        if rank == my_rank:
            device_handles.append(
                bytes(size_of_channels)
            )  # just zeros for semaphores that do not exist
        else:
            device_handles.append(simple_channels[rank].device_handle().raw)
    # keep a reference to the device handles so that they don't get garbage collected
    d_channels = cp.asarray(memoryview(b""..join(device_handles)), dtype=cp.uint8)
    params = pack(d_channels, my_rank, nranks, memory.size)

    nblocks = 1
    nthreads = 512
    kernel.launch_kernel(params, nblocks, nthreads, 0, None)
```

--------------------------------

### Use NCCL audit shim library

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Alternative to LD_PRELOAD for loading the audit shim library to avoid environment conflicts.

```bash
export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_script.py
```

--------------------------------

### Create a PortChannel

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/04-port-channel.md

Instantiating a PortChannel using the IDs generated by the ProxyService.

```cpp
mscclpp::PortChannel portChan = proxyService.portChannel(semaId, remoteMemId, localMemId);
```

--------------------------------

### Run MSCCL++ Application

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/mscclpp-torch-integration.md

Executes the script using torchrun with necessary environment variables for MSCCL++ networking.

```bash
MSCCLPP_MASTER_ADDR=<ip> MSCCLPP_MASTER_PORT=<port> \
  torchrun --nnodes=1 --nproc_per_node=8 customized_comm_with_dsl.py
```

--------------------------------

### Configure MSCCL++ NCCL/RCCL fallback

Source: https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md

Environment variables to enable and configure fallback to standard NCCL/RCCL libraries.

```bash
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl_lib/libnccl.so (or /path_to_rccl_lib/librccl.so for AMD platforms)
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="list of collective name[s]"
```

--------------------------------

### Run Allreduce Performance Test

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/cpp-examples.md

Execute the allreduce_test_perf binary with specific message sizes and algorithm selection.

```bash
$ mpirun --bind-to numa -np 8 ./bin/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5
```

--------------------------------

### Synchronize GPU Kernels with relaxedSignal and relaxedWait

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md

Demonstrates using BaseMemoryChannelDeviceHandle to coordinate execution between two GPUs. Use these methods when execution flow synchronization is required but memory operation ordering is not.

```cpp
__global__ void gpuKernel0(mscclpp::BaseMemoryChannelDeviceHandle *devHandle, int iter) {
  if (threadIdx.x + blockIdx.x * gridDim.x == 0) {
    for (int i = 0; i < iter; ++i) {
      devHandle->relaxedWait();
      // spin for a few ms
      spin_cycles(1e7);
      devHandle->relaxedSignal();
    }
  }
}

__global__ void gpuKernel1(mscclpp::BaseMemoryChannelDeviceHandle *devHandle, int iter) {
  if (threadIdx.x + blockIdx.x * gridDim.x == 0) {
    for (int i = 0; i < iter; ++i) {
      devHandle->relaxedSignal();
      devHandle->relaxedWait();
    }
  }
}
```

--------------------------------

### Load MSCCL++ Default Algorithms

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/mscclpp-torch-integration.md

Loads the default collection of MSCCL++ algorithms. Requires a scratch buffer for algorithm execution.

```python
import mscclpp
import mscclpp.utils as mscclpp_utils

def load_algorithms(scratch_buffer: torch.Tensor, rank: int):
    """Load MSCCL++ default algorithm collection."""
    collection_builder = mscclpp.AlgorithmCollectionBuilder()
    return collection_builder.build_default_algorithms(
        scratch_buffer=scratch_buffer.data_ptr(),
        scratch_buffer_size=scratch_buffer.nbytes,
        rank=rank
    )
```

--------------------------------

### Run Customized Communication with Default Algorithms

Source: https://github.com/microsoft/mscclpp/blob/main/docs/guide/mscclpp-torch-integration.md

Command to launch a PyTorch distributed job using torchrun. Ensure MSCCLPP environment variables are set.

```bash
MSCCLPP_MASTER_ADDR=<ip> MSCCLPP_MASTER_PORT=<port> \
  torchrun --nnodes=1 --nproc_per_node=8 customized_comm_with_default_algo.py
```

--------------------------------

### Obtain Channel device handles

Source: https://github.com/microsoft/mscclpp/blob/main/docs/tutorials/01-basic-concepts.md

Retrieve lightweight device handles from channels to enable GPU kernel operations.

```cpp
// From gpu_ping_pong.cu, lines 90 and 100
mscclpp::BaseMemoryChannelDeviceHandle memChanHandle0 = memChan0.deviceHandle();
mscclpp::BaseMemoryChannelDeviceHandle memChanHandle1 = memChan1.deviceHandle();
```