### Verify cuPyNumeric Installation Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst Runs a sample Legate application (black_scholes.py) to verify that cuPyNumeric is installed and functioning correctly. This example demonstrates the performance of the library. ```sh legate examples/black_scholes.py ``` -------------------------------- ### Install cuPyNumeric via PyPI (New Environment) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst Creates a new Python virtual environment and then installs the latest version of cuPyNumeric from PyPI into it using pip. ```bash python -m venv myenv source myenv/bin/activate pip install nvidia-cupynumeric ``` -------------------------------- ### Install Conda and cuPyNumeric Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Installs Miniconda3 for Linux, initializes Conda for bash and zsh, and then creates a Conda environment named 'legate' with cuPyNumeric and Legate installed from conda-forge and legate channels. This is a comprehensive setup for using cuPyNumeric with Conda. ```sh mkdir -p ~/miniforge3 wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O ~/miniforge3/miniforge.sh bash ~/miniforge3/miniforge.sh -b -u -p ~/miniforge3 rm -rf ~/miniforge3/miniforge.sh ~/miniforge3/bin/conda init bash ~/miniforge3/bin/conda init zsh source ~/.bashrc conda create -n legate -c conda-forge -c legate cupynumeric conda activate legate ``` -------------------------------- ### Legate Resource Allocation Examples Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Provides command-line templates for launching applications using Legate with specific resource allocations for CPU, GPU, and OMP task variants. These examples demonstrate how to specify the number of nodes, CPUs, GPUs, OpenMP settings, and memory for each task type. ```text --nodes: number of Nodes to be utilized for the program --cpus: number of CPUs to be utilized for the program --gpus: number of GPUs to be utilized for the program --omps: number of OpenMP groups created --ompthreads: number of threads in each OpenMP group --sysmem: system memory (MB) --fbmem: framebuffer memory per GPU (MB) ``` ```sh legate --cpus 8 --sysmem 40000 ./main.py ``` ```sh legate --gpus 2 --fbmem 40000 ./main.py ``` ```sh legate --omps 1 --ompthreads 4 --sysmem 40000 ./main.py ``` -------------------------------- ### Install cuPyNumeric via PyPI (Existing Environment) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst Installs the latest version of cuPyNumeric from PyPI into an existing Python environment using pip. ```bash pip install nvidia-cupynumeric ``` -------------------------------- ### Docker Environment Setup for Building Wheels Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/README.md This snippet demonstrates how to set up a Docker container environment to build the Python PyPi binary wheels. It involves running a container with NVIDIA GPU support, mounting the source directory, and installing necessary development tools like GCC. ```bash docker run --rm --runtime=nvidia --gpus all -it --mount type=bind,src=.,dst=/src rapidsai/ci-wheel:cuda12.8.0-rockylinux8-py3.12 bash cd /src export PATH=/src/continuous_integration/scripts/tools/:$PATH dnf install -y gcc-toolset-11-libatomic-devel ``` -------------------------------- ### Run Legate Example Program with srun Launcher Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command executes a Python script ('main.py') using the Legate driver. It specifies using 4 GPUs, the 'srun' launcher, and 2 nodes. The '--verbose' option can be added for more detailed output. ```sh legate --gpus 4 --launcher srun --nodes 2 ./main.py ``` -------------------------------- ### Run Black-Scholes on GPU Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes the Black-Scholes algorithm for pricing options on GPUs. This example demonstrates scaling the computation by increasing the number of GPUs and the problem size, measuring the elapsed time for each configuration. ```sh legate --gpus 1 --sysmem 10000 --fbmem 14000 ./black_scholes.py --num 100000 --precision 32 --time legate --gpus 2 --sysmem 10000 --fbmem 38000 ./black_scholes.py --num 1000000 --precision 32 --time legate --gpus 4 --sysmem 10000 --fbmem 38000 ./black_scholes.py --num 2000000 --precision 32 --time ``` -------------------------------- ### Run FFT on CPU Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst Command to execute the FFT example on CPU using the Legate launcher. Ensures CPU-only execution by setting --gpus 0. ```sh legate --cpus 1 --gpus 0 ./fft.py ``` -------------------------------- ### Run cuPyNumeric Dot Product Example on Single Node with GPUs Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes a Python script named 'main.py' using the 'legate' command, specifying the use of 2 GPUs. This demonstrates running a cuPyNumeric dot-product calculation on a single workstation with multi-GPU capability. ```sh legate --gpus 2 ./main.py ``` -------------------------------- ### cuPyNumeric Dot Product Example Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst A Python script demonstrating a large-scale dot product calculation using cuPyNumeric. It generates two large random vectors, computes their dot product, and times the operation. This example requires the legate.timing module and cuPyNumeric library. ```python from legate.timing import time import cupynumeric as np # Define the size of the vectors size = 100000000 start_time = time() # Generate two random vectors of the specified size vector1 = np.random.rand(size) vector2 = np.random.rand(size) # Compute the dot product using cuPyNumeric dot_product = np.dot(vector1, vector2) end_time = time() elapsed_time = (end_time - start_time)/1000 print("Dot product:", dot_product) print(f"Dot product took {elapsed_time:.4f} ms") ``` -------------------------------- ### FFT Example Main Function Initialization (Python) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst Initializes inputs and performs a GPU-accelerated batched 2D Fast Fourier Transform using cuPyNumeric. Supports dynamic shape configuration via command-line arguments. ```python import numpy as np import cupy importlegate.array as cp fromlegate.core import TaskContext from legate.core import Future, VariantCode def main(): parser = argparse.ArgumentParser() parser.add_argument( "-f", "--file", default="fft.py", help="program file" ) parser.add_argument( "-n", "--nodes", type=int, default=1, help="number of nodes to use for execution", ) parser.add_argument( "-d", "--dims", default="(128, 256, 256)", help=""" dimensionalities of the input array" ) args = parser.parse_args() shape = eval(args.dims) cp.init(args.nodes) # Allocate arrays on the CPU A_np = np.zeros(shape, dtype=np.complex64) B_cpn = cp.array(shape, dtype=np.complex64) A_cpn = cp.array(shape, dtype=np.complex64) # Launch the FFT task fft2d_batched_gpu(A_cpn, B_cpn) # Wait for all tasks to complete cp.finish() if __name__ == "__main__": main() ``` -------------------------------- ### Installing the Built Binary Wheel Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/README.md This command installs the generated binary wheel from the 'final-dist' directory using pip. This is the same wheel that would be produced by the Continuous Integration pipeline. ```bash pip install final-dist/*.whl ``` -------------------------------- ### Install Nightly cuPyNumeric via Conda Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst Installs the latest nightly build of cuPyNumeric from the legate-nightly channel. These builds are only lightly validated and should be used at your own risk. ```bash conda install -c conda-forge -c legate-nightly cupynumeric ``` -------------------------------- ### Python: Complete module for Legate Histogram Example Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst This Python snippet provides the complete module for the Legate histogram example. It includes the necessary imports, the main function to set up the data and histogram arrays, and the 'histogram_task' function for computation. This code can be executed using the 'legate' command-line launcher. ```python import numpy as np import legate.numpy as cp from legate.core import TaskContext, REDUCTION_ADD, ReductionArray, VariantCode @cp.task(cpu_only=True) def histogram_task(data: cp.ndarray, hist: ReductionArray[REDUCTION_ADD]): ctx = TaskContext.get_context() if ctx.get_variant_kind() == VariantCode.GPU: # Use CuPy arrays on GPU data_view = cp.asarray(data) hist_view = cp.asarray(hist) else: # Use NumPy arrays on CPU data_view = np.asarray(data) hist_view = np.asarray(hist) # Compute local histogram for the chunk of data local_hist = cp.bincount(data_view, minlength=hist_view.shape[0]) # Add local histogram results to the global 'hist' array using reduction hist_view += local_hist # The 'hist' array is a ReductionArray, so the addition is reduced automatically # across all tasks/devices. return hist_view def main(): parser = cp.argparse.ArgumentParser( description="GPU-accelerated histogram counting." ) parser.add_argument("--size", type=int, default=1000, help="Size of the input array.") args = parser.parse_args() # Create a 1D array with random integers data = cp.random.randint(0, 10, size=args.size) # Create an empty histogram array of length 10 hist = cp.zeros(10, dtype=cp.int64) # Call the histogram task function to compute frequencies # The result is accumulated in the 'hist' array via reduction histogram_task(data, hist) # Print the computed histogram print(hist) return hist if __name__ == "__main__": main() ``` -------------------------------- ### Run matmul on GPU Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst Command to run the matmul example on GPU using the Legate launcher. This command utilizes a specified number of GPUs for accelerated computation. ```sh legate --gpus 2 ./matmul.py -m 1000 -k 1000 -n 1000 ``` -------------------------------- ### Run matmul on Multi-Node Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst Command to execute the matmul example across multiple nodes using the Legate launcher with srun. This is for distributed execution on HPC systems. ```sh legate --nodes 2 --launcher srun --gpus 4 --ranks-per-node 1 ./matmul.py -m 1000 -k 1000 -n 1000 ``` -------------------------------- ### Install Pillow Library for Edge Detection Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command installs the Pillow library from the conda-forge channel. Pillow is a necessary dependency for the edge detection script, used for image manipulation and opening image files. ```sh conda install -c conda-forge pillow ``` -------------------------------- ### Relocatable Test Installation Source: https://github.com/nv-legate/cupynumeric/blob/main/tests/cpp/CMakeLists.txt Installs the test executable in a relocatable manner to the binary directory, ensuring it can be found and executed after installation. It also includes the tests in the 'ALL' target. ```cmake include(GNUInstallDirs) rapids_test_install_relocatable(INSTALL_COMPONENT_SET testing DESTINATION ${CMAKE_INSTALL_BINDIR} INCLUDE_IN_ALL) ``` -------------------------------- ### FFT Example Task Function (Python) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst Defines a task for batched 2D FFT using Legate. Utilizes 'align' and 'broadcast' constraints for optimal partitioning and GPU execution. ```python from legate.core import TaskContext from legate.core import Future, VariantCode @cp.task( VariantCode.GPU, legate_only=True, domain=("src.domain", "dst.domain"), constraints=[ cp.align("src", "dst"), cp.broadcast("src", (1, 2)), ], ) def fft2d_batched_gpu(src, dst): ctx = TaskContext() xp = cupy if ctx. روی_gpu() else np # Convert to CuPy arrays (views without copying) src_cp = cupy.asarray(src) dst_cp = cupy.asarray(dst) # Apply 2D FFT for each batch independently for i in range(src_cp.shape[0]): dst_cp[i] = cupy.fft.fft2(src_cp[i]) return dst_cp ``` -------------------------------- ### Array-Based Operations vs. Loops in cuPyNumeric Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Highlights the benefits of using array-based operations over explicit loops for element-wise updates in cuPyNumeric. It provides examples for updating array elements based on indexing and conditional logic, demonstrating how to achieve the same results more concisely and efficiently. ```python # x and y are three-dimensional arrays # NOT recommended: Performing naive element-wise implementation for i in range(ny): for j in range(nx): x[0, j, i] = y[3, j, i] # Recommended: Using array-based operations x[0] = y[3] ``` ```python # x and y are two-dimensional arrays, and we need to update x # depending on whether y meets a condition or not. # NOT recommended: Performing naive element-wise implementation for i in range(ny): for j in range(nx): if (y[j, i] < tol): x[j, i] = const else x[j, i] = 1.0 - const # Recommended: Using array-based operations cond = y < tol x[cond] = const x[~cond] = 1.0 - const ``` -------------------------------- ### Resource Scoping to GPU with Legate Core API Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/advanced.rst Provides a Python code example using the legate.core API to restrict a block of code to run exclusively on GPUs. It involves obtaining the Legate runtime and machine information, then using a context manager for GPU-only execution. ```python from legate.core import TaskTarget, get_legate_runtime machine = get_legate_runtime().get_machine() with machine.only(TaskTarget.GPU): # code to run only on GPUs ``` -------------------------------- ### MPI4Py Stencil Operation for Multi-GPU Systems Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/examples/torchswe.ipynb This example illustrates the complexities of parallelizing stencil operations for multi-GPU systems using MPI4Py and CuPy. It includes setting up MPI communication, determining GPU devices for each rank, and defining data types for halo boundaries. This highlights the manual domain decomposition and inter-GPU communication required. ```python from mpi4py import MPI import cupy as cp num_timesteps = 10 def set_device(comm: MPI.Comm): # Device selection for each rank on multi-GPU nodes (TorchSWE-specific) n_gpus = cp.cuda.runtime.getDeviceCount() local_rank = comm.Get_rank() % n_gpus cp.cuda.runtime.setDevice(local_rank) comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() # Determine grid size and decompose domain gnx, gny = 126, 126 # global grid dimensions local_nx, local_ny = gnx // size, gny # local grid dimensions per rank local_grid = cp.ones((local_nx + 2, local_ny + 2)) # with halo boundaries # Set up MPI data types and boundaries send_type, recv_type = ( MPI.DOUBLE.Create_subarray( (local_nx + 2, local_ny + 2), (local_nx, local_ny), (1, 1) ), MPI.DOUBLE.Create_subarray( (local_nx + 2, local_ny + 2), (local_nx, local_ny), (1, 1) ), ) send_type.Commit() recv_type.Commit() ``` -------------------------------- ### Install cuPyNumeric via Conda (Existing Environment) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst Installs the cuPyNumeric package into an existing Conda environment. Requires conda version >= 24.1. It installs from the conda-forge and legate channels. ```bash conda install -c conda-forge -c legate cupynumeric ``` -------------------------------- ### Install cuPyNumeric via Conda (New Environment) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst Installs the cuPyNumeric package into a new Conda environment. Requires conda version >= 24.1. It installs from the conda-forge and legate channels. ```bash conda create -n myenv -c conda-forge -c legate cupynumeric ``` -------------------------------- ### Run Black-Scholes on CPU Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes the Black-Scholes algorithm for pricing options on the CPU using the 'legate' command. This command-line execution is designed for a specific number of options and precision, measuring the elapsed time for the computation. ```sh legate --cpus 1 --sysmem 10000 ./black_scholes.py --num 10000 --precision 32 --time ``` -------------------------------- ### Force GPU cuPyNumeric Installation with Conda Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst Forces Conda to install a version of cuPyNumeric with GPU support, overriding the default system detection. Specify the desired CUDA version. This is useful when you need to ensure GPU acceleration is enabled. ```sh CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge -c legate cupynumeric ``` -------------------------------- ### Configure Install Directory for Libraries Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/CMakeLists.txt Sets the installation subdirectory for libraries. This determines where libraries will be placed when the project is installed, commonly used for system-wide or package installations. ```cmake set(CMAKE_INSTALL_LIBDIR lib64) ``` -------------------------------- ### Preparing Wheel Dependency Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/README.md This code shows how to prepare a compatible binary wheel for 'legate' to build against. It involves creating a 'wheel' subdirectory and copying the necessary wheel file into it. ```bash mkdir wheel cp /path/to/legate.whl wheel/ ``` -------------------------------- ### Set Install RPATH for Target Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/CMakeLists.txt Applies the defined runtime search paths (RPATH) to the 'cupynumeric' target during installation. This ensures that the installed libraries can find their dependencies. ```cmake set_property( TARGET cupynumeric PROPERTY INSTALL_RPATH ${rpaths} APPEND ) ``` -------------------------------- ### Run Matrix Multiplication with Legate on GPU Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes the simple_mm.py script using Legate on GPUs. This command specifies the number of GPUs, the launcher, number of nodes, and memory allocations. It's used to solve matrix multiplication problems of varying sizes. ```sh legate --gpus 2 --launcher srun --nodes 1 --sysmem 2000 --fbmem 24000 --eager-alloc-percentage 10 ./simple_mm.py ``` ```sh legate --gpus 4 --launcher srun --nodes 1 --sysmem 2000 --fbmem 38000 --eager-alloc-percentage 10 ./simple_mm.py ``` ```sh legate --gpus 4 --launcher srun --nodes 2 --sysmem 2000 --fbmem 38000 --eager-alloc-percentage 10 ./simple_mm.py ``` -------------------------------- ### Allocate 2 GPU Nodes for Legate Execution Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command allocates computing resources on a cluster. It requests 2 nodes, each with 1 task and 4 GPUs, for an interactive session lasting 1 hour and 30 minutes. Ensure the Legate environment is activated before proceeding. ```sh salloc --nodes 2 --ntasks-per-node 1 --qos interactive --time 01:30:00 --constraint gpu --gpus-per-node 4 --account= ``` -------------------------------- ### Run Conjugate Gradient Method on CPU Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes the Conjugate Gradient (CG) method script on CPUs using Legate. This command is used to solve a 10,000 x 10,000 2-d adjacency system, with options to control the number of CPUs, memory, iterations, and verification. ```sh legate --cpus 1 --sysmem 16000 ./cg.py --num 100 --check --time ``` -------------------------------- ### Python: Main function for Histogram Example Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst This Python snippet defines the main function for the histogram example. It initializes a 1D NumPy array with random integers and an empty 'hist' array for storing counts. It then calls the 'histogram_task' function to compute frequencies. The input array size can be adjusted via the '--size' command-line argument. ```python import numpy as np import legate.numpy as cp def main(): parser = cp.argparse.ArgumentParser( description="GPU-accelerated histogram counting." ) parser.add_argument("--size", type=int, default=1000, help="Size of the input array.") args = parser.parse_args([ "--size", "10000000" ]) # Create a 1D array with random integers data = cp.random.randint(0, 10, size=args.size) # Create an empty histogram array of length 10 hist = cp.zeros(10, dtype=cp.int64) # Call the histogram task function to compute frequencies # The result is accumulated in the 'hist' array via reduction histogram_task(data, hist) # Print the computed histogram print(hist) return hist ``` -------------------------------- ### Run CG model with 1 GPU using Legate Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes the Conjugate Gradient (CG) model with 1 GPU, system memory of 48000MB, and framebuffer memory of 14000MB. It solves a 22500x22500 adjacency system and outputs timing information. ```sh legate --gpus 1 --sysmem 48000 --fbmem 14000 ./cg.py --num 150 --check --time ``` -------------------------------- ### Legate Program Output (Duplicated Ranks) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This is sample output from a Legate program run on Perlmutter, demonstrating duplicated output from each rank. To show output only from the first rank, set the environment variable LEGATE_LIMIT_STDOUT=1. ```text Dot product: 25001932.012924932 Dot product took 141.2350 ms Dot product: 25001932.012924932 Dot product took 141.2350 ms ``` -------------------------------- ### Run Edge Detection on GPU with Legate Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command executes an edge detection script ('./edge.py') on a single GPU, allocating specific amounts of system and framebuffer memory. It's designed for environments like Perlmutter and assumes the script is available locally. ```sh legate --gpus 1 --sysmem 16000 --fbmem 38000 ./edge.py ``` -------------------------------- ### Run Jacobi Stencil on CPU with Legate Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command executes the jacobi_stencil.py script on the CPU using Legate. It specifies the number of CPUs, system memory, and the grid size and number of iterations for the Jacobi stencil computation. The output shows the grid generation and the elapsed time in milliseconds. ```sh legate --cpus 1 --sysmem 16000 ./jacobi_stencil.py --size 10000 --iterations 100 ``` -------------------------------- ### Run Jacobi Stencil on GPU with Legate (2 GPUs) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command executes the jacobi_stencil.py script on two GPUs using Legate, demonstrating scalability. It configures system and framebuffer memory and sets a larger grid size and more iterations. The output shows the performance improvement with multiple GPUs for a larger problem size. ```sh legate --gpus 2 --sysmem 16000 --fbmem 38000 ./jacobi_stencil.py --size 30000 --iterations 200 ``` -------------------------------- ### Run Cupynumeric on GPU with Legate Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst Execute Cupynumeric scripts utilizing GPU resources. This command specifies the number of GPUs to allocate for execution. Ensure Legate is installed and accessible in your environment. ```sh legate --gpus 2 ./fft.py ``` -------------------------------- ### Replace NumPy Import with CuPyNumeric (Python) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/usage.rst This snippet shows how to replace the standard NumPy import statement with CuPyNumeric to leverage GPU acceleration. No external dependencies are required beyond CuPyNumeric installation. ```python import numpy as np ``` ```python import cupynumeric as np ``` -------------------------------- ### Run CG model with 2 GPUs using Legate Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes the Conjugate Gradient (CG) model with 2 GPUs, system memory of 40000MB, and framebuffer memory of 38000MB. It solves a 50625x50625 adjacency system and outputs timing information. ```sh legate --gpus 2 --sysmem 40000 --fbmem 38000 ./cg.py --num 225 --check --time ``` -------------------------------- ### Run CG model with 4 GPUs using Legate Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst Executes the Conjugate Gradient (CG) model with 4 GPUs, system memory of 40000MB, and framebuffer memory of 38000MB. It solves a 75625x75625 adjacency system and outputs timing information. ```sh legate --gpus 4 --sysmem 40000 --fbmem 38000 ./cg.py --num 275 --check --time ``` -------------------------------- ### Conditional CUDA Setup Source: https://github.com/nv-legate/cupynumeric/blob/main/tests/cpp/CMakeLists.txt Conditionally finds the CUDA Toolkit and enables the CUDA language if the Legion_USE_CUDA flag is set. This allows for building tests with GPU support when required. ```cmake if(Legion_USE_CUDA) find_package(CUDAToolkit REQUIRED) enable_language(CUDA) endif() ``` -------------------------------- ### Run Jacobi Stencil on GPU with Legate (1 GPU) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command executes the jacobi_stencil.py script on a single GPU using Legate. It specifies the number of GPUs, system memory, and framebuffer memory, along with the grid size and number of iterations. The output indicates the elapsed time for the GPU computation, which is significantly faster than CPU. ```sh legate --gpus 1 --sysmem 16000 --fbmem 15000 ./jacobi_stencil.py --size 15000 --iterations 100 ``` -------------------------------- ### cuPyNumeric Detailed API Coverage Report Format Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/howtos/measuring.rst Example format of a detailed CSV coverage report. It lists each NumPy function called, its location in the source code, and a boolean indicating if it's implemented by cuPyNumeric. ```csv function_name,location,implemented numpy.array,tests/dot.py:27,True numpy.ndarray.__init__,tests/dot.py:27,True numpy.array,tests/dot.py:28,True numpy.ndarray.__init__,tests/dot.py:28,True numpy.ndarray.dot,tests/dot.py:31,True numpy.ndarray.__init__,tests/dot.py:31,True numpy.allclose,tests/dot.py:33,True numpy.ndarray.__init__,tests/dot.py:33,True ``` -------------------------------- ### Run Jacobi Stencil on GPU with Legate (4 GPUs) Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst This command executes the jacobi_stencil.py script on four GPUs using Legate, showcasing performance with more parallel resources. It uses substantial system and framebuffer memory and processes a very large grid with many iterations. The output highlights the efficiency of distributed GPU computation. ```sh legate --gpus 4 --sysmem 16000 --fbmem 38000 ./jacobi_stencil.py --size 50000 --iterations 300 ``` -------------------------------- ### Multi-node Execution with Manual Task Manager Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/advanced.rst Shows how to initiate multi-node execution of cuPyNumeric programs using a manual task manager like 'mpirun'. This approach allows direct control over process distribution. ```sh mpirun -np N legate script.py