### Verify cuPyNumeric Installation

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst

Runs a sample Legate application (black_scholes.py) to verify that cuPyNumeric is installed and functioning correctly. This example demonstrates the performance of the library.

```sh
legate examples/black_scholes.py
```

--------------------------------

### Install cuPyNumeric via PyPI (New Environment)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst

Creates a new Python virtual environment and then installs the latest version of cuPyNumeric from PyPI into it using pip.

```bash
python -m venv myenv
source myenv/bin/activate
pip install nvidia-cupynumeric
```

--------------------------------

### Install Conda and cuPyNumeric

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Installs Miniconda3 for Linux, initializes Conda for bash and zsh, and then creates a Conda environment named 'legate' with cuPyNumeric and Legate installed from conda-forge and legate channels. This is a comprehensive setup for using cuPyNumeric with Conda.

```sh
mkdir -p ~/miniforge3
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O ~/miniforge3/miniforge.sh
bash ~/miniforge3/miniforge.sh -b -u -p ~/miniforge3
rm -rf ~/miniforge3/miniforge.sh
~/miniforge3/bin/conda init bash
~/miniforge3/bin/conda init zsh
source ~/.bashrc
conda create -n legate -c conda-forge -c legate cupynumeric
conda activate legate
```

--------------------------------

### Legate Resource Allocation Examples

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Provides command-line templates for launching applications using Legate with specific resource allocations for CPU, GPU, and OMP task variants. These examples demonstrate how to specify the number of nodes, CPUs, GPUs, OpenMP settings, and memory for each task type.

```text
--nodes: number of Nodes to be utilized for the program

--cpus: number of CPUs to be utilized for the program
--gpus: number of GPUs to be utilized for the program

--omps: number of OpenMP groups created
--ompthreads: number of threads in each OpenMP group

--sysmem: system memory (MB)
--fbmem: framebuffer memory per GPU (MB)
```

```sh
legate --cpus 8 --sysmem 40000 ./main.py <main.py options>
```

```sh
legate --gpus 2 --fbmem 40000 ./main.py <main.py options>
```

```sh
legate --omps 1 --ompthreads 4 --sysmem 40000 ./main.py <main.py options>
```

--------------------------------

### Install cuPyNumeric via PyPI (Existing Environment)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst

Installs the latest version of cuPyNumeric from PyPI into an existing Python environment using pip.

```bash
pip install nvidia-cupynumeric
```

--------------------------------

### Docker Environment Setup for Building Wheels

Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/README.md

This snippet demonstrates how to set up a Docker container environment to build the Python PyPi binary wheels. It involves running a container with NVIDIA GPU support, mounting the source directory, and installing necessary development tools like GCC.

```bash
docker run --rm --runtime=nvidia --gpus all -it --mount type=bind,src=.,dst=/src rapidsai/ci-wheel:cuda12.8.0-rockylinux8-py3.12 bash
cd /src
export PATH=/src/continuous_integration/scripts/tools/:$PATH
dnf install -y gcc-toolset-11-libatomic-devel
```

--------------------------------

### Run Legate Example Program with srun Launcher

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command executes a Python script ('main.py') using the Legate driver. It specifies using 4 GPUs, the 'srun' launcher, and 2 nodes. The '--verbose' option can be added for more detailed output.

```sh
legate --gpus 4 --launcher srun --nodes 2 ./main.py
```

--------------------------------

### Run Black-Scholes on GPU

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes the Black-Scholes algorithm for pricing options on GPUs. This example demonstrates scaling the computation by increasing the number of GPUs and the problem size, measuring the elapsed time for each configuration.

```sh
legate --gpus 1 --sysmem 10000 --fbmem 14000  ./black_scholes.py --num 100000 --precision 32 --time
legate --gpus 2 --sysmem 10000 --fbmem 38000  ./black_scholes.py --num 1000000 --precision 32 --time
legate --gpus 4 --sysmem 10000 --fbmem 38000  ./black_scholes.py --num 2000000 --precision 32 --time
```

--------------------------------

### Run FFT on CPU

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

Command to execute the FFT example on CPU using the Legate launcher. Ensures CPU-only execution by setting --gpus 0.

```sh
legate --cpus 1 --gpus 0 ./fft.py
```

--------------------------------

### Run cuPyNumeric Dot Product Example on Single Node with GPUs

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes a Python script named 'main.py' using the 'legate' command, specifying the use of 2 GPUs. This demonstrates running a cuPyNumeric dot-product calculation on a single workstation with multi-GPU capability.

```sh
legate --gpus 2 ./main.py
```

--------------------------------

### cuPyNumeric Dot Product Example

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

A Python script demonstrating a large-scale dot product calculation using cuPyNumeric. It generates two large random vectors, computes their dot product, and times the operation. This example requires the legate.timing module and cuPyNumeric library.

```python
from legate.timing import time

import cupynumeric as np

# Define the size of the vectors
size = 100000000

start_time = time()

# Generate two random vectors of the specified size
vector1 = np.random.rand(size)
vector2 = np.random.rand(size)

# Compute the dot product using cuPyNumeric
dot_product = np.dot(vector1, vector2)

end_time = time()

elapsed_time = (end_time - start_time)/1000
print("Dot product:", dot_product)
print(f"Dot product took {elapsed_time:.4f} ms")
```

--------------------------------

### FFT Example Main Function Initialization (Python)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

Initializes inputs and performs a GPU-accelerated batched 2D Fast Fourier Transform using cuPyNumeric. Supports dynamic shape configuration via command-line arguments.

```python
import numpy as np
import cupy
importlegate.array as cp
fromlegate.core import TaskContext
from legate.core import Future, VariantCode

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-f", "--file", default="fft.py", help="program file"
    )
    parser.add_argument(
        "-n",
        "--nodes",
        type=int,
        default=1,
        help="number of nodes to use for execution",
    )
    parser.add_argument(
        "-d", "--dims", default="(128, 256, 256)", help="""
        dimensionalities of the input array"
    )
    args = parser.parse_args()

    shape = eval(args.dims)
    cp.init(args.nodes)

    # Allocate arrays on the CPU
    A_np = np.zeros(shape, dtype=np.complex64)
    B_cpn = cp.array(shape, dtype=np.complex64)
    A_cpn = cp.array(shape, dtype=np.complex64)

    # Launch the FFT task
    fft2d_batched_gpu(A_cpn, B_cpn)

    # Wait for all tasks to complete
    cp.finish()


if __name__ == "__main__":
    main()

```

--------------------------------

### Installing the Built Binary Wheel

Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/README.md

This command installs the generated binary wheel from the 'final-dist' directory using pip. This is the same wheel that would be produced by the Continuous Integration pipeline.

```bash
pip install final-dist/*.whl
```

--------------------------------

### Install Nightly cuPyNumeric via Conda

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst

Installs the latest nightly build of cuPyNumeric from the legate-nightly channel. These builds are only lightly validated and should be used at your own risk.

```bash
conda install -c conda-forge -c legate-nightly cupynumeric
```

--------------------------------

### Python: Complete module for Legate Histogram Example

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

This Python snippet provides the complete module for the Legate histogram example. It includes the necessary imports, the main function to set up the data and histogram arrays, and the 'histogram_task' function for computation. This code can be executed using the 'legate' command-line launcher.

```python
import numpy as np

import legate.numpy as cp
from legate.core import TaskContext, REDUCTION_ADD, ReductionArray, VariantCode

@cp.task(cpu_only=True)
def histogram_task(data: cp.ndarray, hist: ReductionArray[REDUCTION_ADD]):
    ctx = TaskContext.get_context()
    if ctx.get_variant_kind() == VariantCode.GPU:
        # Use CuPy arrays on GPU
        data_view = cp.asarray(data)
        hist_view = cp.asarray(hist)
    else:
        # Use NumPy arrays on CPU
        data_view = np.asarray(data)
        hist_view = np.asarray(hist)

    # Compute local histogram for the chunk of data
    local_hist = cp.bincount(data_view, minlength=hist_view.shape[0])

    # Add local histogram results to the global 'hist' array using reduction
    hist_view += local_hist

    # The 'hist' array is a ReductionArray, so the addition is reduced automatically
    # across all tasks/devices.
    return hist_view

def main():
    parser = cp.argparse.ArgumentParser(
        description="GPU-accelerated histogram counting."
    )
    parser.add_argument("--size", type=int, default=1000, help="Size of the input array.")
    args = parser.parse_args()

    # Create a 1D array with random integers
    data = cp.random.randint(0, 10, size=args.size)

    # Create an empty histogram array of length 10
    hist = cp.zeros(10, dtype=cp.int64)

    # Call the histogram task function to compute frequencies
    # The result is accumulated in the 'hist' array via reduction
    histogram_task(data, hist)

    # Print the computed histogram
    print(hist)

    return hist

if __name__ == "__main__":
    main()
```

--------------------------------

### Run matmul on GPU

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

Command to run the matmul example on GPU using the Legate launcher. This command utilizes a specified number of GPUs for accelerated computation.

```sh
legate --gpus 2 ./matmul.py -m 1000 -k 1000 -n 1000
```

--------------------------------

### Run matmul on Multi-Node

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

Command to execute the matmul example across multiple nodes using the Legate launcher with srun. This is for distributed execution on HPC systems.

```sh
legate --nodes 2 --launcher srun --gpus 4 --ranks-per-node 1 ./matmul.py -m 1000 -k 1000 -n 1000
```

--------------------------------

### Install Pillow Library for Edge Detection

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command installs the Pillow library from the conda-forge channel. Pillow is a necessary dependency for the edge detection script, used for image manipulation and opening image files.

```sh
conda install -c conda-forge pillow
```

--------------------------------

### Relocatable Test Installation

Source: https://github.com/nv-legate/cupynumeric/blob/main/tests/cpp/CMakeLists.txt

Installs the test executable in a relocatable manner to the binary directory, ensuring it can be found and executed after installation. It also includes the tests in the 'ALL' target.

```cmake
include(GNUInstallDirs)

rapids_test_install_relocatable(INSTALL_COMPONENT_SET testing
                                DESTINATION ${CMAKE_INSTALL_BINDIR} INCLUDE_IN_ALL)
```

--------------------------------

### FFT Example Task Function (Python)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

Defines a task for batched 2D FFT using Legate. Utilizes 'align' and 'broadcast' constraints for optimal partitioning and GPU execution.

```python
from legate.core import TaskContext
from legate.core import Future, VariantCode

@cp.task(
    VariantCode.GPU,
    legate_only=True,
    domain=("src.domain", "dst.domain"),
    constraints=[
        cp.align("src", "dst"),
        cp.broadcast("src", (1, 2)),
    ],
)
def fft2d_batched_gpu(src, dst):
    ctx = TaskContext()
    xp = cupy if ctx. روی_gpu() else np

    # Convert to CuPy arrays (views without copying)
    src_cp = cupy.asarray(src)
    dst_cp = cupy.asarray(dst)

    # Apply 2D FFT for each batch independently
    for i in range(src_cp.shape[0]):
        dst_cp[i] = cupy.fft.fft2(src_cp[i])

    return dst_cp

```

--------------------------------

### Array-Based Operations vs. Loops in cuPyNumeric

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Highlights the benefits of using array-based operations over explicit loops for element-wise updates in cuPyNumeric. It provides examples for updating array elements based on indexing and conditional logic, demonstrating how to achieve the same results more concisely and efficiently.

```python
# x and y are three-dimensional arrays

# NOT recommended: Performing naive element-wise implementation
for i in range(ny):
    for j in range(nx):
        x[0, j, i] = y[3, j, i]

# Recommended: Using array-based operations
x[0] = y[3]
```

```python
# x and y are two-dimensional arrays, and we need to update x
# depending on whether y meets a condition or not.

# NOT recommended: Performing naive element-wise implementation
for i in range(ny):
    for j in range(nx):
        if (y[j, i] < tol):
            x[j, i] = const
        else
            x[j, i] = 1.0 - const

# Recommended: Using array-based operations
cond = y < tol
x[cond] = const
x[~cond] = 1.0 - const
```

--------------------------------

### Resource Scoping to GPU with Legate Core API

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/advanced.rst

Provides a Python code example using the legate.core API to restrict a block of code to run exclusively on GPUs. It involves obtaining the Legate runtime and machine information, then using a context manager for GPU-only execution.

```python
from legate.core import TaskTarget, get_legate_runtime

machine = get_legate_runtime().get_machine()
with machine.only(TaskTarget.GPU):
    # code to run only on GPUs
```

--------------------------------

### MPI4Py Stencil Operation for Multi-GPU Systems

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/examples/torchswe.ipynb

This example illustrates the complexities of parallelizing stencil operations for multi-GPU systems using MPI4Py and CuPy. It includes setting up MPI communication, determining GPU devices for each rank, and defining data types for halo boundaries. This highlights the manual domain decomposition and inter-GPU communication required.

```python
from mpi4py import MPI
import cupy as cp

num_timesteps = 10


def set_device(comm: MPI.Comm):
    # Device selection for each rank on multi-GPU nodes (TorchSWE-specific)
    n_gpus = cp.cuda.runtime.getDeviceCount()
    local_rank = comm.Get_rank() % n_gpus
    cp.cuda.runtime.setDevice(local_rank)


comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Determine grid size and decompose domain
gnx, gny = 126, 126  # global grid dimensions
local_nx, local_ny = gnx // size, gny  # local grid dimensions per rank
local_grid = cp.ones((local_nx + 2, local_ny + 2))  # with halo boundaries

# Set up MPI data types and boundaries
send_type, recv_type = (
    MPI.DOUBLE.Create_subarray(
        (local_nx + 2, local_ny + 2), (local_nx, local_ny), (1, 1)
    ),
    MPI.DOUBLE.Create_subarray(
        (local_nx + 2, local_ny + 2), (local_nx, local_ny), (1, 1)
    ),
)
send_type.Commit()
recv_type.Commit()

```

--------------------------------

### Install cuPyNumeric via Conda (Existing Environment)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst

Installs the cuPyNumeric package into an existing Conda environment. Requires conda version >= 24.1. It installs from the conda-forge and legate channels.

```bash
conda install -c conda-forge -c legate cupynumeric
```

--------------------------------

### Install cuPyNumeric via Conda (New Environment)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst

Installs the cuPyNumeric package into a new Conda environment. Requires conda version >= 24.1. It installs from the conda-forge and legate channels.

```bash
conda create -n myenv -c conda-forge -c legate cupynumeric
```

--------------------------------

### Run Black-Scholes on CPU

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes the Black-Scholes algorithm for pricing options on the CPU using the 'legate' command. This command-line execution is designed for a specific number of options and precision, measuring the elapsed time for the computation.

```sh
legate --cpus 1 --sysmem 10000 ./black_scholes.py --num 10000 --precision 32 --time
```

--------------------------------

### Force GPU cuPyNumeric Installation with Conda

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/installation.rst

Forces Conda to install a version of cuPyNumeric with GPU support, overriding the default system detection. Specify the desired CUDA version. This is useful when you need to ensure GPU acceleration is enabled.

```sh
CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge -c legate cupynumeric
```

--------------------------------

### Configure Install Directory for Libraries

Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/CMakeLists.txt

Sets the installation subdirectory for libraries. This determines where libraries will be placed when the project is installed, commonly used for system-wide or package installations.

```cmake
set(CMAKE_INSTALL_LIBDIR lib64)
```

--------------------------------

### Preparing Wheel Dependency

Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/README.md

This code shows how to prepare a compatible binary wheel for 'legate' to build against. It involves creating a 'wheel' subdirectory and copying the necessary wheel file into it.

```bash
mkdir wheel
cp /path/to/legate.whl wheel/
```

--------------------------------

### Set Install RPATH for Target

Source: https://github.com/nv-legate/cupynumeric/blob/main/scripts/build/python/cupynumeric/CMakeLists.txt

Applies the defined runtime search paths (RPATH) to the 'cupynumeric' target during installation. This ensures that the installed libraries can find their dependencies.

```cmake
set_property(
  TARGET cupynumeric
  PROPERTY INSTALL_RPATH ${rpaths}
  APPEND
)
```

--------------------------------

### Run Matrix Multiplication with Legate on GPU

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes the simple_mm.py script using Legate on GPUs. This command specifies the number of GPUs, the launcher, number of nodes, and memory allocations. It's used to solve matrix multiplication problems of varying sizes.

```sh
legate --gpus 2 --launcher srun --nodes 1 --sysmem 2000 --fbmem 24000 --eager-alloc-percentage 10 ./simple_mm.py
```

```sh
legate --gpus 4 --launcher srun --nodes 1 --sysmem 2000 --fbmem 38000 --eager-alloc-percentage 10 ./simple_mm.py
```

```sh
legate --gpus 4 --launcher srun --nodes 2 --sysmem 2000 --fbmem 38000 --eager-alloc-percentage 10 ./simple_mm.py
```

--------------------------------

### Allocate 2 GPU Nodes for Legate Execution

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command allocates computing resources on a cluster. It requests 2 nodes, each with 1 task and 4 GPUs, for an interactive session lasting 1 hour and 30 minutes. Ensure the Legate environment is activated before proceeding.

```sh
salloc --nodes 2 --ntasks-per-node 1 --qos interactive --time 01:30:00 --constraint gpu --gpus-per-node 4 --account=<acct_name>
```

--------------------------------

### Run Conjugate Gradient Method on CPU

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes the Conjugate Gradient (CG) method script on CPUs using Legate. This command is used to solve a 10,000 x 10,000 2-d adjacency system, with options to control the number of CPUs, memory, iterations, and verification.

```sh
legate --cpus 1 --sysmem 16000 ./cg.py --num 100 --check --time
```

--------------------------------

### Python: Main function for Histogram Example

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

This Python snippet defines the main function for the histogram example. It initializes a 1D NumPy array with random integers and an empty 'hist' array for storing counts. It then calls the 'histogram_task' function to compute frequencies. The input array size can be adjusted via the '--size' command-line argument.

```python
import numpy as np

import legate.numpy as cp

def main():
    parser = cp.argparse.ArgumentParser(
        description="GPU-accelerated histogram counting."
    )
    parser.add_argument("--size", type=int, default=1000, help="Size of the input array.")
    args = parser.parse_args([
        "--size", "10000000"
    ])

    # Create a 1D array with random integers
    data = cp.random.randint(0, 10, size=args.size)

    # Create an empty histogram array of length 10
    hist = cp.zeros(10, dtype=cp.int64)

    # Call the histogram task function to compute frequencies
    # The result is accumulated in the 'hist' array via reduction
    histogram_task(data, hist)

    # Print the computed histogram
    print(hist)

    return hist
```

--------------------------------

### Run CG model with 1 GPU using Legate

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes the Conjugate Gradient (CG) model with 1 GPU, system memory of 48000MB, and framebuffer memory of 14000MB. It solves a 22500x22500 adjacency system and outputs timing information.

```sh
legate --gpus 1 --sysmem 48000 --fbmem 14000 ./cg.py --num 150 --check --time
```

--------------------------------

### Legate Program Output (Duplicated Ranks)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This is sample output from a Legate program run on Perlmutter, demonstrating duplicated output from each rank. To show output only from the first rank, set the environment variable LEGATE_LIMIT_STDOUT=1.

```text
Dot product: 25001932.012924932
Dot product took 141.2350 ms
Dot product: 25001932.012924932
Dot product took 141.2350 ms
```

--------------------------------

### Run Edge Detection on GPU with Legate

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command executes an edge detection script ('./edge.py') on a single GPU, allocating specific amounts of system and framebuffer memory. It's designed for environments like Perlmutter and assumes the script is available locally.

```sh
legate --gpus 1 --sysmem 16000 --fbmem 38000 ./edge.py
```

--------------------------------

### Run Jacobi Stencil on CPU with Legate

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command executes the jacobi_stencil.py script on the CPU using Legate. It specifies the number of CPUs, system memory, and the grid size and number of iterations for the Jacobi stencil computation. The output shows the grid generation and the elapsed time in milliseconds.

```sh
legate --cpus 1 --sysmem 16000 ./jacobi_stencil.py --size 10000 --iterations 100
```

--------------------------------

### Run Jacobi Stencil on GPU with Legate (2 GPUs)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command executes the jacobi_stencil.py script on two GPUs using Legate, demonstrating scalability. It configures system and framebuffer memory and sets a larger grid size and more iterations. The output shows the performance improvement with multiple GPUs for a larger problem size.

```sh
legate --gpus 2 --sysmem 16000 --fbmem 38000 ./jacobi_stencil.py --size 30000 --iterations 200
```

--------------------------------

### Run Cupynumeric on GPU with Legate

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/task.rst

Execute Cupynumeric scripts utilizing GPU resources. This command specifies the number of GPUs to allocate for execution. Ensure Legate is installed and accessible in your environment.

```sh
legate --gpus 2 ./fft.py
```

--------------------------------

### Replace NumPy Import with CuPyNumeric (Python)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/usage.rst

This snippet shows how to replace the standard NumPy import statement with CuPyNumeric to leverage GPU acceleration. No external dependencies are required beyond CuPyNumeric installation.

```python
import numpy as np
```

```python
import cupynumeric as np
```

--------------------------------

### Run CG model with 2 GPUs using Legate

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes the Conjugate Gradient (CG) model with 2 GPUs, system memory of 40000MB, and framebuffer memory of 38000MB. It solves a 50625x50625 adjacency system and outputs timing information.

```sh
legate --gpus 2 --sysmem 40000 --fbmem 38000 ./cg.py --num 225 --check --time
```

--------------------------------

### Run CG model with 4 GPUs using Legate

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Executes the Conjugate Gradient (CG) model with 4 GPUs, system memory of 40000MB, and framebuffer memory of 38000MB. It solves a 75625x75625 adjacency system and outputs timing information.

```sh
legate --gpus 4 --sysmem 40000 --fbmem 38000 ./cg.py --num 275 --check --time
```

--------------------------------

### Conditional CUDA Setup

Source: https://github.com/nv-legate/cupynumeric/blob/main/tests/cpp/CMakeLists.txt

Conditionally finds the CUDA Toolkit and enables the CUDA language if the Legion_USE_CUDA flag is set. This allows for building tests with GPU support when required.

```cmake
if(Legion_USE_CUDA)
  find_package(CUDAToolkit REQUIRED)
  enable_language(CUDA)
endif()
```

--------------------------------

### Run Jacobi Stencil on GPU with Legate (1 GPU)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command executes the jacobi_stencil.py script on a single GPU using Legate. It specifies the number of GPUs, system memory, and framebuffer memory, along with the grid size and number of iterations. The output indicates the elapsed time for the GPU computation, which is significantly faster than CPU.

```sh
legate --gpus 1 --sysmem 16000 --fbmem 15000 ./jacobi_stencil.py --size 15000 --iterations 100
```

--------------------------------

### cuPyNumeric Detailed API Coverage Report Format

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/howtos/measuring.rst

Example format of a detailed CSV coverage report. It lists each NumPy function called, its location in the source code, and a boolean indicating if it's implemented by cuPyNumeric.

```csv
function_name,location,implemented
numpy.array,tests/dot.py:27,True
numpy.ndarray.__init__,tests/dot.py:27,True
numpy.array,tests/dot.py:28,True
numpy.ndarray.__init__,tests/dot.py:28,True
numpy.ndarray.dot,tests/dot.py:31,True
numpy.ndarray.__init__,tests/dot.py:31,True
numpy.allclose,tests/dot.py:33,True
numpy.ndarray.__init__,tests/dot.py:33,True
```

--------------------------------

### Run Jacobi Stencil on GPU with Legate (4 GPUs)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command executes the jacobi_stencil.py script on four GPUs using Legate, showcasing performance with more parallel resources. It uses substantial system and framebuffer memory and processes a very large grid with many iterations. The output highlights the efficiency of distributed GPU computation.

```sh
legate --gpus 4 --sysmem 16000 --fbmem 38000 ./jacobi_stencil.py --size 50000 --iterations 300
```

--------------------------------

### Multi-node Execution with Manual Task Manager

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/advanced.rst

Shows how to initiate multi-node execution of cuPyNumeric programs using a manual task manager like 'mpirun'. This approach allows direct control over process distribution.

```sh
mpirun -np N legate script.py <script options>
```

--------------------------------

### Import Libraries for Edge Detection (NumPy, SciPy, Matplotlib, PIL)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/examples/edge_detection.ipynb

Imports necessary libraries for image manipulation, convolution, and plotting. This setup is required for the edge detection process. Dependencies include NumPy, SciPy, Matplotlib, and PIL.

```python
import numpy as np
from numpy import ndarray
from scipy.signal import convolve
from matplotlib import pyplot as plt
from PIL import Image
```

--------------------------------

### Black-Scholes Model Formulas

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

Mathematical formulas for the Black-Scholes model, used to calculate theoretical prices of stock options. Includes formulas for call and put options (C and P) and intermediate values d1 and d2.

```mathematica
C = S_0*N(d_1) - K*exp(-r*T)*N(d_2)
P = K*exp(-r*T)*N(-d_2) - S_0*N(-d_1)

d_1 = (ln(S_0/K) + (r + sigma^2/2)*T)/(sigma*sqrt(T))
d_2 = d_1 - sigma*sqrt(T)
```

--------------------------------

### Google Test Integration with CPM

Source: https://github.com/nv-legate/cupynumeric/blob/main/tests/cpp/CMakeLists.txt

Integrates Google Test using the CPM (CMake Package Manager) by including the necessary CMake script and configuring it for export sets. This ensures that Google Test is available and properly installed for the project.

```cmake
include(${rapids-cmake-dir}/cpm/gtest.cmake)

# BUILD_EXPORT_SET and INSTALL_EXPORT_SET are crucial, otherwise gtest does not get
# installed
rapids_cpm_gtest(BUILD_EXPORT_SET cupynumeric-exports
                 INSTALL_EXPORT_SET cupynumeric-exports)
```

--------------------------------

### Multi-node Execution with Legate Driver

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/advanced.rst

Demonstrates how to run cuPyNumeric programs in parallel across multiple nodes using the 'legate' driver and specifying a task launcher like 'srun'. This requires a task launcher for 2 or more nodes.

```sh
legate --launcher srun --nodes 2 script.py <script options>
```

--------------------------------

### Execute cuPyNumeric Matrix Multiplication on CPU

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/tutorial.rst

This command runs the 'simple_mm.py' script, which uses cuPyNumeric for matrix multiplication, on the CPU. It allocates 1 CPU core and 8000 MB of system memory.

```sh
legate --cpus 1 --sysmem 8000 ./simple_mm.py
```

--------------------------------

### Example Numpy Code (Python)

Source: https://github.com/nv-legate/cupynumeric/blob/main/docs/cupynumeric/source/user/howtos/patching.rst

A simple Python script demonstrating the use of NumPy functions, specifically np.eye for creating an identity matrix and np.linalg.cholesky for performing a Cholesky decomposition. This code can be directly used with lgpatch to test CupyNumeric.

```python
# test.py

import numpy as np
input = np.eye(10, dtype=np.float32)
np.linalg.cholesky(input)
```