### Cupy Distributed Computing

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.signal.windows.blackmanharris

Initialization and backend setup for distributed GPU computing.

```APIDOC
## Distributed Computing API

### Description
This API facilitates distributed computation across multiple GPUs and nodes by providing initialization functions and backend support for communication libraries.

### Endpoints
- `cupyx.distributed.init_process_group`
- `cupyx.distributed.NCCLBackend`

### Details
- **init_process_group**: Initializes the distributed process group, setting up communication.
- **NCCLBackend**: Specifies the NCCL backend for distributed operations, leveraging efficient NCCL communication primitives.
```

--------------------------------

### Launch and Synchronize CUDA Graph on a Stream (CuPy)

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.ExternalStream

Demonstrates launching a CUDA graph 'g' on a specific stream 's1' and then synchronizing that stream to ensure completion. It also shows how to use a context manager for launching on another stream 's2'.

```python
g.launch(stream=s1)
s1.synchronize()

s2 = cp.cuda.Stream()
with s2:
    g.launch()
s2.synchronize()
```

--------------------------------

### cuPy.cuda.runtime.driverGetVersion

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.signal.savgol_filter

Gets the version of the installed NVIDIA driver.

```APIDOC
## cupy.cuda.runtime.driverGetVersion

### Description
Gets the version of the installed NVIDIA driver.

### Method
GET

### Endpoint
`cupy.cuda.runtime.driverGetVersion`

### Parameters
N/A

### Request Example
N/A

### Response
#### Success Response (200)
- **version** (int) - The driver version.

#### Response Example
N/A

```

--------------------------------

### Get CUDA Driver Version (Python)

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.runtime.driverGetVersion

Retrieves the version of the NVIDIA CUDA driver installed on the system. This function does not take any arguments and returns an integer representing the driver version.

```python
import cupy

driver_version = cupy.cuda.runtime.driverGetVersion()
print(f"CUDA Driver Version: {driver_version}")
```

--------------------------------

### Cupy JIT Kernel Syntax

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.polynomial.polynomial.polycompanion

API documentation for Just-In-Time (JIT) kernel construction utilities.

```APIDOC
## cupyx.jit

### Description
Utilities for Just-In-Time (JIT) compilation of CUDA kernels, providing access to thread and block dimensions, synchronization primitives, and shared memory.

### Endpoints

*   `cupyx.jit.rawkernel`
*   `cupyx.jit.threadIdx`
*   `cupyx.jit.blockDim`
*   `cupyx.jit.blockIdx`
*   `cupyx.jit.gridDim`
*   `cupyx.jit.grid`
*   `cupyx.jit.gridsize`
*   `cupyx.jit.laneid`
*   `cupyx.jit.warpsize`
*   `cupyx.jit.range`
*   `cupyx.jit.syncthreads`
*   `cupyx.jit.syncwarp`
*   `cupyx.jit.shfl_sync`
*   `cupyx.jit.shfl_up_sync`
*   `cupyx.jit.shfl_down_sync`
*   `cupyx.jit.shfl_xor_sync`
*   `cupyx.jit.shared_memory`
*   `cupyx.jit.atomic_add`
*   `cupyx.jit.atomic_sub`
*   `cupyx.jit.atomic_exch`
*   `cupyx.jit.atomic_min`
*   `cupyx.jit.atomic_max`
*   `cupyx.jit.atomic_inc`
*   `cupyx.jit.atomic_dec`
*   `cupyx.jit.atomic_cas`
*   `cupyx.jit.atomic_and`
*   `cupyx.jit.atomic_or`
*   `cupyx.jit.atomic_xor`
*   `cupyx.jit.cg.this_grid`
*   `cupyx.jit.cg.this_thread_block`
*   `cupyx.jit.cg.sync`
*   `cupyx.jit.cg.memcpy_async`
*   `cupyx.jit.cg.wait`
*   `cupyx.jit.cg.wait_prior`
*   `cupyx.jit._interface._JitRawKernel`
```

--------------------------------

### Cupy CUDA Kernel Launch API

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.jit.cg.this_grid

Documentation for launching custom kernels (ElementwiseKernel, ReductionKernel, RawKernel) and managing kernel modules.

```APIDOC
## Cupy CUDA Kernel Launch API

### Description
Provides classes and functions for defining, compiling, and launching custom CUDA kernels, including element-wise, reduction, and raw kernels.

### Classes

- **`cupy.ElementwiseKernel(input_dtype, output_dtype, kernel, name, options, backend)`**
  - A kernel that applies an operation element-wise to input arrays.

- **`cupy.ReductionKernel(input_dtype, output_dtype, kernel, neutral_element, reducer, name, options, backend)`**
  - A kernel that performs reduction operations on input arrays.

- **`cupy.RawKernel(code, name, options, backend)`**
  - A kernel that directly uses raw CUDA C/C++ code.

- **`cupy.RawModule(code, options, backend)`**
  - A module that can contain multiple raw kernels.

### Functions

- **`cupy.fuse(*kernels, **kwargs)`**
  - Fuses multiple kernels into a single kernel for optimization.

### JIT (Just-In-Time) Compilation API

- **`cupyx.jit.rawkernel(code, name, options, backend)`**
  - Decorator for defining raw kernels with JIT compilation.

- **`cupyx.jit.threadIdx`**, **`cupyx.jit.blockDim`**, **`cupyx.jit.blockIdx`**, **`cupyx.jit.gridDim`**
  - Thread and block indexing variables for JIT kernels.

- **`cupyx.jit.grid`**, **`cupyx.jit.gridsize`**
  - Functions to get grid dimensions.

- **`cupyx.jit.laneid`**, **`cupyx.jit.warpsize`**
  - Thread and warp identifiers.

- **`cupyx.jit.range(start, end)`**
  - Creates a range for loop unrolling in JIT kernels.

- **`cupyx.jit.syncthreads()`**, **`cupyx.jit.syncwarp()`**
  - Synchronization primitives within a thread block or warp.

- **`cupyx.jit.shfl_sync`**, **`cupyx.jit.shfl_up_sync`**, **`cupyx.jit.shfl_down_sync`**, **`cupyx.jit.shfl_xor_sync`**
  - Warp shuffle operations.

- **`cupyx.jit.shared_memory`**
  - Access to shared memory in JIT kernels.

- **`cupyx.jit.atomic_add`**, **`cupyx.jit.atomic_sub`**, **`cupyx.jit.atomic_exch`**, **`cupyx.jit.atomic_min`**, **`cupyx.jit.atomic_max`**, **`cupyx.jit.atomic_inc`**, **`cupyx.jit.atomic_dec`**, **`cupyx.jit.atomic_cas`**, **`cupyx.jit.atomic_and`**, **`cupyx.jit.atomic_or`**, **`cupyx.jit.atomic_xor`**
  - Atomic operations for JIT kernels.

- **`cupyx.jit.cg.this_grid`**, **`cupyx.jit.cg.this_thread_block`**, **`cupyx.jit.cg.sync`**, **`cupyx.jit.cg.memcpy_async`**, **`cupyx.jit.cg.wait`**, **`cupyx.jit.cg.wait_prior`**
  - CUDA Graph related JIT APIs.

### Internal Interfaces

- **`cupyx.jit._interface._JitRawKernel`**
  - Internal base class for JIT raw kernels.

### Memoization

- **`cupy.memoize()`**
  - Decorator to memoize kernel compilation results.

- **`cupy.clear_memo()`**
  - Clears the kernel compilation cache.

```

--------------------------------

### Start Nested Profiling Range with RangePush and RangePop (Python)

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.nvtx.RangePush

This example demonstrates how to use `RangePush` and `RangePop` to create nested profiling ranges. It's useful for marking specific sections of code execution for performance analysis. The `message` parameter names the range, and `id_color` can be used to assign a specific color for visualization. This functionality is crucial for debugging and optimizing performance by identifying bottlenecks.

```python
from cupy.cuda.nvtx import RangePush, RangePop

RangePush("Nested Powers of A")
for i in range(N):
    RangePush("Iter {}: Double A".format(i))
    A = 2*A
    RangePop()
RangePop()

```

--------------------------------

### cupy.show_config

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.tril_indices

Print the configuration of the current CuPy environment.

```APIDOC
## cupy.show_config

### Description
Print the configuration of the current CuPy environment.

### Method
GET

### Endpoint
N/A (Function within the library)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```python
import cupy as cp
cp.show_config()
```

### Response
#### Success Response (200)
Prints the CuPy configuration to standard output.

#### Response Example
```
Platform:       linux
Arch:           x86_64
OS:             linux

CUDA available: True
CUDA root:      /usr/local/cuda-11.7
CUDA version:   11.7

cuDNN available: True
cuDNN version:   8.4

BLAS vendor:    OpenBLAS
BLAS version:   0.3.19
```
```

--------------------------------

### cuPy.get_array_module

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.sparse.linalg.minres

Gets the array module.

```APIDOC
## cupy.get_array_module

### Description
Gets the array module, which can be either CuPy or NumPy depending on the input.

### Method
N/A (Function)

### Endpoint
N/A (Function)

### Parameters
N/A

### Request Example
N/A

### Response
N/A
```

--------------------------------

### Launching and Synchronizing CUDA Graphs on Streams

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.Stream

Demonstrates how to launch a CUDA graph on a specific stream and synchronize its execution. This allows for fine-grained control over asynchronous operations and ensures that subsequent code waits for the graph to complete.

```python
g.launch(stream=s1)
s1.synchronize()
```

```python
s2 = cp.cuda.Stream()
with s2:
    g.launch()
s2.synchronize()
```

--------------------------------

### cupyx.scipy.get_array_module

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.sparse.linalg.minres

Gets the array module for SciPy.

```APIDOC
## cupyx.scipy.get_array_module

### Description
Gets the array module specifically for SciPy functions when used with CuPy.

### Method
N/A (Function)

### Endpoint
N/A (Function)

### Parameters
N/A

### Request Example
N/A

### Response
N/A
```

--------------------------------

### cupy.tril_indices_from

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.tril_indices_from

Get the indices for the lower-triangle of an array.

```APIDOC
## cupy.tril_indices_from

### Description
Returns the indices for the lower-triangle of arr.

### Method
N/A (This is a function call, not a REST API endpoint)

### Endpoint
N/A

### Parameters
#### Path Parameters
N/A

#### Query Parameters
N/A

#### Request Body
N/A

### Request Example
```python
import cupy

arr = cupy.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
tril_indices = cupy.tril_indices_from(arr, k=0)
print(tril_indices)
# Output: (array([0, 1, 1, 2, 2, 2]), array([0, 0, 1, 0, 1, 2]))
```

### Response
#### Success Response (200)
N/A (This is a function call, not a REST API endpoint)

#### Response Example
N/A
```

--------------------------------

### Cupy Distributed Backend

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.signal.vectorstrength

Initialization for distributed computing with NCCL backend.

```APIDOC
## Cupy Distributed Backend

### Description
Functions for initializing distributed process groups, primarily using the NCCL backend for multi-GPU communication.

### API
- `cupyx.distributed.init_process_group`: Initializes a distributed process group.
- `cupyx.distributed.NCCLBackend`: Specifies the NCCL backend for distributed operations.
```

--------------------------------

### CuPy SciPy Sparse Matrix Constructors

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.sparse.linalg.SuperLU

Documentation for creating sparse matrices using various formats like COO, CSC, and CSR.

```APIDOC
## CuPy SciPy Sparse Matrix Constructors

### Description
Provides constructors for creating sparse matrices in different formats, including Coordinate (COO), Compressed Sparse Row (CSR), and Compressed Sparse Column (CSC).

### Method
Various (primarily class constructors)

### Endpoint
N/A (Class instantiations)

### Parameters
Refer to individual sparse matrix class documentation for specific parameters.

### Request Example
N/A

### Response
N/A (Returns sparse matrix objects)

#### Success Response (N/A)
N/A

#### Response Example
N/A

**Sparse Matrix Constructors:**
- `coo_matrix`
- `csc_matrix`
- `csr_matrix`
- `dia_matrix`
```

--------------------------------

### cupy.cuda.runtime.deviceGetMemPool

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.runtime.deviceGetMemPool

Get the current mempool on the current device.

```APIDOC
## cupy.cuda.runtime.deviceGetMemPool

### Description
Get the current mempool on the current device.

### Method
GET

### Endpoint
/cupy/cuda/runtime/deviceGetMemPool

### Parameters
#### Path Parameters
- **device** (int) - Required - The device ID for which to retrieve the memory pool.

### Request Example
```json
{
  "device": 0
}
```

### Response
#### Success Response (200)
- **mempool_ptr** (intptr_t) - A pointer to the memory pool.

#### Response Example
```json
{
  "mempool_ptr": 140707073071488
}
```
```

--------------------------------

### cuPy.cuda.runtime.getDeviceCount

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.linalg.eigvalsh

Describes the getDeviceCount function to get the number of devices.

```APIDOC
## cupy.cuda.runtime.getDeviceCount

### Description
Retrieves the number of CUDA devices available.

### Method
GET

### Endpoint
/cupy/cuda/runtime/getDeviceCount

### Parameters
N/A

### Request Example
N/A

### Response
N/A
```

--------------------------------

### cupyx.scipy.sparse Matrix Constructors and Utilities

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.ndimage.uniform_filter

Documentation for sparse matrix constructors and utility functions in Cupy's SciPy implementation.

```APIDOC
## cupyx.scipy.sparse Matrix Constructors and Utilities

### Description
Provides functionalities for creating and manipulating sparse matrices, including various formats and utility functions.

### Method
N/A (These are function/class calls, not API endpoints)

### Endpoint
N/A

### Parameters
Refer to individual function or class documentation for specific parameters.

### Request Example
N/A

### Response
N/A

### Available Functions/Classes:
- **Constructors:**
  - `coo_matrix`
  - `csc_matrix`
  - `csr_matrix`
  - `dia_matrix`
- **Utilities:**
  - `spmatrix` (Base class)
  - `eye`
  - `identity`
  - `kron`
  - `kronsum`
  - `diags`
  - `spdiags`
  - `tril`
  - `triu`
  - `bmat`
  - `hstack`
  - `vstack`
  - `rand`
  - `random`
  - `find`
  - `issparse`
  - `isspmatrix`
  - `isspmatrix_csc`
  - `isspmatrix_csr`
  - `isspmatrix_coo`
  - `isspmatrix_dia`
```

--------------------------------

### cupyx.scipy.sparse Matrix Constructors and Operations

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.tanh

Documentation for sparse matrix constructors and basic operations in the cupyx.scipy.sparse module.

```APIDOC
## cupyx.scipy.sparse Matrix Constructors and Operations

### Description
Provides tools for creating and manipulating sparse matrices, which are matrices where most elements are zero.

### Methods

- **coo_matrix**: Constructor for the COOrdinate format sparse matrix.
- **csc_matrix**: Constructor for the Compressed Sparse Column format sparse matrix.
- **csr_matrix**: Constructor for the Compressed Sparse Row format sparse matrix.
- **dia_matrix**: Constructor for the DIAgonal format sparse matrix.
- **spmatrix**: Base class for sparse matrices.
- **eye**: Creates an identity matrix in sparse format.
- **identity**: Creates an identity matrix in sparse format.
- **kron**: Kronecker product of two sparse matrices.
- **kronsum**: Kronecker sum of two sparse matrices.
- **diags**: Extract diagonals and form a sparse matrix.
- **spdiags**: Construct a sparse matrix from diagonals.
- **tril**: Return the lower triangular part of a matrix.
- **triu**: Return the upper triangular part of a matrix.
- **bmat**: Construct a sparse matrix from a list of blocks.
- **hstack**: Stack sparse matrices horizontally.
- **vstack**: Stack sparse matrices vertically.
- **rand**: Create a random sparse matrix.
- **random**: Create a random sparse matrix.
- **find**: Return the indices and values of the non-zero elements.
- **issparse**: Check if an object is a sparse matrix.
- **isspmatrix**: Check if an object is a sparse matrix (generic).
- **isspmatrix_csc**: Check if an object is a CSC sparse matrix.
- **isspmatrix_csr**: Check if an object is a CSR sparse matrix.
- **isspmatrix_coo**: Check if an object is a COO sparse matrix.
- **isspmatrix_dia**: Check if an object is a DIA sparse matrix.

### Example
```python
import cupy as cp
from cupyx.scipy.sparse import csr_matrix

# Create a sparse matrix
row = cp.array([0, 0, 1, 2, 2, 2])
col = cp.array([0, 2, 2, 0, 1, 2])
data = cp.array([1, 2, 3, 4, 5, 6])
s = csr_matrix((data, (row, col)), shape=(3, 3))

print(s.toarray()) # Convert to dense array for printing
```
```

--------------------------------

### cuPy.cuda.runtime.getDeviceProperties

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.linalg.eigvalsh

Describes the getDeviceProperties function to get the device properties.

```APIDOC
## cupy.cuda.runtime.getDeviceProperties

### Description
Retrieves properties of a CUDA device.

### Method
GET

### Endpoint
/cupy/cuda/runtime/getDeviceProperties

### Parameters
N/A

### Request Example
N/A

### Response
N/A
```

--------------------------------

### Distributed Computing

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.linalg.companion

Utilities for initializing distributed training environments.

```APIDOC
## Distributed Computing

### Description
Provides functionalities to initialize distributed communication backends, essential for multi-GPU and multi-node training.

### Endpoints

#### `cupyx.distributed.init_process_group`

- **Description**: Initializes a process group for distributed communication.

#### `cupyx.distributed.NCCLBackend`

- **Description**: Specifies the NCCL backend for distributed communication.
```

--------------------------------

### cuPy.cuda.runtime.getDevice

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.linalg.eigvalsh

Describes the getDevice function to get the current device.

```APIDOC
## cupy.cuda.runtime.getDevice

### Description
Retrieves the currently selected device.

### Method
GET

### Endpoint
/cupy/cuda/runtime/getDevice

### Parameters
N/A

### Request Example
N/A

### Response
N/A
```

--------------------------------

### cupy.cuda.nvtx.RangePop

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.nvtx.RangePop

Ends a nested range started by a RangePush*() call.

```APIDOC
## cupy.cuda.nvtx.RangePop

### Description
Ends a nested range started by a `RangePush*()` call.

### Method
`cupy.cuda.nvtx.RangePop()`

### Endpoint
N/A (This is a Python function, not a REST endpoint)

### Parameters
This function does not take any parameters.

### Request Example
```python
import cupy

# Assuming a RangePush was made previously
cupy.cuda.nvtx.RangePop()
```

### Response
This function does not return a value. Its effect is to mark the end of a range in NVTX profiling.

#### Success Response (200)
N/A

#### Response Example
N/A
```

--------------------------------

### cupyx.scipy.sparse Matrix Constructors and Utilities

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.signal.argrelmax

Documentation for creating and manipulating sparse matrices using CuPy.

```APIDOC
## cupyx.scipy.sparse Matrix Operations

### Description
This section covers the creation of various sparse matrix formats and utility functions for sparse matrices.

### Method
Various (primarily class instantiation and function calls)

### Endpoints
- `cupyx.scipy.sparse.coo_matrix`
- `cupyx.scipy.sparse.csc_matrix`
- `cupyx.scipy.sparse.csr_matrix`
- `cupyx.scipy.sparse.dia_matrix`
- `cupyx.scipy.sparse.spmatrix`
- `cupyx.scipy.sparse.eye`
- `cupyx.scipy.sparse.identity`
- `cupyx.scipy.sparse.kron`
- `cupyx.scipy.sparse.kronsum`
- `cupyx.scipy.sparse.diags`
- `cupyx.scipy.sparse.spdiags`
- `cupyx.scipy.sparse.tril`
- `cupyx.scipy.sparse.triu`
- `cupyx.scipy.sparse.bmat`
- `cupyx.scipy.sparse.hstack`
- `cupyx.scipy.sparse.vstack`
- `cupyx.scipy.sparse.rand`
- `cupyx.scipy.sparse.random`
- `cupyx.scipy.sparse.find`
- `cupyx.scipy.sparse.issparse`
- `cupyx.scipy.sparse.isspmatrix`
- `cupyx.scipy.sparse.isspmatrix_csc`
- `cupyx.scipy.sparse.isspmatrix_csr`
- `cupyx.scipy.sparse.isspmatrix_coo`
- `cupyx.scipy.sparse.isspmatrix_dia`

### Parameters
Parameters vary by class constructor or function. Common parameters include:
- `data` (cupy.ndarray): The non-zero elements of the matrix.
- `indices` (cupy.ndarray): Indices for the elements (e.g., column indices for CSR/CSC).
- `indptr` (cupy.ndarray): Pointers to the start of rows (for CSR/CSC).
- `shape` (tuple): The dimensions of the sparse matrix.
- `format` (str): The desired sparse matrix format.

### Request Example
```python
import cupy as cp
from cupyx.scipy.sparse import csr_matrix

# Create a CSR sparse matrix
row = cp.array([0, 0, 1, 2, 2, 2])
col = cp.array([0, 2, 2, 0, 1, 2])
data = cp.array([1, 2, 3, 4, 5, 6])
s = csr_matrix((data, (row, col)), shape=(3, 3))

print(s)
print(s.toarray()) # Convert to dense array for printing
```

### Response
Returns a sparse matrix object of the specified format or a boolean for utility functions.

#### Success Response (200)
- **sparse_matrix** (cupyx.scipy.sparse.spmatrix subclass) - A sparse matrix object.
- **boolean** - For utility functions like `issparse`.
```

--------------------------------

### CUDA Profiler

Source: https://docs.cupy.dev/en/stable/reference/array_api

Functions to start and stop the CUDA profiler.

```APIDOC
## CUDA Profiler

### Description
Provides basic control for starting and stopping the CUDA profiler.

### Endpoints
- `cupy.cuda.runtime.profilerStart`
- `cupy.cuda.runtime.profilerStop`
```

--------------------------------

### Distributed Computing

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.signal.hilbert

Utilities for initializing distributed training environments using NCCL backend.

```APIDOC
## Distributed Computing

### Description
Tools for setting up and managing distributed training processes, leveraging the NCCL backend for communication.

### Endpoints
- `cupyx.distributed.init_process_group`: Initializes a distributed process group.
- `cupyx.distributed.NCCLBackend`: Represents the NCCL backend for distributed communication.
```

--------------------------------

### Cupy JIT (Just-In-Time) Compilation Helpers

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.fft.ifftn

Utilities for writing JIT-compiled kernels, including thread/block/grid dimensions and synchronization primitives.

```APIDOC
## cupyx.jit

### Description
Provides JIT compilation capabilities and CUDA-specific built-in functions for custom kernels.

### Endpoints
- `cupyx.jit.rawkernel`
- `cupyx.jit.threadIdx`
- `cupyx.jit.blockDim`
- `cupyx.jit.blockIdx`
- `cupyx.jit.gridDim`
- `cupyx.jit.grid`
- `cupyx.jit.gridsize`
- `cupyx.jit.laneid`
- `cupyx.jit.warpsize`
- `cupyx.jit.range`
- `cupyx.jit.syncthreads`
- `cupyx.jit.syncwarp`
- `cupyx.jit.shfl_sync`
- `cupyx.jit.shfl_up_sync`
- `cupyx.jit.shfl_down_sync`
- `cupyx.jit.shfl_xor_sync`
- `cupyx.jit.shared_memory`
- Atomic Operations:
  - `cupyx.jit.atomic_add`
  - `cupyx.jit.atomic_sub`
  - `cupyx.jit.atomic_exch`
  - `cupyx.jit.atomic_min`
  - `cupyx.jit.atomic_max`
  - `cupyx.jit.atomic_inc`
  - `cupyx.jit.atomic_dec`
  - `cupyx.jit.atomic_cas`
  - `cupyx.jit.atomic_and`
  - `cupyx.jit.atomic_or`
  - `cupyx.jit.atomic_xor`
- Cooperative Groups:
  - `cupyx.jit.cg.this_grid`
  - `cupyx.jit.cg.this_thread_block`
  - `cupyx.jit.cg.sync`
  - `cupyx.jit.cg.memcpy_async`
  - `cupyx.jit.cg.wait`
  - `cupyx.jit.cg.wait_prior`
```

--------------------------------

### CUDA Version

Source: https://docs.cupy.dev/en/stable/reference/cuda

Retrieves information about the installed CUDA runtime version.

```APIDOC
## cupy.cuda.get_local_runtime_version

### Description
Returns the version of the CUDA Runtime installed in the environment.

### Method
`cupy.cuda.get_local_runtime_version`

### Parameters
None

### Response
- **version** (str): The version string of the CUDA Runtime.
```

--------------------------------

### CUDA Profiler Control

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.sparse.linalg.svds

APIs to start and stop the CUDA profiler.

```APIDOC
## CUDA Profiler Control API

### Description
APIs to control the CUDA profiler, enabling and disabling profiling of GPU execution.

### Endpoints

- `cupy.cuda.runtime.profilerStart`
- `cupy.cuda.runtime.profilerStop`
```

--------------------------------

### Optimize kernel launch parameters with cupyx.optimizing.optimize in Python

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.optimizing.optimize

This code snippet demonstrates how to use the `cupyx.optimizing.optimize` context manager in Python to optimize kernel launch parameters for CuPy routines. It requires Optuna to be installed and currently supports reduction operations. The manager automatically finds and caches optimal values for thread and block counts, reusing them for subsequent calls with similar array characteristics.

```python
import cupy
from cupyx import optimizing

x = cupy.arange(100)
with optimizing.optimize():
    cupy.sum(x)

```

--------------------------------

### CuPy LineProfileHook Usage Example

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.memory_hooks.LineProfileHook

This example demonstrates how to use the LineProfileHook to profile CuPy memory usage. It initializes the hook, wraps a block of CuPy code with the hook using a 'with' statement, and then prints the profiling report. This hook is useful for identifying memory hotspots in CuPy applications.

```python
from cupy.cuda import memory_hooks

hook = memory_hooks.LineProfileHook()

with hook:
    # some CuPy codes
    pass # Replace with your CuPy operations

hook.print_report()
```

--------------------------------

### cupyx.scipy.sparse Matrix Constructors and Utilities

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.fft.fftshift

Documentation for creating and manipulating sparse matrices, including various formats and utility functions.

```APIDOC
## Sparse Matrix Operations

### Description
Provides tools for creating, manipulating, and querying sparse matrices, supporting formats like COO, CSC, CSR, and DIA.

### Constructors
- `coo_matrix`
- `csc_matrix`
- `csr_matrix`
- `dia_matrix`
- `spmatrix`

### Utility Functions
- `eye`
- `identity`
- `kron`
- `kronsum`
- `diags`
- `spdiags`
- `tril`
- `triu`
- `bmat`
- `hstack`
- `vstack`
- `rand`
- `random`
- `find`
- `issparse`
- `isspmatrix`
- `isspmatrix_csc`
- `isspmatrix_csr`
- `isspmatrix_coo`
- `isspmatrix_dia`

### Endpoints
All functions and constructors are accessed via `cupyx.scipy.sparse`.
```

--------------------------------

### Start Nested Range with NVTX

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.nvtx.RangePushC

Demonstrates how to start a nested profiling range using RangePushC and end it with RangePop. This function is useful for marking specific sections of code for performance analysis, especially when using tools like Nsight Systems. The color parameter allows for visual differentiation of ranges.

```python
from cupy.cuda.nvtx import RangePushC, RangePop

RangePush("Nested Powers of A")
for i in range(N):
    RangePushC("Iter {}: Double A".format(i))
    A = 2*A
    RangePop()
RangePop()
```

--------------------------------

### Distributed Computing

Source: https://docs.cupy.dev/en/stable/reference/scipy_linalg

Utilities for initializing distributed training environments using NCCL.

```APIDOC
## Distributed Computing

### Description
APIs for initializing distributed process groups, primarily for use with the NCCL backend for multi-node, multi-GPU training.

### Endpoints

- `cupyx.distributed.init_process_group`: Initializes a distributed process group.
- `cupyx.distributed.NCCLBackend`: Represents the NCCL backend for distributed communication.
```

--------------------------------

### STFT and iSTFT Example with CuPy

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.signal.istft

Demonstrates the process of generating a test signal, computing its Short Time Fourier Transform (STFT) using CuPy, and then reconstructing the original signal using the inverse STFT (iSTFT). This example showcases the typical workflow for time-frequency analysis and signal reconstruction with CuPy.

```python
import cupy
from cupyx.scipy.signal import stft, istft
import matplotlib.pyplot as plt

# Generate a test signal: a 2 Vrms sine wave at 50Hz with white noise
fs = 1024
nperseg = 512
amp = 2 * cupy.sqrt(2)
noise_power = 0.001 * fs / 2
N = 10 * fs
time = cupy.arange(N) / float(fs)
carrier = amp * cupy.sin(2 * cupy.pi * 50 * time)
noise = cupy.random.normal(scale=cupy.sqrt(noise_power),
                         size=time.shape)
x = carrier + noise

# Compute the STFT
# Note: In a real scenario, you would define nperseg and noverlap based on your analysis needs.
# For this example, let's assume nperseg and noverlap are defined as above.
freqs, times, Zxx = stft(x, fs=fs, nperseg=nperseg)

# Reconstruct the signal using iSTFT
t_reconstructed, x_reconstructed = istft(Zxx, fs=fs, nperseg=nperseg)

# Plotting (requires matplotlib and numpy, assuming conversion if needed)
# If you are running this in a CuPy environment, you might need to move data to CPU for plotting
# For example: x_cpu = cupy.asnumpy(x)
# plt.plot(time, x_cpu, label='Original Signal')
# plt.plot(t_reconstructed, x_reconstructed, label='Reconstructed Signal')
# plt.xlabel('Time [s]')
# plt.ylabel('Amplitude')
# plt.title('Original vs Reconstructed Signal')
# plt.legend()
# plt.show()

```

--------------------------------

### Cupy Distributed Initialization and NCCL Backend

Source: https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.stats.zmap

Supports distributed computing setups using Cupy, primarily with the NCCL backend. `cupyx.distributed.init_process_group` initializes the distributed environment, setting up communication primitives. `cupyx.distributed.NCCLBackend` specifies NCCL as the backend for collective operations.

```python
import cupy as cp
from cupyx.distributed import init_process_group, NCCLBackend

# Example of initializing a distributed process group
# This typically runs in a distributed environment (e.g., using mpirun or torch.distributed)

# rank = 0  # Current process rank
# world_size = 2  # Total number of processes

# init_process_group(backend=NCCLBackend(), rank=rank, world_size=world_size)

# After initialization, you can use NCCL-based collective operations.
# print(f"Distributed process group initialized on rank {rank}.")

```

--------------------------------

### Cupy CUDA Pointer Attributes

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.tri

Function to get attributes of a pointer on the device.

```APIDOC
## CUDA Pointer Attributes API

### Description
Retrieves detailed attributes associated with a given device pointer.

### Endpoints
- `cupy.cuda.runtime.pointerGetAttributes`: Gets attributes of a pointer.
```

--------------------------------

### Cupy JIT (Just-In-Time) Compilation Helpers

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.tri

Helper functions and variables for writing JIT-compiled kernels.

```APIDOC
## JIT Compilation Helpers API

### Description
Provides intrinsic functions and variables usable within JIT-compiled kernels for thread and block coordination, synchronization, and shared memory access.

### Functions and Variables
- `cupyx.jit.threadIdx`: Thread index within a block.
- `cupyx.jit.blockDim`: Dimensions of a block.
- `cupyx.jit.blockIdx`: Block index within a grid.
- `cupyx.jit.gridDim`: Dimensions of a grid.
- `cupyx.jit.grid`: Grid dimensions.
- `cupyx.jit.gridsize`: Total grid size.
- `cupyx.jit.laneid`: Lane ID within a warp.
- `cupyx.jit.warpsize`: Warp size.
- `cupyx.jit.range`: Creates a range iterator.
- `cupyx.jit.syncthreads`: Synchronizes all threads in a block.
- `cupyx.jit.syncwarp`: Synchronizes all threads in a warp.
- `cupyx.jit.shfl_sync`: Shuffle operation with synchronization.
- `cupyx.jit.shfl_up_sync`: Upward shuffle operation with synchronization.
- `cupyx.jit.shfl_down_sync`: Downward shuffle operation with synchronization.
- `cupyx.jit.shfl_xor_sync`: XOR shuffle operation with synchronization.
- `cupyx.jit.shared_memory`: Accesses shared memory.
- `cupyx.jit.atomic_add`, `cupyx.jit.atomic_sub`, `cupyx.jit.atomic_exch`, `cupyx.jit.atomic_min`, `cupyx.jit.atomic_max`, `cupyx.jit.atomic_inc`, `cupyx.jit.atomic_dec`, `cupyx.jit.atomic_cas`, `cupyx.jit.atomic_and`, `cupyx.jit.atomic_or`, `cupyx.jit.atomic_xor`: Atomic operations.
- `cupyx.jit.cg.this_grid`: Current grid.
- `cupyx.jit.cg.this_thread_block`: Current thread block.
- `cupyx.jit.cg.sync`: Synchronizes threads.
- `cupyx.jit.cg.memcpy_async`: Asynchronous memory copy.
- `cupyx.jit.cg.wait`: Waits for operations.
- `cupyx.jit.cg.wait_prior`: Waits for prior operations.
```

--------------------------------

### Cupy Distributed Computing

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.runtime.deviceSetLimit

Documentation for initializing distributed environments and using communication backends in CuPy.

```APIDOC
## Distributed Computing API

### Description
Tools for setting up and managing distributed computations across multiple processes and nodes using CuPy.

### Endpoints

- **Initialization**
  - **cupyx.distributed.init_process_group**
    - Description: Initializes a distributed process group.

- **Communication Backends**
  - **cupyx.distributed.NCCLBackend**
    - Description: The NCCL backend for distributed communication.
```

--------------------------------

### cuPy.cuda.runtime.deviceGetDefaultMemPool

Source: https://docs.cupy.dev/en/stable/reference/generated/cupy.linalg.eigvalsh

Describes the deviceGetDefaultMemPool function to get the default memory pool for a device.

```APIDOC
## cupy.cuda.runtime.deviceGetDefaultMemPool

### Description
Retrieves the default memory pool for a device.

### Method
GET

### Endpoint
/cupy/cuda/runtime/deviceGetDefaultMemPool

### Parameters
N/A

### Request Example
N/A

### Response
N/A
```