### CUDA Host API - Device Management

Source: https://nvidia.github.io/numba-cuda/reference/index

Functions and classes for detecting, managing, and interacting with CUDA-enabled devices.

```APIDOC
## CUDA Host API - Device Management

### Description
APIs for device detection, enquiry, and context management.

#### Device Detection and Enquiry
* `is_available()`: Checks if a CUDA-enabled GPU is available.
* `detect()`: Detects available CUDA devices.

#### Context Management
* `Context`: Represents a CUDA context.
  * `get_memory_info()`: Retrieves memory information for the current context.
  * `pop()`: Pops the current context.
  * `push()`: Pushes a context onto the context stack.
  * `reset()`: Resets the current context.
* `current_context()`: Returns the current CUDA context.
* `require_context()`: Ensures a CUDA context is available.
* `synchronize()`: Synchronizes all streams in the current context.
* `close()`: Closes the current CUDA context.

#### Device Management
* `gpus`: A collection of available CUDA GPUs.
* `current`: The currently selected CUDA device.
* `_DeviceContextManager`: A context manager for device selection.
* `select_device()`: Selects a specific CUDA device.
* `get_current_device()`: Gets the currently active device.
* `list_devices()`: Lists all available CUDA devices.
* `Device`: Represents a CUDA device.
  * `compute_capability`: The compute capability of the device.
  * `id`: The unique identifier of the device.
  * `name`: The name of the device.
  * `uuid`: The UUID of the device.
  * `reset()`: Resets the device.
  * `supports_float16`: Indicates if the device supports float16 operations.
```

--------------------------------

### CUDA Vector Types

Source: https://nvidia.github.io/numba-cuda/reference/types

Describes how to use and construct CUDA vector types in Numba, including recommended naming conventions and examples.

```APIDOC
## CUDA Vector Types

### Description
CUDA Vector Types are usable in kernels. They are formatted as `<base_type>x<N>`, where `base_type` is the base type of the vector, and `N` is the number of elements. Examples include `int64x3`, `uint16x4`, `float32x4`.

### Aliases for Convenience
Aliases consistent with CUDA C/C++ namings are available for convenience (e.g., `float3` aliases `float32x3`).

### Construction
Vector types are constructed directly with their constructor.

### Request Example
```python
from numba.cuda import float32x3

# In kernel
f3 = float32x3(0.0, -1.0, 1.0)

zero = uint32(0)
u2 = uint32x2(1, 2)

# Construct a 3-component vector with primitive type and a 2-component vector
u3 = uint32x3(zero, u2)

# Construct a 4-component vector with 2 2-component vectors
u4 = uint32x4(u2, u2)
```

### Component Access
Components can be accessed through fields `x`, `y`, `z`, and `w`. Components are immutable after construction in the current version.

### Response Example
```python
v1 = float32x2(1.0, 1.0)
v2 = float32x2(1.0, -1.0)
dotprod = v1.x * v2.x + v1.y * v2.y
```
```

--------------------------------

### Configuring and Launching CUDA Kernels

Source: https://nvidia.github.io/numba-cuda/reference/kernel

This section details how to configure and launch CUDA kernels using the `numba.cuda.jit` decorator and the returned Dispatcher object. It covers different ways to specify grid and block dimensions, streams, and shared memory.

```APIDOC
## Launching CUDA Kernels with Dispatcher

### Description
Kernels decorated with `@cuda.jit` return a Dispatcher object. This Dispatcher can be configured with launch parameters (grid, block dimensions, stream, shared memory) and then called with the kernel arguments.

### Method
Configuration and Launch

### Endpoint
N/A (In-process function call)

### Parameters
#### Kernel Configuration Parameters (Subscripting the Dispatcher)
- **griddim** (int or tuple) - Specifies the grid dimensions (up to 3D).
- **blockdim** (int or tuple) - Specifies the block dimensions (up to 3D).
- **stream** (optional) - The CUDA stream on which the kernel will be launched.
- **sharedmem** (optional, int) - The size of dynamic shared memory in bytes.

#### Kernel Arguments (Calling the Configured Dispatcher)
- **x, y, z, ...** (various types) - Arguments required by the kernel function.

### Request Example
```
# Configure and launch in one statement
func[griddim, blockdim, stream, sharedmem](x, y, z)

# Or, configure first, then call
configured = func[griddim, blockdim, stream, sharedmem]
configured(x, y, z)
```

### Response
#### Success Response
N/A (Kernel execution on device)

#### Response Example
N/A

### Notes
- The order of `stream` and `sharedmem` is reversed compared to CUDA C/C++ (`func<<<griddim, blockdim, sharedmem, stream>>>`).
- The Dispatcher automatically specializes for arguments and compute capability.
```

--------------------------------

### Get CUDA Device Name and UUID (Python)

Source: https://nvidia.github.io/numba-cuda/reference/host

Retrieves the name and universally unique identifier (UUID) of a CUDA device. This information can be used for device identification and logging.

```python
import numba

# Assuming device 0 is selected or available
current_device = numba.cuda.get_current_device()
print(f"Device Name: {current_device.name}")
print(f"Device UUID: {current_device.uuid}")
```

--------------------------------

### Get Current CUDA Device (Python)

Source: https://nvidia.github.io/numba-cuda/reference/host

Retrieves the Device object associated with the current thread's CUDA context. This is useful for checking which device is currently active.

```python
import numba

current_device = numba.cuda.get_current_device()
print(f"Current device: {current_device.name} (ID: {current_device.id})")
```

--------------------------------

### CUDA Host API - Measurement

Source: https://nvidia.github.io/numba-cuda/reference/index

APIs for profiling and event management in CUDA.

```APIDOC
## CUDA Host API - Measurement

### Description
Tools for profiling CUDA operations and managing timing events.

#### Profiling
* `profile_start()`: Starts CUDA profiling.
* `profile_stop()`: Stops CUDA profiling.
* `profiling()`: Returns profiling data.

#### Events
* `event()`: Creates a CUDA event.
* `event_elapsed_time()`: Calculates the elapsed time between two CUDA events.
* `Event`: Represents a CUDA event.
  * `query()`: Checks if the event has been completed.
  * `record()`: Records the event.
  * `synchronize()`: Synchronizes with the event.
  * `wait()`: Waits for the event to be completed.
```

--------------------------------

### Get CUDA Device Compute Capability (Python)

Source: https://nvidia.github.io/numba-cuda/reference/host

Retrieves the compute capability of a CUDA device, represented as a tuple of (major, minor) integers. This indicates the hardware features supported by the GPU.

```python
import numba

# Assuming device 0 is selected or available
current_device = numba.cuda.get_current_device()
compute_cap = current_device.compute_capability
print(f"Compute capability of device {current_device.id}: {compute_cap[0]}.{compute_cap[1]}")
```

--------------------------------

### Manage CUDA Context with Device Context Manager (Python)

Source: https://nvidia.github.io/numba-cuda/reference/host

Demonstrates how to use a device as a context manager to execute code on a specific CUDA device. This allows for easy selection of a device for operations like memory transfers.

```python
import numba
import numpy as np

a = np.array([1, 2, 3])

# Assuming device 2 is available
with numba.cuda.gpus[2]:
    d_a = numba.cuda.to_device(a)
    print(f"Array copied to device 2: {d_a}")
```

--------------------------------

### Get CUDA Context Memory Info (Python)

Source: https://nvidia.github.io/numba-cuda/reference/host

Retrieves the amount of free and total memory (in bytes) available within the current CUDA context. This is useful for monitoring GPU memory usage.

```python
import numba

# Ensure a context is active
if numba.cuda.is_available():
    numba.cuda.current_context()
    free_mem, total_mem = numba.cuda.get_memory_info()
    print(f"Memory Info - Free: {free_mem} bytes, Total: {total_mem} bytes")
else:
    print("CUDA not available.")
```

--------------------------------

### CUDA Host API - Compilation

Source: https://nvidia.github.io/numba-cuda/reference/index

Functions for compiling CUDA kernels.

```APIDOC
## CUDA Host API - Compilation

### Description
Functions for compiling CUDA kernels and PTX code.

* `compile()`: Compiles a CUDA kernel.
* `compile_all()`: Compiles all CUDA kernels.
* `compile_for_current_device()`: Compiles a CUDA kernel for the current device.
* `compile_ptx()`: Compiles CUDA code to PTX.
* `compile_ptx_for_current_device()`: Compiles CUDA code to PTX for the current device.
```

--------------------------------

### Numba CUDA fabsf: Single Precision Absolute Value

Source: https://nvidia.github.io/numba-cuda/reference/libdevice

Exposes the `__nv_fabsf` libdevice function for the absolute value of single-precision floating-point numbers. It accepts a float32 input and returns a float32 output, providing a performant way to get the magnitude of a number.

```python
import numba

@numba.cuda.jit
def fabsf_wrapper(x):
    return numba.cuda.libdevice.fabsf(x)
```

--------------------------------

### Numba CUDA fast mathematical functions

Source: https://nvidia.github.io/numba-cuda/reference/libdevice

This section covers fast approximations for common mathematical functions like cosine, exponentiation, logarithms, power, and sine/cosine.

```APIDOC
## numba.cuda.libdevice.fast_cosf(x)

### Description
Computes the fast cosine of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_cosf(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The cosine of x.

#### Response Example
```json
{
  "return_value": 0.5403023058681398
}
```

## numba.cuda.libdevice.fast_exp10f(x)

### Description
Computes the fast base-10 exponentiation of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_exp10f(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The result of 10 raised to the power of x.

#### Response Example
```json
{
  "return_value": 1000.0
}
```

## numba.cuda.libdevice.fast_expf(x)

### Description
Computes the fast natural exponentiation (e^x) of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_expf(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The result of e raised to the power of x.

#### Response Example
```json
{
  "return_value": 2.718281828459045
}
```

## numba.cuda.libdevice.fast_fdividef(x, y)

### Description
Performs fast floating-point division of two float32 arguments.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_fdividef(x, y)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The result of x divided by y.

#### Response Example
```json
{
  "return_value": 2.0
}
```

## numba.cuda.libdevice.fast_log10f(x)

### Description
Computes the fast base-10 logarithm of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_log10f(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The base-10 logarithm of x.

#### Response Example
```json
{
  "return_value": 1.0
}
```

## numba.cuda.libdevice.fast_log2f(x)

### Description
Computes the fast base-2 logarithm of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_log2f(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The base-2 logarithm of x.

#### Response Example
```json
{
  "return_value": 1.0
}
```

## numba.cuda.libdevice.fast_logf(x)

### Description
Computes the fast natural logarithm (ln) of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_logf(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The natural logarithm of x.

#### Response Example
```json
{
  "return_value": 0.0
}
```

## numba.cuda.libdevice.fast_powf(x, y)

### Description
Computes the fast power of a float32 base to a float32 exponent.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_powf(x, y)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The result of x raised to the power of y.

#### Response Example
```json
{
  "return_value": 8.0
}
```

## numba.cuda.libdevice.fast_sincosf(x)

### Description
Computes the fast sine and cosine of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_sincosf(x)
```

### Response
#### Success Response (UniTuple(float32 x 2))
- **return_value** (UniTuple(float32 x 2)) - A tuple containing the sine and cosine of x.

#### Response Example
```json
{
  "return_value": [0.8414709848078965, 0.5403023058681398]
}
```

## numba.cuda.libdevice.fast_sinf(x)

### Description
Computes the fast sine of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_sinf(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The sine of x.

#### Response Example
```json
{
  "return_value": 0.8414709848078965
}
```

## numba.cuda.libdevice.fast_tanf(x)

### Description
Computes the fast tangent of a float32 argument.

### Method
Call

### Endpoint
N/A (Library function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
```
numba.cuda.libdevice.fast_tanf(x)
```

### Response
#### Success Response (float32)
- **return_value** (float32) - The tangent of x.

#### Response Example
```json
{
  "return_value": 1.5574077246549023
}
```
```

--------------------------------

### Configure and Launch CUDA Kernel with Numba

Source: https://nvidia.github.io/numba-cuda/reference/kernel

Demonstrates how to configure and launch a Numba CUDA kernel by specifying grid dimensions, block dimensions, stream, and shared memory. It shows both a two-step configuration and call, and a combined single-statement approach. This is analogous to CUDA C/C++ kernel launch syntax.

```python
func[griddim, blockdim, stream, sharedmem]
configured = func[griddim, blockdim, stream, sharedmem]
configured(x, y, z)

# Idiomatic single-statement call:
func[griddim, blockdim, stream, sharedmem](x, y, z)
```

```cudahacks
func<<<griddim, blockdim, sharedmem, stream>>>(x, y, z)
```

--------------------------------

### Perform Arithmetic Operations on bfloat16 in Numba CUDA

Source: https://nvidia.github.io/numba-cuda/reference/types

Shows examples of supported arithmetic and logical operations on bfloat16 data types within Numba CUDA kernels. This includes standard arithmetic, assignment, comparison, and unary operations.

```python
from numba.cuda.types import bfloat16

# Assume bf16_a and bf16_b are bfloat16 variables
result_add = bf16_a + bf16_b
result_mul = bf16_a * bf16_b
bf16_a += bf16_b
is_equal = bf16_a == bf16_b
negative_val = -bf16_a
```

--------------------------------

### Measurement and Profiling

Source: https://nvidia.github.io/numba-cuda/reference/host

Tools for profiling CUDA kernel execution and measuring event timing.

```APIDOC
## Measurement and Profiling

### Description
This section covers tools for profiling CUDA operations and measuring time.

### Profiling

*   **`profile_start()`**
    *   Description: Starts CUDA profiling.

*   **`profile_stop()`**
    *   Description: Stops CUDA profiling.

*   **`profiling()`**
    *   Description: Returns profiling results (format may vary).

## Events

### Description
Provides functionalities for creating and managing CUDA events for timing.

### Functions

*   **`event()`**
    *   Description: Creates a new CUDA event.
    *   Returns: An `Event` object.

*   **`event_elapsed_time(start_event, end_event)`**
    *   Description: Calculates the time elapsed between two CUDA events.
    *   Parameters:
        *   **`start_event`** (Event) - The starting event.
        *   **`end_event`** (Event) - The ending event.
    *   Returns: Elapsed time in milliseconds (float).

## Event Object

### Description
Represents a CUDA event used for timing and synchronization.

### Methods

*   **`record()`**
    *   Description: Records the event, marking the current time on the CUDA stream.

*   **`query()`**
    *   Description: Checks if the event has been completed.
    *   Returns: `True` if completed, `False` otherwise.

*   **`synchronize()`**
    *   Description: Blocks the host until the event is completed.

*   **`wait()`**
    *   Description: Causes a stream to wait until the event is completed.
```

--------------------------------

### Get Kernel PTX Assembly Code (Numba CUDA)

Source: https://nvidia.github.io/numba-cuda/reference/kernel

Retrieves the PTX assembly code for a compiled Numba CUDA kernel for a specific device and signature. It can return a dictionary of PTX codes for all encountered signatures if no signature is provided. Requires the `numba_cuda` library.

```python
inspect_asm(_signature =None_)
    
Return this kernel’s PTX assembly code for for the device in the current context. 

Parameters:
    
**signature** – A tuple of argument types. 

Returns:
    
The PTX code for the given signature, or a dict of PTX codes for all previously-encountered signatures. 
```

--------------------------------

### Get Kernel LLVM IR (Numba CUDA)

Source: https://nvidia.github.io/numba-cuda/reference/kernel

Retrieves the LLVM Intermediate Representation (IR) for a compiled Numba CUDA kernel given a specific signature. If no signature is provided, it returns LLVM IR for all previously encountered signatures. This function is part of the `numba_cuda` library.

```python
inspect_llvm(_signature =None_)
    
Return the LLVM IR for this kernel. 

Parameters:
    
**signature** – A tuple of argument types. 

Returns:
    
The LLVM IR for the given signature, or a dict of LLVM IR for all previously-encountered signatures. 
```

--------------------------------

### numba.cuda.libdevice.ldexp

Source: https://nvidia.github.io/numba-cuda/reference/libdevice

Computes x * 2^y.

```APIDOC
## numba.cuda.libdevice.ldexp

### Description
Computes x * 2^y.

### Method
N/A (Library Function)

### Endpoint
N/A (Library Function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
None

### Response
#### Success Response (200)
- **return_value** (float64) - The result of the calculation.

#### Response Example
None
```

--------------------------------

### Get Statically Allocated Shared Memory Size (Numba CUDA)

Source: https://nvidia.github.io/numba-cuda/reference/kernel

Returns the size in bytes of statically allocated shared memory for a compiled Numba CUDA kernel. The `signature` parameter specifies the compiled kernel, and it can be omitted for specialized kernels. The returned size is specific to the current device.

```python
Returns the size in bytes of statically allocated shared memory for this kernel. 

Parameters:
    
**signature** – The signature of the compiled kernel to get shared memory usage for. This may be omitted for a specialized kernel. 

Returns:
    
The amount of shared memory allocated by the compiled variant of the kernel for the given signature and current device. 
```

--------------------------------

### numba.cuda.libdevice.llmin

Source: https://nvidia.github.io/numba-cuda/reference/libdevice

Computes the minimum of two signed 64-bit integers.

```APIDOC
## numba.cuda.libdevice.llmin

### Description
Computes the minimum of two signed 64-bit integers.

### Method
N/A (Library Function)

### Endpoint
N/A (Library Function)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None

### Request Example
None

### Response
#### Success Response (200)
- **return_value** (int64) - The minimum of the two arguments.

#### Response Example
None
```

--------------------------------

### Device Management

Source: https://nvidia.github.io/numba-cuda/reference/host

APIs for accessing and managing CUDA-capable devices, including listing, selecting, and obtaining device information.

```APIDOC
## Device Management

### Description
APIs for managing and querying CUDA-capable devices supported by Numba.

### Device Lists and Properties

#### `numba.cuda.gpus`

- **Description**: An indexable list of supported CUDA devices, indexed by integer device ID.

#### `numba.cuda.gpus.current`

- **Description**: The currently-selected device.

### Device Context Managers

#### `numba.cuda.cudadrv.devices._DeviceContextManager`

- **Description**: Provides a context manager for executing operations within the context of a specific device. Instances are typically obtained from `numba.cuda.gpus`.

- **Usage Example**:
```python
with numba.cuda.gpus[2]:
    d_a = numba.cuda.to_device(a)
```

### Device Selection and Information Functions

#### `numba.cuda.select_device(device_id)`

- **Description**: Makes the context associated with `device_id` the current context.
- **Parameters**:
  - `device_id` (int): The ID of the device to select.
- **Returns**: `numba.cuda.cudadrv.driver.Device` instance.
- **Raises**: Exception on error.

#### `numba.cuda.get_current_device()`

- **Description**: Gets the device associated with the current thread.
- **Returns**: `numba.cuda.cudadrv.driver.Device` instance.

#### `numba.cuda.list_devices()`

- **Description**: Returns a list of all detected CUDA devices.
- **Returns**: `list` of `numba.cuda.cudadrv.driver.Device` instances.

### `numba.cuda.cudadrv.driver.Device` Class

- **Description**: Represents a CUDA device and allows querying its functionality.

- **Properties**:
  - `compute_capability` (tuple): A tuple `(major, minor)` indicating the supported compute capability.
  - `id` (int): The integer ID of the device.
  - `name` (str): The name of the device (e.g., "GeForce GTX 970").
  - `uuid` (str): The UUID of the device (e.g., "GPU-e6489c45-5b68-3b03-bab7-0e7c8e809643").
  - `supports_float16` (bool): `True` if the device supports float16 operations, `False` otherwise.

- **Methods**:
  - `reset()`: Deletes the context for the device, destroying all associated allocations, events, and streams.
```

--------------------------------

### Get Kernel SASS Assembly Code (Numba CUDA)

Source: https://nvidia.github.io/numba-cuda/reference/kernel

Retrieves the SASS (Streaming Assembler) assembly code for a compiled Numba CUDA kernel for the current device and signature. It returns a dictionary of SASS codes for all encountered signatures if no signature is specified. Requires `nvdisasm` to be available on the PATH. This function is provided by `numba_cuda`.

```python
inspect_sass(_signature =None_)
    
Return this kernel’s SASS assembly code for for the device in the current context. 

Parameters:
    
**signature** – A tuple of argument types. 

Returns:
    
The SASS code for the given signature, or a dict of SASS codes for all previously-encountered signatures. 
SASS for the device in the current context is returned. 
Requires nvdisasm to be available on the PATH. 
```

--------------------------------

### Numba CUDA Thread Indexing Functions

Source: https://nvidia.github.io/numba-cuda/reference/kernel

Provides access to thread and block indices within a CUDA grid. Includes functions to get the current thread's index within its block (threadIdx), the current block's index within the grid (blockIdx), the dimensions of a thread block (blockDim), the dimensions of the grid (gridDim), the thread's index within its warp (laneid), the size of a warp (warpsize), and the absolute position and size of the current thread in the grid (grid and gridsize).

```Python
from numba import cuda

# Accessing thread and block indices
thread_x = cuda.threadIdx.x
thread_y = cuda.threadIdx.y
thread_z = cuda.threadIdx.z

block_x = cuda.blockIdx.x
block_y = cuda.blockIdx.y
block_z = cuda.blockIdx.z

# Accessing dimensions
block_dim_x = cuda.blockDim.x
block_dim_y = cuda.blockDim.y
block_dim_z = cuda.blockDim.z

grid_dim_x = cuda.gridDim.x
grid_dim_y = cuda.gridDim.y
grid_dim_z = cuda.gridDim.z

# Accessing warp information
lane_id = cuda.laneid
warp_size = cuda.warpsize

# Calculating absolute grid position and size
# For a 1D grid
grid_pos_1d = cuda.grid(1)
grid_size_1d = cuda.gridsize(1)

# For a 2D grid
grid_pos_2d = cuda.grid(2)
grid_size_2d = cuda.gridsize(2)

# For a 3D grid
grid_pos_3d = cuda.grid(3)
grid_size_3d = cuda.gridsize(3)

# Example computation (as described in documentation)
# This would typically be within a @cuda.jit decorated function
# absolute_x = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
# grid_total_x = cuda.blockDim.x * cuda.gridDim.x
```

--------------------------------

### numba.cuda.compile_all

Source: https://nvidia.github.io/numba-cuda/reference/host

Similar to `compile()`, but returns a list of PTX codes/LTO-IRs for the compiled function and any external functions it depends on.

```APIDOC
## numba.cuda.compile_all

### Description
Compiles a Python function and all its external dependencies to PTX or LTO-IR. If external functions are CUDA C++ source, they will be compiled with NVRTC. Other external function code types are added directly to the return list.

### Method
`numba.cuda.compile_all`

### Parameters

#### Path Parameters
None

#### Query Parameters
None

#### Request Body
- **pyfunc** (function) - Required - The Python function to compile.
- **sig** (tuple) - Required - The signature representing the function’s input and output types.
- **debug** (bool) - Optional - Whether to include debug info in the compiled code.
- **lineinfo** (bool) - Optional - Whether to include a line mapping from the compiled code to the source code.
- **device** (bool) - Optional - Whether to compile a device function. Defaults to `True`.
- **fastmath** (bool) - Optional - Whether to enable fast math flags. Defaults to `False`.
- **cc** (tuple) - Optional - Compute capability to compile for, as a tuple `(MAJOR, MINOR)`. Defaults to the value specified by `NUMBA_CUDA_DEFAULT_PTX_CC` or `(5, 0)`.
- **opt** (bool) - Optional - Whether to enable optimizations in the compiled code.
- **abi** (str) - Optional - The ABI for a compiled function - either `"numba"` or `"c"`. Defaults to `"c"`.
- **abi_info** (dict) - Optional - A dict of ABI-specific options.
- **output** (str) - Optional - Type of output to generate, either `"ptx"` or `"ltoir"`. Defaults to `"ltoir"`.
- **forceinline** (bool) - Optional - Enables inlining at the NVVM IR level when set to `True`. Only valid when `output` is `"ltoir"`.
- **launch_bounds** (int or tuple of ints) - Optional - Kernel launch bounds.

### Returns
- **list** - A list containing the compiled PTX or LTO-IR codes for the function and its dependencies.

### Request Example
```python
# Example usage would depend on the specific Python function and signature
# compiled_modules = numba.cuda.compile_all(my_python_func, signature)
```

### Response
#### Success Response (200)
- **list** - A list of strings, where each string is a compiled PTX or LTO-IR module.

#### Response Example
```json
[
  "// PTX code for main function...",
  "// PTX code for dependency1..."
]
```
```

--------------------------------

### List All Detected CUDA Devices (Python)

Source: https://nvidia.github.io/numba-cuda/reference/host

Returns a list containing Device objects for all CUDA-capable devices detected on the system. Each object provides details about a specific device.

```python
import numba

devices = numba.cuda.list_devices()
print("Detected CUDA devices:")
for device in devices:
    print(f"- ID: {device.id}, Name: {device.name}, Compute Capability: {device.compute_capability}")
```

--------------------------------

### numba.cuda.libdevice.hadd

Source: https://nvidia.github.io/numba-cuda/reference/libdevice

Adds two int32 arguments.

```APIDOC
## numba.cuda.libdevice.hadd

### Description
Adds two int32 arguments, `x` and `y`.

### Method
N/A (This is a library function call, not an HTTP endpoint)

### Endpoint
N/A

### Parameters
#### Path Parameters
N/A

#### Query Parameters
N/A

#### Request Body
N/A

### Request Example
N/A

### Response
#### Success Response (200)
- **return_value** (int32) - The sum of `x` and `y`.

#### Response Example
N/A

### See Also
https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_hadd.html
```

--------------------------------

### CUDA Kernel API - Dispatcher Objects

Source: https://nvidia.github.io/numba-cuda/reference/index

Objects and methods for interacting with compiled CUDA kernels.

```APIDOC
## CUDA Kernel API - Dispatcher Objects

### Description
Provides an interface to interact with compiled CUDA kernels, allowing inspection and dispatch.

* `CUDADispatcher`: Represents a compiled CUDA kernel.
  * `extensions`: Access to kernel extensions.
  * `forall()`: Launches a kernel across a grid of blocks.
  * `get_const_mem_size()`: Gets the constant memory size used by the kernel.
  * `get_local_mem_per_thread()`: Gets the local memory size per thread.
  * `get_max_threads_per_block()`: Gets the maximum number of threads per block.
  * `get_regs_per_thread()`: Gets the number of registers per thread.
  * `get_shared_mem_per_block()`: Gets the shared memory size per block.
  * `inspect_asm()`: Inspects the generated assembly code.
  * `inspect_llvm()`: Inspects the generated LLVM IR.
  * `inspect_sass()`: Inspects the generated SASS code.
  * `inspect_types()`: Inspects the types used by the kernel.
  * `specialize()`: Specializes the kernel for specific types.
  * `specialized`: Access to specialized versions of the kernel.
```

--------------------------------

### Device Detection and Enquiry

Source: https://nvidia.github.io/numba-cuda/reference/host

Functions for detecting CUDA GPU availability and summarizing supported hardware.

```APIDOC
## Device Detection and Enquiry

### Description
Functions for querying available hardware, including GPU availability and detection of supported CUDA hardware.

### Functions

#### `numba.cuda.is_available()`

- **Description**: Returns a boolean to indicate the availability of a CUDA GPU. Initializes the driver if not already initialized.
- **Method**: None (Python function)
- **Returns**: `bool`

#### `numba.cuda.detect()`

- **Description**: Detects supported CUDA hardware and prints a summary. Returns a boolean indicating if any supported devices were detected.
- **Method**: None (Python function)
- **Returns**: `bool`
```

--------------------------------

### Floor Functions

Source: https://nvidia.github.io/numba-cuda/reference/libdevice

Functions for calculating the floor of floating-point numbers.

```APIDOC
## numba.cuda.libdevice.floor

### Description
Computes the largest integer value not greater than the input float64.

### Method
N/A (Function call)

### Endpoint
N/A

### Parameters
#### Path Parameters
N/A

#### Query Parameters
N/A

#### Request Body
N/A

### Request Example
```python
numba.cuda.libdevice.floor(f)
```

### Response
#### Success Response (200)
- **return value** (float64) - The floor of the input float64.

#### Response Example
```json
{
  "return_value": 5.0
}
```
```

```APIDOC
## numba.cuda.libdevice.floorf

### Description
Computes the largest integer value not greater than the input float32.

### Method
N/A (Function call)

### Endpoint
N/A

### Parameters
#### Path Parameters
N/A

#### Query Parameters
N/A

#### Request Body
N/A

### Request Example
```python
numba.cuda.libdevice.floorf(f)
```

### Response
#### Success Response (200)
- **return value** (float32) - The floor of the input float32.

#### Response Example
```json
{
  "return_value": 5.0
}
```
```

--------------------------------

### Libdevice Functions

Source: https://nvidia.github.io/numba-cuda/reference/index

Wrapped libdevice functions available for use in Numba CUDA kernels.

```APIDOC
## Libdevice Functions

### Description

This section lists various functions from the CUDA libdevice library that are wrapped and available for use within Numba CUDA kernels. These include mathematical functions, bit manipulation functions, and more.

### Wrapped Functions

**Mathematical Functions:**

- `abs()`: Computes the absolute value.
- `acos()`: Computes the arc cosine.
- `acosf()`: Computes the arc cosine for float input.
- `acosh()`: Computes the inverse hyperbolic cosine.
- `acoshf()`: Computes the inverse hyperbolic cosine for float input.
- `asin()`: Computes the arc sine.
- `asinf()`: Computes the arc sine for float input.
- `asinh()`: Computes the inverse hyperbolic sine.
- `asinhf()`: Computes the inverse hyperbolic sine for float input.
- `atan()`: Computes the arc tangent.
- `atan2()`: Computes the arc tangent of y/x.
- `atan2f()`: Computes the arc tangent of y/x for float input.
- `atanf()`: Computes the arc tangent for float input.
- `atanh()`: Computes the inverse hyperbolic tangent.
- `atanhf()`: Computes the inverse hyperbolic tangent for float input.
- `cbrt()`: Computes the cube root.
- `cbrtf()`: Computes the cube root for float input.
- `ceil()`: Computes the ceiling of a number.
- `ceilf()`: Computes the ceiling of a float number.
- `copysign()`: Returns a number with the magnitude of the first and the sign of the second.
- `copysignf()`: Computes copysign for float input.
- `cos()`: Computes the cosine.
- `cosf()`: Computes the cosine for float input.
- `cosh()`: Computes the hyperbolic cosine.
- `coshf()`: Computes the hyperbolic cosine for float input.
- `cospi()`: Computes the cosine of pi times the argument.
- `cospif()`: Computes the cosine of pi times the float argument.

**Bit Manipulation Functions:**

- `brev()`: Reverses the bits of a 32-bit integer.
- `brevll()`: Reverses the bits of a 64-bit integer.
- `byte_perm()`: Permutes bytes within a 32-bit integer.
- `clz()`: Counts the number of leading zeros in a 32-bit integer.
- `clzll()`: Counts the number of leading zeros in a 64-bit integer.

**Other Functions:**

- `dadd_rd()`: Double-precision addition, round to downward.
- `dadd_rn()`: Double-precision addition, round to nearest.
```

--------------------------------

### Device Management

Source: https://nvidia.github.io/numba-cuda/reference/host

Provides functionalities for managing CUDA devices, including listing, selecting, and querying device properties.

```APIDOC
## Device Management Functions

### Description
This section details functions for managing CUDA devices.

### Functions

*   **`list_devices()`**
    *   Description: Lists all available CUDA devices.
    *   Returns: A list of `Device` objects.

*   **`get_current_device()`**
    *   Description: Gets the currently active CUDA device.
    *   Returns: A `Device` object representing the current device.

*   **`select_device(device)`**
    *   Description: Selects a specific CUDA device to be the current device.
    *   Parameters:
        *   **`device`** (Device or int) - The device to select. Can be a `Device` object or its ID.

## Device Object

### Description
Represents a CUDA device with its properties and methods.

### Properties

*   **`id`** (int) - The unique identifier of the device.
*   **`name`** (str) - The name of the device.
*   **`uuid`** (str) - The universally unique identifier of the device.
*   **`compute_capability`** (tuple) - The compute capability of the device (e.g., (7, 5)).
*   **`supports_float16`** (bool) - Indicates if the device supports float16 operations.

### Methods

*   **`reset()`**
    *   Description: Resets the CUDA device, freeing all resources.

*   **`gpus`**
    *   Description: Property to access GPU-related information (specifics depend on implementation).

*   **`current`**
    *   Description: Property to get the current device context.
```

--------------------------------

### Detect Supported CUDA Hardware (Python)

Source: https://nvidia.github.io/numba-cuda/reference/host

Scans the system for supported CUDA-capable hardware and prints a summary. It returns a boolean indicating whether any compatible devices were found.

```python
import numba

if numba.cuda.detect():
    print("Supported CUDA devices detected.")
else:
    print("No supported CUDA devices found.")
```

--------------------------------

### Memory Management

Source: https://nvidia.github.io/numba-cuda/reference/index

Functions for managing memory allocation and data transfer between host and device.

```APIDOC
## Memory Management

### Description

Functions for managing memory on the CUDA device, including allocating arrays, transferring data, and handling different memory types like pinned, mapped, and managed memory.

### Allocation Functions

- `to_device()`: Allocates memory on the device and copies data to it.
- `device_array()`: Allocates an uninitialized array on the device.
- `device_array_like()`: Allocates an array on the device with the same shape and dtype as another array.
- `pinned_array()`: Allocates a pinned (page-locked) array on the host.
- `pinned_array_like()`: Allocates a pinned array on the host like another array.
- `mapped_array()`: Allocates a mapped array (device memory accessible from host).
- `mapped_array_like()`: Allocates a mapped array like another array.
- `managed_array()`: Allocates a managed array (unified memory).

### Memory Type Aliases

- `pinned()`: Alias for `pinned_array()`.
- `mapped()`: Alias for `mapped_array()`.

### Device Objects

**`DeviceNDArray`**

Represents an N-dimensional array residing on the CUDA device.

- `copy_to_device()`: Copies the array to another device.
- `copy_to_host()`: Copies the array from the device to the host.
- `is_c_contiguous()`: Checks if the array is C-contiguous.
- `is_f_contiguous()`: Checks if the array is Fortran-contiguous.
- `ravel()`: Flattens the array.
- `reshape()`: Reshapes the array.
- `split()`: Splits the array into sub-arrays.

**`DeviceRecord`**

Represents a record (struct-like) residing on the CUDA device.

- `copy_to_device()`: Copies the record to another device.
- `copy_to_host()`: Copies the record from the device to the host.

**`MappedNDArray`**

Represents an N-dimensional array that is mapped between host and device memory.

- `copy_to_device()`: Copies the array to another device.
- `copy_to_host()`: Copies the array from the device to the host.
- `split()`: Splits the array into sub-arrays.
```

--------------------------------

### CUDA Kernel API - Kernel Declaration

Source: https://nvidia.github.io/numba-cuda/reference/index

Function for declaring and compiling JIT kernels.

```APIDOC
## CUDA Kernel API - Kernel Declaration

### Description
Decorator for Just-In-Time (JIT) compilation of CUDA kernels.

* `jit()`: Decorator to compile Python functions into CUDA kernels.
```

--------------------------------

### Stream Management

Source: https://nvidia.github.io/numba-cuda/reference/host

Utilities for managing CUDA streams, including creation, synchronization, and callback registration.

```APIDOC
## Stream Management

### Description
This section details functions and classes for managing CUDA streams.

### Stream Object

### Description
Represents a CUDA stream, which is a sequence of operations that execute in order.

### Methods

*   **`synchronize()`**
    *   Description: Blocks the host until all operations in the stream have completed.

*   **`auto_synchronize()`**
    *   Description: Enables or disables automatic synchronization for the stream.

*   **`add_callback(callback, *args, **kwargs)`**
    *   Description: Adds a Python callback function to be executed when the stream reaches the current point.
    *   Parameters:
        *   **`callback`** (function) - The Python function to call.
        *   **`*args`**, **`**kwargs`** - Arguments to pass to the callback.

*   **`async_done()`**
    *   Description: Returns a future or event that completes when the stream is done (specific behavior might vary).

### Stream Functions

*   **`stream()`**
    *   Description: Creates a new, empty CUDA stream.
    *   Returns: A `Stream` object.

*   **`default_stream()`**
    *   Description: Gets the default CUDA stream for the current context.
    *   Returns: A `Stream` object.

*   **`legacy_default_stream()`**
    *   Description: Gets the legacy default CUDA stream.
    *   Returns: A `Stream` object.

*   **`per_thread_default_stream()`**
    *   Description: Gets the default CUDA stream associated with the current thread.
    *   Returns: A `Stream` object.

*   **`external_stream(handle, owner=True)`**
    *   Description: Creates a `Stream` object from an external CUDA stream handle.
    *   Parameters:
        *   **`handle`** (int or pointer) - The handle to the external CUDA stream.
        *   **`owner`** (bool) - Whether this object should manage the lifetime of the external stream.
```

--------------------------------

### Profiling Control - numba.cuda.profile_start, numba.cuda.profile_stop, numba.cuda.profiling

Source: https://nvidia.github.io/numba-cuda/reference/host

Functions to control and manage CUDA profiling within the current context. `profile_start()` enables profiling, `profile_stop()` disables it, and `profiling()` acts as a context manager to automatically handle these states.

```python
numba.cuda.profile_start()

```

```python
numba.cuda.profile_stop()

```

```python
with numba.cuda.profiling():
    # Code to profile
    pass

```

--------------------------------

### CUDA Host API - Stream Management

Source: https://nvidia.github.io/numba-cuda/reference/index

Functions and classes for managing CUDA streams.

```APIDOC
## CUDA Host API - Stream Management

### Description
APIs for creating, managing, and synchronizing CUDA streams.

* `Stream`: Represents a CUDA stream.
  * `add_callback()`: Adds a callback to be executed when the stream completes.
  * `async_done()`: Checks if the stream has finished asynchronous operations.
  * `auto_synchronize()`: Enables automatic synchronization for the stream.
  * `synchronize()`: Synchronizes the stream.
* `stream()`: Creates a new CUDA stream.
* `default_stream()`: Gets the default CUDA stream.
* `legacy_default_stream()`: Gets the legacy default CUDA stream.
* `per_thread_default_stream()`: Gets the per-thread default CUDA stream.
* `external_stream()`: Creates a stream from an external handle.
```