### CUDA Host API - Device Management Source: https://nvidia.github.io/numba-cuda/reference/index Functions and classes for detecting, managing, and interacting with CUDA-enabled devices. ```APIDOC ## CUDA Host API - Device Management ### Description APIs for device detection, enquiry, and context management. #### Device Detection and Enquiry * `is_available()`: Checks if a CUDA-enabled GPU is available. * `detect()`: Detects available CUDA devices. #### Context Management * `Context`: Represents a CUDA context. * `get_memory_info()`: Retrieves memory information for the current context. * `pop()`: Pops the current context. * `push()`: Pushes a context onto the context stack. * `reset()`: Resets the current context. * `current_context()`: Returns the current CUDA context. * `require_context()`: Ensures a CUDA context is available. * `synchronize()`: Synchronizes all streams in the current context. * `close()`: Closes the current CUDA context. #### Device Management * `gpus`: A collection of available CUDA GPUs. * `current`: The currently selected CUDA device. * `_DeviceContextManager`: A context manager for device selection. * `select_device()`: Selects a specific CUDA device. * `get_current_device()`: Gets the currently active device. * `list_devices()`: Lists all available CUDA devices. * `Device`: Represents a CUDA device. * `compute_capability`: The compute capability of the device. * `id`: The unique identifier of the device. * `name`: The name of the device. * `uuid`: The UUID of the device. * `reset()`: Resets the device. * `supports_float16`: Indicates if the device supports float16 operations. ``` -------------------------------- ### CUDA Vector Types Source: https://nvidia.github.io/numba-cuda/reference/types Describes how to use and construct CUDA vector types in Numba, including recommended naming conventions and examples. ```APIDOC ## CUDA Vector Types ### Description CUDA Vector Types are usable in kernels. They are formatted as `x`, where `base_type` is the base type of the vector, and `N` is the number of elements. Examples include `int64x3`, `uint16x4`, `float32x4`. ### Aliases for Convenience Aliases consistent with CUDA C/C++ namings are available for convenience (e.g., `float3` aliases `float32x3`). ### Construction Vector types are constructed directly with their constructor. ### Request Example ```python from numba.cuda import float32x3 # In kernel f3 = float32x3(0.0, -1.0, 1.0) zero = uint32(0) u2 = uint32x2(1, 2) # Construct a 3-component vector with primitive type and a 2-component vector u3 = uint32x3(zero, u2) # Construct a 4-component vector with 2 2-component vectors u4 = uint32x4(u2, u2) ``` ### Component Access Components can be accessed through fields `x`, `y`, `z`, and `w`. Components are immutable after construction in the current version. ### Response Example ```python v1 = float32x2(1.0, 1.0) v2 = float32x2(1.0, -1.0) dotprod = v1.x * v2.x + v1.y * v2.y ``` ``` -------------------------------- ### Configuring and Launching CUDA Kernels Source: https://nvidia.github.io/numba-cuda/reference/kernel This section details how to configure and launch CUDA kernels using the `numba.cuda.jit` decorator and the returned Dispatcher object. It covers different ways to specify grid and block dimensions, streams, and shared memory. ```APIDOC ## Launching CUDA Kernels with Dispatcher ### Description Kernels decorated with `@cuda.jit` return a Dispatcher object. This Dispatcher can be configured with launch parameters (grid, block dimensions, stream, shared memory) and then called with the kernel arguments. ### Method Configuration and Launch ### Endpoint N/A (In-process function call) ### Parameters #### Kernel Configuration Parameters (Subscripting the Dispatcher) - **griddim** (int or tuple) - Specifies the grid dimensions (up to 3D). - **blockdim** (int or tuple) - Specifies the block dimensions (up to 3D). - **stream** (optional) - The CUDA stream on which the kernel will be launched. - **sharedmem** (optional, int) - The size of dynamic shared memory in bytes. #### Kernel Arguments (Calling the Configured Dispatcher) - **x, y, z, ...** (various types) - Arguments required by the kernel function. ### Request Example ``` # Configure and launch in one statement func[griddim, blockdim, stream, sharedmem](x, y, z) # Or, configure first, then call configured = func[griddim, blockdim, stream, sharedmem] configured(x, y, z) ``` ### Response #### Success Response N/A (Kernel execution on device) #### Response Example N/A ### Notes - The order of `stream` and `sharedmem` is reversed compared to CUDA C/C++ (`func<<>>`). - The Dispatcher automatically specializes for arguments and compute capability. ``` -------------------------------- ### Get CUDA Device Name and UUID (Python) Source: https://nvidia.github.io/numba-cuda/reference/host Retrieves the name and universally unique identifier (UUID) of a CUDA device. This information can be used for device identification and logging. ```python import numba # Assuming device 0 is selected or available current_device = numba.cuda.get_current_device() print(f"Device Name: {current_device.name}") print(f"Device UUID: {current_device.uuid}") ``` -------------------------------- ### Get Current CUDA Device (Python) Source: https://nvidia.github.io/numba-cuda/reference/host Retrieves the Device object associated with the current thread's CUDA context. This is useful for checking which device is currently active. ```python import numba current_device = numba.cuda.get_current_device() print(f"Current device: {current_device.name} (ID: {current_device.id})") ``` -------------------------------- ### CUDA Host API - Measurement Source: https://nvidia.github.io/numba-cuda/reference/index APIs for profiling and event management in CUDA. ```APIDOC ## CUDA Host API - Measurement ### Description Tools for profiling CUDA operations and managing timing events. #### Profiling * `profile_start()`: Starts CUDA profiling. * `profile_stop()`: Stops CUDA profiling. * `profiling()`: Returns profiling data. #### Events * `event()`: Creates a CUDA event. * `event_elapsed_time()`: Calculates the elapsed time between two CUDA events. * `Event`: Represents a CUDA event. * `query()`: Checks if the event has been completed. * `record()`: Records the event. * `synchronize()`: Synchronizes with the event. * `wait()`: Waits for the event to be completed. ``` -------------------------------- ### Get CUDA Device Compute Capability (Python) Source: https://nvidia.github.io/numba-cuda/reference/host Retrieves the compute capability of a CUDA device, represented as a tuple of (major, minor) integers. This indicates the hardware features supported by the GPU. ```python import numba # Assuming device 0 is selected or available current_device = numba.cuda.get_current_device() compute_cap = current_device.compute_capability print(f"Compute capability of device {current_device.id}: {compute_cap[0]}.{compute_cap[1]}") ``` -------------------------------- ### Manage CUDA Context with Device Context Manager (Python) Source: https://nvidia.github.io/numba-cuda/reference/host Demonstrates how to use a device as a context manager to execute code on a specific CUDA device. This allows for easy selection of a device for operations like memory transfers. ```python import numba import numpy as np a = np.array([1, 2, 3]) # Assuming device 2 is available with numba.cuda.gpus[2]: d_a = numba.cuda.to_device(a) print(f"Array copied to device 2: {d_a}") ``` -------------------------------- ### Get CUDA Context Memory Info (Python) Source: https://nvidia.github.io/numba-cuda/reference/host Retrieves the amount of free and total memory (in bytes) available within the current CUDA context. This is useful for monitoring GPU memory usage. ```python import numba # Ensure a context is active if numba.cuda.is_available(): numba.cuda.current_context() free_mem, total_mem = numba.cuda.get_memory_info() print(f"Memory Info - Free: {free_mem} bytes, Total: {total_mem} bytes") else: print("CUDA not available.") ``` -------------------------------- ### CUDA Host API - Compilation Source: https://nvidia.github.io/numba-cuda/reference/index Functions for compiling CUDA kernels. ```APIDOC ## CUDA Host API - Compilation ### Description Functions for compiling CUDA kernels and PTX code. * `compile()`: Compiles a CUDA kernel. * `compile_all()`: Compiles all CUDA kernels. * `compile_for_current_device()`: Compiles a CUDA kernel for the current device. * `compile_ptx()`: Compiles CUDA code to PTX. * `compile_ptx_for_current_device()`: Compiles CUDA code to PTX for the current device. ``` -------------------------------- ### Numba CUDA fabsf: Single Precision Absolute Value Source: https://nvidia.github.io/numba-cuda/reference/libdevice Exposes the `__nv_fabsf` libdevice function for the absolute value of single-precision floating-point numbers. It accepts a float32 input and returns a float32 output, providing a performant way to get the magnitude of a number. ```python import numba @numba.cuda.jit def fabsf_wrapper(x): return numba.cuda.libdevice.fabsf(x) ``` -------------------------------- ### Numba CUDA fast mathematical functions Source: https://nvidia.github.io/numba-cuda/reference/libdevice This section covers fast approximations for common mathematical functions like cosine, exponentiation, logarithms, power, and sine/cosine. ```APIDOC ## numba.cuda.libdevice.fast_cosf(x) ### Description Computes the fast cosine of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_cosf(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The cosine of x. #### Response Example ```json { "return_value": 0.5403023058681398 } ``` ## numba.cuda.libdevice.fast_exp10f(x) ### Description Computes the fast base-10 exponentiation of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_exp10f(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The result of 10 raised to the power of x. #### Response Example ```json { "return_value": 1000.0 } ``` ## numba.cuda.libdevice.fast_expf(x) ### Description Computes the fast natural exponentiation (e^x) of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_expf(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The result of e raised to the power of x. #### Response Example ```json { "return_value": 2.718281828459045 } ``` ## numba.cuda.libdevice.fast_fdividef(x, y) ### Description Performs fast floating-point division of two float32 arguments. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_fdividef(x, y) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The result of x divided by y. #### Response Example ```json { "return_value": 2.0 } ``` ## numba.cuda.libdevice.fast_log10f(x) ### Description Computes the fast base-10 logarithm of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_log10f(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The base-10 logarithm of x. #### Response Example ```json { "return_value": 1.0 } ``` ## numba.cuda.libdevice.fast_log2f(x) ### Description Computes the fast base-2 logarithm of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_log2f(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The base-2 logarithm of x. #### Response Example ```json { "return_value": 1.0 } ``` ## numba.cuda.libdevice.fast_logf(x) ### Description Computes the fast natural logarithm (ln) of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_logf(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The natural logarithm of x. #### Response Example ```json { "return_value": 0.0 } ``` ## numba.cuda.libdevice.fast_powf(x, y) ### Description Computes the fast power of a float32 base to a float32 exponent. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_powf(x, y) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The result of x raised to the power of y. #### Response Example ```json { "return_value": 8.0 } ``` ## numba.cuda.libdevice.fast_sincosf(x) ### Description Computes the fast sine and cosine of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_sincosf(x) ``` ### Response #### Success Response (UniTuple(float32 x 2)) - **return_value** (UniTuple(float32 x 2)) - A tuple containing the sine and cosine of x. #### Response Example ```json { "return_value": [0.8414709848078965, 0.5403023058681398] } ``` ## numba.cuda.libdevice.fast_sinf(x) ### Description Computes the fast sine of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_sinf(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The sine of x. #### Response Example ```json { "return_value": 0.8414709848078965 } ``` ## numba.cuda.libdevice.fast_tanf(x) ### Description Computes the fast tangent of a float32 argument. ### Method Call ### Endpoint N/A (Library function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example ``` numba.cuda.libdevice.fast_tanf(x) ``` ### Response #### Success Response (float32) - **return_value** (float32) - The tangent of x. #### Response Example ```json { "return_value": 1.5574077246549023 } ``` ``` -------------------------------- ### Configure and Launch CUDA Kernel with Numba Source: https://nvidia.github.io/numba-cuda/reference/kernel Demonstrates how to configure and launch a Numba CUDA kernel by specifying grid dimensions, block dimensions, stream, and shared memory. It shows both a two-step configuration and call, and a combined single-statement approach. This is analogous to CUDA C/C++ kernel launch syntax. ```python func[griddim, blockdim, stream, sharedmem] configured = func[griddim, blockdim, stream, sharedmem] configured(x, y, z) # Idiomatic single-statement call: func[griddim, blockdim, stream, sharedmem](x, y, z) ``` ```cudahacks func<<>>(x, y, z) ``` -------------------------------- ### Perform Arithmetic Operations on bfloat16 in Numba CUDA Source: https://nvidia.github.io/numba-cuda/reference/types Shows examples of supported arithmetic and logical operations on bfloat16 data types within Numba CUDA kernels. This includes standard arithmetic, assignment, comparison, and unary operations. ```python from numba.cuda.types import bfloat16 # Assume bf16_a and bf16_b are bfloat16 variables result_add = bf16_a + bf16_b result_mul = bf16_a * bf16_b bf16_a += bf16_b is_equal = bf16_a == bf16_b negative_val = -bf16_a ``` -------------------------------- ### Measurement and Profiling Source: https://nvidia.github.io/numba-cuda/reference/host Tools for profiling CUDA kernel execution and measuring event timing. ```APIDOC ## Measurement and Profiling ### Description This section covers tools for profiling CUDA operations and measuring time. ### Profiling * **`profile_start()`** * Description: Starts CUDA profiling. * **`profile_stop()`** * Description: Stops CUDA profiling. * **`profiling()`** * Description: Returns profiling results (format may vary). ## Events ### Description Provides functionalities for creating and managing CUDA events for timing. ### Functions * **`event()`** * Description: Creates a new CUDA event. * Returns: An `Event` object. * **`event_elapsed_time(start_event, end_event)`** * Description: Calculates the time elapsed between two CUDA events. * Parameters: * **`start_event`** (Event) - The starting event. * **`end_event`** (Event) - The ending event. * Returns: Elapsed time in milliseconds (float). ## Event Object ### Description Represents a CUDA event used for timing and synchronization. ### Methods * **`record()`** * Description: Records the event, marking the current time on the CUDA stream. * **`query()`** * Description: Checks if the event has been completed. * Returns: `True` if completed, `False` otherwise. * **`synchronize()`** * Description: Blocks the host until the event is completed. * **`wait()`** * Description: Causes a stream to wait until the event is completed. ``` -------------------------------- ### Get Kernel PTX Assembly Code (Numba CUDA) Source: https://nvidia.github.io/numba-cuda/reference/kernel Retrieves the PTX assembly code for a compiled Numba CUDA kernel for a specific device and signature. It can return a dictionary of PTX codes for all encountered signatures if no signature is provided. Requires the `numba_cuda` library. ```python inspect_asm(_signature =None_) Return this kernel’s PTX assembly code for for the device in the current context. Parameters: **signature** – A tuple of argument types. Returns: The PTX code for the given signature, or a dict of PTX codes for all previously-encountered signatures. ``` -------------------------------- ### Get Kernel LLVM IR (Numba CUDA) Source: https://nvidia.github.io/numba-cuda/reference/kernel Retrieves the LLVM Intermediate Representation (IR) for a compiled Numba CUDA kernel given a specific signature. If no signature is provided, it returns LLVM IR for all previously encountered signatures. This function is part of the `numba_cuda` library. ```python inspect_llvm(_signature =None_) Return the LLVM IR for this kernel. Parameters: **signature** – A tuple of argument types. Returns: The LLVM IR for the given signature, or a dict of LLVM IR for all previously-encountered signatures. ``` -------------------------------- ### numba.cuda.libdevice.ldexp Source: https://nvidia.github.io/numba-cuda/reference/libdevice Computes x * 2^y. ```APIDOC ## numba.cuda.libdevice.ldexp ### Description Computes x * 2^y. ### Method N/A (Library Function) ### Endpoint N/A (Library Function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) - **return_value** (float64) - The result of the calculation. #### Response Example None ``` -------------------------------- ### Get Statically Allocated Shared Memory Size (Numba CUDA) Source: https://nvidia.github.io/numba-cuda/reference/kernel Returns the size in bytes of statically allocated shared memory for a compiled Numba CUDA kernel. The `signature` parameter specifies the compiled kernel, and it can be omitted for specialized kernels. The returned size is specific to the current device. ```python Returns the size in bytes of statically allocated shared memory for this kernel. Parameters: **signature** – The signature of the compiled kernel to get shared memory usage for. This may be omitted for a specialized kernel. Returns: The amount of shared memory allocated by the compiled variant of the kernel for the given signature and current device. ``` -------------------------------- ### numba.cuda.libdevice.llmin Source: https://nvidia.github.io/numba-cuda/reference/libdevice Computes the minimum of two signed 64-bit integers. ```APIDOC ## numba.cuda.libdevice.llmin ### Description Computes the minimum of two signed 64-bit integers. ### Method N/A (Library Function) ### Endpoint N/A (Library Function) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None ### Request Example None ### Response #### Success Response (200) - **return_value** (int64) - The minimum of the two arguments. #### Response Example None ``` -------------------------------- ### Device Management Source: https://nvidia.github.io/numba-cuda/reference/host APIs for accessing and managing CUDA-capable devices, including listing, selecting, and obtaining device information. ```APIDOC ## Device Management ### Description APIs for managing and querying CUDA-capable devices supported by Numba. ### Device Lists and Properties #### `numba.cuda.gpus` - **Description**: An indexable list of supported CUDA devices, indexed by integer device ID. #### `numba.cuda.gpus.current` - **Description**: The currently-selected device. ### Device Context Managers #### `numba.cuda.cudadrv.devices._DeviceContextManager` - **Description**: Provides a context manager for executing operations within the context of a specific device. Instances are typically obtained from `numba.cuda.gpus`. - **Usage Example**: ```python with numba.cuda.gpus[2]: d_a = numba.cuda.to_device(a) ``` ### Device Selection and Information Functions #### `numba.cuda.select_device(device_id)` - **Description**: Makes the context associated with `device_id` the current context. - **Parameters**: - `device_id` (int): The ID of the device to select. - **Returns**: `numba.cuda.cudadrv.driver.Device` instance. - **Raises**: Exception on error. #### `numba.cuda.get_current_device()` - **Description**: Gets the device associated with the current thread. - **Returns**: `numba.cuda.cudadrv.driver.Device` instance. #### `numba.cuda.list_devices()` - **Description**: Returns a list of all detected CUDA devices. - **Returns**: `list` of `numba.cuda.cudadrv.driver.Device` instances. ### `numba.cuda.cudadrv.driver.Device` Class - **Description**: Represents a CUDA device and allows querying its functionality. - **Properties**: - `compute_capability` (tuple): A tuple `(major, minor)` indicating the supported compute capability. - `id` (int): The integer ID of the device. - `name` (str): The name of the device (e.g., "GeForce GTX 970"). - `uuid` (str): The UUID of the device (e.g., "GPU-e6489c45-5b68-3b03-bab7-0e7c8e809643"). - `supports_float16` (bool): `True` if the device supports float16 operations, `False` otherwise. - **Methods**: - `reset()`: Deletes the context for the device, destroying all associated allocations, events, and streams. ``` -------------------------------- ### Get Kernel SASS Assembly Code (Numba CUDA) Source: https://nvidia.github.io/numba-cuda/reference/kernel Retrieves the SASS (Streaming Assembler) assembly code for a compiled Numba CUDA kernel for the current device and signature. It returns a dictionary of SASS codes for all encountered signatures if no signature is specified. Requires `nvdisasm` to be available on the PATH. This function is provided by `numba_cuda`. ```python inspect_sass(_signature =None_) Return this kernel’s SASS assembly code for for the device in the current context. Parameters: **signature** – A tuple of argument types. Returns: The SASS code for the given signature, or a dict of SASS codes for all previously-encountered signatures. SASS for the device in the current context is returned. Requires nvdisasm to be available on the PATH. ``` -------------------------------- ### Numba CUDA Thread Indexing Functions Source: https://nvidia.github.io/numba-cuda/reference/kernel Provides access to thread and block indices within a CUDA grid. Includes functions to get the current thread's index within its block (threadIdx), the current block's index within the grid (blockIdx), the dimensions of a thread block (blockDim), the dimensions of the grid (gridDim), the thread's index within its warp (laneid), the size of a warp (warpsize), and the absolute position and size of the current thread in the grid (grid and gridsize). ```Python from numba import cuda # Accessing thread and block indices thread_x = cuda.threadIdx.x thread_y = cuda.threadIdx.y thread_z = cuda.threadIdx.z block_x = cuda.blockIdx.x block_y = cuda.blockIdx.y block_z = cuda.blockIdx.z # Accessing dimensions block_dim_x = cuda.blockDim.x block_dim_y = cuda.blockDim.y block_dim_z = cuda.blockDim.z grid_dim_x = cuda.gridDim.x grid_dim_y = cuda.gridDim.y grid_dim_z = cuda.gridDim.z # Accessing warp information lane_id = cuda.laneid warp_size = cuda.warpsize # Calculating absolute grid position and size # For a 1D grid grid_pos_1d = cuda.grid(1) grid_size_1d = cuda.gridsize(1) # For a 2D grid grid_pos_2d = cuda.grid(2) grid_size_2d = cuda.gridsize(2) # For a 3D grid grid_pos_3d = cuda.grid(3) grid_size_3d = cuda.gridsize(3) # Example computation (as described in documentation) # This would typically be within a @cuda.jit decorated function # absolute_x = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x # grid_total_x = cuda.blockDim.x * cuda.gridDim.x ``` -------------------------------- ### numba.cuda.compile_all Source: https://nvidia.github.io/numba-cuda/reference/host Similar to `compile()`, but returns a list of PTX codes/LTO-IRs for the compiled function and any external functions it depends on. ```APIDOC ## numba.cuda.compile_all ### Description Compiles a Python function and all its external dependencies to PTX or LTO-IR. If external functions are CUDA C++ source, they will be compiled with NVRTC. Other external function code types are added directly to the return list. ### Method `numba.cuda.compile_all` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **pyfunc** (function) - Required - The Python function to compile. - **sig** (tuple) - Required - The signature representing the function’s input and output types. - **debug** (bool) - Optional - Whether to include debug info in the compiled code. - **lineinfo** (bool) - Optional - Whether to include a line mapping from the compiled code to the source code. - **device** (bool) - Optional - Whether to compile a device function. Defaults to `True`. - **fastmath** (bool) - Optional - Whether to enable fast math flags. Defaults to `False`. - **cc** (tuple) - Optional - Compute capability to compile for, as a tuple `(MAJOR, MINOR)`. Defaults to the value specified by `NUMBA_CUDA_DEFAULT_PTX_CC` or `(5, 0)`. - **opt** (bool) - Optional - Whether to enable optimizations in the compiled code. - **abi** (str) - Optional - The ABI for a compiled function - either `"numba"` or `"c"`. Defaults to `"c"`. - **abi_info** (dict) - Optional - A dict of ABI-specific options. - **output** (str) - Optional - Type of output to generate, either `"ptx"` or `"ltoir"`. Defaults to `"ltoir"`. - **forceinline** (bool) - Optional - Enables inlining at the NVVM IR level when set to `True`. Only valid when `output` is `"ltoir"`. - **launch_bounds** (int or tuple of ints) - Optional - Kernel launch bounds. ### Returns - **list** - A list containing the compiled PTX or LTO-IR codes for the function and its dependencies. ### Request Example ```python # Example usage would depend on the specific Python function and signature # compiled_modules = numba.cuda.compile_all(my_python_func, signature) ``` ### Response #### Success Response (200) - **list** - A list of strings, where each string is a compiled PTX or LTO-IR module. #### Response Example ```json [ "// PTX code for main function...", "// PTX code for dependency1..." ] ``` ``` -------------------------------- ### List All Detected CUDA Devices (Python) Source: https://nvidia.github.io/numba-cuda/reference/host Returns a list containing Device objects for all CUDA-capable devices detected on the system. Each object provides details about a specific device. ```python import numba devices = numba.cuda.list_devices() print("Detected CUDA devices:") for device in devices: print(f"- ID: {device.id}, Name: {device.name}, Compute Capability: {device.compute_capability}") ``` -------------------------------- ### numba.cuda.libdevice.hadd Source: https://nvidia.github.io/numba-cuda/reference/libdevice Adds two int32 arguments. ```APIDOC ## numba.cuda.libdevice.hadd ### Description Adds two int32 arguments, `x` and `y`. ### Method N/A (This is a library function call, not an HTTP endpoint) ### Endpoint N/A ### Parameters #### Path Parameters N/A #### Query Parameters N/A #### Request Body N/A ### Request Example N/A ### Response #### Success Response (200) - **return_value** (int32) - The sum of `x` and `y`. #### Response Example N/A ### See Also https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_hadd.html ``` -------------------------------- ### CUDA Kernel API - Dispatcher Objects Source: https://nvidia.github.io/numba-cuda/reference/index Objects and methods for interacting with compiled CUDA kernels. ```APIDOC ## CUDA Kernel API - Dispatcher Objects ### Description Provides an interface to interact with compiled CUDA kernels, allowing inspection and dispatch. * `CUDADispatcher`: Represents a compiled CUDA kernel. * `extensions`: Access to kernel extensions. * `forall()`: Launches a kernel across a grid of blocks. * `get_const_mem_size()`: Gets the constant memory size used by the kernel. * `get_local_mem_per_thread()`: Gets the local memory size per thread. * `get_max_threads_per_block()`: Gets the maximum number of threads per block. * `get_regs_per_thread()`: Gets the number of registers per thread. * `get_shared_mem_per_block()`: Gets the shared memory size per block. * `inspect_asm()`: Inspects the generated assembly code. * `inspect_llvm()`: Inspects the generated LLVM IR. * `inspect_sass()`: Inspects the generated SASS code. * `inspect_types()`: Inspects the types used by the kernel. * `specialize()`: Specializes the kernel for specific types. * `specialized`: Access to specialized versions of the kernel. ``` -------------------------------- ### Device Detection and Enquiry Source: https://nvidia.github.io/numba-cuda/reference/host Functions for detecting CUDA GPU availability and summarizing supported hardware. ```APIDOC ## Device Detection and Enquiry ### Description Functions for querying available hardware, including GPU availability and detection of supported CUDA hardware. ### Functions #### `numba.cuda.is_available()` - **Description**: Returns a boolean to indicate the availability of a CUDA GPU. Initializes the driver if not already initialized. - **Method**: None (Python function) - **Returns**: `bool` #### `numba.cuda.detect()` - **Description**: Detects supported CUDA hardware and prints a summary. Returns a boolean indicating if any supported devices were detected. - **Method**: None (Python function) - **Returns**: `bool` ``` -------------------------------- ### Floor Functions Source: https://nvidia.github.io/numba-cuda/reference/libdevice Functions for calculating the floor of floating-point numbers. ```APIDOC ## numba.cuda.libdevice.floor ### Description Computes the largest integer value not greater than the input float64. ### Method N/A (Function call) ### Endpoint N/A ### Parameters #### Path Parameters N/A #### Query Parameters N/A #### Request Body N/A ### Request Example ```python numba.cuda.libdevice.floor(f) ``` ### Response #### Success Response (200) - **return value** (float64) - The floor of the input float64. #### Response Example ```json { "return_value": 5.0 } ``` ``` ```APIDOC ## numba.cuda.libdevice.floorf ### Description Computes the largest integer value not greater than the input float32. ### Method N/A (Function call) ### Endpoint N/A ### Parameters #### Path Parameters N/A #### Query Parameters N/A #### Request Body N/A ### Request Example ```python numba.cuda.libdevice.floorf(f) ``` ### Response #### Success Response (200) - **return value** (float32) - The floor of the input float32. #### Response Example ```json { "return_value": 5.0 } ``` ``` -------------------------------- ### Libdevice Functions Source: https://nvidia.github.io/numba-cuda/reference/index Wrapped libdevice functions available for use in Numba CUDA kernels. ```APIDOC ## Libdevice Functions ### Description This section lists various functions from the CUDA libdevice library that are wrapped and available for use within Numba CUDA kernels. These include mathematical functions, bit manipulation functions, and more. ### Wrapped Functions **Mathematical Functions:** - `abs()`: Computes the absolute value. - `acos()`: Computes the arc cosine. - `acosf()`: Computes the arc cosine for float input. - `acosh()`: Computes the inverse hyperbolic cosine. - `acoshf()`: Computes the inverse hyperbolic cosine for float input. - `asin()`: Computes the arc sine. - `asinf()`: Computes the arc sine for float input. - `asinh()`: Computes the inverse hyperbolic sine. - `asinhf()`: Computes the inverse hyperbolic sine for float input. - `atan()`: Computes the arc tangent. - `atan2()`: Computes the arc tangent of y/x. - `atan2f()`: Computes the arc tangent of y/x for float input. - `atanf()`: Computes the arc tangent for float input. - `atanh()`: Computes the inverse hyperbolic tangent. - `atanhf()`: Computes the inverse hyperbolic tangent for float input. - `cbrt()`: Computes the cube root. - `cbrtf()`: Computes the cube root for float input. - `ceil()`: Computes the ceiling of a number. - `ceilf()`: Computes the ceiling of a float number. - `copysign()`: Returns a number with the magnitude of the first and the sign of the second. - `copysignf()`: Computes copysign for float input. - `cos()`: Computes the cosine. - `cosf()`: Computes the cosine for float input. - `cosh()`: Computes the hyperbolic cosine. - `coshf()`: Computes the hyperbolic cosine for float input. - `cospi()`: Computes the cosine of pi times the argument. - `cospif()`: Computes the cosine of pi times the float argument. **Bit Manipulation Functions:** - `brev()`: Reverses the bits of a 32-bit integer. - `brevll()`: Reverses the bits of a 64-bit integer. - `byte_perm()`: Permutes bytes within a 32-bit integer. - `clz()`: Counts the number of leading zeros in a 32-bit integer. - `clzll()`: Counts the number of leading zeros in a 64-bit integer. **Other Functions:** - `dadd_rd()`: Double-precision addition, round to downward. - `dadd_rn()`: Double-precision addition, round to nearest. ``` -------------------------------- ### Device Management Source: https://nvidia.github.io/numba-cuda/reference/host Provides functionalities for managing CUDA devices, including listing, selecting, and querying device properties. ```APIDOC ## Device Management Functions ### Description This section details functions for managing CUDA devices. ### Functions * **`list_devices()`** * Description: Lists all available CUDA devices. * Returns: A list of `Device` objects. * **`get_current_device()`** * Description: Gets the currently active CUDA device. * Returns: A `Device` object representing the current device. * **`select_device(device)`** * Description: Selects a specific CUDA device to be the current device. * Parameters: * **`device`** (Device or int) - The device to select. Can be a `Device` object or its ID. ## Device Object ### Description Represents a CUDA device with its properties and methods. ### Properties * **`id`** (int) - The unique identifier of the device. * **`name`** (str) - The name of the device. * **`uuid`** (str) - The universally unique identifier of the device. * **`compute_capability`** (tuple) - The compute capability of the device (e.g., (7, 5)). * **`supports_float16`** (bool) - Indicates if the device supports float16 operations. ### Methods * **`reset()`** * Description: Resets the CUDA device, freeing all resources. * **`gpus`** * Description: Property to access GPU-related information (specifics depend on implementation). * **`current`** * Description: Property to get the current device context. ``` -------------------------------- ### Detect Supported CUDA Hardware (Python) Source: https://nvidia.github.io/numba-cuda/reference/host Scans the system for supported CUDA-capable hardware and prints a summary. It returns a boolean indicating whether any compatible devices were found. ```python import numba if numba.cuda.detect(): print("Supported CUDA devices detected.") else: print("No supported CUDA devices found.") ``` -------------------------------- ### Memory Management Source: https://nvidia.github.io/numba-cuda/reference/index Functions for managing memory allocation and data transfer between host and device. ```APIDOC ## Memory Management ### Description Functions for managing memory on the CUDA device, including allocating arrays, transferring data, and handling different memory types like pinned, mapped, and managed memory. ### Allocation Functions - `to_device()`: Allocates memory on the device and copies data to it. - `device_array()`: Allocates an uninitialized array on the device. - `device_array_like()`: Allocates an array on the device with the same shape and dtype as another array. - `pinned_array()`: Allocates a pinned (page-locked) array on the host. - `pinned_array_like()`: Allocates a pinned array on the host like another array. - `mapped_array()`: Allocates a mapped array (device memory accessible from host). - `mapped_array_like()`: Allocates a mapped array like another array. - `managed_array()`: Allocates a managed array (unified memory). ### Memory Type Aliases - `pinned()`: Alias for `pinned_array()`. - `mapped()`: Alias for `mapped_array()`. ### Device Objects **`DeviceNDArray`** Represents an N-dimensional array residing on the CUDA device. - `copy_to_device()`: Copies the array to another device. - `copy_to_host()`: Copies the array from the device to the host. - `is_c_contiguous()`: Checks if the array is C-contiguous. - `is_f_contiguous()`: Checks if the array is Fortran-contiguous. - `ravel()`: Flattens the array. - `reshape()`: Reshapes the array. - `split()`: Splits the array into sub-arrays. **`DeviceRecord`** Represents a record (struct-like) residing on the CUDA device. - `copy_to_device()`: Copies the record to another device. - `copy_to_host()`: Copies the record from the device to the host. **`MappedNDArray`** Represents an N-dimensional array that is mapped between host and device memory. - `copy_to_device()`: Copies the array to another device. - `copy_to_host()`: Copies the array from the device to the host. - `split()`: Splits the array into sub-arrays. ``` -------------------------------- ### CUDA Kernel API - Kernel Declaration Source: https://nvidia.github.io/numba-cuda/reference/index Function for declaring and compiling JIT kernels. ```APIDOC ## CUDA Kernel API - Kernel Declaration ### Description Decorator for Just-In-Time (JIT) compilation of CUDA kernels. * `jit()`: Decorator to compile Python functions into CUDA kernels. ``` -------------------------------- ### Stream Management Source: https://nvidia.github.io/numba-cuda/reference/host Utilities for managing CUDA streams, including creation, synchronization, and callback registration. ```APIDOC ## Stream Management ### Description This section details functions and classes for managing CUDA streams. ### Stream Object ### Description Represents a CUDA stream, which is a sequence of operations that execute in order. ### Methods * **`synchronize()`** * Description: Blocks the host until all operations in the stream have completed. * **`auto_synchronize()`** * Description: Enables or disables automatic synchronization for the stream. * **`add_callback(callback, *args, **kwargs)`** * Description: Adds a Python callback function to be executed when the stream reaches the current point. * Parameters: * **`callback`** (function) - The Python function to call. * **`*args`**, **`**kwargs`** - Arguments to pass to the callback. * **`async_done()`** * Description: Returns a future or event that completes when the stream is done (specific behavior might vary). ### Stream Functions * **`stream()`** * Description: Creates a new, empty CUDA stream. * Returns: A `Stream` object. * **`default_stream()`** * Description: Gets the default CUDA stream for the current context. * Returns: A `Stream` object. * **`legacy_default_stream()`** * Description: Gets the legacy default CUDA stream. * Returns: A `Stream` object. * **`per_thread_default_stream()`** * Description: Gets the default CUDA stream associated with the current thread. * Returns: A `Stream` object. * **`external_stream(handle, owner=True)`** * Description: Creates a `Stream` object from an external CUDA stream handle. * Parameters: * **`handle`** (int or pointer) - The handle to the external CUDA stream. * **`owner`** (bool) - Whether this object should manage the lifetime of the external stream. ``` -------------------------------- ### Profiling Control - numba.cuda.profile_start, numba.cuda.profile_stop, numba.cuda.profiling Source: https://nvidia.github.io/numba-cuda/reference/host Functions to control and manage CUDA profiling within the current context. `profile_start()` enables profiling, `profile_stop()` disables it, and `profiling()` acts as a context manager to automatically handle these states. ```python numba.cuda.profile_start() ``` ```python numba.cuda.profile_stop() ``` ```python with numba.cuda.profiling(): # Code to profile pass ``` -------------------------------- ### CUDA Host API - Stream Management Source: https://nvidia.github.io/numba-cuda/reference/index Functions and classes for managing CUDA streams. ```APIDOC ## CUDA Host API - Stream Management ### Description APIs for creating, managing, and synchronizing CUDA streams. * `Stream`: Represents a CUDA stream. * `add_callback()`: Adds a callback to be executed when the stream completes. * `async_done()`: Checks if the stream has finished asynchronous operations. * `auto_synchronize()`: Enables automatic synchronization for the stream. * `synchronize()`: Synchronizes the stream. * `stream()`: Creates a new CUDA stream. * `default_stream()`: Gets the default CUDA stream. * `legacy_default_stream()`: Gets the legacy default CUDA stream. * `per_thread_default_stream()`: Gets the per-thread default CUDA stream. * `external_stream()`: Creates a stream from an external handle. ```