### Standard Installation with Configure Script Source: https://github.com/inducer/pycuda/blob/main/README_SETUP.txt Use this method for a standard installation. Run configure with help or specific options, then build and install. ```bash ./configure.py --help ./configure.py --some-options make sudo make install ``` -------------------------------- ### start_profiler Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Starts the CUDA profiler. ```APIDOC ## start_profiler() ### Description Starts the CUDA profiler. ``` -------------------------------- ### Basic CUDA Kernel Execution with PyCUDA Source: https://github.com/inducer/pycuda/blob/main/doc/index.rst This example demonstrates compiling a simple CUDA kernel, defining input and output arrays, and executing the kernel on the GPU. It utilizes pycuda.autoinit for automatic context management and pycuda.driver for low-level CUDA operations. The data is copied to the device, the kernel is launched, and the result is copied back. ```python import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule mod = SourceModule(___doc__=''' __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } ''') multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1), grid=(1,1)) print(dest-a*b) ``` -------------------------------- ### InclusiveScanKernel Usage Example Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Demonstrates the usage of InclusiveScanKernel for performing a prefix sum operation on an array of integers. ```Python import pycuda.gpuarray as gpuarray import numpy as np from pycuda.scan import InclusiveScanKernel knl = InclusiveScanKernel(np.int32, "a+b") n = 2**20-2**18+5 host_data = np.random.randint(0, 10, n).astype(np.int32) dev_data = gpuarray.to_gpu(queue, host_data) knl(dev_data) assert (dev_data.get() == np.cumsum(host_data, axis=0)).all() ``` -------------------------------- ### Texture and Surface Declarations and Usage Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Example demonstrating the declaration and usage of texture and surface references in CUDA C++ with PyCUDA helper functions. Includes examples for different data types and 3D operations. ```c++ #include texture my_tex; // complex128: fp_tex_cdouble // complex64 : fp_tex_cfloat // float64 : fp_tex_double surface my_surf; // Surfaces in 2D needs 'cudaSurfaceType2DLayered' __global__ void f() { ... fp_tex3D(my_tex, i, j, k); fp_surf3Dwrite(myvar, my_surf, i, j, k, cudaBoundaryModeClamp); // fp extensions don't need width in bytes fp_surf3Dread(&myvar, my_surf, i, j, k, cudaBoundaryModeClamp); ... } ``` -------------------------------- ### Distutils Installation without Configure Source: https://github.com/inducer/pycuda/blob/main/README_SETUP.txt Install PyCUDA directly using Python's distutils. Configuration is read from default files in a specific order. ```bash python setup.py build sudo python setup.py install ``` -------------------------------- ### ReductionKernel Usage Example Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Demonstrates a basic usage of ReductionKernel for calculating a dot product of two arrays. ```Python import pycuda.gpuarray as gpuarray import numpy from pycuda.reduction import ReductionKernel a = gpuarray.arange(400, dtype=numpy.float32) b = gpuarray.arange(400, dtype=numpy.float32) krnl = ReductionKernel(numpy.float32, neutral="0", reduce_expr="a+b", map_expr="x[i]*y[i]", arguments="float *x, float *y") my_dot_prod = krnl(a, b).get() ``` -------------------------------- ### Execute CUDA Kernel for Array Doubling Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Shows how to get a compiled CUDA kernel function and execute it on device memory. This example doubles elements in arrays managed by the DoubleOpStruct wrapper. ```python func = mod.get_function("double_array") func(struct_arr, block = (32, 1, 1), grid=(2, 1)) print("doubled arrays", array1, array2) ``` -------------------------------- ### Device.name() Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Get the name of the CUDA device. ```APIDOC ## Device.name() ### Description Get the name of the CUDA device. ### Returns - str: The name of the device. ``` -------------------------------- ### Vector Addition with Jinja 2 Templating Source: https://github.com/inducer/pycuda/blob/main/doc/metaprog.rst Generates CUDA C code for vector addition using Jinja 2 templating. This allows for dynamic configuration of block sizes and data types at runtime. Ensure Jinja 2 is installed (`pip install Jinja2`). ```python from jinja2 import Template tpl = Template("\n __global__ void add( {{ type_name }} *tgt, {{ type_name }} *op1, {{ type_name }} *op2) { int idx = threadIdx.x + {{ thread_block_size }} * {{block_size}} * blockIdx.x; {% for i in range(block_size) %} {% set offset = i*thread_block_size %} tgt[idx + {{ offset }}] = op1[idx + {{ offset }}] + op2[idx + {{ offset }}]; {% endfor %} }") rendered_tpl = tpl.render( type_name="float", block_size=block_size, thread_block_size=thread_block_size) mod = SourceModule(rendered_tpl) ``` -------------------------------- ### prepare Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Prepares the invocation of a kernel by setting up argument types and texture references. ```APIDOC ## prepare ### Description This method configures the kernel for invocation by specifying the expected types of its arguments and registering any texture references that will be used. This allows for more efficient subsequent calls. ### Parameters * `arg_types` (iterable): An iterable containing type characters understood by the :mod:`struct` module or :class:`numpy.dtype` objects. 'F' and 'D' are understood for single- and double-precision floating point numbers, respectively. * `shared` (int, optional): The number of bytes available for *extern __shared__* arrays. * `texrefs` (list, optional): A list of :class:`TextureReference` objects to be registered for use with this function. These references will be bound at invocation time. ### Returns * self: Returns the instance itself, allowing for method chaining. ``` -------------------------------- ### Import and Initialize PyCuda Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Import necessary PyCuda modules and initialize the driver. Autoinit handles context creation and cleanup automatically. ```python import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule ``` -------------------------------- ### pycuda.gl.autoinit Source: https://github.com/inducer/pycuda/blob/main/doc/gl.rst Provides automatic initialization for OpenGL interoperability. ```APIDOC ## pycuda.gl.autoinit ### Description Importing this module will attempt to automatically initialize OpenGL interoperability. ### Warning Importing :mod:`pycuda.gl.autoinit` will fail with a rather unhelpful error message if you don't already have a GL context created and active. ### Data * **device**: The automatically initialized CUDA device. * **context**: The automatically initialized CUDA context. ``` -------------------------------- ### Create and Transfer Host Data to Device Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Create a NumPy array on the host, ensure it's single-precision float, allocate memory on the device, and transfer the data. ```python import numpy a = numpy.random.randn(4,4) a = a.astype(numpy.float32) a_gpu = cuda.mem_alloc(a.nbytes) cuda.memcpy_htod(a_gpu, a) ``` -------------------------------- ### Device.compute_capability() Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Get the compute capability of the CUDA device. ```APIDOC ## Device.compute_capability() ### Description Get the compute capability of the CUDA device. ### Returns - tuple: A tuple of two integers (major, minor) representing the compute capability. ``` -------------------------------- ### Create and Fill Managed Memory Array Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Demonstrates creating a managed memory array using `managed_empty` and filling it with data on the host. ```python from pycuda.autoinit import context import pycuda.driver as cuda import numpy as np a = cuda.managed_empty(shape=10, dtype=np.float32, mem_flags=cuda.mem_attach_flags.GLOBAL) a[:] = np.linspace(0, 9, len(a)) # Fill array on host ``` -------------------------------- ### Device.pci_bus_id() Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Get the PCI bus ID of the CUDA device. CUDA 4.1 and newer. ```APIDOC ## Device.pci_bus_id() ### Description Get the PCI bus ID of the CUDA device. Available in CUDA 4.1 and newer. ### Returns - str: The PCI bus ID of the device. ``` -------------------------------- ### pycuda.autoprimaryctx Source: https://github.com/inducer/pycuda/blob/main/doc/util.rst Similar to pycuda.autoinit, but retains the device's primary context instead of creating a new one. ```APIDOC ## Module: pycuda.autoprimaryctx ### Description Similar to :mod:`pycuda.autoinit`, but retains the device primary context instead of creating a new context. It also has ``device`` and ``context`` attributes. ### Attributes * **device** (:class:`pycuda.driver.Device`): The device associated with the primary context. * **context** (:class:`pycuda.driver.Context`): The retained primary context. ``` -------------------------------- ### pycuda.tools.get_default_device Source: https://github.com/inducer/pycuda/blob/main/doc/util.rst Deprecated function to get a default CUDA device, use make_default_context instead. ```APIDOC ## Function: pycuda.tools.get_default_device(default=0) ### Description Deprecated. Use :func:`pycuda.tools.make_default_context`. Returns a :class:`pycuda.driver.Device` instance chosen based on environment variables, configuration files, or a default value. ### Rules for Device Selection 1. If the environment variable ``CUDA_DEVICE`` is set, its integer value is used. 2. If the file :file:`.cuda-device` exists in the user's home directory, its integer content is used. 3. Otherwise, the `default` parameter is used as the device number. ### Parameters * **default** (int): The default device number to use if other methods fail. Defaults to 0. ### Returns - :class:`pycuda.driver.Device`: The selected CUDA device. ``` -------------------------------- ### pycuda.autoinit Source: https://github.com/inducer/pycuda/blob/main/doc/util.rst Automatically initializes CUDA and creates a compute context upon import. It provides access to the initialized device and context. ```APIDOC ## Module: pycuda.autoinit ### Description Automatically performs all steps necessary to get CUDA ready for submission of compute kernels by creating a compute context using `pycuda.tools.make_default_context`. ### Attributes * **device** (:class:`pycuda.driver.Device`): The device used for automatic initialization. * **context** (:class:`pycuda.driver.Context`): A default-constructed context on the initialized device. ``` -------------------------------- ### init(flags=0) Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Initialize CUDA. This must be called before any other function in this module. See also pycuda.autoinit. ```APIDOC ## init(flags=0) ### Description Initialize CUDA. This must be called before any other function in this module. ### Parameters #### Query Parameters - **flags** (int) - Optional - Flags for initialization. Defaults to 0. ``` -------------------------------- ### Get CURAND Version Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Retrieves the version of the CURAND library PyCUDA was compiled against. Returns a 3-tuple of integers. ```python pycuda.curandom.get_curand_version() ``` -------------------------------- ### initialize_profiler Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Initializes the CUDA profiler with a configuration file, output file, and output mode. ```APIDOC ## initialize_profiler(config_file, output_file, output_mode) ### Description *output_mode* is one of the attributes of :class:`profiler_output_mode`. ### Parameters #### Path Parameters - **config_file** (str) - Required - Path to the profiler configuration file. - **output_file** (str) - Required - Path to the output file for profiler data. - **output_mode** (profiler_output_mode) - Required - The output mode for the profiler. ``` -------------------------------- ### Kernel Invocation Interface Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst A convenience interface for invoking CUDA kernels. It allows direct passing of arguments and handles setup automatically. It can be slower than prepared calls due to argument type guessing. ```APIDOC ## Kernel Invocation ### Description This method provides a high-level interface for launching CUDA kernels with specified grid and block dimensions, along with kernel arguments. It handles the transfer of arguments and execution, returning the execution time if requested. ### Parameters * `grid` (tuple): A tuple of up to three integers specifying the number of thread blocks in the multi-dimensional grid. * `stream` (pycuda.driver.Stream, optional): A Stream instance to serialize the copying of input arguments, execution, and copying of output arguments. * `shared` (int): The number of bytes available to the kernel in *extern __shared__* arrays. * `texrefs` (list): A list of :class:`TextureReference` instances that the function will have access to. * `time_kernel` (bool): If True, the function returns the number of seconds spent executing the kernel. * `*args`: Variable number of arguments to be passed to the kernel. Supported types include subclasses of :class:`numpy.number`, :class:`DeviceAllocation` instances, instances of :class:`ArgumentHandler` subclasses, objects supporting the Python buffer protocol, and :class:`~pycuda.gpuarray.GPUArray` instances. ### Returns * None or float: Returns the number of seconds spent executing the kernel if `time_kernel` is True, otherwise None. ``` -------------------------------- ### pycuda.gl.init Source: https://github.com/inducer/pycuda/blob/main/doc/gl.rst Enables GL interoperability for an existing CUDA context (deprecated). ```APIDOC ## pycuda.gl.init() ### Description Enable GL interoperability for the already-created (so far non-GL) and currently active :class:`pycuda.driver.Context`. ### Warning This function is deprecated since CUDA 3.0 and PyCUDA 2011.1. ### Warning This will fail with a rather unhelpful error message if you don't already have a GL context created and active. ``` -------------------------------- ### module_from_buffer Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Creates a Module by loading PTX or CUBIN code from a buffer. ```APIDOC ## module_from_buffer(buffer, options=[], message_handler=None) ### Description Creates a `Module` by loading PTX or CUBIN code from a buffer. Supports custom options and message handlers for PTX compilation (CUDA 2.1+). ### Parameters - **buffer** (bytes-like object): The PTX or CUBIN code. - **options** (list of tuples, optional): JIT compilation options. - **message_handler** (callable, optional): A function to process compiler messages. ``` -------------------------------- ### Prepared Kernel Call for Reduced Overhead Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Prepare a kernel for invocation with specific argument types to reduce overhead compared to the standard __call__ method. ```python grid = (1, 1) block = (4, 4, 1) func.prepare("P") func.prepared_call(grid, block, a_gpu) ``` -------------------------------- ### module_from_file Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Creates a Module by loading a CUBIN file from the specified filename. ```APIDOC ## module_from_file(filename) ### Description Creates a `Module` by loading a CUBIN file from the specified filename. ### Parameters - **filename** (string) - The path to the CUBIN file. ``` -------------------------------- ### prepared_call Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Invokes a prepared kernel with specified grid and block dimensions and arguments. ```APIDOC ## prepared_call ### Description Invokes a kernel that has been previously prepared using the :meth:`prepare` method. It uses the configured argument types and texture references for execution. ### Parameters * `grid` (tuple): A tuple specifying the grid dimensions (number of thread blocks). * `block` (tuple): A tuple specifying the block dimensions (number of threads per block). * `*args`: Variable arguments to be passed to the kernel. These should match the types configured during preparation. * `shared_size` (int, optional): The size of shared memory to allocate for the kernel. Defaults to 0. ``` -------------------------------- ### ReductionKernel with Output Specification Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Shows how to specify an output array for the ReductionKernel, useful for accumulating results or storing them in specific locations. ```Python from pycuda.curandom import rand as curand a = curand((10, 200), dtype=np.float32) red = ReductionKernel(np.float32, neutral=0, reduce_expr="a+b", arguments="float *in") a_sum = gpuarray.empty(10, dtype=np.float32) for i in range(10): red(a[i], out=a_sum[i]) assert(np.allclose(a_sum.get(), a.get().sum(axis=1))) ``` -------------------------------- ### launch_grid_async Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Launches a grid of thread blocks asynchronously, sequenced by a stream. Deprecated. ```APIDOC ## launch_grid_async(width, height, stream) ### Description Launch a width*height grid of thread blocks of *self*, sequenced by the :class:`Stream` *stream*. .. warning:: Deprecated as of version 2011.1. ### Parameters #### Path Parameters - **width** (int) - Required - The width of the grid. - **height** (int) - Required - The height of the grid. - **stream** (Stream) - Required - The stream to sequence the launch by. ``` -------------------------------- ### Context Flags Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Constants for configuring CUDA contexts using :meth:`Device.make_context`. ```APIDOC ## Context Flags (`ctx_flags`) ### Description Flags for :meth:`Device.make_context`. CUDA 2.0 and above only. ### Flags - **SCHED_AUTO**: Scheduling mode: yield if more contexts than processors, otherwise spin. - **SCHED_SPIN**: Scheduling mode: spin while waiting for CUDA calls to complete. - **SCHED_YIELD**: Scheduling mode: yield to other threads while waiting. - **SCHED_MASK**: Mask of valid scheduling flags. - **SCHED_BLOCKING_SYNC**: Use blocking synchronization (CUDA 2.2+). - **MAP_HOST**: Support mapped pinned allocations (CUDA 2.2+). - **LMEM_RESIZE_TO_MAX**: Keep local memory allocation after launch (CUDA 3.2+). - **FLAGS_MASK**: Mask of valid flags. ``` -------------------------------- ### launch_grid Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Launches a grid of thread blocks for the function. Deprecated. ```APIDOC ## launch_grid(width, height) ### Description Launch a width*height grid of thread blocks of *self*. .. warning:: Deprecated as of version 2011.1. ### Parameters #### Path Parameters - **width** (int) - Required - The width of the grid. - **height** (int) - Required - The height of the grid. ``` -------------------------------- ### Execute GPU Kernel with Managed Memory Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Shows how to compile and execute a simple GPU kernel that modifies a managed memory array, followed by host-side computation. ```python from pycuda.compiler import SourceModule mod = SourceModule(__doc__) doublify = mod.get_function("doublify") doublify(a, grid=(1,1), block=(len(a),1,1)) context.synchronize() # Wait for kernel completion before host access median = np.median(a) # Computed on host! ``` -------------------------------- ### prepared_timed_call Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Invokes a prepared kernel and returns a callable to query the GPU execution time. ```APIDOC ## prepared_timed_call ### Description Invokes a prepared kernel and provides a mechanism to measure its execution time on the GPU. It returns a callable that, when invoked, blocks until the kernel completes and returns the elapsed time in seconds. ### Parameters * `grid` (tuple): A tuple specifying the grid dimensions (number of thread blocks). * `block` (tuple): A tuple specifying the block dimensions (number of threads per block). * `*args`: Variable arguments to be passed to the kernel. These should match the types configured during preparation. * `shared_size` (int, optional): The size of shared memory to allocate for the kernel. Defaults to 0. ### Returns * callable: A 0-ary callable that, when called, returns the GPU time consumed by the kernel invocation in seconds. ``` -------------------------------- ### prepared_async_call Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Asynchronously invokes a prepared kernel on a specified stream. ```APIDOC ## prepared_async_call ### Description Asynchronously launches a prepared kernel onto a specified CUDA stream. If no stream is provided, it behaves like :meth:`prepared_call`. This allows for overlapping kernel execution with data transfers or other operations. ### Parameters * `grid` (tuple): A tuple specifying the grid dimensions (number of thread blocks). * `block` (tuple): A tuple specifying the block dimensions (number of threads per block). * `stream` (:class:`pycuda.driver.Stream`): The CUDA stream on which to launch the kernel. If None, the default stream is used. * `*args`: Variable arguments to be passed to the kernel. These should match the types configured during preparation. * `shared_size` (int, optional): The size of shared memory to allocate for the kernel. Defaults to 0. ``` -------------------------------- ### take Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Returns elements from an array at specified indices. ```APIDOC ## take(a, indices, stream=None) ### Description Returns the GPUArray ``[a[indices[0]], ..., a[indices[n]]]``. ### Parameters - **a**: The input array. Must be a type that can be bound to a texture. - **indices**: The indices of the elements to take. - **stream**: The CUDA stream to use for the operation. ``` -------------------------------- ### Device Methods Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Methods for retrieving information about and managing CUDA devices. ```APIDOC ## Device Methods ### Description Methods for retrieving information about and managing CUDA devices. ### Methods - **compute_capability()**: Return a 2-tuple indicating the compute capability version of this device. - **total_memory()**: Return the total amount of memory on the device in bytes. - **get_attribute(attr)**: Return the (numeric) value of the attribute *attr*. - **get_attributes()**: Return all device attributes in a :class:`dict`. - **make_context(flags=ctx_flags.SCHED_AUTO)**: Create a :class:`Context` on this device. - **retain_primary_context()**: Return the :class:`Context` obtained by retaining the device's primary context. - **can_access_peer(dev)**: Check if peer access is possible between devices (CUDA 4.0+). - **__hash__()**: Returns the hash of the device object. - **__eq__()**: Checks for equality between two device objects. - **__ne__()**: Checks for inequality between two device objects. ``` -------------------------------- ### empty_like Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Creates an uninitialized GPUArray with the same properties as another array. ```APIDOC ## empty_like(other_ary, dtype=None, order="K") ### Description Creates a new, uninitialized GPUArray having the same properties as *other_ary*. The *dtype* and *order* attributes allow these aspects to be set independently of their values in *other_ary*. ### Parameters - **other_ary**: The array whose properties will be used. - **dtype**: The data type for the new array (default: None, inferred from other_ary). - **order**: The memory layout ('C', 'F', 'A', or 'K', default: 'K'). 'K' tries to match the strides of *other_ary* as closely as possible. ``` -------------------------------- ### Compile and Execute CUDA Kernel Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Define a CUDA C kernel to double array elements, compile it using SourceModule, and execute it on the device. ```cuda __global__ void doublify(float *a) { int idx = threadIdx.x + threadIdx.y*4; a[idx] *= 2; } ``` ```python mod = SourceModule(__doc__) func = mod.get_function("doublify") func(a_gpu, block=(4,4,1)) ``` -------------------------------- ### pycuda.gl.make_context Source: https://github.com/inducer/pycuda/blob/main/doc/gl.rst Creates and returns a PyCUDA driver Context with GL interoperability enabled. Requires an existing active GL context. ```APIDOC ## pycuda.gl.make_context(dev, flags=0) ### Description Create and return a :class:`pycuda.driver.Context` that has GL interoperability enabled. ### Parameters * **dev** - The CUDA device to create the context on. * **flags** - Optional flags for context creation. Defaults to 0. ### Warning This will fail with a rather unhelpful error message if you don't already have a GL context created and active. ``` -------------------------------- ### launch Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Launches a single thread block of the function. Deprecated. ```APIDOC ## launch() ### Description Launch a single thread block of *self*. .. warning:: Deprecated as of version 2011.1. ``` -------------------------------- ### Module Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Represents a loaded CUBIN module on the device. Allows retrieval of functions, global variables, and texture/surface references. ```APIDOC ## Module ### Description Represents a loaded CUBIN module on the device. Allows retrieval of functions, global variables, and texture/surface references. ### Methods - `get_function(name)`: Returns a `Function` handle by name. - `get_global(name)`: Returns the device address and size of a global variable. - `get_texref(name)`: Returns a `TextureReference` handle by name. - `get_surfref(name)`: Returns a `SurfaceReference` handle by name (CUDA 3.1+). ``` -------------------------------- ### Context Methods Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Methods for managing CUDA contexts, which represent processes on a compute device. ```APIDOC ## Context Methods ### Description Methods for managing CUDA contexts, which represent processes on a compute device. ### Methods - **detach()**: Decrease the reference count on this context. If the reference count hits zero, the context is deleted. - **push()**: Make *self* the active context, pushing it on top of the context stack (CUDA 2.0+). - **pop()**: Remove any context from the top of the context stack, deactivating it (CUDA 2.0+). - **get_device()**: Return the device that the current context is working on. - **synchronize()**: Wait for all activity in the current context to cease. - **set_limit(limit, value)**: Set a context limit (CUDA 3.1+). - **get_limit(limit)**: Get a context limit (CUDA 3.1+). - **set_cache_config(cc)**: Set the cache configuration (CUDA 3.2+). - **get_cache_config()**: Get the cache configuration (CUDA 3.2+). - **set_shared_config(sc)**: Set the shared memory configuration (CUDA 4.2+). - **get_shared_config()**: Get the shared memory configuration (CUDA 4.2+). - **get_api_version()**: Return an integer API version number (CUDA 3.2+). - **enable_peer_access(peer, flags=0)**: Enable peer access between contexts (CUDA 4.0+). - **disable_peer_access(peer, flags=0)**: Disable peer access between contexts (CUDA 4.0+). ``` -------------------------------- ### ones_like Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Creates a GPUArray initialized with ones, having the same properties as another array. ```APIDOC ## ones_like(other_ary, dtype=None, order="K") ### Description Creates a new, ones-initialized GPUArray having the same properties as *other_ary*. The *dtype* and *order* attributes allow these aspects to be set independently of their values in *other_ary*. ### Parameters - **other_ary**: The array whose properties will be used. - **dtype**: The data type for the new array (default: None, inferred from other_ary). - **order**: The memory layout ('C', 'F', 'A', or 'K', default: 'K'). 'K' tries to match the strides of *other_ary* as closely as possible. ### Version Added 2017.2 ``` -------------------------------- ### memcpy_dtoa, memcpy_atod, memcpy_htoa, memcpy_atoh, memcpy_atoa Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Functions for copying data between device arrays and host/device memory. ```APIDOC ## memcpy_dtoa, memcpy_atod, memcpy_htoa, memcpy_atoh, memcpy_atoa ### Description These functions facilitate memory transfers involving device arrays (`ary`) and host/device memory. They allow copying data to/from specific indices and with specified lengths. * `memcpy_dtoa(ary, index, src, len)`: Copies from host memory (*src*) to a device array (*ary*) at a given *index* with a specified *len*. * `memcpy_atod(dest, ary, index, len)`: Copies from a device array (*ary*) at a given *index* with a specified *len* to host memory (*dest*). * `memcpy_htoa(ary, index, src)`: Copies from host memory (*src*) to a device array (*ary*) at a given *index*. * `memcpy_atoh(dest, ary, index)`: Copies from a device array (*ary*) at a given *index* to host memory (*dest*). * `memcpy_atoa(dest, dest_index, src, src_index, len)`: Copies between two device arrays with specified source and destination indices and length. ### Parameters * **ary** (Array) - The PyCUDA Array object. * **index** (integer) - The starting index for the transfer within the array. * **src** (pointer or buffer) - The source memory. * **dest** (pointer or buffer) - The destination memory. * **len** (integer) - The number of bytes to transfer. * **dest_index** (integer) - The starting index for the destination array. * **src_index** (integer) - The starting index for the source array. ``` -------------------------------- ### Define and Use a CUDA Struct Wrapper Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Demonstrates creating a Python wrapper for a CUDA struct to manage device memory and data copying. This is useful for passing structured data to CUDA kernels. ```python class DoubleOpStruct: mem_size = 8 + numpy.intp(0).nbytes def __init__(self, array, struct_arr_ptr): self.data = cuda.to_device(array) self.shape, self.dtype = array.shape, array.dtype packed_args = struct.pack("ixP", array.size, numpy.uintp(self.data)) cuda.memcpy_htod(struct_arr_ptr, packed_args) def __str__(self): return str(cuda.from_device(self.data, self.shape, self.dtype)) struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size) do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr) array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr) print("original arrays", array1, array2) ``` -------------------------------- ### zeros Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Allocates and initializes a GPUArray with zeros. ```APIDOC ## zeros(shape, dtype=np.float64, *, allocator=None, order="C") ### Description Same as :func:`empty`, but the :class:`GPUArray` is zero-initialized before being returned. ### Method Not specified (assumed to be a function in the pycuda module) ### Parameters * **shape**: The shape of the array. * **dtype**: The data type of the array elements (defaults to np.float64). * **allocator**: Optional allocator function. * **order**: The memory order ('C' or 'F'). ### Response #### Success Response - **GPUArray**: A GPUArray initialized with zeros. ``` -------------------------------- ### memcpy_peer, memcpy_peer_async Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Copies data between device pointers in different contexts. ```APIDOC ## memcpy_peer, memcpy_peer_async ### Description Copies a specified number of bytes (*size*) between device pointers in potentially different CUDA contexts. The asynchronous version allows serialization via a CUDA stream. ### Parameters * **dest** (Device Pointer or int) - The destination device pointer. * **src** (Device Pointer or int) - The source device pointer. * **size** (integer) - The number of bytes to copy. * **dest_context** (CUDA Context, optional) - The destination CUDA context. * **src_context** (CUDA Context, optional) - The source CUDA context. * **stream** (CUDA Stream, optional) - The CUDA stream to serialize the asynchronous operation with. ### Note Requires CUDA 4.0 and above. ### Version Added 2011.1 ``` -------------------------------- ### memset_d8_async, memset_d16_async, memset_d32_async Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Asynchronously fills device memory arrays with a specified data value. ```APIDOC ## memset_d8_async, memset_d16_async, memset_d32_async ### Description Asynchronously fills a device memory array with the specified *data* value. Operations can optionally be serialized via a CUDA stream. ### Parameters * **dest** (Device Pointer) - The destination device pointer. * **data** (value) - The data value to fill the array with. * **count** (integer) - The number of elements to fill (not bytes). * **stream** (CUDA Stream, optional) - The CUDA stream to serialize the operation with. ### Note *count* specifies the number of elements, not bytes. ### Version Added 2015.1 ``` -------------------------------- ### Floating Point Assembly Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Assembles floating-point numbers from significands and exponents. Computes `significand * 2**exponent`. ```python pycuda.curandom.ldexp(significand, exponent, stream=None) ``` -------------------------------- ### memcpy_htod_async Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Asynchronously copies data from a Python buffer to a device pointer. ```APIDOC ## memcpy_htod_async ### Description Asynchronously copies data from a Python buffer (*src*) to a device pointer (*dest*). The size of the copy is determined by the size of the Python buffer. The operation can optionally be serialized via a CUDA stream. ### Parameters * **dest** (Device Pointer or int) - The destination device pointer. * **src** (Python Buffer) - The source Python buffer object. * **stream** (CUDA Stream, optional) - The CUDA stream to serialize the operation with. ### Note *src* must be page-locked memory. ### Version Added 0.93 ``` -------------------------------- ### Transfer Data from Device to Host Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Retrieve the modified data from the GPU back to a NumPy array on the host and print both the original and doubled arrays. ```python a_doubled = numpy.empty_like(a) cuda.memcpy_dtoh(a_doubled, a_gpu) print(a_doubled) print(a) ``` -------------------------------- ### ones Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Creates a GPUArray filled with ones. It is similar to the empty function but initializes the array with ones. ```APIDOC ## ones(shape, dtype=np.float64, *, allocator=None, order="C") ### Description Creates a GPUArray filled with ones. It is similar to the empty function but initializes the array with ones. ### Parameters - **shape**: The dimensions of the array. - **dtype**: The data type of the array elements (default: np.float64). - **allocator**: Optional custom allocator. - **order**: The memory layout of the array ('C' or 'F', default: 'C'). ``` -------------------------------- ### Function Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Represents a kernel function in a loaded module. Allows launching the kernel with specified arguments, block size, grid size, and stream. ```APIDOC ## Function ### Description Represents a kernel function in a loaded module. Allows launching the kernel with specified arguments, block size, grid size, and stream. ### Methods - `__call__(arg1, ..., argn, block=block_size, grid=(1,1), stream=None, shared=0, texrefs=[], time_kernel=False)`: Launches the kernel. `block` must be a 3-tuple of integers. `arg1` through `argn` are the positional C arguments to the kernel. ``` -------------------------------- ### zeros_like Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Creates a GPUArray initialized with zeros, having the same properties as another array. ```APIDOC ## zeros_like(other_ary, dtype=None, order="K") ### Description Creates a new, zero-initialized GPUArray having the same properties as *other_ary*. The *dtype* and *order* attributes allow these aspects to be set independently of their values in *other_ary*. ### Parameters - **other_ary**: The array whose properties will be used. - **dtype**: The data type for the new array (default: None, inferred from other_ary). - **order**: The memory layout ('C', 'F', 'A', or 'K', default: 'K'). 'K' tries to match the strides of *other_ary* as closely as possible. ``` -------------------------------- ### memcpy_htod Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Copies data from a Python buffer to a device pointer. ```APIDOC ## memcpy_htod ### Description Copies data from a Python buffer (*src*) to a device pointer (*dest*). The size of the copy is determined by the size of the Python buffer. ### Parameters * **dest** (Device Pointer or int) - The destination device pointer. * **src** (Python Buffer) - The source Python buffer object. ``` -------------------------------- ### memcpy_dtod, memcpy_dtod_async Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Copies data between device pointers. ```APIDOC ## memcpy_dtod, memcpy_dtod_async ### Description Copies a specified number of bytes (*size*) from one device pointer (*src*) to another (*dest*). The asynchronous version allows serialization via a CUDA stream. ### Parameters * **dest** (Device Pointer or int) - The destination device pointer. * **src** (Device Pointer or int) - The source device pointer. * **size** (integer) - The number of bytes to copy. * **stream** (CUDA Stream, optional) - The CUDA stream to serialize the asynchronous operation with. ### Note Requires CUDA 3.0 and above. ### Version Added 0.94 ``` -------------------------------- ### memcpy_dtoh_async Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Asynchronously copies data from a device pointer to a Python buffer. ```APIDOC ## memcpy_dtoh_async ### Description Asynchronously copies data from a device pointer (*src*) to a Python buffer (*dest*). The size of the copy is determined by the size of the Python buffer. The operation can optionally be serialized via a CUDA stream. ### Parameters * **dest** (Python Buffer) - The destination Python buffer object. * **src** (Device Pointer or int) - The source device pointer. * **stream** (CUDA Stream, optional) - The CUDA stream to serialize the operation with. ### Note *dest* must be page-locked memory. ### Version Added 0.93 ``` -------------------------------- ### Floating Point Decomposition and Assembly Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Functions for decomposing floating-point numbers into their components and assembling them, including fmod, frexp, ldexp, and modf. ```APIDOC ## Floating Point Decomposition and Assembly ### Description Functions for decomposing floating-point numbers into their components and assembling them. ### Functions - `fmod(arg, mod, stream=None)`: Return the floating point remainder of the division `arg/mod`. - `frexp(arg, stream=None)`: Return a tuple `(significands, exponents)` such that `arg == significand * 2**exponent`. - `ldexp(significand, exponent, stream=None)`: Return a new array of floating point values composed from `significand` and `exponent` as `result = significand * 2**exponent`. - `modf(arg, stream=None)`: Return a tuple `(fracpart, intpart)` of arrays containing the integer and fractional parts of `arg`. ### Parameters - **arg**: The input array for floating-point operations. - **mod**: The modulus for `fmod`. - **significand**: The significand part for `ldexp`. - **exponent**: The exponent part for `ldexp`. - **stream**: Optional CUDA stream for asynchronous execution. ``` -------------------------------- ### any Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Checks if any element in the array is true. ```APIDOC ## any(a, stream=None, allocator=None) ### Description Checks if any element in the array is true. ### Parameters - **a**: The input array. - **stream**: The CUDA stream to use for the operation. - **allocator**: Optional custom allocator. ``` -------------------------------- ### SourceModule Class Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Creates a pycuda.driver.Module from CUDA source code. It handles compilation using nvcc and provides options for customization. ```APIDOC ## class:: SourceModule(source, nvcc="nvcc", options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[]) ### Description Create a :class:`pycuda.driver.Module` from the CUDA source code *source*. The Nvidia compiler *nvcc* is assumed to be on the ``PATH`` if no path to it is specified, and is invoked with *options* to compile the code. If *keep* is *True*, the compiler output directory is kept, and a line indicating its location in the file system is printed for debugging purposes. Unless *no_extern_c* is *True*, the given source code is wrapped in *extern "C" { ... }* to prevent C++ name mangling. `arch` and `code` specify the values to be passed for the ``-arch`` and ``-code`` options on the :program:`nvcc` command line. If `arch` is `None`, it defaults to the current context's device's compute capability. If `code` is `None`, it will not be specified. `cache_dir` gives the directory used for compiler caching. If `None` then `cache_dir` is taken to be ``PYCUDA_CACHE_DIR`` if set or a sensible per-user default. If passed as `False`, caching is disabled. If the environment variable ``PYCUDA_DISABLE_CACHE`` is set to any value then caching is disabled. This preference overrides any value of `cache_dir` and can be used to disable caching globally. This class exhibits the same public interface as :class:`pycuda.driver.Module`, but does not inherit from it. *Change note:* :class:`SourceModule` was moved from :mod:`pycuda.driver` to :mod:`pycuda.compiler` in version 0.93. ``` -------------------------------- ### arange Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Creates a GPUArray filled with numbers spaced apart by a given step. ```APIDOC ## arange(start, stop, step, dtype=None, stream=None) ### Description Creates a GPUArray filled with numbers spaced `step` apart, starting from `start` and ending at `stop`. ### Parameters - **start**: The starting value of the sequence. - **stop**: The end value of the sequence. - **step**: The spacing between values. - **dtype**: The data type of the array elements (default: None, inferred from start, stop, and step). - **stream**: The CUDA stream to use for the operation. ``` -------------------------------- ### compile Function Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Compiles CUDA source code and returns the resulting cubin file as a string without uploading it to the GPU. ```APIDOC ## function:: compile(source, nvcc="nvcc", options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[]) ### Description Perform the same compilation as the corresponding :class:`SourceModule` constructor, but only return resulting *cubin* file as a string. In particular, do not upload the code to the GPU. ``` -------------------------------- ### memcpy_dtoh Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Copies data from a device pointer to a Python buffer. ```APIDOC ## memcpy_dtoh ### Description Copies data from a device pointer (*src*) to a Python buffer (*dest*). The size of the copy is determined by the size of the Python buffer. ### Parameters * **dest** (Python Buffer) - The destination Python buffer object. * **src** (Device Pointer or int) - The source device pointer. ``` -------------------------------- ### stack Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst Joins a sequence of arrays along a new axis. ```APIDOC ## stack(arrays, axis=0, allocator=None) ### Description Joins a sequence of arrays along a new axis. ### Parameters - **arrays**: A sequence of arrays to join. - **axis**: The axis along which to join the arrays (default: 0). ``` -------------------------------- ### Device(number | pci_bus_id) Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst A handle to the *number*'th CUDA device or a device specified by its PCI bus ID. See also pycuda.autoinit. ```APIDOC ## Device(number | pci_bus_id) ### Description A handle to a CUDA device. Can be constructed using the device number or its PCI bus ID. ### Parameters #### Path Parameters - **number** (int) - The device number. - **pci_bus_id** (str) - The PCI bus ID of the device (CUDA 4.1 and newer). ``` -------------------------------- ### Using GPUArray for Simplified Operations Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst Utilize GPUArray for high-level operations, including data transfer, arithmetic, and retrieval of results. ```python import pycuda.gpuarray as gpuarray a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32)) a_doubled = (2*a_gpu).get() print(a_doubled) print(a_gpu) ``` -------------------------------- ### Memcpy2D Class Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst Provides methods to configure and perform two-dimensional memory copies. ```APIDOC ## Memcpy2D Class ### Description The `Memcpy2D` class allows for the configuration and execution of two-dimensional memory transfers. It provides attributes to define offsets, pitch, and methods to set the source and destination of the copy. ### Attributes * **src_x_in_bytes** (integer) - X Offset of the origin of the copy (initialized to 0). * **src_y** (integer) - Y offset of the origin of the copy (initialized to 0). * **src_pitch** (integer) - Size of a row in bytes at the origin of the copy. * **dst_x_in_bytes** (integer) - X offset of the destination of the copy (initialized to 0). * **dst_y** (integer) - Y offset of the destination of the copy (initialized to 0). ### Methods * **set_src_host(buffer)**: Sets the *buffer* (Python object adhering to buffer interface) as the origin of the copy. * **set_src_array(array)**: Sets the :class:`Array` *array* as the origin of the copy. * **set_src_device(devptr)**: Sets the device address *devptr* (an :class:`int` or :class:`DeviceAllocation`) as the origin of the copy. * **set_src_unified(buffer)**: Sets *buffer* as the origin, which can correspond to host or device memory (requires unified addressing, CUDA 4.0+). ### Version Added 2011.1 (for `set_src_unified`) ```