### Standard Installation with Configure Script

Source: https://github.com/inducer/pycuda/blob/main/README_SETUP.txt

Use this method for a standard installation. Run configure with help or specific options, then build and install.

```bash
./configure.py --help
./configure.py --some-options
make
sudo make install
```

--------------------------------

### start_profiler

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Starts the CUDA profiler.

```APIDOC
## start_profiler()

### Description
Starts the CUDA profiler.
```

--------------------------------

### Basic CUDA Kernel Execution with PyCUDA

Source: https://github.com/inducer/pycuda/blob/main/doc/index.rst

This example demonstrates compiling a simple CUDA kernel, defining input and output arrays, and executing the kernel on the GPU. It utilizes pycuda.autoinit for automatic context management and pycuda.driver for low-level CUDA operations. The data is copied to the device, the kernel is launched, and the result is copied back.

```python
import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule(___doc__='''
  __global__ void multiply_them(float *dest, float *a, float *b)
  {
    const int i = threadIdx.x;
    dest[i] = a[i] * b[i];
  }
''')

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print(dest-a*b)
```

--------------------------------

### InclusiveScanKernel Usage Example

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Demonstrates the usage of InclusiveScanKernel for performing a prefix sum operation on an array of integers.

```Python
import pycuda.gpuarray as gpuarray
import numpy as np
from pycuda.scan import InclusiveScanKernel

knl = InclusiveScanKernel(np.int32, "a+b")

n = 2**20-2**18+5
host_data = np.random.randint(0, 10, n).astype(np.int32)
dev_data = gpuarray.to_gpu(queue, host_data)

knl(dev_data)
assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()
```

--------------------------------

### Texture and Surface Declarations and Usage

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Example demonstrating the declaration and usage of texture and surface references in CUDA C++ with PyCUDA helper functions. Includes examples for different data types and 3D operations.

```c++
#include <pycuda-helpers.hpp>

texture<fp_tex_double, 3, cudaReadModeElementType> my_tex; // complex128: fp_tex_cdouble
                                                               // complex64 : fp_tex_cfloat
                                                               // float64   : fp_tex_double
surface<void, 3, cudaReadModeElementType> my_surf;         // Surfaces in 2D needs 'cudaSurfaceType2DLayered'

__global__ void f()
{
  ...
  fp_tex3D(my_tex, i, j, k);
  fp_surf3Dwrite(myvar, my_surf, i, j, k, cudaBoundaryModeClamp); // fp extensions don't need width in bytes
  fp_surf3Dread(&myvar, my_surf, i, j, k, cudaBoundaryModeClamp);
  ...
}
```

--------------------------------

### Distutils Installation without Configure

Source: https://github.com/inducer/pycuda/blob/main/README_SETUP.txt

Install PyCUDA directly using Python's distutils. Configuration is read from default files in a specific order.

```bash
python setup.py build
sudo python setup.py install
```

--------------------------------

### ReductionKernel Usage Example

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Demonstrates a basic usage of ReductionKernel for calculating a dot product of two arrays.

```Python
import pycuda.gpuarray as gpuarray
import numpy
from pycuda.reduction import ReductionKernel

a = gpuarray.arange(400, dtype=numpy.float32)
b = gpuarray.arange(400, dtype=numpy.float32)

krnl = ReductionKernel(numpy.float32, neutral="0",
        reduce_expr="a+b", map_expr="x[i]*y[i]",
        arguments="float *x, float *y")

my_dot_prod = krnl(a, b).get()
```

--------------------------------

### Execute CUDA Kernel for Array Doubling

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Shows how to get a compiled CUDA kernel function and execute it on device memory. This example doubles elements in arrays managed by the DoubleOpStruct wrapper.

```python
func = mod.get_function("double_array")
func(struct_arr, block = (32, 1, 1), grid=(2, 1))
print("doubled arrays", array1, array2)
```

--------------------------------

### Device.name()

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Get the name of the CUDA device.

```APIDOC
## Device.name()

### Description
Get the name of the CUDA device.

### Returns
- str: The name of the device.
```

--------------------------------

### Vector Addition with Jinja 2 Templating

Source: https://github.com/inducer/pycuda/blob/main/doc/metaprog.rst

Generates CUDA C code for vector addition using Jinja 2 templating. This allows for dynamic configuration of block sizes and data types at runtime. Ensure Jinja 2 is installed (`pip install Jinja2`).

```python
from jinja2 import Template

tpl = Template("\n        __global__ void add(
                {{ type_name }} *tgt, 
                {{ type_name }} *op1, 
                {{ type_name }} *op2)
        {
          int idx = threadIdx.x + 
            {{ thread_block_size }} * {{block_size}}
            * blockIdx.x;

          {% for i in range(block_size) %}
              {% set offset = i*thread_block_size %}
              tgt[idx + {{ offset }}] = 
                op1[idx + {{ offset }}] 
                + op2[idx + {{ offset }}];
          {% endfor %}
        }")

    rendered_tpl = tpl.render(
        type_name="float", block_size=block_size,
        thread_block_size=thread_block_size)

    mod = SourceModule(rendered_tpl)
```

--------------------------------

### prepare

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Prepares the invocation of a kernel by setting up argument types and texture references.

```APIDOC
## prepare

### Description
This method configures the kernel for invocation by specifying the expected types of its arguments and registering any texture references that will be used. This allows for more efficient subsequent calls.

### Parameters
* `arg_types` (iterable): An iterable containing type characters understood by the :mod:`struct` module or :class:`numpy.dtype` objects. 'F' and 'D' are understood for single- and double-precision floating point numbers, respectively.
* `shared` (int, optional): The number of bytes available for *extern __shared__* arrays.
* `texrefs` (list, optional): A list of :class:`TextureReference` objects to be registered for use with this function. These references will be bound at invocation time.

### Returns
* self: Returns the instance itself, allowing for method chaining.
```

--------------------------------

### Import and Initialize PyCuda

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Import necessary PyCuda modules and initialize the driver. Autoinit handles context creation and cleanup automatically.

```python
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
```

--------------------------------

### pycuda.gl.autoinit

Source: https://github.com/inducer/pycuda/blob/main/doc/gl.rst

Provides automatic initialization for OpenGL interoperability.

```APIDOC
## pycuda.gl.autoinit

### Description
Importing this module will attempt to automatically initialize OpenGL interoperability.

### Warning
Importing :mod:`pycuda.gl.autoinit` will fail with a rather unhelpful error message if you don't already have a GL context created and active.

### Data
* **device**: The automatically initialized CUDA device.
* **context**: The automatically initialized CUDA context.
```

--------------------------------

### Create and Transfer Host Data to Device

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Create a NumPy array on the host, ensure it's single-precision float, allocate memory on the device, and transfer the data.

```python
import numpy
a = numpy.random.randn(4,4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
```

--------------------------------

### Device.compute_capability()

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Get the compute capability of the CUDA device.

```APIDOC
## Device.compute_capability()

### Description
Get the compute capability of the CUDA device.

### Returns
- tuple: A tuple of two integers (major, minor) representing the compute capability.
```

--------------------------------

### Create and Fill Managed Memory Array

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Demonstrates creating a managed memory array using `managed_empty` and filling it with data on the host.

```python
from pycuda.autoinit import context
import pycuda.driver as cuda
import numpy as np

a = cuda.managed_empty(shape=10, dtype=np.float32, mem_flags=cuda.mem_attach_flags.GLOBAL)
a[:] = np.linspace(0, 9, len(a)) # Fill array on host
```

--------------------------------

### Device.pci_bus_id()

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Get the PCI bus ID of the CUDA device. CUDA 4.1 and newer.

```APIDOC
## Device.pci_bus_id()

### Description
Get the PCI bus ID of the CUDA device. Available in CUDA 4.1 and newer.

### Returns
- str: The PCI bus ID of the device.
```

--------------------------------

### pycuda.autoprimaryctx

Source: https://github.com/inducer/pycuda/blob/main/doc/util.rst

Similar to pycuda.autoinit, but retains the device's primary context instead of creating a new one.

```APIDOC
## Module: pycuda.autoprimaryctx

### Description
Similar to :mod:`pycuda.autoinit`, but retains the device primary context instead of creating a new context. It also has ``device`` and ``context`` attributes.

### Attributes
* **device** (:class:`pycuda.driver.Device`): The device associated with the primary context.
* **context** (:class:`pycuda.driver.Context`): The retained primary context.
```

--------------------------------

### pycuda.tools.get_default_device

Source: https://github.com/inducer/pycuda/blob/main/doc/util.rst

Deprecated function to get a default CUDA device, use make_default_context instead.

```APIDOC
## Function: pycuda.tools.get_default_device(default=0)

### Description
Deprecated. Use :func:`pycuda.tools.make_default_context`.
Returns a :class:`pycuda.driver.Device` instance chosen based on environment variables, configuration files, or a default value.

### Rules for Device Selection
1. If the environment variable ``CUDA_DEVICE`` is set, its integer value is used.
2. If the file :file:`.cuda-device` exists in the user's home directory, its integer content is used.
3. Otherwise, the `default` parameter is used as the device number.

### Parameters
* **default** (int): The default device number to use if other methods fail. Defaults to 0.

### Returns
- :class:`pycuda.driver.Device`: The selected CUDA device.
```

--------------------------------

### pycuda.autoinit

Source: https://github.com/inducer/pycuda/blob/main/doc/util.rst

Automatically initializes CUDA and creates a compute context upon import. It provides access to the initialized device and context.

```APIDOC
## Module: pycuda.autoinit

### Description
Automatically performs all steps necessary to get CUDA ready for submission of compute kernels by creating a compute context using `pycuda.tools.make_default_context`.

### Attributes
* **device** (:class:`pycuda.driver.Device`): The device used for automatic initialization.
* **context** (:class:`pycuda.driver.Context`): A default-constructed context on the initialized device.
```

--------------------------------

### init(flags=0)

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Initialize CUDA. This must be called before any other function in this module. See also pycuda.autoinit.

```APIDOC
## init(flags=0)

### Description
Initialize CUDA. This must be called before any other function in this module.

### Parameters
#### Query Parameters
- **flags** (int) - Optional - Flags for initialization. Defaults to 0.
```

--------------------------------

### Get CURAND Version

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Retrieves the version of the CURAND library PyCUDA was compiled against. Returns a 3-tuple of integers.

```python
pycuda.curandom.get_curand_version()
```

--------------------------------

### initialize_profiler

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Initializes the CUDA profiler with a configuration file, output file, and output mode.

```APIDOC
## initialize_profiler(config_file, output_file, output_mode)

### Description
*output_mode* is one of the attributes of :class:`profiler_output_mode`.

### Parameters
#### Path Parameters
- **config_file** (str) - Required - Path to the profiler configuration file.
- **output_file** (str) - Required - Path to the output file for profiler data.
- **output_mode** (profiler_output_mode) - Required - The output mode for the profiler.
```

--------------------------------

### Kernel Invocation Interface

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

A convenience interface for invoking CUDA kernels. It allows direct passing of arguments and handles setup automatically. It can be slower than prepared calls due to argument type guessing.

```APIDOC
## Kernel Invocation

### Description
This method provides a high-level interface for launching CUDA kernels with specified grid and block dimensions, along with kernel arguments. It handles the transfer of arguments and execution, returning the execution time if requested.

### Parameters
* `grid` (tuple): A tuple of up to three integers specifying the number of thread blocks in the multi-dimensional grid.
* `stream` (pycuda.driver.Stream, optional): A Stream instance to serialize the copying of input arguments, execution, and copying of output arguments.
* `shared` (int): The number of bytes available to the kernel in *extern __shared__* arrays.
* `texrefs` (list): A list of :class:`TextureReference` instances that the function will have access to.
* `time_kernel` (bool): If True, the function returns the number of seconds spent executing the kernel.
* `*args`: Variable number of arguments to be passed to the kernel. Supported types include subclasses of :class:`numpy.number`, :class:`DeviceAllocation` instances, instances of :class:`ArgumentHandler` subclasses, objects supporting the Python buffer protocol, and :class:`~pycuda.gpuarray.GPUArray` instances.

### Returns
* None or float: Returns the number of seconds spent executing the kernel if `time_kernel` is True, otherwise None.
```

--------------------------------

### pycuda.gl.init

Source: https://github.com/inducer/pycuda/blob/main/doc/gl.rst

Enables GL interoperability for an existing CUDA context (deprecated).

```APIDOC
## pycuda.gl.init()

### Description
Enable GL interoperability for the already-created (so far non-GL) and currently active :class:`pycuda.driver.Context`.

### Warning
This function is deprecated since CUDA 3.0 and PyCUDA 2011.1.

### Warning
This will fail with a rather unhelpful error message if you don't already have a GL context created and active.
```

--------------------------------

### module_from_buffer

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Creates a Module by loading PTX or CUBIN code from a buffer.

```APIDOC
## module_from_buffer(buffer, options=[], message_handler=None)

### Description

Creates a `Module` by loading PTX or CUBIN code from a buffer. Supports custom options and message handlers for PTX compilation (CUDA 2.1+).

### Parameters

- **buffer** (bytes-like object): The PTX or CUBIN code.
- **options** (list of tuples, optional): JIT compilation options.
- **message_handler** (callable, optional): A function to process compiler messages.
```

--------------------------------

### Prepared Kernel Call for Reduced Overhead

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Prepare a kernel for invocation with specific argument types to reduce overhead compared to the standard __call__ method.

```python
grid = (1, 1)
block = (4, 4, 1)
func.prepare("P")
func.prepared_call(grid, block, a_gpu)
```

--------------------------------

### module_from_file

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Creates a Module by loading a CUBIN file from the specified filename.

```APIDOC
## module_from_file(filename)

### Description

Creates a `Module` by loading a CUBIN file from the specified filename.

### Parameters

- **filename** (string) - The path to the CUBIN file.
```

--------------------------------

### prepared_call

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Invokes a prepared kernel with specified grid and block dimensions and arguments.

```APIDOC
## prepared_call

### Description
Invokes a kernel that has been previously prepared using the :meth:`prepare` method. It uses the configured argument types and texture references for execution.

### Parameters
* `grid` (tuple): A tuple specifying the grid dimensions (number of thread blocks).
* `block` (tuple): A tuple specifying the block dimensions (number of threads per block).
* `*args`: Variable arguments to be passed to the kernel. These should match the types configured during preparation.
* `shared_size` (int, optional): The size of shared memory to allocate for the kernel. Defaults to 0.
```

--------------------------------

### ReductionKernel with Output Specification

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Shows how to specify an output array for the ReductionKernel, useful for accumulating results or storing them in specific locations.

```Python
from pycuda.curandom import rand as curand
a = curand((10, 200), dtype=np.float32)
red = ReductionKernel(np.float32, neutral=0,
                               reduce_expr="a+b",
                               arguments="float *in")
a_sum = gpuarray.empty(10, dtype=np.float32)
for i in range(10):
    red(a[i], out=a_sum[i])
assert(np.allclose(a_sum.get(), a.get().sum(axis=1)))
```

--------------------------------

### launch_grid_async

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Launches a grid of thread blocks asynchronously, sequenced by a stream. Deprecated.

```APIDOC
## launch_grid_async(width, height, stream)

### Description
Launch a width*height grid of thread blocks of *self*, sequenced by the :class:`Stream` *stream*.

.. warning:: Deprecated as of version 2011.1.

### Parameters
#### Path Parameters
- **width** (int) - Required - The width of the grid.
- **height** (int) - Required - The height of the grid.
- **stream** (Stream) - Required - The stream to sequence the launch by.
```

--------------------------------

### Context Flags

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Constants for configuring CUDA contexts using :meth:`Device.make_context`.

```APIDOC
## Context Flags (`ctx_flags`)

### Description
Flags for :meth:`Device.make_context`. CUDA 2.0 and above only.

### Flags
- **SCHED_AUTO**: Scheduling mode: yield if more contexts than processors, otherwise spin.
- **SCHED_SPIN**: Scheduling mode: spin while waiting for CUDA calls to complete.
- **SCHED_YIELD**: Scheduling mode: yield to other threads while waiting.
- **SCHED_MASK**: Mask of valid scheduling flags.
- **SCHED_BLOCKING_SYNC**: Use blocking synchronization (CUDA 2.2+).
- **MAP_HOST**: Support mapped pinned allocations (CUDA 2.2+).
- **LMEM_RESIZE_TO_MAX**: Keep local memory allocation after launch (CUDA 3.2+).
- **FLAGS_MASK**: Mask of valid flags.
```

--------------------------------

### launch_grid

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Launches a grid of thread blocks for the function. Deprecated.

```APIDOC
## launch_grid(width, height)

### Description
Launch a width*height grid of thread blocks of *self*.

.. warning:: Deprecated as of version 2011.1.

### Parameters
#### Path Parameters
- **width** (int) - Required - The width of the grid.
- **height** (int) - Required - The height of the grid.
```

--------------------------------

### Execute GPU Kernel with Managed Memory

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Shows how to compile and execute a simple GPU kernel that modifies a managed memory array, followed by host-side computation.

```python
from pycuda.compiler import SourceModule
mod = SourceModule(__doc__)
doublify = mod.get_function("doublify")

doublify(a, grid=(1,1), block=(len(a),1,1))
context.synchronize() # Wait for kernel completion before host access

median = np.median(a) # Computed on host!
```

--------------------------------

### prepared_timed_call

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Invokes a prepared kernel and returns a callable to query the GPU execution time.

```APIDOC
## prepared_timed_call

### Description
Invokes a prepared kernel and provides a mechanism to measure its execution time on the GPU. It returns a callable that, when invoked, blocks until the kernel completes and returns the elapsed time in seconds.

### Parameters
* `grid` (tuple): A tuple specifying the grid dimensions (number of thread blocks).
* `block` (tuple): A tuple specifying the block dimensions (number of threads per block).
* `*args`: Variable arguments to be passed to the kernel. These should match the types configured during preparation.
* `shared_size` (int, optional): The size of shared memory to allocate for the kernel. Defaults to 0.

### Returns
* callable: A 0-ary callable that, when called, returns the GPU time consumed by the kernel invocation in seconds.
```

--------------------------------

### prepared_async_call

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Asynchronously invokes a prepared kernel on a specified stream.

```APIDOC
## prepared_async_call

### Description
Asynchronously launches a prepared kernel onto a specified CUDA stream. If no stream is provided, it behaves like :meth:`prepared_call`. This allows for overlapping kernel execution with data transfers or other operations.

### Parameters
* `grid` (tuple): A tuple specifying the grid dimensions (number of thread blocks).
* `block` (tuple): A tuple specifying the block dimensions (number of threads per block).
* `stream` (:class:`pycuda.driver.Stream`): The CUDA stream on which to launch the kernel. If None, the default stream is used.
* `*args`: Variable arguments to be passed to the kernel. These should match the types configured during preparation.
* `shared_size` (int, optional): The size of shared memory to allocate for the kernel. Defaults to 0.
```

--------------------------------

### take

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Returns elements from an array at specified indices.

```APIDOC
## take(a, indices, stream=None)

### Description
Returns the GPUArray ``[a[indices[0]], ..., a[indices[n]]]``.

### Parameters
- **a**: The input array. Must be a type that can be bound to a texture.
- **indices**: The indices of the elements to take.
- **stream**: The CUDA stream to use for the operation.
```

--------------------------------

### Device Methods

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Methods for retrieving information about and managing CUDA devices.

```APIDOC
## Device Methods

### Description
Methods for retrieving information about and managing CUDA devices.

### Methods

- **compute_capability()**: Return a 2-tuple indicating the compute capability version of this device.
- **total_memory()**: Return the total amount of memory on the device in bytes.
- **get_attribute(attr)**: Return the (numeric) value of the attribute *attr*.
- **get_attributes()**: Return all device attributes in a :class:`dict`.
- **make_context(flags=ctx_flags.SCHED_AUTO)**: Create a :class:`Context` on this device.
- **retain_primary_context()**: Return the :class:`Context` obtained by retaining the device's primary context.
- **can_access_peer(dev)**: Check if peer access is possible between devices (CUDA 4.0+).
- **__hash__()**: Returns the hash of the device object.
- **__eq__()**: Checks for equality between two device objects.
- **__ne__()**: Checks for inequality between two device objects.
```

--------------------------------

### empty_like

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Creates an uninitialized GPUArray with the same properties as another array.

```APIDOC
## empty_like(other_ary, dtype=None, order="K")

### Description
Creates a new, uninitialized GPUArray having the same properties as *other_ary*. The *dtype* and *order* attributes allow these aspects to be set independently of their values in *other_ary*.

### Parameters
- **other_ary**: The array whose properties will be used.
- **dtype**: The data type for the new array (default: None, inferred from other_ary).
- **order**: The memory layout ('C', 'F', 'A', or 'K', default: 'K'). 'K' tries to match the strides of *other_ary* as closely as possible.
```

--------------------------------

### Compile and Execute CUDA Kernel

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Define a CUDA C kernel to double array elements, compile it using SourceModule, and execute it on the device.

```cuda
__global__ void doublify(float *a)
    {
      int idx = threadIdx.x + threadIdx.y*4;
      a[idx] *= 2;
    }
```

```python
mod = SourceModule(__doc__)
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
```

--------------------------------

### pycuda.gl.make_context

Source: https://github.com/inducer/pycuda/blob/main/doc/gl.rst

Creates and returns a PyCUDA driver Context with GL interoperability enabled. Requires an existing active GL context.

```APIDOC
## pycuda.gl.make_context(dev, flags=0)

### Description
Create and return a :class:`pycuda.driver.Context` that has GL interoperability enabled.

### Parameters
* **dev** - The CUDA device to create the context on.
* **flags** - Optional flags for context creation. Defaults to 0.

### Warning
This will fail with a rather unhelpful error message if you don't already have a GL context created and active.
```

--------------------------------

### launch

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Launches a single thread block of the function. Deprecated.

```APIDOC
## launch()

### Description
Launch a single thread block of *self*.

.. warning:: Deprecated as of version 2011.1.
```

--------------------------------

### Module

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Represents a loaded CUBIN module on the device. Allows retrieval of functions, global variables, and texture/surface references.

```APIDOC
## Module

### Description

Represents a loaded CUBIN module on the device. Allows retrieval of functions, global variables, and texture/surface references.

### Methods

- `get_function(name)`: Returns a `Function` handle by name.
- `get_global(name)`: Returns the device address and size of a global variable.
- `get_texref(name)`: Returns a `TextureReference` handle by name.
- `get_surfref(name)`: Returns a `SurfaceReference` handle by name (CUDA 3.1+).
```

--------------------------------

### Context Methods

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Methods for managing CUDA contexts, which represent processes on a compute device.

```APIDOC
## Context Methods

### Description
Methods for managing CUDA contexts, which represent processes on a compute device.

### Methods

- **detach()**: Decrease the reference count on this context. If the reference count hits zero, the context is deleted.
- **push()**: Make *self* the active context, pushing it on top of the context stack (CUDA 2.0+).
- **pop()**: Remove any context from the top of the context stack, deactivating it (CUDA 2.0+).
- **get_device()**: Return the device that the current context is working on.
- **synchronize()**: Wait for all activity in the current context to cease.
- **set_limit(limit, value)**: Set a context limit (CUDA 3.1+).
- **get_limit(limit)**: Get a context limit (CUDA 3.1+).
- **set_cache_config(cc)**: Set the cache configuration (CUDA 3.2+).
- **get_cache_config()**: Get the cache configuration (CUDA 3.2+).
- **set_shared_config(sc)**: Set the shared memory configuration (CUDA 4.2+).
- **get_shared_config()**: Get the shared memory configuration (CUDA 4.2+).
- **get_api_version()**: Return an integer API version number (CUDA 3.2+).
- **enable_peer_access(peer, flags=0)**: Enable peer access between contexts (CUDA 4.0+).
- **disable_peer_access(peer, flags=0)**: Disable peer access between contexts (CUDA 4.0+).
```

--------------------------------

### ones_like

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Creates a GPUArray initialized with ones, having the same properties as another array.

```APIDOC
## ones_like(other_ary, dtype=None, order="K")

### Description
Creates a new, ones-initialized GPUArray having the same properties as *other_ary*. The *dtype* and *order* attributes allow these aspects to be set independently of their values in *other_ary*.

### Parameters
- **other_ary**: The array whose properties will be used.
- **dtype**: The data type for the new array (default: None, inferred from other_ary).
- **order**: The memory layout ('C', 'F', 'A', or 'K', default: 'K'). 'K' tries to match the strides of *other_ary* as closely as possible.

### Version Added
2017.2
```

--------------------------------

### memcpy_dtoa, memcpy_atod, memcpy_htoa, memcpy_atoh, memcpy_atoa

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Functions for copying data between device arrays and host/device memory.

```APIDOC
## memcpy_dtoa, memcpy_atod, memcpy_htoa, memcpy_atoh, memcpy_atoa

### Description
These functions facilitate memory transfers involving device arrays (`ary`) and host/device memory. They allow copying data to/from specific indices and with specified lengths.

*   `memcpy_dtoa(ary, index, src, len)`: Copies from host memory (*src*) to a device array (*ary*) at a given *index* with a specified *len*.
*   `memcpy_atod(dest, ary, index, len)`: Copies from a device array (*ary*) at a given *index* with a specified *len* to host memory (*dest*).
*   `memcpy_htoa(ary, index, src)`: Copies from host memory (*src*) to a device array (*ary*) at a given *index*.
*   `memcpy_atoh(dest, ary, index)`: Copies from a device array (*ary*) at a given *index* to host memory (*dest*).
*   `memcpy_atoa(dest, dest_index, src, src_index, len)`: Copies between two device arrays with specified source and destination indices and length.

### Parameters
* **ary** (Array) - The PyCUDA Array object.
* **index** (integer) - The starting index for the transfer within the array.
* **src** (pointer or buffer) - The source memory.
* **dest** (pointer or buffer) - The destination memory.
* **len** (integer) - The number of bytes to transfer.
* **dest_index** (integer) - The starting index for the destination array.
* **src_index** (integer) - The starting index for the source array.
```

--------------------------------

### Define and Use a CUDA Struct Wrapper

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Demonstrates creating a Python wrapper for a CUDA struct to manage device memory and data copying. This is useful for passing structured data to CUDA kernels.

```python
class DoubleOpStruct:
    mem_size = 8 + numpy.intp(0).nbytes
    def __init__(self, array, struct_arr_ptr):
        self.data = cuda.to_device(array)
        self.shape, self.dtype = array.shape, array.dtype
        packed_args = struct.pack("ixP", array.size, numpy.uintp(self.data))
        cuda.memcpy_htod(struct_arr_ptr, packed_args)

    def __str__(self):
        return str(cuda.from_device(self.data, self.shape, self.dtype))

struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size

array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
print("original arrays", array1, array2)
```

--------------------------------

### zeros

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Allocates and initializes a GPUArray with zeros.

```APIDOC
## zeros(shape, dtype=np.float64, *, allocator=None, order="C")

### Description
Same as :func:`empty`, but the :class:`GPUArray` is zero-initialized before being returned.

### Method
Not specified (assumed to be a function in the pycuda module)

### Parameters
* **shape**: The shape of the array.
* **dtype**: The data type of the array elements (defaults to np.float64).
* **allocator**: Optional allocator function.
* **order**: The memory order ('C' or 'F').

### Response
#### Success Response
- **GPUArray**: A GPUArray initialized with zeros.
```

--------------------------------

### memcpy_peer, memcpy_peer_async

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Copies data between device pointers in different contexts.

```APIDOC
## memcpy_peer, memcpy_peer_async

### Description
Copies a specified number of bytes (*size*) between device pointers in potentially different CUDA contexts. The asynchronous version allows serialization via a CUDA stream.

### Parameters
* **dest** (Device Pointer or int) - The destination device pointer.
* **src** (Device Pointer or int) - The source device pointer.
* **size** (integer) - The number of bytes to copy.
* **dest_context** (CUDA Context, optional) - The destination CUDA context.
* **src_context** (CUDA Context, optional) - The source CUDA context.
* **stream** (CUDA Stream, optional) - The CUDA stream to serialize the asynchronous operation with.

### Note
Requires CUDA 4.0 and above.

### Version Added
2011.1
```

--------------------------------

### memset_d8_async, memset_d16_async, memset_d32_async

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Asynchronously fills device memory arrays with a specified data value.

```APIDOC
## memset_d8_async, memset_d16_async, memset_d32_async

### Description
Asynchronously fills a device memory array with the specified *data* value. Operations can optionally be serialized via a CUDA stream.

### Parameters
* **dest** (Device Pointer) - The destination device pointer.
* **data** (value) - The data value to fill the array with.
* **count** (integer) - The number of elements to fill (not bytes).
* **stream** (CUDA Stream, optional) - The CUDA stream to serialize the operation with.

### Note
*count* specifies the number of elements, not bytes.

### Version Added
2015.1
```

--------------------------------

### Floating Point Assembly

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Assembles floating-point numbers from significands and exponents. Computes `significand * 2**exponent`.

```python
pycuda.curandom.ldexp(significand, exponent, stream=None)
```

--------------------------------

### memcpy_htod_async

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Asynchronously copies data from a Python buffer to a device pointer.

```APIDOC
## memcpy_htod_async

### Description
Asynchronously copies data from a Python buffer (*src*) to a device pointer (*dest*). The size of the copy is determined by the size of the Python buffer. The operation can optionally be serialized via a CUDA stream.

### Parameters
* **dest** (Device Pointer or int) - The destination device pointer.
* **src** (Python Buffer) - The source Python buffer object.
* **stream** (CUDA Stream, optional) - The CUDA stream to serialize the operation with.

### Note
*src* must be page-locked memory.

### Version Added
0.93
```

--------------------------------

### Transfer Data from Device to Host

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Retrieve the modified data from the GPU back to a NumPy array on the host and print both the original and doubled arrays.

```python
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)
print(a)
```

--------------------------------

### ones

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Creates a GPUArray filled with ones. It is similar to the empty function but initializes the array with ones.

```APIDOC
## ones(shape, dtype=np.float64, *, allocator=None, order="C")

### Description
Creates a GPUArray filled with ones. It is similar to the empty function but initializes the array with ones.

### Parameters
- **shape**: The dimensions of the array.
- **dtype**: The data type of the array elements (default: np.float64).
- **allocator**: Optional custom allocator.
- **order**: The memory layout of the array ('C' or 'F', default: 'C').
```

--------------------------------

### Function

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Represents a kernel function in a loaded module. Allows launching the kernel with specified arguments, block size, grid size, and stream.

```APIDOC
## Function

### Description

Represents a kernel function in a loaded module. Allows launching the kernel with specified arguments, block size, grid size, and stream.

### Methods

- `__call__(arg1, ..., argn, block=block_size, grid=(1,1), stream=None, shared=0, texrefs=[], time_kernel=False)`: Launches the kernel. `block` must be a 3-tuple of integers. `arg1` through `argn` are the positional C arguments to the kernel.
```

--------------------------------

### zeros_like

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Creates a GPUArray initialized with zeros, having the same properties as another array.

```APIDOC
## zeros_like(other_ary, dtype=None, order="K")

### Description
Creates a new, zero-initialized GPUArray having the same properties as *other_ary*. The *dtype* and *order* attributes allow these aspects to be set independently of their values in *other_ary*.

### Parameters
- **other_ary**: The array whose properties will be used.
- **dtype**: The data type for the new array (default: None, inferred from other_ary).
- **order**: The memory layout ('C', 'F', 'A', or 'K', default: 'K'). 'K' tries to match the strides of *other_ary* as closely as possible.
```

--------------------------------

### memcpy_htod

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Copies data from a Python buffer to a device pointer.

```APIDOC
## memcpy_htod

### Description
Copies data from a Python buffer (*src*) to a device pointer (*dest*). The size of the copy is determined by the size of the Python buffer.

### Parameters
* **dest** (Device Pointer or int) - The destination device pointer.
* **src** (Python Buffer) - The source Python buffer object.
```

--------------------------------

### memcpy_dtod, memcpy_dtod_async

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Copies data between device pointers.

```APIDOC
## memcpy_dtod, memcpy_dtod_async

### Description
Copies a specified number of bytes (*size*) from one device pointer (*src*) to another (*dest*). The asynchronous version allows serialization via a CUDA stream.

### Parameters
* **dest** (Device Pointer or int) - The destination device pointer.
* **src** (Device Pointer or int) - The source device pointer.
* **size** (integer) - The number of bytes to copy.
* **stream** (CUDA Stream, optional) - The CUDA stream to serialize the asynchronous operation with.

### Note
Requires CUDA 3.0 and above.

### Version Added
0.94
```

--------------------------------

### memcpy_dtoh_async

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Asynchronously copies data from a device pointer to a Python buffer.

```APIDOC
## memcpy_dtoh_async

### Description
Asynchronously copies data from a device pointer (*src*) to a Python buffer (*dest*). The size of the copy is determined by the size of the Python buffer. The operation can optionally be serialized via a CUDA stream.

### Parameters
* **dest** (Python Buffer) - The destination Python buffer object.
* **src** (Device Pointer or int) - The source device pointer.
* **stream** (CUDA Stream, optional) - The CUDA stream to serialize the operation with.

### Note
*dest* must be page-locked memory.

### Version Added
0.93
```

--------------------------------

### Floating Point Decomposition and Assembly

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Functions for decomposing floating-point numbers into their components and assembling them, including fmod, frexp, ldexp, and modf.

```APIDOC
## Floating Point Decomposition and Assembly

### Description
Functions for decomposing floating-point numbers into their components and assembling them.

### Functions
- `fmod(arg, mod, stream=None)`: Return the floating point remainder of the division `arg/mod`.
- `frexp(arg, stream=None)`: Return a tuple `(significands, exponents)` such that `arg == significand * 2**exponent`.
- `ldexp(significand, exponent, stream=None)`: Return a new array of floating point values composed from `significand` and `exponent` as `result = significand * 2**exponent`.
- `modf(arg, stream=None)`: Return a tuple `(fracpart, intpart)` of arrays containing the integer and fractional parts of `arg`.

### Parameters
- **arg**: The input array for floating-point operations.
- **mod**: The modulus for `fmod`.
- **significand**: The significand part for `ldexp`.
- **exponent**: The exponent part for `ldexp`.
- **stream**: Optional CUDA stream for asynchronous execution.
```

--------------------------------

### any

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Checks if any element in the array is true.

```APIDOC
## any(a, stream=None, allocator=None)

### Description
Checks if any element in the array is true.

### Parameters
- **a**: The input array.
- **stream**: The CUDA stream to use for the operation.
- **allocator**: Optional custom allocator.
```

--------------------------------

### SourceModule Class

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Creates a pycuda.driver.Module from CUDA source code. It handles compilation using nvcc and provides options for customization.

```APIDOC
## class:: SourceModule(source, nvcc="nvcc", options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[])

### Description
Create a :class:`pycuda.driver.Module` from the CUDA source code *source*. The Nvidia compiler *nvcc* is assumed to be on the ``PATH`` if no path to it is specified, and is invoked with *options* to compile the code. If *keep* is *True*, the compiler output directory is kept, and a line indicating its location in the file system is printed for debugging purposes.

Unless *no_extern_c* is *True*, the given source code is wrapped in *extern "C" { ... }* to prevent C++ name mangling.

`arch` and `code` specify the values to be passed for the ``-arch`` and ``-code`` options on the :program:`nvcc` command line. If `arch` is `None`, it defaults to the current context's device's compute capability. If `code` is `None`, it will not be specified.

`cache_dir` gives the directory used for compiler caching. If `None` then `cache_dir` is taken to be ``PYCUDA_CACHE_DIR`` if set or a sensible per-user default. If passed as `False`, caching is disabled.

If the environment variable ``PYCUDA_DISABLE_CACHE`` is set to any value then caching is disabled. This preference overrides any value of `cache_dir` and can be used to disable caching globally.

This class exhibits the same public interface as :class:`pycuda.driver.Module`, but does not inherit from it.

*Change note:* :class:`SourceModule` was moved from :mod:`pycuda.driver` to :mod:`pycuda.compiler` in version 0.93.
```

--------------------------------

### arange

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Creates a GPUArray filled with numbers spaced apart by a given step.

```APIDOC
## arange(start, stop, step, dtype=None, stream=None)

### Description
Creates a GPUArray filled with numbers spaced `step` apart, starting from `start` and ending at `stop`.

### Parameters
- **start**: The starting value of the sequence.
- **stop**: The end value of the sequence.
- **step**: The spacing between values.
- **dtype**: The data type of the array elements (default: None, inferred from start, stop, and step).
- **stream**: The CUDA stream to use for the operation.
```

--------------------------------

### compile Function

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Compiles CUDA source code and returns the resulting cubin file as a string without uploading it to the GPU.

```APIDOC
## function:: compile(source, nvcc="nvcc", options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[])

### Description
Perform the same compilation as the corresponding :class:`SourceModule` constructor, but only return resulting *cubin* file as a string. In particular, do not upload the code to the GPU.
```

--------------------------------

### memcpy_dtoh

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Copies data from a device pointer to a Python buffer.

```APIDOC
## memcpy_dtoh

### Description
Copies data from a device pointer (*src*) to a Python buffer (*dest*). The size of the copy is determined by the size of the Python buffer.

### Parameters
* **dest** (Python Buffer) - The destination Python buffer object.
* **src** (Device Pointer or int) - The source device pointer.
```

--------------------------------

### stack

Source: https://github.com/inducer/pycuda/blob/main/doc/array.rst

Joins a sequence of arrays along a new axis.

```APIDOC
## stack(arrays, axis=0, allocator=None)

### Description
Joins a sequence of arrays along a new axis.

### Parameters
- **arrays**: A sequence of arrays to join.
- **axis**: The axis along which to join the arrays (default: 0).
```

--------------------------------

### Device(number | pci_bus_id)

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

A handle to the *number*'th CUDA device or a device specified by its PCI bus ID. See also pycuda.autoinit.

```APIDOC
## Device(number | pci_bus_id)

### Description
A handle to a CUDA device. Can be constructed using the device number or its PCI bus ID.

### Parameters
#### Path Parameters
- **number** (int) - The device number.
- **pci_bus_id** (str) - The PCI bus ID of the device (CUDA 4.1 and newer).
```

--------------------------------

### Using GPUArray for Simplified Operations

Source: https://github.com/inducer/pycuda/blob/main/doc/tutorial.rst

Utilize GPUArray for high-level operations, including data transfer, arithmetic, and retrieval of results.

```python
import pycuda.gpuarray as gpuarray
a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32))
a_doubled = (2*a_gpu).get()
print(a_doubled)
print(a_gpu)
```

--------------------------------

### Memcpy2D Class

Source: https://github.com/inducer/pycuda/blob/main/doc/driver.rst

Provides methods to configure and perform two-dimensional memory copies.

```APIDOC
## Memcpy2D Class

### Description
The `Memcpy2D` class allows for the configuration and execution of two-dimensional memory transfers. It provides attributes to define offsets, pitch, and methods to set the source and destination of the copy.

### Attributes
* **src_x_in_bytes** (integer) - X Offset of the origin of the copy (initialized to 0).
* **src_y** (integer) - Y offset of the origin of the copy (initialized to 0).
* **src_pitch** (integer) - Size of a row in bytes at the origin of the copy.
* **dst_x_in_bytes** (integer) - X offset of the destination of the copy (initialized to 0).
* **dst_y** (integer) - Y offset of the destination of the copy (initialized to 0).

### Methods
* **set_src_host(buffer)**: Sets the *buffer* (Python object adhering to buffer interface) as the origin of the copy.
* **set_src_array(array)**: Sets the :class:`Array` *array* as the origin of the copy.
* **set_src_device(devptr)**: Sets the device address *devptr* (an :class:`int` or :class:`DeviceAllocation`) as the origin of the copy.
* **set_src_unified(buffer)**: Sets *buffer* as the origin, which can correspond to host or device memory (requires unified addressing, CUDA 4.0+).

### Version Added
2011.1 (for `set_src_unified`)
```