### Installing and Running Pre-commit Hooks

Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md

Steps to install pre-commit hooks and run formatting and linting checks on all files. This ensures code adheres to project-specific style guidelines.

```shell
pip install pre-commit && pre-commit install
```

```shell
pre-commit --files all
```

--------------------------------

### Build Options

Source: https://github.com/nvidia/cuembed/blob/main/CMakeLists.txt

Defines boolean options to control the building of tests, benchmarks, and examples. Defaults are set to ON for tests and benchmarks, and OFF for examples.

```cmake
option(BUILD_TESTS "Build the tests" ON)
option(BUILD_BENCHMARKS "Build the benchmarks" ON)
option(BUILD_EXAMPLES "Build examples" OFF)
```

--------------------------------

### Link Libraries and Include CUDA Headers for Utils

Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt

Configures the 'utils' library to use CUDA include directories and links against CUDA runtime, Google Test, and Abseil libraries. This ensures all dependencies are resolved.

```cmake
target_include_directories(utils PRIVATE ${CUDAToolkit_INCLUDE_DIRS} absl::log absl::check gtest)
target_link_libraries(utils PRIVATE CUDA::cudart absl::log absl::check gtest)
```

--------------------------------

### Dependency and Library Setup

Source: https://github.com/nvidia/cuembed/blob/main/CMakeLists.txt

Finds the CUDAToolkit package and sets up interface libraries for headers. It conditionally includes subdirectories for third-party libraries (like Abseil and GTest) and project-specific modules (utils, tests, benchmarks, examples) based on build options.

```cmake
find_package(CUDAToolkit)

set(CUEMBED_PROJECT_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})

add_library(cuembed_hdrs INTERFACE ${cuembed_source_files})
target_include_directories(cuembed_hdrs INTERFACE ${CUEMBED_PROJECT_SOURCE_DIR})
add_library(cuembed::hdrs ALIAS cuembed_hdrs)

if (BUILD_TESTS OR BUILD_BENCHMARKS)
    # TODO(zejiaz): move to CPM instead of submodule
    add_subdirectory(third_party/abseil-cpp)

    # Utility library for benchmarking and testing.
    add_subdirectory(utils)
endif()

# Setup tests
if(BUILD_TESTS)
    add_subdirectory(third_party/gtest)
    add_subdirectory(tests)
endif()

# Benchmarks.
if (BUILD_BENCHMARKS)
    add_subdirectory(benchmarks)
endif()

# Examples
if (BUILD_EXAMPLES)
    add_subdirectory(examples/pytorch)
endif()
```

--------------------------------

### Create Utils Library with CUDA and C++ Sources

Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt

Compiles multiple CUDA (.cu) and C++ (.cpp) source files into an OBJECT library named 'utils'. This library contains the core embedding functionalities.

```cmake
add_library(utils OBJECT src/embedding_allocation.cu src/embedding_gpu_forward.cu src/embedding_gpu_transpose.cu src/embedding_gpu_backward.cu src/embedding_cpu.cu src/datagen.cpp)
```

--------------------------------

### Forward Propagation Example in C++ CUDA

Source: https://github.com/nvidia/cuembed/blob/main/README.md

Provides an example of running forward propagation for embedding lookups using the cuEmbed host API. It takes embedding data, indices, offsets, and weights as input and writes the results to an output vector.

```cpp
template <typename ElemT, typename IndexT, typename OffsetT, bool fp16_math>
void RunForward(const utils::AllocationOptions& options,
                const thrust::device_vector<ElemT>& embedding,
                const thrust::device_vector<IndexT>& indices,
                const thrust::device_vector<OffsetT>& offsets,
                const thrust::device_vector<ElemT>& weights,
                thrust::device_vector<ElemT>* result) {
  const int* offsets_ptr = nullptr;
  int hotness = options.hotness();
  if (options.is_csr()) {
    offsets_ptr = offsets.data().get();
    hotness = 0;
  }
  const ElemT* weight_ptr = nullptr;
  if (options.is_weighted()) {
    weight_ptr = weights.data().get();
  }
  using InputT = ElemT;
  using OutputT = ElemT;
  EmbeddingForward<InputT, OutputT, IndexT, OffsetT, fp16_math>(
      embedding.data().get(),
      options.embed_width(),
      indices.data().get(),
      offsets_ptr,
      weight_ptr,
      options.batch_size(),
      hotness,
      options.combine_mode(),
      result->data().get());
}
```

--------------------------------

### Include Directories with CMake

Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt

Includes directories from the cuEmbed project source directory for compilation. This is a standard CMake command.

```cmake
include_directories(${CUEMBED_PROJECT_SOURCE_DIR})
```

--------------------------------

### Find CUDA Toolkit Package

Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt

Locates the CUDAToolkit package using CMake's find_package command. This makes CUDA compiler and libraries available for use.

```cmake
find_package(CUDAToolkit)
```

--------------------------------

### Find CUDA Toolkit in CMake

Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt

This command searches for the CUDA Toolkit installation on the system. It's a prerequisite for using CUDA-specific features in CMake. Requires the CUDA Toolkit to be installed and discoverable by CMake.

```cmake
find_package(CUDAToolkit REQUIRED)
```

--------------------------------

### Manual cuEmbed Benchmark Execution

Source: https://github.com/nvidia/cuembed/blob/main/README.md

Illustrates how to manually run a specific cuEmbed benchmark test case using the `manual_benchmark` binary. It includes various command-line arguments to configure the benchmark parameters such as embedding dimensions, batch size, and data types.

```bash
./bin/benchmarks/manual_benchmark --num_categories 10000000 --embed_width 256 --batch_size 65536 --alpha=1.15 --hotness=64 --csr_input=false --half_embedding_type=true --weighted_sum=false --compressed_grad=true
```

--------------------------------

### Run cuEmbed Benchmarks (Bash)

Source: https://github.com/nvidia/cuembed/blob/main/README.md

Shows how to execute the cuEmbed benchmark suite by navigating to the benchmarks directory and running the sweep_parameters.sh script. This script is used for comprehensive benchmark testing.

```bash
cd benchmarks/
./sweep_parameters.sh
```

--------------------------------

### Running Performance Benchmarks

Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md

Command to navigate to the benchmarks directory and execute the sweep_parameters.sh script for performance testing. This helps identify performance regressions.

```shell
cd benchmarks ; ./sweep_parameters.sh
```

--------------------------------

### Build cuEmbed from Source (Bash)

Source: https://github.com/nvidia/cuembed/blob/main/README.md

Provides the bash commands to clone the cuEmbed repository recursively, create a build directory, and configure the build using CMake with release type. This is for building tests and benchmarks.

```bash
git clone --recursive https://gitlab-master.nvidia.com/compute/psx/recommender/cuembed
cd cuembed
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
```

--------------------------------

### Building Correctness Tests with CMake

Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md

Instructions on how to build correctness tests for the cuEmbed project by enabling the BUILD_TESTS option in CMake. This ensures that new code additions have corresponding tests.

```cmake
set(BUILD_TESTS ON CACHE BOOL "Build correctness tests")
```

--------------------------------

### Configure and Link Test Executables (CMake)

Source: https://github.com/nvidia/cuembed/blob/main/tests/CMakeLists.txt

This CMake code iterates through each identified test source file. For each file, it creates an executable, sets include directories (including Googletest and CUDA), and links it with essential libraries such as `cuembed_hdrs`, `gtest`, `cuda`, and `absl::log`. It also specifies the output directory for the executables.

```cmake
foreach(test_file ${test_source_files})
    get_filename_component(test_name ${test_file} NAME_WE)
    add_executable(${test_name} ${test_file})
    target_include_directories(${test_name} PRIVATE googletest ${CUDAToolkit_INCLUDE_DIRS})
    target_link_libraries(
        ${test_name} PRIVATE
        cuembed_hdrs
        gtest
        gtest_main
        cuda
        utils
        absl::log 
        absl::check)
    set_target_properties(${test_name}
        PROPERTIES
        RUNTIME_OUTPUT_DIRECTORY "${PROJECT_BINARY_DIR}/bin/test"
    )
endforeach(test_file)
```

--------------------------------

### Project Configuration

Source: https://github.com/nvidia/cuembed/blob/main/CMakeLists.txt

Sets the minimum CMake version, C++ standard, project name, and enables CUDA compilation. It also defines CUDA architectures and compiler flags for optimized performance.

```cmake
cmake_minimum_required(VERSION 3.23)
set(CMAKE_CXX_STANDARD 17)
project(cuembed CXX CUDA)
enable_language(CUDA)

set(CMAKE_CUDA_ARCHITECTURES 70 75 80 90)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --ptxas-options=-v")
```

--------------------------------

### Find and Compile Test Source Files (CMake)

Source: https://github.com/nvidia/cuembed/blob/main/tests/CMakeLists.txt

This snippet uses CMake's `GLOB` command to find all `.cpp` and `.cu` files within the 'tests' directory. It then combines these into a single list named `test_source_files` for subsequent processing.

```cmake
file(GLOB test_source_files_cpp "${CMAKE_SOURCE_DIR}/tests/*.cpp")
file(GLOB test_source_files_cu  "${CMAKE_SOURCE_DIR}/tests/*.cu")
set(test_source_files ${test_source_files_cpp} ${test_source_files_cu})
```

--------------------------------

### Signing Commits with Git

Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md

Instructions for signing git commits using the -s or --signoff flags. This action certifies the contribution's eligibility for open-source licensing.

```git
git commit -s
```

```git
git commit --signoff
```

--------------------------------

### Link Libraries for CUDA Executable in CMake

Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt

Links the 'manual_benchmark' target against several libraries, including CUDA-specific libraries ('cuda', 'cuembed_hdrs') and other dependencies like Abseil. This ensures all necessary components are available at runtime and link time. Requires libraries to be correctly defined and available.

```cmake
target_link_libraries(
  manual_benchmark PRIVATE 
  cuembed_hdrs
  cuda 
  utils 
  absl::log 
  absl::log_initialize
  absl::check
  absl::flags 
  absl::flags_parse)
```

--------------------------------

### Add cuEmbed to CMake Project

Source: https://github.com/nvidia/cuembed/blob/main/README.md

Demonstrates how to add the cuEmbed library to a CMake project using the CPM.cmake package manager. It fetches the library from a GitLab repository and links it to the target library.

```cmake
CPMAddPackage(
  NAME cuembed
  GIT_REPOSITORY https://rep_ro:${GITLAB_TOKEN}@gitlab-master.nvidia.com/compute/psx/recommender/cuembed.git
  GIT_TAG main
  OPTIONS
    "BUILD_TESTS OFF"
    "BUILD_BENCHMARKS OFF"
)

target_link_libraries(my_library ${cuembed_SOURCE_DIR})
```

--------------------------------

### Set Runtime Output Directory in CMake

Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt

Configures the 'manual_benchmark' target to place its executable output in a specific subdirectory ('${CMAKE_BINARY_DIR}/bin/benchmarks'). This helps organize build artifacts. It sets a property on the target.

```cmake
set_target_properties(manual_benchmark
  PROPERTIES 
  RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/bin/benchmarks")
```

--------------------------------

### Transpose

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Reorders indices from sample-id-first to table-index-first ordering for backward passes. Output indices are in coordinate (COO) format.

```APIDOC
## Transpose

### Description
Reorders indices from sample-id-first ordering as is needed during forward to table-index-first ordering needed for backward. Output indices are produced in coordinate (COO) format.

### Method
`template <typename IndexT, typename WeightT> void Transpose(...)`

### Endpoint
N/A (This is a C++ library function)

### Parameters
#### Input Parameters
- **rows** (*const IndexT**) - Pointer to the lookup indices.
- **cols** (*const IndexT**) - Pointer to the offsets (CSR format) used during forward. Must be nullptr when launching for fixed hotness.
- **weights** (*const WeightT**) - Pointer to the weight array used during forward. If nullptr, will not produce transposed weights.
- **nnz** (*const int*) - Number of nonzeros.
- **work** (*char**) - Pointer to scratch workspace. Set to nullptr for workspace query.
- **lwork** (*size_t**) - Pointer to size of scratch workspace.
- **stream** (*const cudaStream_t*) - Optional. The cudaStream to launch the kernel asynchronously. If not specified, will launch the kernel on default stream.

#### Output Parameters
- **transpose_rows** (*IndexT**) - Pointer to the output transposed table indices.
- **transpose_cols** (*IndexT**) - Pointer to the output transposed sparse indices.
- **transpose_weights** (*WeightT**) - Pointer to the transposed weight array. If input weights is nullptr, then will not produce transposed weights.

### Usage Notes
- The function should first be called with `work` set to nullptr to perform a workspace query. The required size of `work` array in bytes will be returned in `lwork`. Then the function should be called a second time with `work` pointing to allocated workspace of size `lwork`.
- If input `weights` are set to nullptr, then output `transpose_weights` will not be set.
- For the embedding use case, `transpose_rows` contains the embedding lookup indices which are now in sorted order. `transpose_cols` and `transpose_weights` contain the sample IDs and optionally the weights corresponding to the reordered rows.

### Example
(Code example not provided in the source text for this function)
```

--------------------------------

### Convert Dense to Compressed Gradient Indices (C++)

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Remaps dense embedding row IDs to compressed embedding row IDs, optimizing gradient storage when only a subset of embedding rows are referenced. The function requires an initial workspace query to determine the necessary buffer size.

```cpp
template <typename IndexT>
void ComputeCompressedGradIndices(const IndexT* indices,
                              const int nnz,
                              IndexT* remapped_indices,
                              char* work,
                              size_t* lwork,
                              const cudaStream_t stream = 0)
```

--------------------------------

### cuEmbed Forward Propagation Kernel

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Implements the forward pass for embedding lookups. It accepts an embedding table, lookup indices (fixed-hotness or CSR), and a reduction mode. The kernel aims to maximize memory bandwidth utilization. The 'fp16_math' parameter controls whether fp16 math operations are used for performance.

```cpp
template <typename InputT,
          typename OutputT,
          typename IndexT,
          typename OffsetT,
          bool fp16_math>
void EmbeddingForward(const InputT* params,
                      const int embed_width,
                      const IndexT* indices,
                      const OffsetT* offsets,
                      const typename GetElemT<InputT>* weights,
                      const int batch_size,
                      const int num_hots,
                      const CombineMode mode,
                      OutputT* ret,
                      const cudaStream_t stream = 0) 
```

--------------------------------

### cuEmbed Backward Operation Signature (C++)

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Defines the function signature for the backward embedding operation. It accepts incoming gradients, embedding dimensions, indices, and optional weights to compute gradients with respect to the embedding table. Supports both full and compressed gradient calculations.

```cpp
template <typename GradT, typename IndexT>
void EmbeddingBackward(const GradT* grad_y,
                       const int embed_width,
                       const int num_grad_embedding_rows,
                       const int nnz,
                       const IndexT* transpose_indices,
                       const IndexT* transpose_sample_ids,
                       const IndexT* transpose_remapped_indices,
                       const GradT* transpose_weights,
                       const bool skip_grad_init,
                       GradT* grad_embedding,
                       IndexT* inverse_mapping,
                       const cudaStream_t stream = 0)
```

--------------------------------

### Create Executable with CUDA Source in CMake

Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt

Defines an executable target named 'manual_benchmark' using a CUDA source file ('manual_benchmark.cu'). CMake automatically handles the compilation of CUDA code. The target is built as an executable.

```cmake
add_executable(manual_benchmark manual_benchmark.cu)
```

--------------------------------

### Backward Embedding Operation

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Handles the backward pass for embeddings, calculating gradients based on incoming gradients and indices. It supports both dense and compressed gradient outputs.

```APIDOC
## void EmbeddingBackward

### Description
Performs the backward pass for embedding lookups. It computes gradients with respect to the embedding table using incoming gradients (`grad_y`) and transposed indices. Supports summation of gradients for duplicate indices and can output either full (dense) or compressed gradients.

### Method
```
void EmbeddingBackward<
    typename GradT,
    typename IndexT
>(
    const GradT* grad_y,
    const int embed_width,
    const int num_grad_embedding_rows,
    const int nnz,
    const IndexT* transpose_indices,
    const IndexT* transpose_sample_ids,
    const IndexT* transpose_remapped_indices,
    const GradT* transpose_weights,
    const bool skip_grad_init,
    GradT* grad_embedding,
    IndexT* inverse_mapping,
    const cudaStream_t stream = 0
);
```

### Parameters
#### Path Parameters
* None

#### Query Parameters
* None

#### Request Body
* **grad_y** (const GradT*) - Pointer to the incoming gradient tensor.
* **embed_width** (int) - The dimensionality of each embedding vector.
* **num_grad_embedding_rows** (int) - The number of rows to allocate for the output gradient embedding table.
* **nnz** (int) - The total number of non-zero indices (elements) in the coordinate format inputs.
* **transpose_indices** (const IndexT*) - Pointer to the transposed lookup indices in COO format. Must group repeating indices contiguously.
* **transpose_sample_ids** (const IndexT*) - Pointer to the transposed sample IDs in COO format.
* **transpose_remapped_indices** (const IndexT*) - Pointer to remapped indices for compressed gradients (from `ComputeCompressedGradIndices`). Set to `nullptr` for dense gradients.
* **transpose_weights** (const GradT*) - Pointer to weights to be multiplied with `grad_y` before accumulation. Can be `nullptr` for unweighted accumulation.
* **skip_grad_init** (bool) - If `true`, skips the initialization of `grad_embedding` to zero.
* **grad_embedding** (GradT*) - Pointer to the output gradient with respect to the embedding table. Allocation size depends on dense vs. compressed output.
* **inverse_mapping** (IndexT*) - Pointer to store the mapping from compressed gradient rows to original embedding table IDs. Only used for compressed gradients.
* **stream** (cudaStream_t) - The CUDA stream to launch the kernel on. Defaults to the null stream.

### Request Example
```json
{
  "grad_y": null, 
  "embed_width": 128,
  "num_grad_embedding_rows": 10000,
  "nnz": 50000,
  "transpose_indices": null,
  "transpose_sample_ids": null,
  "transpose_remapped_indices": null,
  "transpose_weights": null,
  "skip_grad_init": false,
  "grad_embedding": null,
  "inverse_mapping": null,
  "stream": null
}
```

### Response
#### Success Response (void)
* **grad_embedding** (GradT*) - Output gradient w.r.t. the embedding table.
* **inverse_mapping** (IndexT*) - Output mapping for compressed gradients.

#### Response Example
*No direct response body; gradients are written to the provided `grad_embedding` and `inverse_mapping` pointers.*

### Notes
- For dense gradients: `transpose_remapped_indices` should be `nullptr`, `num_grad_embedding_rows` should be the total number of categories, and `inverse_mapping` is unused.
- For compressed gradients: `transpose_remapped_indices` must be provided, `num_grad_embedding_rows` should be the number of unique lookup indices, and `inverse_mapping` must be allocated.
```

--------------------------------

### EmbeddingForward

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Performs the forward pass of the embedding lookup operation. It accepts an embedding table and lookup indices in either fixed-hotness or CSR format, and supports various reduction modes.

```APIDOC
## EmbeddingForward

### Description
Performs the forward pass of the embedding lookup operation. It accepts an embedding table and lookup indices in either fixed-hotness or CSR format, and supports various reduction modes.

### Method
`void EmbeddingForward(...)`

### Endpoint
N/A (This is a C++ function signature, not a REST endpoint)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
None (Function parameters are used)

**Function Signature:**
```cpp
template <typename InputT,
          typename OutputT,
          typename IndexT,
          typename OffsetT,
          bool fp16_math>
void EmbeddingForward(
    const InputT* params,              // Pointer to the embedding table data.
    const int embed_width,             // Number of elements in each embedding row.
    const IndexT* indices,             // Pointer to the lookup indices.
    const OffsetT* offsets,            // Pointer to the offsets (CSR format). Must be nullptr when launching for fixed hotness.
    const typename GetElemT<InputT>* weights, // Pointer to the weight array. Can be nullptr for plain reduction.
    const int batch_size,              // Batch size of the embedding lookup workload.
    const int num_hots,                // Number of rows to lookup for each sample. Must be 0 for CSR indices.
    const CombineMode mode,            // Reduction mode: kSum, kMean, or kConcat.
    OutputT* ret,                      // Pointer to the output location.
    const cudaStream_t stream = 0      // Optional CUDA stream for asynchronous execution.
)
```

**Notes on Parameters:**
- For fixed hotness indices, `num_hots` specifies the hotness value, and `offsets` must be `nullptr`.
- For CSR indices, `num_hots` must be 0, and `offsets` points to the explicit offset array.
- If `weights` is `nullptr`, a plain reduction is performed. Otherwise, weights are applied before reduction.
- When `fp16_math` is true, fp16 math operations are performed in fp16.

### Request Example
None (This is a C++ function)

### Response
#### Success Response
Outputs are written to the location pointed to by `ret`.

#### Response Example
None (This is a C++ function)
```

--------------------------------

### Transpose Indices for Embedding Backward Pass (C++)

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Reorders indices from sample-id-first to table-index-first ordering, essential for the backward pass in embedding lookups. Outputs indices in Coordinate (COO) format. It requires an initial workspace query before the actual computation.

```cpp
template <typename IndexT, typename WeightT>
void Transpose(const IndexT* rows,
               const IndexT* cols,
               const WeightT* weights,
               const int nnz,
               IndexT* transpose_rows,
               IndexT* transpose_cols,
               WeightT* transpose_weights,
               char* work,
               size_t* lwork,
               const cudaStream_t stream = 0);
```

--------------------------------

### Set CUDA Include Directories in CMake

Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt

Adds the include directories provided by the CUDA Toolkit to the build target 'manual_benchmark'. This allows the compiler to find CUDA headers. Depends on `find_package(CUDAToolkit)` having been called successfully.

```cmake
target_include_directories(manual_benchmark PRIVATE ${CUDAToolkit_INCLUDE_DIRS})
```

--------------------------------

### Extract Row IDs from CSR Format to COO

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Transforms Compressed Sparse Row (CSR) formatted indices into Coordinate (COO) format. It uses an 'offsets' array to determine the row IDs for each non-zero element, mapping them to the correct sample index. This is crucial for backward passes requiring vertical summation of gradients. It requires CUDA streams for execution.

```cpp
template <typename IndexT, typename OffsetT>
void ExtractRowIdsFromCSR(const OffsetT* offsets,
                          const int batch_size,
                          IndexT* row_ids,
                          const cudaStream_t stream = 0);
```

--------------------------------

### Extract Row IDs from Fixed-Hotness Format to COO

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Converts indices from a fixed-hotness format to a Coordinate (COO) format. It generates a 'row_ids' array where each entry corresponds to the sample ID. This is useful when the input data is structured with a fixed number of embedding categories per sample. Dependencies include CUDA streams for asynchronous execution.

```cpp
template <typename IndexT>
void ExtractRowIdsFromFixed(const int batch_size,
                            const int num_hots,
                            IndexT* row_ids,
                            const cudaStream_t stream = 0);
```

--------------------------------

### ComputeCompressedGradIndices

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Remaps indices from dense embedding row IDs to compressed embedding row IDs, useful when the number of referenced rows is small.

```APIDOC
## ComputeCompressedGradIndices

### Description
Remaps indices from dense embedding row IDs to compressed embedding row IDs. This is advantageous when the number of embedding rows actually referenced by the indices is much smaller than the total number of rows, allowing for a compressed gradient storage.

### Method
`template <typename IndexT> void ComputeCompressedGradIndices(...)`

### Endpoint
N/A (This is a C++ library function)

### Parameters
#### Input Parameters
- **indices** (*const IndexT**) - Pointer to the lookup indices, grouped by index.
- **nnz** (*const int*) - Length of the indices array.
- **work** (*char**) - Temporary workspace. Set to nullptr for workspace query.
- **lwork** (*size_t**) - Size of workspace in bytes (input/output).
- **stream** (*const cudaStream_t*) - Optional. The cudaStream to launch the kernel asynchronously. If not specified, will launch the kernel on default stream.

#### Output Parameters
- **remapped_indices** (*IndexT**) - Pointer to the remapped lookup indices (output).

### Usage Notes
- The function should first be called with `work` set to nullptr to perform a workspace query. The required size of `work` array in bytes will be returned in `lwork`. Then the function should be called a second time with `work` pointing to allocated workspace of size `lwork`.
- The value `num_unique` (the number of unique indices) can be attained from `remapped_indices.back() + 1` after calling this function.

### Example
(Code example not provided in the source text for this function)
```

--------------------------------

### Extract Row IDs for Concatenation (COO)

Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md

Generates a sequence of row IDs for concatenation operations in Coordinate (COO) format. It produces a simple sequential array of row IDs from 0 to nnz-1, where nnz is the number of non-zero elements. This function is typically used when preparing data for operations that require flattened or concatenated index structures. It utilizes CUDA streams for asynchronous processing.

```cpp
template <typename IndexT>
void ExtractRowIdsForConcat(const int nnz, 
                            IndexT* row_ids,
                            const cudaStream_t stream = 0);
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.