### Installing and Running Pre-commit Hooks Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md Steps to install pre-commit hooks and run formatting and linting checks on all files. This ensures code adheres to project-specific style guidelines. ```shell pip install pre-commit && pre-commit install ``` ```shell pre-commit --files all ``` -------------------------------- ### Build Options Source: https://github.com/nvidia/cuembed/blob/main/CMakeLists.txt Defines boolean options to control the building of tests, benchmarks, and examples. Defaults are set to ON for tests and benchmarks, and OFF for examples. ```cmake option(BUILD_TESTS "Build the tests" ON) option(BUILD_BENCHMARKS "Build the benchmarks" ON) option(BUILD_EXAMPLES "Build examples" OFF) ``` -------------------------------- ### Link Libraries and Include CUDA Headers for Utils Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt Configures the 'utils' library to use CUDA include directories and links against CUDA runtime, Google Test, and Abseil libraries. This ensures all dependencies are resolved. ```cmake target_include_directories(utils PRIVATE ${CUDAToolkit_INCLUDE_DIRS} absl::log absl::check gtest) target_link_libraries(utils PRIVATE CUDA::cudart absl::log absl::check gtest) ``` -------------------------------- ### Dependency and Library Setup Source: https://github.com/nvidia/cuembed/blob/main/CMakeLists.txt Finds the CUDAToolkit package and sets up interface libraries for headers. It conditionally includes subdirectories for third-party libraries (like Abseil and GTest) and project-specific modules (utils, tests, benchmarks, examples) based on build options. ```cmake find_package(CUDAToolkit) set(CUEMBED_PROJECT_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}) add_library(cuembed_hdrs INTERFACE ${cuembed_source_files}) target_include_directories(cuembed_hdrs INTERFACE ${CUEMBED_PROJECT_SOURCE_DIR}) add_library(cuembed::hdrs ALIAS cuembed_hdrs) if (BUILD_TESTS OR BUILD_BENCHMARKS) # TODO(zejiaz): move to CPM instead of submodule add_subdirectory(third_party/abseil-cpp) # Utility library for benchmarking and testing. add_subdirectory(utils) endif() # Setup tests if(BUILD_TESTS) add_subdirectory(third_party/gtest) add_subdirectory(tests) endif() # Benchmarks. if (BUILD_BENCHMARKS) add_subdirectory(benchmarks) endif() # Examples if (BUILD_EXAMPLES) add_subdirectory(examples/pytorch) endif() ``` -------------------------------- ### Create Utils Library with CUDA and C++ Sources Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt Compiles multiple CUDA (.cu) and C++ (.cpp) source files into an OBJECT library named 'utils'. This library contains the core embedding functionalities. ```cmake add_library(utils OBJECT src/embedding_allocation.cu src/embedding_gpu_forward.cu src/embedding_gpu_transpose.cu src/embedding_gpu_backward.cu src/embedding_cpu.cu src/datagen.cpp) ``` -------------------------------- ### Forward Propagation Example in C++ CUDA Source: https://github.com/nvidia/cuembed/blob/main/README.md Provides an example of running forward propagation for embedding lookups using the cuEmbed host API. It takes embedding data, indices, offsets, and weights as input and writes the results to an output vector. ```cpp template void RunForward(const utils::AllocationOptions& options, const thrust::device_vector& embedding, const thrust::device_vector& indices, const thrust::device_vector& offsets, const thrust::device_vector& weights, thrust::device_vector* result) { const int* offsets_ptr = nullptr; int hotness = options.hotness(); if (options.is_csr()) { offsets_ptr = offsets.data().get(); hotness = 0; } const ElemT* weight_ptr = nullptr; if (options.is_weighted()) { weight_ptr = weights.data().get(); } using InputT = ElemT; using OutputT = ElemT; EmbeddingForward( embedding.data().get(), options.embed_width(), indices.data().get(), offsets_ptr, weight_ptr, options.batch_size(), hotness, options.combine_mode(), result->data().get()); } ``` -------------------------------- ### Include Directories with CMake Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt Includes directories from the cuEmbed project source directory for compilation. This is a standard CMake command. ```cmake include_directories(${CUEMBED_PROJECT_SOURCE_DIR}) ``` -------------------------------- ### Find CUDA Toolkit Package Source: https://github.com/nvidia/cuembed/blob/main/utils/CMakeLists.txt Locates the CUDAToolkit package using CMake's find_package command. This makes CUDA compiler and libraries available for use. ```cmake find_package(CUDAToolkit) ``` -------------------------------- ### Find CUDA Toolkit in CMake Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt This command searches for the CUDA Toolkit installation on the system. It's a prerequisite for using CUDA-specific features in CMake. Requires the CUDA Toolkit to be installed and discoverable by CMake. ```cmake find_package(CUDAToolkit REQUIRED) ``` -------------------------------- ### Manual cuEmbed Benchmark Execution Source: https://github.com/nvidia/cuembed/blob/main/README.md Illustrates how to manually run a specific cuEmbed benchmark test case using the `manual_benchmark` binary. It includes various command-line arguments to configure the benchmark parameters such as embedding dimensions, batch size, and data types. ```bash ./bin/benchmarks/manual_benchmark --num_categories 10000000 --embed_width 256 --batch_size 65536 --alpha=1.15 --hotness=64 --csr_input=false --half_embedding_type=true --weighted_sum=false --compressed_grad=true ``` -------------------------------- ### Run cuEmbed Benchmarks (Bash) Source: https://github.com/nvidia/cuembed/blob/main/README.md Shows how to execute the cuEmbed benchmark suite by navigating to the benchmarks directory and running the sweep_parameters.sh script. This script is used for comprehensive benchmark testing. ```bash cd benchmarks/ ./sweep_parameters.sh ``` -------------------------------- ### Running Performance Benchmarks Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md Command to navigate to the benchmarks directory and execute the sweep_parameters.sh script for performance testing. This helps identify performance regressions. ```shell cd benchmarks ; ./sweep_parameters.sh ``` -------------------------------- ### Build cuEmbed from Source (Bash) Source: https://github.com/nvidia/cuembed/blob/main/README.md Provides the bash commands to clone the cuEmbed repository recursively, create a build directory, and configure the build using CMake with release type. This is for building tests and benchmarks. ```bash git clone --recursive https://gitlab-master.nvidia.com/compute/psx/recommender/cuembed cd cuembed mkdir build cd build cmake -DCMAKE_BUILD_TYPE=Release .. make ``` -------------------------------- ### Building Correctness Tests with CMake Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md Instructions on how to build correctness tests for the cuEmbed project by enabling the BUILD_TESTS option in CMake. This ensures that new code additions have corresponding tests. ```cmake set(BUILD_TESTS ON CACHE BOOL "Build correctness tests") ``` -------------------------------- ### Configure and Link Test Executables (CMake) Source: https://github.com/nvidia/cuembed/blob/main/tests/CMakeLists.txt This CMake code iterates through each identified test source file. For each file, it creates an executable, sets include directories (including Googletest and CUDA), and links it with essential libraries such as `cuembed_hdrs`, `gtest`, `cuda`, and `absl::log`. It also specifies the output directory for the executables. ```cmake foreach(test_file ${test_source_files}) get_filename_component(test_name ${test_file} NAME_WE) add_executable(${test_name} ${test_file}) target_include_directories(${test_name} PRIVATE googletest ${CUDAToolkit_INCLUDE_DIRS}) target_link_libraries( ${test_name} PRIVATE cuembed_hdrs gtest gtest_main cuda utils absl::log absl::check) set_target_properties(${test_name} PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${PROJECT_BINARY_DIR}/bin/test" ) endforeach(test_file) ``` -------------------------------- ### Project Configuration Source: https://github.com/nvidia/cuembed/blob/main/CMakeLists.txt Sets the minimum CMake version, C++ standard, project name, and enables CUDA compilation. It also defines CUDA architectures and compiler flags for optimized performance. ```cmake cmake_minimum_required(VERSION 3.23) set(CMAKE_CXX_STANDARD 17) project(cuembed CXX CUDA) enable_language(CUDA) set(CMAKE_CUDA_ARCHITECTURES 70 75 80 90) set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --ptxas-options=-v") ``` -------------------------------- ### Find and Compile Test Source Files (CMake) Source: https://github.com/nvidia/cuembed/blob/main/tests/CMakeLists.txt This snippet uses CMake's `GLOB` command to find all `.cpp` and `.cu` files within the 'tests' directory. It then combines these into a single list named `test_source_files` for subsequent processing. ```cmake file(GLOB test_source_files_cpp "${CMAKE_SOURCE_DIR}/tests/*.cpp") file(GLOB test_source_files_cu "${CMAKE_SOURCE_DIR}/tests/*.cu") set(test_source_files ${test_source_files_cpp} ${test_source_files_cu}) ``` -------------------------------- ### Signing Commits with Git Source: https://github.com/nvidia/cuembed/blob/main/CONTRIBUTING.md Instructions for signing git commits using the -s or --signoff flags. This action certifies the contribution's eligibility for open-source licensing. ```git git commit -s ``` ```git git commit --signoff ``` -------------------------------- ### Link Libraries for CUDA Executable in CMake Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt Links the 'manual_benchmark' target against several libraries, including CUDA-specific libraries ('cuda', 'cuembed_hdrs') and other dependencies like Abseil. This ensures all necessary components are available at runtime and link time. Requires libraries to be correctly defined and available. ```cmake target_link_libraries( manual_benchmark PRIVATE cuembed_hdrs cuda utils absl::log absl::log_initialize absl::check absl::flags absl::flags_parse) ``` -------------------------------- ### Add cuEmbed to CMake Project Source: https://github.com/nvidia/cuembed/blob/main/README.md Demonstrates how to add the cuEmbed library to a CMake project using the CPM.cmake package manager. It fetches the library from a GitLab repository and links it to the target library. ```cmake CPMAddPackage( NAME cuembed GIT_REPOSITORY https://rep_ro:${GITLAB_TOKEN}@gitlab-master.nvidia.com/compute/psx/recommender/cuembed.git GIT_TAG main OPTIONS "BUILD_TESTS OFF" "BUILD_BENCHMARKS OFF" ) target_link_libraries(my_library ${cuembed_SOURCE_DIR}) ``` -------------------------------- ### Set Runtime Output Directory in CMake Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt Configures the 'manual_benchmark' target to place its executable output in a specific subdirectory ('${CMAKE_BINARY_DIR}/bin/benchmarks'). This helps organize build artifacts. It sets a property on the target. ```cmake set_target_properties(manual_benchmark PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/bin/benchmarks") ``` -------------------------------- ### Transpose Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Reorders indices from sample-id-first to table-index-first ordering for backward passes. Output indices are in coordinate (COO) format. ```APIDOC ## Transpose ### Description Reorders indices from sample-id-first ordering as is needed during forward to table-index-first ordering needed for backward. Output indices are produced in coordinate (COO) format. ### Method `template void Transpose(...)` ### Endpoint N/A (This is a C++ library function) ### Parameters #### Input Parameters - **rows** (*const IndexT**) - Pointer to the lookup indices. - **cols** (*const IndexT**) - Pointer to the offsets (CSR format) used during forward. Must be nullptr when launching for fixed hotness. - **weights** (*const WeightT**) - Pointer to the weight array used during forward. If nullptr, will not produce transposed weights. - **nnz** (*const int*) - Number of nonzeros. - **work** (*char**) - Pointer to scratch workspace. Set to nullptr for workspace query. - **lwork** (*size_t**) - Pointer to size of scratch workspace. - **stream** (*const cudaStream_t*) - Optional. The cudaStream to launch the kernel asynchronously. If not specified, will launch the kernel on default stream. #### Output Parameters - **transpose_rows** (*IndexT**) - Pointer to the output transposed table indices. - **transpose_cols** (*IndexT**) - Pointer to the output transposed sparse indices. - **transpose_weights** (*WeightT**) - Pointer to the transposed weight array. If input weights is nullptr, then will not produce transposed weights. ### Usage Notes - The function should first be called with `work` set to nullptr to perform a workspace query. The required size of `work` array in bytes will be returned in `lwork`. Then the function should be called a second time with `work` pointing to allocated workspace of size `lwork`. - If input `weights` are set to nullptr, then output `transpose_weights` will not be set. - For the embedding use case, `transpose_rows` contains the embedding lookup indices which are now in sorted order. `transpose_cols` and `transpose_weights` contain the sample IDs and optionally the weights corresponding to the reordered rows. ### Example (Code example not provided in the source text for this function) ``` -------------------------------- ### Convert Dense to Compressed Gradient Indices (C++) Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Remaps dense embedding row IDs to compressed embedding row IDs, optimizing gradient storage when only a subset of embedding rows are referenced. The function requires an initial workspace query to determine the necessary buffer size. ```cpp template void ComputeCompressedGradIndices(const IndexT* indices, const int nnz, IndexT* remapped_indices, char* work, size_t* lwork, const cudaStream_t stream = 0) ``` -------------------------------- ### cuEmbed Forward Propagation Kernel Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Implements the forward pass for embedding lookups. It accepts an embedding table, lookup indices (fixed-hotness or CSR), and a reduction mode. The kernel aims to maximize memory bandwidth utilization. The 'fp16_math' parameter controls whether fp16 math operations are used for performance. ```cpp template void EmbeddingForward(const InputT* params, const int embed_width, const IndexT* indices, const OffsetT* offsets, const typename GetElemT* weights, const int batch_size, const int num_hots, const CombineMode mode, OutputT* ret, const cudaStream_t stream = 0) ``` -------------------------------- ### cuEmbed Backward Operation Signature (C++) Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Defines the function signature for the backward embedding operation. It accepts incoming gradients, embedding dimensions, indices, and optional weights to compute gradients with respect to the embedding table. Supports both full and compressed gradient calculations. ```cpp template void EmbeddingBackward(const GradT* grad_y, const int embed_width, const int num_grad_embedding_rows, const int nnz, const IndexT* transpose_indices, const IndexT* transpose_sample_ids, const IndexT* transpose_remapped_indices, const GradT* transpose_weights, const bool skip_grad_init, GradT* grad_embedding, IndexT* inverse_mapping, const cudaStream_t stream = 0) ``` -------------------------------- ### Create Executable with CUDA Source in CMake Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt Defines an executable target named 'manual_benchmark' using a CUDA source file ('manual_benchmark.cu'). CMake automatically handles the compilation of CUDA code. The target is built as an executable. ```cmake add_executable(manual_benchmark manual_benchmark.cu) ``` -------------------------------- ### Backward Embedding Operation Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Handles the backward pass for embeddings, calculating gradients based on incoming gradients and indices. It supports both dense and compressed gradient outputs. ```APIDOC ## void EmbeddingBackward ### Description Performs the backward pass for embedding lookups. It computes gradients with respect to the embedding table using incoming gradients (`grad_y`) and transposed indices. Supports summation of gradients for duplicate indices and can output either full (dense) or compressed gradients. ### Method ``` void EmbeddingBackward< typename GradT, typename IndexT >( const GradT* grad_y, const int embed_width, const int num_grad_embedding_rows, const int nnz, const IndexT* transpose_indices, const IndexT* transpose_sample_ids, const IndexT* transpose_remapped_indices, const GradT* transpose_weights, const bool skip_grad_init, GradT* grad_embedding, IndexT* inverse_mapping, const cudaStream_t stream = 0 ); ``` ### Parameters #### Path Parameters * None #### Query Parameters * None #### Request Body * **grad_y** (const GradT*) - Pointer to the incoming gradient tensor. * **embed_width** (int) - The dimensionality of each embedding vector. * **num_grad_embedding_rows** (int) - The number of rows to allocate for the output gradient embedding table. * **nnz** (int) - The total number of non-zero indices (elements) in the coordinate format inputs. * **transpose_indices** (const IndexT*) - Pointer to the transposed lookup indices in COO format. Must group repeating indices contiguously. * **transpose_sample_ids** (const IndexT*) - Pointer to the transposed sample IDs in COO format. * **transpose_remapped_indices** (const IndexT*) - Pointer to remapped indices for compressed gradients (from `ComputeCompressedGradIndices`). Set to `nullptr` for dense gradients. * **transpose_weights** (const GradT*) - Pointer to weights to be multiplied with `grad_y` before accumulation. Can be `nullptr` for unweighted accumulation. * **skip_grad_init** (bool) - If `true`, skips the initialization of `grad_embedding` to zero. * **grad_embedding** (GradT*) - Pointer to the output gradient with respect to the embedding table. Allocation size depends on dense vs. compressed output. * **inverse_mapping** (IndexT*) - Pointer to store the mapping from compressed gradient rows to original embedding table IDs. Only used for compressed gradients. * **stream** (cudaStream_t) - The CUDA stream to launch the kernel on. Defaults to the null stream. ### Request Example ```json { "grad_y": null, "embed_width": 128, "num_grad_embedding_rows": 10000, "nnz": 50000, "transpose_indices": null, "transpose_sample_ids": null, "transpose_remapped_indices": null, "transpose_weights": null, "skip_grad_init": false, "grad_embedding": null, "inverse_mapping": null, "stream": null } ``` ### Response #### Success Response (void) * **grad_embedding** (GradT*) - Output gradient w.r.t. the embedding table. * **inverse_mapping** (IndexT*) - Output mapping for compressed gradients. #### Response Example *No direct response body; gradients are written to the provided `grad_embedding` and `inverse_mapping` pointers.* ### Notes - For dense gradients: `transpose_remapped_indices` should be `nullptr`, `num_grad_embedding_rows` should be the total number of categories, and `inverse_mapping` is unused. - For compressed gradients: `transpose_remapped_indices` must be provided, `num_grad_embedding_rows` should be the number of unique lookup indices, and `inverse_mapping` must be allocated. ``` -------------------------------- ### EmbeddingForward Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Performs the forward pass of the embedding lookup operation. It accepts an embedding table and lookup indices in either fixed-hotness or CSR format, and supports various reduction modes. ```APIDOC ## EmbeddingForward ### Description Performs the forward pass of the embedding lookup operation. It accepts an embedding table and lookup indices in either fixed-hotness or CSR format, and supports various reduction modes. ### Method `void EmbeddingForward(...)` ### Endpoint N/A (This is a C++ function signature, not a REST endpoint) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body None (Function parameters are used) **Function Signature:** ```cpp template void EmbeddingForward( const InputT* params, // Pointer to the embedding table data. const int embed_width, // Number of elements in each embedding row. const IndexT* indices, // Pointer to the lookup indices. const OffsetT* offsets, // Pointer to the offsets (CSR format). Must be nullptr when launching for fixed hotness. const typename GetElemT* weights, // Pointer to the weight array. Can be nullptr for plain reduction. const int batch_size, // Batch size of the embedding lookup workload. const int num_hots, // Number of rows to lookup for each sample. Must be 0 for CSR indices. const CombineMode mode, // Reduction mode: kSum, kMean, or kConcat. OutputT* ret, // Pointer to the output location. const cudaStream_t stream = 0 // Optional CUDA stream for asynchronous execution. ) ``` **Notes on Parameters:** - For fixed hotness indices, `num_hots` specifies the hotness value, and `offsets` must be `nullptr`. - For CSR indices, `num_hots` must be 0, and `offsets` points to the explicit offset array. - If `weights` is `nullptr`, a plain reduction is performed. Otherwise, weights are applied before reduction. - When `fp16_math` is true, fp16 math operations are performed in fp16. ### Request Example None (This is a C++ function) ### Response #### Success Response Outputs are written to the location pointed to by `ret`. #### Response Example None (This is a C++ function) ``` -------------------------------- ### Transpose Indices for Embedding Backward Pass (C++) Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Reorders indices from sample-id-first to table-index-first ordering, essential for the backward pass in embedding lookups. Outputs indices in Coordinate (COO) format. It requires an initial workspace query before the actual computation. ```cpp template void Transpose(const IndexT* rows, const IndexT* cols, const WeightT* weights, const int nnz, IndexT* transpose_rows, IndexT* transpose_cols, WeightT* transpose_weights, char* work, size_t* lwork, const cudaStream_t stream = 0); ``` -------------------------------- ### Set CUDA Include Directories in CMake Source: https://github.com/nvidia/cuembed/blob/main/benchmarks/CMakeLists.txt Adds the include directories provided by the CUDA Toolkit to the build target 'manual_benchmark'. This allows the compiler to find CUDA headers. Depends on `find_package(CUDAToolkit)` having been called successfully. ```cmake target_include_directories(manual_benchmark PRIVATE ${CUDAToolkit_INCLUDE_DIRS}) ``` -------------------------------- ### Extract Row IDs from CSR Format to COO Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Transforms Compressed Sparse Row (CSR) formatted indices into Coordinate (COO) format. It uses an 'offsets' array to determine the row IDs for each non-zero element, mapping them to the correct sample index. This is crucial for backward passes requiring vertical summation of gradients. It requires CUDA streams for execution. ```cpp template void ExtractRowIdsFromCSR(const OffsetT* offsets, const int batch_size, IndexT* row_ids, const cudaStream_t stream = 0); ``` -------------------------------- ### Extract Row IDs from Fixed-Hotness Format to COO Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Converts indices from a fixed-hotness format to a Coordinate (COO) format. It generates a 'row_ids' array where each entry corresponds to the sample ID. This is useful when the input data is structured with a fixed number of embedding categories per sample. Dependencies include CUDA streams for asynchronous execution. ```cpp template void ExtractRowIdsFromFixed(const int batch_size, const int num_hots, IndexT* row_ids, const cudaStream_t stream = 0); ``` -------------------------------- ### ComputeCompressedGradIndices Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Remaps indices from dense embedding row IDs to compressed embedding row IDs, useful when the number of referenced rows is small. ```APIDOC ## ComputeCompressedGradIndices ### Description Remaps indices from dense embedding row IDs to compressed embedding row IDs. This is advantageous when the number of embedding rows actually referenced by the indices is much smaller than the total number of rows, allowing for a compressed gradient storage. ### Method `template void ComputeCompressedGradIndices(...)` ### Endpoint N/A (This is a C++ library function) ### Parameters #### Input Parameters - **indices** (*const IndexT**) - Pointer to the lookup indices, grouped by index. - **nnz** (*const int*) - Length of the indices array. - **work** (*char**) - Temporary workspace. Set to nullptr for workspace query. - **lwork** (*size_t**) - Size of workspace in bytes (input/output). - **stream** (*const cudaStream_t*) - Optional. The cudaStream to launch the kernel asynchronously. If not specified, will launch the kernel on default stream. #### Output Parameters - **remapped_indices** (*IndexT**) - Pointer to the remapped lookup indices (output). ### Usage Notes - The function should first be called with `work` set to nullptr to perform a workspace query. The required size of `work` array in bytes will be returned in `lwork`. Then the function should be called a second time with `work` pointing to allocated workspace of size `lwork`. - The value `num_unique` (the number of unique indices) can be attained from `remapped_indices.back() + 1` after calling this function. ### Example (Code example not provided in the source text for this function) ``` -------------------------------- ### Extract Row IDs for Concatenation (COO) Source: https://github.com/nvidia/cuembed/blob/main/cuembed/README.md Generates a sequence of row IDs for concatenation operations in Coordinate (COO) format. It produces a simple sequential array of row IDs from 0 to nnz-1, where nnz is the number of non-zero elements. This function is typically used when preparing data for operations that require flattened or concatenated index structures. It utilizes CUDA streams for asynchronous processing. ```cpp template void ExtractRowIdsForConcat(const int nnz, IndexT* row_ids, const cudaStream_t stream = 0); ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.