### Plugin Supports Format Combination Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/cpp-api/classnvinfer1_1_1_i_plugin_v2_dynamic_ext.html This example demonstrates how to define a plugin that supports only FP16 NCHW format and datatype. It checks if the current input/output format and type match the requirements. ```cpp return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kHALF; ``` -------------------------------- ### Deconvolution Example with Input, Output, and Expected Values - Python Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Deconvolution.html This example demonstrates setting up input data, determining output shape, and defining expected output values for a deconvolution layer. It's useful for testing and verifying deconvolution operations. ```python inputs[in1.name] = np.array([[[[-3.0, -2.0, -1.0], [0.0, 1.0, 2.0], [2.0, 5.0, 6.0]]]]) outputs[layer.get_output(0).name] = layer.get_output(0).shape expected[layer.get_output(0).name] = np.array( [ [ [ [-3.0, -5.0, -6.0, -3.0, -1.0], [-3.0, -4.0, -3.0, 0.0, 1.0], [-1.0, 3.0, 10.0, 11.0, 7.0], [2.0, 8.0, 16.0, 14.0, 8.0], [2.0, 7.0, 13.0, 11.0, 6.0], ] ] ] ) ``` -------------------------------- ### Activation Operator Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Activation.html This example demonstrates how to use the add_activation method to apply a RELU activation function to an input tensor. ```APIDOC ## Activation Operator ### Description Applies an activation function to an input tensor. ### Method `network.add_activation(input_tensor, type)` ### Parameters * **input_tensor** (tensor) - The input tensor to apply the activation function to. * **type** (ActivationType) - The type of activation function to apply. Supported types include `RELU`, `SIGMOID`, `TANH`, `LEAKY_RELU`, `ELU`, `SELU`, `SOFTSIGN`, `SOFTPLUS`, `CLIP`, `HARD_SIGMOID`, `SCALED_TANH`, `THRESHOLDED_RELU`. * **alpha** (float, optional) - Parameter used for activation functions like `LEAKY_RELU`, `ELU`, `SELU`, `SOFTPLUS`, `CLIP`, `HARD_SIGMOID`, `SCALED_TANH`, `THRESHOLDED_RELU`. * **beta** (float, optional) - Parameter used for activation functions like `SELU`, `SOFTPLUS`, `CLIP`, `HARD_SIGMOID`, `SCALED_TANH`. ### Inputs * **input** (tensor of type T) - The input tensor. ### Outputs * **output** (tensor of type T) - The output tensor with the activation function applied. ### Data Types * **T**: `float16`, `float32`, `bfloat16`, `int32`, `int64` (Note: `int32` and `int64` are supported only for `RELU`) ### Shape Information The output tensor has the same shape as the input tensor. ### Example ```python in1 = network.add_input("input1", dtype=trt.float32, shape=(2, 3)) layer = network.add_activation(in1, type=trt.ActivationType.RELU) network.mark_output(layer.get_output(0)) # Example usage with numpy for input data inputs[in1.name] = np.array([[-3.0, -2.0, -1.0], [0.0, 1.0, 2.0]]) # Storing output shape outputs[layer.get_output(0).name] = layer.get_output(0).shape # Expected output for RELU activation expected[layer.get_output(0).name] = np.array([[0.0, 0.0, 0.0], [0.0, 1.0, 2.0]]) ``` ``` -------------------------------- ### Block Quantize and Dequantize Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Quantize.html This example illustrates block quantization and dequantization, useful for specific quantization schemes where blocks of data are quantized together. ```APIDOC ## Block Quantize and Dequantize Example ### Description This example demonstrates block quantization and dequantization. ### Method `network.add_quantize()` and `network.add_dequantize()` with block shape specified. ### Parameters - **input**: Input tensor. - **scale**: Quantization scale tensor. - **toType**: The DataType of the output tensor (e.g., `trt.int4`). - **block_shape**: The shape of the quantization block. ### Request Example ```python weights = network.add_constant(shape=(4, 8), weights=np.array([ [1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0], [1.1, 1.2, 2.1, 2.2, 3.1, 3.2, 4.1, 4.2], [4.0, 4.0, 5.0, 5.0, 6.0, 6.0, 7.0, 7.0], [4.1, 4.2, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2], ], dtype=np.float32)) scale = network.add_constant(shape=(2, 8), weights=np.array([ [1, 1, 2, 2, 3, 3, 4, 4], [4, 4, 5, 5, 6, 6, 7, 7] ], dtype=np.float32)) quantize = network.add_quantize(weights.get_output(0), scale.get_output(0), trt.int4) dequantize = network.add_dequantize(quantize.get_output(0), scale.get_output(0), trt.float32) network.mark_output(dequantize.get_output(0)) outputs[dequantize.get_output(0).name] = dequantize.get_output(0).shape expected[dequantize.get_output(0).name] = np.array( [ [ [1, 1, 2, 2, 3, 3, 4, 4], [1, 1, 2, 2, 3, 3, 4, 4], [4, 4, 5, 5, 6, 6, 7, 7], [4, 4, 5, 5, 6, 6, 7, 7], ] ] ) ``` ``` -------------------------------- ### Plugin Supports Conditional Format Combination Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/cpp-api/classnvinfer1_1_1_i_plugin_v2_dynamic_ext.html This example shows a plugin that supports FP16 NCHW for its first two inputs and FP32 NCHW for its single output. The support is conditional based on the input/output position. ```cpp return inOut[pos].format == TensorFormat::kLINEAR && (inOut[pos].type == (pos < 2 ? DataType::kHALF : DataType::kFLOAT)); ``` -------------------------------- ### Quantize and Dequantize Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Quantize.html This example demonstrates how to quantize a tensor and then dequantize it back to its original floating-point type using the Quantize and Dequantize operators. ```APIDOC ## Quantize and Dequantize Example ### Description This example demonstrates quantizing a tensor and then dequantizing it. ### Method `network.add_quantize()` and `network.add_dequantize()` ### Parameters - **input**: Input tensor. - **scale**: Quantization scale tensor. - **axis**: The axis to perform quantization on (optional). - **toType**: The DataType of the output tensor (optional, defaults to `int8`). ### Request Example ```python in1 = network.add_input("input1", dtype=trt.float32, shape=(1, 1, 3, 3)) scale = network.add_constant(shape=(1,), weights=np.array([1 / 127], dtype=np.float32)) quantize = network.add_quantize(in1, scale.get_output(0)) quantize.axis = 3 dequantize = network.add_dequantize(quantize.get_output(0), scale.get_output(0)) dequantize.axis = 3 network.mark_output(dequantize.get_output(0)) inputs[in1.name] = np.array( [ [ [0.56, 0.89, 1.4], [-0.56, 0.39, 6.0], [0.67, 0.11, -3.6], ] ] ) outputs[dequantize.get_output(0).name] = dequantize.get_output(0).shape expected[dequantize.get_output(0).name] = np.array( [ [ [0.56, 0.89, 1], [-0.56, 0.39, 1.0], [0.67, 0.11, -1.0], ] ] ) ``` ``` -------------------------------- ### Polymorphic Plugin Format Combination Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/cpp-api/classnvinfer1_1_1_i_plugin_v2_dynamic_ext.html This example defines a 'polymorphic' plugin with two inputs and one output. It supports any format or type, but requires that all inputs and the output share the same format and type as the first input. ```cpp return pos == 0 || (inOut[pos].format == inOut.format[0] && inOut[pos].type == inOut[0].type); ``` -------------------------------- ### Attention Operator Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Attention.html Example demonstrating how to use the Attention operator with various configurations, including input definitions, mask application, and output marking. ```APIDOC ## Attention Operator ### Description Generates an attention mechanism. This operator supports various configurations through its attributes and inputs. ### Method `network.add_attention()` ### Parameters #### Attributes - `normalizationOp` (enum): Specifies the normalization function to apply. Can be `NONE` or `SOFTMAX`. - `causal` (boolean): Determines if the attention runs causal inference. - `decomposable` (boolean): Determines if the attention can be decomposed into multiple kernels if a fused kernel is not found. - `normalizationQuantizeToType` (enum, optional): Specifies the datatype for attention normalization quantization. Options include `DataType::kFP8` and `DataType::kINT8`. - `nbRanks` (integer, default: 1): Specifies the number of ranks for multi-device attention execution. #### Inputs - **query** (tensor T1): The query tensor. - **key** (tensor T1): The key tensor. - **value** (tensor T1): The value tensor. - **mask** (tensor T2, optional): An optional mask tensor. If boolean, `True` indicates allowed attention. If float, it's an add mask. - **normalizationQuantizeScale** (tensor T1, optional): The quantization scale for the attention normalization output. #### Outputs - **outputs** (tensor T1): The output tensor of the attention operation. ### Data Types - T1: `float32`, `float16`, `bfloat16` - T2: `float32`, `float16`, `bfloat16`, `bool`. T2 must match T1 if not `bool`. ### Shape Information - **query** and **outputs**: [b, dq, sq, h] - **key** and **value**: [b, dkv, skv, h] - **mask**: [a0, a1, sq, skv] where a0 and a1 are broadcastable to b and h. - **normalizationQuantizeScale**: [a0,...,an], 0 <= n >= 1 ### Example ```python network = get_runner.builder.create_network(flags=1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)) qkv_shape = (1, 8, 1, 16) mask_shape = (1, 1, 1, 1) query = network.add_input("query", dtype=trt.float16, shape=qkv_shape) key = network.add_input("key", dtype=trt.float16, shape=qkv_shape) value = network.add_input("value", dtype=trt.float16, shape=qkv_shape) mask = network.add_input("mask", dtype=trt.bool, shape=mask_shape) layer = network.add_attention(query, key, value, trt.AttentionNormalizationOp.SOFTMAX, False) layer.mask = mask network.mark_output(layer.get_output(0)) # Input data preparation and execution would follow here... ``` ### C++ API Reference For more information about the C++ IAttention operator, refer to the C++ IAttention documentation. ``` -------------------------------- ### Example: Transpose Last Two Dimensions Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/cpp-api/classnvinfer1_1_1_i_plugin_v2_dynamic_ext.html This example demonstrates how to override getOutputDimensions for a plugin that transposes the last two dimensions of its single input. ```cpp DimsExprs output(inputs[0]); std::swap(output.d[output.nbDims-1], output.d[output.nbDims-2]); return output; ``` -------------------------------- ### Assertion Firing Example (Build Time) Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Assertion.html This example is designed to trigger a build-time error by creating a condition that the builder can prove is false. It uses inputs with different shapes to ensure inequality. ```python # This test should fail during build stage in1 = network.add_input("input1", dtype=trt.float32, shape=(3, 4, 4)) shape1 = network.add_shape(in1) in2 = network.add_input("input2", dtype=trt.float32, shape=(3, 3, 4)) shape2 = network.add_shape(in2) identity = network.add_identity(in1) cond = network.add_elementwise(shape1.get_output(0), shape2.get_output(0), op=trt.ElementWiseOperation.EQUAL) assertion = network.add_assertion(cond.get_output(0), message="Should fail") network.mark_output(identity.get_output(0)) inputs[in1.name] = np.zeros(shape=(2, 4)) outputs[identity.get_output(0).name] = identity.get_output(0).shape ``` -------------------------------- ### Reduce Operator Example (Keep Dims False) Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Reduce.html This example shows how to use the `add_reduce` function to perform a PROD reduction, removing the reduced dimensions by setting `keep_dims` to False. ```APIDOC ## Reduce Operator Example (Keep Dims False) ### Description This example shows how to use the `add_reduce` function to perform a PROD reduction, removing the reduced dimensions by setting `keep_dims` to False. ### Method `network.add_reduce(input_tensor, op, axes, keep_dims)` ### Parameters * `input_tensor` (tensor): The input tensor to reduce. * `op` (trt.ReduceOperation): The reduction operation to perform (e.g., `trt.ReduceOperation.PROD`). * `axes` (int): A bitmask representing the axes to reduce. * `keep_dims` (bool): If True, preserves the reduced dimensions with a size of 1. If False, removes the reduced dimensions. ### Request Example ```python in1 = network.add_input("input1", dtype=trt.float32, shape=(1, 2, 2, 3)) layer = network.add_reduce(in1, op=trt.ReduceOperation.PROD, axes=6, keep_dims=False) network.mark_output(layer.get_output(0)) inputs[in1.name] = np.array( [ [ [[-3.0, -2.0, -1.0], [0.0, 1.0, 2.0]], [[3.0, 4.0, 5.0], [6.0, 7.0, 8.0]], ] ] ) outputs[layer.get_output(0).name] = layer.get_output(0).shape expected[layer.get_output(0).name] = np.array([[0.0, -56.0, -80.0]]) ``` ``` -------------------------------- ### Padding Operator Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Padding.html This example demonstrates how to use the add_padding_nd method to pad an input tensor. It shows the creation of the layer, marking the output, and provides expected input and output shapes and values. ```APIDOC ## Padding Operator ### Description Pads with zeros (or trims) an input tensor along the two innermost dimensions and store the result in an output tensor. ### Attributes - `pre_padding_nd` (int tuple) - The amount of pre-padding to use for each dimension. If positive, the tensor is pad with zeros, otherwise, it’s trimmed. - `post_padding_nd` (int tuple) - The amount of post-padding to use for each dimension. If positive, the tensor is pad with zeros, otherwise, it’s trimmed. ### Inputs - **input** (tensor of type `T`) - The input tensor to be padded or trimmed. ### Outputs - **output** (tensor of type `T`) - The resulting tensor after padding or trimming. ### Data Types - **T**: `int8`, `int32`, `float16`, `float32` ### Shape Information - **input** is a tensor with a shape of [a0,...,an−1], n≥4 - **output** is a tensor with a shape of [b0,...,bn−1], where: pjpre = pre padding at spatial dimension j pjpost = post padding at spatial dimension j bi = ai, 0 ≤ i < n−2 bi = ai + pjpre + pjpost, n−2 ≤ i < n, j = i − (n−2) ### Example ```python in1 = network.add_input("input1", dtype=trt.float32, shape=(1, 1, 3, 5)) layer = network.add_padding_nd(in1, pre_padding=(-1, 3), post_padding=(3, -2)) network.mark_output(layer.get_output(0)) inputs[in1.name] = np.array( [[[[-3.0, -2.0, -1.0, 10.0, -25.0], [-4.0, -9.0, -1.0, 10.0, -25.0], [0.0, 1.0, 2.0, -2.0, -1.0]]]] ) outputs[layer.get_output(0).name] = layer.get_output(0).shape expected[layer.get_output(0).name] = np.array( [ [ [ [0.0, 0.0, 0.0, -4.0, -9.0, -1.0], [0.0, 0.0, 0.0, 0.0, 1.0, 2.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], ] ] ] ) ``` ``` -------------------------------- ### Assertion Not Firing Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Assertion.html This example demonstrates an assertion that is not expected to fail during build or runtime. It sets up inputs and operations to create a condition that remains true. ```python in1 = network.add_input("input1", dtype=trt.float32, shape=(3, 4, 4)) shape = network.add_shape(in1) identity = network.add_identity(in1) cond = network.add_elementwise(shape.get_output(0), shape.get_output(0), op=trt.ElementWiseOperation.EQUAL) assertion = network.add_assertion(cond.get_output(0), message="Shouldn't fail") network.mark_output(identity.get_output(0)) inputs[in1.name] = np.zeros(shape=(2, 4)) outputs[identity.get_output(0).name] = identity.get_output(0).shape ``` -------------------------------- ### 2D Block Dynamic Quantize Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/DynamicQuantize.html This example demonstrates how to use the `add_dynamic_quantize_v2` operator for 2D block dynamic quantization. It quantizes FP32 input to FP8, then dequantizes it back to FP32 using per-block scales. ```APIDOC ## add_dynamic_quantize_v2 ### Description Adds a dynamic quantize layer to the network. This layer dynamically quantizes the input tensor to a specified format (e.g., FP8) and produces a scale tensor. ### Method `network.add_dynamic_quantize_v2(input, block_shape, output_dtype, scale_dtype)` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **input** (ITensor) - The input tensor to be quantized. - **block_shape** (Dims) - The shape of the quantization block. - **output_dtype** (DataType) - The data type of the quantized output tensor. - **scale_dtype** (DataType) - The data type of the scale tensor. ### Request Example ```python # Assuming 'network' is a valid nvinfer.Network object and 'in1' is an ITensor block_shape = trt.Dims([4, 3]) dynq = network.add_dynamic_quantize_v2(in1, block_shape, trt.fp8, trt.float32) data_f8 = dynq.get_output(0) scale_f32 = dynq.get_output(1) ``` ### Response #### Success Response (200) Returns an object with two outputs: - **Output 0**: The quantized tensor. - **Output 1**: The scale tensor. #### Response Example ```json { "quantized_tensor": "ITensor", "scale_tensor": "ITensor" } ``` ## add_dequantize ### Description Adds a dequantize layer to the network. This layer dequantizes a tensor from a specified format (e.g., FP8) to another format (e.g., FP32) using provided scales. ### Method `network.add_dequantize(input, scale, output_dtype)` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **input** (ITensor) - The tensor to be dequantized. - **scale** (ITensor) - The scale tensor used for dequantization. - **output_dtype** (DataType) - The data type of the dequantized output tensor. ### Request Example ```python # Assuming 'network' is a valid nvinfer.Network object, 'data_f8' and 'scale_f32' are ITensors dequantize_data = network.add_dequantize(data_f8, scale_f32, trt.float32) dequantize_data.block_shape = block_shape # Set block_shape if applicable data_dq = dequantize_data.get_output(0) ``` ### Response #### Success Response (200) Returns the dequantized tensor. #### Response Example ```json { "dequantized_tensor": "ITensor" } ``` ## mark_output ### Description Marks a tensor as an output of the network. ### Method `network.mark_output(tensor)` ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **tensor** (ITensor) - The tensor to be marked as an output. ### Request Example ```python # Assuming 'network' is a valid nvinfer.Network object and 'data_dq' is an ITensor network.mark_output(data_dq) ``` ### Response None ``` -------------------------------- ### RotaryEmbedding Operator Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/RotaryEmbedding.html This snippet demonstrates how to create and configure a RotaryEmbedding layer within a TensorRT network. It shows the setup of inputs, caches, and position IDs, and how to mark the output. The reference implementation for computing the rotary embedding is also provided for clarity. ```python get_runner.network = get_runner.builder.create_network(flags=1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)) network = get_runner.network input = network.add_input("input", dtype=trt.float32, shape=(2, 8, 4, 512)) cos_cache = network.add_input("cos_cache", dtype=trt.float32, shape=(100, 256)) sin_cache = network.add_input("sin_cache", dtype=trt.float32, shape=(100, 256)) position_ids = network.add_input("position_ids", dtype=trt.int64, shape=(2, 4)) layer = network.add_rotary_embedding(input=input, cos_cache=cos_cache, sin_cache=sin_cache, interleaved=False, rotary_embedding_dim=0) layer.set_input(3, position_ids) network.mark_output(layer.get_output(0)) ``` ```python inputs[input.name] = np.random.rand(2, 8, 4, 512).astype("f") inputs[cos_cache.name] = np.random.rand(100, 256).astype("f") inputs[sin_cache.name] = np.random.rand(100, 256).astype("f") inputs[position_ids.name] = np.array([[6, 2, 1, 7], [2, 8, 3, 6]]) ``` ```python outputs[layer.get_output(0).name] = layer.get_output(0).shape ``` ```python # This is a reference implementation of the rotary embedding operator. def compute_rotary_embedding( input, cos_cache, sin_cache, position_ids=None, interleaved=False, rotary_embedding_dim=0, ): # Shape of input: (batch_size, num_heads, seq_len, head_size) head_size = input.shape[3] # Process partial RoPE rotary_embedding_dim = head_size if rotary_embedding_dim == 0 else rotary_embedding_dim x_rotate, x_not_rotate = np.split(input, [rotary_embedding_dim], axis=-1) # Get cached cosine and sine values cache = cos_cache + 1j * sin_cache if position_ids is not None: cache = cache[position_ids] # Shape: (batch_size, seq_len, rotary_embedding_dim/2) cache = cache[:, np.newaxis, :, :] # Shape: (batch_size, 1, seq_len, rotary_embedding_dim/2) # Get the 2-d vectors to rotate if interleaved: x1, x2 = x_rotate[..., 0::2], x_rotate[..., 1::2] else: x1, x2 = np.split(x_rotate, 2, axis=-1) x = x1 + 1j * x2 # Rotate the vectors x = x * cache # Put the rotated vectors back if interleaved: x = np.expand_dims(x, axis=-1) x = np.concatenate((np.real(x), np.imag(x)), axis=-1) x = np.reshape(x, x_rotate.shape) else: x = np.concatenate((np.real(x), np.imag(x)), axis=-1) # Process partial RoPE output = np.concatenate((x, x_not_rotate), axis=-1) return output ``` ```python expected[layer.get_output(0).name] = compute_rotary_embedding(inputs[input.name], inputs[cos_cache.name], inputs[sin_cache.name], inputs[position_ids.name], interleaved=False, rotary_embedding_dim=0) ``` -------------------------------- ### Create a Loop with Accumulator and Trip Limit Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Loop.html This example demonstrates creating a loop structure using TensorRT. It includes an ElementWise layer for accumulation, a recurrence layer for managing state, and a TripLimit layer to control the number of iterations. This setup is useful for recurrent computations where a value is updated iteratively. ```python ''' This example creates a Loop consisting of an ElementWise layer that is used as an accumulator. The accumalter value is named `accumaltor_value`, and a the added value for each iteration is named `accumaltor_added_value`. The Loop stop condition is a counter initialized to `num_iterations`, which is implemented using the TripLimit layer. The expected output is `accumaltor_value` + `num_iterations`*`accumaltor_added_value` ''' num_iterations = 3 trip_limit = network.add_constant(shape=(), weights=trt.Weights(np.array([num_iterations], dtype=np.dtype("i4")))) accumaltor_value = network.add_input("input1", dtype=trt.float32, shape=(2, 3)) accumaltor_added_value = network.add_input("input2", dtype=trt.float32, shape=(2, 3)) loop = network.add_loop() # setting the ITripLimit layer to stop after `num_iterations` iterations loop.add_trip_limit(trip_limit.get_output(0), trt.TripLimit.COUNT) # initialzing IRecurrenceLayer with a init value rec = loop.add_recurrence(accumaltor_value) # eltwise inputs are 'accumaltor_added_value', and the IRecurrenceLayer output. eltwise = network.add_elementwise(accumaltor_added_value, rec.get_output(0), op=trt.ElementWiseOperation.SUM) # wiring the IRecurrenceLayer with the output of eltwise. # The IRecurrenceLayer output would now be `accumaltor_value` for the first iteration, and the eltwise output for any other iteration rec.set_input(1, eltwise.get_output(0)) # marking the IRecurrenceLayer output as the Loop output loop_out = loop.add_loop_output(rec.get_output(0), trt.LoopOutput.LAST_VALUE) # marking the Loop output as the network output network.mark_output(loop_out.get_output(0)) inputs[accumaltor_value.name] = np.array( [ [2.7, -4.9, 23.34], [8.9, 10.3, -19.8], ]) inputs[accumaltor_added_value.name] = np.array( [ [1.1, 2.2, 3.3], [-5.7, 1.3, 4.6], ]) outputs[loop_out.get_output(0).name] = eltwise.get_input(0).shape expected[loop_out.get_output(0).name] = inputs[accumaltor_value.name] + inputs[accumaltor_added_value.name] * num_iterations ``` -------------------------------- ### Create and Set Runtime Cache (C++) Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/performance/best-practices.html Demonstrates how to create a runtime cache and set it in the runtime configuration using C++. This enables caching of compiled kernels for future use. ```cpp 1// Create a runtime cache. 2auto runtimeCache = std::unique_ptr(runtimeConfig->createRuntimeCache()); 3 4// Set the runtime cache in runtime configuration. 5runtimeConfig->setRuntimeCache(*runtimeCache); ``` -------------------------------- ### KVCacheUpdate Operator Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/KVCacheUpdate.html This example demonstrates how to use the KVCacheUpdate operator in Python with TensorRT. ```APIDOC ## KVCacheUpdate Performs Key (K) / Value (V) cache update for attention computations. Users provide the newly computed K/V values as inputs, and the layer will output the updated K/V cache. The writeIndices input specifies where to write K/V updates for each sequence in the batch. Separate KVCacheUpdate layers should be used for K and V. ### Attributes `cacheMode` specifies the cache update mode: * `LINEAR` In linear mode, for each batch element i and sequence position s: `output[i, :, writeIndices[i] + s, :] = update[i, :, s, :]` ### Inputs **cache** : tensor of type `T`, the key/value cache tensor. Must be a network input and have a static sequence length dimension. **update** : tensor of type `T`, the newly computed key/value tensor to write into the cache. **writeIndices** : tensor of type `M`, specifies the write position index for each batch element i. Values must satisfy `writeIndices[i] + sequenceLength <= maxSequenceLength`. ### Outputs **output** : tensor of type `T`, the updated cache tensor. Must be a network output and shares the same device memory address with the cache input (in-place update). ### Data Types T: `float32`, `float16`, `bfloat16` M: `int32`, `int64` ### Shape Information **cache** and **output** are tensors with the same shape of [b,d,smax,h] **update** is a tensor with the shape of [b,d,s,h] where s≤smax **writeIndices** is a tensor with the shape of [b] Where: * b is the batch size * d is the number of heads * smax is the maximum sequence length (must be static) * s is the update sequence length * h is the head size ### DLA Support Not supported. ### Examples KVCacheUpdate ```python network = get_runner.builder.create_network(flags=1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)) cache_shape = (4, 2, 8, 1) update_shape = (4, 2, 4, 1) write_indices_shape = (4,) cache = network.add_input("cache", dtype=trt.float32, shape=cache_shape) update = network.add_input("update", dtype=trt.float32, shape=update_shape) write_indices = network.add_input("write_indices", dtype=trt.int32, shape=write_indices_shape) layer = network.add_kv_cache_update(cache, update, write_indices, trt.KVCacheMode.LINEAR) network.mark_output(layer.get_output(0)) cache_data = np.array( [ [0.53, 0.88, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.41, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.67, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.32, 0.79, 0.64, 0.0, 0.0, 0.0, 0.0, 0.0], ], dtype=np.float32, ) inputs[cache.name] = cache_data[:, None, :, None] + np.zeros((1, 2, 1, 1)) update_data = np.array( [ [0.72, 0.0, 0.0, 0.0], [0.55, 0.94, 0.0, 0.0], [0.61, 0.28, 0.0, 0.0], [0.83, 0.0, 0.0, 0.0], ], dtype=np.float32, ) inputs[update.name] = update_data[:, None, :, None] + np.zeros((1, 2, 1, 1)) write_indices_data = np.array([2, 1, 1, 3], dtype=np.int32) inputs[write_indices.name] = write_indices_data outputs[layer.get_output(0).name] = layer.get_output(0).shape expected_data = np.array( [ [0.53, 0.88, 0.72, 0.0, 0.0, 0.0, 0.0, 0.0], [0.41, 0.55, 0.94, 0.0, 0.0, 0.0, 0.0, 0.0], [0.67, 0.61, 0.28, 0.0, 0.0, 0.0, 0.0, 0.0], [0.32, 0.79, 0.64, 0.83, 0.0, 0.0, 0.0, 0.0], ], dtype=np.float32, ) expected[layer.get_output(0).name] = expected_data[:, None, :, None] + np.zeros((1, 2, 1, 1)) # Set get_runner.network back to the new STRONGLY_TYPED network get_runner.network = network ``` ## C++ API For more information about the C++ IKVCacheUpdateLayer operator, refer to the C++ IKVCacheUpdateLayer documentation. ## Python API For more information about the Python IKVCacheUpdateLayer operator, refer to the Python IKVCacheUpdateLayer documentation. ``` -------------------------------- ### initialize() Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/cpp-api/functions_func_i.html Initializes a plugin. ```APIDOC ## initialize() ### Description Initializes a plugin. This method is called by TensorRT when the plugin is first used. ### Method `bool initialize()` ### Endpoint N/A (C++ API) ### Parameters None ### Request Example N/A ### Response Returns `true` if initialization is successful, `false` otherwise. ``` -------------------------------- ### NMS Operator Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/NMS.html Example demonstrating the usage of the NMS operator in Python with TensorRT. ```APIDOC ## NMS Operator ### Description The NMS algorithm iterates through a set of bounding boxes and their confidence scores, in decreasing order of score. Boxes are selected if their score is above a given threshold, and their intersection-over-union (IoU) with previously selected boxes is less than or equal to a given threshold. This layer implements NMS per batch item and per class. Per batch item, boxes are initially sorted by their scores without regard to class. Only boxes up to a maximum of the TopK limit are considered for selection (per batch). During selection, only overlapping boxes of the same class are compared, so that overlapping boxes of different classes do not suppress each other. ### Attributes - `fmt`: The bounding box format can be one of: `CORNER_PAIRS` (x1, y1, x2, y2) or `CENTER_SIZES` (x_center, y_center, width, height). Default is `CORNER_PAIRS`. - `limit`: The TopK box limit, maximum number of filtered boxes considered for selection per batch item. Default is 2000 for SM 5.3 and 6.2 devices, and 5000 otherwise. ### Inputs - **Boxes**: tensor of type `T1`. - **Scores**: tensor of type `T1`. - **MaxOutputBoxesPerClass**: tensor of type `int32`. - **IoUThreshold** (optional): tensor of type `float32`. Scalar value in range [0.0f, 1.0f]. Default is 0.0f. - **ScoreThreshold** (optional): tensor of type `float32`. Default is 0.0f. ### Outputs - **SelectedIndices**: tensor of type `T2`. Shape [NumOutputBoxes, 3]. Each row contains (batchIndex, classIndex, boxIndex). - **NumOutputBoxes**: tensor of type `int32`. Scalar value. ### Data Types - **T1**: `float16`, `float32`, `bfloat16` - **T2**: `int32`, `int64` ### Shape Information - **Boxes**: [batchSize, numInputBoundingBoxes, numClasses, 4] or [batchSize, numInputBoundingBoxes, 4] - **Scores**: [batchSize, numInputBoundingBoxes, numClasses] - **MaxOutputBoxesPerClass**: 0D tensor (scalar) - **IoUThreshold**: 0D tensor (scalar) - **ScoreThreshold**: 0D tensor (scalar) - **SelectedIndices**: [NumOutputBoxes, 3] - **NumOutputBoxes**: 0D tensor (scalar) ### Volume Limits - **Boxes**, **Scores**, and **SelectedIndices** can have up to 2^31 - 1 elements. ### Example ```python opt_profile = get_runner.builder.create_optimization_profile() get_runner.config.add_optimization_profile(opt_profile) boxes = network.add_input("boxes", dtype=trt.float32, shape=(1, 3, 4)) scores = network.add_input("scores", dtype=trt.float32, shape=(1, 3, 3)) constant = network.add_constant(shape=(), weights=np.ones(shape=(), dtype=np.int32)) max_output_boxes_per_class = constant.get_output(0) layer = network.add_nms(boxes, scores, max_output_boxes_per_class) network.mark_output(layer.get_output(0)) network.mark_output(layer.get_output(1)) layer.get_output(0).dtype = trt.int32 layer.get_output(1).dtype = trt.int32 inputs[boxes.name] = np.array([[[0.0, 0.0, 0.1, 0.1], [0.2, 0.2, 0.4, 0.4], [0.5, 0.5, 0.6, 0.6]]]) inputs[scores.name] = np.array([[[0.7, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.9]]]) # Expected shape is [2, 3] outputs[layer.get_output(0).name] = layer.get_output(0).shape expected[layer.get_output(0).name] = np.array([[0, 2, 2], [0, 0, 0]]) # Expected shape is [] with a scalar value of 2 outputs[layer.get_output(1).name] = layer.get_output(1).shape expected[layer.get_output(1).name] = np.array(2) ``` ``` -------------------------------- ### Linear Resize Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Resize.html Demonstrates how to use the Resize operator for linear interpolation, specifying the output shape directly. ```APIDOC ## Linear Resize ### Description Resizes an input tensor to a specified output shape using linear interpolation. ### Method `network.add_resize()` ### Parameters - `input`: The input tensor. - `resize_mode`: Set to `trt.InterpolationMode.LINEAR`. - `shape`: The desired output shape. Example: `(1, 1, 5, 5)`. - `coordinate_transformation`: Controls coordinate mapping. Example: `trt.ResizeCoordinateTransformation.ALIGN_CORNERS`. ### Request Example ```python input_tensor = network.add_input("input", dtype=trt.float32, shape=(1, 1, 3, 3)) layer = network.add_resize(input_tensor) layer.resize_mode = trt.InterpolationMode.LINEAR layer.shape = (1, 1, 5, 5) layer.coordinate_transformation = trt.ResizeCoordinateTransformation.ALIGN_CORNERS network.mark_output(layer.get_output(0)) ``` ### Response Example ```json { "output_shape": [1, 1, 5, 5] } ``` ``` -------------------------------- ### Squeeze Operator Example Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/_static/operators/Squeeze.html This example demonstrates how to add a Squeeze layer to a TensorRT network. It specifies the input tensor and the axes to squeeze. The example also includes setting up test data and verifying the output shape. ```python in1 = network.add_input("input1", dtype=trt.float32, shape=(3, 1, 4, 1)) axes_weights = trt.Weights(np.array([1, -1], dtype=np.int64)) axes_layer = network.add_constant((2,), axes_weights) axes_tensor = axes_layer.get_output(0) layer = network.add_squeeze(in1, axes_tensor) network.mark_output(layer.get_output(0)) test_data = np.array( [ [1.0, 2.0, 3.0, 4.0], [10.0, 20.0, 30.0, 40.0], [100.0, 200.0, 300.0, 400.0], ] ) inputs[in1.name] = test_data.reshape(3, 1, 4, 1) outputs[layer.get_output(0).name] = layer.get_output(0).shape expected[layer.get_output(0).name] = test_data ``` -------------------------------- ### Create and Configure Optimization Profile (Python) Source: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/inference-library/work-with-dynamic-shapes.html This Python code demonstrates how to create an optimization profile, define its input shapes (min, opt, max), and add it to the configuration. ```python profile = builder.create_optimization_profile(); profile.set_shape("foo", (3, 100, 200), (3, 150, 250), (3, 200, 300)) config.add_optimization_profile(profile) ```