### Install Docs Dependencies and Serve

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Install optional documentation dependencies and serve the documentation locally using `mkdocs serve`.

```bash
pip install -e ".[docs]"
mkdocs serve
```

--------------------------------

### Verify Local Installation

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Verify that the Model Optimizer ONNX tool has been installed correctly by running its help command.

```bash
modelopt-onnx-ptq --help
```

--------------------------------

### Install Model Optimizer Locally and Verify

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md

Installs the NVIDIA Model Optimizer with ONNX support using pip and verifies the installation by importing necessary modules. Also checks for available ONNX Runtime execution providers.

```bash
conda create -n modelopt python=3.12 pip && conda activate modelopt
pip install -U "nvidia-modelopt[onnx]==0.43.0" --extra-index-url https://pypi.nvidia.com
export LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
```

```python
import modelopt.onnx.quantization; print('OK')
import onnxruntime as ort; print(ort.get_available_providers())
```

--------------------------------

### modelopt-onnx-ptq eval-trt Examples

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Examples demonstrating the usage of modelopt-onnx-ptq eval-trt for different output formats and layouts.

```APIDOC
## modelopt-onnx-ptq eval-trt Examples

### Auto layout (DeepStream-Yolo single [B,N,6] export)
This command exports an ONNX model for DeepStream-Yolo with auto layout, suitable for trtexec.

```bash
modelopt-onnx-ptq eval-trt --output-format auto --onnx model.onnx --engine model.engine \
  --images data/coco/val2017 --annotations data/coco/annotations/instances_val2017.json
```

### Ultralytics single tensor (explicit name if needed)
Exports an ONNX model in Ultralytics format, allowing for explicit specification of the output tensor name.

```bash
modelopt-onnx-ptq eval-trt --output-format ultralytics --output-tensor output0 --engine model.engine \
  --images data/coco/val2017 --annotations data/coco/annotations/instances_val2017.json
```

### DeepStream-Yolo layout (NMS in eval)
Exports an ONNX model for DeepStream-Yolo with NMS (Non-Maximum Suppression) integrated into the evaluation step.

```bash
modelopt-onnx-ptq eval-trt --output-format deepstream_yolo --output-tensor output --engine model.engine \
  --images data/coco/val2017 --annotations data/coco/annotations/instances_val2017.json
```
```

--------------------------------

### Install CLI (Editable Mode)

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md

Installs the modelopt-onnx-ptq CLI in editable mode from the repository root. This registers the 'modelopt-onnx-ptq' command.

```bash
pip install -e .
```

--------------------------------

### Clone Repository and Navigate

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md

Clones the Model-Optimizer-ONNX repository and changes the current directory into it. This is the first step for Docker installation.

```bash
git clone https://github.com/levipereira/Model-Optimizer-ONNX.git
cd Model-Optimizer-ONNX
```

--------------------------------

### Clone and Install Locally

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Clone the repository and install the Model Optimizer ONNX tool in editable mode using pip. This method requires a matching CUDA/TensorRT/ONNX Runtime stack on the host.

```bash
git clone https://github.com/levipereira/Model-Optimizer-ONNX.git
cd Model-Optimizer-ONNX
pip install -e .
```

--------------------------------

### Example Workflow with Environment Configuration

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Demonstrates a typical workflow where environment variables for `SESSION_ID` and `MODELOPT_ARTIFACTS_ROOT` are set before running a pipeline command. Logs and artifacts will automatically use these configurations.

```bash
export SESSION_ID=yolo-ptq-$(date +%Y%m%d)
export MODELOPT_ARTIFACTS_ROOT=/data/experiments

modelopt-onnx-ptq pipeline-e2e --onnx models/yolo.onnx --quant-matrix int8.all
```

--------------------------------

### Install Model Optimizer with ONNX Support

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md

Ensure the ONNX extras are installed to resolve ImportError: modelopt.onnx.

```bash
pip install -U "nvidia-modelopt[onnx]==0.43.0" --extra-index-url https://pypi.nvidia.com
```

--------------------------------

### Install Model Optimizer ONNX Locally

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md

Clone the repository and install the package in editable mode. Ensure a matching CUDA, TensorRT, and ONNX Runtime stack is present on the host.

```bash
git clone https://github.com/levipereira/Model-Optimizer-ONNX.git
cd Model-Optimizer-ONNX
pip install -e .
modelopt-onnx-ptq --help
```

--------------------------------

### Install ONNX Runtime for CUDA 13

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md

Installs or upgrades onnxruntime-gpu to a CUDA 13 nightly build. Use this for local installations without Docker when targeting CUDA 13.

```bash
pip install --upgrade --force-reinstall --pre \
  --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-13-nightly/pypi/simple/ \
  onnxruntime-gpu --no-deps
```

--------------------------------

### Verify TensorRT Installation

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md

Check if the TensorRT library is correctly installed and accessible in the Python environment.

```bash
python -c "import tensorrt; print(tensorrt.__version__)"
```

--------------------------------

### Download COCO Dataset and Build Calibration Tensors

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md

Installs the modelopt-onnx-ptq command and uses it to download the COCO val2017 dataset. Then, it generates calibration tensors from the dataset images for quantization.

```bash
pip install -e .   # once: installs the `modelopt-onnx-ptq` command
modelopt-onnx-ptq download-coco --output-dir data/coco
```

```bash
modelopt-onnx-ptq calib \
  --images_dir=data/coco/val2017 \
  --calibration_data_size=500 \
  --img_size=640
```

--------------------------------

### Custom Quantization Profile YAML Example

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

An example YAML profile for `modelopt-onnx-ptq` that customizes quantization behavior. It includes whitelisting specific node patterns and excluding certain operator types and nodes.

```yaml
# Example: modelopt_onnx_ptq/profiles/custom_profile.yaml
# Quantize only backbone convolutions, exclude sensitive ops

defaults:
  autotune: false

include_nodes:
  # Whitelist backbone conv nodes (regex patterns)
  - "node_conv2d_[0-9]$"
  - "node_conv2d_[1-3][0-9]$"

exclude_op_types:
  - Sigmoid
  - Softmax
  - Resize
  - Concat

exclude_nodes:
  # Exclude SiLU activations
  - "node_silu.*"
```

--------------------------------

### Run Docker Container (Edit Mode)

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md

Starts a Docker container for development, mounting the current directory as the workspace. This allows for live code changes.

```bash
docker run --gpus all --rm -it \
  -w /workspace/modelopt-onnx-ptq \
  -v "$(pwd)":/workspace/modelopt-onnx-ptq \
  modelopt-onnx-ptq
```

--------------------------------

### Set Autotune Warmup Runs

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Configure the number of warmup iterations to perform before starting timed runs for latency measurements.

```bash
--autotune_warmup_runs 5
```

--------------------------------

### Comment Style Guidelines in Python

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md

Examples demonstrating preferred commenting practices, focusing on explaining non-obvious logic rather than narrating code execution.

```python
# bad — narrates the obvious
img = img / 255.0  # normalize image to 0-1 range

# bad — teaches a lesson
# We use Path here because pathlib provides a more Pythonic way to handle
# filesystem paths compared to os.path, allowing operator overloading for
# path concatenation and better cross-platform compatibility.
output = base_dir / "model.onnx"

# good — explains a non-obvious choice
# TRT requires even spatial dims; pad asymmetrically to preserve alignment
padded = np.pad(tensor, ((0,0), (0,0), (0,1), (0,1)))

# good — no comment needed, code is clear
output = base_dir / "model.onnx"
```

--------------------------------

### Execution Provider Resolution

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/reference.md

Example output of the internal execution provider resolution utility.

```python
_prepare_ep_list(["cpu", "cuda:0", "trt"])
```

```python
["CPUExecutionProvider", ("CUDAExecutionProvider", {"device_id": 0}), "TensorrtExecutionProvider"]
```

--------------------------------

### Verify Model Optimizer Version

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md

Check the installed version of the modelopt package to ensure compatibility.

```python
python -c "import modelopt; print(modelopt.__version__)"
```

--------------------------------

### Run Diagnostic Commands

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md

Execute these commands to inspect the environment, including installed versions of Model Optimizer, ONNX Runtime, and CUDA library paths.

```bash
# Model Optimizer version
python -c "import modelopt; print(modelopt.__version__)"

# ORT version and available EPs
python -c "import onnxruntime as ort; print(ort.__version__); print(ort.get_available_providers())"

# Check what CUDA libs ORT was built against
ldd $(python -c "import onnxruntime as ort, os; print(os.path.join(os.path.dirname(ort.__file__), 'capi/libonnxruntime_providers_cuda.so'))") | grep -E 'cublas|cudnn'

# System CUDA toolkit
ls /usr/local/cuda*/lib64/libcublas.so* 2>/dev/null
nvcc --version 2>/dev/null

# LD_LIBRARY_PATH
echo $LD_LIBRARY_PATH
```

--------------------------------

### Build FP16 TensorRT Engine

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md

Build an FP16 TensorRT engine from the original ONNX model. This serves as a reference for comparison. The output engine file path is shown as an example.

```bash
modelopt-onnx-ptq build-trt --onnx models/your.onnx --mode fp16
```

--------------------------------

### Display Modelopt ONNX Quantization Help

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Run this command to view the full list of available options and their descriptions for the modelopt.onnx.quantization module.

```bash
python -m modelopt.onnx.quantization --help
```

--------------------------------

### Run Full Quantization Grid (Baseline)

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/yolo26n-end-to-end-ptq-workflow.md

Execute the pipeline-e2e command with '--quant-matrix all' to establish a baseline performance and accuracy across various quantization settings without a profile. This helps identify the best initial quantization strategy.

```bash
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo26n_no_nms_e2e.onnx \
  --calibration-data-size 1000 \
  --input-name input \
  --output-format deepstream_yolo \
  --high-precision-dtype fp16 \
  --quant-matrix all \
  --session-id yolo26n_quant_baseline
```

--------------------------------

### Fix 'No module named 'utils'' in TREx (Manual Install)

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md

This sequence of commands and a `sed` patch resolves the 'No module named 'utils'' error for manual installations. It ensures the correct directory is added to sys.path and installs the package in editable mode.

```bash
TREX_ROOT=/workspace/TREx/tools/experimental/trt-engine-explorer
touch "${TREX_ROOT}/bin/__init__.py"
sed -i 's|sys.path.append(os.path.join(os.path.dirname(SCRIPT_DIR), "utils"))|sys.path.insert(0, os.path.dirname(SCRIPT_DIR))|' \
  "${TREX_ROOT}/bin/trex.py"
cd "${TREX_ROOT}"
pip install --no-cache-dir -e ".[notebook]"
trex --help
```

--------------------------------

### Import Autotune Q/DQ Baseline

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Specify a pre-quantized ONNX model to import Q/DQ patterns from, used for warm-starting the autotuner.

```bash
--autotune_qdq_baseline ./baseline.onnx
```

--------------------------------

### Install ORT Nightly for CUDA 13

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md

Use this command to resolve libcublas version mismatches by installing the nightly build of ONNX Runtime compatible with CUDA 13.

```bash
pip uninstall onnxruntime onnxruntime-gpu -y
pip install --pre --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-13-nightly/pypi/simple/ onnxruntime-gpu
```

--------------------------------

### Basic Quantization with Autotune

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Use this command for post-training quantization with default autotuning enabled. Ensure calibration data and the ONNX model path are provided.

```bash
modelopt-onnx-ptq quantize --calibration_data ... --onnx_path ... --autotune default
```

--------------------------------

### Execute manual profile tuning steps

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md

Commands for manual quantization, TensorRT engine building, and performance evaluation.

```bash
calib
```

```bash
quantize --profile … --suffix …
```

```bash
build-trt --mode strongly-typed
```

```bash
eval-trt
```

```bash
trt-bench
```

```bash
trex-analyze
```

--------------------------------

### Initialize TREx Environment

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/ptq-trt-performance/SKILL.md

Commands to set up the environment and launch the Jupyter interface for TREx tools.

```bash
# Optional: export TREX_VENV=/workspace/TREx/tools/experimental/trt-engine-explorer/env_trex
cd /workspace/TREx/tools/experimental/trt-engine-explorer/notebooks
jupyter lab   # or: jupyter notebook
```

--------------------------------

### Use Autotune Pattern Cache

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Provide a YAML file for caching patterns to speed up autotuning via warm-starting.

```bash
--autotune_pattern_cache ./patterns.yaml
```

--------------------------------

### Compare Two ONNX Models

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Enable two-plan comparison by providing a primary ONNX model and a second ONNX model using `--compare-onnx`. Specify a different builder mode for the second model with `--compare-onnx-mode` if needed. Requires the ONNX model path, a build mode, and image size.

```bash
modelopt-onnx-ptq trex-analyze --onnx models/yolo_fp16.onnx --mode fp16 --img-size 640 \
  --compare --compare-onnx artifacts/quantized/yolo.int8.entropy.quant.onnx \
  --compare-onnx-mode strongly-typed
```

--------------------------------

### Build Docker Image

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md

Builds the Docker image for modelopt-onnx-ptq. Ensure you are in the repository root directory.

```bash
docker build -f docker/Dockerfile -t modelopt-onnx-ptq .
```

--------------------------------

### End-to-End PTQ Pipeline

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md

Run the complete Post-Training Quantization (PTQ) pipeline, including calibration, FP16 baseline generation, quantization, and reporting, with a single command. Use optional flags like `--img-size`, `--input-name`, and `--output-format` as needed. Exclude the FP16 baseline comparison by adding `--no-fp16-baseline`.

```bash
modelopt-onnx-ptq pipeline-e2e --onnx models/your.onnx
```

--------------------------------

### Enable Verbose Autotuner Logging

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Activate detailed logging for the autotuning process to provide more insights.

```bash
--autotune_verbose
```

--------------------------------

### PTQ Workflow Commands

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/quantization-performance-workflow.md

A collection of CLI commands for the full PTQ lifecycle, including calibration, quantization, engine building, and performance analysis.

```bash
# 1) Calibration
modelopt-onnx-ptq calib --images_dir data/coco/val2017/ ... --output_path artifacts/calibration/calib_coco.npy

# 2) Quantize (first iteration — no custom profile)
modelopt-onnx-ptq quantize \
  --calibration_data artifacts/calibration/calib_coco.npy \
  --onnx_path models/yolo.onnx \
  --quantize_mode int8 --calibration_method entropy

# 2b) Quantize with a hand-tuned profile (no autotune) — try shipped YOLO26 examples under modelopt_onnx_ptq/profiles/
modelopt-onnx-ptq quantize \
  --calibration_data artifacts/calibration/calib_coco.npy \
  --onnx_path models/yolo.onnx \
  --quantize_mode int8 --calibration_method entropy \
  --profile yolo26n_no_nms_e2e_perf \
  --suffix .v1.quant.onnx

# 3–4) Build, evaluate, benchmark (paths follow your artifacts layout)
modelopt-onnx-ptq build-trt --onnx artifacts/quantized/<stem>.int8.entropy.quant.onnx --mode strongly-typed --img-size 640 --batch 1
modelopt-onnx-ptq eval-trt --engine artifacts/trt_engine/<stem>.b1_i640.engine ...
modelopt-onnx-ptq trt-bench --engine artifacts/trt_engine/<stem>.b1_i640.engine ...

# 5) Profile the quantized plan
modelopt-onnx-ptq trex-analyze \
  --onnx artifacts/quantized/<stem>.int8.entropy.quant.onnx \
  --mode strongly-typed --img-size 640

# 5b) Compare layer timings: FP16 TensorRT plan vs quantized ONNX (two builds — see --compare)
#     Primary = baseline ONNX + --mode fp16; second = PTQ ONNX + --strongly-typed.
modelopt-onnx-ptq trex-analyze \
  --onnx models/yolo.onnx \
  --mode fp16 \
  --compare \
  --compare-onnx artifacts/quantized/<stem>.int8.entropy.quant.onnx \
  --compare-onnx-mode strongly-typed \
  --img-size 640 --input-name <onnx_input_name>
```

--------------------------------

### modelopt-onnx-ptq trex-analyze

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Analyzes TensorRT engine builds using `trtexec` for profiling. Requires TREx or Docker TREx to be installed. Supports comparison, graph analysis, and reporting modes.

```APIDOC
## modelopt-onnx-ptq trex-analyze

### Description
`trtexec` build + profile; pick one of `--compare`, `--graph`, `--report`, or none.

### Method
CLI Command

### Endpoint
N/A (CLI Tool)

### Parameters
#### Command Line Arguments
- **`--onnx`** (string) - Required - Path to the ONNX model.
- **`--engine`** (string) - Optional - Path to an existing TensorRT engine for comparison.
- **`--compare`** (boolean) - Optional - Enable comparison mode.
- **`--graph`** (boolean) - Optional - Enable graph analysis mode.
- **`--report`** (boolean) - Optional - Enable detailed reporting mode.
- **`--trex`** (string) - Optional - Path to TREx executable or Docker image.

### Request Example
```bash
modelopt-onnx-ptq trex-analyze --onnx ./models/model.onnx --graph --report
```

### Response
#### Success Response (CLI Output)
- Analysis results, profiling data, or comparison reports.

#### Response Example
```
TREx analysis complete. Profile data saved to ./artifacts/trex_analysis/
```
```

--------------------------------

### Manual PTQ workflow

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/ptq-trt-performance/SKILL.md

Perform quantization and TensorRT engine building manually. This involves calibrating, quantizing the ONNX model, building the TensorRT engine, evaluating mAP, and benchmarking QPS/latency. Run `trt-bench` sequentially for accurate QPS/latency comparisons.

```bash
# 1. Calibrate
modelopt-onnx-ptq calib --img_size <img_size> --output calib.npy

# 2. Quantize
modelopt-onnx-ptq quantize --profile <profile> --suffix .v2.quant.onnx \
  --output model.v2.quant.onnx \
  --input model.onnx \
  --calib calib.npy

# 3. Build TensorRT engine
modelopt-onnx-ptq build-trt --onnx model.v2.quant.onnx \
  --mode strongly-typed \
  --engine model.v2.quant.engine

# 4. Evaluate mAP
modelopt-onnx-ptq eval-trt --onnx model.v2.quant.onnx \
  --engine model.v2.quant.engine \
  --output-format auto

# 5. Benchmark QPS/latency
modelopt-onnx-ptq trt-bench --engine model.v2.quant.engine \
  --output-format auto
```

--------------------------------

### Quantize ONNX Model using CLI

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md

The command-line interface provides an alternative for ONNX model quantization. Use `--onnx_path`, `--quantize_mode`, `--calibration_data_path`, `--calibration_method`, and `--output_path`. Autotune can be set to 'default'.

```bash
python -m modelopt.onnx.quantization \
  --onnx_path=model.onnx \
  --quantize_mode=int8 \
  --calibration_data_path=calib.npy \
  --calibration_method=entropy \
  --output_path=model.quant.onnx \
  --autotune=default
```

--------------------------------

### Verify ONNX Runtime CUDA Provider Libraries

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md

Check the CUDA version compatibility of the ONNX Runtime GPU provider by inspecting the linked libraries. This helps diagnose version mismatches between the system CUDA toolkit and the installed ONNX Runtime.

```bash
ldd $(python -c "import onnxruntime; print(onnxruntime.__file__.replace('__init__.py','capi/libonnxruntime_providers_cuda.so'))")
```

--------------------------------

### Run Docker Container (Default)

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md

Runs the modelopt-onnx-ptq Docker container with specified volumes for data persistence. Mounts models, data, and artifacts directories.

```bash
export DATA_ROOT="$HOME/modelopt-onnx-ptq"
mkdir -p "$DATA_ROOT/models" "$DATA_ROOT/data" "$DATA_ROOT/artifacts"

docker run --gpus all --rm -it \
  -w /workspace/modelopt-onnx-ptq \
  -v "$DATA_ROOT/models:/workspace/modelopt-onnx-ptq/models" \
  -v "$DATA_ROOT/data:/workspace/modelopt-onnx-ptq/data" \
  -v "$DATA_ROOT/artifacts:/workspace/modelopt-onnx-ptq/artifacts" \
  modelopt-onnx-ptq
```

--------------------------------

### Build and Run Docker Container for Model Optimizer

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md

Builds the Docker image for the Model Optimizer and runs it with necessary volume mounts for models, data, and artifacts. Ensure CUDA is enabled for GPU acceleration.

```bash
docker build -f docker/Dockerfile -t modelopt-onnx-ptq .
export DATA_ROOT="$HOME/modelopt-onnx-ptq"
mkdir -p "$DATA_ROOT/models" "$DATA_ROOT/data" "$DATA_ROOT/artifacts"
docker run --gpus all --rm -it -w /workspace/modelopt-onnx-ptq \
  -v "$DATA_ROOT/models:/workspace/modelopt-onnx-ptq/models" \
  -v "$DATA_ROOT/data:/workspace/modelopt-onnx-ptq/data" \
  -v "$DATA_ROOT/artifacts:/workspace/modelopt-onnx-ptq/artifacts" \
  modelopt-onnx-ptq
```

--------------------------------

### Configure Autotune Preset

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Select a preset for the autotuner: 'quick', 'default', or 'extensive'. This tunes Q/DQ placement for TensorRT.

```bash
--autotune quick
```

--------------------------------

### Run End-to-End Pipeline

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Orchestrates the full optimization workflow from calibration to report generation.

```bash
# Basic end-to-end with single INT8 quantization
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo.onnx \
  --img-size 640 \
  --output-format auto

# Full 6-combo comparison (int8, fp8, int4 with different methods)
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo.onnx \
  --quant-matrix all \
  --img-size 640 \
  --session-id full-comparison-001

# INT8 with autotune optimization
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo.onnx \
  --quant-matrix int8.all \
  --autotune default \
  --img-size 640 \
  --session-id autotune-test

# Custom quantization matrix with profile
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo.onnx \
  --quant-matrix "int8.entropy,fp8.max" \
  --quantize-profile yolo26n_no_nms_e2e_perf \
  --img-size 640 \
  --batch 1 \
  --session-id profile-comparison

# Skip FP16 baseline, continue on errors
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo.onnx \
  --quant-matrix all \
  --no-fp16-baseline \
  --continue-on-error \
  --session-id ptq-only-run

# Full pipeline with all options
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo.onnx \
  --quant-matrix all \
  --autotune default \
  --quantize-profile yolo26n_no_nms_e2e_perf \
  --high-precision-dtype fp16 \
  --img-size 640 \
  --batch 1 \
  --input-name images \
  --output-format auto \
  --session-id comprehensive-test
```

--------------------------------

### Quantize with Custom Profile

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/quantization-performance-workflow.md

Run the quantization process using a specific YAML profile file to control node inclusion or exclusion.

```bash
modelopt-onnx-ptq quantize \
  --calibration_data artifacts/calibration/calib_coco.npy \
  --onnx_path models/yolo.onnx \
  --quantize_mode int8 --calibration_method entropy \
  --profile profiles/my_yolo_rules.yaml
```

--------------------------------

### Build the Docker image

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md

Builds the container image from the provided Dockerfile after cloning the repository.

```bash
git clone https://github.com/levipereira/Model-Optimizer-ONNX.git
cd Model-Optimizer-ONNX
docker build -f docker/Dockerfile -t modelopt-onnx-ptq .
```

--------------------------------

### Enable Weight-Only Quantization Style

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Use a weight-only quantization approach where only weights are quantized, and DQ nodes are removed.

```bash
--dq_only
```

--------------------------------

### CLI: modelopt-onnx-ptq quantize

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/reference.md

Command line interface for performing quantization on ONNX models.

```APIDOC
## CLI: modelopt-onnx-ptq quantize

### Description
Command line utility to quantize ONNX models. Use --help for additional wrapper options.

### Parameters
- **--onnx_path** (str) - Required - Input ONNX path.
- **--quantize_mode** (str) - Optional - Quantization mode (fp8, int8, int4).
- **--calibration_data_path** (str) - Optional - Path to .npy or .npz calibration data.
- **--output_path** (str) - Optional - Output ONNX path.

### Request Example
```bash
python -m modelopt.onnx.quantization --onnx_path model.onnx --quantize_mode int8 --output_path model_quant.onnx
```
```

--------------------------------

### Run pipeline-e2e for quantization comparison

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/ptq-trt-performance/SKILL.md

Use `pipeline-e2e` to compare int8, fp8, and int4 quantization methods. It automates multiple runs and reporting. Ensure `--input-name` is set for `build-trt` if the ONNX input name is not the default. The `--output-format auto` and `--onnx` flags are passed to `eval-trt` to correctly infer the output layout.

```bash
modelopt-onnx-ptq pipeline-e2e \
  --quant-matrix all \
  --quantize-profile <name> \
  --input-name <input_name> \
  --build-mode strongly-typed \
  --session-id my_run \
  --continue-on-error \
  --no-fp16-baseline
```

--------------------------------

### Model Optimizer PTQ Pipeline Steps

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md

Illustrates the sequence of operations for post-training quantization using the ONNX model optimizer. Autotune is a flag within the quantize step and supports int8/fp8.

```bash
ONNX FP32 → calib → quantize [--autotune] → build-trt → eval-trt → trt-bench
```

--------------------------------

### Re-run PTQ with Graph Simplification

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md

Use this command to re-run post-training quantization with graph simplification enabled. This can help resolve issues related to QuantizeLinear nodes by potentially altering the Q/DQ layout.

```bash
modelopt-onnx-ptq quantize ... -- --simplify
```

--------------------------------

### Enable Zero-Point Quantization

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Activate zero-point quantization techniques, such as those used in awq_lite.

```bash
--use_zero_point
```

--------------------------------

### Run ONNX Post-Training Quantization (INT8)

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md

Performs INT8 quantization on ONNX models using pre-generated calibration data. Specify the calibration data path, the ONNX model file(s), quantization mode, and calibration method.

```bash
# Without autotune
modelopt-onnx-ptq quantize \
  --calibration_data=artifacts/calibration/calib_coco.npy \
  --onnx_glob="models/*.onnx" \
  --quantize_mode=int8 \
  --calibration_method=entropy
```

--------------------------------

### Run End-to-End Pipeline

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/workflow.md

Executes the full quantization and optimization pipeline for multiple modes using the specified autotune configuration.

```bash
modelopt-onnx-ptq pipeline-e2e --onnx models/yolo.onnx --quant-matrix all --autotune default --continue-on-error
```

--------------------------------

### Run End-to-End Quantization Pipeline

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md

Execute a full pipeline covering multiple quantization combinations with autotune enabled.

```bash
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo.onnx \
  --quant-matrix all \
  --autotune default \
  --continue-on-error
```

--------------------------------

### Specify Log File

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Direct the quantization process logs to a specific file, separate from the main tool's log file.

```bash
--log_file quantization.log
```

--------------------------------

### Quantize Model with Built-in Profile

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Use this command to perform post-training quantization using a specified built-in profile. Ensure the calibration data and ONNX model path are correctly provided.

```bash
modelopt-onnx-ptq quantize \
  --calibration_data artifacts/calibration/calib.npy \
  --onnx_path models/yolo.onnx \
  --profile yolo26n_no_nms_e2e_perf_backbone_neck
```

--------------------------------

### Benchmark with trtexec

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Use the `trtexec` command-line tool for benchmarking instead of the TensorRT Python API during autotuning.

```bash
--autotune_use_trtexec
```

--------------------------------

### Quantize Model with Autotune

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/workflow.md

Performs quantization on an ONNX model using a specified calibration dataset and autotune preset.

```bash
modelopt-onnx-ptq quantize \
  --calibration_data artifacts/calibration/calib.npy \
  --onnx_path models/yolo.onnx \
  --autotune default
```

--------------------------------

### Activate TREx Environment and Show Help

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/docker-reference.md

Activates the TensorRT Engine Explorer (TREx) virtual environment and displays its help message. This is used inside the container for model profiling.

```bash
source /workspace/TREx/tools/experimental/trt-engine-explorer/env_trex/bin/activate
trex --help
```

--------------------------------

### Quantize ONNX Model via CLI

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md

Perform model quantization using the modelopt-onnx-ptq CLI or the direct modelopt module.

```bash
modelopt-onnx-ptq quantize \
  --calibration_data=artifacts/calibration/calib_coco.npy \
  --onnx_path=models/yolov8n.onnx \
  --quantize_mode=int8 \
  --calibration_method=entropy \
  --autotune default
```

```bash
python -m modelopt.onnx.quantization \
  --onnx_path=models/yolov8n.onnx \
  --quantize_mode=int8 \
  --calibration_data_path=artifacts/calibration/calib_coco.npy \
  --calibration_method=entropy \
  --output_path=artifacts/quantized/yolov8n.int8.entropy.quant.onnx \
  --autotune=default
```

--------------------------------

### Run Pipeline with Backbone Quantization Profile

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/yolo26n-end-to-end-ptq-workflow.md

Execute the end-to-end pipeline with a specified quantization profile for the backbone. Ensure the profile name matches your YAML configuration. The `--high-precision-dtype fp16` flag is optional if relying on defaults.

```bash
modelopt-onnx-ptq pipeline-e2e \
  --onnx models/yolo26n_no_nms_e2e.onnx \
  --calibration-data-size 1000 \
  --input-name input \
  --output-format deepstream_yolo \
  --quantize-profile yolo26n_no_nms_e2e_backbone \
  --high-precision-dtype fp16 \
  --quant-matrix all \
  --session-id yolo26n_prof_backbone
```

--------------------------------

### modelopt-onnx-ptq quantize

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Wraps the ONNX PTQ process using the Model Optimizer. It supports various quantization modes (FP8, INT8, INT4), calibration methods, and optional autotuning for optimal performance.

```APIDOC
## modelopt-onnx-ptq quantize

### Description
Runs ONNX PTQ via Model Optimizer.

### Method
CLI Command

### Endpoint
N/A (CLI Tool)

### Parameters
#### Common Arguments
- **`--calibration_data`** (string) - Required - Path to `calib.npy`
- **`--onnx_path`** (string) - Optional - Single ONNX file
- **`--onnx_glob`** (string) - Optional - Glob (e.g. `models/*.onnx`) — mutually exclusive with `--onnx_path`
- **`--output_dir`** (string) - Optional - Output directory (default: `<artifacts root>/quantized`; root is `cwd/artifacts` or `MODELOPT_ARTIFACTS_ROOT`)
- **`--quantize_mode`** (string) - Optional - `fp8`, `int8`, `int4`
- **`--calibration_method`** (string) - Optional - e.g. `entropy`, `max` (mode-dependent)
- **`--high_precision_dtype`** (string) - Optional - Default `fp16` (aligns with TensorRT mixed precision); use `fp32` if PTQ `shape_inference` fails on your graph
- **`--autotune`** (string) - Optional - Q/DQ placement tuning via TensorRT timing (`quick` | `default` | `extensive`). Use with `int8`; `fp8` + autotune often fails on some detection ONNX graphs. Needs GPU + TensorRT.
- **`--profile`** (string) - Optional - YAML file (built-in name or path) with Model Optimizer include/exclude rules. Requires PyYAML.
- **`--suffix`** (string) - Optional - Output suffix (default `.quant.onnx`)

### Hardware Requirements
- **FP8 hardware:** `--quantize_mode fp8` requires a CUDA GPU with compute capability ≥ 8.9 (e.g., Ada Lovelace, Hopper, Blackwell).

### Pass-through Arguments
Arguments after a lone `--` are appended to the `python -m modelopt.onnx.quantization` command. Do not repeat arguments already set by `modelopt-onnx-ptq quantize`.

### Request Example
```bash
modelopt-onnx-ptq quantize --calibration_data ./artifacts/calibration/calib.npy --onnx_path ./models/model.onnx --quantize_mode int8 --autotune extensive
```

### Response
#### Success Response (File Output)
- Quantized ONNX model file(s).

#### Response Example
```
# Output file: artifacts/quantized/model.quant.onnx
```
```

--------------------------------

### Compare FP16 Baseline vs Quantized Engine

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Compare a FP16 baseline model against a quantized engine using `trex-analyze` with the `--compare` and `--compare-onnx` flags.

```bash
modelopt-onnx-ptq trex-analyze \
  --onnx models/yolo.onnx \
  --mode fp16 \
  --compare \
  --compare-onnx artifacts/quantized/yolo.int8.entropy.quant.onnx \
  --compare-onnx-mode strongly-typed \
  --img-size 640 \
  --input-name images
```

--------------------------------

### Specify Autotune State File

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Provide a state file for resuming autotuning from a previous checkpoint or for crash recovery. Defaults to a file within the output directory.

```bash
--autotune_state_file ./autotune_state.json
```

--------------------------------

### modelopt-onnx ptq report-runs

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Scans log directories and writes a Markdown report with tables and PNG charts. It supports session IDs for specific run analysis and merging global logs.

```APIDOC
## `modelopt-onnx ptq report-runs`

### Description
Scans log directories and writes a Markdown report with tables plus PNG charts next to the .md file. With `--session-id`, charts are `chart_ips_latency_<id>.png` and `chart_eval_<id>.png`; without a session, the same pattern uses the output file stem as `<id>`. Used standalone or at the end of `pipeline-e2e`.

The report starts with an FP16 baseline table when a `*.fp16` engine row exists, then Best configuration: with FP16 present, each row is the best quantized (non-baseline) engine per metric and vs FP16 compares that engine to the baseline (never the baseline to itself). Charts use series order FP16 baseline → int4 → int8 → fp8. Eval and Throughput tables put FP16 first when present, then sort by mAP or QPS. Comparison sections, Environment & versions, and Data sources follow as before.

### Method
Not applicable (CLI command)

### Endpoint
Not applicable (CLI command)

### Parameters
#### Command Arguments
- **`--session-id`** (string) - Optional - Shortcut: set `--trt-logs-dir` and `--eval-logs-dir` to `artifacts/pipeline_e2e/sessions/<id>/trt_engine/logs` and `…/predictions/logs` (unless you override them). With `-o` omitted, writes `artifacts/pipeline_e2e/sessions/<id>/report_<id>.md`. Same layout as `pipeline-e2e`. If omitted, `SESSION_ID` in the environment is used. Use this (or explicit session log paths) so the report sees `pipeline-e2e` outputs — the default without `--session-id` is the global flat `artifacts/trt_engine/logs`, not the session folder.
- **`--merge-global-logs`** (boolean) - Optional - Also read global `<artifacts>/trt_engine/logs` and `<artifacts>/predictions/logs` and merge with the primary dirs (newest timestamp per config). `pipeline-e2e` enables this when invoking `report-runs`.
- **`--trt-logs-dir`** (string) - Optional - Folder with `trt_bench_*.log` (default without `--session-id`: `<artifacts>/trt_engine/logs` — often not where `pipeline-e2e` writes)
- **`--eval-logs-dir`** (string) - Optional - Folder with `eval_*.log` (default without `--session-id`: `<artifacts>/predictions/logs`)
- **`-o`, `--output`** (string) - Optional - Output `.md` path (default: `artifacts/reports/trt_eval_report_<timestamp>.md`, or `…/sessions/<id>/report_<id>.md` when `--session-id` or `SESSION_ID` selects a session)

### Environment Variables
- **`MODELOPT_ONNX_PTQ_LOGLEVEL` or `LOGLEVEL`** (string) - Controls logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR` (`MODELOPT_ONNX_LOGLEVEL` / `MODELOPT_YOLO_LOGLEVEL` are deprecated).
- **`MODELOPT_ARTIFACTS_ROOT`** (string) - Sets the root directory for artifacts. Defaults to `<cwd>/artifacts` and is created if it doesn't exist.
- **`SESSION_ID`** (string) - Default session ID for various commands (`pipeline-e2e`, `build-trt`, `eval-trt`, `trt-bench`, `report-runs`) when `--session-id` is not explicitly passed. CLI arguments take precedence over this variable.
```

--------------------------------

### Quantize with High Precision FP32

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md

If quantization fails due to issues with FP16 post-processing on dynamic detection heads, retry quantization using FP32 for high precision.

```bash
modelopt-onnx-ptq quantize ... --high_precision_dtype fp32
```

--------------------------------

### Build TensorRT Engine with Best Mode

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md

Attempt to build a TensorRT engine using the 'best' mode. This can be a workaround for parsing issues with strictly-typed Q/DQ patterns, though accuracy may differ from strict PTQ.

```bash
modelopt-onnx-ptq build-trt --onnx ... --mode best
```

--------------------------------

### Build TensorRT Engine with Session Tracking

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Builds a TensorRT engine and enables session tracking for report aggregation. Specify the ONNX model path, image size, batch size, and a session ID.

```bash
modelopt-onnx-ptq build-trt \
  --onnx artifacts/quantized/yolo.int8.entropy.quant.onnx \
  --img-size 640 \
  --batch 1 \
  --session-id my-experiment-001
```

--------------------------------

### Execute automated PTQ performance grid

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md

Run the end-to-end pipeline for PTQ performance measurement using the specified ONNX model.

```bash
pipeline-e2e --onnx models/…onnx
```

--------------------------------

### Specify Autotune Output Directory

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Set the directory where autotuning artifacts, such as state and logs, will be stored.

```bash
--autotune_output_dir ./autotune_results
```

--------------------------------

### Generate Performance Report

Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt

Aggregates benchmark and evaluation logs into a Markdown report.

```bash
# Generate report from session logs
modelopt-onnx-ptq report-runs --session-id my-experiment-001

# Generate report from custom log directories
modelopt-onnx-ptq report-runs \
  --trt-logs-dir artifacts/trt_engine/logs \
  --eval-logs-dir artifacts/predictions/logs \
  -o artifacts/reports/comparison_report.md

# Session report with global log merging
modelopt-onnx-ptq report-runs \
  --session-id my-experiment-001 \
  --merge-global-logs \
  -o artifacts/pipeline_e2e/sessions/my-experiment-001/full_report.md
```

--------------------------------

### Analyze Quantized Models with TREx

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/yolo26n-end-to-end-ptq-workflow.md

Run the trex-analyze command on the quantized ONNX models to inspect performance bottlenecks, such as slow layers or fusion issues. Optionally, compare results against the FP16 baseline.

```bash
modelopt-onnx-ptq trex-analyze
```

--------------------------------

### Enable External Data Format

Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md

Write large weights to separate `.onnx_data` files when necessary, keeping the main ONNX model smaller.

```bash
--use_external_data_format
```