### Install Docs Dependencies and Serve Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Install optional documentation dependencies and serve the documentation locally using `mkdocs serve`. ```bash pip install -e ".[docs]" mkdocs serve ``` -------------------------------- ### Verify Local Installation Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Verify that the Model Optimizer ONNX tool has been installed correctly by running its help command. ```bash modelopt-onnx-ptq --help ``` -------------------------------- ### Install Model Optimizer Locally and Verify Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md Installs the NVIDIA Model Optimizer with ONNX support using pip and verifies the installation by importing necessary modules. Also checks for available ONNX Runtime execution providers. ```bash conda create -n modelopt python=3.12 pip && conda activate modelopt pip install -U "nvidia-modelopt[onnx]==0.43.0" --extra-index-url https://pypi.nvidia.com export LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}" ``` ```python import modelopt.onnx.quantization; print('OK') import onnxruntime as ort; print(ort.get_available_providers()) ``` -------------------------------- ### modelopt-onnx-ptq eval-trt Examples Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Examples demonstrating the usage of modelopt-onnx-ptq eval-trt for different output formats and layouts. ```APIDOC ## modelopt-onnx-ptq eval-trt Examples ### Auto layout (DeepStream-Yolo single [B,N,6] export) This command exports an ONNX model for DeepStream-Yolo with auto layout, suitable for trtexec. ```bash modelopt-onnx-ptq eval-trt --output-format auto --onnx model.onnx --engine model.engine \ --images data/coco/val2017 --annotations data/coco/annotations/instances_val2017.json ``` ### Ultralytics single tensor (explicit name if needed) Exports an ONNX model in Ultralytics format, allowing for explicit specification of the output tensor name. ```bash modelopt-onnx-ptq eval-trt --output-format ultralytics --output-tensor output0 --engine model.engine \ --images data/coco/val2017 --annotations data/coco/annotations/instances_val2017.json ``` ### DeepStream-Yolo layout (NMS in eval) Exports an ONNX model for DeepStream-Yolo with NMS (Non-Maximum Suppression) integrated into the evaluation step. ```bash modelopt-onnx-ptq eval-trt --output-format deepstream_yolo --output-tensor output --engine model.engine \ --images data/coco/val2017 --annotations data/coco/annotations/instances_val2017.json ``` ``` -------------------------------- ### Install CLI (Editable Mode) Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md Installs the modelopt-onnx-ptq CLI in editable mode from the repository root. This registers the 'modelopt-onnx-ptq' command. ```bash pip install -e . ``` -------------------------------- ### Clone Repository and Navigate Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md Clones the Model-Optimizer-ONNX repository and changes the current directory into it. This is the first step for Docker installation. ```bash git clone https://github.com/levipereira/Model-Optimizer-ONNX.git cd Model-Optimizer-ONNX ``` -------------------------------- ### Clone and Install Locally Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Clone the repository and install the Model Optimizer ONNX tool in editable mode using pip. This method requires a matching CUDA/TensorRT/ONNX Runtime stack on the host. ```bash git clone https://github.com/levipereira/Model-Optimizer-ONNX.git cd Model-Optimizer-ONNX pip install -e . ``` -------------------------------- ### Example Workflow with Environment Configuration Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Demonstrates a typical workflow where environment variables for `SESSION_ID` and `MODELOPT_ARTIFACTS_ROOT` are set before running a pipeline command. Logs and artifacts will automatically use these configurations. ```bash export SESSION_ID=yolo-ptq-$(date +%Y%m%d) export MODELOPT_ARTIFACTS_ROOT=/data/experiments modelopt-onnx-ptq pipeline-e2e --onnx models/yolo.onnx --quant-matrix int8.all ``` -------------------------------- ### Install Model Optimizer with ONNX Support Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md Ensure the ONNX extras are installed to resolve ImportError: modelopt.onnx. ```bash pip install -U "nvidia-modelopt[onnx]==0.43.0" --extra-index-url https://pypi.nvidia.com ``` -------------------------------- ### Install Model Optimizer ONNX Locally Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md Clone the repository and install the package in editable mode. Ensure a matching CUDA, TensorRT, and ONNX Runtime stack is present on the host. ```bash git clone https://github.com/levipereira/Model-Optimizer-ONNX.git cd Model-Optimizer-ONNX pip install -e . modelopt-onnx-ptq --help ``` -------------------------------- ### Install ONNX Runtime for CUDA 13 Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md Installs or upgrades onnxruntime-gpu to a CUDA 13 nightly build. Use this for local installations without Docker when targeting CUDA 13. ```bash pip install --upgrade --force-reinstall --pre \ --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-13-nightly/pypi/simple/ \ onnxruntime-gpu --no-deps ``` -------------------------------- ### Verify TensorRT Installation Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md Check if the TensorRT library is correctly installed and accessible in the Python environment. ```bash python -c "import tensorrt; print(tensorrt.__version__)" ``` -------------------------------- ### Download COCO Dataset and Build Calibration Tensors Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md Installs the modelopt-onnx-ptq command and uses it to download the COCO val2017 dataset. Then, it generates calibration tensors from the dataset images for quantization. ```bash pip install -e . # once: installs the `modelopt-onnx-ptq` command modelopt-onnx-ptq download-coco --output-dir data/coco ``` ```bash modelopt-onnx-ptq calib \ --images_dir=data/coco/val2017 \ --calibration_data_size=500 \ --img_size=640 ``` -------------------------------- ### Custom Quantization Profile YAML Example Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt An example YAML profile for `modelopt-onnx-ptq` that customizes quantization behavior. It includes whitelisting specific node patterns and excluding certain operator types and nodes. ```yaml # Example: modelopt_onnx_ptq/profiles/custom_profile.yaml # Quantize only backbone convolutions, exclude sensitive ops defaults: autotune: false include_nodes: # Whitelist backbone conv nodes (regex patterns) - "node_conv2d_[0-9]$" - "node_conv2d_[1-3][0-9]$" exclude_op_types: - Sigmoid - Softmax - Resize - Concat exclude_nodes: # Exclude SiLU activations - "node_silu.*" ``` -------------------------------- ### Run Docker Container (Edit Mode) Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md Starts a Docker container for development, mounting the current directory as the workspace. This allows for live code changes. ```bash docker run --gpus all --rm -it \ -w /workspace/modelopt-onnx-ptq \ -v "$(pwd)":/workspace/modelopt-onnx-ptq \ modelopt-onnx-ptq ``` -------------------------------- ### Set Autotune Warmup Runs Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Configure the number of warmup iterations to perform before starting timed runs for latency measurements. ```bash --autotune_warmup_runs 5 ``` -------------------------------- ### Comment Style Guidelines in Python Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md Examples demonstrating preferred commenting practices, focusing on explaining non-obvious logic rather than narrating code execution. ```python # bad — narrates the obvious img = img / 255.0 # normalize image to 0-1 range # bad — teaches a lesson # We use Path here because pathlib provides a more Pythonic way to handle # filesystem paths compared to os.path, allowing operator overloading for # path concatenation and better cross-platform compatibility. output = base_dir / "model.onnx" # good — explains a non-obvious choice # TRT requires even spatial dims; pad asymmetrically to preserve alignment padded = np.pad(tensor, ((0,0), (0,0), (0,1), (0,1))) # good — no comment needed, code is clear output = base_dir / "model.onnx" ``` -------------------------------- ### Execution Provider Resolution Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/reference.md Example output of the internal execution provider resolution utility. ```python _prepare_ep_list(["cpu", "cuda:0", "trt"]) ``` ```python ["CPUExecutionProvider", ("CUDAExecutionProvider", {"device_id": 0}), "TensorrtExecutionProvider"] ``` -------------------------------- ### Verify Model Optimizer Version Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md Check the installed version of the modelopt package to ensure compatibility. ```python python -c "import modelopt; print(modelopt.__version__)" ``` -------------------------------- ### Run Diagnostic Commands Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md Execute these commands to inspect the environment, including installed versions of Model Optimizer, ONNX Runtime, and CUDA library paths. ```bash # Model Optimizer version python -c "import modelopt; print(modelopt.__version__)" # ORT version and available EPs python -c "import onnxruntime as ort; print(ort.__version__); print(ort.get_available_providers())" # Check what CUDA libs ORT was built against ldd $(python -c "import onnxruntime as ort, os; print(os.path.join(os.path.dirname(ort.__file__), 'capi/libonnxruntime_providers_cuda.so'))") | grep -E 'cublas|cudnn' # System CUDA toolkit ls /usr/local/cuda*/lib64/libcublas.so* 2>/dev/null nvcc --version 2>/dev/null # LD_LIBRARY_PATH echo $LD_LIBRARY_PATH ``` -------------------------------- ### Build FP16 TensorRT Engine Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md Build an FP16 TensorRT engine from the original ONNX model. This serves as a reference for comparison. The output engine file path is shown as an example. ```bash modelopt-onnx-ptq build-trt --onnx models/your.onnx --mode fp16 ``` -------------------------------- ### Display Modelopt ONNX Quantization Help Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Run this command to view the full list of available options and their descriptions for the modelopt.onnx.quantization module. ```bash python -m modelopt.onnx.quantization --help ``` -------------------------------- ### Run Full Quantization Grid (Baseline) Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/yolo26n-end-to-end-ptq-workflow.md Execute the pipeline-e2e command with '--quant-matrix all' to establish a baseline performance and accuracy across various quantization settings without a profile. This helps identify the best initial quantization strategy. ```bash modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo26n_no_nms_e2e.onnx \ --calibration-data-size 1000 \ --input-name input \ --output-format deepstream_yolo \ --high-precision-dtype fp16 \ --quant-matrix all \ --session-id yolo26n_quant_baseline ``` -------------------------------- ### Fix 'No module named 'utils'' in TREx (Manual Install) Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md This sequence of commands and a `sed` patch resolves the 'No module named 'utils'' error for manual installations. It ensures the correct directory is added to sys.path and installs the package in editable mode. ```bash TREX_ROOT=/workspace/TREx/tools/experimental/trt-engine-explorer touch "${TREX_ROOT}/bin/__init__.py" sed -i 's|sys.path.append(os.path.join(os.path.dirname(SCRIPT_DIR), "utils"))|sys.path.insert(0, os.path.dirname(SCRIPT_DIR))|' \ "${TREX_ROOT}/bin/trex.py" cd "${TREX_ROOT}" pip install --no-cache-dir -e ".[notebook]" trex --help ``` -------------------------------- ### Import Autotune Q/DQ Baseline Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Specify a pre-quantized ONNX model to import Q/DQ patterns from, used for warm-starting the autotuner. ```bash --autotune_qdq_baseline ./baseline.onnx ``` -------------------------------- ### Install ORT Nightly for CUDA 13 Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-troubleshooting/SKILL.md Use this command to resolve libcublas version mismatches by installing the nightly build of ONNX Runtime compatible with CUDA 13. ```bash pip uninstall onnxruntime onnxruntime-gpu -y pip install --pre --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-13-nightly/pypi/simple/ onnxruntime-gpu ``` -------------------------------- ### Basic Quantization with Autotune Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Use this command for post-training quantization with default autotuning enabled. Ensure calibration data and the ONNX model path are provided. ```bash modelopt-onnx-ptq quantize --calibration_data ... --onnx_path ... --autotune default ``` -------------------------------- ### Execute manual profile tuning steps Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md Commands for manual quantization, TensorRT engine building, and performance evaluation. ```bash calib ``` ```bash quantize --profile … --suffix … ``` ```bash build-trt --mode strongly-typed ``` ```bash eval-trt ``` ```bash trt-bench ``` ```bash trex-analyze ``` -------------------------------- ### Initialize TREx Environment Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/ptq-trt-performance/SKILL.md Commands to set up the environment and launch the Jupyter interface for TREx tools. ```bash # Optional: export TREX_VENV=/workspace/TREx/tools/experimental/trt-engine-explorer/env_trex cd /workspace/TREx/tools/experimental/trt-engine-explorer/notebooks jupyter lab # or: jupyter notebook ``` -------------------------------- ### Use Autotune Pattern Cache Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Provide a YAML file for caching patterns to speed up autotuning via warm-starting. ```bash --autotune_pattern_cache ./patterns.yaml ``` -------------------------------- ### Compare Two ONNX Models Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Enable two-plan comparison by providing a primary ONNX model and a second ONNX model using `--compare-onnx`. Specify a different builder mode for the second model with `--compare-onnx-mode` if needed. Requires the ONNX model path, a build mode, and image size. ```bash modelopt-onnx-ptq trex-analyze --onnx models/yolo_fp16.onnx --mode fp16 --img-size 640 \ --compare --compare-onnx artifacts/quantized/yolo.int8.entropy.quant.onnx \ --compare-onnx-mode strongly-typed ``` -------------------------------- ### Build Docker Image Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md Builds the Docker image for modelopt-onnx-ptq. Ensure you are in the repository root directory. ```bash docker build -f docker/Dockerfile -t modelopt-onnx-ptq . ``` -------------------------------- ### End-to-End PTQ Pipeline Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md Run the complete Post-Training Quantization (PTQ) pipeline, including calibration, FP16 baseline generation, quantization, and reporting, with a single command. Use optional flags like `--img-size`, `--input-name`, and `--output-format` as needed. Exclude the FP16 baseline comparison by adding `--no-fp16-baseline`. ```bash modelopt-onnx-ptq pipeline-e2e --onnx models/your.onnx ``` -------------------------------- ### Enable Verbose Autotuner Logging Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Activate detailed logging for the autotuning process to provide more insights. ```bash --autotune_verbose ``` -------------------------------- ### PTQ Workflow Commands Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/quantization-performance-workflow.md A collection of CLI commands for the full PTQ lifecycle, including calibration, quantization, engine building, and performance analysis. ```bash # 1) Calibration modelopt-onnx-ptq calib --images_dir data/coco/val2017/ ... --output_path artifacts/calibration/calib_coco.npy # 2) Quantize (first iteration — no custom profile) modelopt-onnx-ptq quantize \ --calibration_data artifacts/calibration/calib_coco.npy \ --onnx_path models/yolo.onnx \ --quantize_mode int8 --calibration_method entropy # 2b) Quantize with a hand-tuned profile (no autotune) — try shipped YOLO26 examples under modelopt_onnx_ptq/profiles/ modelopt-onnx-ptq quantize \ --calibration_data artifacts/calibration/calib_coco.npy \ --onnx_path models/yolo.onnx \ --quantize_mode int8 --calibration_method entropy \ --profile yolo26n_no_nms_e2e_perf \ --suffix .v1.quant.onnx # 3–4) Build, evaluate, benchmark (paths follow your artifacts layout) modelopt-onnx-ptq build-trt --onnx artifacts/quantized/.int8.entropy.quant.onnx --mode strongly-typed --img-size 640 --batch 1 modelopt-onnx-ptq eval-trt --engine artifacts/trt_engine/.b1_i640.engine ... modelopt-onnx-ptq trt-bench --engine artifacts/trt_engine/.b1_i640.engine ... # 5) Profile the quantized plan modelopt-onnx-ptq trex-analyze \ --onnx artifacts/quantized/.int8.entropy.quant.onnx \ --mode strongly-typed --img-size 640 # 5b) Compare layer timings: FP16 TensorRT plan vs quantized ONNX (two builds — see --compare) # Primary = baseline ONNX + --mode fp16; second = PTQ ONNX + --strongly-typed. modelopt-onnx-ptq trex-analyze \ --onnx models/yolo.onnx \ --mode fp16 \ --compare \ --compare-onnx artifacts/quantized/.int8.entropy.quant.onnx \ --compare-onnx-mode strongly-typed \ --img-size 640 --input-name ``` -------------------------------- ### modelopt-onnx-ptq trex-analyze Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Analyzes TensorRT engine builds using `trtexec` for profiling. Requires TREx or Docker TREx to be installed. Supports comparison, graph analysis, and reporting modes. ```APIDOC ## modelopt-onnx-ptq trex-analyze ### Description `trtexec` build + profile; pick one of `--compare`, `--graph`, `--report`, or none. ### Method CLI Command ### Endpoint N/A (CLI Tool) ### Parameters #### Command Line Arguments - **`--onnx`** (string) - Required - Path to the ONNX model. - **`--engine`** (string) - Optional - Path to an existing TensorRT engine for comparison. - **`--compare`** (boolean) - Optional - Enable comparison mode. - **`--graph`** (boolean) - Optional - Enable graph analysis mode. - **`--report`** (boolean) - Optional - Enable detailed reporting mode. - **`--trex`** (string) - Optional - Path to TREx executable or Docker image. ### Request Example ```bash modelopt-onnx-ptq trex-analyze --onnx ./models/model.onnx --graph --report ``` ### Response #### Success Response (CLI Output) - Analysis results, profiling data, or comparison reports. #### Response Example ``` TREx analysis complete. Profile data saved to ./artifacts/trex_analysis/ ``` ``` -------------------------------- ### Manual PTQ workflow Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/ptq-trt-performance/SKILL.md Perform quantization and TensorRT engine building manually. This involves calibrating, quantizing the ONNX model, building the TensorRT engine, evaluating mAP, and benchmarking QPS/latency. Run `trt-bench` sequentially for accurate QPS/latency comparisons. ```bash # 1. Calibrate modelopt-onnx-ptq calib --img_size --output calib.npy # 2. Quantize modelopt-onnx-ptq quantize --profile --suffix .v2.quant.onnx \ --output model.v2.quant.onnx \ --input model.onnx \ --calib calib.npy # 3. Build TensorRT engine modelopt-onnx-ptq build-trt --onnx model.v2.quant.onnx \ --mode strongly-typed \ --engine model.v2.quant.engine # 4. Evaluate mAP modelopt-onnx-ptq eval-trt --onnx model.v2.quant.onnx \ --engine model.v2.quant.engine \ --output-format auto # 5. Benchmark QPS/latency modelopt-onnx-ptq trt-bench --engine model.v2.quant.engine \ --output-format auto ``` -------------------------------- ### Quantize ONNX Model using CLI Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md The command-line interface provides an alternative for ONNX model quantization. Use `--onnx_path`, `--quantize_mode`, `--calibration_data_path`, `--calibration_method`, and `--output_path`. Autotune can be set to 'default'. ```bash python -m modelopt.onnx.quantization \ --onnx_path=model.onnx \ --quantize_mode=int8 \ --calibration_data_path=calib.npy \ --calibration_method=entropy \ --output_path=model.quant.onnx \ --autotune=default ``` -------------------------------- ### Verify ONNX Runtime CUDA Provider Libraries Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md Check the CUDA version compatibility of the ONNX Runtime GPU provider by inspecting the linked libraries. This helps diagnose version mismatches between the system CUDA toolkit and the installed ONNX Runtime. ```bash ldd $(python -c "import onnxruntime; print(onnxruntime.__file__.replace('__init__.py','capi/libonnxruntime_providers_cuda.so'))") ``` -------------------------------- ### Run Docker Container (Default) Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/installation.md Runs the modelopt-onnx-ptq Docker container with specified volumes for data persistence. Mounts models, data, and artifacts directories. ```bash export DATA_ROOT="$HOME/modelopt-onnx-ptq" mkdir -p "$DATA_ROOT/models" "$DATA_ROOT/data" "$DATA_ROOT/artifacts" docker run --gpus all --rm -it \ -w /workspace/modelopt-onnx-ptq \ -v "$DATA_ROOT/models:/workspace/modelopt-onnx-ptq/models" \ -v "$DATA_ROOT/data:/workspace/modelopt-onnx-ptq/data" \ -v "$DATA_ROOT/artifacts:/workspace/modelopt-onnx-ptq/artifacts" \ modelopt-onnx-ptq ``` -------------------------------- ### Build and Run Docker Container for Model Optimizer Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md Builds the Docker image for the Model Optimizer and runs it with necessary volume mounts for models, data, and artifacts. Ensure CUDA is enabled for GPU acceleration. ```bash docker build -f docker/Dockerfile -t modelopt-onnx-ptq . export DATA_ROOT="$HOME/modelopt-onnx-ptq" mkdir -p "$DATA_ROOT/models" "$DATA_ROOT/data" "$DATA_ROOT/artifacts" docker run --gpus all --rm -it -w /workspace/modelopt-onnx-ptq \ -v "$DATA_ROOT/models:/workspace/modelopt-onnx-ptq/models" \ -v "$DATA_ROOT/data:/workspace/modelopt-onnx-ptq/data" \ -v "$DATA_ROOT/artifacts:/workspace/modelopt-onnx-ptq/artifacts" \ modelopt-onnx-ptq ``` -------------------------------- ### Configure Autotune Preset Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Select a preset for the autotuner: 'quick', 'default', or 'extensive'. This tunes Q/DQ placement for TensorRT. ```bash --autotune quick ``` -------------------------------- ### Run End-to-End Pipeline Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Orchestrates the full optimization workflow from calibration to report generation. ```bash # Basic end-to-end with single INT8 quantization modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo.onnx \ --img-size 640 \ --output-format auto # Full 6-combo comparison (int8, fp8, int4 with different methods) modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo.onnx \ --quant-matrix all \ --img-size 640 \ --session-id full-comparison-001 # INT8 with autotune optimization modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo.onnx \ --quant-matrix int8.all \ --autotune default \ --img-size 640 \ --session-id autotune-test # Custom quantization matrix with profile modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo.onnx \ --quant-matrix "int8.entropy,fp8.max" \ --quantize-profile yolo26n_no_nms_e2e_perf \ --img-size 640 \ --batch 1 \ --session-id profile-comparison # Skip FP16 baseline, continue on errors modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo.onnx \ --quant-matrix all \ --no-fp16-baseline \ --continue-on-error \ --session-id ptq-only-run # Full pipeline with all options modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo.onnx \ --quant-matrix all \ --autotune default \ --quantize-profile yolo26n_no_nms_e2e_perf \ --high-precision-dtype fp16 \ --img-size 640 \ --batch 1 \ --input-name images \ --output-format auto \ --session-id comprehensive-test ``` -------------------------------- ### Quantize with Custom Profile Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/quantization-performance-workflow.md Run the quantization process using a specific YAML profile file to control node inclusion or exclusion. ```bash modelopt-onnx-ptq quantize \ --calibration_data artifacts/calibration/calib_coco.npy \ --onnx_path models/yolo.onnx \ --quantize_mode int8 --calibration_method entropy \ --profile profiles/my_yolo_rules.yaml ``` -------------------------------- ### Build the Docker image Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/README.md Builds the container image from the provided Dockerfile after cloning the repository. ```bash git clone https://github.com/levipereira/Model-Optimizer-ONNX.git cd Model-Optimizer-ONNX docker build -f docker/Dockerfile -t modelopt-onnx-ptq . ``` -------------------------------- ### Enable Weight-Only Quantization Style Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Use a weight-only quantization approach where only weights are quantized, and DQ nodes are removed. ```bash --dq_only ``` -------------------------------- ### CLI: modelopt-onnx-ptq quantize Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/reference.md Command line interface for performing quantization on ONNX models. ```APIDOC ## CLI: modelopt-onnx-ptq quantize ### Description Command line utility to quantize ONNX models. Use --help for additional wrapper options. ### Parameters - **--onnx_path** (str) - Required - Input ONNX path. - **--quantize_mode** (str) - Optional - Quantization mode (fp8, int8, int4). - **--calibration_data_path** (str) - Optional - Path to .npy or .npz calibration data. - **--output_path** (str) - Optional - Output ONNX path. ### Request Example ```bash python -m modelopt.onnx.quantization --onnx_path model.onnx --quantize_mode int8 --output_path model_quant.onnx ``` ``` -------------------------------- ### Run pipeline-e2e for quantization comparison Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/ptq-trt-performance/SKILL.md Use `pipeline-e2e` to compare int8, fp8, and int4 quantization methods. It automates multiple runs and reporting. Ensure `--input-name` is set for `build-trt` if the ONNX input name is not the default. The `--output-format auto` and `--onnx` flags are passed to `eval-trt` to correctly infer the output layout. ```bash modelopt-onnx-ptq pipeline-e2e \ --quant-matrix all \ --quantize-profile \ --input-name \ --build-mode strongly-typed \ --session-id my_run \ --continue-on-error \ --no-fp16-baseline ``` -------------------------------- ### Model Optimizer PTQ Pipeline Steps Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md Illustrates the sequence of operations for post-training quantization using the ONNX model optimizer. Autotune is a flag within the quantize step and supports int8/fp8. ```bash ONNX FP32 → calib → quantize [--autotune] → build-trt → eval-trt → trt-bench ``` -------------------------------- ### Re-run PTQ with Graph Simplification Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md Use this command to re-run post-training quantization with graph simplification enabled. This can help resolve issues related to QuantizeLinear nodes by potentially altering the Q/DQ layout. ```bash modelopt-onnx-ptq quantize ... -- --simplify ``` -------------------------------- ### Enable Zero-Point Quantization Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Activate zero-point quantization techniques, such as those used in awq_lite. ```bash --use_zero_point ``` -------------------------------- ### Run ONNX Post-Training Quantization (INT8) Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md Performs INT8 quantization on ONNX models using pre-generated calibration data. Specify the calibration data path, the ONNX model file(s), quantization mode, and calibration method. ```bash # Without autotune modelopt-onnx-ptq quantize \ --calibration_data=artifacts/calibration/calib_coco.npy \ --onnx_glob="models/*.onnx" \ --quantize_mode=int8 \ --calibration_method=entropy ``` -------------------------------- ### Run End-to-End Pipeline Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/workflow.md Executes the full quantization and optimization pipeline for multiple modes using the specified autotune configuration. ```bash modelopt-onnx-ptq pipeline-e2e --onnx models/yolo.onnx --quant-matrix all --autotune default --continue-on-error ``` -------------------------------- ### Run End-to-End Quantization Pipeline Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md Execute a full pipeline covering multiple quantization combinations with autotune enabled. ```bash modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo.onnx \ --quant-matrix all \ --autotune default \ --continue-on-error ``` -------------------------------- ### Specify Log File Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Direct the quantization process logs to a specific file, separate from the main tool's log file. ```bash --log_file quantization.log ``` -------------------------------- ### Quantize Model with Built-in Profile Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Use this command to perform post-training quantization using a specified built-in profile. Ensure the calibration data and ONNX model path are correctly provided. ```bash modelopt-onnx-ptq quantize \ --calibration_data artifacts/calibration/calib.npy \ --onnx_path models/yolo.onnx \ --profile yolo26n_no_nms_e2e_perf_backbone_neck ``` -------------------------------- ### Benchmark with trtexec Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Use the `trtexec` command-line tool for benchmarking instead of the TensorRT Python API during autotuning. ```bash --autotune_use_trtexec ``` -------------------------------- ### Quantize Model with Autotune Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/workflow.md Performs quantization on an ONNX model using a specified calibration dataset and autotune preset. ```bash modelopt-onnx-ptq quantize \ --calibration_data artifacts/calibration/calib.npy \ --onnx_path models/yolo.onnx \ --autotune default ``` -------------------------------- ### Activate TREx Environment and Show Help Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/docker-reference.md Activates the TensorRT Engine Explorer (TREx) virtual environment and displays its help message. This is used inside the container for model profiling. ```bash source /workspace/TREx/tools/experimental/trt-engine-explorer/env_trex/bin/activate trex --help ``` -------------------------------- ### Quantize ONNX Model via CLI Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/onnx-ptq/SKILL.md Perform model quantization using the modelopt-onnx-ptq CLI or the direct modelopt module. ```bash modelopt-onnx-ptq quantize \ --calibration_data=artifacts/calibration/calib_coco.npy \ --onnx_path=models/yolov8n.onnx \ --quantize_mode=int8 \ --calibration_method=entropy \ --autotune default ``` ```bash python -m modelopt.onnx.quantization \ --onnx_path=models/yolov8n.onnx \ --quantize_mode=int8 \ --calibration_data_path=artifacts/calibration/calib_coco.npy \ --calibration_method=entropy \ --output_path=artifacts/quantized/yolov8n.int8.entropy.quant.onnx \ --autotune=default ``` -------------------------------- ### Run Pipeline with Backbone Quantization Profile Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/yolo26n-end-to-end-ptq-workflow.md Execute the end-to-end pipeline with a specified quantization profile for the backbone. Ensure the profile name matches your YAML configuration. The `--high-precision-dtype fp16` flag is optional if relying on defaults. ```bash modelopt-onnx-ptq pipeline-e2e \ --onnx models/yolo26n_no_nms_e2e.onnx \ --calibration-data-size 1000 \ --input-name input \ --output-format deepstream_yolo \ --quantize-profile yolo26n_no_nms_e2e_backbone \ --high-precision-dtype fp16 \ --quant-matrix all \ --session-id yolo26n_prof_backbone ``` -------------------------------- ### modelopt-onnx-ptq quantize Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Wraps the ONNX PTQ process using the Model Optimizer. It supports various quantization modes (FP8, INT8, INT4), calibration methods, and optional autotuning for optimal performance. ```APIDOC ## modelopt-onnx-ptq quantize ### Description Runs ONNX PTQ via Model Optimizer. ### Method CLI Command ### Endpoint N/A (CLI Tool) ### Parameters #### Common Arguments - **`--calibration_data`** (string) - Required - Path to `calib.npy` - **`--onnx_path`** (string) - Optional - Single ONNX file - **`--onnx_glob`** (string) - Optional - Glob (e.g. `models/*.onnx`) — mutually exclusive with `--onnx_path` - **`--output_dir`** (string) - Optional - Output directory (default: `/quantized`; root is `cwd/artifacts` or `MODELOPT_ARTIFACTS_ROOT`) - **`--quantize_mode`** (string) - Optional - `fp8`, `int8`, `int4` - **`--calibration_method`** (string) - Optional - e.g. `entropy`, `max` (mode-dependent) - **`--high_precision_dtype`** (string) - Optional - Default `fp16` (aligns with TensorRT mixed precision); use `fp32` if PTQ `shape_inference` fails on your graph - **`--autotune`** (string) - Optional - Q/DQ placement tuning via TensorRT timing (`quick` | `default` | `extensive`). Use with `int8`; `fp8` + autotune often fails on some detection ONNX graphs. Needs GPU + TensorRT. - **`--profile`** (string) - Optional - YAML file (built-in name or path) with Model Optimizer include/exclude rules. Requires PyYAML. - **`--suffix`** (string) - Optional - Output suffix (default `.quant.onnx`) ### Hardware Requirements - **FP8 hardware:** `--quantize_mode fp8` requires a CUDA GPU with compute capability ≥ 8.9 (e.g., Ada Lovelace, Hopper, Blackwell). ### Pass-through Arguments Arguments after a lone `--` are appended to the `python -m modelopt.onnx.quantization` command. Do not repeat arguments already set by `modelopt-onnx-ptq quantize`. ### Request Example ```bash modelopt-onnx-ptq quantize --calibration_data ./artifacts/calibration/calib.npy --onnx_path ./models/model.onnx --quantize_mode int8 --autotune extensive ``` ### Response #### Success Response (File Output) - Quantized ONNX model file(s). #### Response Example ``` # Output file: artifacts/quantized/model.quant.onnx ``` ``` -------------------------------- ### Compare FP16 Baseline vs Quantized Engine Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Compare a FP16 baseline model against a quantized engine using `trex-analyze` with the `--compare` and `--compare-onnx` flags. ```bash modelopt-onnx-ptq trex-analyze \ --onnx models/yolo.onnx \ --mode fp16 \ --compare \ --compare-onnx artifacts/quantized/yolo.int8.entropy.quant.onnx \ --compare-onnx-mode strongly-typed \ --img-size 640 \ --input-name images ``` -------------------------------- ### Specify Autotune State File Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Provide a state file for resuming autotuning from a previous checkpoint or for crash recovery. Defaults to a file within the output directory. ```bash --autotune_state_file ./autotune_state.json ``` -------------------------------- ### modelopt-onnx ptq report-runs Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Scans log directories and writes a Markdown report with tables and PNG charts. It supports session IDs for specific run analysis and merging global logs. ```APIDOC ## `modelopt-onnx ptq report-runs` ### Description Scans log directories and writes a Markdown report with tables plus PNG charts next to the .md file. With `--session-id`, charts are `chart_ips_latency_.png` and `chart_eval_.png`; without a session, the same pattern uses the output file stem as ``. Used standalone or at the end of `pipeline-e2e`. The report starts with an FP16 baseline table when a `*.fp16` engine row exists, then Best configuration: with FP16 present, each row is the best quantized (non-baseline) engine per metric and vs FP16 compares that engine to the baseline (never the baseline to itself). Charts use series order FP16 baseline → int4 → int8 → fp8. Eval and Throughput tables put FP16 first when present, then sort by mAP or QPS. Comparison sections, Environment & versions, and Data sources follow as before. ### Method Not applicable (CLI command) ### Endpoint Not applicable (CLI command) ### Parameters #### Command Arguments - **`--session-id`** (string) - Optional - Shortcut: set `--trt-logs-dir` and `--eval-logs-dir` to `artifacts/pipeline_e2e/sessions//trt_engine/logs` and `…/predictions/logs` (unless you override them). With `-o` omitted, writes `artifacts/pipeline_e2e/sessions//report_.md`. Same layout as `pipeline-e2e`. If omitted, `SESSION_ID` in the environment is used. Use this (or explicit session log paths) so the report sees `pipeline-e2e` outputs — the default without `--session-id` is the global flat `artifacts/trt_engine/logs`, not the session folder. - **`--merge-global-logs`** (boolean) - Optional - Also read global `/trt_engine/logs` and `/predictions/logs` and merge with the primary dirs (newest timestamp per config). `pipeline-e2e` enables this when invoking `report-runs`. - **`--trt-logs-dir`** (string) - Optional - Folder with `trt_bench_*.log` (default without `--session-id`: `/trt_engine/logs` — often not where `pipeline-e2e` writes) - **`--eval-logs-dir`** (string) - Optional - Folder with `eval_*.log` (default without `--session-id`: `/predictions/logs`) - **`-o`, `--output`** (string) - Optional - Output `.md` path (default: `artifacts/reports/trt_eval_report_.md`, or `…/sessions//report_.md` when `--session-id` or `SESSION_ID` selects a session) ### Environment Variables - **`MODELOPT_ONNX_PTQ_LOGLEVEL` or `LOGLEVEL`** (string) - Controls logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR` (`MODELOPT_ONNX_LOGLEVEL` / `MODELOPT_YOLO_LOGLEVEL` are deprecated). - **`MODELOPT_ARTIFACTS_ROOT`** (string) - Sets the root directory for artifacts. Defaults to `/artifacts` and is created if it doesn't exist. - **`SESSION_ID`** (string) - Default session ID for various commands (`pipeline-e2e`, `build-trt`, `eval-trt`, `trt-bench`, `report-runs`) when `--session-id` is not explicitly passed. CLI arguments take precedence over this variable. ``` -------------------------------- ### Quantize with High Precision FP32 Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md If quantization fails due to issues with FP16 post-processing on dynamic detection heads, retry quantization using FP32 for high precision. ```bash modelopt-onnx-ptq quantize ... --high_precision_dtype fp32 ``` -------------------------------- ### Build TensorRT Engine with Best Mode Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/troubleshooting.md Attempt to build a TensorRT engine using the 'best' mode. This can be a workaround for parsing issues with strictly-typed Q/DQ patterns, though accuracy may differ from strict PTQ. ```bash modelopt-onnx-ptq build-trt --onnx ... --mode best ``` -------------------------------- ### Build TensorRT Engine with Session Tracking Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Builds a TensorRT engine and enables session tracking for report aggregation. Specify the ONNX model path, image size, batch size, and a session ID. ```bash modelopt-onnx-ptq build-trt \ --onnx artifacts/quantized/yolo.int8.entropy.quant.onnx \ --img-size 640 \ --batch 1 \ --session-id my-experiment-001 ``` -------------------------------- ### Execute automated PTQ performance grid Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/skills/modelopt-onnx-ptq-dev/SKILL.md Run the end-to-end pipeline for PTQ performance measurement using the specified ONNX model. ```bash pipeline-e2e --onnx models/…onnx ``` -------------------------------- ### Specify Autotune Output Directory Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Set the directory where autotuning artifacts, such as state and logs, will be stored. ```bash --autotune_output_dir ./autotune_results ``` -------------------------------- ### Generate Performance Report Source: https://context7.com/levipereira/model-optimizer-onnx/llms.txt Aggregates benchmark and evaluation logs into a Markdown report. ```bash # Generate report from session logs modelopt-onnx-ptq report-runs --session-id my-experiment-001 # Generate report from custom log directories modelopt-onnx-ptq report-runs \ --trt-logs-dir artifacts/trt_engine/logs \ --eval-logs-dir artifacts/predictions/logs \ -o artifacts/reports/comparison_report.md # Session report with global log merging modelopt-onnx-ptq report-runs \ --session-id my-experiment-001 \ --merge-global-logs \ -o artifacts/pipeline_e2e/sessions/my-experiment-001/full_report.md ``` -------------------------------- ### Analyze Quantized Models with TREx Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/yolo26n-end-to-end-ptq-workflow.md Run the trex-analyze command on the quantized ONNX models to inspect performance bottlenecks, such as slow layers or fusion issues. Optionally, compare results against the FP16 baseline. ```bash modelopt-onnx-ptq trex-analyze ``` -------------------------------- ### Enable External Data Format Source: https://github.com/levipereira/model-optimizer-onnx/blob/master/docs/cli-reference.md Write large weights to separate `.onnx_data` files when necessary, keeping the main ONNX model smaller. ```bash --use_external_data_format ```