### Install Dependencies and Run Tests

Source: https://github.com/nvidia/model-optimizer/blob/main/tests/examples/README.md

Installs necessary dependencies and executes tests for a specific example. Ensure you are in the root of the repository and have mounted the local modelopt directory.

```bash
cd /workspace/Model-Optimizer
pip install -e ".[all,dev-test]"
pytest tests/examples/$TEST
```

--------------------------------

### Install Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/speculative_decoding/recipes/train_eagle_head_cosmos_reason2.ipynb

Installs the necessary model optimization library and project requirements. Use this at the beginning of the setup process.

```bash
%%bash
pip install -U nvidia-modelopt[hf]
pip install -r ../requirements.txt
```

--------------------------------

### Install Model-Optimizer and Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/diffusers/distillation/README.md

Installs Model-Optimizer and all required dependencies for distillation training. Ensure you are in the distillation example directory before running.

```bash
cd examples/diffusers/distillation

pip install -e ../../../

pip install -r requirements.txt
```

--------------------------------

### Deploy QAT Checkpoint on SGLang

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md

Start the SGLang server with a specified model path and tensor parallelism size. Refer to the SGLang setup guide for installation instructions.

```bash
python3 -m sglang.launch_server --model <model-path> --tp <tp_size>

```

--------------------------------

### QAT Workflow Example with ModelOpt

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/cnn_qat/README.md

This Python code demonstrates the core steps of Quantization-Aware Training (QAT) using NVIDIA ModelOpt. It includes model quantization, calibration, QAT fine-tuning, and saving/restoring the model. Ensure necessary imports and model/loader setup are done prior to this.

```python
from modelopt.torch.quantization import mtq
from modelopt.torch.opt import mto

# ... build model, loaders, optimizer, scheduler ...

def calibrate_fn(m):
    m.eval()
    seen = 0
    for x, _ in calib_loader:
        m(x.to(device))
        seen += x.size(0)
        if seen >= 512:
            break

# 1. PTQ quantization + calibration
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, calibrate_fn)

# 2. QAT fine-tuning
for epoch in range(1, epochs + 1):
    train(model, train_loader, ...)
    scheduler.step()

# 3. Save final QAT model (weights + quantizer state)
mto.save(model, "cnn_qat_best.pth")

# 4. To reload for inference or further training:
model = build_model()
mto.restore(model, "cnn_qat_best.pth")
model.to(device)
```

--------------------------------

### Verify Puzzletron Installation

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/README.md

Run GPU tests to confirm the puzzletron installation. This example specifically checks the Qwen3-8B model.

```bash
python -m pytest tests/gpu/torch/puzzletron/test_puzzletron.py -k "Qwen3-8B"
```

--------------------------------

### Install Model Optimizer with Hugging Face Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_distill/README.md

Install Model Optimizer with specific dependencies for Hugging Face models and then install example requirements.

```bash
pip install -U nvidia-modelopt[hf]
pip install -r requirements.txt
```

--------------------------------

### Install Dependencies and Login

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/dataset/README.md

Installs the necessary package and logs into Hugging Face Hub. A token is required for gated datasets.

```bash
pip install nvidia-modelopt[hf]
hf auth login --token <your token> # required for gated datasets
```

--------------------------------

### Install Base Requirements

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/kl_divergence_metrics/README.md

Install the necessary Python packages for the toolkit. Consider installing PyTorch with CUDA support for improved performance.

```bash
pip install -r requirements.txt
```

--------------------------------

### Simple Flat Directory Structure Example

Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/README.md

Illustrates a basic file organization for an experimental technique, including the main script, tests, and examples.

```text
experimental/my_technique/
├── README.md
├── requirements.txt
├── my_technique.py
├── test_my_technique.py
└── example.py
```

--------------------------------

### Version Summary Report

Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/contributing.md

This is an example of the version summary that is printed at the start of every run. It helps in identifying the versions of different components used.

```text
============================================================
Version Report
============================================================
  Launcher                       d28acd33     (main)
  Megatron-LM                    1e064f361    (main)
  Model-Optimizer                69c0d479     (main)
============================================================
```

--------------------------------

### Run PTQ Example with Recipe via CLI

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/10_recipes.rst

Execute a PTQ example script using a specified recipe via the command line. This bypasses format-specific flags.

```bash
python examples/llm_ptq/hf_ptq.py \
    --model Qwen/Qwen3-8B \
    --recipe general/ptq/fp8_default-fp8_kv \
    --export_path build/fp8 \
    --calib_size 512 \
    --export_fmt hf
```

--------------------------------

### Install vLLM Fork with AnyModel Support

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/README.md

Clone and install a specific vLLM fork that includes AnyModel support for deploying compressed models. Ensure you follow the vLLM installation guide for building from source.

```bash
git clone https://github.com/askliar/vllm.git
cd vllm
git checkout feature/add_anymodel_to_vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto
```

--------------------------------

### Install Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/diffusers/qad_example/README.md

Create a virtual environment and install the required dependencies using pip. This includes LTX packages from source and NVIDIA ModelOpt from PyPI.

```bash
python -m venv .venv
.venv\Scripts\activate   # Windows
# source .venv/bin/activate   # Linux/macOS

pip install -r requirements.txt
```

```bash
pip install torch accelerate safetensors pyyaml
```

--------------------------------

### Install Dependencies with requirements.txt

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb

Installs all necessary dependencies for the notebook using a requirements.txt file. Ensure this file is present in the same directory.

```python
!pip install -r requirements.txt
```

--------------------------------

### Install ModelOpt-Windows with Olive

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/README.md

Installs the ModelOpt-Windows integrated into Microsoft's Olive framework. Also installs ONNX Runtime with CUDA support.

```bash
pip install olive-ai[nvmo]
```

```bash
pip install onnxruntime-genai-cuda
```

--------------------------------

### Run Hugging Face Example Script

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_ptq/README.md

Example bash script to run the Hugging Face quantization example for LLM models like Llama-3.

```bash
#!/bin/bash
# For LLM models like [Llama-3](https://huggingface.co/meta-llama):

```

--------------------------------

### Install ModelOpt-Windows Standalone Toolkit

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/README.md

Installs the ModelOpt-Windows as a standalone toolkit for CUDA 12.x systems.

```bash
pip install nvidia-modelopt[onnx]
```

--------------------------------

### Install nvidia-modelopt with all optional dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/getting_started/_installation_for_Linux.rst

Use this command to install the package with all optional dependencies included. This ensures full functionality across all modules.

```bash
pip install -U "nvidia-modelopt[all]"
```

--------------------------------

### Package Directory Structure Example

Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/README.md

Shows a more structured approach for an experimental technique using a package layout, separating core logic from examples and tests.

```text
experimental/my_technique/
├── README.md
├── requirements.txt
├── my_technique/
│   ├── __init__.py
│   ├── core.py
│   └── config.py
├── tests/
│   └── test_core.py
└── examples/
    └── example_usage.py
```

--------------------------------

### Install Dependencies and Run Tests Locally

Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/testing.md

Installs necessary Python packages using uv and then executes pytest for local testing. Ensure you are in the Model-Optimizer/tools/launcher directory.

```bash
cd Model-Optimizer/tools/launcher
uv pip install -e . pytest
uv run pytest -v
```

--------------------------------

### Sequential Quantization Configuration (W4A8 Example)

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/_quant_cfg.rst

Configure sequential quantization where a TensorQuantizer is replaced by a SequentialQuantizer that applies formats in sequence. This example shows W4A8 quantization.

```python
{
    "quantizer_name": "*weight_quantizer",
    "cfg": [
        {"num_bits": 4, "block_sizes": {-1: 128, "type": "static"}},
        {"num_bits": (4, 3)},  # FP8
    ],
}
```

--------------------------------

### Install Model Optimizer from Source

Source: https://github.com/nvidia/model-optimizer/blob/main/README.md

Install Model Optimizer from source in editable mode to use the latest features or for development. This requires cloning the repository first.

```bash
# Clone the Model Optimizer repository
git clone git@github.com:NVIDIA/Model-Optimizer.git
cd Model-Optimizer

pip install -e .[dev]
```

--------------------------------

### Install DMS Package

Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/dms/README.md

Clone the repository and install the DMS package in editable mode. This provides all necessary components for training and evaluation.

```bash
git clone https://github.com/NVIDIA/Model-Optimizer
cd Model-Optimizer/experimental/dms
pip install -e .
```

--------------------------------

### ModelOpt Launcher Documentation Guides

Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/README.md

Table outlining the available documentation guides for the ModelOpt Launcher, including Configuration, Architecture, Testing, Claude Code, and Contributing.

```markdown
| Guide | Description |
|---|---|
| [Configuration](docs/configuration.md) | YAML formats, CLI overrides, flags, `hf_local` |
| [Architecture](docs/architecture.md) | Shared core, factory system, typed tasks, mount mechanism |
| [Testing](docs/testing.md) | Running tests locally and in CI |
| [Claude Code](docs/claude_code.md) | Submit, monitor, diagnose workflows |
| [Contributing](docs/contributing.md) | Adding models, typed tasks, bug reporting |
```

--------------------------------

### Custom PTQ Recipe Example

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/10_recipes.rst

An example of a custom PTQ recipe configuration for INT8 per-channel weight quantization. Modify the 'quantize' section for specific needs.

```yaml
# my_int8_recipe.yml
metadata:
  recipe_type: ptq
  description: INT8 per-channel weight, per-tensor activation.

quantize:
  algorithm: max
  quant_cfg:
    - quantizer_name: '*'
      enable: false
    - quantizer_name: '*weight_quantizer'
      cfg:
        num_bits: 8
        axis: 0
    - quantizer_name: '*input_quantizer'
      cfg:
        num_bits: 8
        axis:
    - quantizer_name: '*lm_head*'
      enable: false
    - quantizer_name: '*output_layer*'
      enable: false
```

--------------------------------

### Dataset Mix Configuration Example

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/dataset/README.md

YAML configuration for mixing datasets, specifying repository IDs, splits, and augmentation settings. 'cap_per_split' limits the number of examples.

```yaml
datasets:
  - repo_id: nvidia/Nemotron-Math-v2
    splits: [high_part00, high_part01]
    cap_per_split: 200000
    augment: true

  - repo_id: nvidia/OpenMathReasoning-mini
    splits: [train]
    augment: false   # multilingual — skip language-redirect augmentation
```

--------------------------------

### QAD Example Workflow

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/README.md

Sets up and runs Quantization Aware Distillation (QAD) using a QADTrainer. This involves configuring a teacher model and a distillation criterion.

```python
import modelopt.torch.opt as mto
import modelopt.torch.distill as mtd
import modelopt.torch.quantization as mtq
from modelopt.torch.distill.plugins.huggingface import LMLogitsLoss
from modelopt.torch.quantization.plugins.transformers_trainer import QADTrainer


... 

# [Not shown] load model, tokenizer, data loaders etc
# Create the distillation config
distill_config = {
   "teacher_model": teacher_model,
   "criterion": LMLogitsLoss(),
}

trainer = QADTrainer(
   model=model,
   processing_class=tokenizer,
   args=training_args,
   quant_args=quant_args,
   distill_config=distill_config,
   **data_module,
)

trainer.train()  # Train the quantized model using distillation (i.e, QAD)

# Save the final student model weights; An example usage
trainer.save_model()
```

--------------------------------

### Verify ModelOpt Installation

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/getting_started/windows/_installation_standalone.rst

Execute this Python command to confirm that the ModelOpt library, specifically its quantization module, has been successfully installed. This check is performed after setting up the environment.

```python
python -c "import modelopt.onnx.quantization"
```

--------------------------------

### Example Workflow: Improve Existing Quantization

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/onnx_ptq/autotune/README.md

This workflow demonstrates how to first create an initial quantized model using modelopt's quantize function and then use that quantized model as a baseline for further autotuning to find improved Q/DQ placements.

```python
import numpy as np
from modelopt.onnx.quantization import quantize

# Create dummy calibration data (replace with real data for production)
dummy_input = np.random.randn(128, 3, 224, 224).astype(np.float32)
quantize(
    'resnet50_Opset17_bs128.onnx',
    calibration_data=dummy_input,
    calibration_method='entropy',
    output_path='resnet50_quantized.onnx'
)
```

```bash
# Step 2: Use the quantized baseline for autotuning
# The autotuner will try to find better Q/DQ placements than the initial quantization
python3 -m modelopt.onnx.quantization.autotune \
    --onnx_path resnet50_Opset17_bs128.onnx \
    --output_dir ./resnet50_autotuned \
    --qdq_baseline resnet50_quantized.onnx \
    --schemes_per_region 50
```

--------------------------------

### Deploy QAT Checkpoint on TensorRT-LLM

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md

Launch an OpenAI-compatible endpoint using TensorRT-LLM with a quantized checkpoint. Ensure TensorRT-LLM is installed and follow the official guide for setup.

```bash
trtllm-serve path/to/quantized/checkpoint --tokenizer /path/to/tokenizer --max_batch_size <max_batch_size> --max_num_tokens <max_num_tokens> --max_seq_len <max_seq_len> --tp_size <tp_size> --pp_size <pp_size> --host <host_ip_address> --port <port> --kv_cache_free_gpu_memory_fraction 0.95

```

--------------------------------

### Install Model Optimizer Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/chained_optimizations/README.md

Install Model Optimizer with optional torch and huggingface dependencies. Also install requirements.txt for the example.

```bash
pip install "nvidia-modelopt[hf]"
pip install -r requirements.txt
```

--------------------------------

### Launch DFlash Example

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/speculative_decoding/README.md

Execute a complete end-to-end example for DFlash (Block Diffusion for Speculative Decoding) training and evaluation using the provided launcher script. Ensure you have the necessary YAML configuration file.

```bash
uv run launch.py --yaml examples/Qwen/Qwen3-8B/hf_online_dflash.yaml --yes
```

--------------------------------

### Knowledge Distillation Setup

Source: https://context7.com/nvidia/model-optimizer/llms.txt

Enables training smaller student models to mimic larger teacher models. Requires loading both teacher and student models, and freezing the teacher's parameters.

```python
import torch.nn as nn
import modelopt.torch.distill as mtd
from transformers import AutoModelForCausalLM

# Load teacher and student models
teacher = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf").cuda()
student = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").cuda()

# Freeze teacher model
for param in teacher.parameters():
    param.requires_grad = False
```

--------------------------------

### Perform QAT with SFTTrainer

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md

Launch a full parameter Supervised Finetuning (SFT) with Quantization Aware Training (QAT) on the GPT-OSS 20B model using `accelerate launch`. This command utilizes specific configuration files and quantization settings.

```bash
# Other supported quantization configs include NVFP4_MLP_WEIGHT_ONLY_CFG, NVFP4_MLP_ONLY_CFG etc.
# [Optional] For faster FlashAttention3, add '--attn_implementation kernels-community/vllm-flash-attn3'
accelerate launch --config_file configs/zero3.yaml sft.py \
    --config configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b \
    --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG \
    --output_dir gpt-oss-20b-qat
```

--------------------------------

### Deploy with TensorRT-LLM

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/deployment/3_unified_hf.rst

Example of loading and running inference with a quantized Hugging Face model using TensorRT-LLM. Ensure TensorRT-LLM v0.17.0 or later is installed. This example uses an FP8 quantized Llama-3.1 model.

```python
from tensorrt_llm import LLM, SamplingParams

def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8")

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()
```

--------------------------------

### Example Python Import Statement

Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/README.md

Demonstrates how to import a custom optimization function from an experimental module. Ensure the experimental module is correctly installed or accessible.

```python
from experimental.my_technique import my_optimize
...
```

--------------------------------

### Multi-task Pipeline Example

Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/configuration.md

Configure sequential tasks where one task starts only after the previous one completes. It demonstrates sharing values across tasks using `global_vars`.

```yaml
job_name: Qwen3-8B_quantize_export
pipeline:
  global_vars:
    hf_model: /hf-local/Qwen/Qwen3-8B

  task_0:
    script: common/megatron_lm/quantize/quantize.sh
    environment:
      - HF_MODEL_CKPT: <<global_vars.hf_model>>
    slurm_config:
      _factory_: "slurm_factory"
      nodes: 1

  task_1:
    script: common/megatron_lm/export/export.sh
    environment:
      - HF_MODEL_CKPT: <<global_vars.hf_model>>
    slurm_config:
      _factory_: "slurm_factory"
      nodes: 1
```

--------------------------------

### Start Ray Server and Deploy Model

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/evaluation/nemo_evaluator_instructions.md

Starts the Ray server and deploys the Hugging Face model using the `deploy_ray_hf.py` script. Configure GPU, CPU, and port settings as needed.

```bash
# Start the server (blocks while running — use a separate terminal)
ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
    --model_path path/to/checkpoint \
    --model_id anymodel-hf \
    --num_gpus 2 --num_gpus_per_replica 2 --num_cpus_per_replica 16 \
    --trust_remote_code --port 8083 --device_map "auto" --cuda_visible_devices "0,1"
```

--------------------------------

### Migrate Legacy Dict Format to New List Format (Full Config)

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/_quant_cfg.rst

This example demonstrates the conversion of a legacy flat dictionary-based quant_cfg to the new list format. The deny-all-then-configure pattern is achieved by placing a default disable entry at the start.

```python
"quant_cfg": [
    {"quantizer_name": "*",
     "enable": False},
    {"quantizer_name": "*weight_quantizer",
     "cfg": {"num_bits": 8, "axis": 0}},
    {"quantizer_name": "*input_quantizer",
     "cfg": {"num_bits": 8, "axis": None}},
]
```

--------------------------------

### Basic QAT/QAD Training with FSDP

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/llama_factory/README.md

Launches LLaMA Factory for QAT/QAD training using FSDP. The script automatically installs LLaMA Factory if not present.

```bash
./launch_llamafactory.sh llama_config.yaml
```

--------------------------------

### Deploy QAT Checkpoint on vLLM

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md

Start the vLLM server with the quantized model path. Follow the OpenAI Cookbook instructions for deploying with vLLM.

```bash
vllm serve <model_path>

```

--------------------------------

### Start AutoQuantization Search

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_ptq/notebooks/3_PTQ_AutoQuantization.ipynb

Wraps the model's native loss function and automatically searches for the best per-layer quantization format mapping. Constraints guide average bit precision, and loss is evaluated across candidate formats to preserve accuracy. `disabled_layers` can keep specific layers unquantized.

```python
def loss_fn(out, batch):
    return out.loss


print("🚧  Launching auto_quantize ...")
t0 = time.time()

model, _ = mtq.auto_quantize(
    model,
    constraints={"effective_bits": EFFECTIVE_BITS},
    data_loader=calib_loader,
    forward_step=lambda m, b: m(**b),
    loss_func=loss_fn,
    quantization_formats=[QUANT_CFG[q] for q in Q_FORMATS.split(",")],
    num_calib_steps=len(calib_loader),
    num_score_steps=len(calib_loader),
    verbose=True,
    disabled_layers=["*lm_head*"]  # keep LM head in fp16
)
print(f"✅ Done in {time.time() - t0:.1f}s")
```

--------------------------------

### Hugging Face Example Script for PTQ

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/vlm_ptq/README.md

This script demonstrates an all-in-one, step-by-step model quantization example for supported Hugging Face multi-modal models. The quantization format and number of GPUs are provided as inputs.

```bash
scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
```

--------------------------------

### Install Requirements

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/fvd_metrics/README.md

Installs the necessary Python packages for the FVD tool. For GPU support, ensure PyTorch with CUDA is installed.

```bash
pip install -r requirements.txt
```

```bash
pip install torch --index-url https://download.pytorch.org/whl/cu129
```

--------------------------------

### Launch Distillation Training for HuggingFace Models

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_distill/README.md

Example command to launch a knowledge distillation training process for HuggingFace models using `accelerate launch`. This command specifies teacher and student models, output directory, and training parameters.

```bash
accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
    main.py \
    --teacher_name_or_path 'meta-llama/Llama-3.2-3B-Instruct' \
    --student_name_or_path 'meta-llama/Llama-3.2-1B' \
    --output_dir ./llama3.2-distill \
    --max_length 2048 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 8 \
    --max_steps 200 \
    --logging_steps 5
```

--------------------------------

### Get Autotuner Help via Command Line

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/onnx_ptq/autotune/README.md

Use this command to display help information and available options for the ONNX PTQ autotuner when running from the command line.

```bash
python3 -m modelopt.onnx.quantization.autotune --help
```

--------------------------------

### Install Model Optimizer

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/getting_started/_installation_for_Linux.rst

Install Model Optimizer using pip. This command will also download and install necessary third-party open-source software.

```bash
pip install nvidia-modelopt
```

--------------------------------

### QAT/QAD Training using CLI

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/llama_factory/README.md

Initiates QAT/QAD training via the llamafactory_cli.

```bash
./launch_llamafactory.sh train llama_config.yaml
```

--------------------------------

### Install NVIDIA Model Optimizer

Source: https://context7.com/nvidia/model-optimizer/llms.txt

Install the Model Optimizer library from PyPI with all dependencies or from source for the latest features. Development dependencies are included when installing from source.

```bash
pip install -U nvidia-modelopt[all]
```

```bash
git clone git@github.com:NVIDIA/Model-Optimizer.git
cd Model-Optimizer
pip install -e .[dev]
```

--------------------------------

### Autotune with Pattern Cache (Cold Start)

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/9_autotune.rst

Perform an initial optimization run to generate a pattern cache. This cache stores the best Q/DQ schemes for reuse in subsequent optimizations.

```bash
python -m modelopt.onnx.quantization.autotune \
    --onnx_path model_v1.onnx \
    --output_dir ./run1
```

--------------------------------

### Navigate to SpecDec Benchmark Directory

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/specdec_bench/README.md

Change the current directory to the SpecDec benchmark examples.

```bash
cd examples/specdec_bench
```

--------------------------------

### Install Model Optimizer with Pip

Source: https://github.com/nvidia/model-optimizer/blob/main/README.md

Install the stable release of Model Optimizer using pip. This command also installs additional third-party open source software.

```bash
pip install -U nvidia-modelopt[all]
```

--------------------------------

### Install Model Optimizer with ONNX and HF

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/diffusers/README.md

Install Model Optimizer with ONNX and Hugging Face dependencies. Also install requirements specific to subsections like evaluation.

```bash
pip install nvidia-modelopt[onnx,hf]
pip install -r requirements.txt
```

--------------------------------

### Factory System Registration Example

Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/architecture.md

YAML configuration demonstrating how to reference a factory by name, such as `slurm_factory`, and set its parameters like `nodes`.

```yaml
slurm_config:
  _factory_: "slurm_factory"
  nodes: 1
```

--------------------------------

### PTQ Recipe - Single File Example

Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/10_recipes.rst

A single YAML file defining a PTQ recipe with FP8 quantization for weights and activations, and FP8 KV cache.

```yaml
# modelopt_recipes/general/ptq/fp8_default-fp8_kv.yml

metadata:
  recipe_type: ptq
  description: FP8 per-tensor weight and activation (W8A8), FP8 KV cache, max calibration.

quantize:
  algorithm: max
  quant_cfg:
    - quantizer_name: '*'
      enable: false
    - quantizer_name: '*input_quantizer'
      cfg:
        num_bits: e4m3
        axis:
    - quantizer_name: '*weight_quantizer'
      cfg:
        num_bits: e4m3
        axis:
    - quantizer_name: '*[kv]_bmm_quantizer'
      enable: true
      cfg:
        num_bits: e4m3
    # ... standard exclusions omitted for brevity

```

--------------------------------

### Install Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/qat-finetune-transformers.ipynb

Installs or upgrades the necessary libraries for transformers and trl.

```python
%pip install --upgrade transformers trl
```

--------------------------------

### Install Evaluation Requirements

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/diffusers/README.md

Install the necessary Python packages for evaluation by running this command.

```bash
pip install -r eval/requirments.txt
```

--------------------------------

### Verify TensorRT-Edge-LLM Installation

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/torch_onnx/README.md

Check if the CLI tools are installed correctly by running their help commands.

```bash
tensorrt-edgellm-quantize-llm --help
tensorrt-edgellm-export-llm --help
```

--------------------------------

### Download and Tokenize Nemotron-SFT-Instruction-Following-Chat-v2

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/dataset/MEGATRON_DATA_PREP.md

Downloads the Nemotron-SFT-Instruction-Following-Chat-v2 dataset and then tokenizes its data directory. Ensure the tokenizer and output directory are set.

```bash
hf download nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 \
    --repo-type dataset \
    --local-dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
    --input_dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/data/ \
    --json_keys messages \
    --tokenizer ${TOKENIZER} \
    --output_dir ${OUTPUT_DIR} \
    --workers 96 \
    --max_sequence_length 256_000 \
    --reasoning_content inline
```

--------------------------------

### Install PyTorch Packages

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/README.md

Install specific versions of PyTorch, Torchvision, and Torchaudio compatible with CUDA 12.8.

```powershell
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
```

--------------------------------

### Install Model Optimizer and Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/onnx_ptq/README.md

Install the nvidia-modelopt package with ONNX dependencies and other requirements using pip.

```bash
pip install -U nvidia-modelopt[onnx]
pip install -r requirements.txt
```

--------------------------------

### Install Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/evaluation/nemo_evaluator_instructions.md

Installs necessary Python packages from the provided requirements file. Ensure you are in the correct directory.

```bash
pip install -r examples/puzzletron/requirements.txt
```

--------------------------------

### View All ONNX PTQ Parameters

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/onnx_ptq/genai_llm/README.md

Run this command to display all available command-line parameters for the ONNX PTQ example script.

```bash
python quantize.py --help
```

--------------------------------

### Install ONNX Runtime DirectML Packages

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/README.md

Install ONNX Runtime with DirectML support for hardware acceleration on Windows.

```powershell
pip install onnxruntime-directml==1.21.1
pip install onnxruntime-genai-directml==0.6.0
```

--------------------------------

### Install ONNX Runtime GenAI

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/perplexity_metrics/README.md

Install the ONNX Runtime GenAI package, which is necessary for evaluating ONNX models.

```bash
pip install onnxruntime-genai
```

--------------------------------

### YAML Configuration Example

Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/CLAUDE.md

Illustrates the structure of a ModelOpt YAML configuration file, including job name, pipeline tasks, global variables, script arguments, environment settings, and Slurm configurations.

```yaml
job_name: Qwen3-8B_NVFP4_DEFAULT_CFG
pipeline:
  global_vars:
    hf_local: /hf-local/
  task_0:
    script: common/megatron_lm/quantize/quantize.sh
    args:
      - --calib-dataset-path-or-name <<global_vars.hf_local>>abisee/cnn_dailymail
    environment:
      - MLM_MODEL_CFG: Qwen/Qwen3-8B
      - HF_MODEL_CKPT: <<global_vars.hf_local>>Qwen/Qwen3-8B
      - TP: 4
    slurm_config:
      _factory_: "slurm_factory"
      nodes: 1
      ntasks_per_node: 4
      gpus_per_node: 4
```

--------------------------------

### Install Model Optimizer and Dependencies

Source: https://github.com/nvidia/model-optimizer/blob/main/examples/pruning/cifar_resnet.ipynb

Installs the necessary libraries for using Model Optimizer, including torchvision and torchprofile.

```python
! pip install nvidia-modelopt torchvision torchprofile
```

--------------------------------

### Quantization Aware Training (QAT) Setup and Loop

Source: https://context7.com/nvidia/model-optimizer/llms.txt

Fine-tunes a quantized model to recover accuracy loss. Enables automatic saving/loading of modelopt state with HuggingFace checkpointing.

```python
import torch
import modelopt.torch.opt as mto
import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.optim import AdamW

# Enable automatic save/load of modelopt state with HuggingFace checkpointing
mto.enable_huggingface_checkpointing()

# Load and quantize model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").cuda()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

def calibrate(model):
    for text in ["Sample calibration text 1", "Sample calibration text 2"]:
        inputs = tokenizer(text, return_tensors="pt").to("cuda")
        model(**inputs)

# Quantize the model with NVFP4 configuration
model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop=calibrate)

# QAT training loop
optimizer = AdamW(model.parameters(), lr=1e-5)
model.train()

for epoch in range(2):
    for batch in train_dataloader:
        inputs = batch["input_ids"].cuda()
        outputs = model(input_ids=inputs, labels=inputs)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Save quantized model (modelopt state saved automatically)
model.save_pretrained("./qat_model")
tokenizer.save_pretrained("./qat_model")
```