### Install Dependencies and Run Tests Source: https://github.com/nvidia/model-optimizer/blob/main/tests/examples/README.md Installs necessary dependencies and executes tests for a specific example. Ensure you are in the root of the repository and have mounted the local modelopt directory. ```bash cd /workspace/Model-Optimizer pip install -e ".[all,dev-test]" pytest tests/examples/$TEST ``` -------------------------------- ### Install Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/speculative_decoding/recipes/train_eagle_head_cosmos_reason2.ipynb Installs the necessary model optimization library and project requirements. Use this at the beginning of the setup process. ```bash %%bash pip install -U nvidia-modelopt[hf] pip install -r ../requirements.txt ``` -------------------------------- ### Install Model-Optimizer and Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/diffusers/distillation/README.md Installs Model-Optimizer and all required dependencies for distillation training. Ensure you are in the distillation example directory before running. ```bash cd examples/diffusers/distillation pip install -e ../../../ pip install -r requirements.txt ``` -------------------------------- ### Deploy QAT Checkpoint on SGLang Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md Start the SGLang server with a specified model path and tensor parallelism size. Refer to the SGLang setup guide for installation instructions. ```bash python3 -m sglang.launch_server --model --tp ``` -------------------------------- ### QAT Workflow Example with ModelOpt Source: https://github.com/nvidia/model-optimizer/blob/main/examples/cnn_qat/README.md This Python code demonstrates the core steps of Quantization-Aware Training (QAT) using NVIDIA ModelOpt. It includes model quantization, calibration, QAT fine-tuning, and saving/restoring the model. Ensure necessary imports and model/loader setup are done prior to this. ```python from modelopt.torch.quantization import mtq from modelopt.torch.opt import mto # ... build model, loaders, optimizer, scheduler ... def calibrate_fn(m): m.eval() seen = 0 for x, _ in calib_loader: m(x.to(device)) seen += x.size(0) if seen >= 512: break # 1. PTQ quantization + calibration model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, calibrate_fn) # 2. QAT fine-tuning for epoch in range(1, epochs + 1): train(model, train_loader, ...) scheduler.step() # 3. Save final QAT model (weights + quantizer state) mto.save(model, "cnn_qat_best.pth") # 4. To reload for inference or further training: model = build_model() mto.restore(model, "cnn_qat_best.pth") model.to(device) ``` -------------------------------- ### Verify Puzzletron Installation Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/README.md Run GPU tests to confirm the puzzletron installation. This example specifically checks the Qwen3-8B model. ```bash python -m pytest tests/gpu/torch/puzzletron/test_puzzletron.py -k "Qwen3-8B" ``` -------------------------------- ### Install Model Optimizer with Hugging Face Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_distill/README.md Install Model Optimizer with specific dependencies for Hugging Face models and then install example requirements. ```bash pip install -U nvidia-modelopt[hf] pip install -r requirements.txt ``` -------------------------------- ### Install Dependencies and Login Source: https://github.com/nvidia/model-optimizer/blob/main/examples/dataset/README.md Installs the necessary package and logs into Hugging Face Hub. A token is required for gated datasets. ```bash pip install nvidia-modelopt[hf] hf auth login --token # required for gated datasets ``` -------------------------------- ### Install Base Requirements Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/kl_divergence_metrics/README.md Install the necessary Python packages for the toolkit. Consider installing PyTorch with CUDA support for improved performance. ```bash pip install -r requirements.txt ``` -------------------------------- ### Simple Flat Directory Structure Example Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/README.md Illustrates a basic file organization for an experimental technique, including the main script, tests, and examples. ```text experimental/my_technique/ ├── README.md ├── requirements.txt ├── my_technique.py ├── test_my_technique.py └── example.py ``` -------------------------------- ### Version Summary Report Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/contributing.md This is an example of the version summary that is printed at the start of every run. It helps in identifying the versions of different components used. ```text ============================================================ Version Report ============================================================ Launcher d28acd33 (main) Megatron-LM 1e064f361 (main) Model-Optimizer 69c0d479 (main) ============================================================ ``` -------------------------------- ### Run PTQ Example with Recipe via CLI Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/10_recipes.rst Execute a PTQ example script using a specified recipe via the command line. This bypasses format-specific flags. ```bash python examples/llm_ptq/hf_ptq.py \ --model Qwen/Qwen3-8B \ --recipe general/ptq/fp8_default-fp8_kv \ --export_path build/fp8 \ --calib_size 512 \ --export_fmt hf ``` -------------------------------- ### Install vLLM Fork with AnyModel Support Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/README.md Clone and install a specific vLLM fork that includes AnyModel support for deploying compressed models. Ensure you follow the vLLM installation guide for building from source. ```bash git clone https://github.com/askliar/vllm.git cd vllm git checkout feature/add_anymodel_to_vllm VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto ``` -------------------------------- ### Install Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/diffusers/qad_example/README.md Create a virtual environment and install the required dependencies using pip. This includes LTX packages from source and NVIDIA ModelOpt from PyPI. ```bash python -m venv .venv .venv\Scripts\activate # Windows # source .venv/bin/activate # Linux/macOS pip install -r requirements.txt ``` ```bash pip install torch accelerate safetensors pyyaml ``` -------------------------------- ### Install Dependencies with requirements.txt Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb Installs all necessary dependencies for the notebook using a requirements.txt file. Ensure this file is present in the same directory. ```python !pip install -r requirements.txt ``` -------------------------------- ### Install ModelOpt-Windows with Olive Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/README.md Installs the ModelOpt-Windows integrated into Microsoft's Olive framework. Also installs ONNX Runtime with CUDA support. ```bash pip install olive-ai[nvmo] ``` ```bash pip install onnxruntime-genai-cuda ``` -------------------------------- ### Run Hugging Face Example Script Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_ptq/README.md Example bash script to run the Hugging Face quantization example for LLM models like Llama-3. ```bash #!/bin/bash # For LLM models like [Llama-3](https://huggingface.co/meta-llama): ``` -------------------------------- ### Install ModelOpt-Windows Standalone Toolkit Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/README.md Installs the ModelOpt-Windows as a standalone toolkit for CUDA 12.x systems. ```bash pip install nvidia-modelopt[onnx] ``` -------------------------------- ### Install nvidia-modelopt with all optional dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/getting_started/_installation_for_Linux.rst Use this command to install the package with all optional dependencies included. This ensures full functionality across all modules. ```bash pip install -U "nvidia-modelopt[all]" ``` -------------------------------- ### Package Directory Structure Example Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/README.md Shows a more structured approach for an experimental technique using a package layout, separating core logic from examples and tests. ```text experimental/my_technique/ ├── README.md ├── requirements.txt ├── my_technique/ │ ├── __init__.py │ ├── core.py │ └── config.py ├── tests/ │ └── test_core.py └── examples/ └── example_usage.py ``` -------------------------------- ### Install Dependencies and Run Tests Locally Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/testing.md Installs necessary Python packages using uv and then executes pytest for local testing. Ensure you are in the Model-Optimizer/tools/launcher directory. ```bash cd Model-Optimizer/tools/launcher uv pip install -e . pytest uv run pytest -v ``` -------------------------------- ### Sequential Quantization Configuration (W4A8 Example) Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/_quant_cfg.rst Configure sequential quantization where a TensorQuantizer is replaced by a SequentialQuantizer that applies formats in sequence. This example shows W4A8 quantization. ```python { "quantizer_name": "*weight_quantizer", "cfg": [ {"num_bits": 4, "block_sizes": {-1: 128, "type": "static"}}, {"num_bits": (4, 3)}, # FP8 ], } ``` -------------------------------- ### Install Model Optimizer from Source Source: https://github.com/nvidia/model-optimizer/blob/main/README.md Install Model Optimizer from source in editable mode to use the latest features or for development. This requires cloning the repository first. ```bash # Clone the Model Optimizer repository git clone git@github.com:NVIDIA/Model-Optimizer.git cd Model-Optimizer pip install -e .[dev] ``` -------------------------------- ### Install DMS Package Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/dms/README.md Clone the repository and install the DMS package in editable mode. This provides all necessary components for training and evaluation. ```bash git clone https://github.com/NVIDIA/Model-Optimizer cd Model-Optimizer/experimental/dms pip install -e . ``` -------------------------------- ### ModelOpt Launcher Documentation Guides Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/README.md Table outlining the available documentation guides for the ModelOpt Launcher, including Configuration, Architecture, Testing, Claude Code, and Contributing. ```markdown | Guide | Description | |---|---| | [Configuration](docs/configuration.md) | YAML formats, CLI overrides, flags, `hf_local` | | [Architecture](docs/architecture.md) | Shared core, factory system, typed tasks, mount mechanism | | [Testing](docs/testing.md) | Running tests locally and in CI | | [Claude Code](docs/claude_code.md) | Submit, monitor, diagnose workflows | | [Contributing](docs/contributing.md) | Adding models, typed tasks, bug reporting | ``` -------------------------------- ### Custom PTQ Recipe Example Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/10_recipes.rst An example of a custom PTQ recipe configuration for INT8 per-channel weight quantization. Modify the 'quantize' section for specific needs. ```yaml # my_int8_recipe.yml metadata: recipe_type: ptq description: INT8 per-channel weight, per-tensor activation. quantize: algorithm: max quant_cfg: - quantizer_name: '*' enable: false - quantizer_name: '*weight_quantizer' cfg: num_bits: 8 axis: 0 - quantizer_name: '*input_quantizer' cfg: num_bits: 8 axis: - quantizer_name: '*lm_head*' enable: false - quantizer_name: '*output_layer*' enable: false ``` -------------------------------- ### Dataset Mix Configuration Example Source: https://github.com/nvidia/model-optimizer/blob/main/examples/dataset/README.md YAML configuration for mixing datasets, specifying repository IDs, splits, and augmentation settings. 'cap_per_split' limits the number of examples. ```yaml datasets: - repo_id: nvidia/Nemotron-Math-v2 splits: [high_part00, high_part01] cap_per_split: 200000 augment: true - repo_id: nvidia/OpenMathReasoning-mini splits: [train] augment: false # multilingual — skip language-redirect augmentation ``` -------------------------------- ### QAD Example Workflow Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/README.md Sets up and runs Quantization Aware Distillation (QAD) using a QADTrainer. This involves configuring a teacher model and a distillation criterion. ```python import modelopt.torch.opt as mto import modelopt.torch.distill as mtd import modelopt.torch.quantization as mtq from modelopt.torch.distill.plugins.huggingface import LMLogitsLoss from modelopt.torch.quantization.plugins.transformers_trainer import QADTrainer ... # [Not shown] load model, tokenizer, data loaders etc # Create the distillation config distill_config = { "teacher_model": teacher_model, "criterion": LMLogitsLoss(), } trainer = QADTrainer( model=model, processing_class=tokenizer, args=training_args, quant_args=quant_args, distill_config=distill_config, **data_module, ) trainer.train() # Train the quantized model using distillation (i.e, QAD) # Save the final student model weights; An example usage trainer.save_model() ``` -------------------------------- ### Verify ModelOpt Installation Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/getting_started/windows/_installation_standalone.rst Execute this Python command to confirm that the ModelOpt library, specifically its quantization module, has been successfully installed. This check is performed after setting up the environment. ```python python -c "import modelopt.onnx.quantization" ``` -------------------------------- ### Example Workflow: Improve Existing Quantization Source: https://github.com/nvidia/model-optimizer/blob/main/examples/onnx_ptq/autotune/README.md This workflow demonstrates how to first create an initial quantized model using modelopt's quantize function and then use that quantized model as a baseline for further autotuning to find improved Q/DQ placements. ```python import numpy as np from modelopt.onnx.quantization import quantize # Create dummy calibration data (replace with real data for production) dummy_input = np.random.randn(128, 3, 224, 224).astype(np.float32) quantize( 'resnet50_Opset17_bs128.onnx', calibration_data=dummy_input, calibration_method='entropy', output_path='resnet50_quantized.onnx' ) ``` ```bash # Step 2: Use the quantized baseline for autotuning # The autotuner will try to find better Q/DQ placements than the initial quantization python3 -m modelopt.onnx.quantization.autotune \ --onnx_path resnet50_Opset17_bs128.onnx \ --output_dir ./resnet50_autotuned \ --qdq_baseline resnet50_quantized.onnx \ --schemes_per_region 50 ``` -------------------------------- ### Deploy QAT Checkpoint on TensorRT-LLM Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md Launch an OpenAI-compatible endpoint using TensorRT-LLM with a quantized checkpoint. Ensure TensorRT-LLM is installed and follow the official guide for setup. ```bash trtllm-serve path/to/quantized/checkpoint --tokenizer /path/to/tokenizer --max_batch_size --max_num_tokens --max_seq_len --tp_size --pp_size --host --port --kv_cache_free_gpu_memory_fraction 0.95 ``` -------------------------------- ### Install Model Optimizer Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/chained_optimizations/README.md Install Model Optimizer with optional torch and huggingface dependencies. Also install requirements.txt for the example. ```bash pip install "nvidia-modelopt[hf]" pip install -r requirements.txt ``` -------------------------------- ### Launch DFlash Example Source: https://github.com/nvidia/model-optimizer/blob/main/examples/speculative_decoding/README.md Execute a complete end-to-end example for DFlash (Block Diffusion for Speculative Decoding) training and evaluation using the provided launcher script. Ensure you have the necessary YAML configuration file. ```bash uv run launch.py --yaml examples/Qwen/Qwen3-8B/hf_online_dflash.yaml --yes ``` -------------------------------- ### Knowledge Distillation Setup Source: https://context7.com/nvidia/model-optimizer/llms.txt Enables training smaller student models to mimic larger teacher models. Requires loading both teacher and student models, and freezing the teacher's parameters. ```python import torch.nn as nn import modelopt.torch.distill as mtd from transformers import AutoModelForCausalLM # Load teacher and student models teacher = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf").cuda() student = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").cuda() # Freeze teacher model for param in teacher.parameters(): param.requires_grad = False ``` -------------------------------- ### Perform QAT with SFTTrainer Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md Launch a full parameter Supervised Finetuning (SFT) with Quantization Aware Training (QAT) on the GPT-OSS 20B model using `accelerate launch`. This command utilizes specific configuration files and quantization settings. ```bash # Other supported quantization configs include NVFP4_MLP_WEIGHT_ONLY_CFG, NVFP4_MLP_ONLY_CFG etc. # [Optional] For faster FlashAttention3, add '--attn_implementation kernels-community/vllm-flash-attn3' accelerate launch --config_file configs/zero3.yaml sft.py \ --config configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b \ --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG \ --output_dir gpt-oss-20b-qat ``` -------------------------------- ### Deploy with TensorRT-LLM Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/deployment/3_unified_hf.rst Example of loading and running inference with a quantized Hugging Face model using TensorRT-LLM. Ensure TensorRT-LLM v0.17.0 or later is installed. This example uses an FP8 quantized Llama-3.1 model. ```python from tensorrt_llm import LLM, SamplingParams def main(): prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if __name__ == '__main__': main() ``` -------------------------------- ### Example Python Import Statement Source: https://github.com/nvidia/model-optimizer/blob/main/experimental/README.md Demonstrates how to import a custom optimization function from an experimental module. Ensure the experimental module is correctly installed or accessible. ```python from experimental.my_technique import my_optimize ... ``` -------------------------------- ### Multi-task Pipeline Example Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/configuration.md Configure sequential tasks where one task starts only after the previous one completes. It demonstrates sharing values across tasks using `global_vars`. ```yaml job_name: Qwen3-8B_quantize_export pipeline: global_vars: hf_model: /hf-local/Qwen/Qwen3-8B task_0: script: common/megatron_lm/quantize/quantize.sh environment: - HF_MODEL_CKPT: <> slurm_config: _factory_: "slurm_factory" nodes: 1 task_1: script: common/megatron_lm/export/export.sh environment: - HF_MODEL_CKPT: <> slurm_config: _factory_: "slurm_factory" nodes: 1 ``` -------------------------------- ### Start Ray Server and Deploy Model Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/evaluation/nemo_evaluator_instructions.md Starts the Ray server and deploys the Hugging Face model using the `deploy_ray_hf.py` script. Configure GPU, CPU, and port settings as needed. ```bash # Start the server (blocks while running — use a separate terminal) ray start --head --num-gpus 2 --port 6379 --disable-usage-stats python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path path/to/checkpoint \ --model_id anymodel-hf \ --num_gpus 2 --num_gpus_per_replica 2 --num_cpus_per_replica 16 \ --trust_remote_code --port 8083 --device_map "auto" --cuda_visible_devices "0,1" ``` -------------------------------- ### Migrate Legacy Dict Format to New List Format (Full Config) Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/_quant_cfg.rst This example demonstrates the conversion of a legacy flat dictionary-based quant_cfg to the new list format. The deny-all-then-configure pattern is achieved by placing a default disable entry at the start. ```python "quant_cfg": [ {"quantizer_name": "*", "enable": False}, {"quantizer_name": "*weight_quantizer", "cfg": {"num_bits": 8, "axis": 0}}, {"quantizer_name": "*input_quantizer", "cfg": {"num_bits": 8, "axis": None}}, ] ``` -------------------------------- ### Basic QAT/QAD Training with FSDP Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/llama_factory/README.md Launches LLaMA Factory for QAT/QAD training using FSDP. The script automatically installs LLaMA Factory if not present. ```bash ./launch_llamafactory.sh llama_config.yaml ``` -------------------------------- ### Deploy QAT Checkpoint on vLLM Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/README.md Start the vLLM server with the quantized model path. Follow the OpenAI Cookbook instructions for deploying with vLLM. ```bash vllm serve ``` -------------------------------- ### Start AutoQuantization Search Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_ptq/notebooks/3_PTQ_AutoQuantization.ipynb Wraps the model's native loss function and automatically searches for the best per-layer quantization format mapping. Constraints guide average bit precision, and loss is evaluated across candidate formats to preserve accuracy. `disabled_layers` can keep specific layers unquantized. ```python def loss_fn(out, batch): return out.loss print("🚧 Launching auto_quantize ...") t0 = time.time() model, _ = mtq.auto_quantize( model, constraints={"effective_bits": EFFECTIVE_BITS}, data_loader=calib_loader, forward_step=lambda m, b: m(**b), loss_func=loss_fn, quantization_formats=[QUANT_CFG[q] for q in Q_FORMATS.split(",")], num_calib_steps=len(calib_loader), num_score_steps=len(calib_loader), verbose=True, disabled_layers=["*lm_head*"] # keep LM head in fp16 ) print(f"✅ Done in {time.time() - t0:.1f}s") ``` -------------------------------- ### Hugging Face Example Script for PTQ Source: https://github.com/nvidia/model-optimizer/blob/main/examples/vlm_ptq/README.md This script demonstrates an all-in-one, step-by-step model quantization example for supported Hugging Face multi-modal models. The quantization format and number of GPUs are provided as inputs. ```bash scripts/huggingface_example.sh --model --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq] ``` -------------------------------- ### Install Requirements Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/fvd_metrics/README.md Installs the necessary Python packages for the FVD tool. For GPU support, ensure PyTorch with CUDA is installed. ```bash pip install -r requirements.txt ``` ```bash pip install torch --index-url https://download.pytorch.org/whl/cu129 ``` -------------------------------- ### Launch Distillation Training for HuggingFace Models Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_distill/README.md Example command to launch a knowledge distillation training process for HuggingFace models using `accelerate launch`. This command specifies teacher and student models, output directory, and training parameters. ```bash accelerate launch --config-file ./accelerate_config/fsdp2.yaml \ main.py \ --teacher_name_or_path 'meta-llama/Llama-3.2-3B-Instruct' \ --student_name_or_path 'meta-llama/Llama-3.2-1B' \ --output_dir ./llama3.2-distill \ --max_length 2048 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 8 \ --max_steps 200 \ --logging_steps 5 ``` -------------------------------- ### Get Autotuner Help via Command Line Source: https://github.com/nvidia/model-optimizer/blob/main/examples/onnx_ptq/autotune/README.md Use this command to display help information and available options for the ONNX PTQ autotuner when running from the command line. ```bash python3 -m modelopt.onnx.quantization.autotune --help ``` -------------------------------- ### Install Model Optimizer Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/getting_started/_installation_for_Linux.rst Install Model Optimizer using pip. This command will also download and install necessary third-party open-source software. ```bash pip install nvidia-modelopt ``` -------------------------------- ### QAT/QAD Training using CLI Source: https://github.com/nvidia/model-optimizer/blob/main/examples/llm_qat/llama_factory/README.md Initiates QAT/QAD training via the llamafactory_cli. ```bash ./launch_llamafactory.sh train llama_config.yaml ``` -------------------------------- ### Install NVIDIA Model Optimizer Source: https://context7.com/nvidia/model-optimizer/llms.txt Install the Model Optimizer library from PyPI with all dependencies or from source for the latest features. Development dependencies are included when installing from source. ```bash pip install -U nvidia-modelopt[all] ``` ```bash git clone git@github.com:NVIDIA/Model-Optimizer.git cd Model-Optimizer pip install -e .[dev] ``` -------------------------------- ### Autotune with Pattern Cache (Cold Start) Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/9_autotune.rst Perform an initial optimization run to generate a pattern cache. This cache stores the best Q/DQ schemes for reuse in subsequent optimizations. ```bash python -m modelopt.onnx.quantization.autotune \ --onnx_path model_v1.onnx \ --output_dir ./run1 ``` -------------------------------- ### Navigate to SpecDec Benchmark Directory Source: https://github.com/nvidia/model-optimizer/blob/main/examples/specdec_bench/README.md Change the current directory to the SpecDec benchmark examples. ```bash cd examples/specdec_bench ``` -------------------------------- ### Install Model Optimizer with Pip Source: https://github.com/nvidia/model-optimizer/blob/main/README.md Install the stable release of Model Optimizer using pip. This command also installs additional third-party open source software. ```bash pip install -U nvidia-modelopt[all] ``` -------------------------------- ### Install Model Optimizer with ONNX and HF Source: https://github.com/nvidia/model-optimizer/blob/main/examples/diffusers/README.md Install Model Optimizer with ONNX and Hugging Face dependencies. Also install requirements specific to subsections like evaluation. ```bash pip install nvidia-modelopt[onnx,hf] pip install -r requirements.txt ``` -------------------------------- ### Factory System Registration Example Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/docs/architecture.md YAML configuration demonstrating how to reference a factory by name, such as `slurm_factory`, and set its parameters like `nodes`. ```yaml slurm_config: _factory_: "slurm_factory" nodes: 1 ``` -------------------------------- ### PTQ Recipe - Single File Example Source: https://github.com/nvidia/model-optimizer/blob/main/docs/source/guides/10_recipes.rst A single YAML file defining a PTQ recipe with FP8 quantization for weights and activations, and FP8 KV cache. ```yaml # modelopt_recipes/general/ptq/fp8_default-fp8_kv.yml metadata: recipe_type: ptq description: FP8 per-tensor weight and activation (W8A8), FP8 KV cache, max calibration. quantize: algorithm: max quant_cfg: - quantizer_name: '*' enable: false - quantizer_name: '*input_quantizer' cfg: num_bits: e4m3 axis: - quantizer_name: '*weight_quantizer' cfg: num_bits: e4m3 axis: - quantizer_name: '*[kv]_bmm_quantizer' enable: true cfg: num_bits: e4m3 # ... standard exclusions omitted for brevity ``` -------------------------------- ### Install Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/gpt-oss/qat-finetune-transformers.ipynb Installs or upgrades the necessary libraries for transformers and trl. ```python %pip install --upgrade transformers trl ``` -------------------------------- ### Install Evaluation Requirements Source: https://github.com/nvidia/model-optimizer/blob/main/examples/diffusers/README.md Install the necessary Python packages for evaluation by running this command. ```bash pip install -r eval/requirments.txt ``` -------------------------------- ### Verify TensorRT-Edge-LLM Installation Source: https://github.com/nvidia/model-optimizer/blob/main/examples/torch_onnx/README.md Check if the CLI tools are installed correctly by running their help commands. ```bash tensorrt-edgellm-quantize-llm --help tensorrt-edgellm-export-llm --help ``` -------------------------------- ### Download and Tokenize Nemotron-SFT-Instruction-Following-Chat-v2 Source: https://github.com/nvidia/model-optimizer/blob/main/examples/dataset/MEGATRON_DATA_PREP.md Downloads the Nemotron-SFT-Instruction-Following-Chat-v2 dataset and then tokenizes its data directory. Ensure the tokenizer and output directory are set. ```bash hf download nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 \ --repo-type dataset \ --local-dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ --input_dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/data/ \ --json_keys messages \ --tokenizer ${TOKENIZER} \ --output_dir ${OUTPUT_DIR} \ --workers 96 \ --max_sequence_length 256_000 \ --reasoning_content inline ``` -------------------------------- ### Install PyTorch Packages Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/README.md Install specific versions of PyTorch, Torchvision, and Torchaudio compatible with CUDA 12.8. ```powershell pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128 ``` -------------------------------- ### Install Model Optimizer and Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/onnx_ptq/README.md Install the nvidia-modelopt package with ONNX dependencies and other requirements using pip. ```bash pip install -U nvidia-modelopt[onnx] pip install -r requirements.txt ``` -------------------------------- ### Install Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/puzzletron/evaluation/nemo_evaluator_instructions.md Installs necessary Python packages from the provided requirements file. Ensure you are in the correct directory. ```bash pip install -r examples/puzzletron/requirements.txt ``` -------------------------------- ### View All ONNX PTQ Parameters Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/onnx_ptq/genai_llm/README.md Run this command to display all available command-line parameters for the ONNX PTQ example script. ```bash python quantize.py --help ``` -------------------------------- ### Install ONNX Runtime DirectML Packages Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/README.md Install ONNX Runtime with DirectML support for hardware acceleration on Windows. ```powershell pip install onnxruntime-directml==1.21.1 pip install onnxruntime-genai-directml==0.6.0 ``` -------------------------------- ### Install ONNX Runtime GenAI Source: https://github.com/nvidia/model-optimizer/blob/main/examples/windows/accuracy_benchmark/perplexity_metrics/README.md Install the ONNX Runtime GenAI package, which is necessary for evaluating ONNX models. ```bash pip install onnxruntime-genai ``` -------------------------------- ### YAML Configuration Example Source: https://github.com/nvidia/model-optimizer/blob/main/tools/launcher/CLAUDE.md Illustrates the structure of a ModelOpt YAML configuration file, including job name, pipeline tasks, global variables, script arguments, environment settings, and Slurm configurations. ```yaml job_name: Qwen3-8B_NVFP4_DEFAULT_CFG pipeline: global_vars: hf_local: /hf-local/ task_0: script: common/megatron_lm/quantize/quantize.sh args: - --calib-dataset-path-or-name <>abisee/cnn_dailymail environment: - MLM_MODEL_CFG: Qwen/Qwen3-8B - HF_MODEL_CKPT: <>Qwen/Qwen3-8B - TP: 4 slurm_config: _factory_: "slurm_factory" nodes: 1 ntasks_per_node: 4 gpus_per_node: 4 ``` -------------------------------- ### Install Model Optimizer and Dependencies Source: https://github.com/nvidia/model-optimizer/blob/main/examples/pruning/cifar_resnet.ipynb Installs the necessary libraries for using Model Optimizer, including torchvision and torchprofile. ```python ! pip install nvidia-modelopt torchvision torchprofile ``` -------------------------------- ### Quantization Aware Training (QAT) Setup and Loop Source: https://context7.com/nvidia/model-optimizer/llms.txt Fine-tunes a quantized model to recover accuracy loss. Enables automatic saving/loading of modelopt state with HuggingFace checkpointing. ```python import torch import modelopt.torch.opt as mto import modelopt.torch.quantization as mtq from transformers import AutoModelForCausalLM, AutoTokenizer from torch.optim import AdamW # Enable automatic save/load of modelopt state with HuggingFace checkpointing mto.enable_huggingface_checkpointing() # Load and quantize model model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").cuda() tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") def calibrate(model): for text in ["Sample calibration text 1", "Sample calibration text 2"]: inputs = tokenizer(text, return_tensors="pt").to("cuda") model(**inputs) # Quantize the model with NVFP4 configuration model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop=calibrate) # QAT training loop optimizer = AdamW(model.parameters(), lr=1e-5) model.train() for epoch in range(2): for batch in train_dataloader: inputs = batch["input_ids"].cuda() outputs = model(input_ids=inputs, labels=inputs) loss = outputs.loss optimizer.zero_grad() loss.backward() optimizer.step() # Save quantized model (modelopt state saved automatically) model.save_pretrained("./qat_model") tokenizer.save_pretrained("./qat_model") ```