### Install Required Packages (Shell) Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md Installs the necessary Python packages listed in the `requirements.txt` file located in the current directory. This ensures all dependencies for running the example scripts are met. ```bash pip install -r requirements.txt ``` -------------------------------- ### Navigate to Examples Directory (Shell) Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md Changes the current directory to the 'examples' subdirectory. This is a prerequisite step to run the example scripts provided for the Llama3.2 Multimodal implementation. ```bash cd examples ``` -------------------------------- ### Running Inference Demo Process 1 (TP=64, 32 Local Ranks, Node 1) - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md This bash command launches the first inference process for a two-node setup with a total tensor parallelism of 64. It starts at rank ID 0 (`--start_rank_id 0`) with 32 local ranks (`--local_ranks 32`) for the first node and specifies common Neuron Runtime environment variables and model inference parameters for a Llama model, directing output to a node-specific log file (`rank_0.log`). ```bash NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path TinyLLama-v0 \ --compiled-model-path traced_models/TinyLLama-v0-multi-node_0/ \ --torch-dtype bfloat16 \ --start_rank_id 0 \ --local_ranks 32 \ --tp-degree 64 \ --batch-size 2 \ --max-context-length 32 \ --seq-len 64 \ --on-device-sampling \ --enable-bucketing \ --top-k 1 \ --do-sample \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" 2>&1 | tee rank_0.log ``` -------------------------------- ### Running Inference Demo Process 2 (TP=64, 32 Local Ranks, Node 2) - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md This bash command launches the second inference process for a two-node setup with a total tensor parallelism of 64. It starts at rank ID 32 (`--start_rank_id 32`) with 32 local ranks (`--local_ranks 32`) for the second node, continuing the rank numbering from the first. It specifies the same model and inference parameters as the first process for the two-node example and directs output to a node-specific log file (`rank_1.log`). ```bash NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path TinyLLama-v0 \ --compiled-model-path traced_models/TinyLLama-v0-multi-node_1/ \ --torch-dtype bfloat16 \ --start_rank_id 32 \ --local_ranks 32 \ --tp-degree 64 \ --batch-size 2 \ --max-context-length 32 \ --seq-len 64 \ --on-device-sampling \ --enable-bucketing \ --top-k 1 \ --do-sample \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" 2>&1 | tee rank_1.log ``` -------------------------------- ### Running Inference Demo Process 2 (TP=32, 16 Local Ranks) - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md This bash command launches the second of two inference processes on the same single Trn1 node, configured for a total tensor parallelism of 32. It uses Neuron cores 16-31 (`NEURON_RT_VISIBLE_CORES=16-31`), starts at rank ID 16 (`--start_rank_id 16`) with 16 local ranks (`--local_ranks_size 16`), and specifies the same model and inference parameters as the first process, redirecting output to a log file. ```bash NEURON_RT_VISIBLE_CORES=16-31 NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path TinyLLama-v0 \ --compiled-model-path traced_models/TinyLLama-v0-multi-node-1/ \ --torch-dtype bfloat16 \ --start_rank_id 16 \ --local_ranks_size 16 \ --tp-degree 32 \ --batch-size 2 \ --max-context-length 32 \ --seq-len 64 \ --on-device-sampling \ --enable-bucketing \ --top-k 1 \ --do-sample \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" 2>&1 | tee log ``` -------------------------------- ### Install NeuronX Distributed Inference for Testing Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md Installs the `neuronx-distributed-inference` package in editable mode (`-e`) along with its testing dependencies (`.[test]`). This is the first step required to run or write tests for the package on a Neuron device. ```Bash pip install -e .[test] ``` -------------------------------- ### Running Inference Demo Process 1 (TP=32, 16 Local Ranks) - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md This bash command launches the first of two inference processes on a single Trn1 node configured for a total tensor parallelism of 32 across two processes. It uses Neuron cores 0-15 (`NEURON_RT_VISIBLE_CORES=0-15`), starts at rank ID 0 (`--start_rank_id 0`) with 16 local ranks (`--local_ranks_size 16`), and specifies model and inference parameters for a Llama model, redirecting output to a log file. ```bash MASTER_PORT=65111 NEURON_RT_VISIBLE_CORES=0-15 NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path TinyLLama-v0 \ --compiled-model-path traced_models/TinyLLama-v0-multi-node-0/ \ --torch-dtype bfloat16 \ --start_rank_id 0 \ --local_ranks_size 16 \ --tp-degree 32 \ --batch-size 2 \ --max-context-length 32 \ --seq-len 64 \ --on-device-sampling \ --enable-bucketing \ --top-k 1 \ --do-sample \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" 2>&1 | tee log ``` -------------------------------- ### Run Llama Inference with Quantization - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md This command illustrates running a quantized Llama model using the inference demo. It specifies the compiled model path, enables the `--quantized` flag, and provides the path to the quantized model checkpoints and the quantization type. ```bash inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path /home/ubuntu/model_hf/Llama-2-7b/ \ --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \ --torch-dtype bfloat16 \ --tp-degree 32 \ --batch-size 2 \ --max-context-length 32 \ --seq-len 64 \ --on-device-sampling \ --enable-bucketing \ --quantized \ --quantized-checkpoints-path /home/ubuntu/model_hf/Llama-2-7b/model_quant.pt \ --quantization-type per_channel_symmetric \ --top-k 1 \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" ``` -------------------------------- ### Run Llama Inference with Speculation - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md This example demonstrates running Llama inference with speculative decoding enabled. It specifies both the main model and a smaller draft model, along with their compiled paths. It also includes token matching accuracy checks and benchmarking. ```bash inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path /home/ubuntu/model_hf/open_llama_7b/ \ --compiled-model-path /home/ubuntu/traced_model/open_llama_7b/ \ --draft-model-path /home/ubuntu/model_hf/open_llama_3b/ \ --compiled-draft-model-path /home/ubuntu/traced_model/open_llama_3b/ \ --torch-dtype bfloat16 \ --tp-degree 32 \ --batch-size 1 \ --max-context-length 32 \ --seq-len 64 \ --enable-bucketing \ --speculation-length 5 \ --top-k 1 \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --check-accuracy-mode token-matching \ --benchmark ``` -------------------------------- ### Run Llama Inference with Token Matching - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md This command executes the inference demo script for a Llama model. It sets up tensor parallelism, batch size, sequence lengths, and enables on-device sampling and bucketing. It performs an accuracy check by matching generated tokens against a reference output and includes a benchmarking flag. ```bash inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path /home/ubuntu/model_hf/Llama-3.1-8B-Instruct/ \ --compiled-model-path /home/ubuntu/traced_model/Llama-3.1-8B-Instruct/ \ --torch-dtype bfloat16 \ --tp-degree 32 \ --batch-size 2 \ --max-context-length 32 \ --seq-len 64 \ --on-device-sampling \ --enable-bucketing \ --top-k 1 \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" \ --check-accuracy-mode token-matching \ --benchmark ``` -------------------------------- ### Run Llama Inference with Custom Logit Matching Tolerances - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md This example shows how to perform logit matching accuracy checks with custom divergence difference and absolute/relative tolerances. It uses `--divergence-difference-tol` and `--tol-map` to specify acceptable error bounds for different layers. ```bash inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path /home/ubuntu/model_hf/Llama-2-7b/ \ --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \ --torch-dtype bfloat16 \ --tp-degree 32 \ --batch-size 2 \ --max-context-length 32 \ --seq-len 64 \ --check-accuracy-mode logit-matching \ --divergence-difference-tol 0.005 \ --tol-map "{5: (1e-5, 0.02)}" \ --enable-bucketing \ --top-k 1 \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" ``` -------------------------------- ### Run DBRX Inference with Logit Matching - Bash Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md This command runs inference for a DBRX model using the demo script. It configures tensor parallelism, batch size, and sequence lengths. It performs an accuracy check by comparing logits (output probabilities) with a reference and enables bucketing. ```bash inference_demo \ --model-type dbrx \ --task-type causal-lm \ run \ --model-path /home/ubuntu/model_hf/dbrx-1layer/ \ --compiled-model-path /home/ubuntu/traced_model/dbrx-1layer-demo/ \ --torch-dtype bfloat16 \ --tp-degree 32 \ --batch-size 2 \ --max-context-length 1024 \ --seq-len 1152 \ --enable-bucketing \ --top-k 1 \ --pad-token-id 0 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" \ --check-accuracy-mode logit-matching ``` -------------------------------- ### Running Llama Causal LM Inference with NeuronX Distributed (Bash) Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/slora.md This snippet provides a command-line example for running a distributed inference demo using the `inference_demo` tool. It's configured for a Llama model performing causal language modeling with specific settings like tensor parallelism (`--tp-degree 32`), batch size (`--batch-size 2`), and features like on-device sampling, bucketing, and LoRA. It specifies model and compiled model paths, data type (`bfloat16`), sequence lengths, and includes sample prompts. ```bash inference_demo \ --model-type llama \ --task-type causal-lm \ run \ --model-path /dev/shm/tiny_llama \ --compiled-model-path /dev/shm/traced_model/tiny_llama/ \ --torch-dtype bfloat16 \ --tp-degree 32 \ --batch-size 2 \ --max-context-length 16 \ --seq-len 16 \ --on-device-sampling \ --enable-bucketing \ --top-k 1 \ --do-sample \ --pad-token-id 2 \ --prompt "I believe the meaning of life is" \ --prompt "The color of the sky is" \ --enable-lora \ --max-loras 2 \ --max-lora-rank 16 \ --target-modules embed_tokens q_proj k_proj v_proj o_proj up_proj down_proj \ ``` -------------------------------- ### Validate Module Accuracy on Neuron vs CPU Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md Provides an example PyTest test case (`test_validate_accuracy_basic_module`) demonstrating how to use `build_module` to compile a PyTorch module (`ExampleModule`) for Neuron and then `validate_accuracy` to compare its output against the same module run on CPU. It showcases how to initialize the module differently for Neuron (with distribution) and CPU. ```Python # Module to test. class ExampleModule(torch.nn.Module): def __init__(self, distributed): super().__init__() if distributed: self.linear = ColumnParallelLinear( input_size=SAMPLE_SIZE, output_size=SAMPLE_SIZE, bias=False, dtype=torch.float32, ) else: self.linear = torch.nn.Linear( in_features=SAMPLE_SIZE, out_features=SAMPLE_SIZE, bias=False, dtype=torch.float32, ) def forward(self, x): return self.linear(x) def test_validate_accuracy_basic_module(): inputs = [(torch.arange(0, SAMPLE_SIZE, dtype=torch.float32),)] example_inputs = [(torch.zeros((SAMPLE_SIZE), dtype=torch.float32),)] module_cpu = ExampleModule(distributed=False) neuron_model = build_module(ExampleModule, example_inputs, module_init_kwargs={"distributed": True}) validate_accuracy(neuron_model, inputs, cpu_callable=module_cpu) ``` -------------------------------- ### Validate Function Accuracy on Neuron vs CPU Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md Provides an example PyTest test case (`test_validate_accuracy_basic_function`) demonstrating how to use `build_function` to compile a Python function (`example_sum`) for Neuron and then `validate_accuracy` to compare its output against the same function run on CPU. This is a simpler case compared to module testing, focusing purely on a tensor-in/tensor-out function. ```Python def example_sum(tensor): return torch.sum(tensor) def test_validate_accuracy_basic_function(): inputs = [(torch.tensor([1, 2, 3], dtype=torch.float32),)] example_inputs = [(torch.zeros((3), dtype=torch.float32),)] neuron_model = build_function(example_sum, example_inputs) validate_accuracy(neuron_model, inputs, cpu_callable=example_sum) ``` -------------------------------- ### Run Inference and Capture Output (Python/Shell) Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md Executes the main Python script for compiling and running Llama3.2 Multimodal inference on Neuron devices. The output (including standard output and standard error) is piped to the `tee` command, saving it to `out.txt` while also displaying it on the console. ```bash python generation_mllama.py |&tee out.txt ``` -------------------------------- ### Build and Run Function on Neuron using build_function with partial Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md Demonstrates how to use `neuronx_distributed_inference.utils.testing.build_function` to compile a Python function into a Neuron model and run it. It shows how to handle functions with non-tensor arguments (like `k` and `dim` in `top_k`) by using `functools.partial` to bind non-tensor arguments before passing the function to `build_function`. Requires `torch` and `functools.partial`. ```Python def top_k(input: torch.Tensor, k: int, dim: int): return torch.topk(input, k, dim) top_k_partial = partial(top_k, 1, 0) model = build_fuction(top_k_partial, example_inputs=[(torch.rand(4)),]) output = model(torch.rand(4)) ``` -------------------------------- ### Convert PyTorch Checkpoint to Neuron Format (Python/Shell) Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md Executes a Python script to convert a Llama3.2 Multimodal model checkpoint from Meta's PyTorch format to the Neuron format. Requires specifying the input directory of the PyTorch checkpoint and the desired output directory for the Neuron checkpoint. ```bash python checkpoint_conversion_utils/convert_mllama_weights_to_neuron.py --input-dir --output-dir --instruct ``` -------------------------------- ### Run NeuronX Distributed Inference Integration Tests Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md Executes the integration tests located in the `test/integration/` directory using pytest. These tests validate end-to-end model functionality on a Neuron device, often using generated weights without external network dependencies. The `--forked` flag helps isolate test runs. ```Bash pytest test/integration/ --forked ``` -------------------------------- ### Run NeuronX Distributed Inference Unit Tests Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md Executes the unit tests located in the `test/unit/` directory using pytest. The `--forked` flag is often used in testing environments to isolate tests and prevent state leakage between them, especially important on Neuron devices. ```Bash pytest test/unit/ --forked ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.