### Install Required Packages (Shell)

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md

Installs the necessary Python packages listed in the `requirements.txt` file located in the current directory. This ensures all dependencies for running the example scripts are met.

```bash
pip install -r requirements.txt
```

--------------------------------

### Navigate to Examples Directory (Shell)

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md

Changes the current directory to the 'examples' subdirectory. This is a prerequisite step to run the example scripts provided for the Llama3.2 Multimodal implementation.

```bash
cd examples
```

--------------------------------

### Running Inference Demo Process 1 (TP=64, 32 Local Ranks, Node 1) - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md

This bash command launches the first inference process for a two-node setup with a total tensor parallelism of 64. It starts at rank ID 0 (`--start_rank_id 0`) with 32 local ranks (`--local_ranks 32`) for the first node and specifies common Neuron Runtime environment variables and model inference parameters for a Llama model, directing output to a node-specific log file (`rank_0.log`).

```bash
NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \
              --model-type llama \
              --task-type causal-lm \
              run \
                --model-path TinyLLama-v0 \
                --compiled-model-path  traced_models/TinyLLama-v0-multi-node_0/ \
                --torch-dtype bfloat16 \
                --start_rank_id 0 \
                --local_ranks 32 \
                --tp-degree 64 \
                --batch-size 2 \
                --max-context-length 32 \
                --seq-len 64 \
                --on-device-sampling \
                --enable-bucketing \
                --top-k 1 \
                --do-sample \
                --pad-token-id 2 \
                --prompt "I believe the meaning of life is" \
                --prompt "The color of the sky is" 2>&1 | tee rank_0.log
```

--------------------------------

### Running Inference Demo Process 2 (TP=64, 32 Local Ranks, Node 2) - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md

This bash command launches the second inference process for a two-node setup with a total tensor parallelism of 64. It starts at rank ID 32 (`--start_rank_id 32`) with 32 local ranks (`--local_ranks 32`) for the second node, continuing the rank numbering from the first. It specifies the same model and inference parameters as the first process for the two-node example and directs output to a node-specific log file (`rank_1.log`).

```bash
NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \
              --model-type llama \
              --task-type causal-lm \
              run \
                --model-path TinyLLama-v0 \
                --compiled-model-path  traced_models/TinyLLama-v0-multi-node_1/ \
                --torch-dtype bfloat16 \
                --start_rank_id 32 \
                --local_ranks 32 \
                --tp-degree 64 \
                --batch-size 2 \
                --max-context-length 32 \
                --seq-len 64 \
                --on-device-sampling \
                --enable-bucketing \
                --top-k 1 \
                --do-sample \
                --pad-token-id 2 \
                --prompt "I believe the meaning of life is" \
                --prompt "The color of the sky is" 2>&1 | tee rank_1.log
```

--------------------------------

### Running Inference Demo Process 2 (TP=32, 16 Local Ranks) - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md

This bash command launches the second of two inference processes on the same single Trn1 node, configured for a total tensor parallelism of 32. It uses Neuron cores 16-31 (`NEURON_RT_VISIBLE_CORES=16-31`), starts at rank ID 16 (`--start_rank_id 16`) with 16 local ranks (`--local_ranks_size 16`), and specifies the same model and inference parameters as the first process, redirecting output to a log file.

```bash
NEURON_RT_VISIBLE_CORES=16-31 NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \
       --model-type llama \
       --task-type causal-lm \
        run \
         --model-path TinyLLama-v0 \
         --compiled-model-path traced_models/TinyLLama-v0-multi-node-1/ \
         --torch-dtype bfloat16 \
         --start_rank_id 16 \
         --local_ranks_size 16 \
         --tp-degree 32 \
         --batch-size 2 \
         --max-context-length 32 \
         --seq-len 64 \
         --on-device-sampling \
         --enable-bucketing \
         --top-k 1 \
         --do-sample \
         --pad-token-id 2 \
         --prompt "I believe the meaning of life is" \
         --prompt "The color of the sky is" 2>&1 | tee log
```

--------------------------------

### Install NeuronX Distributed Inference for Testing

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md

Installs the `neuronx-distributed-inference` package in editable mode (`-e`) along with its testing dependencies (`.[test]`). This is the first step required to run or write tests for the package on a Neuron device.

```Bash
pip install -e .[test]
```

--------------------------------

### Running Inference Demo Process 1 (TP=32, 16 Local Ranks) - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/multi_node.md

This bash command launches the first of two inference processes on a single Trn1 node configured for a total tensor parallelism of 32 across two processes. It uses Neuron cores 0-15 (`NEURON_RT_VISIBLE_CORES=0-15`), starts at rank ID 0 (`--start_rank_id 0`) with 16 local ranks (`--local_ranks_size 16`), and specifies model and inference parameters for a Llama model, redirecting output to a log file.

```bash
MASTER_PORT=65111  NEURON_RT_VISIBLE_CORES=0-15  NEURON_CPP_LOG_LEVEL=1 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 inference_demo \
       --model-type llama \
       --task-type causal-lm \
        run \
         --model-path TinyLLama-v0 \
         --compiled-model-path traced_models/TinyLLama-v0-multi-node-0/ \
         --torch-dtype bfloat16 \
         --start_rank_id 0 \
         --local_ranks_size 16 \
         --tp-degree 32 \
         --batch-size 2 \
         --max-context-length 32 \
         --seq-len 64 \
         --on-device-sampling \
         --enable-bucketing \
         --top-k 1 \
         --do-sample \
         --pad-token-id 2 \
         --prompt "I believe the meaning of life is" \
         --prompt "The color of the sky is" 2>&1 | tee log
```

--------------------------------

### Run Llama Inference with Quantization - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md

This command illustrates running a quantized Llama model using the inference demo. It specifies the compiled model path, enables the `--quantized` flag, and provides the path to the quantized model checkpoints and the quantization type.

```bash
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --on-device-sampling \
    --enable-bucketing \
    --quantized \
    --quantized-checkpoints-path /home/ubuntu/model_hf/Llama-2-7b/model_quant.pt \
    --quantization-type per_channel_symmetric \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is"
```

--------------------------------

### Run Llama Inference with Speculation - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md

This example demonstrates running Llama inference with speculative decoding enabled. It specifies both the main model and a smaller draft model, along with their compiled paths. It also includes token matching accuracy checks and benchmarking.

```bash
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/open_llama_7b/ \
    --compiled-model-path /home/ubuntu/traced_model/open_llama_7b/ \
    --draft-model-path /home/ubuntu/model_hf/open_llama_3b/ \
    --compiled-draft-model-path /home/ubuntu/traced_model/open_llama_3b/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 1 \
    --max-context-length 32 \
    --seq-len 64 \
    --enable-bucketing \
    --speculation-length 5 \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --check-accuracy-mode token-matching \
    --benchmark
```

--------------------------------

### Run Llama Inference with Token Matching - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md

This command executes the inference demo script for a Llama model. It sets up tensor parallelism, batch size, sequence lengths, and enables on-device sampling and bucketing. It performs an accuracy check by matching generated tokens against a reference output and includes a benchmarking flag.

```bash
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-3.1-8B-Instruct/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-3.1-8B-Instruct/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --on-device-sampling \
    --enable-bucketing \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is" \
    --check-accuracy-mode token-matching \
    --benchmark
```

--------------------------------

### Run Llama Inference with Custom Logit Matching Tolerances - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md

This example shows how to perform logit matching accuracy checks with custom divergence difference and absolute/relative tolerances. It uses `--divergence-difference-tol` and `--tol-map` to specify acceptable error bounds for different layers.

```bash
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --check-accuracy-mode logit-matching \
    --divergence-difference-tol 0.005 \
    --tol-map "{5: (1e-5, 0.02)}" \
    --enable-bucketing \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is"
```

--------------------------------

### Run DBRX Inference with Logit Matching - Bash

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/README.md

This command runs inference for a DBRX model using the demo script. It configures tensor parallelism, batch size, and sequence lengths. It performs an accuracy check by comparing logits (output probabilities) with a reference and enables bucketing.

```bash
inference_demo \
  --model-type dbrx \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/dbrx-1layer/ \
    --compiled-model-path /home/ubuntu/traced_model/dbrx-1layer-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 1024 \
    --seq-len 1152 \
    --enable-bucketing \
    --top-k 1 \
    --pad-token-id 0 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is" \
    --check-accuracy-mode logit-matching
```

--------------------------------

### Running Llama Causal LM Inference with NeuronX Distributed (Bash)

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/examples/slora.md

This snippet provides a command-line example for running a distributed inference demo using the `inference_demo` tool. It's configured for a Llama model performing causal language modeling with specific settings like tensor parallelism (`--tp-degree 32`), batch size (`--batch-size 2`), and features like on-device sampling, bucketing, and LoRA. It specifies model and compiled model paths, data type (`bfloat16`), sequence lengths, and includes sample prompts.

```bash
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /dev/shm/tiny_llama \
    --compiled-model-path /dev/shm/traced_model/tiny_llama/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 16 \
    --seq-len 16 \
    --on-device-sampling \
    --enable-bucketing \
    --top-k 1 \
    --do-sample \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is" \
    --enable-lora \
    --max-loras 2 \
    --max-lora-rank 16 \
    --target-modules embed_tokens q_proj k_proj v_proj o_proj up_proj down_proj \
```

--------------------------------

### Validate Module Accuracy on Neuron vs CPU

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md

Provides an example PyTest test case (`test_validate_accuracy_basic_module`) demonstrating how to use `build_module` to compile a PyTorch module (`ExampleModule`) for Neuron and then `validate_accuracy` to compare its output against the same module run on CPU. It showcases how to initialize the module differently for Neuron (with distribution) and CPU.

```Python
# Module to test.
class ExampleModule(torch.nn.Module):
    def __init__(self, distributed):
        super().__init__()
        if distributed:
            self.linear = ColumnParallelLinear(
                input_size=SAMPLE_SIZE,
                output_size=SAMPLE_SIZE,
                bias=False,
                dtype=torch.float32,
            )
        else:
            self.linear = torch.nn.Linear(
                in_features=SAMPLE_SIZE,
                out_features=SAMPLE_SIZE,
                bias=False,
                dtype=torch.float32,
            )

    def forward(self, x):
        return self.linear(x)


def test_validate_accuracy_basic_module():
    inputs = [(torch.arange(0, SAMPLE_SIZE, dtype=torch.float32),)]
    example_inputs = [(torch.zeros((SAMPLE_SIZE), dtype=torch.float32),)]

    module_cpu = ExampleModule(distributed=False)
    neuron_model = build_module(ExampleModule, example_inputs, module_init_kwargs={"distributed": True})

    validate_accuracy(neuron_model, inputs, cpu_callable=module_cpu)
```

--------------------------------

### Validate Function Accuracy on Neuron vs CPU

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md

Provides an example PyTest test case (`test_validate_accuracy_basic_function`) demonstrating how to use `build_function` to compile a Python function (`example_sum`) for Neuron and then `validate_accuracy` to compare its output against the same function run on CPU. This is a simpler case compared to module testing, focusing purely on a tensor-in/tensor-out function.

```Python
def example_sum(tensor):
    return torch.sum(tensor)

def test_validate_accuracy_basic_function():
    inputs = [(torch.tensor([1, 2, 3], dtype=torch.float32),)]
    example_inputs = [(torch.zeros((3), dtype=torch.float32),)]

    neuron_model = build_function(example_sum, example_inputs)
    validate_accuracy(neuron_model, inputs, cpu_callable=example_sum)
```

--------------------------------

### Run Inference and Capture Output (Python/Shell)

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md

Executes the main Python script for compiling and running Llama3.2 Multimodal inference on Neuron devices. The output (including standard output and standard error) is piped to the `tee` command, saving it to `out.txt` while also displaying it on the console.

```bash
python generation_mllama.py |&tee out.txt
```

--------------------------------

### Build and Run Function on Neuron using build_function with partial

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md

Demonstrates how to use `neuronx_distributed_inference.utils.testing.build_function` to compile a Python function into a Neuron model and run it. It shows how to handle functions with non-tensor arguments (like `k` and `dim` in `top_k`) by using `functools.partial` to bind non-tensor arguments before passing the function to `build_function`. Requires `torch` and `functools.partial`.

```Python
def top_k(input: torch.Tensor, k: int, dim: int):
    return torch.topk(input, k, dim)

top_k_partial = partial(top_k, 1, 0)
model = build_fuction(top_k_partial, example_inputs=[(torch.rand(4)),])
output = model(torch.rand(4))
```

--------------------------------

### Convert PyTorch Checkpoint to Neuron Format (Python/Shell)

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/src/neuronx_distributed_inference/models/mllama/README.md

Executes a Python script to convert a Llama3.2 Multimodal model checkpoint from Meta's PyTorch format to the Neuron format. Requires specifying the input directory of the PyTorch checkpoint and the desired output directory for the Neuron checkpoint.

```bash
python checkpoint_conversion_utils/convert_mllama_weights_to_neuron.py --input-dir <path_to_meta_pytorch_checkpoint> --output-dir <path_to_neuron_checkpoint> --instruct
```

--------------------------------

### Run NeuronX Distributed Inference Integration Tests

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md

Executes the integration tests located in the `test/integration/` directory using pytest. These tests validate end-to-end model functionality on a Neuron device, often using generated weights without external network dependencies. The `--forked` flag helps isolate test runs.

```Bash
pytest test/integration/ --forked
```

--------------------------------

### Run NeuronX Distributed Inference Unit Tests

Source: https://github.com/aws-neuron/neuronx-distributed-inference/blob/main/test/README.md

Executes the unit tests located in the `test/unit/` directory using pytest. The `--forked` flag is often used in testing environments to isolate tests and prevent state leakage between them, especially important on Neuron devices.

```Bash
pytest test/unit/ --forked
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.