### Complete NeuronX Distributed Utility Example Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/utilities-api.md This example demonstrates a full setup using various NeuronX distributed utilities, including logging, random seed initialization, model creation, activation checkpointing, device placement, and checkpointing with CPU tensors. ```python import torch import torch_xla.core.xla_model as xm from neuronx_distributed.utils.logger import get_logger from neuronx_distributed.utils.model_utils import ( get_model_sequential, init_on_device, is_hf_pretrained_model ) from neuronx_distributed.utils.activation_checkpoint import apply_activation_checkpointing from neuronx_distributed.parallel_layers.random import model_parallel_xla_manual_seed from neuronx_distributed.parallel_layers.utils import move_all_tensor_to_cpu # Setup logging and RNG logger = get_logger(rank0_only=True) model_parallel_xla_manual_seed(42) # Create model logger.info("Creating model...") from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") # Apply activation checkpointing logger.info("Applying activation checkpointing...") apply_activation_checkpointing( model, check_fn=lambda m: hasattr(m, 'forward') and 'transformer' in str(type(m)) ) # Move to device logger.info("Moving model to device...") device = xm.xla_device() model = get_model_sequential(model, device, sequential_move_factor=12) # Verify model is HF logger.info(f"Is HF model: {is_hf_pretrained_model(model)}") # Training loop for step, batch in enumerate(train_loader): outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() if (step + 1) % 100 == 0: # Save checkpoint with CPU tensors state = move_all_tensor_to_cpu(model.state_dict()) logger.info(f"Checkpointing at step {step+1}") ``` -------------------------------- ### Full-Feature Configuration Example Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md A comprehensive configuration combining LoRA, pipeline parallelism, ZeRO-1, mixed precision, and meta device settings. ```python from neuronx_distributed.modules.lora import LoraConfig lora_config = LoraConfig( lora_rank=16, target_modules=["q_proj", "v_proj"], use_rslora=True ) pipeline_config = { "num_microbatches": 8, "virtual_pipeline_size": 2, "input_names": ["input_ids", "attention_mask"], "output_loss_value_spec": (True, False) } nxd_config = neuronx_distributed_config( # Parallelism tensor_parallel_size=4, pipeline_parallel_size=4, context_parallel_size=1, # Pipeline config pipeline_config=pipeline_config, # Optimizer config optimizer_config={ "zero_one_enabled": True, "grad_clipping": True, "max_grad_norm": 1.0 }, # Mixed precision mixed_precision_config={ "use_master_weights": True, "use_fp32_grad_acc": True, "use_master_weights_in_ckpt": True }, # Model init model_init_config={ "sequential_move_factor": 12, "meta_device_init": False }, # LoRA lora_config=lora_config, # Checkpointing activation_checkpoint_config="full", # Other pad_model=True, sequence_parallel=False ) ``` -------------------------------- ### ZeRO-1 with Mixed Precision Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/optimizer-api.md Example demonstrating how to configure and initialize the parallel optimizer with ZeRO-1 and mixed precision settings. ```APIDOC ## ZeRO-1 with Mixed Precision ### Description This example shows how to set up the `neuronx_distributed_config` for ZeRO-1 optimization with mixed precision features like using master weights, FP32 gradient accumulation, and saving FP32 weights in checkpoints. ### Code ```python import torch from neuronx_distributed import neuronx_distributed_config, initialize_parallel_optimizer # Create config with ZeRO-1 and mixed precision nxd_config = neuronx_distributed_config( tensor_parallel_size=4, optimizer_config={ "zero_one_enabled": True, "grad_clipping": True, "max_grad_norm": 1.0 }, mixed_precision_config={ "use_master_weights": True, # FP32 master copy "use_fp32_grad_acc": True, # FP32 gradient accumulation "use_master_weights_in_ckpt": True # Save FP32 weights } ) optimizer = initialize_parallel_optimizer( nxd_config, torch.optim.AdamW, model.parameters(), lr=1e-4, weight_decay=0.01 ) ``` ``` -------------------------------- ### Basic Pipeline Parallelism Setup Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/pipeline-parallelism-api.md Configures and initializes a model for pipeline parallelism. Requires `neuronx_distributed` and `transformers` libraries. ```python import torch from neuronx_distributed import neuronx_distributed_config, initialize_parallel_model from transformers import AutoModelForCausalLM # Configure with pipeline parallelism nxd_config = neuronx_distributed_config( tensor_parallel_size=2, pipeline_parallel_size=4, pipeline_config={ "num_microbatches": 4, "virtual_pipeline_size": 2, "output_loss_value_spec": (True, False), # First output is loss } ) def model_fn(): return AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") nxd_model = initialize_parallel_model(nxd_config, model_fn) # Forward pass (automatic pipeline scheduling) input_ids = torch.randint(0, 32000, (batch_size, seq_len)) outputs = nxd_model(input_ids=input_ids) loss = outputs[0] ``` -------------------------------- ### Trace Model with Example Inputs Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trace-and-compile-api.md Trace a PyTorch model using example inputs to capture its computation graph and parameters for compilation. Supports multiple traces with different input shapes for dynamic batching by providing unique tags. ```python import torch from neuronx_distributed.trace import ModelBuilder model = MyTransformer() builder = ModelBuilder(model) # Trace with different input shapes for dynamic batching input_ids = torch.randint(0, 32000, (batch_size, seq_len)) builder.trace( kwargs={"input_ids": input_ids}, tag="batch_1_seq_128" ) builder.trace( kwargs={"input_ids": torch.randint(0, 32000, (batch_size, seq_len*2))}, tag="batch_1_seq_256" ) ``` -------------------------------- ### Checkpoint Save/Load with ZeRO-1 Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/optimizer-api.md Example illustrating how to save and load model and optimizer states with ZeRO-1 enabled. ```APIDOC ## Checkpoint Save/Load with ZeRO-1 ### Description This example demonstrates the usage of `save_checkpoint` and `load_checkpoint` functions when using the ZeRO-1 optimizer, ensuring that the sharded optimizer state is correctly saved and restored. ### Code ```python import torch from neuronx_distributed import save_checkpoint, load_checkpoint # Save checkpoint with ZeRO-1 optimizer save_checkpoint( checkpoint_dir_str="./checkpoints", tag="step_5000", model=model, optimizer=zero1_optimizer, num_kept_ckpts=3, zero1_optimizer=True # Indicate ZeRO-1 sharded state ) # Load checkpoint model_state, optim_state, _, _ = load_checkpoint( checkpoint_dir_str="./checkpoints", model=model, optimizer=zero1_optimizer, zero1_optimizer=True ) model.load_state_dict(model_state) zero1_optimizer.load_state_dict(optim_state) ``` ``` -------------------------------- ### Basic ZeRO-1 Training Example Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/optimizer-api.md Demonstrates how to initialize and use the NeuronZero1Optimizer within a typical training loop. It shows gradient clipping being automatically handled and how to access the computed gradient norm. ```python import torch from neuronx_distributed.optimizer import NeuronZero1Optimizer # Create optimizer with ZeRO-1 zero1_optimizer = NeuronZero1Optimizer( model.parameters(), optimizer_class=torch.optim.AdamW, grad_clipping=True, max_grad_norm=1.0, lr=1e-4, weight_decay=0.01 ) # Training loop for step, (input_ids, labels) in enumerate(train_loader): outputs = model(input_ids=input_ids, labels=labels) loss = outputs.loss loss.backward() # Gradient norm automatically clipped zero1_optimizer.step() zero1_optimizer.zero_grad() if step % 100 == 0: print(f"Step {step}, Loss: {loss.item():.4f}, Grad Norm: {zero1_optimizer.grad_norm}") ``` -------------------------------- ### Run Tensor Capture Example on CPU Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/utils/tensor_capture/README.md This command demonstrates the tensor capture functionality on a CPU environment. It's useful for initial debugging and verification before running on Neuron. ```bash python examples/inference/tensor_capture/tensor_capture_example.py demo ``` -------------------------------- ### Llama-2 LoRA Configuration Example Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/lora-and-modules-api.md Demonstrates how to instantiate LoraConfig for a Llama-2 model, specifying target modules, alpha, dropout, and additional modules to save. ```python from neuronx_distributed.modules.lora import LoraConfig, LoraModel lora_config = LoraConfig( lora_rank=16, target_modules=["q_proj", "v_proj"], # Adapt only query and value projections lora_alpha=32, lora_dropout=0.05, use_rslora=True, init_lora_weights="gaussian", modules_to_save=["lm_head"] # Also train the output head ) ``` -------------------------------- ### Typical Import Pattern Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/quickstart-guide.md Import necessary libraries for distributed training with neuronx-distributed. Ensure AWS Neuron SDK, torch-neuronx, and torch-xla are installed. ```python import torch import torch.distributed as dist import torch_xla.core.xla_model as xm from neuronx_distributed import ( neuronx_distributed_config, initialize_parallel_model, initialize_parallel_optimizer, save_checkpoint, load_checkpoint, has_checkpoint, parallel_layers, ) from neuronx_distributed.modules.lora import LoraConfig, LoraModel from neuronx_distributed.optimizer import NeuronZero1Optimizer ``` -------------------------------- ### Install NeuronX Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Installs the neuronx-distributed package using pip. Ensure you have Python and pip installed. ```bash pip install neuronx-distributed ``` -------------------------------- ### Multi-Node Distributed Training Setup Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Configures environment variables for multi-node distributed training. Each node requires its rank and world size to be set, along with the master node's address and port. The training script is then launched using torch.distributed.launch. ```bash # On each node, export node rank and local rank export RANK=$(($(hostname -s | grep -oE '[0-9]+$') * 8)) # Node rank * 8 export WORLD_SIZE=$((num_nodes * 8)) export MASTER_ADDR= export MASTER_PORT=12355 # Run training python -m torch.distributed.launch --nproc_per_node=8 train.py ``` -------------------------------- ### Checkpointing Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Provides examples for saving and loading model and optimizer states using the `save_checkpoint` and `load_checkpoint` functions. ```APIDOC ## Checkpointing ### Description Save and load model and optimizer states. ### Code ```python from neuronx_distributed import save_checkpoint, load_checkpoint # Save save_checkpoint( "./checkpoints", f"step_{step}", model=nxd_model, optimizer=nxd_optimizer ) # Load model_state, optim_state, _, _ = load_checkpoint( "./checkpoints", model=nxd_model, optimizer=nxd_optimizer ) ``` ``` -------------------------------- ### Initialize Minimal NeuronX Distributed Config Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Initializes the NeuronX Distributed configuration using all default settings. This is suitable for single-GPU setups without explicit parallelism. ```python from neuronx_distributed import neuronx_distributed_config nxd_config = neuronx_distributed_config() # Uses all defaults: TP=1, PP=1, no parallelism, no optimization ``` -------------------------------- ### Typical Training Loop with NeuronX Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/optimizer-api.md Demonstrates a standard training loop using NeuronX Distributed. It covers initialization of the distributed environment, model and optimizer setup with specific configurations for tensor parallelism, optimizer options, and mixed precision. The loop includes forward and backward passes, optimizer steps with automatic gradient clipping, and periodic checkpoint saving. ```python import torch from neuronx_distributed import ( neuronx_distributed_config, initialize_parallel_model, initialize_parallel_optimizer, save_checkpoint ) # Setup torch.distributed.init_process_group("xla") nxd_config = neuronx_distributed_config( tensor_parallel_size=4, optimizer_config={"zero_one_enabled": True, "grad_clipping": True}, mixed_precision_config={"use_master_weights": True} ) model = initialize_parallel_model(nxd_config, model_fn) optimizer = initialize_parallel_optimizer( nxd_config, torch.optim.AdamW, model.parameters(), lr=1e-4 ) # Training for step, batch in enumerate(train_loader): # Forward outputs = model(**batch) loss = outputs.loss # Backward loss.backward() # Optimizer step with automatic grad clipping optimizer.step() optimizer.zero_grad() # Checkpoint periodically if (step + 1) % 1000 == 0: save_checkpoint( "./ckpts", f"step_{step+1}", model=model, optimizer=optimizer, num_kept_ckpts=3 ) ``` -------------------------------- ### Compile Model using Mock Distributed Environment Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/inference/llama/README.md Compile models using a simulated distributed environment. This is useful for testing compilation without a full distributed setup. ```bash python run.py compile_with_mock \ --tp-degree 32 \ --batch-size 2 \ --seq-len 128 \ --model-path ~/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth \ --output-path ~/neuron_models/Llama3.2-1B-Instruct \ --shard-on-load True ``` -------------------------------- ### Inference with Compiled Models Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/quickstart-guide.md Trace, compile, and run inference with a model using NeuronX Distributed. This example demonstrates compiling a PyTorch model to a Neuron-compatible format for efficient inference. ```python from neuronx_distributed.trace import ModelBuilder import torch # Create model model = MyModel() # Trace and compile to Neuron builder = ModelBuilder(model) # Trace for different input shapes (optional) input_tensor = torch.randint(0, 32000, (8, 128)) builder.trace(kwargs={"input_ids": input_tensor}, tag="seq_128") # Compile to NEFF nxd_model = builder.compile( priority_model_key="seq_128", compiler_args="-O2" ) # Load weights from checkpoint checkpoint = load_checkpoint("./checkpoints", tag="final") weights = [checkpoint["model"] for _ in range(world_size)] nxd_model.set_weights(weights) # Initialize on Neuron device nxd_model.to_neuron() # Run inference input_ids = torch.randint(0, 32000, (8, 128)) output = nxd_model(input_ids) ``` -------------------------------- ### Configure LoRA Fine-tuning with NeuronX Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/quickstart-guide.md Configure LoRA settings and initialize a distributed model with NeuronX Distributed. This setup is suitable for efficient fine-tuning of large language models. ```python from neuronx_distributed import neuronx_distributed_config, initialize_parallel_model from neuronx_distributed.modules.lora import LoraConfig from transformers import AutoModelForCausalLM import torch # Configure LoRA lora_config = LoraConfig( lora_rank=16, target_modules=["q_proj", "v_proj"], lora_alpha=32, lora_dropout=0.05, use_rslora=True ) # Create distributed config with LoRA nxd_config = neuronx_distributed_config( tensor_parallel_size=4, lora_config=lora_config, optimizer_config={"zero_one_enabled": True} ) # Load base model model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") # Initialize with distributed training nxd_model = initialize_parallel_model( nxd_config, lambda: model ) # Check trainable parameter ratio trainable, total = nxd_model.get_trainable_parameters() print(f"Trainable parameters: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)") # Training proceeds normally - only LoRA params are updated optimizer = initialize_parallel_optimizer( nxd_config, torch.optim.AdamW, nxd_model.parameters(), lr=5e-4 ) # After training, save LoRA adapter save_checkpoint( "./checkpoints", "final", model=nxd_model, optimizer=optimizer ) ``` -------------------------------- ### NeuronX Distributed Config - Full Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Initialize NeuronX Distributed configuration for a fully distributed setup, combining tensor and pipeline parallelism with advanced optimizer and mixed precision settings. ```python nxd_config = neuronx_distributed_config( tensor_parallel_size=8, pipeline_parallel_size=4, pipeline_config={"num_microbatches": 8, "virtual_pipeline_size": 2}, optimizer_config={"zero_one_enabled": True}, mixed_precision_config={"use_master_weights": True}, activation_checkpoint_config="full" ) ``` -------------------------------- ### Get Pipeline Model Method Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/pipeline-parallelism-api.md Returns the entire model as a traced ScriptModule. ```python def get_pp_model(self) -> torch.jit.ScriptModule ``` -------------------------------- ### Model and Optimizer Initialization Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Demonstrates how to create a distributed configuration, initialize a parallel model, and set up a parallel optimizer using the NeuronX Distributed library. ```APIDOC ## Model and Optimizer Initialization ### Description Initialize distributed configuration, parallel model, and parallel optimizer. ### Code ```python from neuronx_distributed import ( neuronx_distributed_config, initialize_parallel_model, initialize_parallel_optimizer, ) # 1. Create distributed config nxd_config = neuronx_distributed_config( tensor_parallel_size=4, optimizer_config={"zero_one_enabled": True} ) # 2. Initialize model nxd_model = initialize_parallel_model(nxd_config, model_fn) # 3. Initialize optimizer nxd_optimizer = initialize_parallel_optimizer( nxd_config, torch.optim.AdamW, nxd_model.parameters(), lr=1e-4 ) ``` ``` -------------------------------- ### Get Stage to Rank Mapping Method Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/pipeline-parallelism-api.md Returns the mapping from pipeline stage to rank. ```python def get_stage_to_rank_map(self) -> Dict[int, int] ``` -------------------------------- ### Get Local Parameters Method Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/pipeline-parallelism-api.md Retrieves the local parameters for the current pipeline parallelism rank. ```python def local_parameters(self) -> Iterator[torch.nn.Parameter] ``` -------------------------------- ### Configure Meta Device Initialization Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Configure meta device initialization for parameter initialization using a custom function and specified sequential move factor. ```python def param_init_fn(module): """Initialize parameters for meta device initialization.""" if hasattr(module, 'weight'): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) if hasattr(module, 'bias') and module.bias is not None: torch.nn.init.zeros_(module.bias) nxd_config = neuronx_distributed_config( tensor_parallel_size=8, model_init_config={ "meta_device_init": True, "param_init_fn": param_init_fn, "sequential_move_factor": 15 # Larger for very large models } ) ``` -------------------------------- ### Configure Optimizer Settings Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Set up optimizer configurations, including ZeRO-1 state sharding and gradient clipping. This snippet shows how to enable and configure these features for stable and efficient training. ```python optimizer_config = { # Enable ZeRO-1 optimizer state sharding "zero_one_enabled": False, # bool # Gradient clipping to prevent gradient explosion "grad_clipping": True, # bool # Maximum gradient norm (when grad_clipping=True) "max_grad_norm": 1.0, # float } nxd_config = neuronx_distributed_config( optimizer_config=optimizer_config ) ``` -------------------------------- ### Get Available NEFF Keys Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trace-and-compile-api.md Retrieves a list of all unique identifiers (keys) for the NEFF artifacts that have been added to the model. ```python def get_available_keys(self) -> List[str] ``` -------------------------------- ### Get Local Named Parameters Method Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/pipeline-parallelism-api.md Retrieves the local named parameters for the current pipeline parallelism rank. ```python def local_named_parameters(self, *args, **kwargs) -> Iterator[Tuple[str, torch.nn.Parameter]] ``` -------------------------------- ### Initialize Model and Optimizer with NeuronX Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Configure distributed settings, initialize the model, and set up the optimizer. Supports features like ZeRO-1 optimization. ```python from neuronx_distributed import ( neuronx_distributed_config, initialize_parallel_model, initialize_parallel_optimizer, ) # 1. Create distributed config nxd_config = neuronx_distributed_config( tensor_parallel_size=4, optimizer_config={\"zero_one_enabled\": True} ) # 2. Initialize model nxd_model = initialize_parallel_model(nxd_config, model_fn) # 3. Initialize optimizer nxd_optimizer = initialize_parallel_optimizer( nxd_config, torch.optim.AdamW, nxd_model.parameters(), lr=1e-4 ) ``` -------------------------------- ### Configure 70B Parameters (Tensor + Pipeline) Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Recommended configuration for a 70B parameter model using tensor and pipeline parallelism. Enables Zero-One optimizer and master weights for mixed precision. ```python nxd_config = neuronx_distributed_config( tensor_parallel_size=8, pipeline_parallel_size=2, optimizer_config={"zero_one_enabled": True}, mixed_precision_config={"use_master_weights": True} ) ``` -------------------------------- ### get_logger Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/utilities-api.md Get a configured logger for distributed training. This function allows for selective logging, either from all ranks or only from rank 0. ```APIDOC ## get_logger ### Description Get configured logger for distributed training. ### Method ```python def get_logger( rank0_only: bool = False, name: str = "neuronx_distributed" ) -> logging.Logger ``` ### Parameters #### Parameters - **rank0_only** (bool) - Optional - Only log from rank 0 - **name** (str) - Optional - Logger name ### Return Value Python logger configured for distributed training. ### Example ```python from neuronx_distributed.utils.logger import get_logger logger = get_logger(rank0_only=True) # Only rank 0 logs logger.info("Starting training") ``` ``` -------------------------------- ### Configure Training with Pipeline Parallelism Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Set up training with pipeline parallelism, specifying microbatching, virtual pipeline size, and input/output configurations. ```python nxd_config = neuronx_distributed_config( tensor_parallel_size=4, pipeline_parallel_size=8, pipeline_config={ "num_microbatches": 8, "virtual_pipeline_size": 2, "input_names": ["input_ids", "attention_mask"], "output_loss_value_spec": (True, False) }, optimizer_config={"zero_one_enabled": True} ) ``` -------------------------------- ### Initialize Parallel Model with NeuronX Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trainer-api.md Demonstrates how to initialize a HuggingFace model with tensor parallelism and optimizer configurations using `initialize_parallel_model`. ```python from neuronx_distributed import neuronx_distributed_config, initialize_parallel_model from transformers import AutoModelForCausalLM nxd_config = neuronx_distributed_config( tensor_parallel_size=4, optimizer_config={"zero_one_enabled": True} ) def model_fn(): return AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") nxd_model = initialize_parallel_model( nxd_config, model_fn ) ``` -------------------------------- ### NeuronX Distributed Config - Single Device Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Initialize NeuronX Distributed configuration for a single device setup, typically for smaller models. ```python nxd_config = neuronx_distributed_config( tensor_parallel_size=1 ) ``` -------------------------------- ### ModelBuilder.trace Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trace-and-compile-api.md Traces the model with example inputs for subsequent compilation. This method can be called multiple times to capture different input shapes or configurations. ```APIDOC ## trace Trace model with example inputs for compilation. ```python def trace( self, args: Union[None, torch.Tensor, Tuple[torch.Tensor, ...]] = None, kwargs: Optional[Dict[str, torch.Tensor]] = None, tag: Optional[str] = None, spmd: bool = True ) -> ModelBuilder ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | args | Union[None, torch.Tensor, Tuple] | None | Example input tensors as positional arguments | | kwargs | Optional[Dict[str, torch.Tensor]] | None | Example input tensors as keyword arguments | | tag | Optional[str] | None | Unique identifier for trace. Auto-generated if None | | spmd | bool | True | Use SPMD (Single Program Multiple Data) tracing | ### Return Value Returns self for method chaining. ### Behavior - Traces model using torch_neuronx symbolic execution - Records HLO (High Level Optimizer) IR and computation graph - Preserves model parameters for later compilation - Supports multiple traces for different input shapes/specializations - Auto-generates tag based on HLO hash if not provided ### Example ```python import torch from neuronx_distributed.trace import ModelBuilder model = MyTransformer() builder = ModelBuilder(model) # Trace with different input shapes for dynamic batching input_ids = torch.randint(0, 32000, (batch_size, seq_len)) builder.trace( kwargs={"input_ids": input_ids}, tag="batch_1_seq_128" ) builder.trace( kwargs={"input_ids": torch.randint(0, 32000, (batch_size, seq_len*2))}, tag="batch_1_seq_256" ) ``` ``` -------------------------------- ### Instantiate MoEConfig Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/lora-and-modules-api.md Create an instance of MoEConfig with specified parameters for MoE layers. Ensure the parameters align with your model architecture and hardware capabilities. ```python from neuronx_distributed.modules.moe import MoEConfig moe_config = MoEConfig( num_experts=16, num_experts_per_tok=2, expert_dim=2048 ) ``` -------------------------------- ### Build NeuronxDistributed from Source Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/README.md Build the NeuronxDistributed library from its source code. The resulting wheel file will be placed in the 'build/' directory. ```bash bash ./build.sh ``` -------------------------------- ### Get Configured Logger Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/utilities-api.md Retrieve a logger instance configured for distributed training. Set `rank0_only` to True to restrict logging to the rank 0 process. ```python from neuronx_distributed.utils.logger import get_logger logger = get_logger(rank0_only=True) # Only rank 0 logs logger.info("Starting training") ``` -------------------------------- ### Get Local World Size Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/utilities-api.md Returns the number of processes (ranks) running on the current node. This is typically used in distributed environments to understand the local parallelism. ```python from neuronx_distributed.parallel_layers.utils import get_local_world_size local_size = get_local_world_size() print(f"Number of local ranks: {local_size}") ``` -------------------------------- ### Configure Model Initialization Parameters Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Set parameters for sequential device parameter transfer and deferred initialization. Use `sequential_move_factor` to balance speed and memory usage. `meta_device_init` requires `param_init_fn` if set to True. ```python model_init_config = { # Factor for sequential device parameter transfer # Higher = slower but uses less memory # Default 11 works for ~20B parameter models "sequential_move_factor": 11, # int (1-100+) # Initialize model on meta device (deferred initialization) "meta_device_init": False, # bool # Requires param_init_fn if True # Custom parameter initialization for meta device "param_init_fn": None, # Optional[Callable[[Module], None]] # Example: functools.partial(transformers.modeling_utils.init_weights_gpt2) } nxd_config = neuronx_distributed_config( model_init_config=model_init_config, sequential_move_factor=11 # Or set via this parameter ) ``` -------------------------------- ### Save and Load Checkpoint with ZeRO-1 Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/optimizer-api.md Demonstrates how to save and load model and optimizer states when using ZeRO-1 optimization. Ensure to specify `zero1_optimizer=True` for both saving and loading to correctly handle the sharded optimizer state. ```python import torch from neuronx_distributed import save_checkpoint, load_checkpoint # Save checkpoint with ZeRO-1 optimizer save_checkpoint( checkpoint_dir_str="./checkpoints", tag="step_5000", model=model, optimizer=zero1_optimizer, num_kept_ckpts=3, zero1_optimizer=True # Indicate ZeRO-1 sharded state ) # Load checkpoint model_state, optim_state, _, _ = load_checkpoint( checkpoint_dir_str="./checkpoints", model=model, optimizer=zero1_optimizer, zero1_optimizer=True ) model.load_state_dict(model_state) zero1_optimizer.load_state_dict(optim_state) ``` -------------------------------- ### Get XLA RNG Tracker Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/utilities-api.md Retrieves the XLA random number generator tracker. This is essential for ensuring reproducibility in distributed training by seeding XLA random operations. ```python def get_xla_rng_tracker() -> torch_xla.RngTracker ``` -------------------------------- ### Model Initialization Configuration Dictionary Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/types-reference.md Defines a dictionary for model initialization parameters, including sequential move factor and meta-device initialization. ```python # Model init configuration model_init_config: Dict[str, Any] = { "sequential_move_factor": 11, "meta_device_init": False, "param_init_fn": None } ``` -------------------------------- ### Deferred Parameter Initialization on Device Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/utilities-api.md Employ `init_on_device` as a context manager to defer parameter initialization until parameters are accessed. This is particularly useful for initializing models on a meta device to conserve memory before a later materialization to an actual device. ```python from torch.device import device as torch_device from neuronx_distributed.utils.model_utils import init_on_device with init_on_device(torch.device("meta")): # Model created with all parameters on meta device model = LargeTransformer() # Later, materialize to actual device model = get_model_sequential(model, xm.xla_device(), param_init_fn=init_fn) ``` -------------------------------- ### Configure 200B+ Parameters (Full Distributed) Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Recommended configuration for large models (200B+ parameters) with full distributed settings. Includes tensor, pipeline, expert, and context parallelism, along with advanced optimizer and mixed precision settings. ```python nxd_config = neuronx_distributed_config( tensor_parallel_size=8, pipeline_parallel_size=4, expert_parallel_size=1, context_parallel_size=1, pipeline_config={ "num_microbatches": 8, "virtual_pipeline_size": 2 }, optimizer_config={"zero_one_enabled": True}, mixed_precision_config={ "use_master_weights": True, "use_fp32_grad_acc": True }, activation_checkpoint_config="full" ) ``` -------------------------------- ### Check for Existing Checkpoint Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trainer-api.md Verifies if a valid, completed checkpoint exists in the given directory. This is useful for deciding whether to load an existing state or start training from scratch. ```python from neuronx_distributed import has_checkpoint if has_checkpoint("./checkpoints"): model_state, optim_state, _, _ = load_checkpoint( "./checkpoints", model=nxd_model, optimizer=nxd_optimizer ) else: print("No checkpoint found, starting from scratch") ``` -------------------------------- ### Initialize Distributed Training Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/quickstart-guide.md Initialize the distributed training process group using the 'xla' backend. This sets up communication channels between distributed processes. ```python import torch import torch.distributed as dist # Initialize distributed training dist.init_process_group("xla") world_size = dist.get_world_size() rank = dist.get_rank() ``` -------------------------------- ### NxDOptimizer Methods Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/optimizer-api.md Provides essential methods for the NxDOptimizer, including step, state_dict, load_state_dict, and zero_grad. These methods handle distributed training aspects. ```python def step(self, closure=None) -> Optional[float] # Optimization step with distributed handling ``` ```python def state_dict(self) -> Dict # Get optimizer state ``` ```python def load_state_dict(self, state_dict: Dict) -> None # Load optimizer state ``` ```python def zero_grad(self, set_to_none: bool = False) -> None # Reset gradients ``` -------------------------------- ### LoraConfig Instance Method Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/lora-and-modules-api.md Returns a dictionary representation of the configuration, suitable for saving to a checkpoint. ```python def selected_fields_to_save(self) -> Dict: ``` -------------------------------- ### Configure MoE Model Training Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Set up configuration for Mixture of Experts (MoE) model training with expert and pipeline parallelism. ```python nxd_config = neuronx_distributed_config( tensor_parallel_size=4, expert_parallel_size=2, pipeline_parallel_size=2, optimizer_config={ "zero_one_enabled": True, "grad_clipping": True } ) ``` -------------------------------- ### Configure Pipeline Parallelism Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/quickstart-guide.md Configure NeuronX Distributed for pipeline parallelism with specified tensor and pipeline parallel sizes. This setup is suitable for training large models that do not fit into a single device's memory. ```python from neuronx_distributed import neuronx_distributed_config, initialize_parallel_model from transformers import AutoModelForCausalLM # Configure with pipeline parallelism nxd_config = neuronx_distributed_config( tensor_parallel_size=4, pipeline_parallel_size=2, pipeline_config={ "num_microbatches": 4, # 4 microbatches per training step "virtual_pipeline_size": 1, # No interleaving (set to 2+ for interleaved) "input_names": ["input_ids", "attention_mask"], "output_loss_value_spec": (True, False) # First output is loss } ) # Create model - NxDPPModel automatically wraps for pipeline parallelism nxd_model = initialize_parallel_model( nxd_config, lambda: AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") ) # Training loop - same interface, but with pipeline scheduling optimizer = initialize_parallel_optimizer( nxd_config, torch.optim.AdamW, nxd_model.parameters(), lr=1e-4 ) for batch in train_loader: outputs = nxd_model(input_ids=batch["input_ids"]) loss = outputs[0] loss.backward() optimizer.step() optimizer.zero_grad() ``` -------------------------------- ### Configure 7B Parameters (Tensor Parallel) Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/configuration-guide.md Recommended configuration for a 7B parameter model using tensor parallelism. Sets tensor parallelism to 2 and enables Zero-One optimizer. ```python nxd_config = neuronx_distributed_config( tensor_parallel_size=2, optimizer_config={"zero_one_enabled": True} ) ``` -------------------------------- ### Set Up Distributed Training Environment Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Configures the environment variables for distributed training on a Neuron instance. NEURON_RT_NUM_CORES should be set according to the instance type (e.g., 64 for trn1). ```bash # Install NeuronX distributed pip install neuronx-distributed # Set up distributed training environment export NEURON_RT_NUM_CORES=64 # For trn1 instance export NEURON_FRAMEWORK_DEBUG=0 ``` -------------------------------- ### Initialize Model on Neuron Device Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trace-and-compile-api.md Prepares the model for inference by allocating memory on the Neuron device, initializing communication groups, and applying transformations. This must be called before any inference. ```python def to_neuron(self) -> None ``` ```python nxd_model.set_weights(weights) nxd_model.to_neuron() # Initialize on hardware # Now ready for inference ``` -------------------------------- ### Model Compilation Workflow Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trace-and-compile-api.md Use this workflow to trace and compile PyTorch models for Neuron. Initialize ModelBuilder, trace with various input shapes, compile the model, load weights, and prepare for inference. ```python import torch from neuronx_distributed.trace import ModelBuilder # 1. Create model model = MyLargeModel() # 2. Initialize ModelBuilder builder = ModelBuilder( model, weights_to_skip_layout_optimization={"embeddings.weight"} ) # 3. Trace for different input shapes input_ids_128 = torch.randint(0, 32000, (8, 128)) input_ids_256 = torch.randint(0, 32000, (8, 256)) builder.trace(kwargs={"input_ids": input_ids_128}, tag="seq_128") builder.trace(kwargs={"input_ids": input_ids_256}, tag="seq_256") # 4. Compile to NEFF nxd_model = builder.compile( priority_model_key="seq_128", compiler_args="-O2" ) # 5. Load weights checkpoint = load_distributed_checkpoint("model_weights") nxd_model.set_weights(checkpoint) # 6. Initialize on Neuron nxd_model.to_neuron() # 7. Run inference outputs = run_inference_with_nxd_model(nxd_model) ``` -------------------------------- ### NxDModel Initialization Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trace-and-compile-api.md Initializes the NxDModel wrapper. Configure distributed execution parameters and optional state/layout transformers. ```python class NxDModel: def __init__( self, world_size: int, start_rank: int = 0, state_initializer: Optional[StateInitializer] = None, layout_transformer: Optional[torch.classes.neuron.LayoutTransformation] = None ) ``` -------------------------------- ### Load and Resume Training from Checkpoint with NeuronX Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/quickstart-guide.md Load model and optimizer states from a checkpoint to resume training. Use this when you need to continue a previously interrupted or completed training session. ```python from neuronx_distributed import load_checkpoint, has_checkpoint # Check if checkpoint exists if has_checkpoint("./checkpoints"): model_state, optim_state, _, _ = load_checkpoint( "./checkpoints", model=nxd_model, optimizer=optimizer ) nxd_model.load_state_dict(model_state) optimizer.load_state_dict(optim_state) print("Resumed from checkpoint") else: print("No checkpoint found, starting from scratch") # Continue training from loaded state for step, batch in enumerate(train_loader): # ... training loop continues ... pass ``` -------------------------------- ### Optimizer Configuration Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/README.md Configure optimizer settings such as gradient sharding, clipping, and master weights for FP32 precision. ```python optimizer_config = { "zero_one_enabled": True, # Enable gradient sharding "grad_clipping": True, # Clip gradients "max_grad_norm": 1.0 # Clipping threshold } mixed_precision_config = { "use_master_weights": True, # Keep FP32 copy "use_fp32_grad_acc": True, # Accumulate in FP32 "use_master_weights_in_ckpt": True # Save FP32 weights } ``` -------------------------------- ### Traced Model Building and Compilation Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/types-reference.md Shows how to build and compile a model using ModelBuilder for tracing, specifying input arguments. ```python from neuronx_distributed.trace import ModelBuilder, NxDModel from typing import Union, Callable def build_and_compile( model_fn: Callable, ) -> NxDModel: builder: ModelBuilder = ModelBuilder(model_fn()) builder.trace(args=None, kwargs={"input_ids": input_tensor}) nxd_model: NxDModel = builder.compile() return nxd_model ``` -------------------------------- ### Basic Tensor Capture Usage Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/utils/tensor_capture/README.md Demonstrates the basic workflow of enabling tensor capture for specific modules, running the model, retrieving captured tensors, and disabling capture. Ensure necessary imports are present before use. ```python from neuronx_distributed.utils.tensor_capture import ( enable_tensor_capture, disable_tensor_capture, get_available_modules, register_tensor, get_captured_tensors_dict ) # Create a model model = create_model() # Find available modules available_modules = get_available_modules(model) print(f"Available modules: {available_modules}") # Define which modules to monitor modules_to_capture = ["layers.0", "layers.1", "output_layer"] # Enable tensor capture (outputs only) model = enable_tensor_capture(model, modules_to_capture, max_tensors=5) # Or enable tensor capture for both inputs and outputs # model = enable_tensor_capture(model, modules_to_capture, max_tensors=5, capture_inputs=True) # Run the model inputs = create_inputs() outputs = model(inputs) # Get captured tensors as an ordered dictionary captured_tensors_dict = get_captured_tensors_dict() # Process the captured tensors for name, tensor in captured_tensors_dict.items(): print(f"Tensor {name} shape: {tensor.shape}") # Disable tensor capture when done model = disable_tensor_capture(model) ``` -------------------------------- ### Load Checkpoint from Directory Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trainer-api.md Loads model, optimizer, and scheduler states from a specified directory. Ensure the model, optimizer, and scheduler objects are initialized before calling this function. ```python from neuronx_distributed import load_checkpoint model_state, optim_state, sched_state, user_data = load_checkpoint( checkpoint_dir_str="./checkpoints", model=nxd_model, optimizer=nxd_optimizer, scheduler=scheduler ) # Load into model and optimizer nxd_model.load_state_dict(model_state) nxd_optimizer.load_state_dict(optim_state) ``` -------------------------------- ### Run Llama3.2 1B Model on CPU Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/inference/llama/README.md Use this command to run the Llama3.2 1B model on your CPU. Ensure you have the model and tokenizer downloaded and specify their paths correctly. Adjust batch size and sequence length as needed. ```bash python run.py generate_cpu \ --batch-size 2 \ --seq-len 128 \ --model-path ~/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth \ --tokenizer-path ~/.llama/checkpoints/Llama3.2-1B-Instruct/tokenizer.model \ --prompts "['How tall is the Space Needle?','What is the capital of France?']" ``` -------------------------------- ### NxDModel Initialization Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trace-and-compile-api.md Initializes the NxDModel wrapper with distributed execution parameters. ```APIDOC ## NxDModel Constructor ### Description Initializes the NxDModel wrapper for distributed execution. ### Parameters #### Path Parameters - **world_size** (int) - Required - Number of ranks in distributed execution (TP * PP * DP) - **start_rank** (int) - Optional - Starting rank for multi-node setup (Default: 0) - **state_initializer** (Optional[StateInitializer]) - Optional - Module to initialize state buffers on Neuron (Default: None) - **layout_transformer** (Optional[LayoutTransformation]) - Optional - Weight layout transformation module (Default: None) ``` -------------------------------- ### Format Code with Pre-commit Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/README.md Format all files in the project using the pre-commit framework. This command ensures code style consistency across the project. ```bash pre-commit run --all-files ``` -------------------------------- ### init_on_device Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/utilities-api.md A context manager for deferring parameter initialization to a specific device, useful for memory-efficient meta-device initialization. ```APIDOC ## init_on_device ### Description Context manager for deferred parameter initialization on specific device. ### Method Signature ```python @contextmanager def init_on_device( device: torch.device, include_buffers: bool = False, force_custom_init_on_device: bool = False ) ``` ### Parameters #### Path Parameters - **device** (torch.device) - Required - Device to initialize on (meta or xla). - **include_buffers** (bool) - Optional - Include buffers in deferred initialization. Defaults to False. - **force_custom_init_on_device** (bool) - Optional - Force custom initialization on device. Defaults to False. ### Behavior - Defers model initialization until parameters are accessed. - Useful for meta device initialization to save memory. - Yields context for model creation/initialization code. ### Example ```python from torch.device import device as torch_device from neuronx_distributed.utils.model_utils import init_on_device with init_on_device(torch.device("meta")): # Model created with all parameters on meta device model = LargeTransformer() # Later, materialize to actual device model = get_model_sequential(model, xm.xla_device(), param_init_fn=init_fn) ``` ``` -------------------------------- ### Compile Model using Actual Torch Distributed Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/inference/llama/README.md Compile models by launching actual distributed processes using `torchrun`. This method requires a properly configured distributed environment. ```bash python run.py compile_no_mock \ --tp-degree 32 \ --batch-size 2 \ --seq-len 128 \ --model-path ~/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth \ --output-path ~/neuron_models/Llama3.2-1B-Instruct \ --shard-on-load True ``` -------------------------------- ### Initialize Parallel Optimizer Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/trainer-api.md Initializes a PyTorch optimizer with distributed training optimizations like ZeRO-1 sharding. Use this when setting up your training loop with NeuronX Distributed. ```python from neuronx_distributed import initialize_parallel_optimizer nxd_config = neuronx_distributed_config( tensor_parallel_size=4, optimizer_config={"zero_one_enabled": True, "grad_clipping": True} ) nxd_optimizer = initialize_parallel_optimizer( nxd_config, torch.optim.AdamW, nxd_model.parameters(), lr=1e-4, weight_decay=0.01 ) ``` -------------------------------- ### Standard Optimizer Configuration (No ZeRO-1) Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/optimizer-api.md Configuration for a standard optimizer without ZeRO-1 enabled. This includes settings for gradient clipping and mixed precision, with master weights and FP32 gradient accumulation disabled. ```python optimizer_config = { "zero_one_enabled": False, "grad_clipping": True, "max_grad_norm": 1.0 } mixed_precision_config = { "use_master_weights": False, "use_fp32_grad_acc": False, "use_master_weights_in_ckpt": False } ``` -------------------------------- ### Column Parallel Linear Layer Initialization Source: https://github.com/aws-neuron/neuronx-distributed/blob/main/_autodocs/types-reference.md Demonstrates the type usage for initializing a ColumnParallelLinear layer with specified input and output sizes. ```python from neuronx_distributed.parallel_layers import ( ColumnParallelLinear, RowParallelLinear, ParallelEmbedding ) import torch # Type usage linear_layer: torch.nn.Module = ColumnParallelLinear( input_size=4096, output_size=8192 ) ```