Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Add Docs
PyTorch
https://github.com/pytorch/pytorch
Admin
PyTorch is an open-source machine learning framework that accelerates the path from research
...
Tokens:
1,061,443
Snippets:
6,294
Trust Score:
8.4
Update:
5 days ago
Context
Skills
Chat
Benchmark
92.3
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# PyTorch PyTorch is an open-source deep learning framework that provides multi-dimensional tensor computation with strong GPU acceleration and a tape-based automatic differentiation engine for building and training neural networks. It is developed and maintained by Meta AI Research and a large open-source community. The library is designed to be deeply integrated with Python, offering a NumPy-like API for tensors while seamlessly supporting both CPU and GPU computation. PyTorch's dynamic computation graph (define-by-run) makes it especially well suited for research workflows, allowing the network architecture to be modified at runtime without any recompilation step. At its core, PyTorch centers on the `torch.Tensor` type — a multi-dimensional array that tracks gradients, lives on any device, and supports hundreds of mathematical operations. Around this primitive, the library provides `torch.nn` for composable neural-network building blocks, `torch.optim` for gradient-based optimization algorithms, `torch.utils.data` for efficient data loading pipelines, `torch.autograd` for automatic differentiation, `torch.amp` for mixed-precision training, `torch.distributed` for multi-device/multi-node training, `torch.compile` for graph-mode compilation, `torch.fx` for programmatic model transformation, and `torch.profiler` for performance analysis. These subsystems integrate tightly so that a single unified API covers the entire lifecycle from data ingestion through model definition, training, evaluation, serialization, and deployment. --- ## `torch.Tensor` — Core Tensor Creation and Operations `torch.Tensor` is the central data structure in PyTorch. Tensors are n-dimensional arrays that support CPU and GPU computation, automatic differentiation, broadcasting, and a comprehensive set of mathematical operations. Key factory functions include `torch.tensor`, `torch.zeros`, `torch.ones`, `torch.rand`, `torch.randn`, `torch.arange`, and `torch.empty`. ```python import torch # ── Creation ────────────────────────────────────────────────────────────────── x = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) # shape [2, 3] z = torch.zeros(3, 4, dtype=torch.float32) r = torch.randn(2, 3, device="cuda" if torch.cuda.is_available() else "cpu") # ── Indexing and slicing ────────────────────────────────────────────────────── first_row = x[0] # tensor([1., 2., 3.]) col = x[:, 1] # tensor([2., 5.]) sub = x[0:2, 1:3] # tensor([[2., 3.], [5., 6.]]) # ── Math operations ─────────────────────────────────────────────────────────── a = torch.tensor([1.0, 2.0, 3.0]) b = torch.tensor([4.0, 5.0, 6.0]) print(a + b) # tensor([5., 7., 9.]) print(a @ b) # dot product → tensor(32.) print(torch.matmul(x, x.T)) # [2,3] @ [3,2] → [2,2] # ── Shape manipulation ──────────────────────────────────────────────────────── flat = x.view(-1) # tensor([1.,2.,3.,4.,5.,6.]) col3d = x.unsqueeze(0) # [1,2,3] stacked = torch.stack([a, b]) # [2,3] cat = torch.cat([a, b]) # [6] # ── Device movement ─────────────────────────────────────────────────────────── if torch.cuda.is_available(): x_gpu = x.to("cuda") x_cpu = x_gpu.cpu() # ── Gradient tracking ───────────────────────────────────────────────────────── w = torch.randn(3, requires_grad=True) loss = (w * a).sum() loss.backward() print(w.grad) # tensor([1., 2., 3.]) ``` --- ## `torch.nn.Module` — Base Class for All Neural Network Modules `nn.Module` is the foundation for every neural network component in PyTorch. Subclasses define learnable parameters in `__init__` and implement the forward computation in `forward`. The base class handles parameter registration, device movement (`.to()`), serialization, gradient zeroing, train/eval mode toggling, and hooks. ```python import torch import torch.nn as nn import torch.nn.functional as F class ConvClassifier(nn.Module): """Simple CNN for MNIST-like 28×28 grayscale images.""" def __init__(self, num_classes: int = 10): super().__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1) self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(2, 2) self.bn1 = nn.BatchNorm2d(32) self.dropout = nn.Dropout(0.25) self.fc1 = nn.Linear(64 * 7 * 7, 128) self.fc2 = nn.Linear(128, num_classes) def forward(self, x: torch.Tensor) -> torch.Tensor: x = self.pool(F.relu(self.bn1(self.conv1(x)))) # [B,32,14,14] x = self.pool(F.relu(self.conv2(x))) # [B,64,7,7] x = self.dropout(x.flatten(1)) # [B, 3136] x = F.relu(self.fc1(x)) # [B, 128] return self.fc2(x) # [B, 10] model = ConvClassifier() # Inspect parameters total_params = sum(p.numel() for p in model.parameters()) print(f"Parameters: {total_params:,}") # Parameters: 823,978 # Move to GPU device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) # Forward pass imgs = torch.randn(8, 1, 28, 28, device=device) # batch of 8 logits = model(imgs) # [8, 10] print(logits.shape) # torch.Size([8, 10]) # Switch between training and evaluation modes model.train() # enables dropout / batch-norm in train mode model.eval() # disables dropout; batch-norm uses running stats ``` --- ## `torch.nn.Sequential` — Ordered Layer Container `nn.Sequential` chains modules in order so that the output of each layer feeds into the next. It is the simplest way to build feedforward networks without writing an explicit `forward` method. ```python import torch import torch.nn as nn mlp = nn.Sequential( nn.Linear(784, 512), nn.ReLU(), nn.Dropout(0.3), nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, 10), ) x = torch.randn(32, 784) logits = mlp(x) # [32, 10] print(logits.shape) # torch.Size([32, 10]) # Named children for introspection for name, layer in mlp.named_children(): print(name, "->", layer) # 0 -> Linear(in_features=784, out_features=512, bias=True) # 1 -> ReLU() ... ``` --- ## `torch.nn.Linear` — Fully Connected Layer `nn.Linear(in_features, out_features)` applies an affine transformation `y = xAᵀ + b`. It is the building block of multilayer perceptrons, attention projections, and output heads. ```python import torch import torch.nn as nn fc = nn.Linear(in_features=128, out_features=64, bias=True) print(fc.weight.shape) # torch.Size([64, 128]) print(fc.bias.shape) # torch.Size([64]) x = torch.randn(16, 128) # batch_size=16, in_features=128 y = fc(x) print(y.shape) # torch.Size([16, 64]) # Custom weight initialization nn.init.kaiming_uniform_(fc.weight, nonlinearity="relu") nn.init.zeros_(fc.bias) ``` --- ## `torch.nn.Conv2d` — 2D Convolutional Layer `nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, groups, dilation)` applies a 2-D cross-correlation over a 4-D input tensor of shape `(N, C_in, H, W)`. It supports grouped/depthwise convolutions via the `groups` argument. ```python import torch import torch.nn as nn # Standard 3×3 convolution with same-padding conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1) x = torch.randn(4, 3, 224, 224) # NCHW image batch y = conv(x) print(y.shape) # torch.Size([4, 64, 224, 224]) # Depthwise separable convolution (groups=in_channels) dw = nn.Conv2d(64, 64, kernel_size=3, padding=1, groups=64) pw = nn.Conv2d(64, 128, kernel_size=1) out = pw(dw(y)) print(out.shape) # torch.Size([4, 128, 224, 224]) # Strided convolution for spatial downsampling down = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1) print(down(out).shape) # torch.Size([4, 256, 112, 112]) ``` --- ## `torch.nn.LSTM` / `torch.nn.GRU` — Recurrent Layers `nn.LSTM` and `nn.GRU` implement multi-layer gated recurrent networks. They accept sequences of shape `(seq_len, batch, input_size)` (or `(batch, seq_len, input_size)` with `batch_first=True`) and return output sequences plus final hidden states. ```python import torch import torch.nn as nn lstm = nn.LSTM( input_size=64, hidden_size=128, num_layers=2, batch_first=True, dropout=0.2, bidirectional=True, ) # x: (batch=8, seq_len=20, input_size=64) x = torch.randn(8, 20, 64) output, (h_n, c_n) = lstm(x) print(output.shape) # torch.Size([8, 20, 256]) (128*2 bidirectional) print(h_n.shape) # torch.Size([4, 8, 128]) (num_layers*2, batch, hidden) # GRU variant gru = nn.GRU(input_size=64, hidden_size=128, batch_first=True) out, h_n = gru(x) print(out.shape) # torch.Size([8, 20, 128]) # Packing variable-length sequences from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence lengths = torch.tensor([20, 15, 10, 20, 18, 12, 20, 20]) # per-sample lengths packed = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False) packed_out, _ = lstm(packed) unpacked, _ = pad_packed_sequence(packed_out, batch_first=True) print(unpacked.shape) # torch.Size([8, 20, 256]) ``` --- ## `torch.nn.MultiheadAttention` — Scaled Dot-Product Attention `nn.MultiheadAttention(embed_dim, num_heads)` implements multi-head scaled dot-product attention as described in "Attention Is All You Need". It is the core building block for Transformer models. ```python import torch import torch.nn as nn mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, dropout=0.1, batch_first=True) # Self-attention: Q = K = V = x x = torch.randn(4, 32, 512) # (batch=4, seq_len=32, embed_dim=512) out, weights = mha(x, x, x) print(out.shape) # torch.Size([4, 32, 512]) print(weights.shape) # torch.Size([4, 32, 32]) # Causal (autoregressive) attention mask seq_len = 32 mask = torch.triu(torch.ones(seq_len, seq_len, dtype=torch.bool), diagonal=1) out_causal, _ = mha(x, x, x, attn_mask=mask) print(out_causal.shape) # torch.Size([4, 32, 512]) # Key padding mask for variable-length inputs key_padding = torch.zeros(4, 32, dtype=torch.bool) key_padding[:, 20:] = True # last 12 positions are padding out_padded, _ = mha(x, x, x, key_padding_mask=key_padding) ``` --- ## `torch.nn.LayerNorm` / `torch.nn.BatchNorm2d` — Normalization Layers Normalization layers stabilize and accelerate training. `LayerNorm` normalizes over the last D dimensions of the input (used in Transformers), while `BatchNorm2d` normalizes over the spatial (H, W) dimensions and the batch axis (used in CNNs). ```python import torch import torch.nn as nn # LayerNorm — Transformer style ln = nn.LayerNorm(normalized_shape=512) x = torch.randn(4, 32, 512) # (batch, seq_len, embed_dim) print(ln(x).shape) # torch.Size([4, 32, 512]) # RMSNorm — LLaMA/Llama2 style (no learnable bias, faster) rms = nn.RMSNorm(normalized_shape=512) print(rms(x).shape) # torch.Size([4, 32, 512]) # BatchNorm2d — CNN style bn = nn.BatchNorm2d(num_features=64) imgs = torch.randn(8, 64, 28, 28) print(bn(imgs).shape) # torch.Size([8, 64, 28, 28]) # GroupNorm — works with any batch size (e.g., batch=1) gn = nn.GroupNorm(num_groups=8, num_channels=64) print(gn(imgs).shape) # torch.Size([8, 64, 28, 28]) ``` --- ## Loss Functions — `torch.nn` Losses PyTorch provides all standard loss functions as `nn.Module` subclasses. They accept raw logits or probabilities and targets and return a scalar (or per-element tensor with `reduction='none'`). ```python import torch import torch.nn as nn # ── Cross-entropy (multi-class classification) ──────────────────────────────── ce = nn.CrossEntropyLoss() logits = torch.randn(8, 10) # [batch, classes] labels = torch.randint(0, 10, (8,)) # integer class indices loss = ce(logits, labels) print(f"CE loss: {loss.item():.4f}") # ── Binary cross-entropy with logits ───────────────────────────────────────── bce = nn.BCEWithLogitsLoss() logits_bin = torch.randn(8, 1) targets = torch.randint(0, 2, (8, 1)).float() print(f"BCE loss: {bce(logits_bin, targets).item():.4f}") # ── Mean squared error (regression) ────────────────────────────────────────── mse = nn.MSELoss() pred = torch.randn(16, 1) truth = torch.randn(16, 1) print(f"MSE loss: {mse(pred, truth).item():.4f}") # ── Huber / SmoothL1 (robust regression) ───────────────────────────────────── huber = nn.HuberLoss(delta=1.0) print(f"Huber loss: {huber(pred, truth).item():.4f}") # ── Per-element loss for custom weighting ───────────────────────────────────── ce_none = nn.CrossEntropyLoss(reduction="none") per_sample = ce_none(logits, labels) # shape [8] weighted = (per_sample * torch.rand(8)).mean() ``` --- ## `torch.optim` — Optimizers `torch.optim` provides standard gradient-based optimization algorithms. All optimizers follow the same interface: construct with `model.parameters()`, call `optimizer.zero_grad()`, compute loss, call `loss.backward()`, then `optimizer.step()`. ```python import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(128, 10) # ── Adam (default for most tasks) ──────────────────────────────────────────── optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4) # ── AdamW (Adam with decoupled weight decay — preferred for Transformers) ──── optimizer = optim.AdamW(model.parameters(), lr=1e-4, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01) # ── SGD with momentum ───────────────────────────────────────────────────────── optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True, weight_decay=1e-4) # ── Training loop ───────────────────────────────────────────────────────────── criterion = nn.CrossEntropyLoss() for epoch in range(3): x = torch.randn(32, 128) labels = torch.randint(0, 10, (32,)) optimizer.zero_grad() # clear previous gradients logits = model(x) loss = criterion(logits, labels) loss.backward() # compute gradients optimizer.step() # update parameters print(f"Epoch {epoch}: loss={loss.item():.4f}") ``` --- ## `torch.optim.lr_scheduler` — Learning Rate Scheduling Learning rate schedulers adjust the learning rate during training according to predefined rules, improving convergence. They wrap an optimizer and are stepped after each epoch or batch. ```python import torch import torch.nn as nn import torch.optim as optim from torch.optim.lr_scheduler import ( CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau, StepLR ) model = nn.Linear(128, 10) optimizer = optim.AdamW(model.parameters(), lr=1e-3) # ── Step decay: multiply LR by gamma every step_size epochs ────────────────── scheduler_step = StepLR(optimizer, step_size=30, gamma=0.1) # ── Cosine annealing ────────────────────────────────────────────────────────── scheduler_cos = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6) # ── OneCycleLR (recommended for super-convergence) ─────────────────────────── num_epochs = 10 steps_per_epoch = 100 scheduler_1cycle = OneCycleLR( optimizer, max_lr=1e-2, steps_per_epoch=steps_per_epoch, epochs=num_epochs ) # ── ReduceLROnPlateau (validation-loss-aware) ───────────────────────────────── scheduler_plateau = ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=5, min_lr=1e-6) # ── Usage in training loop ─────────────────────────────────────────────────── for epoch in range(num_epochs): for _ in range(steps_per_epoch): optimizer.zero_grad() loss = (model(torch.randn(32, 128)) - torch.randn(32, 10)).pow(2).mean() loss.backward() optimizer.step() scheduler_1cycle.step() # step per batch for OneCycleLR val_loss = loss.item() scheduler_plateau.step(val_loss) # step per epoch with val metric print(f"Epoch {epoch}: lr={optimizer.param_groups[0]['lr']:.2e}") ``` --- ## `torch.utils.data.Dataset` and `DataLoader` — Data Pipelines `Dataset` defines how to access individual samples; `DataLoader` wraps it to provide batching, shuffling, and parallel data loading via worker processes. ```python import torch from torch.utils.data import Dataset, DataLoader, random_split import numpy as np class TabularDataset(Dataset): """Custom map-style dataset for tabular data.""" def __init__(self, X: np.ndarray, y: np.ndarray): self.X = torch.tensor(X, dtype=torch.float32) self.y = torch.tensor(y, dtype=torch.long) def __len__(self) -> int: return len(self.X) def __getitem__(self, idx: int): return self.X[idx], self.y[idx] # Create dataset X = np.random.randn(1000, 20).astype(np.float32) y = np.random.randint(0, 5, 1000) dataset = TabularDataset(X, y) # Train / validation split train_ds, val_ds = random_split(dataset, [800, 200]) # DataLoaders with parallel workers train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4, pin_memory=True, drop_last=True) val_loader = DataLoader(val_ds, batch_size=128, shuffle=False, num_workers=2) for batch_X, batch_y in train_loader: # batch_X: [64, 20], batch_y: [64] print(batch_X.shape, batch_y.shape) break # torch.Size([64, 20]) torch.Size([64]) # TensorDataset shortcut for in-memory data from torch.utils.data import TensorDataset tensor_ds = TensorDataset(torch.randn(100, 20), torch.randint(0, 5, (100,))) loader = DataLoader(tensor_ds, batch_size=32, shuffle=True) ``` --- ## `torch.autograd` — Automatic Differentiation `torch.autograd` computes gradients of scalar outputs with respect to leaf tensors that have `requires_grad=True`. `torch.no_grad()` and `torch.inference_mode()` disable gradient tracking for inference to save memory and compute. ```python import torch import torch.autograd as autograd # ── Basic gradient computation ───────────────────────────────────────────── x = torch.tensor([2.0, 3.0], requires_grad=True) y = (x ** 2).sum() # y = 4 + 9 = 13 y.backward() print(x.grad) # tensor([4., 6.]) — dy/dx = 2x # ── autograd.grad for functional-style gradients ──────────────────────────── w = torch.randn(3, 3, requires_grad=True) out = (w @ w).trace() grads = autograd.grad(out, w) print(grads[0].shape) # torch.Size([3, 3]) # ── Disable gradient tracking for inference ────────────────────────────────── model = torch.nn.Linear(10, 1) x_val = torch.randn(5, 10) with torch.no_grad(): pred = model(x_val) # no gradient graph built — faster & less memory # torch.inference_mode() is slightly faster than no_grad for inference with torch.inference_mode(): pred = model(x_val) # ── Custom autograd Function ───────────────────────────────────────────────── class Swish(autograd.Function): @staticmethod def forward(ctx, x): sig = torch.sigmoid(x) ctx.save_for_backward(sig, x) return x * sig @staticmethod def backward(ctx, grad_output): sig, x = ctx.saved_tensors return grad_output * (sig * (1 + x * (1 - sig))) x = torch.randn(4, requires_grad=True) y = Swish.apply(x) y.sum().backward() print(x.grad.shape) # torch.Size([4]) ``` --- ## `torch.amp` — Automatic Mixed Precision (AMP) `torch.amp.autocast` automatically casts operations to float16 (or bfloat16) where safe, reducing memory usage and speeding up computation on modern GPUs. `torch.amp.GradScaler` prevents gradient underflow when training in float16. ```python import torch import torch.nn as nn import torch.optim as optim from torch.amp import autocast, GradScaler model = nn.Linear(1024, 512).cuda() optimizer = optim.AdamW(model.parameters(), lr=1e-3) scaler = GradScaler("cuda") for _ in range(3): x = torch.randn(64, 1024, device="cuda") y = torch.randn(64, 512, device="cuda") optimizer.zero_grad() # Forward pass in mixed precision with autocast("cuda"): pred = model(x) # computed in float16 loss = nn.functional.mse_loss(pred, y) # Scale gradients to avoid float16 underflow scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) scaler.step(optimizer) scaler.update() print(f"loss={loss.item():.4f}, scale={scaler.get_scale()}") # bfloat16 (preferred on Ampere/Hopper GPUs — no scaling needed) with autocast("cuda", dtype=torch.bfloat16): out = model(torch.randn(64, 1024, device="cuda")) print(out.dtype) # torch.bfloat16 ``` --- ## `torch.save` / `torch.load` — Model Serialization `torch.save` serializes Python objects (state dicts, tensors, entire models) to disk using pickle. `torch.load` deserializes them. The recommended practice is to save only the `state_dict` (parameter dictionary) rather than the whole model object. ```python import torch import torch.nn as nn model = nn.Linear(128, 10) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # ── Save a training checkpoint ──────────────────────────────────────────────── checkpoint = { "epoch": 42, "model_state": model.state_dict(), "optimizer_state": optimizer.state_dict(), "loss": 0.1234, } torch.save(checkpoint, "checkpoint.pt") # ── Load and resume training ────────────────────────────────────────────────── ckpt = torch.load("checkpoint.pt", weights_only=True) model.load_state_dict(ckpt["model_state"]) optimizer.load_state_dict(ckpt["optimizer_state"]) start_epoch = ckpt["epoch"] + 1 print(f"Resuming from epoch {start_epoch}") # ── Save only weights (inference) ──────────────────────────────────────────── torch.save(model.state_dict(), "model_weights.pt") new_model = nn.Linear(128, 10) new_model.load_state_dict(torch.load("model_weights.pt", weights_only=True)) new_model.eval() # ── Cross-device loading ────────────────────────────────────────────────────── # Load GPU-trained model on CPU state = torch.load("model_weights.pt", map_location="cpu", weights_only=True) new_model.load_state_dict(state) ``` --- ## `torch.compile` — Graph-Mode Compilation `torch.compile` (introduced in PyTorch 2.0) compiles a model or function into optimized kernels using TorchDynamo (tracing) and TorchInductor (backend). It requires no changes to the model definition and typically delivers 1.5–3× speedups on modern GPUs. ```python import torch import torch.nn as nn class ResBlock(nn.Module): def __init__(self, dim: int): super().__init__() self.net = nn.Sequential( nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim) ) def forward(self, x): return x + self.net(x) model = ResBlock(512).cuda() # ── Compile with default backend (inductor) ─────────────────────────────────── compiled_model = torch.compile(model) x = torch.randn(64, 512, device="cuda") # First call triggers compilation (slow), subsequent calls use compiled kernel with torch.no_grad(): out = compiled_model(x) print(out.shape) # torch.Size([64, 512]) # ── Compilation modes ───────────────────────────────────────────────────────── # "default" — balance of speed and compile time # "reduce-overhead" — minimize Python overhead (best for small batches) # "max-autotune" — exhaustive kernel tuning (longest compile, fastest runtime) fast_model = torch.compile(model, mode="max-autotune") # ── Compile a standalone function ───────────────────────────────────────────── @torch.compile def fused_gelu(x: torch.Tensor) -> torch.Tensor: return x * torch.sigmoid(1.702 * x) print(fused_gelu(torch.randn(1024, device="cuda")).shape) # torch.Size([1024]) ``` --- ## `torch.fx` — Symbolic Tracing and Graph Transformation `torch.fx` provides a Python-level intermediate representation (IR) for `nn.Module` instances. It symbolically traces a module to produce a `GraphModule` that can be inspected, transformed, and re-emitted as valid Python code — enabling optimizations like operator fusion, quantization, and custom backends. ```python import torch import torch.nn as nn from torch.fx import symbolic_trace, GraphModule, Interpreter class TwoLayerNet(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(64, 32) self.fc2 = nn.Linear(32, 16) def forward(self, x): return self.fc2(torch.relu(self.fc1(x))) model = TwoLayerNet() traced: GraphModule = symbolic_trace(model) # Print the IR graph print(traced.graph) # graph(): # %x : ... = placeholder[target=x] # %fc1 : ... = call_module[target=fc1](args=(%x,)) # %relu : ... = call_function[target=torch.relu](args=(%fc1,)) # %fc2 : ... = call_module[target=fc2](args=(%relu,)) # return fc2 # Print the generated Python code print(traced.code) # Transform: replace all relu with sigmoid for node in traced.graph.nodes: if node.op == "call_function" and node.target is torch.relu: node.target = torch.sigmoid traced.recompile() x = torch.randn(8, 64) out = traced(x) print(out.shape) # torch.Size([8, 16]) ``` --- ## `torch.distributed` — Distributed Training `torch.distributed` provides primitives for multi-process/multi-GPU training. `DistributedDataParallel` (DDP) is the recommended high-level API that wraps a model to synchronize gradients across GPUs automatically. ```python import torch import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP from torch.utils.data import DataLoader, DistributedSampler, TensorDataset def train_worker(rank: int, world_size: int): # Initialize process group dist.init_process_group( backend="nccl", # "gloo" for CPU; "nccl" for GPU init_method="env://", rank=rank, world_size=world_size, ) device = torch.device(f"cuda:{rank}") model = nn.Linear(128, 10).to(device) model = DDP(model, device_ids=[rank]) dataset = TensorDataset( torch.randn(256, 128), torch.randint(0, 10, (256,)) ) sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) loader = DataLoader(dataset, batch_size=32, sampler=sampler) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) criterion = nn.CrossEntropyLoss() for epoch in range(2): sampler.set_epoch(epoch) # important for correct shuffling for x, y in loader: x, y = x.to(device), y.to(device) optimizer.zero_grad() loss = criterion(model(x), y) loss.backward() # gradients averaged across GPUs optimizer.step() dist.destroy_process_group() # Launch with torchrun (preferred): # torchrun --nproc_per_node=2 script.py # Or programmatically (for 2 GPUs): # mp.spawn(train_worker, args=(2,), nprocs=2, join=True) ``` --- ## `torch.profiler` — Performance Profiling `torch.profiler.profile` records CPU/GPU operator timings, memory usage, and kernel traces. Results can be exported to TensorBoard for visualization or inspected in-process with `key_averages()`. ```python import torch import torch.nn as nn from torch.profiler import profile, ProfilerActivity, schedule, tensorboard_trace_handler model = nn.Sequential( nn.Linear(1024, 512), nn.ReLU(), nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, 10), ).cuda() x = torch.randn(64, 1024, device="cuda") # ── One-shot profile ────────────────────────────────────────────────────────── with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True, ) as prof: for _ in range(10): model(x) print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) # ┌───────────────────────┬────────────┬─────────────┬────────────┐ # │ Name │ CPU total │ CUDA total │ # Calls │ # ├───────────────────────┼────────────┼─────────────┼────────────┤ # │ aten::linear │ 1.234 ms │ 0.823 ms │ 30 │ # └───────────────────────┴────────────┴─────────────┴────────────┘ # ── Scheduled profile with TensorBoard export ───────────────────────────────── with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=schedule(wait=1, warmup=1, active=3, repeat=1), on_trace_ready=tensorboard_trace_handler("./log/profiler"), ) as prof: for step in range(5): model(x) prof.step() # View with: tensorboard --logdir=./log/profiler ``` --- ## `torch.cuda` — CUDA Device Management `torch.cuda` exposes CUDA device selection, memory management, stream synchronization, and GPU information queries. It is lazily initialized and safe to import on CPU-only machines. ```python import torch # ── Device availability and selection ──────────────────────────────────────── print(torch.cuda.is_available()) # True / False print(torch.cuda.device_count()) # e.g. 4 print(torch.cuda.get_device_name(0)) # 'NVIDIA A100-SXM4-80GB' torch.cuda.set_device(0) device = torch.device("cuda:0") # ── Memory management ──────────────────────────────────────────────────────── x = torch.randn(1000, 1000, device=device) print(torch.cuda.memory_allocated()) # bytes currently allocated print(torch.cuda.memory_reserved()) # bytes reserved by caching allocator torch.cuda.empty_cache() # release unused cached memory # ── Streams for concurrent kernel execution ─────────────────────────────────── s1 = torch.cuda.Stream() s2 = torch.cuda.Stream() with torch.cuda.stream(s1): a = torch.randn(512, 512, device=device) b = a @ a.T # runs asynchronously on stream s1 with torch.cuda.stream(s2): c = torch.randn(512, 512, device=device) d = c @ c.T # runs concurrently on stream s2 torch.cuda.synchronize() # wait for all streams to complete # ── Seed for reproducibility ────────────────────────────────────────────────── torch.cuda.manual_seed_all(42) torch.manual_seed(42) ``` --- ## Complete Training Loop — Putting It All Together A full supervised training workflow integrating `Dataset`, `DataLoader`, `nn.Module`, `optim`, `lr_scheduler`, AMP, gradient clipping, and checkpointing. ```python import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset, random_split from torch.amp import autocast, GradScaler from torch.optim.lr_scheduler import CosineAnnealingLR # ── Reproducibility ─────────────────────────────────────────────────────────── torch.manual_seed(0) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # ── Data ────────────────────────────────────────────────────────────────────── X = torch.randn(2000, 256) y = torch.randint(0, 10, (2000,)) train_ds, val_ds = random_split(TensorDataset(X, y), [1600, 400]) train_dl = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=2, pin_memory=True) val_dl = DataLoader(val_ds, batch_size=128, shuffle=False, num_workers=2) # ── Model ───────────────────────────────────────────────────────────────────── model = nn.Sequential( nn.Linear(256, 512), nn.LayerNorm(512), nn.GELU(), nn.Dropout(0.1), nn.Linear(512, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(0.1), nn.Linear(256, 10), ).to(device) # ── Optimizer, scheduler, AMP ───────────────────────────────────────────────── optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) scheduler = CosineAnnealingLR(optimizer, T_max=20, eta_min=1e-5) scaler = GradScaler("cuda") if device.type == "cuda" else None criterion = nn.CrossEntropyLoss(label_smoothing=0.1) best_val_acc = 0.0 for epoch in range(20): # ── Training ────────────────────────────────────────────────────────────── model.train() total_loss = 0.0 for xb, yb in train_dl: xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True) optimizer.zero_grad() if scaler: with autocast("cuda"): logits = model(xb) loss = criterion(logits, yb) scaler.scale(loss).backward() scaler.unscale_(optimizer) nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() else: logits = model(xb) loss = criterion(logits, yb) loss.backward() nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() total_loss += loss.item() scheduler.step() # ── Validation ──────────────────────────────────────────────────────────── model.eval() correct = 0 with torch.inference_mode(): for xb, yb in val_dl: xb, yb = xb.to(device), yb.to(device) correct += (model(xb).argmax(1) == yb).sum().item() val_acc = correct / len(val_ds) print(f"Epoch {epoch:02d} | loss={total_loss/len(train_dl):.4f} | val_acc={val_acc:.3f}") # ── Checkpoint best model ───────────────────────────────────────────────── if val_acc > best_val_acc: best_val_acc = val_acc torch.save(model.state_dict(), "best_model.pt") ``` --- PyTorch is the primary framework for research prototyping and production deployment of deep learning models across computer vision (CNNs, ViTs), natural language processing (Transformers, LLMs), reinforcement learning (policy gradient, Q-learning), and scientific computing. Its core design philosophy — immediate execution, Python-native debugging, and gradual graph capture — makes it accessible to both researchers who need to iterate quickly and engineers who need production-grade performance. The same codebase that runs interactively in a Jupyter notebook can be compiled with `torch.compile`, quantized with `torch.ao.quantization`, exported via `torch.export`, and served through TorchScript or ONNX. PyTorch integrates naturally into the broader ML ecosystem: Hugging Face Transformers, TorchVision, TorchAudio, and Lightning all build on its `nn.Module` / `DataLoader` foundations. Distributed training with DDP or FSDP scales from a single workstation GPU to hundreds of nodes transparently, and the AMP + `torch.compile` stack closes the gap between research prototypes and inference-optimized deployments. Its permissive BSD license, extensive documentation, and active open-source community make it the de facto standard for deep learning in both academia and industry.