PyTorch (pytorch/pytorch)

PyTorch

https://github.com/pytorch/pytorch
Admin
PyTorch is an open-source machine learning framework that accelerates the path from research...

Tokens:1,061,443
Snippets:6,294
Trust Score:8.4
Update:5 days ago
Show doc for...
Context Summary (auto-generated)
Raw
# PyTorch

PyTorch is an open-source deep learning framework that provides multi-dimensional tensor computation with strong GPU acceleration and a tape-based automatic differentiation engine for building and training neural networks. It is developed and maintained by Meta AI Research and a large open-source community. The library is designed to be deeply integrated with Python, offering a NumPy-like API for tensors while seamlessly supporting both CPU and GPU computation. PyTorch's dynamic computation graph (define-by-run) makes it especially well suited for research workflows, allowing the network architecture to be modified at runtime without any recompilation step.

At its core, PyTorch centers on the `torch.Tensor` type — a multi-dimensional array that tracks gradients, lives on any device, and supports hundreds of mathematical operations. Around this primitive, the library provides `torch.nn` for composable neural-network building blocks, `torch.optim` for gradient-based optimization algorithms, `torch.utils.data` for efficient data loading pipelines, `torch.autograd` for automatic differentiation, `torch.amp` for mixed-precision training, `torch.distributed` for multi-device/multi-node training, `torch.compile` for graph-mode compilation, `torch.fx` for programmatic model transformation, and `torch.profiler` for performance analysis. These subsystems integrate tightly so that a single unified API covers the entire lifecycle from data ingestion through model definition, training, evaluation, serialization, and deployment.

---

## `torch.Tensor` — Core Tensor Creation and Operations

`torch.Tensor` is the central data structure in PyTorch. Tensors are n-dimensional arrays that support CPU and GPU computation, automatic differentiation, broadcasting, and a comprehensive set of mathematical operations. Key factory functions include `torch.tensor`, `torch.zeros`, `torch.ones`, `torch.rand`, `torch.randn`, `torch.arange`, and `torch.empty`.

```python
import torch

# ── Creation ──────────────────────────────────────────────────────────────────
x = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])   # shape [2, 3]
z = torch.zeros(3, 4, dtype=torch.float32)
r = torch.randn(2, 3, device="cuda" if torch.cuda.is_available() else "cpu")

# ── Indexing and slicing ──────────────────────────────────────────────────────
first_row = x[0]                   # tensor([1., 2., 3.])
col       = x[:, 1]                # tensor([2., 5.])
sub       = x[0:2, 1:3]           # tensor([[2., 3.], [5., 6.]])

# ── Math operations ───────────────────────────────────────────────────────────
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b)         # tensor([5., 7., 9.])
print(a @ b)         # dot product → tensor(32.)
print(torch.matmul(x, x.T))  # [2,3] @ [3,2] → [2,2]

# ── Shape manipulation ────────────────────────────────────────────────────────
flat   = x.view(-1)            # tensor([1.,2.,3.,4.,5.,6.])
col3d  = x.unsqueeze(0)        # [1,2,3]
stacked = torch.stack([a, b])  # [2,3]
cat     = torch.cat([a, b])    # [6]

# ── Device movement ───────────────────────────────────────────────────────────
if torch.cuda.is_available():
    x_gpu = x.to("cuda")
    x_cpu = x_gpu.cpu()

# ── Gradient tracking ─────────────────────────────────────────────────────────
w = torch.randn(3, requires_grad=True)
loss = (w * a).sum()
loss.backward()
print(w.grad)   # tensor([1., 2., 3.])
```

---

## `torch.nn.Module` — Base Class for All Neural Network Modules

`nn.Module` is the foundation for every neural network component in PyTorch. Subclasses define learnable parameters in `__init__` and implement the forward computation in `forward`. The base class handles parameter registration, device movement (`.to()`), serialization, gradient zeroing, train/eval mode toggling, and hooks.

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class ConvClassifier(nn.Module):
    """Simple CNN for MNIST-like 28×28 grayscale images."""
    def __init__(self, num_classes: int = 10):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool  = nn.MaxPool2d(2, 2)
        self.bn1   = nn.BatchNorm2d(32)
        self.dropout = nn.Dropout(0.25)
        self.fc1   = nn.Linear(64 * 7 * 7, 128)
        self.fc2   = nn.Linear(128, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.pool(F.relu(self.bn1(self.conv1(x))))  # [B,32,14,14]
        x = self.pool(F.relu(self.conv2(x)))             # [B,64,7,7]
        x = self.dropout(x.flatten(1))                   # [B, 3136]
        x = F.relu(self.fc1(x))                          # [B, 128]
        return self.fc2(x)                               # [B, 10]

model = ConvClassifier()

# Inspect parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total_params:,}")   # Parameters: 823,978

# Move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Forward pass
imgs = torch.randn(8, 1, 28, 28, device=device)  # batch of 8
logits = model(imgs)                               # [8, 10]
print(logits.shape)  # torch.Size([8, 10])

# Switch between training and evaluation modes
model.train()   # enables dropout / batch-norm in train mode
model.eval()    # disables dropout; batch-norm uses running stats
```

---

## `torch.nn.Sequential` — Ordered Layer Container

`nn.Sequential` chains modules in order so that the output of each layer feeds into the next. It is the simplest way to build feedforward networks without writing an explicit `forward` method.

```python
import torch
import torch.nn as nn

mlp = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 10),
)

x = torch.randn(32, 784)
logits = mlp(x)          # [32, 10]
print(logits.shape)      # torch.Size([32, 10])

# Named children for introspection
for name, layer in mlp.named_children():
    print(name, "->", layer)
# 0 -> Linear(in_features=784, out_features=512, bias=True)
# 1 -> ReLU()  ...
```

---

## `torch.nn.Linear` — Fully Connected Layer

`nn.Linear(in_features, out_features)` applies an affine transformation `y = xAᵀ + b`. It is the building block of multilayer perceptrons, attention projections, and output heads.

```python
import torch
import torch.nn as nn

fc = nn.Linear(in_features=128, out_features=64, bias=True)
print(fc.weight.shape)  # torch.Size([64, 128])
print(fc.bias.shape)    # torch.Size([64])

x = torch.randn(16, 128)   # batch_size=16, in_features=128
y = fc(x)
print(y.shape)             # torch.Size([16, 64])

# Custom weight initialization
nn.init.kaiming_uniform_(fc.weight, nonlinearity="relu")
nn.init.zeros_(fc.bias)
```

---

## `torch.nn.Conv2d` — 2D Convolutional Layer

`nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, groups, dilation)` applies a 2-D cross-correlation over a 4-D input tensor of shape `(N, C_in, H, W)`. It supports grouped/depthwise convolutions via the `groups` argument.

```python
import torch
import torch.nn as nn

# Standard 3×3 convolution with same-padding
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
x = torch.randn(4, 3, 224, 224)   # NCHW image batch
y = conv(x)
print(y.shape)  # torch.Size([4, 64, 224, 224])

# Depthwise separable convolution (groups=in_channels)
dw = nn.Conv2d(64, 64, kernel_size=3, padding=1, groups=64)
pw = nn.Conv2d(64, 128, kernel_size=1)
out = pw(dw(y))
print(out.shape)  # torch.Size([4, 128, 224, 224])

# Strided convolution for spatial downsampling
down = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)
print(down(out).shape)  # torch.Size([4, 256, 112, 112])
```

---

## `torch.nn.LSTM` / `torch.nn.GRU` — Recurrent Layers

`nn.LSTM` and `nn.GRU` implement multi-layer gated recurrent networks. They accept sequences of shape `(seq_len, batch, input_size)` (or `(batch, seq_len, input_size)` with `batch_first=True`) and return output sequences plus final hidden states.

```python
import torch
import torch.nn as nn

lstm = nn.LSTM(
    input_size=64,
    hidden_size=128,
    num_layers=2,
    batch_first=True,
    dropout=0.2,
    bidirectional=True,
)

# x: (batch=8, seq_len=20, input_size=64)
x = torch.randn(8, 20, 64)
output, (h_n, c_n) = lstm(x)

print(output.shape)  # torch.Size([8, 20, 256])  (128*2 bidirectional)
print(h_n.shape)     # torch.Size([4, 8, 128])   (num_layers*2, batch, hidden)

# GRU variant
gru = nn.GRU(input_size=64, hidden_size=128, batch_first=True)
out, h_n = gru(x)
print(out.shape)   # torch.Size([8, 20, 128])

# Packing variable-length sequences
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

lengths = torch.tensor([20, 15, 10, 20, 18, 12, 20, 20])  # per-sample lengths
packed = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
packed_out, _ = lstm(packed)
unpacked, _ = pad_packed_sequence(packed_out, batch_first=True)
print(unpacked.shape)  # torch.Size([8, 20, 256])
```

---

## `torch.nn.MultiheadAttention` — Scaled Dot-Product Attention

`nn.MultiheadAttention(embed_dim, num_heads)` implements multi-head scaled dot-product attention as described in "Attention Is All You Need". It is the core building block for Transformer models.

```python
import torch
import torch.nn as nn

mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, dropout=0.1, batch_first=True)

# Self-attention: Q = K = V = x
x   = torch.randn(4, 32, 512)   # (batch=4, seq_len=32, embed_dim=512)
out, weights = mha(x, x, x)
print(out.shape)     # torch.Size([4, 32, 512])
print(weights.shape) # torch.Size([4, 32, 32])

# Causal (autoregressive) attention mask
seq_len = 32
mask = torch.triu(torch.ones(seq_len, seq_len, dtype=torch.bool), diagonal=1)
out_causal, _ = mha(x, x, x, attn_mask=mask)
print(out_causal.shape)  # torch.Size([4, 32, 512])

# Key padding mask for variable-length inputs
key_padding = torch.zeros(4, 32, dtype=torch.bool)
key_padding[:, 20:] = True      # last 12 positions are padding
out_padded, _ = mha(x, x, x, key_padding_mask=key_padding)
```

---

## `torch.nn.LayerNorm` / `torch.nn.BatchNorm2d` — Normalization Layers

Normalization layers stabilize and accelerate training. `LayerNorm` normalizes over the last D dimensions of the input (used in Transformers), while `BatchNorm2d` normalizes over the spatial (H, W) dimensions and the batch axis (used in CNNs).

```python
import torch
import torch.nn as nn

# LayerNorm — Transformer style
ln = nn.LayerNorm(normalized_shape=512)
x  = torch.randn(4, 32, 512)   # (batch, seq_len, embed_dim)
print(ln(x).shape)             # torch.Size([4, 32, 512])

# RMSNorm — LLaMA/Llama2 style (no learnable bias, faster)
rms = nn.RMSNorm(normalized_shape=512)
print(rms(x).shape)            # torch.Size([4, 32, 512])

# BatchNorm2d — CNN style
bn = nn.BatchNorm2d(num_features=64)
imgs = torch.randn(8, 64, 28, 28)
print(bn(imgs).shape)          # torch.Size([8, 64, 28, 28])

# GroupNorm — works with any batch size (e.g., batch=1)
gn = nn.GroupNorm(num_groups=8, num_channels=64)
print(gn(imgs).shape)          # torch.Size([8, 64, 28, 28])
```

---

## Loss Functions — `torch.nn` Losses

PyTorch provides all standard loss functions as `nn.Module` subclasses. They accept raw logits or probabilities and targets and return a scalar (or per-element tensor with `reduction='none'`).

```python
import torch
import torch.nn as nn

# ── Cross-entropy (multi-class classification) ────────────────────────────────
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 10)                   # [batch, classes]
labels = torch.randint(0, 10, (8,))           # integer class indices
loss   = ce(logits, labels)
print(f"CE loss: {loss.item():.4f}")

# ── Binary cross-entropy with logits ─────────────────────────────────────────
bce = nn.BCEWithLogitsLoss()
logits_bin = torch.randn(8, 1)
targets    = torch.randint(0, 2, (8, 1)).float()
print(f"BCE loss: {bce(logits_bin, targets).item():.4f}")

# ── Mean squared error (regression) ──────────────────────────────────────────
mse   = nn.MSELoss()
pred  = torch.randn(16, 1)
truth = torch.randn(16, 1)
print(f"MSE loss: {mse(pred, truth).item():.4f}")

# ── Huber / SmoothL1 (robust regression) ─────────────────────────────────────
huber = nn.HuberLoss(delta=1.0)
print(f"Huber loss: {huber(pred, truth).item():.4f}")

# ── Per-element loss for custom weighting ─────────────────────────────────────
ce_none = nn.CrossEntropyLoss(reduction="none")
per_sample = ce_none(logits, labels)   # shape [8]
weighted   = (per_sample * torch.rand(8)).mean()
```

---

## `torch.optim` — Optimizers

`torch.optim` provides standard gradient-based optimization algorithms. All optimizers follow the same interface: construct with `model.parameters()`, call `optimizer.zero_grad()`, compute loss, call `loss.backward()`, then `optimizer.step()`.

```python
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(128, 10)

# ── Adam (default for most tasks) ────────────────────────────────────────────
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

# ── AdamW (Adam with decoupled weight decay — preferred for Transformers) ────
optimizer = optim.AdamW(model.parameters(), lr=1e-4, betas=(0.9, 0.999), eps=1e-8,
                        weight_decay=0.01)

# ── SGD with momentum ─────────────────────────────────────────────────────────
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True,
                      weight_decay=1e-4)

# ── Training loop ─────────────────────────────────────────────────────────────
criterion = nn.CrossEntropyLoss()

for epoch in range(3):
    x      = torch.randn(32, 128)
    labels = torch.randint(0, 10, (32,))

    optimizer.zero_grad()        # clear previous gradients
    logits = model(x)
    loss   = criterion(logits, labels)
    loss.backward()              # compute gradients
    optimizer.step()             # update parameters

    print(f"Epoch {epoch}: loss={loss.item():.4f}")
```

---

## `torch.optim.lr_scheduler` — Learning Rate Scheduling

Learning rate schedulers adjust the learning rate during training according to predefined rules, improving convergence. They wrap an optimizer and are stepped after each epoch or batch.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import (
    CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau, StepLR
)

model     = nn.Linear(128, 10)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# ── Step decay: multiply LR by gamma every step_size epochs ──────────────────
scheduler_step = StepLR(optimizer, step_size=30, gamma=0.1)

# ── Cosine annealing ──────────────────────────────────────────────────────────
scheduler_cos = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# ── OneCycleLR (recommended for super-convergence) ───────────────────────────
num_epochs  = 10
steps_per_epoch = 100
scheduler_1cycle = OneCycleLR(
    optimizer, max_lr=1e-2,
    steps_per_epoch=steps_per_epoch, epochs=num_epochs
)

# ── ReduceLROnPlateau (validation-loss-aware) ─────────────────────────────────
scheduler_plateau = ReduceLROnPlateau(optimizer, mode="min", factor=0.5,
                                       patience=5, min_lr=1e-6)

# ── Usage in training loop ───────────────────────────────────────────────────
for epoch in range(num_epochs):
    for _ in range(steps_per_epoch):
        optimizer.zero_grad()
        loss = (model(torch.randn(32, 128)) - torch.randn(32, 10)).pow(2).mean()
        loss.backward()
        optimizer.step()
        scheduler_1cycle.step()   # step per batch for OneCycleLR

    val_loss = loss.item()
    scheduler_plateau.step(val_loss)   # step per epoch with val metric
    print(f"Epoch {epoch}: lr={optimizer.param_groups[0]['lr']:.2e}")
```

---

## `torch.utils.data.Dataset` and `DataLoader` — Data Pipelines

`Dataset` defines how to access individual samples; `DataLoader` wraps it to provide batching, shuffling, and parallel data loading via worker processes.

```python
import torch
from torch.utils.data import Dataset, DataLoader, random_split
import numpy as np

class TabularDataset(Dataset):
    """Custom map-style dataset for tabular data."""
    def __init__(self, X: np.ndarray, y: np.ndarray):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self) -> int:
        return len(self.X)

    def __getitem__(self, idx: int):
        return self.X[idx], self.y[idx]

# Create dataset
X = np.random.randn(1000, 20).astype(np.float32)
y = np.random.randint(0, 5, 1000)
dataset = TabularDataset(X, y)

# Train / validation split
train_ds, val_ds = random_split(dataset, [800, 200])

# DataLoaders with parallel workers
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True,  num_workers=4,
                          pin_memory=True, drop_last=True)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=2)

for batch_X, batch_y in train_loader:
    # batch_X: [64, 20], batch_y: [64]
    print(batch_X.shape, batch_y.shape)
    break  # torch.Size([64, 20]) torch.Size([64])

# TensorDataset shortcut for in-memory data
from torch.utils.data import TensorDataset
tensor_ds = TensorDataset(torch.randn(100, 20), torch.randint(0, 5, (100,)))
loader = DataLoader(tensor_ds, batch_size=32, shuffle=True)
```

---

## `torch.autograd` — Automatic Differentiation

`torch.autograd` computes gradients of scalar outputs with respect to leaf tensors that have `requires_grad=True`. `torch.no_grad()` and `torch.inference_mode()` disable gradient tracking for inference to save memory and compute.

```python
import torch
import torch.autograd as autograd

# ── Basic gradient computation ─────────────────────────────────────────────
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()   # y = 4 + 9 = 13
y.backward()
print(x.grad)        # tensor([4., 6.])  — dy/dx = 2x

# ── autograd.grad for functional-style gradients ────────────────────────────
w = torch.randn(3, 3, requires_grad=True)
out = (w @ w).trace()
grads = autograd.grad(out, w)
print(grads[0].shape)   # torch.Size([3, 3])

# ── Disable gradient tracking for inference ──────────────────────────────────
model = torch.nn.Linear(10, 1)
x_val = torch.randn(5, 10)

with torch.no_grad():
    pred = model(x_val)   # no gradient graph built — faster & less memory

# torch.inference_mode() is slightly faster than no_grad for inference
with torch.inference_mode():
    pred = model(x_val)

# ── Custom autograd Function ─────────────────────────────────────────────────
class Swish(autograd.Function):
    @staticmethod
    def forward(ctx, x):
        sig = torch.sigmoid(x)
        ctx.save_for_backward(sig, x)
        return x * sig

    @staticmethod
    def backward(ctx, grad_output):
        sig, x = ctx.saved_tensors
        return grad_output * (sig * (1 + x * (1 - sig)))

x = torch.randn(4, requires_grad=True)
y = Swish.apply(x)
y.sum().backward()
print(x.grad.shape)   # torch.Size([4])
```

---

## `torch.amp` — Automatic Mixed Precision (AMP)

`torch.amp.autocast` automatically casts operations to float16 (or bfloat16) where safe, reducing memory usage and speeding up computation on modern GPUs. `torch.amp.GradScaler` prevents gradient underflow when training in float16.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.amp import autocast, GradScaler

model     = nn.Linear(1024, 512).cuda()
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
scaler    = GradScaler("cuda")

for _ in range(3):
    x = torch.randn(64, 1024, device="cuda")
    y = torch.randn(64, 512,  device="cuda")

    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast("cuda"):
        pred = model(x)                   # computed in float16
        loss = nn.functional.mse_loss(pred, y)

    # Scale gradients to avoid float16 underflow
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()

    print(f"loss={loss.item():.4f}, scale={scaler.get_scale()}")

# bfloat16 (preferred on Ampere/Hopper GPUs — no scaling needed)
with autocast("cuda", dtype=torch.bfloat16):
    out = model(torch.randn(64, 1024, device="cuda"))
print(out.dtype)   # torch.bfloat16
```

---

## `torch.save` / `torch.load` — Model Serialization

`torch.save` serializes Python objects (state dicts, tensors, entire models) to disk using pickle. `torch.load` deserializes them. The recommended practice is to save only the `state_dict` (parameter dictionary) rather than the whole model object.

```python
import torch
import torch.nn as nn

model     = nn.Linear(128, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── Save a training checkpoint ────────────────────────────────────────────────
checkpoint = {
    "epoch":           42,
    "model_state":     model.state_dict(),
    "optimizer_state": optimizer.state_dict(),
    "loss":            0.1234,
}
torch.save(checkpoint, "checkpoint.pt")

# ── Load and resume training ──────────────────────────────────────────────────
ckpt = torch.load("checkpoint.pt", weights_only=True)
model.load_state_dict(ckpt["model_state"])
optimizer.load_state_dict(ckpt["optimizer_state"])
start_epoch = ckpt["epoch"] + 1
print(f"Resuming from epoch {start_epoch}")

# ── Save only weights (inference) ────────────────────────────────────────────
torch.save(model.state_dict(), "model_weights.pt")

new_model = nn.Linear(128, 10)
new_model.load_state_dict(torch.load("model_weights.pt", weights_only=True))
new_model.eval()

# ── Cross-device loading ──────────────────────────────────────────────────────
# Load GPU-trained model on CPU
state = torch.load("model_weights.pt", map_location="cpu", weights_only=True)
new_model.load_state_dict(state)
```

---

## `torch.compile` — Graph-Mode Compilation

`torch.compile` (introduced in PyTorch 2.0) compiles a model or function into optimized kernels using TorchDynamo (tracing) and TorchInductor (backend). It requires no changes to the model definition and typically delivers 1.5–3× speedups on modern GPUs.

```python
import torch
import torch.nn as nn

class ResBlock(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim)
        )
    def forward(self, x):
        return x + self.net(x)

model = ResBlock(512).cuda()

# ── Compile with default backend (inductor) ───────────────────────────────────
compiled_model = torch.compile(model)

x = torch.randn(64, 512, device="cuda")

# First call triggers compilation (slow), subsequent calls use compiled kernel
with torch.no_grad():
    out = compiled_model(x)
print(out.shape)  # torch.Size([64, 512])

# ── Compilation modes ─────────────────────────────────────────────────────────
# "default"      — balance of speed and compile time
# "reduce-overhead" — minimize Python overhead (best for small batches)
# "max-autotune"   — exhaustive kernel tuning (longest compile, fastest runtime)
fast_model = torch.compile(model, mode="max-autotune")

# ── Compile a standalone function ─────────────────────────────────────────────
@torch.compile
def fused_gelu(x: torch.Tensor) -> torch.Tensor:
    return x * torch.sigmoid(1.702 * x)

print(fused_gelu(torch.randn(1024, device="cuda")).shape)  # torch.Size([1024])
```

---

## `torch.fx` — Symbolic Tracing and Graph Transformation

`torch.fx` provides a Python-level intermediate representation (IR) for `nn.Module` instances. It symbolically traces a module to produce a `GraphModule` that can be inspected, transformed, and re-emitted as valid Python code — enabling optimizations like operator fusion, quantization, and custom backends.

```python
import torch
import torch.nn as nn
from torch.fx import symbolic_trace, GraphModule, Interpreter

class TwoLayerNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 16)

    def forward(self, x):
        return self.fc2(torch.relu(self.fc1(x)))

model  = TwoLayerNet()
traced: GraphModule = symbolic_trace(model)

# Print the IR graph
print(traced.graph)
# graph():
#   %x : ... = placeholder[target=x]
#   %fc1 : ... = call_module[target=fc1](args=(%x,))
#   %relu : ... = call_function[target=torch.relu](args=(%fc1,))
#   %fc2 : ... = call_module[target=fc2](args=(%relu,))
#   return fc2

# Print the generated Python code
print(traced.code)

# Transform: replace all relu with sigmoid
for node in traced.graph.nodes:
    if node.op == "call_function" and node.target is torch.relu:
        node.target = torch.sigmoid

traced.recompile()

x   = torch.randn(8, 64)
out = traced(x)
print(out.shape)  # torch.Size([8, 16])
```

---

## `torch.distributed` — Distributed Training

`torch.distributed` provides primitives for multi-process/multi-GPU training. `DistributedDataParallel` (DDP) is the recommended high-level API that wraps a model to synchronize gradients across GPUs automatically.

```python
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler, TensorDataset

def train_worker(rank: int, world_size: int):
    # Initialize process group
    dist.init_process_group(
        backend="nccl",        # "gloo" for CPU; "nccl" for GPU
        init_method="env://",
        rank=rank,
        world_size=world_size,
    )

    device = torch.device(f"cuda:{rank}")
    model  = nn.Linear(128, 10).to(device)
    model  = DDP(model, device_ids=[rank])

    dataset = TensorDataset(
        torch.randn(256, 128), torch.randint(0, 10, (256,))
    )
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader  = DataLoader(dataset, batch_size=32, sampler=sampler)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(2):
        sampler.set_epoch(epoch)          # important for correct shuffling
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            loss = criterion(model(x), y)
            loss.backward()               # gradients averaged across GPUs
            optimizer.step()

    dist.destroy_process_group()

# Launch with torchrun (preferred):
# torchrun --nproc_per_node=2 script.py

# Or programmatically (for 2 GPUs):
# mp.spawn(train_worker, args=(2,), nprocs=2, join=True)
```

---

## `torch.profiler` — Performance Profiling

`torch.profiler.profile` records CPU/GPU operator timings, memory usage, and kernel traces. Results can be exported to TensorBoard for visualization or inspected in-process with `key_averages()`.

```python
import torch
import torch.nn as nn
from torch.profiler import profile, ProfilerActivity, schedule, tensorboard_trace_handler

model = nn.Sequential(
    nn.Linear(1024, 512), nn.ReLU(),
    nn.Linear(512,  256), nn.ReLU(),
    nn.Linear(256,  10),
).cuda()

x = torch.randn(64, 1024, device="cuda")

# ── One-shot profile ──────────────────────────────────────────────────────────
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for _ in range(10):
        model(x)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# ┌───────────────────────┬────────────┬─────────────┬────────────┐
# │ Name                  │ CPU total  │ CUDA total  │ # Calls    │
# ├───────────────────────┼────────────┼─────────────┼────────────┤
# │ aten::linear          │  1.234 ms  │   0.823 ms  │       30   │
# └───────────────────────┴────────────┴─────────────┴────────────┘

# ── Scheduled profile with TensorBoard export ─────────────────────────────────
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=tensorboard_trace_handler("./log/profiler"),
) as prof:
    for step in range(5):
        model(x)
        prof.step()
# View with: tensorboard --logdir=./log/profiler
```

---

## `torch.cuda` — CUDA Device Management

`torch.cuda` exposes CUDA device selection, memory management, stream synchronization, and GPU information queries. It is lazily initialized and safe to import on CPU-only machines.

```python
import torch

# ── Device availability and selection ────────────────────────────────────────
print(torch.cuda.is_available())          # True / False
print(torch.cuda.device_count())          # e.g. 4
print(torch.cuda.get_device_name(0))      # 'NVIDIA A100-SXM4-80GB'

torch.cuda.set_device(0)
device = torch.device("cuda:0")

# ── Memory management ────────────────────────────────────────────────────────
x = torch.randn(1000, 1000, device=device)
print(torch.cuda.memory_allocated())      # bytes currently allocated
print(torch.cuda.memory_reserved())       # bytes reserved by caching allocator

torch.cuda.empty_cache()                  # release unused cached memory

# ── Streams for concurrent kernel execution ───────────────────────────────────
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()

with torch.cuda.stream(s1):
    a = torch.randn(512, 512, device=device)
    b = a @ a.T                     # runs asynchronously on stream s1

with torch.cuda.stream(s2):
    c = torch.randn(512, 512, device=device)
    d = c @ c.T                     # runs concurrently on stream s2

torch.cuda.synchronize()            # wait for all streams to complete

# ── Seed for reproducibility ──────────────────────────────────────────────────
torch.cuda.manual_seed_all(42)
torch.manual_seed(42)
```

---

## Complete Training Loop — Putting It All Together

A full supervised training workflow integrating `Dataset`, `DataLoader`, `nn.Module`, `optim`, `lr_scheduler`, AMP, gradient clipping, and checkpointing.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
from torch.amp import autocast, GradScaler
from torch.optim.lr_scheduler import CosineAnnealingLR

# ── Reproducibility ───────────────────────────────────────────────────────────
torch.manual_seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ── Data ──────────────────────────────────────────────────────────────────────
X = torch.randn(2000, 256)
y = torch.randint(0, 10, (2000,))
train_ds, val_ds = random_split(TensorDataset(X, y), [1600, 400])
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True,  num_workers=2, pin_memory=True)
val_dl   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=2)

# ── Model ─────────────────────────────────────────────────────────────────────
model = nn.Sequential(
    nn.Linear(256, 512), nn.LayerNorm(512), nn.GELU(), nn.Dropout(0.1),
    nn.Linear(512, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(0.1),
    nn.Linear(256,  10),
).to(device)

# ── Optimizer, scheduler, AMP ─────────────────────────────────────────────────
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
scheduler = CosineAnnealingLR(optimizer, T_max=20, eta_min=1e-5)
scaler    = GradScaler("cuda") if device.type == "cuda" else None
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

best_val_acc = 0.0

for epoch in range(20):
    # ── Training ──────────────────────────────────────────────────────────────
    model.train()
    total_loss = 0.0
    for xb, yb in train_dl:
        xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True)
        optimizer.zero_grad()
        if scaler:
            with autocast("cuda"):
                logits = model(xb)
                loss   = criterion(logits, yb)
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()
        else:
            logits = model(xb)
            loss   = criterion(logits, yb)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
        total_loss += loss.item()

    scheduler.step()

    # ── Validation ────────────────────────────────────────────────────────────
    model.eval()
    correct = 0
    with torch.inference_mode():
        for xb, yb in val_dl:
            xb, yb = xb.to(device), yb.to(device)
            correct += (model(xb).argmax(1) == yb).sum().item()
    val_acc = correct / len(val_ds)

    print(f"Epoch {epoch:02d} | loss={total_loss/len(train_dl):.4f} | val_acc={val_acc:.3f}")

    # ── Checkpoint best model ─────────────────────────────────────────────────
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "best_model.pt")
```

---

PyTorch is the primary framework for research prototyping and production deployment of deep learning models across computer vision (CNNs, ViTs), natural language processing (Transformers, LLMs), reinforcement learning (policy gradient, Q-learning), and scientific computing. Its core design philosophy — immediate execution, Python-native debugging, and gradual graph capture — makes it accessible to both researchers who need to iterate quickly and engineers who need production-grade performance. The same codebase that runs interactively in a Jupyter notebook can be compiled with `torch.compile`, quantized with `torch.ao.quantization`, exported via `torch.export`, and served through TorchScript or ONNX.

PyTorch integrates naturally into the broader ML ecosystem: Hugging Face Transformers, TorchVision, TorchAudio, and Lightning all build on its `nn.Module` / `DataLoader` foundations. Distributed training with DDP or FSDP scales from a single workstation GPU to hundreds of nodes transparently, and the AMP + `torch.compile` stack closes the gap between research prototypes and inference-optimized deployments. Its permissive BSD license, extensive documentation, and active open-source community make it the de facto standard for deep learning in both academia and industry.