### Install Entmax Package Source: https://github.com/deep-spin/entmax/blob/master/README.md Standard pip installation command for the entmax package. ```bash pip install entmax ``` -------------------------------- ### Project onto k-Subsets Budget with budget_bisect Source: https://context7.com/deep-spin/entmax/llms.txt Projects onto a k-subsets budget constraint, where each output is in [0,1] and sums to budget. Useful for multi-label classification. Budget=1 is equivalent to standard sparsemax. Requires torch and budget_bisect from entmax. ```python import torch from entmax import budget_bisect x = torch.tensor([-2.0, 0.0, 0.5]) # budget=1 is equivalent to standard sparsemax print(budget_bisect(x, budget=1, dim=0)) # tensor([0.0000, 0.2500, 0.7500]) # budget=2: allow up to 2 "active" outputs print(budget_bisect(x, budget=2, dim=0)) # tensor([0.0000, 1.0000, 1.0000]) # Multi-label scenario: predict top-3 labels from 10 classes logits = torch.randn(16, 10) # batch=16, classes=10 label_probs = budget_bisect(logits, budget=3, dim=1) print(label_probs.sum(dim=1)) # all approximately 3.0 print((label_probs > 0).sum(dim=1)) # number of active labels ``` -------------------------------- ### Use Parametric nn.Module Wrappers for Bisection Transforms Source: https://context7.com/deep-spin/entmax/llms.txt nn.Module wrappers for bisection-based transforms, supporting learnable or fixed alpha parameters. Supports EntmaxBisect, NormmaxBisect, BudgetBisect, and SparsemaxBisect. Requires torch, nn, and the respective classes from entmax. ```python import torch import torch.nn as nn from entmax import EntmaxBisect, NormmaxBisect, BudgetBisect, SparsemaxBisect # EntmaxBisect with learnable per-head alpha class AdaptiveSparseAttention(nn.Module): def __init__(self, n_heads): super().__init__() # One alpha per attention head, initialized to 1.5 self.alpha = nn.Parameter(torch.full((n_heads, 1, 1), 1.5)) self.attn_fn = EntmaxBisect(dim=-1) def forward(self, scores): # scores: (batch, heads, seq, seq) return self.attn_fn(scores) # uses self.alpha implicitly via entmax_bisect # Direct module usage sparse_layer = EntmaxBisect(alpha=1.7, dim=-1, n_iter=30) normmax_layer = NormmaxBisect(alpha=3.0, dim=-1) budget_layer = BudgetBisect(budget=5, dim=-1) bisect_sparse = SparsemaxBisect(dim=-1) x = torch.randn(4, 20) # batch=4, vocab/classes=20 print(sparse_layer(x).sum(dim=1)) # all ~1.0 print(normmax_layer(x).sum(dim=1)) # all ~1.0 print(budget_layer(x).sum(dim=1)) # all ~5.0 print(bisect_sparse(x).sum(dim=1)) # all ~1.0 ``` -------------------------------- ### EntmaxBisect / NormmaxBisect / BudgetBisect / SparsemaxBisect Modules Source: https://context7.com/deep-spin/entmax/llms.txt nn.Module wrappers for bisection-based transforms, supporting learnable or fixed alpha parameters. ```APIDOC ## EntmaxBisect / NormmaxBisect / BudgetBisect / SparsemaxBisect ### Description `nn.Module` wrappers for the bisection-based transforms, supporting learnable or fixed alpha parameters. ### Parameters - **alpha** (float, optional) - The alpha parameter for EntmaxBisect and NormmaxBisect. Defaults to 1.5 for EntmaxBisect and 3.0 for NormmaxBisect. - **budget** (float, optional) - The budget parameter for BudgetBisect. Defaults to 5. - **dim** (int) - The dimension along which to compute the transform. - **n_iter** (int, optional) - Number of iterations for the bisection method. Defaults to 30. ### Usage Example ```python import torch import torch.nn as nn from entmax import EntmaxBisect, NormmaxBisect, BudgetBisect, SparsemaxBisect # EntmaxBisect with learnable per-head alpha class AdaptiveSparseAttention(nn.Module): def __init__(self, n_heads): super().__init__() # One alpha per attention head, initialized to 1.5 self.alpha = nn.Parameter(torch.full((n_heads, 1, 1), 1.5)) self.attn_fn = EntmaxBisect(dim=-1) def forward(self, scores): # scores: (batch, heads, seq, seq) return self.attn_fn(scores) # uses self.alpha implicitly via entmax_bisect # Direct module usage sparse_layer = EntmaxBisect(alpha=1.7, dim=-1, n_iter=30) normmax_layer = NormmaxBisect(alpha=3.0, dim=-1) budget_layer = BudgetBisect(budget=5, dim=-1) bisect_sparse = SparsemaxBisect(dim=-1) x = torch.randn(4, 20) # batch=4, vocab/classes=20 print(sparse_layer(x).sum(dim=1)) # all ~1.0 print(normmax_layer(x).sum(dim=1)) # all ~1.0 print(budget_layer(x).sum(dim=1)) # all ~5.0 print(bisect_sparse(x).sum(dim=1)) # all ~1.0 ``` ``` -------------------------------- ### Use Sparsemax/Entmax15 as nn.Module Replacements for Softmax Source: https://context7.com/deep-spin/entmax/llms.txt Drop-in nn.Module replacements for softmax, usable anywhere nn.Softmax is used, such as in attention mechanisms. Requires torch, nn, and Sparsemax/Entmax15 from entmax. ```python import torch import torch.nn as nn from entmax import Sparsemax, Entmax15 # Replace nn.Softmax in attention class SparseAttention(nn.Module): def __init__(self, d_model, n_heads, sparse_type="entmax15"): super().__init__() self.d_head = d_model // n_heads self.n_heads = n_heads self.q = nn.Linear(d_model, d_model) self.k = nn.Linear(d_model, d_model) self.v = nn.Linear(d_model, d_model) if sparse_type == "sparsemax": self.attn = Sparsemax(dim=-1) else: self.attn = Entmax15(dim=-1) def forward(self, x): B, T, C = x.shape q = self.q(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2) k = self.k(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2) v = self.v(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2) scores = (q @ k.transpose(-2, -1)) / (self.d_head ** 0.5) weights = self.attn(scores) # sparse attention weights return (weights @ v).transpose(1, 2).reshape(B, T, C) model = SparseAttention(d_model=64, n_heads=4, sparse_type="entmax15") x = torch.randn(2, 10, 64) out = model(x) print(out.shape) # (2, 10, 64) ``` -------------------------------- ### SparsemaxLoss in a Training Loop Source: https://context7.com/deep-spin/entmax/llms.txt Integrate SparsemaxLoss into a standard PyTorch training loop. Ensure to zero gradients before the forward pass and step the optimizer after backpropagation. ```python model = nn.Linear(128, 50) # 50-class classifier optimizer = torch.optim.Adam(model.parameters()) criterion = SparsemaxLoss(reduction="elementwise_mean") x = torch.randn(32, 128) y = torch.randint(0, 50, (32,)) optimizer.zero_grad() loss = criterion(model(x), y) loss.backward() optimizer.step() print(f"Loss: {loss.item():.4f}") ``` -------------------------------- ### Entmax Probability Mappings Comparison Source: https://github.com/deep-spin/entmax/blob/master/README.md Compares the output of softmax, sparsemax, entmax15, normmax_bisect, and budget_bisect for a given input tensor. Useful for understanding the sparsity patterns of different entmax variants. ```python import torch from torch.nn.functional import softmax from entmax import sparsemax, entmax15, entmax_bisect, normmax_bisect, budget_bisect x = torch.tensor([-2, 0, 0.5]) softmax(x, dim=0) sparsemax(x, dim=0) entmax15(x, dim=0) normmax_bisect(x, alpha=2, dim=0) normmax_bisect(x, alpha=1000, dim=0) budget_bisect(x, budget=2, dim=0) ``` -------------------------------- ### budget_bisect Source: https://context7.com/deep-spin/entmax/llms.txt Projects onto a k-subsets budget constraint, where each output value is in [0,1] and sums to `budget`. Useful for multi-label classification. ```APIDOC ## budget_bisect ### Description Projects onto a k-subsets budget constraint: each output value is in [0,1] and sums to `budget`. Useful for multi-label classification where the expected number of active labels is known. ### Parameters - **x** (torch.Tensor) - Input tensor. - **budget** (float) - The budget constraint for the sum of output values. - **dim** (int) - The dimension along which to compute the projection. ### Request Example ```python import torch from entmax import budget_bisect x = torch.tensor([-2.0, 0.0, 0.5]) # budget=1 is equivalent to standard sparsemax print(budget_bisect(x, budget=1, dim=0)) # tensor([0.0000, 0.2500, 0.7500]) # budget=2: allow up to 2 "active" outputs print(budget_bisect(x, budget=2, dim=0)) # tensor([0.0000, 1.0000, 1.0000]) # Multi-label scenario: predict top-3 labels from 10 classes logits = torch.randn(16, 10) # batch=16, classes=10 label_probs = budget_bisect(logits, budget=3, dim=1) print(label_probs.sum(dim=1)) # all approximately 3.0 print((label_probs > 0).sum(dim=1)) # number of active labels ``` ### Response - **output** (torch.Tensor) - The resulting budget-sparse projection. ``` -------------------------------- ### SparsemaxLoss Module with Reduction and Ignore Index Source: https://context7.com/deep-spin/entmax/llms.txt Use SparsemaxLoss module for calculating sparse loss. Set `k` to None for full sort, or specify `k` for partial-sort speedup. `ignore_index` can be used to ignore specific target values. ```python criterion = SparsemaxLoss( k=None, # full sort; set k for partial-sort speedup ignore_index=-100, reduction="elementwise_mean" ) mean_loss = criterion(logits, targets) mean_loss.backward() ``` -------------------------------- ### Entmax Gradients w.r.t. Alpha Source: https://github.com/deep-spin/entmax/blob/master/README.md Demonstrates how to compute gradients of entmax probabilities with respect to the alpha parameter. This is useful for implementing adaptive, learned sparsity. ```python from torch.autograd import grad x = torch.tensor([[-1, 0, 0.5], [1, 2, 3.5]]) alpha = torch.tensor(1.33, requires_grad=True) p = entmax_bisect(x, alpha) grad(p[0, 0], alpha) ``` -------------------------------- ### Sparsemax via Bisection Algorithm Source: https://context7.com/deep-spin/entmax/llms.txt A bisection-based implementation of sparsemax, equivalent to alpha=2 entmax. Useful when a purely bisection-based pipeline is preferred. Allows control over precision with n_iter. ```python import torch from entmax import sparsemax_bisect X = torch.tensor([[-2.0, 0.0, 0.5], [ 1.0, 2.0, 3.5]]) p = sparsemax_bisect(X, dim=-1) # tensor([[0.0000, 0.2500, 0.7500], # [0.0000, 0.1667, 0.8333]]) # Control precision with n_iter (24 sufficient for float32 machine precision) p_precise = sparsemax_bisect(X, dim=-1, n_iter=24) ``` -------------------------------- ### 1.5-Entmax Sparse Activation Source: https://context7.com/deep-spin/entmax/llms.txt Computes the 1.5-entmax probability mapping, which is sparser than softmax but less extreme than sparsemax. Uses an exact partial-sort algorithm. Supports batched input and returning the support size. ```python import torch from entmax import entmax15 x = torch.tensor([-2.0, 0.0, 0.5]) p = entmax15(x, dim=0) # tensor([0.0000, 0.3260, 0.6740]) -- sparser than softmax, less so than sparsemax # Batch of logits (e.g., attention scores in a transformer) attn_scores = torch.randn(4, 16) # batch=4, seq_len=16 attn_weights = entmax15(attn_scores, dim=1) assert torch.allclose(attn_weights.sum(dim=1), torch.ones(4), atol=1e-5) # Many entries will be exactly zero print((attn_weights == 0).sum(dim=1)) # number of exact zeros per row # With support size p, support = entmax15(attn_scores, dim=1, return_support_size=True) print(support.squeeze()) # number of nonzero attention positions ``` -------------------------------- ### sparsemax_bisect Source: https://context7.com/deep-spin/entmax/llms.txt Bisection-based implementation of sparsemax, equivalent to alpha=2 entmax. Useful when a purely bisection-based pipeline is preferred. ```APIDOC ## sparsemax_bisect ### Description Bisection-based implementation of sparsemax (equivalent to alpha=2 entmax). Useful when a purely bisection-based pipeline is preferred. ### Parameters - **dim** (int) - The dimension along which to compute the sparsemax. - **n_iter** (int, optional) - The number of iterations for the bisection algorithm. Defaults to a value sufficient for float32 machine precision. ### Request Example ```python import torch from entmax import sparsemax_bisect X = torch.tensor([[-2.0, 0.0, 0.5], [ 1.0, 2.0, 3.5]]) p = sparsemax_bisect(X, dim=-1) # Control precision with n_iter p_precise = sparsemax_bisect(X, dim=-1, n_iter=24) ``` ### Response #### Success Response - **p** (Tensor) - The sparsemax probability distribution. ``` -------------------------------- ### EntmaxBisectLoss Module Usage Source: https://context7.com/deep-spin/entmax/llms.txt Instantiate and use the `EntmaxBisectLoss` module for calculating the alpha-entmax Fenchel-Young loss. Configure parameters like alpha, number of iterations (`n_iter`), and reduction method. ```python # Module form criterion = EntmaxBisectLoss(alpha=1.7, n_iter=50, reduction="elementwise_mean") mean_loss = criterion(logits, targets) mean_loss.backward() ``` -------------------------------- ### EntmaxBisectLoss with Learnable Alpha Source: https://context7.com/deep-spin/entmax/llms.txt Enable gradient flow through the alpha parameter for learnable sparsity levels. The `entmax_bisect_loss` function can be used with a `torch.Tensor` for alpha that requires gradients. ```python # Learnable alpha — gradient flows through alpha alpha = torch.tensor(1.5, requires_grad=True) loss = entmax_bisect_loss(logits, targets, alpha=alpha) loss.sum().backward() print(alpha.grad) # gradient w.r.t. sparsity level ``` -------------------------------- ### Sparse Softmax Activation with Sparsemax Source: https://context7.com/deep-spin/entmax/llms.txt Computes a sparse probability distribution using L2 projection. Produces exact zeros for low-scoring entries. Supports batched input and returning the support size. ```python import torch from entmax import sparsemax x = torch.tensor([-2.0, 0.0, 0.5]) # Basic usage p = sparsemax(x, dim=0) # tensor([0.0000, 0.2500, 0.7500]) # Batched input along dim=1 X = torch.tensor([[-2.0, 0.0, 0.5], [1.0, 2.0, 3.5]]) p_batch = sparsemax(X, dim=1) # tensor([[0.0000, 0.2500, 0.7500], # [0.0000, 0.1667, 0.8333]]) # With partial-sort hint for efficiency (k slightly > expected nonzeros) p_fast = sparsemax(X, dim=1, k=3) # Return support size (number of nonzeros) p, support = sparsemax(X, dim=1, return_support_size=True) print(p) # tensor([[0.0000, 0.2500, 0.7500], ...]) print(support) # tensor([[2], [2]]) # Verify it's a valid probability distribution assert torch.allclose(p.sum(dim=1), torch.ones(2)) ``` -------------------------------- ### Sparsemax / Entmax15 Modules Source: https://context7.com/deep-spin/entmax/llms.txt PyTorch nn.Module replacements for nn.Softmax, usable in attention mechanisms and other layers. ```APIDOC ## Sparsemax / Entmax15 ### Description PyTorch `nn.Module` wrappers that can be used anywhere `nn.Softmax` would be used, e.g. directly replacing attention softmax. ### Parameters - **dim** (int) - The dimension along which to compute the softmax. ### Usage Example ```python import torch import torch.nn as nn from entmax import Sparsemax, Entmax15 # Replace nn.Softmax in attention class SparseAttention(nn.Module): def __init__(self, d_model, n_heads, sparse_type="entmax15"): super().__init__() self.d_head = d_model // n_heads self.n_heads = n_heads self.q = nn.Linear(d_model, d_model) self.k = nn.Linear(d_model, d_model) self.v = nn.Linear(d_model, d_model) if sparse_type == "sparsemax": self.attn = Sparsemax(dim=-1) else: self.attn = Entmax15(dim=-1) def forward(self, x): B, T, C = x.shape q = self.q(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2) k = self.k(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2) v = self.v(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2) scores = (q @ k.transpose(-2, -1)) / (self.d_head ** 0.5) weights = self.attn(scores) # sparse attention weights return (weights @ v).transpose(1, 2).reshape(B, T, C) model = SparseAttention(d_model=64, n_heads=4, sparse_type="entmax15") x = torch.randn(2, 10, 64) out = model(x) print(out.shape) # (2, 10, 64) ``` ``` -------------------------------- ### Compute Alpha-Normmax Projection with normmax_bisect Source: https://context7.com/deep-spin/entmax/llms.txt Computes the alpha-normmax projection. Use large alpha values to approximate argmax (hardmax). Requires torch and normmax_bisect from entmax. ```python import torch from entmax import normmax_bisect x = torch.tensor([-2.0, 0.0, 0.5]) # alpha=2 is similar to sparsemax print(normmax_bisect(x, alpha=2, dim=0)) # tensor([0.0000, 0.3110, 0.6890]) # Large alpha approaches argmax / hardmax print(normmax_bisect(x, alpha=1000, dim=0)) # tensor([0.0000, 0.4997, 0.5003]) # Per-example alpha (batch) X = torch.randn(8, 32) p = normmax_bisect(X, alpha=3.0, dim=1) assert torch.allclose(p.sum(dim=1), torch.ones(8), atol=1e-5) ``` -------------------------------- ### sparsemax_loss / SparsemaxLoss Source: https://context7.com/deep-spin/entmax/llms.txt Fenchel-Young sparsemax loss, a sparse alternative to cross-entropy. ```APIDOC ## sparsemax_loss / SparsemaxLoss ### Description Sparse alternative to cross-entropy based on the sparsemax transform. Produces exactly zero loss when the target class receives all probability mass. Functional and module forms available. ### Parameters - **logits** (torch.Tensor) - The input tensor of logits. - **targets** (torch.Tensor) - The target class indices. ### Request Example ```python import torch import torch.nn as nn from entmax import sparsemax_loss, SparsemaxLoss # Functional form logits = torch.tensor([[ 0.5, 2.0, -1.0], [ 1.0, -0.5, 3.0]]) targets = torch.tensor([1, 2]) losses = sparsemax_loss(logits, targets) # shape: (2,) print(losses) # per-sample losses ``` ### Response - **losses** (torch.Tensor) - A tensor containing the per-sample sparsemax loss. ``` -------------------------------- ### entmax15 Source: https://context7.com/deep-spin/entmax/llms.txt Computes the 1.5-entmax probability mapping, which is sparser than softmax but less extreme than sparsemax. It uses an exact partial-sort algorithm. ```APIDOC ## entmax15 ### Description Computes the 1.5-entmax probability mapping, which is sparser than softmax but less extreme than sparsemax. Solves `max_p - H_1.5(p)` where H_1.5 is the Tsallis 1.5-entropy. Uses an exact partial-sort algorithm. ### Parameters - **dim** (int) - The dimension along which to compute the entmax15. - **return_support_size** (bool, optional) - If True, also returns the number of non-zero elements. ### Request Example ```python import torch from entmax import entmax15 x = torch.tensor([-2.0, 0.0, 0.5]) p = entmax15(x, dim=0) # Batch of logits (e.g., attention scores in a transformer) attn_scores = torch.randn(4, 16) attn_weights = entmax15(attn_scores, dim=1) # With support size p, support = entmax15(attn_scores, dim=1, return_support_size=True) ``` ### Response #### Success Response - **p** (Tensor) - The 1.5-entmax probability distribution. - **support** (Tensor, optional) - The number of non-zero elements in the distribution. ``` -------------------------------- ### Generic Alpha-Entmax with Bisection Source: https://context7.com/deep-spin/entmax/llms.txt Computes alpha-entmax for any alpha > 1 using a bisection algorithm. This function is differentiable with respect to alpha, enabling learned sparsity. Supports fixed or per-row alpha values. ```python import torch from torch.autograd import grad from entmax import entmax_bisect x = torch.tensor([-2.0, 0.0, 0.5]) # Fixed alpha values print(entmax_bisect(x, alpha=1.5, dim=0)) # similar to entmax15 # tensor([0.0000, 0.3260, 0.6740]) print(entmax_bisect(x, alpha=2.0, dim=0)) # similar to sparsemax # tensor([0.0000, 0.2500, 0.7500]) # Learnable alpha: gradient flows through alpha X = torch.tensor([[-1.0, 0.0, 0.5], [ 1.0, 2.0, 3.5]]) alpha = torch.tensor(1.33, requires_grad=True) p = entmax_bisect(X, alpha) # tensor([[0.0460, 0.3276, 0.6264], # [0.0026, 0.1012, 0.8963]], grad_fn=...) # Compute gradient w.r.t. alpha g = grad(p[0, 0], alpha) print(g) # (tensor(-0.2562),) # Per-row alpha (adaptive sparsity per attention head) alpha_per_row = torch.tensor([[1.5], [2.0]]) # shape (batch, 1) p_adaptive = entmax_bisect(X, alpha=alpha_per_row, dim=1) # Use in a training loop with learnable alpha optimizer_alpha = torch.optim.Adam([alpha], lr=0.01) loss = -p[0, 2] # dummy loss loss.backward() optimizer_alpha.step() ``` -------------------------------- ### Entmax15Loss Functional and Module Usage Source: https://context7.com/deep-spin/entmax/llms.txt Calculate 1.5-entmax Fenchel-Young loss using either the functional interface or the Entmax15Loss module. The module can also return the support size when `return_support_size` is True. ```python import torch from entmax import entmax15_loss, Entmax15Loss logits = torch.tensor([[ 0.5, 2.0, -1.0], [ 1.0, -0.5, 3.0]]) targets = torch.tensor([1, 2]) # Functional losses = entmax15_loss(logits, targets) print(losses) # sparse per-sample losses, shape (2,) ``` ```python # Module, also returns support size criterion = Entmax15Loss( k=100, reduction="elementwise_mean", return_support_size=True ) mean_loss, support = criterion(logits, targets) print(f"Mean loss: {mean_loss.item():.4f}") print(f"Support size: {support}") ``` ```python # Gradient check logits.requires_grad_(True) loss = entmax15_loss(logits, targets) loss.sum().backward() print(logits.grad) # sparse gradient — only nonzero for support entries ``` -------------------------------- ### sparsemax Source: https://context7.com/deep-spin/entmax/llms.txt Computes a sparse probability distribution by projecting the input onto the probability simplex using the L2 norm. It produces exact zeros for low-scoring entries and uses an efficient partial-sort algorithm. ```APIDOC ## sparsemax ### Description Computes a sparse probability distribution by projecting the input onto the probability simplex using the L2 norm. Produces exact zeros for low-scoring entries. Uses an efficient partial-sort algorithm. ### Parameters - **dim** (int) - The dimension along which to compute the sparsemax. - **k** (int, optional) - A hint for the partial-sort algorithm for efficiency. Should be slightly larger than the expected number of non-zero elements. - **return_support_size** (bool, optional) - If True, also returns the number of non-zero elements. ### Request Example ```python import torch from entmax import sparsemax x = torch.tensor([-2.0, 0.0, 0.5]) # Basic usage p = sparsemax(x, dim=0) # Batched input along dim=1 X = torch.tensor([[-2.0, 0.0, 0.5], [1.0, 2.0, 3.5]]) p_batch = sparsemax(X, dim=1) # With partial-sort hint for efficiency p_fast = sparsemax(X, dim=1, k=3) # Return support size (number of nonzeros) p, support = sparsemax(X, dim=1, return_support_size=True) ``` ### Response #### Success Response - **p** (Tensor) - The sparse probability distribution. - **support** (Tensor, optional) - The number of non-zero elements in the distribution. ``` -------------------------------- ### NormmaxBisectLoss Functional and Module Usage Source: https://context7.com/deep-spin/entmax/llms.txt Calculate alpha-normmax Fenchel-Young loss using either the functional `normmax_bisect_loss` or the `NormmaxBisectLoss` module. Configure the alpha value and the number of iterations for the bisection method. ```python import torch from entmax import normmax_bisect_loss, NormmaxBisectLoss logits = torch.randn(16, 50) targets = torch.randint(0, 50, (16,)) # Functional loss = normmax_bisect_loss(logits, targets, alpha=2.0) print(loss) # per-sample losses, shape (16,) ``` ```python # Module criterion = NormmaxBisectLoss( alpha=3.0, n_iter=50, reduction="elementwise_mean" ) mean_loss = criterion(logits, targets) mean_loss.backward() ``` -------------------------------- ### normmax_bisect Source: https://context7.com/deep-spin/entmax/llms.txt Computes the alpha-normmax projection. As alpha approaches infinity, it approximates the argmax (hardmax) function. ```APIDOC ## normmax_bisect ### Description Computes the alpha-normmax projection: `max_p - ||p||_alpha` with the alpha-norm regularizer. As alpha → ∞, approaches argmax (hardmax). ### Parameters - **x** (torch.Tensor) - Input tensor. - **alpha** (float) - The alpha parameter for the normmax projection. Higher values approach hardmax. - **dim** (int) - The dimension along which to compute the projection. ### Request Example ```python import torch from entmax import normmax_bisect x = torch.tensor([-2.0, 0.0, 0.5]) # alpha=2 is similar to sparsemax print(normmax_bisect(x, alpha=2, dim=0)) # tensor([0.0000, 0.3110, 0.6890]) # Large alpha approaches argmax / hardmax print(normmax_bisect(x, alpha=1000, dim=0)) # tensor([0.0000, 0.4997, 0.5003]) # Per-example alpha (batch) X = torch.randn(8, 32) p = normmax_bisect(X, alpha=3.0, dim=1) assert torch.allclose(p.sum(dim=1), torch.ones(8), atol=1e-5) ``` ### Response - **output** (torch.Tensor) - The resulting alpha-normmax projection. ``` -------------------------------- ### entmax_bisect Source: https://context7.com/deep-spin/entmax/llms.txt Computes alpha-entmax for any alpha > 1 using a bisection algorithm. This function is differentiable with respect to alpha, enabling learned sparsity. ```APIDOC ## entmax_bisect ### Description Computes alpha-entmax for any alpha > 1 using a bisection algorithm. This is the most general activation; alpha=1.5 matches `entmax15`, alpha=2 matches `sparsemax`. Critically, this function is **differentiable with respect to alpha**, enabling learned sparsity. ### Parameters - **alpha** (float or Tensor) - The exponent for the entmax function. Can be a scalar or a tensor for per-row alpha values. - **dim** (int) - The dimension along which to compute the entmax. ### Request Example ```python import torch from torch.autograd import grad from entmax import entmax_bisect x = torch.tensor([-2.0, 0.0, 0.5]) # Fixed alpha values print(entmax_bisect(x, alpha=1.5, dim=0)) print(entmax_bisect(x, alpha=2.0, dim=0)) # Learnable alpha: gradient flows through alpha X = torch.tensor([[-1.0, 0.0, 0.5], [ 1.0, 2.0, 3.5]]) alpha = torch.tensor(1.33, requires_grad=True) p = entmax_bisect(X, alpha) # Compute gradient w.r.t. alpha g = grad(p[0, 0], alpha) # Per-row alpha (adaptive sparsity per attention head) alpha_per_row = torch.tensor([[1.5], [2.0]]) # shape (batch, 1) p_adaptive = entmax_bisect(X, alpha=alpha_per_row, dim=1) # Use in a training loop with learnable alpha optimizer_alpha = torch.optim.Adam([alpha], lr=0.01) loss = -p[0, 2] # dummy loss loss.backward() optimizer_alpha.step() ``` ### Response #### Success Response - **p** (Tensor) - The alpha-entmax probability distribution. ``` -------------------------------- ### EntmaxBisectLoss Source: https://context7.com/deep-spin/entmax/llms.txt EntmaxBisectLoss is a generic alpha-entmax Fenchel-Young loss using bisection for arbitrary alpha > 1. It supports learnable alpha and can be used functionally or as a module. ```APIDOC ## EntmaxBisectLoss — Generic alpha-entmax Fenchel-Young loss ### Description Bisection-based loss for arbitrary alpha > 1 with learnable alpha support. ### Usage ```python import torch from entmax import entmax_bisect_loss, EntmaxBisectLoss logits = torch.randn(8, 100) # 8 examples, 100 classes targets = torch.randint(0, 100, (8,)) # alpha=1.5 (default) loss_15 = entmax_bisect_loss(logits, targets, alpha=1.5) # Sparsemax regime loss_sp = entmax_bisect_loss(logits, targets, alpha=2.0) # Learnable alpha — gradient flows through alpha alpha = torch.tensor(1.5, requires_grad=True) loss = entmax_bisect_loss(logits, targets, alpha=alpha) loss.sum().backward() print(alpha.grad) # gradient w.r.t. sparsity level # Module form criterion = EntmaxBisectLoss(alpha=1.7, n_iter=50, reduction="elementwise_mean") mean_loss = criterion(logits, targets) mean_loss.backward() ``` ``` -------------------------------- ### Entmax15Loss Source: https://context7.com/deep-spin/entmax/llms.txt Entmax15Loss is a Fenchel-Young loss based on the 1.5-entmax transform, offering a balance between cross-entropy and sparsemax loss. It can be used functionally or as a module and optionally return the support size. ```APIDOC ## Entmax15Loss — 1.5-entmax Fenchel-Young loss ### Description Sparse loss based on the 1.5-entmax transform; a middle ground between cross-entropy and sparsemax loss. ### Usage ```python import torch from entmax import entmax15_loss, Entmax15Loss logits = torch.tensor([[ 0.5, 2.0, -1.0], [ 1.0, -0.5, 3.0]]) targets = torch.tensor([1, 2]) # Functional losses = entmax15_loss(logits, targets) print(losses) # sparse per-sample losses, shape (2,) # Module, also returns support size criterion = Entmax15Loss( k=100, reduction="elementwise_mean", return_support_size=True ) mean_loss, support = criterion(logits, targets) print(f"Mean loss: {mean_loss.item():.4f}") print(f"Support size: {support}") # Gradient check logits.requires_grad_(True) loss = entmax15_loss(logits, targets) loss.sum().backward() print(logits.grad) # sparse gradient — only nonzero for support entries ``` ``` -------------------------------- ### SparsemaxLoss Source: https://context7.com/deep-spin/entmax/llms.txt SparsemaxLoss is a Fenchel-Young loss based on the sparsemax transform. It can be used functionally or as a module, with options for partial sorting (k) and ignoring specific indices. ```APIDOC ## SparsemaxLoss ### Description SparsemaxLoss is a Fenchel-Young loss based on the sparsemax transform. It can be used functionally or as a module, with options for partial sorting (k) and ignoring specific indices. ### Usage ```python import torch from entmax import SparsemaxLoss # Module form with reduction and ignore_index criterion = SparsemaxLoss( k=None, # full sort; set k for partial-sort speedup ignore_index=-100, reduction="elementwise_mean" ) # Example usage in a training loop model = torch.nn.Linear(128, 50) # 50-class classifier optimizer = torch.optim.Adam(model.parameters()) criterion = SparsemaxLoss(reduction="elementwise_mean") x = torch.randn(32, 128) y = torch.randint(0, 50, (32,)) optimizer.zero_grad() loss = criterion(model(x), y) loss.backward() optimizer.step() print(f"Loss: {loss.item():.4f}") ``` ``` -------------------------------- ### Compute Fenchel-Young Sparsemax Loss Source: https://context7.com/deep-spin/entmax/llms.txt Computes a sparse alternative to cross-entropy loss based on the sparsemax transform. Produces zero loss when the target class receives all probability mass. Functional and module forms are available. Requires torch, nn, and sparsemax_loss/SparsemaxLoss from entmax. ```python import torch import torch.nn as nn from entmax import sparsemax_loss, SparsemaxLoss # Functional form logits = torch.tensor([[ 0.5, 2.0, -1.0], [ 1.0, -0.5, 3.0]]) targets = torch.tensor([1, 2]) losses = sparsemax_loss(logits, targets) # shape: (2,) print(losses) # per-sample losses ``` -------------------------------- ### EntmaxBisectLoss Functional Usage with Different Alphas Source: https://context7.com/deep-spin/entmax/llms.txt Compute generic alpha-entmax Fenchel-Young loss using `entmax_bisect_loss`. Supports arbitrary alpha values, including the default alpha=1.5 and the sparsemax regime (alpha=2.0). ```python import torch from entmax import entmax_bisect_loss, EntmaxBisectLoss logits = torch.randn(8, 100) # 8 examples, 100 classes targets = torch.randint(0, 100, (8,)) # alpha=1.5 (default) loss_15 = entmax_bisect_loss(logits, targets, alpha=1.5) # Sparsemax regime loss_sp = entmax_bisect_loss(logits, targets, alpha=2.0) ``` -------------------------------- ### NormmaxBisectLoss Source: https://context7.com/deep-spin/entmax/llms.txt NormmaxBisectLoss is an alpha-normmax Fenchel-Young loss derived from the normmax transform. It can be used functionally or as a module. ```APIDOC ## NormmaxBisectLoss — Alpha-normmax Fenchel-Young loss ### Description Fenchel-Young loss derived from the normmax transform. ### Usage ```python import torch from entmax import normmax_bisect_loss, NormmaxBisectLoss logits = torch.randn(16, 50) targets = torch.randint(0, 50, (16,)) # Functional loss = normmax_bisect_loss(logits, targets, alpha=2.0) print(loss) # per-sample losses, shape (16,) # Module criterion = NormmaxBisectLoss( alpha=3.0, n_iter=50, reduction="elementwise_mean" ) mean_loss = criterion(logits, targets) mean_loss.backward() ``` ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.