### Install Entmax Package

Source: https://github.com/deep-spin/entmax/blob/master/README.md

Standard pip installation command for the entmax package.

```bash
pip install entmax
```

--------------------------------

### Project onto k-Subsets Budget with budget_bisect

Source: https://context7.com/deep-spin/entmax/llms.txt

Projects onto a k-subsets budget constraint, where each output is in [0,1] and sums to budget. Useful for multi-label classification. Budget=1 is equivalent to standard sparsemax. Requires torch and budget_bisect from entmax.

```python
import torch
from entmax import budget_bisect

x = torch.tensor([-2.0, 0.0, 0.5])

# budget=1 is equivalent to standard sparsemax
print(budget_bisect(x, budget=1, dim=0))
# tensor([0.0000, 0.2500, 0.7500])

# budget=2: allow up to 2 "active" outputs
print(budget_bisect(x, budget=2, dim=0))
# tensor([0.0000, 1.0000, 1.0000])

# Multi-label scenario: predict top-3 labels from 10 classes
logits = torch.randn(16, 10)  # batch=16, classes=10
label_probs = budget_bisect(logits, budget=3, dim=1)
print(label_probs.sum(dim=1))  # all approximately 3.0
print((label_probs > 0).sum(dim=1))  # number of active labels
```

--------------------------------

### Use Parametric nn.Module Wrappers for Bisection Transforms

Source: https://context7.com/deep-spin/entmax/llms.txt

nn.Module wrappers for bisection-based transforms, supporting learnable or fixed alpha parameters. Supports EntmaxBisect, NormmaxBisect, BudgetBisect, and SparsemaxBisect. Requires torch, nn, and the respective classes from entmax.

```python
import torch
import torch.nn as nn
from entmax import EntmaxBisect, NormmaxBisect, BudgetBisect, SparsemaxBisect

# EntmaxBisect with learnable per-head alpha
class AdaptiveSparseAttention(nn.Module):
    def __init__(self, n_heads):
        super().__init__()
        # One alpha per attention head, initialized to 1.5
        self.alpha = nn.Parameter(torch.full((n_heads, 1, 1), 1.5))
        self.attn_fn = EntmaxBisect(dim=-1)

    def forward(self, scores):
        # scores: (batch, heads, seq, seq)
        return self.attn_fn(scores)  # uses self.alpha implicitly via entmax_bisect

# Direct module usage
sparse_layer = EntmaxBisect(alpha=1.7, dim=-1, n_iter=30)
normmax_layer = NormmaxBisect(alpha=3.0, dim=-1)
budget_layer  = BudgetBisect(budget=5, dim=-1)
bisect_sparse = SparsemaxBisect(dim=-1)

x = torch.randn(4, 20)  # batch=4, vocab/classes=20

print(sparse_layer(x).sum(dim=1))   # all ~1.0
print(normmax_layer(x).sum(dim=1))  # all ~1.0
print(budget_layer(x).sum(dim=1))   # all ~5.0
print(bisect_sparse(x).sum(dim=1))  # all ~1.0
```

--------------------------------

### EntmaxBisect / NormmaxBisect / BudgetBisect / SparsemaxBisect Modules

Source: https://context7.com/deep-spin/entmax/llms.txt

nn.Module wrappers for bisection-based transforms, supporting learnable or fixed alpha parameters.

```APIDOC
## EntmaxBisect / NormmaxBisect / BudgetBisect / SparsemaxBisect

### Description
`nn.Module` wrappers for the bisection-based transforms, supporting learnable or fixed alpha parameters.

### Parameters
- **alpha** (float, optional) - The alpha parameter for EntmaxBisect and NormmaxBisect. Defaults to 1.5 for EntmaxBisect and 3.0 for NormmaxBisect.
- **budget** (float, optional) - The budget parameter for BudgetBisect. Defaults to 5.
- **dim** (int) - The dimension along which to compute the transform.
- **n_iter** (int, optional) - Number of iterations for the bisection method. Defaults to 30.

### Usage Example
```python
import torch
import torch.nn as nn
from entmax import EntmaxBisect, NormmaxBisect, BudgetBisect, SparsemaxBisect

# EntmaxBisect with learnable per-head alpha
class AdaptiveSparseAttention(nn.Module):
    def __init__(self, n_heads):
        super().__init__()
        # One alpha per attention head, initialized to 1.5
        self.alpha = nn.Parameter(torch.full((n_heads, 1, 1), 1.5))
        self.attn_fn = EntmaxBisect(dim=-1)

    def forward(self, scores):
        # scores: (batch, heads, seq, seq)
        return self.attn_fn(scores)  # uses self.alpha implicitly via entmax_bisect

# Direct module usage
sparse_layer = EntmaxBisect(alpha=1.7, dim=-1, n_iter=30)
normmax_layer = NormmaxBisect(alpha=3.0, dim=-1)
budget_layer  = BudgetBisect(budget=5, dim=-1)
bisect_sparse = SparsemaxBisect(dim=-1)

x = torch.randn(4, 20)  # batch=4, vocab/classes=20

print(sparse_layer(x).sum(dim=1))   # all ~1.0
print(normmax_layer(x).sum(dim=1))  # all ~1.0
print(budget_layer(x).sum(dim=1))   # all ~5.0
print(bisect_sparse(x).sum(dim=1))  # all ~1.0
```
```

--------------------------------

### Use Sparsemax/Entmax15 as nn.Module Replacements for Softmax

Source: https://context7.com/deep-spin/entmax/llms.txt

Drop-in nn.Module replacements for softmax, usable anywhere nn.Softmax is used, such as in attention mechanisms. Requires torch, nn, and Sparsemax/Entmax15 from entmax.

```python
import torch
import torch.nn as nn
from entmax import Sparsemax, Entmax15

# Replace nn.Softmax in attention
class SparseAttention(nn.Module):
    def __init__(self, d_model, n_heads, sparse_type="entmax15"):
        super().__init__()
        self.d_head = d_model // n_heads
        self.n_heads = n_heads
        self.q = nn.Linear(d_model, d_model)
        self.k = nn.Linear(d_model, d_model)
        self.v = nn.Linear(d_model, d_model)
        if sparse_type == "sparsemax":
            self.attn = Sparsemax(dim=-1)
        else:
            self.attn = Entmax15(dim=-1)

    def forward(self, x):
        B, T, C = x.shape
        q = self.q(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        k = self.k(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        v = self.v(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        scores = (q @ k.transpose(-2, -1)) / (self.d_head ** 0.5)
        weights = self.attn(scores)   # sparse attention weights
        return (weights @ v).transpose(1, 2).reshape(B, T, C)

model = SparseAttention(d_model=64, n_heads=4, sparse_type="entmax15")
x = torch.randn(2, 10, 64)
out = model(x)
print(out.shape)  # (2, 10, 64)
```

--------------------------------

### SparsemaxLoss in a Training Loop

Source: https://context7.com/deep-spin/entmax/llms.txt

Integrate SparsemaxLoss into a standard PyTorch training loop. Ensure to zero gradients before the forward pass and step the optimizer after backpropagation.

```python
model = nn.Linear(128, 50)     # 50-class classifier
optimizer = torch.optim.Adam(model.parameters())
criterion = SparsemaxLoss(reduction="elementwise_mean")

x = torch.randn(32, 128)
y = torch.randint(0, 50, (32,))
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item():.4f}")
```

--------------------------------

### Entmax Probability Mappings Comparison

Source: https://github.com/deep-spin/entmax/blob/master/README.md

Compares the output of softmax, sparsemax, entmax15, normmax_bisect, and budget_bisect for a given input tensor. Useful for understanding the sparsity patterns of different entmax variants.

```python
import torch

from torch.nn.functional import softmax
from entmax import sparsemax, entmax15, entmax_bisect, normmax_bisect, budget_bisect

x = torch.tensor([-2, 0, 0.5])

softmax(x, dim=0)
sparsemax(x, dim=0)
entmax15(x, dim=0)
normmax_bisect(x, alpha=2, dim=0)
normmax_bisect(x, alpha=1000, dim=0)
budget_bisect(x, budget=2, dim=0)
```

--------------------------------

### budget_bisect

Source: https://context7.com/deep-spin/entmax/llms.txt

Projects onto a k-subsets budget constraint, where each output value is in [0,1] and sums to `budget`. Useful for multi-label classification.

```APIDOC
## budget_bisect

### Description
Projects onto a k-subsets budget constraint: each output value is in [0,1] and sums to `budget`. Useful for multi-label classification where the expected number of active labels is known.

### Parameters
- **x** (torch.Tensor) - Input tensor.
- **budget** (float) - The budget constraint for the sum of output values.
- **dim** (int) - The dimension along which to compute the projection.

### Request Example
```python
import torch
from entmax import budget_bisect

x = torch.tensor([-2.0, 0.0, 0.5])

# budget=1 is equivalent to standard sparsemax
print(budget_bisect(x, budget=1, dim=0))
# tensor([0.0000, 0.2500, 0.7500])

# budget=2: allow up to 2 "active" outputs
print(budget_bisect(x, budget=2, dim=0))
# tensor([0.0000, 1.0000, 1.0000])

# Multi-label scenario: predict top-3 labels from 10 classes
logits = torch.randn(16, 10)  # batch=16, classes=10
label_probs = budget_bisect(logits, budget=3, dim=1)
print(label_probs.sum(dim=1))  # all approximately 3.0
print((label_probs > 0).sum(dim=1))  # number of active labels
```

### Response
- **output** (torch.Tensor) - The resulting budget-sparse projection.
```

--------------------------------

### SparsemaxLoss Module with Reduction and Ignore Index

Source: https://context7.com/deep-spin/entmax/llms.txt

Use SparsemaxLoss module for calculating sparse loss. Set `k` to None for full sort, or specify `k` for partial-sort speedup. `ignore_index` can be used to ignore specific target values.

```python
criterion = SparsemaxLoss(
    k=None,                     # full sort; set k for partial-sort speedup
    ignore_index=-100,
    reduction="elementwise_mean"
)
mean_loss = criterion(logits, targets)
mean_loss.backward()
```

--------------------------------

### Entmax Gradients w.r.t. Alpha

Source: https://github.com/deep-spin/entmax/blob/master/README.md

Demonstrates how to compute gradients of entmax probabilities with respect to the alpha parameter. This is useful for implementing adaptive, learned sparsity.

```python
from torch.autograd import grad

x = torch.tensor([[-1, 0, 0.5], [1, 2, 3.5]])
alpha = torch.tensor(1.33, requires_grad=True)

p = entmax_bisect(x, alpha)

grad(p[0, 0], alpha)
```

--------------------------------

### Sparsemax via Bisection Algorithm

Source: https://context7.com/deep-spin/entmax/llms.txt

A bisection-based implementation of sparsemax, equivalent to alpha=2 entmax. Useful when a purely bisection-based pipeline is preferred. Allows control over precision with n_iter.

```python
import torch
from entmax import sparsemax_bisect

X = torch.tensor([[-2.0, 0.0, 0.5],
                   [ 1.0, 2.0, 3.5]])

p = sparsemax_bisect(X, dim=-1)
# tensor([[0.0000, 0.2500, 0.7500],
#         [0.0000, 0.1667, 0.8333]])

# Control precision with n_iter (24 sufficient for float32 machine precision)
p_precise = sparsemax_bisect(X, dim=-1, n_iter=24)
```

--------------------------------

### 1.5-Entmax Sparse Activation

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes the 1.5-entmax probability mapping, which is sparser than softmax but less extreme than sparsemax. Uses an exact partial-sort algorithm. Supports batched input and returning the support size.

```python
import torch
from entmax import entmax15

x = torch.tensor([-2.0, 0.0, 0.5])

p = entmax15(x, dim=0)
# tensor([0.0000, 0.3260, 0.6740])  -- sparser than softmax, less so than sparsemax

# Batch of logits (e.g., attention scores in a transformer)
attn_scores = torch.randn(4, 16)  # batch=4, seq_len=16
attn_weights = entmax15(attn_scores, dim=1)
assert torch.allclose(attn_weights.sum(dim=1), torch.ones(4), atol=1e-5)

# Many entries will be exactly zero
print((attn_weights == 0).sum(dim=1))  # number of exact zeros per row

# With support size
p, support = entmax15(attn_scores, dim=1, return_support_size=True)
print(support.squeeze())  # number of nonzero attention positions
```

--------------------------------

### sparsemax_bisect

Source: https://context7.com/deep-spin/entmax/llms.txt

Bisection-based implementation of sparsemax, equivalent to alpha=2 entmax. Useful when a purely bisection-based pipeline is preferred.

```APIDOC
## sparsemax_bisect

### Description
Bisection-based implementation of sparsemax (equivalent to alpha=2 entmax). Useful when a purely bisection-based pipeline is preferred.

### Parameters
- **dim** (int) - The dimension along which to compute the sparsemax.
- **n_iter** (int, optional) - The number of iterations for the bisection algorithm. Defaults to a value sufficient for float32 machine precision.

### Request Example
```python
import torch
from entmax import sparsemax_bisect

X = torch.tensor([[-2.0, 0.0, 0.5],
                   [ 1.0, 2.0, 3.5]])

p = sparsemax_bisect(X, dim=-1)

# Control precision with n_iter
p_precise = sparsemax_bisect(X, dim=-1, n_iter=24)
```

### Response
#### Success Response
- **p** (Tensor) - The sparsemax probability distribution.
```

--------------------------------

### EntmaxBisectLoss Module Usage

Source: https://context7.com/deep-spin/entmax/llms.txt

Instantiate and use the `EntmaxBisectLoss` module for calculating the alpha-entmax Fenchel-Young loss. Configure parameters like alpha, number of iterations (`n_iter`), and reduction method.

```python
# Module form
criterion = EntmaxBisectLoss(alpha=1.7, n_iter=50, reduction="elementwise_mean")
mean_loss = criterion(logits, targets)
mean_loss.backward()
```

--------------------------------

### EntmaxBisectLoss with Learnable Alpha

Source: https://context7.com/deep-spin/entmax/llms.txt

Enable gradient flow through the alpha parameter for learnable sparsity levels. The `entmax_bisect_loss` function can be used with a `torch.Tensor` for alpha that requires gradients.

```python
# Learnable alpha — gradient flows through alpha
alpha = torch.tensor(1.5, requires_grad=True)
loss = entmax_bisect_loss(logits, targets, alpha=alpha)
loss.sum().backward()
print(alpha.grad)   # gradient w.r.t. sparsity level
```

--------------------------------

### Sparse Softmax Activation with Sparsemax

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes a sparse probability distribution using L2 projection. Produces exact zeros for low-scoring entries. Supports batched input and returning the support size.

```python
import torch
from entmax import sparsemax

x = torch.tensor([-2.0, 0.0, 0.5])

# Basic usage
p = sparsemax(x, dim=0)
# tensor([0.0000, 0.2500, 0.7500])

# Batched input along dim=1
X = torch.tensor([[-2.0, 0.0, 0.5],
                   [1.0,  2.0, 3.5]])
p_batch = sparsemax(X, dim=1)
# tensor([[0.0000, 0.2500, 0.7500],
#         [0.0000, 0.1667, 0.8333]])

# With partial-sort hint for efficiency (k slightly > expected nonzeros)
p_fast = sparsemax(X, dim=1, k=3)

# Return support size (number of nonzeros)
p, support = sparsemax(X, dim=1, return_support_size=True)
print(p)       # tensor([[0.0000, 0.2500, 0.7500], ...])
print(support) # tensor([[2], [2]])

# Verify it's a valid probability distribution
assert torch.allclose(p.sum(dim=1), torch.ones(2))
```

--------------------------------

### Sparsemax / Entmax15 Modules

Source: https://context7.com/deep-spin/entmax/llms.txt

PyTorch nn.Module replacements for nn.Softmax, usable in attention mechanisms and other layers.

```APIDOC
## Sparsemax / Entmax15

### Description
PyTorch `nn.Module` wrappers that can be used anywhere `nn.Softmax` would be used, e.g. directly replacing attention softmax.

### Parameters
- **dim** (int) - The dimension along which to compute the softmax.

### Usage Example
```python
import torch
import torch.nn as nn
from entmax import Sparsemax, Entmax15

# Replace nn.Softmax in attention
class SparseAttention(nn.Module):
    def __init__(self, d_model, n_heads, sparse_type="entmax15"):
        super().__init__()
        self.d_head = d_model // n_heads
        self.n_heads = n_heads
        self.q = nn.Linear(d_model, d_model)
        self.k = nn.Linear(d_model, d_model)
        self.v = nn.Linear(d_model, d_model)
        if sparse_type == "sparsemax":
            self.attn = Sparsemax(dim=-1)
        else:
            self.attn = Entmax15(dim=-1)

    def forward(self, x):
        B, T, C = x.shape
        q = self.q(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        k = self.k(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        v = self.v(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        scores = (q @ k.transpose(-2, -1)) / (self.d_head ** 0.5)
        weights = self.attn(scores)   # sparse attention weights
        return (weights @ v).transpose(1, 2).reshape(B, T, C)

model = SparseAttention(d_model=64, n_heads=4, sparse_type="entmax15")
x = torch.randn(2, 10, 64)
out = model(x)
print(out.shape)  # (2, 10, 64)
```
```

--------------------------------

### Compute Alpha-Normmax Projection with normmax_bisect

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes the alpha-normmax projection. Use large alpha values to approximate argmax (hardmax). Requires torch and normmax_bisect from entmax.

```python
import torch
from entmax import normmax_bisect

x = torch.tensor([-2.0, 0.0, 0.5])

# alpha=2 is similar to sparsemax
print(normmax_bisect(x, alpha=2, dim=0))
# tensor([0.0000, 0.3110, 0.6890])

# Large alpha approaches argmax / hardmax
print(normmax_bisect(x, alpha=1000, dim=0))
# tensor([0.0000, 0.4997, 0.5003])

# Per-example alpha (batch)
X = torch.randn(8, 32)
p = normmax_bisect(X, alpha=3.0, dim=1)
assert torch.allclose(p.sum(dim=1), torch.ones(8), atol=1e-5)
```

--------------------------------

### sparsemax_loss / SparsemaxLoss

Source: https://context7.com/deep-spin/entmax/llms.txt

Fenchel-Young sparsemax loss, a sparse alternative to cross-entropy.

```APIDOC
## sparsemax_loss / SparsemaxLoss

### Description
Sparse alternative to cross-entropy based on the sparsemax transform. Produces exactly zero loss when the target class receives all probability mass. Functional and module forms available.

### Parameters
- **logits** (torch.Tensor) - The input tensor of logits.
- **targets** (torch.Tensor) - The target class indices.

### Request Example
```python
import torch
import torch.nn as nn
from entmax import sparsemax_loss, SparsemaxLoss

# Functional form
logits = torch.tensor([[ 0.5,  2.0, -1.0],
                        [ 1.0, -0.5,  3.0]])
targets = torch.tensor([1, 2])

losses = sparsemax_loss(logits, targets)   # shape: (2,)
print(losses)  # per-sample losses
```

### Response
- **losses** (torch.Tensor) - A tensor containing the per-sample sparsemax loss.
```

--------------------------------

### entmax15

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes the 1.5-entmax probability mapping, which is sparser than softmax but less extreme than sparsemax. It uses an exact partial-sort algorithm.

```APIDOC
## entmax15

### Description
Computes the 1.5-entmax probability mapping, which is sparser than softmax but less extreme than sparsemax. Solves `max_p <x,p> - H_1.5(p)` where H_1.5 is the Tsallis 1.5-entropy. Uses an exact partial-sort algorithm.

### Parameters
- **dim** (int) - The dimension along which to compute the entmax15.
- **return_support_size** (bool, optional) - If True, also returns the number of non-zero elements.

### Request Example
```python
import torch
from entmax import entmax15

x = torch.tensor([-2.0, 0.0, 0.5])

p = entmax15(x, dim=0)

# Batch of logits (e.g., attention scores in a transformer)
attn_scores = torch.randn(4, 16)
attn_weights = entmax15(attn_scores, dim=1)

# With support size
p, support = entmax15(attn_scores, dim=1, return_support_size=True)
```

### Response
#### Success Response
- **p** (Tensor) - The 1.5-entmax probability distribution.
- **support** (Tensor, optional) - The number of non-zero elements in the distribution.
```

--------------------------------

### Generic Alpha-Entmax with Bisection

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes alpha-entmax for any alpha > 1 using a bisection algorithm. This function is differentiable with respect to alpha, enabling learned sparsity. Supports fixed or per-row alpha values.

```python
import torch
from torch.autograd import grad
from entmax import entmax_bisect

x = torch.tensor([-2.0, 0.0, 0.5])

# Fixed alpha values
print(entmax_bisect(x, alpha=1.5, dim=0))  # similar to entmax15
# tensor([0.0000, 0.3260, 0.6740])
print(entmax_bisect(x, alpha=2.0, dim=0))  # similar to sparsemax
# tensor([0.0000, 0.2500, 0.7500])

# Learnable alpha: gradient flows through alpha
X = torch.tensor([[-1.0, 0.0, 0.5],
                   [ 1.0, 2.0, 3.5]])
alpha = torch.tensor(1.33, requires_grad=True)
p = entmax_bisect(X, alpha)
# tensor([[0.0460, 0.3276, 0.6264],
#         [0.0026, 0.1012, 0.8963]], grad_fn=...)

# Compute gradient w.r.t. alpha
g = grad(p[0, 0], alpha)
print(g)  # (tensor(-0.2562),)

# Per-row alpha (adaptive sparsity per attention head)
alpha_per_row = torch.tensor([[1.5], [2.0]])  # shape (batch, 1)
p_adaptive = entmax_bisect(X, alpha=alpha_per_row, dim=1)

# Use in a training loop with learnable alpha
optimizer_alpha = torch.optim.Adam([alpha], lr=0.01)
loss = -p[0, 2]  # dummy loss
loss.backward()
optimizer_alpha.step()
```

--------------------------------

### Entmax15Loss Functional and Module Usage

Source: https://context7.com/deep-spin/entmax/llms.txt

Calculate 1.5-entmax Fenchel-Young loss using either the functional interface or the Entmax15Loss module. The module can also return the support size when `return_support_size` is True.

```python
import torch
from entmax import entmax15_loss, Entmax15Loss

logits = torch.tensor([[ 0.5,  2.0, -1.0],
                        [ 1.0, -0.5,  3.0]])
targets = torch.tensor([1, 2])

# Functional
losses = entmax15_loss(logits, targets)
print(losses)   # sparse per-sample losses, shape (2,)
```

```python
# Module, also returns support size
criterion = Entmax15Loss(
    k=100,
    reduction="elementwise_mean",
    return_support_size=True
)
mean_loss, support = criterion(logits, targets)
print(f"Mean loss: {mean_loss.item():.4f}")
print(f"Support size: {support}")
```

```python
# Gradient check
logits.requires_grad_(True)
loss = entmax15_loss(logits, targets)
loss.sum().backward()
print(logits.grad)   # sparse gradient — only nonzero for support entries
```

--------------------------------

### sparsemax

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes a sparse probability distribution by projecting the input onto the probability simplex using the L2 norm. It produces exact zeros for low-scoring entries and uses an efficient partial-sort algorithm.

```APIDOC
## sparsemax

### Description
Computes a sparse probability distribution by projecting the input onto the probability simplex using the L2 norm. Produces exact zeros for low-scoring entries. Uses an efficient partial-sort algorithm.

### Parameters
- **dim** (int) - The dimension along which to compute the sparsemax.
- **k** (int, optional) - A hint for the partial-sort algorithm for efficiency. Should be slightly larger than the expected number of non-zero elements.
- **return_support_size** (bool, optional) - If True, also returns the number of non-zero elements.

### Request Example
```python
import torch
from entmax import sparsemax

x = torch.tensor([-2.0, 0.0, 0.5])

# Basic usage
p = sparsemax(x, dim=0)

# Batched input along dim=1
X = torch.tensor([[-2.0, 0.0, 0.5],
                   [1.0,  2.0, 3.5]])
p_batch = sparsemax(X, dim=1)

# With partial-sort hint for efficiency
p_fast = sparsemax(X, dim=1, k=3)

# Return support size (number of nonzeros)
p, support = sparsemax(X, dim=1, return_support_size=True)
```

### Response
#### Success Response
- **p** (Tensor) - The sparse probability distribution.
- **support** (Tensor, optional) - The number of non-zero elements in the distribution.
```

--------------------------------

### NormmaxBisectLoss Functional and Module Usage

Source: https://context7.com/deep-spin/entmax/llms.txt

Calculate alpha-normmax Fenchel-Young loss using either the functional `normmax_bisect_loss` or the `NormmaxBisectLoss` module. Configure the alpha value and the number of iterations for the bisection method.

```python
import torch
from entmax import normmax_bisect_loss, NormmaxBisectLoss

logits = torch.randn(16, 50)
targets = torch.randint(0, 50, (16,))

# Functional
loss = normmax_bisect_loss(logits, targets, alpha=2.0)
print(loss)   # per-sample losses, shape (16,)
```

```python
# Module
criterion = NormmaxBisectLoss(
    alpha=3.0,
    n_iter=50,
    reduction="elementwise_mean"
)
mean_loss = criterion(logits, targets)
mean_loss.backward()
```

--------------------------------

### normmax_bisect

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes the alpha-normmax projection. As alpha approaches infinity, it approximates the argmax (hardmax) function.

```APIDOC
## normmax_bisect

### Description
Computes the alpha-normmax projection: `max_p <x,p> - ||p||_alpha` with the alpha-norm regularizer. As alpha → ∞, approaches argmax (hardmax).

### Parameters
- **x** (torch.Tensor) - Input tensor.
- **alpha** (float) - The alpha parameter for the normmax projection. Higher values approach hardmax.
- **dim** (int) - The dimension along which to compute the projection.

### Request Example
```python
import torch
from entmax import normmax_bisect

x = torch.tensor([-2.0, 0.0, 0.5])

# alpha=2 is similar to sparsemax
print(normmax_bisect(x, alpha=2, dim=0))
# tensor([0.0000, 0.3110, 0.6890])

# Large alpha approaches argmax / hardmax
print(normmax_bisect(x, alpha=1000, dim=0))
# tensor([0.0000, 0.4997, 0.5003])

# Per-example alpha (batch)
X = torch.randn(8, 32)
p = normmax_bisect(X, alpha=3.0, dim=1)
assert torch.allclose(p.sum(dim=1), torch.ones(8), atol=1e-5)
```

### Response
- **output** (torch.Tensor) - The resulting alpha-normmax projection.
```

--------------------------------

### entmax_bisect

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes alpha-entmax for any alpha > 1 using a bisection algorithm. This function is differentiable with respect to alpha, enabling learned sparsity.

```APIDOC
## entmax_bisect

### Description
Computes alpha-entmax for any alpha > 1 using a bisection algorithm. This is the most general activation; alpha=1.5 matches `entmax15`, alpha=2 matches `sparsemax`. Critically, this function is **differentiable with respect to alpha**, enabling learned sparsity.

### Parameters
- **alpha** (float or Tensor) - The exponent for the entmax function. Can be a scalar or a tensor for per-row alpha values.
- **dim** (int) - The dimension along which to compute the entmax.

### Request Example
```python
import torch
from torch.autograd import grad
from entmax import entmax_bisect

x = torch.tensor([-2.0, 0.0, 0.5])

# Fixed alpha values
print(entmax_bisect(x, alpha=1.5, dim=0))
print(entmax_bisect(x, alpha=2.0, dim=0))

# Learnable alpha: gradient flows through alpha
X = torch.tensor([[-1.0, 0.0, 0.5],
                   [ 1.0, 2.0, 3.5]])
alpha = torch.tensor(1.33, requires_grad=True)
p = entmax_bisect(X, alpha)

# Compute gradient w.r.t. alpha
g = grad(p[0, 0], alpha)

# Per-row alpha (adaptive sparsity per attention head)
alpha_per_row = torch.tensor([[1.5], [2.0]])  # shape (batch, 1)
p_adaptive = entmax_bisect(X, alpha=alpha_per_row, dim=1)

# Use in a training loop with learnable alpha
optimizer_alpha = torch.optim.Adam([alpha], lr=0.01)
loss = -p[0, 2]  # dummy loss
loss.backward()
optimizer_alpha.step()
```

### Response
#### Success Response
- **p** (Tensor) - The alpha-entmax probability distribution.
```

--------------------------------

### EntmaxBisectLoss

Source: https://context7.com/deep-spin/entmax/llms.txt

EntmaxBisectLoss is a generic alpha-entmax Fenchel-Young loss using bisection for arbitrary alpha > 1. It supports learnable alpha and can be used functionally or as a module.

```APIDOC
## EntmaxBisectLoss — Generic alpha-entmax Fenchel-Young loss

### Description
Bisection-based loss for arbitrary alpha > 1 with learnable alpha support.

### Usage
```python
import torch
from entmax import entmax_bisect_loss, EntmaxBisectLoss

logits = torch.randn(8, 100)    # 8 examples, 100 classes
targets = torch.randint(0, 100, (8,))

# alpha=1.5 (default)
loss_15 = entmax_bisect_loss(logits, targets, alpha=1.5)

# Sparsemax regime
loss_sp = entmax_bisect_loss(logits, targets, alpha=2.0)

# Learnable alpha — gradient flows through alpha
alpha = torch.tensor(1.5, requires_grad=True)
loss = entmax_bisect_loss(logits, targets, alpha=alpha)
loss.sum().backward()
print(alpha.grad)   # gradient w.r.t. sparsity level

# Module form
criterion = EntmaxBisectLoss(alpha=1.7, n_iter=50, reduction="elementwise_mean")
mean_loss = criterion(logits, targets)
mean_loss.backward()
```
```

--------------------------------

### Entmax15Loss

Source: https://context7.com/deep-spin/entmax/llms.txt

Entmax15Loss is a Fenchel-Young loss based on the 1.5-entmax transform, offering a balance between cross-entropy and sparsemax loss. It can be used functionally or as a module and optionally return the support size.

```APIDOC
## Entmax15Loss — 1.5-entmax Fenchel-Young loss

### Description
Sparse loss based on the 1.5-entmax transform; a middle ground between cross-entropy and sparsemax loss.

### Usage
```python
import torch
from entmax import entmax15_loss, Entmax15Loss

logits = torch.tensor([[ 0.5,  2.0, -1.0],
                        [ 1.0, -0.5,  3.0]])
targets = torch.tensor([1, 2])

# Functional
losses = entmax15_loss(logits, targets)
print(losses)   # sparse per-sample losses, shape (2,)

# Module, also returns support size
criterion = Entmax15Loss(
    k=100,
    reduction="elementwise_mean",
    return_support_size=True
)
mean_loss, support = criterion(logits, targets)
print(f"Mean loss: {mean_loss.item():.4f}")
print(f"Support size: {support}")

# Gradient check
logits.requires_grad_(True)
loss = entmax15_loss(logits, targets)
loss.sum().backward()
print(logits.grad)   # sparse gradient — only nonzero for support entries
```
```

--------------------------------

### SparsemaxLoss

Source: https://context7.com/deep-spin/entmax/llms.txt

SparsemaxLoss is a Fenchel-Young loss based on the sparsemax transform. It can be used functionally or as a module, with options for partial sorting (k) and ignoring specific indices.

```APIDOC
## SparsemaxLoss

### Description
SparsemaxLoss is a Fenchel-Young loss based on the sparsemax transform. It can be used functionally or as a module, with options for partial sorting (k) and ignoring specific indices.

### Usage
```python
import torch
from entmax import SparsemaxLoss

# Module form with reduction and ignore_index
criterion = SparsemaxLoss(
    k=None,                     # full sort; set k for partial-sort speedup
    ignore_index=-100,
    reduction="elementwise_mean"
)

# Example usage in a training loop
model = torch.nn.Linear(128, 50)     # 50-class classifier
optimizer = torch.optim.Adam(model.parameters())
criterion = SparsemaxLoss(reduction="elementwise_mean")

x = torch.randn(32, 128)
y = torch.randint(0, 50, (32,))
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item():.4f}")
```
```

--------------------------------

### Compute Fenchel-Young Sparsemax Loss

Source: https://context7.com/deep-spin/entmax/llms.txt

Computes a sparse alternative to cross-entropy loss based on the sparsemax transform. Produces zero loss when the target class receives all probability mass. Functional and module forms are available. Requires torch, nn, and sparsemax_loss/SparsemaxLoss from entmax.

```python
import torch
import torch.nn as nn
from entmax import sparsemax_loss, SparsemaxLoss

# Functional form
logits = torch.tensor([[ 0.5,  2.0, -1.0],
                        [ 1.0, -0.5,  3.0]])
targets = torch.tensor([1, 2])

losses = sparsemax_loss(logits, targets)   # shape: (2,)
print(losses)  # per-sample losses
```

--------------------------------

### EntmaxBisectLoss Functional Usage with Different Alphas

Source: https://context7.com/deep-spin/entmax/llms.txt

Compute generic alpha-entmax Fenchel-Young loss using `entmax_bisect_loss`. Supports arbitrary alpha values, including the default alpha=1.5 and the sparsemax regime (alpha=2.0).

```python
import torch
from entmax import entmax_bisect_loss, EntmaxBisectLoss

logits = torch.randn(8, 100)    # 8 examples, 100 classes
targets = torch.randint(0, 100, (8,))

# alpha=1.5 (default)
loss_15 = entmax_bisect_loss(logits, targets, alpha=1.5)

# Sparsemax regime
loss_sp = entmax_bisect_loss(logits, targets, alpha=2.0)
```

--------------------------------

### NormmaxBisectLoss

Source: https://context7.com/deep-spin/entmax/llms.txt

NormmaxBisectLoss is an alpha-normmax Fenchel-Young loss derived from the normmax transform. It can be used functionally or as a module.

```APIDOC
## NormmaxBisectLoss — Alpha-normmax Fenchel-Young loss

### Description
Fenchel-Young loss derived from the normmax transform.

### Usage
```python
import torch
from entmax import normmax_bisect_loss, NormmaxBisectLoss

logits = torch.randn(16, 50)
targets = torch.randint(0, 50, (16,))

# Functional
loss = normmax_bisect_loss(logits, targets, alpha=2.0)
print(loss)   # per-sample losses, shape (16,)

# Module
criterion = NormmaxBisectLoss(
    alpha=3.0,
    n_iter=50,
    reduction="elementwise_mean"
)
mean_loss = criterion(logits, targets)
mean_loss.backward()
```
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.