# MABWiser

MABWiser is a research library written in Python for rapid prototyping of multi-armed bandit algorithms. It supports context-free, parametric, and non-parametric contextual bandit models with built-in parallelization for both training and testing components. The library provides a scikit-learn style public interface for fitting models on historical decision/reward data and predicting the best arm based on learned expectations.

Developed by the Artificial Intelligence Center of Excellence at Fidelity Investments, MABWiser includes a comprehensive simulation utility for comparing different policies and performing hyper-parameter tuning. The library is designed for applications like A/B testing, advertisement optimization, recommendation systems, and any scenario requiring sequential decision-making under uncertainty. It integrates with Mab2Rec for recommender systems and ALNS for combinatorial optimization problems.

## MAB Class Initialization

The `MAB` class is the main entry point for creating multi-armed bandit models. It accepts a list of arms, a learning policy, an optional neighborhood policy for contextual bandits, and parallelization settings.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy

# Context-free bandit with Epsilon Greedy policy
arms = ['Arm1', 'Arm2', 'Arm3']
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.15),
    seed=123456,
    n_jobs=1  # Number of parallel jobs (-1 for all CPUs)
)

# Contextual bandit with LinUCB policy
contextual_mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinUCB(alpha=1.25, l2_lambda=1.0),
    seed=123456
)

# Non-parametric contextual bandit with neighborhood policy
neighborhood_mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.UCB1(alpha=1.25),
    neighborhood_policy=NeighborhoodPolicy.KNearest(k=5, metric="euclidean"),
    n_jobs=4
)
```

## fit() - Training the Model

The `fit()` method trains the multi-armed bandit on historical decision and reward data. For contextual bandits, context features must also be provided.

```python
from mabwiser.mab import MAB, LearningPolicy
import numpy as np

# Define arms and create bandit
arms = ['Layout1', 'Layout2']
mab = MAB(arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42)

# Historical data: which arm was chosen and what reward was received
decisions = ['Layout1', 'Layout1', 'Layout2', 'Layout1', 'Layout2', 'Layout2']
rewards = [10, 17, 22, 9, 25, 15]

# Train the model
mab.fit(decisions=decisions, rewards=rewards)

# For contextual bandits, include context features
contextual_mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinUCB(alpha=1.0)
)
contexts = [[0.2, 0.5], [0.8, 0.3], [0.1, 0.9], [0.5, 0.5], [0.3, 0.7], [0.9, 0.1]]
contextual_mab.fit(decisions=decisions, rewards=rewards, contexts=contexts)
```

## predict() - Making Predictions

The `predict()` method returns the best arm based on the learned policy. For contextual bandits, context features for the prediction must be provided.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy

# Train a context-free bandit
arms = ['Ad1', 'Ad2', 'Ad3']
mab = MAB(arms=arms, learning_policy=LearningPolicy.UCB1(alpha=1.25), seed=42)
decisions = ['Ad1', 'Ad2', 'Ad1', 'Ad3', 'Ad2', 'Ad1']
rewards = [10, 5, 12, 8, 6, 15]
mab.fit(decisions, rewards)

# Predict the best arm
best_arm = mab.predict()
print(f"Best arm: {best_arm}")  # Output: Best arm: Ad1

# For contextual bandits, predict with context
contextual_mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinUCB(alpha=1.0)
)
contexts = [[22, 0.5, 1], [35, 0.8, 0], [28, 0.3, 1], [45, 0.6, 0], [31, 0.4, 1], [27, 0.7, 0]]
contextual_mab.fit(decisions, rewards, contexts)

# Single prediction
new_context = [[30, 0.5, 1]]
prediction = contextual_mab.predict(new_context)
print(f"Prediction for context: {prediction}")

# Batch predictions
new_contexts = [[25, 0.6, 1], [40, 0.3, 0], [33, 0.9, 1]]
predictions = contextual_mab.predict(new_contexts)
print(f"Batch predictions: {predictions}")  # Returns list of predictions
```

## predict_expectations() - Getting Expected Rewards

The `predict_expectations()` method returns a dictionary mapping each arm to its expected reward, useful for understanding the model's confidence in each arm.

```python
from mabwiser.mab import MAB, LearningPolicy

# Create and train bandit
arms = ['Option1', 'Option2', 'Option3']
mab = MAB(arms=arms, learning_policy=LearningPolicy.Softmax(tau=0.5), seed=42)
decisions = ['Option1', 'Option2', 'Option1', 'Option3', 'Option2', 'Option1', 'Option3']
rewards = [100, 80, 95, 70, 85, 110, 75]
mab.fit(decisions, rewards)

# Get expected rewards for all arms
expectations = mab.predict_expectations()
print(f"Expected rewards: {expectations}")
# Output: {'Option1': 101.67, 'Option2': 82.5, 'Option3': 72.5}

# For contextual bandits with multiple contexts
contextual_mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinUCB(alpha=1.0)
)
contexts = [[1, 0], [0, 1], [1, 1], [0, 0], [1, 0], [0, 1], [1, 1]]
contextual_mab.fit(decisions, rewards, contexts)

# Get expectations for multiple test contexts
test_contexts = [[1, 0], [0, 1]]
all_expectations = contextual_mab.predict_expectations(test_contexts)
for i, exp in enumerate(all_expectations):
    print(f"Context {i}: {exp}")
```

## partial_fit() - Online Learning

The `partial_fit()` method enables online learning by incrementally updating the model with new decision-reward pairs without retraining from scratch.

```python
from mabwiser.mab import MAB, LearningPolicy

# Initial training
arms = ['Product_A', 'Product_B', 'Product_C']
mab = MAB(arms=arms, learning_policy=LearningPolicy.ThompsonSampling(), seed=42)
initial_decisions = ['Product_A', 'Product_B', 'Product_A']
initial_rewards = [1, 0, 1]  # Binary rewards for Thompson Sampling
mab.fit(initial_decisions, initial_rewards)

print(f"Initial prediction: {mab.predict()}")

# New data arrives - update model incrementally
new_decisions = ['Product_C', 'Product_B', 'Product_C']
new_rewards = [1, 1, 1]
mab.partial_fit(new_decisions, new_rewards)

print(f"Updated prediction: {mab.predict()}")
print(f"Updated expectations: {mab.predict_expectations()}")

# Contextual online learning
contextual_mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinGreedy(epsilon=0.1, l2_lambda=1.0)
)
initial_contexts = [[1, 2], [2, 1], [1, 1]]
contextual_mab.fit(initial_decisions, [10, 5, 12], initial_contexts)

# Update with new contextual data
new_contexts = [[2, 2], [1, 3], [3, 1]]
contextual_mab.partial_fit(new_decisions, [8, 15, 11], new_contexts)
```

## add_arm() and remove_arm() - Dynamic Arm Management

These methods allow adding or removing arms dynamically after the model has been trained.

```python
from mabwiser.mab import MAB, LearningPolicy

# Create and train bandit
arms = ['Strategy1', 'Strategy2']
mab = MAB(arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42)
decisions = ['Strategy1', 'Strategy2', 'Strategy1', 'Strategy2']
rewards = [100, 80, 95, 85]
mab.fit(decisions, rewards)

print(f"Current arms: {mab.arms}")
print(f"Best arm: {mab.predict()}")

# Add a new arm (starts with no training data)
mab.add_arm('Strategy3')
print(f"Arms after addition: {mab.arms}")

# Train the new arm with partial_fit
mab.partial_fit(['Strategy3', 'Strategy3'], [120, 115])
print(f"Best arm after training new arm: {mab.predict()}")
print(f"Expectations: {mab.predict_expectations()}")

# Remove an underperforming arm
mab.remove_arm('Strategy2')
print(f"Arms after removal: {mab.arms}")

# For Thompson Sampling with custom binarizer for new arm
def custom_binarizer(arm, reward):
    thresholds = {'Strategy1': 90, 'Strategy3': 110, 'Strategy4': 100}
    return reward > thresholds.get(arm, 100)

ts_mab = MAB(arms=['Strategy1', 'Strategy3'],
             learning_policy=LearningPolicy.ThompsonSampling(binarizer=custom_binarizer),
             seed=42)
ts_mab.fit(['Strategy1', 'Strategy3'], [95, 115])
ts_mab.add_arm('Strategy4', binarizer=custom_binarizer)
```

## warm_start() - Cold Start Handling

The `warm_start()` method addresses the cold start problem by initializing new arms using feature similarity to existing trained arms.

```python
from mabwiser.mab import MAB, LearningPolicy

# Create and train bandit
arms = ['Item1', 'Item2', 'Item3']
mab = MAB(arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42)
decisions = ['Item1', 'Item2', 'Item1', 'Item2']  # Item3 has no training data
rewards = [50, 40, 55, 45]
mab.fit(decisions, rewards)

# Check cold arms (arms with no training data)
print(f"Cold arms: {mab.cold_arms}")  # Output: ['Item3']

# Define feature vectors for each arm
arm_to_features = {
    'Item1': [1.0, 0.5, 0.2],
    'Item2': [0.8, 0.6, 0.3],
    'Item3': [0.9, 0.55, 0.25],  # Similar to Item1
    'Item4': [0.85, 0.58, 0.28]  # New arm to be added
}

# Warm start cold arms using similar trained arms
# distance_quantile=0.5 means arms within 50th percentile of distances will be warm started
mab.warm_start(arm_to_features, distance_quantile=0.5)

print(f"Cold arms after warm start: {mab.cold_arms}")
print(f"Expectations: {mab.predict_expectations()}")

# Add new arm and warm start it
mab.add_arm('Item4')
arm_to_features['Item4'] = [0.85, 0.58, 0.28]
mab.warm_start(arm_to_features, distance_quantile=0.75)
```

## LearningPolicy.EpsilonGreedy - Exploration vs Exploitation

Epsilon Greedy selects the best arm with probability (1-epsilon) and a random arm with probability epsilon for exploration.

```python
from mabwiser.mab import MAB, LearningPolicy

# Arms represent different website layouts
arms = ['Layout_A', 'Layout_B', 'Layout_C']

# Historical A/B test data
decisions = ['Layout_A', 'Layout_B', 'Layout_C', 'Layout_A', 'Layout_B',
             'Layout_A', 'Layout_C', 'Layout_B', 'Layout_A', 'Layout_C']
rewards = [12, 8, 15, 10, 9, 14, 16, 7, 11, 18]

# Higher epsilon = more exploration
# Lower epsilon = more exploitation of known best arm
mab_explore = MAB(
    arms=arms,
    learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.25),  # 25% exploration
    seed=42
)
mab_explore.fit(decisions, rewards)

mab_exploit = MAB(
    arms=arms,
    learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.05),  # 5% exploration
    seed=42
)
mab_exploit.fit(decisions, rewards)

print(f"High exploration prediction: {mab_explore.predict()}")
print(f"High exploitation prediction: {mab_exploit.predict()}")
print(f"Expectations: {mab_exploit.predict_expectations()}")
```

## LearningPolicy.UCB1 - Upper Confidence Bound

UCB1 balances exploration and exploitation by selecting arms based on their upper confidence bound, favoring both high-reward and under-explored arms.

```python
from mabwiser.mab import MAB, LearningPolicy

# Arms represent different recommendation algorithms
arms = ['Collaborative', 'ContentBased', 'Hybrid', 'Popular']

# Historical engagement data
decisions = ['Collaborative', 'ContentBased', 'Hybrid', 'Popular',
             'Collaborative', 'ContentBased', 'Hybrid', 'Collaborative']
rewards = [0.8, 0.6, 0.75, 0.5, 0.85, 0.55, 0.7, 0.9]

# Alpha controls exploration: higher alpha = more exploration
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.UCB1(alpha=1.25),
    seed=42
)
mab.fit(decisions, rewards)

# UCB1 formula: mean + alpha * sqrt(2 * log(N) / n_i)
# where N = total trials, n_i = trials for arm i
print(f"Best arm: {mab.predict()}")
print(f"UCB expectations: {mab.predict_expectations()}")

# Online learning with UCB1
new_decisions = ['Popular', 'Hybrid']
new_rewards = [0.65, 0.8]
mab.partial_fit(new_decisions, new_rewards)
print(f"Updated best arm: {mab.predict()}")
```

## LearningPolicy.ThompsonSampling - Bayesian Approach

Thompson Sampling uses Bayesian probability by maintaining a beta distribution for each arm and sampling from it to make decisions. Requires binary rewards or a binarizer function.

```python
from mabwiser.mab import MAB, LearningPolicy

# Binary reward scenario (click/no-click)
arms = ['Banner_A', 'Banner_B', 'Banner_C']
decisions = ['Banner_A', 'Banner_B', 'Banner_C', 'Banner_A', 'Banner_B',
             'Banner_A', 'Banner_C', 'Banner_B', 'Banner_A', 'Banner_C']
rewards = [1, 0, 1, 1, 1, 0, 1, 0, 1, 1]  # Binary: clicked (1) or not (0)

mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.ThompsonSampling(),
    seed=42
)
mab.fit(decisions, rewards)

print(f"Thompson Sampling prediction: {mab.predict()}")
print(f"Success probabilities: {mab.predict_expectations()}")

# Non-binary rewards with custom binarizer
arms_revenue = ['Plan_Basic', 'Plan_Pro', 'Plan_Enterprise']
decisions_rev = ['Plan_Basic', 'Plan_Pro', 'Plan_Enterprise', 'Plan_Basic', 'Plan_Pro']
rewards_rev = [29, 99, 299, 35, 89]  # Revenue amounts

# Binarizer: success if revenue exceeds threshold for that arm
arm_thresholds = {'Plan_Basic': 30, 'Plan_Pro': 80, 'Plan_Enterprise': 250}

def revenue_binarizer(arm, reward):
    return reward >= arm_thresholds[arm]

mab_revenue = MAB(
    arms=arms_revenue,
    learning_policy=LearningPolicy.ThompsonSampling(binarizer=revenue_binarizer),
    seed=42
)
mab_revenue.fit(decisions_rev, rewards_rev)
print(f"Revenue-based prediction: {mab_revenue.predict()}")
```

## LearningPolicy.LinUCB - Contextual Linear UCB

LinUCB uses ridge regression to model the relationship between context features and rewards, with an upper confidence bound for exploration.

```python
from mabwiser.mab import MAB, LearningPolicy
import numpy as np
from sklearn.preprocessing import StandardScaler

# Arms represent different ad campaigns
arms = ['Campaign_Tech', 'Campaign_Fashion', 'Campaign_Sports']

# User context: [age, income_level, engagement_score]
contexts = [
    [25, 0.6, 0.8],  # Young, medium income, high engagement
    [45, 0.9, 0.5],  # Middle-aged, high income, medium engagement
    [30, 0.4, 0.9],  # Young, low income, very high engagement
    [55, 0.8, 0.3],  # Older, high income, low engagement
    [22, 0.3, 0.95], # Very young, low income, very high engagement
    [40, 0.7, 0.6],  # Middle-aged, medium-high income, medium engagement
]

decisions = ['Campaign_Tech', 'Campaign_Fashion', 'Campaign_Tech',
             'Campaign_Fashion', 'Campaign_Sports', 'Campaign_Tech']
rewards = [15, 25, 18, 30, 12, 20]

# Scale contexts for better performance
scaler = StandardScaler()
scaled_contexts = scaler.fit_transform(contexts)

# LinUCB: alpha controls exploration, l2_lambda is regularization strength
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinUCB(alpha=1.5, l2_lambda=1.0, scale=False),
    seed=42
)
mab.fit(decisions, rewards, scaled_contexts)

# Predict for new users
new_users = [[28, 0.5, 0.85], [50, 0.95, 0.4]]
new_users_scaled = scaler.transform(new_users)

predictions = mab.predict(new_users_scaled)
expectations = mab.predict_expectations(new_users_scaled)

for i, (pred, exp) in enumerate(zip(predictions, expectations)):
    print(f"User {i+1}: Recommended {pred}, Expectations: {exp}")
```

## LearningPolicy.LinTS - Contextual Thompson Sampling

LinTS combines linear regression with Thompson Sampling for contextual bandits, sampling from the posterior distribution of regression coefficients.

```python
from mabwiser.mab import MAB, LearningPolicy
from sklearn.preprocessing import StandardScaler

# Arms represent different product recommendations
arms = ['Electronics', 'Clothing', 'Books', 'HomeGoods']

# User context: [browsing_time, cart_value, past_purchases, device_mobile]
contexts = [
    [15, 50, 3, 1],
    [45, 200, 10, 0],
    [8, 25, 1, 1],
    [30, 150, 7, 0],
    [20, 75, 4, 1],
    [60, 300, 15, 0],
]

decisions = ['Electronics', 'HomeGoods', 'Books', 'Clothing', 'Electronics', 'HomeGoods']
rewards = [120, 85, 25, 60, 95, 150]

# Scale the context features
scaler = StandardScaler()
scaled_contexts = scaler.fit_transform(contexts)

# LinTS: alpha controls exploration variance, must be > 0
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinTS(alpha=0.5, l2_lambda=1.0),
    seed=42
)
mab.fit(decisions, rewards, scaled_contexts)

# Predict for new user session
new_session = [[25, 100, 5, 1]]
new_session_scaled = scaler.transform(new_session)

# LinTS predictions have natural randomness from sampling
print(f"LinTS prediction: {mab.predict(new_session_scaled)}")
print(f"Expected values: {mab.predict_expectations(new_session_scaled)}")
```

## NeighborhoodPolicy.KNearest - K-Nearest Neighbors Contextual

KNearest finds the k most similar historical contexts and applies the learning policy only to those neighbors.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
from sklearn.preprocessing import StandardScaler

# Arms represent different treatment options
arms = ['Treatment_A', 'Treatment_B', 'Treatment_C']

# Patient context: [age, severity_score, biomarker_level]
contexts = [
    [35, 0.6, 1.2],
    [50, 0.8, 1.8],
    [42, 0.5, 1.0],
    [65, 0.9, 2.1],
    [38, 0.4, 0.9],
    [55, 0.7, 1.5],
    [48, 0.6, 1.3],
    [60, 0.85, 1.9],
]

decisions = ['Treatment_A', 'Treatment_B', 'Treatment_C', 'Treatment_B',
             'Treatment_A', 'Treatment_C', 'Treatment_A', 'Treatment_B']
rewards = [0.8, 0.7, 0.6, 0.75, 0.85, 0.65, 0.82, 0.72]

scaler = StandardScaler()
scaled_contexts = scaler.fit_transform(contexts)

# KNearest with k=3 neighbors using euclidean distance
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.UCB1(alpha=1.0),
    neighborhood_policy=NeighborhoodPolicy.KNearest(k=3, metric="euclidean"),
    seed=42
)
mab.fit(decisions, rewards, scaled_contexts)

# Predict for new patient
new_patient = [[45, 0.65, 1.4]]
new_patient_scaled = scaler.transform(new_patient)

prediction = mab.predict(new_patient_scaled)
expectations = mab.predict_expectations(new_patient_scaled)
print(f"Recommended treatment: {prediction}")
print(f"Expected outcomes: {expectations}")
```

## NeighborhoodPolicy.Radius - Radius-Based Neighborhood

Radius neighborhood policy considers all historical observations within a specified distance from the prediction context.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
from sklearn.preprocessing import StandardScaler

# Arms represent different pricing strategies
arms = ['Price_Low', 'Price_Medium', 'Price_High']

# Market context: [demand_index, competitor_price, inventory_level]
contexts = [
    [0.8, 100, 0.9],
    [0.5, 95, 0.6],
    [0.9, 110, 0.95],
    [0.4, 90, 0.4],
    [0.7, 105, 0.8],
    [0.6, 98, 0.7],
]

decisions = ['Price_High', 'Price_Low', 'Price_High', 'Price_Low', 'Price_Medium', 'Price_Medium']
rewards = [150, 80, 160, 75, 110, 105]

scaler = StandardScaler()
scaled_contexts = scaler.fit_transform(contexts)

# Radius policy: considers neighbors within distance of 1.5
# If no neighbors found, uses no_nhood_prob_of_arm for random selection
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1),
    neighborhood_policy=NeighborhoodPolicy.Radius(
        radius=1.5,
        metric="euclidean",
        no_nhood_prob_of_arm=[0.3, 0.4, 0.3]  # Fallback probabilities
    ),
    seed=42
)
mab.fit(decisions, rewards, scaled_contexts)

# Predict for new market conditions
new_market = [[0.75, 102, 0.85]]
new_market_scaled = scaler.transform(new_market)

print(f"Recommended pricing: {mab.predict(new_market_scaled)}")
print(f"Expected revenues: {mab.predict_expectations(new_market_scaled)}")
```

## NeighborhoodPolicy.Clusters - Cluster-Based Contextual

Clusters policy uses k-means clustering to partition the context space, applying the learning policy within each cluster.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
from sklearn.preprocessing import StandardScaler

# Arms represent different content categories
arms = ['News', 'Entertainment', 'Sports', 'Technology']

# User context: [age, time_of_day, session_duration]
contexts = [
    [25, 0.3, 15],   # Young, morning, short session
    [45, 0.8, 45],   # Middle-aged, evening, long session
    [30, 0.5, 25],   # Young, afternoon, medium session
    [55, 0.9, 60],   # Older, night, very long session
    [22, 0.2, 10],   # Very young, early morning, very short
    [35, 0.6, 30],   # Adult, afternoon, medium-long
    [50, 0.85, 50],  # Middle-aged, evening, long
    [28, 0.4, 20],   # Young adult, late morning, medium
]

decisions = ['Entertainment', 'News', 'Sports', 'News',
             'Entertainment', 'Technology', 'News', 'Sports']
rewards = [0.9, 0.7, 0.8, 0.75, 0.85, 0.65, 0.72, 0.78]

scaler = StandardScaler()
scaled_contexts = scaler.fit_transform(contexts)

# Clusters policy with 3 clusters
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.ThompsonSampling(),
    neighborhood_policy=NeighborhoodPolicy.Clusters(
        n_clusters=3,
        is_minibatch=False  # Use MiniBatchKMeans for large datasets
    ),
    seed=42
)
mab.fit(decisions, [1, 1, 1, 1, 1, 0, 1, 1], scaled_contexts)  # Binary rewards for TS

# Predict for new user
new_user = [[32, 0.55, 28]]
new_user_scaled = scaler.transform(new_user)

print(f"Recommended content: {mab.predict(new_user_scaled)}")
```

## NeighborhoodPolicy.TreeBandit - Decision Tree Partitioning

TreeBandit uses decision trees to partition the context space, maintaining separate bandit statistics at each leaf node. Compatible with EpsilonGreedy, UCB1, and ThompsonSampling.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy

# Arms represent different email subject lines
arms = ['SubjectLine_A', 'SubjectLine_B', 'SubjectLine_C']

# Recipient context: [email_frequency, past_opens, segment_id, time_since_last]
contexts = [
    [0.5, 0.8, 1, 2],
    [0.2, 0.3, 2, 7],
    [0.8, 0.9, 1, 1],
    [0.1, 0.2, 3, 14],
    [0.6, 0.7, 1, 3],
    [0.3, 0.4, 2, 5],
    [0.9, 0.95, 1, 1],
    [0.4, 0.5, 2, 4],
]

decisions = ['SubjectLine_A', 'SubjectLine_B', 'SubjectLine_C', 'SubjectLine_B',
             'SubjectLine_A', 'SubjectLine_C', 'SubjectLine_A', 'SubjectLine_B']
rewards = [1, 0, 1, 0, 1, 0, 1, 1]  # Binary: opened (1) or not (0)

# TreeBandit with custom decision tree parameters
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.ThompsonSampling(),
    neighborhood_policy=NeighborhoodPolicy.TreeBandit(
        tree_parameters={
            'max_depth': 4,
            'min_samples_leaf': 2,
            'min_samples_split': 4
        }
    ),
    seed=42
)
mab.fit(decisions, rewards, contexts)

# Predict for new recipient
new_recipient = [[0.55, 0.75, 1, 2]]
print(f"Best subject line: {mab.predict(new_recipient)}")
print(f"Open probabilities: {mab.predict_expectations(new_recipient)}")
```

## Simulator - Comparing Multiple Bandits

The Simulator utility enables comparing different bandit configurations, performing hyper-parameter tuning, and running offline/online simulations.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
from mabwiser.simulator import Simulator
from sklearn.preprocessing import StandardScaler
import random

# Generate sample data
random.seed(42)
size = 500
arms = [0, 1, 2]
decisions = [random.choice(arms) for _ in range(size)]
rewards = [random.randint(0, 100) for _ in range(size)]
contexts = [[random.random() for _ in range(5)] for _ in range(size)]

# Define bandits to compare
bandits = [
    ('EpsilonGreedy_10%', MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42)),
    ('EpsilonGreedy_25%', MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=42)),
    ('UCB1_alpha1', MAB(arms, LearningPolicy.UCB1(alpha=1.0), seed=42)),
    ('UCB1_alpha1.5', MAB(arms, LearningPolicy.UCB1(alpha=1.5), seed=42)),
    ('LinUCB', MAB(arms, LearningPolicy.LinUCB(alpha=1.0), seed=42)),
]

# Create simulator
sim = Simulator(
    bandits=bandits,
    decisions=decisions,
    rewards=rewards,
    contexts=contexts,
    scaler=StandardScaler(),
    test_size=0.3,           # 30% for testing
    is_ordered=False,        # Random train/test split
    batch_size=50,           # Online learning with batches of 50
    seed=42
)

# Run simulation
sim.run()

# Access results
for name, mab in sim.bandits:
    print(f"\n{name}:")
    print(f"  Confusion Matrix: {sim.bandit_to_confusion_matrices[name][-1]}")
    if 'total' in sim.bandit_to_arm_to_stats_avg[name]:
        total_stats = sim.bandit_to_arm_to_stats_avg[name]['total']
        total_reward = sum(s['sum'] for s in total_stats.values() if not np.isnan(s['sum']))
        print(f"  Total Predicted Reward: {total_reward:.2f}")

# Plot results (requires matplotlib)
# sim.plot(metric='avg', is_per_arm=False)
```

## Parallel Processing

MABWiser supports parallel processing for both training and prediction, significantly improving performance for large datasets.

```python
from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy
import numpy as np
from sklearn.datasets import make_classification

# Generate large dataset
np.random.seed(42)
n_samples = 50000
n_features = 100
arms = list(range(10))

contexts, _ = make_classification(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=20,
    random_state=42
)
decisions = np.random.choice(arms, size=n_samples)
rewards = np.random.rand(n_samples)

# Parallel training and prediction with n_jobs
# n_jobs=-1 uses all available CPUs
# n_jobs=-2 uses all CPUs except one
mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.LinUCB(alpha=1.0),
    n_jobs=-1,  # Use all CPUs
    backend='loky'  # Options: 'loky', 'multiprocessing', 'threading'
)

# Parallel fit
mab.fit(decisions, rewards, contexts)

# Parallel prediction for batch of new contexts
test_contexts = np.random.randn(1000, n_features)
predictions = mab.predict(test_contexts)

print(f"Processed {len(predictions)} predictions")
print(f"Distribution: {np.unique(predictions, return_counts=True)}")

# Contextual neighborhood policy with parallel processing
neighborhood_mab = MAB(
    arms=arms,
    learning_policy=LearningPolicy.UCB1(alpha=1.0),
    neighborhood_policy=NeighborhoodPolicy.Radius(radius=2.0),
    n_jobs=4,
    backend='loky'
)
neighborhood_mab.fit(decisions, rewards, contexts)
```

## Summary

MABWiser provides a comprehensive toolkit for implementing multi-armed bandit algorithms in Python. The primary use cases include A/B testing for web optimization, personalized recommendation systems, dynamic pricing, clinical trial optimization, and any sequential decision-making problem where you need to balance exploration of new options with exploitation of known good options. The library supports context-free bandits for simple scenarios, parametric contextual bandits (LinUCB, LinTS, LinGreedy) when you have user/item features, and non-parametric approaches (KNearest, Radius, Clusters, TreeBandit) when the relationship between context and reward is complex.

Integration patterns typically involve: (1) collecting historical decision-reward data, (2) training a MAB model with `fit()`, (3) using `predict()` for real-time recommendations, and (4) continuously updating the model with `partial_fit()` as new data arrives. For production systems, use the Simulator to compare different policies and tune hyperparameters before deployment. The parallel processing capabilities (`n_jobs` parameter) enable scaling to large datasets, and the warm start functionality addresses cold start problems when introducing new arms. MABWiser follows scikit-learn conventions, making it easy to integrate into existing ML pipelines and combine with standard preprocessing tools like StandardScaler.