# MABWiser MABWiser is a research library written in Python for rapid prototyping of multi-armed bandit algorithms. It supports context-free, parametric, and non-parametric contextual bandit models with built-in parallelization for both training and testing components. The library provides a scikit-learn style public interface for fitting models on historical decision/reward data and predicting the best arm based on learned expectations. Developed by the Artificial Intelligence Center of Excellence at Fidelity Investments, MABWiser includes a comprehensive simulation utility for comparing different policies and performing hyper-parameter tuning. The library is designed for applications like A/B testing, advertisement optimization, recommendation systems, and any scenario requiring sequential decision-making under uncertainty. It integrates with Mab2Rec for recommender systems and ALNS for combinatorial optimization problems. ## MAB Class Initialization The `MAB` class is the main entry point for creating multi-armed bandit models. It accepts a list of arms, a learning policy, an optional neighborhood policy for contextual bandits, and parallelization settings. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy # Context-free bandit with Epsilon Greedy policy arms = ['Arm1', 'Arm2', 'Arm3'] mab = MAB( arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.15), seed=123456, n_jobs=1 # Number of parallel jobs (-1 for all CPUs) ) # Contextual bandit with LinUCB policy contextual_mab = MAB( arms=arms, learning_policy=LearningPolicy.LinUCB(alpha=1.25, l2_lambda=1.0), seed=123456 ) # Non-parametric contextual bandit with neighborhood policy neighborhood_mab = MAB( arms=arms, learning_policy=LearningPolicy.UCB1(alpha=1.25), neighborhood_policy=NeighborhoodPolicy.KNearest(k=5, metric="euclidean"), n_jobs=4 ) ``` ## fit() - Training the Model The `fit()` method trains the multi-armed bandit on historical decision and reward data. For contextual bandits, context features must also be provided. ```python from mabwiser.mab import MAB, LearningPolicy import numpy as np # Define arms and create bandit arms = ['Layout1', 'Layout2'] mab = MAB(arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42) # Historical data: which arm was chosen and what reward was received decisions = ['Layout1', 'Layout1', 'Layout2', 'Layout1', 'Layout2', 'Layout2'] rewards = [10, 17, 22, 9, 25, 15] # Train the model mab.fit(decisions=decisions, rewards=rewards) # For contextual bandits, include context features contextual_mab = MAB( arms=arms, learning_policy=LearningPolicy.LinUCB(alpha=1.0) ) contexts = [[0.2, 0.5], [0.8, 0.3], [0.1, 0.9], [0.5, 0.5], [0.3, 0.7], [0.9, 0.1]] contextual_mab.fit(decisions=decisions, rewards=rewards, contexts=contexts) ``` ## predict() - Making Predictions The `predict()` method returns the best arm based on the learned policy. For contextual bandits, context features for the prediction must be provided. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy # Train a context-free bandit arms = ['Ad1', 'Ad2', 'Ad3'] mab = MAB(arms=arms, learning_policy=LearningPolicy.UCB1(alpha=1.25), seed=42) decisions = ['Ad1', 'Ad2', 'Ad1', 'Ad3', 'Ad2', 'Ad1'] rewards = [10, 5, 12, 8, 6, 15] mab.fit(decisions, rewards) # Predict the best arm best_arm = mab.predict() print(f"Best arm: {best_arm}") # Output: Best arm: Ad1 # For contextual bandits, predict with context contextual_mab = MAB( arms=arms, learning_policy=LearningPolicy.LinUCB(alpha=1.0) ) contexts = [[22, 0.5, 1], [35, 0.8, 0], [28, 0.3, 1], [45, 0.6, 0], [31, 0.4, 1], [27, 0.7, 0]] contextual_mab.fit(decisions, rewards, contexts) # Single prediction new_context = [[30, 0.5, 1]] prediction = contextual_mab.predict(new_context) print(f"Prediction for context: {prediction}") # Batch predictions new_contexts = [[25, 0.6, 1], [40, 0.3, 0], [33, 0.9, 1]] predictions = contextual_mab.predict(new_contexts) print(f"Batch predictions: {predictions}") # Returns list of predictions ``` ## predict_expectations() - Getting Expected Rewards The `predict_expectations()` method returns a dictionary mapping each arm to its expected reward, useful for understanding the model's confidence in each arm. ```python from mabwiser.mab import MAB, LearningPolicy # Create and train bandit arms = ['Option1', 'Option2', 'Option3'] mab = MAB(arms=arms, learning_policy=LearningPolicy.Softmax(tau=0.5), seed=42) decisions = ['Option1', 'Option2', 'Option1', 'Option3', 'Option2', 'Option1', 'Option3'] rewards = [100, 80, 95, 70, 85, 110, 75] mab.fit(decisions, rewards) # Get expected rewards for all arms expectations = mab.predict_expectations() print(f"Expected rewards: {expectations}") # Output: {'Option1': 101.67, 'Option2': 82.5, 'Option3': 72.5} # For contextual bandits with multiple contexts contextual_mab = MAB( arms=arms, learning_policy=LearningPolicy.LinUCB(alpha=1.0) ) contexts = [[1, 0], [0, 1], [1, 1], [0, 0], [1, 0], [0, 1], [1, 1]] contextual_mab.fit(decisions, rewards, contexts) # Get expectations for multiple test contexts test_contexts = [[1, 0], [0, 1]] all_expectations = contextual_mab.predict_expectations(test_contexts) for i, exp in enumerate(all_expectations): print(f"Context {i}: {exp}") ``` ## partial_fit() - Online Learning The `partial_fit()` method enables online learning by incrementally updating the model with new decision-reward pairs without retraining from scratch. ```python from mabwiser.mab import MAB, LearningPolicy # Initial training arms = ['Product_A', 'Product_B', 'Product_C'] mab = MAB(arms=arms, learning_policy=LearningPolicy.ThompsonSampling(), seed=42) initial_decisions = ['Product_A', 'Product_B', 'Product_A'] initial_rewards = [1, 0, 1] # Binary rewards for Thompson Sampling mab.fit(initial_decisions, initial_rewards) print(f"Initial prediction: {mab.predict()}") # New data arrives - update model incrementally new_decisions = ['Product_C', 'Product_B', 'Product_C'] new_rewards = [1, 1, 1] mab.partial_fit(new_decisions, new_rewards) print(f"Updated prediction: {mab.predict()}") print(f"Updated expectations: {mab.predict_expectations()}") # Contextual online learning contextual_mab = MAB( arms=arms, learning_policy=LearningPolicy.LinGreedy(epsilon=0.1, l2_lambda=1.0) ) initial_contexts = [[1, 2], [2, 1], [1, 1]] contextual_mab.fit(initial_decisions, [10, 5, 12], initial_contexts) # Update with new contextual data new_contexts = [[2, 2], [1, 3], [3, 1]] contextual_mab.partial_fit(new_decisions, [8, 15, 11], new_contexts) ``` ## add_arm() and remove_arm() - Dynamic Arm Management These methods allow adding or removing arms dynamically after the model has been trained. ```python from mabwiser.mab import MAB, LearningPolicy # Create and train bandit arms = ['Strategy1', 'Strategy2'] mab = MAB(arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42) decisions = ['Strategy1', 'Strategy2', 'Strategy1', 'Strategy2'] rewards = [100, 80, 95, 85] mab.fit(decisions, rewards) print(f"Current arms: {mab.arms}") print(f"Best arm: {mab.predict()}") # Add a new arm (starts with no training data) mab.add_arm('Strategy3') print(f"Arms after addition: {mab.arms}") # Train the new arm with partial_fit mab.partial_fit(['Strategy3', 'Strategy3'], [120, 115]) print(f"Best arm after training new arm: {mab.predict()}") print(f"Expectations: {mab.predict_expectations()}") # Remove an underperforming arm mab.remove_arm('Strategy2') print(f"Arms after removal: {mab.arms}") # For Thompson Sampling with custom binarizer for new arm def custom_binarizer(arm, reward): thresholds = {'Strategy1': 90, 'Strategy3': 110, 'Strategy4': 100} return reward > thresholds.get(arm, 100) ts_mab = MAB(arms=['Strategy1', 'Strategy3'], learning_policy=LearningPolicy.ThompsonSampling(binarizer=custom_binarizer), seed=42) ts_mab.fit(['Strategy1', 'Strategy3'], [95, 115]) ts_mab.add_arm('Strategy4', binarizer=custom_binarizer) ``` ## warm_start() - Cold Start Handling The `warm_start()` method addresses the cold start problem by initializing new arms using feature similarity to existing trained arms. ```python from mabwiser.mab import MAB, LearningPolicy # Create and train bandit arms = ['Item1', 'Item2', 'Item3'] mab = MAB(arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42) decisions = ['Item1', 'Item2', 'Item1', 'Item2'] # Item3 has no training data rewards = [50, 40, 55, 45] mab.fit(decisions, rewards) # Check cold arms (arms with no training data) print(f"Cold arms: {mab.cold_arms}") # Output: ['Item3'] # Define feature vectors for each arm arm_to_features = { 'Item1': [1.0, 0.5, 0.2], 'Item2': [0.8, 0.6, 0.3], 'Item3': [0.9, 0.55, 0.25], # Similar to Item1 'Item4': [0.85, 0.58, 0.28] # New arm to be added } # Warm start cold arms using similar trained arms # distance_quantile=0.5 means arms within 50th percentile of distances will be warm started mab.warm_start(arm_to_features, distance_quantile=0.5) print(f"Cold arms after warm start: {mab.cold_arms}") print(f"Expectations: {mab.predict_expectations()}") # Add new arm and warm start it mab.add_arm('Item4') arm_to_features['Item4'] = [0.85, 0.58, 0.28] mab.warm_start(arm_to_features, distance_quantile=0.75) ``` ## LearningPolicy.EpsilonGreedy - Exploration vs Exploitation Epsilon Greedy selects the best arm with probability (1-epsilon) and a random arm with probability epsilon for exploration. ```python from mabwiser.mab import MAB, LearningPolicy # Arms represent different website layouts arms = ['Layout_A', 'Layout_B', 'Layout_C'] # Historical A/B test data decisions = ['Layout_A', 'Layout_B', 'Layout_C', 'Layout_A', 'Layout_B', 'Layout_A', 'Layout_C', 'Layout_B', 'Layout_A', 'Layout_C'] rewards = [12, 8, 15, 10, 9, 14, 16, 7, 11, 18] # Higher epsilon = more exploration # Lower epsilon = more exploitation of known best arm mab_explore = MAB( arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.25), # 25% exploration seed=42 ) mab_explore.fit(decisions, rewards) mab_exploit = MAB( arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.05), # 5% exploration seed=42 ) mab_exploit.fit(decisions, rewards) print(f"High exploration prediction: {mab_explore.predict()}") print(f"High exploitation prediction: {mab_exploit.predict()}") print(f"Expectations: {mab_exploit.predict_expectations()}") ``` ## LearningPolicy.UCB1 - Upper Confidence Bound UCB1 balances exploration and exploitation by selecting arms based on their upper confidence bound, favoring both high-reward and under-explored arms. ```python from mabwiser.mab import MAB, LearningPolicy # Arms represent different recommendation algorithms arms = ['Collaborative', 'ContentBased', 'Hybrid', 'Popular'] # Historical engagement data decisions = ['Collaborative', 'ContentBased', 'Hybrid', 'Popular', 'Collaborative', 'ContentBased', 'Hybrid', 'Collaborative'] rewards = [0.8, 0.6, 0.75, 0.5, 0.85, 0.55, 0.7, 0.9] # Alpha controls exploration: higher alpha = more exploration mab = MAB( arms=arms, learning_policy=LearningPolicy.UCB1(alpha=1.25), seed=42 ) mab.fit(decisions, rewards) # UCB1 formula: mean + alpha * sqrt(2 * log(N) / n_i) # where N = total trials, n_i = trials for arm i print(f"Best arm: {mab.predict()}") print(f"UCB expectations: {mab.predict_expectations()}") # Online learning with UCB1 new_decisions = ['Popular', 'Hybrid'] new_rewards = [0.65, 0.8] mab.partial_fit(new_decisions, new_rewards) print(f"Updated best arm: {mab.predict()}") ``` ## LearningPolicy.ThompsonSampling - Bayesian Approach Thompson Sampling uses Bayesian probability by maintaining a beta distribution for each arm and sampling from it to make decisions. Requires binary rewards or a binarizer function. ```python from mabwiser.mab import MAB, LearningPolicy # Binary reward scenario (click/no-click) arms = ['Banner_A', 'Banner_B', 'Banner_C'] decisions = ['Banner_A', 'Banner_B', 'Banner_C', 'Banner_A', 'Banner_B', 'Banner_A', 'Banner_C', 'Banner_B', 'Banner_A', 'Banner_C'] rewards = [1, 0, 1, 1, 1, 0, 1, 0, 1, 1] # Binary: clicked (1) or not (0) mab = MAB( arms=arms, learning_policy=LearningPolicy.ThompsonSampling(), seed=42 ) mab.fit(decisions, rewards) print(f"Thompson Sampling prediction: {mab.predict()}") print(f"Success probabilities: {mab.predict_expectations()}") # Non-binary rewards with custom binarizer arms_revenue = ['Plan_Basic', 'Plan_Pro', 'Plan_Enterprise'] decisions_rev = ['Plan_Basic', 'Plan_Pro', 'Plan_Enterprise', 'Plan_Basic', 'Plan_Pro'] rewards_rev = [29, 99, 299, 35, 89] # Revenue amounts # Binarizer: success if revenue exceeds threshold for that arm arm_thresholds = {'Plan_Basic': 30, 'Plan_Pro': 80, 'Plan_Enterprise': 250} def revenue_binarizer(arm, reward): return reward >= arm_thresholds[arm] mab_revenue = MAB( arms=arms_revenue, learning_policy=LearningPolicy.ThompsonSampling(binarizer=revenue_binarizer), seed=42 ) mab_revenue.fit(decisions_rev, rewards_rev) print(f"Revenue-based prediction: {mab_revenue.predict()}") ``` ## LearningPolicy.LinUCB - Contextual Linear UCB LinUCB uses ridge regression to model the relationship between context features and rewards, with an upper confidence bound for exploration. ```python from mabwiser.mab import MAB, LearningPolicy import numpy as np from sklearn.preprocessing import StandardScaler # Arms represent different ad campaigns arms = ['Campaign_Tech', 'Campaign_Fashion', 'Campaign_Sports'] # User context: [age, income_level, engagement_score] contexts = [ [25, 0.6, 0.8], # Young, medium income, high engagement [45, 0.9, 0.5], # Middle-aged, high income, medium engagement [30, 0.4, 0.9], # Young, low income, very high engagement [55, 0.8, 0.3], # Older, high income, low engagement [22, 0.3, 0.95], # Very young, low income, very high engagement [40, 0.7, 0.6], # Middle-aged, medium-high income, medium engagement ] decisions = ['Campaign_Tech', 'Campaign_Fashion', 'Campaign_Tech', 'Campaign_Fashion', 'Campaign_Sports', 'Campaign_Tech'] rewards = [15, 25, 18, 30, 12, 20] # Scale contexts for better performance scaler = StandardScaler() scaled_contexts = scaler.fit_transform(contexts) # LinUCB: alpha controls exploration, l2_lambda is regularization strength mab = MAB( arms=arms, learning_policy=LearningPolicy.LinUCB(alpha=1.5, l2_lambda=1.0, scale=False), seed=42 ) mab.fit(decisions, rewards, scaled_contexts) # Predict for new users new_users = [[28, 0.5, 0.85], [50, 0.95, 0.4]] new_users_scaled = scaler.transform(new_users) predictions = mab.predict(new_users_scaled) expectations = mab.predict_expectations(new_users_scaled) for i, (pred, exp) in enumerate(zip(predictions, expectations)): print(f"User {i+1}: Recommended {pred}, Expectations: {exp}") ``` ## LearningPolicy.LinTS - Contextual Thompson Sampling LinTS combines linear regression with Thompson Sampling for contextual bandits, sampling from the posterior distribution of regression coefficients. ```python from mabwiser.mab import MAB, LearningPolicy from sklearn.preprocessing import StandardScaler # Arms represent different product recommendations arms = ['Electronics', 'Clothing', 'Books', 'HomeGoods'] # User context: [browsing_time, cart_value, past_purchases, device_mobile] contexts = [ [15, 50, 3, 1], [45, 200, 10, 0], [8, 25, 1, 1], [30, 150, 7, 0], [20, 75, 4, 1], [60, 300, 15, 0], ] decisions = ['Electronics', 'HomeGoods', 'Books', 'Clothing', 'Electronics', 'HomeGoods'] rewards = [120, 85, 25, 60, 95, 150] # Scale the context features scaler = StandardScaler() scaled_contexts = scaler.fit_transform(contexts) # LinTS: alpha controls exploration variance, must be > 0 mab = MAB( arms=arms, learning_policy=LearningPolicy.LinTS(alpha=0.5, l2_lambda=1.0), seed=42 ) mab.fit(decisions, rewards, scaled_contexts) # Predict for new user session new_session = [[25, 100, 5, 1]] new_session_scaled = scaler.transform(new_session) # LinTS predictions have natural randomness from sampling print(f"LinTS prediction: {mab.predict(new_session_scaled)}") print(f"Expected values: {mab.predict_expectations(new_session_scaled)}") ``` ## NeighborhoodPolicy.KNearest - K-Nearest Neighbors Contextual KNearest finds the k most similar historical contexts and applies the learning policy only to those neighbors. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy from sklearn.preprocessing import StandardScaler # Arms represent different treatment options arms = ['Treatment_A', 'Treatment_B', 'Treatment_C'] # Patient context: [age, severity_score, biomarker_level] contexts = [ [35, 0.6, 1.2], [50, 0.8, 1.8], [42, 0.5, 1.0], [65, 0.9, 2.1], [38, 0.4, 0.9], [55, 0.7, 1.5], [48, 0.6, 1.3], [60, 0.85, 1.9], ] decisions = ['Treatment_A', 'Treatment_B', 'Treatment_C', 'Treatment_B', 'Treatment_A', 'Treatment_C', 'Treatment_A', 'Treatment_B'] rewards = [0.8, 0.7, 0.6, 0.75, 0.85, 0.65, 0.82, 0.72] scaler = StandardScaler() scaled_contexts = scaler.fit_transform(contexts) # KNearest with k=3 neighbors using euclidean distance mab = MAB( arms=arms, learning_policy=LearningPolicy.UCB1(alpha=1.0), neighborhood_policy=NeighborhoodPolicy.KNearest(k=3, metric="euclidean"), seed=42 ) mab.fit(decisions, rewards, scaled_contexts) # Predict for new patient new_patient = [[45, 0.65, 1.4]] new_patient_scaled = scaler.transform(new_patient) prediction = mab.predict(new_patient_scaled) expectations = mab.predict_expectations(new_patient_scaled) print(f"Recommended treatment: {prediction}") print(f"Expected outcomes: {expectations}") ``` ## NeighborhoodPolicy.Radius - Radius-Based Neighborhood Radius neighborhood policy considers all historical observations within a specified distance from the prediction context. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy from sklearn.preprocessing import StandardScaler # Arms represent different pricing strategies arms = ['Price_Low', 'Price_Medium', 'Price_High'] # Market context: [demand_index, competitor_price, inventory_level] contexts = [ [0.8, 100, 0.9], [0.5, 95, 0.6], [0.9, 110, 0.95], [0.4, 90, 0.4], [0.7, 105, 0.8], [0.6, 98, 0.7], ] decisions = ['Price_High', 'Price_Low', 'Price_High', 'Price_Low', 'Price_Medium', 'Price_Medium'] rewards = [150, 80, 160, 75, 110, 105] scaler = StandardScaler() scaled_contexts = scaler.fit_transform(contexts) # Radius policy: considers neighbors within distance of 1.5 # If no neighbors found, uses no_nhood_prob_of_arm for random selection mab = MAB( arms=arms, learning_policy=LearningPolicy.EpsilonGreedy(epsilon=0.1), neighborhood_policy=NeighborhoodPolicy.Radius( radius=1.5, metric="euclidean", no_nhood_prob_of_arm=[0.3, 0.4, 0.3] # Fallback probabilities ), seed=42 ) mab.fit(decisions, rewards, scaled_contexts) # Predict for new market conditions new_market = [[0.75, 102, 0.85]] new_market_scaled = scaler.transform(new_market) print(f"Recommended pricing: {mab.predict(new_market_scaled)}") print(f"Expected revenues: {mab.predict_expectations(new_market_scaled)}") ``` ## NeighborhoodPolicy.Clusters - Cluster-Based Contextual Clusters policy uses k-means clustering to partition the context space, applying the learning policy within each cluster. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy from sklearn.preprocessing import StandardScaler # Arms represent different content categories arms = ['News', 'Entertainment', 'Sports', 'Technology'] # User context: [age, time_of_day, session_duration] contexts = [ [25, 0.3, 15], # Young, morning, short session [45, 0.8, 45], # Middle-aged, evening, long session [30, 0.5, 25], # Young, afternoon, medium session [55, 0.9, 60], # Older, night, very long session [22, 0.2, 10], # Very young, early morning, very short [35, 0.6, 30], # Adult, afternoon, medium-long [50, 0.85, 50], # Middle-aged, evening, long [28, 0.4, 20], # Young adult, late morning, medium ] decisions = ['Entertainment', 'News', 'Sports', 'News', 'Entertainment', 'Technology', 'News', 'Sports'] rewards = [0.9, 0.7, 0.8, 0.75, 0.85, 0.65, 0.72, 0.78] scaler = StandardScaler() scaled_contexts = scaler.fit_transform(contexts) # Clusters policy with 3 clusters mab = MAB( arms=arms, learning_policy=LearningPolicy.ThompsonSampling(), neighborhood_policy=NeighborhoodPolicy.Clusters( n_clusters=3, is_minibatch=False # Use MiniBatchKMeans for large datasets ), seed=42 ) mab.fit(decisions, [1, 1, 1, 1, 1, 0, 1, 1], scaled_contexts) # Binary rewards for TS # Predict for new user new_user = [[32, 0.55, 28]] new_user_scaled = scaler.transform(new_user) print(f"Recommended content: {mab.predict(new_user_scaled)}") ``` ## NeighborhoodPolicy.TreeBandit - Decision Tree Partitioning TreeBandit uses decision trees to partition the context space, maintaining separate bandit statistics at each leaf node. Compatible with EpsilonGreedy, UCB1, and ThompsonSampling. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy # Arms represent different email subject lines arms = ['SubjectLine_A', 'SubjectLine_B', 'SubjectLine_C'] # Recipient context: [email_frequency, past_opens, segment_id, time_since_last] contexts = [ [0.5, 0.8, 1, 2], [0.2, 0.3, 2, 7], [0.8, 0.9, 1, 1], [0.1, 0.2, 3, 14], [0.6, 0.7, 1, 3], [0.3, 0.4, 2, 5], [0.9, 0.95, 1, 1], [0.4, 0.5, 2, 4], ] decisions = ['SubjectLine_A', 'SubjectLine_B', 'SubjectLine_C', 'SubjectLine_B', 'SubjectLine_A', 'SubjectLine_C', 'SubjectLine_A', 'SubjectLine_B'] rewards = [1, 0, 1, 0, 1, 0, 1, 1] # Binary: opened (1) or not (0) # TreeBandit with custom decision tree parameters mab = MAB( arms=arms, learning_policy=LearningPolicy.ThompsonSampling(), neighborhood_policy=NeighborhoodPolicy.TreeBandit( tree_parameters={ 'max_depth': 4, 'min_samples_leaf': 2, 'min_samples_split': 4 } ), seed=42 ) mab.fit(decisions, rewards, contexts) # Predict for new recipient new_recipient = [[0.55, 0.75, 1, 2]] print(f"Best subject line: {mab.predict(new_recipient)}") print(f"Open probabilities: {mab.predict_expectations(new_recipient)}") ``` ## Simulator - Comparing Multiple Bandits The Simulator utility enables comparing different bandit configurations, performing hyper-parameter tuning, and running offline/online simulations. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy from mabwiser.simulator import Simulator from sklearn.preprocessing import StandardScaler import random # Generate sample data random.seed(42) size = 500 arms = [0, 1, 2] decisions = [random.choice(arms) for _ in range(size)] rewards = [random.randint(0, 100) for _ in range(size)] contexts = [[random.random() for _ in range(5)] for _ in range(size)] # Define bandits to compare bandits = [ ('EpsilonGreedy_10%', MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.1), seed=42)), ('EpsilonGreedy_25%', MAB(arms, LearningPolicy.EpsilonGreedy(epsilon=0.25), seed=42)), ('UCB1_alpha1', MAB(arms, LearningPolicy.UCB1(alpha=1.0), seed=42)), ('UCB1_alpha1.5', MAB(arms, LearningPolicy.UCB1(alpha=1.5), seed=42)), ('LinUCB', MAB(arms, LearningPolicy.LinUCB(alpha=1.0), seed=42)), ] # Create simulator sim = Simulator( bandits=bandits, decisions=decisions, rewards=rewards, contexts=contexts, scaler=StandardScaler(), test_size=0.3, # 30% for testing is_ordered=False, # Random train/test split batch_size=50, # Online learning with batches of 50 seed=42 ) # Run simulation sim.run() # Access results for name, mab in sim.bandits: print(f"\n{name}:") print(f" Confusion Matrix: {sim.bandit_to_confusion_matrices[name][-1]}") if 'total' in sim.bandit_to_arm_to_stats_avg[name]: total_stats = sim.bandit_to_arm_to_stats_avg[name]['total'] total_reward = sum(s['sum'] for s in total_stats.values() if not np.isnan(s['sum'])) print(f" Total Predicted Reward: {total_reward:.2f}") # Plot results (requires matplotlib) # sim.plot(metric='avg', is_per_arm=False) ``` ## Parallel Processing MABWiser supports parallel processing for both training and prediction, significantly improving performance for large datasets. ```python from mabwiser.mab import MAB, LearningPolicy, NeighborhoodPolicy import numpy as np from sklearn.datasets import make_classification # Generate large dataset np.random.seed(42) n_samples = 50000 n_features = 100 arms = list(range(10)) contexts, _ = make_classification( n_samples=n_samples, n_features=n_features, n_informative=20, random_state=42 ) decisions = np.random.choice(arms, size=n_samples) rewards = np.random.rand(n_samples) # Parallel training and prediction with n_jobs # n_jobs=-1 uses all available CPUs # n_jobs=-2 uses all CPUs except one mab = MAB( arms=arms, learning_policy=LearningPolicy.LinUCB(alpha=1.0), n_jobs=-1, # Use all CPUs backend='loky' # Options: 'loky', 'multiprocessing', 'threading' ) # Parallel fit mab.fit(decisions, rewards, contexts) # Parallel prediction for batch of new contexts test_contexts = np.random.randn(1000, n_features) predictions = mab.predict(test_contexts) print(f"Processed {len(predictions)} predictions") print(f"Distribution: {np.unique(predictions, return_counts=True)}") # Contextual neighborhood policy with parallel processing neighborhood_mab = MAB( arms=arms, learning_policy=LearningPolicy.UCB1(alpha=1.0), neighborhood_policy=NeighborhoodPolicy.Radius(radius=2.0), n_jobs=4, backend='loky' ) neighborhood_mab.fit(decisions, rewards, contexts) ``` ## Summary MABWiser provides a comprehensive toolkit for implementing multi-armed bandit algorithms in Python. The primary use cases include A/B testing for web optimization, personalized recommendation systems, dynamic pricing, clinical trial optimization, and any sequential decision-making problem where you need to balance exploration of new options with exploitation of known good options. The library supports context-free bandits for simple scenarios, parametric contextual bandits (LinUCB, LinTS, LinGreedy) when you have user/item features, and non-parametric approaches (KNearest, Radius, Clusters, TreeBandit) when the relationship between context and reward is complex. Integration patterns typically involve: (1) collecting historical decision-reward data, (2) training a MAB model with `fit()`, (3) using `predict()` for real-time recommendations, and (4) continuously updating the model with `partial_fit()` as new data arrives. For production systems, use the Simulator to compare different policies and tune hyperparameters before deployment. The parallel processing capabilities (`n_jobs` parameter) enable scaling to large datasets, and the warm start functionality addresses cold start problems when introducing new arms. MABWiser follows scikit-learn conventions, making it easy to integrate into existing ML pipelines and combine with standard preprocessing tools like StandardScaler.