### Install PowerShap Source: https://github.com/predict-idlab/powershap/blob/main/README.md Install the PowerShap library using pip. This command installs the latest version from PyPI. ```bash pip install powershap ``` -------------------------------- ### Setup Autoreload and Imports for Simulation Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb Initializes the autoreload extension and imports necessary libraries for simulation tasks. Ensure these imports are present before running simulation code. ```python %load_ext autoreload %autoreload 2 import pandas as pd import numpy as np import shap from sklearn.datasets import make_classification,make_regression from sklearn.model_selection import train_test_split import sys import time ``` -------------------------------- ### Initialize and Fit PowerShap with CatBoostClassifier Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb Initializes a PowerShap object with a CatBoostClassifier and fits it to training data. This snippet demonstrates the basic setup for using PowerShap with a specific machine learning model. ```python import sys sys.path.append("../powershap") from powershap import PowerShap from catboost import CatBoostClassifier,CatBoostRegressor from sklearn.linear_model import LogisticRegressionCV, RidgeClassifierCV,LinearRegression from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier selector = PowerShap( model = CatBoostClassifier(verbose=0, n_estimators=250,use_best_model=True),#LogisticRegressionCV(),#GradientBoostingClassifier(),#CatBoostClassifier(verbose=0, n_estimators=250), #model = CatBoostRegressor(verbose=0, n_estimators=0,use_best_model=True), verbose=True, ) selector.fit(X_train, y_train) ``` -------------------------------- ### Generate Classification Dataset and Split Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb Creates a synthetic classification dataset using `make_classification` and splits it into training and testing sets. This is a common setup for evaluating machine learning models. ```python n_features = 250 #20,50,100,250,500 n_informative = int(0.50*n_features) #5%,10%,33%,50%,90% n_samples = int(5000/(1-0.33))+1 #7463#5000 X, y = make_classification(n_samples=n_samples, n_classes=3, n_features=n_features, n_informative=n_informative, n_redundant=0, n_repeated = 0,shuffle=False) #X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=n_informative,random_state=4,shuffle=False) X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=42) X_train ``` -------------------------------- ### Get Subset Columns Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Retrieves the column names from the subset generated by BorutaShap. ```python subset.columns ``` -------------------------------- ### Load and Prepare SCENE Dataset Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Loads the SCENE dataset from a CSV file, resets the index, and prepares it for classification by converting the target column to integer type. It then splits the data into training and validation sets, stratifying by the target column. ```python current_db = pd.read_csv("data/scene.csv") current_db = current_db.reset_index() Index_col = "index" target_col = "Urban" current_db[target_col]=current_db[target_col].astype(np.int32) train_idx,val_idx = train_test_split(current_db[Index_col],test_size=0.25,random_state = 1,stratify=current_db[target_col]) current_db_train = current_db[current_db[Index_col].isin(train_idx)] current_db_test = current_db[current_db[Index_col].isin(val_idx)] ``` -------------------------------- ### PowerShap and BorutaShap Imports Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb Imports necessary libraries for PowerShap and BorutaShap, including dataset generation and time tracking utilities. This snippet sets up the environment for simulation studies. ```python from sklearn.datasets import make_classification,make_regression from sklearn.model_selection import train_test_split import sys import time #sys.path.append("../powershap") from powershap import PowerShap import shapicant #n_features = 20 #20,50,100,250,500 #n_informative = int(0.10*n_features) #10%,33%,50%,90% #n_samples = 1000#5000 regression_bool=False output_dict = {} ``` -------------------------------- ### BorutaShap Feature Selection Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Configures and applies BorutaShap for feature selection using a CatBoostClassifier. This example specifies 'shap' as the importance measure and enables verbose output during fitting. ```python model = CatBoostClassifier(verbose=False,iterations=250,class_weights=[1-len(current_db_train[current_db_train[target_col] == 0])/len(current_db_train),len(current_db_train[current_db_train[target_col] == 0])/len(current_db_train)]) # if classification is False it is a Regression problem Feature_Selector = BorutaShap( model=model, importance_measure='shap', classification=True) Feature_Selector.fit(X=current_db_train[l/ist(current_db_train.columns.values[1:-6])], y=current_db_train[target_col], sample=False, train_or_test = 'test', normalize=True,verbose=True) subset = Feature_Selector.Subset() ``` -------------------------------- ### Load and Prepare Energy Data Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Loads the energy dataset using pandas, resets the index, and splits the data into training and validation sets. ```python current_db = pd.read_csv("data/energydata_complete.csv") current_db = current_db.reset_index() Index_col = "index" target_col = "Appliances" train_idx,val_idx = train_test_split(current_db[Index_col],test_size=0.25,random_state = 1) current_db_train = current_db[current_db[Index_col].isin(train_idx)] current_db_test = current_db[current_db[Index_col].isin(val_idx)] ``` -------------------------------- ### Data Loading and Preprocessing Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Loads the 'gina_prior.csv' dataset, preprocesses labels, and splits the data into training and validation sets using `train_test_split`. ```python gina_prior_df = pd.read_csv("data/gina_prior.csv") gina_prior_df = gina_prior_df.reset_index() gina_prior_df.loc[gina_prior_df.label==-1,"label"]=0 train_idx,val_idx = train_test_split(gina_prior_df["index"],test_size=0.25,random_state = 1) current_db_train = gina_prior_df[gina_prior_df["index"].isin(train_idx)] current_db_test = gina_prior_df[gina_prior_df["index"].isin(val_idx)] target_col = "label" Index_col = "index" ``` -------------------------------- ### Generate and Split Classification Dataset Source: https://github.com/predict-idlab/powershap/blob/main/examples/test.ipynb Generates a synthetic classification dataset using scikit-learn and splits it into training and testing sets. The data is then converted into a pandas DataFrame for easier manipulation. Ensure scikit-learn and pandas are installed. ```python from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=750, n_classes=2, n_features=10, n_informative=2, n_redundant=1) X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(10)]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=42) X_train ``` -------------------------------- ### Run PowerShap Benchmark with CatBoost Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb This script runs a benchmark for PowerShap using CatBoost models. It iterates through various dataset sizes, feature counts, and informative feature percentages, recording execution times and feature selection results. Requires pandas and numpy. ```python from sklearn.datasets import make_classification,make_regression from sklearn.model_selection import train_test_split import sys import time #sys.path.append("../powershap") from powershap import PowerShap from catboost import CatBoostClassifier,CatBoostRegressor #n_features = 20 #20,50,100,250,500 #n_informative = int(0.10*n_features) #10%,33%,50%,90% #n_samples = 1000#5000 regression_bool=False estimators = 250 hypercube = False output_dict = {} for n_samples in [1000,5000,20000]: output_dict[str(n_samples)]={} for n_features in [20,100,250,500]: output_dict[str(n_samples)][str(n_features)]={} average_times = [] output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]: output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={} print("Amount of samples = "+str(n_samples)) print("Total used features = "+str(n_features)) print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+" %)") print("") found_features = [] found_idx_features = [] times = [] for random_seed in [0,1,2,3,4]: print("Seed "+str(random_seed)) if regression_bool: X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False) else: X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, hypercube=hypercube, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed) X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)]) start_time = time.time() if regression_bool: selector = PowerShap( model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) else: selector = PowerShap( model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) selector.fit(X, y) times.append(time.time() - start_time) processed_shaps_df = selector._processed_shaps_df print(50+"-") found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01])) found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values) found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features] found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features] print("Average time: "+str(np.round(np.mean(times),2))+" seconds") print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01]))) print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features") print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features") average_times.append(np.round(times,2)) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features print(100+"=") output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times benchmark_dict_print(output_dict) print(100+"=") if hypercube: output_dict_to_df(output_dict).to_csv("estimators_250_Classification_output_df.csv",index=False) else: output_dict_to_df(output_dict).to_csv("estimators_250_Classification_output_df_polytone.csv",index=False) ``` -------------------------------- ### BorutaShap Simulation with CatBoost Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb Runs a simulation to evaluate BorutaShap's feature selection performance. It iterates through different numbers of samples, features, and informative features, measuring the average time taken and the accuracy of feature identification. Requires CatBoostClassifier and pandas. ```python from sklearn.datasets import make_classification,make_regression from sklearn.model_selection import train_test_split import sys import time #sys.path.append("../powershap") from powershap import PowerShap from BorutaShap import BorutaShap #n_features = 20 #20,50,100,250,500 #n_informative = int(0.10*n_features) #10%,33%,50%,90% #n_samples = 1000#5000 regression_bool=False output_dict = {} for n_samples in [1000,5000,20000]: output_dict[str(n_samples)]={} for n_features in [20,100,250,500]: output_dict[str(n_samples)][str(n_features)]={} average_times = [] output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]: output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={} print("Amount of samples = "+str(n_samples)) print("Total used features = "+str(n_features)) print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+"%%)") print("") found_features = [] found_idx_features = [] times = [] for random_seed in [0,1,2,3,4]: print("Seed "+str(random_seed)) if regression_bool: X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False) else: X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed) X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)]) start_time = time.time() # if classification is False it is a Regression problem model = CatBoostClassifier(verbose=0, n_estimators=250) selector = BorutaShap(model=model,importance_measure='shap',classification=True) selector.fit(X=X, y=y, verbose=False) subset = selector.Subset() times.append(time.time() - start_time) print(50*"-") found_features.append(len(subset.columns)) found_idx_features.append(subset.columns) found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features] found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features] print("Average time: "+str(np.round(np.mean(times),2))+" seconds") print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01]))) print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features") print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features") average_times.append(np.round(times,2)) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features print(100*"=") output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times benchmark_dict_print(output_dict) output_dict_to_df(output_dict).to_csv("250_est_Catboost_borutashap_output_df.csv",index=False) print(100*"=") ``` -------------------------------- ### Run PowerShap Simulation for Regression Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb This script simulates feature selection using PowerShap with CatBoost regressors. It iterates through different dataset sizes, feature counts, and informative feature percentages, recording the average time taken and the number of features identified. Use this to benchmark PowerShap's performance. ```python from sklearn.datasets import make_classification,make_regression from sklearn.model_selection import train_test_split import sys import time #sys.path.append("../powershap") from powershap import PowerShap from catboost import CatBoostClassifier,CatBoostRegressor #n_features = 20 #20,50,100,250,500 #n_informative = int(0.10*n_features) #10%,33%,50%,90% #n_samples = 1000#5000 regression_bool=True estimators = 250 output_dict = {} for n_samples in [1000,5000,20000]: output_dict[str(n_samples)]={} for n_features in [20,100,250,500]: output_dict[str(n_samples)][str(n_features)]={} average_times = [] output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]: output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={} print("Amount of samples = "+str(n_samples)) print("Total used features = "+str(n_features)) print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+"%) ") print("") found_features = [] found_idx_features = [] times = [] for random_seed in [0,1,2,3,4]: print("Seed "+str(random_seed)) if regression_bool: X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False) else: X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed) X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)]) start_time = time.time() if regression_bool: selector = PowerShap( model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) else: selector = PowerShap( model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) selector.fit(X, y) times.append(time.time() - start_time) processed_shaps_df = selector._processed_shaps_df print(50+"-") found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01])) found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values) found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features] found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features] print("Average time: "+str(np.round(np.mean(times),2))+" seconds") print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01]))) print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features") print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features") average_times.append(np.round(times,2)) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features print(100+"=") output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times benchmark_dict_print(output_dict) print(100+"=") output_dict_to_df(output_dict).to_csv("estimators_250_Regression_output_df.csv",index=False) ``` -------------------------------- ### Data Preparation for Training Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb This code prepares the training and testing datasets by selecting the features determined in the previous step and assigning the target column. ```python X_train = current_db_train[selected_features] Y_train = current_db_train[target_col] X_test = current_db_test[selected_features] Y_test = current_db_test[target_col] ``` -------------------------------- ### Load and Prepare Data for Visualization Source: https://github.com/predict-idlab/powershap/blob/main/examples/visualisation_powershap.ipynb Loads data from CSV files for various estimators (PowerSHAP with different estimators, logistic regression, chi-squared, F-test, random forest, borutashap, shapicant). It then concatenates these DataFrames, calculates performance metrics like '%found_informative_features' and '%outputted_noise_features', renames columns, and filters data based on sample size and total features. This prepares the data for plotting. ```python import seaborn as sns from matplotlib.ticker import ScalarFormatter import pandas as pd import numpy as np pure_powershap = False df_esti_500 = pd.read_csv("results/estimators_500_Classification_output_df.csv") df_esti_500["method"]="powershap - catboost - 500 est" df_esti_250 = pd.read_csv("results/estimators_250_Classification_output_df.csv") if pure_powershap: df_esti_250["method"]="powershap - catboost - 250 est" else: df_esti_250["method"]="powershap" df_esti_50 = pd.read_csv("results/estimators_50_Classification_output_df.csv") df_esti_50["method"]="powershap - catboost - 50 est" df_logreg = pd.read_csv("results/logisticregressioncv_output_df.csv") df_logreg["method"]="powershap - logistic regression" df_logreg_polytone = pd.read_csv("results/logisticregressioncv_output_df_polytone.csv") df_logreg_polytone["method"]="logistic regression polytone" df_chisq = pd.read_csv("results/chisquared_output_df.csv") df_chisq["method"]="chi²" df_chisq_poly = pd.read_csv("results/chi_squared_polytone_output_df.csv") df_chisq_poly["method"]="chi polytone" df_f_classif = pd.read_csv("results/f_classif_output_df.csv") df_f_classif["method"]="f test" df_f_classif_poly = pd.read_csv("results/f_classif_polytone_output_df.csv") df_f_classif_poly["method"]="f_classif polytone" df_random = pd.read_csv("results/randomforest_output_df.csv") df_random["method"]="powershap - random forest" t = df_random[df_random.n_samples==1000].copy() t["n_samples"]=20000 df_random = df_random.append(t) df_shapicant = pd.read_csv("results/shapicant_output_df.csv") df_shapicant["method"]="shapicant" t = df_shapicant[df_shapicant.total_features==20].copy() t["total_features"]=50 df_shapicant = df_shapicant.append(t) t = df_shapicant[df_shapicant.n_samples==5000].copy() t["n_samples"]=20000 df_shapicant = df_shapicant.append(t) t = df_shapicant[df_shapicant.n_samples==5000].copy() t["n_samples"]=1000 df_shapicant = df_shapicant.append(t) df_boruta = pd.read_csv("results/250_est_Catboost_borutashap_output_df.csv") df_boruta["method"]="borutashap" order_samples = [1000]#[5000]#1000,5000,20000] order_features = [20,100,250,500] #df = df_esti_250.append(df_logreg).append(df_random).append(df_chisq).append(df_f_classif).append(df_boruta).append(df_shapicant) #df = df_esti_250.append(df_chisq).append(df_f_classif).append(df_boruta).append(df_shapicant) #df = df_esti_250.append(df_logreg).append(df_chisq).append(df_f_classif).append(df_boruta) #df["model_samples"]=df["method"].values+"_"+df["n_samples"].map(lambda x: f"{x:,}").values if pure_powershap: df = df_esti_50.append(df_esti_250).append(df_esti_500).append(df_logreg).append(df_random) df_n = 5 else: #df = df_esti_250.append(df_chisq).append(df_f_classif).append(df_boruta).append(df_shapicant) #df_n=5 df = df_esti_250.append(df_chisq).append(df_f_classif).append(df_boruta)#.append(df_shapicant) df_n=4 df["%informative_features"]=np.tile(np.repeat([10,33,50,90],5),15*df_n) df["%found_informative_features"]=df["found_informative_features"]/df["informative_features"]*100 df["%outputted_noise_features"]=df["outputted_noise_features"]/(df["total_features"]-df["informative_features"])*100 df = df.rename(columns={"total_features":"total features"}) df = df[df.n_samples.isin(order_samples)] df = df[df["total features"].isin(order_features)] df = df.drop(columns="n_samples") ``` -------------------------------- ### Initialize PowerSHAP with GradientBoostingClassifier Source: https://github.com/predict-idlab/powershap/blob/main/examples/test.ipynb Initializes the PowerSHAP selector with a GradientBoostingClassifier and automatic feature selection enabled. Ensure necessary libraries are imported and data (X_train, y_train) is available. ```python import sys sys.path.append("../powershap") from powershap import PowerSHAP from catboost import CatBoostClassifier from sklearn.linear_model import LogisticRegressionCV, RidgeClassifierCV from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier selector = PowerSHAP( model = GradientBoostingClassifier(),#CatBoostClassifier(verbose=0, n_estimators=250), automatic=True, limit_automatic=100, ) ``` -------------------------------- ### CatBoost Regressor Model Training Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Initializes and trains a CatBoost Regressor model with specified parameters. The model is then used for cross-validation and evaluation. ```python CB_model = CatBoostRegressor(verbose=100,iterations=250,random_seed=2)#,per_float_feature_quantization=['1:border_count=1024']) CB_model.fit(X_train,Y_train) ``` -------------------------------- ### Cross-Validation with CatBoostRegressor Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Initializes a CatBoostRegressor model and performs 10-fold cross-validation using the benchmark_regression_cross_validation function. Prints the model type and the mean and standard deviation of training and testing scores. ```python CB_model = CatBoostRegressor(verbose=False,iterations=250,random_seed=2,use_best_model=True) scores_cv_train,scores_cv_test = benchmark_regression_cross_validation(Model = CB_model,Inp_db = current_db_train.copy(deep=True),index_col=Index_col,folds=10,RS=0,features = selected_features,target_col = target_col) print(model) print("TRAIN") for key in scores_cv_train: print(str(key)+": "+str(np.round(np.mean(scores_cv_train[key]),3))+" ("+str(np.round(np.std(scores_cv_train[key]),3))+")") print(50*"=") print("TEST") for key in scores_cv_test: print(str(key)+": "+str(np.round(np.mean(scores_cv_test[key]),3))+" ("+str(np.round(np.std(scores_cv_test[key]),3))+")") ``` -------------------------------- ### Initialize and Fit BorutaShap Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Initializes a CatBoostClassifier and then uses BorutaShap for feature selection. This is useful for identifying important features in a dataset for classification tasks. ```python model = CatBoostClassifier(verbose=False,iterations=250)#,use_best_model=True) # if classification is False it is a Regression problem Feature_Selector = BorutaShap(model=model, importance_measure='shap', classification=True) Feature_Selector.fit(X=current_db_train[list(current_db_train.columns.values[1:-1])], y=current_db_train[target_col], sample=False, train_or_test = 'test', normalize=True,verbose=True) subset = Feature_Selector.Subset() ``` -------------------------------- ### Benchmark PowerShap with CatBoost Classifiers Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb This script benchmarks PowerShap's performance using CatBoost classifiers. It iterates through various sample sizes, feature counts, and informative feature percentages, recording average computation times and feature selection results. Requires pandas and numpy. ```python from sklearn.datasets import make_classification,make_regression from sklearn.model_selection import train_test_split import sys import time #sys.path.append("../powershap") from powershap import PowerShap from catboost import CatBoostClassifier,CatBoostRegressor regression_bool=False estimators = 500 output_dict = {} for n_samples in [1000,5000,20000]: output_dict[str(n_samples)]={} for n_features in [20,100,250,500]: output_dict[str(n_samples)][str(n_features)]={} average_times = [] output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]: output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"={} print("Amount of samples = "+str(n_samples)) print("Total used features = "+str(n_features)) print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+" %)") print("") found_features = [] found_idx_features = [] times = [] for random_seed in [0,1,2,3,4]: print("Seed "+str(random_seed)) if regression_bool: X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False) else: X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed) X = pd.DataFrame(data=X, columns=[f"col_{{i}}" for i in range(n_features)]) start_time = time.time() if regression_bool: selector = PowerShap( model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) else: selector = PowerShap( model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) selector.fit(X, y) times.append(time.time() - start_time) processed_shaps_df = selector._processed_shaps_df print(50*"-") found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01])) found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values) found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features] found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features] print("Average time: "+str(np.round(np.mean(times),2))+" seconds") print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01]))) print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features") print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features") average_times.append(np.round(times,2)) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"].update({"informative_features":int(n_informative/100*n_features),"found_informative_features":found_informative_features,"outputted_noise_features":found_noise_features}) print(100*"=") output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times benchmark_dict_print(output_dict) print(100*"=") output_dict_to_df(output_dict).to_csv("estimators_500_Classification_output_df.csv",index=False) ``` -------------------------------- ### Benchmarking PowerShap with CatBoostRegressor Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb This script benchmarks PowerShap's feature selection capabilities with CatBoostRegressor. It iterates through various dataset sizes and feature configurations, measuring execution time and the effectiveness of feature identification. Requires pandas and numpy. ```python from sklearn.datasets import make_classification,make_regression from sklearn.model_selection import train_test_split import sys import time #sys.path.append("../powershap") from powershap import PowerShap from catboost import CatBoostClassifier,CatBoostRegressor import pandas as pd import numpy as np def benchmark_dict_print(output_dict): for n_samples in output_dict: print("Amount of samples = "+str(n_samples)) for n_features in output_dict[n_samples]: if n_features != "Average time": print("Total used features = "+str(n_features)) for n_informative in output_dict[n_samples][n_features]: if n_informative != "Average time": print("Informative features: "+str(output_dict[n_samples][n_features][n_informative]["informative_features"])+" ("+str(int(output_dict[n_samples][n_features][n_informative]["informative_features"]/int(n_features)*100))+"%)") print("Average time: "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["Average time"]),2))+" seconds") print("Found features: "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]),2))) print("Found "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]),2)) + " of "+str(output_dict[n_samples][n_features][n_informative]["informative_features"])+" informative features") print(str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["outputted_noise_features"]),2)) + " of "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]),2)) + " outputted powershap features are noise features") print(100*"=") def output_dict_to_df(output_dict): df_list = [] for n_samples in output_dict: for n_features in output_dict[n_samples]: if n_features != "Average time": for n_informative in output_dict[n_samples][n_features]: if n_informative != "Average time": df_list.append({ "n_samples": int(n_samples), "n_features": int(n_features), "n_informative_percent": int(n_informative), "informative_features": output_dict[n_samples][n_features][n_informative]["informative_features"], "found_informative_features": np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]), "outputted_noise_features": np.mean(output_dict[n_samples][n_features][n_informative]["outputted_noise_features"]), "average_time_seconds": np.mean(output_dict[n_samples][n_features][n_informative]["Average time"]) }) return pd.DataFrame(df_list) regression_bool=True estimators = 50 output_dict = {} for n_samples in [1000,5000,20000]: output_dict[str(n_samples)]={} for n_features in [20,100,250,500]: output_dict[str(n_samples)][str(n_features)]={} average_times = [] output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]: output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={} print("Amount of samples = "+str(n_samples)) print("Total used features = "+str(n_features)) print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+" %)") print("") found_features = [] found_idx_features = [] times = [] for random_seed in [0,1,2,3,4]: print("Seed "+str(random_seed)) if regression_bool: X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False) else: X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed) X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)]) start_time = time.time() if regression_bool: selector = PowerShap( model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) else: selector = PowerShap( model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True), automatic=True ) selector.fit(X, y) times.append(time.time() - start_time) processed_shaps_df = selector._processed_shaps_df print(50*"-") found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01])) found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values) found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features] found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features] print("Average time: "+str(np.round(np.mean(times),2))+" seconds") print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01]))) print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features") print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features") average_times.append(np.round(times,2)) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features) output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features print(100*"=") output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times benchmark_dict_print(output_dict) print(100*"=") output_dict_to_df(output_dict).to_csv("estimators_50_Regression_output_df.csv",index=False) ``` -------------------------------- ### Initialize and Fit PowerShap Selector Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Initializes a PowerShap selector with a CatBoostRegressor model and fits it to the training data. The 'automatic' mode is enabled for feature selection. ```python selector = PowerShap( model = CatBoostRegressor(verbose=0, n_estimators=250,use_best_model=True), power_iterations=10,automatic=True, limit_automatic=10,verbose=True,target_col=target_col,index_col=Index_col, ) selector.fit(current_db_train[list(current_db_train.columns.values[3:])], current_db_train[target_col]) t = selector._processed_shaps_df ``` -------------------------------- ### Initialize and Fit BorutaShap Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Initializes and fits the BorutaShap feature selection algorithm using a CatBoostRegressor model. It performs feature selection for regression problems. ```python model = CatBoostRegressor(verbose=False,iterations=250) # if classification is False it is a Regression problem Feature_Selector = BorutaShap(model=model, importance_measure='shap', classification=False) Feature_Selector.fit(X=current_db_train[list(current_db_train.columns.values[3:])], y=current_db_train[target_col], sample=False, train_or_test = 'test', normalize=True,verbose=True) subset = Feature_Selector.Subset() ``` -------------------------------- ### Import Necessary Libraries Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb Imports a comprehensive set of libraries required for data manipulation, machine learning, feature selection, and model training. Includes numpy, pandas, scikit-learn, catboost, shap, and others. ```python import numpy as np from numpy.random import RandomState import pandas as pd from math import sqrt from sklearn.model_selection import train_test_split, KFold, StratifiedKFold from sklearn import feature_selection as fs from sklearn import metrics as me from sklearn.metrics import classification_report,auc,r2_score,matthews_corrcoef,roc_auc_score from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import chi2,f_classif,f_regression from catboost import CatBoostRegressor,CatBoostClassifier from catboost.utils import get_roc_curve from catboost import Pool, cv import shap from scipy import stats from scipy.optimize import curve_fit import copy from tabulate import tabulate from tqdm import tqdm from BorutaShap import BorutaShap from powershap import PowerShap import shapicant ``` -------------------------------- ### Create and Fit Sklearn Pipeline with PowerShap Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb This snippet demonstrates how to create a scikit-learn Pipeline that includes PowerShap for feature selection and a KNeighborsClassifier as the final estimator. It then fits the pipeline to training data. ```python from sklearn.pipeline import Pipeline from sklearn.neighbors import KNeighborsClassifier pipe = Pipeline( [ ( "selector", PowerShap( CatBoostClassifier(n_estimators=250,verbose=False,use_best_model=True), automatic=True, limit_automatic=100,#power_alpha=0.001,power_req_iterations=0.999, #CatBoostRegressor(n_estimators=250,verbose=False), automatic=True, limit_automatic=100, ), ), ("catboost", KNeighborsClassifier()),#(n_estimators=250,verbose=False)), #("catboost", CatBoostRegressor(n_estimators=250,verbose=False)), ] ) pipe.fit(X_train, y_train) from sklearn.metrics import accuracy_score,r2_score print("Baseline", accuracy_score(KNeighborsClassifier().fit(X_train, y_train).predict(X_test), y_test)) #print("Baseline", r2_score(LinearRegression.fit(X_train, y_train).predict(X_test), y_test)) print("PowerShap feature selection:", accuracy_score(pipe.predict(X_test), y_test)) #print("PowerShap feature selection:", r2_score(pipe.predict(X_test), y_test)) ```