### Install PowerShap

Source: https://github.com/predict-idlab/powershap/blob/main/README.md

Install the PowerShap library using pip. This command installs the latest version from PyPI.

```bash
pip install powershap
```

--------------------------------

### Setup Autoreload and Imports for Simulation

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

Initializes the autoreload extension and imports necessary libraries for simulation tasks. Ensure these imports are present before running simulation code.

```python
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import shap
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split
import sys
import time
```

--------------------------------

### Initialize and Fit PowerShap with CatBoostClassifier

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

Initializes a PowerShap object with a CatBoostClassifier and fits it to training data. This snippet demonstrates the basic setup for using PowerShap with a specific machine learning model.

```python
import sys
sys.path.append("../powershap")

from powershap import PowerShap


from catboost import CatBoostClassifier,CatBoostRegressor
from sklearn.linear_model import LogisticRegressionCV, RidgeClassifierCV,LinearRegression
from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier

selector = PowerShap(
    model = CatBoostClassifier(verbose=0, n_estimators=250,use_best_model=True),#LogisticRegressionCV(),#GradientBoostingClassifier(),#CatBoostClassifier(verbose=0, n_estimators=250),
    #model = CatBoostRegressor(verbose=0, n_estimators=0,use_best_model=True),
    verbose=True,
)
selector.fit(X_train, y_train)
```

--------------------------------

### Generate Classification Dataset and Split

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

Creates a synthetic classification dataset using `make_classification` and splits it into training and testing sets. This is a common setup for evaluating machine learning models.

```python
n_features = 250 #20,50,100,250,500
n_informative = int(0.50*n_features) #5%,10%,33%,50%,90%
n_samples = int(5000/(1-0.33))+1 #7463#5000

X, y = make_classification(n_samples=n_samples, n_classes=3, n_features=n_features, n_informative=n_informative, n_redundant=0, n_repeated = 0,shuffle=False)
#X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=n_informative,random_state=4,shuffle=False)
X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=42)
X_train
```

--------------------------------

### Get Subset Columns

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Retrieves the column names from the subset generated by BorutaShap.

```python
subset.columns
```

--------------------------------

### Load and Prepare SCENE Dataset

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Loads the SCENE dataset from a CSV file, resets the index, and prepares it for classification by converting the target column to integer type. It then splits the data into training and validation sets, stratifying by the target column.

```python
current_db = pd.read_csv("data/scene.csv")
current_db = current_db.reset_index()

Index_col = "index"
target_col = "Urban"

current_db[target_col]=current_db[target_col].astype(np.int32)

train_idx,val_idx = train_test_split(current_db[Index_col],test_size=0.25,random_state = 1,stratify=current_db[target_col])
current_db_train = current_db[current_db[Index_col].isin(train_idx)]
current_db_test = current_db[current_db[Index_col].isin(val_idx)]

```

--------------------------------

### PowerShap and BorutaShap Imports

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

Imports necessary libraries for PowerShap and BorutaShap, including dataset generation and time tracking utilities. This snippet sets up the environment for simulation studies.

```python
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split
import sys
import time
#sys.path.append("../powershap")

from powershap import PowerShap
import shapicant

#n_features = 20 #20,50,100,250,500
#n_informative = int(0.10*n_features) #10%,33%,50%,90%
#n_samples = 1000#5000

regression_bool=False

output_dict = {}


```

--------------------------------

### BorutaShap Feature Selection

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Configures and applies BorutaShap for feature selection using a CatBoostClassifier. This example specifies 'shap' as the importance measure and enables verbose output during fitting.

```python
model = CatBoostClassifier(verbose=False,iterations=250,class_weights=[1-len(current_db_train[current_db_train[target_col] == 0])/len(current_db_train),len(current_db_train[current_db_train[target_col] == 0])/len(current_db_train)])

# if classification is False it is a Regression problem
Feature_Selector = BorutaShap(
    model=model,
    importance_measure='shap',
    classification=True)

Feature_Selector.fit(X=current_db_train[l/ist(current_db_train.columns.values[1:-6])], y=current_db_train[target_col], sample=False,
                        train_or_test = 'test', normalize=True,verbose=True)
subset = Feature_Selector.Subset()

```

--------------------------------

### Load and Prepare Energy Data

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Loads the energy dataset using pandas, resets the index, and splits the data into training and validation sets.

```python
current_db = pd.read_csv("data/energydata_complete.csv")
current_db = current_db.reset_index()

Index_col = "index"
target_col = "Appliances"

train_idx,val_idx = train_test_split(current_db[Index_col],test_size=0.25,random_state = 1)
current_db_train = current_db[current_db[Index_col].isin(train_idx)]
current_db_test = current_db[current_db[Index_col].isin(val_idx)]
```

--------------------------------

### Data Loading and Preprocessing

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Loads the 'gina_prior.csv' dataset, preprocesses labels, and splits the data into training and validation sets using `train_test_split`.

```python
gina_prior_df = pd.read_csv("data/gina_prior.csv")
gina_prior_df = gina_prior_df.reset_index()
gina_prior_df.loc[gina_prior_df.label==-1,"label"]=0
train_idx,val_idx = train_test_split(gina_prior_df["index"],test_size=0.25,random_state = 1)
current_db_train = gina_prior_df[gina_prior_df["index"].isin(train_idx)]
current_db_test = gina_prior_df[gina_prior_df["index"].isin(val_idx)]

target_col = "label"
Index_col = "index"
```

--------------------------------

### Generate and Split Classification Dataset

Source: https://github.com/predict-idlab/powershap/blob/main/examples/test.ipynb

Generates a synthetic classification dataset using scikit-learn and splits it into training and testing sets. The data is then converted into a pandas DataFrame for easier manipulation. Ensure scikit-learn and pandas are installed.

```python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=750, n_classes=2, n_features=10, n_informative=2, n_redundant=1)
X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(10)])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=42)
X_train
```

--------------------------------

### Run PowerShap Benchmark with CatBoost

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

This script runs a benchmark for PowerShap using CatBoost models. It iterates through various dataset sizes, feature counts, and informative feature percentages, recording execution times and feature selection results. Requires pandas and numpy.

```python
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split
import sys
import time
#sys.path.append("../powershap")

from powershap import PowerShap
from catboost import CatBoostClassifier,CatBoostRegressor

#n_features = 20 #20,50,100,250,500
#n_informative = int(0.10*n_features) #10%,33%,50%,90%
#n_samples = 1000#5000

regression_bool=False
estimators = 250
hypercube = False

output_dict = {}

for n_samples in [1000,5000,20000]:
    output_dict[str(n_samples)]={}
    for n_features in [20,100,250,500]:
        output_dict[str(n_samples)][str(n_features)]={}
        
        average_times = []
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]:
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={}
            print("Amount of samples = "+str(n_samples))
            print("Total used features = "+str(n_features))
            print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+" %)")
            print("")
            
            found_features = []
            found_idx_features = []
            times = []
            for random_seed in [0,1,2,3,4]:
                print("Seed "+str(random_seed))
                
                if regression_bool:
                    X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False)
                else:
                    X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, hypercube=hypercube, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed)
                X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)])

                start_time = time.time()
                if regression_bool:
                    selector = PowerShap(
                        model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                else:
                    selector = PowerShap(
                        model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                selector.fit(X, y)
                
                times.append(time.time() - start_time)
                
                processed_shaps_df = selector._processed_shaps_df
                print(50+"-")
                
                found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01]))
                found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values)
                
            found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features]
            found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features]
            print("Average time: "+str(np.round(np.mean(times),2))+" seconds")
            print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01])))
            print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features")
            print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features")
            
            average_times.append(np.round(times,2))
            
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features)
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features
            
            print(100+"=")
            
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        benchmark_dict_print(output_dict)
        print(100+"=")
    
        if hypercube:
            output_dict_to_df(output_dict).to_csv("estimators_250_Classification_output_df.csv",index=False)
        else:
            output_dict_to_df(output_dict).to_csv("estimators_250_Classification_output_df_polytone.csv",index=False)
```

--------------------------------

### BorutaShap Simulation with CatBoost

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

Runs a simulation to evaluate BorutaShap's feature selection performance. It iterates through different numbers of samples, features, and informative features, measuring the average time taken and the accuracy of feature identification. Requires CatBoostClassifier and pandas.

```python
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split
import sys
import time
#sys.path.append("../powershap")

from powershap import PowerShap
from BorutaShap import BorutaShap

#n_features = 20 #20,50,100,250,500
#n_informative = int(0.10*n_features) #10%,33%,50%,90%
#n_samples = 1000#5000

regression_bool=False

output_dict = {}

for n_samples in [1000,5000,20000]:
    output_dict[str(n_samples)]={}
    for n_features in [20,100,250,500]:
        output_dict[str(n_samples)][str(n_features)]={}
        
        average_times = []
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]:
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={}
            print("Amount of samples = "+str(n_samples))
            print("Total used features = "+str(n_features))
            print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+"%%)")
            print("")
            
            found_features = []
            found_idx_features = []
            times = []
            for random_seed in [0,1,2,3,4]:
                print("Seed "+str(random_seed))
                
                if regression_bool:
                    X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False)
                else:
                    X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed)
                X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)])

                start_time = time.time()
                
                # if classification is False it is a Regression problem
                model = CatBoostClassifier(verbose=0, n_estimators=250)
                selector = BorutaShap(model=model,importance_measure='shap',classification=True)

                selector.fit(X=X, y=y, verbose=False)
                subset = selector.Subset()
                
                times.append(time.time() - start_time)
                print(50*"-")
                
                found_features.append(len(subset.columns))
                found_idx_features.append(subset.columns)
                
            found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features]
            found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features]
            print("Average time: "+str(np.round(np.mean(times),2))+" seconds")
            print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01])))
            print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features")
            print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features")
            
            average_times.append(np.round(times,2))
            
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features)
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features
            
            print(100*"=")
            
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        benchmark_dict_print(output_dict)
            
        output_dict_to_df(output_dict).to_csv("250_est_Catboost_borutashap_output_df.csv",index=False)
        
        print(100*"=")
```

--------------------------------

### Run PowerShap Simulation for Regression

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

This script simulates feature selection using PowerShap with CatBoost regressors. It iterates through different dataset sizes, feature counts, and informative feature percentages, recording the average time taken and the number of features identified. Use this to benchmark PowerShap's performance.

```python
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split
import sys
import time
#sys.path.append("../powershap")

from powershap import PowerShap
from catboost import CatBoostClassifier,CatBoostRegressor

#n_features = 20 #20,50,100,250,500
#n_informative = int(0.10*n_features) #10%,33%,50%,90%
#n_samples = 1000#5000

regression_bool=True
estimators = 250

output_dict = {}

for n_samples in [1000,5000,20000]:
    output_dict[str(n_samples)]={}
    for n_features in [20,100,250,500]:
        output_dict[str(n_samples)][str(n_features)]={}
        
        average_times = []
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]:
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={}
            print("Amount of samples = "+str(n_samples))
            print("Total used features = "+str(n_features))
            print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+"%) ")
            print("")
            
            found_features = []
            found_idx_features = []
            times = []
            for random_seed in [0,1,2,3,4]:
                print("Seed "+str(random_seed))
                
                if regression_bool:
                    X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False)
                else:
                    X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed)
                X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)])

                start_time = time.time()
                if regression_bool:
                    selector = PowerShap(
                        model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                else:
                    selector = PowerShap(
                        model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                selector.fit(X, y)
                
                times.append(time.time() - start_time)
                
                processed_shaps_df = selector._processed_shaps_df
                print(50+"-")
                
                found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01]))
                found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values)
                
            found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features]
            found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features]
            print("Average time: "+str(np.round(np.mean(times),2))+" seconds")
            print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01])))
            print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features")
            print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features")
            
            average_times.append(np.round(times,2))
            
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features)
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features
            
            print(100+"=")
            
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        benchmark_dict_print(output_dict)
        print(100+"=")
        
    
output_dict_to_df(output_dict).to_csv("estimators_250_Regression_output_df.csv",index=False)
```

--------------------------------

### Data Preparation for Training

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

This code prepares the training and testing datasets by selecting the features determined in the previous step and assigning the target column.

```python
X_train = current_db_train[selected_features]
Y_train = current_db_train[target_col]

X_test = current_db_test[selected_features]
Y_test = current_db_test[target_col]
```

--------------------------------

### Load and Prepare Data for Visualization

Source: https://github.com/predict-idlab/powershap/blob/main/examples/visualisation_powershap.ipynb

Loads data from CSV files for various estimators (PowerSHAP with different estimators, logistic regression, chi-squared, F-test, random forest, borutashap, shapicant). It then concatenates these DataFrames, calculates performance metrics like '%found_informative_features' and '%outputted_noise_features', renames columns, and filters data based on sample size and total features. This prepares the data for plotting.

```python
import seaborn as sns
from matplotlib.ticker import ScalarFormatter
import pandas as pd
import numpy as np

pure_powershap = False

df_esti_500 = pd.read_csv("results/estimators_500_Classification_output_df.csv")
df_esti_500["method"]="powershap - catboost - 500 est"
df_esti_250 = pd.read_csv("results/estimators_250_Classification_output_df.csv")
if pure_powershap:
    df_esti_250["method"]="powershap - catboost - 250 est"
else:
    df_esti_250["method"]="powershap"
df_esti_50 = pd.read_csv("results/estimators_50_Classification_output_df.csv")
df_esti_50["method"]="powershap - catboost - 50 est"

df_logreg = pd.read_csv("results/logisticregressioncv_output_df.csv")
df_logreg["method"]="powershap - logistic regression"
df_logreg_polytone = pd.read_csv("results/logisticregressioncv_output_df_polytone.csv")
df_logreg_polytone["method"]="logistic regression polytone"

df_chisq = pd.read_csv("results/chisquared_output_df.csv")
df_chisq["method"]="chi²"
df_chisq_poly = pd.read_csv("results/chi_squared_polytone_output_df.csv")
df_chisq_poly["method"]="chi polytone"

df_f_classif = pd.read_csv("results/f_classif_output_df.csv")
df_f_classif["method"]="f test"
df_f_classif_poly = pd.read_csv("results/f_classif_polytone_output_df.csv")
df_f_classif_poly["method"]="f_classif polytone"

df_random = pd.read_csv("results/randomforest_output_df.csv")
df_random["method"]="powershap - random forest"
t = df_random[df_random.n_samples==1000].copy()
t["n_samples"]=20000
df_random = df_random.append(t)

df_shapicant = pd.read_csv("results/shapicant_output_df.csv")
df_shapicant["method"]="shapicant"
t = df_shapicant[df_shapicant.total_features==20].copy()
t["total_features"]=50
df_shapicant = df_shapicant.append(t)
t = df_shapicant[df_shapicant.n_samples==5000].copy()
t["n_samples"]=20000
df_shapicant = df_shapicant.append(t)
t = df_shapicant[df_shapicant.n_samples==5000].copy()
t["n_samples"]=1000
df_shapicant = df_shapicant.append(t)

df_boruta = pd.read_csv("results/250_est_Catboost_borutashap_output_df.csv")
df_boruta["method"]="borutashap"

order_samples = [1000]#[5000]#1000,5000,20000]
order_features = [20,100,250,500]


#df = df_esti_250.append(df_logreg).append(df_random).append(df_chisq).append(df_f_classif).append(df_boruta).append(df_shapicant)
#df = df_esti_250.append(df_chisq).append(df_f_classif).append(df_boruta).append(df_shapicant)

#df = df_esti_250.append(df_logreg).append(df_chisq).append(df_f_classif).append(df_boruta)

#df["model_samples"]=df["method"].values+"_"+df["n_samples"].map(lambda x: f"{x:,}").values

if pure_powershap:
    df = df_esti_50.append(df_esti_250).append(df_esti_500).append(df_logreg).append(df_random)
    df_n = 5
else:
    #df = df_esti_250.append(df_chisq).append(df_f_classif).append(df_boruta).append(df_shapicant)
    #df_n=5
    df = df_esti_250.append(df_chisq).append(df_f_classif).append(df_boruta)#.append(df_shapicant)
    df_n=4


df["%informative_features"]=np.tile(np.repeat([10,33,50,90],5),15*df_n)
df["%found_informative_features"]=df["found_informative_features"]/df["informative_features"]*100
df["%outputted_noise_features"]=df["outputted_noise_features"]/(df["total_features"]-df["informative_features"])*100

df = df.rename(columns={"total_features":"total features"})

df = df[df.n_samples.isin(order_samples)]
df = df[df["total features"].isin(order_features)]
df = df.drop(columns="n_samples")
```

--------------------------------

### Initialize PowerSHAP with GradientBoostingClassifier

Source: https://github.com/predict-idlab/powershap/blob/main/examples/test.ipynb

Initializes the PowerSHAP selector with a GradientBoostingClassifier and automatic feature selection enabled. Ensure necessary libraries are imported and data (X_train, y_train) is available.

```python
import sys
sys.path.append("../powershap")

from powershap import PowerSHAP


from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegressionCV, RidgeClassifierCV
from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier

selector = PowerSHAP(
    model = GradientBoostingClassifier(),#CatBoostClassifier(verbose=0, n_estimators=250),
    automatic=True, limit_automatic=100,
)

```

--------------------------------

### CatBoost Regressor Model Training

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Initializes and trains a CatBoost Regressor model with specified parameters. The model is then used for cross-validation and evaluation.

```python
CB_model = CatBoostRegressor(verbose=100,iterations=250,random_seed=2)#,per_float_feature_quantization=['1:border_count=1024'])
CB_model.fit(X_train,Y_train)
```

--------------------------------

### Cross-Validation with CatBoostRegressor

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Initializes a CatBoostRegressor model and performs 10-fold cross-validation using the benchmark_regression_cross_validation function. Prints the model type and the mean and standard deviation of training and testing scores.

```python
CB_model = CatBoostRegressor(verbose=False,iterations=250,random_seed=2,use_best_model=True)

scores_cv_train,scores_cv_test = benchmark_regression_cross_validation(Model = CB_model,Inp_db = current_db_train.copy(deep=True),index_col=Index_col,folds=10,RS=0,features = selected_features,target_col = target_col)

print(model)
print("TRAIN")
for key in scores_cv_train:
    print(str(key)+": "+str(np.round(np.mean(scores_cv_train[key]),3))+" ("+str(np.round(np.std(scores_cv_train[key]),3))+")")
print(50*"=")
print("TEST")
for key in scores_cv_test:
    print(str(key)+": "+str(np.round(np.mean(scores_cv_test[key]),3))+" ("+str(np.round(np.std(scores_cv_test[key]),3))+")")
```

--------------------------------

### Initialize and Fit BorutaShap

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Initializes a CatBoostClassifier and then uses BorutaShap for feature selection. This is useful for identifying important features in a dataset for classification tasks.

```python
model = CatBoostClassifier(verbose=False,iterations=250)#,use_best_model=True)

# if classification is False it is a Regression problem
Feature_Selector = BorutaShap(model=model,
                            importance_measure='shap',
                            classification=True)

Feature_Selector.fit(X=current_db_train[list(current_db_train.columns.values[1:-1])], y=current_db_train[target_col], sample=False,
                        train_or_test = 'test', normalize=True,verbose=True)
subset = Feature_Selector.Subset()
```

--------------------------------

### Benchmark PowerShap with CatBoost Classifiers

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

This script benchmarks PowerShap's performance using CatBoost classifiers. It iterates through various sample sizes, feature counts, and informative feature percentages, recording average computation times and feature selection results. Requires pandas and numpy.

```python
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split
import sys
import time
#sys.path.append("../powershap")

from powershap import PowerShap
from catboost import CatBoostClassifier,CatBoostRegressor

regression_bool=False
estimators = 500

output_dict = {}

for n_samples in [1000,5000,20000]:
    output_dict[str(n_samples)]={}
    for n_features in [20,100,250,500]:
        output_dict[str(n_samples)][str(n_features)]={}
        
        average_times = []
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]:
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"={} 
            print("Amount of samples = "+str(n_samples))
            print("Total used features = "+str(n_features))
            print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+" %)")
            print("")
            
            found_features = []
            found_idx_features = []
            times = []
            for random_seed in [0,1,2,3,4]:
                print("Seed "+str(random_seed))
                
                if regression_bool:
                    X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False)
                else:
                    X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed)
                X = pd.DataFrame(data=X, columns=[f"col_{{i}}" for i in range(n_features)])

                start_time = time.time()
                if regression_bool:
                    selector = PowerShap(
                        model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                else:
                    selector = PowerShap(
                        model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                selector.fit(X, y)
                
                times.append(time.time() - start_time)
                
                processed_shaps_df = selector._processed_shaps_df
                print(50*"-")
                
                found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01]))
                found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values)
                
            found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features]
            found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features]
            print("Average time: "+str(np.round(np.mean(times),2))+" seconds")
            print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01])))
            print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features")
            print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features")
            
            average_times.append(np.round(times,2))
            
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"].update({"informative_features":int(n_informative/100*n_features),"found_informative_features":found_informative_features,"outputted_noise_features":found_noise_features})
            
            print(100*"=")
            
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        benchmark_dict_print(output_dict)
        print(100*"=")
    
        output_dict_to_df(output_dict).to_csv("estimators_500_Classification_output_df.csv",index=False)
```

--------------------------------

### Benchmarking PowerShap with CatBoostRegressor

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

This script benchmarks PowerShap's feature selection capabilities with CatBoostRegressor. It iterates through various dataset sizes and feature configurations, measuring execution time and the effectiveness of feature identification. Requires pandas and numpy.

```python
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split
import sys
import time
#sys.path.append("../powershap")

from powershap import PowerShap
from catboost import CatBoostClassifier,CatBoostRegressor
import pandas as pd
import numpy as np

def benchmark_dict_print(output_dict):
    for n_samples in output_dict:
        print("Amount of samples = "+str(n_samples))
        for n_features in output_dict[n_samples]:
            if n_features != "Average time":
                print("Total used features = "+str(n_features))
                for n_informative in output_dict[n_samples][n_features]:
                    if n_informative != "Average time":
                        print("Informative features: "+str(output_dict[n_samples][n_features][n_informative]["informative_features"])+" ("+str(int(output_dict[n_samples][n_features][n_informative]["informative_features"]/int(n_features)*100))+"%)")
                        print("Average time: "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["Average time"]),2))+" seconds")
                        print("Found features: "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]),2)))
                        print("Found "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]),2)) + " of "+str(output_dict[n_samples][n_features][n_informative]["informative_features"])+" informative features")
                        print(str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["outputted_noise_features"]),2)) + " of "+str(np.round(np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]),2)) + " outputted powershap features are noise features")
                        print(100*"=")

def output_dict_to_df(output_dict):
    df_list = []
    for n_samples in output_dict:
        for n_features in output_dict[n_samples]:
            if n_features != "Average time":
                for n_informative in output_dict[n_samples][n_features]:
                    if n_informative != "Average time":
                        df_list.append({
                            "n_samples": int(n_samples),
                            "n_features": int(n_features),
                            "n_informative_percent": int(n_informative),
                            "informative_features": output_dict[n_samples][n_features][n_informative]["informative_features"],
                            "found_informative_features": np.mean(output_dict[n_samples][n_features][n_informative]["found_informative_features"]),
                            "outputted_noise_features": np.mean(output_dict[n_samples][n_features][n_informative]["outputted_noise_features"]),
                            "average_time_seconds": np.mean(output_dict[n_samples][n_features][n_informative]["Average time"])
                        })
    return pd.DataFrame(df_list)

regression_bool=True
estimators = 50

output_dict = {}

for n_samples in [1000,5000,20000]:
    output_dict[str(n_samples)]={}
    for n_features in [20,100,250,500]:
        output_dict[str(n_samples)][str(n_features)]={}
        
        average_times = []
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        for n_informative in [10,33,50,90]:#[int(0.10*n_features),int(0.33*n_features),int(0.50*n_features),int(0.90*n_features)]:
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]={}
            print("Amount of samples = "+str(n_samples))
            print("Total used features = "+str(n_features))
            print("Informative features: "+str(int(n_informative/100*n_features))+" ("+str(n_informative)+" %)")
            print("")
            
            found_features = []
            found_idx_features = []
            times = []
            for random_seed in [0,1,2,3,4]:
                print("Seed "+str(random_seed))
                
                if regression_bool:
                    X, y = make_regression(n_samples=n_samples, n_features=n_features, n_informative=int(n_informative/100*n_features),random_state=random_seed,shuffle=False)
                else:
                    X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features, n_informative=int(n_informative/100*n_features), n_redundant=0, n_repeated = 0,shuffle=False,random_state=random_seed)
                X = pd.DataFrame(data=X, columns=[f"col_{i}" for i in range(n_features)])

                start_time = time.time()
                if regression_bool:
                    selector = PowerShap(
                        model = CatBoostRegressor(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                else:
                    selector = PowerShap(
                        model = CatBoostClassifier(verbose=0, n_estimators=estimators,use_best_model=True),
                        automatic=True
                    )
                selector.fit(X, y)
                
                times.append(time.time() - start_time)
                
                processed_shaps_df = selector._processed_shaps_df
                print(50*"-")
                
                found_features.append(len(processed_shaps_df[processed_shaps_df.p_value<0.01]))
                found_idx_features.append(processed_shaps_df[processed_shaps_df.p_value<0.01].index.values)
                
            found_informative_features = [np.sum(np.isin(X.columns.values[:int(n_informative/100*n_features)],f_list)) for f_list in found_idx_features]
            found_noise_features = [np.sum(1-np.isin(f_list,X.columns.values[:int(n_informative/100*n_features)])) for f_list in found_idx_features]
            print("Average time: "+str(np.round(np.mean(times),2))+" seconds")
            print("Found features: "+str(found_features))#len(processed_shaps_df[processed_shaps_df.p_value<0.01])))
            print("Found "+str(np.mean(found_informative_features))+" of "+str(int(n_informative/100*n_features))+" informative features")
            print(str(np.mean(found_noise_features))+" of "+str(np.mean(found_features))+" outputted powershap features are noise features")
            
            average_times.append(np.round(times,2))
            
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["informative_features"]=int(n_informative/100*n_features)
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["found_informative_features"]=found_informative_features
            output_dict[str(n_samples)][str(n_features)][str(n_informative)+"%"]["outputted_noise_features"]=found_noise_features
            
            print(100*"=")
            
        output_dict[str(n_samples)][str(n_features)]["Average time"]=average_times
        benchmark_dict_print(output_dict)
        print(100*"=")
        
output_dict_to_df(output_dict).to_csv("estimators_50_Regression_output_df.csv",index=False)

```

--------------------------------

### Initialize and Fit PowerShap Selector

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Initializes a PowerShap selector with a CatBoostRegressor model and fits it to the training data. The 'automatic' mode is enabled for feature selection.

```python
selector = PowerShap(
    model = CatBoostRegressor(verbose=0, n_estimators=250,use_best_model=True),
    power_iterations=10,automatic=True, limit_automatic=10,verbose=True,target_col=target_col,index_col=Index_col,
)
selector.fit(current_db_train[list(current_db_train.columns.values[3:])], current_db_train[target_col])
t = selector._processed_shaps_df
```

--------------------------------

### Initialize and Fit BorutaShap

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Initializes and fits the BorutaShap feature selection algorithm using a CatBoostRegressor model. It performs feature selection for regression problems.

```python
model = CatBoostRegressor(verbose=False,iterations=250)

# if classification is False it is a Regression problem
Feature_Selector = BorutaShap(model=model,
                            importance_measure='shap',
                            classification=False)

Feature_Selector.fit(X=current_db_train[list(current_db_train.columns.values[3:])], y=current_db_train[target_col], sample=False,
                        train_or_test = 'test', normalize=True,verbose=True)
subset = Feature_Selector.Subset()
```

--------------------------------

### Import Necessary Libraries

Source: https://github.com/predict-idlab/powershap/blob/main/examples/Benchmark_datasets.ipynb

Imports a comprehensive set of libraries required for data manipulation, machine learning, feature selection, and model training. Includes numpy, pandas, scikit-learn, catboost, shap, and others.

```python
import numpy as np
from numpy.random import RandomState
import pandas as pd
from math import sqrt
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn import feature_selection as fs
from sklearn import metrics as me
from sklearn.metrics import classification_report,auc,r2_score,matthews_corrcoef,roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import chi2,f_classif,f_regression
from catboost import CatBoostRegressor,CatBoostClassifier
from catboost.utils import get_roc_curve
from catboost import Pool, cv
import shap
from scipy import stats
from scipy.optimize import curve_fit
import copy
from tabulate import tabulate
from tqdm import tqdm
from BorutaShap import BorutaShap
from powershap  import PowerShap
import shapicant
```

--------------------------------

### Create and Fit Sklearn Pipeline with PowerShap

Source: https://github.com/predict-idlab/powershap/blob/main/examples/simulation.ipynb

This snippet demonstrates how to create a scikit-learn Pipeline that includes PowerShap for feature selection and a KNeighborsClassifier as the final estimator. It then fits the pipeline to training data.

```python
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier

pipe = Pipeline(
    [
        (
            "selector",
            PowerShap(
                CatBoostClassifier(n_estimators=250,verbose=False,use_best_model=True), automatic=True, limit_automatic=100,#power_alpha=0.001,power_req_iterations=0.999,
                #CatBoostRegressor(n_estimators=250,verbose=False), automatic=True, limit_automatic=100,
            ),
        ),
        ("catboost", KNeighborsClassifier()),#(n_estimators=250,verbose=False)),
        #("catboost", CatBoostRegressor(n_estimators=250,verbose=False)),
    ]
)

pipe.fit(X_train, y_train)


from sklearn.metrics import accuracy_score,r2_score


print("Baseline", accuracy_score(KNeighborsClassifier().fit(X_train, y_train).predict(X_test), y_test))
#print("Baseline", r2_score(LinearRegression.fit(X_train, y_train).predict(X_test), y_test))


print("PowerShap feature selection:", accuracy_score(pipe.predict(X_test), y_test))
#print("PowerShap feature selection:", r2_score(pipe.predict(X_test), y_test))


```