### Install Feature-engine Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md Install the feature-engine library if you haven't already. This ensures compatibility with the examples. ```bash pip install feature_engine ``` -------------------------------- ### Clone the Feature-engine Examples Repository Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md Clone the Feature-engine examples repository to your local machine to start contributing. ```bash git clone https://github.com//feature-engine-examples.git ``` -------------------------------- ### Install Pytest Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Install the pytest testing framework. This is a prerequisite for running tests. ```bash $ pip install pytest ``` -------------------------------- ### Install documentation dependencies Source: https://github.com/feature-engine/feature_engine/blob/main/README.md Install the required Python packages for building the Feature-engine documentation from the root directory. ```bash pip install -r docs/requirements.txt ``` -------------------------------- ### Install Documentation Dependencies Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Install the necessary libraries for building the documentation. Ensure you are in the feature_engine module directory. ```bash $ pip install -r docs/requirements.txt ``` -------------------------------- ### Install Documentation Requirements Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_docs.md Install the necessary Python packages for building the documentation. This command should be run after activating the project's virtual environment. ```bash pip install -r docs/requirements.txt ``` -------------------------------- ### Install tox Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Install tox, a tool for automating testing, in your development environment. ```bash $ pip install tox ``` -------------------------------- ### Install Feature-engine using pip Source: https://github.com/feature-engine/feature_engine/blob/main/README.md Use this command to install the Feature-engine library from PyPI. ```bash pip install feature_engine ``` -------------------------------- ### Initial DataFrame Setup Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/preprocessing/MatchVariables.md This code initializes two new columns, 'var_a' and 'var_b', in the test DataFrame and sets their values to 0. This is a setup step before applying transformations. ```python # let's add some columns for the demo test_t[['var_a', 'var_b']] = 0 test_t.head() ``` -------------------------------- ### Install Mypy Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Install mypy for type hinting checks. This is used to verify type annotations in the codebase. ```bash $ pip install mypy ``` -------------------------------- ### Install Black and Isort Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Install the black and isort libraries for code formatting and import sorting. These tools help maintain PEP8 compliance. ```bash $ pip install black ``` ```bash $ pip install isort ``` -------------------------------- ### Install Feature-Engine with Pip Source: https://github.com/feature-engine/feature_engine/blob/main/docs/quickstart/index.md Install the feature-engine package using pip. This is the standard method for installing Python packages. ```bash pip install feature-engine ``` -------------------------------- ### Example Feature Combinations Output Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/DecisionTreeFeatures.md This is an example output showing the list of feature combinations that will be used to train decision trees. It includes individual features and all possible pairs of numerical features from the training set. ```python ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', ['MedInc', 'HouseAge'], ['MedInc', 'AveRooms'], ['MedInc', 'AveBedrms'], ['MedInc', 'Population'], ['MedInc', 'AveOccup'], ['HouseAge', 'AveRooms'], ['HouseAge', 'AveBedrms'], ['HouseAge', 'Population'], ['HouseAge', 'AveOccup'], ['AveRooms', 'AveBedrms'], ['AveRooms', 'Population'], ['AveRooms', 'AveOccup'], ['AveBedrms', 'Population'], ['AveBedrms', 'AveOccup'], ['Population', 'AveOccup']] ``` -------------------------------- ### Setup Pipeline and Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md Initializes a pandas DataFrame and Series, then creates a Feature-Engine Pipeline with imputation, encoding, and a Lasso model. The pipeline is then fitted to the data. ```python import numpy as np import pandas as pd from feature_engine.imputation import DropMissingData from feature_engine.encoding import OneHotEncoder from feature_engine.pipeline import Pipeline from sklearn.linear_model import Lasso X = pd.DataFrame( dict( x1=[2, 1, 1, 0, np.nan], x2=["a", np.nan, "b", np.nan, "a"], ) ) y = pd.Series([1, 2, 3, 4, 5]) pipe = Pipeline( [ ("drop", DropMissingData()), ("enc", OneHotEncoder()), ("lasso", Lasso(random_state=10)), ] ) pipe.fit(X, y) ``` -------------------------------- ### Install Feature-engine in Developer Mode Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Install Feature-engine and its development dependencies. The '-e' flag installs the package in editable mode, so code changes are reflected immediately without reinstallation. Include '.[docs,tests]' to install dependencies for documentation and testing. ```bash cd feature_engine pip install -e ".[docs,tests]" ``` -------------------------------- ### Install Feature-engine developer dependencies Source: https://github.com/feature-engine/feature_engine/blob/main/README.md Install Feature-engine and its testing dependencies, necessary for development and running tests. ```bash pip install -e ".[tests]" ``` -------------------------------- ### Forecasting Pipeline Setup Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/make_pipeline.md Imports for setting up a direct forecasting pipeline using Feature-Engine's time series forecasting transformers and scikit-learn models. ```python import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.linear_model import Lasso from sklearn.metrics import root_mean_squared_error from sklearn.multioutput import MultiOutputRegressor from feature_engine.timeseries.forecasting import ( LagFeatures, WindowFeatures, ) from feature_engine.pipeline import make_pipeline ``` -------------------------------- ### Install Feature-engine using Conda Source: https://github.com/feature-engine/feature_engine/blob/main/README.md Use this command to install the Feature-engine library from the conda-forge channel. ```bash conda install -c conda-forge feature_engine ``` -------------------------------- ### Set up a Pipeline with DropMissingData, OrdinalEncoder, and Lasso Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md Instantiate a Pipeline with a list of named steps. This example chains data imputation, categorical encoding, and a Lasso regression model. ```python import numpy as np import pandas as pd from feature_engine.imputation import DropMissingData from feature_engine.encoding import OrdinalEncoder from feature_engine.pipeline import Pipeline from sklearn.linear_model import Lasso X = pd.DataFrame( dict( x1=[2, 1, 1, 0, np.nan], x2=["a", np.nan, "b", np.nan, "a"], ) ) y = pd.Series([1, 2, 3, 4, 5]) pipe = Pipeline( [ ("drop", DropMissingData()), ("enc", OrdinalEncoder(encoding_method="arbitrary")), ("lasso", Lasso(random_state=10)), ] ) # predict pipe.fit(X, y) preds_pipe = pipe.predict(X) preds_pipe ``` -------------------------------- ### Discretizer Output Example Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/EqualWidthDiscretiser.md Shows the learned interval limits for 'LotArea' and 'GrLivArea' after applying EqualWidthDiscretiser with 10 bins. Note the inclusion of -inf and inf for comprehensive coverage. ```python { 'LotArea': [-inf, 22694.5, 44089.0, 65483.5, 86878.0, 108272.5, 129667.0, 151061.5, 172456.0, 193850.5, inf], 'GrLivArea': [-inf, 864.8, 1395.6, 1926.3999999999999, 2457.2, 2988.0, 3518.7999999999997, 4049.5999999999995, 4580.4, 5111.2, inf] } ``` -------------------------------- ### Performance Drifts Example Output Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md Example output showing the change in linear regression r2 after shuffling each feature. Positive values indicate an increase in performance, while negative values indicate a decrease. ```python { 'age': -0.0054698043007869734, 'sex': 0.03325633986510784, 'bmi': 0.184158237207512, 'bp': 0.10089894421748086, 's1': 0.49324432634948095, 's2': 0.21163252880660438, 's3': 0.02006839198785859, 's4': 0.011098050006761673, 's5': 0.4828781996541602, 's6': 0.003963360084439538 } ``` -------------------------------- ### Load and Split Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/text/TextFeatures.md Loads the 20 newsgroups dataset and splits it into training and testing sets. Ensure pandas and scikit-learn are installed. ```python from sklearn.datasets import fetch_20newsgroups from sklearn.model_selection import train_test_split import pandas as pd data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.sport.hockey']) df = pd.DataFrame({'text': data.data, 'target': data.target}) X_train, X_test, y_train, y_test = train_test_split( df[['text']], df['target'], test_size=0.3, random_state=42 ) print(X_train.head()) ``` -------------------------------- ### Install Feature-engine in developer mode Source: https://github.com/feature-engine/feature_engine/blob/main/README.md Install Feature-engine in editable mode, allowing for direct code changes to be reflected without reinstallation. ```bash pip install -e . ``` -------------------------------- ### Example Features to Drop Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md Example output listing features that were deemed non-important based on their performance drift being greater than the mean performance drift of all features. ```python ['age', 'sex', 'bp', 's3', 's4', 's6'] ``` -------------------------------- ### Extract the first two steps of a Pipeline Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md Use slicing notation to extract a partial pipeline. This example retrieves the first two steps. ```python pipe[:2] ``` -------------------------------- ### Extract the first step of a Pipeline Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md Use slicing notation to extract a partial pipeline. This example retrieves only the first step. ```python pipe[:1] ``` -------------------------------- ### Import Libraries for Monotonic Features Example Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/DecisionTreeEncoder.md Import necessary libraries including matplotlib, fetch_openml, train_test_split, and DecisionTreeEncoder for demonstrating monotonic features. ```python import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from feature_engine.encoding import DecisionTreeEncoder ``` -------------------------------- ### RandomSampleImputer with observation-specific seeding (example) Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/RandomSampleImputer.md Illustrates how RandomSampleImputer can be configured for observation-specific seeding. The seed is derived from the sum of 'height' and 'weight' for each observation. ```python RandomSampleImputer( random_state=['height', 'weight'], seed='observation', seeding_method='add', ) ``` -------------------------------- ### Navigate to the Project Directory Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md Change your current directory to the cloned feature-engine-examples project. ```bash cd feature-engine-examples ``` -------------------------------- ### Load Diabetes Dataset and Display Head Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureElimination.md Loads the diabetes dataset from Scikit-learn and displays the first few rows of the feature data. This is a common starting point for feature selection examples. ```python import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_diabetes from sklearn.linear_model import LinearRegression from feature_engine.selection import RecursiveFeatureElimination # load dataset X, y = load_diabetes(return_X_y=True, as_frame=True) print(X.head()) ``` -------------------------------- ### Build documentation with Sphinx Source: https://github.com/feature-engine/feature_engine/blob/main/README.md Build the HTML version of the documentation using Sphinx. This command should be run from the root directory of the project. ```bash sphinx-build -b html docs build ``` -------------------------------- ### Transformed Data Example Output Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md Example output of the DataFrame head after feature selection, showing only the remaining important features. ```python bmi s1 s2 s5 0 0.061696 -0.044223 -0.034821 0.019907 1 -0.051474 -0.008449 -0.019163 -0.068332 2 0.044451 -0.045599 -0.034194 0.002861 3 -0.011595 0.012191 0.024991 0.022688 4 -0.036385 0.003935 0.015596 -0.031988 ``` -------------------------------- ### Load data and set up ArbitraryNumberImputer Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/ArbitraryNumberImputer.md This snippet demonstrates loading the house prices dataset, splitting it into training and testing sets, and initializing the ArbitraryNumberImputer to impute specified numerical variables with -999. ```python import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from feature_engine.imputation import ArbitraryNumberImputer # Load dataset X, y = fetch_openml( name='house_prices', version=1, return_X_y=True, as_frame=True, parser='auto', ) # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) ``` ```python # set up the imputer arbitrary_imputer = ArbitraryNumberImputer( arbitrary_number=-999, variables=['LotFrontage', 'MasVnrArea'], ) # fit the imputer arbitrary_imputer.fit(X_train) ``` -------------------------------- ### Frequency Encoding Dictionary Example Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/CountFrequencyEncoder.md Example output of the 'encoder_dict_' showing the mapping of categories to their frequencies for 'cabin', 'pclass', and 'embarked' variables. ```python { 'cabin': {'M': 0.7663755458515283, 'C': 0.07751091703056769, 'B': 0.04585152838427948, 'E': 0.034934497816593885, 'D': 0.034934497816593885, 'A': 0.018558951965065504, 'F': 0.016375545851528384, 'G': 0.004366812227074236, 'T': 0.001091703056768559}, 'pclass': {3: 0.5436681222707423, 1: 0.25109170305676853, 2: 0.2052401746724891}, 'embarked': {'S': 0.7117903930131004, 'C': 0.19541484716157206, 'Q': 0.0906113537117904, 'Missing': 0.002183406113537118} } ``` -------------------------------- ### Create Sample DataFrame Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/text/TextFeatures.md Creates a pandas DataFrame with sample text data for demonstration purposes. ```python import pandas as pd from feature_engine.text import TextFeatures # Create sample data X = pd.DataFrame({ 'review': [ 'This product is AMAZING! Best purchase ever.', 'Not great. Would not recommend.', 'OK for the price. 3 out of 5 stars.', 'TERRIBLE!!! DO NOT BUY!', ], 'title': [ 'Great Product', 'Disappointed', 'Average', 'Awful', ] }) print(X) ``` -------------------------------- ### Standard Deviation of Performance Drifts Example Output Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md Example output showing the variability (standard deviation) of the change in r2 after shuffling each feature. Higher values suggest more inconsistent performance changes. ```python { 'age': 0.012788500580799392, 'sex': 0.040792331972680645, 'bmi': 0.042212436355346106, 'bp': 0.05397012536801143, 's1': 0.35198797776358015, 's2': 0.167636042355086, 's3': 0.03455158514716544, 's4': 0.007755675852874145, 's5': 0.1449579162698361, 's6': 0.011193022434166025 } ``` -------------------------------- ### Build Documentation with Sphinx Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Generate HTML documentation from the source files using Sphinx. The output will be stored in the 'build' folder. ```bash $ sphinx-build -b html docs build ``` -------------------------------- ### Navigate to Feature-engine directory Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Change your current directory to the root of the Feature-engine repository. ```bash $ cd feature_engine ``` -------------------------------- ### Create a Sample DataFrame Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_numerical_variables.md Create a pandas DataFrame with various data types including numerical, categorical, and datetime. ```python import pandas as pd df = pd.DataFrame({ "Name": ["tom", "nick", "krish", "jack"], "City": ["London", "Manchester", "Liverpool", "Bristol"], "Age": [20, 21, 19, 18], "Marks": [0.9, 0.8, 0.7, 0.6], "dob": pd.date_range("2020-02-24", periods=4, freq="min"), }) print(df.head()) ``` -------------------------------- ### Load Data and Prepare for Reciprocal Transformation Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/ReciprocalTransformer.md Load the Ames house prices dataset, create a new variable 'sqrfootpercar', and split the data into training and testing sets. This example demonstrates data preparation before applying the transformation. ```python import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from feature_engine.transformation import ReciprocalTransformer data = fetch_openml(name='house_prices', as_frame=True) data = data.frame data["sqrfootpercar"] = data['GarageArea'] / data['GarageCars'] data = data[~data["sqrfootpercar"].isna()] y = data['SalePrice'] X = data[['GarageCars', 'GarageArea', "sqrfootpercar"]] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) print(X_train.head()) ``` -------------------------------- ### Example Transformed Data Output Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/DecisionTreeFeatures.md This is an example output of the transformed data, showing the original features alongside the newly created features derived from decision tree splits. The new features are named 'tree(...)' indicating their origin. ```python MedInc HouseAge AveRooms AveBedrms Population AveOccup \ 14740 4.1518 22.0 5.663073 1.075472 1551.0 4.180593 10101 5.7796 32.0 6.107226 0.927739 1296.0 3.020979 20566 4.3487 29.0 5.930712 1.026217 1554.0 2.910112 2670 2.4511 37.0 4.992958 1.316901 390.0 2.746479 15709 5.0049 25.0 4.319261 1.039578 649.0 1.712401 tree(MedInc) tree(HouseAge) tree(AveRooms) tree(AveBedrms) ... \ 14740 2.204822 2.130618 2.001950 2.080254 ... 10101 2.975513 2.051980 2.001950 2.165554 ... 20566 2.204822 2.051980 2.001950 2.165554 ... 2670 1.416771 2.051980 1.802158 1.882763 ... 15709 2.420124 2.130618 1.802158 2.165554 ... tree(['HouseAge', 'AveRooms']) tree(['HouseAge', 'AveBedrms']) \ 14740 1.885406 2.124812 10101 1.885406 2.124812 20566 1.885406 2.124812 2670 1.797902 1.836498 15709 1.797902 2.124812 tree(['HouseAge', 'Population']) tree(['HouseAge', 'AveOccup']) \ 14740 2.004703 1.437440 10101 2.004703 2.257968 20566 2.004703 2.257968 2670 2.123579 2.257968 15709 2.123579 2.603372 tree(['AveRooms', 'AveBedrms']) tree(['AveRooms', 'Population']) \ 14740 2.099977 1.878989 10101 2.438937 2.077321 20566 2.099977 1.878989 2670 1.728401 1.843904 15709 1.821467 1.843904 tree(['AveRooms', 'AveOccup']) tree(['AveBedrms', 'Population']) \ 14740 1.719582 2.056003 10101 2.156884 2.056003 20566 2.156884 2.056003 2670 1.747990 1.882763 15709 2.783690 2.221092 tree(['AveBedrms', 'AveOccup']) tree(['Population', 'AveOccup']) 14740 1.400491 1.484939 10101 2.153210 2.059187 20566 2.153210 2.059187 2670 1.861020 2.235743 15709 2.727460 2.747390 [5 rows x 27 columns] ``` -------------------------------- ### Create Toy Dataset Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/scaling/MeanNormalizationScaler.md Creates a sample pandas DataFrame for demonstrating the MeanNormalizationScaler. Includes numerical and non-numerical columns. ```python import pandas as pd from feature_engine.scaling import MeanNormalizationScaler df = pd.DataFrame.from_dict( { "Name": ["tom", "nick", "krish", "jack"], "City": ["London", "Manchester", "Liverpool", "Bristol"], "Age": [20, 21, 19, 18], "Height": [1.80, 1.77, 1.90, 2.00], "Marks": [0.9, 0.8, 0.7, 0.6], "dob": pd.date_range("2020-02-24", periods=4, freq="min"), }) print(df) ``` -------------------------------- ### Get Selected Feature Names Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByTargetMeanPerformance.md Use the get_feature_names_out method to retrieve the names of the features that were selected by the transformer. ```python sel.get_feature_names_out() ``` -------------------------------- ### Build Documentation with Sphinx Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_docs.md Build the HTML version of the documentation using Sphinx. This command specifies the source directory for documentation files and the output directory for the generated HTML. ```bash sphinx-build -b html docs build ``` -------------------------------- ### Get Features to Drop Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureElimination.md Retrieve the list of features that RecursiveFeatureElimination has identified for removal based on the specified threshold. ```python # the features to remove tr.features_to_drop_ ``` -------------------------------- ### Initialize and Fit Winsorizer Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/Winsorizer.md Initializes the Winsorizer to cap outliers in 'age' and 'fare' using the Gaussian method on the right tail. The 'fold' parameter is set to 3, indicating 3 standard deviations from the mean. ```python capper = Winsorizer(capping_method='gaussian', tail='right', fold=3, variables=['age', 'fare']) capper.fit(X_train) ``` -------------------------------- ### Example Model Accuracy Output Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/WoEEncoder.md This snippet shows the expected output format for the model accuracy after training and prediction. ```python Accuracy: 0.76 ``` -------------------------------- ### Pandas dropna Example Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/DropMissingData.md Demonstrates the basic usage of pandas' dropna function to remove rows with NaN values. ```python import numpy as np import pandas as pd X = pd.DataFrame(dict( x1 = [np.nan,1,1,0,np.nan], x2 = ["a", np.nan, "b", np.nan, "a"], )) X.dropna(inplace=True) print(X) ``` -------------------------------- ### Set up Feature-engine Pipeline Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md Defines a pipeline including outlier trimming, one-hot encoding, scaling, and logistic regression. ```python pipe = Pipeline( [ ("outliers", OutlierTrimmer(variables=["age", "fare"])), ("enc", OneHotEncoder()), ("scaler", StandardScaler()), ("logit", LogisticRegression(random_state=10)), ] ) ``` -------------------------------- ### Get Feature Names After Transformation Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/datetime/DatetimeFeatures.md Retrieves the names of the features generated by the DatetimeFeatures transformer after fitting and transforming the data. ```python dtfs.get_feature_names_out() ``` -------------------------------- ### Initialize Linear Regression Model Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md Sets up a Linear Regression model from Scikit-learn. This model will be used by SelectByShuffling to evaluate feature importance. ```python linear_model = LinearRegression() ``` -------------------------------- ### Extract the last step of a Pipeline Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md Use slicing notation to extract a partial pipeline. This example retrieves only the last step. ```python pipe[-1:] ``` -------------------------------- ### Initialize and Prepare for Performance-Based Selection Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SmartCorrelatedSelection.md This snippet demonstrates the initial setup for performance-based feature selection using SmartCorrelatedSelection. It includes importing necessary libraries like pandas, make_classification, DecisionTreeClassifier, and SmartCorrelatedSelection, and preparing a toy dataset. ```python import pandas as pd from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from feature_engine.selection import SmartCorrelatedSelection ``` -------------------------------- ### Create a Toy Dataset Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_all_variables.md This code creates a sample pandas DataFrame with numerical, categorical, and datetime variables for demonstration purposes. ```python import pandas as pd from sklearn.datasets import make_classification X, y = make_classification( n_samples=1000, n_features=4, n_redundant=1, n_clusters_per_class=1, weights=[0.50], class_sep=2, random_state=1, ) # transform arrays into pandas df and series colnames = [f"num_var_{i+1}" for i in range(4)] X = pd.DataFrame(X, columns=colnames) X["cat_var1"] = ["Hello"] * 1000 X["cat_var2"] = ["Bye"] * 1000 X["date1"] = pd.date_range("2020-02-24", periods=1000, freq="min") X["date2"] = pd.date_range("2021-09-29", periods=1000, freq="h") X["date3"] = ["2020-02-24"] * 1000 print(X.head()) ``` -------------------------------- ### Initialize LogCpTransformer with user-defined constants for specific variables Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/LogCpTransformer.md Initialize the LogCpTransformer by providing a dictionary to the 'C' parameter. Each key-value pair in the dictionary specifies a variable and the constant to be added to it before the logarithm is applied. ```python tf = LogCpTransformer(C={"bmi": 2, "s3": 3, "s4": 4}) tf.fit(X_train) ``` -------------------------------- ### Load Titanic Dataset and Split Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/OutlierTrimmer.md Loads the Titanic dataset and splits it into training and testing sets. Ensure `feature_engine` is installed. ```python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.outliers import OutlierTrimmer X, y = load_titanic( return_X_y_frame=True, predictors_only=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train.head()) ``` -------------------------------- ### Create a Feature Engineering Pipeline with WoEEncoder Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/WoEEncoder.md Sets up a pipeline that first discretizes numerical variables, then groups rare labels, and finally encodes all specified variables using the WoEEncoder. This demonstrates a sequential application of multiple feature engineering steps. ```python pipe = Pipeline( [ ("disc", EqualFrequencyDiscretiser(variables=numerical_features)), ("rare_label", RareLabelEncoder(tol=0.1, n_categories=2, variables=all, ignore_format=True)), ("woe", WoEEncoder(variables=all)), ]) ``` -------------------------------- ### Create a Toy DataFrame Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/MathFeatures.md This code snippet demonstrates how to create a sample pandas DataFrame with various data types, which will be used to illustrate the functionality of MathFeatures. ```python import numpy as np import pandas as pd from feature_engine.creation import MathFeatures df = pd.DataFrame.from_dict( { "Name": ["tom", "nick", "krish", "jack"], "City": ["London", "Manchester", "Liverpool", "Bristol"], "Age": [20, 21, 19, 18], "Marks": [0.9, 0.8, 0.7, 0.6], "dob": pd.date_range("2020-02-24", periods=4, freq="T"), }) print(df) ``` -------------------------------- ### Activate Conda Environment Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Activate the conda environment you created. This ensures that subsequent installations and commands are run within the isolated environment. ```bash conda activate myenv ``` -------------------------------- ### Prepare Test Set by Dropping Columns Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/preprocessing/MatchVariables.md Demonstrates preparing a test set by dropping specific columns ('sex', 'age') to simulate missing features. ```python # Let's drop some columns in the test set for the demo test_t = test.drop(["sex", "age"], axis=1) test_t.head() ``` -------------------------------- ### Find Numerical Variables Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_numerical_variables.md Use `find_numerical_variables` to get a list of all numerical variable names from the DataFrame. This function requires the DataFrame as input. ```python from feature_engine.variable_handling import find_numerical_variables var_num = find_numerical_variables(df) var_num ``` -------------------------------- ### Load Libraries and Dataset Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/CyclicalFeatures.md Imports necessary libraries and loads the Bike Sharing Demand dataset from OpenML. ```python import numpy as np import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.datasets import fetch_openml from sklearn.preprocessing import FunctionTransformer from feature_engine.creation import CyclicalFeatures df = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True).frame print(df.head()) ``` -------------------------------- ### Getting Feature Names with ExpandingWindowFeatures Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/timeseries/forecasting/ExpandingWindowFeatures.md Use the `get_feature_names_out()` method after fitting the transformer to retrieve the names of the original and newly created features. ```python win_f = ExpandingWindowFeatures() win_f.fit(X) win_f.get_feature_names_out() ``` -------------------------------- ### Import Libraries and Load Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/LogCpTransformer.md Imports necessary libraries and loads the California housing dataset for transformation. ```python import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import fetch_california_housing from feature_engine.transformation import LogCpTransformer # Load dataset X, y = fetch_california_housing( return_X_y=True, as_frame=True) # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0) ``` -------------------------------- ### Get Feature Names After Lagging Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/timeseries/forecasting/LagFeatures.md Use the `get_feature_names_out()` method to retrieve the names of all features, including the newly created lag features. ```python lag_f.get_feature_names_out() ``` -------------------------------- ### Load Data and Initialize LogTransformer Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/LogTransformer.md Imports necessary libraries, loads the Ames house prices dataset, splits it into training and testing sets, and initializes the LogTransformer for specific variables. The transformer checks for numerical variables during fit. ```python import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from feature_engine.transformation import LogTransformer data = fetch_openml(name='house_prices', as_frame=True) data = data.frame X = data.drop(['SalePrice', 'Id'], axis=1) y = data['SalePrice'] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) print(X_train.head()) ``` ```python X_train[['LotArea', 'GrLivArea']].hist(figsize=(10,5)) plt.show() ``` ```python logt = LogTransformer(variables = ['LotArea', 'GrLivArea']) logt.fit(X_train) ``` -------------------------------- ### Get Feature Names Out Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SmartCorrelatedSelection.md Uses the `get_feature_names_out()` method, common to scikit-learn transformers, to retrieve the names of the features remaining in the transformed DataFrame. ```python tr.get_feature_names_out() ``` -------------------------------- ### Get supported features Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByInformationValue.md Use the 'get_support()' method to obtain a boolean list indicating which features are selected (True) or dropped (False). ```python sel.get_support() ``` -------------------------------- ### Load House Prices Dataset Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/MeanMedianImputer.md Load the house prices dataset from OpenML for demonstration purposes. ```python X, y = fetch_openml( name='house_prices', version=1, return_X_y=True, as_frame=True, parser='auto', ) ``` -------------------------------- ### Get Performance Drifts Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureAddition.md Retrieve the changes in model performance resulting from adding each feature. This helps in understanding the incremental value of each feature. ```python # Get the performance drift of each feature tr.performance_drifts_ ``` -------------------------------- ### Initialize and Fit MeanMedianImputer Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/MeanMedianImputer.md Initialize MeanMedianImputer with the 'mean' imputation method and specify the variables to impute. Then, fit the imputer using the training data. ```python # Set up the imputer mmi = MeanMedianImputer( imputation_method='mean', variables=['LotFrontage', 'MasVnrArea'] ) # Fit transformer with training data mmi.fit(X_train) ``` -------------------------------- ### Setting up a Pipeline with make_pipeline Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/make_pipeline.md Use make_pipeline to create a pipeline that first drops missing data, then encodes categorical features ordinally, and finally fits a Lasso regression model. The pipeline automatically assigns names to each step. ```python import numpy as np import pandas as pd from feature_engine.imputation import DropMissingData from feature_engine.encoding import OrdinalEncoder from feature_engine.pipeline import make_pipeline from sklearn.linear_model import Lasso X = pd.DataFrame( dict( x1=[2, 1, 1, 0, np.nan], x2=["a", np.nan, "b", np.nan, "a"], ) ) y = pd.Series([1, 2, 3, 4, 5]) pipe = make_pipeline( DropMissingData(), OrdinalEncoder(encoding_method="arbitrary"), Lasso(random_state=10), ) # predict pipe.fit(X, y) preds_pipe = pipe.predict(X) preds_pipe ``` -------------------------------- ### Get Transformed Feature Names Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/RelativeFeatures.md Retrieves the names of all features in the DataFrame after the RelativeFeatures transformation has been applied. This is useful for understanding the output of the transformer. ```python transformer.get_feature_names_out(input_features=None) ``` -------------------------------- ### Load Titanic Dataset and Split Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/DropFeatures.md Loads the Titanic dataset and splits it into training and testing sets. Ensure you have the feature_engine library installed. ```python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.selection import DropFeatures X, y = load_titanic( return_X_y_frame=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) ``` -------------------------------- ### Verify Remotes Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Check that both your fork ('origin') and the main repository ('upstream') are correctly linked to your local copy. ```bash $ git remote -v ``` -------------------------------- ### Load Titanic Dataset and Split Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/DropConstantFeatures.md Loads the Titanic dataset and splits it into training and testing sets. Ensure you have feature_engine and scikit-learn installed. ```python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.selection import DropConstantFeatures X, y = load_titanic( return_X_y_frame=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) ``` -------------------------------- ### Load Data and Split into Train/Test Sets Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/Winsorizer.md Loads the Titanic dataset and splits it into training and testing sets. Ensure Feature-Engine and Scikit-learn are installed. ```python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.outliers import Winsorizer X, y = load_titanic( return_X_y_frame=True, predictors_only=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train.head()) ``` -------------------------------- ### Load Wine Dataset and Libraries Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/index.md Imports necessary libraries and loads the wine quality dataset from Scikit-learn. Displays the head of the dataset. ```python import pandas as pd from sklearn.datasets import load_wine from feature_engine.creation import RelativeFeatures, MathFeatures X, y = load_wine(return_X_y=True, as_frame=True) print(X.head()) ``` -------------------------------- ### Clone Feature-engine Repository Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md Clone your forked repository to your local machine to begin development. ```bash $ git clone https://github.com//feature_engine ``` -------------------------------- ### Load Titanic Dataset and Split Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/OrdinalEncoder.md Loads the Titanic dataset and splits it into training and testing sets. Ensure Feature-engine and scikit-learn are installed. ```python import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.encoding import OrdinalEncoder X, y = load_titanic( return_X_y_frame=True, handle_missing=True, predictors_only=True, cabin="letter_only", ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train.head()) ``` -------------------------------- ### Upgrade Feature-Engine with Pip Source: https://github.com/feature-engine/feature_engine/blob/main/docs/quickstart/index.md Upgrade an existing feature-engine installation to the latest version using pip. The -U flag ensures the package is updated. ```bash pip install -U feature-engine ``` -------------------------------- ### Instantiate and Transform Data Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/EqualFrequencyDiscretiser.md Instantiates the EqualFrequencyDiscretiser with 5 quantiles (bins) and applies it to the created dataset to transform the features. ```python # Instantiate discretizer disc = EqualFrequencyDiscretiser(q=5) # Transform simulated data X_transformed = disc.fit_transform(X) ``` -------------------------------- ### Stage and Commit Changes Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md Add your notebook changes to the staging area and commit them with a meaningful message. ```bash git add . git commit -m "a meaningful commit message" ``` -------------------------------- ### Find All Variables in a Dataset Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_all_variables.md Use `find_all_variables` to get a list of all variable names in the DataFrame. This function is useful for quickly inspecting the columns of your dataset. ```python from feature_engine.variable_handling import find_all_variables vars_all = find_all_variables(X) vars_all ``` -------------------------------- ### Get Features to Drop Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureAddition.md Access the list of features identified by RecursiveFeatureAddition that will be dropped. These are the features deemed least important based on the selection criteria. ```python # the features to drop tr.features_to_drop_ ``` -------------------------------- ### Display Training Data Head Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/OneHotEncoder.md Prints the first 5 rows of the training data to show the initial structure and content. ```python pclass sex age sibsp parch fare cabin embarked 501 2 female 13.000000 0 1 19.5000 M S 588 2 female 4.000000 1 1 23.0000 M S 402 2 female 30.000000 1 0 13.8583 M C 1193 3 male 29.881135 0 0 7.7250 M Q 686 3 female 22.000000 0 0 7.7250 M Q ``` -------------------------------- ### Get Transformed Feature Names Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/OutlierTrimmer.md Retrieve the names of the features in the dataset after the outlier transformation has been applied. This is useful for subsequent data processing steps. ```python ot.get_feature_names_out() ``` -------------------------------- ### Get Output Feature Names Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/datetime/DatetimeFeatures.md Obtains the names of the features that will be present in the output DataFrame after the transformation, including the newly extracted datetime features. ```python dfts.get_feature_names_out() ```