### Install Feature-engine

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md

Install the feature-engine library if you haven't already. This ensures compatibility with the examples.

```bash
pip install feature_engine
```

--------------------------------

### Clone the Feature-engine Examples Repository

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md

Clone the Feature-engine examples repository to your local machine to start contributing.

```bash
git clone https://github.com/<YOURUSERNAME>/feature-engine-examples.git
```

--------------------------------

### Install Pytest

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Install the pytest testing framework. This is a prerequisite for running tests.

```bash
$ pip install pytest
```

--------------------------------

### Install documentation dependencies

Source: https://github.com/feature-engine/feature_engine/blob/main/README.md

Install the required Python packages for building the Feature-engine documentation from the root directory.

```bash
pip install -r docs/requirements.txt
```

--------------------------------

### Install Documentation Dependencies

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Install the necessary libraries for building the documentation. Ensure you are in the feature_engine module directory.

```bash
$ pip install -r docs/requirements.txt
```

--------------------------------

### Install Documentation Requirements

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_docs.md

Install the necessary Python packages for building the documentation. This command should be run after activating the project's virtual environment.

```bash
pip install -r docs/requirements.txt
```

--------------------------------

### Install tox

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Install tox, a tool for automating testing, in your development environment.

```bash
$ pip install tox
```

--------------------------------

### Install Feature-engine using pip

Source: https://github.com/feature-engine/feature_engine/blob/main/README.md

Use this command to install the Feature-engine library from PyPI.

```bash
pip install feature_engine
```

--------------------------------

### Initial DataFrame Setup

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/preprocessing/MatchVariables.md

This code initializes two new columns, 'var_a' and 'var_b', in the test DataFrame and sets their values to 0. This is a setup step before applying transformations.

```python
# let's add some columns for the demo
test_t[['var_a', 'var_b']] = 0

test_t.head()
```

--------------------------------

### Install Mypy

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Install mypy for type hinting checks. This is used to verify type annotations in the codebase.

```bash
$ pip install mypy
```

--------------------------------

### Install Black and Isort

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Install the black and isort libraries for code formatting and import sorting. These tools help maintain PEP8 compliance.

```bash
$ pip install black
```

```bash
$ pip install isort
```

--------------------------------

### Install Feature-Engine with Pip

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/quickstart/index.md

Install the feature-engine package using pip. This is the standard method for installing Python packages.

```bash
pip install feature-engine
```

--------------------------------

### Example Feature Combinations Output

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/DecisionTreeFeatures.md

This is an example output showing the list of feature combinations that will be used to train decision trees. It includes individual features and all possible pairs of numerical features from the training set.

```python
['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 ['MedInc', 'HouseAge'],
 ['MedInc', 'AveRooms'],
 ['MedInc', 'AveBedrms'],
 ['MedInc', 'Population'],
 ['MedInc', 'AveOccup'],
 ['HouseAge', 'AveRooms'],
 ['HouseAge', 'AveBedrms'],
 ['HouseAge', 'Population'],
 ['HouseAge', 'AveOccup'],
 ['AveRooms', 'AveBedrms'],
 ['AveRooms', 'Population'],
 ['AveRooms', 'AveOccup'],
 ['AveBedrms', 'Population'],
 ['AveBedrms', 'AveOccup'],
 ['Population', 'AveOccup']]
```

--------------------------------

### Setup Pipeline and Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md

Initializes a pandas DataFrame and Series, then creates a Feature-Engine Pipeline with imputation, encoding, and a Lasso model. The pipeline is then fitted to the data.

```python
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OneHotEncoder
from feature_engine.pipeline import Pipeline

from sklearn.linear_model import Lasso

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

pipe = Pipeline(
    [
        ("drop", DropMissingData()),
        ("enc", OneHotEncoder()),
        ("lasso", Lasso(random_state=10)),
    ]
)
pipe.fit(X, y)
```

--------------------------------

### Install Feature-engine in Developer Mode

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Install Feature-engine and its development dependencies. The '-e' flag installs the package in editable mode, so code changes are reflected immediately without reinstallation. Include '.[docs,tests]' to install dependencies for documentation and testing.

```bash
cd feature_engine
pip install -e ".[docs,tests]"
```

--------------------------------

### Install Feature-engine developer dependencies

Source: https://github.com/feature-engine/feature_engine/blob/main/README.md

Install Feature-engine and its testing dependencies, necessary for development and running tests.

```bash
pip install -e ".[tests]"
```

--------------------------------

### Forecasting Pipeline Setup

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/make_pipeline.md

Imports for setting up a direct forecasting pipeline using Feature-Engine's time series forecasting transformers and scikit-learn models.

```python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import Lasso
from sklearn.metrics import root_mean_squared_error
from sklearn.multioutput import MultiOutputRegressor

from feature_engine.timeseries.forecasting import (
    LagFeatures,
    WindowFeatures,
)
from feature_engine.pipeline import make_pipeline
```

--------------------------------

### Install Feature-engine using Conda

Source: https://github.com/feature-engine/feature_engine/blob/main/README.md

Use this command to install the Feature-engine library from the conda-forge channel.

```bash
conda install -c conda-forge feature_engine
```

--------------------------------

### Set up a Pipeline with DropMissingData, OrdinalEncoder, and Lasso

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md

Instantiate a Pipeline with a list of named steps. This example chains data imputation, categorical encoding, and a Lasso regression model.

```python
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import Pipeline

from sklearn.linear_model import Lasso

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

pipe = Pipeline(
    [
        ("drop", DropMissingData()),
        ("enc", OrdinalEncoder(encoding_method="arbitrary")),
        ("lasso", Lasso(random_state=10)),
    ]
)
# predict
pipe.fit(X, y)
preds_pipe = pipe.predict(X)
preds_pipe
```

--------------------------------

### Discretizer Output Example

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/EqualWidthDiscretiser.md

Shows the learned interval limits for 'LotArea' and 'GrLivArea' after applying EqualWidthDiscretiser with 10 bins. Note the inclusion of -inf and inf for comprehensive coverage.

```python
{
 'LotArea': [-inf,
  22694.5,
  44089.0,
  65483.5,
  86878.0,
  108272.5,
  129667.0,
  151061.5,
  172456.0,
  193850.5,
  inf],
 'GrLivArea': [-inf,
  864.8,
  1395.6,
  1926.3999999999999,
  2457.2,
  2988.0,
  3518.7999999999997,
  4049.5999999999995,
  4580.4,
  5111.2,
  inf]
}
```

--------------------------------

### Performance Drifts Example Output

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md

Example output showing the change in linear regression r2 after shuffling each feature. Positive values indicate an increase in performance, while negative values indicate a decrease.

```python
{
'age': -0.0054698043007869734,
'sex': 0.03325633986510784,
'bmi': 0.184158237207512,
'bp': 0.10089894421748086,
's1': 0.49324432634948095,
's2': 0.21163252880660438,
's3': 0.02006839198785859,
's4': 0.011098050006761673,
's5': 0.4828781996541602,
's6': 0.003963360084439538
}
```

--------------------------------

### Load and Split Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/text/TextFeatures.md

Loads the 20 newsgroups dataset and splits it into training and testing sets. Ensure pandas and scikit-learn are installed.

```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
import pandas as pd

data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.sport.hockey'])
df = pd.DataFrame({'text': data.data, 'target': data.target})
X_train, X_test, y_train, y_test = train_test_split(
    df[['text']], df['target'], test_size=0.3, random_state=42
)

print(X_train.head())
```

--------------------------------

### Install Feature-engine in developer mode

Source: https://github.com/feature-engine/feature_engine/blob/main/README.md

Install Feature-engine in editable mode, allowing for direct code changes to be reflected without reinstallation.

```bash
pip install -e .
```

--------------------------------

### Example Features to Drop

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md

Example output listing features that were deemed non-important based on their performance drift being greater than the mean performance drift of all features.

```python
['age', 'sex', 'bp', 's3', 's4', 's6']
```

--------------------------------

### Extract the first two steps of a Pipeline

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md

Use slicing notation to extract a partial pipeline. This example retrieves the first two steps.

```python
pipe[:2]
```

--------------------------------

### Extract the first step of a Pipeline

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md

Use slicing notation to extract a partial pipeline. This example retrieves only the first step.

```python
pipe[:1]
```

--------------------------------

### Import Libraries for Monotonic Features Example

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/DecisionTreeEncoder.md

Import necessary libraries including matplotlib, fetch_openml, train_test_split, and DecisionTreeEncoder for demonstrating monotonic features.

```python
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

from feature_engine.encoding import DecisionTreeEncoder
```

--------------------------------

### RandomSampleImputer with observation-specific seeding (example)

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/RandomSampleImputer.md

Illustrates how RandomSampleImputer can be configured for observation-specific seeding. The seed is derived from the sum of 'height' and 'weight' for each observation.

```python
RandomSampleImputer(
    random_state=['height', 'weight'],
    seed='observation',
    seeding_method='add',
)
```

--------------------------------

### Navigate to the Project Directory

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md

Change your current directory to the cloned feature-engine-examples project.

```bash
cd feature-engine-examples
```

--------------------------------

### Load Diabetes Dataset and Display Head

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureElimination.md

Loads the diabetes dataset from Scikit-learn and displays the first few rows of the feature data. This is a common starting point for feature selection examples.

```python
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import RecursiveFeatureElimination

# load dataset
X, y = load_diabetes(return_X_y=True, as_frame=True)

print(X.head())
```

--------------------------------

### Build documentation with Sphinx

Source: https://github.com/feature-engine/feature_engine/blob/main/README.md

Build the HTML version of the documentation using Sphinx. This command should be run from the root directory of the project.

```bash
sphinx-build -b html docs build
```

--------------------------------

### Transformed Data Example Output

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md

Example output of the DataFrame head after feature selection, showing only the remaining important features.

```python
        bmi        s1        s2        s5
0  0.061696 -0.044223 -0.034821  0.019907
1 -0.051474 -0.008449 -0.019163 -0.068332
2  0.044451 -0.045599 -0.034194  0.002861
3 -0.011595  0.012191  0.024991  0.022688
4 -0.036385  0.003935  0.015596 -0.031988
```

--------------------------------

### Load data and set up ArbitraryNumberImputer

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/ArbitraryNumberImputer.md

This snippet demonstrates loading the house prices dataset, splitting it into training and testing sets, and initializing the ArbitraryNumberImputer to impute specified numerical variables with -999.

```python
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

from feature_engine.imputation import ArbitraryNumberImputer

# Load dataset
X, y = fetch_openml(
    name='house_prices',
    version=1,
    return_X_y=True,
    as_frame=True,
    parser='auto',
)

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=0,
)
```

```python
# set up the imputer
arbitrary_imputer = ArbitraryNumberImputer(
    arbitrary_number=-999,
    variables=['LotFrontage', 'MasVnrArea'],
    )

# fit the imputer
arbitrary_imputer.fit(X_train)
```

--------------------------------

### Frequency Encoding Dictionary Example

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/CountFrequencyEncoder.md

Example output of the 'encoder_dict_' showing the mapping of categories to their frequencies for 'cabin', 'pclass', and 'embarked' variables.

```python
{
    'cabin': {'M': 0.7663755458515283,
   'C': 0.07751091703056769,
   'B': 0.04585152838427948,
   'E': 0.034934497816593885,
   'D': 0.034934497816593885,
   'A': 0.018558951965065504,
   'F': 0.016375545851528384,
   'G': 0.004366812227074236,
   'T': 0.001091703056768559},
'pclass': {3: 0.5436681222707423,
   1: 0.25109170305676853,
   2: 0.2052401746724891},
'embarked': {'S': 0.7117903930131004,
   'C': 0.19541484716157206,
   'Q': 0.0906113537117904,
   'Missing': 0.002183406113537118}
}
```

--------------------------------

### Create Sample DataFrame

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/text/TextFeatures.md

Creates a pandas DataFrame with sample text data for demonstration purposes.

```python
import pandas as pd
from feature_engine.text import TextFeatures

# Create sample data
X = pd.DataFrame({
    'review': [
        'This product is AMAZING! Best purchase ever.',
        'Not great. Would not recommend.',
        'OK for the price. 3 out of 5 stars.',
        'TERRIBLE!!! DO NOT BUY!',
    ],
    'title': [
        'Great Product',
        'Disappointed',
        'Average',
        'Awful',
    ]
})

print(X)
```

--------------------------------

### Standard Deviation of Performance Drifts Example Output

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md

Example output showing the variability (standard deviation) of the change in r2 after shuffling each feature. Higher values suggest more inconsistent performance changes.

```python
{
'age': 0.012788500580799392,
'sex': 0.040792331972680645,
'bmi': 0.042212436355346106,
'bp': 0.05397012536801143,
's1': 0.35198797776358015,
's2': 0.167636042355086,
's3': 0.03455158514716544,
's4': 0.007755675852874145,
's5': 0.1449579162698361,
's6': 0.011193022434166025
}
```

--------------------------------

### Build Documentation with Sphinx

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Generate HTML documentation from the source files using Sphinx. The output will be stored in the 'build' folder.

```bash
$ sphinx-build -b html docs build
```

--------------------------------

### Navigate to Feature-engine directory

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Change your current directory to the root of the Feature-engine repository.

```bash
$ cd feature_engine
```

--------------------------------

### Create a Sample DataFrame

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_numerical_variables.md

Create a pandas DataFrame with various data types including numerical, categorical, and datetime.

```python
import pandas as pd
df = pd.DataFrame({
    "Name": ["tom", "nick", "krish", "jack"],
    "City": ["London", "Manchester", "Liverpool", "Bristol"],
    "Age": [20, 21, 19, 18],
    "Marks": [0.9, 0.8, 0.7, 0.6],
    "dob": pd.date_range("2020-02-24", periods=4, freq="min"),
})

print(df.head())
```

--------------------------------

### Load Data and Prepare for Reciprocal Transformation

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/ReciprocalTransformer.md

Load the Ames house prices dataset, create a new variable 'sqrfootpercar', and split the data into training and testing sets. This example demonstrates data preparation before applying the transformation.

```python
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from feature_engine.transformation import ReciprocalTransformer

data = fetch_openml(name='house_prices', as_frame=True)
data = data.frame

data["sqrfootpercar"] = data['GarageArea'] / data['GarageCars']
data = data[~data["sqrfootpercar"].isna()]

y = data['SalePrice']
X = data[['GarageCars', 'GarageArea', "sqrfootpercar"]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print(X_train.head())
```

--------------------------------

### Example Transformed Data Output

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/DecisionTreeFeatures.md

This is an example output of the transformed data, showing the original features alongside the newly created features derived from decision tree splits. The new features are named 'tree(...)' indicating their origin.

```python
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  \
14740  4.1518      22.0  5.663073   1.075472      1551.0  4.180593
10101  5.7796      32.0  6.107226   0.927739      1296.0  3.020979
20566  4.3487      29.0  5.930712   1.026217      1554.0  2.910112
2670   2.4511      37.0  4.992958   1.316901       390.0  2.746479
15709  5.0049      25.0  4.319261   1.039578       649.0  1.712401

       tree(MedInc)  tree(HouseAge)  tree(AveRooms)  tree(AveBedrms)  ...  \
14740      2.204822        2.130618        2.001950         2.080254  ...
10101      2.975513        2.051980        2.001950         2.165554  ...
20566      2.204822        2.051980        2.001950         2.165554  ...
2670       1.416771        2.051980        1.802158         1.882763  ...
15709      2.420124        2.130618        1.802158         2.165554  ...

       tree(['HouseAge', 'AveRooms'])  tree(['HouseAge', 'AveBedrms'])  \
14740                        1.885406                         2.124812
10101                        1.885406                         2.124812
20566                        1.885406                         2.124812
2670                         1.797902                         1.836498
15709                        1.797902                         2.124812

       tree(['HouseAge', 'Population'])  tree(['HouseAge', 'AveOccup'])  \
14740                          2.004703                        1.437440
10101                          2.004703                        2.257968
20566                          2.004703                        2.257968
2670                           2.123579                        2.257968
15709                          2.123579                        2.603372

       tree(['AveRooms', 'AveBedrms'])  tree(['AveRooms', 'Population'])  \
14740                         2.099977                          1.878989
10101                         2.438937                          2.077321
20566                         2.099977                          1.878989
2670                          1.728401                          1.843904
15709                         1.821467                          1.843904

       tree(['AveRooms', 'AveOccup'])  tree(['AveBedrms', 'Population'])  \
14740                        1.719582                           2.056003
10101                        2.156884                           2.056003
20566                        2.156884                           2.056003
2670                         1.747990                           1.882763
15709                        2.783690                           2.221092

       tree(['AveBedrms', 'AveOccup'])  tree(['Population', 'AveOccup'])
14740                         1.400491                          1.484939
10101                         2.153210                          2.059187
20566                         2.153210                          2.059187
2670                          1.861020                          2.235743
15709                         2.727460                          2.747390

[5 rows x 27 columns]
```

--------------------------------

### Create Toy Dataset

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/scaling/MeanNormalizationScaler.md

Creates a sample pandas DataFrame for demonstrating the MeanNormalizationScaler. Includes numerical and non-numerical columns.

```python
import pandas as pd
from feature_engine.scaling import MeanNormalizationScaler

df = pd.DataFrame.from_dict(
    {
        "Name": ["tom", "nick", "krish", "jack"],
        "City": ["London", "Manchester", "Liverpool", "Bristol"],
        "Age": [20, 21, 19, 18],
        "Height": [1.80, 1.77, 1.90, 2.00],
        "Marks": [0.9, 0.8, 0.7, 0.6],
        "dob": pd.date_range("2020-02-24", periods=4, freq="min"),
    })

print(df)
```

--------------------------------

### Get Selected Feature Names

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByTargetMeanPerformance.md

Use the get_feature_names_out method to retrieve the names of the features that were selected by the transformer.

```python
sel.get_feature_names_out()
```

--------------------------------

### Build Documentation with Sphinx

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_docs.md

Build the HTML version of the documentation using Sphinx. This command specifies the source directory for documentation files and the output directory for the generated HTML.

```bash
sphinx-build -b html docs build
```

--------------------------------

### Get Features to Drop

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureElimination.md

Retrieve the list of features that RecursiveFeatureElimination has identified for removal based on the specified threshold.

```python
# the features to remove
tr.features_to_drop_
```

--------------------------------

### Initialize and Fit Winsorizer

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/Winsorizer.md

Initializes the Winsorizer to cap outliers in 'age' and 'fare' using the Gaussian method on the right tail. The 'fold' parameter is set to 3, indicating 3 standard deviations from the mean.

```python
capper = Winsorizer(capping_method='gaussian',
                    tail='right',
                    fold=3,
                    variables=['age', 'fare'])

capper.fit(X_train)
```

--------------------------------

### Example Model Accuracy Output

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/WoEEncoder.md

This snippet shows the expected output format for the model accuracy after training and prediction.

```python
Accuracy: 0.76
```

--------------------------------

### Pandas dropna Example

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/DropMissingData.md

Demonstrates the basic usage of pandas' dropna function to remove rows with NaN values.

```python
import numpy as np
import pandas as pd

X = pd.DataFrame(dict(
       x1 = [np.nan,1,1,0,np.nan],
       x2 = ["a", np.nan, "b", np.nan, "a"],
       ))

X.dropna(inplace=True)
print(X)
```

--------------------------------

### Set up Feature-engine Pipeline

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md

Defines a pipeline including outlier trimming, one-hot encoding, scaling, and logistic regression.

```python
pipe = Pipeline(
    [
        ("outliers", OutlierTrimmer(variables=["age", "fare"])),
        ("enc", OneHotEncoder()),
        ("scaler", StandardScaler()),
        ("logit", LogisticRegression(random_state=10)),
    ]
)
```

--------------------------------

### Get Feature Names After Transformation

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/datetime/DatetimeFeatures.md

Retrieves the names of the features generated by the DatetimeFeatures transformer after fitting and transforming the data.

```python
dtfs.get_feature_names_out()
```

--------------------------------

### Initialize Linear Regression Model

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByShuffling.md

Sets up a Linear Regression model from Scikit-learn. This model will be used by SelectByShuffling to evaluate feature importance.

```python
linear_model = LinearRegression()
```

--------------------------------

### Extract the last step of a Pipeline

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/Pipeline.md

Use slicing notation to extract a partial pipeline. This example retrieves only the last step.

```python
pipe[-1:]
```

--------------------------------

### Initialize and Prepare for Performance-Based Selection

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SmartCorrelatedSelection.md

This snippet demonstrates the initial setup for performance-based feature selection using SmartCorrelatedSelection. It includes importing necessary libraries like pandas, make_classification, DecisionTreeClassifier, and SmartCorrelatedSelection, and preparing a toy dataset.

```python
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from feature_engine.selection import SmartCorrelatedSelection
```

--------------------------------

### Create a Toy Dataset

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_all_variables.md

This code creates a sample pandas DataFrame with numerical, categorical, and datetime variables for demonstration purposes.

```python
import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=4,
    n_redundant=1,
    n_clusters_per_class=1,
    weights=[0.50],
    class_sep=2,
    random_state=1,
)

# transform arrays into pandas df and series
colnames = [f"num_var_{i+1}" for i in range(4)]
X = pd.DataFrame(X, columns=colnames)

X["cat_var1"] = ["Hello"] * 1000
X["cat_var2"] = ["Bye"] * 1000

X["date1"] = pd.date_range("2020-02-24", periods=1000, freq="min")
X["date2"] = pd.date_range("2021-09-29", periods=1000, freq="h")
X["date3"] = ["2020-02-24"] * 1000

print(X.head())
```

--------------------------------

### Initialize LogCpTransformer with user-defined constants for specific variables

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/LogCpTransformer.md

Initialize the LogCpTransformer by providing a dictionary to the 'C' parameter. Each key-value pair in the dictionary specifies a variable and the constant to be added to it before the logarithm is applied.

```python
tf = LogCpTransformer(C={"bmi": 2, "s3": 3, "s4": 4})
tf.fit(X_train)
```

--------------------------------

### Load Titanic Dataset and Split Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/OutlierTrimmer.md

Loads the Titanic dataset and splits it into training and testing sets. Ensure `feature_engine` is installed.

```python
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.outliers import OutlierTrimmer

X, y = load_titanic(
    return_X_y_frame=True,
    predictors_only=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())
```

--------------------------------

### Create a Feature Engineering Pipeline with WoEEncoder

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/WoEEncoder.md

Sets up a pipeline that first discretizes numerical variables, then groups rare labels, and finally encodes all specified variables using the WoEEncoder. This demonstrates a sequential application of multiple feature engineering steps.

```python
pipe = Pipeline(
    [
        ("disc", EqualFrequencyDiscretiser(variables=numerical_features)),
        ("rare_label", RareLabelEncoder(tol=0.1, n_categories=2, variables=all, ignore_format=True)),
        ("woe", WoEEncoder(variables=all)),
    ])
```

--------------------------------

### Create a Toy DataFrame

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/MathFeatures.md

This code snippet demonstrates how to create a sample pandas DataFrame with various data types, which will be used to illustrate the functionality of MathFeatures.

```python
import numpy as np
import pandas as pd
from feature_engine.creation import MathFeatures

df = pd.DataFrame.from_dict(
    {
        "Name": ["tom", "nick", "krish", "jack"],
        "City": ["London", "Manchester", "Liverpool", "Bristol"],
        "Age": [20, 21, 19, 18],
        "Marks": [0.9, 0.8, 0.7, 0.6],
        "dob": pd.date_range("2020-02-24", periods=4, freq="T"),
    })

print(df)
```

--------------------------------

### Activate Conda Environment

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Activate the conda environment you created. This ensures that subsequent installations and commands are run within the isolated environment.

```bash
conda activate myenv
```

--------------------------------

### Prepare Test Set by Dropping Columns

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/preprocessing/MatchVariables.md

Demonstrates preparing a test set by dropping specific columns ('sex', 'age') to simulate missing features.

```python
# Let's drop some columns in the test set for the demo
test_t = test.drop(["sex", "age"], axis=1)

test_t.head()
```

--------------------------------

### Find Numerical Variables

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_numerical_variables.md

Use `find_numerical_variables` to get a list of all numerical variable names from the DataFrame. This function requires the DataFrame as input.

```python
from feature_engine.variable_handling import find_numerical_variables

var_num = find_numerical_variables(df)

var_num
```

--------------------------------

### Load Libraries and Dataset

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/CyclicalFeatures.md

Imports necessary libraries and loads the Bike Sharing Demand dataset from OpenML.

```python
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import FunctionTransformer

from feature_engine.creation import CyclicalFeatures

df = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True).frame

print(df.head())
```

--------------------------------

### Getting Feature Names with ExpandingWindowFeatures

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/timeseries/forecasting/ExpandingWindowFeatures.md

Use the `get_feature_names_out()` method after fitting the transformer to retrieve the names of the original and newly created features.

```python
win_f = ExpandingWindowFeatures()

win_f.fit(X)

win_f.get_feature_names_out()
```

--------------------------------

### Import Libraries and Load Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/LogCpTransformer.md

Imports necessary libraries and loads the California housing dataset for transformation.

```python
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

from feature_engine.transformation import LogCpTransformer

# Load dataset
X, y = fetch_california_housing( return_X_y=True, as_frame=True)

# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
    X, y, test_size=0.3, random_state=0)
```

--------------------------------

### Get Feature Names After Lagging

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/timeseries/forecasting/LagFeatures.md

Use the `get_feature_names_out()` method to retrieve the names of all features, including the newly created lag features.

```python
lag_f.get_feature_names_out()
```

--------------------------------

### Load Data and Initialize LogTransformer

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/transformation/LogTransformer.md

Imports necessary libraries, loads the Ames house prices dataset, splits it into training and testing sets, and initializes the LogTransformer for specific variables. The transformer checks for numerical variables during fit.

```python
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

from feature_engine.transformation import LogTransformer

data = fetch_openml(name='house_prices', as_frame=True)
data = data.frame

X = data.drop(['SalePrice', 'Id'], axis=1)
y = data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print(X_train.head())
```

```python
X_train[['LotArea', 'GrLivArea']].hist(figsize=(10,5))
plt.show()
```

```python
logt = LogTransformer(variables = ['LotArea', 'GrLivArea'])

logt.fit(X_train)
```

--------------------------------

### Get Feature Names Out

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SmartCorrelatedSelection.md

Uses the `get_feature_names_out()` method, common to scikit-learn transformers, to retrieve the names of the features remaining in the transformed DataFrame.

```python
tr.get_feature_names_out()
```

--------------------------------

### Get supported features

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/SelectByInformationValue.md

Use the 'get_support()' method to obtain a boolean list indicating which features are selected (True) or dropped (False).

```python
sel.get_support()
```

--------------------------------

### Load House Prices Dataset

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/MeanMedianImputer.md

Load the house prices dataset from OpenML for demonstration purposes.

```python
X, y = fetch_openml(
    name='house_prices',
    version=1,
    return_X_y=True,
    as_frame=True,
    parser='auto',
)
```

--------------------------------

### Get Performance Drifts

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureAddition.md

Retrieve the changes in model performance resulting from adding each feature. This helps in understanding the incremental value of each feature.

```python
# Get the performance drift of each feature
tr.performance_drifts_
```

--------------------------------

### Initialize and Fit MeanMedianImputer

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/imputation/MeanMedianImputer.md

Initialize MeanMedianImputer with the 'mean' imputation method and specify the variables to impute. Then, fit the imputer using the training data.

```python
# Set up the imputer
mmi = MeanMedianImputer(
        imputation_method='mean',
        variables=['LotFrontage', 'MasVnrArea']
)

# Fit transformer with training data
mmi.fit(X_train)
```

--------------------------------

### Setting up a Pipeline with make_pipeline

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/pipeline/make_pipeline.md

Use make_pipeline to create a pipeline that first drops missing data, then encodes categorical features ordinally, and finally fits a Lasso regression model. The pipeline automatically assigns names to each step.

```python
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import make_pipeline

from sklearn.linear_model import Lasso

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

pipe = make_pipeline(
    DropMissingData(),
    OrdinalEncoder(encoding_method="arbitrary"),
    Lasso(random_state=10),
)
# predict
pipe.fit(X, y)
preds_pipe = pipe.predict(X)
preds_pipe
```

--------------------------------

### Get Transformed Feature Names

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/RelativeFeatures.md

Retrieves the names of all features in the DataFrame after the RelativeFeatures transformation has been applied. This is useful for understanding the output of the transformer.

```python
transformer.get_feature_names_out(input_features=None)
```

--------------------------------

### Load Titanic Dataset and Split Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/DropFeatures.md

Loads the Titanic dataset and splits it into training and testing sets. Ensure you have the feature_engine library installed.

```python
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropFeatures

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)
```

--------------------------------

### Verify Remotes

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Check that both your fork ('origin') and the main repository ('upstream') are correctly linked to your local copy.

```bash
$ git remote -v
```

--------------------------------

### Load Titanic Dataset and Split Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/DropConstantFeatures.md

Loads the Titanic dataset and splits it into training and testing sets. Ensure you have feature_engine and scikit-learn installed.

```python
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropConstantFeatures

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)
```

--------------------------------

### Load Data and Split into Train/Test Sets

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/Winsorizer.md

Loads the Titanic dataset and splits it into training and testing sets. Ensure Feature-Engine and Scikit-learn are installed.

```python
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.outliers import Winsorizer

X, y = load_titanic(
    return_X_y_frame=True,
    predictors_only=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())
```

--------------------------------

### Load Wine Dataset and Libraries

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/creation/index.md

Imports necessary libraries and loads the wine quality dataset from Scikit-learn. Displays the head of the dataset.

```python
import pandas as pd
from sklearn.datasets import load_wine
from feature_engine.creation import RelativeFeatures, MathFeatures

X, y = load_wine(return_X_y=True, as_frame=True)
print(X.head())
```

--------------------------------

### Clone Feature-engine Repository

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_code.md

Clone your forked repository to your local machine to begin development.

```bash
$ git clone https://github.com/<YOURUSERNAME>/feature_engine
```

--------------------------------

### Load Titanic Dataset and Split Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/OrdinalEncoder.md

Loads the Titanic dataset and splits it into training and testing sets. Ensure Feature-engine and scikit-learn are installed.

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import OrdinalEncoder

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
    predictors_only=True,
    cabin="letter_only",
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())
```

--------------------------------

### Upgrade Feature-Engine with Pip

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/quickstart/index.md

Upgrade an existing feature-engine installation to the latest version using pip. The -U flag ensures the package is updated.

```bash
pip install -U feature-engine
```

--------------------------------

### Instantiate and Transform Data

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/EqualFrequencyDiscretiser.md

Instantiates the EqualFrequencyDiscretiser with 5 quantiles (bins) and applies it to the created dataset to transform the features.

```python
# Instantiate discretizer
disc = EqualFrequencyDiscretiser(q=5)

# Transform simulated data
X_transformed = disc.fit_transform(X)
```

--------------------------------

### Stage and Commit Changes

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/contribute/contribute_jup.md

Add your notebook changes to the staging area and commit them with a meaningful message.

```bash
git add .
git commit -m "a meaningful commit message"
```

--------------------------------

### Find All Variables in a Dataset

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/variable_handling/find_all_variables.md

Use `find_all_variables` to get a list of all variable names in the DataFrame. This function is useful for quickly inspecting the columns of your dataset.

```python
from feature_engine.variable_handling import find_all_variables

vars_all = find_all_variables(X)

vars_all
```

--------------------------------

### Get Features to Drop

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/selection/RecursiveFeatureAddition.md

Access the list of features identified by RecursiveFeatureAddition that will be dropped. These are the features deemed least important based on the selection criteria.

```python
# the features to drop
tr.features_to_drop_
```

--------------------------------

### Display Training Data Head

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/encoding/OneHotEncoder.md

Prints the first 5 rows of the training data to show the initial structure and content.

```python
      pclass     sex        age  sibsp  parch     fare cabin embarked
501        2  female  13.000000      0      1  19.5000     M        S
588        2  female   4.000000      1      1  23.0000     M        S
402        2  female  30.000000      1      0  13.8583     M        C
1193       3    male  29.881135      0      0   7.7250     M        Q
686        3  female  22.000000      0      0   7.7250     M        Q
```

--------------------------------

### Get Transformed Feature Names

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/outliers/OutlierTrimmer.md

Retrieve the names of the features in the dataset after the outlier transformation has been applied. This is useful for subsequent data processing steps.

```python
ot.get_feature_names_out()
```

--------------------------------

### Get Output Feature Names

Source: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/datetime/DatetimeFeatures.md

Obtains the names of the features that will be present in the output DataFrame after the transformation, including the newly extracted datetime features.

```python
dfts.get_feature_names_out()
```