### Install Numerai Dependencies

Source: https://github.com/numerai/example-scripts/blob/master/example_model.ipynb

Installs necessary Python libraries for Numerai, including numerapi, pandas, pyarrow, matplotlib, lightgbm, scikit-learn, scipy, and cloudpickle. The '-q' flag suppresses output, and '--upgrade' ensures the latest versions are installed.

```python
# Install dependencies
!pip install -q --upgrade numerapi pandas pyarrow matplotlib lightgbm scikit-learn scipy cloudpickle==3.1.1
```

--------------------------------

### Install Project Dependencies

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Installs necessary Python libraries for Numerai data analysis and model training, including numerapi, pandas, pyarrow, matplotlib, lightgbm, scikit-learn, scipy, cloudpickle, and numerai-tools.

```python
# Install dependencies
!pip install -q --upgrade numerapi pandas pyarrow matplotlib lightgbm scikit-learn scipy cloudpickle==3.1.1
!pip install -q --no-deps numerai-tools

# Inline plots
%matplotlib inline
```

--------------------------------

### Check Python Version

Source: https://github.com/numerai/example-scripts/blob/master/example_model.ipynb

Checks the installed Python version using the `!python --version` command. This is a simple command-line execution within the script environment.

```python
!python --version
```

--------------------------------

### Download and Inspect Feature Metadata

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Downloads the 'features.json' file which contains metadata about the dataset's features and targets. It then parses this JSON to display the number of feature sets and targets available.

```python
import json

# download the feature metadata file
napi.download_dataset(f"{DATA_VERSION}/features.json")

# read the metadata and display
feature_metadata = json.load(open(f"{DATA_VERSION}/features.json"))
for metadata in feature_metadata:
  print(metadata, len(feature_metadata[metadata]))
```

--------------------------------

### Check Python Version

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Displays the installed Python version. Useful for ensuring compatibility with project dependencies.

```python
print(f"Python Version: {sys.version}")
```

--------------------------------

### List Numerai Datasets

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Initializes the NumerAPI client to interact with the Numerai platform. It then lists all available datasets and their versions, focusing on a specific version (e.g., 'v5.0') and printing the files within that version.

```python
# Initialize NumerAPI - the official Python API client for Numerai
from numerapi import NumerAPI
napi = NumerAPI()

# list the datasets and available versions
all_datasets = napi.list_datasets()
dataset_versions = list(set(d.split('/')[0] for d in all_datasets))
print("Available versions:", dataset_versions)

# Set data version to one of the latest datasets
DATA_VERSION = "v5.0"

# Print all files available for download for our version
current_version_files = [f for f in all_datasets if f.startswith(DATA_VERSION)]
print("Available", DATA_VERSION, "files:", current_version_files)
```

--------------------------------

### Download and Load Numerai Benchmark Models

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Downloads the validation benchmark models dataset from Numerai and loads it into a pandas DataFrame. This is a prerequisite for further analysis.

```python
napi.download_dataset(f"{DATA_VERSION}/validation_benchmark_models.parquet")
benchmark_models = pd.read_parquet(
    f"{DATA_VERSION}/validation_benchmark_models.parquet"
)
benchmark_models
```

--------------------------------

### Import Matplotlib for Plotting

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Imports the matplotlib.pyplot module, commonly aliased as plt, for creating visualizations.

```python
import matplotlib.pyplot as plt
```

--------------------------------

### Quick Test of Ensemble Prediction

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Performs a quick test of the `predict_ensemble` function by downloading live data and generating predictions. This serves as a basic validation step before deployment.

```python
# Quick test
napi.download_dataset(f"{DATA_VERSION}/live.parquet")
live_features = pd.read_parquet(f"{DATA_VERSION}/live.parquet", columns=feature_cols)
predict_ensemble(live_features, benchmark_models)
```

--------------------------------

### Download and Load Meta Model Data

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This script shows how to download the meta-model data for a specific round and load it into a pandas DataFrame. This data is used for calculating Meta Model Contribution (MMC).

```python
# import the 2 scoring functions
from numerai_tools.scoring import numerai_corr, correlation_contribution

# Download and join in the meta_model for the validation eras
napi.download_dataset(f"v4.3/meta_model.parquet", round_num=842)
validation["meta_model"] = pd.read_parquet(
    f"v4.3/meta_model.parquet"
)["numerai_meta_model"]
```

--------------------------------

### Download Pickle File in Google Colab

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This script provides a conditional way to download the generated pickle file ('hello_numerai.pkl') if the code is being run within a Google Colab environment. It uses a try-except block to handle potential errors if not in Colab.

```python
# Download file if running in Google Colab
try:
    from google.colab import files
    files.download('hello_numerai.pkl')
except:
    pass
```

--------------------------------

### Display Training Data Sample (Python)

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This snippet shows a sample of the training data, illustrating the structure and content of stock data used for model training. It includes stock IDs, eras, target values, and numerous features.

```python
train
```

--------------------------------

### Display Feature Set Sizes

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Iterates through predefined feature set names ('small', 'medium', 'all') using the loaded feature metadata and prints the number of features in each set. This helps in understanding the scale of different feature subsets.

```python
feature_sets = feature_metadata["feature_sets"]
for feature_set in ["small", "medium", "all"]:
  print(feature_set, len(feature_sets[feature_set]))
```

--------------------------------

### Download Meta-Model and Calculate Correlation Contribution

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Downloads the meta-model dataset for a specific round and calculates correlation contribution using numerai_tools.scoring. This requires the pandas library and the numerai_tools package.

```python
from numerai_tools.scoring import correlation_contribution

# Download and join in the meta_model for the validation eras
napi.download_dataset(f"v4.3/meta_model.parquet", round_num=842)
validation["meta_model"] = pd.read_parquet(
    f"v4.3/meta_model.parquet"
)["numerai_meta_model"]
```

--------------------------------

### Download & Prepare Validation Data and Predict

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Downloads the validation dataset, preprocesses it, and generates predictions using a pre-trained LightGBM model. The script uses pandas to load and filter the data, and the Numerai API to download the dataset.

```python
# Download validation data
napi.download_dataset(f"{DATA_VERSION}/validation.parquet")

# Load the validation data, filtering for data_type == "validation"
validation = pd.read_parquet(
    f"{DATA_VERSION}/validation.parquet",
    columns=["era", "data_type", "target"] + small_features
)
validation = validation[validation["data_type"] == "validation"]
del validation["data_type"]

# Downsample every 4th era to reduce memory usage and speedup validation (suggested for Colab free tier)
# Comment out the line below to use all the data
validation = validation[validation["era"].isin(validation["era"].unique()[::4])]

# Embargo overlapping eras from training data
last_train_era = int(train["era"].unique()[-1])
eras_to_embargo = [str(era).zfill(4) for era in [last_train_era + i for i in range(4)]]
validation = validation[~validation["era"].isin(eras_to_embargo)]

# Generate predictions against the small feature set of the validation data
validation["prediction"] = model.predict(validation[small_features])
```

--------------------------------

### Get summary metrics for correlations

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Computes and displays summary metrics (mean, std, sharpe, max_drawdown) for a given set of correlations. It utilizes a get_summary_metrics function and formats the display of floating-point numbers. This is useful for comparing different prediction sets.

```python
summary_metrics = get_summary_metrics(correlations, cumsum_corrs)
pd.set_option('display.float_format', lambda x: '%f' % x)
summary = pd.DataFrame(summary_metrics)
summary
```

--------------------------------

### Generate and Format Live Predictions in Python

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This script downloads the latest live features, loads them into a pandas DataFrame, generates predictions using a pre-trained model, and formats the output for submission. It requires the 'numerai_api' and 'pandas' libraries.

```python
# Download latest live features
napi.download_dataset(f"{DATA_VERSION}/live.parquet")

# Load live features
live_features = pd.read_parquet(f"{DATA_VERSION}/live.parquet", columns=feature_set)

# Generate live predictions
live_predictions = model.predict(live_features[feature_set])

# Format submission
pd.Series(live_predictions, index=live_features.index).to_frame("prediction")
```

--------------------------------

### Download Serialized Model File

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Provides code to download the serialized model file ('target_ensemble.pkl') when running in a Google Colab environment. It uses a try-except block to handle cases where the code is not run in Colab.

```python
# Download file if running in Google Colab
try:
    from google.colab import files
    files.download('target_ensemble.pkl')
except:
    pass
```

--------------------------------

### Plot Performance Metrics by Feature

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Generates bar charts for various performance metrics of features, sorted by their mean performance. It uses subplots for better visualization and shares the x-axis.

```python
# plot the performance metrics of the features as bar charts sorted by mean
feature_metrics.sort_values("mean", ascending=False).plot.bar(
    title="Performance Metrics of Features Sorted by Mean",
    subplots=True,
    figsize=(15, 6),
    layout=(2, 3),
    sharex=False,
    xticks=[],
    snap=False
)
```

--------------------------------

### Initialize NumerAPI Client and Download Metadata (Python)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Initializes the NumerAPI client to interact with Numerai's API and downloads the 'features.json' metadata file. This file contains information about feature sets and groups, crucial for understanding and manipulating features.

```python
import json
import pandas as pd
from numerapi import NumerAPI

# initialize our API client
napi = NumerAPI()

# Set data version to one of the latest datasets
DATA_VERSION = "v5.0"

napi.download_dataset(f"{DATA_VERSION}/features.json")
feature_metadata = json.load(open(f"{DATA_VERSION}/features.json"))
feature_sets = feature_metadata["feature_sets"]
```

--------------------------------

### Quick Test of Feature Neutralization (Python)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

This script performs a quick test of the feature neutralization process. It downloads the live dataset, loads the relevant features, and then calls the `predict_neutral` function to generate predictions.

```python
# Quick test
napi.download_dataset(f"{DATA_VERSION}/live.parquet")
live_features = pd.read_parquet(f"{DATA_VERSION}/live.parquet", columns=small_features)
predict_neutral(live_features)
```

--------------------------------

### Compute and Process Feature Performance Metrics

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Computes performance metrics for each specified feature using the 'metrics' function, converts the results into a pandas DataFrame, and sorts them by mean performance.

```python
# compute performance metrics for each feature
feature_metrics = [
    metrics(per_era_corr[feature_name])
    for feature_name in sm_serenity_feats
]

# convert to numeric DataFrame and sort
feature_metrics = (
    pd.DataFrame(feature_metrics, index=sm_serenity_feats)
    .apply(pd.to_numeric)
    .sort_values("mean", ascending=False)
)

feature_metrics
```

--------------------------------

### Calculate Per-Era Correlation

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This code calculates the per-era correlation between the model's predictions and the target values. It uses the `numerai_corr` function from the `numerai_tools.scoring` library.

```python
# Compute the per-era corr between our predictions and the target values
per_era_corr = validation.groupby("era").apply(
    lambda x: numerai_corr(x[["prediction"]].dropna(), x["target"].dropna())
)
```

--------------------------------

### Download and Load Numerai Data with LightGBM

Source: https://github.com/numerai/example-scripts/blob/master/example_model.ipynb

Downloads training data and feature metadata from Numerai using the numerapi library. It then loads the data into a pandas DataFrame, selects features, and trains a LightGBM regressor model. The script includes options for different feature sets and downsampling for faster processing.

```python
from numerapi import NumerAPI
import pandas as pd
import json
napi = NumerAPI()

# use one of the latest data versions
DATA_VERSION = "v5.0"

# Download data
napi.download_dataset(f"{DATA_VERSION}/train.parquet")
napi.download_dataset(f"{DATA_VERSION}/features.json")

# Load data
feature_metadata = json.load(open(f"{DATA_VERSION}/features.json"))
features = feature_metadata["feature_sets"]["small"]
# use "medium" or "all" for better performance. Requires more RAM.
# features = feature_metadata["feature_sets"]["medium"]
# features = feature_metadata["feature_sets"]["all"]
train = pd.read_parquet(f"{DATA_VERSION}/train.parquet", columns=["era"]+features+["target"])

# For better models, join train and validation data and train on all of it.
# This would cause diagnostics to be misleading though.
# napi.download_dataset(f"{DATA_VERSION}/validation.parquet")
# validation = pd.read_parquet(f"{DATA_VERSION}/validation.parquet", columns=["era"]+features+["target"])
# validation = validation[validation["data_type"] == "validation"] # drop rows which don't have targets yet
# train = pd.concat([train, validation])

# Downsample for speed
train = train[train["era"].isin(train["era"].unique()[::4])]  # skip this step for better performance

# Train model
import lightgbm as lgb
model = lgb.LGBMRegressor(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    num_leaves=2**5-1,
    colsample_bytree=0.1
)
# We've found the following "deep" parameters perform much better, but they require much more CPU and RAM
# model = lgb.LGBMRegressor(
#     n_estimators=30_000,
#     learning_rate=0.001,
#     max_depth=10,
#     num_leaves=2**10,
#     colsample_bytree=0.1,
#     min_data_in_leaf=10000,
# )
model.fit(
    train[features],
    train["target"]
)

# Define predict function
def predict(
    live_features: pd.DataFrame,
    _live_benchmark_models: pd.DataFrame
 ) -> pd.DataFrame:
    live_predictions = model.predict(live_features[features])
    submission = pd.Series(live_predictions, index=live_features.index)
    return submission.to_frame("prediction")

# Pickle predict function
import cloudpickle
p = cloudpickle.dumps(predict)
with open("example_model.pkl", "wb") as f:
    f.write(p)

# Download file if running in Google Colab
try:
    from google.colab import files
    files.download('example_model.pkl')
except:
    pass

```

--------------------------------

### Calculate Correlation with Main Target

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Calculates and displays the correlation of each auxiliary target column with the main target column. The results are sorted in descending order of correlation.

```python
(
    targets_df[target_cols]
    .corrwith(targets_df[MAIN_TARGET])
    .sort_values(ascending=False)
    .to_frame("corr_with_cyrus_v4_20")
)
```

--------------------------------

### Set Pandas Display Format

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Configures pandas to display floating-point numbers with a specific format, showing all decimal places. This is useful for detailed analysis of financial or statistical data.

```python
pd.set_option('display.float_format', lambda x: '%f' % x)
```

--------------------------------

### Define Prediction Pipeline Function in Python

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Defines a prediction pipeline function that takes live features and benchmark models as input and returns a DataFrame with predictions. This function serves as the core logic for model submissions.

```python
# Define your prediction pipeline as a function
def predict(live_features: pd.DataFrame, _live_benchmark_models: pd.DataFrame) -> pd.DataFrame:
    live_predictions = model.predict(live_features[feature_set])
    submission = pd.Series(live_predictions, index=live_features.index)
    return submission.to_frame("prediction")
```

--------------------------------

### Serialize Prediction Function with Cloudpickle

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Serializes the `predict_ensemble` function and its dependencies using the `cloudpickle` library. The serialized object is then saved to a file named 'target_ensemble.pkl' in binary write mode.

```python
# Use the cloudpickle library to serialize your function and its dependencies
import cloudpickle
p = cloudpickle.dumps(predict_ensemble)
with open("target_ensemble.pkl", "wb") as f:
    f.write(p)
```

--------------------------------

### Calculate Per-Era Meta Model Contribution (MMC)

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This script calculates the per-era Meta Model Contribution (MMC) by comparing the model's predictions against the meta model and the target values. It utilizes the `correlation_contribution` function.

```python
# Compute the per-era mmc between our predictions, the meta model, and the target values
per_era_mmc = validation.dropna().groupby("era").apply(
    lambda x: correlation_contribution(x[["prediction"]], x["meta_model"], x["target"])
)
```

--------------------------------

### Calculate and Plot Cumulative Per-Era Correlation

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Calculates the cumulative sum of per-era correlations and plots it. It also flips the sign of per-era correlation based on its mean, focusing on magnitude.

```python
# Flip sign for negative mean correlation since we only care about magnitude
per_era_corr *= np.sign(per_era_corr.mean())

# Plot the per-era correlations
per_era_corr.cumsum().plot(
    title="Cumulative Absolute Value CORR of Features and the Target",
    figsize=(15, 5),
    legend=False,
    xlabel="Era"
  )
```

--------------------------------

### Serialize Prediction Function using Cloudpickle in Python

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This code snippet serializes a Python prediction function using the `cloudpickle` library, creating a pickle file ready for upload to Numerai. It demonstrates how to save the serialized function to a file.

```python
# Use the cloudpickle library to serialize your function
import cloudpickle
p = cloudpickle.dumps(predict)
with open("hello_numerai.pkl", "wb") as f:
    f.write(p)
```

--------------------------------

### Plot Cumulative Correlations (Pandas/Matplotlib)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Plots the cumulative correlations of neutralized predictions using Pandas and Matplotlib. It configures the plot with a title and adjusts the x-axis ticks.

```python
pd.DataFrame(cumulative_correlations).plot(
    title="Cumulative Correlation of Neutralized Predictions",
    figsize=(10, 6),
    xticks=[]
)
```

--------------------------------

### Generate Predictions and Embargo Eras

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

This script demonstrates how to embargo future eras from the validation set to prevent data leakage. It then generates predictions using a pre-trained model on the filtered validation data.

```python
# Eras are 1 week apart, but targets look 20 days (o 4 weeks/eras) into the future,
# so we need to "embargo" the first 4 eras following our last train era to avoid "data leakage"
last_train_era = int(train["era"].unique()[-1])
meras_to_embargo = [str(era).zfill(4) for era in [last_train_era + i for i in range(4)]]
validation = validation[~validation["era"].isin(eras_to_embargo)]

# Generate predictions against the out-of-sample validation features
# This will take a few minutes 
validation["prediction"] = model.predict(validation[feature_set])
validation[["era", "prediction", "target"]]
```

--------------------------------

### Train LGBMRegressor Model

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Trains a LightGBM Regressor model on the provided training data. The model is configured with specific hyperparameters like `n_estimators`, `learning_rate`, `max_depth`, `num_leaves`, and `colsample_bytree`. This script imports the `lightgbm` library and fits the model using the features and the target variable from the training dataset. It includes comments linking to the LightGBM documentation for parameters.

```python
# https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html
import lightgbm as lgb

# https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
model = lgb.LGBMRegressor(
  n_estimators=2000,
  learning_rate=0.01,
  max_depth=5,
  num_leaves=2**5-1,
  colsample_bytree=0.1
)
# We've found the following "deep" parameters perform much better, but they require much more CPU and RAM
# model = lgb.LGBMRegressor(
#     n_estimators=30_000,
#     learning_rate=0.001,
#     max_depth=10,
#     num_leaves=2**10,
#     colsample_bytree=0.1
#     min_data_in_leaf=10000,
# )

# This will take a few minutes 🍵
model.fit(
  train[feature_set],
  train["target"]
)
```

--------------------------------

### Download Serialized Model in Google Colab (Python)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

This script is designed to run in Google Colab and attempts to download the serialized model file (`feature_neutralization.pkl`) using the `google.colab.files` module. If not in a Colab environment, it gracefully skips the download.

```python
# Download file if running in Google Colab
try:
    from google.colab import files
    files.download('feature_neutralization.pkl')
except:
    pass
```

--------------------------------

### Compute Performance Metrics (Python)

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Computes and displays key performance metrics for the model over the validation period, including mean correlation, standard deviation, Sharpe ratio, and maximum drawdown for both CORR and MMC. These metrics are presented in a pandas DataFrame for easy comparison.

```python
# Compute performance metrics
corr_mean = per_era_corr.mean()
corr_std = per_era_corr.std(ddof=0)
corr_sharpe = corr_mean / corr_std
corr_max_drawdown = (per_era_corr.cumsum().expanding(min_periods=1).max() - per_era_corr.cumsum()).max()

mmc_mean = per_era_mmc.mean()
mmc_std = per_era_mmc.std(ddof=0)
mmc_sharpe = mmc_mean / mmc_std
mmc_max_drawdown = (per_era_mmc.cumsum().expanding(min_periods=1).max() - per_era_mmc.cumsum()).max()

pd.DataFrame({
    "mean": [corr_mean, mmc_mean],
    "std": [corr_std, mmc_std],
    "sharpe": [corr_sharpe, mmc_sharpe],
    "max_drawdown": [corr_max_drawdown, mmc_max_drawdown]
}, index=["CORR", "MMC"]).T
```

--------------------------------

### Load and Prepare Numerai Data

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Loads training data and feature metadata from Numerai. It specifies data version, main target, candidate targets, and feature sets. The data is then filtered to include only specific eras for reduced memory usage.

```python
import pandas as pd
import json
from numerapi import NumerAPI

# Set the data version to one of the most recent versions
DATA_VERSION = "v5.0"
MAIN_TARGET = "target_cyrusd_20"
TARGET_CANDIDATES = [
  MAIN_TARGET,
  "target_victor_20",
  "target_xerxes_20",
  "target_teager2b_20"
]
FAVORITE_MODEL = "v5_lgbm_ct_blend"

# Download data
napi = NumerAPI()
napi.download_dataset(f"{DATA_VERSION}/train.parquet")
napi.download_dataset(f"{DATA_VERSION}/features.json")

# Load data
feature_metadata = json.load(open(f"{DATA_VERSION}/features.json"))
feature_cols = feature_metadata["feature_sets"]["small"]
# use "medium" or "all" for better performance. Requires more RAM.
# features = feature_metadata["feature_sets"]["medium"]
# features = feature_metadata["feature_sets"]["all"]
target_cols = feature_metadata["targets"]
train = pd.read_parquet(
    f"{DATA_VERSION}/train.parquet",
    columns=["era"] + feature_cols + target_cols
)

# Downsample to every 4th era to reduce memory usage and speedup model training (suggested for Colab free tier)
# Comment out the line below to use all the data (higher memory usage, slower model training, potentially better performance)
train = train[train["era"].isin(train["era"].unique()[::4])]

```

--------------------------------

### Build and Predict with Feature-Neutral Model (Python)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

This script defines a function `predict_neutral` that takes live features, predicts using all features, and then neutralizes these predictions against a specified subset of features. It returns the neutralized predictions ranked by percentage.

```python
from numerai_tools.scoring import neutralize
import pandas as pd
def predict_neutral(live_features: pd.DataFrame, _live_benchmark_models: pd.DataFrame = None) -> pd.DataFrame:
    # make predictions using all features
    predictions = pd.DataFrame(
        model.predict(live_features[small_features]),
        index=live_features.index,
        columns=["prediction"]
    )
    # neutralize predictions to a subset of features
    neutralized = neutralize(predictions, live_features[sm_serenity_feats])
    return neutralized.rank(pct=True)
```

--------------------------------

### Calculate and Display Summary Metrics (Pandas)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Calculates summary metrics (mean, std, Sharpe ratio, max drawdown) for predictions and neutralized predictions. It then formats and displays these metrics in a Pandas DataFrame.

```python
summary_metrics = {}
for col in prediction_cols:
    mean = correlations[col].mean()
    std = correlations[col].std(ddof=0)
    sharpe = mean / std
    rolling_max = cumulative_correlations[col].expanding(min_periods=1).max()
    max_drawdown = (rolling_max - cumulative_correlations[col]).max()
    summary_metrics[col] = {
        "mean": mean,
        "std": std,
        "sharpe": sharpe,
        "max_drawdown": max_drawdown,
    }
pd.set_option('display.float_format', lambda x: '%f' % x)
pd.DataFrame(summary_metrics).T
```

--------------------------------

### Plot Target Correlation Matrix

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Generates a heatmap visualization of the correlation matrix for the specified target columns using the seaborn library. This helps in understanding the relationships between different targets.

```python
import seaborn as sns
sns.heatmap(
  targets_df[target_cols].corr(),
  cmap="coolwarm",
  xticklabels=False,
  yticklabels=False
)
```

--------------------------------

### Train LightGBM Models for Multiple Targets

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Trains LightGBM regression models for a list of target candidates. It initializes the LGBMRegressor with specified hyperparameters and fits the model to the training data. The trained models are stored in a dictionary keyed by the target name.

```python
import lightgbm as lgb

models = {}
for target in TARGET_CANDIDATES:
    model = lgb.LGBMRegressor(
        n_estimators=2000,
        learning_rate=0.01,
        max_depth=5,
        num_leaves=2**4-1,
        colsample_bytree=0.1
    )
    # We've found the following "deep" parameters perform much better, but they require much more CPU and RAM
    # model = lgb.LGBMRegressor(
    #     n_estimators=30_000,
    #     learning_rate=0.001,
    #     max_depth=10,
    #     num_leaves=2**10,
    #     colsample_bytree=0.1
    #     min_data_in_leaf=10000,
    # )
    model.fit(
        train[feature_cols],
        train[target]
    )
    models[target] = model
```

--------------------------------

### Download and Prepare Validation Data

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Downloads the validation dataset and loads it into a pandas DataFrame. It filters the data to include only 'validation' type entries and selects relevant columns: 'era', 'data_type', 'target', and the specified feature set. The script also includes an option to downsample the validation data by selecting every 4th era, which is useful for reducing memory usage and speeding up evaluation, especially in environments like Google Colab.

```python
# Download validation data - this will take a few minutes
napi.download_dataset(f"{DATA_VERSION}/validation.parquet")

# Load the validation data and filter for data_type == "validation"
validation = pd.read_parquet(
    f"{DATA_VERSION}/validation.parquet",
    columns=["era", "data_type", "target"] + feature_set
)
validation = validation[validation["data_type"] == "validation"]
del validation["data_type"]

# Downsample to every 4th era to reduce memory usage and speedup evaluation (suggested for Colab free tier)
# Comment out the line below to use all the data (slower and higher memory usage, but more accurate evaluation)
validation = validation[validation["era"].isin(validation["era"].unique()[::4])]
```

--------------------------------

### Visualize Max Feature Exposure Per Era

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Calculates and plots the maximum feature exposure for each era from pre-computed feature exposures. It also prints the mean of these maximum exposures across all eras. Uses pandas for plotting.

```python
# Plot the max feature exposure per era
max_feature_exposure = feature_exposures.max(axis=1)
max_feature_exposure.plot(
  title="Max Feature Exposure",
  kind="bar",
  figsize=(10, 5),
  xticks=[],
  snap=False
)
# Mean max feature exposure across eras
print("Mean of max feature exposure", max_feature_exposure.mean())
```

--------------------------------

### Serialize Model for Upload using Cloudpickle (Python)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

This script demonstrates how to serialize a Python function (`predict_neutral`) and its dependencies using the `cloudpickle` library. The serialized function is then saved to a file named `feature_neutralization.pkl` in binary write mode.

```python
# Use the cloudpickle library to serialize your function and its dependencies
import cloudpickle
p = cloudpickle.dumps(predict_neutral)
with open("feature_neutralization.pkl", "wb") as f:
    f.write(p)
```

--------------------------------

### Plot Histograms of Target Distributions

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

This code snippet utilizes Pandas plotting capabilities to visualize the distributions of specified target columns. It generates histograms with a specified number of bins and density normalization, displayed in a grid layout.

```python
# Plot target distributions
targets_df[TARGET_CANDIDATES].plot(
  title="Target Distributions",
  kind="hist",
  bins=35,
  density=True,
  figsize=(8, 4),
  subplots=True,
  layout=(2, 2),
  ylabel="",
  yticks=[]
)
```

--------------------------------

### Plot Cumulative Performance Metrics

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Plots the cumulative performance of neutralized predictions using pandas plotting functionality. It sets a custom figure size and removes x-axis ticks for clarity.

```python
cumsum_mmc.plot(
  title="Cumulative BMC of Neutralized Predictions",
  figsize=(10, 6),
  xticks=[]
)
```

--------------------------------

### Download and Load Training Data (Python)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Downloads the Numerai training data (train.parquet) and loads a specified feature set ('small' or 'all') along with 'era' and 'target' columns using pandas. It also downsamples the data by selecting every 4th era to reduce memory usage, which is particularly useful for environments like Google Colab's free tier.

```python
# define the small features and small serenity features
# use "all" for better performance. Requires more RAM.
feature_size = "small"
# feature_size = "all"
small_features = feature_sets[feature_size]
sm_serenity_feats = list(subgroups[feature_size]["serenity"])

# Download the training data and feature metadata
napi.download_dataset(f"{DATA_VERSION}/train.parquet")

# Load the just the small feature set,
# this is a great feature of the parquet file format
train = pd.read_parquet(
    f"{DATA_VERSION}/train.parquet",
    columns=["era", "target"] + small_features
)

# Downsample to every 4th era to reduce memory usage and
# speedup model training (suggested for Colab free tier).
train = train[train["era"].isin(train["era"].unique()[::4])]
```

--------------------------------

### Visualize Feature Exposures

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Computes and visualizes the Pearson correlation between model predictions and a set of serenity features per era. Uses pandas for data manipulation and matplotlib for plotting. Handles plotting of multiple features and adjusts layout for clarity.

```python
# Compute the Peason correlation of the predictions with each of the
# serenity features of the small feature set
feature_exposures = validation.groupby("era").apply(
    lambda d: d[sm_serenity_feats].corrwith(d["prediction"])
)

# Plot the feature exposures as bar charts
feature_exposures.plot.bar(
    title="Feature Exposures",
    figsize=(16, 10),
    layout=(7,5),
    xticks=[],
    subplots=True,
    sharex=False,
    legend=False,
    snap=False
)
for ax in plt.gcf().axes:
    ax.set_xlabel("")
    ax.title.set_fontsize(10)
plt.tight_layout(pad=1.5)
plt.gcf().suptitle("Feature Exposures", fontsize=15)
```

--------------------------------

### Load and Downsample Training Data

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Downloads the 'train.parquet' dataset and loads a specified feature set (e.g., 'small') into a pandas DataFrame. It also includes functionality to downsample the data by selecting every Nth era, which is useful for managing memory and speeding up training, especially in environments like Google Colab.

```python
import pandas as pd

# Define our feature set
feature_set = feature_sets["small"]
# use "medium" or "all" for better performance. Requires more RAM.
# features = feature_metadata["feature_sets"]["medium"]
# features = feature_metadata["feature_sets"]["all"]

# Download the training data - this will take a few minutes
napi.download_dataset(f"{DATA_VERSION}/train.parquet")

# Load only the "medium" feature set to
# Use the "all" feature set to use all features
train = pd.read_parquet(
    f"{DATA_VERSION}/train.parquet",
    columns=["era", "target"] + feature_set
)

# Downsample to every 4th era to reduce memory usage and speedup model training (suggested for Colab free tier)
# Comment out the line below to use all the data
train = train[train["era"].isin(train["era"].unique()[::4])]
```

--------------------------------

### Load and Preprocess Validation Data

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Downloads and loads validation data from a specified version. It filters the data for the 'validation' type, removes the 'data_type' column, and optionally downsamples eras to manage memory. It also embargoes eras that overlap with the training data.

```python
# Download validation data
napi.download_dataset(f"{DATA_VERSION}/validation.parquet")

# Load the validation data, filtering for data_type == "validation"
validation = pd.read_parquet(
    f"{DATA_VERSION}/validation.parquet",
    columns=["era", "data_type"] + feature_cols + target_cols
)
validation = validation[validation["data_type"] == "validation"]
del validation["data_type"]

# Downsample every 4th era to reduce memory usage and speedup validation (suggested for Colab free tier)
# Comment out the line below to use all the data
validation = validation[validation["era"].isin(validation["era"].unique()[::4])]

# Embargo overlapping eras from training data
last_train_era = int(train["era"].unique()[-1])
eras_to_embargo = [str(era).zfill(4) for era in [last_train_era + i for i in range(4)]]
validation = validation[~validation["era"].isin(eras_to_embargo)]
```

--------------------------------

### Apply Feature Neutralization

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Applies feature neutralization to model predictions using the `numerai-tools` library for different proportions. It groups data by era and neutralizes predictions against specified features, storing the results in new columns.

```python
# import neutralization from numerai-tools
from numerai_tools.scoring import neutralize

# Neutralize predictions per-era against features at different proportions
proportions = [0.25, 0.5, 0.75, 1.0]
for proportion in proportions:
    neutralized = validation.groupby("era", group_keys=True).apply(
        lambda d: neutralize(
          d[["prediction"]],
          d[sm_serenity_feats],
          proportion=proportion
        )
    ).reset_index().set_index("id")
    validation[f"neutralized_{proportion*100:.0f}"] = neutralized["prediction"]
```

--------------------------------

### Generate Validation Predictions for Multiple Models (Python)

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

This code iterates through a list of target candidates, using a trained model for each target to predict on the validation dataset. It then selects and displays the prediction columns from the validation DataFrame.

```python
for target in TARGET_CANDIDATES:
    validation[f"prediction_{target}"] = models[target].predict(validation[feature_cols])

pred_cols = [f"prediction_{target}" for target in TARGET_CANDIDATES]
validation[pred_cols]
```

--------------------------------

### Train LGBM Regressor Model

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Trains a LightGBM Regressor model on the provided training data using a small feature set. The model is configured with specific hyperparameters, including the number of estimators, learning rate, max depth, number of leaves, and column sample by tree. It outputs the trained model.

```python
import lightgbm as lgb

model = lgb.LGBMRegressor(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    num_leaves=2**4-1,
    colsample_bytree=0.1
)
# We've found the following "deep" parameters perform much better, but they require much more CPU and RAM
# model = lgb.LGBMRegressor(
#     n_estimators=30_000,
#     learning_rate=0.001,
#     max_depth=10,
#     num_leaves=2**10,
#     colsample_bytree=0.1
#     min_data_in_leaf=10000,
# )
model.fit(
    train[small_features],
    train["target"]
)
```

--------------------------------

### Drop 'target' Column in Pandas DataFrame

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

This code snippet demonstrates how to drop the 'target' column from a Pandas DataFrame if it is an alias for a main target column. It includes an assertion to verify the equality before dropping.

```python
# Drop `target` column
assert train["target"].equals(train[MAIN_TARGET])
targets_df = train[["era"] + target_cols]
```

--------------------------------

### Calculate Summary Metrics for Model Performance

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Calculates key performance metrics (mean, standard deviation, Sharpe ratio, max drawdown) for model predictions and their correlation with a main target ('cyrus'). It utilizes pandas for data aggregation and analysis. Requires 'pandas'.

```python
def get_summary_metrics(scores, cumsum_scores):
    summary_metrics = {}
    # per era correlation between predictions of the model trained on this target and cyrus
    mean = scores.mean()
    std = scores.std()
    sharpe = mean / std
    rolling_max = cumsum_scores.expanding(min_periods=1).max()
    max_drawdown = (rolling_max - cumsum_scores).max()
    return {
        "mean": mean,
        "std": std,
        "sharpe": sharpe,
        "max_drawdown": max_drawdown,
    }

target_summary_metrics = {}
for pred_col in prediction_cols:
  target_summary_metrics[pred_col] = get_summary_metrics(
      correlations[pred_col], cumsum_corrs[pred_col]
  )
  # per era correlation between this target and cyrus
  mean_corr_with_cryus = validation.groupby("era").apply(
      lambda d: d[pred_col].corr(d[MAIN_TARGET])
  ).mean()
  target_summary_metrics[pred_col].update({
      "mean_corr_with_cryus": mean_corr_with_cryus
  })


pd.set_option('display.float_format', lambda x: '%f' % x)
summary = pd.DataFrame(target_summary_metrics).T
summary
```

--------------------------------

### Calculate Meta Model Contribution (MMC)

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Calculates the Meta Model Contribution (MMC) by comparing a user's chosen model ('FAVORITE_MODEL') against benchmark models. It generates per-era MMC, cumulative MMC, and a summary DataFrame with performance statistics.

```python
validation[FAVORITE_MODEL] = benchmark_models[FAVORITE_MODEL]


per_era_mmc, cumsum_mmc, summary = get_mmc(validation, FAVORITE_MODEL)
# plot the cumsum mmc performance
cumsum_mmc.plot(
  title="Contribution of Neutralized Predictions to Numerai's Teager Ensemble",
  figsize=(10, 6),
  xticks=[]
)

pd.set_option('display.float_format', lambda x: '%f' % x)
summary
```

--------------------------------

### Count NaNs per Era and Plot

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Calculates the number of NaN values for each target column, grouped by 'era'. It then generates a plot showing the number of NaNs per era for the specified target columns.

```python
nans_per_era = targets_df.groupby("era").apply(lambda x: x.isna().sum())
nans_per_era[target_cols].plot(figsize=(8, 4), title="Number of NaNs per Era", legend=False)
```

--------------------------------

### Calculate and Plot Cumulative Correlation

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Calculates and plots the cumulative correlation of validation predictions against target values. It uses Numerai's scoring functions and pandas for data manipulation and plotting. Requires 'numerai_tools' and 'pandas'.

```python
from numerai_tools.scoring import numerai_corr, correlation_contribution

prediction_cols = [
    f"prediction_{target}"
    for target in TARGET_CANDIDATES
]
correlations = validation.groupby("era").apply(
    lambda d: numerai_corr(d[prediction_cols], d["target"])
)
cumsum_corrs = correlations.cumsum()
cumsum_corrs.plot(
  title="Cumulative Correlation of validation Predictions",
  figsize=(10, 6),
  xticks=[]
)
```

--------------------------------

### Compile Feature Set Intersections (Python)

Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb

Compiles and analyzes the intersections between Numerai's feature sets (small, medium, all) and feature groups (intelligence, wisdom, etc.). It calculates the count of features in each intersection and displays them in a sorted DataFrame.

```python
sizes = ["small", "medium", "all"]
groups = [
  "intelligence",
  "wisdom",
  "charisma",
  "dexterity",
  "strength",
  "constitution",
  "agility",
  "serenity",
  "all"
]

# compile the intersections of feature sets and feature groups
subgroups = {}
for size in sizes:
    subgroups[size] = {}
    for group in groups:
        subgroups[size][group] = (
            set(feature_sets[size])
            .intersection(set(feature_sets[group]))
        )

# convert to data frame and display the feature count of each intersection
# NOTE: applymap is deprecated, using map instead in newer pandas versions
# pd.DataFrame(subgroups).applymap(len).sort_values(by="all", ascending=False)
# For compatibility with older pandas versions or if applymap is preferred:
import pandas as pd
df_counts = pd.DataFrame(subgroups)

# Use a loop or applymap if necessary for older pandas versions
# For newer pandas, use map:
df_counts = df_counts.map(len)

df_counts = df_counts.sort_values(by="all", ascending=False)
print(df_counts)
```

--------------------------------

### Plot Target Density Histogram

Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb

Generates a density histogram for the 'target' column using pandas plotting. It visualizes the distribution of target values, which represent stock market returns over the next 20 business days, normalized by factors and trends. The target values are binned into 5 unequal bins. This plot helps in understanding the data distribution for model training.

```python
# Plot density histogram of the target
train["target"].plot(
  kind="hist",
  title="Target",
  figsize=(5, 3),
  xlabel="Value",
  density=True,
  bins=50
)
```

--------------------------------

### Select and Print Target Columns from DataFrame

Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb

Demonstrates how to select specific columns, including 'era' and a list of target columns, from a Pandas DataFrame. This is useful for isolating relevant data for analysis or model training. Assumes the DataFrame 'train' is already loaded.

```python
train[["era"] + target_cols]
```