### Install Numerai Dependencies Source: https://github.com/numerai/example-scripts/blob/master/example_model.ipynb Installs necessary Python libraries for Numerai, including numerapi, pandas, pyarrow, matplotlib, lightgbm, scikit-learn, scipy, and cloudpickle. The '-q' flag suppresses output, and '--upgrade' ensures the latest versions are installed. ```python # Install dependencies !pip install -q --upgrade numerapi pandas pyarrow matplotlib lightgbm scikit-learn scipy cloudpickle==3.1.1 ``` -------------------------------- ### Install Project Dependencies Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Installs necessary Python libraries for Numerai data analysis and model training, including numerapi, pandas, pyarrow, matplotlib, lightgbm, scikit-learn, scipy, cloudpickle, and numerai-tools. ```python # Install dependencies !pip install -q --upgrade numerapi pandas pyarrow matplotlib lightgbm scikit-learn scipy cloudpickle==3.1.1 !pip install -q --no-deps numerai-tools # Inline plots %matplotlib inline ``` -------------------------------- ### Check Python Version Source: https://github.com/numerai/example-scripts/blob/master/example_model.ipynb Checks the installed Python version using the `!python --version` command. This is a simple command-line execution within the script environment. ```python !python --version ``` -------------------------------- ### Download and Inspect Feature Metadata Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Downloads the 'features.json' file which contains metadata about the dataset's features and targets. It then parses this JSON to display the number of feature sets and targets available. ```python import json # download the feature metadata file napi.download_dataset(f"{DATA_VERSION}/features.json") # read the metadata and display feature_metadata = json.load(open(f"{DATA_VERSION}/features.json")) for metadata in feature_metadata: print(metadata, len(feature_metadata[metadata])) ``` -------------------------------- ### Check Python Version Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Displays the installed Python version. Useful for ensuring compatibility with project dependencies. ```python print(f"Python Version: {sys.version}") ``` -------------------------------- ### List Numerai Datasets Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Initializes the NumerAPI client to interact with the Numerai platform. It then lists all available datasets and their versions, focusing on a specific version (e.g., 'v5.0') and printing the files within that version. ```python # Initialize NumerAPI - the official Python API client for Numerai from numerapi import NumerAPI napi = NumerAPI() # list the datasets and available versions all_datasets = napi.list_datasets() dataset_versions = list(set(d.split('/')[0] for d in all_datasets)) print("Available versions:", dataset_versions) # Set data version to one of the latest datasets DATA_VERSION = "v5.0" # Print all files available for download for our version current_version_files = [f for f in all_datasets if f.startswith(DATA_VERSION)] print("Available", DATA_VERSION, "files:", current_version_files) ``` -------------------------------- ### Download and Load Numerai Benchmark Models Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Downloads the validation benchmark models dataset from Numerai and loads it into a pandas DataFrame. This is a prerequisite for further analysis. ```python napi.download_dataset(f"{DATA_VERSION}/validation_benchmark_models.parquet") benchmark_models = pd.read_parquet( f"{DATA_VERSION}/validation_benchmark_models.parquet" ) benchmark_models ``` -------------------------------- ### Import Matplotlib for Plotting Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Imports the matplotlib.pyplot module, commonly aliased as plt, for creating visualizations. ```python import matplotlib.pyplot as plt ``` -------------------------------- ### Quick Test of Ensemble Prediction Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Performs a quick test of the `predict_ensemble` function by downloading live data and generating predictions. This serves as a basic validation step before deployment. ```python # Quick test napi.download_dataset(f"{DATA_VERSION}/live.parquet") live_features = pd.read_parquet(f"{DATA_VERSION}/live.parquet", columns=feature_cols) predict_ensemble(live_features, benchmark_models) ``` -------------------------------- ### Download and Load Meta Model Data Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This script shows how to download the meta-model data for a specific round and load it into a pandas DataFrame. This data is used for calculating Meta Model Contribution (MMC). ```python # import the 2 scoring functions from numerai_tools.scoring import numerai_corr, correlation_contribution # Download and join in the meta_model for the validation eras napi.download_dataset(f"v4.3/meta_model.parquet", round_num=842) validation["meta_model"] = pd.read_parquet( f"v4.3/meta_model.parquet" )["numerai_meta_model"] ``` -------------------------------- ### Download Pickle File in Google Colab Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This script provides a conditional way to download the generated pickle file ('hello_numerai.pkl') if the code is being run within a Google Colab environment. It uses a try-except block to handle potential errors if not in Colab. ```python # Download file if running in Google Colab try: from google.colab import files files.download('hello_numerai.pkl') except: pass ``` -------------------------------- ### Display Training Data Sample (Python) Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This snippet shows a sample of the training data, illustrating the structure and content of stock data used for model training. It includes stock IDs, eras, target values, and numerous features. ```python train ``` -------------------------------- ### Display Feature Set Sizes Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Iterates through predefined feature set names ('small', 'medium', 'all') using the loaded feature metadata and prints the number of features in each set. This helps in understanding the scale of different feature subsets. ```python feature_sets = feature_metadata["feature_sets"] for feature_set in ["small", "medium", "all"]: print(feature_set, len(feature_sets[feature_set])) ``` -------------------------------- ### Download Meta-Model and Calculate Correlation Contribution Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Downloads the meta-model dataset for a specific round and calculates correlation contribution using numerai_tools.scoring. This requires the pandas library and the numerai_tools package. ```python from numerai_tools.scoring import correlation_contribution # Download and join in the meta_model for the validation eras napi.download_dataset(f"v4.3/meta_model.parquet", round_num=842) validation["meta_model"] = pd.read_parquet( f"v4.3/meta_model.parquet" )["numerai_meta_model"] ``` -------------------------------- ### Download & Prepare Validation Data and Predict Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Downloads the validation dataset, preprocesses it, and generates predictions using a pre-trained LightGBM model. The script uses pandas to load and filter the data, and the Numerai API to download the dataset. ```python # Download validation data napi.download_dataset(f"{DATA_VERSION}/validation.parquet") # Load the validation data, filtering for data_type == "validation" validation = pd.read_parquet( f"{DATA_VERSION}/validation.parquet", columns=["era", "data_type", "target"] + small_features ) validation = validation[validation["data_type"] == "validation"] del validation["data_type"] # Downsample every 4th era to reduce memory usage and speedup validation (suggested for Colab free tier) # Comment out the line below to use all the data validation = validation[validation["era"].isin(validation["era"].unique()[::4])] # Embargo overlapping eras from training data last_train_era = int(train["era"].unique()[-1]) eras_to_embargo = [str(era).zfill(4) for era in [last_train_era + i for i in range(4)]] validation = validation[~validation["era"].isin(eras_to_embargo)] # Generate predictions against the small feature set of the validation data validation["prediction"] = model.predict(validation[small_features]) ``` -------------------------------- ### Get summary metrics for correlations Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Computes and displays summary metrics (mean, std, sharpe, max_drawdown) for a given set of correlations. It utilizes a get_summary_metrics function and formats the display of floating-point numbers. This is useful for comparing different prediction sets. ```python summary_metrics = get_summary_metrics(correlations, cumsum_corrs) pd.set_option('display.float_format', lambda x: '%f' % x) summary = pd.DataFrame(summary_metrics) summary ``` -------------------------------- ### Generate and Format Live Predictions in Python Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This script downloads the latest live features, loads them into a pandas DataFrame, generates predictions using a pre-trained model, and formats the output for submission. It requires the 'numerai_api' and 'pandas' libraries. ```python # Download latest live features napi.download_dataset(f"{DATA_VERSION}/live.parquet") # Load live features live_features = pd.read_parquet(f"{DATA_VERSION}/live.parquet", columns=feature_set) # Generate live predictions live_predictions = model.predict(live_features[feature_set]) # Format submission pd.Series(live_predictions, index=live_features.index).to_frame("prediction") ``` -------------------------------- ### Download Serialized Model File Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Provides code to download the serialized model file ('target_ensemble.pkl') when running in a Google Colab environment. It uses a try-except block to handle cases where the code is not run in Colab. ```python # Download file if running in Google Colab try: from google.colab import files files.download('target_ensemble.pkl') except: pass ``` -------------------------------- ### Plot Performance Metrics by Feature Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Generates bar charts for various performance metrics of features, sorted by their mean performance. It uses subplots for better visualization and shares the x-axis. ```python # plot the performance metrics of the features as bar charts sorted by mean feature_metrics.sort_values("mean", ascending=False).plot.bar( title="Performance Metrics of Features Sorted by Mean", subplots=True, figsize=(15, 6), layout=(2, 3), sharex=False, xticks=[], snap=False ) ``` -------------------------------- ### Initialize NumerAPI Client and Download Metadata (Python) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Initializes the NumerAPI client to interact with Numerai's API and downloads the 'features.json' metadata file. This file contains information about feature sets and groups, crucial for understanding and manipulating features. ```python import json import pandas as pd from numerapi import NumerAPI # initialize our API client napi = NumerAPI() # Set data version to one of the latest datasets DATA_VERSION = "v5.0" napi.download_dataset(f"{DATA_VERSION}/features.json") feature_metadata = json.load(open(f"{DATA_VERSION}/features.json")) feature_sets = feature_metadata["feature_sets"] ``` -------------------------------- ### Quick Test of Feature Neutralization (Python) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb This script performs a quick test of the feature neutralization process. It downloads the live dataset, loads the relevant features, and then calls the `predict_neutral` function to generate predictions. ```python # Quick test napi.download_dataset(f"{DATA_VERSION}/live.parquet") live_features = pd.read_parquet(f"{DATA_VERSION}/live.parquet", columns=small_features) predict_neutral(live_features) ``` -------------------------------- ### Compute and Process Feature Performance Metrics Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Computes performance metrics for each specified feature using the 'metrics' function, converts the results into a pandas DataFrame, and sorts them by mean performance. ```python # compute performance metrics for each feature feature_metrics = [ metrics(per_era_corr[feature_name]) for feature_name in sm_serenity_feats ] # convert to numeric DataFrame and sort feature_metrics = ( pd.DataFrame(feature_metrics, index=sm_serenity_feats) .apply(pd.to_numeric) .sort_values("mean", ascending=False) ) feature_metrics ``` -------------------------------- ### Calculate Per-Era Correlation Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This code calculates the per-era correlation between the model's predictions and the target values. It uses the `numerai_corr` function from the `numerai_tools.scoring` library. ```python # Compute the per-era corr between our predictions and the target values per_era_corr = validation.groupby("era").apply( lambda x: numerai_corr(x[["prediction"]].dropna(), x["target"].dropna()) ) ``` -------------------------------- ### Download and Load Numerai Data with LightGBM Source: https://github.com/numerai/example-scripts/blob/master/example_model.ipynb Downloads training data and feature metadata from Numerai using the numerapi library. It then loads the data into a pandas DataFrame, selects features, and trains a LightGBM regressor model. The script includes options for different feature sets and downsampling for faster processing. ```python from numerapi import NumerAPI import pandas as pd import json napi = NumerAPI() # use one of the latest data versions DATA_VERSION = "v5.0" # Download data napi.download_dataset(f"{DATA_VERSION}/train.parquet") napi.download_dataset(f"{DATA_VERSION}/features.json") # Load data feature_metadata = json.load(open(f"{DATA_VERSION}/features.json")) features = feature_metadata["feature_sets"]["small"] # use "medium" or "all" for better performance. Requires more RAM. # features = feature_metadata["feature_sets"]["medium"] # features = feature_metadata["feature_sets"]["all"] train = pd.read_parquet(f"{DATA_VERSION}/train.parquet", columns=["era"]+features+["target"]) # For better models, join train and validation data and train on all of it. # This would cause diagnostics to be misleading though. # napi.download_dataset(f"{DATA_VERSION}/validation.parquet") # validation = pd.read_parquet(f"{DATA_VERSION}/validation.parquet", columns=["era"]+features+["target"]) # validation = validation[validation["data_type"] == "validation"] # drop rows which don't have targets yet # train = pd.concat([train, validation]) # Downsample for speed train = train[train["era"].isin(train["era"].unique()[::4])] # skip this step for better performance # Train model import lightgbm as lgb model = lgb.LGBMRegressor( n_estimators=2000, learning_rate=0.01, max_depth=5, num_leaves=2**5-1, colsample_bytree=0.1 ) # We've found the following "deep" parameters perform much better, but they require much more CPU and RAM # model = lgb.LGBMRegressor( # n_estimators=30_000, # learning_rate=0.001, # max_depth=10, # num_leaves=2**10, # colsample_bytree=0.1, # min_data_in_leaf=10000, # ) model.fit( train[features], train["target"] ) # Define predict function def predict( live_features: pd.DataFrame, _live_benchmark_models: pd.DataFrame ) -> pd.DataFrame: live_predictions = model.predict(live_features[features]) submission = pd.Series(live_predictions, index=live_features.index) return submission.to_frame("prediction") # Pickle predict function import cloudpickle p = cloudpickle.dumps(predict) with open("example_model.pkl", "wb") as f: f.write(p) # Download file if running in Google Colab try: from google.colab import files files.download('example_model.pkl') except: pass ``` -------------------------------- ### Calculate Correlation with Main Target Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Calculates and displays the correlation of each auxiliary target column with the main target column. The results are sorted in descending order of correlation. ```python ( targets_df[target_cols] .corrwith(targets_df[MAIN_TARGET]) .sort_values(ascending=False) .to_frame("corr_with_cyrus_v4_20") ) ``` -------------------------------- ### Set Pandas Display Format Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Configures pandas to display floating-point numbers with a specific format, showing all decimal places. This is useful for detailed analysis of financial or statistical data. ```python pd.set_option('display.float_format', lambda x: '%f' % x) ``` -------------------------------- ### Define Prediction Pipeline Function in Python Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Defines a prediction pipeline function that takes live features and benchmark models as input and returns a DataFrame with predictions. This function serves as the core logic for model submissions. ```python # Define your prediction pipeline as a function def predict(live_features: pd.DataFrame, _live_benchmark_models: pd.DataFrame) -> pd.DataFrame: live_predictions = model.predict(live_features[feature_set]) submission = pd.Series(live_predictions, index=live_features.index) return submission.to_frame("prediction") ``` -------------------------------- ### Serialize Prediction Function with Cloudpickle Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Serializes the `predict_ensemble` function and its dependencies using the `cloudpickle` library. The serialized object is then saved to a file named 'target_ensemble.pkl' in binary write mode. ```python # Use the cloudpickle library to serialize your function and its dependencies import cloudpickle p = cloudpickle.dumps(predict_ensemble) with open("target_ensemble.pkl", "wb") as f: f.write(p) ``` -------------------------------- ### Calculate Per-Era Meta Model Contribution (MMC) Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This script calculates the per-era Meta Model Contribution (MMC) by comparing the model's predictions against the meta model and the target values. It utilizes the `correlation_contribution` function. ```python # Compute the per-era mmc between our predictions, the meta model, and the target values per_era_mmc = validation.dropna().groupby("era").apply( lambda x: correlation_contribution(x[["prediction"]], x["meta_model"], x["target"]) ) ``` -------------------------------- ### Calculate and Plot Cumulative Per-Era Correlation Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Calculates the cumulative sum of per-era correlations and plots it. It also flips the sign of per-era correlation based on its mean, focusing on magnitude. ```python # Flip sign for negative mean correlation since we only care about magnitude per_era_corr *= np.sign(per_era_corr.mean()) # Plot the per-era correlations per_era_corr.cumsum().plot( title="Cumulative Absolute Value CORR of Features and the Target", figsize=(15, 5), legend=False, xlabel="Era" ) ``` -------------------------------- ### Serialize Prediction Function using Cloudpickle in Python Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This code snippet serializes a Python prediction function using the `cloudpickle` library, creating a pickle file ready for upload to Numerai. It demonstrates how to save the serialized function to a file. ```python # Use the cloudpickle library to serialize your function import cloudpickle p = cloudpickle.dumps(predict) with open("hello_numerai.pkl", "wb") as f: f.write(p) ``` -------------------------------- ### Plot Cumulative Correlations (Pandas/Matplotlib) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Plots the cumulative correlations of neutralized predictions using Pandas and Matplotlib. It configures the plot with a title and adjusts the x-axis ticks. ```python pd.DataFrame(cumulative_correlations).plot( title="Cumulative Correlation of Neutralized Predictions", figsize=(10, 6), xticks=[] ) ``` -------------------------------- ### Generate Predictions and Embargo Eras Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb This script demonstrates how to embargo future eras from the validation set to prevent data leakage. It then generates predictions using a pre-trained model on the filtered validation data. ```python # Eras are 1 week apart, but targets look 20 days (o 4 weeks/eras) into the future, # so we need to "embargo" the first 4 eras following our last train era to avoid "data leakage" last_train_era = int(train["era"].unique()[-1]) meras_to_embargo = [str(era).zfill(4) for era in [last_train_era + i for i in range(4)]] validation = validation[~validation["era"].isin(eras_to_embargo)] # Generate predictions against the out-of-sample validation features # This will take a few minutes validation["prediction"] = model.predict(validation[feature_set]) validation[["era", "prediction", "target"]] ``` -------------------------------- ### Train LGBMRegressor Model Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Trains a LightGBM Regressor model on the provided training data. The model is configured with specific hyperparameters like `n_estimators`, `learning_rate`, `max_depth`, `num_leaves`, and `colsample_bytree`. This script imports the `lightgbm` library and fits the model using the features and the target variable from the training dataset. It includes comments linking to the LightGBM documentation for parameters. ```python # https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html import lightgbm as lgb # https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html model = lgb.LGBMRegressor( n_estimators=2000, learning_rate=0.01, max_depth=5, num_leaves=2**5-1, colsample_bytree=0.1 ) # We've found the following "deep" parameters perform much better, but they require much more CPU and RAM # model = lgb.LGBMRegressor( # n_estimators=30_000, # learning_rate=0.001, # max_depth=10, # num_leaves=2**10, # colsample_bytree=0.1 # min_data_in_leaf=10000, # ) # This will take a few minutes 🍵 model.fit( train[feature_set], train["target"] ) ``` -------------------------------- ### Download Serialized Model in Google Colab (Python) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb This script is designed to run in Google Colab and attempts to download the serialized model file (`feature_neutralization.pkl`) using the `google.colab.files` module. If not in a Colab environment, it gracefully skips the download. ```python # Download file if running in Google Colab try: from google.colab import files files.download('feature_neutralization.pkl') except: pass ``` -------------------------------- ### Compute Performance Metrics (Python) Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Computes and displays key performance metrics for the model over the validation period, including mean correlation, standard deviation, Sharpe ratio, and maximum drawdown for both CORR and MMC. These metrics are presented in a pandas DataFrame for easy comparison. ```python # Compute performance metrics corr_mean = per_era_corr.mean() corr_std = per_era_corr.std(ddof=0) corr_sharpe = corr_mean / corr_std corr_max_drawdown = (per_era_corr.cumsum().expanding(min_periods=1).max() - per_era_corr.cumsum()).max() mmc_mean = per_era_mmc.mean() mmc_std = per_era_mmc.std(ddof=0) mmc_sharpe = mmc_mean / mmc_std mmc_max_drawdown = (per_era_mmc.cumsum().expanding(min_periods=1).max() - per_era_mmc.cumsum()).max() pd.DataFrame({ "mean": [corr_mean, mmc_mean], "std": [corr_std, mmc_std], "sharpe": [corr_sharpe, mmc_sharpe], "max_drawdown": [corr_max_drawdown, mmc_max_drawdown] }, index=["CORR", "MMC"]).T ``` -------------------------------- ### Load and Prepare Numerai Data Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Loads training data and feature metadata from Numerai. It specifies data version, main target, candidate targets, and feature sets. The data is then filtered to include only specific eras for reduced memory usage. ```python import pandas as pd import json from numerapi import NumerAPI # Set the data version to one of the most recent versions DATA_VERSION = "v5.0" MAIN_TARGET = "target_cyrusd_20" TARGET_CANDIDATES = [ MAIN_TARGET, "target_victor_20", "target_xerxes_20", "target_teager2b_20" ] FAVORITE_MODEL = "v5_lgbm_ct_blend" # Download data napi = NumerAPI() napi.download_dataset(f"{DATA_VERSION}/train.parquet") napi.download_dataset(f"{DATA_VERSION}/features.json") # Load data feature_metadata = json.load(open(f"{DATA_VERSION}/features.json")) feature_cols = feature_metadata["feature_sets"]["small"] # use "medium" or "all" for better performance. Requires more RAM. # features = feature_metadata["feature_sets"]["medium"] # features = feature_metadata["feature_sets"]["all"] target_cols = feature_metadata["targets"] train = pd.read_parquet( f"{DATA_VERSION}/train.parquet", columns=["era"] + feature_cols + target_cols ) # Downsample to every 4th era to reduce memory usage and speedup model training (suggested for Colab free tier) # Comment out the line below to use all the data (higher memory usage, slower model training, potentially better performance) train = train[train["era"].isin(train["era"].unique()[::4])] ``` -------------------------------- ### Build and Predict with Feature-Neutral Model (Python) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb This script defines a function `predict_neutral` that takes live features, predicts using all features, and then neutralizes these predictions against a specified subset of features. It returns the neutralized predictions ranked by percentage. ```python from numerai_tools.scoring import neutralize import pandas as pd def predict_neutral(live_features: pd.DataFrame, _live_benchmark_models: pd.DataFrame = None) -> pd.DataFrame: # make predictions using all features predictions = pd.DataFrame( model.predict(live_features[small_features]), index=live_features.index, columns=["prediction"] ) # neutralize predictions to a subset of features neutralized = neutralize(predictions, live_features[sm_serenity_feats]) return neutralized.rank(pct=True) ``` -------------------------------- ### Calculate and Display Summary Metrics (Pandas) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Calculates summary metrics (mean, std, Sharpe ratio, max drawdown) for predictions and neutralized predictions. It then formats and displays these metrics in a Pandas DataFrame. ```python summary_metrics = {} for col in prediction_cols: mean = correlations[col].mean() std = correlations[col].std(ddof=0) sharpe = mean / std rolling_max = cumulative_correlations[col].expanding(min_periods=1).max() max_drawdown = (rolling_max - cumulative_correlations[col]).max() summary_metrics[col] = { "mean": mean, "std": std, "sharpe": sharpe, "max_drawdown": max_drawdown, } pd.set_option('display.float_format', lambda x: '%f' % x) pd.DataFrame(summary_metrics).T ``` -------------------------------- ### Plot Target Correlation Matrix Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Generates a heatmap visualization of the correlation matrix for the specified target columns using the seaborn library. This helps in understanding the relationships between different targets. ```python import seaborn as sns sns.heatmap( targets_df[target_cols].corr(), cmap="coolwarm", xticklabels=False, yticklabels=False ) ``` -------------------------------- ### Train LightGBM Models for Multiple Targets Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Trains LightGBM regression models for a list of target candidates. It initializes the LGBMRegressor with specified hyperparameters and fits the model to the training data. The trained models are stored in a dictionary keyed by the target name. ```python import lightgbm as lgb models = {} for target in TARGET_CANDIDATES: model = lgb.LGBMRegressor( n_estimators=2000, learning_rate=0.01, max_depth=5, num_leaves=2**4-1, colsample_bytree=0.1 ) # We've found the following "deep" parameters perform much better, but they require much more CPU and RAM # model = lgb.LGBMRegressor( # n_estimators=30_000, # learning_rate=0.001, # max_depth=10, # num_leaves=2**10, # colsample_bytree=0.1 # min_data_in_leaf=10000, # ) model.fit( train[feature_cols], train[target] ) models[target] = model ``` -------------------------------- ### Download and Prepare Validation Data Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Downloads the validation dataset and loads it into a pandas DataFrame. It filters the data to include only 'validation' type entries and selects relevant columns: 'era', 'data_type', 'target', and the specified feature set. The script also includes an option to downsample the validation data by selecting every 4th era, which is useful for reducing memory usage and speeding up evaluation, especially in environments like Google Colab. ```python # Download validation data - this will take a few minutes napi.download_dataset(f"{DATA_VERSION}/validation.parquet") # Load the validation data and filter for data_type == "validation" validation = pd.read_parquet( f"{DATA_VERSION}/validation.parquet", columns=["era", "data_type", "target"] + feature_set ) validation = validation[validation["data_type"] == "validation"] del validation["data_type"] # Downsample to every 4th era to reduce memory usage and speedup evaluation (suggested for Colab free tier) # Comment out the line below to use all the data (slower and higher memory usage, but more accurate evaluation) validation = validation[validation["era"].isin(validation["era"].unique()[::4])] ``` -------------------------------- ### Visualize Max Feature Exposure Per Era Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Calculates and plots the maximum feature exposure for each era from pre-computed feature exposures. It also prints the mean of these maximum exposures across all eras. Uses pandas for plotting. ```python # Plot the max feature exposure per era max_feature_exposure = feature_exposures.max(axis=1) max_feature_exposure.plot( title="Max Feature Exposure", kind="bar", figsize=(10, 5), xticks=[], snap=False ) # Mean max feature exposure across eras print("Mean of max feature exposure", max_feature_exposure.mean()) ``` -------------------------------- ### Serialize Model for Upload using Cloudpickle (Python) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb This script demonstrates how to serialize a Python function (`predict_neutral`) and its dependencies using the `cloudpickle` library. The serialized function is then saved to a file named `feature_neutralization.pkl` in binary write mode. ```python # Use the cloudpickle library to serialize your function and its dependencies import cloudpickle p = cloudpickle.dumps(predict_neutral) with open("feature_neutralization.pkl", "wb") as f: f.write(p) ``` -------------------------------- ### Plot Histograms of Target Distributions Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb This code snippet utilizes Pandas plotting capabilities to visualize the distributions of specified target columns. It generates histograms with a specified number of bins and density normalization, displayed in a grid layout. ```python # Plot target distributions targets_df[TARGET_CANDIDATES].plot( title="Target Distributions", kind="hist", bins=35, density=True, figsize=(8, 4), subplots=True, layout=(2, 2), ylabel="", yticks=[] ) ``` -------------------------------- ### Plot Cumulative Performance Metrics Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Plots the cumulative performance of neutralized predictions using pandas plotting functionality. It sets a custom figure size and removes x-axis ticks for clarity. ```python cumsum_mmc.plot( title="Cumulative BMC of Neutralized Predictions", figsize=(10, 6), xticks=[] ) ``` -------------------------------- ### Download and Load Training Data (Python) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Downloads the Numerai training data (train.parquet) and loads a specified feature set ('small' or 'all') along with 'era' and 'target' columns using pandas. It also downsamples the data by selecting every 4th era to reduce memory usage, which is particularly useful for environments like Google Colab's free tier. ```python # define the small features and small serenity features # use "all" for better performance. Requires more RAM. feature_size = "small" # feature_size = "all" small_features = feature_sets[feature_size] sm_serenity_feats = list(subgroups[feature_size]["serenity"]) # Download the training data and feature metadata napi.download_dataset(f"{DATA_VERSION}/train.parquet") # Load the just the small feature set, # this is a great feature of the parquet file format train = pd.read_parquet( f"{DATA_VERSION}/train.parquet", columns=["era", "target"] + small_features ) # Downsample to every 4th era to reduce memory usage and # speedup model training (suggested for Colab free tier). train = train[train["era"].isin(train["era"].unique()[::4])] ``` -------------------------------- ### Visualize Feature Exposures Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Computes and visualizes the Pearson correlation between model predictions and a set of serenity features per era. Uses pandas for data manipulation and matplotlib for plotting. Handles plotting of multiple features and adjusts layout for clarity. ```python # Compute the Peason correlation of the predictions with each of the # serenity features of the small feature set feature_exposures = validation.groupby("era").apply( lambda d: d[sm_serenity_feats].corrwith(d["prediction"]) ) # Plot the feature exposures as bar charts feature_exposures.plot.bar( title="Feature Exposures", figsize=(16, 10), layout=(7,5), xticks=[], subplots=True, sharex=False, legend=False, snap=False ) for ax in plt.gcf().axes: ax.set_xlabel("") ax.title.set_fontsize(10) plt.tight_layout(pad=1.5) plt.gcf().suptitle("Feature Exposures", fontsize=15) ``` -------------------------------- ### Load and Downsample Training Data Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Downloads the 'train.parquet' dataset and loads a specified feature set (e.g., 'small') into a pandas DataFrame. It also includes functionality to downsample the data by selecting every Nth era, which is useful for managing memory and speeding up training, especially in environments like Google Colab. ```python import pandas as pd # Define our feature set feature_set = feature_sets["small"] # use "medium" or "all" for better performance. Requires more RAM. # features = feature_metadata["feature_sets"]["medium"] # features = feature_metadata["feature_sets"]["all"] # Download the training data - this will take a few minutes napi.download_dataset(f"{DATA_VERSION}/train.parquet") # Load only the "medium" feature set to # Use the "all" feature set to use all features train = pd.read_parquet( f"{DATA_VERSION}/train.parquet", columns=["era", "target"] + feature_set ) # Downsample to every 4th era to reduce memory usage and speedup model training (suggested for Colab free tier) # Comment out the line below to use all the data train = train[train["era"].isin(train["era"].unique()[::4])] ``` -------------------------------- ### Load and Preprocess Validation Data Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Downloads and loads validation data from a specified version. It filters the data for the 'validation' type, removes the 'data_type' column, and optionally downsamples eras to manage memory. It also embargoes eras that overlap with the training data. ```python # Download validation data napi.download_dataset(f"{DATA_VERSION}/validation.parquet") # Load the validation data, filtering for data_type == "validation" validation = pd.read_parquet( f"{DATA_VERSION}/validation.parquet", columns=["era", "data_type"] + feature_cols + target_cols ) validation = validation[validation["data_type"] == "validation"] del validation["data_type"] # Downsample every 4th era to reduce memory usage and speedup validation (suggested for Colab free tier) # Comment out the line below to use all the data validation = validation[validation["era"].isin(validation["era"].unique()[::4])] # Embargo overlapping eras from training data last_train_era = int(train["era"].unique()[-1]) eras_to_embargo = [str(era).zfill(4) for era in [last_train_era + i for i in range(4)]] validation = validation[~validation["era"].isin(eras_to_embargo)] ``` -------------------------------- ### Apply Feature Neutralization Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Applies feature neutralization to model predictions using the `numerai-tools` library for different proportions. It groups data by era and neutralizes predictions against specified features, storing the results in new columns. ```python # import neutralization from numerai-tools from numerai_tools.scoring import neutralize # Neutralize predictions per-era against features at different proportions proportions = [0.25, 0.5, 0.75, 1.0] for proportion in proportions: neutralized = validation.groupby("era", group_keys=True).apply( lambda d: neutralize( d[["prediction"]], d[sm_serenity_feats], proportion=proportion ) ).reset_index().set_index("id") validation[f"neutralized_{proportion*100:.0f}"] = neutralized["prediction"] ``` -------------------------------- ### Generate Validation Predictions for Multiple Models (Python) Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb This code iterates through a list of target candidates, using a trained model for each target to predict on the validation dataset. It then selects and displays the prediction columns from the validation DataFrame. ```python for target in TARGET_CANDIDATES: validation[f"prediction_{target}"] = models[target].predict(validation[feature_cols]) pred_cols = [f"prediction_{target}" for target in TARGET_CANDIDATES] validation[pred_cols] ``` -------------------------------- ### Train LGBM Regressor Model Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Trains a LightGBM Regressor model on the provided training data using a small feature set. The model is configured with specific hyperparameters, including the number of estimators, learning rate, max depth, number of leaves, and column sample by tree. It outputs the trained model. ```python import lightgbm as lgb model = lgb.LGBMRegressor( n_estimators=2000, learning_rate=0.01, max_depth=5, num_leaves=2**4-1, colsample_bytree=0.1 ) # We've found the following "deep" parameters perform much better, but they require much more CPU and RAM # model = lgb.LGBMRegressor( # n_estimators=30_000, # learning_rate=0.001, # max_depth=10, # num_leaves=2**10, # colsample_bytree=0.1 # min_data_in_leaf=10000, # ) model.fit( train[small_features], train["target"] ) ``` -------------------------------- ### Drop 'target' Column in Pandas DataFrame Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb This code snippet demonstrates how to drop the 'target' column from a Pandas DataFrame if it is an alias for a main target column. It includes an assertion to verify the equality before dropping. ```python # Drop `target` column assert train["target"].equals(train[MAIN_TARGET]) targets_df = train[["era"] + target_cols] ``` -------------------------------- ### Calculate Summary Metrics for Model Performance Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Calculates key performance metrics (mean, standard deviation, Sharpe ratio, max drawdown) for model predictions and their correlation with a main target ('cyrus'). It utilizes pandas for data aggregation and analysis. Requires 'pandas'. ```python def get_summary_metrics(scores, cumsum_scores): summary_metrics = {} # per era correlation between predictions of the model trained on this target and cyrus mean = scores.mean() std = scores.std() sharpe = mean / std rolling_max = cumsum_scores.expanding(min_periods=1).max() max_drawdown = (rolling_max - cumsum_scores).max() return { "mean": mean, "std": std, "sharpe": sharpe, "max_drawdown": max_drawdown, } target_summary_metrics = {} for pred_col in prediction_cols: target_summary_metrics[pred_col] = get_summary_metrics( correlations[pred_col], cumsum_corrs[pred_col] ) # per era correlation between this target and cyrus mean_corr_with_cryus = validation.groupby("era").apply( lambda d: d[pred_col].corr(d[MAIN_TARGET]) ).mean() target_summary_metrics[pred_col].update({ "mean_corr_with_cryus": mean_corr_with_cryus }) pd.set_option('display.float_format', lambda x: '%f' % x) summary = pd.DataFrame(target_summary_metrics).T summary ``` -------------------------------- ### Calculate Meta Model Contribution (MMC) Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Calculates the Meta Model Contribution (MMC) by comparing a user's chosen model ('FAVORITE_MODEL') against benchmark models. It generates per-era MMC, cumulative MMC, and a summary DataFrame with performance statistics. ```python validation[FAVORITE_MODEL] = benchmark_models[FAVORITE_MODEL] per_era_mmc, cumsum_mmc, summary = get_mmc(validation, FAVORITE_MODEL) # plot the cumsum mmc performance cumsum_mmc.plot( title="Contribution of Neutralized Predictions to Numerai's Teager Ensemble", figsize=(10, 6), xticks=[] ) pd.set_option('display.float_format', lambda x: '%f' % x) summary ``` -------------------------------- ### Count NaNs per Era and Plot Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Calculates the number of NaN values for each target column, grouped by 'era'. It then generates a plot showing the number of NaNs per era for the specified target columns. ```python nans_per_era = targets_df.groupby("era").apply(lambda x: x.isna().sum()) nans_per_era[target_cols].plot(figsize=(8, 4), title="Number of NaNs per Era", legend=False) ``` -------------------------------- ### Calculate and Plot Cumulative Correlation Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Calculates and plots the cumulative correlation of validation predictions against target values. It uses Numerai's scoring functions and pandas for data manipulation and plotting. Requires 'numerai_tools' and 'pandas'. ```python from numerai_tools.scoring import numerai_corr, correlation_contribution prediction_cols = [ f"prediction_{target}" for target in TARGET_CANDIDATES ] correlations = validation.groupby("era").apply( lambda d: numerai_corr(d[prediction_cols], d["target"]) ) cumsum_corrs = correlations.cumsum() cumsum_corrs.plot( title="Cumulative Correlation of validation Predictions", figsize=(10, 6), xticks=[] ) ``` -------------------------------- ### Compile Feature Set Intersections (Python) Source: https://github.com/numerai/example-scripts/blob/master/feature_neutralization.ipynb Compiles and analyzes the intersections between Numerai's feature sets (small, medium, all) and feature groups (intelligence, wisdom, etc.). It calculates the count of features in each intersection and displays them in a sorted DataFrame. ```python sizes = ["small", "medium", "all"] groups = [ "intelligence", "wisdom", "charisma", "dexterity", "strength", "constitution", "agility", "serenity", "all" ] # compile the intersections of feature sets and feature groups subgroups = {} for size in sizes: subgroups[size] = {} for group in groups: subgroups[size][group] = ( set(feature_sets[size]) .intersection(set(feature_sets[group])) ) # convert to data frame and display the feature count of each intersection # NOTE: applymap is deprecated, using map instead in newer pandas versions # pd.DataFrame(subgroups).applymap(len).sort_values(by="all", ascending=False) # For compatibility with older pandas versions or if applymap is preferred: import pandas as pd df_counts = pd.DataFrame(subgroups) # Use a loop or applymap if necessary for older pandas versions # For newer pandas, use map: df_counts = df_counts.map(len) df_counts = df_counts.sort_values(by="all", ascending=False) print(df_counts) ``` -------------------------------- ### Plot Target Density Histogram Source: https://github.com/numerai/example-scripts/blob/master/hello_numerai.ipynb Generates a density histogram for the 'target' column using pandas plotting. It visualizes the distribution of target values, which represent stock market returns over the next 20 business days, normalized by factors and trends. The target values are binned into 5 unequal bins. This plot helps in understanding the data distribution for model training. ```python # Plot density histogram of the target train["target"].plot( kind="hist", title="Target", figsize=(5, 3), xlabel="Value", density=True, bins=50 ) ``` -------------------------------- ### Select and Print Target Columns from DataFrame Source: https://github.com/numerai/example-scripts/blob/master/target_ensemble.ipynb Demonstrates how to select specific columns, including 'era' and a list of target columns, from a Pandas DataFrame. This is useful for isolating relevant data for analysis or model training. Assumes the DataFrame 'train' is already loaded. ```python train[["era"] + target_cols] ```