### Setup GFMBench-API using Python

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/evo2_gfmbench_recipe.ipynb

This Python script clones the GFMBench-API repository if it doesn't exist, installs its dependencies, and sets environment variables. It resolves paths relative to the notebook's working directory.

```python
import os
import sys
from pathlib import Path


# Clone target is relative to the notebook working directory; resolve once for stable absolute paths.
GFMBENCH_PATH_RELATIVE = Path("GFMBench-api")
GFMBENCH_PATH = GFMBENCH_PATH_RELATIVE.resolve()

# Clone only if GFMBench-api is not already present (safe to re-run the cell)
if not GFMBENCH_PATH.exists():
    os.system("git clone https://github.com/NVIDIA/GFMBench-api")
else:
    print(f"Using existing clone: {GFMBENCH_PATH}")

REQUIREMENTS = GFMBENCH_PATH / "basic_requirements.txt"
if not REQUIREMENTS.exists():
    raise FileNotFoundError(f"{REQUIREMENTS} not found — run `git pull` in {GFMBENCH_PATH}")
print(f"Installing GFMBench-API deps from {REQUIREMENTS}")
os.system(f"{sys.executable} -m pip install -r {REQUIREMENTS}")

os.environ["GFMBENCH_PATH"] = str(GFMBENCH_PATH)
os.environ["PYTHONPATH"] = str(GFMBENCH_PATH)
sys.path.insert(0, str(GFMBENCH_PATH))

print(f"GFMBENCH_PATH={GFMBENCH_PATH}")
print(f"cwd={Path.cwd()}")
```

--------------------------------

### Install Dependencies

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/README.md

Installs all project dependencies using uv sync, or installs packages individually using pip.

```bash
# Install everything (recommended)
uv sync

# Or install packages individually
pip install -e sae/
pip install -e recipes/esm2/
```

--------------------------------

### Install Dependencies and Run Dashboard

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/codonfm/codon_dashboard/README.md

Installs project dependencies using npm and starts the development server for the SAE Feature Dashboard. Access the dashboard via http://localhost:5173.

```bash
npm install
npm run dev
```

--------------------------------

### Quick Start for ESM2 Recipe

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/README.md

Navigate to the ESM2 recipe directory and start the training process using a production configuration.

```bash
cd recipes/esm2
python scripts/train.py --config-name config_production
```

--------------------------------

### Install Dependencies and Run Tests for ESM2

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/docs/docs/main/getting-started/development.md

Navigate to the model directory, install requirements, and run tests. This is a common setup for standard BioNeMo models.

```bash
cd models/esm2
pip install -r requirements.txt
pytest -v .
```

--------------------------------

### Install Dependencies

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/codonfm/README.md

Install project dependencies using uv sync. This command should be run from the repository root.

```bash
# From repo root (UV workspace)
uv sync
```

--------------------------------

### Build and Run BioNeMo Recipe with Docker

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/README.md

Build a Docker image for a BioNeMo recipe and run it. This example navigates to the 'esm2_native_te' recipe directory. Ensure Docker is installed and accessible.

```bash
# Navigate to a recipe
cd recipes/esm2_native_te

# Build and run
docker build -t esm2_recipe .
docker run --rm -it --gpus all esm2_recipe python train.py
```

--------------------------------

### Install PyBigWig

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/00-Mutation-Datasets-Preprocessing.ipynb

Uncomment and run this command to install the PyBigWig library if it's not already present.

```python
# Uncomment to install PyBigWig
# !pip install  pyBigWig
```

--------------------------------

### Local Development Setup

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/docs/docs/main/getting-started/index.md

Navigate to a specific recipe directory and run build and test scripts for local development.

```bash
cd recipes/evo2_megatron
bash .ci_build.sh
pytest -v .

```

--------------------------------

### Install All Recipes for Development

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/README.md

Synchronize all dependencies for the entire project, including all recipes, from the repository root.

```bash
# From repository root
uv sync
```

--------------------------------

### Install Pre-commit Hooks

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/README.md

Installs the pre-commit framework to manage and run hooks automatically before commits.

```bash
pre-commit install
```

--------------------------------

### Launch Single-Process Training

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_native_te/README.md

Starts a single-process training job on one GPU.

```bash
python train_fsdp2.py
```

--------------------------------

### Example Experiment Name Convention

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/docs/docs/main/getting-started/using-slurm.md

An example of how to construct an experiment name based on various training parameters for clear tracking.

```bash
EXPERIMENT_NAME=EVO2_SEQLEN${SEQ_LEN}_PP${PP_SIZE}_TP${TP_SIZE}_CP${CP_SIZE}_LR${LR}_MINLR${MIN_LR}_WU${WU_STEPS}_GA${GRAD_ACC_BATCHES}_...
```

--------------------------------

### Launch Single Process Training

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_accelerate_te/README.md

Use this command to start training on a single GPU with a specified configuration name.

```bash
python train.py --config-name=L0_sanity
```

--------------------------------

### Initialize and Run Trainer

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/sae/README.md

Initialize the Trainer with the SAE model and configuration, then start the training process.

```python
from sae.training import Trainer

trainer = Trainer(sae, config)
trainer.train(embeddings)  # embeddings: [N, input_dim]
```

--------------------------------

### Clone GFMBench-API and Install Dependencies (Linux)

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/evo2_gfmbench_recipe.ipynb

Use these Linux commands to clone the GFMBench-API repository and install its basic requirements. Ensure you run these from your project's working directory.

```bash
git clone https://github.com/NVIDIA/GFMBench-api
export GFMBENCH_PATH=./GFMBench-api
cd "$GFMBENCH_PATH"
pip install -r basic_requirements.txt
export PYTHONPATH="$GFMBENCH_PATH"
```

--------------------------------

### Recipe Performance Benchmarks Example

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/README.md

Example markdown structure for documenting performance benchmarks of a recipe on single and multi-node configurations. Includes throughput, memory usage, model details, and batch size.

```markdown
## Performance Benchmarks

### Single Node (8x H100)
- **Throughput**: 2,500 tokens/sec
- **Memory Usage**: 45GB per GPU
- **Model**: ESM-2 650M parameters
- **Batch Size**: 32 (micro_batch_size=4, gradient_accumulation=8)

### Multi Node (2x8 H100)
- **Throughput**: 4,800 tokens/sec
- **Scaling Efficiency**: 96%
- **Network**: InfiniBand
```

--------------------------------

### HuggingFace Transformers Inference Example

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/esm2/README.md

Quick start example for performing inference using HuggingFace transformers with an ESM-2 model. Ensure you have the transformers library installed.

```python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("nvidia/esm2_t6_8M_UR50D")
tokenizer = AutoTokenizer.from_pretrained("nvidia/esm2_t6_8M_UR50D")

gfp_P42212 = (
    "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL"
    "VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV"
    "NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD"
    "HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
)

inputs = tokenizer(gfp_P42212, return_tensors="pt")
output = model(**inputs)
```

--------------------------------

### Setup and Directory Creation

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/00-Mutation-Datasets-Preprocessing.ipynb

Initializes data directories and creates necessary subdirectories for storing downloaded datasets. Requires setting DATA_DIR and OUTPUT_DIR.

```python
import gzip
import os
import shutil
import urllib.request

import pandas as pd
import requests


# ── Set data directory ───────────────────────────────────────
DATA_DIR = ""  # <-- change this to your preferred data root
OUTPUT_DIR = ""  # output directory where all processed datasets will be saved
UCSC_API_KEY = ""  # <-- set your UCSC API key for Table Browser downloads
# ─────────────────────────────────────────────────────────────

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

for subdir in [
    "reference/hg19",
    "reference/hg38",
    "alphamissense_data",
    "ddd_asd_zhouetal",
    "clinvar_syn",
]:
    os.makedirs(os.path.join(DATA_DIR, subdir), exist_ok=True)
```

--------------------------------

### Load Pre-trained Geneformer Model (HF)

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md

Loads a Geneformer model variant from Hugging Face. Use this to get started with existing models.

```python
from transformers import AutoModelForMaskedLM

# Load the default model (Geneformer-V2-316M)
model = AutoModelForMaskedLM.from_pretrained("ctheodoris/Geneformer")

# Or load a specific variant
model = AutoModelForMaskedLM.from_pretrained(
    "ctheodoris/Geneformer", subfolder="Geneformer-V2-104M"
)
```

--------------------------------

### Launch Single Process Training (AMPLIFY Model)

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_accelerate_te/README.md

Use this command to start training the AMPLIFY model on a single GPU with a specified configuration name.

```bash
python train.py --config-name=L0_sanity_amplify
```

--------------------------------

### Quick Start: Train a ReLUSAE Model

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/sae/README.md

Demonstrates how to set up a synthetic dataset, create a ReLUSAE model, configure training, and train the model.

```python
import torch
from sae.architectures import ReLUSAE, TopKSAE
from sae.training import Trainer, TrainingConfig
from sae.utils import get_device, set_seed

# Set random seed
set_seed(42)

# Create a synthetic dataset (replace with your embeddings)
embeddings = torch.randn(10000, 512)

# Create SAE model
sae = ReLUSAE(
    input_dim=512,
    hidden_dim=512 * 8,  # 8x expansion
    l1_coeff=1e-3,
)

# Configure training
config = TrainingConfig(
    lr=3e-4,
    n_epochs=10,
    batch_size=4096,
    device=get_device(),
)

# Train
trainer = Trainer(sae, config)
trainer.train(embeddings)
```

--------------------------------

### Quick Start: Convert and Run Mixtral HF to TE

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/mixtral/README.md

Converts a HuggingFace Mixtral model to Transformer Engine format and performs a sample inference. Ensure you are in the 'models/mixtral' directory or have installed dependencies.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from convert import convert_mixtral_hf_to_te

# Load the original HuggingFace Mixtral model
model_hf = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-v0.1", torch_dtype=torch.bfloat16
)

# Convert to TransformerEngine
model_te = convert_mixtral_hf_to_te(model_hf)
model_te.to("cuda")

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer("The quick brown fox", return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model_te.generate(**inputs, max_new_tokens=16)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```

--------------------------------

### Get Transcript-Level Metadata

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/000-Annotation-File-Processing.ipynb

Extracts essential metadata for each transcript, including gene ID, gene name, chromosome, strand, transcript start and end positions, transcript type, and canonical status. This information is aggregated per transcript.

```python
# Get transcript-level metadata (gene info, coordinates, canonical status)
tx_starts = (
    protein_coding_gtf.filter(pl.col("feature") == "transcript")
    .group_by("transcript_id")
    .agg(
        pl.col("gene_id").first().alias("gene_id"),
        pl.col("gene_name").first().alias("gene_name"),
        pl.col("chrom").first().alias("chrom"),
        pl.col("strand").first().alias("strand"),
        pl.col("start").min().alias("tx_start"),
        pl.col("end").max().alias("tx_end"),
        pl.col("transcript_type").first().alias("transcript_type"),
        pl.col("is_canonical").first().alias("is_canonical"),
        pl.col("is_mane_select").first().alias("is_mane_select"),
    )
)
```

--------------------------------

### Install SAE Package

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/sae/README.md

Install the SAE package using pip. Choose between a standalone installation or from a git repository.

```bash
pip install -e sae/
```

```bash
pip install git+https://github.com/yourusername/biosae.git#subdirectory=sae
```

--------------------------------

### Install Individual Recipe

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/README.md

Install a specific recipe package in editable mode after installing the core SAE package.

```bash
# Install core first
pip install -e sae/

# Then install recipe
pip install -e recipes/esm2/
```

--------------------------------

### Configure Data Paths and Download Models

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/3-Zero-Shot-Mutation-Variant-Clinvar-Synonymous.ipynb

Sets up input data paths, output directories, and model checkpoint locations. It also includes a utility function to download necessary model checkpoints if they are not already present.

```python
# Configuration: Data paths and model checkpoints
DATA_INPUT_PATH = "/data/processed/mutation_datasets_latest/clinvar_synom.csv"
OUTPUT_DIR = "/data/validation/encodons_results/clinvar_synom"

# Model checkpoints for inference
MODEL_CHECKPOINTS = {
    "80M": "/data/checkpoints/NV-CodonFM-Encodon-TE-80M-v1",
    "600m": "/data/checkpoints/NV-CodonFM-Encodon-TE-600M-v1",
    "1B": "/data/checkpoints/NV-CodonFM-Encodon-TE-Cdwt-1B-v1",
}

from src.utils.load_checkpoint import download_checkpoint


# download models if necessary
download_checkpoint(
    repo_id="nvidia/NV-CodonFM-Encodon-TE-80M-v1", local_dir="/data/checkpoints/NV-CodonFM-Encodon-TE-80M-v1"
)
download_checkpoint(
    repo_id="nvidia/NV-CodonFM-Encodon-TE-600M-v1", local_dir="/data/checkpoints/NV-CodonFM-Encodon-TE-600M-v1"
)
download_checkpoint(
    repo_id="nvidia/NV-CodonFM-Encodon-TE-1B-v1", local_dir="/data/checkpoints/NV-CodonFM-Encodon-TE-1B-v1"
)


# Inference parameters
BATCH_SIZE = 64
NUM_WORKERS = 4
CONTEXT_LENGTH = 2048
```

--------------------------------

### Install MMseqs2 via Conda

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/data_scripts/data_curation/allseq_clustering_for_splits.ipynb

Installs MMseqs2 using Conda, recommended for most users. Ensure you have Conda or Miniconda installed.

```bash
conda install -c conda-forge -c bioconda mmseqs2
```

--------------------------------

### Recipe CI/CD Integration Example

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/README.md

Bash commands for building a Docker image for a recipe and running pytest for CI/CD integration. Demonstrates the standard test contract invocation.

```bash
cd recipes/my_recipe
docker build -t my_recipe .
docker run --rm -it --gpus all my_recipe pytest -v .
```

--------------------------------

### Install MMseqs2 via apt (Ubuntu)

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/data_scripts/data_curation/allseq_clustering_for_splits.ipynb

Installs MMseqs2 on Ubuntu systems using the apt package manager. This may install an older version.

```bash
sudo apt-get update && sudo apt-get install -y mmseqs2
```

--------------------------------

### Install Python Dependencies

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/lora-fine-tuning-tutorial.ipynb

Installs necessary Python packages for data wrangling and evaluation, including scikit-learn, datasets, matplotlib, and tensorboard. Use the '-q' flag for quiet installation.

```python
%pip install -q scikit-learn datasets matplotlib tensorboard
```

--------------------------------

### Setup Directories and Prediction Commands

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/zeroshot_brca1.ipynb

Configures output directories and constructs prediction commands for reference and variant sequences. It dynamically sets FP8 and parallelism options based on GPU support and model size.

```python
# Define output directories for prediction results
output_dir = Path("brca1_fasta_files")
output_dir.mkdir(parents=True, exist_ok=True)

# Save reference and variant sequences to FASTA
ref_fasta_path = output_dir / "brca1_reference_sequences.fasta"
var_fasta_path = output_dir / "brca1_variant_sequences.fasta"

predict_ref_dir = output_dir / "reference_predictions"
predict_var_dir = output_dir / "variant_predictions"
predict_ref_dir.mkdir(parents=True, exist_ok=True)
predict_var_dir.mkdir(parents=True, exist_ok=True)

fp8_supported, gpu_info = check_fp8_support()
print(f"FP8 Support: {fp8_supported}")
print(gpu_info)

# Note: If FP8 is not supported, you may want to disable it in the model config
# The Evo2 config has 'use_fp8_input_projections: True' by default

if FAST_CI_MODE:
    model_subset_option = "--num-layers 4 --hybrid-override-pattern SDH*"
else:
    model_subset_option = ""

# NOTE: if you are using the 40b, 1b or 20b checkpoints that are sensitive to FP8, you should use --vortex-style-fp8
#  for prediction to get good accuracy.
# fp8_option = "--vortex-style-fp8" if fp8_supported else ""
if MODEL_SIZE in ["evo2_20b", "evo2_40b"]:
    fp8_option = "--mixed-precision-recipe bf16_mixed --vortex-style-fp8" if fp8_supported else ""
else:
    fp8_option = "--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed" if fp8_supported else ""

# Update predict commands to run on the full dataset
NUM_GPUS = torch.cuda.device_count()
if MODEL_SIZE == "evo2_1b_base":
    NUM_GPUS = min(NUM_GPUS, 2)  # lots of CP on small examples slows things down.
    parallelism_option = f"--context-parallel-size {NUM_GPUS}"
else:
    parallelism_option = f"--tensor-parallel-size {NUM_GPUS}"
predict_ref_command = (
    f"torchrun --standalone --nproc_per_node={NUM_GPUS} --no-python predict_evo2 --use-subquadratic-ops --fasta {ref_fasta_path} --ckpt-dir {checkpoint_path} "
    f"--output-dir {predict_ref_dir} {parallelism_option} {model_subset_option} "
    f"--pipeline-model-parallel-size 1 --output-log-prob-seqs {fp8_option}"
)

predict_var_command = (
    f"torchrun --standalone --nproc_per_node={NUM_GPUS} --no-python predict_evo2 --use-subquadratic-ops --fasta {var_fasta_path} --ckpt-dir {checkpoint_path} "
    f"--output-dir {predict_var_dir} {parallelism_option} {model_subset_option} "
    f"--pipeline-model-parallel-size 1 --output-log-prob-seqs {fp8_option}"
)
```

--------------------------------

### Launch Multi-Process Training with Accelerate

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_accelerate_te/README.md

Initiates distributed training across multiple processes using Accelerate's launch command and a specified configuration file.

```bash
accelerate launch --config_file accelerate_config/fsdp2_te.yaml \
    --num_processes 2 train.py \
    --config-name=L0_sanity
```

--------------------------------

### Launch Single-Process Training

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/opengenome2_llama_native_te/README.md

Use this command to run training on a single GPU.

```bash
python train_fsdp2.py --config-name L0_sanity
```

--------------------------------

### Install Geneformer with Pip

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md

Installs the Geneformer package using pip. Navigate to the Geneformer directory first.

```bash
cd models/geneformer
pip install -e .
```

--------------------------------

### Import Libraries and Setup

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/0-Zero-Shot-Mutation-Variant-CancerHotspot.ipynb

Imports necessary Python libraries for data manipulation, machine learning, and deep learning, and sets up the environment for PyTorch and CUDA.

```python
import os
import pickle
import sys
import warnings
from datetime import datetime

import numpy as np
import pandas as pd
import polars as pl
import torch
from tqdm import tqdm


warnings.filterwarnings("ignore")

# Machine learning libraries
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import auc, precision_recall_curve, roc_curve


plt.style.use("default")
sns.set_palette("husl")

# Add project paths
sys.path.append("../")

# Import Encodon-specific modules
from src.data.metadata import MetadataFields
from src.data.mutation_dataset import MutationDataset, collate_fn
from src.data.preprocess.mutation_pred import mlm_process_item
from src.inference.encodon import EncodonInference
from src.inference.task_types import TaskTypes


print("✅ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name()}")
```

--------------------------------

### Install Geneformer for Development

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md

Installs Geneformer with development dependencies, including testing tools. Ensure you are in the Geneformer directory.

```bash
cd models/geneformer
pip install -e .[test]
```

--------------------------------

### Import Libraries and Setup

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/5-EnCodon-Downstream-Task-mRFP-expression.ipynb

Imports necessary Python libraries for data manipulation, machine learning, and Encodon model inference. Sets up random seeds for reproducibility and checks PyTorch and CUDA availability.

```python
import os
import sys
import warnings

import numpy as np
import pandas as pd
import torch
from tqdm import tqdm


warnings.filterwarnings("ignore")

# ML libraries
# Visualization
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV


# Add project paths
sys.path.append("..")

# Import Encodon modules
# Import additional modules for dataset handling
from torch.utils.data import DataLoader

from src.data.codon_bert_dataset import CodonBertDataset
from src.data.metadata import MetadataFields
from src.data.preprocess.codon_sequence import process_item
from src.inference.encodon import EncodonInference
from src.inference.task_types import TaskTypes


# Fix random seed
torch.manual_seed(42)
np.random.seed(42)

print("✅ Libraries imported successfully!")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
```

--------------------------------

### Install DeepEP for Fused Token Router

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/mixtral/README.md

Install the DeepEP library using the provided bash script to enable the high-performance FusedTokenRouter.

```bash
bash install_hybridep.sh
```

--------------------------------

### Install Model Package in Editable Mode

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/llama3/README.md

Install the model package in editable mode within the development container to enable testing with pytest.

```bash
pip install -e .
```

--------------------------------

### Install Development Dependencies with Constraints

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md

Installs development dependencies for Geneformer while respecting specific constraints defined by PIP_CONSTRAINT. Change directory to models/geneformer first.

```bash
cd models/geneformer
PIP_CONSTRAINT= pip install -e .[test]
```

--------------------------------

### Setup Training Data Configuration

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/fine-tuning-tutorial.ipynb

Configures and writes a YAML file for preprocessing training data. It specifies dataset paths, output directories, and various preprocessing parameters.

```python
from bionemo.evo2.data.dataset_tokenizer import (
    DEFAULT_HF_TOKENIZER_MODEL_PATH,  # use the 512 size for historical reasons
)


full_fasta_path = os.path.abspath(concat_path)
output_dir = os.path.abspath("preprocessed_data")


output_yaml = f"""
- datapaths: ["{full_fasta_path}"]
  output_dir: "{output_dir}"
  output_prefix: chr20_21_22_uint8_distinct
  train_split: 0.9
  valid_split: 0.05
  test_split: 0.05
  overwrite: True
  embed_reverse_complement: true
  random_reverse_complement: 0.0
  random_lineage_dropout: 0.0
  include_sequence_id: false
  transcribe: "back_transcribe"
  force_uppercase: false
  indexed_dataset_dtype: "uint8"
  hf_tokenizer_model_path: {DEFAULT_HF_TOKENIZER_MODEL_PATH}
  pretrained_tokenizer_model: null
  special_tokens: null
  fast_hf_tokenizer: true
  append_eod: true
  enforce_sample_length: null
  ftfy: false
  workers: 1
  preproc_concurrency: 100000
  chunksize: 25
  drop_empty_sequences: true
  nnn_filter: false  # If you split your fasta on NNN (in human these are contigs), then you should set this to true.
  seed: 12342  # Not relevant because we are not using random reverse complement or lineage dropout.
"""
with open("preprocess_config.yaml", "w") as f:
    print(output_yaml, file=f)
```

--------------------------------

### Install GFMBench API Requirements

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/evo2_gfmbench_recipe.ipynb

Installs the necessary Python packages for the GFMBench API. It's recommended to use a virtual environment to avoid permission issues.

```bash
pip install -r /data/sense/alarey/code/bionemo-framework/bionemo-recipes/recipes/evo2_megatron/examples/GFMBench-api/basic_requirements.txt
```

--------------------------------

### Install vLLM Post-Build

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/vllm_inference/esm2/README.md

Installs vLLM within an existing Docker container after the image has been built. The script auto-detects the GPU architecture or accepts an explicit architecture argument.

```bash
docker build -t esm2 .
docker run --rm -it --gpus all esm2 bash -c "./install_vllm.sh"
```

```bash
# or with an explicit architecture:
docker run --rm -it --gpus all esm2 bash -c "./install_vllm.sh 9.0"
```

--------------------------------

### Install Dependencies Manually (CUDA 13.0)

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_native_te/README.md

Installs dependencies manually in a Python environment with CUDA 13.0 support. This includes PyTorch, Transformer Engine, and Flash Attention.

```bash
uv venv --python 3.12 --seed /workspace/.venv
source /workspace/.venv/bin/activate
uv pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130
uv pip install wheel packaging psutil
pip install --no-build-isolation "flash-attn>=2.1.1,<=2.8.1"
pip install --no-build-isolation transformer-engine[pytorch]==2.9.0
uv pip install -r /requirements.txt
```

--------------------------------

### Initialize Qwen3 Model with FP4 Recipe

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/qwen/README.md

Initializes a Qwen3 model using an `NVFP4BlockScaling` FP4 recipe, setting all layers to FP4 precision.

```python
fp4_recipe = te_recipe.NVFP4BlockScaling()

config = NVQwen3Config.from_pretrained(
    "Qwen/Qwen3-0.6B",
    layer_precision=["fp4"] * 28,
)
model = NVQwen3ForCausalLM(config, fp4_recipe=fp4_recipe)
```

--------------------------------

### Install Dependencies in Dev Container

Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/qwen/README.md

Install project dependencies within the development container using the provided requirements file. This is a prerequisite for running tests inside the container.

```bash
pip install -r requirements.txt
```