### Setup GFMBench-API using Python Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/evo2_gfmbench_recipe.ipynb This Python script clones the GFMBench-API repository if it doesn't exist, installs its dependencies, and sets environment variables. It resolves paths relative to the notebook's working directory. ```python import os import sys from pathlib import Path # Clone target is relative to the notebook working directory; resolve once for stable absolute paths. GFMBENCH_PATH_RELATIVE = Path("GFMBench-api") GFMBENCH_PATH = GFMBENCH_PATH_RELATIVE.resolve() # Clone only if GFMBench-api is not already present (safe to re-run the cell) if not GFMBENCH_PATH.exists(): os.system("git clone https://github.com/NVIDIA/GFMBench-api") else: print(f"Using existing clone: {GFMBENCH_PATH}") REQUIREMENTS = GFMBENCH_PATH / "basic_requirements.txt" if not REQUIREMENTS.exists(): raise FileNotFoundError(f"{REQUIREMENTS} not found — run `git pull` in {GFMBENCH_PATH}") print(f"Installing GFMBench-API deps from {REQUIREMENTS}") os.system(f"{sys.executable} -m pip install -r {REQUIREMENTS}") os.environ["GFMBENCH_PATH"] = str(GFMBENCH_PATH) os.environ["PYTHONPATH"] = str(GFMBENCH_PATH) sys.path.insert(0, str(GFMBENCH_PATH)) print(f"GFMBENCH_PATH={GFMBENCH_PATH}") print(f"cwd={Path.cwd()}") ``` -------------------------------- ### Install Dependencies Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/README.md Installs all project dependencies using uv sync, or installs packages individually using pip. ```bash # Install everything (recommended) uv sync # Or install packages individually pip install -e sae/ pip install -e recipes/esm2/ ``` -------------------------------- ### Install Dependencies and Run Dashboard Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/codonfm/codon_dashboard/README.md Installs project dependencies using npm and starts the development server for the SAE Feature Dashboard. Access the dashboard via http://localhost:5173. ```bash npm install npm run dev ``` -------------------------------- ### Quick Start for ESM2 Recipe Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/README.md Navigate to the ESM2 recipe directory and start the training process using a production configuration. ```bash cd recipes/esm2 python scripts/train.py --config-name config_production ``` -------------------------------- ### Install Dependencies and Run Tests for ESM2 Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/docs/docs/main/getting-started/development.md Navigate to the model directory, install requirements, and run tests. This is a common setup for standard BioNeMo models. ```bash cd models/esm2 pip install -r requirements.txt pytest -v . ``` -------------------------------- ### Install Dependencies Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/codonfm/README.md Install project dependencies using uv sync. This command should be run from the repository root. ```bash # From repo root (UV workspace) uv sync ``` -------------------------------- ### Build and Run BioNeMo Recipe with Docker Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/README.md Build a Docker image for a BioNeMo recipe and run it. This example navigates to the 'esm2_native_te' recipe directory. Ensure Docker is installed and accessible. ```bash # Navigate to a recipe cd recipes/esm2_native_te # Build and run docker build -t esm2_recipe . docker run --rm -it --gpus all esm2_recipe python train.py ``` -------------------------------- ### Install PyBigWig Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/00-Mutation-Datasets-Preprocessing.ipynb Uncomment and run this command to install the PyBigWig library if it's not already present. ```python # Uncomment to install PyBigWig # !pip install pyBigWig ``` -------------------------------- ### Local Development Setup Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/docs/docs/main/getting-started/index.md Navigate to a specific recipe directory and run build and test scripts for local development. ```bash cd recipes/evo2_megatron bash .ci_build.sh pytest -v . ``` -------------------------------- ### Install All Recipes for Development Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/README.md Synchronize all dependencies for the entire project, including all recipes, from the repository root. ```bash # From repository root uv sync ``` -------------------------------- ### Install Pre-commit Hooks Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/README.md Installs the pre-commit framework to manage and run hooks automatically before commits. ```bash pre-commit install ``` -------------------------------- ### Launch Single-Process Training Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_native_te/README.md Starts a single-process training job on one GPU. ```bash python train_fsdp2.py ``` -------------------------------- ### Example Experiment Name Convention Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/docs/docs/main/getting-started/using-slurm.md An example of how to construct an experiment name based on various training parameters for clear tracking. ```bash EXPERIMENT_NAME=EVO2_SEQLEN${SEQ_LEN}_PP${PP_SIZE}_TP${TP_SIZE}_CP${CP_SIZE}_LR${LR}_MINLR${MIN_LR}_WU${WU_STEPS}_GA${GRAD_ACC_BATCHES}_... ``` -------------------------------- ### Launch Single Process Training Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_accelerate_te/README.md Use this command to start training on a single GPU with a specified configuration name. ```bash python train.py --config-name=L0_sanity ``` -------------------------------- ### Initialize and Run Trainer Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/sae/README.md Initialize the Trainer with the SAE model and configuration, then start the training process. ```python from sae.training import Trainer trainer = Trainer(sae, config) trainer.train(embeddings) # embeddings: [N, input_dim] ``` -------------------------------- ### Clone GFMBench-API and Install Dependencies (Linux) Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/evo2_gfmbench_recipe.ipynb Use these Linux commands to clone the GFMBench-API repository and install its basic requirements. Ensure you run these from your project's working directory. ```bash git clone https://github.com/NVIDIA/GFMBench-api export GFMBENCH_PATH=./GFMBench-api cd "$GFMBENCH_PATH" pip install -r basic_requirements.txt export PYTHONPATH="$GFMBENCH_PATH" ``` -------------------------------- ### Recipe Performance Benchmarks Example Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/README.md Example markdown structure for documenting performance benchmarks of a recipe on single and multi-node configurations. Includes throughput, memory usage, model details, and batch size. ```markdown ## Performance Benchmarks ### Single Node (8x H100) - **Throughput**: 2,500 tokens/sec - **Memory Usage**: 45GB per GPU - **Model**: ESM-2 650M parameters - **Batch Size**: 32 (micro_batch_size=4, gradient_accumulation=8) ### Multi Node (2x8 H100) - **Throughput**: 4,800 tokens/sec - **Scaling Efficiency**: 96% - **Network**: InfiniBand ``` -------------------------------- ### HuggingFace Transformers Inference Example Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/esm2/README.md Quick start example for performing inference using HuggingFace transformers with an ESM-2 model. Ensure you have the transformers library installed. ```python from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("nvidia/esm2_t6_8M_UR50D") tokenizer = AutoTokenizer.from_pretrained("nvidia/esm2_t6_8M_UR50D") gfp_P42212 = ( "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL" "VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV" "NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD" "HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" ) inputs = tokenizer(gfp_P42212, return_tensors="pt") output = model(**inputs) ``` -------------------------------- ### Setup and Directory Creation Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/00-Mutation-Datasets-Preprocessing.ipynb Initializes data directories and creates necessary subdirectories for storing downloaded datasets. Requires setting DATA_DIR and OUTPUT_DIR. ```python import gzip import os import shutil import urllib.request import pandas as pd import requests # ── Set data directory ─────────────────────────────────────── DATA_DIR = "" # <-- change this to your preferred data root OUTPUT_DIR = "" # output directory where all processed datasets will be saved UCSC_API_KEY = "" # <-- set your UCSC API key for Table Browser downloads # ───────────────────────────────────────────────────────────── # Create output directory os.makedirs(OUTPUT_DIR, exist_ok=True) for subdir in [ "reference/hg19", "reference/hg38", "alphamissense_data", "ddd_asd_zhouetal", "clinvar_syn", ]: os.makedirs(os.path.join(DATA_DIR, subdir), exist_ok=True) ``` -------------------------------- ### Load Pre-trained Geneformer Model (HF) Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md Loads a Geneformer model variant from Hugging Face. Use this to get started with existing models. ```python from transformers import AutoModelForMaskedLM # Load the default model (Geneformer-V2-316M) model = AutoModelForMaskedLM.from_pretrained("ctheodoris/Geneformer") # Or load a specific variant model = AutoModelForMaskedLM.from_pretrained( "ctheodoris/Geneformer", subfolder="Geneformer-V2-104M" ) ``` -------------------------------- ### Launch Single Process Training (AMPLIFY Model) Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_accelerate_te/README.md Use this command to start training the AMPLIFY model on a single GPU with a specified configuration name. ```bash python train.py --config-name=L0_sanity_amplify ``` -------------------------------- ### Quick Start: Train a ReLUSAE Model Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/sae/README.md Demonstrates how to set up a synthetic dataset, create a ReLUSAE model, configure training, and train the model. ```python import torch from sae.architectures import ReLUSAE, TopKSAE from sae.training import Trainer, TrainingConfig from sae.utils import get_device, set_seed # Set random seed set_seed(42) # Create a synthetic dataset (replace with your embeddings) embeddings = torch.randn(10000, 512) # Create SAE model sae = ReLUSAE( input_dim=512, hidden_dim=512 * 8, # 8x expansion l1_coeff=1e-3, ) # Configure training config = TrainingConfig( lr=3e-4, n_epochs=10, batch_size=4096, device=get_device(), ) # Train trainer = Trainer(sae, config) trainer.train(embeddings) ``` -------------------------------- ### Quick Start: Convert and Run Mixtral HF to TE Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/mixtral/README.md Converts a HuggingFace Mixtral model to Transformer Engine format and performs a sample inference. Ensure you are in the 'models/mixtral' directory or have installed dependencies. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from convert import convert_mixtral_hf_to_te # Load the original HuggingFace Mixtral model model_hf = AutoModelForCausalLM.from_pretrained( "mistralai/Mixtral-8x7B-v0.1", torch_dtype=torch.bfloat16 ) # Convert to TransformerEngine model_te = convert_mixtral_hf_to_te(model_hf) model_te.to("cuda") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1") tokenizer.pad_token = tokenizer.eos_token inputs = tokenizer("The quick brown fox", return_tensors="pt") inputs = {k: v.to("cuda") for k, v in inputs.items()} with torch.no_grad(): output_ids = model_te.generate(**inputs, max_new_tokens=16) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` -------------------------------- ### Get Transcript-Level Metadata Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/000-Annotation-File-Processing.ipynb Extracts essential metadata for each transcript, including gene ID, gene name, chromosome, strand, transcript start and end positions, transcript type, and canonical status. This information is aggregated per transcript. ```python # Get transcript-level metadata (gene info, coordinates, canonical status) tx_starts = ( protein_coding_gtf.filter(pl.col("feature") == "transcript") .group_by("transcript_id") .agg( pl.col("gene_id").first().alias("gene_id"), pl.col("gene_name").first().alias("gene_name"), pl.col("chrom").first().alias("chrom"), pl.col("strand").first().alias("strand"), pl.col("start").min().alias("tx_start"), pl.col("end").max().alias("tx_end"), pl.col("transcript_type").first().alias("transcript_type"), pl.col("is_canonical").first().alias("is_canonical"), pl.col("is_mane_select").first().alias("is_mane_select"), ) ) ``` -------------------------------- ### Install SAE Package Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/sae/README.md Install the SAE package using pip. Choose between a standalone installation or from a git repository. ```bash pip install -e sae/ ``` ```bash pip install git+https://github.com/yourusername/biosae.git#subdirectory=sae ``` -------------------------------- ### Install Individual Recipe Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/interpretability/sparse_autoencoders/recipes/README.md Install a specific recipe package in editable mode after installing the core SAE package. ```bash # Install core first pip install -e sae/ # Then install recipe pip install -e recipes/esm2/ ``` -------------------------------- ### Configure Data Paths and Download Models Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/3-Zero-Shot-Mutation-Variant-Clinvar-Synonymous.ipynb Sets up input data paths, output directories, and model checkpoint locations. It also includes a utility function to download necessary model checkpoints if they are not already present. ```python # Configuration: Data paths and model checkpoints DATA_INPUT_PATH = "/data/processed/mutation_datasets_latest/clinvar_synom.csv" OUTPUT_DIR = "/data/validation/encodons_results/clinvar_synom" # Model checkpoints for inference MODEL_CHECKPOINTS = { "80M": "/data/checkpoints/NV-CodonFM-Encodon-TE-80M-v1", "600m": "/data/checkpoints/NV-CodonFM-Encodon-TE-600M-v1", "1B": "/data/checkpoints/NV-CodonFM-Encodon-TE-Cdwt-1B-v1", } from src.utils.load_checkpoint import download_checkpoint # download models if necessary download_checkpoint( repo_id="nvidia/NV-CodonFM-Encodon-TE-80M-v1", local_dir="/data/checkpoints/NV-CodonFM-Encodon-TE-80M-v1" ) download_checkpoint( repo_id="nvidia/NV-CodonFM-Encodon-TE-600M-v1", local_dir="/data/checkpoints/NV-CodonFM-Encodon-TE-600M-v1" ) download_checkpoint( repo_id="nvidia/NV-CodonFM-Encodon-TE-1B-v1", local_dir="/data/checkpoints/NV-CodonFM-Encodon-TE-1B-v1" ) # Inference parameters BATCH_SIZE = 64 NUM_WORKERS = 4 CONTEXT_LENGTH = 2048 ``` -------------------------------- ### Install MMseqs2 via Conda Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/data_scripts/data_curation/allseq_clustering_for_splits.ipynb Installs MMseqs2 using Conda, recommended for most users. Ensure you have Conda or Miniconda installed. ```bash conda install -c conda-forge -c bioconda mmseqs2 ``` -------------------------------- ### Recipe CI/CD Integration Example Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/README.md Bash commands for building a Docker image for a recipe and running pytest for CI/CD integration. Demonstrates the standard test contract invocation. ```bash cd recipes/my_recipe docker build -t my_recipe . docker run --rm -it --gpus all my_recipe pytest -v . ``` -------------------------------- ### Install MMseqs2 via apt (Ubuntu) Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/data_scripts/data_curation/allseq_clustering_for_splits.ipynb Installs MMseqs2 on Ubuntu systems using the apt package manager. This may install an older version. ```bash sudo apt-get update && sudo apt-get install -y mmseqs2 ``` -------------------------------- ### Install Python Dependencies Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/lora-fine-tuning-tutorial.ipynb Installs necessary Python packages for data wrangling and evaluation, including scikit-learn, datasets, matplotlib, and tensorboard. Use the '-q' flag for quiet installation. ```python %pip install -q scikit-learn datasets matplotlib tensorboard ``` -------------------------------- ### Setup Directories and Prediction Commands Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/zeroshot_brca1.ipynb Configures output directories and constructs prediction commands for reference and variant sequences. It dynamically sets FP8 and parallelism options based on GPU support and model size. ```python # Define output directories for prediction results output_dir = Path("brca1_fasta_files") output_dir.mkdir(parents=True, exist_ok=True) # Save reference and variant sequences to FASTA ref_fasta_path = output_dir / "brca1_reference_sequences.fasta" var_fasta_path = output_dir / "brca1_variant_sequences.fasta" predict_ref_dir = output_dir / "reference_predictions" predict_var_dir = output_dir / "variant_predictions" predict_ref_dir.mkdir(parents=True, exist_ok=True) predict_var_dir.mkdir(parents=True, exist_ok=True) fp8_supported, gpu_info = check_fp8_support() print(f"FP8 Support: {fp8_supported}") print(gpu_info) # Note: If FP8 is not supported, you may want to disable it in the model config # The Evo2 config has 'use_fp8_input_projections: True' by default if FAST_CI_MODE: model_subset_option = "--num-layers 4 --hybrid-override-pattern SDH*" else: model_subset_option = "" # NOTE: if you are using the 40b, 1b or 20b checkpoints that are sensitive to FP8, you should use --vortex-style-fp8 # for prediction to get good accuracy. # fp8_option = "--vortex-style-fp8" if fp8_supported else "" if MODEL_SIZE in ["evo2_20b", "evo2_40b"]: fp8_option = "--mixed-precision-recipe bf16_mixed --vortex-style-fp8" if fp8_supported else "" else: fp8_option = "--mixed-precision-recipe bf16_with_fp8_current_scaling_mixed" if fp8_supported else "" # Update predict commands to run on the full dataset NUM_GPUS = torch.cuda.device_count() if MODEL_SIZE == "evo2_1b_base": NUM_GPUS = min(NUM_GPUS, 2) # lots of CP on small examples slows things down. parallelism_option = f"--context-parallel-size {NUM_GPUS}" else: parallelism_option = f"--tensor-parallel-size {NUM_GPUS}" predict_ref_command = ( f"torchrun --standalone --nproc_per_node={NUM_GPUS} --no-python predict_evo2 --use-subquadratic-ops --fasta {ref_fasta_path} --ckpt-dir {checkpoint_path} " f"--output-dir {predict_ref_dir} {parallelism_option} {model_subset_option} " f"--pipeline-model-parallel-size 1 --output-log-prob-seqs {fp8_option}" ) predict_var_command = ( f"torchrun --standalone --nproc_per_node={NUM_GPUS} --no-python predict_evo2 --use-subquadratic-ops --fasta {var_fasta_path} --ckpt-dir {checkpoint_path} " f"--output-dir {predict_var_dir} {parallelism_option} {model_subset_option} " f"--pipeline-model-parallel-size 1 --output-log-prob-seqs {fp8_option}" ) ``` -------------------------------- ### Launch Multi-Process Training with Accelerate Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_accelerate_te/README.md Initiates distributed training across multiple processes using Accelerate's launch command and a specified configuration file. ```bash accelerate launch --config_file accelerate_config/fsdp2_te.yaml \ --num_processes 2 train.py \ --config-name=L0_sanity ``` -------------------------------- ### Launch Single-Process Training Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/opengenome2_llama_native_te/README.md Use this command to run training on a single GPU. ```bash python train_fsdp2.py --config-name L0_sanity ``` -------------------------------- ### Install Geneformer with Pip Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md Installs the Geneformer package using pip. Navigate to the Geneformer directory first. ```bash cd models/geneformer pip install -e . ``` -------------------------------- ### Import Libraries and Setup Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/0-Zero-Shot-Mutation-Variant-CancerHotspot.ipynb Imports necessary Python libraries for data manipulation, machine learning, and deep learning, and sets up the environment for PyTorch and CUDA. ```python import os import pickle import sys import warnings from datetime import datetime import numpy as np import pandas as pd import polars as pl import torch from tqdm import tqdm warnings.filterwarnings("ignore") # Machine learning libraries # Visualization import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import auc, precision_recall_curve, roc_curve plt.style.use("default") sns.set_palette("husl") # Add project paths sys.path.append("../") # Import Encodon-specific modules from src.data.metadata import MetadataFields from src.data.mutation_dataset import MutationDataset, collate_fn from src.data.preprocess.mutation_pred import mlm_process_item from src.inference.encodon import EncodonInference from src.inference.task_types import TaskTypes print("✅ All libraries imported successfully!") print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU device: {torch.cuda.get_device_name()}") ``` -------------------------------- ### Install Geneformer for Development Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md Installs Geneformer with development dependencies, including testing tools. Ensure you are in the Geneformer directory. ```bash cd models/geneformer pip install -e .[test] ``` -------------------------------- ### Import Libraries and Setup Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/codonfm_ptl_te/notebooks/5-EnCodon-Downstream-Task-mRFP-expression.ipynb Imports necessary Python libraries for data manipulation, machine learning, and Encodon model inference. Sets up random seeds for reproducibility and checks PyTorch and CUDA availability. ```python import os import sys import warnings import numpy as np import pandas as pd import torch from tqdm import tqdm warnings.filterwarnings("ignore") # ML libraries # Visualization import matplotlib.pyplot as plt from scipy.stats import spearmanr from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score from sklearn.model_selection import GridSearchCV # Add project paths sys.path.append("..") # Import Encodon modules # Import additional modules for dataset handling from torch.utils.data import DataLoader from src.data.codon_bert_dataset import CodonBertDataset from src.data.metadata import MetadataFields from src.data.preprocess.codon_sequence import process_item from src.inference.encodon import EncodonInference from src.inference.task_types import TaskTypes # Fix random seed torch.manual_seed(42) np.random.seed(42) print("✅ Libraries imported successfully!") print(f"PyTorch: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") ``` -------------------------------- ### Install DeepEP for Fused Token Router Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/mixtral/README.md Install the DeepEP library using the provided bash script to enable the high-performance FusedTokenRouter. ```bash bash install_hybridep.sh ``` -------------------------------- ### Install Model Package in Editable Mode Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/llama3/README.md Install the model package in editable mode within the development container to enable testing with pytest. ```bash pip install -e . ``` -------------------------------- ### Install Development Dependencies with Constraints Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/geneformer/README.md Installs development dependencies for Geneformer while respecting specific constraints defined by PIP_CONSTRAINT. Change directory to models/geneformer first. ```bash cd models/geneformer PIP_CONSTRAINT= pip install -e .[test] ``` -------------------------------- ### Setup Training Data Configuration Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/fine-tuning-tutorial.ipynb Configures and writes a YAML file for preprocessing training data. It specifies dataset paths, output directories, and various preprocessing parameters. ```python from bionemo.evo2.data.dataset_tokenizer import ( DEFAULT_HF_TOKENIZER_MODEL_PATH, # use the 512 size for historical reasons ) full_fasta_path = os.path.abspath(concat_path) output_dir = os.path.abspath("preprocessed_data") output_yaml = f""" - datapaths: ["{full_fasta_path}"] output_dir: "{output_dir}" output_prefix: chr20_21_22_uint8_distinct train_split: 0.9 valid_split: 0.05 test_split: 0.05 overwrite: True embed_reverse_complement: true random_reverse_complement: 0.0 random_lineage_dropout: 0.0 include_sequence_id: false transcribe: "back_transcribe" force_uppercase: false indexed_dataset_dtype: "uint8" hf_tokenizer_model_path: {DEFAULT_HF_TOKENIZER_MODEL_PATH} pretrained_tokenizer_model: null special_tokens: null fast_hf_tokenizer: true append_eod: true enforce_sample_length: null ftfy: false workers: 1 preproc_concurrency: 100000 chunksize: 25 drop_empty_sequences: true nnn_filter: false # If you split your fasta on NNN (in human these are contigs), then you should set this to true. seed: 12342 # Not relevant because we are not using random reverse complement or lineage dropout. """ with open("preprocess_config.yaml", "w") as f: print(output_yaml, file=f) ``` -------------------------------- ### Install GFMBench API Requirements Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/evo2_megatron/examples/evo2_gfmbench_recipe.ipynb Installs the necessary Python packages for the GFMBench API. It's recommended to use a virtual environment to avoid permission issues. ```bash pip install -r /data/sense/alarey/code/bionemo-framework/bionemo-recipes/recipes/evo2_megatron/examples/GFMBench-api/basic_requirements.txt ``` -------------------------------- ### Install vLLM Post-Build Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/vllm_inference/esm2/README.md Installs vLLM within an existing Docker container after the image has been built. The script auto-detects the GPU architecture or accepts an explicit architecture argument. ```bash docker build -t esm2 . docker run --rm -it --gpus all esm2 bash -c "./install_vllm.sh" ``` ```bash # or with an explicit architecture: docker run --rm -it --gpus all esm2 bash -c "./install_vllm.sh 9.0" ``` -------------------------------- ### Install Dependencies Manually (CUDA 13.0) Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/recipes/esm2_native_te/README.md Installs dependencies manually in a Python environment with CUDA 13.0 support. This includes PyTorch, Transformer Engine, and Flash Attention. ```bash uv venv --python 3.12 --seed /workspace/.venv source /workspace/.venv/bin/activate uv pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130 uv pip install wheel packaging psutil pip install --no-build-isolation "flash-attn>=2.1.1,<=2.8.1" pip install --no-build-isolation transformer-engine[pytorch]==2.9.0 uv pip install -r /requirements.txt ``` -------------------------------- ### Initialize Qwen3 Model with FP4 Recipe Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/qwen/README.md Initializes a Qwen3 model using an `NVFP4BlockScaling` FP4 recipe, setting all layers to FP4 precision. ```python fp4_recipe = te_recipe.NVFP4BlockScaling() config = NVQwen3Config.from_pretrained( "Qwen/Qwen3-0.6B", layer_precision=["fp4"] * 28, ) model = NVQwen3ForCausalLM(config, fp4_recipe=fp4_recipe) ``` -------------------------------- ### Install Dependencies in Dev Container Source: https://github.com/nvidia-bionemo/bionemo-framework/blob/main/models/qwen/README.md Install project dependencies within the development container using the provided requirements file. This is a prerequisite for running tests inside the container. ```bash pip install -r requirements.txt ```