### Install CellFlow Framework Source: https://context7.com/theislab/cellflow/llms.txt Instructions for installing the CellFlow package via pip or setting up a development environment from source. ```bash pip install cellflow-tools ``` ```bash git clone https://github.com/theislab/cellflow cd cellflow pip install -e . ``` -------------------------------- ### Install CellFlow from PyPI Source: https://github.com/theislab/cellflow/blob/main/docs/installation.rst Installs the stable version of the cellflow-tools package from the Python Package Index (PyPI). This is the recommended method for most users. ```bash pip install cellflow-tools ``` -------------------------------- ### Setup Training Metrics Callbacks Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/500_combosciplex.ipynb Initializes callbacks for monitoring model training, specifically tracking energy distance, MMD, and R-squared metrics in gene expression space. ```python metrics_callback = cellflow.training.Metrics(metrics=["mmd", "e_distance"]) decoded_metrics_callback = cellflow.training.PCADecodedMetrics(ref_adata=adata_train, metrics=["r_squared"]) callbacks = [metrics_callback, decoded_metrics_callback] ``` -------------------------------- ### Install CellFlow Development Version from GitHub Source: https://github.com/theislab/cellflow/blob/main/docs/installation.rst Installs the latest development version of CellFlow directly from its GitHub repository. This is useful for accessing the newest features or contributing to the project. ```bash pip install git+https://github.com/theislab/CellFlow.git@main ``` -------------------------------- ### Load Built-in Datasets in Python Source: https://context7.com/theislab/cellflow/llms.txt Utility functions to load preprocessed datasets directly from Hugging Face. These datasets are suitable for tutorials, examples, and testing Cellflow functionalities. Examples include PBMC cytokine, iNeurons, and zebrafish embryo datasets. ```python import cellflow # Load PBMC cytokine dataset (10M cells, 12 donors, 90 cytokines) adata_pbmc = cellflow.datasets.pbmc_cytokines(force_download=False) # Load iNeurons dataset for genetic perturbations adata_ineurons = cellflow.datasets.ineurons(force_download=False) # Load zebrafish embryo dataset with genetic perturbations adata_zesta = cellflow.datasets.zesta(force_download=False) ``` -------------------------------- ### Setup Metrics Callback for Training Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/200_zebrafish.ipynb Initializes a Metrics callback for the CellFlow training process. This callback is configured to compute and log 'mmd' (Maximum Mean Discrepancy) and 'e_distance' (Energy Distance) during training. ```python metrics_callback = cellflow.training.Metrics(metrics=["mmd", "e_distance"]) callbacks = [metrics_callback] ``` -------------------------------- ### Install CellFlow in editable mode Source: https://github.com/theislab/cellflow/blob/main/README.md This command installs CellFlow in editable mode, which is useful for developers who want to modify the library's source code. It requires cloning the repository first. ```bash git clone https://github.com/theislab/cellflow cd cellflow pip install -e . ``` -------------------------------- ### Initialize CellFlow Training Metrics Callback (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/201_zebrafish_continuous.ipynb Initializes a metrics callback for CellFlow training to compute and log specific metrics during the training process. This example configures the callback to compute 'mmd' (Maximum Mean Discrepancy) and 'e_distance' (Energy Distance). ```python import cellflow.training metrics_callback = cellflow.training.Metrics(metrics=["mmd", "e_distance"]) callbacks = [metrics_callback] ``` -------------------------------- ### Import CellFlow and Dependencies Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/500_combosciplex.ipynb Initializes the environment by importing necessary libraries for data manipulation, visualization, and deep learning, including CellFlow, JAX, and Flax. ```python import warnings from pandas.errors import SettingWithCopyWarning import matplotlib %matplotlib inline warnings.simplefilter("ignore", UserWarning) warnings.simplefilter("ignore", FutureWarning) warnings.simplefilter("ignore", SettingWithCopyWarning) import numpy as np import pandas as pd import seaborn as sns import jax from tqdm import tqdm import functools import matplotlib.pyplot as plt import anndata as ad import scanpy as sc import rapids_singlecell as rsc import flax.linen as nn import optax import pertpy import cellflow from cellflow.model import CellFlow import cellflow.preprocessing as cfpp from cellflow.utils import match_linear from cellflow.plotting import plot_condition_embedding from cellflow.preprocessing import transfer_labels, compute_wknn, centered_pca, project_pca, reconstruct_pca, annotate_compounds, get_molecular_fingerprints from cellflow.metrics import compute_r_squared, compute_e_distance ``` -------------------------------- ### POST /model/prepare Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/201_zebrafish_continuous.ipynb Initializes the CellFlow model architecture with specific layers, pooling strategies, and matching functions. ```APIDOC ## POST /model/prepare ### Description Configures the architecture of the CellFlow model, including embedding layers, pooling mechanisms, and the velocity field regression parameters. ### Method POST ### Endpoint /model/prepare ### Parameters #### Request Body - **condition_mode** (string) - Required - Mode for condition embeddings (e.g., "deterministic"). - **pooling** (string) - Required - Aggregation method (e.g., "attention_token"). - **layers_before_pool** (dict) - Required - Configuration for layers processing input features before pooling. - **layers_after_pool** (dict) - Required - Configuration for layers processing pooled features. - **condition_embedding_dim** (int) - Required - Latent space dimension. - **cond_output_dropout** (float) - Required - Dropout rate for condition embeddings. - **probability_path** (dict) - Required - Reference vector field configuration. - **match_fn** (callable) - Required - Function to sample pairs between control and perturbed cells. ### Request Example { "condition_mode": "deterministic", "pooling": "attention_token", "condition_embedding_dim": 256, "cond_output_dropout": 0.9 } ### Response #### Success Response (200) - **status** (string) - Confirmation of model preparation. ``` -------------------------------- ### Get Condition Embedding API Source: https://context7.com/theislab/cellflow/llms.txt Retrieves learned condition embeddings from the trained model. ```APIDOC ## get_condition_embedding ### Description Retrieves the learned condition embeddings from the trained model for visualization and analysis. ### Method `cf.get_condition_embedding()` ### Parameters - **covariate_data** (DataFrame) - Required - DataFrame containing condition information. - **condition_id_key** (str) - Required - Column name in `covariate_data` identifying conditions. - **key_added** (str) - Optional - Key to store the embeddings in `adata.uns`. ``` -------------------------------- ### Save and Load CellFlow Model (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Demonstrates how to save the trained CellFlow model to disk and subsequently load it back. This is essential for persisting model states and reusing trained models. ```python cf.save("cellflow_model/", overwrite=True) cf = cellflow.model.CellFlow.load( "cellflow_model/" ) ``` -------------------------------- ### Configure CellFlow Training Callbacks Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/100_pbmc.ipynb Initializes metrics and logging callbacks for training. It demonstrates how to use PCADecodedMetrics and WandbLogger to track model performance. ```python metrics_callback = cellflow.training.Metrics(metrics=["r_squared", "mmd", "e_distance"]) decoded_metrics_callback = cellflow.training.PCADecodedMetrics(ref_adata=adata_train, metrics=["r_squared"]) wandb_callback = cellflow.training.WandbLogger(project="cellflow_tutorials", out_dir="~", config={"name": "100m_pbmc"}) callbacks = [metrics_callback, decoded_metrics_callback] ``` -------------------------------- ### Configure Neural Network Architecture Source: https://context7.com/theislab/cellflow/llms.txt Sets up the neural network components including the condition encoder architecture and the matching function for optimal transport. ```python import functools import optax import flax.linen as nn from cellflow.utils import match_linear layers_before_pool = { "drug": {"layer_type": "mlp", "dims": [1024, 1024], "dropout_rate": 0.5}, "cell_type": {"layer_type": "mlp", "dims": [256, 256], "dropout_rate": 0.0}, } layers_after_pool = {"layer_type": "mlp", "dims": [1024, 1024], "dropout_rate": 0.0} match_fn = functools.partial(match_linear, epsilon=0.5, tau_a=1.0, tau_b=1.0) ``` -------------------------------- ### Initialize CellFlow Model Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/500_combosciplex.ipynb Prepares the CellFlow model with specific architecture parameters, including condition modes, pooling strategies, and probability path settings. ```python cf.prepare_model( condition_mode="deterministic", regularization=0.0, pooling="mean", layers_before_pool=layers_before_pool, layers_after_pool=layers_after_pool, condition_embedding_dim=64, cond_output_dropout=0.9, hidden_dims=[2048, 2048, 2048], conditioning="concatenation", decoder_dims=[4096, 4096, 4096], probability_path={"constant_noise": 1.5}, match_fn=match_fn, linear_projection_before_concatenation=True, ) ``` -------------------------------- ### Prepare Validation Data Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/201_zebrafish_continuous.ipynb Sets up validation datasets for monitoring model performance during training. ```APIDOC ## POST /model/prepare_validation_data ### Description Prepares validation data for the model, allowing for monitoring at specific training intervals. ### Method POST ### Parameters #### Request Body - **adata** (AnnData) - Required - The validation dataset. - **name** (string) - Required - Name of the validation set. - **n_conditions_on_log_iteration** (int) - Optional - Number of conditions to log during iteration. - **n_conditions_on_train_end** (int) - Optional - Number of conditions to log at training end. ### Request Example ```python cf.prepare_validation_data( adata_test, name="test", n_conditions_on_log_iteration=None, n_conditions_on_train_end=None ) ``` ``` -------------------------------- ### Prepare CellFlow Model (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/100_pbmc.ipynb Prepares the CellFlow model by setting various configuration parameters. This includes defining the condition mode, regularization, pooling strategy, network layer architectures, embedding dimensions, dropout rates, time encoding, hidden layer dimensions, conditioning method, decoder dimensions, activation functions, probability path definition, matching function, optimizer, and other architectural choices. ```python import cellflow as cf import nn import optax cf.prepare_model( condition_mode="deterministic", regularization=0.0, pooling="attention_token", pooling_kwargs={}, layers_before_pool=layers_before_pool, layers_after_pool=layers_after_pool, condition_embedding_dim=256, cond_output_dropout=0.9, condition_encoder_kwargs={}, pool_sample_covariates=True, time_freqs=1024, time_encoder_dims=[1024, 1024, 1024], time_encoder_dropout=0.0, hidden_dims=[2048, 2048, 2048], hidden_dropout=0.0, conditioning="concatenation", decoder_dims=[4096, 4096, 4096], vf_act_fn=nn.silu, vf_kwargs=None, probability_path={"constant_noise": 0.5}, match_fn=match_fn, optimizer=optax.MultiSteps(optax.adam(5e-5), 20), solver_kwargs={}, layer_norm_before_concatenation=False, linear_projection_before_concatenation=False, ) ``` -------------------------------- ### POST /model/prepare Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Configures the architecture and hyperparameters for the CellFlow model. ```APIDOC ## POST /model/prepare ### Description Initializes the CellFlow model architecture, defining layers, pooling strategies, and matching functions for condition embeddings. ### Method POST ### Endpoint cellflow.model.CellFlow.prepare_model ### Parameters #### Request Body - **condition_embedding_dim** (int) - Required - Dimension of the condition embedding. - **time_encoder_dims** (list) - Required - List of hidden dimensions for the time encoder. - **hidden_dims** (list) - Required - Dimensions for the velocity field hidden layers. - **pooling** (string) - Required - Pooling strategy (e.g., "mean"). - **layers_before_pool** (list) - Required - Network layers before permutation-invariant pooling. - **flow** (dict) - Required - Configuration for the reference vector field. ### Request Example { "condition_embedding_dim": 128, "pooling": "mean", "hidden_dims": [2048, 2048, 128] } ### Response #### Success Response (200) - **status** (string) - Confirmation of model preparation. ``` -------------------------------- ### Configure CellFlow Matching Function Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/500_combosciplex.ipynb Sets up the matching function using partial application to define parameters for linear matching between control and perturbed cell populations. ```python match_fn = functools.partial(match_linear, epsilon=1.0, tau_a=1.0, tau_b=1.0) ``` -------------------------------- ### Training Callbacks Configuration Source: https://context7.com/theislab/cellflow/llms.txt Configures various callbacks for monitoring training progress, including embedding space metrics, PCA-decoded metrics, VAE-decoded metrics, and Weights & Biases logging. ```APIDOC ## POST /training/callbacks ### Description Defines the callback objects used during the CellFlow training process to track performance and log experiments. ### Parameters #### Request Body - **metrics** (list) - Optional - List of metric names to compute (e.g., r_squared, mmd, sinkhorn_div, e_distance). - **metric_aggregations** (list) - Optional - Aggregation methods (e.g., mean, median). - **ref_adata** (AnnData) - Optional - Reference data for PCA decoding. - **vae** (Model) - Optional - VAE model for decoding metrics. - **log_prefix** (string) - Optional - Prefix for logged metrics. ### Request Example { "metrics": ["r_squared", "e_distance"], "metric_aggregations": ["mean"] } ``` -------------------------------- ### Running CellFlow Model Initialization Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Initializes the CellFlow model using the prepared training data and specifies the solver to be used. ```APIDOC ## Running CellFlow Now we are ready to set up the `CellFlow` model. We use the default deterministic `otfm` solver for this task. ### Code Example ```python cf = cellflow.model.CellFlow(adata_train_full, solver="otfm") ``` ``` -------------------------------- ### Initialize CellFlow Environment Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/200_zebrafish.ipynb Imports necessary libraries for data manipulation, modeling, and plotting, while suppressing common warnings. ```python import warnings from pandas.errors import SettingWithCopyWarning warnings.simplefilter("ignore", UserWarning) warnings.simplefilter("ignore", FutureWarning) warnings.simplefilter("ignore", SettingWithCopyWarning) import numpy as np import pandas as pd import seaborn as sns import jax import functools import matplotlib.pyplot as plt import anndata as ad import scanpy as sc import rapids_singlecell as rsc import flax.linen as nn import optax import cellflow from cellflow.model import CellFlow import cellflow.preprocessing as cfpp from cellflow.utils import match_linear from cellflow.plotting import plot_condition_embedding from cellflow.preprocessing import transfer_labels, compute_wknn, centered_pca, project_pca, reconstruct_pca from cellflow.metrics import compute_r_squared, compute_e_distance ``` -------------------------------- ### Data Preparation for Metrics Calculation Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/500_combosciplex.ipynb This section explains how to prepare dictionaries of data for calculating various metrics, such as energy distance and R-squared. It separates data by condition for detailed analysis. ```APIDOC ## POST /api/cellflow/prepare_metrics_data ### Description Organizes AnnData objects into dictionaries keyed by condition, separating data into encoded (PCA) and decoded (gene expression) spaces for metric calculation. ### Method POST ### Endpoint /api/cellflow/prepare_metrics_data ### Parameters #### Request Body - **adata_test** (AnnData) - Required - AnnData object for the test set. - **adata_test_recon** (AnnData) - Required - AnnData object for the reconstructed test set. - **adata_preds** (AnnData) - Required - AnnData object for the predicted data. - **adata_train** (AnnData) - Required - AnnData object used for training/reference. ### Request Example ```python test_data_target_encoded = {} test_data_target_decoded = {} # ... (similar dictionaries for recon and predicted data) for cond in adata_preds.obs["condition"].unique(): test_data_target_encoded[cond] = adata_test[adata_test.obs["condition"] == cond].obsm["X_pca"] test_data_target_decoded[cond] = adata_test[adata_test.obs["condition"] == cond].X.toarray() # ... (populate other dictionaries) ``` ### Response #### Success Response (200) - **prepared_data** (dict) - A dictionary containing multiple dictionaries for encoded, decoded, reconstructed, and predicted data, separated by condition. #### Response Example ```json { "test_data_target_encoded": {"cond1": [...], "cond2": [...]}, "test_data_target_decoded": {"cond1": [...], "cond2": [...]}, "test_data_target_encoded_reconstructed": {"cond1": [...], "cond2": [...]}, "test_data_target_decoded_reconstructed": {"cond1": [...], "cond2": [...]}, "test_data_target_encoded_predicted": {"cond1": [...], "cond2": [...]}, "test_data_target_decoded_predicted": {"cond1": [...], "cond2": [...]} } ``` ``` -------------------------------- ### Execute Complete CellFlow Workflow Source: https://context7.com/theislab/cellflow/llms.txt A full end-to-end pipeline demonstrating data preprocessing, model initialization, training, and evaluation of cell perturbation predictions. ```python adata = cellflow.datasets.pbmc_cytokines() cfpp.centered_pca(adata_train, n_comps=100) cf = CellFlow(adata_train, solver="otfm") cf.prepare_model(condition_embedding_dim=256, hidden_dims=[2048, 2048, 2048], match_fn=functools.partial(match_linear, epsilon=0.5)) cf.train(num_iterations=100000, batch_size=1024, callbacks=[cellflow.training.Metrics(metrics=["r_squared", "e_distance"])]) predictions = cf.predict(adata=adata_control, covariate_data=covariate_data, condition_id_key="condition") ``` -------------------------------- ### Prepare Training Data Source: https://context7.com/theislab/cellflow/llms.txt Configures the data loader by defining perturbation and sample covariates, embeddings, and control definitions required for training the model. ```python import numpy as np adata.uns["drug_embeddings"] = { "DrugA": np.array([0.1, 0.2, 0.3]), "DrugB": np.array([0.4, 0.5, 0.6]), "DrugC": np.array([-0.2, 0.3, 0.0]), } adata.uns["cell_type_embeddings"] = { "cell_typeA": np.array([0.0, 1.0]), "cell_typeB": np.array([0.0, 2.0]), } cf.prepare_data( sample_rep="X_pca", control_key="is_control", perturbation_covariates={ "drug": ("drug_1", "drug_2"), "dose": ("dose_1", "dose_2") }, perturbation_covariate_reps={ "drug": "drug_embeddings" }, sample_covariates=["cell_type"], sample_covariate_reps={ "cell_type": "cell_type_embeddings" }, split_covariates=["cell_type"], max_combination_length=2, null_value=0.0, ) ``` -------------------------------- ### Configure Validation Data Source: https://context7.com/theislab/cellflow/llms.txt Prepares validation datasets by subsampling conditions and setting evaluation parameters to monitor training progress. ```python import scanpy as sc import anndata as ad adata_val_subsampled = [] for cond in adata_val.obs["condition"].unique(): adata_val_subsampled.append( sc.pp.subsample(adata_val[adata_val.obs["condition"] == cond], n_obs=1000, copy=True) ) adata_val_for_validation = ad.concat(adata_val_subsampled) adata_val_for_validation.uns = adata.uns.copy() cf.prepare_validation_data( adata_val_for_validation, name="validation", n_conditions_on_log_iteration=10, n_conditions_on_train_end=None, predict_kwargs={"n_steps": 100}, ) ``` -------------------------------- ### Configure Training Callbacks Source: https://context7.com/theislab/cellflow/llms.txt Defines various callback mechanisms for tracking model performance, including embedding space metrics, PCA/VAE decoded metrics, and Weights & Biases logging. ```python metrics_cb = Metrics(metrics=["r_squared", "mmd", "sinkhorn_div", "e_distance"], metric_aggregations=["mean", "median"]) pca_cb = PCADecodedMetrics(ref_adata=adata_train, metrics=["r_squared", "e_distance"], log_prefix="gene_space_") vae_cb = VAEDecodedMetrics(vae=scvi_model, adata=adata_train, metrics=["r_squared"], log_prefix="vae_decoded_") wandb_cb = WandbLogger(project="my_project", out_dir="./wandb_logs", config={"lr": 5e-5, "batch_size": 1024}) callbacks = [metrics_cb, pca_cb, wandb_cb] ``` -------------------------------- ### Prepare CellFlow Model Architecture (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/201_zebrafish_continuous.ipynb Prepares the CellFlow model by specifying its architecture and training parameters. This includes defining the condition mode, regularization, pooling strategy, layer configurations, embedding dimensions, dropout rates, hidden and decoder dimensions, probability path, and the matching function. ```python import cellflow as cf # Assuming layers_before_pool, layers_after_pool, and match_fn are defined as above cf.prepare_model( condition_mode="deterministic", regularization=0.0, pooling="attention_token", layers_before_pool=layers_before_pool, layers_after_pool=layers_after_pool, condition_embedding_dim=256, cond_output_dropout=0.9, hidden_dims=[2048, 2048, 2048], decoder_dims=[4096, 4096, 4096], probability_path={"constant_noise": 0.5}, match_fn=match_fn, ) ``` -------------------------------- ### Initialize CellFlow Model Source: https://context7.com/theislab/cellflow/llms.txt Configures the model architecture, including pooling strategies, layer dimensions, conditioning mechanisms, and optimizer settings. ```python cf.prepare_model( condition_mode="deterministic", regularization=0.0, pooling="attention_token", pooling_kwargs={}, layers_before_pool=layers_before_pool, layers_after_pool=layers_after_pool, condition_embedding_dim=256, cond_output_dropout=0.9, time_freqs=1024, time_encoder_dims=[1024, 1024, 1024], time_encoder_dropout=0.0, hidden_dims=[2048, 2048, 2048], hidden_dropout=0.0, conditioning="concatenation", decoder_dims=[4096, 4096, 4096], decoder_dropout=0.0, vf_act_fn=nn.silu, probability_path={"constant_noise": 0.5}, match_fn=match_fn, optimizer=optax.MultiSteps(optax.adam(5e-5), 20), seed=0 ) ``` -------------------------------- ### Initialize CellFlow Model Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/200_zebrafish.ipynb Initializes the CellFlow model using the training data and specifies the flow matching solver. ```python cf = CellFlow(adata_train, solver="otfm") ``` -------------------------------- ### POST /training/metrics Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/201_zebrafish_continuous.ipynb Configures training callbacks to compute and log metrics such as energy distance and MMD. ```APIDOC ## POST /training/metrics ### Description Sets up the metrics callback to track model performance during the training process. ### Method POST ### Endpoint /training/metrics ### Parameters #### Request Body - **metrics** (list) - Required - List of metrics to compute (e.g., ["mmd", "e_distance"]). ### Request Example { "metrics": ["mmd", "e_distance"] } ### Response #### Success Response (200) - **callback_id** (string) - Identifier for the configured metrics callback. ``` -------------------------------- ### Define Training Callbacks in Python Source: https://context7.com/theislab/cellflow/llms.txt Provides base classes and specific implementations for creating custom callbacks during model training. These include metrics calculation (e.g., PCA decoded, VAE decoded), logging utilities, and general computation callbacks for integrating custom logic into the training loop. ```python from cellflow.training import ( Metrics, PCADecodedMetrics, VAEDecodedMetrics, WandbLogger, ComputationCallback, LoggingCallback, ) ``` -------------------------------- ### Initialize CellFlow Model Source: https://context7.com/theislab/cellflow/llms.txt Initializes the CellFlow model using an AnnData object. Users can choose between the OTFM solver for deterministic mappings or the GENOT solver for stochastic predictions. ```python import cellflow from cellflow.model import CellFlow import anndata as ad adata = ad.read_h5ad("perturbation_data.h5ad") cf = CellFlow(adata, solver="otfm") cf_stochastic = CellFlow(adata, solver="genot") ``` -------------------------------- ### CellFlow Training Metrics Callback Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/500_combosciplex.ipynb This section describes how to set up callbacks for computing and logging metrics during CellFlow model training. ```APIDOC ## POST /api/cellflow/training/metrics ### Description Configures callbacks to compute and log metrics during model training, including energy distance, MMD, and R-squared. ### Method POST ### Endpoint /api/cellflow/training/metrics ### Parameters #### Request Body - **metrics_callback** (dict) - Configuration for the primary metrics callback (e.g., `{"metrics": ["mmd", "e_distance"]}`). - **decoded_metrics_callback** (dict) - Configuration for the PCA decoded metrics callback (e.g., `{"ref_adata": "adata_train", "metrics": ["r_squared"]}`). ### Request Example ```json { "metrics_callback": {"metrics": ["mmd", "e_distance"]}, "decoded_metrics_callback": {"ref_adata": "adata_train", "metrics": ["r_squared"]} } ``` ### Response #### Success Response (200) - **callbacks** (list) - A list containing the configured metric callbacks. #### Response Example ```json { "callbacks": [ { "type": "Metrics", "metrics": ["mmd", "e_distance"] }, { "type": "PCADecodedMetrics", "ref_adata": "adata_train", "metrics": ["r_squared"] } ] } ``` ``` -------------------------------- ### Initialize CellFlow model Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Initializes the CellFlow model object using the prepared AnnData object and the deterministic otfm solver. ```python cf = cellflow.model.CellFlow(adata_train_full, solver="otfm") ``` -------------------------------- ### Prepare CellFlow Data Handling Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/201_zebrafish_continuous.ipynb Configures the data handling for the CellFlow model using the `prepare_data` method. This involves specifying representations for cellular states, control cells, perturbations, sample context, and mapping strategies. It also sets parameters like maximum combination length and null value for missing perturbations. ```python cf.prepare_data( sample_rep = "X_aligned", control_key = "first_t_control", perturbation_covariates = {"genetic_perturbation": ("gene_target_1" , "gene_target_2")}, perturbation_covariate_reps = {"genetic_perturbation": "gene_embeddings"}, sample_covariates = ("logtime",), split_covariates = None, max_combination_length = 2, null_value = 0.0, ) ``` -------------------------------- ### Prepare CellFlow Model Architecture Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/200_zebrafish.ipynb Prepares the CellFlow model by setting various architectural parameters including condition mode, regularization, pooling strategy, layer configurations, embedding dimensions, dropout rates, hidden and decoder dimensions, probability path, and the matching function. ```python cf.prepare_model( condition_mode="deterministic", regularization=0.0, pooling="attention_token", layers_before_pool=layers_before_pool, layers_after_pool=layers_after_pool, condition_embedding_dim=256, cond_output_dropout=0.9, hidden_dims=[2048, 2048, 2048], decoder_dims=[4096, 4096, 4096], probability_path={"constant_noise": 0.5}, match_fn=match_fn, ) ``` -------------------------------- ### CellFlow Workflow Execution Source: https://context7.com/theislab/cellflow/llms.txt Demonstrates the end-to-end workflow including data loading, preprocessing, model initialization, training, and prediction. ```APIDOC ## POST /workflow/execute ### Description Executes the full pipeline: data loading, PCA projection, model preparation, training, and prediction. ### Parameters #### Request Body - **num_iterations** (int) - Required - Number of training steps. - **batch_size** (int) - Required - Training batch size. - **condition_id_key** (string) - Required - Key for identifying conditions in prediction. ### Request Example { "num_iterations": 100000, "batch_size": 1024, "condition_id_key": "condition" } ### Response #### Success Response (200) - **predictions** (dict) - Dictionary of predicted embeddings per condition. ``` -------------------------------- ### Manage Model Persistence Source: https://context7.com/theislab/cellflow/llms.txt Saves trained models to disk and loads them back into the environment for inference or further analysis. ```python cf.save(dir_path="./models", file_prefix="pbmc_cytokine", overwrite=True) cf_loaded = CellFlow.load("./models/pbmc_cytokine_CellFlow.pkl") ``` -------------------------------- ### Generating a Source Distribution Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb This section describes how to generate a source distribution for training the CellFlow model. It involves creating a random distribution based on subsampled means of the training data when control conditions are not suitable. ```APIDOC ## Generating a Source Distribution This section describes how to generate a source distribution for training the CellFlow model. It involves creating a random distribution based on subsampled means of the training data when control conditions are not suitable. ### Code Example ```python n_src_cells = 10000 n_samples = 1000 sample_rep = "X_pca" samples = [] for i in range(n_src_cells): sample = adata_train.obsm[sample_rep][ np.random.choice(adata_train.n_obs, n_samples), ].mean(axis=0) samples.append(sample) samples = np.array(samples) samples_obs = pd.DataFrame( {col: 0.0 for col in [mol + "_conc" for mol in morphogens]}, index=range(samples.shape[0]), ) samples_obs["dataset"] = "CTRL" samples_obs["media"] = "CTRL" samples_obs["condition"] = "CTRL" adata_ctrl = sc.AnnData( X=csr_matrix(np.zeros((samples.shape[0], adata_train.n_vars))), obs=samples_obs, ) adata_ctrl.obsm[sample_rep] = samples adata_ctrl.var_names = adata_train.var_names adata_train_full = ad.concat([adata_train, adata_ctrl], join="outer") adata_train_full.obs["CTRL"] = adata_train_full.obs["dataset"] == "CTRL" adata_ctrl.obs["CTRL"] = True adata_train_full.uns, adata_eval.uns = adata.uns, adata.uns ``` ``` -------------------------------- ### POST /model/train Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Executes the training process for the prepared CellFlow model. ```APIDOC ## POST /model/train ### Description Trains the CellFlow model for a specified number of iterations. ### Method POST ### Endpoint cellflow.model.CellFlow.train ### Parameters #### Request Body - **num_iterations** (int) - Required - Number of training iterations. ### Request Example { "num_iterations": 500000 } ### Response #### Success Response (200) - **loss** (float) - Final training loss value. ``` -------------------------------- ### Prepare Model API Source: https://context7.com/theislab/cellflow/llms.txt Configures and prepares the CellFlow model with specified parameters for training. ```APIDOC ## prepare_model ### Description Prepares the CellFlow model with various configuration options. ### Method `cf.prepare_model()` ### Parameters - **condition_mode** (str) - Optional - 'deterministic' or 'stochastic'. - **regularization** (float) - Optional - L2 regularization for embeddings. - **pooling** (str) - Optional - Pooling strategy: 'mean', 'attention_token', or 'attention_seed'. - **pooling_kwargs** (dict) - Optional - Keyword arguments for pooling. - **layers_before_pool** (list) - Optional - Layers before the pooling layer. - **layers_after_pool** (list) - Optional - Layers after the pooling layer. - **condition_embedding_dim** (int) - Optional - Latent dimension for conditions. - **cond_output_dropout** (float) - Optional - Dropout rate for condition output. - **time_freqs** (int) - Optional - Sinusoidal time encoding frequency. - **time_encoder_dims** (list) - Optional - Dimensions for the time encoder. - **time_encoder_dropout** (float) - Optional - Dropout rate for the time encoder. - **hidden_dims** (list) - Optional - Dimensions for cell embedding layers. - **hidden_dropout** (float) - Optional - Dropout rate for hidden layers. - **conditioning** (str) - Optional - Conditioning method: 'concatenation', 'film', or 'resnet'. - **decoder_dims** (list) - Optional - Dimensions for the output decoder layers. - **decoder_dropout** (float) - Optional - Dropout rate for the decoder. - **vf_act_fn** (callable) - Optional - Activation function for velocity field. - **probability_path** (dict) - Optional - Configuration for the probability path (e.g., {"constant_noise": 0.5}). - **match_fn** (callable) - Optional - Matching function. - **optimizer** (object) - Optional - Optimizer configuration (e.g., optax.MultiSteps). - **seed** (int) - Optional - Random seed for reproducibility. ``` -------------------------------- ### Train CellFlow Model Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/200_zebrafish.ipynb Initiates the training of the CellFlow model with specified parameters. This includes setting the total number of training iterations, batch size, providing a list of callbacks (e.g., for metrics), and defining the validation frequency. ```python cf.train( num_iterations=500_000, batch_size=2048, callbacks=callbacks, valid_freq=80_000, ) ``` -------------------------------- ### POST /model/persistence Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Saves or loads a trained CellFlow model to/from disk. ```APIDOC ## POST /model/persistence ### Description Handles saving the current model state to a directory or loading a previously saved model. ### Method POST ### Endpoint cellflow.model.CellFlow.save / cellflow.model.CellFlow.load ### Parameters #### Request Body - **path** (string) - Required - File system path for the model. - **overwrite** (boolean) - Optional - Whether to overwrite existing files. ### Request Example { "path": "cellflow_model/", "overwrite": true } ### Response #### Success Response (200) - **message** (string) - Status of the save or load operation. ``` -------------------------------- ### Inspect Predictions using PCA Marginal Distributions Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Visualizes the marginal distributions of cell populations along principal components using Kernel Density Estimation (KDE) plots. This helps in understanding how cell distributions are predicted across different conditions and components. ```python fig, axs = plt.subplots(2, 3, figsize=(10, 4)) for i, ax in enumerate(axs.flatten()): sns.kdeplot( x=adata[adata.obs["condition"] == "RA_4"].obsm["X_pca"][:, i], fill=True, levels=10, color=colors[0], label="RA_4", alpha=0.5, ax=ax, ) sns.kdeplot( x=adata[adata.obs["condition"] == "BMP4_3"].obsm["X_pca"][:, i], fill=True, levels=10, color=colors[1], label="BMP4_3", alpha=0.5, ax=ax, ) sns.kdeplot( x=adata[adata.obs["condition"] == "RA_4+BMP4_3"].obsm["X_pca"][:, i], fill=True, levels=10, color=colors[2], label="RA_4+BMP4_3", alpha=0.5, ax=ax, ) sns.kdeplot( ``` -------------------------------- ### Define Matching Function (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/100_pbmc.ipynb Configures the `match_fn` using `functools.partial` to create a specific instance of the `match_linear` function. This function is used for sampling pairs batch-wise and includes parameters for entropic regularization (`epsilon`) and unbalancedness (`tau_a`, `tau_b`). ```python import functools match_fn = functools.partial(match_linear, epsilon=0.5, tau_a=1.0, tau_b=1.0) ``` -------------------------------- ### Initialize CellFlow Model Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/201_zebrafish_continuous.ipynb Initializes the CellFlow model instance with a specific flow matching solver. ```APIDOC ## POST /model/initialize ### Description Initializes the CellFlow model using the provided AnnData object and solver type. ### Method POST ### Parameters #### Request Body - **adata_train** (AnnData) - Required - The training dataset. - **solver** (string) - Required - The flow matching solver to use (e.g., "otfm"). ### Request Example ```python cf = CellFlow(adata_train, solver="otfm") ``` ``` -------------------------------- ### Training Components Source: https://github.com/theislab/cellflow/blob/main/docs/user/training.rst Overview of the core components used for training models in CellFlow. ```APIDOC ## Training Module Overview This module provides the necessary tools and classes for training machine learning models within the CellFlow framework. ### Key Components: - **BaseCallback**: Abstract base class for creating custom callbacks. - **CallbackRunner**: Manages the execution of callbacks during training. - **ComputationCallback**: A callback for performing computations during training steps. - **LoggingCallback**: A callback for logging training progress and metrics. - **Metrics**: Base class for defining evaluation metrics. - **PCADecodedMetrics**: Metrics specifically for evaluating PCA-decoded outputs. - **VAEDecodedMetrics**: Metrics for evaluating VAE-decoded outputs. - **WandbLogger**: Integrates with Weights & Biases for experiment tracking and visualization. - **CellFlowTrainer**: The main class orchestrating the training process. ``` -------------------------------- ### Configure CellFlow Matching Function (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Configures the matching function used in CellFlow for sampling pairs between source and perturbed cells. It utilizes `match_linear` with specific parameters for cost scaling and balancing. ```python match_fn = partial( solver_utils.match_linear, epsilon=0.5, scale_cost="mean", tau_a=0.99, tau_b=0.99, ) ``` -------------------------------- ### Train CellFlow Model (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Initiates the training process for the CellFlow model for a specified number of iterations. This snippet focuses on the core training execution, with a note to monitor validation metrics separately for initial runs. ```python cf.train(num_iterations=500000) ``` -------------------------------- ### Prepare data for CellFlow Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Configures the model's data handling by specifying the cellular representation, control key, and perturbation covariates. ```python cf.prepare_data( sample_rep=sample_rep, control_key="CTRL", perturbation_covariates={"conditions": condition_keys}, perturbation_covariate_reps={"conditions": "conditions"}, ) ``` -------------------------------- ### Execute CellFlow Model Training Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/100_pbmc.ipynb Runs the training process for the CellFlow model with specified iterations, batch size, and validation frequency. ```python cf.train( num_iterations=500_000, batch_size=1024, callbacks=callbacks, valid_freq=20_000 ) ``` -------------------------------- ### Calculate and Visualize Evaluation Metrics Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/100_pbmc.ipynb Computes energy distance and R-squared metrics using JAX tree mapping, aggregates results into a DataFrame, and visualizes the performance across donors using Seaborn bar plots. ```python e_distances = jax.tree_util.tree_map(compute_e_distance, test_data_target_encoded, test_data_target_encoded_predicted) r_squared = jax.tree_util.tree_map(compute_r_squared, test_data_target_decoded, test_data_target_decoded_predicted) df_e_distance = pd.DataFrame.from_dict(e_distances, orient="index", columns=["energy_distance"]) df_r_squared = pd.DataFrame.from_dict(r_squared, orient="index", columns=["r_squared"]) df_metrics = pd.merge(df_e_distance, df_r_squared, left_index=True, right_index=True) df_metrics["condition"] = df_metrics.index df_metrics["donor"] = df_metrics.apply(lambda x: x["condition"].split("_")[0], axis=1) fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True) sns.barplot(data=df_metrics, x="donor", y="energy_distance", edgecolor="black", ax=axes[0]) sns.barplot(data=df_metrics, x="donor", y="r_squared", edgecolor="black", ax=axes[1]) plt.tight_layout() plt.show() ``` -------------------------------- ### Compute Molecular Fingerprints from SMILES in Python Source: https://context7.com/theislab/cellflow/llms.txt Generates Morgan fingerprints for chemical compounds from their SMILES representations. This function first annotates compounds with SMILES strings (e.g., from PubChem) and then computes the fingerprints, which can be used as feature representations for drugs or other molecules. ```python import cellflow.preprocessing as cfpp # First annotate compounds with SMILES from PubChem cfpp.annotate_compounds( adata, compound_keys=["drug_1", "drug_2"], control_category="control", query_id_type="name", # or "cid" for PubChem IDs ) # Sets: adata.obs["drug_1_smiles"], adata.obs["drug_1_pubchem_name"], etc. # Compute Morgan fingerprints cfpp.get_molecular_fingerprints( adata, compound_keys=["drug_1", "drug_2"], smiles_keys=None, # Auto-detects "{compound_key}_smiles" control_value="control", uns_key_added="fingerprints", radius=4, n_bits=1024, ) # Sets: adata.uns["fingerprints"] = {"DrugA": fingerprint_array, ...} ``` -------------------------------- ### Prepare CellFlow Model Architecture (Python) Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/300_ineuron_tutorial.ipynb Prepares the CellFlow model architecture by setting various hyperparameters including condition embedding dimensions, time encoder dimensions, hidden layer dimensions, pooling type, and the matching function. This function configures the network structure for processing cell data. ```python cf.prepare_model( condition_embedding_dim=128, time_encoder_dims=[1024] * 5, time_encoder_dropout=0.1, hidden_dims=[2048] * 2 + [128], hidden_dropout=0.2, decoder_dims=[512] * 2, decoder_dropout=0.1, pooling="mean", layers_before_pool=layers_before_pool, layers_after_pool=layers_after_pool, cond_output_dropout=0.3, flow={"constant_noise": 0.0}, match_fn=match_fn ) ``` -------------------------------- ### CellFlow Model Preparation API Source: https://github.com/theislab/cellflow/blob/main/docs/notebooks/500_combosciplex.ipynb This section details the parameters for preparing the CellFlow model architecture using the `prepare_model` method. ```APIDOC ## POST /api/cellflow/prepare_model ### Description Prepares the CellFlow model architecture by setting various configuration parameters. ### Method POST ### Endpoint /api/cellflow/prepare_model ### Parameters #### Request Body - **condition_mode** (string) - Optional - Mode for learning condition embeddings ('deterministic' recommended). - **regularization** (float) - Optional - Regularization value for the latent space (0.0 recommended). - **pooling** (string) - Optional - Method for aggregating condition combinations ('attention_token' or 'mean'). - **layers_before_pool** (dict) - Optional - Configuration for embedding ESM2 embeddings and time. - **layers_after_pool** (dict) - Optional - Configuration for layers after pooling. - **condition_embedding_dim** (int) - Optional - Dimension of the condition encoder's latent space. - **cond_output_dropout** (float) - Optional - Dropout rate applied to the condition embedding. - **hidden_dims** (list) - Optional - Dimensions of hidden layers. - **conditioning** (string) - Optional - Conditioning method ('concatenation'). - **decoder_dims** (list) - Optional - Dimensions of decoder layers. - **probability_path** (dict) - Optional - Configuration for the probability path, e.g., `{"constant_noise": 1.5}`. - **match_fn** (callable) - Optional - Function to sample pairs between control and perturbed cells. - **linear_projection_before_concatenation** (bool) - Optional - Whether to apply linear projection before concatenation. ### Request Example ```json { "condition_mode": "deterministic", "regularization": 0.0, "pooling": "attention_token", "layers_before_pool": { "drug_perturbation": {"layer_type": "mlp", "dims": [256, 256], "dropout_rate": 0.0} }, "layers_after_pool": { "layer_type": "mlp", "dims": [256, 256], "dropout_rate": 0.0 }, "condition_embedding_dim": 64, "cond_output_dropout": 0.9, "hidden_dims": [2048, 2048, 2048], "conditioning": "concatenation", "decoder_dims": [4096, 4096, 4096], "probability_path": {"constant_noise": 1.5}, "match_fn": "functools.partial(match_linear, epsilon=1.0, tau_a=1.0, tau_b=1.0)", "linear_projection_before_concatenation": true } ``` ### Response #### Success Response (200) - **status** (string) - Indicates successful preparation. #### Response Example ```json { "status": "Model prepared successfully" } ``` ```