### Install DiffMS package

Source: https://github.com/coleygroup/diffms/blob/master/README.md

Installs the DiffMS package in editable mode using pip. This command should be run after setting up the environment and dependencies.

```pip
pip install -e .
```

--------------------------------

### Download and Process FP2MOL Dataset

Source: https://github.com/coleygroup/diffms/blob/master/README.md

Executes a series of bash scripts to download and preprocess the FP2MOL dataset. These scripts are sequential and require unzip to be installed.

```bash
bash data_processing/00_download_fp2mol_data.sh
bash data_processing/01_download_canopus_data.sh
bash data_processing/02_download_msg_data.sh
bash data_processing/03_preprocess_fp2mol.sh
```

--------------------------------

### Install PyTorch with CUDA support

Source: https://github.com/coleygroup/diffms/blob/master/README.md

Installs a specific version of PyTorch (2.3.1) with CUDA 11.8 support using pip. This is essential for GPU acceleration in the DiffMS project.

```pip
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cu118
```

--------------------------------

### Create Mamba Environment for DiffMS

Source: https://github.com/coleygroup/diffms/blob/master/README.md

Creates and activates a Mamba environment named 'diffms' with RDKit and Python 3.9 installed. Mamba offers a faster alternative to Conda for environment creation.

```bash
mamba create -y -n diffms rdkit=2024.09.4 python=3.9
mamba activate diffms
```

--------------------------------

### Create Conda Environment for DiffMS

Source: https://github.com/coleygroup/diffms/blob/master/README.md

Creates and activates a Conda environment named 'diffms' with RDKit and Python 3.9 installed. This is a prerequisite for running the DiffMS code.

```bash
conda create -y -c conda-forge -n diffms rdkit=2024.09.4 python=3.9
conda activate diffms
```

--------------------------------

### Hydra Configuration for Model Training (YAML)

Source: https://context7.com/coleygroup/diffms/llms.txt

This snippet shows Hydra configuration files in YAML format for setting up model training parameters. It includes default settings for general experiment parameters, model architecture, and training specifics.

```yaml
# configs/config.yaml
defaults:
  - _self_
  - general: general_default
  - model: model_default
  - train: train_default
  - dataset: canopus  # or 'msg' or 'fp2mol'

# configs/general/general_default.yaml
name: 'experiment_name'
wandb_name: 'diffms'
gpus: 4
val_samples_to_generate: 10
test_samples_to_generate: 100
load_weights: null  # Path to pretrained checkpoint
encoder_finetune_strategy: null  # freeze, ft-transformer, freeze-transformer
decoder_finetune_strategy: null  # freeze, ft-input, freeze-input, ft-output

# configs/model/model_default.yaml
model: 'graph_tf_v2'  # graph_tf, graph_tf_v2, graph_tf_v3, graph_tf_v4
diffusion_steps: 1000
diffusion_noise_schedule: 'cosine'
transition: 'marginal'  # uniform or marginal
n_layers: 8
hidden_dims: 256
hidden_mlp_dims: 1024

# configs/train/train_default.yaml
lr: 0.0001
weight_decay: 0.0001
scheduler: 'one_cycle'  # const or one_cycle
n_epochs: 300
clip_grad: 10.0
ema_decay: 0.999
save_model: true

```

--------------------------------

### Pretrain Fingerprint-Molecule Model

Source: https://github.com/coleygroup/diffms/blob/master/README.md

Runs the main script for fingerprint-molecule pretraining. Requires setting the dataset in config.yaml to 'fp2mol'.

```python
python src/fp2mol_main.py
```

--------------------------------

### Finetune Spectra-Molecule Generation Model

Source: https://github.com/coleygroup/diffms/blob/master/README.md

Runs the main script for finetuning the end-to-end spectra-molecule generation model. Requires setting the dataset in config.yaml to 'msg' or 'canopus'.

```python
python src/spec2mol_main.py
```

--------------------------------

### FP2Mol Pretraining: Train Molecular Graph Decoder with Morgan Fingerprints

Source: https://context7.com/coleygroup/diffms/llms.txt

This Python script demonstrates the pretraining phase of DiffMS, focusing on generating molecules from Morgan fingerprints. It uses Hydra for configuration management and PyTorch Lightning for training. The process involves loading a dataset, initializing the FP2MolDenoisingDiffusion model, and training it using a PyTorch Lightning Trainer.

```python
import hydra
from omegaconf import DictConfig
from pytorch_lightning import Trainer
from src.diffusion_model_fp2mol import FP2MolDenoisingDiffusion
from src.datasets import fp2mol_dataset

@hydra.main(version_base='1.3', config_path='configs', config_name='config')
def main(cfg: DictConfig):
    # Load dataset with Morgan fingerprints
    datamodule = fp2mol_dataset.FP2MolDataModule(cfg)
    dataset_infos = fp2mol_dataset.FP2Mol_infos(datamodule, cfg)

    # Initialize diffusion model
    model = FP2MolDenoisingDiffusion(
        cfg=cfg,
        dataset_infos=dataset_infos,
        train_metrics=train_metrics,
        visualization_tools=visualization_tools,
        extra_features=extra_features,
        domain_features=domain_features
    )

    # Train model
    trainer = Trainer(
        gradient_clip_val=cfg.train.clip_grad,
        strategy="ddp",
        accelerator='gpu',
        devices=cfg.general.gpus,
        max_epochs=cfg.train.n_epochs
    )
    trainer.fit(model, datamodule=datamodule)

if __name__ == '__main__':
    main()

```

--------------------------------

### Prepare Spectra-Molecule Dataset with Splits (Python)

Source: https://context7.com/coleygroup/diffms/llms.txt

Loads and prepares paired mass spectra and molecular structure data using PyTorch Geometric. It defines custom splitters and featurizers for spectra and molecules, including Morgan fingerprints. The function returns a dictionary containing DataLoader objects for training, validation, and testing sets.

```python
from src.mist.data import datasets, splitter, featurizers
from torch_geometric.loader import DataLoader

def prepare_spec2mol_dataset(cfg):
    """
    Load spectra-molecule dataset with train/val/test splits.

    Args:
        cfg: Configuration dict with dataset parameters
            - datadir: Directory with spectrum files
            - split_file: Path to predefined splits
            - morgan_nbits: Fingerprint size (e.g., 4096)

    Returns:
        Dictionary with train, val, test DataLoaders
    """
    # Define data splitter
    data_splitter = splitter.PresetSpectraSplitter(
        split_file=cfg.dataset.split_file
    )

    # Define featurizer for spectra and molecules
    paired_featurizer = featurizers.PairedFeaturizer(
        spec_featurizer=featurizers.PeakFormula(**cfg.dataset),
        mol_featurizer=featurizers.FingerprintFeaturizer(
            fp_names=['morgan4096'],
            **cfg.dataset
        ),
        graph_featurizer=featurizers.GraphFeaturizer(**cfg.dataset)
    )

    # Load spectra-molecule pairs
    spectra_mol_pairs = datasets.get_paired_spectra(**cfg.dataset)
    spectra_mol_pairs = list(zip(*spectra_mol_pairs))

    # Split dataset
    split_name, (train, val, test) = data_splitter.get_splits(spectra_mol_pairs)

    # Create datasets
    ms_datasets = {
        'train': datasets.SpectraMolDataset(
            spectra_mol_list=train,
            featurizer=paired_featurizer,
            **cfg.dataset
        ),
        'val': datasets.SpectraMolDataset(
            spectra_mol_list=val,
            featurizer=paired_featurizer,
            **cfg.dataset
        ),
        'test': datasets.SpectraMolDataset(
            spectra_mol_list=test,
            featurizer=paired_featurizer,
            **cfg.dataset
        )
    }

    # Create dataloaders
    dataloaders = {
        'train': DataLoader(ms_datasets['train'], batch_size=32, shuffle=True),
        'val': DataLoader(ms_datasets['val'], batch_size=32, shuffle=False),
        'test': DataLoader(ms_datasets['test'], batch_size=32, shuffle=False)
    }

    return dataloaders
```

--------------------------------

### Spec2Mol Fine-tuning: Generate Molecules from Mass Spectra

Source: https://context7.com/coleygroup/diffms/llms.txt

This Python script details the fine-tuning stage of DiffMS for spectrum-to-molecule generation. It utilizes Hydra for configuration and PyTorch Lightning for training, allowing for flexible encoder/decoder freezing strategies. The script loads a spectra-molecule dataset, initializes the Spec2MolDenoisingDiffusion model, optionally freezes parts of the encoder or decoder, loads pretrained weights, and then fine-tunes the model.

```python
import hydra
from omegaconf import DictConfig
from pytorch_lightning import Trainer
from src.diffusion_model_spec2mol import Spec2MolDenoisingDiffusion
from src.datasets import spec2mol_dataset
import torch

@hydra.main(version_base='1.3', config_path='configs', config_name='config')
def main(cfg: DictConfig):
    # Load spectra-molecule dataset
    datamodule = spec2mol_dataset.Spec2MolDataModule(cfg)
    dataset_infos = spec2mol_dataset.Spec2MolDatasetInfos(datamodule, cfg)

    # Initialize model with pretrained weights
    model = Spec2MolDenoisingDiffusion(cfg=cfg, dataset_infos=dataset_infos, **model_kwargs)

    # Apply fine-tuning strategies (freeze, ft-transformer, etc.)
    if cfg.general.encoder_finetune_strategy == 'freeze':
        for param in model.encoder.parameters():
            param.requires_grad = False

    if cfg.general.decoder_finetune_strategy == 'freeze-transformer':
        for param in model.decoder.tf_layers.parameters():
            param.requires_grad = False

    # Load pretrained weights
    if cfg.general.load_weights is not None:
        checkpoint = torch.load(cfg.general.load_weights, map_location='cpu')
        model.load_state_dict(checkpoint['state_dict'], strict=False)

    # Fine-tune
    trainer = Trainer(
        strategy="ddp_find_unused_parameters_true",
        accelerator='gpu',
        devices=cfg.general.gpus,
        max_epochs=cfg.train.n_epochs
    )
    trainer.fit(model, datamodule=datamodule)

if __name__ == '__main__':
    main()

```

--------------------------------

### Molecular Sampling: Generate Molecules from Diffusion Model

Source: https://context7.com/coleygroup/diffms/llms.txt

This snippet imports necessary PyTorch and RDKit modules for molecular sampling using the reverse diffusion process. It includes utilities for creating batches of graph data and applying functional transformations, essential for generating molecule candidates from a trained diffusion model.

```python
import torch
from torch_geometric.data import Batch
from rdkit import Chem
import torch.nn.functional as F
from src import utils
from src.diffusion import diffusion_utils

```

--------------------------------

### Sample Molecules from Batch using PyTorch

Source: https://context7.com/coleygroup/diffms/llms.txt

Generates multiple molecule candidates for each input in a batch using a trained diffusion model. It involves converting data to a dense representation, sampling from a noise distribution, reversing the diffusion process, and converting graph structures to RDKit molecule objects.

```python
import torch
from rdkit import Chem
from typing import List

# Assuming Batch, utils, and diffusion_utils are defined elsewhere
# class Batch:
#     def __init__(self, x, edge_index, edge_attr, batch, y):
#         self.x = x
#         self.edge_index = edge_index
#         self.edge_attr = edge_attr
#         self.batch = batch
#         self.y = y

# class DiffusionModel:
#     def __init__(self, limit_dist, T, device, visualization_tools):
#         self.limit_dist = limit_dist
#         self.T = T
#         self.device = device
#         self.visualization_tools = visualization_tools
#
#     def sample_p_zs_given_zt(self, s_norm, t_norm, X, E, y, node_mask):
#         # Placeholder for actual model sampling logic
#         # This should return a structure with X, E, y attributes
#         class SampledData:
#             def __init__(self, X, E, y):
#                 self.X = X
#                 self.E = E
#                 self.y = y
#
#             def mask(self, node_mask, collapse=False):
#                 # Placeholder for masking logic
#                 return self
#
#         return SampledData(X, E, y), None # Return dummy sampled data and None for other return values

# class DiffusionUtils:
#     @staticmethod
#     def sample_discrete_feature_noise(limit_dist, node_mask):
#         # Placeholder for noise sampling logic
#         # This should return a structure with E attribute
#         class NoiseData:
#             def __init__(self, E):
#                 self.E = E
#         return NoiseData(torch.rand_like(node_mask.float())) # Dummy noise

# class Utils:
#     @staticmethod
#     def to_dense(x, edge_index, edge_attr, batch):
#         # Placeholder for dense conversion logic
#         # Returns dense_data (with X, E attributes) and node_mask
#         class DenseData:
#             def __init__(self, X, E):
#                 self.X = X
#                 self.E = E
#         return DenseData(torch.rand(len(batch), 5, 10), torch.rand(len(batch), 5, 5)), torch.ones(len(batch), 5).bool() # Dummy dense data and mask

# @torch.no_grad()
def sample_batch(model, batch: Batch, num_samples: int = 10) -> list[list[Chem.Mol]]:
    """
    Generate multiple molecule candidates for each input in the batch.

    Args:
        model: Trained diffusion model
        batch: Input batch with fingerprints or spectra
        num_samples: Number of molecules to generate per input

    Returns:
        List of molecule lists, one per input
    """
    # Convert to dense representation
    dense_data, node_mask = Utils.to_dense(
        batch.x, batch.edge_index, batch.edge_attr, batch.batch
    )

    predicted_mols = [list() for _ in range(len(batch))]

    for _ in range(num_samples):
        # Sample from noise distribution
        z_T = DiffusionUtils.sample_discrete_feature_noise(
            limit_dist=model.limit_dist,
            node_mask=node_mask
        )

        X, E, y = dense_data.X, z_T.E, batch.y

        # Reverse diffusion process
        for s_int in reversed(range(0, model.T)):
            s_array = s_int * torch.ones((len(batch), 1), device=model.device)
            t_array = s_array + 1
            s_norm = s_array / model.T
            t_norm = t_array / model.T

            # Sample z_s given z_t
            sampled_s, _ = model.sample_p_zs_given_zt(
                s_norm, t_norm, X, E, y, node_mask
            )
            X, E, y = sampled_s.X, sampled_s.E, batch.y

        # Convert graphs to molecules
        sampled_s = sampled_s.mask(node_mask, collapse=True)
        for idx, (nodes, adj_mat) in enumerate(zip(sampled_s.X, sampled_s.E)):
            mol = model.visualization_tools.mol_from_graphs(nodes, adj_mat)
            predicted_mols[idx].append(mol)

    return predicted_mols

```

--------------------------------

### Process Molecule to Graph with Morgan Fingerprint (Python)

Source: https://context7.com/coleygroup/diffms/llms.txt

Converts an InChI string into a PyTorch Geometric Data object representing a molecular graph, including atom and bond features, and computes a Morgan fingerprint. It utilizes RDKit for molecule parsing and PyTorch for tensor operations. The function handles potential parsing errors by returning None for invalid InChI strings.

```python
import torch
import torch.nn.functional as F
from rdkit import Chem
from rdkit.Chem.AllChem import GetMorganFingerprintAsBitVect
import numpy as np
from torch_geometric.data import Data

def process_molecule_to_graph(inchi, morgan_r=2, morgan_nbits=2048):
    """
    Convert InChI string to graph representation with fingerprint.

    Args:
        inchi: InChI string representation of molecule
        morgan_r: Morgan fingerprint radius
        morgan_nbits: Number of bits in fingerprint

    Returns:
        PyTorch Geometric Data object with graph and fingerprint
    """
    # Define atom and bond types
    atom_types = {'C': 0, 'O': 1, 'P': 2, 'N': 3, 'S': 4, 'Cl': 5, 'F': 6, 'H': 7}
    bond_types = {
        Chem.rdchem.BondType.SINGLE: 0,
        Chem.rdchem.BondType.DOUBLE: 1,
        Chem.rdchem.BondType.TRIPLE: 2,
        Chem.rdchem.BondType.AROMATIC: 3
    }

    # Parse molecule
    mol = Chem.MolFromInchi(inchi)
    if mol is None:
        return None

    # Remove stereochemistry
    smi = Chem.MolToSmiles(mol, isomericSmiles=False)
    mol = Chem.MolFromSmiles(smi)

    N = mol.GetNumAtoms()

    # Extract atom features
    type_idx = []
    for atom in mol.GetAtoms():
        symbol = atom.GetSymbol()
        type_idx.append(atom_types[symbol])

    x = F.one_hot(torch.tensor(type_idx), num_classes=len(atom_types)).float()

    # Extract bond features
    row, col, edge_type = [], [], []
    for bond in mol.GetBonds():
        start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        row += [start, end]
        col += [end, start]
        bond_idx = bond_types[bond.GetBondType()] + 1  # 0 reserved for no bond
        edge_type += 2 * [bond_idx]

    edge_index = torch.tensor([row, col], dtype=torch.long)
    edge_attr = F.one_hot(
        torch.tensor(edge_type),
        num_classes=len(bond_types) + 1
    ).float()

    # Compute Morgan fingerprint
    fp = GetMorganFingerprintAsBitVect(mol, morgan_r, nBits=morgan_nbits)
    y = torch.tensor(np.asarray(fp, dtype=np.int8)).unsqueeze(0)

    return Data(x=x, edge_index=edge_index, edge_attr=edge_attr, y=y)
```

--------------------------------

### Prepare and Compute Molecule Metrics (Python)

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb

This code prepares true and predicted InChI strings from molecule objects, computes metrics using parallel processing, aggregates results, and saves them to a CSV file. It handles sorting predictions by frequency and applies different metric aggregation logic based on the 'doFull' parameter. Dependencies include RDKit (Chem), collections, tqdm, joblib, pandas, and pulp.

```python
true_inchi = []
pred_inchi = []
for i in range(len(true)):
    local_pred_inchi = []
    for j in range(len(pred[i])):
        if is_valid(pred[i][j]):
            local_pred_inchi.append(Chem.MolToInchi(pred[i][j]))

    # sort local_pred_inchi by frequency
    inchi_counts = Counter(local_pred_inchi)
    local_pred_inchi = [item for item, count in inchi_counts.most_common()]

    if not doFull:
        local_pred_inchi = local_pred_inchi[:11]

    pred_inchi.append(local_pred_inchi)
    true_inchi.append(Chem.MolToInchi(true[i]))

solver = pulp.listSolvers(onlyAvailable=True)[0]

with tqdm_joblib(tqdm(total=len(true_inchi))) as progress_bar:
    results = Parallel(n_jobs=-1)(
        delayed(compute_metrics_for_one)(
            true_inchi[i],
            pred_inchi[i],
            solver,
            doMCES=doMCES,
            doFull=doFull
        )
        for i in range(len(true_inchi))
    )

# aggregate results
final_metrics = defaultdict(float)
for r in results:
    for key, val in r.items():
        final_metrics[key] += val

if doFull:
    for k in range(1, 101):
        final_metrics[f'acc@{k}'] /= len(true_inchi)
        final_metrics[f'mces@{k}'] /= len(true_inchi)
        final_metrics[f'tanimoto@{k}'] /= len(true_inchi)
        final_metrics[f'cosine@{k}'] /= len(true_inchi)
        final_metrics[f'close_match@{k}'] /= len(true_inchi)
        final_metrics[f'meaningful_match@{k}'] /= len(true_inchi)
else:
    for k in range(1, 11):
        final_metrics[f'acc@{k}'] /= len(true_inchi)
        final_metrics[f'mces@{k}'] /= len(true_inchi)
        final_metrics[f'tanimoto@{k}'] /= len(true_inchi)
        final_metrics[f'cosine@{k}'] /= len(true_inchi)
        final_metrics[f'close_match@{k}'] /= len(true_inchi)
        final_metrics[f'meaningful_match@{k}'] /= len(true_inchi)

df = pd.DataFrame(final_metrics, index=[0])
df.to_csv(csv_path, index=False)
```

--------------------------------

### Load Model Predictions from Pickle Files (Python)

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb

This code snippet demonstrates how to load model predictions and ground truth data from multiple pickle files. It iterates through a range of indices and checks for the existence of prediction files, extending lists with loaded data. This is typically used for loading results from previous training or evaluation steps.

```python
# example code of loading model predictions as saved in the diffusion_model_spec2mol.py test step
# paths/loading will be different

canopus_true = []
canopus_pred = []
for idx in range(1, 5):
    i = idx-1
    while os.path.exists(f"../final_results/canopus/spec2mol-canopus-eval-{idx}_resume_pred_{i}.pkl"):
        with open(f"../final_results/canopus/spec2mol-canopus-eval-{idx}_resume_true_{i}.pkl", 'rb') as f:
            canopus_true.extend(pickle.load(f))
        with open(f"../final_results/canopus/spec2mol-canopus-eval-{idx}_resume_pred_{i}.pkl", 'rb') as f:
            canopus_pred.extend(pickle.load(f))
        i += 4
```

--------------------------------

### Build Molecule from Graph Representation using RDKit

Source: https://context7.com/coleygroup/diffms/llms.txt

Constructs an RDKit molecule object from discrete atom and edge type tensors. It maps atom indices to symbols and bond type indices to RDKit bond types. The function handles potential errors during sanitization, returning None if molecule construction fails.

```python
import torch
from rdkit import Chem
from rdkit.Chem import AllChem
from typing import List, Optional

def build_molecule_with_partial_charges(atom_types: torch.Tensor, edge_types: torch.Tensor, atom_decoder: List[str]) -> Optional[Chem.Mol]:
    """
    Construct a molecule from discrete graph representation.

    Args:
        atom_types: Tensor of shape (n_atoms,) with atom type indices
        edge_types: Tensor of shape (n_atoms, n_atoms) with bond type indices
        atom_decoder: List mapping indices to atom symbols

    Returns:
        RDKit Mol object or None if construction fails
    """
    bond_dict = [
        None,
        Chem.rdchem.BondType.SINGLE,
        Chem.rdchem.BondType.DOUBLE,
        Chem.rdchem.BondType.TRIPLE,
        Chem.rdchem.BondType.AROMATIC
    ]

    mol = Chem.RWMol()

    # Add atoms
    for atom_idx in atom_types:
        atom = Chem.Atom(atom_decoder[atom_idx.item()])
        mol.AddAtom(atom)

    # Add bonds
    edge_types = torch.triu(edge_types, diagonal=1)
    for i in range(len(atom_types)):
        for j in range(i + 1, len(atom_types)):
            bond_type_idx = edge_types[i, j].item()
            if bond_type_idx > 0:
                try:
                    mol.AddBond(i, j, bond_dict[bond_type_idx])
                except Exception as e:
                    print(f"Error adding bond between {i} and {j} with type {bond_type_idx}: {e}")
                    continue

    try:
        mol = mol.GetMol()
        Chem.SanitizeMol(mol)
        return mol
    except Exception as e:
        print(f"Error sanitizing molecule: {e}")
        return None

```

--------------------------------

### Compute Molecular Accuracy and Validity Metrics (Python)

Source: https://context7.com/coleygroup/diffms/llms.txt

This snippet provides Python functions to calculate molecular accuracy (top-k exact match and Tanimoto similarity) and validity rate for generated molecules using RDKit. It requires RDKit and optionally torch for potential tensor operations.

```python
import torch
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from myopic_mces import get_mces

def compute_molecular_accuracy(predicted_mols, true_mol, k=10):
    """
    Compute top-k exact match accuracy and Tanimoto similarity.

    Args:
        predicted_mols: List of k generated RDKit Mol objects
        true_mol: Ground truth RDKit Mol object
        k: Number of top predictions to consider

    Returns:
        Dictionary with accuracy and similarity metrics
    """
    # Compute reference fingerprint
    true_fp = AllChem.GetMorganFingerprintAsBitVect(true_mol, 2, nBits=2048)
    true_inchi = Chem.MolToInchi(true_mol)

    # Check exact matches
    exact_match = False
    similarities = []

    for pred_mol in predicted_mols[:k]:
        if pred_mol is None:
            similarities.append(0.0)
            continue

        # Check InChI match
        pred_inchi = Chem.MolToInchi(pred_mol)
        if pred_inchi == true_inchi:
            exact_match = True

        # Compute Tanimoto similarity
        pred_fp = AllChem.GetMorganFingerprintAsBitVect(pred_mol, 2, nBits=2048)
        similarity = DataStructs.TanimotoSimilarity(true_fp, pred_fp)
        similarities.append(similarity)

    metrics = {
        f'top_{k}_accuracy': 1.0 if exact_match else 0.0,
        f'max_tanimoto_{k}': max(similarities) if similarities else 0.0,
        f'mean_tanimoto_{k}': sum(similarities) / len(similarities) if similarities else 0.0
    }

    return metrics


def compute_validity(generated_mols):
    """
    Compute validity rate of generated molecules.

    Args:
        generated_mols: List of RDKit Mol objects (may contain None)

    Returns:
        Validity rate as float between 0 and 1
    """
    valid_count = 0

    for mol in generated_mols:
        if mol is None:
            continue

        try:
            # Check if molecule can be sanitized
            Chem.SanitizeMol(mol)
            # Check if single connected component
            mol_frags = Chem.rdmolops.GetMolFrags(mol, asMols=True, sanitizeFrags=True)
            if len(mol_frags) == 1:
                valid_count += 1
        except:
            continue

    return valid_count / len(generated_mols) if generated_mols else 0.0

```

--------------------------------

### Combine, Deduplicate, and Shuffle INChIs - Python

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

This snippet combines INChI lists from various sources (HMDB, DSS, COCONUT, MOSES), removes duplicate entries using a set, and then shuffles the unique list randomly. This prepares the data for splitting into training and validation sets.

```python
combined_inchis = hmdb_inchis + dss_inchis + coconut_inchis + moses_inchis
combined_inchis = list(set(combined_inchis))
random.shuffle(combined_inchis)
```

--------------------------------

### Save Training and Validation INChIs to CSV - Python

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

This snippet creates pandas DataFrames from the processed training and validation INChI lists. It then saves these DataFrames to CSV files named 'combined_train.csv' and 'combined_val.csv' in the '../data/fp2mol/combined/preprocessed/' directory, without including the DataFrame index.

```python
combined_train_df = pd.DataFrame(combined_train_inchis, columns=["inchi"])
combined_train_df.to_csv("../data/fp2mol/combined/preprocessed/combined_train.csv", index=False)

combined_val_df = pd.DataFrame(combined_val_inchis, columns=["inchi"])
combined_val_df.to_csv("../data/fp2mol/combined/preprocessed/combined_val.csv", index=False)
```

--------------------------------

### Convert COCONUT SMILES to InChI and Split Dataset (Python)

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

Processes the COCONUT dataset by reading a CSV file, converting SMILES to InChI identifiers, filtering molecules, and removing stereochemistry. The data is then split into training and validation sets and saved as CSV files. Requires RDKit, Pandas, and tqdm.

```python
coconut_df = pd.read_csv('../data/fp2mol/raw/coconut_complete-10-2024.csv')

coconut_set_raw = set(coconut_df["canonical_smiles"])

coconut_set = set()
for smi in tqdm(coconut_set_raw, desc='Cleaning COCONUT structures', leave=False):
    try:
        mol = Chem.MolFromSmiles(smi)
        smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information
        mol = Chem.MolFromSmiles(smi)
        if filter_with_atom_types(mol):
            coconut_set.add(Chem.MolToInchi(mol))
    except:
        pass

coconut_inchis = list(coconut_set)
random.shuffle(coconut_inchis)

coconut_train_inchis = coconut_inchis[:int(0.95 * len(coconut_inchis))]
coconut_val_inchis = coconut_inchis[int(0.95 * len(coconut_inchis)):]

coconut_train_inchis = [inchi for inchi in coconut_train_inchis if inchi not in excluded_inchis]

coconut_train_df = pd.DataFrame(coconut_train_inchis, columns=["inchi"])
coconut_train_df.to_csv("../data/fp2mol/coconut/preprocessed/coconut_train.csv", index=False)

coconut_val_df = pd.DataFrame(coconut_val_inchis, columns=["inchi"])
coconut_val_df.to_csv("../data/fp2mol/coconut/preprocessed/coconut_val.csv", index=False)
```

--------------------------------

### Process MSG Dataset: SMILES to InChI Conversion and Filtering

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

This script processes the MSG dataset, similar to the Canopus script. It loads split and label data, converts SMILES to InChI, applies filtering to the training set, and saves the resulting InChI lists for train, test, and validation sets. It also updates an `excluded_inchis` set. Dependencies include pandas and RDKit.

```python
msg_split = pd.read_csv('../data/msg/split.tsv', sep='\t')

msg_labels = pd.read_csv('../data/msg/labels.tsv', sep='\t')
msg_labels["name"] = msg_labels["spec"]
msg_labels = msg_labels[["name", "smiles"]].reset_index(drop=True)

msg_labels = msg_labels.merge(msg_split, on="name")

msg_train_inchis = []
msg_test_inchis = []
msg_val_inchis = []

for i in tqdm(range(len(msg_labels)), desc="Converting SMILES to InChI"):
    
    mol = Chem.MolFromSmiles(msg_labels.loc[i, "smiles"])
    smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information
    mol = Chem.MolFromSmiles(smi)
    inchi = Chem.MolToInchi(mol)

    if msg_labels.loc[i, "split"] == "train":
        if filter(mol):
            msg_train_inchis.append(inchi)
    elif msg_labels.loc[i, "split"] == "test":
        msg_test_inchis.append(inchi)
    elif msg_labels.loc[i, "split"] == "val":
        msg_val_inchis.append(inchi)

msg_train_df = pd.DataFrame(set(msg_train_inchis), columns=["inchi"])
msg_train_df.to_csv("../data/fp2mol/msg/preprocessed/msg_train.csv", index=False)

msg_test_df = pd.DataFrame(msg_test_inchis, columns=["inchi"])
msg_test_df.to_csv("../data/fp2mol/msg/preprocessed/msg_test.csv", index=False)

msg_val_df = pd.DataFrame(msg_val_inchis, columns=["inchi"])
msg_val_df.to_csv("../data/fp2mol/msg/preprocessed/msg_val.csv", index=False)

excluded_inchis.update(msg_test_inchis + msg_val_inchis)
```

--------------------------------

### Process Canopus Dataset: SMILES to InChI Conversion and Filtering

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

This script processes the Canopus dataset by reading split and label information, converting SMILES to InChI, and filtering training data using the `filter` function. It then saves the processed InChI lists for training, testing, and validation sets into CSV files. Dependencies include pandas and RDKit.

```python
canopus_split = pd.read_csv('../data/canopus/splits/canopus_hplus_100_0.tsv', sep='\t')

canopus_labels = pd.read_csv('../data/canopus/labels.tsv', sep='\t')
canopus_labels["name"] = canopus_labels["spec"]
canopus_labels = canopus_labels[["name", "smiles"]].reset_index(drop=True)

canopus_labels = canopus_labels.merge(canopus_split, on="name")

canopus_train_inchis = []
canopus_test_inchis = []
canopus_val_inchis = []

for i in tqdm(range(len(canopus_labels)), desc="Converting SMILES to InChI"):
    
    mol = Chem.MolFromSmiles(canopus_labels.loc[i, "smiles"])
    smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information
    mol = Chem.MolFromSmiles(smi)
    inchi = Chem.MolToInchi(mol)

    if canopus_labels.loc[i, "split"] == "train":
        if filter(mol):
            canopus_train_inchis.append(inchi)
    elif canopus_labels.loc[i, "split"] == "test":
        canopus_test_inchis.append(inchi)
    elif canopus_labels.loc[i, "split"] == "val":
        canopus_val_inchis.append(inchi)

canopus_train_df = pd.DataFrame(set(canopus_train_inchis), columns=["inchi"])
canopus_train_df.to_csv("../data/fp2mol/canopus/preprocessed/canopus_train.csv", index=False)

canopus_test_df = pd.DataFrame(canopus_test_inchis, columns=["inchi"])
canopus_test_df.to_csv("../data/fp2mol/canopus/preprocessed/canopus_test.csv", index=False)

canopus_val_df = pd.DataFrame(canopus_val_inchis, columns=["inchi"])
canopus_val_df.to_csv("../data/fp2mol/canopus/preprocessed/canopus_val.csv", index=False)

excluded_inchis = set(canopus_test_inchis + canopus_val_inchis)
```

--------------------------------

### Python Imports for Chemical Informatics

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb

Imports necessary libraries for cheminformatics, including RDKit for molecule manipulation, pandas for data handling, and others for optimization and progress tracking. Disables RDKit logging to avoid verbose output.

```python
import pickle
import os
from collections import Counter, defaultdict

import pulp
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import pandas as pd
from myopic_mces import MCES
from joblib import Parallel, delayed
from tqdm import tqdm
from tqdm_joblib import tqdm_joblib


from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
```

--------------------------------

### Split and Filter INChIs into Training/Validation Sets - Python

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

This code splits the shuffled INChI list into training and validation sets, with 95% for training and 5% for validation. It then filters out any INChIs present in the `excluded_inchis` list from the training set. This ensures that specific structures are not included in the training data.

```python
combined_train_inchis = combined_inchis[:int(0.95 * len(combined_inchis))]
combined_val_inchis = combined_inchis[int(0.95 * len(combined_inchis)):]
combined_train_inchis = [inchi for inchi in combined_train_inchis if inchi not in excluded_inchis]
```

--------------------------------

### Compute Metrics using Loaded Predictions (Python)

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb

This is a direct call to a compute_metrics function, likely using the data loaded from pickle files in the preceding step. It calculates metrics for 'canopus' predictions against true values and saves them to 'canopus_metrics.csv', with options to control MCES calculation and full metric computation.

```python
compute_metrics(canopus_true, canopus_pred, "canopus_metrics.csv", doMCES=False, doFull=True)
```

--------------------------------

### Read Molecules from SDF File

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

This function reads molecular structures from an SDF file, extracting SMILES strings. It iterates through lines of the file and identifies SMILES entries. Dependencies include the `tqdm` library for progress bars and RDKit for potential future molecular processing. It takes a file path as input and returns a list of SMILES strings.

```python
import random
from collections import Counter

import pandas as pd
from tqdm import tqdm

from rdkit import Chem
from rdkit import RDLogger
from rdkit.Chem import Descriptors

random.seed(42)

lg = RDLogger.logger()lg.setLevel(RDLogger.CRITICAL)

def read_from_sdf(path):
    res = []
    app = False
    with open(path, 'r') as f:
        for line in tqdm(f.readlines(), desc='Loading SDF structures', leave=False):
            if app:
                res.append(line.strip())
                app = False
            if line.startswith('> <SMILES>'):
                app = True

    return res
```

--------------------------------

### Convert MOSES SMILES to InChI and Split Dataset (Python)

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

Converts SMILES strings from the MOSES dataset to InChI identifiers. The process includes filtering molecules, removing stereochemistry information, and splitting the data into training and validation sets, which are then saved to CSV files. Dependencies: RDKit, Pandas, tqdm.

```python
moses_df = pd.read_csv('../data/fp2mol/raw/moses_complete.csv')

moses_set_raw = set(moses_df["SMILES"])

moses_set = set()
for smi in tqdm(moses_set_raw, desc='Cleaning MOSES structures', leave=False):
    try:
        mol = Chem.MolFromSmiles(smi)
        smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information
        mol = Chem.MolFromSmiles(smi)
        if filter_with_atom_types(mol):
            moses_set.add(Chem.MolToInchi(mol))
    except:
        pass

moses_inchis = list(moses_set)
random.shuffle(moses_inchis)

moses_train_inchis = moses_inchis[:int(0.95 * len(moses_inchis))]
moses_val_inchis = moses_inchis[int(0.95 * len(moses_inchis)):]

moses_train_inchis = [inchi for inchi in moses_train_inchis if inchi not in excluded_inchis]

moses_train_df = pd.DataFrame(moses_train_inchis, columns=["inchi"])
moses_train_df.to_csv("../data/fp2mol/moses/preprocessed/moses_train.csv", index=False)

moses_val_df = pd.DataFrame(moses_val_inchis, columns=["inchi"])
moses_val_df.to_csv("../data/fp2mol/moses/preprocessed/moses_val.csv", index=False)
```

--------------------------------

### Convert DSSTox SMILES to InChI and Split Dataset (Python)

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

Processes the DSSTox dataset by converting SMILES strings to InChI identifiers. It handles multiple Excel files, filters molecules, removes stereochemistry, and splits the data into training and validation sets, saving them to CSV. Requires RDKit, Pandas, and tqdm.

```python
dss_set_raw = set()
for i in tqdm(range(1, 14), desc='Loading DSSTox structures', leave=False):
    df = pd.read_excel(f'../data/fp2mol/raw/DSSToxDump{i}.xlsx')
    dss_set_raw.update(df[df['SMILES'].notnull()]['SMILES'])

dss_set = set()
for smi in tqdm(dss_set_raw, desc='Cleaning DSSTox structures', leave=False):
    try:
        mol = Chem.MolFromSmiles(smi)
        smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information
        mol = Chem.MolFromSmiles(smi)
        if filter_with_atom_types(mol):
            dss_set.add(Chem.MolToInchi(mol))
    except:
        pass

dss_inchis = list(dss_set)
random.shuffle(dss_inchis)

dss_train_inchis = dss_inchis[:int(0.95 * len(dss_inchis))]
dss_val_inchis = dss_inchis[int(0.95 * len(dss_inchis)):]

dss_train_inchis = [inchi for inchi in dss_train_inchis if inchi not in excluded_inchis]

dss_train_df = pd.DataFrame(dss_train_inchis, columns=["inchi"])
dss_train_df.to_csv("../data/fp2mol/dss/preprocessed/dss_train.csv", index=False)

dss_val_df = pd.DataFrame(dss_val_inchis, columns=["inchi"])
dss_val_df.to_csv("../data/fp2mol/dss/preprocessed/dss_val.csv", index=False)
```

--------------------------------

### Python Molecule Canonicalization with Tautomer Handling

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb

Defines functions to canonicalize molecules from InChI strings, handling tautomers using RDKit's MolStandardize module. It supports different RDKit versions for tautomer enumeration. Includes functions to convert molecules to SMILES and check molecule validity.

```python
try:
    from rdkit.Chem.MolStandardize.tautomer import TautomerCanonicalizer, TautomerTransform
    _RD_TAUTOMER_CANONICALIZER = 'v1'
    _TAUTOMER_TRANSFORMS = (
        TautomerTransform('1,3 heteroatom H shift', 
                          '[#7,S,O,Se,Te;!H0]-[#7X2,#6,#15]=[#7,#16,#8,Se,Te]'),
        TautomerTransform('1,3 (thio)keto/enol r', '[O,S,Se,Te;X2!H0]-[C]=[C]')
    )
except ModuleNotFoundError:
    from rdkit.Chem.MolStandardize.rdMolStandardize import TautomerEnumerator  # newer rdkit
    _RD_TAUTOMER_CANONICALIZER = 'v2'

def canonical_mol_from_inchi(inchi):
    """Canonicalize mol after Chem.MolFromInchi
    Note that this function may be 50 times slower than Chem.MolFromInchi"""
    mol = Chem.MolFromInchi(inchi)
    if mol is None:
        return None
    if _RD_TAUTOMER_CANONICALIZER == 'v1':
        _molvs_t = TautomerCanonicalizer(transforms=_TAUTOMER_TRANSFORMS)
        mol = _molvs_t.canonicalize(mol)
    else:
        _te = TautomerEnumerator()
        mol = _te.Canonicalize(mol)
    return mol

def mol2smiles(mol):
    try:
        Chem.SanitizeMol(mol)
    except ValueError:
        return None
    return Chem.MolToSmiles(mol)

def is_valid(mol):
    smiles = mol2smiles(mol)
    if smiles is None:
        return False

    try:
        mol_frags = Chem.rdmolops.GetMolFrags(mol, asMols=True, sanitizeFrags=True)
    except:
        return False
    if len(mol_frags) > 1:
        return False
    
    return True
```

--------------------------------

### Filter Molecules by Properties

Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb

This function filters RDKit molecules based on several criteria: no disconnected structures ('.') in SMILES, molecular weight below 1500, and no formal charge on any atom. It takes an RDKit molecule object as input and returns a boolean indicating if the molecule passes the filters.

```python
def filter(mol):
    try:
        smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information
        mol = Chem.MolFromSmiles(smi)

        if "." in smi:
            return False
        
        if Descriptors.MolWt(mol) >= 1500:
            return False
        
        for atom in mol.GetAtoms():
            if atom.GetFormalCharge() != 0:
                return False
    except:
        return False
    
    return True
```