### Install DiffMS package Source: https://github.com/coleygroup/diffms/blob/master/README.md Installs the DiffMS package in editable mode using pip. This command should be run after setting up the environment and dependencies. ```pip pip install -e . ``` -------------------------------- ### Download and Process FP2MOL Dataset Source: https://github.com/coleygroup/diffms/blob/master/README.md Executes a series of bash scripts to download and preprocess the FP2MOL dataset. These scripts are sequential and require unzip to be installed. ```bash bash data_processing/00_download_fp2mol_data.sh bash data_processing/01_download_canopus_data.sh bash data_processing/02_download_msg_data.sh bash data_processing/03_preprocess_fp2mol.sh ``` -------------------------------- ### Install PyTorch with CUDA support Source: https://github.com/coleygroup/diffms/blob/master/README.md Installs a specific version of PyTorch (2.3.1) with CUDA 11.8 support using pip. This is essential for GPU acceleration in the DiffMS project. ```pip pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cu118 ``` -------------------------------- ### Create Mamba Environment for DiffMS Source: https://github.com/coleygroup/diffms/blob/master/README.md Creates and activates a Mamba environment named 'diffms' with RDKit and Python 3.9 installed. Mamba offers a faster alternative to Conda for environment creation. ```bash mamba create -y -n diffms rdkit=2024.09.4 python=3.9 mamba activate diffms ``` -------------------------------- ### Create Conda Environment for DiffMS Source: https://github.com/coleygroup/diffms/blob/master/README.md Creates and activates a Conda environment named 'diffms' with RDKit and Python 3.9 installed. This is a prerequisite for running the DiffMS code. ```bash conda create -y -c conda-forge -n diffms rdkit=2024.09.4 python=3.9 conda activate diffms ``` -------------------------------- ### Hydra Configuration for Model Training (YAML) Source: https://context7.com/coleygroup/diffms/llms.txt This snippet shows Hydra configuration files in YAML format for setting up model training parameters. It includes default settings for general experiment parameters, model architecture, and training specifics. ```yaml # configs/config.yaml defaults: - _self_ - general: general_default - model: model_default - train: train_default - dataset: canopus # or 'msg' or 'fp2mol' # configs/general/general_default.yaml name: 'experiment_name' wandb_name: 'diffms' gpus: 4 val_samples_to_generate: 10 test_samples_to_generate: 100 load_weights: null # Path to pretrained checkpoint encoder_finetune_strategy: null # freeze, ft-transformer, freeze-transformer decoder_finetune_strategy: null # freeze, ft-input, freeze-input, ft-output # configs/model/model_default.yaml model: 'graph_tf_v2' # graph_tf, graph_tf_v2, graph_tf_v3, graph_tf_v4 diffusion_steps: 1000 diffusion_noise_schedule: 'cosine' transition: 'marginal' # uniform or marginal n_layers: 8 hidden_dims: 256 hidden_mlp_dims: 1024 # configs/train/train_default.yaml lr: 0.0001 weight_decay: 0.0001 scheduler: 'one_cycle' # const or one_cycle n_epochs: 300 clip_grad: 10.0 ema_decay: 0.999 save_model: true ``` -------------------------------- ### Pretrain Fingerprint-Molecule Model Source: https://github.com/coleygroup/diffms/blob/master/README.md Runs the main script for fingerprint-molecule pretraining. Requires setting the dataset in config.yaml to 'fp2mol'. ```python python src/fp2mol_main.py ``` -------------------------------- ### Finetune Spectra-Molecule Generation Model Source: https://github.com/coleygroup/diffms/blob/master/README.md Runs the main script for finetuning the end-to-end spectra-molecule generation model. Requires setting the dataset in config.yaml to 'msg' or 'canopus'. ```python python src/spec2mol_main.py ``` -------------------------------- ### FP2Mol Pretraining: Train Molecular Graph Decoder with Morgan Fingerprints Source: https://context7.com/coleygroup/diffms/llms.txt This Python script demonstrates the pretraining phase of DiffMS, focusing on generating molecules from Morgan fingerprints. It uses Hydra for configuration management and PyTorch Lightning for training. The process involves loading a dataset, initializing the FP2MolDenoisingDiffusion model, and training it using a PyTorch Lightning Trainer. ```python import hydra from omegaconf import DictConfig from pytorch_lightning import Trainer from src.diffusion_model_fp2mol import FP2MolDenoisingDiffusion from src.datasets import fp2mol_dataset @hydra.main(version_base='1.3', config_path='configs', config_name='config') def main(cfg: DictConfig): # Load dataset with Morgan fingerprints datamodule = fp2mol_dataset.FP2MolDataModule(cfg) dataset_infos = fp2mol_dataset.FP2Mol_infos(datamodule, cfg) # Initialize diffusion model model = FP2MolDenoisingDiffusion( cfg=cfg, dataset_infos=dataset_infos, train_metrics=train_metrics, visualization_tools=visualization_tools, extra_features=extra_features, domain_features=domain_features ) # Train model trainer = Trainer( gradient_clip_val=cfg.train.clip_grad, strategy="ddp", accelerator='gpu', devices=cfg.general.gpus, max_epochs=cfg.train.n_epochs ) trainer.fit(model, datamodule=datamodule) if __name__ == '__main__': main() ``` -------------------------------- ### Prepare Spectra-Molecule Dataset with Splits (Python) Source: https://context7.com/coleygroup/diffms/llms.txt Loads and prepares paired mass spectra and molecular structure data using PyTorch Geometric. It defines custom splitters and featurizers for spectra and molecules, including Morgan fingerprints. The function returns a dictionary containing DataLoader objects for training, validation, and testing sets. ```python from src.mist.data import datasets, splitter, featurizers from torch_geometric.loader import DataLoader def prepare_spec2mol_dataset(cfg): """ Load spectra-molecule dataset with train/val/test splits. Args: cfg: Configuration dict with dataset parameters - datadir: Directory with spectrum files - split_file: Path to predefined splits - morgan_nbits: Fingerprint size (e.g., 4096) Returns: Dictionary with train, val, test DataLoaders """ # Define data splitter data_splitter = splitter.PresetSpectraSplitter( split_file=cfg.dataset.split_file ) # Define featurizer for spectra and molecules paired_featurizer = featurizers.PairedFeaturizer( spec_featurizer=featurizers.PeakFormula(**cfg.dataset), mol_featurizer=featurizers.FingerprintFeaturizer( fp_names=['morgan4096'], **cfg.dataset ), graph_featurizer=featurizers.GraphFeaturizer(**cfg.dataset) ) # Load spectra-molecule pairs spectra_mol_pairs = datasets.get_paired_spectra(**cfg.dataset) spectra_mol_pairs = list(zip(*spectra_mol_pairs)) # Split dataset split_name, (train, val, test) = data_splitter.get_splits(spectra_mol_pairs) # Create datasets ms_datasets = { 'train': datasets.SpectraMolDataset( spectra_mol_list=train, featurizer=paired_featurizer, **cfg.dataset ), 'val': datasets.SpectraMolDataset( spectra_mol_list=val, featurizer=paired_featurizer, **cfg.dataset ), 'test': datasets.SpectraMolDataset( spectra_mol_list=test, featurizer=paired_featurizer, **cfg.dataset ) } # Create dataloaders dataloaders = { 'train': DataLoader(ms_datasets['train'], batch_size=32, shuffle=True), 'val': DataLoader(ms_datasets['val'], batch_size=32, shuffle=False), 'test': DataLoader(ms_datasets['test'], batch_size=32, shuffle=False) } return dataloaders ``` -------------------------------- ### Spec2Mol Fine-tuning: Generate Molecules from Mass Spectra Source: https://context7.com/coleygroup/diffms/llms.txt This Python script details the fine-tuning stage of DiffMS for spectrum-to-molecule generation. It utilizes Hydra for configuration and PyTorch Lightning for training, allowing for flexible encoder/decoder freezing strategies. The script loads a spectra-molecule dataset, initializes the Spec2MolDenoisingDiffusion model, optionally freezes parts of the encoder or decoder, loads pretrained weights, and then fine-tunes the model. ```python import hydra from omegaconf import DictConfig from pytorch_lightning import Trainer from src.diffusion_model_spec2mol import Spec2MolDenoisingDiffusion from src.datasets import spec2mol_dataset import torch @hydra.main(version_base='1.3', config_path='configs', config_name='config') def main(cfg: DictConfig): # Load spectra-molecule dataset datamodule = spec2mol_dataset.Spec2MolDataModule(cfg) dataset_infos = spec2mol_dataset.Spec2MolDatasetInfos(datamodule, cfg) # Initialize model with pretrained weights model = Spec2MolDenoisingDiffusion(cfg=cfg, dataset_infos=dataset_infos, **model_kwargs) # Apply fine-tuning strategies (freeze, ft-transformer, etc.) if cfg.general.encoder_finetune_strategy == 'freeze': for param in model.encoder.parameters(): param.requires_grad = False if cfg.general.decoder_finetune_strategy == 'freeze-transformer': for param in model.decoder.tf_layers.parameters(): param.requires_grad = False # Load pretrained weights if cfg.general.load_weights is not None: checkpoint = torch.load(cfg.general.load_weights, map_location='cpu') model.load_state_dict(checkpoint['state_dict'], strict=False) # Fine-tune trainer = Trainer( strategy="ddp_find_unused_parameters_true", accelerator='gpu', devices=cfg.general.gpus, max_epochs=cfg.train.n_epochs ) trainer.fit(model, datamodule=datamodule) if __name__ == '__main__': main() ``` -------------------------------- ### Molecular Sampling: Generate Molecules from Diffusion Model Source: https://context7.com/coleygroup/diffms/llms.txt This snippet imports necessary PyTorch and RDKit modules for molecular sampling using the reverse diffusion process. It includes utilities for creating batches of graph data and applying functional transformations, essential for generating molecule candidates from a trained diffusion model. ```python import torch from torch_geometric.data import Batch from rdkit import Chem import torch.nn.functional as F from src import utils from src.diffusion import diffusion_utils ``` -------------------------------- ### Sample Molecules from Batch using PyTorch Source: https://context7.com/coleygroup/diffms/llms.txt Generates multiple molecule candidates for each input in a batch using a trained diffusion model. It involves converting data to a dense representation, sampling from a noise distribution, reversing the diffusion process, and converting graph structures to RDKit molecule objects. ```python import torch from rdkit import Chem from typing import List # Assuming Batch, utils, and diffusion_utils are defined elsewhere # class Batch: # def __init__(self, x, edge_index, edge_attr, batch, y): # self.x = x # self.edge_index = edge_index # self.edge_attr = edge_attr # self.batch = batch # self.y = y # class DiffusionModel: # def __init__(self, limit_dist, T, device, visualization_tools): # self.limit_dist = limit_dist # self.T = T # self.device = device # self.visualization_tools = visualization_tools # # def sample_p_zs_given_zt(self, s_norm, t_norm, X, E, y, node_mask): # # Placeholder for actual model sampling logic # # This should return a structure with X, E, y attributes # class SampledData: # def __init__(self, X, E, y): # self.X = X # self.E = E # self.y = y # # def mask(self, node_mask, collapse=False): # # Placeholder for masking logic # return self # # return SampledData(X, E, y), None # Return dummy sampled data and None for other return values # class DiffusionUtils: # @staticmethod # def sample_discrete_feature_noise(limit_dist, node_mask): # # Placeholder for noise sampling logic # # This should return a structure with E attribute # class NoiseData: # def __init__(self, E): # self.E = E # return NoiseData(torch.rand_like(node_mask.float())) # Dummy noise # class Utils: # @staticmethod # def to_dense(x, edge_index, edge_attr, batch): # # Placeholder for dense conversion logic # # Returns dense_data (with X, E attributes) and node_mask # class DenseData: # def __init__(self, X, E): # self.X = X # self.E = E # return DenseData(torch.rand(len(batch), 5, 10), torch.rand(len(batch), 5, 5)), torch.ones(len(batch), 5).bool() # Dummy dense data and mask # @torch.no_grad() def sample_batch(model, batch: Batch, num_samples: int = 10) -> list[list[Chem.Mol]]: """ Generate multiple molecule candidates for each input in the batch. Args: model: Trained diffusion model batch: Input batch with fingerprints or spectra num_samples: Number of molecules to generate per input Returns: List of molecule lists, one per input """ # Convert to dense representation dense_data, node_mask = Utils.to_dense( batch.x, batch.edge_index, batch.edge_attr, batch.batch ) predicted_mols = [list() for _ in range(len(batch))] for _ in range(num_samples): # Sample from noise distribution z_T = DiffusionUtils.sample_discrete_feature_noise( limit_dist=model.limit_dist, node_mask=node_mask ) X, E, y = dense_data.X, z_T.E, batch.y # Reverse diffusion process for s_int in reversed(range(0, model.T)): s_array = s_int * torch.ones((len(batch), 1), device=model.device) t_array = s_array + 1 s_norm = s_array / model.T t_norm = t_array / model.T # Sample z_s given z_t sampled_s, _ = model.sample_p_zs_given_zt( s_norm, t_norm, X, E, y, node_mask ) X, E, y = sampled_s.X, sampled_s.E, batch.y # Convert graphs to molecules sampled_s = sampled_s.mask(node_mask, collapse=True) for idx, (nodes, adj_mat) in enumerate(zip(sampled_s.X, sampled_s.E)): mol = model.visualization_tools.mol_from_graphs(nodes, adj_mat) predicted_mols[idx].append(mol) return predicted_mols ``` -------------------------------- ### Process Molecule to Graph with Morgan Fingerprint (Python) Source: https://context7.com/coleygroup/diffms/llms.txt Converts an InChI string into a PyTorch Geometric Data object representing a molecular graph, including atom and bond features, and computes a Morgan fingerprint. It utilizes RDKit for molecule parsing and PyTorch for tensor operations. The function handles potential parsing errors by returning None for invalid InChI strings. ```python import torch import torch.nn.functional as F from rdkit import Chem from rdkit.Chem.AllChem import GetMorganFingerprintAsBitVect import numpy as np from torch_geometric.data import Data def process_molecule_to_graph(inchi, morgan_r=2, morgan_nbits=2048): """ Convert InChI string to graph representation with fingerprint. Args: inchi: InChI string representation of molecule morgan_r: Morgan fingerprint radius morgan_nbits: Number of bits in fingerprint Returns: PyTorch Geometric Data object with graph and fingerprint """ # Define atom and bond types atom_types = {'C': 0, 'O': 1, 'P': 2, 'N': 3, 'S': 4, 'Cl': 5, 'F': 6, 'H': 7} bond_types = { Chem.rdchem.BondType.SINGLE: 0, Chem.rdchem.BondType.DOUBLE: 1, Chem.rdchem.BondType.TRIPLE: 2, Chem.rdchem.BondType.AROMATIC: 3 } # Parse molecule mol = Chem.MolFromInchi(inchi) if mol is None: return None # Remove stereochemistry smi = Chem.MolToSmiles(mol, isomericSmiles=False) mol = Chem.MolFromSmiles(smi) N = mol.GetNumAtoms() # Extract atom features type_idx = [] for atom in mol.GetAtoms(): symbol = atom.GetSymbol() type_idx.append(atom_types[symbol]) x = F.one_hot(torch.tensor(type_idx), num_classes=len(atom_types)).float() # Extract bond features row, col, edge_type = [], [], [] for bond in mol.GetBonds(): start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx() row += [start, end] col += [end, start] bond_idx = bond_types[bond.GetBondType()] + 1 # 0 reserved for no bond edge_type += 2 * [bond_idx] edge_index = torch.tensor([row, col], dtype=torch.long) edge_attr = F.one_hot( torch.tensor(edge_type), num_classes=len(bond_types) + 1 ).float() # Compute Morgan fingerprint fp = GetMorganFingerprintAsBitVect(mol, morgan_r, nBits=morgan_nbits) y = torch.tensor(np.asarray(fp, dtype=np.int8)).unsqueeze(0) return Data(x=x, edge_index=edge_index, edge_attr=edge_attr, y=y) ``` -------------------------------- ### Prepare and Compute Molecule Metrics (Python) Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb This code prepares true and predicted InChI strings from molecule objects, computes metrics using parallel processing, aggregates results, and saves them to a CSV file. It handles sorting predictions by frequency and applies different metric aggregation logic based on the 'doFull' parameter. Dependencies include RDKit (Chem), collections, tqdm, joblib, pandas, and pulp. ```python true_inchi = [] pred_inchi = [] for i in range(len(true)): local_pred_inchi = [] for j in range(len(pred[i])): if is_valid(pred[i][j]): local_pred_inchi.append(Chem.MolToInchi(pred[i][j])) # sort local_pred_inchi by frequency inchi_counts = Counter(local_pred_inchi) local_pred_inchi = [item for item, count in inchi_counts.most_common()] if not doFull: local_pred_inchi = local_pred_inchi[:11] pred_inchi.append(local_pred_inchi) true_inchi.append(Chem.MolToInchi(true[i])) solver = pulp.listSolvers(onlyAvailable=True)[0] with tqdm_joblib(tqdm(total=len(true_inchi))) as progress_bar: results = Parallel(n_jobs=-1)( delayed(compute_metrics_for_one)( true_inchi[i], pred_inchi[i], solver, doMCES=doMCES, doFull=doFull ) for i in range(len(true_inchi)) ) # aggregate results final_metrics = defaultdict(float) for r in results: for key, val in r.items(): final_metrics[key] += val if doFull: for k in range(1, 101): final_metrics[f'acc@{k}'] /= len(true_inchi) final_metrics[f'mces@{k}'] /= len(true_inchi) final_metrics[f'tanimoto@{k}'] /= len(true_inchi) final_metrics[f'cosine@{k}'] /= len(true_inchi) final_metrics[f'close_match@{k}'] /= len(true_inchi) final_metrics[f'meaningful_match@{k}'] /= len(true_inchi) else: for k in range(1, 11): final_metrics[f'acc@{k}'] /= len(true_inchi) final_metrics[f'mces@{k}'] /= len(true_inchi) final_metrics[f'tanimoto@{k}'] /= len(true_inchi) final_metrics[f'cosine@{k}'] /= len(true_inchi) final_metrics[f'close_match@{k}'] /= len(true_inchi) final_metrics[f'meaningful_match@{k}'] /= len(true_inchi) df = pd.DataFrame(final_metrics, index=[0]) df.to_csv(csv_path, index=False) ``` -------------------------------- ### Load Model Predictions from Pickle Files (Python) Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb This code snippet demonstrates how to load model predictions and ground truth data from multiple pickle files. It iterates through a range of indices and checks for the existence of prediction files, extending lists with loaded data. This is typically used for loading results from previous training or evaluation steps. ```python # example code of loading model predictions as saved in the diffusion_model_spec2mol.py test step # paths/loading will be different canopus_true = [] canopus_pred = [] for idx in range(1, 5): i = idx-1 while os.path.exists(f"../final_results/canopus/spec2mol-canopus-eval-{idx}_resume_pred_{i}.pkl"): with open(f"../final_results/canopus/spec2mol-canopus-eval-{idx}_resume_true_{i}.pkl", 'rb') as f: canopus_true.extend(pickle.load(f)) with open(f"../final_results/canopus/spec2mol-canopus-eval-{idx}_resume_pred_{i}.pkl", 'rb') as f: canopus_pred.extend(pickle.load(f)) i += 4 ``` -------------------------------- ### Build Molecule from Graph Representation using RDKit Source: https://context7.com/coleygroup/diffms/llms.txt Constructs an RDKit molecule object from discrete atom and edge type tensors. It maps atom indices to symbols and bond type indices to RDKit bond types. The function handles potential errors during sanitization, returning None if molecule construction fails. ```python import torch from rdkit import Chem from rdkit.Chem import AllChem from typing import List, Optional def build_molecule_with_partial_charges(atom_types: torch.Tensor, edge_types: torch.Tensor, atom_decoder: List[str]) -> Optional[Chem.Mol]: """ Construct a molecule from discrete graph representation. Args: atom_types: Tensor of shape (n_atoms,) with atom type indices edge_types: Tensor of shape (n_atoms, n_atoms) with bond type indices atom_decoder: List mapping indices to atom symbols Returns: RDKit Mol object or None if construction fails """ bond_dict = [ None, Chem.rdchem.BondType.SINGLE, Chem.rdchem.BondType.DOUBLE, Chem.rdchem.BondType.TRIPLE, Chem.rdchem.BondType.AROMATIC ] mol = Chem.RWMol() # Add atoms for atom_idx in atom_types: atom = Chem.Atom(atom_decoder[atom_idx.item()]) mol.AddAtom(atom) # Add bonds edge_types = torch.triu(edge_types, diagonal=1) for i in range(len(atom_types)): for j in range(i + 1, len(atom_types)): bond_type_idx = edge_types[i, j].item() if bond_type_idx > 0: try: mol.AddBond(i, j, bond_dict[bond_type_idx]) except Exception as e: print(f"Error adding bond between {i} and {j} with type {bond_type_idx}: {e}") continue try: mol = mol.GetMol() Chem.SanitizeMol(mol) return mol except Exception as e: print(f"Error sanitizing molecule: {e}") return None ``` -------------------------------- ### Compute Molecular Accuracy and Validity Metrics (Python) Source: https://context7.com/coleygroup/diffms/llms.txt This snippet provides Python functions to calculate molecular accuracy (top-k exact match and Tanimoto similarity) and validity rate for generated molecules using RDKit. It requires RDKit and optionally torch for potential tensor operations. ```python import torch from rdkit import Chem from rdkit.Chem import AllChem, DataStructs from myopic_mces import get_mces def compute_molecular_accuracy(predicted_mols, true_mol, k=10): """ Compute top-k exact match accuracy and Tanimoto similarity. Args: predicted_mols: List of k generated RDKit Mol objects true_mol: Ground truth RDKit Mol object k: Number of top predictions to consider Returns: Dictionary with accuracy and similarity metrics """ # Compute reference fingerprint true_fp = AllChem.GetMorganFingerprintAsBitVect(true_mol, 2, nBits=2048) true_inchi = Chem.MolToInchi(true_mol) # Check exact matches exact_match = False similarities = [] for pred_mol in predicted_mols[:k]: if pred_mol is None: similarities.append(0.0) continue # Check InChI match pred_inchi = Chem.MolToInchi(pred_mol) if pred_inchi == true_inchi: exact_match = True # Compute Tanimoto similarity pred_fp = AllChem.GetMorganFingerprintAsBitVect(pred_mol, 2, nBits=2048) similarity = DataStructs.TanimotoSimilarity(true_fp, pred_fp) similarities.append(similarity) metrics = { f'top_{k}_accuracy': 1.0 if exact_match else 0.0, f'max_tanimoto_{k}': max(similarities) if similarities else 0.0, f'mean_tanimoto_{k}': sum(similarities) / len(similarities) if similarities else 0.0 } return metrics def compute_validity(generated_mols): """ Compute validity rate of generated molecules. Args: generated_mols: List of RDKit Mol objects (may contain None) Returns: Validity rate as float between 0 and 1 """ valid_count = 0 for mol in generated_mols: if mol is None: continue try: # Check if molecule can be sanitized Chem.SanitizeMol(mol) # Check if single connected component mol_frags = Chem.rdmolops.GetMolFrags(mol, asMols=True, sanitizeFrags=True) if len(mol_frags) == 1: valid_count += 1 except: continue return valid_count / len(generated_mols) if generated_mols else 0.0 ``` -------------------------------- ### Combine, Deduplicate, and Shuffle INChIs - Python Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb This snippet combines INChI lists from various sources (HMDB, DSS, COCONUT, MOSES), removes duplicate entries using a set, and then shuffles the unique list randomly. This prepares the data for splitting into training and validation sets. ```python combined_inchis = hmdb_inchis + dss_inchis + coconut_inchis + moses_inchis combined_inchis = list(set(combined_inchis)) random.shuffle(combined_inchis) ``` -------------------------------- ### Save Training and Validation INChIs to CSV - Python Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb This snippet creates pandas DataFrames from the processed training and validation INChI lists. It then saves these DataFrames to CSV files named 'combined_train.csv' and 'combined_val.csv' in the '../data/fp2mol/combined/preprocessed/' directory, without including the DataFrame index. ```python combined_train_df = pd.DataFrame(combined_train_inchis, columns=["inchi"]) combined_train_df.to_csv("../data/fp2mol/combined/preprocessed/combined_train.csv", index=False) combined_val_df = pd.DataFrame(combined_val_inchis, columns=["inchi"]) combined_val_df.to_csv("../data/fp2mol/combined/preprocessed/combined_val.csv", index=False) ``` -------------------------------- ### Convert COCONUT SMILES to InChI and Split Dataset (Python) Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb Processes the COCONUT dataset by reading a CSV file, converting SMILES to InChI identifiers, filtering molecules, and removing stereochemistry. The data is then split into training and validation sets and saved as CSV files. Requires RDKit, Pandas, and tqdm. ```python coconut_df = pd.read_csv('../data/fp2mol/raw/coconut_complete-10-2024.csv') coconut_set_raw = set(coconut_df["canonical_smiles"]) coconut_set = set() for smi in tqdm(coconut_set_raw, desc='Cleaning COCONUT structures', leave=False): try: mol = Chem.MolFromSmiles(smi) smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information mol = Chem.MolFromSmiles(smi) if filter_with_atom_types(mol): coconut_set.add(Chem.MolToInchi(mol)) except: pass coconut_inchis = list(coconut_set) random.shuffle(coconut_inchis) coconut_train_inchis = coconut_inchis[:int(0.95 * len(coconut_inchis))] coconut_val_inchis = coconut_inchis[int(0.95 * len(coconut_inchis)):] coconut_train_inchis = [inchi for inchi in coconut_train_inchis if inchi not in excluded_inchis] coconut_train_df = pd.DataFrame(coconut_train_inchis, columns=["inchi"]) coconut_train_df.to_csv("../data/fp2mol/coconut/preprocessed/coconut_train.csv", index=False) coconut_val_df = pd.DataFrame(coconut_val_inchis, columns=["inchi"]) coconut_val_df.to_csv("../data/fp2mol/coconut/preprocessed/coconut_val.csv", index=False) ``` -------------------------------- ### Process MSG Dataset: SMILES to InChI Conversion and Filtering Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb This script processes the MSG dataset, similar to the Canopus script. It loads split and label data, converts SMILES to InChI, applies filtering to the training set, and saves the resulting InChI lists for train, test, and validation sets. It also updates an `excluded_inchis` set. Dependencies include pandas and RDKit. ```python msg_split = pd.read_csv('../data/msg/split.tsv', sep='\t') msg_labels = pd.read_csv('../data/msg/labels.tsv', sep='\t') msg_labels["name"] = msg_labels["spec"] msg_labels = msg_labels[["name", "smiles"]].reset_index(drop=True) msg_labels = msg_labels.merge(msg_split, on="name") msg_train_inchis = [] msg_test_inchis = [] msg_val_inchis = [] for i in tqdm(range(len(msg_labels)), desc="Converting SMILES to InChI"): mol = Chem.MolFromSmiles(msg_labels.loc[i, "smiles"]) smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information mol = Chem.MolFromSmiles(smi) inchi = Chem.MolToInchi(mol) if msg_labels.loc[i, "split"] == "train": if filter(mol): msg_train_inchis.append(inchi) elif msg_labels.loc[i, "split"] == "test": msg_test_inchis.append(inchi) elif msg_labels.loc[i, "split"] == "val": msg_val_inchis.append(inchi) msg_train_df = pd.DataFrame(set(msg_train_inchis), columns=["inchi"]) msg_train_df.to_csv("../data/fp2mol/msg/preprocessed/msg_train.csv", index=False) msg_test_df = pd.DataFrame(msg_test_inchis, columns=["inchi"]) msg_test_df.to_csv("../data/fp2mol/msg/preprocessed/msg_test.csv", index=False) msg_val_df = pd.DataFrame(msg_val_inchis, columns=["inchi"]) msg_val_df.to_csv("../data/fp2mol/msg/preprocessed/msg_val.csv", index=False) excluded_inchis.update(msg_test_inchis + msg_val_inchis) ``` -------------------------------- ### Process Canopus Dataset: SMILES to InChI Conversion and Filtering Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb This script processes the Canopus dataset by reading split and label information, converting SMILES to InChI, and filtering training data using the `filter` function. It then saves the processed InChI lists for training, testing, and validation sets into CSV files. Dependencies include pandas and RDKit. ```python canopus_split = pd.read_csv('../data/canopus/splits/canopus_hplus_100_0.tsv', sep='\t') canopus_labels = pd.read_csv('../data/canopus/labels.tsv', sep='\t') canopus_labels["name"] = canopus_labels["spec"] canopus_labels = canopus_labels[["name", "smiles"]].reset_index(drop=True) canopus_labels = canopus_labels.merge(canopus_split, on="name") canopus_train_inchis = [] canopus_test_inchis = [] canopus_val_inchis = [] for i in tqdm(range(len(canopus_labels)), desc="Converting SMILES to InChI"): mol = Chem.MolFromSmiles(canopus_labels.loc[i, "smiles"]) smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information mol = Chem.MolFromSmiles(smi) inchi = Chem.MolToInchi(mol) if canopus_labels.loc[i, "split"] == "train": if filter(mol): canopus_train_inchis.append(inchi) elif canopus_labels.loc[i, "split"] == "test": canopus_test_inchis.append(inchi) elif canopus_labels.loc[i, "split"] == "val": canopus_val_inchis.append(inchi) canopus_train_df = pd.DataFrame(set(canopus_train_inchis), columns=["inchi"]) canopus_train_df.to_csv("../data/fp2mol/canopus/preprocessed/canopus_train.csv", index=False) canopus_test_df = pd.DataFrame(canopus_test_inchis, columns=["inchi"]) canopus_test_df.to_csv("../data/fp2mol/canopus/preprocessed/canopus_test.csv", index=False) canopus_val_df = pd.DataFrame(canopus_val_inchis, columns=["inchi"]) canopus_val_df.to_csv("../data/fp2mol/canopus/preprocessed/canopus_val.csv", index=False) excluded_inchis = set(canopus_test_inchis + canopus_val_inchis) ``` -------------------------------- ### Python Imports for Chemical Informatics Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb Imports necessary libraries for cheminformatics, including RDKit for molecule manipulation, pandas for data handling, and others for optimization and progress tracking. Disables RDKit logging to avoid verbose output. ```python import pickle import os from collections import Counter, defaultdict import pulp from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs import pandas as pd from myopic_mces import MCES from joblib import Parallel, delayed from tqdm import tqdm from tqdm_joblib import tqdm_joblib from rdkit import RDLogger RDLogger.DisableLog('rdApp.*') ``` -------------------------------- ### Split and Filter INChIs into Training/Validation Sets - Python Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb This code splits the shuffled INChI list into training and validation sets, with 95% for training and 5% for validation. It then filters out any INChIs present in the `excluded_inchis` list from the training set. This ensures that specific structures are not included in the training data. ```python combined_train_inchis = combined_inchis[:int(0.95 * len(combined_inchis))] combined_val_inchis = combined_inchis[int(0.95 * len(combined_inchis)):] combined_train_inchis = [inchi for inchi in combined_train_inchis if inchi not in excluded_inchis] ``` -------------------------------- ### Compute Metrics using Loaded Predictions (Python) Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb This is a direct call to a compute_metrics function, likely using the data loaded from pickle files in the preceding step. It calculates metrics for 'canopus' predictions against true values and saves them to 'canopus_metrics.csv', with options to control MCES calculation and full metric computation. ```python compute_metrics(canopus_true, canopus_pred, "canopus_metrics.csv", doMCES=False, doFull=True) ``` -------------------------------- ### Read Molecules from SDF File Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb This function reads molecular structures from an SDF file, extracting SMILES strings. It iterates through lines of the file and identifies SMILES entries. Dependencies include the `tqdm` library for progress bars and RDKit for potential future molecular processing. It takes a file path as input and returns a list of SMILES strings. ```python import random from collections import Counter import pandas as pd from tqdm import tqdm from rdkit import Chem from rdkit import RDLogger from rdkit.Chem import Descriptors random.seed(42) lg = RDLogger.logger()lg.setLevel(RDLogger.CRITICAL) def read_from_sdf(path): res = [] app = False with open(path, 'r') as f: for line in tqdm(f.readlines(), desc='Loading SDF structures', leave=False): if app: res.append(line.strip()) app = False if line.startswith('> '): app = True return res ``` -------------------------------- ### Convert MOSES SMILES to InChI and Split Dataset (Python) Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb Converts SMILES strings from the MOSES dataset to InChI identifiers. The process includes filtering molecules, removing stereochemistry information, and splitting the data into training and validation sets, which are then saved to CSV files. Dependencies: RDKit, Pandas, tqdm. ```python moses_df = pd.read_csv('../data/fp2mol/raw/moses_complete.csv') moses_set_raw = set(moses_df["SMILES"]) moses_set = set() for smi in tqdm(moses_set_raw, desc='Cleaning MOSES structures', leave=False): try: mol = Chem.MolFromSmiles(smi) smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information mol = Chem.MolFromSmiles(smi) if filter_with_atom_types(mol): moses_set.add(Chem.MolToInchi(mol)) except: pass moses_inchis = list(moses_set) random.shuffle(moses_inchis) moses_train_inchis = moses_inchis[:int(0.95 * len(moses_inchis))] moses_val_inchis = moses_inchis[int(0.95 * len(moses_inchis)):] moses_train_inchis = [inchi for inchi in moses_train_inchis if inchi not in excluded_inchis] moses_train_df = pd.DataFrame(moses_train_inchis, columns=["inchi"]) moses_train_df.to_csv("../data/fp2mol/moses/preprocessed/moses_train.csv", index=False) moses_val_df = pd.DataFrame(moses_val_inchis, columns=["inchi"]) moses_val_df.to_csv("../data/fp2mol/moses/preprocessed/moses_val.csv", index=False) ``` -------------------------------- ### Convert DSSTox SMILES to InChI and Split Dataset (Python) Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb Processes the DSSTox dataset by converting SMILES strings to InChI identifiers. It handles multiple Excel files, filters molecules, removes stereochemistry, and splits the data into training and validation sets, saving them to CSV. Requires RDKit, Pandas, and tqdm. ```python dss_set_raw = set() for i in tqdm(range(1, 14), desc='Loading DSSTox structures', leave=False): df = pd.read_excel(f'../data/fp2mol/raw/DSSToxDump{i}.xlsx') dss_set_raw.update(df[df['SMILES'].notnull()]['SMILES']) dss_set = set() for smi in tqdm(dss_set_raw, desc='Cleaning DSSTox structures', leave=False): try: mol = Chem.MolFromSmiles(smi) smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information mol = Chem.MolFromSmiles(smi) if filter_with_atom_types(mol): dss_set.add(Chem.MolToInchi(mol)) except: pass dss_inchis = list(dss_set) random.shuffle(dss_inchis) dss_train_inchis = dss_inchis[:int(0.95 * len(dss_inchis))] dss_val_inchis = dss_inchis[int(0.95 * len(dss_inchis)):] dss_train_inchis = [inchi for inchi in dss_train_inchis if inchi not in excluded_inchis] dss_train_df = pd.DataFrame(dss_train_inchis, columns=["inchi"]) dss_train_df.to_csv("../data/fp2mol/dss/preprocessed/dss_train.csv", index=False) dss_val_df = pd.DataFrame(dss_val_inchis, columns=["inchi"]) dss_val_df.to_csv("../data/fp2mol/dss/preprocessed/dss_val.csv", index=False) ``` -------------------------------- ### Python Molecule Canonicalization with Tautomer Handling Source: https://github.com/coleygroup/diffms/blob/master/notebooks/compute_metrics.ipynb Defines functions to canonicalize molecules from InChI strings, handling tautomers using RDKit's MolStandardize module. It supports different RDKit versions for tautomer enumeration. Includes functions to convert molecules to SMILES and check molecule validity. ```python try: from rdkit.Chem.MolStandardize.tautomer import TautomerCanonicalizer, TautomerTransform _RD_TAUTOMER_CANONICALIZER = 'v1' _TAUTOMER_TRANSFORMS = ( TautomerTransform('1,3 heteroatom H shift', '[#7,S,O,Se,Te;!H0]-[#7X2,#6,#15]=[#7,#16,#8,Se,Te]'), TautomerTransform('1,3 (thio)keto/enol r', '[O,S,Se,Te;X2!H0]-[C]=[C]') ) except ModuleNotFoundError: from rdkit.Chem.MolStandardize.rdMolStandardize import TautomerEnumerator # newer rdkit _RD_TAUTOMER_CANONICALIZER = 'v2' def canonical_mol_from_inchi(inchi): """Canonicalize mol after Chem.MolFromInchi Note that this function may be 50 times slower than Chem.MolFromInchi""" mol = Chem.MolFromInchi(inchi) if mol is None: return None if _RD_TAUTOMER_CANONICALIZER == 'v1': _molvs_t = TautomerCanonicalizer(transforms=_TAUTOMER_TRANSFORMS) mol = _molvs_t.canonicalize(mol) else: _te = TautomerEnumerator() mol = _te.Canonicalize(mol) return mol def mol2smiles(mol): try: Chem.SanitizeMol(mol) except ValueError: return None return Chem.MolToSmiles(mol) def is_valid(mol): smiles = mol2smiles(mol) if smiles is None: return False try: mol_frags = Chem.rdmolops.GetMolFrags(mol, asMols=True, sanitizeFrags=True) except: return False if len(mol_frags) > 1: return False return True ``` -------------------------------- ### Filter Molecules by Properties Source: https://github.com/coleygroup/diffms/blob/master/notebooks/build_fp2mol_datasets.ipynb This function filters RDKit molecules based on several criteria: no disconnected structures ('.') in SMILES, molecular weight below 1500, and no formal charge on any atom. It takes an RDKit molecule object as input and returns a boolean indicating if the molecule passes the filters. ```python def filter(mol): try: smi = Chem.MolToSmiles(mol, isomericSmiles=False) # remove stereochemistry information mol = Chem.MolFromSmiles(smi) if "." in smi: return False if Descriptors.MolWt(mol) >= 1500: return False for atom in mol.GetAtoms(): if atom.GetFormalCharge() != 0: return False except: return False return True ```