### Clone SE(3)-Transformer Repository Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md Clones the DeepLearningExamples repository and navigates into the SE3Transformer directory. This is the first step to get the project code. ```bash git clone https://github.com/NVIDIA/DeepLearningExamples cd DeepLearningExamples/PyTorch/DrugDiscovery/SE3Transformer ``` -------------------------------- ### Start SE(3)-Transformer Training Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md Executes the training script for the SE(3)-Transformer model. This command assumes the container is already running and the script is accessible. ```bash bash scripts/train.sh ``` -------------------------------- ### Start SE(3)-Transformer Inference Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md Executes the prediction script for the SE(3)-Transformer model. This command is used after training to generate predictions. ```bash bash scripts/predict.sh ``` -------------------------------- ### Run SE(3)-Transformer NGC Container Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md Starts an interactive session within the SE(3)-Transformer NGC container. It mounts a local 'results' directory for storing output and configures GPU runtime and memory limits. ```docker mkdir -p results docker run -it --runtime=nvidia --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ${PWD}/results:/results se3-transformer:latest ``` -------------------------------- ### Orchestrate RF2NA Structure Prediction Pipeline (Bash) Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt The `run_RF2NA.sh` script manages the entire structure prediction process, including MSA generation, template searching, and neural network inference for various complex types. It requires activating the RF2NA conda environment and navigating to the example directory. ```bash # Predict protein-RNA complex structure conda activate RF2NA cd /path/to/RoseTTAFold2NA/example # RNA binding protein with RNA ../run_RF2NA.sh rna_prediction rna_binding_protein.fa R:RNA.fa # DNA binding protein with double-stranded DNA ../run_RF2NA.sh dna_prediction dna_binding_protein.fa D:DNA.fa # Single-stranded DNA binding ../run_RF2NA.sh ssdna_prediction protein.fa S:ssDNA.fa # Paired protein/RNA MSA prediction ../run_RF2NA.sh paired_prediction PR:paired_protein_rna.a3m # Multi-chain protein with RNA ../run_RF2NA.sh multichain protein_A.fa protein_B.fa R:RNA.fa # Expected outputs in output_folder/models/: # - model_00.pdb: Predicted structure with pLDDT in B-factor column # - model_00.npz: Contains dist (LxLx37 distogram), lddt (L), pae (LxL) ``` -------------------------------- ### Build SE(3)-Transformer NGC Container Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md Builds the Docker image for the SE(3)-Transformer model. This container includes the necessary dependencies to run the model. ```docker docker build -t se3-transformer . ``` -------------------------------- ### Utilize RoseTTAFold Utility Functions (Python) Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt Demonstrates the usage of various utility functions from the `network/util.py` module. These functions cover data augmentation (random rotation/translation), structure manipulation (centering, realigning missing residues), frame construction, nucleic acid identification, and PDB file writing with confidence scores. ```python import torch from network.util import ( random_rot_trans, center_and_realign_missing, rigid_from_3_points, generate_Cbeta, writepdb, is_nucleic, dna_reverse_complement ) # Random rotation and translation for data augmentation # xyz shape: (N, L, 27, 3) xyz_augmented = random_rot_trans(xyz, random_noise=20.0) # Center structure and handle missing residues # xyz shape: (L, 27, 3), mask shape: (L, 27) xyz_centered = center_and_realign_missing(xyz, mask) # Build reference frame from backbone atoms # N, Ca, C shapes: (..., 3) R, T = rigid_from_3_points(N, Ca, C, is_na=None) # R: rotation matrix (..., 3, 3) # T: translation vector (..., 3) # Generate Cbeta from backbone Cb = generate_Cbeta(N, Ca, C) # Check if residue is nucleic acid (DNA/RNA) # seq values >= 22 are nucleic acids seq = torch.tensor([0, 1, 22, 23, 27, 28]) # 2 protein, 4 NA residues na_mask = is_nucleic(seq) # [False, False, True, True, True, True] # Generate reverse complement for DNA dna_seq = torch.tensor([[22, 23, 24, 25]]) # ACGT complement = dna_reverse_complement(dna_seq) # TGCA -> [25, 24, 23, 22] # Write PDB file with confidence scores writepdb( filename="output.pdb", atoms=xyz_allatom, # (1, L, 27, 3) coordinates seq=sequence, # (1, L) residue types Ls=[100, 50], # Chain lengths for multi-chain idx_pdb=None, # Optional residue numbering bfacts=lddt_scores # (L,) confidence scores for B-factor ) ``` -------------------------------- ### Featurize MSA with PyTorch Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt Loads and featurizes Multiple Sequence Alignments (MSAs) using PyTorch tensors. It includes optional block deletion for data augmentation and defines parameters for latent MSA sequences, total sequences, and recycling iterations. The output includes masked sequences, latent MSA features, extra sequence features, and masking positions. ```python import torch # Parameters for featurization params = { 'MAXLAT': 128, # Maximum latent MSA sequences 'MAXSEQ': 1024, # Maximum total sequences 'MAXCYCLE': 4 # Recycling iterations } # Load and convert MSA to tensors msa = torch.tensor(msa_array).long() # Shape: (N, L) ins = torch.tensor(ins_array).long() # Shape: (N, L) # Optional: apply block deletion for data augmentation during training if msa.shape[0] > 5: msa, ins = MSABlockDeletion(msa, ins, nb=5) # Generate model input features seq, msa_seed_orig, msa_seed, msa_extra, mask_msa = MSAFeaturize( msa, ins, params, p_mask=0.15, # Masking probability for training L_s=[] # Chain lengths for multi-chain complexes ) # Output shapes: # seq: (MAXCYCLE, L) - masked sequences per cycle # msa_seed: (MAXCYCLE, MAXLAT, L, features) - latent MSA features # msa_extra: (MAXCYCLE, MAXSEQ, L, features) - extra sequence features # mask_msa: (MAXCYCLE, MAXLAT, L) - masking positions # Merge heteromeric MSAs for protein-nucleic acid complexes a3m_protein = {'msa': protein_msa, 'ins': protein_ins} a3m_rna = {'msa': rna_msa, 'ins': rna_ins} chain_lengths = [protein_length, rna_length] merged = merge_a3m_hetero(a3m_protein, a3m_rna, chain_lengths) combined_msa = merged['msa'] # Shape: (N, protein_length + rna_length) ``` -------------------------------- ### Generate Protein MSA with HHblits Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt This bash script generates a protein MSA using iterative HHblits searches against UniRef30 and BFD databases. It requires specific environment variables for paths and takes input FASTA, output directory, a tag, number of CPUs, and max memory as arguments. The output is an A3M formatted MSA file. ```bash #!/bin/bash # Generate protein MSA from FASTA sequence # Required environment variables export PIPEDIR=/path/to/RoseTTAFold2NA # Database paths DB_UR30="$PIPEDIR/UniRef30_2020_06/UniRef30_2020_06" DB_BFD="$PIPEDIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt" # Run MSA generation ./input_prep/make_protein_msa.sh \ input_sequence.fa \ /output/directory \ protein_tag \ 8 \ 64 # Output files: # /output/directory/protein_tag.msa0.a3m - Final MSA in A3M format # The script performs iterative HHblits searches with increasing E-value thresholds: # 1e-10 -> 1e-6 -> 1e-3 against UniRef30 # Falls back to BFD if insufficient sequences found # Filters results: 90% identity, 75% or 50% coverage cutoffs ``` -------------------------------- ### Generate RNA MSA with cmscan, Rfam, and BLAST Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt This bash script generates an RNA MSA by utilizing cmscan against Rfam, RNAcentral, and the NCBI nt database. It requires specific database files to be present in the RNA directory. The script takes input RNA FASTA, output directory, a tag, number of CPUs, and max memory as arguments, producing a FASTA alignment format MSA file. ```bash #!/bin/bash # Generate RNA MSA from FASTA sequence # Required databases in $PIPEDIR/RNA/: # - Rfam.cm (covariance models) # - rnacentral.fasta (RNAcentral sequences) # - nt (NCBI nucleotide database) # - rfam_annotations.tsv.gz (Rfam to RNAcentral mapping) # - Rfam.full_region.gz (Rfam to nt mapping) ./input_prep/make_rna_msa.sh \ rna_sequence.fa \ /output/directory \ rna_tag \ 8 \ 64 # Output files: # /output/directory/rna_tag.afa - Final RNA MSA in FASTA alignment format # Pipeline steps: # 1. cmscan against Rfam to identify RNA families # 2. Retrieve homologs from RNAcentral via Rfam annotations # 3. Retrieve homologs from nt database via Rfam annotations # 4. BLASTN search against RNAcentral and nt # 5. Cluster with cd-hit-est to remove redundancy # 6. Realign all hits with nhmmer # 7. Filter with hhfilter (99% identity, 50% coverage) ``` -------------------------------- ### Define and Initialize RoseTTAFold Model Architecture (Python) Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt Defines the complete neural network architecture for structure prediction using the RoseTTAFoldModule. It initializes the model with specified hyperparameters and loads trained weights from a checkpoint file. Dependencies include PyTorch and custom network modules. ```python import torch from network.RoseTTAFoldModel import RoseTTAFoldModule import network.util as util # Model hyperparameters (default configuration) MODEL_PARAM = { "n_extra_block": 4, # Extra MSA processing blocks "n_main_block": 32, # Main trunk iterations "n_ref_block": 4, # Structure refinement blocks "d_msa": 256, # MSA embedding dimension "d_pair": 128, # Pair embedding dimension "d_templ": 64, # Template embedding dimension "n_head_msa": 8, # MSA attention heads "n_head_pair": 4, # Pair attention heads "n_head_templ": 4, # Template attention heads "d_hidden": 32, # Hidden dimension "d_hidden_templ": 64, # Template hidden dimension "p_drop": 0.0, # Dropout (0 for inference) "lj_lin": 0.75 # LJ potential linearization } # SE3-Transformer parameters for structure module SE3_param = { "num_layers": 1, "num_channels": 32, "num_degrees": 2, "l0_in_features": 64, "l0_out_features": 64, "l1_in_features": 3, "l1_out_features": 2, "num_edge_features": 64, "div": 4, "n_heads": 4 } MODEL_PARAM['SE3_param_full'] = SE3_param MODEL_PARAM['SE3_param_topk'] = SE3_param # Initialize model device = torch.device("cuda:0") model = RoseTTAFoldModule( **MODEL_PARAM, aamask=util.allatom_mask.to(device), ljlk_parameters=util.ljlk_parameters.to(device), lj_correction_parameters=util.lj_correction_parameters.to(device), num_bonds=util.num_bonds.to(device), hbtypes=util.hbtypes.to(device), hbbaseatoms=util.hbbaseatoms.to(device), hbpolys=util.hbpolys.to(device) ).to(device) # Load trained weights checkpoint = torch.load("network/weights/RF2NA_apr23.pt", map_location=device) model.load_state_dict(checkpoint['model_state_dict']) model.eval() ``` -------------------------------- ### Read Structural Templates with PyTorch Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt Parses HHsearch results to extract structural templates from a PDB database using PyTorch. It loads a pre-indexed FFDB and reads template information including coordinates, sequence features, and atom masks. The function supports specifying template IDs, residue offsets, and the number of templates to use, with an option for random noise for missing coordinates. ```python import torch from network.parsers import read_templates from network.ffindex import read_index, read_data from collections import namedtuple # Load template database FFDB = "/path/to/pdb100_2021Mar03/pdb100_2021Mar03" FFindexDB = namedtuple("FFindexDB", "index, data") ffdb = FFindexDB( read_index(FFDB + '_pdb.ffindex'), read_data(FFDB + '_pdb.ffdata') ) # Read templates from HHsearch output query_length = 150 xyz_t, t1d, mask_t = read_templates( qlen=query_length, ffdb=ffdb, hhr_fn="protein.hhr", # HHsearch results atab_fn="protein.atab", # Alignment table templ_to_use=[], # Optional: specific template IDs offset=0, # Residue offset for multi-chain n_templ=4, # Number of templates to use random_noise=5.0 # Noise for missing coordinates ) # Output tensors: # xyz_t: (n_templ, L, 27, 3) - template coordinates # t1d: (n_templ, L, 22) - one-hot sequence + confidence # mask_t: (n_templ, L, 27) - valid atom mask print(f"Loaded {xyz_t.shape[0]} templates") print(f"Template coords shape: {xyz_t.shape}") print(f"Template features shape: {t1d.shape}") ``` -------------------------------- ### Parse Sequence Alignment Formats (Python) Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt The `network/parsers.py` module offers functions to parse various sequence alignment formats, including A3M for proteins and FASTA for RNA/DNA. It handles gzipped files, sequence limits, and specific alphabet encodings for nucleic acids. The `parse_mixed_fasta` function is used for paired protein/RNA MSAs. ```python import numpy as np from network.parsers import parse_a3m, parse_fasta, parse_mixed_fasta # Parse protein A3M multiple sequence alignment # Returns: msa (NxL array of residue indices), ins (NxL insertion counts) msa, ins = parse_a3m( filename="protein.a3m.gz", unzip=True, # Handle gzipped files maxseq=10000 # Maximum sequences to load ) print(f"Protein MSA: {msa.shape[0]} sequences, length {msa.shape[1]}") # Amino acid encoding: A=0, R=1, N=2, D=3, C=4, Q=5, E=6, G=7, H=8, I=9, # L=10, K=11, M=12, F=13, P=14, S=15, T=16, W=17, Y=18, V=19, gap=20 # Parse RNA/DNA FASTA alignment rna_msa, rna_ins = parse_fasta( filename="rna.afa", maxseq=10000, rna_alphabet=True, # Use RNA encoding (A=27, C=28, G=29, U=30) dna_alphabet=False ) dna_msa, dna_ins = parse_fasta( filename="dna.fa", maxseq=10000, rna_alphabet=False, dna_alphabet=True # Use DNA encoding (A=22, C=23, G=24, T=25) ) # Parse paired protein/RNA MSA (slash-separated format) # Each line: PROTEIN_SEQUENCE/RNA_SEQUENCE mixed_msa, mixed_ins, lengths = parse_mixed_fasta( filename="paired.a3m", maxseq=10000 ) protein_length, rna_length = lengths print(f"Paired MSA: protein={protein_length}, RNA={rna_length}") ``` -------------------------------- ### Featurize MSA Data for Model Input (Python) Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt The `MSAFeaturize` function, located in `network/data_loader.py`, transforms raw MSA data into features suitable for the neural network. This process includes applying masking strategies and performing sequence clustering, preparing the data for model inference. ```python import torch from network.data_loader import MSAFeaturize, MSABlockDeletion, merge_a3m_hetero ``` -------------------------------- ### Run RF2NA Prediction with Predictor Class (Python) Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt The `Predictor` class in `network/predict.py` provides the core interface for structure prediction. It loads trained weights, initializes a template database, and runs inference on input sequences. The input format specifies chain types and file paths, and the output includes PDB files and prediction quality metrics. ```python import torch from collections import namedtuple from network.predict import Predictor from network.ffindex import read_index, read_data # Initialize the predictor with trained weights device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") predictor = Predictor( model_weights="network/weights/RF2NA_apr23.pt", device=device ) # Load template database FFDB = "/path/to/pdb100_2021Mar03/pdb100_2021Mar03" FFindexDB = namedtuple("FFindexDB", "index, data") ffdb = FFindexDB( read_index(FFDB + '_pdb.ffindex'), read_data(FFDB + '_pdb.ffdata') ) # Define inputs with chain type prefixes: # P: protein MSA (a3m format) + HHR templates # R: RNA MSA (afa format) # D: double-stranded DNA (fasta, complement auto-generated) # S: single-stranded DNA (fasta) # PR: paired protein/RNA MSA inputs = [ "P:/path/to/protein.msa0.a3m:/path/to/protein.hhr:/path/to/protein.atab", "R:/path/to/rna.afa" ] # Run prediction predictor.predict( inputs=inputs, out_prefix="/output/models/model", ffdb=ffdb, n_templ=4 # Number of templates to use ) # Output files created: # - /output/models/model_00.pdb (structure with pLDDT confidence) # - /output/models/model_00.npz (distogram, lddt, pae arrays) ``` -------------------------------- ### Load and Interpret RoseTTAFold Prediction Outputs (Python) Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt Loads prediction outputs from a NumPy archive (.npz) and a PDB file. It extracts and prints information about the predicted distogram, per-residue pLDDT scores, and the Predicted Aligned Error (PAE) matrix. It also demonstrates reading pLDDT scores from the B-factor column of a PDB file. ```python import numpy as np # Load prediction outputs data = np.load("models/model_00.npz") # Distogram: predicted inter-residue distances # Shape: (L, L, 37) - 37 distance bins distogram = data['dist'] print(f"Distogram shape: {distogram.shape}") # Per-residue pLDDT confidence scores # Shape: (L,) - values 0-100, higher is better lddt = data['lddt'] print(f"Mean pLDDT: {np.mean(lddt):.1f}") print(f"Residues with pLDDT > 70: {np.sum(lddt > 70)}") # Predicted Aligned Error (PAE) matrix # Shape: (L, L) - expected position error in Angstroms pae = data['pae'] print(f"PAE matrix shape: {pae.shape}") print(f"Mean PAE: {np.mean(pae):.2f} Angstroms") # Read PDB with pLDDT in B-factor column with open("models/model_00.pdb", 'r') as f: for line in f: if line.startswith("ATOM"): residue = int(line[22:26]) bfactor = float(line[60:66]) # This is pLDDT * 100 print(f"Residue {residue}: pLDDT = {bfactor:.1f}") break ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.