### Clone SE(3)-Transformer Repository

Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md

Clones the DeepLearningExamples repository and navigates into the SE3Transformer directory. This is the first step to get the project code.

```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/DrugDiscovery/SE3Transformer
```

--------------------------------

### Start SE(3)-Transformer Training

Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md

Executes the training script for the SE(3)-Transformer model. This command assumes the container is already running and the script is accessible.

```bash
bash scripts/train.sh
```

--------------------------------

### Start SE(3)-Transformer Inference

Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md

Executes the prediction script for the SE(3)-Transformer model. This command is used after training to generate predictions.

```bash
bash scripts/predict.sh
```

--------------------------------

### Run SE(3)-Transformer NGC Container

Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md

Starts an interactive session within the SE(3)-Transformer NGC container. It mounts a local 'results' directory for storing output and configures GPU runtime and memory limits.

```docker
mkdir -p results
docker run -it --runtime=nvidia --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ${PWD}/results:/results se3-transformer:latest
```

--------------------------------

### Orchestrate RF2NA Structure Prediction Pipeline (Bash)

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

The `run_RF2NA.sh` script manages the entire structure prediction process, including MSA generation, template searching, and neural network inference for various complex types. It requires activating the RF2NA conda environment and navigating to the example directory.

```bash
# Predict protein-RNA complex structure
conda activate RF2NA
cd /path/to/RoseTTAFold2NA/example

# RNA binding protein with RNA
../run_RF2NA.sh rna_prediction rna_binding_protein.fa R:RNA.fa

# DNA binding protein with double-stranded DNA
../run_RF2NA.sh dna_prediction dna_binding_protein.fa D:DNA.fa

# Single-stranded DNA binding
../run_RF2NA.sh ssdna_prediction protein.fa S:ssDNA.fa

# Paired protein/RNA MSA prediction
../run_RF2NA.sh paired_prediction PR:paired_protein_rna.a3m

# Multi-chain protein with RNA
../run_RF2NA.sh multichain protein_A.fa protein_B.fa R:RNA.fa

# Expected outputs in output_folder/models/:
# - model_00.pdb: Predicted structure with pLDDT in B-factor column
# - model_00.npz: Contains dist (LxLx37 distogram), lddt (L), pae (LxL)
```

--------------------------------

### Build SE(3)-Transformer NGC Container

Source: https://github.com/uw-ipd/rosettafold2na/blob/main/SE3Transformer/README.md

Builds the Docker image for the SE(3)-Transformer model. This container includes the necessary dependencies to run the model.

```docker
docker build -t se3-transformer .
```

--------------------------------

### Utilize RoseTTAFold Utility Functions (Python)

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

Demonstrates the usage of various utility functions from the `network/util.py` module. These functions cover data augmentation (random rotation/translation), structure manipulation (centering, realigning missing residues), frame construction, nucleic acid identification, and PDB file writing with confidence scores.

```python
import torch
from network.util import (
    random_rot_trans,
    center_and_realign_missing,
    rigid_from_3_points,
    generate_Cbeta,
    writepdb,
    is_nucleic,
    dna_reverse_complement
)

# Random rotation and translation for data augmentation
# xyz shape: (N, L, 27, 3)
xyz_augmented = random_rot_trans(xyz, random_noise=20.0)

# Center structure and handle missing residues
# xyz shape: (L, 27, 3), mask shape: (L, 27)
xyz_centered = center_and_realign_missing(xyz, mask)

# Build reference frame from backbone atoms
# N, Ca, C shapes: (..., 3)
R, T = rigid_from_3_points(N, Ca, C, is_na=None)
# R: rotation matrix (..., 3, 3)
# T: translation vector (..., 3)

# Generate Cbeta from backbone
Cb = generate_Cbeta(N, Ca, C)

# Check if residue is nucleic acid (DNA/RNA)
# seq values >= 22 are nucleic acids
seq = torch.tensor([0, 1, 22, 23, 27, 28])  # 2 protein, 4 NA residues
na_mask = is_nucleic(seq)  # [False, False, True, True, True, True]

# Generate reverse complement for DNA
dna_seq = torch.tensor([[22, 23, 24, 25]])  # ACGT
complement = dna_reverse_complement(dna_seq)  # TGCA -> [25, 24, 23, 22]

# Write PDB file with confidence scores
writepdb(
    filename="output.pdb",
    atoms=xyz_allatom,      # (1, L, 27, 3) coordinates
    seq=sequence,           # (1, L) residue types
    Ls=[100, 50],           # Chain lengths for multi-chain
    idx_pdb=None,           # Optional residue numbering
    bfacts=lddt_scores      # (L,) confidence scores for B-factor
)
```

--------------------------------

### Featurize MSA with PyTorch

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

Loads and featurizes Multiple Sequence Alignments (MSAs) using PyTorch tensors. It includes optional block deletion for data augmentation and defines parameters for latent MSA sequences, total sequences, and recycling iterations. The output includes masked sequences, latent MSA features, extra sequence features, and masking positions.

```python
import torch

# Parameters for featurization
params = {
    'MAXLAT': 128,    # Maximum latent MSA sequences
    'MAXSEQ': 1024,   # Maximum total sequences
    'MAXCYCLE': 4     # Recycling iterations
}

# Load and convert MSA to tensors
msa = torch.tensor(msa_array).long()   # Shape: (N, L)
ins = torch.tensor(ins_array).long()   # Shape: (N, L)

# Optional: apply block deletion for data augmentation during training
if msa.shape[0] > 5:
    msa, ins = MSABlockDeletion(msa, ins, nb=5)

# Generate model input features
seq, msa_seed_orig, msa_seed, msa_extra, mask_msa = MSAFeaturize(
    msa, ins, params,
    p_mask=0.15,  # Masking probability for training
    L_s=[]        # Chain lengths for multi-chain complexes
)

# Output shapes:
# seq: (MAXCYCLE, L) - masked sequences per cycle
# msa_seed: (MAXCYCLE, MAXLAT, L, features) - latent MSA features
# msa_extra: (MAXCYCLE, MAXSEQ, L, features) - extra sequence features
# mask_msa: (MAXCYCLE, MAXLAT, L) - masking positions

# Merge heteromeric MSAs for protein-nucleic acid complexes
a3m_protein = {'msa': protein_msa, 'ins': protein_ins}
a3m_rna = {'msa': rna_msa, 'ins': rna_ins}
chain_lengths = [protein_length, rna_length]

merged = merge_a3m_hetero(a3m_protein, a3m_rna, chain_lengths)
combined_msa = merged['msa']  # Shape: (N, protein_length + rna_length)
```

--------------------------------

### Generate Protein MSA with HHblits

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

This bash script generates a protein MSA using iterative HHblits searches against UniRef30 and BFD databases. It requires specific environment variables for paths and takes input FASTA, output directory, a tag, number of CPUs, and max memory as arguments. The output is an A3M formatted MSA file.

```bash
#!/bin/bash
# Generate protein MSA from FASTA sequence

# Required environment variables
export PIPEDIR=/path/to/RoseTTAFold2NA

# Database paths
DB_UR30="$PIPEDIR/UniRef30_2020_06/UniRef30_2020_06"
DB_BFD="$PIPEDIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"

# Run MSA generation
./input_prep/make_protein_msa.sh \
    input_sequence.fa \
    /output/directory \
    protein_tag \
    8 \
    64

# Output files:
# /output/directory/protein_tag.msa0.a3m - Final MSA in A3M format

# The script performs iterative HHblits searches with increasing E-value thresholds:
# 1e-10 -> 1e-6 -> 1e-3 against UniRef30
# Falls back to BFD if insufficient sequences found
# Filters results: 90% identity, 75% or 50% coverage cutoffs
```

--------------------------------

### Generate RNA MSA with cmscan, Rfam, and BLAST

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

This bash script generates an RNA MSA by utilizing cmscan against Rfam, RNAcentral, and the NCBI nt database. It requires specific database files to be present in the RNA directory. The script takes input RNA FASTA, output directory, a tag, number of CPUs, and max memory as arguments, producing a FASTA alignment format MSA file.

```bash
#!/bin/bash
# Generate RNA MSA from FASTA sequence

# Required databases in $PIPEDIR/RNA/:
# - Rfam.cm (covariance models)
# - rnacentral.fasta (RNAcentral sequences)
# - nt (NCBI nucleotide database)
# - rfam_annotations.tsv.gz (Rfam to RNAcentral mapping)
# - Rfam.full_region.gz (Rfam to nt mapping)

./input_prep/make_rna_msa.sh \
    rna_sequence.fa \
    /output/directory \
    rna_tag \
    8 \
    64

# Output files:
# /output/directory/rna_tag.afa - Final RNA MSA in FASTA alignment format

# Pipeline steps:
# 1. cmscan against Rfam to identify RNA families
# 2. Retrieve homologs from RNAcentral via Rfam annotations
# 3. Retrieve homologs from nt database via Rfam annotations
# 4. BLASTN search against RNAcentral and nt
# 5. Cluster with cd-hit-est to remove redundancy
# 6. Realign all hits with nhmmer
# 7. Filter with hhfilter (99% identity, 50% coverage)
```

--------------------------------

### Define and Initialize RoseTTAFold Model Architecture (Python)

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

Defines the complete neural network architecture for structure prediction using the RoseTTAFoldModule. It initializes the model with specified hyperparameters and loads trained weights from a checkpoint file. Dependencies include PyTorch and custom network modules.

```python
import torch
from network.RoseTTAFoldModel import RoseTTAFoldModule
import network.util as util

# Model hyperparameters (default configuration)
MODEL_PARAM = {
    "n_extra_block": 4,      # Extra MSA processing blocks
    "n_main_block": 32,      # Main trunk iterations
    "n_ref_block": 4,        # Structure refinement blocks
    "d_msa": 256,            # MSA embedding dimension
    "d_pair": 128,           # Pair embedding dimension
    "d_templ": 64,           # Template embedding dimension
    "n_head_msa": 8,         # MSA attention heads
    "n_head_pair": 4,        # Pair attention heads
    "n_head_templ": 4,       # Template attention heads
    "d_hidden": 32,          # Hidden dimension
    "d_hidden_templ": 64,    # Template hidden dimension
    "p_drop": 0.0,           # Dropout (0 for inference)
    "lj_lin": 0.75           # LJ potential linearization
}

# SE3-Transformer parameters for structure module
SE3_param = {
    "num_layers": 1,
    "num_channels": 32,
    "num_degrees": 2,
    "l0_in_features": 64,
    "l0_out_features": 64,
    "l1_in_features": 3,
    "l1_out_features": 2,
    "num_edge_features": 64,
    "div": 4,
    "n_heads": 4
}

MODEL_PARAM['SE3_param_full'] = SE3_param
MODEL_PARAM['SE3_param_topk'] = SE3_param

# Initialize model
device = torch.device("cuda:0")
model = RoseTTAFoldModule(
    **MODEL_PARAM,
    aamask=util.allatom_mask.to(device),
    ljlk_parameters=util.ljlk_parameters.to(device),
    lj_correction_parameters=util.lj_correction_parameters.to(device),
    num_bonds=util.num_bonds.to(device),
    hbtypes=util.hbtypes.to(device),
    hbbaseatoms=util.hbbaseatoms.to(device),
    hbpolys=util.hbpolys.to(device)
).to(device)

# Load trained weights
checkpoint = torch.load("network/weights/RF2NA_apr23.pt", map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
```

--------------------------------

### Read Structural Templates with PyTorch

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

Parses HHsearch results to extract structural templates from a PDB database using PyTorch. It loads a pre-indexed FFDB and reads template information including coordinates, sequence features, and atom masks. The function supports specifying template IDs, residue offsets, and the number of templates to use, with an option for random noise for missing coordinates.

```python
import torch
from network.parsers import read_templates
from network.ffindex import read_index, read_data
from collections import namedtuple

# Load template database
FFDB = "/path/to/pdb100_2021Mar03/pdb100_2021Mar03"
FFindexDB = namedtuple("FFindexDB", "index, data")
ffdb = FFindexDB(
    read_index(FFDB + '_pdb.ffindex'),
    read_data(FFDB + '_pdb.ffdata')
)

# Read templates from HHsearch output
query_length = 150
xyz_t, t1d, mask_t = read_templates(
    qlen=query_length,
    ffdb=ffdb,
    hhr_fn="protein.hhr",       # HHsearch results
    atab_fn="protein.atab",     # Alignment table
    templ_to_use=[],            # Optional: specific template IDs
    offset=0,                   # Residue offset for multi-chain
    n_templ=4,                  # Number of templates to use
    random_noise=5.0            # Noise for missing coordinates
)

# Output tensors:
# xyz_t: (n_templ, L, 27, 3) - template coordinates
# t1d: (n_templ, L, 22) - one-hot sequence + confidence
# mask_t: (n_templ, L, 27) - valid atom mask

print(f"Loaded {xyz_t.shape[0]} templates")
print(f"Template coords shape: {xyz_t.shape}")
print(f"Template features shape: {t1d.shape}")
```

--------------------------------

### Parse Sequence Alignment Formats (Python)

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

The `network/parsers.py` module offers functions to parse various sequence alignment formats, including A3M for proteins and FASTA for RNA/DNA. It handles gzipped files, sequence limits, and specific alphabet encodings for nucleic acids. The `parse_mixed_fasta` function is used for paired protein/RNA MSAs.

```python
import numpy as np
from network.parsers import parse_a3m, parse_fasta, parse_mixed_fasta

# Parse protein A3M multiple sequence alignment
# Returns: msa (NxL array of residue indices), ins (NxL insertion counts)
msa, ins = parse_a3m(
    filename="protein.a3m.gz",
    unzip=True,      # Handle gzipped files
    maxseq=10000     # Maximum sequences to load
)
print(f"Protein MSA: {msa.shape[0]} sequences, length {msa.shape[1]}")
# Amino acid encoding: A=0, R=1, N=2, D=3, C=4, Q=5, E=6, G=7, H=8, I=9,
#                      L=10, K=11, M=12, F=13, P=14, S=15, T=16, W=17, Y=18, V=19, gap=20

# Parse RNA/DNA FASTA alignment
rna_msa, rna_ins = parse_fasta(
    filename="rna.afa",
    maxseq=10000,
    rna_alphabet=True,   # Use RNA encoding (A=27, C=28, G=29, U=30)
    dna_alphabet=False
)

dna_msa, dna_ins = parse_fasta(
    filename="dna.fa",
    maxseq=10000,
    rna_alphabet=False,
    dna_alphabet=True    # Use DNA encoding (A=22, C=23, G=24, T=25)
)

# Parse paired protein/RNA MSA (slash-separated format)
# Each line: PROTEIN_SEQUENCE/RNA_SEQUENCE
mixed_msa, mixed_ins, lengths = parse_mixed_fasta(
    filename="paired.a3m",
    maxseq=10000
)
protein_length, rna_length = lengths
print(f"Paired MSA: protein={protein_length}, RNA={rna_length}")
```

--------------------------------

### Featurize MSA Data for Model Input (Python)

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

The `MSAFeaturize` function, located in `network/data_loader.py`, transforms raw MSA data into features suitable for the neural network. This process includes applying masking strategies and performing sequence clustering, preparing the data for model inference.

```python
import torch
from network.data_loader import MSAFeaturize, MSABlockDeletion, merge_a3m_hetero
```

--------------------------------

### Run RF2NA Prediction with Predictor Class (Python)

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

The `Predictor` class in `network/predict.py` provides the core interface for structure prediction. It loads trained weights, initializes a template database, and runs inference on input sequences. The input format specifies chain types and file paths, and the output includes PDB files and prediction quality metrics.

```python
import torch
from collections import namedtuple
from network.predict import Predictor
from network.ffindex import read_index, read_data

# Initialize the predictor with trained weights
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
predictor = Predictor(
    model_weights="network/weights/RF2NA_apr23.pt",
    device=device
)

# Load template database
FFDB = "/path/to/pdb100_2021Mar03/pdb100_2021Mar03"
FFindexDB = namedtuple("FFindexDB", "index, data")
ffdb = FFindexDB(
    read_index(FFDB + '_pdb.ffindex'),
    read_data(FFDB + '_pdb.ffdata')
)

# Define inputs with chain type prefixes:
# P: protein MSA (a3m format) + HHR templates
# R: RNA MSA (afa format)
# D: double-stranded DNA (fasta, complement auto-generated)
# S: single-stranded DNA (fasta)
# PR: paired protein/RNA MSA

inputs = [
    "P:/path/to/protein.msa0.a3m:/path/to/protein.hhr:/path/to/protein.atab",
    "R:/path/to/rna.afa"
]

# Run prediction
predictor.predict(
    inputs=inputs,
    out_prefix="/output/models/model",
    ffdb=ffdb,
    n_templ=4  # Number of templates to use
)

# Output files created:
# - /output/models/model_00.pdb (structure with pLDDT confidence)
# - /output/models/model_00.npz (distogram, lddt, pae arrays)
```

--------------------------------

### Load and Interpret RoseTTAFold Prediction Outputs (Python)

Source: https://context7.com/uw-ipd/rosettafold2na/llms.txt

Loads prediction outputs from a NumPy archive (.npz) and a PDB file. It extracts and prints information about the predicted distogram, per-residue pLDDT scores, and the Predicted Aligned Error (PAE) matrix. It also demonstrates reading pLDDT scores from the B-factor column of a PDB file.

```python
import numpy as np

# Load prediction outputs
data = np.load("models/model_00.npz")

# Distogram: predicted inter-residue distances
# Shape: (L, L, 37) - 37 distance bins
distogram = data['dist']
print(f"Distogram shape: {distogram.shape}")

# Per-residue pLDDT confidence scores
# Shape: (L,) - values 0-100, higher is better
lddt = data['lddt']
print(f"Mean pLDDT: {np.mean(lddt):.1f}")
print(f"Residues with pLDDT > 70: {np.sum(lddt > 70)}")

# Predicted Aligned Error (PAE) matrix
# Shape: (L, L) - expected position error in Angstroms
pae = data['pae']
print(f"PAE matrix shape: {pae.shape}")
print(f"Mean PAE: {np.mean(pae):.2f} Angstroms")

# Read PDB with pLDDT in B-factor column
with open("models/model_00.pdb", 'r') as f:
    for line in f:
        if line.startswith("ATOM"):
            residue = int(line[22:26])
            bfactor = float(line[60:66])  # This is pLDDT * 100
            print(f"Residue {residue}: pLDDT = {bfactor:.1f}")
            break
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.