# Unicycler

Unicycler is an assembly pipeline specifically designed for bacterial genomes that can work with Illumina short reads, PacBio or Oxford Nanopore long reads, or both in hybrid assembly mode. It produces complete, circular bacterial chromosome and plasmid assemblies by using SPAdes for short-read assembly, miniasm and Racon for long-read assembly, and sophisticated bridging algorithms to resolve repeats and connect contigs.

The pipeline excels at hybrid assembly where it combines the accuracy of Illumina reads with the scaffolding power of long reads to resolve repetitive regions and produce finished-quality genomes. For short-read-only assemblies, Unicycler functions as a SPAdes optimizer, trying multiple k-mer sizes and selecting the best assembly graph. For long-read-only assemblies, it uses a miniasm+Racon pipeline with multiple polishing rounds to improve consensus accuracy.

## Command Line Interface

The main entry point for running Unicycler assemblies with configurable options for input reads, output directory, assembly mode, and threading.

```bash
# Illumina-only assembly (paired-end reads)
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -o output_dir

# Long-read-only assembly (PacBio or Nanopore)
unicycler -l long_reads.fastq.gz -o output_dir

# Hybrid assembly (best results - both short and long reads)
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads.fastq.gz -o output_dir

# Hybrid assembly with custom options
unicycler \
  -1 short_reads_1.fastq.gz \
  -2 short_reads_2.fastq.gz \
  -l long_reads.fastq.gz \
  -o output_dir \
  --mode bold \
  --threads 16 \
  --verbosity 2 \
  --min_fasta_length 500 \
  --keep 2

# Assembly with unpaired short reads
unicycler -s unpaired_reads.fastq.gz -l long_reads.fastq.gz -o output_dir

# Conservative mode (lowest misassembly rate, smaller contigs)
unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out --mode conservative

# Bold mode (longest contigs, higher misassembly risk)
unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out --mode bold

# Using existing long-read assembly for bridging
unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out \
  --existing_long_read_assembly my_long_read_assembly.gfa

# Expected linear sequences (e.g., for organisms with linear chromosomes)
unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out --linear_seqs 1
```

## AssemblyGraph Class

The core data structure for representing and manipulating bacterial genome assembly graphs in GFA format, supporting segments, links, copy depth tracking, and bridging operations.

```python
from unicycler.assembly_graph import AssemblyGraph

# Load an assembly graph from a GFA file
graph = AssemblyGraph('assembly_graph.gfa', overlap=77)

# Load with custom insert size parameters
graph = AssemblyGraph('assembly_graph.gfa', overlap=77,
                      insert_size_mean=350, insert_size_deviation=75)

# Get basic graph statistics
total_length = graph.get_total_length()
total_length_no_overlaps = graph.get_total_length_no_overlaps()
dead_ends = graph.total_dead_end_count()
median_depth = graph.get_median_read_depth()

# Get connected components
components = graph.get_connected_components()
# Returns: [[1, 2, 3], [4, 5]]  # segment numbers grouped by component

# Check if a component is complete (circular replicon)
for component in components:
    if graph.is_component_complete(component):
        print(f"Component {component} is a complete circular replicon")

# Get single-copy segments (useful for scaffolding)
single_copy_segments = graph.get_single_copy_segments()

# Get path sequence through the graph
path = [1, 2, -3, 4]  # signed segment numbers (negative = reverse complement)
sequence = graph.get_path_sequence(path)

# Save graph to different formats
graph.save_to_gfa('output.gfa')
graph.save_to_fasta('output.fasta', min_length=500)
graph.save_to_gfa('output_with_info.gfa',
                  save_copy_depth_info=True,
                  save_seg_type_info=True)

# Merge simple paths (segments in unbranching paths)
graph.merge_all_possible(anchor_segments=None, bridging_mode=1)

# Renumber segments by length (longest = 1)
graph.renumber_segments()

# Get completed circular replicons
circular_segs = graph.completed_circular_replicons()
# Returns list of segment numbers that form complete circles
```

## SPAdes Assembly Functions

Functions for running SPAdes short-read assembly with automatic k-mer optimization, returning the best assembly graph based on contig count and dead end minimization.

```python
from unicycler.spades_func import get_best_spades_graph, get_kmer_range

# Get optimal k-mer range based on read lengths
kmer_range = get_kmer_range(
    given_kmers=None,          # None for automatic, or list like [21, 51, 71]
    reads_1_filename='R1.fastq.gz',
    reads_2_filename='R2.fastq.gz',
    unpaired_reads_filename=None,
    spades_dir='/path/to/spades_work',
    kmer_count=8,              # Number of k-mer steps
    min_kmer_frac=0.2,         # Min k-mer as fraction of read length
    max_kmer_frac=0.95,        # Max k-mer as fraction of read length
    spades_path='spades.py'
)
# Returns: [27, 47, 63, 77, 89, 99, 107, 117]

# Run SPAdes and get best assembly graph
best_graph = get_best_spades_graph(
    short1='reads_1.fastq.gz',
    short2='reads_2.fastq.gz',
    short_unpaired=None,
    out_dir='/path/to/output',
    read_depth_filter=0.25,    # Filter contigs below this fraction of chromosomal depth
    verbosity=1,
    spades_path='spades.py',
    threads=8,
    keep=1,                    # File retention level (0-3)
    kmer_count=8,
    min_k_frac=0.2,
    max_k_frac=0.95,
    kmers=None,                # Specific k-mers or None for auto
    expected_linear_seqs=0,
    largest_component=False,   # Keep only largest component?
    spades_graph_prefix='/path/to/graphs/spades',
    spades_options=None,       # Additional SPAdes options
    spades_version='3.15.0'
)
# Returns: AssemblyGraph object with optimal k-mer assembly
```

## Miniasm Assembly Pipeline

Functions for long-read assembly using miniasm and Racon polishing, integrating long reads with short-read contigs for hybrid assembly bridging.

```python
from unicycler.miniasm_assembly import make_miniasm_string_graph

# Create miniasm string graph for hybrid assembly
string_graph = make_miniasm_string_graph(
    graph=short_read_graph,           # AssemblyGraph from SPAdes (None for long-read-only)
    read_dict=read_dictionary,        # Dict of read_name -> Read objects
    long_read_filename='long_reads.fastq.gz',
    scoring_scheme=scoring_scheme,    # AlignmentScoringScheme object
    read_nicknames=nickname_dict,     # Dict mapping read names to short IDs
    counter=itertools.count(start=1), # Counter for output file numbering
    args=args,                        # Parsed command-line arguments
    anchor_segments=anchor_list,      # List of single-copy segments to bridge
    existing_long_read_assembly=None  # Path to pre-made assembly or None
)
# Returns: StringGraph object with polished unitigs

# The function internally:
# 1. Aligns long reads to assembly graph with minimap
# 2. Saves anchor contigs and overlapping reads as "reads" for miniasm
# 3. Runs miniasm to create string graph
# 4. Polishes with multiple Racon rounds
# 5. Places short-read contigs back into unitig graph
```

## Bridge Application

System for creating and applying bridges between single-copy segments using evidence from long reads, SPAdes paths, and miniasm assemblies.

```python
from unicycler.bridge_long_read import create_long_read_bridges
from unicycler.bridge_spades_contig import create_spades_contig_bridges
from unicycler.bridge_miniasm import create_miniasm_bridges

# Create bridges from SPAdes contig paths
spades_bridges = create_spades_contig_bridges(
    graph=assembly_graph,
    anchor_segments=single_copy_segments
)

# Create bridges from long read alignments
long_read_bridges = create_long_read_bridges(
    graph=assembly_graph,
    read_dict=read_dictionary,
    read_names=read_name_list,
    anchor_segments=single_copy_segments,
    verbosity=1,
    min_scaled_score=85.0,
    threads=8,
    scoring_scheme=scoring_scheme,
    min_alignment_length=1000,
    expected_linear_seqs=False,
    min_bridge_qual=10.0
)

# Create bridges from miniasm assembly
miniasm_bridges = create_miniasm_bridges(
    graph=assembly_graph,
    string_graph=miniasm_string_graph,
    anchor_segments=single_copy_segments,
    scoring_scheme=scoring_scheme,
    verbosity=1,
    min_bridge_qual=10.0
)

# Combine all bridges
all_bridges = spades_bridges + long_read_bridges + miniasm_bridges

# Apply bridges to graph (sorted by quality, highest first)
seg_nums_used = graph.apply_bridges(
    bridges=all_bridges,
    verbosity=1,
    min_bridge_qual=10.0  # Reject bridges below this quality
)
# Bridges are applied in order of decreasing quality
# Conflicting bridges are resolved by using the highest-quality option
```

## Alignment Scoring Scheme

Configuration for sequence alignment scoring parameters used throughout the assembly and bridging pipeline.

```python
from unicycler.alignment import AlignmentScoringScheme

# Create scoring scheme from comma-separated string
# Format: match, mismatch, gap_open, gap_extend
scoring_scheme = AlignmentScoringScheme('3,-6,-5,-2')

# Access individual scores
match_score = scoring_scheme.match          # 3
mismatch_score = scoring_scheme.mismatch    # -6
gap_open = scoring_scheme.gap_open          # -5
gap_extend = scoring_scheme.gap_extend      # -2

# The scoring scheme is used for:
# - Long read to graph alignment
# - Bridge path finding and scoring
# - Consensus sequence generation
# - Contig placement in unitig graphs
```

## Copy Depth Determination

Algorithm for determining segment copy numbers based on read depth and graph connectivity, essential for identifying single-copy segments for bridging.

```python
from unicycler.assembly_graph_copy_depth import determine_copy_depth

# Analyze graph and assign copy depths to segments
determine_copy_depth(graph)

# After running, segments have copy depth information:
# graph.copy_depths = {segment_number: [depth1, depth2, ...]}
# Single-copy segments have exactly one depth value

# Get single-copy segments (copy number = 1)
single_copy = graph.get_single_copy_segments()

# Get copy number for specific segment
copy_num = graph.get_copy_number(segment)  # Returns 0, 1, 2, 3, etc.

# Check if segment is single-copy
is_single = graph.is_seg_num_single_copy(segment_number)

# Get segments without assigned copy depth
no_depth = graph.get_no_copy_depth_segments()

# Copy depth visualization colors (for Bandage)
# 1 copy: green, 2 copies: gold, 3 copies: orange, 4+ copies: red
```

## Long Read Loading and Processing

Functions for loading and processing long reads from FASTQ/FASTA files for assembly and alignment.

```python
from unicycler.read_ref import load_long_reads, get_read_nickname_dict

# Load long reads from file
read_dict, read_names, long_read_filename = load_long_reads(
    filename='long_reads.fastq.gz',
    output_dir='/path/to/output'  # For decompressed temp files
)
# read_dict: {read_name: Read object with sequence, qualities, alignments}
# read_names: ordered list of read names
# long_read_filename: path to (possibly decompressed) reads file

# Create short nicknames for reads (used in miniasm)
nicknames = get_read_nickname_dict(read_names)
# Returns: {'very_long_read_name_1': 'R001', 'another_read': 'R002', ...}

# Access read information
for name in read_names:
    read = read_dict[name]
    sequence = read.sequence
    qualities = read.qualities
    alignments = read.alignments  # List after alignment step
```

## Output Files

Unicycler produces several output files depending on the `--keep` level setting, with the final assembly always saved in both GFA and FASTA formats.

```bash
# Output directory structure (--keep 1, default):
output_dir/
├── assembly.gfa              # Final assembly in GFA v1 format
├── assembly.fasta            # Final assembly in FASTA format
├── unicycler.log             # Detailed log of the assembly process
├── 001_spades_graph_k*.gfa   # SPAdes graphs at each k-mer
├── 002_depth_filter.gfa      # Graph after depth filtering
├── 003_overlaps_removed.gfa  # Overlap-free graph
├── 004_long_read_assembly.gfa # Miniasm+Racon assembly (hybrid only)
├── 005_bridges_applied.gfa   # Graph after bridging
└── 006_final_clean.gfa       # Cleaned graph before rotation

# Parse GFA output programmatically
from unicycler.assembly_graph import AssemblyGraph

final_graph = AssemblyGraph('output_dir/assembly.gfa', overlap=0)

# Get assembly statistics
n50, shortest, q1, median, q3, longest = final_graph.get_contig_stats()
print(f"N50: {n50} bp")
print(f"Longest contig: {longest} bp")
print(f"Total segments: {len(final_graph.segments)}")

# Check completion status
components = final_graph.get_connected_components()
for i, comp in enumerate(components):
    status = "complete" if final_graph.is_component_complete(comp) else "incomplete"
    length = sum(final_graph.segments[x].get_length() for x in comp)
    print(f"Component {i+1}: {length} bp, {status}")
```

## Assembly Mode Settings

Configuration constants for conservative, normal, and bold assembly modes affecting bridge quality thresholds and contig merging behavior.

```python
from unicycler import settings

# Bridge quality thresholds by mode
# Conservative: most accurate, fewer completed replicons
conservative_threshold = settings.CONSERVATIVE_MIN_BRIDGE_QUAL  # 25.0

# Normal: balanced accuracy and completeness
normal_threshold = settings.NORMAL_MIN_BRIDGE_QUAL  # 10.0

# Bold: most completed replicons, higher misassembly risk
bold_threshold = settings.BOLD_MIN_BRIDGE_QUAL  # 1.0

# Other key settings
max_threads_default = settings.MAX_AUTO_THREAD_COUNT  # 8
min_alignment_length = settings.MIN_LONG_READ_ALIGNMENT_LENGTH  # 50
racon_loops_hybrid = settings.RACON_POLISH_LOOP_COUNT_HYBRID  # 2
racon_loops_long_only = settings.RACON_POLISH_LOOP_COUNT_LONG_ONLY  # 4

# Path finding settings
min_path_length_ratio = settings.MIN_RELATIVE_PATH_LENGTH  # 0.9
max_path_length_ratio = settings.MAX_RELATIVE_PATH_LENGTH  # 1.1
```

Unicycler is primarily used for assembling bacterial genomes from sequencing data, with particular strength in producing complete, circularized chromosome and plasmid sequences. The hybrid assembly mode combines Illumina accuracy with long-read scaffolding to resolve complex repeat structures that fragment short-read assemblies, making it ideal for finishing bacterial genomes without manual intervention.

The pipeline integrates well with downstream analysis tools through its standard GFA and FASTA output formats. Assembly graphs can be visualized in Bandage for manual inspection, and the circularized contigs produced by Unicycler can be directly annotated with tools like Prokka. For users with high-depth, high-accuracy long reads, Unicycler's author recommends considering long-read-first approaches using Trycycler and Polypolish, but Unicycler remains the preferred tool for short-read-first hybrid assembly when long-read depth is limited.