# Unicycler Unicycler is an assembly pipeline specifically designed for bacterial genomes that can work with Illumina short reads, PacBio or Oxford Nanopore long reads, or both in hybrid assembly mode. It produces complete, circular bacterial chromosome and plasmid assemblies by using SPAdes for short-read assembly, miniasm and Racon for long-read assembly, and sophisticated bridging algorithms to resolve repeats and connect contigs. The pipeline excels at hybrid assembly where it combines the accuracy of Illumina reads with the scaffolding power of long reads to resolve repetitive regions and produce finished-quality genomes. For short-read-only assemblies, Unicycler functions as a SPAdes optimizer, trying multiple k-mer sizes and selecting the best assembly graph. For long-read-only assemblies, it uses a miniasm+Racon pipeline with multiple polishing rounds to improve consensus accuracy. ## Command Line Interface The main entry point for running Unicycler assemblies with configurable options for input reads, output directory, assembly mode, and threading. ```bash # Illumina-only assembly (paired-end reads) unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -o output_dir # Long-read-only assembly (PacBio or Nanopore) unicycler -l long_reads.fastq.gz -o output_dir # Hybrid assembly (best results - both short and long reads) unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads.fastq.gz -o output_dir # Hybrid assembly with custom options unicycler \ -1 short_reads_1.fastq.gz \ -2 short_reads_2.fastq.gz \ -l long_reads.fastq.gz \ -o output_dir \ --mode bold \ --threads 16 \ --verbosity 2 \ --min_fasta_length 500 \ --keep 2 # Assembly with unpaired short reads unicycler -s unpaired_reads.fastq.gz -l long_reads.fastq.gz -o output_dir # Conservative mode (lowest misassembly rate, smaller contigs) unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out --mode conservative # Bold mode (longest contigs, higher misassembly risk) unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out --mode bold # Using existing long-read assembly for bridging unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out \ --existing_long_read_assembly my_long_read_assembly.gfa # Expected linear sequences (e.g., for organisms with linear chromosomes) unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out --linear_seqs 1 ``` ## AssemblyGraph Class The core data structure for representing and manipulating bacterial genome assembly graphs in GFA format, supporting segments, links, copy depth tracking, and bridging operations. ```python from unicycler.assembly_graph import AssemblyGraph # Load an assembly graph from a GFA file graph = AssemblyGraph('assembly_graph.gfa', overlap=77) # Load with custom insert size parameters graph = AssemblyGraph('assembly_graph.gfa', overlap=77, insert_size_mean=350, insert_size_deviation=75) # Get basic graph statistics total_length = graph.get_total_length() total_length_no_overlaps = graph.get_total_length_no_overlaps() dead_ends = graph.total_dead_end_count() median_depth = graph.get_median_read_depth() # Get connected components components = graph.get_connected_components() # Returns: [[1, 2, 3], [4, 5]] # segment numbers grouped by component # Check if a component is complete (circular replicon) for component in components: if graph.is_component_complete(component): print(f"Component {component} is a complete circular replicon") # Get single-copy segments (useful for scaffolding) single_copy_segments = graph.get_single_copy_segments() # Get path sequence through the graph path = [1, 2, -3, 4] # signed segment numbers (negative = reverse complement) sequence = graph.get_path_sequence(path) # Save graph to different formats graph.save_to_gfa('output.gfa') graph.save_to_fasta('output.fasta', min_length=500) graph.save_to_gfa('output_with_info.gfa', save_copy_depth_info=True, save_seg_type_info=True) # Merge simple paths (segments in unbranching paths) graph.merge_all_possible(anchor_segments=None, bridging_mode=1) # Renumber segments by length (longest = 1) graph.renumber_segments() # Get completed circular replicons circular_segs = graph.completed_circular_replicons() # Returns list of segment numbers that form complete circles ``` ## SPAdes Assembly Functions Functions for running SPAdes short-read assembly with automatic k-mer optimization, returning the best assembly graph based on contig count and dead end minimization. ```python from unicycler.spades_func import get_best_spades_graph, get_kmer_range # Get optimal k-mer range based on read lengths kmer_range = get_kmer_range( given_kmers=None, # None for automatic, or list like [21, 51, 71] reads_1_filename='R1.fastq.gz', reads_2_filename='R2.fastq.gz', unpaired_reads_filename=None, spades_dir='/path/to/spades_work', kmer_count=8, # Number of k-mer steps min_kmer_frac=0.2, # Min k-mer as fraction of read length max_kmer_frac=0.95, # Max k-mer as fraction of read length spades_path='spades.py' ) # Returns: [27, 47, 63, 77, 89, 99, 107, 117] # Run SPAdes and get best assembly graph best_graph = get_best_spades_graph( short1='reads_1.fastq.gz', short2='reads_2.fastq.gz', short_unpaired=None, out_dir='/path/to/output', read_depth_filter=0.25, # Filter contigs below this fraction of chromosomal depth verbosity=1, spades_path='spades.py', threads=8, keep=1, # File retention level (0-3) kmer_count=8, min_k_frac=0.2, max_k_frac=0.95, kmers=None, # Specific k-mers or None for auto expected_linear_seqs=0, largest_component=False, # Keep only largest component? spades_graph_prefix='/path/to/graphs/spades', spades_options=None, # Additional SPAdes options spades_version='3.15.0' ) # Returns: AssemblyGraph object with optimal k-mer assembly ``` ## Miniasm Assembly Pipeline Functions for long-read assembly using miniasm and Racon polishing, integrating long reads with short-read contigs for hybrid assembly bridging. ```python from unicycler.miniasm_assembly import make_miniasm_string_graph # Create miniasm string graph for hybrid assembly string_graph = make_miniasm_string_graph( graph=short_read_graph, # AssemblyGraph from SPAdes (None for long-read-only) read_dict=read_dictionary, # Dict of read_name -> Read objects long_read_filename='long_reads.fastq.gz', scoring_scheme=scoring_scheme, # AlignmentScoringScheme object read_nicknames=nickname_dict, # Dict mapping read names to short IDs counter=itertools.count(start=1), # Counter for output file numbering args=args, # Parsed command-line arguments anchor_segments=anchor_list, # List of single-copy segments to bridge existing_long_read_assembly=None # Path to pre-made assembly or None ) # Returns: StringGraph object with polished unitigs # The function internally: # 1. Aligns long reads to assembly graph with minimap # 2. Saves anchor contigs and overlapping reads as "reads" for miniasm # 3. Runs miniasm to create string graph # 4. Polishes with multiple Racon rounds # 5. Places short-read contigs back into unitig graph ``` ## Bridge Application System for creating and applying bridges between single-copy segments using evidence from long reads, SPAdes paths, and miniasm assemblies. ```python from unicycler.bridge_long_read import create_long_read_bridges from unicycler.bridge_spades_contig import create_spades_contig_bridges from unicycler.bridge_miniasm import create_miniasm_bridges # Create bridges from SPAdes contig paths spades_bridges = create_spades_contig_bridges( graph=assembly_graph, anchor_segments=single_copy_segments ) # Create bridges from long read alignments long_read_bridges = create_long_read_bridges( graph=assembly_graph, read_dict=read_dictionary, read_names=read_name_list, anchor_segments=single_copy_segments, verbosity=1, min_scaled_score=85.0, threads=8, scoring_scheme=scoring_scheme, min_alignment_length=1000, expected_linear_seqs=False, min_bridge_qual=10.0 ) # Create bridges from miniasm assembly miniasm_bridges = create_miniasm_bridges( graph=assembly_graph, string_graph=miniasm_string_graph, anchor_segments=single_copy_segments, scoring_scheme=scoring_scheme, verbosity=1, min_bridge_qual=10.0 ) # Combine all bridges all_bridges = spades_bridges + long_read_bridges + miniasm_bridges # Apply bridges to graph (sorted by quality, highest first) seg_nums_used = graph.apply_bridges( bridges=all_bridges, verbosity=1, min_bridge_qual=10.0 # Reject bridges below this quality ) # Bridges are applied in order of decreasing quality # Conflicting bridges are resolved by using the highest-quality option ``` ## Alignment Scoring Scheme Configuration for sequence alignment scoring parameters used throughout the assembly and bridging pipeline. ```python from unicycler.alignment import AlignmentScoringScheme # Create scoring scheme from comma-separated string # Format: match, mismatch, gap_open, gap_extend scoring_scheme = AlignmentScoringScheme('3,-6,-5,-2') # Access individual scores match_score = scoring_scheme.match # 3 mismatch_score = scoring_scheme.mismatch # -6 gap_open = scoring_scheme.gap_open # -5 gap_extend = scoring_scheme.gap_extend # -2 # The scoring scheme is used for: # - Long read to graph alignment # - Bridge path finding and scoring # - Consensus sequence generation # - Contig placement in unitig graphs ``` ## Copy Depth Determination Algorithm for determining segment copy numbers based on read depth and graph connectivity, essential for identifying single-copy segments for bridging. ```python from unicycler.assembly_graph_copy_depth import determine_copy_depth # Analyze graph and assign copy depths to segments determine_copy_depth(graph) # After running, segments have copy depth information: # graph.copy_depths = {segment_number: [depth1, depth2, ...]} # Single-copy segments have exactly one depth value # Get single-copy segments (copy number = 1) single_copy = graph.get_single_copy_segments() # Get copy number for specific segment copy_num = graph.get_copy_number(segment) # Returns 0, 1, 2, 3, etc. # Check if segment is single-copy is_single = graph.is_seg_num_single_copy(segment_number) # Get segments without assigned copy depth no_depth = graph.get_no_copy_depth_segments() # Copy depth visualization colors (for Bandage) # 1 copy: green, 2 copies: gold, 3 copies: orange, 4+ copies: red ``` ## Long Read Loading and Processing Functions for loading and processing long reads from FASTQ/FASTA files for assembly and alignment. ```python from unicycler.read_ref import load_long_reads, get_read_nickname_dict # Load long reads from file read_dict, read_names, long_read_filename = load_long_reads( filename='long_reads.fastq.gz', output_dir='/path/to/output' # For decompressed temp files ) # read_dict: {read_name: Read object with sequence, qualities, alignments} # read_names: ordered list of read names # long_read_filename: path to (possibly decompressed) reads file # Create short nicknames for reads (used in miniasm) nicknames = get_read_nickname_dict(read_names) # Returns: {'very_long_read_name_1': 'R001', 'another_read': 'R002', ...} # Access read information for name in read_names: read = read_dict[name] sequence = read.sequence qualities = read.qualities alignments = read.alignments # List after alignment step ``` ## Output Files Unicycler produces several output files depending on the `--keep` level setting, with the final assembly always saved in both GFA and FASTA formats. ```bash # Output directory structure (--keep 1, default): output_dir/ ├── assembly.gfa # Final assembly in GFA v1 format ├── assembly.fasta # Final assembly in FASTA format ├── unicycler.log # Detailed log of the assembly process ├── 001_spades_graph_k*.gfa # SPAdes graphs at each k-mer ├── 002_depth_filter.gfa # Graph after depth filtering ├── 003_overlaps_removed.gfa # Overlap-free graph ├── 004_long_read_assembly.gfa # Miniasm+Racon assembly (hybrid only) ├── 005_bridges_applied.gfa # Graph after bridging └── 006_final_clean.gfa # Cleaned graph before rotation # Parse GFA output programmatically from unicycler.assembly_graph import AssemblyGraph final_graph = AssemblyGraph('output_dir/assembly.gfa', overlap=0) # Get assembly statistics n50, shortest, q1, median, q3, longest = final_graph.get_contig_stats() print(f"N50: {n50} bp") print(f"Longest contig: {longest} bp") print(f"Total segments: {len(final_graph.segments)}") # Check completion status components = final_graph.get_connected_components() for i, comp in enumerate(components): status = "complete" if final_graph.is_component_complete(comp) else "incomplete" length = sum(final_graph.segments[x].get_length() for x in comp) print(f"Component {i+1}: {length} bp, {status}") ``` ## Assembly Mode Settings Configuration constants for conservative, normal, and bold assembly modes affecting bridge quality thresholds and contig merging behavior. ```python from unicycler import settings # Bridge quality thresholds by mode # Conservative: most accurate, fewer completed replicons conservative_threshold = settings.CONSERVATIVE_MIN_BRIDGE_QUAL # 25.0 # Normal: balanced accuracy and completeness normal_threshold = settings.NORMAL_MIN_BRIDGE_QUAL # 10.0 # Bold: most completed replicons, higher misassembly risk bold_threshold = settings.BOLD_MIN_BRIDGE_QUAL # 1.0 # Other key settings max_threads_default = settings.MAX_AUTO_THREAD_COUNT # 8 min_alignment_length = settings.MIN_LONG_READ_ALIGNMENT_LENGTH # 50 racon_loops_hybrid = settings.RACON_POLISH_LOOP_COUNT_HYBRID # 2 racon_loops_long_only = settings.RACON_POLISH_LOOP_COUNT_LONG_ONLY # 4 # Path finding settings min_path_length_ratio = settings.MIN_RELATIVE_PATH_LENGTH # 0.9 max_path_length_ratio = settings.MAX_RELATIVE_PATH_LENGTH # 1.1 ``` Unicycler is primarily used for assembling bacterial genomes from sequencing data, with particular strength in producing complete, circularized chromosome and plasmid sequences. The hybrid assembly mode combines Illumina accuracy with long-read scaffolding to resolve complex repeat structures that fragment short-read assemblies, making it ideal for finishing bacterial genomes without manual intervention. The pipeline integrates well with downstream analysis tools through its standard GFA and FASTA output formats. Assembly graphs can be visualized in Bandage for manual inspection, and the circularized contigs produced by Unicycler can be directly annotated with tools like Prokka. For users with high-depth, high-accuracy long reads, Unicycler's author recommends considering long-read-first approaches using Trycycler and Polypolish, but Unicycler remains the preferred tool for short-read-first hybrid assembly when long-read depth is limited.