Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Theme
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Create API Key
Add Docs
HISAT2
https://github.com/daehwankimlab/hisat2
Admin
HISAT2 is a fast and sensitive alignment program that maps next-generation sequencing reads to
...
Tokens:
34,344
Snippets:
333
Trust Score:
8
Update:
1 month ago
Context
Skills
Chat
Benchmark
90.6
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# HISAT2 HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against reference genomes and human populations. Based on a Hierarchical Graph FM index (HGFM), HISAT2 combines a global index with many small local indexes covering 56 Kbp genomic regions each, enabling efficient alignment of sequencing reads while supporting SNPs, splice sites, and haplotype information in the index. The tool outputs alignments in SAM format, enabling seamless integration with downstream tools like SAMtools, GATK, StringTie, and Cufflinks. HISAT2 includes specialized support for RNA-seq data with spliced alignment capabilities, and the HISAT-3N extension handles nucleotide conversion sequencing technologies including bisulfite sequencing (BS-seq), SLAM-seq, TAPS, and other base-converted reads. The software runs on Linux, Mac OS X, and Windows, requiring approximately 8GB RAM for basic operation and outputting standard SAM format results. ## Building a HISAT2 Index The `hisat2-build` command creates an index from reference genome FASTA files, producing 8 index files (`.1.ht2` through `.8.ht2`) that enable fast read alignment. The index supports SNP and splice site information for improved accuracy. ```bash # Build a basic genome index from FASTA file hisat2-build genome.fa genome # Build index with SNP information for population-aware alignment hisat2-build --snp genome.snp genome.fa genome_snp # Build index with splice site and exon annotations for RNA-seq hisat2-build --ss splicesites.txt --exon exons.txt genome.fa genome_tran # Build index with both SNPs and transcriptome annotations hisat2-build --snp genome.snp --ss splicesites.txt --exon exons.txt genome.fa genome_snp_tran # Build using multiple threads for faster indexing hisat2-build -p 8 genome.fa genome # Build large index for genomes >4 billion nucleotides hisat2-build --large-index large_genome.fa large_genome ``` ## Aligning Single-End Reads The `hisat2` command aligns sequencing reads to an indexed reference genome, outputting alignments in SAM format. It supports FASTQ, FASTA, and other input formats with extensive options for controlling alignment behavior. ```bash # Align single-end FASTQ reads (default format) hisat2 -x genome -U reads.fq -S output.sam # Align single-end FASTA reads hisat2 -f -x genome -U reads.fa -S output.sam # Align with multiple threads for faster processing hisat2 -p 8 -x genome -U reads.fq -S output.sam # Align DNA reads (disable spliced alignment) hisat2 -x genome -U reads.fq -S output.sam --no-spliced-alignment # Align and output unaligned reads to separate file hisat2 -x genome -U reads.fq -S output.sam --un unaligned.fq # Align gzip-compressed reads directly hisat2 -x genome -U reads.fq.gz -S output.sam # Output alignment time statistics hisat2 -t -x genome -U reads.fq -S output.sam ``` ## Aligning Paired-End Reads HISAT2 supports paired-end read alignment with automatic fragment length detection and concordant/discordant alignment reporting, optimized for Illumina paired-end sequencing data. ```bash # Basic paired-end alignment hisat2 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam # Paired-end alignment with multiple threads hisat2 -p 8 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam # Paired-end alignment with custom fragment length constraints hisat2 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam -I 0 -X 500 # Output concordantly aligned pairs to separate files hisat2 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam --al-conc aligned_%.fq # Output discordant and unaligned pairs hisat2 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam --un-conc unaligned_%.fq # Disable mixed-mode alignment (only concordant/discordant) hisat2 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam --no-mixed ``` ## RNA-Seq Alignment with Spliced Reads HISAT2 excels at RNA-seq alignment with built-in splice site detection, strand-specific library support, and compatibility with transcript assemblers like StringTie and Cufflinks. ```bash # Standard RNA-seq alignment with splice site discovery hisat2 -x genome_tran -1 reads_1.fq -2 reads_2.fq -S output.sam # RNA-seq with known splice sites file hisat2 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam --known-splicesite-infile splicesites.txt # Output novel splice sites discovered during alignment hisat2 -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam --novel-splicesite-outfile novel_splicesites.txt # Strand-specific RNA-seq (forward stranded library) hisat2 -x genome_tran -1 reads_1.fq -2 reads_2.fq -S output.sam --rna-strandness FR # Strand-specific RNA-seq (reverse stranded library) hisat2 -x genome_tran -1 reads_1.fq -2 reads_2.fq -S output.sam --rna-strandness RF # Alignment optimized for downstream transcript assembly with StringTie hisat2 -x genome_tran -1 reads_1.fq -2 reads_2.fq -S output.sam --dta # Alignment optimized specifically for Cufflinks hisat2 -x genome_tran -1 reads_1.fq -2 reads_2.fq -S output.sam --dta-cufflinks # Set custom intron length constraints hisat2 -x genome_tran -1 reads_1.fq -2 reads_2.fq -S output.sam --min-intronlen 20 --max-intronlen 500000 ``` ## Extracting Splice Sites and Exons from GTF The `hisat2_extract_splice_sites.py` and `hisat2_extract_exons.py` scripts parse GTF annotation files to generate splice site and exon lists for index building and alignment. ```bash # Extract splice sites from GTF annotation file python hisat2_extract_splice_sites.py genes.gtf > splicesites.txt # Extract splice sites with verbose statistics python hisat2_extract_splice_sites.py -v genes.gtf > splicesites.txt # Extract exons from GTF annotation file python hisat2_extract_exons.py genes.gtf > exons.txt # Extract from gzipped GTF file using stdin zcat genes.gtf.gz | python hisat2_extract_splice_sites.py - > splicesites.txt # Complete workflow: extract annotations and build index python hisat2_extract_splice_sites.py genes.gtf > splicesites.txt python hisat2_extract_exons.py genes.gtf > exons.txt hisat2-build --ss splicesites.txt --exon exons.txt genome.fa genome_tran ``` ## Extracting SNPs and Haplotypes from VCF The `hisat2_extract_snps_haplotypes_VCF.py` script processes VCF files to generate SNP and haplotype information for building population-aware indexes that improve alignment accuracy. ```bash # Extract SNPs and haplotypes from VCF file python hisat2_extract_snps_haplotypes_VCF.py genome.fa variants.vcf genome # This produces two files: # - genome.snp: SNP information # - genome.haplotype: Haplotype information # Extract from 1000 Genomes VCF for specific chromosome python hisat2_extract_snps_haplotypes_VCF.py genome.fa ALL.chr22.vcf.gz chr22 # Build index with extracted SNP and haplotype data hisat2-build --snp genome.snp --haplotype genome.haplotype genome.fa genome_snp ``` ## Inspecting HISAT2 Indexes The `hisat2-inspect` command extracts information from built indexes, including reference sequences, SNPs, splice sites, and index metadata. ```bash # Extract original reference sequences from index hisat2-inspect genome > genome_from_index.fa # Print reference sequence names only hisat2-inspect -n genome # Print index summary with sequence names and lengths hisat2-inspect -s genome # Extract SNPs stored in the index hisat2-inspect --snp genome_snp # Extract splice sites from the index hisat2-inspect --ss genome_tran # Extract all splice sites including local index sites hisat2-inspect --ss-all genome_tran # Extract exon information hisat2-inspect --exon genome_tran ``` ## Working with SRA Data HISAT2 can directly align reads from NCBI's Sequence Read Archive (SRA) using accession numbers, eliminating the need to download and decompress files manually. ```bash # Align reads directly from SRA accession hisat2 -x genome --sra-acc SRR353653 -S output.sam # Align multiple SRA accessions hisat2 -x genome --sra-acc SRR353653,SRR353654 -S output.sam # Align SRA data with multiple threads hisat2 -p 8 -x genome --sra-acc SRR353653 -S output.sam ``` ## SAM Output and Read Group Configuration HISAT2 outputs alignments in SAM format with configurable headers and read group information for downstream processing and sample tracking. ```bash # Add read group information to SAM output hisat2 -x genome -U reads.fq -S output.sam --rg-id sample1 --rg SM:sample1 --rg PL:ILLUMINA # Suppress unaligned reads from SAM output hisat2 -x genome -U reads.fq -S output.sam --no-unal # Suppress SAM header lines hisat2 -x genome -U reads.fq -S output.sam --no-hd # Add 'chr' prefix to chromosome names hisat2 -x genome -U reads.fq -S output.sam --add-chrname # Remove 'chr' prefix from chromosome names hisat2 -x genome -U reads.fq -S output.sam --remove-chrname # Write alignment summary to file hisat2 -x genome -U reads.fq -S output.sam --summary-file alignment_summary.txt # Machine-friendly summary format hisat2 -x genome -U reads.fq -S output.sam --new-summary --summary-file summary.txt ``` ## HISAT-3N for Nucleotide Conversion Sequencing HISAT-3N extends HISAT2 for aligning nucleotide conversion sequencing reads from technologies like bisulfite sequencing (BS-seq), SLAM-seq, and TAPS, with support for both standard and repeat-aware alignment modes. ```bash # Build HISAT-3N index for bisulfite sequencing (C-to-T conversion) hisat-3n-build --base-change C,T genome.fa genome # Build HISAT-3N index for SLAM-seq (T-to-C conversion) hisat-3n-build --base-change T,C genome.fa genome # Build repeat-aware index (requires ~256GB RAM for human genome) hisat-3n-build --base-change C,T --repeat-index genome.fa genome # Align bisulfite-seq reads with standard mode hisat-3n -x genome -U reads.fq -S output.sam --base-change C,T --no-repeat-index # Align strand-specific bisulfite-seq paired-end reads hisat-3n -x genome -1 reads_1.fq -2 reads_2.fq -S output.sam --base-change C,T --directional-mapping --no-spliced-alignment # Align SLAM-seq RNA reads with repeat mode hisat-3n -x genome -U reads.fq -S output.sam --base-change T,C --repeat # Generate 3N conversion table from alignment samtools sort output.sam -o sorted_output.sam -O sam hisat-3n-table -p 16 --alignments sorted_output.sam --ref genome.fa --output-name conversion_table.tsv --base-change C,T # Generate conversion table for CpG sites only (bisulfite-seq) hisat-3n-table -p 16 --alignments sorted_output.sam --ref genome.fa --output-name cpg_table.tsv --base-change C,T --CG-only --unique-only ``` ## Downstream Analysis with SAMtools HISAT2 SAM output integrates with SAMtools for format conversion, sorting, indexing, and variant calling workflows. ```bash # Convert SAM to BAM format samtools view -bS output.sam > output.bam # Sort BAM file by coordinate samtools sort output.bam -o output.sorted.bam # Index sorted BAM file samtools index output.sorted.bam # Complete alignment pipeline with piped output hisat2 -p 8 -x genome -1 reads_1.fq -2 reads_2.fq | samtools view -bS - | samtools sort -o output.sorted.bam # Generate variant calls with SAMtools/BCFtools samtools mpileup -uf genome.fa output.sorted.bam | bcftools call -mv -Ob -o variants.bcf # View alignment statistics samtools flagstat output.sorted.bam ``` ## Performance Tuning Options HISAT2 provides multiple options for optimizing alignment performance based on available computational resources and accuracy requirements. ```bash # Use multiple threads (recommended for large datasets) hisat2 -p 16 -x genome -U reads.fq -S output.sam # Use memory-mapped I/O for index (enables multiple processes sharing memory) hisat2 --mm -x genome -U reads.fq -S output.sam # Preserve input read order in output (requires more memory) hisat2 -p 8 --reorder -x genome -U reads.fq -S output.sam # Report up to k alignments per read hisat2 -k 5 -x genome -U reads.fq -S output.sam # Adjust scoring for minimum alignment score hisat2 -x genome -U reads.fq -S output.sam --score-min L,0,-0.6 # Trim bases from read ends before alignment hisat2 -x genome -U reads.fq -S output.sam -5 10 -3 5 ``` HISAT2 is primarily used for aligning next-generation sequencing reads in genomics research workflows, particularly RNA-seq analysis for gene expression studies, whole-genome sequencing for variant discovery, and exome sequencing for targeted mutation analysis. Its graph-based indexing approach makes it especially well-suited for population-level studies where incorporating known genetic variants improves alignment accuracy. The tool integrates into standard bioinformatics pipelines by producing SAM format output compatible with widely-used downstream tools. For RNA-seq, alignments can be processed by transcript assemblers like StringTie or Cufflinks for expression quantification. For DNA sequencing, outputs flow into variant callers like GATK or BCFtools. The HISAT-3N extension enables specialized epigenomics workflows for DNA methylation analysis from bisulfite sequencing and metabolic labeling studies using SLAM-seq or similar technologies.