Try Live
Add Docs
Rankings
Pricing
Docs
Install
Install
Docs
Pricing
More...
More...
Try Live
Rankings
Enterprise
Create API Key
Add Docs
HTSeq
https://github.com/htseq/htseq
Admin
HTSeq is a Python framework to process and analyze data from high-throughput sequencing assays,
...
Tokens:
56,043
Snippets:
403
Trust Score:
5
Update:
1 week ago
Context
Skills
Chat
Benchmark
86.9
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# HTSeq HTSeq is a Python framework for processing and analyzing high-throughput sequencing (HTS) data. It provides classes and functions for reading and writing sequence files (FASTA, FASTQ), alignment files (SAM/BAM/CRAM), and annotation files (GFF/GTF). The library is designed to facilitate custom bioinformatics scripts and offers powerful data structures like GenomicArray and GenomicArrayOfSets for storing and querying genomic intervals efficiently. A key component of HTSeq is the `htseq-count` script, which counts how many sequencing reads map to each genomic feature (typically genes) from GTF/GFF annotations. This tool is widely used in RNA-Seq differential expression analysis. HTSeq also supports single-cell RNA-Seq with cell and molecular barcodes through `htseq-count-barcodes`, making it suitable for both bulk and single-cell transcriptomics workflows. ## Reading FASTQ Files with FastqReader The FastqReader class reads FASTQ files and yields SequenceWithQualities objects containing sequence data, quality scores, and read names. It supports multiple quality encoding formats including Phred (default), Solexa, and Solexa-old. ```python import HTSeq import numpy as np # Read FASTQ file with default Phred encoding fastq_file = HTSeq.FastqReader("reads.fastq") # Read FASTQ file with Solexa quality encoding fastq_file = HTSeq.FastqReader("reads.fastq", "solexa") # Iterate through reads for read in fastq_file: print(f"Read name: {read.name}") print(f"Sequence: {read.seq}") print(f"Quality scores: {read.qual}") print(f"Quality string: {read.qualstr}") break # Calculate average quality per position across all reads qualsum = None nreads = 0 for read in fastq_file: if qualsum is None: qualsum = np.zeros(len(read), dtype=int) qualsum += read.qual nreads += 1 average_quality = qualsum / float(nreads) print(f"Average quality per position: {average_quality}") ``` ## Reading FASTA Files with FastaReader The FastaReader class parses FASTA files and yields Sequence objects. It can read from regular or gzip-compressed files and supports optional raw iterator mode for improved performance. ```python import HTSeq # Read FASTA file fasta_file = HTSeq.FastaReader("sequences.fa") # Iterate through sequences for seq in fasta_file: print(f"Name: {seq.name}") print(f"Sequence: {seq.seq}") print(f"Length: {len(seq)}") # Load all sequences into a dictionary sequences = dict((s.name, s) for s in HTSeq.FastaReader("sequences.fa")) print(f"Sequence 'gene1': {sequences['gene1'].seq}") # Use raw iterator for faster parsing (returns tuples) sequences = dict((s[1], s[0]) for s in HTSeq.FastaReader("sequences.fa", raw_iterator=True)) # Get reverse complement seq = HTSeq.Sequence(b"ACGTAAAG", "my_seq") rc = seq.get_reverse_complement() print(f"Original: {seq.seq}, Reverse complement: {rc.seq}") # Write to FASTA file with open("output.fa", "w") as outfile: seq.write_to_fasta_file(outfile) ``` ## Reading SAM/BAM Files with BAM_Reader The BAM_Reader class reads SAM, BAM, and CRAM alignment files using pysam under the hood. It yields SAM_Alignment objects containing alignment information, read sequences, quality scores, and CIGAR operations. ```python import HTSeq import itertools # Open BAM/SAM file bam_reader = HTSeq.BAM_Reader("alignments.bam") # Iterate through alignments for alignment in itertools.islice(bam_reader, 5): if alignment.aligned: print(f"Read: {alignment.read.name}") print(f"Aligned to: {alignment.iv}") print(f"Strand: {alignment.iv.strand}") print(f"CIGAR: {alignment.cigar}") print(f"Mapping quality: {alignment.aQual}") # Count reads per chromosome import collections counts = collections.defaultdict(int) for alignment in HTSeq.BAM_Reader("alignments.bam"): if alignment.aligned: counts[alignment.iv.chrom] += 1 print(f"Reads per chromosome: {dict(counts)}") # Fetch reads from a specific region (requires index) bam_reader = HTSeq.BAM_Reader("alignments_sorted.bam") for alignment in bam_reader.fetch(region="chr1:1000000-2000000"): print(alignment) # Write alignments to BAM file bam_writer = HTSeq.BAM_Writer.from_BAM_Reader("output.bam", bam_reader) for alignment in bam_reader: bam_writer.write(alignment) bam_writer.close() ``` ## GenomicInterval and GenomicPosition GenomicInterval represents a genomic region with chromosome, start, end, and strand. GenomicPosition represents a single base position. These are fundamental data structures used throughout HTSeq for coordinate handling. ```python import HTSeq # Create a genomic interval (0-based, half-open) iv = HTSeq.GenomicInterval("chr3", 123203, 127245, "+") print(f"Chromosome: {iv.chrom}") print(f"Start: {iv.start}, End: {iv.end}") print(f"Strand: {iv.strand}") print(f"Length: {iv.length}") # Directional coordinates (accounting for strand) print(f"Directional start: {iv.start_d}") print(f"Directional end: {iv.end_d}") # Create interval from directional coordinates iv2 = HTSeq.GenomicInterval_from_directional("chr3", 127244, 4042, "-") # Create a genomic position pos = HTSeq.GenomicPosition("chr3", 125000, "+") print(f"Position: {pos.pos}") # Check overlap relationships iv_a = HTSeq.GenomicInterval("chr1", 100, 300, "+") iv_b = HTSeq.GenomicInterval("chr1", 200, 400, "+") print(f"Overlaps: {iv_a.overlaps(iv_b)}") print(f"Contains: {iv_a.contains(iv_b)}") print(f"Is contained in: {iv_a.is_contained_in(iv_b)}") # Extend interval to include another iv_a.extend_to_include(iv_b) print(f"Extended interval: {iv_a}") ``` ## GenomicArray for Coverage Vectors GenomicArray is a container that stores values indexed by genomic coordinates. It's ideal for coverage vectors, signal tracks, and other per-position data. The step-based storage efficiently handles regions of constant value. ```python import HTSeq # Create a stranded GenomicArray for integer values cvg = HTSeq.GenomicArray("auto", stranded=True, typecode="i") # Build coverage from alignments for alignment in HTSeq.BAM_Reader("alignments.bam"): if alignment.aligned: cvg[alignment.iv] += 1 # Query coverage at a position pos = HTSeq.GenomicPosition("chr1", 50000, "+") print(f"Coverage at position: {cvg[pos]}") # Query coverage over an interval iv = HTSeq.GenomicInterval("chr1", 45000, 55000, "+") coverage_values = list(cvg[iv]) print(f"Coverage values: {coverage_values[:10]}...") # Iterate through steps (constant-value regions) for step_iv, value in cvg[iv].steps(): if value > 0: print(f"{step_iv}: {value}") # Apply a function to modify values cvg[iv].apply(lambda x: x * 2) # Export to BedGraph/Wiggle format for genome browsers cvg.write_bedgraph_file("coverage_plus.wig", "+") cvg.write_bedgraph_file("coverage_minus.wig", "-") # Export to BigWig format (requires pyBigWig) cvg.write_bigwig_file("coverage_plus.bw", "+") # Create GenomicArray with known chromosome lengths chrom_lengths = {"chr1": 248956422, "chr2": 242193529} ga = HTSeq.GenomicArray(chrom_lengths, stranded=False, typecode="d") # Set values ga[HTSeq.GenomicInterval("chr1", 100, 200, ".")] = 5.5 ga[HTSeq.GenomicInterval("chr1", 150, 250, ".")] += 3.2 ``` ## GenomicArrayOfSets for Overlapping Features GenomicArrayOfSets stores sets of objects at each genomic position, making it ideal for handling overlapping genomic features like genes or exons. It's the core data structure used by htseq-count. ```python import HTSeq # Create a GenomicArrayOfSets features = HTSeq.GenomicArrayOfSets("auto", stranded=False) # Add features (use += to add to the set) features[HTSeq.GenomicInterval("chr1", 100, 250, ".")] += "geneA" features[HTSeq.GenomicInterval("chr1", 360, 640, ".")] += "geneA" # Second exon features[HTSeq.GenomicInterval("chr1", 510, 950, ".")] += "geneB" # Overlapping gene # Query what features overlap a read read_iv = HTSeq.GenomicInterval("chr1", 450, 800, ".") for step_iv, gene_set in features[read_iv].steps(): print(f"{step_iv}: {sorted(gene_set)}") # Find all features overlapping a read (union mode) all_features = set() for step_iv, gene_set in features[read_iv].steps(): all_features |= gene_set print(f"All overlapping features: {sorted(all_features)}") # Find features overlapping entire read (intersection-strict mode) common_features = None for step_iv, gene_set in features[read_iv].steps(): if common_features is None: common_features = gene_set.copy() else: common_features &= gene_set print(f"Common features: {sorted(common_features)}") # Count reads per gene counts = {} for alignment in HTSeq.BAM_Reader("alignments.bam"): if not alignment.aligned: continue gene_set = set() for step_iv, step_set in features[alignment.iv].steps(): gene_set |= step_set if len(gene_set) == 1: gene = list(gene_set)[0] counts[gene] = counts.get(gene, 0) + 1 elif len(gene_set) > 1: counts["__ambiguous"] = counts.get("__ambiguous", 0) + 1 ``` ## Reading GFF/GTF Files with GFF_Reader The GFF_Reader class parses GFF and GTF annotation files, yielding GenomicFeature objects with genomic coordinates and attributes like gene_id, transcript_id, and gene_name. ```python import HTSeq # Read GTF file gtf_file = HTSeq.GFF_Reader("annotations.gtf") # Iterate through features for feature in gtf_file: print(f"Type: {feature.type}") print(f"Name: {feature.name}") print(f"Interval: {feature.iv}") print(f"Attributes: {feature.attr}") if "gene_id" in feature.attr: print(f"Gene ID: {feature.attr['gene_id']}") break # Filter for exons only exons = [] for feature in HTSeq.GFF_Reader("annotations.gtf"): if feature.type == "exon": exons.append(feature) # Build GenomicArrayOfSets from exons features = HTSeq.GenomicArrayOfSets("auto", stranded=True) for feature in HTSeq.GFF_Reader("annotations.gtf"): if feature.type == "exon": features[feature.iv] += feature.attr["gene_id"] # Use make_feature_genomicarrayofsets (used by htseq-count) gff = HTSeq.GFF_Reader("annotations.gtf") result = HTSeq.make_feature_genomicarrayofsets( gff, id_attribute="gene_id", feature_type=["exon"], additional_attributes=["gene_name"], stranded=True, verbose=True ) features = result["features"] attributes = result["attributes"] # Access additional attributes for gene_id, attrs in list(attributes.items())[:5]: print(f"{gene_id}: {attrs}") # Write features back to GFF with open("output.gff", "w") as outfile: for feature in HTSeq.GFF_Reader("annotations.gtf"): outfile.write(feature.get_gff_line()) ``` ## htseq-count Command Line Tool The htseq-count script counts reads overlapping genomic features from a GTF file. It supports multiple overlap modes, handles paired-end data, and can process multiple BAM files in parallel. ```bash # Basic usage: count reads per gene htseq-count alignments.bam annotations.gtf > counts.txt # Multiple input files with parallel processing htseq-count -n 4 sample1.bam sample2.bam sample3.bam annotations.gtf -c counts.tsv # Specify options for RNA-Seq analysis htseq-count \ --stranded=reverse \ --mode=union \ --type=exon \ --idattr=gene_id \ --additional-attr=gene_name \ --minaqual=10 \ alignments.bam \ annotations.gtf \ -c counts.tsv # For non-strand-specific data htseq-count --stranded=no alignments.bam annotations.gtf > counts.txt # Position-sorted paired-end data htseq-count -r pos --max-reads-in-buffer=30000000 paired_end.bam annotations.gtf # Output annotated BAM file htseq-count -o annotated.sam alignments.bam annotations.gtf > counts.txt # Handle multimappers with fractional counts htseq-count --nonunique=fraction alignments.bam annotations.gtf # Output to different formats (mtx, h5ad, loom) htseq-count -c counts.h5ad --counts-output-sparse alignments.bam annotations.gtf # Restrict to specific gene htseq-count --feature-query 'gene_name == "ACTB"' alignments.bam annotations.gtf ``` ## htseq-count-barcodes for Single-Cell RNA-Seq The htseq-count-barcodes script is designed for single-cell RNA-Seq data with cell barcodes and UMIs (unique molecular identifiers), compatible with 10X Genomics output. ```bash # Basic usage with default 10X Genomics tags (CB for cell barcode, UB for UMI) htseq-count-barcodes cellranger_output.bam annotations.gtf -c counts.h5ad # Specify custom barcode tags htseq-count-barcodes \ --cell-barcode=CB \ --UMI=UB \ cellranger_output.bam \ annotations.gtf \ -c counts.mtx # Correct UMI sequencing errors (Hamming distance 1) htseq-count-barcodes \ --correct-UMI-distance=1 \ cellranger_output.bam \ annotations.gtf \ -c counts.h5ad # Full example with all common options htseq-count-barcodes \ --stranded=reverse \ --mode=union \ --type=exon \ --idattr=gene_id \ --additional-attr=gene_name \ --add-chromosome-info \ --minaqual=10 \ --cell-barcode=CB \ --UMI=UB \ --correct-UMI-distance=1 \ --counts_output_sparse \ cellranger_output.bam \ annotations.gtf \ -c counts.h5ad ``` ## Paired-End Alignment Processing HTSeq provides functions to pair alignments from paired-end sequencing data, supporting both name-sorted and position-sorted BAM files. ```python import HTSeq # For name-sorted BAM files bam_reader = HTSeq.BAM_Reader("paired_end_namesorted.bam") for first, second in HTSeq.pair_SAM_alignments(bam_reader): if first is not None and second is not None: print(f"Read 1: {first.iv}") print(f"Read 2: {second.iv}") print(f"Insert size: {first.inferred_insert_size}") # For position-sorted BAM files (uses memory buffer) bam_reader = HTSeq.BAM_Reader("paired_end_possorted.bam") for first, second in HTSeq.pair_SAM_alignments_with_buffer( bam_reader, max_buffer_size=30000000 ): if first is not None and second is not None: # Process paired reads pass # Access paired-end specific attributes for alignment in HTSeq.BAM_Reader("paired_end.bam"): if alignment.paired_end: print(f"PE which: {alignment.pe_which}") # 'first' or 'second' print(f"Mate aligned: {alignment.mate_aligned}") print(f"Proper pair: {alignment.proper_pair}") if alignment.mate_aligned: print(f"Mate start: {alignment.mate_start}") ``` ## CIGAR String Parsing HTSeq parses CIGAR strings from SAM/BAM files into CigarOperation objects, enabling analysis of spliced alignments, insertions, and deletions. ```python import HTSeq # Parse a CIGAR string manually cigar_ops = HTSeq.parse_cigar("20M6I10M", 1000, "chr2", "+") for op in cigar_ops: print(f"Type: {op.type}, Size: {op.size}") print(f"Reference interval: {op.ref_iv}") print(f"Query positions: {op.query_from}-{op.query_to}") # CIGAR operation type names print(HTSeq.cigar_operation_names) # {'M': 'matched', 'I': 'inserted', 'D': 'deleted', # 'N': 'skipped', 'S': 'soft-clipped', 'H': 'hard-clipped', # 'P': 'padded', '=': 'sequence-matched', 'X': 'sequence-mismatched'} # Process CIGAR from alignment for alignment in HTSeq.BAM_Reader("spliced_alignments.bam"): if alignment.aligned: for cigar_op in alignment.cigar: if cigar_op.type == "M": # Matched bases print(f"Match: {cigar_op.ref_iv}") elif cigar_op.type == "N": # Splice junction print(f"Splice: {cigar_op.ref_iv}") elif cigar_op.type == "I": # Insertion print(f"Insertion of {cigar_op.size} bp") elif cigar_op.type == "D": # Deletion print(f"Deletion: {cigar_op.ref_iv}") # Build coverage considering only matched bases cvg = HTSeq.GenomicArray("auto", stranded=True, typecode="i") for alignment in HTSeq.BAM_Reader("alignments.bam"): if alignment.aligned: for cigar_op in alignment.cigar: if cigar_op.type in ("M", "=", "X"): # Match operations cvg[cigar_op.ref_iv] += 1 ``` ## VCF File Reading HTSeq provides a VCF_Reader class for parsing Variant Call Format files, allowing integration of variant data with sequencing analysis. ```python import HTSeq # Read VCF file vcf_reader = HTSeq.VCF_Reader("variants.vcf") # Parse metadata first vcf_reader.parse_meta() print(f"Metadata: {vcf_reader.metadata}") print(f"Info fields: {vcf_reader.info}") print(f"Sample IDs: {vcf_reader.sampleids}") # Iterate through variants for variant in vcf_reader: print(f"Position: {variant.pos}") print(f"Reference: {variant.ref}") print(f"Alternate: {variant.alt}") print(f"Quality: {variant.qual}") print(f"Filter: {variant.filter}") print(f"Info: {variant.info}") # Unpack info field into typed values vcf_reader.parse_meta() vcf_reader.make_info_dict() for variant in vcf_reader: variant.unpack_info(vcf_reader.infodict) print(f"Unpacked info: {variant.info}") ``` ## BED and Wiggle File Reading HTSeq supports reading BED files for genomic intervals and Wiggle files for genomic signal tracks. ```python import HTSeq # Read BED file bed_reader = HTSeq.BED_Reader("regions.bed") for feature in bed_reader: print(f"Name: {feature.name}") print(f"Interval: {feature.iv}") print(f"Score: {feature.score}") if feature.thick: print(f"Thick region: {feature.thick}") # Read Wiggle file wig_reader = HTSeq.WiggleReader("signal.wig") for interval, value in wig_reader: print(f"{interval}: {value}") # Read BigWig file (requires pyBigWig) bw_reader = HTSeq.BigWig_Reader("signal.bw") print(f"Chromosomes: {bw_reader.chroms()}") for interval in bw_reader.intervals("chr1"): print(interval) ``` HTSeq is primarily used for RNA-Seq analysis workflows, where `htseq-count` quantifies gene expression by counting reads overlapping exonic regions. It integrates seamlessly with downstream differential expression tools like DESeq2 and edgeR. The library supports both bulk and single-cell RNA-Seq through `htseq-count-barcodes`, making it versatile for modern transcriptomics studies. Beyond read counting, HTSeq enables custom bioinformatics scripts through its Python API. Common use cases include building coverage profiles for genome browser visualization, filtering and annotating BAM files, analyzing splicing patterns through CIGAR operations, and integrating multiple data types (sequences, alignments, annotations, variants) in unified workflows. The GenomicArray and GenomicArrayOfSets data structures provide efficient storage and querying of genomic data, making HTSeq suitable for processing large-scale sequencing datasets.