HTSeq (htseq/htseq)

HTSeq

https://github.com/htseq/htseq
Admin
HTSeq is a Python framework to process and analyze data from high-throughput sequencing assays,...

Tokens:56,043
Snippets:403
Trust Score:5
Update:1 week ago
Show doc for...
Context Summary (auto-generated)
Raw
# HTSeq

HTSeq is a Python framework for processing and analyzing high-throughput sequencing (HTS) data. It provides classes and functions for reading and writing sequence files (FASTA, FASTQ), alignment files (SAM/BAM/CRAM), and annotation files (GFF/GTF). The library is designed to facilitate custom bioinformatics scripts and offers powerful data structures like GenomicArray and GenomicArrayOfSets for storing and querying genomic intervals efficiently.

A key component of HTSeq is the `htseq-count` script, which counts how many sequencing reads map to each genomic feature (typically genes) from GTF/GFF annotations. This tool is widely used in RNA-Seq differential expression analysis. HTSeq also supports single-cell RNA-Seq with cell and molecular barcodes through `htseq-count-barcodes`, making it suitable for both bulk and single-cell transcriptomics workflows.

## Reading FASTQ Files with FastqReader

The FastqReader class reads FASTQ files and yields SequenceWithQualities objects containing sequence data, quality scores, and read names. It supports multiple quality encoding formats including Phred (default), Solexa, and Solexa-old.

```python
import HTSeq
import numpy as np

# Read FASTQ file with default Phred encoding
fastq_file = HTSeq.FastqReader("reads.fastq")

# Read FASTQ file with Solexa quality encoding
fastq_file = HTSeq.FastqReader("reads.fastq", "solexa")

# Iterate through reads
for read in fastq_file:
    print(f"Read name: {read.name}")
    print(f"Sequence: {read.seq}")
    print(f"Quality scores: {read.qual}")
    print(f"Quality string: {read.qualstr}")
    break

# Calculate average quality per position across all reads
qualsum = None
nreads = 0
for read in fastq_file:
    if qualsum is None:
        qualsum = np.zeros(len(read), dtype=int)
    qualsum += read.qual
    nreads += 1

average_quality = qualsum / float(nreads)
print(f"Average quality per position: {average_quality}")
```

## Reading FASTA Files with FastaReader

The FastaReader class parses FASTA files and yields Sequence objects. It can read from regular or gzip-compressed files and supports optional raw iterator mode for improved performance.

```python
import HTSeq

# Read FASTA file
fasta_file = HTSeq.FastaReader("sequences.fa")

# Iterate through sequences
for seq in fasta_file:
    print(f"Name: {seq.name}")
    print(f"Sequence: {seq.seq}")
    print(f"Length: {len(seq)}")

# Load all sequences into a dictionary
sequences = dict((s.name, s) for s in HTSeq.FastaReader("sequences.fa"))
print(f"Sequence 'gene1': {sequences['gene1'].seq}")

# Use raw iterator for faster parsing (returns tuples)
sequences = dict((s[1], s[0]) for s in HTSeq.FastaReader("sequences.fa", raw_iterator=True))

# Get reverse complement
seq = HTSeq.Sequence(b"ACGTAAAG", "my_seq")
rc = seq.get_reverse_complement()
print(f"Original: {seq.seq}, Reverse complement: {rc.seq}")

# Write to FASTA file
with open("output.fa", "w") as outfile:
    seq.write_to_fasta_file(outfile)
```

## Reading SAM/BAM Files with BAM_Reader

The BAM_Reader class reads SAM, BAM, and CRAM alignment files using pysam under the hood. It yields SAM_Alignment objects containing alignment information, read sequences, quality scores, and CIGAR operations.

```python
import HTSeq
import itertools

# Open BAM/SAM file
bam_reader = HTSeq.BAM_Reader("alignments.bam")

# Iterate through alignments
for alignment in itertools.islice(bam_reader, 5):
    if alignment.aligned:
        print(f"Read: {alignment.read.name}")
        print(f"Aligned to: {alignment.iv}")
        print(f"Strand: {alignment.iv.strand}")
        print(f"CIGAR: {alignment.cigar}")
        print(f"Mapping quality: {alignment.aQual}")

# Count reads per chromosome
import collections
counts = collections.defaultdict(int)
for alignment in HTSeq.BAM_Reader("alignments.bam"):
    if alignment.aligned:
        counts[alignment.iv.chrom] += 1

print(f"Reads per chromosome: {dict(counts)}")

# Fetch reads from a specific region (requires index)
bam_reader = HTSeq.BAM_Reader("alignments_sorted.bam")
for alignment in bam_reader.fetch(region="chr1:1000000-2000000"):
    print(alignment)

# Write alignments to BAM file
bam_writer = HTSeq.BAM_Writer.from_BAM_Reader("output.bam", bam_reader)
for alignment in bam_reader:
    bam_writer.write(alignment)
bam_writer.close()
```

## GenomicInterval and GenomicPosition

GenomicInterval represents a genomic region with chromosome, start, end, and strand. GenomicPosition represents a single base position. These are fundamental data structures used throughout HTSeq for coordinate handling.

```python
import HTSeq

# Create a genomic interval (0-based, half-open)
iv = HTSeq.GenomicInterval("chr3", 123203, 127245, "+")
print(f"Chromosome: {iv.chrom}")
print(f"Start: {iv.start}, End: {iv.end}")
print(f"Strand: {iv.strand}")
print(f"Length: {iv.length}")

# Directional coordinates (accounting for strand)
print(f"Directional start: {iv.start_d}")
print(f"Directional end: {iv.end_d}")

# Create interval from directional coordinates
iv2 = HTSeq.GenomicInterval_from_directional("chr3", 127244, 4042, "-")

# Create a genomic position
pos = HTSeq.GenomicPosition("chr3", 125000, "+")
print(f"Position: {pos.pos}")

# Check overlap relationships
iv_a = HTSeq.GenomicInterval("chr1", 100, 300, "+")
iv_b = HTSeq.GenomicInterval("chr1", 200, 400, "+")
print(f"Overlaps: {iv_a.overlaps(iv_b)}")
print(f"Contains: {iv_a.contains(iv_b)}")
print(f"Is contained in: {iv_a.is_contained_in(iv_b)}")

# Extend interval to include another
iv_a.extend_to_include(iv_b)
print(f"Extended interval: {iv_a}")
```

## GenomicArray for Coverage Vectors

GenomicArray is a container that stores values indexed by genomic coordinates. It's ideal for coverage vectors, signal tracks, and other per-position data. The step-based storage efficiently handles regions of constant value.

```python
import HTSeq

# Create a stranded GenomicArray for integer values
cvg = HTSeq.GenomicArray("auto", stranded=True, typecode="i")

# Build coverage from alignments
for alignment in HTSeq.BAM_Reader("alignments.bam"):
    if alignment.aligned:
        cvg[alignment.iv] += 1

# Query coverage at a position
pos = HTSeq.GenomicPosition("chr1", 50000, "+")
print(f"Coverage at position: {cvg[pos]}")

# Query coverage over an interval
iv = HTSeq.GenomicInterval("chr1", 45000, 55000, "+")
coverage_values = list(cvg[iv])
print(f"Coverage values: {coverage_values[:10]}...")

# Iterate through steps (constant-value regions)
for step_iv, value in cvg[iv].steps():
    if value > 0:
        print(f"{step_iv}: {value}")

# Apply a function to modify values
cvg[iv].apply(lambda x: x * 2)

# Export to BedGraph/Wiggle format for genome browsers
cvg.write_bedgraph_file("coverage_plus.wig", "+")
cvg.write_bedgraph_file("coverage_minus.wig", "-")

# Export to BigWig format (requires pyBigWig)
cvg.write_bigwig_file("coverage_plus.bw", "+")

# Create GenomicArray with known chromosome lengths
chrom_lengths = {"chr1": 248956422, "chr2": 242193529}
ga = HTSeq.GenomicArray(chrom_lengths, stranded=False, typecode="d")

# Set values
ga[HTSeq.GenomicInterval("chr1", 100, 200, ".")] = 5.5
ga[HTSeq.GenomicInterval("chr1", 150, 250, ".")] += 3.2
```

## GenomicArrayOfSets for Overlapping Features

GenomicArrayOfSets stores sets of objects at each genomic position, making it ideal for handling overlapping genomic features like genes or exons. It's the core data structure used by htseq-count.

```python
import HTSeq

# Create a GenomicArrayOfSets
features = HTSeq.GenomicArrayOfSets("auto", stranded=False)

# Add features (use += to add to the set)
features[HTSeq.GenomicInterval("chr1", 100, 250, ".")] += "geneA"
features[HTSeq.GenomicInterval("chr1", 360, 640, ".")] += "geneA"  # Second exon
features[HTSeq.GenomicInterval("chr1", 510, 950, ".")] += "geneB"  # Overlapping gene

# Query what features overlap a read
read_iv = HTSeq.GenomicInterval("chr1", 450, 800, ".")
for step_iv, gene_set in features[read_iv].steps():
    print(f"{step_iv}: {sorted(gene_set)}")

# Find all features overlapping a read (union mode)
all_features = set()
for step_iv, gene_set in features[read_iv].steps():
    all_features |= gene_set
print(f"All overlapping features: {sorted(all_features)}")

# Find features overlapping entire read (intersection-strict mode)
common_features = None
for step_iv, gene_set in features[read_iv].steps():
    if common_features is None:
        common_features = gene_set.copy()
    else:
        common_features &= gene_set
print(f"Common features: {sorted(common_features)}")

# Count reads per gene
counts = {}
for alignment in HTSeq.BAM_Reader("alignments.bam"):
    if not alignment.aligned:
        continue
    gene_set = set()
    for step_iv, step_set in features[alignment.iv].steps():
        gene_set |= step_set
    if len(gene_set) == 1:
        gene = list(gene_set)[0]
        counts[gene] = counts.get(gene, 0) + 1
    elif len(gene_set) > 1:
        counts["__ambiguous"] = counts.get("__ambiguous", 0) + 1
```

## Reading GFF/GTF Files with GFF_Reader

The GFF_Reader class parses GFF and GTF annotation files, yielding GenomicFeature objects with genomic coordinates and attributes like gene_id, transcript_id, and gene_name.

```python
import HTSeq

# Read GTF file
gtf_file = HTSeq.GFF_Reader("annotations.gtf")

# Iterate through features
for feature in gtf_file:
    print(f"Type: {feature.type}")
    print(f"Name: {feature.name}")
    print(f"Interval: {feature.iv}")
    print(f"Attributes: {feature.attr}")
    if "gene_id" in feature.attr:
        print(f"Gene ID: {feature.attr['gene_id']}")
    break

# Filter for exons only
exons = []
for feature in HTSeq.GFF_Reader("annotations.gtf"):
    if feature.type == "exon":
        exons.append(feature)

# Build GenomicArrayOfSets from exons
features = HTSeq.GenomicArrayOfSets("auto", stranded=True)
for feature in HTSeq.GFF_Reader("annotations.gtf"):
    if feature.type == "exon":
        features[feature.iv] += feature.attr["gene_id"]

# Use make_feature_genomicarrayofsets (used by htseq-count)
gff = HTSeq.GFF_Reader("annotations.gtf")
result = HTSeq.make_feature_genomicarrayofsets(
    gff,
    id_attribute="gene_id",
    feature_type=["exon"],
    additional_attributes=["gene_name"],
    stranded=True,
    verbose=True
)
features = result["features"]
attributes = result["attributes"]

# Access additional attributes
for gene_id, attrs in list(attributes.items())[:5]:
    print(f"{gene_id}: {attrs}")

# Write features back to GFF
with open("output.gff", "w") as outfile:
    for feature in HTSeq.GFF_Reader("annotations.gtf"):
        outfile.write(feature.get_gff_line())
```

## htseq-count Command Line Tool

The htseq-count script counts reads overlapping genomic features from a GTF file. It supports multiple overlap modes, handles paired-end data, and can process multiple BAM files in parallel.

```bash
# Basic usage: count reads per gene
htseq-count alignments.bam annotations.gtf > counts.txt

# Multiple input files with parallel processing
htseq-count -n 4 sample1.bam sample2.bam sample3.bam annotations.gtf -c counts.tsv

# Specify options for RNA-Seq analysis
htseq-count \
    --stranded=reverse \
    --mode=union \
    --type=exon \
    --idattr=gene_id \
    --additional-attr=gene_name \
    --minaqual=10 \
    alignments.bam \
    annotations.gtf \
    -c counts.tsv

# For non-strand-specific data
htseq-count --stranded=no alignments.bam annotations.gtf > counts.txt

# Position-sorted paired-end data
htseq-count -r pos --max-reads-in-buffer=30000000 paired_end.bam annotations.gtf

# Output annotated BAM file
htseq-count -o annotated.sam alignments.bam annotations.gtf > counts.txt

# Handle multimappers with fractional counts
htseq-count --nonunique=fraction alignments.bam annotations.gtf

# Output to different formats (mtx, h5ad, loom)
htseq-count -c counts.h5ad --counts-output-sparse alignments.bam annotations.gtf

# Restrict to specific gene
htseq-count --feature-query 'gene_name == "ACTB"' alignments.bam annotations.gtf
```

## htseq-count-barcodes for Single-Cell RNA-Seq

The htseq-count-barcodes script is designed for single-cell RNA-Seq data with cell barcodes and UMIs (unique molecular identifiers), compatible with 10X Genomics output.

```bash
# Basic usage with default 10X Genomics tags (CB for cell barcode, UB for UMI)
htseq-count-barcodes cellranger_output.bam annotations.gtf -c counts.h5ad

# Specify custom barcode tags
htseq-count-barcodes \
    --cell-barcode=CB \
    --UMI=UB \
    cellranger_output.bam \
    annotations.gtf \
    -c counts.mtx

# Correct UMI sequencing errors (Hamming distance 1)
htseq-count-barcodes \
    --correct-UMI-distance=1 \
    cellranger_output.bam \
    annotations.gtf \
    -c counts.h5ad

# Full example with all common options
htseq-count-barcodes \
    --stranded=reverse \
    --mode=union \
    --type=exon \
    --idattr=gene_id \
    --additional-attr=gene_name \
    --add-chromosome-info \
    --minaqual=10 \
    --cell-barcode=CB \
    --UMI=UB \
    --correct-UMI-distance=1 \
    --counts_output_sparse \
    cellranger_output.bam \
    annotations.gtf \
    -c counts.h5ad
```

## Paired-End Alignment Processing

HTSeq provides functions to pair alignments from paired-end sequencing data, supporting both name-sorted and position-sorted BAM files.

```python
import HTSeq

# For name-sorted BAM files
bam_reader = HTSeq.BAM_Reader("paired_end_namesorted.bam")
for first, second in HTSeq.pair_SAM_alignments(bam_reader):
    if first is not None and second is not None:
        print(f"Read 1: {first.iv}")
        print(f"Read 2: {second.iv}")
        print(f"Insert size: {first.inferred_insert_size}")

# For position-sorted BAM files (uses memory buffer)
bam_reader = HTSeq.BAM_Reader("paired_end_possorted.bam")
for first, second in HTSeq.pair_SAM_alignments_with_buffer(
    bam_reader,
    max_buffer_size=30000000
):
    if first is not None and second is not None:
        # Process paired reads
        pass

# Access paired-end specific attributes
for alignment in HTSeq.BAM_Reader("paired_end.bam"):
    if alignment.paired_end:
        print(f"PE which: {alignment.pe_which}")  # 'first' or 'second'
        print(f"Mate aligned: {alignment.mate_aligned}")
        print(f"Proper pair: {alignment.proper_pair}")
        if alignment.mate_aligned:
            print(f"Mate start: {alignment.mate_start}")
```

## CIGAR String Parsing

HTSeq parses CIGAR strings from SAM/BAM files into CigarOperation objects, enabling analysis of spliced alignments, insertions, and deletions.

```python
import HTSeq

# Parse a CIGAR string manually
cigar_ops = HTSeq.parse_cigar("20M6I10M", 1000, "chr2", "+")
for op in cigar_ops:
    print(f"Type: {op.type}, Size: {op.size}")
    print(f"Reference interval: {op.ref_iv}")
    print(f"Query positions: {op.query_from}-{op.query_to}")

# CIGAR operation type names
print(HTSeq.cigar_operation_names)
# {'M': 'matched', 'I': 'inserted', 'D': 'deleted',
#  'N': 'skipped', 'S': 'soft-clipped', 'H': 'hard-clipped',
#  'P': 'padded', '=': 'sequence-matched', 'X': 'sequence-mismatched'}

# Process CIGAR from alignment
for alignment in HTSeq.BAM_Reader("spliced_alignments.bam"):
    if alignment.aligned:
        for cigar_op in alignment.cigar:
            if cigar_op.type == "M":  # Matched bases
                print(f"Match: {cigar_op.ref_iv}")
            elif cigar_op.type == "N":  # Splice junction
                print(f"Splice: {cigar_op.ref_iv}")
            elif cigar_op.type == "I":  # Insertion
                print(f"Insertion of {cigar_op.size} bp")
            elif cigar_op.type == "D":  # Deletion
                print(f"Deletion: {cigar_op.ref_iv}")

# Build coverage considering only matched bases
cvg = HTSeq.GenomicArray("auto", stranded=True, typecode="i")
for alignment in HTSeq.BAM_Reader("alignments.bam"):
    if alignment.aligned:
        for cigar_op in alignment.cigar:
            if cigar_op.type in ("M", "=", "X"):  # Match operations
                cvg[cigar_op.ref_iv] += 1
```

## VCF File Reading

HTSeq provides a VCF_Reader class for parsing Variant Call Format files, allowing integration of variant data with sequencing analysis.

```python
import HTSeq

# Read VCF file
vcf_reader = HTSeq.VCF_Reader("variants.vcf")

# Parse metadata first
vcf_reader.parse_meta()
print(f"Metadata: {vcf_reader.metadata}")
print(f"Info fields: {vcf_reader.info}")
print(f"Sample IDs: {vcf_reader.sampleids}")

# Iterate through variants
for variant in vcf_reader:
    print(f"Position: {variant.pos}")
    print(f"Reference: {variant.ref}")
    print(f"Alternate: {variant.alt}")
    print(f"Quality: {variant.qual}")
    print(f"Filter: {variant.filter}")
    print(f"Info: {variant.info}")

# Unpack info field into typed values
vcf_reader.parse_meta()
vcf_reader.make_info_dict()
for variant in vcf_reader:
    variant.unpack_info(vcf_reader.infodict)
    print(f"Unpacked info: {variant.info}")
```

## BED and Wiggle File Reading

HTSeq supports reading BED files for genomic intervals and Wiggle files for genomic signal tracks.

```python
import HTSeq

# Read BED file
bed_reader = HTSeq.BED_Reader("regions.bed")
for feature in bed_reader:
    print(f"Name: {feature.name}")
    print(f"Interval: {feature.iv}")
    print(f"Score: {feature.score}")
    if feature.thick:
        print(f"Thick region: {feature.thick}")

# Read Wiggle file
wig_reader = HTSeq.WiggleReader("signal.wig")
for interval, value in wig_reader:
    print(f"{interval}: {value}")

# Read BigWig file (requires pyBigWig)
bw_reader = HTSeq.BigWig_Reader("signal.bw")
print(f"Chromosomes: {bw_reader.chroms()}")
for interval in bw_reader.intervals("chr1"):
    print(interval)
```

HTSeq is primarily used for RNA-Seq analysis workflows, where `htseq-count` quantifies gene expression by counting reads overlapping exonic regions. It integrates seamlessly with downstream differential expression tools like DESeq2 and edgeR. The library supports both bulk and single-cell RNA-Seq through `htseq-count-barcodes`, making it versatile for modern transcriptomics studies.

Beyond read counting, HTSeq enables custom bioinformatics scripts through its Python API. Common use cases include building coverage profiles for genome browser visualization, filtering and annotating BAM files, analyzing splicing patterns through CIGAR operations, and integrating multiple data types (sequences, alignments, annotations, variants) in unified workflows. The GenomicArray and GenomicArrayOfSets data structures provide efficient storage and querying of genomic data, making HTSeq suitable for processing large-scale sequencing datasets.