### Run Example Tests with Pytest Source: https://bionumpy.github.io/bionumpy/developer_guide/making_examples.html Use this command to run your example tests locally during development. Ensure your example file ends with '_example.py' and contains functions starting with 'test_'. ```bash pytest scripts/your_example.py ``` -------------------------------- ### Example Script for Doctesting Source: https://bionumpy.github.io/bionumpy/tutorials/example.html A complete example script intended for use with pytest and doctesting. It includes a docstring and must be self-contained without auto-imports or test setup. ```python """ Example script used for documenting doctesting and other stuff """ a = 5 print(a) ``` -------------------------------- ### Run tests in a single example file Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/testing.rst.txt Execute tests for a specific example file. This is useful for isolating and testing changes in individual example scripts. ```bash pytest example/our_example.py ``` -------------------------------- ### Download example data Source: https://bionumpy.github.io/bionumpy/_sources/introduction.rst.txt Download a sample FASTQ file from the BioNumPy GitHub repository for use in examples. ```bash wget https://github.com/bionumpy/bionumpy/raw/main/example_data/big.fq.gz ``` -------------------------------- ### Install BioNumPy Source: https://bionumpy.github.io/bionumpy/_sources/index.rst.txt Use pip to install the BioNumPy library. ```bash pip install bionumpy ``` -------------------------------- ### Run Example Tests with Pytest Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/making_examples.rst.txt Execute your BioNumPy example tests using the pytest command-line tool. Ensure your example file is named with the `_example.py` suffix and contains functions prefixed with `test_`. ```bash pytest scripts/your_example.py ``` -------------------------------- ### Run and Print Random Integer Source: https://bionumpy.github.io/bionumpy/_sources/tutorials/example.rst.txt This code snippet generates a random integer and prints it. It is part of a test setup for documentation examples. ```python a = np.random.randint(3, 6) print(a) ``` -------------------------------- ### Install Development Dependencies Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/setting_up_development_environment.rst.txt Install additional development dependencies required for testing and other development tasks using the provided requirements file. ```bash pip install -r requirements_dev.txt ``` -------------------------------- ### Install BioNumPy Locally Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/setting_up_development_environment.rst.txt Clone the BioNumPy repository and install it locally in editable mode using pip. This ensures that changes made to the BioNumPy code are immediately reflected. It is recommended to use a virtual environment. ```bash git clone git@github.com:bionumpy/bionumpy.git cd bionumpy pip install -e . ``` -------------------------------- ### Example Script for Pytest Source: https://bionumpy.github.io/bionumpy/_sources/tutorials/example.rst.txt This script is intended to be included and tested by pytest. It requires complete code without auto-imports or test setup. ```python import bionumpy as bnp def main(): # Example usage of bionumpy sequence = bionumpy.open( "/Users/runner/work/bionumpy/bionumpy/../scripts/example.fastq.gz" ) print(sequence.sequence) if __name__ == "__main__": main() ``` -------------------------------- ### Run BioNumPy Tests Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/setting_up_development_environment.rst.txt Execute the BioNumPy test suite to verify the correct setup of your development environment. This includes unit tests, property testing, example testing, and doctesting. ```bash ./run_tests ``` -------------------------------- ### Install Development Version of NpStructures Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/setting_up_development_environment.rst.txt Clone the npstructures repository and install its development branch using pip in editable mode. This is useful when BioNumPy's new features depend on unpublished changes in npstructures. ```bash git clone git@github.com:knutdrand/npstructures.git cd npstructures git checkout dev pip install -e . ``` -------------------------------- ### Get kmers for sequences Source: https://bionumpy.github.io/bionumpy/modules/sequences.html Generates kmers from sequences encoded with an AlphabetEncoding. Use bnp.change_encoding if your sequences lack a suitable encoding. This example shows kmer extraction from a small set of sequences. ```python import bionumpy as bnp sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding) bnp.sequence.get_kmers(sequences, 3) ``` -------------------------------- ### Create and Print Intervals Source: https://bionumpy.github.io/bionumpy/modules/genome_arithmetics.html Demonstrates how to create an Interval object and print its contents. Intervals are defined by chromosome, start, and stop positions. ```python >>> intervals = Interval(["chr1", "chr1", "chr1"], [3, 5, 10], [8, 7, 12]) >>> print(intervals) Interval with 3 entries chromosome start stop chr1 3 8 chr1 5 7 chr1 10 12 ``` -------------------------------- ### Get Interval Sequences (General Path) Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/indexed_fasta.html Retrieves sequences for genomic intervals by iterating through each interval and calculating offsets within the indexed FASTA file. This is used when the fast path is not applicable. It handles calculating start and stop positions, reading raw bytes, and deleting unwanted characters based on line length and modifications. ```python lengths = [] cur_offset = 0 pre_alloc = np.empty((intervals.stop-intervals.start).sum(), dtype=np.uint8) alloc_offset = 0 for interval in intervals: chromosome = interval.chromosome.to_string() idx = self._index[chromosome] lenb, rlen, lenc = (idx["lenb"], idx["rlen"], idx["lenc"]) start_row = interval.start//lenc start_mod = interval.start % lenc start_offset = start_row*lenb+start_mod stop_row = interval.stop // lenc stop_offset = stop_row*lenb+interval.stop % lenc self._f_obj.seek(idx["offset"] + start_offset) lengths.append(stop_offset-start_offset-(stop_row-start_row)) D = stop_offset-start_offset tmp = np.frombuffer(self._f_obj.read(stop_offset-start_offset), dtype=np.uint8) tmp = np.delete(tmp, [lenb*(j+1)-1-start_mod for j in range(stop_row-start_row)]) pre_alloc[alloc_offset:alloc_offset+tmp.size] = tmp alloc_offset += tmp.size cur_offset += stop_offset-start_offset assert alloc_offset == pre_alloc.size, (alloc_offset, pre_alloc.size) assert np.all(pre_alloc> 0), np.sum(pre_alloc==0) a = EncodedArray(pre_alloc, BaseEncoding) return EncodedRaggedArray(a, lengths) ``` -------------------------------- ### Access Start Positions of Streamed Intervals Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genomic_intervals.html Provides access to the start positions of streamed genomic intervals. ```python @property def start(self): return self._start ``` -------------------------------- ### Include External Code Example Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/writing_documentation.rst.txt This directive includes code from an external Python file directly into the documentation. Ensure the path is correct relative to the documentation source. ```rst .. literalinclude:: /../scripts/your_example.py ``` -------------------------------- ### Generate and Print Random Integer Source: https://bionumpy.github.io/bionumpy/tutorials/example.html Generates a random integer between 3 and 5 (inclusive) and prints it. This is a basic example for demonstrating code execution. ```python a = np.random.randint(3, 6) print(a) ``` -------------------------------- ### Get Minimizers from DNA Sequences Source: https://bionumpy.github.io/bionumpy/_sources/topics/kmers.rst.txt Shows how to extract minimizers from DNA sequences. The kmer size and window size can be specified. ```python bnp.sequence.get_minimizers(sequences, k=2, window_size=4) ``` -------------------------------- ### Custom Encoding Output Example Source: https://bionumpy.github.io/bionumpy/developer_guide/encodings.html Demonstrates the output of encoding and decoding a sequence using a custom OneToOneEncoding. Shows the representation of the encoded object and its raw numpy array. ```text ACT encoded_array('ACT', MyCustomEncoding()) array([66, 68, 85], dtype=uint8) ACT encoded_array('ACT') array([65, 67, 84], dtype=uint8) ``` -------------------------------- ### Subsample Fasta/Fastq Reads Source: https://bionumpy.github.io/bionumpy/_sources/tutorials/benchmarking_examples.rst.txt Subsamples exactly half of the sequences from a fasta or fastq file. This example addresses the complexity of achieving exact subsampling when processing large files in chunks. ```python import bionumpy as bnp input_file = "example.fasta" output_file = "output.fasta" total_sequences = bnp.open(input_file).size sequences_to_subsample = total_sequences // 2 with bnp.open(input_file) as f_in, bnp.open(output_file, "w") as f_out: subsampled_count = 0 for chunk in f_in.read_chunks(): num_to_take = min(sequences_to_subsample - subsampled_count, len(chunk)) if num_to_take > 0: f_out.write(chunk[:num_to_take]) subsampled_count += num_to_take if subsampled_count >= sequences_to_subsample: break ``` -------------------------------- ### Build Documentation Locally Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/writing_documentation.rst.txt Run this command to build and test the documentation locally. It creates HTML files and opens them in your browser. ```bash make docs ``` -------------------------------- ### Extend and Shift Intervals Source: https://bionumpy.github.io/bionumpy/_sources/source/intervals.rst.txt Demonstrates extending the stop position of intervals and shifting both start and stop positions. Filters intervals based on their length. ```python >>> import bionumpy as bnp >>> intervals = bnp.open("example_data/small_interval.bed").read() >>> extended_right = bnp.replace(intervals, stop=intervals.stop+10) >>> shifted = bnp.replace(intervals, start=intervals.start+5, stop=intervals.stop+5) >>> small = intervals[(intervals.stop-intervals.start)<50] ``` -------------------------------- ### Get Genomic Location (Start, Stop, Center) Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genomic_intervals.html Retrieves the genomic location for the 'start', 'stop', or 'center' of the intervals. Handles stranded intervals by adjusting the stop position for '-' strands. ```python if where in ('start', 'stop'): if not self.is_stranded(): data = self._intervals else: location = np.where(self.strand == ('+' if where == 'start' else '-'), self.start, self.stop - 1) data = replace(self._intervals, start=location) else: assert where == 'center' location = (self.start + self.stop) // 2 data = replace(self._intervals, start=location) return GenomicLocationGlobal.from_data( data, self._genome_context, is_stranded=self.is_stranded(), position_name='start') ``` -------------------------------- ### Build Documentation Locally Source: https://bionumpy.github.io/bionumpy/developer_guide/writing_documentation.html Run this command to build and test the documentation locally. It creates HTML files in the docs_source/_build directory. ```bash make docs ``` -------------------------------- ### Slice EncodedRaggedArray (last N) Source: https://bionumpy.github.io/bionumpy/source/sequences.html Slice an EncodedRaggedArray to get the last N sequences. This example retrieves the last four sequences. ```python >>> my_seqs[-4:] # last 4 sequences encoded_ragged_array(['TGIVPMRM*S', 'CENVC', 'RSTWF', 'NTIFMC'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY*')) ``` -------------------------------- ### Slice EncodedRaggedArray (first N) Source: https://bionumpy.github.io/bionumpy/source/sequences.html Slice an EncodedRaggedArray to get the first N sequences. This example retrieves the first two sequences. ```python >>> my_seqs[0:2] # first 2 sequences encoded_ragged_array(['LMSYAEVYGH', 'WKGVGKQNCAWSVNVH'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY*')) ``` -------------------------------- ### Slice EncodedRaggedArray (first N sequences) Source: https://bionumpy.github.io/bionumpy/_sources/source/sequences.rst.txt Slice an EncodedRaggedArray to retrieve a subset of sequences. This example gets the first two sequences. ```python print(my_seqs[0:2]) ``` -------------------------------- ### Clean Documentation Build Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/writing_documentation.rst.txt Use this command to clean the existing documentation build files. This is useful before rebuilding. ```bash make clean ``` -------------------------------- ### Initialize MultiStream with Data Sources Source: https://bionumpy.github.io/bionumpy/_sources/source/multiple_data_sources.rst.txt Demonstrates how to initialize a MultiStream object by providing sequence lengths, an indexed reference genome, variant data, and interval data. This synchronizes the streams for aligned processing. ```python import bionumpy as bnp variants = bnp.open("example_data/few_variants.vcf").read_chunks() intervals = bnp.open("example_data/small_interval.bed").read_chunks() reference = bnp.open_indexed("example_data/small_genome.fa") multistream = bnp.MultiStream(reference.get_contig_lengths(), sequence=reference, variants=variants, intervals=intervals) ``` -------------------------------- ### Analyze Pileup Data within Peaks Source: https://bionumpy.github.io/bionumpy/_sources/topics/genomic_data.rst.txt Extract and analyze pileup data within the regions defined by genomic intervals. This example shows how to get the maximum and mean pileup values for each peak. ```python peak_pileups = pileup[intervals] print(peak_pileups.max(axis=-1)) print(peak_pileups.mean(axis=-1)) ``` -------------------------------- ### Initialize IndexedFasta Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/indexed_fasta.html Initializes an `IndexedFasta` object. It reads the FASTA index file and opens the FASTA file for binary reading. Requires a FASTA file and its corresponding .fai index file. ```python class IndexedFasta: """ Class representing an indexed fasta file. Behaves like dict of chrom names to sequences """ def __init__(self, filename: Union[str, Path]): if isinstance(filename, str): filename = Path(filename) self._filename = filename self._index = read_index(filename.with_suffix(filename.suffix + ".fai")) self._f_obj = open(filename, "rb") self._index_table = FastaIdx.from_entry_tuples( [ (name, var['rlen'], var['offset'], var['lenc'], var['lenb']) for name, var in self._index.items() ] # if '_' not in name]) ``` -------------------------------- ### Instantiate and print a bnpdataclass Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/bnpdataclass/bnpdataclass.html Instantiate the decorated class with data for each field. The printed output displays the data in a structured, table-like format. ```python data = Person(["Knut", "Ivar", "Geir"], [35, 30, 40]) print(data) ``` -------------------------------- ### Get minimizers for sequences Source: https://bionumpy.github.io/bionumpy/modules/sequences.html Computes minimizers for sequences encoded with an AlphabetEncoding. Specify the kmer size and window size for minimizer extraction. This example uses DNA sequences and extracts 2-mers within a window of 4. ```python import bionumpy as bnp sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding) bnp.sequence.get_minimizers(sequences, 2, 4) ``` -------------------------------- ### Test Documentation Code Source: https://bionumpy.github.io/bionumpy/_sources/developer_guide/writing_documentation.rst.txt Run this command within the docs_source directory to automatically test all code examples embedded in the documentation using doctest. It verifies that the code output matches the expected output. ```bash make doctest ``` -------------------------------- ### Get kmers from FASTQ sequences Source: https://bionumpy.github.io/bionumpy/modules/sequences.html Extracts kmers of a specified size from sequences read from a FASTQ file. Sequences are converted to DNAEncoding before kmer extraction. This example retrieves the first three kmers of the first sequence. ```python import bionumpy as bnp sequences = bnp.open("example_data/big.fq.gz").read().sequence sequences = bnp.change_encoding(sequences, bnp.DNAEncoding) bnp.sequence.get_kmers(sequences, 31)[0, 0:3] # first three kmers of first sequence ``` -------------------------------- ### Convert FASTQ to FASTA using Bash Source: https://bionumpy.github.io/bionumpy/_sources/manuscript/index.rst.txt This bash command demonstrates a common bioinformatics task of converting FASTQ to FASTA format using a pipeline of standard Unix utilities. ```bash zcat file.fastq.gz | paste - - - - | perl -ane 'print ">"$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz ``` -------------------------------- ### DelimitedBufferWithInernalComments: Calculate Column Starts and Ends Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/delimited_buffers.html Calculates column start and end positions, specifically handling lines that start with a comment character and are followed by a newline. ```python @classmethod def _calculate_col_starts_and_ends(cls, data, delimiters): comment_mask = (data[delimiters[:-1]] == '\n') & (data[delimiters[:-1] + 1] == cls.COMMENT) comment_mask = np.flatnonzero(comment_mask) start_delimiters = np.delete(delimiters, comment_mask)[:-1] end_delimiters = np.delete(delimiters, comment_mask + 1) if data[0] != cls.COMMENT: start_delimiters = np.insert(start_delimiters, 0, -1) else: end_delimiters = end_delimiters[1:] return start_delimiters + 1, end_delimiters ``` -------------------------------- ### Create an Interval Dictionary from a Small BED File Source: https://bionumpy.github.io/bionumpy/source/multiple_data_sources.html Load a small BED file into memory and group its intervals by chromosome to create a dictionary. This dictionary can then be used with MultiStream, regardless of the original file's sort order. ```python >>> intervals = bnp.open("example_data/small_interval.bed").read() >>> interval_dict = dict(bnp.groupby(intervals, "chromosome")) >>> interval_dict {'0': Interval with 5 entries chromosome start stop 0 13 18 0 37 46 0 62 83 0 105 126 0 129 130, '1': Interval with 10 entries chromosome start stop 1 3 21 1 41 65 1 91 114 1 131 153 1 157 168 1 174 201 1 213 230 1 240 268 1 290 315 1 319 339, '2': Interval with 15 entries chromosome start stop 2 2 16 2 44 49 2 77 101 2 108 127 2 135 154 2 163 165 2 173 177 2 201 214 2 242 268 2 292 320, '3': Interval with 20 entries chromosome start stop 3 7 34 3 58 82 3 95 101 3 130 138 3 150 170 3 188 211 3 234 261 3 283 302 3 325 352 3 353 362} ``` -------------------------------- ### Open and Read a Gzipped FASTQ File Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/files.html Opens a gzipped FASTQ file and reads all its content into a `SequenceEntryWithQuality` object. This is useful for processing the entire file at once. ```python import bionumpy as bnp all_data = bnp.open("example_data/big.fq.gz").read() print(all_data) ``` -------------------------------- ### Create Genome from File Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genome.html Read genome information from a 'chrom.sizes' or 'fa.fai' file. If a FASTA file is provided, an index will be created if it doesn't exist, enabling sequence reading. ```python >>> import bionumpy as bnp >>> bnp.Genome.from_file('example_data/hg38.chrom.sizes') Genome(['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', '...']) ``` -------------------------------- ### Get EncodedRaggedArray shape Source: https://bionumpy.github.io/bionumpy/_sources/source/sequences.rst.txt Access the `.shape` property of an EncodedRaggedArray to get the number of sequences and the lengths of each sequence. ```python print(my_seqs.shape) ``` -------------------------------- ### Import BioNumPy and read FASTQ data Source: https://bionumpy.github.io/bionumpy/_sources/introduction.rst.txt Import NumPy and BioNumPy, then open and read a FASTQ file into memory. The data is loaded as a SequenceEntryWithQuality object. ```python import numpy as np import bionumpy as bnp # open the file f = bnp.open("example_data/big.fq.gz") data = f.read() # reads the whole file into memory print(data) ``` -------------------------------- ### Clean Documentation Build Artifacts Source: https://bionumpy.github.io/bionumpy/developer_guide/writing_documentation.html Use this command to clean up previous build artifacts in the documentation directory. ```bash make clean ``` -------------------------------- ### GenomicIntervalsStreamed.start Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genomic_intervals.html Property to access the start of the intervals. ```APIDOC ## start ### Description Property to access the start of the intervals. ``` -------------------------------- ### get_location Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genomic_intervals.html Retrieves the genomic location (start, stop, or center) of the intervals. ```APIDOC ## get_location ### Description Get the genomic location of either 'start', 'stop' or 'center' of the intervals. ### Parameters - **where** (str): 'start', 'stop' or 'center'. Defaults to 'start'. ### Returns - **GenomicLocation**: The genomic location. ``` -------------------------------- ### Read Sequences from FASTQ File Source: https://bionumpy.github.io/bionumpy/_sources/source/sequences.rst.txt Use `bnp.open` to read sequence entries from a FASTQ file. The `read()` method returns all entries with their associated quality scores. ```python entries = bnp.open("example_data/reads.fq").read() ``` -------------------------------- ### EncodedCounts Initialization and Basic Operations Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/sequence/count_encoded.html Demonstrates the initialization of EncodedCounts and basic operations like string representation, equality checks, and element access by label. ```python from typing import List, Dict, Optional import numpy as np from numpy.typing import ArrayLike from numbers import Number from ..io.matrix_dump import Matrix from ..util.typing import EncodedArrayLike from ..encoded_array import EncodedArray class EncodedCounts: """ Class for storing counts of encoded data. """ alphabet: list counts: np.ndarray row_names: list = None def __init__(self, alphabet, counts, row_names=None): self.counts = counts self.alphabet = alphabet self.row_names = row_names def __str__(self): return "\n".join(f"{c}: {n}" for c, n in zip(self.alphabet, self.counts.T)) def __repr__(self): return f'''EncodedCounts(alphabet={repr(self.alphabet)}, counts={repr(self.counts)}, row_names={repr(self.row_names)})''' def __eq__(self, other): if self.alphabet != other.alphabet: return False if not np.all(self.counts == other.counts): return False return True def __getitem__(self, idx: str): return self.counts[..., self.alphabet.index(idx)] def __add__(self, other): if isinstance(other, Number): o_counts = other else: assert self.alphabet == other.alphabet o_counts = other.counts return self.__class__(self.alphabet, self.counts + o_counts) def __radd__(self, other): if isinstance(other, Number): o_counts = other else: assert self.alphabet == other.alphabet o_counts = other.counts return self.__class__(self.alphabet, self.counts + o_counts) # return dataclasses.replace(self, counts=self.counts+o_counts) def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): if method == "__call__": assert all(i.alphabet == self.alphabet for i in inputs if isinstance(i, EncodedCounts)) assert all(i.alphabet == self.alphabet for i in kwargs.values() if isinstance(i, EncodedCounts)) arrays = [i.counts if isinstance(i, EncodedCounts) else i for i in inputs] kwargs = {k: i.counts if isinstance(i, EncodedCounts) else i for k, i in kwargs.items()} return self.__class__(self.alphabet, getattr(ufunc, method)(*arrays, **kwargs)) else: return NotImplemented ``` -------------------------------- ### from_fields Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genomic_intervals.html Creates GenomicIntervals from separate arrays for chromosome, start, stop, and optionally strand. ```APIDOC ## from_fields ### Description Create genomic intervals from fields. ### Parameters - **genome_context** (GenomeContextBase) - The genome context. - **chromosome** (StringArray) - An array of chromosome names. - **start** (np.ndarray) - An array of start positions. - **stop** (np.ndarray) - An array of stop positions. - **strand** (Optional[EncodedArray]) - An optional array of strand information. ### Returns - **GenomicIntervals** - A GenomicIntervals object created from the provided fields. ``` -------------------------------- ### Compute Position Weight Matrix from File Source: https://bionumpy.github.io/bionumpy/_sources/tutorials/position_weight_matrix.rst.txt Reads a motif-PWM from a file and creates a PositionWeightMatrix object. Ensure the correct alphabet and counts are provided. ```python from bionumpy.io.motifs import PositionWeightMatrix # Read a motif-pwm from file # The alphabet and counts are inferred from the file pwm = PositionWeightMatrix.from_file("example.pwm") # Print the PWM print(pwm) ``` -------------------------------- ### Query EncodedArray Source: https://bionumpy.github.io/bionumpy/_sources/source/sequences.rst.txt Perform NumPy-fast queries on EncodedArray objects. This example checks for equality with a character. ```python print(encoded_array == "g") ``` -------------------------------- ### Read Biological Files with bnp.open Source: https://bionumpy.github.io/bionumpy/_sources/using_bionumpy_in_your_existing_project.rst.txt Use `bnp.open` to read biological files like VCF. It automatically detects the file format. Iterate over chunks for efficiency, and then over individual entries within each chunk. ```python import numpy as np import bionumpy as bnp # open your file, bnp.open automatically detects the file format f = bnp.open("example_data/variants.vcf") # a chunk is an efficient representation of a chunk of many lines for chunk in f.read_chunks(): # we can iterate over the entries for single_entry in chunk.to_iter(): print(single_entry) # and we can access things like, chromosome, position and so on position = single_entry.position ``` -------------------------------- ### intersect Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/arithmetics/intervals.html Computes the intersection of two sets of intervals. Assumes intervals are sorted by start position. ```APIDOC ## intersect ### Description Computes the intersection of two sets of intervals. Assumes intervals are sorted by start position. ### Parameters * **intervals_a** (Interval) - The first set of intervals. * **intervals_b** (Interval) - The second set of intervals. ### Returns * **Interval** - The intervals representing the intersection. ``` -------------------------------- ### Plotting Read Pileup Around Transcription Start Sites (TSS) Source: https://bionumpy.github.io/bionumpy/_sources/tutorials/genomic_data.rst.txt Reads a wig file as a stream and plots the mean read pileup around transcription start sites. Requires the wig file to be alphabetically sorted by chromosome, which can be achieved by setting `sort_names=True` when creating the `Genome` object. Computations are lazily evaluated and must be triggered with `bnp.compute`. ```python import numpy as np import bionumpy as bnp import plotly.graph_objects as go def tss_plot(wig_filename: str, chrom_sizes_filename: str, annotation_filename: str): # Read genome and transcripts genome = bnp.Genome.from_file(chrom_sizes_filename, sort_names=True) # The wig file is alphbetically sorted annotation = genome.read_annotation(annotation_filename) transcripts = annotation.transcripts # Get transcript start locations and make windows around them tss = transcripts.get_location('start').sorted() # Make sure the transcripts are sorted alphabetically windows = tss.get_windows(flank=500) # Get mean read pileup within these windows and plot track = genome.read_track(wig_filename, stream=True) signals = track[windows] mean_signal = signals.mean(axis=0) signal = bnp.compute(mean_signal) # Compute the actual value px.line(x=np.arange(-500, 500), y=signal.to_array(), title="Read pileup relative to TSS start", labels={"x": "Position relative to TSS start", "y": "Mean read pileup"}).show() ``` -------------------------------- ### Create FASTA Index Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/indexed_fasta.html Creates a FASTA index for a given FASTA file. This function reads the file using `bnp_open` with `FastaIdxBuffer` and returns the index as a `FastaIdx` object. ```python def create_index(filename: str) -> FastaIdx: """Create a fasta index for a fasta file Parameters ---------- filename : str Filename of the fasta file Returns ------- FastaIdx Fasta index as bnpdataclass """ reader = bnp_open(filename, buffer_type=FastaIdxBuffer) indice_builders = list(reader.read_chunks()) offsets = np.cumsum([0] + [idx.byte_size[0] for idx in indice_builders]) return np.concatenate([ FastaIdx( idx.chromosome, idx.length, idx.start + offset, idx.characters_per_line, idx.line_length, ) for idx, offset in zip(indice_builders, offsets) ]) ``` -------------------------------- ### Slice EncodedArray Source: https://bionumpy.github.io/bionumpy/_sources/source/sequences.rst.txt Use NumPy-like indexing to slice EncodedArray objects, for example, to trim sequence ends. ```python print(encoded_array[2:-2]) ``` -------------------------------- ### Load Genomic Sequence Source: https://bionumpy.github.io/bionumpy/_sources/topics/genomic_data.rst.txt Loads a reference genome sequence from a FASTA file. Ensure the 'example_data/small_sequence.fa' file is accessible. ```python genome_sequence = genome.read_sequence('example_data/small_sequence.fa') print(genome_sequence) ``` -------------------------------- ### GenomicIntervals._from_fields Source: https://bionumpy.github.io/bionumpy/modules/genomic_data.html Creates genomic intervals from provided fields including chromosome, start, stop, and optionally strand. ```APIDOC ## GenomicIntervals._from_fields ### Description Create genomic intervals from fields. ### Parameters - **genome_context** (GenomeContextBase) - The genome context. - **chromosome** (StringArray) - Array of chromosome names. - **start** (np.ndarray) - Array of start positions. - **stop** (np.ndarray) - Array of stop positions. - **strand** (EncodedArray | None, optional) - Array of strand information. ### Returns GenomicIntervals ``` -------------------------------- ### sort_intervals() Source: https://bionumpy.github.io/bionumpy/modules/genome_arithmetics.html Sorts intervals based on chromosome, start, and stop positions. Allows for a custom sort order. ```APIDOC ## sort_intervals() ### Description Sort intervals on “chromosome”, “start”, “stop”. ### Parameters - **intervals** (Interval) - Unsorted intervals. - **chromosome_key_function** (callable) - A function to determine the chromosome key (defaults to a lambda function). - **sort_order** (List[str]) - A list specifying the desired order of chromosomes. ### Returns - **Interval** - Sorted intervals. ``` -------------------------------- ### bnp_open Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/files.html Opens a file for reading or writing, automatically detecting the appropriate buffer type based on the file extension. Supports lazy reading and chunked processing. ```APIDOC ## bnp_open ### Description Open a `NpDataclassReader` file object, that can be used to read the file, either in chunks or completely. Files read in chunks can be used together with the `@bnp.streamable` decorator to call a function on all chunks in the file and optionally reduce the results. If `mode="w"` it opens a writer object. ### Parameters * **filename** (str) - Name of the file to open * **mode** (str) - Either "w" or "r" * **buffer_type** (FileBuffer) - A `FileBuffer` class to specify how the data in the file should be interpreted * **lazy** (bool) - If True, the data will be read lazily, i. e. only when it is accessed. This is useful to speed up reading of large files, but it is more memory demanding ### Returns * **NpDataclassReader** - A file reader object ### Examples ```python import bionumpy as bnp # Read all data from a gzipped FASTQ file all_data = bnp.open("example_data/big.fq.gz").read() print(all_data) # Read the first chunk of a gzipped FASTQ file first_chunk = bnp.open("example_data/big.fq.gz").read_chunk(300000) print(first_chunk) ``` ``` -------------------------------- ### Define RawInterval Dataclass Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/arithmetics/intervals.html Defines a simple dataclass for representing raw intervals with start and stop attributes. ```python @bnpdataclass class RawInterval: start: int stop: int ``` -------------------------------- ### Get EncodedRaggedArray encoding Source: https://bionumpy.github.io/bionumpy/_sources/source/sequences.rst.txt Access the encoding scheme used for the sequences in an EncodedRaggedArray via the `.encoding` property. ```python print(my_seqs.encoding) ``` -------------------------------- ### Shift and Filter Intervals with NumPy Source: https://bionumpy.github.io/bionumpy/source/intervals.html Demonstrates basic interval manipulation using NumPy-like operations. Use for simple geometric transformations and filtering based on interval properties. ```python import bionumpy as bnp intervals = bnp.open("example_data/small_interval.bed").read() extended_right = bnp.replace(intervals, stop=intervals.stop+10) shifted = bnp.replace(intervals, start=intervals.start+5, stop=intervals.stop+5) small = intervals[(intervals.stop-intervals.start)<50] ``` -------------------------------- ### Read a Chunk from a Gzipped FASTQ File Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/files.html Opens a gzipped FASTQ file and reads a specified number of entries as the first chunk. This is useful for processing large files in manageable parts. ```python first_chunk = bnp.open("example_data/big.fq.gz").read_chunk(300000) print(first_chunk) ``` -------------------------------- ### Get EncodedRaggedArray lengths Source: https://bionumpy.github.io/bionumpy/_sources/source/sequences.rst.txt Retrieve the lengths of individual sequences within an EncodedRaggedArray using the `.lengths` property. ```python print(my_seqs.lengths) ``` -------------------------------- ### Get Genome Size Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genome.html Returns the total size of the genome in base pairs. This is a property of the `Genome` object. ```python genome.size ``` -------------------------------- ### bnp_open Source: https://bionumpy.github.io/bionumpy/_sources/modules/io.rst.txt Opens a file for reading. It supports automatic format detection based on filename suffix and allows overriding with a specified buffer type. ```APIDOC ## bnp_open ### Description Opens a file for reading. It supports automatic format detection based on filename suffix and allows overriding with a specified buffer type. ### Method (Not specified, typically a function call) ### Parameters - **filename** (str) - Description of the file to open. - **buffer_type** (optional) - Specifies the type of buffer to use for reading, overriding automatic detection. ``` -------------------------------- ### Get context from BNPDataClass object Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/bnpdataclass/bnpdataclass.html Retrieves a context value from the BNPDataClass object. This method is marked as deprecated. ```python logger.warning(f'Deprecated method set_context in BNPDataClass') if not hasattr(self, '_context'): self._context = dict() return self._context[name] ``` -------------------------------- ### Reverse Complement Fasta/Fastq Files Source: https://bionumpy.github.io/bionumpy/tutorials/benchmarking_examples.html Generates the reverse complement of sequences in a FASTA or FASTQ file and writes the result to a new file. Automatically detects the appropriate buffer type based on file extension. ```python import bionumpy as bnp def reverse_complement(input_filename: str, output_filename: str): """Reverse complements a fasta or fastq file and writes the result to a new file.""" bt = lambda filename: (bnp.TwoLineFastaBuffer if filename.endswith(('fa', 'fa.gz')) else None) with bnp.open(output_filename, "w", buffer_type=bt(output_filename)) as outfile: for chunk in bnp.open(input_filename, buffer_type=bt(input_filename)).read_chunks(): rc = bnp.sequence.get_reverse_complement(chunk.sequence) outfile.write(bnp.replace(chunk, sequence=rc)) def test(): reverse_complement('example_data/big.fq.gz', 'example_data/big_rc.fq.gz') assert bnp.count_entries('example_data/big_rc.fq.gz') == bnp.count_entries('example_data/big.fq.gz') if __name__ == '__main__': import sys reverse_complement(sys.argv[1], sys.argv[2]) ``` -------------------------------- ### Get Pileup of Intervals Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/arithmetics/intervals.html Calculates the number of intervals that overlap each position on a chromosome or contig. This function is streamable. ```python def get_pileup(intervals: Interval, chromosome_size: int) -> GenomicRunLengthArray: """Get the number of intervals that overlap each position of the chromosome/contig ``` -------------------------------- ### global_intersect Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/arithmetics/intervals.html Computes the intersection of two sets of intervals across all chromosomes. Intervals are sorted by chromosome and then by start position. ```APIDOC ## global_intersect ### Description Computes the intersection of two sets of intervals across all chromosomes. Intervals are sorted by chromosome and then by start position. ### Parameters * **intervals_b** (Interval) - The second set of intervals. * **intervals_a** (Interval) - The first set of intervals. ### Returns * **Interval** - The intervals representing the global intersection. ``` -------------------------------- ### create_index Function Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/indexed_fasta.html Creates a FASTA index for a given FASTA file. ```APIDOC ## def create_index(filename: str) -> FastaIdx Create a fasta index for a fasta file Parameters ---------- filename : str Filename of the fasta file Returns ------- FastaIdx Fasta index as bnpdataclass ``` -------------------------------- ### OneLineBuffer.get_field_range_as_text Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/one_line_buffer.html Retrieves a specified range of fields, specifically expecting a single field (start to start+1), and returns it as text. ```APIDOC ## get_field_range_as_text(start: int, end: int) ### Description Get a range of fields as text. Asserts that the range is exactly one field. ### Parameters * **start** (int) - The starting index of the field range. * **end** (int) - The ending index of the field range. Must be start + 1. ### Returns * EncodedRaggedArray - The specified field range as text. ``` -------------------------------- ### Get Pileup from Streamed Intervals Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genomic_intervals.html Calculates the pileup for streamed genomic intervals. This is a method within the GenomicIntervalsStreamed class. ```python def get_pileup(self) -> GenomicArray: ``` -------------------------------- ### Create Info Dataclass from Header Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/io/vcf_buffers.html Dynamically creates a BNPDataClass for INFO fields based on VCF header data. Handles different data types and list formats specified in the header. ```python def translate_field_type(info_dict): t = info_dict['Type'] number = info_dict['Number'] is_list = (number is None) or (number > 1) if t == Optional[int] and is_list: return List[int] elif t == Optional[float] and is_list: return List[float] elif is_list: return str return t def create_info_dataclass(header_data): if not header_data: return str header = parse_header(header_data) is_list = lambda val: (val['Number'] is None) or (val['Number'] > 1) is_int_list = lambda val: (val['Type'] == Optional[int]) and is_list(val) info_fields = [(key, translate_field_type(val)) for key, val in header.INFO.items()] dc = make_dataclass(info_fields, "InfoDataclass") return dc ``` -------------------------------- ### Get Sorted Interval Stream Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/genomic_data/genomic_intervals.html Returns a stream of sorted genomic intervals. Assumes intervals are already sorted. ```python def get_sorted_stream(self): sorted_intervals = self.sorted() return self.from_interval_stream(iter([sorted_intervals])) ``` -------------------------------- ### bnp_open() Source: https://bionumpy.github.io/bionumpy/modules/io.html Opens a file based on its suffix, returning a reader or writer object. Supports lazy reading for large files. ```APIDOC ## bnp_open() ### Description Open a file according to its suffix. Opens a NpDataclassReader file object, which can be used to read the file either in chunks or completely. Files read in chunks can be used together with the @bnp.streamable decorator to call a function on all chunks in the file and optionally reduce the results. If mode="w", it opens a writer object. ### Method `bnp_open(_filename : str_, _mode : str = None_, _buffer_type =None_, _lazy =None_)` ### Parameters #### Path Parameters - **filename** (str) - Name of the file to open - **mode** (str) - Optional. Either "w" or "r". - **buffer_type** (FileBuffer) - Optional. A FileBuffer class to specify how the data in the file should be interpreted. - **lazy** (bool) - Optional. If True, the data will be read lazily, i.e. only when it is accessed. This is useful to speed up reading of large files, but it is more memory demanding. ### Returns - NpDataclassReader - A file reader object ``` -------------------------------- ### Get Raw Kmer Values Source: https://bionumpy.github.io/bionumpy/_sources/tutorials/extracting_kmers_around_snps.rst.txt Retrieves the raw integer (int64) encoded values of the alternative allele k-mers. ```python raw_kmers = alt_kmers.raw() print(raw_kmers[0:5]) ``` -------------------------------- ### Convert FASTQ to FASTA using BioNumPy Source: https://bionumpy.github.io/bionumpy/_sources/manuscript/index.rst.txt This Python snippet shows how to perform the FASTQ to FASTA conversion using BioNumPy, offering a more integrated and potentially more robust approach than bash scripting. ```python with bnp.open("output.fasta") as out_file: outfile.write(bnp.open("input.fastq").read_chunks()) ``` -------------------------------- ### get_context Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/bnpdataclass/bnpdataclass.html Gets a context value for the object, typically used for storing auxiliary information like header information. ```APIDOC ## get_context ### Description Gets a context value for the object, typically used for storing auxiliary information like header information. ### Method `get_context(self, name: str) -> Any` ### Parameters #### Path Parameters - **name** (str) - The name of the context variable to retrieve. ### Returns The value of the context variable. ### Warning Deprecated method `set_context` in `BNPDataClass`. ``` -------------------------------- ### Initialize MultiStream with an Interval Dictionary Source: https://bionumpy.github.io/bionumpy/source/multiple_data_sources.html Construct a MultiStream object by passing an in-memory interval dictionary along with other data sources. This allows MultiStream to efficiently access interval data irrespective of its original file's sort order. ```python >>> multistream = bnp.MultiStream(reference.get_contig_lengths(), ... sequence=reference, ... variants=variants, ... intervals=interval_dict) ``` -------------------------------- ### Slice EncodedArray Source: https://bionumpy.github.io/bionumpy/source/sequences.html Use NumPy-like slicing to extract subsequences from an EncodedArray. This example trims the first and last two characters. ```python >>> encoded_array[2:-2] encoded_array('tggt') ``` -------------------------------- ### Global Interval Intersection Source: https://bionumpy.github.io/bionumpy/_modules/bionumpy/arithmetics/intervals.html Computes the intersection of intervals across different chromosomes. Sorts intervals by chromosome and then by start position. ```python @streamable() def global_intersect(intervals_b, intervals_a): all_intervals = np.concatenate([intervals_a, intervals_b]) all_intervals = all_intervals[np.lexsort((all_intervals.start, all_intervals.chromosome))] stops = all_intervals.stop[np.lexsort((all_intervals.stop, all_intervals.chromosome))] mask = stops[:-1] > all_intervals.start[1:] result = all_intervals[1:][mask] result.stop = stops[:-1][mask] return result ```