### Install GenomeKit via Conda Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Instructions to install GenomeKit using the pre-compiled conda packages available from the conda-forge channel. ```bash conda install -c conda-forge genomekit ``` -------------------------------- ### Install GenomeKit via Conda Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Instructions to install the GenomeKit library using the conda package manager from the conda-forge channel. ```Bash conda install -c conda-forge genomekit ``` -------------------------------- ### Run GenomeKit Annotation Walk Demo Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Shows the command-line execution and example output of the `walk_annotations.py` script, illustrating the hierarchical structure of GenomeKit annotations as they are printed to the console. ```bash $ python demos/walk_annotations.py ... ``` -------------------------------- ### Example Output of Genomic Position to Exon Mappings Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Shows example mappings from input genomic coordinates to their corresponding nearest downstream exons, including cases with multiple candidates. ```Text chr1:91662-91662 --> chr1:169296-169296 --> chr1:320862-320862 --> # 1st candidate chr1:320862-320862 --> # 2nd candidate chr1:320862-320862 --> # 3rd candidate ``` -------------------------------- ### Example VCF File Structure Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Provides a sample VCF (Variant Call Format) file content, illustrating its header information, column definitions, and example variant entries. ```VCF ##fileformat=VCFv4.2 ##reference=GRCh37 ##INFO= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2 sample3 1 949523 . C T . . AF=0.00 GT:AD 0/0:0,1 0/1:0,2 0/0:0,3 1 949608 . G A . . AF=0.01 GT:AD 0/0:0,4 0/1:0,5 0/0:0,6 1 949696 . - G . . AF=0.02 GT:AD 0/0:0,7 0/1:0,8 0/1:0,9 1 949739 . G TC . . AF=0.03 GT:AD 0/1:0,10 0/0:0,11 1/1:0,12 1 977028 . G T . . AF=0.04 GT:AD 0/1:0,13 0/0:0,14 1/1:0,15 1 977330 . T C . . AF=0.05 GT:AD 0/1:0,16 0/0:0,17 ./.:0,18 1 977516 . - C . . AF=0.06 GT:AD 1/1:0,19 1/1:0,20 ./.:0,21 1 977570 . G A . . AF=0.07 GT:AD 1/1:0,22 1/1:0,23 ./.:0,24 1 978604 . CT - . . AF=0.08 GT:AD 1/1:0,25 1/1:0,26 ./.:0,27 1 978628 . C T . . AF=0.09 GT:AD ./.:28,0 0/0:29,0 ./.:30,0 ``` -------------------------------- ### Instantiating a Variant Object Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Shows how to create a `Variant` object using chromosome, 0-based position, reference allele, alternate allele, and reference genome. The example also demonstrates its string representation. ```Python >>> variant = Variant("chr7", 117120148, "AT", "G", "hg19") >>> variant ``` -------------------------------- ### Example VCF File Content Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart This block shows a sample VCF (Variant Call Format) file, including its header information and several variant entries. This file is used as input for subsequent GenomeKit operations. ```VCF ##fileformat=VCFv4.2 ##reference=GRCh37 ##INFO= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2 sample3 1 949523 . C T . . AF=0.00 GT:AD 0/0:0,1 0/1:0,2 0/0:0,3 1 949608 . G A . . AF=0.01 GT:AD 0/0:0,4 0/1:0,5 0/0:0,6 1 949696 . - G . . AF=0.02 GT:AD 0/0:0,7 0/1:0,8 0/1:0,9 1 949739 . G TC . . AF=0.03 GT:AD 0/1:0,10 0/0:0,11 1/1:0,12 1 977028 . G T . . AF=0.04 GT:AD 0/1:0,13 0/0:0,14 1/1:0,15 1 977330 . T C . . AF=0.05 GT:AD 0/1:0,16 0/0:0,17 ./.:0,18 1 977516 . - C . . AF=0.06 GT:AD 1/1:0,19 1/1:0,20 ./.:0,21 1 977570 . G A . . AF=0.07 GT:AD 1/1:0,22 1/1:0,23 ./.:0,24 1 978604 . CT - . . AF=0.08 GT:AD 1/1:0,25 1/1:0,26 ./.:0,27 1 978628 . C T . . AF=0.09 GT:AD ./.:28,0 0/0:29,0 ./.:30,0 ``` -------------------------------- ### Example Output of Filtered Acceptor Site DNA Sequences Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Shows example 10-nucleotide sequences extracted around acceptor sites after applying filtering criteria, demonstrating the sense-strand nature of the output. ```Text TGCAGGGAAC # Note they are all sense-strand (AG) TTCAGCTGCT # because exon.end5 knows the strand. TGTAGGAAAC TCCAGGCTAT GCCAGAGGAC GACAGAACCA CCCAGATTGG ... ``` -------------------------------- ### Initialize GenomeKit Interval Object Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates how to create an "Interval" object, specifying chromosome, strand, start, end, and reference genome. Intervals are 0-based with exclusive end. ```Python interval = Interval("chr7", "+", 117120016, 117120201, "hg19") ``` -------------------------------- ### Setup Conda Environment for M1 Macs Source: https://deepgenomics.github.io/GenomeKit/api.html/develop Provides an alternative Conda setup specifically for M1 Mac users, creating a `cxx` environment and installing C++ compiler and other dependencies from a file. ```Shell conda create -n cxx cxx-compiler zlib conda activate cxx conda install -c conda-forge -c bioconda --file a-file-with-the-deps-from-genomekit_dev-yml.txt ``` -------------------------------- ### Run GenomeKit with Docker Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates how to run the GenomeKit Docker image in an interactive session and import the `genome_kit` library within Python. ```bash docker run -it --rm deepgenomicsinc/genomekit:latest python ``` ```python import genome_kit ``` -------------------------------- ### Clone GenomeKit Repository for Data Generation Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Commands to clone the GenomeKit GitHub repository and navigate into its directory, preparing for local data generation. ```Bash git clone https://github.com/deepgenomics/GenomeKit.git pushd GenomeKit ``` -------------------------------- ### Example Interval Object Initialization Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates the creation of an `Interval` object, highlighting a special case where `anchor_offset` is used to indicate a motif match within an insertion. In such cases, the position within the insertion has no direct alignment to the reference genome. ```Python Interval("chr7", "+", 117232020, 117232020, "hg19", 117232020)] ``` -------------------------------- ### Generate GenomeKit Annotation Data with Docker Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Command to build annotation data files (e.g., NCBI v105.20190906 for hg19) using the GenomeKit Docker image. This requires the assembly to be built first and uses the same data output directory setup. ```Bash docker run --rm -it -v ./data-src:/data-src \ -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \ --platform=linux/amd64 deepgenomicsinc/genomekit \ python /data-src/build.py hg19.p13.plusMT/NCBI/v105.20190906 /output ``` -------------------------------- ### Import GenomeKit Package Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Imports the main GenomeKit package, aliasing it as 'gk' for brevity in subsequent code. ```Python import genome_kit as gk ``` -------------------------------- ### Importing the VCFTable Class Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Shows how to import the `VCFTable` class from `genome_kit` for working with binary VCF files. ```Python >>> from genome_kit import VCFTable ``` -------------------------------- ### Run GenomeKit with Docker Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Command to run GenomeKit interactively using its official Docker image, allowing direct Python interaction within the container. ```Bash $ docker run -it --rm deepgenomicsinc/genomekit:latest python >>> import genome_kit ``` -------------------------------- ### Importing the GenomeKit Package Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates the basic import of the `genome_kit` package into a Python project, typically aliased as `gk` for convenience in subsequent code. ```python >>> import genome_kit as gk ``` -------------------------------- ### Importing the Variant Class Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates how to import the `Variant` class from the `genome_kit` library to begin working with genomic variants. ```Python from genome_kit import Variant ``` -------------------------------- ### Calculate GenomeKit Interval Length Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Shows how to get the length (number of bases) of an "Interval" object, which is calculated as "end - start". ```Python len(interval) ``` -------------------------------- ### Initialize GenomeKit Genome Object and Get DNA Sequence Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates creating a "Genome" object for a specific reference genome (e.g., 'hg19') and then using it to retrieve the DNA sequence for a given "Interval". ```Python genome = Genome("hg19") # Equivalently "hg19" genome.dna(interval) ``` -------------------------------- ### Traverse Genome Annotation Hierarchy in Python Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst This example demonstrates how to programmatically walk through the hierarchical structure of genomic annotations provided by GenomeKit. It shows how to access genes, transcripts, and exons from a `Genome` object and iterate through them to print their details. ```python genome = Genome("gencode.v19") for gene in genome.genes: # Each gene print(gene) for tran in gene.transcripts: # Each transcript on the gene print(" ", tran) for exon in tran.exons: # Each exon on the transcript print(" ", exon) ``` ```bash $ python demos/walk_annotations.py ... ``` -------------------------------- ### Loading VCF File with VCFTable Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates how to open a gzipped VCF file using `VCFTable.from_vcf`, specifying the reference genome and which INFO and FORMAT fields to load, and shows the resulting `VCFTable` object. ```Python >>> vcf = VCFTable.from_vcf("test.vcf.gz", Genome("hg19"), info_ids=["AF"], fmt_ids=["GT", "AD"]) >>> vcf ``` -------------------------------- ### Creating VariantGenome Objects with Single and Multiple Variants Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart This Python example demonstrates how to instantiate `VariantGenome` objects in GenomeKit. It shows the creation of variant genomes with a single variant (substitution or deletion) and with a list of multiple variants, illustrating how multiple variants are applied collectively to the reference genome. ```python ref = Genome("hg19") var1 = VariantGenome(ref, ref.variant("chr7:117120188:A:T")) # rs397508673 (A>T) var2 = VariantGenome(ref, ref.variant("chr7:117120190:A:-")) # rs397508710 (delA) var3 = VariantGenome(ref, [ref.variant(x) for x in ["chr7:117120188:A:T", "chr7:117120190:A:-"]]) # both variants together ``` -------------------------------- ### Create GenomeKit Variant Object from Parameters Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates how to instantiate a `Variant` object using its constructor, providing chromosome, 0-based position, reference allele, alternate allele, and reference genome. ```Python >>> variant = Variant("chr7", 117120148, "AT", "G", "hg19") >>> variant ``` -------------------------------- ### Implementing and Registering a Custom GenomeKit DataManager Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Illustrates how to extend the `DataManager` class to provide a custom mechanism for storing and retrieving GenomeKit data files. This example includes methods for initialization, file retrieval (`get_file`), and file upload (`upload_file`), along with the code to register the custom manager with GenomeKit. It also mentions the alternative of using a plugin package. ```Python class MyDataManager(DataManager): def __init__(self, data_dir: str): ... def get_file(self, filename: str) -> str: ... def upload_file(self, filepath: str, filename: str, metadata: Dict[str, str]=None): ... gk.gk_data.data_manager = MyDataManager() ``` -------------------------------- ### Creating Interval Objects and Checking Length Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates how to instantiate `Interval` objects by specifying chromosome, strand, start, end coordinates, and the genome build. It also illustrates how to retrieve the length of an `Interval` using the built-in `len()` function. ```python d = Interval("chr1", "+", 3, 4, "hg38")\n\nlen(a), len(b), len(c), len(d)\n(5, 5, 4, 1) ``` -------------------------------- ### Initializing a GenomeKit Interval Object Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Illustrates the creation of an `Interval` object, which represents a specific genomic region. An interval is defined by its chromosome, strand, start and end coordinates, and the reference genome it belongs to. ```python >>> interval = Interval("chr7", "+", 117120016, 117120201, "hg19") ``` -------------------------------- ### Accessing Variant Interval Attributes Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Illustrates how to retrieve the start, end, and length of a variant's reference allele interval, and how to access the underlying `Interval` object. ```Python >>> variant.start, variant.end, len(variant) (117120148, 117120150, 2) >>> variant.interval Interval("chr7", "+", 117120148, 117120150, "hg19") ``` -------------------------------- ### Clone GenomeKit Repository for Local Data Generation Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Commands to clone the GenomeKit GitHub repository and change the current directory into the cloned repository, which is necessary for generating local data files. ```bash git clone https://github.com/deepgenomics/GenomeKit.git pushd GenomeKit ``` -------------------------------- ### Access Variant Interval Properties Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Shows how a `Variant` object behaves as a subclass of `Interval`, allowing access to `start`, `end`, and `len` properties, and how to retrieve its underlying `Interval` object which spans the reference allele. ```Python >>> variant.start, variant.end, len(variant) (117120148, 117120150, 2) >>> variant.interval Interval("chr7", "+", 117120148, 117120150, "hg19") ``` -------------------------------- ### Initialize and Perform Basic Operations on GenomeKit Intervals in Python Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart This Python snippet demonstrates the creation of `Interval` objects with specified chromosome, strand, start, end, and genome build. It then showcases fundamental operations like calculating interval length, checking for containment of one interval within another, detecting overlaps, determining upstream/downstream relationships, and comparing intervals for equality. ```python >>> # 0123456789 >>> # aaaaabbbbb >>> # cccc >>> # d >>> a = Interval("chr1", "+", 0, 5, "hg38") >>> b = Interval("chr1", "+", 5, 10, "hg38") >>> c = Interval("chr1", "+", 3, 7, "hg38") >>> d = Interval("chr1", "+", 3, 4, "hg38") >>> len(a), len(b), len(c), len(d) (5, 5, 4, 1) >>> a.contains(c), c.within(a), a.contains(d), d.within(a) (False, False, True, True) >>> a.overlaps(b), a.overlaps(c) (False, True) >>> a.upstream_of(b), b.dnstream_of(a) (True, True) >>> c.upstream_of(b), b.dnstream_of(c) (False, False) >>> a == b, a == d (False, False) >>> a != b, a != d (True, True) ``` -------------------------------- ### Import Core GenomeKit Types Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Imports essential GenomeKit classes, "Genome" and "Interval", directly into the current namespace for easier access. ```Python from genome_kit import Genome from genome_kit import Interval ... ``` -------------------------------- ### Building Genome Tracks with Strand Awareness Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates the use of `GenomeTrackBuilder` to create custom genomic tracks, highlighting the impact of the `strandedness` argument. It shows examples for both 'strand_unaware' and 'strand_aware' modes, illustrating how data is ordered and retrieved based on the specified strand behavior. ```python >>> track = GenomeTrackBuilder("neg.gtrack", "u3", "strand_unaware", Genome("hg19")) >>> interval = Interval("chr1", "-", 10, 15, "hg38") >>> track.set_data(interval, np.arange(0, len(interval), dtype=np.uint8)) >>> track.finalize() >>> track = GenomeTrack("neg.gtrack") >>> track(interval) array([[4], [3], [2], [1], [0]], dtype=uint8) >>> track = GenomeTrackBuilder("neg.gtrack", "u3", "strand_aware", Genome("hg19")) >>> track.set_data(interval, np.arange(0, len(interval), dtype=np.uint8)) >>> track.finalize() >>> track = GenomeTrack("neg.gtrack") >>> track(interval) array([[0], [1], [2], [3], [4]], dtype=uint8) ``` -------------------------------- ### Initializing Genome Object and Retrieving DNA Sequence Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Explains how to instantiate a `Genome` object for a specific reference genome. It then demonstrates using this `Genome` object to retrieve the DNA sequence corresponding to a given `Interval`. ```python >>> genome = Genome("hg19") # Equivalently "hg19" >>> genome.dna(interval) 'AATTGGAAGCAAA...AACTTTTTTTCAG' ``` -------------------------------- ### Generate GenomeKit Assembly Data with Docker Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Command to build assembly data files (e.g., hg19) using the GenomeKit Docker image. It sets the data output directory and runs the build script for the specified assembly. ```Bash export GENOMEKIT_DATA_DIR=$(python -c "import os ; import appdirs ; print(os.environ.get('GENOMEKIT_DATA_DIR', appdirs.user_data_dir('genome_kit')))") docker run --rm -it -v ./data-src:/data-src \ -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \ --platform=linux/amd64 deepgenomicsinc/genomekit \ python /data-src/build.py hg19.p13.plusMT/assembly /output ``` -------------------------------- ### Clone GenomeKit Source Repository Source: https://deepgenomics.github.io/GenomeKit/api.html/develop Clones the GenomeKit source code from its GitHub repository to your local machine, initiating the development setup. ```Shell git clone git@github.com:deepgenomics/GenomeKit.git ``` -------------------------------- ### API Reference: VCFTable Class Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Documentation for the `VCFTable` class, used to open and query binary VCF files, returning `Variant`-based objects and allowing access to INFO and FORMAT fields. ```APIDOC VCFTable: - Description: Provides an interface to compact, indexed binary VCF files. - Class Methods: - from_vcf(file_path: str, genome: Genome, info_ids: list = None, fmt_ids: list = None) -> VCFTable - Description: Opens a VCF file and loads specified INFO and FORMAT fields. - Parameters: - file_path: str (Path to the VCF file, e.g., 'test.vcf.gz') - genome: Genome (Genome object for reference) - info_ids: list (Optional list of INFO field IDs to load) - fmt_ids: list (Optional list of FORMAT field IDs to load) - Methods: - __getitem__(index: int) -> VCFVariant - Description: Accesses a VCFVariant object by its 0-based index. - info(info_id: str) -> numpy.ndarray - Description: Retrieves all values for a specified INFO field as a NumPy array. - Parameters: - info_id: str (The ID of the INFO field) - find_within(interval: Interval) -> list[VCFVariant] - Description: Finds all VCFVariant objects that fall within the given interval. - Parameters: - interval: Interval (The genomic interval to query) - index_of(variant: VCFVariant) -> int - Description: Returns the 0-based index of a VCFVariant object within the VCFTable. - Parameters: - variant: VCFVariant (The VCFVariant object to find the index for) - format(format_id: str) -> numpy.ndarray - Description: Retrieves per-sample format data (e.g., GT, AD) as a NumPy array. - Parameters: - format_id: str (The ID of the FORMAT field) - Returns: numpy.ndarray (Shape depends on data, e.g., (num_variants, num_samples) or (num_variants, num_samples, num_alleles)) ``` -------------------------------- ### Extract DNA Sequence from GenomeKit Interval Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates how to extract DNA sequences from `Interval` objects using the `dna` attribute of a `Genome` instance. It shows how to get both forward and reverse-complemented sequences based on the strand. ```Python >>> a = Interval("chr7", "+", 117120016, 117120201, "hg19") >>> b = a.as_opposite_strand() >>> genome = Genome("hg19") >>> genome.dna(a) 'AATTGGAAGCAAA...AACTTTTTTTCAG' >>> genome.dna(b) 'CTGAAAAAAAGTT...TTTGCTTCCAATT' ``` -------------------------------- ### Install GenomeKit in Develop Mode Source: https://deepgenomics.github.io/GenomeKit/api.html/develop Installs the GenomeKit source tree in editable (develop) mode. This step is crucial for enabling the `build` subcommand to correctly locate and utilize test data directories within the source tree. ```Bash pip install -e . ``` -------------------------------- ### Install GenomeKit in Development Mode Source: https://deepgenomics.github.io/GenomeKit/api.html/develop Installs the GenomeKit package in editable development mode. This command builds the C++ extension and links it into your Python `site-packages`, allowing `import genome_kit` from any directory and reflecting local source changes. ```Shell pip install -e . ``` -------------------------------- ### Importing Core GenomeKit Types Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Shows how to directly import essential classes like `Genome` and `Interval` into the current namespace. This practice simplifies code by avoiding the need to prefix object instantiations with `genome_kit`. ```python >>> from genome_kit import Genome >>> from genome_kit import Interval ... ``` -------------------------------- ### Perform Basic Motif Search in GenomeKit Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Illustrates how to use `genome_kit.Genome.find_motif` to search for a specific DNA motif within a genomic interval on a reference genome. The example also shows how to expand the returned empty interval for further feature extraction. ```Python genome = Genome('hg19') # Short sequence from CFTR interval = Interval('chr7', '+', 117231957, 117232030, genome) genome.dna(interval) motif = 'AACAA' matches = genome.find_motif(interval, motif) matches[0].expand(5, 5) ``` -------------------------------- ### APIDOC: genome_kit.Genome Class Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Documentation for the `Genome` class, providing convenient access to resources associated with a reference genome. It outlines the constructor and methods for retrieving genomic data like DNA sequences. ```APIDOC Class: Genome Description: Resources available for a reference genome. Constructor: __init__(genome_name: str) genome_name: The name of the reference genome (e.g., "hg19", "gencode.v19"). Methods: dna(interval: Interval): Returns the DNA sequence for the given interval. Properties: genes: Access to gene annotations (available when genome is versioned, e.g., "gencode.v19"). ``` -------------------------------- ### Importing VCFTable from GenomeKit Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart This snippet demonstrates how to import the `VCFTable` class from the `genome_kit` library, which is essential for working with VCF files. ```Python from genome_kit import VCFTable ``` -------------------------------- ### APIDOC: genome_kit.Interval Class Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Documentation for the `Interval` class, representing a genomic interval. It details the constructor parameters and key properties/methods for manipulating interval data. ```APIDOC Class: Interval Description: An interval on a reference genome. Constructor: __init__(chromosome: str, strand: str, start: int, end: int, reference_genome: str) chromosome: The chromosome name (e.g., "chr7"). strand: The strand ("+" or "-"). start: The 0-based start position (exclusive end). end: The 0-based end position (exclusive end). reference_genome: The reference genome name (e.g., "hg19"). Properties: len(): Returns the number of bases spanned by the interval (end - start). as_ucsc(): Returns the interval in UCSC browser's "1-based, inclusive end" format. ``` -------------------------------- ### Opening a VCF File and Accessing Variants Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart This code demonstrates how to open a gzipped VCF file (`test.vcf.gz`) using `VCFTable.from_vcf`, specifying the genome build and which INFO and FORMAT fields to carry over. It also shows how to inspect the `VCFTable` object and access an individual `Variant` object by index. ```Python vcf = VCFTable.from_vcf("test.vcf.gz", Genome("hg19"), info_ids=["AF"], fmt_ids=["GT", "AD"]) vcf vcf[0] ``` -------------------------------- ### Access Versioned Genomic Resources with GenomeKit Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Shows how to initialize a "Genome" object with a versioned resource (e.g., 'gencode.v19') to enable access to specific annotations like genes, transcripts, and exons, and then retrieve DNA sequences for these objects. ```Python genome = Genome("gencode.v19") # Implies "hg19" gene = genome.genes["ENSG00000001626.10"] # Gene object tran = gene.transcripts[2] # Transcript object exon = tran.exons[0] # Exon object genome.dna(exon) ``` -------------------------------- ### Creating Variant from 1-Based String Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates two methods to create a `Variant` object from a 1-based string representation, commonly used in UCSC and Clinvar conventions, ensuring validation against a specified genome. ```Python >>> genome = Genome("hg19") >>> variant = genome.variant("chr7:117,120,149:AT:G") # First way >>> variant = Variant.from_string('chr7:117,120,149:AT:G', genome) # Second way >>> variant ``` -------------------------------- ### Extract DNA Features from Reference and Variant Genomes Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst This example illustrates how to extract DNA sequences from both a reference genome and a variant genome using a single feature extraction function. It defines a function that retrieves a specific transcript, expands its 5' end, and extracts the DNA sequence, demonstrating the transparent handling of `Genome` and `VariantGenome` objects. ```Python def extract_features(genome): tran = genome.transcripts["ENST00000426809.1"] # CFTR transcript span = tran.end5.expand(2, 5) # 7nt span at 5' end return genome.dna(span) # extract DNA ref = Genome("gencode.v19") variants = [Variant.from_string("chr7:117120149:A:G", ref), # rs397508328 Variant.from_string("chr7:117120151:G:T", ref)] # rs397508657 var = VariantGenome(ref, variants) print(extract_features(ref)) print(extract_features(var)) ``` -------------------------------- ### Import GenomeKit Variant Class Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Imports the `Variant` class from the `genome_kit` library, which is used to represent individual genomic variants. ```Python from genome_kit import Variant ``` -------------------------------- ### Python Example: Uploading and Getting GenomeKit Files Source: https://deepgenomics.github.io/GenomeKit/api.html/_modules/genome_kit/gk_data Demonstrates a typical workflow for managing files with GenomeKit. This example shows how to upload a local file using `upload_file` to make it accessible, and then retrieve it using `get_file`, which handles on-demand downloads and returns the file's local path. ```Python >>> upload_file('/local/path/hg38.2bit', 'hg38.2bit') >>> get_file('hg38.2bit') "/Users/example/Application Support/genome_kit/hg38.2bit" ``` -------------------------------- ### Generate Genome Annotation Data with Docker Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Command to generate specific genome annotation data files (e.g., hg19.p13.plusMT/NCBI/v105.20190906) using a Docker container. This step should be performed after the corresponding assembly data has been built. ```bash docker run --rm -it -v ./data-src:/data-src \ -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \ --platform=linux/amd64 deepgenomicsinc/genomekit \ python /data-src/build.py hg19.p13.plusMT/NCBI/v105.20190906 /output ``` -------------------------------- ### Define and Use Anchored Intervals in GenomeKit Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates how GenomeKit's 'anchored' intervals allow a specific position to remain aligned when an interval is lifted over to a variant genome. This example shows anchoring to the 5' or 3' end and observing the resulting DNA sequence changes on a variant genome. ```Python interval = Interval("chr7", "+", 117120185, 117120192, ref) anchored_5p = interval.with_anchor("5p") # Anchored to its 5' end anchored_3p = interval.with_anchor("3p") # Anchored to its 3' end ref = Genome("hg19") var = VariantGenome(ref, ref.variant("chr7:117120190:A:-")) # rs397508710 (delA) ref.dna(interval) var.dna(interval) # (shrink 3' end) var.dna(anchored_5p) # (fill 3' end) var.dna(anchored_3p) # (fill 5' end) ``` -------------------------------- ### Accessing Versioned Genomic Resources (GENCODE) Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Illustrates how to access versioned genomic resources, such as GENCODE annotations, by initializing the `Genome` object with a specific version. This allows navigation through gene, transcript, and exon objects to retrieve associated DNA sequences. ```python >>> genome = Genome("gencode.v19") # Implies "hg19" >>> gene = genome.genes["ENSG00000001626.10"] # Gene object >>> tran = gene.transcripts[2] # Transcript object >>> exon = tran.exons[0] # Exon object >>> genome.dna(exon) 'AATTGGAAGCAAA...AACTTTTTTTCAG' ``` -------------------------------- ### APIDOC: genome_kit.GenomeTrackBuilder Class Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Documentation for the `GenomeTrackBuilder` class, used to construct genomic tracks. It details the constructor parameters, including the `strandedness` argument and its possible values, and methods for setting data and finalizing the track. ```APIDOC Class: GenomeTrackBuilder Description: Builder for creating genomic tracks. Constructor: __init__(track_name: str, data_type: str, strandedness: str, genome: Genome) track_name: The name of the track file. data_type: The data type for the track. strandedness: Defines how data is ordered based on strand. Possible values: "single_stranded": Both strands share the same data, applied in Interval coordinate (reference strand) order. "strand_unaware": Ignores the Interval strand, data applied in Interval coordinate (reference strand) order. "strand_aware": Data applied from 5' end to 3' end (sense strand order). genome: The Genome object associated with the track. Methods: set_data(interval: Interval, data: np.ndarray): Sets data for a specific interval. finalize(): Finalizes the track building process. ``` -------------------------------- ### GenomeKit API Reference: Interval and Motif Methods Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Detailed API documentation for key GenomeKit methods: `Interval.with_anchor` for creating anchored intervals and `Genome.find_motif` (also applicable to `VariantGenome.find_motif`) for comprehensive motif searching, including parameter descriptions and return types. ```APIDOC Interval: with_anchor(mode: str) -> Interval mode: str Description: Specifies the anchoring mode. Can be "5p" (5' end), "3p" (3' end), or an integer for a specific base within the interval. Purpose: To create a new Interval object anchored to a specific position, ensuring that position remains aligned when lifted over to a variant genome. Genome: find_motif(interval: Interval, motif: str, match_position: Union[int, str] = 0, find_overlapping_motifs: bool = False) -> List[Interval] interval: Interval Description: The genomic interval within which to search for the motif. motif: str Description: The DNA sequence string to search for. match_position: Union[int, str] = 0 Description: Controls the alignment of the returned empty interval relative to the motif match. Values: - 0 or '5p': Aligns the match to the 5' end of the motif (default). - len(motif) or '3p': Aligns the match to the 3' end of the motif. - Integer (0 to len(motif)): Aligns to a specific base within the motif. find_overlapping_motifs: bool = False Description: If True, all overlapping motif matches are returned. If False (default), only non-overlapping matches are returned. Returns: List[Interval] Description: A list of empty Interval objects, each representing a motif match. The anchor of each returned interval is set to its position, ensuring alignment on variant genomes. Purpose: To locate occurrences of a specified DNA motif within a given genomic interval on the reference genome. VariantGenome: find_motif(interval: Interval, motif: str, match_position: Union[int, str] = 0, find_overlapping_motifs: bool = False) -> List[Interval] Description: Similar to Genome.find_motif, but performs the search on a variant genome. (Parameters are identical to Genome.find_motif) ``` -------------------------------- ### Generate Genome Assembly Data with Docker Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Command to generate specific genome assembly data files (e.g., hg19.p13.plusMT/assembly) using a Docker container. This command mounts local data-src and output directories, and sets the GENOMEKIT_DATA_DIR within the container. ```bash docker run --rm -it -v ./data-src:/data-src \ -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \ --platform=linux/amd64 deepgenomicsinc/genomekit \ python /data-src/build.py hg19.p13.plusMT/assembly /output ``` -------------------------------- ### Walk GenomeKit Annotation Structure Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates how to iterate through genes, transcripts, and exons within a GenomeKit Genome object to access hierarchical annotation data. This provides a programmatic way to explore the genomic elements. ```python genome = Genome("gencode.v19") for gene in genome.genes: # Each gene print(gene) for tran in gene.transcripts: # Each transcript on the gene print(" ", tran) for exon in tran.exons: # Each exon on the transcript print(" ", exon) ``` -------------------------------- ### Create GenomeKit Variant from UCSC/Clinvar String Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Illustrates two methods to create a `Variant` object from a 1-based (DNA1) string representation, common in UCSC and Clinvar formats, using both `genome.variant()` and `Variant.from_string()`. ```Python >>> genome = Genome("hg19") >>> variant = genome.variant("chr7:117,120,14 ``` -------------------------------- ### Accessing Per-Sample Genotype and Allelic Depth Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst Demonstrates how to extract per-sample format data like 'GT' (Genotype) and 'AD' (Allelic Depths) from the `VCFTable`, showing the shape of the resulting arrays and how to filter them by variant indices. ```Python >>> gt = vcf.format('GT') >>> gt.shape (10L, 3L) >>> gt[indices] array([[1, 0, 0], [2, 2, 0], [2, 2, 0]], dtype=int8) >>> ad = vcf.format('AD') >>> ad.shape (10L, 3L) >>> ad[indices] array([[[ 0, 16], [ 0, 17], [ 0, 18]], [[ 0, 19], [ 0, 20], [ 0, 21]], [[ 0, 22], [ 0, 23], [ 0, 24]]], dtype=int32) ``` -------------------------------- ### Exploring Exon Object Attributes Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Showcases various attributes available on an `Exon` object obtained from versioned genomic resources. These attributes provide detailed information about the exon, including its genomic interval, index within the transcript, and references to parent and sibling objects. ```python >>> exon.interval Interval("chr7", "+", 117120016, 117120201, "hg19") >>> exon.index 0 >>> exon.transcript >>> exon.cds >>> exon.next_exon ``` -------------------------------- ### GenomeKit Annotation Table Positional Query Methods API Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Documents the methods available on GenomeKit annotation tables (e.g., `ExonTable`) for performing various positional queries. These methods are crucial for mapping genomic positions to specific annotation elements. ```APIDOC ExonTable Methods for Positional Queries: find_overlapping(): elements overlapping interval. find_within(): elements falling within interval. find_exact(): elements exactly spanning interval. find_5p_aligned(): elements with 5’ end aligned to the 5’ end of interval. find_3p_aligned(): elements with 3’ end aligned to the 3’ end of interval. find_5p_within(): elements with 5’-most position within interval. find_3p_within(): elements with 3’-most position within interval. ``` -------------------------------- ### Set GenomeKit Data Directory Environment Variable Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Sets the GENOMEKIT_DATA_DIR environment variable, which specifies where GenomeKit data files will be stored. It uses a Python one-liner that leverages the appdirs library to determine the user's default data directory if the variable is not already set. ```bash export GENOMEKIT_DATA_DIR=$(python -c "import os ; import appdirs ; print(os.environ.get('GENOMEKIT_DATA_DIR', appdirs.user_data_dir('genome_kit')))") ``` -------------------------------- ### Calculating Interval Length Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart Demonstrates how to determine the span of an `Interval` object, which is the number of bases it covers. The length is calculated as the difference between the end and start positions, adhering to a 0-based, exclusive end convention. ```python >>> len(interval) 185 ```