### Install GenomeKit via Conda

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Instructions to install GenomeKit using the pre-compiled conda packages available from the conda-forge channel.

```bash
conda install -c conda-forge genomekit
```

--------------------------------

### Install GenomeKit via Conda

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Instructions to install the GenomeKit library using the conda package manager from the conda-forge channel.

```Bash
conda install -c conda-forge genomekit
```

--------------------------------

### Run GenomeKit Annotation Walk Demo

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Shows the command-line execution and example output of the `walk_annotations.py` script, illustrating the hierarchical structure of GenomeKit annotations as they are printed to the console.

```bash
$ python demos/walk_annotations.py
<Gene ENSG00000223972.4 (DDX11L1)>
   <Transcript ENST00000456328.2 of DDX11L1>
      <Exon 1/3 of ENST00000456328.2>
      <Exon 2/3 of ENST00000456328.2>
      <Exon 3/3 of ENST00000456328.2>
...
```

--------------------------------

### Example Output of Genomic Position to Exon Mappings

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Shows example mappings from input genomic coordinates to their corresponding nearest downstream exons, including cases with multiple candidates.

```Text
chr1:91662-91662 --> <Exon 4/4 of ENST00000466430.1>
chr1:169296-169296 --> <Exon 3/8 of ENST00000466557.2>
chr1:320862-320862 --> <Exon 2/3 of ENST00000432964.1>  # 1st candidate
chr1:320862-320862 --> <Exon 2/4 of ENST00000601486.1>  # 2nd candidate
chr1:320862-320862 --> <Exon 1/3 of ENST00000599771.2>  # 3rd candidate
```

--------------------------------

### Example VCF File Structure

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Provides a sample VCF (Variant Call Format) file content, illustrating its header information, column definitions, and example variant entries.

```VCF
##fileformat=VCFv4.2
##reference=GRCh37
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS     ID REF ALT QUAL FILTER INFO    FORMAT sample1  sample2  sample3
1      949523  .  C   T   .    .      AF=0.00 GT:AD  0/0:0,1  0/1:0,2  0/0:0,3
1      949608  .  G   A   .    .      AF=0.01 GT:AD  0/0:0,4  0/1:0,5  0/0:0,6
1      949696  .  -   G   .    .      AF=0.02 GT:AD  0/0:0,7  0/1:0,8  0/1:0,9
1      949739  .  G   TC  .    .      AF=0.03 GT:AD  0/1:0,10 0/0:0,11 1/1:0,12
1      977028  .  G   T   .    .      AF=0.04 GT:AD  0/1:0,13 0/0:0,14 1/1:0,15
1      977330  .  T   C   .    .      AF=0.05 GT:AD  0/1:0,16 0/0:0,17 ./.:0,18
1      977516  .  -   C   .    .      AF=0.06 GT:AD  1/1:0,19 1/1:0,20 ./.:0,21
1      977570  .  G   A   .    .      AF=0.07 GT:AD  1/1:0,22 1/1:0,23 ./.:0,24
1      978604  .  CT  -   .    .      AF=0.08 GT:AD  1/1:0,25 1/1:0,26 ./.:0,27
1      978628  .  C   T   .    .      AF=0.09 GT:AD  ./.:28,0 0/0:29,0 ./.:30,0
```

--------------------------------

### Instantiating a Variant Object

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Shows how to create a `Variant` object using chromosome, 0-based position, reference allele, alternate allele, and reference genome. The example also demonstrates its string representation.

```Python
>>> variant = Variant("chr7", 117120148, "AT", "G", "hg19")
>>> variant
<Variant chr7:117120148:AT:G:hg19>
```

--------------------------------

### Example VCF File Content

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

This block shows a sample VCF (Variant Call Format) file, including its header information and several variant entries. This file is used as input for subsequent GenomeKit operations.

```VCF
##fileformat=VCFv4.2
##reference=GRCh37
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS     ID REF ALT QUAL FILTER INFO    FORMAT sample1  sample2  sample3
1      949523  .  C   T   .    .      AF=0.00 GT:AD  0/0:0,1  0/1:0,2  0/0:0,3
1      949608  .  G   A   .    .      AF=0.01 GT:AD  0/0:0,4  0/1:0,5  0/0:0,6
1      949696  .  -   G   .    .      AF=0.02 GT:AD  0/0:0,7  0/1:0,8  0/1:0,9
1      949739  .  G   TC  .    .      AF=0.03 GT:AD  0/1:0,10 0/0:0,11 1/1:0,12
1      977028  .  G   T   .    .      AF=0.04 GT:AD  0/1:0,13 0/0:0,14 1/1:0,15
1      977330  .  T   C   .    .      AF=0.05 GT:AD  0/1:0,16 0/0:0,17 ./.:0,18
1      977516  .  -   C   .    .      AF=0.06 GT:AD  1/1:0,19 1/1:0,20 ./.:0,21
1      977570  .  G   A   .    .      AF=0.07 GT:AD  1/1:0,22 1/1:0,23 ./.:0,24
1      978604  .  CT  -   .    .      AF=0.08 GT:AD  1/1:0,25 1/1:0,26 ./.:0,27
1      978628  .  C   T   .    .      AF=0.09 GT:AD  ./.:28,0 0/0:29,0 ./.:30,0
```

--------------------------------

### Example Output of Filtered Acceptor Site DNA Sequences

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Shows example 10-nucleotide sequences extracted around acceptor sites after applying filtering criteria, demonstrating the sense-strand nature of the output.

```Text
TGCAGGGAAC   # Note they are all sense-strand (AG)
TTCAGCTGCT   # because exon.end5 knows the strand.
TGTAGGAAAC
TCCAGGCTAT
GCCAGAGGAC
GACAGAACCA
CCCAGATTGG
...
```

--------------------------------

### Initialize GenomeKit Interval Object

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates how to create an "Interval" object, specifying chromosome, strand, start, end, and reference genome. Intervals are 0-based with exclusive end.

```Python
interval = Interval("chr7", "+", 117120016, 117120201, "hg19")
```

--------------------------------

### Setup Conda Environment for M1 Macs

Source: https://deepgenomics.github.io/GenomeKit/api.html/develop

Provides an alternative Conda setup specifically for M1 Mac users, creating a `cxx` environment and installing C++ compiler and other dependencies from a file.

```Shell
conda create -n cxx cxx-compiler zlib
conda activate cxx
conda install -c conda-forge -c bioconda --file a-file-with-the-deps-from-genomekit_dev-yml.txt
```

--------------------------------

### Run GenomeKit with Docker

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates how to run the GenomeKit Docker image in an interactive session and import the `genome_kit` library within Python.

```bash
docker run -it --rm deepgenomicsinc/genomekit:latest python
```

```python
import genome_kit
```

--------------------------------

### Clone GenomeKit Repository for Data Generation

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Commands to clone the GenomeKit GitHub repository and navigate into its directory, preparing for local data generation.

```Bash
git clone https://github.com/deepgenomics/GenomeKit.git
pushd GenomeKit
```

--------------------------------

### Example Interval Object Initialization

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates the creation of an `Interval` object, highlighting a special case where `anchor_offset` is used to indicate a motif match within an insertion. In such cases, the position within the insertion has no direct alignment to the reference genome.

```Python
Interval("chr7", "+", 117232020, 117232020, "hg19", 117232020)]
```

--------------------------------

### Generate GenomeKit Annotation Data with Docker

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Command to build annotation data files (e.g., NCBI v105.20190906 for hg19) using the GenomeKit Docker image. This requires the assembly to be built first and uses the same data output directory setup.

```Bash
docker run --rm -it -v ./data-src:/data-src \
    -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \
    --platform=linux/amd64 deepgenomicsinc/genomekit \
    python /data-src/build.py hg19.p13.plusMT/NCBI/v105.20190906 /output
```

--------------------------------

### Import GenomeKit Package

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Imports the main GenomeKit package, aliasing it as 'gk' for brevity in subsequent code.

```Python
import genome_kit as gk
```

--------------------------------

### Importing the VCFTable Class

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Shows how to import the `VCFTable` class from `genome_kit` for working with binary VCF files.

```Python
>>> from genome_kit import VCFTable
```

--------------------------------

### Run GenomeKit with Docker

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Command to run GenomeKit interactively using its official Docker image, allowing direct Python interaction within the container.

```Bash
$ docker run -it --rm deepgenomicsinc/genomekit:latest python
>>> import genome_kit
```

--------------------------------

### Importing the GenomeKit Package

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates the basic import of the `genome_kit` package into a Python project, typically aliased as `gk` for convenience in subsequent code.

```python
>>> import genome_kit as gk
```

--------------------------------

### Importing the Variant Class

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates how to import the `Variant` class from the `genome_kit` library to begin working with genomic variants.

```Python
from genome_kit import Variant
```

--------------------------------

### Calculate GenomeKit Interval Length

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Shows how to get the length (number of bases) of an "Interval" object, which is calculated as "end - start".

```Python
len(interval)
```

--------------------------------

### Initialize GenomeKit Genome Object and Get DNA Sequence

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates creating a "Genome" object for a specific reference genome (e.g., 'hg19') and then using it to retrieve the DNA sequence for a given "Interval".

```Python
genome = Genome("hg19")  # Equivalently "hg19"
genome.dna(interval)
```

--------------------------------

### Traverse Genome Annotation Hierarchy in Python

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

This example demonstrates how to programmatically walk through the hierarchical structure of genomic annotations provided by GenomeKit. It shows how to access genes, transcripts, and exons from a `Genome` object and iterate through them to print their details.

```python
genome = Genome("gencode.v19")
for gene in genome.genes:          # Each gene
    print(gene)
    for tran in gene.transcripts:  # Each transcript on the gene
        print("  ", tran)
        for exon in tran.exons:    # Each exon on the transcript
            print("     ", exon)
```

```bash
$ python demos/walk_annotations.py
<Gene ENSG00000223972.4 (DDX11L1)>
   <Transcript ENST00000456328.2 of DDX11L1>
      <Exon 1/3 of ENST00000456328.2>
      <Exon 2/3 of ENST00000456328.2>
      <Exon 3/3 of ENST00000456328.2>
...
```

--------------------------------

### Loading VCF File with VCFTable

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates how to open a gzipped VCF file using `VCFTable.from_vcf`, specifying the reference genome and which INFO and FORMAT fields to load, and shows the resulting `VCFTable` object.

```Python
>>> vcf = VCFTable.from_vcf("test.vcf.gz", Genome("hg19"), info_ids=["AF"], fmt_ids=["GT", "AD"])
>>> vcf
<VCFTable, len() = 10>
```

--------------------------------

### Creating VariantGenome Objects with Single and Multiple Variants

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

This Python example demonstrates how to instantiate `VariantGenome` objects in GenomeKit. It shows the creation of variant genomes with a single variant (substitution or deletion) and with a list of multiple variants, illustrating how multiple variants are applied collectively to the reference genome.

```python
ref = Genome("hg19")
var1 = VariantGenome(ref, ref.variant("chr7:117120188:A:T"))    # rs397508673 (A>T)
var2 = VariantGenome(ref, ref.variant("chr7:117120190:A:-"))    # rs397508710 (delA)
var3 = VariantGenome(ref, [ref.variant(x) for x in ["chr7:117120188:A:T",
                           "chr7:117120190:A:-"]])  # both variants together
```

--------------------------------

### Create GenomeKit Variant Object from Parameters

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates how to instantiate a `Variant` object using its constructor, providing chromosome, 0-based position, reference allele, alternate allele, and reference genome.

```Python
>>> variant = Variant("chr7", 117120148, "AT", "G", "hg19")
>>> variant
<Variant chr7:117120148:AT:G:hg19>
```

--------------------------------

### Implementing and Registering a Custom GenomeKit DataManager

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Illustrates how to extend the `DataManager` class to provide a custom mechanism for storing and retrieving GenomeKit data files. This example includes methods for initialization, file retrieval (`get_file`), and file upload (`upload_file`), along with the code to register the custom manager with GenomeKit. It also mentions the alternative of using a plugin package.

```Python
class MyDataManager(DataManager):
        def __init__(self, data_dir: str):
            ...

        def get_file(self, filename: str) -> str:
            ...

        def upload_file(self, filepath: str, filename: str, metadata: Dict[str, str]=None):
            ...

gk.gk_data.data_manager = MyDataManager()
```

--------------------------------

### Creating Interval Objects and Checking Length

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates how to instantiate `Interval` objects by specifying chromosome, strand, start, end coordinates, and the genome build. It also illustrates how to retrieve the length of an `Interval` using the built-in `len()` function.

```python
d = Interval("chr1", "+", 3, 4, "hg38")\n\nlen(a), len(b), len(c), len(d)\n(5, 5, 4, 1)
```

--------------------------------

### Initializing a GenomeKit Interval Object

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Illustrates the creation of an `Interval` object, which represents a specific genomic region. An interval is defined by its chromosome, strand, start and end coordinates, and the reference genome it belongs to.

```python
>>> interval = Interval("chr7", "+", 117120016, 117120201, "hg19")
```

--------------------------------

### Accessing Variant Interval Attributes

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Illustrates how to retrieve the start, end, and length of a variant's reference allele interval, and how to access the underlying `Interval` object.

```Python
>>> variant.start, variant.end, len(variant)
(117120148, 117120150, 2)

>>> variant.interval
Interval("chr7", "+", 117120148, 117120150, "hg19")
```

--------------------------------

### Clone GenomeKit Repository for Local Data Generation

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Commands to clone the GenomeKit GitHub repository and change the current directory into the cloned repository, which is necessary for generating local data files.

```bash
git clone https://github.com/deepgenomics/GenomeKit.git
pushd GenomeKit
```

--------------------------------

### Access Variant Interval Properties

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Shows how a `Variant` object behaves as a subclass of `Interval`, allowing access to `start`, `end`, and `len` properties, and how to retrieve its underlying `Interval` object which spans the reference allele.

```Python
>>> variant.start, variant.end, len(variant)
(117120148, 117120150, 2)

>>> variant.interval
Interval("chr7", "+", 117120148, 117120150, "hg19")
```

--------------------------------

### Initialize and Perform Basic Operations on GenomeKit Intervals in Python

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

This Python snippet demonstrates the creation of `Interval` objects with specified chromosome, strand, start, end, and genome build. It then showcases fundamental operations like calculating interval length, checking for containment of one interval within another, detecting overlaps, determining upstream/downstream relationships, and comparing intervals for equality.

```python
>>> #  0123456789
>>> #  aaaaabbbbb
>>> #     cccc
>>> #     d
>>> a = Interval("chr1", "+", 0,  5, "hg38")
>>> b = Interval("chr1", "+", 5, 10, "hg38")
>>> c = Interval("chr1", "+", 3,  7, "hg38")
>>> d = Interval("chr1", "+", 3,  4, "hg38")

>>> len(a), len(b), len(c), len(d)
(5, 5, 4, 1)

>>> a.contains(c), c.within(a), a.contains(d), d.within(a)
(False, False, True, True)

>>> a.overlaps(b), a.overlaps(c)
(False, True)

>>> a.upstream_of(b), b.dnstream_of(a)
(True, True)
>>> c.upstream_of(b), b.dnstream_of(c)
(False, False)

>>> a == b, a == d
(False, False)
>>> a != b, a != d
(True, True)
```

--------------------------------

### Import Core GenomeKit Types

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Imports essential GenomeKit classes, "Genome" and "Interval", directly into the current namespace for easier access.

```Python
from genome_kit import Genome
from genome_kit import Interval
...
```

--------------------------------

### Building Genome Tracks with Strand Awareness

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates the use of `GenomeTrackBuilder` to create custom genomic tracks, highlighting the impact of the `strandedness` argument. It shows examples for both 'strand_unaware' and 'strand_aware' modes, illustrating how data is ordered and retrieved based on the specified strand behavior.

```python
>>> track = GenomeTrackBuilder("neg.gtrack", "u3", "strand_unaware", Genome("hg19"))
>>> interval = Interval("chr1", "-", 10, 15, "hg38")
>>> track.set_data(interval, np.arange(0, len(interval), dtype=np.uint8))
>>> track.finalize()
>>> track = GenomeTrack("neg.gtrack")
>>> track(interval)
array([[4],
       [3],
       [2],
       [1],
       [0]], dtype=uint8)
>>> track = GenomeTrackBuilder("neg.gtrack", "u3", "strand_aware", Genome("hg19"))
>>> track.set_data(interval, np.arange(0, len(interval), dtype=np.uint8))
>>> track.finalize()
>>> track = GenomeTrack("neg.gtrack")
>>> track(interval)
array([[0],
       [1],
       [2],
       [3],
       [4]], dtype=uint8)
```

--------------------------------

### Initializing Genome Object and Retrieving DNA Sequence

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Explains how to instantiate a `Genome` object for a specific reference genome. It then demonstrates using this `Genome` object to retrieve the DNA sequence corresponding to a given `Interval`.

```python
>>> genome = Genome("hg19")  # Equivalently "hg19"
>>> genome.dna(interval)
'AATTGGAAGCAAA...AACTTTTTTTCAG'
```

--------------------------------

### Generate GenomeKit Assembly Data with Docker

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Command to build assembly data files (e.g., hg19) using the GenomeKit Docker image. It sets the data output directory and runs the build script for the specified assembly.

```Bash
export GENOMEKIT_DATA_DIR=$(python -c "import os ; import appdirs ; print(os.environ.get('GENOMEKIT_DATA_DIR', appdirs.user_data_dir('genome_kit')))")

docker run --rm -it -v ./data-src:/data-src \
    -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \
    --platform=linux/amd64 deepgenomicsinc/genomekit \
    python /data-src/build.py hg19.p13.plusMT/assembly /output
```

--------------------------------

### Clone GenomeKit Source Repository

Source: https://deepgenomics.github.io/GenomeKit/api.html/develop

Clones the GenomeKit source code from its GitHub repository to your local machine, initiating the development setup.

```Shell
git clone git@github.com:deepgenomics/GenomeKit.git
```

--------------------------------

### API Reference: VCFTable Class

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Documentation for the `VCFTable` class, used to open and query binary VCF files, returning `Variant`-based objects and allowing access to INFO and FORMAT fields.

```APIDOC
VCFTable:
  - Description: Provides an interface to compact, indexed binary VCF files.
  - Class Methods:
    - from_vcf(file_path: str, genome: Genome, info_ids: list = None, fmt_ids: list = None) -> VCFTable
      - Description: Opens a VCF file and loads specified INFO and FORMAT fields.
      - Parameters:
        - file_path: str (Path to the VCF file, e.g., 'test.vcf.gz')
        - genome: Genome (Genome object for reference)
        - info_ids: list (Optional list of INFO field IDs to load)
        - fmt_ids: list (Optional list of FORMAT field IDs to load)
  - Methods:
    - __getitem__(index: int) -> VCFVariant
      - Description: Accesses a VCFVariant object by its 0-based index.
    - info(info_id: str) -> numpy.ndarray
      - Description: Retrieves all values for a specified INFO field as a NumPy array.
      - Parameters:
        - info_id: str (The ID of the INFO field)
    - find_within(interval: Interval) -> list[VCFVariant]
      - Description: Finds all VCFVariant objects that fall within the given interval.
      - Parameters:
        - interval: Interval (The genomic interval to query)
    - index_of(variant: VCFVariant) -> int
      - Description: Returns the 0-based index of a VCFVariant object within the VCFTable.
      - Parameters:
        - variant: VCFVariant (The VCFVariant object to find the index for)
    - format(format_id: str) -> numpy.ndarray
      - Description: Retrieves per-sample format data (e.g., GT, AD) as a NumPy array.
      - Parameters:
        - format_id: str (The ID of the FORMAT field)
      - Returns: numpy.ndarray (Shape depends on data, e.g., (num_variants, num_samples) or (num_variants, num_samples, num_alleles))
```

--------------------------------

### Extract DNA Sequence from GenomeKit Interval

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates how to extract DNA sequences from `Interval` objects using the `dna` attribute of a `Genome` instance. It shows how to get both forward and reverse-complemented sequences based on the strand.

```Python
>>> a = Interval("chr7", "+", 117120016, 117120201, "hg19")
>>> b = a.as_opposite_strand()

>>> genome = Genome("hg19")
>>> genome.dna(a)
'AATTGGAAGCAAA...AACTTTTTTTCAG'
>>> genome.dna(b)
'CTGAAAAAAAGTT...TTTGCTTCCAATT'
```

--------------------------------

### Install GenomeKit in Develop Mode

Source: https://deepgenomics.github.io/GenomeKit/api.html/develop

Installs the GenomeKit source tree in editable (develop) mode. This step is crucial for enabling the `build` subcommand to correctly locate and utilize test data directories within the source tree.

```Bash
pip install -e .
```

--------------------------------

### Install GenomeKit in Development Mode

Source: https://deepgenomics.github.io/GenomeKit/api.html/develop

Installs the GenomeKit package in editable development mode. This command builds the C++ extension and links it into your Python `site-packages`, allowing `import genome_kit` from any directory and reflecting local source changes.

```Shell
pip install -e .
```

--------------------------------

### Importing Core GenomeKit Types

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Shows how to directly import essential classes like `Genome` and `Interval` into the current namespace. This practice simplifies code by avoiding the need to prefix object instantiations with `genome_kit`.

```python
>>> from genome_kit import Genome
>>> from genome_kit import Interval
...
```

--------------------------------

### Perform Basic Motif Search in GenomeKit

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Illustrates how to use `genome_kit.Genome.find_motif` to search for a specific DNA motif within a genomic interval on a reference genome. The example also shows how to expand the returned empty interval for further feature extraction.

```Python
genome = Genome('hg19')

# Short sequence from CFTR
interval = Interval('chr7', '+', 117231957, 117232030, genome)
genome.dna(interval)

motif = 'AACAA'
matches = genome.find_motif(interval, motif)
matches[0].expand(5, 5)
```

--------------------------------

### APIDOC: genome_kit.Genome Class

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Documentation for the `Genome` class, providing convenient access to resources associated with a reference genome. It outlines the constructor and methods for retrieving genomic data like DNA sequences.

```APIDOC
Class: Genome
Description: Resources available for a reference genome.
Constructor:
  __init__(genome_name: str)
    genome_name: The name of the reference genome (e.g., "hg19", "gencode.v19").
Methods:
  dna(interval: Interval): Returns the DNA sequence for the given interval.
Properties:
  genes: Access to gene annotations (available when genome is versioned, e.g., "gencode.v19").
```

--------------------------------

### Importing VCFTable from GenomeKit

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

This snippet demonstrates how to import the `VCFTable` class from the `genome_kit` library, which is essential for working with VCF files.

```Python
from genome_kit import VCFTable
```

--------------------------------

### APIDOC: genome_kit.Interval Class

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Documentation for the `Interval` class, representing a genomic interval. It details the constructor parameters and key properties/methods for manipulating interval data.

```APIDOC
Class: Interval
Description: An interval on a reference genome.
Constructor:
  __init__(chromosome: str, strand: str, start: int, end: int, reference_genome: str)
    chromosome: The chromosome name (e.g., "chr7").
    strand: The strand ("+" or "-").
    start: The 0-based start position (exclusive end).
    end: The 0-based end position (exclusive end).
    reference_genome: The reference genome name (e.g., "hg19").
Properties:
  len(): Returns the number of bases spanned by the interval (end - start).
  as_ucsc(): Returns the interval in UCSC browser's "1-based, inclusive end" format.
```

--------------------------------

### Opening a VCF File and Accessing Variants

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

This code demonstrates how to open a gzipped VCF file (`test.vcf.gz`) using `VCFTable.from_vcf`, specifying the genome build and which INFO and FORMAT fields to carry over. It also shows how to inspect the `VCFTable` object and access an individual `Variant` object by index.

```Python
vcf = VCFTable.from_vcf("test.vcf.gz", Genome("hg19"), info_ids=["AF"], fmt_ids=["GT", "AD"])
vcf
<VCFTable, len() = 10>
vcf[0]
<VCFVariant chr1:949522:C:T:hg19>
```

--------------------------------

### Access Versioned Genomic Resources with GenomeKit

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Shows how to initialize a "Genome" object with a versioned resource (e.g., 'gencode.v19') to enable access to specific annotations like genes, transcripts, and exons, and then retrieve DNA sequences for these objects.

```Python
genome = Genome("gencode.v19")             # Implies "hg19"
gene = genome.genes["ENSG00000001626.10"]  # Gene object
tran = gene.transcripts[2]                 # Transcript object
exon = tran.exons[0]                       # Exon object
genome.dna(exon)
```

--------------------------------

### Creating Variant from 1-Based String

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates two methods to create a `Variant` object from a 1-based string representation, commonly used in UCSC and Clinvar conventions, ensuring validation against a specified genome.

```Python
>>> genome = Genome("hg19")
>>> variant = genome.variant("chr7:117,120,149:AT:G")               # First way
>>> variant = Variant.from_string('chr7:117,120,149:AT:G', genome)  # Second way
>>> variant
<Variant chr7:117120148:AT:G:hg19>
```

--------------------------------

### Extract DNA Features from Reference and Variant Genomes

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

This example illustrates how to extract DNA sequences from both a reference genome and a variant genome using a single feature extraction function. It defines a function that retrieves a specific transcript, expands its 5' end, and extracts the DNA sequence, demonstrating the transparent handling of `Genome` and `VariantGenome` objects.

```Python
def extract_features(genome):
    tran = genome.transcripts["ENST00000426809.1"]   # CFTR transcript
    span = tran.end5.expand(2, 5)                    # 7nt span at 5' end
    return genome.dna(span)                          # extract DNA

ref = Genome("gencode.v19")
variants = [Variant.from_string("chr7:117120149:A:G", ref),     # rs397508328
            Variant.from_string("chr7:117120151:G:T", ref)]     # rs397508657
var = VariantGenome(ref, variants)
print(extract_features(ref))
print(extract_features(var))
```

--------------------------------

### Import GenomeKit Variant Class

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Imports the `Variant` class from the `genome_kit` library, which is used to represent individual genomic variants.

```Python
from genome_kit import Variant
```

--------------------------------

### Python Example: Uploading and Getting GenomeKit Files

Source: https://deepgenomics.github.io/GenomeKit/api.html/_modules/genome_kit/gk_data

Demonstrates a typical workflow for managing files with GenomeKit. This example shows how to upload a local file using `upload_file` to make it accessible, and then retrieve it using `get_file`, which handles on-demand downloads and returns the file's local path.

```Python
>>> upload_file('/local/path/hg38.2bit', 'hg38.2bit')
>>> get_file('hg38.2bit')
"/Users/example/Application Support/genome_kit/hg38.2bit"
```

--------------------------------

### Generate Genome Annotation Data with Docker

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Command to generate specific genome annotation data files (e.g., hg19.p13.plusMT/NCBI/v105.20190906) using a Docker container. This step should be performed after the corresponding assembly data has been built.

```bash
docker run --rm -it -v ./data-src:/data-src \
    -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \
    --platform=linux/amd64 deepgenomicsinc/genomekit \
    python /data-src/build.py hg19.p13.plusMT/NCBI/v105.20190906 /output
```

--------------------------------

### Define and Use Anchored Intervals in GenomeKit

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates how GenomeKit's 'anchored' intervals allow a specific position to remain aligned when an interval is lifted over to a variant genome. This example shows anchoring to the 5' or 3' end and observing the resulting DNA sequence changes on a variant genome.

```Python
interval = Interval("chr7", "+", 117120185, 117120192, ref)
anchored_5p = interval.with_anchor("5p")  # Anchored to its 5' end
anchored_3p = interval.with_anchor("3p")  # Anchored to its 3' end

ref = Genome("hg19")
var = VariantGenome(ref, ref.variant("chr7:117120190:A:-"))  # rs397508710 (delA)
ref.dna(interval)
var.dna(interval)     # (shrink 3' end)
var.dna(anchored_5p)  # (fill 3' end)
var.dna(anchored_3p)  # (fill 5' end)
```

--------------------------------

### Accessing Versioned Genomic Resources (GENCODE)

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Illustrates how to access versioned genomic resources, such as GENCODE annotations, by initializing the `Genome` object with a specific version. This allows navigation through gene, transcript, and exon objects to retrieve associated DNA sequences.

```python
>>> genome = Genome("gencode.v19")             # Implies "hg19"
>>> gene = genome.genes["ENSG00000001626.10"]  # Gene object
>>> tran = gene.transcripts[2]                 # Transcript object
>>> exon = tran.exons[0]                       # Exon object
>>> genome.dna(exon)
'AATTGGAAGCAAA...AACTTTTTTTCAG'
```

--------------------------------

### APIDOC: genome_kit.GenomeTrackBuilder Class

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Documentation for the `GenomeTrackBuilder` class, used to construct genomic tracks. It details the constructor parameters, including the `strandedness` argument and its possible values, and methods for setting data and finalizing the track.

```APIDOC
Class: GenomeTrackBuilder
Description: Builder for creating genomic tracks.
Constructor:
  __init__(track_name: str, data_type: str, strandedness: str, genome: Genome)
    track_name: The name of the track file.
    data_type: The data type for the track.
    strandedness: Defines how data is ordered based on strand.
      Possible values:
        "single_stranded": Both strands share the same data, applied in Interval coordinate (reference strand) order.
        "strand_unaware": Ignores the Interval strand, data applied in Interval coordinate (reference strand) order.
        "strand_aware": Data applied from 5' end to 3' end (sense strand order).
    genome: The Genome object associated with the track.
Methods:
  set_data(interval: Interval, data: np.ndarray): Sets data for a specific interval.
  finalize(): Finalizes the track building process.
```

--------------------------------

### GenomeKit API Reference: Interval and Motif Methods

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Detailed API documentation for key GenomeKit methods: `Interval.with_anchor` for creating anchored intervals and `Genome.find_motif` (also applicable to `VariantGenome.find_motif`) for comprehensive motif searching, including parameter descriptions and return types.

```APIDOC
Interval:
  with_anchor(mode: str) -> Interval
    mode: str
      Description: Specifies the anchoring mode. Can be "5p" (5' end), "3p" (3' end), or an integer for a specific base within the interval.
      Purpose: To create a new Interval object anchored to a specific position, ensuring that position remains aligned when lifted over to a variant genome.

Genome:
  find_motif(interval: Interval, motif: str, match_position: Union[int, str] = 0, find_overlapping_motifs: bool = False) -> List[Interval]
    interval: Interval
      Description: The genomic interval within which to search for the motif.
    motif: str
      Description: The DNA sequence string to search for.
    match_position: Union[int, str] = 0
      Description: Controls the alignment of the returned empty interval relative to the motif match.
      Values:
        - 0 or '5p': Aligns the match to the 5' end of the motif (default).
        - len(motif) or '3p': Aligns the match to the 3' end of the motif.
        - Integer (0 to len(motif)): Aligns to a specific base within the motif.
    find_overlapping_motifs: bool = False
      Description: If True, all overlapping motif matches are returned. If False (default), only non-overlapping matches are returned.
    Returns: List[Interval]
      Description: A list of empty Interval objects, each representing a motif match. The anchor of each returned interval is set to its position, ensuring alignment on variant genomes.
    Purpose: To locate occurrences of a specified DNA motif within a given genomic interval on the reference genome.

VariantGenome:
  find_motif(interval: Interval, motif: str, match_position: Union[int, str] = 0, find_overlapping_motifs: bool = False) -> List[Interval]
    Description: Similar to Genome.find_motif, but performs the search on a variant genome.
    (Parameters are identical to Genome.find_motif)
```

--------------------------------

### Generate Genome Assembly Data with Docker

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Command to generate specific genome assembly data files (e.g., hg19.p13.plusMT/assembly) using a Docker container. This command mounts local data-src and output directories, and sets the GENOMEKIT_DATA_DIR within the container.

```bash
docker run --rm -it -v ./data-src:/data-src \
    -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \
    --platform=linux/amd64 deepgenomicsinc/genomekit \
    python /data-src/build.py hg19.p13.plusMT/assembly /output
```

--------------------------------

### Walk GenomeKit Annotation Structure

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates how to iterate through genes, transcripts, and exons within a GenomeKit Genome object to access hierarchical annotation data. This provides a programmatic way to explore the genomic elements.

```python
genome = Genome("gencode.v19")
for gene in genome.genes:          # Each gene
    print(gene)
    for tran in gene.transcripts:  # Each transcript on the gene
        print("  ", tran)
        for exon in tran.exons:    # Each exon on the transcript
            print("     ", exon)
```

--------------------------------

### Create GenomeKit Variant from UCSC/Clinvar String

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Illustrates two methods to create a `Variant` object from a 1-based (DNA1) string representation, common in UCSC and Clinvar formats, using both `genome.variant()` and `Variant.from_string()`.

```Python
>>> genome = Genome("hg19")
>>> variant = genome.variant("chr7:117,120,14
```

--------------------------------

### Accessing Per-Sample Genotype and Allelic Depth

Source: https://deepgenomics.github.io/GenomeKit/api.html/_sources/quickstart.rst

Demonstrates how to extract per-sample format data like 'GT' (Genotype) and 'AD' (Allelic Depths) from the `VCFTable`, showing the shape of the resulting arrays and how to filter them by variant indices.

```Python
>>> gt = vcf.format('GT')
>>> gt.shape
(10L, 3L)
>>> gt[indices]
array([[1, 0, 0],
       [2, 2, 0],
       [2, 2, 0]], dtype=int8)

>>> ad = vcf.format('AD')
>>> ad.shape
(10L, 3L)
>>> ad[indices]
array([[[ 0, 16],
            [ 0, 17],
            [ 0, 18]],

           [[ 0, 19],
            [ 0, 20],
            [ 0, 21]],

           [[ 0, 22],
            [ 0, 23],
            [ 0, 24]]], dtype=int32)
```

--------------------------------

### Exploring Exon Object Attributes

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Showcases various attributes available on an `Exon` object obtained from versioned genomic resources. These attributes provide detailed information about the exon, including its genomic interval, index within the transcript, and references to parent and sibling objects.

```python
>>> exon.interval
Interval("chr7", "+", 117120016, 117120201, "hg19")
>>> exon.index
0
>>> exon.transcript
<Transcript ENST00000003084.6 of CFTR>
>>> exon.cds
<Cds in Exon 1/27 of ENST00000003084.6>
>>> exon.next_exon
<Exon 2/27 of ENST00000003084.6>
```

--------------------------------

### GenomeKit Annotation Table Positional Query Methods API

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Documents the methods available on GenomeKit annotation tables (e.g., `ExonTable`) for performing various positional queries. These methods are crucial for mapping genomic positions to specific annotation elements.

```APIDOC
ExonTable Methods for Positional Queries:
  find_overlapping(): elements overlapping interval.
  find_within(): elements falling within interval.
  find_exact(): elements exactly spanning interval.
  find_5p_aligned(): elements with 5’ end aligned to the 5’ end of interval.
  find_3p_aligned(): elements with 3’ end aligned to the 3’ end of interval.
  find_5p_within(): elements with 5’-most position within interval.
  find_3p_within(): elements with 3’-most position within interval.
```

--------------------------------

### Set GenomeKit Data Directory Environment Variable

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Sets the GENOMEKIT_DATA_DIR environment variable, which specifies where GenomeKit data files will be stored. It uses a Python one-liner that leverages the appdirs library to determine the user's default data directory if the variable is not already set.

```bash
export GENOMEKIT_DATA_DIR=$(python -c "import os ; import appdirs ; print(os.environ.get('GENOMEKIT_DATA_DIR', appdirs.user_data_dir('genome_kit')))")
```

--------------------------------

### Calculating Interval Length

Source: https://deepgenomics.github.io/GenomeKit/api.html/quickstart

Demonstrates how to determine the span of an `Interval` object, which is the number of bases it covers. The length is calculated as the difference between the end and start positions, adhering to a 0-based, exclusive end convention.

```python
>>> len(interval)
185
```