### Install fastQpick via PyPI Source: https://github.com/pachterlab/fastqpick/blob/main/README.md Install the fastQpick library using pip from the Python Package Index. ```bash pip install fastQpick ``` -------------------------------- ### Install fastQpick from Source Source: https://github.com/pachterlab/fastqpick/blob/main/README.md Install the fastQpick library directly from its source code repository using pip. ```bash pip install git+https://github.com/pachterlab/fastQpick.git ``` -------------------------------- ### Setup Reference, Reads, and Index for Yeast Data Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/yeast_example.ipynb Downloads and prepares the necessary reference genome, FASTQ files, and builds a kallisto index. Ensures paired-end reads are of equal length, a requirement for fastQpick. ```python def sh(cmd): subprocess.run(cmd, shell=True, check=True) # kallisto binary if not os.path.exists(KALLISTO): sh(f"cd {WORK} && wget -q https://github.com/pachterlab/kallisto/releases/download/" f ``` -------------------------------- ### Import Libraries and Setup Paths for fastQpick Analysis Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/yeast_example.ipynb Imports necessary Python libraries and defines working directories and file paths for the fastQpick analysis. Ensures output directories exist. ```python import os import glob import json import shutil import subprocess import numpy as np import matplotlib.pyplot as plt import pyfastx from fastQpick import fastQpick WORK = os.path.abspath("realdata") FIG_DIR = os.path.abspath("figures") os.makedirs(WORK, exist_ok=True) os.makedirs(FIG_DIR, exist_ok=True) KALLISTO = os.path.join(WORK, "kallisto", "kallisto") IDX = os.path.join(WORK, "yeast.idx") M1 = os.path.join(WORK, "data", "SRR453566_1.fastq.gz") M2 = os.path.join(WORK, "data", "SRR453566_2.fastq.gz") ACCESSION = "SRR453566" ``` -------------------------------- ### fastQpick Python API Usage Source: https://github.com/pachterlab/fastqpick/blob/main/README.md Example of how to import and use the fastQpick function within a Python script. ```python from fastQpick import fastQpick fastQpick( input_files=['FASTQ_FILE1', 'FASTQ_FILE2', ...], ... ) ``` -------------------------------- ### fastQpick Command-line Help Source: https://github.com/pachterlab/fastqpick/blob/main/README.md Command to display all available options and arguments for the fastQpick command-line interface. ```bash fastQpick --help ``` -------------------------------- ### fastQpick Command-line Usage Source: https://github.com/pachterlab/fastqpick/blob/main/README.md Basic command-line syntax for running fastQpick. Specify options and input FASTQ files. ```bash fastQpick [OPTIONS] FASTQ_FILE1 FASTQ_FILE2 ... ``` -------------------------------- ### Bootstrap Replicates with Replacement Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/yeast_example.ipynb This snippet sets up parameters for bootstrapping, defines a function to quantify each replicate by resampling FASTQ files with replacement and running kallisto, and then collects the results into a count matrix. It reuses a length dictionary and quantifies reads only once per replicate. ```python B = 200 # reduce for a quick run length_dict = {M1: N_READS, M2: N_READS} counts_dir = os.path.join(WORK, "counts_nb") os.makedirs(counts_dir, exist_ok=True) def quantify_replicate(seed): tmp = os.path.join(WORK, f"tmp_nb/seed{seed}") rep, kq = os.path.join(tmp, "rep"), os.path.join(tmp, "kq") fastQpick(input_files=[M1, M2], fraction=1.0, seeds=seed, output_dir=rep, file_group_size=2, replacement=True, overwrite=True, verbose=False, fastq_to_length_dict=length_dict) r1 = os.path.join(rep, "SRR453566_1.fastq") r2 = os.path.join(rep, "SRR453566_2.fastq") subprocess.run([KALLISTO, "quant", "-i", IDX, "-o", kq, "-t", "4", r1, r2], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) est = np.loadtxt(os.path.join(kq, "abundance.tsv"), skiprows=1, usecols=3) shutil.rmtree(tmp, ignore_errors=True) return est R = np.vstack([quantify_replicate(b) for b in range(1, B + 1)]) print("bootstrap count matrix (replicates x transcripts):", R.shape) ``` -------------------------------- ### fastQpick Python API Help Source: https://github.com/pachterlab/fastqpick/blob/main/README.md Python code to access the help documentation for the fastQpick function. ```python help(fastQpick) ``` -------------------------------- ### Generate Bootstrap Replicates with fastQpick Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/intro.ipynb Generates bootstrap replicates by resampling FASTQ files with replacement. The `num_samples` argument draws all replicates in a single call, with each replicate written to its own seed-suffixed file. The `fastq_to_length_dict` is reused across replicates for efficiency. ```python B = 300 length_dict = {input_fastq: N_READS} rep_dir = os.path.join(WORKDIR, "replicates") fastQpick( input_files=input_fastq, fraction=1.0, seed=1, num_samples=B, output_dir=rep_dir, without_replacement=False, overwrite=True, verbose=False, fastq_to_length_dict=length_dict, ) # One seed-suffixed output file per replicate is written into a single directory. rep_files = sorted( os.path.join(rep_dir, f) for f in os.listdir(rep_dir) if f.endswith(".fastq") ) print("Generated", len(rep_files), "bootstrap replicates with replacement") ``` -------------------------------- ### Import necessary libraries and set up directories Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/intro.ipynb Imports essential Python libraries for data manipulation, file handling, and plotting. It also defines and creates working and figure directories, removing existing ones if they are present. ```python import os import shutil from collections import Counter import numpy as np import pyfastx import matplotlib.pyplot as plt from fastQpick import fastQpick rng = np.random.default_rng(0) WORKDIR = os.path.abspath("_bootstrap_demo") FIG_DIR = os.path.abspath("figures") if os.path.exists(WORKDIR): shutil.rmtree(WORKDIR) os.makedirs(WORKDIR, exist_ok=True) os.makedirs(FIG_DIR, exist_ok=True) ``` -------------------------------- ### Simulate RNA-seq experiment and write FASTQ file Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/intro.ipynb Simulates RNA-seq reads by assigning them to transcripts based on predefined proportions. Each read is generated with a transcript-specific barcode and random padding. The simulated reads are written to a FASTQ file. This section also prints the true and observed proportions for verification. ```python TRANSCRIPTS = ["TXA", "TXB", "TXC", "TXD"] TRUE_PROPS = np.array([0.50, 0.30, 0.15, 0.05]) BARCODES = {"TXA": "AAAACCCC", "TXB": "GGGGTTTT", "TXC": "ACACACAC", "TXD": "TGTGTGTG"} BARCODE_LEN = 8 N_READS = 50_000 READ_PAD = 42 assignments = rng.choice(len(TRANSCRIPTS), size=N_READS, p=TRUE_PROPS) bases = np.array(list("ACGT")) input_fastq = os.path.join(WORKDIR, "reads.fastq") with open(input_fastq, "w") as fh: for i, t in enumerate(assignments): barcode = BARCODES[TRANSCRIPTS[t]] pad = "".join(rng.choice(bases, size=READ_PAD)) seq = barcode + pad qual = "I" * len(seq) fh.write(f"@read{i}\n{seq}\n+\n{qual}\n") observed_counts = Counter(TRANSCRIPTS[t] for t in assignments) observed_props = np.array([observed_counts[t] / N_READS for t in TRANSCRIPTS]) print("Wrote", N_READS, "reads to", input_fastq) for t, p_true, p_obs in zip(TRANSCRIPTS, TRUE_PROPS, observed_props): print(f" {t}: true={p_true:.3f} observed={p_obs:.4f}") ``` -------------------------------- ### Baseline Quantification with Kallisto Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/yeast_example.ipynb Quantifies the original library using kallisto to obtain transcript proportions and pseudoaligned read counts. This sets the basis for analyzing sampling error. ```python base_dir = os.path.join(WORK, "kout_baseline") sh(f"{KALLISTO} quant -i {IDX} -o {base_dir} -t 4 {M1} {M2}") targets = np.loadtxt(os.path.join(base_dir, "abundance.tsv"), skiprows=1, usecols=0, dtype=str) baseline = np.loadtxt(os.path.join(base_dir, "abundance.tsv"), skiprows=1, usecols=3) info = json.load(open(os.path.join(base_dir, "run_info.json"))) print("transcripts:", len(targets), " pseudoaligned:", info["n_pseudoaligned"], f"({info['p_pseudoaligned']}%) ") ``` -------------------------------- ### Analyze Bootstrap Detection Probability and Plot Results Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/yeast_example.ipynb Calculates bootstrap detection probabilities, identifies persistent and fragile transcripts, and generates a three-panel figure comparing standardized deviations, bootstrap vs. analytic SE, and detection probability vs. point estimate. Saves the figure to a file. ```python THETA = 5.0 # detection threshold (estimated counts) detect_prob = (R >= THETA).mean(axis=0) # bootstrap detection prob. pi_t persist = detect_prob >= 0.95 fragile = (detect_prob > 0.05) & (detect_prob < 0.95) print(f"fragile calls (0.05= THETA)).sum())}") print(f" absent at baseline, may appear : {int((fragile & (baseline < THETA)).sum())}") fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16.5, 4.3)) # Panel A: standardized bootstrap deviations pooled across expressed transcripts. z = ((P[:, expressed] - p_hat[expressed]) / analytic_se[expressed]).ravel() ax1.hist(z, bins=80, density=True, color="#4C72B0", alpha=0.6, label=f"pooled $z$ (std={z.std(ddof=1):.2f})") xs = np.linspace(-4, 4, 400) ax1.plot(xs, np.exp(-0.5 * xs**2) / np.sqrt(2*np.pi), color="#C44E52", lw=2, label=r"$\mathcal{{N}}(0,1)$ ") ax1.set_xlim(-4, 4) ax1.set_xlabel(r"standardized deviation $(\hat{{p}}^{(b)}-\hat{{p}})/SE_{\mathrm{analytic}}$") ax1.set_ylabel("density") ax1.set_title("A. Standardized bootstrap deviations vs. $\mathcal{{N}}(0,1)$ ") ax1.legend(frameon=False, fontsize=9) # Panel B: bootstrap SE vs analytic SE across all expressed transcripts. ase, bse = analytic_se[expressed], boot_se[expressed] lo, hi = min(ase.min(), bse.min())*0.8, max(ase.max(), bse.max())*1.2 ax2.plot([lo, hi], [lo, hi], color="#888888", ls="--", lw=1, label="y = x") sc = ax2.scatter(ase, bse, c=np.log10(baseline[expressed]), cmap="viridis", s=8, alpha=0.6) ax2.set_xscale("log"); ax2.set_yscale("log") ax2.set_xlim(lo, hi); ax2.set_ylim(lo, hi) ax2.set_xlabel(r"analytic SE $\sqrt{\hat{{p}}(1-\hat{{p}})/N}$ ") ax2.set_ylabel("bootstrap SE") ax2.set_title("B. Bootstrap SE recovers analytic SE") fig.colorbar(sc, ax=ax2).set_label(r"$\\log_{10}$ est. counts", fontsize=8) ax2.legend(frameon=False, fontsize=9, loc="upper left") # Panel C: bootstrap detection probability vs. point estimate for low-abundance # transcripts. Detection is a discrete presence/absence event, so the Gaussian SE # of panel B does not describe it; the bootstrap instead gives a per-transcript # detection probability that flags which low-abundance calls persist and which # are driven by a handful of reads and go away under resampling. # Persist and fragile transcripts are plotted in full so the legend counts match # the totals reported in the text; rarely-called transcripts are restricted to the # low-count regime, and zero counts are floored to the left edge of the log axis. XFLOOR = 0.3 cat = np.where(persist, 0, np.where(fragile, 1, 2)) lowab = baseline > XFLOOR colors = ["#55A868", "#DD8452", "#C44E52"] # persist / fragile / rarely called labels = ["persists ($\pi\geq0.95$)", "fragile ($0.05<\pi<0.95$)", "rarely called ($\pi\leq0.05$)"] for k in (0, 1, 2): # persists, fragile, rarely called m = (cat == k) if k != 2 else ((cat == k) & lowab) xk = np.clip(baseline[m], XFLOOR, None) ax3.scatter(xk, detect_prob[m], s=10, alpha=0.6, color=colors[k], label=f"{labels[k]} (n={int(m.sum())})") ax3.set_xlim(left=XFLOOR * 0.9) ax3.axvline(THETA, color="#888888", ls="--", lw=1) ax3.text(THETA*1.1, 0.06, f"threshold $\theta={THETA:g}$", rotation=90, va="bottom", ha="left", fontsize=8, color="#888888") ax3.set_xscale("log") ax3.set_ylim(-0.03, 1.03) ax3.set_xlabel("point-estimate count $\hat{{c}}_t$ (baseline)") ax3.set_ylabel(r"bootstrap detection probability $\pi_t$") ax3.set_title("C. Detection robustness of low-abundance calls") ax3.legend(frameon=False, fontsize=8, loc="center right") fig.tight_layout() out = os.path.join(FIG_DIR, "bootstrap_realdata.png") fig.savefig(out, dpi=200) print("Saved", out) plt.show() ``` -------------------------------- ### Visualize Bootstrap Distributions and SE Comparison Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/intro.ipynb Generates a two-panel figure. Panel A shows the distribution of bootstrap estimates for transcript proportions, along with true values and 95% confidence intervals. Panel B directly compares the bootstrap standard error against the analytic standard error. This visualization helps to assess the accuracy and reliability of the bootstrap method. ```python fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4.2)) # Panel A: bootstrap distributions with true values and 95% CIs positions = np.arange(len(TRANSCRIPTS)) parts = ax1.violinplot([boot_props[:, i] for i in range(len(TRANSCRIPTS))], positions=positions, showextrema=False) for pc in parts['bodies']: pc.set_facecolor("#4C72B0") pc.set_alpha(0.6) ax1.vlines(positions, ci_lo, ci_hi, color="#333333", lw=2, label="95% bootstrap CI") ax1.scatter(positions, TRUE_PROPS, color="#C44E52", zorder=3, label="true proportion") ax1.set_xticks(positions) ax1.set_xticklabels(TRANSCRIPTS) ax1.set_ylabel("estimated proportion") ax1.set_title("A. Bootstrap distribution of abundance estimates") ax1.legend(frameon=False, fontsize=9) # Panel B: bootstrap SE vs analytic SE lim = max(boot_se.max(), analytic_se.max()) * 1.15 ax2.plot([0, lim], [0, lim], color="#888888", ls="--", lw=1, label="y = x") ax2.scatter(analytic_se, boot_se, color="#4C72B0", zorder=3) for i, t in enumerate(TRANSCRIPTS): ax2.annotate(t, (analytic_se[i], boot_se[i]), textcoords="offset points", xytext=(6, -2), fontsize=9) ax2.set_xlim(0, lim) ax2.set_ylim(0, lim) ax2.set_xlabel(r"analytic SE $\\sqrt{p(1-p)/N}$") ax2.set_ylabel("bootstrap SE") ax2.set_title("B. Bootstrap SE recovers analytic SE") ax2.legend(frameon=False, fontsize=9) fig.tight_layout() out_path = os.path.join(FIG_DIR, "bootstrap_demo.png") fig.savefig(out_path, dpi=200) print("Saved figure to", out_path) plt.show() ``` -------------------------------- ### Checkout Manuscript Branch Source: https://github.com/pachterlab/fastqpick/blob/main/README.md Command to switch to the 'manuscript' git branch, likely for reproducing figures from the paper. ```bash git checkout manuscript ``` -------------------------------- ### Re-quantify Bootstrap Replicates Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/intro.ipynb Quantifies each bootstrap replicate by recovering transcript barcodes and computing the proportion of reads assigned to each transcript. This function is used to generate a bootstrap distribution of abundance estimates for every transcript. Ensure `BARCODES`, `BARCODE_LEN`, and `TRANSCRIPTS` are defined. ```python barcode_to_tx = {bc: tx for tx, bc in BARCODES.items()} def quantify(fastq_path): counts = Counter() total = 0 for _, seq, _ in pyfastx.Fastx(fastq_path): counts[barcode_to_tx[seq[:BARCODE_LEN]]] += 1 total += 1 return np.array([counts[t] / total for t in TRANSCRIPTS]) boot_props = np.vstack([quantify(f) for f in rep_files]) print("Bootstrap proportion matrix shape (replicates x transcripts):", boot_props.shape) ``` -------------------------------- ### Calculate and Print Bootstrap and Analytic SE and CI Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/intro.ipynb Calculates bootstrap standard errors, percentile confidence intervals, and analytic standard errors for transcript proportions. It then prints a formatted table comparing these values and reports the maximum relative difference between the bootstrap and analytic standard errors. This is useful for validating the bootstrap method against theoretical expectations. ```python boot_se = boot_props.std(axis=0, ddof=1) ci_lo = np.percentile(boot_props, 2.5, axis=0) ci_hi = np.percentile(boot_props, 97.5, axis=0) analytic_se = np.sqrt(observed_props * (1 - observed_props) / N_READS) print(f"{ 'tx':>4} {'observed':>9} {'boot_SE':>9} {'analytic_SE':>12} {'95% CI':>22}") for i, t in enumerate(TRANSCRIPTS): ci = f"[{ci_lo[i]:.4f}, {ci_hi[i]:.4f}]" print(f"{t:>4} {observed_props[i]:>9.4f} {boot_se[i]:>9.5f} {analytic_se[i]:>12.5f} {ci:>22}") rel_err = np.abs(boot_se - analytic_se) / analytic_se print(f"\nMax relative difference between bootstrap and analytic SE: {rel_err.max():.1%}") ``` -------------------------------- ### Calculate Bootstrap and Analytic Standard Errors Source: https://github.com/pachterlab/fastqpick/blob/main/notebooks/yeast_example.ipynb Calculates and compares bootstrap and analytic standard errors for transcript proportions. Filters for expressed transcripts (>= 100 counts) and prints summary statistics of the comparison. ```python P = R / R.sum(axis=1, keepdims=True) N = baseline.sum() p_hat = baseline / N boot_se = P.std(axis=0, ddof=1) analytic_se = np.sqrt(p_hat * (1 - p_hat) / N) expressed = baseline >= 100 rel = np.abs(boot_se[expressed] - analytic_se[expressed]) / analytic_se[expressed] ratio = (boot_se[expressed] / analytic_se[expressed]).mean() print(f"expressed transcripts : {expressed.sum()}") print(f"median |boot-analytic|/analytic : {np.median(rel):.1%}") print(f"mean boot/analytic ratio : {ratio:.4f}") ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.