### Get Current Date and Time Source: https://github.com/verily-src/workbench-examples/blob/main/cloud_env_setup.ipynb This shell command prints the current date and time. It is used for logging and provenance purposes, indicating when specific setup steps or the notebook execution occurred. ```shell date ``` -------------------------------- ### Setup Libraries and Utilities Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Imports essential Python libraries for making HTTP requests, manipulating file paths, and displaying progress bars during downloads. These are foundational for interacting with web APIs and managing data. ```python # Import the standard "requests" library for programmatic access of HTTP URLs import requests # Import the standard "os" module for URL path manipulation import os # Import "tqdm" to display a progress bar during downloads from tqdm import tqdm ``` -------------------------------- ### List Jupyter Lab Extensions Source: https://github.com/verily-src/workbench-examples/blob/main/cloud_env_setup.ipynb This shell command lists all installed JupyterLab extensions. This information is valuable for debugging and understanding the available functionalities within the JupyterLab environment, aiding in environment setup and troubleshooting. ```shell jupyter labextension list ``` -------------------------------- ### Get CPU Core Count Source: https://github.com/verily-src/workbench-examples/blob/main/cloud_env_setup.ipynb This shell command counts the number of logical CPU cores available on the system by parsing the `/proc/cpuinfo` file. It provides a quick way to assess the processing power of the cloud environment. ```shell grep processor /proc/cpuinfo | wc -l ``` -------------------------------- ### Get Total System Memory Source: https://github.com/verily-src/workbench-examples/blob/main/cloud_env_setup.ipynb This shell command retrieves the total system memory in kilobytes by reading the `MemTotal` line from the `/proc/meminfo` file. This is useful for understanding the memory resources available in the cloud environment. ```shell grep "^MemTotal:" /proc/meminfo ``` -------------------------------- ### Create Cromwell Examples Directory Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_gvs_stats.ipynb Ensures the existence of a '~/wb-tutorials/cromwell' directory, which is used to store Cromwell-related files, such as server logs. The `!mkdir -p` command creates the directory if it does not already exist, preventing potential errors in subsequent operations. ```bash CROMWELL_EXAMPLES_DIR=os.path.expanduser('~/wb-tutorials/cromwell') CROMWELL_SERVER_LOG=f'{CROMWELL_EXAMPLES_DIR}/cromwell.server.log' !mkdir -p {CROMWELL_EXAMPLES_DIR} ``` -------------------------------- ### Clone workbench-examples Repository using Git Source: https://github.com/verily-src/workbench-examples/blob/main/nextflow/workspace_description.md Clones the 'workbench-examples' Git repository. This command is used if the repository is not automatically cloned into the Verily Workbench cloud environment. It requires Git to be installed in the environment. ```sh git clone https://github.com/verily-src/workbench-examples.git ``` -------------------------------- ### Manage Dataproc Autoscaling Policies with gcloud CLI Source: https://context7.com/verily-src/workbench-examples/llms.txt Demonstrates how to manage Dataproc autoscaling policies using the `gcloud` command-line tool. It includes commands for importing a policy from a YAML file, describing an existing policy, and applying a policy during the creation of a Dataproc cluster. Requires the Google Cloud SDK to be installed and configured. ```bash # Import autoscaling policy gcloud dataproc autoscaling-policies import two_worker_autoscaling_policy \ --source=two_worker_autoscaling_policy.yaml \ --region=us-central1 # Describe policy gcloud dataproc autoscaling-policies describe two_worker_autoscaling_policy \ --region=us-central1 # Use policy when creating cluster wb resource create dataproc-cluster \ --name=my-cluster \ --autoscaling-policy=two_worker_autoscaling_policy \ --num-workers=2 ``` -------------------------------- ### Setup RNASeq Test Datasets - Python Source: https://github.com/verily-src/workbench-examples/blob/main/nextflow/nextflow_examples.ipynb This code snippet is used to set up the necessary test datasets for the RNASeq pipeline. It either checks out a specific branch if the workspace is a clone of the 'Getting Started with Nextflow workspace' or clones the repository and checks out the branch if it's a new or existing personal workspace. It ensures the 'samplesheet_minimal.csv' file is accessible. ```python !cd /home/jupyter/test-datasets && git checkout rnaseq !cd /home/jupyter/test-datasets && cat samplesheet/samplesheet_minimal.csv || echo "Something's not quite right. Please ensure you've added the Git repo as referenced resource and checked out the RNASeq branch." ``` ```python ![[ -f test-datasets/samplesheet/samplesheet_minimal.csv ]] && echo "Resource already exists" || (wb resource add-ref git-repo --name=test-datasets --repo-url=git@github.com:nf-core/test-datasets.git && cd /home/jupyter && wb git clone --resource=nf-core-sample-data-repo && git checkout rnaseq) ``` -------------------------------- ### Start MySQL Database for Cromwell Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_server_management.ipynb Starts a MySQL database instance using Docker to store Cromwell job states. It configures the database with necessary credentials and parameters. Dependencies: Docker. Inputs: None. Outputs: Runs a MySQL Docker container. ```bash !docker run -p 3306:3306 \ --name MySQLContainer \ -e MYSQL_ROOT_PASSWORD=cromwell \ -e MYSQL_DATABASE=cromwell_db \ -e MYSQL_USER=cromwell \ -e MYSQL_PASSWORD=cromwell \ -d mysql/mysql-server:5.5 \ --max-allowed-packet=16M ``` -------------------------------- ### Create JupyterLab Notebook Instances with Workbench CLI Source: https://context7.com/verily-src/workbench-examples/llms.txt This section shows how to create JupyterLab notebook instances using the Workbench CLI. Examples include creating an instance with a custom post-startup script and accessing it privately, as well as creating an instance with default settings. ```bash # Create cloud environment (notebook instance) wb resource create gcp-notebook \ --access=PRIVATE_ACCESS \ --cloning=COPY_NOTHING \ --name=analysis-environment \ --description="Personal analysis environment" \ --instance-id=my-notebook-20231208 \ --post-startup-script=gs://my-bucket/startup.sh # Create with default settings wb resource create gcp-notebook \ --name=default-notebook \ --description="Default notebook instance" ``` -------------------------------- ### Display HelloWorld WDL Workflow Content (Shell) Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_examples.ipynb Displays the content of the 'helloWorld.wdl' file using the 'cat' shell command. This workflow is a simple example that takes a string input parameter named 'name' and has no file inputs. ```shell !cat workflows/wdl/helloWorld.wdl ``` -------------------------------- ### Download Files for First Project Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Downloads all files for the first project in the previously fetched project list. It uses the `download_project_files` function and specifies an output directory `OUTPUT_DIR`. Prints status messages before and after download. ```python TARGET_PROJECT = PROJECT_LIST[0] print(f"Downloading files for project '{TARGET_PROJECT['title']}'") download_project_files(CATALOG, TARGET_PROJECT['uuid'], OUTPUT_DIR) print("Downloads Complete.") ``` -------------------------------- ### Move Resource to Version Folder using Workbench CLI Source: https://github.com/verily-src/workbench-examples/blob/main/first_hour_on_vwb/creating_a_data_collection.ipynb This example demonstrates how to move a resource to a specific version folder within a workspace using the Workbench CLI. It requires the target folder's ID, the resource's name, and the workspace ID. ```bash # Move desired resource to version folder wb resource move --folder-id= --name= --workspace= ``` -------------------------------- ### Create Cromwell Options JSON (Python) Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_gvs_setup_inputs.ipynb Generates a JSON file named `gvs_options.json` that configures Cromwell execution options. This example enables both reading from and writing to the Cromwell cache, which can significantly speed up subsequent runs by reusing previous results. ```python with open('gvs_options.json', 'w') as outfile: json.dump({ 'read_from_cache': True, 'write_to_cache': True }, outfile, indent=4) ``` -------------------------------- ### Configure Git User Name and Email Source: https://github.com/verily-src/workbench-examples/blob/main/cloud_env_setup.ipynb This Python snippet configures the global Git user name and email address. It checks for environment variables and allows manual input if needed. This ensures that Git commits are properly attributed. Dependencies include the `os` module. ```python import os if not os.getenv('GOOGLE_CLOUD_PROJECT'): raise Exception('Expected environment variables are not available. Please let workbench-support@verily.com know.') # [Optional] EDIT THIS CELL If you wish to set your name and email address for all git repositories, change these # values to be correct for you. All other cells in this notebook work fine unchanged. # Uncomment the following line if you want to use your Workbench email address as your Git email address. #GIT_EMAIL = os.environ['WORKBENCH_USER_EMAIL'] GIT_EMAIL = None GIT_NAME = None !git config --global --list if GIT_NAME is not None: !git config --global user.name "{GIT_NAME}" if GIT_EMAIL is not None: !git config --global user.email "{GIT_EMAIL}" !git config --global --list | grep user # [Optional] EDIT THIS CELL If you wish to set the text editor when using git # in the terminal instead of via the JupyterLab git extension. # !git config --global core.editor emacs ``` -------------------------------- ### Install and Import nf-core Tool Source: https://github.com/verily-src/workbench-examples/blob/main/nextflow/nextflow_examples.ipynb Installs the 'nf-core' companion tool using pip and imports it. This tool is necessary for interacting with 'nf-core' pipelines. After installation, the kernel must be restarted before the tool can be successfully imported. ```python try: import nf_core print("nf-core is already installed") except: print("Installing nf-core...") !pip install nf-core ``` ```python try: import nf_core print("nf-core is already installed") except: print("Please restart the kernel before importing...") ``` -------------------------------- ### Start Cromwell Server in Background Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_server_management.ipynb Launches the Cromwell server in server mode as a background task. All server messages are redirected to a log file. The server takes a few seconds to initialize and become ready for requests. Dependencies: Bash shell. Inputs: CROMWELL_CONF, CROMWELL_SERVER_LOG. Outputs: Starts Cromwell server process. ```bash %%bash -s {CROMWELL_CONF} {CROMWELL_SERVER_LOG} # Start Cromwell in server mode cromwell --config "$1" --logdir "$(dirname "$2")" server > "$2" 2>&1 & ``` -------------------------------- ### Start Cromwell Server in Background (Bash) Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_server_management.ipynb Starts the Cromwell server in the background using the %%bash magic command. It configures the JVM memory and points to the Cromwell configuration file. The output is redirected to a specified log file. This method is preferred over '!' for background processes in IPython. ```Bash CROMWELL_CONF="$1" CROMWELL_SERVER_LOG="$2" java -Xms10g -Xmx10g -Dconfig.file="${CROMWELL_CONF}" -jar "${CROMWELL_JAR}" server &> "${CROMWELL_SERVER_LOG}" & ``` -------------------------------- ### Create Python Virtual Environment Directory Source: https://github.com/verily-src/workbench-examples/blob/main/cloud_env_setup.ipynb This shell command creates a directory named `venvs` in the user's home directory. This directory is intended to store Python virtual environments used by Verily Workbench tutorials, promoting organized dependency management. ```shell mkdir -p ~/venvs ``` -------------------------------- ### List Installed Python Packages with Bash Source: https://github.com/verily-src/workbench-examples/blob/main/ml_examples/ml4h/ML4H_ML_intro.ipynb This Bash command uses `pip freeze` to list all installed Python packages and their versions in the current environment. This is useful for environment management and reproducibility. ```bash pip3 freeze ``` -------------------------------- ### List Installed Python Packages (Shell) Source: https://github.com/verily-src/workbench-examples/blob/main/ml_examples/llama31/vwb_8b_v100_llama31_hf.ipynb Outputs a list of all installed Python packages and their versions. This is crucial for understanding the environment's configuration and ensuring reproducibility. ```bash !pip freeze ``` -------------------------------- ### Configure Workspace and Cloud Environment Directories Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_server_management.ipynb Sets up local directories for tutorial files and Cromwell configuration. It defines paths for tutorial files, Cromwell configuration, and server logs, creating necessary directories. Dependencies: os module. Inputs: None. Outputs: Prints configured paths. ```python import os CROMWELL_EXAMPLES_DIR=os.path.expanduser('~/wb-tutorials/cromwell') CROMWELL_CONF=f'{CROMWELL_EXAMPLES_DIR}/cromwell.conf' CROMWELL_SERVER_LOG=f'{CROMWELL_EXAMPLES_DIR}/cromwell.server.log' !mkdir -p {CROMWELL_EXAMPLES_DIR} print(f'Tutorial files will be written locally to {CROMWELL_EXAMPLES_DIR}') print() print(f'Cromwell configuration file will be written to {CROMWELL_CONF}') print(f'Cromwell server log file will be written to {CROMWELL_SERVER_LOG}') ``` -------------------------------- ### Configure Cromwell Tutorial File Paths (Python) Source: https://github.com/verily-src/workbench-examples/blob/main/cromwell_setup/cromwell_examples.ipynb Configures and prints the local file paths for Cromwell-related tutorial files within the VWB environment. This includes directories for examples, configuration, input JSON files, and log files. It also creates the necessary directories using '!mkdir -p'. ```python import os CROMWELL_EXAMPLES_DIR=os.path.expanduser('~/wb-tutorials/cromwell') CROMWELL_CONF=f'{CROMWELL_EXAMPLES_DIR}/cromwell.runmode.conf' HELLO_WORLD_INPUTS_JSON=f'{CROMWELL_EXAMPLES_DIR}/hello_world.inputs.json' SAMPLE_INPUTS_JSON=f'{CROMWELL_EXAMPLES_DIR}/sample.inputs.json' RUNMODE_LOG=f'{CROMWELL_EXAMPLES_DIR}/cromwell.run.log' !mkdir -p {CROMWELL_EXAMPLES_DIR} print(f'Tutorial files will be written locally to {CROMWELL_EXAMPLES_DIR}') print() print(f'Cromwell configuration file will be written to {CROMWELL_CONF}') print(f'Cromwell hello-world input JSON file will be written to {HELLO_WORLD_INPUTS_JSON}') print(f'Cromwell runmode log file will be written to {RUNMODE_LOG}') print(f'Cromwell samples input JSON file will be written to {SAMPLE_INPUTS_JSON}') ``` -------------------------------- ### Set Notebook Globals Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Defines global variables for notebook operations, including the HCA catalog endpoint URL, directory for saved files, and an example project UUID. It also creates the necessary output directory if it doesn't exist. ```python CATALOG_PREFIX = 'dcp' ENDPOINT_URL = f'https://service.azul.data.humancellatlas.org/index' CATALOGS_URL = f'{ENDPOINT_URL}/catalogs' PROJECTS_URL = f'{ENDPOINT_URL}/projects' HCA_EXAMPLES_DIR = os.path.expanduser('~/wb-tutorials/hca') OUTPUT_DIR = os.path.join(HCA_EXAMPLES_DIR, 'data') !mkdir -p "{OUTPUT_DIR}" ``` -------------------------------- ### Generate Environment Provenance: Jupyter Lab Extensions Source: https://github.com/verily-src/workbench-examples/blob/main/dataproc/batch_job_submit.ipynb Lists installed Jupyter Lab extensions using `jupyter labextension list`. This helps in documenting the Jupyter environment setup, including any custom extensions that might affect notebook execution. ```bash !jupyter labextension list ``` -------------------------------- ### Setup: Get Current Workspace ID in Python Source: https://github.com/verily-src/workbench-examples/blob/main/first_hour_on_vwb/creating_a_data_collection.ipynb This code snippet captures the ID of the current workspace using the 'wb workspace describe' command and then parses the JSON output to extract the workspace ID. It requires the 'subprocess', 'json', 'ipywidgets', 'widget_utils', 'vwb_folder_utils', and 'datetime' libraries. ```python import json import ipywidgets as widgets import subprocess import widget_utils as wu import vwb_folder_utils as vfu from datetime import date ''' Resolves ID of current workspace. ''' def get_current_workspace_id(): CURRENT_WORKSPACE_ID_CMD_OUTPUT = !wb workspace describe --format=json | jq --raw-output ".id" CURRENT_WORKSPACE_ID = CURRENT_WORKSPACE_ID_CMD_OUTPUT[0] return CURRENT_WORKSPACE_ID CURRENT_WORKSPACE_ID = get_current_workspace_id() print(f"Current workspace ID is {CURRENT_WORKSPACE_ID}") ``` -------------------------------- ### Manage BigQuery Datasets with Workbench CLI Source: https://context7.com/verily-src/workbench-examples/llms.txt Commands for creating and managing BigQuery datasets in Verily Workbench. Includes options for table lifecycle policies and referencing datasets/tables from other projects. ```bash # Create BigQuery dataset with auto-delete for tables (14 days) wb resource create bq-dataset \ --name=tabular_data_autodelete_after_two_weeks \ --dataset-id=tabular_data_autodelete_after_two_weeks \ --cloning=COPY_NOTHING \ --default-table-lifetime=1209600 \ --description="BigQuery dataset for temporary tabular data." # Add referenced BigQuery dataset from another project wb resource add-ref bq-dataset \ --cloning=COPY_REFERENCE \ --description="Public genomics dataset" \ --name=public-genomes \ --path=bigquery-public-data.human_genome_variants # Add referenced BigQuery table wb resource add-ref bq-table \ --cloning=COPY_REFERENCE \ --description="1000 Genomes pedigree data" \ --name=genomes-pedigree \ --path=bigquery-public-data.human_genome_variants.1000_genomes_pedigree # List all workspace resources wb resource list ``` -------------------------------- ### Install Specific 'igraph' Version (R) Source: https://github.com/verily-src/workbench-examples/blob/main/1kgenomes_examples/R_1k_genomes.ipynb This R command installs a specific version ('1.6.0') of the 'igraph' package from a specified CRAN repository. This is done to 'pin' the version and avoid potential compatibility issues with the latest version, as mentioned in the context of the example. It requires the 'remotes' package to be installed and loaded. ```r install_version("igraph", version = "1.6.0", repos = "http://cran.us.r-project.org") ``` -------------------------------- ### Workbench CLI - BigQuery Dataset Management Source: https://context7.com/verily-src/workbench-examples/llms.txt Create and manage BigQuery datasets, including setting table lifecycle policies and adding references to existing datasets or tables. ```APIDOC ## Workbench CLI - BigQuery Dataset Management ### Description Create and manage BigQuery datasets, including setting table lifecycle policies and adding references to existing datasets or tables. ### Commands - **`wb resource create bq-dataset --name= --dataset-id= --cloning= --default-table-lifetime= --description=`**: Create a new BigQuery dataset. - **`wb resource add-ref bq-dataset --cloning= --description= --name= --path=`**: Add a referenced BigQuery dataset from another project or location. - **`wb resource add-ref bq-table --cloning= --description= --name= --path=`**: Add a referenced BigQuery table. - **`wb resource list`**: List all workspace resources. ``` -------------------------------- ### Get Latest DCP Catalog Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Retrieves the latest Data Coordination Platform (DCP) catalog identifier. This is a simple utility function to get the catalog name, which is then used in other operations. ```python CATALOG = get_dcp_catalog() print(f"The DCP catalog is: {CATALOG}") ``` -------------------------------- ### Display Final Workbench Resource List Source: https://github.com/verily-src/workbench-examples/blob/main/workspace_setup.ipynb Lists the Workbench resources after their creation or resolution. This command allows users to confirm that the Cloud Storage buckets and BigQuery dataset have been successfully set up in their workspace. ```bash !wb resource list ``` -------------------------------- ### Install 'irlba' R Package Source: https://github.com/verily-src/workbench-examples/blob/main/1kgenomes_examples/R_1k_genomes.ipynb This R command installs the 'irlba' package, which provides efficient methods for truncated singular value decomposition and principal components analysis (PCA) on large sparse and dense matrices. This package is a dependency for the PCA computation in this example. It requires an internet connection to download from CRAN. ```r # Fast and memory efficient methods for truncated singular value decomposition and # principal components analysis of large sparse and dense matrices. install.packages("irlba") ``` -------------------------------- ### Get Project Request Parameters Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Constructs the necessary parameters for requesting a list of projects from the HCA catalog. This includes specifying the catalog, the maximum number of projects to retrieve, and sorting options. ```python def get_project_request_params(catalog: str, max_projects: int) -> dict: # Set up request parameters return { 'catalog': catalog, 'size': max_projects, 'sort': 'projectTitle', 'order': 'asc' } ``` -------------------------------- ### Get Latest DCP Catalog Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Identifies and returns the latest Data Coordination Platform (DCP) catalog from the list of available catalogs. It filters for DCP-prefixed catalogs and selects the one with the highest numerical suffix. ```python def get_dcp_catalog() -> str: # We want to latest dcp catalog. catalogs = list_catalogs() # Extract the 'dcp' catalogs dcp_catalogs = [c for c in catalogs if c.startswith(CATALOG_PREFIX)] # Get the largest numerically max_value = 0 max_catalog = None for c in dcp_catalogs: if int(c[len(CATALOG_PREFIX):]) > max_value: max_value = int(c[len(CATALOG_PREFIX):]) max_catalog = c return max_catalog ``` -------------------------------- ### Manage Cloud Storage Buckets with Workbench CLI Source: https://context7.com/verily-src/workbench-examples/llms.txt Commands for creating and managing Google Cloud Storage buckets within Verily Workbench. Supports features like automatic deletion policies and referencing existing buckets. ```bash # Create a durable storage bucket wb resource create gcs-bucket \ --name=ws_files \ --bucket-name=${GOOGLE_CLOUD_PROJECT}-ws-files \ --cloning=COPY_NOTHING \ --description="Bucket for reports and provenance records." # Create auto-deleting bucket (files deleted after 14 days) wb resource create gcs-bucket \ --name=ws_files_autodelete_after_two_weeks \ --bucket-name=${GOOGLE_CLOUD_PROJECT}-autodelete-after-two-weeks \ --cloning=COPY_NOTHING \ --auto-delete=14 \ --description="Bucket for temporary storage with automatic cleanup." # Resolve bucket URL from workspace reference wb resolve --name=ws_files # Add existing bucket as referenced resource wb resource add-ref gcs-bucket \ --bucket-name=my-existing-bucket \ --cloning=COPY_REFERENCE \ --description="External bucket reference" \ --name=external-bucket ``` -------------------------------- ### Configure and Run ML Training from Command Line (Python) Source: https://github.com/verily-src/workbench-examples/blob/main/ml_examples/ml4h/mnist_survival_analysis_demo.ipynb This Python code snippet demonstrates how to set `sys.argv` to mimic command-line arguments for running an ML training process. It configures input/output tensors, batch size, and output directories. Dependencies include `sys` and a `parse_args` function. ```python import sys # Assuming HD5_FOLDER and OUTPUT_FOLDER are defined elsewhere # Assuming parse_args and train_multimodal_multitask are imported sys.argv = ['train', '--tensors', HD5_FOLDER, '--input_tensors', 'mnist.mnist_image', '--output_tensors', 'mnist.mnist_label', '--batch_size', '64', '--test_steps', '64', '--epochs', '6', '--output_folder', OUTPUT_FOLDER, '--id', 'learn_mnist' ] args = parse_args() metrics = train_multimodal_multitask(args) ``` -------------------------------- ### Query Top 1000 SNPs from BigQuery Source: https://github.com/verily-src/workbench-examples/blob/main/1kgenomes_examples/GWAS_experiments.ipynb Retrieves the top 1000 SNP positions from a BigQuery table, ordered by start position. It selects specific columns including reference name, start position, reference bases, alternative bases, and Chi-squared score. The results are limited to 1000 rows and then ordered. ```sql SELECT * FROM ( SELECT reference_name, start_position, reference_bases, alt_bases, chi_squared_score FROM `stats_results_table` LIMIT 1000 ) ORDER BY start_position asc ``` -------------------------------- ### Fetch and Print Project List Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Fetches a short list of projects (up to 10) from the latest DCP catalog and prints their details. This snippet demonstrates the usage of the `list_projects` function. ```python PROJECT_LIST = list_projects(CATALOG, 10) ``` -------------------------------- ### Install 'threejs' R Package Source: https://github.com/verily-src/workbench-examples/blob/main/1kgenomes_examples/R_1k_genomes.ipynb This R command installs the 'threejs' package, which enables the creation of interactive 3D scatter plots, network plots, and globes using the 'three.js' JavaScript visualization library. This package is used for visualizing the PCA results in 3D. It requires an internet connection to download from CRAN. ```r # Create interactive 3D scatter plots, network plots, and globes using the 'three.js' visualization library install.packages("threejs") ``` -------------------------------- ### Fetch HCA Project List Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Fetches a list of project titles and UUIDs from the HCA catalog. It handles pagination and limits the number of projects returned. Dependencies include `fetch_json` and `get_project_request_params`. ```python def list_projects(catalog: str, max_projects: int) -> list: # Allocate a list to populate for return project_list = [] print(f"Fetching first {max_projects} projects:") # Set up the fetch parameters url = PROJECTS_URL params = get_project_request_params(catalog, max_projects) while url and len(project_list) < max_projects: response_json = fetch_json(url, params) # Iterate over results, pulling out key project elements for hit in response_json['hits']: uuid = hit['entryId'] shortname = hit['projects'][0]['projectShortname'] title = hit['projects'][0]['projectTitle'] print("-----------------------") print(f"Title: {title}") print(f"Shortname: {shortname}") print(f"Id: {uuid}") project_list.append({'title': title, 'uuid': uuid}) # Handle response pagination if we haven't reached max_projects url = response_json['pagination']['next'] if url: params = None else: break return project_list ``` -------------------------------- ### Manage Workspaces with Workbench CLI Source: https://context7.com/verily-src/workbench-examples/llms.txt Commands to manage and inspect Verily Workbench workspaces using the `wb` CLI. This includes checking status, listing workspaces, retrieving details, and switching between workspaces. ```bash # Check current workspace status wb status # List all accessible workspaces wb workspace list # Get workspace details in JSON format wb workspace describe --format=json # Extract workspace ID programmatically wb workspace describe --format=json | jq --raw-output ".id" # Switch to a different workspace wb workspace set --id= ``` -------------------------------- ### ML4H Imports and Setup Source: https://github.com/verily-src/workbench-examples/blob/main/ml_examples/ml4h/ML4H_Model_Factory_Intro.ipynb This code block imports essential libraries and modules from the ML4H toolkit and other common data science packages. It sets up the environment for machine learning tasks using ML4H, including data manipulation, model creation, and visualization tools. ```python # Imports import os import sys import pickle import random from typing import List, Dict, Callable from collections import defaultdict, Counter import h5py import numpy as np from ml4h.defines import StorageType from ml4h.arguments import parse_args from ml4h.TensorMap import TensorMap, Interpretation from ml4h.tensor_generators import test_train_valid_tensor_generators from ml4h.models.train import train_model_from_generators from ml4h.models.model_factory import make_multimodal_multitask_model from ml4h.models.inspect import plot_and_time_model from ml4h.recipes import compare_multimodal_scalar_task_models, train_multimodal_multitask %matplotlib inline import matplotlib.pyplot as plt from matplotlib import gridspec ``` -------------------------------- ### List Available Catalogs Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Retrieves a list of available Human Cell Atlas (HCA) catalogs from the server. It filters out internal catalogs and returns a list of catalog names, such as ['dcp31', 'dcp32', 'dcp1']. ```python def list_catalogs() -> list: response = fetch_json(CATALOGS_URL, None) catalogs = [] for catalog, details in response['catalogs'].items(): if not details['internal']: catalogs.append(catalog) return catalogs ``` -------------------------------- ### Get Workspace ID with `wb` CLI Source: https://github.com/verily-src/workbench-examples/blob/main/dataproc/batch_job_submit.ipynb Retrieves the workspace ID using the `wb workspace describe` command and parses the JSON output to extract the ID. This is a prerequisite for deriving other workspace-specific resources. ```python ws_id_list = !wb workspace describe --format=JSON | jq '.id' WORKSPACE_ID = ws_id_list[0] print(WORKSPACE_ID) ``` -------------------------------- ### Workbench CLI - Workspace Management Source: https://context7.com/verily-src/workbench-examples/llms.txt Manage Verily Workbench workspaces, including checking status, listing available workspaces, describing workspace details, and setting the active workspace. ```APIDOC ## Workbench CLI - Workspace Management ### Description Manage Verily Workbench workspaces, including checking status, listing available workspaces, describing workspace details, and setting the active workspace. ### Commands - **`wb status`**: Check the current workspace status. - **`wb workspace list`**: List all accessible workspaces. - **`wb workspace describe --format=json`**: Get detailed workspace information in JSON format. - **`wb workspace describe --format=json | jq --raw-output ".id"`**: Extract the workspace ID programmatically. - **`wb workspace set --id=`**: Switch to a different workspace. ``` -------------------------------- ### Create BigQuery Dataset using Workbench CLI Source: https://github.com/verily-src/workbench-examples/blob/main/1kgenomes_examples/GWAS_experiments.ipynb Executes a command-line instruction via the Verily Workbench CLI (`wb resource create bq-dataset`) to create a new BigQuery dataset within the current workspace. The dataset is specified to be in the `US` region. ```bash !wb resource create bq-dataset --location=US --id $bq_dataset_name ``` -------------------------------- ### Download File with Progress Bar Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb Downloads a file from a given URL to a specified local path. It streams the content and displays a progress bar using `tqdm`, providing visual feedback on the download status and speed. ```python def download_file(url: str, output_path: str) -> None: # Start the request stream response = requests.get(url, stream=True) response.raise_for_status() # Get the content length so the progress bar can display accurate progress total = int(response.headers.get('content-length', 0)) print(f'Downloading to: {output_path}', flush=True) # Fetch the content in chunks, updating the progress bar with open(output_path, 'wb') as f: with tqdm(total=total, unit='B', unit_scale=True, unit_divisor=1024) as bar: for chunk in response.iter_content(chunk_size=1024): size = f.write(chunk) bar.update(size) ``` -------------------------------- ### Workbench CLI - Git Repository Integration Source: https://context7.com/verily-src/workbench-examples/llms.txt Integrate Git repositories as workspace references, allowing them to be automatically mounted into cloud environments. ```APIDOC ## Workbench CLI - Git Repository Integration ### Description Integrate Git repositories as workspace references, allowing them to be automatically mounted into cloud environments. ### Commands - **`wb resource add-ref git-repo --repo-url= --name= --cloning= --description=`**: Add a Git repository as a referenced resource. - Supports public and private repositories (private requires SSH key setup). - **`wb resource resolve --name=`**: Resolve the reference to a Git repository. ``` -------------------------------- ### Generate Environment Provenance: Conda Environment Source: https://github.com/verily-src/workbench-examples/blob/main/dataproc/batch_job_submit.ipynb Exports the current Conda environment configuration using `conda env export`. This command lists all packages and their versions installed in the active Conda environment, crucial for reproducibility. ```bash !conda env export ``` -------------------------------- ### Python Utilities for Group Management Source: https://context7.com/verily-src/workbench-examples/llms.txt This Python code assists in retrieving and managing organization-linked group information and permissions. It provides functions to get a group's description and to list role assignments for users and other groups within a specified organization. ```python import json import subprocess # Get org-linked group information def get_org_linked_group_info(org_id, group_name): """ Return an org-linked group's description as JSON. """ wb_command = ["wb", "group", "describe", f"--org={org_id}", f"--name={group_name}", "--format=JSON"] result = subprocess.run(wb_command, capture_output=True, text=True) group_info = json.loads(result.stdout) return group_info # Get group role assignments def get_org_linked_group_roles(org_id, group_name): """ Return a flattened mapping of users to roles for named group. """ roles_dict = {role: set() for role in ["ADMIN", "MEMBER", "READER", "SUPPORT"]} wb_command = ["wb", "group", "role", "list", f"--org={org_id}", f"--name={group_name}", "--format=JSON"] result = subprocess.run(wb_command, capture_output=True, text=True) nested_roles = json.loads(result.stdout) for item in nested_roles: if item['principal']['principalType'] == "GROUP": roles_dict[role].update( get_org_linked_group_roles( item['principal']['groupOrg'], item['principal']['groupName'] )["MEMBER"] ) else: for role in item['roles']: if item['principal']['userEmail'] is not None: roles_dict[role].add(item['principal']['userEmail']) return roles_dict # Usage example group_info = get_org_linked_group_info("verily", "research-team") roles = get_org_linked_group_roles("verily", "research-team") print(f"Group info: {group_info}") print(f"Admins: {roles['ADMIN']}") print(f"Members: {roles['MEMBER']}") ``` -------------------------------- ### Setup and Configuration: Python Variables Source: https://github.com/verily-src/workbench-examples/blob/main/dataproc/create_hail_cluster.ipynb Sets up necessary Python environment variables and constructs a unique Hail cluster name using the user's email and the current date. It imports the datetime and os modules for date formatting and environment variable access. ```python from datetime import datetime import os ``` ```python USER = os.getenv('WORKBENCH_USER_EMAIL') if USER: USER = USER.split('@')[0].replace('.', '-') else: print('WORKBENCH_USER_EMAIL not defined; using USER') USER = os.getenv('USER') print(USER) ``` ```python HAIL_CLUSTER_NAME = '-'.join(['hail', USER, datetime.now().strftime('%Y%m%d')]) print(HAIL_CLUSTER_NAME) ``` -------------------------------- ### Verify Git Global Configuration Source: https://context7.com/verily-src/workbench-examples/llms.txt This snippet lists the global Git configuration and filters it to show the user name and email settings, verifying the previous configuration step. This command helps confirm that the git config commands were successful. ```bash !git config --global --list | grep user ``` -------------------------------- ### Fetch JSON Data with Error Handling Source: https://github.com/verily-src/workbench-examples/blob/main/single_cell/getting-started-with-hca.ipynb A utility function to fetch JSON data from a given URL with optional parameters. It handles HTTP errors and returns the parsed JSON response on success. This is crucial for robust API interactions. ```python def fetch_json(url: str, params: dict) -> list: response = requests.get(url, params=params) response.raise_for_status() return response.json() ``` -------------------------------- ### Integrate Git Repositories with Workbench CLI Source: https://context7.com/verily-src/workbench-examples/llms.txt Commands to add Git repositories as workspace references, enabling automatic mounting into cloud environments. Supports both public and private repositories. ```bash # Add public GitHub repository wb resource add-ref git-repo \ --repo-url=https://github.com/verily-src/workbench-examples.git \ --name=workbench-examples \ --cloning=COPY_REFERENCE \ --description="Verily Workbench example notebooks" # Add private GitHub repository (requires SSH key setup) wb resource add-ref git-repo \ --repo-url=git@github.com:org/private-repo.git \ --name=private-repo \ --cloning=COPY_REFERENCE \ --description="Private research repository" # Resolve repository reference wb resource resolve --name=workbench-examples ``` -------------------------------- ### Generate Environment Provenance with Python Source: https://github.com/verily-src/workbench-examples/blob/main/first_hour_on_vwb/creating_a_data_collection.ipynb These Python code snippets are used to generate provenance information about the current notebook environment. They capture the date, conda/pip installed packages, Jupyter Lab extensions, CPU count, and total memory. ```python !date ``` ```python !conda env export ``` ```python !jupyter labextension list ``` ```python !grep ^processor /proc/cpuinfo | wc -l ``` ```python !grep "^MemTotal:" /proc/meminfo ```