### Install Project Dependencies Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Installs the project's dependencies defined in the setup file. The '-e' flag installs in editable mode, and '.[local,test]' includes dependencies required for local development and running tests. ```bash pip install -e ".[local,test]" ``` -------------------------------- ### Example Deployment Configuration (YAML) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This snippet shows a partial example of the `conf/deployment.yml` file used to define workflows and tasks for deployment on Databricks. It highlights the use of a custom section for repetitive configurations. ```yaml # Custom section is used to store configurations that might be repetative. ``` -------------------------------- ### Install dbx Tool Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Installs the dbx command-line tool using pip, making it available in the activated Conda environment. ```bash pip install dbx ``` -------------------------------- ### Example Project Tree Structure (Shell) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This shell output displays a sample directory tree for a dbx project, illustrating the typical layout including configuration files (.dbx, conf), source code (charming_aurora), notebooks, tests, and CI/CD workflows (.github). ```shell . ├── .dbx #(1) │   ├── lock.json #(2) │   └── project.json #(3) ├── .github #(4) │   └── workflows │   ├── onpush.yml #(5) │   └── onrelease.yml #(6) ├── .gitignore #(7) ├── README.md #(8) ├── charming_aurora #(9) │   ├── __init__.py #(10) │   ├── common.py #(11) │   └── tasks #(12) │   ├── __init__.py │   ├── sample_etl_task.py #(13) │   └── sample_ml_task.py #(14) ├── conf #(15) │   ├── deployment.yml #(16) │   └── tasks #(17) │   ├── sample_etl_config.yml #(18) │   └── sample_ml_config.yml #(19) ├── notebooks #(20) │   └── sample_notebook.py ├── pyproject.toml #(21) ├── setup.py #(22) └── tests #(23) ├── entrypoint.py #(24) ├── integration │   └── e2e_test.py └── unit #(25) ├── conftest.py #(26) └── sample_test.py #(27) ``` -------------------------------- ### Exporting Coverage Reports (Bash) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md These Bash commands demonstrate how to run unit tests with coverage and export the results into specific formats using the `--cov-report` flag. Examples are provided for exporting to HTML and XML formats. ```bash pytest tests/unit --cov --cov-report=html pytest tests/unit --cov --cov-report=xml ``` -------------------------------- ### Change Directory to Project Folder Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Navigates into the newly created project directory 'charming-aurora' which was generated by the 'dbx init' command. ```bash cd charming-aurora ``` -------------------------------- ### Install OpenJDK for Local Spark Tests Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Installs OpenJDK version 11.0.15 from the conda-forge channel, which is required for running local Apache Spark tests. ```bash conda install -c conda-forge openjdk=11.0.15 ``` -------------------------------- ### Initialize dbx Project with Custom Artifact Location Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Initializes a new dbx project, similar to the previous command, but also specifies a custom cloud-based artifact storage location (S3, WASBS, or GS) instead of the default DBFS path. This is recommended for production setups. ```bash dbx init \ -p "cicd_tool=GitHub Actions" \ -p "cloud=" \ -p "project_name=charming-aurora" \ -p "profile=charming-aurora" \ -p "artifact_location=" \ --no-input ``` -------------------------------- ### Install Project Locally with Pip Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Installs the project package locally in editable mode using pip. This command includes local and test dependencies, making the project ready for development and testing. ```bash pip install -e ".[local,test]" ``` -------------------------------- ### Verify Databricks CLI Profile Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Verifies that the configured Databricks CLI profile 'charming-aurora' is working correctly by listing the root directory of the Databricks workspace filesystem. ```bash databricks --profile charming-aurora workspace ls / ``` -------------------------------- ### Running Unit Tests (Bash) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This Bash command executes the unit tests located in the `tests/unit` directory using the pytest framework. It will typically start a local Spark session, run the tests, and shut down the session upon completion. ```bash pytest tests/unit ``` -------------------------------- ### Mapping Entrypoint Alias in setup.py (Python) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This Python code snippet from "setup.py" defines console script entry points. It maps the alias "etl" to the "entrypoint" function within the "charming_aurora.tasks.sample_etl_task" module, allowing the task to be launched via this alias. ```python # irrenevant content is ommited setup( ..., entry_points = { "console_scripts": [ "etl = charming_aurora.tasks.sample_etl_task:entrypoint", "ml = charming_aurora.tasks.sample_ml_task:entrypoint" ]}, ... ) ``` -------------------------------- ### Initialize dbx Project with GitHub Actions Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Initializes a new dbx project named 'charming-aurora' using the default template. It configures the project for GitHub Actions CI/CD, specifies the cloud provider, and links it to the 'charming-aurora' Databricks CLI profile. The --no-input flag prevents interactive prompts. ```bash dbx init \ -p "cicd_tool=GitHub Actions" \ -p "cloud=" \ -p "project_name=charming-aurora" \ -p "profile=charming-aurora" \ --no-input ``` -------------------------------- ### Launching Deployed Databricks Workflow (dbx) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This command uses the dbx tool to trigger a run of the previously deployed Databricks Job corresponding to the "charming-aurora-sample-etl" workflow. The "--trace" flag can be added to wait for completion. ```bash dbx launch charming-aurora-sample-etl ``` -------------------------------- ### Create Conda Environment for Project Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Creates a new Conda environment named 'charming-aurora' with Python version 3.9 to isolate project dependencies. ```bash conda create -n charming-aurora python=3.9 ``` -------------------------------- ### Activate Conda Environment Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Activates the newly created Conda environment 'charming-aurora' to ensure subsequent installations are within this isolated environment. ```bash conda activate charming-aurora ``` -------------------------------- ### Initializing dbx Project from Git Template (Example) - Bash Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/custom_templates.md Provides a concrete example of initializing a dbx project from a Git repository template using the `--path` argument with a placeholder URL. This demonstrates the simplest form of using a Git-based template. ```bash dbx init --path=https://git/repo/with/template.git ``` -------------------------------- ### Configuring Python Wheel Task Deployment (YAML) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This YAML snippet shows how to configure a "python_wheel_task" within a dbx deployment file. It specifies the package name, the entry point alias ("etl"), and parameters passed to the task. ```yaml # relevant section of the deployment file, some parts are omitted environments: default: workflows: - name: "charming-aurora-sample-etl" tasks: - task_key: "main" <<: *basic-static-cluster python_wheel_task: package_name: "charming_aurora" entry_point: "etl" # take a look at the setup.py entry_points section for details on how to define an entrypoint parameters: ["--conf-file", "file:fuse://conf/tasks/sample_etl_config.yml"] ``` -------------------------------- ### Task Configuration Loading Utility (Python) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This Python `Task` class provides a base for Databricks job tasks, handling initialization and configuration loading. The `_provide_config` method reads configuration from a file specified by the `--conf-file` command-line argument, parsed by `_get_conf_file` using `argparse`. The `_read_config` method uses `yaml.safe_load` to parse the configuration file, supporting `file:fuse://` references for files uploaded to the workspace. The `__init__` method allows passing initial configuration or defaults to reading from the file. ```Python # some lines were intentionally omitted class Task(ABC): def __init__(self, spark=None, init_conf=None): self.spark = self._prepare_spark(spark) self.logger = self._prepare_logger() self.dbutils = self.get_dbutils() if init_conf: #(1) self.conf = init_conf else: self.conf = self._provide_config() self._log_conf() def _provide_config(self): self.logger.info("Reading configuration from --conf-file job option") conf_file = self._get_conf_file() if not conf_file: self.logger.info( "No conf file was provided, setting configuration to empty dict." "Please override configuration in subclass init method" ) return {} else: self.logger.info(f"Conf file was provided, reading configuration from {conf_file}") return self._read_config(conf_file) @staticmethod def _get_conf_file(): #(2) p = ArgumentParser() p.add_argument("--conf-file", required=False, type=str) namespace = p.parse_known_args(sys.argv[1:])[0] return namespace.conf_file @staticmethod def _read_config(conf_file) -> Dict[str, Any]: config = yaml.safe_load(pathlib.Path(conf_file).read_text()) #(3) return config ``` ``` -------------------------------- ### Defining Python Task Entrypoint (Databricks) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This Python code defines an "entrypoint" function used by "python_wheel_task" and a "__main__" block for "spark_python_task". The "entrypoint" function initializes and launches the main task logic. ```python # if you're using python_wheel_task, you'll need the entrypoint function to be used in setup.py def entrypoint(): # pragma: no cover task = SampleETLTask() task.launch() # if you're using spark_python_task, you'll need the __main__ block to start the code execution if __name__ == '__main__': entrypoint() ``` -------------------------------- ### Deploying Databricks Workflow as Job (dbx) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This command uses the dbx tool to deploy the specified workflow ("charming-aurora-sample-etl") configuration, typically defined in "deployment.yml", as a Databricks Job in the target environment. ```bash dbx deploy charming-aurora-sample-etl ``` -------------------------------- ### Setup Local Python Environment with Conda Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Creates and activates a local Python environment using Conda for project development. This step is necessary to isolate project dependencies. Requires Conda to be installed and configured. ```bash conda create -n {{cookiecutter.project_slug}} python=3.9 conda activate {{cookiecutter.project_slug}} ``` -------------------------------- ### Executing Databricks Task on All-Purpose Cluster (dbx) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This command executes a specific task ("main") from the "charming-aurora-sample-etl" package on a named all-purpose (interactive) Databricks cluster. It handles package building, uploading to DBFS, and running the task in isolation. ```bash dbx execute charming-aurora-sample-etl --task=main --cluster-name="some-interactive-cluster-name" ``` -------------------------------- ### Run Local Unit Tests with Coverage Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md Executes the unit tests located in the 'tests/unit' directory using pytest. The '--cov' flag generates a code coverage report. ```bash pytest tests/unit --cov ``` -------------------------------- ### Example .dbx/project.json Configuration Source: https://github.com/databrickslabs/dbx/blob/main/docs/reference/project.md Example structure of the `.dbx/project.json` file. It defines environments, mapping them to Databricks workspaces, specifying artifact storage type (currently only `mlflow`), and referencing a local Databricks CLI profile for authentication. The `inplace_jinja_support` flag enables Jinja templating within the project. ```JSON { "environments": { "default": { "profile": "charming-aurora", "storage_type": "mlflow", "properties": { "workspace_directory": "/Shared/dbx/charming_aurora", "artifact_location": "dbfs:/Shared/dbx/projects/charming_aurora" } } }, "inplace_jinja_support": true } ``` -------------------------------- ### Generate Project Tree (Bash) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This bash command generates a tree view of the project directory structure up to 3 levels deep, excluding common build and version control artifacts like __pycache__, .git, .pytest_cache, and .coverage. ```bash tree -L 3 -I __pycache__ -a -I .git -I .pytest_cache -I .coverage ``` -------------------------------- ### Install JDK with Conda Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Installs a specific version of the OpenJDK using Conda. This is required if JDK is not already present on the local machine, as some Databricks tools or dependencies might rely on it. ```bash conda install -c conda-forge openjdk=11.0.15 ``` -------------------------------- ### Defining Databricks Job Workflow (YAML) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This YAML configuration defines a Databricks job workflow using `dbx`. It utilizes YAML anchors (`&`, `<<: *`) to reuse cluster configurations. The workflow includes a single task named 'main' which is a `python_wheel_task` executing the 'etl' entry point from the 'charming_aurora' package. It demonstrates passing configuration via a file reference using `file:fuse://`. ```YAML # Please read YAML documentation for details on how to use substitutions and anchors. custom: basic-cluster-props: &basic-cluster-props spark_version: "10.4.x-cpu-ml-scala2.12" basic-static-cluster: &basic-static-cluster new_cluster: <<: *basic-cluster-props num_workers: 1 node_type_id: " 0 logging.info("Testing the ETL job - done") logging.info("Testing the ML job") test_ml_config = { "input": common_config, "experiment": "/Shared/charming-aurora/sample_experiment" } ml_job = SampleMLTask(spark, test_ml_config) ml_job.launch() experiment = mlflow.get_experiment_by_name(test_ml_config['experiment']) assert experiment is not None runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id]) assert runs.empty is False logging.info("Testing the ML job - done") ``` -------------------------------- ### Deploy and Launch Workflow on Job Cluster (dbx) Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Deploys workflow assets to Databricks and then launches a run of the workflow on a job cluster using the deployed assets. This simulates a production job run. ```bash dbx deploy --assets-only dbx launch --from-assets --trace ``` -------------------------------- ### Create Databricks Repo from Git URL (CLI) Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Creates a new Databricks Repo linked to a specified Git repository URL using the Databricks CLI. This command integrates your Git repository with the Databricks Repos feature. ```bash databricks repos create --url --provider ``` -------------------------------- ### Initializing dbx Project from Python Package Template - Bash Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/custom_templates.md Demonstrates how to initialize a dbx project using a template that has been installed as a Python package. The `--package` argument specifies the name of the installed package containing the template. ```bash dbx init --package=my-template-pkg ``` -------------------------------- ### Tag and Push Git for Release Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Creates an annotated Git tag for a specific version and pushes the tag to the remote repository. This action is typically used to trigger a CI/CD release pipeline. ```bash git tag -a v -m "Release tag for version " git push origin --tags ``` -------------------------------- ### Testing ETL Job with Pytest (Python) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/python_quickstart.md This snippet focuses specifically on the ETL task testing portion of the `test_jobs` function. It illustrates how to configure and launch an ETL task programmatically within a test and verify its output, such as checking the row count of the resulting table. ```python # imports are omitted def test_jobs(spark: SparkSession, tmp_path: Path): logging.info("Testing the ETL job") common_config = {"database": "default", "table": "sklearn_housing"} test_etl_config = {"output": common_config} etl_job = SampleETLTask(spark, test_etl_config) etl_job.launch() table_name = f"{test_etl_config['output']['database']}.{test_etl_config['output']['table']}" _count = spark.table(table_name).count() assert _count > 0 logging.info("Testing the ETL job - done") # code of the ML task test is omitted ``` -------------------------------- ### Defining Basic dbx Deployment File Structure (YAML) Source: https://github.com/databrickslabs/dbx/blob/main/docs/reference/deployment.md This snippet illustrates the standard structure of a `dbx` deployment file in YAML format. It includes the `build` configuration, the required `environments` section with a named environment (e.g., `default`), and the required `workflows` section containing workflow definitions. The example shows a basic workflow with a `python_wheel_task`. ```yaml build: #(1) python: "pip" environments: #(2) default: #(3) workflows: #(4) - name: "workflow1" #(5) tasks: - task_key: "task1" # example task payload python_wheel_task: package_name: "some-pkg" entry_point: "some-ep" ``` -------------------------------- ### Project Structure Example (Shell) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/file_references.md Illustrates a typical project directory structure for a dbx project, showing the location of source code, tasks, and configuration files. This structure is relevant for understanding how file references are resolved relative to the project root. ```shell . ├── charming_aurora # │   ├── __init__.py │   ├── common.py │   └── tasks │   ├── __init__.py │   ├── sample_etl_task.py │   └── sample_ml_task.py ├── conf │   ├── deployment.yml │   └── tasks │   ├── sample_etl_config.yml │   └── sample_ml_config.yml ``` -------------------------------- ### Example Included Cluster Configuration (JSON) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/jinja_support.md A JSON snippet representing a cluster configuration that can be included in a main deployment file using Jinja's 'include' tag, demonstrating reusable configuration blocks. ```json { "spark_version": "some-version", "node_type_id": "some-node-type", "aws_attributes": { "first_on_demand": 0, "availability": "SPOT" }, "num_workers": 2 } ``` -------------------------------- ### Installing dbx with pip (Shell) Source: https://github.com/databrickslabs/dbx/blob/main/README.md This command demonstrates how to install the dbx package using pip, the standard package installer for Python. It requires Python 3.8+ and pip or conda to be installed on the system. ```shell pip install dbx ``` -------------------------------- ### Initializing dbx Project from Versioned Git Template - Bash Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/custom_templates.md Illustrates how to use the `--checkout` flag with `dbx init` to specify a particular version (tag, branch, or commit) when initializing a project from a Git repository template. This is useful for ensuring reproducible project setups. ```bash #specific tag dbx init --path=https://git/repo/with/template.git --checkout=v0.0.1 #specific branch dbx init --path=https://git/repo/with/template.git --checkout=prod #specific git commit dbx init --path=https://git/repo/with/template.git --checkout=aaa111bbb ``` -------------------------------- ### Installing Python Template Package - Bash Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/custom_templates.md Shows the command to install a Python package containing a dbx template using pip. This step is required before initializing a project from a template shipped as a Python package. ```bash pip install "my-template-pkg==0.0.1" # or whatever version ``` -------------------------------- ### Examples of Passing Parameters for Specific Task Types (dbx launch, bash) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/passing_parameters.md These examples illustrate the required structure of the `--parameters` argument for various Databricks job task types (spark_jar_task, notebook_task, spark_python_task, python_wheel_task, spark_submit_task, pipeline_task, sql_task, dbt_task) when launching a workflow using `dbx launch`. ```bash dbx launch --parameters='{"jar_params": ["a1", "b1"]}' # spark_jar_task dbx launch --parameters='{"notebook_params":{"name":"john doe","age":"35"}}' # notebook_task dbx launch --parameters='{"python_params":["john doe","35"]}' # spark_python_task or python_wheel_task dbx launch --parameters='{"spark_submit_params": ["--class", "org.apache.spark.examples.SparkPi"]}' # spark_submit_task dbx launch --parameters='{"python_named_params": {"name": "task", "data": "dbfs:/path/to/data.json"}}' # python_wheel_task dbx launch --parameters='{"pipeline_params": {"full_refresh": true}}' # pipeline_task as a part of a workflow dbx launch --parameters='{"sql_params": {"name": "john doe", "age": "35"}}' # sql_task dbx launch --parameters='{"dbt_commands": ["dbt deps", "dbt seed", "dbt run"]}' # dbt_task ``` -------------------------------- ### Run Unit Tests with Pytest Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Executes unit tests located in the `tests/unit` directory using pytest. The `--cov` flag generates a coverage report to assess code test coverage. ```bash pytest tests/unit --cov ``` -------------------------------- ### Launching DLT Pipeline with Parameters using dbx Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/delta_live_tables.md These Bash commands illustrate how to pass parameters to a Delta Live Tables (DLT) pipeline update when launching it via dbx. The --parameters flag is used, followed by a JSON string containing the desired parameters. Examples include triggering a full refresh or specifying a selection of tables for refresh, following the structure expected by the DLT API's Start Update endpoint. ```bash dbx launch --parameters='{ "full_refresh": "true" }' # for full refresh ``` ```bash dbx launch --parameters='{ "refresh_selection": ["sales_orders_cleaned", "sales_order_in_chicago"] }' # start an update of selected tables ``` ```bash dbx launch --parameters='{ "refresh_selection": ["sales_orders_cleaned", "sales_order_in_chicago"], "full_refresh_selection": ["customers", "sales_orders_raw"] }' # start a full update of selected tables ``` -------------------------------- ### Install Extras During dbx Execute (Bash) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/dependency_management.md This Bash command shows how to instruct the dbx tool to install specific dependency 'extras' (defined in setup.py or pyproject.toml) when executing a job on a Databricks cluster, allowing environment-specific dependency installation. ```bash dbx execute ... --pip-install-extras="test,other-extra,one-more-extra" ``` -------------------------------- ### Running dbx Development Commands - Bash Source: https://github.com/databrickslabs/dbx/blob/main/contrib/CONTRIBUTING.md These commands demonstrate how to use the `make` tool for common development tasks within the dbx project. They cover displaying help, cleaning and installing dependencies, running tests (including specific test files), fixing code formatting, and running linters. ```bash make help make clean install make test make test /tests/path/to/blah_test.py make fix make lint ``` -------------------------------- ### Including Configuration Snippets in dbx Deployment (JSON/Jinja) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/jinja_support.md Example of a dbx deployment JSON file using Jinja's 'include' tag to insert content from another file ('includes/cluster-test.json.j2') into the 'new_cluster' field, promoting modularity. ```json { "environments": { "default": { "jobs": [ { "name": "your-job-name", "new_cluster": {% include 'includes/cluster-test.json.j2' %}, "libraries": [], "max_retries": 0, "spark_python_task": { "python_file": "file://placeholder_1.py" } } ] } } } ``` -------------------------------- ### Execute Workflow on All-Purpose Cluster (dbx) Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Runs a specified dbx workflow on an existing all-purpose Databricks cluster. This is useful for testing workflows in an interactive environment. ```bash dbx execute --cluster-name= ``` -------------------------------- ### Configure JVM Project Deployment - dbx YAML - YAML Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/jvm/jvm_devloop.md Provides an example of a dbx deployment file (deployment.yml) for a JVM project. It defines cluster properties, build commands (mvn clean package), and a workflow task that uses an instance pool, specifies a JAR library built locally, disables default packaging, and sets the main class name for a spark_jar_task. It highlights using Jinja functions and instance pools for faster iteration. ```YAML custom: basic-cluster-props: &basic-cluster-props spark_version: "10.4.x-cpu-ml-scala2.12" basic-static-cluster: &basic-static-cluster new_cluster: <<: *basic-cluster-props num_workers: 2 instance_pool_name: "dev-instance-pool-created-above" #(1) driver_instance_pool_name: "dev-instance-pool-created-above" #(2) build: commands: - "mvn clean package" #(3) environments: default: workflows: - name: "charming-aurora-sample-jvm" tasks: - task_key: "main" <<: *basic-static-cluster libraries: - jar: "{{ 'file://' + dbx.get_last_modified_file('target/scala-2.12', 'jar') }}" #(4) deployment_config: #(5) no_package: true spark_jar_task: main_class_name: "org.some.main.ClassName" parameters: [] ``` -------------------------------- ### Installing dbx with Cloud Storage Dependencies (Bash) Source: https://github.com/databrickslabs/dbx/blob/main/docs/concepts/artifact_storage.md These snippets show the extra identifiers to append to the `pip install dbx` command to include necessary libraries for interacting with cloud-based artifact storage locations. `dbx[azure]` adds support for `wasbs://`, `dbx[aws]` for `s3://`, and `dbx[gcp]` for `gs://`. These extras are required for `dbx` to perform upload/download operations on these storage types. ```Bash dbx[azure] ``` ```Bash dbx[aws] ``` ```Bash dbx[gcp] ``` -------------------------------- ### Example Jinja Variables File (YAML) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/jinja_support.md A YAML file defining variables ('TASK_CLUSTER', 'TASK_NAME') that can be referenced in dbx deployment Jinja templates using the 'var['...']' syntax, providing a structured way to manage parameters. ```yaml TASK_CLUSTER: MIN_WORKERS: 1 MAX_WORKERS: 5 TASK_NAME: 'main' ``` -------------------------------- ### Install Python Wheel with Notebook-Scoped Libraries using dbx Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/dependency_management.md This command is used within a Databricks notebook cell to install a Python wheel containing project code and dependencies. It leverages notebook-scoped libraries, ensuring the installation is specific to the current user's context and doesn't require a cluster restart. The `--force-reinstall` flag ensures the package is reinstalled even if a previous version exists. ```Databricks Notebook (Python) %pip install --force-reinstall /package.whl ``` -------------------------------- ### Installing Package from Custom Repository in Notebook (Python) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/devloop/mixed.md Installs a Python package from a configured custom pypi repository (like Artifactory) directly within a notebook cell using the `%pip` magic command. This is useful for consuming packaged code with specific versions. ```python %pip install package-from-artifactory ``` -------------------------------- ### Referencing Environment Variables in dbx Deployment (YAML/Jinja) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/jinja_support.md Example of a dbx deployment YAML file using Jinja syntax ('{{ env['VAR_NAME'] }}') to dynamically set a job tag based on an environment variable, adding flexibility to deployments. ```yaml # only relevant block shown environments: default: - name: "job-with-tags" tags: - job_group: "{{ env['JOB_GROUP'] }}" ``` -------------------------------- ### Accessing packaged files with pkg_resources (Python) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/packaging_files.md Shows how to use the `pkg_resources.resource_filename` function from the `setuptools` package to get the filesystem path to files included in the Python package via `package_data`. This allows the application code to read or process these files. ```Python import pkg_resources raw_csv_path = pkg_resources.resource_filename( "", "resources/raw/username.csv" ) query_path = pkg_resources.resource_filename( "", "resources/sql/create_table.sql" ) ``` -------------------------------- ### Execute Workflow Interactively on Cluster (dbx) Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Executes a dbx workflow interactively on a specified Databricks cluster. This command is useful for development and debugging directly on the cluster environment. ```bash dbx execute \ --cluster-name="" ``` -------------------------------- ### dbx Deployment Configuration (Policy & Init Script) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/named_properties.md Example of a dbx deployment configuration in YAML format. It demonstrates how to reference a cluster policy by name or ID and how to define additional init scripts directly within the cluster definition. dbx will merge these with the policy's init scripts. ```yaml # irrelevant parts are omitted environments: default: workflows: - name: workflow_name job_clusters: - new_cluster: policy_id: "cluster-policy://policy-with-pip-install-script" init_scripts: - dbfs: destination: dbfs:/some/path/install_sql_driver.sh tasks: ... ``` -------------------------------- ### Referencing Variables File in dbx Deployment (YAML/Jinja) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/jinja_support.md Example of a dbx deployment YAML file using Jinja syntax ('{{ var['...'] }}') to dynamically set task key and cluster autoscale parameters based on variables defined in a separate variables file. ```yaml # irrelevant config parts are omitted environments: default: workflows: - name: "charming-aurora-sample-etl" tasks: - task_key: "{{ var['TASK_NAME'] }}" new_cluster: autoscale: min_workers: {{ var['TASK_CLUSTER']['MIN_WORKERS'] }} max_workers: {{ var['TASK_CLUSTER']['MAX_WORKERS'] }} ``` -------------------------------- ### Configure Custom PyPI Index with Bash Init Script Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/dependency_management.md This bash script, intended as an init script, modifies the /etc/pip.conf file to add a custom PyPI index URL alongside the default one. This allows pip to install packages from the specified private repository during cluster initialization. ```Bash echo """[global] index-url=https://pypi.org/simple extra-index-url=https://my.custom.pypi.example.com/simple/ """ > /etc/pip.conf ``` -------------------------------- ### Passing Parameters using dbx launch --from-assets (Jobs API 2.0, notebook_task) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/passing_parameters.md Example of passing parameters to a `notebook_task` using `dbx launch --from-assets` with the Jobs API 2.0 format. Parameters are provided as a JSON object under the 'base_parameters' key. ```bash dbx launch --from-assets --parameters='{"base_parameters": {"key1": "value1", "key2": "value2"}}' ``` -------------------------------- ### Invalid Combination of Build Options in dbx Deployment YAML Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/build_management.md Shows an example of an invalid `deployment.yml` configuration where multiple exclusive build options (like `python` and `commands`) are specified simultaneously. This configuration will not work. ```YAML build: python: "pip" commands: - "echo 'building!'" - "sleep 5" - "mvn clean package" ``` -------------------------------- ### Passing Parameters using dbx launch (Jobs API 2.0, spark_submit_task) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/passing_parameters.md Example of passing parameters to a `spark_submit_task` using the standard `dbx launch` command with the Jobs API 2.0 format. Parameters are provided as an array under the 'spark_submit_params' key. ```bash dbx launch --parameters='{"spark_submit_params": ["--class", "org.apache.spark.examples.SparkPi"]}' ``` -------------------------------- ### Execute Specific Task in Multitask Job (dbx) Source: https://github.com/databrickslabs/dbx/blob/main/src/dbx/templates/projects/python_basic/render/{{cookiecutter.project_name}}/README.md Executes a single task within a multitask job definition on an all-purpose Databricks cluster using dbx. This allows for targeted testing of individual job components. ```bash dbx execute \ --cluster-name= \ --job= \ --task= ``` -------------------------------- ### Configuring Different Workflow Types in dbx Deployment File (YAML) Source: https://github.com/databrickslabs/dbx/blob/main/docs/reference/deployment.md This snippet demonstrates how to define different types of workflows within the `workflows` section of a `dbx` deployment file. It shows examples for `jobs-v2.1` (identified by the `tasks` section), `jobs-v2.0` (identified by the absence of `tasks`), and `pipeline` (explicitly set using `workflow_type: "pipeline"`). ```yaml build: python: "pip" environments: default: workflows: ################################################ - name: "workflow-in-v2.1-format" tasks: - task_key: "task1" python_wheel_task: package_name: "some-pkg" entry_point: "some-ep" ################################################ - name: "workflow-in-v2.0-format" spark_python_task: python_file: "file://some/file.py" ################################################ - name: "workflow-in-pipeline-format" target: "some-target-db" workflow_type: "pipeline" # enforces the recognition libraries: - notebook: path: "/Repos/some/path" ``` -------------------------------- ### Databricks Workflow Deployment with Git Source (YAML) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/devops/mixed.md Example deployment configuration file (`conf/deployment.yml`) for a mixed-mode Databricks project. It defines a workflow with tasks, including a notebook task sourced from a remote Git repository (`git_source`) and a Python wheel task. The `deployment_config` for the notebook task disables automatic package dependency, as the package code is imported manually in the notebook. ```YAML environments: default: workflows: - name: "mixed-mode-workflow" job_clusters: # omitted git_source: git_url: https://some-git-provider.com/some/remote/repo.git git_provider: "git-provider-name" git_branch: "main" # or git_tag or git_commit tasks: - task_key: "notebook-remote" notebook_task: notebook_path: "notebooks/sample_notebook" deployment_config: no_package: true job_cluster_key: "default" - task_key: "packaged" python_wheel_task: package_name: "" entry_point: "" job_cluster_key: "default" ``` -------------------------------- ### Using Custom Jinja Functions in dbx Deployment (YAML/Jinja) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/jinja_support.md Example of a dbx deployment YAML file using Jinja syntax ('{{ custom.(...) }}') to call a custom Python function ('custom.multiply_by_two') to dynamically set a cluster configuration value. ```yaml # irrelevant config parts are omitted environments: default: workflows: - name: "charming-aurora-sample-etl" tasks: - task_key: "some-task" new_cluster: autoscale: min_workers: 1 max_workers: {{ custom.multiply_by_two(2) }} ``` -------------------------------- ### Initializing dbx Project from Git Template (Generic) - Bash Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/custom_templates.md Shows the basic syntax for initializing a dbx project using a template located in a Git repository. The `--path` argument specifies the repository URL, and the optional `--checkout` argument allows specifying a branch, tag, or commit. ```bash dbx init --path PATH [--checkout LOC] ``` -------------------------------- ### Passing Task-Specific Parameters using dbx launch --from-assets (Jobs API 2.1, pipeline_task) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/passing_parameters.md Example of passing task-specific parameters to a `pipeline_task` using `dbx launch --from-assets` with the Jobs API 2.1 format. Parameters are provided as an array of objects, each specifying 'task_key' and parameters like 'full_refresh'. ```bash dbx launch --from-assets --parameters='[ {"task_key": "some", "full_refresh": true} ]' ``` -------------------------------- ### Define Dependencies with Setuptools in Python Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/dependency_management.md This Python snippet shows a typical setup.py file structure using setuptools to define different sets of dependencies: main package requirements, local development requirements, and test requirements, using the extras_require mechanism. ```python from setuptools import find_packages, setup from your_package_name import __version__ PACKAGE_REQUIREMENTS = ["pyyaml"] #(1) LOCAL_REQUIREMENTS = [ #(2) "pyspark==3.2.1", "delta-spark==1.1.0", "scikit-learn", "pandas", "mlflow", ] TEST_REQUIREMENTS = [ #(3) # development & testing tools "pytest", "coverage[toml]", "pytest-cov", "dbx>=0.8" ] setup( name="your_package_name", packages=find_packages(exclude=["tests", "tests.*"]), setup_requires=["setuptools","wheel"], install_requires=PACKAGE_REQUIREMENTS, extras_require={"local": LOCAL_REQUIREMENTS, "test": TEST_REQUIREMENTS}, #(4) version=__version__, ) ``` -------------------------------- ### Initializing dbx Project with Default Template - Bash Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/general/custom_templates.md Shows how to use the `--template` option to initialize a dbx project using one of the built-in templates provided by dbx. Currently, `python_basic` is the only default template available. ```bash dbx init --template=python_basic ``` -------------------------------- ### Bash Commands to Run Tests on Job Cluster Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/integration_tests.md These bash commands demonstrate how to deploy and launch the 'sample-tests' workflow as a job cluster using assets-based deployment. The first command deploys the necessary assets, and the second command launches the job using those deployed assets. ```Bash dbx deploy sample-tests --assets-only dbx launch sample-tests --from-assets ``` -------------------------------- ### Databricks Cluster Policy Definition (Init Script) Source: https://github.com/databrickslabs/dbx/blob/main/docs/features/named_properties.md Example of a Databricks cluster policy definition in JSON format that enforces the inclusion of a specific init script by setting its destination as a fixed property. ```json { "init_scripts.0.dbfs.destination": { "type": "fixed", "value": "dbfs://some/path/script.sh" } } ``` -------------------------------- ### Configuring package_data in setup.py (Python) Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/packaging_files.md Demonstrates how to use the `package_data` field in a `setup.py` file to include arbitrary files (like SQL or CSV) within a Python package. It shows how to specify file patterns relative to the package directory. ```Python from setuptools import setup setup( ... package_data={'': ['resources/sql/*.sql', "resources/raw/*.csv"]}, ... ) ``` -------------------------------- ### YAML Workflow Configuration for Tests Source: https://github.com/databrickslabs/dbx/blob/main/docs/guides/python/integration_tests.md This YAML configuration defines a 'sample-tests' workflow within the default environment. It sets up a task named 'main' that uses a basic static cluster and executes the Python entrypoint script `tests/entrypoint.py` as a Spark Python task, passing parameters to pytest to run tests and collect coverage. ```YAML environments: default: workflows: - name: "sample-tests" tasks: - task_key: "main" <<: *basic-static-cluster spark_python_task: python_file: "file://tests/entrypoint.py" # this call supports all standard pytest arguments parameters: ["file:fuse://tests/integration", "--cov="] ``` -------------------------------- ### Enabling Photon Runtime Engine for Databricks Job Clusters Source: https://github.com/databrickslabs/dbx/blob/main/docs/reference/deployment.md Provides a YAML configuration snippet demonstrating how to specify the `runtime_engine: PHOTON` property within a job cluster definition in the `dbx` deployment file to enable the Databricks Photon runtime. ```YAML custom: basic-cluster-props: &basic-cluster-props spark_version: "your-spark-version" node_type_id: "your-node-type-id" spark_conf: spark.databricks.delta.preview.enabled: 'true' instance_pool_name: driver_instance_pool_name: runtime_engine: PHOTON init_scripts: - dbfs: destination: dbfs:/ ```