### Initialize Overwatch Workspace (Scala) Source: https://databrickslabs.github.io/overwatch/gettingstarted Initializes the Overwatch workspace by defining configuration parameters such as storage paths, data targets, security tokens, and cost details. This setup is crucial for the pipeline's operation and requires manual validation on the first run. ```scala private val storagePrefix = "/mnt/global/Overwatch/working/directory" private val workspaceID = ??? //automatically generated in the runner scripts referenced above private val dataTarget = DataTarget( Some(etlDB), Some(s"${storagePrefix}/${workspaceID}/${etlDB}.db"), Some(s"${storagePrefix}/global_share"), Some(consumerDB), Some(s"${storagePrefix}/${workspaceID}/${consumerDB}.db") ) private val tokenSecret = TokenSecret(secretsScope, dbPATKey) private val badRecordsPath = s"${storagePrefix}/${workspaceID}/sparkEventsBadRecords" private val auditSourcePath = "/path/to/my/workspaces/raw_audit_logs" // INPUT: workspace audit log directory private val interactiveDBUPrice = 0.56 private val automatedDBUPrice = 0.26 private val customWorkspaceName = workspaceID // customize this to a custom name if custom workspace_name is desired val params = OverwatchParams( auditLogConfig = AuditLogConfig(rawAuditPath = Some(auditSourcePath)), dataTarget = Some(dataTarget), tokenSecret = Some(tokenSecret), badRecordsPath = Some(badRecordsPath), overwatchScope = Some(scopes), maxDaysToLoad = maxDaysToLoad, databricksContractPrices = DatabricksContractPrices(interactiveDBUPrice, automatedDBUPrice), primordialDateString = Some(primordialDateString), workspace_name = Some(customWorkspaceName), // as of 0.6.0 externalizeOptimize = false // as of 0.6.0 ) // args are the full json config to be used for the Overwatch workspace deployment val args = JsonUtils.objToJson(params).compactString // the workspace object contains everything Overwatch needs to build, run, maintain its pipelines val workspace = Initializer(args) // 0.5.x+ ``` -------------------------------- ### Setup Widgets for Configuration (Scala) Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/azure_runner_docs_example Initializes and retrieves values from Databricks widgets for configuring ETL and consumer database names, secrets scope, and other parameters. ```scala // dbutils.widgets.removeAll // dbutils.widgets.text("storagePrefix", "", "1. ETL Storage Prefix") // dbutils.widgets.text("etlDBName", "overwatch_etl", "2. ETL Database Name") // dbutils.widgets.text("consumerDBName", "overwatch", "3. Consumer DB Name") // dbutils.widgets.text("secretsScope", "my_secret_scope", "4. Secret Scope") // dbutils.widgets.text("dbPATKey", "my_key_with_api", "5. Secret Key (DBPAT)") // dbutils.widgets.text("ehKey", "overwatch_eventhub_conn_string", "6. Secret Key (EH)") // dbutils.widgets.text("ehName", "my_eh_name", "7. EH Topic Name") // dbutils.widgets.text("primordialDateString", "2021-04-01", "8. Primordial Date") // dbutils.widgets.text("maxDaysToLoad", "60", "9. Max Days") // dbutils.widgets.text("scopes", "all", "A1. Scopes") ``` ```scala val storagePrefix = dbutils.widgets.get("storagePrefix") val etlDB = dbutils.widgets.get("etlDBName") val consumerDB = dbutils.widgets.get("consumerDBName") val secretsScope = dbutils.widgets.get("secretsScope") val dbPATKey = dbutils.widgets.get("dbPATKey") val ehName = dbutils.widgets.get("ehName") val ehKey = dbutils.widgets.get("ehKey") val primordialDateString = dbutils.widgets.get("primordialDateString") val maxDaysToLoad = dbutils.widgets.get("maxDaysToLoad").toInt val scopes = if (dbutils.widgets.get("scopes") == "all") { "audit,sparkEvents,jobs,clusters,clusterEvents,notebooks,pools".split(",") } else dbutils.widgets.get("scopes").split(",") ``` ```scala if (storagePrefix.isEmpty || consumerDB.isEmpty || etlDB.isEmpty || ehName.isEmpty || secretsScope.isEmpty || ehKey.isEmpty || dbPATKey.isEmpty) { throw new IllegalArgumentException("Please specify all required parameters!") } ``` -------------------------------- ### Initialize Pipeline for Custom Costs (Scala) Source: https://databrickslabs.github.io/overwatch/gettingstarted Initializes the pipeline to create custom costing tables in the ETL database. This step is necessary for the first-time initialization when custom compute costs need to be configured according to regional or contract pricing. ```scala Bronze(workspace) ``` -------------------------------- ### Databricks Overwatch Main Class Parameters Source: https://databrickslabs.github.io/overwatch/gettingstarted Demonstrates how parameters are passed to the Databricks Overwatch Main Class, requiring a JSON array of escaped strings. This format is crucial for configuring job parameters, especially when using the 'escapedConfigString' utility. ```shell [""] ["bronze", ""] ["silver", ""] ["gold", ""] ``` -------------------------------- ### Databricks Overwatch Main Class Argument Structure Source: https://databrickslabs.github.io/overwatch/gettingstarted Explains the expected structure for arguments passed to the Databricks Overwatch Main Class. It covers scenarios with a single argument (Overwatch parameters JSON string) and two arguments (pipeline layer and Overwatch parameters JSON string). ```text Single argument: Arg(0) is Overwatch Parameters json string (escaped if sent through API). Two arguments: Arg(0) must be ‘bronze’, ‘silver’, or ‘gold’ corresponding to the Overwatch Pipeline layer you want to be run. Arg(1) will be the Overwatch Parameters json string. ``` -------------------------------- ### Add Cluster Dependencies for OverWatch Source: https://databrickslabs.github.io/overwatch/gettingstarted Specifies the Maven coordinates for the OverWatch assembly (fat jar) and the Azure EventHubs Spark integration. These are required dependencies to be added to your Databricks cluster. ```maven com.databricks.labs:overwatch_2.12: ``` ```maven com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.21 ``` -------------------------------- ### Execute Overwatch Pipeline Stages (Scala) Source: https://databrickslabs.github.io/overwatch/gettingstarted Executes the main stages of the Overwatch pipeline: Bronze, Silver, and Gold. This is typically done from a Databricks notebook after the workspace has been initialized and configured. ```scala private val args = JsonUtils.objToJson(params).compactString val workspace = Initializer(Array(args), debugFlag = false) Bronze(workspace).run() Silver(workspace).run() Gold(workspace).run() ``` -------------------------------- ### Construct Overwatch Parameters and Instantiate Workspace Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example Defines data targets, token secrets, file paths, DBU prices, and constructs the OverwatchParams object. It then uses these parameters to instantiate the Initializer, converting the parameters to a JSON string. This is a key setup step for the Overwatch project. ```scala private val dataTarget = DataTarget( Some(etlDB), Some(s"${storagePrefix}/${workspaceID}/${etlDB}.db"), Some(s"${storagePrefix}/global_share"), Some(consumerDB), Some(s"${storagePrefix}/${workspaceID}/${consumerDB}.db") ) private val tokenSecret = TokenSecret(secretsScope, dbPATKey) private val badRecordsPath = s"${storagePrefix}/${workspaceID}/sparkEventsBadrecords" private val auditSourcePath = s"${storagePrefix}/${workspaceID}/raw_audit_logs" private val interactiveDBUPrice = 0.56 private val automatedDBUPrice = 0.26 val params = OverwatchParams( auditLogConfig = AuditLogConfig(rawAuditPath = Some(auditSourcePath)), dataTarget = Some(dataTarget), tokenSecret = Some(tokenSecret), badRecordsPath = Some(badRecordsPath), overwatchScope = Some(scopes), maxDaysToLoad = maxDaysToLoad, databricksContractPrices = DatabricksContractPrices(interactiveDBUPrice, automatedDBUPrice), primordialDateString = Some(primordialDateString) ) private val args = JsonUtils.objToJson(params).compactString val workspace = if (args.length != 0) { Initializer(Array(args), debugFlag = true) } else { Initializer(Array()) } ``` -------------------------------- ### Interacting With Overwatch and its State Source: https://databrickslabs.github.io/overwatch/dataengineer/advancedtopics Guide on how to instantiate Overwatch configurations and retrieve workspace state, and interact with pipeline modules. ```APIDOC ## Interacting With Overwatch and its State ### Description This section explains how to use your production notebook to instantiate Overwatch configurations, obtain condensed parameters using `JsonUtils.compactString`, and retrieve workspace objects using helper functions. ### Method Scala code execution within a Databricks notebook. ### Endpoint N/A ### Parameters #### Instantiating Workspace (As of 0.6.0.4) - **Database Name** (string) - The name of the Overwatch ETL database (e.g., `"overwatch_etl"`). #### Instantiating Workspace (Older versions/Compact String) - **Compact Config String** (string) - The condensed configuration string obtained from a previous run. ### Request Example #### Getting Compact String ```scala import com.databricks.labs.overwatch.utils.JsonUtils val params = OverwatchParams(...) print(JsonUtils.objToJson(params).compactString) ``` #### Instantiating Workspace (Helper Function) ```scala import com.databricks.labs.overwatch.utils.Helpers val prodWorkspace = Helpers.getWorkspaceByDatabase("overwatch_etl") ``` #### Instantiating Workspace (Compact String) ```scala // For older versions import com.databricks.labs.overwatch.pipeline.Initializer val workspace = Initializer("") ``` ### Response - **Workspace Object**: Represents the Overwatch configuration and state of a workspace. ### Using the Workspace #### Exploring the Config ```scala prodWorkspace.getConfig.* ``` #### Interacting with the Pipeline ```scala val bronzePipeline = Bonze(prodWorkspace) val silverPipeline = Silver(prodWorkspace) val goldPipeline = Gold(prodWorkspace) bronzePipeline.* ``` ### Error Handling - Ensure the compact string used is from a compatible Overwatch version. - `Helpers.getWorkspaceByDatabase` may not work across different Overwatch versions. ``` -------------------------------- ### Materialize JSON Configurations (Scala) Source: https://databrickslabs.github.io/overwatch/gettingstarted Generates different string representations of the Overwatch pipeline parameters (params). These configurations are used for various purposes, including Databricks job runs and notebook instantiations. ```scala val escapedConfigString = JsonUtils.objToJson(params).escapedString // used for job runs using main class val compactConfigString = JsonUtils.objToJson(params).compactString // used to instantiate workspace in a notebook val prettyConfigString = JsonUtils.objToJson(params).prettyString // human-readable json config ``` -------------------------------- ### Build Overwatch Uber JAR with Maven Source: https://databrickslabs.github.io/overwatch/deployoverwatch/runningoverwatch/clusterconfig Demonstrates the Maven command to compile an uber JAR of Overwatch with all its dependencies. This is an alternative to downloading pre-built JARs from GitHub releases, requiring Maven to be installed and configured. ```bash mvn clean package ``` -------------------------------- ### Azure Production Workspace Config - Scala Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/Upgrade_Example Provides an example of a production workspace configuration string for Azure. This JSON string includes settings for audit logging, token secrets, data targets, and intelligent scaling, formatted for use with Databricks Overwatch. ```Scala val prodArgs = """{"auditLogConfig":{"auditLogFormat":"json","azureAuditLogEventhubConfig":{"connectionString":"{{secrets/overwatch_shared/eh_conn}}","eventHubName":"EHName","auditRawEventsPrefix":"EHChkDir","maxEventsPerTrigger":10000}},"tokenSecret":{"scope":"mySecretScope","key":"PATKeyName"},"dataTarget":{"databaseName":"overwatch_etl","databaseLocation":"/path/to/etl.db","etlDataPathPrefix":"/path/to/global_share","consumerDatabaseName":"overwatch","consumerDatabaseLocation":"/path/to/consumer.db"},"badRecordsPath":"/path/to/bad/recordsTracker","overwatchScope":["audit","accounts","jobs","sparkEvents","clusters","clusterEvents","notebooks","pools"],"maxDaysToLoad":60,"databricksContractPrices":{"interactiveDBUCostUSD":0.55,"automatedDBUCostUSD":0.15,"sqlComputeDBUCostUSD":0.22,"jobsLightDBUCostUSD":0.1},"primordialDateString":"2021-01-01","intelligentScaling":{"enabled":true,"minimumCores":8,"maximumCores":64,"coeff":1.0}}""" ``` -------------------------------- ### Initialize Overwatch Workspace (Scala) Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/Upgrade_0610_91LTS Initializes the Overwatch workspace with a compact configuration string. This sets up the necessary objects and configurations to run Overwatch jobs. ```Scala val compactString = """compact string""" val prodWorkspace = Initializer(compactString) ``` -------------------------------- ### AWS Production Workspace Config - Scala Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/Upgrade_Example Provides an example of a production workspace configuration string for AWS. This JSON string outlines settings for audit logging, token secrets, data targets, and intelligent scaling, tailored for Databricks Overwatch on AWS. ```Scala val prodArgs = """{"auditLogConfig":{"rawAuditPath":"/path/to/AuditLogs","auditLogFormat":"json"},"tokenSecret":{"scope":"mySecretScope","key":"PATKeyName"},"dataTarget":{"databaseName":"overwatch_etl","databaseLocation":"/path/to/etl.db","etlDataPathPrefix":"/path/to/global_share","consumerDatabaseName":"overwatch","consumerDatabaseLocation":"/path/to/consumer.db"},"badRecordsPath":"/tmp/overwatch_etl/sparkEventsBadrecords","overwatchScope":["audit","sparkEvents","jobs","clusters","clusterEvents","notebooks","pools"],"maxDaysToLoad":30,"databricksContractPrices":{"interactiveDBUCostUSD":0.56,"automatedDBUCostUSD":0.26,"sqlComputeDBUCostUSD":0.22,"jobsLightDBUCostUSD":0.1},"primordialDateString":"2021-01-01","intelligentScaling":{"enabled":false,"minimumCores":4,"maximumCores":512,"coeff":1.0}}""" ``` -------------------------------- ### Import Overwatch Utilities - Scala Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/Upgrade_Example Imports necessary classes from the Overwatch utility and pipeline modules for use in Scala notebooks. ```Scala import com.databricks.labs.overwatch.utils.Upgrade ``` ```Scala import com.databricks.labs.overwatch.pipeline.Initializer ``` -------------------------------- ### Initialize Overwatch Workspace (Scala) Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/Upgrade_Example Initializes the Overwatch workspace using production arguments. This step prepares the environment for subsequent operations, such as upgrades. This snippet uses Scala. ```Scala val prodWorkspace = Initializer(prodArgs) ``` -------------------------------- ### Configure Overwatch Parameters and Instantiate Workspace Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_070 This snippet shows how to construct the Overwatch parameters, including data targets, token secrets, paths, pricing, and workspace name, and then instantiate the Overwatch workspace. ```scala private val dataTarget = DataTarget( Some(etlDB), Some(s"${storagePrefix}/${workspaceID}/${etlDB}.db"), Some(s"${storagePrefix}/global_share"), Some(consumerDB), Some(s"${storagePrefix}/${workspaceID}/${consumerDB}.db") ) private val tokenSecret = TokenSecret(secretsScope, dbPATKey) private val badRecordsPath = s"${storagePrefix}/${workspaceID}/sparkEventsBadrecords" private val auditSourcePath = "/path/to/my/workspaces/raw_audit_logs" // INPUT: workspace audit log directory private val interactiveDBUPrice = 0.56 private val automatedDBUPrice = 0.26 private val customWorkspaceName = workspaceID // customize this to a custom name if custom workspace_name is desired val params = OverwatchParams( auditLogConfig = AuditLogConfig(rawAuditPath = Some(auditSourcePath)), dataTarget = Some(dataTarget), tokenSecret = Some(tokenSecret), badRecordsPath = Some(badRecordsPath), overwatchScope = Some(scopes), maxDaysToLoad = maxDaysToLoad, databricksContractPrices = DatabricksContractPrices(interactiveDBUPrice, automatedDBUPrice), primordialDateString = Some(primordialDateString), workspace_name = Some(customWorkspaceName), externalizeOptimize = false // recommend true -- reference this https://databrickslabs.github.io/overwatch/gettingstarted/advancedtopics/#externalize-optimize--z-order-as-of-060 ) private val args = JsonUtils.objToJson(params).compactString val workspace = Initializer(args, debugFlag = false) ``` -------------------------------- ### AWS Example Configuration for Overwatch Pipeline Source: https://databrickslabs.github.io/overwatch/gettingstarted/configuration This Scala code snippet demonstrates how to configure the Overwatch pipeline for AWS, setting up data targets, security tokens, and pricing details. It initializes parameters for audit logs, data storage, and intelligent scaling. ```scala import com.databricks.labs.overwatch.pipeline.{Initializer, Bronze, Silver, Gold} import com.databricks.labs.overwatch.utils._ private val storagePrefix = /mnt/overwatch_global private val dataTarget = DataTarget( Some(etlDB), Some(s"${storagePrefix}/${workspaceID}/${etlDB}.db"), Some(s"${storagePrefix}/global_share"), Some(consumerDB), Some(s"${storagePrefix}/${workspaceID}/${consumerDB}.db") ) private val tokenSecret = TokenSecret(secretsScope, dbPATKey) private val badRecordsPath = s"${storagePrefix}/${workspaceID}/sparkEventsBadrecords" private val auditSourcePath = "/path/to/my/workspaces/raw_audit_logs" // INPUT: workspace audit log directory private val interactiveDBUPrice = 0.56 private val automatedDBUPrice = 0.26 private val DatabricksSQLDBUPrice = 0.22 private val automatedJobsLightDBUPrice = 0.10 private val customWorkspaceName = workspaceID // customize this to a custom name if custom workspace_name is desired val params = OverwatchParams( auditLogConfig = AuditLogConfig(rawAuditPath = Some(auditSourcePath), auditLogFormat = "json"), dataTarget = Some(dataTarget), tokenSecret = Some(tokenSecret), badRecordsPath = Some(badRecordsPath), overwatchScope = Some(scopes), maxDaysToLoad = maxDaysToLoad, databricksContractPrices = DatabricksContractPrices(interactiveDBUPrice, automatedDBUPrice, DatabricksSQLDBUPrice, automatedJobsLightDBUPrice), primordialDateString = Some(primordialDateString), intelligentScaling = IntelligentScaling(enabled = true, minimumCores = 16, maximumCores = 64, coeff = 1.0), workspace_name = Some(customWorkspaceName), // as of 0.6.0 externalizeOptimize = false // as of 0.6.0 ) ``` -------------------------------- ### Scala Example for Azure Spark UI API Source: https://databrickslabs.github.io/overwatch/assets/_index/realtime_helpers Example Scala code demonstrating how to construct the URL and retrieve data from the Azure Spark UI API. ```APIDOC ## Scala Example for Azure Spark UI API ### Description This Scala code snippet shows how to dynamically construct the API URL for the Azure Spark UI and set up variables needed for the request. ### Method N/A (Code Example) ### Endpoint N/A (Code Example) ### Parameters #### Scala Variables - **`env`** (String) - The Databricks API URL. - **`token`** (String) - The Databricks API token. - **`clusterId`** (String) - The ID of the Spark cluster. - **`sparkContextID`** (String) - The ID of the Spark context. - **`orgId`** (String) - The organization ID. - **`uiPort`** (String) - The Spark UI port. - **`apiPath`** (String) - The specific API endpoint path (e.g., 'applications'). - **`url`** (String) - The complete constructed URL for the API request. ### Request Example ```scala val env = dbutils.notebook.getContext.apiUrl.get val token = dbutils.notebook.getContext.apiToken.get val clusterId = spark.conf.get("spark.databricks.clusterUsageTags.clusterId") val sparkContextID = spark.conf.get("spark.databricks.sparkContextId") val orgId = spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId") val uiPort = spark.conf.get("spark.ui.port") val apiPath = "applications" val url = s"$env/driver-proxy-api/o/$orgId/$clusterId/$uiPort/api/v1/$apiPath" ``` ### Response N/A (Code Example) ``` -------------------------------- ### Display Configuration Strings Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example Shows how to convert the Overwatch parameters object to different JSON string formats (escaped and compact). These commands are useful for preparing the configuration to be run as a job with a main class. ```scala JsonUtils.objToJson(params).escapedString ``` ```scala JsonUtils.objToJson(params).compactString ``` -------------------------------- ### Scala Example for AWS Spark UI API Source: https://databrickslabs.github.io/overwatch/assets/_index/realtime_helpers Example Scala code for constructing the URL to access the AWS Spark UI API. ```APIDOC ## Scala Example for AWS Spark UI API ### Description This Scala code snippet demonstrates how to construct the API endpoint URL for accessing the AWS Spark UI through the Databricks driver proxy. ### Method N/A (Code Example) ### Endpoint N/A (Code Example) ### Parameters #### Scala Variables - **`uiPort`** (String) - The Spark UI port. - **`env`** (String) - The base URL of the Databricks environment (shard name). - **`token`** (String) - The Databricks API token. - **`clusterId`** (String) - The ID of the Spark cluster. - **`url`** (String) - The complete constructed URL for the API request. ### Request Example ```scala val uiPort = spark.conf.get("spark.ui.port") val env = dbutils.notebook.getContext.apiUrl.get val token = dbutils.notebook.getContext.apiToken.get val clusterId = spark.conf.get("spark.databricks.clusterUsageTags.clusterId") val url = s"https://$env/driver-proxy-api/o/0/$clusterId/$uiPort/api/v1/" // Example output from execution: // uiPort: String = 42285 // env: String = https://demo.cloud.databricks.com // token: String = [REDACTED] // clusterId: String = 0318-151752-abed99 // url: String = https://https://demo.cloud.databricks.com/driver-proxy-api/o/0/0318-151752-abed99/42285/api/v1/ ``` ### Response N/A (Code Example) ``` -------------------------------- ### AWS EC2 DescribeSpotPriceHistory API Response Example Source: https://databrickslabs.github.io/overwatch/faq A sample XML response from the AWS EC2 DescribeSpotPriceHistory API, detailing instance type, product description, spot price, timestamp, and availability zone for spot instances. ```xml 59dbff89-35bd-4eac-99ed-be587EXAMPLE m3.medium Linux/UNIX 0.287 2020-11-01T20:56:05.000Z us-west-2a m3.medium Windows 0.033 2020-11-01T22:33:47.000Z us-west-2a ``` -------------------------------- ### Import Databricks Overwatch Libraries (Scala) Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example Imports necessary classes from the Databricks Overwatch library for pipeline operations and utility functions. ```scala import com.databricks.labs.overwatch.pipeline.{Initializer, Bronze, Silver, Gold} import com.databricks.labs.overwatch.utils._ import com.databricks.labs.overwatch.pipeline.TransformFunctions ``` -------------------------------- ### Generate Compact Config String - Scala Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/Upgrade_Example Demonstrates how to instantiate Overwatch parameters and then convert the object to a compact JSON string using `JsonUtils.objToJson` and `compactString`. This is essential for passing configuration parameters efficiently. ```Scala val params = OverwatchParams(...) println(JsonUtils.objToJson(params).compactString) ``` -------------------------------- ### List Storage Directory (Scala) Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example Displays the contents of a specified directory within the Databricks file system, typically used for initial verification or debugging. ```scala display(dbutils.fs.ls(s"${storagePrefix}/${workspaceID}").toDF) ``` -------------------------------- ### Get Config Strings for Overwatch Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_060 Provides methods to serialize Overwatch parameters into different string formats (escaped, compact, pretty) for job execution or reference. ```Scala // use this to run Overwatch as a job JsonUtils.objToJson(params).escapedString ``` ```Scala // use this to get a compact string of the created parameters for reference and/or creating workspace object on the fly JsonUtils.objToJson(params).compactString ``` ```Scala // Prints a pretty string of the entire config to be used in the run JsonUtils.objToJson(params).prettyString ``` -------------------------------- ### Install and Configure Collectd Source: https://databrickslabs.github.io/overwatch/assets/_index/realtime_helpers Installs the collectd monitoring agent and its utilities, then configures the collectd.conf file to monitor various system metrics and send them to a Graphite server. This involves setting up plugins for CPU, memory, disk, network interfaces, and more, along with specifying the Graphite host and port. ```bash apt-get update -y DEBIAN_FRONTEND=noninteractive apt install collectd collectd-utils -y cat >"/etc/collectd/collectd.conf" < Device "/dev/mapper/vg-lv" MountPoint "/local_disk0" FSType "ext4" Interface "ens3" IgnoreSelected false Host "${OVERWATCH_RELAY}" Port "2003" Protocol "tcp" LogSendErrors true Prefix "v1.${WORKSPACE}.cluster_${DB_CLUSTER_ID}.machine.raw.${IS_DRIVER}." StoreRates true AlwaysAppendDS false EscapeCharacter "_" EOL service collectd restart ``` -------------------------------- ### Convert Parameters to Compact String for Reference Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_070 This snippet demonstrates converting the Overwatch parameters object into a compact JSON string. This is useful for referencing the configuration or instantiating the workspace object on the fly. ```scala // use this to get a compact string of the created parameters for reference and/or creating workspace object on the fly JsonUtils.objToJson(params).compactString ``` -------------------------------- ### Execute Overwatch Upgrade - Scala Example Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/060_upgrade_process Provides an example of how to execute the Overwatch upgrade to version 0.6.0 using Scala. It demonstrates calling the 'upgradeTo060' function with specific parameters, including disabling Spark table rebuilds. ```scala Upgrade.upgradeTo060(prodWorkspace, workspaceNameMap, rebuildSparkTables = false, startStep = 1.0) ``` -------------------------------- ### Import Overwatch Initialization Utility (Scala) Source: https://databrickslabs.github.io/overwatch/assets/ChangeLog/Upgrade_0610_91LTS Imports the Initializer class from the Overwatch pipeline utilities. This class is essential for setting up and configuring Overwatch jobs. ```Scala import com.databricks.labs.overwatch.pipeline.Initializer ``` -------------------------------- ### Execute Silver Pipeline Stage Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_070 This command executes the Silver layer of the Overwatch pipeline, transforming bronze data into a silver data format. ```scala Silver(workspace).run() ``` -------------------------------- ### Construct Overwatch Parameters and Instantiate Workspace Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_042 Defines data targets, token secrets, paths, and pricing for the Overwatch project. It then constructs the OverwatchParams object, serializes it to JSON, and initializes the workspace. Dependencies include `DataTarget`, `TokenSecret`, `DatabricksContractPrices`, `AuditLogConfig`, and `OverwatchParams`. ```scala private val dataTarget = DataTarget( Some(etlDB), Some(s"${storagePrefix}/${workspaceID}/${etlDB}.db"), Some(s"${storagePrefix}/global_share"), Some(consumerDB), Some(s"${storagePrefix}/${workspaceID}/${consumerDB}.db") ) ``` ```scala private val tokenSecret = TokenSecret(secretsScope, dbPATKey) ``` ```scala private val badRecordsPath = s"${storagePrefix}/${workspaceID}/sparkEventsBadrecords" ``` ```scala private val auditSourcePath = "/path/to/my/workspaces/raw_audit_logs" // INPUT: workspace audit log directory ``` ```scala private val interactiveDBUPrice = 0.56 ``` ```scala private val automatedDBUPrice = 0.26 ``` ```scala val params = OverwatchParams( auditLogConfig = AuditLogConfig(rawAuditPath = Some(auditSourcePath)), dataTarget = Some(dataTarget), tokenSecret = Some(tokenSecret), badRecordsPath = Some(badRecordsPath), overwatchScope = Some(scopes), maxDaysToLoad = maxDaysToLoad, databricksContractPrices = DatabricksContractPrices(interactiveDBUPrice, automatedDBUPrice), primordialDateString = Some(primordialDateString) ) ``` ```scala private val args = JsonUtils.objToJson(params).compactString ``` ```scala val workspace = Initializer(args, debugFlag = true) ``` -------------------------------- ### Configure Widgets for Job Parameters (Scala) Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example Sets up Databricks widgets for user-configurable parameters like storage prefix, database names, secrets, and date ranges. It also defines the scopes for data loading. ```scala // dbutils.widgets.removeAll // dbutils.widgets.text("storagePrefix", "", "1. ETL Storage Prefix") // dbutils.widgets.text("etlDBName", "overwatch_etl", "2. ETL Database Name") // dbutils.widgets.text("consumerDBName", "overwatch", "3. Consumer DB Name") // dbutils.widgets.text("secretsScope", "my_secret_scope", "4. Secret Scope") // dbutils.widgets.text("dbPATKey", "my_key_with_api", "5. Secret Key (DBPAT)") // dbutils.widgets.text("primordialDateString", "2020-10-27", "6. Primordial Date") // dbutils.widgets.text("maxDaysToLoad", "60", "7. Max Days") // dbutils.widgets.text("scopes", "all", "8. Scopes") ``` -------------------------------- ### Show The Run Report Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example Displays the run report for the specified etlDB. This command is used to visualize the results and performance of the executed pipeline stages. ```scala display(pipReport(etlDB)) ``` -------------------------------- ### Construct Overwatch Parameters and Instantiate Workspace Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/azure_runner_docs_example Defines data targets, token secrets, event hub configurations, DBU prices, and constructs the OverwatchParams object. It then initializes the workspace based on the generated parameters. ```scala private val dataTarget = DataTarget( Some(etlDB), Some(s"${storagePrefix}/${workspaceID}/${etlDB}.db"), Some(s"${storagePrefix}/global_share"), Some(consumerDB), Some(s"${storagePrefix}/${workspaceID}/${consumerDB}.db") ) ``` ```scala private val tokenSecret = TokenSecret(secretsScope, dbPATKey) ``` ```scala private val ehConnString = dbutils.secrets.get(secretsScope, ehKey) ``` ```scala private val ehStatePath = s"${storagePrefix}/${workspaceID}/ehState" ``` ```scala private val badRecordsPath = s"${storagePrefix}/${workspaceID}/sparkEventsBadrecords" ``` ```scala private val azureLogConfig = AzureAuditLogEventhubConfig(connectionString = ehConnString, eventHubName = ehName, auditRawEventsPrefix = ehStatePath) ``` ```scala private val interactiveDBUPrice = 0.56 ``` ```scala private val automatedDBUPrice = 0.26 ``` ```scala val params = OverwatchParams( auditLogConfig = AuditLogConfig(azureAuditLogEventhubConfig = Some(azureLogConfig)), dataTarget = Some(dataTarget), tokenSecret = Some(tokenSecret), badRecordsPath = Some(badRecordsPath), overwatchScope = Some(scopes), maxDaysToLoad = maxDaysToLoad, databricksContractPrices = DatabricksContractPrices(interactiveDBUPrice, automatedDBUPrice), primordialDateString = Some(primordialDateString) ) ``` ```scala private val args = JsonUtils.objToJson(params).compactString ``` ```scala val workspace = if (args.length != 0) { Initializer(Array(args), debugFlag = true) } else { Initializer(Array()) } ``` -------------------------------- ### Execute Gold Pipeline Stage Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_070 This command executes the Gold layer of the Overwatch pipeline, further processing silver data into a gold data format. ```scala Gold(workspace).run() ``` -------------------------------- ### Install and Configure Collectd for Metrics Collection Source: https://databrickslabs.github.io/overwatch/assets/_index/realtime_helpers Installs the collectd package and configures it to collect various system metrics (CPU, memory, disk, network interfaces) and send them to a Graphite relay. The configuration includes specifying devices, mount points, and the Graphite node details. ```bash apt-get update -y DEBIAN_FRONTEND=noninteractive apt install collectd collectd-utils -y ``` ```bash cat >"/etc/collectd/collectd.conf" < Device "/dev/mapper/vg-lv" MountPoint "/local_disk0" FSType "ext4" Interface "ens3" IgnoreSelected false Host "${OVERWATCH_RELAY}" Port "2003" Protocol "tcp" LogSendErrors true Prefix "v1.${WORKSPACE}.cluster_${DB_CLUSTER_ID}.machine.raw.${IS_DRIVER}." StoreRates true AlwaysAppendDS false EscapeCharacter "_" EOL ``` -------------------------------- ### Configure Overwatch Job Parameters via Widgets (Scala) Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_070 Sets up Databricks widgets for interactive configuration of Overwatch job parameters such as storage prefix, database names, secrets scope, and date ranges. These parameters are then retrieved and validated. ```Scala // dbutils.widgets.removeAll // dbutils.widgets.text("storagePrefix", "", "1. ETL Storage Prefix") // dbutils.widgets.text("etlDBName", "overwatch_etl", "2. ETL Database Name") // dbutils.widgets.text("consumerDBName", "overwatch", "3. Consumer DB Name") // dbutils.widgets.text("secretsScope", "my_secret_scope", "4. Secret Scope") // dbutils.widgets.text("dbPATKey", "SECRET_NAME_my_key_with_api", "5. Secret Key (DBPAT)") // dbutils.widgets.text("primordialDateString", "2021-09-30", "6. Primordial Date") // dbutils.widgets.text("maxDaysToLoad", "60", "7. Max Days") // dbutils.widgets.text("scopes", "all", "8. Scopes") ``` ```Scala val storagePrefix = dbutils.widgets.get("storagePrefix").toLowerCase // OUTPUT: PRIMARY OVERWATCH OUTPUT PREFIX val etlDB = dbutils.widgets.get("etlDBName").toLowerCase val consumerDB = dbutils.widgets.get("consumerDBName").toLowerCase val secretsScope = dbutils.widgets.get("secretsScope") val dbPATKey = dbutils.widgets.get("dbPATKey") val primordialDateString = dbutils.widgets.get("primordialDateString") val maxDaysToLoad = dbutils.widgets.get("maxDaysToLoad").toInt val scopes = if (dbutils.widgets.get("scopes") == "all") { "audit,sparkEvents,jobs,clusters,clusterEvents,notebooks,pools,accounts,dbsql".split(",") } else dbutils.widgets.get("scopes").split(",") if (storagePrefix.isEmpty || consumerDB.isEmpty || etlDB.isEmpty || secretsScope.isEmpty || dbPATKey.isEmpty) { throw new IllegalArgumentException("Please specify all required parameters!") } ``` -------------------------------- ### Execute Bronze Pipeline Stage Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_070 This command executes the Bronze layer of the Overwatch pipeline, processing raw data into a bronze data format. ```scala Bronze(workspace).run() ``` -------------------------------- ### AWS EC2 DescribeSpotPriceHistory API Request Example Source: https://databrickslabs.github.io/overwatch/faq A sample request to the AWS EC2 API to retrieve historical spot pricing data. It specifies the time range, availability zone, and requires authentication parameters. ```http &StartTime=2020-11-01T00:00:00.000Z &EndTime=2020-11-01T23:59:59.000Z &AvailabilityZone=us-west-2a &AUTHPARAMS ``` -------------------------------- ### List Storage Files (Scala) Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/azure_runner_docs_example Lists the files in the specified storage path for the current workspace. This is typically used to check if data already exists. ```scala // If first run this should be empty display(dbutils.fs.ls(s"${storagePrefix}/${workspaceID}").toDF) ``` -------------------------------- ### Databricks Overwatch Init Script Source: https://databrickslabs.github.io/overwatch/assets/_index/realtime_helpers A bash script that initializes the Databricks environment for Overwatch. It sets up environment variables, configures Spark metrics, installs and configures collectd, and restarts the collectd service. ```bash #!/bin/bash OVERWATCH_RELAY=127.0.0.1 # CHANGE_ME WORKSPACE=AWS_DEMO # CHANGE_ME if [[ $DB_IS_DRIVER = "TRUE" ]]; then IS_DRIVER="driver" else IS_DRIVER="worker" fi IP_CLEAN=$( echo ${DB_CONTAINER_IP} | tr '.' '_' ) LOG_DIR=/dbfs/cluster-logs/${DB_CLUSTER_ID}/csv_metrics/${DB_CONTAINER_IP} if [ ! -d "/dbfs/cluster-logs/${DB_CLUSTER_ID}/csv_metrics/${DB_CONTAINER_IP}" ] then mkdir -p $LOG_DIR fi TMP_SCRIPT=/tmp/custom_metrics.sh cat >"$TMP_SCRIPT" <> /databricks/spark/conf/metrics.properties apt-get update -y DEBIAN_FRONTEND=noninteractive apt install collectd collectd-utils -y cat >"/etc/collectd/collectd.conf" < Device "/dev/mapper/vg-lv" MountPoint "/local_disk0" FSType "ext4" Interface "ens3" IgnoreSelected false Host "${OVERWATCH_RELAY}" Port "2003" Protocol "tcp" LogSendErrors true Prefix "v1.${WORKSPACE}.cluster_${DB_CLUSTER_ID}.machine.raw.${IS_DRIVER}." StoreRates true AlwaysAppendDS false EscapeCharacter "_" EOL service collectd restart ``` -------------------------------- ### Construct Overwatch Parameters and Instantiate Workspace Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/azure_runner_docs_example_070 Constructs the `OverwatchParams` object using the retrieved configuration values, including data targets, token secrets, and event hub configurations. It then uses these parameters to initialize the Overwatch workspace, preparing it for pipeline execution. ```scala private val dataTarget = DataTarget( Some(etlDB), Some(s"${storagePrefix}/${workspaceID}/${etlDB}.db"), Some(s"${storagePrefix}/global_share"), Some(consumerDB), Some(s"${storagePrefix}/${workspaceID}/${consumerDB}.db") ) private val tokenSecret = TokenSecret(secretsScope, dbPATKey) private val ehConnString = s"{{secrets/${secretsScope}/${ehKey}}}" private val ehStatePath = s"${storagePrefix}/${workspaceID}/ehState" private val badRecordsPath = s"${storagePrefix}/${workspaceID}/sparkEventsBadrecords" private val azureLogConfig = AzureAuditLogEventhubConfig(connectionString = ehConnString, eventHubName = ehName, auditRawEventsPrefix = ehStatePath) private val interactiveDBUPrice = 0.55 private val automatedDBUPrice = 0.30 private val customWorkspaceName = workspaceID // customize this to a custom name if custom workspace_name is desired val params = OverwatchParams( auditLogConfig = AuditLogConfig(azureAuditLogEventhubConfig = Some(azureLogConfig)), dataTarget = Some(dataTarget), tokenSecret = Some(tokenSecret), badRecordsPath = Some(badRecordsPath), overwatchScope = Some(scopes), maxDaysToLoad = maxDaysToLoad, databricksContractPrices = DatabricksContractPrices(interactiveDBUPrice, automatedDBUPrice), primordialDateString = Some(primordialDateString), workspace_name = Some(customWorkspaceName), externalizeOptimize = false // recommend true -- reference this https://databrickslabs.github.io/overwatch/gettingstarted/advancedtopics/#externalize-optimize--z-order-as-of-060 ) private val args = JsonUtils.objToJson(params).compactString val workspace = Initializer(args, debugFlag = false) ``` -------------------------------- ### Import Overwatch Environment and Pipeline Components (Scala) Source: https://databrickslabs.github.io/overwatch/assets/FAQ/multi_workspace_cleanup Imports necessary classes from the Overwatch library for environment setup and pipeline initialization. These imports are essential for interacting with Databricks workspaces. ```Scala import org.apache.spark.sql.functions._ ``` ```Scala import com.databricks.labs.overwatch.env.Workspace ``` ```Scala import scala.collection.parallel.ForkJoinTaskSupport ``` ```Scala import java.util.concurrent.ForkJoinPool ``` ```Scala import com.databricks.labs.overwatch.pipeline.Initializer ``` -------------------------------- ### Scala: Initialize Overwatch with updated arguments (0.4.2+) Source: https://databrickslabs.github.io/overwatch/changelog This Scala code shows the updated syntax for initializing Overwatch starting from version 0.4.2 when instantiating it from a notebook. It reflects a change in how arguments are passed to the `Initializer` class. ```scala // 0.4.1 (OLD) val workspace = Initializer(Array(args), debugFlag = true) // 0.4.2+ (NEW) val workspace = Initializer(args, debugFlag = true) ``` -------------------------------- ### Convert Parameters to Pretty String for Display Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example_070 This code snippet converts the Overwatch parameters object into a pretty-printed JSON string, making the entire configuration easily readable for review. ```scala // Prints a pretty string of the entire config to be used in the run JsonUtils.objToJson(params).prettyString ``` -------------------------------- ### Execute The Pipeline Source: https://databrickslabs.github.io/overwatch/assets/GettingStarted/aws_runner_docs_example Executes the Overwatch data processing pipeline in stages: Bronze, Silver, and Gold. Each stage is represented by a class that is instantiated with the workspace object and then its run method is called. ```scala Bronze(workspace).run() ``` ```scala Silver(workspace).run() ``` ```scala Gold(workspace).run() ```