### Getting Started Tutorials Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md A collection of tutorials to help users get started with data analysis and tools at Mozilla. ```markdown [Getting Started](cookbooks/getting_started/index.md) ``` -------------------------------- ### Serve Documentation Locally Source: https://github.com/mozilla/data-docs/blob/main/README.md Starts a local server to preview the documentation. This command is used after installing mdbook-dtmo and assumes the documentation source files are present in the current directory. ```bash mdbook-dtmo serve ``` -------------------------------- ### BigQuery ETL for Live Data Setup Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/live_data.md This snippet references a Python script used with bigquery-etl for setting up direct access to live data. It's presented as an example of the ease of setup for this method, particularly for creating user-facing views. ```Python # Reference to a bigquery-etl script for setting up live data views # Example: https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/hubs/active_subscription_ids/view.sql # This is a conceptual representation. The actual code would be within the linked SQL file # which defines a BigQuery view. The bigquery-etl tool orchestrates the creation of such views. # Example of a SQL view definition (as found in the linked file): # CREATE VIEW `your_project.your_dataset.hubs_active_subscription_ids_live` AS # SELECT # user_id, # MAX(CASE WHEN subscription_active THEN 1 ELSE 0 END) AS is_active # FROM # `your_project.your_dataset.hubs_subscriptions_live` # WHERE # submission_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY) # GROUP BY # user_id; ``` -------------------------------- ### Install mdbook-dtmo via curl Source: https://github.com/mozilla/data-docs/blob/main/README.md Installs the mdbook-dtmo tool, a fork of mdBook with custom additions for Mozilla's environment, using a script from a provided URL. This is a convenient way to get the tool if Rust is already installed. ```bash curl -LSfs https://japaric.github.io/trust/install.sh | sh -s -- --git badboy/mdbook-dtmo ``` -------------------------------- ### Example Query for Telemetry (BigQuery) Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/other/addons_daily/intro.md This snippet shows an example query for accessing the addons_daily table through the Telemetry (BigQuery) data source. It demonstrates how to retrieve data related to add-ons and their users. ```SQL SELECT addon_id, submission_date, users_with_addon FROM `project.dataset.addons_daily` WHERE submission_date BETWEEN '2023-01-01' AND '2023-01-31' LIMIT 100; ``` -------------------------------- ### Working with Looker Introduction Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md An introductory guide to using Looker for data visualization and analysis at Mozilla. ```markdown [Working with Looker](cookbooks/looker/index.md) ``` -------------------------------- ### Stub Installer Ping Dataset Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md Provides data from the stub installer ping, likely related to the initial installation process of Firefox. Useful for tracking installation success and early user experience. ```markdown [Stub installer ping](datasets/other/stub_installer/reference.md) ``` -------------------------------- ### Introduction to Cookbooks Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md An introduction to the collection of tutorials and practical guides for data analysis at Mozilla. ```markdown [Tutorials & Cookbooks](cookbooks/index.md) ``` -------------------------------- ### Install mdbook-dtmo via Cargo Source: https://github.com/mozilla/data-docs/blob/main/README.md Builds and installs the mdbook-dtmo preprocessors using the Cargo package manager. This method is suitable if you have the Rust toolchain and Cargo installed. ```rust cargo install mdbook-dtmo ``` -------------------------------- ### BigQuery Table Naming Convention Source: https://github.com/mozilla/data-docs/blob/main/src/tools/guiding_principles.md This example shows the naming convention for BigQuery tables used by the data pipeline. It includes the dataset, table name, and version, differentiating between live and stable (historical) data. ```sql activity_stream_live.impression_stats_v1 activity_stream_stable.impression_stats_v1 ``` -------------------------------- ### BigQuery ETL Logic Examples Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/data_modeling/where_to_store.md Examples of business logic and data transformations expected in the 'bigquery-etl' repository for BigQuery datasets. This includes core metrics, search metrics, acquisition/retention/churn calculations, and partner code mapping. ```markdown - The calculation of [core metrics](https://docs.telemetry.mozilla.org/metrics/index.html): DAU, WAU, MAU, new profiles. - Calculation of [search metrics](https://docs.telemetry.mozilla.org/datasets/search.html?highlight=search#terminology). E.g. Ad clicks, search with ads, organic search. - Calculation of acquisition, retention and churn metrics. - Mapping from partner code to platform for Bing revenue. - Segmentation of clients that require the implementation of business logic, not just filtering on specific columns. ``` -------------------------------- ### Looker Explore Definition Example (VPN Subscriptions) Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/data_modeling/where_to_store.md An example of a Looker explore definition in LKML for 'subscriptions', showcasing how to join with other views and define specific dimensions and time frames. ```lkml explore: subscriptions { label: "VPN Subscriptions" from: subscriptions join: users { type: left_outer sql_on: ${subscriptions.user_id} = ${users.id} ;; } # ... other join and view definitions } ``` -------------------------------- ### OpMon Project Configuration Example Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/operational_monitoring.md Example TOML configuration for an OpMon project, detailing sections like [project], [data_sources], [metrics], and [dimensions]. This snippet illustrates how to define project-specific monitoring parameters. ```toml [project] # A custom, descriptive name of the project. # This will be used as the generated Looker dashboard title. name = "A new operational monitoring project" # The name of the platform this project targets. # For example, "firefox_desktop", "fenix", "firefox_ios", ... platform = "firefox_desktop" # Specifies the type of monitoring desired as described above. # Either "submission_date" (to monitor each day) or "build_id" (to monitor build over build) xaxis = "submission_date" # Both start_date and end_date can be overridden, otherwise the dates configured in # Experimenter will be used as defaults. start_date = "2022-01-01" # Whether to skip the analysis for this project entirely. # Useful for skipping rollouts for which OpMon projects are generated automatically otherwise. skip = false # Whether the project is related to a rollout. is_rollout = false # Ignore the default metrics that would be computed. skip_default_metrics = false # Whether to have all the results in a single tile on the Looker dashboard (compact) # or to have separate tiles for each metric. compact_visualization = false # Metrics, that are based on metrics, to compute. # Defined as a list of strings. These strings are the "slug" of the metric, which is the ``` -------------------------------- ### Querying Live Tables Directly Example Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/live_data.md This example demonstrates how to query live tables, specifically accessing data from the last two days from live tables and older data from stable tables. It highlights the importance of disabling caching to ensure fresh data is returned. The example links to a SQL view definition for active hub subscriptions. ```SQL -- Example SQL for accessing live data (conceptual) -- Assumes a view named 'monitoring.topsites_click_rate_live' -- and filtering by submission_timestamp. SELECT * FROM `your_project.your_dataset.monitoring.topsites_click_rate_live` WHERE submission_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY) AND submission_timestamp < CURRENT_TIMESTAMP(); -- Note: Actual query will depend on the specific view and table structure. -- Caching should be disabled in the query execution environment. ``` -------------------------------- ### Example BigQuery Query: Clients Last Seen Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/bigquery/querying.md An example SQL query to retrieve and aggregate data from the `mozdata.telemetry.clients_last_seen` table, demonstrating filtering, grouping, and ordering. ```sql SELECT submission_date, os, COUNT(*) AS count FROM mozdata.telemetry.clients_last_seen WHERE submission_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 WEEK) AND days_since_seen = 0 GROUP BY submission_date, os HAVING count > 10 -- remove outliers AND lower(os) NOT LIKE '%windows%' ORDER BY os, submission_date DESC ``` -------------------------------- ### SQL Style Guide Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md Presents a style guide for writing SQL queries, promoting consistency, readability, and maintainability. Includes best practices for formatting, naming conventions, and query optimization. ```markdown [SQL Style Guide](concepts/sql_style.md) ``` -------------------------------- ### Looker View Definition Example (Browser KPIs) Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/data_modeling/where_to_store.md An example of a Looker view definition in LKML for 'browser_kpis', demonstrating the implementation of cumulative days of use using a SUM aggregation. ```lkml view: browser_kpis { # ... other dimension and measure definitions measure: cumulative_days_of_use { type: sum sql: ${days_of_use} ;; label: "Cumulative Days of Use" } # ... other dimension and measure definitions } ``` -------------------------------- ### Creating a Prototype Data Project on Google Cloud Platform Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md Steps for setting up a prototype data project within Google Cloud Platform, including environment configuration and initial setup. ```markdown [Creating a Prototype Data Project on Google Cloud Platform](cookbooks/gcp-projects.md) ``` -------------------------------- ### Project Configuration Example Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/operational_monitoring.md This snippet shows a typical project configuration, defining metrics, alerts, data sources, and rollout parameters. ```toml # name of the metric definition section in either the project configuration or the platform-specific # configuration file. # See [metrics] section on how these metrics get defined. metrics = [ 'shutdown_hangs', 'main_crashes', 'startup_crashes', 'memory_unique_content_startup', 'perf_page_load_time_ms' ] alerts = [ "ci_diffs" ] # This section specifies the clients that should be monitored. [project.population] # Slug/name of the data source definition section in either the project configuration or the platform-specific # configuration file. This data source refers to a database table. # See [data_sources] section on how this gets defined. data_source = "main" # The name of the branches that have been configured for a rollout or experiment. # If defined, this configuration overrides boolean_pref. branches = ["enabled", "disabled"] # A SQL snippet that results in a boolean representing whether a client is included in the rollout or experiment or not. boolean_pref = "environment.settings.fission_enabled" # The channel the clients should be monitored from: "release", "beta", or "nightly". channel = "beta" # If set to "true", the rollout and experiment configurations will be ignored and instead # the entire client population (regardless of whether they are part of the experiment or rollout) # will be monitored. # This option is useful if the project is not associated to a rollout or experiment and the general # client population of a product should be monitored. monitor_entire_population = false # References to dimension slugs that are used to segment the client population. # Defined as a list of strings. These strings are the "slug" of the dimension, which is the # name of the dimension definition section in either the project configuration or the platform-specific # configuration file. See [dimensions] section on how these get defined. dimensions = ["os"] # A set of metrics that should be part of the the same visualization [project.metric_groups.crashes] friendly_name = "Crashes" description = "Breakdown of crashes" metrics = [ "main_crashes", "startup_crashes", ] ``` -------------------------------- ### GCP Project Cookbook Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/gcp-projects.md This snippet provides a link to the GCP Project Cookbook on docs.telemetry.mozilla.org. This resource offers detailed guidance on setting up and managing GCP projects within Mozilla, likely covering best practices, configurations, and procedures. ```APIDOC GCP Project Cookbook: URL: https://docs.telemetry.mozilla.org/cookbooks/gcp-projects.html ``` -------------------------------- ### Left Alignment of Root Keywords in SQL Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/sql_style.md Demonstrates the recommended practice of left-aligning root SQL keywords (like SELECT, FROM, WHERE, LIMIT) to start on the same character boundary, improving visual structure. ```sql SELECT client_id, submission_date FROM main_summary WHERE sample_id = '42' AND submission_date > '20180101' LIMIT 10 ``` ```sql SELECT client_id, submission_date FROM main_summary WHERE sample_id = '42' AND submission_date > '20180101' ``` -------------------------------- ### SQL Join Condition Formatting Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/sql_style.md Illustrates the correct indentation and placement for JOIN conditions using the ON clause. The ON keyword should start on a new line, indented further than the JOIN keyword, with conditions following on the same line. ```sql FROM telemetry_stable.main_v4 LEFT JOIN static.normalized_os_name ON main_v4.environment.system.os.name = normalized_os_name.os_name ``` -------------------------------- ### SQL Multi-line Parentheses Formatting Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/sql_style.md Details the proper formatting for parentheses that span multiple lines. The opening parenthesis should end its line, the closing parenthesis should align with the start of the multi-line construct, and the content should be indented. ```sql WITH sample AS ( SELECT client_id, FROM main_summary WHERE sample_id = '42' ) ``` -------------------------------- ### SQL Grouping Columns: Aliases vs. Implicit Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/sql_style.md Illustrates the use of aliases or implicit column numbering for grouping in SQL, showing examples for BigQuery and Presto syntax. Emphasizes clarity and avoiding repetition of complex expressions. ```sql -- BigQuery SQL Syntax SELECT submission_date, normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease, count(*) AS count FROM telemetry.clients_daily WHERE submission_date > '2019-07-01' GROUP BY submission_date, is_prerelease -- Grouping by aliases is supported in BigQuery ``` ```sql -- Presto SQL Syntax SELECT submission_date, normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease, count(*) AS count FROM telemetry.clients_daily WHERE submission_date > '20190701' GROUP BY 1, 2 -- Implicit grouping avoids repeating expressions ``` ```sql -- Presto SQL Syntax SELECT submission_date, normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease, count(*) AS count FROM telemetry.clients_daily WHERE submission_date > '20190701' GROUP BY submission_date, normalized_channel IN ('nightly', 'aurora', 'beta') ``` -------------------------------- ### Introduction to Operational Tasks Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md An introduction to operational procedures and best practices for managing data infrastructure. ```markdown [Operational](cookbooks/operational/index.md) ``` -------------------------------- ### Data Pipeline Ingestion Endpoint Source: https://github.com/mozilla/data-docs/blob/main/src/tools/guiding_principles.md This example shows the structure of the HTTPS endpoint used to submit data payloads to the Mozilla data pipeline. It includes the namespace, document type, version, and a unique document ID for deduplication. ```bash https://incoming.telemetry.mozilla.org/submit/activity-stream/impression-stats/1/ ``` -------------------------------- ### JSON to BigQuery Map Conversion Example Source: https://github.com/mozilla/data-docs/blob/main/src/tools/guiding_principles.md Demonstrates the transformation of a JSON object representing a map into a BigQuery-compatible array of key-value pairs. This is necessary when dealing with free-form maps in BigQuery, following conventions for complex Avro types. ```json { "key1": "value1", "key2": "value2" } ``` ```json [ { "key": "key1", "value": "value1" }, { "key": "key2", "value": "value2" } ] ``` -------------------------------- ### Clients Daily Dataset Reference Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/bigquery/accessing_desktop_data.md Provides an introduction to the 'Clients Daily' derived dataset, which is built from raw ping data with transformations for easier analysis. Users are directed to a specific reference document for more information. ```markdown See the [`clients_daily` reference](../../datasets/batch_view/clients_daily/reference.md) for more information. ``` -------------------------------- ### Using the Data Catalog Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md Instructions on how to use the Data Catalog to find, understand, and access various datasets. ```markdown [Using the Data Catalog](cookbooks/analysis/data_catalog.md) ``` -------------------------------- ### Explicit JOIN Types in SQL Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/sql_style.md Highlights the best practice of explicitly stating the `JOIN` type (e.g., `CROSS JOIN`) instead of relying on implicit joins, improving code clarity. Shows examples for BigQuery Standard SQL. ```sql -- BigQuery Standard SQL Syntax SELECT submission_date, experiment.key AS experiment_id, experiment.value AS experiment_branch, count(*) AS count FROM telemetry.clients_daily CROSS JOIN UNNEST(experiments.key_value) AS experiment WHERE submission_date > '2019-07-01' AND sample_id = '10' GROUP BY submission_date, experiment_id, experiment_branch ``` ```sql -- BigQuery Standard SQL Syntax SELECT submission_date, experiment.key AS experiment_id, experiment.value AS experiment_branch, count(*) AS count FROM telemetry.clients_daily, UNNEST(experiments.key_value) AS experiment -- Implicit JOIN WHERE submission_date > '2019-07-01' AND sample_id = '10' GROUP BY 1, 2, 3 -- Implicit grouping column names ``` -------------------------------- ### Data Monitoring Introduction Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md An introduction to data monitoring practices and tools used at Mozilla. ```markdown [Data Monitoring - Intro to Bigeye](cookbooks/data_monitoring/intro.md) ``` -------------------------------- ### Get Main Crashes on Windows Over a Small Interval Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/obsolete/error_aggregates/reference.md This SQL query counts the number of 'main_crashes' for Firefox on Windows for a specific version ('58.0.2') within a defined time interval. It filters out experiments and groups the results by the window start time. ```sql SELECT window_start as time, sum(main_crashes) AS main_crashes FROM error_aggregates_v2 WHERE application = 'Firefox' AND os_name = 'Windows_NT' AND channel = 'release' AND version = '58.0.2' AND window_start > timestamp '2018-02-21' AND window_end < timestamp '2018-02-22' AND experiment_id IS NULL AND experiment_branch IS NULL GROUP BY window_start ``` -------------------------------- ### JSON Schema Definition Example Source: https://github.com/mozilla/data-docs/blob/main/src/tools/guiding_principles.md This snippet illustrates how a JSON Schema is defined for a new document namespace and type within the Mozilla pipeline schemas repository. It specifies the namespace ('activity-stream') and document type ('impression-stats') with a version ('1'). ```json { "namespace": "activity-stream", "document_type": "impression-stats", "version": 1 } ``` -------------------------------- ### Building and Deploying Containers to GCR with CircleCI Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md A tutorial on building container images and deploying them to Google Container Registry (GCR) using CircleCI for continuous integration and deployment. ```markdown [Building and Deploying Containers to GCR with CircleCI](cookbooks/deploying-containers.md) ``` -------------------------------- ### Example Query for Event Counts Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/bigquery/events_daily/reference.md This SQL query retrieves daily event and client counts from the events_daily table. It utilizes the mozfun.event_analysis.extract_event_counts function to parse the event strings and joins with the event_types table to get event names. The query filters data for the last 28 days. ```sql SELECT submission_date, category, event, COUNT(*) AS client_count, SUM(count) AS event_count FROM `moz-fx-data-shared-prod`.fenix.events_daily CROSS JOIN UNNEST(mozfun.event_analysis.extract_event_counts(events)) JOIN `moz-fx-data-shared-prod`.fenix.event_types USING (index) WHERE submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY) GROUP BY submission_date, category, event ``` -------------------------------- ### Firefox Profile Creation Commands Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/profile/profile_creation.md Demonstrates command-line arguments for creating and managing Firefox profiles. The `--createprofile` argument creates a new profile, while `--profile` allows starting Firefox with a specified existing or new profile directory. ```bash firefox --createprofile firefox --profile /path/to/profile/directory ``` -------------------------------- ### BigQuery Table Listing Example Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/pipeline/schemas.md Demonstrates how to list BigQuery tables within a specific namespace using the 'bq ls' command, showing table details like schema ID, labels, partitioning, and clustering. ```bash $ bq ls --max_results=3 moz-fx-data-shared-prod:org_mozilla_fenix_stable tableId Type Labels Time Partitioning Clustered Fields ------------------- ------- --------------------------------------- ----------------------------------- ------------------------------- activation_v1 TABLE schema_id:glean_ping_1 DAY (field: submission_timestamp) normalized_channel, sample_id schema_id:glean_ping_1 schemas_build_id:202001230145_be1f11e baseline_v1 TABLE schema_id:glean_ping_1 DAY (field: submission_timestamp) normalized_channel, sample_id schema_id:glean_ping_1 schemas_build_id:202001230145_be1f11e bookmarks_sync_v1 TABLE schema_id:glean_ping_1 DAY (field: submission_timestamp) normalized_channel, sample_id schema_id:glean_ping_1 schemas_build_id:202001230145_be1f11e ``` -------------------------------- ### Install Node.js Dependencies for Spell and Link Checking Source: https://github.com/mozilla/data-docs/blob/main/README.md Installs the necessary Node.js packages for spell checking (markdown-spellcheck) and link checking (markdown-link-check) by running `npm install` in the repository's root directory. This requires Node.js to be installed. ```bash npm install ``` -------------------------------- ### Example Query for Day 2-7 Activation by Product Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/non_desktop/day_2_7_activation/reference.md This SQL query calculates the Day 2-7 activation metric, broken down by product. It aggregates new profiles and activated users from the firefox_nondesktop_day_2_7_activation table for a specific cohort date. ```sql SELECT cohort_date, product, SUM(day_2_7_activated) as day_2_7_activated, SUM(new_profiles) as new_profiles, SAFE_DIVIDE(SUM(day_2_7_activated), SUM(new_profiles)) as day_2_7_activation FROM mozdata.telemetry.firefox_nondesktop_day_2_7_activation WHERE cohort_date = "2020-03-01" GROUP BY 1,2 ORDER BY 1 ``` -------------------------------- ### Query Successful Installs per Country Code Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/other/stub_installer/reference.md This SQL query retrieves the count of successful installs per normalized country code on a specific date from the `firefox_installer.install` BigQuery table. It demonstrates how to access and analyze stub installer ping data. ```sql SELECT normalized_country_code, succeeded, count(*) FROM firefox_installer.install WHERE DATE(submission_timestamp) = '2021-04-20' GROUP BY normalized_country_code, succeeded ``` -------------------------------- ### Introduction to Operational Monitoring Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md An overview of the systems and practices used for operational monitoring of Mozilla's services and infrastructure. ```markdown [Introduction to Operational Monitoring](cookbooks/operational_monitoring.md) ``` -------------------------------- ### SQL Date Conversion Examples Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/analysis_gotchas.md Provides SQL examples for converting specific date fields into usable date formats. These examples are useful for data analysis and manipulation in SQL environments. ```SQL DATE_FROM_UNIX_DATE(SAFE_CAST(environment.profile.creation_date AS INT64)) ``` ```SQL SAFE.PARSE_TIMESTAMP('%a, %d %b %Y %T %Z', REPLACE(metadata.header.date, 'GMT+00:00', 'GMT')) ``` -------------------------------- ### Google Cloud Platform Prototype Project Creation Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/gcp-projects.md This section details the steps and benefits of creating a prototype GCP project. It covers provisioning service accounts for BigQuery and other GCP resources, writing and querying data in private BigQuery tables, making Docker images available via Google Container Registry, creating Google Cloud Storage buckets, Compute Instances, and Kubernetes clusters. It also touches on cost tracking and the advantages over traditional sandbox projects. ```APIDOC Google Cloud Platform Prototype Project: Purpose: To provision a dedicated GCP project for data-intensive development and production deployment. Benefits: - Easy cost tracking of individual components. - Self-service administrative credentials with limited lifespan. - Ability to spin down projects and resources after use. Features: - Service Accounts: For BigQuery access and command-line/Docker container operations. - BigQuery: Private tables for writing and querying data without impacting production. - Google Container Registry: For hosting Docker images. - Google Cloud Storage: For temporary data storage. - Google Compute Instances: For testing software in the cloud. - Kubernetes Clusters: For testing scheduled jobs with telemetry-airflow. - Protosaur: For creating static dashboards. Request Process: - File a bug using the provided template. Support: - Contact Data Engineering contact for project creation and advice. - Get in touch with the data platform team for project necessity, contact, or budget queries. Tracking: - Projects are tracked on Confluence (requires Mozilla LDAP). ``` -------------------------------- ### Get Crash Measures Across Platforms Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/obsolete/error_aggregates/reference.md This SQL query retrieves various crash measures (usage hours, main crashes, content crashes, etc.) for Firefox across different operating systems and channels. It filters for specific build IDs, time windows, and excludes experiments. The results are grouped by window start, channel, build ID, version, and OS name. ```sql SELECT window_start, build_id, channel, os_name, version, sum(usage_hours) AS usage_hours, sum(main_crashes) AS main, sum(content_crashes) AS content, sum(gpu_crashes) AS gpu, sum(plugin_crashes) AS plugin, sum(gmplugin_crashes) AS gmplugin FROM error_aggregates_v2 WHERE application = 'Firefox' AND (os_name = 'Darwin' or os_name = 'Linux' or os_name = 'Windows_NT') AND (channel = 'beta' or channel = 'release' or channel = 'nightly' or channel = 'esr') AND build_id > '201801' AND window_start > current_timestamp - (1 * interval '24' hour) AND experiment_id IS NULL AND experiment_branch IS NULL GROUP BY window_start, channel, build_id, version, os_name ``` -------------------------------- ### Day 2-7 Activation Dataset Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md Details the 'Day 2-7 Activation' dataset, likely tracking user activation within the first week of using a product. Useful for onboarding analysis. ```markdown [Day 2-7 Activation](datasets/non_desktop/day_2_7_activation/reference.md) ``` -------------------------------- ### SendPing Subroutine in NSIS Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/other/stub_installer/reference.md This refers to the SendPing subroutine within the NSIS script of the stub installer. It is responsible for forming and sending the installer pings. The exact code is located in the mozilla-central repository. ```nsis SendPing subroutine in stub.nsi ``` -------------------------------- ### Retained Installer Tables Schema Source: https://github.com/mozilla/data-docs/blob/main/src/datasets/non_desktop/google_play_store/reference.md Defines the schema for the retained installer tables in the Google Play Store dataset. It includes fields like Date, Package_Name, Acquisition_Channel, Store_Listing_Visitors, Installers, and various retention rates. ```json { "root": { "Date": "date", "Package_Name": "string", "Acquisition_channel | country | UTM_source_campaign": "string", "Store_Listing_Visitors": "integer", "Installers": "integer", "Visitor_to_installer_conversion_rate": "float", "installers_retained_for_1_day": "integer", "installers_to_1_day_retention_rate": "float", "installers_retained_for_7_days": "integer", "installers_to_7_days_retention_rate": "float", "installers_retained_for_15_days": "integer", "installers_to_15_days_retention_rate": "float", "installers_retained_for_30_days": "integer", "installers_to_30_days_retention_rate": "float" } } ``` -------------------------------- ### Data Monitoring - Intro to Bigeye Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md An introduction to using Bigeye for data quality monitoring, covering its features and initial setup. ```markdown [Data Monitoring - Intro to Bigeye](cookbooks/data_monitoring/intro.md) ``` -------------------------------- ### Implementing Experiments Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md Guidelines for implementing experiments within Mozilla products, focusing on telemetry collection and analysis for A/B testing and feature validation. ```markdown [Implementing Experiments](cookbooks/client_guidelines.md) ``` -------------------------------- ### Guiding Principles for Data Infrastructure Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md The core principles and philosophies that guide the design, development, and maintenance of Mozilla's data infrastructure. ```markdown [Guiding Principles for Data Infrastructure](tools/guiding_principles.md) ``` -------------------------------- ### Profile Creation Source: https://github.com/mozilla/data-docs/blob/main/src/SUMMARY.md Explains the process of creating and initializing user profiles. Covers the data points collected during profile creation and their significance. ```markdown [Profile Creation](concepts/profile/profile_creation.md) ``` -------------------------------- ### Looker Explore Definition Example Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/data_modeling/where_to_store.md An example of a Looker explore definition in LKML, specifically for 'pocket_tile_impressions', demonstrating how to define a subset of data for analysis. ```lkml explore: pocket_tile_impressions { label: "Pocket Tile Impressions" from: pocket_tile_impressions join: user_devices { type: left_outer sql_on: ${pocket_tile_impressions.device_id} = ${user_devices.id} ;; } # ... other join and view definitions } ``` -------------------------------- ### Exposed Views Filtering Example Source: https://github.com/mozilla/data-docs/blob/main/src/concepts/pipeline/filtering.md SQL example demonstrating filtering within exposed views, where data in stable tables is not exposed to users. ```sql github.com/mozilla/bigquery-etl/blob/master/sql/moz-fx-data-shared-prod/telemetry/lockwise_mobile_events_v1/view.sql#L17 ``` -------------------------------- ### OpMon Preview Command Source: https://github.com/mozilla/data-docs/blob/main/src/cookbooks/operational_monitoring.md Provides a detailed breakdown of the `opmon preview` command, its options, and usage examples for generating data previews of OpMon projects. It covers project targeting, date ranges, configuration sources, and the output link to Looker dashboards. ```APIDOC OpMon Preview Command: Usage: opmon preview [OPTIONS] Description: Create a preview for a specific project based on a subset of data. Options: --project_id, --project-id TEXT Project to write to --dataset_id, --dataset-id TEXT Temporary dataset to write to --derived_dataset_id, --derived-dataset-id TEXT Temporary derived dataset to write to --start_date, --start-date YYYY-MM-DD Date for which project should be started to get analyzed. Default: current date - 3 days --end_date, --end-date YYYY-MM-DD Date for which project should be stop to get analyzed. Default: current date --slug TEXT Experimenter or Normandy slug associated with the project to create a preview for [required] --config_file, --config-file PATH Custom local config file --config_repos, --config-repos TEXT URLs to public repos with configs --private_config_repos, --private-config-repos TEXT URLs to private repos with configs --help Show this message and exit. Example Usage: gcloud auth login --update-adc gcloud config set project mozdata opmon preview --slug=firefox-install-demo --config_file='/local/path/to/opmon/firefox-install-demo.toml' Output: Start running backfill for firefox-install-demo: 2022-12-17 to 2022-12-19 Backfill 2022-12-17 ... A preview is available at: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::opmon_preview?Table='mozdata.tmp.firefox_install_demo_statistics'&Submission+Date=2022-12-17+to+2022-12-20 ```