### Install Kerchunk from Source Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md Installs the Kerchunk package directly from its GitHub repository. Recommended for development versions. ```bash !pip install git+https://github.com/fsspec/kerchunk ``` -------------------------------- ### Verify Kerchunk Installation Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Start a Python interpreter and import the kerchunk library to verify that the installation was successful. Check the installed version. ```python import kerchunk kerchunk.__version__ ``` -------------------------------- ### Set Up Documentation Environment Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Create and activate a conda virtual environment for building documentation, then install the necessary Python dependencies. ```sh conda create --name kerchunk-docs python=3.8 conda activate kerchunk-docs python -m pip install -r docs/requirements.txt ``` -------------------------------- ### Build Documentation Source: https://github.com/fsspec/kerchunk/blob/main/docs/README.md Steps to build the documentation locally. Ensure you are in the 'docs' directory and have installed dependencies from 'requirements.txt'. ```bash cd docs pip install -r requirements.txt make html open build/html/index.html ``` -------------------------------- ### Version 1 Spec: Example Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md A concrete example of the Version 1 specification, demonstrating the use of templates, generation rules, and references. ```json { "version": 1, "templates": { "u": "server.domain/path", "f": "{{c}}" }, "gen": [ { "key": "gen_key{{i}}", "url": "http://{{u}}_{{i}}", "offset": "{{(i + 1) * 1000}}", "length": "1000", "dimensions": { "i": {"stop": 5} } } ], "refs": { "key0": "data", "key1": ["http://target_url", 10000, 100], "key2": ["http://{{u}}", 10000, 100], "key3": ["http://{{f(c='text')}}", 10000, 100] } } ``` -------------------------------- ### Install Kerchunk with Development Dependencies Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Install kerchunk along with optional development dependencies, which may include linters, testing tools, and other utilities. ```shell pip install -e '.[dev]' ``` -------------------------------- ### Setup Logging and Autoreload Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Configures logging and enables autoreloading of modules for development. ```python %load_ext autoreload %autoreload 2 import logging import importlib importlib.reload(logging) logging.basicConfig( format="%(asctime)s.%(msecs)03dZ %(processName)s %(threadName)s %(levelname)s:%(name)s:%(message)s", datefmt="%Y-%m-%dT%H:%M:%S", level=logging.WARNING, ) logger = logging.getLogger("juypter") ``` -------------------------------- ### Start Dask cluster Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb Initializes a Dask cluster using Dask Gateway for distributed computing. This requires a configured gateway, such as on a QHub deployment. ```python # this requires you to have a configures gateway, e.g., be on a QHub deployment from dask.distributed import Client from dask_gateway import Gateway gateway = Gateway() cluster = gateway.new_cluster() ``` -------------------------------- ### Start a Dask cluster Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_chunk_compare.ipynb Initializes a Dask cluster using Dask Gateway for distributed computing. This is required before scaling the cluster or creating a client. ```python from dask.distributed import Client from dask_gateway import Gateway gateway = Gateway() cluster = gateway.new_cluster() ``` ```python cluster.scale(30); ``` ```python client = Client(cluster) ``` ```python client ``` -------------------------------- ### Install Pre-commit Hooks Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Install pre-commit to automatically run code linting and style checks on each git commit. This helps maintain code quality and consistency. ```shell pre-commit install ``` -------------------------------- ### Install Kerchunk in Editable Mode Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Install the kerchunk package in editable mode from the project's home directory. This allows changes in the source code to be reflected immediately without reinstallation. ```shell pip install -e . ``` -------------------------------- ### Import necessary libraries for logging and kerchunk Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb Imports logging and various kerchunk modules for GRIB file processing and aggregation. Ensure these libraries are installed. ```python import logging import importlib importlib.reload(logging) logging.basicConfig( format="%(asctime)s.%(msecs)03dZ %(processName)s %(threadName)s %(levelname)s:%(name)s:%(message)s", datefmt="%Y-%m-%dT%H:%M:%S", level=logging.WARNING, ) logger = logging.getLogger("juypter") import copy import fsspec import pandas as pd import xarray as xr import datetime from kerchunk.grib2 import ( AggregationType, build_idx_grib_mapping, extract_datatree_chunk_index, grib_tree, map_from_index, parse_grib_idx, reinflate_grib_store, scan_grib, strip_datavar_chunks, ) ``` -------------------------------- ### Version 0 Spec: Zarr Example (String Values) Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md An example of how Zarr data might be represented in the Version 0 spec using string values for JSON content. ```json { ".zgroup": "{\n \"zarr_format\": 2\n}", ".zattrs": "{\n \"Conventions\": \"UGRID-0.9.0\n"}", "x/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n \"node\"\n ...", "x/.zarray": "{\n \"chunks\": [\n 9228245\n ], \"compressor\": null, \"dtype\": \"/kerchunk.git cd kerchunk git remote add upstream git@github.com:fsspec/kerchunk.git ``` -------------------------------- ### Examine Grib Fixture JSON with zcat and jq Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md Use zcat to decompress gzipped JSON files and jq to parse and query the JSON content on the command line. ```console zcat tests/fixtures/hrrr.wrfsubhf/zarr_tree_store_v1.json.gz | jq . ``` -------------------------------- ### Create and Display the Application Layout Source: https://github.com/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb Constructs the Panel application layout, combining the variable selector, global map, and local map with its refresh button. The application is then displayed. ```python app = pn.Column( pn.Param(ze.param.variable, width=150), pn.Row( ze.global_map, pn.Column( pn.panel(ze.local_map, loading_indicator=True), ze.param.update_localmap ), ), ) app.show() ``` -------------------------------- ### Preprocessing Data Before Combining Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md Apply custom preprocessing functions to filter or modify data within reference files before combining them. This example drops a specific variable using a `preprocess` function. ```python def pre_process(refs): for k in list(refs): if k.startswith('air_pressure_at_mean_sea_level'): refs.pop(k) return refs json_list = fs2.glob("vars_combined.json") + fs2.glob("02_sea_surface_temperature.json") mzz = MultiZarrToZarr(json_list, remote_protocol='s3', remote_options={'anon':True}, concat_dims=['time0'], identical_dims = ['lat', 'lon'], preprocess = pre_process) d = mzz.translate() with fs2.open('sea_surface_temperature_combined.json', 'wb') as f: f.write(ujson.dumps(d).encode()) backend_args = {"consolidated": False, "storage_options": {"fo": d, "remote_protocol": "s3","remote_options": {"anon": True}}} print(xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args)) ``` -------------------------------- ### Open and view the datamodel with xarray-datatree Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb Opens the generated `grib_tree_store` using xarray-datatree to visualize the hierarchical structure of the GRIB data. This step uses a reference filesystem. ```python # Transforming the output to datatree to view it. This tree model the variables s3_dt = xr.open_datatree( fsspec.filesystem( "reference", fo=grib_tree_store, remote_protocol="s3", remote_options={"anon": True}, ).get_mapper(""), engine="zarr", consolidated=False, ) ``` -------------------------------- ### Build Mapping from GRIB/Zarr Metadata to IDX Attributes Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Constructs a mapping between GRIB/Zarr metadata and the attributes found in .idx files for a specific forecast horizon. This requires reading both GRIB and .idx files. ```python # What we need is a mapping from our grib/zarr metadata to the attributes in the idx files # They are unique for each time horizon e.g. you need to build a unique mapping for the 1 hour # forecast, the 2 hour forecast... the 48 hour forecast. # let's make one for the 6 hour horizon. This requires reading both the grib and the idx file, # mapping the data for each grib message in order mapping = build_idx_grib_mapping( basename="gs://global-forecast-system/gfs.20230928/00/atmos/gfs.t00z.pgrb2.0p25.f006", ) mapping ``` -------------------------------- ### Copy GRIB Data with gsutil Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md Utilize gsutil for efficient, parallel copying of GRIB data from cloud storage to a local directory. ```console gsutil -m cp gs://high-resolution-rapid-refresh/hrrr.20230928/conus/hrrr.t00z.wrfsfcf* testdata/. ``` -------------------------------- ### Import necessary libraries Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_simple.ipynb Import xarray for data manipulation and fsspec for accessing various file systems. ```python import xarray as xr import fsspec ``` -------------------------------- ### List Root Directory of Reference Filesystem Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb Lists the contents of the root directory of the created reference filesystem. ```python fs.ls("", False) ``` -------------------------------- ### Load Dataset via Intake Catalog Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md Simplify dataset access by opening an intake catalog. This allows listing available datasets and loading a specific one using its catalog entry. This method abstracts away the complexities of fsspec and zarr configuration. ```python import intake catalog = intake.open_catalog('s3://esip-qhub-public/ecmwf/intake_catalog.yml') list(catalog) ``` ```python ds = catalog['ERA5-Kerchunk-1979-2022'].to_dask() ``` -------------------------------- ### Create Test Parquet Chunk Indexes with Python Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md This Python script reads GRIB data, filters it by variable name and valid time, and saves the result as a parquet file. It utilizes 'fsspec' for file operations and 'dask.dataframe' for efficient data handling. ```python gfs_base_path = "gs://dev.camus-infra.camus.store/davetest/gfs" gfs_kind = dd.read_parquet( [f.full_name for f in fsspec.open_files(os.path.join(gfs_base_path, "data_index/**.parquet"))], index=False ).compute() gfs_kind.loc[ gfs_kind.varname.isin(["u", "dswrf"]) & (gfs_kind.valid_time <= "2023-09-28 04:00:00") ].to_parquet("/home/builder/bando/ingestion/noaa_nwp/tests/fixtures/gfs.pgrb2.0p25/test_reinflate.parquet") ``` -------------------------------- ### Generate Truncated GRIB and IDX Files with Python Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md This Python snippet uses 'fsspec' and a 'dynamic_zarr_store' utility to create truncated GRIB and index files for testing purposes. It requires a GCS filesystem object. ```python fs = fsspec.filesystem("gcs") dynamic_zarr_store.make_test_grib_idx_files( fs=fs, basename="gs://camus-infra.camus.store/circleci_test_data/bando/ingestion/noaa_nwp/tests/fixtures/20221014/hrrr.t01z.wrfsubhf00.grib2" ) ``` -------------------------------- ### Create a ReferenceFileSystem mapper Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_simple.ipynb Use fsspec.get_mapper to create a reference to a remote JSON file that describes the dataset's structure. This is useful for accessing data in cloud storage like S3, especially when requester pays is enabled. ```python mapper = fsspec.get_mapper("reference://", fo='s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json', target_options={'requester_pays': True}, remote_protocol='s3', remote_options={'requester_pays': True}) ``` -------------------------------- ### Build GRIB Index Mapping Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb Creates a mapping from GRIB metadata to index file attributes for a specific time horizon. This mapping is crucial for building the kerchunk index. ```python # creating a mapping for a single horizon file which is to be used later mapping = build_idx_grib_mapping( "s3://noaa-gefs-pds/gefs.20230101/00/atmos/pgrb2sp25/geavg.t00z.pgrb2s.0p25.f006", storage_options=dict(anon=True), validate=True, ) mapping.head() ``` -------------------------------- ### Build GRIB Tree Model Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb Constructs a tree model from GRIB files using `grib_tree` and `scan_grib`. Ensure the GRIB files used for scanning are from the repository being indexed. `remote_options` are used for accessing S3. ```python grib_tree_store = grib_tree( scan_grib( #"s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006", "s3://noaa-gefs-pds/gefs.20230101/00/atmos/pgrb2sp25/geavg.t00z.pgrb2s.0p25.f006", storage_options=dict(anon=True), ), remote_options=dict(anon=True), ) ``` -------------------------------- ### Create Zarr Reference Set from Existing Zarr Store with `single_zarr` Source: https://context7.com/fsspec/kerchunk/llms.txt Produces a kerchunk-style reference dictionary for an existing Zarr v2 store, consolidating all chunk keys. Useful for testing and combining Zarr stores. Supports both local and remote Zarr stores. ```python from kerchunk.zarr import single_zarr # Local Zarr store refs_local = single_zarr("path/to/my_store.zarr", inline_threshold=100) # Remote Zarr store on GCS refs_gcs = single_zarr( "gcs://my-bucket/data/output.zarr", storage_options={"token": "anon"}, inline_threshold=0, ) # Use class-based API (identical interface to other drivers) from kerchunk.zarr import ZarrToZarr zzr = ZarrToZarr("s3://my-bucket/data/output.zarr", storage_options={"anon": True}) refs = zzr.translate() ``` -------------------------------- ### Import Necessary Libraries Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Imports core libraries for data manipulation, file system operations, and kerchunk functionalities. ```python import datetime import copy import xarray as xr import numpy as np import pandas as pd import fsspec import kerchunk from kerchunk.grib2 import ( grib_tree, scan_grib, extract_datatree_chunk_index, strip_datavar_chunks, reinflate_grib_store, AggregationType, read_store, write_store, parse_grib_idx, build_idx_grib_mapping, map_from_index ) import gcsfs pd.set_option('display.max_columns', None) ``` -------------------------------- ### Stage and Commit Changes Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Check the status of your changes, add new or modified files to the staging area, and commit them to your local repository with a descriptive message. ```sh git status ``` ```sh git add path/to/file-to-be-added.py ``` ```sh git commit -m "" ``` -------------------------------- ### Version 0 Spec: Basic Structure Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md This is the prototype spec for the structure required by ReferenceFileSystem. It defines how to include data as-is or reference data from a URL with offset and length. ```json { "key0": "data", "key1": ["protocol://target_url", 10000, 100] } ``` -------------------------------- ### Open Reinflated Store as Xarray Datatree Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Opens the reinflated GRIB store as an xarray datatree using a reference filesystem. This allows for interactive data exploration and analysis. ```python gfs_dt = xr.open_datatree(fsspec.filesystem("reference", fo=gfs_store).get_mapper(""), engine="zarr", consolidated=False) gfs_dt ``` -------------------------------- ### List First 5 GCS Files in a Subdirectory Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb Lists the first 5 files within a specific subdirectory of a GCS bucket. ```python gcs.ls("pangeo-data/SDO_AIA_Images/094")[:5] ``` -------------------------------- ### Open intake catalog Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb Opens an intake catalog from a YAML file and lists its contents. This is the entry point for accessing datasets. ```python cat = intake.open_catalog('intake_catalog.yml') list(cat) ``` -------------------------------- ### Import necessary libraries Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb Imports required libraries for data handling, Zarr, fsspec, intake, and Dask. ```python import xarray as xr import zarr import fsspec import fsspec.implementations.reference as refs import intake import intake_xarray ``` -------------------------------- ### Build hierarchical datamodel using grib_tree Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb Converts GRIB files into a hierarchical datamodel using kerchunk's `grib_tree` function. This method can be slow for large datasets. ```python # converting the references into the hierarchical datamodel grib_tree_store = grib_tree( [ group for f in s3_files for group in scan_grib(f, storage_options=dict(anon=True)) ], remote_options=dict(anon=True), ) ``` -------------------------------- ### DataTree Model Visualization Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/reference_aggregation.md This represents a DataTree model generated from an aggregation of GRIB files. It shows the hierarchical structure of the aggregated data, including variables, dimensions, coordinates, and attributes. Use this to understand the organization of your aggregated dataset. ```bash DataTree('None', parent=None) ├── DataTree('prmsl') │ │ Dimensions: () │ │ Data variables: │ │ *empty* │ │ Attributes: │ │ name: Pressure reduced to MSL │ └── DataTree('instant') │ │ Dimensions: () │ │ Data variables: │ │ *empty* │ │ Attributes: │ │ stepType: instant │ └── DataTree('meanSea') │ Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1, │ model_horizons: 1, valid_times: 237) │ Coordinates: │ * latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0 │ * longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0 │ meanSea float64 8B ... │ number (time, step) int64 8B ... │ step (model_horizons, valid_times) timedelta64[ns] 2kB ... │ time (model_horizons, valid_times) datetime64[ns] 2kB ... │ valid_time (model_horizons, valid_times) datetime64[ns] 2kB ... │ Dimensions without coordinates: model_horizons, valid_times │ Data variables: │ prmsl (model_horizons, valid_times, latitude, longitude) float64 124MB ... │ Attributes: │ typeOfLevel: meanSea └── DataTree('ulwrf') │ Dimensions: () │ Data variables: │ *empty* │ Attributes: │ name: Upward long-wave radiation flux └── DataTree('avg') │ Dimensions: () │ Data variables: │ *empty* │ Attributes: │ stepType: avg └── DataTree('nominalTop') Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1, model_horizons: 1, valid_times: 237) Coordinates: * latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0 * longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0 nominalTop float64 8B ... number (time, step) int64 8B ... step (model_horizons, valid_times) timedelta64[ns] 2kB ... time (model_horizons, valid_times) datetime64[ns] 2kB ... valid_time (model_horizons, valid_times) datetime64[ns] 2kB ... Dimensions without coordinates: model_horizons, valid_times Data variables: ulwrf (model_horizons, valid_times, latitude, longitude) float64 124MB ... Attributes: typeOfLevel: nominalTop ``` -------------------------------- ### Open Local Reference Set with fsspec/xarray Source: https://context7.com/fsspec/kerchunk/llms.txt Opens a pre-built reference set from a local JSON file using `xarray.open_dataset`. Requires `fsspec` and `xarray` on the reader side. Ensure `consolidated` is set to `False` and provide `storage_options` for remote access if needed. ```python import fsspec import xarray as xr # --- Open from a local JSON file --- ds = xr.open_dataset( "reference://", engine="zarr", backend_kwargs={ "consolidated": False, "storage_options": { "fo": "combined.json", "remote_protocol": "s3", "remote_options": {"anon": True}, }, }, ) ``` -------------------------------- ### Load SDO Dataset with Dask Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb Loads the SDO dataset into a dask-backed xarray DataArray. This allows for out-of-core computation. ```python ds = cat.SDO.to_dask() ds ``` -------------------------------- ### Parse Runtime from IDX Filename Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Parses the runtime information (e.g., '20230901/00') from the GRIB index file's basename. This is a preliminary step for creating mappings. ```python # Now if we parse the RunTime from the idx file name `gfs.20230901/00/` ``` -------------------------------- ### Open Subset as Datatree Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb Opens the reinflated store as an xarray datatree using `xr.open_datatree`. Requires `fsspec` for remote file system access and specifies `remote_protocol` and `remote_options`. ```python s3_dt_subset = xr.open_datatree( fsspec.filesystem( "reference", fo=s3_store, remote_protocol="s3", remote_options={"anon": True} ).get_mapper(""), engine="zarr", consolidated=False, ) ``` -------------------------------- ### Prepare for Aggregation Indexing Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb Initializes a list to store mapped indices and removes duplicate entries from the mapping DataFrame. This is a preparatory step for building the final aggregation index. ```python %%time mapped_index_list = [] dedupe_mapping = mapping.loc[~mapping["attrs"].duplicated(keep="first"), :] ``` -------------------------------- ### `kerchunk.grib2.scan_grib2` / `kerchunk.grib2.GribToZarr` Source: https://context7.com/fsspec/kerchunk/llms.txt Scans a GRIB2 file and produces a list of reference dicts (one per GRIB message or logical group) using `cfgrib` for metadata decoding. Each dict is a valid single-variable reference set. Requires `cfgrib` and optionally `eccodes`. ```APIDOC ## `kerchunk.grib2.scan_grib2` / `kerchunk.grib2.GribToZarr` — Translate a GRIB2 file into reference sets Scans a GRIB2 file and produces a list of reference dicts (one per GRIB message or logical group) using `cfgrib` for metadata decoding. Each dict is a valid single-variable reference set. Requires `cfgrib` and optionally `eccodes`. ```python import ujson from kerchunk.grib2 import scan_grib2 # Returns a list of reference dicts, one per logical variable/level group refs_list = scan_grib2( "gfs.t00z.pgrb2.0p25.f006", inline_threshold=100, storage_options={}, ) print(f"Found {len(refs_list)} GRIB message groups") # Save each as a separate JSON for later combining for i, refs in enumerate(refs_list): with open(f"grib_msg_{i:04d}.json", "w") as f: ujson.dump(refs, f) # Combine all into a single virtual dataset from kerchunk.combine import MultiZarrToZarr import xarray as xr mzz = MultiZarrToZarr( refs_list, concat_dims=["time"], identical_dims=["latitude", "longitude"], ) combined = mzz.translate() ds = xr.open_dataset( "reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": combined}}, ) print(ds) ``` ``` -------------------------------- ### Open GRIB2 Zarr Store with Xarray Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Opens the generated GRIB2 Zarr store metadata directly using xarray's open_datatree function. This is useful for inspection but can be slow for large aggregations. ```python # The grib_tree can be opened directly using either zarr or xarray datatree # But this is too slow to build big aggregations gfs_dt = xr.open_datatree(fsspec.filesystem("reference", fo=gfs_grib_tree_store).get_mapper(""), engine="zarr", consolidated=False) gfs_dt ``` -------------------------------- ### Read Zarr dataset Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb Loads an equivalent Zarr dataset directly. Prints the encoding details and displays the dataset for comparison. ```python ds_zarr = cat['ike-zarr'].to_dask() print(ds_zarr.zeta.encoding,'\n') ds_zarr.zeta ``` -------------------------------- ### Check dataset size Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_simple.ipynb Calculate and display the size of the dataset in gigabytes. ```python ds.nbytes/1e9 ``` -------------------------------- ### `kerchunk.zarr.single_zarr` / `ZarrToZarr` Source: https://context7.com/fsspec/kerchunk/llms.txt Produces a kerchunk-style reference dict for an existing Zarr v2 store (local or remote), consolidating all chunk keys into the reference format. Useful for testing and for combining Zarr stores with `MultiZarrToZarr`. ```APIDOC ## `kerchunk.zarr.single_zarr` / `ZarrToZarr` — Create a reference set from an existing Zarr store Produces a kerchunk-style reference dict for an existing Zarr v2 store (local or remote), consolidating all chunk keys into the reference format. Useful for testing and for combining Zarr stores with `MultiZarrToZarr`. ```python from kerchunk.zarr import single_zarr # Local Zarr store refs_local = single_zarr("path/to/my_store.zarr", inline_threshold=100) # Remote Zarr store on GCS refs_gcs = single_zarr( "gcs://my-bucket/data/output.zarr", storage_options={"token": "anon"}, inline_threshold=0, ) # Use class-based API (identical interface to other drivers) from kerchunk.zarr import ZarrToZarr zzr = ZarrToZarr("s3://my-bucket/data/output.zarr", storage_options={"anon": True}) refs = zzr.translate() ``` ``` -------------------------------- ### Import Kerchunk HDF and fsspec Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md Imports necessary modules for handling HDF files and interacting with various file systems using fsspec. ```python from kerchunk.hdf import SingleHdf5ToZarr import fsspec ``` -------------------------------- ### Create and Activate Conda Environment Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Create a new conda environment named 'kerchunk' using the specified environment file and activate it. This ensures a consistent development environment. ```shell conda env create --name kerchunk --file ci/environment-py3<*>.yml conda activate kerchunk ``` -------------------------------- ### Import MultiZarrToZarr Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md Imports the MultiZarrToZarr class for combining multiple kerchunked datasets. ```python from kerchunk.combine import MultiZarrToZarr ``` -------------------------------- ### Check Dataset Size Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb Calculates and displays the size of the loaded dataset in gigabytes. ```python ds.nbytes / 2**30 # GB ``` -------------------------------- ### Create Dask client Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb Creates a Dask client connected to the Dask cluster. This client is used to submit computations to the cluster. ```python client = Client(cluster) ``` -------------------------------- ### Build GRIB2 Zarr Store Metadata Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Scans GRIB2 files from GCS to extract Zarr kerchunk metadata and builds a hierarchical view of the dataset. This operation can be I/O intensive. ```python # Pick two files to build a grib_tree with the correct dimensions gfs_files = [ "gs://global-forecast-system/gfs.20230928/00/atmos/gfs.t00z.pgrb2.0p25.f000", "gs://global-forecast-system/gfs.20230928/00/atmos/gfs.t00z.pgrb2.0p25.f001" ] # This operation reads two of the large Grib2 files from GCS # scan_grib extracts the zarr kerchunk metadata for each individual grib message # grib_tree builds a zarr/xarray compatible hierarchical view of the dataset gfs_grib_tree_store = grib_tree([group for f in gfs_files for group in scan_grib(f)]) # it is slow even in parallel because it requires a huge amount of IO ``` -------------------------------- ### Translate GRIB2 to Zarr Reference Sets with `scan_grib2` Source: https://context7.com/fsspec/kerchunk/llms.txt Scans a GRIB2 file and produces a list of reference dictionaries, one per GRIB message or logical group. Requires `cfgrib` and optionally `eccodes`. The output can be saved as individual JSON files or combined into a single virtual dataset. ```python import ujson from kerchunk.grib2 import scan_grib2 # Returns a list of reference dicts, one per logical variable/level group refs_list = scan_grib2( "gfs.t00z.pgrb2.0p25.f006", inline_threshold=100, storage_options={}, # local file; use {"anon": True} for S3 ) print(f"Found {len(refs_list)} GRIB message groups") # Save each as a separate JSON for later combining for i, refs in enumerate(refs_list): with open(f"grib_msg_{i:04d}.json", "w") as f: ujson.dump(refs, f) # Combine all into a single virtual dataset from kerchunk.combine import MultiZarrToZarr import xarray as xr mzz = MultiZarrToZarr( refs_list, # pass dicts directly concat_dims=["time"], identical_dims=["latitude", "longitude"], ) combined = mzz.translate() ds = xr.open_dataset( "reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": combined}}, ) print(ds) ``` -------------------------------- ### Read Zarray File from Reference Filesystem Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb Reads and decodes the '.zarray' file from the reference filesystem. This file contains metadata about the Zarr array. ```python print(fs.cat("094/.zarray").decode()) ``` -------------------------------- ### Create and Checkout a New Git Branch Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md Create a new feature branch and switch to it for development. This ensures production-ready code remains on the main branch. ```sh git branch shiny-new-feature git checkout shiny-new-feature ``` ```sh git checkout -b shiny-new-feature ``` -------------------------------- ### Aggregate GRIB Files into a Data Tree Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb Iterates through dates and runtimes to parse GRIB index files, map them, and concatenate into a single index. This process aggregates data from multiple GRIB files efficiently. ```python mapped_index_list = [] deduped_mapping = mapping.loc[~mapping["attrs"].duplicated(keep="first"), :] for date in pd.date_range("2023-09-01", "2023-09-30"): for runtime in range(0,24,6): horizon=6 fname=f"gs://global-forecast-system/gfs.{date.strftime('%Y%m%d')}/{runtime:02}/atmos/gfs.t{runtime:02}z.pgrb2.0p25.f{horizon:03}" idxdf = parse_grib_idx( basename=fname ) mapped_index = map_from_index( pd.Timestamp( date + datetime.timedelta(hours=runtime)), deduped_mapping, idxdf.loc[~idxdf["attrs"].duplicated(keep="first"), :], ) mapped_index_list.append(mapped_index) gfs_kind = pd.concat(mapped_index_list) gfs_kind ```