### Install Kerchunk from Source

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Installs the Kerchunk package directly from its GitHub repository. Recommended for development versions.

```bash
!pip install git+https://github.com/fsspec/kerchunk
```

--------------------------------

### Verify Kerchunk Installation

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Start a Python interpreter and import the kerchunk library to verify that the installation was successful. Check the installed version.

```python
import kerchunk
kerchunk.__version__
```

--------------------------------

### Set Up Documentation Environment

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Create and activate a conda virtual environment for building documentation, then install the necessary Python dependencies.

```sh
conda create --name kerchunk-docs python=3.8
conda activate kerchunk-docs
python -m pip install -r docs/requirements.txt
```

--------------------------------

### Build Documentation

Source: https://github.com/fsspec/kerchunk/blob/main/docs/README.md

Steps to build the documentation locally. Ensure you are in the 'docs' directory and have installed dependencies from 'requirements.txt'.

```bash
cd docs
pip install -r requirements.txt
make html
open build/html/index.html
```

--------------------------------

### Version 1 Spec: Example

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md

A concrete example of the Version 1 specification, demonstrating the use of templates, generation rules, and references.

```json
{
    "version": 1,
    "templates": {
        "u": "server.domain/path",
        "f": "{{c}}"
    },
    "gen": [
        {
            "key": "gen_key{{i}}",
            "url": "http://{{u}}_{{i}}",
            "offset": "{{(i + 1) * 1000}}",
            "length": "1000",
            "dimensions":
              {
                "i": {"stop":  5}
              }
        }
    ],
    "refs": {
      "key0": "data",
      "key1": ["http://target_url", 10000, 100],
      "key2": ["http://{{u}}", 10000, 100],
      "key3": ["http://{{f(c='text')}}", 10000, 100]
    }
}
```

--------------------------------

### Install Kerchunk with Development Dependencies

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Install kerchunk along with optional development dependencies, which may include linters, testing tools, and other utilities.

```shell
pip install -e '.[dev]'
```

--------------------------------

### Setup Logging and Autoreload

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Configures logging and enables autoreloading of modules for development.

```python
%load_ext autoreload
%autoreload 2

import logging
import importlib

importlib.reload(logging)
logging.basicConfig(
    format="%(asctime)s.%(msecs)03dZ %(processName)s %(threadName)s %(levelname)s:%(name)s:%(message)s",
    datefmt="%Y-%m-%dT%H:%M:%S",
    level=logging.WARNING,
)

logger = logging.getLogger("juypter")
```

--------------------------------

### Start Dask cluster

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb

Initializes a Dask cluster using Dask Gateway for distributed computing. This requires a configured gateway, such as on a QHub deployment.

```python
# this requires you to have a configures gateway, e.g., be on a QHub deployment
from dask.distributed import Client
from dask_gateway import Gateway
gateway = Gateway()
cluster = gateway.new_cluster()
```

--------------------------------

### Start a Dask cluster

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_chunk_compare.ipynb

Initializes a Dask cluster using Dask Gateway for distributed computing. This is required before scaling the cluster or creating a client.

```python
from dask.distributed import Client
from dask_gateway import Gateway
gateway = Gateway()
cluster = gateway.new_cluster()
```

```python
cluster.scale(30);
```

```python
client = Client(cluster)
```

```python
client
```

--------------------------------

### Install Pre-commit Hooks

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Install pre-commit to automatically run code linting and style checks on each git commit. This helps maintain code quality and consistency.

```shell
pre-commit install
```

--------------------------------

### Install Kerchunk in Editable Mode

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Install the kerchunk package in editable mode from the project's home directory. This allows changes in the source code to be reflected immediately without reinstallation.

```shell
pip install -e .
```

--------------------------------

### Import necessary libraries for logging and kerchunk

Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb

Imports logging and various kerchunk modules for GRIB file processing and aggregation. Ensure these libraries are installed.

```python
import logging
import importlib
importlib.reload(logging)
logging.basicConfig(
    format="%(asctime)s.%(msecs)03dZ %(processName)s %(threadName)s %(levelname)s:%(name)s:%(message)s",
    datefmt="%Y-%m-%dT%H:%M:%S",
    level=logging.WARNING,
)

logger = logging.getLogger("juypter")


import copy
import fsspec
import pandas as pd
import xarray as xr
import datetime
from kerchunk.grib2 import (
    AggregationType,
    build_idx_grib_mapping,
    extract_datatree_chunk_index,
    grib_tree,
    map_from_index,
    parse_grib_idx,
    reinflate_grib_store,
    scan_grib,
    strip_datavar_chunks,
)
```

--------------------------------

### Version 0 Spec: Zarr Example (String Values)

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md

An example of how Zarr data might be represented in the Version 0 spec using string values for JSON content.

```json
{
  ".zgroup": "{\n    \"zarr_format\": 2\n}",
  ".zattrs": "{\n    \"Conventions\": \"UGRID-0.9.0\n"}",
  "x/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"node\"\n ...",
  "x/.zarray": "{\n    \"chunks\": [\n        9228245\n    ],
    \"compressor\": null,
    \"dtype\": \"<f8\",\n  ...",
  "x/0": ["s3://bucket/path/file.nc", 294094376, 73825960]
}
```

--------------------------------

### Version 1 Spec: Evaluated to Version 0

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md

The Version 0 equivalent of the provided Version 1 example. This shows how the templating and generation rules are expanded.

```json
{
  "key0": "data",
  "key1": ["http://target_url", 10000, 100],
  "key2": ["http://server.domain/path", 10000, 100],
  "key3": ["http://text", 10000, 100],
  "gen_key0": ["http://server.domain/path_0", 1000, 1000],
  "gen_key1": ["http://server.domain/path_1", 2000, 1000],
  "gen_key2": ["http://server.domain/path_2", 3000, 1000],
  "gen_key3": ["http://server.domain/path_3", 4000, 1000],
  "gen_key4": ["http://server.domain/path_4", 5000, 1000]
}
```

--------------------------------

### Read Combined Dataset with Xarray

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/test_example.md

Demonstrates how a user can open and analyze the generated virtual dataset using xarray. This process does not require kerchunk or h5py to be installed. The `fo` argument in `backend_kwargs` should be the output from `mzz.translate()` or a path to a saved JSON file.

```python
import xarray as xr
ds = xr.open_dataset(
    "reference://", engine="zarr",
    backend_kwargs={
        "storage_options": {
            "fo": out,
            "remote_protocol": "s3",
            "remote_options": {"anon": True}
        },
        "consolidated": False
    }
)
# do analysis...
ds.velocity.mean()

```

--------------------------------

### Select and Plot Temperature Data

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

This example demonstrates selecting data by latitude and longitude, then plotting the 'air_temperature_at_2_metres' for a specific year. It showcases basic data slicing and plotting capabilities with xarray.

```python
%%time
da = ds.sel(lat = -34).sel(lon = 198)
da.air_temperature_at_2_metres.sel(time0 = slice('2000-01-01','2000-12-31')).plot()
```

--------------------------------

### View Remote Repositories

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Display the URLs of the remote repositories configured for your local Git repository. This is useful for verifying your remote setup.

```sh
git remote -v
```

--------------------------------

### Version 0 Spec: Zarr Example (JSON Object Values)

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md

An equivalent representation of Zarr data in Version 0 spec, where JSON content is provided as JSON objects instead of strings.

```json
{
  ".zgroup": {"zarr_format": 2},
  ".zattrs": {"Conventions": "UGRID-0.9.0\n"},
  "x/.zattrs": {"_ARRAY_DIMENSIONS": ["node"]},
  "x/.zarray": {"chunks": [9228245], "compressor": null, "dtype": "<f8"},
  "x/0": ["s3://bucket/path/file.nc", 294094376, 73825960]
}
```

--------------------------------

### Combine Zarrs with Default Mapping

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Use `MultiZarrToZarr` to combine Zarr datasets. This example demonstrates default mapping where variable names are preserved.

```python
mzz = MultiZarrToZarr(json_list,
    remote_protocol='s3',
    remote_options={'anon':True},
    concat_dims=['time0'],
    identical_dims = ['lat', 'lon']
)

d = mzz.translate()

backend_args = {"consolidated": False, "storage_options": {"fo": d, "remote_protocol": "s3","remote_options": {"anon": True}}}
print(xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args))
```

--------------------------------

### Select and Plot Data from xarray Dataset

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Once the dataset is loaded (e.g., using intake), you can operate on it like any other lazy xarray dataset. This example selects data for a specific date and plots the 'air_pressure_at_mean_sea_level' variable.

```python
%%time
da = ds.sel(time0 = '2021-01-01T00:00:00')
da['air_pressure_at_mean_sea_level'].plot()
```

--------------------------------

### Import Libraries for Geospatial Data Analysis

Source: https://github.com/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb

Imports necessary Python libraries for geospatial data handling, visualization, and array manipulation. Ensure all libraries are installed before running.

```python
import fsspec
import geoviews as gv
import imagecodecs.numcodecs
import hvplot.xarray
import holoviews as hv
import numpy as np
import panel as pn
import param
import fsspec
import intake
from tqdm import tqdm
import xarray as xr

import itertools
import math

imagecodecs.numcodecs.register_codecs()  # register the TIFF codec
pn.extension()  # viz
```

--------------------------------

### Open Compressed Remote Reference Set (zstd)

Source: https://context7.com/fsspec/kerchunk/llms.txt

Opens a compressed remote reference set (e.g., zstd) using `fsspec`. This example demonstrates opening a zstd-compressed JSON file from S3 and then using `xarray` to access the data. `target_options` can specify compression details.

```python
import fsspec
import xarray as xr

# --- Open from a compressed remote JSON (zstd) ---
fs = fsspec.filesystem(
    "reference",
    fo="s3://esip-qhub-public/ecmwf/ERA5_1979_2022_multivar.json.zst",
    target_options={\"compression\": \"zstd\", \"anon\": True},
    remote_protocol="s3",
    remote_options={\"anon\": True},
)
ds_era5 = xr.open_dataset(
    fs.get_mapper(""),
    engine="zarr",
    backend_kwargs={\"consolidated\": False},
)

# --- Coordinate selection over 43 years of data with minimal latency ---
da = ds_era5.sel(time0=\"2021-01-01T00:00:00")
da["air_pressure_at_mean_sea_level"].plot()  # fetches only needed chunks
```

--------------------------------

### Build Documentation Locally

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Navigate to the docs directory and use the make command to build the HTML documentation locally. This allows you to preview your documentation changes.

```sh
cd docs
make html
```

--------------------------------

### Convert References to Parquet and Open with ReferenceFileSystem

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/advanced.md

This workflow demonstrates converting kerchunk references to parquet format for efficient storage and then opening the data using `fsspec.implementations.reference.ReferenceFileSystem` with lazy loading enabled. It's useful for handling very large datasets where memory efficiency is crucial.

```python
from kerchunk import hdf, combine, df
import fsspec.implementations.reference
from fsspec.implementations.reference import LazyReferenceMapper
from tempfile import TemporaryDirectory

import xarray as xr

files = fsspec.open(location_of_data)

# Create LazyReferenceMapper to pass to MultiZarrToZarr
fs = fsspec.filesystem("file")

os.makedirs("combined.parq")
out = LazyReferenceMapper.create(record_size=1000, root="combined.parq", fs=fs)

# Create references from input files
single_ref_sets = [hdf.SingleHdf5ToZarr(_).translate() for _ in files]

out_dict = combine.MultiZarrToZarr(
 single_ref_sets,
 remote_protocol="s3",
 concat_dims=["time"],
 remote_options={\"anon\": True},
 out=out
 ).translate()

out.flush()

df.refs_to_dataframe(out_dict, "combined.parq")

fs = fsspec.implementations.reference.ReferenceFileSystem(
    "combined.parq", remote_protocol="s3", target_protocol="file", lazy=True)
ds = xr.open_dataset(
    fs.get_mapper(), engine="zarr",
    backend_kwargs={"consolidated": False}
)

```

--------------------------------

### Initialize S3 and Local Filesystems

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Sets up an anonymous S3 filesystem to access ERA5 data and a local filesystem to save generated JSON reference files. The `so` dictionary contains arguments for opening files in read-binary mode with specific caching and fill behavior.

```python
fs = fsspec.filesystem('s3', anon=True) #S3 file system to manage ERA5 files
flist = (fs.glob('s3://era5-pds/2020/*/data/air_pressure_at_mean_sea_level.nc')[:2]
        + fs.glob('s3://era5-pds/2020/*/data/*sea_surface_temperature.nc')[:2])

fs2 = fsspec.filesystem('')  #local file system to save final jsons to

from pathlib import Path
import os
import ujson

so = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') # args to fs.open()
```

--------------------------------

### Opening a pre-built reference set directly with fsspec/xarray

Source: https://context7.com/fsspec/kerchunk/llms.txt

Demonstrates how to open a pre-built reference set directly using `fsspec` and `xarray`, without requiring kerchunk on the reader side.

```APIDOC
## Opening a pre-built reference set directly with fsspec/xarray

Once a reference set has been computed and saved, it can be opened at read time using only `fsspec` and `xarray` (kerchunk itself is not required on the reader side).

```python
import fsspec
import xarray as xr

# --- Open from a local JSON file ---
ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={
        "consolidated": False,
        "storage_options": {
            "fo": "combined.json",
            "remote_protocol": "s3",
            "remote_options": {"anon": True},
        },
    },
)

# --- Open from a compressed remote JSON (zstd) ---
fs = fsspec.filesystem(
    "reference",
    fo="s3://esip-qhub-public/ecmwf/ERA5_1979_2022_multivar.json.zst",
    target_options={"compression": "zstd", "anon": True},
    remote_protocol="s3",
    remote_options={"anon": True},
)
ds_era5 = xr.open_dataset(
    fs.get_mapper(""),
    engine="zarr",
    backend_kwargs={"consolidated": False},
)

# --- Coordinate selection over 43 years of data with minimal latency ---
da = ds_era5.sel(time0="2021-01-01T00:00:00")
da["air_pressure_at_mean_sea_level"].plot()  # fetches only needed chunks
```
```

--------------------------------

### Get Zarr Keys

Source: https://github.com/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb

Creates a set of existing Zarr keys from the mapper. This is used to check for the presence of data chunks.

```python
zkeys = set(mapper)
```

--------------------------------

### Open dataset using ReferenceFileSystem mapper

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_simple.ipynb

Open the dataset using xarray.open_dataset, passing the created mapper and specifying the engine as 'zarr'. Ensure consolidated metadata is set to False.

```python
ds = xr.open_dataset(mapper, engine="zarr", backend_kwargs={"consolidated": False})
```

--------------------------------

### Instantiate ZarrExplorer

Source: https://github.com/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb

Creates an instance of the ZarrExplorer class to initialize the visualization tool.

```python
ze = ZarrExplorer()
```

--------------------------------

### List First 5 Items in a Subdirectory of Reference Filesystem

Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb

Lists the first 5 items within a specified subdirectory of the reference filesystem.

```python
fs.ls("094", False)[:5]
```

--------------------------------

### Postprocessing Data After Combining

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Utilize `postprocess` functions to modify the final combined dataset before it is returned. This example adjusts the `fill_value` for latitude and longitude coordinates.

```python
import zarr
def modify_fill_value(out):
    out_ = zarr.open(out)
    out_.lon.fill_value = -999
    out_.lat.fill_value = -999
    return out

def postprocess(out):
    out = modify_fill_value(out)
    return out

json_list = fs2.glob("air_pressure_at_mean_sea_level_combined.json") + fs2.glob("sea_surface_temperature_combined.json")

mzz = MultiZarrToZarr(json_list,
    remote_protocol='s3',
    remote_options={'anon':True},
    concat_dims=['time0'],
    identical_dims = ['lat', 'lon'],
    postprocess = postprocess)

d = mzz.translate()

with fs2.open('combined.json', 'wb') as f:
    f.write(ujson.dumps(d).encode())


backend_args = {"consolidated": False, "storage_options": {"fo": d, "remote_protocol": "s3","remote_options": {"anon": True}}}
print(xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args))
```

--------------------------------

### Create Reference Filesystem

Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb

Creates a 'reference' filesystem using fsspec, pointing to a JSON file containing metadata for remote files. Requires anonymous access to GCS.

```python
fs = fsspec.filesystem(
    "reference",
    fo="gcs://mdtemp/SDO_no_coords.json",
    remote_options={"token": "anon"},
    remote_protocol="gcs",
    target_options={"token": "anon"}
)
```

--------------------------------

### Clone and Set Up Upstream Remote

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Clone your forked repository and add the main kerchunk repository as an upstream remote for future updates.

```shell
git clone git@github.com:<yourusername>/kerchunk.git
cd kerchunk
git remote add upstream git@github.com:fsspec/kerchunk.git
```

--------------------------------

### Examine Grib Fixture JSON with zcat and jq

Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md

Use zcat to decompress gzipped JSON files and jq to parse and query the JSON content on the command line.

```console
zcat tests/fixtures/hrrr.wrfsubhf/zarr_tree_store_v1.json.gz | jq .
```

--------------------------------

### Create and Display the Application Layout

Source: https://github.com/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb

Constructs the Panel application layout, combining the variable selector, global map, and local map with its refresh button. The application is then displayed.

```python
app = pn.Column(
    pn.Param(ze.param.variable, width=150),
    pn.Row(
        ze.global_map,
        pn.Column(
            pn.panel(ze.local_map, loading_indicator=True),
            ze.param.update_localmap
        ),
    ),
)
app.show()
```

--------------------------------

### Preprocessing Data Before Combining

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Apply custom preprocessing functions to filter or modify data within reference files before combining them. This example drops a specific variable using a `preprocess` function.

```python
def pre_process(refs):
    for k in list(refs):
        if k.startswith('air_pressure_at_mean_sea_level'):
            refs.pop(k)
    return refs

json_list = fs2.glob("vars_combined.json") + fs2.glob("02_sea_surface_temperature.json")

mzz = MultiZarrToZarr(json_list,
    remote_protocol='s3',
    remote_options={'anon':True},
    concat_dims=['time0'],
    identical_dims = ['lat', 'lon'],
    preprocess = pre_process)

d = mzz.translate()

with fs2.open('sea_surface_temperature_combined.json', 'wb') as f:
    f.write(ujson.dumps(d).encode())

backend_args = {"consolidated": False, "storage_options": {"fo": d, "remote_protocol": "s3","remote_options": {"anon": True}}}
print(xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args))
```

--------------------------------

### Open and view the datamodel with xarray-datatree

Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb

Opens the generated `grib_tree_store` using xarray-datatree to visualize the hierarchical structure of the GRIB data. This step uses a reference filesystem.

```python
# Transforming the output to datatree to view it. This tree model the variables
s3_dt = xr.open_datatree(
    fsspec.filesystem(
        "reference",
        fo=grib_tree_store,
        remote_protocol="s3",
        remote_options={"anon": True},
    ).get_mapper(""),
    engine="zarr",
    consolidated=False,
)

```

--------------------------------

### Build Mapping from GRIB/Zarr Metadata to IDX Attributes

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Constructs a mapping between GRIB/Zarr metadata and the attributes found in .idx files for a specific forecast horizon. This requires reading both GRIB and .idx files.

```python
# What we need is a mapping from our grib/zarr metadata to the attributes in the idx files
# They are unique for each time horizon e.g. you need to build a unique mapping for the 1 hour
# forecast, the 2 hour forecast... the 48 hour forecast.

# let's make one for the 6 hour horizon. This requires reading both the grib and the idx file,
# mapping the data for each grib message in order
mapping = build_idx_grib_mapping(
    basename="gs://global-forecast-system/gfs.20230928/00/atmos/gfs.t00z.pgrb2.0p25.f006",
)
mapping
```

--------------------------------

### Copy GRIB Data with gsutil

Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md

Utilize gsutil for efficient, parallel copying of GRIB data from cloud storage to a local directory.

```console
gsutil -m cp gs://high-resolution-rapid-refresh/hrrr.20230928/conus/hrrr.t00z.wrfsfcf* testdata/.
```

--------------------------------

### Import necessary libraries

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_simple.ipynb

Import xarray for data manipulation and fsspec for accessing various file systems.

```python
import xarray as xr
import fsspec
```

--------------------------------

### List Root Directory of Reference Filesystem

Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb

Lists the contents of the root directory of the created reference filesystem.

```python
fs.ls("", False)
```

--------------------------------

### Load Dataset via Intake Catalog

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Simplify dataset access by opening an intake catalog. This allows listing available datasets and loading a specific one using its catalog entry. This method abstracts away the complexities of fsspec and zarr configuration.

```python
import intake
catalog = intake.open_catalog('s3://esip-qhub-public/ecmwf/intake_catalog.yml')
list(catalog)
```

```python
ds = catalog['ERA5-Kerchunk-1979-2022'].to_dask()
```

--------------------------------

### Create Test Parquet Chunk Indexes with Python

Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md

This Python script reads GRIB data, filters it by variable name and valid time, and saves the result as a parquet file. It utilizes 'fsspec' for file operations and 'dask.dataframe' for efficient data handling.

```python
gfs_base_path = "gs://dev.camus-infra.camus.store/davetest/gfs"
gfs_kind = dd.read_parquet(
    [f.full_name for f in fsspec.open_files(os.path.join(gfs_base_path, "data_index/**.parquet"))],
    index=False
).compute()
gfs_kind.loc[
    gfs_kind.varname.isin(["u", "dswrf"]) &
    (gfs_kind.valid_time	<= "2023-09-28 04:00:00")
].to_parquet("/home/builder/bando/ingestion/noaa_nwp/tests/fixtures/gfs.pgrb2.0p25/test_reinflate.parquet")
```

--------------------------------

### Generate Truncated GRIB and IDX Files with Python

Source: https://github.com/fsspec/kerchunk/blob/main/tests/grib_idx_fixtures/README.md

This Python snippet uses 'fsspec' and a 'dynamic_zarr_store' utility to create truncated GRIB and index files for testing purposes. It requires a GCS filesystem object.

```python
fs = fsspec.filesystem("gcs")
dynamic_zarr_store.make_test_grib_idx_files(
fs=fs,
basename="gs://camus-infra.camus.store/circleci_test_data/bando/ingestion/noaa_nwp/tests/fixtures/20221014/hrrr.t01z.wrfsubhf00.grib2"
)
```

--------------------------------

### Create a ReferenceFileSystem mapper

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_simple.ipynb

Use fsspec.get_mapper to create a reference to a remote JSON file that describes the dataset's structure. This is useful for accessing data in cloud storage like S3, especially when requester pays is enabled.

```python
mapper = fsspec.get_mapper("reference://", 
    fo='s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json', 
    target_options={'requester_pays': True}, 
    remote_protocol='s3', 
    remote_options={'requester_pays': True})
```

--------------------------------

### Build GRIB Index Mapping

Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb

Creates a mapping from GRIB metadata to index file attributes for a specific time horizon. This mapping is crucial for building the kerchunk index.

```python
# creating a mapping for a single horizon file which is to be used later
mapping = build_idx_grib_mapping(
    "s3://noaa-gefs-pds/gefs.20230101/00/atmos/pgrb2sp25/geavg.t00z.pgrb2s.0p25.f006",
    storage_options=dict(anon=True),
    validate=True,
)
mapping.head()
```

--------------------------------

### Build GRIB Tree Model

Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb

Constructs a tree model from GRIB files using `grib_tree` and `scan_grib`. Ensure the GRIB files used for scanning are from the repository being indexed. `remote_options` are used for accessing S3.

```python
grib_tree_store = grib_tree(
    scan_grib(
        #"s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006",
        "s3://noaa-gefs-pds/gefs.20230101/00/atmos/pgrb2sp25/geavg.t00z.pgrb2s.0p25.f006",
        storage_options=dict(anon=True),
    ),
    remote_options=dict(anon=True),
)
```

--------------------------------

### Create Zarr Reference Set from Existing Zarr Store with `single_zarr`

Source: https://context7.com/fsspec/kerchunk/llms.txt

Produces a kerchunk-style reference dictionary for an existing Zarr v2 store, consolidating all chunk keys. Useful for testing and combining Zarr stores. Supports both local and remote Zarr stores.

```python
from kerchunk.zarr import single_zarr

# Local Zarr store
refs_local = single_zarr("path/to/my_store.zarr", inline_threshold=100)

# Remote Zarr store on GCS
refs_gcs = single_zarr(
    "gcs://my-bucket/data/output.zarr",
    storage_options={"token": "anon"},
    inline_threshold=0,
)

# Use class-based API (identical interface to other drivers)
from kerchunk.zarr import ZarrToZarr

zzr = ZarrToZarr("s3://my-bucket/data/output.zarr", storage_options={"anon": True})
refs = zzr.translate()
```

--------------------------------

### Import Necessary Libraries

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Imports core libraries for data manipulation, file system operations, and kerchunk functionalities.

```python
import datetime
import copy
import xarray as xr
import numpy as np
import pandas as pd
import fsspec
import kerchunk
from kerchunk.grib2 import (
    grib_tree, scan_grib, extract_datatree_chunk_index, strip_datavar_chunks, 
    reinflate_grib_store, AggregationType, read_store, write_store, parse_grib_idx,
    build_idx_grib_mapping, map_from_index
)
import gcsfs

pd.set_option('display.max_columns', None)
```

--------------------------------

### Stage and Commit Changes

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Check the status of your changes, add new or modified files to the staging area, and commit them to your local repository with a descriptive message.

```sh
git status
```

```sh
git add path/to/file-to-be-added.py
```

```sh
git commit -m "<commit message>"
```

--------------------------------

### Version 0 Spec: Basic Structure

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/spec.md

This is the prototype spec for the structure required by ReferenceFileSystem. It defines how to include data as-is or reference data from a URL with offset and length.

```json
{
  "key0": "data",
  "key1": ["protocol://target_url", 10000, 100]
}
```

--------------------------------

### Open Reinflated Store as Xarray Datatree

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Opens the reinflated GRIB store as an xarray datatree using a reference filesystem. This allows for interactive data exploration and analysis.

```python
gfs_dt = xr.open_datatree(fsspec.filesystem("reference", fo=gfs_store).get_mapper(""), engine="zarr", consolidated=False)
gfs_dt
```

--------------------------------

### List First 5 GCS Files in a Subdirectory

Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb

Lists the first 5 files within a specific subdirectory of a GCS bucket.

```python
gcs.ls("pangeo-data/SDO_AIA_Images/094")[:5]
```

--------------------------------

### Open intake catalog

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb

Opens an intake catalog from a YAML file and lists its contents. This is the entry point for accessing datasets.

```python
cat = intake.open_catalog('intake_catalog.yml')
list(cat)
```

--------------------------------

### Import necessary libraries

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb

Imports required libraries for data handling, Zarr, fsspec, intake, and Dask.

```python
import xarray as xr
import zarr
import fsspec
import fsspec.implementations.reference as refs
import intake
import intake_xarray
```

--------------------------------

### Build hierarchical datamodel using grib_tree

Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb

Converts GRIB files into a hierarchical datamodel using kerchunk's `grib_tree` function. This method can be slow for large datasets.

```python
# converting the references into the hierarchical datamodel
grib_tree_store = grib_tree(
    [
        group
        for f in s3_files
        for group in scan_grib(f, storage_options=dict(anon=True))
    ],
    remote_options=dict(anon=True),
)
```

--------------------------------

### DataTree Model Visualization

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/reference_aggregation.md

This represents a DataTree model generated from an aggregation of GRIB files. It shows the hierarchical structure of the aggregated data, including variables, dimensions, coordinates, and attributes. Use this to understand the organization of your aggregated dataset.

```bash
DataTree('None', parent=None)
├── DataTree('prmsl')
│   │   Dimensions:  ()
│   │   Data variables:
│   │       *empty*
│   │   Attributes:
│   │       name:     Pressure reduced to MSL
│   └── DataTree('instant')
│       │   Dimensions:  ()
│       │   Data variables:
│       │       *empty*
│       │   Attributes:
│       │       stepType:  instant
│       └── DataTree('meanSea')
│               Dimensions:     (latitude: 181, longitude: 360, time: 1, step: 1,
│                                model_horizons: 1, valid_times: 237)
│               Coordinates:
│                 * latitude    (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0
│                 * longitude   (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
│                   meanSea     float64 8B ...
│                   number      (time, step) int64 8B ...
│                   step        (model_horizons, valid_times) timedelta64[ns] 2kB ...
│                   time        (model_horizons, valid_times) datetime64[ns] 2kB ...
│                   valid_time  (model_horizons, valid_times) datetime64[ns] 2kB ...
│               Dimensions without coordinates: model_horizons, valid_times
│               Data variables:
│                   prmsl       (model_horizons, valid_times, latitude, longitude) float64 124MB ...
│               Attributes:
│                   typeOfLevel:  meanSea
└── DataTree('ulwrf')
    │   Dimensions:  ()
    │   Data variables:
    │       *empty*
    │   Attributes:
    │       name:     Upward long-wave radiation flux
    └── DataTree('avg')
        │   Dimensions:  ()
        │   Data variables:
        │       *empty*
        │   Attributes:
        │       stepType:  avg
        └── DataTree('nominalTop')
                Dimensions:     (latitude: 181, longitude: 360, time: 1, step: 1,
                                    model_horizons: 1, valid_times: 237)
                Coordinates:
                    * latitude    (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0
                    * longitude   (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
                    nominalTop  float64 8B ...
                    number      (time, step) int64 8B ...
                    step        (model_horizons, valid_times) timedelta64[ns] 2kB ...
                    time        (model_horizons, valid_times) datetime64[ns] 2kB ...
                    valid_time  (model_horizons, valid_times) datetime64[ns] 2kB ...
                Dimensions without coordinates: model_horizons, valid_times
                Data variables:
                    ulwrf       (model_horizons, valid_times, latitude, longitude) float64 124MB ...
                Attributes:
                    typeOfLevel:  nominalTop
```

--------------------------------

### Open Local Reference Set with fsspec/xarray

Source: https://context7.com/fsspec/kerchunk/llms.txt

Opens a pre-built reference set from a local JSON file using `xarray.open_dataset`. Requires `fsspec` and `xarray` on the reader side. Ensure `consolidated` is set to `False` and provide `storage_options` for remote access if needed.

```python
import fsspec
import xarray as xr

# --- Open from a local JSON file ---
ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={
        "consolidated": False,
        "storage_options": {
            "fo": "combined.json",
            "remote_protocol": "s3",
            "remote_options": {"anon": True},
        },
    },
)
```

--------------------------------

### Load SDO Dataset with Dask

Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb

Loads the SDO dataset into a dask-backed xarray DataArray. This allows for out-of-core computation.

```python
ds = cat.SDO.to_dask()
ds
```

--------------------------------

### Parse Runtime from IDX Filename

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Parses the runtime information (e.g., '20230901/00') from the GRIB index file's basename. This is a preliminary step for creating mappings.

```python
# Now if we parse the RunTime from the idx file name `gfs.20230901/00/`
```

--------------------------------

### Open Subset as Datatree

Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb

Opens the reinflated store as an xarray datatree using `xr.open_datatree`. Requires `fsspec` for remote file system access and specifies `remote_protocol` and `remote_options`.

```python
s3_dt_subset = xr.open_datatree(
    fsspec.filesystem(
        "reference", fo=s3_store, remote_protocol="s3", remote_options={"anon": True}
    ).get_mapper(""),
    engine="zarr",
    consolidated=False,
)
```

--------------------------------

### Prepare for Aggregation Indexing

Source: https://github.com/fsspec/kerchunk/blob/main/Fast_aggregation.ipynb

Initializes a list to store mapped indices and removes duplicate entries from the mapping DataFrame. This is a preparatory step for building the final aggregation index.

```python
%%time
mapped_index_list = []

dedupe_mapping = mapping.loc[~mapping["attrs"].duplicated(keep="first"), :]

```

--------------------------------

### `kerchunk.grib2.scan_grib2` / `kerchunk.grib2.GribToZarr`

Source: https://context7.com/fsspec/kerchunk/llms.txt

Scans a GRIB2 file and produces a list of reference dicts (one per GRIB message or logical group) using `cfgrib` for metadata decoding. Each dict is a valid single-variable reference set. Requires `cfgrib` and optionally `eccodes`.

```APIDOC
## `kerchunk.grib2.scan_grib2` / `kerchunk.grib2.GribToZarr` — Translate a GRIB2 file into reference sets

Scans a GRIB2 file and produces a list of reference dicts (one per GRIB message or logical group) using `cfgrib` for metadata decoding. Each dict is a valid single-variable reference set. Requires `cfgrib` and optionally `eccodes`.

```python
import ujson
from kerchunk.grib2 import scan_grib2

# Returns a list of reference dicts, one per logical variable/level group
refs_list = scan_grib2(
    "gfs.t00z.pgrb2.0p25.f006",
    inline_threshold=100,
    storage_options={},
)

print(f"Found {len(refs_list)} GRIB message groups")

# Save each as a separate JSON for later combining
for i, refs in enumerate(refs_list):
    with open(f"grib_msg_{i:04d}.json", "w") as f:
        ujson.dump(refs, f)

# Combine all into a single virtual dataset
from kerchunk.combine import MultiZarrToZarr
import xarray as xr

mzz = MultiZarrToZarr(
    refs_list,
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)
combined = mzz.translate()

ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={"consolidated": False, "storage_options": {"fo": combined}},
)
print(ds)
```
```

--------------------------------

### Open GRIB2 Zarr Store with Xarray

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Opens the generated GRIB2 Zarr store metadata directly using xarray's open_datatree function. This is useful for inspection but can be slow for large aggregations.

```python
# The grib_tree can be opened directly using either zarr or xarray datatree
# But this is too slow to build big aggregations
gfs_dt = xr.open_datatree(fsspec.filesystem("reference", fo=gfs_grib_tree_store).get_mapper(""), engine="zarr", consolidated=False)
gfs_dt
```

--------------------------------

### Read Zarr dataset

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb

Loads an equivalent Zarr dataset directly. Prints the encoding details and displays the dataset for comparison.

```python
ds_zarr  = cat['ike-zarr'].to_dask()
print(ds_zarr.zeta.encoding,'\n')
ds_zarr.zeta
```

--------------------------------

### Check dataset size

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_simple.ipynb

Calculate and display the size of the dataset in gigabytes.

```python
ds.nbytes/1e9
```

--------------------------------

### `kerchunk.zarr.single_zarr` / `ZarrToZarr`

Source: https://context7.com/fsspec/kerchunk/llms.txt

Produces a kerchunk-style reference dict for an existing Zarr v2 store (local or remote), consolidating all chunk keys into the reference format. Useful for testing and for combining Zarr stores with `MultiZarrToZarr`.

```APIDOC
## `kerchunk.zarr.single_zarr` / `ZarrToZarr` — Create a reference set from an existing Zarr store

Produces a kerchunk-style reference dict for an existing Zarr v2 store (local or remote), consolidating all chunk keys into the reference format. Useful for testing and for combining Zarr stores with `MultiZarrToZarr`.

```python
from kerchunk.zarr import single_zarr

# Local Zarr store
refs_local = single_zarr("path/to/my_store.zarr", inline_threshold=100)

# Remote Zarr store on GCS
refs_gcs = single_zarr(
    "gcs://my-bucket/data/output.zarr",
    storage_options={"token": "anon"},
    inline_threshold=0,
)

# Use class-based API (identical interface to other drivers)
from kerchunk.zarr import ZarrToZarr

zzr = ZarrToZarr("s3://my-bucket/data/output.zarr", storage_options={"anon": True})
refs = zzr.translate()
```
```

--------------------------------

### Import Kerchunk HDF and fsspec

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Imports necessary modules for handling HDF files and interacting with various file systems using fsspec.

```python
from kerchunk.hdf import SingleHdf5ToZarr
import fsspec
```

--------------------------------

### Create and Activate Conda Environment

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Create a new conda environment named 'kerchunk' using the specified environment file and activate it. This ensures a consistent development environment.

```shell
conda env create --name kerchunk --file ci/environment-py3<*>.yml
conda activate kerchunk
```

--------------------------------

### Import MultiZarrToZarr

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/tutorial.md

Imports the MultiZarrToZarr class for combining multiple kerchunked datasets.

```python
from kerchunk.combine import MultiZarrToZarr
```

--------------------------------

### Check Dataset Size

Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb

Calculates and displays the size of the loaded dataset in gigabytes.

```python
ds.nbytes / 2**30  # GB
```

--------------------------------

### Create Dask client

Source: https://github.com/fsspec/kerchunk/blob/main/examples/ike_intake.ipynb

Creates a Dask client connected to the Dask cluster. This client is used to submit computations to the cluster.

```python
client = Client(cluster)
```

--------------------------------

### Build GRIB2 Zarr Store Metadata

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Scans GRIB2 files from GCS to extract Zarr kerchunk metadata and builds a hierarchical view of the dataset. This operation can be I/O intensive.

```python
# Pick two files to build a grib_tree with the correct dimensions
gfs_files = [
    "gs://global-forecast-system/gfs.20230928/00/atmos/gfs.t00z.pgrb2.0p25.f000",
    "gs://global-forecast-system/gfs.20230928/00/atmos/gfs.t00z.pgrb2.0p25.f001"
]

# This operation reads two of the large Grib2 files from GCS
# scan_grib extracts the zarr kerchunk metadata for each individual grib message
# grib_tree builds a zarr/xarray compatible hierarchical view of the dataset
gfs_grib_tree_store = grib_tree([group for f in gfs_files for group in scan_grib(f)])
# it is slow even in parallel because it requires a huge amount of IO
```

--------------------------------

### Translate GRIB2 to Zarr Reference Sets with `scan_grib2`

Source: https://context7.com/fsspec/kerchunk/llms.txt

Scans a GRIB2 file and produces a list of reference dictionaries, one per GRIB message or logical group. Requires `cfgrib` and optionally `eccodes`. The output can be saved as individual JSON files or combined into a single virtual dataset.

```python
import ujson
from kerchunk.grib2 import scan_grib2

# Returns a list of reference dicts, one per logical variable/level group
refs_list = scan_grib2(
    "gfs.t00z.pgrb2.0p25.f006",
    inline_threshold=100,
    storage_options={},   # local file; use {"anon": True} for S3
)

print(f"Found {len(refs_list)} GRIB message groups")

# Save each as a separate JSON for later combining
for i, refs in enumerate(refs_list):
    with open(f"grib_msg_{i:04d}.json", "w") as f:
        ujson.dump(refs, f)

# Combine all into a single virtual dataset
from kerchunk.combine import MultiZarrToZarr
import xarray as xr

mzz = MultiZarrToZarr(
    refs_list,   # pass dicts directly
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)
combined = mzz.translate()

ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={"consolidated": False, "storage_options": {"fo": combined}},
)
print(ds)
```

--------------------------------

### Read Zarray File from Reference Filesystem

Source: https://github.com/fsspec/kerchunk/blob/main/examples/SDO.ipynb

Reads and decodes the '.zarray' file from the reference filesystem. This file contains metadata about the Zarr array.

```python
print(fs.cat("094/.zarray").decode())
```

--------------------------------

### Create and Checkout a New Git Branch

Source: https://github.com/fsspec/kerchunk/blob/main/docs/source/contributing.md

Create a new feature branch and switch to it for development. This ensures production-ready code remains on the main branch.

```sh
git branch shiny-new-feature
git checkout shiny-new-feature
```

```sh
git checkout -b shiny-new-feature
```

--------------------------------

### Aggregate GRIB Files into a Data Tree

Source: https://github.com/fsspec/kerchunk/blob/main/dynamicgribchunking.ipynb

Iterates through dates and runtimes to parse GRIB index files, map them, and concatenate into a single index. This process aggregates data from multiple GRIB files efficiently.

```python
mapped_index_list = []

deduped_mapping = mapping.loc[~mapping["attrs"].duplicated(keep="first"), :]
for date in pd.date_range("2023-09-01", "2023-09-30"):
  for runtime in range(0,24,6):
    horizon=6
    fname=f"gs://global-forecast-system/gfs.{date.strftime('%Y%m%d')}/{runtime:02}/atmos/gfs.t{runtime:02}z.pgrb2.0p25.f{horizon:03}"

    idxdf = parse_grib_idx(
        basename=fname
    )

    mapped_index = map_from_index(
        pd.Timestamp( date + datetime.timedelta(hours=runtime)),
        deduped_mapping,
        idxdf.loc[~idxdf["attrs"].duplicated(keep="first"), :],
    )
    mapped_index_list.append(mapped_index)

gfs_kind = pd.concat(mapped_index_list)
gfs_kind
```