### Setup Development Environment and Run Tests Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/contributing.md Follow these steps to create a conda environment, clone the repository, install development dependencies, and run unit tests to verify the installation. ```bash conda create -n nested_pandas_env python=3.11 conda activate nested_pandas_env git clone https://github.com/lincc-frameworks/nested-pandas.git cd nested-pandas/ bash ./.setup_dev.sh pip install pytest pytest ``` -------------------------------- ### Install nested-pandas with Development Dependencies Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/CLAUDE.md Use the provided setup script for recommended development environment setup. Alternatively, install the package in editable mode with development dependencies. ```bash ./.setup_dev.sh ``` ```bash pip install -e '.[dev]' ``` ```bash pre-commit install ``` -------------------------------- ### Install pytest and run tests Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/installation.md Installs the pytest framework and runs the unit test suite to verify the local installation of nested-pandas. ```bash pip install pytest pytest ``` -------------------------------- ### Generate Example Nested DataFrame Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_manipulation.ipynb Generates an example nested dataframe for demonstration purposes. This is the initial setup for most operations. ```python import nested_pandas as npd from nested_pandas.datasets import generate_data # Begin by generating an example dataset ndf = generate_data(5, 20, seed=1) ndf ``` -------------------------------- ### Benchmarking setup and execution Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Defines helper functions for running benchmarks and plotting results. Ensures the njit function is compiled before benchmarking. ```python # define helpers for benchmarking def run_max_slope_py(nf): nf.map_rows( max_slope_py, columns=["nested.t", "nested.flux"], row_container="args", output_names="max_slope" ) def run_max_slope_njit(nf): nf.map_rows( max_slope_njit, columns=["nested.t", "nested.flux"], row_container="args", output_names="max_slope", njit=True, ) run_max_slope_njit(nf.copy()) # run njit once for compilation before benchmark plot_bench(run_max_slope_py, run_max_slope_njit, title="njit over python execution - max_slope") ``` -------------------------------- ### Install nested-pandas from source Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/installation.md Clones the nested-pandas repository and installs it from the local source code. This is useful for development versions. ```bash git clone https://github.com/lincc-frameworks/nested-pandas.git cd nested-pandas pip install . ``` -------------------------------- ### Install nested-pandas using pip Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/installation.md Installs the latest release version of nested-pandas from PyPI. ```bash % pip install nested-pandas ``` -------------------------------- ### Setup Toy DataFrame for Combining Nested Structures Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_manipulation.ipynb Sets up a toy nested dataframe with multiple nested columns ('c' and 'd') to demonstrate combining nested structures. ```python # Setup a toy dataframe with two nested columns list_nf = npd.NestedFrame( { "a": ["cat", "dog", "bird"], "b": [1, 2, 3], "c": [[1, 2, 3], [4, 5, 6], [7, 8, 9]], "d": [[10, 20, 30], [40, 50, 60], [70, 80, 90]], } ) list_nf = list_nf.nest_lists(["c"], "c") list_nf = list_nf.nest_lists(["d"], "d") list_nf ``` -------------------------------- ### Install nested-pandas with development dependencies Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/installation.md Installs nested-pandas from source along with optional development dependencies, which are needed for running unit tests and building documentation. Depending on your system, you might need to use single quotes around 'dev'. ```bash pip install .[dev] ``` -------------------------------- ### Install nested-pandas Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_loading_notebook.ipynb Install nested-pandas and its dependencies using pip. ```python # % pip install nested-pandas ``` -------------------------------- ### Inspecting All Column Labels Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/quickstart.ipynb Use the `.all_columns` property to get a dictionary of both top-level ('base') and nested column labels. This provides a comprehensive view of all available columns. ```python # Provides a dictionary of "base" (top-level) and nested column labels nf.all_columns ``` -------------------------------- ### nested-pandas Workflow for Flux Amplitude Calculation Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/performance.ipynb This snippet demonstrates an optimized workflow using nested-pandas to achieve the same results as the native pandas example, often with improved performance. It involves reading data, joining nested structures, filtering, and using the reduce method for calculations. Ensure nested_pandas and its utilities are imported. ```python %%timeit # Read in parquet data # nesting sources into objects nf = npd.read_parquet("objects.parquet") nf = nf.join_nested(npd.read_parquet("ztf_sources.parquet"), "ztf_sources") # Filter on object nf = nf.query("ra > 10.0") # Count number of observations per photometric band and add it as a column nf = count_nested(nf, "ztf_sources", by="band", join=True) # use an existing utility # Filter on our nobs nf = nf.query("n_ztf_sources_g > 520") # Calculate Amplitude amplitude = licu.Amplitude() nf.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux") ``` -------------------------------- ### Build Documentation with Sphinx Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/CLAUDE.md Navigate to the docs directory and use the make html command to build the project's documentation. ```bash cd docs && make html ``` -------------------------------- ### Run Pre-commit Checks and Linting/Formatting with Ruff Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/CLAUDE.md Execute all pre-commit hooks to ensure code quality and style compliance across the project. Specific commands are available for Ruff linting and formatting. ```bash pre-commit run --all-files ``` ```bash ruff check src/ tests/ ``` ```bash ruff format src/ tests/ ``` -------------------------------- ### Get nested column names Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Retrieves the names of all nested columns available in the Series using the .columns attribute of the .nest accessor. ```python nested_series.nest.columns ``` -------------------------------- ### Set up conda environment for nested-pandas Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/installation.md Creates and activates a new conda environment for nested-pandas development. Use Python 3.11. ```bash conda create -n nested_pandas_env python=3.11 conda activate nested_pandas_env ``` -------------------------------- ### Import necessary libraries Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_loading_notebook.ipynb Import pandas, os, tempfile, and specific components from nested_pandas. ```python import os import tempfile import pandas as pd from nested_pandas import NestedFrame, read_parquet from nested_pandas.datasets import generate_parquet_file ``` -------------------------------- ### Convert Nested Series to Flat DataFrame Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Use `.to_flat()` to get a "flat" pandas DataFrame with a repeated index, effectively concatenating nested elements. This operation is copy-free. ```python nested_series.nest.to_flat(["flux", "t"]) ``` -------------------------------- ### Plotting Benchmark for Asymptotic Behavior Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Configures and plots benchmark results to analyze the asymptotic behavior of njit versus Python execution. Adjusts the number of base rows and nested columns to observe performance trends. ```python n_base_shrink = [7500, 10_000] n_nested_list_asy = [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000] plot_bench( run_max_slope_py, run_max_slope_njit, title="njit over python execution - max_slope (asymptotic behavior)", n_base_list=n_base_shrink, n_nested_list=n_nested_list_asy, ) ``` -------------------------------- ### Read Parquet with Partial Column Loading Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Demonstrates reading a Parquet file with nested columns, supporting partial loading of specific sub-columns using dot notation. Requires the 's3fs' library for S3 access. ```python import nested_pandas as npd from nested_pandas.datasets import generate_data nf = generate_data(100, 50, seed=0) # Write nf.to_parquet("data.parquet") # Read full file nf2 = npd.read_parquet("data.parquet") assert list(nf2.nested_columns) == ["nested"] # Partial load: only "a" base column and the "flux" sub-column of "nested" nf3 = npd.read_parquet("data.parquet", columns=["a", "nested.flux"]) print(nf3.columns.tolist()) # ['a', 'nested'] print(nf3["nested"].nest.columns) # ['flux'] # Read from S3 (requires s3fs) # nf_s3 = npd.read_parquet("s3://my-bucket/data.parquet") ``` -------------------------------- ### Run All Tests with Pytest Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/CLAUDE.md Execute all tests in the project using the pytest command. For coverage reports, include the --cov and --cov-report flags. ```bash python -m pytest ``` ```bash python -m pytest --cov=nested_pandas --cov-report=xml ``` -------------------------------- ### Applying njit-compiled function with map_rows Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Demonstrates how to use an `@njit` decorated function with `map_rows` by setting `njit=True`. This enables performance optimizations. ```python nf.map_rows( max_slope_njit, columns=["nested.t", "nested.flux"], row_container="args", output_names="max_slope", njit=True, ) ``` -------------------------------- ### Running and Plotting Weighted Mean Slope Benchmarks Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Sets up and runs benchmark comparisons for Python and njit versions of weighted_mean_slope using map_rows. Includes initial njit compilation and plotting of results. ```python def run_weighted_mean_slope_py(nf): nf.map_rows( weighted_mean_slope_py, columns=["nested.t", "nested.flux"], row_container="args", output_names="weighted_mean_slope", ) def run_weighted_mean_slope_njit(nf): nf.map_rows( weighted_mean_slope_njit, columns=["nested.t", "nested.flux"], row_container="args", output_names="weighted_mean_slope", njit=True, ) run_weighted_mean_slope_njit(nf.copy()) # run njit once for compilation before benchmark plot_bench( run_weighted_mean_slope_py, run_weighted_mean_slope_njit, title="njit over python execution - weighted_mean_slope", ) ``` -------------------------------- ### GroupBy Describe Aggregation Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/groupby_doc.ipynb Demonstrates the `describe()` aggregation on a `groupby` object. This method works as expected, providing descriptive statistics by automatically flattening the nested columns. ```python # describe works as expected with automatic flattened nested column nf.groupby("c").describe() ``` -------------------------------- ### Import necessary libraries and generate data Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Imports `generate_data` from `nested_pandas.datasets`, `numpy`, and `njit` from `numba`. Generates a sample nested pandas DataFrame for demonstration. ```python from nested_pandas.datasets import generate_data import numpy as np from numba import njit # example frame nf = generate_data(10_000, 1000, seed=1) ``` -------------------------------- ### Access nested column keys using .nest Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Demonstrates how to retrieve the names (keys) of the nested columns using the .nest accessor. ```python list(nested_series.nest.keys()) ``` -------------------------------- ### Build Flat Spectrum Dataframe Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/nested_spectra.ipynb Constructs a 'flat' spectrum table by iterating through retrieved FITS spectral data. It aggregates wavelength, flux, error, and object index into NumPy arrays. Requires numpy. ```python import numpy as np # Build a flat spectrum dataframe # Initialize some empty arrays to hold the flat data wave = np.array([]) flux = np.array([]) err = np.array([]) index = np.array([]) # Loop over each spectrum, adding its data to the arrays for i, hdu in enumerate(sp): wave = np.append(wave, 10 ** hdu["COADD"].data.loglam) # * u.angstrom flux = np.append(flux, hdu["COADD"].data.flux * 1e-17) # * u.erg/u.second/u.centimeter**2/u.angstrom err = np.append(err, 1 / hdu["COADD"].data.ivar * 1e-17) # * flux.unit # We'll need to set an index to keep track of which rows correspond # to which object index = np.append(index, i * np.ones(len(hdu["COADD"].data.loglam))) ``` -------------------------------- ### Asymptotic Behavior Plot for njit Looping Function Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Use this to visualize the performance scaling of njit-compiled functions with explicit loops. It helps understand how performance changes as the input size increases, particularly for nested data structures. ```python n_base_shrink = [7500, 10_000] nested_list_asy = [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000] # Assuming plot_bench and run_weighted_mean_slope_py are defined elsewhere # plot_bench( # run_weighted_mean_slope_py, # run_weighted_mean_slope_njit_loop, # Assuming this is defined elsewhere # title="njit over python execution - weighted_mean_slope (asymptotic behavior for loop)", # n_base_list=n_base_shrink, # n_nested_list=nested_list_asy, # ) ``` -------------------------------- ### Applying standard Python function with map_rows Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Shows the usage of a standard Python function with `map_rows`. This serves as the baseline for performance comparison. ```python nf.map_rows( max_slope_py, columns=["nested.t", "nested.flux"], row_container="args", output_names="max_slope", ) ``` -------------------------------- ### Load Data from Parquet Files Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_loading_notebook.ipynb Ingest data from Parquet files into a NestedFrame using the `read_parquet` method. Temporary files are used for demonstration. ```python # Note: that we use the `tempfile` module to create and then cleanup a temporary directory. # You can of course remove this and use your own directory and real files on your system. with tempfile.TemporaryDirectory() as temp_path: # Generates parquet files with random data within our temporary directory generate_parquet_file(10, {"nested1": 100, "nested2": 10}, os.path.join(temp_path, "test.parquet")) # Read the parquet file to a NestedFrame nf = read_parquet(os.path.join(temp_path, "test.parquet")) ``` -------------------------------- ### Class Documentation Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/_templates/autosummary/class.rst This section details the documentation structure for classes within the nested-pandas library, including their constructors, methods, and attributes. ```APIDOC ## Class Name .. autoclass:: {{ objname }} .. automethod:: __init__ .. rubric:: Methods .. autosummary:: {% for item in methods %} ~{{ name }}.{{ item }} {%- endfor %} .. rubric:: Attributes .. autosummary:: {% for item in attributes %} ~{{ name }}.{{ item }} {%- endfor %} ``` -------------------------------- ### Pandas Workflow for Flux Amplitude Calculation Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/performance.ipynb This snippet shows a typical workflow using native pandas to read, filter, and process photometric data, including calculating flux amplitudes. It requires importing pandas, light_curve, and numpy. ```python import nested_pandas as npd import pandas as pd import light_curve as licu import numpy as np from nested_pandas.utils import count_nested ``` ```python %%timeit # Read data object_df = pd.read_parquet("objects.parquet") source_df = pd.read_parquet("ztf_sources.parquet") # Filter on object filtered_object = object_df.query("ra > 10.0") # sync object to source --removes any index values of source not found in object filtered_source = filtered_object[[]].join(source_df, how="left") # Count number of observations per photometric band and add it to the object table band_counts = ( source_df.groupby(level=0) .apply(lambda x: x[["band"]].value_counts().reset_index()) .pivot_table(values="count", index="index", columns="band", aggfunc="sum") ) filtered_object = filtered_object.join(band_counts[["g", "r"]]) # Filter on our nobs filtered_object = filtered_object.query("g > 520") filtered_source = filtered_object[[]].join(source_df, how="left") # Calculate Amplitude amplitude = licu.Amplitude() filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux))) ``` -------------------------------- ### Benchmarking Three Mixed-Type Arguments with map_rows Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Benchmark the performance of a three-argument function using `map_rows`, comparing njit execution against the default Python pathway. Observe how optimization decreases with increasing nested column width. ```python def run_sum_max3_py(nf): nf.map_rows( sum_max3_py, columns=["a", "nested.t", "nested.flux"], row_container="args", output_names="max_3" ) # leave out `njit=True` to take python pathway def run_sum_max3_njit(nf): nf.map_rows( sum_max3_njit, columns=["a", "nested.t", "nested.flux"], row_container="args", output_names="max_3", ) run_sum_max3_njit(nf.copy()) # run once for jit compilation before benchmark plot_bench(run_sum_max3_py, run_sum_max3_njit, title="njit custom function over python - sum_max3") ``` -------------------------------- ### Plotting Asymptotic Behavior for Weighted Mean Slope Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Analyzes the asymptotic behavior of the weighted_mean_slope function by plotting performance against increasing nested column width. This helps identify the crossover point where Python may outperform njit. ```python n_base_shrink = [7500, 10_000] n_nested_list_asy = [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000] plot_bench( run_weighted_mean_slope_py, run_weighted_mean_slope_njit, title="njit over python execution - weighted_mean_slope (asymptotic behavior)", n_base_list=n_base_shrink, n_nested_list=n_nested_list_asy, ) ``` -------------------------------- ### NestedFrame.to_parquet / read_parquet Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Write a NestedFrame to a Parquet file and read it back. Supports partial column loading and remote paths. ```APIDOC ## NestedFrame.to_parquet / read_parquet — Parquet I/O Write a `NestedFrame` to a Parquet file (nested columns are stored as struct-of-lists) and read it back. Supports partial column loading via dot notation and remote paths (S3, HTTP). ### Request Example ```python import nested_pandas as npd from nested_pandas.datasets import generate_data nf = generate_data(100, 50, seed=0) # Write nf.to_parquet("data.parquet") # Read full file nf2 = npd.read_parquet("data.parquet") # Partial load: only "a" base column and the "flux" sub-column of "nested" nf3 = npd.read_parquet("data.parquet", columns=["a", "nested.flux"]) # Read from S3 (requires s3fs) # nf_s3 = npd.read_parquet("s3://my-bucket/data.parquet") ``` ### Response Example ```python # After reading full file: # assert list(nf2.nested_columns) == ["nested"] # After partial load: # print(nf3.columns.tolist()) # Expected: ['a', 'nested'] # print(nf3["nested"].nest.columns) # Expected: ['flux'] ``` ``` -------------------------------- ### Create a flat Pandas DataFrame Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_loading_notebook.ipynb Define a sample flat DataFrame with repeating and varying columns, suitable for conversion to a nested structure. ```python flat_df = pd.DataFrame( data={ "a": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3], "b": [2, 2, 2, 4, 4, 4, 6, 6, 6, 6], "c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1], "d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1], }, index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2], ) flat_df ``` -------------------------------- ### Create Base NestedFrame Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_loading_notebook.ipynb Initialize a NestedFrame from a dictionary to define top-level objects and constant values. ```python nf = NestedFrame( data={ "a": [1, 2, 3], "b": [2, 4, 6], }, index=[0, 1, 2], ) nf ``` -------------------------------- ### Display NestedFrame with Nested Columns Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/data_loading_notebook.ipynb View a NestedFrame that contains nested columns, demonstrating the structure after data loading. ```python nf # nf contains nested columns ``` -------------------------------- ### Build NestedFrame from Arrays Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/nested_spectra.ipynb Constructs a NestedFrame from flat arrays representing spectral data (wavelength, flux, error). Ensure data is properly formatted and indexed. ```python flat_spec = npd.NestedFrame(dict(wave=wave, flux=flux, err=err), index=index.astype(np.int8)) ``` -------------------------------- ### Python and njit implementations for max_slope Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Provides both a standard Python and an `@njit` decorated version of the `max_slope` function. Use the `@njit` version for performance-critical computations. ```python def max_slope_py(t, flux): slope = np.diff(flux) / np.diff(t) return np.max(slope) @njit def max_slope_njit(t, flux): slope = np.diff(flux) / np.diff(t) return np.max(slope) ``` -------------------------------- ### Generate Toy NestedFrame Dataset Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Use `generate_data` to quickly create synthetic NestedFrame datasets for testing. It supports single or multiple nested columns with configurable sizes. ```python from nested_pandas.datasets import generate_data # Single nested column: 5 base rows, 10 nested rows each nf = generate_data(5, 10, seed=1) print(nf) # a b nested # 0 0.417022 0.184677 [{t: 8.38389, flux: 31.551563, band: 'r'}; …] (10 rows) # ... ``` ```python # Multiple nested columns with different sizes nf2 = generate_data(5, {"lc": 10, "spectra": 3}, seed=42) print(nf2.nested_columns) # ['lc', 'spectra'] print(nf2["lc"].nest.columns) # ['t', 'flux', 'band'] ``` -------------------------------- ### Import necessary libraries Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Imports essential libraries for data manipulation and nested data structures. ```python import numpy as np import pandas as pd import pyarrow as pa from nested_pandas import NestedDtype from nested_pandas.datasets import generate_data from nested_pandas.series.packer import pack ``` -------------------------------- ### Mapping Rows with Custom Function and Arguments Input Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/quickstart.ipynb Apply a custom function to rows using `map_rows`, passing specified columns as arguments. The `row_container='args'` option unpacks the data into separate arguments for the function. ```python def show_inputs(*args): return args nf_inputs = nf.map_rows(show_inputs, columns=["ra", "lightcurve.time"], row_container="args") nf_inputs ``` -------------------------------- ### GroupBy Min/Max/Mean Aggregation Failure Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/groupby_doc.ipynb Illustrates that `min()`, `max()`, and `mean()` aggregations fail when applied to nested columns. This is due to the unhashable nature of the nested data structures. ```python # min/max/mean fail on nested columns try: grouped_min = nf.groupby("c").min() print(grouped_min) except TypeError as e: print(f"Cannot compute min on nested columns: {e}") ``` -------------------------------- ### Infer Nesting with Prefix Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Use `infer_nesting=True` to automatically create nested columns based on an 'out.' prefix in the output dictionary keys. This is useful for transforming flat data into a nested structure. ```python def offsets(row): return {"out.dt": row["nested.t"] - row["a"], "out.df": row["nested.flux"] - row["b"]} result3 = nf.map_rows(offsets, columns=["a", "b", "nested.t", "nested.flux"], infer_nesting=True) print(result3.nested_columns) # ['out'] # append_columns: merge results back into the original frame nf_aug = nf.map_rows(summarize, columns=["a", "nested.flux"], append_columns=True) print(nf_aug.base_columns) # ['a', 'b', 'mean_flux', 'n_obs', 'max_minus_a'] ``` -------------------------------- ### Generate Sample NestedPandas Data Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/groupby_doc.ipynb Generates a sample NestedPandas DataFrame for demonstration purposes. This includes creating a DataFrame with nested data and adding a non-nested column 'c' for grouping. ```python from nested_pandas.datasets import generate_data nf = generate_data(5, 10, seed=1) nf["c"] = [0, 0, 1, 1, 1] nf ``` -------------------------------- ### Benchmarking Exploded Base Column Optimization Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Benchmark the performance of the njit-optimized function against its Python equivalent. This helps determine the breaking point where the overhead of exploding the column outweighs njit's benefits. ```python # explode graph def run_scaled_max_flux_njit_explode(nf): nf["nested.a"] = nf["a"] nf.map_rows( scaled_max_flux_njit_explode, columns=["nested.flux", "nested.a"], row_container="args", output_names="scaled_max_flux", njit=True, ) run_scaled_max_flux_njit_explode(nf.copy()) # run once for jit compilation before benchmark plot_bench( run_scaled_max_flux_py, run_scaled_max_flux_njit_explode, title="njit over python execution - scaled_max_flux (explode)", ) ``` -------------------------------- ### Numba njit with Explicit Loop for Weighted Mean Slope Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Use this when optimizing functions that involve iterative calculations. Numba's njit can provide substantial speedups by compiling explicit Python loops into machine code, outperforming numpy-based approaches in certain scenarios. ```python import numpy as np from numba import njit @njit def weighted_mean_slope_njit_loop(t, flux): n = t.size - 1 num = 0.0 weight = 0.0 # manually looping to get the difference and summing up for i in range(n): dt = t[i + 1] - t[i] df = flux[i + 1] - flux[i] slope = df / dt num += slope * dt weight += dt return num / weight def run_weighted_mean_slope_njit_loop(nf): nf.map_rows( weighted_mean_slope_njit_loop, columns=["nested.t", "nested.flux"], row_container="args", output_names="weighted_mean_slope", njit=True, ) # Assuming nf and plot_bench are defined elsewhere # run_weighted_mean_slope_njit_loop(nf.copy()) # run njit once for compilation before benchmark # plot_bench( # run_weighted_mean_slope_py, # Assuming this is defined elsewhere # run_weighted_mean_slope_njit_loop, # title="njit over python execution - weighted_mean_slope (loop)", # ) ``` -------------------------------- ### Multi-select Sub-columns in NestedSeries Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/quickstart.ipynb Select multiple sub-columns from a NestedSeries simultaneously. ```python # Multi-selecting sub-columns nf["lightcurve"][[ "time", "brightness"]] ``` -------------------------------- ### Create Nested Series using pack() Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Use `pack()` to create a nested Series from a collection of elements like DataFrames, dictionaries, or None. Elements must share the same columns but can have different lengths. ```python series_from_pack = pack( [ pd.DataFrame({"t": [1, 2, 3], "flux": [0.1, 0.2, 0.3]}), {"t": [4, 5], "flux": [0.4, 0.5]}, None, ], name="from_pack", # optional index=[3, 4, 5], # optional ) series_from_pack ``` -------------------------------- ### Mapping Rows with Custom Function and Dictionary Input Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/quickstart.ipynb Apply a custom function to rows using `map_rows`, passing specified columns as a dictionary. The `row_container='dict'` option structures the input for the function. ```python def show_inputs(row): return row # row_container="dict" passes the data as a dictionary to the function nf_inputs = nf.map_rows(show_inputs, columns=["ra", "lightcurve.time"], row_container="dict") nf_inputs # map_rows returns a dataframe view of the dicts, but the two columns can be accessed with show_inputs as # row["ra"] and row["lightcurve.time"] ``` -------------------------------- ### Dispatcher for packing DataFrames or sequences Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt The `pack` function acts as a dispatcher, automatically selecting `pack_flat` for DataFrames and `pack_seq` for sequences. ```python import pandas as pd import numpy as np from nested_pandas.series.packer import pack, pack_flat, pack_lists, pack_seq # pack: dispatcher — delegates to pack_flat (DataFrame) or pack_seq (sequence) flat = pd.DataFrame({ "t": [1.0, 1.5, 2.0, 2.5], "flux": [10., 11., 20., 21.], }, index=[0, 0, 1, 1]) ns4 = pack(flat, name="lc") ``` -------------------------------- ### Plotting a Nested Spectrum Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/nested_spectra.ipynb Visualizes a single spectrum by plotting its flux against wavelength. Requires matplotlib for plotting. Ensure the spectrum data is correctly accessed. ```python import matplotlib.pyplot as plt # Plot a spectrum spec = spec_ndf.iloc[1].coadd_spectrum plt.plot(spec["wave"], spec["flux"]) plt.xlabel("Wavelength (Å)") plt.ylabel(r"Flux ($ergs/s/cm^2/Å$)") ``` -------------------------------- ### Build NestedFrame Manually Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Construct a `NestedFrame` by joining a flat DataFrame as a nested column to a base DataFrame. This demonstrates the core `NestedFrame` structure and its properties. ```python import nested_pandas as npd import pandas as pd import numpy as np # Build a NestedFrame manually base = npd.NestedFrame({"obj_id": [1, 2, 3], "ra": [10.0, 20.0, 30.0]}, index=[0, 1, 2]) measurements = pd.DataFrame({ "time": [1.1, 1.2, 2.1, 2.2, 3.1], "flux": [10.0, 11.0, 20.0, 21.0, 30.0], "band": ["g", "r", "g", "r", "g"], }, index=[0, 0, 1, 1, 2]) nf = base.join_nested(measurements, "lc") print(nf) # obj_id ra lc # 0 1 10.0 [{time: 1.1, flux: 10.0, band: 'g'}; …] (2 rows) # 1 2 20.0 [{time: 2.1, flux: 20.0, band: 'g'}; …] (2 rows) # 2 3 30.0 [{time: 3.1, flux: 30.0, band: 'g'}] print(nf.nested_columns) # ['lc'] print(nf.base_columns) # ['obj_id', 'ra'] print(nf.all_columns) # {'base': [...], 'lc': ['time', 'flux', 'band']} ``` -------------------------------- ### Basic GroupBy Operation Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/groupby_doc.ipynb Demonstrates a basic `groupby` operation on a non-nested column. This returns a standard Pandas GroupBy object, as grouping by nested columns is not supported due to their unhashable nature. ```python nf.groupby("c") # returns a Pandas GroupBy object ``` -------------------------------- ### Exploding Base Column for njit Optimization Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/njit_map_rows.ipynb Use this pattern when njit optimization is needed for functions with mixed-type arguments. Explode the base column into nested columns to satisfy njit's static typing requirements. Be aware that this can be inefficient for large nested column widths. ```python @njit def scaled_max_flux_njit_explode(flux, a): """ flux: 1D array (nested slice) a: scalar vector (base column value exploded into nested column) """ return a[0] * np.max(flux) nf["nested.a"] = nf["a"] # explode base column into nested column nf.map_rows( scaled_max_flux_njit_explode, columns=["nested.flux", "nested.a"], # input both arguments as nested column row_container="args", output_names="scaled_max_flux", njit=True, ) ``` -------------------------------- ### Create Nested Series from Existing Nested Series Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Use the `pack()` function to create a new nested Series from an existing one. This is useful for creating copies or when performing operations that result in a new nested structure. ```python new_series = pack(nested_series.nest.to_flat()) new_series.equals(nested_series) ``` -------------------------------- ### NestSeriesAccessor (.nest) Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Low-level accessor for NestedSeries, providing methods to convert between representations, mutate sub-columns, and query the flat array. ```APIDOC ## NestSeriesAccessor (`.nest`) — Low-level accessor for NestedSeries Registered as `.nest` on any `pd.Series` with `NestedDtype`. Provides methods to convert between representations, mutate sub-columns, and query the flat array. ### Methods - **`to_flat()`**: Convert to flat DataFrame (repeated index). - **`to_lists()`**: Convert to list-arrays DataFrame (one array per row). - **`columns`**: Get list of sub-column names. - **`len()`**: Get number of nested rows per outer row. - **`flat_length`**: Get total number of nested elements. - **`flat_index`**: Get flat index (repeated outer index). - **`set_flat_column(name, values)`**: Add a new sub-column with a scalar or flat array. - **`set_list_column(name, values)`**: Add a sub-column using list-arrays. - **`set_filled_column(name, values)`**: Repeat a base-column value into nested rows. - **`drop(column_name)`**: Drop sub-columns. - **`query(expression)`**: Query the flat arrays. ### Request Example ```python from nested_pandas.datasets import generate_data nf = generate_data(5, 5, seed=1) ns = nf["nested"] # NestedSeries # Convert to flat DataFrame flat_df = ns.nest.to_flat() # Add a new sub-column ns2 = ns.nest.set_flat_column("weight", 1.0) ns3 = ns.nest.set_flat_column("norm_flux", flat_df["flux"].values / flat_df["flux"].max()) # Drop sub-columns ns_no_band = ns.nest.drop("band") # Query the flat arrays ns_bright = ns.nest.query("flux > 50") ``` ### Response Example ```python # print(flat_df.head()) # print(ns.nest.columns) # Expected: ['t', 'flux', 'band'] # print(ns.nest.len()) # Expected: [5, 5, 5, 5, 5] # print(ns.nest.flat_length) # Expected: 25 # print(ns.nest.flat_index) # Expected: Index([0, 0, 0, 0, 0, 1, 1, ...]) # print(ns_bright) ``` ``` -------------------------------- ### Query SDSS for Spectra Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/nested_spectra.ipynb Queries the Sloan Digital Sky Survey (SDSS) for astronomical objects within a specified region and retrieves their spectra. Requires astroquery and astropy libraries. ```python from astroquery.sdss import SDSS from astropy import coordinates as coords import astropy.units as u import nested_pandas as npd # Query SDSS for a set of objects with spectra pos = coords.SkyCoord("0h8m10.63s +14d50m23.3s", frame="icrs") xid = SDSS.query_region(pos, radius=3 * u.arcmin, spectro=True) xid_ndf = npd.NestedFrame(xid.to_pandas()) xid_ndf ``` -------------------------------- ### Perform Type Checking with Mypy Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/CLAUDE.md Utilize mypy for static type checking to catch potential type-related errors in the source and test files. ```bash mypy src/ tests/ ``` -------------------------------- ### Retrieve Spectra Data from SDSS Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/nested_spectra.ipynb Retrieves the actual spectral data for objects previously identified by an SDSS query. Clears the cache before fetching to ensure fresh data. ```python # Query SDSS for the corresponding spectra SDSS.clear_cache() sp = SDSS.get_spectra(matches=xid) sp ``` -------------------------------- ### Join Nested Spectra to Existing Data Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/pre_executed/nested_spectra.ipynb Joins a NestedFrame containing spectral data to an existing DataFrame based on a common key. This nests the spectral data within the main table. ```python spec_ndf = xid_ndf.join_nested(flat_spec, "coadd_spectrum").set_index("objid") ``` -------------------------------- ### Create Nested Series from PyArrow Struct Array Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Construct a nested Series efficiently from a PyArrow struct array. This is the most performant method for creating nested Series when data is already in PyArrow format. ```python pa_struct_array = pa.StructArray.from_arrays( [ [ np.arange(10), np.arange(5), ], # "a" field [ np.linspace(0, 1, 10), np.linspace(0, 1, 5), ], # "b" field ], names=["a", "b"], ) series_from_pa_struct = pd.Series( pa_struct_array, dtype=NestedDtype(pa_struct_array.type), name="from_pa_struct_array", index=["I", "II"], ) ``` -------------------------------- ### Create a Flat Pandas DataFrame Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/gettingstarted/quickstart.ipynb Create a standard pandas DataFrame to represent nested time-series data. This serves as the input for converting into a NestedFrame. ```python import pandas as pd # Represent nested time series information as a classic pandas dataframe. my_data_frame = pd.DataFrame( { "id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 2], "ra": [10.0, 10.0, 10.0, 15.0, 15.0, 15.0, 12.1, 12.1, 12.1, 12.1], "dec": [0.0, 0.0, 0.0, -1.0, -1.0, -1.0, 0.5, 0.5, 0.5, 0.5], "time": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5, 60677.0, 60676.6, 60676.7, 60676.8, 60676.9], "brightness": [100.0, 101.0, 99.8, 5.0, 5.01, 4.98, 20.1, 20.5, 20.3, 20.2], "band": ["g", "r", "g", "r", "g", "r", "g", "g", "r", "r"], } ) my_data_frame ``` -------------------------------- ### Convert Nested Series to ArrowDtype, Flat DataFrame, and List DataFrame Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Demonstrates conversions of a nested Series to a PyArrow dtype Series, a flat DataFrame, and a DataFrame with list-arrays. These conversions are useful for interoperability and analysis. ```python # Convert to pd.ArrowDtype Series of struct-arrays arrow_dtyped_series = pd.Series(nested_series, dtype=nested_series.dtype.to_pandas_arrow_dtype()) # Convert to a flat dataframe flat_df = nested_series.nest.to_flat() # Convert to a list-array dataframe list_df = nested_series.nest.to_lists() ``` -------------------------------- ### Generate nested data and access Series Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Generates sample nested data and extracts a Series with NestedDtype for further manipulation. ```python nested_df = generate_data(4, 3, seed=42) nested_series = nested_df["nested"] nested_series[2] ``` -------------------------------- ### Pack list columns into a nested column in-place Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Use `nest_lists` as an instance method to pack existing list-valued columns into a new named nested column. This is an in-place style operation. ```python import nested_pandas as npd nf = npd.NestedFrame({ "id": [1, 2, 3], "flux": [[10., 11.], [20., 21.], [30., 31.]], "band": [["g", "r"], ["g", "g"], ["r", "r"]], }) result = nf.nest_lists(columns=["flux", "band"], name="obs") print(result) # id obs # 0 1 [{flux: 10.0, band: 'g'}; …] (2 rows) # 1 2 [{flux: 20.0, band: 'g'}; …] (2 rows) # 2 3 [{flux: 30.0, band: 'r'}; …] (2 rows) ``` -------------------------------- ### Create a new nested Series with a subset of columns Source: https://github.com/lincc-frameworks/nested-pandas/blob/main/docs/tutorials/low_level.ipynb Generates a new nested Series containing only the specified columns ('t' and 'flux') from the original nested Series. ```python nested_series.nest[["t", "flux"]].dtype ``` -------------------------------- ### generate_data Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt Generates a synthetic NestedFrame dataset for testing and exploration. It can create single or multiple nested columns with specified sizes. ```APIDOC ## `generate_data` — Generate a toy NestedFrame dataset Quickly creates a synthetic `NestedFrame` with base columns `a` and `b` and one or more nested columns (`t`, `flux`, `band`) for testing and exploration. Accepts a dictionary for `n_layer` to create multiple nested columns in one call. ```python from nested_pandas.datasets import generate_data # Single nested column: 5 base rows, 10 nested rows each nf = generate_data(5, 10, seed=1) print(nf) # a b nested # 0 0.417022 0.184677 [{t: 8.38389, flux: 31.551563, band: 'r'}; …] (10 rows) # ... # Multiple nested columns with different sizes nf2 = generate_data(5, {"lc": 10, "spectra": 3}, seed=42) print(nf2.nested_columns) # ['lc', 'spectra'] print(nf2["lc"].nest.columns) # ['t', 'flux', 'band'] ``` ``` -------------------------------- ### NestedSeries Accessor for Low-Level Operations Source: https://context7.com/lincc-frameworks/nested-pandas/llms.txt The `.nest` accessor provides low-level methods for NestedSeries, enabling conversion to flat or list-array DataFrames, manipulation of sub-columns, and querying of nested data. ```python from nested_pandas.datasets import generate_data nf = generate_data(5, 5, seed=1) ns = nf["nested"] # NestedSeries # Convert to flat DataFrame (repeated index) flat_df = ns.nest.to_flat() print(flat_df.head()) # Convert to list-arrays DataFrame (one array per row) lists_df = ns.nest.to_lists() print(lists_df.head()) # List of sub-column names print(ns.nest.columns) # ['t', 'flux', 'band'] # Number of nested rows per outer row print(ns.nest.len()) # [5, 5, 5, 5, 5] # Total number of nested elements print(ns.nest.flat_length) # 25 # Flat index (repeated outer index) print(ns.nest.flat_index) # Index([0, 0, 0, 0, 0, 1, 1, ...]) # Add a new sub-column with a scalar (broadcast) or flat array ns2 = ns.nest.set_flat_column("weight", 1.0) ns3 = ns.nest.set_flat_column("norm_flux", flat_df["flux"].values / flat_df["flux"].max()) # Add a sub-column using list-arrays (one list per outer row) ns4 = ns.nest.set_list_column("flag", [[True]*5]*5) # Repeat a base-column value into nested rows ns5 = ns.nest.set_filled_column("object_id", [10, 20, 30, 40, 50]) # Drop sub-columns ns_no_band = ns.nest.drop("band") # Query the flat arrays ns_bright = ns.nest.query("flux > 50") print(ns_bright) ```