# Embedding Atlas

Embedding Atlas is an interactive visualization tool for large embeddings, released as both a Python package (`embedding-atlas` on PyPI) and a JavaScript/TypeScript npm package (`embedding-atlas`). It allows users to visualize, cross-filter, and search millions of data points using high-dimensional embeddings projected to 2D via UMAP. Core capabilities include automatic data clustering and labeling, kernel density estimation with density contours, order-independent transparency rendering, real-time nearest-neighbor search, and multi-coordinated metadata views. The tool supports text, image, and audio modalities and provides multiple backends for computing embeddings (SentenceTransformers, HuggingFace Transformers, or any LiteLLM-compatible API).

The project is structured as a monorepo with several packages: a Python backend (`packages/backend`) that exposes a CLI, a Jupyter/notebook widget, and a Streamlit component; a JavaScript `embedding-atlas` npm package that exports UI components (`EmbeddingAtlas`, `EmbeddingView`, `EmbeddingViewMosaic`) along with WebAssembly-powered algorithms for in-browser UMAP computation and density clustering. All three Python integration modes (CLI, notebook widget, Streamlit) share a common `compute_projection` pipeline and `EmbeddingAtlasOptions` configuration schema, making it straightforward to move from exploratory notebook usage to production Streamlit dashboards or standalone exported web applications.

---

## CLI: `embedding-atlas` command

The command line tool loads a dataset (Parquet, JSONL, CSV, or Hugging Face), optionally computes embeddings and a UMAP projection, then serves an interactive web UI at `http://localhost:5055`.

```bash
# Install
pip install embedding-atlas

# Quickstart: auto-detect text column interactively
embedding-atlas dataset.parquet

# Specify text column and use a custom sentence-transformers model
embedding-atlas dataset.parquet --text description --model all-mpnet-base-v2

# Load from Hugging Face with a specific split
embedding-atlas james-burton/wine_reviews --split validation --text description

# Use pre-computed 2D projections and neighbor graph
embedding-atlas dataset.parquet --x projection_x --y projection_y --neighbors neighbors

# Use pre-computed embedding vectors (skips text embedding, runs UMAP only)
embedding-atlas dataset.parquet --vector embedding_vectors

# Use an API-based embedder via LiteLLM (e.g. OpenAI)
embedding-atlas dataset.parquet --text description \
  --embedder litellm \
  --model openai/text-embedding-3-small \
  --api-key sk-xxx \
  --batch-size 1024

# Use a locally running Ollama model
embedding-atlas dataset.parquet --text description \
  --embedder litellm \
  --model ollama/nomic-embed-text \
  --api-base http://localhost:11434 \
  --max-concurrency 2

# Control UMAP parameters for reproducibility
embedding-atlas dataset.parquet --text description \
  --umap-metric cosine \
  --umap-n-neighbors 15 \
  --umap-min-dist 0.1 \
  --umap-random-state 42

# Export as a standalone offline web app (ZIP or folder)
embedding-atlas dataset.parquet --text description --export-application output.zip

# Apply a DuckDB SQL query before visualization (filter or transform data)
embedding-atlas dataset.parquet --query "SELECT * FROM data WHERE country = 'US'" --text description

# Sample 50,000 rows from a large dataset
embedding-atlas large_dataset.parquet --text description --sample 50000

# Enable MCP (Model Context Protocol) server for AI agent integration
embedding-atlas dataset.parquet --text description --mcp

# Expose server on all interfaces with CORS enabled
embedding-atlas dataset.parquet --text description --host 0.0.0.0 --cors

# Run with uv (no pip install needed)
uvx embedding-atlas dataset.parquet
```

---

## `compute_projection` — Compute embeddings and UMAP projection

Synchronous function that embeds a DataFrame column (text, image, audio, or pre-computed vectors) and projects the result to 2D using UMAP, adding `projection_x`, `projection_y`, and a `neighbors` column to the DataFrame. Cannot be called from within a running async event loop (e.g. inside Jupyter); use `async_compute_projection` in that case.

```python
from embedding_atlas.projection import compute_projection
import pandas as pd

df = pd.read_parquet("wine_reviews.parquet")

# Text: default SentenceTransformers model (all-MiniLM-L6-v2)
df = compute_projection(
    df,
    inputs="description",   # column to embed
    modality="text",         # 'text' | 'image' | 'audio' | 'vector' | 'auto'
    x="projection_x",        # output X column
    y="projection_y",        # output Y column
    neighbors="neighbors",   # output neighbors column (set None to skip)
    umap_args={"metric": "cosine", "n_neighbors": 15, "random_state": 42},
)

# Vector: skip embedding, just run UMAP on pre-computed vectors
df = compute_projection(
    df,
    inputs="embedding_vectors",
    modality="vector",
    x="projection_x",
    y="projection_y",
)

# Image: HuggingFace Transformers pipeline (google/vit-base-patch16-224 by default)
df = compute_projection(
    df,
    inputs="image_bytes",    # column with bytes or {"bytes": ...} dicts
    modality="image",
    x="projection_x",
    y="projection_y",
    model="google/vit-base-patch16-224",
)

# LiteLLM API (OpenAI text-embedding-3-small)
df = compute_projection(
    df,
    inputs="description",
    modality="text",
    embedder="litellm",
    model="openai/text-embedding-3-small",
    embedder_args={"api_key": "sk-xxx"},
    batch_size=1024,
    x="projection_x",
    y="projection_y",
)

# Custom async embedder function
async def my_embedder(batch, *, model, embedder_args):
    # batch: list of strings (for text) or list of {"bytes": bytes} (for image/audio)
    import numpy as np
    return np.random.rand(len(batch), 384).astype("float32")

df = compute_projection(
    df,
    inputs="description",
    modality="text",
    embedder=my_embedder,
    x="projection_x",
    y="projection_y",
)

print(df[["description", "projection_x", "projection_y", "neighbors"]].head())
# Expected output columns: projection_x (float64), projection_y (float64),
# neighbors (list of dicts: {"ids": [...], "distances": [...]})
```

---

## `async_compute_projection` — Async version for Jupyter notebooks

Async counterpart to `compute_projection`. Accepts the same arguments and must be awaited. Use this inside Jupyter notebooks, Marimo, or any async context where `asyncio.run()` would raise `RuntimeError: This event loop is already running`.

```python
from embedding_atlas.projection import async_compute_projection
import pandas as pd
from datasets import load_dataset

# Load dataset
ds = load_dataset("james-burton/wine_reviews", split="validation")
df = pd.DataFrame(ds)

# Default: SentenceTransformers for text
df = await async_compute_projection(
    df,
    inputs="description",
    modality="text",
    x="projection_x",
    y="projection_y",
    neighbors="neighbors",
)

# Ollama local server via LiteLLM (limit concurrency for local servers)
df = await async_compute_projection(
    df,
    inputs="description",
    modality="text",
    embedder="litellm",
    model="ollama/nomic-embed-text",
    embedder_args={"api_base": "http://localhost:11434"},
    batch_size=512,
    max_concurrency=2,
    x="projection_x",
    y="projection_y",
    neighbors="neighbors",
)

print(df.shape)        # Same shape as input + 3 new columns
print(df.dtypes)       # projection_x: float64, projection_y: float64, neighbors: object
```

---

## `EmbeddingAtlasWidget` — Jupyter / AnyWidget notebook widget

An interactive Embedding Atlas widget for use in Jupyter, Marimo, Colab, and VSCode notebooks. Backed by [AnyWidget](https://anywidget.dev). Registers the DataFrame as a DuckDB in-memory table and exposes a `.selection()` method to retrieve the currently filtered rows.

```python
from embedding_atlas.widget import EmbeddingAtlasWidget
from embedding_atlas.projection import async_compute_projection
import pandas as pd
from datasets import load_dataset

ds = load_dataset("james-burton/wine_reviews", split="validation")
df = pd.DataFrame(ds)

# Compute projection (async in notebooks)
df = await async_compute_projection(
    df, inputs="description", modality="text",
    x="projection_x", y="projection_y", neighbors="neighbors",
)

# Basic widget: show table and charts (no embedding view)
EmbeddingAtlasWidget(df)

# Full widget with embedding view, text, and nearest neighbors
widget = EmbeddingAtlasWidget(
    df,
    text="description",          # column to show in tooltips and search
    x="projection_x",            # X coordinate for embedding scatter plot
    y="projection_y",            # Y coordinate for embedding scatter plot
    neighbors="neighbors",       # pre-computed KNN for nearest-neighbor lookup
    point_size=3.0,              # override auto point size
    labels="automatic",          # auto-generate cluster labels using TF-IDF
    stop_words=["the", "a"],     # words to exclude from label generation
    show_table=True,             # show data table panel on open
    show_charts=True,            # show charts panel on open
    show_embedding=True,         # show embedding view panel on open
)
widget   # display the widget

# Retrieve the current user selection as a DataFrame
selected_df = widget.selection(format="dataframe")  # or format="arrow"
print(selected_df.shape)

# Custom labels instead of auto-generated ones
custom_labels = [
    {"x": 1.5, "y": 2.3, "text": "Fruity Wines", "level": 1, "priority": 10},
    {"x": -0.5, "y": 3.1, "text": "Dry Reds", "level": 2, "priority": 5},
]
widget2 = EmbeddingAtlasWidget(
    df, x="projection_x", y="projection_y", labels=custom_labels
)
```

---

## `embedding_atlas` — Streamlit component

Renders an interactive Embedding Atlas component inside a Streamlit app. Returns a `dict` containing a `predicate` SQL string representing the user's current cross-filter selection, which can be applied with DuckDB to filter the source DataFrame.

```python
import duckdb
import pandas as pd
import streamlit as st
from datasets import load_dataset
from embedding_atlas.projection import compute_projection
from embedding_atlas.streamlit import embedding_atlas

st.set_page_config(layout="wide")
st.title("Embedding Atlas + Streamlit")

@st.cache_data
def load_data():
    ds = load_dataset("james-burton/wine_reviews", split="validation")
    df = pd.DataFrame(ds)
    return compute_projection(
        df, inputs="description", modality="text",
        x="projection_x", y="projection_y", neighbors="neighbors",
    )

df = load_data()

# Render the Embedding Atlas component; returns the current selection state
value = embedding_atlas(
    df,
    text="description",
    x="projection_x",
    y="projection_y",
    neighbors="neighbors",
    show_table=True,
    show_charts=True,
    key="embedding_atlas_widget",
)

# Use the SQL predicate to filter the DataFrame with DuckDB
st.subheader("Selected rows")
predicate = value.get("predicate")
if predicate:
    subset = duckdb.query_df(df, "dataframe", f"SELECT * FROM dataframe WHERE {predicate}")
    st.dataframe(subset)
else:
    st.write("No selection — interact with the widget above.")

# Without projection (table + charts only mode)
value2 = embedding_atlas(df, key="charts_only")
```

---

## `pagerank` — PageRank from a graph edge list

Computes PageRank scores from a weighted or unweighted edge list using PyTorch sparse matrix power iteration. Supports dangling nodes (no outgoing edges) and converges using L1-norm tolerance.

```python
from embedding_atlas.pagerank import pagerank, knn_to_edges, compute_pagerank_column
import numpy as np
import pandas as pd

# --- Directly from an edge list ---
edges = [
    (0, 1, 0.5),  # (source, target, weight)
    (0, 2, 1.0),
    (1, 2, 0.8),
    (2, 0, 1.0),
]
scores = pagerank(edges, n=3, damping=0.85, max_iterations=100, tolerance=1e-9)
print(scores)  # array([0.32..., 0.21..., 0.46...]) — higher = more central

# --- From KNN arrays (e.g., output of compute_projection) ---
knn_indices = np.array([[1, 2], [0, 2], [0, 1]])
knn_distances = np.array([[0.1, 0.2], [0.1, 0.3], [0.2, 0.3]])

# Convert raw KNN distances → UMAP membership-strength edge weights
weighted_edges = knn_to_edges(knn_indices, knn_distances, local_connectivity=1.0)
scores = pagerank(weighted_edges, n=3)

# --- From a DataFrame with a neighbors column ---
# The 'neighbors' column must contain dicts: {"ids": [...], "distances": [...]}
df = pd.read_parquet("dataset_with_neighbors.parquet")
df["pagerank"] = compute_pagerank_column(df, neighbors="neighbors", damping=0.85)
print(df[["pagerank"]].describe())

# Run from CLI to add pagerank column to a parquet file:
# python -m embedding_atlas.pagerank --in dataset.parquet --out dataset_ranked.parquet
```

---

## `EmbeddingView` — Standalone JavaScript scatter plot component

A WebGPU/WebGL2 scatter plot component for rendering up to a few million embedding points directly from typed arrays. Available as vanilla JS, React, and Svelte wrappers. Supports lasso selection, tooltips, custom overlays, and theme configuration.

```js
import { EmbeddingView } from "embedding-atlas";           // vanilla JS
import { EmbeddingView } from "embedding-atlas/react";     // React
import { EmbeddingView } from "embedding-atlas/svelte";    // Svelte

// --- React example ---
import { useState } from "react";
import { EmbeddingView } from "embedding-atlas/react";

function App({ xData, yData, categoryData }) {
  const [tooltip, setTooltip] = useState(null);

  return (
    <EmbeddingView
      data={{
        x: new Float32Array(xData),          // required: X coordinates
        y: new Float32Array(yData),          // required: Y coordinates
        category: new Uint8Array(categoryData), // optional: category index per point
      }}
      tooltip={tooltip}
      onTooltip={setTooltip}
      config={{
        pointSize: 4,                        // optional: override auto point size
      }}
      theme={{
        light: { clusterLabelColor: "black" },
        dark:  { clusterLabelColor: "white" },
      }}
      customTooltip={{
        class: class CustomTooltip {
          constructor(target, props) { /* mount tooltip DOM here */ }
          update(props) { /* re-render with new props.tooltip */ }
          destroy() { /* cleanup */ }
        },
        props: { maxWidth: 300 },
      }}
      customOverlay={{
        class: class CustomOverlay {
          constructor(target, props) {
            // props.proxy.location(x, y) → pixel [px, py]
          }
          update(props) {}
          destroy() {}
        },
        props: {},
      }}
    />
  );
}

// --- Vanilla JS example ---
import { EmbeddingView } from "embedding-atlas";

const component = new EmbeddingView(document.getElementById("container"), {
  data: { x: new Float32Array(xData), y: new Float32Array(yData) },
  onTooltip: (value) => console.log("hovered point:", value),
});

// Update props after creation
component.update({ data: { x: newX, y: newY } });

// Cleanup
component.destroy();
```

---

## `EmbeddingViewMosaic` — Mosaic-connected scatter plot component

Variant of `EmbeddingView` that reads data from a [Mosaic](https://idl.uw.edu/mosaic/) coordinator table rather than typed arrays. Enables cross-filtering and linked views with other Mosaic-connected charts.

```js
import { EmbeddingViewMosaic } from "embedding-atlas/react";
import { createClient } from "@uwdata/mosaic-core";

// Assumes a Mosaic coordinator with a "data_table" table loaded
const coordinator = createClient(/* ... */);

// React
<EmbeddingViewMosaic
  coordinator={coordinator}
  table="data_table"
  x="x_column"
  y="y_column"
  category="category_column"  // optional: color points by category
  text="text_column"          // optional: shown in tooltip
  identifier="id_column"      // optional: used for cross-filtering
  filter={selectionBrush}     // optional: Mosaic Selection for cross-filter input
  onTooltip={(v) => console.log(v)}
  config={{ pointSize: 3 }}
  theme={{ light: { clusterLabelColor: "#333" } }}
/>

// Vanilla JS
import { EmbeddingViewMosaic } from "embedding-atlas";
const component = new EmbeddingViewMosaic(document.getElementById("container"), {
  coordinator,
  table: "data_table",
  x: "x_column",
  y: "y_column",
  onTooltip: (value) => console.log(value),
});
component.update({ filter: newBrush });
component.destroy();
```

---

## `EmbeddingAtlas` — Full frontend UI component (JavaScript)

The complete Embedding Atlas UI as a JavaScript component. Integrates embedding scatter plot, metadata charts, text search, and tabular data view into a single coordinated interface backed by a Mosaic coordinator.

```js
import { EmbeddingAtlas } from "embedding-atlas/react";
import { createClient } from "@uwdata/mosaic-core";

const coordinator = createClient(/* DuckDB connection */);

// React
function App() {
  const [state, setState] = useState(null);

  return (
    <EmbeddingAtlas
      coordinator={coordinator}
      data={{
        table: "data_table",
        id: "id_column",
        projection: { x: "x_column", y: "y_column" },
        text: "text_column",
        image: "image_column",      // optional
        neighbors: "neighbors_col", // optional: KNN for nearest-neighbor search
        importance: "pagerank_col", // optional: point importance scores
      }}
      embeddingViewLabels="automatic"   // or "disabled" or custom label array
      embeddingViewConfig={{ pointSize: 4 }}
      colorScheme="light"               // or "dark"
      chartTheme={{
        light: { markColor: "steelblue" },
        dark:  { markColor: "orange" },
      }}
      initialState={savedState}         // restore a previously saved UI state
      onStateChange={setState}          // callback when UI state changes
    />
  );
}

// Vanilla JS
import { EmbeddingAtlas } from "embedding-atlas";
const component = new EmbeddingAtlas(document.getElementById("app"), {
  coordinator,
  data: { table: "data_table", id: "id", projection: { x: "x", y: "y" }, text: "text" },
});
component.update({ colorScheme: "dark" });
component.destroy();
```

---

## `createUMAP` — In-browser UMAP (WebAssembly)

WebAssembly implementation of UMAP (ported from umap-learn/pynndescent to Rust) for running dimensionality reduction entirely in the browser without a Python server.

```js
import { createUMAP } from "embedding-atlas";

const count = 2000;
const inputDim = 128;
const outputDim = 2;

// Float32Array of shape [count * inputDim]
const data = new Float32Array(count * inputDim);
// ... populate data with your high-dimensional vectors

const umap = await createUMAP(count, inputDim, outputDim, data, {
  metric: "cosine",  // distance metric
});

// Run to completion
await umap.run();

// Retrieve 2D coordinates: Float32Array of shape [count * outputDim]
const embedding = umap.embedding();
console.log("First point:", embedding[0], embedding[1]);

// Free WebAssembly memory
umap.destroy();
```

---

## `createNNDescent` — Approximate nearest neighbor search (WebAssembly)

WebAssembly implementation of the NNDescent algorithm for approximate nearest neighbor (ANN) search in the browser.

```js
import { createNNDescent } from "embedding-atlas";

const count = 2000;
const inputDim = 128;
const k = 15;  // number of neighbors

const data = new Float32Array(count * inputDim);
// ... populate with your vectors

const index = await createNNDescent(count, inputDim, data, {
  metric: "cosine",
});

// Query by vector
const query = new Float32Array(inputDim);
// ... populate query vector
const neighbors = index.queryByVector(query, k);
// neighbors: { indices: Uint32Array, distances: Float32Array }

index.destroy();
```

---

## `findClusters` — Density-based clustering (WebAssembly)

WebAssembly implementation of a density map clustering algorithm that identifies clusters in a 2D density grid. Used internally by Embedding Atlas to generate automatic cluster labels for the embedding view.

```js
import { findClusters } from "embedding-atlas";

const width = 512;
const height = 512;

// Float32Array of width * height density values (e.g., from a KDE output)
const densityMap = new Float32Array(width * height);
// ... populate from your rendering pipeline

const clusters = await findClusters(densityMap, width, height);
// clusters: Array of Cluster objects, each with:
//   - x, y: center coordinates in density-map space
//   - level: detail level (controls zoom threshold for label display)
//   - priority: relative importance (higher = shown preferentially)

clusters.forEach((c) => {
  console.log(`Cluster at (${c.x.toFixed(2)}, ${c.y.toFixed(2)}) level=${c.level}`);
});
```

---

## REST API endpoints (served by the Python backend)

When the CLI or `make_server` is used, a FastAPI server is started that exposes the following HTTP endpoints consumed by the frontend.

```bash
# Retrieve dataset as Parquet
curl http://localhost:5055/data/dataset.parquet --output dataset.parquet

# Retrieve metadata JSON (contains DuckDB mode, MCP config, props)
curl http://localhost:5055/data/metadata.json
# Response: {"props": {...}, "database": {"type": "rest"}, "mcp": {"type": "websocket"}}

# Execute a DuckDB SQL query (POST, returns Arrow IPC or JSON)
curl -X POST http://localhost:5055/data/query \
  -H "Content-Type: application/json" \
  -d '{"type": "arrow", "sql": "SELECT description, projection_x, projection_y FROM dataset LIMIT 10"}'

# GET variant (query parameter, URL-encoded JSON)
curl "http://localhost:5055/data/query?query=%7B%22type%22%3A%22json%22%2C%22sql%22%3A%22SELECT+COUNT(*)+FROM+dataset%22%7D"

# Export current selection as CSV/JSON/JSONL/Parquet
curl -X POST http://localhost:5055/data/selection \
  -H "Content-Type: application/json" \
  -d '{"format": "csv", "predicate": "country = '\''US'\''"}' \
  --output selection.csv

# Download the entire visualization as a self-contained ZIP
curl http://localhost:5055/data/archive.zip --output archive.zip

# Cache read/write (used by the frontend to persist UI state)
curl -X POST http://localhost:5055/data/cache/mykey \
  -H "Content-Type: application/json" \
  -d '{"someKey": "someValue"}'
curl http://localhost:5055/data/cache/mykey
# Response: {"someKey": "someValue"}

# MCP endpoint (only available when --mcp flag is used)
curl -X POST http://localhost:5055/mcp \
  -H "Content-Type: application/json" \
  -d '{"method": "tools/list"}'
```

---

Embedding Atlas is well-suited for three primary use cases: rapid exploratory data analysis of large text, image, or audio corpora via the CLI or Jupyter widget; embedding quality auditing and dataset curation workflows where researchers need to visually identify clusters, outliers, and duplicates across millions of data points; and production dashboards in Streamlit applications where interactive embedding-based cross-filtering is surfaced to end users. The `compute_projection` API with caching means repeated runs reuse expensive embedding computations, and the `--export-application` flag allows sharing visualizations as self-contained static web apps with no server dependency.

For integration, Embedding Atlas follows a layered pattern: compute projections once with `compute_projection` / `async_compute_projection` and persist the enriched DataFrame (including `projection_x`, `projection_y`, and `neighbors` columns) as Parquet for reuse across CLI, widget, and Streamlit contexts. The JavaScript npm package enables embedding Embedding Atlas components inside existing web applications using any framework (React, Svelte, or vanilla JS) backed by a Mosaic DuckDB coordinator, while the WebAssembly UMAP and NNDescent exports allow fully in-browser projection pipelines without a Python server. MCP support (`--mcp`) further integrates the tool into AI agent workflows where LLMs can issue SQL queries, create charts, and inspect the data schema programmatically.