Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Add Docs
Embedding Atlas
https://github.com/apple/embedding-atlas
Admin
Embedding Atlas is a tool that provides interactive visualizations for large embeddings, allowing
...
Tokens:
21,881
Snippets:
221
Trust Score:
8.6
Update:
1 week ago
Context
Skills
Chat
Benchmark
78.8
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# Embedding Atlas Embedding Atlas is an interactive visualization tool for large embeddings, released as both a Python package (`embedding-atlas` on PyPI) and a JavaScript/TypeScript npm package (`embedding-atlas`). It allows users to visualize, cross-filter, and search millions of data points using high-dimensional embeddings projected to 2D via UMAP. Core capabilities include automatic data clustering and labeling, kernel density estimation with density contours, order-independent transparency rendering, real-time nearest-neighbor search, and multi-coordinated metadata views. The tool supports text, image, and audio modalities and provides multiple backends for computing embeddings (SentenceTransformers, HuggingFace Transformers, or any LiteLLM-compatible API). The project is structured as a monorepo with several packages: a Python backend (`packages/backend`) that exposes a CLI, a Jupyter/notebook widget, and a Streamlit component; a JavaScript `embedding-atlas` npm package that exports UI components (`EmbeddingAtlas`, `EmbeddingView`, `EmbeddingViewMosaic`) along with WebAssembly-powered algorithms for in-browser UMAP computation and density clustering. All three Python integration modes (CLI, notebook widget, Streamlit) share a common `compute_projection` pipeline and `EmbeddingAtlasOptions` configuration schema, making it straightforward to move from exploratory notebook usage to production Streamlit dashboards or standalone exported web applications. --- ## CLI: `embedding-atlas` command The command line tool loads a dataset (Parquet, JSONL, CSV, or Hugging Face), optionally computes embeddings and a UMAP projection, then serves an interactive web UI at `http://localhost:5055`. ```bash # Install pip install embedding-atlas # Quickstart: auto-detect text column interactively embedding-atlas dataset.parquet # Specify text column and use a custom sentence-transformers model embedding-atlas dataset.parquet --text description --model all-mpnet-base-v2 # Load from Hugging Face with a specific split embedding-atlas james-burton/wine_reviews --split validation --text description # Use pre-computed 2D projections and neighbor graph embedding-atlas dataset.parquet --x projection_x --y projection_y --neighbors neighbors # Use pre-computed embedding vectors (skips text embedding, runs UMAP only) embedding-atlas dataset.parquet --vector embedding_vectors # Use an API-based embedder via LiteLLM (e.g. OpenAI) embedding-atlas dataset.parquet --text description \ --embedder litellm \ --model openai/text-embedding-3-small \ --api-key sk-xxx \ --batch-size 1024 # Use a locally running Ollama model embedding-atlas dataset.parquet --text description \ --embedder litellm \ --model ollama/nomic-embed-text \ --api-base http://localhost:11434 \ --max-concurrency 2 # Control UMAP parameters for reproducibility embedding-atlas dataset.parquet --text description \ --umap-metric cosine \ --umap-n-neighbors 15 \ --umap-min-dist 0.1 \ --umap-random-state 42 # Export as a standalone offline web app (ZIP or folder) embedding-atlas dataset.parquet --text description --export-application output.zip # Apply a DuckDB SQL query before visualization (filter or transform data) embedding-atlas dataset.parquet --query "SELECT * FROM data WHERE country = 'US'" --text description # Sample 50,000 rows from a large dataset embedding-atlas large_dataset.parquet --text description --sample 50000 # Enable MCP (Model Context Protocol) server for AI agent integration embedding-atlas dataset.parquet --text description --mcp # Expose server on all interfaces with CORS enabled embedding-atlas dataset.parquet --text description --host 0.0.0.0 --cors # Run with uv (no pip install needed) uvx embedding-atlas dataset.parquet ``` --- ## `compute_projection` — Compute embeddings and UMAP projection Synchronous function that embeds a DataFrame column (text, image, audio, or pre-computed vectors) and projects the result to 2D using UMAP, adding `projection_x`, `projection_y`, and a `neighbors` column to the DataFrame. Cannot be called from within a running async event loop (e.g. inside Jupyter); use `async_compute_projection` in that case. ```python from embedding_atlas.projection import compute_projection import pandas as pd df = pd.read_parquet("wine_reviews.parquet") # Text: default SentenceTransformers model (all-MiniLM-L6-v2) df = compute_projection( df, inputs="description", # column to embed modality="text", # 'text' | 'image' | 'audio' | 'vector' | 'auto' x="projection_x", # output X column y="projection_y", # output Y column neighbors="neighbors", # output neighbors column (set None to skip) umap_args={"metric": "cosine", "n_neighbors": 15, "random_state": 42}, ) # Vector: skip embedding, just run UMAP on pre-computed vectors df = compute_projection( df, inputs="embedding_vectors", modality="vector", x="projection_x", y="projection_y", ) # Image: HuggingFace Transformers pipeline (google/vit-base-patch16-224 by default) df = compute_projection( df, inputs="image_bytes", # column with bytes or {"bytes": ...} dicts modality="image", x="projection_x", y="projection_y", model="google/vit-base-patch16-224", ) # LiteLLM API (OpenAI text-embedding-3-small) df = compute_projection( df, inputs="description", modality="text", embedder="litellm", model="openai/text-embedding-3-small", embedder_args={"api_key": "sk-xxx"}, batch_size=1024, x="projection_x", y="projection_y", ) # Custom async embedder function async def my_embedder(batch, *, model, embedder_args): # batch: list of strings (for text) or list of {"bytes": bytes} (for image/audio) import numpy as np return np.random.rand(len(batch), 384).astype("float32") df = compute_projection( df, inputs="description", modality="text", embedder=my_embedder, x="projection_x", y="projection_y", ) print(df[["description", "projection_x", "projection_y", "neighbors"]].head()) # Expected output columns: projection_x (float64), projection_y (float64), # neighbors (list of dicts: {"ids": [...], "distances": [...]}) ``` --- ## `async_compute_projection` — Async version for Jupyter notebooks Async counterpart to `compute_projection`. Accepts the same arguments and must be awaited. Use this inside Jupyter notebooks, Marimo, or any async context where `asyncio.run()` would raise `RuntimeError: This event loop is already running`. ```python from embedding_atlas.projection import async_compute_projection import pandas as pd from datasets import load_dataset # Load dataset ds = load_dataset("james-burton/wine_reviews", split="validation") df = pd.DataFrame(ds) # Default: SentenceTransformers for text df = await async_compute_projection( df, inputs="description", modality="text", x="projection_x", y="projection_y", neighbors="neighbors", ) # Ollama local server via LiteLLM (limit concurrency for local servers) df = await async_compute_projection( df, inputs="description", modality="text", embedder="litellm", model="ollama/nomic-embed-text", embedder_args={"api_base": "http://localhost:11434"}, batch_size=512, max_concurrency=2, x="projection_x", y="projection_y", neighbors="neighbors", ) print(df.shape) # Same shape as input + 3 new columns print(df.dtypes) # projection_x: float64, projection_y: float64, neighbors: object ``` --- ## `EmbeddingAtlasWidget` — Jupyter / AnyWidget notebook widget An interactive Embedding Atlas widget for use in Jupyter, Marimo, Colab, and VSCode notebooks. Backed by [AnyWidget](https://anywidget.dev). Registers the DataFrame as a DuckDB in-memory table and exposes a `.selection()` method to retrieve the currently filtered rows. ```python from embedding_atlas.widget import EmbeddingAtlasWidget from embedding_atlas.projection import async_compute_projection import pandas as pd from datasets import load_dataset ds = load_dataset("james-burton/wine_reviews", split="validation") df = pd.DataFrame(ds) # Compute projection (async in notebooks) df = await async_compute_projection( df, inputs="description", modality="text", x="projection_x", y="projection_y", neighbors="neighbors", ) # Basic widget: show table and charts (no embedding view) EmbeddingAtlasWidget(df) # Full widget with embedding view, text, and nearest neighbors widget = EmbeddingAtlasWidget( df, text="description", # column to show in tooltips and search x="projection_x", # X coordinate for embedding scatter plot y="projection_y", # Y coordinate for embedding scatter plot neighbors="neighbors", # pre-computed KNN for nearest-neighbor lookup point_size=3.0, # override auto point size labels="automatic", # auto-generate cluster labels using TF-IDF stop_words=["the", "a"], # words to exclude from label generation show_table=True, # show data table panel on open show_charts=True, # show charts panel on open show_embedding=True, # show embedding view panel on open ) widget # display the widget # Retrieve the current user selection as a DataFrame selected_df = widget.selection(format="dataframe") # or format="arrow" print(selected_df.shape) # Custom labels instead of auto-generated ones custom_labels = [ {"x": 1.5, "y": 2.3, "text": "Fruity Wines", "level": 1, "priority": 10}, {"x": -0.5, "y": 3.1, "text": "Dry Reds", "level": 2, "priority": 5}, ] widget2 = EmbeddingAtlasWidget( df, x="projection_x", y="projection_y", labels=custom_labels ) ``` --- ## `embedding_atlas` — Streamlit component Renders an interactive Embedding Atlas component inside a Streamlit app. Returns a `dict` containing a `predicate` SQL string representing the user's current cross-filter selection, which can be applied with DuckDB to filter the source DataFrame. ```python import duckdb import pandas as pd import streamlit as st from datasets import load_dataset from embedding_atlas.projection import compute_projection from embedding_atlas.streamlit import embedding_atlas st.set_page_config(layout="wide") st.title("Embedding Atlas + Streamlit") @st.cache_data def load_data(): ds = load_dataset("james-burton/wine_reviews", split="validation") df = pd.DataFrame(ds) return compute_projection( df, inputs="description", modality="text", x="projection_x", y="projection_y", neighbors="neighbors", ) df = load_data() # Render the Embedding Atlas component; returns the current selection state value = embedding_atlas( df, text="description", x="projection_x", y="projection_y", neighbors="neighbors", show_table=True, show_charts=True, key="embedding_atlas_widget", ) # Use the SQL predicate to filter the DataFrame with DuckDB st.subheader("Selected rows") predicate = value.get("predicate") if predicate: subset = duckdb.query_df(df, "dataframe", f"SELECT * FROM dataframe WHERE {predicate}") st.dataframe(subset) else: st.write("No selection — interact with the widget above.") # Without projection (table + charts only mode) value2 = embedding_atlas(df, key="charts_only") ``` --- ## `pagerank` — PageRank from a graph edge list Computes PageRank scores from a weighted or unweighted edge list using PyTorch sparse matrix power iteration. Supports dangling nodes (no outgoing edges) and converges using L1-norm tolerance. ```python from embedding_atlas.pagerank import pagerank, knn_to_edges, compute_pagerank_column import numpy as np import pandas as pd # --- Directly from an edge list --- edges = [ (0, 1, 0.5), # (source, target, weight) (0, 2, 1.0), (1, 2, 0.8), (2, 0, 1.0), ] scores = pagerank(edges, n=3, damping=0.85, max_iterations=100, tolerance=1e-9) print(scores) # array([0.32..., 0.21..., 0.46...]) — higher = more central # --- From KNN arrays (e.g., output of compute_projection) --- knn_indices = np.array([[1, 2], [0, 2], [0, 1]]) knn_distances = np.array([[0.1, 0.2], [0.1, 0.3], [0.2, 0.3]]) # Convert raw KNN distances → UMAP membership-strength edge weights weighted_edges = knn_to_edges(knn_indices, knn_distances, local_connectivity=1.0) scores = pagerank(weighted_edges, n=3) # --- From a DataFrame with a neighbors column --- # The 'neighbors' column must contain dicts: {"ids": [...], "distances": [...]} df = pd.read_parquet("dataset_with_neighbors.parquet") df["pagerank"] = compute_pagerank_column(df, neighbors="neighbors", damping=0.85) print(df[["pagerank"]].describe()) # Run from CLI to add pagerank column to a parquet file: # python -m embedding_atlas.pagerank --in dataset.parquet --out dataset_ranked.parquet ``` --- ## `EmbeddingView` — Standalone JavaScript scatter plot component A WebGPU/WebGL2 scatter plot component for rendering up to a few million embedding points directly from typed arrays. Available as vanilla JS, React, and Svelte wrappers. Supports lasso selection, tooltips, custom overlays, and theme configuration. ```js import { EmbeddingView } from "embedding-atlas"; // vanilla JS import { EmbeddingView } from "embedding-atlas/react"; // React import { EmbeddingView } from "embedding-atlas/svelte"; // Svelte // --- React example --- import { useState } from "react"; import { EmbeddingView } from "embedding-atlas/react"; function App({ xData, yData, categoryData }) { const [tooltip, setTooltip] = useState(null); return ( <EmbeddingView data={{ x: new Float32Array(xData), // required: X coordinates y: new Float32Array(yData), // required: Y coordinates category: new Uint8Array(categoryData), // optional: category index per point }} tooltip={tooltip} onTooltip={setTooltip} config={{ pointSize: 4, // optional: override auto point size }} theme={{ light: { clusterLabelColor: "black" }, dark: { clusterLabelColor: "white" }, }} customTooltip={{ class: class CustomTooltip { constructor(target, props) { /* mount tooltip DOM here */ } update(props) { /* re-render with new props.tooltip */ } destroy() { /* cleanup */ } }, props: { maxWidth: 300 }, }} customOverlay={{ class: class CustomOverlay { constructor(target, props) { // props.proxy.location(x, y) → pixel [px, py] } update(props) {} destroy() {} }, props: {}, }} /> ); } // --- Vanilla JS example --- import { EmbeddingView } from "embedding-atlas"; const component = new EmbeddingView(document.getElementById("container"), { data: { x: new Float32Array(xData), y: new Float32Array(yData) }, onTooltip: (value) => console.log("hovered point:", value), }); // Update props after creation component.update({ data: { x: newX, y: newY } }); // Cleanup component.destroy(); ``` --- ## `EmbeddingViewMosaic` — Mosaic-connected scatter plot component Variant of `EmbeddingView` that reads data from a [Mosaic](https://idl.uw.edu/mosaic/) coordinator table rather than typed arrays. Enables cross-filtering and linked views with other Mosaic-connected charts. ```js import { EmbeddingViewMosaic } from "embedding-atlas/react"; import { createClient } from "@uwdata/mosaic-core"; // Assumes a Mosaic coordinator with a "data_table" table loaded const coordinator = createClient(/* ... */); // React <EmbeddingViewMosaic coordinator={coordinator} table="data_table" x="x_column" y="y_column" category="category_column" // optional: color points by category text="text_column" // optional: shown in tooltip identifier="id_column" // optional: used for cross-filtering filter={selectionBrush} // optional: Mosaic Selection for cross-filter input onTooltip={(v) => console.log(v)} config={{ pointSize: 3 }} theme={{ light: { clusterLabelColor: "#333" } }} /> // Vanilla JS import { EmbeddingViewMosaic } from "embedding-atlas"; const component = new EmbeddingViewMosaic(document.getElementById("container"), { coordinator, table: "data_table", x: "x_column", y: "y_column", onTooltip: (value) => console.log(value), }); component.update({ filter: newBrush }); component.destroy(); ``` --- ## `EmbeddingAtlas` — Full frontend UI component (JavaScript) The complete Embedding Atlas UI as a JavaScript component. Integrates embedding scatter plot, metadata charts, text search, and tabular data view into a single coordinated interface backed by a Mosaic coordinator. ```js import { EmbeddingAtlas } from "embedding-atlas/react"; import { createClient } from "@uwdata/mosaic-core"; const coordinator = createClient(/* DuckDB connection */); // React function App() { const [state, setState] = useState(null); return ( <EmbeddingAtlas coordinator={coordinator} data={{ table: "data_table", id: "id_column", projection: { x: "x_column", y: "y_column" }, text: "text_column", image: "image_column", // optional neighbors: "neighbors_col", // optional: KNN for nearest-neighbor search importance: "pagerank_col", // optional: point importance scores }} embeddingViewLabels="automatic" // or "disabled" or custom label array embeddingViewConfig={{ pointSize: 4 }} colorScheme="light" // or "dark" chartTheme={{ light: { markColor: "steelblue" }, dark: { markColor: "orange" }, }} initialState={savedState} // restore a previously saved UI state onStateChange={setState} // callback when UI state changes /> ); } // Vanilla JS import { EmbeddingAtlas } from "embedding-atlas"; const component = new EmbeddingAtlas(document.getElementById("app"), { coordinator, data: { table: "data_table", id: "id", projection: { x: "x", y: "y" }, text: "text" }, }); component.update({ colorScheme: "dark" }); component.destroy(); ``` --- ## `createUMAP` — In-browser UMAP (WebAssembly) WebAssembly implementation of UMAP (ported from umap-learn/pynndescent to Rust) for running dimensionality reduction entirely in the browser without a Python server. ```js import { createUMAP } from "embedding-atlas"; const count = 2000; const inputDim = 128; const outputDim = 2; // Float32Array of shape [count * inputDim] const data = new Float32Array(count * inputDim); // ... populate data with your high-dimensional vectors const umap = await createUMAP(count, inputDim, outputDim, data, { metric: "cosine", // distance metric }); // Run to completion await umap.run(); // Retrieve 2D coordinates: Float32Array of shape [count * outputDim] const embedding = umap.embedding(); console.log("First point:", embedding[0], embedding[1]); // Free WebAssembly memory umap.destroy(); ``` --- ## `createNNDescent` — Approximate nearest neighbor search (WebAssembly) WebAssembly implementation of the NNDescent algorithm for approximate nearest neighbor (ANN) search in the browser. ```js import { createNNDescent } from "embedding-atlas"; const count = 2000; const inputDim = 128; const k = 15; // number of neighbors const data = new Float32Array(count * inputDim); // ... populate with your vectors const index = await createNNDescent(count, inputDim, data, { metric: "cosine", }); // Query by vector const query = new Float32Array(inputDim); // ... populate query vector const neighbors = index.queryByVector(query, k); // neighbors: { indices: Uint32Array, distances: Float32Array } index.destroy(); ``` --- ## `findClusters` — Density-based clustering (WebAssembly) WebAssembly implementation of a density map clustering algorithm that identifies clusters in a 2D density grid. Used internally by Embedding Atlas to generate automatic cluster labels for the embedding view. ```js import { findClusters } from "embedding-atlas"; const width = 512; const height = 512; // Float32Array of width * height density values (e.g., from a KDE output) const densityMap = new Float32Array(width * height); // ... populate from your rendering pipeline const clusters = await findClusters(densityMap, width, height); // clusters: Array of Cluster objects, each with: // - x, y: center coordinates in density-map space // - level: detail level (controls zoom threshold for label display) // - priority: relative importance (higher = shown preferentially) clusters.forEach((c) => { console.log(`Cluster at (${c.x.toFixed(2)}, ${c.y.toFixed(2)}) level=${c.level}`); }); ``` --- ## REST API endpoints (served by the Python backend) When the CLI or `make_server` is used, a FastAPI server is started that exposes the following HTTP endpoints consumed by the frontend. ```bash # Retrieve dataset as Parquet curl http://localhost:5055/data/dataset.parquet --output dataset.parquet # Retrieve metadata JSON (contains DuckDB mode, MCP config, props) curl http://localhost:5055/data/metadata.json # Response: {"props": {...}, "database": {"type": "rest"}, "mcp": {"type": "websocket"}} # Execute a DuckDB SQL query (POST, returns Arrow IPC or JSON) curl -X POST http://localhost:5055/data/query \ -H "Content-Type: application/json" \ -d '{"type": "arrow", "sql": "SELECT description, projection_x, projection_y FROM dataset LIMIT 10"}' # GET variant (query parameter, URL-encoded JSON) curl "http://localhost:5055/data/query?query=%7B%22type%22%3A%22json%22%2C%22sql%22%3A%22SELECT+COUNT(*)+FROM+dataset%22%7D" # Export current selection as CSV/JSON/JSONL/Parquet curl -X POST http://localhost:5055/data/selection \ -H "Content-Type: application/json" \ -d '{"format": "csv", "predicate": "country = '\''US'\''"}' \ --output selection.csv # Download the entire visualization as a self-contained ZIP curl http://localhost:5055/data/archive.zip --output archive.zip # Cache read/write (used by the frontend to persist UI state) curl -X POST http://localhost:5055/data/cache/mykey \ -H "Content-Type: application/json" \ -d '{"someKey": "someValue"}' curl http://localhost:5055/data/cache/mykey # Response: {"someKey": "someValue"} # MCP endpoint (only available when --mcp flag is used) curl -X POST http://localhost:5055/mcp \ -H "Content-Type: application/json" \ -d '{"method": "tools/list"}' ``` --- Embedding Atlas is well-suited for three primary use cases: rapid exploratory data analysis of large text, image, or audio corpora via the CLI or Jupyter widget; embedding quality auditing and dataset curation workflows where researchers need to visually identify clusters, outliers, and duplicates across millions of data points; and production dashboards in Streamlit applications where interactive embedding-based cross-filtering is surfaced to end users. The `compute_projection` API with caching means repeated runs reuse expensive embedding computations, and the `--export-application` flag allows sharing visualizations as self-contained static web apps with no server dependency. For integration, Embedding Atlas follows a layered pattern: compute projections once with `compute_projection` / `async_compute_projection` and persist the enriched DataFrame (including `projection_x`, `projection_y`, and `neighbors` columns) as Parquet for reuse across CLI, widget, and Streamlit contexts. The JavaScript npm package enables embedding Embedding Atlas components inside existing web applications using any framework (React, Svelte, or vanilla JS) backed by a Mosaic DuckDB coordinator, while the WebAssembly UMAP and NNDescent exports allow fully in-browser projection pipelines without a Python server. MCP support (`--mcp`) further integrates the tool into AI agent workflows where LLMs can issue SQL queries, create charts, and inspect the data schema programmatically.