### Install Sphinx and Sphinx Book Theme

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/README.md

Install Sphinx for documentation generation and the Sphinx Book Theme for styling. Use conda for Sphinx and pip for the theme.

```bash
conda install sphinx
pip install sphinx-book-theme
```

--------------------------------

### Install Hugging Face Datasets

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md

Install the Hugging Face datasets library using pip.

```bash
pip install datasets
```

--------------------------------

### Install Ubuntu Dependencies with apt

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md

Installs optional dependencies for handling audio, video, images, and compressed streams on Ubuntu using apt. Note: AWS SDK requires manual build.

```bash
sudo apt install libsndfile1-dev libsamplerate0-dev ffmpeg libjpeg-turbo8-dev \
    zlib1g-dev libbz2-dev liblzma-dev
```

--------------------------------

### Install Python Build Tools

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md

Installs pybind11 and CMake, which are required for building the Python bindings from source.

```bash
pip install pybind11[global] cmake
```

--------------------------------

### Python Sample Examples

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md

Demonstrates the creation of valid samples in Python, including scalar casting and string encoding.

```python
# This is a valid sample
sample = {"hello": np.array(0)}

# So is this because scalars are cast to scalar arrays
sample = {"scalar": 42}

# Strings can also be used, however, they will be represented in unicode.
sample = {"key": "value"}

# Most likely you would want to write it as bytes in the sample as follows
sample = {"key": b"path/to/my/file"}
sample = {"key": "value".encode("ascii")}
```

--------------------------------

### Install MLX Data Python Package from Source

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md

Installs the MLX Data Python package locally from source. Use 'pip install -e .' for an editable install.

```bash
cd /path/to/mlx/data
pip install .  # or pip install -e . for an editable install
```

--------------------------------

### Install mlx-data and dependencies

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Install the mlx-data library using pip. Optional dependencies for audio, video, and S3 on macOS and Ubuntu are listed, along with instructions to build the C++ standalone library.

```bash
pip install mlx-data

# macOS: optional dependencies for audio, video, S3
brew install libsndfile libsamplerate ffmpeg jpeg-turbo zlib bzip2 xz aws-sdk-cpp

# Ubuntu: optional dependencies
sudo apt install libsndfile1-dev libsamplerate0-dev ffmpeg libjpeg-turbo8-dev \
    zlib1g-dev libbz2-dev liblzma-dev

# Build C++ standalone library
mkdir build && cd build
cmake ..
make -j
sudo make install
```

--------------------------------

### Install MLX Data from PyPI

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md

Use this command to install the MLX Data package and its essential dependencies for reading various data formats.

```bash
pip install mlx-data
```

--------------------------------

### Install macOS Dependencies with Homebrew

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md

Installs optional dependencies for handling audio, video, images, and S3 access on macOS using Homebrew.

```bash
brew install libsndfile libsamplerate ffmpeg jpeg-turbo zlib bzip2 xz aws-sdk-cpp
```

--------------------------------

### Download and Run WikiText Benchmark

Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/wikitext/README.md

Use this bash script to download the necessary data, tokenizer, and execute the benchmarks. Ensure you have wget and unzip installed.

```bash
bash run_wikitext.sh
```

--------------------------------

### Key Transform Example

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md

Illustrates how to use `key_transform` to apply a function to a specific key (e.g., 'image') within the stream's data.

```python
.key_transform("image", ...)
```

--------------------------------

### Load Hugging Face Dataset

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md

Download and inspect a dataset from Hugging Face. This example loads the MNIST dataset.

```python
from datasets import load_dataset

ds = load_dataset("ylecun/mnist")
print(ds['train'])
```

--------------------------------

### Format C++ Code with clang-format

Source: https://github.com/ml-explore/mlx-data/blob/main/CONTRIBUTING.md

Use this command to format a C++ file in-place. Ensure clang-format is installed.

```bash
clang-format -i file.cpp
```

--------------------------------

### Install Target

Source: https://github.com/ml-explore/mlx-data/blob/main/python/src/CMakeLists.txt

Installs the compiled '_c' target to the 'mlx/data' directory within the Python package structure. This makes the C++ extension available for import in Python.

```cmake
install(TARGETS _c DESTINATION mlx/data)
```

--------------------------------

### Create Buffer from Vector in Python

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md

Illustrates creating a Buffer from a list of samples using `buffer_from_vector`. This is useful for in-memory datasets or when starting a data pipeline.

```python
from pathlib import Path

import mlx.data as dx

def files_and_classes(root: Path):
    """Load the files and classes from an image dataset that contains one folder per class."""
    images = list(root.rglob("*.jpg"))
    categories = [p.relative_to(root).parent.name for p in images]
    category_set = set(categories)
    category_map = {c: i for i, c in enumerate(sorted(category_set))}

    return [
        {
            "image": str(p.relative_to(root)).encode("ascii"),
            "category": c,
            "label": category_map[c]
        }
        for c, p in zip(categories, images)
    ]

dset = dx.buffer_from_vector(files_and_classes(Path("path/to/dataset)))
# We can now apply transformations to the dataset
```

--------------------------------

### Load and Inspect MNIST Dataset

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md

Loads the MNIST dataset and prints its buffer information. This is a starting point for using the dataset.

```python
import mlx.data as dx
from mlx.data.datasets import load_mnist, load_wikitext_lines
from mlx.data.tokenizer_helpers import read_trie_from_vocab

mnist = load_mnist()
print(mnist)
# Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz 9.5MiB (15.1MiB/s)
# Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz 1.6MiB (12.9MiB/s)
# Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz 32.0KiB (17.1MiB/s)
# Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz 8.0KiB (26.6MiB/s)
# Buffer(size=60000, keys={'label', 'image'})
```

--------------------------------

### Format Python Code with black

Source: https://github.com/ml-explore/mlx-data/blob/main/CONTRIBUTING.md

Use this command to format a Python file in-place. Ensure black is installed.

```bash
black file.py
```

--------------------------------

### Download LibriSpeech Data and Tokenizer

Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/librispeech/README.md

Use this bash script to download the LibriSpeech dataset, the required tokenizer model, and run the benchmarks. Ensure you have wget and unzip installed.

```bash
bash run_librispeech.sh
```

--------------------------------

### Convert Buffer to MLX Stream with Transformations

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md

Transform an MLX Buffer into an MLX Stream, applying transformations like normalization and batching. This example normalizes images and batches them.

```python
import mlx.data as dx

stream = buffer
    .to_stream()
    .key_transform("image", lambda x: x.astype("float32") / 255)
    .batch(32)
    .prefetch(prefetch_size=8, num_threads=4)
```

--------------------------------

### Add bzip2 Dependency with Patch

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Downloads and builds bzip2, applying a patch for -fPIC compilation. It's configured to install into the 'deps' directory.

```cmake
ExternalProject_Add(
  bzip2
  URL https://sourceware.org/pub/bzip2/bzip2-1.0.8.tar.gz
  PATCH_COMMAND patch -p1 < ${CMAKE_SOURCE_DIR}/cmake/bzip2-1.0.8.patch
  CONFIGURE_COMMAND ""
  BUILD_COMMAND ""
  INSTALL_COMMAND make install PREFIX=${CMAKE_BINARY_DIR}/deps
  BUILD_IN_SOURCE 1
  DOWNLOAD_EXTRACT_TIMESTAMP 1)
```

--------------------------------

### Build Standalone C++ Library

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md

Builds and installs MLX Data as a standalone C++ static library. This allows linking MLXData into other C++ projects.

```bash
mkdir build && cd build
cmake ..
make -j
sudo make install
```

--------------------------------

### Add zlib Dependency

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Downloads and builds zlib, a compression library. It's configured for static linking and PIC (Position Independent Code) compilation, installing into the 'deps' directory.

```cmake
ExternalProject_Add(
  zlib
  URL https://www.zlib.net/zlib-1.3.1.tar.gz
  CONFIGURE_COMMAND
    ${CMAKE_COMMAND} -E env PATH=${PATH} PKG_CONFIG_PATH=${PKG_CONFIG_PATH}
    CFLAGS=-fPIC ./configure --prefix=${CMAKE_BINARY_DIR}/deps --static
  BUILD_COMMAND make
  INSTALL_COMMAND make install
  BUILD_IN_SOURCE 1
  DOWNLOAD_EXTRACT_TIMESTAMP 1)
```

--------------------------------

### Add xz Dependency

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Downloads and builds xz, a compression library. It's configured with specific CMake arguments to disable shared libraries, documentation, and scripts, ensuring static linking and PIC support, and installs into the 'deps' directory.

```cmake
ExternalProject_Add(
  xz
  URL https://downloads.sourceforge.net/project/lzmautils/xz-5.8.1.tar.gz
  CMAKE_ARGS -DCMAKE_BUILD_TYPE=Release
             -DBUILD_SHARED_LIBS=OFF
             -DXZ_TOOL_XZ=OFF
             -DXZ_TOOL_XZDEC=OFF
             -DXZ_TOOL_LZMADEC=OFF
             -DXZ_TOOL_LZMAINFO=OFF
             -DXZ_DOC=OFF
             -DENABLE_SCRIPTS=OFF
             -DCMAKE_POSITION_INDEPENDENT_CODE=ON
             -DCMAKE_PREFIX_PATH=${CMAKE_BINARY_DIR}/deps
             -DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/deps
  INSTALL_DIR ${CMAKE_BINARY_DIR}/deps
  DOWNLOAD_EXTRACT_TIMESTAMP 1)
```

--------------------------------

### Add zstd Dependency

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Downloads and builds zstd, a fast compression algorithm. It's configured for static linking, disabling multithreading and programs, and setting macOS specific architectures if applicable. Installs into the 'deps' directory.

```cmake
if(APPLE)
  set(OSX_ARCHITECTURES "x86_64$<SEMICOLON>x86_64h$<SEMICOLON>arm64")
endif()
ExternalProject_Add(
  zstd
  URL https://github.com/facebook/zstd/archive/refs/tags/v1.5.7.tar.gz
  CMAKE_ARGS -DCMAKE_BUILD_TYPE=Release
             -DCMAKE_OSX_ARCHITECTURES=${OSX_ARCHITECTURES}
             -DZSTD_BUILD_SHARED=OFF
             -DZSTD_MULTITHREAD_SUPPORT=OFF
             -DZSTD_BUILD_PROGRAMS=OFF
             -DCMAKE_PREFIX_PATH=${CMAKE_BINARY_DIR}/deps
             -DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/deps
  SOURCE_SUBDIR build/cmake
  INSTALL_DIR ${CMAKE_BINARY_DIR}/deps
  DOWNLOAD_EXTRACT_TIMESTAMP 1)
```

--------------------------------

### Add xvidcore External Project

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Configures the xvidcore project, specifying its URL, patch command, and custom configure, build, and install commands. It's built in-source and uses a specific patch to disable shared libraries.

```cmake
ExternalProject_Add(
  xvidcore
  DEPENDS nasm
  URL https://downloads.xvid.com/downloads/xvidcore-1.3.7.tar.bz2
  PATCH_COMMAND patch -p1 < ${CMAKE_SOURCE_DIR}/cmake/xvidcore-1.3.7.patch
  CONFIGURE_COMMAND
    cd build/generic && ${CMAKE_COMMAND} -E env PATH=${PATH}
    PKG_CONFIG_PATH=${PKG_CONFIG_PATH} CFLAGS=-fPIC ./configure
    --prefix=${CMAKE_BINARY_DIR}/deps
  BUILD_COMMAND cd build/generic && ${CMAKE_COMMAND} -E env PATH=${PATH}
                PKG_CONFIG_PATH=${PKG_CONFIG_PATH} make -j1
  INSTALL_COMMAND cd build/generic && make install
  BUILD_IN_SOURCE 1
  DOWNLOAD_EXTRACT_TIMESTAMP 1)
```

--------------------------------

### Add pkg-config Dependency

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Downloads and builds pkg-config, a dependency for managing compile/link flags. It's configured to install into the local 'deps' directory and disables shared library building.

```cmake
ExternalProject_Add(
  pkg-config
  URL http://pkgconfig.freedesktop.org/releases/pkg-config-0.29.2.tar.gz
      http://fresh-center.net/linux/misc/pkg-config-0.29.2.tar.gz
  CONFIGURE_COMMAND
    ${CMAKE_COMMAND} -E env CFLAGS=-Wno-int-conversion
    CXXFLAGS=-Wno-int-conversion ./configure --prefix=${CMAKE_BINARY_DIR}/deps
    --disable-shared --with-internal-glib
  BUILD_COMMAND make
  INSTALL_COMMAND make install
  BUILD_IN_SOURCE 1
  DOWNLOAD_EXTRACT_TIMESTAMP 1)
```

--------------------------------

### Serve Local Documentation

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/README.md

Run a local HTTP server in the mlx/docs/build/html/ directory to view the built documentation. Point your browser to http://localhost:<port>.

```bash
python -m http.server <port>
```

--------------------------------

### Build HTML Documentation

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/README.md

Build the HTML version of the documentation from the mlx/docs/ directory using the make html command.

```bash
make html
```

--------------------------------

### Stream Processing with Batching and Prefetching

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md

Demonstrates creating a stream from a buffer, applying batching, and then using non-deterministic prefetching for efficient iteration.

```python
# We can define the rest of the processing pipeline using streams.
# 1. First shuffle the buffer
# 2. Make a stream
# 3. Batch and then prefetch
dset = (
    dset
    .shuffle()
    .to_stream()  # <-- making a stream from the shuffled buffer
    .batch(32)
    .prefetch(8, 4)  # <-- prefetch 8 batches using 4 threads
)

# Now we can iterate over dset
sample = next(dset)
```

--------------------------------

### Stream Processing with Ordered Prefetching

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md

Shows how to create a stream from a buffer and apply batching with deterministic prefetching using `ordered_prefetch`.

```python
# We can define the rest of the processing pipeline using streams.
# 1. First shuffle the buffer
# 2. Make a stream
# 3. Batch and then prefetch
dset = (
    dset
    .shuffle()
    .batch(32)
    .ordered_prefetch(8, 4)  # <-- prefetch 8 batches in a stream using 4 threads
)

# Now we can iterate over dset
sample = next(dset)
```

--------------------------------

### Create and Access Buffer Elements

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/buffer.md

Demonstrates creating a buffer from a vector, applying a key transformation, and accessing elements by index. Use this to create and manipulate in-memory datasets.

```python
import mlx.data as dx

numbers = dx.buffer_from_vector([{"x": i} for i in range(10)])
evens = numbers.key_transform("x", lambda x: 2*x)

print(evens)
# prints Buffer(size=10, keys={'x'})

print(evens[3])
# prints {'x': array(6)}

print(len(evens))
# prints 10
```

--------------------------------

### Initialize and Use AWSFileFetcher

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/miscellaneous.md

Instantiate an AWSFileFetcher for downloading files from S3-compatible storage with local caching. Prefetch files in the background for improved performance.

```python
from pathlib import Path
from mlx.data.core import AWSFileFetcher

LOCAL_CACHE = Path("/path/to/local/cache")

ff = AWSFileFetcher(
    "my-cool-bucket",
    endpoint="https://my.endpoint.com/"
    local_prefix=LOCAL_CACHE,
    num_kept_files=100,
)

# When fetch returns my/remote/path/foo.npy will be in LOCAL_CACHE
ff.fetch("my/remote/path/foo.npy")
assert (LOCAL_CACHE / "my/remote/path/foo.npy").is_file()

# We can prefetch in the background
ff.prefetch(["foo_1.npy", "foo_2.npy"])
ff.fetch("foo_1.npy")
# process foo_1 while foo_2 downloads in the background
```

--------------------------------

### Build CharTrie and Tokenizer Manually

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/tokenizing.md

Demonstrates building a CharTrie from a list of words and then creating a Tokenizer. Useful for custom vocabularies.

```python
from mlx.data.core import CharTrie, Tokenizer

# We can build a trie ourselves
trie = CharTrie()
for t in b"a quick brown fox jumped over the lazy dog".split():
    trie.insert(t)
trie.insert(b" ")

tokenizer = Tokenizer(trie)
print(tokenizer.tokenize_shortest(b"a quick brown fox jumped over the lazy dog"))
# [0, 9, 1, 9, 2, 9, 3, 9, 4, 9, 5, 9, 6, 9, 7, 9, 8]
```

```python
# We can also add all the letters in the trie and then tokenize anything we want
import string
for l in string.ascii_letters:
    trie.insert(bytes(l, "utf-8"))

print(tokenizer.tokenize_shortest(b"This is a quick example"))
# [54, 16, 17, 27, 9, 17, 27, 9, 0, 9, 1, 9, 13, 32, 0, 21, 24, 20, 13]
```

--------------------------------

### Shuffle, batch, and prefetch with Buffer and Stream APIs

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Demonstrates chaining transformations on a Buffer, including shuffling, converting to a Stream, batching, and prefetching. Supports both non-deterministic multi-threaded streams and deterministic ordered prefetching.

```python
import mlx.data as dx
from mlx.data.datasets import load_mnist

mnist = load_mnist(train=True)  # Buffer(size=60000, keys={'label', 'image'})

# Non-deterministic multi-threaded stream
stream = (
    mnist
    .shuffle()
    .to_stream()
    .batch(32)
    .prefetch(prefetch_size=8, num_threads=4)
)

for batch in stream:
    images = batch["image"]  # shape (32, 28, 28, 1), dtype uint8
    labels = batch["label"]  # shape (32,)
    break

stream.reset()  # restart iteration

# Deterministic ordered prefetch (stays a Buffer)
ordered = (
    mnist
    .shuffle()
    .batch(32)
    .ordered_prefetch(num_prefetch=8, num_threads=4)
)
```

--------------------------------

### Load Common Datasets (MNIST, CIFAR-10, LibriSpeech, WikiText)

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Provides pre-built loaders for common datasets like MNIST, CIFAR-10, LibriSpeech, and WikiText. Data is downloaded and cached locally by default.

```python
import mlx.data as dx
from mlx.data.datasets import (
    load_mnist, load_fashion_mnist,
    load_cifar10, load_cifar100,
    load_imagenet,
    load_librispeech,
    load_speechcommands,
    load_wikitext_lines,
    load_images_from_folder,
)

# MNIST — Buffer(size=60000, keys={'image', 'label'})
mnist_train = load_mnist(train=True)
mnist_test  = load_mnist(train=False)

train_iter = (
    mnist_train
    .shuffle()
    .to_stream()
    .key_transform("image", lambda x: (x.astype("float32") / 255).ravel())
    .batch(128)
    .prefetch(4, 2)
)
print(next(train_iter)["image"].shape)  # (128, 784)

# CIFAR-10
cifar = load_cifar10(train=True)  # Buffer(size=50000, keys={'image','label'})

# LibriSpeech
libri = load_librispeech(split="train-clean-100")  # Buffer

# WikiText-103 — returns a Stream of lines
wiki = load_wikitext_lines(split="train")  # Stream

# Load images from a custom folder (one subfolder = one class)
folder_dset = load_images_from_folder("path/to/imagenet/train")
```

--------------------------------

### Build AWS SDK from Source on Ubuntu

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md

Builds the AWS SDK from source on Ubuntu, specifically for S3 access. This is only needed if you require S3 functionality and are not using prebuilt binaries.

```bash
sudo apt install libcurl4-openssl-dev libssl-dev
git clone --depth 1 --recurse-submodules https://github.com/aws/aws-sdk-cpp.git
cd aws-sdk-cpp
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3" -DBUILD_SHARED_LIBS=OFF
make -j
sudo make install
```

--------------------------------

### Iterate Through MLX Stream for Training

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md

Demonstrates how to use the created MLX stream for training, including resetting the stream, iterating through batches, and displaying a sample image.

```python
import matplotlib.pyplot as plt
import mlx.core as mx

train_stream = hf_dataset_to_mlx_stream(ds['train'], shuffle=True)
test_stream = hf_dataset_to_mlx_stream(ds['test'], shuffle=False)

train_stream.reset()
for batch in train_stream:
    (X, y) = mx.array(batch['image']), mx.array(batch['label'])

    print('The image should display a ', y[0].item())
    plt.imshow(X[0])
    break
```

--------------------------------

### Run Pre-commit Hooks for All Files

Source: https://github.com/ml-explore/mlx-data/blob/main/CONTRIBUTING.md

Execute all pre-commit hooks on all files in the repository to ensure consistent code style.

```bash
pre-commit run --all-files
```

--------------------------------

### Tokenize and Prepare Wikitext Dataset for Language Modeling

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md

Prepares the wikitext dataset for language modeling by tokenizing lines using a provided vocabulary trie, filtering, batching, and applying sliding windows. Performance is noted for M2 Macbook Air.

```python
workers = 8
trie = read_trie_from_vocab("/path/to/vocab.txt")
wiki_iterator = (
    wiki
    .tokenize("line", trie, output_key="tokens")
    .filter_key("tokens")
    .prefetch(512, workers)
    .batch(128, dim=dict(tokens=0))  # gather everything in a big array of tokens
    .sliding_window("tokens", 1025, 1025)
    .shape("tokens", "tokens_length", 0)
    .batch(32)  # actual batch size
    .prefetch(2, 1)
)
# The above can be iterated at approximately 2.5M tok/s on an M2 Macbook Air.
```

--------------------------------

### Load and Prepare MNIST Dataset for MLP Training

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/quick_start.md

Imports MLX Data and the MNIST dataset loader. It then shuffles, converts to a stream, flattens images, batches, and prefetches data for efficient MLP training. Finally, it shows how to iterate over the prepared batches.

```python
# This is the standard way to import and access mlx.data
import mlx.data as dx

# Let's import MNIST loading
from mlx.data.datasets import load_mnist

# Loads a buffer with the MNIST images
m mnist_train = load_mnist(train=True)

# Let's shuffle flatten and batch to prepare for MLP training
mnist_mlp = (
    mnist_train
    .shuffle()
    .to_stream()
    .key_transform("image", lambda x: x.astype("float32").reshape(-1))
    .batch(32)
    .prefetch(4, 2)
)

# Now we can iterate over the batches in normal python
for batch in mnist_mlp:
    x, y = batch["image"], batch["label"]
```

--------------------------------

### Run WikiText Benchmark with Custom Paths

Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/wikitext/README.md

Execute the benchmark script by specifying the paths to your tokenizer model and the extracted WikiText dataset. This command sets the number of threads for OpenMP.

```bash
OMP_NUM_THREADS=1 python mlx_data.py \
    --tokenizer_file /path/to/tokenizer.model \
    /path/to/wikitext/wikitext-103-raw
```

--------------------------------

### Set PATH and PKG_CONFIG_PATH

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Configures environment variables for the build process, ensuring that executables and pkg-config files from downloaded dependencies are found.

```cmake
set(PATH ${CMAKE_BINARY_DIR}/deps/bin:$ENV{PATH})
set(PKG_CONFIG_PATH
    ${CMAKE_BINARY_DIR}/deps/lib/pkgconfig:${CMAKE_BINARY_DIR}/deps/lib64/pkgconfig
)
```

--------------------------------

### Combine Dataset Loading, Conversion, and Stream Creation

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md

A comprehensive function that loads a Hugging Face dataset, converts it to NumPy arrays, and creates a preprocessed MLX stream ready for training.

```python
import numpy as np
import mlx.data as dx

# Convert the content of the dataset into numpy arrays
def huggingface_to_array_of_dict(dataset):
    return [{"image": np.array(image).copy(), "label": label}
            for label, image in zip(dataset['label'], dataset['image'])]

# Convert the Hugging Face dataset to a stream of batches
def hf_dataset_to_mlx_stream(dataset, shuffle=False):
    numpy_data = huggingface_to_array_of_dict(dataset)

    buffer = dx.buffer_from_vector(numpy_data)
    if shuffle:
        buffer = buffer.shuffle()

    return (
        buffer
        .to_stream()
        .key_transform("image", lambda x: x.astype("float32") / 255)
        .batch(32)
        .prefetch(prefetch_size=8, num_threads=4)
    )
```

--------------------------------

### Read CharTrie from SentencePiece Model

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/tokenizing.md

Shows how to load a CharTrie and associated weights from a SentencePiece model file for efficient tokenization. Recommended for using pre-trained models.

```python
from mlx.data.tokenizer_helpers import read_trie_from_spm

trie, weights = read_trie_from_spm("path/to/spm/model")
tokenizer = Tokenizer(trie, trie_key_scores=weights)
tokenizer.tokenize_shortest(b"This is some more text to tokenize")
```

--------------------------------

### Run mlx.data Benchmark

Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/librispeech/README.md

Execute the mlx.data benchmark by specifying the tokenizer file and the path to the extracted LibriSpeech dataset. Set OMP_NUM_THREADS=1 to prevent thread contention.

```bash
OMP_NUM_THREADS=1 python mlx_data.py \
    --tokenizer_file /path/to/tokenizer.model \
    /path/to/librispeech/LibriSpeech/dev-clean
```

--------------------------------

### Load and Process Images with MLX Data Pipeline

Source: https://github.com/ml-explore/mlx-data/blob/main/README.md

This pipeline demonstrates loading images, resizing, cropping, batching, scaling, and prefetching. It uses a Python function to prepare data samples and then applies various transformations using the MLX Data API.

```python
# A simple python function returning a list of dicts. All samples in MLX data
# are dicts of arrays.
def files_and_classes(root: Path):
    files = [str(f) for f in root.glob("**/*.jpg")]
    files = [f for f in files if "BACKGROUND" not in f]
    classes = dict(
        map(reversed, enumerate(sorted(set(f.split("/")[-2] for f in files))))
    )

    return [
        dict(image=f.encode("ascii"), label=classes[f.split("/")[-2]]) for f in files
    ]


dset = (
    # Make a buffer (finite length container of samples) from the python list
    dx.buffer_from_vector(files_and_classes(root))

    # Shuffle and transform to a stream
    .shuffle()
    .to_stream()

    # Implement a simple image pipeline. No random augmentations here but they
    # could be applied.
    .load_image("image")  # load the file pointed to by the 'image' key as an image
    .image_resize_smallest_side("image", 256)
    .image_center_crop("image", 224, 224)

    # Accumulate into batches
    .batch(batch_size)

    # Cast to float32 and scale to [0, 1]. We do this in python and we could
    # have done any transformation we could think of.
    .key_transform("image", lambda x: x.astype("float32") / 255)

    # Finally, fetch batches in background threads
    .prefetch(prefetch_size=8, num_threads=8)
)

# dset is a python iterable so one could simply
for sample in dset:
    # access sample["image"] and sample["label"]
    pass
```

--------------------------------

### Configure Target Properties

Source: https://github.com/ml-explore/mlx-data/blob/main/python/src/CMakeLists.txt

Sets include directories, link libraries, and compile definitions for the '_c' target. This ensures the extension can access necessary headers, link against the 'mlxdata' library, and use version-specific definitions.

```cmake
target_include_directories(_c PUBLIC ${CMAKE_SOURCE_DIR})
target_link_libraries(_c PRIVATE mlxdata)
target_compile_definitions(_c PRIVATE _VERSION_=${MLX_DATA_VERSION})
```

--------------------------------

### Load and Transform Images with MLX Data

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Loads images, applies transformations like resizing and cropping, normalizes pixel values, and batches the data. Ensure the 'image' key exists in your samples.

```python
import mlx.data as dx

dset = (
    dx.buffer_from_vector(files_and_classes(root))
    .shuffle()
    .to_stream()
    # Decode JPEG file pointed to by 'image' key -> HWC uint8 array
    .load_image("image")
    # Resize so the smallest side is 256 pixels
    .image_resize_smallest_side("image", 256)
    # Center crop to 224x224
    .image_center_crop("image", 224, 224)
    # Accumulate into batches of 32
    .batch(32)
    # Cast to float32 and normalize to [0, 1] using a Python lambda
    .key_transform("image", lambda x: x.astype("float32") / 255)
    # Prefetch 8 batches using 8 background threads
    .prefetch(prefetch_size=8, num_threads=8)
)

for batch in dset:
    x = batch["image"]   # (32, 224, 224, 3), float32
    y = batch["label"]   # (32,), int
```

--------------------------------

### Fast Tokenization with CharTrie

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Tokenizes byte strings using a CharTrie for shortest-path or unigram tokenization. This process does not hold the GIL, allowing for parallel execution.

```python
from mlx.data.core import CharTrie, Tokenizer
from mlx.data.tokenizer_helpers import (
    read_trie_from_spm,
    read_bpe_from_spm,
    read_bpe_from_hf,
    read_trie_from_vocab,
    gpt2_byte_map,
)
import mlx.data as dx

# --- CharTrie tokenizer (unigram / shortest path) ---
trie = CharTrie()
for word in b"a quick brown fox jumped over the lazy dog".split():
    trie.insert(word)
trie.insert(b" ")
tokenizer = Tokenizer(trie)
print(tokenizer.tokenize_shortest(b"a quick brown fox"))
# [0, 9, 1, 9, 2, 9, 3]
```

--------------------------------

### Load and Tokenize from SentencePiece Model

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Loads a SentencePiece model for tokenization and applies it to a text corpus. It includes steps for streaming, tokenizing, filtering, padding, chunking, and batching data.

```python
trie, weights = read_trie_from_spm("path/to/tokenizer.model")

dset = (
    dx.stream_python_iterable(lambda: ({"text": line} for line in open("corpus.txt", "rb"))) 
    .tokenize("text", trie, trie_key_scores=weights, output_key="tokens")
    .filter_key("text", remove=True)        # drop raw text key
    .pad("tokens", 0, 1, 0, trie.search("<s>").id)   # prepend BOS
    .pad("tokens", 0, 0, 1, trie.search("</s>").id)  # append EOS
    .sliding_window("tokens", 1025, 1025)   # chunk into context windows
    .shape("tokens", "tokens_length", 0)
    .batch(32)
    .prefetch(4, 4)
)
```

--------------------------------

### Create and Use a Stream from a Python Iterable

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/stream.md

Demonstrates creating a stream from a large Python iterable and applying a filtering transformation. Note that streams are pointers and advancing one advances others derived from it. Streams can be reset if the source supports it.

```python
import mlx.data as dx

# The samples are never all instantiated
numbers = dx.stream_python_iterable(lambda: ({"x": i} for i in range(10**10)))

# Filtering is done with transforms returning an empty sample
evens = numbers.sample_transform(lambda s: s if s["x"] % 2 == 0 else dict())

print(next(numbers))
# prints {'x': array(0)}
print(next(numbers))
# prints {'x': array(1)}

# Streams are pointers to the streams so evens is using numbers under the
# hood. Since numbers was advanced now evens is advanced as well.
print(next(evens))
# prints {'x': array(2)}
print(next(evens))
# prints {'x': array(4)}
print(next(numbers))
# prints {'x': array(5)}

# Streams can be reset.
evens.reset()
print(next(evens))
print(next(evens))
print(next(numbers))
# prints {'x': array(0)}
#        {'x': array(2)}
#        {'x': array(3)}

```

--------------------------------

### Apply Python Transforms to Data Samples

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Applies Python functions to data samples. `key_transform` modifies a single key's value, while `sample_transform` operates on the entire sample dictionary. Use `*_if` variants for conditional transformations.

```python
import numpy as np
import mlx.data as dx

dset = dx.buffer_from_vector([{"audio": np.random.randn(16000).astype("float32")}])

# Apply a function to a single key
dset = dset.key_transform("audio", lambda x: x / np.abs(x).max())

# Cross-key logic and conditional filtering
def augment_and_filter(sample):
    if sample["label"] < 0:
        return dict()  # drop sample
    sample["image"] = sample["image"].astype("float32") / 255.0
    sample["image"] = (1 + 0.1 * np.random.rand()) * sample["image"]
    return sample

dset = dset.to_stream().sample_transform(augment_and_filter)

# Conditional variants: every operation has a *_if form
enable_flip = True
dset = dset.image_random_h_flip_if(enable_flip, "image", 0.5)
```

--------------------------------

### MLX Data Stream Prefetching (Non-deterministic)

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Uses `Stream.prefetch` for non-deterministic, high-throughput prefetching with multiple threads. Suitable when sample order within an epoch is not critical after shuffling.

```python
import mlx.data as dx
from mlx.data.datasets import load_cifar10

buf = load_cifar10(train=True)

# Non-deterministic (highest throughput)
stream_nondet = (
    buf
    .shuffle()
    .to_stream()
    .load_image("image")
    .batch(64)
    .key_transform("image", lambda x: x.astype("float32") / 255)
    .prefetch(num_prefetch=8, num_threads=8)
)

for batch in stream_nondet:
    pass  # iterate one epoch

stream_nondet.reset()  # reset for next epoch
```

--------------------------------

### MLX Data Stream Prefetching (Deterministic)

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Uses `Buffer.ordered_prefetch` to prefetch samples while preserving order after an initial shuffle. Ensures deterministic iteration order for each epoch.

```python
import mlx.data as dx
from mlx.data.datasets import load_cifar10

buf = load_cifar10(train=True)

# Deterministic ordered prefetch (same order every epoch after shuffle)
stream_det = (
    buf
    .shuffle()
    .load_image("image")
    .batch(64)
    .key_transform("image", lambda x: x.astype("float32") / 255)
    .ordered_prefetch(num_prefetch=8, num_threads=8)
)

for batch in stream_det:
    pass  # iterate one epoch

stream_det.reset() # reset for next epoch
```

--------------------------------

### Load and Process Audio Files with MLX Data

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Loads audio files, removes the channel dimension for mono audio, extracts Mel filterbank features, and batches the results. Ensure audio files are accessible via the 'audio' key.

```python
import mlx.data as dx
from mlx.data.features import mfsc
from mlx.data.tokenizer_helpers import read_trie_from_spm

trie, _ = read_trie_from_spm("tokenizer.model")

dset = (
    dx.buffer_from_vector([{"file": b"librispeech/train/**/*.txt"}])
    .to_stream()
    .line_reader_from_key("file", "line")
    .sample_transform(lambda s: {
        "audio": b"/".join(bytes(s["line"]).split(b" ", 1)[0].split(b"-")[:-1]
                          + [bytes(s["line"]).split(b" ", 1)[0] + b".flac"]),
        "transcript": bytes(s["line"]).split(b" ", 1)[1].lower(),
    })
    # Load audio file -> (T, C) int16 array
    .load_audio("audio", prefix="path/to/librispeech")
    # Drop channel dim: (T, 1) -> (T,)
    .squeeze("audio")
    # Extract 128-band log Mel filterbank features at 16 kHz
    .key_transform("audio", mfsc(n_filterbank=128, sampling_freq=16000))
    # Record audio length before batching
    .shape("audio", "audio_length", 0)
    .batch(32)
    .prefetch(8, 8)
)
```

--------------------------------

### Add ffmpeg External Project

Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt

Configures the ffmpeg project, specifying its URL, dependencies, and custom configure and build commands. It's built in-source and includes various options to disable specific features and enable others like libvorbis and libopus.

```cmake
ExternalProject_Add(
  ffmpeg
  DEPENDS nasm
          zlib
          lame
          libogg
          opus
          libvorbis
          xvidcore
          pkg-config
  URL https://ffmpeg.org/releases/ffmpeg-7.1.1.tar.bz2
  CONFIGURE_COMMAND
    ${CMAKE_COMMAND} -E env PATH=${PATH} PKG_CONFIG_PATH=${PKG_CONFIG_PATH}
    ./configure --prefix=${CMAKE_BINARY_DIR}/deps --disable-shared --enable-pic
    --enable-runtime-cpudetect --enable-libvorbis --enable-libopus
    --disable-iconv --disable-programs --disable-doc --disable-htmlpages
    --disable-manpages --disable-podpages --disable-txtpages --disable-alsa
    --disable-sdl2 --disable-xlib --disable-cuda-llvm --disable-cuvid
    --disable-d3d11va --disable-dxva2 --disable-nvdec --disable-nvenc
    --disable-v4l2-m2m --disable-vdpau
    --pkg-config=${CMAKE_BINARY_DIR}/deps/bin/pkg-config
    "--extra-ldflags=-L${CMAKE_BINARY_DIR}/deps/lib -L${CMAKE_BINARY_DIR}/deps/lib64"
    "--extra-libs=-lvorbis -logg -lm"
  BUILD_COMMAND ${CMAKE_COMMAND} -E env
                PATH=${CMAKE_BINARY_DIR}/deps/bin:$ENV{PATH} make
  INSTALL_COMMAND make install
  BUILD_IN_SOURCE 1
  DOWNLOAD_EXTRACT_TIMESTAMP 1)
```

--------------------------------

### Create MLX Buffer from List of Dictionaries

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md

Convert a list of dictionaries, where each dictionary represents a data sample (e.g., image and label), into an MLX Buffer.

```python
import mlx.data as dx

buffer = dx.buffer_from_vector(dicts)
```

--------------------------------

### Load and Inspect Wikitext-103 Dataset

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md

Loads the wikitext-103 dataset (training split) and prints its stream information. This dataset is often used for language modeling tasks.

```python
wiki = load_wikitext_lines(split="train")
print(wiki)
# Downloading https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip 183.1MiB (9.9MiB/s)
# Computing hash of ..../.cache/mlx.data/wikitext/wikitext-103-raw-v1.zip |████████████████████████████████████████| 183.1MiB / 183.1MiB (1.0GiB/s)
# Extracting ..../.cache/mlx.data/wikitext/wikitext-103-raw-v1.zip 517.9MiB (318.2MiB/s)
# Stream()
```

--------------------------------

### Create Buffer from Python list with dx.buffer_from_vector

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Use `dx.buffer_from_vector` to create a Buffer from a list of sample dictionaries. This is suitable when the entire dataset fits in memory. Samples can map string keys to NumPy arrays, bytes, or scalars.

```python
from pathlib import Path
import mlx.data as dx

def files_and_classes(root: Path):
    images = list(root.rglob("*.jpg"))
    categories = [p.relative_to(root).parent.name for p in images]
    category_set = set(categories)
    category_map = {c: i for i, c in enumerate(sorted(category_set))}
    return [
        {
            "image": str(p.relative_to(root)).encode("ascii"),  # bytes path
            "category": c,
            "label": category_map[c],                           # int label
        }
        for c, p in zip(categories, images)
    ]

buf = dx.buffer_from_vector(files_and_classes(Path("path/to/dataset")))
print(buf)          # Buffer(size=9144, keys={'category', 'image', 'label'})
print(buf[0])       # {'category': array([...]), 'image': array([...]), 'label': array(42)}
print(len(buf))     # 9144
```

--------------------------------

### BPE Tokenizer from HuggingFace tokenizer.json

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Reads BPE symbols and merges from a HuggingFace tokenizer.json file and applies byte mapping for tokenization.

```python
symbols, merges = read_bpe_from_hf("tokenizer.json")
byte_map = gpt2_byte_map()

dset_bpe = (
    dx.buffer_from_vector([{"text": "Hello world!"}])
    .replace_bytes("text", byte_map)
    .tokenize_bpe("text", symbols, merges)
)
```

--------------------------------

### Convert HuggingFace Dataset to MLX Data Stream

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Converts a HuggingFace dataset to an MLX Data stream, processing images and labels. Images are converted to NumPy arrays, normalized, and flattened. Use this for image-based datasets from HuggingFace.

```python
import numpy as np
import mlx.data as dx
import mlx.core as mx
from datasets import load_dataset

ds = load_dataset("ylecun/mnist")

def hf_to_mlx_stream(hf_split, shuffle=False, batch_size=32):
    samples = [
        {"image": np.array(img).copy(), "label": lbl}
        for img, lbl in zip(hf_split["image"], hf_split["label"])
    ]
    buf = dx.buffer_from_vector(samples)
    if shuffle:
        buf = buf.shuffle()
    return (
        buf
        .to_stream()
        .key_transform("image", lambda x: (x.astype("float32") / 255).ravel())
        .batch(batch_size)
        .prefetch(prefetch_size=8, num_threads=4)
    )

train_stream = hf_to_mlx_stream(ds["train"], shuffle=True)
test_stream  = hf_to_mlx_stream(ds["test"],  shuffle=False)

train_stream.reset()
for batch in train_stream:
    X = mx.array(batch["image"])   # (32, 784)
    y = mx.array(batch["label"])   # (32,)
    # training step here
    break
```

--------------------------------

### Find Python and pybind11

Source: https://github.com/ml-explore/mlx-data/blob/main/python/src/CMakeLists.txt

Locates the Python interpreter and pybind11 components required for building the C++ extension. Ensures that the necessary Python development files and pybind11 configuration are available.

```cmake
find_package(
  Python
  COMPONENTS Interpreter Development.Module
  REQUIRED)
execute_process(
  COMMAND "${Python_EXECUTABLE}" -m pybind11 --cmakedir
  OUTPUT_STRIP_TRAILING_WHITESPACE
  OUTPUT_VARIABLE pybind11_ROOT)
find_package(pybind11 CONFIG REQUIRED)
```

--------------------------------

### Process and Batch MNIST Dataset

Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md

Applies transformations to the MNIST dataset, including shuffling, converting to stream, normalizing images, and batching. Prefetching is used for performance.

```python
mnist_iter = (
    mnist
    .shuffle()
    .to_stream()
    .key_transform("image", lambda x: (x.astype("float32") / 255).ravel())
    .batch(128)
    .prefetch(4, 2)
)
print(next(mnist_iter)["image"].shape)
# (128, 784)
```

--------------------------------

### AWSFileFetcher for Remote S3 File Fetching

Source: https://context7.com/ml-explore/mlx-data/llms.txt

Fetches files from an S3-compatible bucket into a local cache, supporting background prefetching. It can be integrated with I/O operations via the `file_fetcher` parameter.

```python
from pathlib import Path
from mlx.data.core import AWSFileFetcher
import mlx.data as dx

LOCAL_CACHE = Path("/tmp/s3_cache")

ff = AWSFileFetcher(
    "my-dataset-bucket",
    endpoint="https://s3.us-east-1.amazonaws.com/",
    local_prefix=LOCAL_CACHE,
    num_kept_files=500,    # LRU cache: keep at most 500 files locally
)

# Standalone usage
ff.fetch("data/train/image_001.jpg")
assert (LOCAL_CACHE / "data/train/image_001.jpg").is_file()

# Background prefetch while processing current file
ff.prefetch(["data/train/image_002.jpg", "data/train/image_003.jpg"])
ff.fetch("data/train/image_002.jpg")  # likely already cached

# In a pipeline: pass file_fetcher to load_image
samples = [{"image": b"data/train/image_001.jpg", "label": 0}]
dset = (
    dx.buffer_from_vector(samples)
    .to_stream()
    .load_image("image", file_fetcher=ff)
    .image_resize_smallest_side("image", 256)
    .batch(32)
    .prefetch(4, 4)
)
```