### Install Sphinx and Sphinx Book Theme Source: https://github.com/ml-explore/mlx-data/blob/main/docs/README.md Install Sphinx for documentation generation and the Sphinx Book Theme for styling. Use conda for Sphinx and pip for the theme. ```bash conda install sphinx pip install sphinx-book-theme ``` -------------------------------- ### Install Hugging Face Datasets Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md Install the Hugging Face datasets library using pip. ```bash pip install datasets ``` -------------------------------- ### Install Ubuntu Dependencies with apt Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md Installs optional dependencies for handling audio, video, images, and compressed streams on Ubuntu using apt. Note: AWS SDK requires manual build. ```bash sudo apt install libsndfile1-dev libsamplerate0-dev ffmpeg libjpeg-turbo8-dev \ zlib1g-dev libbz2-dev liblzma-dev ``` -------------------------------- ### Install Python Build Tools Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md Installs pybind11 and CMake, which are required for building the Python bindings from source. ```bash pip install pybind11[global] cmake ``` -------------------------------- ### Python Sample Examples Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md Demonstrates the creation of valid samples in Python, including scalar casting and string encoding. ```python # This is a valid sample sample = {"hello": np.array(0)} # So is this because scalars are cast to scalar arrays sample = {"scalar": 42} # Strings can also be used, however, they will be represented in unicode. sample = {"key": "value"} # Most likely you would want to write it as bytes in the sample as follows sample = {"key": b"path/to/my/file"} sample = {"key": "value".encode("ascii")} ``` -------------------------------- ### Install MLX Data Python Package from Source Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md Installs the MLX Data Python package locally from source. Use 'pip install -e .' for an editable install. ```bash cd /path/to/mlx/data pip install . # or pip install -e . for an editable install ``` -------------------------------- ### Install mlx-data and dependencies Source: https://context7.com/ml-explore/mlx-data/llms.txt Install the mlx-data library using pip. Optional dependencies for audio, video, and S3 on macOS and Ubuntu are listed, along with instructions to build the C++ standalone library. ```bash pip install mlx-data # macOS: optional dependencies for audio, video, S3 brew install libsndfile libsamplerate ffmpeg jpeg-turbo zlib bzip2 xz aws-sdk-cpp # Ubuntu: optional dependencies sudo apt install libsndfile1-dev libsamplerate0-dev ffmpeg libjpeg-turbo8-dev \ zlib1g-dev libbz2-dev liblzma-dev # Build C++ standalone library mkdir build && cd build cmake .. make -j sudo make install ``` -------------------------------- ### Install MLX Data from PyPI Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md Use this command to install the MLX Data package and its essential dependencies for reading various data formats. ```bash pip install mlx-data ``` -------------------------------- ### Install macOS Dependencies with Homebrew Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md Installs optional dependencies for handling audio, video, images, and S3 access on macOS using Homebrew. ```bash brew install libsndfile libsamplerate ffmpeg jpeg-turbo zlib bzip2 xz aws-sdk-cpp ``` -------------------------------- ### Download and Run WikiText Benchmark Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/wikitext/README.md Use this bash script to download the necessary data, tokenizer, and execute the benchmarks. Ensure you have wget and unzip installed. ```bash bash run_wikitext.sh ``` -------------------------------- ### Key Transform Example Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md Illustrates how to use `key_transform` to apply a function to a specific key (e.g., 'image') within the stream's data. ```python .key_transform("image", ...) ``` -------------------------------- ### Load Hugging Face Dataset Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md Download and inspect a dataset from Hugging Face. This example loads the MNIST dataset. ```python from datasets import load_dataset ds = load_dataset("ylecun/mnist") print(ds['train']) ``` -------------------------------- ### Format C++ Code with clang-format Source: https://github.com/ml-explore/mlx-data/blob/main/CONTRIBUTING.md Use this command to format a C++ file in-place. Ensure clang-format is installed. ```bash clang-format -i file.cpp ``` -------------------------------- ### Install Target Source: https://github.com/ml-explore/mlx-data/blob/main/python/src/CMakeLists.txt Installs the compiled '_c' target to the 'mlx/data' directory within the Python package structure. This makes the C++ extension available for import in Python. ```cmake install(TARGETS _c DESTINATION mlx/data) ``` -------------------------------- ### Create Buffer from Vector in Python Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md Illustrates creating a Buffer from a list of samples using `buffer_from_vector`. This is useful for in-memory datasets or when starting a data pipeline. ```python from pathlib import Path import mlx.data as dx def files_and_classes(root: Path): """Load the files and classes from an image dataset that contains one folder per class.""" images = list(root.rglob("*.jpg")) categories = [p.relative_to(root).parent.name for p in images] category_set = set(categories) category_map = {c: i for i, c in enumerate(sorted(category_set))} return [ { "image": str(p.relative_to(root)).encode("ascii"), "category": c, "label": category_map[c] } for c, p in zip(categories, images) ] dset = dx.buffer_from_vector(files_and_classes(Path("path/to/dataset))) # We can now apply transformations to the dataset ``` -------------------------------- ### Load and Inspect MNIST Dataset Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md Loads the MNIST dataset and prints its buffer information. This is a starting point for using the dataset. ```python import mlx.data as dx from mlx.data.datasets import load_mnist, load_wikitext_lines from mlx.data.tokenizer_helpers import read_trie_from_vocab mnist = load_mnist() print(mnist) # Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz 9.5MiB (15.1MiB/s) # Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz 1.6MiB (12.9MiB/s) # Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz 32.0KiB (17.1MiB/s) # Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz 8.0KiB (26.6MiB/s) # Buffer(size=60000, keys={'label', 'image'}) ``` -------------------------------- ### Format Python Code with black Source: https://github.com/ml-explore/mlx-data/blob/main/CONTRIBUTING.md Use this command to format a Python file in-place. Ensure black is installed. ```bash black file.py ``` -------------------------------- ### Download LibriSpeech Data and Tokenizer Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/librispeech/README.md Use this bash script to download the LibriSpeech dataset, the required tokenizer model, and run the benchmarks. Ensure you have wget and unzip installed. ```bash bash run_librispeech.sh ``` -------------------------------- ### Convert Buffer to MLX Stream with Transformations Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md Transform an MLX Buffer into an MLX Stream, applying transformations like normalization and batching. This example normalizes images and batches them. ```python import mlx.data as dx stream = buffer .to_stream() .key_transform("image", lambda x: x.astype("float32") / 255) .batch(32) .prefetch(prefetch_size=8, num_threads=4) ``` -------------------------------- ### Add bzip2 Dependency with Patch Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Downloads and builds bzip2, applying a patch for -fPIC compilation. It's configured to install into the 'deps' directory. ```cmake ExternalProject_Add( bzip2 URL https://sourceware.org/pub/bzip2/bzip2-1.0.8.tar.gz PATCH_COMMAND patch -p1 < ${CMAKE_SOURCE_DIR}/cmake/bzip2-1.0.8.patch CONFIGURE_COMMAND "" BUILD_COMMAND "" INSTALL_COMMAND make install PREFIX=${CMAKE_BINARY_DIR}/deps BUILD_IN_SOURCE 1 DOWNLOAD_EXTRACT_TIMESTAMP 1) ``` -------------------------------- ### Build Standalone C++ Library Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md Builds and installs MLX Data as a standalone C++ static library. This allows linking MLXData into other C++ projects. ```bash mkdir build && cd build cmake .. make -j sudo make install ``` -------------------------------- ### Add zlib Dependency Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Downloads and builds zlib, a compression library. It's configured for static linking and PIC (Position Independent Code) compilation, installing into the 'deps' directory. ```cmake ExternalProject_Add( zlib URL https://www.zlib.net/zlib-1.3.1.tar.gz CONFIGURE_COMMAND ${CMAKE_COMMAND} -E env PATH=${PATH} PKG_CONFIG_PATH=${PKG_CONFIG_PATH} CFLAGS=-fPIC ./configure --prefix=${CMAKE_BINARY_DIR}/deps --static BUILD_COMMAND make INSTALL_COMMAND make install BUILD_IN_SOURCE 1 DOWNLOAD_EXTRACT_TIMESTAMP 1) ``` -------------------------------- ### Add xz Dependency Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Downloads and builds xz, a compression library. It's configured with specific CMake arguments to disable shared libraries, documentation, and scripts, ensuring static linking and PIC support, and installs into the 'deps' directory. ```cmake ExternalProject_Add( xz URL https://downloads.sourceforge.net/project/lzmautils/xz-5.8.1.tar.gz CMAKE_ARGS -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF -DXZ_TOOL_XZ=OFF -DXZ_TOOL_XZDEC=OFF -DXZ_TOOL_LZMADEC=OFF -DXZ_TOOL_LZMAINFO=OFF -DXZ_DOC=OFF -DENABLE_SCRIPTS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_PREFIX_PATH=${CMAKE_BINARY_DIR}/deps -DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/deps INSTALL_DIR ${CMAKE_BINARY_DIR}/deps DOWNLOAD_EXTRACT_TIMESTAMP 1) ``` -------------------------------- ### Add zstd Dependency Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Downloads and builds zstd, a fast compression algorithm. It's configured for static linking, disabling multithreading and programs, and setting macOS specific architectures if applicable. Installs into the 'deps' directory. ```cmake if(APPLE) set(OSX_ARCHITECTURES "x86_64$x86_64h$arm64") endif() ExternalProject_Add( zstd URL https://github.com/facebook/zstd/archive/refs/tags/v1.5.7.tar.gz CMAKE_ARGS -DCMAKE_BUILD_TYPE=Release -DCMAKE_OSX_ARCHITECTURES=${OSX_ARCHITECTURES} -DZSTD_BUILD_SHARED=OFF -DZSTD_MULTITHREAD_SUPPORT=OFF -DZSTD_BUILD_PROGRAMS=OFF -DCMAKE_PREFIX_PATH=${CMAKE_BINARY_DIR}/deps -DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/deps SOURCE_SUBDIR build/cmake INSTALL_DIR ${CMAKE_BINARY_DIR}/deps DOWNLOAD_EXTRACT_TIMESTAMP 1) ``` -------------------------------- ### Add xvidcore External Project Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Configures the xvidcore project, specifying its URL, patch command, and custom configure, build, and install commands. It's built in-source and uses a specific patch to disable shared libraries. ```cmake ExternalProject_Add( xvidcore DEPENDS nasm URL https://downloads.xvid.com/downloads/xvidcore-1.3.7.tar.bz2 PATCH_COMMAND patch -p1 < ${CMAKE_SOURCE_DIR}/cmake/xvidcore-1.3.7.patch CONFIGURE_COMMAND cd build/generic && ${CMAKE_COMMAND} -E env PATH=${PATH} PKG_CONFIG_PATH=${PKG_CONFIG_PATH} CFLAGS=-fPIC ./configure --prefix=${CMAKE_BINARY_DIR}/deps BUILD_COMMAND cd build/generic && ${CMAKE_COMMAND} -E env PATH=${PATH} PKG_CONFIG_PATH=${PKG_CONFIG_PATH} make -j1 INSTALL_COMMAND cd build/generic && make install BUILD_IN_SOURCE 1 DOWNLOAD_EXTRACT_TIMESTAMP 1) ``` -------------------------------- ### Add pkg-config Dependency Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Downloads and builds pkg-config, a dependency for managing compile/link flags. It's configured to install into the local 'deps' directory and disables shared library building. ```cmake ExternalProject_Add( pkg-config URL http://pkgconfig.freedesktop.org/releases/pkg-config-0.29.2.tar.gz http://fresh-center.net/linux/misc/pkg-config-0.29.2.tar.gz CONFIGURE_COMMAND ${CMAKE_COMMAND} -E env CFLAGS=-Wno-int-conversion CXXFLAGS=-Wno-int-conversion ./configure --prefix=${CMAKE_BINARY_DIR}/deps --disable-shared --with-internal-glib BUILD_COMMAND make INSTALL_COMMAND make install BUILD_IN_SOURCE 1 DOWNLOAD_EXTRACT_TIMESTAMP 1) ``` -------------------------------- ### Serve Local Documentation Source: https://github.com/ml-explore/mlx-data/blob/main/docs/README.md Run a local HTTP server in the mlx/docs/build/html/ directory to view the built documentation. Point your browser to http://localhost:. ```bash python -m http.server ``` -------------------------------- ### Build HTML Documentation Source: https://github.com/ml-explore/mlx-data/blob/main/docs/README.md Build the HTML version of the documentation from the mlx/docs/ directory using the make html command. ```bash make html ``` -------------------------------- ### Stream Processing with Batching and Prefetching Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md Demonstrates creating a stream from a buffer, applying batching, and then using non-deterministic prefetching for efficient iteration. ```python # We can define the rest of the processing pipeline using streams. # 1. First shuffle the buffer # 2. Make a stream # 3. Batch and then prefetch dset = ( dset .shuffle() .to_stream() # <-- making a stream from the shuffled buffer .batch(32) .prefetch(8, 4) # <-- prefetch 8 batches using 4 threads ) # Now we can iterate over dset sample = next(dset) ``` -------------------------------- ### Stream Processing with Ordered Prefetching Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/buffers_streams_samples.md Shows how to create a stream from a buffer and apply batching with deterministic prefetching using `ordered_prefetch`. ```python # We can define the rest of the processing pipeline using streams. # 1. First shuffle the buffer # 2. Make a stream # 3. Batch and then prefetch dset = ( dset .shuffle() .batch(32) .ordered_prefetch(8, 4) # <-- prefetch 8 batches in a stream using 4 threads ) # Now we can iterate over dset sample = next(dset) ``` -------------------------------- ### Create and Access Buffer Elements Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/buffer.md Demonstrates creating a buffer from a vector, applying a key transformation, and accessing elements by index. Use this to create and manipulate in-memory datasets. ```python import mlx.data as dx numbers = dx.buffer_from_vector([{"x": i} for i in range(10)]) evens = numbers.key_transform("x", lambda x: 2*x) print(evens) # prints Buffer(size=10, keys={'x'}) print(evens[3]) # prints {'x': array(6)} print(len(evens)) # prints 10 ``` -------------------------------- ### Initialize and Use AWSFileFetcher Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/miscellaneous.md Instantiate an AWSFileFetcher for downloading files from S3-compatible storage with local caching. Prefetch files in the background for improved performance. ```python from pathlib import Path from mlx.data.core import AWSFileFetcher LOCAL_CACHE = Path("/path/to/local/cache") ff = AWSFileFetcher( "my-cool-bucket", endpoint="https://my.endpoint.com/" local_prefix=LOCAL_CACHE, num_kept_files=100, ) # When fetch returns my/remote/path/foo.npy will be in LOCAL_CACHE ff.fetch("my/remote/path/foo.npy") assert (LOCAL_CACHE / "my/remote/path/foo.npy").is_file() # We can prefetch in the background ff.prefetch(["foo_1.npy", "foo_2.npy"]) ff.fetch("foo_1.npy") # process foo_1 while foo_2 downloads in the background ``` -------------------------------- ### Build CharTrie and Tokenizer Manually Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/tokenizing.md Demonstrates building a CharTrie from a list of words and then creating a Tokenizer. Useful for custom vocabularies. ```python from mlx.data.core import CharTrie, Tokenizer # We can build a trie ourselves trie = CharTrie() for t in b"a quick brown fox jumped over the lazy dog".split(): trie.insert(t) trie.insert(b" ") tokenizer = Tokenizer(trie) print(tokenizer.tokenize_shortest(b"a quick brown fox jumped over the lazy dog")) # [0, 9, 1, 9, 2, 9, 3, 9, 4, 9, 5, 9, 6, 9, 7, 9, 8] ``` ```python # We can also add all the letters in the trie and then tokenize anything we want import string for l in string.ascii_letters: trie.insert(bytes(l, "utf-8")) print(tokenizer.tokenize_shortest(b"This is a quick example")) # [54, 16, 17, 27, 9, 17, 27, 9, 0, 9, 1, 9, 13, 32, 0, 21, 24, 20, 13] ``` -------------------------------- ### Shuffle, batch, and prefetch with Buffer and Stream APIs Source: https://context7.com/ml-explore/mlx-data/llms.txt Demonstrates chaining transformations on a Buffer, including shuffling, converting to a Stream, batching, and prefetching. Supports both non-deterministic multi-threaded streams and deterministic ordered prefetching. ```python import mlx.data as dx from mlx.data.datasets import load_mnist mnist = load_mnist(train=True) # Buffer(size=60000, keys={'label', 'image'}) # Non-deterministic multi-threaded stream stream = ( mnist .shuffle() .to_stream() .batch(32) .prefetch(prefetch_size=8, num_threads=4) ) for batch in stream: images = batch["image"] # shape (32, 28, 28, 1), dtype uint8 labels = batch["label"] # shape (32,) break stream.reset() # restart iteration # Deterministic ordered prefetch (stays a Buffer) ordered = ( mnist .shuffle() .batch(32) .ordered_prefetch(num_prefetch=8, num_threads=4) ) ``` -------------------------------- ### Load Common Datasets (MNIST, CIFAR-10, LibriSpeech, WikiText) Source: https://context7.com/ml-explore/mlx-data/llms.txt Provides pre-built loaders for common datasets like MNIST, CIFAR-10, LibriSpeech, and WikiText. Data is downloaded and cached locally by default. ```python import mlx.data as dx from mlx.data.datasets import ( load_mnist, load_fashion_mnist, load_cifar10, load_cifar100, load_imagenet, load_librispeech, load_speechcommands, load_wikitext_lines, load_images_from_folder, ) # MNIST — Buffer(size=60000, keys={'image', 'label'}) mnist_train = load_mnist(train=True) mnist_test = load_mnist(train=False) train_iter = ( mnist_train .shuffle() .to_stream() .key_transform("image", lambda x: (x.astype("float32") / 255).ravel()) .batch(128) .prefetch(4, 2) ) print(next(train_iter)["image"].shape) # (128, 784) # CIFAR-10 cifar = load_cifar10(train=True) # Buffer(size=50000, keys={'image','label'}) # LibriSpeech libri = load_librispeech(split="train-clean-100") # Buffer # WikiText-103 — returns a Stream of lines wiki = load_wikitext_lines(split="train") # Stream # Load images from a custom folder (one subfolder = one class) folder_dset = load_images_from_folder("path/to/imagenet/train") ``` -------------------------------- ### Build AWS SDK from Source on Ubuntu Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/install.md Builds the AWS SDK from source on Ubuntu, specifically for S3 access. This is only needed if you require S3 functionality and are not using prebuilt binaries. ```bash sudo apt install libcurl4-openssl-dev libssl-dev git clone --depth 1 --recurse-submodules https://github.com/aws/aws-sdk-cpp.git cd aws-sdk-cpp mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3" -DBUILD_SHARED_LIBS=OFF make -j sudo make install ``` -------------------------------- ### Iterate Through MLX Stream for Training Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md Demonstrates how to use the created MLX stream for training, including resetting the stream, iterating through batches, and displaying a sample image. ```python import matplotlib.pyplot as plt import mlx.core as mx train_stream = hf_dataset_to_mlx_stream(ds['train'], shuffle=True) test_stream = hf_dataset_to_mlx_stream(ds['test'], shuffle=False) train_stream.reset() for batch in train_stream: (X, y) = mx.array(batch['image']), mx.array(batch['label']) print('The image should display a ', y[0].item()) plt.imshow(X[0]) break ``` -------------------------------- ### Run Pre-commit Hooks for All Files Source: https://github.com/ml-explore/mlx-data/blob/main/CONTRIBUTING.md Execute all pre-commit hooks on all files in the repository to ensure consistent code style. ```bash pre-commit run --all-files ``` -------------------------------- ### Tokenize and Prepare Wikitext Dataset for Language Modeling Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md Prepares the wikitext dataset for language modeling by tokenizing lines using a provided vocabulary trie, filtering, batching, and applying sliding windows. Performance is noted for M2 Macbook Air. ```python workers = 8 trie = read_trie_from_vocab("/path/to/vocab.txt") wiki_iterator = ( wiki .tokenize("line", trie, output_key="tokens") .filter_key("tokens") .prefetch(512, workers) .batch(128, dim=dict(tokens=0)) # gather everything in a big array of tokens .sliding_window("tokens", 1025, 1025) .shape("tokens", "tokens_length", 0) .batch(32) # actual batch size .prefetch(2, 1) ) # The above can be iterated at approximately 2.5M tok/s on an M2 Macbook Air. ``` -------------------------------- ### Load and Prepare MNIST Dataset for MLP Training Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/quick_start.md Imports MLX Data and the MNIST dataset loader. It then shuffles, converts to a stream, flattens images, batches, and prefetches data for efficient MLP training. Finally, it shows how to iterate over the prepared batches. ```python # This is the standard way to import and access mlx.data import mlx.data as dx # Let's import MNIST loading from mlx.data.datasets import load_mnist # Loads a buffer with the MNIST images m mnist_train = load_mnist(train=True) # Let's shuffle flatten and batch to prepare for MLP training mnist_mlp = ( mnist_train .shuffle() .to_stream() .key_transform("image", lambda x: x.astype("float32").reshape(-1)) .batch(32) .prefetch(4, 2) ) # Now we can iterate over the batches in normal python for batch in mnist_mlp: x, y = batch["image"], batch["label"] ``` -------------------------------- ### Run WikiText Benchmark with Custom Paths Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/wikitext/README.md Execute the benchmark script by specifying the paths to your tokenizer model and the extracted WikiText dataset. This command sets the number of threads for OpenMP. ```bash OMP_NUM_THREADS=1 python mlx_data.py \ --tokenizer_file /path/to/tokenizer.model \ /path/to/wikitext/wikitext-103-raw ``` -------------------------------- ### Set PATH and PKG_CONFIG_PATH Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Configures environment variables for the build process, ensuring that executables and pkg-config files from downloaded dependencies are found. ```cmake set(PATH ${CMAKE_BINARY_DIR}/deps/bin:$ENV{PATH}) set(PKG_CONFIG_PATH ${CMAKE_BINARY_DIR}/deps/lib/pkgconfig:${CMAKE_BINARY_DIR}/deps/lib64/pkgconfig ) ``` -------------------------------- ### Combine Dataset Loading, Conversion, and Stream Creation Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md A comprehensive function that loads a Hugging Face dataset, converts it to NumPy arrays, and creates a preprocessed MLX stream ready for training. ```python import numpy as np import mlx.data as dx # Convert the content of the dataset into numpy arrays def huggingface_to_array_of_dict(dataset): return [{"image": np.array(image).copy(), "label": label} for label, image in zip(dataset['label'], dataset['image'])] # Convert the Hugging Face dataset to a stream of batches def hf_dataset_to_mlx_stream(dataset, shuffle=False): numpy_data = huggingface_to_array_of_dict(dataset) buffer = dx.buffer_from_vector(numpy_data) if shuffle: buffer = buffer.shuffle() return ( buffer .to_stream() .key_transform("image", lambda x: x.astype("float32") / 255) .batch(32) .prefetch(prefetch_size=8, num_threads=4) ) ``` -------------------------------- ### Read CharTrie from SentencePiece Model Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/tokenizing.md Shows how to load a CharTrie and associated weights from a SentencePiece model file for efficient tokenization. Recommended for using pre-trained models. ```python from mlx.data.tokenizer_helpers import read_trie_from_spm trie, weights = read_trie_from_spm("path/to/spm/model") tokenizer = Tokenizer(trie, trie_key_scores=weights) tokenizer.tokenize_shortest(b"This is some more text to tokenize") ``` -------------------------------- ### Run mlx.data Benchmark Source: https://github.com/ml-explore/mlx-data/blob/main/benchmarks/comparative/librispeech/README.md Execute the mlx.data benchmark by specifying the tokenizer file and the path to the extracted LibriSpeech dataset. Set OMP_NUM_THREADS=1 to prevent thread contention. ```bash OMP_NUM_THREADS=1 python mlx_data.py \ --tokenizer_file /path/to/tokenizer.model \ /path/to/librispeech/LibriSpeech/dev-clean ``` -------------------------------- ### Load and Process Images with MLX Data Pipeline Source: https://github.com/ml-explore/mlx-data/blob/main/README.md This pipeline demonstrates loading images, resizing, cropping, batching, scaling, and prefetching. It uses a Python function to prepare data samples and then applies various transformations using the MLX Data API. ```python # A simple python function returning a list of dicts. All samples in MLX data # are dicts of arrays. def files_and_classes(root: Path): files = [str(f) for f in root.glob("**/*.jpg")] files = [f for f in files if "BACKGROUND" not in f] classes = dict( map(reversed, enumerate(sorted(set(f.split("/")[-2] for f in files)))) ) return [ dict(image=f.encode("ascii"), label=classes[f.split("/")[-2]]) for f in files ] dset = ( # Make a buffer (finite length container of samples) from the python list dx.buffer_from_vector(files_and_classes(root)) # Shuffle and transform to a stream .shuffle() .to_stream() # Implement a simple image pipeline. No random augmentations here but they # could be applied. .load_image("image") # load the file pointed to by the 'image' key as an image .image_resize_smallest_side("image", 256) .image_center_crop("image", 224, 224) # Accumulate into batches .batch(batch_size) # Cast to float32 and scale to [0, 1]. We do this in python and we could # have done any transformation we could think of. .key_transform("image", lambda x: x.astype("float32") / 255) # Finally, fetch batches in background threads .prefetch(prefetch_size=8, num_threads=8) ) # dset is a python iterable so one could simply for sample in dset: # access sample["image"] and sample["label"] pass ``` -------------------------------- ### Configure Target Properties Source: https://github.com/ml-explore/mlx-data/blob/main/python/src/CMakeLists.txt Sets include directories, link libraries, and compile definitions for the '_c' target. This ensures the extension can access necessary headers, link against the 'mlxdata' library, and use version-specific definitions. ```cmake target_include_directories(_c PUBLIC ${CMAKE_SOURCE_DIR}) target_link_libraries(_c PRIVATE mlxdata) target_compile_definitions(_c PRIVATE _VERSION_=${MLX_DATA_VERSION}) ``` -------------------------------- ### Load and Transform Images with MLX Data Source: https://context7.com/ml-explore/mlx-data/llms.txt Loads images, applies transformations like resizing and cropping, normalizes pixel values, and batches the data. Ensure the 'image' key exists in your samples. ```python import mlx.data as dx dset = ( dx.buffer_from_vector(files_and_classes(root)) .shuffle() .to_stream() # Decode JPEG file pointed to by 'image' key -> HWC uint8 array .load_image("image") # Resize so the smallest side is 256 pixels .image_resize_smallest_side("image", 256) # Center crop to 224x224 .image_center_crop("image", 224, 224) # Accumulate into batches of 32 .batch(32) # Cast to float32 and normalize to [0, 1] using a Python lambda .key_transform("image", lambda x: x.astype("float32") / 255) # Prefetch 8 batches using 8 background threads .prefetch(prefetch_size=8, num_threads=8) ) for batch in dset: x = batch["image"] # (32, 224, 224, 3), float32 y = batch["label"] # (32,), int ``` -------------------------------- ### Fast Tokenization with CharTrie Source: https://context7.com/ml-explore/mlx-data/llms.txt Tokenizes byte strings using a CharTrie for shortest-path or unigram tokenization. This process does not hold the GIL, allowing for parallel execution. ```python from mlx.data.core import CharTrie, Tokenizer from mlx.data.tokenizer_helpers import ( read_trie_from_spm, read_bpe_from_spm, read_bpe_from_hf, read_trie_from_vocab, gpt2_byte_map, ) import mlx.data as dx # --- CharTrie tokenizer (unigram / shortest path) --- trie = CharTrie() for word in b"a quick brown fox jumped over the lazy dog".split(): trie.insert(word) trie.insert(b" ") tokenizer = Tokenizer(trie) print(tokenizer.tokenize_shortest(b"a quick brown fox")) # [0, 9, 1, 9, 2, 9, 3] ``` -------------------------------- ### Load and Tokenize from SentencePiece Model Source: https://context7.com/ml-explore/mlx-data/llms.txt Loads a SentencePiece model for tokenization and applies it to a text corpus. It includes steps for streaming, tokenizing, filtering, padding, chunking, and batching data. ```python trie, weights = read_trie_from_spm("path/to/tokenizer.model") dset = ( dx.stream_python_iterable(lambda: ({"text": line} for line in open("corpus.txt", "rb"))) .tokenize("text", trie, trie_key_scores=weights, output_key="tokens") .filter_key("text", remove=True) # drop raw text key .pad("tokens", 0, 1, 0, trie.search("").id) # prepend BOS .pad("tokens", 0, 0, 1, trie.search("").id) # append EOS .sliding_window("tokens", 1025, 1025) # chunk into context windows .shape("tokens", "tokens_length", 0) .batch(32) .prefetch(4, 4) ) ``` -------------------------------- ### Create and Use a Stream from a Python Iterable Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/stream.md Demonstrates creating a stream from a large Python iterable and applying a filtering transformation. Note that streams are pointers and advancing one advances others derived from it. Streams can be reset if the source supports it. ```python import mlx.data as dx # The samples are never all instantiated numbers = dx.stream_python_iterable(lambda: ({"x": i} for i in range(10**10))) # Filtering is done with transforms returning an empty sample evens = numbers.sample_transform(lambda s: s if s["x"] % 2 == 0 else dict()) print(next(numbers)) # prints {'x': array(0)} print(next(numbers)) # prints {'x': array(1)} # Streams are pointers to the streams so evens is using numbers under the # hood. Since numbers was advanced now evens is advanced as well. print(next(evens)) # prints {'x': array(2)} print(next(evens)) # prints {'x': array(4)} print(next(numbers)) # prints {'x': array(5)} # Streams can be reset. evens.reset() print(next(evens)) print(next(evens)) print(next(numbers)) # prints {'x': array(0)} # {'x': array(2)} # {'x': array(3)} ``` -------------------------------- ### Apply Python Transforms to Data Samples Source: https://context7.com/ml-explore/mlx-data/llms.txt Applies Python functions to data samples. `key_transform` modifies a single key's value, while `sample_transform` operates on the entire sample dictionary. Use `*_if` variants for conditional transformations. ```python import numpy as np import mlx.data as dx dset = dx.buffer_from_vector([{"audio": np.random.randn(16000).astype("float32")}]) # Apply a function to a single key dset = dset.key_transform("audio", lambda x: x / np.abs(x).max()) # Cross-key logic and conditional filtering def augment_and_filter(sample): if sample["label"] < 0: return dict() # drop sample sample["image"] = sample["image"].astype("float32") / 255.0 sample["image"] = (1 + 0.1 * np.random.rand()) * sample["image"] return sample dset = dset.to_stream().sample_transform(augment_and_filter) # Conditional variants: every operation has a *_if form enable_flip = True dset = dset.image_random_h_flip_if(enable_flip, "image", 0.5) ``` -------------------------------- ### MLX Data Stream Prefetching (Non-deterministic) Source: https://context7.com/ml-explore/mlx-data/llms.txt Uses `Stream.prefetch` for non-deterministic, high-throughput prefetching with multiple threads. Suitable when sample order within an epoch is not critical after shuffling. ```python import mlx.data as dx from mlx.data.datasets import load_cifar10 buf = load_cifar10(train=True) # Non-deterministic (highest throughput) stream_nondet = ( buf .shuffle() .to_stream() .load_image("image") .batch(64) .key_transform("image", lambda x: x.astype("float32") / 255) .prefetch(num_prefetch=8, num_threads=8) ) for batch in stream_nondet: pass # iterate one epoch stream_nondet.reset() # reset for next epoch ``` -------------------------------- ### MLX Data Stream Prefetching (Deterministic) Source: https://context7.com/ml-explore/mlx-data/llms.txt Uses `Buffer.ordered_prefetch` to prefetch samples while preserving order after an initial shuffle. Ensures deterministic iteration order for each epoch. ```python import mlx.data as dx from mlx.data.datasets import load_cifar10 buf = load_cifar10(train=True) # Deterministic ordered prefetch (same order every epoch after shuffle) stream_det = ( buf .shuffle() .load_image("image") .batch(64) .key_transform("image", lambda x: x.astype("float32") / 255) .ordered_prefetch(num_prefetch=8, num_threads=8) ) for batch in stream_det: pass # iterate one epoch stream_det.reset() # reset for next epoch ``` -------------------------------- ### Load and Process Audio Files with MLX Data Source: https://context7.com/ml-explore/mlx-data/llms.txt Loads audio files, removes the channel dimension for mono audio, extracts Mel filterbank features, and batches the results. Ensure audio files are accessible via the 'audio' key. ```python import mlx.data as dx from mlx.data.features import mfsc from mlx.data.tokenizer_helpers import read_trie_from_spm trie, _ = read_trie_from_spm("tokenizer.model") dset = ( dx.buffer_from_vector([{"file": b"librispeech/train/**/*.txt"}]) .to_stream() .line_reader_from_key("file", "line") .sample_transform(lambda s: { "audio": b"/".join(bytes(s["line"]).split(b" ", 1)[0].split(b"-")[:-1] + [bytes(s["line"]).split(b" ", 1)[0] + b".flac"]), "transcript": bytes(s["line"]).split(b" ", 1)[1].lower(), }) # Load audio file -> (T, C) int16 array .load_audio("audio", prefix="path/to/librispeech") # Drop channel dim: (T, 1) -> (T,) .squeeze("audio") # Extract 128-band log Mel filterbank features at 16 kHz .key_transform("audio", mfsc(n_filterbank=128, sampling_freq=16000)) # Record audio length before batching .shape("audio", "audio_length", 0) .batch(32) .prefetch(8, 8) ) ``` -------------------------------- ### Add ffmpeg External Project Source: https://github.com/ml-explore/mlx-data/blob/main/super/CMakeLists.txt Configures the ffmpeg project, specifying its URL, dependencies, and custom configure and build commands. It's built in-source and includes various options to disable specific features and enable others like libvorbis and libopus. ```cmake ExternalProject_Add( ffmpeg DEPENDS nasm zlib lame libogg opus libvorbis xvidcore pkg-config URL https://ffmpeg.org/releases/ffmpeg-7.1.1.tar.bz2 CONFIGURE_COMMAND ${CMAKE_COMMAND} -E env PATH=${PATH} PKG_CONFIG_PATH=${PKG_CONFIG_PATH} ./configure --prefix=${CMAKE_BINARY_DIR}/deps --disable-shared --enable-pic --enable-runtime-cpudetect --enable-libvorbis --enable-libopus --disable-iconv --disable-programs --disable-doc --disable-htmlpages --disable-manpages --disable-podpages --disable-txtpages --disable-alsa --disable-sdl2 --disable-xlib --disable-cuda-llvm --disable-cuvid --disable-d3d11va --disable-dxva2 --disable-nvdec --disable-nvenc --disable-v4l2-m2m --disable-vdpau --pkg-config=${CMAKE_BINARY_DIR}/deps/bin/pkg-config "--extra-ldflags=-L${CMAKE_BINARY_DIR}/deps/lib -L${CMAKE_BINARY_DIR}/deps/lib64" "--extra-libs=-lvorbis -logg -lm" BUILD_COMMAND ${CMAKE_COMMAND} -E env PATH=${CMAKE_BINARY_DIR}/deps/bin:$ENV{PATH} make INSTALL_COMMAND make install BUILD_IN_SOURCE 1 DOWNLOAD_EXTRACT_TIMESTAMP 1) ``` -------------------------------- ### Create MLX Buffer from List of Dictionaries Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/hf_datasets_streams.md Convert a list of dictionaries, where each dictionary represents a data sample (e.g., image and label), into an MLX Buffer. ```python import mlx.data as dx buffer = dx.buffer_from_vector(dicts) ``` -------------------------------- ### Load and Inspect Wikitext-103 Dataset Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md Loads the wikitext-103 dataset (training split) and prints its stream information. This dataset is often used for language modeling tasks. ```python wiki = load_wikitext_lines(split="train") print(wiki) # Downloading https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip 183.1MiB (9.9MiB/s) # Computing hash of ..../.cache/mlx.data/wikitext/wikitext-103-raw-v1.zip |████████████████████████████████████████| 183.1MiB / 183.1MiB (1.0GiB/s) # Extracting ..../.cache/mlx.data/wikitext/wikitext-103-raw-v1.zip 517.9MiB (318.2MiB/s) # Stream() ``` -------------------------------- ### Create Buffer from Python list with dx.buffer_from_vector Source: https://context7.com/ml-explore/mlx-data/llms.txt Use `dx.buffer_from_vector` to create a Buffer from a list of sample dictionaries. This is suitable when the entire dataset fits in memory. Samples can map string keys to NumPy arrays, bytes, or scalars. ```python from pathlib import Path import mlx.data as dx def files_and_classes(root: Path): images = list(root.rglob("*.jpg")) categories = [p.relative_to(root).parent.name for p in images] category_set = set(categories) category_map = {c: i for i, c in enumerate(sorted(category_set))} return [ { "image": str(p.relative_to(root)).encode("ascii"), # bytes path "category": c, "label": category_map[c], # int label } for c, p in zip(categories, images) ] buf = dx.buffer_from_vector(files_and_classes(Path("path/to/dataset"))) print(buf) # Buffer(size=9144, keys={'category', 'image', 'label'}) print(buf[0]) # {'category': array([...]), 'image': array([...]), 'label': array(42)} print(len(buf)) # 9144 ``` -------------------------------- ### BPE Tokenizer from HuggingFace tokenizer.json Source: https://context7.com/ml-explore/mlx-data/llms.txt Reads BPE symbols and merges from a HuggingFace tokenizer.json file and applies byte mapping for tokenization. ```python symbols, merges = read_bpe_from_hf("tokenizer.json") byte_map = gpt2_byte_map() dset_bpe = ( dx.buffer_from_vector([{"text": "Hello world!"}]) .replace_bytes("text", byte_map) .tokenize_bpe("text", symbols, merges) ) ``` -------------------------------- ### Convert HuggingFace Dataset to MLX Data Stream Source: https://context7.com/ml-explore/mlx-data/llms.txt Converts a HuggingFace dataset to an MLX Data stream, processing images and labels. Images are converted to NumPy arrays, normalized, and flattened. Use this for image-based datasets from HuggingFace. ```python import numpy as np import mlx.data as dx import mlx.core as mx from datasets import load_dataset ds = load_dataset("ylecun/mnist") def hf_to_mlx_stream(hf_split, shuffle=False, batch_size=32): samples = [ {"image": np.array(img).copy(), "label": lbl} for img, lbl in zip(hf_split["image"], hf_split["label"]) ] buf = dx.buffer_from_vector(samples) if shuffle: buf = buf.shuffle() return ( buf .to_stream() .key_transform("image", lambda x: (x.astype("float32") / 255).ravel()) .batch(batch_size) .prefetch(prefetch_size=8, num_threads=4) ) train_stream = hf_to_mlx_stream(ds["train"], shuffle=True) test_stream = hf_to_mlx_stream(ds["test"], shuffle=False) train_stream.reset() for batch in train_stream: X = mx.array(batch["image"]) # (32, 784) y = mx.array(batch["label"]) # (32,) # training step here break ``` -------------------------------- ### Find Python and pybind11 Source: https://github.com/ml-explore/mlx-data/blob/main/python/src/CMakeLists.txt Locates the Python interpreter and pybind11 components required for building the C++ extension. Ensures that the necessary Python development files and pybind11 configuration are available. ```cmake find_package( Python COMPONENTS Interpreter Development.Module REQUIRED) execute_process( COMMAND "${Python_EXECUTABLE}" -m pybind11 --cmakedir OUTPUT_STRIP_TRAILING_WHITESPACE OUTPUT_VARIABLE pybind11_ROOT) find_package(pybind11 CONFIG REQUIRED) ``` -------------------------------- ### Process and Batch MNIST Dataset Source: https://github.com/ml-explore/mlx-data/blob/main/docs/src/python/common_datasets.md Applies transformations to the MNIST dataset, including shuffling, converting to stream, normalizing images, and batching. Prefetching is used for performance. ```python mnist_iter = ( mnist .shuffle() .to_stream() .key_transform("image", lambda x: (x.astype("float32") / 255).ravel()) .batch(128) .prefetch(4, 2) ) print(next(mnist_iter)["image"].shape) # (128, 784) ``` -------------------------------- ### AWSFileFetcher for Remote S3 File Fetching Source: https://context7.com/ml-explore/mlx-data/llms.txt Fetches files from an S3-compatible bucket into a local cache, supporting background prefetching. It can be integrated with I/O operations via the `file_fetcher` parameter. ```python from pathlib import Path from mlx.data.core import AWSFileFetcher import mlx.data as dx LOCAL_CACHE = Path("/tmp/s3_cache") ff = AWSFileFetcher( "my-dataset-bucket", endpoint="https://s3.us-east-1.amazonaws.com/", local_prefix=LOCAL_CACHE, num_kept_files=500, # LRU cache: keep at most 500 files locally ) # Standalone usage ff.fetch("data/train/image_001.jpg") assert (LOCAL_CACHE / "data/train/image_001.jpg").is_file() # Background prefetch while processing current file ff.prefetch(["data/train/image_002.jpg", "data/train/image_003.jpg"]) ff.fetch("data/train/image_002.jpg") # likely already cached # In a pipeline: pass file_fetcher to load_image samples = [{"image": b"data/train/image_001.jpg", "label": 0}] dset = ( dx.buffer_from_vector(samples) .to_stream() .load_image("image", file_fetcher=ff) .image_resize_smallest_side("image", 256) .batch(32) .prefetch(4, 4) ) ```