### Web Demo Server Launch

Source: https://context7.com/google-research/pix2struct/llms.txt

Start an interactive web demo server for model predictions.

```bash
# Start the web demo server on port 8080
python -m pix2struct.demo \
  --gin_search_paths="pix2struct/configs" \
  --gin_file=models/pix2struct.gin \
  --gin_file=runs/inference.gin \
  --gin_file=sizes/base.gin \
  --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \
  --gin.BATCH_SIZE=1 \
  --gin.CHECKPOINT_PATH="'gs://pix2struct-data/chartqa_base/checkpoint_287600'"

# Access at http://localhost:8080
# Upload an image and optionally provide a question prompt
# The demo will display the model's prediction

# Custom port configuration
python -m pix2struct.demo \
  --port=9000 \
  ... # other gin flags
```

--------------------------------

### Install Pix2Struct and Dependencies

Source: https://context7.com/google-research/pix2struct/llms.txt

Clone the repository, set up a conda environment, and install the package with development dependencies. Includes commands for setting up cloud storage and GCP project.

```bash
# Clone the repository
git clone https://github.com/google-research/pix2struct.git
cd pix2struct

# Create and activate conda environment
conda create -n pix2struct python=3.9
conda activate pix2struct

# Install the package with dev dependencies
pip install -e ."[dev]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Run tests to verify installation
pytest

# Set up GCS storage directory for data and models
export PIX2STRUCT_DIR="gs://<your_bucket>/<path_to_pix2struct_dir>"

# Set up GCP project for Dataflow preprocessing
export GCP_PROJECT=<your_project_id>
export GCP_REGION=<your_region>
```

--------------------------------

### Install Pix2Struct Environment

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Commands to clone the repository, create a conda environment, and install dependencies.

```bash
git clone https://github.com/google-research/pix2struct.git
cd pix2struct
conda create -n pix2struct python=3.9
conda activate pix2struct
pip install -e ."[dev]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pytest
```

--------------------------------

### Run Pix2Struct Inference via Command Line

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Use this command for inference on a single example. It's tested at small scales and not recommended for large-scale inference. Ensure the config file matches the checkpoint.

```bash
python -m pix2struct.example_inference \
  --gin_search_paths="pix2struct/configs" \
  --gin_file=models/pix2struct.gin \
  --gin_file=runs/inference.gin \
  --gin_file=sizes/base.gin \
  --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \
  --gin.BATCH_SIZE=1 \
  --gin.CHECKPOINT_PATH="'gs://pix2struct-data/textcaps_base/checkpoint_280400'" \
  --image=$HOME/test_image.jpg
```

--------------------------------

### Prepare Widget Captioning Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Downloads and converts the Widget Captioning dataset. Requires the RICO dataset to be set up first.

```bash
mkdir -p data/widget_captioning
cd data/widget_captioning
git clone https://github.com/google-research-datasets/widget-caption.git
cp widget-caption/widget_captions.csv ./
cp widget-caption/split/*.txt ./
mv dev.txt val.txt
rm -rf widget-caption
cd ..
gsutil -m cp -r widget_captioning $PIX2STRUCT_DIR/data/widget_captioning
python -m pix2struct.preprocessing.convert_widget_captioning \
  --data_dir=$PIX2STRUCT_DIR/data/widget_captioning \
  --image_dir=$PIX2STRUCT_DIR/data/rico_images \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Prepare RefExp Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Downloads and converts the RefExp dataset. Requires the RICO dataset to be set up first.

```bash
mkdir -p data/refexp
cd data/refexp
wget https://github.com/google-research-datasets/uibert/raw/main/ref_exp/train.tfrecord
wget https://github.com/google-research-datasets/uibert/raw/main/ref_exp/dev.tfrecord
wget https://github.com/google-research-datasets/uibert/raw/main/ref_exp/test.tfrecord
mv dev.tfrecord val.tfrecord
cd ..
gsutil -m cp -r refexp $PIX2STRUCT_DIR/data/refexp
python -m pix2struct.preprocessing.convert_refexp \
  --data_dir=$PIX2STRUCT_DIR/data/refexp \
  --image_dir=$PIX2STRUCT_DIR/data/rico_images \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Run Pix2Struct Inference via Web Demo

Source: https://github.com/google-research/pix2struct/blob/main/README.md

This command launches a web-based demo for inference, accessible at localhost:8080. It allows uploading custom images and prompts. Ensure the config file matches the checkpoint.

```bash
python -m pix2struct.demo \
  --gin_search_paths="pix2struct/configs" \
  --gin_file=models/pix2struct.gin \
  --gin_file=runs/inference.gin \
  --gin_file=sizes/base.gin \
  --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \
  --gin.BATCH_SIZE=1 \
  --gin.CHECKPOINT_PATH="'gs://pix2struct-data/textcaps_base/checkpoint_280400'"
```

--------------------------------

### Provision Cloud TPU VM

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Creates and connects to a v3-8 TPU VM instance for model training.

```bash
TPU_TYPE=v3-8
TPU_NAME=pix2struct-$TPU_TYPE
TPU_ZONE=europe-west4-a
gcloud compute tpus tpu-vm create $TPU_NAME \
  --zone=$TPU_ZONE \
  --accelerator-type=$TPU_TYPE \
  --version=tpu-vm-base
gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$TPU_ZONE
```

--------------------------------

### Preprocess ChartQA Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Download and convert the ChartQA dataset.

```bash
mkdir -p data/chartqa
cd data/chartqa
git clone https://github.com/vis-nlp/ChartQA.git
cp -r ChartQA/ChartQA\ Dataset/* ./
rm -rf ChartQA
cd ..
gsutil -m cp -r chartqa $PIX2STRUCT_DIR/data/chartqa
python -m pix2struct.preprocessing.convert_chartqa \
  --data_dir=$PIX2STRUCT_DIR/data/chartqa \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Prepare Screen2Words Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Downloads and converts the Screen2Words dataset. Requires the RICO dataset to be set up first.

```bash
cd data
git clone https://github.com/google-research-datasets/screen2words.git
gsutil -m cp -r screen2words $PIX2STRUCT_DIR/data/screen2words
python -m pix2struct.preprocessing.convert_screen2words \
  --screen2words_dir=$PIX2STRUCT_DIR/data/screen2words \
  --rico_dir=$PIX2STRUCT_DIR/data/rico_images \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Model Size Configurations and Checkpoints

Source: https://context7.com/google-research/pix2struct/llms.txt

Defines configurations for Base (282M) and Large (1.1B) model sizes using gin files. Lists pretrained and finetuned checkpoint paths for various tasks and model sizes.

```python
# Base model configuration (sizes/base.gin)
# NUM_ENCODER_LAYERS = 12
# NUM_DECODER_LAYERS = 12
# NUM_HEADS = 12
# HEAD_DIM = 64
# MLP_DIM = 2048
# EMBED_DIM = 768

# Large model configuration (sizes/large.gin)
# NUM_ENCODER_LAYERS = 18
# NUM_DECODER_LAYERS = 18
# NUM_HEADS = 24
# HEAD_DIM = 64
# MLP_DIM = 3968
# EMBED_DIM = 1536

# Pretrained checkpoint paths
CHECKPOINTS = {
    "base": "gs://pix2struct-data/pix2struct_base/checkpoint_900200",
    "large": "gs://pix2struct-data/pix2struct_large/checkpoint_900200",
}

# Finetuned checkpoints for specific tasks
FINETUNED_CHECKPOINTS = {
    "textcaps_base": "gs://pix2struct-data/textcaps_base/checkpoint_280400",
    "textcaps_large": "gs://pix2struct-data/textcaps_large/checkpoint_180600",
    "chartqa_base": "gs://pix2struct-data/chartqa_base/checkpoint_287600",
    "docvqa_base": "gs://pix2struct-data/docvqa_base/checkpoint_284400",
    "screen2words_base": "gs://pix2struct-data/screen2words_base/checkpoint_282600",
}
```

--------------------------------

### T5X Model Training

Source: https://context7.com/google-research/pix2struct/llms.txt

Finetune Pix2Struct models on TPU infrastructure using T5X.

```bash
# Set up TPU VM
TPU_TYPE=v3-8
TPU_NAME=pix2struct-$TPU_TYPE
TPU_ZONE=europe-west4-a
gcloud compute tpus tpu-vm create $TPU_NAME \
  --zone=$TPU_ZONE \
  --accelerator-type=$TPU_TYPE \
  --version=tpu-vm-base

# SSH into TPU and install pix2struct package

# Run training with validation evaluation
python -m t5x.train \
  --gin_search_paths="pix2struct/configs" \
  --gin_file="models/pix2struct.gin" \
  --gin_file="runs/train.gin" \
  --gin_file="sizes/base.gin" \
  --gin_file="optimizers/adafactor.gin" \
  --gin_file="schedules/screen2words.gin" \
  --gin_file="init/pix2struct_base_init.gin" \
  --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \
  --gin.MODEL_DIR="'$PIX2STRUCT_DIR/experiments/screen2words_base'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \
  --gin.BATCH_SIZE=32

# Train on other tasks by changing schedule and task name:
# --gin_file="schedules/chartqa.gin" --gin.MIXTURE_OR_TASK_NAME="'chartqa'"
# --gin_file="schedules/docvqa.gin" --gin.MIXTURE_OR_TASK_NAME="'docvqa'"
# --gin_file="schedules/textcaps.gin" --gin.MIXTURE_OR_TASK_NAME="'textcaps'"
```

--------------------------------

### Prepare DocVQA Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Processes DocVQA data after manual download. Assumes tar files are present in the data directory.

```bash
mkdir -p data/docvqa
cd data/docvqa
```

```bash
tar xvf train.tar.gz
tar xvf val.tar.gz
tar xvf test.tar.gz
rm -r *.tar.gz */ocr_results

cd ..
gsutil -m cp -r docvqa $PIX2STRUCT_DIR/data/docvqa
python -m pix2struct.preprocessing.convert_docvqa \
  --data_dir=$PIX2STRUCT_DIR/data/docvqa \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Train Pix2Struct Model

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Initiates the training loop using T5X with specified configuration files and task parameters.

```bash
python -m t5x.train \
  --gin_search_paths="pix2struct/configs" \
  --gin_file="models/pix2struct.gin" \
  --gin_file="runs/train.gin" \
  --gin_file="sizes/base.gin" \
  --gin_file="optimizers/adafactor.gin" \
  --gin_file="schedules/screen2words.gin" \
  --gin_file="init/pix2struct_base_init.gin" \
  --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \
  --gin.MODEL_DIR="'$PIX2STRUCT_DIR/experiments/screen2words_base'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \
  --gin.BATCH_SIZE=32
```

--------------------------------

### Command-line Inference Execution

Source: https://context7.com/google-research/pix2struct/llms.txt

Run inference on images using pretrained checkpoints via the command line.

```bash
# Set JAX platform to CPU for testing (optional)
export JAX_PLATFORMS=cpu

# Run inference on an image with a TextCaps checkpoint
python -m pix2struct.example_inference \
  --gin_search_paths="pix2struct/configs" \
  --gin_file=models/pix2struct.gin \
  --gin_file=runs/inference.gin \
  --gin_file=sizes/base.gin \
  --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \
  --gin.BATCH_SIZE=1 \
  --gin.CHECKPOINT_PATH="'gs://pix2struct-data/textcaps_base/checkpoint_280400'" \
  --image=/path/to/image.jpg

# Run inference with a question (VQA tasks)
python -m pix2struct.example_inference \
  --gin_search_paths="pix2struct/configs" \
  --gin_file=models/pix2struct.gin \
  --gin_file=runs/inference.gin \
  --gin_file=sizes/base.gin \
  --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \
  --gin.BATCH_SIZE=1 \
  --gin.CHECKPOINT_PATH="'gs://pix2struct-data/docvqa_base/checkpoint_284400'" \
  --image=/path/to/document.png \
  --text="What is the total amount?"
```

--------------------------------

### Preprocess Datasets with Apache Beam

Source: https://context7.com/google-research/pix2struct/llms.txt

Use these commands to convert raw datasets to TFRecord format. Specify the runner (e.g., DataflowRunner) and other flags for cloud execution, or omit them to run locally.

```bash
python -m pix2struct.preprocessing.convert_chartqa \
  --data_dir=$PIX2STRUCT_DIR/data/chartqa \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

```bash
python -m pix2struct.preprocessing.convert_docvqa \
  --data_dir=$PIX2STRUCT_DIR/data/docvqa \
  -- \
  --runner=DataflowRunner \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION
```

```bash
python -m pix2struct.preprocessing.convert_textcaps \
  --textcaps_dir=$PIX2STRUCT_DIR/data/textcaps \
  --output_dir=$PIX2STRUCT_DIR/data/textcaps/processed
```

```bash
python -m pix2struct.preprocessing.convert_screen2words \
  --screen2words_dir=$PIX2STRUCT_DIR/data/screen2words \
  --rico_dir=$PIX2STRUCT_DIR/data/rico_images
```

--------------------------------

### Prepare InfographicVQA Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Processes InfographicVQA data after manual download. Assumes zip files and JSON files are present.

```bash
mkdir -p data/infographicvqa
cd data/infographicvqa
```

```bash
for split in train val test
do
  unzip infographicVQA_${split}_v1.0_images.zip
  mv infographicVQA_${split}_v1.0_images $split
  mv infographicVQA_${split}_v1.0.json $split/${split}_v1.0.json
done
rm *.zip

cd ..
gsutil -m cp -r infographicvqa $PIX2STRUCT_DIR/data/infographicvqa
python -m pix2struct.preprocessing.convert_docvqa \
  --data_dir=$PIX2STRUCT_DIR/data/infographicvqa \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Prepare OCR-VQA Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Processes OCR-VQA data after manual download. Assumes images directory and dataset.json are present.

```bash
mkdir -p data/ocrvqa
cd data/ocrvqa
```

```bash
cd ..
gsutil -m cp -r ocrvqa $PIX2STRUCT_DIR/data/ocrvqa
python -m pix2struct.preprocessing.convert_ocrvqa \
  --data_dir=$PIX2STRUCT_DIR/data/ocrvqa \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Preprocess TextCaps Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Download and convert the TextCaps dataset for use with Pix2Struct.

```bash
mkdir -p data/textcaps
cd data/textcaps
curl -O https://dl.fbaipublicfiles.com/textvqa/data/textcaps/TextCaps_0.1_train.json
curl -O https://dl.fbaipublicfiles.com/textvqa/data/textcaps/TextCaps_0.1_val.json
curl -O https://dl.fbaipublicfiles.com/textvqa/data/textcaps/TextCaps_0.1_test.json
curl -O https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
curl -O https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip
unzip train_val_images.zip
rm train_val_images.zip
unzip test_images.zip
rm test_images.zip
cd ..
gsutil -m cp -r textcaps_data $PIX2STRUCT_DIR/data/textcaps
python -m pix2struct.preprocessing.convert_textcaps \
  --textcaps_dir=$PIX2STRUCT_DIR/data/textcaps \
  --output_dir=$PIX2STRUCT_DIR/data/textcaps/processed \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Preprocess RICO Images

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Download and prepare RICO dataset images required for Screen2Words, RefExp, and Widget Captioning tasks.

```bash
cd data
wget https://storage.googleapis.com/crowdstf-rico-uiuc-4540/rico_dataset_v0.1/unique_uis.tar.gz
tar xvfz unique_uis.tar.gz
rm unique_uis.tar.gz
gsutil -m cp -r combined $PIX2STRUCT_DIR/data/rico_images
```

--------------------------------

### Configure Environment Variables

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Set the required environment variables for GCS storage and GCP project configuration.

```bash
export PIX2STRUCT_DIR="gs://<your_bucket>/<path_to_pix2struct_dir>"
```

```bash
export GCP_PROJECT=<your_project_id>
export GCP_REGION=<your_region>
```

--------------------------------

### Extract Patches from Image

Source: https://context7.com/google-research/pix2struct/llms.txt

Demonstrates the use of the patch_sequence function to extract patches from an image. It automatically determines optimal resizing based on the maximum patch limit.

```python
from pix2struct import preprocessors
import tensorflow as tf

# Extract patches from an image
# Image shape: [height, width, channels]
```

--------------------------------

### Prepare AI2D Dataset

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Downloads, extracts, and converts the AI2D dataset for use with Pix2Struct using Dataflow.

```bash
mkdir -p data/
cd data/
wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip
unzip ai2d-all.zip
rm ai2d-all.zip
gsutil -m cp -r ai2d $PIX2STRUCT_DIR/data/ai2d
python -m pix2struct.preprocessing.convert_ai2d \
  --data_dir=$PIX2STRUCT_DIR/data/ai2d \
  --test_ids_path=gs://pix2struct-data/ai2d_test_ids.csv \
  -- \
  --runner=DataflowRunner \
  --save_main_session \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --temp_location=$PIX2STRUCT_DIR/data/temp \
  --staging_location=$PIX2STRUCT_DIR/data/staging \
  --setup_file=./setup.py
```

--------------------------------

### Render Text Header on Image

Source: https://context7.com/google-research/pix2struct/llms.txt

Renders a text header (question or prompt) above an image, creating a combined image. Text is wrapped to 80 characters, uses a default font size of 36px, and has a white background with black text.

```python
from pix2struct.preprocessing import preprocessing_utils
from PIL import Image

# Load an image
image = Image.open("chart.png")

# Render a question header above the image
question = "What is the total revenue for Q4 2023?"
combined_image = preprocessing_utils.render_header(image, question)

# The header is rendered with:
# - Text wrapped to 80 characters per line
# - Default font size: 36px
# - White background, black text
# - Padding: 5px on all sides

# Save or convert to bytes
combined_image.save("chart_with_question.png")

# Convert to bytes for model input
image_bytes = preprocessing_utils.image_to_bytes(combined_image)
```

--------------------------------

### Preprocessing Pipeline Configuration

Source: https://context7.com/google-research/pix2struct/llms.txt

Defines a standard SeqIO preprocessing pipeline for Pix2Struct tasks, including decoding, normalization, and patch conversion.

```python
import functools
import seqio
from pix2struct import preprocessors

# Standard preprocessing pipeline for pix2struct tasks
PREPROCESSORS = [
    # Remap feature keys to standard names
    functools.partial(seqio.preprocessors.rekey, key_map={
        "inputs": "image",
        "targets": "parse",
        "parse": "parse",
        "image": "image",
        "id": "id",
        "group_id": "group_id"
    }),
    # Sample one target if multiple exist
    preprocessors.sample_one(key="targets"),
    # Decode PNG bytes to image tensor
    preprocessors.image_decoder(key="inputs", channels=3),
    # Normalize image (per-image standardization)
    preprocessors.normalize_image(key="inputs"),
    # Convert image to patch sequence
    preprocessors.image_to_patches(key="inputs"),
    # Tokenize text targets and append EOS
    seqio.preprocessors.tokenize_and_append_eos,
]
```

--------------------------------

### T5X Model Evaluation

Source: https://context7.com/google-research/pix2struct/llms.txt

Evaluate trained models on test datasets using T5X.

```bash
# Evaluate model on test set
python -m t5x.eval \
  --gin_search_paths="pix2struct/configs" \
  --gin_file="models/pix2struct.gin" \
  --gin_file="runs/eval.gin" \
  --gin_file="sizes/base.gin" \
  --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \
  --gin.CHECKPOINT_PATH="'$PIX2STRUCT_DIR/experiments/screen2words_base/checkpoint_286600'" \
  --gin.EVAL_OUTPUT_DIR="'$PIX2STRUCT_DIR/experiments/screen2words_base/test_eval'" \
  --gin.EVAL_SPLIT="'test'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \
  --gin.BATCH_SIZE=32

# Use large model checkpoints with appropriate size config
python -m t5x.eval \
  --gin_search_paths="pix2struct/configs" \
  --gin_file="models/pix2struct.gin" \
  --gin_file="runs/eval.gin" \
  --gin_file="sizes/large.gin" \
  --gin.MIXTURE_OR_TASK_NAME="'docvqa'" \
  --gin.CHECKPOINT_PATH="'gs://pix2struct-data/docvqa_large/checkpoint_184000'" \
  --gin.EVAL_OUTPUT_DIR="'$PIX2STRUCT_DIR/experiments/docvqa_large/test_eval'" \
  --gin.EVAL_SPLIT="'test'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \
  --gin.BATCH_SIZE=16
```

--------------------------------

### Task Registration and Feature Description

Source: https://context7.com/google-research/pix2struct/llms.txt

Registers a custom task in the SeqIO registry and defines the expected TFRecord feature structure.

```python
from pix2struct import tasks
from pix2struct import metrics
import os

# Register a custom pix2struct task
tasks.add_pix2struct_task(
    name="my_custom_task",
    base_dir=os.environ.get("PIX2STRUCT_DIR", "") + "/data",
    train_file_pattern="my_task/processed/train.tfr*",
    valid_file_pattern="my_task/processed/val.tfr*",
    test_file_pattern="my_task/processed/test.tfr*",
    # Optional: custom metrics (defaults to pix2struct_metrics)
    metric_fns=[metrics.pix2struct_metrics],
    # Optional: custom postprocessor (defaults to multi_target)
    postprocess_fn=None
)

# TFRecord feature description expected by tasks
FEATURE_DESCRIPTION = {
    "id": tf.io.FixedLenFeature([], tf.string, default_value="no-id"),
    "image": tf.io.FixedLenFeature([], tf.string),  # PNG bytes
    "parse": tf.io.FixedLenSequenceFeature([], tf.string, allow_missing=True),
    "group_id": tf.io.FixedLenFeature([], tf.string, default_value="no-group-id"),
}
```

--------------------------------

### Inference Function Retrieval

Source: https://context7.com/google-research/pix2struct/llms.txt

Initializes inference functions for a trained model checkpoint using T5X partitioning.

```python
from pix2struct import inference_utils
from t5x import partitioning
import tensorflow as tf

# Get inference functions from a checkpoint
inference_fns = inference_utils.get_inference_fns(
    task_name="placeholder_pix2struct",
    batch_size=1,
    sequence_length={"inputs": 4096, "targets": 128},
    model=model,  # ImageToTextModel instance
    checkpoint_path="gs://pix2struct-data/textcaps_base/checkpoint_280400",
    partitioner=partitioning.PjitPartitioner(num_partitions=1)
)
```

--------------------------------

### Evaluate Pix2Struct Model

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Runs model evaluation on the test set using a specific checkpoint.

```bash
python -m t5x.eval \
  --gin_search_paths="pix2struct/configs" \
  --gin_file="models/pix2struct.gin" \
  --gin_file="runs/eval.gin" \
  --gin_file="sizes/base.gin" \
  --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \
  --gin.CHECKPOINT_PATH="'$PIX2STRUCT_DIR/experiments/screen2words_base/checkpoint_286600'" \
  --gin.EVAL_OUTPUT_DIR="'$PIX2STRUCT_DIR/experiments/test_exp/test_eval'" \
  --gin.EVAL_SPLIT="'test'" \
  --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \
  --gin.BATCH_SIZE=32
```

--------------------------------

### Programmatic Inference with Pix2Struct

Source: https://context7.com/google-research/pix2struct/llms.txt

Use the inference functions to process datasets and generate predictions.

```python
predict_fn = inference_fns["predict"]  # Returns decoded string predictions
intermediates_fn = inference_fns["intermediates"]  # Returns attention weights, etc.

# Make predictions on a dataset
dataset = tf.data.Dataset.from_tensors({
    "id": "",
    "group_id": "",
    "image": image_bytes,  # PNG bytes
    "parse": [""]
})

for prediction in predict_fn(dataset):
    print(prediction)  # Decoded text output
```

--------------------------------

### Delete TPU VM Instance

Source: https://github.com/google-research/pix2struct/blob/main/README.md

Command to delete the TPU VM instance when it is no longer needed. Replace $TPU_NAME and $TPU_ZONE with your specific values.

```bash
gcloud compute tpus tpu-vm delete $TPU_NAME --zone=$TPU_ZONE
```

--------------------------------

### Split and Write TFRecords with Apache Beam

Source: https://context7.com/google-research/pix2struct/llms.txt

An Apache Beam transform for partitioning data into train/validation/test splits and writing to TFRecord files. It shuffles data, partitions by hash of a key feature for deterministic splitting, and reports counts via beam.metrics.

```python
from pix2struct.preprocessing import preprocessing_utils
import apache_beam as beam
import tensorflow as tf

# Create a pipeline to process and split data
with beam.Pipeline() as pipeline:
    examples = (
        pipeline
        | "CreateExamples" >> beam.Create(processed_examples)
        | "SplitAndWrite" >> preprocessing_utils.SplitAndWriteTFRecords(
            output_dir="gs://bucket/data/my_task/processed",
            key="id",  # Feature key used for deterministic splitting
            validation_percent=10,  # 10% of data goes to validation
            train_file_name="train.tfr",
            val_file_name="val.tfr",
            test_file_name="test.tfr",
            is_test=lambda key: key.startswith("test_"),  # Optional test filter
        )
    )

# The transform:
# 1. Shuffles data using beam.Reshuffle()
# 2. Partitions by hash of key feature (deterministic)
# 3. Writes to sharded TFRecord files
# 4. Reports train/val/test counts via beam.metrics
```

--------------------------------

### ImageToTextModel Configuration

Source: https://context7.com/google-research/pix2struct/llms.txt

Defines output features and vocabulary for the ImageToTextModel. The model uses ImageToTextFeatureConverter for processing inputs and targets.

```python
from pix2struct import models
import seqio

# Model is configured through gin files
# Output features define the input/output specifications
OUTPUT_FEATURES = dict(
    inputs=seqio.ContinuousFeature(rank=2, dtype=tf.float32),
    targets=seqio.Feature(
        vocabulary=seqio.SentencePieceVocabulary(
            "gs://pix2struct-data/sentencepiece.model")))

# The ImageToTextModel uses ImageToTextFeatureConverter for processing
# TASK_FEATURES map to MODEL_FEATURES:
# - inputs (float32, rank=2) -> encoder_input_tokens
# - targets (int32) -> decoder_target_tokens, decoder_input_tokens, decoder_loss_weights
```

--------------------------------

### Patch Sequence Generation

Source: https://context7.com/google-research/pix2struct/llms.txt

Converts an image tensor into a sequence of patches for model input.

```python
image = tf.random.uniform([480, 640, 3])  # Example RGB image
max_patches = 4096  # Maximum number of patches (sequence length)
patch_size = (16, 16)  # Height and width of each patch

patches, original_shape = preprocessors.patch_sequence(
    image=image,
    max_patches=max_patches,
    patch_size=patch_size
)
```

--------------------------------

### PatchEmbed Implementation

Source: https://context7.com/google-research/pix2struct/llms.txt

Details the PatchEmbed component, which converts image patches into embeddings. It combines positional embeddings for row/column IDs with a linear projection of pixel values.

```python
from pix2struct import models
from flaxformer.components import dense
from flaxformer.components import embedding

# PatchEmbed configuration
# Input: [batch, num_patches, 2 + patch_height * patch_width * channels]
# - First 2 values: row_id and col_id (1-indexed, 0 = padding)
# - Remaining values: flattened pixel values (normalized)

# The patch projection converts pixels to embedding dimension
# patch_projection/dense.DenseGeneral:
#   features = 768  # EMBED_DIM for base model
#   use_bias = True

# Position embeddings are added for row and column IDs
# embedding.PositionEmbed:
#   num_embeddings = 4096
#   features = 768  # EMBED_DIM

# Final embedding = row_embed + col_embed + patch_projection(pixels)
```

--------------------------------

### Evaluation Metrics Computation

Source: https://context7.com/google-research/pix2struct/llms.txt

Computes various evaluation metrics including ANLS and relaxed accuracy for numeric answers.

```python
from pix2struct import metrics

# Evaluation targets and predictions
targets = [
    ["The quick brown fox"],
    ["42", "forty-two"],
    ["Paris"],
    ["25%", "0.25"],
]
predictions = [
    "The quick brown fox",
    "42",
    "paris",  # Case mismatch
    "24.8%",  # Close to target
]

# Compute all metrics
results = metrics.pix2struct_metrics(targets, predictions)

# Individual metric functions
anls_score = metrics.anls_metric("hello world", "hello word")  # ~0.91

# Relaxed correctness for numeric values (5% tolerance)
is_correct = metrics.relaxed_correctness("100", "98")  # True (within 5%)
is_correct = metrics.relaxed_correctness("100", "90")  # False (outside 5%)
```

--------------------------------

### ImageEncoder Configuration

Source: https://context7.com/google-research/pix2struct/llms.txt

Configuration details for the ImageEncoder, which processes image patches into embeddings. It uses transformer layers without discrete tokenization.

```python
from pix2struct import models
from flaxformer.architectures.t5 import t5_architecture

# ImageEncoder configuration (base size)
# NUM_ENCODER_LAYERS = 12
# NUM_HEADS = 12
# HEAD_DIM = 64
# EMBED_DIM = 768

# The encoder processes inputs with shape [batch, num_patches, patch_features]
# Positional information (row_id, col_id) is encoded in the first 2 channels
# Remaining channels contain pixel values

# ImageEncoder uses PatchEmbed for converting images to embeddings
# PatchEmbed combines:
# - Position embeddings for row and column IDs (num_extra_embedders = 2)
# - Linear projection for pixel patches
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.