### Web Demo Server Launch Source: https://context7.com/google-research/pix2struct/llms.txt Start an interactive web demo server for model predictions. ```bash # Start the web demo server on port 8080 python -m pix2struct.demo \ --gin_search_paths="pix2struct/configs" \ --gin_file=models/pix2struct.gin \ --gin_file=runs/inference.gin \ --gin_file=sizes/base.gin \ --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \ --gin.BATCH_SIZE=1 \ --gin.CHECKPOINT_PATH="'gs://pix2struct-data/chartqa_base/checkpoint_287600'" # Access at http://localhost:8080 # Upload an image and optionally provide a question prompt # The demo will display the model's prediction # Custom port configuration python -m pix2struct.demo \ --port=9000 \ ... # other gin flags ``` -------------------------------- ### Install Pix2Struct and Dependencies Source: https://context7.com/google-research/pix2struct/llms.txt Clone the repository, set up a conda environment, and install the package with development dependencies. Includes commands for setting up cloud storage and GCP project. ```bash # Clone the repository git clone https://github.com/google-research/pix2struct.git cd pix2struct # Create and activate conda environment conda create -n pix2struct python=3.9 conda activate pix2struct # Install the package with dev dependencies pip install -e ."[dev]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html # Run tests to verify installation pytest # Set up GCS storage directory for data and models export PIX2STRUCT_DIR="gs:///" # Set up GCP project for Dataflow preprocessing export GCP_PROJECT= export GCP_REGION= ``` -------------------------------- ### Install Pix2Struct Environment Source: https://github.com/google-research/pix2struct/blob/main/README.md Commands to clone the repository, create a conda environment, and install dependencies. ```bash git clone https://github.com/google-research/pix2struct.git cd pix2struct conda create -n pix2struct python=3.9 conda activate pix2struct pip install -e ."[dev]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html pytest ``` -------------------------------- ### Run Pix2Struct Inference via Command Line Source: https://github.com/google-research/pix2struct/blob/main/README.md Use this command for inference on a single example. It's tested at small scales and not recommended for large-scale inference. Ensure the config file matches the checkpoint. ```bash python -m pix2struct.example_inference \ --gin_search_paths="pix2struct/configs" \ --gin_file=models/pix2struct.gin \ --gin_file=runs/inference.gin \ --gin_file=sizes/base.gin \ --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \ --gin.BATCH_SIZE=1 \ --gin.CHECKPOINT_PATH="'gs://pix2struct-data/textcaps_base/checkpoint_280400'" \ --image=$HOME/test_image.jpg ``` -------------------------------- ### Prepare Widget Captioning Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Downloads and converts the Widget Captioning dataset. Requires the RICO dataset to be set up first. ```bash mkdir -p data/widget_captioning cd data/widget_captioning git clone https://github.com/google-research-datasets/widget-caption.git cp widget-caption/widget_captions.csv ./ cp widget-caption/split/*.txt ./ mv dev.txt val.txt rm -rf widget-caption cd .. gsutil -m cp -r widget_captioning $PIX2STRUCT_DIR/data/widget_captioning python -m pix2struct.preprocessing.convert_widget_captioning \ --data_dir=$PIX2STRUCT_DIR/data/widget_captioning \ --image_dir=$PIX2STRUCT_DIR/data/rico_images \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Prepare RefExp Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Downloads and converts the RefExp dataset. Requires the RICO dataset to be set up first. ```bash mkdir -p data/refexp cd data/refexp wget https://github.com/google-research-datasets/uibert/raw/main/ref_exp/train.tfrecord wget https://github.com/google-research-datasets/uibert/raw/main/ref_exp/dev.tfrecord wget https://github.com/google-research-datasets/uibert/raw/main/ref_exp/test.tfrecord mv dev.tfrecord val.tfrecord cd .. gsutil -m cp -r refexp $PIX2STRUCT_DIR/data/refexp python -m pix2struct.preprocessing.convert_refexp \ --data_dir=$PIX2STRUCT_DIR/data/refexp \ --image_dir=$PIX2STRUCT_DIR/data/rico_images \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Run Pix2Struct Inference via Web Demo Source: https://github.com/google-research/pix2struct/blob/main/README.md This command launches a web-based demo for inference, accessible at localhost:8080. It allows uploading custom images and prompts. Ensure the config file matches the checkpoint. ```bash python -m pix2struct.demo \ --gin_search_paths="pix2struct/configs" \ --gin_file=models/pix2struct.gin \ --gin_file=runs/inference.gin \ --gin_file=sizes/base.gin \ --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \ --gin.BATCH_SIZE=1 \ --gin.CHECKPOINT_PATH="'gs://pix2struct-data/textcaps_base/checkpoint_280400'" ``` -------------------------------- ### Provision Cloud TPU VM Source: https://github.com/google-research/pix2struct/blob/main/README.md Creates and connects to a v3-8 TPU VM instance for model training. ```bash TPU_TYPE=v3-8 TPU_NAME=pix2struct-$TPU_TYPE TPU_ZONE=europe-west4-a gcloud compute tpus tpu-vm create $TPU_NAME \ --zone=$TPU_ZONE \ --accelerator-type=$TPU_TYPE \ --version=tpu-vm-base gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$TPU_ZONE ``` -------------------------------- ### Preprocess ChartQA Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Download and convert the ChartQA dataset. ```bash mkdir -p data/chartqa cd data/chartqa git clone https://github.com/vis-nlp/ChartQA.git cp -r ChartQA/ChartQA\ Dataset/* ./ rm -rf ChartQA cd .. gsutil -m cp -r chartqa $PIX2STRUCT_DIR/data/chartqa python -m pix2struct.preprocessing.convert_chartqa \ --data_dir=$PIX2STRUCT_DIR/data/chartqa \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Prepare Screen2Words Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Downloads and converts the Screen2Words dataset. Requires the RICO dataset to be set up first. ```bash cd data git clone https://github.com/google-research-datasets/screen2words.git gsutil -m cp -r screen2words $PIX2STRUCT_DIR/data/screen2words python -m pix2struct.preprocessing.convert_screen2words \ --screen2words_dir=$PIX2STRUCT_DIR/data/screen2words \ --rico_dir=$PIX2STRUCT_DIR/data/rico_images \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Model Size Configurations and Checkpoints Source: https://context7.com/google-research/pix2struct/llms.txt Defines configurations for Base (282M) and Large (1.1B) model sizes using gin files. Lists pretrained and finetuned checkpoint paths for various tasks and model sizes. ```python # Base model configuration (sizes/base.gin) # NUM_ENCODER_LAYERS = 12 # NUM_DECODER_LAYERS = 12 # NUM_HEADS = 12 # HEAD_DIM = 64 # MLP_DIM = 2048 # EMBED_DIM = 768 # Large model configuration (sizes/large.gin) # NUM_ENCODER_LAYERS = 18 # NUM_DECODER_LAYERS = 18 # NUM_HEADS = 24 # HEAD_DIM = 64 # MLP_DIM = 3968 # EMBED_DIM = 1536 # Pretrained checkpoint paths CHECKPOINTS = { "base": "gs://pix2struct-data/pix2struct_base/checkpoint_900200", "large": "gs://pix2struct-data/pix2struct_large/checkpoint_900200", } # Finetuned checkpoints for specific tasks FINETUNED_CHECKPOINTS = { "textcaps_base": "gs://pix2struct-data/textcaps_base/checkpoint_280400", "textcaps_large": "gs://pix2struct-data/textcaps_large/checkpoint_180600", "chartqa_base": "gs://pix2struct-data/chartqa_base/checkpoint_287600", "docvqa_base": "gs://pix2struct-data/docvqa_base/checkpoint_284400", "screen2words_base": "gs://pix2struct-data/screen2words_base/checkpoint_282600", } ``` -------------------------------- ### T5X Model Training Source: https://context7.com/google-research/pix2struct/llms.txt Finetune Pix2Struct models on TPU infrastructure using T5X. ```bash # Set up TPU VM TPU_TYPE=v3-8 TPU_NAME=pix2struct-$TPU_TYPE TPU_ZONE=europe-west4-a gcloud compute tpus tpu-vm create $TPU_NAME \ --zone=$TPU_ZONE \ --accelerator-type=$TPU_TYPE \ --version=tpu-vm-base # SSH into TPU and install pix2struct package # Run training with validation evaluation python -m t5x.train \ --gin_search_paths="pix2struct/configs" \ --gin_file="models/pix2struct.gin" \ --gin_file="runs/train.gin" \ --gin_file="sizes/base.gin" \ --gin_file="optimizers/adafactor.gin" \ --gin_file="schedules/screen2words.gin" \ --gin_file="init/pix2struct_base_init.gin" \ --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \ --gin.MODEL_DIR="'$PIX2STRUCT_DIR/experiments/screen2words_base'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \ --gin.BATCH_SIZE=32 # Train on other tasks by changing schedule and task name: # --gin_file="schedules/chartqa.gin" --gin.MIXTURE_OR_TASK_NAME="'chartqa'" # --gin_file="schedules/docvqa.gin" --gin.MIXTURE_OR_TASK_NAME="'docvqa'" # --gin_file="schedules/textcaps.gin" --gin.MIXTURE_OR_TASK_NAME="'textcaps'" ``` -------------------------------- ### Prepare DocVQA Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Processes DocVQA data after manual download. Assumes tar files are present in the data directory. ```bash mkdir -p data/docvqa cd data/docvqa ``` ```bash tar xvf train.tar.gz tar xvf val.tar.gz tar xvf test.tar.gz rm -r *.tar.gz */ocr_results cd .. gsutil -m cp -r docvqa $PIX2STRUCT_DIR/data/docvqa python -m pix2struct.preprocessing.convert_docvqa \ --data_dir=$PIX2STRUCT_DIR/data/docvqa \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Train Pix2Struct Model Source: https://github.com/google-research/pix2struct/blob/main/README.md Initiates the training loop using T5X with specified configuration files and task parameters. ```bash python -m t5x.train \ --gin_search_paths="pix2struct/configs" \ --gin_file="models/pix2struct.gin" \ --gin_file="runs/train.gin" \ --gin_file="sizes/base.gin" \ --gin_file="optimizers/adafactor.gin" \ --gin_file="schedules/screen2words.gin" \ --gin_file="init/pix2struct_base_init.gin" \ --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \ --gin.MODEL_DIR="'$PIX2STRUCT_DIR/experiments/screen2words_base'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \ --gin.BATCH_SIZE=32 ``` -------------------------------- ### Command-line Inference Execution Source: https://context7.com/google-research/pix2struct/llms.txt Run inference on images using pretrained checkpoints via the command line. ```bash # Set JAX platform to CPU for testing (optional) export JAX_PLATFORMS=cpu # Run inference on an image with a TextCaps checkpoint python -m pix2struct.example_inference \ --gin_search_paths="pix2struct/configs" \ --gin_file=models/pix2struct.gin \ --gin_file=runs/inference.gin \ --gin_file=sizes/base.gin \ --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 2048, 'targets': 128}" \ --gin.BATCH_SIZE=1 \ --gin.CHECKPOINT_PATH="'gs://pix2struct-data/textcaps_base/checkpoint_280400'" \ --image=/path/to/image.jpg # Run inference with a question (VQA tasks) python -m pix2struct.example_inference \ --gin_search_paths="pix2struct/configs" \ --gin_file=models/pix2struct.gin \ --gin_file=runs/inference.gin \ --gin_file=sizes/base.gin \ --gin.MIXTURE_OR_TASK_NAME="'placeholder_pix2struct'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \ --gin.BATCH_SIZE=1 \ --gin.CHECKPOINT_PATH="'gs://pix2struct-data/docvqa_base/checkpoint_284400'" \ --image=/path/to/document.png \ --text="What is the total amount?" ``` -------------------------------- ### Preprocess Datasets with Apache Beam Source: https://context7.com/google-research/pix2struct/llms.txt Use these commands to convert raw datasets to TFRecord format. Specify the runner (e.g., DataflowRunner) and other flags for cloud execution, or omit them to run locally. ```bash python -m pix2struct.preprocessing.convert_chartqa \ --data_dir=$PIX2STRUCT_DIR/data/chartqa \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` ```bash python -m pix2struct.preprocessing.convert_docvqa \ --data_dir=$PIX2STRUCT_DIR/data/docvqa \ -- \ --runner=DataflowRunner \ --project=$GCP_PROJECT \ --region=$GCP_REGION ``` ```bash python -m pix2struct.preprocessing.convert_textcaps \ --textcaps_dir=$PIX2STRUCT_DIR/data/textcaps \ --output_dir=$PIX2STRUCT_DIR/data/textcaps/processed ``` ```bash python -m pix2struct.preprocessing.convert_screen2words \ --screen2words_dir=$PIX2STRUCT_DIR/data/screen2words \ --rico_dir=$PIX2STRUCT_DIR/data/rico_images ``` -------------------------------- ### Prepare InfographicVQA Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Processes InfographicVQA data after manual download. Assumes zip files and JSON files are present. ```bash mkdir -p data/infographicvqa cd data/infographicvqa ``` ```bash for split in train val test do unzip infographicVQA_${split}_v1.0_images.zip mv infographicVQA_${split}_v1.0_images $split mv infographicVQA_${split}_v1.0.json $split/${split}_v1.0.json done rm *.zip cd .. gsutil -m cp -r infographicvqa $PIX2STRUCT_DIR/data/infographicvqa python -m pix2struct.preprocessing.convert_docvqa \ --data_dir=$PIX2STRUCT_DIR/data/infographicvqa \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Prepare OCR-VQA Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Processes OCR-VQA data after manual download. Assumes images directory and dataset.json are present. ```bash mkdir -p data/ocrvqa cd data/ocrvqa ``` ```bash cd .. gsutil -m cp -r ocrvqa $PIX2STRUCT_DIR/data/ocrvqa python -m pix2struct.preprocessing.convert_ocrvqa \ --data_dir=$PIX2STRUCT_DIR/data/ocrvqa \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Preprocess TextCaps Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Download and convert the TextCaps dataset for use with Pix2Struct. ```bash mkdir -p data/textcaps cd data/textcaps curl -O https://dl.fbaipublicfiles.com/textvqa/data/textcaps/TextCaps_0.1_train.json curl -O https://dl.fbaipublicfiles.com/textvqa/data/textcaps/TextCaps_0.1_val.json curl -O https://dl.fbaipublicfiles.com/textvqa/data/textcaps/TextCaps_0.1_test.json curl -O https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip curl -O https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip unzip train_val_images.zip rm train_val_images.zip unzip test_images.zip rm test_images.zip cd .. gsutil -m cp -r textcaps_data $PIX2STRUCT_DIR/data/textcaps python -m pix2struct.preprocessing.convert_textcaps \ --textcaps_dir=$PIX2STRUCT_DIR/data/textcaps \ --output_dir=$PIX2STRUCT_DIR/data/textcaps/processed \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Preprocess RICO Images Source: https://github.com/google-research/pix2struct/blob/main/README.md Download and prepare RICO dataset images required for Screen2Words, RefExp, and Widget Captioning tasks. ```bash cd data wget https://storage.googleapis.com/crowdstf-rico-uiuc-4540/rico_dataset_v0.1/unique_uis.tar.gz tar xvfz unique_uis.tar.gz rm unique_uis.tar.gz gsutil -m cp -r combined $PIX2STRUCT_DIR/data/rico_images ``` -------------------------------- ### Configure Environment Variables Source: https://github.com/google-research/pix2struct/blob/main/README.md Set the required environment variables for GCS storage and GCP project configuration. ```bash export PIX2STRUCT_DIR="gs:///" ``` ```bash export GCP_PROJECT= export GCP_REGION= ``` -------------------------------- ### Extract Patches from Image Source: https://context7.com/google-research/pix2struct/llms.txt Demonstrates the use of the patch_sequence function to extract patches from an image. It automatically determines optimal resizing based on the maximum patch limit. ```python from pix2struct import preprocessors import tensorflow as tf # Extract patches from an image # Image shape: [height, width, channels] ``` -------------------------------- ### Prepare AI2D Dataset Source: https://github.com/google-research/pix2struct/blob/main/README.md Downloads, extracts, and converts the AI2D dataset for use with Pix2Struct using Dataflow. ```bash mkdir -p data/ cd data/ wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip unzip ai2d-all.zip rm ai2d-all.zip gsutil -m cp -r ai2d $PIX2STRUCT_DIR/data/ai2d python -m pix2struct.preprocessing.convert_ai2d \ --data_dir=$PIX2STRUCT_DIR/data/ai2d \ --test_ids_path=gs://pix2struct-data/ai2d_test_ids.csv \ -- \ --runner=DataflowRunner \ --save_main_session \ --project=$GCP_PROJECT \ --region=$GCP_REGION \ --temp_location=$PIX2STRUCT_DIR/data/temp \ --staging_location=$PIX2STRUCT_DIR/data/staging \ --setup_file=./setup.py ``` -------------------------------- ### Render Text Header on Image Source: https://context7.com/google-research/pix2struct/llms.txt Renders a text header (question or prompt) above an image, creating a combined image. Text is wrapped to 80 characters, uses a default font size of 36px, and has a white background with black text. ```python from pix2struct.preprocessing import preprocessing_utils from PIL import Image # Load an image image = Image.open("chart.png") # Render a question header above the image question = "What is the total revenue for Q4 2023?" combined_image = preprocessing_utils.render_header(image, question) # The header is rendered with: # - Text wrapped to 80 characters per line # - Default font size: 36px # - White background, black text # - Padding: 5px on all sides # Save or convert to bytes combined_image.save("chart_with_question.png") # Convert to bytes for model input image_bytes = preprocessing_utils.image_to_bytes(combined_image) ``` -------------------------------- ### Preprocessing Pipeline Configuration Source: https://context7.com/google-research/pix2struct/llms.txt Defines a standard SeqIO preprocessing pipeline for Pix2Struct tasks, including decoding, normalization, and patch conversion. ```python import functools import seqio from pix2struct import preprocessors # Standard preprocessing pipeline for pix2struct tasks PREPROCESSORS = [ # Remap feature keys to standard names functools.partial(seqio.preprocessors.rekey, key_map={ "inputs": "image", "targets": "parse", "parse": "parse", "image": "image", "id": "id", "group_id": "group_id" }), # Sample one target if multiple exist preprocessors.sample_one(key="targets"), # Decode PNG bytes to image tensor preprocessors.image_decoder(key="inputs", channels=3), # Normalize image (per-image standardization) preprocessors.normalize_image(key="inputs"), # Convert image to patch sequence preprocessors.image_to_patches(key="inputs"), # Tokenize text targets and append EOS seqio.preprocessors.tokenize_and_append_eos, ] ``` -------------------------------- ### T5X Model Evaluation Source: https://context7.com/google-research/pix2struct/llms.txt Evaluate trained models on test datasets using T5X. ```bash # Evaluate model on test set python -m t5x.eval \ --gin_search_paths="pix2struct/configs" \ --gin_file="models/pix2struct.gin" \ --gin_file="runs/eval.gin" \ --gin_file="sizes/base.gin" \ --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \ --gin.CHECKPOINT_PATH="'$PIX2STRUCT_DIR/experiments/screen2words_base/checkpoint_286600'" \ --gin.EVAL_OUTPUT_DIR="'$PIX2STRUCT_DIR/experiments/screen2words_base/test_eval'" \ --gin.EVAL_SPLIT="'test'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \ --gin.BATCH_SIZE=32 # Use large model checkpoints with appropriate size config python -m t5x.eval \ --gin_search_paths="pix2struct/configs" \ --gin_file="models/pix2struct.gin" \ --gin_file="runs/eval.gin" \ --gin_file="sizes/large.gin" \ --gin.MIXTURE_OR_TASK_NAME="'docvqa'" \ --gin.CHECKPOINT_PATH="'gs://pix2struct-data/docvqa_large/checkpoint_184000'" \ --gin.EVAL_OUTPUT_DIR="'$PIX2STRUCT_DIR/experiments/docvqa_large/test_eval'" \ --gin.EVAL_SPLIT="'test'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \ --gin.BATCH_SIZE=16 ``` -------------------------------- ### Task Registration and Feature Description Source: https://context7.com/google-research/pix2struct/llms.txt Registers a custom task in the SeqIO registry and defines the expected TFRecord feature structure. ```python from pix2struct import tasks from pix2struct import metrics import os # Register a custom pix2struct task tasks.add_pix2struct_task( name="my_custom_task", base_dir=os.environ.get("PIX2STRUCT_DIR", "") + "/data", train_file_pattern="my_task/processed/train.tfr*", valid_file_pattern="my_task/processed/val.tfr*", test_file_pattern="my_task/processed/test.tfr*", # Optional: custom metrics (defaults to pix2struct_metrics) metric_fns=[metrics.pix2struct_metrics], # Optional: custom postprocessor (defaults to multi_target) postprocess_fn=None ) # TFRecord feature description expected by tasks FEATURE_DESCRIPTION = { "id": tf.io.FixedLenFeature([], tf.string, default_value="no-id"), "image": tf.io.FixedLenFeature([], tf.string), # PNG bytes "parse": tf.io.FixedLenSequenceFeature([], tf.string, allow_missing=True), "group_id": tf.io.FixedLenFeature([], tf.string, default_value="no-group-id"), } ``` -------------------------------- ### Inference Function Retrieval Source: https://context7.com/google-research/pix2struct/llms.txt Initializes inference functions for a trained model checkpoint using T5X partitioning. ```python from pix2struct import inference_utils from t5x import partitioning import tensorflow as tf # Get inference functions from a checkpoint inference_fns = inference_utils.get_inference_fns( task_name="placeholder_pix2struct", batch_size=1, sequence_length={"inputs": 4096, "targets": 128}, model=model, # ImageToTextModel instance checkpoint_path="gs://pix2struct-data/textcaps_base/checkpoint_280400", partitioner=partitioning.PjitPartitioner(num_partitions=1) ) ``` -------------------------------- ### Evaluate Pix2Struct Model Source: https://github.com/google-research/pix2struct/blob/main/README.md Runs model evaluation on the test set using a specific checkpoint. ```bash python -m t5x.eval \ --gin_search_paths="pix2struct/configs" \ --gin_file="models/pix2struct.gin" \ --gin_file="runs/eval.gin" \ --gin_file="sizes/base.gin" \ --gin.MIXTURE_OR_TASK_NAME="'screen2words'" \ --gin.CHECKPOINT_PATH="'$PIX2STRUCT_DIR/experiments/screen2words_base/checkpoint_286600'" \ --gin.EVAL_OUTPUT_DIR="'$PIX2STRUCT_DIR/experiments/test_exp/test_eval'" \ --gin.EVAL_SPLIT="'test'" \ --gin.TASK_FEATURE_LENGTHS="{'inputs': 4096, 'targets': 128}" \ --gin.BATCH_SIZE=32 ``` -------------------------------- ### Programmatic Inference with Pix2Struct Source: https://context7.com/google-research/pix2struct/llms.txt Use the inference functions to process datasets and generate predictions. ```python predict_fn = inference_fns["predict"] # Returns decoded string predictions intermediates_fn = inference_fns["intermediates"] # Returns attention weights, etc. # Make predictions on a dataset dataset = tf.data.Dataset.from_tensors({ "id": "", "group_id": "", "image": image_bytes, # PNG bytes "parse": [""] }) for prediction in predict_fn(dataset): print(prediction) # Decoded text output ``` -------------------------------- ### Delete TPU VM Instance Source: https://github.com/google-research/pix2struct/blob/main/README.md Command to delete the TPU VM instance when it is no longer needed. Replace $TPU_NAME and $TPU_ZONE with your specific values. ```bash gcloud compute tpus tpu-vm delete $TPU_NAME --zone=$TPU_ZONE ``` -------------------------------- ### Split and Write TFRecords with Apache Beam Source: https://context7.com/google-research/pix2struct/llms.txt An Apache Beam transform for partitioning data into train/validation/test splits and writing to TFRecord files. It shuffles data, partitions by hash of a key feature for deterministic splitting, and reports counts via beam.metrics. ```python from pix2struct.preprocessing import preprocessing_utils import apache_beam as beam import tensorflow as tf # Create a pipeline to process and split data with beam.Pipeline() as pipeline: examples = ( pipeline | "CreateExamples" >> beam.Create(processed_examples) | "SplitAndWrite" >> preprocessing_utils.SplitAndWriteTFRecords( output_dir="gs://bucket/data/my_task/processed", key="id", # Feature key used for deterministic splitting validation_percent=10, # 10% of data goes to validation train_file_name="train.tfr", val_file_name="val.tfr", test_file_name="test.tfr", is_test=lambda key: key.startswith("test_"), # Optional test filter ) ) # The transform: # 1. Shuffles data using beam.Reshuffle() # 2. Partitions by hash of key feature (deterministic) # 3. Writes to sharded TFRecord files # 4. Reports train/val/test counts via beam.metrics ``` -------------------------------- ### ImageToTextModel Configuration Source: https://context7.com/google-research/pix2struct/llms.txt Defines output features and vocabulary for the ImageToTextModel. The model uses ImageToTextFeatureConverter for processing inputs and targets. ```python from pix2struct import models import seqio # Model is configured through gin files # Output features define the input/output specifications OUTPUT_FEATURES = dict( inputs=seqio.ContinuousFeature(rank=2, dtype=tf.float32), targets=seqio.Feature( vocabulary=seqio.SentencePieceVocabulary( "gs://pix2struct-data/sentencepiece.model"))) # The ImageToTextModel uses ImageToTextFeatureConverter for processing # TASK_FEATURES map to MODEL_FEATURES: # - inputs (float32, rank=2) -> encoder_input_tokens # - targets (int32) -> decoder_target_tokens, decoder_input_tokens, decoder_loss_weights ``` -------------------------------- ### Patch Sequence Generation Source: https://context7.com/google-research/pix2struct/llms.txt Converts an image tensor into a sequence of patches for model input. ```python image = tf.random.uniform([480, 640, 3]) # Example RGB image max_patches = 4096 # Maximum number of patches (sequence length) patch_size = (16, 16) # Height and width of each patch patches, original_shape = preprocessors.patch_sequence( image=image, max_patches=max_patches, patch_size=patch_size ) ``` -------------------------------- ### PatchEmbed Implementation Source: https://context7.com/google-research/pix2struct/llms.txt Details the PatchEmbed component, which converts image patches into embeddings. It combines positional embeddings for row/column IDs with a linear projection of pixel values. ```python from pix2struct import models from flaxformer.components import dense from flaxformer.components import embedding # PatchEmbed configuration # Input: [batch, num_patches, 2 + patch_height * patch_width * channels] # - First 2 values: row_id and col_id (1-indexed, 0 = padding) # - Remaining values: flattened pixel values (normalized) # The patch projection converts pixels to embedding dimension # patch_projection/dense.DenseGeneral: # features = 768 # EMBED_DIM for base model # use_bias = True # Position embeddings are added for row and column IDs # embedding.PositionEmbed: # num_embeddings = 4096 # features = 768 # EMBED_DIM # Final embedding = row_embed + col_embed + patch_projection(pixels) ``` -------------------------------- ### Evaluation Metrics Computation Source: https://context7.com/google-research/pix2struct/llms.txt Computes various evaluation metrics including ANLS and relaxed accuracy for numeric answers. ```python from pix2struct import metrics # Evaluation targets and predictions targets = [ ["The quick brown fox"], ["42", "forty-two"], ["Paris"], ["25%", "0.25"], ] predictions = [ "The quick brown fox", "42", "paris", # Case mismatch "24.8%", # Close to target ] # Compute all metrics results = metrics.pix2struct_metrics(targets, predictions) # Individual metric functions anls_score = metrics.anls_metric("hello world", "hello word") # ~0.91 # Relaxed correctness for numeric values (5% tolerance) is_correct = metrics.relaxed_correctness("100", "98") # True (within 5%) is_correct = metrics.relaxed_correctness("100", "90") # False (outside 5%) ``` -------------------------------- ### ImageEncoder Configuration Source: https://context7.com/google-research/pix2struct/llms.txt Configuration details for the ImageEncoder, which processes image patches into embeddings. It uses transformer layers without discrete tokenization. ```python from pix2struct import models from flaxformer.architectures.t5 import t5_architecture # ImageEncoder configuration (base size) # NUM_ENCODER_LAYERS = 12 # NUM_HEADS = 12 # HEAD_DIM = 64 # EMBED_DIM = 768 # The encoder processes inputs with shape [batch, num_patches, patch_features] # Positional information (row_id, col_id) is encoded in the first 2 channels # Remaining channels contain pixel values # ImageEncoder uses PatchEmbed for converting images to embeddings # PatchEmbed combines: # - Position embeddings for row and column IDs (num_extra_embedders = 2) # - Linear projection for pixel patches ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.