### Run the Demo Application Source: https://github.com/microsoft/gui-actor/blob/main/demo/README.md Executes the main Python script to start the GUI Actor demo. This command should be run after installing dependencies. ```bash python app.py ``` -------------------------------- ### Install GUI-Actor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md Clone the repository and install the package in editable mode. ```bash git clone https://github.com/microsoft/GUI-Actor.git cd GUI-Actor pip install -e . ``` -------------------------------- ### Training Example with GUI-Actor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md Demonstrates how to load and initialize the Qwen2VLForConditionalGenerationWithPointer model for training, specifying data type and device mapping. ```python import torch from transformers import AutoProcessor, AutoTokenizer from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2-VL", torch_dtype=torch.bfloat16, device_map="cuda:0" ) ``` -------------------------------- ### Example YAML Data Configuration Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Demonstrates how to configure datasets using YAML, specifying paths, sampling strategies, and image folders. ```yaml datasets: - json_path: /data/screenspot/train.json sampling_strategy: first:5000 images_folder: /data/screenspot/images - json_path: /data/gui_actions/data.json sampling_strategy: random:1000 images_folder: /data/gui_actions/images - json_path: /data/mobile/train.json sampling_strategy: all images_folder: /data/mobile/images ``` -------------------------------- ### Install Dependencies Source: https://github.com/microsoft/gui-actor/blob/main/demo/README.md Installs the necessary Python packages listed in requirements.txt. Ensure you have pip installed. ```bash pip install -r requirements.txt ``` -------------------------------- ### Example Usage of LazySupervisedDataset Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/dataset.md Demonstrates how to initialize the dataset with a tokenizer, processor, and data configuration, and then load a sample. ```python from transformers import AutoProcessor, AutoTokenizer from gui_actor.dataset import LazySupervisedDataset tokenizer = AutoTokenizer.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") class DataArgs: image_folder = "/data/images" max_conv_turns = 10 early_mix_text = False dataset = LazySupervisedDataset( tokenizer=tokenizer, processor=processor, data_path="/data/config.yaml", data_args=DataArgs() ) # Load a sample sample = dataset[0] print(f"Input shape: {sample['input_ids'].shape}") print(f"Num targets: {len(sample['coordinates'])}") ``` -------------------------------- ### Initialize Custom Trainer Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md Example of initializing the AGUVISTrainer with TrainingArguments and a LazySupervisedDataset. ```python from transformers import TrainingArguments from gui_actor.trainer import AGUVISTrainer from gui_actor.dataset import LazySupervisedDataset ``` -------------------------------- ### Prepare Example Data Source: https://github.com/microsoft/gui-actor/blob/main/README.md Loads a dataset and extracts a sample for processing. This prepares the data for model inference. ```python dataset = load_dataset("rootsautomation/ScreenSpot")["test"] example = dataset[0] print(f"Instruction: {example['instruction']}") print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}") ``` -------------------------------- ### Inference Example with generate() Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md Perform inference using the model's generate method. This is a simplified example; refer to inference.md for a full demonstration. ```python # See inference.md for full inference example using generate() or inference() outputs = model.generate( input_ids=input_ids, pixel_values=pixel_values, image_grid_thw=image_grid_thw, max_new_tokens=100, return_dict_in_generate=True, output_hidden_states=True ) ``` -------------------------------- ### GUI Actor Inference Example Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/inference.md Demonstrates how to load the GUI Actor model, prepare conversation input with an image and text, run the inference function, and extract the predicted click coordinates and confidence score. This example uses placeholder mode for faster generation. ```python import torch from PIL import Image from transformers import AutoProcessor, AutoTokenizer from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer from gui_actor.inference import inference from gui_actor.constants import grounding_system_message # Load model and processor model_name = "microsoft/GUI-Actor-7B-Qwen2-VL" model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2" ).eval() data_processor = AutoProcessor.from_pretrained(model_name) tokenizer = data_processor.tokenizer # Prepare input image = Image.open("screenshot.png") conversation = [ { "role": "system", "content": [{"type": "text", "text": grounding_system_message}] }, { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Close the window"} ] } ] # Run inference with torch.no_grad(): pred = inference( conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3 ) # Extract results best_x, best_y = pred["topk_points"][0] print(f"Predicted click: ({best_x:.4f}, {best_y:.4f})") print(f"Confidence: {pred['topk_values'][0]:.4f}") ``` -------------------------------- ### Loading and Training Qwen2.5-VL Model Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md Example of how to load the Qwen2.5-VL model with pointer capabilities and use it for training. This snippet demonstrates combining language modeling loss with pointer loss for multi-patch grounding. ```python import torch from gui_actor.modeling_qwen25vl import Qwen2_5_VLForConditionalGenerationWithPointer # Load Qwen2.5-VL variant model = Qwen2_5_VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2.5-VL", torch_dtype=torch.bfloat16, device_map="cuda:0" ) # Training: combine LM and pointer losses outputs = model( input_ids=input_ids, labels=labels, pixel_values=pixel_values, image_grid_thw=image_grid_thw, visual_token_indices_of_coordinates=visual_indices, multi_patch_labels=patch_labels, if_multi_patch=True ) # Loss is combination loss = outputs.loss # = lm_loss_weight * lm_loss + pointer_loss_weight * pointer_loss ``` -------------------------------- ### ForceFollowTokensLogitsProcessor Example Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/inference.md Demonstrates how to instantiate and use the ForceFollowTokensLogitsProcessor with a tokenizer. This is useful for controlling specific token generation during inference. ```python from transformers import AutoTokenizer from gui_actor.inference import ForceFollowTokensLogitsProcessor tokenizer = AutoTokenizer.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") pointer_pad_id = tokenizer.encode("<|pointer_pad|>")[0] pointer_end_id = tokenizer.encode("<|pointer_end|>")[0] processor = ForceFollowTokensLogitsProcessor( token_a_id=pointer_pad_id, forced_sequence=[pointer_end_id] ) ``` -------------------------------- ### Run Inference with GUI-Actor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md Example of loading the model, preparing input with an image and conversation, and running inference to get prediction coordinates. ```python import torch from PIL import Image from transformers import AutoProcessor, AutoTokenizer from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer from gui_actor.inference import inference from gui_actor.constants import grounding_system_message # Load model model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2-VL", torch_dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2" ).eval() processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") tokenizer = processor.tokenizer # Prepare input image = Image.open("screenshot.png") conversation = [ { "role": "system", "content": [{"type": "text", "text": grounding_system_message}] }, { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Click the submit button"} ] } ] # Run inference with torch.no_grad(): pred = inference( conversation, model, tokenizer, processor, use_placeholder=True, topk=3 ) # Get prediction x, y = pred["topk_points"][0] print(f"Click at ({x:.4f}, {y:.4f})") ``` -------------------------------- ### Example Training Sample Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md Illustrates a complete training sample, including its ID, associated image file, and a conversation with human and GPT turns, featuring ground truth bounding box information. ```python { "id": "sample_001", "image": "screenshot.jpg", "conversations": [ {"from": "human", "value": "\nClose the window"}, { "from": "gpt", "value": "pyautogui.click(x=0.95, y=0.15)", "recipient": "os", "end_turn": True, "bbox_gt": [0.9, 0.1, 1.0, 0.2] } ] } ``` -------------------------------- ### Training Setup with Special Tokens and Ignore Index Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/constants.md Demonstrates how to register special tokens with the tokenizer and create a dataset with an ignore index for labels during training. ```python from gui_actor.constants import ( DEFAULT_POINTER_START_TOKEN, DEFAULT_POINTER_PAD_TOKEN, DEFAULT_POINTER_END_TOKEN, IGNORE_INDEX, grounding_system_message, ) # Register special tokens special_tokens = [ DEFAULT_POINTER_START_TOKEN, DEFAULT_POINTER_PAD_TOKEN, DEFAULT_POINTER_END_TOKEN, ] tokenizer.add_special_tokens({"additional_special_tokens": special_tokens}) # Create dataset with ignore index labels = torch.full((seq_len,), IGNORE_INDEX, dtype=torch.long) labels[target_positions] = target_token_ids # Use system message in conversations conversation = [{ "role": "system", "content": [{"type": "text", "text": grounding_system_message}] }] ``` -------------------------------- ### YAML Configuration for Dataset Loading Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/dataset.md Example of a YAML configuration file for specifying multiple datasets, their sampling strategies, and associated image folders. ```yaml datasets: - json_path: path/to/data1.json sampling_strategy: first:1000 images_folder: path/to/images1 - json_path: path/to/data2.json sampling_strategy: random:500 images_folder: path/to/images2 ``` -------------------------------- ### Install GUI-Actor Package Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Install the GUI-Actor package using pip to resolve NoSuchModuleError when loading models. ```bash pip install -e . ``` -------------------------------- ### Example Bounding Box Usage Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md Illustrates the assignment of a normalized bounding box tuple to a variable. ```python bbox: Tuple[float, float, float, float] = (0.25, 0.25, 0.75, 0.75) # Center square ``` -------------------------------- ### Example Conversation Format Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md Demonstrates a multi-turn conversation involving system instructions, user input with an image and text, and an assistant's response. ```python conversation = [ { "role": "system", "content": [ {"type": "text", "text": "You are a GUI agent..."} ] }, { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Click the submit button"} ] }, { "role": "assistant", "content": [ {"type": "text", "text": "pyautogui.click(x=0.5, y=0.7)"} ] } ] ``` -------------------------------- ### Create and Activate Conda Environment Source: https://github.com/microsoft/gui-actor/blob/main/README.md Create a new conda environment named 'gui_actor' with Python 3.10, activate it, install PyTorch with CUDA support, and then install the project dependencies. ```bash conda create -n gui_actor python=3.10 conda activate gui_actor conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia pip install -e . ``` -------------------------------- ### Example Coordinates Usage Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md Illustrates the assignment of a normalized (x, y) coordinate tuple to a variable. ```python click_point: Tuple[float, float] = (0.5, 0.75) # Middle-right position ``` -------------------------------- ### Configure ForceFollowTokensLogitsProcessor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Initialize ForceFollowTokensLogitsProcessor to control token generation during inference. Specify the start token and a sequence of forced tokens. ```python from gui_actor.inference import ForceFollowTokensLogitsProcessor processor = ForceFollowTokensLogitsProcessor( token_a_id=tokenizer.encode("<|pointer_start|>")[0], forced_sequence=[ tokenizer.encode("<|pointer_pad|>")[0], tokenizer.encode("<|pointer_end|>")[0] ] ) ``` -------------------------------- ### VisionHead_MultiPatch Forward Pass Example Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md Demonstrates the forward pass of the VisionHead_MultiPatch. Use this to compute attention weights and loss for visual grounding tasks. ```python import torch from gui_actor.modeling import VisionHead_MultiPatch head = VisionHead_MultiPatch(d_model=3584, projection_dim=3584) # Simulate visual features (e.g., 196 patches from 14x14 grid) visual_embeds = torch.randn(196, 3584) # Simulate target tokens (3 regions to ground) target_hidden = torch.randn(3, 3584) # Ground truth: first region covers patches 0-3, second covers 4-8, etc. labels = torch.zeros(3, 196) labels[0, :4] = 1 labels[1, 4:9] = 1 labels[2, 10:15] = 1 # Forward pass attn_weights, loss = head(visual_embeds, target_hidden, labels=labels) # attn_weights: (3, 196) # loss: scalar tensor ``` -------------------------------- ### get_prediction_region_point Example Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/inference.md Shows how to use get_prediction_region_point with simulated attention scores to find predicted click regions. Adjust parameters like top_n and activation_threshold for different results. ```python import torch from gui_actor.inference import get_prediction_region_point # Simulated attention scores from model attn_scores = torch.randn(1, 784) # 28x28 grid attn_scores = torch.softmax(attn_scores, dim=-1) best_point, all_centers, scores, all_patches = get_prediction_region_point( attn_scores, n_width=28, n_height=28, top_n=5, activation_threshold=0.25, return_all_regions=True ) print(f"Best prediction: {best_point}") print(f"Alternative options: {all_centers[:3]}") ``` -------------------------------- ### Perform Inference with `inference()` Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md This function serves as the primary entry point for model inference. It takes a conversation history, model, tokenizer, and processor to generate predictions. The example shows how to set up the conversation with system, user roles, images, and text, and then process the output. ```python import torch from PIL import Image from transformers import AutoProcessor, AutoTokenizer from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer from gui_actor.inference import inference from gui_actor.constants import grounding_system_message model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2-VL", torch_dtype=torch.bfloat16, device_map="cuda:0" ).eval() processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") tokenizer = processor.tokenizer image = Image.open("screenshot.png") conversation = [ { "role": "system", "content": [{"type": "text", "text": grounding_system_message}] }, { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Click the submit button"} ] } ] with torch.no_grad(): pred = inference( conversation=conversation, model=model, tokenizer=tokenizer, data_processor=processor, use_placeholder=True, topk=3 ) # Results print(f"Best point: {pred['topk_points'][0]}") print(f"Confidence: {pred['topk_values'][0]}") print(f"Alternatives: {pred['topk_points'][1:]}") ``` -------------------------------- ### Load and Use Qwen2VLForConditionalGenerationWithPointer Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Demonstrates loading the Qwen2VL model with specified configurations and using it for both training and inference. For training, it combines LM and pointer losses. For inference, it uses the `generate` method. ```python import torch from transformers import AutoProcessor, AutoTokenizer from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer # Load model model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2-VL", torch_dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2" ).eval() processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") tokenizer = processor.tokenizer # Training mode (combine LM and pointer losses) outputs = model( input_ids=input_ids, labels=labels, pixel_values=pixel_values, image_grid_thw=image_grid_thw, visual_token_indices_of_coordinates=coordinates, multi_patch_labels=patch_labels, ) loss = outputs.loss loss.backward() # Inference mode with torch.no_grad(): outputs = model.generate( input_ids=input_ids, pixel_values=pixel_values, image_grid_thw=image_grid_thw, max_new_tokens=100, return_dict_in_generate=True, output_hidden_states=True ) ``` -------------------------------- ### Initialize AGUVISTrainer Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Instantiate the AGUVISTrainer with model, training arguments, datasets, data collator, tokenizer, and processor. ```python from gui_actor.trainer import AGUVISTrainer trainer = AGUVISTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, tokenizer=tokenizer, processing_class=processor, ) ``` -------------------------------- ### Load Model with Optimizations Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Load the Qwen2VLForConditionalGenerationWithPointer model with specified data type, device mapping, and attention implementation for performance optimization. ```python import torch from transformers import AutoProcessor, AutoTokenizer from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2-VL", torch_dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2" ) processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") tokenizer = AutoTokenizer.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL") ``` -------------------------------- ### Create Training Sampler Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md Creates a training sampler with support for length-grouped, modality-grouped, or random sampling strategies to optimize data loading and minimize padding. ```python def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler] ``` -------------------------------- ### Run Warmup Training Script Source: https://github.com/microsoft/gui-actor/blob/main/README.md Execute the warmup training script for the GUI-Actor model. ```bash bash scripts/warmup.sh ``` -------------------------------- ### Run Full-Parameter Training Script Source: https://github.com/microsoft/gui-actor/blob/main/README.md Execute the full-parameter training script for the GUI-Actor model. ```bash bash scripts/train.sh ``` -------------------------------- ### Python Example of Using do_boxes_overlap Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/utils.md Demonstrates how to use the `do_boxes_overlap` function with both overlapping and non-overlapping boxes. Ensure the function is imported from `gui_actor.utils` before use. ```python from gui_actor.utils import do_boxes_overlap box1 = (0, 0, 100, 100) box2 = (50, 50, 150, 150) if do_boxes_overlap(box1, box2): print("Boxes overlap!") # Will print else: print("No overlap") # Non-overlapping boxes box3 = (200, 200, 300, 300) print(do_boxes_overlap(box1, box3)) # False ``` -------------------------------- ### Register and Get Special Token IDs Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Register special pointer tokens with the tokenizer and retrieve their corresponding IDs for use in model input. ```python from gui_actor.constants import ( DEFAULT_POINTER_START_TOKEN, DEFAULT_POINTER_PAD_TOKEN, DEFAULT_POINTER_END_TOKEN ) # Register with tokenizer tokenizer.add_special_tokens({ "additional_special_tokens": [ DEFAULT_POINTER_START_TOKEN, DEFAULT_POINTER_PAD_TOKEN, DEFAULT_POINTER_END_TOKEN, ] }) # Get token IDs start_id = tokenizer.encode(DEFAULT_POINTER_START_TOKEN)[0] pad_id = tokenizer.encode(DEFAULT_POINTER_PAD_TOKEN)[0] end_id = tokenizer.encode(DEFAULT_POINTER_END_TOKEN)[0] ``` -------------------------------- ### Evaluate on ScreenSpot-Pro Source: https://github.com/microsoft/gui-actor/blob/main/README.md Evaluate the GUI-Actor model on the ScreenSpot-Pro benchmark. Ensure you have downloaded the data and provide the correct paths to the saved results and data directory. ```bash python eval/screenSpot_pro.py --save_path --data_path ``` -------------------------------- ### Obtaining Special Token IDs Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/constants.md Illustrates how to get the integer IDs for special tokens like '<|pointer_start|>' and '<|pointer_end|>' by encoding them using the tokenizer. ```python from gui_actor.constants import ( DEFAULT_POINTER_START_TOKEN, DEFAULT_POINTER_END_TOKEN, ) pointer_start_id = tokenizer.encode(DEFAULT_POINTER_START_TOKEN)[0] pointer_end_id = tokenizer.encode(DEFAULT_POINTER_END_TOKEN)[0] ``` -------------------------------- ### Initialize and Load GUI-Actor Model Source: https://github.com/microsoft/gui-actor/blob/main/README.md This Python snippet demonstrates how to load the GUI-Actor model with Qwen2-VL backbone, including processor, tokenizer, and model configuration with specific torch dtype and attention implementation. The model is set to evaluation mode. ```python import torch from qwen_vl_utils import process_vision_info from datasets import load_dataset from transformers import AutoProcessor from gui_actor.constants import chat_template from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer from gui_actor.inference import inference # load model model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL" data_processor = AutoProcessor.from_pretrained(model_name_or_path) tokenizer = data_processor.tokenizer model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( model_name_or_path, torch_dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2" ).eval() ``` -------------------------------- ### Qwen2_5_VLForConditionalGenerationWithPointer Constructor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md Main model class integrating Qwen2.5-VL with a pointer head for coordinate-free grounding. Customize loss weights for pointer and language modeling components. ```python Qwen2_5_VLForConditionalGenerationWithPointer( config, *args, pointer_loss_weight: float = 1.0, lm_loss_weight: float = 1.0, **kwargs ) ``` -------------------------------- ### Import Dataset and Training Utilities Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Import necessary components for dataset handling and training. This includes dataset classes, data reformatting utilities, and the trainer itself. ```python from gui_actor.dataset import ( LazySupervisedDataset, reformat_coordinates, get_token_index, get_multi_patch_labels ) from gui_actor.trainer import AGUVISTrainer ``` -------------------------------- ### Coordinate Extraction Patterns (Drag) Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/constants.md Regular expression patterns used to extract coordinates for drag operations from text responses. These patterns capture the start and end coordinates of a drag action. ```python r"from_coord=\[([0-9.]+), ([0-9.]+)\], to_coord=\[([0-9.]+), ([0-9.]+)\]" ``` -------------------------------- ### AGUVISTrainer.create_accelerator_and_postprocess Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md Overrides the parent method to configure the Accelerator with custom settings, including gradient accumulation, DeepSpeed plugin support, a 52-week timeout, and FSDP activation checkpointing. ```APIDOC ## AGUVISTrainer.create_accelerator_and_postprocess ### Description Overrides parent to configure Accelerator with custom settings. ### Function Signature ```python def create_accelerator_and_postprocess(self) -> None ``` ### Configuration - **Gradient Accumulation**: Disabled sync with dataloader for efficiency - **DeepSpeed**: Supports DeepSpeed plugin from training args - **Timeout**: 52-week timeout for long training runs - **FSDP**: Configures activation checkpointing if FSDP enabled ``` -------------------------------- ### AGUVISTrainer Create Accelerator and Postprocess Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md This method overrides the parent's functionality to configure the Accelerator with custom settings, including gradient accumulation, DeepSpeed support, and FSDP configuration. ```python def create_accelerator_and_postprocess(self) -> None: # Overrides parent to configure Accelerator with custom settings. # Configuration: # - Gradient Accumulation: Disabled sync with dataloader for efficiency # - DeepSpeed: Supports DeepSpeed plugin from training args # - Timeout: 52-week timeout for long training runs # - FSDP: Configures activation checkpointing if FSDP enabled pass ``` -------------------------------- ### Qwen2VLForConditionalGenerationWithPointer Constructor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md Initializes the Qwen2VLForConditionalGenerationWithPointer model. Accepts a configuration object and optional weights for pointer and language model losses. ```python Qwen2VLForConditionalGenerationWithPointer( config, *args, pointer_loss_weight: float = 1.0, lm_loss_weight: float = 1.0, **kwargs ) ``` -------------------------------- ### Run ScreenSpot Evaluation Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md Execute the evaluation script for the ScreenSpot benchmark. Ensure the script path is correct within the 'eval/' directory. ```bash python eval/screenSpot.py ``` -------------------------------- ### Prepare Batch and Forward Pass Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md Prepare a batch of input data including token IDs, labels, and pixel values for the model. Then, perform a forward pass to calculate the loss and print various loss components. ```python batch = { "input_ids": torch.randint(0, 50000, (2, 512)), # batch_size=2 "labels": torch.randint(0, 50000, (2, 512)), "pixel_values": torch.randn(2, 3, 1088, 1088), # 2 images "image_grid_thw": torch.tensor([[1, 14, 14], [1, 14, 14]]), "visual_token_indices_of_coordinates": torch.tensor([[5, 10], [15, 20]]), "multi_patch_labels": [ torch.ones(2, 196) * 0.1, # sample 1: 2 targets, 196 patches torch.ones(2, 196) * 0.1 # sample 2: 2 targets, 196 patches ], "if_multi_patch": True, } # Forward pass outputs = model(**batch) loss = outputs.loss print(f"Total loss: {loss.item()}") print(f"LM loss: {outputs.lm_loss.item()}") print(f"Pointer loss: {outputs.pointer_loss.item()}") # Backward pass loss.backward() ``` -------------------------------- ### Load Dataset and Configure Training Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md This snippet demonstrates how to load a dataset using LazySupervisedDataset and configure training arguments with TrainingArguments for the AGUVISTrainer. ```python class DataArgs: image_folder = "/path/to/images" max_conv_turns = 10 early_mix_text = False dataset = LazySupervisedDataset( tokenizer=tokenizer, processor=processor, data_path="/path/to/config.yaml", data_args=DataArgs() ) training_args = TrainingArguments( output_dir="./checkpoints", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=1e-4, learning_rate_new_params=1e-3, group_by_length=True, save_steps=500, ) trainer = AGUVISTrainer( model=model, args=training_args, train_dataset=dataset, data_collator=data_collator, tokenizer=tokenizer, processing_class=processor, ) trainer.train() ``` -------------------------------- ### Configure Loss Weights via Constructor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Set initial loss weights for pointer head and language model directly when loading the model from a pre-trained checkpoint. ```python model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2-VL", pointer_loss_weight=1.0, lm_loss_weight=1.0, ) ``` -------------------------------- ### Clone GUI-Actor Repository Source: https://github.com/microsoft/gui-actor/blob/main/README.md Clone the GUI-Actor repository to your local machine and navigate into the project directory. ```bash git clone https://github.com/microsoft/GUI-Actor.git cd GUI-Actor ``` -------------------------------- ### QwenVLwithVisionHeadOutputWithPast Constructor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md Custom output class extending Qwen2.5-VL's base output with vision pointer network results. Use this to capture language modeling loss, pointer loss, and pointer scores. ```python QwenVLwithVisionHeadOutputWithPast( lm_loss: Optional[torch.FloatTensor] = None, pointer_loss: Optional[torch.FloatTensor] = None, pointer_scores: Optional[List[torch.FloatTensor]] = None, *args, **kwargs ) ``` -------------------------------- ### Qwen2VLForConditionalGenerationWithPointer Constructor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md Initializes the Qwen2VLForConditionalGenerationWithPointer model. This constructor allows for the configuration of the base Qwen2VL model along with specific weights for pointer and language model losses. ```APIDOC ## Constructor ```python Qwen2VLForConditionalGenerationWithPointer( config, *args, pointer_loss_weight: float = 1.0, lm_loss_weight: float = 1.0, **kwargs ) ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | config | PretrainedConfig | — | Model config (from base Qwen2VL) | | pointer_loss_weight | float | 1.0 | Weight of pointer loss in combined loss | | lm_loss_weight | float | 1.0 | Weight of language model loss in combined loss | ### Attributes | Attribute | Type | Description | |-----------|------|-------------| | multi_patch_pointer_head | VisionHead_MultiPatch | Multi-patch grounding head | | pointer_loss_weight | float | Pointer loss scaling factor | | lm_loss_weight | float | LM loss scaling factor | ``` -------------------------------- ### Create Optimizer with Different Learning Rates Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md Initializes an optimizer with distinct learning rates for the base model parameters and newly added parameters (e.g., pointer head, embed tokens), useful during model warmup stages. ```python def create_optimizer_with_different_learning_rates(self) -> torch.optim.Optimizer ``` ```python from transformers import TrainingArguments from gui_actor.trainer import AGUVISTrainer from gui_actor.dataset import LazySupervisedDataset training_args = TrainingArguments( output_dir="./checkpoints", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=1e-4, learning_rate_new_params=1e-3, group_by_length=True, dataloader_num_workers=4, dataloader_pin_memory=True, dataloader_persistent_workers=True, gradient_accumulation_steps=2, ) trainer = AGUVISTrainer( model=model, args=training_args, train_dataset=train_dataset, data_collator=data_collator, ) # Use different learning rates trainer.create_optimizer_with_different_learning_rates() trainer.train() ``` -------------------------------- ### VisionHead_MultiPatch Constructor Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md Initializes a multi-patch visual grounding head. Configure with hidden dimension, projection dimension, number of attention heads, and dropout rate. ```python VisionHead_MultiPatch( d_model: int, projection_dim: int, num_attention_heads: int = 8, dropout_rate: float = 0.1 ) ``` -------------------------------- ### Run DeepSpeed ZeRO-3 Training Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md Launch distributed training using DeepSpeed ZeRO-3. This command specifies the number of GPUs, the DeepSpeed configuration file, and the number of training epochs. ```bash deepspeed --num_gpus 8 train.py \ --deepspeed scripts/zero3.json \ --num_train_epochs 3 ``` -------------------------------- ### __getitem__ Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/dataset.md Retrieves and preprocesses a sample at a given index. ```APIDOC ## __getitem__ ### Description Retrieves and preprocesses sample at index i. ### Method Signature ```python def __getitem__(self, i: int) -> Dict[str, torch.Tensor] ``` ### Returns Dictionary with keys: - **input_ids** (torch.LongTensor) - `(seq_len,)` - Token IDs - **labels** (torch.LongTensor) - `(seq_len,)` - Target IDs (IGNORE_INDEX for non-target tokens) - **coordinates** (List[Tuple[float, float]]) - Normalized click points - **visual_token_indices_of_coordinates** (torch.LongTensor) - `(n_targets,)` - Token indices for each coordinate - **pixel_values** (torch.Tensor) - `(num_images, 3, H, W)` - Image tensors - **image_grid_thw** (torch.LongTensor) - `(num_images, 3)` - Grid dimensions [T, H, W] - **multi_patch_labels** (torch.Tensor) - `(n_targets, n_patches)` - Binary masks for regions - **id** (str) - Sample ID ### Fallback Behavior If sample processing fails, randomly selects another sample instead of raising. ``` -------------------------------- ### Required Runtime Dependencies Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Lists the essential packages required for the project to run, including their version constraints. ```text pre-commit>=3.7.1 # Git hooks framework pip>=24.1.1 # Package installer Pillow>=10.4.0 # Image processing liger-kernel==0.5.2 # Optimized LM kernel opencv-python-headless>=4.10.0.84 # Computer vision accelerate==1.1.1 # Distributed training qwen-vl-utils==0.0.8 # Qwen VL processing utilities deepspeed==0.16.0 # DeepSpeed training optimizations transformers==4.51.3 # Hugging Face transformers flash-attn==2.7.3 # Flash Attention optimization wandb==0.18.3 # Weights & Biases logging datasets>=2.18.0 # Dataset utilities ``` -------------------------------- ### Load Qwen2.5-VL Model Variant Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Load the GUI-Actor-7B-Qwen2.5-VL model for improved performance. Specify torch_dtype and device_map for efficient loading. ```python import torch from transformers import AutoProcessor, AutoTokenizer from gui_actor.modeling_qwen25vl import Qwen2_5_VLForConditionalGenerationWithPointer # Qwen2.5-VL variant (better performance) model = Qwen2_5_VLForConditionalGenerationWithPointer.from_pretrained( "microsoft/GUI-Actor-7B-Qwen2.5-VL", torch_dtype=torch.bfloat16, device_map="cuda:0" ) processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2.5-VL") ``` -------------------------------- ### Project Citation (BibTeX) Source: https://github.com/microsoft/gui-actor/blob/main/README.md Provides the BibTeX entry for citing the GUI-Actor project in academic work. ```bibtex @article{wu2025gui, title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others}, journal={arXiv preprint arXiv:2506.03143}, year={2025} } ``` -------------------------------- ### Import Utility Functions Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Import necessary utility functions from gui_actor.utils for data handling and visualization. ```python from gui_actor.utils import ( dump_args_to_json, draw_point, draw_bbox, do_boxes_overlap ) ``` -------------------------------- ### Configure Weights & Biases Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Sets environment variables for Weights & Biases (WandB) to log training metrics and experiments. Specify the project name and the entity (team) for logging. ```bash # Weights & Biases export WANDB_PROJECT=gui-actor export WANDB_ENTITY=my-team ``` -------------------------------- ### Custom AGUVISTrainer with Length-Grouped Sampling Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Use AGUVISTrainer for custom training with length-grouped sampling and loss weighting. Configure training arguments and initialize the trainer with model, datasets, and processor. ```python from transformers import TrainingArguments from gui_actor.trainer import AGUVISTrainer training_args = TrainingArguments( output_dir="./checkpoints", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=1e-4, learning_rate_new_params=1e-3, group_by_length=True, save_steps=500, logging_steps=10, ) trainer = AGUVISTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, tokenizer=tokenizer, processing_class=processor, ) # Warmup stage model.reset_loss_weights(pointer_loss_weight=1.0, lm_loss_weight=0.0) trainer.train() # Full training model.reset_loss_weights(pointer_loss_weight=1.0, lm_loss_weight=1.0) trainer.train(resume_from_checkpoint="./checkpoints/checkpoint-1000") ``` -------------------------------- ### dump_args_to_json Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/utils.md Serializes training and model configuration to a JSON file for experiment tracking. It filters arguments to be JSON-serializable, saves the output to `{output_dir}/args.json` with 4-space indentation, and skips saving if the file already exists. ```APIDOC ## dump_args_to_json ### Description Serializes training and model configuration to JSON file for experiment tracking. ### Function Signature ```python def dump_args_to_json( model_config, data_processor, model_args, data_args, training_args, output_dir: str ) -> None ``` ### Parameters #### Path Parameters - **model_config** (object) - Required - Model config object (e.g., config from from_pretrained) - **data_processor** (object) - Required - Image processor + tokenizer combined - **model_args** (object) - Required - Model-specific arguments (as argparse Namespace) - **data_args** (object) - Required - Data loading arguments - **training_args** (object) - Required - Training/optimization arguments - **output_dir** (str) - Required - Directory where args.json will be saved (if not already exists) ### Behavior 1. Filters all arguments to only JSON-serializable values 2. Saves to `{output_dir}/args.json` with 4-space indentation 3. Skips saving if file already exists ### Output Structure ```json { "model_config": {...}, "data_processor_config": {...}, "image_processor_config": {...}, "model_args": {...}, "data_args": {...}, "training_args": {...} } ``` ### Example ```python from gui_actor.utils import dump_args_to_json dump_args_to_json( model_config=model.config, data_processor=processor, model_args=model_args, data_args=data_args, training_args=training_args, output_dir="./output" ) # Creates: ./output/args.json ``` ``` -------------------------------- ### Create Training DataLoader Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md Generates a training DataLoader with custom configurations, including column removal, custom collate function, persistent workers, custom sampling, memory pinning, and prefetching. ```python def get_train_dataloader(self) -> DataLoader ``` -------------------------------- ### Import Core Model Classes Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Import the necessary classes for using the core models. These are the primary classes for building and interacting with the GUI-Actor models. ```python from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer from gui_actor.modeling_qwen25vl import Qwen2_5_VLForConditionalGenerationWithPointer ``` -------------------------------- ### Import Constants Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md Import predefined constants for special tokens and other configurations from gui_actor.constants. ```python from gui_actor.constants import ( DEFAULT_IMAGE_TOKEN, DEFAULT_POINTER_START_TOKEN, DEFAULT_POINTER_PAD_TOKEN, DEFAULT_POINTER_END_TOKEN, ADDITIONAL_SPECIAL_TOKENS, grounding_system_message, chat_template, IGNORE_INDEX ) ``` -------------------------------- ### Configure CUDA/GPU Settings Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md Sets environment variables for CUDA and GPU usage. `CUDA_VISIBLE_DEVICES` selects which GPUs to use, and `CUDA_LAUNCH_BLOCKING=1` enables synchronous execution for easier debugging of CUDA errors. ```bash # CUDA/GPU settings export CUDA_VISIBLE_DEVICES=0,1,2,3 export CUDA_LAUNCH_BLOCKING=1 ``` -------------------------------- ### Attention Head Output from VisionHead_MultiPatch.forward() Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md The output from the VisionHead_MultiPatch.forward() method includes attention weights and an optional loss. The attention weights indicate scores normalized over visual tokens. ```python ( attn_weights: torch.Tensor, # Shape: (n_targets, n_visual) # Softmax-normalized attention scores loss: torch.Tensor | None, # Shape: (1,) - KL divergence loss ) ``` -------------------------------- ### Evaluation Script Execution Source: https://github.com/microsoft/gui-actor/blob/main/verifier/README.md Shell commands to run the evaluation scripts for ScreenSpot datasets v1, v2, and Pro. Ensure file paths are correctly updated in the scripts before execution. ```bash bash run_ss_v1.sh bash run_ss_v2.sh bash run_ss_pro.sh ```