### Run the Demo Application

Source: https://github.com/microsoft/gui-actor/blob/main/demo/README.md

Executes the main Python script to start the GUI Actor demo. This command should be run after installing dependencies.

```bash
python app.py
```

--------------------------------

### Install GUI-Actor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md

Clone the repository and install the package in editable mode.

```bash
git clone https://github.com/microsoft/GUI-Actor.git
cd GUI-Actor
pip install -e .
```

--------------------------------

### Training Example with GUI-Actor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md

Demonstrates how to load and initialize the Qwen2VLForConditionalGenerationWithPointer model for training, specifying data type and device mapping.

```python
import torch
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer

model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2-VL",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0"
)
```

--------------------------------

### Example YAML Data Configuration

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Demonstrates how to configure datasets using YAML, specifying paths, sampling strategies, and image folders.

```yaml
datasets:
  - json_path: /data/screenspot/train.json
    sampling_strategy: first:5000
    images_folder: /data/screenspot/images

  - json_path: /data/gui_actions/data.json
    sampling_strategy: random:1000
    images_folder: /data/gui_actions/images

  - json_path: /data/mobile/train.json
    sampling_strategy: all
    images_folder: /data/mobile/images
```

--------------------------------

### Install Dependencies

Source: https://github.com/microsoft/gui-actor/blob/main/demo/README.md

Installs the necessary Python packages listed in requirements.txt. Ensure you have pip installed.

```bash
pip install -r requirements.txt
```

--------------------------------

### Example Usage of LazySupervisedDataset

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/dataset.md

Demonstrates how to initialize the dataset with a tokenizer, processor, and data configuration, and then load a sample.

```python
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.dataset import LazySupervisedDataset

tokenizer = AutoTokenizer.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")
processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")

class DataArgs:
    image_folder = "/data/images"
    max_conv_turns = 10
    early_mix_text = False

dataset = LazySupervisedDataset(
    tokenizer=tokenizer,
    processor=processor,
    data_path="/data/config.yaml",
    data_args=DataArgs()
)

# Load a sample
sample = dataset[0]
print(f"Input shape: {sample['input_ids'].shape}")
print(f"Num targets: {len(sample['coordinates'])}")
```

--------------------------------

### Initialize Custom Trainer

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md

Example of initializing the AGUVISTrainer with TrainingArguments and a LazySupervisedDataset.

```python
from transformers import TrainingArguments
from gui_actor.trainer import AGUVISTrainer
from gui_actor.dataset import LazySupervisedDataset

```

--------------------------------

### Prepare Example Data

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

Loads a dataset and extracts a sample for processing. This prepares the data for model inference.

```python
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Instruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")
```

--------------------------------

### Inference Example with generate()

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md

Perform inference using the model's generate method. This is a simplified example; refer to inference.md for a full demonstration.

```python
# See inference.md for full inference example using generate() or inference()
outputs = model.generate(
    input_ids=input_ids,
    pixel_values=pixel_values,
    image_grid_thw=image_grid_thw,
    max_new_tokens=100,
    return_dict_in_generate=True,
    output_hidden_states=True
)
```

--------------------------------

### GUI Actor Inference Example

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/inference.md

Demonstrates how to load the GUI Actor model, prepare conversation input with an image and text, run the inference function, and extract the predicted click coordinates and confidence score. This example uses placeholder mode for faster generation.

```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference
from gui_actor.constants import grounding_system_message

# Load model and processor
model_name = "microsoft/GUI-Actor-7B-Qwen2-VL"
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

data_processor = AutoProcessor.from_pretrained(model_name)
tokenizer = data_processor.tokenizer

# Prepare input
image = Image.open("screenshot.png")
conversation = [
    {
        "role": "system",
        "content": [{"type": "text", "text": grounding_system_message}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Close the window"}
        ]
    }
]

# Run inference
with torch.no_grad():
    pred = inference(
        conversation,
        model,
        tokenizer,
        data_processor,
        use_placeholder=True,
        topk=3
    )

# Extract results
best_x, best_y = pred["topk_points"][0]
print(f"Predicted click: ({best_x:.4f}, {best_y:.4f})")
print(f"Confidence: {pred['topk_values'][0]:.4f}")
```

--------------------------------

### Loading and Training Qwen2.5-VL Model

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md

Example of how to load the Qwen2.5-VL model with pointer capabilities and use it for training. This snippet demonstrates combining language modeling loss with pointer loss for multi-patch grounding.

```python
import torch
from gui_actor.modeling_qwen25vl import Qwen2_5_VLForConditionalGenerationWithPointer

# Load Qwen2.5-VL variant
model = Qwen2_5_VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2.5-VL",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0"
)

# Training: combine LM and pointer losses
outputs = model(
    input_ids=input_ids,
    labels=labels,
    pixel_values=pixel_values,
    image_grid_thw=image_grid_thw,
    visual_token_indices_of_coordinates=visual_indices,
    multi_patch_labels=patch_labels,
    if_multi_patch=True
)

# Loss is combination
loss = outputs.loss  # = lm_loss_weight * lm_loss + pointer_loss_weight * pointer_loss
```

--------------------------------

### ForceFollowTokensLogitsProcessor Example

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/inference.md

Demonstrates how to instantiate and use the ForceFollowTokensLogitsProcessor with a tokenizer. This is useful for controlling specific token generation during inference.

```python
from transformers import AutoTokenizer
from gui_actor.inference import ForceFollowTokensLogitsProcessor

tokenizer = AutoTokenizer.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")
pointer_pad_id = tokenizer.encode("<|pointer_pad|>")[0]
pointer_end_id = tokenizer.encode("<|pointer_end|>")[0]

processor = ForceFollowTokensLogitsProcessor(
    token_a_id=pointer_pad_id,
    forced_sequence=[pointer_end_id]
)
```

--------------------------------

### Run Inference with GUI-Actor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md

Example of loading the model, preparing input with an image and conversation, and running inference to get prediction coordinates.

```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference
from gui_actor.constants import grounding_system_message

# Load model
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2-VL",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")
tokenizer = processor.tokenizer

# Prepare input
image = Image.open("screenshot.png")
conversation = [
    {
        "role": "system",
        "content": [{"type": "text", "text": grounding_system_message}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Click the submit button"}
        ]
    }
]

# Run inference
with torch.no_grad():
    pred = inference(
        conversation,
        model,
        tokenizer,
        processor,
        use_placeholder=True,
        topk=3
    )

# Get prediction
x, y = pred["topk_points"][0]
print(f"Click at ({x:.4f}, {y:.4f})")
```

--------------------------------

### Example Training Sample

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md

Illustrates a complete training sample, including its ID, associated image file, and a conversation with human and GPT turns, featuring ground truth bounding box information.

```python
{
    "id": "sample_001",
    "image": "screenshot.jpg",
    "conversations": [
        {"from": "human", "value": "<image>\nClose the window"},
        {
            "from": "gpt",
            "value": "pyautogui.click(x=0.95, y=0.15)",
            "recipient": "os",
            "end_turn": True,
            "bbox_gt": [0.9, 0.1, 1.0, 0.2]
        }
    ]
}
```

--------------------------------

### Training Setup with Special Tokens and Ignore Index

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/constants.md

Demonstrates how to register special tokens with the tokenizer and create a dataset with an ignore index for labels during training.

```python
from gui_actor.constants import (
    DEFAULT_POINTER_START_TOKEN,
    DEFAULT_POINTER_PAD_TOKEN,
    DEFAULT_POINTER_END_TOKEN,
    IGNORE_INDEX,
    grounding_system_message,
)

# Register special tokens
special_tokens = [
    DEFAULT_POINTER_START_TOKEN,
    DEFAULT_POINTER_PAD_TOKEN,
    DEFAULT_POINTER_END_TOKEN,
]
tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})

# Create dataset with ignore index
labels = torch.full((seq_len,), IGNORE_INDEX, dtype=torch.long)
labels[target_positions] = target_token_ids

# Use system message in conversations
conversation = [{
    "role": "system",
    "content": [{"type": "text", "text": grounding_system_message}]
}]
```

--------------------------------

### YAML Configuration for Dataset Loading

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/dataset.md

Example of a YAML configuration file for specifying multiple datasets, their sampling strategies, and associated image folders.

```yaml
datasets:
  - json_path: path/to/data1.json
    sampling_strategy: first:1000
    images_folder: path/to/images1
  - json_path: path/to/data2.json
    sampling_strategy: random:500
    images_folder: path/to/images2
```

--------------------------------

### Install GUI-Actor Package

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Install the GUI-Actor package using pip to resolve NoSuchModuleError when loading models.

```bash
pip install -e .
```

--------------------------------

### Example Bounding Box Usage

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md

Illustrates the assignment of a normalized bounding box tuple to a variable.

```python
bbox: Tuple[float, float, float, float] = (0.25, 0.25, 0.75, 0.75)  # Center square
```

--------------------------------

### Example Conversation Format

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md

Demonstrates a multi-turn conversation involving system instructions, user input with an image and text, and an assistant's response.

```python
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a GUI agent..."}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Click the submit button"}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "pyautogui.click(x=0.5, y=0.7)"}
        ]
    }
]
```

--------------------------------

### Create and Activate Conda Environment

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

Create a new conda environment named 'gui_actor' with Python 3.10, activate it, install PyTorch with CUDA support, and then install the project dependencies.

```bash
conda create -n gui_actor python=3.10
conda activate gui_actor
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .
```

--------------------------------

### Example Coordinates Usage

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md

Illustrates the assignment of a normalized (x, y) coordinate tuple to a variable.

```python
click_point: Tuple[float, float] = (0.5, 0.75)  # Middle-right position
```

--------------------------------

### Configure ForceFollowTokensLogitsProcessor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Initialize ForceFollowTokensLogitsProcessor to control token generation during inference. Specify the start token and a sequence of forced tokens.

```python
from gui_actor.inference import ForceFollowTokensLogitsProcessor

processor = ForceFollowTokensLogitsProcessor(
    token_a_id=tokenizer.encode("<|pointer_start|>")[0],
    forced_sequence=[
        tokenizer.encode("<|pointer_pad|>")[0],
        tokenizer.encode("<|pointer_end|>")[0]
    ]
)
```

--------------------------------

### VisionHead_MultiPatch Forward Pass Example

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md

Demonstrates the forward pass of the VisionHead_MultiPatch. Use this to compute attention weights and loss for visual grounding tasks.

```python
import torch
from gui_actor.modeling import VisionHead_MultiPatch

head = VisionHead_MultiPatch(d_model=3584, projection_dim=3584)

# Simulate visual features (e.g., 196 patches from 14x14 grid)
visual_embeds = torch.randn(196, 3584)

# Simulate target tokens (3 regions to ground)
target_hidden = torch.randn(3, 3584)

# Ground truth: first region covers patches 0-3, second covers 4-8, etc.
labels = torch.zeros(3, 196)
labels[0, :4] = 1
labels[1, 4:9] = 1
labels[2, 10:15] = 1

# Forward pass
attn_weights, loss = head(visual_embeds, target_hidden, labels=labels)
# attn_weights: (3, 196)
# loss: scalar tensor
```

--------------------------------

### get_prediction_region_point Example

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/inference.md

Shows how to use get_prediction_region_point with simulated attention scores to find predicted click regions. Adjust parameters like top_n and activation_threshold for different results.

```python
import torch
from gui_actor.inference import get_prediction_region_point

# Simulated attention scores from model
attn_scores = torch.randn(1, 784)  # 28x28 grid
attn_scores = torch.softmax(attn_scores, dim=-1)

best_point, all_centers, scores, all_patches = get_prediction_region_point(
    attn_scores,
    n_width=28,
    n_height=28,
    top_n=5,
    activation_threshold=0.25,
    return_all_regions=True
)

print(f"Best prediction: {best_point}")
print(f"Alternative options: {all_centers[:3]}")
```

--------------------------------

### Perform Inference with `inference()`

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

This function serves as the primary entry point for model inference. It takes a conversation history, model, tokenizer, and processor to generate predictions. The example shows how to set up the conversation with system, user roles, images, and text, and then process the output.

```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference
from gui_actor.constants import grounding_system_message

model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2-VL",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0"
).eval()

processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")
tokenizer = processor.tokenizer

image = Image.open("screenshot.png")

conversation = [
    {
        "role": "system",
        "content": [{"type": "text", "text": grounding_system_message}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Click the submit button"}
        ]
    }
]

with torch.no_grad():
    pred = inference(
        conversation=conversation,
        model=model,
        tokenizer=tokenizer,
        data_processor=processor,
        use_placeholder=True,
        topk=3
    )

# Results
print(f"Best point: {pred['topk_points'][0]}")
print(f"Confidence: {pred['topk_values'][0]}")
print(f"Alternatives: {pred['topk_points'][1:]}")
```

--------------------------------

### Load and Use Qwen2VLForConditionalGenerationWithPointer

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Demonstrates loading the Qwen2VL model with specified configurations and using it for both training and inference. For training, it combines LM and pointer losses. For inference, it uses the `generate` method.

```python
import torch
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer

# Load model
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2-VL",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")
tokenizer = processor.tokenizer

# Training mode (combine LM and pointer losses)
outputs = model(
    input_ids=input_ids,
    labels=labels,
    pixel_values=pixel_values,
    image_grid_thw=image_grid_thw,
    visual_token_indices_of_coordinates=coordinates,
    multi_patch_labels=patch_labels,
)
loss = outputs.loss
loss.backward()

# Inference mode
with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        pixel_values=pixel_values,
        image_grid_thw=image_grid_thw,
        max_new_tokens=100,
        return_dict_in_generate=True,
        output_hidden_states=True
    )
```

--------------------------------

### Initialize AGUVISTrainer

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Instantiate the AGUVISTrainer with model, training arguments, datasets, data collator, tokenizer, and processor.

```python
from gui_actor.trainer import AGUVISTrainer

trainer = AGUVISTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    processing_class=processor,
)
```

--------------------------------

### Load Model with Optimizations

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Load the Qwen2VLForConditionalGenerationWithPointer model with specified data type, device mapping, and attention implementation for performance optimization.

```python
import torch
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer

model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2-VL",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")
tokenizer = AutoTokenizer.from_pretrained("microsoft/GUI-Actor-7B-Qwen2-VL")
```

--------------------------------

### Create Training Sampler

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md

Creates a training sampler with support for length-grouped, modality-grouped, or random sampling strategies to optimize data loading and minimize padding.

```python
def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]
```

--------------------------------

### Run Warmup Training Script

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

Execute the warmup training script for the GUI-Actor model.

```bash
bash scripts/warmup.sh
```

--------------------------------

### Run Full-Parameter Training Script

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

Execute the full-parameter training script for the GUI-Actor model.

```bash
bash scripts/train.sh
```

--------------------------------

### Python Example of Using do_boxes_overlap

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/utils.md

Demonstrates how to use the `do_boxes_overlap` function with both overlapping and non-overlapping boxes. Ensure the function is imported from `gui_actor.utils` before use.

```python
from gui_actor.utils import do_boxes_overlap

box1 = (0, 0, 100, 100)
box2 = (50, 50, 150, 150)

if do_boxes_overlap(box1, box2):
    print("Boxes overlap!")  # Will print
else:
    print("No overlap")

# Non-overlapping boxes
box3 = (200, 200, 300, 300)
print(do_boxes_overlap(box1, box3))  # False
```

--------------------------------

### Register and Get Special Token IDs

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Register special pointer tokens with the tokenizer and retrieve their corresponding IDs for use in model input.

```python
from gui_actor.constants import (
    DEFAULT_POINTER_START_TOKEN,
    DEFAULT_POINTER_PAD_TOKEN,
    DEFAULT_POINTER_END_TOKEN
)

# Register with tokenizer
tokenizer.add_special_tokens({
    "additional_special_tokens": [
        DEFAULT_POINTER_START_TOKEN,
        DEFAULT_POINTER_PAD_TOKEN,
        DEFAULT_POINTER_END_TOKEN,
    ]
})

# Get token IDs
start_id = tokenizer.encode(DEFAULT_POINTER_START_TOKEN)[0]
pad_id = tokenizer.encode(DEFAULT_POINTER_PAD_TOKEN)[0]
end_id = tokenizer.encode(DEFAULT_POINTER_END_TOKEN)[0]
```

--------------------------------

### Evaluate on ScreenSpot-Pro

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

Evaluate the GUI-Actor model on the ScreenSpot-Pro benchmark. Ensure you have downloaded the data and provide the correct paths to the saved results and data directory.

```bash
python eval/screenSpot_pro.py --save_path <path_to_save_results> --data_path <path_to_data_dir>
```

--------------------------------

### Obtaining Special Token IDs

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/constants.md

Illustrates how to get the integer IDs for special tokens like '<|pointer_start|>' and '<|pointer_end|>' by encoding them using the tokenizer.

```python
from gui_actor.constants import (
    DEFAULT_POINTER_START_TOKEN,
    DEFAULT_POINTER_END_TOKEN,
)

pointer_start_id = tokenizer.encode(DEFAULT_POINTER_START_TOKEN)[0]
pointer_end_id = tokenizer.encode(DEFAULT_POINTER_END_TOKEN)[0]
```

--------------------------------

### Initialize and Load GUI-Actor Model

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

This Python snippet demonstrates how to load the GUI-Actor model with Qwen2-VL backbone, including processor, tokenizer, and model configuration with specific torch dtype and attention implementation. The model is set to evaluation mode.

```python
import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import AutoProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# load model
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = AutoProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()
```

--------------------------------

### Qwen2_5_VLForConditionalGenerationWithPointer Constructor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md

Main model class integrating Qwen2.5-VL with a pointer head for coordinate-free grounding. Customize loss weights for pointer and language modeling components.

```python
Qwen2_5_VLForConditionalGenerationWithPointer(
    config,
    *args,
    pointer_loss_weight: float = 1.0,
    lm_loss_weight: float = 1.0,
    **kwargs
)
```

--------------------------------

### Import Dataset and Training Utilities

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Import necessary components for dataset handling and training. This includes dataset classes, data reformatting utilities, and the trainer itself.

```python
from gui_actor.dataset import (
    LazySupervisedDataset,
    reformat_coordinates,
    get_token_index,
    get_multi_patch_labels
)
from gui_actor.trainer import AGUVISTrainer
```

--------------------------------

### Coordinate Extraction Patterns (Drag)

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/constants.md

Regular expression patterns used to extract coordinates for drag operations from text responses. These patterns capture the start and end coordinates of a drag action.

```python
r"from_coord=\[([0-9.]+), ([0-9.]+)\], to_coord=\[([0-9.]+), ([0-9.]+)\]"
```

--------------------------------

### AGUVISTrainer.create_accelerator_and_postprocess

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md

Overrides the parent method to configure the Accelerator with custom settings, including gradient accumulation, DeepSpeed plugin support, a 52-week timeout, and FSDP activation checkpointing.

```APIDOC
## AGUVISTrainer.create_accelerator_and_postprocess

### Description
Overrides parent to configure Accelerator with custom settings.

### Function Signature
```python
def create_accelerator_and_postprocess(self) -> None
```

### Configuration
- **Gradient Accumulation**: Disabled sync with dataloader for efficiency
- **DeepSpeed**: Supports DeepSpeed plugin from training args
- **Timeout**: 52-week timeout for long training runs
- **FSDP**: Configures activation checkpointing if FSDP enabled
```

--------------------------------

### AGUVISTrainer Create Accelerator and Postprocess

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md

This method overrides the parent's functionality to configure the Accelerator with custom settings, including gradient accumulation, DeepSpeed support, and FSDP configuration.

```python
def create_accelerator_and_postprocess(self) -> None:
    # Overrides parent to configure Accelerator with custom settings.
    # Configuration:
    # - Gradient Accumulation: Disabled sync with dataloader for efficiency
    # - DeepSpeed: Supports DeepSpeed plugin from training args
    # - Timeout: 52-week timeout for long training runs
    # - FSDP: Configures activation checkpointing if FSDP enabled
    pass
```

--------------------------------

### Qwen2VLForConditionalGenerationWithPointer Constructor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md

Initializes the Qwen2VLForConditionalGenerationWithPointer model. Accepts a configuration object and optional weights for pointer and language model losses.

```python
Qwen2VLForConditionalGenerationWithPointer(
    config,
    *args,
    pointer_loss_weight: float = 1.0,
    lm_loss_weight: float = 1.0,
    **kwargs
)
```

--------------------------------

### Run ScreenSpot Evaluation

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md

Execute the evaluation script for the ScreenSpot benchmark. Ensure the script path is correct within the 'eval/' directory.

```bash
python eval/screenSpot.py
```

--------------------------------

### Prepare Batch and Forward Pass

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md

Prepare a batch of input data including token IDs, labels, and pixel values for the model. Then, perform a forward pass to calculate the loss and print various loss components.

```python
batch = {
    "input_ids": torch.randint(0, 50000, (2, 512)),  # batch_size=2
    "labels": torch.randint(0, 50000, (2, 512)),
    "pixel_values": torch.randn(2, 3, 1088, 1088),  # 2 images
    "image_grid_thw": torch.tensor([[1, 14, 14], [1, 14, 14]]),
    "visual_token_indices_of_coordinates": torch.tensor([[5, 10], [15, 20]]),
    "multi_patch_labels": [
        torch.ones(2, 196) * 0.1,  # sample 1: 2 targets, 196 patches
        torch.ones(2, 196) * 0.1   # sample 2: 2 targets, 196 patches
    ],
    "if_multi_patch": True,
}

# Forward pass
outputs = model(**batch)
loss = outputs.loss
print(f"Total loss: {loss.item()}")
print(f"LM loss: {outputs.lm_loss.item()}")
print(f"Pointer loss: {outputs.pointer_loss.item()}")

# Backward pass
loss.backward()
```

--------------------------------

### Load Dataset and Configure Training

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md

This snippet demonstrates how to load a dataset using LazySupervisedDataset and configure training arguments with TrainingArguments for the AGUVISTrainer.

```python
class DataArgs:
    image_folder = "/path/to/images"
    max_conv_turns = 10
    early_mix_text = False

dataset = LazySupervisedDataset(
    tokenizer=tokenizer,
    processor=processor,
    data_path="/path/to/config.yaml",
    data_args=DataArgs()
)

training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=1e-4,
    learning_rate_new_params=1e-3,
    group_by_length=True,
    save_steps=500,
)

trainer = AGUVISTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    processing_class=processor,
)

trainer.train()
```

--------------------------------

### Configure Loss Weights via Constructor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Set initial loss weights for pointer head and language model directly when loading the model from a pre-trained checkpoint.

```python
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2-VL",
    pointer_loss_weight=1.0,
    lm_loss_weight=1.0,
)
```

--------------------------------

### Clone GUI-Actor Repository

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

Clone the GUI-Actor repository to your local machine and navigate into the project directory.

```bash
git clone https://github.com/microsoft/GUI-Actor.git
cd GUI-Actor
```

--------------------------------

### QwenVLwithVisionHeadOutputWithPast Constructor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md

Custom output class extending Qwen2.5-VL's base output with vision pointer network results. Use this to capture language modeling loss, pointer loss, and pointer scores.

```python
QwenVLwithVisionHeadOutputWithPast(
    lm_loss: Optional[torch.FloatTensor] = None,
    pointer_loss: Optional[torch.FloatTensor] = None,
    pointer_scores: Optional[List[torch.FloatTensor]] = None,
    *args,
    **kwargs
)
```

--------------------------------

### Qwen2VLForConditionalGenerationWithPointer Constructor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling.md

Initializes the Qwen2VLForConditionalGenerationWithPointer model. This constructor allows for the configuration of the base Qwen2VL model along with specific weights for pointer and language model losses.

```APIDOC
## Constructor

```python
Qwen2VLForConditionalGenerationWithPointer(
    config,
    *args,
    pointer_loss_weight: float = 1.0,
    lm_loss_weight: float = 1.0,
    **kwargs
)
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| config | PretrainedConfig | — | Model config (from base Qwen2VL) |
| pointer_loss_weight | float | 1.0 | Weight of pointer loss in combined loss |
| lm_loss_weight | float | 1.0 | Weight of language model loss in combined loss |

### Attributes

| Attribute | Type | Description |
|-----------|------|-------------|
| multi_patch_pointer_head | VisionHead_MultiPatch | Multi-patch grounding head |
| pointer_loss_weight | float | Pointer loss scaling factor |
| lm_loss_weight | float | LM loss scaling factor |
```

--------------------------------

### Create Optimizer with Different Learning Rates

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md

Initializes an optimizer with distinct learning rates for the base model parameters and newly added parameters (e.g., pointer head, embed tokens), useful during model warmup stages.

```python
def create_optimizer_with_different_learning_rates(self) -> torch.optim.Optimizer
```

```python
from transformers import TrainingArguments
from gui_actor.trainer import AGUVISTrainer
from gui_actor.dataset import LazySupervisedDataset

training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=1e-4,
    learning_rate_new_params=1e-3,
    group_by_length=True,
    dataloader_num_workers=4,
    dataloader_pin_memory=True,
    dataloader_persistent_workers=True,
    gradient_accumulation_steps=2,
)

trainer = AGUVISTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

# Use different learning rates
trainer.create_optimizer_with_different_learning_rates()

trainer.train()
```

--------------------------------

### VisionHead_MultiPatch Constructor

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/modeling_qwen25vl.md

Initializes a multi-patch visual grounding head. Configure with hidden dimension, projection dimension, number of attention heads, and dropout rate.

```python
VisionHead_MultiPatch(
    d_model: int,
    projection_dim: int,
    num_attention_heads: int = 8,
    dropout_rate: float = 0.1
)
```

--------------------------------

### Run DeepSpeed ZeRO-3 Training

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/README.md

Launch distributed training using DeepSpeed ZeRO-3. This command specifies the number of GPUs, the DeepSpeed configuration file, and the number of training epochs.

```bash
deepspeed --num_gpus 8 train.py \
    --deepspeed scripts/zero3.json \
    --num_train_epochs 3
```

--------------------------------

### __getitem__

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/dataset.md

Retrieves and preprocesses a sample at a given index.

```APIDOC
## __getitem__

### Description
Retrieves and preprocesses sample at index i.

### Method Signature
```python
def __getitem__(self, i: int) -> Dict[str, torch.Tensor]
```

### Returns
Dictionary with keys:
- **input_ids** (torch.LongTensor) - `(seq_len,)` - Token IDs
- **labels** (torch.LongTensor) - `(seq_len,)` - Target IDs (IGNORE_INDEX for non-target tokens)
- **coordinates** (List[Tuple[float, float]]) - Normalized click points
- **visual_token_indices_of_coordinates** (torch.LongTensor) - `(n_targets,)` - Token indices for each coordinate
- **pixel_values** (torch.Tensor) - `(num_images, 3, H, W)` - Image tensors
- **image_grid_thw** (torch.LongTensor) - `(num_images, 3)` - Grid dimensions [T, H, W]
- **multi_patch_labels** (torch.Tensor) - `(n_targets, n_patches)` - Binary masks for regions
- **id** (str) - Sample ID

### Fallback Behavior
If sample processing fails, randomly selects another sample instead of raising.
```

--------------------------------

### Required Runtime Dependencies

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Lists the essential packages required for the project to run, including their version constraints.

```text
pre-commit>=3.7.1        # Git hooks framework
pip>=24.1.1              # Package installer
Pillow>=10.4.0           # Image processing
liger-kernel==0.5.2      # Optimized LM kernel
opencv-python-headless>=4.10.0.84  # Computer vision
accelerate==1.1.1        # Distributed training
qwen-vl-utils==0.0.8     # Qwen VL processing utilities
deepspeed==0.16.0        # DeepSpeed training optimizations
transformers==4.51.3     # Hugging Face transformers
flash-attn==2.7.3        # Flash Attention optimization
wandb==0.18.3            # Weights & Biases logging
datasets>=2.18.0         # Dataset utilities
```

--------------------------------

### Load Qwen2.5-VL Model Variant

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Load the GUI-Actor-7B-Qwen2.5-VL model for improved performance. Specify torch_dtype and device_map for efficient loading.

```python
import torch
from transformers import AutoProcessor, AutoTokenizer
from gui_actor.modeling_qwen25vl import Qwen2_5_VLForConditionalGenerationWithPointer

# Qwen2.5-VL variant (better performance)
model = Qwen2_5_VLForConditionalGenerationWithPointer.from_pretrained(
    "microsoft/GUI-Actor-7B-Qwen2.5-VL",
    torch_dtype=torch.bfloat16,
    device_map="cuda:0"
)

processor = AutoProcessor.from_pretrained("microsoft/GUI-Actor-7B-Qwen2.5-VL")
```

--------------------------------

### Project Citation (BibTeX)

Source: https://github.com/microsoft/gui-actor/blob/main/README.md

Provides the BibTeX entry for citing the GUI-Actor project in academic work.

```bibtex
@article{wu2025gui,
  title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
  author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
  journal={arXiv preprint arXiv:2506.03143},
  year={2025}
}
```

--------------------------------

### Import Utility Functions

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Import necessary utility functions from gui_actor.utils for data handling and visualization.

```python
from gui_actor.utils import (
    dump_args_to_json,
    draw_point,
    draw_bbox,
    do_boxes_overlap
)
```

--------------------------------

### Configure Weights & Biases

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Sets environment variables for Weights & Biases (WandB) to log training metrics and experiments. Specify the project name and the entity (team) for logging.

```bash
# Weights & Biases
export WANDB_PROJECT=gui-actor
export WANDB_ENTITY=my-team
```

--------------------------------

### Custom AGUVISTrainer with Length-Grouped Sampling

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Use AGUVISTrainer for custom training with length-grouped sampling and loss weighting. Configure training arguments and initialize the trainer with model, datasets, and processor.

```python
from transformers import TrainingArguments
from gui_actor.trainer import AGUVISTrainer

training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=1e-4,
    learning_rate_new_params=1e-3,
    group_by_length=True,
    save_steps=500,
    logging_steps=10,
)

trainer = AGUVISTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    processing_class=processor,
)

# Warmup stage
model.reset_loss_weights(pointer_loss_weight=1.0, lm_loss_weight=0.0)
trainer.train()

# Full training
model.reset_loss_weights(pointer_loss_weight=1.0, lm_loss_weight=1.0)
trainer.train(resume_from_checkpoint="./checkpoints/checkpoint-1000")
```

--------------------------------

### dump_args_to_json

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/utils.md

Serializes training and model configuration to a JSON file for experiment tracking. It filters arguments to be JSON-serializable, saves the output to `{output_dir}/args.json` with 4-space indentation, and skips saving if the file already exists.

```APIDOC
## dump_args_to_json

### Description
Serializes training and model configuration to JSON file for experiment tracking.

### Function Signature
```python
def dump_args_to_json(
    model_config,
    data_processor,
    model_args,
    data_args,
    training_args,
    output_dir: str
) -> None
```

### Parameters
#### Path Parameters
- **model_config** (object) - Required - Model config object (e.g., config from from_pretrained)
- **data_processor** (object) - Required - Image processor + tokenizer combined
- **model_args** (object) - Required - Model-specific arguments (as argparse Namespace)
- **data_args** (object) - Required - Data loading arguments
- **training_args** (object) - Required - Training/optimization arguments
- **output_dir** (str) - Required - Directory where args.json will be saved (if not already exists)

### Behavior
1. Filters all arguments to only JSON-serializable values
2. Saves to `{output_dir}/args.json` with 4-space indentation
3. Skips saving if file already exists

### Output Structure
```json
{
    "model_config": {...},
    "data_processor_config": {...},
    "image_processor_config": {...},
    "model_args": {...},
    "data_args": {...},
    "training_args": {...}
}
```

### Example
```python
from gui_actor.utils import dump_args_to_json

dump_args_to_json(
    model_config=model.config,
    data_processor=processor,
    model_args=model_args,
    data_args=data_args,
    training_args=training_args,
    output_dir="./output"
)
# Creates: ./output/args.json
```
```

--------------------------------

### Create Training DataLoader

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/trainer.md

Generates a training DataLoader with custom configurations, including column removal, custom collate function, persistent workers, custom sampling, memory pinning, and prefetching.

```python
def get_train_dataloader(self) -> DataLoader
```

--------------------------------

### Import Core Model Classes

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Import the necessary classes for using the core models. These are the primary classes for building and interacting with the GUI-Actor models.

```python
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.modeling_qwen25vl import Qwen2_5_VLForConditionalGenerationWithPointer
```

--------------------------------

### Import Constants

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/api-reference.md

Import predefined constants for special tokens and other configurations from gui_actor.constants.

```python
from gui_actor.constants import (
    DEFAULT_IMAGE_TOKEN,
    DEFAULT_POINTER_START_TOKEN,
    DEFAULT_POINTER_PAD_TOKEN,
    DEFAULT_POINTER_END_TOKEN,
    ADDITIONAL_SPECIAL_TOKENS,
    grounding_system_message,
    chat_template,
    IGNORE_INDEX
)
```

--------------------------------

### Configure CUDA/GPU Settings

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/configuration.md

Sets environment variables for CUDA and GPU usage. `CUDA_VISIBLE_DEVICES` selects which GPUs to use, and `CUDA_LAUNCH_BLOCKING=1` enables synchronous execution for easier debugging of CUDA errors.

```bash
# CUDA/GPU settings
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CUDA_LAUNCH_BLOCKING=1
```

--------------------------------

### Attention Head Output from VisionHead_MultiPatch.forward()

Source: https://github.com/microsoft/gui-actor/blob/main/_autodocs/types.md

The output from the VisionHead_MultiPatch.forward() method includes attention weights and an optional loss. The attention weights indicate scores normalized over visual tokens.

```python
( 
    attn_weights: torch.Tensor,      # Shape: (n_targets, n_visual)
                                      # Softmax-normalized attention scores
    loss: torch.Tensor | None,        # Shape: (1,) - KL divergence loss
)
```

--------------------------------

### Evaluation Script Execution

Source: https://github.com/microsoft/gui-actor/blob/main/verifier/README.md

Shell commands to run the evaluation scripts for ScreenSpot datasets v1, v2, and Pro. Ensure file paths are correctly updated in the scripts before execution.

```bash
bash run_ss_v1.sh
bash run_ss_v2.sh
bash run_ss_pro.sh
```