AI Model Architecture

Why Multiple Models?

AskUI uses different AI models for different tasks instead of one large model for everything. This is because UI automation requires several distinct capabilities:

Computer vision to see what’s on screen
Natural language understanding to interpret instructions
Planning to break down complex tasks
Precise interaction to click and type accurately

Different AI models are better at different tasks, so AskUI combines specialized models rather than trying to make one model do everything.

The Three Model Types

1. Locator Models

What they do: Find and interact with UI elements Locator models analyze screenshots to locate buttons, text fields, and other UI elements. They also execute mouse clicks and keyboard input. Tasks:

Identify UI elements from screenshots
Determine element locations and boundaries
Execute clicks, typing, and other interactions

Models used:

UIDT-1: Locates elements on screen
PTA-1: Takes text descriptions and finds matching UI elements

2. Query Models

What they do: Answer questions and make decisions Query models process natural language and generate responses. They understand context and can reason about what actions to take. Tasks:

Interpret user instructions
Answer questions about screen content
Make decisions about next steps

Models used:

GPT-4: General language understanding and reasoning
Computer Use: Anthropic’s model for computer interaction tasks

3. Action Models (AMs)

What they do: Plan and coordinate multi-step tasks Action Models take high-level goals and break them into sequences of actions. They coordinate the other models and handle errors. Tasks:

Break down complex goals into steps
Decide which model to use for each step
Handle failures and retry logic
Monitor progress and adjust plans

Models used:

Computer Use: Plans and executes computer tasks
UI-Tars: Specialized for UI automation workflows

How They Work Together

When you give AskUI a task:

Action Model creates a plan with specific steps
Query Models interpret any unclear instructions
Locator Models execute each individual action
Action Model checks results and continues or adjusts the plan

For example, with “Book a flight from Berlin to Rome”:

AM plans: open travel site → search flights → select options → book
Locator model clicks on flight search
Query model interprets “Berlin” and “Rome” as departure/destination
Locator model fills in the form fields
AM monitors progress and handles any errors

Model Capabilities

Model Type	Model Name	Purpose	Teachable	Online Trainable
Locator	UIDT-1	Locate elements & understand screen	No	Partial
Locator	PTA-1	Convert prompts into one-click actions	No	Yes
Query	GPT-4	Understand & respond to user queries	Yes	No
Query	Gemini 2.5 Flash	Understand & respond to user queries	Yes	No
Query	Gemini 2.5 Pro	Understand & respond to user queries	Yes	No
Query	Computer Use	Understand & respond to user queries	Yes	No
Large Action (act)	Computer Use	Plan and execute full workflows	Yes	No
Large Action (act)	UI-Tars	Plan and execute full workflows	Yes	No

Note: See model names here

Documentation

Tutorial

How-to Guides

Understanding AskUI

Additional Resources

AI Model Architecture

Why Multiple Models?

The Three Model Types

1. Locator Models

2. Query Models

3. Action Models (AMs)

How They Work Together

Model Capabilities

Documentation

Tutorial

How-to Guides

Understanding AskUI

Additional Resources

​Why Multiple Models?

​The Three Model Types

​1. Locator Models

​2. Query Models

​3. Action Models (AMs)

​How They Work Together

​Model Capabilities

Why Multiple Models?

The Three Model Types

1. Locator Models

2. Query Models

3. Action Models (AMs)

How They Work Together

Model Capabilities