Why Multiple Models?
AskUI uses different AI models for different tasks instead of one large model for everything. This is because UI automation requires several distinct capabilities:- Computer vision to see what’s on screen
- Natural language understanding to interpret instructions
- Planning to break down complex tasks
- Precise interaction to click and type accurately
The Three Model Types
1. Locator Models
What they do: Find and interact with UI elements Locator models analyze screenshots to locate buttons, text fields, and other UI elements. They also execute mouse clicks and keyboard input. Tasks:- Identify UI elements from screenshots
- Determine element locations and boundaries
- Execute clicks, typing, and other interactions
- UIDT-1: Locates elements on screen
- PTA-1: Takes text descriptions and finds matching UI elements
2. Query Models
What they do: Answer questions and make decisions Query models process natural language and generate responses. They understand context and can reason about what actions to take. Tasks:- Interpret user instructions
- Answer questions about screen content
- Make decisions about next steps
- GPT-4: General language understanding and reasoning
- Computer Use: Anthropic’s model for computer interaction tasks
3. Action Models (AMs)
What they do: Plan and coordinate multi-step tasks Action Models take high-level goals and break them into sequences of actions. They coordinate the other models and handle errors. Tasks:- Break down complex goals into steps
- Decide which model to use for each step
- Handle failures and retry logic
- Monitor progress and adjust plans
- Computer Use: Plans and executes computer tasks
- UI-Tars: Specialized for UI automation workflows
How They Work Together
When you give AskUI a task:- Action Model creates a plan with specific steps
- Query Models interpret any unclear instructions
- Locator Models execute each individual action
- Action Model checks results and continues or adjusts the plan
- AM plans: open travel site → search flights → select options → book
- Locator model clicks on flight search
- Query model interprets “Berlin” and “Rome” as departure/destination
- Locator model fills in the form fields
- AM monitors progress and handles any errors
Model Capabilities
Model Type | Model Name | Purpose | Teachable | Online Trainable |
---|---|---|---|---|
Locator | UIDT-1 | Locate elements & understand screen | No | Partial |
Locator | PTA-1 | Convert prompts into one-click actions | No | Yes |
Query | GPT-4 | Understand & respond to user queries | Yes | No |
Query | Gemini 2.5 Flash | Understand & respond to user queries | Yes | No |
Query | Gemini 2.5 Pro | Understand & respond to user queries | Yes | No |
Query | Computer Use | Understand & respond to user queries | Yes | No |
Large Action (act) | Computer Use | Plan and execute full workflows | Yes | No |
Large Action (act) | UI-Tars | Plan and execute full workflows | Yes | No |
Note: See model names here