How AskUI Works
Learn how AskUI works
AskUI consists of mutliple components which all work together to allow users to build versatile automations across different operating systems. These components are:
-
Model Provider: The model providers host AI models. These models are the brain of the agents for locating elements (Grounding Models), asking (Query Models) and autonomous acting (Large Action Models)
-
Agent Framework: The agent framework is used to build an Vision Agent in Python. You can easily interact with any OS by taking over the mouse and keyboard, visualizing the thinking process or use specialized tools like excel, pdf or web requests.
-
Agent OS: The Agent OS is installed on the device to be automated, executing all commands and interacting with mouse and keyboard and visualize the thinking process.
-
Agent Hub Web: The Agent Hub Web provides you an easy way to collaborate with your team. It allows you to synchronize your Vision Agents via the repository to other devices. Trigger you agents from email or web, schedule agents on remote devices and manage your team.
-
Agent Hub Desktop: The Agent Hub Desktop is a graphical interface for your Vision Agents. This is installed on your local device and allows you to setup your device, sync agents and interact with running agents. To be released Summer 2025.
Lets have a deeper look at these components:
Building Your Agent
The Agent Framework is where you tell AskUI what to automate. You can write single-step instructions or let the agent act on its own e.g. agent.act("Book me flight from Karlsruhe to Rome")
.
For example, let’s say you want AskUI to click a “Login” button (single-step action). You can define your agent in python e.g. my_agent.py
:
VisionAgent(display=1)
→ Start and configure the agent (e.g. start a Vision Agent on display 1).agent.click("Element locator")
→ Specifies the action of the agent (clicking)
Running your Agent
Here’s what’s happening under the hood when you execute the agent with python my_agent.py
now:
Let’s zoom in what the first line is doing:
with
-statement creates a context for the Vision Agent to start & stop the Agent OSVisionAgent(display=1)
- Instantiate a Vision Agent with the args
- Start the Agent OS on display 1
as agent
make the instance available in the context
Now what’s happening when a single step is executed:
- Take a screenshot over Agent OS.
- Use the natural language locator (
"the Login button"
) of the to be located element. - This locator is send with the screenshot to the configured AI model of the Model Providers. (Per default, we have set PTA-1 as a model hosted on AskUI Model Provider.)
- The Grounding Model is returing an location. e.g. Click on pixel 100x 100y.
- This Action is the send to the AgentOS
- The AgentOS is then take over the mouse and move the mouse to pixel 100x 100y and press the left mouse button.
Model Provider
AskUI supports multiple Model Providers and various different AI models for different purposes. Generally, these can be split into two:
- UI Grounding Model: This describes the process of using an AI model for localization of UI elements on a screen. Typically OCR, Computer Vision or Vision Language Models (VLMs) are used for this.
- Reasoning and Planning: To enable real agentic behaviour, a Large Language Model such as the Computer Use model by Anthropic is used for planning the steps to be executed with a reasoning loop.
We provide our own models as well as external Model Providers. You can find all supported models in here
Was this page helpful?