How AskUI Works
Learn how AskUI works
AskUI consists of mutliple components which all work together to allow users to build versatile automations across different operating systems. These components are:
-
Model Provider: The AI models used for UI grounding and prompt processing.
-
Agent Framework: The framework used to instruct the agent with agent commands (Python).
-
Agent OS: This is a client installed on the device to be automated, executing all commands and interacting with mouse and keyboard. Serves as the “hands” of the Vision Agent. Also responsible for environment management of the agent.
-
Agent Hub Web: Cross Team Collaboration Plattform. Web frontend for creation and management of access tokens and user management. Serves as repository for vision agents.
-
Agent Hub Desktop: Configuration Tool. Local frontend for agent configuration and interaction. To be released Summer 2025.
Lets have a deeper look at these components:
Agent Framework
The Agent Framework is where you tell AskUI what to automate. You can write single-step instructions or let the agent act on its own.
For example, let’s say you want AskUI to click a “Login” button (single-step action). The instruction would look like this:
Here’s what’s happening:
-
agent
→ Specifies which agent is executing the instruction. It connects to the respective Agent OS controller and executes the command there. Per default we use the installed controller which runs on your local machine. -
click("Element description")
→ Specifies the action (clicking) and takes in a natural language description of the to be located element (Element description). This description is send to the configured AI model of the Model Providers. Per default, we have set PTA-1 as a model hosted on AskUI infrastructure.
Once you run the agent, the Agent OS is executing the instructions.
Agent OS
Specifically, the Agent OS is doing the following when the agent is triggered:
- Takes a screenshot of the screen it is running on
- Sends the screenshots to the Model Provider to retrieve the position it will execute the action. The model inference is called for every single step if it is not cached.
- Executes the specified action, in this case a
leftClick()
. AskUI supports all actions which a human can do via keyboard and mouse. For some actions just as a directmouseLeftClick()
, no onbject localization is needed. In these cases, no AI inference is used.
The Agent OS furthermore has specific functions such as environment management, proxy configuration and in-background automation. A detailed overview can be found here. #todo
Model Provider
AskUI supports multiple Model Providers and various different AI models for different purposes. Generally, these can be split into two:
- UI Grounding Tasks: This describes the process of using an AI model for localization of UI elements on a screen. Typically OCR, Computer Vision or Vision Language Models (VLMs) are used for this.
- Reasoning and Planning: To enable real agentic behaviour, a Large Language Model such as the Computer Use model by Anthropic is used for planning the steps to be executed with a reasoning loop.
We provide our own models as well as external Model Providers. You can find all supported models in #todo