How AskUI Works

AskUI consists of mutliple components which all work together to allow users to build versatile automations across different operating systems. These components are:

Model Provider: The model providers host AI models. These models are the brain of the agents for locating elements (Grounding Models), asking (Query Models) and autonomous acting (Large Action Models)
Agent Framework: The agent framework is used to build an Vision Agent in Python. You can easily interact with any OS by taking over the mouse and keyboard, visualizing the thinking process or use specialized tools like excel, pdf or web requests.
Agent OS: The Agent OS is installed on the device to be automated, executing all commands and interacting with mouse and keyboard and visualize the thinking process.
Agent Hub Web: The Agent Hub Web provides you an easy way to collaborate with your team. It allows you to synchronize your Vision Agents via the repository to other devices. Trigger you agents from email or web, schedule agents on remote devices and manage your team.
Agent Hub Desktop: The Agent Hub Desktop is a graphical interface for your Vision Agents. This is installed on your local device and allows you to setup your device, sync agents and interact with running agents. To be released Summer 2025.

Lets have a deeper look at these components:

Building Your Agent

The Agent Framework is where you tell AskUI what to automate. You can write single-step instructions or let the agent act on its own e.g. agent.act("Book me flight from Karlsruhe to Rome").

For example, let’s say you want AskUI to click a “Login” button (single-step action). You can define your agent in python e.g. my_agent.py:

with VisionAgent(display=1) as agent: 
    agent.click("the Login button")

VisionAgent(display=1) → Start and configure the agent (e.g. start a Vision Agent on display 1).
agent.click("Element locator") → Specifies the action of the agent (clicking)

Running your Agent

Here’s what’s happening under the hood when you execute the agent with python my_agent.py now:

Let’s zoom in what the first line is doing:

with VisionAgent(display=1) as agent:

with-statement creates a context for the Vision Agent to start & stop the Agent OS
VisionAgent(display=1)
1. Instantiate a Vision Agent with the args
2. Start the Agent OS on display 1
as agent make the instance available in the context

Now what’s happening when a single step is executed:

agent.click("the Login button")

Take a screenshot over Agent OS.
Use the natural language locator ("the Login button") of the to be located element.
This locator is send with the screenshot to the configured AI model of the Model Providers. (Per default, we have set PTA-1 as a model hosted on AskUI Model Provider.)
The Grounding Model is returing an location. e.g. Click on pixel 100x 100y.
This Action is the send to the AgentOS
The AgentOS is then take over the mouse and move the mouse to pixel 100x 100y and press the left mouse button.

Model Provider

AskUI supports multiple Model Providers and various different AI models for different purposes. Generally, these can be split into two:

UI Grounding Model: This describes the process of using an AI model for localization of UI elements on a screen. Typically OCR, Computer Vision or Vision Language Models (VLMs) are used for this.
Reasoning and Planning: To enable real agentic behaviour, a Large Language Model such as the Computer Use model by Anthropic is used for planning the steps to be executed with a reasoning loop.

We provide our own models as well as external Model Providers. You can find all supported models in here

Introduction

Getting Started

Core Concepts

Model Usage & Configuration

AskUI Suite

Integrations & Advanced Usage

Updates & Glossary

Building Your Agent

Running your Agent

Model Provider

Introduction

Getting Started

Core Concepts

Model Usage & Configuration

AskUI Suite

Integrations & Advanced Usage

Updates & Glossary

​Building Your Agent

​Running your Agent

​Model Provider

Building Your Agent

Running your Agent

Model Provider