Understand the foundational concept of agents in AskUI
At the core of AskUI’s architecture is the concept of agents - intelligent entities that understand, reason about, and interact with user interfaces on behalf of users. Agents represent a fundamental shift from traditional automation approaches by embedding intelligence directly into the automation process.
Traditional automation tools assume applications behave predictably and deterministically. However, real-world applications are stateful and exhibit unpredictable behaviors that cannot be controlled. AskUI’s agent-based approach addresses the fundamental challenge of dealing with stateful applications by mimicking human adaptability.
Since we cannot convert stateful applications into stateless ones, agents must deal with these situations like humans do. Agents continuously adapt their behavior based on what they observe, making real-time decisions about how to proceed when unexpected situations arise.
Agents can validate whether actions were successfully performed and recover from failures. For example, if an agent sends a scroll command but the mouse cursor wasn’t in the scrollable area, the agent detects this failure, repositions the mouse, and performs the scroll action again.
Agents bridge the gap between human intention and machine execution. Users can describe what they want to accomplish in natural language, and agents translate this into appropriate actions while handling the unpredictable nature of stateful applications.
Tools are the operational interface that agents use to interact with the underlying system. They provide the concrete actions agents can perform - from capturing screenshots and clicking buttons to typing text and executing commands. Tools bridge the gap between high-level agent reasoning and low-level system operations, enabling agents to translate their decisions into real-world actions.
Every AskUI agent follows a consistent operational lifecycle:
Perception: The agent captures and analyzes the current state of the interface
Understanding: It interprets the visual information to identify elements and their relationships
Planning: The agent determines the appropriate sequence of actions to achieve the goal
Execution: It performs the planned actions on the interface
Verification: The agent confirms whether actions succeeded and adjusts if necessary
Copy
Ask AI
from askui import VisionAgent# Agent initialization creates the perception and reasoning systemswith VisionAgent() as agent: # Perception: Agent analyzes the current interface and system (e.g. network status) # Understanding: Agent interprets the "login form" concept # Planning: Agent determines the sequence of actions needed # Execution: Agent performs the planned interactions # Verification: Agent confirms successful completion agent.act("Fill out the login form with username john.doe")