Android Vision Agent

class AndroidVisionAgent(AgentBase)

A vision-based agent that can interact with Android devices through computer vision and AI. This agent can perform various UI interactions on Android devices like tapping, typing, swiping, and more. It uses computer vision models to locate UI elements and execute actions on them. Arguments:

reporters list[Reporter] | None, optional - List of reporter instances for logging and reporting. If None, an empty list is used.
model ModelChoice | ModelComposition | str | None, optional - The default choice or name of the model(s) to be used for vision tasks. Can be overridden by the model parameter in the tap(), get(), act() etc. methods.
retry Retry, optional - The retry instance to use for retrying failed actions. Defaults to ConfigurableRetry with exponential backoff. Currently only supported for locate() method.
models ModelRegistry | None, optional - A registry of models to make available to the AndroidVisionAgent so that they can be selected using the model parameter of AndroidVisionAgent or the model parameter of its tap(), get(), act() etc. methods. Entries in the registry override entries in the default model registry.
model_provider str | None, optional - The model provider to use for vision tasks.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.tap("Submit button")
    agent.type("Hello World")
    agent.act("Open settings menu")

tap

def tap(
    target: str | Locator | tuple[int, int],
    model: ModelComposition | str | None = None
) -> None

Taps on the specified target. Arguments:

target str | Locator | Point - The target to tap on. Can be a locator, a point, or a string.
model ModelComposition | str | None, optional - The composition or name of the model(s) to be used for tapping on the target.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.tap("Submit button")
    agent.tap((100, 100))

type

def type(text: Annotated[str, Field(min_length=1)]) -> None

Types the specified text as if it were entered on a keyboard. Arguments:

text str - The text to be typed. Must be at least 1 character long. Only ASCII printable characters are supported. other characters will raise an error.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.type("Hello, world!")  # Types "Hello, world!"
    agent.type("user@example.com")  # Types an email address
    agent.type("password123")  # Types a password

key_tap

def key_tap(key: ANDROID_KEY) -> None

Taps the specified key on the Android device. Arguments:

key ANDROID_KEY - The key to tap.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.key_tap("HOME")  # Taps the home key
    agent.key_tap("BACK")  # Taps the back key

key_combination

def key_combination(
    keys: Annotated[list[ANDROID_KEY], Field(min_length=2)],
    duration_in_ms: int = 100
) -> None

Taps the specified keys on the Android device. Arguments:

keys list[ANDROID_KEY] - The keys to tap.
duration_in_ms int, optional - The duration in milliseconds to hold the key combination. Default is 100ms.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.key_combination(["HOME", "BACK"])  # Taps the home key and then the back key
    agent.key_combination(["HOME", "BACK"], duration_in_ms=200)  # Taps the home key and then the back key for 200ms.

shell

def shell(command: str) -> str

Executes a shell command on the Android device. Arguments:

command str - The shell command to execute.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.shell("pm list packages")  # Lists all installed packages
    agent.shell("dumpsys battery")  # Displays battery information

drag_and_drop

def drag_and_drop(
    x1: int, y1: int, x2: int, y2: int, duration_in_ms: int = 1000
) -> None

Drags and drops the specified target. Arguments:

x1 int - The x-coordinate of the starting point.
y1 int - The y-coordinate of the starting point.
x2 int - The x-coordinate of the ending point.
y2 int - The y-coordinate of the ending point.
duration_in_ms int, optional - The duration in milliseconds to hold the drag and drop. Default is 1000ms.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.drag_and_drop(100, 100, 200, 200)  # Drags and drops from (100, 100) to (200, 200)
    agent.drag_and_drop(100, 100, 200, 200, duration_in_ms=2000)  # Drags and drops from (100, 100) to (200, 200) with a 2000ms duration

swipe

def swipe(
    x1: int, y1: int, x2: int, y2: int, duration_in_ms: int = 1000
) -> None

Swipes the specified target. Arguments:

x1 int - The x-coordinate of the starting point.
y1 int - The y-coordinate of the starting point.
x2 int - The x-coordinate of the ending point.
y2 int - The y-coordinate of the ending point.
duration_in_ms int, optional - The duration in milliseconds to hold the swipe. Default is 1000ms.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.swipe(100, 100, 200, 200)  # Swipes from (100, 100) to (200, 200)
    agent.swipe(100, 100, 200, 200, duration_in_ms=2000)  # Swipes from (100, 100) to (200, 200) with a 2000ms duration

set_device_by_serial_number

def set_device_by_serial_number(device_sn: str) -> None

Sets the active device for screen interactions by name. Arguments:

device_sn str - The serial number of the device to set as active.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.set_device_by_serial_number("Pixel 6")  # Sets the active device to the Pixel 6

act

def act(
    goal: Annotated[str | list[MessageParam],
                    Field(min_length=1)],
    model: str | None = None,
    on_message: OnMessageCb | None = None,
    tools: list[Tool] | ToolCollection | None = None,
    settings: ActSettings | None = None
) -> None

Instructs the agent to achieve a specified goal through autonomous actions. The agent will analyze the screen, determine necessary steps, and perform actions to accomplish the goal. This may include clicking, typing, scrolling, and other interface interactions. Arguments:

goal str | list[MessageParam] - A description of what the agent should achieve.
model str | None, optional - The composition or name of the model(s) to be used for achieving the goal.
on_message OnMessageCb | None, optional - Callback for new messages. If it returns None, stops and does not add the message.
tools list[Tool] | ToolCollection | None, optional - The tools for the agent. Defaults to default tools depending on the selected model.
settings AgentSettings | None, optional - The settings for the agent. Defaults to a default settings depending on the selected model.

Returns: None Raises:

MaxTokensExceededError - If the model reaches the maximum token limit defined in the agent settings.
ModelRefusalError - If the model refuses to process the request.

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.act("Open the settings menu")
    agent.act("Search for 'printer' in the search box")
    agent.act("Log in with username 'admin' and password '1234'")

get

def get(
    query: Annotated[str, Field(min_length=1)],
    response_schema: Type[ResponseSchema] | None = None,
    model: str | None = None,
    source: Optional[InputSource] = None
) -> ResponseSchema | str

Retrieves information from an image or PDF based on the provided query. If no source is provided, a screenshot of the current screen is taken. Arguments:

query str - The query describing what information to retrieve.
source InputSource | None, optional - The source to extract information from. Can be a path to an image, PDF, or office document file, a PIL Image object or a data URL. Defaults to a screenshot of the current screen.
response_schema Type[ResponseSchema] | None, optional - A Pydantic model class that defines the response schema. If not provided, returns a string.
model str | None, optional - The composition or name of the model(s) to be used for retrieving information from the screen or image using the query. Note: response_schema is not supported by all models. PDF processing is only supported for Gemini models hosted on AskUI.

Returns: ResponseSchema | str: The extracted information, str if no response_schema is provided. Raises:

NotImplementedError - If PDF processing is not supported for the selected model.
ValueError - If the source is not a valid PDF or image.

Example:

from askui import ResponseSchemaBase, AndroidVisionAgent
from PIL import Image
import json

class UrlResponse(ResponseSchemaBase):
    url: str

class NestedResponse(ResponseSchemaBase):
    nested: UrlResponse

class LinkedListNode(ResponseSchemaBase):
    value: str
    next: "LinkedListNode | None"

with AndroidVisionAgent() as agent:
    # Get URL as string
    url = agent.get("What is the current url shown in the url bar?")

    # Get URL as Pydantic model from image at (relative) path
    response = agent.get(
        "What is the current url shown in the url bar?",
        response_schema=UrlResponse,
        source="screenshot.png",
    )
    # Dump whole model
    print(response.model_dump_json(indent=2))
    # or
    response_json_dict = response.model_dump(mode="json")
    print(json.dumps(response_json_dict, indent=2))
    # or for regular dict
    response_dict = response.model_dump()
    print(response_dict["url"])

    # Get boolean response from PIL Image
    is_login_page = agent.get(
        "Is this a login page?",
        response_schema=bool,
        source=Image.open("screenshot.png"),
    )
    print(is_login_page)

    # Get integer response
    input_count = agent.get(
        "How many input fields are visible on this page?",
        response_schema=int,
    )
    print(input_count)

    # Get float response
    design_rating = agent.get(
        "Rate the page design quality from 0 to 1",
        response_schema=float,
    )
    print(design_rating)

    # Get nested response
    nested = agent.get(
        "Extract the URL and its metadata from the page",
        response_schema=NestedResponse,
    )
    print(nested.nested.url)

    # Get recursive response
    linked_list = agent.get(
        "Extract the breadcrumb navigation as a linked list",
        response_schema=LinkedListNode,
    )
    current = linked_list
    while current:
        print(current.value)
        current = current.next

    # Get text from PDF
    text = agent.get(
        "Extract all text from the PDF",
        source="document.pdf",
    )
    print(text)

locate

def locate(
    locator: str | Locator,
    screenshot: Optional[InputSource] = None,
    model: ModelComposition | str | None = None
) -> Point

Locates the first matching UI element identified by the provided locator. Arguments:

locator str | Locator - The identifier or description of the element to locate.
screenshot InputSource | None, optional - The screenshot to use for locating the element. Can be a path to an image file, a PIL Image object or a data URL. If None, takes a screenshot of the currently selected display.
model ModelComposition | str | None, optional - The composition or name of the model(s) to be used for locating the element using the locator.

Returns:

Point - The coordinates of the element as a tuple (x, y).

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    point = agent.locate("Submit button")
    print(f"Element found at coordinates: {point}")

locate_all

def locate_all(
    locator: str | Locator,
    screenshot: Optional[InputSource] = None,
    model: ModelComposition | str | None = None
) -> PointList

Locates all matching UI elements identified by the provided locator. Note: Some LocateModels can only locate a single element. In this case, the returned list will have a length of 1. Arguments:

locator str | Locator - The identifier or description of the element to locate.
screenshot InputSource | None, optional - The screenshot to use for locating the element. Can be a path to an image file, a PIL Image object or a data URL. If None, takes a screenshot of the currently selected display.
model ModelComposition | str | None, optional - The composition or name of the model(s) to be used for locating the element using the locator.

Returns:

PointList - The coordinates of the elements as a list of tuples (x, y).

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    points = agent.locate_all("Submit button")
    print(f"Found {len(points)} elements at coordinates: {points}")

locate_all_elements

def locate_all_elements(
    screenshot: Optional[InputSource] = None,
    model: ModelComposition | None = None
) -> list[DetectedElement]

Locate all elements in the current screen using AskUI Models. Arguments:

screenshot InputSource | None, optional - The screenshot to use for locating the elements. Can be a path to an image file, a PIL Image object or a data URL. If None, takes a screenshot of the currently selected display.
model ModelComposition | None, optional - The model composition to be used for locating the elements.

Returns:

list[DetectedElement] - A list of detected elements

Example:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    detected_elements = agent.locate_all_elements()
    print(f"Found {len(detected_elements)} elements: {detected_elements}")

annotate

def annotate(
    screenshot: InputSource | None = None,
    annotation_dir: str = "annotations",
    model: ModelComposition | None = None
) -> None

Annotate the screenshot with the detected elements. Creates an interactive HTML file with the detected elements and saves it to the annotation directory. The HTML file can be opened in a browser to see the annotated image. The user can hover over the elements to see their names and text value and click on the box to copy the text value to the clipboard. Arguments:

screenshot ImageSource | None, optional - The screenshot to annotate. If None, takes a screenshot of the currently selected display.
annotation_dir str - The directory to save the annotated image. Defaults to “annotations”.

model ModelComposition | None, optional - The composition of the model(s) to be used for annotating the image. If None, uses the default model. Example Using AndroidVisionAgent:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.annotate()

Example Using AndroidVisionAgent:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.annotate()

Example Using AndroidVisionAgent with custom screenshot and annotation directory:

from askui import AndroidVisionAgent

with AndroidVisionAgent() as agent:
    agent.annotate(screenshot="screenshot.png", annotation_dir="htmls")

wait

def wait(
    until: Annotated[float, Field(gt=0.0)] | str | Locator,
    retry_count: Optional[Annotated[int, Field(gt=0)]] = None,
    delay: Optional[Annotated[float, Field(gt=0.0)]] = None,
    until_condition: Literal["appear", "disappear"] = "appear",
    model: ModelComposition | str | None = None
) -> None

Pauses execution or waits until a UI element appears or disappears. Arguments:

until float | str | Locator - If a float, pauses execution for the specified number of seconds (must be greater than 0.0). If a string or Locator, waits until the specified UI element appears or disappears on screen.
retry_count int | None - Number of retries when waiting for a UI element. Defaults to 3 if None.
delay int | None - Sleep duration in seconds between retries when waiting for a UI element. Defaults to 1 second if None.
until_condition Literal[“appear”, “disappear”] - The condition to wait until the element satisfies. Defaults to “appear”.
model ModelComposition | str | None, optional - The composition or name of the model(s) to be used for locating the element using the until locator.

Raises:

WaitUntilError - If the UI element is not found after all retries.

Example:

from askui import AndroidVisionAgent
from askui.locators import loc

with AndroidVisionAgent() as agent:
    # Wait for a specific duration
    agent.wait(5)  # Pauses execution for 5 seconds
    agent.wait(0.5)  # Pauses execution for 500 milliseconds

    # Wait for a UI element to appear
    agent.wait("Submit button", retry_count=5, delay=2)
    agent.wait("Login form")  # Uses default retries and sleep time
    agent.wait(loc.Text("Password"))  # Uses default retries and sleep time

    # Wait for a UI element to disappear
    agent.wait("Loading spinner", until_condition="disappear")

    # Wait using a specific model
    agent.wait("Submit button", model="custom_model")

Python Vision Agent

AskUI Suite

Workspace Services API

tap

type

key_tap

key_combination

shell

drag_and_drop

swipe

set_device_by_serial_number

act

get

locate

locate_all

locate_all_elements

annotate

wait

Python Vision Agent

AskUI Suite

Workspace Services API

​tap

​type

​key_tap

​key_combination

​shell

​drag_and_drop

​swipe

​set_device_by_serial_number

​act

​get

​locate

​locate_all

​locate_all_elements

​annotate

​wait

tap

type

key_tap

key_combination

shell

drag_and_drop

swipe

set_device_by_serial_number

act

get

locate

locate_all

locate_all_elements

annotate

wait