> ## Documentation Index
> Fetch the complete documentation index at: https://docs.askui.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Agent

<a id="askui.agent.VisionAgent" />

```python theme={null}
class VisionAgent(AgentBase)
```

A vision-based agent that can interact with user interfaces through computer vision and AI.

This agent can perform various UI interactions like clicking, typing, scrolling, and more.
It uses computer vision models to locate UI elements and execute actions on them.

**Arguments**:

* `display` *int, optional* - The display number to use for screen interactions. Defaults to `1`.
* `reporters` *list\[Reporter] | None, optional* - List of reporter instances for logging and reporting. If `None`, an empty list is used.
* `tools` *AgentToolbox | None, optional* - Custom toolbox instance. If `None`, a default one will be created with `AskUiControllerClient`.
* `model` *ModelChoice | ModelComposition | str | None, optional* - The default choice or name of the model(s) to be used for vision tasks. Can be overridden by the `model` parameter in the `click()`, `get()`, `act()` etc. methods.
* `retry` *Retry, optional* - The retry instance to use for retrying failed actions. Defaults to `ConfigurableRetry` with exponential backoff. Currently only supported for `locate()` method.
* `models` *ModelRegistry | None, optional* - A registry of models to make available to the `VisionAgent` so that they can be selected using the `model` parameter of `VisionAgent` or the `model` parameter of its `click()`, `get()`, `act()` etc. methods. Entries in the registry override entries in the default model registry.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.click("Submit button")
    agent.type("Hello World")
    agent.act("Open settings menu")
```

<a id="askui.agent.VisionAgent.click" />

## click

```python theme={null}
def click(
    locator: Optional[str | Locator | Point] = None,
    button: Literal["left", "middle", "right"] = "left",
    repeat: Annotated[int, Field(gt=0)] = 1,
    offset: Optional[Point] = None,
    model: ModelComposition | str | None = None
) -> None
```

Simulates a mouse click on the user interface element identified by the provided locator.

**Arguments**:

* `locator` *str | Locator | Point | None, optional* - UI element description, structured locator, or absolute coordinates (x, y). If `None`, clicks at current position.
* `button` *'left' | 'middle' | 'right', optional* - Specifies which mouse button to click. Defaults to `'left'`.
* `repeat` *int, optional* - The number of times to click. Must be greater than `0`. Defaults to `1`.
* `offset` *Point | None, optional* - Pixel offset (x, y) from the target location. Positive x=right, negative x=left, positive y=down, negative y=up.
* `model` *ModelComposition | str | None, optional* - The composition or name of the model(s) to be used for locating the element to click on using the `locator`.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.click()              # Left click on current position
    agent.click("Edit")        # Left click on text "Edit"
    agent.click((100, 200))    # Left click at absolute coordinates (100, 200)
    agent.click("Edit", button="right")  # Right click on text "Edit"
    agent.click(repeat=2)      # Double left click on current position
    agent.click("Edit", button="middle", repeat=4)   # 4x middle click on text "Edit"
    agent.click("Submit", offset=(10, -5))  # Click 10 pixels right and 5 pixels up from "Submit"
```

<a id="askui.agent.VisionAgent.mouse_move" />

## mouse\_move

```python theme={null}
def mouse_move(
    locator: str | Locator | Point,
    offset: Optional[Point] = None,
    model: ModelComposition | str | None = None
) -> None
```

Moves the mouse cursor to the UI element identified by the provided locator.

**Arguments**:

* `locator` *str | Locator | Point* - UI element description, structured locator, or absolute coordinates (x, y).
* `offset` *Point | None, optional* - Pixel offset (x, y) from the target location. Positive x=right, negative x=left, positive y=down, negative y=up.
* `model` *ModelComposition | str | None, optional* - The composition or name of the model(s) to be used for locating the element to move the mouse to using the `locator`.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.mouse_move("Submit button")  # Moves cursor to submit button
    agent.mouse_move((300, 150))       # Moves cursor to absolute coordinates (300, 150)
    agent.mouse_move("Close")          # Moves cursor to close element
    agent.mouse_move("Profile picture", model="custom_model")  # Uses specific model
    agent.mouse_move("Menu", offset=(5, 10))  # Move 5 pixels right and 10 pixels down from "Menu"
```

<a id="askui.agent.VisionAgent.mouse_scroll" />

## mouse\_scroll

```python theme={null}
def mouse_scroll(x: int, y: int) -> None
```

Simulates scrolling the mouse wheel by the specified horizontal and vertical amounts.

**Arguments**:

* `x` *int* - The horizontal scroll amount. Positive values typically scroll right, negative values scroll left.
* `y` *int* - The vertical scroll amount. Positive values typically scroll down, negative values scroll up.

**Notes**:

The actual scroll direction depends on the operating system's configuration.
Some systems may have "natural scrolling" enabled, which reverses the traditional direction.

The meaning of scroll units varies across operating systems and applications.
A scroll value of `10` might result in different distances depending on the application and system settings.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.mouse_scroll(0, 10)  # Usually scrolls down 10 units
    agent.mouse_scroll(0, -5)  # Usually scrolls up 5 units
    agent.mouse_scroll(3, 0)   # Usually scrolls right 3 units
```

<a id="askui.agent.VisionAgent.type" />

## type

```python theme={null}
def type(
    text: Annotated[str, Field(min_length=1)],
    locator: str | Locator | Point | None = None,
    offset: Optional[Point] = None,
    model: ModelComposition | str | None = None,
    clear: bool = True
) -> None
```

Types the specified text as if it were entered on a keyboard.

If `locator` is provided, it will first click on the element to give it focus before typing.
If `clear` is `True` (default), it will triple click on the element to select the current text (in multi-line inputs like textareas the current line or paragraph) before typing.

**IMPORTANT:** `clear` only works if a `locator` is provided.

**Arguments**:

* `text` *str* - The text to be typed. Must be at least `1` character long.
* `locator` *str | Locator | Point | None, optional* - UI element description, structured locator, or absolute coordinates (x, y). If `None`, types at current focus.
* `offset` *Point | None, optional* - Pixel offset (x, y) from the target location. Positive x=right, negative x=left, positive y=down, negative y=up.
* `model` *ModelComposition | str | None, optional* - The composition or name of the model(s) to be used for locating the element, i.e., input field, to type into using the `locator`.
* `clear` *bool, optional* - Whether to triple click on the element to give it focus and select the current text before typing. Defaults to `True`.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.type("Hello, world!")  # Types "Hello, world!" at current focus
    agent.type("user@example.com", locator="Email")  # Clicks on "Email" input, then types
    agent.type("username", locator=(200, 100))  # Clicks at coordinates (200, 100), then types
    agent.type("password123", locator="Password field", model="custom_model")  # Uses specific model
    agent.type("Hello, world!", locator="Textarea", clear=False)  # Types "Hello, world!" into textarea without clearing
    agent.type("text", locator="Input field", offset=(5, 0))  # Click 5 pixels right of "Input field", then type
```

<a id="askui.agent.VisionAgent.key_up" />

## key\_up

```python theme={null}
def key_up(key: PcKey | ModifierKey) -> None
```

Simulates the release of a key.

**Arguments**:

* `key` *PcKey | ModifierKey* - The key to be released.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.key_up('a')  # Release the 'a' key
    agent.key_up('shift')  # Release the 'Shift' key
```

<a id="askui.agent.VisionAgent.key_down" />

## key\_down

```python theme={null}
def key_down(key: PcKey | ModifierKey) -> None
```

Simulates the pressing of a key.

**Arguments**:

* `key` *PcKey | ModifierKey* - The key to be pressed.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.key_down('a')  # Press the 'a' key
    agent.key_down('shift')  # Press the 'Shift' key
```

<a id="askui.agent.VisionAgent.mouse_up" />

## mouse\_up

```python theme={null}
def mouse_up(button: Literal["left", "middle", "right"] = "left") -> None
```

Simulates the release of a mouse button.

**Arguments**:

* `button` *'left' | 'middle' | 'right', optional* - The mouse button to be released. Defaults to `'left'`.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.mouse_up()  # Release the left mouse button
    agent.mouse_up('right')  # Release the right mouse button
    agent.mouse_up('middle')  # Release the middle mouse button
```

<a id="askui.agent.VisionAgent.mouse_down" />

## mouse\_down

```python theme={null}
def mouse_down(button: Literal["left", "middle", "right"] = "left") -> None
```

Simulates the pressing of a mouse button.

**Arguments**:

* `button` *'left' | 'middle' | 'right', optional* - The mouse button to be pressed. Defaults to `'left'`.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.mouse_down()  # Press the left mouse button
    agent.mouse_down('right')  # Press the right mouse button
    agent.mouse_down('middle')  # Press the middle mouse button
```

<a id="askui.agent.VisionAgent.keyboard" />

## keyboard

```python theme={null}
def keyboard(
    key: PcKey | ModifierKey,
    modifier_keys: Optional[list[ModifierKey]] = None,
    repeat: Annotated[int, Field(gt=0)] = 1
) -> None
```

Simulates pressing (and releasing) a key or key combination on the keyboard.

**Arguments**:

* `key` *PcKey | ModifierKey* - The main key to press. This can be a letter, number, special character, or function key.
* `modifier_keys` *list\[ModifierKey] | None, optional* - List of modifier keys to press along with the main key. Common modifier keys include `'ctrl'`, `'alt'`, `'shift'`.
* `repeat` *int, optional* - The number of times to press (and release) the key. Must be greater than `0`. Defaults to `1`.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.keyboard('a')  # Press 'a' key
    agent.keyboard('enter')  # Press 'Enter' key
    agent.keyboard('v', ['control'])  # Press Ctrl+V (paste)
    agent.keyboard('s', ['control', 'shift'])  # Press Ctrl+Shift+S
    agent.keyboard('a', repeat=2)  # Press 'a' key twice
```

<a id="askui.agent.VisionAgent.cli" />

## cli

```python theme={null}
def cli(command: Annotated[str, Field(min_length=1)]) -> None
```

Executes a command on the command line interface.

This method allows running shell commands directly from the agent. The command
is split on spaces and executed as a subprocess.

**Arguments**:

* `command` *str* - The command to execute on the command line.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    # Use for Windows
    agent.cli(fr'start "" "C:\Program Files\VideoLAN\VLClc.exe"') # Start in VLC non-blocking
    agent.cli(fr'"C:\Program Files\VideoLAN\VLClc.exe"') # Start in VLC blocking

    # Mac
    agent.cli("open -a chrome")  # Open Chrome non-blocking for mac
    agent.cli("chrome")  # Open Chrome blocking for linux
    agent.cli("echo Hello World")  # Prints "Hello World"
    agent.cli("python --version")  # Displays Python version

    # Linux
    agent.cli("nohub chrome")  # Open Chrome non-blocking for linux
    agent.cli("chrome")  # Open Chrome blocking for linux
    agent.cli("echo Hello World")  # Prints "Hello World"
    agent.cli("python --version")  # Displays Python version

```

<a id="askui.agent.VisionAgent.act" />

## act

```python theme={null}
def act(
    goal: Annotated[str | list[MessageParam],
                    Field(min_length=1)],
    model: str | None = None,
    on_message: OnMessageCb | None = None,
    tools: list[Tool] | ToolCollection | None = None,
    settings: ActSettings | None = None
) -> None
```

Instructs the agent to achieve a specified goal through autonomous actions.

The agent will analyze the screen, determine necessary steps, and perform
actions to accomplish the goal. This may include clicking, typing, scrolling,
and other interface interactions.

**Arguments**:

* `goal` *str | list\[MessageParam]* - A description of what the agent should
  achieve.
* `model` *str | None, optional* - The composition or name of the model(s) to
  be used for achieving the `goal`.
* `on_message` *OnMessageCb | None, optional* - Callback for new messages. If
  it returns `None`, stops and does not add the message.
* `tools` *list\[Tool] | ToolCollection | None, optional* - The tools for the
  agent. Defaults to default tools depending on the selected model.
* `settings` *AgentSettings | None, optional* - The settings for the agent.
  Defaults to a default settings depending on the selected model.

**Returns**:

None

**Raises**:

* `MaxTokensExceededError` - If the model reaches the maximum token limit
  defined in the agent settings.
* `ModelRefusalError` - If the model refuses to process the request.

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    agent.act("Open the settings menu")
    agent.act("Search for 'printer' in the search box")
    agent.act("Log in with username 'admin' and password '1234'")
```

<a id="askui.agent.VisionAgent.get" />

## get

```python theme={null}
def get(
    query: Annotated[str, Field(min_length=1)],
    response_schema: Type[ResponseSchema] | None = None,
    model: str | None = None,
    source: Optional[InputSource] = None
) -> ResponseSchema | str
```

Retrieves information from an image or PDF based on the provided `query`.

If no `source` is provided, a screenshot of the current screen is taken.

**Arguments**:

* `query` *str* - The query describing what information to retrieve.
* `source` *InputSource | None, optional* - The source to extract information
  from. Can be a path to an image, PDF, or office document file,
  a PIL Image object or a data URL. Defaults to a screenshot of the
  current screen.
* `response_schema` *Type\[ResponseSchema] | None, optional* - A Pydantic model
  class that defines the response schema. If not provided, returns a
  string.
* `model` *str | None, optional* - The composition or name of the model(s) to
  be used for retrieving information from the screen or image using the
  `query`. Note: `response_schema` is not supported by all models.
  PDF processing is only supported for Gemini models hosted on AskUI.

**Returns**:

ResponseSchema | str: The extracted information, `str` if no
`response_schema` is provided.

**Raises**:

* `NotImplementedError` - If PDF processing is not supported for the selected
  model.
* `ValueError` - If the `source` is not a valid PDF or image.

**Example**:

```python theme={null}
from askui import ResponseSchemaBase, VisionAgent
from PIL import Image
import json

class UrlResponse(ResponseSchemaBase):
    url: str

class NestedResponse(ResponseSchemaBase):
    nested: UrlResponse

class LinkedListNode(ResponseSchemaBase):
    value: str
    next: "LinkedListNode | None"

with VisionAgent() as agent:
    # Get URL as string
    url = agent.get("What is the current url shown in the url bar?")

    # Get URL as Pydantic model from image at (relative) path
    response = agent.get(
        "What is the current url shown in the url bar?",
        response_schema=UrlResponse,
        source="screenshot.png",
    )
    # Dump whole model
    print(response.model_dump_json(indent=2))
    # or
    response_json_dict = response.model_dump(mode="json")
    print(json.dumps(response_json_dict, indent=2))
    # or for regular dict
    response_dict = response.model_dump()
    print(response_dict["url"])

    # Get boolean response from PIL Image
    is_login_page = agent.get(
        "Is this a login page?",
        response_schema=bool,
        source=Image.open("screenshot.png"),
    )
    print(is_login_page)

    # Get integer response
    input_count = agent.get(
        "How many input fields are visible on this page?",
        response_schema=int,
    )
    print(input_count)

    # Get float response
    design_rating = agent.get(
        "Rate the page design quality from 0 to 1",
        response_schema=float,
    )
    print(design_rating)

    # Get nested response
    nested = agent.get(
        "Extract the URL and its metadata from the page",
        response_schema=NestedResponse,
    )
    print(nested.nested.url)

    # Get recursive response
    linked_list = agent.get(
        "Extract the breadcrumb navigation as a linked list",
        response_schema=LinkedListNode,
    )
    current = linked_list
    while current:
        print(current.value)
        current = current.next

    # Get text from PDF
    text = agent.get(
        "Extract all text from the PDF",
        source="document.pdf",
    )
    print(text)
```

<a id="askui.agent.VisionAgent.locate" />

## locate

```python theme={null}
def locate(
    locator: str | Locator,
    screenshot: Optional[InputSource] = None,
    model: ModelComposition | str | None = None
) -> Point
```

Locates the first matching UI element identified by the provided locator.

**Arguments**:

* `locator` *str | Locator* - The identifier or description of the element to
  locate.
* `screenshot` *InputSource | None, optional* - The screenshot to use for
  locating the element. Can be a path to an image file, a PIL Image object
  or a data URL. If `None`, takes a screenshot of the currently
  selected display.
* `model` *ModelComposition | str | None, optional* - The composition or name
  of the model(s) to be used for locating the element using the `locator`.

**Returns**:

* `Point` - The coordinates of the element as a tuple (x, y).

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    point = agent.locate("Submit button")
    print(f"Element found at coordinates: {point}")
```

<a id="askui.agent.VisionAgent.locate_all" />

## locate\_all

```python theme={null}
def locate_all(
    locator: str | Locator,
    screenshot: Optional[InputSource] = None,
    model: ModelComposition | str | None = None
) -> PointList
```

Locates all matching UI elements identified by the provided locator.

Note: Some LocateModels can only locate a single element. In this case, the
returned list will have a length of 1.

**Arguments**:

* `locator` *str | Locator* - The identifier or description of the element to
  locate.
* `screenshot` *InputSource | None, optional* - The screenshot to use for
  locating the element. Can be a path to an image file, a PIL Image object
  or a data URL. If `None`, takes a screenshot of the currently
  selected display.
* `model` *ModelComposition | str | None, optional* - The composition or name
  of the model(s) to be used for locating the element using the `locator`.

**Returns**:

* `PointList` - The coordinates of the elements as a list of tuples (x, y).

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    points = agent.locate_all("Submit button")
    print(f"Found {len(points)} elements at coordinates: {points}")
```

<a id="askui.agent.VisionAgent.locate_all_elements" />

## locate\_all\_elements

```python theme={null}
def locate_all_elements(
    screenshot: Optional[InputSource] = None,
    model: ModelComposition | None = None
) -> list[DetectedElement]
```

Locate all elements in the current screen using AskUI Models.

**Arguments**:

* `screenshot` *InputSource | None, optional* - The screenshot to use for
  locating the elements. Can be a path to an image file, a PIL Image
  object or a data URL. If `None`, takes a screenshot of the currently
  selected display.
* `model` *ModelComposition | None, optional* - The model composition
  to be used for locating the elements.

**Returns**:

* `list[DetectedElement]` - A list of detected elements

**Example**:

```python theme={null}
from askui import VisionAgent

with VisionAgent() as agent:
    detected_elements = agent.locate_all_elements()
    print(f"Found {len(detected_elements)} elements: {detected_elements}")
```

<a id="askui.agent.VisionAgent.annotate" />

## annotate

```python theme={null}
def annotate(
    screenshot: InputSource | None = None,
    annotation_dir: str = "annotations",
    model: ModelComposition | None = None
) -> None
```

Annotate the screenshot with the detected elements.
Creates an interactive HTML file with the detected elements
and saves it to the annotation directory.
The HTML file can be opened in a browser to see the annotated image.
The user can hover over the elements to see their names and text value
and click on the box to copy the text value to the clipboard.

**Arguments**:

* `screenshot` *ImageSource | None, optional* - The screenshot to annotate.
  If `None`, takes a screenshot of the currently selected display.
* `annotation_dir` *str* - The directory to save the annotated
  image. Defaults to "annotations".
* `model` *ModelComposition | None, optional* - The composition
  of the model(s) to be used for annotating the image.
  If `None`, uses the default model.

  Example Using VisionAgent:

  ```python theme={null}
  from askui import VisionAgent

  with VisionAgent() as agent:
      agent.annotate()
  ```

  Example Using AndroidVisionAgent:

  ```python theme={null}
  from askui import AndroidVisionAgent

  with AndroidVisionAgent() as agent:
      agent.annotate()
  ```

  Example Using VisionAgent with custom screenshot and annotation directory:

  ```python theme={null}
  from askui import VisionAgent

  with VisionAgent() as agent:
      agent.annotate(screenshot="screenshot.png", annotation_dir="htmls")
  ```

<a id="askui.agent.VisionAgent.wait" />

## wait

```python theme={null}
def wait(
    until: Annotated[float, Field(gt=0.0)] | str | Locator,
    retry_count: Optional[Annotated[int, Field(gt=0)]] = None,
    delay: Optional[Annotated[float, Field(gt=0.0)]] = None,
    until_condition: Literal["appear", "disappear"] = "appear",
    model: ModelComposition | str | None = None
) -> None
```

Pauses execution or waits until a UI element appears or disappears.

**Arguments**:

* `until` *float | str | Locator* - If a float, pauses execution for the
  specified number of seconds (must be greater than 0.0). If a string
  or Locator, waits until the specified UI element appears or
  disappears on screen.
* `retry_count` *int | None* - Number of retries when waiting for a UI
  element. Defaults to 3 if None.
* `delay` *int | None* - Sleep duration in seconds between retries when
  waiting for a UI element. Defaults to 1 second if None.
* `until_condition` *Literal\["appear", "disappear"]* - The condition to wait
  until the element satisfies. Defaults to "appear".
* `model` *ModelComposition | str | None, optional* - The composition or name
  of the model(s) to be used for locating the element using the
  `until` locator.

**Raises**:

* `WaitUntilError` - If the UI element is not found after all retries.

**Example**:

```python theme={null}
from askui import VisionAgent
from askui.locators import loc

with VisionAgent() as agent:
    # Wait for a specific duration
    agent.wait(5)  # Pauses execution for 5 seconds
    agent.wait(0.5)  # Pauses execution for 500 milliseconds

    # Wait for a UI element to appear
    agent.wait("Submit button", retry_count=5, delay=2)
    agent.wait("Login form")  # Uses default retries and sleep time
    agent.wait(loc.Text("Password"))  # Uses default retries and sleep time

    # Wait for a UI element to disappear
    agent.wait("Loading spinner", until_condition="disappear")

    # Wait using a specific model
    agent.wait("Submit button", model="custom_model")
```
