aiverse.design

Interactions

Patterns

NEW

Perspectivas

Iniciar sesión

aiverse.design

Iniciar sesión

Visual as an input

Using visual references or sketches as an input to provide context to the LLM.

Overview

This pattern enables users to visually outline an idea and have AI match the structure or get clearer context, just as we humans do. Instead of trying to describe it in words, for which you may not even know the correct term, you can provide visual context in a fast, low-friction medium; multi-modal UX.

User intent

Getting result faster

Macro trend

Human in the loop

Why does “visual input” matter?

Many a times, physical objects or UI contains information that's context-rich and when trying feed it to an LLM through articulation becomes is difficult, as compared to pointing, "this part" - very easy.

So why not try and make LLMs capable of just this? Being able to provide context from your camera, or take a screenshot, or even select an element from the UI. This reduces the time AND the cognitive load of explaining in words, not to mention keeping the users in the flow.

A common use case of this is providing visual input for replicating the structure or style of an image. It also supports creativity that’s sometimes non-verbal. It retains the user’s intent more accurately as compared to a text prompt, where layout and framing are often left to how well you can articulate it. The pattern is superrr helpful for visual workflows that require expression.

Let’s have a look at products have used this pattern.

Examples

Photomath provides you the answer to your math problems just by, as obvious by the name, taking a photo. The app interprets the equation and returns a step-by-step solution. The key detail is in getting a structured and correct output, with minimal user effort.

iPadOS’s sketch to image is an example of the structure as an input use case we talked about earlier. Users’ Apple Pencil sketch goes in, the LLM refines the drawing, and out comes a neater, “made it pop” visual image, preserving the original structure.

Adobe’s Firefly lets you define the layout you want in your generated images. Once the structure is in place, you can choose a style or prompt one, and the app generates images that follow the references.

Variations maintain the original positioning and balance, enabling consistent iterations. You can also lock certain elements, while changing the others, allowing a flexible control over both structure and style.

AI UX checklist

Can the system reliably extract actionable content from photos or screenshots?
Are image inputs treated as a first-class mode, not a fallback?
Is recognition fast enough to feel native to the task?
Does the system respond with the right tool, not just information?
Does the interaction allow multiple entry points - camera, screenshots, clipboard?

As models gain precision across vision and language, this pattern will evolve to work across different input contexts — grabbing a chart, a UI, or paragraph and triggering the right function without switching tools.

*yes we use em dash :)

Visual as an input will become key to the ambient AI UX, better understanding user’s intent.

¿Se pregunta cómo las empresas están diseñando para la inteligencia artificial?¡Ahorre horas de investigación de UI y UX!

Fundadora de diseño en Studio Oblique

Iniciar sesión

EARLY BIRD OFFER

¿Se pregunta cómo las empresas están diseñando para la inteligencia artificial?¡Ahorre horas de investigación de UI y UX!

Visual as an input

Using visual references or sketches as an input to provide context to the LLM.

Overview

User intent

Getting result faster

Macro trend

Human in the loop

Why does “visual input” matter?

Let’s have a look at products have used this pattern.

Examples

AI UX checklist

Can the system reliably extract actionable content from photos or screenshots?
Are image inputs treated as a first-class mode, not a fallback?
Is recognition fast enough to feel native to the task?
Does the system respond with the right tool, not just information?
Does the interaction allow multiple entry points - camera, screenshots, clipboard?