Enabling users to interact with their voice, have a conversation or take actions
This is a type of screenless UX.
And most recently, there's a hidden behaviour shift taking place...
People are relying on voice more and more. Taking walks with ChatGPT to think out loud, or talking to the system with Whisprflow instead of click, click, type.
We're seeing voice being added to our day-to-day applications, not just standalone devices like Echo dot.
But why now? Because the technology is finally here. Our fear of "Uhuh, I didn't quite that. Try again" is a thing of the past. Audio interpretation has gotten faster and much more advanced - take Airpods' live translation for example.
In fact, it's so much better that users have their expectations raised again (since 2011's Siri launch); they expect accuracy, context awareness, and a seamless handoff between speaking, writing and editing. But done right, voice input can make a product feel like a real partner.
Let's dive into some key takeaways.
Users only believe in voice if they know they're being heard. Break that once, especially during onboarding or their let-me-give-that-a-shot phase, and you've lost that user for a couple of years (courtesy to Siri).
Perplexity has a voice input mode with a fancyyy visual feedback - reminiscent of the audio visualizer days.

Elevenlabs has a similar (open source) orb component to indicate different states of your voice agent (one of their most used components by the way).

Alexa shows a vibrant blue ring as its listening state. The interaction is hands-free and optimized for ambient environments like kitchens or living rooms.
One thing missing though is a "mmhhmm" voice along with the visual, confirming to me its listening. Sometimes I'm running late and don't have time to look at Alexa to see if its listening. Siri 2.0 caters to this!

One of the most interesting examples is Tolan - their sound designer, Thomas, broke down the process of coming up with the audio for its "thinking" state (you will LOVE it!):

A voice input system is only as good as its editing flow. Punctuation, filler removal, and intent detection make it usable.
Whisprflow corrects grammar and removes the ermss and umms. Tools that existed for textual content, like Grammarly, will now be present invisibly in voice.
Monologue autoformats the content so you don't have to structure it. You dump → AI transcribes and structures.

We’re past the “dedicated device for audio” era. Voice works best when it’s baked into daily tools (calls, docs, chats) instead of living on a smart speaker island. Embedding it reduces friction and raises adoption.
Siri is activated via long press or voice, "Hey Siri". Users can ask "add hiking to my calendar for 7 AM" and since its integrated with apps, it can execute tasks with outputs in the form of GUI widgets. And if needed, follow-up for continuity.

New.computer's experiment plays with two modalities: when the you are looking at the screen, it's text first, when you look away, it's audio first.

Arc (with elevator music during wait time, real classic move!) enables users to speak when they hold up their phone in a call positioning. You can then talk and search like a normal phone call but with your friend Google. This is an embedded voice interaction, provides search capabilities through a real-time voice input.

Voice isn’t just about commands anymore. Users are using it to reason, brainstorm, and multitask which opens up a new design challenge: How do you design for the messy, exploratory speech, not just short direct commands?
As a practitioner, I like to write notes — key takeaways and questions — to ask myself whenever I'm designing for voice input interaction in the future.
The aim when working on a voice interaction is to get into code (for a live experience) as soon as possible. Understand all those states and iron out the nuances.
As a designer, there's not much to the interface, maybe a button and a visualizer, but a lot of designing is needed for the part of the interaction that's invisible.
Voice interfaces are soon going to evolve from simple, reactive tools into proactive, conversational systems (more in Audio output AI pattern). With advancements in memory and context modeling, future assistants will anticipate user needs and suggest helpful actions without being prompted.
Imagine this: your room is connected to a smart voice assistant that says, "It's getting quite warm; should I turn on the AC? Also, it's late; would you like me to dim the lights?"
for designers and product teams in the new AI paradigm.
