Multimodal Apps – Designing for Apps That Can See, Hear, and Speak. - TO26: Firebrand

People want a reset of their relationship with their computers. We'll use the latest AI models to explore prototypes of UX experiences that leverage their device's microphone, speakers, and video camera to perceive the world the way we do—then design for the trust, context, and consent questions that follow.

This is the third time Josh has run this workshop at The Outlook, refined each round from what participants build and discover.

The whole session runs in Cursor, so you're not just observing prototypes—you're inside the codebase, modifying and extending them live. We'll work with three, each leaning on a different mix of voice, vision, and audio, to show how these capabilities reshape the expectations users bring to an interface.

We'll work through a simple framework, the Three Cs:

Context – what is actually driving the interaction, and what the app needs to understand about the user's situation.
Capability – what the underlying models and APIs can genuinely do, and where they fall short.
Consent – what data and permissions we're asking for, and whether users understand what they're agreeing to.

In small groups you'll move through each prototype in Cursor, document what you notice through the Three Cs lens, then change the context parameters and behaviour yourself to see how the apps respond. You leave having built, not just watched.

By the end you'll have hands-on experience with several multimodal AI providers, a working framework for evaluating perceptual interfaces, and a sharper instinct for the trust questions these apps raise.

Who it's for: Designers, product people, and developers prototyping with AI who want to move past chat boxes.

Format: 3 hours, hands-on in Cursor.

Laptops with webcam and microphone required.

Multimodal Apps – Designing for Apps That Can See, Hear, and Speak.

Join our newsletter