Your Team Chose Voice Because It Felt Modern. That's Not a Framework.

Cover Image for Your Team Chose Voice Because It Felt Modern. That's Not a Framework.

A fintech product team added voice to their onboarding flow in 2025. Competitors had it. Users were asking about it. The decision happened in a product review, somewhere between the competitive audit and the roadmap discussion.

Six months later, voice accounted for 2.3% of sessions and a disproportionate share of support tickets. The friction wasn't implementation — the technical execution was clean. The friction was that users were trying to set up investment accounts in public places, on commutes, in offices. They didn't want to read their account numbers aloud.

Nobody had asked who uses this, where, for what. They'd asked: "Do we have voice yet?"

The Modality Decision Nobody Has Formalized

The question of which interface modality to use — voice, touch, text input, gesture, or some combination — is one of the most consequential design decisions a team makes. It shapes the entire interaction model, the accessibility profile, the privacy implications, and the cognitive load of every user encounter.

It's also one of the least formalized. Most teams make modality decisions based on one of three inputs: what competitors are shipping, what's technically feasible given the current stack, or what feels modern. None of those are criteria. They're noise.

CHI 2026 published two relevant papers — Bashar et al. on cross-modal interaction performance under varying cognitive load, and DeVries et al. on task structure and modality affordance matching — that together offer the clearest empirical framework for this decision that currently exists.

The Four Variables That Actually Matter

1. Task Structure

DeVries et al. coded tasks along a single axis: discrete versus continuous. Discrete tasks have defined inputs and defined completion states. Continuous tasks require ongoing judgment, comparison, or navigation.

Voice performs well for discrete tasks with bounded inputs: setting a timer, querying a status, initiating a call. It degrades sharply for continuous tasks that require comparison ("which plan should I choose?"), review ("can you look over this before I submit?"), or multi-step decision making where the user needs to hold multiple variables simultaneously.

The fintech example above is a continuous, high-stakes task requiring careful review of account details and terms. Voice is poorly matched to every dimension of that task structure. The matching error wasn't about voice being wrong in principle — it was about voice being wrong for this task type.

2. Visual Attention Budget

Bashar et al. introduced the concept of the visual attention budget: the portion of a user's visual processing capacity that is available for interface interaction during a given task. The finding is counterintuitive but robust: voice interfaces don't reduce cognitive load when the use context already has high visual demands.

Driving is the canonical example. Voice feels like it should reduce the attention cost of interacting while driving. But the 2026 Ergonomics meta-analysis (synthesizing 47 prior studies) found that voice interfaces requiring any auditory response processing — reading back content, presenting options — captured cognitive resources that overlapped with the visual-spatial processing demands of driving. Touch-optimized large-target interfaces, despite involving physical interaction, often had lower accident correlation than voice interfaces requiring active listening.

The implication: voice reduces friction when the environment has low visual demand and the user's hands are occupied (walking, cooking). It doesn't reduce friction when the environment already has high attentional demands.

3. Environmental Context

This is the variable teams most consistently fail to model. Interfaces get designed in offices and tested in labs. They get used on trains, in bathrooms, in open-plan offices, and in places where audio is inappropriate, noisy, or both.

Voice has an obvious failure mode in environments where audio output or input is unwelcome. But the subtler failure mode is in noisy environments where speech recognition degrades: construction sites, events, crowded transit. A modality that works in 80% of environments but fails catastrophically in the remaining 20% will generate a support ticket volume that misrepresents the underlying design error.

Touch interfaces have their own environmental failures: gloves, wet hands, screen glare, physical incapacitation. Gesture interfaces fail when physical range of motion is restricted. Text input fails when hands are occupied.

The framework question isn't "which modality works best?" It's "which failure modes are acceptable for our specific user context distribution?"

4. Accessibility Requirements

The accessibility implications of modality choice are rarely surfaced at the design decision stage. They should be.

Voice-first interfaces create barriers for users with speech impairments, hearing impairments in environments requiring audio feedback, and non-native speakers in systems with limited language support. Touch interfaces create barriers for users with motor impairments. Text-heavy interfaces create barriers for users with reading disabilities or low literacy.

The cognitive load research on design tools makes a related point about how the design environment shapes what designers notice — and accessibility issues are disproportionately invisible to designers who don't encounter them in their own daily use.

Building the accessibility analysis into the modality decision framework, rather than treating it as a post-hoc audit, means asking at decision time: "What percentage of our user base does this modality exclude, and what's the cost of that exclusion?"

Why the Framework Gets Skipped

The reason modality decisions happen by competitive imitation rather than framework is the same reason most design decisions happen without rigorous criteria: the framework requires admitting uncertainty about your users.

"We should add voice because competitors have it" is a confident statement. "Here's what we know about our user context distribution and task structure, which suggests voice would benefit approximately X% of sessions while creating friction for Y%" requires user research, environmental modeling, and the willingness to ship something narrower than your competitor's feature list.

The competitive imitation path is faster and feels more defensible in product reviews. The framework path produces better products.

What "Multimodal" Actually Means

The CHI 2026 papers both converge on something important: most successful implementations don't pick a single modality. They define a primary modality for the core use case and layered alternatives for contexts where the primary fails.

This is different from "multimodal" as it's usually described — "we support voice AND touch AND text." That framing implies equivalence. The framework approach implies hierarchy: voice is primary for hands-occupied ambient queries, touch is the fallback for environments where voice is inappropriate, text is available for precise inputs where accuracy matters more than convenience.

The product teams that get multimodal right think about fallback hierarchies, not feature checklists. The teams that get it wrong add modalities to match competitive parity and then struggle to explain why the new modality has low adoption.

The Decision Before the Decision

The fintech team eventually rearchitected their onboarding flow. Voice got moved to post-setup: account status queries, quick balance checks, portfolio updates. The initial setup — the continuous, high-stakes, sensitive-data task — stayed touch-and-text.

Adoption of the voice features in the new context reached 23% of eligible sessions within four months.

The product didn't need more voice. It needed voice in the right place, for the right tasks, for the right user context. That's a different question than "should we have voice?" And it's a question a framework answers.

Photo: Quintessence UK (Pexels)