Confidence Is a Design Token. Your Design System Doesn't Know That Yet.

Cover Image for Confidence Is a Design Token. Your Design System Doesn't Know That Yet.

Pick up any major design system documentation. Polaris. Carbon. Material Design 3. Atlassian. Open the token section. You'll find color primitives, type scales, spacing, elevation, motion curves. You'll find component states: default, hover, focus, disabled, error.

You won't find a token for we're not sure.

Every output a product generates carries a confidence level. A search result ranked first might have a relevance score of 97%. Or 53%. They look identical. A language model completing a sentence might be drawing on dense training signal or pattern-matching on thin evidence. The text renders the same either way. A database query might return a real-time verified record or a six-hour-old cached fallback from a degraded upstream service. Both display with the same font weight, same opacity, same visual authority.

Design systems have made a quiet collective decision: confidence is not their problem. The assumption embedded in nearly every component library is that data arrives ready — valid, complete, trustworthy — and the system's job is to present it cleanly. That assumption has always been partially wrong. In a product landscape filling up with AI-generated outputs, it's becoming catastrophic.

Why Design Systems Don't Specify Uncertainty

I've reviewed over twenty design systems over the past two years — Polaris, Carbon, Material Design 3, Atlassian, Primer (GitHub), Radix, Chakra, Ant Design, Base Web, Paste (Twilio), and others. Every one specifies error states. They handle the hard binary: success or failure, loading or loaded, valid or invalid. They handle states that are defined — present or absent, triggered or not.

What none of them specify is the gradient between "definitely right" and "probably wrong."

IBM's Carbon is the closest to an exception. It has data quality indicators — a Tag component with a warning state, InlineNotification patterns for flagging uncertain data. But these are editorial additions, not systematic. There is no confidence-level semantic baked into the token architecture. Carbon doesn't say: when a value has a confidence level below 70%, reduce opacity to 0.65 and switch to the lighter type weight. That rule has to be invented by each product team independently, without any shared specification to anchor from.

The result is that products make inconsistent local decisions — or make no decision at all and render everything at full visual authority by default. The visual presentation is universally overconfident, not because any designer consciously chose overconfidence, but because no design system told them confidence was a thing they needed to decide about.

Loading States Are Lying About Result Quality

The loading state problem is where the gap between design system specification and reality becomes most concrete.

Most loading patterns collapse four meaningfully different states into one signal. When you show a spinner or a skeleton screen, you communicate: something is happening. But "something is happening" covers:

Fetching — request sent, waiting on the network. Processing — data received, computation underway. Uncertain — timeout hit, serving stale or partial data, upstream degraded. Validating — data received, running integrity checks before rendering.

A skeleton screen says nothing about which of these is true. If the state is "fetching from a fast edge CDN, completes in 120ms," the skeleton is accurate — it previews what's coming. If the state is "upstream timeout, falling back to a cache from four hours ago," the skeleton is lying. It's presenting a confidence-neutral loading pattern for data that, when it arrives, should be visually marked as low-confidence.

The consequence is that users can't develop accurate mental models of data freshness. Every result feels equally current, equally authoritative. Research on cognitive load in design shows that users offload calibration work onto visual presentation — when the interface looks confident, users read the content as confident. Skeleton screens make that worse by encoding a "normal load" signal regardless of underlying quality.

Weather apps demonstrate this failure plainly: they show the same skeleton → loaded transition for a reading that's real-time from a nearby station and a reading interpolated from the nearest available sensor 40 miles away. Both load with identical visual weight. Both look equally authoritative. The interface has made a claim about data quality that the data doesn't support.

What Confidence Tokens Actually Look Like

A workable confidence token system has three layers, applied in order of perceptual subtlety.

Layer 1: Saturation and opacity. Confident values: full color, 100% opacity. Uncertain values: −30% saturation, 0.65 opacity. This operates as a passive signal — users don't have to read anything, they perceive the difference. A search result grid where the top-ranked items are full color and lower-ranked results are visually muted communicates quality hierarchy without a single label change. The signal is embedded in the presentation, not added as supplementary text that users will skip.

Layer 2: Typography weight. Confident values render at the standard display weight — 600 for headings, 400 for body. Uncertain values drop one step: 400 for headings, 300 for body. Combined with the opacity layer, these two create a perceptible gradient without requiring the user to parse additional UI chrome.

Layer 3: Iconographic supplementation. For values below a threshold — below 60%, say — surface the uncertainty explicitly with an info icon on hover or tap. Not a warning icon, which signals error rather than uncertainty, but a neutral information indicator that reveals context on demand: "This result has a lower relevance score" or "This value was last verified 6 hours ago." The tooltip layer handles accessibility, ensuring the signal is available without relying solely on color or weight.

Specifying this system requires a three-column token table most design systems don't have: the confidence range, the opacity and saturation values, and the type weight. Writing the table takes a day. Getting engineers to expose confidence scores from APIs — not binary valid/invalid flags, but actual scores — is the larger work. Most API responses don't surface confidence data at all. Building this system requires an explicit conversation with the engineering layer: we're designing components that consume confidence values; what can you expose?

That's a larger organizational change than adding a token to a JSON file, which is partly why it hasn't happened. But describing something as hard is not the same as describing it as someone else's problem.

AI Interfaces Have a Design Confidence Problem Right Now

Every AI-powered surface shipping today is a confidence crisis in slow motion.

When a language model generates a response, the visual presentation gives the user no signal about how grounded the answer is. Same font, same container, same visual authority — whether the model is drawing on solid signal or confabulating from sparse evidence. "This is established fact" and "the model's best guess" are identical display states. The interface treats them as equivalent because the design system never drew a distinction.

The medical symptom checker case makes the stakes concrete: an interface that renders "likely seasonal allergies" and "possible early-stage lymphoma" in identical typography, with identical visual weight and layout, has failed. Not because of a content moderation failure — because of a design system failure. There's no hedged presentation mode. There's no component that says: render this with appropriate visual uncertainty markers. The design system didn't specify it, so the interface can't use it.

OpenAI added a caveat text convention — "I'm not certain, but..." — but that's a content-layer patch, not a design system solution. The caveat text is formatted identically to the confident assertions around it. It's more words in the same confident-looking container. The hedging exists in the content; the visual layer still signals uniform authority.

This matters more now than it did three years ago because AI-generated outputs are appearing in clinical interfaces, legal research tools, financial products, and medical information services. The design systems going into those products never contemplated a distinction between "this output has a high confidence score" and "this is a best-effort guess." Their silence on the question isn't neutral. It's a decision — and the default it produces is wrong.

The Spec Creates the Requirement

The reframe I'd push: confidence signaling is not an engineering problem that needs to be solved before it becomes a design problem. It's a design problem that needs to be specified before engineering will surface the necessary data.

Every design system team can define what low-confidence presentation looks like right now — the token values, the component variants, the iconographic language. Once that spec exists, it becomes a concrete requirement on the API layer. "We have components that consume confidence scores; this endpoint needs to expose them." The spec creates the requirement. This is how design systems have always expanded engineering scope — by making design possibilities concrete enough that building them is a clear ask rather than an abstract value judgment.

The reason we don't have confidence tokens isn't that it's technically hard. It's that design systems are built by people who've inherited a mental model where the system's job starts when data arrives. It doesn't. It starts at the question of what we're allowed to claim — and right now, the default answer is: everything, always, at full visual authority.

That's not a design position. It's an oversight that's been institutionalized.

The gap between "we're confident" and "we're guessing" is a visual design problem. Nobody has owned it yet.


Photo by Egor Komarov on Pexels