JSON Mode Lies. Your LLM Structured Outputs Are Failing Silently in Production.

Your LLM pipeline has been in production for six weeks. The evals pass. The demo is clean. Then a customer support ticket shows up: the AI-generated report is missing the risk_level field, and the downstream system silently defaulted it to null, which the compliance team just discovered means "no risk identified."
That field was in your schema. The model just didn't include it.
JSON mode is not the same as structured outputs. Most teams don't know the difference until something breaks quietly in production, usually on a schema complex enough that no single eval caught the failure path.
What JSON Mode Actually Promises
OpenAI introduced JSON mode in late 2023 with a clean guarantee: the model will only produce valid JSON. That's it. Parseable JSON — not JSON that conforms to your specific schema.
This distinction sounds pedantic until you're writing a Python parser that does response["category"] and gets a KeyError at 2 AM, or — worse — gets None because you wrote response.get("category") like a careful developer, and now every downstream decision that depends on category is running on null data.
Anthropic's tool use / function calling pattern is closer to schema-enforced — the model is constrained to produce a response matching a defined tool input schema — but enforcement still depends on how the provider implements it under the hood. OpenAI's "Structured Outputs" (released August 2024) brought actual constrained decoding to the API for a specific subset of JSON Schema: required fields, enum types, object shapes. But it explicitly excludes features like oneOf, anyOf, $ref — the parts of JSON Schema most commonly used in real production schemas.
The promise narrowed significantly in the fine print.
The Failure Modes Nobody Runs Evals For
Field omission is the most common failure and the hardest to catch. A model that's been asked to return a five-field object will sometimes return a three-field object — always with the three "most important" fields present, as determined by whatever semantic weight the model assigns them. The optional-looking field your schema marks required is the one that disappears.
Type coercion failures are the subtler problem. The model returns "45" where your schema says integer. Your JSON parser accepts it. Your code's type checker, if you have one, flags it as a warning rather than an error. Your database ingestion silently coerces it or drops the row.
Enum constraint violations happen under distribution shift — when the input text contains a value that's close to, but not exactly, one of your schema's enum members. A model that correctly returns "APPROVED" on your test cases returns "Approved" on a production case with slightly different phrasing. Downstream case-sensitive comparison fails.
Nested object collapse is the least understood mode. A schema with deeply nested required fields gets "flattened" by the model — it returns a shallower structure that's valid JSON but doesn't match the schema. This happens most reliably on schemas with nesting depth greater than three.
None of these failures surface in simple eval sets. They emerge under distribution shift, complex schemas, and the specific inputs that weren't in your golden set.
Why Your Evals Don't Catch This
Most LLM eval frameworks test for content quality, not schema validity. You're checking whether the model's output is "good" — does it answer the question, is the sentiment correct, does the reasoning hold. Schema compliance is treated as a pre-condition that the provider guarantees.
A Berkeley evaluation of structured output reliability across six popular LLMs (published in late 2024) found failure rates of 5–18% on schemas with four or more required fields, nested objects, and enum constraints. On simpler schemas — two or three flat required fields — failure rates were under 2%.
Those numbers don't sound alarming until you multiply them by volume. A pipeline processing 10,000 requests per day with a 3% structured output failure rate generates 300 silent failures daily. If each failure means missing data, corrupted records, or a default value that changes a decision — the compound damage accumulates before anyone notices.
The other eval problem: test inputs are usually representative of the training distribution. The failures happen at the edges — unusual input phrasing, long context, edge cases in the content being structured. Your golden set caught the easy cases. Production is the rest of them.
Constrained Decoding Is the Actual Fix
The only architectural solution that works is constrained decoding at inference time. Instead of generating free-form text and hoping it matches your schema, constrained decoding restricts the token sequence the model can produce to only sequences that are valid according to the schema.
Outlines (dottxt-ai/outlines) is the most widely used open-source implementation. It builds a finite-state machine from your JSON Schema and applies it during generation — each token is selected from the intersection of what the model wants to output and what the schema allows. You can't get a missing field because the generation physically cannot end without producing it.
Guidance (Microsoft) takes a similar approach. llama.cpp has native grammar support that achieves the same effect for self-hosted models.
The tradeoff: constrained decoding adds latency. For typical schemas, the overhead is 5–15% on generation time. For deeply complex schemas with many branching enum values, it can run higher. In production, you benchmark it and decide whether the correctness guarantee is worth it — for most compliance-adjacent use cases, it is.
If you're using hosted APIs (OpenAI, Anthropic, Google) and can't apply constrained decoding yourself, you're dependent on the provider's implementation. OpenAI's Structured Outputs feature applies constrained decoding server-side for supported schema features. For schemas that use unsupported features, you're back to probabilistic compliance.
What This Costs When You Get It Wrong
The real cost of silent structured output failures isn't debugging time — it's the downstream decisions made on corrupted data.
A recommendation system that returns confidence: 0.87 in one record and omits confidence in another doesn't crash. The omitting record gets a default. The default is wrong. The recommendation changes. Nobody traced it to the JSON field.
A document classification pipeline that occasionally returns "High Risk" instead of the schema-required "HIGH_RISK" keeps running. The downstream risk dashboard that expects exact enum matches silently drops those records from the high-risk view. The compliance team is looking at an incomplete picture.
The failures are documented and categorized — but the structured output category is consistently underestimated because the failures are silent by design. JSON that parses is "working" by most monitoring definitions.
The Production-Ready Pattern
For any schema where field completeness is load-bearing: use constrained decoding if you control the inference, and OpenAI Structured Outputs if you're on their API and your schema is within the supported subset. Validate responses against your schema explicitly with a JSON Schema validator (Pydantic, Zod, jsonschema) before they enter your pipeline — don't trust that the upstream layer enforced it.
For monitoring: track field presence rates per schema field in your telemetry. A field that's present in 97% of responses is failing 3% of the time. Name it, set an alert on it, and trace the failure patterns before they compound.
JSON mode was a useful abstraction for demos. Production schemas deserve the actual guarantee.
Photo by Stanislav Kondratiev via Pexels.