Hand-writing a JSON Schema for a 300-line API response is the kind of task that gets put off forever. The mechanical answer is "infer from sample data" — paste the response, get back a schema. The cleaner answer is "infer from multiple samples" — paste five real responses, get back a schema that actually distinguishes required fields from optional ones, catches enums, and pulls repeated subtrees into $defs. This guide walks the inference pipeline, the draft-version differences that bite you in production, and the constraints that data alone can never tell you.
Single Sample vs Multi-Sample
Given one sample:
{ "id": "u_123", "name": "Ada", "email": "ada@example.com" }
a single-sample inference gives you id, name, email — all required, all string. The problem: you have no idea if any of these can be missing. You have no idea if there are fields like phone or created_at that appear in some responses and not others. You have no idea if id actually has a structured format you should pin (^u_[0-9]+$).
Multi-sample inference solves these one at a time:
- Required vs optional comes from the intersection of keys across samples. A field is required only if it appears in every sample.
- Optional fields come from the union — anything that appeared in any sample is in the schema, but only required if it appeared in all.
- Enum candidates come from finite low-cardinality observed value sets — if a field only ever holds one of 4 values across 50 samples, it's probably an enum.
- Format candidates require every observed value to match the strict spec — one outlier and the format keyword is dropped.
The JSON Schema Generator does multi-sample inference by default. Drop in 5–10 real API responses and the output is meaningfully better than what you'd hand-write in an hour.
The Required-Field Rule, Step by Step
| Sample | Has id |
Has name |
Has email |
Has phone |
|---|---|---|---|---|
| 1 | yes | yes | yes | no |
| 2 | yes | yes | yes | yes |
| 3 | yes | yes | no | no |
The output schema marks id and name as required. email and phone are properties but not required. This is the intersection rule applied literally. If you only had sample 1, you'd mistakenly mark email required and never know phone exists.
The flip side: add a sample with a real corner case. A user who registered before email verification was required will have a record with no email. One such sample in the inference set saves you from a schema that fails validation on legitimate data in production.
Format Detection (Conservative on Purpose)
JSON Schema has built-in format keywords: date-time, date, time, email, uuid, ipv4, ipv6, uri, hostname. A naive inference would slap format: email on any string that looks email-shaped. The conservative rule, which the JSON Schema Generator uses:
Emit
format: Xonly if every observed value in the field passes the strict spec for X. One outlier disables the keyword.
The reason is round-trip safety. If you infer format: email, then run AJV validation on the same data, and one of the samples has a malformed email — your validator now fails on the same data you inferred from. Better to under-detect than to silently break.
Practical formats that hold up well in real data:
uuid— strict 8-4-4-4-12 hex; rarely has false positivesdate-time— ISO 8601; check that all values have either aTseparator or both date and time componentsemail— RFC 5322 simplified; common in user profiles but watch for plus-addressed and IDN casesuri— RFC 3986; check that every value parses; surprisingly common to have one malformed URL in the wild
Formats that are usually too aggressive at sample sizes under 50:
hostname— many string fields look hostname-shaped but aren'tipv4/ipv6— rare enough that you usually know in advance
Extracting $defs
When the same shape appears in multiple places, inline duplication makes the schema painful to maintain. Promote it to $defs:
{
"$defs": {
"Address": { "type": "object", "properties": { ... } }
},
"properties": {
"shipping_address": { "$ref": "#/$defs/Address" },
"billing_address": { "$ref": "#/$defs/Address" }
}
}
The inference rule: hash the canonical form of each subtree. If a hash appears 2 or more times (the default threshold), extract it. Below 2, inline it — the maintenance cost of extracting a one-use $def outweighs the readability benefit.
The threshold is a tradeoff:
| Threshold | Effect |
|---|---|
| 2 (default) | Catches most real duplication; small schemas may get over-extracted |
| 3 | Cleaner inline-vs-extracted balance for large schemas |
| 1 (every reusable subtree) | Maximum reuse, maximum maintenance burden |
For schemas you'll hand-edit afterward, threshold 3 is usually nicer. For schemas that will be regenerated from samples on every API change, threshold 2 keeps the diff readable.
Draft Differences You Actually Hit
JSON Schema has shipped four major drafts that you'll see in real codebases: 4, 6, 7, 2019-09, 2020-12. The differences that matter when generating schemas from samples:
prefixItems(2020-12) replaces the tuple form ofitems. Draft 7 wrote[{ ... }, { ... }, { ... }]to constrain a fixed-position array. Draft 2020-12 usesprefixItems: [...]and reservesitemsfor the "rest" schema. If you generate Draft 7 and run it through a 2020-12 validator, tuples silently degrade.unevaluatedProperties(2019-09+) lets you forbid properties not validated by any sub-schema. Doesn't exist in Draft 7. Generators that target Draft 7 lose this expressiveness.$dynamicRef(2020-12) replaces the older$recursiveReffor self-referential schemas. Library support is uneven — AJV supports it, but some older toolchains don't.$schemakeyword position. Always at the root; some validators emit warnings if it's in a sub-schema even though the spec allows it.exclusiveMinimum/exclusiveMaximumchanged from boolean (Draft 4) to number (Draft 6+). A Draft 4 schema with{ "minimum": 0, "exclusiveMinimum": true }doesn't parse against Draft 6+.
For new work in 2026, default to Draft 2020-12. For interop with older tooling (OpenAPI 3.0 uses a Draft 4-derived subset), target what your toolchain requires and pin it explicitly with $schema.
Export Targets
The same Shape tree can render to multiple targets:
- JSON / YAML — the schema itself, formatted for the language of choice
- TypeScript — equivalent
interfacedeclarations with?:for optional fields - Zod — runtime validators usable in TypeScript code
Conversion is mechanical but lossy in the TypeScript direction: JSON Schema can express constraints (minimum, pattern, format) that TypeScript types cannot. The export keeps the structural part — types, optionality, enums — and drops the constraints. If you need both runtime validation and types from one source, Zod is the better target because Zod schemas carry the constraints into runtime.
The Round-Trip Validation Check
After generating a schema, run AJV validation on the same samples you generated from. If anything fails, the inference made a wrong assumption — usually format-too-aggressive or enum-too-tight from a small sample. The JSON Schema Generator ships this as a Validation tab that lazy-loads AJV when opened, so the round-trip check is one click rather than a manual setup.
Run validation against samples that weren't in the inference set too. New data that fails validation tells you the inference was overfit; loosen and regenerate.
The Honest Limitations
Inference from data is bounded by what data alone can show. The things it cannot do:
- Cross-field constraints. "If
countryisUS,stateis required." JSON Schema can express this viadependentRequiredandif/then, but inference can't see the rule from samples — only from documentation. oneOfdiscriminator branching. If your API returns one of three response shapes based on atypefield, inference will unify all three properties into a single object. The discriminator pattern has to be added by hand.- Custom keywords. Tools like Hyper-Schema or AsyncAPI extensions need explicit configuration, not data.
- Meaningful enums vs accidental low cardinality. A field with 4 distinct values across 50 samples might be an enum — or it might be a string field that happened to only see 4 values in your sample. Inference can't tell the difference.
- Format detection is per-field, not per-context. A field named
idthat holds UUIDs in one endpoint and^u_[0-9]+$IDs in another will be inferred asstringin the merged case, losing the structure.
Use inference to draft the schema. Hand-tighten the parts data can't see.
Related Tools
- JSON Schema Generator — multi-sample inference with format detection,
$defs, AJV validation, and TS/Zod export - JSON Formatter — pretty-print and inspect JSON before/after generation
- JSON Validator — validate JSON against a schema in the browser
- YAML Formatter — for OpenAPI specs and Kubernetes manifests
- CI Converter — convert CI YAML across platforms
TL;DR
Single-sample inference guesses required vs optional; multi-sample inference proves it via intersection. Format detection should be conservative (one outlier disables the keyword) to survive the AJV round-trip. $defs extraction at threshold 2 catches real duplication. Draft 2020-12 uses prefixItems for tuples; if you target Draft 7, tuples silently degrade. JSON Schema can express constraints TypeScript can't — export to Zod when you need both. The JSON Schema Generator runs the whole pipeline client-side from one or many samples.