Tutorial9 min read

Generate JSON Schema from Samples

Single-sample inference guesses required vs optional; multi-sample inference proves it via intersection. How conservative format detection, $defs extraction by content hash, and Draft 2020-12's prefixItems actually work — plus the constraints that no amount of sample data can reveal.

Hand-writing a JSON Schema for a 300-line API response is the kind of task that gets put off forever. The mechanical answer is "infer from sample data" — paste the response, get back a schema. The cleaner answer is "infer from multiple samples" — paste five real responses, get back a schema that actually distinguishes required fields from optional ones, catches enums, and pulls repeated subtrees into $defs. This guide walks the inference pipeline, the draft-version differences that bite you in production, and the constraints that data alone can never tell you.

Single Sample vs Multi-Sample

Given one sample:

{ "id": "u_123", "name": "Ada", "email": "ada@example.com" }

a single-sample inference gives you id, name, email — all required, all string. The problem: you have no idea if any of these can be missing. You have no idea if there are fields like phone or created_at that appear in some responses and not others. You have no idea if id actually has a structured format you should pin (^u_[0-9]+$).

Multi-sample inference solves these one at a time:

  • Required vs optional comes from the intersection of keys across samples. A field is required only if it appears in every sample.
  • Optional fields come from the union — anything that appeared in any sample is in the schema, but only required if it appeared in all.
  • Enum candidates come from finite low-cardinality observed value sets — if a field only ever holds one of 4 values across 50 samples, it's probably an enum.
  • Format candidates require every observed value to match the strict spec — one outlier and the format keyword is dropped.

The JSON Schema Generator does multi-sample inference by default. Drop in 5–10 real API responses and the output is meaningfully better than what you'd hand-write in an hour.

The Required-Field Rule, Step by Step

Sample Has id Has name Has email Has phone
1 yes yes yes no
2 yes yes yes yes
3 yes yes no no

The output schema marks id and name as required. email and phone are properties but not required. This is the intersection rule applied literally. If you only had sample 1, you'd mistakenly mark email required and never know phone exists.

The flip side: add a sample with a real corner case. A user who registered before email verification was required will have a record with no email. One such sample in the inference set saves you from a schema that fails validation on legitimate data in production.

Format Detection (Conservative on Purpose)

JSON Schema has built-in format keywords: date-time, date, time, email, uuid, ipv4, ipv6, uri, hostname. A naive inference would slap format: email on any string that looks email-shaped. The conservative rule, which the JSON Schema Generator uses:

Emit format: X only if every observed value in the field passes the strict spec for X. One outlier disables the keyword.

The reason is round-trip safety. If you infer format: email, then run AJV validation on the same data, and one of the samples has a malformed email — your validator now fails on the same data you inferred from. Better to under-detect than to silently break.

Practical formats that hold up well in real data:

  • uuid — strict 8-4-4-4-12 hex; rarely has false positives
  • date-time — ISO 8601; check that all values have either a T separator or both date and time components
  • email — RFC 5322 simplified; common in user profiles but watch for plus-addressed and IDN cases
  • uri — RFC 3986; check that every value parses; surprisingly common to have one malformed URL in the wild

Formats that are usually too aggressive at sample sizes under 50:

  • hostname — many string fields look hostname-shaped but aren't
  • ipv4 / ipv6 — rare enough that you usually know in advance

Extracting $defs

When the same shape appears in multiple places, inline duplication makes the schema painful to maintain. Promote it to $defs:

{
  "$defs": {
    "Address": { "type": "object", "properties": { ... } }
  },
  "properties": {
    "shipping_address": { "$ref": "#/$defs/Address" },
    "billing_address":  { "$ref": "#/$defs/Address" }
  }
}

The inference rule: hash the canonical form of each subtree. If a hash appears 2 or more times (the default threshold), extract it. Below 2, inline it — the maintenance cost of extracting a one-use $def outweighs the readability benefit.

The threshold is a tradeoff:

Threshold Effect
2 (default) Catches most real duplication; small schemas may get over-extracted
3 Cleaner inline-vs-extracted balance for large schemas
1 (every reusable subtree) Maximum reuse, maximum maintenance burden

For schemas you'll hand-edit afterward, threshold 3 is usually nicer. For schemas that will be regenerated from samples on every API change, threshold 2 keeps the diff readable.

Draft Differences You Actually Hit

JSON Schema has shipped four major drafts that you'll see in real codebases: 4, 6, 7, 2019-09, 2020-12. The differences that matter when generating schemas from samples:

  • prefixItems (2020-12) replaces the tuple form of items. Draft 7 wrote [{ ... }, { ... }, { ... }] to constrain a fixed-position array. Draft 2020-12 uses prefixItems: [...] and reserves items for the "rest" schema. If you generate Draft 7 and run it through a 2020-12 validator, tuples silently degrade.
  • unevaluatedProperties (2019-09+) lets you forbid properties not validated by any sub-schema. Doesn't exist in Draft 7. Generators that target Draft 7 lose this expressiveness.
  • $dynamicRef (2020-12) replaces the older $recursiveRef for self-referential schemas. Library support is uneven — AJV supports it, but some older toolchains don't.
  • $schema keyword position. Always at the root; some validators emit warnings if it's in a sub-schema even though the spec allows it.
  • exclusiveMinimum / exclusiveMaximum changed from boolean (Draft 4) to number (Draft 6+). A Draft 4 schema with { "minimum": 0, "exclusiveMinimum": true } doesn't parse against Draft 6+.

For new work in 2026, default to Draft 2020-12. For interop with older tooling (OpenAPI 3.0 uses a Draft 4-derived subset), target what your toolchain requires and pin it explicitly with $schema.

Export Targets

The same Shape tree can render to multiple targets:

  • JSON / YAML — the schema itself, formatted for the language of choice
  • TypeScript — equivalent interface declarations with ?: for optional fields
  • Zod — runtime validators usable in TypeScript code

Conversion is mechanical but lossy in the TypeScript direction: JSON Schema can express constraints (minimum, pattern, format) that TypeScript types cannot. The export keeps the structural part — types, optionality, enums — and drops the constraints. If you need both runtime validation and types from one source, Zod is the better target because Zod schemas carry the constraints into runtime.

The Round-Trip Validation Check

After generating a schema, run AJV validation on the same samples you generated from. If anything fails, the inference made a wrong assumption — usually format-too-aggressive or enum-too-tight from a small sample. The JSON Schema Generator ships this as a Validation tab that lazy-loads AJV when opened, so the round-trip check is one click rather than a manual setup.

Run validation against samples that weren't in the inference set too. New data that fails validation tells you the inference was overfit; loosen and regenerate.

The Honest Limitations

Inference from data is bounded by what data alone can show. The things it cannot do:

  • Cross-field constraints. "If country is US, state is required." JSON Schema can express this via dependentRequired and if/then, but inference can't see the rule from samples — only from documentation.
  • oneOf discriminator branching. If your API returns one of three response shapes based on a type field, inference will unify all three properties into a single object. The discriminator pattern has to be added by hand.
  • Custom keywords. Tools like Hyper-Schema or AsyncAPI extensions need explicit configuration, not data.
  • Meaningful enums vs accidental low cardinality. A field with 4 distinct values across 50 samples might be an enum — or it might be a string field that happened to only see 4 values in your sample. Inference can't tell the difference.
  • Format detection is per-field, not per-context. A field named id that holds UUIDs in one endpoint and ^u_[0-9]+$ IDs in another will be inferred as string in the merged case, losing the structure.

Use inference to draft the schema. Hand-tighten the parts data can't see.

Related Tools

TL;DR

Single-sample inference guesses required vs optional; multi-sample inference proves it via intersection. Format detection should be conservative (one outlier disables the keyword) to survive the AJV round-trip. $defs extraction at threshold 2 catches real duplication. Draft 2020-12 uses prefixItems for tuples; if you target Draft 7, tuples silently degrade. JSON Schema can express constraints TypeScript can't — export to Zod when you need both. The JSON Schema Generator runs the whole pipeline client-side from one or many samples.

Try the tools