Overview
An extraction schema is a JSON object that defines which fields to extract from a document and how to structure the output. You pass the schema to the API along with Markdown content generated by the API. You can build extraction schemas using the Playground or the API, both of which generate valid schemas automatically. Use this article as a reference for understanding how schemas are structured and how the API handles unsupported keywords.The schema requirements in this article apply to
extract-20260314 and later. For information about earlier versions, see Earlier Versions of Extract.When using the library, you can also pass a Pydantic model class instead of a JSON schema. For more information, go to Python Library.
Basic Structure
The extraction schema must follow this structure:x-alternativeNames keyword to account for field name variations across documents:
Schemas generated by the Playground or the API automatically include root-level
description and required keywords. These are currently ignored by the API.Top-Level Type Requirement
The top-leveltype keyword must be "object". Schemas with a different top-level type will return an error.
In the following example, the highlighted line shows the required top-level type keyword:
Define Each Field
Each field you want to extract is defined as a property inside theproperties keyword. Each property can include the following:
| Required | Description | |
|---|---|---|
| Field name | Required | The key used to identify the extracted value in the output. |
type | Required | The data type of the extracted value. |
description | Optional | Natural-language description of what to extract. |
enum | Optional | Restricts the extracted value to a set of allowed values. |
format | Optional | Instructions for how to format the extracted value. |
x-alternativeNames | Optional | Alternative labels for the field that may appear across different documents. |
properties | Required for object types | Defines the fields within a nested object. |
items | Required for array types | Defines the structure of items in an array. |
Field Names
Required. The field name is the key that identifies each property in theproperties object and determines how the extracted value is labeled in the output.
Field names can contain letters, numbers, underscores, and hyphens. Use descriptive, specific names that clearly indicate what data to extract:
invoice_numberinstead ofnumberpatient_nameinstead ofname
Supported Field Types
Required. Use thetype keyword to define the data type of the extracted value. Supported types are:
| Type | Description |
|---|---|
array | A list of items. To see an example, go to Arrays. |
boolean | True or false values. |
integer | Whole numbers. |
number | Numeric values, including decimals. Use this type for monetary values or when you need to perform calculations on the extracted value. |
object | A nested structure. To see an example, go to Nested Objects. |
string | Text values. |
type keyword:
To restrict a field to a specific set of allowed values, use the
enum keyword.Restrict Values with Enum
Optional. Use theenum keyword to restrict the extracted value to a specific set of allowed values. Only string values are supported. Include "type": "string" in the field definition. Any non-string values are converted to strings.
For example, the following schema restricts the extracted account type to one of three allowed values. The highlighted lines show the type and enum keywords:
In the Playground, “enum” appears as a field type option in the Type drop-down menu. When you export the schema, the Playground outputs this correctly using the
enum keyword.Field Descriptions
Optional. Use thedescription keyword to help the API identify and extract the correct data. The more specific your descriptions, the more accurate the extraction.
Include the following in your descriptions:
- Exactly what data to extract
- What to include or exclude (for example, “excluding tax” or “including area code”)
description keyword:
Format
Optional. Use theformat keyword to specify how the extracted value should be formatted. This is most commonly applied to string fields.
The format keyword accepts natural-language instructions and standard JSON Schema format values. Natural-language instructions offer more flexibility, since you can describe formatting requirements that don’t have a standard equivalent. We recommend experimenting with different values to find what works best for your use case.
The following examples illustrate the range of options:
format value | Output example |
|---|---|
YYYY-MM-DD | 2026-01-17 |
Month DD, YYYY | January 17, 2026 |
Currency amount with the $ symbol, for example $12.50 | $170.23 |
Two-letter US state code | CA |
format keyword:
Alternative Names
Optional. Use thex-alternativeNames keyword to list alternative labels for a field. This helps the API locate the correct data when documents use different labels for the same field, such as “Invoice Number” versus “Reference Number.”
In the following example, the highlighted lines show the x-alternativeNames keyword:
Properties (For Objects)
Required forobject types. Use the properties keyword to define the fields within a nested object. For a full example, see Nested Objects.
Format Arrays with items
Required for array types. Use the items keyword to define the structure of items in an array. For a full example, see Arrays.
Nested Objects
Use nested objects when the data you want to extract has a natural hierarchical structure. For example, an invoice might include a billing address with multiple sub-fields (street, city, state, and ZIP code), or a patient form might have separate sections for personal details and insurance information. In the following example, the highlighted lines show a nestedproperties keyword inside the invoice object:
Arrays
Use arrays to extract repeating structures from a document, such as all rows in a table. Each item in the array follows the same schema, making arrays well-suited for data like transaction lists, invoice line items, or lists of charges. To define an array field:- Set
"type": "array"on the field. - Include the
itemskeyword to define the structure of each item. - Inside
items, define the fields each item should contain.
Keyword Support
The API only supports a specific set of JSON Schema keywords. Unsupported keywords either cause errors or are silently ignored, depending on the keyword.Supported Keywords
The API supports the following keywords. For details on each, see Define Each Field.descriptionenum(string values only)formatitemspropertiestypex-alternativeNames
Ignored Keywords
The following keywords are not supported but will not cause errors. The API removes or resolves them before running extraction.| Keyword | How the API handles it |
|---|---|
Reference keywords: $anchor, $defs, $id, $ref, $schema, definitions | Used to define and reference reusable schema components. All references are resolved during schema conversion. The keywords are then removed. |
Recursive and dynamic keywords: $dynamicAnchor, $dynamicRef, $recursiveAnchor, $recursiveRef | Used for self-referential schemas. All references are resolved during schema conversion. The keywords are then removed. |
anyOf | To ensure consistent output, each field is limited to a specific type. If one of the anyOf types is null, the API removes null and sets the type to the other specified type.If none of the types are null, the API falls back to string, since that is least likely to cause issues. |
default | If default is null, the API removes the keyword. If default is any other value, the API returns a 206. |
nullable | The API removes the keyword. All fields are nullable by default. For more information, see Missing Fields and Nullable Fields. |
required | The API removes the keyword. The API considers all fields to be required. |
title | The API removes the keyword. Typically, the title does not give additional information that is not already present in the field name or description. |
Keywords That Cause Errors
Whether the API returns a206 (Partial Content) or 422 (Unprocessable Entity) depends on the strict parameter. If strict is false, the API returns a 206. If strict is true, the API returns a 422. For more information, see Set the strict Parameter.
Any keyword not listed in Supported Keywords or Ignored Keywords will cause the API to return an error. The following list provides common examples and is not exhaustive:
allOfconstmaxItemsmaxLengthmaximumminItemsminLengthminimumoneOfpatternpropertyOrderinguniqueItems
How the API Handles Required Fields
The API treats all fields as required and always attempts to extract every property defined in your schema. Therequired keyword is not supported and is ignored if included. For more information, see Ignored Keywords.
If the API cannot find a field in the document, it returns null rather than an error. For more information, see How the API Handles Missing Fields.
How the API Handles Missing Fields
All fields are nullable, meaning the API returnsnull when it cannot find a field in the document rather than returning an error. Because of this, the API ignores null and nullable if included in your schema. For more information, see Ignored Keywords.
For example, if your schema includes a first_name field but the document does not contain a first name, the API returns null for that field.
The exact behavior depends on the field type:
| Field type | Behavior when not found |
|---|---|
Primitive fields (boolean, integer, number, string) | Returns null. |
array | Returns an empty array: []. |
object | Never returns null, but all primitive fields within it return null. |
Schema Validation
The API processes your schema and output in three stages:- Before Extraction: Validate Schema Structure
- During Processing: Convert Schema
- After Extraction: Validate Extracted Output Against Schema
Validate Schema Structure (Before Extraction)
The API checks that your schema is valid JSON and follows the required structure. If validation fails, the API returns a422 error before processing begins.
Convert Schema (During Processing)
The API converts your schema before running extraction. If the schema includes keywords that cause errors, the behavior depends on thestrict parameter:
- If
strictisfalse: the API continues and returns a206(Partial Content). - If
strictistrue: the API stops and returns a422(Unprocessable Entity).
Validate Extracted Output Against Schema (After Extraction)
After extraction completes, the API validates that the extracted output matches your schema. If it does not, the API returns a206 (Partial Content) with any successfully extracted data.
Because the API returns at least partial results, the API call consumes credits.
FAQs for Extraction Schemas
Does the order of properties in the schema need to match the order of the fields in the document?
No. The order of properties in the schema has no impact on extraction. For example, a property defined last in the schema can still extract data from a field that appears at the top of the document.Is there a maximum number of fields that can be extracted?
No. There is no maximum number of properties in an extraction schema.Can I put formatting and alternative names in the description field?
You can include formatting instructions and alternative field names indescription, but using the dedicated format and x-alternativeNames keywords is more effective.
When you use dedicated keywords, the API knows exactly what each piece of information is: format contains only formatting instructions, and x-alternativeNames contains only alternative field labels. This allows the API to apply each more precisely than if the same information were embedded in a general description.
For more information, see Format and Alternative Names.

