Overview
To extract fields from parsed documents, provide a JSON schema that defines what structured data should be extracted from Markdown content. The extraction schema is a JSON Schema object that determines which key-value pairs are extracted and how they are structured. This article provides requirements and best practices for creating JSON schemas for extraction. For information about extraction models and model versions, go to Extraction Model Versions.When using the library, you can also pass a Pydantic model class instead of a JSON schema. For more information, go to Python Library.
Basic Structure
The JSON schema must follow this structure:Schema Type
The top-leveltype property must be "object":
Field Names
Define each field you want to extract as a property in theproperties object. Field names can contain letters, numbers, underscores, and hyphens.
Use descriptive and specific field names that clearly indicate what data should be extracted. For example, use:
invoice_numberinstead ofnumberpatient_nameinstead ofname
Supported Field Types
Define field types using thetype property. Supported types are:
array: Lists of items. The array can include these data types:string,enum,date,boolean,number,integer, andobject. To see an example, go to Arrays.boolean: True/false valuesinteger: Whole numbersnumber: Numeric values, including decimals. If extracting monetary values, or if you need to perform calculations on the extracted value, use this type. When using the agentic-doc library, this isfloat.object: Nested structures. To see an example, go to Nested Objects.string: Text values
When using extraction model
extract-20250930, you can also use null. For more information, go to Nullable Fields.Field Descriptions
Thedescription property is optional, but providing detailed descriptions helps the API identify and extract the correct data. The more specific your field names and descriptions, the more accurate the extraction.
Include the following details in your descriptions:
- Exactly what data to extract
- Any formatting requirements (e.g., “in USD”, “as YYYY-MM-DD”)
- What to include or exclude (e.g., “excluding tax”, “including area code”)
Schema Examples
Basic Example
This example shows a simple schema that extracts key information from a patient intake form:Restrict Values with Enum
Use theenum property to restrict the extracted value to a specific set of allowed values. Only string enums are supported.
For example, when extracting data from bank statements, you can limit the account type to only “Premium Checking” or “Standard Checking”:
If you use a non-string data type with the
enum property when using extraction model extract-20251024, the extraction request will fail.Arrays
Use arrays to extract lists of similar items from a document. This example extracts individual charges from a utility bill:Union Types (Multiple Allowed Types)
When a field can accept multiple data types, use theanyOf keyword instead of a type array (e.g., "type": ["number", "object"]) if one of the types is object or array. This prevents validation conflicts.
For example, if a field can contain either a number or an object, use anyOf:
Correct:
Nested Objects
Use nested objects to extract hierarchical data from a document. The extraction schema supports up to five nested levels. This example extracts structured information from an invoice:Nullable Fields
Some fields in your JSON schema may not have values in every document. For example, when processing patient intake forms from different healthcare providers, some forms may include a middle name field while others may not. To handle fields that may be missing, allow the API to returnnull for the field. The JSON schema syntax depends on which extraction model you are using:
extract-20251024: Use the Nullable Keyword
When using extraction modelextract-20251024, add "nullable": true to any field that can be null.
This example allows the middle name field to return either a string or null:
extract-20250930: Use the Null Type
When using extraction modelextract-20250930, define nullable fields using a type array that includes "null".
This example allows the middle name field to return either a string or null:
Schema Validation
The API validates your JSON schema in two ways:Before Extraction
The API checks that your JSON schema is valid JSON and follows the required structure. Invalid schemas return a 422 error:After Extraction
The API validates that extracted data matches your JSON schema. If the extracted data doesn’t conform to the schema, the API returns a 206 status and any successfully extracted data. Because the API returns at least partial results, the API call consumes credits.Keyword Support
Each extraction model supports different JSON Schema keywords.extract-20251024: Supported Keywords
When using extraction modelextract-20251024, the API only supports the following keywords in your JSON schema. All other keywords are ignored and won’t cause errors.
$defs$refanyOfdescriptionenum(only string enums are supported)formatitemsmaximummaxItemsminimumminItemsnullablepropertiespropertyOrderingrequiredtitletype
The following commonly-used keywords are not supported. This list is not exhaustive; any keyword not listed above is ignored.
$schemaallOfoneOfconstminLengthmaxLengthuniqueItems
extract-20250930: Unsupported Keywords
When using extraction modelextract-20250930, the API does not support the following keywords in your JSON schema. If your schema includes these keywords, the API returns an error: Keyword 'KEY' is not supported.
allOfnotdependentRequireddependentSchemasifthenelse
Missing Fields
When the API cannot find a field defined in your JSON schema within the document, the response depends on the extraction model.extract-20251024: Returns Null for Missing Fields
When using extraction modelextract-20251024, the API returns missing values as null, even if the field is marked as required or not marked as nullable in your JSON schema.
For example, if your JSON schema requires a First Name field but the submitted document does not contain a first name, the API returns null for that field.
extract-20250930: Inconsistent Behavior
When using extraction modelextract-20250930, the API has inconsistent behavior when fields are missing. It may return null, 0, an empty string, or another value.
Best Practices and Recommendations
This section provides best practices for designing JSON schemas. While these aren’t requirements, following these recommendations may improve extraction accuracy.Match Schema to Document Structure
Design your JSON schema to align with how information appears in the documents. For example, if you’re extracting a field from the top of a document and another from the bottom, list them in that order in your schema.Limit the Number of Properties
For optimal performance, include no more than 30 properties in your JSON schema. Performance may degrade as the number of properties increases. Each field defined within aproperties object counts as one property. The schema below has 4 total properties: 1 top-level property (employeeInfo) that organizes the data, plus 3 nested properties (name, address, socialSecurityNumber) that contain the actual extracted values.
Reduce JSON Schema Complexity
In general, keep JSON schemas as simple as they need to be. Find the balance between being specific and descriptive and being concise. Recommendations:- Start with only a few fields in your JSON schema and then add more as needed.
- Keep property names and enum names short but descriptive.
- Flatten nested arrays.
- Reduce the number of optional properties.
- Reduce the number of valid values for enums.
When using extraction model
extract-20251024, if the API determines that the JSON schema is too complex, the API will fall back to extract-20250930 and attempt extraction with that model.
