Overview
To extract fields from parsed documents, include a JSON extraction schema that defines what structured data should be extracted from Markdown content. The extraction schema is a JSON Schema object that determines which key-value pairs are extracted and how they are structured.When using the agentic-doc library, you can also pass a Pydantic model class instead of a JSON schema. For more information, go to Extract Data with the Legacy Library.
If using as a Snowflake native application, Snowflake has additional requirements when using Local Processing. For more information, go to Parse Documents with Snowflake Local Processing.
Schema Structure
The extraction schema must be a valid JSON object that follows the JSON Schema Draft 2020-12 specification. The API validates your schema before processing and returns an error if the schema is invalid.Basic Structure
The extraction schema must adhere to the following schema structure.Schema Type
The top-level element of the schema must beobject
.
Field Names
Define each field you want to extract as a property in theproperties
object. Field names can contain letters, numbers, underscores, and hyphens.
Use descriptive and specific field names that clearly indicate what data should be extracted. For example, use:
invoice_number
instead ofnumber
patient_name
instead ofname
Supported Field Types
Define field types using thetype
property. Supported types are:
- array: Lists of items. The array can include these data types: string, enum, date, boolean, number, integer, and object. To see an example, go to Arrays.
- boolean: True/false values
- integer: Whole numbers
- null: Represents a null value. Used with other types to make fields nullable (e.g.,
type: ["string", "null"]
). For more information, go to Nullable Fields. - number: Numeric values, including decimals. If extracting monetary values, or if you need to perform calculations on the extracted value, use this type. When using the agentic-doc library, this is
float
. - object: Nested structures. To see an example, go to Nested Objects.
- string: Text values
Field Descriptions
Thedescription
property is optional, but providing detailed descriptions helps the API identify and extract the correct data from your documents. The more descriptive and specific your field names and descriptions are, the more accurately the API can identify the correct data.
Include the following details in your descriptions:
- Exactly what data to extract
- Any formatting requirements (e.g., “in USD”, “as YYYY-MM-DD”)
- What to include or exclude (e.g., “excluding tax”, “including area code”)
Schema Examples
Simple Fields
Use simple fields to extract basic, standalone values from a document. This example extracts key information from a patient intake form:Restrict Possible Values (Use Enum)
Useenum
to restrict the extracted value to a specific set of allowed values. Define the field as type: "string"
and use the enum
property to list the allowed values.
For example, when extracting data from bank statements, you can limit the account type to only “Premium Checking” or “Standard Checking”:
Nullable Fields
Use nullable fields when a value may or may not be present in the document. Define nullable fields using a type array that includesnull
as one of the allowed types.
For example, to allow an optional Middle Name field that can be either a string or null, use the following schema.
Arrays
Use arrays to extract lists of similar items from a document. This example extracts individual charges from a utility bill:Type Arrays with Complex Types
When defining fields that can accept multiple data types (called a “type array”), do not includeobject
or array
as options. Instead, use the anyOf
keyword to specify multiple schema definitions. This approach prevents validation conflicts and provides clearer schema definitions.
For example, if a field can contain either a number or an object, use anyOf
.
Correct:
Nested Objects
Use nested objects to extract hierarchical data from a document. The extraction schema supports up to five nested levels. This example extracts structured information from an invoice:Supported Number of Properties
For optimal performance, include no more than 30 properties in your extraction schema. Performance may degrade as the number of properties increases.What Counts as a Property?
A property is a key-value pair in an object and is defined using theproperties
keyword.
The schema below has 4 total properties: 1 top-level property (employeeInfo
) that organizes the data, plus 3 nested properties (name
, address
, socialSecurityNumber
) that contain the actual extracted values.
Match Schema to Document Structure
Design your schema to align with how information appears in the source documents. For example, if you’re extracting a field from the top of a document and another from the bottom, list them in that order in your schema.Schema Validation
The API validates your schema in two ways:Before Extraction
The API checks that your schema is valid JSON and conforms to JSON Schema specifications. Invalid schemas return a 422 error:After Extraction
The API validates that extracted data matches your schema definition. If the extracted data doesn’t conform to the schema, the API returns a 206 status with a validation error:Unsupported Keywords
The JSON extraction schema does not support the following keywords:allOf
not
dependentRequired
dependentSchemas
if
then
else