Skip to main content

Overview

To extract fields from parsed documents, include a JSON extraction schema that defines what structured data should be extracted from Markdown content. The extraction schema is a JSON Schema object that determines which key-value pairs are extracted and how they are structured.
When using the agentic-doc library, you can also pass a Pydantic model class instead of a JSON schema. For more information, go to Extract Data with the Legacy Library.
If using as a Snowflake native application, Snowflake has additional requirements when using Local Processing. For more information, go to Parse Documents with Snowflake Local Processing.

Schema Structure

The extraction schema must be a valid JSON object that follows the JSON Schema Draft 2020-12 specification. The API validates your schema before processing and returns an error if the schema is invalid.

Basic Structure

The extraction schema must adhere to the following schema structure.
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "Description of what to extract"
    }
  },
  "required": ["field_name"]
}

Schema Type

The top-level element of the schema must be object.
 "type": "object",

Field Names

Define each field you want to extract as a property in the properties object. Field names can contain letters, numbers, underscores, and hyphens. Use descriptive and specific field names that clearly indicate what data should be extracted. For example, use:
  • invoice_number instead of number
  • patient_name instead of name

Supported Field Types

Define field types using the type property. Supported types are:
  • array: Lists of items. The array can include these data types: string, enum, date, boolean, number, integer, and object. To see an example, go to Arrays.
  • boolean: True/false values
  • integer: Whole numbers
  • null: Represents a null value. Used with other types to make fields nullable (e.g., type: ["string", "null"]). For more information, go to Nullable Fields.
  • number: Numeric values, including decimals. If extracting monetary values, or if you need to perform calculations on the extracted value, use this type. When using the agentic-doc library, this is float.
  • object: Nested structures. To see an example, go to Nested Objects.
  • string: Text values

Field Descriptions

The description property is optional, but providing detailed descriptions helps the API identify and extract the correct data from your documents. The more descriptive and specific your field names and descriptions are, the more accurately the API can identify the correct data. Include the following details in your descriptions:
  • Exactly what data to extract
  • Any formatting requirements (e.g., “in USD”, “as YYYY-MM-DD”)
  • What to include or exclude (e.g., “excluding tax”, “including area code”)
{
  "type": "object",
  "properties": {
    "total_amount": {
      "type": "number",
      "description": "Total amount in USD, excluding tax"
    }
  }
}

Schema Examples

Simple Fields

Use simple fields to extract basic, standalone values from a document. This example extracts key information from a patient intake form:
{
  "type": "object",
  "properties": {
    "patient_name": {
      "type": "string",
      "title": "Patient Name",
      "description": "The name of the patient"
    },
    "doctor": {
      "type": "string",
      "title": "Doctor",
      "description": "Primary care physician of the patient"
    },
    "copay": {
      "type": "number",
      "title": "Patient Copay",
      "description": "Copay that the patient is required to pay before services are rendered"
    }
  },
  "required": ["patient_name"]
}

Restrict Possible Values (Use Enum)

Use enum to restrict the extracted value to a specific set of allowed values. Define the field as type: "string" and use the enum property to list the allowed values. For example, when extracting data from bank statements, you can limit the account type to only “Premium Checking” or “Standard Checking”:
{
  "type": "object",
  "required": [
    "account_type"
  ],
  "properties": {
    "account_type": {
      "type": "string",
      "enum": [
        "Premium Checking",
        "Standard Checking"
      ],
      "description": "bank account type"
    }
  }
}

Nullable Fields

Use nullable fields when a value may or may not be present in the document. Define nullable fields using a type array that includes null as one of the allowed types. For example, to allow an optional Middle Name field that can be either a string or null, use the following schema.
{
  "type": "object",
  "properties": {
    "first_name": {
      "type": "string",
      "description": "Patient's first name"
    },
    "middle_name": {
      "type": ["string", "null"],
      "description": "Patient's middle name, if provided"
    },
    "last_name": {
      "type": "string",
      "description": "Patient's last name"
    }
  },
  "required": ["first_name", "last_name"]
}

Arrays

Use arrays to extract lists of similar items from a document. This example extracts individual charges from a utility bill:
{
  "type": "object",
  "properties": {
    "charges": {
      "type": "array",
      "title": "Charges",
      "description": "List of charges on the utility bill",
      "items": {
        "type": "object",
        "properties": {
          "charge_type": {
            "type": "string",
            "title": "Charge Type",
            "description": "Type of charge (e.g., electricity, gas, water)"
          },
          "amount": {
            "type": "number",
            "title": "Amount",
            "description": "Charge amount in USD"
          },
          "usage": {
            "type": "string",
            "title": "Usage",
            "description": "Usage amount with unit (e.g., '450 kWh', '25 CCF')"
          }
        }
      }
    }
  }
}

Type Arrays with Complex Types

When defining fields that can accept multiple data types (called a “type array”), do not include object or array as options. Instead, use the anyOf keyword to specify multiple schema definitions. This approach prevents validation conflicts and provides clearer schema definitions. For example, if a field can contain either a number or an object, use anyOf. Correct:
{
  "type": "object",
  "properties": {
    "field1": {"type": "string"},
    "field2": {
      "anyOf": [
        {"type": "number"},
        {"type": "object"}
      ]
    }
  },
  "required": ["field1", "field2"]
}
Incorrect
{
  "type": "object",
  "properties": {
    "field1": {"type": "string"},
    "field2": {"type": ["number", "object"]}
  },
  "required": ["field1", "field2"]
}

Nested Objects

Use nested objects to extract hierarchical data from a document. The extraction schema supports up to five nested levels. This example extracts structured information from an invoice:
{
  "type": "object",
  "properties": {
    "invoice": {
      "type": "object",
      "properties": {
        "number": {
          "type": "string",
          "description": "Invoice number"
        },
        "date": {
          "type": "string",
          "description": "Invoice date"
        },
        "total": {
          "type": "number",
          "description": "Total amount"
        }
      }
    }
  }
}

Supported Number of Properties

For optimal performance, include no more than 30 properties in your extraction schema. Performance may degrade as the number of properties increases.

What Counts as a Property?

A property is a key-value pair in an object and is defined using the properties keyword. The schema below has 4 total properties: 1 top-level property (employeeInfo) that organizes the data, plus 3 nested properties (name, address, socialSecurityNumber) that contain the actual extracted values.
{
  "type": "object",
  "title": "Payroll Document Field Extraction Schema",
  "required": [
    "employeeInfo"
  ],
  "properties": {
    "employeeInfo": {
      "type": "object",
      "title": "Employee Information",
      "required": [
        "name",
        "address",
        "socialSecurityNumber"
      ],
      "properties": {
        "name": {
          "type": "string",
          "title": "Employee Name",
          "description": "Full name of the employee."
        },
        "address": {
          "type": "string",
          "title": "Employee Address",
          "description": "Mailing address of the employee."
        },
        "socialSecurityNumber": {
          "type": "string",
          "title": "Social Security Number",
          "description": "Employee's Social Security Number."
        }
      },
      "description": "Key identifying and contact information for the employee."
    }
  },
  "description": "Schema for extracting high-value tabular and form-like fields from a payroll-related markdown document."
}

Match Schema to Document Structure

Design your schema to align with how information appears in the source documents. For example, if you’re extracting a field from the top of a document and another from the bottom, list them in that order in your schema.

Schema Validation

The API validates your schema in two ways:

Before Extraction

The API checks that your schema is valid JSON and conforms to JSON Schema specifications. Invalid schemas return a 422 error:
{
  "error": "Invalid JSON schema provided for fields_schema."
}

After Extraction

The API validates that extracted data matches your schema definition. If the extracted data doesn’t conform to the schema, the API returns a 206 status with a validation error:
{
  "error": "Field validation error: Extracted schema does not match the provided fields schema: ValidationError :: Expected string, got integer"
}

Unsupported Keywords

The JSON extraction schema does not support the following keywords:
  • allOf
  • not
  • dependentRequired
  • dependentSchemas
  • if
  • then
  • else
Using these keywords results in this error:
Keyword 'KEY' is not supported
I