Skip to main content

Overview

To extract fields from parsed documents, provide a JSON schema that defines what structured data should be extracted from Markdown content. The extraction schema is a JSON Schema object that determines which key-value pairs are extracted and how they are structured. This article provides requirements and best practices for creating JSON schemas for extraction. For information about extraction models and model versions, go to Extraction Model Versions.
When using the library, you can also pass a Pydantic model class instead of a JSON schema. For more information, go to Python Library.
When using as a Snowflake native application, Snowflake has additional requirements when using Local Processing. For more information, go to Parse Documents with Snowflake Local Processing.

Basic Structure

The JSON schema must follow this structure:
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "Description of what to extract"
    }
  },
  "required": ["field_name"]
}

Schema Type

The top-level type property must be "object":
{
  "type": "object",
  "properties": {}
}

Field Names

Define each field you want to extract as a property in the properties object. Field names can contain letters, numbers, underscores, and hyphens. Use descriptive and specific field names that clearly indicate what data should be extracted. For example, use:
  • invoice_number instead of number
  • patient_name instead of name

Supported Field Types

Define field types using the type property. Supported types are:
  • array: Lists of items. The array can include these data types: string, enum, date, boolean, number, integer, and object. To see an example, go to Arrays.
  • boolean: True/false values
  • integer: Whole numbers
  • number: Numeric values, including decimals. If extracting monetary values, or if you need to perform calculations on the extracted value, use this type. When using the agentic-doc library, this is float.
  • object: Nested structures. To see an example, go to Nested Objects.
  • string: Text values
When using extraction model extract-20250930, you can also use null. For more information, go to Nullable Fields.

Field Descriptions

The description property is optional, but providing detailed descriptions helps the API identify and extract the correct data. The more specific your field names and descriptions, the more accurate the extraction. Include the following details in your descriptions:
  • Exactly what data to extract
  • Any formatting requirements (e.g., “in USD”, “as YYYY-MM-DD”)
  • What to include or exclude (e.g., “excluding tax”, “including area code”)
{
  "type": "object",
  "properties": {
    "total_amount": {
      "type": "number",
      "description": "Total amount in USD, excluding tax"
    }
  }
}

Schema Examples

Basic Example

This example shows a simple schema that extracts key information from a patient intake form:
{
  "type": "object",
  "properties": {
    "patient_name": {
      "type": "string",
      "title": "Patient Name",
      "description": "The name of the patient"
    },
    "doctor": {
      "type": "string",
      "title": "Doctor",
      "description": "Primary care physician of the patient"
    },
    "copay": {
      "type": "number",
      "title": "Patient Copay",
      "description": "Copay that the patient is required to pay before services are rendered"
    }
  },
  "required": ["patient_name"]
}

Restrict Values with Enum

Use the enum property to restrict the extracted value to a specific set of allowed values. Only string enums are supported. For example, when extracting data from bank statements, you can limit the account type to only “Premium Checking” or “Standard Checking”:
{
  "type": "object",
  "required": [
    "account_type"
  ],
  "properties": {
    "account_type": {
      "type": "string",
      "enum": [
        "Premium Checking",
        "Standard Checking"
      ],
      "description": "bank account type"
    }
  }
}
If you use a non-string data type with the enum property when using extraction model extract-20251024, the extraction request will fail.

Arrays

Use arrays to extract lists of similar items from a document. This example extracts individual charges from a utility bill:
{
  "type": "object",
  "properties": {
    "charges": {
      "type": "array",
      "title": "Charges",
      "description": "List of charges on the utility bill",
      "items": {
        "type": "object",
        "properties": {
          "charge_type": {
            "type": "string",
            "title": "Charge Type",
            "description": "Type of charge (e.g., electricity, gas, water)"
          },
          "amount": {
            "type": "number",
            "title": "Amount",
            "description": "Charge amount in USD"
          },
          "usage": {
            "type": "string",
            "title": "Usage",
            "description": "Usage amount with unit (e.g., '450 kWh', '25 CCF')"
          }
        }
      }
    }
  }
}

Union Types (Multiple Allowed Types)

When a field can accept multiple data types, use the anyOf keyword instead of a type array (e.g., "type": ["number", "object"]) if one of the types is object or array. This prevents validation conflicts. For example, if a field can contain either a number or an object, use anyOf: Correct:
{
  "type": "object",
  "properties": {
    "field1": {"type": "string"},
    "field2": {
      "anyOf": [
        {"type": "number"},
        {"type": "object"}
      ]
    }
  },
  "required": ["field1", "field2"]
}
Incorrect:
{
  "type": "object",
  "properties": {
    "field1": {"type": "string"},
    "field2": {"type": ["number", "object"]}
  },
  "required": ["field1", "field2"]
}

Nested Objects

Use nested objects to extract hierarchical data from a document. The extraction schema supports up to five nested levels. This example extracts structured information from an invoice:
{
  "type": "object",
  "properties": {
    "invoice": {
      "type": "object",
      "properties": {
        "number": {
          "type": "string",
          "description": "Invoice number"
        },
        "date": {
          "type": "string",
          "description": "Invoice date"
        },
        "total": {
          "type": "number",
          "description": "Total amount"
        }
      }
    }
  }
}

Nullable Fields

Some fields in your JSON schema may not have values in every document. For example, when processing patient intake forms from different healthcare providers, some forms may include a middle name field while others may not. To handle fields that may be missing, allow the API to return null for the field. The JSON schema syntax depends on which extraction model you are using:

extract-20251024: Use the Nullable Keyword

When using extraction model extract-20251024, add "nullable": true to any field that can be null. This example allows the middle name field to return either a string or null:
{
  "type": "object",
  "properties": {
    "first_name": {
      "type": "string",
      "description": "Patient's first name"
    },
    "middle_name": {
      "type": "string",
      "nullable": true,
      "description": "Patient's middle name, if provided"
    },
    "last_name": {
      "type": "string",
      "description": "Patient's last name"
    }
  },
  "required": ["first_name", "last_name"]
}

extract-20250930: Use the Null Type

When using extraction model extract-20250930, define nullable fields using a type array that includes "null". This example allows the middle name field to return either a string or null:
{
  "type": "object",
  "properties": {
    "first_name": {
      "type": "string",
      "description": "Patient's first name"
    },
    "middle_name": {
      "type": ["string", "null"],
      "description": "Patient's middle name, if provided"
    },
    "last_name": {
      "type": "string",
      "description": "Patient's last name"
    }
  },
  "required": ["first_name", "last_name"]
}

Schema Validation

The API validates your JSON schema in two ways:

Before Extraction

The API checks that your JSON schema is valid JSON and follows the required structure. Invalid schemas return a 422 error:
{
  "error": "Invalid JSON schema provided for fields_schema."
}

After Extraction

The API validates that extracted data matches your JSON schema. If the extracted data doesn’t conform to the schema, the API returns a 206 status and any successfully extracted data. Because the API returns at least partial results, the API call consumes credits.

Keyword Support

Each extraction model supports different JSON Schema keywords.

extract-20251024: Supported Keywords

When using extraction model extract-20251024, the API only supports the following keywords in your JSON schema. All other keywords are ignored and won’t cause errors.
  • $defs
  • $ref
  • anyOf
  • description
  • enum (only string enums are supported)
  • format
  • items
  • maximum
  • maxItems
  • minimum
  • minItems
  • nullable
  • properties
  • propertyOrdering
  • required
  • title
  • type
The following commonly-used keywords are not supported. This list is not exhaustive; any keyword not listed above is ignored.
  • $schema
  • allOf
  • oneOf
  • const
  • minLength
  • maxLength
  • uniqueItems

extract-20250930: Unsupported Keywords

When using extraction model extract-20250930, the API does not support the following keywords in your JSON schema. If your schema includes these keywords, the API returns an error: Keyword 'KEY' is not supported.
  • allOf
  • not
  • dependentRequired
  • dependentSchemas
  • if
  • then
  • else

Missing Fields

When the API cannot find a field defined in your JSON schema within the document, the response depends on the extraction model.

extract-20251024: Returns Null for Missing Fields

When using extraction model extract-20251024, the API returns missing values as null, even if the field is marked as required or not marked as nullable in your JSON schema. For example, if your JSON schema requires a First Name field but the submitted document does not contain a first name, the API returns null for that field.

extract-20250930: Inconsistent Behavior

When using extraction model extract-20250930, the API has inconsistent behavior when fields are missing. It may return null, 0, an empty string, or another value.

Best Practices and Recommendations

This section provides best practices for designing JSON schemas. While these aren’t requirements, following these recommendations may improve extraction accuracy.

Match Schema to Document Structure

Design your JSON schema to align with how information appears in the documents. For example, if you’re extracting a field from the top of a document and another from the bottom, list them in that order in your schema.

Limit the Number of Properties

For optimal performance, include no more than 30 properties in your JSON schema. Performance may degrade as the number of properties increases. Each field defined within a properties object counts as one property. The schema below has 4 total properties: 1 top-level property (employeeInfo) that organizes the data, plus 3 nested properties (name, address, socialSecurityNumber) that contain the actual extracted values.
{
  "type": "object",
  "title": "Payroll Document Field Extraction Schema",
  "required": [
    "employeeInfo"
  ],
  "properties": {
    "employeeInfo": {
      "type": "object",
      "title": "Employee Information",
      "required": [
        "name",
        "address",
        "socialSecurityNumber"
      ],
      "properties": {
        "name": {
          "type": "string",
          "title": "Employee Name",
          "description": "Full name of the employee."
        },
        "address": {
          "type": "string",
          "title": "Employee Address",
          "description": "Mailing address of the employee."
        },
        "socialSecurityNumber": {
          "type": "string",
          "title": "Social Security Number",
          "description": "Employee's Social Security Number."
        }
      },
      "description": "Key identifying and contact information for the employee."
    }
  },
  "description": "Schema for extracting high-value tabular and form-like fields from a payroll-related markdown document."
}

Reduce JSON Schema Complexity

In general, keep JSON schemas as simple as they need to be. Find the balance between being specific and descriptive and being concise. Recommendations:
  • Start with only a few fields in your JSON schema and then add more as needed.
  • Keep property names and enum names short but descriptive.
  • Flatten nested arrays.
  • Reduce the number of optional properties.
  • Reduce the number of valid values for enums.
When using extraction model extract-20251024, if the API determines that the JSON schema is too complex, the API will fall back to extract-20250930 and attempt extraction with that model.