When parsing a document, can extract data that you specify from a document. This is helpful if you need to extract the same data from multiple documents. For example, if you work for a financial institution and need to extract the Total Income field from tens of thousands of loan applications, you can use the extraction feature to do that. Extract Data from Documents

Classification

As part of the extraction process, you can classify documents and extract data based on the type of document it is. For example, let’s say you work for a financial institution and want to extract a set of data from Loan Applications and another set of data from Income Statements. You can assign a class to each document, and then extract data based on that document’s class. In the JSON schema used in the Playground and when calling the API, use the enum keyword to identify the document types.

Get Started: Extraction Workflow

We recommend using the schema extraction wizard directly in our Playground to build and validate an extraction schema. You can then use that schema when parsing documents:
  1. Use the schema extraction wizard in our Playground to build a schema tailored to your documents. Build a Schema with the Wizard
  2. Choose a format to export the schema to: library or API. Export the Relevant Format
  3. Include the schema when you call the parse function with the agentic-doc library or run the API.
You can also extract data in the Playground. We recommend doing this only for testing purposes, since the Playground isn’t designed to handle bulk document processing.

Supported Data Types

When creating an extraction schema, you can specify the following data types:
  • boolean
  • number: When using the library, this is float.
  • string
  • enum
  • date
  • integer
  • object
  • array: The array can include these data types: string, enum, date, boolean, number, integer, object.
These data types are only supported in the library and API:
  • byte
  • nested objects
  • list: This data type from the typing library is supported if the types within the list are valid.
  • union: This data type from the typing library is supported if the types within the union are valid.

The Library and API Use Different Schemas Formats

The schema format used in the library and API is different. But no worries; you can build or upload a schema in the Playground and then choose which format to export it to! Learn more about the schema format for each use case:

Field Definition and Extraction Guidance

When you define the data to be extracted, you provide a Name for each field. You can also add an optional Description to give more context. Both the Name and Description serve as guidance to help understand exactly what information to locate and extract from your documents. The more descriptive and specific your field names and descriptions are, the more accurately can identify the correct data in your documents.

Supported Number of Properties

For optimal performance, include no more than 30 properties in your extraction schema. Performance may degrade as the number of properties increases.

What Counts as a Property?

A property is a key-value pair in an object and is defined using the properties keyword. The schema below has 4 total properties: 1 top-level property (employeeInfo) that organizes the data, plus 3 nested properties (name, address, socialSecurityNumber) that contain the actual extracted values.
{
  "type": "object",
  "title": "Payroll Document Field Extraction Schema",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "required": [
    "employeeInfo"
  ],
  "properties": {
    "employeeInfo": {
      "type": "object",
      "title": "Employee Information",
      "required": [
        "name",
        "address",
        "socialSecurityNumber"
      ],
      "properties": {
        "name": {
          "type": "string",
          "title": "Employee Name",
          "description": "Full name of the employee."
        },
        "address": {
          "type": "string",
          "title": "Employee Address",
          "description": "Mailing address of the employee."
        },
        "socialSecurityNumber": {
          "type": "string",
          "title": "Social Security Number",
          "description": "Employee's Social Security Number."
        }
      },
      "description": "Key identifying and contact information for the employee."
    }
  },
  "description": "Schema for extracting high-value tabular and form-like fields from a payroll-related markdown document."
}