When parsing a document, can extract data that you specify from a document. This is helpful if you need to extract the same data from multiple similar documents.

For example, if you work for a financial institution and need to extract the Total Income field from tens of thousands of loan applications, you can use the extraction feature to do that.

Classification

As part of the extraction process, you can classify documents and extract data based on the type of document it is.

For example, let’s say you work for a financial institution and want to extract a set of data from Loan Applications and another set of data from Income Statements. You can assign a class to each document, and then extract data based on that document’s class.

In the JSON schema used in the Playground and when calling the API, use the enum keyword to identify the document types.

Extraction Workflows

Your extraction workflow depends on whether you use the API or library.

If you want to extract data when calling the API:

  1. Create a JSON schema that identifies the fields you want to extract. You can use the guided workflow in the Playground to help you build the schema.
  2. Include the JSON schema when you run the API.

If you want to extract data with the agentic-doc library:

  1. Define the fields you want to extract using Pydantic models directly in your code. You do not need to create a JSON schema.
You can also extract data in the Playground. We recommend doing this only for testing purposes, since the Playground isn’t designed to handle bulk document processing.

Field Definition and Extraction Guidance

When you define the data to be extracted, you provide a Name for each field. You can also add an optional Description to give more context. Both the Name and Description serve as guidance to help understand exactly what information to locate and extract from your documents.

The more descriptive and specific your field names and descriptions are, the more accurately can identify the correct data in your documents.

JSON Schema for Extraction

The JSON schema is a structured list of fields — such as names, dates, or numbers — that tells the what values to extract.

JSON Schema Format

The JSON schema for extraction must follow the format below.

{
  "type": "object",
  "properties": {
    "field name": {
      "type": "string",
      "description": "Add details to clarify what to extract, which is especially useful when the field name alone isn't specific enough. For example, enter 'Only include the first name' if the document also has fields for middle and last names."
    }
  }
}

Sample JSON Schema

Here is a sample JSON schema that shows fields that could be extracted from an earnings statement.

{
  "type": "object",
  "properties": {
    "Net Pay": {
      "type": "number",
      "description": "The net pay in the \"this period\" column."
    },
    "Federal Income Tax": {
      "type": "string",
      "description": "The federal income tax in the \"this period\" column."
    },
    "Payroll Check Number": {
      "type": "number",
      "description": "The check number."
    }
  }
}