Chunk Definition

A chunk is a discrete element extracted from a document, such as a block of text, a table, or a figure. When you send a document to the API, it analyzes the content, breaks it down into meaningful elements, and returns each one as a chunk.

Each chunk includes structured data that describes the content of the chunk and the location of the chunk in the document. This structure makes it easier to understand the extracted data and use it for downstream tasks.

Extracted chunks are included in both the JSON and Markdown outputs.

Chunk Types

Each chunk is labeled with a chunk type (chunk_type), which identifies what kind of content it represents.

The chunk types returned by are:

Text

A text chunk type is an element that consists entirely of characters (letters and numbers), such as:

  • paragraphs
  • equations
  • code blocks
  • lists
  • handwritten text

Example: Paragraph

Here is an example of the API marking a paragraph as a text chunk:

Here is the rendered Markdown for that chunk:

Example: Lists

Here is an example of the API marking a list as a text chunk:

Here is the rendered Markdown for that chunk:

Table

A table chunk type is a grid of rows and columns containing data.

doesn’t require gridlines to be present, and typically interprets well-aligned sets of data to be part of a table. For example, part of a receipt can be extracted as a table if the purchased items align with the costs.

Output

When a chunk is extracted, the chunk description is included in the text object. For table chunk types, the chunk is returned as HTML.

Here is an example JSON output for a table chunk:

{
      "text": "<table><tbody><tr><td>1</td><td>Americano</td><td>$3.19</td></tr><tr><td>1</td><td>Almond Scone</td><td>$1.99</td></tr><tr><td>1</td><td>16oz Bottle Water</td><td>$2.99</td></tr></tbody></table>",
      "grounding": [
        {
          "box": {
            "l": 0.04332852363586426,
            "t": 0.36108696460723877,
            "r": 0.8898441195487976,
            "b": 0.47052615880966187
          },
          "page": 0
        }
      ],
      "chunk_type": "table",
      "chunk_id": "a2948002-f9b9-4afd-8534-8f2ff8947d74"
    }

Example: Receipt

Here is an example of the API marking receipt line items as a table chunk:

Here is the rendered Markdown for that chunk:

Example: Earnings Statement

Here is an example of the API marking part of an earnings statement as a table chunk:

Here is the rendered Markdown for that chunk:

Form

A form is a collection of sets, each composed of key-value pairs, where keys denote field names and values are the data. Forms can include:

  • form fields
  • checkboxes
  • radio buttons

Output

When a chunk is extracted, the chunk description is included in the text object. For form chunk types, the chunk is returned as key-value pairs separated by line breaks (\n ).

Here is an example JSON output for a form chunk:

    {
      "text": "Social Security Number: 999-99-9999\nTaxable Marital Status: Married\nExemptions/Allowances:\n  Federal: 3, $25 Additional Tax\n  State: 2\n  Local: 2",
      "grounding": [
        {
          "box": {
            "l": 0.1897532343864441,
            "t": 0.17260606586933136,
            "r": 0.39956772327423096,
            "b": 0.2404632717370987
          },
          "page": 0
        }
      ],
      "chunk_type": "form",
      "chunk_id": "6bfe9850-6a12-456e-b355-afe9bdd9b1f5"
    }

Example: Form with Checkboxes and Radio Buttons

Here is an example of the API marking part of a loan form as a form chunk:

Here is the rendered Markdown for that chunk:

A page_header chunk type is a set of text in the top, bottom, or side margins of a document, including:

  • page headers
  • page footers
  • page numbers
  • handwritten notes in margins
  • line numbers on one side of a page
The page_header chunk type will be deprecated in a later version of .

Example: Header and Page Number

Here is an example of the API marking a header and page number as a page_header chunk:

Here is the rendered Markdown for that chunk:

Figure

A figure chunk type is an element that contains visual or graphical non-text content, including:

  • logos
  • pictures
  • graphs (bar graphs, line graphs, etc.)
  • flowcharts
  • diagrams
  • QR codes
  • barcodes
  • stamps
  • signatures
  • ID cards

Example: Medical Imaging

Here is an example of the API marking a pathology image as a figure chunk:

Here is the rendered Markdown for that chunk:

Example: Bar Chart

Here is an example of the API marking a bar chart as a figure chunk:

Here is the rendered Markdown for that chunk: