Chunk Types
Chunk Definition
A chunk is a discrete element extracted from a document, such as a block of text, a table, or a figure. When you send a document to the API, it analyzes the content, breaks it down into meaningful elements, and returns each one as a chunk.
Each chunk includes structured data that describes the content of the chunk and the location of the chunk in the document. This structure makes it easier to understand the extracted data and use it for downstream tasks.
Extracted chunks are included in both the JSON and Markdown outputs.
Chunk Types
Each chunk is labeled with a chunk type (chunk_type
), which identifies what kind of content it represents.
The chunk types returned by are:
Text
A text
chunk type is an element that consists entirely of characters (letters and numbers), such as:
- paragraphs
- equations
- code blocks
- lists
- handwritten text
Example: Paragraph
Here is an example of the API marking a paragraph as a text
chunk:
Here is the rendered Markdown for that chunk:
Example: Lists
Here is an example of the API marking a list as a text
chunk:
Here is the rendered Markdown for that chunk:
Table
A table
chunk type is a grid of rows and columns containing data.
doesn’t require gridlines to be present, and typically interprets well-aligned sets of data to be part of a table. For example, part of a receipt can be extracted as a table if the purchased items align with the costs.
Output
When a chunk is extracted, the chunk description is included in the text
object. For table
chunk types, the chunk is returned as HTML.
Here is an example JSON output for a table
chunk:
Example: Receipt
Here is an example of the API marking receipt line items as a table
chunk:
Here is the rendered Markdown for that chunk:
Example: Earnings Statement
Here is an example of the API marking part of an earnings statement as a table
chunk:
Here is the rendered Markdown for that chunk:
Form
A form
is a collection of sets, each composed of key-value pairs, where keys denote field names and values are the data. Forms can include:
- form fields
- checkboxes
- radio buttons
Output
When a chunk is extracted, the chunk description is included in the text
object. For form
chunk types, the chunk is returned as key-value pairs separated by line breaks (\n
).
Here is an example JSON output for a form
chunk:
Example: Form with Checkboxes and Radio Buttons
Here is an example of the API marking part of a loan form as a form
chunk:
Here is the rendered Markdown for that chunk:
Page Header
A page_header
chunk type is a set of text in the top, bottom, or side margins of a document, including:
- page headers
- page footers
- page numbers
- handwritten notes in margins
- line numbers on one side of a page
page_header
chunk type will be deprecated in a later version of .Example: Header and Page Number
Here is an example of the API marking a header and page number as a page_header
chunk:
Here is the rendered Markdown for that chunk:
Figure
A figure
chunk type is an element that contains visual or graphical non-text content, including:
- logos
- pictures
- graphs (bar graphs, line graphs, etc.)
- flowcharts
- diagrams
- QR codes
- barcodes
- stamps
- signatures
- ID cards
Example: Medical Imaging
Here is an example of the API marking a pathology image as a figure
chunk:
Here is the rendered Markdown for that chunk:
Example: Bar Chart
Here is an example of the API marking a bar chart as a figure
chunk:
Here is the rendered Markdown for that chunk: