Chunk Types
Chunk Definition
A chunk is a discrete element extracted from a document, such as a block of text, a table, or a figure.
Chunk Overview
When you send a document to the API, it analyzes the content on each page, breaks it down into meaningful elements, and returns each one as a chunk.
Each chunk includes structured data that describes the content of the chunk and the location of the chunk in the document. This structure makes it easier to understand the extracted data and use it for downstream tasks.
Extracted chunks are included in both the JSON and Markdown outputs.
Semantic Chunking
The API uses semantic chunking, which means it intelligently groups content based on meaning rather than just layout or formatting.
Instead of splitting documents at arbitrary points like fixed lengths or paragraph breaks, the API identifies coherent units of information (like complete ideas, logical sections, or related data) and extracts them as individual chunks.
Semantic chunking improves the relevance and usability of the extracted content, especially in downstream tasks like search, retrieval, and analysis.
Why Do We Create Chunks?
Chunking makes downstream tasks faster, more accurate, and easier to scale. It serves several key purposes:
- Enables downstream apps to process large documents efficiently: Chunking allows applications like RAG systems and LLMs to index and retrieve smaller, meaningful segments instead of full documents. This helps avoid input size constraints, such as token limits.
- Improves retrieval granularity: Smaller, semantically meaningful units allow for more accurate and relevant results in downstream tasks like question answering and summarization.
- Supports downstream semantic search and embeddings: Well-structured chunks provide better inputs for embedding and make it easier to index and retrieve information during search.
- Maintains human readability: Chunking reflects how a human would naturally read the document, maintaining the visual and logical relationships between elements on the page.
Chunk Types
Each chunk is labeled with a chunk type (chunk_type
), which identifies what kind of content it represents.
The chunk types returned by are:
Text
A text
chunk type is an element that consists entirely of characters (letters and numbers), such as:
- paragraphs
- titles and headings
- lists
- form fields
- checkboxes
- radio buttons
- equations
- code blocks
- handwritten text
Ouput for Key-Value Pairs
If the text
content has key-value pairs, like form fields, the extracted data will be returnd as key-value pairs separated by line breaks (\n
).
Here is an example JSON output for a text
chunk that has form fields:
Example: Paragraph
Here is an example of the API marking a paragraph as a text
chunk:
Here is the rendered Markdown for that chunk:
Example: Lists
Here is an example of the API marking a list as a text
chunk:
Here is the rendered Markdown for that chunk:
Table
A table
chunk type is a grid of rows and columns containing data.
doesn’t require gridlines to be present, and typically interprets well-aligned sets of data to be part of a table. For example, part of a receipt can be extracted as a table if the purchased items align with the costs.
Output
When a chunk is extracted, the chunk description is included in the text
object. For table
chunk types, the chunk is returned as HTML.
Here is an example JSON output for a table
chunk:
Example: Receipt
Here is an example of the API marking receipt line items as a table
chunk:
Here is the rendered Markdown for that chunk:
Example: Earnings Statement
Here is an example of the API marking part of an earnings statement as a table
chunk:
Here is the rendered Markdown for that chunk:
Marginalia
A marginalia
chunk type is a set of text in the top, bottom, or side margins of a document, including:
- page headers
- page footers
- page numbers
- handwritten notes in margins
- line numbers on one side of a page
Example: Header and Page Number
Here is an example of the API marking a header and page number as a page_header
chunk:
Here is the rendered Markdown for that chunk:
Figure
A figure
chunk type is an element that contains visual or graphical non-text content, including:
- logos
- pictures
- graphs (bar graphs, line graphs, etc.)
- flowcharts
- diagrams
- QR codes
- barcodes
- stamps
- signatures
- ID cards
Example: Medical Imaging
Here is an example of the API marking a pathology image as a figure
chunk:
Here is the rendered Markdown for that chunk:
Example: Bar Chart
Here is an example of the API marking a bar chart as a figure
chunk:
Here is the rendered Markdown for that chunk:
Deprecated Chunk Types
Some chunk types were deprecated and consolidated into other types. These changes were introduced in the library v0.2.1, and will be rolled out to the API on Thursday, May 22.
These chunk types were consolidated into marginalia
:
page_header
page_footer
page_number
These chunk types were consolidated into text
:
title
form
key_value
Action Required When Using Library
If you use the library and your scripts or workflows use any of the deprecated chunk types, update your code to use the new types.
How the library handles the deprecated chunk types depends on the version you’re using:
- Upgrade to v0.2.1 to use the new chunk types.
- If using v0.0.13 to v0.1.3, the
marginalia
type doesn’t exist and will fallback topage_header
. - If using v0.0.12 or earlier, the code will NOT work after May 22.
Action Required When Calling the API Directly
If you call the API directly and your scripts or workflows use any of the deprecated chunk types, update your code to use the new types.
We are making these same changes (consolidating the chunk types) to the API on Thursday, May 22.
Starting May 22, the API will stop using the deprecated types in the response. If your code uses the deprecated chunk types, the code will no longer work.