Parsing Basics
When using the library, there are three functions you can use to parse documents:
parse_documents
: Use if you want to parse one or more documents and return the output as objects.parse_and_save_documents
: Use if you want to parse one or more documents and save the output as JSON filesparse_and_save_document
: Use if you want to parse only one file. You can have the output returns as objects or saved as a JSON file.
Parse Documents and Return Results as Objects
Use the parse_documents
function to parse one or more documents and return the output as objects. You have the option to save the visual groundings to a directory.
When to Use: Immediate Processing
Because the parse_documents
function returns extracted data as objects, this function is best for immediate downstream processing, integrations with other systems, or interactive environments (like Jupyter Notebook).
Use this when:
- You’re running the script in a notebook or web service that will immediately process, transform, or display the data.
- You need to pass the data to another function or microservice as part of a larger pipeline.
- You’re working in an interactive environment (like a Jupyter Notebook or a web-based UI).
- You want to avoid writing to disk due to permission issues or cloud function constraints.
Sample Script
This script parses two PDFs and returns the results as both Markdown and JSON objects. This example uses documents hosted at URLs, but local files are also supported.
Function Signature
Parameters
Here are the parameters for the parse_documents
function:
documents
: List of paths to documents or URLs pointing to documents.include_marginalia
: IfTrue
, includespage_header
chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults toTrue
. (Optional)include_metadata_in_markdown
: IfTrue
, includes metadata in the Markdown output. Defaults toTrue
. (Optional)grounding_save_dir
: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)
Returns
The parse_documents
function returns a list of ParsedDocument
objects. For more information, go to ParsedDocument Object.
Raises
The parse_documents
function can raise these errors:
FileNotFoundError
: This error is raised if the provided file path does not exist.ValueError
: This error is raised if the file type is not supported or a URL is invalid.
Parse Documents and Save Results as JSON Files
Use the parse_and_save_documents
function if you want to parse one or more documents and save the output as JSON files in a specified directory.
You have the option to save the visual groundings to a directory.
When to Use: Persistence and Auditing
Because the parse_and_save_documents
function saves the output as JSON files, this function is best for use cases that require persistence storage or auditing.
Use this when:
- You want to store the extracted output for future reference, manual review, or archiving.
- You have a pipeline that uses batch processing or file watchers.
- You need to debug the output separately or share it with others.
- Your process includes a manual review step, such as human-in-the-loop verification.
Sample Script
This script parses two PDFs and saves the output as JSON files in this directory: ./parsed_results
. This example uses documents hosted at URLs, but local files are also supported.
Function Signature
Parameters
Here are the parameters for the parse_and_save_documents
function:
documents
: List of paths to documents or URLs pointing to documents.result_save_dir
: The directory where the JSON files will be saved.include_marginalia
: IfTrue
, includespage_header
chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults toTrue
. (Optional)include_metadata_in_markdown
: IfTrue
, includes metadata in the Markdown output. Defaults toTrue
. (Optional)grounding_save_dir
: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)
Returns
The parse_and_save_documents
function returns a list file paths to the JSON files that the function created. The JSON files contain the structured data for the extracted elements.
The file paths as sorted in the same order as the input file paths. The JSON file name is the original file name with a timestamp appended. For example if the input file is “document.pdf”, the output file could be “document_20250313_070305.json”.
Example return:
Raises
FileNotFoundError
: This error is raised if the provided file path does not exist.ValueError
: This error is raised if the file type is not supported or a URL is invalid.
Parse One Document
Use the parse_and_save_document
function if you want to parse one document. You have the option to either return the output as objects or save the output as a JSON file in a specified directory.
You have the option to save the visual groundings to a directory.
Sample Script
This script parses a PDF and saves the output as a JSON file in this directory: ./parsed_results
. This example uses a document hosted at a URL, but local files are also supported.
Function Signature
Parameters
document
: The path to a document or URL pointing to a document.result_save_dir
: The directory where the JSON files will be saved. (Optional)include_marginalia
: IfTrue
, includespage_header
chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults toTrue
. (Optional)include_metadata_in_markdown
: IfTrue
, includes metadata in the Markdown output. Defaults toTrue
. (Optional)grounding_save_dir
: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)
Returns
If the result_save_dir
parameter is included, the function returns the file path to the JSON file that the function created.
The JSON file contains the structured data for the extracted elements. The JSON file name is the original file name with a timestamp appended. For example if the input file is “document.pdf”, the output file could be “document_20250313_070305.json”.
If the result_save_dir
parameter is not included, the function returns a list of ParsedDocument
objects. For more information, go to ParsedDocument Object.
Raises
FileNotFoundError
: This error is raised if the provided file path does not exist.ValueError
: This error is raised if the file type is not supported or a URL is invalid.
ParsedDocument Objects
A ParsedDocument
object contains the data extracted from a document.