When using the library, there are three functions you can use to parse documents:

  • parse_documents: Use if you want to parse one or more documents and return the output as objects.
  • parse_and_save_documents: Use if you want to parse one or more documents and save the output as JSON files
  • parse_and_save_document: Use if you want to parse only one file. You can have the output returns as objects or saved as a JSON file.

Parse Documents and Return Results as Objects

Use the parse_documents function to parse one or more documents and return the output as objects. You have the option to save the visual groundings to a directory.

When to Use: Immediate Processing

Because the parse_documents function returns extracted data as objects, this function is best for immediate downstream processing, integrations with other systems, or interactive environments (like Jupyter Notebook).

Use this when:

  • You’re running the script in a notebook or web service that will immediately process, transform, or display the data.
  • You need to pass the data to another function or microservice as part of a larger pipeline.
  • You’re working in an interactive environment (like a Jupyter Notebook or a web-based UI).
  • You want to avoid writing to disk due to permission issues or cloud function constraints.

Sample Script

This script parses two PDFs and returns the results as both Markdown and JSON objects. This example uses documents hosted at URLs, but local files are also supported.

from agentic_doc.parse import parse_documents

# Parse documents from URLs
results = parse_documents(["https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf", "https://www.rbcroyalbank.com/banking-services/_assets-custom/pdf/eStatement.pdf"])
parsed_doc = results[0]

# Get the extracted data as markdown
print(parsed_doc.markdown)  

# Get the extracted data as structured chunks of content in a JSON schema
print(parsed_doc.chunks)  

Function Signature

def parse_documents(
    documents: list[str | Path | Url],
    *,
    include_marginalia: bool = True,
    include_metadata_in_markdown: bool = True,
    grounding_save_dir: str | Path | None = None
) -> list[ParsedDocument]:

Parameters

Here are the parameters for the parse_documents function:

  • documents: List of paths to documents or URLs pointing to documents.
  • include_marginalia: If True, includes page_header chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True. (Optional)
  • include_metadata_in_markdown: If True, includes metadata in the Markdown output. Defaults to True. (Optional)
  • grounding_save_dir: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)

Returns

The parse_documents function returns a list of ParsedDocument objects. For more information, go to ParsedDocument Object.

Raises

The parse_documents function can raise these errors:

  • FileNotFoundError: This error is raised if the provided file path does not exist.
  • ValueError: This error is raised if the file type is not supported or a URL is invalid.

Parse Documents and Save Results as JSON Files

Use the parse_and_save_documents function if you want to parse one or more documents and save the output as JSON files in a specified directory.

You have the option to save the visual groundings to a directory.

When to Use: Persistence and Auditing

Because the parse_and_save_documents function saves the output as JSON files, this function is best for use cases that require persistence storage or auditing.

Use this when:

  • You want to store the extracted output for future reference, manual review, or archiving.
  • You have a pipeline that uses batch processing or file watchers.
  • You need to debug the output separately or share it with others.
  • Your process includes a manual review step, such as human-in-the-loop verification.

Sample Script

This script parses two PDFs and saves the output as JSON files in this directory: ./parsed_results. This example uses documents hosted at URLs, but local files are also supported.

from agentic_doc.parse import parse_and_save_documents

# URLs to the document
documents = ["https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf", "https://www.rbcroyalbank.com/banking-services/_assets-custom/pdf/eStatement.pdf"]

# Directory where the parsed results will be saved
result_save_dir = "./parsed_results"

# Parse the documents and save the results
result_paths = parse_and_save_documents(documents=documents, result_save_dir=result_save_dir)

print(f"Result saved to: {result_paths}")

Function Signature

def parse_and_save_documents(
    documents: list[str | Path | Url],
    *,
    result_save_dir: str | Path,
    include_marginalia: bool = True,
    include_metadata_in_markdown: bool = True,
    grounding_save_dir: str | Path | None = None
) -> list[Path]:

Parameters

Here are the parameters for the parse_and_save_documents function:

  • documents: List of paths to documents or URLs pointing to documents.
  • result_save_dir: The directory where the JSON files will be saved.
  • include_marginalia: If True, includes page_header chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True. (Optional)
  • include_metadata_in_markdown: If True, includes metadata in the Markdown output. Defaults to True. (Optional)
  • grounding_save_dir: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)

Returns

The parse_and_save_documents function returns a list file paths to the JSON files that the function created. The JSON files contain the structured data for the extracted elements.

The file paths as sorted in the same order as the input file paths. The JSON file name is the original file name with a timestamp appended. For example if the input file is “document.pdf”, the output file could be “document_20250313_070305.json”.

Example return:

Result saved to: [PosixPath('parsed_results/sample-sat-score-report_20250508_094455.json'), PosixPath('parsed_results/eStatement_20250508_094347.json')]

Raises

  • FileNotFoundError: This error is raised if the provided file path does not exist.
  • ValueError: This error is raised if the file type is not supported or a URL is invalid.

Parse One Document

Use the parse_and_save_document function if you want to parse one document. You have the option to either return the output as objects or save the output as a JSON file in a specified directory.

You have the option to save the visual groundings to a directory.

Sample Script

This script parses a PDF and saves the output as a JSON file in this directory: ./parsed_results. This example uses a document hosted at a URL, but local files are also supported.

from agentic_doc.parse import parse_and_save_document

# URL to the document
document = ["https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf"]

# Directory where the parsed result will be saved
result_save_dir = "./parsed_results"

# Parse the document and save the result
result_paths = parse_and_save_document(document=document, result_save_dir=result_save_dir)

print(f"Result saved to: {result_paths}")

Function Signature

def parse_and_save_document(
    document: str | Path | Url,
    *,
    result_save_dir: str | Path | None = None,
    include_marginalia: bool = True,
    include_metadata_in_markdown: bool = True,
    grounding_save_dir: str | Path | None = None
) -> Path | ParsedDocument:


Parameters

  • document: The path to a document or URL pointing to a document.
  • result_save_dir: The directory where the JSON files will be saved. (Optional)
  • include_marginalia: If True, includes page_header chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True. (Optional)
  • include_metadata_in_markdown: If True, includes metadata in the Markdown output. Defaults to True. (Optional)
  • grounding_save_dir: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)

Returns

If the result_save_dir parameter is included, the function returns the file path to the JSON file that the function created.

The JSON file contains the structured data for the extracted elements. The JSON file name is the original file name with a timestamp appended. For example if the input file is “document.pdf”, the output file could be “document_20250313_070305.json”.

If the result_save_dir parameter is not included, the function returns a list of ParsedDocument objects. For more information, go to ParsedDocument Object.

Raises

  • FileNotFoundError: This error is raised if the provided file path does not exist.
  • ValueError: This error is raised if the file type is not supported or a URL is invalid.

ParsedDocument Objects

A ParsedDocument object contains the data extracted from a document.