Deprecated Parsing Functions

This article is about the legacy agentic-doc library. Use the landingai-ade library for all new projects.

Deprecated Parsing Functions

The parsing functions below were deprecated in v0.2.3 of the library. These functions will continue to work in later versions, but we recommend implementing the parse function instead. Deprecated functions:

parse_documents: Use if you want to parse one or more documents and return the output as objects.
parse_and_save_documents: Use if you want to parse one or more documents and save the output as JSON files
parse_and_save_document: Use if you want to parse only one file. You can have the output returned as objects or saved as a JSON file.

Parse Documents and Return Results as Objects

Use the parse_documents function to parse one or more documents and return the output as objects. You have the option to save the visual groundings to a directory.

When to Use: Immediate Processing

Because the parse_documents function returns extracted data as objects, this function is best for immediate downstream processing, integrations with other systems, or interactive environments (like Jupyter Notebook). Use this when:

You’re running the script in a notebook or web service that will immediately process, transform, or display the data.
You need to pass the data to another function or microservice as part of a larger pipeline.
You’re working in an interactive environment (like a Jupyter Notebook or a web-based UI).
You want to avoid writing to disk due to permission issues or cloud function constraints.

Sample Script

This script parses two PDFs and returns the results as both Markdown and JSON objects. This example uses documents hosted at URLs, but local files are also supported.

from agentic_doc.parse import parse_documents

# Parse documents from URLs
results = parse_documents(["https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf", "https://www.rbcroyalbank.com/banking-services/_assets-custom/pdf/eStatement.pdf"])
parsed_doc = results[0]

# Get the extracted data as markdown
print(parsed_doc.markdown)  

# Get the extracted data as structured chunks of content in a JSON schema
print(parsed_doc.chunks)  

Function Signature

def parse_documents(
    documents: list[Union[str, Path, Url]],
    *,
    include_marginalia: bool = True,
    include_metadata_in_markdown: bool = True,
    grounding_save_dir: Union[str, Path, None] = None,
    extraction_model: Optional[type[T]] = None,
    extraction_schema: Optional[dict[str, Any]] = None,
    config: Optional[ParseConfig] = None,
) -> list[ParsedDocument[T]]:

Parameters

Here are the parameters for the parse_documents function:

documents: List of paths to documents or URLs pointing to documents.
include_marginalia: If True, includes marginalia chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True. (Optional)
include_metadata_in_markdown: If True, includes metadata in the Markdown output. Defaults to True. (Optional)
grounding_save_dir: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)
extraction_model: Pydantic model schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)
extraction_schema: JSON schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)
config: Pass configuration settings with the ParseConfig object. For more information about using this parameter, go to Pass Settings with ParseConfig. (Optional)

Returns

The parse_documents function returns a list of ParsedDocument objects. For more information, go to ParsedDocument.

Raises

The parse_documents function can raise these errors:

FileNotFoundError: This error is raised if the provided file path does not exist.
ValueError: This error is raised if the file type is not supported or a URL is invalid.

Parse Documents and Save Results as JSON Files

Use the parse_and_save_documents function if you want to parse one or more documents and save the output as JSON files in a specified directory. You have the option to save the visual groundings to a directory.

When to Use: Persistence and Auditing

Because the parse_and_save_documents function saves the output as JSON files, this function is best for use cases that require persistence storage or auditing. Use this when:

You want to store the extracted output for future reference, manual review, or archiving.
You have a pipeline that uses batch processing or file watchers.
You need to debug the output separately or share it with others.
Your process includes a manual review step, such as human-in-the-loop verification.

Sample Script

This script parses two PDFs and saves the output as JSON files in this directory: ./parsed_results. This example uses documents hosted at URLs, but local files are also supported.

from agentic_doc.parse import parse_and_save_documents

# URLs to the document
documents = ["https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf", "https://www.rbcroyalbank.com/banking-services/_assets-custom/pdf/eStatement.pdf"]

# Directory where the parsed results will be saved
result_save_dir = "./parsed_results"

# Parse the documents and save the results
result_paths = parse_and_save_documents(documents=documents, result_save_dir=result_save_dir)

print(f"Result saved to: {result_paths}")

Function Signature

def parse_and_save_documents(
    documents: list[Union[str, Path, Url]],
    *,
    result_save_dir: Union[str, Path],
    grounding_save_dir: Union[str, Path, None] = None,
    include_marginalia: bool = True,
    include_metadata_in_markdown: bool = True,
    extraction_model: Optional[type[T]] = None,
    extraction_schema: Optional[dict[str, Any]] = None,
    config: Optional[ParseConfig] = None,
) -> list[Path]:

Parameters

Here are the parameters for the parse_and_save_documents function:

documents: List of paths to documents or URLs pointing to documents.
result_save_dir: The directory where the JSON files will be saved.
include_marginalia: If True, includes marginalia chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True. (Optional)
include_metadata_in_markdown: If True, includes metadata in the Markdown output. Defaults to True. (Optional)
grounding_save_dir: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)
extraction_model: Pydantic model schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)
extraction_schema: JSON schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)
config: Pass configuration settings with the ParseConfig object. For more information about using this parameter, go to Pass Settings with ParseConfig. (Optional)

Returns

The parse_and_save_documents function returns a list of file paths to the JSON files that the function created. The JSON files contain the structured data for the extracted elements. The file paths are sorted in the same order as the input file paths. The JSON file name is the original file name with a timestamp appended. For example if the input file is “document.pdf”, the output file could be “document_20250313_070305.json”. Example return:

Result saved to: [PosixPath('parsed_results/sample-sat-score-report_20250508_094455.json'), PosixPath('parsed_results/eStatement_20250508_094347.json')]

Raises

FileNotFoundError: This error is raised if the provided file path does not exist.
ValueError: This error is raised if the file type is not supported or a URL is invalid.

Parse One Document

Use the parse_and_save_document function if you want to parse one document. You have the option to either return the output as objects or save the output as a JSON file in a specified directory. You have the option to save the visual groundings to a directory.

Sample Script

This script parses a PDF and saves the output as a JSON file in this directory: ./parsed_results. This example uses a document hosted at a URL, but local files are also supported.

from agentic_doc.parse import parse_and_save_document

# URL to the document
document = "https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf"

# Directory where the parsed result will be saved
result_save_dir = "./parsed_results"

# Parse the document and save the result
result_paths = parse_and_save_document(document=document, result_save_dir=result_save_dir)

print(f"Result saved to: {result_paths}")

Function Signature

def parse_and_save_document(
    document: Union[str, Path, Url],
    *,
    include_marginalia: bool = True,
    include_metadata_in_markdown: bool = True,
    result_save_dir: Union[str, Path, None] = None,
    grounding_save_dir: Union[str, Path, None] = None,
    extraction_model: Optional[type[T]] = None,
    extraction_schema: Optional[dict[str, Any]] = None,
    config: Optional[ParseConfig] = None,
) -> Union[Path, ParsedDocument[T]]:

Parameters

document: The path to a document or URL pointing to a document.
result_save_dir: The directory where the JSON files will be saved.
include_marginalia: If True, includes marginalia chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True. (Optional)
include_metadata_in_markdown: If True, includes metadata in the Markdown output. Defaults to True. (Optional)
grounding_save_dir: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)
connector_path: Path for connector to search (when using connectors).
connector_pattern: Pattern to filter files (when using connectors)
extraction_model: Pydantic model schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)
extraction_schema: JSON schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)
config: Pass configuration settings with the ParseConfig object. For more information about using this parameter, go to Pass Settings with ParseConfig. (Optional)

Returns

If the result_save_dir parameter is included, the function returns the file path to the JSON file that the function created. The JSON file contains the structured data for the extracted elements. The JSON file name is the original file name with a timestamp appended. For example, if the input file is “document.pdf”, the output file could be “document_20250313_070305.json”. If the result_save_dir parameter is not included, the function returns a list of ParsedDocument objects. For more information, go to ParsedDocument Object.

Raises

FileNotFoundError: This error is raised if the provided file path does not exist.
ValueError: This error is raised if the file type is not supported or a URL is invalid.

Get Started

Parsing

Split

Extraction

Troubleshooting

General

Security

Administration

Agentic Document Extraction on Snowflake

Legacy Python Library

Deprecated Parsing Functions