Parse Function

Use the parse function to parse one or more documents. You have the option to either return the output as objects or save the output as a JSON file in a specified directory.

You have the option to save the visual groundings to a directory.

The parse function is available in the agentic-doc library v0.2.3 and later.

Prerequisite: Set API Key

Before running the parse function, you must set the API key as an environment variable (or put it in a .env file). Get your API key on the API Key page.

Sample Script: Parse Local File

Run this script to parse a file on a local directory and return the results as Markdown and JSON objects.

from agentic_doc.parse import parse

# Parse a local file
result = parse("path/to/file.pdf")
parsed_doc = result[0]

# Get the extracted data as markdown
(result.markdown)

# Get the extracted data as structured chunks of content in a JSON schema
(result.chunks)  

Sample Script: Parse Multiple Local Files

Run this script to parse two files on a local directory and return the results as Markdown and JSON objects.

from agentic_doc.parse import parse

# Parse multiple local files
file_paths = ["path/to/file.pdf", "path/to/file.pdf"]
results = parse(file_paths)

for result in results:
    # Get the extracted data as markdown
    (result.markdown) 

    # Get the extracted data as structured chunks of content in a JSON schema
    (result.chunks)  

Sample Script: Parse PDFs Located at URLs

Run this script to parse two PDFs locatd at URLs and return the results as Markdown and JSON objects. In this example, sample URLs are provided for you.

from agentic_doc.parse import parse

# Parse documents from URLs
file_paths = ["https://satsuite.collegeboard.org/media/pdf/sample-sat-score-report.pdf", "https://www.rbcroyalbank.com/banking-services/_assets-custom/pdf/eStatement.pdf"]
results = parse(file_paths)

# Get the extracted data as markdown
for result in results:
    print(result.markdown)

# Get the extracted data as structured chunks of content in a JSON schema
for result in results:
    print(result.chunks)

Sample Script: Parse Local File and Save JSON

Run this script to parse a local file and save the results as a JSON file at the specified directory.

from agentic_doc.parse import parse

# Parse a local PDF and save results to directory
result = parse("path/to/file.pdf", result_save_dir="path/to/save/results")

# Print the file path to the JSON file
print(f"Final result: {result[0].result_path}")

Sample Script: Parse Files from Bytes

In addition to supporting PDFs and images, the parse function supports raw bytes from PDF and image files. This means you can parse documents that are already loaded into memory, without needing to save them to disk first.

Here are two common situations where this is useful:

  • File uploaded through a web form: You can send the uploaded file directly to the parse function without storing it as a file.
  • File returned from another API: You can pass the file content from the API response straight to the parse function.

When bytes are loaded to the parser, the parser automatically detects the file type from the bytes.

The ability to load bytes is available in the agentic-doc library v0.2.4 and later.

Parse PDF Bytes

from agentic_doc.parse import parse

# Load a PDF as bytes
with open("document.pdf", "rb") as f:
    raw_bytes = f.read()

# Parse the document from bytes
results = parse(raw_bytes)

Parse Image Bytes

# Load an image as bytes
with open("image.png", "rb") as f:
    image_bytes = f.read()

# Parse the document from bytes
results = parse(image_bytes)

Function Signature

def parse(
    documents: Union[
        bytes,
        str,
        Path,
        Url,
        List[Union[str, Path, Url]],
        BaseConnector,
        ConnectorConfig,
    ],
    *,
    include_marginalia: bool = True,
    include_metadata_in_markdown: bool = True,
    result_save_dir: Optional[Union[str, Path]] = None,
    grounding_save_dir: Optional[Union[str, Path]] = None,
    connector_path: Optional[str] = None,
    connector_pattern: Optional[str] = None,
) -> List[ParsedDocument]:

Parameters

Here are the parameters for the parse function:

  • documents: List of paths to documents or URLs pointing to documents.
  • result_save_dir: The directory where the JSON files will be saved.
  • include_marginalia: If True, includes marginalia chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True. (Optional)
  • include_metadata_in_markdown: If True, includes metadata in the Markdown output. Defaults to True. (Optional)
  • grounding_save_dir: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)
  • connector_path: Path for connector to search (when using connectors).
  • connector_pattern: Pattern to filter files (when using connectors).

Returns

The function returns a list of ParsedDocument objects. For more information, go to ParsedDocument Object.

If the result_save_dir parameter is included, you can find the file path to each generated JSON file in the result_path field in each ParsedDocument object.

Raises

  • No documents to parse: The error is raised if the provided file path does not exist.
  • ValueError: This error is raised if the file type is not supported or a URL is invalid.

ParsedDocument Objects

A ParsedDocument object contains the data extracted from a document.