Parsing Basics
Parse Function
Use the parse
function to parse one or more documents. You have the option to either return the output as objects or save the output as a JSON file in a specified directory.
You have the option to save the visual groundings to a directory.
parse
function is available in the agentic-doc library v0.2.3 and later.Prerequisite: Set API Key
Before running the parse
function, you must set the API key as an environment variable (or put it in a .env file). Get your API key on the API Key page.
Sample Script: Parse Local File
Run this script to parse a file on a local directory and return the results as Markdown and JSON objects.
Sample Script: Parse Multiple Local Files
Run this script to parse two files on a local directory and return the results as Markdown and JSON objects.
Sample Script: Parse PDFs Located at URLs
Run this script to parse two PDFs locatd at URLs and return the results as Markdown and JSON objects. In this example, sample URLs are provided for you.
Sample Script: Parse Local File and Save JSON
Run this script to parse a local file and save the results as a JSON file at the specified directory.
Sample Script: Parse Files from Bytes
In addition to supporting PDFs and images, the parse
function supports raw bytes from PDF and image files. This means you can parse documents that are already loaded into memory, without needing to save them to disk first.
Here are two common situations where this is useful:
- File uploaded through a web form: You can send the uploaded file directly to the
parse
function without storing it as a file. - File returned from another API: You can pass the file content from the API response straight to the
parse
function.
When bytes are loaded to the parser, the parser automatically detects the file type from the bytes.
Parse PDF Bytes
Parse Image Bytes
Function Signature
Parameters
Here are the parameters for the parse
function:
documents
: List of paths to documents or URLs pointing to documents.result_save_dir
: The directory where the JSON files will be saved.include_marginalia
: IfTrue
, includesmarginalia
chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults toTrue
. (Optional)include_metadata_in_markdown
: IfTrue
, includes metadata in the Markdown output. Defaults toTrue
. (Optional)grounding_save_dir
: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)connector_path
: Path for connector to search (when using connectors).connector_pattern
: Pattern to filter files (when using connectors).
Returns
The function returns a list of ParsedDocument
objects. For more information, go to ParsedDocument Object.
If the result_save_dir
parameter is included, you can find the file path to each generated JSON file in the result_path
field in each ParsedDocument
object.
Raises
No documents to parse
: The error is raised if the provided file path does not exist.ValueError
: This error is raised if the file type is not supported or a URL is invalid.
ParsedDocument Objects
A ParsedDocument
object contains the data extracted from a document.