Parse Function
Use theparse
function to parse one or more documents. You have the option to either return the output as objects or save the output as a JSON file in a specified directory.
You have the option to save the visual groundings to a directory.
The
parse
function is available in the agentic-doc library v0.2.3 and later.Prerequisite: Set API Key
Before running theparse
function, get your API Key and set it.
Sample Script: Parse Local File
Run this script to parse a file on a local directory and return the results as Markdown and JSON objects.Sample Script: Parse Multiple Local Files
Run this script to parse two files on a local directory and return the results as Markdown and JSON objects.Sample Script: Parse PDFs Located at URLs
Run this script to parse two PDFs located at URLs and return the results as Markdown and JSON objects. In this example, sample URLs are provided for you.Sample Script: Parse Local File and Save JSON
Run this script to parse a local file and save the results as a JSON file at the specified directory.Sample Script: Parse Files from Bytes
In addition to supporting PDFs and images, theparse
function supports raw bytes from PDF and image files. This means you can parse documents that are already loaded into memory, without needing to save them to disk first.
Here are two common situations where this is useful:
- File uploaded through a web form: You can send the uploaded file directly to the
parse
function without storing it as a file. - File returned from another API: You can pass the file content from the API response straight to the
parse
function.
The ability to load bytes is available in the agentic-doc library v0.2.4 and later.
Parse PDF Bytes
Parse Image Bytes
Function Signature
Parameters
Here are the parameters for theparse
function:
documents
: List of paths to documents or URLs pointing to documents.result_save_dir
: The directory where the JSON files will be saved.include_marginalia
: IfTrue
, includesmarginalia
chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults toTrue
. (Optional)include_metadata_in_markdown
: IfTrue
, includes metadata in the Markdown output. Defaults toTrue
. (Optional)grounding_save_dir
: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)connector_path
: Path for connector to search (when using connectors).connector_pattern
: Pattern to filter files (when using connectors).extraction_model
: Pydantic model schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)extraction_schema
: JSON schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)config
: Pass configuration settings with theParseConfig
object. For more information about using this parameter, go to Pass Settings with ParseConfig. (Optional)
Returns
The function returns a list ofParsedDocument
objects. For more information, go to ParsedDocument.
If the result_save_dir
parameter is included, you can find the file path to each generated JSON file in the result_path
field in each ParsedDocument
object.
Raises
No documents to parse
: The error is raised if the provided file path does not exist.ValueError
: This error is raised if the file type is not supported or a URL is invalid.