Use the parse
function to parse one or more documents. You have the option to either return the output as objects or save the output as a JSON file in a specified directory.
You have the option to save the visual groundings to a directory.
parse
function is available in the agentic-doc library v0.2.3 and later.Before running the parse
function, get your API Key and set it.
Run this script to parse a file on a local directory and return the results as Markdown and JSON objects.
Run this script to parse two files on a local directory and return the results as Markdown and JSON objects.
Run this script to parse two PDFs located at URLs and return the results as Markdown and JSON objects. In this example, sample URLs are provided for you.
Run this script to parse a local file and save the results as a JSON file at the specified directory.
In addition to supporting PDFs and images, the parse
function supports raw bytes from PDF and image files. This means you can parse documents that are already loaded into memory, without needing to save them to disk first.
Here are two common situations where this is useful:
parse
function without storing it as a file.parse
function.When bytes are loaded to the parser, the parser automatically detects the file type from the bytes.
Here are the parameters for the parse
function:
documents
: List of paths to documents or URLs pointing to documents.result_save_dir
: The directory where the JSON files will be saved.include_marginalia
: If True
, includes marginalia
chunks (text in the header, footer, and margins) in the output. For more information, go to Chunk Types. Defaults to True
. (Optional)include_metadata_in_markdown
: If True
, includes metadata in the Markdown output. Defaults to True
. (Optional)grounding_save_dir
: The directory where grounding images will be saved. For more information, go to Save Groundings as Images. (Optional)connector_path
: Path for connector to search (when using connectors).connector_pattern
: Pattern to filter files (when using connectors).extraction_model
: Pydantic model schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)extraction_schema
: JSON schema for field extraction. For more information about extraction, go to Extract Data with the Library. (Optional)config
: Pass configuration settings with the ParseConfig
object. For more information about using this parameter, go to Pass Settings with ParseConfig. (Optional)The function returns a list of ParsedDocument
objects. For more information, go to ParsedDocument Object.
If the result_save_dir
parameter is included, you can find the file path to each generated JSON file in the result_path
field in each ParsedDocument
object.
No documents to parse
: The error is raised if the provided file path does not exist.ValueError
: This error is raised if the file type is not supported or a URL is invalid.A ParsedDocument
object contains the data extracted from a document.