The legacy API (v1/tools/agentic-document-analysis) was the original endpoint for document processing. It combined document parsing and field extraction into a single API call.This endpoint has been replaced with separate, function-specific APIs that provide more flexibility in how you process documents:
API: Converts documents into structured Markdown with hierarchical JSON
API: Classifies and separates documents into sub-documents
API: Extracts specific data fields from parsed documents
We recommend migrating to the current APIs. The legacy API will continue to work, but new features and improvements are only added to the current APIs.For migration guidance:
This example shows the basic legacy API call to parse documents and get structured Markdown and JSON output. You can add optional parameters to customize the output or include field extraction.For the full API specification and all available parameters, see Agentic Document Extraction (Legacy).
Extract Data Based on Document Type (Classification)
As part of the extraction process, you can classify documents and extract data based on the type of document it is. For example, you can extract different sets of data from Invoices, Receipts, and Waybills.To classify documents, use the enum keyword to define the document types in your script. Here is an example:
Copy
Ask AI
# Define document typesclass_schema = { "type": "object", "properties": { "document_type": {"type": "string", "enum": ["Document Type 1", "Document Type 2", "Document Type 3"]} }, "required": ["document_type"],}
Let’s say you need to parse three types of documents: Passports, Invoices, and Other. The data you need to extract depends on the document type.The following script shows you how to define your document types, the fields to extract for each document type, and the code to actually parse the documents, classify the documents, and extract the data.
When using field extraction wih the legacy ADE API, the results include a confidence score for each extracted field. This score indicates how certain is about the accuracy of the extracted data.Having a confidence score allows you to create logic to route fields with low-confidence scores to human reviewers before sending data to downstream systems. For example, you can write a script that sends an extracted field to a human reviewer if the confidence score is lower than a set threshold.The higher the confidence score, the more confident is that the prediction is accurate.
The confidence score feature is experimental and still in development, and may not return accurate results.
The legacy ADE API returns a confidence property for each extracted field within the data.extraction_metadata object.The sample extraction_metadata output below shows how the confidence property displays after extraction.
Copy the Python script below and save it to a local directory:
Sample Python Script for Field Extraction
Copy
Ask AI
from __future__ import annotationsfrom pydantic import BaseModel, Fieldfrom agentic_doc.parse import parseclass SampleExtractionSchema(BaseModel): accountHolder: str = Field( ..., description='The full name of the person who holds the bank account.', title='Account Holder Name', ) accountNumber: str = Field( ..., description='The bank account number associated with the account holder.', title='Bank Account Number', )# Parse a file and extract the fieldsresults = parse("estatement.pdf", extraction_model=SampleExtractionSchema)fields = results[0].extraction# Return the value of the extracted fieldsprint("Extracted Fields:")print(fields) # Return the value of the extracted field metadataprint("\nExtraction Metadata:")print(results[0].extraction_metadata)
Because the confidence score feature is experimental and still in development, there are certain situations where scores are not available.The confidence score value will be null in the following situations:
Tables: Data extracted from tables will have a null confidence score.
Changes to formatting: Fields with custom formatting applied during extraction will have a null confidence score. For example, reformatting a date from “DD-MM-YYYY” to “MM-DD-YYYY” results in a null score.