Skip to main content
This article is about the legacy ADE endpoint (v1/tools/agentic-document-analysis). Use the current endpoints for all new projects.
The legacy API (v1/tools/agentic-document-analysis) was the original endpoint for document processing. It combined document parsing and field extraction into a single API call. This endpoint has been replaced with separate, function-specific APIs that provide more flexibility in how you process documents:
  • API: Converts documents into structured Markdown with hierarchical JSON
  • API: Classifies and separates documents into sub-documents
  • API: Extracts specific data fields from parsed documents

Migrate to the Current APIs

We recommend migrating to the current APIs. The legacy API will continue to work, but new features and improvements are only added to the current APIs. For migration guidance:

About This Section

This section documents the legacy API for users who still rely on it. If you’re starting a new integration, use the current APIs instead.

Parse Documents with the Legacy API

This example shows the basic legacy API call to parse documents and get structured Markdown and JSON output. You can add optional parameters to customize the output or include field extraction. For the full API specification and all available parameters, see Agentic Document Extraction (Legacy).
curl -X POST 'https://api.va.landing.ai/v1/tools/agentic-document-analysis' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F '[email protected]'

Extract Data with the Legacy API

To extract specific fields while parsing with the legacy API, add the fields_schema parameter to the API call. This parameter accepts a JSON schema that defines the fields you want to extract. You can use the guided schema wizard in the Playground to help you create the schema. After creating the schema, you can either pass the JSON file in the API call or define the JSON schema directly in your script.

Sample Script: Pass JSON Schema to API

The script below shows how to pass a JSON file with the extraction schema to the API when parsing a document.
import json

VA_API_KEY = <YOUR_VA_API_KEY>  # Replace with your API key
headers = {"Authorization": f"Bearer {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"

base_pdf_path = "your_pdf_path"  # Replace with the path to the file
pdf_name = "filename.pdf"  # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"

schema_name = "w2-schema.json"  # Replace with the JSON schema
schema_path = f"{base_pdf_path}/{schema_name}"

with open(schema_path, "r") as file:
    schema = json.load(file)

files = [
    ("pdf", (pdf_name, open(pdf_path, "rb"), "application/pdf")),
]

payload = {"fields_schema": json.dumps(schema)}

response = requests.request("POST", url, headers=headers, files=files, data=payload)

output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)

Sample Script: Include JSON Schema in Script

The script below shows how to define the extraction schema in the call when parsing a document.
import json
import requests

VA_API_KEY = "YOUR_VA_API_KEY"  # Replace with your API key
headers = {"Authorization": f"Bearer {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"

base_pdf_path = "your_pdf_path"  # Replace with the path to the file
pdf_name = "filename.pdf"  # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"

# Define your schema
schema = {
  "title": "Student Enrollment Form Extraction Schema",
  "description": "Schema for extracting key fields from a student enrollment form in markdown format.",
  "type": "object",
  "properties": {
    "studentName": {
      "type": "object",
      "title": "Student Name",
      "description": "The full name of the student.",
      "properties": {
        "first": {
          "type": "string",
          "title": "First Name",
          "description": "The student's first name."
        },
        "last": {
          "type": "string",
          "title": "Last Name",
          "description": "The student's last name."
        }
      },
      "required": [
        "first",
        "last"
      ]
    }
  },
  "required": [
    "studentName"
  ]
}

files = [
    ("pdf", (pdf_name, open(pdf_path, "rb"), "application/pdf")),
]

payload = {"fields_schema": json.dumps(schema)}

response = requests.request("POST", url, headers=headers, files=files, data=payload)

output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)

Extract Data Based on Document Type (Classification)

As part of the extraction process, you can classify documents and extract data based on the type of document it is. For example, you can extract different sets of data from Invoices, Receipts, and Waybills. To classify documents, use the enum keyword to define the document types in your script. Here is an example:
# Define document types
class_schema = {
    "type": "object",
    "properties": {
        "document_type": {"type": "string", "enum": ["Document Type 1", "Document Type 2", "Document Type 3"]}
    },
    "required": ["document_type"],
}

Sample Script: Classify Documents

Let’s say you need to parse three types of documents: Passports, Invoices, and Other. The data you need to extract depends on the document type. The following script shows you how to define your document types, the fields to extract for each document type, and the code to actually parse the documents, classify the documents, and extract the data.
import requests
import json

VA_API_KEY = <YOUR_VA_API_KEY>  # Replace with your API key
headers = {"Authorization": f"Bearer {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"

base_pdf_path = "your_pdf_path"  # Replace with the path to the file
pdf_name = "filename.pdf"  # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"

# Define document types
class_schema = {
    "type": "object",
    "properties": {
        "document_type": {"type": "string", "enum": ["Passport", "Invoice", "Other"]}
    },
    "required": ["document_type"],
}

# First request: classification
with open(pdf_path, "rb") as f:
    files = [("pdf", (pdf_name, f, "application/pdf"))]
    payload = {"fields_schema": json.dumps(class_schema)}
    classification_response = requests.post(
        url, headers=headers, files=files, data=payload
    )

classification = classification_response.json()["data"]["extracted_schema"][
    "document_type"
]

# Define schema based on classification
if classification == "Passport":
    schema = {
        "type": "object",
        "properties": {
            "Given Names": {"type": "string"},
            "Date of birth": {"type": "string"},
            "ID_number": {"type": "number"},
            "Passport Number": {"type": "string"},
        },
        "required": ["Given Names", "Date of birth", "ID_number", "Passport Number"],
    }
elif classification == "Invoice":
    schema = {
        "type": "object",
        "properties": {
            "Bill to": {"type": "string"},
            "Invoice Number": {"type": "string"},
            "Invoice Date": {"type": "string"},
            "Due Date": {"type": "string"},
            "Total": {"type": "string"},
        },
        "required": ["Bill to", "Invoice Number", "Invoice Date", "Due Date", "Total"],
    }
else:
    print("Document type is 'Other'. No extraction schema defined.")
    exit()

# Second request: extraction
with open(pdf_path, "rb") as f:
    files = [("pdf", (pdf_name, f, "application/pdf"))]
    payload = {"fields_schema": json.dumps(schema)}
    response = requests.post(url, headers=headers, files=files, data=payload)

output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)

Confidence Scores

When using field extraction wih the legacy ADE API, the results include a confidence score for each extracted field. This score indicates how certain is about the accuracy of the extracted data. Having a confidence score allows you to create logic to route fields with low-confidence scores to human reviewers before sending data to downstream systems. For example, you can write a script that sends an extracted field to a human reviewer if the confidence score is lower than a set threshold. The higher the confidence score, the more confident is that the prediction is accurate.
The confidence score feature is experimental and still in development, and may not return accurate results.

Availability

  • API: Confidence scores are included in extraction results starting July 21, 2025.
  • Library: Use agentic-doc library v0.3.1 or later to get confidence scores in extraction results.

Confidence Scores in Output

The legacy ADE API returns a confidence property for each extracted field within the data.extraction_metadata object. The sample extraction_metadata output below shows how the confidence property displays after extraction.
Confidence Score in Output
{
   'patient':{
      'name':{
         'value':'John Doe',
         'chunk_references':[
            '0a75d377-a435-49d0-b987-ee77e67e746c'
         ],
         'confidence':0.99
      },
      'address':{
         'value':'123 Main Street',
         'chunk_references':[
            '1a753ba4-e375-4e70-aaf5-db3834a7d6a7'
         ],
         'confidence':0.97
      }
   }
}

Example Workflow: Get Confidence Scores During Field Extraction

This example walks you through how to access and identify the confidence property during field extraction with the agentic-doc library.
  1. Download this PDF and save it to a local directory: Sample Bank Statement.
  2. Copy the Python script below and save it to a local directory:
    Sample Python Script for Field Extraction
        
    from __future__ import annotations
    
    from pydantic import BaseModel, Field
    from agentic_doc.parse import parse
    
    
    class SampleExtractionSchema(BaseModel):
        accountHolder: str = Field(
            ...,
            description='The full name of the person who holds the bank account.',
            title='Account Holder Name',
        )
        accountNumber: str = Field(
            ...,
            description='The bank account number associated with the account holder.',
            title='Bank Account Number',
        )
    
    # Parse a file and extract the fields
    results = parse("estatement.pdf", extraction_model=SampleExtractionSchema)
    fields = results[0].extraction
    
    # Return the value of the extracted fields
    print("Extracted Fields:")
    print(fields)    
    
    # Return the value of the extracted field metadata
    print("\nExtraction Metadata:")
    print(results[0].extraction_metadata)
    
  3. Install the agentic-doc library, if you haven’t already.
  4. Get your API key and set it. For more information, go to API Key.
  5. Run the Python script.
  6. Check that the output is similar to the following:
    Confidence Score in Output
    Extracted Fields:
    accountHolder='SUSAN SAMPLE' accountNumber='02782-5094431'
    
    Extraction Metadata:
    accountHolder=MetadataType[str](value='SUSAN SAMPLE', chunk_references=['d04c75fd-5b1e-4aab-a859-b4aff17305cc'], confidence=0.9999998063873693) accountNumber=MetadataType[str](value='02782-5094431', chunk_references=['5c06add1-bb8e-4143-9424-e95b5bbaeee5'], confidence=0.9718049806957251)
    

Null Confidence Scores

Because the confidence score feature is experimental and still in development, there are certain situations where scores are not available. The confidence score value will be null in the following situations:
  • Tables: Data extracted from tables will have a null confidence score.
  • Changes to formatting: Fields with custom formatting applied during extraction will have a null confidence score. For example, reformatting a date from “DD-MM-YYYY” to “MM-DD-YYYY” results in a null score.