Extract Data with the API

You can extract data when you call the API directly.

To do this, first create a JSON schema of the fields you want to extract. You can use the Playground to help you create the schema.

After creating the schema, you can either pass the JSON file in the API call or define the JSON schema directly in your script.

You can also classify documents before extracting data, and then extract data based on the document type.

Supported Data Types for Fields

When extracting data with the API or library, you can specify the following data types for fields:

bool
bytes
float
int
nested objects
str
list: This data type from the typing library is supported if the types within the list are valid.
union: This data type from the typing library is supported if the types within the union are valid.

These data types are not supported: set, tuple, dict, optional, none.

Sample Script: Pass JSON Schema to API

The script below shows how to pass a JSON file with the extraction schema to the API when parsing a document.

import json

VA_API_KEY = <YOUR_VA_API_KEY>  # Replace with your API key
headers = {"Authorization": f"Basic {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"

base_pdf_path = "your_pdf_path"  # Replace with the path to the file
pdf_name = "filename.pdf"  # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"

schema_name = "w2-schema.json"  # Replace with the JSON schema
schema_path = f"{base_pdf_path}/{schema_name}"

with open(schema_path, "r") as file:
    schema = json.load(file)

files = [
    ("pdf", (pdf_name, open(pdf_path, "rb"), "application/pdf")),
]

payload = {"fields_schema": json.dumps(schema)}

response = requests.request("POST", url, headers=headers, files=files, data=payload)

output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)

Sample Script: Include JSON Schema in Script

The script below shows how to define the extraction schema in the call when parsing a document.

import json
import requests

VA_API_KEY = <YOUR_VA_API_KEY>  # Replace with your API key
headers = {"Authorization": f"Basic {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"

base_pdf_path = "your_pdf_path"  # Replace with the path to the file
pdf_name = "filename.pdf"  # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"

# Define your schema
schema = {
    "type": "object",
    "properties": {
        "table": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "Id": {"type": "number"},
                    "Name": {"type": "string"},
                    "Email": {"type": "string"},
                    "Investments": {"type": "string"},
                },
                "required": [
                    "Id",
                    "Name",
                    "Email",
                    "Investments",
                ],
            },
            "description": "A table containing user information with columns for Id, Name, Email, and Investments.",
        }
    },
    "required": ["table"],
}

files = [
    ("pdf", (pdf_name, open(pdf_path, "rb"), "application/pdf")),
]

payload = {"fields_schema": json.dumps(schema)}

response = requests.request("POST", url, headers=headers, files=files, data=payload)

output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)

Extract Data Based on Document Type (Classification)

As part of the extraction process, you can classify documents and extract data based on the type of document it is. For example, you can extract different sets of data from Invoices, Receipts, and Waybills.

To classify documents, use the enum keyword to define the document types in your script. Here is an example:

# Define document types
class_schema = {
    "type": "object",
    "properties": {
        "document_type": {"type": "string", "enum": ["Document Type 1", "Document Type 2", "Document Type 3"]}
    },
    "required": ["document_type"],
}

Sample Script: Classify Documents

Let’s say you need to parse three types of documents: Passports, Invoices, and Other. The data you need to extract depends on the document type.

The following script shows you how to define your document types, the fields to extract for each document type, and the code to actually parse, the documeents, classify the document, and extract the data.

import requests
import json

VA_API_KEY = <YOUR_VA_API_KEY>  # Replace with your API key
headers = {"Authorization": f"Basic {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"

base_pdf_path = "your_pdf_path"  # Replace with the path to the file
pdf_name = "filename.pdf"  # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"

# Define document types
class_schema = {
    "type": "object",
    "properties": {
        "document_type": {"type": "string", "enum": ["Passport", "Invoice", "Other"]}
    },
    "required": ["document_type"],
}

# First request: classification
with open(pdf_path, "rb") as f:
    files = [("pdf", (pdf_name, f, "application/pdf"))]
    payload = {"fields_schema": json.dumps(class_schema)}
    classification_response = requests.post(
        url, headers=headers, files=files, data=payload
    )

classification = classification_response.json()["data"]["extracted_schema"][
    "document_type"
]

# Define schema based on classification
if classification == "Passport":
    schema = {
        "type": "object",
        "properties": {
            "Given Names": {"type": "string"},
            "Date of birth": {"type": "string"},
            "ID_number": {"type": "number"},
            "Passport Number": {"type": "string"},
        },
        "required": ["Given Names", "Date of birth", "ID_number", "Passport Number"],
    }
elif classification == "Invoice":
    schema = {
        "type": "object",
        "properties": {
            "Bill to": {"type": "string"},
            "Invoice Number": {"type": "string"},
            "Invoice Date": {"type": "string"},
            "Due Date": {"type": "string"},
            "Total": {"type": "string"},
        },
        "required": ["Bill to", "Invoice Number", "Invoice Date", "Due Date", "Total"],
    }
else:
    print("Document type is 'Other'. No extraction schema defined.")
    exit()

# Second request: extraction
with open(pdf_path, "rb") as f:
    files = [("pdf", (pdf_name, f, "application/pdf"))]
    payload = {"fields_schema": json.dumps(schema)}
    response = requests.post(url, headers=headers, files=files, data=payload)

output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)

Get Started

Parsing

Extraction

Troubleshooting

General

Extract Data with the API

Supported Data Types for Fields

Sample Script: Pass JSON Schema to API

Sample Script: Include JSON Schema in Script

Extract Data Based on Document Type (Classification)

Sample Script: Classify Documents

Get Started

Parsing

Extraction

Troubleshooting

General

​Supported Data Types for Fields

​Sample Script: Pass JSON Schema to API

​Sample Script: Include JSON Schema in Script

​Extract Data Based on Document Type (Classification)

​Sample Script: Classify Documents

Supported Data Types for Fields

Sample Script: Pass JSON Schema to API

Sample Script: Include JSON Schema in Script

Extract Data Based on Document Type (Classification)

Sample Script: Classify Documents