Skip to main content
This article is about the legacy agentic-doc library. Use the landingai-ade library for all new projects.
You can extract specified fields from a document when parsing it with the agentic-doc library. There are two parameters you can use to define and pass your extraction schema:
  • extraction_model: Define fields using a Pydantic model class and receive the results as a Pydantic model instance.
  • extraction_schema: Define fields using a JSON schema dictionary and receive results as a dictionary.

Extract Fields with extraction_model

The extraction_model approach uses Pydantic models to specify which data fields to extract from documents. You define a class that inherits from BaseModel and pass that class to the extraction_model parameter. The results are returned as a Pydantic model instance with type validation and attribute access.

Sample Script: Extract Fields with extraction_model

For example, let’s say you want to extract a few fields from this Pay Stub. First define the extraction schema as a Pydantic model in the extraction_model parameter. Then run the parse function with the extraction_model parameter.
from pydantic import BaseModel, Field
from agentic_doc.parse import parse

# Define the fields you want to extract
class ExtractedFields(BaseModel):
    employee_name: str = Field(description="the full name of the employee")
    employee_ssn: str = Field(description="the social security number of the employee")
    gross_pay: float = Field(description="the gross pay of the employee")
    employee_address: str = Field(description="the address of the employee")

# Parse a file and extract the fields
results = parse("pay-stub.pdf", extraction_model=ExtractedFields)
fields = results[0].extraction

# Return the value of one of the extracted fields
print(fields.employee_name)

Extract Fields with extraction_schema

The extraction_schema approach uses a JSON schema dictionary to specify which data fields to extract from documents. You provide the schema as a dictionary in the extraction_schema parameter and receive the results as a dictionary.

Extract Fields with extraction_schema

For example, let’s say you want to extract a few fields from this Pay Stub. First define the JSON schema in the extraction_schema parameter. Then run the parse function with the extraction_schema parameter.
from agentic_doc.parse import parse

# Define the extraction schema as a dictionary
extraction_schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "Employee Payroll Field Extraction Schema",
    "description": "Schema for extracting key employee payroll fields from a markdown document, as specified by the user.",
    "type": "object",
    "properties": {
        "employee_name": {
            "title": "Employee Name",
            "description": "The full name of the employee as it appears on the payroll document.",
            "type": "string"
        },
        "employee_ssn": {
            "title": "Employee Social Security Number", 
            "description": "The Social Security Number of the employee, formatted as XXX-XX-XXXX.",
            "type": "string"
        },
        "gross_pay": {
            "title": "Gross Pay",
            "description": "The total gross pay for the employee for the specified pay period, as shown on the earnings statement.",
            "type": "string"
        },
        "employee_address": {
            "title": "Employee Address",
            "description": "The full mailing address of the employee as listed on the payroll document.",
            "type": "string"
        }
    }
}

# Parse the document with the extraction schema
results = parse("./documents/pay-stub.pdf", extraction_schema=extraction_schema)
extracted_data = results[0].extraction

# Access the extracted fields (returned as a dictionary)
print(f"Employee: {extracted_data['employee_name']}")
print(f"SSN: {extracted_data['employee_ssn']}")
print(f"Gross Pay: {extracted_data['gross_pay']}")
print(f"Address: {extracted_data['employee_address']}")

Extract Nested Subfields

To extract nested subfields from documents, define a Pydantic model for a logical grouping of related fields, then nest it within your main extraction schema. This approach allows you to organize document data into structured, hierarchical formats that group related information under meaningful section names. The models with the nested fields must be defined before the main extraction schema. Otherwise you may get an error that the classes with the nested fields are not defined.

Sample Script: Extract Nested Subfields

For example, let’s say you want to extract data from the Patient Details and Emergency Contact Information sections in this Medical Form. First, define a schema for the nested information in the Patient Details section and another schema for the nested information in the Emergency Contact Information section. Then, define a schema that combines the nested schemas. Last, run the “combined” schema.
from __future__ import annotations
from pydantic import BaseModel, Field
from agentic_doc.parse import parse


# Define a nested model for patient-specific information
class PatientDetails(BaseModel):
    
    patient_name: str = Field(
        ..., 
        description='Full name of the patient.', 
        title='Patient Name'
    )
    date: str = Field(
        ...,
        description='Date the patient information form was filled out.',
        title='Date',
    )


# Define a nested model for emergency contact details
class EmergencyContactInformation(BaseModel):
    
    emergency_contact_name: str = Field(
        ...,
        description='Full name of the emergency contact person.',
        title='Emergency Contact Name',
    )
    relationship_to_patient: str = Field(
        ...,
        description='Relationship of the emergency contact to the patient.',
        title='Relationship to Patient',
    )
    primary_phone_number: str = Field(
        ...,
        description='Primary phone number of the emergency contact.',
        title='Primary Phone Number',
    )
    secondary_phone_number: str = Field(
        ...,
        description='Secondary phone number of the emergency contact.',
        title='Secondary Phone Number',
    )
    address: str = Field(
        ...,
        description='Full address of the emergency contact.', 
        title='Address'
    )


# Define the main extraction schema that combines all the nested models
class PatientAndEmergencyContactInformationExtractionSchema(BaseModel):
    
    # Nested field containing patient details
    patient_details: PatientDetails = Field(
        ...,
        description='Information about the patient as provided in the form.',
        title='Patient Details',
    )
    
    # Nested field containing emergency contact information
    emergency_contact_information: EmergencyContactInformation = Field(
        ...,
        description='Details of the emergency contact person for the patient.',
        title='Emergency Contact Information',
    )


# MAIN EXECUTION SECTION
# Parse the PDF and extract structured data using the schema
print("Parsing document and extracting patient information...")
results = parse("medical-form.pdf", extraction_model=PatientAndEmergencyContactInformationExtractionSchema)

# Get the extracted fields from the first result
# (results is a list, this takes the first item's extraction)
fields = results[0].extraction

# Display the extracted structured data
print("Extracted patient and emergency contact information:")
print(fields)

Extract Variable-Length Data with List Objects

If your documents have repeatable data structures, use Pydantic’s List type to extract that data. This pattern works when the same type of information appears multiple times, you don’t know how many items will appear, and each repeated item has the same fields. Common examples include:
  • Line items in invoices or receipts
  • Transaction records in bank statements
  • Contact information for multiple people
  • Product details in catalogs

Sample Script: Extract Variable-Length Data with List Objects

For example, let’s say you want to extract several fields from Wire Transfer Forms. Each form might have a different length of wire instructions and line items. You could use the following script to extract these variable-length lists. The Invoice model uses List[DescriptionItem] for line items and List[WireInstruction] for wire transfer details. This allows the extraction to capture all items, regardless of how many are in the document. Run the script on this Wire Transfer Form.
from typing import List
from pydantic import BaseModel, Field
from agentic_doc.parse import parse

# Nested models for list fields
class DescriptionItem(BaseModel):
    description: str = Field(description="Invoice or Bill Description")
    amount: float = Field(description="Invoice or Bill Amount")

class WireInstruction(BaseModel):
    bank_name: str = Field(description="Bank name")
    bank_address: str = Field(description="Bank address")
    bank_account_no: str = Field(description="Bank account number")
    swift_code: str = Field(description="SWIFT code")
    aba_routing: str = Field(description="ABA routing number")
    ach_routing: str = Field(description="ACH routing number")

# Invoice model containing list object fields
class Invoice(BaseModel):
    description_or_particular: List[DescriptionItem] = Field(
        description="List of invoice line items (description and amount)"
    )
    wire_instructions: List[WireInstruction] = Field(
        description="Wire transfer instructions"
    )

# Main extraction model
class ExtractedInvoiceFields(BaseModel):
    invoice: Invoice = Field(description="Invoice list-type fields")

# Parse a file and extract the fields
results = parse("./documents/wire-transfer.pdf", extraction_model=ExtractedInvoiceFields)
fields = results[0].extraction

# Print results
print(fields)

Extraction Output

When you extract fields from a document using the agentic-doc library, these fields are returned:

extracted_schema

This is the same type as the BaseModel that is passed in as the argument to extraction_model.

extraction_metadata

This is the same type as the BaseModel that is passed in as the argument to extraction_model. It has the same nesting structure as the BaseModel except that each field is replaced with a dictionary. This dictionary uses this form:
{"chunk_references": ["chunk-id-1", chunk-id-2", ...]}
Here is an example of a returned extraction_metadata field:
extraction_metadata = {
    "employee_name": {
        "chunk_references": [
            "72ba3cca-01e5-407b-9fc4-81f54f9f0c51"
        ]
    },
    "gross_pay": {
        "chunk_references": [
            "5b8865b9-1a81-46df-bcf7-0bdbed9130dc"
        ]
    }
}
If you have a more deeply-nested extraction_model (like in this medical form example), then you may see something like this:
extraction_metadata = {
    "patient": {
        "name": {
            "chunk_references": [
                "72ba3cca-01e5-407b-9fc4-81f54f9f0c51"
            ]
        }
    },
    "emergency_contact": {
        "name": {
            "chunk_references": [
                "5b8865b9-1a81-46df-bcf7-0bdbed9130dc"
            ]
        }
    }
}

extraction_error

If the extraction function encountered an error, the extraction_error field will describe the issue. If there are no errors, the extraction_error field returns None.
I