This article is about the legacy agentic-doc library. Use the landingai-ade library for all new projects.
extraction_model
: Define fields using a Pydantic model class and receive the results as a Pydantic model instance.extraction_schema
: Define fields using a JSON schema dictionary and receive results as a dictionary.
Extract Fields with extraction_model
Theextraction_model
approach uses Pydantic models to specify which data fields to extract from documents. You define a class that inherits from BaseModel
and pass that class to the extraction_model
parameter. The results are returned as a Pydantic model instance with type validation and attribute access.
Sample Script: Extract Fields with extraction_model
For example, let’s say you want to extract a few fields from this Pay Stub. First define the extraction schema as a Pydantic model in theextraction_model
parameter. Then run the parse
function with the extraction_model
parameter.
Extract Fields with extraction_schema
Theextraction_schema
approach uses a JSON schema dictionary to specify which data fields to extract from documents. You provide the schema as a dictionary in the extraction_schema
parameter and receive the results as a dictionary.
Extract Fields with extraction_schema
For example, let’s say you want to extract a few fields from this Pay Stub. First define the JSON schema in theextraction_schema
parameter. Then run the parse
function with the extraction_schema
parameter.
Extract Nested Subfields
To extract nested subfields from documents, define a Pydantic model for a logical grouping of related fields, then nest it within your main extraction schema. This approach allows you to organize document data into structured, hierarchical formats that group related information under meaningful section names. The models with the nested fields must be defined before the main extraction schema. Otherwise you may get an error that the classes with the nested fields are not defined.Sample Script: Extract Nested Subfields
For example, let’s say you want to extract data from the Patient Details and Emergency Contact Information sections in this Medical Form. First, define a schema for the nested information in the Patient Details section and another schema for the nested information in the Emergency Contact Information section. Then, define a schema that combines the nested schemas. Last, run the “combined” schema.Extract Variable-Length Data with List Objects
If your documents have repeatable data structures, use Pydantic’sList
type to extract that data.
This pattern works when the same type of information appears multiple times, you don’t know how many items will appear, and each repeated item has the same fields. Common examples include:
- Line items in invoices or receipts
- Transaction records in bank statements
- Contact information for multiple people
- Product details in catalogs
Sample Script: Extract Variable-Length Data with List Objects
For example, let’s say you want to extract several fields from Wire Transfer Forms. Each form might have a different length of wire instructions and line items. You could use the following script to extract these variable-length lists. TheInvoice
model uses List[DescriptionItem]
for line items and List[WireInstruction]
for wire transfer details. This allows the extraction to capture all items, regardless of how many are in the document.
Run the script on this Wire Transfer Form.
Extraction Output
When you extract fields from a document using the agentic-doc library, these fields are returned:extracted_schema
This is the same type as the BaseModel
that is passed in as the argument to extraction_model
.
extraction_metadata
This is the same type as the BaseModel
that is passed in as the argument to extraction_model
.
It has the same nesting structure as the BaseModel
except that each field is replaced with a dictionary. This dictionary uses this form:
extraction_metadata
field:
extraction_model
(like in this medical form example), then you may see something like this:
extraction_error
If the extraction function encountered an error, the extraction_error
field will describe the issue. If there are no errors, the extraction_error
field returns None
.