Extract Data with the Library
You can extract specified fields from a document when parsing it with the agentic-doc library.
There are two methods for running extraction with the library:
- Define the fields that you want to extract using Pydantic models directly in your code. Then pass that definition to the
extraction_model
parameter when running theparse
function. - Input the schema in the
extraction_schema
parameter as adict
and receive the answer asdict
(JSON).
Sample Script: Extract Fields (extraction_model)
Run this script to extract fields from a local file and return the value of one of the extracted fields. In this scenario, the schema is passed in the extraction_model
parameter.
Sample Script: Extract Nested Subfields
You can extract nested subfields from documents. For example, let’s say you have a medical form that has two “name” fields: one for the Patient and one for their Emergency Contact. You can extract both fields by running the script below.
The script extracts the nested subfields from a local file and returns their values.
Extraction Output
When you extract fields from a document using the agentic-doc library, these fields are returned:
extracted_schema
This is the same type as the BaseModel
that is passed in as the argument to extraction_model
.
extraction_metadata
This is the same type as the BaseModel
that is passed in as the argument to extraction_model
.
It has the same nesting structure as the BaseModel
except that each field is replaced with a dictionary. This dictionary uses this form:
Here is an example of a returned extraction_metadata
field:
If you have a more deeply-nested extraction_model
(like in this medical form example), then you may see something like this:
extraction_error
If the extraction function encountered an error, the extraction_error
field will describe the issue. If there are no errors, the extraction_error
field returns None
.