Document Parser#
Document Parser can automatically parse PDF, JPG, and PNG documents to extract fields, tables, values, and barcodes from invoices, statements, orders, and other PDF and scanned documents.
Built-in document parser templates#
General Invoice Template
can parse invoices (English only) to invoice id, invoice date, extract total, tax, and line items. Set the templateId
parameter to 1
to use this template.
How to classify incoming documents before parsing them?#
Use the /pdf/classifier endpoint (see below) to automatically sort/detect the class of the document based on AI or on custom keywords-based rules.
For example, you can easily define rules to find which vendor provided the document to find which template to apply accordingly. See Document Classifier for more details.
Additional Information and Tools#
Available Methods#
/pdf/documentparser#
This API method extracts data from documents based on a document parser extraction template. With this API method, you can extract data from custom areas by searching form fields, tables, multiple pages, and more.
Method: POST
Endpoint: /v1/pdf/documentparser
Attributes#
Note
Attributes are case-sensitive and should be inside JSON for POST request, for example:
{
"url": "https://example.com/file1.pdf"
}
Attribute |
Description |
Required |
---|---|---|
|
URL to the source file. 1 |
yes |
|
HTTP auth user name if required to access source |
no |
|
HTTP auth password if required to access source |
no |
|
Set ID of document parser template to be used. View and manage your templates at Document Parser. |
no |
|
You can pass the code of the document parser template to be used directly. |
no |
|
Set to |
no |
|
Default is |
no |
|
Password of PDF file, the input must be in string format. |
no |
|
Set |
no |
|
File name for the generated output, the input must be in string format. |
no |
|
Set the expiration time for the output link in minutes (default is |
no |
|
Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more. |
no |
Query parameters#
No query parameters accepted.
Payload#
{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
"outputFormat": "JSON",
"templateId": "1",
"async": false,
"inline": "true",
"password": "",
"profiles": ""
}
Response 2#
{
"body": {
"objects": [
{
"name": "companyName",
"objectType": "field",
"value": "Amazon Web Services, Inc",
"rectangle": [
0,
0,
0,
0
]
},
{
"name": "companyName2",
"objectType": "field",
"value": "Amazon Web Services, Inc",
"rectangle": [
0,
0,
0,
0
]
},
{
"name": "invoiceId",
"objectType": "field",
"value": "123456789",
"pageIndex": 0,
"rectangle": [
0,
0,
0,
0
]
},
{
"name": "dateIssued",
"objectType": "field",
"value": "2018-04-03T00:00:00",
"pageIndex": 0,
"rectangle": [
0,
0,
0,
0
]
},
{
"name": "dateDue",
"objectType": "field",
"value": "2018-04-03T00:00:00",
"pageIndex": 0,
"rectangle": [
0,
0,
0,
0
]
},
{
"name": "bankAccount",
"objectType": "field",
"value": "123456789012",
"pageIndex": 0,
"rectangle": [
0,
0,
0,
0
]
},
{
"name": "total",
"objectType": "field",
"value": 6.58,
"pageIndex": 0,
"rectangle": [
0,
0,
0,
0
]
},
{
"name": "subTotal",
"objectType": "field",
"value": ""
},
{
"name": "tax",
"objectType": "field",
"value": 1.01,
"pageIndex": 0,
"rectangle": [
0,
0,
0,
0
]
},
{
"objectType": "table",
"name": "table",
"rows": []
}
],
"templateName": "Generic Invoice [en]",
"templateVersion": "4",
"timestamp": "2020-08-21T19:23:31"
},
"pageCount": 1,
"error": false,
"status": 200,
"name": "sample-invoice.json",
"remainingCredits": 60803
}
CURL#
curl --location --request POST 'https://api.pdf.co/v1/pdf/documentparser' \
--header 'Content-Type: application/json' \
--header 'x-api-key: {{x-api-key}}' \
--data-raw '{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
"outputFormat": "JSON",
"templateId": "1",
"async": false,
"inline": "true",
"password": "",
"profiles": ""
}'
/pdf/documentparser/templates#
Returns all Document Parser data extraction templates for the current user.
Use the PDF.co dashbaord to manage your Document Parser Templates.
Method: GET
Endpoint: /v1/pdf/documentparser/templates
Query parameters#
No query parameters accepted.
Body payload#
No body parameters accepted.
Response 2#
{
"templates": [
{
"id": 40,
"type": "user",
"title": "Untitled",
"description": "Untitled"
},
{
"id": 1,
"type": "system",
"title": "Invoice Parser",
"description": "Parses invoices and extracts invoice number, company name, due date, amount, tax"
}
],
"remainingCredits": 94229
}
CURL#
curl --location --request GET 'https://api.pdf.co/v1/pdf/documentparser/templates' \
--header 'Content-Type: application/json' \
--header 'x-api-key: {{x-api-key}}'
/pdf/documentparser/templates/:id#
Returns detailed information for document parser template by template’s id.
Use the PDF.co dashbaord to manage your Document Parser Templates.
Method: GET
Endpoint: /v1/pdf/documentparser/templates/:id
Query parameters#
No query parameters accepted.
Body payload#
No body parameters accepted.
CURL#
curl --location --request GET 'https://api.pdf.co/v1/pdf/documentparser/templates/1' \
--header 'Content-Type: application/json' \
--header 'x-api-key: {{x-api-key}}' \
--data-raw ''
Template samples#
Find templates to use with Document Parser here:
Footnotes
- 1
Supports links from Google Drive, Dropbox, and PDF.co Built-In Files Storage. To upload files via the API check out the File Upload section. Note: If you experience intermittent Too Many Requests or Access Denied errors, please try to add
cache:
to enable built-in URL caching. (e.gcache:https://example.com/file1.pdf
) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.- 2(1,2)
Response codes as follows:
Code
Description
200
The request has succeeded
400
Bad input parameters
401
Unauthorized
403
Not enough credits
405
Timeout error. To process large documents or files please use asynchronous mode (set the
async
parameter totrue
) and then check status using the /job/check endpoint. If a file contains many pages then specify a page range using thepages
parameter. The number of pages of the document can be obtained using the /pdf/info endpoint.