PDF to JSON#
Available Methods#
Note
Auto classification Of incoming documents
Use the Document Classifier endpoint to automatically sort/detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.
/pdf/convert/to/json2#
Convert PDF and scanned images into JSON representation with text, fonts, images, vectors, and formatting preserved.
Method: POST
Endpoint: /v1/pdf/convert/to/json2
/pdf/convert/to/json-meta#
What is the difference between /pdf/convert/to/json-meta
and /pdf/convert/to/json2
?
/json-meta
uses AI to detect meta styles for text objects, such as:
paragraph style (from
h1
..h7
top
andsmall
).meta
type
of the text object (text
,datetime
,integer
,decimal
,currency
etc.).meta
subType
of the text object (companyName
,personName
and other AI-based meta types)./json-meta
consumes more credits because it runs with AI./json-meta
is also a bit slower due to the AI process.Async
mode is recommended for this endpoint.
Convert PDF and scanned images into JSON using AI.
Method: POST
Endpoint: /v1/pdf/convert/to/json-meta
Attributes#
Note
Attributes are case-sensitive and should be inside JSON for POST request, for example:
{
"url": "https://example.com/file1.pdf"
}
Attribute |
Description |
Required |
---|---|---|
|
URL to the source file. 1 |
yes |
|
HTTP auth user name if required to access source |
no |
|
HTTP auth password if required to access source |
no |
|
Comma-separated indices of pages (or page ranges) that you want to use. The first-page index is always 0. For example, if you have a 7-page document that you want to be split into 3 separate PDFs but a different number of pages it would go like this: 0, 1, 2- or 1, 2, 3-7 which will result in 1 PDF with page one, 1 PDF with page two and one PDF with the rest of the pages. You can also use inverted page numbers adding |
no |
|
Unwrap lines into a single line within table cells when |
no |
|
Defines coordinates for extraction, e.g. |
no |
|
Set the language for OCR (text from image) to use for scanned PDF, PNG, and JPG documents input when extracting text. The default is |
no |
|
Set to |
no |
|
Line grouping within table cells. Set to |
no |
|
Password of PDF file, the input must be in string format. |
no |
|
Set |
no |
|
File name for the generated output, the input must be in string format. |
no |
|
Set the expiration time for the output link in minutes (default is |
no |
|
Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more. |
no |
Query parameters#
No query parameters accepted.
Payload#
{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-json/sample.pdf",
"inline": true,
"async": false
}
Response 2#
{
"body": {
"document": {
"pageCount": "1",
"pageCountWithOCRPerformed": "0",
"page": {
"index": "0",
"width": "595.320007324219",
"height": "841.919982910156",
"OCRWasPerformed": "False",
"row": [
{
"column": [
{
"text": {
"fontName": "Arial",
"fontSize": "24.0",
"fontStyle": "Bold",
"color": "#538DD3",
"x": "36.00",
"y": "34.44",
"width": "242.81",
"height": "24.00",
"text": "Your Company Name"
}
},
{
"text": ""
},
{
"text": ""
},
{
"text": ""
}
]
},
{
"column": [
{
"text": ""
},
{
"text": ""
},
{
"text": {
"fontName": "Arial",
"fontSize": "11.0",
"fontStyle": "Bold",
"x": "389.11",
"y": "425.83",
"width": "36.75",
"height": "11.04",
"text": "TOTAL"
}
},
{
"text": {
"fontName": "Arial",
"fontSize": "11.0",
"fontStyle": "Bold",
"x": "525.82",
"y": "425.83",
"width": "33.62",
"height": "11.04",
"text": "200.00"
}
}
]
}
]
}
}
},
"pageCount": 1,
"error": false,
"status": 200,
"name": "sample.json",
"remainingCredits": 99227903,
"credits": 28
}
CURL#
curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/json2' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-json/sample.pdf",
"inline": true,
"async": false
}'
curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/json-meta' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-json/sample.pdf",
"inline": true,
"async": false
}'
Code samples#
Footnotes
- 1
Supports links from Google Drive, Dropbox, and PDF.co Built-In Files Storage. To upload files via the API check out the File Upload section. Note: If you experience intermittent Too Many Requests or Access Denied errors, please try to add
cache:
to enable built-in URL caching. (e.gcache:https://example.com/file1.pdf
) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.- 2
Main response codes as follows:
Code
Description
200
Success
400
Bad request. Typically happens because of bad input parameters, or because the input URLs can’t be reached, possibly due to access restrictions like needing a login or password.
401
Unauthorized
402
Not enough credits
445
Timeout error. To process large documents or files please use asynchronous mode (set the
async
parameter totrue
) and then check status using the /job/check endpoint. If a file contains many pages then specify a page range using thepages
parameter. The number of pages of the document can be obtained using the /pdf/info endpoint.Note
For more see the complete list of available response codes.