PDF to JSON#
Available Methods#
Note
Auto classification Of incoming documents
Use the Document Classifier endpoint to automatically sort/detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.
/pdf/convert/to/json2#
Convert PDF and scanned images into JSON representation with text, fonts, images, vectors, and formatting preserved.
Method: POST
Endpoint: /v1/pdf/convert/to/json2
/pdf/convert/to/json-meta#
What is the difference between /pdf/convert/to/json-meta
and /pdf/convert/to/json2
?
/json-meta
uses AI to detect meta styles for text objects, such as:
paragraph style (from
h1
..h7
top
andsmall
).meta
type
of the text object (text
,datetime
,integer
,decimal
,currency
etc.).meta
subType
of the text object (companyName
,personName
and other AI-based meta types)./json-meta
consumes more credits because it runs with AI./json-meta
is also a bit slower due to the AI process.Async
mode is recommended for this endpoint.
Convert PDF and scanned images into JSON using AI.
Method: POST
Endpoint: /v1/pdf/convert/to/json-meta
Attributes#
Note
Attributes are case-sensitive and should be inside JSON for POST request, for example:
{
"url": "https://example.com/file1.pdf"
}
Attribute |
Description |
Required |
---|---|---|
|
URL to the source file. 1 |
yes |
|
HTTP auth user name if required to access source |
no |
|
HTTP auth password if required to access source |
no |
|
Comma-separated indices of pages (or page ranges) that you want to use. The first-page index is always 0. For example, if you have a 7-page document that you want to be split into 3 separate PDFs but a different number of pages it would go like this: 0, 1, 2- or 1, 2, 3-7 which will result in 1 PDF with page one, 1 PDF with page two and one PDF with the rest of the pages. You can also use inverted page numbers adding |
no |
|
Unwrap lines into a single line within table cells when |
no |
|
Defines coordinates for extraction, e.g. |
no |
|
Set the language for OCR (text from image) to use for scanned PDF, PNG, and JPG documents input when extracting text. The default is |
no |
|
Set to |
no |
|
Line grouping within table cells. Set to |
no |
|
Set |
no |
|
File name for the generated output, the input must be in string format. |
no |
|
Set the expiration time for the output link in minutes (default is |
no |
|
Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more. |
no |
Query parameters#
No query parameters accepted.
Payload#
{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-json/sample.pdf",
"inline": true,
"async": false
}
Response 2#
{
"body": {
"document": {
"pageCount": "1",
"pageCountWithOCRPerformed": "0",
"page": {
"index": "0",
"width": "595.320007324219",
"height": "841.919982910156",
"OCRWasPerformed": "False",
"row": [
{
"column": [
{
"text": {
"fontName": "Arial",
"fontSize": "24.0",
"fontStyle": "Bold",
"color": "#538DD3",
"x": "36.00",
"y": "34.44",
"width": "242.81",
"height": "24.00",
"text": "Your Company Name"
}
},
{
"text": ""
},
{
"text": ""
},
{
"text": ""
}
]
},
{
"column": [
{
"text": ""
},
{
"text": ""
},
{
"text": {
"fontName": "Arial",
"fontSize": "11.0",
"fontStyle": "Bold",
"x": "389.11",
"y": "425.83",
"width": "36.75",
"height": "11.04",
"text": "TOTAL"
}
},
{
"text": {
"fontName": "Arial",
"fontSize": "11.0",
"fontStyle": "Bold",
"x": "525.82",
"y": "425.83",
"width": "33.62",
"height": "11.04",
"text": "200.00"
}
}
]
}
]
}
}
},
"pageCount": 1,
"error": false,
"status": 200,
"name": "sample.json",
"remainingCredits": 99227903,
"credits": 28
}
CURL#
curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/json2' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-json/sample.pdf",
"inline": true,
"async": false
}'
curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/json-meta' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-json/sample.pdf",
"inline": true,
"async": false
}'
Code samples#
Footnotes
- 1
Supports links from Google Drive, Dropbox, and PDF.co Built-In Files Storage. To upload files via the API check out the File Upload section. Note: If you experience intermittent Too Many Requests or Access Denied errors, please try to add
cache:
to enable built-in URL caching. (e.gcache:https://example.com/file1.pdf
) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.- 2
Response codes as follows:
Code
Description
200
The request has succeeded
400
Bad input parameters
401
Unauthorized
403
Not enough credits
405
Timeout error. To process large documents or files please use asynchronous mode (set the
async
parameter totrue
) and then check status using the /job/check endpoint. If a file contains many pages then specify a page range using thepages
parameter. The number of pages of the document can be obtained using the /pdf/info endpoint.