PDF Find#
Available Methods#
/pdf-find#
Find text in PDF and get coordinates. Supports regular expressions.
Method: POST
Endpoint: /v1/pdf-find
Attributes#
Note
Attributes are case-sensitive and should be inside JSON for POST request, for example:
{
"url": "https://example.com/file1.pdf"
}
Attribute |
Description |
Required |
---|---|---|
|
URL to the source file. 1 |
yes |
|
HTTP auth user name if required to access source |
no |
|
HTTP auth password if required to access source |
no |
|
Text to search can support regular expressions if you set the |
yes |
|
Comma-separated indices of pages (or page ranges) that you want to use. The first-page index is always 0. For example, if you have a 7-page document that you want to be split into 3 separate PDFs but a different number of pages it would go like this: 0, 1, 2- or 1, 2, 3-7 which will result in 1 PDF with page one, 1 PDF with page two and one PDF with the rest of the pages. You can also use inverted page numbers adding |
no |
|
Set to |
no |
|
Values can be either |
no |
|
Password of PDF file, the input must be in string format. |
no |
|
Must be one of: |
no |
|
Set |
no |
|
Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more. |
no |
Query parameters#
No query parameters accepted.
Payload#
{
"async": "false",
"url": "pdfco-test-files.s3.us-west-2.amazonaws.compdf-to-text/sample.pdf",
"searchString": "Invoice Date \\d+/\\d+/\\d+",
"regexSearch": "true",
"name": "output",
"pages": "0-",
"inline": "true",
"wordMatchingMode": "",
"password": ""
}
Response 2#
{
"body": [
{
"text": "Invoice Date 01/01/2016",
"left": 436.5400085449219,
"top": 130.4599995137751,
"width": 122.85311957550027,
"height": 11.040000486224898,
"pageIndex": 0,
"bounds": {
"location": {
"isEmpty": false,
"x": 436.54,
"y": 130.46
},
"size": "122.853119, 11.0400009",
"x": 436.54,
"y": 130.46,
"width": 122.853119,
"height": 11.0400009,
"left": 436.54,
"top": 130.46,
"right": 559.3931,
"bottom": 141.5,
"isEmpty": false
},
"elementCount": 1,
"elements": [
{
"index": 0,
"left": 436.5400085449219,
"top": 130.4599995137751,
"width": 122.85311957550027,
"height": 11.040000486224898,
"angle": 0,
"text": "Invoice Date 01/01/2016",
"isNewLine": true,
"fontIsBold": true,
"fontIsItalic": false,
"fontName": "Helvetica-Bold",
"fontSize": 11,
"fontColor": "0, 0, 0",
"fontColorAsOleColor": 0,
"fontColorAsHtmlColor": "#000000",
"bounds": {
"location": {
"isEmpty": false,
"x": 436.54,
"y": 130.46
},
"size": "122.853119, 11.0400009",
"x": 436.54,
"y": 130.46,
"width": 122.853119,
"height": 11.0400009,
"left": 436.54,
"top": 130.46,
"right": 559.3931,
"bottom": 141.5,
"isEmpty": false
}
}
]
}
],
"pageCount": 1,
"error": false,
"status": 200,
"name": "output",
"remainingCredits": 59970
}
CURL#
curl --location --request POST 'https://api.pdf.co/v1/pdf/find' \
--header 'x-api-key: ' \
--header 'Content-Type: application/json' \
--data-raw '{
"async": "false",
"url": "pdfco-test-files.s3.us-west-2.amazonaws.compdf-to-text/sample.pdf",
"searchString": "Invoice Date \\d+/\\d+/\\d+",
"regexSearch": "true",
"name": "output",
"pages": "0-",
"inline": "true",
"wordMatchingMode": "",
"password": ""
}'
/pdf-find-table#
AI powered document analysis can scan your document for tables and return the array of tables on pages with coordinates and information about columns detected in these tables.
This function finds tables in documents using an AI-powered table detection engine.
This endpoint locates tables in an input PDF document and returns JSON with:
The array of
tables
objects.X
,Y
,Width
, andHeight
coordinates for every table found.Rect
param for every table that you can re-use withpdf/convert/to/json
,pdf/convert/to/csv
,pdf/convert/to/csv
, and other endpoints to extract a selected table only.PageIndex
page index for a page with a table. The very first page is0
.Columns
array with the set ofX
coordinates for every column inside the table that was found.
To extract the table into CSV, JSON, or XML please use pdf/convert/to/csv
, pdf/convert/to/json2
, and pdf/convert/to/xml
endpoints with rect
parameter value from rect
output param for this table accordingly.
Method: POST
Endpoint: /v1/pdf-find/table
Attributes#
Note
Attributes are case-sensitive and should be inside JSON for POST request, for example:
{
"url": "https://example.com/file1.pdf"
}
Attribute |
Description |
Required |
---|---|---|
|
URL to the source file. 1 |
yes |
|
HTTP auth user name if required to access source |
no |
|
HTTP auth password if required to access source |
no |
|
Comma-separated indices of pages (or page ranges) that you want to use. The first-page index is always 0. For example, if you have a 7-page document that you want to be split into 3 separate PDFs but a different number of pages it would go like this: 0, 1, 2- or 1, 2, 3-7 which will result in 1 PDF with page one, 1 PDF with page two and one PDF with the rest of the pages. You can also use inverted page numbers adding |
no |
|
Set to |
no |
|
Password of PDF file, the input must be in string format. |
no |
|
Set |
no |
|
File name for the generated output, the input must be in string format. |
no |
|
Set the expiration time for the output link in minutes (default is |
no |
|
Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more. |
no |
Note
There is also a “legacy find tables” mode which can be used. Legacy mode can be enabled by setting an object on your profiles
attribute like this:
"profiles": "{ 'Mode': 'Legacy'}"
With a more detailed config with minimum rows, minimum columns, and column detection mode:
"profiles": { 'Mode': 'Legacy',
'ColumnDetectionMode': 'BorderedTables',
'DetectionMinNumberOfRows': 1,
'DetectionMinNumberOfColumns': 1,
'DetectionMaxNumberOfInvalidSubsequentRowsAllowed': 0,
'DetectionMinNumberOfLineBreaksBetweenTables': 0,
'EnhanceTableBorders': false
}
Query parameters#
No query parameters accepted.
Payload#
{
"url": "pdfco-test-files.s3.us-west-2.amazonaws.compdf-to-text/sample.pdf",
"async": "false",
"inline": "true",
"password": ""
}
Response 2#
{
"body": {
"tables": [
{
"PageIndex": 0,
"X": 36,
"Y": 34.4400024,
"Width": 523.44,
"Height": 160.82,
"Columns": [
357.675
],
"rect": "36, 34.4400024, 523.44, 160.82"
},
{
"PageIndex": 0,
"X": 36,
"Y": 316.249969,
"Width": 523.44,
"Height": 120.620026,
"Columns": [
157.117,
340.68,
475.84
],
"rect": "36, 316.249969, 523.44, 120.620026"
}
]
},
"pageCount": 1,
"error": false,
"status": 200,
"name": "sample.json",
"remainingCredits": 98892697,
"credits": 21
}
CURL#
curl --location --request POST 'https://api.pdf.co/v1/pdf/find/table' \
--header 'x-api-key: ' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "pdfco-test-files.s3.us-west-2.amazonaws.compdf-to-text/sample.pdf",
"async": "false",
"inline": "true",
"password": ""
}'
Code samples#
Footnotes
- 1(1,2)
Supports links from Google Drive, Dropbox, and PDF.co Built-In Files Storage. To upload files via the API check out the File Upload section. Note: If you experience intermittent Too Many Requests or Access Denied errors, please try to add
cache:
to enable built-in URL caching. (e.gcache:https://example.com/file1.pdf
) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.- 2(1,2)
Main response codes as follows:
Code
Description
200
Success
400
Bad request. Typically happens because of bad input parameters, or because the input URLs can’t be reached, possibly due to access restrictions like needing a login or password.
401
Unauthorized
402
Not enough credits
445
Timeout error. To process large documents or files please use asynchronous mode (set the
async
parameter totrue
) and then check status using the /job/check endpoint. If a file contains many pages then specify a page range using thepages
parameter. The number of pages of the document can be obtained using the /pdf/info endpoint.Note
For more see the complete list of available response codes.