PDF to Text#

/pdf/convert/to/text#

Convert PDF and scanned images to text with layout preserved. This method uses OCR and reporoduces layout.

Note

Auto classification Of incoming documents

Use the Document Classifier endpoint to automatically sort/detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

  • Method: POST

  • Endpoint: /v1/pdf/convert/to/text

Attributes#

Note

Attributes are case-sensitive and should be inside JSON for POST request, for example:

{
    "url": "https://example.com/file1.pdf"
}

Attribute

Description

Required

url

URL to the source file. 1

yes

httpusername

HTTP auth user name if required to access source url.

no

httppassword

HTTP auth password if required to access source url.

no

pages

Comma-separated indices of pages (or page ranges) that you want to use. The first-page index is always 0. For example, if you have a 7-page document that you want to be split into 3 separate PDFs but a different number of pages it would go like this: 0, 1, 2- or 1, 2, 3-7 which will result in 1 PDF with page one, 1 PDF with page two and one PDF with the rest of the pages. You can also use inverted page numbers adding ! before the number. E.g. !0 means “the last page”, 1-!1 means “from the second to the penultimate page”, and !1- - “last two pages”. Also, you can use a single asterisk (*) character as the range to split the document into separate pages. The input must be in string format.

no

unwrap

Unwrap lines into a single line within table cells when lineGrouping is enabled. Must be one of: true, or false.

no

rect

Defines coordinates for extraction, e.g. 51.8, 114.8, 235.5, 204.0. Use PDF Edit Add Helper to get or measure PDF coordinates. The input must be in string format.

no

lang

Set the language for OCR (text from image) to use for scanned PDF, PNG, and JPG documents input when extracting text. The default is eng. Other languages are also supported: deu, spa, chi_sim, jpn, and many others, see Language Support. You can also use 2 languages simultaneously like this: eng+deu or jpn+kor (any combination).

no

inline

Set to true to return results inside the response. Otherwise, the endpoint will return a link to the output file generated.

no

lineGrouping

Line grouping within table cells. Set to 1 to enable the grouping. The input must be in string format.

no

password

Password of PDF file, the input must be in string format.

no

async

Set async to true for long processes to run in the background, API will then return a jobId which you can use with the Background Job Check endpoint to check the status of the process and retrieve the output while you can proceed with other tasks.

no

name

File name for the generated output, the input must be in string format.

no

expiration

Set the expiration time for the output link in minutes (default is 60 i.e 60 minutes or 1 hour), After this specified duration, any generated output file(s) will be automatically deleted from PDF.co Temporary Files Storage. The maximum duration for link expiration varies based on your current subscription plan. To store permanent input files (e.g. re-usable images, pdf templates, documents) consider using PDF.co Built-In Files Storage.

no

profiles

Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more.

no

Query parameters#

No query parameters accepted.

Payload#

{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
    "inline": true,
    "async": false
}

Response 2#

{
    "body": "   Your Company Name \r\n       Your Address \r\n        City, State Zip \r\n                                                                                      Invoice No. 123456 \r\n                                                                                   Invoice Date 01/01/2016 \r\n      Client Name \r\n       Address \r\n        City, State Zip \r\n\r\n       Notes \r\n\r\n\r\n       Item                                     Quantity                     Price                     Total \r\n       Item 1                                              1                      40.00                      40.00 \r\n       Item 2                                              2                      30.00                      60.00 \r\n       Item 3                                              3                      20.00                      60.00 \r\n       Item 4                                              4                      10.00                      40.00 \r\n                                                           TOTAL                200.00\r\n",
    "pageCount": 1,
    "error": false,
    "status": 200,
    "name": "sample.txt",
    "remainingCredits": 99032333,
    "credits": 21
}

CURL#

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
    "inline": true,
    "async": false
}'


/pdf/convert/to/text-simple#

This endpoint works faster and requires fewer credits as it is not using AI-powered layout analysis, OCR support, and also has no support for profiles for fine-tuning. For advanced conversion with layout analysis, OCR (for scanned pages), PDF repair, and other features please use the /pdf/convert/to/text endpoint instead.

Note

Auto classification Of incoming documents

Use the Document Classifier endpoint to automatically sort/detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

  • Method: POST

  • Endpoint: /v1/pdf/convert/to/text-simple

Attributes#

Note

Attributes are case-sensitive and should be inside JSON for POST request, for example:

{
    "url": "https://example.com/file1.pdf"
}

Attribute

Description

Required

url

URL to the source file. 1

yes

httpusername

HTTP auth user name if required to access source url.

no

httppassword

HTTP auth password if required to access source url.

no

pages

Comma-separated indices of pages (or page ranges) that you want to use. The first-page index is always 0. For example, if you have a 7-page document that you want to be split into 3 separate PDFs but a different number of pages it would go like this: 0, 1, 2- or 1, 2, 3-7 which will result in 1 PDF with page one, 1 PDF with page two and one PDF with the rest of the pages. You can also use inverted page numbers adding ! before the number. E.g. !0 means “the last page”, 1-!1 means “from the second to the penultimate page”, and !1- - “last two pages”. Also, you can use a single asterisk (*) character as the range to split the document into separate pages. The input must be in string format.

no

inline

Set to true to return results inside the response. Otherwise, the endpoint will return a link to the output file generated.

no

password

Password of PDF file, the input must be in string format.

no

async

Set async to true for long processes to run in the background, API will then return a jobId which you can use with the Background Job Check endpoint to check the status of the process and retrieve the output while you can proceed with other tasks.

no

name

File name for the generated output, the input must be in string format.

no

Query parameters#

No query parameters accepted.

Payload#

{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text-simple/sample.pdf",
    "inline": true,
    "async": false
}

Response 2#

{
    "body": "Your Company Name \r\nYour Address \r\nCity, State Zip \r\nInvoice No. 123456 \r\nInvoice Date 01/01/2016 \r\nClient Name \r\nAddress \r\nCity, State Zip  \r\nNotes   \r\nItem Quantity Price Total \r\nItem 1 1 40.00 40.00 \r\nItem 2 2 30.00 60.00 \r\nItem 3 3 20.00 60.00 \r\nItem 4 4 10.00 40.00   \r\nTOTAL 200.00   \r\n",
    "pageCount": 1,
    "error": false,
    "status": 200,
    "name": "sample.txt",
    "remainingCredits": 99885491,
    "credits": 2
}

CURL#

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/text-simple' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
    "inline": true,
    "async": false
}'


Code samples#

Footnotes

1(1,2)

Supports links from Google Drive, Dropbox, and PDF.co Built-In Files Storage. To upload files via the API check out the File Upload section. Note: If you experience intermittent Too Many Requests or Access Denied errors, please try to add cache: to enable built-in URL caching. (e.g cache:https://example.com/file1.pdf) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.

2(1,2)

Main response codes as follows:

Code

Description

200

Success

400

Bad request. Typically happens because of bad input parameters, or because the input URLs can’t be reached, possibly due to access restrictions like needing a login or password.

401

Unauthorized

402

Not enough credits

445

Timeout error. To process large documents or files please use asynchronous mode (set the async parameter to true) and then check status using the /job/check endpoint. If a file contains many pages then specify a page range using the pages parameter. The number of pages of the document can be obtained using the /pdf/info endpoint.