POST /v1/pdf/convert/to/xls

Attributes

Attributes are case-sensitive and should be inside JSON for POST request. for example: { "url": "https://example.com/file1.pdf" }
AttributeTypeRequiredDefaultDescription
urlstringYes-URL to the source file url attribute
callbackstringNo-The callback URL (or Webhook) used to receive the POST data. see Webhooks & Callbacks. This is only applicable when async is set to true.
httpusernamestringNo-HTTP auth user name if required to access source URL.
httppasswordstringNo-HTTP auth password if required to access source URL.
pagesstringNoall pagesSpecify page indices as comma-separated values or ranges to process (e.g. “0, 1, 2-” or “1, 2, 3-7”). The first-page index is 0. Use ”!” before a number for inverted page numbers (e.g. “!0” for the last page). If not specified, the default configuration processes all pages. The input must be in string format.
rectstringNo-Defines coordinates for extraction. UsePDF Edit Add Helperto get or measure PDF coordinates. The format is {x} {y} {width} {height}.
langstringNoengSet the language for OCR (text from image) to use for scanned PDF, PNG, and JPG documents input when extracting text. see Language Support. You can also use 2 languages simultaneously like this: eng+deu (any combination).
passwordstringNo-Password for the PDF file.
asyncbooleanNofalseSet async to true for long processes to run in the background, API will then return a jobId which you can use with the Background Job Check endpoint. Also see Webhooks & Callbacks
namestringNo-File name for the generated output, the input must be in string format.
expirationintegerNo60Set the expiration time for the output link in minutes. After this specified duration, any generated output file(s) will be automatically deleted from PDF.co Temporary Files Storage. The maximum duration for link expiration varies based on your current subscription plan. To store permanent input files (e.g. re-usable images, pdf templates, documents) consider using PDF.co Built-In Files Storage.
profilesobjectNo-See Profiles for more information.
    ColumnDetectionModestringNoContentGroupsAndBordersControls column detection/alignment in PDF table extraction. See Column Detection Mode for more information.
    OCRModestringNoAutoSpecifies how OCR (Optical Character Recognition) should process input content, offering various modes to tailor text extraction based on content type such as images, fonts, and vector graphics. For more information, see OCR Extraction Modes.
    OCRResolutionintegerNo300Use this parameter to change the OCR resolution from the default 300 dpi. The range is from 72 to 1200 dpi.
    RotationAngleintegerNo-Use manual rotation to handle PDFs with vertically drawn text. Normally, OCR automatically detects page rotation in PDFs and extracts text accurately. However, in some cases, the PDF might not have an actual rotated page --- Rather, the text itself is drawn vertically. In such scenarios, auto-detection may fail. You can use this parameter to manually set the page rotation. The available angles are: 0, 1, 2, 3.
    LineGroupingModestringNoNoneControls line grouping in PDF text extraction. Modes: None (no grouping), GroupByRows (merge rows if all cells align), GroupByColumns (merge cells by column), JoinOrphanedRows (merge single-cell rows to above if no separator).
    ConsiderFontColorsbooleanNofalseControls whether font colors should be considered when detecting table structure and merging text objects during PDF extraction. Set to true to consider font colors.
    DetectNewColumnBySpacesRatiostringNo1.2Controls how spaces between words are interpreted for column detection in PDF text extraction. It defines the ratio of space width that determines when text should be treated as being in separate columns.
    AutoAlignColumnsToHeaderbooleanNotrueControls how columns are detected and aligned during table extraction from PDF documents. It affects both table structure detection and text extraction with formatting preservation. Set to true to automatically align columns to the header row. When set to true (default), the row with the most columns is used as the header, and all other rows are aligned to this structure --- ideal for well-structured tables. When set to false, columns are analyzed independently across all rows to build the structure, which works better for inconsistent or irregular tables.
    OCRImagePreprocessingFilters.AddGammaCorrection()array[string (float format)]No[“1.4”]Adds a gamma correction filter to the image preprocessing pipeline used during OCR (Optical Character Recognition). This filter adjusts the brightness and contrast of an image by applying a non-linear gamma correction to improve text recognition quality.
    OCRImagePreprocessingFilters.AddGrayscale()booleanNofalseSet to true to preprocessing filter that converts a colored document/image to grayscale before performing OCR
    SaveVectorsbooleanNofalseControls whether to save vector graphics during PDF to HTML conversion. Set to true to save vector graphics.
    SaveImagesstringNoNoneControls how images are saved during PDF to HTML conversion. Modes: None (no images), OuterFile (save to sub-folder), Embed (embed as Base64 data:URI).
    ConsiderFontSizesbooleanNofalseSet to true to this parameter makes the converter consider font size differences in document text when detecting and parsing table structures. This can be helpful in cases where tables are formatted using different font sizes to distinguish between headers, data cells, or other structural elements.
    ExtractionAreaarray[numbe]No-Extract text in a specific area by defining the extraction area - set with points in the format [x, y, width, height].
    ExtractShadowLikeTextbooleanNotrueControls whether to extract invisible text from a PDF document. Set to false to skip over invisible text during extraction. This is particularly useful when dealing with PDFs that contain hidden text layers or when you only want to extract visible content. When this value is set to false, OCRMode must be set to Auto to properly apply the shadow text filtering effect.
    DataEncryptionAlgorithmstringNo-Controls the encryption algorithm used for data encryption. See User-Controlled Encryption for more information. The available algorithms are: AES128, AES192, AES256.
    DataEncryptionKeystringNo-Controls the encryption key used for data encryption. See User-Controlled Encryption for more information.
    DataEncryptionIVstringNo-Controls the encryption IV used for data encryption. See User-Controlled Encryption for more information.
    DataDecryptionAlgorithmstringNo-Controls the decryption algorithm used for data decryption. See User-Controlled Encryption for more information. The available algorithms are: AES128, AES192, AES256.
    DataDecryptionKeystringNo-Controls the decryption key used for data decryption. See User-Controlled Encryption for more information.
    DataDecryptionIVstringNo-Controls the decryption IV used for data decryption. See User-Controlled Encryption for more information.
You can use profiles to control the convert process and output of the CSV file.

Column Detection Mode

This might be case when a document contains a number of overlapping invisible text and vector objects that affect column detection. In this case you may need to fix the wrongly positioned data.

Set the options for your column detection via the following profiles parameters:

ColumnDetectionMode - available values:

  • ContentGroups
  • Borders
  • BorderedTables
  • ContentGroupsAI
{
 "profiles": "{ 'ColumnDetectionMode': 'ContentGroups' }"
}

Query parameters

No query parameters accepted.

Responses

ParameterTypeDescription
urlstringDirect URL to the final PDF file stored in S3.
outputLinkValidTillstringTimestamp indicating when the output link will expire
pageCountintegerNumber of pages in the PDF document.
errorbooleanIndicates whether an error occurred (false means success)
statusstringStatus code of the request (200, 404, 500, etc.). For more information, see Response Codes.
namestringName of the output file
creditsintegerNumber of credits consumed by the request
remainingCreditsintegerNumber of credits remaining in the account
durationintegerTime taken for the operation in milliseconds

Example Payload

To see the request size limits, please refer to the Request Size Limits.
{
  "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-excel/sample.pdf",
  "async": false
}

Example Response

To see the main response codes, please refer to the Response Codes page.
{
  "url": "https://pdf-temp-files.s3.amazonaws.com/60c6b9f50280495a9567f73a0a394252/sample.xlsx",
  "pageCount": 1,
  "error": false,
  "status": 200,
  "name": "sample.xlsx",
  "remainingCredits": 60568
}

Code Samples

var https = require("https");
  var path = require("path");
  var fs = require("fs");


  // The authentication key (API Key).
  // Get your own by registering at https://app.pdf.co
  const API_KEY = "***********************************";


  // Direct URL of source PDF file.
  // You can also upload your own file into PDF.co and use it as url. Check "Upload File" samples for code snippets: https://github.com/bytescout/pdf-co-api-samples/tree/master/File%20Upload/    
  const SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-excel/sample.pdf";
  // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
  const Pages = "";
  // PDF document password. Leave empty for unprotected documents.
  const Password = "";
  // Destination XLS file name
  const DestinationFile = "./result.xls";


  // Prepare request to `PDF To XLS` API endpoint
  var queryPath = `/v1/pdf/convert/to/xls`;

  // JSON payload for api request
  var jsonPayload = JSON.stringify({
      name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl
  });

  var reqOptions = {
      host: "api.pdf.co",
      method: "POST",
      path: queryPath,
      headers: {
          "x-api-key": API_KEY,
          "Content-Type": "application/json",
          "Content-Length": Buffer.byteLength(jsonPayload, 'utf8')
      }
  };
  // Send request
  var postRequest = https.request(reqOptions, (response) => {
      response.on("data", (d) => {
          // Parse JSON response
          var data = JSON.parse(d);
          if (data.error == false) {
              // Download XLS file
              var file = fs.createWriteStream(DestinationFile);
              https.get(data.url, (response2) => {
                  response2.pipe(file)
                      .on("close", () => {
                          console.log(`Generated XLS file saved as "${DestinationFile}" file.`);
                      });
              });
          }
          else {
              // Service reported error
              console.log(data.message);
          }
      });
  }).on("error", (e) => {
      // Request error
      console.log(e);
  });

  // Write request data
  postRequest.write(jsonPayload);
  postRequest.end();