Profiles#

This page describes the profiles parameter that can be used with your API calls.

Profiles are used to to set extra options for common API calls and are sometimes distinct to a particular API.

Profiles are embedded with a JSON type of notation along with the profiles object for your API calls, for example:

Important

Please note that the value for the profiles field in the code snippets must be enclosed in quotes ("), making it a complete string. For example: { "profiles": "{'TrimSpaces':true, 'PreserveFormattingOnTextExtraction': true}"}

Sample Code#

{
    "profiles": "{'TrimSpaces':true, 'PreserveFormattingOnTextExtraction': true}"
}
profiles = '"TrimSpaces": "True", "PreserveFormattingOnTextExtraction": "True" '
{
  "profiles": "'TrimSpaces': 'True' , 'PreserveFormattingOnTextExtraction': 'True'"
}
String profiles = "{ 'TrimSpaces': 'True', 'PreserveFormattingOnTextExtraction': 'True' }";
const Profiles = "{ 'TrimSpaces': 'True', 'PreserveFormattingOnTextExtraction': 'True' }";
$Profiles = '{ "TrimSpaces": "True", "PreserveFormattingOnTextExtraction": "True" }'
{
  "profiles": "'TrimSpaces': 'True' , 'PreserveFormattingOnTextExtraction': 'True'"
}
{
    "url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/pdf-to-json/sample.pdf",
    "inline": true,
    "profiles": "{ 'TrimSpaces': 'True', 'PreserveFormattingOnTextExtraction': 'True' }"
}

Generic Profile Options#

The following profiles options are not specific to any one particular endpoint.

Standard Parameters#

The std_params within the profiles parameter enables the definition of regular API parameters in a JSON format. This std_params feature is designed to simplify the process of passing standard parameters and additional options in the profiles parameter for PDF.co API requests.

Note

When std_params are used in the profiles parameter, if a parameter is duplicated within both std_params and outside profiles, the value specified in std_params will overwrite the duplicate value.

std_params Structure#

  • Description: Contains key-value pairs of standard parameters that will be used across PDF.co API requests.

  • Type: JSON Object (passed as a string)

  • Example:

    {
      "profiles": "{'std_params': {'callback': 'webhook_url'}}"
    }
    

Practical Application#

Using the std_params profile, you can define a set of standard parameters and configurations that will be consistently applied across your PDF.co API requests. This approach is particularly beneficial when using automation platforms like Zapier, Make, and others, where the number of parameters you can pass directly is limited.

Complete Request Example#

Here is a complete example illustrating the use of the std_params profile with other parameters:

/pdf/convert/to/text

{
  "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
  "inline": true,
  "profiles": "{'std_params': {'callback': 'webhook_url', 'async': true}, 'ExtractShadowLikeText': false, 'ExtractColumnByColumn': true, 'OCRMode': 'Auto'}}",
  "TrimSpaces": true,
  "PreserveFormattingOnTextExtraction": true
}

Output as Base64#

If you require your output as base64 use the following:

{
    "profiles": "{ 'outputDataFormat': 'base64' }"
}

Important

This output data format is supported by endpoints that generate binary files - PDF and images. The output is accessible via a generated link and the file under the link is in a base64-encoded text format.

Converting PDFs#

There are a variety of profiles options which can be set when converting from PDF to other documents. These profiles control how to extract the information from the source PDF file.

These options apply to the following endpoints:

  • /pdf/convert/to/csv

  • /pdf/convert/to/xml

  • /pdf/convert/to/json

  • /pdf/convert/to/json2

  • /pdf/convert/to/xls

  • /pdf/convert/to/xlsx

Convert Vectors#

You can choose whether the conversion process should convert vectors or not as follows:

{
    "profiles": "{ 'SaveVectors': true }"
}

Save Images#

This profiles parameter includes the SaveImages property that extracts individual images in a regular PDF.

{
    "profiles": "{ 'SaveImages': 'Embed' }"
}

Consider Font Size#

This profiles parameter allows you to seperate header and body text based on font size.

{
    "profiles": "{ 'ConsiderFontSizes': true }"
}

Set the Extraction Area#

Extract text in a specific area by defining the extraction area - set with points in the format [x, y, width, height].

{
    "profiles": "{ 'ExtractionArea': [171.0,69.0,249.75,71.25] }"
}

Extracting Invisible Text#

When dealing with PDF documents, sometimes there may be unwanted invisible text that makes it difficult to extract the desired content accurately. This could be due to various reasons such as the original document being scanned or saved with a low-quality setting. In such cases, it is important to remove the unwanted invisible text to ensure accurate extraction of the desired content.

{
    "profiles": "{ 'ExtractInvisibleText': false, 'ExtractShadowLikeText': false, 'OCRMode': 'Auto' }"
}

OCR Options#

For OCR (Optical Character Recognition) there are a variety of profile options.

Setting the OCR Mode#

There a three values which can be set for this mode:

  • Auto - OCR will be determined automatically

  • TextFromImagesOnly - OCR will only extract from images

  • TextFromImagesAndVectorsAndFonts - OCR will extract from images, vectors and fonts

  • TextFromImagesAndVectorsAndRepairedFonts - OCR will extract from images, vectors and repaired fonts. Sometimes a PDF file used is malformed. The embedded font used to draw characters has modified character table that doesn’t allow to get correct symbol codes of any relevant charset. In this case we can ensure that if document opens in Adobe Reader and copy-paste the text from it. If all characters are garbled too, This might be some sort of extraction protection. If we need to get the text from this kind of file at any cost, we can try this mode. This allows to “repair” the text.

{
    "profiles": "{ 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts' }"
}

Extracting Text from Colored Background#

If you can’t extract text with a colored background, please add the Grayscale filter to the profiles as follows:

{
    "profiles": "{ 'OCRImagePreprocessingFilters.AddGrayscale()': [] }"
}

Considering the Font Color on Tables#

Sometimes the data which OCR must extract from a table might have colored text which is difficult to extract. OCR results can be improved with the following:

{
    "profiles": "{
        'LineGroupingMode': 'JoinOrphanedRows',
        'ConsiderFontColors': true,
        'DetectNewColumnBySpacesRatio': '1.1',
        'AutoAlignColumnsToHeader': false,
        'OCRImagePreprocessingFilters.AddGammaCorrection()': [ '1.4' ]
    }"
}

Setting the Rotation Angle#

Normally OCR detects PDF rotation and extracts text properly. But in some cases a PDF is constructed in such a way that a page is not rotated and instead text is drawn vertically, OCR does not detect page rotation automatically. In such scenarios we can use following profile setting.

{
    "profiles": "{ 'RotationAngle': 2 }"
}
  • 0 no rotation

  • 1 90 degrees

  • 2 180 degrees

  • 3 270 degrees


Profile Options by Endpoint#

Explore various profiles options by API endpoint below.

PDF Add#

Crop a PDF File#

Crop a PDF file using an array to define the crop area. The crop box is defined by a rectangle [x, y, width, height] in PDF points (1 Point = 1/72 inches).

Note

An A4 page size in points is 595 x 842

{
    "profiles": "{ 'Pages[0].SetCropBox()': ['28', '28', '539', '786'] }"
}

Disable Ligaturization#

To disable ligaturization, for example for Hebrew, use the following:

{
    "profiles": "{ 'DisableLigatures': true }"
}

Flatten Document#

Flattening a document renders it as read-only. Handy if you want to remove editing or copying capability.

{
    "profiles": "{ 'FlattenDocument()': [] }"
}

Search and Replace Text#

Adjust Text Alignment#

Users may have encountered an issue when using the /pdf/edit/replace-text API endpoint to replace text in a PDF document. The replaced text might appear slightly higher than the original text or the surrounding text, causing alignment issues.

To fix this issue, we have added a new parameter called YAdjustmentForReplacementText in the profiles parameter of the API request. This parameter allows you to adjust the vertical position of the replaced text, ensuring proper alignment with the rest of the document. Negative values for this parameter move text up, positive values move text down.

Here’s an example of how to use the YAdjustmentForReplacementText parameter. In this example API request, the YAdjustmentForReplacementText parameter has been set to -1, which moves the replaced text 1 unit up vertically, resulting in better alignment with the original text.

{
    "profiles": "{'YAdjustmentForReplacementText': '-1'}"
}

Search and Replace Text with Image#

Crop Empty Space Around Images#

If you require to crop empty space around an inserted image use the following:

{
    "profiles": "{'AutoCropImages': true}"
}

Search and Delete Text#

Showing Redacted Text#

By default when we delete text using post-tag-pdf-edit-delete-text it will simply remove text leaving a space where the text was.

In the case where you need to blackout deleted text it can be acheived using following profiles parameters.

  • Set UsePatch parameter to true.

  • Set PatchColor parameter to color we want to use for redacting in hex format. For example: 'PatchColor': '#000000'.

In case we want to only blackout text, but not remove it so that we can still copy it, we can do so using RemoveTextUnderPatch parameter and set it to false.

Important

If RemoveTextUnderPatch is set to false then a user could still copy the text making the redaction less secure than you might require!

{
    "profiles": "{'UsePatch': true, 'PatchColor': '#000000', 'RemoveTextUnderPatch': true}"
}

PDF Optimize#

Optimization options#

Set the options for your optimization via the following profiles parameters:

  • ImageOptimizationFormat - (optional) controls image compression format. Available values:
    • JPEG (default) JPEG based compression.

    • Flate (zip-like compression).

    • Fax 1-bit black and white compression, provides best file size.

  • JPEGQuality (optional) controls JPEG compression quality from 1 (worst quality, smallest size) to 100 (best quality, largest size). Set to 25 by default.

  • ResampleImages (optional) tells the compressor to resample images to a new resolution - true by default.

  • ResamplingResolution (optional) target resampled images resolution. 120 (dots per inch) by default.

  • GrayscaleImages (optional) turns all images into grayscale. Not affecting the compression, but useful if you need to make all images inside grayscale - false by default.

{
    "profiles": "{ 'ImageOptimizationFormat': 'JPEG', 'JPEGQuality': 25, 'ResampleImages': true, 'ResamplingResolution': 120, 'GrayscaleImages': false }"
}

PDF Convert to CSV & PDF Convert to XLS#

Column Detection Mode#

This might be case when a document contains a number of overlapping invisible text and vector objects that affect column detection. In this case you may need to fix the wrongly positioned data.

Set the options for your column detection via the following profiles parameters:

  • ColumnDetectionMode - available values:
    • ContentGroups

    • Borders

    • BorderedTables

    • ContentGroupsAI

{
    "profiles": "{ 'ColumnDetectionMode': 'ContentGroups' }"
}

PDF Merge#

Rename Matching Fields#

This feature enables the renaming of field names during the merging of PDF files which contain forms. If set to false, it will retain the original field names. This is helpful for merged PDF forms with identical field names when the customer wants to auto-fill the identical field names in other pages.

{
    "profiles": "{ 'RenameMatchingFieldsDuringMerge': false }"
}

Generate Bookmarks#

This adds bookmarks to the merged document with names assigned to every merged document in the same order:

{
    "profiles": "{'GenerateBookmarks': true, 'BookmarkTitles': [ 'BookmarkName1', 'BookmarkName2', 'BookmarkName3' ] }"
}

Include / Exclude from ZIPS#

You can control which files to include and exclude from input zip files with a profiles.

// include PDF, XLS and XLSX files
{
    "profiles": "{ 'zipIncludeFilter': '*.pdf,*.xls*' }"
}
// exclude DOC, DOCX, XLS and XLSX files
{
    "profiles": "{ 'zipExcludeFilter': '*.doc*,*.xls*' }"
}

Note

zipIncludeFilter and zipExcludeFilter support * and ? wildcards.

Change Document Title#

You can chnage the document title during a merge with the following:

{
    "profiles": "{ 'MergedDocumentTitle': 'New Title' }"
}

PDF Find & PDF Find Table#

Find only bordered tables#

You can limit search to bordered tables only by enabling the legacy table search mode with the following profiles config:

{
    "profiles": "{ 'Mode': 'Legacy',
                  'ColumnDetectionMode': 'BorderedTables',
                  'DetectionMinNumberOfRows': 1,
                  'DetectionMinNumberOfColumns': 1,
                  'DetectionMaxNumberOfInvalidSubsequentRowsAllowed': 0,
                  'DetectionMinNumberOfLineBreaksBetweenTables': 0,
                  'EnhanceTableBorders': false
                }"
}

PDF Find#

Support page rotation#

This endpoint supports PDF page rotation as follows:

{
   "profiles": "{ 'OCRDetectPageRotation': true }"
}

PDF to HTML#

Disable Images#

To turn off images output set the following profile:

{
   "profiles": "{ 'saveImages': 0 }"
}

Control Image Quality#

Some PDF may have high quality images used in the document and you may need to keep the quality of these images in the output HTML. By default PDF to HTML is optimizing images and you can easily turn it off with the following profile:

{
   "profiles": "{ 'OptimizeImages': false }"
}

Control Output Page Width#

Control page width output as follows:

{
   "profiles": "{ 'OutputPageWidth': 2048 }"
}

Inject CSS#

To inject CSS for layout options in your HTML use the following:

{
   "profiles": "{ 'AdditionalCssStyles': '#canvas { zoom: 50%; }' }"
}

PDF to Image#

Disable Text Layer#

We can turn off the text layer for our render as follows:

{
   "profiles": "{ 'RenderTextObjects': false }"
}

Set Image Resolution#

By default the screen resolution is 120 DPI. To change the rendering resolution, please use:

{
    "profiles": "{ 'RenderingResolution': 300 }"
}

Options for TIFF#

TIFF has a variety of options as follows:

{
    "profiles": "{
        'TextSmoothingMode': 'HighQuality', // Valid values: 'HighSpeed', 'HighQuality'
        'VectorSmoothingMode': 'HighQuality', // Valid values: 'HighSpeed', 'HighQuality'
        'ImageInterpolationMode': 'HighQuality', // Valid values: 'HighSpeed', 'HighQuality'
        'RenderTextObjects': true, // Valid values: true, false
        'RenderVectorObjects': true, // Valid values: true, false
        'RenderImageObjects': true, // Valid values: true, false
        'RenderCurveVectorObjects': true, // Valid values: true, false
        'JPEGQuality': 85, // from 0 (lowest) to 100 (highest)
        'TIFFCompression': 'LZW', // Valid values: 'None', 'LZW', 'CCITT3', 'CCITT4', 'RLE'
        'RotateFlipType': 'RotateNoneFlipNone', // See note
        'ImageBitsPerPixel': 'BPP24', // Valid values: 'BPP1', 'BPP8', 'BPP24', 'BPP32'
        'OneBitConversionAlgorithm': 'OtsuThreshold', // Valid values: 'OtsuThreshold', 'BayerOrderedDithering'
        'FontHintingMode': 'Default', // Valid values: 'Default', 'Stronger'
        'NightMode': false // Valid values: true, false
    }"
}

Options for WEBP#

To control the quality and encoding speed use the following:

{
    "profiles": "{ 'WEBPQuality': 75, 'WEBPEncodingSpeed': 4 }"
}