Profiles#
This page describes the profiles
parameter that can be used with your API calls.
Profiles are used to to set extra options for common API calls and are sometimes distinct to a particular API.
Profiles are embedded with a JSON
type of notation along with the profiles
object for your API calls, for example:
Important
Please note that the value for the profiles
field in the code snippets must be enclosed in quotes ("
), making it a complete string. For example: { "profiles": "{'TrimSpaces':true, 'PreserveFormattingOnTextExtraction': true}"}
Sample Code#
{
"profiles": "{'TrimSpaces':true, 'PreserveFormattingOnTextExtraction': true}"
}
profiles = '"TrimSpaces": "True", "PreserveFormattingOnTextExtraction": "True" '
{
"profiles": "'TrimSpaces': 'True' , 'PreserveFormattingOnTextExtraction': 'True'"
}
String profiles = "{ 'TrimSpaces': 'True', 'PreserveFormattingOnTextExtraction': 'True' }";
const Profiles = "{ 'TrimSpaces': 'True', 'PreserveFormattingOnTextExtraction': 'True' }";
$Profiles = '{ "TrimSpaces": "True", "PreserveFormattingOnTextExtraction": "True" }'
{
"profiles": "'TrimSpaces': 'True' , 'PreserveFormattingOnTextExtraction': 'True'"
}
{
"url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/pdf-to-json/sample.pdf",
"inline": true,
"profiles": "{ 'TrimSpaces': 'True', 'PreserveFormattingOnTextExtraction': 'True' }"
}
Generic Profile Options#
The following profiles
options are not specific to any one particular endpoint.
Standard Parameters#
The std_params
within the profiles
parameter enables the definition of regular API parameters in a JSON
format. This std_params
feature is designed to simplify the process of passing standard parameters and additional options in the profiles
parameter for PDF.co API requests.
When using Standard Parameters webhooks can be utilized by setting the callback
object with the URL of your choice. However, is is simpler to set the callback
object directly - see Webhooks & Callbacks for more.
Note
When std_params
are used in the profiles
parameter, if a parameter is duplicated within both std_params
and outside profiles, the value specified in std_params
will overwrite the duplicate value.
Therefore if you define a callback object in std_params
then it will overwrite any value you may have defined via the basic callback object!
std_params
Structure#
Description: Contains key-value pairs of standard parameters that will be used across PDF.co API requests.
Type:
JSON
Object (passed as a string)Example:
{ "profiles": "{'std_params': {'callback': 'webhook_url'}}" }
Practical Application#
Using the std_params
profile, you can define a set of standard parameters and configurations that will be consistently applied across your PDF.co API requests. This approach is particularly beneficial when using automation platforms like Zapier, Make, and others, where the number of parameters you can pass directly is limited.
Complete Request Example#
Here is a complete example illustrating the use of the std_params
profile with other parameters:
/pdf/convert/to/text
{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
"inline": true,
"profiles": "{'std_params': {'callback': 'webhook_url', 'async': true}, 'ExtractShadowLikeText': false, 'ExtractColumnByColumn': true, 'OCRMode': 'Auto'}}",
"TrimSpaces": true,
"PreserveFormattingOnTextExtraction": true
}
Output as Base64#
If you require your output as base64
use the following:
{
"profiles": "{ 'outputDataFormat': 'base64' }"
}
Important
This output data format is supported by endpoints that generate binary files - PDF and images. The output is accessible via a generated link and the file under the link is in a base64-encoded text format.
Converting PDFs#
There are a variety of profiles
options which can be set when converting from PDF to other documents. These profiles
control how to extract the information from the source PDF file.
These options apply to the following endpoints:
/pdf/convert/to/csv
/pdf/convert/to/xml
/pdf/convert/to/json
/pdf/convert/to/json2
/pdf/convert/to/xls
/pdf/convert/to/xlsx
Convert Vectors#
You can choose whether the conversion process should convert vectors or not as follows:
{
"profiles": "{ 'SaveVectors': true }"
}
Save Images#
This profiles
parameter includes the SaveImages
property that extracts individual images in a regular PDF.
{
"profiles": "{ 'SaveImages': 'Embed' }"
}
Consider Font Size#
This profiles
parameter allows you to seperate header and body text based on font size.
{
"profiles": "{ 'ConsiderFontSizes': true }"
}
Set the Extraction Area#
Extract text in a specific area by defining the extraction area - set with points in the format [x, y, width, height]
.
{
"profiles": "{ 'ExtractionArea': [171.0,69.0,249.75,71.25] }"
}
Extracting Invisible Text#
When dealing with PDF documents, sometimes there may be unwanted invisible text that makes it difficult to extract the desired content accurately. This could be due to various reasons such as the original document being scanned or saved with a low-quality setting. In such cases, it is important to remove the unwanted invisible text to ensure accurate extraction of the desired content.
{
"profiles": "{ 'ExtractInvisibleText': false, 'ExtractShadowLikeText': false, 'OCRMode': 'Auto' }"
}
OCR (Optical Character Recognition) Mode Options#
The following values can be configured for OCR mode:
OCR Mode |
Description |
---|---|
|
Automatically determines the optimal OCR settings based on the input. |
|
Automatically repairs fonts in text extracted from images or other documents. |
|
Extracts text from images and fonts from documents. |
|
Extracts text from images and repaired fonts from documents. |
|
Extracts text, vectors, and fonts from images and documents. |
|
Extracts text, vectors, and repaired fonts from images and documents. |
|
Extracts text and vectors from images only. |
|
Extracts text from images only. |
|
Extracts text from documents with repaired fonts only. |
|
Extracts text and fonts from documents with vectors. |
|
Extracts text and repaired fonts from documents with vectors. |
|
Extracts text from documents with vectors only. |
{
"profiles": "{ 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts' }"
}
OCR (Optical Character Recognition) Resolution#
OCR resolution can be set from 72
to 1200
DPI. The default value is 300
DPI. The higher the resolution, the better the OCR results. However, higher resolution also means longer processing times.
{
"profiles": "{ 'OCRResolution': 300 }"
}
Extracting Text from Colored Background#
If you can’t extract text with a colored background, please add the Grayscale filter to the profiles
as follows:
{
"profiles": "{ 'OCRImagePreprocessingFilters.AddGrayscale()': [] }"
}
Considering the Font Color on Tables#
Sometimes the data which OCR must extract from a table might have colored text which is difficult to extract. OCR results can be improved with the following:
{
"profiles": "{
'LineGroupingMode': 'JoinOrphanedRows',
'ConsiderFontColors': true,
'DetectNewColumnBySpacesRatio': '1.1',
'AutoAlignColumnsToHeader': false,
'OCRImagePreprocessingFilters.AddGammaCorrection()': [ '1.4' ]
}"
}
Setting the Rotation Angle#
Normally OCR detects PDF rotation and extracts text properly. But in some cases a PDF is constructed in such a way that a page is not rotated and instead text is drawn vertically, OCR does not detect page rotation automatically. In such scenarios we can use following profile setting.
{
"profiles": "{ 'RotationAngle': 2 }"
}
0
no rotation1
90 degrees2
180 degrees3
270 degrees
Profile Options by Endpoint#
Explore various profiles
options by API endpoint below.
PDF Add#
Crop a PDF File#
Crop a PDF file using an array to define the crop area. The crop box is defined by a rectangle [x, y, width, height]
in PDF points (1 Point = 1/72 inches).
Note
An A4 page size in points is 595 x 842
{
"profiles": "{ 'Pages[0].SetCropBox()': ['28', '28', '539', '786'] }"
}
Disable Ligaturization#
To disable ligaturization, for example for Hebrew, use the following:
{
"profiles": "{ 'DisableLigatures': true }"
}
Flatten Document#
Flattening a document renders it as read-only. Handy if you want to remove editing or copying capability.
{
"profiles": "{ 'FlattenDocument()': [] }"
}
Search and Replace Text#
Adjust Text Alignment#
Users may have encountered an issue when using the /pdf/edit/replace-text API endpoint to replace text in a PDF document. The replaced text might appear slightly higher than the original text or the surrounding text, causing alignment issues.
To fix this issue, we have added a new parameter called YAdjustmentForReplacementText
in the profiles
parameter of the API request. This parameter allows you to adjust the vertical position of the replaced text, ensuring proper alignment with the rest of the document. Negative values for this parameter move text up, positive values move text down.
Here’s an example of how to use the YAdjustmentForReplacementText
parameter. In this example API request, the YAdjustmentForReplacementText
parameter has been set to -1
, which moves the replaced text 1
unit up vertically, resulting in better alignment with the original text.
{
"profiles": "{'YAdjustmentForReplacementText': '-1'}"
}
Search and Replace Text with Image#
Crop Empty Space Around Images#
If you require to crop empty space around an inserted image use the following:
{
"profiles": "{'AutoCropImages': true}"
}
Search and Delete Text#
Showing Redacted Text#
By default when we delete text using post-tag-pdf-edit-delete-text it will simply remove text leaving a space where the text was.
In the case where you need to blackout deleted text it can be acheived using following profiles
parameters.
Set
UsePatch
parameter totrue
.Set
PatchColor
parameter to color we want to use for redacting inhex
format. For example:'PatchColor': '#000000'
.
In case we want to only blackout text, but not remove it so that we can still copy it, we can do so using RemoveTextUnderPatch
parameter and set it to false
.
Important
If RemoveTextUnderPatch
is set to false
then a user could still copy the text making the redaction less secure than you might require!
{
"profiles": "{'UsePatch': true, 'PatchColor': '#000000', 'RemoveTextUnderPatch': true}"
}
PDF Optimize#
Optimization options#
Set the options for your optimization via the following profiles
parameters:
ImageOptimizationFormat
- (optional) controls image compression format. Available values:JPEG
(default) JPEG based compression.Flate
(zip-like compression).Fax
1-bit black and white compression, provides best file size.
JPEGQuality
(optional) controls JPEG compression quality from1
(worst quality, smallest size) to100
(best quality, largest size). Set to25
by default.ResampleImages
(optional) tells the compressor to resample images to a new resolution -true
by default.ResamplingResolution
(optional) target resampled images resolution.120
(dots per inch) by default.GrayscaleImages
(optional) turns all images into grayscale. Not affecting the compression, but useful if you need to make all images inside grayscale -false
by default.
{
"profiles": "{ 'ImageOptimizationFormat': 'JPEG', 'JPEGQuality': 25, 'ResampleImages': true, 'ResamplingResolution': 120, 'GrayscaleImages': false }"
}
PDF Convert to CSV & PDF Convert to XLS#
Column Detection Mode#
This might be case when a document contains a number of overlapping invisible text and vector objects that affect column detection. In this case you may need to fix the wrongly positioned data.
Set the options for your column detection via the following profiles
parameters:
ColumnDetectionMode
- available values:ContentGroups
Borders
BorderedTables
ContentGroupsAI
{
"profiles": "{ 'ColumnDetectionMode': 'ContentGroups' }"
}
PDF Merge#
Rename Matching Fields#
This feature enables the renaming of field names during the merging of PDF files which contain forms. If set to false
, it will retain the original field names. This is helpful for merged PDF forms with identical field names when the customer wants to auto-fill the identical field names in other pages.
{
"profiles": "{ 'RenameMatchingFieldsDuringMerge': false }"
}
Generate Bookmarks#
This adds bookmarks to the merged document with names assigned to every merged document in the same order:
{
"profiles": "{'GenerateBookmarks': true, 'BookmarkTitles': [ 'BookmarkName1', 'BookmarkName2', 'BookmarkName3' ] }"
}
Include / Exclude from ZIPS#
You can control which files to include and exclude from input zip files with a profiles
.
// include PDF, XLS and XLSX files
{
"profiles": "{ 'zipIncludeFilter': '*.pdf,*.xls*' }"
}
// exclude DOC, DOCX, XLS and XLSX files
{
"profiles": "{ 'zipExcludeFilter': '*.doc*,*.xls*' }"
}
Note
zipIncludeFilter
and zipExcludeFilter
support *
and ?
wildcards.
Change Document Title#
You can chnage the document title during a merge with the following:
{
"profiles": "{ 'MergedDocumentTitle': 'New Title' }"
}
PDF Find & PDF Find Table#
Find only bordered tables#
You can limit search to bordered tables only by enabling the legacy table search mode with the following profiles
config:
{
"profiles": "{ 'Mode': 'Legacy',
'ColumnDetectionMode': 'BorderedTables',
'DetectionMinNumberOfRows': 1,
'DetectionMinNumberOfColumns': 1,
'DetectionMaxNumberOfInvalidSubsequentRowsAllowed': 0,
'DetectionMinNumberOfLineBreaksBetweenTables': 0,
'EnhanceTableBorders': false
}"
}
PDF Find#
Support page rotation#
This endpoint supports PDF page rotation as follows:
{
"profiles": "{ 'OCRDetectPageRotation': true }"
}
PDF to HTML#
Disable Images#
To turn off images output set the following profile:
{
"profiles": "{ 'saveImages': 0 }"
}
Control Image Quality#
Some PDF may have high quality images used in the document and you may need to keep the quality of these images in the output HTML. By default PDF to HTML is optimizing images and you can easily turn it off with the following profile:
{
"profiles": "{ 'OptimizeImages': false }"
}
Control Output Page Width#
Control page width output as follows:
{
"profiles": "{ 'OutputPageWidth': 2048 }"
}
Inject CSS#
To inject CSS for layout options in your HTML use the following:
{
"profiles": "{ 'AdditionalCssStyles': '#canvas { zoom: 50%; }' }"
}
PDF to Image#
Disable Text Layer#
We can turn off the text layer for our render as follows:
{
"profiles": "{ 'RenderTextObjects': false }"
}
Set Image Resolution#
By default the screen resolution is 120 DPI. To change the rendering resolution, please use:
{
"profiles": "{ 'RenderingResolution': 300 }"
}
Options for TIFF#
TIFF has a variety of options as follows:
{
"profiles": "{
'TextSmoothingMode': 'HighQuality', // Valid values: 'HighSpeed', 'HighQuality'
'VectorSmoothingMode': 'HighQuality', // Valid values: 'HighSpeed', 'HighQuality'
'ImageInterpolationMode': 'HighQuality', // Valid values: 'HighSpeed', 'HighQuality'
'RenderTextObjects': true, // Valid values: true, false
'RenderVectorObjects': true, // Valid values: true, false
'RenderImageObjects': true, // Valid values: true, false
'RenderCurveVectorObjects': true, // Valid values: true, false
'JPEGQuality': 85, // from 0 (lowest) to 100 (highest)
'TIFFCompression': 'LZW', // Valid values: 'None', 'LZW', 'CCITT3', 'CCITT4', 'RLE'
'RotateFlipType': 'RotateNoneFlipNone', // See note
'ImageBitsPerPixel': 'BPP24', // Valid values: 'BPP1', 'BPP8', 'BPP24', 'BPP32'
'OneBitConversionAlgorithm': 'OtsuThreshold', // Valid values: 'OtsuThreshold', 'BayerOrderedDithering'
'FontHintingMode': 'Default', // Valid values: 'Default', 'Stronger'
'NightMode': false // Valid values: true, false
}"
}
Note
RotateFlipType
values can be found here: https://docs.microsoft.com/en-us/dotnet/api/system.drawing.rotatefliptype?view=netframework-2.0
Options for WEBP#
To control the quality and encoding speed use the following:
{
"profiles": "{ 'WEBPQuality': 75, 'WEBPEncodingSpeed': 4 }"
}