General#

Artifex Software and ByteScout/PDF.co Merger FAQ#

We understand that you may have some questions following the recent announcement about the merger of ByteScout PDF.co and Artifex Software. This FAQ is designed to address your concerns and provide information about the transition.

General Information

Q1: Who is Artifex Software?

Artifex Software, Inc. is a seasoned player in the realm of PDF solutions. As a subsidiary of ePapyrus, Inc., they’ve delivered robust PDF technologies such as open-source Ghostscript and MuPDF for over 30 years. Its flagship product, Ghostscript was the first non-Adobe PDF solution being shipped with almost all Linux distributions. Artifex Software has provided their services to a range of notable companies, including Google, Oracle, HP, Kyocera and Intuit, just to name a few.

Q2: What is the aim behind the ByteScout and Artifex Software merger?

This merger combines the strengths of both companies: Artifex Software’s in-depth expertise in PDF solutions, and ByteScout’s innovative technologies. Our joint aim is to provide even more comprehensive and enhanced solutions to our clients.

Account and Services

Q3: Will my ByteScout.com and PDF.co services be affected?

Rest assured, your services will continue as they are now. We aim to ensure a seamless transition where the tools, resources, and services you use daily continue without interruption.

Q4: Will the brand names ByteScout.com and PDF.co change?

No. We recognize the trust and value you have in the ByteScout.com and PDF.co brands, and we will continue to maintain these names.

Q5: Do I need to take any action due to this merger?

No action is required on your part. Your services will continue as before.

Q6: Will this merger impact my subscription?

No, your subscription and its terms will remain the same.

Q7: Will I need to make any changes to my current setup because of this merger?

No changes are necessary on your part. Everything will continue to function as it currently does.

Billing

Q8: Will there be any changes to my billing?

While your services will remain unchanged, your invoices will now be issued under the name of Artifex Software Inc.

Support

Q9: Who should I contact if I have further questions or concerns?

If you have any more questions or concerns, don’t hesitate to contact our customer support team at support@bytescout.com.

We appreciate your understanding and continued support during this exciting transition, and look forward to serving you under this new and promising partnership.

Fonts available for PDF Filling and Adding Text to PDF with pdf/edit/add#

PDF.co Font List

Arial
Arial Black
Bahnschrift
Calibri
Cambria
Cambria Math
Candara
Comic Sans MS
Consolas
Constantia
Corbel
Courier New
Ebrima
Franklin Gothic Medium
Gabriola
Gadugi
Georgia
HoloLens MDL2 Assets
Impact
Ink Free
Javanese Text
Leelawadee UI
Lucida Console
Lucida Sans Unicode
Malgun Gothic
Marlett
Microsoft Himalaya
Microsoft JhengHei
Microsoft New Tai Lue
Microsoft PhagsPa
Microsoft Sans Serif
Microsoft Tai Le
Microsoft YaHei
Microsoft Yi Baiti
MingLiU-ExtB
Mongolian Baiti
MS Gothic
MV Boli
Myanmar Text
Nirmala UI
Palatino Linotype
Segoe MDL2 Assets
Segoe Print
Segoe Script
Segoe UI
Segoe UI Historic
Segoe UI Emoji
Segoe UI Symbol
SimSun
Sitka
Sylfaen
Symbol
Tahoma
Times New Roman
Trebuchet MS
Verdana
Webdings
Wingdings
Yu Gothic

Japanese Fonts

MS Gothic
MS Mincho
Yu Gothic

Chinese Fonts

SimSun
MingLiU
Microsoft YaHei

Korean Fonts

Malgun Gothic

Hebrew Fonts

Miriam

Arabic Fonts

Aldhabi
Andalus
Arabic Typesetting

How to create and test configurations for PDF extraction and image-to-text functions locally#

If you are working with scanned PDFs and the extracted text (text, csv, json, xml) is incomplete or inaccurate, consider using our desktop app, ByteScout PDF Multitool (compatible with Windows 7/10/11 and higher). This app emulates most of the major functions of the PDF.co API and, more importantly, allows you to create and test configurations for PDF extraction and image-to-text functions locally.

ByteScout PDF Multitool includes the OCR Analyzer tool, which helps you quickly find the best combination of OCR filters and parameters to enhance the quality of PDF text extraction results.

PDF Multitool and its OCR Analyzer provide JSON code for profiles that can be used with PDF.co cloud and on-premises versions. Simply set this JSON config to the profiles parameters for the PDF To Text/CSV/XML/JSON API methods.

Step-by-step guide on how to start using the PDF Multitool free app:

  1. First, download the free version of PDF Multitool here.

  2. Next, load your PDF/JPG/PNG document into the multitool.

  3. Then, in the left navigation menu, select OCR Analyzer.

  4. Choose the OCR Language and OCR Resolution and click Go.

  5. Click Copy To button and select Send to CSV.. or similar to copy this configuration into the appropriate extractor.

  6. This will open PDF Extractor config for PDF to CSV/Text/XML/JSON accordingly.

  7. Try the new configuration by clicking Preview.

  8. If you’re satisfied with the outcome, go to the Profile for PDF.co and API Server tab.

  9. Click on Copy as payload for PDF.co or API Server.

  10. Finally, paste this as a value to the profiles parameter value into your script/code or in Zapier/Make plugin accordingly.

  11. If you are not satisfied with the results, try to adjust parameters and filters on the All Options tab (see Tips and Tricks below).

For a demo on how to use this tool, watch this video: https://youtu.be/NSyyohNNe6E.

Tips and Tricks On Finding Best OCR Settings Using PDF Multitool

  1. For fuzzy or blurred scans: try to increase OCR Resolution from default 300 dpi (dots per inch) to 600 or even 800 or 1200 dpi and try again. Note: higher resolution means more time to process the document.

  2. For dark scans: try to add Gamma Correction filter with default value of 1.4 or 1.5 and try again. Note: this filter will make the dark images lighter automatically.

  3. To get text printed nearby borders or lines, try to add filter that removes lines before extraction. For tables with borders or lines, and if you see layout is reproduced incorrect or some words/letters are lost, try to add Horizontal Line Removal and Vertical Line Removal filters in All Options - OCRImageProcessingFilters section. Make sure to put these filters first in the list (use Up and Down buttons to move filters up and down in the list).

  4. For non-English documents set proper recognition language: set OCR Language to the appropriate language you see on the document. Default selected is eng (English). If you have a document in German, set it to deu (German). If you have multiple languages in the same document, select 2 languages (for example, eng and deu).

  5. If you don’t need a whole page, then try to limit extraction area to a specific area on a page. It will increase the quality of text extraction as well as processing speed. To set extraction area, click on the Select tool on the main toolbar in PDF Multitool and use your mouse to select the area with the source text. Then run extraction and preview again.

  6. If extracted text is missing some important text snippets, try to set an extraction area to extract from. Limiting to a specific area on a page may dramatically increase the quality of the text recognition.

  7. If extracting from the whole page produces broken results: try to run few extractions from the same page but limiting to selected areas, for example: extract from the top area, then from the middle area, then from the bottom area. Then combine results into one file. This will help to get better results if the page has different layouts or different fonts or different font sizes.

  8. Setting extraction area to exclude header and footer and/or side notes in the document may simplify text analysis greatly.

  9. Removing Background Noise: Lowering Gamma (with values below 1.4) and raising Contrast can effectively remove background noise from images.

  10. Extracting text from color photos or scans. Enhancing Gamma Effect on Color Photos improves the extraction quality. Applying the Grayscale filter before Gamma may yield better gamma effects on color photos. Grayscale alone is generally less useful.

  11. Removing Parasite Dots and Artifacts producing small garbled text snippets: Combining the Median filter with high-resolution rendering (600+ DPI) can help remove parasite dots from scanned images or fax rasterization artifacts. However, this approach may also remove punctuation symbols.

  12. Fixing Etched/Distorted Letters: The Dilate filter can be used to repair etched or distorted letters in images.

List of OCR Image Preprocessing Filters Supported By PDF Multitool and PDF.co API

  • Contrast - Adds the Contrast image filter, which enhances image quality for OCR by improving contrast. This filter is particularly helpful for images where the text color is gray or similar to the background color. Lowering gamma and raising contrast can effectively remove background noise from images.

  • Deskew - Applies the Deskew image filter with a default angle threshold of 0.4 degrees (minimal admissible skew angle). This filter is useful for fixing slight rotatin of scanned images. For scans rotated 90, 180, 270 degrees, use the RotationAngle parameter in profiles instead, for example { ‘rotationAngle’: 1 }. RotationAngle parameters available are the following:

    • 0 no rotation (default)

    • 1 90 degrees

    • 2 180 degrees

    • 3 270 degrees

  • Dilate - Incorporates the “Dilate” image filter, which improves image quality for OCR by thickening the letter strokes. The Dilate filter can be used to repair etched or distorted letters in images.

  • Fit - Adds the Fit image filter with a specified size limit. The image is proportionally resized when its width or height exceeds the limit, which improves text extraction performance from large images.

  • Gamma - Implements the Gamma Correction filter with a default value of 1.4. This filter enhances image quality for OCR by automatically lightening dark images.

  • Grayscale - Applies the “Grayscale” image filter. Applying the Grayscale filter before Gamma may yield better gamma effects on color photos, although Grayscale alone is less useful.

  • HorizontalLinesRemover - Integrates the “Horizontal Lines Remover” image filter. This filter enhances OCR text recognition quality inside borders and near borders by removing horizontal lines before text recognition. IMPORTANT: this filter is added by default in PDF.co cloud and on-prem. If you don’t need it, set profiles to { 'OCRImagePreprocessingFilters.Clear()': [] }.

  • VerticalLinesRemover - Implements the “Vertical Lines Remover” image filter. This filter enhances OCR text recognition quality inside borders and near borders by removing vertical lines before text recognition. IMPORTANT: this filter is added by default in PDF.co cloud and on-prem. If you don’t need it, set profiles to { 'OCRImagePreprocessingFilters.Clear()': [] }.

  • Invert - Adds the Invert (negative) image filter. Sometime, scanned documents are inverted (white text on black background). This filter can be used to fix this issue by inverting all colors before extracting text.

  • Median - Incorporates the “Median” image filter. Combining the Median filter with high-resolution rendering (`600`+ DPI) can help remove parasite dots from scanned images or fax rasterization artifacts. However, this approach may also remove punctuation symbols.

  • Scale - Adds the Scale image filter with a specified scale factor. For example, 2.0 doubles the size of the input image, improving the recognition quality of small letters.

ByteScout PDF Multitool - more information at https://bytescout.com/products/pdfmultitool/index.html.

How to use custom fonts?#

Due to possible security and licensing issues, we cannot add third-party fonts to our server. However, we have a PDF.co Self-Hosted server that will allow you to install custom fonts. The PDF.co Self-Hosted server is on-premise and must be hosted in your infrastructure.

Here’s a comparison of our PDF.co Cloud and PDF.co Self-hosted https://pdf.co/pricing/on-demand-cloud-vs-dedicated-vs-on-prem.

Please let us know if you’re interested in the PDF.co Self-Hosted server.

Another way to use custom fonts is through the HTML to PDF API. There are two ways that you can use custom fonts in your HTML template.

You can read about HTML Template to PDF here: https://pdf.co/html-template-to-pdf.

IP addresses used by PDF.co Cloud#

The PDF.co Cloud is hosted on Amazon AWS infrastructure. For information about the IP addresses and IP address ranges used by AWS, you can refer to this link: https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html.

Moreover, we are presently utilizing us-west-2 or the Oregon region for our servers. You may find details about the AWS Regions and IP ranges in this link: https://docs.aws.amazon.com/quicksight/latest/user/regions.html.

Where can I find the PDF.co output in Zapier?#

The PDF.co output is temporary and expires after an hour by default. The expiration can be extended in the Business plan.

We recommend that you add a third step in your Zap to save the PDF output to a permanent cloud storage such as Google Drive, Dropbox, or similar.

Here’s a step-by-step guide on how to set it up. It starts at Step 6: https://pdf.co/make-pdf-searchable-and-upload-in-google-drive#6.

If you’d like to review the generated output, please check out Step 5 here: https://pdf.co/make-pdf-searchable-and-upload-in-google-drive#5.

Who can access the pdf-temp-files, and how long are files stored?#

The pdf-temp-files storage is a private Amazon S3 bucket that utilizes strong industry-standard encryption at rest. Uploaded and output files are temporarily stored in this bucket under highly randomized names generated using a secure random generator. Each file is set to expire in 60 minutes by default and is automatically deleted permanently from the bucket upon expiration. Depending on your subscription plan, you may increase the expiration timeout from 5 minutes to 1440 minutes (1,440 minutes = 24 hours) using the expiration parameter. You may also remove a file directly using the file/delete endpoint at any time.

Since the pdf-temp-files storage is a private bucket, files are accessed via a special “signed” link using the Amazon AWS powered signed links mechanism. This mechanism provides an additional layer of security when accessing the file.

The pdf-temp-files bucket is not included in any backups. Only our engineers have temporary access to this bucket, and 2FA is enforced and required for access. Each access session to the storage is automatically logged, and information about the files’ relation to a specific user is stored separately in a different database.

For additional encryption of the file content, you may utilize user-controlled encryption. This feature provides a way to encrypt output file content with your own encryption option using industry-standard AES encryption, which is supported by all platforms, including Salesforce and others.