Document Classifier#

Use Document Classifier endpoint (see below) to automatically sort / detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

Auto classification Of Incoming Documents#

Available Methods#

/pdf/classifier

/pdf/classifier#

Document Classifier can automatically find class of input PDF, JPG, PNG document by analyzing its content using the built-in AI or custom defined classification rules.

The best way to develop, test and maintain classification rules is to use Classifier Tester Tool from PDF.co Document Classifier UI . Use this tool to quickly edit and test rules on single PDFs and on folders.

Method: POST
Endpoint: /v1/pdf/classifier

Attributes#

Note

Attributes are case-sensitive and should be inside JSON for POST request, for example:

{
    "url": "https://example.com/file1.pdf"
}

Attribute	Description	Required
`url`	URL to the source file. 1	yes
`httpusername`	HTTP auth user name if required to access source `url`.	no
`httppassword`	HTTP auth password if required to access source `url`.	no
`rulescsv`	Define custom classification rules in CSV format. Rules are in CSV format where each row contains: `class name`, `logic` (`AND` or `OR` (default)), and keywords separated by a comma. Each row is separated by the `\n` symbol. You can use regular expressions for keywords with this syntax: `/keyword or regexp/i` where `i` is the case-insensitive flag. Please note that all `\` symbols should add the prefix `\` because of JSON format, so `\d` becomes `\\d` and so on. Custom Rules Example 1 for `rulescsv` (for more examples please check the Document Classifier Usage Guide) `Amazon AWS, OR, Amazon Web Services Invoice, Amazon CloudFront\nDigital Ocean, OR,DigitalOcean, DOInvoice\nACME,OR, ACME Inc.,1540 Long Street` Custom Rules Example 2 (with regular expressions, for more examples please check the Document Classifier Usage Guide) `Medical Report,AND,/Instructing Party\|Medical Report\|Date Of Injury\|Med Agency Ref/i\r\nInjured Claimant,OR, Injured Claimant, Injured Patient ID`	no
`rulescsvurl`	Instead of inline CSV you can use this parameter and set the URL to a CSV file with classification rules. This is useful if you have a separate developer working on CSV rules.	no
`caseSensitive`	Defines if keywords in rules are case-sensitive or not. (default `true`).	no
`inline`	Set to `true` to return results inside the response. Otherwise, the endpoint will return a link to the output file generated. Note: only applies if `async` mode is `true`.	no
`password`	Password of PDF file, the input must be in string format.	no
`async`	Set `async` to `true` for long processes to run in the background, API will then return a `jobId` which you can use with the Background Job Check endpoint to check the status of the process and retrieve the output while you can proceed with other tasks.	no
`name`	File name for the generated output, the input must be in string format.	no
`expiration`	Set the expiration time for the output link in minutes (default is `60` i.e 60 minutes or 1 hour), After this specified duration, any generated output file(s) will be automatically deleted from PDF.co Temporary Files Storage. The maximum duration for link expiration varies based on your current subscription plan. To store permanent input files (e.g. re-usable images, pdf templates, documents) consider using PDF.co Built-In Files Storage.	no
`profiles`	Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more.	no

Query parameters#

No query parameters accepted.

Payload 3 #

{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
    "async": false,
    "inline": "true",
    "password": "",
    "profiles": ""
}

Response 2 #

{
    "body": {
        "classes": [
            {
                "class": "invoice"
            },
            {
                "class": "finance"
            },
            {
                "class": "documents"
            }
        ]
    },
    "pageCount": 1,
    "error": false,
    "status": 200,
    "credits": 42,
    "duration": 353,
    "remainingCredits": 98019328
}

CURL#

curl --location --request POST 'https://api.pdf.co/v1/pdf/classifier' \
--header 'Content-Type: application/json' \
--header 'x-api-key: *******************' \
--data-raw '{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
    "async": false,
    "inline": "true",
    "password": "",
    "profiles": ""
} '

Code samples#

JavaScript / Node.js

var request = require('request');
var options = {
  'method': 'POST',
  'url': 'https://api.pdf.co/v1/pdf/classifier',
  'headers': {
    'Content-Type': 'application/json',
    'x-api-key': 'YOUR_PDFCO_API_KEY'
  },
  body: JSON.stringify({
    "url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
    "async": false,
    "encrypt": "false",
    "inline": "true",
    "password": "",
    "profiles": ""
  })

};
request(options, function (error, response) {
  if (error) throw new Error(error);
  console.log(response.body);
});

Java

import java.io.*;
import okhttp3.*;
public class main {
        public static void main(String []args) throws IOException{
                OkHttpClient client = new OkHttpClient().newBuilder()
                        .build();
                MediaType mediaType = MediaType.parse("application/json");
                RequestBody body = RequestBody.create(mediaType, "{\n    \"url\": \"https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf\",\n    \"async\": false,\n    \"encrypt\": \"false\",\n    \"inline\": \"true\",\n    \"password\": \"\",\n    \"profiles\": \"\"\n} ");
                Request request = new Request.Builder()
                        .url("https://api.pdf.co/v1/pdf/classifier")
                        .method("POST", body)
                        .addHeader("Content-Type", "application/json")
                        .addHeader("x-api-key", "YOUR_PDFCO_API_KEY")
                        .build();
                Response response = client.newCall(request).execute();
                System.out.println(response.body().string());
        }
}

using System;
using RestSharp;
namespace HelloWorldApplication {
        class HelloWorld {
                static void Main(string[] args) {
                        var client = new RestClient("https://api.pdf.co/v1/pdf/classifier");
                        client.Timeout = -1;
                        var request = new RestRequest(Method.POST);
                        request.AddHeader("Content-Type", "application/json");
                        request.AddHeader("x-api-key", "YOUR_PDFCO_API_KEY");
                        var body = @"{" + "\n" +
                        @"    ""url"": ""https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf""," + "\n" +
                        @"    ""async"": false," + "\n" +
                        @"    ""encrypt"": ""false""," + "\n" +
                        @"    ""inline"": ""true""," + "\n" +
                        @"    ""password"": """"," + "\n" +
                        @"    ""profiles"": """"" + "\n" +
                        @"} ";
                        request.AddParameter("application/json", body,  ParameterType.RequestBody);
                        IRestResponse response = client.Execute(request);
                        Console.WriteLine(response.Content);
                }
        }
}

PHP

<?php

    $curl = curl_init();

    curl_setopt_array($curl, array(
            CURLOPT_URL => 'https://api.pdf.co/v1/pdf/classifier',
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_ENCODING => '',
            CURLOPT_MAXREDIRS => 10,
            CURLOPT_TIMEOUT => 0,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
            CURLOPT_CUSTOMREQUEST => 'POST',
            CURLOPT_POSTFIELDS =>'{
        "url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
        "async": false,
        "encrypt": "false",
        "inline": "true",
        "password": "",
        "profiles": ""
    } ',
            CURLOPT_HTTPHEADER => array(
                    'Content-Type: application/json',
                    'x-api-key: YOUR_PDFCO_API_KEY'
            ),
    ));

    $response = json_decode(curl_exec($curl));

    curl_close($curl);
    echo "<h2>Output:</h2><pre>", var_export($response, true), "</pre>";

?>

On Github#

Footnotes

1

Supports publicly accessible links from any source, including Google Drive, Dropbox, and PDF.co Built-In Files Storage. To upload files via the API, check out the File Upload section. Note: If you experience intermittent Access Denied or Too Many Requests errors, please try adding cache: to enable built-in URL caching (e.g., cache:https://example.com/file1.pdf). For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.

2

Main response codes as follows:

Code	Description
`200`	Success
`400`	Bad request. Typically happens because of bad input parameters, or because the input URLs can’t be reached, possibly due to access restrictions like needing a login or password.
`401`	Unauthorized
`402`	Not enough credits
`445`	Timeout error. To process large documents or files please use asynchronous mode (set the `async` parameter to `true`) and then check status using the /job/check endpoint. If a file contains many pages then specify a page range using the `pages` parameter. The number of pages of the document can be obtained using the /pdf/info endpoint.

Note

For more see the complete list of available response codes.

3

PDF.co Request size: API requests do not support request sizes of more than 4 megabytes in size. Please ensure that request sizes do not exceed this limit.

Was this page helpful?

Document Classifier#

Auto classification Of Incoming Documents#

Available Methods#

/pdf/classifier#

Attributes#

Query parameters#

Payload 3#

Response 2#

CURL#

Code samples#

On Github#

Are you a human?

Payload 3 #

Response 2 #