Document Classifier#
Use Document Classifier endpoint (see below) to automatically sort / detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.
Auto classification Of Incoming Documents#
Note
To quickly create and test classification rules, download and install ByteScout PDF Multitool. Run it and check PDF Classifier
at the left sidebar. Test rules and export them as a JSON request for PDF.co PDF Classifier.
Available Methods#
/pdf/classifier#
Document Classifier can automatically find class of input PDF, JPG, PNG document by analyzing its content using the built-in AI or custom defined classification rules.
The best way to develop, test and maintain classification rules is to use Classifier Tester Tool
from PDF.co Document Classifier UI . Use this tool to quickly edit and test rules on single PDFs and on folders.
Method: POST
Endpoint: /v1/pdf/classifier
Attributes#
Note
Attributes are case-sensitive and should be inside JSON for POST request, for example:
{
"url": "https://example.com/file1.pdf"
}
Attribute |
Description |
Required |
---|---|---|
|
URL to the source file. 1 |
yes |
|
HTTP auth user name if required to access source |
no |
|
HTTP auth password if required to access source |
no |
|
Define custom classification rules in CSV format. Rules are in CSV format where each row contains:
|
no |
|
Instead of inline CSV you can use this parameter and set the URL to a CSV file with classification rules. This is useful if you have a separate developer working on CSV rules. |
no |
|
Defines if keywords in rules are case-sensitive or not. (default |
no |
|
Set to |
no |
|
Password of PDF file, the input must be in string format. |
no |
|
Set |
no |
|
File name for the generated output, the input must be in string format. |
no |
|
Set the expiration time for the output link in minutes (default is |
no |
|
Use this parameter to set additional configurations for fine-tuning and extra options. Explore the Profiles section for more. |
no |
Query parameters#
No query parameters accepted.
Payload#
{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
"async": false,
"inline": "true",
"password": "",
"profiles": ""
}
Response 2#
{
"body": {
"classes": [
{
"class": "invoice"
},
{
"class": "finance"
},
{
"class": "documents"
}
]
},
"pageCount": 1,
"error": false,
"status": 200,
"credits": 42,
"duration": 353,
"remainingCredits": 98019328
}
CURL#
curl --location --request POST 'https://api.pdf.co/v1/pdf/classifier' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
"url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
"async": false,
"inline": "true",
"password": "",
"profiles": ""
} '
Code samples#
var request = require('request');
var options = {
'method': 'POST',
'url': 'https://api.pdf.co/v1/pdf/classifier',
'headers': {
'Content-Type': 'application/json',
'x-api-key': 'YOUR_PDFCO_API_KEY'
},
body: JSON.stringify({
"url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
"async": false,
"encrypt": "false",
"inline": "true",
"password": "",
"profiles": ""
})
};
request(options, function (error, response) {
if (error) throw new Error(error);
console.log(response.body);
});
import java.io.*;
import okhttp3.*;
public class main {
public static void main(String []args) throws IOException{
OkHttpClient client = new OkHttpClient().newBuilder()
.build();
MediaType mediaType = MediaType.parse("application/json");
RequestBody body = RequestBody.create(mediaType, "{\n \"url\": \"https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf\",\n \"async\": false,\n \"encrypt\": \"false\",\n \"inline\": \"true\",\n \"password\": \"\",\n \"profiles\": \"\"\n} ");
Request request = new Request.Builder()
.url("https://api.pdf.co/v1/pdf/classifier")
.method("POST", body)
.addHeader("Content-Type", "application/json")
.addHeader("x-api-key", "YOUR_PDFCO_API_KEY")
.build();
Response response = client.newCall(request).execute();
System.out.println(response.body().string());
}
}
using System;
using RestSharp;
namespace HelloWorldApplication {
class HelloWorld {
static void Main(string[] args) {
var client = new RestClient("https://api.pdf.co/v1/pdf/classifier");
client.Timeout = -1;
var request = new RestRequest(Method.POST);
request.AddHeader("Content-Type", "application/json");
request.AddHeader("x-api-key", "YOUR_PDFCO_API_KEY");
var body = @"{" + "\n" +
@" ""url"": ""https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf""," + "\n" +
@" ""async"": false," + "\n" +
@" ""encrypt"": ""false""," + "\n" +
@" ""inline"": ""true""," + "\n" +
@" ""password"": """"," + "\n" +
@" ""profiles"": """"" + "\n" +
@"} ";
request.AddParameter("application/json", body, ParameterType.RequestBody);
IRestResponse response = client.Execute(request);
Console.WriteLine(response.Content);
}
}
}
<?php
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => 'https://api.pdf.co/v1/pdf/classifier',
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'POST',
CURLOPT_POSTFIELDS =>'{
"url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
"async": false,
"encrypt": "false",
"inline": "true",
"password": "",
"profiles": ""
} ',
CURLOPT_HTTPHEADER => array(
'Content-Type: application/json',
'x-api-key: YOUR_PDFCO_API_KEY'
),
));
$response = json_decode(curl_exec($curl));
curl_close($curl);
echo "<h2>Output:</h2><pre>", var_export($response, true), "</pre>";
?>
On Github#
Footnotes
- 1
Supports links from Google Drive, Dropbox, and PDF.co Built-In Files Storage. To upload files via the API check out the File Upload section. Note: If you experience intermittent Access Denied or Too Many Requests errors, please try to add
cache:
to enable built-in URL caching. (e.gcache:https://example.com/file1.pdf
) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.- 2
Main response codes as follows:
Code
Description
200
Success
400
Bad request. Typically happens because of bad input parameters, or because the input URLs can’t be reached, possibly due to access restrictions like needing a login or password.
401
Unauthorized
402
Not enough credits
445
Timeout error. To process large documents or files please use asynchronous mode (set the
async
parameter totrue
) and then check status using the /job/check endpoint. If a file contains many pages then specify a page range using thepages
parameter. The number of pages of the document can be obtained using the /pdf/info endpoint.Note
For more see the complete list of available response codes.