How PDFs make implementing RPA applications easier

25 Aug 2020 By Sam Claeys

PDF standards

In order for RPA applications to work smoothly, processes must both have a standard structure and work with standardized files. This is the only way to maximize the number of files processed by a single automation. In many cases, this means that RPA can only work with data that it has itself generated. But why not use a standard here too? There are many good reasons to use the PDF format whenever such processes need to interact with third-party data. After all, PDF is the lowest common denominator for nearly all document types processed or received in offices. Office files, emails and even images can be easily converted to PDF, giving RPA applications a standardized starting point for processing. Moreover, with the many features it has accumulated over the years, PDF is the most powerful, versatile document format in the world. However, not every PDF meets the prerequisites for automatic processing equally well. This starts not with a reliable display model for the document, but rather the actual data it contains.

Making PDFs RPA-ready

  • One of the ‘simplest’ stumbling blocks – in fact, a more or less unavoidable one for document-based RPA – is the password protection that is often applied to documents without thinking. Technical and legal restrictions prevent content from being extracted from password-protected files, which instead just have to be returned to sender.
  • In order to process PDF files automatically, they need to meet a few requirements. For example, after scanning a document, it is generally necessary to fully index the text of it using OCR, assigning Unicode characters to the extracted content. Only then RPA processes can evaluate the actual text. Even ‘born digital’ PDFs may not necessarily have full Unicode support. This is where dedicated validation tools (and repair tools, where necessary) come into play! One very concrete example would be print files exported from an ERP system, used to consolidate outgoing invoices. A PDF software tool will search through the text, finding keywords or separators that it uses to split the single PDF into separate invoices. Naturally, this only works if the software can recognize the keywords – and without OCR, this is only possible if the text has already been ‘translated’ into Unicode.
  • By integrating metadata into PDFs, RPA applications can be provided with pointers that show how to process a given file. For instance, it may make sense to extract information before converting the source file, and then add the extracted information to the PDF. Consider the following example: a retail company receives a set of product descriptions from a supplier in PDF format. They can add entries to the metadata, which they then use to classify the files. If their customers need information, these descriptions can then be added to individual product catalogs and used to generate a table of contents.
  • Ideally, the PDF files will be ‘tagged’. This means that not just the semantic component of the text is defined in Unicode, but also that headers, paragraphs, image descriptions and tables are described (‘tagged’) in a structured data format. These tags allow the RPA application to tell how to structure text content (particularly in multi-column layouts), extract headers and organize images using their descriptions. Since assigning tags to PDF documents after the fact is a very time-intensive process, AI is generally used for tasks like reading forms correctly at the field level. This makes it even more important to fully index the text of PDF files as described in the first bullet here.

Conclusion

Businesses looking to maximize process automation can and should first establish a framework for seamlessly leveraging RPA-based applications. Part of this is about building a solid foundation, using maximally homogeneous, standardized data. As the highest common factor for Office files, high-quality PDFs are a good starting point for this foundation.

Back to overview