Advanced document processing with AWS Textract

By abdulmumin yaqeen

on September 25, 2023

In today's digital world, businesses and organizations deal with an ever-increasing volume of documents, ranging from invoices and contracts to forms and reports. Extracting structured data from these documents manually can be time-consuming, error-prone, and inefficient.

This is where AWS Textract, a service provided by Amazon Web Services (AWS), comes into play. AWS Textract leverages machine learning to automatically recognize and extract text, forms, tables, and other valuable data from a variety of document types.

In this article, we will explore how to perform advanced document processing with AWS Textract and unlock the potential of automated data extraction. In specific, we will be analyzing documents to extract forms and form data.

Setting up Your Environment

Before diving in, you need to set up your AWS environment. This involves:

Creating an AWS Account and IAM Role: If you don't already have an AWS account, sign up for one. Next, create a lambda function and update the role with the necessary permissions to use Textract. The role should have policies like AmazonTextractFullAccess or custom policies with the required permissions.
Creating an S3 Bucket: You'll need to store the documents you want to process in an Amazon S3 bucket. Textract will analyze documents from this bucket.

Starting a Textract Job

Now that your environment is set up, it's time to start processing documents with Textract. Here are the steps involved:

Choose the Documents: Decide which documents you want to process and upload them to your S3 bucket.
Start a Textract Job: You can use either the AWS SDK or AWS CLI to start a Textract job. Specify the S3 bucket and the document(s) you want to process. Textract supports various document formats, including PDF, image files (JPEG, PNG), and more. In our case, we will be using the SDK in a Lambda function.

From this moment onward, we will be writing a lot of code, but most of it is reusable and it can be copied over to any Textract project.

import boto3
import sys
import re
import json
from collections import defaultdict

def load_image():
    s3_bucket = "textract-images-bucket-2"
    s3_key = "dummy_id.jpeg"
    # process using image bytes

    textract = boto3.client('textract', region_name='us-east-1')
    response = textract.analyze_document(
              Document={'S3Object': {'Bucket': s3_bucket, 'Name': s3_key}}
, FeatureTypes=['FORMS'])

    return response

So this is the first thing we are doing, loading our test image from an s3 bucket.

Everything here is very basic though, in the Textract API the analyse_document function, we pass two arguments:

Document={'S3Object': {'Bucket': s3_bucket, 'Name': s3_key}}, This is for the location of the document we will be processing
FeatureTypes=['FORMS'], This argument specifies the type of data we are looking for in this document. Which is forms.

If you were to call this function in your lambda_handler, you would be able to see the result, but in a super cluttered way.

The series of functions we will be writing next will be just to declutter the result. As said earlier, these functions will be reusable across all of your Textract projects. It will only be invalid if AWS Textract decides to change the syntax of the response you get.

def get_kv_map(file_name):

    response = load_image()
    # Get the text blocks
    blocks = response['Blocks']

    # get key and value maps
    key_map = {}
    value_map = {}
    block_map = {}
    for block in blocks:
        block_id = block['Id']
        block_map[block_id] = block
        if block['BlockType'] == "KEY_VALUE_SET":
            if 'KEY' in block['EntityTypes']:
                key_map[block_id] = block
            else:
                value_map[block_id] = block

    return key_map, value_map, block_map

In simple terms, this function organizes the information extracted from the document using Textract into three separate collections. These collections are like folders:

key_map: It's a folder for blocks that contain keys or labels. For example, if the document has labels like "Name" or "Date," those blocks are placed here.
value_map: This folder holds blocks that contain values associated with the keys. If, for instance, you have corresponding data like "John Doe" or "2023-09-25," these blocks are stored here.
block_map: Think of this as a general folder where all the extracted blocks are stored. It includes both key and value blocks, as well as any other type of block found in the document.

Why is this helpful? Well, it makes it easier to work with the extracted data, especially when you're dealing with documents structured as key-value pairs. You can quickly find and use the keys and their associated values, thanks to this organization.

def get_kv_relationship(key_map, value_map, block_map):
    kvs = defaultdict(list)
    for block_id, key_block in key_map.items():
        value_block = find_value_block(key_block, value_map)
        key = get_text(key_block, block_map)
        val = get_text(value_block, block_map)
        kvs[key].append(val)
    return kvs

def find_value_block(key_block, value_map):
    for relationship in key_block['Relationships']:
        if relationship['Type'] == 'VALUE':
            for value_id in relationship['Ids']:
                value_block = value_map[value_id]
    return value_block


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] == 'SELECTED':
                            text += 'X '

    return text

Here's a step-by-step explanation:

Initialization: The function initializes an empty dictionary called kvs. This dictionary will be used to establish relationships between key-value pairs. It uses defaultdict(list) to ensure that each key in kvs maps to a list, allowing multiple values to be associated with a single key.
Looping Through Key Blocks: The function iterates through each key block in the key_map dictionary. Each key block represents a label or identifier.
Finding Corresponding Value Block: For each key block, it calls a function named find_value_block(key_block, value_map) to locate the corresponding value block in the value_map. This function is assumed to find and return the value block associated with the given key block.
Extracting Key and Value Text: After identifying the key and value blocks, the function extracts the text content from these blocks using a function called get_text(key_block, block_map) and get_text(value_block, block_map). The block_map is used to locate the blocks within all the blocks extracted from the document.
Creating Relationships: The extracted key and value texts are used to establish relationships within the kvs dictionary. Specifically, it appends the value (val) to the list associated with the key (key) in the kvs dictionary.
Return: Finally, the function returns the kvs dictionary, which now contains key-value relationships. Each key in the dictionary corresponds to a label or key block, and its associated value is a list containing one or more values extracted from value blocks associated with that key.

In essence, this function builds upon the initial categorization of key and value blocks to create structured relationships between them. It's particularly useful when processing documents that contain multiple key-value pairs, enabling easy access to the associated values for each key.

Finally, we will use all of these functions in our main function/lambda_handler function.

def print_kvs(kvs):
    for key, value in kvs.items():
        print(key, ":", value)

def lambda_handler(file_name):
    key_map, value_map, block_map = get_kv_map(file_name)

    # Get Key Value relationship
    kvs = get_kv_relationship(key_map, value_map, block_map)
    print("\n\n== FOUND KEY : VALUE pairs ===\n")
    print_kvs(kvs)

For this demo, we are only printing the result to the console, but you will want to return that as a JSON or write that somewhere, Depending on your use case or how you architect your workflow.

Testing

In my s3, I have uploaded a sample ID card image, which looks like this 👇🏾

After testing, the result looks something like this.

== FOUND KEY : VALUE pairs ===

ISSUE DATE  : ['28 AUG 14 ']
DATE of BIRTH  : ['01 OCT 60 ']
HEIGHT  : ['26cm ']
SEX  : ['M ']
MIDDLE NAME  : ['CITIZEN ']
EXPIRY  : ['01/19 ']
MENT NUMBER  : ['4812640614 ']
FIRST NAME  : ['NIGERIAN ']
SURNAME  : ['PROUD ']
NATIONALITY  : ['NGA ']
My  : ['Signature ']

And this is such an accurate result, the form data was detected and extracted from this image.

NOTE: The entire document processing was done by the initial function we wrote, which uses Textract API. Every other subsequent function is used to format the response for better results and further processing if needed.

That's it! You've learned to use AWS Textract for document processing. You can leverage the full potential of Textract to perform advanced document processing and automation in your applications or business processes.

Happy Coding!

Solutions & Offerings

Wendu

Microsoft on AWS

Backup and Restore

AWS Managed Services

AI Acceleration

AWS Maturity Services

Cloud Migration

Data Management

Security Compliance

AI Scan

Professional Services

Data Analytics

Cloud Resource Management

Cloud Training

Database as a Service

DevOps as a Service

Omni-Channel Contact Center

Resources

Press Releases

White-paper & Ebooks

Advanced document processing with AWS Textract

By abdulmumin yaqeen

Setting up Your Environment

Starting a Textract Job

Testing