Technology for digital assetization 101

2023/11/14 | 3 mins
Technology for digital assetization 101
 
  • Eunjung Park (Upstage CSO)

  • PEOPLE WHO WERE CONFUSED BETWEEN OCR AND INFORMATION EXTRACTION

    ANYONE WHO WANTS TO KNOW THE BASICS OF OCR

  • OCR, INFORMATION EXTRACTION CONCEPT SUMMARY

    CRITERIA FOR A GOOD OCR AND INFORMATION EXTRACTION MODEL

  • ✔️ OCR, TECHNOLOGY TO READ ALL LETTERS IN A DOCUMENT

    ✔️ INFORMATION EXTRACTION, A TECHNOLOGY THAT EVOLVES FROM OCR TO SELECT ONLY KEY INFORMATION FROM DOCUMENTS

(This blog is part 2 of the Digitize Anything! series.)

As mentioned in the last blog, digital transformation must come first in order to achieve digital transformation . However, the technology used varies depending on the purpose of digital assetization. What technology should I use? In this article, we will explain which technologies we should adopt depending on the data we have and our purposes.




OCR, A TECHNOLOGY THAT READS ALL LETTERS IN A DOCUMENT

WHAT IS OCR?

First, if the data our organization has is an image or document file, and we want to find and read all the letters in the file, we can use OCR (optical character recognition).

  • What input does OCR take?: document files such as png, jpg, pdf, etc.

  • What output does OCR return?: Character and character location information

  • How does OCR work?: OCR usually consists of two models. First, there is a detector that finds letters from a given file, and a recognizer that deciphers what letters it finds.

 

Detector → Recognizer

In the case of the detector, the position of the letter is expressed as a quad (square of 4 points), polygon (contour expressed as 2N points), center point (one center point), etc. Upstage Document OCR detects using a rectangular method with four dots, as shown in the photo above.

In the case of the recognizer, character recognition is performed based on predefined recognition target characters. Undefined recognition characters are usually recognized as unknown symbols such as “�”. Currently, the characters targeted for recognition defined in Upstage Document OCR include (1) Korean (2) English (3) numbers (4) Chinese characters (5) special characters, and we are continuously adding them according to customer requests.

Nowadays, both detectors and recognizers are developed as deep learning-based models, and depending on the usage scenario, an end-to-end integrated model is used rather than developing the detector and recognizer separately.


WHAT IS A GOOD OCR MODEL?

IN ORDER TO ADOPT THE RIGHT OCR MODEL FOR YOUR ORGANIZATION, YOU NEED TO BE ABLE TO PROPERLY EVALUATE THE MODEL. BELOW ARE FOUR CRITERIA THAT CAN BE USED TO EVALUATE AN OCR MODEL.

  1. Accuracy: The most important metric for many customers in OCR models is accuracy. A more accurate technical term is F1-score , and as of 2023, it is common for commercial-level models to score 95 or higher on any test set.

  2. Inference speed: Inference speed is important when you need to return results in real time. It varies depending on the number of characters included in the image, but an inference speed of less than 2 seconds per image is acceptable. Usually, inference speed has a tradeoff relationship with accuracy. If inference speed is less important, you can choose a model that is not as fast but has relatively high accuracy.

  3. Recognition range: The range of characters recognized by the OCR model can usually be defined by the character set defined in ISO-15924 or the language code of ISO 639-1. Therefore, before introducing OCR, it is essential to check which character set or language code is appropriate for our organization's usage scenario. Additionally, you should review whether you need special objects such as signatures, checkboxes, and stamps that are not defined in Unicode.

  4. Robustness: Lastly, due to the nature of the AI model, accuracy scores can vary greatly depending on the test set and metric. So it's not surprising that an OCR model that scored an accuracy of 95 in one case would score 80 in another. If a model that scores well in a specific case has poor quality on our organization's data, the model's generalization performance can be considered poor. A good OCR model has excellent quality for a variety of real-world data, or edge cases. Therefore, in order to verify whether a particular OCR model is a good model, it is important to check whether it has sufficient robustness to various cases in our data.

OCR


What is it used for?

REPRESENTATIVE USE CASES FOR OCR ARE AS FOLLOWS.

  1. Image search: This technology is used to search images by indexing letters within a document. This technique is useful, for example, when you want to find related images based on words or sentences used in a specific document. It analyzes the text entered by the user and finds images related to the content from the Internet or database.

  2. Manga translation: This technology is used to extract the text contained in a comic and translate it into another language. Recognizes text contained in comics and translates it into the user's preferred language. This is a very useful technology for global readers as it makes comic content easily accessible in a variety of languages.



INFORMATION EXTRACTION, A TECHNOLOGY THAT EVOLVES FROM OCR TO SELECT ONLY KEY INFORMATION FROM DOCUMENTS

What is information extraction?

When you want to pick out key information contained in a document rather than simply reading text from it, you can go one step further from OCR and use information extraction technology.

  • What input does information extraction receive? List of document files such as png, jpg, pdf, etc. and key information you want to extract

  • What output does information extraction return? Output only necessary data as structured information

  • How does information extraction work? Like OCR, after the detector and recognizer are performed, a parser is run that extracts only the necessary information from all given characters.

 

Detector → Recognizer → Parser

information extraction

“The list of key information you want to extract” is also called “ontology.” If the ontology that we want to extract from documents is the treatment period, patient registration number, receipt number, etc., we annotate the data to include these three pieces of information. The information extractor learned from the data returns the final value in the form of a key value.

Example of information extraction results

Example of information extraction results

What is a good information extraction model?

Below are four criteria you can use to evaluate a good information extraction model.

  1. Accuracy, inference speed: As with OCR models, accuracy and inference speed are important, and there is a trade-off between the two.

  2. Adaptability to various templates: In the case of the existing rule base model, key information cannot be extracted at all if the document template changes. However, models developed with AI technology have the advantage of being able to extract information even if it is not a previously observed template.

  3. Support for our organization's data format: Sometimes not all of the key information in a document can be expressed in the chunks of information you need. For example, sometimes you need to be able to extract information in the form of a table reflecting rows and columns.

Receipt example where [Egg Tart, 3500, 1, 3500] forms a group

Receipt example where [Egg Tart, 3500, 1, 3500] forms a group

4. Providing reliability scores: In many cases, information extraction is used to automate information input into documents. In this case, a confidence score can be useful to check whether the extracted information requires human verification. When a reliability score is provided, items above a certain threshold are automatically processed, and items below are inspected by humans or undergo separate processing procedures. The reference point should be able to be set at the level desired by our organization.

What is it used for?

  1. Loading various types of documents into a relational DB: Insurance companies use information extraction technology to selectively extract important data (e.g. drug name, amount, etc.) to automatically extract the necessary information from medical bill receipts and detailed medical bill statements. If the extracted information is automatically saved in the database, it can be used to easily generate statistical data such as drug usage.

  2. Personal information masking: Automatically finds and masks (hides) personal information such as name, resident registration number, and address included in the document. Identified personal information is automatically hidden for security purposes, helping your document comply with privacy regulations.

  3. Work automation: Logistics and shipping companies certify and track the delivery of cargo. The B/L (bill of lading) document contains key information such as cargo details, origin, destination, and transportation conditions, and stores this data in information. Automatic extraction with extraction technology can automate processes such as cargo management, transport route optimization, and delivery status tracking, greatly improving logistics efficiency.

Going out

OCR AND INFORMATION EXTRACTION TECHNOLOGY FOR DIGITAL ASSETIZATION GO BEYOND SIMPLE DATA PROCESSING AND ENABLE FUNDAMENTAL INNOVATION IN BUSINESS PROCESSES. SIGNIFICANTLY IMPROVES WORK EFFICIENCY BY AUTOMATING THE PROCESSING OF LARGE QUANTITIES OF DOCUMENTS AND EXTRACTING IMPORTANT INFORMATION QUICKLY AND ACCURATELY. THESE TECHNOLOGIES, WHICH ARE ALREADY BEING USED IN A VARIETY OF INDUSTRIES INCLUDING INSURANCE COMPANIES, MANUFACTURERS, BANKS, HOSPITALS, AND RETAILERS, CAN BE EXPANDED TO AN EVEN WIDER SCOPE. BY REFERRING TO THE CASES INTRODUCED IN THIS ARTICLE, FIND AND APPLY THE OPTIMAL SOLUTION SUITED TO YOUR ORGANIZATION'S WORK ENVIRONMENT TO BUILD A FASTER, MORE ACCURATE, AND MORE EFFICIENT WORK PROCESS.

 

📝 Learn more about Document AI

Create new value through capitalizing data

Feel free to test the Document AI API in the Upstage console and create the service you want!


 
Previous
Previous

A LOOK BACK AT 2023 AI TREND KEYWORDS

Next
Next

From documents to knowledge, digital assets that create our company’s own data