Applying image pre-processing and post-processing to OCR: A case study for Vietnamese business cards

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

APPLYING IMAGE PRE-PROCESSING AND POST-PROCESSING

TO OCR: A CASE STUDY FOR VIETNAMESE BUSINESS CARDS

Thai Duy Quy^a*, Vo Phương Binh^a, Tran Nhat Quang^a, Phan Thi Thanh Nga^a

^aThe Faculty of Information Technology, Dalat University, Lamdong, Vietnam

^*Corresponding author: Email: quytd@dlu.edu.vn

Abstract

This paper presents a proposal image pre-processing and Vietnamese post-processing

algorithms efficiently adopt the Tesseract open source Optical Character Recognition (OCR)

library. We built a mobile application (Android) and applied the result for Vietnamese

business cards. The experimental results show that the proposed method implemented as an

Android application achieved more accuracy than the original OCR library.

Keywords: Android; OCR; Image pre-processing; Post-processing; Vietnamese Business

Card.

90

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

ỨNG DỤNG TIỀN XỬ LÝ ẢNH VÀ HẬU XỬ LÝ TRONG QUÁ

TRÌNH NHẬN DẠNG CHỮ QUANG HỌC:

NGHIÊN CỨU ÁP DỤNG CHO DANH THIẾP TIẾNG VIỆT

Thái Duy Quý^a*, Võ Phương Bình^a, Trần Nhật Quang^a, Phan Thị Thanh Nga^a

^aKhoa Công nghệ Thông tin, Trường Đại học Đà Lạt, Lâm Đồng, Việt Nam

^*Tác giả liên hệ: Email: quytd@dlu.edu.vn

Tóm tắt

Bài báo trình bày đề xuất phương pháp tiền xử lý ảnh và hậu xử lý tiếng Việt áp dụng cho

quá trình nhận dạng ký tự quang học bằng thư viện mã nguồn mở Tesseract. Chúng tôi xây

dựng một ứng dụng trên hệ điều hành Android và áp dụng kết quả nghiên cứu cho các danh

thiếp tiếng Việt. Kết quả cho thấy phương pháp đề xuất khi thực thi cho kết quả chính xác

hơn các ứng dụng hiện hành.

Từ khoá: Android; Danh thiếp tiếng Việt; Hậu xử lý; Nhận dạng ký tự quang học; Tiền xử

lý ảnh.

91

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

INTRODUCTION

1.

In daily work, we usually receive business cards from our friends or partners. The

business cards regularly have some information, such as name, address, phone number,

etc. In the contact list of a smartphone, the user can also store the same contact

information as a business card. Therefore, our goal is to build an application to extract

the text of the business card and save the contact information into a smart phone. The

Android application can directly input an image of the contact information using the

phone’s camera. Noise in the business card image is then eliminated. The image is then

provided to the Optical Character Recognition (OCR) engine to extract the necessary

information and to save it to the contact list. To improve the efficiency of the extraction

process, we developed improved algorithms for image pre-processing and post-

processing. Our application is implemented on an Android device and tested with

Vietnamese business cards. The OCR engine used in this paper is the Tesseract open

source library.

2.

RELATED WORK

OCR systems have been under development in research and industry since the

1950s using knowledge-based and statistical pattern recognition techniques to transform

scanned or photographed images of text into machine-editable text files (Eason, Noble,

& Sneddon, 1955). Shalin, Chopra, Ghadge, and Onkar (2014) developed an early OCR

system. Techniques of pre-processing images, used as an initial step in character

recognition systems, were presented, of which the feature extraction step of optical

character recognition is the most important. In order to improve the accuracy of image

recognition, Mande and Hansheng (2015) and Matteo, Ratko, Matija, and Tihomir (2017)

have proposed an efficient method to remove background noise and enhance low-quality

images, respectively. In addition, Nirmala and Nagabhushan (2009) proposed an

approach which can handle document images with varying backgrounds of multiple

colors. Bhaskar, Lavassar, and Green (2015); Pal, Rajani, Poojary, and Prasad (2017);

and Yorozu, Hirano, Oka, and Tagawa (1987) presented a tutorial to improve the accuracy

of the OCR method when converting printed words into digital text.

Although there are many applications of OCR which were high accurate for the

English language (Badla, 2014; Chang, & Steven, 2009; Kulkarni, Jadhav, Kalpe, &

Kurkut, 2014; Palan, Bhatt, Mehta, Shavdia, & Kambli, 2014; Phan, Nguyen, Nguyen,

Thai, & Vo, 2017; & Trần, 2013), OCR systems for non-English languages may have

several problems. Vietnamese is a language with tones and single syllables (Phan & et

al., 2017). We were not successful in finding any relevant studies that have a 100%

recognition rate for Vietnamese, but some applications have been implemented, such as

in Trần (2013). Among commercial versions, another popular application is CamCard,

but it does not offer much support for Vietnamese language business cards. An

application available for Vietnamese language in Google Store is Business Card Reader

Free, but the experimental accuracy is not high.

92

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

OCR AND TESSERACT

3.

OCR is the technical process which converts scanned images, typewritten, or

printed text into machine encoded text. OCR has been in development for almost 80 years,

as the first patent for an OCR machine was filed in 1929 by a German named Gustav

Tauschek and an American patent was filed subsequently in 1935. OCR has many

applications, including use in the postal service, language translation, and digital libraries.

Currently, OCR is even in the hands of the general public in the form of mobile

applications. The OCR system input images include text which cannot be edited. The

output of the OCR process is editable text from the input images. The OCR process is

illustrated in Fig. 1.

Figure 1. OCR process

There are a few stages within the OCR process used to convert an image to text.

To simplify these steps, we use an open source software called Tesseract as the kernel for

our project. Tesseract was first built in 1985 by Hewlett Packard. The project later

changed hands and was further developed by the University of Nevada-Las Vegas from

1996 to 2006 (Matteo & et al., 2017). From 2007, Google has sponsored this project under

the Apache 2.0 license as open source software. Today, Tesseract is considered the most

accurate free OCR engine in existence and is one of the most widely used in the world.

Tesseract now provides support for 139 languages (Mande & Hansheng, 2015). The

Tesseract OCR process can be represented by the flow chart in Figure 2, in this system,

there are eight stages, as follows (Bhaskar & et al., 2017):



A Gray-scale or color image is provided as input: The input data should

ideally be a “flat” image from a flatbed scanner or a near parallel image

capture.



Adaptive threshold: Performs the reduction of a gray-scale image to a binary

image using Otsu’s method (Bhaskar & et al., 2017). The algorithm assumes

that in an image there are foreground (black) pixels and background (white)

pixels. It then calculates the optimal threshold that separates the two pixel

classes so that the variance between the two is minimal;



Connected-component labeling: Through the binary image, Tesseract will

identify the foreground pixels and then mark the potential characters;

93

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018



Line finding algorithm: Lines of text are found by analyzing the image space

adjacent to potential characters.

Baseline fitting algorithm: Finding baselines for each of the lines. After each

line of text is found, Tesseract examines the lines of text to find the

approximate text height across the line.



Fixed pitch detection: The other step of setting up character detection is

finding the approximate character width. This allows the correct incremental

extraction of characters as Tesseract progresses down a line;

Non-fixed pitch spacing delimiting: Characters that are not of uniform width,

or not of a width that agrees with the surrounding neighbourhood, are

reclassified to be processed in an alternate manner;

Word recognition: After finding all of the possible character “blobs” in the

document, Tesseract performs word recognition on a word-by-word, line-by-

line basis. Words are then passed through a contextual and syntactical

analyzer, which ensures accurate recognition.

Figure 2. Tesseract flow chart

94

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

4.

PROPOSED METHOD

4.1. Pre-processing

The Tesseract engine is the kernel of the OCR system in our project. To improve

the accuracy of the process, we use some pre-processing techniques for the input images.

The first technique is to fix a frame after taking pictures with a camera and converting to

gray-scale images. After that, we used the methods proposed by Mande and Hansheng

(2015); Matteo and et al. (2017); and Shivananda and Nagabhushan (2009).

When the user finishes taking the images, the program automatically identifies the

frame for the picture, which is the outline of the business card. It can change the size and

shape of the frame as suitable for recognizing text. This action not only helps increase the

accuracy of the captured image, but also removes unnecessary parts of the business card.

Figure 3 shows an example of the frame selection for a photographed business card. We

used the OpenCV open source library, which is an efficient tool for image processing.

OpenCV tool can also convert a color picture to a gray-scale picture, which is very

convenient in the next step of our OCR process.

Figure 3. A frame after taking a picture

On the other hand, the images can be processed before input to Tesseract.

Therefore, we have applied some methods proposed by previous authors. First, the

original colored image is converted into a gray-scale image using the formula proposed

by Li, Jia-bing, and Shan-shan (2010) shown in Equation (1)

Y = 0.2999R + 0.587G + 0.114B

(1)

where R, G, and B are the normalized red, green, and blue pixel values,

respectively.

Second, we applied the methods proposed by Badla (2014) to convert the color

images to gray-scale by two techniques: Luminosity and DPI Enhancement. Both of these

techniques used the OpenCV library to perform the conversion. Luminosity is a method

for converting an image into gray-scale while preserving some of the color intensities

(Badla, 2014). The algorithm code below describes the image luminosity process:

95

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

// Get buffered image from input file; iterate all the pixels in the image with width=w and height=h

for int w=0 to w=width

{

for int h=0 to h=height

{

// call BufferedImage.getRGB() saves the color of the pixel

// call Color(int) to grab the RGB value in pixel

Color= new color();

// now use red, green, and black components to calculator average.

int luminosity = (int)(0.2126 * red + 0.7152 *green + 0.0722 *blue;

// now create new values

Color lum = new ColorLum

Image.set(lum)

// set the pixel in the new formed object

}

To get the best results out of the image, we need to fix the DPI as 300 DPI is the

minimum acceptable for Tesseract (Badla, 2014). The algorithm for DPI enhancement is

as follows:

start edge extract (low, high){

// define edge

Edge edge;

// form image matrix

Int imgx[3][3]={}

Int imgy[3][3]={}

Img height;

Img width;

//Get diff in dpi on X edge

// get diff in dpi on y edge

diffx= height* width;

diffy=r_Height*r_Width;

img magnitude= sizeof(int)* r_Height*r_Width);

memset(diffx, 0, sizeof(int)* r_Height*r_Width);

memset(diffy, 0, sizeof(int)* r_Height*r_Width);

memset(mag, 0, sizeof(int)* r_Height*r_Width);

// this computes the angles

// and magnitude in input img

For ( int y=0 to y=height)

For (int x=0 to x=width)

Result_xside +=pixel*x[dy][dx];

Result_yside=pixel*y[dy][dx];

// return recreated image

result=new Image(edge, r_Height, r_Width)

return result;

}

Finally, we use the methods proposed by Mande and Hansheng (2015) and Matteo

& et al. (2017) with low-quality or background images. Tesseract requires a minimum

text size for reasonable accuracy. If the x-height of images is below 20px, the accuracy

drops off. The first pre-processing method proposed of Matteo and et al. (2017) is image

resizing so that the image height is 100px. Resizing is only applied if the height of the

original image is below 100px. The second pre-processing method of Matteo and et al.

(2017) is an image sharpening method. The main reason for using it is to enhance the

contrast between edges, i.e. to enhance contrast between text and background. The image

sharpening is achieved using unsharp masking, represented by Equation (2).

g(i,j) = f(i,j) - f_smooth(i, j)

(2)

96

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

A smoothed image fsmooth is subtracted from the original image f. The third

proposed method of Matteo and et al. (2017) is image blurring to reduce high frequency

information and remove noise from the images, which can possibly cause a lower OCR

accuracy rate. This method is achieved by applying a low-pass filter to the analyzed image

f such that each pixel is replaced by the average of all the values in the local neighborhood

of size 9x9 pixels, as in Equation (3).

(3)

Mande and Hansheng (2015) proposed some methods in cases where the image

has a background. The methods are based on a color model in RGB space (Figure 4). We

applied this method using the parameter of brightness distortion (α_i) and chromaticity

(CD_i) to enhance a document image and make it easier to remove background. The

brightness distortion α_iis obtained by Equation (4).

(_i) = (p_i- _iE_i)²

(4)

Where α_irepresents the pixel’s brightness. To minimize the object function (4), α_i

must be 1 if the brightness of the given pixel in the current image is the same as in the

reference image. Similarly, α_i< 1 means the pixel is dimmer than the expected brightness;

and α_i>1 means it is brighter. When α_iare determined, the value of CD_ican be solved by

Equation (5):

CD_i= || p_i- _iE_i||

(5)

Figure 4. Color model in RGB space. E_irepresents the expected color of pixel p_iin

the current image. The difference between p_iand E_iis decomposed into brightness

_iand chromaticity (CD_i)

Source: Mande and Hansheng (2015).

4.2. Post-Processing

OCR (including Tesseract) is used for many applications these days. In this

project, we only researched and applied OCR to business cards. Therefore, we were only

concerned with four items: i) Name or organization; ii) Telephone number; iii) Email;

and iv) Address of organization. Actually, there are two techniques for extracting textual

97

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

information from images: i) Regular expression (can own defined rules) or ii) Machine

learning statistics (Trần, 2013). In this study, we used regular expression, or methods

dependent on Vietnamese language rules, to obtain the necessary information.

The editable text received from the OCR process includes multiple lines. The

information on the business card usually is short and the first letters indicate the contents

of the line. Overall, the telephone number and email address use regular expressions,

whereas name and address are based on Vietnamese language conventions. For email

address and phone number, we used the regular expression provided by Kipalog (2018).

The regular expression for the email address is Expression (6):

/[A-Z0-9._%+-]+@[A-Z0-9-]+.+.[A-Z]{2,4}/igm

Similarly, the phone number is expressied as Expression (7):

(\\(\\d+\\)+[\\s-.]*)*(\\d+[\\s-.]*)+

(6)

(7)

In addition, when the algorithm scans a phone number, it also categorizes the

number as a mobile number or a home number. On most business cards, the phone

numbers are usually a sequence of numbers, or are separated by special characters such

as white spaces, dots, dashes…. Thus, in the algorithm we included some special

exceptions to improve the post-processing.

With the Vietnamese name, the algorithm will check whether the line contains the

family name or not. The family name is stored and a comparison is made to determine if

the information stream contains a family name. If not, the algorithm will get all the words

in the line and save them as the organization name. With address, the algorithm will check

if the input stream contains headings with such Vietnamese phrases as “Đc:” or “Địa chỉ:”

or English phrases such as "Add:", “Address:”, or these words in uppercase format. If it

exists, this line is the address, otherwise the algorithm checks to find the name of the

provinces in Vietnam, which are stored in a list similar to family name.

4.3. Proposed model

Figure 5 shows the basic steps involved in recognition in our project. Images taken

by a phone’s camera of a business card are pre-processed (see in 4.1) and then inputted

to the Tesseract engine. After receiving text results, we use Vietnamese language

conventions for names and addresses to extract information from the card (post-

processing, see in 4.2) and then save the information to the list of contacts in the Android

device.

98

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

Figure 5. Proposed structural model

5.

RESULTS

We have successfully implemented an application called Vietnamese Card Scan

(VnCS) on the Android OS. The experiment was deployed on the Samsung Galaxy Tab

E tablet with Android 4.4.4. The size of the APK file is 26.5MB. The program runs on

the Android OS shown in Figure 6. The test data include 250 Vietnamese business cards

of three types, as presented in Table 1.

Figure 6. ScanVnCard program in Samsung Galaxy Tab E

99

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

Table 1. Business card collected data

Type Features

Quantum

No. 1 Distinctive background and text, no wallpaper

No. 2 Distinctive background and letters, with wallpaper

135

75

No. 3 Have the same color, logo, picture or characters that are difficult to identify 40

Four types of information are extracted, as follows: i) Name or organization; ii)

Phone numbers; iii) Email; and iv) Address. The results with the accuracy of each

extraction type are shown in Table 2. Figure 7 presents an original Vietnamese business

card, after pre-processing, and editable text after OCR processing.

Table 2. Results for four types of information extracted from business cards

No. 1(%) No. 2(%) No. 3(%)

Name or organization

Phone numbers

Email

90

80

70

80

60

70

50

60

Address

(a)

(b)

(c)

(d)

Figure 7. An example for our OCR process in Vietnamese business card

Note: a) Original business card; b) Pre-processing; c) Editable text; and d) Saving to contact list.

100

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

CONCLUSIONS

6.

This paper provides a detailed discussion about a mobile image to text recognition

system implemented through an Android application for Vietnamese business cards. The

image is taken with a camera and pre-processed by various techniques. The image is then

processed with an OCR technique to produce editable text on screen. Finally, the

necessary information is extracted by post-processing and saved to the contact list. The

results show that the proposed method achieves more efficiency and accuracy than the

original software. In the future, we will improve the program to run faster and deploy on

many operating systems.

REFERENCES

Badla, S. (2014). Improving the efficiency of Tesserct OCR engine. Retrieved from

https://scholarworks.sjsu.edu/cgi/viewcontent.cgi?referer=https://www.google.c

om/&httpsredir=1&article=1416&context=etd_projects.

Bhaskar, S., Lavassar, N., & Green, S. (2015). Implementing optical character

recognition on the Android operating system for business cards. Retrieved from

https://stacks.stanford.edu/file/druid:rz261ds9725/Bhaskar_Lavassar_Green_Bu

sinessCardRecognition.pdf.

Chang, L. Z., & Steven, Z. Z. (2009). Robust pre-processing techniques for OCR

applications on mobile devices. Paper presented at The International Conference

on Mobile Technology, Application & Systems, France.

Eason, G., Noble, B., & Sneddon, I. N. (1955). On certain integrals of Lipschitz-Hankel

type involving products of Bessel functions. Phil. Trans. Roy. Soc., A247, 529-

551.

Kipalog. (2018). 30 đoạn biểu thức chính quy mà lập trình viên Web nên biết. Được truy

lục từ ttps://kipalog.com/posts/30-doan-bieu-thuc-chinh-quy-ma-lap-trinh-vien-

web-nen-biet.

Koistinen, M., Kettunen, K., & Kervinen, J. (2017). How to improve optical character

recognition of historical Finnish newspapers using open source Tesseract OCR

engine. Paper presented at The Language & Technology Conference: Human

Language Technologies as a Challenge for Computer Science and Linguistics,

Poland.

Kulkarni, S. S., Jadhav, V., Kalpe, A., & Kurkut, V. (2014). Android card reader

application using OCR. International Journal of Advanced Research in Computer

and Communication Engineering, 3, 5238-5239.

Li, J., Jia-Bing, H. D., & Shan-shan, Z. (2010). A novel algorithm for color space

conversion model from CMYK to LAB. Journal of Multimedia, 5(2), 159-166.

Mande, S., & Hansheng, L. (2015). Improving OCR performance with background image

elimination. Paper presented at The International Conference on Fuzzy Systems

and Knowledge Discovery, China.

101

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

Matteo, B., Ratko, G., Matija, P., & Tihomir, M. (2017). Improving optical character

recognition performance for low quality images. Paper presented at The

International Symposium ELMAR, Croatia.

Pal, I., Rajani, M., Poojary, A., & Prasad, P. (2017). Implementation of image to text

conversion using Android app. International Journal of Advanced Research in

Electrical, Electronics and Instrumentation Engineering, 6, 2291-2297.

Palan, D. R., Bhatt, G. B., Mehta, K. J., Shavdia, K. J., & Kambli, M. (2014). OCR on

Android-travelmate. International Journal of Advanced Research in Computer

and Communication Engineering, 3, 5810-5812.

Phan, T. T. N., Nguyen, T. H. T., Nguyen, V. P., Thai, D. Q., & Vo, P. B. (2017).

Vietnamese text extraction from book covers. Dalat University Journal of

Science, 7(2), 142-152.

Shalin, Chopra, A., Ghadge, A. A., & Onkar, A. P. (2014). Optical character recognition.

International Journal of Advanced Research in Computer and Communication

Engineering, 3, 214-219.

Shivananda, N., & Nagabhushan, P. (2009). Separation of foreground text from complex

background in color document images. Paper presented at The Seventh

International Conference on Advances in Pattern Recognition, India.

Trần, Đ. H., (2013). Ứng dụng nhận dạng danh thiếp tiếng việt và cập nhật thông tin danh

bạ trên Android. Được truy lục từ https://text.123doc.org/document/

2558917-ung-dung-nhan-dang-danh-thiep-tieng-viet-va-cap-nhat-thong-tin-danh

-ba-tren-android-full-soure-code.htm

Yorozu, Y., Hirano, M., Oka, K., & Tagawa, Y. (1987). Electron spectroscopy studies on

magneto-optical media and plastic substrate interface. IEEE Transl. J. Magn., 2,

740-741.

Zhou, S. Z., Gilani, S. O., & Winkler, S. (2016). Open source OCR framework using

mobile devices. SPIE-IS&T, 6821, 1-6.

102