Vietnamese semantic role labelling

VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

Vietnamese Semantic Role Labelling

1,

2

∗

Le-Hong Phuong , Pham Thai Hoang , Pham Xuan Khoai ,

Nguyen Thi Minh Huyen¹, Nguyen Thi Luong³, Nguyen Minh Hiep³

¹VNU University of Science, 334 Nguyen Trai, Thanh Xuan, Hanoi, Vietnam

²FPT University, Hanoi, Vietnam

³Dalat University, Lamdong, Vietnam

Abstract

In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language

sentences and its application for the Vietnamese language. We present our effort in building Vietnamese

PropBank, the first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese

texts. In particular, we present a novel constituent extraction algorithm in the argument candidate identification

step which is more suitable and more accurate than the common node-mapping method. In the machine learning

part, our system integrates distributed word features produced by two recent unsupervised learning models in

two learned statistical classifiers and makes use of integer linear programming inference procedure to improve

the accuracy. The system is evaluated in a series of experiments and achieves a good result, an F score of

1

74.77%. Our system, including corpus and software, is available as an open source project for free research and

we believe that it is a good baseline for the development of future Vietnamese SRL systems.

Received 27 June 2017; Revised 23 October 2017; Accepted 20 November 2017

Keywords: Distributed word representation, integer linear programming, semantic role labelling, Vietnamese,

Vietnamese PropBank.

^*1. Introduction

Figure 1 shows the SRL of a simple

Vietnamese sentence. In this example, the

arguments of the predicate giúp (helped) are

labelled with their semantic roles. The meaning

of the labels will be described in detail in

Section 2.2.

In this paper, we study semantic role labelling

(SRL), a subtask of semantic parsing of natural

language sentences. SRL is the task of identifying

semantic roles of arguments of each predicate in a

sentence. In particular, it answers a question Who

did what to whom, when, where, why?. For each

predicate in a sentence, the goal is to identify all

constituents that fill a semantic role, and to

determine their roles, such as agent, patient, or

instrument, and their adjuncts, such as locative,

temporal or manner.

Figure 1. SRL of the Vietnamese sentence.

"Nam giúp Huy học bài vào hôm qua"

(Nam helped Huy to do homework yesterday).

________

^*Corresponding author. Email: phuonglh@vnu.edu.vn

https://doi.org/10.25073/2588-1086/vnucsce.166

39

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

40

SRL has been used in many natural

introduce a new algorithm for extracting

candidates, which is much more accurate,

language processing (NLP) applications such as

question answering [1], machine translation [2],

document summarization [3] and information

extraction [4]. Therefore, SRL is an important

task in NLP. The first SRL system was

developed by Gildea and Jurafsky [5]. This

system was performed on the English FrameNet

corpus. Since then, SRL task has been widely

studied by the NLP community. In particular,

there have been two shared-tasks, CoNLL-2004

[6] and CoNLL-2005 [7], focusing on SRL task

for English. Most of the systems participating in

these shared-tasks treated this problem as a

classification problem which can be solved by

supervised machine learning techniques. There

exists also several systems for other well-studied

languages like Chinese [8] or Japanese [9].

This paper covers not only the contents of

two works published in conference proceedings

[10] (in Vietnamese) and [11] on the

construction and the evaluation of a first SRL

system for Vietnamese, but also an extended

investigation of techniques used in SRL. More

concretely, the use of integer linear programming

inference procedure and distributed word

representations in our semantic role labelling

system, which leads to improved results over our

previous work, as well as a more elaborate

evaluation are new for this article.

achieving an

F score of 84.08%. In the

1

classification step, in addition to the common

linguistic features, we propose novel and useful

features for use in SRL, including function tags

and distributed word representations. These

features are employed in two statistical

classification models, maximum entropy and

support vector machines, which are proved to

be good at many classification problems. In

order to incorporate important grammatical

constraints into the system to improve further

the performance, we combine machine learning

techniques with an inference procedure based

on integer linear programming. Finally, we use

distributed word representations produced by

two recent unsupervised models, the Skip-gram

model and the GloVe model, on a large corpus

to alleviate the data sparseness problem. These

word embeddings help our SRL software

system generalize well on unseen words. Our

final system achieves an

F score of 74.77% on

1

a test corpus. This system, including corpus and

software, is available as an open source project

for free research and we believe that it is a good

baseline for the development of future

Vietnamese SRL systems.

The remainder of this paper is structured as

follows. Section 2 describes the construction of

a SRL corpus for Vietnamese. Section 3

presents the development of a SRL software,

including the methodologies of existing systems

and of our system. Section 4 presents the

evaluation results and discussion. Finally,

Section 5 concludes the paper and suggests

some directions for future work.

Our system includes two main components,

a SRL corpus and a SRL software which is

thoroughly evaluated. We employ the same

development methodology of the English

PropBank to build

a

SRL corpus for

Vietnamese containing a large number of

syntactically parsed sentences with predicate-

argument structures.

We then use this SRL corpus and

supervised machine learning models to develop

2. Vietnamese SRL corpus

Like many other problems in NLP,

annotated corpora are essential for statistical

learning as well as evaluation of SRL systems.

In this section, we start with an introduction of

existing English SRL corpora. Then we present

our work on the construction of the first

reference SRL corpus for Vietnamese.

a

SRL software for Vietnamese. We

demonstrate that a simple application of SRL

techniques developed for English or other

languages could not give a good accuracy for

Vietnamese. In particular, in the constituent

identification step, the widely used 1-1 node-

mapping algorithm for extracting argument

candidates performs poorly on the Vietnamese

2.1. Existing English SRL corpora

dataset, having

F score of 35.93%. We thus

1

2.1.1. FrameNet

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

41

The FrameNet project is a lexical database

of English. It was built by annotating examples

of how words are used in actual texts. It

consists of more than 10,000 word senses, most

of them with annotated examples that show the

meaning and usage and more than 170,000

manually annotated sentences [12]. This is the

most widely used dataset upon which SRL

systems for English have been developed

and tested.

• Predicate (V): Participant realizing the

verb of the proposition.

For example, the sentence of Figure 2 can

be annotated in the PropBank role schema as

shown in Figure 3.

Figure 3. An example sentence

in the PropBank corpus.

FrameNet is based on the Frame Semantics

theory [13]. The basic idea is that the meanings

of most words can be best understood on the

basis of a semantic frame: a description of a

type of event, relation, or entity and the

participants in it. All members in semantic

frames are called frame elements. For example,

a sentence in FrameNet is annotated in cooking

concept as shown in Figure 2.

The English PropBank methodology is

currently implemented for a wide variety of

languages such as Chinese, Arabic or Hindi

with the aim of creating parallel PropBanks¹.

This SRL resource has a great impact on many

natural language processing tasks and

applications.

2.1.3. VerbNet

VerbNet is a verb lexicon of English, which

was developed by Karin Kipper-Schuler and

colleagues [15]. It contains more than 5800

English verbs, which are classified into 270

groups, according to the verb classification

method of Beth Levin [16]. In this approach,

the behavior of a verb is mostly determined by

its meaning.

Once classified into groups, each verb

group is added semantic roles. VerbNet has 23

semantic roles, for example

• Actor, the participant that is the

investigator of an event.

Figure 2. example sentence in the FrameNet corpus.

2.1.2. PropBank

PropBank is a corpus that is annotated with

verbal propositions and their arguments [14].

PropBank tries to supply a general purpose

labelling of semantic roles for a large corpus to

support the training of automatic semantic role

labelling systems. However, defining such a

universal set of semantic roles for all types of

predicates is a difficult task; therefore, only

Arg0 and Arg1 semantic roles can be

generalized. In addition to the core roles,

PropBank defines several adjunct roles that can

apply to any verb. It is called Argument

Modifier. The semantic roles covered by the

PropBank are the following:

• Core Arguments (Arg0-Arg5, ArgA):

Arguments define predicate specific roles. Their

semantics depend on predicates in the sentence.

• Adjunct Arguments (ArgM-*): General

arguments that can belong to any predicate.

There are 13 types of adjuncts.

• Agent, the actor in an event who initiates

and carries out the event and who exists

independently of the event.

• Attribute, the undergoer that is a property

of an entity or entities.

• Destination, the goal that is a concrete,

physical location.

These syntactic roles normally answer who,

what, when and how questions. A SRL

annotation guidelines of this project is available

online². In summary, SRL corpora have been

constructed for English and other well-

resourced languages. They are important

• Reference Arguments (R-*): Arguments

represent arguments realized in other parts of

the sentence.

________

1

http://verbs.colorado.edu/ mpalmer/projects/ace/ EPB-

annotation-guidelines.pdf

²http://verbs.colorado.edu/verb-

index/VerbNet_Guidelines.pdf

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

42

resources which are very useful for many

been proposed. For example, Cao Xuân Hạo

[17] makes use of the argument (obligatory

participants) roles as agent, actor, processed,

force, carrier, patient, experiencer, goal, etc.,

while Diệp Quang Ban [19] makes use of fact

categories: dynamic, static, mental, existential,

verbal, relational etc. For adjuncts (optional

participants), Cao Xuân Hạo uses the roles:

manner, mean, result, path, etc., while Diệp

Quang Ban makes use of circumstance types:

time, space, cause, condition, goal, result,

path, etc.

In this work, we took a pragmatic

standpoint during the design of a semantic role

tagset and focused our attention on the SRL

categories that we expect to be most necessary

and useful in practical applications. We have

constructed a semantic role tagset based on two

following principles:

natural language processing applications.

For the Vietnamese language, there has not

existed any SRL corpus which with a similar

level like those of English corpora described

above. In the following sections, we report our

initiatives for constructing and evaluating a

SRL corpus for Vietnamese.

2.2. Building a Vietnamese propBank

In this section, we present the construction

of a Vietnamese SRL corpus, which is referred

as Vietnamese PropBank hereafter. We first

describe annotation guidelines and then

describe the SRL corpus which has been

developed.

2.2.1. Vietnamese SRL Annotation

Guidelines

The determination of semantic roles in the

Vietnamese language is a difficult problem and

it has been investigated with different opinions.

In general, Vietnamese linguists have not

reached a consensus on a list of semantic roles

for the language. Different linguists proposed

different lists; some used the same name but

with different meaning of a role, or different

names having the same meaning.

Nevertheless, one can use an important

principle for determining semantic roles:

"Semantic role is the actual role a participant

plays in some situation and it always depends

on the nature of that situation" [17]. This means

that when identifying the meaning of a phrase

or of a sentence, one must not separate it out of

the underlying situation that it appears. While

there might be some controversy about the

exact semantic role names should be, one can

list common semantic roles which have been

accepted by most of Vietnamese linguists [18].

The syntactic sub-categorization frames are

closely related to the verb meanings. That is,

the meaning of a sentence can be captured by

the subcategorization frame of the verb

predicate. In consequence, the sentence

meaning can be described by labelling the

semantic roles for each participant in the sub-

categorization frame of the predicate. This

approach is adopted by many Vietnamese

linguists and different semantic roles set have

• The semantic roles are well-defined and

commonly accepted by the Vietnamese linguist

community.

• The semantic roles are comparable to

those of the English PropBank corpus, which

make them helpful and advantageous for

constructing

multi-lingual

corpora

and

applications in later steps. Furthermore, it

seems fairly indisputable that there are

structural and semantic correspondences

accross languages.

We have selected a SRL tagset which is

basically similar to that of the PropBank.

However, some roles are made more fine-

grained accounting for idiosyncratic properties

of the Vietnamese language. In addition, some

new roles are added to better distinguish

predicate arguments when the predicate is an

adjective, a numeral, a noun or a preposition,

which is a common phenomenon in Vietnamese

besides the popular verbal predicate.

The following paragraph describes some

semantic roles of predicative arguments where

the predicate is a verb:

•

Arg0: The agent semantic role

representing a person or thing who is the doer

of an event. For example,

• Arg0-Identified and Arg1-Identifier: The

semantic roles representing identified entity and

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

43

identifier respectively, normally used with the

copula “là”. For example,

syntactically annotated treebank containing

10,000 manually annotated sentences (the

VietTreeBank) to

annotated corpus.

a

coarse-grained SRL

The Vietnamese treebank is one result of a

national project which aims to develop basic

resources and tools for Vietnamese language

and speech processing³. The raw texts of the

treebank are collected from the social and

political sections of the Youth online daily

newspaper. The corpus is divided into three sets

corresponding to three annotation levels: word-

segmented, part-of-speech-tagged and syntax-

annotated set. The syntax-annotated corpus, a

subset of the part-of-speech-tagged set, is

currently composed of 10,471 sentences

(He is the best player here.)

• Arg1-Patient: The semantic role which is

the surface object of a predicate indicating the

person or thing affected. For example,

B

(The soldiers broke a bridge.)

• Arg2: The semantic role of a beneficiary

indicating a referent who is advantaged or

disadvantaged by an event. For example,

(

225,085 tokens). Sentences range from 2 to

105 words, with an average length of 21.75

words. There are 9,314 sentences of length

40 words or less. The tagset of the treebank

has 38 syntactic labels (18 part-of-speech

(He repaired a bike for her.)

Figure 4 presents an example of the SRL

analysis of a syntactically bracketed sentence

“Ba đứa con anh đã có việc làm ổn định.” (His

three children have had a permanent job.). The

semantic roles of this sentence include:

• Arg0: “ba đứa con anh” (his three

children) is the agent

tags, 17 syntactic category tags,

3 empty

categories) and 17 function tags. For details,

please refer to [20]⁴. The meanings of some

common tags are listed in Table 1.

Table 1. Some Vietnamese treebank tags

• ArgM-TMP: “đã” is a temporal modifier

• Rel: “có” (have) is the predicate

No. Category Description

• Arg1: “việc làm ổn định” (a permanent

job) is the patient.

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

S

simple declarative clause

verb phrase

noun phrase

preposition phrase

common noun

verb

pronoun

adverb

preposition

coordinating conjunction

VP

NP

PP

N

V

P

R

E

CC

The coarse-grained semantic role tagset

contains 24 role names which are all based on

the main roles of the PropBank. We carefully

investigated the tagset of the VietTreeBank

based on detailed guidelines of constituency

structures, phrasal types, functional tags,

Figure 4. A SRL annotated Vietnamese sentence.

2.2.2. Vietnamese SRL Corpus

Once the SRL annotation guidelines have

been designed, we built a Vietnamese SRL

corpus by following two main steps.

In the first step, we proposed a set of

conversion rules to convert automatically a

clauses,

parts-of-speech

and

adverbial

________

³VLSP Project, https://vlsp.hpda.vn/demo/

⁴All the resources are available at the website of the VLSP project.

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

44

functional tagset to propose a set of rules for

web-based, friendly and easy for correction and

edition of multiple linguists. In addition, it also

permits a collaborative work where any edition

at sentence level is versionized and logged with

meta-information so as to facilitate cross

validation and discussion between linguists

if necessary.

determining high-level semantic roles. Some

rules for coarse-grained annotation are shown in

Table 2. Each rule is used to determine a

semantic role for a phrase of a sentence.

As an example, consider the constituency

analysis of a sentence in the VietTreeBank “Kia

là những ngôi nhà vách đất.” (Over there are

soil-wall houses.)

We have completed the semantic role

annotation of 5,460 sentences of the

VietTreeBank, covering 7,525 verbal and

(S (NP-SUB (P-H Kia)) (VP (V-H là) (NP

(L những) (Nc-H ngôi) (N nhà) (NP (N-H vách)

(N đất)))) (. .))

adjectival

predicatives.

The

annotation

guidelines as well as the current SRL corpus are

published as open resources for free research. n

the next section, we present our effort in

First, using the annotation rule for Arg0, the

phrase having syntactical function SUB or

preceding the predicate of the sentence, we can

annotate the semantic role Arg0 for the word

“Kia”. The predicate “là” is annotated with

semantic role REL. Finally, the noun phrase

following the predicate “những ngôi nhà vách

đất” is annotated with Arg1.

developing

a

SRL software system for

Vietnamese which is constructed and evaluated

on this SRL corpus.

3. Vietnamese SRL system

3.1. Existing approaches

Table 2. Some rules

for coarse-grained SRL annotation

This section gives a brief survey of

common approaches which are used by many

existing SRL systems of well-studied

languages. These systems are investigated in

two aspects: (a) the data type that the systems

use and (b) their approaches for labelling

semantic roles, including model types, labelling

strategies, degrees of granularity and post-

processing.

Role

ARG0

Description Rule

Agent

SUB

|

Phrasal

types (NP, ...)

preceding

predicate

ARG1

Patient

DOB

|

phrasal

types (NP, ...)

following

predicate

3.1.1. Data types

ARG2

ARGM-NEG Negation

Beneficiary IOB phrases

The input data of a SRL system are

typically syntactically parsed sentences. There

are two common syntactic representations

namely bracketed trees and dependency trees.

Some systems use bracketed trees of sentences

as input data. A bracketed tree of a sentence is

the tree of nested constituents representing its

constituency structure. Some systems use

Negative words

“không,

chớ, chả”

LOC phrases

MNR phrases

chẳng,

ARGM-LOC Locatives

ARGM-MNR Manner

markers

ARGM-CAU Cause clauses PRP

|

causal

words “do, bởi vì,

vì, bởi,”

dependency trees of

a

sentence, which

ARGM-DIR Directionals DIR phrases

ARGM-DIS Conjunctive CC phrases or C

clauses

ARGM-EXT Extent

markers

represents dependencies between individual

words of a sentence. The syntactic dependency

represents the fact that the presence of a word is

licensed by another word which is its governor.

In a typed dependency analysis, grammatical

labels are added to the dependencies to mark

their grammatical relations, for example

nominal subject (nsubj) or direct object (dobj).

word

EXT phrases

In the second step, we developed a software

to help a team of Vietnamese linguists manually

revise and annotate the converted corpus with

fine-grained semantic roles. The software is

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

45

Figure 5 shows the bracketed tree and the

dependency tree of an example sentence.

(b) Repeated argument

(c) Missing argument

Figure 6. Examples of some inconsistencies.

Figure 5. Bracketed and dependency trees for

Labelling strategies

sentence Nam đá bóng (Nam plays football).

Strategies for labelling semantic roles are

diverse, but they can be classified into three main

strategies. Most of the systems use a two-step

approach consisting of identification and

classification [21, 22]. The first step identifies

arguments from many candidates, which is

essentially a binary classification problem. The

second step classifies the identified arguments

into particular semantic roles. Some systems use a

single classification step by adding a “null” label

into semantic roles, denoting that this is not an

argument [23]. Other systems consider SRL as a

sequence tagging problem [24, 25].

3.1.2. SRL strategy

Input structures

The first step of a SRL system is to extract

constituents that are more likely to be

arguments or parts of arguments. This step is

called argument candidate extraction. Most of

SRL systems for English use 1-1 node mapping

method to find candidates. This method

searches all nodes in a parse tree and maps

constituents and arguments. Many systems use

a pruning strategy on bracketed trees to better

identify argument candidates [8].

Model types

Granularity

In a second step, each argument candidate

is labelled with a semantic role. Every SRL

system has a classification model which can be

classified into two types, independent model or

joint model. While an independent model

decides the label of each argument candidate

independently of other candidates, a joint model

finds the best overall labelling for all candidates

in the sentence at the same time. Independent

models are fast but are prone to inconsistencies

such as argument overlap, argument repetition

or argument missing. For example, Figure 6

shows some examples of these inconsistencies

when analyzing the Vietnamese sentence Do

học chăm, Nam đã đạt thành tích cao (By

studying hard, Nam got a high achievement).

Existing SRL systems use different degrees

of granularity when considering constituents.

Some systems use individual words as their

input and perform sequence tagging to identify

arguments. This method is called word-by-word

(W-by-W) approach. Other systems use

syntactic phrases as input constituents. This

method is called constituent-by-constituent (C-

by-C) approach. Compared to the W-by-W

approach, C-by-C approach has two main

advantages. First, phrase boundaries are usually

consistent with argument boundaries. Second,

C-by-C approach allows us to work with larger

contexts due to a smaller number of candidates

in comparison to the W-by-W approach. Figure

7 presents an example of C-by-C and W-by-W

approaches.

(a) Example of C-by-C

(a) Overlapping argument

(b) Example of W-by-W

Figure 7. C-by-C and W-by-W approaches.

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

46

Vietnamese corpus. However, this application

gives us a very poor performance. Therefore, in

the identification step, we develop a new

algorithm for extracting candidate constituents

which is much more accurate for Vietnamese

than the node-mapping algorithm. Details of

experimental results will be provided in the

Section 4.

In order to improve the accuracy of the

classification step, and hence of our SRL

system as a whole, we have integrated many

useful features for use in two statistical

classification models, namely Maximum

Entropy (ME) and Support Vector Machines

(SVM). On the one hand, we adapt the features

which have been proved to be good for SRL of

English. On the other hand, we propose some

novel features, including function tags,

predicate type and distance. Moreover, to

improve further the performance of our system,

we introduce some appropriate constraints and

apply a post-processing method by using ILP.

Finally, to better handle unseen words, we

generalize the system by integrating distributed

word representations.

Post-processing

To improve the final result, some systems

use post-processing to correct argument labels.

Common post-processing methods include re-

ranking, Viterbi search and integer linear

programming (ILP).

3.2. Our approach

The previous subsection has reviewed

existing techniques for SRL which have been

published so far for well-studied languages. In

this section, we first show that these techniques

per se cannot give a good result for Vietnamese

SRL, due to some inherent difficulties, both in

terms of language characteristics and of the

available corpus. We then develop a new

algorithm for extracting candidate constituents

for use in the identification step.

Some difficulties of Vietnamese SRL are

related to its SRL corpus. As presented in the

previous section, this SRL corpus has 5,460

annotated sentences, which is much smaller

than SRL corpora of other languages. For

example, the English PropBank contains about

50,000 sentences, which is about ten times

larger. While smaller in size, the Vietnamese

PropBank has more semantic roles than the

English PropBank has – 28 roles compared to

21 roles. This makes the unavoidable data

sparseness problem more severe for Vietnamese

SRL than for English SRL.

In addition, our extensive inspection and

experiments on the Vietnamese PropBank have

uncovered that this corpus has many annotation

errors, largely due to encoding problems and

inconsistencies in annotation. In many cases,

we have to fix these annotation errors by

ourselves. In other cases where only a

proposition of a complex sentence is incorrectly

annotated, we perform an automatic

preprocessing procedure to drop it out, leave the

correctly annotated propositions untouched. We

finally come up with a corpus of 4,800

sentences which are semantic role annotated.

A major difficulty of Vietnamese SRL is

due to the nature of the language, where its

linguistic characteristics are different from

occidental languages [26]. We first try to apply

the common node-mapping algorithm which is

widely used in English SRL systems to the

In the next paragraphs, we first present our

constituent extraction algorithm to get inputs

for the identification step and then the ILP post-

processing method. Details of the features used

in the classification step and the effect of

distributed word representations in SRL will be

presented in Section 4.

3.2.1. Constituent extraction algorithm

Our algorithm derives from the pruning

algorithm for English [27] with some

modifications. While the original algorithm

collects sisters of the current node, our

algorithm checks the condition whether or not

children of each sister have the same phrase

label and have different function label from

their parent. If they have the same phrase labels

and different function labels from their parent,

our algorithm collects each of them as an

argument candidate. Otherwise, their parent is

collected as a candidate. In addition, we remove

the constraint that does not collect coordinated

nodes from the original algorithm.

This algorithm aims to extract constituents

from a bracketed tree which are associated to

their corresponding predicates of the sentence.

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

47

If the sentence has multiple predicates, multiple

constituent sets corresponding to the predicates

are extracted. The pseudo code of the algorithm

is described in Algorithm 1.

a tree. The children() function gets the

children of a node. The sibling() function gets

the sisters of a node. The isPhrase() function

checks whether a node is of phrasal type or not.

Algorithm 1: Constituent Extraction Algorithm

The

phraseType()

function

and

Data: A bracketed tree

Result: A tree with constituents for the predicate

begin

T and its predicate

functionTag() function extracts the phrase

type and function tag of a node, respectively.

Finally, the collect(node) function collects

words from leaves of the subtree rooted at a

node and creates a constituent.

currentNode  predicateNode

while currentNode  T.root() do

for S currentNode.sibling() do

if | S.children() |> 1 and

S.children().get(0).isPhrase() then

sameType  true

diffTag  true

phraseType 

S.children().get(0).phraseType()

funcTag 

S.children().get(0).functionTag()

for i 1 to | S.children() | 1 do

if

S.children().get(i).phraseType() 

phraseType then

sameType  false

break

if

S.children().get(i). functionTag() =

funcTag then

diffTag  false

break

if sameType and diffTag then

for child S.children() do

T.collect(child)

else

T.collect(S)

currentNode  currentNode.parent()

Figure 8. Extracting constituents of the sentence

"Bà nói nó là con trai tôi mà" at predicate "là".

return

T

This algorithm uses several simple

functions. The root() function gets the root of

Figure 8 shows an example of running the

algorithm on a sentence Bà nói nó là con trai

tôi mà (You said that he is my son). First, we

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

48

argument Sⁱ. The aim of the system is to find

find the current predicate node V-H là (is). The

current node has only one sibling NP node. This

NP node has three children where some of them

have different labels from their parents, so this

node and its associated words are collected.

After that, we set current node to its parent and

repeat the process until reaching the root of the

tree. Finally, we obtain a tree with the following

constituents for predicate là: Bà, nói, nó, and

con trai tôi mà.

the maximal overall score of the arguments:

ˆ^1:M

c

= argmax score(S^1:M= c^1:M)

1:M

M

c

P

(2)

(3)

M

= argmax score(Sⁱ= cⁱ)



1:M

M

i=1

c

P

ILP Constraints

In this paragraph, we propose a constraint

set for our SRL system. Some of them are

directly inspired and derived from results for

English SRL, others are constraints that we

specify uniquely to account for Vietnamese

specificities. The constraint set includes:

1. One argument can take only one type.

2. Arguments cannot overlap with the

predicate in the sentence.

3.2.2. Integer linear programming

Because the system classifies arguments

independently, labels assigned to arguments in

a sentence may violate Vietnamese grammatical

constraints. To prevent such violation and

improve the result, we propose a post-

processing process which finds the best global

assignment that also satisfies grammatical

constraints. Our work is based on the ILP

method of English PropBank [28]. Some

constraints that are unique to Vietnamese are

also introduced and incorporated.

3.

Arguments cannot overlap other

arguments in the sentence.

4.

There is no duplicating argument

phenomenon for core arguments in the

sentence.

Integer programs are almost identical to

5. If the predicate is not verb type, there are

only 2 types of core argument Arg0 and Arg1.

In particular, constraints from 1 to 4 are

derived from the ILP method for English [28],

while constraint 5 is designed specifically for

Vietnamese.

linear programs. The cost function and the

constraints are all in linear form. The only

difference is that the variables in ILP can only

take integer values. A general binary ILP can be

stated as follows.



Given a cost vector pR^d, a set of

ILP Formulation



variables z = (z₁,, z_d)R^d

,

and cost

C₂R^t²R^d, where

t₁,t₂are the number of inequality and equality

To find the best overall labelling satisfying

these constraints, we transform our system to an

ILP problem. First, let z_ic= [Sⁱ= c] be the

binary variable that shows whether or not Sⁱis

matrices C₁R^t¹R^d

,

constraints and

d

is the number of binary

labelled argument type

c

.

We denote



ˆ

variables. The ILP solution

maximizes the cost function:

z

is the vector that

p_ic= score(Sⁱ= c) . The objective function of

the optimization problem can be written as:



|P|

M









 

C z  b

1

¹

argmax

p_icz_ic.

(4)

ˆ

(1)



z = argmax p  z subject to





d

C₂z = b₂

^z0^,1

where

z0,1

i=1 c=1

d

Next, each constraint proposed above can

be reformulated as follows:

.

b ,b₂R

1

Our system attempts to find exact roles for

argument candidate set for each sentence. This

set is denoted as S^1:M, where the index ranged

1. One argument can take only one type.

|P|

z_ic=1, i[1,M ]. (5)



from

1

to

M

; and the argument role set is

c=1

2. Arguments cannot overlap with the

predicate in the sentence.

denoted as

P

. Assuming that the classifier

returns a score, score(Sⁱ= cⁱ) , corresponding

to the likelihood of assigning label cⁱto

3.

Arguments cannot overlap other

arguments in the sentence. If there are

k

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

49

arguments S¹, S²,..., S^kthat appear in a same

1. Phrase type: This is very useful feature

in classifying semantic roles because different

roles tend to have different syntactic categories.

For example, in the sentence in Figure 8 Bà nói

nó là con trai tôi mà, the phrase type of

constituent nó is NP.

2. Parse tree path: This feature captures the

syntactic relation between a constituent and a

predicate in a bracketed tree. This is the shortest

path from a constituent node to a predicate node

word in the sentence, we can conclude that

there are at least k 1 arguments that are

classified as “null”:

k

(6)



z_ic k 1 (c =``null ).



i=1

This constraint has been satisfied by our

constituent extraction approach. Thus, we do

not need to add this constraint in the

post-processing step if the constituent

extraction algorithm has been used.

in the tree. We use either symbol

 or symbol



to indicate the upward direction or the

downward direction, respectively. For example,

4.

There is no duplicating argument

phenomenon for core arguments in the

the parse tree path from constituent nó to the

sentence.

predicate là is NP

 S VP V.

M

3. Position: Position is a binary feature that

describes whether the constituent occurs after or

before the predicate. It takes value 0 if the

constituent appears before the predicate in the

sentence or value 1 otherwise. For example, the

position of constituent nó in Figure 8 is 0 since

it appears before predicate là.

4. Voice: Sometimes, the differentiation

between active and passive voice is useful. For

example, in an active sentence, the subject is

usually an Arg0 while in a passive sentence, it

is often an Arg1. Voice feature is also binary

feature, taking value 1 for active voice or 0 for

passive voice. The sentence in Figure 8 is of

active voice, thus its voice feature value is 1.

5. Head word: This is the first word of a

phrase. For example, the head word for the

phrase con trai tôi mà is con trai.

z 1,

 _ic

(7)

i=1

c



Arg0,Arg1,Arg2,Arg3,Arg4 .



5. If the predicate is not verb type, there are

only 2 types of core argument Arg0 and Arg1

M

z_ic= 0 c



Arg2,Arg3,Arg4 .





(8)

i=1

In the next section, we present experimental

results, system evaluation and discussions.

4. Evaluation

In this section, we describe the evaluation

of our SRL system. First, we first introduce two

feature sets used in machine learning classifiers.

Then, the evaluation results are presented and

discussed. Next, we report the improved results

by using integer linear programming inference

method. Finally, we present the efficacy of

distributed word representations in generalizing

the system to unseen words.

6.

Subcategorization: Subcategorization

feature captures the tree that has the concerned

predicate as its child. For example, in Figure 8,

the subcategorization of the predicate là is

VP(V, NP).

4.1. Feature sets

4.1.2. New features

We use two feature sets in this study. The

first one is composed of basic features which

are commonly used in SRL system for English.

This feature set is used in the SRL system of

Gildea and Jurafsky [5] on the FrameNet

corpus.

4.1.1. Basic features

This feature set consists of 6 feature

templates, as follows:

Preliminary investigations on the basic

feature set give us a rather poor result.

Therefore, we propose some novel features so

as to improve the accuracy of the system. These

features are as follows:

1. Function tag: Function tag is a useful

information, especially for classifying adjunct

arguments. It determines a constituent’s role,

for example, the function tag of constituent nó

is SUB, indicating that this has a subjective role.

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

50

2. Distance: This feature records the length

of correct argument candidates, particularly

with the recall ratio of 86.12% compared to

79.39% of the pruning algorithm. This result

also shows that we cannot take for granted that

a good algorithm for English could also work

well for another language of different

characteristics.

of the full parse tree path before pruning. For

example, the distance from constituent nó to the

predicate là is 3.

3. Predicate type: Unlike in English, the

type of predicates in Vietnamese is much more

complicated. It is not only a verb, but is also a

noun, an adjective, or a preposition. Therefore,

we propose a new feature which captures

predicate types. For example, the predicate type

of the concerned predicate is V.

In the second experiment, we continue to

compare the performance of the two extraction

algorithms, this time at the final classification

step and get the baseline for Vietnamese SRL.

The classifier we use in this experiment is a

Support Vector Machine (SVM) classifier⁵.

4.2. Results and discussions

4.2.1. Evaluation Method

Table

4

shows the accuracy of the

We use a 10-fold cross-validation method

to evaluate our system. The final accuracy

scores is the average scores of the 10 runs.

The evaluation metrics are the precision,

baseline system.

Table 4. Accuracy of the baseline system

I

1-1

NodePruning

Our

recall and

the proportion of labelled arguments identified

by the system which are correct; the recall (

F₁-measure. The precision ( P ) is

Mapping Alg.Alg.

Extraction

Alg.

Precision 66.19% 73.63%

73.02%

67.16%

69.96%

R

)

Recall

29.34%

40.66%

62.79%

67.78%

is the proportion of labelled arguments in the

gold results which are correctly identified by

F

1

l

the system; and the

mean of and , that is F = 2PR/(P  R)

4.2.2. Baseline system

In the first experiment, we compare our

constituent extraction algorithm to the 1-1 node

mapping and the pruning algorithm [28]. Table

3 shows the performance of two extraction

algorithms.

F₁-measure is the harmonic

Once again, this result confirms that our

algorithm achieves the better result. The of

P

R

.

1

F

1

our baseline SRL system is 69.96%, compared

to 40.66% of the 1-1 node mapping and 67.78%

of the pruning system. This result can be

explained by the fact that the 1-1 node mapping

and the pruning algorithm have a low recall

ratio, because it identifies incorrectly many

argument candidates.

Table 3. Accuracy of three extraction algorithms

4.2.3. Labelling strategy

1-1 Node Pruning

Mapping Alg.

Alg.

Our

Extraction

Alg.

82.15%

86.12%

84.08%

In the third experiment, we compare two

labelling strategies for Vietnamese SRL. In

addition to the SVM classifier, we also try the

Maximum Entropy (ME) classifier, which

usually gives good accuracy in a wide variety of

classification problems⁶. Table 5 shows the

scores of different labelling strategies.

Precision

Recall

29.58%

45.82%

35.93%

85.05%

79.39%

82.12%

F

1

F

1

We see that our extraction algorithm

outperforms significantly the 1-1 node mapping

algorithm, in both of the precision and the recall

ratios. It is also better than the pruning

algorithm. In particular, the precision of the 1-1

node mapping algorithm is only 29.58%; it

means that this method captures many

candidates which are not arguments. In contrast,

our algorithm is able to identify a large number

________

⁵^{We use the linear SVM classifier with}L₂^{regularization}

provided by the scikit-learn software package. The

regularization term is fixed at 0.1.

6

^{We use the logistic regression classifier with}L₂

regularization provided by the scikit-learn software

package. The regularization term is fixed at 1.

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

51

Table 5. Accuracy of two labelling strategies

In the fifth experiment, we investigate the

significance of individual features to the system

by removing them, one by one from the feature

ME

SVM

1-step strategy

2-step strategy

69.79%

69.28%

69.96%

69.38%

set

₄. By doing this, we can evaluate the

importance of each feature to our overall

system. The feature sets and their

corresponding accuracy are presented in Table

8 and Table 9 respectively.

We see that the performance of SVM

classifier is slightly better than the performance

of ME classifier. The best accuracy is obtained

by using 1-step strategy with SVM classifier.

Table 8. Feature sets (continued)

The current SRL system achieves an

F score

1

Feature Set

Description

of 69.96%.

4.2.4. Feature analysis

₅

₄\ {Function Tag}

In the fourth experiment, we analyse and

evaluate the impact of each individual feature to

the accuracy of our system so as to find the best

feature set for our Vietnamese SRL system. We

start with the basic feature set presented

₆

₄\ {Distance}

₄\ {Head Word}

₄\ {Path}

₇

₈

₉

₄\ {Position}

previously, denoted by ₀and augment it with

₁₀

₁₁

₁₂

₁₃

₄\ {Voice}

modified and new features as shown in Table 6.

The accuracy of these feature sets are shown in

Table 7.

₄\ {Subcategorization}

₄\ {Predicate}

₄\ {Phrase Type}

Table 6. Feature sets

Feature Set Description

₁

₂

₃

₀ {Function Tag}

₀ {Predicate Type}

₀ {Distance}

Table 9. Accuracy of feature sets in Table 8

Feature Set Precision Recall

F

1

77.53% 71.29% 74.27%

₄

₅

Table 7. Accuracy of feature sets in Table 6

73.04% 67.21% 70.00%

77.38% 71.20% 74.16%

Feature Set Precision Recall

₆

F

1

73.74% 67.17% 70.29%

73.02%

77.38%

72.98%

73.04%

67.16% 69.96%

71.20% 74.16%

67.15% 69.94%

67.21% 70.00%

₇

₀

₁

₂

₃

77.58% 71.10% 74.20%

₈

77.39% 71.39% 74.26%

₉

77.51% 71.24% 74.24%

₁₀

77.53% 71.46% 74.37%

₁₁

₁₂

₁₃

We notice that amongst the three features,

function tag is the most important feature which

increases the accuracy of the baseline feature

77.38% 71.41% 74.27%

77.86% 70.99% 74.26%

set by about 4% of

F score. The distance

1

We see that the accuracy increases slightly

when the subcategorization feature (₁₁) is

removed. For this reason, we remove only the

subcategorization feature. The best feature set

includes the following features: predicate,

phrase type, function tag, parse tree path,

feature also helps increase slightly the accuracy.

We thus consider the fourth feature set ₄

defined as

₄= ₀{FunctionTag}{Distance}.

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

52

ArgM-I

ArgM-LOC 59.26% 75.56% 66.21%

ArgM-LVB 0.00% 0.00% 0.00%

ArgM-MNR 56.06% 52.00% 53.70%

ArgM-MOD 76.57% 84.77% 80.33%

ArgM-NEG 85.21% 94.24% 89.46%

ArgM-PRD 22.00% 13.67% 15.91%

ArgM-PRP 70.38% 70.96% 70.26%

0.00%

distance, voice, position and head word. The

accuracy of our system with this feature set is

74.37% of

4.2.5. Improvement via integer linear

programming

F score.

1

Table 10. The impact of ILP

Precision

Recall

ArgM-

Partice

38.76% 17.51% 22.96%

F

1

A

B

C

77.53%

78.28%

78.29%

A: Without ILP

71.46% 74.37%

71.48% 74.72%

71.48%

74.73%

ArgM-REC 45.00% 48.00% 45.56%

ArgM-RES 2.00% 6.67% 9.52%

ArgM-TMP 78.86% 93.09% 85.36%

B: With ILP (not using constraint 5)

C: With ILP (using constraint 5)

A detailed investigation of our constituent

extraction algorithm reveals that it can account

for about 86% of possible argument candidates.

Although this coverage ratio is relatively high,

it is not exhaustive. One natural question to ask

is whether an exhaustive search of argument

candidates could improve the accuracy of the

system or not. Thus, in the seventh experiment,

we replace our constituent extraction algorithm

by an exhaustive search where all nodes of a

syntactic tree are taken as possible argument

candidates. Then, we add the third constraint to

the ILP post-processing step as presented above

(Arguments cannot overlap other arguments in

the sentence). An accuracy comparison of two

constituent extraction algorithms is shown in

Table 12.

j

As discussed previously, after classifying

the arguments, we use ILP method to help

improve the overall accuracy. In the sixth

experiment, we set up an ILP to find the best

performance satisfying constraints presented

earlier⁷. The score p_ic= score(Sⁱ= c) is the

signed distance of that argument to the

hyperplane. We also compare our ILP system

with the ILP method for English by using only

constraints from 1 to 4. The improvement given

by ILP is shown in Table 10. We see that ILP

increases the performance of about 0.4% and

when adding constraint 5, the result is slightly

better. The accuracy of for each argument is

shown in Table 11.

Table 12. Accuracy of two extraction algorithms

Table 11. Accuracy of each argument type

Getting All Nodes Our Extraction Alg.

Precision Recall

Precision 19.56%

82.15%

86.12%

84.08%

F

1

Recall

93.25%

Arg0

Arg1

Arg2

Arg3

Arg4

93.92% 97.34% 95.59%

68.97% 82.38% 75.03%

56.87% 46.62% 50.78%

32.23%

F

1

Taking all nodes of a syntactic tree help

increase the number of candidate argument to a

coverage ratio of 93.25%. However, it also

proposes many wrong candidates as shown by a

low precision ratio. Table 13 shows the

accuracy of our system in the two candidate

extraction approaches.

3.33%

61.62% 22.01% 31.17%

0.00% 0.00%

5.00%

4.00%

ArgM-ADJ 0.00%

ArgM-ADV 60.18% 44.80% 51.17%

ArgM-CAU 61.96% 47.63% 50.25%

ArgM-COM 41.90% 78.72% 52.53%

ArgM-DIR 41.21% 23.01% 29.30%

ArgM-DIS 60.79% 56.37% 58.25%

ArgM-DSP 0.00%

ArgM-EXT 70.10% 77.78% 73.19%

ArgM-GOL 0.00% 0.00% 0.00%

Table 13. Accuracy of our system

0.00%

Getting All Nodes Our Extraction Alg.

Precision 77.99%

78.29%

71.48%

74.73%

Recall

62.50%

69.39%

________

⁷We use the GLPK solver provided by the PuLP software

package, available at https://pythonhosted.org/PuLP/.

1

F

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

53

We see that an exhaustive search of

candidates help present more possible

constituent candidates but it makes the

performance of the system worse than the

constituent extraction algorithm (69.39%

are common. As seen in the previous systems,

some important features of our SRL system are

word features including predicates and

head words.

As in most NLP tasks, the words are usually

encoded as symbolic identifiers which are

drawn from a vocabulary. Therefore, they are

often represented by one-hot vectors (also

called indicator vectors) of the same length as

the size of the vocabulary. This representation

suffers from two major problems. The first

problem is data sparseness, that is, the

parameters corresponding to rare or unknown

words are poorly estimated. The second

problem is that it is not able to capture the

semantic similarity between closely related

words. This limitation of the one-hot word

representation has motivated unsupervised

methods for inducing word representations over

large, unlabelled corpora.

Recently, distributed representations of

words have been shown to be advantageous for

many natural language processing tasks. A

distributed representation is dense, low

dimensional and real-valued. Distributed word

representations are called word embeddings.

Each dimension of the embedding represents a

latent feature of the word which hopefully

captures useful syntactic and semantic

similarities [29].

compared to 74.73% of

F ratio). One plausible

1

explanation is that the more a classifier has

candidates to consider, the more it is likely to

make wrong classification decision, which

results in worse accuracy of the overall system.

In addition, a large number of candidates makes

the system lower to run. In our experiment, we

see the training time increased fourfold when

the exhaustive search approach was used

instead of our constituent extraction algorithm.

4.2.6. Learning curve

In the ninth experiment, we investigate the

dependence of accuracy to the size of the

training dataset. Figure 9 depicts the learning

curve of our system when the data size is

varied.

Word embeddings are typically induced

using neural language models, which use neural

networks as the underlying predictive model.

Historically, training and testing of neural

language models has been slow, scaling as the

size of the vocabulary for each model

computation [30]. However, many approaches

have been recently proposed to speed up the

training process, allowing scaling to very large

corpora [31, 32, 33, 34].

Another method to produce word

embeddings has been introduced recently by the

natural language processing group at the

Stanford university [35]. They proposed a

global log-bilinear regression model that

combines the advantages of the two major

model families in the literature: global matrix

factorization and local context window

methods.

Figure 9. Learning curve of the system.

It seems that the accuracy of our system

improves only slightly starting from the dataset

of about 2,000 sentences. Nevertheless, the

curve has not converged, indicating that the

system could achieve a better accuracy when a

larger dataset is available.

4.3. Generalizing to unseen words

In this section, we report our effort to

extend the applicability of our SRL system to

new text domain where rare or unknown words

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

54

We present in the subsections 4.3.1 and

4.3.2 how we use a neural language model and

global log-bilinear regression model,

respectively, to produce word embeddings for

Vietnamese which are used in this study.

4.3.1 Skip-gram Model

We use word embeddings produced by

Mikolov’s continuous Skip-gram model using

the neural network and source code introduced

in [36]. The continuous skip-gram model itself

is described in details in [34].

For our experiments we used a continuous

skip-gram window of size 2, i.e. the actual

context size for each training sample is a

random number up to 2. The neural network

uses the central word in the context to predict

the other words, by maximizing the average

conditional log probability

| V | nodes in the tree, while the true

log₂

softmax requires computing over all | V |

words.

a

The training code was obtained from the

tool word2vec⁸and we used frequent word

subsampling as well as a word appearance

threshold of 5. The output dimension is set to

50, i.e. each word is mapped to a unit vector in

R⁵⁰. This is deemed adequate for our purpose

without overfitting the training data. Figure 10

shows the scatter plot of some Vietnamese

words which are projected onto the first two

principal components after performing the

principal component analysis of all the word

distributed representations. We can see that

semantically related words are grouped closely

together.

T

c

1

(9)

log p(w | w ),

_T

t j

t

t=1 j=c

where {w_i:iT} is the whole training

set, w_tis the central word and the w_{t j}are on

either side of the context. The conditional

probabilities are defined by the softmax

function

exp(o_a^i_b)

(10)

p(a | b) =

,

exp(o_w^i_b)



wV

where i_wand o_ware the input and output

vector of respectively, and is the

w

V

vocabulary. For computational efficiency,

Mikolov’s training code approximates the

softmax function by the hierarchical softmax, as

defined in [31]. Here the hierarchical softmax is

built on a binary Huffman tree with one word at

each leaf node. The conditional probabilities are

calculated according to the decomposition:

Figure 10. Some Vietnamese words produced by the

Skip-gram model, projected onto two dimensions.

4.3.2. GloVe model

Pennington, Socher, and Manning [35]

introduced the global vector model for learning

word representations (GloVe). Similar to the

Skip-gram model, GloVe is a local context

window method but it has the advantages of the

global matrix factorization method.

l

p(a | b) =

p(d_i(a) | d₁(a)...d_i1(a),b),

(11)

is the path length from the root to the

, and d_i(a) is the decision at step on

the path (for example if the next node the left



i=1

where

node

l

a

i

0

child of the current node, and if it is the right

1

child). If the tree is balanced, the hierarchical

softmax only needs to compute around

________

⁸http://code.google.com/p/word2vec/

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

55

The main idea of GloVe is to use

word-word occurrence counts to estimate the

co-occurrence probabilities rather than the

probabilities by themselves. Let

probability that word appear in the context of

word

w_iR^dand

P

denote the

ij

j

w_jR^d

i

;

denote the

word vectors of word

respectively. It is shown that

i

and word

j

w^w_j= log(P ) = log(C_ij)log(C_i),

(12)

i

ij

where C_ijis the number of times word

occurs in the context of word

j

i

.

It turns out that GloVe is a global

log-bilinear regression model. Finding word

vectors is equivalent to solving a weighted

least-squares regression model with the cost

function:

n

f (C )(w^w  b  b  log(C ))²,

(13)

Figure 11. Some Vietnamese words produced by the

GloVe model, projected onto two dimensions.

J =



ij

i

j

i

j

ij

i, j=1

4.3.3. Text corpus

where

n

is the size of the vocabulary, b_i

To create distributed word representations,

we use a dataset consisting of 7.3GB of text

from 2 million articles collected through a

Vietnamese news portal¹⁰. The text is first

normalized to lower case and all special

characters are removed except these common

symbols: the comma, the semicolon, the colon,

the full stop and the percentage sign. All

numeral sequences are replaced with the special

token < number>, so that correlations between

certain words and numbers are correctly

recognized by the neural network or the log-

bilinear regression model.

Each word in the Vietnamese language may

consist of more than one syllables with spaces

in between, which could be regarded as

multiple words by the unsupervised models.

Hence it is necessary to replace the spaces

within each word with underscores to create full

word tokens. The tokenization process follows

the method described in [37].

and b_jare additional bias terms and f (C_ij) is

a weighting function. A class of weighting

functions which are found to work well can be

parameterized as





^

x

x_max

1







ifx < x_max



f (x) =

(14)









otherwise



The training code was obtained from the

tool GloVe⁹and we used a word appearance

threshold of 2,000. Figure 11 shows the scatter

plot of the same words in Figure 10, but this

time their word vectors are produced by the

GloVe model.

After removal of special characters and

tokenization, the articles add up to 969 million

________

⁹http://nlp.stanford.edu/projects/glove/

¹⁰http://www.baomoi.com

L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58

56

system. In other words, their use can help

generalize the system to unseen words.

word tokens, spanning a vocabulary of 1.5

million unique tokens. We train the

unsupervised models with the full vocabulary to

obtain the representation vectors, and then

prune the collection of word vectors to the

65,000 most frequent words, excluding special

5. Conclusion

We have presented our work on developing

a semantic role labelling system for the

Vietnamese language. The system comprises

two main component, a corpus and a software.

Our system achieves a good accuracy of about

symbols and the token

representing numeral sequences.

<

number>

4.3.4. SRL with distributed word

representations

74.8% of

F score.

1

We train the two word embedding models

on the same text corpus presented in the

previous subsections to produce distributed

word representations, where each word is

represented by a real-valued vector of 50

dimensions.

In the last experiment, we replace predicate

or head word features in our SRL system by

their corresponding word vectors. For

predicates which are composed of multiple

words, we first tokenize them into individual

words and then average their vectors to get

vector representations. Table 14 and Table 15

shows performances of the Skip-gram and

GloVe models for predicate feature and for

head word feature, respectively.

We have argued that one cannot assume a

good applicability of existing methods and tools

developed for English and other occidental

languages and that they may not offer a cross-

language validity. For an isolating language

such as Vietnamese, techniques developed for

inflectional languages cannot be applied “as is”.

In particular, we have developed an algorithm

for extracting argument candidates which has a

better accuracy than the 1-1 node mapping

algorithm. We have proposed some novel

features which are proved to be useful for

Vietnamese semantic role labelling, notably and

function

representations. We have employed integer

linear programming, recent inference