Dependency-based pre-ordering for English-Vietnamese statistical machine translation

VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
Dependency-based Pre-ordering For English-Vietnamese  
Statistical Machine Translation  
Tran Hong Viet1,2,*, Nguyen Van Vinh2, Vu Thuong Huyen3, Nguyen Le Minh4  
1University of Economic and Technical Industries, Hanoi, Vietnam  
2VNU University of Engineering and Technology, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam  
3Thuy Loi University, Hanoi, Vietnam  
4Japan Advanced Institute of Science and Technolog  
Abstract  
Reordering is a major challenge in machine translation (MT) between two languages with significant  
differences in word order. In this paper, we present an approach as pre-processing step based on a dependency  
parser in phrase-based statistical machine translation (SMT) to learn automatic and manual reordering rules from  
English to Vietnamese. The dependency parse trees and transformation rules are used to reorder the source  
sentences and applied for systems translating from English to Vietnamese. We evaluated our approach on  
English-Vietnamese machine translation tasks, and showed that it outperforms the baseline phrase-based  
SMT system.  
Received 16 May 2017; Revised 07 Sep 2017; Accepted 29 Sep 2017  
Keywords: Natural Language Processing, Machine Translation, Phrase-based Statistical Machine Translation.  
1. Introduction*  
strengths of phrases, while incorporating syntax  
into SMT. Some approaches were applied at the  
word level [3]. They are useful for language  
with rich morphology, for reducing data  
sparseness. Other kinds of syntax reordering  
methods require parser trees, such as the work  
in [3]. The parsed tree is more powerful in  
capturing the sentence structure. However, it is  
expensive to create tree structure and build a  
good quality parser. All the above approaches  
require much decoding time, which is  
expensive.  
The approach that we are interested in is  
balancing the quality of translation with  
decoding time. Reordering approaches as a  
preprocessing step [5, 21, 27] are very effective  
(significant improvement over state of-the-art  
Phrase-based statistical machine translation  
[8] is the state-of-the-art of SMT because of its  
power in modelling short reordering and local  
context. However, with phrase-based SMT,  
long distance reordering is still problematic.  
The reordering problem (global reordering) is  
one of the major problems, since different  
languages have different word order  
requirements. In recent years, many reordering  
methods have been proposed to tackle the long  
distance reordering problem. Many solutions  
solving the reordering problem have been  
proposed, such as syntax-based model [15],  
lexicalized reordering [10]. Chiang [15] shows  
significant improvements by keeping the  
phrase-based  
translation systems and separately quality  
evaluation of each reordering models).  
and  
hierarchical  
machine  
_______  
* Corresponding author. E-mail.: thviet@uneti.edu.vn  
14  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
15  
The end-to-end neural MT (NMT) approach  
[26] has recently been proposed for MT.  
However, the NMT method has some  
limitations that may jeopardize its ability to  
generate better translation. The NMT system  
usually causes a serious out-of-vocabulary  
(OOV) problem, the translation quality would  
be badly hurt; The NMT decoder lacks a  
mechanism to guarantee that all the source  
words are translated and usually favors short  
translations. It is difficult for an NMT system to  
benefit from target language model trained on  
target monolingual corpus, which is proven to  
be useful for improving translation quality in  
statistical machine translation (SMT). NMT  
need much more training time. In [20], NMT  
requires longer time to train (18 days)  
compared to their best SMT system (3 days).  
compared to MOSES [7] which is the state  
of-the-art phrase-based SMT system.  
This paper is structured as follows: Section  
1 introduces the reordering problem. Section 2  
reviews the related works. Section 3 introduces  
phrase-based SMT. Section 4 expresses how to  
apply transformation rules for reordering the  
source sentences. Section 5 presents a the  
learning model in order to transform the word  
order of an input sentence to an order that is  
natural in the target languages. Section 6  
describes experimental results; Section 7  
discusses the experimental results. And,  
conclusions are given in Section 8.  
2. Related works  
The difference of the word order between  
source and target languages is the major  
problem in phrase-based statistical machine  
translation. Fig 1 describes an example that a  
reordering approach modifies the word order of  
an input sentence of a source languages  
(English) in order to generate the word order of  
a target languages (Vietnamese).  
Many preordering methods using syntactic  
information have been proposed to solve the  
reordering problem. (Collin 2005; Xu 2009)  
[3, 27] presented a preordering method which  
used manually created rules on parse trees. In  
addition, linguistic knowledge for a language  
pair is necessary to create such rules. Other  
preordering methods using automatic created  
reordering rules or a statistical classifier were  
studied [21, 28]  
Collins [3] developed a clause detection and  
used some handwritten rules to reorder words in  
the clause. Partly, (Habash 2007) [18] built an  
automatic extracted syntactic rules. Xu [27]  
described a method using a dependency parse  
tree and a flexible rule to perform the  
reordering of subject, object, etc,... These rules  
were written by hand, but [27] showed that an  
automatic rule learner can be used.  
Figure 1. A example of preordering for English-  
Vietnamese translation.  
Inspire by this preprocessing approaches,  
we propose a combined approach which  
preserves the strength of phrase-based SMT in  
reordering and decoding time as well as the  
strength of integrating syntactic information in  
reordering. Firstly, the proposed method uses a  
dependency parsing for preprocessing step with  
training and testing. Secondly, transformation  
rules are applied to reorder the source  
sentences. The experimental resulting from  
English-Vietnamese pair shows that our  
approach achieved improvements in BLEU  
scores [1] when translating from English,  
Bach [13] propose a novel source-side  
dependency tree reordering model for statistical  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
16  
machine translation, in which subtree  
dependency parser of English sentence for  
translating from English to Vietnamese. Base  
movements and constraints are represented as  
reordering events associated with the widely  
used lexicalized reordering models.  
on  
above  
studies,  
we  
utilize  
the  
English - Vietnamese transformation rules  
(manual and automatic rules are extracted from  
English-Vietnamese parallel corpus) that  
(Genzel 2010; Lerner and Petrov 2013)  
[5, 21] described a method using discriminative  
classifiers to directly predict the final word  
order. Cai [2] introduced a novel pre-ordering  
approach based on dependency parsing for  
Chinese-English SMT. Isao Goto [17]  
described a preordering method using a  
target-language parser via cross-language  
syntactic projection for statistical machine  
translation.  
Joachim Daiber [16] presented a novel  
examining the relationship between preordering  
and word order freedom in Machine  
Translation.  
Chenchen Ding, [4] proposed extra-chunk  
pre-ordering of morphemes which allows  
Japanese functional morphemes to move across  
chunk boundaries.  
directly predict target-side word as  
a
preprocessing step in phrase-based machine  
translation. As the same with [18], we also  
applied preprocessing in both training and  
decoding time.  
3. Brief description of the baseline  
phrase-based SMT  
In this section, we will describe the phrase-  
based SMT system which was used for the  
experiments. Phrase-based SMT, as described  
by [8] translates a source sentence into a target  
sentence by decomposing the source sentence  
into a sequence of source phrases, which can be  
any contiguous sequences of words (or tokens  
treated as words) in the source sentence. For  
each source phrase, a target phrase translation is  
selected, and the target phrases are arranged in  
some order to produce the target sentence. A set  
of possible translation candidates created in this  
way were scored according to a weighted linear  
combination of feature values, and the highest  
scoring translation candidate was selected as the  
Christian Hadiwinoto presented a novel  
reordering approach utilizing sparse features  
based on dependency word pairs [19] and  
presented a novel reordering approach utilizing  
a
neural network and dependency-based  
embedding to predict whether the translations  
of two source words linked by a dependency  
relation should remain in the same order or  
should be swapped in the translated sentence  
[20]. This approach is complex and spend much  
time to process.  
translation  
of  
the  
source  
sentence.  
Symbolically,  
However, there were not definitely many  
studies on English-Vietnamese to SMT system  
tasks. To our knowledge, no research address  
reordering models for English-Vietnamese  
SMT based on dependency parsing. In  
comparison with these mentioned approaches,  
our proposed method has some differences as  
follows: We investigate to use a reordering  
models for English-Vietnamese SMT using  
dependency information. We study SVO  
language in English-Vietnamese in order to  
n
t arg maxt,a  
i f j (s,t,a) (1)  
i1  
when s is the input sentence, t is a possible  
output sentence, and a is a phrasal alignment  
that specifies how t is constructed from s, and  
is the selected output sentence. The weights  
associated with each feature  
are tuned to  
maximize the quality of the translation  
hypothesis selected by the decoding procedure  
that computes the argmax. The log-linear model  
is a natural framework to integrate many  
features. The probabilities of source phrase  
given target phrases, and target phrases given  
recognize  
the  
differences  
about  
English-Vietnamese word labels, phrase label  
as well as dependency labels. We use  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
17  
source phrases, are estimated from the  
bilingual corpus.  
We use the dependency grammars and the  
differences of word order between  
Koehn [8] used the following distortion  
model (reordering model), which simply  
penalizes nonmonotonic phrase alignment  
based on the word distance of successively  
translated source phrases with an appropriate  
value for the parameter :  
Vietnamese and English to create a set of the  
reordering rules.  
(2)  
Figure 3. Example about Dependency Parser  
of an English sentence using Stanford Parser.  
Figure 2. A example with POS tags  
and dependency parser.  
Moses [7] is open source toolkit for  
statistical machine translation system that  
allows automatically train translation models  
for any language pair. When we have a trained  
model, an efficient search algorithm quickly  
finds the highest probability translation among  
the exponential number of choices. In our work,  
we also used Moses to evaluate on English-  
Vietnamese machine translation tasks.  
4. Dependency syntactic preprocessing  
for SMT  
Figure 4. Representation of the Stanford  
Dependencies for the English source sentence.  
Reordering approaches on English-  
Vietnamese translation task have limitation. In  
this paper, we firstly produce a parse tree using  
dependency parser tools [11]. Figure 3 shows  
an example of parsed a English sentence.  
Then, we utilize some dependency relations  
extracted from a statistical dependency parser to  
create the dependency based on reordering  
rules. Dependency parsing among words typed  
with grammatical relations are proven as useful  
information in some applications relative to  
syntactic processing (Figure 4).  
There are approximately 50 grammatical  
relations in English, meanwhile there are 27  
ones in Vietnamese based on [9] and the  
differences of word order between English and  
Vietnamese to create the set of the reordering  
rules. Base on these rules, we propose an our  
method which is capable of applying and  
combining them simultaneously. We utilize the  
word labels in [9] to analyze the extract POS  
tags and head modifier dependencies.  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
18  
In addition, we focus on analyzing some  
popular structures of English language when  
translating to Vietnamese language. This  
analysis can achieve remarkable improvements  
in translation performance. Because English  
and Vietnamese both are SVO languages, the  
order of verb rarely change, we focus mainly on  
some typical relations as noun phrase,  
adjectival and adverbial phrase, preposition and  
created manually written reordering rule set for  
English-Vietnamese language pair. Inspired  
from [27], our study employ dependency syntax  
and transyntaxsformation rules to reorder the  
source sentences and applied to English-  
Vietnamese translation system.  
Figure 6. An example of word reordering  
phenomenon in noun phrase with adjectival  
modifier (amod) and determiner modifier (det).  
In this example, the noun “computer” is swapped  
with the adjectival “personal”.  
For example, with noun phrase, there  
always exists a head noun and the components  
before and after it. These auxiliary components  
will move to new positions according to  
Vietnamese translational order.  
Let us consider an example in Figure 6,  
Figure 7 to the difference of word order in  
English and Vietnamese noun phrase and  
adjectival and adverbial phrase.  
Figure 7. An example of word reordering  
phenomenon in adjectival phrase with adverbial  
modifier (advmod) and determiner modifier (det).  
Table 1. Handwritten rules For Reordering English  
to Vietnamese using Dependency syntactic  
preprocessing  
4.1. Transformation rule  
This section, we describe a transformation  
rule.  
T
(L, W, O)  
JJ or JJS or JJR  
(advcl,1,NORMAL)  
(self,-1,NORMAL)  
(aux,-2,REVERSE)  
(auxpass,-  
2,REVERSE)  
(neg,-2,REVERSE)  
(cop,0,REVERSE)  
(prep,0,NORMAL)  
(rcmod,1,NORMAL)  
(self,0,NORMAL)  
(poss,-1, NORMAL)  
(admod,-  
2,REVERSE)  
(pobj,1,NORMAL)  
(self,2,NORMAL)  
NN or NNS  
IN or TO  
Figure 5. An Example of using Dependency  
Syntactic before and after our preprocessing.  
Our rule set is for English-Vietnamese  
phrase-based SMT. Table 1 shows handwritten  
rules using dependency syntactic preprocessing  
to reorder from English to Vietnamese  
(Table 1).  
In the proposed approach, a transform rule  
is a mapping from T to a set of tuples (L, W, O)  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
19  
• T is the part-of-speech (POS) tag of the  
head in a dependency parse tree node.  
• L is a dependency label for a child node.  
• W is a weight indicating the order of that  
child node.  
• O is the type of order (either NORMAL or  
REVERSE).  
Our rule set provides a valuable resource  
for preordering in English-Vietnamese phrase-  
based SMT.  
(admod,-2, REVERSE). For the example shown  
in Figure 4, we would apply it to the ROOT  
node and result in "songwriter that wrote many  
songs romantic."  
We apply them in a dependency tree  
recursively starting from the root node. If the  
POS tag of a node matches the left-hand-side of  
a rule, the rule is applied and the order of the  
sentence is changed. We go through all the  
children of the node and get the precedence  
weights for them from the set of precedence  
tuples. If we encounter a child node that has a  
dependency label not listed in the set of tuples,  
we give it a default weight of 0 and default  
order type of NORMAL. The children nodes  
are sorted according to their weights from  
highest to lowest, and nodes with the same  
weights are ordered according to the type of  
order defined in the rule.  
Figure 5 gives examples of original and  
preprocessed phrase in English. The first line is  
the original English sentences: "that songwriter  
wrote many songs romantic.", and the fourth  
line is the target Vietnamese reordering "Nhạc  
sĩ đó đã viết nhiều bài hát lãng mạn.". This  
sentences is arranged as the Vietnamese order.  
We aim to preprocess as in Figure 5.  
Vietnamese sentences is the output of our  
method. As you can see, after reordering,  
original English line has the same word order.  
4.2. Dependency syntactic processing  
We aim to reorder an English sentence to  
get a new English, and some words in this  
sentence are arranged as Vietnamese words  
order. The type of order is only used when we  
have multiple children with the same weight,  
while the weight is used to determine the  
relative order of the children, going from the  
largest to the smallest. The weight can be any  
real valued number. The order type NORMAL  
means we preserve the original order of the  
children, while REVERSE means we flip the  
order. We reserve a special label self to refer to  
the head node itself so that we can apply a  
weight to the head, too. We will call this tuple a  
precedence tuple in later discussions. In this  
study, we use manually created rules only.  
Suppose we have a reordering rule: NNS  
(prep, 0, NORMAL), (rcmod, 1, NORMAL),  
(self, 0, NORMAL), (poss, -1, NORMAL),  
Table 2. Corpus Statistical  
Sentence pairs Training Set Development Set Test Set  
Corpus  
General  
132636  
131236  
400  
1000  
Vietnamese  
131236  
18.91  
English  
Training  
Sentences  
Average Length  
Word  
17.98  
2481762  
39071  
400  
2360727  
54086  
Vocabulary  
Development Sentences  
Average Length  
Word  
22.73  
21.41  
8567  
9092  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
20  
Vocabulary  
Sentences  
1537  
1000  
22.70  
22707  
2882  
1920  
Test  
Average Length  
Word  
21.42  
21428  
3816  
Vocabulary  
f
5.  
Classifier-based  
preordering  
for  
continue the traversal recursively in that order.  
In the above example, we need to decide the  
order of the head "looking" and the children "I",  
"’m", and "site.".  
The words in sentence are reordered by a  
new sequence learned from training data using  
multi-classifier model. We use SVM  
classification model [25] that supports  
multi-class prediction. The class labels are  
corresponding to reordering sequence, so it is  
enable to select the best one from many  
possible sequences.  
phrase-based SMT  
Current time, state-of-the-art phrase-based  
SMT system using the lexicalized reordering  
model in Moses toolkit. In our work, we also  
used Moses to evaluate on English-Vietnamese  
machine translation tasks.  
5.1. Classifier-based preordering  
In this section, we describe a the learning  
model that can transform the word order of an  
input sentence to an order that is natural in the  
target language. English is used as source  
language, while Vietnamese is used as target  
language in our discussion about the  
word orders.  
Table 3. Set of features used in training data  
from corpus English-Vietnamese  
Feature Description  
T
The head’s POS tag  
For example, when translating the English  
sentence:  
T
The first child’s POS tag  
I ’m looking at a new jewelry site.  
To Vietnamese, we would like to reorder it as:  
I ’m looking at a site new jewelry.  
L
The first child’s syntactic label  
The second child’s POS tag  
The second child’s syntactic label  
The third child’s POS tag  
T
L
And then, this model will be used in  
combination with translation model.  
T
The feature is built for "site, a, new,  
jewelry" family in Figure 2:  
L
The third child’s syntactic label  
The fourth child’s POS tag  
The fourth child’s syntactic label  
T
NN, DT, det, JJ, amod, NN, nn, 1230, 1023  
We use the dependency grammars and the  
differences of word order between English and  
Vietnamese to create a set of the reordering  
rules. From part-of-speech (POS) tag and parse  
the input sentence, producing the POS tags and  
head-modifier dependencies shown in Figure 2.  
Traversing the dependency tree starting at the  
root to reordering. We determine the order of  
the head and its children (independently of  
other decisions) for each head word and  
L
O1  
The sequence of head and its  
children  
in source alignment  
O2  
The sequence of head and its  
children  
in target alignment.  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
21  
Table 4. Examples of rules  
and reorder source sentences  
classifiers learn to trade off between a rich set  
of overlapping features. List of features are  
given in table 3.  
Pattern  
Order  
Example  
We use SVM classification model in the  
WEKA tools [6] that supports multi-class  
prediction. Since it naturally supports  
multi-class prediction and can therefore be used  
to select one out of many possible  
permutations. The learning algorithm produces  
a sparse set of features. In our experiments, the  
models were based on features that generated  
from 100k English - Vietnamese sentence pairs.  
When extracting the features, every word  
can be represented by its word identity, its  
POS-tags from the treebank, syntactic label. We  
also include pairs of these features, resulting in  
potentially bilexical features.  
NN, DT, det, JJ, 1,0,2,3 I ’m looking at a  
amod, NN, nn  
new jewelry site.  
I ’m looking at  
a site new jewelry.  
NNS, JJ, amod, 2,1,0,3 it faced a blank  
CC, cc, NNS, con  
wall.  
it faced a wall  
blank.  
it ’s a social  
phenomenon.  
NNP, NNP, nn, 2,1,0  
NNP, nn  
it ’s a  
phenomenon  
social.  
5.2. Features  
Algorithm 1 Extract rules  
input: dependency trees of source sentences  
and alignment pairs;  
output: set of automatic rules;  
for each family in dependency trees of subset  
and alignment pairs of sentences do  
generate feature (pattern + order) ;  
end for  
The features extracted based on dependency  
tree includes POS tag and alignment  
information. We traverse the tree from the top,  
in each family we create features with the  
following information:  
• The head’s POS tag.  
• The first child’s POS tag, the first child’s  
syntactic label.  
Build model from set of features;  
for each family in dependency trees in the rest  
of the sentences do  
• The second child’s POS tag, the second  
child’s syntactic label.  
• The third child’s POS tag, the third child’s  
syntactic label.  
• The fourth child’s POS tag, the fourth  
child’s syntactic label.  
• The sequence of head and its children in  
source alignment.  
• The sequence of head and its children in  
target alignment. It is class label for SVM  
classifier model.  
We limited our self by processing families  
that have less than five children based on  
counting total families in each group: 1 head  
and 1 child, 1 head and 2 children, 1 head and 3  
children, 1 head and 4 children ... We found out  
that the most common families appear (80%) in  
our training sentences is less than and equal  
four children.  
generate pattern for prediction;  
get predicted order from model;  
add (pattern, order) as new rule in set of rules;  
end for  
Algorithm 2 Apply rule  
input: source-side dependency trees , set of rules;  
output: set of new sentences;  
for each dependency tree do  
for each family in tree do  
generate pattern  
get order from set of rules based on pattern  
apply transform  
end for  
Build new sentence;  
end for  
5.3. Training data for preordering  
In this section, we describe a method to  
build training data for a pair English to  
We trained a separate classifier for each  
number of possible children. In hence, the  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
22  
Vietnamese. Our purpose is to reconstruct the  
word order of input sentence to an order that is  
arranged as Vietnamese words order.  
For example with the English sentence in  
Figure 2:  
learn to trade off between a rich set of  
overlapping features. To build a classification  
model, we use SVM classification model in the  
WEKA tools. The following result are obtained  
using 10 folds-cross validation.  
I ’m looking at a new jewelry site.  
is transformed into Vietnamese order:  
I ’m looking at a site new jewelry.  
We apply them in a dependency tree  
recursively starting from the root node. If the  
POS-tags of a node matches the left-hand-side  
of the rule, the rule is applied and the order of  
the sentence is changed. We go through all the  
children of the node and matching rules for  
them from the set of automatically rules.  
Table 4 gives examples of original and  
preprocessed phrase in English. The first line is  
the original English: "I’m looking at a new  
jewelry site", and the target Vietnamese  
reordering "Tôi đang xem một trang web mới  
về nữ_trang". This sentences is arranged as the  
Vietnamese order. Vietnamese sentences are the  
output of our method. As you can see, after  
reordering, the original English line has the  
same word order: "I ’m looking at a site new  
jewelry" in Figure 1.  
For this approach, we first do preprocessing  
to encode some special words and parser the  
sentences to dependency tree using Stanford  
Parser [14]. Then, we use target to source  
alignment and dependency tree to generate  
features. We add source, target alignment, POS  
tag, syntactic label of word to each node in the  
dependency tree. For each family in the tree, we  
generate a training instance if it has less than and  
equal four children. In case, a family has more  
than and equal five children, we discard this  
family but still keep traversing at each child.  
Each rule consists of: pattern and order. For  
every node in the dependency tree, from the  
top-down, we find the node matching against  
the pattern, and if a match is found, the  
associated order applies. We arrange the words  
in the English sentence, which is covered by the  
matching node, like Vietnamese words order.  
And then, we do the same for each children of  
this node. If any rule is applied, we use the  
order of original sentence. These rules are learnt  
automatically from bilingual corpora. The our  
algorithm’s outline is given as Alg. 1 and Alg. 2  
Algorithm 1 extracts automatically the rules  
with input including dependency trees of source  
sentences and alignment pairs.  
6. Experimental results  
6.1. Data set and experimental setup  
For evaluation, we used an Vietnamese-  
English corpus [22], including about 131236  
pairs for training, 1000 pairs for testing and 400  
pairs for development test set. Table 2 gives  
more statistical information about our corpora.  
We conducted some experiments with SMT  
Moses Decoder [7] and SRILM [12]. We  
trained a trigram language model using  
interpolate and kndiscount smoothing with  
Vietnamese mono corpus. Before extracting  
phrase table, we use GIZA++ [10] to build  
word alignment with grow-diag-final-and  
algorithm. Besides using preprocessing, we also  
used default reordering model in Moses  
Decoder: using word-based extraction (wbe),  
splitting type of reordering orientation to three  
classes (monotone, swap and discontinuous –  
msd), combining backward and forward  
direction (bidirectional) and modeling base on  
Algorithm 2 proceeds by considering all  
rules after finish Algorithm 1 and source-side  
dependency trees to build new sentence.  
5.4. Classification mode  
The reordering decisions are made by  
multi-class classifiers (correspond with number  
of permutation: 2, 6, 24, 120) where class labels  
correspond to permutation sequences. We train  
a separate classifier for each number of possible  
children. Crucially, we do not learn explicit tree  
transformations rules, but let the classifiers  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
23  
both source and target language (fe) [7]. To  
contrast, we tried preprocessing the source  
sentence with manual rules and automatic rules.  
We implemented as follows:  
6.2. Using manual rules  
In this section, we present our experiments  
to translate from English to Vietnamese in a  
statistical machine translation system. We used  
Stanford Parser [14] to parse source sentence  
and apply to preprocessing source sentences  
(English sentences). According to typical  
differences of word order between English and  
Vietnamese, we have created a set of  
dependency-based rules for reordering words in  
English sentence according to Vietnamese word  
order and types of rules including noun phrase,  
adjectival and adverbial phrase, preposition  
which is described in table 1.  
• We used Stanford Parser [14] to parse  
source sentence and apply to preprocessing  
source sentences (English sentences).  
• We used classifier-based preordering by  
using SVM classification model [25] in Weka  
tools [6] for training the features-rich  
discriminative classifiers to extract automatic  
rules and apply them for reordering words in  
English sentences according to Vietnamese  
word order.  
• We implemented preprocessing step  
during both training and decoding time.  
• Using the SMT Moses decoder [7] for  
decoding.  
We give some definitions for our  
experiments:  
6.3. Using automatic rules  
We present our experiments to translate  
from English to Vietnamese in a statistical  
machine translation system. In hence, the  
language pair chosen is English-Vietnamese.  
We used Stanford Parser [14] to parse source  
sentence (English sentences).  
We used dependency parsing and rules  
extracted from training the features-rich  
discriminative classifiers for reordering source-  
side sentences. The rules are automatically  
extracted from English-Vietnamese parallel  
corpus and the dependency parser of English  
examples. Finally, they used these rules to  
reorder source sentences. We evaluated our  
approach on English-Vietnamese machine  
translation tasks with systems in table 5 which  
shows that it can outperform the baseline  
phrase-based SMT system.  
Baseline: use the baseline phrase-based  
SMT system using the lexicalized reordering  
model in Moses toolkit.  
Manual Rules: the phrase-based SMT  
systems applying manual rules [23].  
Auto Rules : the phrase-based SMT  
systems applying automatic rules [24].  
Auto Rules + Manual Rules: the phrase-  
based SMT systems applying automatic rules,  
then applying manual rules.  
Table 5. Our experimental systems on English-  
Vietnamese parallel corpus  
Name  
Description  
Baseline  
Manual Rules  
Phrase-based system  
Phrase-based system  
with corpus  
Table 6. Size of phrase tables  
which preprocessed  
using manual rules  
Phrase-based system  
with corpus which  
preprocessed using  
automatic learning rules  
Phrase-based system  
with corpus which  
preprocessed using  
automatic learning rules  
and manual rules  
Name  
Size of phrase-table  
1152216  
Auto Rules  
Baseline  
Manual Rules  
Auto Rules  
1231365  
1213401  
Auto Rules +  
Manual Rules  
Auto Rules +  
Manual Rules  
1253401  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
24  
Table 7. Translation performance  
for the English-Vietnamese task  
Table 9. An example of a translation produced by  
our system for an input sentence sampled from  
English-Vietnamese corpus  
System  
BLEU (%)  
36.89  
Input  
Translation Translation Translation  
Baseline  
sentence: (Baseline): (Auto):  
(human):  
Manual Rules  
Auto Rules  
37.71  
The coat was far too big  
- it completely  
37.12  
enveloped him.  
Auto Rules + Manual Rules 37.85  
6.4. BLEU score  
Chiếc áo khoác là quá  
The result of our experiments in table 6  
showed size of phrase tables built from  
translation model base on our method. In this  
method, we can find out various phrases in the  
translation model. So that, they enable us to  
have more options for decoder to generate the  
best translation.  
Table 7 describes the BLEU score of our  
experiments. As we can see, by applying  
preprocessing in both training and decoding, the  
BLEU score of "Auto Rules" system is lower  
by 0.49 point than "Manual Rules" system. This  
result is due to the fact that manual rules have  
better quality than automatic rules. However,  
"Auto Rules + Manual Rules" system is the best  
system because applying the combination rules  
can cover much linguistic phenomena.  
lớn  
- nó hoàn toàn phủ anh  
ta.  
Chiếc áo khoác là quá  
lớn  
- nó phủ hoàn toàn anh  
ta.  
Chiếc áo khoác quá lớn  
- nó hoàn toàn phủ anh  
ta.  
Manh Cuong is a young  
football player  
The above result proved that the effect of  
applying transformation rule base on the  
dependency parse tree.  
with potential great.  
Manh Cuong là một cầu  
thủ  
bóng đá với nhiều tiềm  
năng.  
Table 8. Statistical number of family on  
corpus English-Vietnamese  
Number  
Number Description  
children of head  
79142  
40822  
26008  
15990  
7442  
2728  
942  
Family has 1 children  
Manh Cuong là một cầu  
thủ  
bóng đá trẻ có tiềm  
năng lớn.  
Family has 2 children  
Family has 3 children  
Family has 4 children  
Family has 5 children  
Family has 6 children  
Family has 7 children  
Family has 8 children  
Mạnh Cường là cầu thủ  
307  
bóng đá trẻ rất nhiều  
triển vọng.  
83  
Family has 9  
children  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
25  
it is shown that applying classifier method to  
solve reordering problems automatically.  
7. Analysis and discussion  
We have found that in our experiments  
work is sufficiently correlated to the translation  
quality done manually. Besides, we also have  
found some errors cause such as parse tree  
source sentence quality, word alignment quality  
and quality of corpus. All the above errors can  
According to typical differences of word  
order between English and Vietnamese, we  
have created a set of automatic rules for  
reordering words in English sentence according  
to Vietnamese word order and types of rules  
including noun phrase, adjectival and adverbial  
phrase, as well as preposition phrase. Table 8  
gives statistical families which have larger or  
equal 4 children in our corpus. The number of  
children in each family has limited 4 children in  
our approach. So in target language  
(Vietnamese), the number of children in each  
family is the same.  
The manual rules have good quality  
[27, 18], the phrase-based SMT systems  
applying manual rules is better than the phrase-  
based SMT systems applying automatic rules.  
We believe that the quality of the phrase-based  
SMT systems applying automatic rules will be  
better when we have a better corpus.  
effect automatic reordering rules. Table  
9
showed the translation output examples are  
better than baseline system produced by our  
system for the input sentences from English-  
Vietnamese test set. Go here for more examples  
of translations for input sentences sampled  
randomly from our corpus. Some phrases in  
English source sentence were reordered  
corresponding to Vietnamese target sentence  
order. We focus mainly on some typical  
relations as noun phrase, adjectival and  
adverbial phrase, preposition and created  
manually written reordering rule set for  
English-Vietnamese language pair. Our study  
employed  
dependency  
syntactic  
and  
transformation rules to reorder the source  
sentence and applied to English to Vietnamese  
translation systems.  
8. Conclusion  
For example, with noun phrase, there  
always exists a head noun and the components  
before and after it. These auxiliary components  
will move to new positions according to  
Vietnamese translational order. These rules can  
popular source linguistic phenomena equivalent  
to target language ones as follows:  
• The phrase-based systems applying rules  
with category JJ or JJS  
• The phrase-based systems applying rules  
with category NN or NNS  
In this paper, we present a preprocessing  
approach based on the dependency parser. The  
proposed approach is applying for English -  
Vietnamese  
translation  
system.  
The  
experimental results show that our approach  
achieved statistical improvements in BLEU  
scores over a state-of-the-art phrase-based  
baseline system. By applying manual rules and  
automatic rules, the quality of English-  
Vietnamese translation system is improving. In  
our study, our rules cover some linguistic  
reordering phenomena. These reordering rules  
benefit English-Vietnamese languages pair.  
We will focus on word order problems  
much more with linguistic reordering  
phenomena on English-Vietnamese to learn  
better the dependency-based reordering rules  
(manual rules and automatic rules). This is  
necessary in improving SMT systems and that  
might lead to its a wider adoption.  
• The phrase-based systems applying rules  
with category IN or TO  
Based on these phenomena, translation  
quality has significantly improved. We carried  
out error analysis sentences and compared to  
the golden reordering. Our analysis has also the  
benefits of automatic reordering rules on  
translation quality. In combination with  
machine learning method in related work [21],  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
26  
Innovation, and Vision for the Future, RIVF  
2013, Hanoi, Vietnam, November 10-13, 2013,  
2013, pp. 147-151.  
Acknowledgments  
This work described in this paper has been  
partially funded by Hanoi National University  
(QG.15.23 project).  
[11] F. J. Och, H. Ney, A systematic comparison of  
various  
statistical  
alignment  
models,  
Computational Linguistics 29 (1) (2003) 19-51.  
[12] B. M. de Marneffe, C. D.Manning, Generating  
typed dependency parses from phrase structure  
parses, in: In the Proceeding of the 5th  
References  
International  
Resources and Evaluation, 2006.  
[13] A. Stolcke, Srilm - an extensible language  
modeling toolkit, in: Proceedings of  
Conference  
on  
Language  
[1] S. R. T. W. Papineni, Kishore, W. Zhu, Bleu: A  
method for automatic evaluation of machine  
translation., in: ACL, 2002.  
International Conference on Spoken Language  
Processing, Vol. 29, 2002, pp. 901-904.  
[2] E. S. Y. Z. Jingsheng Cai, Masao Utiyama,  
Dependency-based pre-ordering for chinese-  
english machine translation, in: Proceedings of  
the 52nd Annual Meeting of the Association for  
Computational Linguistics, 2014.  
[3] M. Collins, P. Koehn, I. Kucerová, Clause  
restructuring for statistical machine translation,  
in: Proc. ACL 2005, Ann Arbor, USA, 2005,  
pp. 531-540.  
[14] N. Bach, Q. Gao, S. Vogel, Source-side  
dependency tree reordering models with subtree  
movements and constraints, in: Proceedings of  
the Twelfth Machine Translation Summit  
(MTSummit-XII), International Association for  
Machine Translation, Ottawa, Canada, 2009.  
[15] D. Cer, M.-C. de Marneffe, D. Jurafsky, C. D.  
Manning, Parsing to stanford dependencies:  
Trade-offs between speed and accuracy, in: 7th  
[4] C. Ding, K. Sakanushi, H. Touji, M. Yamamoto,  
Inter-, intra-, and extra-chunk pre-ordering for  
International  
Conference  
on  
Language  
statistical  
japanese-to-english  
machine  
Resources and Evaluation (LREC 2010), 2010.  
translation , ACM Trans. Asian Low-Resour.  
Lang. Inf. Process. 15 (3) (2016) 20:1-20:28.  
doi:10.1145/2818381 .  
[16] D. Chiang, A hierarchical phrase-based model  
for statistical machine translation, in:  
Proceedings of the 43rd Annual Meeting of the  
Association for Computational Linguistics  
(ACL’05), Ann Arbor, Michigan, 2005,  
pp. 263-270.  
[17] J. Daiber, M. Stanojevic, W. Aziz, K.  
Sima’an, Examining the relationship  
between preordering and word order freedom in  
machine translation, in: Proceedings of the First  
Conference on Machine Translation (WMT16),  
Berlin, Germany, August. Association for  
Computational Linguistics, 2016.  
[18] I. Goto, M. Utiyama, E. Sumita, S. Kurohashi,  
Preordering using a target-language parser via  
cross-language syntactic projection for statistical  
machine translation, ACM Transactions on  
Asian and Low-Resource Language Information  
Processing 14 (3) (2015) 13.  
[19] N. Habash, Syntactic preprocessing for statistical  
machine translation, Proceedings of the 11th MT  
Summit, 2007.  
[6] D. Genzel, Automatically learning source-side  
reordering rules for large scale machine  
translation, in: Proceedings of the 23rd  
International Conference on Computational  
Linguistics, COLING ’10, 2010, pp. 376-384.  
[7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.  
Reutemann, I. H. Witten, The weka data mining  
software: An update, SIGKDD Explor. Newsl.  
11 (1) (2009) 10-18.  
[8] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,  
M. Federico, N. Bertoldi, B. Cowan, W. Shen, C.  
Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,  
E. Herbst, Moses: Open source toolkit for statistical  
machine translation, in: Proceedings of ACL,  
Demonstration Session, 2007.  
[9] P. Koehn, F. J. Och, D. Marcu, Statistical  
phrase-based translation, in: Proceedings of  
HLT-NAACL 2003, Edmonton, Canada, 2003,  
pp. 127-133.  
[20] C. Hadiwinoto, Y. Liu, H. T. Ng, To swap or not  
to swap? exploiting dependency word pairs for  
reordering in statistical machine translation, in:  
Thirtieth AAAI Conference on Artificial  
Intelligence, 2016.  
[10] T. L. Nguyen, M. L. Ha, V. H. Nguyen, T. M. H.  
Nguyen, P. Le-Hong, Building a treebank for  
vietnamese dependency parsing, in: 2013 IEEE  
RIVF International Conference on Computing  
and Communication Technologies, Research,  
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27  
27  
[21] C. Hadiwinoto, H. T. Ng, A dependency-  
based neural reordering model for statistical  
[26] L. Wang, Support Vector Machines: theory and  
applications, Vol. 177, Springer Science &  
Business Media, 2005.  
machine  
translation,  
arXiv  
preprint  
arXiv:1702.04510, 2017.  
[27] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M.  
Norouzi, W. Macherey, M. Krikun, Y. Cao, Q.  
Gao, K. Macherey, et al., Google€™s neural  
machine translation system: Bridging the gap  
between human and machine translation, arXiv  
preprint arXiv:1609.08144, 2016.  
[28] P. Xu, J. Kang, M. Ringgaard, F. Och, Using a  
dependency parser to improve smt for subject-  
object-verb languages, in: Proceedings of  
Human Language Technologies: The 2009  
Annual Conference of the North American  
Chapter of the Association for Computational  
Linguistics, Association for Computational  
[22] U. Lerner, S. Petrov, Source-side classifier  
preordering for machine translation., in:  
EMNLP, 2013, pp. 513-523.  
[23] H. V. Huy, T.-L. N. Phuong-Thai Nguyen, M.  
Nguyen, Boostrapping phrase – based  
statistical machine translation via wsd  
integration, in: In Proceeding of the Sixth  
International Joint Conference on Natural  
Language Processing (IJCNLP 2013), 2013,  
pp. 1042-1046.  
[24] V. H. Tran, V. V. Nguyen, M. L. Nguyen,  
Improving  
english-vietnamese  
statistical  
Linguistics,  
pp. 245-253.  
Boulder,  
Colorado,  
2009,  
machine translation using preprocessing  
dependency syntactic, In Proceedings of the  
2015 Conference of the Pacific Association for  
Computational Linguistics (Pacling 2015)  
pp. 115-121.  
[29] N. Yang, M. Li, D. Zhang, N. Yu, A ranking-  
based approach to word reordering for statistical  
machine translation, in: Proceedings of the 50th  
Annual Meeting of the Association for  
Computational Linguistics: Long Papers-  
Volume 1, Association for Computational  
Linguistics, 2012, pp. 912-920.  
[25] V. H. Tran, H. T. Vu, V. V. Nguyen, M. L.  
Nguyen,  
approach for english-vietnamese statistical  
machine translation, 17th International  
A
classifier-based  
preordering  
Conference on Intelligent Text Processing and  
Computational Linguistics (CICLing 2016).  
G
h
pdf 14 trang yennguyen 13/04/2022 6900
Bạn đang xem tài liệu "Dependency-based pre-ordering for English-Vietnamese statistical machine translation", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

File đính kèm:

  • pdfdependency_based_pre_ordering_for_english_vietnamese_statist.pdf