Single concatenated input is better than indenpendent multiple-input for CNNs to predict chemical-induced disease relation from literature

VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 1 (2020) 11-16  
Original Article  
Single Concatenated Input is Better than Indenpendent  
Multiple-input for CNNs to Predict Chemical-induced Disease  
Relation from Literature  
Pham Thi Quynh Trang, Bui Manh Thang, Dang Thanh Hai*  
Bingo Biomedical Informatics Lab, Faculty of Information Technology,  
VNU University of Engineering and Technology, Vietnam National University, Hanoi,  
144 Xuan Thuy, Cau Giay, Hanoi, Vietnam  
Received 21 October 2019  
Revised 17 March 2020; Accepted 23 March 2020  
Abstract: Chemical compounds (drugs) and diseases are among top searched keywords on the  
PubMed database of biomedical literature by biomedical researchers all over the world (according  
to a study in 2009). Working with PubMed is essential for researchers to get insights into drugs’  
side effects (chemical-induced disease relations (CDR), which is essential for drug safety and  
toxicity. It is, however, a catastrophic burden for them as PubMed is a huge database of  
unstructured texts, growing steadily very fast (~28 millions scientific articles currently,  
approximately two deposited per minute). As a result, biomedical text mining has been empirically  
demonstrated its great implications in biomedical research communities. Biomedical text has its  
own distinct challenging properties, attracting much attetion from natural language processing  
communities. A large-scale study recently in 2018 showed that incorporating information into  
indenpendent multiple-input layers outperforms concatenating them into a single input layer (for  
biLSTM), producing better performance when compared to state-of-the-art CDR classifying  
models. This paper demonstrates that for a CNN it is vice-versa, in which concatenation is better  
for CDR classification. To this end, we develop a CNN based model with multiple input  
concatenated for CDR classification. Experimental results on the benchmark dataset demonstrate  
its outperformance over other recent state-of-the-art CDR classification models.  
Keywords: Chemical disease relation prediction, Convolutional neural network, Biomedical text mining.  
1. Introduction *  
requires approximated 14 years, with a total  
cost of about $1 billion, for a specific drug to be  
available in the pharmaceutical market [2].  
Nevertheless, even when being in clinical uses  
for a while, side effects of many drugs are still  
unknown to scientists and/or clinical doctors  
[3]. Understanding drugs’ side effects is  
Drug manufacturing is an extremely  
expensive and time-consuming process [1]. It  
_______  
* Corresponding author.  
E-mail address: hai.dang@vnu.edu.vn  
11  
P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16  
12  
essential for drug safety and toxicity. All these  
producing better performance when compared  
to relevant state-of-the-art models. To the best  
of our knowledge, there is currently no study  
confirming whether it is still hold true for a  
CNN-based intra-sentence chemical disease  
relationship prediction model by far. To this  
end, this paper proposes a model for prediction  
of intra-sentence chemical disease relations in  
biomedical text using CNN with concatenation  
of multiple layers for encoding different  
linguistic properties as input.  
The rest of this paper is organized as  
follows. Section 2 describes the proposed  
method in detail. Experimental results are  
discussed in section 3. Finally, section 4  
concludes this paper.  
facts explain why chemical compounds (drugs)  
and diseases are among top searched keywords  
on PubMed by biomedical researchers all over  
the world (according to [4]). PubMed is a huge  
database of biomedical literature, currently with  
~28 millions scientific articles, and is growing  
steadily very fast (approximate two ones added  
per minute).  
Working with such a huge amount of  
unstructured textual documents in PubMed is a  
catastrophic burden for biomedical researchers.  
It can be, however, accelerated with the  
application of biomedical text mining, hereby  
for drug (chemical)  
-
disease relation  
prediction, in particular. Biomedical text  
mining has been empirically demonstrated its  
great implications in biomedical research  
communities [5-7].  
Biomedical text has its own distinct  
challenging properties, attracting much attetion  
from natural language processing communities  
[8, 9]. In 2004, an annual challenge, called  
2. Method  
Given  
a
preprocessed and tokenized  
sentence containing two entity types of interest  
(i.e. chemical and disease), our model first  
extracts the shortest dependency path (SDP) (on  
the dependency tree) between such two entities.  
The SDP contains tokens (together with  
dependency relations between them) that are  
important for understanding the semantic  
connection between two entities (see Figure 1 for  
an example of the SDP).  
BioCreative  
(Critical  
Assessment  
of  
Information Extraction systems in Biology) was  
launched for biomedical text mining  
researchers. In 2016, researchers from NCBI  
organized the chemical disease relationship  
extraction task for the challenge [10].  
To date, almost all proposed models are only  
for prediction of relationships between chemicals  
and diseases that appear within a sentence (intra-  
sentence relationships) [11]. We note that those  
models that produce the state-of-the-art  
performance are mainly based on deep neural  
architechtures [12-14], such as recurrent neural  
networks (RNN) like bi-directional long short-  
term memory (biLSTM) in [15] and convolutional  
neural networks (CNN) in [16-18].  
Figure 1. Dependency tree for an example sentence.  
The shortest dependency path between two entities  
(i.e. depression and methyldopa) goes through the  
tokens “occurring” and “patients”.  
Recently, Le et al. developed a biLSTM  
based intra-sentence biomedical relation  
prediction model that incorporates various  
informative linguistic properties in an  
independent multiple-layer manner [19]. Their  
Each token t on a SDP is encoded with the  
embedding et by concatenating three  
embeddings of equal dimension d (i.e. ew  
ept  
eps), which represent important linguistic  
information, including its token itself (ew), part  
of speech (POS) (ept) and its position (eps). Two  
former partial embeddings are fine-tuned during  
experimental  
results  
demonstrate  
that  
incorporating information into independent  
multiple-input layers outperforms concatenating  
them into a single input layer (for biLSTM),  
P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16  
13  
the model training. Position embeddings are  
indexed by distance pairs [dl%5, dr%5], where  
dl and dr are distances from a token to the left  
and the right entity, respectively.  
For each dependency relation (r) on the  
SDP, its embedding has the dimension of 3*d,  
and is randomly initialized and fine-tuned as the  
model’s parameters during training.  
embedding channel (c) independently, creating  
a corresponding feature map ic. The max  
pooling operator is then applied on those  
created feature maps on all channels (three in  
our case) to create a feature value for filter fi  
(Figure 3).  
2.2. Hyper-parameters  
The  
model’s  
hyper-parameters  
are  
To this end, each SDP is embedded into the  
RNxD space (see Figure 2), where N is the  
number of all tokens and dependency relations  
on the SDP and D=3*d. The embedded SDP  
will be fed as input into a conventional  
convolutional neural network (CNN [20]) for  
being classified if there is or not a predefined  
relation (i.e. chemical-induced disease relation)  
between two entities.  
empirically set as follows:  
Filter size: n x d, where d is the embedding  
dimension (300 in our experiments), n is a number  
of consecutive elements (tokens/POS tags,  
relations) on SDPs (Figure 3).  
Number of filters: 32 filters of the size 2 x  
300, 128 of 3 x 300, 32 of 4 x 300, 96 of 5 x 300.  
Number of hidden layers: 2.  
Number of units at each layer: 128.  
- The number of training epochs: 100  
- Patience for early stopping: 10  
- Optimizer: Adam  
3. Experimental results  
3.1. Dataset  
Our experiments are conducted on the Bio  
Creative V data [10]. It’s an annotated text  
corpus that consists of human annotations for  
chemicals, diseases and their chemical-induced-  
disease (CID) relation at the abstract level. The  
dataset contains 1500 PubMed articles divided  
into three subsets for training, development and  
testing. In 1500 articles, most were selected  
from the CTD data set (accounting for  
1400/1500 articles). The remaining 100 articles  
in the test set are completely different articles,  
which are carefully selected. All these data is  
manually curated. The detail information is  
shown in Table 1.  
Figure 2. Embedding by concatenation mechanism  
of the Shortest Dependency Path (SDP) from the  
example in Figure 1.  
2.1. Multiple-channel embedding  
For multi-channel embedding, instead of  
concatenating three partial embeddings of each  
token on a SDP we maintain three independent  
embedding channels for them. Channels for  
relations on the SDP are identical embeddings.  
As a result, SDPs are embedded into Rnxdxc  
,
where n is the number of all tokens and  
dependency relations between them, d is the  
dimension number of embeddings, and c=3 is  
the number of embedding channels.  
To calculate feature maps for CNN we  
follow the scheme in the work of Kim 2014  
[21]. Each CNN’s filter fi is slided along each  
3.2. Model evaluation  
We merge the training and development  
subsets of the BioCreative V CDR into a single  
training dataset, which is then divided into the  
new training and validation/development data  
with a ratio 85%:15%. To stop training process  
P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16  
14  
at the right time, we use the early stop technique  
on F1-score on the new validation data.  
training and evaluating 15 times on the new  
training and development set, the averaged F1  
on the test set is chosen as the final evaluation  
result across the entire dataset to make sure that  
the model can work well with strange samples.  
The entire text will be passed through a  
sentence splitter. Then based on the name of the  
disease, the name of the chemical has been  
marked from the previous step, we filter out all  
the sentences containing at least one pair of  
chemical-disease entities. With all the sentences  
found, we can classify the relation for each pair  
Finally, the models that achieve the best  
results based on the sentence level will be applied  
to the problem on the abstract level to compare  
with other very recent state-of-the-art methods.  
of chemical-disease entities. We perform model  
U
Ơ
Figure 3. Model architecture with three-channel embedding as an input for an SDP.  
Table 1. Statistics on BioCreative V CDR dataset [10]  
Chemical  
Mention  
5203  
5347  
5385  
Disease  
Mention  
4182  
4244  
4424  
Dataset  
Articles  
CID  
ID  
ID  
Training  
Development  
Test  
500  
500  
500  
1467  
1507  
1435  
1965  
1865  
1988  
1038  
1012  
1066  
g
3.3. Results and comparison  
when contributing 0.9% of the F1 improvement to  
the final performance of the model.  
Experiment results show that the model  
achieves the averaged F1 of 57% (Precision of  
55.6% and Recall of 58.6%) at the abstract  
level. Compared with its variant that does not  
use dependency relations, we observe a big  
outperformance of about 2.6% at F1, which is  
very significant (see Table 2). It indicates that  
dependency relations contain much information  
for relation extraction. In the meanwhile, POS tag  
and position information are also very useful  
Table 2. Performance of our model with different  
linguistic information used as input  
Information used  
Tokens only  
Precision Recall  
F1  
53.7  
55.4  
54.5  
Token, Dependency  
(depRE)  
Tokens, DepRE and  
POS tags  
Tokens, depRE,  
POS and Position  
55.7  
56.8  
56.2  
56.6  
57.0  
55.7  
55.6  
57.5  
58.6  
P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16  
15  
Acknowledgements  
Compared with recent state-of-the-art  
models such as MASS [19], ASM [22], and the  
tree kernel based model [23], our model  
performs better (Table 3). Ours and MASS only  
exploit intra-sentence information (namely  
SDPs, POS and positions), ignoring prediction  
for cross-sentence relations, while the other two  
incorporate cross-sentence information. We  
note that cross-sentence relations account for  
30% of all relations in the CDR dataset. This  
probably explains why ASM could achieve  
better recall (67.4%) than our model (58.6%).  
This research is funded by Vietnam  
National Foundation for Science and  
Technology Development (NAFOSTED) under  
grant number 102.05-2016.14.  
References  
[1] Paul SM, D.S. Mytelka, C.T. Dunwiddie,  
C.C. Persinger, B.H. Munos, S.R. Lindborg, A.L.  
Schacht, How to improve R&D productivity: The  
pharmaceutical industry's grand challenge, Nat  
Rev Drug Discov. 9(3) (2010) 203-14.  
https://doi.org/10.1038/nrd3078.  
Table 3. Performance of our model in comparison  
with other state-of-the-art models  
[2] J.A. DiMasi, New drug development in the United  
States from 1963 to 1999, Clinical pharmacology  
Model  
Relations  
Precision Recall  
F1  
and  
therapeutics  
69  
(2001)  
286-296.  
Intra- and  
inter-  
sentence  
https://doi.org/10.1067/mcp.2001.115132.  
Zhou et  
al., 2016  
[3] C.P. Adams, V. Van Brantner, Estimating the cost  
of new drug development: Is it really $802  
million? Health Affairs 25 (2006) 420-428.  
https://doi.org/10.1377/hlthaff.25.2.420.  
[4] R.I. Doğan, G.C. Murray, A. Névéol et al.,  
"Understanding PubMed user search behavior  
through log analysis", Oxford Database, 2009.  
[5] G.K. Savova, J.J. Masanz, P.V. Ogren et al., "Mayo  
clinical text analysis and knowledge extraction  
system (cTAKES): Architecture, component  
evaluation and applications", Journal of the  
American Medical Informatics Association, 2010.  
[6] T.C. Wiegers, A.P. Davis, C.J. Mattingly,  
64.9  
49.2  
56.0  
Panyam  
et al.,  
2018  
Intra- and  
inter-  
sentence  
49.0  
58.9  
55.6  
67.4  
54.9  
58.6  
56.8  
56.9  
57.0  
Le et al., Intra-  
2018  
sentence  
Our  
model  
Intra-  
sentence  
4. Conclusion  
Collaborative  
biocuration-text  
mining  
development task for document prioritization for  
curation, Database 22 (2012) pp. bas037.  
This paper experimentally demonstrates  
that CNNs perform better prediction of abstract-  
level chemical-induced disease relations in  
biomedical literature when using concatenated  
input embedding channels rather than  
independent multiple channels. It is vice versa  
for BiLSTM when multiple independent  
channels give better performance, as shown in a  
recent large-scale related study [Le et al., 2018].  
To this end, this paper present a model for  
prediction of chemical-induced disease relations  
in biomedical text based on a CNN with  
concatenated input embeddings. Experimental  
results on the benchmark dataset show that our  
model outperforms three recent state-of-the-art  
related models.  
[7] N. Kang, B. Singh, C. Bui et al., "Knowledge-  
based extraction of adverse drug events from  
biomedical text", BMC Bioinformatics 15, 2014.  
[8] A. Névéol, R.L. Doğan, Z. Lu, "Semi-automatic  
semantic annotation of PubMed queries: A study  
on quality, Efficiency, Satisfaction", Journal of  
Biomedical Informatics 44, 2011.  
[9] L. Hirschman, G.A. Burns, M. Krallinger, C.  
Arighi, K.B. Cohen et al., Text mining for the  
biocuration workflow, Database Apr 18, 2012,  
pp. bas020.  
[10] Wei et al., "Overview of the BioCreative V  
Chemical Disease Relation (CDR) Task",  
Proceedings of the Fifth BioCreative Challenge  
Evaluation Workshop, 2015.  
[11] P. Verga, E. Strubell, A. McCallum,  
Simultaneously Self-Attending to All Mentions  
for Full-Abstract Biological Relation Extraction,  
P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16  
16  
In Proceedings of the 2018 Conference of the  
North American Chapter of the Association for  
Computational Linguistics: Human Language  
Technologies 1 (2018) 872-884.  
Proceedings of the Fifty-fourth Annual Meeting of  
the Association for Computational Linguistics 1  
(2016) 1298-1307.  
[12] Y. Shen, X. Huang, Attention-based convolutional  
neural network for semantic relation extraction,  
In: Proceedings of COLING 2016, the Twenty-  
sixth International Conference on Computational  
Linguistics: Technical Papers, The COLING 2016  
Organizing Committee, Osaka, Japan, 2016,  
pp. 2526-2536.  
[13] Y. Peng, Z. Lu, Deep learning for extracting  
protein-protein interactions from biomedical  
literature, In: Proceedings of the BioNLP 2017  
Workshop, Association for Computational  
Linguistics, Vancouver, Canada, 2016, pp. 29-38.  
[14] S. Liu, F. Shen, R. Komandur Elayavilli, Y.  
Wang, M. Rastegar-Mojarad, V. Chaudhary, H.  
Liu, Extracting chemical-protein relations using  
attention-based neural networks, Database, 2018.  
[15] H. Zhou, H. Deng, L. Chen, Y. Yang, C. Jia,  
D. Huang, Exploiting syntactic and semantics  
[18] J. Gu, F. Sun, L. Qian et al., Chemical-induced  
disease relation extraction via convolutional  
neural network, Database (2017) 1-12.  
https://doi.org/10.1093/database/bax024.  
[19] H.Q. Le, D.C. Can, S.T. Vu, T.H. Dang, M.T.  
Pilehvar, N. Collier, Large-scale Exploration of  
Neural Relation Classification Architectures,  
In Proceedings of the 2018 Conference on  
Empirical Methods in Natural Language  
Processing, 2018, pp. 2266-2277.  
[20] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner,  
Gradient-based learning applied to document  
recognition, In Proceedings of the IEEE. 86(11)  
(1998) 2278-2324.  
[21] Y. Kim, Convolutional neural networks for  
sentence  
classification,  
ArXiv  
preprint  
arXiv:1408.5882.  
[22] C. Nagesh, Panyam, Karin Verspoor, Trevor Cohn  
and Kotagiri Ramamohanarao, Exploiting graph  
kernels for high performance biomedical relation  
extraction, Journal of biomedical semantics 9(1)  
(2018) 7.  
information for  
chemical-disease  
relation  
extraction, Database, 2016, pp. baw048.  
[16] S. Liu, B. Tang, Q. Chen et al., Drugdrug  
interaction extraction via convolutional neural  
networks, Comput, Math, Methods Med, Vol  
(2016) 1-8. https://doi.org/10.1155/2016/6918381.  
[17] L. Wang, Z. Cao, G. De Melo et al., Relation  
classification via multi-level attention CNNs, In:  
[23] H. Zhou, H. Deng, L. Chen, Y. Yang, C. Jia, D.  
Huang, Exploiting syntactic and semantics  
information for  
chemical-disease  
relation  
extraction, Database, 2016.  
Uu  
u
pdf 6 trang yennguyen 08/04/2022 4340
Bạn đang xem tài liệu "Single concatenated input is better than indenpendent multiple-input for CNNs to predict chemical-induced disease relation from literature", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

File đính kèm:

  • pdfsingle_concatenated_input_is_better_than_indenpendent_multip.pdf