簡易檢索 / 詳目顯示

研究生: 陳柏宇
Bo-Yu Chen
論文名稱: 模型適應式學習用於大規模的基因標示工作
Model Adaptation Learning for Large Scale Gene Tagging Task
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 鮑興國
Hsing-Kuo Pao
戴碧如
Bi-Ru Dai
許鈞南
Chun-Nan Hsu
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 39
中文關鍵詞: 名稱實體識別人類基因擷取條件隨機場週期性步長適應法模型適應投票機制
外文關鍵詞: named entity recognition, human gene tagging, conditional random fields, periodic step-size adaptation, model adaptation, voting scheme
相關次數: 點閱:227下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在生醫文獻資料探勘中,生物專有名詞辨識主要著重在從文獻中擷取出基因與基因產物的實體名稱。然而我們很難使用少量的人類基因資料去訓練出一個好的模型來從測試資料中擷取出大量的基因實體名稱。
    我們利用條件隨機場(Conditional Random Fields: CRFs)做為訓練模型的演算法,並採用模型適應的方法(model adaptation method)來解決少量的人類基因資料的問題。簡單的說,我們利用其它與人類基因有相關的資料,像是包含所有物種的資料,再利用模型適應的方試來擷取出人類基因的實體名稱。藉由模型適應過程,我們擷取到更多人類基因與人類基因產品的實體名稱,並且盡可能的降低將非人類基因誤判成人類基因的比例。為了進一步的增進我們的效能,我們使用投票的機制來結合由不同的比例的所有物種資料來做模型適應方法所訓練出來的模型,經由我們的實驗證實,投票的機制將有助於我們提升效能。另一方面這個結果也顯示,利用投票來結合不同的模型比只用單一模型來標示測試資料效果來的好。


    Gene tagging task is a kind of named entity recognition (NER) of gene and gene product mentions in scientific text in biomedical text mining. However, it is very difficult to train a good model by few human genes data for tagging a large number of human gene names and human gene product mentions. We propose a model adaptation method based on conditional random fields (CRFs) models to solve the problem of the lackness of human genes data. We choose other data relative to human genes such as all kinds of species gene data, and extract information of human gene names from all kinds of species gene data in the model adaptation method. By the model adaptation method, we tag more human gene names and human gene product mentions without increasing the false positive rate as far as possible. In order to enhance the performance further, we use the voting scheme to combine the labeled results from different models by model adaptation method. Finally, the experimental results verify that the model adaptation method enhances our performance indeed. On the other hand, the results also show that the voting scheme has a better performance than single model.

    1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Learning Algorithms for Gene Tagging 5 2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Periodic Step-Size Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Periodic Step-Size Adaptation . . . . . . . . . . . . . . . . . . . . . 9 2.3 Model Adaptation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Data Source and Data Preprocessing 13 3.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Dataset Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Data Labeling and Performance Measures . . . . . . . . . . . . . . . . . . 20 3.4.1 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Experimental Results 23 4.1 Baseline Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Model Adaptation Measurement . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Voting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Conclusion 36

    [1] The biocreative challenge evaluation. http://biocreative.sourceforge.net/index.html.
    [2] E. Alpaydin. Introduction to machine learning. Cambridge: MIT Press, 2004.
    [3] A. Benveniste, P. Priouret, and M. Metivier. Adaptive algorithms and stochastic approximations. 1990.
    [4] A. Berger. The improved iterative scaling algorithm: A gentle introduction. Technical report, Carnegie Mellon University, 1997.
    [5] A.L. Berger, V.J. Della Pietra, and S.A. Della Pietra. A maximum entropy approach to natural language processing. Computational linguistics, 22(1):39–71, 1996.
    [6] A. Culotta, D. Kulp, and A. McCallum. Gene prediction with conditional random fields. Technical report, University of Massachusetts Dept. of Computer Science, 2005.
    [7] C. Fraley. On computing the largest fraction of missing information for the EM algorithm and the worst linear function for data augmentation. Computational Statistics and Data Analysis, 31(1):13–26, 1999.
    [8] L. Hirschman, M. Krallinger, and A. Valencia. The second biocreative challenge evaluation workshop, 2007. CINO Centro Nacional de Investigaciones Oncologicas.
    [9] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of biocreative: critical assessment of information extraction for biology. BMC Bioinformatics, 6(1):S1, 2005.
    [10] C.N. Hsu, Y.M. Chang, C.J. Kuo, Y.S Lin, H.S. Huang, and I.F. Chung. Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinfor-matics, 24(13):i286, 2008.
    [11] H.S. Huang, Y.M. Chang, and C.N. Hsu. Training conditional random fields by periodic step size adaptation for large-scale text mining. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 511–516, 2007.
    [12] L.J. Jensen, J. Saric, and P. Bork. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics, 7(2):119–129, 2006.
    [13] A. Krogh. An introduction to hidden markov models for biological sequences. NEW COMPREHENSIVE BIOCHEMISTRY, 32:45–63, 1998.
    [14] Taku Kudo. Crf++: Yet another crf toolkit. available under lgpl from the following. http://crfpp.sourceforge.net/,January 2003.
    [15] C.J. Kuo. Using conditional random fields for gene mention tagging. Master’s thesis, National Yang-Ming University School of Life Science Institute of Bioinformatics, 2007.
    [16] C.J. Kuo, Y.M. Chang, H.S. Huang, K.T. Lin, B.H. Yang, Y.S. Lin, C.N. Hsu, and I.F. Chung. Rich feature set, unification of bidirectional parsing and dictionary filtering for high F-score gene mention tagging. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pages 105–107, 2007.
    [17] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. pages 282–289, 2001.
    [18] U. Leser and J. Hakenberg. What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics, 6(4):357–369, 2005.
    [19] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 591–598, 2000.
    [20] R. McDonald and F. Pereira. Identifying gene and protein mentions in text using conditional random fields. BMC bioinformatics, 6(1):S6, 2005.
    [21] G.J. McLachlan, T. Krishnan, and W. InterScience. The EM algorithm and exten-sions. Wiley New York, 1997.
    [22] X.L. Meng and D.B. Rubin. On the global and componentwise rates of convergence of the EM algorithm. Linear algebra and its applications, 199:413–425, 1994.
    [23] T. Mitsumori, S. Fation, M. Murata, K. Doi, and H. Doi. Gene/protein name recognition based on support vector machine using dictionary as features. BMC bioinfor-matics, 6(1):S8, 2005.
    [24] J. Natarajan, D. Berrar, C.J. Hack, and W. Dubitzky. Knowledge discovery in biology and biotechnology texts: A review of techniques, evaluation strategies, and application. Critical Reviews in Biotechnol, 25(1-2):31–52, 2005.
    [25] S. Della Pietra, V. Della Pietra, J. Lafferty, R. Technol, and S. Brook. Inducing features of random fields. IEEE transactions on pattern analysis and machine intel-ligence, 19(4):380–393, 1997.
    [26] L. Rabiner and B. Juang. An introduction to hidden markov models. IEEE assp magazine, 3(1):4–16, 1986.
    [27] L.R. Rabiner et al. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
    [28] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, pages 213–220, 2003.
    [29] L. Smith, L. Tanabe, R. Ando, C.J. Kuo, I.F. Chung, C.N. Hsu, Y.S. Lin, R. Klinger, C. Friedrich, K. Ganchev, et al. Overview of biocreative II gene mention recognition. Genome Biology, 9(2):S2, 2008.
    [30] J.C. Spall. Introduction to stochastic search and optimization: estimation, simula-tion, and control. Wiley-Interscience, 2005.
    [31] C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning, page 93, 2007.
    [32] Y. Tsuruoka, Y. Tateishi, J. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii. Developing a robust part-of-speech tagger for biomedical text. Lecture notes in computer science, 3746:382, 2005.
    [33] H.M. Wallach. Conditional random fields: An introduction. Rapport technique MS-CIS-04-21, Department of Computer and Information Science, University of Penn-sylvania, 50, 2004.
    [34] B. Widrow and M.E Hoff. Adaptive switching circuits. 1960 IRE WESCON Convention Record. New York: IRE, 4:96–104, 1960.
    [35] J. Wilbur, L. Smith, and L. Tanabe. Biocreative 2. gene mention task. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pages 7–16, 2007.
    [36] A. Yeh, A. Morgan, M. Colosimo, and L. Hirschman. Biocreative task 1a: gene mention finding evaluation. BMC Bioinformatics, 6(1):S2, 2005.
    [37] X. Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison, 2005.

    QR CODE