簡易檢索 / 詳目顯示

研究生: 陳威達
Wei-Da Chen
論文名稱: 使用查詢擴展技術及支援向量機由網路資料集挖掘中文姓名翻譯
Mining Translations of Chinese Names from Web Corpora by Using a Query Expansion Technique and Support Vector Machine
指導教授: 李漢銘
Hahn-Ming Lee
何建明
Jan-Ming Ho
口試委員: 王勝德
Sheng-De Wang
李育杰
Yuh-Jye Lee
王榮英
Jung-Ying Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 78
中文關鍵詞: 資料探勘姓名翻譯查詢擴展支援向量機
外文關鍵詞: Data mining, Name translation, Query expansion, Suppot Vector Machine
相關次數: 點閱:309下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 中文姓名翻譯是屬於專名實體翻譯中的一種特殊案例。因為在翻譯中文姓名的方法中存在許多不同種類的羅馬拼音系統,且許多人會在所翻譯過的名字中添加額外與本身中文姓名不相關的字。而將某學者的姓名正確的翻譯成英文將能夠對人們在網路上尋找此學者的相關學術成就有很大的幫助,因此中文姓名的翻譯成為一個重要的議題。
    在這篇論文中,我們首先提出一個為中文姓名之翻譯分類的方法,接著提出一個新的方法來從網路資料集中挖掘出中文姓名的翻譯。我們的方法利用查詢擴展技術及支援向量機與“發音”與“距離”這兩種特徵來設法取得可能的姓名翻譯。利用查詢擴展技術能夠有效且更精確的回收同時含有輸入人名與其英文翻譯的網頁,而利用支援向量機透過範例的訓練學習來判別姓名翻譯候選的正確與否可減少使用啟發式法則時因主觀判斷而產生的副作用。我們將中文姓名依其相對應的英文翻譯分成八種類型,實驗結果顯示我們的方法可將三種較常見的類型有效的翻譯。


    Chinese name translation is a special case of the problem of named entity translation. It is a very challenging problem because there exist many kinds of Romanization systems and some people like to add some words to their English names. Because of translating a scholar’s name into its corresponding English name correctly could help find information about his academic achievements, Chinese name translation is in great demand.
    In this thesis, we first propose a classification of Chinese names, and then propose a novel methodology to mining Chinese name translations from Web corpora. Our methodology uses two kinds of features, which are the phonetic and the distant features, to extract name translation candidates by using a query expansion technique and Support Vector Machine (SVM). Using query expansion technique can effectively and more precisely retrieve the Web pages which contained the input Chinese name and the name’s translation. And using SVM to learn verification rule by training samples for name translation candidates can avoid the side effect caused by using heuristic rule. We classify Chinese names into eight name types according to the corresponding name translation. The experiment result showed our methodology can effectively mine out the correct name translations of three common name types.

    Content Abstract II Acknowledgements IV Content V List of Tables VI List of Figures VII Chpater 1 Introduction 1 Chpater 2 Background 10 Chpater 3 Chinese Name Translation Mining System (CNTMS) 19 Chpater 4 Experiments 33 Chpater 5 Discussion and Conclusion 47 References 55 Vita 65 List of Tables Table 1. 8 Types of translated Chinese name formats 3 Table 2. Distribution of Database I & Database II 36 List of Figures Figure 1. Returned Web page snippets by using a person’s name and surname translation as a query 6 Figure 2. A common example of translating movie title in Chinese 13 Figure 3. The system architecture of CNTMS 22 Figure 4. Components of the Query expander 23 Figure 5. An example the distance between two terms 25 Figure 6. The detailed and comprehensive steps running in the Candidate extractor. 28 Figure 7. Translation accuracy of Dataset I by using the phonetic feature 40 Figure 8. Translation accuracy of Dataset II by using the phonetic feature 40 Figure 9. Translation accuracy of Dataset I by using the distant feature 40 Figure 10. Translation accuracy of Dataset II by using the distant feature 40 Figure 11. Translation accuracy of each name type of Dataset I by using the phonetic feature 41 Figure 12. Translation accuracy of each name type of Dataset I by using the distant feature 41 Figure 13. Translation accuracy of each name type of Dataset II by using the phonetic feature 42 Figure 14. Translation accuracy of each name type of Dataset II by using the distant feature 42 Figure 15. Translation accuracy of Dataset I by using both the phonetic and the distant features 45 Figure 16. Translation accuracy of Dataset I by using both the phonetic and the distant features 45 Figure 17. Translation accuracy of each name type of Dataset I by using both the phonetic and the distant features 46 Figure 18. Translation accuracy of each name type of Dataset II by using both the phonetic and the distant features 46

    [1] Cambridge Dictionaries Online.
    http://dictionary.cambridge.org
    [2] Directory of scholars of Institute of Mathematics, Academia Sinica.
    http://www.math.sinica.edu.tw/addbook/default.jsp
    [3] Directory of Division of Computer Science of National Science Council.
    http://cs.nsc.ncku.edu.tw/news
    [4] Google.
    http://www.google.com
    [5] LIBSVM -- A Library for Support Vector Machines.
    http://www.csie.ntu.edu.tw/~cjlin/libsvm
    [6] Named entity recognition - Wikipedia, the free encyclopedia
    http://en.wikipedia.org/wiki/Named_entity_recognition
    [7] Pinyin - Wikipedia, the free encyclopedia.
    http://en.wikipedia.org/wiki/Pinyin
    [8] Query expansion - Wikipedia, the free encyclopedia.
    http://en.wikipedia.org/wiki/Query_expansion
    [9] Rigid designator - Wikipedia, the free encyclopedia.
    http://en.wikipedia.org/wiki/Rigid_designator
    [10] Soundex - Wikipedia, the free encyclopedia.
    http://en.wikipedia.org/wiki/Soundex
    [11] Tongyong Pinyin - Wikipedia, the free encyclopedia.
    http://en.wikipedia.org/wiki/Tongyong_Pinyin
    [12] P. F. Brown, J. C. Lai and R. L. Mercer, “Aligning Sentences in Parallel Corpora,” in Proceedings, 29th Annual Meeting of the Association for Computational Linguistics, Berkeley CA, June 1991.
    [13] P. F. Brown, V. J. Della Pietra, S. A. Della Pietra and R. L. Mercer, “The Mathematics of Statistical Machine Translation: Parameter Estimation,” Association for Computational Linguistic, Volume 19, No. 2, pages 263-311, 1993.
    [14] C. Carpineto, R. De Mori and G. Romano, “Informative Term Selection for Automatic Query Expansion,” in Proceedings of the Seventh Text Retrieval Conference (TREC 7), 1999.
    [15] H.-H. Chen, S.-J. Hueng, Y.-W. Ding and S.-C. Tsai, “Proper Name Translation in Cross-Language Information Retrieval,” in Proceedings of 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, Canada, August 1998.
    [16] H.-H. Chen, W.-C. Lin, C. Yang and W.-H. Lin, “Translating–Transliterating Named Entities for Multilingual Information Access,” Journal of the American Society for Information Science and Technology, 57(5):645–659, 2006.
    [17] P.-J. Cheng, Y.-C. Pan, W.-H. Lu and L.-F. Chien, “Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora,” in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, Spain, July 2004.
    [18] P.-J. Cheng, J.-W. Teng, R.-C. Chen, J.-H. Wang, W.-H. Lu and L.-F. Chien, “Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval,” in Proceedings of Special Interest Group on Information Retrieval 2004 (SIGIR’04), Sheffield, South Yorkshire, UK, July 2004.
    [19] D. Feng, Y. Lv and M. Zhou, “A New Approach for English-Chinese Named Entity Alignment,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain, July 2004.
    [20] W. A. Gale and K. W. Church, “A Program for Aligning Sentences in Bilingual Corpora,” in Proceedings of 29th Annual Meeting of the Association for Computational Linguistics, Berkeley CA, June 1991.
    [21] W. Gao, K.-F. Wong and W. Lam, “Phoneme-based Transliteration of Foreign Names for OOV Problem,” in Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04), Hainan Island, China, March 2004.
    [22] H. Hassan and J. Sorensen, “An Integrated Approach for Arabic-English Named Entity Translation,” in Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, June 2005.
    [23] F. Huang, “Cluster-specific Name Transliteration,” in Proceedings of the conference on Human Language Technology (HLT/EMNLP 2005), Vancouver, B. C., Canada, October 2005.
    [24] F. Huang and S. Vogel, “Improved Named Entity Translation and Bilingual Named Entity Extraction,” in Proceedings of the Fourth IEEE International Conference on Multimodal Interfaces (ICMI’02), Pittsburgh, October 2002.
    [25] F. Huang, S. Vogel and A. Waibel, “Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization,” in Proceedings of ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Sapporo, Japan, July 2003.
    [26] F. Huang, Y. Zhang and S. Vogel, “Mining Key Phrase Translations from Web Corpora,” in Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, October 2005
    [27] L. Jiang, M. Zhou, L.-F. Chien and C. Niu, “Named Entity Translation with Web Mining and Transliteration,” in Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India, January 2007.
    [28] T. Kumano, H. Kashioka, H. Tanaka and T. Fukusima, “Acquiring Bilingual Named Entity Translations from Content-Aligned Corpora,” in Proceedings of International Joint Conference on Natural Language Processing (IJCNLP), China, March 2004.
    [29] W. Lam, S.-K. Chan and R. Huang, “Named Entity Translation Matching and Learning: With Application for Mining Unseen Translations,” ACM Transactions on Information Systems, Vol. 25, No. 1, pages 38-69, 2007.
    [30] W. Lam, R. Huang, P.-S. Cheung, “Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations,” in Proceedings of Special Interest Group on Information Retrieval 2004 (SIGIR’04), Sheffield, South Yorkshire, UK, July 2004.
    [31] W.-H. Lin and H.-H. Chen, “Backward Machine Transliteration by Learning Phonetic Similarity,” in Proceedings of International Conference on Computational Linguistics (COLING2002), Taipei, Taiwan, August 2002.
    [32] T. Lin, C.-C. Wu and J.-S. Chang, “Word-Transliteration Alignment,” in Proceedings of the Fifteenth Research on Computational Linguistics Conference, ROCLING XV, Hsinchu, 2003.
    [33] W.-H. Lu, L.-F. Chien and H.-J. Lee, “A Transitive Model for Extracting Translation Equivalents of Web Queries through Anchor Text Mining,” in Proceedings of the 19th International Conference on Computational Linguistics (COLING), Taipei, Taiwan, August 2002.
    [34] W.-H. Lu, L.-F. Chien and H.-J. Lee, “Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach,” ACM Transactions on Information Systems, Vol. 22, No. 2, pages 242–269, 2004.
    [35] W.-H. Lu, L.-F. Chien and H.-J. Lee, “Translation of Web Queries Using Anchor Text Mining,” ACM Transactions on Asian Language Information Processing, Vol. 1, No. 2, Pages 159-172, 2002.
    [36] R.-C. Moore, “Learning Translations of Named-Entity Phrases from Parallel Corpora,” in Proceedings of 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL03), Budapest, Hungary, April 2003.
    [37] M. Murata, K. Uchimoto, Q. Ma and H. Isahara, “Using a Support-Vector Machine for Japanese-to-English Translation of Tense, Aspect, and Modality,” in Proceedings of Annual Meeting of the ACL archive Proceedings of the workshop on Data-driven methods in machine translation (WDDMT), France, July 2001.
    [38] S. E. Robertson, “On Term Selection for Query Expansion,” Journal of Documentation, Vol. 46, No. 4, pages 359-64, 1990.
    [39] M.-S. Shia, J.-H. Lin, S.Yu and W.-H. Lu, “A Web-based Unsupervised Algorithm for Learning Transliteration Model to Improve Translation of Low-Frequency Proper Names,” in Proceedings of Natural Language Processing and Knowledge Engineering, 2005. IEEE (NLP-KE '05), China, October 2005.
    [40] P. Virga and S. Khudanpur, “Transliteration of Proper Names in Cross-Language Applications,” in Proceedings of Special Interest Group on Information Retrieval 2003 (SIGIR’03), Toronto, Canada, July 2003.
    [41] S. Wan and C.-M. Verspoor, “Automatic English-Chinese Name Transliteration for Development of Multilingual Resources,” in Proceedings of COLING-ACL'98, the joint meeting of 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada, 1998.
    [42] S. Warwick and G. Russell, “Bilingual Concordancing and Bilingual Lexicography,” in Proceedings of EURALEX 4th International Congress, M~ilaga, Spain, 1993.
    [43] Y.-C. Wei, M.-S. Lin and H.-H. Chen, “Name Disambiguation in Person Information Mining,” in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06), Hong Kong, December 2006.
    [44] Y. Wu, J. Zhao and B. Xu, “Chinese Named Entity Recognition Combining a Statistical Model with Human Knowledge,” in Proceedings of the Workshop on Multilingual and Mixed-language Named Entity Recognition, Sappora, Japan, July 2003.
    [45] K.-H. Yang, K.-Y. Chiou, H.-M. Lee and J.-M. Ho, “Web Appearance Disambiguation of Personal Names Based on Network Motif,” in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06), Hong Kong, December 2006.
    [46] K.-H. Yang, J.-Y. Jiang, H.-M. Lee and J.-M. Ho, “Extracting Citation Relationships from Web Documents for Author Disambiguation,” Technical Report (TR-IIS-06-017), Institute of Information Science, Academia Sinica, 2006.
    [47] M. Zhang, H. Li, J. Su and H. Setiawan, “A Phrase-Based Context-Dependent Joint Probability Model for Named Entity Translation,” in Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-05), Jeju Island, Republic of Korea, October 2005.
    [48] Y. Zhang, F. Huang and S. Vogel, “Mining Translations of OOV Terms from the Web through Cross-lingual,” in Proceedings of Special Interest Group on Information Retrieval 2005 (SIGIR’05), Salvador, Brazil, August 2005.
    [49] H.-P. Zhang, Q. Liu, H. Yu, X. Cheng and S. Bai, “Chinese Named Entity Recognition Using Role Model,” Computational Linguistics and Chinese Language Processing (CLCLP), Vol. 8, No. 2, pages 29-60, 2003

    QR CODE