簡易檢索 / 詳目顯示

研究生: 郭金喜
Jin-Shea Kuo
論文名稱: 自網際網路抽取音譯詞研究
A Study on Extracting Transliterations from the Web
指導教授: 楊英魁
Ying-Kuei Yang
口試委員: 鄭伯順
Bor-Shenn Jeng
陳信希
none
梁婷
none
孫宗瀛
none
吳傳嘉
none
黎碧煌
none
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 131
中文關鍵詞: 機器翻譯音譯詞組抽取非監督式學習主動式學習多面向學習混淆音矩陣機器音譯
外文關鍵詞: transliteration, transliteration extraction, multi-view learning
相關次數: 點閱:207下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著頻繁的文化交流,外來的音譯名詞不斷的湧入各種語言之中,因此在自然語言處理 (natural language processing) 研究中,特別是在專有名詞辨識 (named entity recognition, NER) 、跨語言資訊檢索 (cross-language information retrieval, CLIR) 、機器問答 (question answering, QA) 以及機器翻譯 (machine translation, MT) 等方面,機器音譯 (machine transliteration) 都扮演著重要的角色。機器音譯研究乃是探討如何根據發音特性 (pronunciation) 將一個字詞 (word) 從一個語言翻譯至另一個語言,這種依照聲音翻譯 (translation-by-sound) 的方式簡稱為音譯。在這個研究中通常需要有大量的音譯詞組 (transliteration pairs) 做為翻譯模型訓練 (model training) 之用,因此大量的音譯詞組便成機器音譯研究中不可或缺的資源。但是要收集大量的音譯詞組則是費時費力,因此如何自動化抽取大量的音譯詞組便是本篇論文的研究重點。迅速增長的網際網路已成為世界上最大的分散式資料庫 (distributed database) 之一,因為網際網路上不斷有新的文章發表,在這些文章中有許多是翻譯自外文資料,因此有許多翻譯詞及音譯詞存在於這個巨大的非平行語料庫中。所以本篇論文希望利用這個網際網路語料庫進行音譯詞組抽取。
    本篇論文提出三種自動學習架構 (learning framework) 來處理自網際網路中自動抽取 (automatic extraction) 音譯詞組的問題。在自動抽取過程中可以透過中英文字詞間的音相似度模型 (phonetic similarity model, PSM) 來計算音相似度,這裡所指的音相似度模型包含混淆音矩陣 (confusion matrix) 及中文 n-gram 語言模型 (Chinese n-gram language model) 。使用這個音相似度模型,音譯詞組自動抽取過程變成包含有『辨識 (recognition) 』及『驗證 (validation) 』兩個步驟:首先,在辨識過程,先找出一個英文字,然後在英文字附近的上下文語境 (context) 中找出其最有可能的中文候選詞 (candidate) ;其次在驗證過程中,經由假設檢驗 (hypothesis test) 來篩選 (select) 候選詞,以確認最後可能的音譯詞。在論文中還針對用以評量效能的音譯詞組集合進行了統計分析,以便對音譯的特性有更進一步了解,進而更準確的模型化 (model) 音譯規則,從而改善音譯詞組自動抽取的效能,。
    在自動學習架構中,首先對開發語料庫 (development corpus) 進行監督式學習 (supervised learning) 和非監督式學習 (unsupervised learning) 的音譯詞組自動抽取。在監督式學習下利用 PSM 模型可得到 F-measure 為 0.739 的實驗結果,與利用語言學規則所得到的效能 (F-measure 0.20) 相比,利用 PSM 模型所得到的高出許多,此結果確認了使用 PSM 模型的可能性;而利用語音自動辨識所產生的混淆音矩陣 (confusion matrix) 來初始化 PSM 模型,使用這個 PSM 模型在一個小量的網際網路語料庫中進行音譯詞組抽取,在非監督式學習下可以得到與監督式學習非常接近的結果,這確認了可將此種PSM 模型運用於網際網路的環境下進行音譯詞組抽取。
    然後,本篇論文運用主動式學習 (active learning) 方法於音譯詞組自動抽取,以期能在沒有外在知識如語音自動辨識資料的情況下能改善效能。主動學習方法可以主動篩選最富有資訊的樣本以供學習,而不是只有被動學習所取得的樣本。從實驗結果中發現,在使用最有效的策略時,主動式學習可達到 F-measure 為 0.722 的結果,而相較於監督式學習而言卻可減少 90.2% 的樣本標記 (labeling)。最後,本篇論文還使用多面向式學習 (multi-view learning) 方法,以便進一步改善效能和減輕人工標記的必要性。在非監督方式下以 Co-training 和 Co-EM (expectation maximization) 兩種策略從網際網路上進行音譯詞組自動抽取,其中最有效的設定可達到 F-measure 為 0.727 。透過這些學習方法,可以快速地自網際網路上建立所需要的音譯詞詞典。


    Machine transliteration or phonetic transcription plays an important role in the study of natural language processing on topics such as named entity recognition (NER), cross-language information retrieval (CLIR), question answering (QA) and machine translation (MT). It is a process of translating a word in one language into another language by preserving its pronunciation in the original language, otherwise known as translation-by-sound. A collection of transliterations are important to the study of machine transliteration; however, it is time-consuming and labor-intensive to construct such a corpus.
    This thesis proposes three learning frameworks for the automatic transliteration extraction from the Web. We formulate the machine transliteration process using a phonetic similarity model (PSM) which consists of phonetic confusion matrices and a Chinese character n-gram language model. With the phonetic similarity model, the extraction of transliteration pairs becomes a two-step process of recognition followed by validation: First, in the recognition process, we identify the most probable transliteration in the k-neighborhood of a spotted English word. Then, in the validation process, we qualify the transliteration pair candidates with a hypothesis test. We also carry out an analytical study on the statistics of several key factors, such as lexical variation and phonetic variation, which result in casual transliteration, in English-Chinese transliteration to help formulation of the phonetic similarity modeling.
    In the learning frameworks, we first present supervised learning and unsupervised learning to harvest transliterations from a development corpus. The experimental result validates the effectiveness of the PSM by achieving an F-measure of 0.739 in supervised learning. The unsupervised learning bootstrapping with prior ASR (automated speech recognition) knowledge works very close to the supervised one, thus allowing us to deploy automatic extraction of transliteration pairs in the Web space.
    Then, we exploit the active learning algorithm, which actively selects most informative samples for annotation instead of passively receiving samples for learning, to improve performance. It is found that for active learning to reach the performance of supervised learning, the most effective strategy achieves an F-measure of 0.722 and reduces the labeling effort by 90.2%. Finally, we further employ multi-view learning to alleviate the necessity of human annotation and leverage the performance. Two learning strategies, Co-training and Co-EM, are implemented in the unsupervised manner to discover transliterations from the Web. The most effective view setting achieves an F-measure of 0.727. The reported performance shows the effectiveness of our proposed approaches. By exploiting these approaches, we can obtain a set of transliterations easily and quickly from the Web.

    論文摘要 iii ABSTRACT vi 誌 謝 viii 目錄 (Table of contents) 1 圖索引 (List of figures) 3 表格索引 (List of tables) 4 縮寫名詞表 (List of acronyms) 5 1. 簡介 6 2. 相關研究 12 3. 音相似度模型 16 3.1. 中英文音譯基本原理 16 3.1.1. 音譯原理 17 3.1.2. 統計分析 21 3.2. 音相似度模型 27 3.2.1. 中文音譯公式推導 27 3.2.2. 挑選音譯候選詞 33 3.2.3. 音節對應策略 35 3.2.4. 計算音相似度 38 3.2.5. 音節對應學習策略 41 3.3. 中文羅馬拼音的文字混淆音矩陣 43 3.4. 規則式混淆音矩陣 45 3.5. 實驗結果 46 3.5.1. 中文羅馬拼音的文字混淆音矩陣 47 3.5.2. 規則式混淆音矩陣 48 3.5.3. 來自自動語音辨識的混淆音矩陣 49 3.5.4. 學習混淆音矩陣 52 3.5.5. 自雙語網頁中學習混淆音矩陣 57 3.6. 討論 58 4. 使用主動式學習於音譯詞自動抽取 62 4.1. 擴充音譯詞相似度模型 63 4.1.1. 以字素為基本的方法 64 4.1.2. PSM 模型參數估算 65 4.2. 非監督式學習 66 4.3. 主動式學習 67 4.4. 主動-非監督式學習 70 4.5. 共現模型 71 4.6. 實驗結果 73 4.6.1. 非監督式學習 74 4.6.2. 主動式學習 75 4.6.3. 主動-非監督式學習 77 4.6.4. 共現模型 79 4.6.5. 自搜尋引擎中建構音譯詞詞典 81 5. 使用多面向式學習於音譯詞抽取 84 5.1. 多面向 (multi-view) 的音節對應關係 84 5.2. Co-training 及Co-EM 學習 90 5.3. 實驗結果 95 5.3.1. 非監督式學習 96 5.3.2. Co-training 97 5.3.3. Co-EM 學習 99 5.3.4. 共現模型 101 5.4. 與其他音譯詞組抽取方法比較 102 6. 結論及未來工作 107 6.1. 結論 107 6.2. 未來工作 109 參考書目 110 作者簡歷 119

    Y. Al-Onaizan and K. Knight. 2002. Translating Named Entities Using Monolingual and Bilingual Resources, In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 400-408.
    S. Blum and T. Mitchell. 1998. Combining Labeled and Unlabeled Data with Co-Training, In Proceedings of 11th Conference on Computational Learning Theory, pp. 92-100.
    E. Brill, G. Kacmarcik and C. Brockett. 2001. Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs, In Proceedings of Natural Language Processing Pacific Rim Symposium (NLPPRS), pp. 393-399.
    S. Brin and L. Page. 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine, In Proceedings of 7th International World Wide Web Conference, pp. 107-117.
    P. F. Brown, S. A. Della Pietra, V. J. Della Pietra and R. L. Mercer. 1994. The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, Vol. 19, Issue 2, pp. 263-311.
    H.-H. Chen and J.-C. Lee. 1996. Identification and Classification of Proper Nouns in Chinese Texts, In Proceedings of 16th International Conference on Computational Linguistics (COLING), pp. 222-229.
    H.-H. Chen, C.-H. Yang and Y. Lin. 2003. Learning Formulation and Transformation rules for Multilingual Entities, In Proceedings of 41st Annual Meeting of the Association for Computational Linguistics (ACL) Workshop on Multilingual and Mixed-language Named Entity Recognition, pp. 1-8.
    H.-H. Chen, W.-C. Lin, C.-H. Yang and W.-H. Lin. 2006. Translating-Transliterating Named Entities for Multilingual Information Access, Journal of the American Society for Information Science and Technology, 57(5), pp. 645-659.
    K. Lunde. 1999. CJKV Information Processing, O’Reilly.
    D. A. Cohn, Z. Ghahramani and M. I. Jordan. 1996. Active Learning with Statistical Models, Journal of Artificial Research, 4, pp. 129-145.
    J. Cho, H. Garcia-Monlina and L. Page. 1998. Efficient Crawling Through URL Ordering, In Proceedings of 7th International Web Conference, pp. 14-18.
    I. Dagan and S. P. Engelson. 1995. Committee-based Sampling for Training Probabilistic Classifiers, In Proceedings of 12th International Conference on Machine Learning (ICML), pp. 150-157.
    J. Dean and M. Henzinger. 1999. Finding Related Pages in the World Wide Web, In Proceedings of 8th World Wide Web (WWW), pp. 389-410.
    A. P. Dempster, N. M. Laird and D. B. Rubin. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Ser. B. Vol. 39, pp. 1-38.
    R. O. Duda, P. E. Hart and D. G. Stork. 2001. Pattern Classification, John Wiley & Sons.
    L. Galescu and J. Allen. 2001. Bi-directional Conversion between Graphemes and Phonemes Using a Joint N-gram Model, In Proceedings of International Speech Communication Association (ISCA) Tutorial and Research Workshop of Speech Synthesis, pp.103-108.
    W. Gao, K.-F. Wong and W. Lam. 2004a. Phoneme-based Transliteration of Foreign Names for OOV Problem, In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP), pp. 374-381.
    W. Gao, K.-F. Wong and W. Lam. 2004b. Improving Transliteration of Foreign Names by Precise Alignment of Phoneme Chunks and Using Contextual Features, In Proceedings of the 1st Asia Information Retrieval Symposium (AIRS), pp. 106-117.
    F. G Cozman, I. Cohen and M. C. Cirelo. 2003. Semi-supervised Learning of Mixture Models, In Proceedings of 12th International Conference of Machine Learning (ICML), pp. 99-106.
    F. Huang, S. Vogel and A. Waibel. 2004. Improving Name entity Translation combining Phonetic and Semantic Similarities, In Proceedings of the Human Language Technology Conference/North American chapter of the Association for Computational Linguistics annual meeting (HLT-NAACL), pp. 281-288 .
    F. Huang, Y. Zhang and Stephan Vogel. 2005. Mining Key Phrase Translations from Web Corpora, In Proceedings of the Human Language Technology Conference / Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP), pp. 483-490.
    L. Jiang, M. Zhou, L.-F. Chien, C. Niu. 2007. Named Entity Translation with Web Mining and Transliteration, In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pp. 1629-1634.
    S.-Y. Jung, S.-L. Hong and E. Paek. 2000. An English to Korean Transliteration Model of Extended Markov Window, In Proceedings of The 18th International Conference on Computational Linguistics (COLING), pp. 383-389.
    D. Jurafsky and J. H. Martin. 2000. Speech and Language Processing, pp. 91-188, Prentice-Hall, New Jersey.
    B.-J. Kang and K.-S. Choi. 2000. Automatic Transliteration and Back-Transliteration by Decision Tree Learning, In Proceedings of 2nd International Conference on Language Resource and Evaluation (LREC), pp. 1135-1411.
    I.-H. Kang and G.-C. Kim. 2000. English-to-Korean Transliteration using Multiple Unbounded Overlapping Phoneme Chunks, In Proceedings of the 18th International Conference on Computational Linguistics (COLING), pp. 418-424.
    J. Kleinberg. 1998. Authoritative Sources in a Hyperlinked Environment, In Proceedings of 9th ACM-SIAM Symposium on Discrete Algorithms, pp. 14-20.
    A. Klementiev and D. Roth. 2006. Named Entity Transliteration and Discovery from Multilingual Comparable Corpora, In Proceedings of the Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting (HLT-NAACL), pp. 82-88.
    K. Knight and J. Graehl. 1998. Machine Transliteration, Computational Linguistics, 24, 4, pp. 599-612.
    J.-S. Kuo and Y.-K. Yang. 2004a. Constructing Transliterations Lexicons from Web Corpora, In the Companion Volume to Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 102-105.
    J.-S. Kuo and Y.-K. Yang. 2004b. Generating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web Corpora, In Proceedings of 18th Pacific Asia Conference on Language, Information and Computation (PACLIC), pp. 275-282.
    J.-S. Kuo and Y.-K. Yang. 2005a. Incorporating Pronunciation Variation into Extraction of Transliterated-term Pairs from Web Corpora, In Proceedings of International Conference on Chinese Computing (ICCC), pp. 131-138.
    J.-S. Kuo. 2005b. Generating Term Transliterations Using Context Information and Validatin Generated Results Using Web Corpora, In Proceedings of 2nd Asia Information Retrieval Symposium (AIRS), pp. 659-665.
    J.-S. Kuo, H. Li and Y.-K. Yang. 2006. Learning Transliteration Lexicons from the Web, In Proceedings of 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1129-1136.
    W. Lam, R. Z. Huang and P. S. Cheung. 2004. Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations, In Proceedings of 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 289-296.
    C.-J. Lee and J. S. Chang. 2003. Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts Using a Statistical Machine Transliteration Model, In Proceedings of the Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting (HLT-NAACL) Workshop on Building and Using Parallel Texts Data Driven Machine Translation and Beyond, pp. 96-103.
    D. D. Lewis and J. Catlett. 1994. Heterogeneous Uncertainty Sampling for Supervised Learning, In Proceedings of International Conference on Machine Learning (ICML) 1994, pp. 148-156.
    H. Li, M. Zhang and J. Su. 2004. A Joint Source Channel Model for Machine Transliteration, In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 159-166.
    W.-H. Lin and H.-H. Chen. 2002. Backward Machine Transliteration by Learning Phonetic Similarity, In Proceedings of 6th Conference on Natural Language Learning, pp. 139-145.
    T. Lin, J.-C. Wu and J. S. Chang. 2004. Extraction of Name and Transliteration in Monolingual and Parallel Corpora, In Proceedings of 6th Conference of the Association for Machine Translation in the Americas (AMTA), pp. 177-186.
    A. F. Llitjos and A. Black. 2001. Knowledge of Language Origin Improves Pronunciation Accuracy of Proper Names, In Proceedings of Eurospeech’2001, Vol. 3, pp. 1919-1922.
    W.-H. Lu, L.-F. Chien and H.-J. Lee. 2002. Translation of Web Queries Using Anchor Text Mining, ACM Transactions on Asian Language and Information Processing (TALIP), Vol. 1, Issue 2, pp. 159- 172.
    C. D. Manning and H. Schutze. 1999. Foundations of Statistical Natural Language Processing, The MIT Press.
    A. McCallum and K. Nigam. 1998. Employing EM in Pool-based Active Learning for Text Classification, In Proceedings of 15th International Conference on Machine Learning (ICML), pp. 350-358.
    H. Meng, W.-K. Lo, B. Chen and K. Tang. 2001. Generating Phonetic Cognates to Handle Named Entities in English-Chinese Cross-Language Spoken Document Retrieval, In Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 311-314.
    T. Mitchell. 1997. Machine Learning, McGraw-Hill.
    I. Muslea, S. Minton and C. A. Knoblock. 2002. Active + Semi-supervised learning = Robust Multi-View Learning, In Proceedings of the 9th International Conference on Machine Learning (ICML), pp. 435-442.
    C. S. Myers and L. R. Rabiner. 1981. A comparative study of several dynamic time-warping algorithms for connected word recognition, The Bell System Technical Journal, 60 (7), pp. 1389-1409.
    M. Nagata, T. Saito and K. Suzuki. 2001. Using the Web as a Bilingual Dictionary, In Proceedings of 39th ACL Workshop on Data-Driven Methods in Machine Translation, pp. 95-102.
    K. Nigam and R. Ghani. 2000. Analyzing the Effectiveness and Applicability of Co-training, In Proceedings of the 9th Conference of Information and Knowledge and Management, pp. 86-93.
    J.-H. Oh and K.-S. Choi. 2001. Automatic Extraction of Transliterated Foreign Words using Hidden Markov Model for Handling Unknown Words, In Proceedings of International Conference of Computer Processing on Oriental Language (ICCPOL), pp. 433-438.
    J.-H. Oh and K.-S. Choi. 2002. An English-Korean Transliteration Model Using Pronunciation and Contextual Rules, In Proceedings of the 19th International Conference on Computational Linguistics (COLING), pp.758-764.
    J.-H. Oh, K.-S. Choi and H. Isahara. 2006. A Machine Transliteration Model based on Correspondence between Graphemes and Phonemes, ACM Transactions on Asian Language Information Processing (TALIP), pp.185-208.
    J.-H. Oh and H. Isahara. 2006. Mining the Web for Transliteration Lexicons: Joint-Validation Approach, In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp.254-261.
    V. Pagel, K. Lenzo and A. Black. 1998. Letter to Sound Rules for Accented Lexicon Compression, In Proceedings of International Conference on Spoken Language Processing (ICSLP), pp. 2015-2020.
    Y. Qu, G. Grefenstette and D. Evans. 2003. Automatic Transliteration for Japanese-to-English Text Retrieval, In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 353-360.
    R. Rapp. 1999. Automatic Identification of Word Translations from Unrelated English and German Corpora, In Proceedings of 37th Annual Meeting of the Association for Computational Linguistic (ACL), pp. 519-526.
    L. Rabiner and B.-H. Juang. 1993. Fundmentals of Speech Recognition, Prentice-Hall, New Jersey.
    G. Riccardi and D. Hakkani-Tür. 2005. Active Learning: Theory and Applications to Automatic Speech Recognition, IEEE Transactions on speech and Audio Processing, Vol. 13, No. 4, pp. 504-511.
    R. Sproat, T. Tao and C. Zhai. 2006. Named Entity Transliteration with Comparable Corpora, In Proceedings of 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 73-80.
    K. Tsuji, B. Dailley and K. Kageura. 2002. Extracting French-Japanese Word Pairs from Bilingual Corpora based on Transliteration Rules, In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC), p.499-502.
    G. Tur, D. Hakkani-Tur, R. E. Schapire. 2005. Combining Active and Semi-supervised Learning for Spoken Language Understanding, Speech Communication, 45, pp. 171-186.
    B. Vauqois. 1988. A Survey of Formal Grammars and Algorithms for Recognition and Transformation in Machine Translation, IFIP Congress-68 (Edinburgh), pp 254-260; reprinted in C. Boitet, ed., B. Vauqois et al. TAO: Vingt-cinq Ans de Traduction Automatique - Analectes. Grenoble: Association Champollin, pp 201-213.
    P. Virga and S. Khudanpur. 2003. Transliteration of Proper Names in Cross-Lingual Information Retrieval, In Proceedings of 41st ACL Workshop on Multilingual and Mixed Language Named Entity Recognition, pp. 57-64.
    S. Wan and C. M. Verspoor. 1998. Automatic English-Chinese Name Transliteration for Development of Multilingual Resources, In Proceedings of 17th International Conference on Computational Linguistics (COLING) and 36th Annual Meeting of the Association for Computational Linguistics (ACL), pp.1352-1356.
    J.-C. Wu and J. S. Chang. 2007. Learning to Find English to Chinese Transliterations on the Web, In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), pp. 996-1004.
    J. Xiao, J. Liu and T.-S. Chua. 2002. Extracting Pronunciation-translated Names from Chinese Texts Using Bootstrapping Approach, In Proceedings of 1st SIGHAN Workshop on Chinese Language Processing, pp. 1-6.
    Xinhua News Agency. 1992. Chinese transliteration of foreign personal names, The Commercial Press.
    C. Zhang and T. Chen. 2002. An Active Learning Framework for Content-based Information Retrieval, IEEE Transactions on Multimedia, 4(2), pp. 260-268.
    國音學,國立臺灣師範大學國音教材編輯委員會編纂,正中書局出版,2003.

    QR CODE