簡易檢索 / 詳目顯示

研究生: 李美玲
Mei-lin Li
論文名稱: 基於機器學習理論建置中英文電子郵件過濾系統之研究
A Comparative Study on Spam Filtering for English and Chinese E-mails
指導教授: 洪西進
Shi-jinn Horng
口試委員: 蘇民揚
Ming-yang Su
高宗萬
Tzong-wann Kao
馮輝文
Huei-wen Ferng
吳金雄
Chin-hsiung Wu
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 中文
論文頁數: 74
中文關鍵詞: 垃圾郵件機器學習過濾器分類
外文關鍵詞: spam, mail, classify
相關次數: 點閱:246下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本論文研究方向有二:實作各項機器學習理論,並驗證其於電子郵件分類領域中的準確率;其二為特徵值擷取方式的分析,歸納出一個好的訓練法則。
系統主要功能為建構一個具有演進式學習功能的分類機制,同時能夠應用於中文及英文的郵件過濾系統,透過提出的改良方法,在特徵選取方面加入權重的考量,一方面選取有足夠代表性的特徵,一方面降低維度,不同的權重分配方法效果不一;中文部分則是搭配斷詞方法,提升系統效能,根據實驗可以發現使用Markovian features 和Chi-square這兩種特徵選取方法能有相當好的分類效果,不但可以有效的減少使用者閱讀的垃圾信數量,同時降低使用者的重要信件因為錯誤分類而造成的損失,以達到客製化電子郵件處理的目的,實驗證明本系統的確具備優異分類效能。


There are two major topics in this research, one is to implement several machine learning algorithm in order to verify spam accuracy and recall. The other is to analysis about feature selection methods, thus we can conclude an appropriate training rule.
In this research, we build a classifier with adaptive learning ability which can also apply to spam mails written in Chinese and English. The main improvement is to consider feature selection with different weight methods. On one hand we can select representative features, on the other hand we can lower the dimension. Furthermore, we apply message segment to conquer Chinese mail problem. According to the results, using the methods which are Markovian features and Chi-Square can achieve the best performance. The system reduce the number of spam mails which read by users. It also can decrease the damage caused by misclassify mails at the same time. Finally we prove our system does achieve the purpose of customer-oriented mail handling and provide with good efficiency .

摘要 Ⅰ Abstract Ⅱ 誌謝 Ⅲ 目錄 Ⅳ 第一章 緒論 1 1. 1 研究背景及問題探討 1 1. 2 研究動機與目的 4 1. 3 論文架構 5 第二章 文獻探討及相關理論 6 2.1 前言 6 2.2 電子郵件格式 6 2.3 電子郵件傳輸過程及相關協定 8 2.3.1傳輸過程 8 2.3.2 SMTP通訊協定(Simple Mail Transfer Protocol) 11 2.3.3 POP3通訊協定 15 2.4 防範垃圾郵件的相關技術 17 2.5 KNN(K-Nearest Neighborhood)分類演算法 20 2.6 Centroid-Based 演算法 21 2.7 Naïve Bayes演算法 23 2.8 SVM & SSVM演算法 27 2.9 FKC(Frequency Key Chain,關鍵字頻率鏈)模型 30 第三章 系統架構 32 3.1 系統架構圖 32 3.2 前置處理 33 3.2.1 Stop Terms 34 3.2.2 Stemming 34 3.2.3 中文詞語處理 35 3.3 特徵值選取方法 37 3.3.1 詞頻(Term Frequency,TF) 38 3.3.2 詞頻-逆向文件頻率TF-IDF(Term Frequency – Inverse Document Frequency,TF-IDF) 38 3.3.3 馬可夫特徵擷取法(Markovian features) 39 3.3.4 卡方積(Chi-square) 41 3.4 信件分類器 41 3.5 訓練方法分析 42 3.5.1 TEFT (Train Every Thing) 42 3.5.2 TOE (Train Only Errors) 42 3.5.3 TUNE (Train Until No Errors) 43 3.6 效能評估方式(Spam Precision 和Spam Recall) 44 第四章 實驗方法與結果分析 50 4.1 訓練資料集 50 4.1.1英文資料集 50 4.1.2中文資料集 51 4.2 驗證方式 52 4.3 實驗結果 53 4.3.1 初步實驗:基本學習理論的比較 53 4.3.2改良步驟1.前置處理 56 4.3.3改良步驟2.特徵擷取方法 58 4.3 4改良步驟3.訓練方法 60 4.3.5 竄改郵件的研究 61 4.3.6與其他分類法則的比較 61 第五章 結論與未來發展 66 5.1 研究結果分析 66 5.2 未來研究重點 67 5.3 結語 68 第六章 參考文獻 70

[1] Androutsopoulos, I., etc.: An experimental comparison of naive bayesian and keywordbased anit-spam ltering with personal email messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160-167, 2000
[2] Andrew Troelsen: C# and the .Net Platform, 2nd.Appress Inc, 2003.
[3] Borenstein, N.,Freed,N.:RFC Standard 1341,MIME(Multipurpose Internet Mail Extensions),1992, http://www.ietf.org/rfc/rfc1341.txt?number=1341
[4] Graham, P.: A Plan for Spam, August 2002. http://paulgraham.com/spam.html
[5] Graham, P.: Better Bayesian Filtering. Proceedings of the 2003 Spam Conference , January 2003
[6] Guido Schryen,An e-mail honeypot addressing spammers' behavior in collecting and applying addresses, In Proceedings of the 2005 IEEE Workshop on Information Assurance and Security , pages 37 – 41, June 2005
[7] Han, E.,H, Karmis, G.: Centroid-Based Document Classification: Analysis & Experimental Results, Computer Science Technical Report TR00-017,Departmetn of Computer Science, University of Minnesota, Minneapolis, Minnesota, 2000.
[8] Jonathan B. Postel:I RFC Standard 821(Simple Mail Transfer Protocol),1982, http://www.ietf.org/rfc/rfc821.txt
[9] J. Myers, Carnegie Mellon, M. Rose, Dover Beach Consulting,Inc.
: RFC Standard 1939,POP3(Post Office Protocol - Version 3),1996 http://www.ietf.org/rfc/rfc1939.txt
[10] Kaza, S., etc.: Identification of Deliberately Doctored Text Documents Using Frequent Keyword Chain (FKC) Model. Information Reuse and Integration, page(s):398-405, 2003. IRI 2003. IEEE International Conference, 27-29 Oct. 2003
[11] Kun-Lun L i, Kai Li , Hou-Kuan Huang and Sheng-Feng Tian , Active Learning with Simplified SVMs for Spam Categorization, Machine Learning and Cybernetics, 2002 Proceedings 2002 Interational Conference on, Volume:3,4-5 Nov.2002
[12] Lewis, D. Training Algorithms for Linear Text Classifiers. In Proceedings of the 19th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages298–306, Konstanz, Germany, 1996
[13] Ling-spam Corpus, with legitimate (linguist-list) email and spam http://www.iit.demokritos.gr/skel/i-config/downloads/
[14] Meyer, T.A, Whateley, B.: SpamBayes: Effective open-source, Bayesian based, email classification system., First Conference on Email and Anti-Spam (CEAS), 2004
[15] Mingjun Lan, Wanlei Zhou., Spam Filtering based on Preference Ranking, In Proceedings of the 2005 The Fifth International Conference on Computer and Information Technology (CIT’05), pages 223 – 227, Sept. 2005
[16] Michelsoen : C# Primer Plus,Gotop Inc,2003
[17] Michelakis, E., etc.: Filtron: A Learning-Based Anti-Spam Filter, First Conference on Email and Anti-Spam (CEAS),2004.
[18] Pascal Soucy, Gy w. Mineau,A Simple KNN Algorithm for Text Categorization, In Proceedings IEEE International Conference on Data Mining, 2001. ICDM ,, pages 647 – 648, Dec. 2001
[19] Sahami, M, etc.: A Bayesian Approach to Filtering Junk E-Mail. Papers from the AAAI Workshop, pp. 55–62, MadisonWisconsin. AAAI Technical Report WS-98-05, 1998.
[20] Spam Recycling Center http://www.onlinepublishingnews.com/htm/n99n17oln6.htm
[21] SpamAssassin Public Corpus, included in the Apache Spam- Assassin Project (spam and legitimate email), http://spamassassin.apache.org/publiccorpus/
[22] Tony Andrew Meyer, A TREC along the Spam Track with SpamBayes, The Text REtrieval Conference (TREC),March,2005
[23] The Text Retrieval Conference) 2005 Spam Public Corpora
http://plg.uwaterloo.ca/~gvcormac/treccorpus/
[24] Vapnik, V., N.: The Nature of Statistical Learning Theory (Information Science and Statistics), Springer, 2 edition, November 19, 1999.
[25] William S. Yerazunis, The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It., 2004 MIT Spam Conference, January 18, 2004
[26] Web Site: BNC frequency lists are available from ftp://ftp.itri.bton.ac.uk/pub/bnc.
[27] Web Site: What is Stemming ? http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm
[28] Web Site:Porter Stremming Algorithm
http://www.tartarus.org/~martin/PorterStemmer
[29] Web Site:Term Weighting Approaches in Automatic Text
Retrieval
http://portal.acm.org/citation.cfm?id=866292
[30] Yuan Lian, E-mail Filtering, August 30, 2002
[31] Yiming Yang, Jan O. Pedersen. A comparative Study on Feature
Selection in Text Categorization, In Proceedings of the Fourteeth
International Conference on Machine Learning (ICML’97), pages 412-420, July 08-12, 1997
[32] Y.Yang, An Evaluation of Statistical Approaches to Text Categorization,Information Retrieval 1, Volume 1, Numbers 1-2 ,pages 69-90.,April, 1999
[33] 中研院中文計算語言研究小組CKIP斷詞軟體1.0版,2002
http://rocling.iis.sinica.edu.tw/CKIP/
[34] 永遠的Unix首頁 http://www.fanqiang.com
[35] 吳昭逸:具垃圾信過濾與安全機制之電子郵件收發系統,台灣科 技大學,2004
[36] 陳振南,吳毓傑:特徵選取與權重分配於中文新聞分類之比較, 銘傳大學,2002
[37] 陳峰棋,Visual Basic網路應用程式設計-Internet篇,March 2002
[38] 網路安全小組 http://www.20cn.net
[39] 美國線上 http://www.aol.com/
[40] 趨勢科技「網路安全分析報告」
http://www.trendmicro.com/tw/about/news/pr/archive/2005/pr050908.htm
[41] 最長詞組符合演算法 Maximum Matching Algorithm 及MMESG
http://technology.chtsai.org/mmseg/
[42] 謝居呈:應用機器學習理論改良分類竄改過之中英文垃圾電子 郵件,台灣科技大學,2005

無法下載圖示 全文公開日期 2011/08/04 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE