簡易檢索 / 詳目顯示

研究生: 謝居呈
Chu-Cheng Hsieh
論文名稱: 應用機器學習理論改良分類竄改過之中英文垃圾電子郵件
Apply Machine Learning Theory to Classify Camouflage E-Mail Written in Chinese and English
指導教授: 洪西進
Shi-Jinn Horng
口試委員: 范國清
Kuo-Ching Fan
唐永新
Yung-Hsin Tang
胡俊之
Chun-Chih Hu
蘇民揚
Min-Yang Su
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 66
中文關鍵詞: 垃圾郵件機器學習過濾器貝氏分類
外文關鍵詞: classification, baysian, filter, machine learning, spam mail
相關次數: 點閱:270下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本論文主要有二個研究方向,其一為分析各項機器學習理論,並將其應用於垃圾電子郵件的過濾器上,並同時支援中、英文的垃圾郵件分類。其二是針對郵件中各項特徵詞之間,研究如何去有效的考量各特徵詞之間的相互關係,提出改良式的稀疏二元多項式雜湊法,以達到更高的過濾準確率。
最主要的改良是應用於貝式的機率模型上進行分類,藉由應用稀疏的二元多項式雜湊法進行特徵值的前置處理,搭配適當的中文斷詞方法,提出改良過後的貝式模型,使其能應用於中文郵件問題之上,並將此分類器與其它分類器如支援向量機、KNN、K-Centroid等進行比較,同時建立一完整的Microsoft Access中文垃圾郵件語料庫,在Microsoft Windows平台上提出一個可行的中文郵件分類架構。
結合上述的改良式貝式模型,進一步藉由分析中文垃圾郵件的特性,提出一反饋主從式聯防架構,此系統結合MTA與MUA端的過濾機制,能發揮更強大的過濾功用,在阻隔竄改過之垃圾郵件,同時具備優異的過濾垃圾郵件之效能表現。


There are two major topics in which we engage, first, to analyze and apply each machine learning method in spam mails, which are written in English and Chinese words, classification problem. Second, according to the relation between features, we proposed a method, which combines "Sparse Binary Polynomial Hashing Method" with "Baysian Classifier." We believe this process could have better precision in classifying problem.
The main improvement is applying on Bayesian classifier. We combine this process to other machine learning methods. If we put in use advanced baysian method, we could get a better precision result. Most importantly, this procedure also works well in Spam written in Chinese. In the mean time, we build a well-define spam in Chinese database with Microsoft Access software, and then we could have a completely testing environment in modern mail system.
We join different process together and investigate the properties of modern spam mails. Finally, a new client-server system has been proposed. This system has better efficiency and more accurate precision. Based on above context, this system can help users build a clear, safely, and non-spam network environment.

摘要 I Abstract II 誌謝 III 目錄 IV 第1章 緒論 1 1.1 動機與挑戰 1 1.2 問題思考 3 1.2.1 郵件的格式組成 3 1.2.2 郵件分類與文件分類的區別 4 1.2.3 傳統的過濾機制 4 1.2.4 典型的二層式防禦架構圖 6 1.2.5 MTA端的垃圾郵件過濾器 6 1.2.6 MUA端的垃圾郵件過濾器 8 1.3 研究方向 9 1.3.1 為什麼前人的方法現在不再適用 9 1.3.2 主要研究方向 10 1.4 研究架構與論文大綱 11 1.4.1 研究架構 11 1.4.2 論文大綱 12 第2章 相關理論與研究背景 14 2.1 前言 14 2.2 電子郵件格式剖析 14 2.2.1 電子郵件的傳輸過程 14 2.2.2 SMTP通訊協定 16 2.3 KNN(K-Near-Neighborhood)分類演算法 20 2.4 Centroid-Based 分類演算法 21 2.5 Naïve Bayes分類演算法 22 2.6 SVM & SSVM(Smooth Support Vector Machine)分類演算法 25 2.7 FKC(Frequecncy Key Chain,關鍵字頻率鏈) 模型 29 第3章 問題介定與特徵值擷取 30 3.1 訓練資料集 30 3.1.1 英文郵件資料集 30 3.1.2 中文郵件的資料集 31 3.2 特徵值的選取 31 3.2.1 挑選特徵值的方法 31 3.2.2 特徵值如何表示 33 3.2.3 中繼字元(Stop words)的處理 34 3.3 系統分類架構 34 3.3.1 單機作業圖 34 3.3.2 單機作業的主要考量 35 3.3.3 二層式的架構(網域區域聯防) 37 3.3.4 主從式架構的採用考量 39 3.3.5 新一代的垃圾郵件掃描服務(Web Service架構) 40 第4章 研究方法與實驗設計 41 4.1 實驗設計與構思 41 4.1.1 想法 41 4.1.2 基本實驗 42 4.1.3 調整資料集的平衡 43 4.1.4 刻意逃避過濾器的竄改 43 4.1.5 中文辭語處理 45 4.2 驗證方式(Spam Precision與Spam Recall) 45 4.2.1 一般資料分類問題的效能定義 45 4.2.2 垃圾郵件過濾器效能的評估 47 4.2.3 交互驗證方法 48 4.3 訓練步驟與細節 48 4.3.1 SVM 48 4.3.2 Bayes 50 4.4 實驗結果 52 4.4.1 初步實驗:基本學習理論的比較 52 4.4.2 改良1:特徵選取的策略 53 4.4.3 與其它改良分類法則的比較 53 4.4.4 斷詞與中文分類問題成果 55 4.4.5 效能再調校與訓練方法分析 55 4.5 更進一步的改良成果 56 4.5.1 稀疏的二元多項式雜湊法(SPBH) 56 4.5.2 竄改郵件的研究 57 第5章 結論與未來發展 59 5.1 研究結果分析 59 5.2 未來研究重點 61 5.3 結語 62 第6章 參考文獻 64

[1] Andrew Troelsen: C# and the .Net Platform, 2nd. Apress Inc, 2003.
[2] Androutsopoulos, I., etc.: An experimental comparison of naive bayesian and keywordbased anit-spam ltering with personal email messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160-167, 2000.
[3] Borenstein, N., Freed, N.: RFC Standard 1341, MIME (Multipurpose Internet Mail Extensions), 1992,
http://www.ietf.org/rfc/rfc1341.txt?number=1341
[4] Cohen, W., W.: Learning Rules that Classify E-Mail. In Proceeding of the AAAI Spring Symposium on Machine Learning in Information Access, 1996.
[5] Conference: Spam Conference In MIT. http://spamconference.org/, 2003-2005
[6] Drucker, H., Wu, D., Vapnik, V. N.: Support vector machines for spam categorization. IEEE Transactions On Neural Network, 1999.
[7] Graham, P.: A Plan for Spam, August 2002. http://paulgraham.com/spam.html
[8] Graham, P.: Better Bayesian Filtering. Proceedings of the 2003 Spam Conference, January 2003.
[9] Gregory, L., W., Wu, S., F.: On Attacking Statistical Spam Filters, First Conference on Email and Anti-Spam (CEAS), 2004.
[10] Han, E., H., Karmis, G.: Centroid-Based Document Classification:Analysis & Experimental Results, Computer Science Technical Report TR00-017, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota, 2000.
[11] Han, E.-H. S., Karypis, G.: Centroid-based document classification: Analysis & experimental results. Tech. Rep. 00-017, Computer Science, University of Minnesota, Mar, 2000.
http://citeseer.ist.psu.edu/han00centroidbased.html
[12] Joachims, T.: Learning to Classify Text Using Support Vector Machines – Method, Theory and Algorithms. Kluwer Academic Publishers, Feb 2001.
[13] Johnson, K.: Internet Email Protocols – A Developer's Guide. Addison Wesley Longman Inc, 1999.
[14] Kaza, S., etc.: Identification of Deliberately Doctored Text Documents Using Frequent Keyword Chain (FKC) Model. Information Reuse and Integration, page(s):398-405, 2003. IRI 2003. IEEE International Conference, 27-29 Oct. 2003.
[15] Kotler, P.: Marketing Management. Prentice Hall, 2002.
[16] Lee, Y., J., Mangasarian, O., L.: SSVM: A Smooth Support Vector Machine for Classification, Computational Optimization and Applications, 20, 5-22, 2001.
[17] M. Crispin, M.: RFC Standard 2060, Internet Message Access Protocol, 1996, http://www.ietf.org/rfc/rfc2060.txt
[18] Meyer, T.A, Whateley, B.: SpamBayes: Effective open-source, Bayesian based, email classification system., First Conference on Email and Anti-Spam (CEAS), 2004.
[19] Michelakis, E., etc.: Filtron: A Learning-Based Anti-Spam Filter, First Conference on Email and Anti-Spam (CEAS), 2004.
[20] Myers, J., Rose, M.: RFC Standard 1939, Post Office Protocol - Version 3, 1996, http://www.ietf.org/rfc/rfc1939.txt?number=1939
[21] Pantel, P., and Lin, D.: SpamCop-- A Spam Classification & Organization Program. Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
[22] Postel, J., B.: RFC Standard 821, Simple Mail Transfer Protocol, 1982, http://www.ietf.org/rfc/rfc0821.txt?number=821
[23] Radicati, S.: The Messaging Technology Report, In-Depth, Objective Analysis of the Messaging Industry. Volume 12, Number 8, August 2003. http://radicati.com/
[24] Rigoutsos, I., Huynh, T.,: Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages (SPAM), First Conference on Email and Anti-Spam (CEAS), 2004.
[25] Sahami, M, etc.: A Bayesian Approach to Filtering Junk E-Mail. Papers from the AAAI Workshop, pp. 55–62, MadisonWisconsin. AAAI Technical Report WS-98-05, 1998.
[26] Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In Information Processing and Management, pages 513-523, 1988.
[27] Tsai, C., H.: A Review of Chinese Word Lists Accessible on the Internet, 2004.
http://technology.chtsai.org/wordlist/
[28] Tsai, C., H.: MMSEG: A word identification system for Mandarin Chinese text based on two variants of the maximum matching algorithm, 2000.
http://technology.chtsai.org/mmseg/
[29] Vapnik, V., N.: The Nature of Statistical Learning Theory (Information Science and Statistics), Springer, 2 edition, November 19, 1999.
[30] Web Site: Fight Spam on the Internet! http://spam.abuse.net/
[31] Web Site: Open Relay Database. http://www.ordb.org
[32] Web Site: 中研院中文計算語言研究小組CKIP斷詞軟體1.0版, 2002. http://rocling.iis.sinica.edu.tw/CKIP/
[33] Witten, I., H., Eibe Frank: Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.
[34] Woitaszek, M.,, etc.: Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine. Applications and the Internet, page(s):27-31, Jan, 2003 Proceedings.
[35] Yerazunis, W., S.: Sparse Binary Polynomial Hashing and the CRM114 Discriminator, 2004. http://crm114.sourceforge.net/
[36] 侯捷: Word排版藝術, 碁峰圖書, 2002.
[37] 黃嘉輝: Visual Basic.Net 網際網路程式設計 – TCP/IP與Internet Programming, 文魁資訊, 2003.
[38] 謝居呈: Visual Basic.Net 程式設計指南, 新維文化科技, 2005.
[39] 吳昭逸: 具垃圾信過濾與安全機制之電子郵件收發系統, 台灣科技大學, 2004.

QR CODE