簡易檢索 / 詳目顯示

研究生: 鄧維侖
Wei-Lun Teng
論文名稱: 搭配個人合法郵件過濾器之垃圾郵件過濾方法
A Spam Filtering Approach Utilizing Personalized Legitimate Mail Filter
指導教授: 鄧惟中
Wei-Chung Teng
口試委員: 曾文貴
Wen-Guey Tzeng
雷欽隆
Chin-Laung Lei
項天瑞
Tien-Ruey Hsiang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 52
中文關鍵詞: 個人化垃圾郵件過濾器雙層式郵件過濾系統
外文關鍵詞: personalized spam filtering, content-based, two-tier
相關次數: 點閱:273下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

個人垃圾郵件過濾器一般架設於用戶端,並可利用寄件人名單標籤以及使用者信箱中的資訊來增強郵件分析,因此過濾效果有機會比伺服器端的垃圾郵件過濾系統更好。然而,目前的垃圾郵件過濾器對於同時具備正常郵件與垃圾郵件特徵的郵件難以做一個精確的分類,只能在降低偽陽性與降低偽陰性之中掙扎。本實驗室之前的研究提出了將正常郵件過濾器置放於傳統的垃圾郵件過濾器之前,利用分離兩者的搭配方式,藉此降低正常郵件被誤判的機率,亦即前述的偽陽性。若郵件中含有使用者可能感興趣的資訊或是其特徵與使用者信箱中的正常郵件相符,則正常郵件過濾器會將此種信件先行放到使用的正常郵件信箱,降低此種郵件直接交由傳統垃圾郵件過濾器被誤判的機率。本論文繼之架設了一個完整的實驗平台,並且實作了中文信件以及英文信件的過濾系統,利用門檻值的調整以及搭配,期望在小幅增高偽陰性的情況下有效降低了偽陽性。實驗中我們架設了兩個郵件伺服器,分別用比較雙層式的郵件過濾系統以及傳統郵件過濾器的過濾效果。實驗結果也顯示出我們所提出的雙層式垃圾郵件過濾系統在適當的門檻值下,的確可以得到更好的郵件過濾效果,並降低正常郵件的誤判率。


Comparing to server side spam mail filter, personal spam mail filter has the advantage to utilize personal information, like address book and local mail folders, to reach higher accuracy on spam filtering. However, filters trained by both spam mails and personal mails may have difficulty classifying e-mails with the same characteristics of both spam and ham. Former research suggests that putting a legitimate mail filter in front of traditional personal spam filter may effectively decrease false positive rate and allow some spam mails which user might be interested with to pass through. E-mails classified as legitimate mails by the legitimate mail filter may pass, while the remaining e-mails are processed by the spam filter in an ordinary way. This thesis focuses on implementation issues including building testbed, developing different training process for Chinese and for English mails, and fine tuning thresholds of both filters to reach lowest false positive rate with reasonable false negative rate. Experiments are performed on two mail servers–one equipped with ordinary spam filter only, and the other equipped both the legitimate mail filter and the spam filter. The results of experiments demonstrates that, given the same false negative rate, the two filters approach offer a much lower false positive rate comparing to the ordinary one.

Abstract I Contents II List of Tables IV List of Figures V CHAPTER 1 INTRODUCTION 1 1.1 Motivation 1 1.2 Thesis Contributions 4 1.3 Thesis Organization 5 CHAPTER 2 RELATED WORK 6 2.1 Rule-based Filtering Technique 6 2.2 Content-based Filtering Technique 7 CHAPTER 3 THE TWO-TIER SPAM FILTER STRUCTURE 8 3.1 The Legitimate Mail Filter 9 3.2 The Spam Mail Filter 9 CHAPTER 4 THE LEGITIMATE MAIL FILTER 11 4.1 E-mail of Chinese Language Family 12 4.1.1 Training Legitimate Mails with Chinese Content 12 4.1.2 Filtering Legitimate Mails with Chinese Content 16 4.2 E-mail of English Language Family 18 4.2.1 Training Legitimate Mails with English Content 18 4.2.2 Porter Stemming 18 4.2.3 Stop Words 19 4.2.4 Term Frequency and Inverse Document Frequency Matrix 19 4.2.5 Sliding Window Strategy 24 4.2.6 Filtering Legitimate Mails with English Content 25 CHAPTER 5 THE SPAM MAIL FILTER 29 5.1 About SpamAssassin 29 5.2 The Features of SpamAssassin 30 CHAPTER 6 EXPERIMENT 33 6.1 Experiment Platform 33 6.2 Experiment Using Personal Mails 35 6.3 Experiment Using TREC 2007 36 6.4 Experiment Analysis 37 6.4.1 Data of Personal Mails 37 6.4.2 Data of TREC 2007 40 CHAPTER 7 CONCLUSION AND FUTURE RESEARCH 43 References 44

[1] Messaging Anti-Abuse Working Group, MAAWG Email Metrics Program, First Quarter 2006 Report. June 2006. Available:
http://www.maawg.org/about/FINAL_1Q2006_Metrics_Report.pdf.
[2] The Apache SpamAssassin Project [Online]. Available: http://spamassassin.apache.org/
[3] J. Clark, I. Koprinska and J. Poon, “LINGER - A Smart Personal Assistant for E-mail Classification,” in Proc. of the 13th Int. Conf. on Artificial Neural Networks (ICANN'03), 2003, pp. 274-277.
[4] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian Approach to Filtering Junk E-mail,” AAAI Workshop on Learning for Text Categorization, 1998, AAAI Technical Report WS-98-05.
[5] P. Graham, “Better Bayesian Filtering,” in Proc. of MIT Spam Conference 2003. Available: http://www.paulgraham.com/better.html
[6] A. K. Seewald, “An Evaluation of Naive Bayes Variants in Content-based Learning for Spam Filtering,” Journal of Intelligent Data Analysis, 2007, vol. 11, no. 5, pp. 497-524.
[7] H. Drucker, D. Wu, and V.N. Vapnik, “Support Vector Machine for Spam Categorization,” IEEE Trans. on Neural Networks, 1999, vol. 10, pp. 1048–1054.
[8] A. Kolcz and J. Alspector, “SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs”, in Proc. of the TextDM Workshop on Text Mining, 2001.
[9] K. N. Junejo and A. Karim, “PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering,” in Proc. of the 2006 IEEE/WIC/ACM Int. Conf. on Web Intelligence, 2007, pp. 228 – 234.
[10] K. N. Junejo, M. M. Yousaf, and A. Karim. “A Two-pass Statistical Approach for Automatic Personalized Spam Filtering,” Proc. of ECML/PKDD Discovery Challenge Workshop, 2006, pp. 16-27.
[11] V. Cheng and C.H. Li, “Personalized Spam Filtering with Semi-supervised Classifier Ensemble,” in Proc. of the 2006 IEEE/WIC/ACM Int. Conf. on Web Intelligence, 2006, pp. 195-201.
[12] L. Pelletetier, J. Almhana, and V. Choulakian, “Adaptive Filtering of SPAM,” in Proc. of the 2nd Annual Conf. on Communication Networks and Service Research (CNSR’04), 2004, pp. 218-224.
[13] The CPAN Search Site, MIME::Parser - experimental class for parsing MIME streams [Online]. Available: http://search.cpan.org/dist/MIME-tools/lib/MIME/Parser.pm
[14] C. H. Tsai, “MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm” [Online]. Available: http://technology.chtsai.org/mmseg/
[15] G. Salton and C. Buckley, “Term-weighting Approaches in Automatic Text Retrieval,” Information Processing and Management: an International Journal, 1988, vol. 24, issue 5, pp. 513–523.
[16] K.S. Jones, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” Journal of Documentation, 1972, vol. 28, issue 1, pp. 11–21.
[17] The Postfix Project [Online]. Available: http://www.postfix.org
[18] Cosine Similarity and Term Weight Tutorial [Online]. Avalable:http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
[19] G. Salton, A. Wong, and C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, 1975, vol.18, pp. 613–620.

QR CODE