簡易檢索 / 詳目顯示

研究生: 廖婉淑
Wan-Shu Liao
論文名稱: 利用整合式的機器學習方式提高垃圾網站偵測率
Improving WEBSPAM Detection by Using a Hybrid Machine Learning
指導教授: 洪西進
Shi-Jinn Horng
口試委員: 陳秋華
none
王毓饒
none
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 46
中文關鍵詞: 垃圾網站機器學習資訊增益支援向量機決策樹叢聚法
外文關鍵詞: WEBSPAM, imbalance distribution
相關次數: 點閱:313下載:14
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 垃圾網站(WEBSPAM)係利用人為的方式不正當的提高自己網站在搜尋引擎的排名(Rank)。因此如何偵測分辨出垃圾網站(WEBSPAM),防止因太多垃圾網站,降低搜尋引擎的效率及影響搜尋引擎的名聲,同時避免網路使用者搜尋到垃圾網站而遭受到惡意網站的攻擊,成為搜尋引擎公司的當前要務。
    本研究利用網站的連結(link-based features)及網頁內容特性(content-based features),以知名WEBSPAM UK2007為測試資料集(Benchmark),結合監督式(supervised)的SVM及非監督式(unsupervised)的Cluster機器學習(Machine Learning)方式,在特徵選取後(由275個特徵減少到21個特徵)不僅提升分類效能,仍有0.832偵測率,較未分群時偵測率0.611提高了0.221的偵測率,同時誤判率也由0.668降到0.349,對於資料集分佈不平衡(imbalance)的分類問題,提供了一個能有較佳分類率的解決方法。


    WEB Spamming is a kind of scheme to boost the ranking of a website on search engine deliberately higher than they deserve. We call those websites as “WEBSPAM”. Also, it is a crucial issue to detect those websites for many search engine providers, because this kind of websites will damage not only the performance but also the reputation of a search engine.
    Our research is aimed to improve the WEBSPAM detection by using a hybrid, supervised (SVM) and unsupervised (Cluster), machine learning. We evaluated our model on the well known WEBSPAM corpus, WEBSPAM UK2007, there is an imbalance distribution of webspams and reputable websites(1:18) in the dataset. As a result, there is an improvement of detection rate from 0.611 to 0.832 under a reducing of false positive from 0.668 to 0.349. It should be a good solution to resolve the problem of imbalance distribution of data set.

    中文摘要 I Abstract II 致謝 III 圖目錄 V 表目錄 V 附表(Appendix) V 第一章 導論 1 1.1 背景 1 1.2 貢獻 2 第二章 相關工作 4 2.1搜尋引擎排名策略 4 2.2垃圾網站竄改排名的技術 6 2.3相關研究探討 11 第三章 研究方法 16 3.1 資料探勘(Data mining) 16 3.2本篇論文的架構及方法 25 第四章 實驗結果 27 4.1資料集 27 4.2特徵選取(feature selection)及分類器(classifier) 28 4.3實驗結果 28 第五章 結論與未來展望 33 附表(Appendix) 36

    [1] J.acob Abernethy, O. Chapelle, C. Castillo, Web spam Identification Through Content and Hyperlinks, AIRWeb ’08, ACM, 2008.
    [2] N. Alexandros, N. Marc, M. Mark and F. Dennis, Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web, ACM, 2006.
    [3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
    [4] A. A. Benczúr, K. Csalogány, T. Sarlós, “LinkBased Similarity Search to Fight Web Spam”, AIRWEB’06.
    [5] M. Bianchini, M. Gori and F. Scarselli, Inside PageRank, University of Siena, 2003.
    [6] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri., Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands, July 2007. ACM.
    [7] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine : learning to identify link spam. In Proceedings of the 16th European Conference on Machine Learning (ECML), volume 3720 of Lecture Notes in Artificial Intelligence, pages 233–243, Porto, Portugal, 2005.
    [8] S. Craig, M. Hannes, H. Monika and M. Michael, Analysis of a very large web search engine query log, SIGIR Forum, 33 (1999), pp. 6-12.
    [9] Z. Gyongui and H. Garcia-Molina, Web Spam Taxonomy, First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05), 2005.
    [10] Z. Gyongui, H. Garcia-Molina, Pavel Berkhin and J. Pedersen, "Link Spam Detection Based on Mass Estimation",32th International Conference on Very Large Data Bases (VLDB), 2006.
    [11] M. K. Jon, Authoritative sources in a hyperlinked environment, J. ACM, 46 (1999), pp. 604-632.
    [12] P. Kolari, T. Finin and A. Joshi, SVMs for the Blogosphere: Blog Identification and Splog Detection, Association for the Advancement of Artifical Intelligence (AAAI), 2005.
    [13] A. langville and C. Meyer, Deeper Inside PageRank, North Carolina State University, 2003.
    [14] M. E. J. Newman, Power laws, Pareto distributions and Zipf's law, Contemporary Physics, 46 (2005), pp. 323-351.
    [15] L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Libraries Technologies Project, 1998.
    [16] Y. Preund and R. E. Schapire. A Decision-theoretic, Generalization of on-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1): 119-139, August 1997.
    [17] J. R. Quinlan, “Induction of decision trees”, Machine Learning, 1, 1986.
    [18] B. Wu and B. D. Davison, Cloaking and Redirection: A Preliminary Study, First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05), 2005.
    [19] W. Yi-Min, M. Ming, N. Yuan and C. Hao, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, ACM, Banff, Alberta, Canada, 2007.
    [20] V. Vapnik, “Statistical Learning Theory,” Wiley, New York, 1998.
    [21] E. Ardizzone, A. Chella, R.Pirrone, “An Architecture for Automatic Gesture Analysis”, Proceedings of the Working Conference on Advanced Visual Interfaces May 2000.
    [22] A.H. Sung and Srinivas Mukkamala, “Identify Important Features for Intrusion Detection Using Support Vector Machines and Neural Networks,” 2003 Symposium on, Applications and the Internet, 2003. roceedings. 27-31 Jan. 2003, pp.209 -216.
    [23] X. Qi, L. Nie, B. D. Davision, "Measuring Similarity to Detect Qualified Links", AIRWeb ’07, 2007 ACM.
    [24] Web Spam Challenge. http://webspam.lip6.fr/, 2007.
    [25] Weka software. http://www.cs.waikato.ac.nz/ml/weka/
    [26] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006.

    QR CODE