研究生: |
徐紹恒 SHAO-HENG HSU |
---|---|
論文名稱: |
運用整體學習與群集分析偵測垃圾網站方法 Using Ensemble Learning and Cluster Analysis to Detect WebSpam |
指導教授: |
楊英魁
Ying-Kuei Yang |
口試委員: |
陳俊良
none 黎碧煌 none 孫宗瀛 none |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | 搜尋引擎 、垃圾網站 、類不平衡問題 、分群抽樣 、整體學習 |
外文關鍵詞: | search engine, spam website, class imbalance problem, cluster sampling, ensemble learning. |
相關次數: | 點閱:206 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網際網路的迅速發展,網路上能得到的資訊越來越多,但網頁內容的素質也變得稱差不齊。因此各大搜尋引擎利用了排名的技術使得使用者能更有效率地找到有用的資訊;而垃圾網站則是藉由人為不當的方式提升網站在搜尋引擎上的排名,以達到廣告或是有不軌之意圖,如此行為會降低搜尋引擎效率以及可能對使用者造成威脅。
偵測垃圾網站主要面臨問題是資料類別不平衡問題,現實生活上正常的網站比例遠遠比垃圾網站要來的多,於本研究論文中,垃圾網站代表著為少數類別而正常的網站資料也就代表著多數類別。在機器學習領域中,面臨類不平衡問題大部分的分類器為了提高整體的準確度,所訓練出來的分類器通常會傾向將所有的資料皆預測為多數類別,而對於少數類別預測產生極高的錯誤率。有鑑於此,本論文提出了透過分群抽樣方式平衡兩類資料集,並抽樣多組多數集合資料與少數集合資料結合,並建立整體學習分類模型,目的是考量到分群抽樣時有部分重要的資料點被遺漏而影響分類的結果。最後結合所有分類器並利用加權投票方式得到一個最終的分類模型,來偵測出眾多資料中少數部分的那些造成使用者困擾的垃圾網站。實驗結果顯示利用本論文提出之研究方法,能有效解決類別不平衡問題,有效偵測出垃圾網站且大幅降低了正常網站的誤判率。
As internet grew rapidly, it carries an extensive range of information resources. However, the quality and reliability has become uneven. Therefore, most major search engines use their own ranking algorithms for users to find useful information more efficiently. However, spam websites take artificially improper manner to enhance their ranking in search engines in order to achieve advertising or the intent of misconduct. Such behavior will reduce the efficiency of search engines and may pose a threat to users.
The main issue of detecting spam websites is the imbalanced data classification. In real world application, the quantity of normal websites is higher than those spam sites. In this thesis, normal websites will represent the majority class and spam websites will represent the minority class. In machine learning, when the imbalanced class distribution happens, the general classifier trends to classify all data as the majority for a higher accuracy. However, the minority class can be ignored that that resulting a high error rate for minority class. This paper proposed cluster sampling balanced training data, and sampling more majority class data combined with the minority class data for training ensemble learning model. In order to avoid important data missed as a result of cluster sampling. Finally, bring all classifier with weighted together turn into final classification. Experimental results this method can solve class imbalance problem and detect spam websites effectively.
[1] C. Silverstein, M. Henzinger, H. Marais and M. Moricz,“ Analysis of a very large web search engine query log,” SIGIR Forum,vol. 33, pp. 6-12,1999.
[2] http://en.wikipedia.org/wiki/Spamdexing
[3] L.Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone ,“ Classification and regressiontrees,” Monterey, Calif., U.S.A., Wadsworth, Inc. 1984.
[4] M. Kubat, R. C. Holte and S.Matwin, “Machine learning forthe detection of oil spills in satellite radar images,” Machine Learning, 30(2-3), pp.195-215,1998
[5] C. Gilles, M. Hilario, H. Sax, S. Hugonnet and A. Geissbuhler, “Learning from imbalanced data in surveillance of nosocomial infection,” Artif. Intell. Med., Vol. 37, pp. 7–18, , 2006.
[6] Y. Tang, K. Sven, Y. He and W. Yang, D. Alperovitch, “Support Vector Machines and Random Forests Modeling for Spam Senders Behavior Analysis,” Global Telecommunications Conference,. IEEE GLOBECOM, pp. 1–5, 2008.
[7] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng. vol. 21, no. 9, pp.1263 -1284, 2009.
[8] M. Galar , A. Fernandez, E. Barrenechea, H. Bustince and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid -based approaches,” IEEE Trans. Syst., Man, Cybern, vol. 42, pp.463 -484 , 2012
[9] C. Drummond and R. C. HOLTE, “C4. 5, "Class imbalance and cost sensitivity: why under-sampling beats over-sampling,” International Conference on Machine Learning, Washington DC, 2003: 1522154.
[10] N. V. Chawla, K.W. Bowyer, L.O. Hall and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[11] Show-Jane Yen and Yue-Shi Lee, “Cluster-based under-sampling approaches for imbalanced data distributions,” Expert Syst. Appl. ESWA, pp. 5718-5727, (36), 2009.
[12]Y. Zhang, L. Zhang and Y. Wang, “Cluster-based majority under-sampling approaches for class imbalance learning,” ICIFE, pp. 400–404, 2010.
[13] K. M. Ting, “An Instance-Weighting Method to Induce Cost-Sensitive Trees,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, No. 3, pp. 659-665, 2002
[14] C. X. Ling and V. S. Sheng, “Cost-sensitive learning and the class im-balance problem,” Encyclopedia of Machine Learning, Springer, 2008
[15] W. Fan, S. J. Stolfo, J. Zhang and P. K. Chan, “AdaCost: mis-classification cost- sensitive boosting,” Proc. Int',l Conf. Machine Learning, pp. 97-105, 1999.
[16]L.Breiman, “ Bagging predictors,” Machine Learning 26(2), pp. 123–140, 1996.
[17]Y. Freund , R. E. Schapire, ”A decision-theoretic generalization of on-line learning and an application to boosting ,” Journal of Computer and System Sciences, 55(1),pp. 119-139, August 1997
[18] T. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization,” Machine Learning, vol. 40, no. 2, pp.139 -157, 2000
[19] E. Bauer and R. Kohavi, “An empirical comparison of voting classification algorithms : Bagging, boosting, and variants,” Machine Learning, pp. 105-139,1999
[20] http://barcelona.research.yahoo.net/webspam/datasets/
[21] 廖婉淑,「利用整合式的機器學習方法提高垃圾網站偵測率」。台灣科技大學,資訊工程系,碩士論文,2009。
[22] J. MacQueen, L. M. Lecam and J. Neyman, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. 5th Berkeley Symposium on Math., Stat. and Prob., pp. 281-297, 1967
[23] H. Marmanis and D. Babenko, “Algorithms of the Intelligent Web ”
[24] T.Fawcett , “ROC Graphs: Notes and Practical Considerations for Data MiningResearchers,” HPL, Apr. 2003
[25] 李紹甫,「應用機器學習有效偵測垃圾網頁之研究」。輔仁大學,資訊工程學系,碩士論文,2010