簡易檢索 / 詳目顯示

研究生: 徐紹恒
SHAO-HENG HSU
論文名稱: 運用整體學習與群集分析偵測垃圾網站方法
Using Ensemble Learning and Cluster Analysis to Detect WebSpam
指導教授: 楊英魁
Ying-Kuei Yang
口試委員: 陳俊良
none
黎碧煌
none
孫宗瀛
none
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2013
畢業學年度: 101
語文別: 中文
論文頁數: 51
中文關鍵詞: 搜尋引擎垃圾網站類不平衡問題分群抽樣整體學習
外文關鍵詞: search engine, spam website, class imbalance problem, cluster sampling, ensemble learning.
相關次數: 點閱:206下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網際網路的迅速發展,網路上能得到的資訊越來越多,但網頁內容的素質也變得稱差不齊。因此各大搜尋引擎利用了排名的技術使得使用者能更有效率地找到有用的資訊;而垃圾網站則是藉由人為不當的方式提升網站在搜尋引擎上的排名,以達到廣告或是有不軌之意圖,如此行為會降低搜尋引擎效率以及可能對使用者造成威脅。
    偵測垃圾網站主要面臨問題是資料類別不平衡問題,現實生活上正常的網站比例遠遠比垃圾網站要來的多,於本研究論文中,垃圾網站代表著為少數類別而正常的網站資料也就代表著多數類別。在機器學習領域中,面臨類不平衡問題大部分的分類器為了提高整體的準確度,所訓練出來的分類器通常會傾向將所有的資料皆預測為多數類別,而對於少數類別預測產生極高的錯誤率。有鑑於此,本論文提出了透過分群抽樣方式平衡兩類資料集,並抽樣多組多數集合資料與少數集合資料結合,並建立整體學習分類模型,目的是考量到分群抽樣時有部分重要的資料點被遺漏而影響分類的結果。最後結合所有分類器並利用加權投票方式得到一個最終的分類模型,來偵測出眾多資料中少數部分的那些造成使用者困擾的垃圾網站。實驗結果顯示利用本論文提出之研究方法,能有效解決類別不平衡問題,有效偵測出垃圾網站且大幅降低了正常網站的誤判率。


    As internet grew rapidly, it carries an extensive range of information resources. However, the quality and reliability has become uneven. Therefore, most major search engines use their own ranking algorithms for users to find useful information more efficiently. However, spam websites take artificially improper manner to enhance their ranking in search engines in order to achieve advertising or the intent of misconduct. Such behavior will reduce the efficiency of search engines and may pose a threat to users.
    The main issue of detecting spam websites is the imbalanced data classification. In real world application, the quantity of normal websites is higher than those spam sites. In this thesis, normal websites will represent the majority class and spam websites will represent the minority class. In machine learning, when the imbalanced class distribution happens, the general classifier trends to classify all data as the majority for a higher accuracy. However, the minority class can be ignored that that resulting a high error rate for minority class. This paper proposed cluster sampling balanced training data, and sampling more majority class data combined with the minority class data for training ensemble learning model. In order to avoid important data missed as a result of cluster sampling. Finally, bring all classifier with weighted together turn into final classification. Experimental results this method can solve class imbalance problem and detect spam websites effectively.

    摘要I ABSTRACTII 誌謝III 目錄IV 圖目錄VI 表目錄VIII 第一章緒論1 1.1研究背景1 1.2 研究動機4 1.3論文架構5 第二章文獻探討6 2.1類不平衡問題6 2.2解決類不平衡方法10 2.2.1資料操弄法10 2.2.2調整誤差成本分類法13 2.2.3整體學習法15 第三章 基於分群抽樣改良式整體學習方法18 3.1偵測系統架構18 3.2群集分析21 3.3整體學習方法24 第四章實驗結果與討論31 4.1 WebSpam資料集31 4.2評測標準32 4.3實驗結果36 第五章結論與建議48 參考文獻49

    [1] C. Silverstein, M. Henzinger, H. Marais and M. Moricz,“ Analysis of a very large web search engine query log,” SIGIR Forum,vol. 33, pp. 6-12,1999.
    [2] http://en.wikipedia.org/wiki/Spamdexing
    [3] L.Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone ,“ Classification and regressiontrees,” Monterey, Calif., U.S.A., Wadsworth, Inc. 1984.
    [4] M. Kubat, R. C. Holte and S.Matwin, “Machine learning forthe detection of oil spills in satellite radar images,” Machine Learning, 30(2-3), pp.195-215,1998
    [5] C. Gilles, M. Hilario, H. Sax, S. Hugonnet and A. Geissbuhler, “Learning from imbalanced data in surveillance of nosocomial infection,” Artif. Intell. Med., Vol. 37, pp. 7–18, , 2006.
    [6] Y. Tang, K. Sven, Y. He and W. Yang, D. Alperovitch, “Support Vector Machines and Random Forests Modeling for Spam Senders Behavior Analysis,” Global Telecommunications Conference,. IEEE GLOBECOM, pp. 1–5, 2008.
    [7] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng. vol. 21, no. 9, pp.1263 -1284, 2009.
    [8] M. Galar , A. Fernandez, E. Barrenechea, H. Bustince and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid -based approaches,” IEEE Trans. Syst., Man, Cybern, vol. 42, pp.463 -484 , 2012
    [9] C. Drummond and R. C. HOLTE, “C4. 5, "Class imbalance and cost sensitivity: why under-sampling beats over-sampling,” International Conference on Machine Learning, Washington DC, 2003: 1522154.
    [10] N. V. Chawla, K.W. Bowyer, L.O. Hall and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
    [11] Show-Jane Yen and Yue-Shi Lee, “Cluster-based under-sampling approaches for imbalanced data distributions,” Expert Syst. Appl. ESWA, pp. 5718-5727, (36), 2009.
    [12]Y. Zhang, L. Zhang and Y. Wang, “Cluster-based majority under-sampling approaches for class imbalance learning,” ICIFE, pp. 400–404, 2010.
    [13] K. M. Ting, “An Instance-Weighting Method to Induce Cost-Sensitive Trees,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, No. 3, pp. 659-665, 2002
    [14] C. X. Ling and V. S. Sheng, “Cost-sensitive learning and the class im-balance problem,” Encyclopedia of Machine Learning, Springer, 2008
    [15] W. Fan, S. J. Stolfo, J. Zhang and P. K. Chan, “AdaCost: mis-classification cost- sensitive boosting,” Proc. Int',l Conf. Machine Learning, pp. 97-105, 1999.
    [16]L.Breiman, “ Bagging predictors,” Machine Learning 26(2), pp. 123–140, 1996.
    [17]Y. Freund , R. E. Schapire, ”A decision-theoretic generalization of on-line learning and an application to boosting ,” Journal of Computer and System Sciences, 55(1),pp. 119-139, August 1997
    [18] T. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization,” Machine Learning, vol. 40, no. 2, pp.139 -157, 2000
    [19] E. Bauer and R. Kohavi, “An empirical comparison of voting classification algorithms : Bagging, boosting, and variants,” Machine Learning, pp. 105-139,1999
    [20] http://barcelona.research.yahoo.net/webspam/datasets/
    [21] 廖婉淑,「利用整合式的機器學習方法提高垃圾網站偵測率」。台灣科技大學,資訊工程系,碩士論文,2009。
    [22] J. MacQueen, L. M. Lecam and J. Neyman, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. 5th Berkeley Symposium on Math., Stat. and Prob., pp. 281-297, 1967
    [23] H. Marmanis and D. Babenko, “Algorithms of the Intelligent Web ”
    [24] T.Fawcett , “ROC Graphs: Notes and Practical Considerations for Data MiningResearchers,” HPL, Apr. 2003
    [25] 李紹甫,「應用機器學習有效偵測垃圾網頁之研究」。輔仁大學,資訊工程學系,碩士論文,2010

    無法下載圖示 全文公開日期 2018/07/02 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE