Basic Search / Detailed Display

Author: 張銘億
Ming-Yi Chang
Thesis Title: 一個基於 HTML 碼相似度與廣告追蹤碼的電商詐騙網站偵測框架
An E-commerce Scam Website Detection Framework Based on Syntactic Similarity of HTML Code and Conversion Tracking Identity
Advisor: 鄧惟中
Wei-Chung Teng
Committee: 王勝德
Sheng-De Wang
李漢銘
Hahn-Ming Lee
李育杰
Yuh-Jye Lee
項天瑞
Tien-Ruey Hsiang
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2020
Graduation Academic Year: 108
Language: 英文
Pages: 49
Keywords (in Chinese): 電商詐騙網站偵測廣告追蹤碼相似度測量
Keywords (in other languages): E-commerce scam website detection, Conversion tracking identity, similarity measurement, syntactic similarity
Reference times: Clicks: 336Downloads: 5
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 近年來電商詐騙網站的問題持續惡化,詐騙網站經常利用限時優惠等方式來吸引民眾購買,以達到從中謀利與竊取使用者個資之目的。其中,透過購買社群媒體廣告方式來投放大量相似詐騙網站的問題尤為嚴重。在本研究中,我們提出一個具有模組化特色的電商詐騙網站偵測框架ScamHunter,能有效篩選可疑的詐騙網站,並提供多項指標與信用分數給消費者參考,讓消費者在購物前有更多的資訊來評估是否與電商交易。

    ScamHunter這個框架包含了原始碼相似度模組、網站信譽模組以及評分機制。在相似度模組中,我們利用網頁中的HTML tag(元素)架構來實作兩個網站的相似度計算,並以此對網站進行分群。這種利用文法結構相似度的方法讓我們的框架不會受限於語言,也更利於識別目標網站是否與已知的詐騙群集有關。在網站信譽模組中,我們從網站與域名註冊資訊中提取多項特徵作為後續評估網站的依據,再透過追蹤識別碼的唯一性來辨別網站背後的管理者是否相同。在評分機制中,我們利用多項指標來對網站進行加權評分,再以最後的信用分數來篩選出可疑的詐騙網站。

    在實驗部分,本研究與現有網站評分服務Scamadviser進行比較,透過觀察雙方信用分數的差異,來檢驗本框架的有效性。實驗結果顯示,在2,361個網站的資料集中,透過對網站分群結果進行人工分析,我們分類出56個群集,其中標記出21個可疑的詐騙群集,這些詐騙群集共包含了246個網站。另外,在有效性評估實驗中,Scamadviser對於詐騙網站的精確率只有18.7\%,相較於Scamadviser,我們提出的偵測框架除了可以有效篩選出可疑的詐騙網站,對於正常網站也能給予有參考性的信用分數。


    In recent years, e-commerce scam websites problem continues to deteriorate. These kind of scam websites usually use the flash sale to attract consumers' eye, to sale their products, so as to earn profits illegally and collect user's sensitive information secretly. One of the most serious problems is that the scam groups purchase social media ads to post a massive similar scam websites. In this work, a modular e-commerce scam website detection framework ScamHunter is designed and implemented to identify scam websites effectively, and provides multiple indicators and credit score for consumers' reference.

    ScamHunter includes source code similarity module, domain reputation module and the scoring mechanism. In similarity module, we use the syntactic similarity of HTML tags in web pages to cluster the websites. The syntactic similarity approach helps our framework to identify whether the unknown websites, even of different languages, are related to the same scam clusters. In the domain reputation module, we extract several features from websites and domain registration information as the website evaluation reference. We propose a new feature based on AD tracking code. Through the uniqueness of tracking code identity, we can determine whether the manager behind the websites is the same. In the scoring mechanism, we use multiple indicators to evaluate the websites and identify the suspicious scam websites effectively through credit score.

    To evaluate the effectiveness of our framework, we conducted experiments to compare the difference of credit scores between the existing website evaluation service Scamadviser and our framework. According to the experiment results, a dataset of 2,361 websites can be categorized to 56 clusters, and we are able to label 21 suspicious scam clusters, 246 websites in total number, from the 56 clusters. In the effectiveness evaluation experiment, the precision rate of Scamadviser for scam websites is only 18.7\%, Comparing to Scamadviser, our proposed framework can not only effectively screen out suspicious scam websites, but also give the reference credit score for normal websites.

    中文摘要 i Abstract in English ii Acknowledgements iv Contents v List of Figures viii List of Tables x 1 Introduction 1 1.1 Background.................................... 2 1.2 MotivationandGoals............................ 3 1.3 Contributions ................................ 5 1.4 OutlineoftheThesis ........................... 5 2 Related Work 6 2.1 Clustering Web Pages Based on Structure and Style Similarity ... 6 2.2 ClustingAlgorithm ............................ 7 2.3 WHOIS......................................... 9 2.4 ConversionTracking ........................... 9 3 System Architecture 11 3.1 DomainReputationModule ....................... 12 3.2 TheSourcecodeSimilarityModule ................ 14 3.3 ScoringMechanism.............................. 18 4 Experiments and Analysis 19 4.1 EnvironmentConfigurationandDataset ........... 19 4.1.1 DesignofExperiments......................... 19 4.1.2 DataCollection.............................. 20 4.1.3 DatasetStatistics .......................... 22 4.2 Discussion of Euclidean distance over similarity score ........ 23 4.3 ClusteringAnalysis............................ 24 4.4 EffectivenessAnalysis ........................ 28 4.4.1 EffectivenessofTrustScoreEvaluation ........ 29 4.5 FeatureStatistics ............................ 30 5 Conclusions 32 References 33

    [1] “November shopping –do it the smart way.” https://blog.checkpoint.com/2019/11/26/ november-shopping-do-it-the-smart-way, November 2019. Accessed: 2020-07-08.
    [2] https://news.cts.com.tw/cts/life/201902/201902271953278.html, 2019. Accessed: 2020-06-23.
    [3] C. Carpineto and G. Romano, “Learning to detect and measure fake ecommerce websites in search- engine results,” in Proceedings of the International Conference on Web Intelligence, WI ’17, (New York, NY, USA), p. 403–410, Association for Computing Machinery, 2017.
    [4] https://www.ey.gov.tw/Page/F7408A6FCA4B0A8A/101b8dcf-e6e7-4cbb-9959-f9be70d9f608, 2019. Accessed: 2020-06-23.
    [5] “「越來越像真的電商平臺!」1頁式詐騙網站再進化.”https://www.cib.gov.tw/crime/ SkillDetail/3509, 2019. Accessed: 2020-06-23.
    [6] https://tw.godaddy.com/. Accessed: 2020-06-23.
    [7] “Search quality evaluator guidelines.” https://static.googleusercontent.com/media/ guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf. Accessed: 2020-07-08.
    [8] A. Cidon, L. Gavish, I. Bleier, N. Korshun, M. Schweighauser, and A. Tsitkin, “High precision de- tection of business email compromise,” in 28th USENIX Security Symposium (USENIX Security 19), (Santa Clara, CA), pp. 1291–1307, USENIX Association, Aug. 2019.
    [9] A. Das, S. Baki, A. El Aassal, R. Verma, and A. Dunbar, “Sok: A comprehensive reexamination of phishing research from the security perspective,” IEEE Communications Surveys Tutorials, vol. 22, no. 1, pp. 671–708, 2020.
    [10] G. Ho, A. Cidon, L. Gavish, M. Schweighauser, V. Paxson, S. Savage, G. M. Voelker, and D. Wag- ner, “Detecting and characterizing lateral phishing at scale,” in 28th USENIX Security Symposium (USENIX Security 19), (Santa Clara, CA), pp. 1273–1290, USENIX Association, Aug. 2019.
    33[11] I. Corona, B. Biggio, M. Contini, L. Piras, R. Corda, M. Mereu, G. Mureddu, D. Ariu, and F. Roli, “Deltaphish: Detecting phishing webpages in compromised websites,” in Computer Security – ES- ORICS 2017 (S. N. Foley, D. Gollmann, and E. Snekkenes, eds.), (Cham), pp. 370–388, Springer International Publishing, 2017.
    [12] “Most used social media platform.” https://www.statista.com/statistics/272014/ global-social-networks-ranked-by-number-of-users/, April 2020. Accessed: 2020-07- 08.
    [13] https://www.youtube.com/watch?v=mmfOx4QigHA, 2017. Accessed: 2020-07-08.
    [14] https://join.gov.tw/idea/detail/352b4277-dc75-498b-90ae-01e0e648aeff, 2017. Ac-
    cessed: 2020-07-08.
    [15] https://news.cts.com.tw/cts/society/201803/201803151917301.html, 2017. Accessed: 2020-07-08.
    [16] “checkawebsiteforrisk.”https://www.scamadviser.com/,2012.Accessed:2020-06-23.
    [17] https://github.com/leonzhangtw/E-commerceScamWebsites. Accessed: 2020-07-15.
    [18] T.GowdaandC.A.Mattmann,“Clusteringwebpagesbasedonstructureandstylesimilarity(appli- cation paper),” in Information Reuse and Integration (IRI), 2016 IEEE 17th International Conference on, pp. 175–180, IEEE, 2016.
    [19] K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,” SIAM J. Comput., vol. 18, pp. 1245–1262, 12 1989.
    [20] R.A.JarvisandE.A.Patrick,“Clusteringusingasimilaritymeasurebasedonsharednearneighbors,” IEEE Transactions on Computers, vol. C-22, no. 11, pp. 1025–1034, 1973.
    [21] M.Ester,H.-P.Kriegel,J.Sander,andX.Xu,“Adensity-basedalgorithmfordiscoveringclustersin large spatial databases with noise,” pp. 226–231, AAAI Press, 1996.
    [22] https://whois.icann.org/en/about-whois. Accessed: 2020-07-08.
    [23] https://tw.godaddy.com/domains/full-domain-privacy-and-protection.
    2020-07-08.
    [24] https://www.namecheap.com/security/whoisguard/. Accessed: 2020-07-08.
    [25] “Googleanalytics.”https://analytics.google.com/.Accessed:2020-06-23.
    [26] “Facebook pixel.” https://developers.facebook.com/docs/facebook-pixel/. 2020-06-23.
    Accessed:
    Accessed:
    [27] “Sustaining digital certificate security.” https://security.googleblog.com/2015/10/ sustaining-digital-certificate-security.html, October 2015. Accessed: 2020-07-08.
    34
    [28] “Ca:symantec issues.” https://wiki.mozilla.org/CA:Symantec_Issues, March 2017. Ac- cessed: 2020-07-08.
    [29] “Chrome’s plan to distrust symantec certificates.” https://security.googleblog.com/2017/ 09/chromes-plan-to-distrust-symantec.html, September 2017. Accessed: 2020-07-08.
    [30] “Distrust of symantec tls certificates.” https://blog.mozilla.org/security/2018/03/12/ distrust-symantec-tls-certificates/, March 2018. Accessed: 2020-07-08.
    [31] matiskay, “html-similarity.” https://github.com/matiskay/html-similarity, 2017. Ac- cessed: 2020-06-23.
    [32] “Facebook ads library.” https://www.facebook.com/ads/library/, May 2018. Accessed: 2020-06-23.
    [33] “Asandboxfortheweb.”https://urlscan.io/,May2016.Accessed:2020-06-23.

    QR CODE