簡易檢索 / 詳目顯示

研究生: 陳立東
Li-Dong Chen
論文名稱: 人工智慧與特徵工程技術應用於點擊誘餌新聞偵測
Artificial Intelligence with Feature Engineering for Clickbait News Detection
指導教授: 陳俊良
Jiann-Liang Chen
馬奕葳
Yi-Wei Ma
口試委員: 陳俊良
Jiann-Liang Chen
黎碧煌
Bih-Hwang Lee
林宗男
Tsung-Nan Lin
黃能富
Nen-Fu Huang
楊竹星
Chu-Sing Yang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 92
中文關鍵詞: 深度學習特徵工程自然語言處理人工智慧點擊誘餌假消息
外文關鍵詞: Deep Learning, Feature Engineering, Artificial Intelligence, Natural Language Processing, Clickbait, Disinformation
相關次數: 點閱:251下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來隨著網路科技的蓬勃發展,改變了社會大眾接收資訊的習慣。使網路平台成為了新聞分享與資訊傳播的主流管道。然而,大量的新聞訊息在網路間傳播也導致低劣品質的內容氾濫,產生有些新聞網站平台藉由點擊誘餌(Clickbait)文章的行為以吸引讀者點擊觀看,賺取廣告點閱收益。點擊誘餌文章時常只重視吸引目光的標題而忽略內文的品質與正確性。因此像大量投放點擊誘餌文章的內容農場網站亦為假消息的溫床,讓使用者因為點擊誘餌新聞的吸引而觀看,不但影響著使用者閱讀體驗亦助長不實消息的傳播。
    本研究為降低點擊誘餌的新聞傳播,基於人工智慧與特徵工程的應用提出點擊誘餌新聞偵測系統,以深度學習、建立特徵與特徵評估等機制針對點擊誘餌新聞的標題與內容進行分析,幫助使用者在點擊標題連結時偵測是否為點擊誘餌,以降低劣質新聞的傳播。本研究的資料來源為Content Farm Terminator的黑名單做為點擊誘餌新聞的數據集,而一般新聞來源由Google News平台收集。本研究將資料集的文字做自然語言處理,並使用卷積神經網路(Convolutional Neural Network, CNN)與長短期記憶網路(Long Shot Term Memory, LSTM)的模型為架構,並為了有效將clickbait 新聞從一般新聞中偵測出來,本研究分析Clickbait新聞提出18項的特徵(hand-crafted feature)結合於模型串聯層(Concatenation)做為特徵融合(feature fusion)進行後續模型之訓練,分別為標題詞彙之lexical-based 特徵與網頁呈現形式之format-base 特徵,並經由ANOVA與特徵算術平均數的評估機制選出具有鑑別度之特徵,以提升整體模型準確率。
    本研究分析點擊誘餌新聞網頁的模式與內容,提出18項特徵結合於自然語言處理的深度學習架構,並經由模型訓練結果評估強、中、弱三個級距,作為偵測結果並評估文章的Clickbait強度。最終透過本研究之架構,以最佳特徵組合的CLSTM-TCEF模型進行訓練與測試,其效能表現在訓練集的驗證可達到98.42%的準確率,比較其他相關研究提出只使用文字處理的雙向GRU架構之偵測模型準確率為87.67%,本研究所提出架構與特徵能提升10.75%的準確率,達到有效偵測點擊誘餌文章之目標。


    In recent years, the vigorous development of Internet technology has changed the habits of the public in receiving information. As a result, the Internet platform has become the mainstream channel for news sharing and information dissemination. However, the spread of a large number of news messages on the Internet also leads to an overflow of low-quality content information. Some news websites use Clickbait to click news links and earn advertising revenue. Clickbait articles often focus on the title that attracts attention and neglect the quality and accuracy of the content. Therefore, content farming sites such as the ones that put out a lot of clickbait news are also a breeding ground for disinformation. It not only affects the user's reading experience but also encouraging the spread of disinformation or fake news.
    In order to reduce the spread of clickbait news, this study proposes a clickbait news detection system based on artificial intelligence and feature engineering. This system consists of deep learning, feature building and feature evaluation are used to analyze the content of clickbait news, which help users detect if the news is a clickbait, and to improve the reading quality and reduce the spread of low-quality news. The clickbait dataset used in this study was the Content Farm Terminator blacklist from Google Chrome plug and general news collected by Google News platform. In this study, the text of the dataset was processed in natural language for training purposes and used Convolutional Neural Network (CNN) and Long Shot Term Memory (LSTM) model to improve the training efficiency in word processing and to detect clickbait news from general news, this study analyzes the title and web page formats of Clickbait news to proposes 18 hand-crafted features, including lexical-based features for title lexical and pattern-based features for web page formats, and through the feature evaluation mechanism of ANOVA with Arithmetic Mean to confirm that the hand-crafted features proposed in this task are selecting and distinguishing features to improve the overall accuracy of the system.
    This study analyzes the behaviors and patterns of clickbait news pages and proposes 18 hand-crafted features concatenate with a deep learning framework that incorporates natural language processing. The Clickbait strength of an article is evaluated as a result of the model training and is assessed as strong, moderate, and mild. Finally, the CLSTM-TCEF model with the best combination of features is used to train and test the performance of the architecture of this study. The performance reaches 98.42% accuracy in the validation of the training set and compared with other related studies, the accuracy of the bidirectional GRU detection model using only word processing is 87.67%. The proposed system can increase the accuracy by 10.75% to achieve the goal of effective in detecting clickbait news.

    摘要 I Abstract II Contents IV List of Figures VI List of Tables VIII Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contributions 7 1.3 Organization 9 Chapter 2 Background Knowledge 10 2.1 Clickbait Concept 10 2.2 Clickbait Detection Techniques 13 2.2.1 Blacklist-based Detection 14 2.2.2 Heuristic-based Detection 15 2.3 Artificial Intelligence 16 2.3.1 Natural Language Processing 17 2.3.2 Deep Learning 19 2.4 Previous Study 21 Chapter 3 Proposed Method 24 3.1 System Overview 24 3.2 Data Collection 26 3.2.1 Dataset 26 3.2.2 Data Preparation 27 3.3 Text Processing 31 3.4 Feature Extraction 34 3.3.1 Lexical-based Features 35 3.3.2 Format-based Features 37 3.5 Feature Evaluation 44 3.6 Model Training 47 3.7 Prediction 49 Chapter 4 System Environment and Performance Analysis 51 4.1 System Environment 51 4.1.1 Experimental Environment 51 4.1.2 Experimental Parameter 52 4.2 Performance Analysis 54 4.2.1 Performance Analysis of Textual Feature 54 4.2.2 Performance Analysis of Textual Feature and Hand-crafted Features 60 4.2.3 Performance Analysis of Textual Feature and hand-crafted Features with Feature Selection 65 4.3 Comparison and Practical Test 67 4.4 Prediction Result Analysis 71 4.5 Summary 73 Chapter 5 Conclusions and Future Works 75 5.1 Conclusions 75 5.2 Future Works 76 References 77

    [1] N. Newman, R. Fletcher, A. Kalogeropoulos and R. K. Nielsen, "Reuters Institute Digital News Report 2019," Reuters Institute for the Study of Journalism, pp.144-145, 2019.
    [2] Statista: Percentage of country concern about fake news on the internet in 2019. https://www.statista.com/chart/18343/share-concerned-about-what-is-real-and-fake-on-the-internet/ (last visited on 2020/06/15)
    [3] Global Disinformation Index: The Quarter Billion Dollar Question: How is Disinformation Gaming Ad Tech, 2019. https://disinformationindex.org/wp-content/uploads/2019/09/GDI_Ad-tech_Report_Screen_AW16.pdf (last visited on 2020/06/22)
    [4] Google, "How Google Fights Disinformation White Paper," Retrieved from https://taiwan.googleblog.com/2019/08/blog-post.html (last visited on 2020/06/18)
    [5] Facebook: Further Reducing Clickbait in Feed, 2016. https://about.fb.com/news/2016/08/news-feed-fyi-further-reducing-clickbait-in-feed/ (last visited on 2020/06/20)
    [6] F. Sebastiani, "Machine learning in automated text categorization," ACM computing surveys, Vol.34, No.1, pp.1-47, 2002.
    [7] S. Hassanpour, N. Tomita, T.D. Lise, B. Crosier and A. Marsch, "Identifying Substance Use Risk based on Deep Neural Networks and Instagram Social Media Data, " Proceedings of the Neuropsychopharmacology, pp.487-494, 2019.
    [8] X. Kuai, W. Feng, W. Haiyan and B. Yang, "Detecting Fake News Over Online Social Media via Domain Reputations and Content Understanding.," IEEE Tsinghua Science and Technology, Vol.25, No.1, pp.20-27, 2020.
    [9] Y. Chen, N. Conroy and V. Rubin, "Misleading Online Content: Recognizing Clickbait as “False News”," Proceedings of the ACM on Workshop on Multimodal Deception, pp.15-19, 2015.
    [10] M. Georgiou, "Write Compelling Headlines Instead of Clickbaits," Search Engine Journal, 2016.
    [11] F. Liao, H. Zhuo, X. Huang, and Y. Zhang, "Federated Hierarchical Hybrid Networks for Clickbait Detection," arXiv, pp.1-10, 2019. arXiv:1906.00638
    [12] S. Volkova, and J. Jang, "Misleading or Falsification? Inferring Deceptive Strategies and Types in Online News and Social Media," Companion Proceedings of the Web Conference, pp.575-583, 2018.
    [13] M. Zimdars. "False, Misleading, Clickbaity, and Satirical News Sources" http s://Resource-False-Misleading-Clickbait-y-and-Satirical-“News”-Sources-1.pdf (last visited on 2020/06/20)
    [14] P. Biyani, K. Tsioutsiouliklis, and J. Blackmer, "8 amazing secrets for getting more clicks: detecting clickbaits in news streams using article informality," Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 94-100, 2016.
    [15] A. Chakraborty, B. Paranjape, S. Kakarla and N. Ganguly, "Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media," Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp.9-16, 2016.
    [16] X. Cao, T. Le, J. Zhang and D. Lee, "Machine Learning Based Detection of Clickbait Posts in Social Media," arXiv, 2017. arXiv:1710.01977
    [17] M. Glenski, E. Ayton, D. Arendt and S. Volkova, "Fishing for Clickbaits in Social Images and Texts with Linguistically-Infused Neural Network Models," arXiv, pp.1-5, 2017. arXiv:1710.06390
    [18] M. Potthast, T. Gollub, M. Hagen and Benno Stein, "The Clickbait Challenge 2017: Towards a Regression Model for Clickbait Strength," arXiv, pp.1-6, 2018. arXiv:1812.10847
    [19] Academia Sinica. CkipTagger. https://github.com/ckiplab/ckiptagger (last visited on 2020/06/20)
    [20] Chinese text segmentation. Jieba. https://github.com/fxsjy/jieba (last visited on 2020/06/20)
    [21] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," Proceedings of arXiv, pp.1-12, 2013. arXiv:1301.3781
    [22] S. Hassanpour, N. Tomita, T. DeLise, B. Crosier and L. Marsch, “Identifying substance use risk based on deep neural networks and Instagram social media data,” Proceedings of the Neuropsychopharmacology, pp.487-494, 2019.
    [23] A. Severyn and A. Moschitti, "Unitn: Training Deep Convolutional Neural Network for Twitter Sentiment Classification," Proceedings of the 9th international workshop on semantic evaluation1, pp.464-469, 2015.
    [24] H. Zheng, J. Chen, X. Yao, A. Sangaiah and C. Zhao, "Clickbait Convolutional Neural Network, " Symmetry Novel Machine Learning Approaches for Intelligent Big Data, Vol.10, No.138, pp.1-12, 2018.
    [25] J. Ma, W Gao, P. Mitra, S. Kwon, B. Jansen, K. Wong and M. Cha, "Detecting Rumors from Microblogs with Recurrent Neural Networks," Proceedings of the Twenty-Fifth International Joint Conference on Artificial, pp.3818-3824, 2016.
    [26] S. Gairola, Y. Kumar, V. Kumar and D. Khattar, "A Neural Clickbait Detection Engine," arXiv:1710.01507, pp.1-4, 2017.
    [27] S. Chawda1, A. Patil, A. Singh and P. Ashwini, "A Novel Approach for Clickbait Detection," Proceedings of the IEEE 3rd International Conference on Trends in Electronics and Informatics, pp.1318-1321, 2019.
    [28] V. Kumar, D. Khattar, S. Gairola and V. Varma, "Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks," Proceedings of The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp.1225-1228, 2018.
    [29] A. Geckil, A. Mungen, E. Gundogan and M. Kaya, "A Clickbait Detection Method on News Sites," Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp.932-937, 2018.
    [30] Y. Zhou, "Clickbait Detection in Tweets Using Self-attentive Network," arXiv, pp.1-5, 2017. arXiv:1710.05369
    [31] N. Cao, C. Shi, L. Sabrina, L. Jie and C. Lin, "TargetVue: Visual Analysis of Anomalous User Behaviors in Online Communication Systems," IEEE Transactions on Visualization and Computer Graphics, Vol.22, No.1, pp.280-289, 2016.
    [32] J. Hodson and B. Traynor, "Design Exploration of Fake News: A Transdisciplinary Methodological Approach to Understanding Content Sharing and Trust on Social Media," Proceedings of IEEE International Professional Communication Conference, pp.1-6, 2018.
    [33] D. Asher, J. Caylor and A. Neigel, "Effects of Social Media Involvement, Context, and Data-Type on Opinion Formation," IEEE International Workshop on Social Sensing, pp.1-6, 2018.
    [34] D. Paschalides, A. Kornilakis, C. Christodoulou and R. Andreou, "Check-It: A plugin for Detecting and Reducing the Spread of Fake News and Misinformation on the Web," arXiv, pp.1-8, 2019. arXiv:1905.04260
    [35] F. Junfeng, L. Liang, X. Zhou and Z. Jinkun, "A Convolutional Neural Network for Clickbait Detection," Proceedings of IEEE 4th International Conference on Information Science and Control Engineering, pp.1-5, 2017.
    [36] O. Amin, H. Jiang and A. Aijun, "Using Neural Network for Identifying Clickbaits in Online News Media," arXiv, pp.1-6, 2018. arXiv:1806.07713

    無法下載圖示 全文公開日期 2025/07/29 (校內網路)
    全文公開日期 2025/07/29 (校外網路)
    全文公開日期 2025/07/29 (國家圖書館:臺灣博碩士論文系統)
    QR CODE