簡易檢索 / 詳目顯示

研究生: 莊鎮豪
Chen-Hao Chuang
論文名稱: 深度學習與動態資料技術應用於語句反諷之分析
Deep Learning with Dynamic Dataset for Sarcasm Detection
指導教授: 陳俊良
Jiann-Liang Chen
馬奕葳
Yi-Wei Ma
口試委員: 林宗男
Tsung-Nan Lin
馬奕葳
Yi-Wei Ma
黎碧煌
Bih-Hwang Lee
黃能富
Nen-Fu Huang
楊竹星
csyang@mail.ee.ncku.edu.tw
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 64
中文關鍵詞: 自然語言處理深度學習動態資料文字處理句中閱讀斷詞處理
外文關鍵詞: Dynamic data, Word segment
相關次數: 點閱:283下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著時代的進步,電腦理解文本的技術也愈發純熟,但在這過程中依然有著許多令科學家頭痛不已的問題。例如文本情緒的理解、歷史沿革的理解與語句反諷的理解認知等相關問題都是現今技術尚未完全克服與解決。本研究針對語句反諷進行資料蒐集策略制定、高頻字詞分析與資料集調整機制設計,強化識別準確性。
    早期文本的反諷情緒檢測大多仰賴機器學習,從文句中提取特定特徵(如:表情符號、標點符號、hashtag或者是前後不一致的情緒等方法)來做為訓練的資料。逐漸,人們開始使用深度學習來自動的學習語句,使機器自己判讀。但在這過程中,最一開始的詞向量嵌入的訓練就為研究帶來難題,尤其是在中文的訓練中。隨著時代變遷新的詞彙不斷增加,為模型帶來極大的挑戰。深度學習的架構也對學習能力有著極大的影響,單一使用某一架構都會有它帶來的優缺點。甚至是模型在針對特定高頻字詞容易偏頗的問題,亦是急待解決的問題。
    本研究提出情緒資料收集規則 (Emotional Data Collection Rule, EDCR) 模組,能使用有效增加準確率的方式制定資料集的篩選規則。參考人類判斷反諷語句時會觀察的地方,作為收集資料集最主要的依據。模仿機器學習使用特徵的方式,在深度學習中的資料集做篩選規則的制定。與開發一套動態資料調整 (Dynamic Data Adjustment, DDA) 模組,能有效降低特定高頻詞彙對模型的影響,且達到資料集平衡的模組。透過觀察高頻詞彙於反諷與非反諷資料集中分別占有的比例,來觀察出該改變反諷或非反諷的資料集,以及改變程度,進一步達到識別準確性的提升。透過本研究之架構,整體模型之辨識率具明顯的提升,準確率可達98.74%。


    With the rapid development of information and communication technology, computers have been able to understand most text knowledge, but there are still many issues that need to be resolved. For example, the understanding of textual emotions, history, and sentence sarcasm are all problems that have not been completely overcome and solved by current technology. In this study, a dynamic learning architecture with a data collection strategy, high-frequency word analysis, and dynamic dataset adjustment was developed to improve the identification of sentence sarcasm.
    Most early textual sarcasm detection relied on machine learning to extract specific features from sentences (emoticons, punctuation marks, hashtag or inconsistent emotions, etc.) that are used as information for training. Gradually, people are starting to use deep learning to automatically learn sentences. However, in this process, the initial training of word vector embedding poses a challenge to early research, especially in Chinese language training. As times have changed, new vocabulary keeps growing, and if new words are not learned before, which brings great challenges to the model in recognizing words. The model of deep learning also has a great influence on the learning ability, and the use of a single model has brought its advantages and disadvantages. Even the model must deal with the problem of model tend to high-frequency words. They are also problems to be solved.
    This study proposes an Emotional Data Collection Rules (EDCR) module, which function is to filter data content that does not conform to our rules. And refer to what humans observe when judging sarcasm sentences as the primary basis for collecting dataset. Imitate the way machine learning uses features to make filtering rules for datasets establish. This study developed a Dynamic Data Adjustment (DDA) module, which can effectively reduce the impact of high-frequency keywords on the module that is able to balance datasets. By observing the proportions of high-frequency words in the sarcasm and non-sarcasm datasets. We can decide which dataset should be changed, and where is the change target, achieve the performance improvement to the output model. Through the model of this study, the dynamic learning architecture can achieve accuracy rates as high as 98.74%.

    摘要 I Abstract II Contents IV List of Figures VII List of Tables VIII Notation table IX Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contributions 3 1.3 Organization 4 Chapter 2 Background Knowledge 5 2.1 Text Analytics 5 2.2 Sarcasm Detection in Traditional 6 2.2.1 Detection in Sentence 7 2.2.2 Detection with Context 8 2.2.3 Detection with Markers 9 2.3 Artificial Intelligence 10 2.3.1 Machine Learning 10 2.3.2 Deep Learning 11 2.4 Word Embedding 16 Chapter 3 System Architecture 19 3.1 System Overview 19 3.2 Dataset 20 3.3 Emotional Data Collection Rule (EDCR) 21 3.3.1 EDCR 21 3.4 Data Preprocess 28 3.4.1 Fine-grained in Chinese Word Segmentation 28 3.5 Dynamic Data Adjustment (DDA) 33 3.5.1 Proportion 33 3.5.2 Judgement 34 3.5.3 Changing Amplitude 37 3.6 Model Training 38 Chapter 4 Performance Analysis 40 4.1 Performance Analysis of Segmentation Methods 40 4.2 Performance Analysis of Dataset’s Changing Amplitude 40 4.2.1 Performance Analysis Changing Amplitude of Jieba Segmentation 41 4.2.2 Performance Analysis Changing Amplitude of Char segmentation 42 4.3 Performance Analysis of EDCR and DDA 42 4.3.1 Performance Analysis with none collection rule. 43 4.3.2 Performance Analysis with EDCR. 43 4.3.3 Performance Analysis with EDCR and DDA. 43 4.4 Comparison of Different Studies 44 4.5 Performance Analysis of Dataset’s Collection 45 4.6 Summary 46 Chapter 5 Conclusions and Future Works 48 5.1 Conclusions 48 5.2 Future Works 49 References 50

    [1] V. Mnih, N. Heess, A. Graves and K. Kavukcuoglu, "Recurrent Models of Visual Attention," Proceedings of the Neural Information Processing Systems, pp.2204–2212, 2014.
    [2] Y.J. Tang and H.H. Chen, "Chinese Irony Corpus Construction and Ironic Structure Analysis," Proceedings of the 25th International Conference on Computational Linguistics, pp.23-29, 2014.
    [3] CkipTagger github.com/ckiplab/ckiptagger (Last visit on 2020/07/05)
    [4] B. Felbo, A. Mislove, A. Sogaard, I. Rahwan and S. Lehmann, "Using Millions of Emoji Occurrences to Learn Any-domain Representations for Detecting Sentiment, Emotion and Sarcasm," Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.1615-1625, 2017.
    [5] R. Satapathy, C. Guerreiro, I. Chaturvedi and E. Cambria, "Phonetic-Based Microtext Normalization for Twitter Sentiment Analysis," Proceedings of the IEEE International Conference on Data Mining Workshops, pp.407-413, 2017.
    [6] D.G. Maynard and M.A. Greenwood, "Who Cares about Sarcastic Tweets? Investigating the Impact of Sarcasm on Sentiment Analysis," Proceedings of the International Conference on Language Resources and Evaluation, 2014.
    [7] F.A. Kunneman, C.C. Liebrecht and A.P.J.V.D Bosch, "The (un) Predictability of Emotional Hashtags in Twitter," Proceedings of the 5th Workshop on Language Analysis for Social Media, pp.26-34, 2014.
    [8] S.M. Mohammad, and Kiritchenko, "Using Hashtags to Capture Fine Emotion Categories from Tweets," Computational Intelligence, Vol.31, No.2, pp.301-326, 2015.
    [9] Y. Tay, A.T. Luu, S.C. Hui and J. Su, "Reasoning with Sarcasm by Reading In-Between," 56th Annual Meeting of the Association for Computational Linguistics, Vol.1, pp.1010-1020, 2018.
    [10] A. Esuli and F. Sebastiani, "Sentiwordnet: A publicly Available Lexical Resource for Opinion Mining," Proceedings of the Fifth International Conference on Language Resources and Evaluation, Vol.6, pp.417-422, 2006.
    [11] M.J. Adarsh and P. Ravikumar, "Sarcasm Detection in Text Data to Bring Out Genuine Sentiments for Sentimental Analysis," Proceedings of the 1st International Conference on Advances in Information Technology, pp.94-98, 2019.
    [12] Y. A. Kolchinski and C. Potts, "Representing Social Media Users for Sarcasm Detection," Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.1115-1121, 2018.
    [13] Joshi, P. Bhattacharyya and M.J. Carman, "Automatic Sarcasm Detection: A Survey," Proceedings of the Association for Computing Machinery Computing Surveys, Vol.50, No.5, pp.1-22, 2017.
    [14] D. Bamman and N.A. Smith, "Contextualized Sarcasm Detection on Twitter," Proceedings of the International Association for the Advancement of Artificial Intelligence Conference on Web and Social Media, 2015.
    [15] S.K. Bharti, K.S. Babu and R. Raman, "Context-based Sarcasm Detection in Hindi Tweets," Proceedings of the Ninth International Conference on Advances in Pattern Recognition, pp.1-6, 2017.
    [16] D. Hazarika, S. Poria, S. Gorantla, E. Cambria, R. Zimmermann and R. Mihalcea, "Cascade: Contextual Sarcasm Detection in Online Discussion Forums," Proceedings of the 27th International Conference on Computational Linguistics, pp.1837-1848, 2018.
    [17] S.K. Bharti, K.S. Babu and S.K. Jena, "Parsing-Based Sarcasm Sentiment Recognition in Twitter Data," Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp.1373-1380, 2015.
    [18] C.C. Liebrecht, F.A. Kunneman and A.P.J.V.D Bosch, "The Perfect Solution for Detecting Sarcasm in Tweets #not," Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, pp.29-37, 2013.
    [19] Politics words tlyu0419.github.io/2019/04/06/Crawl-Dcard (Last visit on 2020/07/05)
    [20] Violent and pornography words github.com/fighting41love/funNLP/tree/master/data (Last visit on 2020/07/05)
    [21] F. Kunneman, C. Liebrecht, M.V. Mulken and A.V.D. Bosch, "Signaling Sarcasm: From Hyperbole to Hashtag," Information Processing & Management, Vol.51, No.4, pp.500-509, 2015.
    [22] S. Poria, E. Cambria, D. Hazarika and P. Vij, "A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks," Proceedings of the International Conference on Computational Linguistics: Technical Papers, pp.1601-1612, 2016.
    [23] M. Zhang, Y. Zhang and G. Fu, "Tweet Sarcasm Detection Using Deep Neural Network," Proceedings of the International Conference on Computational Linguistics: Technical Papers, pp.2449-2460, 2016.
    [24] S.K. Bharti, R. Naidu and K.S. Babu, "Hyperbolic Feature-Based Sarcasm Detection in Tweets: A Machine Learning Approach," Proceedings of the IEEE India Council International Conference, pp.1-6, 2017.
    [25] S. Rendalkar and C. Chandankhede, "Sarcasm Detection of Online Comments Using Emotion Detection," International Conference on Inventive Research in Computing Applications, pp.1244-1249, 2018.
    [26] C.I. Eke, A.A. Norman, L. Shuib and H.F. Nweke, "Sarcasm Identification in Textual Data: Systematic Review, Research Challenges and Open Directions," Artificial Intelligence Review, pp 1-44, 2019.
    [27] S. Amir, B.C. Wallace, H. Lyu, P. Carvalho and M.J. Silva, "Modelling Context with User Embeddings for Sarcasm Detection in Social Media," Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp.167-177, 2016.
    [28] Y.H Huang, H.H. Huang and H.H Chen, "Irony Detection with Attentive Recurrent Neural Networks," Proceedings of the European Conference on Information Retrieval, pp.534-540, 2017.
    [29] Ghosh and T. Veale, "Magnets for Sarcasm: Making Sarcasm Detection Timely, Contextual and Very Personal," Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.482-491, 2017.
    [30] Joshi, V. Sharma and P. Bhattacharyya, "Harnessing Context Incongruity for Sarcasm Detection," Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp.757-762, 2015.
    [31] S. Liao, J. Wang, R. Yu, K. Sato and Z. Cheng, "CNN for Situations Understanding Based on Sentiment Analysis of Twitter Data," Procedia computer science, Vol.111, pp.376-381, 2017.
    [32] Y. Huang, Y. Jiang, T. Hasan, Q. Jiang and C. Li, "A Topic Bilstm Model for Sentiment Classification," Proceedings of the 2nd International Conference on Innovation in Artificial Intelligence, pp.143-147, 2018.
    [33] X. Zhou, X. Wan and J. Xiao, "Attention-Based LSTM Network for Cross-Lingual Sentiment Classification," Proceedings of the conference on empirical methods in natural language processing, pp.247-256, 2016.
    [34] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv, 2013. arXiv:1301.3781
    [35] Kumar, V.T. Narapareddy, V.A. Srikanth, A. Malapati and L.B.M. Neti, "Sarcasm Detection Using Multi-Head Attention Based Bidirectional LSTM," IEEE Access, Vol.8, pp.6388-6397, 2020.
    [36] P.K. Mandal and R. Mahto, "Deep CNN-LSTM with Word Embeddings for News Headline Sarcasm Detection," Proceedings of the 16th International Conference on Information Technology-New Generations, pp.495-498, 2019.
    [37] P. Li, T. Fu and W. Ma, "Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER," arXiv, 2019. arXiv:1908.11046
    [38] D. Ghosh and S. Muresan, "With 1 Follower I Must Be AWESOME :P. Exploring the Role of Irony Markers in Irony Recognition," Proceedings of the International Conference on Web and Social Media, pp.588–591, 2018.
    [39] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," Transactions of the Association for Computational Linguistics, Vol. 5, pp.135–146, 2017.
    [40] X. Li, Y. Meng, X. Sun, Q. Han, A. Yuan and J. Li, "Is Word Segmentation Necessary for Deep Learning of Chinese Representations?," arXiv, 2019. arXiv:1905.05526
    [41] UdicOpenData github.com/UDICatNCHU/UdicOpenData (Last visit on 2020/07/05)
    [42] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado and J. Dean, "Distributed Representations of Words and Phrases and Their Compositionality," Proceedings of the Neural Information Processing Systems, pp.3111-3119, 2013.

    QR CODE