簡易檢索 / 詳目顯示

研究生: 吳勇霆
Yong-Ting Wu
論文名稱: 即時異質資料分析系統於雲端運算平台上的設計與實現
Design and Implementation of Real-time Heterogeneous Data Analyzer over Cloud Computing Platform
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: 石維寬
Wei-Kuan Shih
周承復
Cheng-Fu Chou
陳省隆
Hsing-Lung Chen
沈中安
Chung-An Shen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 36
中文關鍵詞: 雲端運算即時異質資料視覺化與分析
外文關鍵詞: Apache Spark, Real-time Heterogeneous Data, Visualization and Analysis
相關次數: 點閱:201下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在現今科技發達和網路技術普及的社會,人們可以隨時隨地透過行動應用產生資料,使得資料產生的速度越來越快,資料量也逐漸攀升,無論是文字資料、照片或影片都可以隨時隨地進行上傳,其中上傳至社群網路的資料數就十分的龐大,而社群網路在世界各地上皆有人使用,同一時間內就可能有數萬個使用者在進行資料傳播,因此社群資料是數以秒計的方式在產生,另外,社群網路的使用者也在逐年上升,使得資料每秒的成長速度也正在逐漸加快,為了能夠有效的利用社群上的資料,本研究將會建置一個即時資料分析系統,收集社群網路上的即時資料,並且將資料進行處理與分析,找出目前世界上的主要議題,並進行議題的情緒分析,判斷各議題是正面情緒還是負面情緒較多,另一方面,為了找出社群網路上的議題與新聞報導之間的差異,也會進行網路新聞內容的資料擷取,進行處理與分析後用來判斷兩者之間的主流議題分佈是否相同或不同,並透過雲端平台進行分散式即時處理,快速得到議題的分析結果,最後在本研究的實驗結果可以發現社群媒體與網路新聞在議題探討上會是相似的,並從情緒分析的結果可以得知人們對於各議題的觀感與喜好程度,藉由此系統成功進行了即時資料的處理與分析。


    Nowadays people can easily connect to the Internet and produce data through mobile applications anywhere and anytime. They can also upload their status words, pictures, or videos to the network as they wish, thus data volumes will grow rapidly. One of the data generators is social network, and many people join social networks all over the world. In the social network, there are thousands of people producing data at the same time, and huge of data will be generated in seconds. In addition, the social network users are increasing recently, and data is growing faster and faster than before. To make good use of those data, we will build a real-time data analyzer in our study, and collect tweet data flying in the social network to process and analyze for finding out the hot topics around the world. We will also conduct the sentiment analysis to get people’s sentiment for each issue, and show the positive and negative sentiment distributions. On the other hands, we will fetch web news content data to check the trend consistency with social network. In our system, we use the cloud computing platform to process data in real-time and quickly get the analysis results. In accordance with our experiment result and evaluation, we can find the topic trends between social network and web news are consistent after the real-time data processing and analysis.

    第 1 章 緒論.........................1 第 2 章 背景知識與相關研究 .............4 2.1 分散式儲存系統.....................4 2.2 分散式運算框架.....................5 2.2.1 MapReduce.......................5 2.2.2 Apache Spark.....................6 2.2.3 MapReduce與Apache Spark的差異.....11 2.3 相關研究............................12 第 3 章 系統架構........................14 3.1 系統架構說明.........................14 3.1.1分散式叢集..........................14 3.1.2互動式資料分析與視覺化工具............15 3.1.3情緒分析............................16 3.2 執行步驟.............................16 3.2.1資料收集(Data Collection)............16 3.2.2即時處理(Real-Time Processing).......18 3.2.3資料儲存(Data Store)..................18 3.2.4資料視覺化與分析(Visualization and Analysis)...19 第 4 章 系統效能評估................................21 4.1環境設定........................................21 4.2實驗情境........................................22 4.3實驗結果與效能評估...............................23 4.3.1 實驗結果.....................................23 4.3.2 效能評估.....................................25 第 5 章 結論及未來展望..............................33 參考文獻...........................................35

    [1] Inc. Cisco Systems, “Cisco Visual Networking Index: Global MobileData Traffic Forecast Update, 2015-2020,” February 3.2016.
    [2] Statista. “Number of social network users worldwide from 2010 to 2019 (in billions),” http://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/.
    [3] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System,” 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010, pp. 1-10.
    [4] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, “The Google File System,” SIGOPS Oper. Syst. Rev., 2003, pp. 29-43.
    [5] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol. 51, pp. 107-113, 2008.
    [6] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker and Ion Stoica, “Spark: Cluster Computing with Working Sets,” the 2nd USENIX conference on Hot topics in cloud computing, Boston, MA, 2010, pp. 10-10.
    [7] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” the 9th USENIX conference on Networked Systems Design and Implementation, 2012, pp. 2-2.
    [8] Arun C Murthy, Vinod Kumar Vavilapalli, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha , Carlo Curino, Owen O'Malley , Sanjay Radia, Benjamin Reed, Eric Baldeschwieler, “Apache Hadoop YARN: yet another resource negotiator,” the 4th annual Symposium on Cloud Computing, 2013.
    [9] Andy Konwinski, Benjamin Hindman, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica, “Mesos: a platform for fine-grained resource sharing in the data center,” the 8th USENIX conference on Networked systems design and implementation, 2011, pp. 295-308.
    [10] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica, “Discretized Streams: Fault-Tolerant Streaming Computation at Scale,” the 24th ACM Symposium on Operating Systems Principles, 2013, pp. 423-438

    [11] Jay Kreps, Neha Narkhede, Jun Rao, “Kafka: a Distributed Messaging System for Log Processing,” NetDB'11, 2011.
    [12] UnGyu Han, Jinho Ahn, “Adaptive Load Balancing Method Enabling Auto-Specifying Threshold of Node Load Status for Apache Flume,” International Journal of Software Engineering and Its Applications, 2015, pp. 201-210.
    [13] John MacCormick, Nick Murphy, Venugopalan Ramasubramanian, Udi Wieder, Junfeng Yang, Lidong Zhou, “Kinesis: A New Approach to Replica Placement in Distributed Storage Systems,” ACM Transactions on Storage (TOS), 2008.
    [14] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan , Michael J. Franklin, Ali Ghodsi, Matei Zaharia, “Spark SQL: Relational Data Processing in Spark,” ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383-1394
    [15] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, S Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar, “Mllib: Machinelearning in apache spark,” arXiv:1505.06807, 2015.
    [16] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica, “GraphX: Graph Processing in a Distributed Dataflow Framework,” the 11th USENIX conference on Operating Systems Design and Implementation, 2014, pp. 599-613.
    [17] Satish Gopalani, Rohan Arora, “Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means,” International Journal of Computer Applications, vol. 113, 2015.
    [18] Ruben Tous, J. Torres, Eduard Ayguade, “Multimedia Big Data Computing for In-Depth Event Analysis,” IEEE International Conference on Multimedia Big Data, 2015, pp. 144-147.
    [19] Lekha R. Nair, Sujala D. Shetty, “Streaming twitter data analysis using spark for effective job search,” Journal of Theoretical and Applied Information Technology, vol. 80, 2015.
    [20] "Positive Words Vocabulary Word List," http://www.enchantedlearning.com/wordlist/positivewords.shtml.
    [21] "Negative Vocabulary Word List," http://www.enchantedlearning.com/wordlist/negativewords.shtml.

    QR CODE