簡易檢索 / 詳目顯示

研究生: 黃洱宸
Er-Chen Huang
論文名稱: 利用主動式學習法處理高維度的巨量資料
Active Learning with Massive High Dimensional Data
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 鮑興國
李育杰
蘇黎
陳瑞彬
邱舉明
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 41
中文關鍵詞: 主動式學習降維巨量資料高維度
外文關鍵詞: Active learning, Dimensionality reduction, Large-scale data, High dimensionality
相關次數: 點閱:316下載:16
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

科技日異月新,物連網蓬勃發展,串聯起人與人、人與物的關係,巨量的資訊充斥在人們的生活中,無時無刻地產生與收集,如何有效的處理、分析與運用這些大量的數據,是這個世代重要的課題。經由儀器收集大量的無類別資料相當的容易,但若想要進行資料的分析預測,往往需要耗費相當多的人力對資料進行類別的標註,使得類別資料相對昂貴。有些資料需要某領域的專家來進行標註,如基因序列、地質分析等資料,有些如手寫辨識、語音辨識等資料,雖然標註的門檻較低,但都需要額外花費到許多人力成本。因此,我們希望使用主動式學習(Active Learning),以標註最少的資料,得到最佳的模型,藉此降低標註的成本損耗,用最有效率的方法處理巨量資料。
在傳統的主動式學習中,利用現有的資訊去計算出無類別資料的資訊量(Information Value),挑選出其中訊息含量最多的無類別資料進行標註。在獲得新的類別資料後,會使得原先計算出的資訊量有所改變,必須重新進行計算,造成許多時間上耗損,尤其高維度巨量資料的情況下更為明顯。為了更有效的使用主動式學習,本篇論文針對兩個方向做討論,第一,判斷資訊量有無重新計算的必要,以此減少計算資訊量的次數;第二,使用降維分層的方式處理無類別資料,使得計算資訊量的時間下降,以此兩種方法分別下降計算資訊量的次數與時間,使整個主動式學習系統能夠大幅的增加其效能。


Technology is changed for the better by the days, and the Internet of Things (IoT) is flourishing. It connects everything in the world. There is massive information in our lives to be collecting. How to process and analyze a large amount of data is the most important lesson in this generation. It is simple to collect unlabeled data from sensors. However, when we want to do the data analysis, it usually needs to cost a lot of human resources to label the data. This situation makes labeled data become more expensive. Some kinds of data need to be labeled by domain experts, such as gene sequence data and geology data. Others, such as handwriting or speech recognition data do not need to have some special domain knowledge to label it, but still, spend a lot of costs. Therefore, we utilize active learning which can use the smallest data to get the best model. The cost will be reduced and make the whole system efficiently.
Traditional active learning applies available data to calculate the information value of unlabeled data, and it selects the ones which have the most information value to query the expert to get the label. Since the system obtains new labeled data, the previous information value will be changed. It has to spend a lot of costs to re-calculate the information value, especially in high dimension massive data. In this thesis, we propose two directions to discuss how to utilize active learning more effectively. First, we consider whether the information value should be re-calculated to reduce the frequency of calculating times. Second, we apply dimensionality reduction and multi-layer method to process unlabeled data. It shortens the time for calculating information value. By the above two points, we can build a more effective active learning system.

Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . vi Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Scenarios . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Uncertainty Sampling . . . . . . . . . . . . . . . 12 2.1.3 Pool-Based Active Learning . . . . . . . . . . . . 15 2.2 Anti Re-Rank Pool-Based Active Learning . . . . . . . . 17 2.3 Multi-Layer Pool . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Principal Components Analysis . . . . . . . . . . 22 2.3.2 Multi-Layer Pool Building . . . . . . . . . . . . . 23 3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 26 3.1 SVMs Model . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 28 3.5 Experimental Setting . . . . . . . . . . . . . . . . . . . . 29 3.6 Anti Re-rank Pool-based Active Learning . . . . . . . . . 30 3.6.1 Linear Model . . . . . . . . . . . . . . . . . . . . 30 3.6.2 Non-Linear Model . . . . . . . . . . . . . . . . . 32 3.7 Multi-Layer Pool . . . . . . . . . . . . . . . . . . . . . . 35 3.8 ARPAL + Multi-Layer Pool . . . . . . . . . . . . . . . . 37 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

[1] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” in Proceedings of
the 17th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’94, (New York, NY, USA), pp. 3–12, Springer-Verlag New York, Inc., 1994.
[2] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,”
J. Mach. Learn. Res., vol. 2, pp. 45–66, Mar. 2002.
[3] J. Kremer, K. Steenstrup Pedersen, and C. Igel, “Active learning with support vector machines,” Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 4, no. 4, pp. 313–326, 2014.
[4] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learning for image classification,”
in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2372–2379, June 2009.
[5] Y. Fu, X. Zhu, and B. Li, “A survey on instance selection for active learning,” Knowledge and Information
Systems, vol. 35, no. 2, pp. 249–283, 2013.
[6] B. Settles, Active Learning. Morgan Claypool, 2012.
[7] B. Demir, C. Persello, and L. Bruzzone, “Batch-mode active-learning methods for the interactive classification
of remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49,
pp. 1014–1031, March 2011.
[8] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery, “Active learning methods for remote
sensing image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 47,
pp. 2218–2232, July 2009.
[9] P. Donmez and J. G. Carbonell, “Proactive learning: Cost-sensitive active learning with multiple imperfect
oracles,” in Proceedings of the 17th ACM Conference on Information and Knowledge Management,
CIKM ’08, (New York, NY, USA), pp. 619–628, ACM, 2008.
[10] C. Persello, A. Boularias, M. Dalponte, T. Gobakken, E. Næsset, and B. Schölkopf, “Cost-sensitive
active learning with lookahead: Optimizing field surveys for remote sensing data classification,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 52, pp. 6652–6664, Oct 2014.
[11] A. Kapoor, E. Horvitz, and S. Basu, “Selective supervision: Guiding supervised learning with
decision-theoretic active learning.,” in IJCAI (M. M. Veloso, ed.), pp. 877–882, 2007.
[12] D. D. Lewis and J. Catlett., “Heterogeneous uncertainty sampling for supervised learning,” in Machine
Learning Proceedings 1994:Proceedings of the Eighth International Conference (W. W. Cohen, ed.),
pp. 148–156, Morgan Kaufmann, 1994.
[13] P. Melville and R. J. Mooney, “Diverse ensembles for active learning,” in Proceedings of 21st International
Conference on Machine Learning (ICML-2004), (Banff, Canada), pp. 584–591, July 2004.
[14] I. Jolliffe, Principal Component Analysis. Springer, 2002.
[15] G. Golub and C. Reinsch, Singular value decomposition and least squares solutions. Springer, 1970.
[16] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell.
Syst. Technol., vol. 2, pp. 27:1–27:27, May 2011.
[17] L. Yann, “Gradient-based learning applied to document recognition.,” 1998.
[18] E. Youn and M. K. Jeong, “Class dependent feature scaling method using naive bayes classifier for
text datamining,” Pattern Recognition Letters, vol. 30, no. 5, pp. 477 – 485, 2009.

QR CODE