簡易檢索 / 詳目顯示

研究生: 陳怡帆
Yi-Fan Chen
論文名稱: 適用於大規模分類問題的一個線上非線性支撐向量機
Online Nonlinear Support Vector Machine for Large-Scale Classification
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 鮑興國
Hsing-Kuo Pao
林軒田
Hsuan-Tien Lin
陳素雲
Su-Yun Huang
葉倚任
Yi-Ren Yeh
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 56
中文關鍵詞: 大規模資料集問題線上學習非線性分類器支撐向量機被動進取演算法核縮減二階資訊
外文關鍵詞: Nonlinear classifier, Reduced Kernel, Second order information
相關次數: 點閱:223下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著現代科技的進步,全世界每分每秒都有千千萬萬筆不同種類的資料生成。因此,在如今大數據的時代,如何處理大規模數據集是一項很大的挑戰。現實世界中的分類問題通常都需要非線性分類器。支撐向量機(SVM)是目前很流行的一個分類方法,然而針對大規模資料集,它是難以生成非線性分類器。線上學習(online learning)是處理大規模問題的一個重要技術,然而現今大部分的線上學習演算法卻都是屬於線性分類器。因此,在本篇論文我們將提出一個線上非線性支撐向量機演算法。被動進取演算法(PA)是一個線上線性支撐向量機。因為被動進取演算法具有封閉形式的更新規則,因此其更新模型的速度非常快。考慮到這一個優勢,我們利用核縮減(Reduced Kernel)技術,提出一系列的非線性被動進取演算法。我們使用數值技術近似以獲得數值穩健性和計算效率。在線上學習環境下,不去記憶曾經看過的資料點可能會傷害到學習的效果。為了解決這個問題,我們利用所有以前曾經看過的資料點來做出一個近似的分類器,此近似分類器可以為我們的更新方向提供一個參考的方向。為了使我們的方法在一輪的訓練具有更好的性能,當更新模型時,我們使用了資料的二階資訊(second order information)。實驗結果顯示,我們提出的方法在準確度上可以與批次非線性演算法媲美,並且我們的方法與批次非線性演算法相比,更能快速有效的處理大規模資料集。實驗結果還顯示了,我們提出的方法對於資料的輸入順序是比較不受影響,並且能夠快速的在一輪的訓練得到足夠好的解。這些證據表明,我們的方法適合用來解決大規模資料集的問題。


    As technology continues to progress, enormous amounts of different kinds of data continue to be generated every second of every day, therefore how to deal with large-scale dataset is a big challenge in big data era. In general, the real-world classification will need nonlinear classifier. Support vector machine (SVM) is one of the most popular nonlinear learning methods, but it is computationally difficult to generating a nonlinear SVM classifier for a large-scale dataset. Online learning is an important technique for handling large-scale problems. However, most online learning algorithms are linear classifiers. Therefore, in this thesis we propose an online nonlinear SVM algorithm. The PA algorithm is an online linear SVM. Because the PA algorithm has a closed form update rule, each update is extremely fast. With this advantage in mind we introduce a family of nonlinear PA algorithms by reduced kernel trick. We use numerical techniques such as orthogonal diagonalization, approximation to gain the numerical robustness and computational efficiency. Due to the nature of on-line learning setting, a lack of memory for previous instances might hurt the learning efficiency. Therefore, we proposed using a proximal model generated from all the previous data instances to provide a reference for the update direction. In order to make our method has better performance in the single-pass, we introduced the second order information when we do model updating. The numerical results show that our proposed method has accuracies comparable to the batch nonlinear algorithms, our method can handle very large datasets and is faster than batch algorithms. Our proposed method is insensitive to the input order and is able to quickly achieve a reasonable good solution in a single pass. These evidences show that our method is suitable for task with extremely large dataset.

    1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Reduced Support Vector Machines . . . . . . . . . . . . . . . . . . . . 6 2.2 Passive-Aggressive Algorithm . . . . . . . . . . . . . . . . . . . . . 9 2.3 Passive-Aggressive Algorithm with the Proximal Model . . . . . . . . . 10 2.4 Periodic Step-size Adaptation . . . . . . . . . . . . . . . . . . . . 14 3 Passive-Aggressive Algorithm for Nonlinear Classification . . . . . . . . . 17 3.1 PA with Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 PA with Reduced Kernel Trick . . . . . . . . . . . . . . . . . . . . . 20 3.3 Stability Consideration . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Experiment: Linear vs. Nonlinear . . . . . . . . . . . . . . . . . . . 24 4 Proximal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Our Proximal Model: quasi-RKFDA . . . . . . . . . . . . . . . . . . . 30 4.2 Combining the RKPA Algorithm with the Proximal Model . . . . . . . . . 33 4.3 Experiment: Proximal Model vs. without Proximal Model . . . . . . . . 36 5 Second Order Information . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Second Order RKPAm-2 . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Experiment: Second Order vs. without Second Order . . . . . . . . . . 43 6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    [1] Ethem Alpaydin. Introduction to machine learning. MIT press, 2004.
    [2] Sue Becker and Yann Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pages 29–37. San Matteo, CA: Morgan Kaufmann, 1988.
    [3] Albert Benveniste, Pierre Priouret, and Michel M’etivier. Adaptive algorithms and stochastic approximations. 1990.
    [4] Antoine Bordes, L’eon Bottou, and Patrick Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent. The Journal of Machine Learning Research, 10:1737–1754, 2009.
    [5] Antoine Bordes, L’eon Bottou, Patrick Gallinari, Jonathan Chang, and S Alex Smith. Erratum: Sgdqn is less careful than expected. The Journal of Machine Learning Research, 11:2229–2240, 2010.
    [6] L’eon Bottou. On-line learning and stochastic approximations. In On-line learning in neural networks, pages 9–42. Cambridge University Press, 1999.
    [7] L’eon Bottou and Yann Le Cun. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry, 21(2):137–151, 2005.
    [8] Olivier Bousquet and L’eon Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2007.
    [9] R.L. Burden and J.D. Faires. Numerical Analysis. PWS-Kent, 1989. ISBN 9780534980597. URL http://books.google.ca/books?id=HiG8AAAACAAJ.
    [10] Nicol`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. In Computational Learning Theory, pages 129–140. Springer, 2002.
    [11] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
    [12] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. The Journal of Machine Learning Research, 7: 551–585, 2006.
    [13] Koby Crammer, Mark Dredze Fern, and O Pereira. Exact convex confidence-weighted learning. In In Advances in Neural Information Processing Systems 22. Citeseer, 2008.
    [14] Koby Crammer, Mark Dredze, and Alex Kulesza. Multi-class confidence weighted algorithms. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 496–504. Association for Computa- tional Linguistics, 2009.
    [15] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors. In Advances in Neural Information Processing Systems, pages 414–422, 2009.
    [16] Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classification. In Proceedings of the 25th international conference on Machine learning, pages 264–271. ACM, 2008.
    [17] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
    [18] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
    [19] Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, and Yuh-jye Lee. Periodic step size adaptation for single pass on-line learning. In Advances in Neural Information Processing Systems, pages 763–771, 2009.
    [20] C.N. Hsu, H.S. Huang, Y.M. Chang, and Y.J. Lee. Periodic step-size adaptation in second-order gradient descent for single-pass on-line structured learning. Machine learning, 77(2-3):195–224, 2009.
    [21] John Lagford. Concerns about the large scale learning challenge, 2008.
    [22] Sangkyun Lee and Stephen J Wright. Asset: Approximate stochastic subgradient estimation training for support vector machines. In ICPRAM (1), pages 223–228, 2012.
    [23] Yuh-Jye Lee and Olvi L Mangasarian. Rsvm: Reduced support vector machines. In Proceedings of the first SIAM international conference on data mining, pages 5–7. SIAM, 2001.
    [24] Yuh-Jye Lee and Olvi L Mangasarian. Ssvm: A smooth support vector machine for classification. Computational optimization and Applications, 20(1):5–22, 2001.
    [25] Justin Ma, Alex Kulesza, Mark Dredze, Koby Crammer, Lawrence K Saul, and Fernando Pereira. Exploiting feature covariance in high-dimensional online learning. In International Conference on Artificial Intelligence and Statistics, pages 493–500, 2010.
    [26] Aditya Krishna Menon. Large-scale support vector machines: algorithms and theory. Research Exam, University of California, San Diego, 2009.
    [27] Noboru Murata. A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, pages 63–92, 1998.
    [28] Noboru Murata and Shun-ichi Amari. Statistical analysis of learning dynamics. Signal Processing, 74(1):3–28, 1999.
    [29] Edgar Osuna, Robert Freund, and Federico Girosi. Support vector machines: Training and applications. 1997.
    [30] Hua Ouyang and Alexander G Gray. Fast stochastic frank-wolfe algorithms for non- linear svms. In SDM, pages 245–256. SIAM, 2010.
    [31] John C Platt. Fast training of support vector machines using sequential minimal optimization. 1999.
    [32] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
    [33] Bernhard Sch‥olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In Computational learning theory, pages 416–426. Springer, 2001.
    [34] Bernhard Scholkopft and Klaus-Robert Mullert. Fisher discriminant analysis with kernels. 1999.
    [35] Nicol N Schraudolph, Jin Yu, and Simon Gu‥nter. A stochastic quasi-newton method for online convex optimization. In In Proceedings of 11th International Conference on Artificial Intelligence and Statistics. Citeseer, 2007.
    [36] Shai Shalev-Shwartz. Online learning: Theory, algorithms, and applications. 2007. [37] JC Spall. Introduction to stochastic search and optimization: estimation, simulation, and control. Wiley-Interscience series in discrete mathematics and optimization, 2003.
    [38] Ivor W Tsang, James T Kwok, Pak-Ming Cheung, and Nello Cristianini. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research, 6(4), 2005.
    [39] Vladimir Vapnik. Statistical learning theory. 1998, 1998.
    [40] Vladimir Vapnik. The nature of statistical learning theory. springer, 2000.
    [41] SVN Vishwanathan, Alexander J Smola, and M Narasimha Murty. Simplesvm. 2003. [42] Yuh-Jye Lee Yi-Cheng Tseng, Yu-Lin Tsai. Combine the passive and aggressive algorithm with the proximal model. 2012.
    [43] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Large linear classification when data cannot fit in memory. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 833–842. ACM, 2010.
    [44] Yuh-Jye Lee Yu-Lin Tsai, Yi-Chen Tseng and Hsing-Kuo Pao. Passive and aggressive algorithm with class means information. 2011.
    [45] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM, 2004.

    QR CODE