簡易檢索 / 詳目顯示

研究生: 范瀞勻
Jing-Yun Carey Fan
論文名稱: 階層式梯度法之類神經網路學習
Stagewise gradient methods for neurla-network learning
指導教授: 水谷英二
Eiji Mizutani
口試委員: 王孔政
Kung-Jeng Wang
黃安橋
An-Chyau Huang
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 94
中文關鍵詞: 類神經網路階層式流程
外文關鍵詞: stagewise procedures, CANFIS neuro-fuzzy learning
相關次數: 點閱:376下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在此論文中,我們會先介紹在多層感知器學習(MLP-learning)裡所用到的最佳控制模式(Optimal control formulation)。以此為基礎下,比較兩種不同的計算坡度(Gradient)的程序:我們欲強調的階層式倒傳遞法(stage-wise BP)以及目前依然流行的節點式倒傳遞法(node-wise BP),並在內文裡提供1-1-10MLP 及10-1-10MLP的例子。另外,我們將階層式倒傳遞法建置在MLP及CANFIS(Co-Active Neuro-Fuzzy Inference Systems)模型裡,並以三個典型的問題來看成效。此三個問題為:一是文獻中常出現的N型配適問題(N-shape fitting problem);其二與其三為UCI機器學習資料庫(UCI machine learning database)裡的分類問題(classification),其二是資料量為中小規模的心臟病問題(Heart C problem)以及其三是資料量為大規模的英文大寫字母辨識問題(Letter Recognition problem)。根據不同問題的模型設計,我們加入了隱藏節點教學法(Hidden-node teaching)、截斷式過濾器(Truncation-filter),以及委員方法(Committee method)。我們設計的模型與文獻中其他方法來比,結果為我們的設計普遍具有優勢。
目前可得的非線性最小平方法模型(Nonlinear least squares models)以及機器學習的架構都能重新以階層編排的方式來建構模型。超乎我們預期的好,階層編排的概念可以更廣泛地建置在其他更多不同的機器學習模型上。


This thesis describes stagewise gradient procedures for learning by artificial “adaptive”
neural networks including a multilayer perceptron (MLP) and a neuro-fuzzy modular network.
These computational models lie at the heart of diverse applications in neural computing and
machine learning. For optimizing those models, we resort to optimal-control stagewise gradient
procedures as an extension of widely-employed (first-order) backpropagation. Here, we
emphasize its “stagewise” construct, for we identify the layered “stagewise” structure in given
neural-network models, and exploit it for devising learning algorithms. By exploring various
pattern recognition applications, we conclude that our stagewise methods can avoid plateaus in
learning, and improve input-to-output mapping accuracy. Furthermore, in neuro-fuzzy learning,
the method can be designed to maintain fuzzy rules to meaningful limits, even with on-line
backpropagation that often hinders rule’s interpretability.
The challenge is developing efficient and effective learning algorithms. In the conventional
supervised learning, the desired signals are presented only at the terminal layer. In the
classical optimal-control theory, the posed problem is known as the Mayer-type problem that
involves the terminal cost alone. In contrast, we formalize a general Bolza-type problem that
involves stage costs on top of the terminal one. Specifically, we pose a nonlinear least squares
problem, for which the sum of squared error measure is modified accordingly to include such
added quantities. In machine learning, for example, weight-decay regularization is popular to
penalize large control (i.e., large weight parameters in our context). Such a regularization can be
included as stage costs for our efficient stagewise implementation. In our scheme, we endeavor
to contrive certain teacher signals for intermediate stages; this is what we call “hidden-node
teaching.” The resulting “hidden” residuals are incorporated into the stage costs. In practice,
such hidden teacher signals may not be well-defined in supervised-learning problems. But, in
MLP-learning, one may develop an algorithm to avoid hidden-node saturations, which certainly
cause plateaus in learning. On the other hand, one may attempt to encourage hidden-node saturations
to attain a particular saturation pattern that leads to a solution if such a pattern is
identifiable (e.g. in the parity problem). In the context of neuro-fuzzy learning, one could use
the same terminal teacher signals for an intermediate stage so as to keep the interpretability of
fuzzy rules. We demonstrate those new concepts using machine-learning benchmark problems
as well as small-scale problems found in the literature.
Many currently available nonlinear least squares models and machine learning architectures
could be re-structured as a nice stagewise format that commonly arises in MLP-learning.
We then render the relevant problems amenable to attack by stagewise gradient procedures.
Potentially, our stagewise methods could apply to a wide variety of machine learning models
beyond our expectation.

1 Introduction 1 1.1 NonlinearOptimization forLearning . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Steepest Descent and Newton methods . . . . . . . . . . . . . . . . . 2 1.1.2 Nonlinear Least Squares Problems . . . . . . . . . . . . . . . . . . . . 3 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Research Scope and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Stagewise Backpropagation for Multi-Stage Neural-Network Learning 7 2.1 Stagewise structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 OptimalControlProblems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 First-OrderStagewiseBackpropagation . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Stagewise procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Hidden-Node Teaching and Weight Decay . . . . . . . . . . . . . . . . 15 2.4 Stagewise vs. node-wise procedures in F-outputMLP-learning . . . . . . . . . 17 I 2.4.1 1-1-10 MLP-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 10-1-10 MLP-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.3 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.4 An example of node-wise procedure . . . . . . . . . . . . . . . . . . . 24 2.5 Second-Order Stagewise Backpropagation . . . . . . . . . . . . . . . . . . . . 25 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Neuro-Fuzzy Learning 27 3.1 TSKfuzzymodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 A basic linear structure of TSK . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 An original example for the Takagi-Sugeno fuzzy model . . . . . . . . 29 3.2 CANFIS(Co-ActiveNeuro-FuzzyInferenceSystem) . . . . . . . . . . . . . . 30 3.2.1 CANFIS with linear rules . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 The XOR problem by a CANFIS with two linear rules . . . . . . . . . 32 3.3 Neuro-Fuzzy Hidden-Node Teaching . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Basic formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Interpretability-precision dilemma: A motivating example . . . . . . . 35 3.3.3 On-line hidden-node teaching effects in TSK-learning . . . . . . . . . 37 3.3.4 CANFIS with neural rules . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.5 A simple committee method for model averaging . . . . . . . . . . . . 41 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Experiments 44 4.1 N-shape curve-fitting problem . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Our CANFIS design for N-shape curve-fitting . . . . . . . . . . . . . . 45 4.1.2 N-shape curve-fitting by a CANFIS with three neural rules . . . . . . . 46 4.1.3 N-shape curve fitting by a 1-3-1 MLP . . . . . . . . . . . . . . . . . . 48 4.2 Heart C five-class pattern classification . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Heart-C classification by singleMLPs . . . . . . . . . . . . . . . . . . 51 4.2.2 Heart-C classification byCANFIS . . . . . . . . . . . . . . . . . . . . 54 4.3 Letter Recognition Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.1 CANFIS design for letter recognition . . . . . . . . . . . . . . . . . . 59 4.3.2 Letter recognition by a CANFIS with two 16-70-50-26 MLPs . . . . . 60 4.4 Committee method results with trained CANFISs . . . . . . . . . . . . . . . . 62 4.5 Hidden-node teaching effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5 Conclusions and Future Directions 65 5.1 MainResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Outlook forFutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Bibliography 69 Appendices 74 A The Hessian Structure in 2-2-1 MLP-Learning 74

[1] Yaser S. Abu-Mostafa. Learning from hints in neural networks. Journal of Complexity,
6(2):192-198, 1989.
[2] Yaser S. Abu-Mostafa. Hints. Neural Computation, 7:639-671, July 1995.
[3] E. Alpaydin. Introduction to Machine Learning. MIT Press, Cambridge, MA, 2004.
[4] Amari, S.-I., Park, H. & Ozeki, T. Singularities affect dynamics of learning in neuromanifolds.
Neural Computation, 18(5):1007-1065, 2006.
[5] A. Asuncion and D.J. Newman. UCI Machine Learning Repository. On the website
http://www.ics.uci.edu/∼mlearn/MLRepository.html, University of California, Irvine,
School of Information and Computer Sciences, ver. 2007.
[6] E. Beckenbach and R. Bellman. An introduction to inequalities. Random House, 1961.
[7] C. Bishop Exact Calculation of the Hessian Matrix for the Multilayer Perceptron. Neural
Computation, 4(4):494-501, 1992.
[8] Ch. M. Bishop. Neural Networks for Pattern Recognition. Oxford Press, 1995.
[9] L. Breiman Bias, variance, and arcing classifiers. Technical Report 460, Statistics Department,
University of California at Berkeley, 1996.
[10] H. Bunke and A. Kandel. Hybrid Methods in Pattern Recognition. Wordware Publishing,
2002.
[11] Ch. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[12] Andrew R. Conn, Nicholas IM Gould, and Philippe L. Toint. Trust-Region Methods.
SIAM MPS/SIAM Series on Optimization, 2000.
[13] J.E. Dennis, D.M. Gay and R.E. Welsch. An Adaptive Nonlinear Least-Squares Algorithm.
InACM Transactions on Mathematical Software, 7(3):348-368, Sept. 1981.
[14] Stuart E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical
Analysis and Applications, 5(1):30-45, 1962.
69
[15] J.E. Dennis and R.B. Schnabei. Numerical Methods for Unconstrained Optimization and
Nonlinear Equations. Prentice-Hall, Englewood Cliffs, NJ,1983. Reprinted by SIAM publications,
1993.
[16] T. Elomaa and J.Rousu. General and efficient multisplitting of numerical attributes.
InMachine Learning, 36:201-244, 1999.
[17] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. InMachine
Learning: Proceedings of Thirteenth International Conference, pp.148-156, 1996.
[18] K. Fukumizu and S. Amari. Local minima and plateaus in hierarchical structures of multilayer
perceptrons. Neural Networks, 13:317-327, 2000.
[19] G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins University Press,
Baltimore, MD, 3th edition, 1996.
[20] J. Hertz, A. Krogh, and R.G. Palmer. Introduction to the theory of neural computation.
Redwood City, CA: Addison Wesley, 1991.
[21] S. Horikawa, T. Furuhashi, and Y. Uchikawa. On fuzzy modeling using fuzzy neural
networks with the back-propagation algorithm. IEEE Transactions on Neural Networks,
3(5):801–806, 1992.
[22] D. R. Hush and B.G. Horne. Progress in supervised neural network. IEEE Signal processing
magazine, pp. 8-39, Jan 1993.
[23] M.T. Hagan, H.B. Demuth, and M. Beale. Neural Network Design. PWS Publishing,
1996.
[24] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-
Verlag, 2001. (Corrected printing 2002).
[25] M. T. Heath. Scientific computing: An introductory survey. MA: McGraw Hill, New
York, 2nd eidition, pp. 286-287, 2002.
[26] H. Ishibuchi, T. Yamamoto, and T. Nakashima. Hybridization of fuzzy GBML approaches
for pattern classification problems. InIEEE Transactions on Systems, Men, and Cybernetics,
Part B, 35(2):359-365, 1999.
[27] H. Ishibuchi and Y. Nojima. Analysis of interpretability-accuracy tradeoff of fuzzy systems
by multiobjective fuzzy genetics-based machine learning. InInternational Journal of
Approximate Reasoning, 44(1):4-31, 2007.
[28] C. M. Kuan and K. Hornik. Convergence of learning algorithms with constant learning
rates. IEEE Transations on Neural Networks, 2(5):484-490, 1991.
[29] K.-M. Lin and C.-J. Lin. A Study on Reduced Support Vector Machines. The IEEE
TRANSACTIONS ON NEURAL NETWORKS, 14(6):1449-1559, 2003.
[30] E. Mizutani and J-S. Roger Jang. Coactive Neural Fuzzy Modeling. In proceedings of the
International Conference on Neural Networks, pp. 760-765, 1995.
70
[31] E. Mizutani. Chapter 13: Coative neuro-fuzzy nodeling: toward generalized ANFIS.
In Neuro-Fuzzy and soft computing: a computational approch to learning and machine
intelligence(J.-S.R. Jang, C.-T.Sun and E.Mizutani), pp.369-400, Prentice Hall,1997.
[32] E. Mizutani and S.E. Dreyfus. Totally Model-Free Reinforcement Learning by Actor-
Critic Elman Networks in Non-Markovian Domains. In the Proceddings of the IEEE
World ongress On Computational Intelligience, pp. 2016–2021, Alaska, USA, May 1998.
[33] Eiji Mizutani. Computing Powell’s dogleg steps for solving adaptive networks nonlinear
least-squares problems. In the Proceddings of the 8th Int’l Fuzzy Systems Association
World Congress, vol.2, pp. 959–963, Taiwan, August 1999.
[34] E. Mizutani and J.W. Demmel. Iterative scaled trust-region learnig in Krylov subspaces
via pearlmutter’s implicit sparse Hessian-vector multiply. In S. Thrun, L. Saul, and B.
Sch¨olkopf, editors. Advances in Neural Information Processing Systems 16 (NIPS 2003),
16:209-216, MIT Press, 2004.
[35] Eiji Mizutani. On computing the Gauss-Newton Hessian matrix for neural-network learning.
In Proceedings of the 12th International Conference on Neural Information Processing
(ICONIP 2005), pp. 43-48, Taipei, TAIWAN, Oct. 30 –Nov. 2 2005.
[36] E. Mizutani and S.Dreyfus. Second-order stagewise backpropagation for Hessian-matrix
analyses and investigation of negative curvature. Journal of Neural Networks (Elsevier
Science), Special Issue on “Advances in Neural Networks Research.” 21:193-203, 2008.
[37] Eiji Mizutani. A tutorial on stagewise backpropagation for efficient gradient and Hessian
evaluations. (Invited paper) In Proceedings of the 2008 Joint 4th International Conf. on
Soft Computing and Intelligent Systems and 9th International Symposium on advanced
Intelligent Systems (SCIS & ISIS 2008).
[38] E. Mizutani and J-Y. Carey Fan. On-line Hidden-node Teaching for Maintaining Rules’
Interpretability in CANFIS Neuro-fuzzy Modeling. Joint 4th International Conference on
Soft Computing and Intelligent Systems and 9th International Symposium on Advanced
Intelligent Systems (SCIS & ISIS 2008), pp.826-831, Nagoya, Japan, 2008.
[39] E. Mizutani and S.E. Dreyfus. On Derivation of Stagewise Secondorder Backpropagation
by Invariant Imbedding for Multi-stage Neural Network Learning. In theProceedings of
the IEEE-INNS International Joint Conference on Neural Networks (IJCNN2006), pp.
4762-4769, Vancouver, Canada, July 2006.
[40] E. Mizutani and Stuart E. Dreyfus. Stagewise Newton, differential dynamic programming,
and neighboring optimum control for neural-network learning. In Proc. of the 24th American
Control Conf. (ACC 2005), pages 1331-1336, Portland, Oregon, USA, June 2005.
[41] E. Mizutani and S.E. Dreyfus. On Practical Use of Stagewise Second-order Backpropagation
for Multi-stage Neural-Network learning. In Proceedings of the International Joint
Conference on Neural Networks, pp.1302-1307, Orlando FL, August 2007.
71
[42] E. Mizutani and S. E. Dreyfus. On complexity analysis of supervised MLP-learning for
algorithmic comparisons. InProceedings of the INNS-IEEE International Joint Conference
on Neural Networks, Washington, DC 1, pp. 347-352, 2001.
[43] M.M. Nelson andW.T. Illingworth. A practical guide to neural nets. Boston, MA:Addison-
Wesley, 1991.
[44] E. Mizutani, S.E. Dreyfus, and J.W. Demmel. Second-order backpropagation algorithms
for a stagewise-partitioned separable Hessian matrix. In Proceedings of the 2005 International
Joint Conference on Neural Networks (INNS-IEEE IJCNN05), 2:1027-1032, Montreal
Quebec, CANADA, 2005.
[45] E. Miautani, S.E. Dreyfus, and J.S. Jang. On dynamic programming-like recursive gradient
formula for alleviating MLP hidden-node saturation in the parity problem. In Proceedings
of The International Workshop on Intelligent Systems Resolutions–The 8th Bellman
Continuum, pp. 100-104, Dec 2000.
[46] E. Miautani, and J.W. Demmel. On iterative Krylov-dogleg trust-region steps for solving
neural networks nonlinear least squares problems. In T. Leen, T. Dietterich and V. Tresp,
Editors,Advances in Neural Information Processing Systems, MA: Cambridge, MIT Press,
13:605V611, 2000.
[47] E. Mizutani and S.E. Dreyfus. MLPs hidden-node saturations and insensitivity to initial
weights in two classification benchmark problems: parity and two-spirals. In Proceedings
of the IEEE International Joint Conference on Neural Networks, Honolulu, USA, 3:2831-
2836, May 2002.
[48] E. Mizutani and J-Y. Carey Fan. On exploiting symmetry for multilayer perceptron learning.
In International Joint Conference on neural networks, pp. 2857-2862, Aug 2007.
[49] M.M. Nelson andW.T. Illingworth. A practical guide to neural nets. Boston, MA:Addison-
Wesley, 1991.
[50] E. Mizutani, D. Dreyfus, and K. Nishio. On derivation of MLP backpropagation from
the Kelley-Bryson optimal-control gradient formula and its application. In Proceeding of
IEEE International Joint Conference on Neural Networks, Como ITALY, 2:167-172, July
2000.
[51] E. Mizutani, J.-S. R. Jang, K. Nishio, H. Takagi, and D. M. Auslander. Coactive neural
networks with adjustable fuzzy membership functions and their applications. In Proc. of
the International Conference on Fuzzy Logic and Neural Networks, pp. 581-582, Iizuka,
Japan, August 1994.
[52] E. Mizutani and J.W. Demmel. On separable nonlinear least squares algorithms for neurofuzzy
modular network learning. In Proceedings of the International Joint Conference on
Neural Networks, pages 2399-2404, Honolulu USA, May 2002.
[53] E. Mizutani and K. Nishio. Fuzzy Mixtures of Complementary Local Experts: Towards
Neuro-Fuzzy Modular Networks. In Proceedings of the IEEE International Conference
on Fuzzy Systems (FUZZ-IEEE), 2:1192-1197, Honolulu USA, May 2002.
72
[54] E. Mizutani and K. Nishio. Multi-Illuminant Color Reproduction for Electronic Cameras
Via CANFIS Neuro-Fuzzy Modular Network Device Characterization. IEEE Transactions
on Neural Networks, 13(4):1009-1022, July 2002.
[55] Barak A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural
Computation, 1:263–269, 1989.
[56] A.S. Pandya and R.B. Macy. Pattern Recognition with Neural Networks in C++. CRC
Press,1995.
[57] H. Schwenk and Y.Bengio. Boosting Neural Network. Neural Computation, 12(8):1869-
1887, 2000.
[58] M.Sugeno and G.Kang. Structure Identification of Fuzzy Model. Fuzzy Sets and Systems,
pp.329–346, Vol.28, 1986.
[59] T. Takagi and M. Sugeno. Fuzzy Identification of Systems and Its Application to Modeling
and Control. In IEEE Transactions on Systems, Man and Cybernetics, 15:116-132, 1985.
[60] L. Prechelt. Proben1–A Set of Neural Network Benchmark Problems and Benchmarking
Rules. Technical Report 21/94, University of Karlsruhe, Germany, 1994.
[61] E.D. Sontag and H.J. Sussmann. Backpropagation separates when perceptrons do. In
Proceedings of the International Joint Conf. on Neural Networks (IJCNN’89), pages 639–
642, vol.1, Washington, DC, USA, June, 1989.
[62] J-S. Roger Jang, C-T.Sun and E.Mizutani. Neuro-Fuzzy and soft computing: a computational
approch to learning and machine intelligence. Prentice Hall,1997.
[63] D.E. Rumelhart, G.E. Hinton, R.J. Williams. Chapter 8: Learning Internal Representations
by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure
of Cognition., D.E. Rumelhart, J.L. McClelland (Eds.), volume 1, MIT Press,
Cambridge, MA., pp. 318-362, 1986.
[64] E. Mizutani and J.W. Demmel. On structure-exploiting trust-region regularized nonlinear
least squares algorithms for neural-network learning. Journal of Neural Networks, Elsevier
Science, 16:745-753, 2003. Special Issue on “Advances in Neural Networks Research.
[65] Wei, H., Zhang, J., Cousseau, F., Ozeki, T., and Amari, S.-I. Dynamics of Learning Near
Singularities in Layered Networks. Neural Computation, 20(3):813-843, 2008.

QR CODE