簡易檢索 / 詳目顯示

研究生: 鄭德璋
Te-Chang Cheng
論文名稱: 基於作業負載密度聚類分析的故障預警系統
System Failure Forewarning Based on Workload Density Cluster Analysis
指導教授: 李漢銘
Hahn-Ming Lee
口試委員: 鮑興國
Hsing-Kuo Pao
鄧惟中
Wei-Chung Teng
鄭博仁
Albert B. Jeng
林豐澤
Feng-Tse Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 115
中文關鍵詞: 錯誤預警工作負載可擴展性基於路徑的軟體可靠性預測自主系統的故障
外文關鍵詞: failure forewarning, workload intensity, scalability, path-based software reliability prediction, autonomic systems
相關次數: 點閱:165下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 每個系統在設計上都有長時間使用的目標,因此系統應該要正常的運作和達到預定的經濟效益運作目標。一段時期的停機時間將會影響一個企業的商業業務。所以對於維運人員和管理人員來說,壓力會不斷的增加,以求減少停機時間和能有效率的維運。要做到這一點,維運人員必須持續和準確評估系統性能,並儘早察覺可能導致系統性能下降的因素。
    我們通常使用系統來做完整的模擬測試,並透過實驗後的數據資料來了解硬體、 作業系統和應用程式層的性能和狀態。但是,在大部分的多使用者軟體系統中,系統的效能包含著不同的工作負載,系統行為在時間分佈上導致很大的統計變化。這高度的統計變化,使我們更難以利用統計結論。因此,模擬的結果是符合一部分的真實經驗。
    狀態指標是診斷的基礎元件。這些指標包含可能代表系統行為的大量訊息,利用指標的量度,可以監測和跟蹤系統的行為。重要的是所選擇的特徵向量必須符合需求,這才能提供真實的準確性和提供強大的診斷程序。所以我們專注於系統工作負載的變化,這將有助於我們提取最佳的特徵向量。
    工作負載的塑模和故障診斷是一項重要的工作,可以做為系統設計人員持續改善系統的依據。因此,我們希望建立系統健康狀況監控器並使用監視器來指示系統狀態的方法。我們的最終目標是利用這一指標來表示系統的實際行為和透過診斷演算法生成的各種模型,收集系統的資料。經由使用指標來追蹤實際的工作變化來使系統狀態能夠表徵化和塑模。然後,我們根據我們的模板,用於診斷演算法做故障檢測。通過這種方法,各種系統資訊的模板將會儘量減少誤報。這樣我們就可以提高準確的預測系統服務的未來狀態。
    在本文中,我們提供了系統健康狀況監控器來顯示系統狀態和預測不正常的行為。從實驗結果,我們觀察到兩種情況。首先,在大多數正常狀態,在我們的測試驗證中,其最低的準確度數值,也逼近我們的理論最小值 84%。這意味著我們的量測和真實系統狀態之間有很緊密的關係。其次,系統使用的指令資料可以預測90%來自各種日常工作報告公佈的事件。這顯示出系統目前它的預測成效。
    然而,經由變動的工作負載來檢測系統的當前狀態還有很長的路要走。處理故障和異常的所有事件都是極具挑戰性。預先調配所有故障和異常的事件就像準備芮氏九級地震。這是從理論上講可以設計,但它在經濟上是不可行的。同樣,儘管在提供系統的最多可能的故障事件是不可行的。但替代使用統計異常行為的特徵,利用這個功能了解突發事件的性質和測試在這種情況下的服務。使其該系統診斷演算法能在其最早階段,有能力檢測,隔離,並識別故障。


    Each system contains design objectives for long-term use. Therefore, a system is required to maintain normal operation and target for operations with economic benefits. A certain period of downtime will affect the commercial business of a corporation. Consequently, there is a growing pressure for operators and administrator to minimize downtime and to operate as effectively. To do so, the operator must conduct a continuous and accurate assessment of system performance in order to detect the possible factors that will degrade system performance.
    A Full-system simulation is commonly used for data collection, and is provided for us to understand the work status of the hardware, operating system and application layers via the efficacy of simulation. In the majority of multi-user software systems, anomalous system behaviors arise not only from the failures of individual components. The workload intensity generated from interactive protocols among components and behaviors at the different time will cause high statistical variance. The result is an increasing difficulty in identifying the roots and conclusions to problems. Hence, the result of a simulation matches some part of real experiences.
    Condition indicators are the foundational components of diagnosis. Such indicators contain a large amount of information that could possibly represent system behaviors, which scale is used to monitor and track the system behavior. It is important to select feature vectors that meet the criteria in order to provide true accuracy and powerful diagnostic routines. By focusing on the variation in system workload intensity, we will be able to extract the best feature vectors.
    The workload modeling and fault diagnosis are the important jobs used by system designers as basis to improving a system. For this reason, we intend to establish a monitor for monitoring the system health with s specific approach that will indicate the system status. Our ultimate goal is to indicate the actual system behavior using this monitor, collect system information through such indicator and evaluate the information to generate various models for diagnostic algorithms. We can then describe system condition and re-model through indicators used to track the actual workload. Next, we can conduct fault detection under our template for diagnostic algorithms. Such method will minimize false alarms for the modeling of much different system information. Furthermore, we can improve the accuracy of the next status in predicting the system service.
    In this thesis, we propose the System Health Monitor to indicate system status and to predict anomalous behavior. We have observed two circumstances from the experimental results. First, under most normal status, the lowest accuracy value is approximate our theoretical minimum threshold of 84%. Such result implies a close correlation between our measured and real system status. Secondly, the command data used by the system could predict 90% of events announced, which reveals the prediction effectiveness of this proposed system.
    Nonetheless, there is still a long way for detecting the change in system workload with respect to indicating the current status variation. It is quite a challenge to process all faults and failures. Prior deployment of resources for all faults and failure events in is similar to preparing for an earthquake with magnitude-9 scale. It is theoretically possible to design economically impracticable to make such deployment. Similarly, although it is infeasible for the system to process the largest possible fault events in the deployment of resources, we could apply statistics to characterize the anomalous behaviors to understand the nature of emergencies and to test system service under such scenarios. Therefore, system diagnostic algorithms will be able to detect, isolate and identify a fault at the earliest stage.

    Abstract I Acknowledgements VI Contents VIII List of Figures XI List of Tables XIII Chapter 1 Introduction 1 1.1 System Logs on System Health Monitor 2 1.2 The Challenges of System Health Monitor 3 1.3 Motivations 4 1.4 Goals 6 1.5 The Outline of thesis 7 Chapter 2 Background and Related Work 8 2.1 The Online Fault Detection Techniques 9 2.2 The Proactive Fault Management of Online Fault Detection 10 2.3 The Related Work of Density-Based Methodology 11 2.4 The Related Work of Path-based Methodology 13 2.5 The Related Work of Performance Analysis 15 2.6 The Related Work of Event Parser 17 2.7 The Fault Detection Project 19 Chapter 3 System Health Monitor 22 3.1 Concept of System Health Monitor 23 3.2 The System Architecture of System Health Monitor 25 3.3 Data Preprocessor Stage 28 3.3.1 Log Abstractor 28 3.3.2 Path Performance Tracer 29 3.4 Statistical Template Modeling Stage 30 3.4.1 Density-Based Model Builder 31 3.4.2 Statistical Pattern Selector 36 3.5 Online Monitor Stage 38 3.5.1 Statistical Change Detector 39 3.5.2 Next Status Predictor 41 3.5.3 Alarm Reporter 44 Chapter 4 Experiments 46 4.1 Description of Data Set 47 4.2 Evaluation Design 49 4.2.1 The Validation in Measurement Stage 50 4.2.2 The Evaluation in Prediction Stage 51 4.3 Density-Based Model Builder Parameter Setting 53 4.3.1 The Effect of Cluster Set 53 4.3.2 The Effect in the Different Number of Cluster Setting 57 4.4 The Validation of Magnitude Scales 62 4.4.1 Threshold Setting 63 4.4.2 The Effect of Magnitude Scales 66 4.4.3 Build the Statistical Pattern 68 4.4.4 The Summarization of Experiment Results 70 4.5 The Performance of System Health Monitor 70 4.5.1 The Performance of the Next Status Predictor 71 4.5.2 The Forewarning Effectiveness Evaluation 78 4.5.3 The Analysis of the Forewarning Results 82 4.5.4 The Summarization of Overall Prediction Performance 85 Chapter 5 Conclusion and Further Work 87 5.1 Conclusion 88 5.2 Further Work 90 References 92 Vita 99

    [1] J. Gray, B. Good, D. Gawlick, P. Homan, and H. Sammer, "One thousand transactions per second," Proceedings of IEEE COMPCON, San Francisco, IEEE Press, 1985.
    [2] Y. Huang, J. Huang, B. Wang, J. Wu, and S. Bai, "Transactional recovery mechanism in stock trading system," in Computer Engineering and Technology (ICCET), 2010 2nd International Conference on, 2010, pp. 205-208.
    [3] D. A. Cieslak, N. V. Chawla, and D. L. Thain, "Troubleshooting thousands of jobs on production grids using data mining techniques," in Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing, 2008, pp. 217-224.
    [4] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection: A survey," ACM Computing Surveys (CSUR), vol. 41, pp. 1-58, 2009.
    [5] I. Hwang, S. Kim, Y. Kim, and C. E. Seah, "A survey of fault detection, isolation, and reconfiguration methods," Control Systems Technology, IEEE Transactions on, vol. 18, pp. 636-653, 2010.
    [6] L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and E. Smirni, "Automated anomaly detection and performance modeling of enterprise applications," ACM Transactions on Computer Systems (TOCS), vol. 27, p. 6, 2009.
    [7] Y. Tan and X. Gu, "On Predictability of System Anomalies in Real World," in 2010 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2010, pp. 133-140.
    [8] Z. M. Jiang, A. E. Hassan, G. Hamann, and P. Flora, "Automated performance analysis of load tests," in Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM), 2009, pp. 125-134.
    [9] L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and E. Smirni, "Anomaly? application change? or workload change?," in Proc. of IEEE DSN Symposium, 2008, pp. 452-461.
    [10] M. Young and R. N. Taylor, "Rethinking the taxonomy of fault detection techniques," in Proceedings of the 11th international conference on Software engineering, 1989, pp. 53-62.
    [11] C. Angeli and A. Chatzinikolaou, "On-line fault detection techniques for technical systems: a survey," International Journal of Computer Science & Applications, vol. 1, pp. 12-30, 2004.
    [12] C. Catal and B. Diri, "A systematic review of software fault prediction studies," Expert Systems with Applications, vol. 36, pp. 7346-7354, 2009.
    [13] F. Salfner, M. Lenk, and M. Malek, "A survey of online failure prediction methods," ACM Computing Surveys (CSUR), vol. 42, pp. 1-42, 2010.
    [14] H. P. Kriegel, P. Kroger, and A. Zimek, "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 3, pp. 1-58, 2009.
    [15] F. Ren, L. Hu, H. Liang, X. Liu, and W. Ren, "Using density-based incremental clustering for anomaly detection," in Computer Science and Software Engineering, 2008 International Conference on, 2008, pp. 986-989.
    [16] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, 1996, pp. 226-231.
    [17] Y. C. Song, H. D. Meng, S. L. Wang, M. O'Grady, and G. O'Hare, "Dynamic and Incremental Clustering Based on Density Reachable," in 2009 Fifth International Joint Conference on INC, IMS and IDC, 2009, pp. 1307-1310.
    [18] Y. Jiang, B. Cukic, and T. Menzies, "Fault prediction using early lifecycle data," in Software Reliability, 2007. ISSRE'07. The 18th IEEE International Symposium on, 2007, pp. 237-246.
    [19] P. S. Sandhu, M. Kaur, and A. Kaur, "A Density Based Clustering approach for early detection of fault prone modules," in Electronics and Information Engineering (ICEIE), 2010 International Conference On, 2010, pp. 525-530.
    [20] J. L. Hammond, T. Minyard, and J. Browne, "End-to-end framework for fault management for open source clusters: Ranger," in Proceedings of the 2010 TeraGrid Conference, 2010, pp. 1-6.
    [21] M. Mboup, C. Join, and M. Fliess, "An on-line change-point detection method," in Control and Automation, 2008 16th Mediterranean Conference on, 2008, pp. 1290-1295.
    [22] Y. Kawahara and M. Sugiyama, "Change-point detection in time-series data by direct density-ratio estimation," in Proceedings of 2009 SIAM International Conference on Data Mining (SDM2009), 2009, pp. 389-400.
    [23] M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer, "Path-based faliure and evolution management," in Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation-Volume 1, 2004.
    [24] G. Jiang, H. Chen, and K. Yoshihira, "Modeling and tracking of transaction flow dynamics for fault detection in complex systems," IEEE Transactions on Dependable and Secure Computing, pp. 312-326, 2006.
    [25] A. G. Saidi, N. L. Binkert, S. K. Reinhardt, and T. Mudge, "End-to-end performance forecasting: finding bottlenecks before they happen," ACM SIGARCH Computer Architecture News, vol. 37, pp. 361-370, 2009.
    [26] M. L. Goodstein, E. Vlachos, S. Chen, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Butterfly analysis: Adapting dataflow analysis to dynamic parallel monitoring," ACM SIGARCH Computer Architecture News, vol. 38, pp. 257-270, 2010.
    [27] C. J. Hsu and C. Y. Huang, "An Adaptive Reliability Analysis Using Path Testing for Complex Component-Based Software Systems," Reliability, IEEE Transactions on, vol. 60, pp. 158-170, 2011.
    [28] R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger, "Diagnosing performance changes by comparing request flows," in Proceedings of the 8th USENIX conference on Networked systems design and implementation, 2011.
    [29] M. Acharya and V. Kommineni, "Mining Health Models for Performance Monitoring of Services," in Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, 2009, pp. 409-420.
    [30] K. C. Foo, Z. M. Jiang, B. Adams, A. E. Hassan, Y. Zou, and P. Flora, "Mining performance regression testing repositories for automated performance analysis," in 2010 10th International Conference on Quality Software, 2010, pp. 32-41.
    [31] D. Ardagna, M. Tanelli, M. Lovera, and L. Zhang, "Black-box performance models for virtualized web service applications," in Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering, 2010, pp. 153-164.
    [32] N. Marwede, M. Rohr, A. van Hoorn, and W. Hasselbring, "Automatic failure diagnosis support in distributed large-scale software systems based on timing behavior anomaly correlation," in European Conference on Software Maintenance and Reengineering, 2009, pp. 47-58.
    [33] M. Rohr, A. van Hoorn, W. Hasselbring, M. Lubcke, and S. Alekseev, "Workload-intensity-sensitive timing behavior analysis for distributed multi-user software systems," in Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering, 2010, pp. 87-92.
    [34] P. Bodik, A. Fox, M. J. Franklin, M. I. Jordan, and D. A. Patterson, "Characterizing, modeling, and generating workload spikes for stateful services," in Proceedings of the 1st ACM symposium on Cloud computing, 2010, pp. 241-252.
    [35] C. F. Alcala and S. Joe Qin, "Analysis and generalization of fault diagnosis methods for process monitoring," Journal of Process Control, vol. 21, pp. 322-330, 2011.
    [36] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam, "Critical event prediction for proactive management in large-scale computer clusters," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 426-435.
    [37] D. Gunter, B. L. Tierney, A. Brown, M. Swany, J. Bresnahan, and J. M. Schopf, "Log summarization and anomaly detection for troubleshooting distributed systems," in Proceedings of the 8th IEEE/ACM International Conference on Grid Computing, 2007, pp. 226-234.
    [38] M. Luo, X. Li, D. H. Zhang, Y. Z. Zhao, P. C. Lim, and L. L. Aw, "Alarm data analysis for equipment failure prediction," SIMTech 2, Apr--Jun 2008.
    [39] A. Beszedes, L. J. Fulop, and T. Gyimothy, "Predicting Critical Problems from Execution Logs of a Large-Scale Software System," in Proceedings of the 11th Symposium on Programming Languages and Software Tools and 7th Nordic Workshop on Model Driven Software Engineering (SPLST'09), 2009, pp. 19-30.
    [40] Q. Fu, J. G. Lou, Y. Wang, and J. Li, "Execution anomaly detection in distributed systems through unstructured log analysis," in Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on, 2009, pp. 149-158.
    [41] G. Andrienko, N. Andrienko, M. Mladenov, M. Mock, and C. Poelitz, "Extracting Events from Spatial Time Series," in 2010 14th International Conference Information Visualisation, 2010, pp. 48-53.
    [42] Z. Lan, Z. Zheng, and Y. Li, "Toward automated anomaly identification in large-scale systems," IEEE Transactions on Parallel and Distributed Systems, pp. 174-187, 2009.
    [43] G. Jiang, H. Chen, K. Yoshihira, and A. Saxena, "Ranking the importance of alerts for problem determination in large computer systems," in Proceedings of the 6th international conference on Autonomic computing, 2009, pp. 3-12.
    [44] R. H. Kwong and D. L. Yonge-Mallo, "Fault Diagnosis in Discrete-Event Systems: Incomplete Models and Learning," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 41, pp. 118-130, 2011.
    [45] X. Ren, S. Lee, R. Eigenmann, and S. Bagchi, "Resource failure prediction in fine-grained cycle sharing system," in International Conference on High Performance Distributed Computing, 2006.
    [46] B. U. Kim, Y. Al-Nashif, S. Fayssal, S. Hariri, and M. Yousif, "Anomaly-based fault detection in pervasive computing system," in Proceedings of the 5th international conference on Pervasive services, 2008, pp. 147-156.
    [47] X. Gu and H. Wang, "Online anomaly prediction for robust cluster systems," in Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, 2009, pp. 1000-1011.
    [48] M. Balman and T. Kosar, "Early error detection and classification in data transfer scheduling," in Complex, Intelligent and Software Intensive Systems, 2009. CISIS'09. International Conference on, 2009, pp. 457-462.
    [49] Y. Tan, X. Gu, and H. Wang, "Adaptive system anomaly prediction for large-scale hosting infrastructures," in Proceeding of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing, 2010, pp. 173-182.
    [50] Z. Lan, J. Gu, Z. Zheng, R. Thakur, and S. Coghlan, "A study of dynamic meta-learning for failure prediction in large-scale systems," Journal of Parallel and Distributed Computing, vol. 70, pp. 630-643, 2010.
    [51] X. Zhao, M. Li, J. Xu, and G. Song, "An effective procedure exploiting unlabeled data to build monitoring system," Expert Systems with Applications, 2011.
    [52] M. Jiang, M. Munawar, T. Reidemeister, and P. Ward, "Efficient Fault Detection and Diagnosis in Complex Software Systems with Information-Theoretic Monitoring," Dependable and Secure Computing, IEEE Transactions on, vol. 8, pp. 510-522, 2011.
    [53] L. R. Shaffer, J. B. Ritter, and W. L. Meyer, The critical-path method: McGraw-Hill, 1965.
    [54] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann Pub, 2005.
    [55] J. Miller, "Earliest uses of symbols in probability and statistics," 1999.
    [56] P. J. Brockwell and R. A. Davis, Time series: theory and methods: Springer Verlag, 2009.
    [57] J. Bollinger, Bollinger on Bollinger bands: McGraw-Hill Professional, 2001.
    [58] M. Sokolova, N. Japkowicz, and S. Szpakowicz, "Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation," AI 2006: Advances in Artificial Intelligence, pp. 1015-1021, 2006.
    [59] C. Fraley and A. E. Raftery, "How many clusters? Which clustering method? Answers via model-based cluster analysis," The computer journal, vol. 41, pp. 578-588, 1998.
    [60] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, "Fisher discriminant analysis with kernels," in Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, 1999, pp. 41-48.
    [61] W. Hoeffding, "A class of statistics with asymptotically normal distribution," The Annals of Mathematical Statistics, vol. 19, pp. 293-325, 1948.
    [62] P. E. Danielsson, "Euclidean distance mapping," Computer Graphics and image processing, vol. 14, pp. 227-248, 1980.
    [63] L. Hentschel, "All in the family Nesting symmetric and asymmetric GARCH models* 1," Journal of Financial Economics, vol. 39, pp. 71-104, 1995.
    [64] A. McWilliams and D. Siegel, "Event studies in management research: Theoretical and empirical issues," The Academy of Management Journal, vol. 40, pp. 626-657, 1997.
    [65] Z. Tian and H. Liao, "Condition based maintenance optimization for multi-component systems using proportional hazards model," Reliability Engineering \& System Safety, 2011.

    QR CODE