簡易檢索 / 詳目顯示

研究生: 蘇義雄
Yi-Hsiung Su
論文名稱: 使用文本探勘在伺服器開發上建立無效的缺陷分類模型
Using Text Mining to Create an Invalid Defect Classification Model for Server Development
指導教授: 欒斌
Pin Luarn
口試委員: 陳正綱
Cheng-Kang Chen
詹前隆
Chien-Lung Chan
廖文志
Wen-Chih Liao
羅凱揚
Peter KY Lo
欒斌
Pin Luarn
學位類別: 博士
Doctor
系所名稱: 管理學院 - 管理研究所
Graduate Institute of Management
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 78
中文關鍵詞: 無效的缺陷分類文本探勘文字探勘資料探勘伺服器開發專案管理BIOS
外文關鍵詞: Invalid defect, Server development, BIOS
相關次數: 點閱:283下載:17
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 無效缺陷(Invalid defects)經常被忽視並且降低了開發生產力和效率。本研究使用探索性研究,專家會議和文本探勘研究方法,在三個研究階段回答四個研究問題。在第一階段,我們從三個x86伺服器專案的缺陷追踪系統中收集了3,347個缺陷。該階段研究發現伺服器產品的缺陷分佈,並且支援柏拉圖法则(Pareto principle)。在第二階段,我們從3347個缺陷中過濾了231個無效的BIOS(基本輸入/輸出系統)缺陷。這些缺陷被擁有眾多功能領域的台灣,中國和美國虛擬團隊發現。該階段研究結果表明BIOS韌體顯示最大數量的缺陷和無效缺陷。該韌體的缺陷和無效缺陷分別佔伺服器開發的43.4%的缺陷和33%的無效缺陷。結果確定了無效缺陷分類,包括四種類型,即按設計工作(WAD),用戶錯誤(User Error),重複(duplicate)和其他(Others)。所有這些類型可以分組在術語WUDO下。在WUDO分類中,WAD類型佔無效缺陷的最多比例45%。在第三階段,本研究確定了一種穩定的分類演算法,即決策樹C4.5,對無效缺陷類型進行分類。此研究對資訊科技產品的專案團隊,可以幫助開發人員和測試人員面臨的不同無效缺陷類型進行分類。結果可以提高專案團隊的生產力,降低專案管理的風險。


    Invalid defects, which are often overlooked, reduce development productivity and efficiency. This study used exploratory study, expert meeting and text mining to answer four research questions in three research stages. In the first stage, we collected 3,347 defects from the defect tracking system of three x86 server projects. The study involves determining the defect distribution of server products, and it supports the Pareto principle. In the second stage, we filtered 231 invalid BIOS (basic input/output system) defects from the 3347 defects. These defects were from numerous function areas owned by virtual teams located in Taiwan, China, and the United States. Results indicated that BIOS firmware demonstrates the maximum number of defects and invalid defects. This firmware accounted for 43.4% defects and 33% invalid defects in server development. Results determined that invalid defect classification that includes four types, namely, working as designed (WAD), user error, duplicate, and others. All of these types can be grouped under the term WUDO. WAD accounts for the maximum of 45% of invalid defects in the WUDO classification. In the third stage, this study determined a stable classification algorithm, namely, decision tree C4.5, to classify the invalid defect types. This study helps project teams for information technology products to classify the different invalid defect types that developers and testers face. Results can improve project team productivity and mitigate project risks in project management.

    論文摘要 I ABSTRACT II 誌謝 III 投稿 IV CONTENTS V LIST OF FIGURES VII LIST OF TABLES VIII Chapter 1 INTRODUCTION 1 1.1 Background and Motivation 1 1.2 Research Objectives 2 1.3 Research Process 5 1.4 Organization of Dissertation 6 Chapter 2 LITERATURE REVIEW 7 2.1 Software Engineering and Project Management 7 2.2 Defects and Invalid Defects 9 2.3 Data Mining and Text Mining 12 2.4 Supervised Machine Learning 14 2.5 Algorithms 15 2.5.1 Decision Tree 15 2.5.2 Naive Bayes 16 2.5.3 Bayesian Network 16 2.5.4 Logistic Regression 16 2.5.5 Neural Network 17 Chapter 3 METHOD 19 3.1 Researched Case 19 3.2 Research Design 24 3.3 Definition of Invalid Defects 26 3.4 Data Collection and Extraction 28 3.5 Definition of GBI and EBI 30 3.6 Text-Mining Approaches 31 3.7 Evaluating Performance 35 Chapter 4 RESULTS AND ANALYSIS 37 4.1 Defect Distribution in Stage 1 37 4.2 Invalid Defect Classification in Stage 2 40 4.3 WUDO Classification 43 4.4 Invalid Defect Classification Model in Stage 3 49 Chapter 5 DISCUSSIONS 53 Chapter 6 CONCLUSIONS AND SUGGESTIONS 56 6.1 Conclusions 56 6.2 Theoretical and Practical Implications 59 6.3 Limitations and Directions for Future Research 60 REFERENCE 62 Appendix A Keyword List 67

    Beck, K., Beedle, M., Bennekum, A. V., Cockburn, A., Cunningham, W., Fowler, M., & Grenning, J. (2001). Manifesto for Agile Software Development. Retrieved December, 2011, from http://agilemanifesto.org/
    Cavalcanti, Y. C., Da Mota Silveira Neto, P. A., Lucrédio, D., Vale, T., de Almeida, E. S., & de Lemos Meira, S. R. (2013). The bug report duplication problem: an exploratory study. Software Quality Journal, 21(1), 39-66. doi: http://dx.doi.org/10.1007/s11219-011-9164-5
    Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0 Step-by-step data mining guide: SPSS Inc.
    Chen, X. P., Tsui, A. S., Farh, J. L., & Cheng, B. S. (2008). Empirical Methods for Research in Organization and Management. . Taiwan: HWA TAI Publishing.
    CMMI Product Team. (2010). CMMI for Development, Version 1.3. from http://resources.sei.cmu.edu/library/asset-view.cfm?AssetID=9661
    Cooper, D. R., & Schindler, P. S. (2014). Business Research Methods (12th ed.). Singapore: McGraw-Hill Higher Education.
    Croft, B., Metzler, D., & Strohman, T. (2009). Search Engines: Information Retrieval in Practice: Addison-Wesley Publishing Company.
    de Campos, L. M., Fernandez-Luna, J. M., & Huete, J. F. (2004). Bayesian networks and information retrieval: an introduction to the special issue. Information Processing & Management, 40(5), 727-733. doi: 10.1016/j.ipm.2004.03.001
    Domingos, P., & Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29(2), 103-130. doi: 10.1023/a:1007413511361
    Doran, M., Zimmer, V., & Rothman, M. (2011). BEYOND BIOS: EXPLORING THE MANY DIMENSIONS OF THE UNIFIED EXTENSIBLE FIRMWARE INTERFACE. Intel Technology Journal, 15(1), 8-21.
    Eckardt, J. R., Davis, T. L., Stern, R. A., Wong, C. S., Marymee, R. K., & Bedjanian, A. L. (2014). The Path to Software Cost Control. Defense AT&L, 43(6), 23-17.
    Ecker, W., Domer, R., & Müller, W. (2009). Hardware-dependent software : principles and practice. Berlin: Springer.
    Fern, M., #225, ndez-Delgado, Cernadas, E., Sen, #233, . . . Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res., 15(1), 3133-3181.
    Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and Techniques: Morgan Kaufmann Publishers Inc.
    Issa, A. A., Abu Rub, F. A., & Thabata, F. F. (2009). Using test case patterns to estimate software development and quality management cost. Software Quality Journal, 17(3), 263-281. doi: http://dx.doi.org/10.1007/s11219-009-9076-9
    Josephson, P.-E. (1998). Defects and Defect Costs in Construction
    - A study of seven building projects in Sweden.
    Kantardzic, M. (2011). Data Mining: Concepts, Models, Methods, and Algorithms: Second Edition.
    Kaplan, C. (1993). Defect prevention saves millions. Quality, 32(10), 51.
    Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Paper presented at the Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2, Montreal, Quebec, Canada.
    Kotsiantis, S. B. (2007). Supervised Machine Learning: A Review of Classification Techniques. Paper presented at the Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies.
    Koumenides, C. L., & Shadbolt, N. R. (2012). Combining link and content-based information in a Bayesian inference model for entity search. Paper presented at the Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search, Portland, Oregon, USA.
    Langley, P., Iba, W., Thompson, K., & Amer Assoc Artificial, I. (1992). AN ANALYSIS OF BAYESIAN CLASSIFIERS. Menlo Pk: Amer Assoc Artificial Intelligence.
    Larsen, G. (1999). Designing component-based frameworks using patterns in the UML. Association for Computing Machinery. Communications of the ACM, 42(10), 38-45.
    Lazić, L., & Milinković, S. (2015). Reducing software defects removal cost via design of experiments using Taguchi approach. Software Quality Journal, 23(2), 267-295. doi: 10.1007/s11219-014-9234-6
    Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nédellec & C. Rouveirol (Eds.), Machine Learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21–23, 1998 Proceedings (pp. 4-15). Berlin, Heidelberg: Springer Berlin Heidelberg.
    Li, J., Stalhane, T., Conradi, R., & Kristiansen, J. M. W. (2012). Enhancing Defect Tracking Systems to Facilitate Software Quality Improvement. IEEE software, 29(2), 59-66. doi: http://dx.doi.org/10.1109/MS.2011.24
    Mannila, H. (2000). Theoretical frameworks for data mining. SIGKDD Explor. Newsl., 1(2), 30-32. doi: 10.1145/846183.846191
    McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. Paper presented at the In AAAI-98 Workshop on Learning for Text Categorization.
    Metsis, V., Androutsopoulos, I., & Paliouras, G. (2008). Spam Filtering with Naive Bayes - Which Naive Bayes? Paper presented at the In Third Conference on Email and Anti-Spam.
    Miner, G., Delen, D., Elder, J., Fast, A., Hill, T., & Nisbet, R. A. (2012). Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. San Diego: Elsevier Academic Press Inc.
    Olson, D. L., & Shi, Y. (2006). Introduction To Business Data Mining: Mcgraw-Hill/Irwin
    Oshana, R., & Kraeling, M. (2013). Software Engineering of Embedded and Real-Time Systems Software Engineering for Embedded Systems (pp. 1-32). Oxford: Newnes.
    Poppendieck, M., & Poppendieck, T. (2003). Lean Software Development An Agile Toolkit: Addison-Wesley.
    Pressman, R. S. (2010). Software Engineering A Practitioner Approach. Boston, Mass: McGraw-Hill.
    Project Management Institute. (2013). A guide to the project management body of knowledge (PMBOK guide), fifth edition Retrieved from http://search.library.wisc.edu/catalog/WU9303188
    Quinlan, J. R. (1993). C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc.
    Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77-90.
    Rish, I. (2001). An empirical study of the naive Bayes classifier. Paper presented at the IJCAI 2001 workshop on empirical methods in artificial intelligence.
    Rish, I., Hellerstein, J., & Thathachar, J. (2001). An analysis of data characteristics that affect naive Bayes performance. Paper presented at the ICML-01.
    Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. In A. A. James & R. Edward (Eds.), Neurocomputing: foundations of research (pp. 696-699): MIT Press.
    Scagliarini, L., & Varone, M. (2016). Text mining vs data mining: discover the differences. Retrieved from http://www.expertsystem.com/text-mining-vs-data-mining-differences/
    Schmidt, D. C. (1995). Using design patterns to develop reusable object-oriented communication software. Association for Computing Machinery. Communications of the ACM, 38(10), 65.
    Schulz, T., Radliński, Ł., Gorges, T., & Rosenstiel, W. (2013). Predicting the Flow of Defect Correction Effort using a Bayesian Network Model. Empirical Software Engineering, 18(3), 435-477. doi: 10.1007/s10664-011-9175-7
    Schwaber, K. (2004). Agile Project Management with Scrum.
    Sommerville, I. (2011). Software Engineering (9th ed.). Boston, USA: Addison-Wesley.
    Sun, J. (2011). Why are Bug Reports Invalid? , 407-410. doi: 10.1109/icst.2011.43
    Sundhari, S. S. (2011, 5-7 June 2011). A knowledge discovery using decision tree by Gini coefficient. Paper presented at the 2011 International Conference on Business, Engineering and Industrial Applications.
    Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining: Addison-Wesley Longman Publishing Co., Inc.
    Trucco, P., Cagno, E., Ruggeri, F., & Grande, O. (2008). A Bayesian Belief Network modelling of organisational factors in risk analysis: A case study in maritime transportation. Reliability Engineering & System Safety, 93(6), 845-856. doi: 10.1016/j.ress.2007.03.035
    Trudeau, J. (2013). Chapter 9 - Software Reuse By Design in Embedded Systems. In R. Oshana & M. Kraeling (Eds.), Software Engineering for Embedded Systems (pp. 261-280). Oxford: Newnes.
    Truett, J., Cornfield, J., & Kannel, W. (1967). A multivariate analysis of the risk of coronary heart disease in Framingham. Journal of Chronic Diseases, 20(7), 511-524. doi: http://dx.doi.org/10.1016/0021-9681(67)90082-3
    Wahli, U. (2004). Software configuration management a clear case for IBM Rational ClearCase and ClearQuest UCM. Research Triangle Park, N.C.: IBM.
    Waikato, U. o. (2017). Weka. Retrieved 4/13, 2017, from http://www.cs.waikato.ac.nz/ml/weka/downloading.html
    Wang, D., Wang, Q., Yang, Y., Li, Q., Wang, H., & Yuan, F. (2011). "Is It Really a Defect?" An Empirical Study on Measuring and Improving the Process of Software Defect Reporting. 434-443. doi: 10.1109/esem.2011.62
    Wikipedia. (2015). Vilfredo Pareto. Retrieved 01 Feburary 2014 http://en.wikipedia.org/wiki/Vilfredo_Pareto
    Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann Publishers Inc.
    Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., . . . Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37. doi: 10.1007/s10115-007-0114-2
    Yang, C.-L., Chang, Y.-K., & Chu, C.-P. (2013). An Analysis of the Root Causes of Defects Injected into the Software by the Software Team: An Industrial Study of the Distributed Health-Care System. International Journal of Software Engineering and Knowledge Engineering, 23(09), 1269-1288. doi: 10.1142/s0218194013500393
    Yang, W. (2007). Statistics (Second ed.). Taipei: Shinlou Books.

    QR CODE