研究生: |
張鈺輝 Yui-Hui Chang |
---|---|
論文名稱: |
距離夾角最似鄰近結合貝氏定理預測分類推論模式之研究 Distance and Cosine Angle-Based K-nearest neighbor Classification with Bayesian Framework |
指導教授: |
鄭明淵
Min-Yuan Cheng |
口試委員: |
潘南飛
Pan, Nang-Fei 陳柏翰 Po-Han Chen 陳鴻銘 Hung-Ming Chen |
學位類別: |
碩士 Master |
系所名稱: |
工程學院 - 營建工程系 Department of Civil and Construction Engineering |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 中文 |
論文頁數: | 130 |
中文關鍵詞: | 分類分析 、K-NN Classifier 、Bayesian Theory 、餘弦相似度 |
外文關鍵詞: | Classification Analysis, K-NN Classifier, Bayesian Theory, Cosine Similarity |
相關次數: | 點閱:659 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資料探勘領域中,分類分析(classifier analysis)為資料預測處理的重要方法,而在分類分析中的最常使用的方法之一,為K-Nearest Neighbor演算法。K-Nearest Neighbor分類法是以歐幾里得距離(Euclidean distance)作為分類依據,雖然使用歐幾里得距離能夠體現個體數值特徵的絕對差異,但如果發生距離非常接近甚至相同時,可能產生難以判斷分類的結果。此外一般分類法呈現結果之方式為直接定義測試資料隸屬為某類,但測試資料與其他類別也有相似之處,其中有機率是屬於另一類之結果,但因為分類之方式將測試資料歸類,無法表現隸屬程度。除了歐幾里得距離,餘弦相似度(cosine similarity)也是常被採用的度量方法之一,當使用餘弦相似度來衡量資料間相似度的大小時,由於餘弦相似度只單獨從方向性上區分差異,對於絕對數值並不敏感,仍有其盲點與缺陷存在。
因此,本研究希望針對上述不足之處進行探討,希望改善K-NN Classifier及分類法後預測之方式,探討歐幾里得距離與餘弦相似度兩種衡量方法的特性,使分類演算法同時將兩者作為分類依據,發展出θ-Means Nearest Neighbor Classifier演算法。此外本研究也將θ-MNN結合Bayesian theory 模式發展θ-Means Nearest Bayesian Classifier演算法使得分類結果能夠提供更詳細的資訊。本研究將利用θ-MNBC演算法與θ-MNN Classifier演算法進行公共工程爭議處理案例與道路邊坡崩塌案例進行分類與分析。
In the field of data mining, Classification analysis is an important method of prediction for data processing. One of the most commonly method in the classification analysis is K-Nearest Neighbor Classifier. K-Nearest Neighbor classification method is based on the Euclidean distance as a classification basis, even though using Euclidean distance can show the absolute difference between data. But the data could be very close or even have the same distance, may cause K-NN classifier difficult to classify. In addition general classification method is define testing data to certain category directly, but testing data could be also similarities with other categories. Which it has a chance to belong to another kind of result. Except the Euclidean distance, cosine similarity is often used as measure. When using the cosine similarity to measure the similarity between data, due to cosine similarity only distinguish the difference from the direction, the absolute values are not sensitive, and it still has its blind spots and defects.
Thus, this research is to explore the problems and improve K-NN classifier. Combining Euclidean distance and cosine similarity, and make them as measure. Innovation and development of θ-Means Nearest Neighbor Classifier algorithms. In addition, this study will also combine θ-MNN classifier and Bayesian theory to develop θ-Means Nearest Bayesian Classifier algorithm makes the classification result can tell more detailed information.
This research will use project dispute resolution cases and slope collapse cases to verify the classification model.
1.Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006.
2.Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From data mining to knowledge discovery in databases." AI magazine 17.3 (1996): 37.
3.Scott, D. W. (1992). Multivariate density estimation theory, practice, and
visualization, Wiley.
4.Kung, Y.-H., Lin, P.-S., and Kao, C.-H. (2012). “An optimal –nearest
neighbor for density estimation.” Stat. Probabil. Lett., 82(10), 1786–1791.
5.Mack, Y. P., and Rosenblatt, M. (1979). “Multivariate k-nearest neighbor
density estimates.” J. Multivar. Anal., 9(1), 1–15.
6.Ouadah, S. (2013). “Uniform-in-bandwidth nearest-neighbor density estimation.”Stat. Probab. Lett., 83(8), 1835–1843.
7.Theodoridis, S., and Koutroumbas, K. (2009). Pattern recognition,
Academic Press, Elsevier.
8.Bishop, C. (2006). Pattern recognition and machine learning, Springer
Science+Business Media, Singapore.
9.Duda, R. O., Hart, P. E., and Stock, D. G. (2001). Pattern classification,
2nd Ed., Wiley.
10.Gnardellis, T., and B. Boutsinas. "On experimenting with data mining in education." Paper preseted at the 2ο Πανελλήνιο Συνέδριο µε ιεθνή Συµµετοχή(2001).
11.Berry, Michael J., and Gordon Linoff. Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc., 1997.
12.Cabena, P. et. al., Discovering Data Mining: From Concept to Implementation, Prentice Hall, 1997.
13.Kleissner, Charly. "Data mining for the enterprise." System Sciences, 1998., Proceedings of the Thirty-First Hawaii International Conference on. Vol. 7. IEEE, 1998.
14.Shaw, Michael J., et al. "Knowledge management and data mining for marketing." Decision support systems 31.1 (2001): 127-137.
15.Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In International joint conference on artificial intelligence (Vol. 2, pp. 1137–1143): Morgan Kaufmann.
16.Cheng, M. and Hoang, N. (2014). "Slope Collapse Prediction Using Bayesian Framework with K-Nearest Neighbor Density Estimation: Case Study in Taiwan." J. Comput. Civ. Eng. , 10.1061/(ASCE)CP.1943-5487.0000456 , 04014116.
17.毛國君,段立娟,王實與石雲,「數據挖掘原理與算法」清華大學出版社有限公司,2005,160-163。
18.Salton, Gerard, and Michael E. Lesk. "Computer evaluation of indexing and text processing." Journal of the ACM (JACM) 15.1 (1968): 8-36.
19.Huang, Anna. "Similarity measures for text document clustering." Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008.
20.張家榮,「以距離與餘弦夾角為基礎之創新群集方法研究」,台灣科技大學營建工程學系營建管理組碩士論文,2014。
21.趙衛君,「應用高斯過程建立分階式山區道路邊坡崩塌預測模式之研究-以阿里山公路為例」,碩士論文,國立臺灣科技大學營建工程系,2004。
22.李鈞宇,「應用高斯過程建立新中橫公路邊坡崩塌預測模式之研究」,碩士論文,國立臺灣科技大學營建工程系,2006。
23.張閔嘉,「智慧型節能技術:以感測網路自動偵測異常空調狀態之研究」,國立台灣大學碩士論文,2011。
24.Chawla, N. V., K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer. 之"SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research