簡易檢索 / 詳目顯示

研究生: 廖晨如
Chen-Ju Liao
論文名稱: 基於雙方文本長度之空間文本相似度演算法
An Enhanced Algorithm for Spatio-Textual Similarity Join by Considering Multiple Keyword Lengths
指導教授: 邱舉明
Ge-Ming Chiu
口試委員: 項天瑞
Tien-Ruey Hsiang
鄧惟中
Wei-Chung Teng
李育杰
Yuh-Jye Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 57
中文關鍵詞: 空間-文本相似度spatio-textual signaturetop-k
外文關鍵詞: spatio-textual similarity, spatio-textual signature, top-k
相關次數: 點閱:151下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著行動裝置結合適地性服務(location based service, LBS)的普遍化,使用者們產生了大量同時包含地理位置和文字描述的資料,像是Foursquare平台的check-in、Facebook的打卡和標籤或是興趣點(point of interest, POI)的產生等等。透過這些資料產生了空間-文本相似度(spatio-textual similarity)比對的議題,有對於各種不同來源所產生的空間-文本資料可以利用相似度的比對進行整合,譬如在空間上的小距離誤差和文本上的不同表達方式的統整,也可以分析社群網路上的使用者們發布的高相似度空間-文本資料用於交友系統的推薦名單,而這些大量的空間-文本資料要如何有效率的進行相互比較,並找出相似度最高的k個配對是本論文主要想探討的議題。
    在本論文中,為了使空間-文本資料之間的相似度運算效能增強並改進相關研究方法的缺點,我們針對了spatio-textual signature架構進行研究,其概念為對所有的空間-文本資料產生spatio-textual signature,如果彼此的相似分數可能會位於最高的k個中,兩者的spatio-textual signature就一定會產生重疊,藉由這個方法,可以過濾掉signature沒有重疊的資料配對。在我們的研究中,為了提升spatio-textual signature的過濾效率,每一個空間-文本資料對於其他人將會針對它們不同的文本資料長度產生相應的spatio-textual signature,如此便能大量增加沒有重疊的資料配對,減少資料彼此之間的空間-文本相似度運算,最後,將會同時利用合成和Twitter的實際資料集來對我們提出的方法進行效能評估及分析。


    With the popularity of mobile devices in conjunction with location based service (LBS), users generate a wealth of spatio-textual data, such as check-in for both Foursquare and Facebook, Facebook tag, point of interest (POI) generation, and more. Through these data, the issue of Spatio-textual Similarity has become important. For example, spatial-textual data generated by various sources can be effective to integrated, such as small spatial errors in space and different expressions in texts, and it can also be used to find high-similarity spatio-textual data published by different users on social networks to dating recommendations. However, how to efficiently calculate the similarity of these spatial-textual data and find out k pairs with highest similarity are the main topics to be discussed in this thesis.
    In this paper, in order to improve the algorithmic efficiency of similarity between spatial and textual data and to improve the performance of the existing methods, we study the spatio-textual signature framework. The concept is to generate a spatio-textual signature set for all spatial-textual data. If the similarity scores for each other are likely to be in the highest k result, the two spatio-textual signature sets must overlap. With this concept, we can prune pairs that do not overlap. In our research, in order to improve the filtering efficiency of the spatio-textual signature, each spatial-textual data will create different spatio-textual signature for others depending on their length of textual description. As a result, there is a large amount of reduction in the similarity calculation of the spatial-textual data. Besides, we employ synthetic database and Twitter database in the experiment to show that our method outperforms the other method significantly.

    目錄 摘要 I Abstract II 目錄 III 圖目錄 V 表目錄 VII 第一章 緒論 1 1.1 背景 1 1.2 論文目標 2 1.3 論文架構 3 第二章 相關研究 5 第三章 系統模型及問題定義 15 第四章 我們的空間文本相似度演算法 17 4.1 考慮雙方文本長度Spatio-Textual Signature 概念 18 4.2 運算結構 21 4.3 Spatio-Textual Signature 策略 24 4.4 演算法 26 第五章 實驗與模擬結果 28 5.1 合成資料集實驗環境設定 28 5.2 合成資料集實驗模擬結果 29 5.2.1 不同總資料數量的影響 29 5.2.2 依據第k高分數變化的影響 31 5.2.3 不同權重的影響 33 5.2.4 Keyword選取機率偏斜度的影響 34 5.2.5 最高keyword重疊數為3 35 5.3 實際資料集實驗環境設定 37 5.4 實際資料集實驗模擬結果 38 5.4.1 在實際資料集的執行效能 38 5.4.2 依據第k高分數變化的影響 39 5.4.3 不同權重的影響 40 5.5 實驗結論 41 第六章 結論與未來研究方向 43 參考文獻 44

    [1]S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” in Proc. IEEE Int’l Conf. on Data Engineering(ICDE), 2006, pp. 5-16.
    [2]C. Xiao, W. Wang, X. Lin, and J. X. Yu, “Efficient similarity joins for near duplicate detection,” in Proc. Int’l Conf. on World Wide Web(WWW), 2008, pp. 131-140.
    [3]A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos, “Closest pair queries in spatial databases,” in Proc. ACM Int’l Conf. on Management of data (SIGMOD), 2000, pp. 189-200.
    [4]C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set similarity joins,” in Proc. IEEE Int’l Conf. on Data Engineering(ICDE), 2009, pp. 916-927.
    [5]H. Hu, G. Li, Z. Bao, J. Feng, Y. Wu, Z. Gong, and Y. Xu, “Top-k spatio-textual similarity join,” IEEE Transactions on Knowledge and Data Engineering(TKDE), vol. 28, no. 2, pp. 551-565, 2016.
    [6]H. Samet, “An overview of quadtrees, octrees, and related hierarchical data structures,” Theoretical Foundations of Computer Graphics and CAD, 1988, pp. 51- 68.
    [7]A. Arasu, V. Ganti, and R. Kaushik, “Efficient exact set-similarity joins,” in Proc. Int’l Conf. on Very Large Data Bases(VLDB), 2006, pp. 918-929.
    [8]J. Wang, G. Li, and J. Feng, “Can we beat the prefix filtering? : An adaptive framework for similarity join and search,” in Proc. ACM Int’l Conf. on Management of data (SIGMOD), 2012, pp. 85-96.
    [9]S. Qi, P. Bouros, and N. Mamoulis, “Efficient top-k spatial distance joins,” in Proc. Int’l Conf. on Advances in Spatial and Temporal Databases(SSTD), 2013, pp. 1-18.
    [10]G. R. Hjaltason, and H. Samet, “Incremental distance join algorithms for spatial databases,” in Proc. ACM Int’l Conf. on Management of data (SIGMOD), 1998, pp. 237-248.
    [11]Y. Jiang, G. Li, J. Feng, and W. Li, “String similarity joins: An experimental evaluation,” Proc. VLDB Endowment, vol. 7, no. 8, pp. 625-636, 2014.
    [12]P. Bouros, S. Ge, and N. Mamoulis, “Spatio-Textual Similarity Joins,” Proc. VLDB Endowment, vol. 6, no. 1, pp. 1-12, 2012.
    [13]S. Liu, G. Li, and J. Feng, “A prefix-filter based method for spatio-textual similarity join,” IEEE Transactions on Knowledge and Data Engineering(TKDE), vol. 26, no. 10, pp. 2354-2367, 2014.
    [14]L. Chen, G. Cong, C. S. Jensen, and D. Wu, “Spatial keyword query processing: An experimental evaluation,” Proc. VLDB Endowment, vol. 6, no. 3, pp. 217-228, 2013.
    [15]G. Cong, C. S. Jensen, and D. Wu, “Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects,” Proc. VLDB Endowment, vol. 2, no. 1, pp. 337-348, 2009.
    [16]J. B. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Nørvåg “Efficient processing of top-k spatial keyword queries,” in Proc. Int’l Conf. on Advances in Spatial and Temporal Databases(SSTD), 2011, pp. 205-222.
    [17]D. Wu, M. L. Yiu, G. Cong, and C. S. Jensen , “Joint top-k spatial keyword query processing,” IEEE Transactions on Knowledge and Data Engineering(TKDE), vol. 24, no. 10, pp. 1889-1903, 2012.
    [18]G. Li, J. Feng, and J. Xu, “Desks: Direction-aware spatial keyword search,” in Proc. IEEE Int’l Conf. on Data Engineering(ICDE), 2012, pp. 474-485.
    [19]G. M. Morton, “A computer oriented geodetic data base and a new technique in file sequencing,” Technical Report, Ottawa, Canada: IBM Ltd, 1966.
    [20]J. A. Orenstein and F. A. Manola, “PROBE Spatial Data Modeling in an Image Database and Query Processing Application,” IEEE Transactions on Software Engineering, vol. 14, no. 5, pp. 611-629, 1988.
    [21]S. Sahni, Data structures, algorithms, and applications in c++. Univ. Press, NJ, USA, 1999.
    [22]J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted files versus signature files for text indexing,” ACM Transactions on Database Systems (TODS), vol. 23, no. 4, pp. 453-490, 1998.
    [23]J Giridharan, and S. V. Vairavan, “Inverted index and interval lists for keyword search,” Int’l Conf. on Green Computing Communication and Electrical Engineering (ICGCCEE), 2014, pp. 1-4.
    [24]J. Eisenstein, B. O'Connor, N. A. Smith, and E. P. Xing, “A latent variable model for geographic lexical variation,” in Proc. IEEE Int’l Conf. on Empirical Methods in Natural Language Processing, 2010, pp. 1277-1287.
    [25]L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web Caching and Zipf-Like Distribution: Evidence and Implication,” in Proc. IEEE Int’l Conf. on Computer Communications(INFOCOM), 1999, pp. 126-134.

    QR CODE