Basic Search / Detailed Display

Author: 陳曉毅
Hsiao-Yi Chen
Thesis Title: 在加密雲上基於詞嵌入之語意搜尋
A Semantic Search over Encrypted Cloud Data based on Word Embedding 研
Advisor: 金台齡
Tai-Lin Chin
Committee: 陳冠宇
Kuan-Yu Chen
洪智傑
Chih-Chieh Hung
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2019
Graduation Academic Year: 107
Language: 中文
Pages: 47
Keywords (in Chinese): 加密雲詞嵌入語意搜尋
Keywords (in other languages): Encrypted Cloud Data, Word Embedding, Semantic Search
Reference times: Clicks: 226Downloads: 2
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

近年來,雲端儲存服務的使用越來越廣泛。憑藉較低的設備成本和高容量的優勢,一些企業和使用者傾向於將他們的數據資料從本地的儲存裝置移動到遠程的儲存裝置上,例如雲端伺服器。為了使用者讓使用者能在雲端伺服器上有效率的搜尋儲存的數據資料,透過關鍵字的搜尋方法是現在廣為使用的搜尋方法。而隨著資訊安全意識的抬頭,數據的擁有者希望放在雲端伺服器中的資料能保有隱私不被不信任的使用者窺探,同時使用者也希望自身的查詢內容不會被不被信任的伺服器紀錄,因此將數據和查詢加密是最有效利用方式。然而加密過後的密文已經失去明文所具有的關係,這會造成在關鍵詞搜尋上增加許多困難度,
此外大部分的搜尋方法大多無法有效率的從使用者所下的關鍵字中獲取使用者真正感興趣的資料。為了解決這些問題,本研究提出一種基於詞嵌入(Word Embedding)的語意搜尋演算法。其中詞嵌入的模型是由神經網路(Neural Network)模型的計算來實現,神經網路模型可以學習語意資料庫(corpus)中詞與詞之間的語意關係,並以向量來表示單詞。透過使用詞嵌入的模型,生成文檔索引向量(document index vector)和查詢向量(query vector)。最後本論文提出的方案可以將查詢向量和索引向量加密為密文,在保護用戶的隱私和文檔的安全性時同時能保有搜尋的效率。


The services of cloud storage have been very popular in recent years. With the superiority of low-cost and high-capacity, people are inclined to move their data from a local computer to a remote facility such as the cloud server. The majority of the existing methods for searching data on the cloud concentrate on keyword-based search scheme. With the rise of information security awareness, data owners hope that the data placed in the cloud server can keep privacy from being snooped by untrusted users, and users also hope that their query content will not be record by untrusted server. Therefore, encrypting data and queries is the most common way.However, the encrypted ciphertext has lost the relationship of the original plaintext, which will cause many difficulties in keyword search.In addition, most of the existing search methods are not able to efficiently obtain the information that the user is really interested in from the user's query keywords. To address these problems, this study proposes a word embedding based semantic search scheme for searching documents on the cloud. The word embedding model is implemented by a neural network. The neural network model can learn the semantic relationship between words in the corpus and express the words in vectors. By using a word-embedded model, a document index vector and a query vector can be generated. The proposed scheme can encrypt the query vector and the index vector into ciphertext, which can preserve the efficiency of the search while protecting the privacy of the user and the security of the document.

論文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV 目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V 圖目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII 1 緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 論文目的與貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 文獻探討. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 可搜索之加密方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 主題模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 詞嵌入模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 問題定義與方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 問題定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.1 系統模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.2 威脅模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 詞嵌入模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 V 3.2.1 輸入層. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 隱藏層. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 輸出層. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.4 訓練和優化. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 建立索引和查詢向量. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 在加密資料搜尋之方法. . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 實驗環境. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 訓練詞嵌入模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 詞嵌入模型有效性. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 語意搜尋. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5 搜尋效率. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 結論與未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Appendix: SECURITY ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . 33 Appendix: Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

[1] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.
[2] H. Zaragoza, N. Craswell, M. J. Taylor, S. Saria, and S. E. Robertson, “Microsoft cambridge at trec 13: Web and hard tracks.,” in TREC, vol. 4, pp. 1–1, 2004.
[3] C. D. Manning, “Prabhakar raghavan, and hinrich schutze,” Introduction to information
retrieval, 2008.
[4] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searches on encrypted data,” in 2000 IEEE Symposium on Security and Privacy, 2000., pp. 44–55, IEEE, 2000.
[5] D. Boneh and B. Waters, “Conjunctive, subset, and range queries on encrypted data,” in Theory of Cryptography Conference, pp. 535–554, Springer, 2007.
[6] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked keyword search over encrypted cloud data,” in 2010 IEEE 30th International Conference on Distributed Computing Systems (ICDCS), pp. 253–262, IEEE, 2010.
[7] Z. Xu, W. Kang, R. Li, K. Yow, and C.-Z. Xu, “Efficient multi-keyword ranked query on encrypted data in the cloud,” in IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 244–251, IEEE, 2012.
[8] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li, “Privacy-preserving multi-keyword text search in the cloud supporting similarity-based ranking,” in Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, pp. 71–82, ACM, 2013.
[9] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preserving multi-keyword ranked search over encrypted cloud data,” IEEE Transactions on parallel and distributed systems, vol. 25, no. 1, pp. 222–233, 2014.
[10] Z. Fu, X. Wu, C. Guan, X. Sun, and K. Ren, “Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 12, pp. 2706–2716, 2016.
[11] E.-J. Goh et al., “Secure indexes.,” IACR Cryptology ePrint Archive, vol. 2003, p. 216, 2003.
[12] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public key encryption with keyword search,” in International conference on the theory and applications of cryptographic techniques, pp. 506–522, Springer, 2004.
[13] W.-N. Shih and T.-L. Chin, “Approximate multi-keyword rank search on encrypted cloud data,” in IEEE Global Communications Conference, pp. 1–5, IEEE, 2017.
[14] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, “Latent semantic indexing: A probabilistic analysis,” Journal of Computer and System Sciences, vol. 61, no. 2, pp. 217–235, 2000.
[15] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 289–296, Morgan Kaufmann Publishers Inc., 1999.
[16] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
[17] T.-T. Pham, N. E. Maillot, J.-H. Lim, and J.-P. Chevallet, “Latent semantic fusion model for image retrieval and annotation,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 439–444, ACM, 2007.
[18] Z.-L. Huo, J.-F. Wu, Y. Lu, and C.-Z. Li, “A topic-based cross-language retrieval model with plsa and tf-idf,” in 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA), pp. 340–344, IEEE, 2018.
[19] S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic model for audio information retrieval,” in 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 37–40, IEEE, 2009.
[20] G. E. Hinton, J. L. McClelland, D. E. Rumelhart, et al., Distributed representations. Carnegie-Mellon University Pittsburgh, PA, 1984.
[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Neural information processing systems, pp. 3111–3119, 2013.
[23] J. R. Firth, “A synopsis of linguistic theory, 1930-1955,” Studies in linguistic analysis, 1957.
[24] T. Kenter, A. Borisov, and M. de Rijke, “Siamese cbow: Optimizing word embeddings for sentence representations,” arXiv preprint arXiv:1606.04640, 2016.
[25] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones, “Word embedding based generalized language model for information retrieval,” in Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 795–798, ACM, 2015.
[26] B. Xue, C. Fu, and Z. Shaobin, “A study on sentiment computing and classification of sina weibo with word2vec,” in 2014 IEEE International Congress on Big Data, pp. 358–363, IEEE, 2014.
[27] Z. Su, H. Xu, D. Zhang, and Y. Xu, “Chinese sentiment classification using a neural network tool—word2vec,” in 2014 International Conference on Multisensor Fusion
and Information Integration for Intelligent Systems (MFI), pp. 1–6, IEEE, 2014. [28] D. Rahmawati and M. L. Khodra, “Word2vec semantic representation in multilabel classification for indonesian news article,” in 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA), pp. 1–6, IEEE, 2016.
[29] J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), pp. 136–140, IEEE, 2015.
[30] I. Vulić and M.-F. Moens, “Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings,” in Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 363–372, ACM, 2015.
[31] Q. Liu, H. Huang, J. Lut, Y. Gao, and G. Zhang, “Enhanced word embedding similarity measures using fuzzy rules for query expansion,” in IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6, IEEE, 2017.
[32] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, pp. 1550–1560, Oct 1990.
[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
[34] J.Rennie, “The 20 newsgroups data set,” 2008.
[35] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.

QR CODE