簡易檢索 / 詳目顯示

研究生: Yohanes Satria Nugroho
Yohanes Satria Nugroho
論文名稱: Lightweight American Sign Language Recognition Using a Deep Learning Approach
Lightweight American Sign Language Recognition Using a Deep Learning Approach
指導教授: 楊傳凱
Chuan-Kai Yang
口試委員: 賴源正
Yuan-Cheng Lai
林伯慎
Bor-Shen Lin
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 54
中文關鍵詞: Sign Language RecognitionLightweight ModelKeypoints Estimation
外文關鍵詞: Sign Language Recognition, Lightweight Model, Keypoints Estimation
相關次數: 點閱:183下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

Sign Language Recognition is a variant of the Action Recognition that consists of more detailed features, such as hand shapes and movements. Researchers have been trying to apply computer-based methods to tackle this task throughout the years. However, the methods proposed are constrained by hardware limitations thus limiting them from being applied in a real-life situations.

In this research, we explore the possibilities of creating a lightweight Sign Language Recognition model so that it can be applied in real-life situations. We explore two different approaches. First we extract keypoints and use a simple LSTM model to do the recognition and get 75\% of Top-1 Validation Accuracy. For the second one, we used the lightweight MoViNet A0 model and achieved 71\% of Top-1 Test accuracy. Although these models achieved a little bit worse result compared to the state-of-the-art I3D, the complexity in terms of FLOPs are far more better.


Sign Language Recognition is a variant of the Action Recognition that consists of more detailed features, such as hand shapes and movements. Researchers have been trying to apply computer-based methods to tackle this task throughout the years. However, the methods proposed are constrained by hardware limitations thus limiting them from being applied in a real-life situations.

In this research, we explore the possibilities of creating a lightweight Sign Language Recognition model so that it can be applied in real-life situations. We explore two different approaches. First we extract keypoints and use a simple LSTM model to do the recognition and get 75\% of Top-1 Validation Accuracy. For the second one, we used the lightweight MoViNet A0 model and achieved 71\% of Top-1 Test accuracy. Although these models achieved a little bit worse result compared to the state-of-the-art I3D, the complexity in terms of FLOPs are far more better.

Master’s Thesis Recommendation Form . . . . . . . . . . . . . . . i Qualification Form by Master’s Degree Examination Committee . . ii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iv Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Outline . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . 5 2.2 Sign Language Recognition . . . . . . . . . . . . . . . . . 7 2.3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . 8 2.4 MoViNets . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Keypoints + LSTM . . . . . . . . . . . . . . . . . 12 3.1.2 MoViNet . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Dataset Processing . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Keypoints Input . . . . . . . . . . . . . . . . . . . 20 3.3.2 RGB Input . . . . . . . . . . . . . . . . . . . . . 23 4 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.2 MoViNet . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 30 4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . 39 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . 40 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 vi

[1] O. M. Sincan and H. Y. Keles, “AUTSL: A large scale multi-modal
turkish sign language dataset and baseline methods,” CoRR, vol. abs/
2008.00932, 2020.

[2] CCDHHDB, “Deaf, hard of hearing, and deafblind demographics
guide.” https://ccdhhdb.com/wp-content/uploads/2022/09/
DHHDB-Demographics.pdf.

[3] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, and P. Presti,
“American sign language recognition with the kinect,” pp. 279–286,
11 2011.

[4] S. Mehdi and Y. Khan, “Sign language recognition using sensor
gloves,” in Proceedings of the 9th International Conference on Neu-
ral Information Processing, 2002. ICONIP ’02., vol. 5, pp. 2204–
2206 vol.5, 2002.

[5] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color
glove,” ACM Trans. Graph., vol. 28, Jul 2009.

[6] T. Starner and A. Pentland, “Real-time american sign language recog-
nition from video using hidden markov models,” in Proceedings of
International Symposium on Computer Vision - ISCV, pp. 265–270,
1995.

[7] T. Starner and A. Pentland, “Visual recognition of american sign lan-
guage using hidden markov models,” 05 1995.

[8] A. A. Hosain, P. Selvam Santhalingam, P. Pathak, H. Rangwala, and
J. Košecká, “Hand pose guided 3d pooling for word-level sign lan-
guage recognition,” in 2021 IEEE Winter Conference on Applications
of Computer Vision (WACV), pp. 3428–3438, 2021.

[9] P. Adarsh, P. Rathi, and M. Kumar, “Yolo v3-tiny: Object detection
and recognition using one stage improved model,” in 2020 6th In-
ternational Conference on Advanced Computing and Communication
Systems (ICACCS), pp. 687–694, 2020.

[10] R. A. Abdul Rahman and C. K. Yang, “Mobile application for real-
time bird sound recognition using convolutional neural network.”
2021.

[11] H. R. V. Joze and O. Koller, “MS-ASL: A large-scale data set
and benchmark for understanding american sign language,” CoRR,
vol. abs/1812.01053, 2018.

[12] “MediaPipe.” mediapipe.dev. Accessed: 2022-11-12.

[13] D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, and
B. Gong, “Movinets: Mobile video networks for efficient video
recognition,” 2021.
[14] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D:
generic features for video analysis,” CoRR, vol. abs/1412.0767, 2014.
[15] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new
model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.

[16] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking
spatiotemporal feature learning for video understanding,” CoRR,
vol. abs/1712.04851, 2017.

[17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” CoRR, vol. abs/1406.2199, 2014.

[18] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A
closer look at spatiotemporal convolutions for action recognition,” in
2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp. 6450–6459, 2018.

[19] L. Jing, E. Vahdani, M. Huenerfauth, and Y. Tian, “Recognizing
american sign language manual signs from RGB-D videos,” CoRR,
vol. abs/1906.02851, 2019.

[20] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign
language recognition using convolutional neural networks,” in Com-
puter Vision - ECCV 2014 Workshops (L. Agapito, M. M. Bronstein,
and C. Rother, eds.), (Cham), pp. 572–578, Springer International
Publishing, 2015.

[21] M. De Coster, M. Van Herreweghe, and J. Dambre, “Sign lan-
guage recognition with transformer networks,” in Proceedings of the
Twelfth Language Resources and Evaluation Conference, (Marseille,
France), pp. 6018–6024, European Language Resources Association,
May 2020.
[22] R. Cui, H.
Liu, and C. Zhang, “A deep neural framework for contin-
uous sign language recognition by iterative training,” IEEE Transac-
tions on Multimedia, vol. 21, no. 7, pp. 1880–1891, 2019.

[23] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “Openpose:
Realtime multi-person 2d pose estimation using part affinity fields,”
CoRR, vol. abs/1812.08008, 2018.

[24] D. Maji, S. Nagori, M. Mathew, and D. Poddar, “Yolo-pose: Enhanc-
ing yolo for multi person pose estimation using object keypoint sim-
ilarity loss,” 2022.

[25] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P. Kindermans, and
Q. Le, “Can weight sharing outperform random architecture search?
an investigation with tunas,” CoRR, vol. abs/2008.06120, 2020.

[26] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural net-
works for continuous sign language recognition by staged optimiza-
tion,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1610–1618, 2017.

[27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-
occurrence feature learning for skeleton based action recognition us-
ing regularized deep LSTM networks,” CoRR, vol. abs/1603.07772,
2016.

[28] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end
sequence modelling with deep recurrent cnn-hmms,” in 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3416–3424, 2017.

無法下載圖示 全文公開日期 2026/01/18 (校內網路)
全文公開日期 2026/01/18 (校外網路)
全文公開日期 2026/01/18 (國家圖書館:臺灣博碩士論文系統)
QR CODE