Lightweight American Sign Language Recognition Using a Deep Learning Approach

簡易檢索 / 詳目顯示

回結果列表

研究生：	Yohanes Satria Nugroho Yohanes Satria Nugroho
論文名稱：	Lightweight American Sign Language Recognition Using a Deep Learning Approach Lightweight American Sign Language Recognition Using a Deep Learning Approach
指導教授：	楊傳凱 Chuan-Kai Yang
口試委員:	賴源正 Yuan-Cheng Lai 林伯慎 Bor-Shen Lin
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	54
中文關鍵詞：	Sign Language Recognition 、Lightweight Model 、Keypoints Estimation
外文關鍵詞：	Sign Language Recognition, Lightweight Model, Keypoints Estimation
相關次數：	點閱：282 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

Sign Language Recognition is a variant of the Action Recognition that consists of more detailed features, such as hand shapes and movements. Researchers have been trying to apply computer-based methods to tackle this task throughout the years. However, the methods proposed are constrained by hardware limitations thus limiting them from being applied in a real-life situations.

In this research, we explore the possibilities of creating a lightweight Sign Language Recognition model so that it can be applied in real-life situations. We explore two different approaches. First we extract keypoints and use a simple LSTM model to do the recognition and get 75\% of Top-1 Validation Accuracy. For the second one, we used the lightweight MoViNet A0 model and achieved 71\% of Top-1 Test accuracy. Although these models achieved a little bit worse result compared to the state-of-the-art I3D, the complexity in terms of FLOPs are far more better.

Master’s Thesis Recommendation Form . . . . . . . . . . . . . . . i
Qualification Form by Master’s Degree Examination Committee . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Research Outline . . . . . . . . . . . . . . . . . . . . . . 4
Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Action Recognition . . . . . . . . . . . . . . . . . . . . . 5
2 Sign Language Recognition . . . . . . . . . . . . . . . . . 7
3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . 8
4 MoViNets . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . 12
1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Keypoints + LSTM . . . . . . . . . . . . . . . . . 12
1.2 MoViNet . . . . . . . . . . . . . . . . . . . . . . 15
2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Dataset Processing . . . . . . . . . . . . . . . . . . . . . 19
3.1 Keypoints Input . . . . . . . . . . . . . . . . . . . 20
3.2 RGB Input . . . . . . . . . . . . . . . . . . . . . 23
Experiments & Results . . . . . . . . . . . . . . . . . . . . . . 24
1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2 MoViNet . . . . . . . . . . . . . . . . . . . . . . 27
2 Experimental Results . . . . . . . . . . . . . . . . . . . . 30
3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 36
Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . 39
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . 40
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi


                                

[1] O. M. Sincan and H. Y. Keles, “AUTSL: A large scale multi-modal
turkish sign language dataset and baseline methods,” CoRR, vol. abs/
2008.00932, 2020.

[2] CCDHHDB, “Deaf, hard of hearing, and deafblind demographics
guide.” https://ccdhhdb.com/wp-content/uploads/2022/09/
DHHDB-Demographics.pdf.

[3] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, and P. Presti,
“American sign language recognition with the kinect,” pp. 279–286,
11 2011.

[4] S. Mehdi and Y. Khan, “Sign language recognition using sensor
gloves,” in Proceedings of the 9th International Conference on Neu-
ral Information Processing, 2002. ICONIP ’02., vol. 5, pp. 2204–
2206 vol.5, 2002.

[5] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color
glove,” ACM Trans. Graph., vol. 28, Jul 2009.

[6] T. Starner and A. Pentland, “Real-time american sign language recog-
nition from video using hidden markov models,” in Proceedings of
International Symposium on Computer Vision - ISCV, pp. 265–270,
1995.

[7] T. Starner and A. Pentland, “Visual recognition of american sign lan-
guage using hidden markov models,” 05 1995.

[8] A. A. Hosain, P. Selvam Santhalingam, P. Pathak, H. Rangwala, and
J. Košecká, “Hand pose guided 3d pooling for word-level sign lan-
guage recognition,” in 2021 IEEE Winter Conference on Applications
of Computer Vision (WACV), pp. 3428–3438, 2021.

[9] P. Adarsh, P. Rathi, and M. Kumar, “Yolo v3-tiny: Object detection
and recognition using one stage improved model,” in 2020 6th In-
ternational Conference on Advanced Computing and Communication
Systems (ICACCS), pp. 687–694, 2020.

[10] R. A. Abdul Rahman and C. K. Yang, “Mobile application for real-
time bird sound recognition using convolutional neural network.”
2021.

[11] H. R. V. Joze and O. Koller, “MS-ASL: A large-scale data set
and benchmark for understanding american sign language,” CoRR,
vol. abs/1812.01053, 2018.

[12] “MediaPipe.” mediapipe.dev. Accessed: 2022-11-12.

[13] D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, and
B. Gong, “Movinets: Mobile video networks for efficient video
recognition,” 2021.
[14] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D:
generic features for video analysis,” CoRR, vol. abs/1412.0767, 2014.
[15] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new
model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.

[16] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking
spatiotemporal feature learning for video understanding,” CoRR,
vol. abs/1712.04851, 2017.

[17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” CoRR, vol. abs/1406.2199, 2014.

[18] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A
closer look at spatiotemporal convolutions for action recognition,” in
2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp. 6450–6459, 2018.

[19] L. Jing, E. Vahdani, M. Huenerfauth, and Y. Tian, “Recognizing
american sign language manual signs from RGB-D videos,” CoRR,
vol. abs/1906.02851, 2019.

[20] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign
language recognition using convolutional neural networks,” in Com-
puter Vision - ECCV 2014 Workshops (L. Agapito, M. M. Bronstein,
and C. Rother, eds.), (Cham), pp. 572–578, Springer International
Publishing, 2015.

[21] M. De Coster, M. Van Herreweghe, and J. Dambre, “Sign lan-
guage recognition with transformer networks,” in Proceedings of the
Twelfth Language Resources and Evaluation Conference, (Marseille,
France), pp. 6018–6024, European Language Resources Association,
May 2020.
[22] R. Cui, H.
Liu, and C. Zhang, “A deep neural framework for contin-
uous sign language recognition by iterative training,” IEEE Transac-
tions on Multimedia, vol. 21, no. 7, pp. 1880–1891, 2019.

[23] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “Openpose:
Realtime multi-person 2d pose estimation using part affinity fields,”
CoRR, vol. abs/1812.08008, 2018.

[24] D. Maji, S. Nagori, M. Mathew, and D. Poddar, “Yolo-pose: Enhanc-
ing yolo for multi person pose estimation using object keypoint sim-
ilarity loss,” 2022.

[25] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P. Kindermans, and
Q. Le, “Can weight sharing outperform random architecture search?
an investigation with tunas,” CoRR, vol. abs/2008.06120, 2020.

[26] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural net-
works for continuous sign language recognition by staged optimiza-
tion,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1610–1618, 2017.

[27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-
occurrence feature learning for skeleton based action recognition us-
ing regularized deep LSTM networks,” CoRR, vol. abs/1603.07772,
2016.

[28] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end
sequence modelling with deep recurrent cnn-hmms,” in 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3416–3424, 2017.

全文公開日期 2026/01/18 (校內網路)
全文公開日期 2026/01/18 (校外網路)
全文公開日期 2026/01/18 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文