Author: |
Yohanes Satria Nugroho Yohanes Satria Nugroho |
---|---|
Thesis Title: |
Lightweight American Sign Language Recognition Using a Deep Learning Approach Lightweight American Sign Language Recognition Using a Deep Learning Approach |
Advisor: |
楊傳凱
Chuan-Kai Yang |
Committee: |
賴源正
Yuan-Cheng Lai 林伯慎 Bor-Shen Lin |
Degree: |
碩士 Master |
Department: |
管理學院 - 資訊管理系 Department of Information Management |
Thesis Publication Year: | 2023 |
Graduation Academic Year: | 111 |
Language: | 英文 |
Pages: | 54 |
Keywords (in Chinese): | Sign Language Recognition 、Lightweight Model 、Keypoints Estimation |
Keywords (in other languages): | Sign Language Recognition, Lightweight Model, Keypoints Estimation |
Reference times: | Clicks: 432 Downloads: 0 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
Sign Language Recognition is a variant of the Action Recognition that consists of more detailed features, such as hand shapes and movements. Researchers have been trying to apply computer-based methods to tackle this task throughout the years. However, the methods proposed are constrained by hardware limitations thus limiting them from being applied in a real-life situations.
In this research, we explore the possibilities of creating a lightweight Sign Language Recognition model so that it can be applied in real-life situations. We explore two different approaches. First we extract keypoints and use a simple LSTM model to do the recognition and get 75\% of Top-1 Validation Accuracy. For the second one, we used the lightweight MoViNet A0 model and achieved 71\% of Top-1 Test accuracy. Although these models achieved a little bit worse result compared to the state-of-the-art I3D, the complexity in terms of FLOPs are far more better.
Sign Language Recognition is a variant of the Action Recognition that consists of more detailed features, such as hand shapes and movements. Researchers have been trying to apply computer-based methods to tackle this task throughout the years. However, the methods proposed are constrained by hardware limitations thus limiting them from being applied in a real-life situations.
In this research, we explore the possibilities of creating a lightweight Sign Language Recognition model so that it can be applied in real-life situations. We explore two different approaches. First we extract keypoints and use a simple LSTM model to do the recognition and get 75\% of Top-1 Validation Accuracy. For the second one, we used the lightweight MoViNet A0 model and achieved 71\% of Top-1 Test accuracy. Although these models achieved a little bit worse result compared to the state-of-the-art I3D, the complexity in terms of FLOPs are far more better.
[1] O. M. Sincan and H. Y. Keles, “AUTSL: A large scale multi-modal
turkish sign language dataset and baseline methods,” CoRR, vol. abs/
2008.00932, 2020.
[2] CCDHHDB, “Deaf, hard of hearing, and deafblind demographics
guide.” https://ccdhhdb.com/wp-content/uploads/2022/09/
DHHDB-Demographics.pdf.
[3] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, and P. Presti,
“American sign language recognition with the kinect,” pp. 279–286,
11 2011.
[4] S. Mehdi and Y. Khan, “Sign language recognition using sensor
gloves,” in Proceedings of the 9th International Conference on Neu-
ral Information Processing, 2002. ICONIP ’02., vol. 5, pp. 2204–
2206 vol.5, 2002.
[5] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color
glove,” ACM Trans. Graph., vol. 28, Jul 2009.
[6] T. Starner and A. Pentland, “Real-time american sign language recog-
nition from video using hidden markov models,” in Proceedings of
International Symposium on Computer Vision - ISCV, pp. 265–270,
1995.
[7] T. Starner and A. Pentland, “Visual recognition of american sign lan-
guage using hidden markov models,” 05 1995.
[8] A. A. Hosain, P. Selvam Santhalingam, P. Pathak, H. Rangwala, and
J. Košecká, “Hand pose guided 3d pooling for word-level sign lan-
guage recognition,” in 2021 IEEE Winter Conference on Applications
of Computer Vision (WACV), pp. 3428–3438, 2021.
[9] P. Adarsh, P. Rathi, and M. Kumar, “Yolo v3-tiny: Object detection
and recognition using one stage improved model,” in 2020 6th In-
ternational Conference on Advanced Computing and Communication
Systems (ICACCS), pp. 687–694, 2020.
[10] R. A. Abdul Rahman and C. K. Yang, “Mobile application for real-
time bird sound recognition using convolutional neural network.”
2021.
[11] H. R. V. Joze and O. Koller, “MS-ASL: A large-scale data set
and benchmark for understanding american sign language,” CoRR,
vol. abs/1812.01053, 2018.
[12] “MediaPipe.” mediapipe.dev. Accessed: 2022-11-12.
[13] D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, and
B. Gong, “Movinets: Mobile video networks for efficient video
recognition,” 2021.
[14] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D:
generic features for video analysis,” CoRR, vol. abs/1412.0767, 2014.
[15] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new
model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.
[16] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking
spatiotemporal feature learning for video understanding,” CoRR,
vol. abs/1712.04851, 2017.
[17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” CoRR, vol. abs/1406.2199, 2014.
[18] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A
closer look at spatiotemporal convolutions for action recognition,” in
2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp. 6450–6459, 2018.
[19] L. Jing, E. Vahdani, M. Huenerfauth, and Y. Tian, “Recognizing
american sign language manual signs from RGB-D videos,” CoRR,
vol. abs/1906.02851, 2019.
[20] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign
language recognition using convolutional neural networks,” in Com-
puter Vision - ECCV 2014 Workshops (L. Agapito, M. M. Bronstein,
and C. Rother, eds.), (Cham), pp. 572–578, Springer International
Publishing, 2015.
[21] M. De Coster, M. Van Herreweghe, and J. Dambre, “Sign lan-
guage recognition with transformer networks,” in Proceedings of the
Twelfth Language Resources and Evaluation Conference, (Marseille,
France), pp. 6018–6024, European Language Resources Association,
May 2020.
[22] R. Cui, H.
Liu, and C. Zhang, “A deep neural framework for contin-
uous sign language recognition by iterative training,” IEEE Transac-
tions on Multimedia, vol. 21, no. 7, pp. 1880–1891, 2019.
[23] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “Openpose:
Realtime multi-person 2d pose estimation using part affinity fields,”
CoRR, vol. abs/1812.08008, 2018.
[24] D. Maji, S. Nagori, M. Mathew, and D. Poddar, “Yolo-pose: Enhanc-
ing yolo for multi person pose estimation using object keypoint sim-
ilarity loss,” 2022.
[25] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P. Kindermans, and
Q. Le, “Can weight sharing outperform random architecture search?
an investigation with tunas,” CoRR, vol. abs/2008.06120, 2020.
[26] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural net-
works for continuous sign language recognition by staged optimiza-
tion,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1610–1618, 2017.
[27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-
occurrence feature learning for skeleton based action recognition us-
ing regularized deep LSTM networks,” CoRR, vol. abs/1603.07772,
2016.
[28] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end
sequence modelling with deep recurrent cnn-hmms,” in 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 3416–3424, 2017.