簡易檢索 / 詳目顯示

研究生: 黃俊翰
Chun-Han Huang
論文名稱: 基於增強 Mask2Former 以實現即時通用人體圖像分割
Enhancing Mask2Former for Real-Time Universal Human Image Segmentation
指導教授: 姚智原
Chih-Yuan Yao
口試委員: 朱宏國
Hung-Kuo Chu
朱宏國
Hung-Kuo Chu
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 47
中文關鍵詞: 人體影像分割通用圖像分割實時圖像分割顯卡前後處理顯卡繪製
外文關鍵詞: Human Image Segmentation, Universal Segmentation, Real-Time Segmentation, GPU pre-process and post-process, GPU render
相關次數: 點閱:240下載:22
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

圖像分割不僅在電腦視覺中是一個十分熱門的研究領域,也被廣泛應用於生活中,例如: 人臉辨識、醫學影像辨識、指紋辨識、人流管控等。從應用中也可發現以人或人體部位的分割在生活中尤其重要,因此我們特別針對人的部分做圖像分割,除了像語意分割一樣辨識出影像中的人以外,也將圖像中每一個人加以區分做到實例分割或全景分割中的物件分割效果,為了達到這樣多樣的圖像分割結果,除了使用多個不同的圖像分割任務的模型外,更好的方法是使用通用圖像分割的模型,因此,我們選擇使用通用圖像分割的模型,但現今的通用圖像分割模型十分龐大,雖然在各種類型的圖像分割任務上都有還不錯的結果,卻無法兼顧到辨識的速度,辨識一張高解析度的圖不到 10 FPS,實際的應用中同時兼顧圖像分割質量以及速度是非常重要的一件事,且如此龐大的模型也需要效能較高的 GPU 才能進行模型的訓練和推論,我們也希望能使通用影像分割的模型在較低性能的 GPU 上也能夠使用。
我們透過大量實驗讓模型在質量和速度取得平衡,藉由實驗縮減現今通用圖像分割模型,分析現今通用影像分割模型中重要的部分,將對結果影響較小的部分縮減,留下較為重要的模型部分,如此一來,不僅在結果不受太大影響的前提下提升通用影像分割模型的辨識速度、減少模型參數量和計算量,變得更有機會在較低性能的 GPU 上使用通用影像分割的模型,使用較小、較快的輕量模型作為 Backbone 來提取出多張不同大小的圖像特徵,並且提出 Multi-Fusion Model 融合多張不同大小的 feature 來得到圖像分割的結果,Multi-Fusion Model 能使整體模型速度提升 50%,而且結果的質量也能夠維持在一個不錯的範圍,接著使用透過實驗考量並簡化的 Transformer Decoder 來對圖像分割的結果進行優化與調整,最後於 Segmentation Module 選擇要產生何種的人的分割結果,此外,使用 GPU 處理所有的前處理、後處理、以及繪製結果的部分,在 GPU 上直接實作降解析度、灰階圖轉換成 RGBA 圖、BGR 圖轉換成 RGBA 圖、顏色映射以及解果的繪製,減少了一般將模型結果傳遞至 CPU 使用 opencv [1] 或 pillow [2] 等函式庫進行圖像處理所浪費的時間,用了這些在 GPU 上處理資料的改動後,便優化了一般圖像分割模型由於使用 CPU 處理導致即使模型夠快也無法實際應用於即時分割中的問題。
我們能以超過 600 FPS 的速度在 2080Ti 的 GPU 上繪製 1080p 的分割結果,通用圖像分割模型的部分能達到 40 FPS 以上的速度辨識 1080p 大小的高解析度圖像,整體系統流程合計能以 35 FPS 以上的速度預測 1080p 圖像且即時繪製圖像分割辨識結果於畫面中,我們針對人的通用圖像分割模型能在 COCO 資料集 [3] 上達到 81 person-IoU和 31 person-AP,至於只替換了 Multi-Fusion Model 的模型則是可以在 COCO 資料集 [3] 上達到 87.2 person-IoU 和 42.8 person-AP。


Image segmentation is not only a highly active research area in computer vision but also
widely applied in various real-life scenarios. It finds applications in fields such as face
recognition, medical image analysis, fingerprint identification, and crowd control.This
highlights the importance of segmenting of humans or body parts.Therefore, we specifi-
cally focus on human segmentation in our work.In addition to detecting humans in images,
as semantic segmentation does, we also segment each person in the image, achieving the
object segmentation effect like instance segmentation or panoptic segmentation.One prac-
tical approach to achieving such diverse image segmentation results is using a universal
image segmentation model rather than multiple models for different segmentation tasks.
Therefore, we chose to use a universal image segmentation model. However, the current
universal image segmentation models are often huge. Although these models achieve good
results in various image segmentation tasks, they often fail to strike a balance in detection
speed. It detects a high-resolution image and less than 10 FPS.However, in practical ap-
plications, it is crucial to balance the quality and speed of image segmentation. Moreover,
such large models require high-performance GPUs for training and inference. We also aim
to make the universal image segmentation model usable on lower-performance GPUs.
We achieved a balance between quality and speed through extensive experimentation.
By conducting experiments, we reduced the size of the current universal image segmenta-
tion model. We analyzed which parts of the existing model are more important and identi-
fied the components that have less effect on the results. We then proceeded to shrink those
less critical parts while retaining the essential components of the model.We use smaller
and faster lightweight models as backbones to extract image features of multiple sizes.
Additionally, we propose the Multi-Fusion Model, which combines features from multi-
ple scales to obtain the image segmentation results. The Multi-Fusion Model can boost
the overall model speed by 50% while maintaining good result quality. Next, we employ
a simplified Transformer Decoder, which has been experimentally optimized, to optimize
and adjust the image segmentation results. Finally, within the Segmentation Module, we
select the type of person segmentation result to generate.In addition, we implement all the
pre-processing, post-processing, and result rendering in GPU. This includes downsam-
pling, converting grayscale images to RGBA images, converting BGR images to RGBA
images, assigning different colors to each person in the image, and directly rendering the
results using shaders.We have minimized the time wasted in traditional image segmentation models by reducing the use of libraries like OpenCV [1] or Pillow [2] for transferring
model results to the CPU for image processing.With these modifications that enable data
processing on the GPU, we optimize the image segmentation and avoid even with a fast
model, the reliance on CPU processing hindered its practical application in real-time seg-
mentation tasks.
We can render the segmentation results of 1080p resolution at speed exceeding 600
FPS on the NVIDIA RTX 2080Ti GPU. The general image segmentation model can achieve
a speed of over 40 FPS for detecting high-resolution 1080p images. The overall system
pipeline can predict 1080p images at a speed of over 35 FPS and instantly render the
segmentation results in the frame. Our human universal image segmentation model can
achieve 81 person-IoU and 31 person-AP on the COCO dataset. As for the model that only
incorporated Multi-Fusion Model can achieves 87.2 person-IoU and 42.8 person-AP.

論文摘要 . . . . . . . . . . . . . . . . . . . I Abstract . . . . . . . . . . . . . . . . . . . II 誌謝 . . . . . . . . . . . . . . . . . . IV 目錄 . . . . . . . . . . . . . . . . . . V 圖目錄 . . . . . . . . . . . . . . . . . . . . VII 表目錄 . . . . . . . . . . . . . . . . . . . . XI 1 緒論 . . . . . . . . . . . . . . . . . . . 1 2 相關研究 . . . . . . . . . . . . . . . . . 3 2.1 語意分割 . . . . . . . . . . . . . . 3 2.2 實時語意分割 . . . . . . . . . . . 4 2.3 實例分割 . . . . . . . . . . . . . . 5 2.4 實時實例分割 . . . . . . . . . . . 6 2.5 全景分割 . . . . . . . . . . . . . . 6 2.6 通用圖像分割 . . . . . . . . . . . 7 3 方法總覽 . . . . . . . . . . . . . . . . . 8 3.1 系統架構 . . . . . . . . . . . . . . 9 4 研究方法 . . . . . . . . . . . . . . . . . 10 4.1 模型架構 . . . . . . . . . . . . . . 11 4.2 Transformer Decoder 架構設計 . . 11 4.3 Pixel Decoder 架構設計 (Multi-Fusion Model) . . . . . . . . . . . . . . . 12 4.4 Backbone 架構設計 . . . . . . . . 13 4.5 Object Query 與 Segmentation Module . . . . . . . . . . . . . . . . . . . 14 4.6 GPU 資料處理 . . . . . . . . . . . 16 4.6.1 前處理 . . . . . . . . . . 16 4.6.2 後處理 . . . . . . . . . . 17 4.6.3 繪製結果 . . . . . . . . . 20 5 實驗結果 . . . . . . . . . . . . . . . . . 21 5.1 資料集 . . . . . . . . . . . . . . . 21 5.2 實驗設置 . . . . . . . . . . . . . . 21 5.3 實驗順序總覽 . . . . . . . . . . . 22 5.4 Transformer Decoder 實驗 . . . . 24 5.5 Pixel Decoder 實驗 . . . . . . . . 27 5.6 Backbone 實驗 . . . . . . . . . . 30 5.7 其他實驗比較 . . . . . . . . . . . 32 5.8 模型結果比較 . . . . . . . . . . . 33 5.8.1 COCO [3] 資料集 . . . . . 33 5.8.2 其他資料集 . . . . . . . . 37 5.9 系統時間比較 . . . . . . . . . . . 40 6 結論與後續工作 . . . . . . . . . . . . . 42 參考文獻 . . . . . . . . . . . . . . . . . . . 44 授權書 . . . . . . . . . . . . . . . . . . . . 48

[1] OpenCV, “Open source computer vision library,” 2015.
[2] A. Clark, “Pillow (pil fork) documentation,” 2015.
[3] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona,
D. Ramanan, P. Doll’a r, and C. L. Zitnick, “Microsoft COCO: common objects in
context,” CoRR, vol. abs/1405.0312, 2014.
[4] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, “Panoptic segmentation,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), June 2019.
[5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in
CVPR, 2017.
[6] M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Rethinking bisenet
for real-time semantic segmentation,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 9716–9725, June 2021.
[7] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature
pyramid networks for object detection,” CoRR, vol. abs/1612.03144, 2016.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015.
[9] T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu,
“Sparse instance activation for real-time instance segmentation,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition (CVPR), 2022.
[10] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention
mask transformer for universal image segmentation,” in CVPR, 2022.
[11] M. Woo, J. Neider, T. Davis, and D. Shreiner, OpenGL programming guide: the of-
ficial guide to learning OpenGL, version 1.2. Addison-Wesley Longman Publishing
Co., Inc., 1999.
[12] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Semantic un-
derstanding of scenes through the ADE20K dataset,” CoRR, vol. abs/1608.05442,
2016.
[13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke,
S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene under-
standing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016.
[14] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2.” https://
github.com/facebookresearch/detectron2, 2019.
[15] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for seman-
tic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2015.
[16] B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification is not all you
need for semantic segmentation,” in NeurIPS, 2021.
[17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in Advances in Neural Information Pro-
cessing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett,
eds.), vol. 28, Curran Associates, Inc., 2015.
[18] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic seg-
mentation,” arXiv preprint arXiv:1505.04366, 2015.
[19] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with
atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
[20] G. Lin, F. Liu, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement
networks for dense prediction,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2019.
[21] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional
encoder-decoder architecture for image segmentation,” CoRR, vol. abs/1511.00561,
2015.
[22] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural net-
work architecture for real-time semantic segmentation,” CoRR, vol. abs/1606.02147,
2016.
[23] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmenta-
tion network for real-time semantic segmentation,” in Proceedings of the European
Conference on Computer Vision (ECCV), September 2018.
[24] J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic segmentation
network inspired from pid controller,” 2022.
[25] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proceedings of the
IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[26] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “SOLO: Segmenting objects by
locations,” in Proc. Eur. Conf. Computer Vision (ECCV), 2020.
[27] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “Solov2: Dynamic and fast in-
stance segmentation,” Proc. Advances in Neural Information Processing Systems
(NeurIPS), 2020.
[28] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance segmentation,”
in ICCV, 2019.
[29] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact++: Better real-time instance
segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2020.
[30] K. Oksuz, B. C. Cam, F. Kahraman, Z. S. Baltaci, S. Kalkan, and E. Akbas, “Mask-
aware iou for anchor assignment in real-time instance segmentation,” in The British
Machine Vision Conference (BMVC), 2021.
[31] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko,
“End-to-end object detection with transformers,” in Computer Vision - ECCV 2020
- 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part
I (A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, eds.), vol. 12346 of Lecture Notes
in Computer Science, pp. 213–229, Springer, 2020.
[32] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Interna-
tional Conference on Learning Representations, 2019.
[33] S. Huang, Z. Lu, R. Cheng, and C. He, “FaPN: Feature-aligned pyramid network for
dense image prediction,” in International Conference on Computer Vision (ICCV),
2021.

QR CODE