Discriminative Cross-Modal Attention Approach for RGB-D Semantic Segmentation
Computer and Knowledge Engineering
دوره 8، شماره 1 - شماره پیاپی 15 ، تیر 2025، صفحه 43-52 اصل مقاله (1.7 M )
نوع مقاله: Image Processing-Pourreza
شناسه دیجیتال (DOI): 10.22067/cke.2025.88682.1117
نویسندگان
emad mousavian* 1 ؛ Danial Qashqai 1 ؛ Shahriar B. Shokouhi 2
1 Department of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran,
2 Department of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
چکیده
Scene understanding through semantic segmentation is a vital component for autonomous vehicles. Given the importance of safety in autonomous driving, existing methods are constantly striving to improve accuracy and reduce error. RGB-based semantic segmentation models typically underperform due to information loss in challenging situations such as lighting variations and limitations in distinguishing occluded objects of similar appearance. Therefore, recent studies have developed RGB-D semantic segmentation methods by employing attention-based fusion modules. Existing fusion modules typically combine cross-modal features by focusing on each modality independently, which limits their ability to capture the complementary nature of modalities. To address this issue, we propose a simple yet effective module called the Discriminative Cross-modal Attention Fusion (DCMAF) module. Specifically, the proposed module performs cross-modal discrimination using element-wise subtraction in an attention-based approach. By integrating the DCMAF module with efficient channel- and spatial-wise attention modules, we introduce the Discriminative Cross-modal Network (DCMNet), a scale- and appearance-invariant model. Extensive experiments demonstrate significant improvements, particularly in predicting small and fine objects, achieving an mIoU of 77.39% on the CamVid dataset, outperforming state-of-the-art RGB-based methods, and a remarkable mIoU of 82.8% on the Cityscapes dataset. As the CamVid dataset lacks depth information, we employ the DPT monocular depth estimation model to generate depth images.
کلیدواژهها
Attention Mechanism ؛ Autonomous Driving ؛ Deep Learning ؛ RGB-D Semantic Segmentation
مراجع
Hu, K. Yang, L. Fei, and K. Wang. (2019, Sep.). ACNet: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. IEEE International Conference on Image Processing (ICIP) . [Online]. Available: https://doi.org/10.1109/ICIP.2019.8803025
Chen, K. Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng. (2020, Aug.). Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation. European Conference on Computer Vision. [Online]. Available: https://doi.org/10.1007/978-3-030-58621-8_33
Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H. M. Gross. (2021, May.). Efficient RGB-D semantic segmentation for indoor scene analysis. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13525–13531. [Online]. Available: https://doi.org/10.1109/ICRA48506.2021.9561675
Hazirbas, L. Ma, C. Domokos, and D. Cremers. (2016, Nov.). Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In Asian Conference on Computer Vision, pp. 213–228. [Online]. Available: https://doi.org/10.1007/978-3-319-54181-5_14
Jiang, L. Zheng, F. Luo, and Z. Zhang. (2018, Jun.). Rednet: Residual encoder-decoder network for indoor RGB-D semantic segmentation. arXiv preprint . [Online]. Available: https://doi.org/10.48550/arXiv.1806.01054
Zhang, Y. Yang, C. Xiong, G. Sun, and Y. Guo. (2022, Jan.). Attention-based dual supervised decoder for RGBD semantic segmentation. arXiv preprint . [Online]. Available: https://doi.org/10.48550/arXiv.2201.01427
Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen. (2023, Dec.). CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers. IEEE Transactions on Intelligent Transportation Systems . [Online]. 24(12), pp. 14679–14694. Available: https://doi.org/10.1109/TITS.2023.3300537
Zhong, C. Guo, J. Zhan, and J. Deng. (2024, Dec.). Attention-based fusion network for RGB-D semantic segmentation. Neurocomputing . [Online]. 608, p. 128371. Available: https://doi.org/10.1016/j.neucom.2024.128371
Zhang, C. Xiong, J. Liu, X. Ye, and G. Sun. (2023, Aug.). Spatial information-guided adaptive context-aware network for efficient RGB-D semantic segmentation. IEEE Sensors Journal . [Online]. 23(19), pp. 23512–23521. Available: https://doi.org/10.1109/JSEN.2023.3304637
Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings of the 10th European Conference on Computer Vision (ECCV), pp. 44–57. [Online]. Available: https://doi.org/10.1007/978-3-540-88682-2_5
Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. [Online]. Available: http://openaccess.thecvf.com
Ranftl, A. Bochkovskiy, and V. Koltun. (2021). Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12179–12188. [Online]. Available: http://openaccess.thecvf.com
Long, E. Shelhamer, and T. Darrell. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. [Online]. Available: http://openaccess.thecvf.com
Badrinarayanan, A. Kendall, and R. Cipolla. (2017, Jan. 2). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence . [Online]. 39(12), pp. 2481–2495. Available: https://doi.org/10.1109/TPAMI.2016.2644615
Ronneberger, P. Fischer, and T. Brox. (2015, Oct.). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, Proceedings, Part III, vol. 18, pp. 234–241. [Online]. Available: https://doi.org/10.1007/978-3-319-24574-4_28
Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. (2017). Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890. [Online]. Available: http://openaccess.thecvf.com
C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. (2017, Apr.). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence . [Online]. 40(4), pp. 834–848. Available: https://doi.org/10.1109/TPAMI.2017.2699184
C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) . [Online]. pp. 801–818. Available: http://openaccess.thecvf.com
Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [Online]. pp. 3146–3154. Available: http://openaccess.thecvf.com
Zhong, Z. Q. Lin, R. Bidart, X. Hu, I. B. Daya, Z. Li, W. S. Zheng, J. Li, and A. Wong. (2020). Squeeze-and-attention networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13065–13074. [Online]. Available: http://openaccess.thecvf.com
Li, P. Xiong, J. An, and L. Wang. (2018, May.). Pyramid attention network for semantic segmentation. arXiv preprint . [Online]. Available: https://doi.org/10.48550/arXiv.1805.10180
Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. (2021, Dec. 6). SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems . [Online]. 34, pp. 12077–12090. Available: https://proceedings.neurips.cc
Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6881–6890. [Online]. Available: http://openaccess.thecvf.com
Wang, Z. Wang, D. Tao, S. See, and G. Wang. (2016, Oct.). Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Proceedings, Part V, vol. 14, pp. 664–679. [Online]. Available: https://doi.org/10.1007/978-3-319-46454-1_40
Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang. (2017). Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3029–3037. [Online]. Available: http://openaccess.thecvf.com
Qashqai, E. Mousavian, S. B. Shokouhi, and S. Mirzakuchaki. (2024, Jul.). CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes. arXiv preprint . [Online]. Available: https://doi.org/10.48550/arXiv.2407.01328
Li, Q. Zhou, D. Wu, M. Sun, and T. Hu. (2024, May.). CLGFormer: Cross-Level-Guided Transformer for RGB-D Semantic Segmentation. Multimedia Tools and Applications . [Online]. pp. 1–23. Available: https://doi.org/10.1007/s11042-024-19051-9
He, X. Zhang, S. Ren, and J. Sun. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. [Online]. Available: http://openaccess.thecvf.com
Geiger, P. Lenz, and R. Urtasun. (2012, Jun.). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. [Online]. Available: https://doi.org/10.1109/CVPR.2012.6248074
Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. (2009, Jun.). ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. [Online]. Available: https://doi.org/10.1109/CVPR.2009.5206848
Y. Lo, H. M. Hang, S. W. Chan, and J. J. Lin. (2019, Dec.). Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, pp. 1–6. [Online]. Available: https://doi.org/10.1145/3338533.3366558
A. Elhassan, C. Yang, C. Huang, T. L. Munea, X. Hong, A. Adam, and A. Benabid. (2022, Jun.). S2 -FPN: Scale-aware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation. arXiv preprint . [Online]. Available: https://doi.org/10.48550/arXiv.2206.07298
Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. (2018). BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 325–341. [Online]. Available: http://openaccess.thecvf.com
Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. (2018). ICNet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 405–420. [Online]. Available: http://openaccess.thecvf.com
Dong, Y. Yan, C. Shen, and H. Wang. (2020, Mar.). Real-time high-performance semantic image segmentation of urban street scenes. IEEE Transactions on Intelligent Transportation Systems . [Online]. 22(6), pp. 3258–3274. Available: https://doi.org/10.1109/TITS.2020.2980426
Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang. (2021, Nov.). BiSeNet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision . [Online]. 129, pp. 3051–3068. Available: https://doi.org/10.1007/s11263-021-01515-2
Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei. (2021). Rethinking BiSeNet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9716–9725. [Online]. Available: http://openaccess.thecvf.com
Zhou, E. Yang, J. Lei, and L. Yu. (2022, May.). FRNet: Feature reconstruction network for RGB-D indoor scene parsing. IEEE Journal of Selected Topics in Signal Processing . [Online]. 16(4), pp. 677–687. Available: https://doi.org/10.1109/JSTSP.2022.3174338
Peng, Y. Zheng, Y. Cheng, and Y. Qiao. (2024, Oct.). RDFormer: Efficient RGB-D Semantic Segmentation in Complex Outdoor Scenes. In Proceedings of the 2024 5th International Conference on Machine Learning and Computer Application (ICMLCA), pp. 170–175. [Online]. Available: https://doi.org/10.1109/ICMLCA63499.2024.10754213
آمار
تعداد مشاهده مقاله: 182
تعداد دریافت فایل اصل مقاله: 153