Discriminative Cross-Modal Attention Approach for RGB-D Semantic Segmentation

Mousavian, Emad; Qashqai, Danial; B. Shokouhi, Shahriar

doi:10.22067/cke.2025.88682.1117

فهرست نشریات

نشریه پژوهشنامه مدیریت تحول دانشگاه فردوسی مشهد مورد پذیرش نمایه DOAJ قرار گرفت.

نشریه دانشنامه حقوق اقتصادی دانشگاه فردوسی مشهد مورد پذیرش نمایه DOAJ قرار گرفت.

نشریه علوم اجتماعی دانشگاه فردوسی مشهد مورد پذیرش نمایه DOAJ قرار گرفت.

نشریه قرآن و حدیث مورد پذیرش نمایه DOAJ قرار گرفت.

آرشیو نشریه های علمی دانشگاه فردوسی مشهد در پایگاه Portico

ابلاغ نشریه برگزیده به دانشگاه فردوسی مشهد- هفته پژوهش 1403

نشریه فقه و اصول مورد پذیرش نمایه DOAJ قرار گرفت.

نشریه پژوهش های حبوبات ایران مورد پذیرش نمایه DOAJ قرار گرفت.

نشریه Computer and Knowledge Engineering مورد پذیرش نمایه DOAJ قرار گرفت

پذیرفته شدن نشریه Iranian Journal of Animal Biosystematics در نمایه اسکوپوس

تعداد نشریات	55
تعداد شماره‌ها	2,137
تعداد مقالات	22,078
تعداد مشاهده مقاله	95,587,357
تعداد دریافت فایل اصل مقاله	33,053,599

	Discriminative Cross-Modal Attention Approach for RGB-D Semantic Segmentation
Computer and Knowledge Engineering
دوره 8، شماره 1 - شماره پیاپی 15، تیر 2025، صفحه 43-52 اصل مقاله (1.7 M)
نوع مقاله: Image Processing-Pourreza
شناسه دیجیتال (DOI): 10.22067/cke.2025.88682.1117
نویسندگان
emad mousavian^* ¹؛ Danial Qashqai¹؛ Shahriar B. Shokouhi²
¹Department of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran,
²Department of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
چکیده
Scene understanding through semantic segmentation is a vital component for autonomous vehicles. Given the importance of safety in autonomous driving, existing methods are constantly striving to improve accuracy and reduce error. RGB-based semantic segmentation models typically underperform due to information loss in challenging situations such as lighting variations and limitations in distinguishing occluded objects of similar appearance. Therefore, recent studies have developed RGB-D semantic segmentation methods by employing attention-based fusion modules. Existing fusion modules typically combine cross-modal features by focusing on each modality independently, which limits their ability to capture the complementary nature of modalities. To address this issue, we propose a simple yet effective module called the Discriminative Cross-modal Attention Fusion (DCMAF) module. Specifically, the proposed module performs cross-modal discrimination using element-wise subtraction in an attention-based approach. By integrating the DCMAF module with efficient channel- and spatial-wise attention modules, we introduce the Discriminative Cross-modal Network (DCMNet), a scale- and appearance-invariant model. Extensive experiments demonstrate significant improvements, particularly in predicting small and fine objects, achieving an mIoU of 77.39% on the CamVid dataset, outperforming state-of-the-art RGB-based methods, and a remarkable mIoU of 82.8% on the Cityscapes dataset. As the CamVid dataset lacks depth information, we employ the DPT monocular depth estimation model to generate depth images.
کلیدواژه‌ها
Attention Mechanism؛ Autonomous Driving؛ Deep Learning؛ RGB-D Semantic Segmentation

مراجع
Hu, K. Yang, L. Fei, and K. Wang. (2019, Sep.). ACNet: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. IEEE International Conference on Image Processing (ICIP). [Online]. Available: https://doi.org/10.1109/ICIP.2019.8803025 Chen, K. Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng. (2020, Aug.). Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation. European Conference on Computer Vision. [Online]. Available: https://doi.org/10.1007/978-3-030-58621-8_33 Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H. M. Gross. (2021, May.). Efficient RGB-D semantic segmentation for indoor scene analysis. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13525–13531. [Online]. Available: https://doi.org/10.1109/ICRA48506.2021.9561675 Hazirbas, L. Ma, C. Domokos, and D. Cremers. (2016, Nov.). Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In Asian Conference on Computer Vision, pp. 213–228. [Online]. Available: https://doi.org/10.1007/978-3-319-54181-5_14 Jiang, L. Zheng, F. Luo, and Z. Zhang. (2018, Jun.). Rednet: Residual encoder-decoder network for indoor RGB-D semantic segmentation. arXiv preprint. [Online]. Available: https://doi.org/10.48550/arXiv.1806.01054 Zhang, Y. Yang, C. Xiong, G. Sun, and Y. Guo. (2022, Jan.). Attention-based dual supervised decoder for RGBD semantic segmentation. arXiv preprint. [Online]. Available: https://doi.org/10.48550/arXiv.2201.01427 Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen. (2023, Dec.). CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers. IEEE Transactions on Intelligent Transportation Systems. [Online]. 24(12), pp. 14679–14694. Available: https://doi.org/10.1109/TITS.2023.3300537 Zhong, C. Guo, J. Zhan, and J. Deng. (2024, Dec.). Attention-based fusion network for RGB-D semantic segmentation. Neurocomputing. [Online]. 608, p. 128371. Available: https://doi.org/10.1016/j.neucom.2024.128371 Zhang, C. Xiong, J. Liu, X. Ye, and G. Sun. (2023, Aug.). Spatial information-guided adaptive context-aware network for efficient RGB-D semantic segmentation. IEEE Sensors Journal. [Online]. 23(19), pp. 23512–23521. Available: https://doi.org/10.1109/JSEN.2023.3304637 Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings of the 10th European Conference on Computer Vision (ECCV), pp. 44–57. [Online]. Available: https://doi.org/10.1007/978-3-540-88682-2_5 Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. [Online]. Available: http://openaccess.thecvf.com Ranftl, A. Bochkovskiy, and V. Koltun. (2021). Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12179–12188. [Online]. Available: http://openaccess.thecvf.com Long, E. Shelhamer, and T. Darrell. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. [Online]. Available: http://openaccess.thecvf.com Badrinarayanan, A. Kendall, and R. Cipolla. (2017, Jan. 2). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Online]. 39(12), pp. 2481–2495. Available: https://doi.org/10.1109/TPAMI.2016.2644615 Ronneberger, P. Fischer, and T. Brox. (2015, Oct.). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, Proceedings, Part III, vol. 18, pp. 234–241. [Online]. Available: https://doi.org/10.1007/978-3-319-24574-4_28 Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. (2017). Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890. [Online]. Available: http://openaccess.thecvf.com C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. (2017, Apr.). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Online]. 40(4), pp. 834–848. Available: https://doi.org/10.1109/TPAMI.2017.2699184 C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). [Online]. pp. 801–818. Available: http://openaccess.thecvf.com Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). [Online]. pp. 3146–3154. Available: http://openaccess.thecvf.com Zhong, Z. Q. Lin, R. Bidart, X. Hu, I. B. Daya, Z. Li, W. S. Zheng, J. Li, and A. Wong. (2020). Squeeze-and-attention networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13065–13074. [Online]. Available: http://openaccess.thecvf.com Li, P. Xiong, J. An, and L. Wang. (2018, May.). Pyramid attention network for semantic segmentation. arXiv preprint. [Online]. Available: https://doi.org/10.48550/arXiv.1805.10180 Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. (2021, Dec. 6). SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems. [Online]. 34, pp. 12077–12090. Available: https://proceedings.neurips.cc Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6881–6890. [Online]. Available: http://openaccess.thecvf.com Wang, Z. Wang, D. Tao, S. See, and G. Wang. (2016, Oct.). Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Proceedings, Part V, vol. 14, pp. 664–679. [Online]. Available: https://doi.org/10.1007/978-3-319-46454-1_40 Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang. (2017). Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3029–3037. [Online]. Available: http://openaccess.thecvf.com Qashqai, E. Mousavian, S. B. Shokouhi, and S. Mirzakuchaki. (2024, Jul.). CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes. arXiv preprint. [Online]. Available: https://doi.org/10.48550/arXiv.2407.01328 Li, Q. Zhou, D. Wu, M. Sun, and T. Hu. (2024, May.). CLGFormer: Cross-Level-Guided Transformer for RGB-D Semantic Segmentation. Multimedia Tools and Applications. [Online]. pp. 1–23. Available: https://doi.org/10.1007/s11042-024-19051-9 He, X. Zhang, S. Ren, and J. Sun. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. [Online]. Available: http://openaccess.thecvf.com Geiger, P. Lenz, and R. Urtasun. (2012, Jun.). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. [Online]. Available: https://doi.org/10.1109/CVPR.2012.6248074 Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. (2009, Jun.). ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. [Online]. Available: https://doi.org/10.1109/CVPR.2009.5206848 Y. Lo, H. M. Hang, S. W. Chan, and J. J. Lin. (2019, Dec.). Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, pp. 1–6. [Online]. Available: https://doi.org/10.1145/3338533.3366558 A. Elhassan, C. Yang, C. Huang, T. L. Munea, X. Hong, A. Adam, and A. Benabid. (2022, Jun.). S²-FPN: Scale-aware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation. arXiv preprint. [Online]. Available: https://doi.org/10.48550/arXiv.2206.07298 Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. (2018). BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 325–341. [Online]. Available: http://openaccess.thecvf.com Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. (2018). ICNet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 405–420. [Online]. Available: http://openaccess.thecvf.com Dong, Y. Yan, C. Shen, and H. Wang. (2020, Mar.). Real-time high-performance semantic image segmentation of urban street scenes. IEEE Transactions on Intelligent Transportation Systems. [Online]. 22(6), pp. 3258–3274. Available: https://doi.org/10.1109/TITS.2020.2980426 Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang. (2021, Nov.). BiSeNet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision. [Online]. 129, pp. 3051–3068. Available: https://doi.org/10.1007/s11263-021-01515-2 Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei. (2021). Rethinking BiSeNet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9716–9725. [Online]. Available: http://openaccess.thecvf.com Zhou, E. Yang, J. Lei, and L. Yu. (2022, May.). FRNet: Feature reconstruction network for RGB-D indoor scene parsing. IEEE Journal of Selected Topics in Signal Processing. [Online]. 16(4), pp. 677–687. Available: https://doi.org/10.1109/JSTSP.2022.3174338 Peng, Y. Zheng, Y. Cheng, and Y. Qiao. (2024, Oct.). RDFormer: Efficient RGB-D Semantic Segmentation in Complex Outdoor Scenes. In Proceedings of the 2024 5th International Conference on Machine Learning and Computer Application (ICMLCA), pp. 170–175. [Online]. Available: https://doi.org/10.1109/ICMLCA63499.2024.10754213
آمار تعداد مشاهده مقاله: 235 تعداد دریافت فایل اصل مقاله: 180

سامانه مدیریت نشریات علمی. قدرت گرفته از سیناوب

پیوندها

اخبار و اعلانات

آمار

Discriminative Cross-Modal Attention Approach for RGB-D Semantic Segmentation