A Systematic Review of Multimodal Sentiment Analysis Based on Text-Image Fusion: Trends, Models, and Research Gaps
DOI:
10.33395/sinkron.v9i2.14840Keywords:
Attention Mechanisms, Deep learning, Feature extraction, Fusion techniques, Sentiment classification, Transformers.Abstract
Sentiment analysis has evolved from text-based approaches to multimodal sentiment analysis (MSA), which integrates textual and visual data to enhance the accuracy of emotional understanding, especially in visually rich social media contexts. This study presents a systematic literature review (SLR) focusing on recent developments in text-image-based MSA, aiming to identify prevailing methods, fusion strategies, and major research gaps. Following the PRISMA protocol, a total of 20 key articles published between 2019 and 2024 were selected and analyzed. The results indicate that deep learning models such as LXMERT, ViLBERT, and ERNIE-ViL outperform traditional architectures, achieving accuracies above 80% on datasets like MVSA and Twitter. Attention mechanisms and advanced feature fusion techniques significantly contribute to improving both accuracy and interpretability. However, challenges remain in terms of annotation quality, semantic alignment across modalities, and real-time implementation constraints. This study contributes by mapping the state-of-the-art in multimodal sentiment analysis, highlighting underexplored research gaps, and offering directions for future work toward more adaptive and context-aware sentiment systems
Downloads
References
Aftab, F., Bazai, S. U., Marjan, S., Baloch, L., Aslam, S., Amphawan, A., & Neo, T. K. (2023). A Comprehensive Survey on Sentiment Analysis Techniques. International Journal of Technology, 14(6), 1288–1298. https://doi.org/10.14716/ijtech.v14i6.6632
Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. In Journal of Big Data (Vol. 8, Issue 1). Springer International Publishing. https://doi.org/10.1186/s40537-021-00444-8
Bender, E. M., & Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041
Blodgett, S. L., Barocas, S., Daumé, H., & Wallach, H. (2020). Language (Technology) is power: A critical survey of ⇜bias” in NLP. Proceedings of the Annual Meeting of the Association for Computational Linguistics, c, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
Boulahia, S. Y., Amamra, A., Madi, M. R., & Daikh, S. (2021). Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications, 32(6). https://doi.org/10.1007/s00138-021-01249-8
Chan, S. W. K., & Chong, M. W. C. (2017). Sentiment analysis in financial texts. Decision Support Systems, 94(August), 53–64. https://doi.org/10.1016/j.dss.2016.10.006
Chen, F., Huang, P., Ge, X., Huang, J., & Bao, Z. (2024). Multimodal Sentiment Analysis Based on Causal Reasoning. ArXiv Preprint ArXiv:2412.07292.
Choi, Y., & Lee, H. (2017). Data properties and the performance of sentiment classification for electronic commerce applications. Information Systems Frontiers, 19(5), 993–1012. https://doi.org/10.1007/s10796-017-9741-7
Das, R., & Singh, T. D. (2023). Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges. ACM Computing Surveys, 55(13). https://doi.org/10.1145/3586075
de Toledo, G. L., & Marcacini, R. M. (2022). Transfer learning with joint fine-tuning for multimodal sentiment analysis. ArXiv Preprint ArXiv:2210.05790.
Deb, A., Lerman, K., & Ferrara, E. (2018). Predicting cyber-events by leveraging hacker sentiment. Information (Switzerland), 9(11), 1–18. https://doi.org/10.3390/info9110280
Denecke, K., & Reichenpfader, D. (2023). Sentiment analysis of clinical narratives: A scoping review. Journal of Biomedical Informatics, 140(March). https://doi.org/10.1016/j.jbi.2023.104336
Gadzicki, K., Khamsehashari, R., & Zetzsche, C. (2020). Early vs late fusion in multimodal convolutional neural networks. Proceedings of 2020 23rd International Conference on Information Fusion, FUSION 2020, July 2020. https://doi.org/10.23919/FUSION45008.2020.9190246
Gherkar, Y., Gujar, P., Gaziyani, A., & Kadu, S. (2022). Keyword : 03029, 1–6.
Gong, L., He, X., & Yang, J. (2024). An Image-Text Sentiment Analysis Method Using Multi-Channel Multi-Modal Joint Learning. Applied Artificial Intelligence, 38(1). https://doi.org/10.1080/08839514.2024.2371712
Gu, D., Wang, J., Cai, S., Yang, C., Song, Z., Zhao, H., Xiao, L., & Wang, H. (2021). Targeted Aspect-Based Multimodal Sentiment Analysis: An Attention Capsule Extraction and Multi-Head Fusion Network. IEEE Access, 9, 157329–157336. https://doi.org/10.1109/ACCESS.2021.3126782
Guo, W., Zhang, Y., Cai, X., Meng, L., Yang, J., & Yuan, X. (2021). LD-MAN: Layout-Driven Multimodal Attention Network for Online News Sentiment Recognition. IEEE Transactions on Multimedia, 23, 1785–1798. https://doi.org/10.1109/TMM.2020.3003648
Hu, X., & Yamamura, M. (2022). Global Local Fusion Neural Network for Multimodal Sentiment Analysis. Applied Sciences (Switzerland), 12(17). https://doi.org/10.3390/app12178453
Huang, F., Zhang, X., Zhao, Z., Xu, J., & Li, Z. (2019). Image–text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems, 167, 26–37. https://doi.org/10.1016/j.knosys.2019.01.019
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
Lei, Y., Qu, K., Zhao, Y., Han, Q., & Wang, X. (2024). Multimodal Sentiment Analysis Based on Composite Hierarchical Fusion. Computer Journal, 67(6), 2230–2245. https://doi.org/10.1093/comjnl/bxae002
Li, J., Zhang, Z., Lang, J., Jiang, Y., An, L., Zou, P., Xu, Y., Gao, S., Lin, J., Fan, C., Sun, X., & Wang, M. (2022). Hybrid Multimodal Feature Extraction, Mining and Fusion for Sentiment Analysis. In MuSe 2022 - Proceedings of the 3rd International Multimodal Sentiment Analysis Workshop and Challenge (Vol. 1, Issue 1). Association for Computing Machinery. https://doi.org/10.1145/3551876.3554809
Liu, X., Li, R., Ye, S., Zhang, G., & Wang, X. (2025). Multimodal Aspect-Based Sentiment Analysis under Conditional Relation. Proceedings of the 31st International Conference on Computational Linguistics, 313–323.
Mu, G., Chen, Y., Li, X., Dai, L., & Dai, J. (2025). Semantic enhancement and cross-modal interaction fusion for sentiment analysis in social media. PloS One, 20(4), e0321011.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., & Brennan, S. E. (2021). The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Bmj, 372.
Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2021). Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 100336. https://doi.org/10.1016/j.patter.2021.100336
Rajesh, A., & Hiwarkar, T. (2023). Sentiment analysis from textual data using multiple channels deep learning models. Journal of Electrical Systems and Information Technology, 10(1). https://doi.org/10.1186/s43067-023-00125-x
Ramamoorthy, S., Gunti, N., Mishra, S., Suryavardan, S., Reganti, A., Patwa, P., DaS, A., Chakraborty, T., Sheth, A., & Ekbal, A. (2022). Memotion 2: Dataset on sentiment and emotion analysis of memes. Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR.
Raza, H., Faizan, M., Hamza, A., Mushtaq, A., & Akhtar, N. (2019). Scientific text sentiment analysis using machine learning techniques. International Journal of Advanced Computer Science and Applications, 10(12), 157–165. https://doi.org/10.14569/ijacsa.2019.0101222
Salman Al-Tameemi, I. K., Feizi-Derakhshi, M. R., Pashazadeh, S., & Asadpour, M. (2023). Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification. Computers, Materials and Continua, 76(2), 2145–2177. https://doi.org/10.32604/CMC.2023.040997
Sharma, C., Bhageria, D., Scott, W., Pykl, S., Das, A., Chakraborty, T., Pulabaigari, V., & Gamback, B. (2020). SemEval-2020 Task 8: Memotion Analysis--The Visuo-Lingual Metaphor! ArXiv Preprint ArXiv:2008.03781.
Sharma, H. D., & Goyal, P. (2023). An Analysis of Sentiment : Methods , Applications ,. Ml.
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. Proceedings of the IEEE International Conference on Computer Vision, 2017-Octob, 843–852. https://doi.org/10.1109/ICCV.2017.97
Tan, K. L., Lee, C. P., & Lim, K. M. (2023). A Survey of Sentiment Analysis: Approaches, Datasets, and Future Research. Applied Sciences (Switzerland), 13(7). https://doi.org/10.3390/app13074550
Thuseethan, S., Janarthan, S., Rajasegarar, S., Kumari, P., & Yearwood, J. (2020). Multimodal deep learning framework for sentiment analysis from text-image web data. Proceedings - 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2020, 267–274. https://doi.org/10.1109/WIIAT50758.2020.00039
Wang, H., Li, X., Ren, Z., Wang, M., & Ma, C. (2023). Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion. Sensors, 23(5), 1–15. https://doi.org/10.3390/s23052679
Wang, K., & Zhang, Y. (2020). Topic Sentiment Analysis in Online Learning Community from College Students. Journal of Data and Information Science, 5(2), 33–61. https://doi.org/10.2478/jdis-2020-0009
Xu, J., Li, Z., Huang, F., Li, C., & Yu, P. S. (2021). Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations. IEEE Transactions on Industrial Informatics, 17(4), 2974–2982. https://doi.org/10.1109/TII.2020.3005405
Yadav, A., & Vishwakarma, D. K. (2023). A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis. ACM Transactions on Multimedia Computing, Communications and Applications, 19(1), 1–11. https://doi.org/10.1145/3517139
Yang, X., Feng, S., Wang, D., & Zhang, Y. (2021). Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia, 23(c), 4014–4026. https://doi.org/10.1109/TMM.2020.3035277
Yin, W., Kann, K., Yu, M., & Schütze, H. (2017). Comparative Study of CNN and RNN for Natural Language Processing. http://arxiv.org/abs/1702.01923
Yin, X., & Chen, L. (2023). Image and Text Aspect Level Multimodal Sentiment Classification Model Using Transformer and Multilayer Attention Interaction. International Journal of Data Warehousing and Mining, 19(1), 1–20. https://doi.org/10.4018/IJDWM.333854
Zhang, K., Geng, Y., Zhao, J., Liu, J., & Li, W. (2020). Sentiment analysis of social media via multimodal feature fusion. Symmetry, 12(12), 1–14. https://doi.org/10.3390/sym12122010
Zhou, J., Zhao, J., Huang, J. X., Hu, Q. V., & He, L. (2021). MASAD: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing, 455, 47–58. https://doi.org/10.1016/j.neucom.2021.05.040
Zhu, T., Li, L., Yang, J., Zhao, S., Liu, H., & Qian, J. (2023). Multimodal Sentiment Analysis with Image-Text Interaction Network. IEEE Transactions on Multimedia, 25, 3375–3385. https://doi.org/10.1109/TMM.2022.3160060
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2025 Mohammed Abdul Mohsen Hamidi, Alaa Yaseen Taqa, Yahya Ismail Ibrahim

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.






















Moraref
PKP Index
Indonesia OneSearch
OCLC Worldcat
Index Copernicus
Scilit
