A Systematic Review of Multimodal Sentiment Analysis Based on Text-Image Fusion: Trends, Models, and Research Gaps

Mohammed Abdul Mohsen Hamidi; Alaa Yaseen Taqa; Yahya Ismail Ibrahim

doi:10.33395/sinkron.v9i2.14840

Authors

Mohammed Abdul Mohsen Hamidi University of Mosul
Alaa Yaseen Taqa University of Mosul
Yahya Ismail Ibrahim University of Mosul

DOI:

10.33395/sinkron.v9i2.14840

Keywords:

Attention Mechanisms, Deep learning, Feature extraction, Fusion techniques, Sentiment classification, Transformers.

Abstract

Sentiment analysis has evolved from text-based approaches to multimodal sentiment analysis (MSA), which integrates textual and visual data to enhance the accuracy of emotional understanding, especially in visually rich social media contexts. This study presents a systematic literature review (SLR) focusing on recent developments in text-image-based MSA, aiming to identify prevailing methods, fusion strategies, and major research gaps. Following the PRISMA protocol, a total of 20 key articles published between 2019 and 2024 were selected and analyzed. The results indicate that deep learning models such as LXMERT, ViLBERT, and ERNIE-ViL outperform traditional architectures, achieving accuracies above 80% on datasets like MVSA and Twitter. Attention mechanisms and advanced feature fusion techniques significantly contribute to improving both accuracy and interpretability. However, challenges remain in terms of annotation quality, semantic alignment across modalities, and real-time implementation constraints. This study contributes by mapping the state-of-the-art in multimodal sentiment analysis, highlighting underexplored research gaps, and offering directions for future work toward more adaptive and context-aware sentiment systems

GS Cited Analysis

Downloads

Download data is not yet available.

References

Aftab, F., Bazai, S. U., Marjan, S., Baloch, L., Aslam, S., Amphawan, A., & Neo, T. K. (2023). A Comprehensive Survey on Sentiment Analysis Techniques. International Journal of Technology, 14(6), 1288–1298. https://doi.org/10.14716/ijtech.v14i6.6632

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. In Journal of Big Data (Vol. 8, Issue 1). Springer International Publishing. https://doi.org/10.1186/s40537-021-00444-8

Bender, E. M., & Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041

Blodgett, S. L., Barocas, S., Daumé, H., & Wallach, H. (2020). Language (Technology) is power: A critical survey of ⇜bias” in NLP. Proceedings of the Annual Meeting of the Association for Computational Linguistics, c, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485

Boulahia, S. Y., Amamra, A., Madi, M. R., & Daikh, S. (2021). Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications, 32(6). https://doi.org/10.1007/s00138-021-01249-8

Chan, S. W. K., & Chong, M. W. C. (2017). Sentiment analysis in financial texts. Decision Support Systems, 94(August), 53–64. https://doi.org/10.1016/j.dss.2016.10.006

Chen, F., Huang, P., Ge, X., Huang, J., & Bao, Z. (2024). Multimodal Sentiment Analysis Based on Causal Reasoning. ArXiv Preprint ArXiv:2412.07292.

Choi, Y., & Lee, H. (2017). Data properties and the performance of sentiment classification for electronic commerce applications. Information Systems Frontiers, 19(5), 993–1012. https://doi.org/10.1007/s10796-017-9741-7

Das, R., & Singh, T. D. (2023). Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges. ACM Computing Surveys, 55(13). https://doi.org/10.1145/3586075

de Toledo, G. L., & Marcacini, R. M. (2022). Transfer learning with joint fine-tuning for multimodal sentiment analysis. ArXiv Preprint ArXiv:2210.05790.

Deb, A., Lerman, K., & Ferrara, E. (2018). Predicting cyber-events by leveraging hacker sentiment. Information (Switzerland), 9(11), 1–18. https://doi.org/10.3390/info9110280

Denecke, K., & Reichenpfader, D. (2023). Sentiment analysis of clinical narratives: A scoping review. Journal of Biomedical Informatics, 140(March). https://doi.org/10.1016/j.jbi.2023.104336

Gadzicki, K., Khamsehashari, R., & Zetzsche, C. (2020). Early vs late fusion in multimodal convolutional neural networks. Proceedings of 2020 23rd International Conference on Information Fusion, FUSION 2020, July 2020. https://doi.org/10.23919/FUSION45008.2020.9190246

Gherkar, Y., Gujar, P., Gaziyani, A., & Kadu, S. (2022). Keyword : 03029, 1–6.

Gong, L., He, X., & Yang, J. (2024). An Image-Text Sentiment Analysis Method Using Multi-Channel Multi-Modal Joint Learning. Applied Artificial Intelligence, 38(1). https://doi.org/10.1080/08839514.2024.2371712

Gu, D., Wang, J., Cai, S., Yang, C., Song, Z., Zhao, H., Xiao, L., & Wang, H. (2021). Targeted Aspect-Based Multimodal Sentiment Analysis: An Attention Capsule Extraction and Multi-Head Fusion Network. IEEE Access, 9, 157329–157336. https://doi.org/10.1109/ACCESS.2021.3126782

Guo, W., Zhang, Y., Cai, X., Meng, L., Yang, J., & Yuan, X. (2021). LD-MAN: Layout-Driven Multimodal Attention Network for Online News Sentiment Recognition. IEEE Transactions on Multimedia, 23, 1785–1798. https://doi.org/10.1109/TMM.2020.3003648

Hu, X., & Yamamura, M. (2022). Global Local Fusion Neural Network for Multimodal Sentiment Analysis. Applied Sciences (Switzerland), 12(17). https://doi.org/10.3390/app12178453

Huang, F., Zhang, X., Zhao, Z., Xu, J., & Li, Z. (2019). Image–text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems, 167, 26–37. https://doi.org/10.1016/j.knosys.2019.01.019

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

Lei, Y., Qu, K., Zhao, Y., Han, Q., & Wang, X. (2024). Multimodal Sentiment Analysis Based on Composite Hierarchical Fusion. Computer Journal, 67(6), 2230–2245. https://doi.org/10.1093/comjnl/bxae002

Li, J., Zhang, Z., Lang, J., Jiang, Y., An, L., Zou, P., Xu, Y., Gao, S., Lin, J., Fan, C., Sun, X., & Wang, M. (2022). Hybrid Multimodal Feature Extraction, Mining and Fusion for Sentiment Analysis. In MuSe 2022 - Proceedings of the 3rd International Multimodal Sentiment Analysis Workshop and Challenge (Vol. 1, Issue 1). Association for Computing Machinery. https://doi.org/10.1145/3551876.3554809

Liu, X., Li, R., Ye, S., Zhang, G., & Wang, X. (2025). Multimodal Aspect-Based Sentiment Analysis under Conditional Relation. Proceedings of the 31st International Conference on Computational Linguistics, 313–323.

Mu, G., Chen, Y., Li, X., Dai, L., & Dai, J. (2025). Semantic enhancement and cross-modal interaction fusion for sentiment analysis in social media. PloS One, 20(4), e0321011.

Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., & Brennan, S. E. (2021). The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Bmj, 372.

Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2021). Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 100336. https://doi.org/10.1016/j.patter.2021.100336

Rajesh, A., & Hiwarkar, T. (2023). Sentiment analysis from textual data using multiple channels deep learning models. Journal of Electrical Systems and Information Technology, 10(1). https://doi.org/10.1186/s43067-023-00125-x

Ramamoorthy, S., Gunti, N., Mishra, S., Suryavardan, S., Reganti, A., Patwa, P., DaS, A., Chakraborty, T., Sheth, A., & Ekbal, A. (2022). Memotion 2: Dataset on sentiment and emotion analysis of memes. Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR.

Raza, H., Faizan, M., Hamza, A., Mushtaq, A., & Akhtar, N. (2019). Scientific text sentiment analysis using machine learning techniques. International Journal of Advanced Computer Science and Applications, 10(12), 157–165. https://doi.org/10.14569/ijacsa.2019.0101222

Salman Al-Tameemi, I. K., Feizi-Derakhshi, M. R., Pashazadeh, S., & Asadpour, M. (2023). Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification. Computers, Materials and Continua, 76(2), 2145–2177. https://doi.org/10.32604/CMC.2023.040997

Sharma, C., Bhageria, D., Scott, W., Pykl, S., Das, A., Chakraborty, T., Pulabaigari, V., & Gamback, B. (2020). SemEval-2020 Task 8: Memotion Analysis--The Visuo-Lingual Metaphor! ArXiv Preprint ArXiv:2008.03781.

Sharma, H. D., & Goyal, P. (2023). An Analysis of Sentiment : Methods , Applications ,. Ml.

Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. Proceedings of the IEEE International Conference on Computer Vision, 2017-Octob, 843–852. https://doi.org/10.1109/ICCV.2017.97

Tan, K. L., Lee, C. P., & Lim, K. M. (2023). A Survey of Sentiment Analysis: Approaches, Datasets, and Future Research. Applied Sciences (Switzerland), 13(7). https://doi.org/10.3390/app13074550

Thuseethan, S., Janarthan, S., Rajasegarar, S., Kumari, P., & Yearwood, J. (2020). Multimodal deep learning framework for sentiment analysis from text-image web data. Proceedings - 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2020, 267–274. https://doi.org/10.1109/WIIAT50758.2020.00039

Wang, H., Li, X., Ren, Z., Wang, M., & Ma, C. (2023). Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion. Sensors, 23(5), 1–15. https://doi.org/10.3390/s23052679

Wang, K., & Zhang, Y. (2020). Topic Sentiment Analysis in Online Learning Community from College Students. Journal of Data and Information Science, 5(2), 33–61. https://doi.org/10.2478/jdis-2020-0009

Xu, J., Li, Z., Huang, F., Li, C., & Yu, P. S. (2021). Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations. IEEE Transactions on Industrial Informatics, 17(4), 2974–2982. https://doi.org/10.1109/TII.2020.3005405

Yadav, A., & Vishwakarma, D. K. (2023). A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis. ACM Transactions on Multimedia Computing, Communications and Applications, 19(1), 1–11. https://doi.org/10.1145/3517139

Yang, X., Feng, S., Wang, D., & Zhang, Y. (2021). Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia, 23(c), 4014–4026. https://doi.org/10.1109/TMM.2020.3035277

Yin, W., Kann, K., Yu, M., & Schütze, H. (2017). Comparative Study of CNN and RNN for Natural Language Processing. http://arxiv.org/abs/1702.01923

Yin, X., & Chen, L. (2023). Image and Text Aspect Level Multimodal Sentiment Classification Model Using Transformer and Multilayer Attention Interaction. International Journal of Data Warehousing and Mining, 19(1), 1–20. https://doi.org/10.4018/IJDWM.333854

Zhang, K., Geng, Y., Zhao, J., Liu, J., & Li, W. (2020). Sentiment analysis of social media via multimodal feature fusion. Symmetry, 12(12), 1–14. https://doi.org/10.3390/sym12122010

Zhou, J., Zhao, J., Huang, J. X., Hu, Q. V., & He, L. (2021). MASAD: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing, 455, 47–58. https://doi.org/10.1016/j.neucom.2021.05.040

Zhu, T., Li, L., Yang, J., Zhao, S., Liu, H., & Qian, J. (2023). Multimodal Sentiment Analysis with Image-Text Interaction Network. IEEE Transactions on Multimedia, 25, 3375–3385. https://doi.org/10.1109/TMM.2022.3160060

	CONTACT US
	EDITORIAL BOARD
	AIMS & SCOPE
	COPYRIGHT & LICENSE
	REVIEWER
	FACEBOOK FANPAGE
	AUTHOR PROCESSING CHARGE
	OPEN ACCESS POLICY
	TEMPLATE
	PEER REVIEW PROCESS
	PUBLICATION ETHICS
	STATISTIC VIEWER
	ARCHIVING
	CROSSMARK POLICY
	FREQUENCY
	PLAGIARISM POLICY
	AUTHOR GUIDELINES
	HISTORY
	CALL REVIEWER

A Systematic Review of Multimodal Sentiment Analysis Based on Text-Image Fusion: Trends, Models, and Research Gaps

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Current Issue

Make a Submission

Information

Developed By

Acceptance Rate Statistics