Hint: you can click on the column to sort based on that column
Multimodal Large Language Models (MLLMs), harnessing the formidable capabilities of Large Language Models (LLMs), demonstrate outstanding performance across a spectrum of multimodal tasks.The emergence of recent research including but not limited to MME, MMBench, SEED-Bench, LVLM-eHub, MM-Vet, MMMU, MLLM-Bench, and others, has predominantly focused on appraising the required traditional vision-language multimodal capabilities of MLLMs in the tasks primarily driven by visual content (vision+text question) like Visual Question Answering (VQA) and Video Question Answering (VideoQA).
However, there is limited understanding about the performance of MLLMs in multimodal reasoning tasks with vision-text contexts (vison-text context+text question), such as Multimodal Sentiment Analysis (MSA), Multimodal Aspect-Based Sentiment Analysis (MABSA), Multimodal Hateful Memes Recognition (MHMR), Multimodal Sarcasm Recognition (MSR), Multimodal Relation Extraction (MRE), and the Visual Question Answering with Text Context (VQAC). This leaves the performance of various MLLMs in multimodal reasoning tasks with vision-text contexts that rely on text and image modalities largely unexplored.
To address the aforementioned gap, we conduct the comprehensive evaluation involving 30 publicly available models, including 22 MLLMs, across a diverse set of 14 datasets covering 6 distinct tasks. Our primary focus is to assess the performance of various MLLMs in the context of tasks involving the comprehension of multimodal content, specifically text-image pairs. We also aim to establish benchmarks across a range of MLLMs for diverse multimodal reasoning tasks with vision-text contexts. These tasks not only require conventional vision-language multimodal capabilities in the models but also demand a deep understanding of multimodal content for classification (sentiment analysis, hate speech, sarcasm, etc.) or reasoning (visual question answering).
We propose the comprehensive assessment framework called MM-InstructEval, incorporating a diverse set of metrics to conduct a thorough evaluation of various models and instructions in the domain of multimodal reasoning tasks with vision-text contexts. MM-InstructEval serves as a complement to existing zero-shot evaluation studies of MLLMs, offering a more comprehensive and holistic assessment when combined with prior related work.
Notably, we support most models from HuggingFace Transformers 🤗
@inproceedings{Yang2023MMBigBenchEM, title={MM-BigBench: Evaluating Multimodal Models on Multimodal Reasoning Tasks with Vision-Text Contexts}, author={ Xiaocui Yang and Wenfang Wu and Shi Feng and Ming Wang and Daling Wang and Yang Li and Qi Sun and Yifei Zhang and Xiaoming Fu and Soujanya Poria}, year={2023}, url={https://api.semanticscholar.org/CorpusID:264127863} }