MM-InstructEval Logo Leaderboard


Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Github Paper
Benchmark:

Hint: you can click on the column to sort based on that column



Why?

Multimodal Large Language Models (MLLMs), harnessing the formidable capabilities of Large Language Models (LLMs), demonstrate outstanding performance across a spectrum of multimodal tasks.The emergence of recent research including but not limited to MME, MMBench, SEED-Bench, LVLM-eHub, MM-Vet, MMMU, MLLM-Bench, and others, has predominantly focused on appraising the required traditional vision-language multimodal capabilities of MLLMs in the tasks primarily driven by visual content (vision+text question) like Visual Question Answering (VQA) and Video Question Answering (VideoQA).

However, there is limited understanding about the performance of MLLMs in multimodal reasoning tasks with vision-text contexts (vison-text context+text question), such as Multimodal Sentiment Analysis (MSA), Multimodal Aspect-Based Sentiment Analysis (MABSA), Multimodal Hateful Memes Recognition (MHMR), Multimodal Sarcasm Recognition (MSR), Multimodal Relation Extraction (MRE), and the Visual Question Answering with Text Context (VQAC). This leaves the performance of various MLLMs in multimodal reasoning tasks with vision-text contexts that rely on text and image modalities largely unexplored.

To address the aforementioned gap, we conduct the comprehensive evaluation involving 30 publicly available models, including 22 MLLMs, across a diverse set of 14 datasets covering 6 distinct tasks. Our primary focus is to assess the performance of various MLLMs in the context of tasks involving the comprehension of multimodal content, specifically text-image pairs. We also aim to establish benchmarks across a range of MLLMs for diverse multimodal reasoning tasks with vision-text contexts. These tasks not only require conventional vision-language multimodal capabilities in the models but also demand a deep understanding of multimodal content for classification (sentiment analysis, hate speech, sarcasm, etc.) or reasoning (visual question answering).

We propose the comprehensive assessment framework called MM-InstructEval, incorporating a diverse set of metrics to conduct a thorough evaluation of various models and instructions in the domain of multimodal reasoning tasks with vision-text contexts. MM-InstructEval serves as a complement to existing zero-shot evaluation studies of MLLMs, offering a more comprehensive and holistic assessment when combined with prior related work.

Notably, we support most models from HuggingFace Transformers 🤗

Notes:


The notation "-" signifies that we do not assess LLMs for the PuzzleVQA and MMMU datasets. The exclusion is due to the inherent requirement of these two datasets for both image information and textual context to generate effective responses. The "Total" results of LLMs are also not calculated. The "Total^{★}" represents the aggregate results across all datasets, with the exception of ScienceQA, PuzzleVQA, and MMMU.

Diverse Tasks

Visual Question Answering with Vison-Text Context (VQAMC)

ScienceQA; MMMU; AlgoPuzzleVQA.

Multimodal Sentiment Analysis (MSA)

MVSA-Single; MVSA-Multiple; MVSA-TumEmo; MOSI-2; MOSI-7; MOSEI-2; MOSEI-7.

Multimodal Aspect-Based Sentiment Analysis (MABSA)

Twitter-2015; Twitter-2017; MASAD.

Multimodal Hateful Memes Recognition (MHMR)

Hate

Multimodal Sarcasm Recognition (MSR)

Sarcasm

Multimodal Relation Extraction (MRE)

MNRE

Comprehensive Metrics (Details can be found in our paper and code.)

Best Performance

Considering the performance variations across different instructions, we report the best performance, achieved by each model among all instructions on each dataset. This metric highlights the upper bounds of performance for different models. The "Best Performance on Full Test Datasets" page displays the best performnce on the full test datasets, except for GPT-4V model.
Notes: Since GPT-4V is expensive and has limited usage, we only evaluate a subset of the test dataset. This subset is randomly sampled from the test dataset and accounts for 10% of the test data. The related results are list in the "Best Performance on Test Subsets" page.

Mean Relative Gain of Models

Given the diversity of models and instructions, it's understandable that we observe substantial variations in accuracy for each dataset, contingent on different models and various instructions. Therefore, we leverage aggregating metrics to evaluate the overall performance across models.

Mean Relative Gain of Instructions

Mean Relative Gains of Instructions are used to meaningfully compare and summarize performance of different instructions across all models.

Adaptability

Different instructions have a significant impact on model performance. To quantify the adaptability between LMs and instructions, we propose the Global Top-K Hit Ratio, GHR@K, as a metric to evaluate the performance of each instruction on different models. This metric measures the proportion of times each instruction achieves top-K performance on a specific model across all datasets.
           
        @inproceedings{Yang2023MMBigBenchEM,
            title={MM-BigBench: Evaluating Multimodal Models on Multimodal Reasoning Tasks with Vision-Text Contexts},
            author={
               Xiaocui Yang and Wenfang Wu and Shi Feng and Ming Wang and Daling Wang and Yang Li and Qi Sun and 
               Yifei Zhang and Xiaoming Fu and Soujanya Poria},
            year={2023},
            url={https://api.semanticscholar.org/CorpusID:264127863}
            }