InstructEval Logo Leaderboard


An Evaluation Suite for Instructed Language Models

Github Paper
Benchmark:

Hint: you can click on the column to sort based on that column



Why?

Instruction-tuned models such as Flan-T5 and Alpaca represent an exciting direction to approximate the performance of large language models (LLMs) like ChatGPT at lower cost. However, as the internal workings of many proprietary LLMs such as GPT-4 is still unknown, research community yet to reach a holistic understanding of instructed LLMs.

To reduce the knowledge gap, we need to know how diverse factors can contribute to their behavior and performance, such as pretraining, instruction data, and training methods. To address these challenges, we create InstructEval, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values.

Compared to existing libraries such as evaluation-harness and HELM, Our Github enables simple and convenient evaluation for multiple models. Notably, we support most models from HuggingFace Transformers 🤗

            @article{chia2023instructeval,
            title = {INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models},
            author = {Yew Ken Chia and Pengfei Hong and Lidong Bing and Soujanya Poria},
            journal = {arXiv preprint arXiv:2306.04757},
            year = {2023}
            }
            

Details

Problem Solving benchmark datasets:
MMLU benchmark is designed to measure world knowledge and problem-solving ability in multiple subjects.
BIG-Bench Hard (BBH) is a subset of 23 challenging tasks from the BIG-Bench benchmark, which focuses on tasks believed to be beyond the capabilities of current language models. It requires models to follow challenging instructions such as navigation, logical deduction, and fallacy detection.
Discrete Reasoning Over Paragraphs (DROP) s a math-based reading comprehension task that requires a system to perform discrete reasoning over passages extracted from Wikipedia articles.
HumanEval is a problem-solving benchmark used for evaluating large language models trained on code.
Counterfactual Reasoning Assessment (CRASS) benchmark is a novel dataset and evaluation tool designed to test the causal reasoning capabilities of large language models.


Writing Evaluation:
We evaluate general writing ability across diverse usage scenarios for informative writing, professional writing, argumentative writing, and creative writing. The dataset is published here IMPACT. We use ChatGPT to judge the quality of the generated answers and each answer is scored on a Likert scale from 1 to 5.


Alignment to Human Values : Helpful, Honest, and Harmless (HHH) benchmark