Evaluation of Large Language Models

NOTE
LLM Leaderboards
Classification of LLM Evaluation
- Commonsense Reasoning Datasets
- Multi-Hop Reasoning Datasets
Benchmarks
How to Calculate a Benchmark
Takeaways

The landscape of large language model evaluation is rich and sometimes overwhelming. Dozens of leaderboards, hundreds of benchmarks, and no single agreed-upon standard for what “good” means. This post maps out the major categories of evaluation, the key benchmarks within each, and how to actually run them yourself.

NOTE

Stanford CS224U is an excellent resource for building a deeper theoretical foundation in LLM evaluation beyond what is covered here.

LLM Leaderboards

A useful starting point is to see which benchmarks are prominent enough to appear on major leaderboards. The following are the most widely referenced:

Hugging Face Open LLM Leaderboard
LMSYS Chatbot Arena — human preference-based ranking
Can AI Code — coding-specific evaluation

For a comprehensive academic treatment, the survey paper Evaluating Large Language Models: A Comprehensive Survey covers the full taxonomy in depth.

Classification of LLM Evaluation

LLM evaluation broadly splits into two axes: what is being evaluated (knowledge, reasoning, alignment, safety, etc.) and how it is evaluated (automatic scoring, human judgment, LLM-as-judge).

Figure 1: The classification of LLM evaluation.

The knowledge and capability progression across model generations:

Figure 2: The progress of LLM knowledge capability evaluation.

Commonsense Reasoning Datasets

Dataset	Domain	Size	Source	Task
ARC	Science	7,787	Various	Multiple-choice QA
QASC	Science	9,980	Human-authored	Multiple-choice QA
MCTACO	Temporal	1,893	MultiRC	Multiple-choice QA
HellaSWAG	Event	20K	ActivityNet, WikiHow	Multiple-choice QA
PIQA	Physical	21K	Human-authored	2-choice QA
Social IQA	Social	38K	Human-authored	Multiple-choice QA
CommonsenseQA	Generic	12,247	ConceptNet	Multiple-choice QA
OpenBookQA	Generic	6K	WorldTree	Multiple-choice QA

Multi-Hop Reasoning Datasets

Dataset	Domain	Size	Hops	Task
HotpotQA	Generic	112,779	1–3	Span extraction
HybridQA	Generic	69,611	2–3	Span extraction
MultiRC	Generic	9,872	~2.4	Multiple-choice
NarrativeQA	Fiction	46,765	—	Generative
MedHop	Medical	2,508	—	Multiple-choice
WikiHop	Generic	51,318	—	Multiple-choice

Benchmarks

Once you have a dataset, you need a benchmark — a standardized procedure that produces a comparable score. The following are the most commonly used:

Benchmark	Subjects	Language	Questions	Access
MMLU	57	English	15,908	Local
MMCU	51	Chinese	11,900	Local
C-Eval	52	Chinese	13,948	Online
AGIEval	20	English + Chinese	8,062	Local
M3KE	71	Chinese	20,477	Local
CMMLU	67	Chinese	11,528	Local

For holistic evaluation across many tasks simultaneously:

Benchmark	Language	Evaluation Method	Leaderboard
HELM	English	Automatic	Yes
BIG-bench	English + others	Automatic	Yes
OpenCompass	English + others	Automatic + LLM-judge	Yes
Chatbot Arena	English + others	Human preference	Yes
FlagEval	English + others	Automatic + Manual	No

How to Calculate a Benchmark

Understanding the mechanics of benchmark calculation matters — especially when you need to trust the numbers or reproduce them.

A Concrete Example: TruthfulQA

The TruthfulQA dataset is designed to evaluate whether a model generates factually accurate answers rather than plausible-sounding ones. It is one of the six benchmarks that appear on the Hugging Face Open LLM Leaderboard.

The Evaluation Harness

Rather than implementing evaluation from scratch for each benchmark, the standard tool is lm-evaluation-harness by EleutherAI. It provides a unified interface for running a model against hundreds of benchmarks with consistent prompting and scoring.

Zero-Shot vs. Few-Shot Evaluation

The evaluation protocol — zero-shot or few-shot — significantly affects scores and must be reported alongside results:

Few-shot: The model is primed with k examples before the test question. A prompt suffix like "thus, the choice is:" can steer the model toward a structured answer format (e.g., A, B, C).
Zero-shot: No examples are provided. The model must interpret the task from the question alone. Multiple prompts are sometimes needed: one to elicit a free-form response, another to coerce it into a parseable answer.

Dataset Splits and Leakage

Pay attention to train/val/test splits. For benchmarks like HellaSwag, models should be fine-tuned only on train and val, with the test set held out strictly for evaluation. Contamination — where test data appears in the pretraining corpus — is a known problem with many public benchmarks and is an active area of research.

Takeaways

Benchmarks are necessary but imperfect. A few principles worth keeping in mind:

No single benchmark is sufficient. Use a diverse set covering different capabilities and domains.
Report the evaluation protocol. Zero-shot vs. few-shot, prompt format, and framework version all affect results.
Be skeptical of leaderboard rankings. Many benchmarks have been saturated or are potentially contaminated in large pretraining corpora.
Human evaluation remains the gold standard for open-ended generation quality — Chatbot Arena’s Elo-based approach is the closest thing to a reliable real-world signal.