📐Benchmarkthing

Evals as an API

Run out-of-the-box evals or benchmarks on the cloud. Save weeks of setting up and development by using evals on our platform.

from benchthing import Bench

bench = Bench("webarena")

bench.run(
    benchmark="webarena",
    task_id="1",
    agents=your_agents
)

result = bench.get_result("1")

Built or used by industry leaders from

  • Tesla
  • Trellis
  • Dolby
  • RemNote
  • Red Hat
  • Princeton

Use Cases

Largest library of benchmarks

Utilize the largest library of benchmarks for comprehensive evaluations.

Extend existing benchmarks

Easily extend and customize existing benchmarks to fit your specific needs.

Create your own evals

Design and implement your own system evaluations with flexibility and ease.

What Our Users Say

Tianpei Gu

"If Benchmarkthing existed before, it would have saved me weeks of setting up miscellaneous sub-tasks in VLMs. I'm excited about using it to benchmark other Computer Vision tasks."

Tianpei Gu

Research Scientist at TikTok

Yitao Liu

"With Benchmarkthing's endpoint, I was able to focus on developing web agents instead of setting up configs and environments for the task execution."

Yitao Liu

NLP Researcher at Princeton

Gus Ye

"Using Benchmarkthing is like having Codecov but for our Retrieval-Augmented Generation (RAG) workflows. It makes them a lot more reliable."

Gus Ye

Senior AI engineer, Founder of Memobase.io

Popular Benchmarks

WebArena

Carnegie Mellon UniversityA realistic web environment for developing autonomous agents. GPT-4 agent achieves 14.41% success rate vs 78.24% human performance.

MLE-bench

OpenAIA benchmark for measuring how well AI agents perform at machine learning engineering.

SWE-bench

Princeton NLPA benchmark for software engineering tasks.

SWE-bench Multimodal

Princeton NLPA benchmark for evaluating AI systems on visual software engineering tasks with JavaScript.

Agentbench

Tsinghua UniversityA comprehensive benchmark to evaluate LLMs as agents (ICLR'24)

Tau (𝜏)-Bench

Sierra AIA benchmark for evaluating AI agents' performance in real-world settings with dynamic interaction.

BIRD-SQL

A pioneering, cross-domain dataset for evaluating text-to-SQL models on large-scale databases.

LegalBench

Hazy Research at StanfordA collaboratively built benchmark for measuring legal reasoning in large language models.

STS (Semantic Textual Similarity)

A benchmark for evaluating semantic equivalence between text snippets.

GLUE MS Marco

MicrosoftA large-scale dataset for benchmarking information retrieval systems.

Stanford HELM

Stanford CRFMA comprehensive framework for evaluating language models across various scenarios.

API-bank

Tsinghua UniversityA benchmark for evaluating tool-augmented LLMs with 73 API tools and 314 dialogues.

ARC (AI2 Reasoning Challenge)

Allen Institute for Artificial IntelligenceA dataset of 7,787 grade-school level, multiple-choice science questions for advanced QA research.

HellaSwag

Allen Institute for Artificial IntelligenceA benchmark for testing physical situation reasoning with harder endings and longer contexts.

HumanEval

Open AIEvaluates functional correctness of code generation based on docstrings.

MMLU (Massive Multitask Language Understanding)

UC BerkeleyA comprehensive benchmark evaluating models on multiple choice questions across 57 subjects.

SuperGLUE

ML for Language Group at NYU CILVRA rigorous benchmark for language understanding with eight challenging tasks.

TruthfulQA

University of OxfordA benchmark for evaluating model truthfulness with 817 questions across 38 categories.

EQ Bench

A benchmark for evaluating emotional intelligence in Large Language Models.

CyberSecEval

Meta LlamaA benchmark for evaluating cybersecurity risks and capabilities in Large Language Models.

Spec-Bench

The Hong Kong Polytechnic University, Peking University, Microsoft Research Asia, Alibaba GroupA comprehensive benchmark for evaluating Speculative Decoding methods across diverse scenarios.

MobileAIBench

Salesforce Al ResearchA comprehensive benchmark for evaluating LLMs & LMMs performance and resource consumption on mobile devices.

MEGA-Bench

Tiger AI LabA comprehensive multimodal evaluation suite with 505 real-world tasks and diverse output formats.

Alexa Arena

Amazon ScienceA user-centric interactive platform for embodied AI and robotic task completion in simulated environments.

BIG-bench

Google ResearchA diverse set of tasks designed to measure AI's general capabilities across reasoning, common sense, creativity, and many other areas.

CodeXGLUE

Microsoft ResearchA benchmark for evaluating models on various programming and software engineering tasks such as code completion, bug detection, and code translation.

BEIR

UKP Lab, TU DarmstadtA heterogeneous benchmark suite for evaluating the performance of retrieval models across various datasets and domains.

MMOCR

OpenMMLabA benchmark for optical character recognition tasks, evaluating models on text detection, recognition, and end-to-end reading comprehension.

HotpotQA

A dataset of 113K Wikipedia-based questions requiring multi-hop reasoning and supporting fact identification.

TriviaQA

Allen Institute for Artificial IntelligenceA large-scale QA dataset with 950K question-answer pairs from 662K Wikipedia and web documents.

Berkeley Function Calling Leaderboard (BFCL)

UC BerkeleyA comprehensive benchmark for evaluating function/tool calling capabilities of language models across single-turn, multi-turn and multi-step scenarios.

NexusBench

NexusflowA comprehensive benchmark suite for evaluating function calling, tool use, and agent capabilities of language models.

HaluBench

Patronus AIA comprehensive hallucination evaluation benchmark with 15K samples from real-world domains including finance and medicine.

RouterBench

MartianA comprehensive benchmark for evaluating multi-LLM routing systems with 405k+ inference outcomes.

StableToolBench

Tsinghua UniversityA stable benchmark for evaluating LLMs' tool learning capabilities with virtual API system and solvable queries.

TaskBench

MicrosoftA comprehensive framework for evaluating LLMs in task automation across decomposition, tool selection, and parameter prediction.

MMGenBench

Alibaba Group & Beihang UniversityA benchmark evaluating LMMs' image understanding through text-to-image generation.

InfiniteBench

Tsinghua UniversityA benchmark for evaluating LLMs' ability to process and reason over super long contexts (100K+ tokens).

BABILong

AIRI & DeepPavlov.aiA benchmark for evaluating LLMs' ability to process and reason over super long contexts using a needle-in-a-haystack approach.

LOFT

Google DeepMindA comprehensive benchmark for evaluating long-context language models across retrieval, RAG, SQL, and more.

HELMET

Princeton NLPA comprehensive benchmark for evaluating long-context language models across seven diverse categories.

Missing something?

Help us expand our benchmark collection by suggesting new benchmarks to add.