Benchmark execution without hassle
Run your AI evals or benchmarks on the cloud.
Save weeks of development with just 3 lines of code.
import bench from 'benchthing'
const data = bench.get('webarena')
const models = bench.getModels({config})
bench.run({benchmark=''webarena', taskId='1'})
const result = bench.result({taskId='1'})
Use Cases
Largest library of benchmarks
Utilize the largest library of benchmarks for comprehensive evaluations.
Extend existing benchmarks
Easily extend and customize existing benchmarks to fit your specific needs.
Create your own evals
Design and implement your own system evaluations with flexibility and ease.
What Our Users Say
"If Benchmarkthing existed before, it would have saved me weeks of setting up miscellaneous sub-tasks in VLMs. I'm excited about using it to benchmark other Computer Vision tasks."
Tianpei Gu
Research Scientist at TikTok
"With Benchmarkthing's endpoint, I was able to focus on developing web agents instead of setting up configs and environments for the task execution."
Yitao Liu
NLP Researcher at Princeton
"Using Benchmarkthing is like having Codecov but for our Retrieval-Augmented Generation (RAG) workflows. It makes them a lot more reliable."
Gus Ye
Senior AI engineer, Founder of Memobase.io
Explore Benchmarks
WebArena
A realistic web environment for developing autonomous agents. GPT-4 agent achieves 14.41% success rate vs 78.24% human performance.
Tau (𝜏)-Bench
A benchmark for evaluating AI agents' performance in real-world settings with dynamic interaction.
BIRD-SQL
A pioneering, cross-domain dataset for evaluating text-to-SQL models on large-scale databases.
STS (Semantic Textual Similarity)
A benchmark for evaluating semantic equivalence between text snippets.