Overview

HellaSwag

Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations. Created in 2019, it tests physical situation reasoning. Initially, models reached only 50% accuracy.

from benchthing import Bench

bench = Bench("hellaswag")

bench.run(
    benchmark="hellaswag",
    task_id="1",
    models=yourLanguageModels
)

result = bench.get_result("1")

Sign up to get access to the HellaSwag benchmark API