Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations. Created in 2019, it tests physical situation reasoning. Initially, models reached only 50% accuracy.
from benchthing import Bench
bench = Bench("hellaswag")
bench.run(
benchmark="hellaswag",
task_id="1",
models=yourLanguageModels
)
result = bench.get_result("1")