A benchmark containing 15,000 Context-Question-Answer triplets annotated for hallucinations, sourced from real-world domains including finance and medicine. Built using examples from FinanceBench, PubmedQA, CovidQA, HaluEval, DROP and RAGTruth, it's designed to evaluate models' ability to detect hallucinations in challenging scenarios.
from benchthing import Bench
bench = Bench("halu-bench")
bench.run(
benchmark="halu-bench",
task_id="1",
models=yourLanguageModels
)
result = bench.get_result("1")