A benchmark suite designed to evaluate LLMs on real-world enterprise-level function calling and agent scenarios. Includes specialized benchmarks for IT ticket systems, security tools (NVD, VirusTotal), and complex multi-step interactions. Used to evaluate models like Athene-V2 against GPT-4 on practical tool use cases.
from benchthing import Bench
bench = Bench("nexus-bench")
bench.run(
benchmark="nexus-bench",
task_id="1",
agents=yourAgentModel
)
result = bench.get_result("1")