Agents & InferencearXiv

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Summary A

"ToolSense" is a new diagnostic framework designed to evaluate how well large language models (LLMs) understand and retrieve tools from large catalogs. It introduces benchmarks to test retrieval accuracy under realistic query ambiguity and probes models' factual knowledge about tools, revealing gaps in performance compared to traditional benchmarks. The framework highlights a dissociation between retrieval success and actual tool knowledge in some models.

Summary B

Researchers have introduced ToolSense, an open-source diagnostic framework that audits how well large language models truly understand the tools they are trained to retrieve, automatically generating realistic retrieval, multiple-choice, and QA benchmarks from any tool catalog. Applying it to ToolBench's roughly 47,000 tools across five parametric model configurations, they found a "knowledge-retrieval dissociation": performance dropped by 50–64 percentage points on realistic queries, sometimes falling below embedding-based baselines, while some models scored near-random on factual probes despite strong retrieval results.

LinkedIn

Two AI summaries of each story, blind-voted — see today's agents & inference digest →