Which AI writes the better take? You decide — blind.

Two top models go head-to-head on today's AI news. Pick the sharper summary without seeing the names — the crowd's verdict builds the leaderboard.

Agents & InferencearXiv

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Researchers have introduced ToolSense, an open-source diagnostic framework that audits how well large language models truly understand the tools they are trained to retrieve, automatically generating realistic retrieval, multiple-choice, and QA benchmarks from any tool catalog. Applying it to ToolBench's roughly 47,000 tools across five parametric model configurations, they found a "knowledge-retrieval dissociation": performance dropped by 50–64 percentage points on realistic queries, sometimes falling below embedding-based baselines, while some models scored near-random on factual probes despite strong retrieval results.

Summary B

"ToolSense" is a new diagnostic framework designed to evaluate how well large language models (LLMs) understand and retrieve tools from large catalogs. It introduces benchmarks to test retrieval accuracy under realistic query ambiguity and probes models' factual knowledge about tools, revealing gaps in performance compared to traditional benchmarks. The framework highlights a dissociation between retrieval success and actual tool knowledge in some models.

0 picks

Permalink Embed Leaderboard →

Browse editions · 63 days

Newer Older

Latest 12 days

07-27 07-26 07-25 07-24 07-23 07-22 07-21 07-20 07-19 07-18 07-17 07-16

More stories

Agents & InferenceTechCrunch

Cheaper, faster, and culturally aware, Avataar’s video AI is built for India’s scale

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Avataar AI, one of 12 startups selected for India's $1.2 billion AI Mission, has launched Varya, a video generation model trained to recognize local cultural elements like festivals, food, and clothing. Built by distilling Alibaba's Wan 2.2 model, Varya runs roughly 10 times faster and at about a 20x lower price than rivals like Veo, Kling, and Runway, charging around $0.005 per second of video. The model will be released as open-weight on India's AI Kosh portal, allowing developers to self-host or modify it.

Summary B

Avataar AI has launched Varya, a faster, cheaper video generation model tailored for India's cultural context, as part of the government's India AI Mission. The model, built by refining Alibaba's Wan 2.2, produces videos 10 times faster at a fraction of the cost, addressing affordability and cultural relevance for India's video-first market. Varya will be available as an open-weight model on India's AI Kosh portal, enabling broader access and customization.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceHugging Face

Introducing North Mini Code: Cohere’s First Model For Developers

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Cohere has released North Mini Code, its first model aimed at developers and the first entry in a new model family designed for agentic software engineering tasks. The 30B-parameter Mixture-of-Experts model with 3B active parameters is available on Hugging Face under the Apache 2.0 license and is optimized for complex coding workflows, terminal-based agentic tasks, and code generation. Cohere reports a score of 33.4 on Artificial Analysis' Coding Index, claiming it outperforms several comparably sized and substantially larger open-source models.

Summary B

Cohere has released North Mini Code, a 30B-parameter Mixture-of-Experts model designed for agentic coding tasks, now available on Hugging Face under the Apache 2.0 license. It outperforms similar-sized models in coding benchmarks and is optimized for complex software engineering workflows. The model was trained using a combination of supervised fine-tuning and reinforcement learning to enhance its coding and agentic capabilities.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceGoogle DeepMind

Investing in multi-agent AI safety research

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Google DeepMind is advancing multi-agent AI safety research to proactively address security risks and ensure responsible development. The initiative focuses on developing safeguards against evolving threats in complex AI systems. This work aligns with the organization's mission to create beneficial AI while mitigating potential harms.

Summary B

Google DeepMind is investing in research focused on the safety of multi-agent AI systems, where multiple AI agents interact with one another. The initiative aims to address emerging risks and ensure AI is developed responsibly as agents become more capable of reasoning, learning, and operating together.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceMistral

Vibe gets to work.

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Mistral has rebranded its Le Chat assistant as Vibe, a unified AI agent that handles both long-running work tasks and coding across web, IDE, and terminal environments. In Work Mode, Vibe manages multi-step tasks like inbox catch-up, research, data analysis, and document drafting by connecting to enterprise tools such as Google Workspace, Slack, and SharePoint, while Code Mode lets remote agents build features, fix bugs, and ship pull requests. The product runs on Mistral's flagship models, carries over existing user data and licenses, and includes a new VS Code extension along with planned Slack integration.

Summary B

Vibe is an AI agent designed to handle complex, multi-stage tasks across work and coding, integrating with tools like Google Workspace, Outlook, and GitHub. It automates processes, drafts documents, analyzes data, and manages coding tasks from request to merged changes. The platform runs on Mistral models optimized for reasoning, agentic tasks, and coding, with features like enterprise knowledge search and structured data analysis.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceOllama

Improved performance and model support with GGUF

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Ollama 0.30 introduces improved performance and expanded GGUF model compatibility, offering up to 20% faster performance on NVIDIA hardware and broader GPU acceleration support. The update enables seamless integration with GGUF files and enhances tool-calling capabilities for coding agents and assistants. Vulkan is now enabled by default, extending GPU acceleration to AMD and Intel devices without requiring vendor-specific libraries.

Summary B

Ollama 0.30 has been released with improved performance and broader GGUF model compatibility through llama.cpp, complementing its MLX engine on Apple silicon. The update delivers up to 20% faster performance on NVIDIA hardware, enables Vulkan by default to extend GPU acceleration to AMD and Intel devices, and expands support for more model families and fine-tuned models. Models with tool-calling capabilities can also be used directly with coding agents and assistants via a single command.

0 picks

Permalink Embed Leaderboard →

See who's winning the model face-off

Tomorrow's blind matchup and the running leaderboard — one email a day.