Which AI writes the better take? You decide — blind.

Two top models go head-to-head on today's AI news. Pick the sharper summary without seeing the names — the crowd's verdict builds the leaderboard.

Agents & InferenceSimon Willison

llm-anthropic 0.25.1

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Simon Willison has released version 0.25.1 of llm-anthropic, a plugin for working with Anthropic's language models. The release was used to generate content related to pelicans and corresponds with updates to Opus 4.8.

Summary B

A new release of the llm-anthropic plugin, version 0.25.1, is now available. The update was used to generate pelican test outputs alongside coverage of the Opus 4.8 model release.

0 picks

Permalink Embed Leaderboard →

Browse editions · 54 days

Newer Older

Latest 12 days

07-18 07-17 07-16 07-15 07-14 07-13 07-12 07-11 07-10 07-09 07-08 07-07

More stories

Agents & InferenceSimon Willison

datasette-agent 0.1a4

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Datasette-agent 0.1a4 now integrates with Datasette 1.0a30's new makeJumpSections() JavaScript hook to provide an agent chat interface accessible through the Jump to menu. Users can test the feature by signing into agent.datasette.io with their GitHub account.

Summary B

An early alpha release of datasette-agent, version 0.1a4, introduces a "Start a new agent chat" interface integrated into Datasette's Jump to menu, accessible by pressing the slash key. The feature relies on the new makeJumpSections() JavaScript plugin hook added in Datasette 1.0a30, and users can test it by signing into agent.datasette.io with a GitHub account.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceHugging Face

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

JetBrains has released Mellum2, a 12-billion parameter Mixture-of-Experts model designed for efficient text-and-code processing with more than 2x faster inference than similarly sized models. The open-source model is optimized for latency-sensitive tasks in production AI systems, including routing, RAG pipelines, and sub-agent operations, while maintaining competitive performance on code generation, reasoning, and math benchmarks. Mellum2 is intended as a specialized component for larger AI systems rather than a replacement for frontier models, enabling faster and more cost-effective deployment in software engineering applications.

Summary B

JetBrains has released Mellum2, an open 12-billion-parameter Mixture-of-Experts model optimized for low-latency text and code tasks. Building on the original Mellum code completion model, it activates only a subset of parameters per token to deliver more than twice the inference speed of similarly sized open models while remaining competitive on benchmarks. JetBrains positions Mellum2 as a "focal" model for high-frequency tasks within larger AI systems, including routing, RAG pipelines, sub-agent operations, and private self-hosted deployments.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceHugging Face

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

IBM Research argues that scalable enterprise AI adoption requires more than large language models, pointing to "agent logic"—software primitives such as knowledge graphs, algorithms, and program analysis libraries that operate within an agent harness to steer LLMs toward enterprise workflows. The approach aims to reduce context space, lower token costs, and curb hallucinations while improving agent quality and end-user trust. IBM tested the concept by building agents for offerings including watsonx Code Assistant for Z, which accelerates mainframe application development and modernization.

Summary B

Enterprise AI adoption at scale requires more than large language models alone—it demands "agent logic," specialized software components that guide AI agents through complex, dynamic enterprise workflows while reducing costs and improving reliability. The article examines how agent logic, including knowledge graphs and program analysis tools, can steer AI models away from hallucinations and inefficiencies by constraining context to what's relevant for specific enterprise tasks. IBM's research demonstrates this approach across multiple domains, including mainframe application development, showing that intelligent guidance systems are critical for moving AI from failed pilots into core business operations.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceHugging Face

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

NVIDIA has released Cosmos 3, described as the first open omni-model for physical AI, now available on Hugging Face. Built on a Mixture-of-Transformers architecture, it unifies world generation, physical reasoning, and action generation into a single model, processing text, image, video, audio, and action modalities in one forward pass. The model is aimed at applications such as robotics, autonomous vehicles, and smart spaces, enabling simulation and understanding of motion, causality, and physics.

Summary B

NVIDIA has released Cosmos 3, an open-source unified artificial intelligence model designed for physical AI applications including robotics, autonomous vehicles, and smart spaces. Built on a Mixture-of-Transformers architecture, Cosmos 3 combines world generation, physical reasoning, and action generation into a single omni-model, eliminating the need to juggle separate models for different tasks. The model is now available on Hugging Face and can process multiple modalities—text, image, video, audio, and action—to simulate and understand the physical world.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceHugging Face

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

A new glossary from Hugging Face aims to clarify the increasingly muddled vocabulary surrounding AI agents, where terms like "harness" and "scaffold" are often used inconsistently across different frameworks and contexts. Authored by Sergio Paniego and Aritra Roy Gosthipaty, it defines key concepts such as model, scaffolding, context engineering, tool use, and training-related terms to provide a practical mental model for newcomers and practitioners alike. Rather than enforcing a single correct definition, the piece seeks to make technical discussions about building, deploying, and training agents easier to follow.

Summary B

The article provides a glossary of key terminology in the rapidly evolving AI agents field, clarifying commonly confused terms like "harness" and "scaffold." According to the authors, the model (LLM) is the core text-processing engine, scaffolding defines the behavioral layer around it through prompts and tool descriptions, and a harness executes tools and manages the agent's loop. The piece aims to establish practical definitions for these terms to facilitate clearer communication among practitioners building, deploying, or using AI agents, while acknowledging that universal definitions don't yet exist across different frameworks.

0 picks

Permalink Embed Leaderboard →

See who's winning the model face-off

Tomorrow's blind matchup and the running leaderboard — one email a day.