Which AI writes the better take? You decide — blind.

Two top models go head-to-head on today's AI news. Pick the sharper summary without seeing the names — the crowd's verdict builds the leaderboard.

Agents & InferenceHugging Face

Adding MCP Tools to Reachy Mini

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Reachy Mini's conversation app now supports tools hosted on Hugging Face Spaces via MCP, allowing users to add new capabilities like weather checks without modifying the app. These tools run remotely in the Space rather than locally, and users can also publish their own tools for others to use. The system works by enabling tools through profile settings, expanding functionality beyond the robot's built-in physical capabilities.

Summary B

Hugging Face's Reachy Mini conversation app now supports external tools hosted in public Hugging Face Spaces and called via MCP, allowing users to add new abilities like checking the weather or searching the web without editing the app's code. Because the tools run within the Spaces themselves, no code is downloaded locally, and users can also publish their own tools for others to use. Tools are managed through profiles, where a tools.txt file determines which capabilities the robot can access.

0 picks

Permalink Embed Leaderboard →

Browse editions · 54 days

Newer Older

Latest 12 days

07-18 07-17 07-16 07-15 07-14 07-13 07-12 07-11 07-10 07-09 07-08 07-07

More stories

Agents & InferenceSimon Willison

datasette-agent-micropython 0.1a0

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

A new alpha release, datasette-agent-micropython 0.1a0, aims to let Datasette Agent safely generate and execute Python code within a sandboxed environment. Early testing has been promising, with GPT-5.5 so far unable to break out of the sandbox.

Summary B

Simon Willison has released the alpha version 0.1a0 of Datasette Agent Micropython, aiming to safely generate and execute Python code. Initial tests show promise, with GPT-5.5 unable to break out of the sandbox.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceTechCrunch

Meta’s AI agent for WhatsApp Business is now available globally

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Meta has launched its AI customer support agent for WhatsApp Business globally, enabling businesses to handle queries, recommend products, and manage appointments. The AI tool, tested in markets like India and Mexico, will be part of WhatsApp Business Premium subscriptions, with pricing based on usage for larger enterprises. Meta is also expanding the agent's capabilities to include market research, calendar management, and integration with platforms like Shopify and Zendesk.

Summary B

Meta has launched its customer support AI tool, now called Meta Business Agent, globally within WhatsApp and Instagram DMs after nearly two years of testing in countries like India and Mexico. The agent can answer customer questions, recommend products, book appointments, qualify sales leads, and route queries to human staff, with Meta planning further capabilities such as market research and calendar management. Businesses will pay for the service through WhatsApp Business Premium subscription tiers, while larger enterprises will be charged based on token usage and gain access to custom agents that connect to systems like Shopify and Zendesk.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceHugging Face

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

JetBrains has released Mellum2, an open 12B-parameter Mixture-of-Experts model optimized for low-latency text and code tasks. The model activates only a subset of its parameters per token, delivering more than twice the inference speed of similarly sized open models while remaining competitive on code, reasoning, science, and math benchmarks. JetBrains positions Mellum2 as a "focal" model for high-frequency operations such as routing, RAG pipelines, sub-agent tasks, and private self-hosted deployments.

Summary B

JetBrains has released Mellum2, a 12-billion-parameter Mixture-of-Experts model optimized for efficient text-and-code tasks like routing, retrieval, and agent workflows. The open model delivers faster inference than similarly sized alternatives while specializing in software engineering applications. Mellum2 is designed as a focused component for production AI systems, prioritizing speed and deployability in latency-sensitive scenarios.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceSimon Willison

Uber Caps Usage of AI Tools Like Claude Code to Manage Costs

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Uber has imposed a $1,500 monthly spending cap per employee on AI coding tools like Claude Code to control costs after exceeding its AI budget. The limit applies separately to each tool, preventing overspending while still allowing engineers flexibility. This policy reflects Uber's effort to balance AI productivity gains with financial sustainability.

Summary B

Uber has imposed a $1,500 monthly cap on per-tool spending for agentic AI coding software like Cursor and Anthropic's Claude Code, after exhausting its 2026 AI budget within four months. The limit applies separately to each tool, so usage of one doesn't count against another, and represents a roughly 11% slice of the median Uber engineer's annual compensation if applied to two tools.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceTechCrunch

New Microsoft tool lets devs spin up AI behavior tests using text descriptions

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Microsoft has released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open source framework that lets developers create AI behavior tests from plain-language descriptions of their system's intended goals, policies, and constraints. The tool converts those descriptions into structured tests, generates and runs problem scenarios against the target system, scores the results, and records the AI's decision paths so developers can pinpoint failures. Microsoft says the framework can be used during development, after deployment, and for continuous monitoring, addressing the need for application-specific evaluations that broader benchmarks cannot cover.

Summary B

Microsoft has launched ASSERT, an open-source framework that lets developers test AI behavior by converting natural-language descriptions into scored evaluations. The tool generates test cases based on specified rules and checks if AI systems comply with application-specific policies. It aims to address gaps in broader AI evaluations by focusing on tailored, continuous monitoring for deployed models.

0 picks

Permalink Embed Leaderboard →

See who's winning the model face-off

Tomorrow's blind matchup and the running leaderboard — one email a day.