Which AI writes the better take? You decide — blind.

Two top models go head-to-head on today's AI news. Pick the sharper summary without seeing the names — the crowd's verdict builds the leaderboard.

Agents & InferenceHugging Face

olmo-eval: An evaluation workbench for the model development loop

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Ai2 has released olmo-eval, an open evaluation workbench designed to support the iterative work of LLM development, including adding and configuring benchmarks, running them across model checkpoints, and analyzing results prompt by prompt. Building on the OLMES standard introduced in 2024, the tool supports agentic and multi-turn evaluation as first-class use cases and offers flexibility in how each benchmark runs to save time and resources. It also provides stronger analysis tools to help developers determine whether a performance change reflects a real improvement or statistical noise.

Summary B

that tests agentic behavior can run in a containerized sandbox for safety and reproducibility. olmo-eval streamlines the evaluation loop by integrating flexible benchmarking, checkpoint tracking, and granular analysis tools to help developers iterate efficiently during model training. It builds on the OLMES standard while adapting to the dynamic needs of ongoing model development.

0 picks

Permalink Embed Leaderboard →

What you'll learn · Jun 13, 2026 · 6 stories

Browse editions · 65 days

Newer Older

Latest 12 days

07-29 07-28 07-27 07-26 07-25 07-24 07-23 07-22 07-21 07-20 07-19 07-18

More stories

Agents & InferenceGoogle DeepMind

Investing in multi-agent AI safety research

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Google DeepMind is committing resources to research focused on the safety of multi-agent AI systems, where multiple AI agents interact with one another. The initiative reflects the company's broader emphasis on developing AI responsibly and addressing emerging risks as agentic systems become more capable.

Summary B

Google DeepMind is advancing multi-agent AI safety research to ensure responsible development of intelligent systems. The company focuses on breakthroughs like Gemini Robotics and AlphaFold while emphasizing proactive security measures. Their mission is to create AI that benefits humanity through responsible innovation and real-world applications.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceMistral

Vibe gets to work.

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Mistral has rebranded its Le Chat assistant as Vibe, a unified AI agent designed to handle both long-running, multi-step work tasks and coding projects under a single license. In Work Mode, Vibe maps out plans, pulls from connected apps like Google Workspace and Slack, conducts research, and drafts deliverables, while Code Mode launches remote coding agents that build features, fix bugs, and ship reviewable pull requests across web, IDE, and terminal. The agent runs on Mistral's flagship models, with existing user conversations, settings, and plans carried over from Le Chat.

Summary B

Vibe is an AI agent designed for long-running, multi-step tasks, integrating with workflows like email, calendars, and coding projects. It leverages Mistral models for reasoning and coding, offering features like document synthesis, data analysis, and automated task scheduling. The platform includes Work Mode for general tasks and Code Mode for coding projects, with support for GitHub, IDE extensions, and CLI tools.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceOllama

Improved performance and model support with GGUF

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Ollama 0.30 introduces improved performance and broader model support through GGUF compatibility, offering up to 20% faster speeds on NVIDIA hardware and expanded GPU acceleration for AMD and Intel devices. The update enables more models to run out of the box, including LFM, Prism, and fine-tuned models from Unsloth. Users can now easily integrate GGUF files and leverage tool-calling capabilities for coding agents and assistants.

Summary B

Ollama 0.30 has been released with improved performance and broader GGUF model compatibility through llama.cpp, complementing its existing MLX engine on Apple silicon. The update delivers up to 20% faster performance on NVIDIA hardware, enables Vulkan by default to extend GPU acceleration to AMD and Intel devices, and expands support for more model families including LFM, Prism, and Unsloth fine-tunes. Models with tool-calling capabilities can also be used directly with coding agents and assistants through a single launch command.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceTechCrunch

Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

The U.S. government ordered Anthropic on Friday to immediately disable global access to its two most powerful AI models, Claude Fable 5 and Claude Mythos 5, citing national security concerns framed as an export control action. Anthropic complied but publicly disputed the move, arguing the underlying issue—a claimed narrow jailbreak of Fable 5 that lets the model identify software flaws—reflects a capability already available in other public models and used routinely for defensive cybersecurity. The company warned that applying such a standard across the industry would effectively halt all new frontier model deployments, a notable clash for a firm that has built its identity around being the safety-focused AI developer ahead of an expected IPO.

Summary B

safety warnings Anthropic issued about its powerful AI models may have led to their shutdown by the U.S. government over national security concerns. The company was ordered to disable Claude Fable 5 and Claude Mythos 5 globally, despite arguing the alleged vulnerabilities are already present in other publicly available AI systems. Anthropic maintains its safeguards are robust and criticizes the decision, warning it could stifle innovation across the AI industry.

0 picks

Permalink Embed Leaderboard →

Agents & InferenceSimon Willison

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Match the models (Optional)

Which model wrote which summary? Select a matchup mapping below before voting.

Summary A

Anthropic says the US government issued an export control directive citing national security, ordering the company to suspend all access to its Fable 5 and Mythos 5 models for any foreign national, forcing it to disable the models for all customers. The directive reportedly stems from concerns about a "jailbreak" of Fable 5, though Anthropic maintains the technique only surfaced minor, previously known vulnerabilities that other publicly available models can also find. Access to the models was cut off the evening of June 12, 2026, with users redirected to use Opus 4.8 instead.

Summary B

The US government has ordered Anthropic to suspend access to its Fable 5 and Mythos 5 AI models for all foreign nationals, citing unspecified national security concerns. The directive, received on June 12, 2026, requires immediate compliance, though the company disputes the uniqueness of the alleged vulnerabilities. Access to other Anthropic models remains unaffected.

0 picks

Permalink Embed Leaderboard →

See who's winning the model face-off

Tomorrow's blind matchup and the running leaderboard — one email a day.

Takeaways written by GPT-5.5 — not one of this week's two contestants.