Two AIs summarize each story. You pick the better one.

A daily AI agents & inference digest — every story summarized by two models, blind. Vote the better summary, then see the running leaderboard of which model practitioners actually prefer. One short email a day, free.

Editions

(20 days)

Agents & Inference, UTC dates, up to 6 stories/day.

What is an edition?

Each edition is one calendar day (UTC). Stories come from the allowlisted RSS feed. Use the menu to jump to any day; empty days show "empty" and may need a rebuild.

This day

6 stories

NewerOlder

What you'll learn today · 6 stories

  1. 1.2.4pp performance changes can be tested against noise with olmo-eval, helping teams rerun reproducible benchmarks across changing checkpoints and agentic multi-turn workflows.
  2. 2.Investing in multi-agent AI safety research
  3. 3.One agent and one licence now span work and code, so teams can centralize multi-step workflows while governing access through admin-level permissions.
  4. 4.Up to 20% faster NVIDIA inference and default Vulkan GPU acceleration make Ollama 0.30 more useful for running GGUF models across NVIDIA, AMD, and Intel hardware.
  5. 5.Two Anthropic models were cut off worldwide after reported jailbreak concerns, so production teams should plan failover for sudden policy-driven model removals.
  6. 6.28 PyPI packages now publish pyemscripten_202*_wasm32 wheels, letting Pyodide apps install WASM-backed Python packages at runtime instead of relying on maintainer-hosted builds.

Agents & Inference

Agents & InferenceHugging Face

olmo-eval: An evaluation workbench for the model development loop

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Summary A

Ai2 has released olmo-eval, an open evaluation workbench designed to support the iterative loop of building large language models rather than just scoring finished ones. Building on the company's earlier OLMES standard, the tool streamlines adding and configuring benchmarks, running them across model checkpoints, and analyzing results prompt by prompt, with first-class support for agentic and multi-turn evaluation. It also offers flexible execution options—such as running lighter benchmarks directly rather than in resource-heavy containers—and stronger analysis tools to determine whether a change genuinely improves performance or is just noise.

Summary B

OLMo-eval is a new evaluation workbench designed to streamline the iterative process of testing language models during development, offering more flexibility than traditional benchmarking tools. It builds on the OLMES standard by simplifying evaluation implementation, supporting agentic and multi-turn testing, and providing stronger analysis tools. Unlike frameworks focused solely on final benchmarks, olmo-eval is tailored for continuous model adjustments, allowing developers to run and analyze tests efficiently across different model checkpoints.

Agents & InferenceGoogle DeepMind

Investing in multi-agent AI safety research

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Summary A

Google DeepMind is investing in multi-agent AI safety research to ensure responsible development of advanced AI systems. The initiative focuses on proactive security measures to address evolving threats while advancing breakthroughs in AI capabilities. This aligns with their mission to build AI that benefits humanity responsibly.

Summary B

Google DeepMind is committing resources to research focused on the safety of multi-agent AI systems, where multiple AI agents interact with one another. The effort reflects the company's broader push to develop AI responsibly and address emerging risks as agentic systems become more capable.

Agents & InferenceMistral

Vibe gets to work.

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Summary A

Mistral has rebranded its Le Chat assistant as Vibe, a unified AI agent for both office work and coding that carries over users' existing conversations, settings, and plans. The platform features a Work Mode for long-running, multi-step tasks like research, data analysis, and report drafting across connected apps like Google Workspace, Slack, and SharePoint, plus a Code Mode that handles coding from request to merged pull request via the web, a VS Code extension, and a CLI. Vibe runs on Mistral's flagship models optimized for reasoning, agentic tasks, and coding, with admin-level permissions and visible step-by-step progress.

Summary B

Vibe is an AI agent designed for complex, multi-step tasks, integrating with work and coding environments to automate processes, draft deliverables, and manage projects. It operates across platforms like Google Workspace, Outlook, and GitHub, offering features like enterprise knowledge search, structured data analysis, and reusable skills. The tool also supports coding tasks, running sessions in isolated sandboxes and integrating with IDEs and the Vibe CLI.

Agents & InferenceOllama

Improved performance and model support with GGUF

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Summary A

Ollama 0.30 has been released, delivering up to 20% faster performance on NVIDIA hardware and expanded GGUF model compatibility through llama.cpp, building on its existing MLX engine for Apple silicon. The update enables Vulkan by default, broadening GPU acceleration to AMD and Intel devices without vendor-specific libraries, and adds support for more model families including LFM, Prism, and Unsloth fine-tuned models. Models with tool-calling capability can now be used directly with coding agents and assistants through a single command.

Summary B

Ollama 0.30 introduces improved performance and expanded GGUF model compatibility, offering up to 20% faster speeds on NVIDIA hardware and broader GPU acceleration support. The update enables seamless use of GGUF files and supports additional model families like LFM and Prism, along with fine-tuned models from Unsloth. Vulkan is now enabled by default, extending GPU acceleration to AMD and Intel devices without requiring vendor-specific libraries.

Agents & InferenceTechCrunch

Amazon CEO reportedly raised Anthropic model concerns before government crackdown

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Summary A

Amazon CEO Andy Jassy reportedly raised security concerns about Anthropic's AI models, leading to a government export ban on two of them. Amazon researchers allegedly found vulnerabilities in Anthropic's Claude Fable 5 model that could be exploited for cyberattacks. Anthropic disputed the concerns, stating similar capabilities exist in other publicly available models.

Summary B

Amazon CEO Andy Jassy reportedly told Treasury Secretary Scott Bessent and other officials that Amazon researchers used Anthropic's Claude Fable 5 model to obtain information that could aid cyberattacks, prompting the government to impose export controls on the Fable 5 and Mythos 5 models. Anthropic subsequently cut off worldwide access to the two models, though it maintains the capabilities at issue are already available in other public models. Amazon, a major Anthropic investor, declined to detail its discussions with the government.

Agents & InferenceSimon Willison

Publishing WASM wheels to PyPI for use with Pyodide

Which summary reads better? Pick one — models revealed after.Both summaries are AI-generated.

Summary A

Pyodide's 314.0 release now allows developers to publish Python packages built for the WebAssembly-based platform directly to PyPI, eliminating the previous bottleneck in which maintainers had to build and host over 300 packages themselves. To demonstrate the new capability, Simon Willison packaged and published luau-wasm, a PyPI package wrapping the Roblox-developed Luau language compiled to WebAssembly. As of now, roughly 28 PyPI packages are publishing wheels using the new platform tags.

Summary B

Pyodide now allows Python packages built for WebAssembly (WASM) to be published directly to PyPI, eliminating the need for manual hosting by maintainers. Developers can now distribute WASM-compatible wheels like native platform wheels, streamlining package availability for Pyodide. This change reduces maintenance burdens and expands community contributions to Pyodide's ecosystem.

Takeaways written by GPT-5.5 — not one of this week's two contestants.