Benchmarks frontier language models on long-horizon automated research and engineering workflows to assess end-to-end autonomous problem-solving capability.
Compresses lengthy chain-of-thought reasoning traces into more compact representations through introspective preference learning over model-generated rationales.
Introduces a budget-aware model merging method that selectively limits which expert weight subsets each model can read, improving scalability.
Attributes training data influence by applying sparse recovery techniques to model output changes induced by systematic perturbations of training subsets.
WebRISE evaluates multimodal LLM-generated web artifacts by checking whether outputs satisfy explicit functional and structural requirements, not just visual similarity.
OpenSTBench introduces evaluation metrics for speech translation that go beyond semantic accuracy, capturing structural, prosodic, or pragmatic translation quality.
Scaling modifications to Gated Delta Networks enable effective feature learning, addressing a previously identified limitation of this recurrent architecture at larger scales.
A two-stage on-policy distillation method first filters low-quality training samples, then applies per-sample reweighting to improve fine-grained optimization of student models.
OVO-S-Bench hierarchically benchmarks multimodal LLMs on streaming video spatial reasoning, testing capabilities like depth, layout, and object-relation understanding over temporal sequences.
RAMP provides a runtime evaluation framework for assessing agentic AI models in live production environments, capturing failure modes invisible to static offline benchmarks.
Uber imposed usage caps on AI coding tools including Claude Code as a direct cost-control measure following higher-than-expected enterprise spending.
MCP tool integration was added to the Reachy Mini robot, enabling it to invoke external AI-powered tools via the Model Context Protocol.
Direct Preference Optimization techniques are applied outside conversational chatbot settings to align generative models in other domains such as code, images, or structured outputs.
OpenAI published its formal public policy agenda outlining its positions on AI regulation, safety standards, and government engagement priorities.
A governance blueprint proposes democratic oversight mechanisms—such as public participation and accountability structures—for decisions made about frontier AI development and deployment.
Wasmer engineers used OpenAI Codex to accelerate building a Node.js-compatible JavaScript runtime optimized for edge computing environments.
GPT-Rosalind receives new capabilities, likely expanding its functionality for biology or genomics-related AI tasks.
A year-long empirical mapping of AI-enabled cyber threats yields policy-relevant findings about attack patterns and defensive implications.
Anthropic launches a Services Track and Partner Hub to formalize and expand the Claude Partner Network ecosystem.
An analysis identifies the conditions and memory allocation patterns that trigger fragmentation in CUDA's caching memory allocator.
A user explains switching away from Gmail due to frustration with AI-driven smart features that oversimplify or patronize user interactions.
Microsoft releases MAI-Code-1-Flash, a fast, efficient AI model optimized for code generation tasks.
Trump signs a reduced-scope executive order on AI policy after earlier, broader versions were revised multiple times.
A Stanford Law study finds AI systems outperform law professors on legal reasoning or analysis tasks tested in the study.
Open Repair Alliance publishes a standardized open data schema for logging and sharing consumer product repair records across organizations.
An engineering team describes their pipeline for embedding and indexing images to enable retrieval-augmented generation over visual content.
LangChain provides a framework and toolset for building, orchestrating, and deploying AI agents and multi-step LLM pipelines.
Open WebUI delivers a self-hostable browser interface for interacting with local and remote LLMs via Ollama and OpenAI-compatible APIs.
Dify offers a production-grade platform for visually designing, deploying, and managing agentic LLM workflows and applications.
Hugging Face Transformers provides a unified Python library for defining, loading, fine-tuning, and running state-of-the-art pretrained models.