The AI Wire

5101 articles — page 5 of 171

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs (huggingface.co)

2026-06-04|model|huggingface

OVO-S-Bench hierarchically benchmarks multimodal LLMs on streaming video spatial reasoning, testing capabilities like depth, layout, and object-relation understanding over temporal sequences.

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems (huggingface.co)

2026-06-04|model|huggingface

RAMP provides a runtime evaluation framework for assessing agentic AI models in live production environments, capturing failure modes invisible to static offline benchmarks.

Uber Caps Usage of AI Tools Like Claude Code to Manage Costs (simonwillison.net)

2026-06-04|news|blog/Simon Willison

Uber imposed usage caps on AI coding tools including Claude Code as a direct cost-control measure following higher-than-expected enterprise spending.

Adding MCP Tools to Reachy Mini (huggingface.co)

2026-06-04|news|blog/Hugging Face Blog

MCP tool integration was added to the Reachy Mini robot, enabling it to invoke external AI-powered tools via the Model Context Protocol.

Direct Preference Optimization Beyond Chatbots (huggingface.co)

2026-06-04|news|blog/Hugging Face Blog

Direct Preference Optimization techniques are applied outside conversational chatbot settings to align generative models in other domains such as code, images, or structured outputs.

OpenAI public policy agenda (openai.com)

2026-06-04|news|blog/OpenAI Blog

OpenAI published its formal public policy agenda outlining its positions on AI regulation, safety standards, and government engagement priorities.

A blueprint for democratic governance of frontier AI (openai.com)

2026-06-04|news|blog/OpenAI Blog

A governance blueprint proposes democratic oversight mechanisms—such as public participation and accountability structures—for decisions made about frontier AI development and deployment.

How Wasmer used Codex to build a Node.js runtime for the edge (openai.com)

2026-06-04|news|blog/OpenAI Blog

Wasmer engineers used OpenAI Codex to accelerate building a Node.js-compatible JavaScript runtime optimized for edge computing environments.

Introducing new capabilities to GPT-Rosalind (openai.com)

2026-06-04|news|blog/OpenAI Blog

GPT-Rosalind receives new capabilities, likely expanding its functionality for biology or genomics-related AI tasks.

Jun 3, 2026PolicyWhat we learned mapping a year’s worth of AI-enabled cyber threats (anthropic.com)

2026-06-04|news|blog/Anthropic News

A year-long empirical mapping of AI-enabled cyber threats yields policy-relevant findings about attack patterns and defensive implications.

Jun 3, 2026AnnouncementsIntroducing the Services Track and Partner Hub of the Claude Partner Network (anthropic.com)

2026-06-04|news|blog/Anthropic News

Anthropic launches a Services Track and Partner Hub to formalize and expand the Claude Partner Network ecosystem.

When does fragmentation occur in the CUDA caching allocator?(docs.pytorch.org)

2026-06-04|news|hackernews

An analysis identifies the conditions and memory allocation patterns that trigger fragmentation in CUDA's caching memory allocator.

Fast & Faithful Function Vectors (arxiv.org)

2026-06-04|paper|arxiv

A method computes function vectors—representations capturing model behavior for specific tasks—more quickly while preserving faithfulness to the original approach.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?(arxiv.org)

2026-06-04|paper|arxiv

AutoLab benchmarks frontier models on long-horizon automated research and engineering tasks, evaluating autonomous scientific problem-solving capability.

Automatic Generation of Titles for Research Papers Using Language Models (arxiv.org)

2026-06-04|paper|arxiv

Language models are applied to automatically generate concise, relevant titles for research papers given their content.

Light or Full Verb? A Minimal-Pair Dataset for Probing Phraseological Competence in Language Models (arxiv.org)

2026-06-04|paper|arxiv

A minimal-pair dataset probes whether language models can distinguish light verb constructions from full verb uses in matched phraseological contexts.

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors (arxiv.org)

2026-06-04|paper|arxiv

FoeGlass uses simple in-context learning to generate adversarial audio examples that fool deepfake detection systems during red teaming.

Identifying Gems from Roman RAPIDly (arxiv.org)

2026-06-04|paper|arxiv

A method rapidly identifies genuine Roman-era gems from the RAPID collection using automated recognition or classification techniques.

Knowledge Index of Noah's Ark (arxiv.org)

2026-06-04|paper|arxiv

A structured knowledge index is built for the Noah's Ark corpus, organizing and making its contents systematically queryable or retrievable.

Arithmetic Pedagogy for Language Models (arxiv.org)

2026-06-04|paper|arxiv

Techniques or curricula for teaching arithmetic reasoning are developed or analyzed to improve language models' mathematical computation accuracy.

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have (arxiv.org)

2026-06-04|paper|arxiv

Leverages existing metadata (e.g., geolocation, timestamps, tags) as supervision signals to fine-tune vision foundation models without requiring manual annotations.

RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities (arxiv.org)

2026-06-04|paper|arxiv

Extends disentangled representation learning frameworks to handle more than two modalities simultaneously, scaling the approach across richer multi-modal data combinations.

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases (arxiv.org)

2026-06-04|paper|arxiv

Assesses how accurately LLMs diagnose and make treatment decisions when presented with structured standardized patient case scenarios mimicking real clinical encounters.

Continual Visual and Verbal Learning Through a Child's Egocentric Input (arxiv.org)

2026-06-04|paper|arxiv

Trains a model on egocentric video and speech data recorded from a child's perspective to study continual multimodal learning mimicking child development.

Graph Set Transformer (arxiv.org)

2026-06-04|paper|arxiv

Applies set transformer attention mechanisms to graph-level tasks, enabling permutation-invariant aggregation over sets of graph elements for improved graph representation.

Audio Interaction Model (arxiv.org)

2026-06-04|paper|arxiv

Introduces a model for processing and generating audio through interactive turn-taking or contextual audio exchange, enabling conversational or responsive audio understanding.

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data (arxiv.org)

2026-06-04|paper|arxiv

Extracts latent self-evaluation capabilities already encoded in base LLMs using minimal labeled examples, enabling calibrated judgment without dedicated RLHF-style training.

Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting (arxiv.org)

2026-06-04|paper|arxiv

Separates appearance attributes from geometric structure within 3D Gaussian Splatting representations, allowing independent editing and optimization of geometry and visual appearance.

Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption (arxiv.org)

2026-06-04|paper|arxiv

Applies fully homomorphic encryption to causal structure learning algorithms, enabling discovery of causal graphs over sensitive data without ever decrypting individual records.

Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent (arxiv.org)

2026-06-04|paper|arxiv

Uses an LLM-driven agent to generate interpretable, evidence-grounded mobility predictions while reducing computational overhead compared to standard deep learning approaches.

← Prev5 / 171Next →