The AI Wire

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only r...

cs-AI cs-CL cs-CR

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models [TOP LAB](arxiv.org)

2026-02-25|paper|arXiv

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tac...

cs-CV cs-AI

Airavat: An Agentic Framework for Internet Measurement [TOP LAB](arxiv.org)

2026-02-25|paper|arXiv

Internet measurement faces twin challenges: complex analyses require expert-level orchestration of tools, yet even syntactically correct implementations can have methodological flaws and can be diffic...

cs-NI cs-AI cs-SE

Test-Time Training with KV Binding Is Secretly Linear Attention (arxiv.org)

2026-02-25|paper|arXiv

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis rev...

cs-LG cs-AI cs-CV

A Very Big Video Reasoning Suite (arxiv.org)

2026-02-24|paper|arXiv

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual env...

cs-CV cs-AI cs-LG

Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System [TOP LAB](arxiv.org)

2026-02-23|paper|arXiv

In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog c...

cs-CL cs-AI

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery [TOP LAB](arxiv.org)

2026-02-20|paper|arXiv

In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unob...

cs-CV cs-AI cs-CY

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability [TOP LAB](arxiv.org)

2026-02-20|paper|arXiv

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly f...

cs-AI cs-CL cs-IR

What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data [TOP LAB](arxiv.org)

2026-02-20|paper|arXiv

Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions. Prior work shows that PD can resurface, yet u...

cs-HC cs-AI cs-CL

Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes [TOP LAB](arxiv.org)

2026-02-19|paper|arXiv

The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a ...

cs-LG cs-AI

DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows [TOP LAB](arxiv.org)

2026-02-19|paper|arXiv

Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across ...

cs-DB cs-AI

Interpretability-by-Design with Accurate Locally Additive Models and Conditional Feature Effects [TOP LAB](arxiv.org)

2026-02-19|paper|arXiv

Generalized additive models (GAMs) offer interpretability through independent univariate feature effects but underfit when interactions are present in data. GA$^2$Ms add selected pairwise interactions...

cs-LG cs-AI

Lifelong Scalable Multi-Agent Realistic Testbed and A Comprehensive Study on Design Choices in Lifelong AGV Fleet Management Systems [TOP LAB](arxiv.org)

2026-02-18|paper|arXiv

We present Lifelong Scalable Multi-Agent Realistic Testbed (LSMART), an open-source simulator to evaluate any Multi-Agent Path Finding (MAPF) algorithm in a Fleet Management System (FMS) with Automate...

cs-RO cs-AI

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment (arxiv.org)

2026-02-13|paper|arXiv

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress t...

cs-RO cs-AI eess-SY

GameDevBench: Evaluating Agentic Capabilities Through Game Development [TOP LAB](arxiv.org)

2026-02-12|paper|arXiv

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software dev...

cs-AI cs-CL cs-SE

ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression [TOP LAB](arxiv.org)

2026-02-12|paper|arXiv

We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Op...

cs-LG cs-AI cs-CL

A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models [TOP LAB](arxiv.org)

2026-02-11|paper|arXiv

How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain ge...

cs-CL cs-AI

Biases in the Blind Spot: Detecting What LLMs Fail to Mention (arxiv.org)

2026-02-11|paper|arXiv

Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their...

cs-LG cs-AI

CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute [TOP LAB](arxiv.org)

2026-02-10|paper|arXiv

Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a...

cs-AI cs-CL

GPT-oss-120B / GPT-oss-20B (OpenAI)(github.com)

2026-02-06|model|GitHub / HuggingFace

OpenAI's first open-weight LLMs since GPT-2 (2019). Apache 2.0 license. Trained with RL and distillation from o3 and frontier internal models. GPT-oss-120B runs on single 80GB GPU; 20B runs on 16GB edge devices.

open-source apache-2 reasoning edge

NVIDIA Cosmos Reason 2 / Isaac GR00T N1.6 (nvidianews.nvidia.com)

2026-02-06|model|NVIDIA (HuggingFace)

Cosmos Reason 2 is an open reasoning VLM enabling machines to see, understand, and act in the physical world. GR00T N1.6 is a vision-language-action (VLA) model for humanoid robots integrating egocentric camera streams, robot states, and language instructions into a unified policy.

robotics physical-ai nvidia open-source

Qwen3-Max-Thinking (Alibaba)(qwen.ai)

2026-02-06|model|Alibaba Cloud / Qwen

Flagship reasoning model with adaptive tool-use -- intelligently invokes retrieval and code interpreter on demand during inference. Advanced test-time scaling via RL.

llm reasoning trillion-param tool-use

SkyReels V3 (Skywork AI)(github.com)

2026-02-06|model|GitHub / HuggingFace

First open-source model supporting three video generation modes in one architecture: multi-subject reference image-to-video, audio-driven avatar generation, and video-to-video editing. Intelligent shot-switching for minute-level durations.

open-source video-gen multimodal audio-driven

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps [TOP LAB](arxiv.org)

2026-02-06|paper|arXiv

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We ...

cs-LG cs-AI

Shared LoRA Subspaces for almost Strict Continual Learning (arxiv.org)

2026-02-06|paper|arXiv

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. W...

cs-LG cs-AI cs-CV

1 / 7Next →