The AI Wire

180 articles tagged "cv" — page 1 of 6

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning [TOP LAB](arxiv.org)

2026-02-27|paper|arXiv

The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data....

Partial recovery of meter-scale surface weather [TOP LAB](arxiv.org)

2026-02-27|paper|arXiv

Near-surface atmospheric conditions can differ sharply over tens to hundreds of meters due to land cover and topography, yet this variability is absent from current weather analyses and forecasts. It ...

cs-LG cs-CV physics-ao-ph

MediX-R1: Open Ended Medical Reinforcement Learning (arxiv.org)

2026-02-27|paper|arXiv

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choi...

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale (arxiv.org)

2026-02-27|paper|arXiv

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of ...

Mobile-Ready Automated Triage of Diabetic Retinopathy Using Digital Fundus Images [TOP LAB](arxiv.org)

2026-02-26|paper|arXiv

Diabetic Retinopathy (DR) is a major cause of vision impairment worldwide. However, manual diagnosis is often time-consuming and prone to errors, leading to delays in screening. This paper presents a ...

Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences (arxiv.org)

2026-02-26|paper|arXiv

Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformat...

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models [TOP LAB](arxiv.org)

2026-02-25|paper|arXiv

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tac...

Test-Time Training with KV Binding Is Secretly Linear Attention (arxiv.org)

2026-02-25|paper|arXiv

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis rev...

cs-LG cs-AI cs-CV

Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics (arxiv.org)

2026-02-25|paper|arXiv

Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown t...

cs-RO cs-CV cs-LG

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device (arxiv.org)

2026-02-24|paper|arXiv

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We pr...

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction (arxiv.org)

2026-02-24|paper|arXiv

We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, ...

A Very Big Video Reasoning Suite (arxiv.org)

2026-02-24|paper|arXiv

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual env...

cs-CV cs-AI cs-LG

Comparative Assessment of Multimodal Earth Observation Data for Soil Moisture Estimation [TOP LAB](arxiv.org)

2026-02-23|paper|arXiv

Accurate soil moisture (SM) estimation is critical for precision agriculture, water resources management and climate monitoring. Yet, existing satellite SM products are too coarse (>1km) for farm-leve...

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory (arxiv.org)

2026-02-23|paper|arXiv

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-t...

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery [TOP LAB](arxiv.org)

2026-02-20|paper|arXiv

In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unob...

cs-CV cs-AI cs-CY

OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents (arxiv.org)

2026-02-20|paper|arXiv

Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sens...

TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos [TOP LAB](arxiv.org)

2026-02-19|paper|arXiv

Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution...

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching [TOP LAB](arxiv.org)

2026-02-13|paper|arXiv

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, av...

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning [TOP LAB](arxiv.org)

2026-02-13|paper|arXiv

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipati...

SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos (arxiv.org)

2026-02-12|paper|arXiv

Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces...

Learning to Detect Baked Goods with Limited Supervision [TOP LAB](arxiv.org)

2026-02-11|paper|arXiv

Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short s...

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI (arxiv.org)

2026-02-11|paper|arXiv

Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on...

Quantum Multiple Rotation Averaging (arxiv.org)

2026-02-11|paper|arXiv

Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Establis...

Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals [TOP LAB](arxiv.org)

2026-02-10|paper|arXiv

Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate mult...

Autoregressive Image Generation with Masked Bit Modeling (arxiv.org)

2026-02-10|paper|arXiv

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that...

Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening [TOP LAB](arxiv.org)

2026-02-09|paper|arXiv

Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scal...

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images (arxiv.org)

2026-02-09|paper|arXiv

Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we ...

Pseudo-Invertible Neural Networks [TOP LAB](arxiv.org)

2026-02-06|paper|arXiv

The Moore-Penrose Pseudo-inverse (PInv) serves as the fundamental solution for linear systems. In this paper, we propose a natural generalization of PInv to the nonlinear regime in general and to neur...

Shared LoRA Subspaces for almost Strict Continual Learning (arxiv.org)

2026-02-06|paper|arXiv

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. W...

cs-LG cs-AI cs-CV

Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning (arxiv.org)

2026-02-06|paper|arXiv

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a...