The AI Wire

180 articles tagged "cv" — page 2 of 6

Reinforced Attention Learning (arxiv.org)

2026-02-05|paper|arXiv

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) t...

cs-CL cs-CV cs-LG

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery [TOP LAB](arxiv.org)

2026-02-03|paper|arXiv

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This...

cs-AI cs-CV cs-LG

Personalized Image Generation via Human-in-the-loop Bayesian Optimization [TOP LAB](arxiv.org)

2026-02-03|paper|arXiv

Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multipl...

cs-CV cs-LG

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation (arxiv.org)

2026-02-02|paper|arXiv

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drif...

cs-CV cs-AI cs-LG

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers [TOP LAB](arxiv.org)

2026-02-01|paper|arXiv

Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken sc...

cs-CV cs-GR cs-LG

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers [TOP LAB](arxiv.org)

2026-01-30|paper|arXiv

cs-CV cs-GR cs-LG

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding (arxiv.org)

2026-01-29|paper|arXiv

Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. Although multimodal models have advanced across languages, th...

cs-CV

Lunar-G2R: Geometry-to-Reflectance Learning for High-Fidelity Lunar BRDF Estimation [TOP LAB](arxiv.org)

2026-01-16|paper|arXiv

We address the problem of estimating realistic, spatially varying reflectance for complex planetary surfaces such as the lunar regolith, which is critical for high-fidelity rendering and vision-based ...

cs-CV

WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments (arxiv.org)

2026-01-16|paper|arXiv

We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that...

cs-CV

STEP3-VL-10B Technical Report [TOP LAB](arxiv.org)

2026-01-15|paper|arXiv

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

cs-CV

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning (arxiv.org)

2026-01-15|paper|arXiv

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-...

cs-CV cs-AI cs-LG

Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation [TOP LAB](arxiv.org)

2026-01-14|paper|arXiv

Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare ...

cs-CV

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis (arxiv.org)

2026-01-14|paper|arXiv

Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerabilit...

cs-CV

3AM: Segment Anything with Geometric Consistency in Videos (arxiv.org)

2026-01-14|paper|arXiv

Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional...

cs-CV

More Images, More Problems? A Controlled Analysis of VLM Failure Modes [TOP LAB](arxiv.org)

2026-01-13|paper|arXiv

Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing ben...

cs-CV

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations (arxiv.org)

2026-01-13|paper|arXiv

Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversaria...

cs-CR cs-CV

Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video (arxiv.org)

2026-01-11|paper|arXiv

We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented...

cs-CV

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes (arxiv.org)

2026-01-11|paper|arXiv

Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statisti...

cs-CV

QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer (arxiv.org)

2026-01-11|paper|arXiv

Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen...

cs-CV

Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping [TOP LAB](arxiv.org)

2026-01-06|paper|arXiv

Geo-Foundation Models (GFMs), have proven effective in diverse downstream applications, including semantic segmentation, classification, and regression tasks. However, in case of flood mapping using S...

cs-CV

BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models [TOP LAB](arxiv.org)

2026-01-06|paper|arXiv

Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing appro...

cs-CV cs-AI cs-LG

SketchRodGS: Sketch-based Extraction of Slender Geometries for Animating Gaussian Splatting Scenes [TOP LAB](arxiv.org)

2026-01-06|paper|arXiv

Physics simulation of slender elastic objects often requires discretization as a polyline. However, constructing a polyline from Gaussian splatting is challenging as Gaussian splatting lacks connectiv...

cs-GR cs-CV

AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction (arxiv.org)

2026-01-05|paper|arXiv

Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian prim...

cs-CV

FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing [TOP LAB](arxiv.org)

2026-01-05|paper|arXiv

Federated data sharing promises utility without centralizing raw data, yet existing embedding-level generators struggle under non-IID client heterogeneity and provide limited formal protection against...

cs-LG cs-AI cs-CV

A Comprehensive Dataset for Human vs. AI Generated Image Detection [TOP LAB](arxiv.org)

2026-01-05|paper|arXiv

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of m...

cs-CV cs-AI

Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI (arxiv.org)

2026-01-05|paper|arXiv

Left ventricle (LV) segmentation is critical for clinical quantification and diagnosis of cardiac images. In this work, we propose two novel deep learning architectures called LNU-Net and IBU-Net for ...

cs-CV cs-LG

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time (arxiv.org)

2026-01-04|paper|arXiv

We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera vi...

cs-CV cs-AI cs-RO

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control [TOP LAB](arxiv.org)

2026-01-04|paper|arXiv

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control modu...

cs-CV cs-AI

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction (arxiv.org)

2026-01-04|paper|arXiv

Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, inclu...

cs-CV

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time (arxiv.org)

2026-01-03|paper|arXiv

cs-CV cs-AI cs-RO

← Prev2 / 6Next →