The AI Wire

178 articles tagged "cs-CV" — page 3 of 6

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control [TOP LAB](arxiv.org)

2026-01-03|paper|arXiv

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control modu...

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction (arxiv.org)

2026-01-03|paper|arXiv

Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, inclu...

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time (arxiv.org)

2026-01-02|paper|arXiv

We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera vi...

cs-CV cs-AI cs-RO

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control [TOP LAB](arxiv.org)

2026-01-02|paper|arXiv

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control modu...

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction (arxiv.org)

2026-01-02|paper|arXiv

Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, inclu...

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time (arxiv.org)

2026-01-01|paper|arXiv

We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera vi...

cs-CV cs-AI cs-RO

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control [TOP LAB](arxiv.org)

2026-01-01|paper|arXiv

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control modu...

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction (arxiv.org)

2026-01-01|paper|arXiv

Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, inclu...

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion (arxiv.org)

2025-12-31|paper|arXiv

Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step d...

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation [TOP LAB](arxiv.org)

2025-12-31|paper|arXiv

Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attenti...

Stochastic Siamese MAE Pretraining for Longitudinal Medical Images [TOP LAB](arxiv.org)

2025-12-31|paper|arXiv

Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approache...

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion (arxiv.org)

2025-12-30|paper|arXiv

Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step d...

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation [TOP LAB](arxiv.org)

2025-12-30|paper|arXiv

Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attenti...

Stochastic Siamese MAE Pretraining for Longitudinal Medical Images [TOP LAB](arxiv.org)

2025-12-30|paper|arXiv

Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approache...

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming (arxiv.org)

2025-12-27|paper|arXiv

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To a...

TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning [TOP LAB](arxiv.org)

2025-12-27|paper|arXiv

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, c...

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation [TOP LAB](arxiv.org)

2025-12-27|paper|arXiv

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (...

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming (arxiv.org)

2025-12-26|paper|arXiv

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To a...

TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning [TOP LAB](arxiv.org)

2025-12-26|paper|arXiv

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, c...

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation [TOP LAB](arxiv.org)

2025-12-26|paper|arXiv

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (...

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming (arxiv.org)

2025-12-25|paper|arXiv

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To a...

TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning [TOP LAB](arxiv.org)

2025-12-25|paper|arXiv

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, c...

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation [TOP LAB](arxiv.org)

2025-12-25|paper|arXiv

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (...

SemanticGen: Video Generation in Semantic Space (arxiv.org)

2025-12-24|paper|arXiv

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality vi...

LongVideoAgent: Multi-Agent Reasoning with Long Videos (arxiv.org)

2025-12-24|paper|arXiv

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summa...

cs-AI cs-CV cs-LG

SpatialTree: How Spatial Abilities Branch Out in MLLMs (arxiv.org)

2025-12-24|paper|arXiv

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most s...

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning (arxiv.org)

2025-12-23|paper|arXiv

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributi...

cs-SD cs-CV cs-LG

Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis [TOP LAB](arxiv.org)

2025-12-23|paper|arXiv

Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-...

Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models (arxiv.org)

2025-12-23|paper|arXiv

Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to ...

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding (arxiv.org)

2025-12-23|paper|arXiv

Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our stud...

← Prev3 / 6Next →