The AI Wire

7 articles tagged "multimodal" — page 1 of 1

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding (huggingface.co)

2026-06-06|model|huggingface

BRepCLIP applies contrastive multimodal pretraining to CAD boundary representation primitives for geometric understanding.

multimodal contrastive-learning cad geometry

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset (huggingface.co)

2026-06-06|model|huggingface

KITScenes multimodal dataset release targeting autonomous driving research.

autonomous-driving dataset multimodal

Gemini 3 Flash (Google DeepMind)(blog.google)

2026-02-06|model|Google DeepMind

Achieves Gemini 3 Pro-class reasoning at Flash-tier latency and cost. Outperforms 2.5 Pro while being 3x faster at less than 1/4 the cost of 3 Pro. 1M token context, 65K output tokens.

llm frontier multimodal fast

Kimi K2.5 (Moonshot AI)(kimi.com)

2026-02-06|model|HuggingFace / Moonshot AI

Native multimodal model trained on 15T tokens mixing visual and textual data from the start. Agent Swarm technology coordinates up to 100 specialized agents simultaneously, reducing execution time by 4.5x for complex workflows.

llm open-source multimodal moe

SkyReels V3 (Skywork AI)(github.com)

2026-02-06|model|GitHub / HuggingFace

First open-source model supporting three video generation modes in one architecture: multi-subject reference image-to-video, audio-driven avatar generation, and video-to-video editing. Intelligent shot-switching for minute-level durations.

open-source video-gen multimodal audio-driven

Sarvam Vision (Sarvam AI)(sarvam.ai)

2026-02-06|model|Sarvam AI

Multilingual document intelligence model supporting all 22 official Indian languages with OCR, visual language understanding, and semantic document parsing. Uses state-space architecture rather than transformer.

multimodal ocr multilingual document-intelligence

Qwen2-VL-72B (huggingface.co)

2025-11-06|model|HuggingFace

Strong vision-language understanding with competitive performance on benchmarks. Supports multiple image inputs and high-resolution processing.

multimodal vision-language transformer qwen