BRepCLIP applies contrastive multimodal pretraining to CAD boundary representation primitives for geometric understanding.
KITScenes multimodal dataset release targeting autonomous driving research.
Achieves Gemini 3 Pro-class reasoning at Flash-tier latency and cost. Outperforms 2.5 Pro while being 3x faster at less than 1/4 the cost of 3 Pro. 1M token context, 65K output tokens.
Native multimodal model trained on 15T tokens mixing visual and textual data from the start. Agent Swarm technology coordinates up to 100 specialized agents simultaneously, reducing execution time by 4.5x for complex workflows.
First open-source model supporting three video generation modes in one architecture: multi-subject reference image-to-video, audio-driven avatar generation, and video-to-video editing. Intelligent shot-switching for minute-level durations.
Multilingual document intelligence model supporting all 22 official Indian languages with OCR, visual language understanding, and semantic document parsing. Uses state-space architecture rather than transformer.
Strong vision-language understanding with competitive performance on benchmarks. Supports multiple image inputs and high-resolution processing.