SePO is an agent that automatically iterates and refines system prompts for LLMs by optimizing them through self-generated feedback without human intervention.
A method scales reinforcement learning from verifiable rewards for code by decomposing complex programming problems into atomic sub-tasks and recombining them to synthesize new training examples.
A policy optimization method equips LLM agents with a meta-cognitive memory mechanism that tracks and leverages past reasoning experiences to improve performance on long-horizon tasks.
A benchmark evaluates image editing models across multiple dimensions specifically targeting whether edits are both visually correct and consistent with chain-of-thought reasoning requirements.
EvoDS is an autonomous data science agent that improves over time by accumulating reusable skills and managing context efficiently to handle complex analytical workflows.
Reinforcement learning training on multilingual tasks causes LLMs to generalize translation ability to previously unseen language pairs by learning to exploit contextual in-context cues.
ArcANE investigates when and how well role-playing LLM agents maintain their assigned character identity versus appropriately breaking character in contextually sensitive situations.
A semi-supervised segmentation framework uses predicted quality scores to weight unlabeled medical images, giving higher training influence to more reliably pseudo-labeled samples.
A benchmark evaluates household robot decision-making in scenarios where competing human values create ethical conflicts with no single correct action.
A diffusion-based generation method splits the denoising process into balanced complexity segments to improve computational efficiency and output quality.
A code-switching ASR approach generalizes to language pairs unseen during training by learning language-agnostic switching representations transferable across multilingual combinations.
AdaPlanBench measures LLM agents' ability to revise plans dynamically when world-state changes or user constraints shift mid-task.
TIDE discovers multiple latent problems in text or systems by iteratively applying reusable problem templates to surface issues beyond those initially targeted.
AdaCodec generates adaptive visual token codes that predict future-frame relevance, enabling video multimodal LLMs to allocate representation capacity more efficiently.
A music recommendation system fuses audio features, lyrics, and user context through LLMs to produce semantically grounded, multimodal song suggestions.
A quote or statement from Emanuel Maiberg of 404 Media is being surfaced, likely offering journalism-grounded commentary on an AI-related topic.
An opinion piece contrasts AI enthusiasts racing to deploy capabilities before safety catches up with skeptics watching the whole effort degrade under its own contradictions.
Hugging Face redesigned its CLI so that autonomous agents can programmatically discover, upload, and manage Hub resources through structured, predictable commands.
EVA-Bench Data 2.0 expands an evaluation suite to 3 domains, 121 tools, and 213 scenarios for testing AI agents on diverse real-world tasks.
Nemotron 3.5 Content Safety delivers a customizable multimodal safety system designed for enterprise AI deployments across diverse global regulatory and cultural contexts.
Applies AI-driven intelligence analysis techniques to strengthen biological threat detection, attribution, and defensive response capabilities at national security scale.
ChatGPT gains a persistent memory architecture that retains user context across conversations, making responses more personalized and contextually relevant over time.
Endava restructures its software development lifecycle by delegating discrete engineering tasks to autonomous AI agents, reducing human bottlenecks in delivery pipelines.
CERN's Castor system provides a large-scale hierarchical storage management solution for archiving and retrieving the massive data volumes produced by particle physics experiments.
The Pentagon operates AI-generated influence operations producing Spanish-language propaganda targeting Latin American populations to shape geopolitical narratives.
A post-training method internalizes multi-agent debate into a single model's latent space, enabling self-refinement without requiring multiple separate model instances at inference.
Describes and analyzes specific AI experiments conducted in the game of Go, clarifying the design choices and outcomes behind notable research milestones.
Magenta RealTime 2 releases open, locally runnable generative music models capable of producing and responding to live musical input in real time.
Fine-tunes a large language model on vintage technical writing corpora to reproduce the terse, structured documentation style characteristic of 1990s software manuals.
Gemma 4 12B processes both text and images within a single encoder-free architecture, unifying multimodal understanding without separate vision encoders.