Proposes a method that jointly trains agent memory and exploration behavior using novelty-based signals to improve navigation and discovery in unknown environments.
Uses unmodified LLMs to score intermediate reasoning steps in math problems at inference time, replacing trained process reward models without any additional training.
Releases an open framework for training visual web agents with online multi-turn RL, clarifying implementation details that enable agents to learn from live browser interactions.
Benchmarks LLM agents on personal productivity tasks by simulating realistic personal data environments, testing performance on real-world applications like calendars and email.
Reports that attackers used social engineering prompts to manipulate Meta AI into granting unauthorized access to high-profile Instagram accounts.
Describes a tool or feature enabling users to directly edit files that have been pasted into an interface, streamlining in-context file modification.
Argues that enterprise AI scaling bottlenecks stem from agent orchestration logic rather than LLM capability, advocating for purpose-built agent architectures over raw model scaling.
JetBrains releases Mellum2, a 12-billion-parameter mixture-of-experts language model, likely targeting developer-focused coding and IDE assistance tasks.
Announces infrastructure investment in Michigan to build data centers or computing facilities supporting AI workloads as part of a broader national AI build-out.
Articulates an organization's official positions on AI governance policy and the boundaries of appropriate political engagement or lobbying activity.
Anthropic has filed a confidential draft S-1 registration statement with the SEC, initiating the regulatory process toward a potential public offering.
xAI has released Composer 2.5, a code/content composition tool, now integrated into the Grok Build development environment.
A survey or advocacy piece covers the resurgence of terminal user interface tools, highlighting strace-ui and Bonsai_term as examples of the TUI revival.
A system simulates agent behavior or cognition using the Free Energy Principle as the computational and theoretical foundation.
A method categorizes individual sentences in clinical notes by their source discipline, enabling fine-grained provenance tracking for multidisciplinary hospital-stay summaries.
RASER routes multi-hop questions selectively to more powerful models only when earlier reasoning steps are detectably unrecoverable, reducing unnecessary escalation cost.
Theoretical analysis characterizes which functions deep neural networks built on congruence-based operations can express when inputs are symmetric positive-definite matrices.
An audit framework analyzes how LLMs frame responses communicatively—hedging, assertiveness, stance—independent of factual content, revealing systematic stylistic biases.
A monitoring framework detects and flags unsafe or erroneous behaviors in agentic AI systems during deployment before those systems have achieved reliable performance.
LLM agents are used to handle the final refinement stage of time series forecasting where standard models underperform due to domain-specific or contextual gaps.
CRAM uses centroid-based token routing and an adaptive mixture-of-experts architecture to enable multimodal models to continually learn new instruction-following tasks without catastrophic forgetting.
A review synthesizes how generative models, multimodal learning, and closed-loop experimental workflows are combined to autonomously discover and design new materials with target properties.
Uses LLMs to extract ADHD-related behavioral signals from free-text teacher narratives in Turkish, going beyond structured rating scale limitations.
Reformulates optimal transport between Gaussian mixture models as a biconvex optimization problem guaranteed to have a unique, stable solution.
Introduces a preference optimization method that progressively shifts the reference distribution during training to align one-step generative models with human preferences.
Constructs a benchmark targeting short, precise temporal moments in video to diagnose whether multimodal LLMs correctly localize and interpret brief visual events.
Provides a labeled dataset pairing fine-grained suicide risk severity annotations with figurative language categories extracted from suicide-related memes.
Proposes a monotonic adaptive norm-rescaling optimizer that reduces sensitivity to hyperparameter choices when training on long-tailed class distributions.
Audits financial LLMs for asset-specific biases toward Bitcoin by analyzing their internal representations and the resulting portfolio allocation recommendations.
Distills safety-aligned behavior into a smaller model by localizing safety-critical layers and applying on-policy distillation only to those components.