Ψ-Bench evaluates how well conversational AI systems tailor persuasive dialogue strategies to individual user personas and psychological profiles.
A KV cache eviction policy for reasoning models selectively discards cache entries based on their estimated contribution to output value, reducing memory without degrading reasoning quality.
MERIT learns disentangled latent representations of music that separate independent attributes such as melody, rhythm, and timbre to improve audio similarity retrieval.
Linear probes trained to detect deceptive internal states in LLMs are stress-tested for robustness under adversarial pressure, with analysis of how deception organizes geometrically in representation space.
Combines world models handling concrete environment dynamics with language models handling abstract reasoning, showing the two approaches are complementary rather than competing.
A local perturbation theory formalizes how policy updates in one domain cause interference in others during multi-domain RL and derives recovery conditions to restore cross-domain performance.
Analyzes long chain-of-thought training traces where the final answer is correct but intermediate reasoning steps contain harmful continuations, diagnosing how such traces arise and their training risks.
ClawHub analyzes malware signals by reconciling disagreements between VirusTotal verdicts, static analysis findings, and SkillSpector detections to improve security assessments.
AutoMedBench evaluates agentic AI systems on automated medical research tasks, benchmarking their ability to autonomously conduct and validate biomedical investigations.
TRON provides rule-verifiable online environments specifically designed for training visual reasoning agents via reinforcement learning with objectively checkable rewards.
A small RL controller guides token sampling decisions of a large language model at test time, improving output quality without retraining the LLM.
Decoupled residual denoising separates content and style pathways in a diffusion model to enable unified image-to-image translation across multiple tasks with fewer training examples.
PaddleOCR-VL-1.6 improves document parsing by targeting previously under-optimized layout regions and applying a progressive post-training strategy to boost recognition accuracy.
micropython-wasm 0.1a0 is an initial alpha release enabling MicroPython to run as a WebAssembly module.
A news item covering the California Brown Pelican, likely reporting on its conservation status, population trends, or ecological observations.
micropython-wasm 0.1a1 is a follow-up alpha release of the MicroPython WebAssembly package, delivering early fixes or improvements over 0.1a0.
datasette-agent-micropython 0.1a0 is an initial alpha plugin integrating MicroPython-based agentic capabilities into the Datasette data exploration tool.
Microsoft announced new MAI frontier models, signaling expanded investment in its own internally developed AI systems beyond existing partnerships.
Holo3.1 is a fast, locally running computer-use agent system that executes GUI tasks on-device without requiring cloud inference.
OpenAI positions Codex as a mainstream productivity tool extending beyond professional developers to broader everyday users.
OpenAI outlines global policy and partnership initiatives aimed at protecting youth safety and expanding educational or economic opportunities.
OpenAI expands Codex integration across diverse professional roles, development tools, and organizational workflows beyond software engineering.
Insurance company Travelers deploys an OpenAI-powered AI system to automate or assist insurance claims processing nationwide.
OpenAI announces an expansion of Project Glasswing, likely a safety or societal-impact initiative, as of June 2026.
Paseo is an open-source, visually refined user interface for interacting with coding agents, released as a Show HN project.
University of Toronto researchers built a proof-of-concept AI worm capable of propagating attacks across internet-connected devices regardless of platform.
Hedge-Bench introduces a benchmark of hard, realistic financial reasoning tasks designed to evaluate agent performance on complex economic decision-making.
A forecasting model predicts how the concept of quantum computing spreads and gets adopted across scientific literature over time.
A contrastive learning framework improves neural algorithmic reasoning for solving the graph coloring problem by contrasting valid and invalid colorings.
Methods are introduced to update or correct specific factual knowledge stored within masked diffusion language models without full retraining.