Qwen3.5 introduces Gated Delta Network architectures replacing standard attention and advanced decoding strategies across an open-source model family for improved efficiency and performance.
OpenAI releases two open-weight reasoning-capable models at 120B and 20B parameter scales, making competitive reasoning model weights publicly accessible.
OpenAI releases frontier-grade coding and agentic models in two tiers—GPT-5.3-Codex and GPT-5.1-Codex-Max—optimized for software generation and autonomous task execution.
Anthropic upgrades the Claude Opus line to version 4.8, advancing frontier-level capability, likely in reasoning, instruction following, or safety alignment over prior Opus releases.
OpenAI targets enterprise deployment with GPT-5.2, a frontier model series tuned for reliability, compliance, and performance in business-critical applications.
OpenAI rolls out GPT-5.5 as a frontier model with a cyber-focused initial deployment, targeting cybersecurity-related tasks or threat analysis use cases.
PRISM evaluates LLMs acting as academic peer reviewers across multiple quality dimensions, measuring review accuracy, consistency, and alignment with human expert judgments.
Reformulating uniform diffusion models with a leave-one-out denoiser and absorbing-state perspective yields a theoretically cleaner, more stable training and inference framework for discrete generative models.
EarlyTom compresses video token representations at early transformer layers, dramatically reducing computation while preserving sufficient temporal information for accurate video understanding.
CoHyDE jointly and iteratively trains an LLM query rewriter alongside a dense encoder so both components mutually improve retrieval of the correct external tools for a given query.
Xetrieval applies mechanistic interpretability methods to dense retrieval models, identifying which internal circuits and representations drive document-query similarity scoring.
Introduces checkpoint-based repair mechanisms that recover failed Program-of-Thought reasoning chains mid-execution rather than restarting from scratch.
Builds consistent 3D Gaussian head avatars from single-view inputs by enforcing multi-view consistency without requiring multi-view generative models.
Applies consistency training to penalize contradictory political outputs, reducing a model's susceptibility to manipulation through framing or loaded prompts.
Deploys a compact vision-language model for time-series anomaly detection, achieving trustworthy reasoning under tight computational constraints.
Fuses visual, language, and dynamics modalities with tri-modal guidance to learn robot perception representations robust to dynamic scene changes.
Formulates accent-robust language identification as a convex optimization problem to improve speech recognition accuracy under low-resource accent conditions.
Distills agent skills online into a multimodal AI agent framework, reducing computation while preserving task performance across diverse modalities.
Evicts KV cache entries based on attention confidence scores and stores retained entries at mixed precision, reducing memory use in long-context LLM inference.
Probes vision-language models to diagnose why distant objects are systematically represented as higher in spatial encoding, revealing a geometric bias.
Uses language model function-calling as a reflective feedback loop to iteratively refine soft prompt tuning based on self-generated evaluative signals.
Provides a beginner-oriented tutorial on using torch.profiler to identify performance bottlenecks in PyTorch training and inference workflows.
Proposes standardized guidelines for conducting trustworthy third-party AI evaluations, covering methodology, transparency, and conflict-of-interest management.
Rosalind Biodefense applies AI to biological threat detection and response, strengthening public health infrastructure against pandemic and bioterrorism risks.
Braintrust uses Codex to automatically translate natural-language customer feature requests directly into executable code, accelerating software delivery.
Boston Children's Hospital deploys AI to identify previously missed or difficult-to-reach diagnoses in pediatric patients from clinical data.
Anthropic released Claude Opus 4.8, a new iteration of their flagship model in the Opus line with updated capabilities.
Anthropic secured $65 billion in Series H funding, pushing its post-money valuation to $965 billion.
A 60-second interactive game simulates the repetitive approval prompts AI agents generate, highlighting user fatigue from constant permission requests.
Catalogs recurring anti-patterns and problematic behaviors observed across large language models, analogous to code smells in software engineering.