Qwen3.5 introduces Gated Delta Network architectures replacing standard attention and advanced decoding strategies across an open-source model family for improved efficiency and performance.
PRISM evaluates LLMs acting as academic peer reviewers across multiple quality dimensions, measuring review accuracy, consistency, and alignment with human expert judgments.
Braintrust uses Codex to automatically translate natural-language customer feature requests directly into executable code, accelerating software delivery.
Boston Children's Hospital deploys AI to identify previously missed or difficult-to-reach diagnoses in pediatric patients from clinical data.
Proposes standardized guidelines for conducting trustworthy third-party AI evaluations, covering methodology, transparency, and conflict-of-interest management.
- vLLM release notes mention Qwen3.5 support as a major new architecture.[2] - The underlying Qwen3.5 models are typically published on Hugging Face under the Qwen org; vLLM’s notes are a reliable pointer to the family’s capabilities.[2]
OpenAI releases two open-weight reasoning-capable models at 120B and 20B parameter scales, making competitive reasoning model weights publicly accessible.
A community-curated repository for sharing, discovering, and collecting reusable prompt templates for ChatGPT and other LLM interfaces.
Rosalind Biodefense applies AI to biological threat detection and response, strengthening public health infrastructure against pandemic and bioterrorism risks.
OpenAI releases frontier-grade coding and agentic models in two tiers—GPT-5.3-Codex and GPT-5.1-Codex-Max—optimized for software generation and autonomous task execution.
Anthropic upgrades the Claude Opus line to version 4.8, advancing frontier-level capability, likely in reasoning, instruction following, or safety alignment over prior Opus releases.
OpenAI targets enterprise deployment with GPT-5.2, a frontier model series tuned for reliability, compliance, and performance in business-critical applications.
OpenAI rolls out GPT-5.5 as a frontier model with a cyber-focused initial deployment, targeting cybersecurity-related tasks or threat analysis use cases.
Anthropic secured $65 billion in Series H funding, pushing its post-money valuation to $965 billion.
Trains a model to verify its own outputs, then uses those verification signals to improve both fine-tuning and inference-time reasoning.
Improves LLM reasoning efficiency by identifying and sampling only at critical decision branch points rather than uniformly across generation steps.
Provides finite-sample theoretical analysis identifying when, why, and how diffusion-based posterior samplers break down, characterizing failure conditions precisely.
Expands LLMs' effective working memory by enabling latent-space reasoning steps that are maintained across context without being decoded into tokens.
Reduces memory cost of autoregressive video diffusion at minute-scale lengths by compressing key-value caches using low-rank latent factorization.
Hugging Face Transformers provides standardized model definitions, weights, and APIs for loading and running state-of-the-art pretrained language and vision models.
Larger models generalize better on rare tasks because greater capacity reduces inter-task interference and preserves low-frequency training signal that smaller models overwrite.
Extends verifiable reward signals for RLHF beyond math/code by using lightweight corpus-grounded process supervision to train models on factual question answering.
Adapts reward models at inference time using in-context examples of preferences, making preference modeling more robust to distribution shift without retraining.
Derives bounds showing multi-component LLM agent pipelines can produce locally coherent outputs that are globally compositionally inconsistent, quantifying this incoherence gap.
Reduces test-time LLM fine-tuning cost by reconstructing weight updates via convex combinations and caching gradients to avoid redundant recomputation.
Diagnoses and estimates the composition of training data mixtures in large language models by analyzing model weights or outputs without direct data access.
Anthropic released Claude Opus 4.8, a new iteration of their flagship model in the Opus line with updated capabilities.
Claude Code gained a research-preview feature called dynamic workflows, enabling adaptive, condition-driven multi-step agentic task execution.
Dynamic workflows in Claude Code allow the agent to adaptively plan and modify its execution steps at runtime based on intermediate results.
A causal framework exposes gaps between current video generation models and true world models by testing whether generated videos respect cause-and-effect dependencies.