Scale AI Research
Scale AI’s mission is to accelerate the development of AI applications. By advancing research, we aim to create AI systems capable of solving complex, human-level problems.


SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
September 19, 2025
Agents
Safety, Evaluation and Alignment
Read More

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
September 11, 2025
Safety, Evaluation and Alignment
Read More

Reliable Weak-to-Strong Monitoring of LLM Agents
August 26, 2025
Safety, Evaluation and Alignment
Oversight
Read More

Search-Time Data Contamination
August 13, 2025
Safety, Evaluation and Alignment
Oversight
Read More

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
July 23, 2025
Reasoning
Safety, Evaluation and Alignment
Read More

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
July 23, 2025
Science of Data
Post-Training
Read More

WebGuard: Building a Generalizable Guardrail for Web Agents
July 21, 2025
Agents
Safety, Evaluation and Alignment
Read More

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
July 15, 2025
Reasoning
Oversight
Safety, Evaluation and Alignment
Read More

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
June 28, 2025
Post-Training
Reasoning
Read More