The UK's AI Security Institute has found that common AI benchmarks systematically undervalue AI agent performance by imposing strict compute budget limits, according to The Decoder. These constraints restrict the token budget, which directly impacts the evaluation of AI capabilities.

In tests across seven benchmarks, increasing the token budget tenfold led to a roughly 25 percent rise in success rates on software engineering tasks. This suggests that prior assessments underestimated AI progress, as actual advancements at the frontier are approximately 60 percent steeper than previously measured, depending on token budget allowances.

For Japanese markets, where AI-driven trading and automated systems are increasingly prevalent, these findings highlight the importance of revising evaluation standards to better capture AI potential and inform investment strategies.