LLMs maintain high speeds without efficiency tax of poor guesses by timing/ using downtime to increase training speed
https://techxplore.com/news/2026-02-drafter-downtime-llm.html "standard speculative decoding optimizes speed using small drafter model predicting several tokens larger target model then verifies in parallel optimization, but periods where small model’s overhead slows down struggling to predict hard/ highly creative tokens... overcome using confidence gate monitoring real-time drafter’s accuracy drop below certain threshold, triggering small model to sleep, large model taking over until text becomes more predictable... 25-30% increased throughput... edge computing, mobile AI"