SpecEdge reduces cost of running LLMs by offloading work from expensive data center GPUs to affordable, consumer-grade edge GPUs

https://www.eurekalert.org/news-releases/1111289

"speculative decoding splits workload: edge GPU predicts likely token (word) sequence, data center GPU batch verifies sequences... eliminates idle time/ speeds overall response... 67.6% reduced cost/ token, 1.91x more efficient, increased processing capacity, 2.22X increased server throughput... works over standard internet, servers handle verification requests from multiple edge devices simultaneously... smartphones, neural processing units"

Comments