Speculative Streaming: Fast LLM Inference Without Auxiliary Models Apple Machine Learning Research
Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As… Read More »Speculative Streaming: Fast LLM Inference Without Auxiliary Models Apple Machine Learning Research