Recurrent Drafter for Fast Speculative Decoding in Large Language Models Apple Machine Learning Research
We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for large language models (LLMs) inference. The performance gains are driven by three key aspects: (1) leveraging a recurrent neural network (RNN) as the draft model conditioning on LLM’s hidden states,… Read More »Recurrent Drafter for Fast Speculative Decoding in Large Language Models Apple Machine Learning Research