Although large language models (LLMs) such as GPT-4 and LLaMA are rapidly reimagining modern-day applications, their inference is slow and difficult to optimize because it is based on autoregressive decoding. The delay of an LLM request mostly depends on the answer length of the request or, equivalently, the number of decoding steps because each autoregressive decoding step yields only one token at a time. Unfortunately, current GPUs’ parallel processing capacity is generally underutilized because each decoding step does not take advantage of it. This presents a problem for many practical LLM applications like chatbots and personal assistants, which rely on instantaneous responses and so regularly produce large sequences with low latency.
Auto-regressive decoding can be sped up with the use of speculative decoding methods like Medusa and OSD, which use a “guess-and-verify” strategy in which a preliminary model makes predictions about several possible tokens in the future, and the original LLM checks these predictions in parallel. These methods can reduce latency by taking advantage of situations where fewer decoding steps are required. They do, however, have some restrictions. To begin, the token acceptance rate, or, equivalently, how correctly the draft model can anticipate the outputs of the main model, is the upper bound on the maximum speedup that speculative decoding-based approaches may achieve. Second, developing a reliable preliminary model is not easy; it typically necessitates more training and careful adjustment to account for variations in traffic over time.
A new study by LMSYS ORG presents lookahead decoding, a novel accurate decoding technique developed to address these difficulties. Although it is computationally prohibitive to decode many subsequent tokens in a single step, it has been observed that an LLM can produce numerous orthogonal n-grams simultaneously. These n-grams could potentially fit into future parts of the created sequence. The traditional Jacobi iteration method is adapted for parallel decoding, which allows autoregressive decoding to be seen as the solution of nonlinear equations. The n-grams that are produced are recorded, checked, and then, if appropriate, incorporated into the sequence. Lookahead decoding is particularly notable since it:
It uses no preliminary model, which speeds up the rollout.
Reduces the total number of decoding steps by a factor of log(FLOPs) for each stage.
The researchers demonstrate that lookahead decoding significantly decreases latency by 1.5x-2.3x with almost no increase in computational burden. Perhaps most significantly, it enables the tradeoff of processing for reduced latency, albeit with diminishing benefits.
They have created their implementation to make lookahead decoding work with huggingface/transformers. HuggingFace provides a native-generated function, but users can significantly boost its efficiency with a few lines of code.
Jacobi iteration is a time-tested technique for resolving nonlinear systems. LLM inference can also be used for token creation in parallel without needing a pre-trained model. Since each step of Jacobi decoding involves LLM forward computation on >1 token, it is significantly more expensive in terms of FLOPs required than each step of autoregressive decoding. The researchers have observed several difficulties that can be encountered when attempting to significantly improve the wallclock performance of Jacobi decoding in real-world applications. While it can decode many tokens in a series of steps, it often gets their order wrong. Even when properly anticipated, tokens are often replaced in the following cycles. As a result, few iterations successfully decode and correctly place numerous tokens simultaneously. Because of this, the entire point of using parallel decoding is nullified. Generally, it does not result in performance drops because of the parallel processing capabilities of graphics processing units.
Lookahead decoding can circumvent its shortcomings by capitalizing on Jacobi Decoding’s capacity to generate parallel n-grams. Each new token at a point is decoded using the values at that position in previous iterations, as seen in Jacobi decoding. Many n-grams are formed due to this process, which builds a timeline of historical tokens at each token position. To use this, lookahead decoding will gather and cache these n-grams based on their trajectories. Lookahead decoding simultaneously checks promising n-grams from the cache while performing parallel decoding using Jacobi iterations for future tokens.
Each lookahead decoding phase is split into two parallel branches—the lookahead branch and the verification branch—to improve efficiency. To produce n-grams from the Jacobi iteration trajectory, the lookahead branch keeps a constant-sized, two-dimensional window. At the same time, candidates for n-grams that show promise are chosen and checked by the verification branch.
Since memory bandwidth is the primary bottleneck in LLM decoding, the researchers combine the lookahead and verification branches into a single pass, taking advantage of the GPU’s parallel processing capacity while concealing any associated overheads.
The team tested different sizes of LLaMA-2-Chat and CodeLLaMA on MT-bench, HumanEval, and GSM8K to see how effective their look-ahead decoding is. The lookahead decoding technique delivers speedup without the need for fine-tuning or preliminary models. Under fp16 precision, they assess the 7B, 13B, and 33B models on a single A100 GPU and the 70B model on two A100 GPUs with pipeline parallelism.
MT-Bench LLaMA Discussion: In many model configurations, the speedup achieved by lookahead decoding is around 1.5x.
HumanEval’s CodeLLaMA: CodeLLaMA’s latency is reduced by more than two times when using lookahead decoding on HumanEval. This is because there are numerous easily guessable N-grams included in the code.
Instructional CodeLLaMA for GSM8K: Lookahead decoding reduces latency by 1.8 thanks to CodeLLama-Instructor’s application to GSM8K’s mathematical challenges.
The post ‘Lookahead Decoding’: A Parallel Decoding Algorithm to Accelerate LLM Inference appeared first on MarkTechPost.
Although large language models (LLMs) such as GPT-4 and LLaMA are rapidly reimagining modern-day applications, their inference is slow and difficult to optimize because it is based on autoregressive decoding. The delay of an LLM request mostly depends on the answer length of the request or, equivalently, the number of decoding steps because each autoregressive
The post ‘Lookahead Decoding’: A Parallel Decoding Algorithm to Accelerate LLM Inference appeared first on MarkTechPost. Read More AI Shorts, Applications, Artificial Intelligence, Editors Pick, Language Model, Large Language Model, Machine Learning, Staff, Tech News, Technology, Uncategorized