Faster LLMs with speculative decoding and AWS Inferentia2
In recent years, we have seen a big increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. Larger models with more parameters, which are in the order of hundreds of billions at the time of writing, tend to produce better …
Faster LLMs with speculative decoding and AWS Inferentia2 Read More »