Last week I was excited to hear about a new LLM paradigm: 1.58 bit large language models (LLMs). In other words, an LLM whose parameters are stored as ternary {-1, 0, 1}, rather than as a continuous 16-bit floating point, for example. This clearly has advantages in terms of latency, memory, and energy consumption, but, surprisingly (to me at least), according to the paper linked below, also comparable performance.
The authors have demonstrated that across 1.3B to 70B parameter variants of a LLaMA model, the new 1.58 bit architecture performed comparably in terms of accuracy (or surprisingly better in some cases) to a standard 16-bit architecture, whilst having up to a 7.16x smaller memory foot print and 4.1x lower latency.
Perhaps we’ll see LLMs running on a raspberry pi in the near future.