E8 Lattice code book quantization of AI LLM large language models

Due to current high PC memory RAM prices being affected by AI companies buying up ram and wanting to run LLMs locally on more limited GPU memory devices such as gami ng GPU and workstation GPUs. I have with the help of AI code assistance created an open source library called glq which quantizes LLM model weights using E8 lattice so that the LLM model takes up less GPU VRAM. GLQ was inspired by ”Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks” Quip#.

Glq has an advantage compared to some other quantization methods at between 2- bits per word up to 4 bits per word.

For serving speed I have made it so that glq can run together with the open source LLM serving engine called vLLM.

I have quantized LLM models such as Gemma-4 and SmolLM from Huggingface.

Current lm-eval MMLU-Pro benchmark of Gemma 4 26B it although at sample small number of n=60 shows GLQ quanitzation at 93.3% vs bf16 91.7% within margin of error due to small sample size. But still an interesting result. In this benchmark at small sample rate glq is nearly lossless.

There is also E8 Key Value Cache quantization in glq which was inspired by NexusQuant.

Comparsion of small open source LLM models by Artificial analysis

https://artificialanalysis.ai/models/open-source/small

Sphere packing

https://en.wikipedia.org/wiki/Sphere_packing

E8 Lattice

https://en.wikipedia.org/wiki/E8_lattice

Opensource code E8 LLM compression library glq:

https://github.com/cnygaard/glq

Python pypi pip package called glq

https://pypi.org/project/glq/