A Perplexity Benchmark of llama.cpp

Without further ado, here are the results (explanations and discussions later):

Table 1: Perplexity on wikitext-2 test set.  
Model \ Quantization q4_0 q4_1 q5_0 q5_1 q8_0 fp16
llama-7b 6.157 6.0915 5.9846 5.948 5.9063 5.68
llama-13b 5.385 5.3608 5.285 5.2702 5.2547 5.09
llama-30b 4.2707 - - - - 4.1
alpaca-30b 4.4521 - - - - -
llama-2-7b 5.9675 6.0398 5.8328 5.8435 5.7897 -
llama-2-7b-chat 7.7641 7.7853 7.5055 7.5392 7.5014 -
llama-2-13b 5.2172 5.2115 5.1343 5.1289 5.1005 -
llama-2-13b-chat 6.6296 6.7059 6.5336 6.5771 6.5361 -

Other than the fp16 results, these are perplexity numbers obtained by running the perplexity program from llama.cpp on the test set of wikitext-2 dataset. qM_N refers to a quantization method of M bits, and N is a selector of the underlying quantization algorithm. Throughout the development of llama.cpp there has been attempts in improving the quantization methods even for the same quantization level, therefore there is N.

All of these results are obtained on an NVIDIA L4 instance from the Google Cloud Platform, thanks to the sponsorship from Philippe Beaudoin, Co-founder and CEO at Waverly. The NVIDIA L4 GPU has 24GB of VRAM, which is sufficient for running llama-30b quantized at q4_0, but not enough for anything beyond that.

Perplexity measures the likeliness of the next token conditioned on the previous tokens in a text sequence. Smaller perplexity means better sequence modeling for the given dataset. From the table, it becomes obvious that the quantization methods implemented in llama.cpp are quite good. Additionally, the determining factor for a large language model's performance is still the number of parameters, even when the level of quantization is high.

To demonstrate the last point better, here is a table detailing the VRAM requirements for parameters.

Table 2: VRAM requirement for model parameters in MB.
Model \ Quantization q4_0 q4_1 q5_0 q5_1 q8_0
llama-7b 4090 4484 4877 5271 7240
llama-13b 7656 8422 9188 9954 13784
llama-30b 18555 20481 22407 24333 33964
alpaca-30b 18555 20481 22407 24333 33964
llama-2-7b 4090 4484 4877 5271 7240
llama-2-7b-chat 4090 4484 4877 5271 7240
llama-2-13b 7656 8422 9188 9954 13784
llama-2-13b-chat 7656 8422 9188 9954 13784

Note that these are only the memory cost of the parameters. When running the model for real, more memory is needed for the internal representations and inputs / outputs, which is the reason why we were not able to run q4_1 for llama-30b and alpaca-30b, even though their parameters do fit on an L4 GPU.

The memory cost of llama-13b-q4_0 is very similar to that of llama-7b-q8_0. However, llama-13b-q4_0 significantly out-performs llama-7b-q8_0 in the perplexity table. Similar observations can be said for llama-2 models as well. This means that compared to quantization levels, number of parameters is a more determining factor for the performance of large language models, under the same memory constraint.

There are some other conclusions one can draw from these numbers, but I will leave those to the readers to interpret. I'm in the process of benchmarking the inference speed of these models, which will be published in a next blog post.

Comments

Popular posts from this blog

Serving Llama-2 7B using llama.cpp with NVIDIA CUDA on Ubuntu 22.04

A Peek into Generative Control