For comparison, on an Apple M1 Pro device, the f16 implementation of Llama2 7B models used in the WebLLM chat demo is significantly faster than the f32 implementation, with a 28% improvement in prefill speed and a 41% improvement in decoding speed as shown in the following screenshots.