Results for ""
Meta has introduced its first lightweight quantized Llama models, which are small and performant enough to run on many popular mobile devices. These instruction-tuned models apply the same quality and safety requirements as the original 1B and 3B models while achieving a 2-4x speedup. The team also achieved an average reduction of 56% in model size and a 41% average reduction in memory usage compared to the original BF16 format.
These models offer a reduced memory footprint, faster on-device inference, accuracy, and portability while maintaining quality and safety for developers to deploy on resource-constrained devices. The models have an average reduction of 56% in model size compared to the original format, based on testing with Android OnePlus 12 models.
The team also reduced memory usage by an average of 41%. The community can now deploy the quantized models onto more mobile CPUs, allowing them to build unique experiences that are fast and provide more privacy since interactions stay entirely on the device.
The team used two techniques for quantizing Llama 3.2 1B and 3B models: Quantization-Aware Training with LoRA adaptors, which prioritize accuracy, and SpinQuant, a state-of-the-art post-training quantization method that prioritizes portability.
Inferences using both quantization techniques are supported in the Llama Stack reference implementation via PyTorch’s ExecuTorch framework. The team built these quantized models in close collaboration with our industry-leading partners and is making them available on Qualcomm and MediaTek SoCs with Arm CPUs.
Source: