The model is compressed by half, and the accuracy is almost lossless. TensorFlow launches a half-precision floating point quantization toolkit, and there is an online demo

Latest update time：2019-08-06

Reads：

Yuyang from Aofei Temple
Quantum Bit Report | Public Account QbitAI

Recently, the TensorFlow model optimization toolkit has added a new member, the post-training half-precision floating point quantization (float16 quantization) tool.

With it, the model can be compressed to half its size with almost no loss of model accuracy, and it can also improve CPU and hardware accelerator latency.

This suite of tools includes hybrid quantization, full integer quantization, and pruning.

You can choose any quantification model you want.

Compress size without losing precision

Double precision is 64 bits, single precision is 32 bits, and the so-called half-precision floating point number is stored using 2 bytes (16 bits).

Compared to 8-bit or 16-bit integers, half-precision floating-point numbers have the advantage of a higher dynamic range; compared to single-precision floating-point numbers, they can save half the storage space and bandwidth.

Compared with double-precision and single-precision floating point numbers, half-precision floating point numbers are obviously not as suitable for calculations. So the question is, why do we actively reduce the precision?

In fact, many application scenarios do not require such high precision. In distributed deep learning, the model may have thousands of parameters, and the size of each parameter is larger than the other. If all constant values can be stored with 16-bit floating point numbers instead of 32-bit floating point numbers, the model size can be compressed to half, which is quite impressive.

If the volume is compressed, won’t the accuracy be lost?

Reducing the precision of floating-point numbers will of course result in a loss of accuracy, but don't worry, the loss is small enough to be negligible.

Testing the standard MobileNet float32 model and float16 model variants on the ILSVRC 2012 image classification task, we can see that the accuracy loss of the fp16 model is less than 0.03% for both MobileNet v1 and MobileNet v2, top1 and top5 .

Trying the object detection task again, the fp16 variant has almost no accuracy loss compared to the standard model.

For both MobileNet v1 and MobileNet SSD, the size of the fp16 variant is about half that of the standard model.

Small size, high precision, why not try the half-precision floating point quantization tool?

Easy to use

Converting your trained 32-bit model to 16-bit is not complicated. You only need to set two key lines of code.

On the TensorFlow Lite converter, set the optimization setting for the 32-bit model to DEFAULT, and then set the target specification support type to FLOAT16:

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]
Tflite_quanit_model = converter.convert()

After the model is converted successfully, it can be run directly.

By default, the model runs on the CPU by "upsampling" the 16-bit parameters to 32 bits and performing operations in standard 32-bit floating point operations.

The reason for this is that currently many hardware do not support accelerated fp16 calculations. In the future, with more hardware support, these half-precision values will no longer need to be "upsampled" and can be calculated directly.

Running fp16 models on GPU is simpler.

TensorFlow Lite's GPU agent has been enhanced to directly access and operate with 16-bit precision parameters:

//Prepare GPU delegate.
const TfLiteGpuDelegateOptions options = {
  .metadata = NULL,
  .compile_options = {
    .precision_loss_allowed = 1,  // FP16
    .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,
    .dynamic_batch_enabled = 0,   // Not fully functional yet
  },
};

If you are interested, TensorFlow officially provides a tutorial demo. Open the Colab link at the end of the article and you can train a 16-bit MNIST model online.

Portal

Official guide:
https://www.tensorflow.org/lite/performance/post_training_quantization

Colab link:
https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/performance/post_training_float16_quant.ipynb

-over-

Join the community | Communicate with outstanding people

Mini Program | All categories of AI learning tutorials