The battle for deep learning "engines": GPU acceleration or dedicated neural network chips?

Latest update time：2021-09-01 03:35

Reads：

Deep learning has become popular around the world in the past two years. Big data and high-performance computing platforms have played an indispensable role in promoting it. They can be said to be the "fuel" and "engine" of deep learning, and GPU is the engine of the engine. Almost all deep learning computing platforms use GPU acceleration. At the same time, deep learning has become a new strategic direction for GPU provider NVIDIA, and the absolute protagonist of GTC 2015 in March.

So, what are the latest developments in GPU for deep learning? What impact do these developments have on deep learning frameworks? How should deep learning developers unleash the potential of GPUs? What are the prospects for the combination of GPUs and deep learning and future technology trends? At the recent NVIDIA Deep Learning China Strategy Conference, Ashok Pandey, NVIDIA's global vice president, PSG and general manager of cloud computing business in China, led his deep learning management team to accept an interview with reporters and gave a detailed interpretation of NVIDIA's deep learning strategy, technology, ecology, and market-related issues.

NVIDIA believes that data, models, and GPUs are currently driving the booming development of deep learning. Deep learning users can choose different computing platforms, but developers need an easy-to-deploy platform and a good ecological environment, including some open source tools based on hardware optimization. Building a good deep learning computing ecosystem is both an existing advantage of GPUs and NVIDIA's consistent purpose.

Ashok Pandey, NVIDIA Global Vice President, PSG and General Manager of Cloud Computing Business in China

Why GPUs and Deep Learning Are a Good Match

With the increase in data volume and computing power, the large-scale neural networks that Hinton and LeCun have worked on for many years have finally come in handy. The performance and learning accuracy of deep learning have been greatly improved and have been widely used in text processing, speech and image recognition. It has not only been adopted by giants such as Google, Facebook, Baidu, and Microsoft, but has also become the core competitiveness of start-ups such as YuantiKu and Megvii Technology.

So why GPU? The most important thing is that the excellent floating-point computing performance of GPU especially improves the performance of two key activities of deep learning: classification and convolution, while achieving the required accuracy. NVIDIA said that deep learning requires high intrinsic parallelism, a large amount of floating-point computing power and matrix budget, and GPU can provide these capabilities, and at the same accuracy, compared with traditional CPU methods, it has faster processing speed, less server investment and lower power consumption.

Performance comparison of CNN training using GPU acceleration and CPU only

Taking the ImageNet competition as an example, based on GPU-accelerated deep learning algorithms, the computer vision systems of Baidu, Microsoft and Google achieved error rates of 5.98% (January 2015 data), 4.94% (February 2015 data) and 4.8% (February 2015 data) in ImageNet image classification and recognition tests, respectively, approaching or exceeding the human recognition level. Although the running score competition is suspected of specific optimization for known data sets, the optimization results are still of reference value for industrial practices.

"Artificial intelligence has evolved from the model-based approach in the past to the data-based and statistics-based approach today, mainly due to the highly parallel structure of GPUs and their efficient and fast connection capabilities. Facts have proved that GPUs are very suitable for deep learning," said Qian Depei, professor at the Beihang University and leader of the overall group of the major project "High-performance Computers and Application Service Environment of the 12th Five-Year Plan 863 Plan".

4 new solutions

NVIDIA reviewed four new products and initiatives introduced at GTC that are helping advance deep learning:

1. GeForce GTX TITAN X, a GPU developed for training deep neural networks.

TITAN X uses the NVIDIA Maxwell GPU architecture, combined with 3,072 processing cores, a single-precision peak performance of 7 teraflops, plus 12GB of onboard video memory and 336.5GB/s of bandwidth, which can process millions of data points for training deep neural networks.

NVIDIA introduced that it took less than three days for TITAN X to train the industry standard model AlexNet using the 1.2 million ImageNet image dataset, while it would take more than 40 days using a 16-core CPU.

2. DIGITS DevBox, a deskside deep learning tool for researchers.

DIGITS DevBox uses four TITAN X GPUs, and every component from memory to I/O has been optimized and debugged. It is pre-installed with various software needed to develop deep neural networks, including: DIGITS software package, three popular deep learning architectures Caffe, Theano and Torch, and NVIDIA's complete GPU accelerated deep learning library cuDNN 2.0. Like other giants, NVIDIA also spares no effort in supporting open source.

NVIDIA says that in key deep learning tests, the DIGITS DevBox can deliver four times the performance of a single TITAN X. Training AlexNet with the DIGITS DevBox takes just 13 hours, while using the best single-GPU PC takes two days and using a CPU-only system takes more than a month.

3. The next-generation GPU architecture Pascal will speed up computing in deep learning applications ten times compared to Maxwell.

Pascal introduces three features that significantly speed up training, including: 32GB of video memory (2.7 times that of GeForce GTX TITAN X), which can perform mixed-precision computing tasks and can calculate at twice the rate of 32-bit floating-point precision at 16-bit floating-point precision; 3D stacked video memory, which allows developers to build larger neural networks and improve the speed performance of deep learning applications by up to 5 times; and NVIDIA's high-speed interconnect technology NVLink to connect more than two GPUs, which can increase the speed of deep learning by up to ten times.

NVIDIA said that single precision is generally used in the field of deep learning. The future trend may be that some people will use half precision or even 1/4 precision, so NVIDIA needs to adjust the GPU architecture according to user needs. Pascal supports FP16 and FP32, which can improve the performance of machine learning.

4. DRIVE PX, a deep learning platform for autonomous vehicles.

Based on NVIDIA Tegra X1 and combined with the latest PX platform, the car can achieve a qualitative leap in instrument display and autonomous driving.

NVLink and DIGITS worth noting

When talking about the ten-fold performance of the next-generation Pascal architecture, we have to mention NVLink, which makes the data transmission speed between GPUs and between GPUs and CPUs 5 to 12 times faster than the existing PCI-Express standard, which is a great boon for deep learning applications that require higher inter-GPU transmission speeds. Developers should be happy that NVLink is based on point-to-point transmission and the programming model is the same as PCI-Express.

NVIDIA said that NVLink can double the number of GPUs in the system to be used together for deep learning computing tasks; it can also connect CPUs and GPUs in a new way, providing better flexibility and power saving performance in server design than PCI-E.

In fact, whether it is data parallelism or model parallelism, NVLink brings more imagination space to deep learning developers. iFlytek, the leader in domestic speech recognition, built a ring-shaped parallel learning architecture based on multiple GPGPUs and InfiniBand for DNN, RNN, CNN and other model training, with good results, but the use of InfiniBand also made other practitioners envious of its "tycoon" behavior. If NVLink is available, there will obviously be other good ways.

Of course, using NVLink also means new investment, and NVIDIA's existing product line also supports deep learning well, so users can choose as appropriate. For more knowledge on deep learning hardware selection, you can refer to the blog post written by Kaggle contestant Tim Dettmers: "Full Guide to Deep Learning Hardware".

The other is DIGITS, an all-in-one graphical system for designing, training, and verifying deep neural networks for image classification. DIGITS can provide guidance to users during the installation, configuration, and training of deep neural networks. It has a user interface and workflow management capabilities that facilitate loading training data sets from local and network sources, and provides real-time monitoring and visualization capabilities. It currently supports the GPU-accelerated version of Caffe. For details, see the Parallel Forall blog: DIGITs: Deep Learning Training System.

NVIDIA said that the reason why DIGITS chose to support Caffe first was because their customer survey results showed that this framework is currently the most popular (including domestic BAT and some foreign users). Similarly, the cuDNN computing library was also first integrated into the Caffe open source tool. NVIDIA promised that even if it cannot cover all tools, DIGITS will provide support for mainstream open source tools in the future, mainly the aforementioned Theano and Torch. NVIDIA has invested more than 30 people in the DIGITS and cuDNN teams worldwide to open source work, and these developers also maintain close communication with deep learning developers in the community.

China Ecology

In NVIDIA's view, the level of deep learning research in China is basically the same as that of foreign institutions. From the perspective of university research, the Chinese University of Hong Kong and the Institute of Automation of the Chinese Academy of Sciences have both achieved good rankings in ImageNet. From the perspective of the industrial sector, BAT, LeTV, iFlytek, etc. all have many young engineers and good research results in the field of deep learning. NVIDIA hopes to strengthen the construction of China's ecological environment and promote the application of deep learning. The main methods still include investment in the open source community, university research cooperation, cooperation with server manufacturers, and cooperation with corporate users.

In January 2015, NVIDIA and iQiyi signed a deep cooperation framework agreement. The two parties will work closely in the fields of deep video and media cloud computing, using the most advanced GPU and deep learning architecture to build iQiyi's video creation, sharing and service platform. NVIDIA said that it will continue to cooperate with key customers to establish joint laboratories in the future.

GPU or dedicated chip?

Although deep learning and artificial intelligence are very popular in publicity, the industrial application of deep learning is still in its infancy, whether from the perspective of bionics or statistics, and the theoretical basis of deep learning has not yet been established and improved. In the eyes of some practitioners, relying on the accumulation of computing power and data sets to obtain results seems too violent - in order for machines to better understand human intentions, more data and stronger computing platforms are needed, and often supervised learning - of course, at this stage we have no worries about insufficient data. In the future, after the theory is perfected, will it no longer rely on data, no longer rely on labeling data (unsupervised learning), and no longer need to demand performance and accuracy from computing power?

Taking a step back, even if computing power is still a necessary engine, does it have to be based on GPU? We know that CPU and FPGA have shown their capabilities in deep learning loads, and IBM's SyNAPSE giant neural network chip (human brain chip) provides 1 million "neuron" cores, 256 million "synaptic" cores and 4096 "neurosynaptic" cores at 70 milliwatts of power, and even allows neural network and machine learning loads to surpass the von Neumann architecture. The energy consumption and performance of both are enough to become potential challengers to GPU. For example, in order to build the "iFlytek Super Brain", in addition to GPU, iFlytek is also considering using deeply customized artificial neural network dedicated chips to build a larger-scale supercomputing platform cluster.

However, as neither of them has yet been commercialized, NVIDIA is not worried that GPU will fall out of favor in the field of deep learning. First, NVIDIA believes that GPU, as the underlying platform, plays an accelerating role, helping deep learning developers to train larger models faster, and will not be affected by the implementation method of deep learning models. Secondly, NVIDIA said that users can choose different platforms according to their needs, but deep learning developers need to keep improving in algorithms and statistics, and need the support of an ecological environment. GPU has built tools such as CUDA, cuDNN and DIGITS, supports various mainstream open source frameworks, provides a friendly interface and visualization methods, and has received support from partners. For example, Inspur has developed a Caffe that supports multiple GPUs, and Sugon has also developed a multi-GPU technology based on the PCI bus, which is more friendly to developers familiar with serial programming. In contrast, FPGA programmable chips or artificial neural network-specific chips have higher requirements for implantation servers, programming environments, and programming capabilities, and lack general potential, so they are not suitable for popularization.

Latest articlesabout

■ITC finally ruled against Innoscience; Biden is anxious about chip subsidies; ADI acquires eFPGA

■Selection of the latest chip product technology content

■Littelfuse's new products enable electronic products to be safe, reliable and efficient. 10+ challenges are waiting for you to explore!

■How did I end up on the path of no return to hardware design?

■Diode clamp circuit: principle and application detailed explanation

■2024 Renesas Electronics MCU/MPU Industrial Technology Seminar - Shenzhen and Shanghai, registration is now open, and there are great gifts for registration and attendance!

■You may not understand these performance parameters of LDO

■If Trump is elected as the US President, will he repeal the "CHIP Act"?

■Video tutorial | Application of audio and video entertainment system for commercial vehicles

■What exactly is the 48V automotive system that we talk about every day?

最新精华更多

The battle for deep learning "engines": GPU acceleration or dedicated neural network chips?

Latest articlesabout