Dear organizers of the Electronic World event, hello! Thank you for giving me this opportunity to read "Infrastructure in the Big Model Era: A Guide to Building a Big Model Computing Center" and share my thoughts here. As a data mining and AI technology enthusiast, I am honored to have the opportunity to maintain such a close connection with the forefront of the industry. 2023 is the first year of the outbreak of AI big model products, and major manufacturers are exploring their own path to the implementation of big model applications. And I have always been looking forward to the development prospects of big models. However, in actual work, I also deeply feel the importance of computing power and infrastructure.
With full expectations, I read the first four chapters of this book. These four chapters can be said to be the soul and cornerstone of the whole book, which introduces readers to a magical world of computing power in a step-by-step manner.
Chapter 1 gets straight to the point and hits the demanding computing power requirements of AI and large models. The essence of machine learning is to approximate things through mathematical models, and massive matrix operations are the key. The author mentions an important concept here: linear power function. The superposition of multiple linear power functions can be used to approximate complex models in the real world. The reason why hardware such as GPU and TPU can perform well in machine learning is precisely because they have unique advantages in matrix multiplication and addition operations. This insight made me suddenly enlightened. It turns out that the seemingly advanced machine learning is actually such a "simple" mathematical principle at the bottom.
Chapter 2 further explores the perfect integration of software and hardware. As industry leaders, CUDA and TensorFlow play a pivotal role in this. Through the author's vivid explanation, I deeply understood the working principle of the TensorFlow computational graph. It turns out that those abstract operators and tensors will eventually be converted into CUDA kernel functions and efficiently executed on the GPU. This seamless connection between software and hardware is the key to the efficient operation of deep learning. The data parallelism and model parallelism in distributed training put forward extremely high requirements on computing power. The author introduces an interesting metaphor here: the coordinated optimization of software and hardware is like the double helix structure of DNA, which complements each other and is indispensable. This metaphor made me suddenly realize that the combination of software and hardware is so mysterious and contains the wisdom of life.
Chapters 3 and 4 bring the perspective back to the bottom of hardware design. From the perspective of a hardware engineer, the author leads readers to explore the exquisite architecture of GPU chips layer by layer. From the most basic transistors and logic gates to parallel computing units such as stream processors and tensor cores, every link is related to performance improvement. After reading the design details of Nvidia H100, H800 and other chips, I really realized that the performance improvement of GPU is the result of the wisdom and unremitting efforts of countless engineers. The DGX series of servers is a culmination of all, with advanced technologies such as NVLink, NVSwitch, and RDMA, which maximizes the computing power of GPU. These technical terms used to seem so profound and difficult to understand to me, but with the author's eloquent explanation, I can actually know a little bit, which is really beneficial.
Looking back at Chapters 1-4, I can't help but sigh that the large model infrastructure is a complex system engineering, involving algorithms, chips, architecture, interconnection and many other fields. The author cleverly connects these seemingly independent fields with a unique perspective and systematic thinking, with both macro-architecture design and micro-technical analysis, which can be said to be a thousand-year-long thought and a thousand-mile vision. The rich and beautiful illustrations provided in this book are also very helpful for us to understand various technical details. Perhaps due to the length of the book, the explanation of some knowledge points is not very detailed, and I can't grasp and absorb them immediately for a while, but learning is endless. I will further expand and enrich my knowledge extension by combining knowledge outside of this book. As an AI beginner, I not only gained knowledge in the process of reading, but also gained a way of thinking about problems. That is to learn to start from the overall situation, take a comprehensive approach, and be good at grasping the key and getting to the essence. This way of thinking will benefit me a lot in my future study and work.
Of course, after reading the first four chapters, I also became more aware that I deeply felt my lack of knowledge reserves in computer systems, computer networks, engineering training and other sub-fields. I still cannot fully understand some of the technical concepts in the book. I need to further expand my research in combination with other reference learning materials, and I also need to actively communicate with technical practitioners in this industry. I will also strengthen my study of relevant knowledge through various means, and think about how to internalize these technical basics, absorb them imperceptibly, and finally play a practical role in my job.
In addition, I was also amazed at the author's in-depth and easy-to-understand technical writing skills. In addition, to truly build a large-model computing power center, it is necessary to work hard in many aspects such as distributed training, efficient communication, intelligent scheduling, and flexible deployment. This requires not only top-notch computing power hardware as support, but also excellent software design to give it soul. I can't wait to continue exploring and see how the author will lead us to overcome these difficulties and build a complete and efficient large-model computing power center.
Here, I would like to express my sincere respect to the authors of this book. It is your selfless sharing and careful compilation that allow us, the latecomers, to stand on the shoulders of giants and get a glimpse of the latest developments in this field. You not only impart knowledge, but also convey a passion for exploration and the courage to innovate. This book will surely become a guiding light on my AI learning journey and inspire me to keep moving forward.
As a dreamer of data mining and AI technology, I have also made a study plan for myself. In the coming days, I will insist on completing a reading note every 15 days to internalize the knowledge in the book into my own skills. I also look forward to exchanging ideas with the seniors in the community, solving the confusion encountered in learning, and making progress together.
"Knowledge gained from books is always shallow, and one must practice to truly understand." Let us join hands and bravely stand at the forefront of the wave of large models, and write a brilliant chapter of this era with wisdom and sweat!
August 6, 2024, Shenzhen