Let's talk about the current AI and a bunch of other things in plain language
Hello everyone, I am Pigo! I would like to share Shanke’s article, which may be helpful for your understanding of AI.
Artificial intelligence is developing so fast that a new term pops up every few days. Today, I will use plain language to explain the origins of AI and the various things we encounter every day that we have heard of many times but still don’t know what they are. Don’t worry, I won’t talk about very profound theories and formulas.
---
Let's start with an example. Think about how to identify whether a picture is a car. One way is to tell me a rule, such as a car is a car if it has wheels, a steering wheel, and can move forward. The same method can be used to let the machine do this, but more strict and clear rules need to be set so that the machine can accurately identify whether it is a car according to these rules. This school of artificial intelligence is the earliest symbolism .
This is not something profound. You used symbols like "because, therefore, there exists any, including, belonging" to deduce the conclusions when proving math problems in high school. You may also have learned about axiomatic system propositional logic, which uses the idea of symbolism. It's just that this idea believes that everything in the world can be symbolized, and then set derivation rules, so that machines can explain and run all principles.
But now this road is not easy to follow. Rigorous and well-regulated mathematics cannot be perfectly formalized and symbolized, not to mention the complex world of human beings. Take the simplest question, will the stock price rise or fall tomorrow? Just imagine how difficult it would be to strictly formalize and prove it.
---
So another school of thought gradually emerged, which is connectionism . Just now I talked about how to identify a picture as a car. Well, don't tell me any rules because I can't remember them. Just let me look at 100 pictures of cars, and then I will naturally learn what a car is, although I don't know how I learned it or how I recognized it.
This is the prototype of a neural network , which is very similar to how we humans learn something. Think about how a child first learns to recognize an object. It is definitely not by looking at whether it has wheels or legs, but because it has seen it many times, it naturally learns it. We may not realize that humans have a very strong ability to recognize patterns. Almost at a glance, without any thinking process, we can know what the objects in front of us are, how they are arranged, and what may happen next, but this is very, very, very difficult for machines.
Neural networks are so magical, and the basic principle is very simple. The input is a bunch of parameters, the output is a bunch of parameters, and the layers in between are a bunch of parameters. Then the weights of the parameters are constantly adjusted through a large amount of data. As the data increases, a set of magical parameter weights are slowly adjusted, which is very similar to the learning process of humans, as if the machine has learned this knowledge.
The above is a neural network model for handwritten digit recognition. Although the parameters finally adjusted can recognize digits in a very regular manner, as if there is logic, each neuron inside is just a group of meaningless little things when viewed individually. The irregularity at the microscopic level demonstrates amazing capabilities at the macroscopic level.
---
From the initial rule-based machine learning (symbolicism) to the later neural network-based machine learning (connectionism), the latter gradually became the mainstream.
As our hardware devices become more and more powerful, computing power becomes stronger and stronger, and deep learning technology based on neural networks becomes stronger and stronger (CNN, RNN, Transformer). I won’t just talk about the technology here, but the hardware is getting more and more powerful, the algorithms are getting more and more powerful, and the parameters are increasing (originally only a dozen numbers can be added, subtracted, multiplied, and divided, now it can be added, subtracted, multiplied, and divided for hundreds of millions of numbers).
That’s when ChatGPT came into being! It was the first time many people came into contact with the term AI. So ChatGPT is not a new technology, but the product of a qualitative change caused by a quantitative change after more and more parameters, more and more powerful hardware, and miracles.
AI used to be able to complete only relatively simple tasks, such as facial recognition and intelligent customer service. Now I suddenly feel that I can think when I talk to it! From micro to macro, quantitative changes produce qualitative changes, and great efforts produce miracles. This is called emergence . In fact, the small function of being able to recognize numbers by combining small neurons is also an emergence. Because there are so many parameters in the model, everyone calls them large models .
You may be familiar with these words by now, as they are the hot words that have been all over the screen in recent years.
---
After ChatGPT appeared, the first question most people asked was what is it? Most people didn’t understand what ChatGPT was because they didn’t know how to search or had network problems. It’s actually just a chat page. The reason why this thing has become so popular is because it’s so easy to use.
Afterwards, various media outlets hyped this thing up and created a bunch of new words to explain it, confusing people. Let’s sort it out now.
The first confusing thing is that ChatGPT cannot access the Internet , so I don’t know what happened recently. I remember that the training data of the first version of ChatGPT only lasted until 2019. This means that after the model is trained, it is just a bunch of weighted parameters. To put it bluntly, it is a very complicated program algorithm that is hard-coded. It is impossible for a person to communicate with it for a period of time and make it smarter or master new knowledge. If you want to make it smarter, you can only train it with a new batch of data or modify the code of the model itself. It is useless to just use it and talk to it.
The second confusing thing is that ChatGPT can access the Internet again and know what happened recently. This is essentially a tool that searches for data on a search engine before answering your question, and then puts the searched data together with your question and lets ChatGPT answer it. What ChatGPT can do is still just output another paragraph of text based on a piece of text.
The third point is the private knowledge base . Now ChatGPT can communicate with people happily through text, and can also use other tools to search for things online and then communicate. But you need to tell him a lot of your personalized pre-knowledge in advance, such as your company's database, a novel, etc., so that he can understand these materials before communicating with you. This cannot be done by ChatGPT alone without any crooked ideas.
There are several ways to achieve this. The first is to increase the length of the context , that is, before each conversation, input a whole novel as the context of your question and then ask the question, but this sounds very limited.
The second method is very clever. I first split a novel into sections and store them in the database. Then, every time I ask a question, I first check the relevant parts in the database and take them out, and then put them together with your question as the context to ask. This is called Retrieval Enhanced Generation RAG . The database used is generally Vector database . The method used for storage and query is called embedding , which is to map high-dimensional data such as text, images, and videos into low-dimensional vectors. In fact, this is equivalent to checking the questions and answers in advance and telling ChatGPT and then asking it to answer. In a real-person conversation, this is like asking "My name is xxx, who am I?". Many companies' knowledge bases and some popular knowledge base building tools on the market now use this method.
The third method is Fine -Tuning . This has a certain technical threshold. You can see some chat models trained to look like a celebrity talking to you, voice cloning technology, etc., which are all fine-tuned based on pre-trained models. Simply put, GPT, BERT, LLaMA are all pre-trained models, which means that the model parameters have been trained with a large amount of data and are almost the same. At this time, you can simply train it with your personal data and it will learn your knowledge. A completely untrained model is like a newborn baby, and a pre-trained model is like a child who has already learned about his parents. At this time, it is easier for you to teach him what grandparents are.
Of course, you can also train from scratch with a naked, unpre-trained model, but that won’t work unless you have deep pockets.
---
We have been talking about ChatGPT before, but now looking at the entire AI coverage, ChatGPT is actually just an application of the GPT model in the field of dialogue, and the GPT model is just a pre-trained model based on the Transformer model in the field of text-text generation.
Nowadays, the things people see and hear are nothing more than text, sound, image, and video. Then the combination of these things is the multimodal capability of the big model. To put it bluntly, it is a variety of forms. It is no longer just text-text, but text-image, text-video, image-video, image-text, etc. These are all long-winded conversions.
Each conversion corresponds to many different application scenarios. For example, text-to-text conversion not only includes conversations, but also translation, writing articles, writing codes, and interpreting codes. Text-to-image conversion includes AI drawing, AI interpreting images, and AI photo editing. The imagination space and application scenarios are huge, which is why a large number of AI applications and terms appear every day. The same basic technologies have been used to create new things.
Let's sort out the more popular ones. The text-text ones I just mentioned are several pre-trained model families written based on the Tansformer architecture. Now the market is divided up by OpenAI's GPT family, Google's BERT family, and Meta's LLaMA family.
Text-image is the field of AI painting. In the early days, it was implemented with the generative adversarial network GAN model. Now it has been killed by the diffusion model . It is also a field that has existed for a long time and suddenly became popular with such high quality that people applauded. Representative examples are the open source Stable Diffusion and the closed source Midjourney and Dall·E 2. The first to become popular was this MJ. Do you remember the picture that was flooded in the circle of friends? It was the AI painting that was too realistic.
There seems to be no breakthrough in the text-video field. Unlike ChatGPT and MJ, which have a large number of real works, Vincent Video is still in the gimmick stage. OpenAI's Sora has been hyped for a long time but has not seen any shadow. Byte's Doubao Big Model has entered the experience stage and the effect is quite explosive, but it still needs more time to verify.
Text-voice is relatively simple. We have been using it in our daily lives, so it is not so explosive. It has just been taken to a higher level with the AI wave. For example, it is not difficult to clone a person's voice almost 100% close. TikTok can almost perfectly clone your voice with just 5 seconds of voice. Some open source projects such as GPT-Sovits can also easily clone your voice on a personal computer through a few minutes of voice training, and this is made in China.
Although all these multimodal gameplays use a variety of different underlying models, they are all supported by the Tansformer model. This is why there is a qualitative leap compared to before, thanks to this thing.
---
After talking about these basic things, it will be easier to understand some of the current application-oriented ecosystems.
Big models have gradually entered the homes of ordinary people. Model training and reasoning, which used to be completed only by enterprise-level computing power, can now be run by ordinary people with a broken computer. In order to make the model run on low-performance computers, there are distillation , pruning , quantization , etc., which correspond to model migration, parameter reduction, and precision reduction, respectively. The purpose is to make the model run on a lower-performance computer for ordinary people to use. Otherwise, how can it be promoted and sold?
On the other hand, people are not satisfied with the capabilities of a single AI application and want to combine multiple AIs or multiple steps to form more powerful functions. This is where the concepts of agents and workflow engines come from .
At first, there was an autogpt that claimed to be able to automatically complete tasks such as looking up information, asking questions, and writing reports in one sentence. It was actually a combination of several ChatGPTs that talked to themselves to continuously advance the task. But later it was found that the effect was far from satisfactory, so it was put out of service.
A workflow engine is a tool that conveniently combines steps to form a so-called intelligent agent. For example, I want to develop an intelligent agent to crawl articles from the web first, then write a new article based on the article and generate an image, and then automatically send it to some blog platforms to get some free money. Many people who sell AI as a sideline to achieve wealth freedom teach this crap. This can be strung together with a workflow engine, which is just to facilitate code-free development for ordinary people.
The concepts of AI workflow engines and intelligent agents are not mysterious. You will understand what many tools are after using them. For example, Dify and Byte's Coze are intelligent agent development tools suitable for text, ComfyUI is a slightly more complex AI painting tool , and there are many zero-code AI applications. Since there is a slight technical threshold and the products produced are somewhat differentiated, many people also sell the generated works or technical teaching services.
Furthermore, it is becoming more and more friendly to developers. For example, the Ollama tool allows people to build and deploy large models locally. Previously, if people wanted to use a large model, they had to download the model to the local computer first. They didn’t know where it was downloaded to, how to run it, and there were a bunch of model files that they couldn’t understand. Now, you can run it with one click by just typing Ollama run xxx. It’s just more convenient.
The same is true for the LangChain framework, which is more developer-oriented . So in fact, many new things and new concepts in the big model now have fewer and fewer technological breakthroughs. The more messy terms you hear, most likely they are for a common goal: popularization . Let ordinary people use AI more conveniently, create AI applications, use even crappier computers to run AI, or deploy their own AI applications without their own computers as long as they can access the Internet. Look at the current GPU rental platform and cloud computers . Aren’t they becoming more and more popular?
Apart from these, other noises can be ignored. Nowadays, people are shouting that something will change the world. Most likely, it is just some application that has some tricks up its sleeve or some chemical reaction between multiple modalities.
---
The development of AI technology in the future and the enhancement of the model itself It may not be possible to reach the next amazing stage in the short term. GPT-4 has been the strongest on the planet for more than a year. The real breakthrough of the big model to the next emergence point may require GPT-10 or the initial symbolism to have a breakthrough to change the AI landscape.
Multimodal development There will be some breakthroughs in mutual promotion, such as the current digital human, voice cloning, article imitation and future video generation. The era where everyone is a director may come. Of course, this also indicates that AI ethics and legal aspects must become very complicated. Now there are a lot more voice fraud and video fraud than before, because the slight improvement from being able to hear AI to not being able to hear AI in voice and video has immediately made a qualitative leap in fraud. If you look at the platforms cloning voices and images, you need to sign an agreement first. So you can pay attention to this.
As AI tools and even AI development become more and more popular, more and more people will try to build themselves into super AI combinations and let AI assist them in their work and life. The gap between those who can use AI and those who can’t will widen. However, don’t think that if you are not an AI major or a technical person, you can’t learn AI. It doesn’t matter at all.
Well, this article contains some of my views, and I hope it can help you solve some of the confusions you may have as we enter the new era of AI.
Reply "ai1" in the following public account to join the AI exchange group