Human data is in short supply, Microsoft OpenAI begins to feed AI with AI, Ultraman says: All data in the future will become synthetic data

Latest update time：2023-08-14 15:21

Reads：

Xiao Xiao comes from Ao Fei
Temple Qubit | Official account QbitAI

Human data is in short supply, and AI is forced to start eating the data produced by AI!

This is the current situation faced by many cutting-edge AI companies such as Microsoft and OpenAI .

They scraped a lot of data from platforms and forums like Wikipedia, e-books, news sites, blogs, Twitter and Reddit, and now... they're running out of data.

BUT, to train a better large model, no amount of data is enough.

According to the Financial Times, many companies are feeding the results generated by large models, so-called synthetic data , to large models with smaller parameters, and have found that the results are pretty good.

Regarding the use of synthetic data, OpenAI CEO Sam Altman not only does not mind, but also said that " all data will become synthetic data in the future ."

Cohere, a large modeling startup valued at $2 billion, is also using synthetic data. Aidan Gomez, CEO of the company and one of the authors of the classic large model Transformer paper, even believes:

Synthetic data could accelerate the path to “ superintelligent ” AI systems.

So, which large models are already using synthetic data, and where does this synthetic data come from?

Big AI synthesizes data and small AI eats it

These so-called synthetic data are essentially data generated by a large model that currently performs well. After manual adjustment , it is then fed to a slightly smaller large model.

Cohere, for example, experimented with using two large models to conduct "role-playing" conversations and turning the results they generated into synthetic data.

These two large models play the role of "mathematics teacher" and "student" respectively, conducting a virtual mathematics teaching. Meanwhile, Cohere has a human employee standing by to oversee conversation generation.

Whenever a mistake is made in a conversation, a human employee steps in to correct the text .

While it does still require manpower, this is much cheaper than hiring experts in science, medicine, and business to write the text.

So, what kind of large models will use these synthetic data?

Recent research from Microsoft Research shows that synthetic data can be used to train language models that are slightly smaller than GPT-4 or PaLM-2.

Take TinyStories , a "four-year-old children's novel" data set generated with GPT-4 as an example. It is proven that although this data set only contains words that a 4-year-old child can understand, it can also generate grammatically correct words after being used to train a large model. , stories with smooth reading experience:

Regarding the reasons for using synthetic data, Cohere CEO Aidan Gomez believes:

It would be better to obtain data from the Internet, but the network data is too messy and cannot meet the needs at all. By comparison, synthetic data is already abundant, even if it hasn't been widely disseminated yet.

The industrial chain behind it has emerged

At present, companies including Scale AI, Gretel.ai and other companies have begun to provide synthetic data services to the outside world.

First, Scale AI launched a synthetic data product, Scale Synthetic, to provide synthetic data services to enterprises.

In a previous article, SemiAnalysis broke the news about GPT-4’s “big lace”. It was also mentioned that the GPT-4 data set contains millions of rows of data from Scale AI and internal instruction fine-tuning.

As for the synthetic data platform Gretel.ai , judging from the official website, it has cooperated with different companies such as Google, Riot Games, and HSBC to generate more synthetic data for other developers to use.

Ali Golshan, CEO of Gretel.ai, believes that the benefit of synthetic data is that it preserves the privacy of all individuals in the data set while still maintaining its statistical integrity .

But not everyone accepts the "magical operation" of synthetic data. Currently, the opinions of all parties are mainly divided into two waves.

Some support the use of synthetic data. Including AI companies such as Cohere, many companies that engage in large-scale models still adhere to this approach and believe that it may generate better AI and even give birth to "super intelligence" from it.

Another part believes that synthetic data will eventually let AI " reap its own consequences ."

For example, a study from Oxford University, Cambridge University, Imperial College London, University of Toronto, University of Edinburgh and Vector Institute shows:

Training with synthetic data will cause irreversible defects in the model:

Forget about those “impossible events” and end up being poisoned by the data you generate.

Some netizens believe that these synthetic data will eventually become a puddle of "unusable sludge" - and then people will have to hire data scientists to clean it .

Some netizens joked that this sounds like " AI inbreeding ."

Do you think AI needs to use synthetic data?

Reference links:
[1]https://www.ft.com/content/053ee253-820e-453a-a1d5-0f24985258de
[2]https://the-decoder.com/gpt-4-architecture-datasets-costs- and-more-leaked/
[3]https://arxiv.org/pdf/2306.11644.pdf
[4]https://arxiv.org/pdf/2305.17493v2.pdf

-over-

"AIGC+Vertical Field Community"

Recruiting!

Partners who follow AIGC are welcome to join the AIGC+ vertical community and learn, explore and innovate AIGC together!

Please note the vertical field "education" or "advertising marketing" you want to join. To join the AIGC talent community, please note "talent" & "name-company-position".

Click here ???? Follow me and remember to star~