Would you be surprised or horrified when technology replicates your speech?-EEWORLD

Collect

If there was a technology that could copy or imitate your speech in a second, would you be surprised or horrified?

In 2019, the application of AI technology has become more and more diverse. Technology companies such as iFlytek and Sogou have successively released applications of speech synthesis technology. Through AI, users can change their voices to those of celebrities or other people they want to imitate in one second.

Internet technology is quietly changing our lives. For artificial intelligence companies, the widespread application of voice recognition technology is no longer a difficult task. However, the ethics and security risks behind it may become an issue that cannot be ignored as AI technology develops.

Real-time voice-changing technology is popular among AI companies, and voice can be changed in one second

"Hi, everyone. I'm very happy today to be here at iFlytek's new product launch. I've always liked iFlytek..."

This happened at iFlytek's 2019 new product launch. iFlytek Chairman Liu Qingfeng used technology to simulate the voices of Shan Tianfang, Lin Chiling and Luo Yonghao to make the opening remarks. Especially when Luo Yonghao's voice sounded, many people thought that Luo Yonghao was there.

"You see Liu Qingfeng, but you hear Lao Luo's voice." Liu Qingfeng said on the stage that this is the company's latest real-time voice changing technology. It is reported that this new speech synthesis technology only needs a 1-minute voice sample to imitate anyone's speech.

Not only iFlytek, recently, Sogou CEO Wang Xiaochuan demonstrated Sogou's voice-changing function at a conference. Through the mobile phone software, Wang Xiaochuan simulated the voices of Gao Xiaosong and a girl from Northeast China, which made the audience laugh. He then demonstrated the voice replacement in the song. It was introduced that the system first trained his voice for 14 minutes and then migrated the timbre.

This is Sogou's latest speech synthesis technology, which can transform anyone's voice into a specific voice, such as Lin Chiling or Jack Ma's voice in seconds. Wang Xiaochuan said that this is not just a simple speech synthesis, but can also transfer the voice, tone and emotion.

Currently, in Sogou Input Method, users can freely change their voices into their favorite voices, which can be used in major social scenarios such as WeChat, QQ, Momo, etc. Sogou provides 19 specific voices in several categories such as celebrities, cartoon characters, game IPs, and dialects.

Wang Xiaochuan

In fact, speech synthesis is no longer a new technology. Before, we saw more of the conversion of text into sound, such as in navigation, transcription, smart speakers, Siri and other smart voice assistants, etc., not real people speaking.

This year, many AI companies have focused on the application of speech synthesis in scenarios such as voice changing and voice cosplay, converting the voices of real people into specific sounds.

Baidu also has practical applications of related technologies. In early May this year, in the CCTV public welfare program "Wait for Me", Baidu Brain, based on intelligent voice technology, synthesized the voice of a deceased veteran, helping old comrades who had been separated for 64 years to "reunite".

According to reports, the technology uses Baidu's end-to-end speech style separation and modeling solution, and uses multiple sets of neural networks to independently encode and model different dimensions of speech, such as timbre, emotion, style, etc., to guide the final synthesis.

The application of these AI technologies reflects the progress of AI technology applications and the concept of inclusive value brought to society. For example, Sogou combines voice changing technology, AI synthetic anchor technology and other industries in media, education, content production, tourism and other scenarios, which will bring greater value imagination space.

On the other hand, the risks of technical loopholes and technology abuse in the future cannot be ignored. Some netizens pointed out that "be careful of being used for telecom fraud" and "you may receive a call from 'Jack Ma' in the future"...

An industry insider in the audio field believes that it should be useful for tool-type products that use audio as a method of interaction, but its positive significance for online audio platforms that use audio as a content carrier remains to be seen.

Therefore, for enterprises, while constantly seeking technological breakthroughs and commercial value, they should also establish a sense of responsibility for technological security.

Speech synthesis technology still has many shortcomings in practice

It is understood that the realistic speech synthesis technology is supported by neural networks and machine learning. Neural networks simulate the transmission process of electrical signals between neurons in the human brain and process input data. They use layered neurons to summarize common features from a large amount of sample data.

In terms of commercialization, speech synthesis technology can be seen being applied in fields such as voice interaction, audiobooks, new media, intelligent customer service, and pan-entertainment.

In an interview, Niu Sen, head of the education category at蜻蜓FM, said that speech synthesis technology in the audio field will greatly reduce the personnel, time and economic costs of converting text content into audio.

When talking about voice cosplay, Niu Sen pointed out that this has many practical flaws. For example, the synthesized audio and the real human voice cannot be completely consistent in terms of emotions and emotional expression.

He said that for audio users, the listening experience of reading a script and narrating the same content would be very different. Only the most authentic human voice can trigger a deep emotional resonance, which is also the value of audio.

On the ethical and safety level, Niu Sen believes that human voices and synthesized sounds must first be screened and confirmed from a technical perspective, and the copyright chain needs to be clarified from a rights perspective. Any unauthorized synthesized audio constitutes an infringement and illegal act. "As a platform, we will conduct strict copyright and quality control."

It is understood that on some audio platforms, speech synthesis technology is mainly used for children's programs. For other content, the AI simulation effect is not so good and has not yet been widely used.

Regarding the security risks of speech synthesis, after the release of the voice-changing technology, Liu Qingfeng emphasized on the spot: For artificial intelligence to continue to develop, the most important thing is how its values should be positive, healthy and kind to people. Therefore, we will obviously not easily open up a black technology like voice-changing technology to the outside world in various apps. There must be a healthy, safe and interesting way to connect with the world.

Previously, Liu Qingfeng also mentioned that the field of artificial intelligence requires not only technical cooperation, but also legal and ethical cooperation.

Regarding security issues, Sogou told Sina Technology that "Technology is a double-edged sword. It can be used for good or bring disaster. Sogou is committed to using technology for good. Voice changing technology is a cutting-edge application of artificial intelligence. Based on speech representation learning and transfer learning technology, it can convert anyone's voice into a specific person's voice (Any-to-One). Sogou has made a breakthrough in this regard and is the first to enter the practical stage. This technology can also be applied to film and television dubbing, family companionship and other scenarios to help people improve work efficiency and happiness in life."

Sogou revealed that in order to ensure that this technology is not abused by those with ulterior motives, the company has implemented strict management and restrictions:

1. Sogou does not export voice changing technology to third parties to ensure the controllability and security of this technology.

2. All target timbres of the voice changing function are defined by Sogou and do not support users' random imitation.

3. The changed voice can be used in apps such as WeChat and QQ. It cannot be forwarded or copied, and the sender can be tracked.

Previously, Wang Xiaochuan also mentioned AI legislation in a media interview: At the current stage of AI development, continuous adjustment and improvement based on technological development as quickly as possible is the most practical means to deal with the legal and ethical risks brought about by AI.

However, the current development of technology is still ahead of ethics and law. Zhou Hongyi mentioned at the World Intelligence Conference in May this year that in the field of AI, if there is no humanistic thinking, the system designed may be a tragedy.

Humanistic thinking behind AI technology

In fact, the "fake and real" phenomenon behind AI technology not only appears in the field of sound, but a technology application by Samsung has also attracted people's attention recently.

According to foreign media reports, researchers at Samsung's artificial intelligence laboratory in Moscow, based on a large amount of animated images and video materials, as well as "deep convolutional neural network" training, use AI technology to accurately identify certain facial features and can turn still images into animated images or even videos.

In the experiment, the researchers used still images of Einstein, Marilyn Monroe and even the Mona Lisa to generate videos of them speaking, but the video quality is currently low.

In other words, with the advancement of AI image generation technology in the future, fake videos can be generated with just one photo.

Before this, AI face-swapping had also caused heated discussions on social media. Someone replaced the Huang Rong played by Athena Chu in the 1994 version of The Legend of the Condor Heroes with Yang Mi's face, and netizens said that it was "perfectly harmonious" and "indistinguishable from the real thing", and even joked that it was "the most cost-effective way to reshoot an old drama".

[1] [2]

Reference address：Would you be surprised or horrified when technology replicates your speech?

Previous article：Technology advances, the past is forgotten
Next article：Detailed interpretation of the "division of labor" of semiconductor equipment

Popular Resources
Popular amplifiers