u25b2 after many years, Tim and his mother heard the original voice synthesized by AI again (photo source: deepmind official website)
Euponia project is a voice to text transcription service for people with language disorders. Based on the audio data of patients with neurodegenerative diseases, combined with parrotron model (seq2seq model based on attention mechanism), it can improve the efficiency of speech synthesis and generate high-quality language. The sound recovery project for Tim lasted for 6 months. The researchers first extracted the sound before Tim fell ill, and took the sound as the sample data of synthetic voice. So the researchers generated a generative AI model called WaveNet.
Waveneat model imitates and synthesizes human language by recognizing rhythm. Compared with some speech generation models in the past, the generated speech segments are more authentic and persuasive. WaveNet model has reached the level similar to 70% of human speech synthesis language, and has higher language generation efficiency. The AI model runs on a tensor processor (TPU) that Google has converted to machine learning, and it takes an average of 50 milliseconds for a one second voice sample to be created.
Photo source: pixabay
Another key operation for researchers after building relevant models is fine-tuning, which is also the key to obtain high-quality comprehensive effect from the least training data. First, they trained the WaveNet model on thousands of loudspeakers, and then AI extracted a small part of speech samples from Tims past speech audio influence materials. After continuous imitation exercises, the speech generated by WaveNet will naturally have the speakers own characteristics.
However, the excellent voice imitation and generation ability is not enough. For AI model, the perfect model architecture is the basis to ensure the overall operation efficiency of the system. So the researchers migrated the WaveNet model to the wavernn model, which is more compact and produces more fidelity audio. In addition, tacotron 2, a system that can convert text to speech, is fine tuned. It can build a speech synthesis model based on the spectrum or the visual representation of the audio signal spectrum that changes with time. That is to say, ai not only learns to listen to onomatopoeia, but also look at pictures and onomatopoeia!
Over the past six months, the voice recovery project for Tim has made great progress. At present, its research results have been shown to the public. In the first episode of the new technology program Ai Ai Ai Ai age, narrated by Robert Downey Jr., Tim and his family heard their own synthetic voice for the first time. In the program, through AI training based on Tims voice text, they read a letter from 34 year old time to 22-year-old self.
u25b2 Tim (second from the right) watched the Ai Ai Ai era with his family and members of the euponia project (photo source: deepmind official website)
When disease comes, it destroys human health and disrupts the pace of human life. Dont forget that science and technology are also developing. Those who are entwined by disease will slowly rotate with the pointer of advanced technology until the disease is cured and back to health.