

In an accompanying paper, Facebook’s researchers note that while WaveNet produces higher-fidelity audio output, MelNet is superior at capturing “high-level structure” - the subtle consistencies contained in a speaker’s voice that are, ironically, almost impossible to describe in words, but to which the human ear is finely attuned. A comparison of spectrogram and waveform data. (Older text-to-speech systems don’t generate audio, but reconstitute it: chopping up speech samples into phonemes, then stitching these back together to create new words.) But while WaveNet and others were trained using audio waveforms, Facebook’s MelNet uses a richer and more informationally dense format to learn to speak: the spectrogram. The basic approach with WaveNet, SampleRNN, and similar programs is to feed the AU system a ton of data and use that to analyze the nuances in a human voice. Much of this progress dates back to 2016 with the unveiling of SampleRNN and WaveNet, the latter being a machine learning text-to-speech program created by Google’s London-based AI lab DeepMind, which now powers the Google Assistant. The quality of voice clones have been steadily improving in recent years, with a recent replica of podcaster Joe Rogan demonstrating exactly how far we’ve come. Now, these audio samples are undeniably impressive, but MelNet isn’t exactly a bolt from the blue. The rest of the training data came from audiobooks, chosen because the “highly animated manner” of the speakers make for a challenging target.
.jpg)
Well, the simple answer is that one of the resources used to train MelNet was a 452-hour dataset of TED talks.

Now you may be wondering why the researchers chose to replicate such a STEM-y bunch of speakers.
