英语原文共 15 页,剩余内容已隐藏,支付完成后下载完整资料
ABSTRACT
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-ofthe-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
1 INTRODUCTION
This work explores raw audio generation techniques, inspired by recent advances in neural autoregressive generative models that model complex distributions such as images (van den Oord et al., 2016a;b) and text (Jozefowicz et al., 2016). Modeling joint probabilities over pixels or words using acute; neural architectures as products of conditional distributions yields state-of-the-art generation.
Remarkably, these architectures are able to model distributions over thousands of random variables (e.g. 64times;64 pixels as in PixelRNN (van den Oord et al., 2016a)). The question this paper addresses is whether similar approaches can succeed in generating wideband raw audio waveforms, which are signals with very high temporal resolution, at least 16,000 samples per second (see Fig. 1).
This paper introduces WaveNet, an audio generative model based on the PixelCNN (van den Oord et al., 2016a;b) architecture. The main contributions of this work are as follows:
bull; We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters. 1
bull; In order to deal with long-range temporal dependencies needed for raw audio generation, we develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.
bull; We show that a single model can be used to generate different voices, conditioned on a speaker identity.
bull; The same architecture shows strong results when tested on a small speech recognition dataset, and is promising when used to generate other audio modalities such as music.
We believe that WaveNets provide a generic and flexible framework for tackling many applications that rely on audio generation (e.g. TTS, music, speech enhancement, voice conversion, source separation).
2 WAVENET
In this paper audio signals are modelled with a generative model operating directly on the raw audio waveform. The joint probability of a waveform x = {x1, . . . , xT } is factorised as a product of conditional probabilities as follows:
Each audio sample xt is therefore conditioned on the samples at all previous timesteps. Similarly to PixelCNNs (van den Oord et al., 2016a;b), the conditional probability distribution is modelled by a stack of convolutional layers. There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input. The model outputs a categorical distribution over the next value xt with a softmax layer. The model is optimized to maximize the log-likelihood of the data w.r.t. the parameters. Because log-likelihoods are tractable, we tune hyper-parameters on a validation set and can easily measure overfitting/underfitting.
2.1 DILATED CAUSAL CONVOLUTIONS
The main ingredient of WaveNet are causal convolutions. By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data: the prediction p (xt 1 | x1, ..., xt) emitted by the model at timestep t cannot depend on any of the future timesteps xt 1, xt 2, . . . , xT . This is visualized in Fig. 2. For images, the equivalent of a causal convolution is a masked convolution (van den Oord et al., 2016a) which can be implemented by constructing a mask tensor and multiplying this elementwise with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.
At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth x are known. When generating with the model, the predictions are sequential: after each sa
剩余内容已隐藏,支付完成后下载完整资料
资料编号:[236662],资料为PDF文档或Word文档,PDF文档可免费转换为Word
以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。
您可能感兴趣的文章
- 为非政府组织OG慈善基金会设计的基于社区的救灾管理系统外文翻译资料
- 基于UML建模的医疗系统电子健康服务软件外文翻译资料
- 开发一种具有增强现实功能的智能手机应用程序, 以支持护理学生对心衰的虚拟学习外文翻译资料
- 在开发 Web 应用程序中应用 Vue.JS 框架外文翻译资料
- 基于MES系统的生产车间信息管理研究外文翻译资料
- 基于Vue.js和MySQL的电子商务平台的设计与实现外文翻译资料
- 详细的Spring配置和SpringBoot外文翻译资料
- 基于NS2的DSR和AODV协议的性能比较研究外文翻译资料
- 不同仿真参数下NS2的TCP吞吐量性能外文翻译资料
- 基于Spring Boot和VUE的车辆管理系统实现外文翻译资料