Jitendra Kumar Dhiman

| Home | PhD Research | Publications | CV | Presentations | Contact me | Resources | Demos | Blog |

Research Interests

Speech and audio signal processing, speech synthesis for TTS application, natural language processing, machine learning for signal processing, sequence-to-sequence modeling of time series data, generative and discriminative modeling techniques for machine learning

Brief Overview of Ph.D. Thesis Work

Speech signals have time-varying spectra. Spectrograms have served as a useful tool for the visualization and analysis of speech signals in the joint time-frequency plane. In this thesis, we consider 2-D analysis of speech spectrograms. We consider a spectrotemporal patch and model it as a 2-D amplitude-modulated and frequency-modulated (AM-FM) sinusoid. Demodulation of the spectrogram yields the 2-D AM and FM components, which correspond to the slowly varying vocal-tract envelope and the excitation, respectively. For solving the demodulation problem, we rely on the complex Riesz transform, which is a 2-D extension of the 1-D Hilbert transform. The demodulation viewpoint brings forth many interesting properties of the speech signal. The spectrotemporal carrier helps us identify the regions that are coherent and those that are not. Based on this idea, we introduce the coherencegram corresponding to a given spectrogram. The temporal evolution of the pitch harmonics can also be characterized by the orientation at each time-frequency coordinate, resulting in the orientationgram. We show that these features collectively enable solutions for the important problems of voiced/unvoiced segmentation, aperiodicity estimation, periodic/aperiodic signal separation, and pitch tracking. We compare the performance of the proposed methods with benchmark methods. The spectrotemporal amplitude characterizes the time-varying magnitude response of the vocal-tract filter. We show how the formants and their bandwidths manifest in the spectrotemporal amplitude. It turns out that the formant bandwidths are mildly overestimated, which are perceptible when one performs speech synthesis using the estimated parameters. We propose a method for correcting the formant bandwidths, which also restores the speech quality. Finally, we use the curated spectrotemporal amplitude, pitch, aperiodicity, and voiced/unvoiced decisions for the task of speech reconstruction in a spectral synthesis model and a neural vocoder, namely, WaveNet. We show that conditioning WaveNet on the spectrotemporal features results in high-quality speech synthesis.

The quality of the synthesized speech is assessed using both objective and subjective measures. We rely on the Perceptual Evaluation of Speech Quality (PESQ) metric and standard Mean Opinion Score (MOS) test for objective and subjective evaluation, respectively. The performance of the proposed parameters is also evaluated in a vocoder framework that uses the spectral synthesis model for speech reconstruction. The objective evaluation shows that the performance of the Riesz transform-based speech parameters is on par with the baseline systems. Using the spectral synthesis model, we report an average PESQ score in the range from 2.30 to 3.45 over a total of 200 speech waveforms taken from the CMU-ARCTIC database comprising both male and female speakers. In comparison, WaveNet-based speech reconstruction gave an average PESQ score of 3.65.Subjective evaluation was carried out through listening tests conducted in an acoustic test chamber on volunteers in the age group of 21 to 30. The average MOS score was 4.30 when the Riesz transform-based features were used in WaveNet for speech reconstruction, which was also comparable with the baseline systems: STRAIGHT and WORLD. Both objective and subjective evaluations also showed that the quality of reconstructed speech waveforms was superior with the proposed features in a WaveNet vocoder than in the spectral synthesis model.

Audio demonstrations of thesis work and other related research are available at the GitHub link: Demos