Authors: Michal Novotný;  Jan Rusz;  Roman Čmejla
Authors‘ workplace: Czech Technical University, Faculty of Electrical Engineering, Prague, Czech Rep.
Published in: Lékař a technika - Clinician and Technology No. 2, 2012, 42, 81-84
Category: Conference YBERC 2012


Hypokinetic dysarthria is a common manifestation of Parkinson's disease (PD). Articulation characteristics can provide useful information to distinguish dysarthric speakers from healthy subjects and monitor the severity of disease and treatment effects. The aim of this study was to design an algorithm for automatic segmentation of consonants and vowels based upon a rapid steady /pa/-/ta/-/ka/ syllable repetition. All syllables were manually labeled at three positions including explosion (E), vowel (V), and occlusion (O). In addition, the representative measurement of voice onset time (VOT) was included as difference between V and E position. When compared to the manual labeled positions, the VOT is detected within the range 5ms to 20ms with a range of a success rate of 68.2 - 90.5%, 44.1 - 75.2%, and 57.2-83.5% for normal, dysarthric, and all speakers. In conclusion, this study shows that algorithm based on the spectral, Bayesian, and polynomial approaches, that can be used to accurately detect the positions of consonant and vowels in normal and dysarthria-related utterances.

Parkinson’s disease, dysarthria, articulation, diadochokinetic


PD affects dopaminergic pathways from substantia nigra to putamen. This causes dopaminergic striatal loss which leads to motoric disorders. Four basic symptoms are tremor, muscular rigidity, akinesis or bradykinesis and stooping unstable posture [1].

According to published study [2], speech pathologies occurred in 70-90% of cases. This speech disorders could be one of the first PD symptoms [2]. Three grades of dysarthria are defined as mild, moderate and severe [3].

This work is aimed to evaluation of mild articulation disorders. For this purpose the diadochokinetic (DDK) task was used. In this exercise are patients asked to repeat the sequence of syllables /pa/-/ta/-/ka/ as fast and as long as possible.

In this paper we are concerned with the segmentation of utterances obtained in DDK task and it’s intention is to create an algorithm which will automatically label important positions in the signal. Important positions are beginning of explosives (/p/, /t/, /k/) (E), beginning of the vocal (V) and end of the vocal (O). With known positions of E, V and O can be estimated other characteristics like harmonic to noise ratio, fundamental frequency F0,…



Data used for the algorithm design is part of the earlier study [4], within which 46 utterances of native speakers were collected, of which 24 (20 men and 4 women) were diagnosed with early PD stage and their records were created before pharmaceutical treatment. Data of the healthy control group (CG) is acquired from 22 participants (15 men and 7 women) without any neurological disorders.

In terms of the study [4] were recorded DDK task utterances, in which were participants asked to repeat the sequence of syllables /pa/-/ta/-/ka/ as fast and as long as possible [5].

The final training database consists of the 80 records set (1644 syllables /pa/, /ta/ or /ka/), of which 40 (753 syllables) records are PD and 40 records are CG (891 syllables).

Phoneme boundaries

For the purposes of the utterance automatic analysis is necessary to cut the speech record and detect three basic positions. The first position is explosion (E), beginning of consonants (/p/, /t/, /k/), which is characterized by the release of the oral closure and by the increase of the explosive noise energy.

The second position is beginning of the vowel (V), which represents beginning of the vocal cord vibration. The last one is occlusion (O), which marks voicing end in the signal. One syllable /pa/ with marked positions of E, V, and O is shown on the figure 1.

Fig. 1: One labeled syllable recorded from a healthy participant.
Fig. 1: One labeled syllable recorded from a healthy participant.

Signal segmentation

Due to unknown number of syllables in individual signals is more effective to split the signal into smaller segments containing only one position E, V and O, at first. This results in approximate borders of syllables in the signal.

At the beginning is the signal re-sampled to the sampling frequency fs = 16kHz and filtered by lowpass filter with bandwidth of 300Hz and second one with bandwidth of 1100Hz. This gives two filtered signals on which the peak detector is applied. This peak detector is defined as 

where the |y(n)| means absolute value of the n-th sample of the signal and the k(n) is defined as

The peak detector outputs are smoothened by the moving average filter and normalized. Through the smoothened signals are searched local maxima with minimal distance of 800 samples. This procedure helps to prevent false detections.

The vector of found positions in the signal filtered by 1100Hz filter is taken as the initial and is amended by the positions gained from the signal filtered by the 300Hz filter.

The widest range between two locations is computed from this vector. This value is furthermore widened by few samples, which gives the final range of one segment. This range is than divided in the exact ratio which gives approximate borders of syllable shown on the figure 1.

Detection of the explosion

The detection of the E is realized only on the part of the segment, which precedes the position of the local maxima found by the previous segmentation. It is based on the filtration of the spectrogram which is considered to be the matrix P with m rows and n columns. For the peak highlight the filtration threshold for each row is computed as weighted average of all values in this row. Every value in the row smaller than this value is set zero and every value higher is kept.

Two energetic envelopes are obtained by the summation of values in each column. The first one is summation of all values and the second one is summation of the values above the 1500Hz. The next step is computation of the centroid for each envelope and elimination of signals with inconvenient position of centroids. In the energetic envelope of the whole filtered spectrogram is then found the approximate position of V and if this position is too underhung the borders of the segment will be moved to avoid missing real position of E behind borders of the segment.

The explosion is than searched in the energetic envelope of frequencies above 1500Hz, because high energy of the vocal on the lower frequencies hides the E peak. This method is illustrated by figure 2. 

Fig. 2: Principle of the E detection, from the top, labeled part of the signal, spectrogram, energetic  envelope for the whole signal and energetic envelopeof the signal above 1500Hz.
Fig. 2: Principle of the E detection, from the top, labeled part of the signal, spectrogram, energetic envelope for the whole signal and energetic envelopeof the signal above 1500Hz.

Detection of the vocal beginning

The V detection counts with rapid growth of the signal energy, which is detected by Bayesian step changepoint detector [6]. Local maxima in the output of BSCD are found and then the one accordant to the position of V is chosen. The selection is based on the shape of the output, because of which we can assume that peak following the longest gap is the peak matching to the position of V. For the better insight see figure 3.

Fig. 3: Methodic of the V detection on the bottom part is shown the BSCD output.
Fig. 3: Methodic of the V detection on the bottom part is shown the BSCD output.

Detection of the vocal ending

The principle of the O detection is to find the flexible threshold, which will optimize itself according to the shape of the signal. Signal is filtered by lowpass with 500Hz bandwidth at first. Next the square of the signal is computed.

The threshold is made by inverted polynomial approximation of the ninth order, which is also moved by the offset computed as two times the average value of the signal energy. This threshold can be written as

where the x is a vector of the x axis values with firts value equal to one and with length equal to length of the searched segment. Coefficients ai and bi are coefficients of the i-th order of the polynomial and x is average value of the signal energy. This threshold is shown on the figure 4. 

Fig. 4: Detection of the O by the inverted polynomial threshold.
Fig. 4: Detection of the O by the inverted polynomial threshold.

Appraisal of the results

The appraisal of the detection results was made due to the whole count of syllables, not as the appraisal of the single signals. Automatically detected positions were compared to the hand labeled E, V and O positions. The difference between detected and hand labeled positions confronted with three thresholds (5ms, 10ms, and 20ms) and as successful detection was marked value smaller or equal to this threshold. The percentage rating was computed due to the whole number of syllables.

For the VOT evaluation were the differences between detected and hand labeled lengths of VOT compared to the same thresholds as the single E, V and O. For the better view of achieved results the figure 5 is appended.

Fig. 5: Dependence of the success rate on the value of the threshold with highlighted 5m, 10ms and 20ms.
Fig. 5: Dependence of the success rate on the value of the threshold with highlighted 5m, 10ms and 20ms.


It is possible to roughly compare success of the VOT detection with works [7] and [8]. The study [7] is aimed to evaluation of the VOT length for the purpose of the accent distinction. For the comparison were used results of American English native speakers because of their similarity to our data. Correct detections (difference less than 10% of the length) were made in 74.9% of all cases and the average difference of the correct detections is 0.735ms [7]. Our algorithm gives worse score of 57.2% and 1.273ms. Results published in the paper [8], which deals with measurement of the voiced and unvoiced consonants (/b/, /p/, /d/, /t/, /g/ and /k/), are for the 10ms threshold 72.6% and 87.8% for the 20ms threshold. Our results are 68.1% for 10ms and 83.5% for 20ms threshold.

During the comparison is necessary to consider different purposes of each work. For the comparison were used speaker with the most similar accents, however still different. The next limit is presence of PD, which aggravates our results.

The algorithm presented in this paper worked with the strongest 5ms threshold for all participants E, V and VOT at sE = 64.0%, sV = 71.2% and sVOT=57.2%. The success rate of the O for all participants for more relevant 10ms threshold is equal to sO=64.6%. These rates are satisfactory and comparable to other two works, however further improvement is necessary to increase robustness in PD cases.

Currently, all three detections work independently on each other, this gives space for the additional improvement by feedback control. Length of the VOT can be compared to physiological values and in the case that detected VOT is out of bounds positions can be re-estimated. Improvement of the O detection can be achieved through the algorithm for the automatic estimation of the ideal polynomial approximation order. In current time the order is set to fixed empirical value which is not ideal in all cases.


The work has been supported by research grants SGS12/185/OHK4/3T/13, GACR 102/12/2230 and NT 12288-5/2011 and by the research program MSM 0021620849 and MSM 6840770012.

Ing. Michal Novotný

Department of Circuit Theory

Faculty of Electrical Engineering

Czech Technical University in Prague

Technická 2, CZ-166 27 Praha 6 – Dejvice


Phone: +420 776 643 155


[1] Rodríguez-Oroz, M., C., Jahanshahi, M., Krack, P., Macias, R., Bezard, E., Obeso, J., A.: Initial clinical manifestations of Parkinson’s disease: features and pathophysical mechanisms. The Lancest Neurology, 8 (12), 1128 – 1139,2009.

[2] Duffy, J., R.: Motor Speech Disorders: Substrates, Differential Diagnosis and Management. 2nd ed. Mosby, New York, NY, 2005 pp. 1 – 592.

[3] Darley, F., L., Aronson, A., E., Brown, J., R.: Differential diagnostics patterns of dysarthria. J. Speech. Hear. Res., 12, 426 – 496,1969.

[4] Rusz, J., Čmejla, R., Růžičková, H., Růžička, E.: Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated Parkinson’s disease. J. Acoust. Soc. Am., 129 (1), 350 – 367,2011.

[5] Fletcher, S.: Time – by – count measurement of dyadochokinetic syllable rate. J.Speech. Hear. Disord., 15, 757 – 762, 1972.

[6] Čmejla, R., Sovka, P.: Recursive Bayesian Autoregressive Changepoint Detector for Sequential Signal Segmentation. EUSipco Proceeedings, Wien (2004), 245 – 248.

[7] Hansen, J., H., L., Gray, S., S., Kim, W.: Automatic voice onset time detection for unvoiced stops (/p/, /t/, /k/) with application to accent classification. Speech Comunication, 52, 777 – 789, 2010.

[8] Stouten, V., Van Hame, H.: Automatic voice onset time estimation from reassignment spectra. Speech Communication, 51, 1194 – 1205, 2009.


Article was published in

The Clinician and Technology Journal

Issue 2

2012 Issue 2

Most read in this issue
Forgotten password

Don‘t have an account?  Create new account

Forgotten password

Enter the email address that you registered with. We will send you instructions on how to set a new password.


Don‘t have an account?  Create new account