Hypokinetic dysarthria is a common manifestation of Parkinson's disease (PD). Articulation characteristics can provide useful information to distinguish dysarthric speakers from healthy subjects and monitor the severity of disease and treatment effects. The aim of this study was to design an algorithm for automatic segmentation of consonants and vowels based upon a rapid steady /pa/-/ta/-/ka/ syllable repetition. All syllables were manually labeled at three positions including explosion (E), vowel (V), and occlusion (O). In addition, the representative measurement of voice onset time (VOT) was included as difference between V and E position. When compared to the manual labeled positions, the VOT is detected within the range 5ms to 20ms with a range of a success rate of 68.2 - 90.5%, 44.1 - 75.2%, and 57.2-83.5% for normal, dysarthric, and all speakers. In conclusion, this study shows that algorithm based on the spectral, Bayesian, and polynomial approaches, that can be used to accurately detect the positions of consonant and vowels in normal and dysarthria-related utterances.
PD affects dopaminergic pathways from substantia
nigra to putamen. This causes dopaminergic striatal
loss which leads to motoric disorders. Four basic
symptoms are tremor, muscular rigidity, akinesis or
bradykinesis and stooping unstable posture .
According to published study , speech pathologies
occurred in 70-90% of cases. This speech disorders
could be one of the first PD symptoms . Three
grades of dysarthria are defined as mild, moderate and
This work is aimed to evaluation of mild articulation
disorders. For this purpose the diadochokinetic (DDK)
task was used. In this exercise are patients asked to
repeat the sequence of syllables /pa/-/ta/-/ka/ as fast
and as long as possible.
In this paper we are concerned with the segmentation
of utterances obtained in DDK task and it’s intention is
to create an algorithm which will automatically label
important positions in the signal. Important positions
are beginning of explosives (/p/, /t/, /k/) (E), beginning
of the vocal (V) and end of the vocal (O). With known
positions of E, V and O can be estimated other
characteristics like harmonic to noise ratio,
fundamental frequency F0,…
Data used for the algorithm design is part of the earlier study , within which 46 utterances of native speakers were collected, of which 24 (20 men and 4
women) were diagnosed with early PD stage and their
records were created before pharmaceutical treatment.
Data of the healthy control group (CG) is acquired
from 22 participants (15 men and 7 women) without
any neurological disorders.
In terms of the study  were recorded DDK task
utterances, in which were participants asked to repeat
the sequence of syllables /pa/-/ta/-/ka/ as fast and as
long as possible .
The final training database consists of the 80 records
set (1644 syllables /pa/, /ta/ or /ka/), of which 40 (753
syllables) records are PD and 40 records are CG (891
For the purposes of the utterance automatic analysis
is necessary to cut the speech record and detect three
basic positions. The first position is explosion (E), beginning of consonants (/p/, /t/, /k/), which is
characterized by the release of the oral closure and by
the increase of the explosive noise energy.
The second position is beginning of the vowel (V),
which represents beginning of the vocal cord vibration.
The last one is occlusion (O), which marks voicing end
in the signal. One syllable /pa/ with marked positions
of E, V, and O is shown on the figure 1.
Due to unknown number of syllables in individual
signals is more effective to split the signal into smaller
segments containing only one position E, V and O, at
first. This results in approximate borders of syllables in
At the beginning is the signal re-sampled to the
sampling frequency fs = 16kHz and filtered by lowpass
filter with bandwidth of 300Hz and second one with
bandwidth of 1100Hz. This gives two filtered signals
on which the peak detector is applied. This peak
detector is defined as
where the |y(n)| means absolute value of the n-th
sample of the signal and the k(n) is defined as
The peak detector outputs are smoothened by the
moving average filter and normalized. Through the
smoothened signals are searched local maxima with
minimal distance of 800 samples. This procedure helps
to prevent false detections.
The vector of found positions in the signal filtered by
1100Hz filter is taken as the initial and is amended by
the positions gained from the signal filtered by the
The widest range between two locations is computed
from this vector. This value is furthermore widened by
few samples, which gives the final range of one
segment. This range is than divided in the exact ratio which gives approximate borders of syllable shown on
the figure 1.
Detection of the explosion
The detection of the E is realized only on the part of
the segment, which precedes the position of the local
maxima found by the previous segmentation. It is
based on the filtration of the spectrogram which is
considered to be the matrix P with m rows and n
columns. For the peak highlight the filtration threshold
for each row is computed as weighted average of all
values in this row. Every value in the row smaller than
this value is set zero and every value higher is kept.
Two energetic envelopes are obtained by the
summation of values in each column. The first one is
summation of all values and the second one is
summation of the values above the 1500Hz. The next
step is computation of the centroid for each envelope
and elimination of signals with inconvenient position
of centroids. In the energetic envelope of the whole
filtered spectrogram is then found the approximate
position of V and if this position is too underhung the
borders of the segment will be moved to avoid missing
real position of E behind borders of the segment.
The explosion is than searched in the energetic
envelope of frequencies above 1500Hz, because high
energy of the vocal on the lower frequencies hides the
E peak. This method is illustrated by figure 2.
Detection of the vocal beginning
The V detection counts with rapid growth of the
signal energy, which is detected by Bayesian step
changepoint detector . Local maxima in the output
of BSCD are found and then the one accordant to the
position of V is chosen. The selection is based on the
shape of the output, because of which we can assume that peak following the longest gap is the peak
matching to the position of V. For the better insight see
Detection of the vocal ending
The principle of the O detection is to find the flexible
threshold, which will optimize itself according to the
shape of the signal. Signal is filtered by lowpass with
500Hz bandwidth at first. Next the square of the signal
The threshold is made by inverted polynomial
approximation of the ninth order, which is also moved
by the offset computed as two times the average value
of the signal energy. This threshold can be written as
where the x is a vector of the x axis values with firts
value equal to one and with length equal to length of
the searched segment. Coefficients ai and bi are
coefficients of the i-th order of the polynomial and x
is average value of the signal energy. This threshold is
shown on the figure 4.
Appraisal of the results
The appraisal of the detection results was made due
to the whole count of syllables, not as the appraisal of
the single signals. Automatically detected positions
were compared to the hand labeled E, V and O
positions. The difference between detected and hand
labeled positions confronted with three thresholds
(5ms, 10ms, and 20ms) and as successful detection was
marked value smaller or equal to this threshold. The
percentage rating was computed due to the whole
number of syllables.
For the VOT evaluation were the differences
between detected and hand labeled lengths of VOT
compared to the same thresholds as the single E, V and
O. For the better view of achieved results the figure 5
It is possible to roughly compare success of the VOT
detection with works  and . The study  is
aimed to evaluation of the VOT length for the purpose
of the accent distinction. For the comparison were used
results of American English native speakers because of
their similarity to our data. Correct detections
(difference less than 10% of the length) were made in
74.9% of all cases and the average difference of the
correct detections is 0.735ms . Our algorithm gives
worse score of 57.2% and 1.273ms. Results published
in the paper , which deals with measurement of the
voiced and unvoiced consonants (/b/, /p/, /d/, /t/, /g/ and
/k/), are for the 10ms threshold 72.6% and 87.8% for
the 20ms threshold. Our results are 68.1% for 10ms
and 83.5% for 20ms threshold.
During the comparison is necessary to consider
different purposes of each work. For the comparison
were used speaker with the most similar accents,
however still different. The next limit is presence of
PD, which aggravates our results.
The algorithm presented in this paper worked with
the strongest 5ms threshold for all participants E, V
and VOT at sE = 64.0%, sV = 71.2% and sVOT=57.2%.
The success rate of the O for all participants for more
relevant 10ms threshold is equal to sO=64.6%. These
rates are satisfactory and comparable to other two
works, however further improvement is necessary to
increase robustness in PD cases.
Currently, all three detections work independently on
each other, this gives space for the additional
improvement by feedback control. Length of the VOT
can be compared to physiological values and in the
case that detected VOT is out of bounds positions can
be re-estimated. Improvement of the O detection can be
achieved through the algorithm for the automatic
estimation of the ideal polynomial approximation
order. In current time the order is set to fixed empirical
value which is not ideal in all cases.
The work has been supported by research grants
SGS12/185/OHK4/3T/13, GACR 102/12/2230 and NT
12288-5/2011 and by the research program MSM
0021620849 and MSM 6840770012.
Ing. Michal Novotný
Department of Circuit Theory
Faculty of Electrical Engineering
Czech Technical University in Prague
Technická 2, CZ-166 27 Praha 6 – Dejvice
Phone: +420 776 643 155
 Rodríguez-Oroz, M., C., Jahanshahi, M., Krack, P., Macias, R., Bezard, E., Obeso, J., A.: Initial clinical manifestations of Parkinson’s disease: features and pathophysical mechanisms. The Lancest Neurology, 8 (12), 1128 – 1139,2009.
 Duffy, J., R.: Motor Speech Disorders: Substrates, Differential Diagnosis and Management. 2nd ed. Mosby, New York, NY, 2005 pp. 1 – 592.
 Darley, F., L., Aronson, A., E., Brown, J., R.: Differential diagnostics patterns of dysarthria. J. Speech. Hear. Res., 12, 426 – 496,1969.
 Rusz, J., Čmejla, R., Růžičková, H., Růžička, E.: Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated Parkinson’s disease. J. Acoust. Soc. Am., 129 (1), 350 – 367,2011.
 Fletcher, S.: Time – by – count measurement of dyadochokinetic syllable rate. J.Speech. Hear. Disord., 15, 757 – 762, 1972.
 Čmejla, R., Sovka, P.: Recursive Bayesian Autoregressive Changepoint Detector for Sequential Signal Segmentation. EUSipco Proceeedings, Wien (2004), 245 – 248.
 Hansen, J., H., L., Gray, S., S., Kim, W.: Automatic voice onset time detection for unvoiced stops (/p/, /t/, /k/) with application to accent classification. Speech Comunication, 52, 777 – 789, 2010.
 Stouten, V., Van Hame, H.: Automatic voice onset time estimation from reassignment spectra. Speech Communication, 51, 1194 – 1205, 2009.