Note: This paper was wirtten for L700 "Seminar in Articulatory Phonetics" - a class taught by Bob Port at Indiana University, Fall Semester, 1999. It was revised and subsequently published by the IULC Working Papers Online where it is available fulltext in PDF format. (Volume 2, No. 1). The program was implemented in MatLab's Simulink modelling environment - if you have access to MatLab and would like to see it in action, drop me a line. Below is the introduction and the bibliography. If you prefer another format or have any questions or comments, feel free to E-mail me.

Spike-V:An Adaptive Mechanism for Speech-Rate Independent Timing

by Sean McLennan and Stephen Hockema

2002

1. Introduction

A major stumbling block to models of cognitive phenomena, machine perception, and natural language processing is a reliance on what can be called “Naive Time” (Port et al., 1995). This is the assumption that human perception and manipulation of time is based on an objective, absolute measure like seconds or milliseconds. Most speech recognition systems, for example, critically rely on sampling a signal at a speci.c rate in Hz, producing a fourier transform to obtain some sort of frequency representation, and then converting it into overlapping segments (typically of 20 ms or so) extracted at speci.c intervals (about 10ms) (Allen, 1995). It is extremely unlikely that such a process occurs in discrete time in human beings and since human beings remain vastly superior at the task of speech recognition, it seems a reasonable goal to attempt to model more biologically plausible mechanisms.

Presented here is a potentially powerful method of targeting the most salient points of a speech signal that robustly adapts to changes in speaking rate. It relies on and responds solely to the input signal adapting in “realtime”, thereby providing an implicit, dynamic measure of speaking rate. The method is instantiated in a Matlab Simulink1 model called, “Spike-V” (“V” for “vowel”), that is conceived of as an addition to other neural network models of speech recognition. For purposes of this discussion, we conceive of Spike-V integrated into Stephen Grossberg’s ARTPHONE model (Grossberg et al., 1997); however, we believe the mechanisms involved are of general applicability.

An assumption that underlies Spike-V is that not all speech segments are created equal; certain segments are more important for recognizing speech than others. That is, highly sonorous segments, particularly vowels, are the primary sources of information about changes in the speech signal. Sonorance is more or less de.ned by the presence of the fundamental frequency (F0) in the speech signal and so accordingly, vowels (and other sonorous segments) are highly correlated with a strong F0. Spike-V .nds periods of F0 in the signal and produces a spike at points more or less in the center2 of those periods. Crucially, Spike-V normalizes to the speaking rate—in realtime—so as to produce (in general) only one spike per period regardless of the period’s absolute duration. The result, we suspect, is a spike train whose qualitative properties remain constant for the same utterance regardless of speaking rate. The rigorous testing of this hypothesis on linearly time-warped speech, naturally di.erent speaking rates, and across speakers will be the focus of further research.

Although the speci.cs of Spike-V’s interaction within a larger speech recognition model have yet to be fully conceived, we envision Spike-V as a unit that provides waves of excitation to other areas of the model. In e.ect, this provides a nonarbitrary, signal-driven segmentation of the speech signal into discrete periods that are not measures of absolute time, but more task-appropriate measures.

References:

Allen, J. (1995). Natural Language Understanding. Benjamin/Cummings Publishing Company, Redwood City, CA.

Gazzaniga, M. S., Ivry, R. B., and Mangun, G. R. (1998). Cognitive Neuroscience: The Biology of the Mind. W. W. Norton and Company, New York, NY.

Grossberg, S. (1995). Neural dynamics of motion perception, recognition learning, and spatial attention. In Port, R. and vanGelder, T., editors, Mind as Motion, pages 449–490. MIT Press, Cambridge, MA.

Grossberg, S., Boardman, I., and Cohen, M. (1997). Neural dynamics of variablerate speech categorization. Journal of Experimental Psychology: Human Perception and Performance., 23:418–503.

Kenstowicz, M. (1994). Phonology in Generative Grammar. Blackwell, Cambridge, MA.

Ladefoged, P. (1982). A Course in Phonetics. Harcourt Brace Jovanovich, Chicago, IL.

Lieberman, P. and Blumstein, S. (1988). Speech Physiology, Speech Perception, and Acoustic Phonetics. Cambridge University Press, Cambridge, MA.

Port, R., Cummins, F., and McAuley, J. D. (1995). Naive time, temporal patterns, and human audition. In Port, R. and vanGelder, T., editors, Mind as Motion, pages 339–372. MIT Press, Cambridge, MA.

Wang, D. (1995). Habituation. In Arbib, M. A., editor, The Handbook of Brain Theory and Neural Networks, pages 441–444. MIT Press, Cambridge, MA.