Welcome to the Harvard-Haskins database of Regularly-Timed Speech!

Click here to go to the download page

Overview of the Database

This database contains acoustic and kinematic data (collected at Harvard University and Haskins Laboratories) from speakers who were uttering sequences of alternating syllables in an evenly-timed fashion, "like a metronome". Since the syllables were perceptually isochronous, the database can be used to study physical cues that underlie the perception of temporal intervals in speech. A paper containing the details of one such study is included on the download page. The paper also contains details of data collection for all materials on this website, as well as a method for the analyzing these data for proposed timing cues.

The database contains acoustic data from 6 speakers (3 male, 3 female), and kinematic data from 3 speakers (1 male, 2 female). All utterances in this database have 11 syllables, and consist of the syllable /ba/ alternating with another syllable (the "target syllable"). The target syllable is one of the following:

/ba/ /cha/ /dela/ /ha/ /la/ /lad/ /li/ /ma/ /pa/ /sa/ /spa/ /ta/ /ya/

Each speaker produced four utterances for each target syllable.

Some details on acoustic and kinematic data

The acoustic data were collected in quiet rooms at Harvard and Haskins, and are in 16-bit .wav format (sample rate = 10 kHz, anti-alias filtered at 4 kHz).

NOTE: Some computer sound cards do not have 10 kHz as an available sample rate for playback, and may play the sound files at a different sample rate without notifying the user.  This will speed up or slow down the sound files.  To hear them at the correct speed, make sure your sound card supports a 10  kHz sample rate.

Kinematic data were collected using an electromagnetic midsagittal articulometer (EMMA) system at Haskins Laboratories, and are also in 16-bit .wav format (sample rate = 625 Hz, anti-alias filtered at 200 Hz during acquisition, low-pass filtered at 17 Hz after voltage-distance conversion). Note that these files are in .wav file format for portability between systems: they are not acoustic files!

The EMMA system measures horizontal and vertical position for selected articulators in a coordinate system centered at the upper incisors. Movements were measured from the upper lip, lower lip, jaw, tongue tip, tongue blade, tongue body, and tongue rear. (Note that the "tongue tip" transducer was not truly on the tip of the tongue, as this would interfere with articulation, but roughly 1/2 cm behind the tip.)

In this database, acoustic files have the following filename structure:

{subject initials} {target syllable} {utterance number}. For example: jbsa1.wav

The kinematic files follow the same filename conventions as the acoustic data, except that kinematic data files have a "k" at the end of the file prefix. For example, the kinematic data corresponding to jbsa1.wav is jbsa1k.wav.

Working with kinematic files

Each kinematic .wav file contains a single time series which contains the data for 7 articulators laid out end-to-end (no gaps), each with x and y directions Thus, once a kinematic file is read into a computer, dividing it into 14 equal parts will result in vectors representing:

  1. upper lip x
  2. upper lip y
  3. lower lip x
  4. lower lip y
  5. jaw x
  6. jaw y
  7. tongue tip x
  8. tongue tip y
  9. tongue blade x
  10. tongue blade y
  11. tongue body x
  12. tongue body y
  13. tongue rear x
  14. tongue rear y
The center of the xy coordinate system (i.e. [0,0]) is at the upper incisors. x values get more positive in the rostro-caudal direction (from the front to the back of the head), and y values get more positive in the ventro-dorsal direction (from the bottom to the top of the head).

Physical units of the data

The acoustic files are in units of volts, and the kinematic files are in units of centimeters.

As a check that you have the correct physical units, here are the max and min values in the acoustic utterance jbsa1.wav:

max 6.3281 volts

min -5.1123 volts

Here are the lower lip x and y maximum and minimum values for jbsa1k.wav:

lower lip x lower lip y

max -.6006 -1.4014 centimeters

min -1.1865 -3.6084 centimeters

Notes for MATLAB users

The following MATLAB code shows an example of how to read the acoustic data into a vector, and the kinematic data file into a 14 column matrix, where the columns correspond to the articulators in the list above.

% Acoustic data

[acoustic_data,sample_rate,nbits]=wavread('jbasa1.wav');

acoustic_data=acoustic_data*10;

% Kinematic data

[kinematic_data,sample_rate,nbits]=wavread('jbsa1k.wav');

npts=length(kinematic_data);

kin_data=reshape(kinematic_data, npts/14,14);

kin_data=kin_data*10;

Note the final step in both cases, which applies a gain factor of 10. This is necessary in MATLAB to convert these particular .wav files back to their original physical units (c.f. "Physical units of the data", above).

Miscellaneous notes

Subject AP has no utterances with /lad/ or /spa/ as the target syllable, and has only 3 utterances with /li/ as the target syllable. Subject LC has only 3 utterances with /pa/ as the target syllable.  Subject LK has only 3 utterances with /ha/ and /sa/ as the target syllable.

More information

A detailed description of this dataset, including figures of acoustic and kinematic data, can be found in chapter 5 of Patel, A.D. A Biological Study of the Relationship between Language and Music, 1996 Ph.D. thesis, Harvard University. (Available from University Microfilms).


Homepage The Neurosciences Institute
April 08, 2002, apatel
Visitor #
Warning: Failed opening 'counter.php' for inclusion (include_path='.:/home/www/includes:/home/www/public/inc:/php/includes:/usr/share/php') in /home/www/users/patel/speech_database.html on line 175
(since 1999-01-01)