The development of a multimodal communication behaviour capture system
Capturing precisely quantified conversational interactions in ecologically valid scenarios is of tremendous value for studying the nature and adaptation of communication behavior in challenging environments. Such data could, e.g., be used in evaluations of audio processing strategies, estimating the effects of hearing assistance devices on spoken interaction, and any number of behavioral listening models. To this end we assembled a high-precision behavioral capture system for recording voice, pupil dilations, gaze, and head and torso motion from several people simultaneously. The capture facility is also capable of reproducing realistic audio backgrounds using a 52-channel loudspeaker array. In order to relate events in one modality to others, data had to be highly synchronous, which posed a hardware challenge. To synchronize voice capture, noise playback and motion capture, two sound cards (one for voice capture, one for the loudspeaker array system), and a motion capture lock sync box were slaved by a single master clock via word clock for audio and genlock for video, and timecode was used as a common time reference. Due to CPU and software restrictions, gaze and pupil dilation capture was made on one computer for each eye tracker, and therefore had to be post-synchronized using recordings of an audio click train sent from one of the sound cards to each of the eye tracking computers. We will present how we designed and constructed the system, how we calibrated it, and what software was used and developed for the purpose. Moreover, we show some preliminary data captured from a three-person free conversation in spatially reconstructed cafeteria noise presented at various levels. We demonstrate how we use categorizations of conversational turn-taking to find points of interest in the data to investigate the transitions in all modalities leading up to, during, and immediately following a turn-taking.