Speech Input/Output for HRI

Currently for speech recognition, we use the same Google Speech-to-Text built in to the Chrome web browser. Recently we have upgraded to using the full Google webspeech api. This has the advantage of being extremely accurate, but can be impacted by delays in recognition if network traffic is heavy. Timing in interaction is of vital importance. Therefore we want to experiment with a local instance of the open source Mozilla speech recognition (for instance DeepSpeech), to see if local processing can reduce the impact of delays in the interaction.

Similarly, we use Google Text-to-Speech for generation, which has the advantage of being both natural sounding and configurable (in terms of voice). Again there is an overhead introduced by the cloud component, and we will evaluate the impact of this on human users in comparison to local models (using for example well-known models such as Mary TTS). For many applications canned responses are sufficient (where audio files can be generated in advance), but we envisage scenarios where this may not be possible, and will evaluate the modules as if just-in-time generation is required.

The evaluation of speech input and output is somewhat well understood, and has been the subject of much of our prior work. We will investigate is how well the current models work in real-world environments. To that end, a comparison of word and concept error rates in laboratory conditions vs. real world, busy public spaces will be performed, as well as accurate measurements of timings for generation of textual results from speech recognition and generation of audio responses from text-to-speech modules. The role of volume of output with respect to background noise will also be explored.

References

David Benyon, Björn Gambäck, Preben Hansen, Oli Mival, Nick Webb (2012). How Was Your Day? evaluating a conversational companion.. IEEE Transactions on Affective Computing.

Cameron Smith, Nigel Crook, Simon Dobnik, Daniel Charlton, Johan Boye, Stephen Pulman, Raul Santos de la Camara, Markku Turunen, David Benyon, Jay Bradley, Nick Webb (2011). Interaction strategies for an affective conversational agent. Presence: Teleoperators and Virtual Environments.

N. Webb, D. Benyon, P. Hansen, O. Mival (2010). Evaluating Human-Machine Conversation for Appropriateness. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC).

Avatar
Nick Webb
Associate Professor of Computer Science / Director of Data Analytics

My research interests include Natural Language Processing, Social Robotics and Data Analytics.

Related