Speech Input/Output for HRI
Currently for speech recognition, we use Google Speech-to-Text. This has the advantage of being extremely accurate, but can be impacted by delays in recognition if network traffic is heavy. Timing in interaction is of vital importance. Therefore we want to experiment with a local instance of the open source Mozilla speech recognition (for instance DeepSpeech), to see if local processing can reduce the impact of delays in the interaction.
Similarly, we use Google Text-to-Speech for generation, which has the advantage of being both natural sounding and configurable (in terms of voice). Again there is an overhead introduced by the cloud component, and we will evaluate the impact of this on human users in comparison to local models (using for example well-known models such as Mary TTS). For many applications canned responses are sufficient (where audio files can be generated in advance), but we envisage scenarios where this may not be possible, and will evaluate the modules as if just-in-time generation is required.
The evaluation of speech input and output is somewhat well understood, and has been the subject of much of our prior work. We will investigate is how well the current models work in real-world environments. To that end, a comparison of word and concept error rates in laboratory conditions vs. real world, busy public spaces will be performed, as well as accurate measurements of timings for generation of textual results from speech recognition and generation of audio responses from text-to-speech modules. The role of volume of output with respect to background noise will also be explored.