RESNA 27th International Annual Confence
Perception of Audio Cues in Robot-Assisted Navigation
An important issue in assisted navigation systems for the visually impaired is how to present navigation-related information to the user. The paper examines this issue in the context of a pilot study in which five visually impaired participants interacted with a robotic guide in noise-free and noisy indoor environments.
Audio perception, speech recognition, speech understanding, visual impairment, robotic guides.
Comparison studies show that in controlled environments visually impaired users do not find major performance differences between different sensor technologies, but have major concerns about available output and input modes [4]. Many existing systems offer only one output mode. For example, Talking Signs© [3] displays information through speech synthesis over open radio receivers, which may block ambient background sounds not only for the users but also for people around them. Drishti [1] also uses speech synthesis but requires that users wear headphones on both ears. Yet, many visually impaired users express strong reservations about wearing headphones when they walk. The situation is similar with input: either, as in the case with Drishti, a system uses speech recognition or it does not provide any meaningful input mode other than an on/off switch.
The objective of this study was to study the input and output mode preferences of visually impaired individuals in the context of using a robotic guide to navigate an unfamiliar indoor environment [2]. The stated objective checks the validity of two hypotheses 1) speech is a usable input and output mode and 2) audio icons, i.e., non-verbal sounds, can help users identify objects and events while engaged in navigational tasks [5].
Figure 1 shows the robotic guide (RG) built at the Robotics Laboratory of the Utah State University Computer Science Department [2]. The experiments described below were conducted with five visually impaired participants (no more than light perception) at the USU CS Department over a period of two months. The participants did not have any speech impediments or cognitive disabilities. Three experiments were conducted: audio perception, speech recognition and speech understanding. All experiments were videotaped.
The audio perception experiment tested whether participants preferred speech or audio icons, e.g., a sound of water bubbles, to signify different objects and events in the environment and how well participants remembered their audio icon selections. There are seven different objects, e.g., elevator, vending machine, bathroom, and four different events. Each object is associated with two different events. For example, elevator is associated with at and approaching . The audio icons available for each event were played to each participant at selection time. The following statistics were gathered: 1) percentage of accurately recognized icons; 2) percentage of objects/events associated with speech; 3) percentage of objects/events associated with audio icons; 4) percentage of objects/events associated with both.
The second and third experiments ran concurrently and tested the feasibility of using speech as a means of input and output, respectively, for humans to communicate with the robot. Each participant was asked to speak approximately sixty phrases while wearing a headset that consisted of a microphone and one headphone. The phrase list was a list of standard phrases that a person may say to a robotic guide in an unfamiliar environment, e.g., “go to the bathroom,” “where am I?” etc. Each participant was positioned in front of a computer running Microsoft's SAPI, a state-of-the-art speech recognition and synthesis engine. The test program was written to use SAPI's text-to-speech engine to read the phrases to the subject one by one, wait for the subject to repeat a phrase, and record a recognition result (speech recognized vs. speech not recognized) in a database. If a participant did not understand the phrase to be repeated, a human observer would count it as a speech understanding failure. This experiment was repeated in two environments: noise-free and noisy. The noise-free environment did not have any ambient sounds other than the usual sounds of a typical office. To simulate a noisy environment, a long audio file of a busy bus station was played on another computer in the office. All five participants were native English speakers and did not train SAPI's speech recognition engine on sample texts. No training was done because the premise of SAPI is that domain-dependent command and control grammars constrain speech recognition to eliminate training altogether.
The averages for the first experiment were: 1) accurately recognized icons (93.3%); 2) objects/events associated with speech (55.8%); 3) objects/events associated with icons (32.6%); 4) objects/events associated with both (11.4%).
The averages for the second experiment (speech recognition) were: 1) 38% in the noise-free environment; 2) 40.2% in the noisy environment. The averages for the third experiment (speech understanding) were: 1) 83.3% in the noise-free environment; 2) 93.5% in the noisy environment.
The analysis of the audio perception experiment showed that two participants were choosing audio preferences essentially at random, while the other three tended to follow a pattern: they chose speech messages for at events and audio icons for approaching events or vice versa. The experiment also showed that the participants tended to go either with speech or with audio icons, but rarely with both. The experiment did not give a clear answer as to whether visually impaired individuals prefer to be notified of objects/events via speech or audio icons. Further work is needed on a larger sample to answer this question on a statistically significant level.
While the level of ambient noise in the environment did not seem to affect the system's speech recognition, in both environments fewer than 50% of phrases were correctly recognized by the system. Even worse, on average, 20% of spoken phrases were incorrectly recognized by the system. For example, when one participant made two throat clearing sounds, the system recognized the sound sequence as the phrase “men's room.” The statistics were far better for the participants understanding phrases spoken by the computer. The average percentage of speech understood in the noise-free environment was 83.3%, while the average percentage of phrases understood in the noisy environment was 93.5%. Clearly, in the second trial (the noisy environment), the participants were more used to SAPI's speech synthesis patterns. These results suggest that speech appears to be a better output medium than input.
It is unlikely that the speech recognition problems can be solved on the software level until there is a substantial improvement in state-of-the-art speech recognition. Further work is planned to seek a wearable hardware solution, e.g., a wearable keyboard. The obvious advantage is that keyboard-based interaction eliminates the input ambiguity problems of speech recognition. One potential challenge is the learning curve required of subjects to master the necessary key combinations.
The study was funded by a Community University Research Initiative (CURI) grant from the State of Utah and a New Faculty Research Grant from Utah State University.
Vladimir Kulyukin, Ph.D.,
Department of Computer Science,
Utah State University,
4205 Old Main Hill,
Logan, UT 84322-4205,
Office Phone (435) 797-8163.
EMAIL: vladimir.kulyukin@usu.edu .