Statistical Identification of Factors that Influence Performance with Speech Recognition

Heidi Horstmann Koester
Rehabilitation Engineering Research Center on Ergonomics
University of Michigan

Abstract

The goal of this study was to identify factors that account for the variation in performance with automatic speech recognition (ASR) systems. Using data from experienced ASR users with physical disabilities, the effect of 20 independent variables on recognition accuracy and text entry rate with ASR was measured using bivariate and multivariate analyses. Use of appropriate correction strategies had the strongest influence on user performance. The amount of time the user spent on their computer, the user's manual typing speed, and the speed with which the ASR system recognized speech were all positively associated with better performance. The amount or perceived adequacy of ASR training did not have a significant impact on performance for this user group.

Keywords

speech recognition, computer access, user performance, outcomes, multiple regression

Background

User performance with automatic speech recognition varies widely, for both new and experienced users. Data from 8 new ASR users show that after 4 to 6 weeks of use, recognition accuracy ranged from 60% to 99%, and text entry rate ranged from 1.5 words per minute (wpm) to 72.6 wpm [1]. For 23 experienced ASR users, the range for recognition accuracy was 72% to 94%, and 3 to 32 wpm for text entry rate [2]. There are many possible reasons for this diverse performance, including factors related to the hardware and software in the system, the user's training and experience, specific ASR usage techniques, and user characteristics [3].

Research Question

Such wide variation defies a simple answer to the question of the performance that users can expect from ASR systems. This study was conducted to provide some insight into why some ASR users perform relatively well, and others relatively poorly.

Methods

Overview

Data from 23 experienced ASR users with physical disabilities were analyzed to determine the factors that influenced user performance with ASR. Measurements of recognition accuracy and text entry rate with ASR were the dependent variables [2]. Indicators for 20 potential factors were formed from responses to survey questions and other measures with this same group of users. The relationship between these 20 independent variables (representing the possible factors) and the 2 dependent variables (representing actual user performance) was assessed graphically and statistically using scatter plots, bivariate analyses, and multivariate regression modeling.

Bivariate Analyses

The bivariate relationship between each independent variable and each dependent variable was graphed for visual inspection, and determined statistically by calculating the Pearson correlation. Statistical significance for the correlations was set at the 0.05 level.

Multivariate Analyses

Multiple regression models were developed for both recognition accuracy and text entry rate. Multivariate influence was examined for any independent variable that had:

(1) a visible bivariate relationship on the scatter plot; AND (2) a statistically significant bivariate correlation OR a bivariate correlation greater than 0.2 (absolute value). The first step was to find the “best” one-factor model from the pool of candidate factors, then determine if any of the remaining factors significantly improved the model enough to warrant a two-factor model. If a two-factor model was found, the remaining factors were again searched for a possible three-factor model. A model was judged to be “better” than another if it had: a higher adjusted R 2 value, greater statistical significance for each independent variable's model coefficient, stronger partial relationships based on graphic analysis, and more robust satisfaction of regression assumptions.

The purpose of the multivariate modeling was to identify influential factors and their relative influence on ASR performance. An independent variable was considered to be an “influential factor” if its standardized Beta coefficient in a multivariate model was significant at the p < 0.05 level. The relative strength of two or more influential factors in a single model was determined by comparing their standardized Beta coefficients.

Results

Bivariate Results

Table 1 shows the Pearson correlations between all candidate factors and recognition accuracy and text entry rate. For recognition accuracy, 10 candidate factors were retained for multivariate analysis. Only weak bivariate relationships were found between recognition accuracy and factors related to hardware and software, ASR training, or the amount of experience subjects had using ASR. For text entry rate, 9 factors were retained for multivariate analysis. ASR training factors showed relatively little relationship to text entry rate, as did the amount of ASR experience subjects had.

**Table 1.** Independent variables and their Pearson correlations with recognition accuracy (Rec Acc) and text entry rate (TER). * significance at p < 0.05; ** significance at p < 0.01.
Independent Variable	Dependent Variable
	Rec Acc	TER
Hardware/Software
RAM	0.024	0.158
ASR Delay	0.085	-0.356
Microphone	-0.152	-0.105
Text Application	-0.081	0.010
ASR Training/Usage
Training Hours	0.001	-0.114
Training Adequacy	0.190	-0.127
ASR Usage	0.090	-0.010
ASR Text Usage	0.419*	0.227
ASR Experience	-0.078	-0.004
ASR Techniques
“Scratch That” Usage	-0.681**	-0.598**
Proofread Style	0.266	-0.003
Words per Utterance	0.315	0.559**
Dictation Speed	0.132	0.426
Computer Experience and Usage
Computer Usage	0.251	-0.053
Word Proc Time	0.413*	0.355
Pre-ASR Experience	-0.311	-0.198
User Characteristics
Gender	-0.078	-0.147
Education	0.338	0.371
Need Computer for Job/school	0.397	0.478*
Typing Speed	0.189	0.610**
Other ASR Factors
Recognition Accuracy	1.0	0.687**

Multivariate Results

Table 2 shows the best multi-factor model found for recognition accuracy. Recognition accuracy was influenced most strongly by the frequency with which users employed the “Scratch That” method of correcting recognition errors. A secondary influence in recognition accuracy was the amount of time users spent on their computer each week.

**Table 2.** Two-factor regression model of recognition accuracy (RA) as a function of Scratch That (ST) and Computer Usage (CU). * significance at p < 0.05. ** significance at p < 0.01.
Model Equation	Scratch That (ST)		Computer Usage (CU)
	Partial b	Sig. of b	Partial b	Sig. of b	Adj. R²
RA = 85.5 – 0.25(ST)+ 0.48(CU)	-0.748	<0.001**	0.347	0.034*	0.535

Table 3 shows the best multi-factor model found for text entry rate. Use of “Scratch That” was again the most influential factor, primarily because of its influence on recognition accuracy. Of secondary importance was the ASR Delay, or how long it takes for the ASR system to display a recognition at the completion of an utterance. Finally, typing speed without ASR also emerged as an influential factor in text entry rate.

**Table 3.** Three-factor regression model of text entry rate (TER) as a function of Scratch That (ST), Typing Speed (TS), and ASR Delay (AD). * significance at p < 0.05. ** significance at p < 0.01.
Model Equation	Scratch That (ST)		Typing Speed (TS)		ASR Delay (AD)
	Partial b	Sig. of b	Partial b	Sig. of b	Partial b	Sig. of b	Adj. R²
TER = 22.4 – 0.22(ST)+ 0.28(TS) – 11.9(AD)	-0.523	0.002**	0.377	0.019*	-0.384	0.012*	0.625

Discussion

While this study only begins to answer the question of which factors have the most influence in ASR performance, the results support the following clinical implications:

Coach the proper correction strategy. While most clinicians are aware of the desirability of limiting the use of “Scratch That,” these results suggest that it receive primary emphasis.
Don't ignore non-ASR input methods such as single-digit typing. These can be used in conjunction with ASR to leverage better performance, not solely as a backup method.
Get the best hardware possible. Teach users methods of gauging system performance and monitoring resource use within the operating system.

References

Koester HH. (2003). Abandonment of speech recognition by new users. In Proceedings of the RESNA 2003 Conference . Washington, DC: RESNA.
Koester HH. (2003). Performance of experienced speech recognition users. In Proceedings of the RESNA 2003 Conference . Washington, DC: RESNA.
Koester HH. (2001). User performance with speech recognition: a literature review. Assistive Technology. 13(2): 116-30.

Acknowledgments

This study was funded by U.S. Dept of Education Grant #H133E980007.

Heidi Horstmann Koester, Ph.D.
2408 Antietam
Ann Arbor MI 48105
hhk@umich.edu