Speech Recognition, Dysarthria, Computer Access
Computer access via voice recognition is a notable challenge to persons with dysarthria. This paper will discuss the research underway to investigate the feasibility of a speaker independent automatic speech recognition (ASR) system that is able to recognize imperfect, e.g. dysarthric speech. Previous research by investigators at the Naval Air Warfare Center Training Systems Division has been successful in developing computer systems that can very accurately transform a microphone audio signal into written words, which then control a computer simulation. Based on this work, we are collecting samples of dysarthric speech, from which ASR models are created. The models, which are formed from dysarthric speech, are then evaluated by individuals with dysarthria.
Computer access via voice recognition continues as a notable challenge to persons with movement disorders – resulting in associated extremity control problems and dysarthria. This paper will discuss efforts to develop a computer based recognizer that can recognize imperfect speech.
According to findings from the 1990 National Health Interview Survey (NHIS) on Assistive Devices, which was co-sponsored by the National Center for Health Statistics (NCHS) and the National Institute on Disability and Rehabilitation Research (NIDRR), more than 13.1 million Americans (about 5.3% of the population) were using assistive technology devices to accommodate physical impairments (1). It is also estimated that 37.3 million non-institutionalized persons aged 15 years and older living in the United States have a chronic health disability that limits their ability to participate fully in life. More than 70 chronic conditions are listed in the NHIS report, and each of these has the potential to cause one or more functional limitations that could be ameliorated by the use of appropriate assistive technology. Several pieces of legislation, notably The Rehabilitation Act of 1973 (as amended), the Americans with Disabilities Act of 1990, the Individuals with Disabilities Education Act of 1990, and the Technology Related Assistance for Individuals with Disabilities Act of 1988, have placed significant emphasis on the use of assistive technology as part of the continuum of services (2). It is important to remember, however, that as technology is used to increase opportunities for individuals with disabilities to become productive members of society, the needs of the whole person must be taken into account to ensure maximum benefit from the technology. Equally, all professionals in assistive technology service delivery must be technology-literate so that appropriate referrals and selections can be made (3).
Automatic Speech Recognition (ASR) input systems allow persons to dictate to the computer. The computer is "trained" to recognize a sound and to provide a computer action in response. However, approximately 14 million Americans have a speech disorder that affects the way they talk (4). As many as two million of these individuals experience such a severe communication disability that they cannot effectively meet their daily communication needs through natural speech, thereby requiring some type of adaptive communication assistance. For these people, ASR systems hold much promise.
Dysarthria, a speech disability that presents as difficulty in articulating words, is often caused by impairment of the muscles used in speech. Dysarthria, or slurred speech, is very common following stroke or traumatic brain injury, and is often associated with neurological disorders including cerebral palsy, amyotrophic lateral sclerosis (ALS or Lou Gehrig's disease), Parkinson’s and Multiple Sclerosis (MS). People with dysarthria are often understood quite well by familiar listeners including family and support people, but even mild dysarthria is difficult for the unfamiliar listener to understand.
There are several commercially available ASR systems for use by typical English speakers, but these systems do not work well with people with speech disorders. This is because the voice recognition models are based on typical English speakers. We propose to develop ASR models based on dysarthric speech.
Our partners at the Navy Air Warfare Center Training Systems Division (NavAir) in Orlando, FL, have years of experience in creating ASR models of speech in noisy environments for military training simulations. This relationship is part of a Cooperative Research and Development Agreement (CRADA) that was approved and signed between Duke University Medical Center and the NavAir in January 2001 that explores virtual reality technologies with military application and with potential use in augmentative and alternative communication technology (5).
Speech samples are collected at Duke University Medical Center, encrypted and sent electronically to our Federal Laboratory Consortium partners at NavAir, in Orlando, FL. Using the speech samples, the engineers at NavAir create speaker independent Automatic Speech Recognizer (ASR) models that are returned and evaluated by dysarthric speakers at Duke University Medical Center and the Center for Applied Rehabilitation Technology (CART) at Rancho Los Amigos National Rehabilitation Center in Downey, CA.
Voice samples have been collected from human subjects with speech impairments including spastic, flaccid, and mixed dysarthria. Subjects need to be literate; therefore we have focused subject recruitment on subjects with dysarthria secondary to ALS. Many individuals with ALS do not experience cognitive impairment that is common with for example, Multiple Sclerosis or Parkinson’s disease.
Subjects would read digits 0 to 9 which are written out (e.g. EIGHT, FOUR, ZERO) while sitting in front of a computer monitor while wearing a light, head mounted microphone. Subjects read one hundred “zip codes” that are simultaneously recorded by a computer. Voice samples are verified by Duke researchers for completeness and accuracy and are then sent to NavAir to be built into ASR models. ASR models are built by NavAir and returned to Duke University Medical Center. Models are evaluated by standard internal methods for overall percentage word recognition correctness as determined by the following formula:
Word recognition = (Number of Correctly Recognized Word/Total Number of Words)*100
Much of the effort has been centered about the collection of speech samples. The original equipment used to collect samples consisted of a Sun SPARCstation 20 running UNIX, connected to a studio quality, ART TubAmp preamplifier with a Shure SM-10A headset microphone. This system had been used for several years in the Navy’s research due to its superb signal to noise ratio. Several clinical and technical challenges we encountered in collecting data using the Sun setup including difficulty transporting the equipment and difficulties transferring data from the computer to removable media for transfer to Orlando. Because of these technical issues, the system was ported from a UNIX native scripting language to Red Hat Linux. The Linux system runs on an IBM Thinkpad 390 laptop system with an Intel Pentium II 300 MHz processor and 256 MB ram. In order to ensure good quality recordings, an external sound capture device was needed. An Eridol, UA-1A audio capture device by Roland was chosen and configured for use with the Linux system. The UA-1A is a USB device that is recognized by Linux. The system is completed using the ART TubAmp preamplifier and Shure SM-10A headset microphone.
Most recent recording have be made using an IBM Thinkpad X41 running a user friendly, custom built, data collection program written in the JAVA language. JAVA enables the application to be run on standard windows PCs running windows XP using standard internal soundcards. Again, the ART TubAmp preamplifier and Shure SM-10A headset microphone were used. Several pilot data sets of typical, non-dysarthric speech have been collected using the new data collection systems. ASR models have been built from these data sets and have been evaluated. The ASR models built from data collected on the new hardware platform perform at a level comparable to the Sun system. Word recognition rates for the pilot data of typical speakers are at or above 95%.
To date a total of 50 sample sets have been collected from 29 females with dysarthria and 21 males with dysarthria. This data has been used to create the first set of computer acoustic models based on dysarthric speech samples. ASR models are created from a subset of 80% of each sample set and tested or “exercised” with the remaining 20%. Navy researchers report the initial correctness numbers are historically where we would expect them to be. Results for the first 4 models created are summarized in the table below:
Model |
Word Recognition |
---|---|
Female Mild |
86.24% |
Female Moderate |
70.27% |
Male Moderate |
75.38% |
Female Severe |
51.25% |
While the results from the models are lower than expected, the pattern of decreasing word recognition percentage with increasing dysarthria is expected and demonstrates the models represent the data. The 86% word recognition rate for the Female Mild model and the rates for the male and female moderate models are in line with the initial performance of off-the-shelf ASR systems as reported by Koester (6), while the Female Severe word recognition of 51% is better than random and shows correlation. Additionally, these models are raw and have not been tuned. We expect that tuning the linguistic model to the acoustic model, as is common practice in ASR systems, will improve the recognition rates.
There are significant challenges surrounding the collection of speech data from individuals with speech disabilities. Speech disabilities are often secondary to conditions that affect motor, sensory and cognitive abilities. This can make it difficult for subjects to travel to research sites, see and hear stimuli, or even understand or read instructions. In this study subjects with dysarthria secondary to ALS were chosen specifically because ALS does not generally affect sensory or cognitive abilities. Associated motor impairments did require the development of portable data collection systems enabling researchers to go to the subject, rather than have the subject come to us.
In speech recognition research and development, speech data bases are closely guarded and if available are expensive and only reflect typical speech. We are working towards making the existing dysarthric speech database freely available for research purposes.
This study was funded by the National Institute on Disability and Rehabilitation Research grant # H133E980026. The authors would like to thank the staff and patients of Rancho Los Amigos CART and the staff and patients of the Duke ALS Clinic.
Kevin Caves
Duke University Medical Center
DUMC 3887
Durham, NC 27710
Disney produced a television show in the mid 1990s called Gargoyles. It's a great show and I'm a big fan. A few years ago Disney started to release the show on DVD. The last release was of season 2, volume 1. That was two years ago. Volume 2 has not been released. Why? Poor sales. So if you should find yourself wanting to support my work, instead I ask you pick up a copy of season 2, volume 1. It's a great show and you might find yourself enjoying it.