Results for ""
Cornell University researchers propose EchoSpeech, a low-power active acoustic sensing-powered silent speech interface (SSI). EchoSpeech transmits inaudible sound waves towards the skin using speakers and microphones positioned on a glass frame.
Recently, SSI (silent speech interface) has gained increasing attention. Silent speech, in contrast to voiced speech, does not require users to vocalise sounds, thereby expanding its application scenarios to areas where voiced speech is limited. For instance, SSI can be used in noisy environments where voiced speech is susceptible to severe interference, as well as in peaceful places and other situations where speaking out loud is socially inappropriate. Furthermore, according to a recent study, SSI is more socially acceptable than voiced communication, and users are willing to tolerate more errors.
"Anything that could give rise to smarter-than-human intelligence — in the form of Artificial Intelligence, brain-computer interfaces, or neuroscience-based human intelligence enhancement — wins hands down beyond contest as doing the most to change the world. Nothing else is even in the same league." — Eliezer Yudkowsky, AI researcher.
According to studies, social awkwardness and privacy concerns are also significant factors influencing user perception and propensity to use voice assistants. By eliminating the need to speak aloud, SSI protects privacy more effectively. These benefits make SSI promise to expand the voice assistant use case to include a silent voice assistant. In addition, SSI creates opportunities that voiced communication has never touched. For example, SSI can be used to input passwords without leaking audio into the surrounding environment. In addition, collaborators in a shared workstation can instruct AI agents using SSI without interfering with one another.
Objective
EchoSpeech is driven by active acoustic sensing, which uses tiny speakers and microphones mounted on the bottom edge of a commercial off-the-shelf (COTS) glass frame to track lip and skin movements from multiple angles. The researchers made a customised deep learning process with connectionist temporal classification (CTC) loss that lets EchoSpeech recognise both discrete and continuous speech without being segmented. In a study with 12 people, they tested EchoSpeech and found that it has a WER of 4.5% (std 3.5%) and a WER of 6.1% (std 4.2%) when it comes to recognising 31 commands and 3-6 figure connected numbers spoken at a speed of 101 words per minute (wpm).
Hence, researchers came up with a two-step (pre-training + fine-tuning) training plan to make it easier for new users to learn and improve performance. They show that EchoSpeech can recognise both solo and connected speech with a WER of 9.5% and 14.4% after only 6-7 minutes of training. The experts also showed EchoSpeech's strength in situations like walking and adding noise. EchoSpeech was used in four real-time tests on a low-power variant that ran at 73.3mW to show how it could be used and how well it worked.
Conclusion
EchoSpeech detects small skin deformations induced by silent utterances and uses them to infer silent speech by analysing echoes from numerous routes. The researchers show that EchoSpeech can recognise 31 standalone orders and 3-6 figure connected digits with 4.5% (standard 3.5%) and 6.1% (standard 4.2%) Word Error Rate (WER), respectively, in a user survey of 12 people.
Furthermore, to verify EchoSpeech's robustness, the researchers ran it through scenarios such as walking and noise injection. They then demonstrated employing EchoSpeech in real-time demo apps running at 73.3mW, with the real-time pipeline running on a smartphone with only 1-6 minutes of training data. EchoSpeech, according to the researchers, is a significant step towards minimally-obtrusive wearable SSI for real-world deployment.