Voice-powered Internet of Things


IoT solutions are about innovative, intuitive approaches that make the way we live, learn, work, and play easier and more interactive. Perhaps the thing that pops into your mind is a smartphone or tablet with a touchscreen where you tap and type your way to using the system. Applications like medical and home automation often involve wearable devices and/or small embedded systems with no touchscreens. In these situations, interacting with the device may come down to cumbersome button-press sequences or “hold until this light blinks.”

I discussed these challenges with Sensory CEO and founder Todd Mozer, who provided a peek into the future of voice-enabled sensors and devices.

Benefits of voice-enabled interaction

Sensory is a company that has a long history with speech recognition. They started in 1994 as an embedded speech recognition chip company. Their basic strategy remains the same, but Sensory has expanded to put their speech recognition intellectual property into system on chip (SoC) MCUs and DSPs.

“The majority of IoT systems today tend to think of voice recognition being within the cloud, and there are complex speech recognition features that make sense to be there,” Mozer said describing the role of speech in the IoT environment. “But there is tremendous value in on-board speech recognition within sensors and devices. This enables better human interface and enables endpoints to communicate amongst themselves and from device-to-cloud. This meshed speech capability concept enables IoT developers to optimize where the speech features are placed.”

Sensory has developed its TrulyHandsfree voice control technology targeted for IoT devices and sensors. Most of us have interacted with speech recognition with our smartphone, which often begins with a button press. TrulyHandsfree allows for speech recognition without having to hit a button, and enables ultra-low-power, high-accuracy command sets.

Embedding speech recognition is an art

In summer 2015, Sensory announced TrulyNatural – a deep neural net speech recognition technology that combines neural networks with deep learning, resulting in acoustic modeling that is small enough to fit into an embedded system without sacrificing state-of-the-art accuracy. Goals of the system include allowing users to speak to devices naturally, that the speech recognition works even when not connected to the Internet, and that there’s no risk of conversations being recorded in a distant cloud.

“We aren’t trying to do everything the cloud can do with TrulyNatural, but something specific without sacrificing accuracy,” said Mozer. “For example, something like a coffee machine has a specific language domain. The TrulyNatural system scales down to create a natural language interface for making coffee.”

Elaborating on the paradigm in which technology uses speech hooks to wake systems from very low power modes before they can be communicated with via natural speech, Todd used the example of, “Hey coffee machine – I want a cappuccino,” to illustrate how there was no need to use silent breaks or specific wake-up phrases. “The real trick is for the device to be responsive when you’re talking to it and not interrupt if you aren’t.” he said.

“Truly hands free” development platforms

Of course, the key question for embedded developers is how to get started. Samsung ARTIK is a collection of open platforms that provide Bluetooth, Wi-Fi, Thread, and ZigBee connectivity to IoT application developers, and Sensory recently ported its TrulyHandsfree software to an ARTIK board for a home automation demo at CES in Las Vegas.

“A number of people came away impressed with the demo”, Todd said. “A crowded room, lots of music and background noise. There was no Internet connection. But when people told the board to turn the lights on or change the temperature, it just worked”

Sensory also offers hardware-based solutions for “always on” speech that can be implemented in an SoC device – about 10 thousand gates that, when running with a microphone, consume as little of 100 microamps (µA) of power. When the system is in quiet mode, it consumes less current than standard battery leakage.

Wearable and home automation are two areas that spring to mind where touchscreen-based user interaction severely limits usefulness. The ability to embed speech recognition in these devices extends use cases and greatly enhances usability for a wide range of embedded devices connecting to IoT systems.