We are entering an era where the voice is being transformed from audio to information. Contributing factors are behavioral changes in smartphones and voice-controlled speakers, advancements in infrastructure technology, and reduction in infrastructure pricing. Today we use products like Amazon Alexa, Apple Siri and Google Voice with voice commands, and get a response back with each request. Behind the scenes, these products convert voice into text via Automatic Speech Recognition (ASR) and use Natural Language Processing (NLP) to interpret and return the results either visually or as voice using Text to Speech (TTS). The rapid growth and ease of use of these products have instilled behavioral changes in consumers — making us more comfortable and more likely to use voice instead of traditional user interfaces.
Every major part of the technology infrastructure required to convert voice into information is available from cloud vendors: Amazon, Google, and Microsoft. The three major services they provide around voice are ASR, TTS, and NLP. For example, the ASR services from Amazon, Google and Microsoft are priced around 2.4 cents/minute. Going forward, all the cloud vendors are embarking on Deep Learning to reduce the training workload, to improve the accuracy of the transcribed text, and to scale the complexity of ASR. In parallel, the evolution of dropping hardware prices combined with the Nvidia GPU cloud infrastructure (for both training and inference) will dramatically reduce the prices of ASR services.
What can we do as the speech infrastructure services ASR, NLP, and TTS improve and the prices come down?
To make a prediction, let’s take a look at the evolution of voice communication services over the last decade. The technology infrastructure for voice communications like high-bandwidth codecs, Acoustic Echo Cancellation (AEC), and the broadband built the stage for complete solutions like Skype, WeChat, Line, WebEx, WhatsApp, and many others. These complete solutions offered a better experience and had a network effect to become the dominant players for voice communication.
Along the similar lines, we expect complete solutions to emerge by leveraging the speech infrastructure services to become the dominant players. At Alan, we are excited to leverage the infrastructure available to develop the world’s first Voice AI service for Enterprises. Our focus is to provide the voice interface to access all your information in your Enterprise. Just talk to Alan!