Voice is the new multi-touch. Cloud-based virtual assistants, powered by advances in artificial intelligence, are improving in usefulness and sweeping across the landscape of consumer tech. How does this new paradigm of the conversational user interface work? What underlying technology makes this all possible? What are the major voice UI platforms for third-party developers available today?
At a very high level, here’s how these platforms work: A user either asks the virtual assistant for some information from an app or tells the assistant to do something with a specific app. The assistant then asks for any more necessary details before passing that information to the app, which then performs the requested task or replies with the requested information.
Speech, Software and Scale
Why have these virtual assistants become ubiquitous so recently? Several factors have recently come together to push this conversational UI tech past the tipping point.
One is the availability of huge amounts of on-demand bandwidth, processing power, and storage made possible by cloud computing platforms. Another contributing factor is the advancement of AI algorithms, software, and hardware (e.g., FPGA’s and GPU’s for machine learning). There is also the fact that the vast amounts of natural language examples, generated by users typing or speaking search queries into (such Google search or Microsoft’s Bing), provide plenty of data to train those advanced machine learning algorithms, making them better at understanding the context, content, and intent of voice commands and questions.
There are also less technological trends at work. Verbal communication is natural to humans and is a great fit for tablets, smartphones, and wearables, which lack physical keyboards and mice. Each of the big players in this space has a financial incentive to push an almost frictionless interface to their money-making platforms like search (Google), other cloud-hosted services (Microsoft), or ecommerce (Amazon).
A Look at the Big Three
Speaking of Google, Microsoft, and Amazon, here’s a rundown of each of their conversational UI offerings for developers:
Amazon Alexa Skills Kit: Amazon enables developers to build specific bridges (called “intents”) between Alexa and their app. Three main components of intents are “utterances” (the spoken or typed trigger for the intent), slots (the details needed by the app to fulfill the intent), and fulfillment (the code which hooks into the app and actually does what the user wants).
Microsoft’s Cortana Skills Kit: Same basic approach as Lex: devs build “skills” for Cortana, which extend and customize it. Devs specify what the skill does, how it’s launched (the trigger phrase) and the contextual info needed to execute that skill.
Google Assistant: Verbal requests from the user and responses spoken back to her are brokered through Google Assistant. “Conversation Actions” are composed of Invocation triggers, Dialogs, Fulfillment (again, same basic threefold paradigm). Google provides Conversation API, Actions SDK and API.AI, which hook into Google’s cloud services to power an app’s conversational UI.
Each of these platforms makes it possible, for instance, to tell Alexa to start streaming that new MST3K reboot or book a flight using a travel app by talking to Cortana. These virtual assistants serve both as the enabler and the gatekeeper of all your app’s verbal communication to and from the user. Computationally, the virtual assistants do all the heavy lifting (speech to text conversion, natural language processing, text to speech conversion).
The Future of Conversational UI’s
As this technology matures, expect to see a complete ecosystem of third-party app integrations for one or more of these platforms. You may also see the emergence of tools and standards which seek to mitigate the danger of vendor lock-in.
How can you, as an app developer begin to add a conversational UI to your software?
Start by listing and mapping out the specific actions your app is able to accomplish with its current UI. For each specific action, list the specific input you need from the user. Each of these three platforms requires this information to be exposed by your app, so you’ll need to do this regardless of which vendor you choose.
Conversational UI’s are quickly revolutionizing how people interact with technology as the cloud wafts deeper into our everyday lives.