With Siri’s launch, Apple enabled technological invention to move a little closer to Hollywood’s dream of HAL in Space Odyssey and the computer in Star Trek. Siri is a major contributing factor to the voice recognition software market’s expected growth forecast of 5% over 2012. Having been static for years, developments in the field are now moving at a phenomenal pace and in unexpected areas: for instance, new research, unveiled at TEDGlobal in Edinburgh on Monday, shows that Parkinson's symptoms can be detected by computer algorithms that analyse voice recordings.
So what is voice recognition technology capable of in 2012? What are its limitations and just where is Apple headed with Siri?
Clarke’s Third law
Siri, the polite app with a personality (but who doesn’t know anything about abortion clinics) was, for many, 2011’s technological high point. With an accuracy rate of up to 99%, the app was thought to have obtained the Holy Grail of Natural Language Processing (NLP). Such was the enthusiasm, that reviewers gushed about Clarke’s Third Law (any sufficiently advanced technology is indistinguishable from magic) and “the fourth interface” (following on from the keyboard, mouse and gestures). No more need for instruction, metaphor or a mental model. Google and Microsoft were trounced (for a while at least) and the prospects for emergent technologies appeared to multiply overnight.
Siri has been likened to having a mini, mobile version of IBM's Watson, the Jeopardy playing robot, which:
• Does Things For You - : Lets you use your voice to send messages, schedule meetings, make phone calls and search via multiple information sources
• Seeks to “comprehend” your request: via location context, time context, task context and dialog context and answers you back
• Learns and acts on the personal information you give it.
For many actions, tasks steps are slashed: to set an alarm with a traditional touchscreen method takes around nine actions, whereas Siri takes two.
Voice-activated technology isn't new, it's just getting better because of increasingly powerful processors and cloud services, improved algorithms and advancements in natural language processing.
What differentiates the NLP logic in Siri is that:
1. It will maintain context, so you could say: “Also send her an email reminder.” Siri will understand ‘her’ and compose the email accordingly.
2. An ability to accurately translate a variety of speaking styles and accents into text with no "training" period” required.
How does it work?
1. An acoustic signal is captured as digital data, the recording is transmitted to the company’s servers and analyzed to identify phonemes and produce strings of symbols .
2. Algorithms based on stochastic modeling (probabilistic methods) (AKA Hidden Markov Models) are used to decode the symbols.
3. These are analysed using a language model to identify words, sentences, any domain-specific vocabulary and statistical grammars.
4. The servers then send back a file that contains the response data.
5. What you then get is concatenated synthesis, an assembly of recorded voice sounds into speech extracted from a database of diphones and demi-syllables. (demanding/expensive)
6. Formant synthesis is applied – a “wave form” is generated by a computer model, adding timbre and passed through a synthesiser
On the synthesised speech capability scale, current systems are at level 2: an intelligible simulation of the natural qualities of human speech. We are still to attain level 3: personalized to the person it is representing and 4: based on a person's own voice and sounds just like that person.
How did Apple achieve what others could not?
Basically, through partnerships and acquisitions with:
1. Siri, Inc., (acquired in April 2010), a spin off from SRI (Stanford Research Institute) - one of the world's largest contract research institutes. Siri's technology was born from SRI's work on the DARPA-funded CALO project, described by SRI as the largest artificial intelligence project ever launched.
2. Nuance partnership (unconfirmed )- also a spinoff company from SRI (and an earlier merger with Scansoft, which has strong links with the futurist Ray Kurzweil). Nuance’s state of the art servers are thought to power Siri’s voice recognition application.
3. Partnership with Wolfram Alpha, the "computational knowledge engine" that has amassed a wealth of information about science, math, music, and other subjects.
Shake Up of Search
In a mobile environment, it's not practical to wade through pages of links to get at simple answers, this has led Siri to be called “the answer engine”.and, according to some, Siri challenges Google's entire search empire and shakes it to the foundation.
Google and Microsoft have their own voice products — such as Google Voice Actions and Microsoft Tellme and Bing Voice Search — as well as search engines. Google has recently updated Voice Actions, making it faster and smoother but reviewers were not impressed. Some commentators say that Siri has given Apple a two-year advantage over its competitors.
Additionally, Siri, the company bought by Apple, implemented more than 45 APIs prior to being acquired - meaning the possibilities of a conversation interface to the web are endless.
But, for SME's, that doesn't make it any easier to implement NLP: you either have to pay one of the very few companies to use their technology, or you have to invest in a lot of research science
However, Microsoft, Nuance and other large entities are working to level the playing field, and to offer Natural Language products at a lower cost and broader scale.
Siri Competitiors: Vlingo, Evi, Start and Zypr also have some advantages over Siri . Siri is limited, in that it's a sealed system so developers can't embed Siri-like functionality into their own applications. However with Zypr they can and Evi has become popular in the UK as she is apparently better at recognising British regional accents.
Environment: Despite Siri’s secondary microphone, which attempts to separate background noise from a user's voice to improve call quality, noisy environments are a problem for voice recognition. However a recent study on the so called “cocktail party effect,” found that, in a noisy environment, the brain tunes into just one sound, such as a person’s voice. This finding will help speed the development of improved voice recogntion techhnologies.
Language: Siri presently only works in English, French German but other languages are soon to be added and the field is moving very quickly. Nuance has mapped 13 out of 22 possible languages spoken in India and is working on the other nine.
Speech accuracy: Despite Siri’s high accuracy rate, it’s subjective. Apple's has already been the subject of two law suits with plaintiffs complaining that commercials were deceptive and Siri would either not understand what was asked or provided the wrong answer.
Accents, especially, seem to be a problem, particularly Scottish, Japanese, Arabic and Australian. According to Apple, Siri will improve with time, the application is still in beta and will be upgraded. Additionally, the more applications are used, the more they contribute to a huge new data stream to study. Servers continue to analyze intonation, accents and languages in minute detail and constantly improve recognition algorithms.
Not suitable for everything: For example some search/organisation tasks which rely on vision such as sorting through photographs, for these, using voice recogntion would be much slower. So it’s likely that graphic user interfaces (GUI’s) won’t be going away anytime soon.
Dissatisfaction: Initial enthusiasm for Siri seems to have died down with only about 20 – 35% of users continuing to rely on the application on a daily basis, a significant number appear to have decided that they don’t like it. However there are concerns over the small sample sizes.
Although rudimentary, the polls seem to bear out research by Carnegie Mellon that, when it came to feedback, people preferred to give it via text rather than vocally to a robot.
State of the Art – new uses
Siri in other Apple products: Siri requires a constant data connection, iPhones are rarely without data connections, but the same can't be said for WiFi-only iPads and some iPods. There are also issues with low quality microphones which is likely to make it difficult for Apple to implement Sir across it's entire product range.
However, integrating Siri into a Mac OS X is a more realistic prospect. All current Mac hardware is already capable of doing the necessary speech-to-text processing locally.
Cars: There are systems, such as Ford’s Sync, which can recognize up to 10,000 voice commands. But usually, for vehicles that allow voice input for directions, the menus are long, laborious, and often just as distracting to navigate as using a touch screen.
Mercedes is planning to unveil the first system which will use Siri, integrating it with Merc’s own software.
Limitations: Manufacturers have greater liability concerns and need to grapple with the issue of distracted driving. Mobile data connections can be inadequate.
Smart TV: Samsung's new smart LED TV, the ES8000 has both voice and gesture control. The touch remote also has a built-in mic that users can speak into when the environment is too loud. However, some reviewers were not impressed by the speed and thought the number of different interfaces available may prove confusing for viewers.
Perhaps, in the future, we may not need a remote but instead use a secondary device, like an iphone.
Home Automation: US company Carnes Audio, which installs home automation systems, created its own promotional video of Siri communicating with household technologies. For example, the demo shows Siri dimming the lights to 50% and setting the thermostat to 70 degrees with the aid of Creston AMS-AIP home automation equipment and an intermediary proxy server.
Bluetooth Low Energy (BLE) paired with Siri also holds interesting opportunities: BLE fitted to something like your front door, would mean that you wouldn’t need a key or to open the door when visitors are expected.
Medicine and Law: Ironically, in occupations that have complex vocabularies, like medicine and law, VR software can differentiate between words extremely well. In a court room, for instance, Nuance’s Dragon Dictate can create court records in real time.
Customer Services: US Airways has introduced a Nuance designed system, Wally, to anticipate callers’ requests, telling frequent-flier members their seat assignments, for example, and, because it converts speech to text, the customer doesn’t have to repeat a request if subsequently transferred to a live operator. Wally is said to seem so personable that many people say “thank you” before hanging up.
Both Microsoft and telecom giant AT&T are offering cloud-based enabler platforms with speech capabilities to create natural language customer care and sales applications for contact centers,.
Death of advertising: What about the advertisers? No problem, there are companies, like Zeebox, already using technologies such as audio fingerprinting to get a clear view, in real time, of what happening on a video as well as in audio. Content is analysed and tags are applied which are linked to companion content from sources such as Wikipedia and eCommerce sites.
VPA’s Everywhere: Voice recognition is already being used for everything from locking a USB drive and enabling overworked doctors to quickly load medical records to a Nuance-powered coffee maker (speaks German and English) and a cuddly toy that can teach you kids Spanish, The latter, Ubooly, combines a furry body with a voice recognition app which turns it into an interactive educational tool. The app uses NLP to respond back to kids and has a memory, so it can reminisce about past playtimes.
The technology which powers systems like SIri, also has has interesting implications beyond voice. An Austrialian start-up, Kaggle, has developed Siri-like software which can skim through files and is currently being used in a trial to mark students essays.
It’s entirely possible then, that, in the future, you'll not only be able to use voice recognition to find this article but it will also be among the tools used to produce it.