Summits Yellow

Microsoft's voice recognition tech now better than teams of human transcribers

Tyrone Stewart

MicrosoftMicrosoft’s voice recognition tech can transcribe conversations better than a team of trained humans, less than a year after the tech reached parity with the average human transcriber.

Following the success of Microsoft’s speech and dialog research group in 2016, it was put to them to get its word error rate down from 5.9 per cent to 5.1 per cent to match that of those involved in a multi-transcriber process and it done just that.

“Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years. Microsoft’s willingness to invest in long-term research is now paying dividends for our customers in products and services such as Cortana, Presentation Translator, and Microsoft Cognitive Services,” said Xuedong Huang, technical fellow at Microsoft, in a blog post. “It’s deeply gratifying to our research teams to see our work used by millions of people each day.

“Advances in speech recognition have created services such as Speech Translator, which can translate presentations in real-time for multi-lingual audiences.”

To put the technology to the test, Microsoft used Switchboard – which has a group of recorded telephone conversations that are used to benchmark speech recognition systems. Through this, Microsoft found that its technology had now reached accuracy level comparable with or better than the highest standard of transcribers.

The next test for the technology would be to reach a similar standard on a grainy phone line, in a noisy environment, understanding different accents, and understanding different speaking styles and languages.

“While achieving a 5.1 percent word error rate on the Switchboard speech recognition task is a significant achievement, the speech research community still has many challenges to address, such as achieving human levels of recognition in noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available,” said Huang. “Moreover, we have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent. Moving from recognizing to understanding speech is the next major frontier for speech technology.”