Today, Microsoft announced that Microsoft Translator, its artificial intelligence text translation service, now supports over 100 different languages and dialects. With the addition of 12 new languages, including Georgian, Macedonian, Tibetan and Uighur, Microsoft claims Translator can now provide text and document information to 5.66 billion people worldwide.
Microsoft says these new languages are underpinned by unique advances in artificial intelligence and will be used in translator applications, translators for Office and Bing, and Azure Cognitive Services Translator and Azure Cognitive Services Speech.
Z-code
What's driving Translator's upgrade is Z-Code, part of Microsoft's larger ZY-Code initiative that aims to combine text, visual, audio, and linguistic AI models to create AI systems that can speak, see, hear, and understand. Z-code provides a framework, architecture, and model for text-based multilingual AI language translation for all language families. By sharing language elements in similar languages, and by migrating learning (applying knowledge from one task to another related task), Microsoft claims it has succeeded in significantly improving the quality and lowering the cost of machine translation capabilities.
With Z-Code, Microsoft is using migrated learning to go beyond the most common languages and improve translation accuracy for "low-resource" languages, which are languages with less than 1 million sentences of training data. (As with all models, Microsoft learned from examples of large data sets, both public and private.) About 1,500 known languages meet this standard, which is why Microsoft has developed a multilingual translation training process that combines language families and language models.
But until recently, even the most advanced algorithms lagged behind human performance.
Efforts outside of Microsoft illustrate the scale of the problem – the Masakhane project, which aims to automate the translation of thousands of languages across the continent, has not gone beyond data collection and transcription. In addition, Mozilla's Common Voice project, which aims to build an open-source transcribed set of voice data, has reviewed only a few dozen languages since its launch in 2017.
Z code language model Through multi-language training across multiple languages, this knowledge is transferred between multiple languages.
For example, the models translation skills (" machine translation ") are used to help improve their ability to understand natural language (" natural language understanding ").
In August, Microsoft said the z-Code model, with its 10 billion parameters, can achieve state-of-the-art results on machine translation and cross-language summarization tasks.
In machine learning, parameters are internal configuration variables that a model uses to make predictions, and their values essentially (but not always) define the model's problem-solving skills.
Microsoft is also hard at work training a 200 billion parameter version of the above benchmark beat model. OpenAI's GPT-3 is one of the largest language models in the world, with 175 billion parameters.
Some researchers claim that ai translations of the text are not as rich as the "vocabulary" of human translations, and there is ample evidence that language models magnify biases in the data sets they train.
Ai researchers from MIT, Intel, and CIFAR, a Canadian initiative, found highly biased language models including BERT, XLNet, OpenAI's GPT-2, and Roberta.
In addition, Google has identified (and claims to have solved) gender bias issues in Google translate model, particularly in resource-poor languages such as Turkish, Finnish, Farsi and Hungarian.
Microsoft, for its part, points to Translator's appeal as evidence of the platform's maturity.
In a blog post, the company noted that thousands of organizations around the world use Translators for their translation needs, including Volkswagen.
“The Volkswagen Group is using the machine translation technology to serve customers in more than 60 languages — translating more than 1 billion words each year,” Microsoft’s John Roach writes. “The reduced data requirements … enable the Translator team to build models for languages with limited resources or that are endangered due to dwindling populations of native speakers.”