A Few Notes on Natural Language Processing & XR

Blog Posts ,Cognitive Services ,High Fidelity ,Programming ,Services ,Software ,Virtual Reality
June 13, 2017

A few weeks ago, I spoke briefly on a new passion project of mine, combining natural language processing and virtual reality to start harnessing our tendencies towards conversational interactions to build out a new degree of engagement in a virtual space.

Natural language processing is an important segment of machine learning, and powers our digital personal assistants. We’re fortunate enough to be living in an age where we have the immense power and access to language processing and understanding through cloud services provided by companies like Microsoft, IBM, Google, and Amazon, all of whom provide endpoints to various tools around speech detection, “real enough” time translation, keyword detection, and sentiment analysis – to name a few.

These services are accessible in a number of different ways, many of which are relatively simple to begin integrating into today’s immersive experiences. While I’ve only had the opportunity to begin dipping my virtual toes into the ocean of possibilities, I wanted to take some time to write up a few things that were on the top of my mind to hopefully help encourage more XR (generally speaking, any type of immersive technology) developers to start thinking about how they could integrate these services into their own applications and experiences.

An overview of a few NLP (and related) services

Microsoft Cognitive Services – Speech API – Microsoft’s APIs are the ones I’m most familiar with, and I’ve recently used the Language APIs to do real-time translation on speech-to-text strings from Limitless in High Fidelity. Microsoft offers a number of APIs to build upon around speech and language processing, a snippet of which I’ve included below using JavaScript. You need to sign up for an Azure account to use Cognitive Services APIs, most of which come with a free tier that are great for testing out. Microsoft also built Cortana skills into the capabilities of HoloLens applications, so it’s likely that the upcoming Mixed Reality ecosystem devices will also support the abstraction layer of Cortana on top of the Bing Speech and Language APIs.

Google Cloud Speech API – like Microsoft, Google also offers a range of speech and language APIs that are available to developers and are present in the form of the Google personal assistant on Android phones. Google Cloud accounts have a free offer to begin with, and provide a large range of specific APIs around speech recognition and detection.

IBM Watson – Jeopardy-winning Watson is available to developers on BlueMix, IBM’s cloud service offering, and like Google and Microsoft provides a range of functionality that developers can build on top of to bring in the power of the services behind one of the most well-known machine-trained speech recognition bots.

Amazon Alexa Speech Service – From what I can tell reading over their documentation and API, the Alexa Speech Service from Amazon differs from the first three in that the APIs for language processing tie directly back into their Alexa platform offerings. This is a good option for building out services within your application that could benefit from an out-of-box voice assistant like Alexa, but may not provide the same degree of flexibility in customizing the application-side behaviors that you would want for game- or app-specific development.

Getting Started with NLP in XR Applications

This process will vary depending on just about every factor imaginable in your game/app, but to start with, you’ll want to consider several stages:

  • Design your functionality and choose a service. Pricing may come into play while making a decision – I recommend taking a look at the documentation for each of the cloud services that support your desired feature set and test out the setup process. Personally, I’m familiar with the Microsoft services, so that’s what I end up using most frequently.
  • Using your input mechanism, capture the desired textual output  that you want to send off to the service. If you’re using microphone input, consider whether or not there are abstraction layers to use input capture devices across different platforms; if targeting mobile and desktop VR / AR headsets, this may involve an abstraction layer to hook into the various hardware. With audio processing, you’ll have the additional step of doing speech to text (STT) translation as well, which is often an additional service offered. Some platforms and toolsets (like High Fidelity) have support for speech to text built-in, which is great to save a step here.
  • Send your string (often, these services have a RESTful HTTP endpoint, and most engines / platforms have the ability to send those requests built-in) to the service that you’ve chosen. It’s likely that at some point, you’ll need to authenticate against your account, which may require a handshake of sorts to acquire a token, or proof of validity in your request. Sometimes, you can skip this step during the development process, but you don’t want to expose your credentials or access keys in production code!
  • Parse the response that you get from the cloud service of choice and use the result in your application!

Translation Example: Calling Microsoft Cognitive Services Translation APIs with JavaScript

This is a really simple example of creating a JavaScript function that takes in a string (in the full code, it’s using the transcription from Limitless in High Fidelity, but you could call this on any string generated from your speech-to-text service of choice) and translates it into Spanish. You would need to modify this example, including the request and what happens when you receive the response text, to make sense for your application and engine, but you can see here that I’ve just updated a text entity using the High Fidelity engine Entities API. I use this to do near real time translation of my spoken input in High Fidelity and translate it in writing into other languages, and similar function calls to other services to generate images on the fly with spoken descriptions.

 var testTranslationWithString = function (toTest) {
        // String formatting to be escaped for use in a URL
        var _toTest = "?text=" + toTest.replace(/ /g, '%20') + "&to=es";
        
        // Create and send our Http Request
        var req = new XMLHttpRequest();

        // We are using Cognitive Services Translation Service
        req.open("GET", "https://api.microsofttranslator.com/V2/Http.svc/Translate" + _toTest);
        req.onreadystatechange = function () {
            if (req.readyState == 4) {
                if (req.status == 200) {
                    // Successful response, parse the returned text (in this case, a Spanish translation of the original string)
                    print("200!");
                    var newText = { "text": formatResponse(req.responseText) };
                    print(JSON.stringify(newText));
                    Entities.editEntity(textID, newText);
                } else {
                    print("HTTP Code: " + req.status + ": " + req.responseText);
                }
            }
        };
        req.setRequestHeader("Accept", "application/xml");
 
        // We requested a token in an earlier function call and pass it in here
        req.setRequestHeader("Authorization", token);
        req.send("");
    }

If you have any suggestions for those who are just getting started with NLP, cloud services, and using these solutions in immersive apps, or if you’ve done something cool with them yourself, leave a comment below! I’ll do my best to answer any questions and I’d love to see what you’re working on with these technologies!

Related Posts

Leave a Reply