Research and issues with cross platform compatibility

I needed to continue to do some research on how to try to get speech-to-text functionality working in Visual Studio Code. I did some research that can be found in this pull request.

 One of the issues with trying to grab microphone input and send it to an application that works across a variety of platforms is that it will not necessarily be straightforward to grab the microphone audio from various computers' hardware, and to then use a program that works across multiple platforms to parse the audio.

Web apps have generally tended to be associated with the ability to run across a variety of platforms as they work on one system remotely, and send data back through some reasonably universal format like html, JavaScript, JSON, or common file formats. Things like the Web Speech API are built into some browsers, but only a few, and Google seems to make a point of making a complex and valuable feature like this unavailable in Chromium, where one might try to look at what they do to make things work. Google does, however, make this option available via their cloud API services, which can take .raw audio from a file, or a short stream of a minute or less, analyze it, and send back JSON containing the words that it thinks it has interpreted. The platform is also handy because you can configure it with some predefined words or phrases that you think users are likely to --quite useful when you plan to make a VS Code extension that will use some words for html elements that will be re-used frequently if you can figure out a way to get VS Code to pull in mic audio.

 The only issue with this is that Google's API service is only free under a certain usage threshold, then they'll cut you off, or start asking you to pay. What's more, the API asks that users enter their credit card information to register for an account to use the API, even if they are trying to use it free. Getting repo contributors to register their own accounts for the Google's cloud speech to text API would greatly increase the amount that the speech to text services could be used with the project (each person has their own trial account), however it would be understandably offputting for potential contributors to have to give a company their credit card information in order to work with the repo.

 In terms of hardware, it seems that there are a few NodeJs libraries out there that will supposedly work across various platforms, although they may require platform specific software to be installed, like sox, or arecord, depending on your platform. What's interesting is that these packages can use Node's process.plaform in order to determine what platform specific software to launch with Node's child_process.spawn(). This concept of detecting the operating system is something that can be used in the code for (or processes spawned by) our VS Code extension. The only issue being that if there are dependencies that need to be installed that differ by operating system, it becomes difficult to distribute both the dev environment, and the actual extension itself.

This got me thinking about trying to make the code more readily accessible to people who try to use our repo. Someone had already created a bare-bones extension that would pull in mic audio, and call a Python script through a Node.js child process to parse the mic audio to text, and return it to the VS Code extension to be printed via console.log(). However, this code was meant to be a proof of concept, and some of the directory paths were hard coded for that person's computer. I also needed to look at the dependenciees being used in the code, look up their documentation, and determine what to install on my machine to get the example code working... this was somewhat inconvenient.

Node.js offers the __directory attribute to provide the path of the currently running script, which provides a more machine-universal way of referencing files relative to the currently running script, and I had found this to help get the existing code running on my machine. However, it would be ideal if people didn't have to go out and figure out what dependencies were needed, and manually install these.

To solve this, a bash/powershell script could be written to try to detect various operating systems, and use their respective package managers to install appropriate dependencies that were not provided via npm (ie sox, arecord, Python, Python's pip package manager, etc). There seems to be a hacky way to make scripts work on both windows and unix machines, as well as a way to detect the operating system in bash scripts. However, this would become a fairly lengthy, and potentially brittle script to write as package repositories change, and some linux distros may not be adequately supported by the script.

An alternative would be to use something like Docker to try to containerize speech to text functionality, and then try to talk to that container --perhaps via a child process-- from the VS Code extension. This container could take in the host's mic audio, then run the python script that already exists in our repo's mic-testing branch.

This would compartmentalize some of the dependencies to a Docker image so that repo contributors wouldn't have to install Python, the required SpeechRecognition package, however, this would also perhaps require that some of the Node.js functionality be ported from the extension's code itself to the Docker container. I found a good template container to build on here. It might work after adding some COPY \ directives to copy in any python scripts, and any other relevant files from the repo when the container builds (as additional code in the Dockerfile).

Comments

Popular posts from this blog

Tinkering with Chrome Headless to Handle Mic Input

Using Arrow keys to cycle through Mozilla Screenshots