Building a Speech Recognition and a Speech-to-Text Tool

In 2010, I was sitting in my room watching TV when I saw an Xbox commercial. Kinect, a line of motion-sensing devices by Microsoft, was released, and with it, a new way to play games. I was fascinated by the features of this new technology, and an idea came to my mind: What if we leveraged Kinect features outside of the console?

There were so many options and too much to create. Kinect connected with robots to perform medical surgeries or deliver packages on the battlefield. Kinect could be integrated with your house to become what we call today “intelligent.” It could also integrate with hardware to use voice commands, so that disabled people could work with computers and get a job that requires the use of specific machine manipulation, or even do without using a keyboard. So many pictures came to my mind as if I were watching a futuristic movie. These realities would later come true, and probably the people reading this article already have an Alexa at home or even use Augmented Reality on their cellphones. You might have also used image recognition to simplify your daily tasks. Well then, I was able to materialize these concepts and ideas with Kinect.

The Process

I have a methodical design thinking process that is personal and works for me, which I have divided into four stages. I like to work like this because it allows me to imagine and consider all scenarios before starting, so that I can set goals without any restrictions. At this moment, my imagination is the limit.

Then, I take my time to read and investigate as much as possible so that the implications of my ideas are clear in terms of costs, scope, time, and effort. In the end, just like we do when we use Scrum Framework to determine an MVP, I can define my Minimum Viable Product with this process. After this, I can start manufacturing the MVP; I won’t be working with vague ideas by then.

Once the MVP is finished, I can add more features to it. I am always mindful of defining clear and achievable short-term objectives.

1. Dream Stage

When something calls my attention, it won’t leave my mind. I decided to make my ideas come true. But what did I need to materialize an idea with such a big challenge? In 2010, the technology was very new; therefore, the company could not deliver an SDK (software development kit) for developers anytime soon. At that moment, the way a new tech works is unknown, except to the engineers who created it; neither manuals nor other information sources are available.

This prompted me to create things from scratch and rely on my personal process that I regularly follow before jumping into something like this. I usually take a couple of days or a week to organize my thoughts, but I don’t make annotations. I just wander from one idea to another, trying to imagine as many scenarios as possible to determine what I need and what could go wrong.

2. Research Stage

Microsoft released no SDKs to manipulate Kinect at that time, so I delved deep into the web to find some people who had already pulled apart the hardware to obtain the DLLs (dynamic-link libraries) that make the magic happen. I finally found them in a Russian web forum. The following steps were relatively easy from then on. Once you have the libraries to work with, it is just a matter of reading their content. As I was doing that, I bought three Kinect sensors to disassemble and understand their hardware capabilities.

3. Creation Stage

At this point, I had everything I needed to start. This is my favorite part because you just dive deep into it. You will start a long-term relationship with it. This is the moment you become a creator. I was coding when I realized that the DLLs only interacted with the hardware, but another thing was missing in the code that made Kinect listen to and understand the user. At that time, I figured out that I could probably use the dictionary that comes with Windows to translate spoken words into text, and just like that, my project started to understand me while speaking.

This step was necessary because the Kinect DLL only contained the functions to perceive the audio. It was impossible to determine if the speaker was talking in English or another language, or to identify the pronounced words. By adding a Windows dictionary, just like we do with our computer, you can instruct the system to define the language to work with. Most importantly, you also provide a pool of words to compare against the received audio. Thus, my project started to “understand” me while speaking.

I integrated different third-party software and hardware using the Kinect sensors and its libraries. For example, I made it possible to navigate through any non-Windows program or to write inside text boxes to fill out a form without using the mouse or keyboard. In the case of Microsoft Word, I could navigate and control the cursor pointer by waving my hands without touching the mouse and writing on the sheet by orally dictating without using any keyboard. I could make a Lego electric car and move it without physical interaction, just by moving my hands in front of the camera sensors to instruct which direction it should go. So the dream was finally over.

4. Perfection Stage

Finally, it was time to enhance my project by adding some features. While analyzing Kinect hardware, I found that there was a branch of engineering that I did not know. It worked with images and was called digital image analysis.

I discovered we could use two types of cameras from Kinect to detect body and even hand depth. It allows you to detect proximity to the sensor so that you are able to play with more variables than the x, y, and z axes, and will also detect facial gestures and hand positions to interact with and integrate them into various systems in different ways.

Soon after, I could perform basic sentiment analysis without AI training, focusing on facial gestures. Of course, it seems rather simple if we compare my sentiment analysis from back then with the current context. Today we have an Artificial Intelligence specialization dedicated exclusively to improving and upgrading sentiment analysis algorithms. Regarding other features, I successfully achieved mouse controlling, application opening and closing, dictation, and automatic writing with Microsoft Word.

Conclusion

Today we have smaller sensors to work with that allow us to perform the same integrations I made almost a decade ago. Something that amazes me every time I remember this moment in my life is that even though so many years have passed, technology is still working the same way. The sensors have become smaller, and hardware upgrades have improved the quality of environmental stimuli detection, but the backend logic and algorithms remain the same.

And for the people willing to do something that seems unattainable right now, I recommend following my path. Let your imagination flow, and you will find at least one feasible idea. Kick off your journey, and once you get your MVP, take a look again at the seemingly unfeasible ideas. You probably are able to materialize them now.

More blog posts from our BDevers.