Natural Language Processing is a field of computer science, more specifically a field of Artificial Intelligence, that is concerned with developing computers with the ability to perceive, understand and produce human language.
Language analysis has been for the most part a qualitative field that relies on human interpreters to find meaning in discourse. Powerful as it may be, it has quite a few limitations, the first of which is the fact that humans have unconscious biases that distort their understanding of the information.
The other issue, and the one most relevant to us, is the limited ability of humans to consume data since most adult humans can only read about 200 to 250 words per minute – college graduates average at around 300 words.
To put those numbers in perspective, an average book is between 90.000 and 100.000 words. This means that a regular human will take about 70 hours to finish a regular-sized book. 100.000 words may seem like a lot, but it’s actually a very small fraction of the amount of language that’s getting produced every single day on social media.
Twitter, a social media built on the foundation of 280 character long messages, averages 500 million tweets a day. Assuming around 20 words per tweet, we’re looking at around 100.000 books of information. And that’s just one social media platform.
Gathering Big Data
Any researcher that sets their sights on social media has to deal with massive amounts of data. Manually gathering and analyzing the data is inefficient at best and a complete waste of time at worst. So what’s the solution?
Gathering data programmatically. Most social media platforms have APIs that allow researchers to access their feeds and grab data samples. And even without an API, web scraping is as old a practice as the internet itself, right?.
Web scraping refers to the practice of fetching and extracting information from web pages, either manually or by automated processes (the former being a lot more common than the latter).
Unfortunately, web scraping sits in a legal gray area. Facebook vs. Power Ventures Inc is one of the most well-known examples of big-tech trying to push against the practice. In this case, Power Ventures created an aggregate site that allowed users to aggregate data about themselves from different services, including LinkedIn, Twitter, Myspace, and AOL.
One of the biggest challenges when working with social media is having to manage several APIs at the same time, as well as understanding the legal limitations of each country. For example, Australia is fairly lax in regards to web scraping, as long as it’s not used to gather email addresses.
Another challenge is understanding and navigating the tiers of developers’ accounts and APIs. Most services offer free tiers with some rather important limitations, like the size of a query or the amount of information you can gather every month.
For example, in Twitter’s case, the Search API sandbox allows for up to 25.000 tweets per month, while a premium account offers up to 5 million. The first is best suited for small-scale projects or proof of concept, the latter for bigger projects.
In other words, anyone interested in gathering information from Social Media should:
- Understand the law regarding data gathering
- Understand how developer accounts and API work for each platform
- Figure out the potential investment based on the scope of their project.
Understanding your Audience
Human nature pushes like-minded individuals toward each other. We’d rather share with people who have the same interests as we do. Social media sites appeal to different demographics, and the interactions in these virtual spaces are shaped both by their behaviors and by the emerging culture.
Natural Language Processing excels at understanding syntax, but semiotics and pragmatism are still challenging to say the least. In other words, a computer might understand a sentence, and even create sentences that make sense. But they have a hard time understanding the meaning of words, or how language changes depending on context.
That’s why computers have such a hard time detecting sarcasm and irony. For the most part, that’s a non-issue. On the one hand, the amount of data containing sarcasm is minuscule, and on the other, some very interesting tools can help.
When training machine learning models to interpret language from social media platforms it’s very important to understand these cultural differences. Twitter, for example, has a rather toxic reputation, and for good reason, it’s right there with Facebook as one of the most toxic places as perceived by its users.
It should come as no surprise then, that you’re more likely to find differences of opinion depending on which platform you work with. And in fact, such differences are very important data points.
As a quick example, market researchers need to understand which social media platform appeals to their target audience. It makes little sense to invest time and resources in following trends on networks that will yield little to no valuable information.
More Than Words
The exponential growth of platforms like Instagram and TikTok poses a new challenge for Natural Language Processing. Videos and images as user-generated content are quickly becoming mainstream, which in turn means that our technology needs to adapt.
Face and voice recognition will prove game-changing shortly, as more and more content creators are sharing their opinions via videos. While challenging, this is also a great opportunity for emotion analysis, since traditional approaches rely on written language, it has always been difficult to assess the emotion behind the words.
While still too early to make an educated guess, if big tech industries keep pushing for a “metaverse”, social media will most likely change and adapt to become something akin to an MMORPG or a game like Club Penguin or Second Life. A social space where people freely exchange information over their microphones and their virtual reality headsets.
Will Meta allow researchers access to these interactions? If the past is any indication, the answer is no, but once again, it’s still too early to tell, and the Metaverse is a long way off.
NLP and Data Science
Faster and more powerful computers have led to a revolution of Natural Language Processing algorithms, but NLP is only one tool in a bigger box. Data scientists have to rely on data gathering, sociological understanding, and just a bit of intuition to make the best out of this technology.
It’s an exciting time for Natural Language Processing, and you can bet that in the following years the field is going to keep growing as it has, providing better and more refined tools for understanding how humans communicate.