In the early 2000s, when I started in systems engineering, data science was a fringe term used by a small pool of academics. I remember my professors referencing it as a sort of fad that would eventually go away. And wow, that opinion didn’t age well.
It’s almost been two decades since then, and data science as we know it has become one of, if not the, highest growing field in computer science. The statistics from CALU are impressive, to say the least, with a job growth rate of over 650% since 2014.
Delve in data science long enough and you’ll feel like it’s a divided house. On one side, you’ll see scientists doing data analysis with R since the last century. On the other, you have those who defend Python as the one true savior of our people.
Yes, there is another path: relying on statistical software like MatLab, Stata, or SPSS. But every data scientist will need to spread their wings sooner or later. Since we are usually working with unstructured data, we need to build custom solutions that stock software is simply not able to provide.
And yes, the obvious answer is to learn both, but that’s only a real answer if you have enough time on your hands to play around with two different coding languages. For the purpose of this discussion, let’s assume that, due to time constraints, our hypothetical newbie data scientist can only pick one.
The answer, as usual, isn’t simple. Let’s get one thing out of the way: both languages are perfectly suitable for data science, as both can cover the basics wonderfully: data manipulation, ad-hoc analysis, and exploration.
So, instead of focusing on the basics, our time is better spent going over what makes them different.
Popularity and job opportunities
If you are already joining a data science team just focus on whatever tools they use. On the other hand, if you are just starting and want to make a choice based on what’s more popular on the job market then Python wins hands down. As of 2020, Python is the 3rd most popular programming language according to GitHub (R doesn’t even make the top 20)
As for job outlook, Python wins by a landslide. Knowing Python is 1.5 times more likely to appear on a job’s posting. Python users are more loyal while at least 10% of surveyed data scientists are reporting a transition to Python. Once we take a look at the trends, it seems like R is quickly becoming a dying language. Except that over half of all data scientists are using both on their daily routine. But why?
R’s ecosystem is really powerful
I’ve never been much of a statistics buff. I know enough to analyze data just fine, but once I take a look at R and its packages I feel like an undergrad seeing statistics 101 for the first time. R has a long tradition in academia and statistics experts. The sheer amount of available projects is staggering. As of right now CRAN, R’s biggest repository, has over 12000 packages that are being updated.
Need a Lavaan test? You got it! Factor analysis? It’s right there in the Psych package. Structural equations? Please, at least try to make it a challenge. Keep in mind that Python has a lot of these things implemented as well, but for the really opaque stuff, R still reigns as king.
Since most R developers are academics, most of these packages are specifically designed to solve academic problems. For example, Psych, that package I mentioned earlier, is designed for psychologists who work with psychometrics.
R is for you if what you’re looking for includes:
- Functions designed to prepare the data for specific analysis
- Functions ready to run the analysis and interpret the results
- Functions that build custom graphics for those analyses
- All of it backed with documentation based on academic books,
Python’s versatility is unbeatable
If R is the old but trusty Mustang your father bought when he was a teenager, then Python is a flamboyant Tesla. Python was originally built with readability in mind so that it could act as a gateway to new programmers, and it shows. The syntax is human-friendly, and even a new programmer can read a Pythonista’s (an advanced developer) code and get an overall idea of what it’s doing.
Since Python is a multi-purpose language, it’s a lot more versatile than R, which makes it the ideal language for integration with other platforms. Just as an example, a colleague of mine is currently developing a game in Python that registers decision-making data, uploads it to a server where it’s analyzed, and eventually will be automatically uploaded to a webpage so other scientists can take a look at the data.
A few years ago most data scientists would have preferred R since it had a more robust set of tools for machine learning. But that’s no longer the case. Python nowadays equals (and sometimes surpasses R) as the best language for Artificial Intelligence.
Python is growing at a gigantic pace, and the main reason is that the developer space (at least for data science) is shared between programmers and scientists, so you have people with the theoretical know-how working hand in hand with people with the technical know-how.
Some critics believe that it’s harder to get into Python since the ecosystem is so big. In my experience, as a data scientist, you only need to engage with a very small part of Python’s tools and then reach further as you explore new possibilities. For the new data scientists, you only need five libraries: Numpy, Scikit-learn, Pandas, Scipy, and Seaborn.
The best tool for you
A programmer who is looking to delve into data science should start with python and then meddle with R when the need arises, while an academic might feel more at home starting with R and then moving on to Python as they take bigger projects.
Most of us simply use one or the other depending on what we feel comfortable with and what each language can provide to tackle a problem. Python users can import R functions with ease and vice-versa.
It’s not a question of which one will eventually win over the other, but rather on which language you should focus your time and effort primarily. In my humble opinion, with the massive expansion we are seeing with Python I find very little reason to tell someone to start with R. But, even though, any data scientist worth a dime should have a good understanding of both languages.