Hadoop

For massive computation, you need the right tools.

Is your company in the business of crunching numbers? Do you need to run computation on massive amounts of data? If so, chances are you’ll require distributed computing for Big Data-like transactions. How do you manage such a feat? With the help of Hadoop.

Apache Hadoop is a collection of open-source tools that makes it possible to cluster numerous computers together to solve big problems. Hadoop is used for very large storage of nearly any kind of data and provides enormous processing power for nearly limitless concurrent jobs.

Hadoop was started by Doug Cutting and Mike Cafarella in 2002 while working on the Apache Nutch project. Very soon after that project began, Cutting and Cafarella concluded it would cost nearly half a million dollars in hardware alone, with a monthly running cost of around $30,000.

To lower the cost, the team turned to Google File System and MapReduce, which led Cutting to realize that Nutch was limited to 20-40 node clusters. This meant they couldn’t achieve their goal with only 2 people working on the project. It was soon after that realization (and Cutting joining Yahoo!) that a new project was formed, called Hadoop, which was created to expand Nutch’s ability to scale 1000s of nodes.

In 2007, Yahoo! successfully implemented Hadoop on a 1,000 node cluster. Hadoop was then released as an open-source project to the Apache Software Foundation. That same year, Apache successfully tested Hadoop on a 4,000 node cluster.

Cutting and Cafarella’s goal had been achieved and Hadoop could scale enough to handle massive data computation. And in December of 2001, Apache released version 1.0 of Hadoop.

What are the pieces that make up Hadoop?

Hadoop is a framework of components that is comprised of 3 parts:

  • 1_soak_BDev_SRP_Numeros
    Hadoop HDFS – Hadoop Distributed File System (HDFS) serves as the storage unit, which not only provides distributed storage but data security and fault tolerance.
  • 1_soak_BDev_SRP_Numeros
    Hadoop MapReduce – Hadoop MapReduce serves as the processing unit, which is handled on the cluster nodes, the results of which are sent to the cluster master.
  • 1_soak_BDev_SRP_Numeros
    Hadoop YARN – Hadoop YARN (Yet Another Resource Negotiator) serves as the resource management unit and performs job scheduling.

Each of those components comes together to make distributed storage much more efficient. To help you understand this, let’s use an analogy. 

A small business sells coffee beans. At first, they only sell one type of coffee beans. They store their coffee beans in a single storage room connected to their building and everything goes smoothly. Eventually, however, customers start asking for different types of coffee beans, so the company decides it’s in their best interest to expand. 

To save money, the company stores all the beans in the same room but hires more employees to handle the new demand. As demand continues to grow, supply has to match, so the storage room becomes problematic. To compensate for this, the business hires even more workers but soon realizes the problem is the bottleneck in the storage room. 

It finally dawns on the company that they need separate storage rooms for each type of bean and then separates employees to manage the different rooms. With this new delivery pipeline in place, the business not only runs smoother but can handle the continued growth in demand.

That distributed storage is similar to how distributed data storage works, but instead of separate storage rooms, we have multiple cluster nodes to store data.

This is how Hadoop helps Big Data to overcome the ever-growing needs for:

  • 1_soak_BDev_SRP_Numeros
    Volume - the amount of data
  • 1_soak_BDev_SRP_Numeros
    Velocity - the speed data is generated
  • 1_soak_BDev_SRP_Numeros
    Variety - the different types of data
  • 1_soak_BDev_SRP_Numeros
    Veracity - the ability to trust in the data

Why Hadoop over a traditional database?

You’re probably asking yourself, “Why use Hadoop when a traditional database has served my company just fine?” That’s a good question. The answer all boils down to how far you want your business to scale. 


You might have the best Java, JavaScript, .NET, Python, PHP, and cloud developers money can buy. But if your database solution isn’t capable of handling massive amounts of data, there’s no way those developers (no matter how talented they are) can work around the limitations of standard databases, such as:

  • 1_soak_BDev_SRP_Numeros
    Data storage size limitations - Hadoop makes it possible to store huge amounts of any type of data.
  • 1_soak_BDev_SRP_Numeros
    Computing power - The Hadoop distributed computing model means your company can manage larger and larger amounts of data, simply by adding nodes to the cluster.
  • 1_soak_BDev_SRP_Numeros
    Fault tolerance - Hadoop protects your cluster against hardware failure. If one node goes down, jobs are automatically redirected to other nodes.
  • 1_soak_BDev_SRP_Numeros
    Flexibility - Hadoop makes it possible to pre-process data before it’s stored.
  • 1_soak_BDev_SRP_Numeros
    Cost-effective - Hadoop is not only free but makes it possible for you to scale your business well beyond traditional databases without the added cost of expensive hardware.

The caveat of using Hadoop

The biggest problem your company will face is having the in-house skills to deploy and manage a Hadoop system. Although Hadoop uses Java as a primary language (especially with MapReduce), the skills required go well beyond the basics. Fortunately, there are plenty of nearshore and offshore development hiring firms that offer the necessary talent to implement Hadoop for your company.

Conclusion

So if you’re looking to compete with the biggest companies on the market, you should seriously consider a platform like Hadoop, to help you meet and exceed the data computational needs dedicated to current and future growth.

Related Pages

AWS

Much More Than Just Web Services To the vast majority of people, Amazon is a

Microsoft Azure

For Your Complete Off-Site Data Center Needs Growth can’t happen if you don’t have the

With more than 2,500 software engineers, our team keeps growing with the Top 1% of IT Talent in the industry.

Clients' Experiences

Ready to work with the Top 1% IT Talent of the market and access a world-class Software Development Team?

Scroll to Top

Get in Touch

Jump-start your Business with the
Top 1% of IT Talent.

Need us to sign a non-disclosure agreement first? Please email us at [email protected].

ACCELERATE YOUR DIGITAL TRANSFORMATION

By continuing to use this site, you agree to our cookie policy.