When it comes to Big Data, your choice of database should be confined to the likes of the NoSQL type. Why? Because NoSQL databases are geared toward rapid processing of massive data stores and varied, unstructured data. If you attempt to use a relational database for Big Data, you will find it falls way short.
Now that you know which type of database to use, which actual database should you select for your project? When you dig into the answer, you’ll find there are quite a few NoSQL databases that are up to the tasks: MongoDB, RavenDB, Redis, CouchBase, IBM Cloudant, and Amazon DynamoDB.
There are also 2 others, both of which are maintained by the Apache Project: HBase, and Cassandra. These NoSQL databases look very similar at first blush, but when you look a bit closer, you’ll find they are quite different. With that said, let’s take a look at Cassandra vs. HBase to see which might be the best fit for your company.
What is HBase?
Apache HBase is an open-source, NoSQL, distributed database for big data stores. This NoSQL database enables random, strictly consistent, real-time access to massive amounts of data (petabytes).
HBase is column-oriented which means data is stored in individual columns that are indexed by unique row keys. Data and queries are distributed across the cluster of servers, which makes for very fast retrieval of results (often in the order of milliseconds). This allows for the rapid retrieval of both rows and columns to help make it a viable option for very large database stores.
HBase is used to store non-relational data, which is accessed via the HBase API. To make HBase a bit more accessible to administrators, it’s often used in conjunction with Apache Phoenix as an SQL layer. By combining HBase and Phoenix, it’s then possible to use standard SQL query syntax for the insertion, deletion, and querying of data.
HBase is scalable, fast, and fault-tolerant.
Components of HBase
HBase consists of the following components:
What is Cassandra?
Apache Cassandra is another open-source, NoSQL, distributed database used for massive stores of data. Unlike some NoSQL distributed databases, Cassandra is a “masterless” architecture (so all nodes provide the same functionality within the cluster) that can withstand a data center outage with zero data loss, even across public or private clouds.
Cassandra is prized for its scalability, high-availability, and performance. Apache Cassandra can be deployed on either commodity hardware or a cloud infrastructure making it an ideal option for mission-critical data. Cassandra is one of the most performant NoSQL databases on the market, so if your project or business needs a database geared toward speed, this might be the perfect option.
Components of Cassandra
Cassandra consists of the following components:
- Replication factor
- Commit Log
What’s the Difference Between HBase and Cassandra?
Let’s take a look at 2 very important aspects of a database—write and read performance—where the differences can be rather glaring.
With HBase, writes are handled by a single server. On the other hand, Cassandra writes to multiple servers with different versions. HBase also stores data in an Hadoop Distributed File System (HDFS) that provides bloom filters and black caches, which equates to considerably faster read performance. With Cassandra, the database must check for data within the partition table first, in order to locate the data in question.
Here is where the tables are turned. Cassandra writes to a log and cache simultaneously, while concurrent writes aren’t possible with HBase. Cassandra also uses consistent hashing for both data partitioning and distribution, which helps to speed up writes. With HBase, a client must first locate the address store for both metadata and tables by way of Zookeeper. The client then requests the server housing the metadata to provide and address for the table where the write will happen. This means writes in HBase require far more overhead than Cassandra, thereby making them slower.
In HBase, the average latency decreases as more random reads and updates are performed. In Cassandra, latency increases proportionally as I/O operations increase. However, there is a decrease in latency after 10,000 read and write operations.
As far as throughput is concerned, HBase is fairly consistent, as it can handle between 100,000 to 200,000 operations, but an increase can occur at 250,000+ operations. On the other hand, Cassandra’s throughput rises steadily as the number of reads and writes increases.
Average read latency is generally higher in HBase, but it doesn’t vary to a noticeable degree as the number of read operations increase.
Which is Right For You?
Let’s make this choice fairly simple by looking at it through the lens of fault tolerance. With HBase, the whole database can go down should the master node fail. With Cassandra, on the other hand, if a node goes down the database will still be available. However, because of the masterless architecture of Cassandra, data inconsistencies can occur.
So, if your primary focus is on data consistency, go with HBase. If your focus is on high availability, go with Cassandra.