What is
HBase?
Hbase is an open source and sorted map data built on Hadoop. It’s
column oriented and horizontally
scalable . It is based on Google's massive Table. It’s set of tables that keep
data in key value format. Hbase is compatible for distributed data sets that
are quite common in big data use cases. Hbase provides Apis enabling
development in practically any programing language. It’s a section of the
Hadoop system that gives random real-time read/write access to data within the
Hadoop file system.
Why Hbase?
·
RDBMS get exponentially slow because the data becomes
massive
·
Expects data to be extremely structured, i.e. ability
to fit in a very well-defined schema
·
Any modification in schema may need a downtime
·
For sparse datasets, too much of overhead of
maintaining NULL values
Features of
Hbase
·
Horizontally scalable: you'll add any number of
columns anytime.
·
Automatic Failover: Automatic failover could be a resource
that permits a system administrator to automatically switch data handling to a
standby system within the event of system compromise
·
Integrations with Map/Reduce framework: Al the
commands and java codes internally implements Map/ reduce to do the task and
it's built over Hadoop Distributed file system.
·
Sparse, distributed, persistent, multidimensional
sorted map that is indexed by rowkey, column key, and timestamp.
·
Often referred as a key worth store or column
family-oriented database, or storing versioned maps of maps.
·
Fundamentally, it is a platform for storing and
retrieving data with random access.
·
It does not care about data types (storing an integer
in one row and a string in another for a similar column).
·
It does not enforce relationships among your
information.
·
It is meant to run on a cluster of computers, built
using commodity hardware.
HBase Read
A read against HBase should be reconciled between the HFiles,
MemStore & BLOCKCACHE. The BlockCache is designed to stay frequently accessed
data from the HFiles in memory so as to avoid disk reads. Every column family
has its own BlockCache. BlockCache contains data in kind of 'block', as unit of
data that HBase reads from disk in a very single pass. The HFile is physically
set out as a sequence of blocks and an index over those blocks. This means
reading a block from HBase needs only trying up that block's location in the
index and retrieving it from disk.
Block: it's the smallest indexed unit of data and is that the
smallest unit of data that may be scan from disk. Default size 64KB.
Scenario, once smaller block size is preferred: To perform
random lookups. Having smaller blocks creates a larger index and thereby
consumes additional memory.
Scenario, once larger block size is preferred: To perform
sequential scans frequently. This permits you to save lots of on memory as a
result of larger blocks mean fewer index entries and therefore a smaller index.
Reading a row from HBase needs initial checking the MemStore,
and then the BlockCache; Finally, HFiles on disk are accessed.
HBase Write
When a write is created, by default, it goes into 2 places:
·
write-ahead log (WAL), HLog, and
·
In-memory writes buffer, MemStore.
Clients do not interact directly with the underlying HFiles
during writes, rather writes goes to WAL & MemStore in parallel. Each write
to HBase needs confirmation from both the WAL and the MemStore. Hadoop
training in Bangalore
HBase
MemStore
·
The MemStore could be a write buffer wherever HBase
accumulates data in memory before a permanent write.
·
Its contents are flushed to disk to form an HFile once
the MemStore fills up.
·
It does not write to an existing HFile however instead
a new file on each flush forms.
·
The HFile is that the underlying storage format for
HBase.
·
HFiles belong to a column family (one MemStore per
column family). A column family will have multiple HFiles; however the reverse
is not true.
·
Size of the MemStore is defined in hbase-site.xml
known as hbase.hregion.memstore.flush.size.
What
happens, once the server hosting a MemStore that has not yet been flushed
crashes?
Every server in HBase cluster keeps a WAL to record changes
as they happen. The WAL could be a file on the underlying file system. A write
is not considered successful till the new WAL entry is successfully written,
this guarantees durability.
RDBMS vs HBase
RDBMS and
HBase differences are given below.
·
Schema/Database in RDBMS will be compared to namespace
in Hbase.
·
A table in RDBMS is compared to column family in
Hbase.
·
A record (after table joins) in RDBMS is compared to a
record in Hbase.
·
A collection of tables in RDBMS is compared to a table
in Hbase.
Author
TIB Academy is the well known Hadoop
training in Bangalore. Join us and build your career in
Big data.TIB Academy provides Bigdata training in Bangalore, with 10+ years experienced trainers well-equipped classrooms. Demo
classes will be conducting on every week.
For Demo Classes Call: 9513332301
For Demo Classes Call: 9513332301
No comments:
Post a Comment