As Hadoop is designed to address both the storage and processing problems of the huge volume of data, The Hadoop distributed file System manages the storage part of the Hadoop framework and is designed to work on commodity hardware.
HDFS is a distributed, scalable, and highly fault-tolerant file system written in java for the Hadoop framework. Generally, in a cluster, a group of data nodes forms HDFS.
Commodity Hardware: HDFS is designed to work well with commodity hardware. As Hadoop is used for handling the huge volume of data, to manage the cost of hardware requires the use of commodity hardware than high-end computer having the huge storage and computing power.
Batch Processing: Hadoop is designed to do batch processing rather than real-time or interactive usage.
Large Data Sets: HDFS is tuned for working with large datasets rather than small datasets. The large data set can vary from a few Terabytes to Petabytes.
HDFS mainly consists of 3 components.
- Secondary NameNode
- NameNode is otherwise called as MasterNode.
- It manages HDFS storage.
- NameNode keeps the metadata information about HDFS like block address, Directory, etc.
- NameNode is not meant to store data apart from metadata information.
- NameNode is a single point of failure for HDFS. If NameNode goes down, the entire file system is unavailable for use.
- Generally, NameNode system is high in the configuration like high RAM and core.
- NameNode continuously monitors the health of DataNodes through heartbeats.
- To ensure High availability we can have both NameNode and standby NameNode.
- DataNode is also called a ‘Worker’ Node.
- DataNodes are responsible for storing actual data in terms of blocks.
- Data is distributed across DataNodes.
- DataNodes communicate to NameNode through the heartbeat.
- DataNodes do not communicate with each other. They work independently.
- As they work independently, the cluster doesn’t get affected if one of the DataNode goes down as long as the replication of data taken care of.
Note: You can create a Single Node cluster where NameNode and DataNode are installed in one system. In this scenario, we cannot expect high availability of NameNode or Data as all the services are in one system.
We will cover the above concepts in great detail in the later chapter.
- Secondary NameNode is used to archive the high availability of NameNode.
- As the NameNode is a single point of failure in HDFS, the file system is unavailable.
- To overcome this problem Secondary NameNode is implemented and Secondary NameNode takes the periodic snapshot of NameNode.
- Secondary NameNode Periodically copies ‘FsImage’ and ‘editslog’ file.
- In the case of NameNode failure, the metadata can be recovered last saved checkpoint of Secondary NameNode.
- But Secondary NameNode is not replaced as Primary NameNode in case of failure.
- Generally, Secondary NameNode configuration is as high as Primary NameNode.