HDFS File Read Process
To Understand HDFS read and write operation let’s consider below assumptions.
Let’s consider the replication factor as 3. We will have a client machine where Hadoop is installed will all components expect NameNode and DataNode. Not considering Secondary NameNode here.
HDFS Read:
As we know Data is stored in multiple blocks in HDFS based on the replication factor.
- As NameNode has the address of all blocks, the Client will interact with NameNode to read a file in HDFS.
- The client will request all the addresses of blocks of a data file to NameNode.
- Client initiates using open() method. FSDataInputStream in = fs.open(inFile);
- In response to this NameNode will return the metadata info about blocks including the address of replicas as FSDataInputStream.
- FSDataInputStream which uses DFSInputStream to take care of the reading the blocks from different nodes.
- The client will invoke the read method (step 3) which causes DFSInputStream to connect to the first block and stream the block.
- The same process (stem 4 and 5) will repeat until all the blocks are covered.