To Understand HDFS read and write operation let’s consider below assumptions.
Let’s consider the replication factor as 3. We will have a client machine where Hadoop is installed will all components expect NameNode and DataNode. Not considering Secondary NameNode here.
As we know Data is stored in multiple blocks in HDFS based on the replication factor.
- As NameNode has the address of all blocks, the Client will interact with NameNode to read a file in HDFS.
- The client will request all the addresses of blocks of a data file to NameNode.
- Client initiates using open() method. FSDataInputStream in = fs.open(inFile);
- In response to this NameNode will return the metadata info about blocks including the address of replicas as FSDataInputStream.
- FSDataInputStream which uses DFSInputStream to take care of the reading the blocks from different nodes.
- The client will invoke the read method (step 3) which causes DFSInputStream to connect to the first block and stream the block.
- The same process (stem 4 and 5) will repeat until all the blocks are covered.