BigData / Hadoop basics
Explain HDFS Data Write Pipeline Workflow.
The HDFS client sends a WRITE request on DistributedFileSystem API.
DistributedFileSystem issue a RPC call to the name node to create a new file in FS namespace. After various checks, client gets the permission or an IOException.
The DistributedFileSystem return FSDataOutputStream to the client for writing data. As the client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, that the name node to allocate new blocks by picking a list of suitable data nodes to store the replicas.
The list of data nodes forms a pipeline based on the replication Level. The default is 3. The DataStreamer streams the packets to the first data node in the pipeline, which stores the packet and forwards it to the second data node in the pipeline. so is the second node does and send it to the third data node.
DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by data nodes, called the "ack queue". A packet gets removed as soon as it has been acknowledged by the data nodes in the pipeline. Datanode sends the acknowledgment once required replicas are created.
The client calls close() on the stream when done which flushes all the remaining packets to the data node pipeline and waits for acknowledgments before contacting the name node to signal that the file is complete. The name node already knows the blocks the file is made up of, so it only has to wait for blocks to be minimally replicated before returning successfully.
More Related questions...