Files and Blocks in HDFS
1. Definition
HDFS Blocks are fixed-size chunks (default 128 MB) into which large files are divided for distributed storage across the Hadoop cluster.
2. Block Fundamentals
2.1 Block Size
Default: 128 MB (Hadoop 2.x and 3.x).
Rationale:
- Minimize metadata overhead
- Optimize sequential read performance
- Balance seek time vs transfer time
Example:
File Size: 1 GB
Blocks: 1024 MB / 128 MB = 8 blocks
With 3x replication:
Physical storage: 8 × 3 = 24 blocks = 3 GB
2.2 File-to-Block Mapping
Process:
Large File → Split into 128 MB blocks →
Each block gets unique ID →
Blocks distributed across DataNodes →
Metadata stored in NameNode
Example:
sales.csv (500 MB):
Block 1 (0-128 MB): blk_1001 → [DN1, DN2, DN3]
Block 2 (128-256 MB): blk_1002 → [DN2, DN4, DN5]
Block 3 (256-384 MB): blk_1003 → [DN3, DN5, DN6]
Block 4 (384-500 MB): blk_1004 → [DN1, DN4, DN6] (only 116 MB)
3. Block Replication
Default Replication: 3 copies.
Benefits:
- Fault Tolerance: Survives 2 node failures
- Read Performance: Parallel reads from multiple copies
- Load Balancing: Distribute read requests
Customization:
# Set replication for specific file
hadoop fs -setrep 5 /critical/data.txt
# Different files, different replication factors
hadoop fs -setrep 1 /archive/old-logs.txt (save space)
hadoop fs -setrep 5 /critical/customer-db.txt (high availability)
4. Block Advantages
Loading comparison…
Exam Pattern Questions and Answers
Question 1: "Explain HDFS block structure with example." (6 Marks)
Answer:
Block Definition (1 mark): HDFS blocks are fixed-size chunks of 128 MB (default) into which large files are divided for distributed storage across cluster.
File Splitting (2 marks): When a file is written to HDFS, it is split into 128 MB blocks. For example, a 500 MB file creates 4 blocks: first three blocks are 128 MB each, and fourth block is 116 MB. Each block receives unique identifier like blk_1001, blk_1002, etc.
Distribution and Replication (3 marks): Each block is replicated across three DataNodes (default) for fault tolerance. Blocks are distributed using rack-aware placement: first replica on writer node, second on different rack, third on same rack as second. This ensures survival of rack failures while optimizing network bandwidth. NameNode maintains metadata mapping each file to its blocks and each block to hosting DataNodes.
Summary
- Block Size: 128 MB default
- Replication: 3 copies default
- Advantages: Unlimited file size, parallel processing, fault tolerance
- Trade-off: 3x storage overhead
Always provide numerical example showing file split into blocks with replication calculation.
Quiz Time! 🎯
Loading quiz…