Home > Topics > Big Data Analysis > Files and Blocks in HDFS

Files and Blocks in HDFS

1. Definition

HDFS Blocks are fixed-size chunks (default 128 MB) into which large files are divided for distributed storage across the Hadoop cluster.

2. Block Fundamentals

2.1 Block Size

Default: 128 MB (Hadoop 2.x and 3.x).

Rationale:

Minimize metadata overhead
Optimize sequential read performance
Balance seek time vs transfer time

Example:

File Size: 1 GB
Blocks: 1024 MB / 128 MB = 8 blocks

With 3x replication:
Physical storage: 8 × 3 = 24 blocks = 3 GB

2.2 File-to-Block Mapping

Process:

Large File → Split into 128 MB blocks → 
Each block gets unique ID → 
Blocks distributed across DataNodes →
Metadata stored in NameNode

Example:

sales.csv (500 MB):
  Block 1 (0-128 MB): blk_1001 → [DN1, DN2, DN3]
  Block 2 (128-256 MB): blk_1002 → [DN2, DN4, DN5]
  Block 3 (256-384 MB): blk_1003 → [DN3, DN5, DN6]
  Block 4 (384-500 MB): blk_1004 → [DN1, DN4, DN6] (only 116 MB)

3. Block Replication

Default Replication: 3 copies.

Benefits:

Fault Tolerance: Survives 2 node failures
Read Performance: Parallel reads from multiple copies
Load Balancing: Distribute read requests

Customization:

# Set replication for specific file
hadoop fs -setrep 5 /critical/data.txt

# Different files, different replication factors
hadoop fs -setrep 1 /archive/old-logs.txt  (save space)
hadoop fs -setrep 5 /critical/customer-db.txt  (high availability)

4. Block Advantages

Benefits

File Size: No limit, can exceed any single disk
Parallel Processing: Process blocks simultaneously
Fault Tolerance: Block replication across nodes
Network Efficiency: Transfer only needed blocks

Considerations

Small Files: Many small files waste NameNode memory
Block Size: Must balance metadata vs efficiency
Replication: Storage overhead (3x space used)
Consistency: Maintain block-to-file relationships

Loading comparison…

Exam Pattern Questions and Answers

Question 1: "Explain HDFS block structure with example." (6 Marks)

Answer:

Block Definition (1 mark): HDFS blocks are fixed-size chunks of 128 MB (default) into which large files are divided for distributed storage across cluster.

File Splitting (2 marks): When a file is written to HDFS, it is split into 128 MB blocks. For example, a 500 MB file creates 4 blocks: first three blocks are 128 MB each, and fourth block is 116 MB. Each block receives unique identifier like blk_1001, blk_1002, etc.

Distribution and Replication (3 marks): Each block is replicated across three DataNodes (default) for fault tolerance. Blocks are distributed using rack-aware placement: first replica on writer node, second on different rack, third on same rack as second. This ensures survival of rack failures while optimizing network bandwidth. NameNode maintains metadata mapping each file to its blocks and each block to hosting DataNodes.

Summary

Block Size: 128 MB default
Replication: 3 copies default
Advantages: Unlimited file size, parallel processing, fault tolerance
Trade-off: 3x storage overhead

Exam Tip

Always provide numerical example showing file split into blocks with replication calculation.

Quiz Time! 🎯

Test Your Knowledge

Question 1 of 2

1. Default HDFS block size is:

64 MB

128 MB

256 MB

512 MB

Loading quiz…