Home > Topics > Big Data Analysis > Files and Blocks in HDFS

Files and Blocks in HDFS

1. Definition

HDFS Blocks are fixed-size chunks (default 128 MB) into which large files are divided for distributed storage across the Hadoop cluster.


2. Block Fundamentals

2.1 Block Size

Default: 128 MB (Hadoop 2.x and 3.x).

Rationale:

  • Minimize metadata overhead
  • Optimize sequential read performance
  • Balance seek time vs transfer time

Example:

File Size: 1 GB
Blocks: 1024 MB / 128 MB = 8 blocks

With 3x replication:
Physical storage: 8 × 3 = 24 blocks = 3 GB

2.2 File-to-Block Mapping

Process:

Large File → Split into 128 MB blocks → 
Each block gets unique ID → 
Blocks distributed across DataNodes →
Metadata stored in NameNode

Example:

sales.csv (500 MB):
  Block 1 (0-128 MB): blk_1001 → [DN1, DN2, DN3]
  Block 2 (128-256 MB): blk_1002 → [DN2, DN4, DN5]
  Block 3 (256-384 MB): blk_1003 → [DN3, DN5, DN6]
  Block 4 (384-500 MB): blk_1004 → [DN1, DN4, DN6] (only 116 MB)

3. Block Replication

Default Replication: 3 copies.

Benefits:

  • Fault Tolerance: Survives 2 node failures
  • Read Performance: Parallel reads from multiple copies
  • Load Balancing: Distribute read requests

Customization:

# Set replication for specific file
hadoop fs -setrep 5 /critical/data.txt

# Different files, different replication factors
hadoop fs -setrep 1 /archive/old-logs.txt  (save space)
hadoop fs -setrep 5 /critical/customer-db.txt  (high availability)

4. Block Advantages

Loading comparison…


Exam Pattern Questions and Answers

Question 1: "Explain HDFS block structure with example." (6 Marks)

Answer:

Block Definition (1 mark): HDFS blocks are fixed-size chunks of 128 MB (default) into which large files are divided for distributed storage across cluster.

File Splitting (2 marks): When a file is written to HDFS, it is split into 128 MB blocks. For example, a 500 MB file creates 4 blocks: first three blocks are 128 MB each, and fourth block is 116 MB. Each block receives unique identifier like blk_1001, blk_1002, etc.

Distribution and Replication (3 marks): Each block is replicated across three DataNodes (default) for fault tolerance. Blocks are distributed using rack-aware placement: first replica on writer node, second on different rack, third on same rack as second. This ensures survival of rack failures while optimizing network bandwidth. NameNode maintains metadata mapping each file to its blocks and each block to hosting DataNodes.


Summary

  1. Block Size: 128 MB default
  2. Replication: 3 copies default
  3. Advantages: Unlimited file size, parallel processing, fault tolerance
  4. Trade-off: 3x storage overhead
Exam Tip

Always provide numerical example showing file split into blocks with replication calculation.


Quiz Time! 🎯

Loading quiz…