Home > Topics > Big Data Analysis > Introduction to HDFS

Introduction to HDFS

1. Definition

HDFS (Hadoop Distributed File System) is a distributed file system designed to store very large files across multiple machines in a Hadoop cluster, providing high-throughput access to data and fault tolerance through replication.


2. Purpose and Goals of HDFS

2.1 Design Goals

  1. Hardware Failure Handling: Assume failures are norm, not exception
  2. Streaming Data Access: Optimize for batch processing with high throughput
  3. Large Datasets: Handle files from gigabytes to terabytes
  4. Simple Coherency Model: Write-once-read-many access pattern
  5. Data Locality: Move computation near data to reduce network congestion
  6. Portability: Run across heterogeneous hardware and software platforms

3. Key Features of HDFS

3.1 Distributed Storage

Concept: Files stored across multiple machines instead of single server.

Mechanism:

  • Large file split into fixed-size blocks (default: 128 MB)
  • Blocks distributed across cluster DataNodes
  • Each block replicated on multiple nodes (default: 3 copies)

Example:

1 GB File → 8 blocks of 128 MB each
Block 1: DataNode1, DataNode3, DataNode5
Block 2: DataNode2, DataNode4, DataNode6
Block 3: DataNode1, DataNode4, DataNode7
...

Benefit: Enables parallel processing and fault tolerance


3.2 High Thro

ughput

Focus: Optimized for reading large files quickly, not low latency.

Loading comparison…

Performance: Can achieve multi-GB/sec throughput when reading large files across cluster.


3.3 Fault Tolerance

Principle: Automatic recovery from failures without data loss.

Implementation:

Loading diagram…

Example Scenario:

Initial State:
Block A: DataNode1, DataNode2, DataNode3

DataNode2 Fails:
Block A: DataNode1, ❌, DataNode3 (only 2 copies)

Automatic Recovery:
NameNode detects failure
Instructs DataNode4 to copy from DataNode1
Block A: DataNode1, DataNode3, DataNode4 (3 copies restored)

Recovery Time: Typically few minutes for single node failure


3.4 Data Locality

Definition: Processing data where it is stored rather than moving data over network.

Traditional Processing:

Data on Server A → Copy over network to Server B → Process → Copy results back
Network Bottleneck: Transferring TBs takes hours/days

HDFS with Data Locality:

Data on Server A → Process on Server A → Results (small)
Network Usage: Minimal, only for small result transfer

Performance Impact:

Loading stats…


3.5 Scalability

Horizontal Scaling: Add more nodes to increase both storage and processing capacity.

Scaling Example:

Cluster SizeStorage CapacityProcessing PowerCost
10 nodes50 TB10x parallel tasks₹5 lakhs
100 nodes500 TB100x parallel tasks₹50 lakhs
1000 nodes5 PB1000x parallel tasks₹5 crores

Linear Growth: 10x nodes = 10x storage + 10x processing


4. HDFS vs Traditional File Systems

AspectTraditional FSHDFS
File SizeKBs to MBs optimalGBs to TBs optimal
Access PatternRandom read/writeSequential read, append-only
StorageSingle machineDistributed across cluster
Fault ToleranceRAID, backup systemsAutomatic replication
ScalabilitLimited by single machineUnlimited horizontal scaling
LatencyMillisecondsSeconds (not latency-optimized)
Use CaseOS files, applicationsBig Data analytics

5. Write-Once-Read-Many Model

HDFS Assumption: Files are written once and read multiple times.

Supported Operations:

  • ✅ Create file
  • ✅ Append to file
  • ✅ Read file (many times)
  • ✅ Delete file
  • ❌ Random writes (not supported)
  • ❌ Modify middle of file (not supported)

Why This Model:

  • Simplifies data coherency
  • Enables high throughput
  • Reduces complexity
  • Perfect for log files, analytics data

Example Use Case:

Log File Processing Workflow:
1. Application writes daily logs to HDFS (write operation)
2. Append new logs throughout the day (append operation)
3. MapReduce jobs read logs for analysis (read many times)
4. After analysis, optionally delete old logs (delete operation)

6. Block Size Explanation

Default Block Size: 128 MB (Hadoop 2.x and 3.x)

Why So Large?

  1. Minimize Metadata: Fewer blocks = less metadata overhead

    • 1 TB file with 128 MB blocks = 8,000 blocks
    • 1 TB file with 4 KB blocks = 268 million blocks!
  2. Reduce Seek Time: Large sequential reads are efficient

    • Disk seek time: ~10 ms
    • Transfer time for 128 MB @ 100 MB/s: ~1280 ms
    • Ratio: 99% transfer, 1% seek (efficient!)
  3. Network Efficiency: Amortize network latency over large transfers

Example Calculation:

File Size: 1 GB
Block Size: 128 MB
Number of Blocks: 1024 MB / 128 MB = 8 blocks

With 3x Replication:
Total Physical Storage: 8 blocks × 3 copies = 24 blocks
Total Storage Used: 3 GB

7. Real-World Example

Loading case study…


Exam Pattern Questions and Answers

Question 1: "Explain HDFS and its key features." (10 Marks)

Answer:

Definition (2 marks):
HDFS (Hadoop Distributed File System) is a distributed file system designed to store very large files across multiple machines in a Hadoop cluster, providing high-throughput access to data and fault tolerance through replication. It is optimized for batch processing workloads and forms the storage foundation of Hadoop ecosystem.

Distributed Storage (2 marks):
HDFS splits large files into fixed-size blocks (default 128 MB) and distributes them across cluster DataNodes. For example, a 1 GB file is divided into 8 blocks of 128 MB each. Each block is replicated on multiple nodes (default 3 copies) ensuring data availability even when nodes fail.

High Throughput (1.5 marks):
HDFS optimizes for high data throughput rather than low latency, achieving multi-GB/sec read speeds across cluster. It uses sequential read pattern perfect for batch analytics, where processing large datasets in bulk is more important than random access speed.

Fault Tolerance (2 marks):
HDFS provides automatic failure recovery through block replication and heartbeat monitoring. NameNode continuously monitors DataNode health via heartbeats. When a node fails, NameNode detects missing heartbeats and instructs healthy nodes to re-replicate blocks, maintaining the configured replication factor without manual intervention or data loss.

Data Locality (1.5 marks):
HDFS enables processing data where it is stored rather than transferring large datasets over network. MapReduce tasks preferentially run on nodes containing required data blocks, reducing network traffic by 99% and improving processing speed by 10-100x compared to traditional approaches.

Scalability (1 mark):
HDFS scales horizontally by adding more commodity servers. Scaling from 10 to 100 nodes increases both storage capacity (10x) and processing power (10x) linearly with proportional cost growth, enabling petabyte-scale deployments.


Question 2: "Explain the write-once-read-many model of HDFS with example." (6 Marks)

Answer:

Model Explanation (2 marks):
HDFS follows write-once-read-many access pattern where files are created, optionally appended, and read multiple times without modification. Random writes and in-place modifications are not supported, simplifying data coherency and enabling high throughput optimizations.

Supported Operations (2 marks):
HDFS supports create file (write new data), append to file (add more data at end), read file (multiple times for analysis), and delete file (remove when no longer needed). It does not support random writes or modifying data in middle of existing files.

Example Usage (2 marks):
In log file processing workflow, applications write daily server logs to HDFS as new files. Throughout the day, new log entries are appended to these files. MapReduce jobs then read these logs multiple times for different analyses - security analysis, performance monitoring, user behavior tracking. After retention period expires, old log files are deleted. This access pattern perfectly matches HDFS's write-once-read-many model, enabling efficient storage and processing of Big Data.


Summary

Key Points for Revision:

  1. HDFS: Distributed file system for storing very large files across cluster
  2. Key Features: Distributed storage, high throughput, fault tolerance, data locality, scalability
  3. Block Size: Default 128 MB per block
  4. Replication: Default 3 copies of each block
  5. Access Pattern: Write-once-read-many model
  6. Optimization: High throughput batch processing, not low latency
  7. Scaling: Horizontal scaling by adding commodity nodes
Exam Tip

Always mention specific numbers: 128 MB block size, 3x replication factor, 99% network reduction with data locality. For architecture questions, draw the distribution of blocks across DataNodes. Explain how HDFS differs from traditional file systems.


Quiz Time! 🎯

Loading quiz…