Home > Topics > Big Data Analysis > Introduction to HDFS

Introduction to HDFS

1. Definition

HDFS (Hadoop Distributed File System) is a distributed file system designed to store very large files across multiple machines in a Hadoop cluster, providing high-throughput access to data and fault tolerance through replication.

2. Purpose and Goals of HDFS

2.1 Design Goals

Hardware Failure Handling: Assume failures are norm, not exception
Streaming Data Access: Optimize for batch processing with high throughput
Large Datasets: Handle files from gigabytes to terabytes
Simple Coherency Model: Write-once-read-many access pattern
Data Locality: Move computation near data to reduce network congestion
Portability: Run across heterogeneous hardware and software platforms

3. Key Features of HDFS

3.1 Distributed Storage

Concept: Files stored across multiple machines instead of single server.

Mechanism:

Large file split into fixed-size blocks (default: 128 MB)
Blocks distributed across cluster DataNodes
Each block replicated on multiple nodes (default: 3 copies)

Example:

1 GB File → 8 blocks of 128 MB each
Block 1: DataNode1, DataNode3, DataNode5
Block 2: DataNode2, DataNode4, DataNode6
Block 3: DataNode1, DataNode4, DataNode7
...

Benefit: Enables parallel processing and fault tolerance

3.2 High Thro

ughput

Focus: Optimized for reading large files quickly, not low latency.

Traditional File System

Access Pattern: Random reads/writes
Optimization: Low latency operations
Use Case: OLTP, transactional systems
File Size: Small to medium files

HDFS

Access Pattern: Sequential reads
Optimization: High throughput batch processing
Use Case: Analytics, MapReduce jobs
File Size: Very large files (GBs to TBs)

Loading comparison…

Performance: Can achieve multi-GB/sec throughput when reading large files across cluster.

3.3 Fault Tolerance

Principle: Automatic recovery from failures without data loss.

Implementation:

Block Replication"Each block stored on 3 nodes"

↓

Heartbeat Monitoring"NameNode checks DataNode health"

↓

Failure Detection"Missing heartbeat = node failure"

↓

Re-replication"Create new copies from replicas"

↓

Recovery Complete"Maintain replication factor"

Loading diagram…

Example Scenario:

Initial State:
Block A: DataNode1, DataNode2, DataNode3

DataNode2 Fails:
Block A: DataNode1, ❌, DataNode3 (only 2 copies)

Automatic Recovery:
NameNode detects failure
Instructs DataNode4 to copy from DataNode1
Block A: DataNode1, DataNode3, DataNode4 (3 copies restored)

Recovery Time: Typically few minutes for single node failure

3.4 Data Locality

Definition: Processing data where it is stored rather than moving data over network.

Traditional Processing:

Data on Server A → Copy over network to Server B → Process → Copy results back
Network Bottleneck: Transferring TBs takes hours/days

HDFS with Data Locality:

Data on Server A → Process on Server A → Results (small)
Network Usage: Minimal, only for small result transfer

Performance Impact:

🚀

10-100x

Speedup

📉

99% reduction

Network Load

💰

Lower bandwidth needs

Cost Savings

⚡

Maximum cluster utilization

Efficiency

Loading stats…

3.5 Scalability

Horizontal Scaling: Add more nodes to increase both storage and processing capacity.

Scaling Example:

Cluster Size	Storage Capacity	Processing Power	Cost
10 nodes	50 TB	10x parallel tasks	₹5 lakhs
100 nodes	500 TB	100x parallel tasks	₹50 lakhs
1000 nodes	5 PB	1000x parallel tasks	₹5 crores

Linear Growth: 10x nodes = 10x storage + 10x processing

4. HDFS vs Traditional File Systems

Aspect	Traditional FS	HDFS
File Size	KBs to MBs optimal	GBs to TBs optimal
Access Pattern	Random read/write	Sequential read, append-only
Storage	Single machine	Distributed across cluster
Fault Tolerance	RAID, backup systems	Automatic replication
Scalabilit	Limited by single machine	Unlimited horizontal scaling
Latency	Milliseconds	Seconds (not latency-optimized)
Use Case	OS files, applications	Big Data analytics

5. Write-Once-Read-Many Model

HDFS Assumption: Files are written once and read multiple times.

Supported Operations:

✅ Create file
✅ Append to file
✅ Read file (many times)
✅ Delete file
❌ Random writes (not supported)
❌ Modify middle of file (not supported)

Why This Model:

Simplifies data coherency
Enables high throughput
Reduces complexity
Perfect for log files, analytics data

Example Use Case:

Log File Processing Workflow:
1. Application writes daily logs to HDFS (write operation)
2. Append new logs throughout the day (append operation)
3. MapReduce jobs read logs for analysis (read many times)
4. After analysis, optionally delete old logs (delete operation)

6. Block Size Explanation

Default Block Size: 128 MB (Hadoop 2.x and 3.x)

Why So Large?

Minimize Metadata: Fewer blocks = less metadata overhead
- 1 TB file with 128 MB blocks = 8,000 blocks
- 1 TB file with 4 KB blocks = 268 million blocks!
Reduce Seek Time: Large sequential reads are efficient
- Disk seek time: ~10 ms
- Transfer time for 128 MB @ 100 MB/s: ~1280 ms
- Ratio: 99% transfer, 1% seek (efficient!)
Network Efficiency: Amortize network latency over large transfers

Example Calculation:

File Size: 1 GB
Block Size: 128 MB
Number of Blocks: 1024 MB / 128 MB = 8 blocks

With 3x Replication:
Total Physical Storage: 8 blocks × 3 copies = 24 blocks
Total Storage Used: 3 GB

7. Real-World Example

❗ Scenario:

Yahoo Mail serves hundreds of millions of users, generating petabytes of email data including messages, attachments, and metadata. Traditional storage systems couldn't scale cost-effectively to handle this volume while maintaining reliability.

💡 Analysis:

Yahoo deployed HDFS across thousands of commodity servers to store all email data. Each email and attachment treated as HDFS file, automatically split into 128 MB blocks and replicated across three nodes. When users access emails, data is retrieved from nearest available node using data locality. This distributed approach enabled Yahoo to store 5+ PB of data across 4000+ node cluster.

✅ Outcome:

Successfully scaled from hundreds of TBs to multiple PBs with linear cost growth. Achieved 99.99% availability despite regular commodity hardware failures through automatic fault recovery. Reduced storage costs by 80% compared to enterprise SAN solutions. Proved HDFS viability for production systems serving millions of users.

🎓 Key Learnings:

HDFS can reliably serve user-facing applications, not just batch analytics
Commodity hardware with software-based fault tolerance beats expensive hardware
Data locality critical for serving content quickly to millions of concurrent users

Loading case study…

Exam Pattern Questions and Answers

Question 1: "Explain HDFS and its key features." (10 Marks)

Answer:

Definition (2 marks):
HDFS (Hadoop Distributed File System) is a distributed file system designed to store very large files across multiple machines in a Hadoop cluster, providing high-throughput access to data and fault tolerance through replication. It is optimized for batch processing workloads and forms the storage foundation of Hadoop ecosystem.

Distributed Storage (2 marks):
HDFS splits large files into fixed-size blocks (default 128 MB) and distributes them across cluster DataNodes. For example, a 1 GB file is divided into 8 blocks of 128 MB each. Each block is replicated on multiple nodes (default 3 copies) ensuring data availability even when nodes fail.

High Throughput (1.5 marks):
HDFS optimizes for high data throughput rather than low latency, achieving multi-GB/sec read speeds across cluster. It uses sequential read pattern perfect for batch analytics, where processing large datasets in bulk is more important than random access speed.

Fault Tolerance (2 marks):
HDFS provides automatic failure recovery through block replication and heartbeat monitoring. NameNode continuously monitors DataNode health via heartbeats. When a node fails, NameNode detects missing heartbeats and instructs healthy nodes to re-replicate blocks, maintaining the configured replication factor without manual intervention or data loss.

Data Locality (1.5 marks):
HDFS enables processing data where it is stored rather than transferring large datasets over network. MapReduce tasks preferentially run on nodes containing required data blocks, reducing network traffic by 99% and improving processing speed by 10-100x compared to traditional approaches.

Scalability (1 mark):
HDFS scales horizontally by adding more commodity servers. Scaling from 10 to 100 nodes increases both storage capacity (10x) and processing power (10x) linearly with proportional cost growth, enabling petabyte-scale deployments.

Question 2: "Explain the write-once-read-many model of HDFS with example." (6 Marks)

Answer:

Model Explanation (2 marks):
HDFS follows write-once-read-many access pattern where files are created, optionally appended, and read multiple times without modification. Random writes and in-place modifications are not supported, simplifying data coherency and enabling high throughput optimizations.

Supported Operations (2 marks):
HDFS supports create file (write new data), append to file (add more data at end), read file (multiple times for analysis), and delete file (remove when no longer needed). It does not support random writes or modifying data in middle of existing files.

Example Usage (2 marks):
In log file processing workflow, applications write daily server logs to HDFS as new files. Throughout the day, new log entries are appended to these files. MapReduce jobs then read these logs multiple times for different analyses - security analysis, performance monitoring, user behavior tracking. After retention period expires, old log files are deleted. This access pattern perfectly matches HDFS's write-once-read-many model, enabling efficient storage and processing of Big Data.

Summary

Key Points for Revision:

HDFS: Distributed file system for storing very large files across cluster
Key Features: Distributed storage, high throughput, fault tolerance, data locality, scalability
Block Size: Default 128 MB per block
Replication: Default 3 copies of each block
Access Pattern: Write-once-read-many model
Optimization: High throughput batch processing, not low latency
Scaling: Horizontal scaling by adding commodity nodes

Exam Tip

Always mention specific numbers: 128 MB block size, 3x replication factor, 99% network reduction with data locality. For architecture questions, draw the distribution of blocks across DataNodes. Explain how HDFS differs from traditional file systems.

Quiz Time! 🎯

Test Your Knowledge

Question 1 of 5

1. Default block size in HDFS is:

64 MB

128 MB

256 MB

512 MB

Loading quiz…

Introduction to HDFS

1. Definition

2. Purpose and Goals of HDFS

2.1 Design Goals

3. Key Features of HDFS

3.1 Distributed Storage

3.2 High Thro

Traditional File System

HDFS

3.3 Fault Tolerance

3.4 Data Locality

3.5 Scalability

4. HDFS vs Traditional File Systems

5. Write-Once-Read-Many Model

6. Block Size Explanation

7. Real-World Example

📋 Case Study: Yahoo's Email Storage on HDFS

Exam Pattern Questions and Answers

Summary

Quiz Time! 🎯

Test Your Knowledge

Traditional File System

HDFS

📋 Case Study: Yahoo's Email Storage on HDFS

Test Your Knowledge