HDFS Internals
1. Definition
HDFS Internals refer to the underlying mechanisms and data structures that enable HDFS to manage metadata, track blocks, ensure data integrity, and maintain fault tolerance.
2. Metadata Storage
2.1 FsImage (File System Image)
Definition: Snapshot of entire file system namespace stored on disk.
Contents:
- Complete directory structure
- File-to-block mapping
- File permissions and quotas
- Block replication information
Location: ${dfs.namenode.name.dir}/current/fsimage_*
Size: Typically few hundred MBs to few GBs depending on namespace size.
Example Entry:
/user/hadoop/sales.csv
- Size: 1 GB
- Blocks: [blk_1001, blk_1002, blk_1003, blk_1004, blk_1005, blk_1006, blk_1007, blk_1008]
- Replication: 3
- Permissions: rw-r--r--
- Owner: hadoop
2.2 Edit Logs
Definition: Transaction log recording all changes to file system namespace.
Purpose: Capture modifications between fsimage checkpoints.
Operations Logged:
- File creation/deletion
- Directory operations
- Permission changes
- Replication factor modifications
Location: ${dfs.namenode.name.dir}/current/edits_*
Example Entries:
OP_ADD: /user/hadoop/data.csv created
OP_DELETE: /user/hadoop/temp.txt deleted
OP_MKDIR: /user/hadoop/reports created
OP_SET_REPLICATION: /user/hadoop/critical.csv replication=5
Workflow:
NameNode Starts → Loads fsimage → Applies edit logs →
In-Memory Namespace Ready → New changes append to edit logs
2.3 Checkpoint Process
Problem: Edit logs grow indefinitely, increasing startup time.
Solution: Periodically merge edit logs with fsimage.
Process:
Loading diagram…
Frequency: Every hour OR every 1 million transactions (whichever comes first).
3. Block Management Internals
3.1 Block ID Generation
Structure: 64-bit unique identifier.
Components:
- Generation Stamp: Timestamp when block created
- Block ID: Incremental sequence number
Example: blk_1073741825_1001
- Block ID: 1073741825
- Generation Stamp: 1001
Purpose: Unique identification prevents conflicts during replication and recovery.
3.2 Block Scanner
Function: Periodically verifies data integrity on DataNodes.
Process:
- Read block data from disk
- Compute checksum
- Compare with stored checksum
- Report corrupted blocks to NameNode
Frequency: Every 3 weeks per block (default).
Corruption Handling:
Corrupted Block Detected → Report to NameNode →
NameNode marks block corrupt → Instructs re-replication from good replica →
Delete corrupted block → Maintain replication factor
4. Memory Management
4.1 NameNode RAM Usage
Metadata Size: Approximately 150 bytes per block.
Calculation:
Loading stats…
Example Calculation:
Cluster storing 10 PB:
10 PB = 10,000 TB = 10,000,000 GB
Blocks = 10,000,000 GB / 0.128 GB per block = 78,125,000 blocks
RAM needed = 78,125,000 × 150 bytes ≈ 12 GB
With 3x replication metadata:
Total RAM ≈ 12 GB × 1.5 = 18 GB
5. Data Integrity
5.1 Checksums
Mechanism: CRC-32C (Cyclic Redundancy Check) computed for each 512-byte chunk.
Storage: Checksum file stored alongside block file.
Verification Points:
- Write Time: Client computes checksum, DataNode verifies
- Read Time: DataNode sends data + checksum, client verifies
- Background: Periodic block scanner verification
Corruption Detection:
Read Operation:
Client requests block → DataNode reads from disk →
Computes checksum → Compares with stored checksum →
If mismatch: Report corruption, read from replica →
If match: Send data to client
6. Safe Mode
Definition: Read-only startup state where HDFS verifies block replication.
Purpose: Ensure cluster health before allowing writes.
Entry Conditions:
- NameNode startup
- Manual activation by administrator
- Critical metadata inconsistencies
Exit Conditions:
- 99.9% of blocks meet minimum replication (default)
- 30 seconds elapsed since threshold met
During Safe Mode:
- ✅ Read operations allowed
- ❌ Write operations blocked
- ✅ Block reports processed
- ❌ No block deletions/replications
Example:
NameNode Startup:
[2024-01-15 10:00:00] Entering safe mode
[2024-01-15 10:00:05] Loading fsimage (500 MB)
[2024-01-15 10:00:10] Replaying edits (1000 transactions)
[2024-01-15 10:00:15] Building block map from DataNode reports
[2024-01-15 10:02:30] 99.95% blocks replicated (threshold: 99.9%)
[2024-01-15 10:03:00] Leaving safe mode - cluster ready
Exam Pattern Questions and Answers
Question 1: "Explain fsimage and edit logs in HDFS with checkpoint process." (8 Marks)
Answer:
FsImage (2 marks): FsImage is a snapshot of entire file system namespace stored on NameNode's disk, containing complete directory structure, file-to-block mappings, permissions, and replication information. It represents the file system state at a specific point in time and size ranges from hundreds of MBs to few GBs depending on namespace size.
Edit Logs (2 marks): Edit logs are transaction logs recording all modifications to file system namespace between fsimage checkpoints. They capture operations like file creation, deletion, directory operations, and permission changes. As NameNode processes requests, changes are appended to edit logs, ensuring durability of metadata modifications.
Checkpoint Process (4 marks): Checkpointing merges edit logs with fsimage to prevent indefinite growth of edit logs. Secondary NameNode downloads current fsimage and edit logs from NameNode, applies edit log transactions to fsimage creating updated snapshot, and uploads new fsimage back to NameNode. This occurs every hour or every 1 million transactions, whichever comes first. NameNode then replaces old fsimage with new one and starts fresh edit log, improving startup time and reducing memory overhead.
Question 2: "How does HDFS ensure data integrity?" (4 Marks)
Answer:
** Checksums** (2 marks): HDFS computes CRC-32C checksums for each 512-byte chunk of data. During writes, clients compute checksums and DataNodes verify them before storing. Checksum files are stored alongside block files on disk.
Verification (2 marks): Data integrity is verified at three points: during writes (client-computed checksums verified by DataNode), during reads (DataNode computes checksum and client verifies), and through background block scanner running every 3 weeks per block. If corruption detected, NameNode is notified, triggers re-replication from good replica, and corrupted block is deleted, maintaining overall data reliability without user intervention.
Summary
Key Points for Revision:
- FsImage: Namespace snapshot on disk
- Edit Logs: Transaction log of modifications
- Checkpointing: Merge edits with fsimage hourly or per 1M transactions
- Block ID: 64-bit with generation stamp
- Checksums: CRC-32C for 512-byte chunks
- Safe Mode: Read-only startup, exits at 99.9% replication
- Memory: ~150 bytes per block in NameNode RAM
Always mention specific numbers: 150 bytes per block, CRC-32C, 512-byte chunks, 99.9% threshold for safe mode, hourly checkpoints. Explain the complete checkpoint workflow showing Secondary NameNode's role.
Quiz Time! 🎯
Loading quiz…