Basic Hadoop Commands
1. Definition
Hadoop Commands are shell-based instructions used to interact with Hadoop Distributed File System (HDFS), manage files, monitor cluster health, and perform administrative tasks from the command line interface.
2. Command Structure
General Syntax:
hadoop fs -<command> <arguments>
or
hdfs dfs -<command> <arguments>
Note: hadoop fs and hdfs dfs are interchangeable for HDFS operations.
3. File Operations Commands
3.1 Listing Files
Command: -ls (list directory contents)
# List files in root directory
hadoop fs -ls /
# List files in specific directory
hadoop fs -ls /user/hadoop/data
# List files recursively
hadoop fs -ls -R /user/hadoop
Output Example:
drwxr-xr-x - hadoop supergroup 0 2024-01-15 10:30 /user/hadoop/data
-rw-r--r-- 3 hadoop supergroup 10485760 2024-01-15 10:31 /user/hadoop/sales.csv
Output Fields: Permissions, Replication, Owner, Group, Size(bytes), Modification Time, Path
3.2 Creating Directories
Command: -mkdir (make directory)
# Create single directory
hadoop fs -mkdir /user/hadoop/reports
# Create nested directories
hadoop fs -mkdir -p /user/hadoop/data/2024/january
-p Flag: Creates parent directories if they don't exist
3.3 Uploading Files
Command: -put or -copyFromLocal (copy from local to HDFS)
# Upload single file
hadoop fs -put sales.csv /user/hadoop/data/
# Upload multiple files
hadoop fs -put *.csv /user/hadoop/data/
# Upload directory
hadoop fs -put /local/documents /user/hadoop/backup/
Exam Point: "The put command copies files from local file system to HDFS, automatically splitting large files into blocks and distributing them across DataNodes with configured replication factor."
3.4 Downloading Files
Command: -get or -copyToLocal (copy from HDFS to local)
# Download single file
hadoop fs -get /user/hadoop/results.txt /local/output/
# Download directory
hadoop fs -get /user/hadoop/reports /local/backup/
3.5 Viewing File Contents
Command: -cat (concatenate and display file)
# View entire file
hadoop fs -cat /user/hadoop/data.txt
# View multiple files
hadoop fs -cat /user/hadoop/part-*
# View file and pipe to more for pagination
hadoop fs -cat /user/hadoop/large-file.txt | more
Command: -tail (display last KB of file)
# View last 1KB of file
hadoop fs -tail /user/hadoop/logfile.txt
3.6 Copying Within HDFS
Command: -cp (copy files within HDFS)
# Copy file
hadoop fs -cp /user/hadoop/source.txt /user/hadoop/backup/
# Copy directory
hadoop fs -cp /user/hadoop/data /user/hadoop/archive/
3.7 Moving/Renaming Files
Command: -mv (move or rename)
# Rename file
hadoop fs -mv /user/hadoop/old-name.txt /user/hadoop/new-name.txt
# Move file to different directory
hadoop fs -mv /user/hadoop/file.txt /user/hadoop/archive/
# Move directory
hadoop fs -mv /user/hadoop/temp /user/hadoop/old-data
3.8 Deleting Files
Command: -rm (remove files)
# Delete single file
hadoop fs -rm /user/hadoop/temp.txt
# Delete multiple files
hadoop fs -rm /user/hadoop/temp-*
# Delete directory recursively
hadoop fs -rm -r /user/hadoop/old-data
# Skip trash and delete permanently
hadoop fs -rm -r -skipTrash /user/hadoop/huge-dataset
Important: Deleted files go to trash by default and can be recovered within retention period (default: 24 hours)
4. File Information Commands
4.1 Check File Size
Command: -du (disk usage)
# Size of files in directory
hadoop fs -du /user/hadoop/data
# Human-readable sizes
hadoop fs -du -h /user/hadoop/data
# Summary of total size
hadoop fs -du -s -h /user/hadoop/data
Output Example:
10485760 /user/hadoop/data/sales.csv (10 MB)
52428800 /user/hadoop/data/logs.txt (50 MB)
4.2 Count Files and Directories
Command: -count (count directories, files, and bytes)
hadoop fs -count /user/hadoop/data
Output: DIR_COUNT FILE_COUNT CONTENT_SIZE PATH
Example Output:
5 20 1073741824 /user/hadoop/data
(5 directories, 20 files, 1 GB total size)
5. Advanced Commands
5.1 Change File Permissions
Command: -chmod (change mode/permissions)
# Give all permissions to owner
hadoop fs -chmod 700 /user/hadoop/private-data.txt
# Make file readable by all
hadoop fs -chmod 644 /user/hadoop/public-file.txt
# Change recursively
hadoop fs -chmod -R 755 /user/hadoop/shared/
Permission Notation:
- 7 (rwx): Read, Write, Execute
- 6 (rw-): Read, Write
- 5 (r-x): Read, Execute
- 4 (r--): Read only
5.2 Change File Owner
Command: -chown (change ownership)
# Change owner
hadoop fs -chown newuser /user/hadoop/file.txt
# Change owner and group
hadoop fs -chown newuser:newgroup /user/hadoop/file.txt
# Change recursively
hadoop fs -chown -R hadoop:supergroup /user/hadoop/project/
5.3 Set Replication Factor
Command: -setrep (set replication factor)
# Set replication to 5 for specific file
hadoop fs -setrep 5 /user/hadoop/critical-data.txt
# Set replication for directory recursively
hadoop fs -setrep -R 2 /user/hadoop/archive/
# Wait for replication to complete
hadoop fs -setrep -w 3 /user/hadoop/important.txt
Use Cases:
- Increase replication for critical data (higher availability)
- Decrease replication for archival data (save space)
6. Cluster Administration Commands
6.1 HDFS Report
Command: hdfs dfsadmin -report
hdfs dfsadmin -report
Output Information:
- Total cluster capacity
- Used space
- Available space
- Number of DataNodes
- Live/Dead nodes
- Block pool usage
6.2 Safe Mode Operations
Command: Safe mode management
# Check safe mode status
hdfs dfsadmin -safemode get
# Enter safe mode
hdfs dfsadmin -safemode enter
# Leave safe mode
hdfs dfsadmin -safemode leave
# Wait for safe mode to be off
hdfs dfsadmin -safemode wait
Safe Mode: Maintenance state where HDFS is read-only, used for cluster maintenance and startup
6.3 Check File System
Command: hdfs fsck (file system check)
# Check file system health
hdfs fsck /
# Check specific directory
hdfs fsck /user/hadoop/data
# Show file blocks and locations
hdfs fsck /user/hadoop/data -files -blocks -locations
Use Cases:
- Identify corrupted blocks
- Find under-replicated blocks
- Verify data integrity
7. Quick Reference Table
| Command | Purpose | Example |
|---|---|---|
-ls | List files | hadoop fs -ls /user/hadoop |
-mkdir | Create directory | hadoop fs -mkdir /user/hadoop/data |
-put | Upload file | hadoop fs -put file.txt /user/hadoop/ |
-get | Download file | hadoop fs -get /user/hadoop/file.txt . |
-cat | View file | hadoop fs -cat /user/hadoop/data.txt |
-cp | Copy in HDFS | hadoop fs -cp /src /dest |
-mv | Move/rename | hadoop fs -mvc /old /new |
-rm | Delete | hadoop fs -rm -r /user/hadoop/temp |
-du | Disk usage | hadoop fs -du -h /user/hadoop |
-chmod | Change permissions | hadoop fs -chmod 755 /path |
-setrep | Set replication | hadoop fs -setrep 3 /path |
8. Practical Example Workflow
Scenario: Daily ETL Process
# 1. Create directory for today's data
hadoop fs -mkdir -p /user/hadoop/sales/2024-01-15
# 2. Upload sales data from local system
hadoop fs -put /local/data/sales_2024-01-15.csv /user/hadoop/sales/2024-01-15/
# 3. Verify upload
hadoop fs -ls /user/hadoop/sales/2024-01-15/
# 4. Check file size
hadoop fs -du -h /user/hadoop/sales/2024-01-15/sales_2024-01-15.csv
# 5. Run processing (MapReduce job - shown as example command)
hadoop jar analytics.jar SalesAnalysis /user/hadoop/sales/2024-01-15 /user/hadoop/results/2024-01-15
# 6. View results
hadoop fs -cat /user/hadoop/results/2024-01-15/part-r-00000
# 7. Download results to local
hadoop fs -get /user/hadoop/results/2024-01-15 /local/reports/
#8. Archive old data with lower replication
hadoop fs -setrep -R 1 /user/hadoop/sales/2024-01-01
Exam Pattern Questions and Answers
Question 1: "Explain basic Hadoop commands for file operations with examples." (8 Marks)
Answer:
Listing Files (1.5 marks): The -ls command lists directory contents in HDFS. Syntax: hadoop fs -ls /user/hadoop/data displays files with permissions, replication factor, owner, group, size, modification time, and path. The -R flag enables recursive listing of subdirectories.
Uploading Files (1.5 marks): The -put command copies files from local file system to HDFS. Syntax: hadoop fs -put sales.csv /user/hadoop/data/ uploads the file, automatically splitting it into blocks and distributing across DataNodes with configured replication factor, typically 3 copies for fault tolerance.
Downloading Files (1.5 marks): The -get command copies files from HDFS to local file system. Syntax: hadoop fs -get /user/hadoop/results.txt /local/output/ downloads the file by collecting all blocks from DataNodes and reassembling them in local directory.
Viewing Files (1 mark): The -cat command displays file contents. Syntax: hadoop fs -cat /user/hadoop/data.txt concatenates and prints file to standard output, useful for viewing small files or checking data.
Deleting Files (1 mark): The -rm command removes files from HDFS. Syntax: hadoop fs -rm -r /user/hadoop/temp deletes directory recursively. Deleted files go to trash by default and can be recovered within retention period.
File Movement (1.5 marks): The -cp and -mv commands copy and move files within HDFS respectively. hadoop fs -cp /source /destination creates copy while hadoop fs -mv /old /new moves or renames files atomically without data transfer across network.
Question 2: "Write commands to perform following HDFS operations: (a) Create directory (b) Upload file (c) Check file size (d) Set replication to 5." (4 Marks)
Answer:
(a) Create Directory (1 mark):
hadoop fs -mkdir -p /user/hadoop/project/data
The -p flag creates parent directories if they don't exist.
(b) Upload File (1 mark):
hadoop fs -put /local/data/sales.csv /user/hadoop/project/data/
This uploads sales.csv from local file system to HDFS project directory.
(c) Check File Size (1 mark):
hadoop fs -du -h /user/hadoop/project/data/sales.csv
The -h flag displays size in human-readable format (KB, MB, GB).
(d) Set Replication (1 mark):
hadoop fs -setrep 5 /user/hadoop/project/data/sales.csv
This changes replication factor to 5, creating 5 copies across DataNodes for higher availability.
Summary
Key Points for Revision:
- File Operations: ls, mkdir, put, get, cat, cp, mv, rm
- Information Commands: du (size), count (statistics)
- Administration: chmod (permissions), chown (ownership), setrep (replication)
- Cluster Commands: dfsadmin -report, safemode, fsck
- Command Syntax:
hadoop fs -<command> <arguments> - Default Replication: 3 copies
- Trash: Deleted files recoverable for 24 hours (default)
For command questions, always write complete syntax with path examples. Mention what happens internally (e.g., "put command splits file into blocks and replicates across DataNodes"). Include flags like -R (recursive), -p (parents), -h (human-readable) where applicable.
Quiz Time! 🎯
Loading quiz…