Streaming Data
1. Definition
Streaming Data refers to continuous flows of data generated in real-time by various sources, that need to be processed immediately or near-immediately to derive timely insights and trigger actions.
2. Characteristics of Streaming Data
- Continuous Generation: Data is produced constantly without predefined start or end
- High Velocity: Data arrives at rapid rates (thousands to millions of events per second)
- Time-Sensitive: Value diminishes quickly if not processed immediately
- Unbounded: Infinite stream with no fixed size
- Event-Driven: Each data point represents an event or observation
3. Sources of Streaming Data
3.1 categorized by Domain
| Domain | Source | Example Data |
|---|---|---|
| Social Media | Twitter, Facebook | Tweets, posts, likes, shares |
| IoT Devices | Sensors, smart meters | Temperature, pressure, vibration readings |
| Financial | Stock exchanges | Stock prices, trade executions |
| Web/Mobile Apps | Clickstreams | Page views, clicks, app events |
| Transportation | GPS devices | Vehicle location, speed, routes |
| Telecom | Network equipment | Call records, data usage, signal strength |
3.2 Example - E-commerce Streaming Data
Every Second:
- 1000+ product page views
- 200+ search queries
- 50+ items added to cart
- 10+ completed transactions
- 500+ user clicks and interactions
4. Stream Processing vs Batch Processing
Loading comparison…
5. Stream Processing Architecture
Basic Components:
Loading diagram…
Workflow:
Sensors → Kafka Topics → Stream Processor → Real-time Analytics →
Dashboard + Alerts + Database
6. Stream Processing Patterns
6.1 Filtering
Purpose: Select only relevant events from the stream
Example:
Input Stream: All transactions
Filter: Amount > ₹1,00,000
Output Stream: Large transactions only
6.2 Transformation
Purpose: Convert data format or enrich events
Example:
Input: {"user_id": "U123", "amount": 50}
Transform: Add user_name from database
Output: {"user_id": "U123", "user_name": "Raj", "amount": 50}
6.3 Aggregation
Purpose: Compute metrics over time windows
Example:
Input Stream: Individual stock trades
Aggregate: Count trades per 5-minute window
Output: {"symbol": "RELIANCE", "count": 1250, "window": "10:00-10:05"}
6.4 Windowing
Types of Windows:
-
Tumbling Window: Fixed, non-overlapping time periods
- Example: Count events every 10 minutes (10:00-10:10, 10:10-10:20)
-
Sliding Window: Overlapping time periods
- Example: Average of last 5 minutes, updated every minute
-
Session Window: Based on periods of activity
- Example: Group user clicks until 30 minutes of inactivity
7. Popular Stream Processing Technologies
7.1 Apache Kafka
Definition: Distributed streaming platform for building real-time data pipelines and streaming applications.
Key Features:
- High throughput (millions of messages/second)
- Fault-tolerant and durable
- Horizontal scalability
- Publish-subscribe messaging
Components:
- Producers: Applications that publish data to Kafka
- Consumers: Applications that read data from Kafka
- Topics: Categories to which records are published
- Brokers: Kafka servers that store data
Use Case: Uber uses Kafka to process trip events in real-time for ride matching and surge pricing.
7.2 Apache Spark Streaming
Definition: Extension of Apache Spark for processing live data streams.
Approach: Micro-batch processing (processes small batches every few seconds)
Features:
- Integration with Spark ecosystem (SQL, MLlib)
- Exactly-once semantics
- Fault tolerance through lineage
Use Case: Netflix uses Spark Streaming for real-time recommendation updates and quality monitoring.
7.3 Apache Flink
Definition: Distributed stream processing framework for high-performance streaming applications.
Approach: True stream processing (event-by-event)
Features:
- Low latency (sub-second)
- Stateful computations
- Event-time processing
- Exactly-once consistency
Use Case: Alibaba uses Flink for real-time search ranking and fraud detection.
7.4 Apache Storm
Definition: Real-time computation system for processing unbounded streams of data.
Features:
- Real-time processing guarantees
- Scalable and fault-tolerant
- Simple programming model
Use Case: Twitter uses Storm for real-time trending topics and analytics.
8. Applications of Streaming Data
8.1 Fraud Detection
Scenario: Credit card transactions
Transaction Event → Stream Processor → Pattern Matching →
Alert if Suspicious (within milliseconds)
Detection Rules:
- Multiple transactions from different locations within minutes
- Transaction amount exceeds historical average by 10x
- Purchase from previously unseen merchant category
8.2 IoT Monitoring
Scenario: Manufacturing plant sensors
Sensor Data → Stream → Anomaly Detection →
Alert if equipment failure predicted
Monitoring:
- Temperature exceeds threshold → Immediate shutdown
- Vibration pattern changes → Maintenance alert
- Production rate drops → Investigation trigger
8.3 Real-time Recommendations
Scenario: E-commerce personalization
User Actions → Stream → ML Model →
Updated Recommendations (within seconds)
Events Processed:
- Product views → Update interest profile
- Items added to cart → Recommend complementary products
- Searches → Show relevant results immediately
8.4 Social Media Analytics
Scenario: Trending topics detection
Tweet Stream → Hashtag Counter → Trending Algorithm →
Update Trending Sidebar (every minute)
Metrics Tracked:
- Tweet velocity (tweets/minute)
- User reach (unique users discussing)
- Sentiment analysis (positive/negative ratio)
9. Challenges in Stream Processing
- Complexity: More difficult than batch processing
- Late Data: Handling events arriving out of order
- State Management: Maintaining state across events
- Scalability: Processing millions of events per second
- Fault Tolerance: Ensuring no data loss during failures
- Testing: Difficult to test streaming applications
Exam Pattern Questions and Answers
Question 1: "What is streaming data? Explain its characteristics and sources." (8 Marks)
Answer:
Definition (2 marks):
Streaming data refers to continuous flows of data generated in real-time by various sources that need to be processed immediately to derive timely insights and trigger actions. Unlike batch data collected over time, streaming data is generated continuously without predefined start or end.
Characteristics (3 marks):
Streaming data has five key characteristics. First, continuous generation where data is produced constantly. Second, high velocity with thousands to millions of events per second. Third, time-sensitivity meaning value diminishes if not processed immediately. Fourth, unbounded nature with infinite stream having no fixed size. Fifth, event-driven where each data point represents a specific event or observation.
Sources (3 marks):
Major sources include social media platforms like Twitter generating tweets and posts continuously, IoT devices such as sensors producing temperature and pressure readings, financial systems with stock exchanges generating trade data, web and mobile applications creating clickstream data from user interactions, transportation systems with GPS devices tracking vehicle locations, and telecom networks monitoring call records and data usage patterns in real-time.
Question 2: "Differentiate between stream processing and batch processing with examples." (6 Marks)
Answer:
Stream Processing (3 marks):
Stream processing handles continuous real-time data flows with processing time in milliseconds to seconds. It deals with small unbounded streams that are infinite in nature. Used for real-time decisions and immediate alerts. For example, fraud detection systems process credit card transactions in real-time, analyzing each transaction within milliseconds to identify suspicious patterns and block fraudulent activity before completion.
Batch Processing (3 marks):
Batch processing accumulates data over time and processes large bounded datasets with processing time ranging from hours to days. Used for historical analysis and comprehensive reports. For example, monthly sales reports where data is collected throughout the month and processed at month-end to generate insights about sales trends, top-performing products, and regional performance.
Summary
Key Points for Revision:
- Streaming Data: Continuous real-time data flows
- Characteristics: Continuous, high velocity, time-sensitive, unbounded, event-driven
- Sources: Social media, IoT, financial, web/mobile, transportation, telecom
- vs Batch: Stream is real-time/continuous, Batch is periodic/accumulated
- Technologies: Kafka, Spark Streaming, Flink, Storm
- Applications: Fraud detection, IoT monitoring, recommendations, social analytics
For streaming questions, emphasize the real-time nature and provide concrete examples. Always mention at least two technologies (Kafka, Spark Streaming) and two use cases (fraud detection, IoT monitoring).
Quiz Time! 🎯
Loading quiz…