Home > Topics > Big Data Analysis > Streaming Data

Streaming Data

1. Definition

Streaming Data refers to continuous flows of data generated in real-time by various sources, that need to be processed immediately or near-immediately to derive timely insights and trigger actions.


2. Characteristics of Streaming Data

  1. Continuous Generation: Data is produced constantly without predefined start or end
  2. High Velocity: Data arrives at rapid rates (thousands to millions of events per second)
  3. Time-Sensitive: Value diminishes quickly if not processed immediately
  4. Unbounded: Infinite stream with no fixed size
  5. Event-Driven: Each data point represents an event or observation

3. Sources of Streaming Data

3.1 categorized by Domain

DomainSourceExample Data
Social MediaTwitter, FacebookTweets, posts, likes, shares
IoT DevicesSensors, smart metersTemperature, pressure, vibration readings
FinancialStock exchangesStock prices, trade executions
Web/Mobile AppsClickstreamsPage views, clicks, app events
TransportationGPS devicesVehicle location, speed, routes
TelecomNetwork equipmentCall records, data usage, signal strength

3.2 Example - E-commerce Streaming Data

Every Second:
- 1000+ product page views
- 200+ search queries
- 50+ items added to cart
- 10+ completed transactions
- 500+ user clicks and interactions

4. Stream Processing vs Batch Processing

Loading comparison…


5. Stream Processing Architecture

Basic Components:

Loading diagram…

Workflow:

Sensors → Kafka Topics → Stream Processor → Real-time Analytics → 
Dashboard + Alerts + Database

6. Stream Processing Patterns

6.1 Filtering

Purpose: Select only relevant events from the stream

Example:

Input Stream: All transactions
Filter: Amount > ₹1,00,000
Output Stream: Large transactions only

6.2 Transformation

Purpose: Convert data format or enrich events

Example:

Input: {"user_id": "U123", "amount": 50}
Transform: Add user_name from database
Output: {"user_id": "U123", "user_name": "Raj", "amount": 50}

6.3 Aggregation

Purpose: Compute metrics over time windows

Example:

Input Stream: Individual stock trades
Aggregate: Count trades per 5-minute window
Output: {"symbol": "RELIANCE", "count": 1250, "window": "10:00-10:05"}

6.4 Windowing

Types of Windows:

  1. Tumbling Window: Fixed, non-overlapping time periods

    • Example: Count events every 10 minutes (10:00-10:10, 10:10-10:20)
  2. Sliding Window: Overlapping time periods

    • Example: Average of last 5 minutes, updated every minute
  3. Session Window: Based on periods of activity

    • Example: Group user clicks until 30 minutes of inactivity

7. Popular Stream Processing Technologies

7.1 Apache Kafka

Definition: Distributed streaming platform for building real-time data pipelines and streaming applications.

Key Features:

  • High throughput (millions of messages/second)
  • Fault-tolerant and durable
  • Horizontal scalability
  • Publish-subscribe messaging

Components:

  • Producers: Applications that publish data to Kafka
  • Consumers: Applications that read data from Kafka
  • Topics: Categories to which records are published
  • Brokers: Kafka servers that store data

Use Case: Uber uses Kafka to process trip events in real-time for ride matching and surge pricing.


7.2 Apache Spark Streaming

Definition: Extension of Apache Spark for processing live data streams.

Approach: Micro-batch processing (processes small batches every few seconds)

Features:

  • Integration with Spark ecosystem (SQL, MLlib)
  • Exactly-once semantics
  • Fault tolerance through lineage

Use Case: Netflix uses Spark Streaming for real-time recommendation updates and quality monitoring.


7.3 Apache Flink

Definition: Distributed stream processing framework for high-performance streaming applications.

Approach: True stream processing (event-by-event)

Features:

  • Low latency (sub-second)
  • Stateful computations
  • Event-time processing
  • Exactly-once consistency

Use Case: Alibaba uses Flink for real-time search ranking and fraud detection.


7.4 Apache Storm

Definition: Real-time computation system for processing unbounded streams of data.

Features:

  • Real-time processing guarantees
  • Scalable and fault-tolerant
  • Simple programming model

Use Case: Twitter uses Storm for real-time trending topics and analytics.


8. Applications of Streaming Data

8.1 Fraud Detection

Scenario: Credit card transactions

Transaction Event → Stream Processor → Pattern Matching → 
Alert if Suspicious (within milliseconds)

Detection Rules:

  • Multiple transactions from different locations within minutes
  • Transaction amount exceeds historical average by 10x
  • Purchase from previously unseen merchant category

8.2 IoT Monitoring

Scenario: Manufacturing plant sensors

Sensor Data → Stream → Anomaly Detection → 
Alert if equipment failure predicted

Monitoring:

  • Temperature exceeds threshold → Immediate shutdown
  • Vibration pattern changes → Maintenance alert
  • Production rate drops → Investigation trigger

8.3 Real-time Recommendations

Scenario: E-commerce personalization

User Actions → Stream → ML Model → 
Updated Recommendations (within seconds)

Events Processed:

  • Product views → Update interest profile
  • Items added to cart → Recommend complementary products
  • Searches → Show relevant results immediately

8.4 Social Media Analytics

Scenario: Trending topics detection

Tweet Stream → Hashtag Counter → Trending Algorithm → 
Update Trending Sidebar (every minute)

Metrics Tracked:

  • Tweet velocity (tweets/minute)
  • User reach (unique users discussing)
  • Sentiment analysis (positive/negative ratio)

9. Challenges in Stream Processing

  1. Complexity: More difficult than batch processing
  2. Late Data: Handling events arriving out of order
  3. State Management: Maintaining state across events
  4. Scalability: Processing millions of events per second
  5. Fault Tolerance: Ensuring no data loss during failures
  6. Testing: Difficult to test streaming applications

Exam Pattern Questions and Answers

Question 1: "What is streaming data? Explain its characteristics and sources." (8 Marks)

Answer:

Definition (2 marks):
Streaming data refers to continuous flows of data generated in real-time by various sources that need to be processed immediately to derive timely insights and trigger actions. Unlike batch data collected over time, streaming data is generated continuously without predefined start or end.

Characteristics (3 marks):
Streaming data has five key characteristics. First, continuous generation where data is produced constantly. Second, high velocity with thousands to millions of events per second. Third, time-sensitivity meaning value diminishes if not processed immediately. Fourth, unbounded nature with infinite stream having no fixed size. Fifth, event-driven where each data point represents a specific event or observation.

Sources (3 marks):
Major sources include social media platforms like Twitter generating tweets and posts continuously, IoT devices such as sensors producing temperature and pressure readings, financial systems with stock exchanges generating trade data, web and mobile applications creating clickstream data from user interactions, transportation systems with GPS devices tracking vehicle locations, and telecom networks monitoring call records and data usage patterns in real-time.


Question 2: "Differentiate between stream processing and batch processing with examples." (6 Marks)

Answer:

Stream Processing (3 marks):
Stream processing handles continuous real-time data flows with processing time in milliseconds to seconds. It deals with small unbounded streams that are infinite in nature. Used for real-time decisions and immediate alerts. For example, fraud detection systems process credit card transactions in real-time, analyzing each transaction within milliseconds to identify suspicious patterns and block fraudulent activity before completion.

Batch Processing (3 marks):
Batch processing accumulates data over time and processes large bounded datasets with processing time ranging from hours to days. Used for historical analysis and comprehensive reports. For example, monthly sales reports where data is collected throughout the month and processed at month-end to generate insights about sales trends, top-performing products, and regional performance.


Summary

Key Points for Revision:

  1. Streaming Data: Continuous real-time data flows
  2. Characteristics: Continuous, high velocity, time-sensitive, unbounded, event-driven
  3. Sources: Social media, IoT, financial, web/mobile, transportation, telecom
  4. vs Batch: Stream is real-time/continuous, Batch is periodic/accumulated
  5. Technologies: Kafka, Spark Streaming, Flink, Storm
  6. Applications: Fraud detection, IoT monitoring, recommendations, social analytics
Exam Tip

For streaming questions, emphasize the real-time nature and provide concrete examples. Always mention at least two technologies (Kafka, Spark Streaming) and two use cases (fraud detection, IoT monitoring).


Quiz Time! 🎯

Loading quiz…