Refers to datasets that are too large, fast, or complex for traditional databases to manage efficiently.
Characteristics
Section titled “Characteristics”- Volume: in the scale of terabytes or petabytes.
- Velocity: high ingest/update rates and need for low-latency ingestion or real-time processing.
- Variety: structured, semi-structured, and unstructured data (logs, JSON, images, time series).
- Veracity: data quality and trustworthiness issues at scale.
- Value: extracting meaningful analytics or driving applications from the data.
Big Data Storage Systems
Section titled “Big Data Storage Systems”Distributed File Systems (DFS)
Section titled “Distributed File Systems (DFS)”Stores data across many machines. Offers a single logical view. Provides redundancy via replication to handle failures.
Example: Google FS, Hadoop HDFS.
Hadoop HDFS features:
- NameNode: stores metadata (file → block mapping).
- DataNode: stores actual block data.
- Block size: typically 64 MB.
- Write-once-read-many model.
Sharding
Section titled “Sharding”Partitions data using a shard key. Improves scalability but adds complexity:
- Application must route queries manually.
- Rebalancing shards is difficult.
- Increases failure points → more replicas needed.
Parallel & Distributed Databases
Section titled “Parallel & Distributed Databases”Use multiple nodes to run queries in parallel. When data is stored on multiple nodes, replication ensures availability.
MapReduce
Section titled “MapReduce”Programming model for large-scale parallel computation. Allows fault-tolerant processing of data.
map()– transforms input into key–value pairs.reduce()– aggregates values by key.
Algebraic Operations & Spark
Section titled “Algebraic Operations & Spark”- Modern systems (Spark, Tez) extend MapReduce with relational-like operations.
- RDD (Resilient Distributed Dataset): fault-tolerant, distributed data abstraction.
- Supports transformations like map, filter, reduce, join.
Streaming Data
Section titled “Streaming Data”- Continuous data flow (stocks, sensors, IoT).
- Queries on streams use windows (time/tuple-based).
- Window types:
- Tumbling (non-overlapping)
- Hopping (fixed-size overlap)
- Sliding (moving window)
- Session (per user session)
Architectures
Section titled “Architectures”- Lambda architecture: stream + batch layers for processing.
- Pub-Sub systems: e.g., Apache Kafka for topic-based streaming.
11. Graph Data Model
Section titled “11. Graph Data Model”- Represents entities and their relationships as nodes and edges.
- Equivalent to ER model in structure.
- Graph Databases: e.g., Neo4J.
- Enable efficient edge traversal queries.
- Use languages like Cypher for pattern-based queries.