2.5.1 The CAP Theorem
Proposed by Eric Brewer, the CAP Theorem states that a distributed system can only provide two of the three following guarantees at once:
- Consistency: Every read receives the most recent write or an error.
- Availability: Every request receives a (non-error) response, without guarantee that it contains the most recent write.
- Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network.
2.5.5 Deep Dive: Apache Cassandra Architecture
Cassandra is a peer-to-peer distributed NoSQL database that avoids the "Single Point of Failure" of master-slave systems.
- Gossip Protocol: Nodes periodically talk to their neighbors to share state information about the cluster (e.g., which nodes are up/down).
- Hinted Handoff: If a node is down during a write, a neighbor stores a "hint" and delivers it as soon as the node comes back online.
- LSM Trees: Unlike RDBMS which use B-Trees, Cassandra uses Log-Structured Merge Trees which are optimized for high-speed writes at the cost of slower reads.
2.5.6 Comparison: Normalization vs Denormalization
| Feature | Normalization (SQL) | Denormalization (NoSQL) |
|---|---|---|
| Data Integrity | High (Primary/Foreign Keys). | Lower (Must be managed by App). |
| Storage | Efficient (Minimal redundancy). | Inefficient (Data is duplicated). |
| Read Speed | Slower (Requires Joins). | Faster (Data is in one record). |
| Optimization | Optimized for Updates. | Optimized for Reads. |
2.5.2 Relaxing Consistency: ACID vs BASE
Traditional databases follow ACID properties (Atomicity, Consistency, Isolation, Durability). Big Data systems often follow BASE:
- Basically Available: The system works even if parts are down.
- Soft State: The state of the system might change over time, even without input.
- Eventually Consistent: Given enough time, all copies of the data will eventually be the same.
2.5.3 Version Stamps (Managing Conflict)
When multiple users update the same record on different servers, conflicts occur. Version Stamps help detect these.
- Content Hash: A unique string generated based on the data.
- Timestamp: The time of the update.
- Vector Clock: A sophisticated list of version numbers from all servers that allows detecting whether one update "happened before" another.
Conflict Resolution:
- Last Write Wins (LWW): The newest timestamp is kept (simple but risky).
- Application-Specific: Merging the changes (e.g., Git merge or shopping cart merger).
2.5.4 Quorum-Based Consistency
Many NoSQL systems (like Cassandra) allow you to tune your consistency using three numbers:
- N: The number of replicas (nodes where data is stored).
- W: The number of nodes that must confirm a Write was successful.
- R: The number of nodes that must respond to a Read.
The Rule: If W + R > N, you are guaranteed Strong Consistency because the read and write sets will always overlap. If W + R <= N, you have Eventual Consistency.
2.5.7 Comparative Matrix: Serialization Formats
| Feature | XML | JSON | BSON | Avro / Protobuf |
|---|---|---|---|---|
| Readability | High (Human) | High (Human) | Low (Binary) | Low (Binary) |
| Parsing Speed | Slow | Fast | Very Fast | Blazing Fast |
| Space Efficiency | Poor (Text tags) | Moderate | Good | Excellent |
| Schema Required | No (Optional XSD) | No | No | Yes |
| Use Case | Legacy APIs | REST APIs | MongoDB | Internal RPC |
2.5.8 Conflict Resolution: The Shopping Cart Problem
Imagine two servers both think you added an item to your cart.
- The Problem: If they don't sync instantly, one update might overwrite the other.
- The NoSQL Solution: Use Vector Clocks to keep both versions and ask the user to "merge" them (or merge them automatically by taking the union of both lists).