Principles of Distributed Database Systems: Exercise Solutions & Key Concepts
Mastering distributed database systems (DDBS) requires more than just reading theory; it demands a hands-on approach to solving complex architectural puzzles. Whether you are studying for an exam or designing a scalable system, working through exercise solutions is the best way to internalize how data moves across a network.
This guide explores the core principles of DDBS through the lens of common exercise problems and their practical solutions. 1. Data Fragmentation and Allocation
One of the first hurdles in any DDBS course is determining how to split a global relation into pieces (fragmentation) and where to store them (allocation). Exercise Scenario:
You have a global relation Employee (EmpID, Name, Dept, Salary, Location). You need to fragment this based on the query: "Find employees working in New York or London." Solution Approach:
Horizontal Fragmentation: This involves using a SELECT operation. You define fragments based on the Location attribute.
Vertical Fragmentation: If a query only needs Name and Salary, you would use a PROJECT operation to split columns rather than rows.
The Correctness Rules: Ensure your solution meets three criteria: Completeness (no data lost), Reconstruction (can join/union back to the original), and Disjointness (no unnecessary duplication). 2. Distributed Query Optimization
Querying a distributed system is expensive because of "communication costs." Exercises often ask you to calculate the cost of a Join operation across two different sites. Key Concept: Semijoins
A common solution to reduce data transfer is the Semijoin. Instead of sending an entire table across the network, you send only the joining column, filter the remote table, and send the smaller result back.
Exercise Tip: When asked to find the "optimal execution plan," always compare the total bytes transferred in a standard Join versus a Semijoin. The formula usually looks like: 3. Distributed Concurrency Control
How do you maintain consistency when multiple users edit the same data on different continents? Solution: Two-Phase Locking (2PL)
In distributed exercises, you'll often encounter the Centralized 2PL vs. Distributed 2PL debate.
Centralized: One site manages all locks. Simple, but a single point of failure.
Distributed: Each site manages locks for its own data. More resilient, but harder to detect Global Deadlocks.
Wait-Die vs. Wound-Wait: These are common algorithmic solutions for deadlock prevention.
Wait-Die: Older transaction waits for younger, younger dies. Wound-Wait: Older transaction "wounds" (preempts) younger. 4. Reliability and the Two-Phase Commit (2PC)
Reliability exercises often focus on what happens when a site or a link fails during a transaction. The 2PC Protocol Steps:
Voting Phase: The coordinator asks all participants if they are ready to commit.
Decision Phase: If all vote "Yes," the coordinator sends a "Global Commit." If any vote "No" or timeout, it sends a "Global Abort."
Common Problem: What happens if the coordinator fails after the voting phase?Solution: This is the "blocking problem" of 2PC. Participants may be left in an uncertain state, holding locks indefinitely until the coordinator recovers. This is why modern systems often look toward Three-Phase Commit (3PC) or Paxos/Raft consensus algorithms. 5. Parallelism and Data Replication
Modern exercises often touch on CAP Theorem (Consistency, Availability, Partition Tolerance).
Exercise Question: "Can a system be CA (Consistent and Available) during a network partition?"
Solution: No. During a partition (P), you must choose between Consistency (refusing the update to keep data uniform) or Availability (allowing the update even if other sites don't see it yet). Summary Checklist for Students
When looking for or writing solutions to distributed database problems, always check for:
Minimization of data transfer: Is there a way to do this with fewer bytes?
Transparency: Does the user feel like they are using a single database?
Site Autonomy: Can a single site function if the others go offline?
By applying these principles to your exercises, you move from theoretical knowledge to architectural expertise.
Principles of Distributed Database Systems
A distributed database system is a collection of multiple databases that are connected through a network, allowing users to access and share data across different locations. The main goals of a distributed database system are:
Key Concepts
Types of Distributed Database Systems
Exercise Solutions
Exercise 1: What are the main advantages of a distributed database system?
Solution: The main advantages of a distributed database system are:
Exercise 2: What is fragmentation in a distributed database system?
Solution: Fragmentation is the process of breaking a large database into smaller fragments, each stored at a different site.
Exercise 3: What is replication in a distributed database system?
Solution: Replication is the process of maintaining multiple copies of data at different sites to improve availability and performance.
Exercise 4: Consider a distributed database system with three sites: A, B, and C. Each site has a copy of a relation R. The relation R has the following tuples:
| ID | Name | Age | | --- | --- | --- | | 1 | John | 25 | | 2 | Jane | 30 | | 3 | Joe | 35 |
Site A has the following fragment of R:
| ID | Name | Age | | --- | --- | --- | | 1 | John | 25 | | 2 | Jane | 30 |
Site B has the following fragment of R:
| ID | Name | Age | | --- | --- | --- | | 2 | Jane | 30 | | 3 | Joe | 35 |
Site C has the following fragment of R:
| ID | Name | Age | | --- | --- | --- | | 1 | John | 25 | | 3 | Joe | 35 |
a. What is the fragmentation of R?
b. What is the replication factor of R?
Solution:
a. The fragmentation of R is:
R = R1 ∪ R2 ∪ R3
where R1, R2, and R3 are the fragments of R at sites A, B, and C, respectively.
b. The replication factor of R is 3, since there are three copies of R, one at each site.
Exercise 5: Consider a distributed database system with two sites: A and B. Site A has a relation R1, and site B has a relation R2. The relations R1 and R2 have the following tuples:
R1:
| ID | Name | Age | | --- | --- | --- | | 1 | John | 25 | | 2 | Jane | 30 |
R2:
| ID | Name | Age | | --- | --- | --- | | 3 | Joe | 35 | | 4 | Sarah | 20 |
Design a distributed query to retrieve all tuples from R1 and R2.
Solution:
The distributed query can be written as:
SELECT * FROM R1 UNION SELECT * FROM R2
This query retrieves all tuples from R1 at site A and R2 at site B, and combines them into a single result set.
Problem: Three sites. Transactions $T_1, T_2, T_3$.
Detect the deadlock.
Solution: We construct the Local Wait-For Graphs (LWFG) and combine them into a Global Wait-For Graph (GWFG).
Local Graphs:
Global Construction: Combine the edges based on transaction identifiers.
Cycle Detection: Tracing the edges: $T_1 \rightarrow T_3 \rightarrow T_2 \rightarrow T_1$. The cycle is closed: $T_1 \rightarrow T_3 \rightarrow T_2 \rightarrow T_1$.
Resolution: The system detects the cycle. It must abort one transaction (victim) to break the lock. Typically, the youngest transaction or the one with the least work done is chosen (e.g., abort $T_3$).
| Topic | Core Principle | Classic Pitfall | |-------|----------------|------------------| | Fragmentation | Horizontal: predicates; Vertical: key preservation | Lossless join not ensured | | Query optimization | Semi-join reduction before full join | Ignoring transmission cost | | Concurrency control | Distributed 2PL + deadlock detection | Circular wait across sites | | Commit | 2PC: prepare → commit | Blocking if coordinator crashes | | Replication | Read/write quorums: R+W > N | Underestimating quorum intersection |
Problem:
A distributed database has 3 sites. Fragment F1 at site A (1000 rows), F2 at site B (500 rows), F3 at site C (2000 rows). Query: F1 ⨝ F2 ⨝ F3. Choose the best join order (cost = tuple transmission). Assume join selectivity is 0.01 and all joins equi-joins.
Hint:
Try all permutations. The optimal order is (F2 ⨝ F1) ⨝ F3 or (F2 ⨝ F3) ⨝ F1? Compute intermediate sizes.
Answer (in brief):
Smallest relation is F2 (500). Join F2 with F1 → size=50010000.01=5000. Then join with F3 → total cost: move F2 to F1(500) + move 5000 to F3(5000) =5500.
Better: Join F2 with F3 first: 50020000.01=10,000; then with F1: cost 500 +10,000=10,500.
Best: Move smallest (F2) to any site first, then join with the next smallest intermediate.
Solving exercises from the Principles of Distributed Database Systems requires a blend of logical reasoning, cost modeling, and protocol understanding. The key steps to success are:
By mastering these exercise patterns, you will not only succeed in your coursework but also build a strong foundation for designing scalable, consistent, and high-performance distributed databases in the real world.
Further Resources:
Do you have a specific problem set you are working on? Share it in the comments for step-by-step help.
Introduction
Distributed database systems are designed to store and manage data across multiple sites or nodes, which can be geographically dispersed. The primary goal of a distributed database system is to provide a unified view of the data, while ensuring that the data is consistent, reliable, and easily accessible. In this write-up, we will discuss the principles of distributed database systems and provide solutions to exercises that illustrate these principles.
Principles of Distributed Database Systems
Exercise Solutions
Exercise 1: Fragmentation and Replication Improved data availability : Data is available at
Consider a distributed database system that stores information about customers, orders, and products. The database is fragmented into three fragments:
Each fragment is replicated at two sites: Site A and Site B.
Draw a diagram showing the fragmentation and replication of the database.
Solution
The diagram below shows the fragmentation and replication of the database:
+---------------+
| Fragment 1 |
| (Customers) |
+---------------+
|
|
v
+---------------+ +---------------+
| Site A | | Site C |
| (Replica 1) | | (Replica 2) |
+---------------+ +---------------+
+---------------+
| Fragment 2 |
| (Orders) |
+---------------+
|
|
v
+---------------+ +---------------+
| Site B | | Site D |
| (Replica 1) | | (Replica 2) |
+---------------+ +---------------+
+---------------+
| Fragment 3 |
| (Products) |
+---------------+
|
|
v
+---------------+ +---------------+
| Site A | | Site B |
| (Replica 1) | | (Replica 2) |
+---------------+ +---------------+
Exercise 2: Distribution and Autonomy
Consider a distributed database system that stores information about employees and departments. The database is distributed across three sites: Site A, Site B, and Site C. Each site has its own local database and is autonomous.
Describe how the system ensures autonomy and distribution.
Solution
The system ensures autonomy by allowing each site to operate independently, making decisions about data management and consistency. Each site has its own local database, which can be updated independently.
The system ensures distribution by storing data across multiple sites. The data is fragmented and distributed across the three sites, providing a unified view of the data.
For example, if a new employee is added at Site A, the employee's information is stored in the local database at Site A. If the employee's department is updated at Site B, the updated information is stored in the local database at Site B. The system ensures that the data is consistent across all sites by using distributed transactions and concurrency control.
Exercise 3: Transparency
Consider a distributed database system that stores information about customers and orders. The database is fragmented and replicated across multiple sites. Describe how the system provides transparency.
Solution
The system provides transparency by hiding the distribution of data from the users, providing a unified view of the data. The users interact with the system through a global schema, which provides a single, unified view of the data.
For example, a user can submit a query to retrieve all customers who have placed an order. The system will automatically determine which sites have the relevant data, retrieve the data, and provide the result to the user. The user is not aware of the fragmentation and replication of the data, and the system provides a unified view of the data.
Conclusion
In conclusion, distributed database systems are designed to store and manage data across multiple sites or nodes. The principles of distributed database systems include fragmentation, replication, distribution, autonomy, and transparency. By understanding these principles and how they are applied, we can design and implement effective distributed database systems that provide a unified view of the data, while ensuring that the data is consistent, reliable, and easily accessible.
Introduction
Distributed database systems are designed to store and manage large amounts of data across multiple sites or nodes. The data is typically replicated or partitioned across multiple nodes to improve performance, reliability, and scalability. In this write-up, we will discuss the principles of distributed database systems and provide solutions to common exercises.
Principles of Distributed Database Systems
Types of Distributed Database Systems
Exercise Solutions
Exercise 1: Design a Distributed Database Schema
Suppose we have a distributed database system for a university with three nodes: Node A ( New York), Node B (Chicago), and Node C (Los Angeles). The database has two relations: Students and Courses.
Solution
We can design a distributed database schema as follows:
Students relation with attributes Student_ID, Name, AgeCourses relation with attributes Course_ID, Course_Name, CreditsEnrollments relation with attributes Student_ID, Course_ID, GradeExercise 2: Fragmentation and Allocation
Suppose we have a relation Orders with attributes Order_ID, Customer_ID, Order_Date, and Total. We want to fragment this relation into two fragments: Orders_1 and Orders_2. We also want to allocate these fragments to two nodes: Node A and Node B.
Solution
We can fragment the Orders relation based on the Order_Date attribute:
Orders_1: Orders with Order_Date between 2020 and 2022Orders_2: Orders with Order_Date between 2023 and 2025We can allocate these fragments to nodes as follows:
Orders_1Orders_2Exercise 3: Distributed Query Processing
Suppose we have a query to retrieve the names of students who are enrolled in a course with a specific course ID.
Solution
We can process this query in a distributed manner as follows:
Student_IDs of students enrolled in the course.Student_IDs back to Node A.Student_IDs and sends another subquery to Node A to retrieve the names of students with those Student_IDs.Conclusion
Distributed database systems are complex systems that require careful design, implementation, and management. Understanding the principles of distributed database systems, including distribution, autonomy, heterogeneity, and transparency, is crucial for designing and implementing efficient and scalable systems. The exercise solutions provided in this write-up demonstrate how to apply these principles to real-world problems.
References:
Mastering the Core: Principles of Distributed Database Systems Exercise Solutions
Distributed database systems (DDBS) are the backbone of modern, globalized computing. From social media feeds to international banking, the ability to manage data across multiple physical locations is essential. However, the complexity of these systems—covering fragmentation, replication, query optimization, and transaction management—can be daunting. Key Concepts
Working through exercise solutions is often the only way to bridge the gap between abstract theory and technical implementation. This article explores the fundamental principles of DDBS through the lens of common problem sets and their solutions. 1. Data Fragmentation and Allocation
One of the first challenges in a distributed environment is deciding how to split data (fragmentation) and where to put it (allocation). Horizontal vs. Vertical Fragmentation
Horizontal Fragmentation: Dividing a relation into subsets of tuples (rows). Solutions usually involve defining selection predicates (e.g., WHERE City = 'New York').
Vertical Fragmentation: Dividing a relation into subsets of attributes (columns). Solutions focus on grouping attributes frequently accessed together, often using an Attribute Affinity Matrix. Common Exercise Scenario:
Problem: Given a global schema and specific site queries, determine the optimal fragments.
Solution Tip: Use Minterm Predicates. By combining all simple predicates from applications, you create non-overlapping fragments that satisfy the "completeness" and "disjointness" rules. 2. Distributed Query Processing
In a distributed system, the cost of moving data over a network often outweighs the cost of local disk I/O. Localization and Optimization
Query processing solutions typically follow a four-step process:
Query Decomposition: Rewriting the calculus query into an algebraic one.
Data Localization: Replacing global relations with their fragments.
Global Optimization: Finding the best join order and communication strategy. Local Optimization: Selecting the best local access paths. Common Exercise Scenario:
Problem: Calculate the cost of a join between two tables located at different sites using a Semi-join.
Solution Tip: Remember that a semi-join reduces the size of the operand before it is sent across the network. If Size(Semi-join result) + Cost(Moving result) < Size(Original Table), the semi-join is more efficient. 3. Distributed Concurrency Control
Ensuring consistency when multiple users access data across sites requires sophisticated locking and ordering mechanisms. Locking and Timestamping
Distributed 2-Phase Locking (2PL): Managing "lock" and "unlock" phases across multiple nodes. Solutions often deal with Global Deadlock Detection, where a cycle exists in the Wait-For-Graph across different sites.
Timestamp Ordering: Assigning unique timestamps to transactions to ensure serializability without explicit locking. 4. Reliability and the Two-Phase Commit (2PC)
How do we ensure that a transaction either commits at every site or aborts at every site? The 2PC Protocol
Voting Phase: The coordinator asks participants if they are ready to commit.
Decision Phase: Based on the votes, the coordinator sends a "Global Commit" or "Global Abort" message. Common Exercise Scenario:
Problem: What happens if the coordinator fails after sending a "Prepare" message but before receiving all votes?
Solution Tip: This leads to a "blocked" state. Participants cannot decide on their own because they don't know the global outcome, highlighting a major weakness of basic 2PC (the need for 3PC or recovery protocols). 5. Parallel Database Systems
While distributed systems focus on geographic separation, parallel systems focus on performance via multiple processors and disks. Architectures Shared Memory: Fast but limited scalability.
Shared Disk: Good for clusters but suffers from communication overhead.
Shared Nothing: The gold standard for massive scalability (e.g., MapReduce, Hadoop). Conclusion: How to Approach Exercise Solutions
When studying "Principles of Distributed Database Systems," don't just look for the answer. Focus on the correctness rules: Completeness: No data is lost during fragmentation.
Reconstruction: You can rebuild the original relation from fragments.
Disjointness: Data isn't unnecessarily duplicated (unless specifically replicated for availability).
By mastering these mathematical and logical foundations, you move beyond rote memorization and toward designing resilient, high-performance distributed architectures.
Finding formal exercise solutions for the authoritative textbook Principles of Distributed Database Systems
(4th Edition, 2020) by M. Tamer Özsu and Patrick Valduriez can be challenging because the authors primarily restrict full solution manuals to instructors. University of Waterloo
However, you can access specific helpful resources and sample solutions through the following official and verified academic channels: 1. Official Textbook Resources The authors maintain a dedicated site at the University of Waterloo
for the 4th edition. While the full manual is restricted, this site is the most reliable source for: Solutions to Selected Exercises
: Links to specific PDFs containing verified answers for core chapters. Presentation Slides
: These often contain "in-class" examples and solved problems that mirror the exercises in the book.
: Crucial for ensuring you aren't trying to solve an exercise with a typo. Official Site Principles of Distributed Database Systems, 4th Ed 2. Verified Solutions for Key Concepts
Common exercises in this field often focus on specific algorithmic problems. You can find high-quality, solved examples for these topics on academic platforms: Data Fragmentation & Allocation
: Step-by-step solutions for vertical and horizontal fragmentation can be found on Distributed Query Optimization
: Look for solutions regarding join ordering and semijoin programs, which are frequently used in distributed systems homework. Concurrency Control
: Solutions involving Two-Phase Commit (2PC) and Paxos consensus algorithms are often provided in university course repositories like those at 3. Alternative Peer-to-Peer Learning
If official solutions are unavailable for a specific problem, these platforms host student-uploaded solution sets: CourseHero
: Hosts various versions of the "Principles of Distributed Database Systems Exercise Solutions" uploaded by students from institutions like GITAM University BITS Pilani Database System Concepts (Practice Site) : While for a different book, the Practice Exercises
by Silberschatz et al. provide publicly available solutions for overlapping topics like distributed transactions and deadlock. Course Hero the complexity of these systems—covering fragmentation
Distributed Database Systems (DDBS) represent a core pillar of modern data management. From Google Spanner to Amazon DynamoDB, the principles of fragmentation, replication, distributed query processing, and concurrency control are essential knowledge for any data professional. However, the theoretical rigor of courses like Principles of Distributed Database Systems (often based on the classic textbook by Özsu and Valduriez) means that exercises can be challenging.
This article provides a structured approach to solving common exercises in this domain. We will break down solutions by topic, explain the underlying reasoning, and offer strategies to tackle problems ranging from fragmentation to distributed deadlock detection.