Testing/Verification of Distributed Systems
Reasoning about the correctness of distributed systems is challenging, mainly due to their concurrency: Multiple things can happen concurrently at multiple places, without any guarantees about their order, creating many possible interleavings. Even for smaller systems, naively examining all possible interleavings can quickly lead to what is referred to as state explosion: There are simply too many possible executions to verify/test them all.
Because of this, several verification and testing solutions specialized for the unique challenges of distributed systems have been proposed. Starting with the model-checking based verification language TLA+ introduced in 1999 to more modern approaches which increasingly focus on scalability to avoid state-explosion and capture the properties of modern large-scale cloud systems.
Goal
The goal of this seminar is to get an overview of different testing and verification approaches for distributed systems. What are the differences between model-checking, randomized testing, and systematic testing? What are the strengths and weaknesses of each approach?
You will explore these questions guided by your own interests and summarize your findings in a survey style research paper.
Additionally, we encourage you to try one or more approaches hands-on and explore their features, possibly trying to test/verify a small example application.
Starting Points
- IronFleet: proving practical distributed systems correct
- Model-based testing of networked applications
- Effective Concurrency Testing for Distributed Systems
- Why is random testing effective for partition tolerance bugs?
Example Tool
- TLA+ (also used by IronFleet in the above pointers)