Designing Distributed Systems with TLA+

Official Description: Distributed systems are hard. Even a few interacting agents can lead to tens of thousands or even millions of unique system states. At that scale, it’s impossible to test for, or even reason about, every possible edge case. We need better tools not just for building systems, but for understanding them.

To truly understand distributed systems, we need to turn to software modeling, or “formal methods”. A few hours of modeling catches complex bugs that would take weeks or months of development to discover. With software modeling, we can build distributed systems more confidently, quickly, and safely.

This talk will focus on one software modeling tool called TLA+, which has seen spectacular success in real-world projects. We will also demonstrate it on an example of a distributed system and close out by discussing learning resources.

Actual Description: I try to convince everybody that threads and Kubernetes are literally the exact same. After failing that I try to convince everybody that distributed systems are easy if we just think really hard about them. After failing that I try to convince everybody that thinking hard is boring and we should just write TLA+ specs instead. After failing that I shill my book and consulting services because dignity is for nerds.

Talk is here.

FAQ

These are questions I got after the talk. I will add more as people ask them.

How do you go from a spec to code in production?

Carefully. Verifying that code matches a spec, aka refinement, is a really hard problem! To accurately implement your spec, you need good engineers, good tooling, and good process. Nothing can take those needs away. There is some research on converting Pluscal specs to go code, but it’s not production-ready.

The benefit of modeling is that if you see an error in production, you immediately know that it’s either an implementation error or some factor you didn’t consider in your model. If you didn’t write a spec, you don’t know whether it’s an implementation error, a fundamental design error, or even an ambiguity in what you were supposed to do in the first place.

Avoiding refinement gives us other advantages: we can apply a spec to any programming language, and we can retrofit existing systems with specs.

Does TLA+ passing a model guarantee that your system is 100% correct or just 99.99%?

TLA+ guarantees that your spec has the specific properties you verified under the specific assumptions you made and model you ran. If you have the assumption that at least half your workers are weakly fair, it makes no guarantees about the case where less than half your workers are weakly fair. If you cap the size of the message queue at 10 messages, you get no guarantees about what will happen if you enqueue 20 messages.

The current model checker can only check bounded models, or models with a finite number of possible states. Note that there can be an infinite number of behaviors in a bounded model, for example if there’s a valid cycle.

How does this fit in with practices like Agile and TDD?

Pretty well, in my opinion! TDD covers ground level, local-code implementation and verification, something that’s outside of scope for TLA+. But TDD doesn’t remove the benefit of thinking about the big picture.

Similarly, the most popular styles of Agile favor short-term iterations, continuous feedback, and developer empowerment. This is all orthogonal to the value of design and planning ahead. Most of Agile’s issue with “design” is actually with the process of “Big Design Up Front”: creating a rigid plan and never deviating from it, even when reality gets in the way. But TLA+ and other modern specification languages aren’t used for BDUF. Instead, you have the flexibility to choose your level of detail, meaning you can design exactly as much as you need to get the benefits you want. You can easily modify your spec in the face of sudden changes and adapt to the new reality.

In return you learn whether your designs will work well before you spend time implementing them. You can find the ambiguity in customer requirements well before you show them a demo. And you learn exactly what you need to change about your product when new constraints come in. I talk more about the intersection between Agile and TLA+ in this essay.

How much time does it add to the development process?

There aren’t yet any large-scale studies on the effectiveness of formal specifications. Anecdotally, you can usually get a beneficial model within 1-2 days. That’s usually enough to find issues with existing systems or hammer out the details of new ones.

I’ve found that a formal specification saves significant time in the long run, both by reducing the amount of expensive, late-stage rework and by reducing the amount of post-release debugging. In my experiences it usually saves maybe 25% of the development time plus maybe the same amount saved in maintenance. Other people I’ve talked to have found similar or larger savings.

How do you decide what to specify?

Experience! A very rough rule of thumb I have is that TLA+ in particular is good if you’re dealing with concurrency or expect the implementation to take more than a week. But the cutoff is super fuzzy in practice.

Can TLA+ tell you if certain constraints are unnecessary?

You can use it to help you find it, but the model checker can’t automatically tell you “you can remove this check and preserve all your properties.” You’d have to try making that change yourself.

Can TLA+ help you find security vulnerabilities?

Yes! See this talk at Strangeloop. It uses Alloy, a different specification language, but the fundamental principles are the same.

Where’s the best place to get started?

See my essay on getting started here. The short version: take existing systems and write specs as a form of documentation. Since you know how the existing behavior, it’s easier to write and sanity-check your spec. About half the people I know who tried this found it a great way to get a better understanding of the system. The other half found critical bugs this way.

How does this compare to FDR and CSP?

I’ve played with FDR a bit but don’t know it well enough to accurately comment.

How does this compare to Spin?

Spin was specifically designed for C networking problems. TLA+ is more general. Nonetheless both are very good tools.

Favorite modeling story?

Got contracted to help a large company write a spec for their new project. After about six hours of speccing we discovered that 1) their proposed architecture couldn’t satisfy their requirements and 2) their requirements didn’t make sense anyway. Project was soon put on indefinite haitus.

Spec

To make the example in the talk more accessible, I invented my own syntax for TLA+. All of changes are 1-1. For example, instead of [key |-> val] I wrote {key: val}. I also only showed a subset of the full spec in the talk. In a real system, there are all sorts of invariants you’d want to check: all messages are well-formed, no deadlocks, all ids are unique, etc. Covering this, though, would lose some focus, so I focused only on ambiguity and liveness. You can see the full spec below:

---- MODULE Uploader ----

EXTENDS Sequences, Integers, TLC
CONSTANTS Users, Workers, NULL

\* For model checking
CONSTANTS MaxVersion, MaxId

Ids == 1..MaxId
Versions == 1..MaxVersion

SeqOf(set, n) == UNION {[1..m -> set] : m \in 0..n}
Range(f) == {f[x]: x \in DOMAIN f}
Nonempty(seq) == seq /= <<>>
set ++ e == set \union {e}
set -- e == set \      {e}

seq (+) elem == Append(seq, elem)

Upload == [from: Users, version: Nat, id: Nat]
ModelUpload == [from: Users, version: Versions, id: Ids]

Msgs == [id: Nat, action: {"create", "edit", "delete"}]


UniqueOnIds(set) ==
  \A m1, m2 \in set:
    m1 /= m2 <=> m1.id /= m2.id

(*--algorithm uploader

variables
  uploads = {};
  content = {};
  queue = <<>>;
  next_id = 1;
  
define
  UploadWith(id) == CHOOSE msg \in uploads: msg.id = id
  ContentWith(id) == CHOOSE msg \in content: msg.id = id

  ExistsUploadWith(id) == 
    \E x \in uploads: 
       x.id = id
  ExistsContentWith(id) == 
    \E x \in content: 
       x.id = id

  ModelTypeInvariant ==
    /\ uploads \in SUBSET ModelUpload
    /\ content \in SUBSET ModelUpload
    
  \* Used to enforce a bounded model
  ModelStateConstraint ==
   /\ \A x \in uploads \union content:
        /\ x.version < MaxVersion
        /\ x.id < MaxId
   /\ next_id < MaxId
    
  TypeInvariant ==
    /\ uploads \in SUBSET Upload
    /\ content \in SUBSET Upload
    /\ queue \in Seq(Msgs)
  
  NoDuplicateIds ==
    /\ UniqueOnIds(uploads)
    /\ UniqueOnIds(content)
  
  ContentMatchesUploads ==
    []<>(
      \A c \in content:
        \E u \in uploads:
          c = u
      )
end define;

macro send_msg(id, action) begin
  queue := queue (+) [id |-> id, action |-> action];
end macro;

process user \in Users
begin
  User:
    while TRUE do
      either
        with 
          create = [from |-> self, version |-> 1, id |-> next_id]
        do
          uploads := uploads ++ create;
          send_msg(create.id, "create");
          next_id := next_id + 1;
        end with;
      or
        with
          upload \in {u \in uploads: u.from = self},
          edit = [upload EXCEPT !.version = @ + 1]
        do
          uploads := uploads -- upload ++ edit;
          send_msg(edit.id, "edit");
        end with;
      or
        with 
          delete \in {u \in uploads: u.from = self}
        do
          uploads := uploads -- delete;
          send_msg(delete.id, "delete");
        end with;
      end either;
    end while;
end process;

fair process worker \in Workers
variables
  msg = NULL, local = NULL;
begin
  Receive:
    await Nonempty(queue);
    msg := Head(queue);
    queue := Tail(queue);
    if msg.action = "create" then
      Create:
        if \E x \in uploads: 
           x.id = msg.id then
        with upload = UploadWith(msg.id) do
          content := content ++ UploadWith(msg.id);
        end with;
        end if;
    elsif msg.action = "edit" then
      Edit:
        if
          /\ ExistsUploadWith(msg.id)
          /\ ExistsContentWith(msg.id)
        then
          with 
            upload = UploadWith(msg.id),
            exists = ContentWith(msg.id)
          do
            content := content -- exists;
            local := upload;
          end with;
        
          PushEdit:
            if 
              /\ ExistsUploadWith(local.id)  
              /\ UploadWith(local.id).version > local.version 
            then
              local := NULL;
              goto Edit;
            else
              content := content ++ local;
              local := NULL;
            end if;
        end if; 

    elsif msg.action = "delete" then
      Delete:
        if \E x \in content: 
           x.id = msg.id
        then
          with
            exists = ContentWith(msg.id)
          do
            content := content -- exists
          end with;
        end if;
    else \* wtf
      assert FALSE;
    end if;
  Repeat:
    msg := NULL;
    goto Receive;
end process;

fair process cleaner = "cleaner"
begin
  Clean:
    while TRUE do
      with
        id \in 
          LET
            upload_ids  == {u.id: u \in uploads}
            content_ids == {c.id: c \in content}
          IN
            content_ids \ upload_ids
        ,
        exists = ContentWith(id)
      do
        content := content -- exists
      end with;
    end while;
end process; 
end algorithm; *)

====

Be sure to translate from PlusCal run with ModelStateConstraint as a state constraint.