Skip to Content

Designing Distributed Systems with TLA+

Official Description: Concurrency is hard. How do you test your system when it’s spread across three services and four languages? Unit testing and type systems only take us so far. At some point we need new tools.

Enter TLA+. TLA+ is a specification language that describes your system and the properties you want. This makes it a fantastic complement to testing: not only can you check your code, you can check your design, too! TLA+ is especially effective for testing concurrency problems, like stalling, race conditions, and dropped messages.

This talk will introduce the ideas behind TLA+ and how it works, with a focus on practical examples. We’ll also show how it caught complex bugs in our systems, as well as how you can start applying it to your own work.

(This is the exact same abstract as the last TLA+ talk, despite the two having just one slide in common. Writing abstracts is not my strong suite.)

Actual Description: I try to convince everybody that threads and Kubernetes are literally the exact same. After failing that I try to convince everybody that distributed systems are easy if we just think really hard about them. After failing that I try to convince everybody that thinking hard is boring and we should just write TLA+ specs instead. After failing that I shill my book and consulting services because dignity is for nerds.

Talk is here.

FAQ

These are questions I got after the talk. I will add more as people ask them.

How do you go from a spec to code in production?

Carefully. Verifying that code matches a spec, aka refinement, is a really hard problem! To accurately implement your spec, you need good engineers, good tooling, and good process. Nothing can take those needs away.

The benefit of modeling is that if you see an error in production, you immediately know that it’s either an implementation error or some factor you didn’t consider in your model. If you didn’t write a spec, you don’t know whether it’s an implementation error, a fundamental design error, or even an ambiguity in what you were supposed to do in the first place.

Does TLA+ passing a model guarantee that your system is 100% correct or just 99.99%?

TLA+ guarantees that your spec has the specific properties you verified under the specific assumptions you made and model you ran. If you have the assumption that at least half your workers are weakly fair, it makes no guarantees about the case where less than half your workers are weakly fair. If you cap the size of the message queue at 10 messages, you get no guarantees about what will happen if you enqueue 20 messages.

The current model checker can only check bounded models, or models with a finite number of possible states. Note that there can be an infinite number of behaviors in a bounded model, for example if there’s a valid cycle.

How does this fit in with practices like Agile and TDD?

Pretty well, in my opinion! TDD covers ground level, local-code implementation and verification, something that’s outside of scope for TLA+. But TDD doesn’t remove the benefit of thinking about the big picture.

Similarly, the most popular styles of Agile favor short-term iterations, continuous feedback, and developer empowerment. This is all orthogonal to the value of design and planning ahead. Most of Agile’s issue with “design” is actually with the process of “Big Design Up Front”: creating a rigid plan and never deviating from it, even when reality gets in the way. But TLA+ and other modern specification languages aren’t used for BDUF. Instead, you have the flexibility to choose your level of detail, meaning you can design exactly as much as you need to get the benefits you want. You can easily modify your spec in the face of sudden changes and adapt to the new reality.

In return you learn whether your designs will work well before you spend time implementing them. You can find the ambiguity in customer requirements well before you show them a demo. And you learn exactl what you need to change about your product when new constraints come in. I talk more about the intersection between Agile and TLA+ in this essay.

How much time does it add to the development process?

There aren’t yet any large-scale studies on the effectiveness of formal specifications. Anecdotally, you can usually get a beneficial model within 1-2 days. That’s usually enough to find issues with existing systems or hammer out the details of new ones.

I’ve found that a formal specification saves significant time in the long run, both by reducing the amount of expensive, late-stage rework and by reducing the amount of post-release debugging. In my experiences it usually saves maybe 25% of the development time plus maybe the same amount saved in maintenance. Other people I’ve talked to have found similar or larger savings.

How do you decide what to specify?

Experience! A very rough rule of thumb I have is that TLA+ in particular is good if you’re dealing with concurrency or expect the implementation to take more than a week. But the cutoff is super fuzzy in practice.

Where’s the best place to get started?

Take existing systems and write specs as a form of documentation. Since you know how the existing behavior, it’s easier to write and sanity-check your spec. About half the people I know who tried this found it a great way to get a better understanding of the system. The other half found critical bugs this way.

How does this compare to FDR and CSP?

I’ve played with FDR a bit but don’t know it well enough to accurately comment.

Spec

To make the example in the talk more accessible, I invented my own syntax for TLA+. All of changes are 1-1. For example, instead of [key |-> val] I wrote {key: val}. I also only showed a subset of the full spec in the talk. In a real system, there are all sorts of invariants you’d want to check: all messages are well-formed, no deadlocks, all ids are unique, etc. Covering this, though, would lose some focus, so I focused only on ambiguity and liveness. You can see the full spec below:

---- MODULE Uploader ----

EXTENDS Sequences, Integers, TLC
CONSTANTS Users, Workers, NULL

\* For model checking
CONSTANTS MaxVersion, MaxId

Ids == 1..MaxId
Versions == 1..MaxVersion

SeqOf(set, n) == UNION {[1..m -> set] : m \in 0..n}
Range(f) == {f[x]: x \in DOMAIN f}
Nonempty(seq) == seq /= <<>>
set ++ e == set \union {e}
set -- e == set \      {e}

seq (+) elem == Append(seq, elem)

Upload == [from: Users, version: Nat, id: Nat]
ModelUpload == [from: Users, version: Versions, id: Ids]

Msgs == [id: Nat, action: {"create", "edit", "delete"}]

set |- id == CHOOSE x \in set: x.id = id
set |= id == {x \in set: x.id = id}

UniqueOnIds(set) ==
  \A m1, m2 \in set:
    m1 /= m2 <=> m1.id /= m2.id

(*--algorithm uploader

variables
  uploads = {};
  content = {};
  queue = <<>>;
  next_id = 1;
  
define
  UploadWith(id) == CHOOSE msg \in uploads: msg.id = id
  ContentWith(id) == CHOOSE msg \in content: msg.id = id

  ExistsUploadWith(id) == 
    \E x \in uploads: 
       x.id = id
  ExistsContentWith(id) == 
    \E x \in content: 
       x.id = id

  ModelTypeInvariant ==
    /\ uploads \in SUBSET ModelUpload
    /\ content \in SUBSET ModelUpload
    
  \* Used to enforce a bounded model
  ModelStateConstraint ==
   /\ \A x \in uploads \union content:
        /\ x.version < MaxVersion
        /\ x.id < MaxId
   /\ next_id < MaxId
    
  TypeInvariant ==
    /\ uploads \in SUBSET Upload
    /\ content \in SUBSET Upload
    /\ queue \in Seq(Msgs)
  
  NoDuplicateIds ==
    /\ UniqueOnIds(uploads)
    /\ UniqueOnIds(content)
  
  ContentMatchesUploads ==
    []<>(
      \A c \in content:
        \E u \in uploads:
          c = u
      )
end define;

macro send_msg(id, action) begin
  queue := queue (+) [id |-> id, action |-> action];
end macro;

process user \in Users
begin
  User:
    while TRUE do
      either
        with 
          create = [from |-> self, version |-> 1, id |-> next_id]
        do
          uploads := uploads ++ create;
          send_msg(create.id, "create");
          next_id := next_id + 1;
        end with;
      or
        with
          upload \in {u \in uploads: u.from = self},
          edit = [upload EXCEPT !.version = @ + 1]
        do
          uploads := uploads -- upload ++ edit;
          send_msg(edit.id, "edit");
        end with;
      or
        with 
          delete \in {u \in uploads: u.from = self}
        do
          uploads := uploads -- delete;
          send_msg(delete.id, "delete");
        end with;
      end either;
    end while;
end process;

fair process worker \in Workers
variables
  msg = NULL, local = NULL;
begin
  Receive:
    await Nonempty(queue);
    msg := Head(queue);
    queue := Tail(queue);
    if msg.action = "create" then
      Create:
        if \E x \in uploads: 
           x.id = msg.id then
        with upload = UploadWith(msg.id) do
          content := content ++ UploadWith(msg.id);
        end with;
        end if;
    elsif msg.action = "edit" then
      Edit:
        if
          /\ ExistsUploadWith(msg.id)
          /\ ExistsContentWith(msg.id)
        then
          with 
            upload = UploadWith(msg.id),
            exists = ContentWith(msg.id)
          do
            content := content -- exists;
            local := upload;
          end with;
        
          PushEdit:
            if 
              /\ ExistsUploadWith(local.id)  
              /\ UploadWith(local.id).version > local.version 
            then
              local := NULL;
              goto Edit;
            else
              content := content ++ local;
              local := NULL;
            end if;
        end if; 

    elsif msg.action = "delete" then
      Delete:
        if \E x \in content: 
           x.id = msg.id
        then
          with
            exists = ContentWith(msg.id)
          do
            content := content -- exists
          end with;
        end if;
    else \* wtf
      assert FALSE;
    end if;
  Repeat:
    msg := NULL;
    goto Receive;
end process;

fair process cleaner = "cleaner"
begin
  Clean:
    while TRUE do
      with
        id \in 
          LET
            upload_ids  == {u.id: u \in uploads}
            content_ids == {c.id: c \in content}
          IN
            content_ids \ upload_ids
        ,
        exists = ContentWith(id)
      do
        content := content -- exists
      end with;
    end while;
end process; 
end algorithm; *)

====

Be sure to translate from PlusCal run with ModelStateConstraint as a state constraint.