# Formally Modeling Database Migrations

Most of my formal methods examples deal with concurrency, because time is evil and I hate it. While FM is very effective for that, it gives people a limited sense of how flexible these tools are. One of the most common questions I get from people is

Formal methods looks really useful for distributed systems, but I’m not making a distributed system. Is FM still useful for me?

Instead of a distributed system, we’re going to do something different. Lots of business apps store their conceptual model in a relational database. If you need to change the structure of the data, you need to convert all of the stored data to the new format. This is a **data migration**. Migrations are risky business because they involve changing a large amount of state at once. They’re also often irreversible, and if something goes wrong you cannot easily undo the migration.

We’d like to make sure the migration will be safe, or that it will accurately map all of our data to the new format. We also want to make sure that the new format matches our requirements. If we accidentally lose a database constraint, then the migration might go fine but we might later become inconsistent.

Modeling a migration is a four-step process. First, we specify the old data representation and the new one. Then, we define what it means for data in each representation to be “equivalent”, in terms of having the same information. Third, we show that equivalent data will have the same constraints: it’s impossible for old data to satisfy the old schema’s constraints but be equivalent to something that violates the new schema’s constraints. Finally, we specify the migration and show that migrated data is equivalent to the old data.

We’re going to use Alloy for this. Alloy is very good at specifying relationships between entities, whether it’s the relationships between tables in a schema or the relationship between two separate schemas. It can also create create visual diagrams of its models, which is handy for exploring your specs. This essay assumes no prior Alloy knowledge. If you want to follow along, you can download it here.

## The Problem

Our system has **People** and **Events**. People **attend** events and either **like** or **hate** them. As a database constraint, we want to enforce that a person cannot both like and hate the same event. For simplicity we’ll assume you don’t have to attend an event to like or hate it.

In our original system, we had two tables, a `Person`

table and an `Event`

table. We modeled the relationships as three separate columns on the `Person`

table, each column a set of `Event`

ids.^{1}

For our new system, we want to replace the three columns with two tables: an `Attendance`

table and a `Review`

table. Each of them stores both a `Person`

id and an `Event`

id. `Review`

will also store whether the review is “like” or “hate”.

Our system already has data in it. We need to migrate all of that data to the new format. Going forward, we also want to preserve the same constraint, where you can’t both like and hate the same event.

## The Model

### Old Schema

```
sig Event {}
```

`Event`

is a **signature**. This specifies the types of instances, or **atoms**, in our model. To keep things simple, the event doesn’t have any other information associated with it. This alone is a complete model, and Alloy can generate matching examples:

Now we add people:

```
sig Person {
events: set Event,
likes: set Event,
hates: set Event
}
```

The body of the signature consists of three **relations**. Relations relate things. Every Person “has” a set of Events they attended, a set they liked, and a set they hated. The `set`

means there are no constraints on just how many events are in each. There are several other quantifiers, like `some`

(at least one) and `lone`

(zero or one).^{2} Now Alloy can generate considerably more complex models:

The Alloy visualizer has a lot of features for analyzing these diagrams, such as an expression evaluator and theming. I can distinguish the shapes and colors of the different signatures and change how the relationships are rendered:

You can download the theme I used here if you’re following along.

Next, we need to express the constraints. No person should like and hate the same event. We can express that with a **predicate**, or a boolean function:

```
pred valid[p: Person]
{
no p.likes & p.hates
}
```

Our only constraint in this example is there is no intersection (`&`

) between these two sets.

Not all generated models will be valid. This is intentional: we want to both be able to say that a certain model satisfies our requirements, and also that one *doesn’t* satisfy our requirements. We could tell Alloy to only generate valid instances of `Person`

by adding a runtime predicate:

```
run {all p: Person | valid[p]}
```

### New Schema

For the new schema, we want separate instances for attendance and reviews. To make things easier, we’ll create a `NewPerson`

signature that points at an existing `Person`

.

```
sig NewPerson {
, person: one Person
}
```

We could also leave out the `one`

here, as Alloy defaults to `one`

if we don’t specify a quantifier. Now for the new tables:

```
sig Attendance {
event: Event,
attendee: NewPerson
}
abstract sig Review {
event: Event,
reviewer: NewPerson
}
sig Likes, Hates extends Review {}
```

`Review`

is **abstract**. We cannot have an instance that is just a `Review`

, we can only have the subtypes in the model. So every `Review`

will either be a `Likes`

review or a `Hates`

review. Both, however, still count as a `Review`

, meaning the set of all `Likes`

is a subset of the set of all `Reviews`

.

Finally, we should write a version of `valid`

for `NewPerson`

. For the sake of instruction later, I’m going to make it always true:

```
pred valid[p: NewPerson] {
some p
}
```

We have two `valid`

predicates, one for `Person`

and one for `NewPerson`

. Alloy sees these are distinct types and has no trouble handling the overload.

## Showing Equivalence

What does it mean for a `NewPerson`

to be equivalent to the original `Person`

? In this case, we can say that they’re equivalent if

- They have the same events
- They have the same likes
- They have the same hates.

Let’s start with the first case:

```
pred same_events[p: Person, np: NewPerson] {
p.events = ???
}
```

How do we get all the events that the `NewPerson`

attended? The data exists on the `Attendance`

instance, not the `NewPerson`

! Here’s where Alloy’s model really shines through. Attendance “has” an `attendee`

relation. This is actually a first-class entity! Alloy sees `attendee`

as a mapping from Attendance instances to `NewPerson`

atoms. We can manipulate the relationship itself. One way is by using `~`

, which reverses a relation: `~attendee`

maps each person to their attendance records. As a visual example:

`attendee`

`~attendee`

from | to |
---|---|

`Attendance1` |
`Person1` |

`Attendance2` |
`Person1` |

`Attendance3` |
`Person2` |

from | to |
---|---|

`Person1` |
`Attendance1` |

`Person1` |
`Attendance2` |

`Person2` |
`Attendance3` |

```
-- a NewPerson
np
-- every Attendance record for a NewPerson
np.~attendee
```

We can also apply relationships to sets of atoms. `Attendance`

is the set of all attendance atoms. `Attendance.event`

is the set of all events that were attended by anybody. Then `np.~attendee.event`

would be all of the events attended by the specific `np`

.

```
pred same_events[p: Person, np: NewPerson] {
p.events = np.~attendee.event
}
```

For `same_reviews`

, we can do something similar. `np.~reviewer`

is all of the reviews by a specific `np`

. This includes both positive and negative reviews, which we need to distinguish. We can do this with a set intersection. `Likes`

is all of the positive reviews, so `np.~reviewer & Likes`

is the set of all positive reviews by `np`

. Finally, `(np~.reviewer & Likes).event`

is the set of all events that correspond to positive reviews by `np`

. We get the negative reviews in a similar way.

```
pred same_reviews[p: Person, np: NewPerson] {
p.likes = (np.~reviewer & Likes).event
p.hates = (np.~reviewer & Hates).event
}
```

We have two clauses in this predicate. All statements in the curly braces must be true for the entire predicate to be true. Between `same_events`

and `same_reviews`

, we have enough to check whether a given `Person`

and `NewPerson`

represent equivalent data:

```
pred equivalent[p: Person, np: NewPerson] {
p = np.person
same_events[p, np]
same_reviews[p, np]
}
```

We use the same multiple clauses here: all three clauses must be true for `equivalent`

to be a true predicate. Let’s generate a matching example:

```
example: run {
one Person
one NewPerson
equivalent[Person, NewPerson]
}
```

By default Alloy will run the topmost `run`

in your spec. You can select which one to run from the `Execute`

menu.^{3}

### Validity Equivalence

Before we can spec out the migration, we need to do two things. First, we need to show that the two models can express equivalent data. That part’s done. Second, we need to show that the new models have equivalent constraints. If a `Person`

violates our constraints, so should the equivalent `NewPerson`

. If the former satisfies the constraints, the latter should also satisfy the constraints. We need to **assert** that two equivalent models have equivalent validity.

```
assert equivalence_preserves_validity {
all p: Person, np: NewPerson |
equivalent[p, np] implies (valid[p] iff valid[np])
}
```

If the two data models are equivalent, then they are either both valid or both invalid. As we didn’t actually put any constraints on `NewPerson`

, we should expect this assertion to be false. By telling Alloy to check it, Alloy will search all possible models to see if any break this.

```
check equivalence_preserves_validity
```

Unsurprisingly, Alloy finds a counterexample:

Both `Person`

and `NewPerson`

attended the same event, and they have matching likes and dislikes. That means they’re equivalent. However, `Person`

likes and hates the same event, which means `Person`

is invalid. But `NewPerson`

is still valid, breaking our assertion. To fix this, let’s add another constraint to `valid[np: NewPerson]`

.

```
pred valid[np: NewPerson] {
all r1, r2: np.~reviewer |
r1 in Likes and r2 in Hates
implies r1.event != r2.event
}
```

In English: For any two reviews done by `np`

, if one is a Like review and the other is a Hates review, then they must belong to different events. Once we update the definition of `valid`

, Alloy finds no more counterexamples.

It’s worth emphasizing the scope of the model checker. Unless specified otherwise, Alloy will assume up to three instances of each signature may be present. For three events and one `p: Person`

, there are 8 possible values for `p.events`

and 64 possible values for `p.likes`

and `p.hates`

. There are at least as combinations of `Assignment`

and `Review`

for `np: NewPerson`

, meaning that, as an extreme lower-bound estimate, Alloy is checking `2^18`

possible examples.^{4} On my computer, Alloy sweeps the state space in just a few milliseconds.

## Modeling the Migration

Now we’re ready to model the migration itself. While the equivalence predicate is declarative, the migration needs to be more algorithmic. Someone looking at the migration should easily be able to translate it to code.

In a programming language, we’d write a function that transforms a `Person`

into a `NewPerson`

. While Alloy has functions, they’re not useful here. In specifications, we don’t restrict ourselves to a single model, and instead talk about the entire space of possibilities. Instead, we specify what it means for a `NewPerson`

to be a migration of a `Person`

and let Alloy explore the models where that relation holds.

```
pred migration[p: Person, np: NewPerson] {
p = np.person
-- migrate events
all e: p.events |
one a: Attendance {
a.event = e
a.attendee = np
}
-- migrate likes
all l: p.likes |
one r: Likes {
r.event = l
r.reviewer = np
}
-- migrate hates
all h: p.hates |
one r: Hates {
r.event = h
r.reviewer = np
}
}
```

Somebody could look at this and know what they have to do: take the `events`

column on each `Person`

row and, for each element in that column, create a corresponding row in the `Attendance`

table. Do the same with `likes`

and `hates`

. That’s all.

If this is a correct description of the migration, then our migrated data should be equivalent, right?

```
assert migration_preserves_equivalence {
all p: Person, np: NewPerson |
migration[p, np] implies equivalent[p, np]
}
check migration_preserves_equivalence
```

Alloy finds a counterexample!

What’s going on here? We created exactly one corresponding `Like`

for every event that `Person`

liked. Since `Person`

didn’t like any events, we needed to create zero corresponding rows for `NewPerson`

. This is satisfied, so the `migration`

predicate is true. We also, for unrelated reasons, created an unrelated `Like`

for `NewPerson`

. This means they’re *not* equivalent. So we have relationship that counts as a migration but does not preserve equivalence.

In other words, we didn’t constrain the migration rules enough. You could “create” the corresponding rows in the new schema just by adding a row for every event in the database! What we actually need to say is “for each *and only* the elements in the column, create the corresponding row.”

```
-- migrate events
all e: p.events |
one a: Attendance {
a.event = e
a.attendee = np
}
+ no e: Event - p.events |
+ some a: Attendance {
+ a.event = e
+ a.attendee = np
+ }
```

This is getting a bit unwieldy, so let’s simplify it by instead saying that we create exactly as many rows as are in the `events`

column:

```
pred migration[p: Person, np: NewPerson] {
p = np.person
+ -- only generate the records needed
+ #p.events = #np.~attendee
+ #p.likes = #np.~reviewer & Likes
+ #p.hates = #np.~reviewer & Hates
```

The `#`

means “number of elements in the set”. With this change the assertion now passes. We have shown that migrations produce equivalent models, and equivalent models have equivalent validities.

We *could* stop here. But there’s a couple of additional tricks we have. First, if we want to be thorough, we should check some statements that we expect to be correct. Alloy makes it very easy to “battle test” our specifications this way. If we can express a property of the spec, we automatically have the complete test for it. Second, we can check some properties where we don’t *know* if they should be correct or not. Alloy will generate examples that give us more insight into our system.

### Being Thorough

Here’s a property I’m uncertain of: if two data models are equivalent, is one of the migration of the other?

```
assert equivalence_implies_migration {
all p: Person, np: NewPerson |
equivalent[p, np] implies migration[p, np]
}
check equivalence_implies_migration
```

This property does not hold:

The `NewPerson`

is equivalent to the `Person`

. We have the same events liked and attended. But we have *two* database records for liking the same event! Now, this might not actually be a problem. Our new data model is considerably richer than our old one, so it makes sense that we can express something that the old model could not. It’s our decision whether having multiple review or attendances of the same type is a valid use case. If it’s not a valid case, we can patch this by changing our `valid`

predicate for `NewPerson`

:

```
pred valid[np: NewPerson] {
all r, r': np.~reviewer |
r in Likes and r' in Hates implies r.event != r'.event
+ all e: Event {
+ -- at most one review for this event
+ lone r: np.~reviewer |
+ r.event = e
+
+ -- at most one attendance
+ lone a: np.~attendee |
+ a.event = e
+ }
}
```

This doesn’t fix `equivalence_implies_migration`

. Not only that, but it also breaks `equivalence_preserves_validity`

. We’ve made the `NewPerson`

constraints stronger without changing the corresponding `Person`

constraints. There is now an invalid `NewPerson`

that are equivalent to a valid `Person`

.

There’s also a *valid* `NewPerson`

equivalent to the valid `Person`

, which is why `equivalence_implies_migration`

breaks. We could have migrated to an invalid `NewPerson`

or a valid `NewPerson`

. Patching that property is easy enough.

```
assert equivalence_implies_migration {
all p: Person, np: NewPerson |
- equivalent[p, np] implies migration[p, np]
+ (equivalent[p, np] and valid[np]) implies migration[p, np]
}
```

Dealing with `equivalence_preserves_validity`

is harder. The “obvious” way to fix this is to modify `equivalent`

. This would be a mistake, though. Right now `equivalence`

is purely definitional: we are taking the data in `NewPerson`

and translating them into equivalent statements for `Person`

. It does not restrict how we define `NewPerson`

, or the constraints we placed on it. Changing `equivalent`

would mix our abstraction levels.

We’re better off reexamining our assertions.

### Rethinking Our Properties

Here are our three properties:

- Migrations produce equivalent schemas
- If two valid schemas are equivalent, they must have been produced by a migration
- Equivalent schemas have equivalent validities (broken)

Consider our actual system. Assuming everything is “normal”, we already know all of the existing data is valid. We then migrate, which converts every instance to its equivalent in the new schema. The migration will fail if we accidentally convert valid data to invalid data. Finally, *going forward*, all data will be in the new schema. We should not be able to update it in a way that would correspond to an invalid data in the old system. With that in mind, we have the following properties as our requirements:

- Migrations produce equivalent data
- If two valid schemas are equivalent, they must have been produced by a migration
- If the new data is valid, then the equivalent old data would have been valid, too
- If we have valid old data and migrate it to the new schema, the new data will also be valid

We’ve split property (3) into two weaker properties. The original one isn’t something we can guarantee anymore, but the two weaker properties can be guaranteed. This means our assertions become:

```
- assert equivalence_preserves_validity {
- all p: Person, np: NewPerson |
- equivalent[p, np] implies (valid[p] iff valid[np])
- }
- check equivalence_preserves_validity
```

```
assert equivalence_does_not_decrease_validity {
all p: Person, np: NewPerson |
equivalent[p, np] and not valid[p] implies not valid[np]
}
assert migrations_stay_valid {
all p: Person, np: NewPerson |
migration[p, np] and valid[p] implies valid[np]
}
check migrations_stay_valid
check equivalence_does_not_decrease_validity
```

All properties are now preserved and we have verified our migration. You can see the entire spec here.

## Discussion

The above example only used a fraction of Alloy’s functionality. We can express much more complex migration properties than this, such as multistage migrations, rollbacks, or complex transformations. At the same time, we’ve seen almost all of Alloy’s *syntax*. The entire grammar is less than two pages. That’s roughly the same length as the simplified JSON grammar. Alloy packs a lot of power in a very small space.

We’re currently writing online documentation for Alloy, but in the meantime the best place to learn is the book Software Abstractions. If you’re interested in learning of other applications, Jay Parlar had an excellent talk at StrangeLoop this year. Finally, if you’re interested in using Alloy at your job, I run workshops. You can contact me here to learn more.

*Thanks to Jay Parlar and Richard Whaling for feedback.*

- You can download the source code for the schema images here.
^{[return]} `lone`

is short for “less than or equal to one”.^{[return]}- If you’re following along, you may see a different model. Alloy uses a SAT solver to find solutions and different machines may present solutions in a different order.
^{[return]} - In practice, many of these states are
**symmetrical**with each other, meaning iff one’s valid, they’re all valid. Alloy is very smart about pruning the possible state space.^{[return]}