27

Sum types for relational databases

 4 years ago
source link: https://blog.typeable.io/posts/2019-11-21-sql-sum-types.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Matt Parsons talks about a few ways to encode sum types in relational databases in his blog .

Not so long ago I was thinking about the same problem and came up with slightly different solutions. Let’s start with an example Matt had used:

data Animal
    = Cat Name Age FavoriteFood
    | Dog Name OwnerId
    | Bird Name Song

Matt shows us three ways of dealing with it. Let’s do a recap to have a better picture:

  1. Shared Primary Key

Dependent tables (which have a 1-to-1 correspondence to constructors) refer to the primary table (which corresponds to the Haskell data type). To make sure it’s impossible to have multiple references to the same row of the primary table we add a column containing constructor names to the primary key of the primary table. At the same time, foreign keys of dependent tables have a similar columns, but in each table the column has only one possible value. Same identifiers should be used for both tables.

  1. The persistent approach

The primary table refers to the dependent tables. Identifiers of these tables are generated separately and are different. Only one value of the foreign key can be not null.

  1. Nullable Columns

Here a denormalized representation of the data is used. All data goes into the the same table, constructor is stored in one of the columns, check constraints are used to limit the range of values we can assign according to the constructor.

I’m proposing two other designs to model sum-types in relational databases.

Normalized sum

In his example, Matt uses sum-of-products ADTs we use in haskell, rather than a pure sum-type. Let’s take a look a pure sum first:

data Cat  = Cat Name Age FavoriteFood
data Dog  = Dog Name OwnerId
data Bird = Bird Name Song
data Animal = ACat Cat | ADog Dog | ABird Bird

For this example, I argue that the persistent approach is a better fit overall. But I propose a few changes:

  • The primary table should contain a constructor tag by introducing a column for every possible value of the sum type. The columns have a unit type (which admits only one value) but are nullable. Only one column can have a non-null value, indicating which constructor is used.
  • Dependent tables have a single extra column with a corresponding constructor tag which can only be unit.
  • Identifiers of the dependent tables mirror identifiers of the primary table. We add constructor tags to the foreign key.

A few benefits of this approach comparing to the persistent one is:

  • We have fewer foreign key identifiers to work with. This leads to a reduction in the size of the primary table, but more importantly, it reduces the number of foreign key indexes.
  • In some cases it improves select performance, as you don’t have to map primary identifiers into dependent identifiers. Consider e.g. an extension to our database schema adding for each pet a history of its owners (many-to-many). To select all history of all dog owners we don’t need to use the animals table, it suffices to join the history table with the dogs table.

Implementation

create type unit as enum ('()');
create table cat (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  age int not null,
  food text not null,
  primary key (id,tag)
);
create table dog (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  owner_id int not null,
  primary key (id,tag)
);
create table bird (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  song text not null,
  primary key (id,tag)
);
create table animal (
  id integer primary key,
  cat unit,
  dog unit,
  bird unit,
  check
    ( (cat is not null and dog is null and bird is null)
    or (cat is null and dog is not null and bird is null)
    or (cat is null and dog is null and bird is not null) ),
  constraint animal__cat foreign key (id,cat) references cat (id,tag),
  constraint animal__dog foreign key (id,dog) references dog (id,tag),
  constraint animal__bird foreign key (id,bird) references bird (id,tag));

When inserting we have to insert data first into the dependent table, and then into the primary table.

When deleting we have to do the opposite.

Denormalized sum

So, what’s the difference between normalized and denormalized sums and what is unsatisfactory in the proposed implementation?

In the normalized implementation, rows in the dependent table can exist on their own, without anything referencing them from the primary table. It’s possible to remove data from the primary table and accidentally forget to remove the correspnding data from the dependent table. Bulk deletion scenarios are not obvious as well. An even worse scenario can arise if we consider cascade deletion of some data in the primary table. This “disjointedness” of the dependent data is a very close approximation of the pure sum structure, but from the POV of database design it’s not the best representation possible. The same problem is exhibited in the persistent approach. For the “Shared Primary Key” approach the situation is reversed — data in the primary table can exist on its own. In other words, in the persistent approach the identifiers in the primary table are a subset of the union of the identifiers in the dependent tables, and in the “Shared Primary Key” approach — a superset.

Let’s just add two-way references! But how do we insert and delete the data in this scenario? Which table should we start with?

The answer lies in a fairly standard database practice which is called deferred constraints, which allows us to delay the checks until the end of the transaction, so we get a bijective relation between the primary table and the union of dependent tables.

Denormalized implementation

create type unit as enum ('()');
create table cat (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  age int not null,
  food text not null,
  primary key (id,tag)
);
create table dog (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  owner_id int not null,
  primary key (id,tag)
);
create table bird (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  song text not null,
  primary key (id,tag)
);
create table animal (
  id integer primary key,
  cat unit,
  dog unit,
  bird unit,
  check
    ( (cat is not null and dog is null and bird is null)
    or (cat is null and dog is not null and bird is null)
    or (cat is null and dog is null and bird is not null) ),
  constraint animal__cat foreign key (id,cat) references cat (id,tag)
    deferrable initially deferred, -- !!!
  constraint animal__dog foreign key (id,dog) references dog (id,tag)
    deferrable initially deferred, -- !!!
  constraint animal__bird foreign key (id,bird) references bird (id,tag)
    deferrable initially deferred); -- !!!
---------- !!! ----------------|||
alter table cat add constraint cat__animal
  foreign key (id) references animal (id) on delete cascade;
alter table dog add constraint dog__animal
  foreign key (id) references animal (id) on delete cascade;
alter table bird add constraint bird__animal
  foreign key (id) references animal (id) on delete cascade;

All the differences of this implementation are highlighted with comments.

In this implementation we have to insert data starting with the primary table (which in my opinion is more natural). To delete the data, we just delete from the primary table, and then cascade deletion removes data from dependent tables.

begin;
  insert into animal(id,cat) values (1,'()');
  insert into cat(id,name,age,food) values (1,'kitty',1,'milk');
  insert into animal(id,dog) values (2,'()');
  insert into dog(id,name,owner_id) values (2,'dolly',5);
  insert into animal(id,bird) values (3,'()');
  insert into bird(id,name,song) values (3,'nightingale','twist');
commit;
delete from animal where id in (1,2)

Here, if desired we can also move the name column to the primary table.

Thanks for reading!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK