Wednesday, December 26, 2007

On Yes/No Columns

In designing a database, one often comes across the need to track Boolean data of an either/or nature, with values like Yes or No, True or False. In common parlance, these are called Boolean "table columns", but we will see that the table metaphor indeed encourages poor design decisions with regard to data of this nature.

Boolean attributes: bad voodoo.

I submit to you that a normalized database will have absolutely no Boolean attributes, and that using such attributes is almost always a poor database design choice.

Firstly, for any such column, we will need to decide whether to allow NULL (if we haven't already decided, for the entire database, to disallow them). If allowed, we will need to decide what NULL means in this context. Sometimes null means "no", sometimes "yes", other times "maybe", "unknown", "not applicable", etc. All queries which use this attribute will need to be written with this meaning in mind, and will need to account for the NULL case. You also need to be aware of how your DBMS handles NULL for indexing (Oracle, for example, can not use indexes except in a limited capacity for "is null" or "is not null" queries).

Secondly, depending on our preference for handling NULL (or not), extending the heading type of a relation (i.e. adding columns to a table) can be an expensive operation if we're modifying an already existing database.

Thirdly, the Boolean attribute design is not readily open to extension. Suppose the requirements change so that a simple Yes/No answer no longer suffices (e.g. is_resolved becomes resolution_date, resolution_type, and resolved_by).


Why are Boolean attributes even considered?

The notion of Boolean attributes comes from the preference of seeing relations as "flat tables" with "columns". Surely, most data can be presented in a tabular format, but this is merely one view of the data. In a relational database, a relation of arity n (which a table with n columns represents) is actually an n-dimensional data structure rather than a two-dimensional table with rows and columns.

Let's look at that claim and what it means. Imagine a nullary relation. What kind of data could such a structure represent? Let's think of this as a table with no columns. Having trouble imagining that? It's hardly a table if it has no columns, but it's a perfectly valid kind of relation. It can represent exactly two distinct values. It can be empty, or it can contain a single tuple (with no attributes). In other words: It's Boolean. So we see that a relational database can represent True/False values without as much as a single attribute. You could indeed design a database in such a way that it contained only nullary relations (one for every fact), but this would be incredibly inefficient (and yet much more efficient than a table with a single row representing, with legions of Boolean attributes, every fact in the database).


A better, relational, design.

If we add an attribute to the nullary relation, it becomes a unary relation, a one-dimensional data structure. We now have something about which to say that something is true or false. Let's say we have a database of our DVD collection and we wish to track whether we have watched a given DVD or not. Instead of adding a Boolean attribute to the DVDs, it's a simpler and more elegant solution to add a unary relation "have_watched", with the DVD's title or ID as its only attribute. We can assume that the world is closed (we have either watched it or not) since there are no nulls possible. No modification to the DVD relation is required, and this design is readily extensible to include a date_watched attribute (or it could be further extended for a multi-owner collection, or whatever).

I hope you will agree that Boolean attributes in relational databases are entirely redundant and introduce unnecessary ambiguity. A relation gives you true/false semantics for free by its very nature, even without any attributes. We can use that to our advantage by simply assuming that every fact in the database is true, and that falsehoods are represented by their absence.

No comments: