Totus Est Indicium

Wednesday, December 26, 2007

On Yes/No Columns

In designing a database, one often comes across the need to track Boolean data of an either/or nature, with values like Yes or No, True or False. In common parlance, these are called Boolean "table columns", but we will see that the table metaphor indeed encourages poor design decisions with regard to data of this nature.

Boolean attributes: bad voodoo.

I submit to you that a normalized database will have absolutely no Boolean attributes, and that using such attributes is almost always a poor database design choice.

Firstly, for any such column, we will need to decide whether to allow NULL (if we haven't already decided, for the entire database, to disallow them). If allowed, we will need to decide what NULL means in this context. Sometimes null means "no", sometimes "yes", other times "maybe", "unknown", "not applicable", etc. All queries which use this attribute will need to be written with this meaning in mind, and will need to account for the NULL case. You also need to be aware of how your DBMS handles NULL for indexing (Oracle, for example, can not use indexes except in a limited capacity for "is null" or "is not null" queries).

Secondly, depending on our preference for handling NULL (or not), extending the heading type of a relation (i.e. adding columns to a table) can be an expensive operation if we're modifying an already existing database.

Thirdly, the Boolean attribute design is not readily open to extension. Suppose the requirements change so that a simple Yes/No answer no longer suffices (e.g. is_resolved becomes resolution_date, resolution_type, and resolved_by).

Why are Boolean attributes even considered?

The notion of Boolean attributes comes from the preference of seeing relations as "flat tables" with "columns". Surely, most data can be presented in a tabular format, but this is merely one view of the data. In a relational database, a relation of arity n (which a table with n columns represents) is actually an n-dimensional data structure rather than a two-dimensional table with rows and columns.

Let's look at that claim and what it means. Imagine a nullary relation. What kind of data could such a structure represent? Let's think of this as a table with no columns. Having trouble imagining that? It's hardly a table if it has no columns, but it's a perfectly valid kind of relation. It can represent exactly two distinct values. It can be empty, or it can contain a single tuple (with no attributes). In other words: It's Boolean. So we see that a relational database can represent True/False values without as much as a single attribute. You could indeed design a database in such a way that it contained only nullary relations (one for every fact), but this would be incredibly inefficient (and yet much more efficient than a table with a single row representing, with legions of Boolean attributes, every fact in the database).

A better, relational, design.

If we add an attribute to the nullary relation, it becomes a unary relation, a one-dimensional data structure. We now have something about which to say that something is true or false. Let's say we have a database of our DVD collection and we wish to track whether we have watched a given DVD or not. Instead of adding a Boolean attribute to the DVDs, it's a simpler and more elegant solution to add a unary relation "have_watched", with the DVD's title or ID as its only attribute. We can assume that the world is closed (we have either watched it or not) since there are no nulls possible. No modification to the DVD relation is required, and this design is readily extensible to include a date_watched attribute (or it could be further extended for a multi-owner collection, or whatever).

I hope you will agree that Boolean attributes in relational databases are entirely redundant and introduce unnecessary ambiguity. A relation gives you true/false semantics for free by its very nature, even without any attributes. We can use that to our advantage by simply assuming that every fact in the database is true, and that falsehoods are represented by their absence.

Tuesday, October 17, 2006

The Joys of Dealing With Oracle Support

Dealing with Oracle Support is like something out of a Jonathan Swift story.

You run into a problem with data corruption. They tell you to run a corruption diagnostics tool. You run the tool, and its findings are inconsistent. So they send you to the group that supports the diagnostics tool. They tell you to install a patch to make the findings consistent. The patch fails to install. They send you to the group that supports issues with the patch installer. You complain that you're being routed around but your issue remains unsolved. They route your issue to the department that deals with misrouted issues.

Tuesday, October 10, 2006

Relations, Dimensions, Time, and Space

Is time really a 4th dimension? I think it depends on how you look at it. But the number 4 is definitely arbitrary.

Any attribute of an existent may be thought of as a dimension of a relation over existents on the continuum or field of that attribute. The three spatial dimensions do not exist as such, but are attributes of existents measuring their particular boundaries or orientation relative to each other.

Time, on the other hand, does not measure any particular attribute of an existent. It measures all of its attributes in unison. Time is a measure of existence itself. It provides the sequential attribute of causality as we track which existents are concurrent and which occur separately.

Time, therefore, is a kind of dimension. A continuum on which things come into existence and then cease to exist. That doesn't make time special in that it's somehow grouped with the spatial dimensions, and doesn't validate the concept of "spacetime".

An object's color can be thought of as a dimension. As can its number of vertices, its intelligence, its similarity to a carrot, or any other conceivable attribute.

It's hard for us to conceive of the fact that time and space do not exist except in the mind. They are concepts. They're data. They are relations over attributes of things that do exist.

Wednesday, September 27, 2006

Printing in Vista Will be Memory-Hungry

The next incarnation of the .Net framework, shipping with Windows Vista, will include a new format for printing documents called XPS, or Xml Paper Specification. This will be part of the Windows Presentation Foundation (WPF).

It seems that Microsoft is determined to make printing on Vista take as long---and as much memory---as possible.

I downloaded the release candidate of the .Net Framework 3.0, and "XPS Writer" which is a printer driver that works with Windows XP to create XPS documents. Printing a 17-page document took 2 minutes on my 3 GHz machine with 2 GB of memory. And this is printing to a file, mind you. How big do you think the file was? The Postscript version of the same file was 300 kB (and took less than a second to spool).

The XPS version was 12 MB. I am not kidding. If you opened this file in a text editor and printed it, the result would be over 3000 pages long.

Wednesday, July 19, 2006

Curse you, Maven!

We needed to set up a more comprehensive build system than the simple ant scripts we have. A coworker said I should look into Maven 2.0.

Well, I have spent an entire day trying to figure out how to get maven to do an automatic daily build. Sounds pretty simple. You check out, you compile the source directory and you leave the class files in a descriptive location. Well, it's not simple with Maven.

My first clue should have been that the book does not say what Maven is. In fact, the author explicitly states that there is no such definition. In other words: To those who understand Maven no explanation is necessary. To those who don't no explanation is possible. No explanation is available either, as there is very little by way of documentation, in typical Apache fashion. What little there is out there is far too simplistic.

I maintain that if you cannot say what something is in 13 words or less, then there are only two possibilities: You either do not know what it is, or there is no such thing.

Maven, you have broken me. I banish you from my realm.

Wednesday, March 08, 2006

Java Optimizations

Java programmers are used to having memory handled for them. They don't have to malloc, and they don't have to free. You just create objects when you need them and once you've stopped using them, the JVM will clean up after you. It's magic.

But sooner or later garbage collection will bite you on the ass. Here's what happens. Your application has places where it creates a lot of little objects and discards them almost immediately. If this doesn't fill up your heap space, it will fragment it. The JVM has a more and more difficult time finding consecutive blocks to fit your larger objects. Ultimately, this will result in a "full gc", where the garbage collector walks the entire heap looking for memory blocks to free up, moving objects around in memory as necessary to defragment the heap. This process may lock up your application completely, and, if memory is being allocated faster than it can be freed, indefinitely. Generally the symptom of this is that Java's CPU usage goes to 100% and stays there.

To remedy this, you must optimize. But how, and what? As a starting point, here are some of the most common culprits of inefficient memory allocation, where you can optimize a lot for little effort:

XML. Processing XML is unnecessarily resource-intensive. The best advice if you value your memory, and if you can at all help it, is to not use XML at all in your application. Walk away and don't look back. Parsing this stuff makes a lot of memory allocations. Every tag and attribute becomes an object, and so does every bit of whitespace between, before, and after every tag. And every object is a new memory allocation. There are many alternatives to XML that cost a lot fewer resources, including databases, s-expressions, and property files.

Vendor JDBC drivers. These are the stock drivers that come with your Database Management System. Since these drivers are usually free, and they are not the vendors' core product, they are notoriously inefficient. If your application talks to a database, chances are that your JDBC driver is eating most of your memory. Consider switching to a commercial-grade JDBC driver, or make fewer calls to the database in your application. Just switching drivers can slash your memory usage by two thirds or more.

StringBuffers. Most applications do a lot of string concatenation, and you can get a lot of bang for your optimization buck here. Use StringBuffer.append() instead of Strings and the '+' operator. All string concatenation actually gets translated to StringBuffer appends at compile-time, but in general the compiler will create a new StringBuffer for every appearance of the '+' operator. So use append() explicitly, use as few StringBuffers as you can, call .toString() on the StringBuffer as late as you can get away with, and only once.

For example, if your method creates a StringBuffer, appends a bunch of stuff to it, and returns a String, consider converting that to a void method that takes a StringBuffer argument and appends to it. This will avoid both the allocation of a new StringBuffer and its conversion to a String.

Mind you, don't try to optimize concatenation of String literals this way. These can and will be optimized at compile-time by javac to be allocated a single time when the class first loads.

Finally, if you already know roughly the size your final StringBuffer will be, initialize the buffer first with the StringBuffer(int) constructor. The default size of a StringBuffer is 16 characters, and every time you append past the buffer size the underlying char[] array is doubled in size which requires allocating new memory. Better to allocate the entire buffer once, with not a byte more than you need.

The same principle applies to collections. Initialize your collections to their final size rather than allowing them to resize several times. If you're adding to a collection in a loop, you probably already know how many iterations there will be, so pass that number to the constructor and save yourself some heap fragmentation.

Attack of the .clone()s. Sometimes you want to augment an object on the fly, just for yourself, without modifying the particular instance. So you clone to a new instance and modify that. What this does is allocate memory for the new object and then copy each of the source object's attributes to the clone. Be very careful with this, and avoid wherever possible. If you have to clone large objects, a lot of objects, or both, then you probably have a poor design and you should refactor. Deep cloning is even worse. Usually this is done by serializing an object to a stream, and then creating a new object from the serialized data. It's very rare that you actually need two copies of the same thing in the same VM. Consider refactoring.

Go static. There might be places in your applications where a lot of objects are being created for no good reason. You might have objects that do nothing except sit around and have their methods called, or get created on the fly like so: new Wossname().doThing(). Here, doThing() should be static, and there's no reason to create that instance of Wossname. There might be other objects that maintain a little bit of internal state that can just as well be static. For example, a method may behave differently based on if a boolean property of the object is true or false. In that case you can replace that design with two classes, one with a static property set to true and another which is a subclass of the first, only overriding the static property by setting it to false. Alternatively, just pass the boolean to static methods instead of keeping a property on the class. Voila, no instances of those classes need to be created, ever.

Pooling. In that same vein, you can save a lot of memory on classes that a lot of threads have to instantiate constantly. Instead of creating and discarding several hundred objects per second, consider creating a pool for those objects. Threads will request an instance from the pool and then return the instance back to the pool when they're done with it. At this point the pool will reinitialize the object as necessary to make it ready to be given to another thread, just as if it were a brand new instance. The effect is that memory consumption will remain more constant and stable over time as a result of fewer overall allocations.

Ultimately, all these optimizations are just guesswork unless you can monitor the effects in a controlled environment. For this you need a profiler. I recommend investing in JProfiler. It's a very good tool, although still somewhat buggy. Its best feature is the ability to look at "allocation hotspots" which shows you where most of the memory is being allocated, both by amount of memory and by number of allocations. Not just which objects are eating your memory, but which methods are creating them. This kind of tool will allow you to discover just how poorly your code really is utilizing memory, or how badly it is fragmenting your precious heap.

Monday, December 05, 2005

Config Code Creep

In the course of developing an application, many constraints are applied to the efforts of developers. There are, among others, the pressures of code re-use, extensibility and configurability. A beautiful application will be both extensible and configurable, and if it needs to be modified we won't have to throw away or re-write much of its code.

But extensibility and configurability can be taken to an extreme, and often are. Think about it. Conjure up a mental picture of the ultimate extensible app. Do you have it in mind? If you took the thought far enough, the application you're thinking of is indistingushable from a programming language.

It's possible to make an application configurable to the point of absurdity. What we end up with is minimal code and a host of configuration files (or database tables) that specify most of how the application behaves. But what are these configuration files really? They're code. They just happen to be code that's tucked away one meta-level down and likely has little or no error checking. Further, they're usually written in something clunky like XML, editing which is both tedious and error-prone.

To add insult to injury, developers usually don't show the same respect for configuration files as they do for code. For example, they're often second-class citizens in a CVS project. Developers who work with multiple branches of the same project often don't feel compelled to keep config files consistent with the rest of a branch. Nor is there the same compulsion for code re-use. Any programmer worth their salt wouldn't dream of copying and pasting a method multiple times into various source files, but they generally don't have a problem with pasting XML stubs into various config files.

Another example is a project I witnessed the launch of a few months ago. It was an ASP.Net application, and the default VS.Net configuration file was accidentally deployed to the production site. They simply weren't paying attention to it.

Then again, an example from the other extreme was a Smalltalk project I worked on where the configuration was written in Smalltalk code like the rest of the application. LISP applications are similarly often configured with LISP programs. That's perfectly respectable in my opinion. Certainly moreso than doing half your work in Java and leave the rest of it to ferment in XML files.

The take-home point is this: There's not a whole world of difference between code and structured data. The config files for your application can quickly become more complex and convoluted than the application code if you're not careful. It's often better to solve a problem in code, where at least some rigor is enforced, than delegate its solution to configuration.