Carrying on a bit more detailed analysis from my previous post, I will try to point some flaws that I saw in databases stored within a Relational Database Management System (RDBMS), and discuss some solutions.
To start with the basics, why using a relational model at all to represent the data?
Skipping the obvious answer that the database is stored already using a relational engine, “normalization” (Codd, 1970) is the adoption of “normal forms of relations” with the goal of avoiding certain data anomalies that can occur in unnormalized tables. In other words, being very strict with the way of representing data is going to protect the consistency and quality of the dataset we are building. On the other hand, not being strict with the data model (or not enforcing a data model at all) may create a scenario with poor quality data, where we cannot decode the information we have, which technically means it is “lost” (even if it is physically stored in the database). Think about this as an unbreakable safe box, for which you have lost the only existing key in the entire world.
The normal forms are an attempt to make sure that you do not destroy true data or create false data in your database, so it is always a good idea to go through them when designing (or redesigning) a schema.
1. Representation of arrays
One of the “bottlenecks” in the relational model, specially for people that come from a programming background is the representation of lists or groups. Sometimes people fake the representation of arrays, by repeating a group of columns, or “flattening out” the array. Apart from the data storage overhead, this is not conform with the NF1 principle. Let us look at an example.
In this example, we cannot remove a course (or update it) without involving a series of rows.
Sometimes arrays are also represented by encoding a list inside a text field, for instance separated by commas. This is also not conform with normalization, since this data type is non-existent and it is breaking the principle that the combination of row, column in a table should always refer to one record. The result is a lot of string-handling procedures to work around this kludge.
|Geography||John, Martin, Maria|
In this example, to delete (or update) a student name we need to parse and modify the students string which is so inconvenient that I don’t need to explain further why it is not a good idea to have this approach.
The proper way to represent arrays in this case is to split the students info in two tables, and relate them with a key.
The two tables above are in the normalized form, and we can operate over them using the relational model. Generally, when things involve complicated procedures to work, it means they are prone to errors, so we should keep everything as simple as possible. With the table structure above, I can implement a cascade delete that will remove all the students in the Geography Department, once I close this department. This goes the same as saying, that once I close the Department, there will be no students enroled in this department lying around the database (I call them “ghost” records).
2. Use of Natural Keys
Natural keys in themselves are not discouraged. They have a meaning to the user, which is generally a good idea, and their use prevents generating extra columns on purpose to be keys. Automatic keys (like identity fields) are generally not conform with the normalized model since they bring some dependence from the physical model (and may be implemented differently by different vendors) and they bring replication issues. With the computer power that we have nowadays it is also not a such problem to have a key involving multiple columns.
However these natural keys may be “hiding” other problems, that are themselves a “threat” to a normalized database.
2.1 Are these keys “preventing” implementing the relation between tables?
In my University database, I can represent the entities: universities, departments, courses and students as different tables. Since I know, there are one-to-many relationships in Universities->Departments, Departments->Courses and Courses->Students, I can enforce this idea by using foreign keys constraints.
CREATE TABLE Universities ( id integer, name varchar(40), CONSTRAINT pk_univ PRIMARY KEY(id) ); CREATE TABLE Departments ( id integer, id_universities integer references Universities(id), name varchar(40), CONSTRAINT pk_dprt PRIMARY KEY(id) ); CREATE TABLE Courses ( id integer, id_dprt integer references Departments(id), name varchar(40), CONSTRAINT pk_courses PRIMARY KEY(id) ); CREATE TABLE Students ( id integer, id_courses integer references Courses(id), name varchar(40), CONSTRAINT pk_students PRIMARY KEY(id) );
If I want to know the University of a student named ‘Martin’, I just have to build my query knowing the “path” from Student to University.
Select Universities.name, Students.name from Universities inner join Departments ON Universities.id=Departments.id_university inner join Courses ON Departments.id=Courses.id_dprt inner join Students ON Courses.id=Students.id_course WHERE student.name like 'Martin'
(DISCLAIMER: please forgive me if there is some error in this query: I did not actually test it, but I think this is sufficient to get the general idea…)
On the other hand, if I don’t implement this structure in the database, and I still want to know the University of a named Student, or the University of a named course, I can do something like this.
CREATE TABLE Universities ( name varchar(40), CONSTRAINT pk_univ PRIMARY KEY(name) ); CREATE TABLE Departments ( university_name varchar(40), name varchar(40), CONSTRAINT pk_dprt PRIMARY KEY(name) ); CREATE TABLE Courses ( university_name varchar(40), name varchar(40), CONSTRAINT pk_course PRIMARY KEY(name) ); CREATE TABLE Students ( university_name varchar(40), name varchar(40), CONSTRAINT pk_students PRIMARY KEY(name) );
In this case, I can know the University of any student (or any course), by querying a single table. The “natural key”: university_name, will provide the answer to my question. This design may run into problems, for instance if I have two students with the same name, and on the same University, but enrolled on different departments. That problem could be “solved” by bloating the Students table with a two-column key:
CREATE TABLE Students ( university_name varchar(40), course_name varchar(40), name varchar(40), CONSTRAINT pk_students PRIMARY KEY(university_name,course_name) );
It is easy to imagine other violations of uniqueness or other requests of information, that could lead us to add more columns to the primary key. My whole point here is that this is a strategy to avoid implementing a relational structure in the database, and therefore it is conceptually wrong. If we know the relations between entities, we should implement them exactly as they are, and not approach the database as a set of self contained-tables.
Actually this behaviour can be tracked to people who first worked with file systems, using tapes. They tend to design one huge file and do all the work against these records; this made sense at the time, since there was no reasonable way of joining a number of small files together without having the computer mount and dismount lots of tapes. With RDBMS there is absolutely no reason to avoid joins between tables. A normalized database tends to have a lot of tables with a small number of columns per table.
If we cannot produce this sort of output because we don’t know the relations between the entities we are modelling, than I would actually question the use of a database at all (or at least, a relational database).
2.2 Do we have information represented in more than one place?
In the example above, we have the natural key “university_name” repeated all over the database. Note that there are no foreign keys, and therefore there is no guarantee that the university names on tables “Courses” and “Students” match the names on table “University”. It is easy to imagine why we may run into problems with this approach. The key that determines one attribute should be in only one table and therefore the attribute should be with it. All the other tables that “mention” this attribute should refer to this table.
Apart from the fact that we are not “repeating” information all over the database and “bloating” the storage, this is the only way of guaranteeing consistency among the dataset. If a fact appears more than once, one of the instances is likely to be in error.
The database is only good in “safeguarding” things that we implement. Since the table “Students” is not formally connected to table Courses, nothing prevents me from having a course listed in students, that is not listed at all in table “Courses”. This may be, either because I did not introduce it there, or because I misspelled it (or some other reason that I can’t remember right now). If we use an application to interface with the database, we could implement strategies to prevent this kind of errors, but actually:
why should we put this responsibility in the “shoulders” of the front-end when it could be enforced by design, in the first place?
3. Table and Column Names
Although there are not strict conventions for naming tables and columns, we should stick to the idea that tables represent either “entities” or “relationships” and the columns represent attributes, and named them accordingly. Since they store sets of rows, it is expressive if the name is in plural (like “Students” or Universities”). Actually this may seem like being over-careful, but there are not many things as bad in terms of data loss, as having a database where we don’t know what the tables or the attributes represent (and not having any enforced constraints only makes it worst).
It is also a good idea to keep the names consistent across the namespace. For instance if we have a column named “temperature” and another one “called “watertemp” and they both refer to the same thing (temperature of the water), it is generally a good idea to use the same name. This, of course, goes against the paradigm of a database as a repository of data from different sources where we don’t even change the field names when importing. But with that kind of database, there is very little you can do without a tremendous amount of work (and possible even then…).
Regarding naming conventions, everything that makes the database easier to understand and to work with, should be implemented.
4. Use of Nulls
The use of nulls brings a series of issues and should be avoided as much as possible, for instance by providing a default value or a meaningful code (“missing”, “invalid”, etc). Having many nulls on a table, may be a sign that it has not been “normalized” properly (for instance the “flattening” of an array). Nulls should *only* be used for a value that it is missing now, but may be resolved later.
5. Information that should not be stored in the database
My final thoughts go to the “temptation” of mixing the contents of the database with the mechanisms to manage this information (for instance the front-end that we use to visualize it and populate it). A database models a certain reality (for instance a University, or an Address Book). Introducing “auxiliary” tables that are tied to the front-end is possible, but not without “polluting” the dataset with information about how we deal with the dataset, which is not really part of the dataset itself. Frontends may change, and ways of visualizing data may change, but our data layer should be kept conveniently “abstracted” from all these things.
This is a reflexion that I make, that may justify the use of n-databases instead of one, where we can relate them to each other. The extra complexity required by the connection mechanism can be justified by the clarity of the system, and the simplicity of dealing only with the parts that we need.