CALL FOR PARTICIPATION


Rene Veldwijk

I’ve been toying with the idea of writing a book about database design for quite some time. Some older Dutch IT professionals may remember my earlier publications on database design: “Ten Commandments” and “Time in the Database”—both compiled and sold out in no time. Back then, I was still young and naïve, albeit with a freshly minted Ph.D.

Much of what I believe to know about data modeling has, as far as I’m aware, never been properly captured in textbooks—and that alone would be reason enough to start writing. But there are many more reasons. As interest in data has exploded, so has the confusion. It’ is a whirlwind of concepts, and no one I know of is organizing or cleaning them up where needed. And now the Data Science / AI combination is adding yet another layer to the chaos.

What I need while writing is a small group of people who can read along and provide feedback. I’m thinking of a small core group (up to six people) to read first drafts, and a larger circle of readers who follow along from a bit more distance. Of course, everything that is published will remain my sole responsibility.

I have no particular commercial ambitions (not even Patreon) and intend to make well-developed texts freely available. I owe a lot to data modeling (especially to other people’s mistakes) and would love to give something back.

To give you an idea of the topics I plan to address, I’ve written a draft preface — see below. I hope to hear from you — and feel free to share this with other data experts.

René Jan Veldwijk https://www.linkedin.com/in/reveldwijk/

Preface to a revisionist educational text

This manuscript (we shall call it a book) is a revisionist text about data modeling. Some of what you will read will be completely new to you, even if you are an experienced data professional. Much, if not most, of what you will read may be familiar material but will be presented and discussed in a different context. And then of course there is a basis of known data modeling stuff, because some readers will be new to data modeling and we are revisionists rather than revolutionaries. We respect data modeling foundations, especially the half-forgotten ones.

Another revisionist touch is that we shall pay attention to bad examples of data models, drawn from decades of practice. Strangely, this is rarely done in textbooks. Data modeling is still more of an art than a science, but we pretend otherwise. Modeling errors are costly at best and fatal at worst. Ignoring the impact of modeling errors in educational material is irresponsible. And make no mistake: the worst errors are made by professionals. All those bad models we show you are found in the wild.

Then there is the issue of abstraction. Our main theme here is that finding the right (often a higher) level of abstraction for your model is the essence of data modeling. However, this must not come at the expense of completeness and clarity. Abstraction and vagueness are different things. Rarely, if ever, do data modeling textbooks offer examples of complete data model specifications and the same goes for real-world models.

Here we don’t accept vagueness. The data models in this book are almost implementation-ready in today’s relational database products. As we shall see, completeness is essential not just to validate a data model but also for choosing the best data model among alternatives.

We shall not discuss high-level modeling methods like FCO_IM or Ontology-Based Modeling in any detail. Without a doubt these methods are useful in the exploratory phase, but they do not produce complete data models. Using their incomplete output directly for implementation in a database environment is a dangerous idea. The data models discussed here can be viewed as the link between high-level models and database implementations. Too often this link is missing.

Because data modeling is ultimately a form of art, there is no step-by-step procedure that is guaranteed to produce the best possible data model. The classical procedure of data normalization will help avoid modeling pitfalls. A generation of IT professionals – fortunately close to retirement – has certainly been led to believe that a normalized model is a good model. If this were true, data modeling and database design could be done by anyone if not fully automatically. The damage this idea has done is incalculable. There are otherwise competent IT professionals who are not data modeling artists. We shall address the qualities that make a good data modeler. But we shall also provide a practical criterion to help evaluate competing data models. Finding the best data model for a given situation is an art, but choosing the best model among alternatives doesn’t need not be.

Data modeling is never done in a vacuum. There is a higher-level framework, a metamodel, that defines what a data model should consist of and which rules every data model must respect. In the context of this book, this metamodel is the relational model of data.

From a practical perspective, it is fortunate that today’s mainstream databases are themselves relational, but the relational model itself has at least two serious shortcomings. The generalization or super-/subtyping problem is normally a modeling irritant, but the problem of modeling historical data, the temporal database problem, is a true data modeler’s nightmare. We shall discuss these problems in depth  so that they are duly appreciated. But we will go further than that. Being revisionists, we shall ask ourselves why widely recognized data modeling and database problems have not been addressed, if not solved, for decades outside academic circles. Being practical, we shall suggest the least harmful modeling practices. And being ambitious, we shall try to go a step further and outline a way to address these problems in a generic manner.

Throughout the decades, the original idea of a database as a single source of truth for all purposes has been continuously eroding, resulting in different databases for different purposes: data warehouses, data marts, and now data lakes. These database concepts come with new data modeling concepts. But do concepts like star schemas and data vaults really add value? And if so, can we as data modelers find a common umbrella covering all these concepts or will we be inhabiting an expanding universe of concepts ending in maximum data entropy? When we get to these questions, we must – and will – have a framework to form an opinion.

A related issue is the emergence of unstructured or complex data with specific applications and tailored database platforms. Some complex data, like geospatial, are still in the realm of classical databases, but require specialized support. But unstructured complex data – text, photos, audio, video, streaming – constitute different fields and are served by different database products. We shall argue that here, at last, we are outside the classical world of data models and databases. This does not mean that they are of no concern: a database supporting unstructured social media is embedded in a classical administrative data environment. And all these unstructured data contain a gold mine of classical information waiting to be extracted and exploited.

And so we come to the relationship between data modeling and Artificial Intelligence (AI). It is still early days, but AI is poised to revolutionize everything with respect to data and databases, including data modeling. The essence of data modeling is pattern recognition and this is exactly what modern generative AI does well. As with human data modelers, the real challenge is obtaining the right information and finding the optimal level of abstraction. Even AI in its current state may perform better than most data modelers. Intriguingly, the old idea that data modeling and database design tasks can be automated may still come true. If so, the only purpose of this book would be as input for training AI. (If so, all rights reserved.)

After all this, the question may arise as to who can profit from the contents and message of this book and maybe even enjoy it? There is no definitive answer to this question yet. It should be useful in teaching IT students. It aims to refresh the knowledge of experienced modelers. It may be useful for IT professionals who are not data modelers themselves to understand or challenge the work of their data modeling colleagues. Selective reading may help project managers identify specific IT project risks and help IT management deal with buzzword-wielding vendors. And finally, there are R&D professionals and scientists who may get some useful innovative ideas. Who can say?

As a final remark, we must deal with the problem that a complete data model, even for a limited modeling case, takes up more space than is practical for a book. Because incomplete data models are a root cause of bad modeling practices, our case descriptions need to be complete. To do this and yet keep this book readable, we have outsourced part of our case descriptions to the internet.

If you have a serious interest in participation, find me on LinkedIn: https://www.linkedin.com/in/reveldwijk 


One response to “CALL FOR PARTICIPATION”

  1. Hoi Rene,
    Ik houd me niet meer zo bezig met dit soort zaken (Natuurkunde vind ik leuker). Toch kan ik niet nalaten 1 opmerking te maken. Volgens mij weet je het wel, maar voor de zekerheid: FCO-IM (en de tools dit dit ondersteunen) genereert volledig automatisch een datamodel, dat in alle gevallen waarin ik dat heb zien gebeuren, ook metaan zou gebruikt kon worden.

    Maar anderen weten daar volgens mij veel meer over te vertellen. Bijvoorbeeld Marco Wobben

Leave a Reply

Your email address will not be published. Required fields are marked *