Structuring the history of data


The history of Data Modelling (structuring) techniques is intertwined with the history of computing technology. Computing technology has changed a lot,  it started with systems that required entire buildings, that had capacities that we now heedlessly put in our back pocket.

When the first computers were devised the data was entered with and was part of the programs. Then slowly storage became more and more available, thus allowing for data to be persisted. Almost at the same time computer screens came about (before that first switches and then punch cards and tickertape were used), allowing for a more interactive use of the computer.

Not long after the computer screen came about the first administrative tasks were brought to the computer. First of all some basics are needed about data. Data is what is noted down about a subject, object or activity in certain symbols (text, numbers, pictures, …)

Ceci n’est pas un pipe

. In order to be able to communicate this as concise as possible, concepts are used to group the data in, which then require a context in order to be able to retrieve the supposed message in a valid and sensible way.

There are a number of data storage techniques to discuss and a number of ways to forge a structure that will fit the requirements the target use of data.

Data storage techniques

Hierarchical Database

The paper forms were rebuilt on the computer and around that processing was designed. This is how Hierarchical databases came about And these were most propagated by IBM (International Business Machines). Hierarchical databases store all the data that is needed in a certain part of the program. With that they have similarities with the concept of object databases and with document databases (like Json and XML). If a hierarchy is needed in the data this is resolved by adding multiple instances of a set of fields in the same hierarchical record, thus having a hierarchy in the one record (or document). This model was almost obsolete by the mid nineties

This led to the need to enter and store data often several times, creating issues with the consistency of those instances.

The Relational Database

To solve the unnecessary use of storage and the issues with consistency Ted Codd came up with the relational model (1970), which puts the relations in the forefront in order to store in principle everything once and advises techniques for relating tables to combine what is not in the same record.

The current RDBMS systems of a diversity of vendors are akin to the theory, but all a bit off. That said, these have been the de-facto standard from the mid 80s onwards, making SQL, the query and management language generally used the de facto standard to access data from structured (in tables) sources.

3NF

The Third Normal Form is a model in which each concept should be represented once and only once, by distributing the concepts over tables with an as singular set of fields as possible that describe a concept.

While this worked fine for data entry (On Line Transaction Processing), for data retrieval the performance was not great.

From this point onwards the models and technological solutions become more data retrieval oriented, serving the reporting and analytics world more than the data entry, but we will come back to that further on in time.

Data Vault

Data Vault is a way to structure data in a relational database creating a structure of relations (context) to which the concepts can be added. Data Vault goes a step further in

The Dimensional Model

The dimensional model is an attempt to limit the amount of data retrieved for a query by splitting up the facts (details) from the Dimensions (groupings)

OLAP

OLAP brings the dimensional way of thinking to another level by offering pre-aggregated data for (a part of) the dimensions

Amongst the problems with the relational database as such there are the relatively slow aggregation and the number of fields that would stay empty, but still take up the storage allocated.

Columnar Database

Since pre-aggregation is nice, but it also pre-supposes certain interests and questions, leaving out other aggregations. In a columnar database the data is stored in columns rather than records. This allows for skipping empty fields, making the storage more efficient and it makes aggregations faster, because they are performed on a single column at a time, in stead of the occurrences of that column in a large list of records.

Graph Database

The Graph database offers a way to freely combine concepts through relations wit the advantage of having an open context, which helps find the unknown or lesser known relations between the concepts.

The Datalake

The datalake is a somewhat blurry concept, possibly containing all kinds of structured and less structured data. The main selling point of the datalake is that it is built on the concept of cheap multi processor, multi storage operating systems, with a strong failover support, thus allowing for very fast processing of potentially very much split up sets of data.

The Web based solutions

The world wide web initially was a fairly passive environment, where one could read what others had written. Beside it there were back then discusion fora and possibilities for file transfer, which used different protocols.

Ajax is a communication protocol between a web page and a background data server, decoupling the data interchange layer from the presentation layer. The data structure was initially implemented in (aja)XMLone and has since moved more and more to JSON

XML

XML is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. (Source: wikipedia)

Over time XML became dressed with standards and rules that have somewhat stiffled the use in newer implementations, thus leading to the popularity of:

JSON

Json is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers. (Source: wikipedia)

Semantic Web

With the internet growing, search engines needed to parse the content in a machine readable manner. Electronic standards were introduced for websites to indicate which data was where in the webpages for search engines to read. This grew into a large open standards such as RDF and OWL.

All of the above need a way to organise the data following a good understanding of the organisation the data is stored or used for.

Data Structuring techniques

The Business Glossary

The oldest way of structuring data comes from the physical archives and libraries. These started with the system cards in long drawers, possibly with color codes and tags. With the growth of computerized databases it became harder to keep the knowledge of the data up to date and available for all the users.

ERD

Enterprise Relation Diagrams are very close to the technical implementation and are therefore popular with modelers close to the database (internal people). The available solutions usually offer some form or combination of

  • Conceptual modelling
    • What is contained in the model, an architectural level.
  • Logical modelling
    • An outline of how to implement the model (system agnostic)
  • Physical modelling
    • A detailed implementation with all the technicalities (database type, schema, tables, keys, constraints, fields with data type and sizing, indexes)

UML

Initially a technique to draw diagrams helping software developers to come up with models. The UML Class Diagrams which depicts entities in software has become a defacto standard in many data intensive product development projects. Not really meant for data structuring, but very well understood by a large audience.

Information Modeling

Whereas techniques and diagram styles are meant to communicate the model to the developers and technicians, another line of modeling needs to addressed. The methods to capture the knowledge in the business domain towards the technical domain of software and data structuring.

NIAM

Natural language Information Analysis Method (formerly known as Nijssen Information Analysis Method) is a way to state the items and relations of data in natural language, thus aiding the communication about the way the modeler designs the data structure. The method is primarily designed to help communicate about the choices when structuring data for use in the computer environment.

FCO-IM

Fully Communication Oriented Information Modeling is a continuation of NIAM which focuses entirely on the modeling of the Communication in the domain. The founders Guido Bakema, Harm van der Lek, en JanPieter Zwart published their work in the 90s and educated students at universities across the world. Read more…

ORM

Object Role Modeling also came out of NIAM and focused on making it fully cover the theory of first order predicate logic.

Data Warehousing

With data applications being built everywhere in the organization, serving single purpose problem domains, the need for management information was on the rise. Being able to both capture data to make comparisons over time, and to combine data from various systems into a single reporting and analyses environment.

Dimensional Models

ELM

  • Data Vault
  • Anchor Modelling

Leave a Reply

Your email address will not be published. Required fields are marked *