7 deadly sins of data modeling

Assumptions can be powerful...especially when they become embedded in cultural behaviours...when they turn into a system of widely accepted truths.

At this point, such assumptions are difficult to challenge and form the basis of ongoing choices and actions...they start to layer upon each other...assumption on assumption...until you can hardly remember why and how it all started.

I like to see assumptions as truths with an inbuilt risk...a conditional truth you might say...many human endeavours are built on these shaky foundations because we tend to break problems down into smaller pieces to solve them or set specific context boundaries where everything external is ignored.

When we do this, a level of risk is always introduced which relates to the effects of (and on) all things outside the boundaries we set...we can create a perfect solution that breaks the rest of the world.

So what has this got to do with data?

For this, we need a little history lesson and go back in time over 50 years...to the birth of affordable departmental computing...to the birth of the silo...it was a perfect storm...raining down on the mainframe and all it stood for...and one thing people hated was the way the mainframe worked with data...input, process, output...so a set of very clever people set out to make it all right...to devise a theory that creates the best databases possible to be created...and this they did...and it was magical in its simplicity.

But this article is not about this...we all know what theory was born and who was involved, because it, and its variations and evolutions, still remain the dominant theory out there...

No, this is about the assumptions and boundaries that were created at the same time... this was the egg that hatched into the data industry, where the metaphorical enterprise architecture atom split, where data started to go its own way.

As side note, it is also interesting that many of the alternative database design approaches and patterns that have arisen in more recent years still share the same basic assumptions that I will outline shortly... Structurally different within the boundary, but observing the same set of wider assumptions of what lies outside. Which are, therefore, not something data practitioners consider to be part of their remit and somebody else's problem.

Now, none of what I have written here so far would be relevant if the world was already perfect and data wasn’t a constant frustration to most businesses, but this is clearly not the case. We have systemic “issues” with data, this is an indisputable fact.

Robert Vane

"We have systemic 'issues' with data, this is an indisputable fact."

So this got me thinking, did we get it all wrong 60 years ago? Did best intentions and best science create a guaranteed future failure scenario?

Well, let's take a look at the assumptions that I believe were made and test them out against what we see happening in the real world... and I'll let you decide.

Assumption 1: Process and Data can be detached with minimal consequences

Of course, if you are trying to create something NOT like something, trying to correct a perceived deficiency, it goes without saying that you assume that you must change most, if not all, of the major characteristics of the original... in this case, the input-process-output flow.

It was assumed that you can split these out, detach data structures from processes and then work on making the data layer as structurally perfect as possible, so it can support a multitude of processing use-cases... then use a specifically formed coding language (SQL) to join it all back together. (Yes! SQL in ALL FORMS is CODE!).

This turned out to be a double-edged sword in many ways, I could write an entire article just on this assumption alone. The main point is this created a conflict, a dichotomy, that didn't exist before when data and process were intertwined.

Which comes first: Data or its purpose? The words or the conversation? The work or the record of the work?

You could also ask a supplementary question...

Did this make data processing any simpler in reality?

Input-Process-Output actually became Locate-Collate-Process-Propagate-Relocate or in other words, the "Process" bit actually became more complex and more code dependent with additional translation steps needed to interpret the data structures both inbound and outbound.

I assert that, the more pure and detached the database structure is from process, the more code dependent you become. More joins, indexing conflicts, performance issues... rings a bell? Why do you think people tried to fix this many times with approaches like OO and NoSQL in order to connect process back to data?

There is an inverse (negative) relationship in effect here that the data scientists either didn't see, didn't want to see or ASSUMED to be a positive correlation. The assumption made was "Get the data optimally structured and applications and processes will naturally improve as a result". History has proven this link to be casual at best.

Now here is the interesting twist to the plot:

If the dreaded application "silo" is effectively a slave to the purity and generality of its database design and the predicate logic that underpins it, is the data science and mathematics employed the problem and not the silo? And if you centralise all data under the same mathematical and structural assumptions, will you not just create more conflicts of interest, more code to resolve them and ongoing and systemic negative correlation to both localised and cross functional processes?

I say you will. Tribalism will emerge as the balancing force and departments will resist centralisation and standardisation that no longer serves their business needs.

I assert that the assumption was flawed and data should never be disconnected, restructured or moved from its original purpose, context and use. This is just crazy data industry thinking based on this assumption being absolute truth.

So the first Deadly Sin is "Detachment from process".

Assumption 2: Simplicity of theory guarantees correct adoption

The theory was stunningly beautiful in its simplicity and this was one of its major attractions... anybody could learn its mechanics and start to construct using this new form of database design approach.

And they did... badly!!! We've all seen the horror shows of database schemas that sit beneath even some of the most popular applications that still exist today!

But how could this be possible? Was it possibly too generalised, too simple, too open to misinterpretation?

Again, history shows us some clues here. There are no surviving mainstream database technologies out there that implemented the theory in a way that the original inventors and their followers are truly happy with. The proponents of the theory sight this as the reason it has not been the only theory ever needed because it was never done right by anybody. Nobody understood the theory properly or had the foundational knowledge to implement it correctly.

I believe it is closer to the truth to say that the database technologies that based themselves on this theory had to translate it to the real world and, quite possibly, saw the limitations and overcame them, allowing the approach to become what it is now.

Well, it's done pretty well I think!!! It is everywhere!!!! But its simplicity and generality has driven waves and hype cycles of reinvention and revolution against the theory, driven by the need to "Fix" the symptoms of its generally bad adoption.

You would think that re-affirmation of the theory and better education would have sorted this by now, but it would appear, sadly, not to be case in a big enough practitioner population to correct this adoption problem.

This 2nd Deadly Sin is, therefore, "Over Simplification of the Complex enterprise".

Assumption 3: Data Quality and Integrity is guaranteed by optimal structural data design

Why is Data Quality and Integrity probably the biggest consistently identified business risk out there in the data space right now?

You may not agree with this, but I assert that it was never really fully considered in the theory in the first place. Only structural integrity was covered, the integrity of relationships between entities. I also have a general observation that the more an approach formalises "relationships", the more it de-emphasises the formalisation and interoperability of attributes/properties.

General observation

The more an approach formalises "relationships", the more it de-emphasises the formalisation and interoperability of attributes/properties.

The theory was Entity/Relationship biased, attributes existed but were only formalised in terms of "NULLability" and conformance to Type. This created an encoding problem. As long as the data you add to an attribute/column is of the right type and within the constraints of length, you can encode it in any way you like, and put pretty much anything in it that fits. And everybody did the same thing differently due to assumption 2 above!

Or in other words, most quality and integrity measures are resolved in code outside of the database. We see this everywhere, data mashing and cleansing in the BI space, integrations performing encoding translations and transformations. In fact, it has got to the stage where application UIs perform so little verification as standard out of the box, that you can ONLY fix it downstream of the database itself nowadays. This trend needs to be reversed... but that's a whole other article!

Because the theory does not address this systemically, a perfect storm was created where applications not only chose their own encoding, but had no incentive for wider interoperability across the whole business or industry... but this I will discuss in the next couple of assumptions.

The Deadly Sin here is "Integrity and Quality of data is not handled before it is stored".

Assumption 4: Nomenclatures fully define semantic meaning

This is all about labelling, and therefore the use of language to describe the structural contents of a data model and its physical database implementation.

Check out the article on the Risks and Benefits of Standard Data Models

Look at the vast majority of databases and you will see that verbose descriptions are very likely vague or missing, column names are ambiguous and misleading.

Labels are left to the discretion of the model designer and very often have syntactical restrictions, for example, 30 characters, no spaces etc. This creates a drive towards using the least number of characters and vague labels like Name, Code, Type etc.

Labels are also often "Internationalised", or in other words, tied to a specific spoken language not natural, familiar or equivalent to every other... Vendor vs Supplier etc.

The point here is that there is no inbuilt labelling discipline or curation within the theory, it is all pretty much left to the whim of the model/database designer or must be managed outside the model. This has created an interoperability nightmare with sometimes over 100 physical labels for the same thing common to see at enterprise scope!

The same label can also relate to multiple meanings in different contexts, this is the "Multiples of Multiples" problem and it is endemic in data modeling and database design because it is absent from the theory. So we have labelling chaos and full context and meaning managed somewhere else if at all!

Business should set the language and technology should ideally use it. But this is not how business applications have been designed for decades, business people have been forced to adopt many technology lexicons and in many ways have become de-skilled in articulating and defining what they do in their own words.

So we now have vagueness of business language everywhere, and technology biased labelling that re-invents itself every 5 years... and an ever widening gap between the semantics of business and the semantics of technology... and data modeling theory and practice has been complicit in this.

Deadly Sin 4, "Inadequate business control and clarity of labelling".

Assumption 5: All entities convey singular truth, all attributes are atomic, all relationships are true

Labels can also be related to simple (atomic) things or compound (compositional) things... and there is no way to determine which in the theory, or be consistent in choosing what is an entity and what is an attribute. You have two levels to choose from and the designer makes that choice based on the context of the requirement at hand, within that specific boundary.

But then the next requirement introduces a new boundary and so on, you then have a choice... combine requirements into a single model or split into multiple models.

If you combine or split, you introduce a set of challenges, the "Model of Models (MOM)" problem which introduces a whole new set of skills to master. This is where science and mathematics has to start dealing with organisational politics, AKA business people!

The theory tilts the modeler towards combination, it pursues singular truth and regards tribalism as something to banish... but this means you need to address the three killers of enterprise scale interoperability, Consensus, Bias and Granularity (See The Ambiguity Clinic on the Q6FSA YouTube channel for more on this!).

Whichever you choose, combine or split... you face some hard choices.

In many ways it is easier (politically speaking) to split models, because then you know that everything in the model is tuned to that specific requirement, there is nothing superfluous. You don't have to justify to one tribal group, elements that are included that mean nothing and have no relevance to them. You also can deal with less language and functional variation... you can get closer to a singular truth much easier, even though that truth has a much reduced scope. Relationships between entities are also much easier to justify under a single context, your business users have a fighting chance of "understanding" the model and supporting it.

When you split, though, the MOM problem manifests itself as an interoperability problem when you have to link models together. You have created bias, you have kicked the consensus can down the road and granularity (whether something is an entity or attribute) is fixed to a specific requirement.

So what if you combine? Well then you have to deal with the 3 killers inside the model, which can be quite a challenge. The more requirements and tribal view points in play, the harder this becomes. Entities, attributes and relationships become more "Conditional", by which I mean that they do not necessarily apply to all use cases... and which use cases they apply to is not in the syntax of the theory, so that's all resolved in query code and external documentation (which we all know is also very often not up to scratch!).

The more dominant business tribes will drive the language, labelling and semantics (or even worse a third party will base it on the most dominant IT application), the less dominant become disenfranchised and loose interest, we've all seen this before. This is why many "Canonical" endeavours get abandoned and logical models become too abstract, the politics is too hard to reconcile and you are forced to split models to achieve clarity.

So the theory pulls you to combine, the politics pushes you to split...the theory has no provision for organisational politics, culture and tribalism. No tribe supports the combined model, silos are born... the spirit of the theory is, unfortunately, lost, the pursuit of singular truth, abandoned.

Deadly Sin 5, "Tribalism must be replaced with singular truth".

Assumption 6: Temporality, state, sequence, confidence, retention, compliance and classification can be left to the designers discretion

Now we are into Pragmatics territory, the context of the model and the interactions it supports, how it deals with time, sequence of process and many other facets beyond the structure and content of the database are not mandatory (or not included) in the theory, they are either left to the discretion of the model designer, and therefore, very often omitted... or not in the scope of the data modeling endeavour at all.

All these considerations are part of the wider enterprise architecture in which the model operates, but data people care not about such things. They want the world to be "Data-Centric". They define the "What" and have very little consideration for the "Why, Where, When, How and Who".

This is very much tied to Assumption 1 above, process and interaction will look after itself. But it won't, everything is connected, the theory has no concept of flow outside the use of code, no concept of provenance or lineage. State and time sequence is difficult to determine because everything is point in time.

It is not only that the theory deliberately detaches data from process, it also makes it very challenging to bring them back together again because the model design only contains part of the picture.

Deadly Sin 6, "Context, Purpose and Interaction is somebody else's problem".

Assumption 7: Data can be manipulated without altering Pragmatics

And this then leads onto another facet of the theory, the reliance on code to assimilate process and the lack of context and purpose recorded within the data model itself, means that every time you construct a query that joins data together and manipulates it, the context of the base tables is replaced by a new context, a combined data set.

This new query result set is not recorded in the data model, it is constructed on the fly, but it has created a new entity. This new entity is the combination of the context of it's base tables and the additional operations it applies...groupings, joins, select and where clauses etc. It is also therefore, a new combination of semantics and, more importantly, pragmatic context.

The business user looking at an application UI and a data engineer exploring table data are not necessarily looking at the same thing unless the exact same query code is used! But data engineers often use a different technology, a different syntax.

Forms based applications get around this traditionally by enforcing a strict 1:1 relationship between UI and database table, but more process centric UIs, ML/AI and event based architectures tend to use multi-table combinations, additional filters etc. There is a piece of code in the middle that collates and creates something new and the data model knows nothing about it.

This is an issue that plagues the BI and AI industry (and digital transformation for that matter) but is very often not recognised as a root cause of their issues. When you move data and apply ANY manipulation or join, you change its semantics and therefore it's pragmatics.

So, Deadly Sin 7, "Believing that data stays the same when it moves between contexts".

Summary

The key observation here is that the smaller the scope of context of a database system, the less these 7 assumptions matter or create unwanted effects...less use cases means less conflict, less conflict means simplicity works.

You could say, therefore, with some confidence looking at the state of data management in the complex real world right now, that the theory created actively encouraged silos, or at least provided the environment under which silos thrived as the path of least resistance, which was possibly not its original intention, but cause and effect in play.

And coupled with the arrival of the departmental mini-computer at approximately the same time, the popularity of client/server desktop architectures following the mainstream business use of desktop windows based computing. This mindset embedded, the application-centric paradigm took control, the assumptions became truths.

We now live in a browser based and connected world, where the silo has become just as hated as the centralised mainframe paradigm that it replaced. The 7 assumptions above have become enterprise scale risks, producing a set of systemic issues that now need to be solved to "digitally" transform.

So, did we, as data people, actually architect our own downfall? Well, we should certainly be challenging these assumptions, IMHO, before it is too late.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

About the author 

Robert Vane

Robert Vane is the co-founder of the Q6FSA Method for Global Information Management, a freelance full enterprise scope data architect with over 25 years experience of getting it all wrong, now dedicated to solving the foundational root causes of failure within the information management space and getting it all right.

You may also like:

The benefits and risks of standard data models
>