Simplifying analytics with pre-defined data architecture.
For the second half of my life, I’ve been building data management systems and I am getting tired of doing the same procedure over and repeatedly. I have a long history of working with generic patterns, data modeling and data related development in my resume and I am about to create several posts about what I have been up to for the last +5 years. This is the start of that series.
Often, I hear something like this, approaching or engaging in a data related project;
“Hi, our current system is not working for us. Our data engineers are struggling with our current system, every time something breaks (and it often does), it’s hard to repair the pipeline and it takes up all the developer’s time. If we build a new system, we could standardize our code and therefore have cheaper system down the line…”
“Hi, our current system is not working for us. Our current system is tied up to our ERP. We are now changing replacing our ERP and there are simply too many changes to be done before we can align to something new. It would better and simpler for us to build a new system …”
“Hi, our current system is not working for us. Our data pipeline process is too slow, we can’t add any more data if we want to give data to our users on time, so we want to build a new system on this new cool platform…”
“Hi, our current system is not working for us. The platform-vendor has notified “end-of-life” for our current version, and we are SO tied up that we have to build a new platform…”
Sounds familiar?
The suggested approach is great business for all employees to try something “modern” and raise their value on the market. It’s even better business for the platform vendors and best business for the consultants.
Every single case is the same. We gather requirements, we setup platforms, we create architectural definitions, we decide on methodologies, we model data, we create templates and standards, and we develop, test and release.
Don’t get me wrong, we should definitely have all of this… But, I have always wondered why we (re-)create them over and over again… I mean how would we react if we defined “how to build a road” every time we constructed a new road segment?
I’ve been in lots of projects, where Tech-departments wants;
- Pattern-based data ingestion into models that allow for adjustable schema-evolution
- Parallel loading
- Parallel teams
- Automated deployment
- Automated testing
- Low data pipeline dependencies
- Data lineage
- Data traceability
- Failure safe loading
- Complete restart-ability
- Clean architecture (knowing what happens when and how)
And on recent projects new requirements has been added, such as;
- Insert only
- Streaming patterns
- Enable reload of everything (fast)
- Enable multiple data models for different use cases
- Enable de-centralization using common methodology and infrastructure
Within the same projects the business requirements are basically;
- Data available as soon as possible for the lowest possible cost
- Actionable data, yet with full traceability (what has been done, by whom, why and how)
- Fast (or faster) deliveries
- Cheap (or cheaper) cost of ownership
And we give these requirements to our developers to build. But, It’s hard to build something with so many requirements, and its even harder on new tech.
So, it’s not uncommon that we spend hours and hours up front;
- Defining our layered or tiered architecture, what is enhanced/enriched were and why?
- What to build our data pipelines with and how
- Where to store what and for how long
- What we need historical trace on and why
- Choosing the modeling methodology
- Arguing what is a link/relationship and what is an anchor/ a hub / an entity or an object
- How should we create keys, and how to handle early arriving data…
- What type of logging we must have and how to build it.
- We define requirements for loading so that data pipelines are easy to restart / rerun.
And so on…
Standards and best practices are always blockers or bottlenecks when missing. And it is very hard to define within new settings. It is however, not very hard to set standards and best practices based on the joint knowledge gathered for years and years within our trade or profession.
So, what happens? The project is in a rush (due to the described business requirements) and hence our great architecture and our standards are not completely defined.
Equals =
A new mess…. Better than before, that’s for sure, but often with huge technical dept.
(Maybe this only happens in projects with me :) but I doubt it).
This is the call for a change!
I teamed up with a couple of smart individuals and decided that it is time to switch approach.
- We defined what is required
- We created the patterns
- We selected the models that support schema evolution
- We sat down and argued on standards, patterns, and models, long before we even met the next project
- We built something that feeds on metadata, hence documentation equals CODE. Which has a great side-effect; It provides the end-users with complete and amazing data lineage and traceability.
- And… we tested the hell out of the system
5 years ago, we created the initial draft of PDQ By Simplitics! Currently we are now close to release 2.0, a great modular data architecture, build to run on AWS, Azure and GCP on top of your favorite data platform vendor.
In upcoming posts, I will describe PDQ By Simplitics core aspects and why they are needed.
PDQ stands for either
- Pretty Darn Quick — since implementation is rapid.
- Persistent Data Quality — since everything is versioned. Data, models, source definitions and mappings.
- Perfect Data (Model) Quest — providing a physical data model that follows ALL the rules, ending the endless arguing
- Parallel Data with Queuing — maximized parallel data loading process, which dynamically waits on dependencies and maximizes parallelism when possible.
PDQ By Simplitics is a data architecture…
- That allows us to model data into objects and attributes with natural relationships between objects, without thinking about the technique and/or what attributes that need versioning(history). Beneath the surface a “hidden” core model that defaults into an anchor-like model, or upon choice into a data vault model, is created and used. Which gives you a baseline that adapts very well to change.
Basically, before PDQ by Simplitics you would pay us by the hour to come in, model objects and attributes, then implement those using some modelling technique, arguing about this and that. Later, to create templates and guidelines and teach all your developers to use and follow them.
With PDQ By Simplitics we have already done the arguing before even met you, hence leaving the data modeling of your cooperation as the only thing left to do. A great benefit of the PDQ By Simplitics methodology is that you do not have to learn or master any of the ensemble models. We master them for you! So, one could say that this is one of those rare occasions where you can have your cake and still eat it.
- Introduces full data traceability.
Every single change is tracked on everything, nothing can be done to bypass this feature
And…
Every model change is versioned
And…
Every data mapping is versioned as well
And…
Even the definition of each source delivery is kept…
And…
Every loaded attribute can be traced to a specific source delivery (and that delivery would still be available in its original format!)
This means that you can recreate any given scenario. Maybe you must prove why something was as it was, or maybe you simply must understand why something is wrong.
- Prevents architects and developers from introducing their preference and style into the system, preventing difficult and expensive maintenance.
PDQ By Simplitics creates the physical data model and produces streamlined data pipelines. All models will have the same naming convention, defaults, and metadata. All data will flow through, the same logic, incorporate the same tagging and tracing, have equal logging, and include standardized fault handling by default. When decided, new improvements to standards and patterns would be available on all flows instantly.
- Allows you to relax from thinking about orchestration and flow dependencies. What can be run in parallel and what must run in sequence is understood by the system thanks to the ensemble modeling methodologies.
This means that we also know how to re-run flows and how to prevent manual interventions. Hence, we have even automated fault tolerance and recovery from some failures.
It is obvious that PDQ by Simplitics will save you a lot of time when implementing a new solution, but it will same even more time when put into production. So instead of hiring a couple architects (or more) arguing about hubs and links, standards, and patterns. Involve a team that has already done that, and that wants to do things a bit different and more effective. It’s due time for a paradigm shift in how we setup data projects…
If you are reading this line, I do hope you find this intriguing and want to understand more in depth of what we are doing and how. If so, follow me for the next 1–10 posts.