Requirements for data architectures
Ok, so this post will not really provide any details regarding a pre-defined data architecture, however it does highlight some top priorities when building one. If you are interested in requirements and high-level thoughts, please read on, otherwise wait for the next post :)
So, in order for us to build a well-functioning data landscape, we have to define some base requirements. We should strive to build a data landscape that is adaptable and sustainable and which can accommodate all your data needs, being ML/AI, analytics, data apps or whatever else. Some data platform vendors will state that you can just sign up and everything you wish for is already accounted for, but that is far from true. A data architecture consists of many modules which must interact and “feed” of each other, much like any eco-system…
These are the most important requirements for a data architecture according to me. And yes, these requirements are the same regardless of your favored approach, centralized data warehousing, data lakehouse, data fabric, data mesh or isolated ML/AI project.
Lets group the requirements into a few buckets. Delivery, Data Trust, Adaptability, Cost, Technology and Portability and in that order!
1.Delivery
The ability to deliver data with as low latency as possible. That goes for new requirements but for continual enhancements as well. And don’t forget that we need our data to be available to whom it may concern and only to them, yet simple to find, access and understand
Short time to value
Why: Simply put, the ability to quickly fulfill new requirements.
How: Introduce an “automate first” approach on everything. Strive to automate those parts that can be automated. It does not only speed up development, it will also improve your overall quality and enable very slim maintenance. Things that do not conform to the available automation patterns should be separated.
Here is a list of areas that we can automate to increase productivity and quality.
- code generation
- streaming and batch automation
- test automation
- automatic data validation and screening
- automated deployment
- automated access control
- automated data glossary
It is not hard to do if we learn how to activate our available metadata.
Access to data
Why: We have to allow for data discovery and simplify access but at the same time ensure privacy and data integrity!
How: Aim for a landscape were we can search for data, understand the data and request access. When granted (hopefully automated), we should be able to access the data using our selected tool-set, with single sign-on and no (low) code. Access should be simple to grant and revoke. All parts of our data landscape should be included and access policies apply everywhere. It is important that your data architecture supports a general access security layer so that there is a single point of administration. Enable frameworks that support easy access to the data required, remove obstacles such as data formatting and connection libraries.
2.Data Trust
The second most important bucket of requirements regards data trust. Basically this revolves around traceability. What are we using, how and why has data been altered and by whom, when? Who else is using the data set and how fresh and accurate is it? These are questions any consumer of data should be asking, hence the answers should be within grasps.
Traceability
Why: A very important aspect of data is where it originates, who or what has processed the data and how. Within what context (model) does the data appear? Who consumes it and who owns it ? (if possible…)
How: All processes should produce and consume metadata. Starting from the initial publication of a specific data products, through any intermediate process until the final consumption. If we align our processes to use metadata as specification and automate discovery of published results. By definition we would activate our metadata, making it accurate and updated. Hence of our metadata would be of higher quality and absolutely true.
Audit-ability
Why: Another important aspect in data trust is the ability to see who or what and when something occurred and why.
How: Track changes on everything. Do not introduce choice when it comes to tracking changes. Changes should be tracked on all data, all code, and all descriptive information. This is the only way to ensure 100% audit trail. Do not allow for manual edits of code, models or data and introduce temporal data.
Data (Quality) Observability
Why: We have to be able to the trust data we consume. The strive is not to have perfect data quality, it is rather to highlight when data is not optimal. Strive to get data in, warn when it does not conform. Remember that data quality is also very subjective, bad data quality for one consumer might be good data quality for another...
How: We should be able to create alarms, warnings and/or notifications when data deviates from expected outcome. For instance;
- when processes are not preforming as they should.
- when data is missing and/or halted.
- when one dataset does not “add up” to another datasets
- when missing values occur
- when new values occur
- when abnormalities in data occur
As natural as it is to define KPI’s within an organisation, “Quality Performance Indicators” should be defined.
Persistent and non-volatile
Why: Enable data trust, by presenting stable and predictable results day in and day out.
How: Ensure that we model and structure data according to known data management best-practices.
- Keep history and make data comparable over time.
- Prior data shouldn’t be deleted when new data is added.
- Historical data should be preserved for comparisons, trends, and analytics.
- Use physical data modeling practices that allows us to preserve prior data and append new.
- Strive for an insert only pattern if possible and present data as-was in combination with as-now.
- Introduce a Bi-temporal approach, at minimum.
Consolidated and integrated
Why: Integrating data around keys and concepts allows for simple and extendable usages without introduction of manual errors. Consolidating data into single points of origins minimizes misuse of data.
How: Model data into clear and concise data models. Use a data modeling practice that allows you to adapt and extend new attributes, objects and context. Our data landscape should in fact allow for multiple context/models without losing track of origin. Find the concepts that needs consolidation and integration and spend time on the modeling task. Pause or discard data which use cases are currently purely isolated (or single domain oriented) until the use case for cross domain usage occur. This allows us to focus on the tasks that matter and consolidate objects, attributes, context and measures for those things with high business impact.
3.Adaptability
Everything changes… Our business, our competition, technology, source systems, data consumption patterns, requirement and people. Fact is, when our data landscape grows, more changes are introduced. For every system we interact with, change will occur in relation to their development pace. As we consume and analyze more and more data, even more data will be demanded. So, it is not even a linear increase of change, it is rather an exponential increase of change. Hence our data landscape must be built with adaptability as a top priority.
Adaptable
Why: Changes occur and will occur.
How:
- Enable source aligned data contracts/products ensuring that we have stability in our delivering processes.
- Building generic processes or generated processes from metadata allows us to simply change target or source aspects of the a mapping.
- Allow for data driven classifications and business rules and push them to downstream operators.
- Treat classifications and business rules as source data.
- Use ensemble modeling methodologies which love changes.
- Introduce temporal data structures.
- Strive to use dynamic allocation of data structures in all selected tools.
Data modeling
Why: Any information is useless unless delivered in a format that can be consumed by users. Data modeling helps in translating the requirements of users into a data model that can be used to support business processes and scale analytics.
How: Data without context and integration will soon loose its value. A great data model will introduce extended usage and exploration of data without breaking quality.
- Use ensemble data models, spend time on integrating and normalizing terms and objects.
- Create semantic models on top of the ensemble data model for specific use cases with specific formats.
(De-)centralization
Why: The same centralized processes, regardless of having to do with data or not, will always be slower than the same de-centralized approach since it by definition can´t scale. However a de-centralized approach is harder to align and conform.
How: Enable de-centralization using common methodology and infrastructure. Streamlining common patterns and definitions can help parallelize data processes without losing control. Automated generic processes based on metadata discovery will allow for pattern re-factoring without re-building existing features. Centralize parts of the data landscape that enable best effect when de-centralized. Re-iterate the policy over and over again.
4.Cost
Cost is without argument the most common use case for doing any form of transformation in my experience. It’s becoming more and more important to focus on since costs might not be as clear as back in the days… Today many costs are based on consumption. So hardware and staffing is not as predictable as before. A badly implemented feature might not scale and will therefore cost more and more over time. A bad choice of tooling might require more time from our developers to master. A platform with low performance might require long waits, time that could be spend on other tasks. All of these examples are wasted costs which could be avoided with policies and trust.
Simplicity
Why: A less complicated system costs less
How: Try to hide complexity in simple tools, tasks and structures. Make repeatable patterns and gather knowledge from any maintenance task. Strive to replace complicated processes with simplified.
Easy to learn, easy to master
Why: Introducing new individuals to the data landscape should be a short and smooth process. Every day spent on introduction not producing is sunk cost.
How:
- Utilize simple and productive interfaces in all parts of the data landscape.
- Avoid scripting and typing commands, embed functionality in UI instead.
- Have complete traceability on each data point. The ideal scenario is to be able to see the raw data origin together with all/any transformation with a single click.
- If development patterns are identical, then understanding and producing new functionality will be smoother.
- Every technical failure should be able to run over and over again without giving new results. This prevents errors due to hiccups' and bad operations, with a bonus that decisions regarding failure handling will be easier to make.
- Finally a clean architecture makes it transparent on what and why something occurs and where, hence faster to troubleshoot.
Low cost of ownership
Why: Pay for the needed compute and not more, scale on demand, and streamline maintenance tasks.
How:
- Standardized processes (development and operations).
- Streamline our access points and utilize scale-able platforms.
- Use scaling wisely, transform bad patterns to more cost efficient patterns.
- Ensure standards and principle are aligned across all running code.
- Try to use platforms that can scale up and down, and hopefully not with tied compute to storage.
Predictable
Why: Working with consumption based price models can sometimes feel uncertain and scary. Prices on SaaS and cloud offerings might suddenly change which is the opposite of what we want. The right choice of tech is really a cost/performance balance.
How: Consider the process of analytics, direct access or memory based. What is suited best for our use-cases. Perhaps a blend of both?
- Build pipelines as slim as possible, working with changed data only, or predictable chunks of data at least.
- Analyse and compare costs against themselves and alternatives.
- Make components in the data landscape portable and don’t invest to much in specific tech, preventing tech-lock-in. Who knows, our preferred SaaS might suddenly change their price model and jeopardize the entire setup.
5.Technology
The data landscape should be considered its own application and be separated from the source system(s). This obvious and simple approach enables us to optimize workloads for our use cases, and makes it possible for us to model data according to our business needs and for future scenarios, not tied to the source system(s).
Separated
Why: Separate the data landscape from source systems. Optimize for use case and build for the future.
How: Separated platforms, use case based data models, data ingestion pipelines or publish patterns.
Parallel and scale-able
Why: Our workloads must be able to execute in parallel and scale vertically and horizontally, enabling more processes and more consumers.
How: First secure one or more data platforms that allow for same time usage and parallel read/write. Secondly, secure patterns for streaming data combined with batched data. Finally, model data into a physical model that enables independent loading.
Variable data formats
Why: A great data landscape can support all format of data, regardless if it is Structured, Semi- or Un-structured.
How: Simplify the data publish process by allowing data to be uploaded in native format, automatically analyzed and published in a data discovery catalog. Make a copy of the data into a “single” conformed data format to allow for ease-of-access ease-of-use.
Accessibility
Why: An open data landscape enables our data consumers to work with tools that they know and are comfortable with, hence enabling higher quality in the produced product(s).
How: Make our data landscape flexible in the sense that new tools and users easily can be added or removed from the palette. However, where data access limits the possibilities. This enables us to be more data driven and agile with data, yet have full control on data integrity. Instead of the opposite of locking consumption to selected tool-sets which users might not even have competence to use, risking quality in the produced product(s).
6.Portability
Our selected tech today might not be as prominent within a couple of years. The price models might change, the tech might not work for the current workload. Maybe our cooperation has made a strategic deal with another vendor. Regardless, there are many possibilities available and hard to make the best tech choice over foreseeable time.
Portable
Why: New tech evolves in tremendous speed and cloud computing costs are constantly changing. New use cases arrive and individual skills evolve. In order to avoid “legacy” branding of our data landscape, we must align tech and costs with our use cases and skill sets.
How: Embrace the fact that our architecture is not perfect. Make the best of it. Make each component responsible for as little as possible and automate as many processes as possible. Make metadata active, meaning use metadata as a central component for our automated processes. Generate code, structure and configs from metadata and try to stick to widely accepted languages for working with data such as SQL, python and similar.
SUMMARY
Setup a data landscape that is built to adapt. Any code, module and/or component should be portable to another platform, tech stack and/or service.
Accept that our data architecture will never be perfect and accept flaws, however build automated processes. Code and structure should either be generated or generic and based on metadata. This allows for an easier migration when we face the fact that something has to be replaced.
Make sure we spend as little time as possible on repeatable tasks and more and more time on data modelling and definitions. Use ensemble modeling with time temporal concepts. That will make our system more stable and endure mayor changes/refactoring of core business systems/processes.
Always focus on simplifying access, data discovery, access patterns, structural simplification and improve.