As a data engineer, I am used to keep track of my code by means of versioning and I have tools at my disposal to find out why something that seems logical at first glance, may not be so logical on second glance. For example, I can have unit tests for every function. This ensures me that whatever I put in will deliver the outcome I expect for that piece of code. However, this only helps me to determine what happens to my data at that exact moment in time with a predetermined dataset. Keeping track of how our data changes during its lifecycle remains to be a difficult task for us engineers. Especially, when you work in an environment that has many different systems that are not (well) linked. This ability to keep track of how data flows through your environment is what is usually called data lineage.
A more formal definition of data lineage is: the ability to keep track of where the data is coming from, where the data is going and what transformations are applied as it flows through multiple processes .
During my research on data lineage, I stumbled upon a patent application publication by Microsoft . In this publication, the authors outline a, yet to be released, product by Microsoft that is able to create, track and handle lineage metadata. In this blog, I will guide you through the concept of data lineage, why it should be a priority for every organization and how we could (in theory) bring about data lineage in every type of data landscape, based on the Microsoft publication.
At first sight, building data lineage seems like quite an easy task. We could easily just keep an Excel sheet where we would write down all the events that happen to a block of data. Right? Not really. Nowadays, large amounts of data can be generated and accumulated. If that wasn't bad enough for our data lineage spreadsheet, we are faced with computing systems that are all interconnected and can have various types of network connections, making it even harder to keep track of where our data flows to. Maybe add some more columns to our spreadsheet?
But why is lineage important anyway? Say for example someone or somethings extracts data out of a system, after which it could be manipulated and put back into a system. When this happens, we do not know:
If we cannot provide an answer for all of the above stated points, we could face some serious consequences. For example, we could get fines for not living up to GDPR legislation, since we do not know where that document with sensitive customer data went to. Or even worse, we could have a security breach without knowing there even was a breach. It appears that our sensitive customer data is basically an open data source.
Now, imagine your colleague extracts data and does some transformations on this data and puts it back into the storage location. You are not aware that the contents of these data objects have been changed. You only become aware when someone points out that your analysis has flaws and you do not have the correct numbers. You start thinking: "How could this have happened?". In addition, imagine what could happen if you are using AI in your business process. Your model would not fit the data anymore, which as a consequence could result in your AI making decisions on faulty data.
At this point, we have established that we already have problems tracking the lineage of one data object (for example, working on an older version of a collaborative document instead of on the master document). Now imagine keeping track of the lineage of any type of data object, amongst a large pool of other data objects, of which also data lineage needs to be tracked. This seems like a daunting, if not impossible, task for a human being. This is what the authors from the above-stated publication also must have thought (probably when they ran out of rows to add in their spreadsheet). The publication states a minimal (abstract) framework, which contains several systems and a number of methods that should be incorporated in an end product in order to successfully build data lineage. Next, I will outline a step-by-step approach to come to this minimal end product.
Firstly, we must determine what type of metadata should be gathered and how we could store this information. Lineage metadata generally includes information on what operations were performed on a data object, who performed these operations and when these operations occurred. There could be two approaches in order to associate the lineage metadata with a dataset. The first option would be to annotate the lineage metadata to the standard metadata that is included with the dataset. Another option would be to store the lineage metadata separately from the dataset and maintain an association with the dataset by means of, for example a matching unique identifier. And no, I do not mean in a spreadsheet.
Now we have an idea of how to store lineage metadata of a dataset, we need to establish if a dataset remains valid during its lifecycle. Lineage metadata is data that is being stored, and this data could still be deleted. The publication proposes two options on how we could maintain lineage integrity within our platform:
At this point, we have formulated an approach for building lineage and how we can ensure its integrity. At this point, we need to know who is responsible for the execution of these tasks within our data platform. Maybe we would not need to have single system that has full responsibility for building the data lineage. If we had to build a centralized system that could extract lineage from every type of system, we would need to do a lot of coding. How about a decentralized approach? That is what the authors of the publication also thought and they proposed a dual approach. If a system is able to capture and embed the lineage metadata, this system should be allowed to do so. In all other cases, there should be a central governance entity that would take care of creating and managing lineage metadata. This central governance entity should have the ability to reach into all the various repositories, enabling it to keep track of when data is downloaded, new data is uploaded to the repository and/or a (new) dataset is created, deleted and/or updated. By decentralizing the responsibility as much as possible, we would limit the amount of custom development in order to connect the systems.
Nowadays, the problem is not necessarily to create data lineage (many systems/databases are capable of this). The problem lies in the ability to have an all-encompassing approach to combine the different lineage metadata sources and build a unified flow. Important considerations are: Which system has the master lineage metadata? How do we keep track of data integrity and make sure all systems are aware when a dataset is invalidated? How should we deal with invalid datasets, and should all systems then discard the dataset? These are just some of the questions to which we currently do not have an answer to. In the following parts of this blog, I will try to formulate an approach and architecture that would send us in the right direction. If this blog post gets enough likes, I might even do it in a spreadsheet.
For more information, please contact Patrick de Hoon.
 Allen, M., & Cervo, D. (2015). Data Quality Management. Multi-Domain Master Data Management.
 Liensberger, C., Bouw, R., & Kashi, O. (2014, January 16). DATA LINEAGE ACROSS MULTIPLE MARKETPLACES. United States Patent Application Publication, 7.