Building blocks for scalable, manageable and cost-efficient data analysis
The amount of data available to us is growing exponentially. With that growth comes ever greater possibilities to create value through data analysis. The areas where Big Data may lead to significant advances vary from simple file reductions to optimizing environmental systems in offices and tailor-made consumer promotions, even on to more-focused cancer tumour treatments. We are at the start of a great journey: the most valuable applications still have to appear.
Through complete use of your own data, you generate value; for people, for society, for customers, for businesses and governments. The first step lies in overcoming obstacles common to many organizations, to easily unlock the value of your data, external data, and to develop new applications. What are some of these challenges?
1. Data silos
Development of data analysis is often a slow process. One of the key reasons for this is the current structure in which parties operate. Or rather, the lack of an effective structure. This applies to internal organizations, but becomes particularly clear whenever an organization tries to use external data or otherwise needs to form ad hoc partnerships with one or more parties. Over and over again such organizations need to reinvent the wheel with regard to the technology to be used, which analysis software and algorithms to use, how to provide guarantees with regard to personal privacy, and how to agree on the conditions governing the use of the data. To put it mildly, this is hardly an incentive for businesses and governments to get the most out of data analysis.
2. Ecosystem thinking
An ideal world would have an ecosystem in which organizations share (controlled) insights into data. An ecosystem which has pre-considered the options as to how these insights are shared. In such an ideal environment, suitable parties would also have sufficient opportunity to build applications based on this data. For instance, one party will see opportunities for an application that better predicts traffic jams based on location data combined with meteorological data, another party will focus on models to predict when certain parts of a car need to be replaced: the ecosystem will provide the required mechanisms to create these insights as and when all telecom providers, car manufacturers and the local meteorological institute participate. In this ideal world, all suitable parties will be free to bring an idea to fruition with the data in the ecosystem – of course under the conditions imposed by the data owners.
3. Sharing insights as a source of value creation
The real potential of data analysis lies particularly in sharing insights into the data. Think, for example, of the leap forward that could be made in improving medical diagnoses and treatments if it became possible to share insights into (anonymized) data for the good of mankind. This also applies to organizations. A financial institution could increase the security of credit or debit card use through insight into location data: as soon as a person’s credit card is used at a different location from the location of that person’s smart phone, a possible fraudulent act may be prevented. So, the bank will need to be able to combine knowledge on the location of the phone (for example using data from a telecom provider) with the bank’s information on the use of the card.
The themes discussed above resulted in the idea behind the development of KAVE. KAVE forms the heart of a new ‘Big Data ecosystem’, removing practical obstacles, allowing quick, easy and consistent Big Data analysis. With KAVE, KPMG has begun a new way of thinking in which data analysis applications can be realized far more easily. A reality in which standardization of data or technology is not necessary, where agreements have been made regarding the conditions of sharing both insights into data and data analysis applications. A reality which includes the highest level of data security, combined with complete transparency on how to deal with personal data. Insights into the increasing data flows and data analysis of governments, companies and individuals can be integrated and shared reliably, rapidly and easily due to KAVE. These conditions are supported by KAVE’s essential characteristics.
There cannot be any doubt that compliance with privacy requirements is our number one priority. This is ensured by enabling the involvement of Trusted Third Parties (TTPs). Personal data is converted into anonymized data and neither the TTP, the supplier nor the processor can remove this anonymization. The result is that a developer or data scientist cannot trace data back to a person. The privacy principles, as stipulated by law, are enforced by the system’s technology and set-up.
If the user clearly states that the personal data may be used for a certain purpose, or in the event of a legitimate interest such as a court order, the pseudonymized data can be re-associated, and the party which complies with all legal and ethical requirements will be able to link the insights to the appropriate person (this process is called re-identification). As this requires cooperation of all parties in the anonymization chain, division of responsibilities guarantees compliance with rules and regulations.
The environment needs to be able to process all common data formats and structures. Until recently this was often an obstacle. But we now have state-of-the-art tools, methods and technology that can process almost all data. In a certain sense data has become like Lego: it always fits and you can build any model you like. Furthermore, data analysis is horizontally and therefore easily scalable: additional analysis capacity can be made available immediately by increasing the number of computers (and can be easily reduced in a cloud environment). In comparison: scaling-up of classic database systems often leads to technical and/or financial challenges.
It has to be possible to make clear agreements regarding conditions imposed by the data-processing party with regard to analyzing its data and those imposed by the data analyzing party with regard to using its analysis. This also entails questions such as the fees for the use of the data (for example through subscription structures), the costs of using the environment, and the purposes of the analysis.
The ecosystem needs to be open. The power of an ecosystem, in which a large amount of data streams is processed, lies in the fact that there is open competition in drafting and developing applications. Application developers are free to stipulate conditions with regard to the use of their applications, and data-processing parties are free to stipulate conditions with regard to the applications they want to use for their data analysis.
Data analysis quality control. Data analysis is more than number crunching. The quality of complex analysis in particular is not determined by the calculative power of the computer but by the data analyst. The difficulty is not finding patterns. The real challenge is to understand or interpret, without bias, patterns in data, so that data analysis leads to meaningful – not disastrous – conclusions. The risk of misinterpretation must be addressed when analyzing large amounts of data, because if you collect enough data, you will always find a striking correlation. Therefore, if desired the quality of an analysis can be monitored with a peer review system. This is a crucial characteristic of the ecosystem to avoid incorrect conclusions being drawn. In addition to monitoring, if necessary, independent third parties can certify the analysis.