One of the earliest examples we have of humans storing and analysing data are the tally sticks, write Bryan Clarke and Nakul Wali of our R&D Incentives practice.
Palaeolithic tribespeople marked notches into sticks or bones to keep track of trading activity and supplies. They would compare sticks and notches to carry out rudimentary calculations, enabling them to make predictions such as how long their food supplies would last. Today, data analytics is used to predict everything and anything, from what a consumer will buy at the grocery store to who will win the Oscars.
The first recorded experiment in statistical data analysis was carried out by John Graunt in 1663 where he theorised that he could design an early warning system for the bubonic plague ravaging Europe. Today, complex analytics underpin the important decisions around COVID-19 lockdowns. As the world came to a standstill with COVID-19, it became imperative for mankind to end the spread of the virus as soon as possible. All the information collected around the world in the response to stop COVID-19 has resulted in gargantuan volumes of data resource. Big data and data analytics have helped us understand the nature of the pandemic and predict its flow. Contact tracing and supporting analytics helped governments to quickly and effectively analyse the data and bring actionable insights in order to curb the spread of COVID-19. Biased, gut feeling, and subjective decision-making is being replaced by objective, data-driven insights that allow governments to better serve citizens and manage risks.
However, big data often involves imperfect information. Like oil it must be discovered, accessed, transported, and stored before processing. It is only then can statistical modelling be applied to find meaning. Big data use the concept of 5 Vs, i.e. Volume, Velocity, Variety, Veracity and Value, to take vast quantities of raw data and distil it into valuable insights. However, some of the of the biggest challenges remaining in data analytics are related to the collecting and storing of vast quantities of high-quality data required to enable valuable insights.
The use of the 5Vs to mine large quantities of data from disparate systems is becoming routine engineering, something that you would expect a data scientist to be able to complete in a routine manner. However, as the volume moves from gigabyte to terabyte to petabyte, the data cannot be stored and analysed using traditional systems in a cost-effective manner. As such, real-time performance and the velocity at which the data is generated, stored or processed can be a significant technological challenge, particularly in large-scale technology environments.
Additionally, the era of social media has introduced a variety of unstructured data types, such as texts, images, and videos. These don’t easily fit into the traditional structured databases. One of the technological challenges is to reliably and robustly identify the truth from all kinds of data. The accuracy of a machine learning (ML) algorithm can be faulty if the data is biased in any way. For example, this summer, researchers from GPT-3, one of the most powerful language models in the world, discovered structural biases within the model’s results related to gender, race, and religion.
Scaling these flaws up to the big data level will effectively compound any errors or shortcomings of the entire analytics process. Therefore, mitigating uncertainty in big data analytics must be at the forefront of any automated technique, as uncertainty can have considerable influence on the accuracy of its results.
The overriding technological challenge of any data analytic project is to deliver insights and value that are more accurate, and more precise through the utilisation of massive volumes of data. Going forward, we can learn to identify biases in learning systems, we can also figure out how to mitigate and intervene.
The main shortcoming of the data available for John Graunt was its non-uniform upkeep and the lack of depth of variables included at its outset. Now, almost 400 years on, we have better data, but we still face many challenges.
If you are considering claiming R&D tax credits for a data analytics project, it is important to understand if the project is routine engineering or involves research and development. As with all innovative technologies that mature over time it is important to carefully consider how your project meets the five key science criteria set out in the R&D tax credits legislation. KPMG’s R&D Incentives practice assist our clients in this regard.
This article first appeared in Irish Tech News and is reproduced here with their kind permission.