Screen-scraping, de-identification and privacy
Screen-scraping, de-identification and privacy
Screen or data-scraping involves the "automated, programmatic use of a website, impersonating a web browser, to extract data or perform actions that users would usually perform manually on the website". While using similar technology and methods, this conduct is distinguished from 'web-crawling', which involves use of any robot, spider or other automated device or process to download a web page’s data, extracting the hyperlinks it contains and following them.
Screen-scraping is not an illegal activity in its own right, however the use of automated processes to collect and collate data has been subject to increased legal scrutiny in recent years.
Given dramatic increases in the volume and variety of Big Data available on the web, web-scraping and web-crawling technologies present considerable opportunities for commercial entities, researchers and interested individuals to find, collect and make sense of large amounts of information. Courts have recognised the utility of these technologies and have gone so far as to mandate access for crawlers and scrapers under certain circumstances (see, for example, hiQ Labs, Inc. v LinkedIn Corp).
However, there is mounting concern that screen-scraping technologies may be used by companies to collect user data that includes personal or sensitive information, and that such data is being scrapped without knowledge or consent. Common examples of personal information which may be scraped include, but are not limited to, user contact details, CVs and email addresses. Consumers are largely concerned that companies will utilise crawling and scraping technology to glean information about their preferences and practices from forums, blogs, social media sites or elsewhere.
There are also concerns that individuals may be able to be identified from publicly available non-personal or de-identified data in data-scraping contexts. Robots, spiders and other automated devices or processes have an unprecedented ability to search in a constant and systemic manner, meaning that data that may have been de-identified in isolation may be capable of re-identification when arranged and analysed as part of a broader data set.
Ensuring privacy of de-identified data
On 5 April 2019, the Department of Industry, Innovation and Science (DIIS) released a discussion paper developed by Data61, Artificial Intelligence: Australia’s Ethics Framework, for public consultation. The discussion paper recognises the inherent tension that exists in the AI space between unlocking the potential of machine learning technologies while at the same time ensuring such technologies do not operate in conflict with Australia’s existing privacy regimes, as well as broader public expectations about data protection and privacy.
In an Australian-specific case study, the report explored the implications that may arise where machine learning technologies used for scraping activities are capable of re-identifying individuals by collating publicly available data and predicting data patterns in a manner superior to humans. In 2016, a dataset that included anonymised health information was uploaded to data.gov.au. Researchers employing automated technologies, such as spiders and bots, were quickly able to re-identify individuals from the data source and the dataset was promptly removed from the site. Data61 and DIIS used this example to stress the importance of using rigorous risk management processes prior to open publication or use of de-identified data.
The Office of the Australian Information Commissioner has made available a best practice guide on the interaction between the Australian Privacy Act and the use of data analytics. The guide stresses that appropriate privacy risk assessments should be undertaken to evaluate the risk of re-identification in circumstances where non-personal data is shared with third parties. Where de-identified data will be made available to other entities, or to the public at large, relevant considerations may include:
- how non-personal data may be used in conjunction with pre-existing datasets
- the difficulty, practicality, cost and likelihood of re-identification
- the risk of unauthorised use of de-identified data sets by third parties.
Following robust risk assessment, OAIC encourages appropriate risk mitigation strategies are implemented to ensure the risk of re-identification remains low, especially under evolving technological circumstance. Where de-identified personal information is later re-identified, the Privacy Act will apply to how such information is handled and can leave an organisation exposed to substantial penalties.
De-identification is not absolute, but instead context specific. Recent and sustained focus in the field of AI has brought to light numerous privacy concerns relating to the use of automated technologies, especially those such as robots, spiders and crawlers which are used to conduct data-scrapping activities. Organisations need to be alert to the risk of re-identification that is posed by these technologies, and ensure adequate measures are in place to protect the personal or sensitive data of individuals and maintain public confidence in data practices and protection.
KPMG Australia acknowledges the Traditional Custodians of the land on which we operate, live and gather as employees, and recognise their continuing connection to land, water and community. We pay respect to Elders past, present and emerging.
©2022 KPMG, an Australian partnership and a member firm of the KPMG global organisation of independent member firms affiliated with KPMG International Limited, a private English company limited by guarantee. All rights reserved. The KPMG name and logo are trademarks used under license by the independent member firms of the KPMG global organisation.
Liability limited by a scheme approved under Professional Standards Legislation.
For more detail about the structure of the KPMG global organisation please visit https://home.kpmg/governance.