The intellectual property and data protection implications of training generative AI
The models underlying popular generative AI tools such as ChatGPT, Bard and DALL-E are being trained on vast amounts of data sourced from the internet.
Some such data may be subject to copyright protection and, to the extent that it relates to identifiable individuals, may be subject to restrictions on processing under data protection law. The use of such data in the training of GPT and similar models therefore gives rise to intellectual property and data protection risks on a significantly augmented scale.
Possible infringement of copyright in the training data
The models underlying generative AI tools are trained through the ingestion of vast datasets. The potential liability of creators of generative AI models – and in some cases the users of generative AI tools – for infringing third party copyright in the underlying training materials depends on the specific implementation of the AI model and its training process and the source of the training materials.
For example, many datasets are derived from scraping publicly accessible materials on the internet. Scraping is generally performed without a licence from the copyright owners, and may be contrary to the terms and conditions of the websites on which content is hosted. By contrast, some tools have been trained on fully-licensed datasets.
The courts may soon have the opportunity to consider the position of unlicensed tools. Earlier this year, stock-image supplier Getty Images filed proceedings against Stability AI Ltd, provider of popular AI image-generator Stable Diffusion, in the English High Court. Getty Images has also brought a corresponding claim in the Delaware courts against Stability AI, Inc (the parent of Stability AI Ltd.), and similar cases have been brought in the California courts against Midjourney, DeviantArt and GitHub. The following points are likely to be addressed in the inevitable flurry of AI related disputes as these tools are used ever more widely:
- infringement risks can be mitigated or avoided altogether by obtaining a licence from the copyright and database owner(s), both for the training process and the use of the AI tool;
- where an AI model is trained on an open-source dataset scraped from the web, it is easier for copyright owners to identify which of their works have been used, providing more certainty at the outset of an infringement claim;
- processes of: (i) compiling the training dataset/corpus; (ii) training the AI model; and (iii) generating new images or textual outputs from the model each raise different copyright infringement considerations. These processes might each be undertaken by distinct, unconnected legal entities, some of whom may have significant liability for copyright infringement, while others who have contributed to the same product might have no liability at all; and
- when assessing infringement liability, attention must be paid to the specific acts that each entity undertakes, the geographical location of those acts, and which territory's laws apply accordingly. For example, some jurisdictions provide data-mining exceptions to copyright infringement, whilst others do not.
The position is far from clear-cut and we intend to monitor these cases with interest. In the meantime, you may wish to read our deep dive.
Large-scale collection of personal data for training purposes
Data protection issues surrounding the training of generative AI tools came to a head recently when Italy's data protection supervisory authority (the Garante per la protezione dei dati personali) ordered OpenAI LLC to immediately cease the use of ChatGPT to process the personal data of data subjects located in Italy, pending further investigation, in part due to an appearance that there is "no legal basis underpinning the massive collection and processing of personal data in order to 'train' the algorithms on which the [ChatGPT] platform relies". We expect that other data protection regulators will take regulatory measures in the coming weeks and months.
The scraping of text data from the internet frequently involves the collection of a significant volume of personal data, which is protected by the EU and UK GDPR (as well as by other data privacy regimes around the globe). This brings into question compliance with some core GDPR obligations including:
- lawfulness of processing, which requires controllers to have a lawful basis for processing personal data. The lawful basis most likely to be relied upon by controllers is that of legitimate interests, but it is an open question as to whether data protection supervisory authorities will consider the balancing test to be satisfied;
- purpose limitation and purpose compatibility, which requires controllers to collect personal data only for specified, explicit and legitimate purposes and ensure that the personal data is not processed in a manner that is incompatible with those purposes. The training of generative AI tools will likely fall outside the scope of the original purposes for processing and it is an open question as to whether data protection supervisory authorities will consider such processing to be incompatible;
- transparency, which requires controllers to make accessible to data subjects information about the processing of their personal data, as the information is not captured directly from the data subjects, and it may well be the case that the controller is unable to contact affected data subjects directly. This use case will challenge the boundaries of reliance on the 'disproportionate effort' exemption;
- accuracy, which requires controllers to ensure that personal data is not incorrect or misleading as to any matter of fact. This presents a unique challenge where the personal data has been 'absorbed' into the AI model and there are multiple layers of abstraction between the personal data in the collected data and the model; and
- data subject rights, in particular the ability to respond to individuals' requests for access, rectification of inaccurate personal data or erasure of their personal data.
Short of taking more substantive enforcement action at this stage, the UK Information Commissioner's Office has suggested that developers of generative AI tools need to "consider their data protection obligations from the outset, taking a data protection by design and by default approach". The extent to which this is possible for large language models in practice remains to be seen.
This article first appeared on TechUK as part of their #AIWeek2023