German Data Protection Authority comments on required legal bases under the GDPR for processing of personal data using AI
The State Commissioner for Data Protection and Freedom of Information of Baden-Württemberg has published a consultation paper concerning the relevant legal bases under the EU General Data Protection Regulation (GDPR) for processing personal data as part of the development and use of artificial intelligence (AI) systems. This consultation paper follows hot on the heels of the French data protection authority's "AI how-to sheets", which seek to address issues relating to the lawfulness of AI developments under the GDPR. The UK Information Commissioner's Office has also published guidance on the data protection issues raised by the development and use of AI, again including specific discussion of questions of legal basis for processing. This increasing focus by data protection authorities on data governance in relation to AI systems is not surprising. AI (particularly generative AI) requires large amounts of input data and, if such data is personal, tensions with data protection laws can arise. These guidelines aim to support the observance of data protection requirements during the development and deployment of AI systems.
The consultation paper is specifically meant to provide guidance on the legal bases for the processing of personal data during the lifecycle of AI systems. The paper shall not be understood as a final and complete guidance but rather invites discussion, input and insights for its further development.
Do AI systems include or constitute personal data under the GDPR?
The paper starts by analysing the interplay of personal data and AI systems. For data to be considered personal, the data subject in question must be identified or identifiable. Many AI systems are trained through the use of data ("training data"), obtained from a variety of sources, which may well include personal data. Whether any of the data within an AI system itself constitutes personal data will then depend, at least in part, on whether
- the system stores some or all of the training data as originally collected (and the training data is itself personal – i.e., it permits the identification of data subjects)
- the system itself only contains correlations and dependencies of relevant parameters, derived from the training data (a "model"), and not the training data itself.
In the former case, of course, the use as well as the development of the system will involve processing of personal data, subject to the requirements of the GDPR. In the latter case, one might naturally take the view that the developed system simply does not contain data relating to individuals, let alone identified or identifiable individuals, and that its use therefore does not involve the processing of personal data. The paper, however, highlights the possibility that various kinds of external attack on AI systems (in particular, membership inference and model inversion attacks) might lead to the possibility of using their models to reproduce some or all of the training data that was used in their development, or at least to identify the individuals on whose data the systems were trained. Where there is a reasonable likelihood that such attacks might succeed, the paper appears to take the view that the model as a whole, or perhaps some of the data within it – on this point the paper is not entirely clear – should be treated as data relating to identifiable individuals and therefore as personal data, whose processing is subject to the GDPR.
This is a controversial and arguably overly broad understanding of the concept of "personal data". It appears to place undue emphasis on the concept of an individual being "identified or identifiable" rather than on the equally critical need to identify data which "relates" to an individual. While the attacks allow a reproduction of the training data, they will not affect the derived data that is still present in the model. In principle, it seems, it should be possible for data (such as a parameter within an AI model) not to relate to individuals, and therefore not to be personal data, even if it is possible to use it to infer the identities of particular individuals and find other data about them. The broad position apparently taken by the paper would appear to have significant practical implications for the use of AI systems which do not contain their training data.
The user would have GDPR responsibilities not only to the individual subjects of data that it may process in the course of use of a system (its own "input" data), but also to all or many of the subjects of the data on which the system was trained, whose identities will be unknown to it. Difficult questions will arise – for example, how should a user address the exercise of the GDPR right to object, or to be forgotten, when the data under consideration is not really a particular item of data relating to a given identifiable individual, but the model as a whole, from which the identity of the individual can conceivably be inferred?
Stages of processing of personal data during the lifecycle of AI systems
The paper initially identifies five phases of the lifecycle of AI systems where processing of personal data may occur:
- Collection of training data
- Training of the AI system
- Provision of the final AI system
- Use of the final AI system
- Use of an AI system's output
In particular, the position regarding the provision and use of the final system again depends on the state of the data available in the system, i.e., whether the training data as originally collected, or only the derived parameters, is stored in the AI system as delivered to the end user. As long as the data subjects whose data has been used to train the system are identifiable, the usage of such systems constitutes a new processing operation and consequently requires a (new) legal basis. However, the role of the end user under the GDPR in this phase of the lifecycle is not explicitly specified in the paper, although it implies that the user of an AI system should be regarded as acting as a controller.
While developers have little to no control over the specific usage of the model and thus do not determine the purpose, the users may not even be aware of the presence of personal data in the model but it can be argued that the users determine the purposes and means of the processing (depending on the individual case). However, in most cases, the users have no/little influence on the precise personal data that might be processed during their usage. Classifying the user as a controller entails further complexities.
For example, the user would require a legal basis for the processing when using the AI system, such as consent or legitimate interests. Since the user and the data subjects whose data has been used for the training of the AI system are not likely ever to interact (and the users are unlikely in most cases even to know whose personal data has been used to train an AI system), does this mean that consent would have to be anticipated by the developers during the development process for the future users and obtained on behalf of the future users of the AI system? This is likely to be impossible in practice, particularly given the GDPR's strict requirements for an effective consent. As for relying on legitimate interests, the usage must meet the necessity threshold and it must be analysed if the result of the usage cannot be achieved in any other way that processes less or no personal data. Not entirely knowing what data is in the AI system, is it possible for the user to evaluate whether the usage constitutes a processing of personal data?
The paper only analyses the legitimate interests legal basis from the perspective of developers during the development process. Ultimately, the paper concedes that, although legitimate interests can provide a sufficient legal basis for numerous processing steps within AI development, this legal basis offers little legal certainty for developers (due to the mandatory balancing of interests' test).
The risks of consent and legitimate interests as legal basis
The paper considers the role of consent as a potential legal basis for data processing for the development of AI systems, for example in the collection and analysis of training data. However, obtaining valid consent in accordance with the GDPR in this context presents a variety of difficulties. The paper points out that it can be difficult for controllers to inform data subjects about all potential processing steps that may be carried out during the development process. The main issue that is highlighted in the paper (to which few solutions are suggested) is the right of data subjects to withdraw their consent. Although withdrawal does not affect the lawfulness of the processing that has taken place up to this point, the paper queries what withdrawal of consent would mean for the functionality of an AI system and whether data erasure can be implemented in this context. These issues, together with the challenges alluded to above regarding the practicalities of obtaining consent in the first place, make it questionable how reliance on consent would operate in this context.
The data present within the AI system is of importance once again. If the training data itself is not stored in the system, the system itself may not contain any personal data and it may be possible to rely on the data subjects' consent as the lawful basis for the initial training exercise. However, if the deployed AI system contains personal data, either because (personal) training data is included in the system or on the basis of the position taken by the paper with regard to membership inference / model inversion attacks, would the controller be obliged, in response to the withdrawal of consent, to extract the original input data back from its present parameters in the system to delete it? If the controller instead relies on legitimate interests as legal basis and the data subject exercises its right to object to the processing, the data subject will need to demonstrate compelling legitimate grounds for the continued processing during the lifecycle of an AI system. The paper's interpretation of the consequences of membership inference attack risk – if not adjusted – could have a significant impact on how AI will be developed and used. The paper briefly touches on other legal bases but considers them to have limited scope of application in this context.
Conclusion
The paper can only be regarded as a starting point for the discussion on how to treat AI systems under the GDPR for the purpose of identifying the potential legal basis for processing activities. Many of the paper's approaches are not yet sufficiently nuanced and the examples given are often too short-sighted regarding practical implication (e.g., the 'controller' qualification of parties in a chain of processing activities or the reliance on consent as legal basis). One of the most important aspects that needs critical review and discussion to provide more clarity for AI developers is the extent to which a developed AI system itself may still be considered to contain (or even as a whole to constitute) personal data even though the training data as originally collected is no longer present and only malicious attacks could allow its reproduction.
Similarly, the potential legal basis for the use of AI systems by third parties needs further clarification. While the paper alludes to the fact that users of AI systems can be considered controllers under the GDPR, it provides no further insights on the applicable legal basis and how this would operate in practice.