AI: The privacy challenges of the training phase

By Jade Kowalski, Rebecca Morgan & Astrid Hardy

Published 12 August 2024

Overview

Meta and X have recently been on the receiving-end of regulatory pressure regarding their AI training practices, and in particular the use of personal data during the training phase. Artificial intelligence models obviously require large datasets for training. Think millions, if not billions, of data points.

There are of course a number of legal issues to think about when it comes to training AI models, such as intellectual property and contractual restrictions. As the need to train generative and other AI models has grown, the issue of data scraping has gathered momentum as one of the key legal dilemmas. Many AI models are being trained on web-scraped data from publicly available sources. We highlighted last year a number of actions alleging that certain AI models had been trained on copyrighted works including books and photographs without consent and compensation.

Considering these issues from a privacy perspective, the interplay between AI legislation, such as the EU AI Act, and data protection legislation, such as the EU/UK GDPR is key. The data protection regulators have already played a leading role in shaping the AI regulatory landscape, with the UK's Information Commissioner's Office (ICO) being particularly active. The regulators are keen to stress that "the AI Act does not replace the requirements of GDPR… [instead], it aims to complement them by laying down the conditions required to develop and deploy trusted AI systems" (see the Q&A from the French data protection authority, CNIL).

The ICO's consultation series specifically sought views on this interplay. In particular, the first chapter considered the lawful basis for training generative AI models on web-scraped data. We highlighted the details of the ICO's consultation and next steps in our article earlier this year.

Challenges to Meta AI's plans – the data protection regulators

Perhaps unsurprisingly, concerns over social media platforms using public and non-public user data to train AI models are at the forefront of these discussions. Earlier this year, Meta updated its privacy policy to confirm that from late June 2024, user-generated 'activity or information' on its social media platforms including Facebook, Instagram, WhatsApp and Threads would be used to develop and improve its AI technology, both in respect of use by Meta and undefined third parties. The 'activity and information' identified included user-generated content such as posts, comments and audio, and messages that users send or receive from business and professional accounts. The proposals unsurprisingly generated significant concerns, despite assurances that content of users under 18, or users' private messages, would not be used for training purposes.

On 14 June 2024, the Irish Data Protection Commission (DPC) announced that Meta had confirmed it would delay processing EU/EEA user data to train Meta AI following complaints by various European data protection regulators. The delay was announced following a request by the DPC, as Meta's lead supervisory authority in the EU. Meta has now paused its plans to use users' posts on Instagram and Facebook for AI training purposes but in a public statement said it was "disappointed by the request of the Irish DPC, our lead regulator, on behalf of the European DPAs, to delay training our large language models (LLMs) using public content shared by adults on Facebook and Instagram — particularly since we incorporated regulatory feedback and the European DPAs have been informed since March. This is a step backwards for European innovation, competition in AI development and further delays bringing the benefits of AI to people in Europe." The DPC's involvement followed eleven complaints to data protection authorities from Noyb, the privacy rights organisation chaired by Max Schrems.

In the UK, the ICO has followed suit and requested that Meta pauses and reviews its plans after receiving concerns from UK users. Similarly, Meta has agreed. Stephen Almond, Executive Director of Regulatory Risk at the ICO confirmed that the ICO "will continue to monitor major developers of generative AI, including Meta, to review the safeguards they have put in place and ensure the information rights of UK users are protected.”

Further afield, Meta recently confirmed that it is available in "seven new languages and more countries around the world, including Latin America for the first time". This announcement was made shortly after the Brazilian data protection authority, ANPD, also issued a preventative order requiring Meta to suspend the use of personal data published on its platforms for AI systems training purposes. A daily fine of R$50,000 (c.£6,900) will apply for non-compliance due to the “imminent risk of serious, irreparable, or hard-to-repair damage to the fundamental rights of the affected data subjects”. A copy of the Order (in Portuguese) can be found here. This is the first direct action by the ANPD against a large tech company. The ANPD highlighted in its Order that Meta had not provided adequate information to users for them to understand the consequences of how their data would be used to train Meta AI. The ANPD stated that there was a lack of transparency and that Brazilian users generally share information on Meta's platform for personal use only and not with the intention that such data will be used to train Meta AI.

Corresponding rise in court proceedings

The focus on the issue of the use of personal data for AI training purposes is also being considered in the courts. On 6 August 2024, the DPC initiated court proceedings against the social media platform X (formerly known as Twitter) in relation to AI 'Grok'. Version 1.5 of Grok, the AI assistant available to select users on X, was released in April 2024. The release of version 2 is expected imminently. In a change to default settings, X introduced a pre-ticked consent box to "Allow [user] posts as well as [user] interactions, inputs and results with Grok [or AI model] to be used for training and fine-tuning."

The DPC's claim was issued in respect of Grok's processing of user data and asked that the Court order X to stop and/or restrict the processing of user data to train its AI systems. The DPC also confirmed that it intends to refer the matter to the European Data Protection Board (EDPB) for further consideration. The claim applied not only to Grok but to any AI model that X uses. Within a couple of days, the DPC and X had reached an agreement that X would suspend its processing of personal data contained within the public posts of X's EU/EEA users processed between 7 May and 1 August 2024, for the purposes of training Grok. The DPC, in conjunction with other peer regulators, will continue to examine the extent to which such practices comply with the GDPR.

The privacy challenges arising from the use of personal data during the training phase – a UK perspective

This surge in regulatory interest is prompting privacy practitioners to address the knotty question of how personal data can be compliantly processed to train AI systems. This issue concerns not just data scraped from public sources but also a company’s own data used for AI training.

As set out above, it is important to keep in mind that "training" of an AI system is a distinct processing activity and should be considered separately from the live use of the system post training, such as its deployment. The following ICO guidance helpfully illustrates:

In many cases, when determining your purpose(s) and lawful basis, it will make sense for you to separate the research and development phase (including conceptualisation, design, training and model selection) of AI systems from the deployment phase. This is because these are distinct and separate purposes, with different circumstances and risks.

Therefore, it may sometimes be more appropriate to choose different lawful bases for your AI development and deployment. For example, you need to do this when:

the AI system was developed for a general-purpose task, and you then deploy it in different contexts for different purposes. For example, a facial recognition system could be trained to recognise faces, but that functionality could be used for multiple purposes, such as preventing crime, authentication, and tagging friends in a social network. Each of these further applications might require a different lawful basis;
you implement an AI system from a third party, any processing of personal data undertaken by the developer will have been for a different purpose (eg to develop the system) to what you intend to use the system for, therefore you may need to identify a different lawful basis; and
processing of personal data for the purposes of training a model may not directly affect the individuals, but once the model is deployed, it may automatically make decisions which have legal or significant effects. This means the provisions on automated decision-making apply; as a result, a different range of available lawful bases may apply at the development and deployment stages.

As is always the case, identifying a lawful basis will be case specific and there won't be a blanket approach to this question. Having said that, there are certain lawful bases that are more likely to apply than others. We've considered three contenders below:

Legitimate interests

As we know, this is often the most flexible lawful basis, making it a common choice for training AI systems. However, you will need to clearly demonstrate the necessity and proportionality of the processing. This should be documented in a Legitimate Interest Assessment (“LIA”) to ensure all three parts of the legitimate interest test are met.

Firstly, identifying a legitimate interest - you would need to define the use case in detail to record the immediate purpose namely, training the AI system, and more broadly, the purpose of the AI technology itself once deployed.
Second, is the necessity test, meaning that the use of personal data for AI training purposes doesn’t need to be absolutely essential, but it should be a targeted and proportionate way of achieving the purpose.
And finally, the balancing exercise, which involves balancing the purpose identified at stage one against data subjects' interests, rights and freedoms.

The ICO's example in its guidance 'How do we ensure lawfulness in AI?' includes the proviso that whilst legitimate interests "may allow the organisation the most room to experiment with different variables for its model", the LIA should be revisited over time as purposes are refined.

Consent

Consent is often touted by the uninitiated as being a panacea to thorny data protection issues. The ICO's guidance points out that "The advantage of consent is that it can lead to more trust and buy-in from individuals when they are using your service." That is of course true, but as our readers will no doubt know, UK GDPR consent brings with it its own challenges, such as ensuring consent is specific and informed and can be easily withdrawn. The ICO guidance also confirms that "…it can be difficult to ensure you collect valid consent for more complicated processing operations, such as those involved in AI. For example, the more things you want to do with the data, the more difficult it is to ensure that consent is genuinely specific and informed." In the context of training AI systems, if individuals decided to withdraw consent en masse, you would be left without a valid lawful basis upon which to continue to train your model. Whilst we wouldn’t rule out reliance on consent wholesale, it isn’t as straightforward as it might appear.

Performance of a contract

The ICO's guidance on this lawful bases draws a clear distinction between its application during the deployment of AI systems (where it can apply if the use of AI is objectively necessary to deliver a service to the relevant individual or to provide them with an AI-derived quotation) and its application for training AI systems, which the ICO does not appear comfortable with: "Furthermore, even if it is an appropriate ground for the use of the system, this may not be an appropriate ground for processing personal data to develop an AI system. If an AI system can perform well enough without being trained on the individual’s personal data, performance of the contract does not depend on such processing. Since machine learning models are typically built using very large datasets, whether or not a single individual’s data is included in the training data should have a negligible effect on the system’s performance.

Similarly, even if you can use performance of a contract as a lawful basis to provide a quote prior to a contract, this does not mean you can also use it to justify using that data to develop the AI system.

You should also note that you are unlikely to be able to rely on this basis for processing personal data for purposes such as ‘service improvement’ of your AI system. This is because in most cases, collection of personal data about the use of a service, details of how users engage with that service, or for the development of new functions within that service are not objectively necessary for the provision of a contract. This is because the service can be delivered without such processing."

Based on this clear steer from the ICO, readers will want to be wary of relying on this lawful basis for AI training purposes.

And what of more sensitive data? When AI training involves processing special category data (“SCD”) or criminal offence data (“COD”), you will need a lawful basis under Article 6 of the UK GDPR and an applicable condition under Article 9 of the UK GDPR or Schedule 1 DPA. This adds complexity, as determining the appropriate conditions depends on the specific AI model and requires careful evaluation. The research purposes condition appears to be the most likely candidate (and depending on your sector, there may be others that are worthy of consideration too).

Research purposes

The research condition (see paragraph 4, Schedule 1 DPA) can apply to both SCD and COD if the processing:

is necessary for archiving purposes, scientific or historical research purposes or statistical purposes;
is carried out in accordance with Article 89(1) UK GDPR (as supplemented by section 19): and
is in the public interest.

Whilst this may on the face of it appear to have quite narrow application, it is one area where reading the small print is worth your time; Recital 159 provides that “the processing of personal data for scientific research purposes should be interpreted in a broad manner including for example technological development and demonstration, fundamental research, applied research and privately funded research” [our emphasis added].

Recital 16.5 (processing for statistical purposes) states: “statistical purposes mean any operation of collection and the processing of personal data necessary for statistical surveys or for the production of statistical results. Those statistical results may further be used for different purposes, including a scientific research purpose. The statistical purpose implies that the result of processing for statistical purposes is not personal data, but aggregate data, and that this result or the personal data are not used in support of measures or decisions regarding any particular natural person”.

It does appear that for SCD and COD the research purposes condition may, depending on the use case, be a viable option for training AI models. Of course, this would need careful consideration and analysis, and properly documenting prior to the use of SCD/COD for training purposes.

Other challenges and practical steps

As well as identifying a lawful basis (and of course demonstrating the necessity element), there are a number of other data protection compliance challenges and practical steps that will need to be thought through and documented. These issues will likely be worthy of a comprehensive Data Protection Impact Assessment which will inevitably involve consideration of the following:

Transparency

Individuals must be able to understand how their personal data may be used to train AI systems. Achieving transparency in AI is challenging, as even developers do not always fully grasp how their models functions. The ICO has produced detailed guidance on ExplAIning Decisions Made with AI and this should be your first port of call when thinking about how you can provide appropriate privacy notices for the training and deployment of AI.

Fairness

Fairness in the context of AI is another area where there are interesting and challenging issues at play. For example, how do AI developers ensure that their systems are fair when they are trained using personal data sets that are inherently biased, for example because the data available is skewed in favour of one gender? The ICO's guidance cites the following example:

The proportion of different genders in the training data may not be balanced. For example, the training data may include a greater proportion of male borrowers because in the past fewer women applied for loans and therefore the bank doesn’t have enough data about women.

Machine learning algorithms used to create an AI system are designed to be the best fit for the data it is trained and tested on. If the men are over-represented in the training data, the model will pay more attention to the statistical relationships that predict repayment rates for men, and less to statistical patterns that predict repayment rates for women, which might be different.

When deciding what datasets are appropriate for training AI systems, the proposed data should be carefully considered for any potential bias or wider unfair outcomes. The ICO has produced detailed guidance on Fairness in the AI lifecycle which should be factored in to AI development.

Purpose limitation

The purpose limitation principle requires that personal data is collected for specified, explicit and legitimate purposes and not further processed in a manner incompatible with those purposes. A review of your organisation's privacy notice would be the first step in ensuring that the proposed processing activities are sufficiently covered.

Data Minimisation

The processing of personal data should be limited to that which is necessary to achieve the relevant purpose. In these circumstances, an organisation must be able to articulate and demonstrate why the relevant data fields are necessary for the training of the AI model. Any personal data which is not deemed to be necessary and relevant should be removed from the training dataset. There can however often be a tension between data minimisation and accuracy. In certain instances, it may be justifiable to include additional data fields to ensure that the relevant technology has an adequate volume of data to ensure that it is able to properly achieve its aims.

Accuracy

Personal data must be accurate and, where necessary, kept up to date. In circumstances where personal data will be used for development/training this is particularly important due to its ongoing impact. The ICO AI Guidance provides extensive commentary on ‘statistical accuracy’ in relation to AI decision making, which relates to the proportion of answers that an AI system gets correct or incorrect. Whilst statistical accuracy is different to the principle of data accuracy under the UK GDPR, it will also need to be properly considered particularly with regard to the interplay with fairness.

Storage Limitation

Personal data should be retained for no longer than is necessary and proportionate for the purposes of processing. In relation to the development/training of AI technology, you will need to be mindful of the storage limitation principle, whilst balancing this against the likely need to ‘re-train’ or ‘de-bug’ the tool in the future.

Security

Usual obligations regarding the implementation of appropriate security measures to maintain the integrity and confidentiality of personal data will apply, including having strict controls in place (in line with relevant policies) in relation to access to any training data and/or training code. This will be particularly important given the likely extensive volume of personal data needed for AI training purposes.

Accountability

Underpinning all data protection obligations is the “accountability” principle. As well as requiring specific actions (for example, the conduct of a data protection impact assessment highlighted above), it also imposes a broader requirement to ensure that data protection issues are considered and documented. Any such documentation will need to be kept updated as the AI tool is developed, deployed and refined.

We will continue to report on developments in this area, in particular steps taken by data protection regulators and the courts around the world.