As part of the continuing ICO consultation series on the application of UK GDPR to the development and use of generative AI models, a third call for evidence has been issued by the regulator, with the latest consultation focusing on the accuracy of training data and model outputs. This follows two previously concluded calls for evidence covering (i) the lawful basis for training generative AI models on web-scraped data, and (ii) how the principle of purpose limitation should be applied at different stages in the generative AI lifecycle.
Background
Accuracy is a key element of data protection law – with one of the principles of the UK GDPR requiring. organisations to ensure that personal data processed is "accurate and, where necessary, kept up to date." Organisations are expected to take “every reasonable step … to ensure that personal data that are inaccurate, having regard to the purposes for which they are processed, are erased or rectified without delay”.
The latest call for evidence relates to the data protection principle of accuracy, as opposed to 'statistical accuracy' which refers to the accuracy of an AI system itself.
Where personal data is inaccurate, then it must be erased or rectified without delay, depending on the circumstances. However, the call for evidence does not cover the right to rectification, and the ICO notes that a future call for evidence will focus on people’s information rights in the context of generative AI development and use.
The ICO’s Guidance on AI and Data Protection sets out that the accuracy principle does not mean that the outputs of generative AI models need to be 100% statistically accurate. The need for statistical accuracy will depend on the use of the AI model. The ICO highlights that high statistical accuracy will be needed for models which are used to make decisions about people (such as the triage of customer enquiries), in contrast to models used to develop ideas for media (such as video games) where statistical accuracy is perhaps less of a priority.
ICO analysis of accuracy and generative AI
Accuracy of generative AI models
The need for accuracy is dependent on whether the purpose of the generative AI models require an accurate output. As noted above, the development of video game storylines do not require accurate outputs. By contrast, using a model to summarise customer complaints requires both statistical accuracy (the summary needs to be a good reflection of the documents it is based on) and data protection accuracy (output must contain correct information about the customer).
Developers, therefore, are required to consider the impact of inaccurate training data on their outputs. Inaccurate training data will lead to inaccurate outputs. If those outputs have consequences for individuals (the examples given by the ICO are financial loss, misinformation and reputational damage) then both developer and deployer could be said to be non-compliant with the accuracy principle.
Link between purpose and accuracy
As mentioned above, one of the key themes in the ICO consultation is that the specific purpose of the generative AI model will determine whether outputs need to be accurate. There must be clear communication between developers, deployers and end-users to ensure the model is appropriate for the level of accuracy. Where a model does not require accuracy as part of its purpose, then technical and organisational controls should be used to ensure that the model is not used in circumstances where accuracy is needed.
Unexpected and incorrect outputs from generative AI models are referred to as 'hallucinations', and are a result of probabilistic generative AI models – in other words, the AI model has a guess. Again, controls must be in place to ensure that users are aware of these risks, and are not relying upon generative AI models to provide factually correct information that said models cannot actually provide.
Effective communication regarding the use of generative AI models inappropriate for certain level of accuracy will therefore be crucial. Developers need to set out clear expectations for users, whether individuals or organisations, on the accuracy of the output. The ICO suggests that developers will need to consider the following steps to help provide information on accuracy:
- Providing clear information on the statistical accuracy of the application, and easily understandable information about appropriate usage;
- Monitoring user-generated content and outputs to understand how the model is being used, followed by user engagement based research to confirm that information provided can be understood by users;
- Possible labelling of outputs as being generated by AI, which could include watermarking or embedding metadata; and
- Assessing the reliability of a model's output, using an assessment with reference to reliable sources of information, using retrieval augmentation generation techniques. These techniques involve the use of an AI framework designed for "improving the quality of [Large Language Model] generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information."[1]
Training data and the effect on accuracy
Where the AI model is used for a purpose where accurate outputs are needed, then the developer must consider the impact of the accuracy of the training data on those outputs. The ICO proposes that compliance with accuracy responsibilities can be improved in the following ways:
- The developer should curate the training data to ensure sufficient accuracy for the purpose of the AI model. The ICO highlights that developers sometimes select training data from social media and user forums, and weight it according to the amount of engagement the content has had been subject to. The ICO is seeking evidence within this call for evidence on how weighting content can be reconciled with the requirement for accuracy.
- Where the AI model does not provide accurate outputs (in the event the model is used in ways not envisaged by the developer) the developer should clearly communicate the accuracy limitations to deployers and end users.
What does the ICO expect of developers?
The ICO expects that developers will have a good understanding of the accuracy of any training data. In particular, developers should know whether training data they are using is accurate, factual and up-to-date, and ensure they understand and document the impact that the accuracy of training data has on the generative AI model outputs.
Clear communication with end users is a key aspect of the development of generative AI. Developers are expected to provide clear information about the accuracy of the application and its intended use. The use of the application and how the information generated is utilised should be monitored, and where appropriate, developers should update and improve the information provided to the public, as well as any restrictions on how the application should be used.
Next steps
The call for evidence is open until 10 May 2024.
[1] https://research.ibm.com/blog/retrieval-augmented-generation-RAG