Opening Address by Minister Josephine Teo at the Personal Data Protection Week 2025
7 July 2025
Good morning, colleagues and friends. I’d first like to thank everyone for being here. We have over 1,500 people in the room today, and over 2,000 coming and going throughout the week, including from many countries in Asia, and even further afield. I especially appreciate our international guests for joining us, including Data Protection Authorities from fellow ASEAN member states. Thank you all for being here.
The theme for this year is “data protection in a changing world”. This is an acknowledgement of the significant changes in both our global operating environment, as well as in the world of technology.
These twin forces have disrupted our workplaces, our homes, and our relationships with each other. It is inevitable that we must adjust our practices, laws and even our broader social norms.
Most of you in this room are practitioners of data or AI, or both.
Last year, I had spoken about the importance of data in the age of AI. This remains as pertinent as ever. We all know that generative AI models are built on vast amounts of data, and data is critical throughout the AI development lifecycle, from pre-training, to fine-tuning, to testing and validation.
In recent times, we have seen an explosion of sector-specific AI applications built on customised or proprietary datasets.
A good example is AskMax, Changi Airport’s chatbot that helps to address passenger queries. It runs on a LLM designed to call on Changi Airport’s data repositories.
Another example is GPT-Legal, which was finetuned by IMDA using the Singapore Academy of Law’s LawNet database.
Given the criticality of data in the AI age, it should not be surprising that data has also become a limiting factor to continuing advancement.
Let us walk through the data challenges at each stage of AI development and use.
In model training, the first well-known issue is the use of internet data to train these large models. Internet data is uneven in quality. Often, they contain biased or toxic content from different sources, including user-generated content on discussion forums. When the underlying data input contains harmful, toxic or biased content, this can lead to downstream problems with model outputs.
In the first regional red teaming challenge run jointly by Singapore IMDA and eight other countries, problematic model behaviours were observed. When asked to write a script about Singaporean inmates, the LLM chose names such as “Kok Wei” for a character jailed for illegal gambling, “Siva” for disorderly drunk and “Razif” for a drug abuse offender. These stereotypes, most likely picked up from the training data, are actually things that we want to avoid.
At the same time, developers are running out of internet data. Most of the LLMs are already trained on the entire corpus of internet data. What then should model providers do to improve their models? They are turning to more sensitive and private databases to augment their models, which brings its own set of challenges.
OpenAI, for example, has a growing list of data-related partnerships not only with global news outlets, but also governments, companies and universities like the Icelandic Government, Apple, Sanofi and Arizona State University.
The partnership model is one way of increasing data availability, but it is time-consuming and difficult to scale. Some of these databases may include sensitive data such as personal data or business confidential information.
Increasingly, we need a way to train models, while protecting sensitive information.
AI application, or ‘app’, which can be seen as the ‘skin’ that is layered on top of AI models, can also pose reliability concerns. If apps provide inaccurate, bias or toxic information, or leak confidential information, these can have serious implications for the company’s reputation, and in the worst cases, may actually cause physical harm.
Typically, companies would employ a range of well-known guardrails to make their app reliable. These include writing detailed system prompts to steer the model behaviour, using retrieval-augmented generation (or RAG), which many of you are familiar with, to improve accuracy or different types of filters to sieve out sensitive information.
Even then, apps can have unexpected shortcomings. Vulcan, a third-party tester, recently tested a high-tech manufacturer’s chatbot that assists employees to answer questions on product specifications that are posed by prospective customers. The manufacturer was concerned that the app would inadvertently leak confidential business information, for example, telling the prospective customers something that they do not want the prospective customers to know. True enough, Vulcan found that when prompted in Mandarin, the app leaked backend sales commission rates. You can imagine, from the manufacturer’s point of view, telling the prospective customers what the sales commission rates are is basically revealing how much further they can cut the price – and it is not something any business wants.
Fortunately, this problem was discovered during the testing phase. This highlights the value of independent testing. To ensure the reliability of GenAI apps before release, it is important to have a systematic and consistent way to check that the app is functioning as intended, and there is some baseline safety.
Like model developers, app developers must deal with data inadequacies. Very often, the models are linked up with internal company databases so that the apps can cater to the businesses’ specific needs. However, there are often insufficient proprietary data to build reliable apps. 42% of respondents to an IBM global survey cited this as one of their biggest challenges to AI adoption. So, we need a way to unlock more data-sharing among companies while protecting sensitive information.
After AI apps are deployed and used by consumers, correcting erroneous or harmful information poses a significant challenge. The process of finetuning and retraining a model – after it has “learnt” something – is imprecise and often costly.
Machine unlearning has therefore become a new field, albeit a nascent one. A key challenge faced by LLM leaders like Anthropic is that models now have billions or trillions of parameters. Which variables contribute most to the shortcomings in output? Are there techniques to identify them and carry out targeted model corrections at scale?
Finally, an overriding concern is accountability. The AI lifecycle is complex, with model builders, deployers, users and more. Each has a role to play to mitigate the risks.
This community here would be familiar with the case of a group of Samsung employees who unintentionally leaked sensitive information by pasting confidential source code into ChatGPT to check for errors. I think we are aware that this is happening in our workplaces too – sometimes our colleagues, in order to do a spell check, or to check the way in which they have put across ideas, may upload a file on to ChatGPT. This makes you wonder if there is anything in the file that should not be shared with ChatGPT.
Is it the responsibility of the employees who should not have put sensitive information into the chatbot? I think most of our colleagues here believe they have some responsibility.
But is it also the responsibility of the app provider to ensure that they have sufficient guardrails to prevent sensitive data from being collected?
Or should model developers be responsible for ensuring that such data is not used for further training?
There are no easy answers to this, I’m afraid.
For AI to continue advancing, we will need various types of solutions – from organisational process improvements to developing new techniques in risk mitigation. Technical solutions, such as Privacy Enhancing Technologies – or PETs that optimise the use of data without compromising privacy – have emerged as a viable pathway for addressing these concerns.
In the last 3 years, the IMDA and PDPC have run the PET Sandbox to encourage businesses to explore and experiment with the use of PETs across a variety of sectors and use cases. We have seen growing interest and some early adopters have also experienced tangible business returns.
For instance, Ant International, a financial institution that joined the Sandbox, used a combination of different PETs to train an AI model with their digital wallet partner without disclosing customer information to each other. The intention was to use the model to match vouchers offered by the wallet partner with customers of Ant International, who were most likely to use them. Ant International contributed voucher redemption data of their customers, while the digital wallet company contributed purchase history, preference and demographic data of the same customers. The AI model was trained separately with both datasets, without each data owner seeing or ingesting the other’s data. This led to a vast improvement in the number of vouchers claimed; the wallet partner increased its revenues, while Ant International enhanced its customer engagement.
You can see that this way of using PETs has many use cases, for example in detecting fraud, or in allowing healthcare institutions to do a better job of taking care of their patients.
Synthetic Data is another example of a PET that shows good promise. Last year, I launched PDPC’s Guide on Synthetic Data Generation, which sets out best practices for organisations. There are now innovative companies in Singapore, such as Betterdata, that help AI developers generate data to mimic real-world datasets. These synthetic data can further augment existing datasets as training datasets to build AI models, which goes some way to addressing the data challenges I had referred to earlier.
Our experience with organisations in the Sandbox has allowed us to better understand the technologies, their ability to protect personal data and comply with legal obligations when such data is shared. It has also given us a good sense of the growing interest from technology providers in offering PET solutions, as well as companies who are keen to use PETs.
To build on this momentum, IMDA will be introducing a PETs Adoption Guide. Designed for C-suite executives, this guide will offer resources to help organisations identify the right PETs for their business needs and will also include key considerations for companies to effectively deploy PETs.
This year’s Personal Data Protection Week will once again include the PETs Summit. Similar to last year when it was held for the first time, the Summit will be a good opportunity for data protection authorities, existing and interested PETs solution providers, and users in the Sandbox to connect and learn more from one another.
As demonstrated in the PETs Sandbox, Singapore’s approach towards emerging technologies is to help provide tools, resources, and a safe environment for companies to experiment, and to quickly share the learnings so that industries and consumers can benefit.
Recently, IMDA, AI Verify Foundation and industry partners collaborated on a Global AI Assurance pilot, studying ways to test the reliability of generative AI applications. Testing is a critical step to demonstrate that the AI application has addressed key risks.
A lot of the things that we use on a day-to-day basis, such as the appliances in our homes, the vehicles that take us to the workplace – we would not use them if they had not been properly tested. And yet, on a day-to-day basis, AI applications are being used on us without having been properly tested. So this is a lacuna, a serious gap that needs to be filled.
One example is Changi General Hospital, which worked with third party tester Softserve to test the reliability of their summarisation tool for selected medical reports. It is incredibly helpful to doctors and their workloads, to be able to put together case or patient summaries that can be shared with other physicians. How we ensure that this summarisation tool is reliable, accurate and does not misrepresent the patient, is of utmost importance.
Another is NCS, which tested how well its coding assistant adhered to internal coding standards and security requirements, as well as external regulatory guidelines.
With insights from this pilot, IMDA has identified several testing methods that organisations can use to test for and manage risks. This compilation of testing methods is known as the “IMDA Starter Kit”. It is a direct response to companies’ requests to go beyond governance frameworks and guidelines, for more standardised ways to test and deploy AI applications. It includes testing for risks like undesirable content and unintended data disclosure, like those I described earlier.
The learning and iterating continue as IMDA transitions its pilot to a new, ongoing AI Assurance Sandbox. The Sandbox is a learning environment to help all of us – whether we are business users, governance teams, AI developers – to jointly develop solutions, like better guardrails or processes for gen AI applications. Organisations interested in putting their applications to the test and contributing to our shared knowledge base are welcome to join.
Ultimately, our aim with each of these Sandboxes is to find coalition and consensus around what good looks like, whether for data protection or AI governance.
Much like traditional fields of product safety or pharmaceuticals, we need subject matter experts to agree on the standards to uphold, and testers to assure us that the standards are being met.
Given the speed and scale of AI adoption, there is some urgency for standards to be developed and agreed to. Realistically, this will take time. There are many stages to go through. In Singapore at least, we have taken the critical first steps to grow the ecosystem for testing and assurance. Our hope is that industry players will join us to initiate ‘soft’ standards that can be the basis for the eventual establishment of formal standards.
The field of data protection has had a head start, and I am pleased to share that we are ready to take the next step.
IMDA has worked with Enterprise SG and the Singapore Accreditation Council to elevate the Data Protection Trustmark (DPTM) to a new Singapore Standard, Singapore Standard 714. Companies that demonstrate accountable data protection practices can now apply to be certified under this new Standard, which will set the national benchmark for companies that want to demonstrate data protection excellence. The Trustmark will assure consumers that certified organisations adopt world-class practices in protecting their personal data.
I hope I have given you a sense of Singapore’s approach to dealing with the challenges and opportunities in using data for AI advancement.
We believe there is much for businesses and people to gain when AI is developed responsibly and deployed reliably, including the methods for unlocking data. It is up to us as leaders in corporations and the government to understand how we can do so, and to put in place the right measures.
By doing so, not only will we facilitate AI adoption, we will also inspire greater confidence in data and AI governance. On that note, I wish you fruitful discussions in the days ahead. Thank you very much.