Within the academic collaborative “Sterker op Eigen Benen”, we aim to improve healthcare for people with intellectual disabilities. Research using routinely collected healthcare data plays an important role in this. Client reports in electronic care records contain valuable, yet privacy-sensitive information about behaviours, symptoms, and events that are crucial for diagnostics and research, such as predicting health issues (e.g., cancer) or behavioural incidents. Due to the highly sensitive nature of this data, privacy protection is essential.
Pseudonymization is a privacy-enhancing technique in which identifiable information—such as names, addresses, or phone numbers—is replaced with artificial identifiers or labels. This allows for analysis of the data while protecting individual privacy. Carmenda (carmenda.nl) is the data infrastructure program within the “Sterker op Eigen Benen” collaborative and provides a privacy tool (available at privacytool.carmenda.nl) for pseudonymizing textual data. This tool runs locally as a desktop application within care organizations. The current implementation uses the Deduce algorithm (see: https://github.com/vmenger/deduce), which detects personal data such as names, addresses, and phone numbers and replaces this sensitive data with labels. Although Deduce performs reasonably well, other algorithms may perform better in terms of speed, accuracy, and usability.
The objective of this project is to inventory, evaluate, and compare different algorithms for pseudonymizing Dutch client reports based on accuracy and performance. The final goal is to determine which algorithm is most suitable for integration into the existing Carmenda privacy tool. For the purposes of writing a MSc Thesis, the student could in addition train and evaluate a new (AI) model to pseudonymize data, using synthetic textual reports.
Interested or have questions? Contact Pim van Oirschot or Joep Tummers.