Comparison and Evaluation of Pseudonymization Algorithms for Client Reports in Intellectual Disability Care

Within the academic collaborative “Sterker op Eigen Benen”, we aim to improve healthcare for people with intellectual disabilities. Research using routinely collected healthcare data plays an important role in this. Client reports in electronic care records contain valuable, yet privacy-sensitive information about behaviours, symptoms, and events that are crucial for diagnostics and research, such as predicting health issues (e.g., cancer) or behavioural incidents. Due to the highly sensitive nature of this data, privacy protection is essential.

Pseudonymization is a privacy-enhancing technique in which identifiable information—such as names, addresses, or phone numbers—is replaced with artificial identifiers or labels. This allows for analysis of the data while protecting individual privacy. Carmenda (carmenda.nl) is the data infrastructure program within the “Sterker op Eigen Benen” collaborative and provides a privacy tool (available at privacytool.carmenda.nl) for pseudonymizing textual data. This tool runs locally as a desktop application within care organizations. The current implementation uses the Deduce algorithm (see: https://github.com/vmenger/deduce), which detects personal data such as names, addresses, and phone numbers and replaces this sensitive data with labels. Although Deduce performs reasonably well, other algorithms may perform better in terms of speed, accuracy, and usability.

Objective

The objective of this project is to inventory, evaluate, and compare different algorithms for pseudonymizing Dutch client reports based on accuracy and performance. The final goal is to determine which algorithm is most suitable for integration into the existing Carmenda privacy tool. For the purposes of writing a MSc Thesis, the student could in addition train and evaluate a new (AI) model to pseudonymize data, using synthetic textual reports.

Activities

Inventory
- What pseudonymization algorithms exist for Dutch text?
- What are their main characteristics (language support, open source/license, API or local deployment)?
Annotation & Test Set Development
- Annotate existing test datasets with entities such as first names, surnames, addresses, locations, institutions, and phone numbers.
- Optionally generate synthetic test data.
Define Evaluation Criteria
- Accuracy: precision, recall, F1-score
- False positives and false negatives
- Performance: processing speed on large datasets (current tool: ~6 minutes for 100,000 rows)
Conduct Evaluation
- Run the algorithms on the test datasets
- Analyze performance per algorithm
Reporting & Recommendation
- Write a summary report comparing the algorithms
- Provide a recommendation for the best-performing algorithm for integration into the CARMENDA tool

Expected Results

Annotated dataset(s) for benchmarking
Automated scoring system with confusion matrices
Benchmark report of different algorithms
Recommendation for integration into the privacy tool

Supervision

Joep Tummers (Radboudumc / Sterker op Eigen Benen)
Pim van Oirschot (Radboudumc / Sterker op Eigen Benen)

Practical Information

Level: BSc/MSc in Computer Science, AI, Data Science, Health Technology, or related field
Location: Nijmegen, partly hybrid
Start date: to be agreed upon
Language: being fluent in Dutch is a prerequisite

Interested or have questions? Contact Pim van Oirschot or Joep Tummers.