How to anonymize medical records for research under GDPR
April 24, 2026
Medical research is one of the activities where anonymization requirements are most explicit: clinical records contain special-category data — health — and GDPR applies the highest level of protection to them. At the same time, patient data is essential raw material for science. This guide explains, step by step, how to anonymize medical records for research projects while complying with GDPR and national health legislation.
What the law requires when handling clinical records
The applicable framework is denser than in other sectors because several norms converge:
- GDPR — Art. 9: health data are a special category and require one of the exceptions in paragraph 2 (explicit consent, public interest in the area of health, scientific research…).
- National data protection laws — typically reinforce the need for impact assessment and specific security measures when processing health data.
- Patient autonomy laws — define the minimum content of the clinical record and the conditions of access.
- Biomedical research laws — specifically regulate projects with biological samples and clinical data.
- Regional health legislation — in many countries each region has additional requirements for the transfer of health data.
The common denominator: real and irreversible anonymization is the exit from the reinforced regime. Once anonymized, clinical data can be used for research without explicit consent. While it remains personal (pseudonymized or in clear), a specific legal basis and reinforced technical measures are required.
What a medical record contains and what must be removed
A standard clinical record contains many more layers of identifiable information than is usually assumed:
Direct identifiers (must be removed):
- Patient’s first and last name
- National ID or passport number
- Health card number
- Social security number
- Clinical history number
- Phone number and email
- Postal address
Care-environment identifiers:
- Name of the center (sometimes identifying, e.g. a hospital in a small catchment area)
- Name of the physician or team
- Floor, room, bed
- Exact dates of admission, discharge, consultation
Quasi-identifiers (must be generalized):
- Date of birth → replaced by age group (five-year bracket)
- Postal code → generalized to region
- Occupation → grouped into broad sectors
- Family data (if they appear)
Narrative content:
- Admission, progress, and discharge reports
- Nursing notes
- Clinical judgment
- Data reported by the patient (“my neighbor also had this”)
It is in the narrative content where simple anonymization processes fail. Structured fields are easy to clean; free text requires natural language processing because it contains names, references to third parties, locations, and dates embedded in prose.
The re-identification risk in clinical data
Academic literature has documented numerous cases where supposedly anonymous datasets could be re-identified:
- A famous MIT study showed that with four approximate location data points from a mobile phone, 95% of people are unique in a medium-sized city.
- In healthcare, combining five-year age bracket, sex, postal code, and a rare condition is often enough to identify a person in a district of 50,000 inhabitants.
- Diagnoses coded in ICD-10 aggregated with approximate date allow re-identification when the diagnosis is unusual.
Therefore, clinical anonymization requires a more aggressive approach than anonymizing other documents: it is not enough to erase names — quasi-identifiers must be generalized, and sometimes rare dates and diagnoses must be removed or modified so that an episode cannot be singled out.
Anonymize clinical records for research
anonimiza.do detects direct and quasi-direct identifiers, strips metadata, and generates an audit log. Built for the GDPR framework.
Try for freeRecommended clinical anonymization process
- Define the use case — does the researcher need longitudinal follow-up? Only aggregated data for a cross-sectional study? The level of anonymization depends on the intended use.
- Classify the fields — separate the three types above: direct identifiers, environment identifiers, and quasi-identifiers.
- Apply full removal to direct identifiers. Do not replace with codes (that would be pseudonymization).
- Generalize quasi-identifiers — exact date → month; exact age → five-year bracket; postal code → region.
- Process narrative text — use NLP trained in clinical language to detect names, references, and locations in free-text fields.
- Remove file metadata — author, creation date, device name, change history.
- Evaluate k-anonymity risk — for each combination of quasi-identifiers, ensure there are at least k equivalent records (recommended k≥5 for clinical data).
- Document the procedure — date, responsible party, techniques applied, outcome of the re-identification test.
Difference with pseudonymization for longitudinal research
Anonymization is not always possible. Some projects require long-term follow-up, linking biological samples with new clinical data, or contacting the patient if a clinically relevant finding appears.
In those cases, the correct technique is double-key pseudonymization:
- The patient receives a random code (study ID).
- The mapping table between study ID and real identity is kept by an independent custodian (typically the ethics committee or the hospital’s quality department), never by the research team.
- The researcher can only request re-linking through a formal and justified procedure.
This scheme is called safe haven and is what biomedical research laws typically require for projects with biological samples.
The role of the Ethics Committee
No biomedical research project should start without passing through the relevant ethics committee. The committee evaluates:
- The need to access personal data versus the possibility of working with anonymized data.
- The adequacy of the proposed anonymization or pseudonymization measures.
- The content of informed consent (if applicable).
- Safeguards against re-identification.
A favorable committee report is also a key factor if one day you have to defend the lawfulness of the processing before a data protection authority.
Errors we frequently observe
Error 1 — “We exported to Excel and that’s it” — names remain, exact dates too, and file metadata betray which physician generated it.
Error 2 — Removing the name but leaving the national ID — seems obvious, but happens when editing manually and the column is hidden but not deleted.
Error 3 — Trusting that “no one reads the free text” — progress notes contain as much personal data as structured fields, and they facilitate the most re-identification cases.
Error 4 — Using the same technique for every study — a study on hypertension does not require the same anonymization aggressiveness as one on a rare disease where a person can be unique in an entire region.
Frequently asked questions
Is patient consent required to use their record in research if it is anonymized?
Once the data is correctly anonymized, there is no personal data, and therefore GDPR consent does not apply. However, biomedical research laws and regional regulations require, in many cases, informing the patient in advance about secondary use of their data, even if anonymized.
Can I share anonymized records with researchers outside the EU?
Yes, once anonymized they can be shared without the restrictions of GDPR Chapter V. If they are pseudonymized, international transfer rules apply (standard contractual clauses, adequacy decision, etc.).
How long does it take to anonymize a 50-page record manually?
Between 2 and 4 hours, and the result is inconsistent because each operator applies different criteria. With a specialized tool, the same record is processed in seconds with homogeneous criteria.
Do electronic health record systems have built-in anonymization?
Some offer export with removal of direct identifiers, but almost none apply generalization of quasi-identifiers or evaluate re-identification risk. For serious research, an additional layer is needed.
Conclusion
Anonymizing medical records correctly is the difference between a viable research project and a sanctioning file. It requires technical judgment, adequate tools, and rigorous documentation. Hospitals and research groups that professionalize this process not only comply with the rule: they also accelerate their projects and reduce the time they spend discussing with the DPO and the ethics committee.
If you need to anonymize medical records with guarantees, try anonimiza.do. The tool recognizes European healthcare identifiers and applies anonymization with an auditable log.
Anonymize your documents without wasting hours
Try anonimiza.do for free — 3 documents a month, no card required. Remove personal data from contracts, payslips and reports in seconds, fully GDPR compliant.
Try it free!