De Identify Data Research Paper

Methods for the de-identification of electronic health records for genomic research

1Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, Ontario K1J 8L1, Canada

2Pediatrics, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, K1H 8L1, Canada

Corresponding author.

Khaled El Emam: ac.awattou@mamelek

Author information ►Copyright and License information ►

Copyright ©2011 BioMed Central Ltd.

Genome Med. 2011; 3(4): 25.

Published online 2011 Apr 27. doi:  10.1186/gm239

This article has been cited by other articles in PMC.


Electronic health records are increasingly being linked to DNA repositories and used as a source of clinical information for genomic research. Privacy legislation in many jurisdictions, and most research ethics boards, require that either personal health information is de-identified or that patient consent or authorization is sought before the data are disclosed for secondary purposes. Here, I discuss how de-identification has been applied in current genomic research projects. Recent metrics and methods that can be used to ensure that the risk of re-identification is low and that disclosures are compliant with privacy legislation and regulations (such as the Health Insurance Portability and Accountability Act Privacy Rule) are reviewed. Although these methods can protect against the known approaches for re-identification, residual risks and specific challenges for genomic research are also discussed.

Electronic health records and the need for de-identification

Electronic health records (EHRs) are increasingly being used as a source of clinically relevant patient data for research [1,2], including genome-wide association studies [3]. Often, research ethics boards will not allow data custodians to disclose identifiable health information without patient consent. However, obtaining consent can be challenging and there have been major concerns about the negative impact of obtaining patient consent on the ability to conduct research [4]. Such concerns are reinforced by the compelling evidence that requiring explicit consent for participation in different forms of health research can have a negative impact on the process and outcomes of the research itself [5-7]. For example, recruitment rates decline significantly when individuals are asked to consent; those who consent tend to be different from those who decline consent on a number of important demographic and socio-economic variables, hence potentially introducing bias in the results [8]; and consent requirements increase the cost of, and time for, conducting the research. Furthermore, often it is not practical to obtain individual patient consent because of the very large populations involved, the lack of a relationship between the researchers and the patients, and the time elapsed between data collection and the research study.

One approach to facilitate the disclosure of information for the purposes of genomic research, and to alleviate some of the problems documented above, is to de-identify data before disclosure to researchers or at the earliest opportunity afterwards [9,10]. Many research ethics boards will waive the consent requirement if the first 'use' of the data is to de-identify it [11,12].

The i2b2 project (informatics for integration of biology and the bedside) has developed tools for clinical investigators to integrate medical records and clinical research. A query tool in i2b2 allows the computation of cohort sizes in a privacy protective way, and a data export tool allows the extraction of de-identified individual-level data [13,14]. Also, the eMerge network, which consists of five sites in the United States, is an example of integrated EHR and genetic databases [3]. The BioVU system at Vanderbilt University, a member of the eMerge network, links a biobank of discarded blood samples with EHR data, and information is disclosed for research purposes after de-identification [3,15].

Here, I provide a description and critical analysis of de-identification methods that have been used in genomic research projects, such as i2b2 and eMerge. This is augmented with an overview of contemporary standards, best practices and recent de-identification methodologies.

De-identification: definitions and concepts

A database integrating clinical information from an EHR with a DNA repository is referred to here as a translational research information system (TRIS) for brevity [16]. It is assumed that the data custodian is extracting a particular set of variables on patients from a TRIS and disclosing that to a data recipient for research purposes, and that the data custodian will be performing the de-identification before the disclosure or at the earliest opportunity after disclosure. The concern for the data custodian is the risk that an adversary will try to re-identify the disclosed data.

Identity versus attribute disclosure

There are two kinds of re-identification that are of concern. The first is when an adversary can assign an identity to a record in the data disclosed from the TRIS. For example, the adversary would be able to determine that record number 7 belongs to a patient named 'Alice Smith'. This is called identity disclosure. The second type of disclosure is when an adversary learns something new about a patient in the disclosed data without knowing which specific record belongs to that patient. For example, if all 20-year-old female patients in the disclosed data who live in Ontario had a specific diagnosis, then an adversary does not need to know which record belongs to Alice Smith; if she is 20 years old and lives in Ontario then the adversary will discover something new about her: the diagnosis. This is called attribute disclosure.

All the publicly known examples of re-identification of personal information have involved identity disclosure [17-26]. Therefore, the focus is on identity disclosure because it is the type that is known to have occurred in practice.

Types of variable

The data in an EHR will include clinical information, and possibly socio-economic status information that may be collected from patients or linked in from external sources (such as the census). EHR information can be divided into four categories. The distinctions among these categories are important because they have an impact on the probability of re-identification and on suitable de-identification methods.

Directly identifying information

One or more direct identifiers can be used to uniquely identify an individual, either by themselves or in combination with other readily available information. For example, there are more than 200 people named 'John Smith' in Ontario, and therefore the name by itself would not be directly identifying, but in combination with the address it would be directly identifying information. Examples of directly identifying information include email address, health insurance card number, credit card number, and social insurance number.

Indirectly identifying relational information

Relational information can be used to probabilistically identify an individual. General examples include sex, geographic indicators (such as postal codes, census geography, or information about proximity to known or unique landmarks), and event dates (such as birth, admission, discharge, procedure, death, specimen collection, or visit/encounter).

Indirectly identifying transactional information

This is similar to relational information in that it can be used to probabilistically identify an individual. However, transactional information may have many instances per individual and per visit. For example, diagnosis codes and drugs dispensed would be considered transactional information.

Sensitive information

This is information that is rarely useful for re-identification purposes - for example, laboratory results.

For any piece of information, its classification into one of the above categories will be context dependant.

Relational and transactional information are referred to as quasi-identifiers. The quasi-identifiers represent the background knowledge about individuals in the TRIS that can be used by an adversary for re-identification. Without this background knowledge identity disclosure cannot occur. For example, if an adversary knows an individual's date of birth and postal code, then s/he can re-identify matching records in the disclosed data. If the adversary does not have such background knowledge about a person, then a date of birth and postal code in a database would not reveal the person's identity. Furthermore, because physical attributes and certain diagnoses can be inferred from DNA analysis (for example, gender, blood type, approximate skin pigmentation, a diagnosis of cystic fibrosis or Huntington's chorea), the DNA sequence data of patients known to an adversary can be used for phenotype prediction and subsequent re-identification of clinical records [27-29]. If an adversary has an identified DNA sequence of a target individual, this can be used to match and re-identify a sequence in the repository. Without an identified DNA sequence or reference sample as background knowledge, such an approach for re-identification would not work [16]. The manner and ease with which an adversary can obtain such background knowledge will determine the plausible methods of re-identification for a particular dataset.

Text versus structured data

Another way to consider the data in a TRIS is in terms of representation: structured versus free-form text. Some data elements in EHRs are in a structured format, which means that they have a pre-defined data type and semantics (for example, a date of birth or a postal code). There will also be plenty of free-form text in the form of, for example, discharge summaries, pathology reports, and consultation letters. Any realistic de-identification process has to deal with both types of data. The BioVU and i2b2 projects have developed and adapted tools for the de-identification of free-form text [15,30].

De-identification standards

In the US, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule provides three standards for the disclosure of health information without seeking patient authorization: the Safe Harbor standard (henceforth Safe Harbor), the Limited Dataset, and the statistical standard. Safe Harbor is a precise standard for the de-identification of personal health information when disclosed for secondary purposes. It stipulates the removal of 18 variables from a dataset as summarized in Box 1. The Limited Dataset stipulates the removal of only 16 variables, but also requires that the data recipient sign a data sharing agreement with the data custodian. The statistical standard requires an expert to certify that 'the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information'. Out of these three standards, the certainty and simplicity of Safe Harbor has made it attractive for data custodians.

Safe Harbor is also relevant beyond the US. For example, health research organizations and commercial organizations in Canada choose to use the Safe Harbor criteria to de-identify datasets [31,32], Canadian sites conducting research funded by US agencies need to comply with HIPAA [33], and international guidelines for the public disclosure of clinical trials data have relied on Safe Harbor definitions [34].

However, Safe Harbor has a number of important disadvantages. There is evidence that it can result in the excessive removal of information useful for research [35]. At the same time it does not provide sufficient protection for many types of data, as illustrated below.

First, it does not explicitly consider genetic data as part of the 18 fields to remove or generalize. There is evidence that a sequence of 30 to 80 independent single nucleotide polymorphisms (SNPs) could uniquely identify a single person [36]. There is also a risk of re-identification from pooled data, where it is possible to determine whether an individual is in a pool of several thousand SNPs using summary statistics on the proportion of individuals in the case or control group and the corresponding SNP value [37,38].

Second, Safe Harbor does not consider longitudinal data. Longitudinal data contain information about multiple visits or episodes of care. For example, let us consider the state inpatient database for California for the year 2007, which contains information on 2,098,578 patients. A Safe Harbor compliant dataset consisting only of the quasi-identifiers gender, year of birth, and year of admission has less than 0.03% of the records with a high probability of re-identification. A high probability of re-identification is defined as over 0.2. However, with two more longitudinal variables added, length of stay and time since last visit for each visit, then 16.57% of the records have a high probability of re-identification (unpublished observations). Thus, the second dataset also meets the Safe Harbor definition but has a markedly higher percentage of the population at risk of re-identification. Therefore, Safe Harbor does not ensure that the data are adequately de-identified. Longitudinal information, such as length of stay and time since last visit, may be known by neighbors, co-workers, relatives, and ex-spouses, and even the public for famous people.

Third, Safe Harbor does not deal with transactional data. For example, it has been shown that a series of diagnosis codes (International Statistical Classification of Diseases and Related Health Problems) for patients makes a large percentage of individuals uniquely identifiable [39]. An adversary who is employed by the healthcare provider could have access to the diagnosis codes and patient identity, which can be used to re-identify records disclosed from the TRIS.

Fourth, Safe Harbor does not take into account the sampling fraction - it is well established that sub-sampling can reduce the probability of re-identification [40-46]. For example, consider a cohort of 63,796 births in Ontario over 2004 to 2009 and three quasi-identifiers: maternal postal code, date of birth of baby, and mother's age. Approximately 96% of the records were unique on these three quasi-identifiers, making them highly identifiable. For research purposes, this dataset was de-identified to ensure that 5% or less of the records could be correctly re-identified by reducing the precision of the postal code to the first three characters, and the date of birth to year of birth. However, a cohort of 127,592 births de-identified in exactly the same way could have 10% of its records correctly re-identified. In this case the variables were exactly the same in the two cohorts but, because the sampling fraction varies, the percentage of records that can be re-identified doubles (from 5% to 10%, respectively).

Finally, other pieces of information that can re-identify individuals in free-form text and notes are not accounted for in Safe Harbor. The following example illustrates how I used this information to re-identify a patient. In a series of medical records that have been de-identified using the Safe Harbor standard, there was a record about a patient with a specific injury. The notes mentioned the profession of the patient's father and hinted at the location of his work. This particular profession lists its members publicly. It was therefore possible to identify all individuals within that profession in that region. Searches through social networking sites allowed the identification of a matching patient (having the same surname) with details of the specific injury during that specific period. The key pieces of information that made re-identification possible were the father's profession and region of work, and these are not part of the Safe Harbor items.

Therefore, universal de-identification heuristics that proscribe certain fields or prescribe specific generalizations of fields will not provide adequate protection in all situations and must be used with caution. Both the BioVU [15] and the i2b2 project [13] de-identify individual-level data according to the Safe Harbor standard, but also require a data sharing agreement with the data recipients as required by the Limited Dataset provision, and some sites implementing the i2b2 software use the Limited Dataset provision for de-identification [14].

Although the Limited Dataset provision provides a mechanism to disclose information without consent, it does not produce data that are de-identified. The challenge for data custodians is that the notices to patients for some repositories state that the data will be de-identified, so there is an obligation to perform de-identification before disclosure [15,47]. Where patients are approached in advance for consent to include their data in the repository, this is predicated on the understanding that any disclosures will be of de-identified data [3]. Under these circumstances, a more stringent standard than the Limited Dataset is required. Within the framework of HIPAA, one can then use the statistical standard for de-identification. This is consistent with privacy legislation and regulations in other jurisdictions, which tend not to be prescriptive and allow a more context-dependant interpretation of identifiability [26].

Managing re-identification risk

The statistical standard in the HIPAA Privacy Rule provides a means to disclose more detailed information for research purposes and still manage overall re-identification risk. Statistical methods can provide quantitative guarantees to patients and research ethics boards that the probability of re-identification is low.

A risk-based approach has been in use for a few years for the disclosure of large clinical and administrative datasets [48], and can be similarly used for the disclosure of information from a TRIS. The basic principles of a risk-based approach for de-identification are that (a) a re-identification probability threshold should be set and (b) the data should be de-identified until the actual re-identification probability is below that threshold.

Because measurement is necessary for setting thresholds, the supplementary material (Additional file 1) consists of a detailed review of re-identification probability metrics for evaluating identity disclosure. Below is a description of how to set a threshold and an overview of de-identification methods that can be used.

Setting a threshold

There are two general approaches to setting a threshold: (a) based on precedent and (b) based on an assessment of the risks from the disclosure of data.

Precedents for thresholds

Historically, data custodians have used the 'cell size of five' rule to de-identify data [49-58]. In the context of a probability of re-identifying an individual, this is equivalent to a probability of 0.2. Some custodians use a cell size of 3 [59-62], which is equivalent to a probability of 0.33 of re-identifying a single individual. Such thresholds are suitable when the data recipient is trusted.

It has been estimated that the Safe Harbor standard results in 0.04% of the population being at high risk for re-identification [63,64]. Another re-identification attack study evaluated the proportion of Safe Harbor compliant medical records that can be re-identified and found that only 0.01% can be correctly re-identified [65]. In practice, setting such low thresholds can also result in significant distortion to the data [35], and is arguably more suitable when data are being publicly disclosed.

Risk-based thresholds

With this approach, the re-identification probability threshold is determined based on factors characterizing the data recipient and the data [48]. These factors have been suggested and have been in use informally by data custodians to inform their disclosure decisions for at least the last decade and a half [46,66], and they cover three dimensions [67], as follows.

First, mitigating controls: this is the set of security and privacy practices that the data recipient has in place. The practices used by custodians of large datasets and recommended by funding agencies and research ethics boards for managing sensitive health information have been reviewed elsewhere [68].

Second, invasion of privacy: this evaluates the extent to which a particular disclosure would be an invasion of privacy to the patients (a checklist is available in [67]). There are three considerations: (i) the sensitivity of the data: the greater the sensitivity of the data, the greater the invasion of privacy; (ii) the potential injury to patients from an inappropriate disclosure - the greater the potential for injury, the greater the invasion of privacy; and (iii) the appropriateness of consent for disclosing the data - the less appropriate the consent, the greater the potential invasion of privacy.

Third, motives and capacity: this considers the motives and the capacity of the data recipient to re-identify the data, considering issues such as conflicts of interest, the potential for financial gain from a re-identification, and whether the data recipient has the skills and the necessary resources to re-identify the data (a checklist is available in [67]).

For example, if the mitigating controls are low, which means that the data recipient has poor security and privacy practices, then the re-identification threshold should be set at a lower level. This will result in more de-identification being applied. However, if the data recipient has very good security and privacy practices in place, then the threshold can be set higher.

De-identification methods

The i2b2 project tools allow investigators to query for patients and controls that meet specific inclusion/exclusion criteria [13,69]. This allows the investigator to determine the size of cohorts for a study. The queries return counts of unique patients that match the criteria. If few patients match the criteria, however, there is a high probability of re-identification. To protect against such identity disclosure, the query engine performs several functions. First, random noise from a Gaussian distribution is added to returned counts, and the standard deviation of the distribution is increased as true counts approach zero. Second, an audit trail is maintained and if users are running too many related queries they are blocked. Also, limits are imposed on multiple queries so that a user cannot compute the mean of the perturbed data.

The disclosure of individual-level data from a TRIS is also important, and various de-identification methods can be applied to such data. The de-identification methods that have the most acceptability among data recipients are masking, generalization, and suppression (see below). Other methods, such as the addition of random noise, distort the individual-level data in ways that are sometimes not intuitive and may result in incorrect results if these distortions affect the multivariate correlational structure in the data. This can be mitigated if the specific type of analysis that will be performed is known in advance and the distortions can account for that. Nevertheless, they tend to have low acceptance among health researchers and analysts [5], and certain types of random noise perturbation can be filtered out to recover the original data [70]; therefore, the effectiveness of noise addition can be questioned. Furthermore, perturbing the DNA sequences themselves may obscure relationships or even lead to false associations [71].

Methods that have been applied in practice are described below and are summarized in Table ​1.

Table 1

Summary of de-identification methods for individual-level data


Masking refers to a set of manipulations of the directly identifying information in the data. In general, direct identifiers are removed/redacted from the dataset, replaced with random values, or replaced with a unique key (also called pseudonymization) [72]. This latter approach is used in the BioVU project to mask the medical record number using a hash function [15].

Patient names are usually redacted or replaced with false names selected randomly from name lists [73]. Numbers, such as medical record numbers, social security numbers, and telephone numbers, are either redacted or replaced with randomly generated but valid numbers [74]. Locations, such as the names of facilities, would also normally be redacted. Such data manipulations are relatively simple to perform for structured data. Text de-identification tools will also do this, such as the tool used in the BioVU project [15].


Generalization reduces the precision in the data. As a simple example of increasing generalization, a patient's date of birth can be generalized to a month and year of birth, to a year of birth, or to a 5 year interval. Allowable generalizations can be specified a priori in the form of a generalization hierarchy, as in the age example above. Generalizations have been defined for SNP sequences [75] and clinical datasets [68]. Instead of hierarchies, generalizations can also be constructed empirically by combining or clustering sequences [76] and transactional data [77] into more general groups.

When a dataset is generalized the re-identification probability can be measured afterwards. Records that are considered high risk are then flagged for suppression. When there are many variables the number of possible ways that these variables can be generalized can be large. Generalization algorithms are therefore used to find the best method of generalization. The algorithms are often constrained by a value MaxSup, which is the maximum percentage of records in the dataset that can be suppressed. For example, if MaxSup is set to 5%, then the generalization algorithm will ignore all possible generalizations that will result in more than 5% of the records being flagged for suppression. This will also guarantee that no more than 5% of the records will have any suppression in them.

Generalization is an optimization problem whereby the algorithm tries to find the optimal generalization for each of the quasi-identifiers that will ensure that the probability of re-identification is at or below the required threshold, the percentage of records flagged for suppression is below MaxSup, and information loss is minimized.

Information loss is used to measure the amount of distortion to the data. A simple measure of information loss is how high up the hierarchy the chosen generalization level is. However, this creates difficulties of interpretation, and other more theoretically grounded metrics that take into account the difference in the level of precision between the original dataset and the generalized data have been suggested [5].


Usually suppression is applied to the specific records that are flagged for suppression. Suppression means the removal of values from the data. There are three general approaches to suppression: casewise deletion, quasi-identifier removal, and local cell suppression.

Casewise deletion removes the whole patient or visit record from the dataset. This results in the most distortion to the data because the sensitive variables are also removed even though those do not contribute to an increase in the risk of identity disclosure.

Quasi-identifier removal removes only the values about the quasi-identifiers in the dataset. This has the advantage that all of the sensitive information is retained.

Local cell suppression is an improvement over quasi-identifier removal in that fewer values are suppressed. Local cell suppression applies an optimization algorithm to find the least number of values about the quasi-identifiers to suppress [78]. All of the sensitive variables are retained and in practice considerably fewer of the quasi-identifier values are suppressed than in casewise and quasi-identifier deletion.

Available tools

Recent reports have provided summaries of free and supported commercial tools for the de-identification of structured clinical and administrative datasets [79,80]. Also, various text de-identification tools have recently been reviewed [81], although many of these tools are experimental and may not all be readily available. Tools for the de-identification of genomic data are mostly at the research stage and their general availability and level of support is unknown.


Genomic research is increasingly using clinically relevant data from electronic health records. Research ethics boards will often require patient consent when their information is used for secondary purposes, unless that information is de-identified. I have described above the methods and challenges of de-identifying data when disclosed for such research.

Combined genomic and clinical data can be quite complex, with free form textual or structured representations, as well as clinical data that are cross-sectional or longitudinal, and relational or transactional. I have described current de-identification practices in two genomic research projects, i2b2 and BioVU, as well as more recent best practices for managing the risk of re-identification.

It is easiest to use prescriptive de-identification heuristics such as those in the HIPAA Privacy Rule Safe Harbor standard. However, such a standard provides insufficient protection for the complex datasets referred to here and may result in the disclosure of data with a high probability of re-identification. Even when augmented with data sharing agreements, these agreements may be based on the inaccurate assumption that the data have a low probability of re-identification. Furthermore, notices to patients and consent forms often state that the data will be de-identified when disclosed. Disclosure practices that are based on the actual measurement of the probability of re-identification allow data custodians to better manage their legal obligations and commitments to patients.

Moving forward, several areas will require further research to minimize risks of re-identification of data used for genomic research. For example, improved methods for the de-identification of genome sequences or genomic data are needed. Sequence de-identification methods that rely on generalization that have been proposed thus far will likely result in significant distortions to large datasets [82]. There is also evidence that the simple suppression of the sequence for specific genes can be undone relatively accurately [83]. In addition, the re-identification risks to family members have not been considered here. Although various re-identification attacks have been highlighted [84-86], adequate familial de-identification methods have yet to be developed.


EHR: electronic health record; HIPAA: Health Insurance Portability and Accountability Act; SNP: single nucleotide polymorphism; TRIS: translational research information system.

Competing interests

The authors declare that they have no competing interests.

Box 1

The 18 elements in the HIPAA Privacy Rule Safe Harbor standard that must be excluded/removed from a dataset

The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:

1. Names;

2. All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census:

a) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and

b) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.

3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;

4. Telephone numbers;

5. Fax numbers;

6. Electronic mail addresses;

7. Social security numbers;

8. Medical record numbers;

9. Health plan beneficiary numbers;

10. Account numbers;

11. Certificate/license numbers;

12. Vehicle identifiers and serial numbers, including license plate numbers;

13. Device identifiers and serial numbers;

14. Web Universal Resource Locators (URLs);

15. Internet Protocol (IP) address numbers;

16. Biometric identifiers, including finger and voice prints;

17. Full face photographic images and any comparable images; and

18. Any other unique identifying number, characteristic, or code.

Adapted from [87]

Supplementary Material

Additional file 1:

Measuring the probability of re-identification. This file describes metrics and decision rules for measuring and interpreting the probability of re-identification for identity disclosure.

Click here for file(400K, PDF)


The analyses performed on the California state inpatient database and the birth registry of Ontario were part of studies approved by the research ethics board of the Children's Hospital of Eastern Ontario Research Institute. Bradley Malin (Vanderbilt University) reviewed some parts of the draft manuscript, and Elizabeth Jonker (CHEO Research Institute) assisted with the formatting of the manuscript.


  • Prokosch H, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med. 2009;48:38–44.[PubMed]
  • Tannen R, Weiner M, Xie D. Use of primary care electronic medical record database in drug efficacy research on cardiovascular outcomes: Comparison of database and randomized controlled trial findings. BMJ. 2009;338:b81. doi: 10.1136/bmj.b81.[PMC free article][PubMed][Cross Ref]
  • McCarty C, Chisholm R, Chute C, Kullo I, Jarvik G, Larson E, Li R, Masys D, Ritchie M, Roden D, Struewing JP, Wolf WA. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4:13. doi: 10.1186/1755-8794-4-13.[PMC free article][PubMed][Cross Ref]
  • Ness R. Influence of the HIPAA privacy rule on health research. JAMA. 2007;298:2164–2170. doi: 10.1001/jama.298.18.2164.[PubMed][Cross Ref]
  • El Emam K, Dankar F, Issa R, Jonker E, Amyot D, Cogo E, Corriveau J-P, Walker M, Chowdhury S, Vaillancourt R, Roffey T, Bottomley J. A globally optimal k-anonymity method for the de-identification of health data. J Am Med Inform Assoc. 2009;16:670–682. doi: 10.1197/jamia.M3144.[PMC free article][PubMed][Cross Ref]
  • Kho M, Duffett M, Willison D, Cook D, Brouwers M. Written informed consent and selection bias in observational studies using medical records: systematic review. BMJ. 2009;338:b866. doi: 10.1136/bmj.b866.[PMC free article][PubMed][Cross Ref]
  • El Emam K, Jonker E, Fineberg A. The case for deidentifying personal health information. Social Sciences Research Network. 2011.
  • Harris ALAR, Teschke KE. Personal privacy and public health: potential impacts of privacy legislation on health research in Canada. Can J Public Health. 2008;99:293–296.[PubMed]
  • Kosseim P, Brady M. Policy by procrastination: secondary use of electronic health records for health research purposes. McGill J Law Health. 2008;2:5–45.
  • Lowrance W. Learning from experience: privacy and the secondary use of data in health research. J Health Serv Res Policy. 2003;8(Suppl 1):2–7.[PubMed]
  • Panel on Research Ethics. Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans. 2. 2010.
  • Willison D, Emerson C, Szala-Meneok K, Gibson E, Schwartz L, Weisbaum K. Access to medical records for research purposes: varying perceptions across Research Ethics Boards. J Med Ethics. 2008;34:308–314. doi: 10.1136/jme.2006.020032.[PubMed][Cross Ref]
  • Murphy S, Weber G, Mendis M, Gainer V, Chueh H, Churchill S, Kohane I. Serving the enterprise and beyond with informatics for integrating biology and the bedside. J Am Med Inform Assoc. 2010;17:124–130. doi: 10.1136/jamia.2009.000893.[PMC free article][PubMed][Cross Ref]
  • Deshmukh V, Meystre S, Mitchell J. Evaluating the informatics for integrating biology and the bedside system for clinical research. BMC Med Res Methodol. 2009;9:70. doi: 10.1186/1471-2288-9-70.[PMC free article][PubMed][Cross Ref]
  • Roden D, Pulley J, Basford M, Bernard G, Clayton E, Balser J, Masys D. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther. 2008;84:362–369. doi: 10.1038/clpt.2008.89.[PMC free article][PubMed][Cross Ref]
  • Malin B, Karp D, Scheuermann R. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med. 2010;58:11–18.[PMC free article][PubMed]
  • The Supreme Court of the State of Illinois. Southern Illinoisan vs. The Illinois Department of Public Health. Docket No. 98712. 2006.
  • Hansell S. AOL removes search data on group of web users. New York Times; 2006.
  • Barbaro M, Zeller T. , Jr. A face is exposed for AOL searcher No. 4417749. New York Times; 2006.
  • Zeller T. , Jr. AOL moves to increase privacy on search queries. New York Times; 2006.
  • Ochoa S, Rasmussen J, Robson C, Salib M. Reidentification of individuals in Chicago's homicide database: A technical and legal study. 2001. Archived at
  • Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. Proceedings of the 2008 IEEE Symposium on Security and Privacy. 2008. pp. 111–125.
  • Sweeney L. PhD thesis. Massachusetts Institute of Technology, Electrical Engineering and Computer Science department; 2001. Computational disclosure control: A primer on data privacy protection.
  • Appellate Court of Illinois - Fifth District. The Southern Illinoisan v. Department of Public Health. 2004.
  • Federal Court (Canada) Mike Gordon vs. The Minister of Health: Affidavit of Bill Wilson. Court File No. T-347-06. 2006.
  • El Emam K, Kosseim P. Privacy interests in prescription records, part 2: patient privacy. IEEE Security Privacy. 2009;7:75–78.
  • Lowrance W, Collins F. Ethics. Identifiability in genomic research. Science. 2007;317:600–602. doi: 10.1126/science.1147699.[PubMed][Cross Ref]
  • Malin B, Sweeney L. Determining the identifiability of DNA database entries. Proc AMIA Symp. 2000;2000:537–541.[PMC free article][PubMed]
  • Wjst M. Caught you: threats to confidentiality due to the public release of large-scale genetic data sets. BMC Med Ethics. 2010;11:21. doi: 10.1186/1472-6939-11-21.[PMC free article][PubMed][Cross Ref]
  • Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14:550–563. doi: 10.1197/jamia.M2444.[PMC free article][PubMed][Cross Ref]
  • El Emam K. Data anonymization practices in clinical research: a descriptive study. Health Canada, Access to Information and Privacy Division; 2006.
  • Canadian Medical Association (CMA) Holdings Incorporated. Deidentification/Anonymization Policy. Ottawa CMA Holdings; 2009.
  • UBC Clinical Research Ethics Board, Providence Health Care Research Ethics Board. Interim Guidance to Clinical Researchers Regarding Compliance with the US Health Insurance Portability and Accountability Act (HIPAA) Vancouver: University of British Columbia; 2003.
  • Hryanszkiewicz I, Norton M, Vickers A, Altman D. Preparing raw clinical data for publications: tuidance for journal editors, authors, and peer reviewers. BMJ. 2010;340:c181. doi: 10.1136/bmj.c181.[PMC free article][PubMed][Cross Ref]
  • Clause S, Triller D, Bornhorst C, Hamilton R, Cosler L. Conforming to HIPAA regulations and compilation of research data. Am J Health Syst Pharm. 2004;61:1025–1031.[PubMed]
  • Lin Z, Owen A, Altman R. Genomic research and human subject privacy. Science. 2004;305:183. doi: 10.1126/science.1095019.[PubMed][Cross Ref]
  • Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson J, Stephan D, Nelson S, Craig D. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4:e1000167. doi: 10.1371/journal.pgen.1000167.[PMC free article][PubMed][Cross Ref]
  • Jacobs K, Yeager M, Wacholder S, Craig D, Kraft P, Hunter D, Paschal J, Manolio T, Tucker M, Hoover R, Thomas GD, Chanock SJ, Chatterjee N. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nat Genet. 2009;41:1253–1257. doi: 10.1038/ng.455.[PMC free article][PubMed][Cross Ref]
  • Loukides G, Denny J, Malin B. The disclosure of diagnosis codes can breach research participants' privacy. J Am Med Inform Assoc. 2010;17:322–327.[PMC free article][PubMed]
  • Willenborg L, de Waal T. Statistical Disclosure Control in Practice. New York: Springer-Verlag; 1996.
  • Willenborg L, de Waal T. Elements of Statistical Disclosure Control. New York: Springer-Verlag; 2001.
  • Skinner CJ. On identification disclosure and prediction disclosure for microdata. Statistica Neerlandica. 1992;46:21–32. doi: 10.1111/j.1467-9574.1992.tb01324.x.[Cross Ref]
  • Marsh C, Skinner C, Arber S, Penhale B, Openshaw S, Hobcraft J, Lievesley D, Walford N. The case for samples of anonymized records from the 1991 census. J R Stat Soc A (Statistics in Society) 1991;154:305–340. doi: 10.2307/2983043.[PubMed][Cross Ref]
  • Dale A, Elliot M. Proposals for 2001 samples of anonymized records: an assessment of disclosure risk. J R Stat Soc A (Statistics in Society) 2001;164:427–447. doi: 10.1111/1467-985X.00212.[Cross Ref]
  • Flora Felso JT, Wagner GG. In: Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. Doyle P, Lane J, Theeuwes J, Zayatz L, editor. Vol. 1. Washington, DC: Elsevier; 2003. Disclosure limitation methods in use: results of a survey. pp. 17–38.
  • Jabine T. Statistical disclosure limitation practices of United States statistical agencies. J Official Stat. 1993;9:427–454.
  • Pulley J, Brace M, Bernard G, Masys D. Evaluation of the effectiveness of posters to provide information to patients about a DNA database and their opportunity to opt out. Cell Tissue Banking. 2007;8:233–241. doi: 10.1007/s10561-007-9033-4.[PubMed][Cross Ref]
  • El Emam K. Risk-based de-identification of health data. IEEE Security Privacy. 2010;8:64–67.
  • Subcommittee on Disclosure Limitation Methodology - Federal Committee on Statistical Methodology. Working paper 22: Report on statistical disclosure control. Statistical Policy Office, Office of Information and Regulatory Affairs, Office of Management and Budget; 1994.
  • Manitoba Center for Health Policy. Manitoba Center for Health Policy Privacy code. 2002.
  • Cancer Care Ontario: Cancer Care Ontario Data Use and Disclosure Policy. 2005,Updated 2008.
  • Health Quality Council. Security and Confidentiality Policies and Procedures. Saskatoon Health Quality Council; 2004.
  • Health Quality Council. Privacy code. Saskatoon Health Quality Council; 2004.
  • Statistics Canada. Therapeutic abortion survey. 2007. Archived at
  • Office of the Information and Privacy Commissioner of British Columbia: Order No. 261-1998. 1998.
  • Office of the Information and Privacy Commissioner of Ontario. Order P-644. 1994. Archived at
  • Alexander L, Jabine T. Access to social security microdata files for research and statistical purposes. Social Security Bulletin. 1978;41:3–17.[PubMed]
  • Ministry of Health and Long Term care (Ontario) Corporate Policy 3-1-21. 1984. [Available on request]
  • Duncan G, Jabine T, de Wolf S. Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. Washington DC: National Academies Press; 1993.
  • de Waal A, Willenborg L. A view on statistical disclosure control for microdata. Survey Methodol. 1996;22:95–103.
  • Office of the Privacy Commissioner of Quebec (CAI) Chenard v. Ministere de l'agriculture, des pecheries et de l'alimentation (141). CAI 141. 1997. [Available on request]
  • National Center for Education Statistics. NCES Statistical Standards. Washington DC US Department of Education; 2003.
  • National Committee on Vital and Health Statistics. Report to the Secretary of the US Department of Health and Human Services on Enhanced Protections for Uses of Health Data: A Stewardship Framework for "Secondary Uses" of Electronically Collected and Transmitted Health Data. V.101907(15) 2007.
  • Sweeney L. Data sharing under HIPAA: 12 years later. Workshop on the HIPAA Privacy Rule's De-Identification Standard; 2010.
  • Lafky D. The Safe Harbor method of de-identification: an empirical test. Fourth National HIPAA Summit West; 2010. Archived at
  • Jabine T. Procedures for restricted data access. J Official Stat. 1993;9:537–589.
  • El Emam K, Brown A, AbdelMalik P, Neisa A, Walker M, Bottomley J, Roffey T. A method for managing re-identification risk from small geographic areas in Canada. BMC Med Inform Decis Mak. 2010;10:18. doi: 10.1186/1472-6947-10-18.[PMC free article][PubMed][Cross Ref]
  • El Emam K, Dankar F, Vaillancourt R, Roffey T, Lysyk M. Evaluating patient re-identification risk from hospital prescription records. Can J Hospital Pharmacy. 2009;62:307–319.[PMC free article][PubMed]
  • Murphy S, Chueh H. A security architecture for query tools used to access large biomedical databases. Proc AMIA Symp. 2002. pp. 552–556.[PMC free article][PubMed]
  • Kargupta H, Datta S, Wang Q, Sivakumar K. Random data perturbation techniques and privacy preserving data mining. Knowledge Information Systems. 2005;7:387–414. doi: 10.1007/s10115-004-0173-6.[Cross Ref]
  • Malin B, Cassa C, Kantarcioglu M. In: Privacy-Preserving Knowledge Discovery. Bonchi F, Ferrari E, editor. New York: Chapman & Hall/CRC Press; 2011. A survey of challenges and solutions for privacy in clinical genomics data mining.
  • El Emam K, Fineberg A. An overview of techniques for de-identifying personal health information. Access to Information and Privacy Division of Health Canada; 2009.
  • Tu K, Klein-Geltink J, Mitiku T, Mihai C, Martin J. De-identification of primary care electronic medical records free-text data in Ontario, Canada. BMC Med Inform Decis Mak. 2010;10:35. doi: 10.1186/1472-6947-10-35.[PMC free article][PubMed][Cross Ref]
  • El Emam K, Jonker E, Sams S, Neri E, Neisa A, Gao T, Chowdhury S.



Very detailed health information about participants is collected during clinical trials. A number of different stakeholders would typically have access to individual-level participant data (IPD), including the study sites, the sponsor of the study, statisticians, Institutional Review Boards (IRBs), and regulators. By IPD we mean individual-level data on trial participants, which is more than the information that is typically included, for example, in clinical study reports (CSRs).

There is increasing pressure to share IPD more broadly than occurs at present. There are many reasons for such sharing, such as transparency in the trial and wider disclosure of adverse events that may have transpired, or to facilitate the reuse of such data for secondary purposes, specifically in the context of health research (Gøtzsche, 2011; IOM, 2013; Vallance and Chalmers, 2013). Many funding agencies tasked with the oversight of research, as well as its funding, are requiring that data collected by the projects they support be made available to others (MRC, 2011; NIH, 2003; The Wellcome Trust, 2011). There are current efforts by regulators, such as the European Medicines Agency (EMA) (2014a,b), to examine how to make IPD from clinical trials shared more widely (IOM, 2013). In many cases, however, privacy concerns have been stated as a key obstacle to making these data available (Castellani, 2013; IOM, 2013).

One way in which privacy issues can be addressed is through the protection of the identities of the corresponding research participants. Such “de-identified” or “anonymized” health data (the former term being popular in North America, and the latter in Europe and other regions) are often considered to be sufficiently devoid of personal health information in many jurisdictions around the world. Many privacy laws therefore allow the data to be used and disclosed for any secondary purposes with participant consent. As long as the data are appropriately de-identified, many privacy concerns associated with data sharing can be readily addressed.

It should be recognized that de-identification is not, by any means, the only privacy concern that needs to be addressed when sharing clinical trial data. In fact, there must be a level of governance in place to ensure that the data will not be analyzed or used to discriminate against or stigmatize the participants or certain groups (e.g., religious or ethnic) associated with the study. This is because discrimination and stigmatization can occur even if the data are de-identified.

This paper describes a high-level risk-based methodology that can be followed to de-identify clinical trial IPD. To contextualize our review and analysis of de-identification, we also touch upon additional governance mechanisms, but we acknowledge that a complete treatment of governance is beyond the scope of this paper. Rather, the primary focus here is only on the privacy protective elements.

Data Recipients, Sponsors, and Adversaries

Clinical trial data may be disclosed by making them completely public or through a request mechanism. The data recipient may be a qualified investigator (QI) who must meet specific criteria. There may be other data recipients who are not QIs as well. If the data are made publicly available with no restrictions, however, then other types of users may access the data, such as journalists and nongovernmental organizations (NGOs). In our discussions we refer to the data recipient as the QI as a primary exemplar, although this is not intended to exclude other possible data recipients (it does make the presentation less verbose).

Data are being disclosed to the QI by the sponsor. We use the term “sponsor” generally to refer to all data custodians who are disclosing IPD, recognizing that the term may mean different entities depending on the context. It may not always be the case that the sponsor is a pharmaceutical company or a medical device company. For example, a regulator may decide to disclose the data to a QI, or a pharmaceutical company may provide the data to an academic institution, whereupon that institution becomes the entity that discloses the data.

The term “adversary” is often used in the disclosure control literature to refer to the role of the individual or entity that is trying to re-identify data subjects. Other terms used are “attacker” and “intruder.” Discussions about the QI being a potential adversary are not intended to paint QIs as having malicious objectives. Rather, in the context of a risk assessment, one must consider a number of possible data recipients as being potential adversaries and manage the re-identification risk accordingly.

Data Sharing Models

A number of different ways to provide access to IPD have been proposed and used, each with different advantages and risks (Mello et al., 2013). First, there is the traditional public data release where anyone can get access to the data with no registration or conditions. Examples of such releases include the publicly available clinical trial data from the International Stroke Trial (IST) (Sandercock et al., 2011) and data posted to the Dryad online open access data repository (Dryad, undated; Haggie, 2013).

A second form of data sharing, which is more restrictive, occurs when there exists a formal request and approval process to obtain access to clinical trial data, such as the GlaxoSmithKline (GSK) trials repository (Harrison, 2012; Nisen and Rockhold, 2013); Project Data Sphere (whose focus is on oncology trial data) (Bhattacharjee, 2012; Hede, 2013); the Yale University Open Data Access (YODA) Project, which is initially making trial data from Medtronic available (CORE, 2014; Krumholz and Ross, 2011); and the Immunology Database and Analysis Portal (ImmPort, n.d.), which is restricted to researchers funded by the Division of Allergy, Immunology, and Transplantation of the National Institute of Allergy and Infectious Diseases (DAIT/NIAID), other approved life science researchers, National Institutes of Health employees, and other preauthorized government employees (ImmPort, n.d.). More recently, pharmaceutical companies have created the website, which facilitates data requests to multiple companies under one portal. Following this restrictive model, a request can be processed by the study sponsor or by a delegate of the sponsor (e.g., an academic institution).

A hybrid of the above approaches is a quasi-public release, in which the data user must agree to some terms of use or sign a “click-through” contract. Click-through contracts are online terms of use that may place restrictions on what can be done with the data and how the data are handled. Regardless, anyone can still download such data. For example, public analytics competition data sets, such as the Heritage Health Prize (El Emam et al., 2012), and data-centric software application development competitions, such as the Cajun Code Fest (Center for Business and Information Technologies, 2013), fall into this category. In practice, however, click-through terms are not common for the sharing of clinical trial IPD.2

A form of data access that does not require any data sharing occurs when analysts request that the data controller perform an analysis on their behalf. Because this does not involve the sharing of IPD, it is a scenario that we do not consider further in this paper.

Data Sharing Mechanisms

Different mechanisms can be used to share IPD. Clinical trial IPD can be shared either as microdata or through an online portal. The term “microdata” is commonly used in the disclosure control literature to refer to individual-level raw data (Willenborg and de Waal, 1996, 2001). These microdata may be in the form of one or more flat files or relational databases.

When disclosed as microdata, the data are downloaded as a raw data file that can be analyzed by QIs on their own machines, using their own software if they wish to do so. The microdata can be downloaded through a website, sent to the QI on a disc, or transferred electronically. If access is through a website, the QI may have to register, sign a contract, or go through other steps before downloading the data.

When a portal is used, the QI can access the data only through a remote computer interface, such that the raw data reside on the sponsor's computers and all analysis performed is on the sponsor's computers. Data users do not download any microdata to their own local computers through this portal. Under this model, all actions can be audited.

A public online portal allows anyone to register and get access to the IPD. Otherwise, the access mechanism requires a formal request process.

De-identification is relevant in both of the aforementioned scenarios. When data are provided as microdata, the de-identification process ensures that each record is protected from the QI and his/her staff as the potential adversary. When data are shared through the portal, a QI or his/her staff may inadvertently recognize a data subject because that data subject is a neighbor, relative, coworker, or famous person (see Box B-1).


Types of Re-identification Attacks. For public data, the sponsor needs to make a worst-case assumption and protect against an adversary who is targeting the data subjects with the highest risk of re-identification. For a nonpublic data set, we consider (more...)

The different approaches for sharing clinical trial IPD are summarized in Figure B-1.


Different approaches for sharing clinical trial data. NOTE: QI = qualified investigator.

Scope of Data to Be De-identified

It is important to make a distinction between biological, and particularly genomic, data and other types of data. Many clinical trials are creating biorepositories. These may have a pseudonym or other unique identifier for the participant, and a sample of data. The de-identification methods we describe in this paper are applicable to clinical, administrative, and survey data. Genomic data raise a different set of issues. These issues are addressed directly in a later section of this paper.

Clinical trial data can be shared at multiple levels of detail. For example, the data can be raw source data or analysis-ready data. We assume that the data are analysis-ready and that no data cleansing is required before de-identification.

Existing Standards for De-identification

Various regulations associated with data protection around the world permit the sharing of de-identified (or similarly termed) data. For instance, European Union (EU) Data Protection Directive 95/46/EC, which strictly prohibits secondary uses of person-specific data without individual consent, provides an exception to the ruling in Recital 26, which states that the “principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable.” However, what does it mean for data to be “identifiable”? How do we know when they are no longer identifiable? The Data Protection Directive, and similar directives around the world, do not provide explicit guidelines regarding how data should be protected. An exception to this rule is a code of practice document published by the U.K. Information Commissioner's Office (ICO) (2012). And although this document provides examples of de-identification methods and issues to consider when assessing the level of identifiability of data, it does not provide a full methodology or specific standards to follow.

There are, however, de-identification standards provided in the Privacy Rule of the U.S. Health Insurance Portability and Accountability Act of 1996 (HIPAA) and subsequent guidance published by the Office for Civil Rights (OCR) at the U.S. Department of Health and Human Services (HHS) (HHS, 2012). This rule is referred to by many regulatory frameworks around the world, and the principles are strongly related to those set forth in the United Kingdom's code of practice document mentioned above.

Two of the key existing standards for the de-identification of health microdata are described in the HIPAA Privacy Rule. It should be recognized that HIPAA applies only to “covered entities” (i.e., health plans, health care clearinghouses, and health care providers that transmit health information electronically) in the United States. It is likely that in many instances the sponsors of clinical trials will not fall into this class. However, these de-identification standards have been in place for approximately a decade, and there is therefore a considerable amount of real-world experience in their application. They can serve as a good launching point for examining best practices in this area. For the disclosure of clinical trial data, the HIPAA Privacy Rule de-identification standards offer a practically defensible foundation even if they are not a regulatory requirement.

According to section 164.514 of the HIPAA Privacy Rule, “health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information.” Section 164.514(b) of the Privacy Rule contains the implementation specifications that a covered entity, or affiliated business associate, must follow to meet the de-identification standard. In particular, the Privacy Rule outlines two routes by which health data can be designated as de-identified. These are illustrated in Figure B-2.


The two de-identification standards in the HIPAA Privacy Rule. SOURCE: Reprinted from a document produced by OCR (HHS, 2012).

The first route is the “Safe Harbor” method. Safe Harbor requires the manipulation of 18 fields in the data set as described in Box B-2. The Privacy Rule requires that a number of these data elements be “removed.” However, there may be acceptable alternatives to actual removal of values as long as the risk of reverse engineering the original values is very small. Compliance with the Safe Harbor standard also requires that the sponsor not have any actual knowledge that a data subject can be re-identified. Assumptions of the Safe Harbor method are listed in Box B-3.


The Safe Harbor De-identification Standard. Names; All geographic subdivisions smaller than a state, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according (more...)


Assumptions of the HIPAA Safe Harbor Method. There are only two quasi-identifiers that need to be manipulated in a data set: dates and zip codes. The adversary does not know who is in the data set (i.e., would not know which individuals participated in (more...)

The application of Safe Harbor is straightforward, but there clearly are instances in which dates and more fine-grained geographic information are necessary. In practice the Safe Harbor standard would remove critical geospatial and temporal information from the data (see items 2 and 3 in Box B-2), potentially reducing the utility of the data. Many meaningful analyses of clinical trial data sets require the dates and event order to be clear. For example, in a Safe Harbor data set, it would not be possible to include the dates when adverse events occurred.

In recognition of the limitations of de-identification via Safe Harbor, the HIPAA Privacy Rule provides for an alternative in the form of the Expert Determination method. This method has three general requirements:

  • The de-identification must be based on generally accepted statistical and scientific principles and methods for rendering information not individually identifiable. This means that the sponsor needs to ensure that there is a body of work that justifies and evaluates the methods that are used for the de-identification and that these methods must be generally known (i.e., undocumented methods or proprietary methods that have never been published would be difficult to classify as “generally accepted”).

  • The risk of re-identification needs to be very small, such that the information could not be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information. However, the mechanism for measuring re-identification risk is not defined in the HIPAA Privacy Rule, and what would be considered very small risk also is not defined. Therefore, the de-identification methodology must include some manner of measuring re-identification risk in a defensible way and have a repeatable process to follow that allows for the definition of very small risk.

  • Finally, the methods and results of the analysis that justify such determination must be documented. The basic principles of deidentification are expected to be consistent across all clinical trials, but the details will be different for each study, and these details also need to be documented.

These conditions are reasonable for a de-identification methodology and are consistent with the guidance that has been produced by other agencies and regulators (Canadian Institute for Health Information, 2010; ICO, 2012). They also serve as a set of conditions that must be met for the methods described here.

Unique and Derived Codes Under HIPAA

According to the 18th item in Safe Harbor (see Box B-2), “any unique identifying number, characteristic, or code” must be removed from the data set; otherwise it would be considered personal health information. However, in lieu of removing the value, it may be hashed or encrypted. This would be called a “pseudonym.” For example, the unique identifier may be a participant's clinical trial number, and this is encrypted with a secret key to create a pseudonym. A similar scheme for creating pseudonyms would be used under the Expert Determination method.

However, in the HIPAA Privacy Rule at § 164.514(c), it is stated that any code that is derived from information about an individual is considered identifiable data. However, such pseudonyms are practically important for knowing which records belong to the same clinical trial participant and constructing the longitudinal record of a data subject. Not being able to create derived pseudonyms means that random pseudonyms must be created. To be able to use random pseudonyms, one must maintain a crosswalk between the individual identity and the random pseudonym. The crosswalk allows the sponsor to use the same pseudonym for each participant across data sets and to allow re-identification at a future date if the need arises. These crosswalks, which are effectively linking tables between the pseudonym and the information about the individual, arguably present an elevated privacy risk because clearly identifiable information must now be stored somehow. Furthermore, the original regulations did not impose any controls on this crosswalk table.

For research purposes, the Common Rule will also apply. Under the Common Rule, which guides IRBs, if the data recipient has no means of getting the key, for example, through an agreement with the sponsor prohibiting the sharing of keys under any circumstances or through organizational policies prohibiting such an exchange, then creating such derived pseudonyms is an acceptable approach (HHS, 2004, 2008b).

Therefore, there is an inconsistency between the Privacy Rule and the Common Rule in that the former does not permit derived pseudonyms, while the latter does. This is well documented (Rothstein, 2005, 2010). However, in the recent guidelines from OCR, this is clarified to state that “a covered entity may disclose codes derived from PHI (protected health information) as part of a de-identified data set if an expert determines that the data meets the de-identification requirements at §164.514(b)(1)” (HHS, 2012, p. 22). This means that a derived code, such as an encryption or hash function, can be used as a pseudonym as long as there is assurance that the means to reverse that pseudonym are tightly controlled. There is now clarity and consistency among rules in that if there is a defensible mechanism whereby reverse engineering a derived pseudonym has a very small probability of being successful this is permitted.

Is It Necessary to Destroy Original Data

Under the Expert Determination method, the re-identification risk needs to be managed assuming that the adversary is “an anticipated recipient” of the data. This limits the range of adversaries that needs to be considered because, in our context, the anticipated recipient is the QI.

However, under the EU Data Protection Directive, the adversary may be the “data controller or any other person.” The data controller is the sponsor or the QI receiving the de-identified data. There are a number of challenges with interpreting this at face value.

One practical issue is that the sponsor will, by definition, be able to re-identify the data because the sponsor will retain the original clinical trial data set. The Article 29 Working Party has proposed that, effectively, the sponsor needs to destroy or aggregate the original data to be able to claim that the data provided to the QI are truly de-identified (Article 29 Data Protection Working Party, 2014). This means that the data are not de-identified if there exists another data set that can re-identify it, even in the possession of another data controller. Therefore, because the identified data exist with the sponsor, the data provided to the QI cannot be considered de-identified. This is certainly not practical because the original data are required for legal reasons (e.g., clinical trial data need to be retained for an extended period of time whose duration depends on the jurisdiction). Such a requirement would discourage de-identification by sponsors and push them to share identifiable data, which arguably would increase the risk of re-identification for trial participants significantly.

In an earlier opinion the Article 29 Data Protection Working Party (2007) emphasized the importance of “likely reasonable” in the definition of identifiable information in the 95/46/EC Directive. In that case, if it is not likely reasonable that data recipients would be able to readily re-identify the anonymized data because they do not have access to the original data, those anonymized data would not be considered personal information. That would seem to be a more reasonable approach that is consistent with interpretations in other jurisdictions.

Is De-identification a Permitted Use

Retroactively obtaining participant consent to de-identify data and use them for secondary analysis may introduce bias in the data set (El Emam, 2013). If de-identification is a permitted use under the relevant regulations, then de-identification can proceed without seeking participant consent. Whether that is the case will depend on the prevailing jurisdiction.

Under HIPAA and extensions under the Health Information Technology for Economic and Clinical Health (HITECH) Act Omnibus Rule, deidentification is a permitted use by a covered entity. However, a business associate can de-identify a data set only if the business associate agreement explicitly allows for that. Silence on de-identification in a business associate agreement is interpreted as not permitting de-identification.

In other jurisdictions, such as Ontario, the legislation makes explicit that de-identification is a permitted use (Perun et al., 2005).


Terminology in this area is not always clear, and different authors and institutions use the same terms to mean different things or different terms to mean the same thing (Knoppers and Saginur, 2005). Here, we provide the terminology and definitions used in this paper.

The International Organization for Standardization (ISO) Technical Specification on the pseudonymization of health data defines relevant terminology for our purposes. The term “anonymization” is defined as a “process that removes the association between the identifying data set and the data subject” (ISO, 2008). This is consistent with current definitions of “identity disclosure,” which corresponds to assigning an identity to a data subject in a data set (OMB, 1994; Skinner, 1992). For example, an identity disclosure would transpire if the QI determined that the third record (ID = 3) in the example data set in Table B-1 belonged to Alice Brown. Thus, anonymization is the process of reducing the probability of identity disclosure to a very small value.


An Example of Data Used to Illustrate a Number of Concepts Referred to Throughout This Paper.

Arguably, the term “anonymization” would be the appropriate term to use here given its more global utilization. However, to remain consistent with the HIPAA Privacy Rule, we use the term “de-identification” in this paper.

Beyond identity disclosure, organizations (and privacy professionals) are, at times, concerned about “attribute disclosure” (OMB, 1994; Skinner, 1992). This occurs when a QI learns a sensitive attribute about a participant in the database with a sufficiently high probability, even if the Q1 does not know which specific record belongs to that patient (Machanavajjhala et al., 2007; Skinner, 1992). For example, in Table B-1, all males born in 1967 had a creatine kinease lab test. Assume that an adversary does not know which record belongs to Almond Zipf (who has record ID = 17; see Table B-2). However, because Almond is male and was born in 1967, the QI will discover something new about him—that he had a test often administered to individuals showing symptoms of a heart attack. All known re-identification attacks are identity disclosures and not attribute disclosures (El Emam et al., 2011a).3 Furthermore, privacy statutes and regulations in multiple jurisdictions, including the HIPAA Privacy Rule, the Ontario Personal Health Information Protection Act (PHIPA), and the EU Data Protection Directive, consider identity disclosure only in their definitions of personal health information. Although participants may consider certain types of attribute disclosure to be a privacy violation, it is not considered so when the objective is anonymization of the data set.


Identities of Participants from the Hypothetical Data Set.

Technical methods have been developed to modify the data to protect against attribute disclosure (Fung et al., 2010). However, these methods have rarely, if ever, been used in practice for the disclosure of health data. One possible reason for this is that they distort the data to such an extent that the data are no longer useful for analysis purposes. There are other, nontechnical approaches that are more appropriate for addressing the risks of attribute disclosure, and in the final section on governance we provide a description of how a sponsor can protect against attribute disclosure. Therefore, our focus in this paper is on identity disclosure.


We begin with some basic definitions that are critical for having a meaningful discussion about how re-identification works. Along the way, we address some of the controversies around de-identification that have appeared in the literature and the media.

Categories of Variables

It is useful to differentiate among the different types of variables in a clinical trial data set. The way the variables are handled during the deidentification process will depend on how they are categorized. We make a distinction among three types of variables (Samarati, 2001; Sweeney, 2002):

  • Directly identifying variables. Direct identifiers have two important characteristics: (1) one or more direct identifiers can be used to uniquely identify an individual, either by themselves or in combination with other readily available information; and (2) they often are not useful for data analysis purposes. Examples of directly identifying variables include names, email addresses, and telephone numbers of participants. It is uncommon to perform data analysis on clinical trial participant names and telephone numbers.

  • Indirectly identifying variables (quasi-identifiers). Quasi-identifiers are the variables about research participants in the data set that a QI can use, either individually or in combination, to reidentify a record. If an adversary does not have background knowledge of a variable, it cannot be a quasi-identifier. The means by which an adversary can obtain such background knowledge will determine which attacks on a data set are plausible. For example, the background knowledge may be available because the adversary knows a particular target individual in the disclosed clinical trial data set, an individual in the data set has a visible characteristic that is also described in the data set, or the background knowledge exists in a public or semipublic registry. Examples of quasi-identifiers include sex, date of birth or age, locations (such as postal codes, census geography, and information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality.

  • Other variables. These are the variables that are not really useful for determining an individual's identity. They may or may not be clinically relevant.

Individuals can be re-identified because of the directly identifying variables and the quasi-identifiers. Therefore, our focus is on these two types of variables.

Classifying Variables

An initial step in being able to reason about the identifiability of a clinical trial data set is to classify the variables into the above categories. We consider the process for doing so below.

Is It an Identifier

There are three conditions for a field to be considered an identifier (of either type). These conditions were informed by HHS's de-identification guidelines (HHS, 2012).


The field values must be sufficiently stable over time so that the values will occur consistently in relation to the data subject. For example, the results of a patient's blood glucose level tests are unlikely to be replicable over time because they will vary quite a bit. If a field value is not replicable, it will be challenging for an adversary to use that information to re-identify an individual.


The variable must have sufficient variability to distinguish among individuals in a data set. For example, in a data set of only breast cancer patients, the diagnosis code (at least at a high level) will have little variation. On the other hand, if a variable has considerable variation among the data subjects, it can distinguish among individuals more precisely. That diagnosis field will be quite distinguishable in a general insurance claims database.


An adversary must know the identifiers about the data subject in order to re-identify them. If a variable is not knowable by an adversary, it cannot be used to launch a re-identification attack on the data.

When we say that a variable is knowable, it also means that the adversary has an identity attached to that information. For example, if an adversary has a zip code and a date of birth, as well as an identity associated with that information (such as a name), then both the zip code and date of birth are knowable.

Knowability will depend on whether an adversary is an acquaintance of a data subject. If the adversary is an acquaintance, such as a neighbor, coworker, relative, or friend, it can be assumed that certain things will be known. Things known by an acquaintance will be, for example, the subject's demographics (e.g., date of birth, gender, ethnicity, race, language spoken at home, place of birth, and visible physical characteristics). An acquaintance may also know some socioeconomic information, such as approximate years of education, approximate income, number of children, and type of dwelling.

A nonacquaintance will know things about a data subject in a number of different ways, in decreasing order of likelihood:

  • The information can be inferred from other knowable information or other variables that determined to be identifiers. For example, birth weight can often be inferred from weeks of gestation. If weeks of gestation are included in the database, birth weight can be determined with reasonable accuracy.

  • The information is publicly available. For example, the information is in a public registry, or it appears in a newspaper article (say, an article about an accident or a famous person). Information can also become public if self-revealed by individuals. Examples are information posted on social networking sites and broadcast email announcements (e.g., births). It should be noted that only information that many people would self-reveal should be considered an identifier. If there is a single example or a small number of examples of people who are revealing everything about their lives (e.g., a quantified self-enthusiast who is also an exhibitionist), this does not mean that this kind of information is an identifier for the majority of the population.

  • The information is in a semipublic registry. Access to these registries may require a nominal fee or application process.

  • The information can be purchased from commercial data brokers. Use of commercial databases is not inexpensive, so an adversary would need to have a strong motive to use such background information.

Some of these data sources can be assessed objectively (e.g., whether there is relevant public information). In other cases, the decision will be subjective and may vary over time.

A Suggested Process for Determining whether a variable Is an Identifier

A simple way to determine whether a variable is an identifier is to ask an expert, internal or external to the sponsor, to do so. There are other, more formal processes that can be used as well.

There are two general approaches to classifying variables. In one approach, two analysts who know the data and the data subject population classify the variables independently; then some measure of agreement is computed. A commonly used measure of agreement is Cohen's Kappa (Cohen, 1960). If this value is above 0.8, there is arguably general consensus, and the two analysts will meet to resolve the classifications on which they had disagreements. The results of this exercise are then retained as documentation.

If the Kappa value is less than 0.8, there is arguably little consensus. In such a case, it is recommended that a group of individuals at the sponsor site review the field classifications and reach a classification consensus. This consensus then needs to be documented, along with the process used to reach it. This process provides the data custodian with a defensible classification of variables.

Is It a Direct or Indirect Identifier

Once a variable has been determined to be an identifier, it is necessary to determine whether it is a direct or indirect (quasi-) identifier. If the field uniquely identifies an individual (e.g., a Social Security Number), it will be treated as a direct identifier. If it is not unique, the next question is whether it is likely to be used for data analysis. If so, it should be treated as a quasi-identifier. This is an important decision because the techniques often used to protect direct identifiers distort the data and their truthfulness significantly.

Is it possible to know which fields will be used for analysis at the time that de-identification is being applied? In many instances, an educated judgment can be made, for example, about potential outcome variables and confounders.

The overall decision rule for classifying variables is shown in Figure B-3.


Decision rule for classifying identifiers. SOURCE: Reprinted with permission from El Emam and colleagues, 2014.

How Is Re-identification Probability Measured

Measurement of re-identification risk is a topic that has received extensive study over multiple decades. We examine it at a conceptual level to illustrate key concepts. This discussion builds on the classification of variables described above.

The Risk of Re-identification for Direct Identifiers

We define risk as the probability of re-identifying a trial participant. In practice, we consider the risk of re-identification for direct identifiers to be 1. If a direct identifier does exist in a clinical trial data set, then by definition it will be considered to have a very high risk of re-identification.

Strictly speaking, the probability is not always 1. For example, consider the direct identifier “Last Name.” If a trial participant is named “Smith,” it is likely that there are other people in the trial named “Smith,” and this is even more likely in the community where that participant lives. However, assuming that the probability of re-identification is equal to 1 is a simplification that has little impact in practice, errs on the conservative side, and makes it possible to focus attention on the quasi-identifiers, which is where, in many instances, the most data utility lies.

Two methods can be applied to protect direct identifiers. The first is suppression, or removal of the variable. For example, when a clinical trial data set is disclosed, all of the names of the participants are stripped from the data set. The second method is to create a pseudonym (ISO, 2008). Pseudonymization is also sometimes called “coding” in the health research literature (Knoppers and Saginur, 2005).4 There are different schemes and technical methods for pseudonymization, such as single and double coding, reversible or irreversible pseudonyms, and encryption and hashing techniques. If executed well, pseudonymization ensures that the probability of re-identification is very small. There is no need to measure this probability on the data after suppression or pseudonymization because in almost all cases that value is going to be very small.

Quasi-identifiers, however, cannot be protected using such procedures. This is because the resulting data, in almost all cases, will not be useful for analytic purposes. Therefore, a different set of approaches is required for measuring and de-identifying quasi-identifiers.

The Risk of Re-identification for Quasi-identifiers

Equivalence Classes

All the records that share the same values on a set of quasi-identifiers are called an “equivalence class.” For example, consider the quasi-identifiers in Table B-1—sex and age. All the records in Table B-1 for males born in 1967 (i.e., records 10, 13, 14, 17, and 22) form an equivalence class. Equivalence class sizes for a data concept, such as age, potentially change during de-identification. For example, there may be five records for males born in 1967. When the precision of age is reduced to a 5-year interval, there are eight records for males born between 1965 and 1969 (i.e., records 2, 10, 11, 13, 14, 17, 22, and 27). In general, there is a trade-off between the level of detail provided for a data concept and the size of the corresponding equivalence classes, with more detail being associated with smaller equivalence classes.

The most common way to measure the probability of re-identification for a record in a data set is for the probability to be equal to 1 divided by the size of its equivalence class. For example, record number 14 is in an equivalence class of size five, and therefore its probability of re-identification is 0.2. Record number 27 is in an equivalence class of size one and therefore its probability of re-identification is equal to 1 divided by 1. Records that are in equivalence classes of size one are called “uniques.” In Table B-3, we have assigned the probability to each record in our example.


The Data Set in Table B-1 with the Probabilities of Re-identification per Record Added.

This probability applies under two conditions: (1) the adversary knows someone in the real world and is trying to find the record that matches that individual, and (2) the adversary has selected a record in the data set and is trying to find the identity of that person in the real world. Both of these types of attacks on health data have occurred in practice, and therefore both perspectives are important to consider. An example of the former perspective is when an adversary gathers information from a newspaper and attempts to find the data subject in the data set. An example of the latter attack is when the adversary selects a record in the data set and tries to match it with a record in the voter registration list.

A key observation here is that the probability of re-identification is not based solely on the uniques in the data set. For example, record number 18 is not a unique, but it still has quite a high probability of re-identification. Therefore, it is recommended that the risk of re-identification be considered, and managed, for both uniques and non-uniques.

Maximum Risk

One way to measure the probability of re-identification for the entire data set is through the maximum risk, which corresponds to the maximum probability of re-identification across all records. From Table B-3, it can be seen that there is a unique record, such that the maximum risk is 1 for this data set.

Average Risk

The average risk corresponds to the average across all records in the data set. In the example of Table B-3, this amounts to 0.59. By definition, the average risk for a data set will be no greater than the maximum risk for the same data set.

which Risk Metric to Use

As the data set is modified, the risk values may change. For example, consider Table B-4, in which year of birth has been generalized to decade of birth. The maximum risk is still 1, but the average risk has declined to 0.33. The average risk will be more sensitive than the maximum risk to modifications to the data.


The Data Set in Table B-1 After Year of Birth Has Been Generalized to Decade of Birth, with the Probabilities of Re-identification per Record Added.

Because the average risk is no greater than the maximum risk, the latter is generally used when a data set is going to be disclosed publicly (El Emam, 2013). This is because a dedicated adversary who is launching a demonstration attack against a publicly available data set will target the record(s) in the disclosed clinical trial data set with the maximum probability of re-identification. Therefore, it is prudent to protect against such an adversary by measuring and managing maximum risk.

The average risk, by comparison, is more suitable for nonpublic data disclosures. For nonpublic data disclosures, some form of data sharing agreement with prohibitions on re-identification can be expected. In this case, it can be assumed that any data subject may be targeted by the adversary.

As a general rule, it is undesirable to have unique records in the data set after de-identification. In the example of Table B-1, there are unique records both in the original data set and after year of birth has been changed to decade of birth (see Table B-4). For example, record 26 is unique in Table B-4. Unique records have a high risk of re-identification. Also, as a general rule, it is undesirable to have records with a probability of re-identification equal to 0.5 in the data set.

With average risk, one can have data sets with an acceptably small average risk but with unique records or records in equivalence classes of size 2. To avoid that situation, one can use the concept of “strict average risk.” Here, maximum risk is first evaluated to ensure that it is at or below 0.33. If that condition is met, average risk is computed. This two-step measure ensures that there are no uniques or doubles in the data set.

In the example data set in Table B-4, the strict average risk is 1. This is because the maximum risk is 1, so the first condition is not met. However, the data set in Table B-5 has a strict average risk of 0.33. Therefore, in practice, maximum risk or strict average risk would be used to measure re-identification risk.


The Generalized Data Set with No Uniques or Doubles.

Samples and Populations

The above examples are based on the premise that an adversary knows who is in the data set. Under those conditions, the manner in which the risk metrics have been demonstrated is correct. We call this a “closed” data set. There are situations in which this premise holds true. For instance, one such case occurs when the data set covers everyone in the population. A second case is when the data collection method itself discloses who is in the data set. Here are several examples in which the data collection method makes a data set closed:

  • If everyone attending a clinic is screened into a trial, an adversary who knows someone who attends the clinic will know that that individual is in the trial database.

  • A study of illicit drug use among youth requires parental consent, which means that parents will know if their child is in the study database.

  • The trial participants self-reveal that they are taking part in a particular trial, for example, on social networks or on online forums.

If it is not possible to know who is in the data set, the trial data set can be considered to be a sample from some population. We call this an “open” data set. Because the data set is a sample, there is some uncertainty about whether a person is in the data set or not. This uncertainty can reduce the probability of re-identification.

When the trial data set is treated as a sample, the maximum and average risk need to be estimated from the sample data. The reason is that in a sample context, the risk calculations depend on the equivalence class size in the population as well. Therefore, the population equivalence class sizes need to be estimated for the same records. Estimates are needed because in most the cases, the sponsor will not have access to the population data.

There is a large body of work on these estimators in the disclosure control literature (e.g., Dankar et al., 2012; Skinner and Shlomo, 2008). A particularly challenging estimation problem is deciding whether a unique record in the sample is also a unique in the population. If a record is unique in the sample, it may be because the sampling fraction is so small that all records in the sample are uniques. Yet a record may be unique in the sample because it is also unique in the population.

Under these conditions, appropriate estimators need to be used to compute the maximum and average risk correctly. In general, when the data set is treated as a sample, the probability of re-identification will be no greater than the probability associated with situations in which the data set is not treated as a sample (i.e., the adversary knows who is in the data set).

Re-identification Risk of Participants with Rare Diseases

It is generally believed that clinical trials conducted on rare diseases will always have a high risk of re-identification. It is true that the risk of re-identification will, in general, be higher than that for nonrare diseases. However, it is not necessarily too high. If the data set is open with a small sampling fraction and one is using (strict) average risk, the risk of re-identification may be acceptably small. The exact risk value will need to be calculated on the actual data set to make that determination.

Taking Context into Account

Determining whether a data set is disclosed to the public or a more restricted group of recipients illustrates how context is critical. In the case of the recipient, for instance, it informs us which metric is more appropriate. However, this is only one aspect of the context surrounding a data set, and a more complete picture can be applied to make more accurate assessments of re-identification risk.

For a public data release, we assume that the adversary will launch a demonstration attack, and therefore it is necessary to manage maximum risk. There are no other controls that can be put in place. For a nonpublic data set, we consider three types of attacks that cover the universe of attacks: deliberate, inadvertent, and breach (El Emam, 2013; El Emam and Arbuckle, 2013).

A deliberate attack transpires when the adversary deliberately attempts to re-identify individuals in the data set. This may be a deliberate decision by the leadership of the data recipient (e.g., the QI decides to re-identify individuals in order to link to another data set) or by a rogue employee associated with the data recipient. The probability that this type of attack will be successful can be computed as follows:


where the term Pr(attempt) captures the probability that a deliberate attempt to re-identify the data will be made by the data recipient. The actual value for Pr(attempt) will depend on the security and privacy controls that the data recipient has in place and the contractual controls that are being imposed as part of the data sharing agreement. The second term, Pr(re-id | attempt), corresponds to the probability that the attack will be successful in the event that the recipient has chosen to commit the attack. This conditional can be measured from the actual data.

An inadvertant attack transpires when a data analyst working with the QI (or the QI himself/herself) inadvertently re-identifies someone in the data set. For instance, this could occur when the recipient is already aware of the identity of someone in the data set, such as a friend; relative, or, more generally, an acquaintance. The probability of successful re-identification in this situation can be computed as follows:


There are defensible ways to compute Pr(acquaintance) (El Emam, 2013), which evaluates the probability of an analyst's knowing someone in the data set. For example, if the trial is of a breast cancer treatment, then Pr(acquaintance) is the probability of the analyst's knowing someone who has breast cancer. The value for Pr(re-id | acquaintance) needs to be computed from the data. Box B-4 considers the question of whether it is always necessary to be concerned about the risk of inadvertent re-identification.


Is It Always Necessary to Be Concerned About the Risk of Inadvertent Re-identification. In the context of data release through an online portal, an argument can be made that the sponsor imposes significant security and privacy controls and requires (more...)

A breach will occur if there is a data breach at the QI's facility. The probability of this type of attack being successful is


where the term Pr(breach) captures the probability that a breach will occur. What should Pr(breach) be? Publicly available data about the probability of a breach can be used to determine this value; the value of the conditional in this case, Pr(re-id | breach), will be computed from these data. Data for 2010 show that 19 percent of health care organizations suffered a data breach within the previous year (HIMSS Analytics, 2010); data for 2012 show that this number rose to 27 percent (HIMSS Analytics, 2012). These organizations were all following the HIPAA Security Rule. Note that these figures are averages and may be adjusted to account for variation.

For a nonpublic data release, then, there are three types of attacks for which the re-identification risk needs to be measured and managed. The risk metrics are summarized in Table B-6. The overall probability of re-identification will then be the largest value among the three equations.

Setting Thresholds: What Is Acceptable Risk

There are quite a few precedents for what can be considered an acceptable amount of risk. These precedents have been in use for many decades, are consistent internationally, and have persisted over time as well (El Emam, 2013). It should be noted, however, that the precedents set to date have been for assessments of maximum risk.

In commentary about the de-identification standard in the HIPAA Privacy Rule, HHS notes in the Federal Register (HHS, 2000) that

the two main sources of disclosure risk for de-identified records about individuals are the existence of records with very unique characteristics (e.g., unusual occupation or very high salary or age) and the existence of external sources of records with matching data elements which can be used to link with the de-identified information and identify individuals (e.g., voter registration records or driver's license records) … an expert disclosure analysis would also consider the probability that an individual who is the target of an attempt at re-identification is represented on both files, the probability that the matching variables are recorded identically on the two types of records, the probability that the target individual is unique in the population for the matching variables, and the degree of confidence that a match would correctly identify a unique person.

It is clear that HHS considers unique records to have a high risk of re-identification, but such statements also suggest that non-unique records have an acceptably low risk of re-identification.

Yet uniqueness is not a universal threshold. Historically, data custodians (particularly government agencies focused on reporting statistics) have used the “minimum cell size” rule as a threshold for deciding whether to de-identify data (Alexander and Jabine, 1978; Cancer Care Ontario, 2005; Health Quality Council, 2004a,b; HHS, 2000; Manitoba Center for Health Policy, 2002; Office of the Information and Privacy Commissioner of British Columbia, 1998; Office of the Information and Privacy Commissioner of Ontario, 1994; OMB, 1994; Ontario Ministry of Health and Long-Term Care, 1984; Statistics Canada, 2007). This rule was originally applied to counting data in tables (e.g., number of males aged 30-35 living in a certain geographic region). The most common minimum cell size in practice is 5, which implies that the maximum probability of re-identifying a record is 1/5, or 0.2. Some custodians, such as certain public health offices, use a smaller minimum count, such as 3 (CDC and HRSA, 2004; de Waal and Willenborg, 1996; NRC, 1993; Office of the Privacy Commissioner of Quebec, 1997; U.S. Department of Education, 2003). Others, by contrast, use a larger minimum, such as 11 (in the United States) (Baier et al., 2012; CMS, 2008, 2011; Erdem and Prada, 2011; HHS, 2008a) and 20 (in Canada) (El Emam et al., 2011b, 2012). Based on our review of the literature and the practices of various statistical agencies, the largest minimum cell size is 25 (El Emam et al., 2011b). It should be recognized, however, that there is no agreed-upon threshold, even for what many people would agree is highly sensitive data. For example, minimal counts of 3 and 5 were recommended for HIV/AIDS data (CDC and HRSA, 2004) and abortion data (Statistics Canada, 2007), respectively. Public data releases have used different cell sizes in different jurisdictions. The variability is due, in part, to different tolerances for risk, the sensitivity of data, whether a data sharing agreement is in place, and the nature of the data recipient.

A minimum cell size criterion amounts to a maximum risk value. Yet in some cases, this is too stringent a standard or may not be an appropriate reflection of the type of attack. In such a case, one can use the average risk, as discussed in the previous section. This makes the review of cell size thresholds suitable for both types of risk metrics.

It is possible to construct a decision framework based on these precedents with five “bins” representing five possible thresholds, as shown in Figure B-4. At one extreme is data that would be considered identifiable when the cell size is smaller than 3. Next to that are data that are de-identified with a minimal cell size of 3. Given that this is the least de-identified data set, one could choose to disclose such data sets only to trusted entities where the risks are minimal (for example, where a data sharing agreement is in place and the data recipient has good security and privacy practices). At the other end of the spectrum is the minimal cell size of 20. This high level of de-identification is appropriate when the data are publicly released, with no restrictions on or tracking of what is done with the data and who has accessed them.


Commonly used risk thresholds based on the review/references in the text.

If the extreme situations cannot be justified in a particular disclosure, an alternative process is needed for choosing one of the intermediate values. In Figure B-4, this is a choice between a value of 5 and a value of 20.

The above framework does not preclude the use of other values (for example, a sponsor may choose to use a threshold value of 25 observations per cell). However, this framework does ground the choices based on precedents of actual data sets.

What Is the Likelihood of Re-identifying Clinical Trial Data Sets

There has been concern in the health care and privacy communities that the risk of re-identification in data is quite high and that deidentification is not possible (Ohm, 2010). This argument is often supported by examples of a number of publicly known re-identification attacks. A systematic review of publicly known re-identification attacks found, however, that when appropriate re-identification standards are used, the risk of re-identification is indeed very small (El Emam et al., 2011a).5 It was only when no de-identification at all was performed on the data or the de-identification applied was not consistent with or based on best practices that data sets were re-identified with a high success rate. Therefore, the evidence that exists today suggests that using current standards and best practices does provide reasonably strong protections against re-identification.


Managing re-identification risk means (1) selecting an appropriate risk metric, (2) selecting an appropriate threshold, and (3) measuring the risk in the actual clinical trial data set that will be disclosed. The choice of a metric is a function of whether the clinical trial data set will be released publicly. For public data sets, it is prudent to use maximum risk in measuring risk and setting thresholds. For nonpublic data sets, a strong case can be made for using average risk (El Emam, 2013; El Emam and Arbuckle, 2013).

How to Choose an Acceptable Threshold

Selecting an acceptable threshold within the range described earlier requires an examination of the context of the data themselves. The re-identification risk threshold is determined based on factors characterizing the QI and the data themselves (El Emam, 2010). These factors have been suggested and have been in use informally by data custodians for at least the last decade and a half (Jabine, 1993a,b). They cover three dimensions (El Emam et al., 2010), as illustrated in Figure B-5:


Factors to consider when deciding on an acceptable level of re-identification risk. SOURCE: Reprinted with permission from El Emam and colleagues, 2014.

  • Mitigating controls. This is the set of security and privacy practices that the QI has in place. A recent review identifies a collection of practices used by large data custodians and recommended by funding agencies and IRBs for managing sensitive health information (El Emam et al., 2009).

  • Invasion of privacy. This entails evaluation of the extent to which a particular disclosure would be an invasion of privacy to the participants (a checklist is available in El Emam et al. [2009]). There are three considerations: (1) the sensitivity of the data (the greater the sensitivity of the data, the greater the invasion of privacy), (2) the potential injury to patients from an inappropriate disclosure (the greater the potential for injury, the greater the invasion of privacy), and (3) the appropriateness of consent for disclosing the data (the less appropriate the consent, the greater the invasion of privacy) (see Box B-5).

  • Motives and capacity. This dimension compasses the motives and the capacity of the QI to re-identify the data, considering such issues as conflicts of interest, the potential for financial gain from re-identification, and whether the data recipient has the skills and financial capacity to re-identify the data (a checklist is available in El Emam et al. [2009]).


Consent and De-identification. As noted earlier, there is no legislative or regulatory requirement to obtain consent from participants to share their de-identified data. There are additional ongoing efforts to ensure that consent forms do not create barriers (more...)

In general, many of these elements can be managed through contracts (e.g., a prohibition on re-identification, restrictions on linking the data with other data sets, and disallowing the sharing of the data with other third parties). For example, if the mitigating controls are low, which means that the QI has poor security and privacy practices, the re-identification threshold should be set at a lower level. This will result in more de-identification being applied. However, if the QI has very good security and privacy practices in place, the threshold can be set higher. Checklists for evaluating these dimensions, as well as a scoring scheme, are available (El Emam, 2013).

If the sponsor is disclosing the data through an online portal, the sponsor has control of many, but not all, of the mitigating controls. This provides additional assurances to the sponsor that a certain subset of controls will be implemented to the sponsor's satisfaction.

Once a threshold has been determined, the actual probability of re-identification is measured in the data set. If the probability is higher than the threshold, transformations of the data need to be performed. Otherwise, the data can be declared to have a very small risk of re-identification.

The implication here is that the amount of data transformation needed will be a function of these other contextual factors. For example, if the QI has good security and privacy practices in place, the threshold chosen will be higher, which means that the data will be subjected to less de-identification.

The security and privacy practices of the QI can be manipulated through contracts. The contract signed by the QI can impose a certain list of practices that must be in place, which are the basis for determining the threshold. Therefore, they must be in place by the QI to justify the level of transformation performed on the data.

This approach is consistent with the limited data set (LDS) method for sharing data under HIPAA. However, this method does not ensure that the risk of re-identification is very small, and therefore the data will still be considered personal health information.

For public data releases, there are no contracts and no expectation that any mitigating controls will be in place. In that case, the lowest probability thresholds (or highest cell size thresholds) are used.

Methods for Transforming the Data

There are a number ways to transform a data set to reduce the probability of re-identification to a value below the threshold. Many algorithms for this purpose have been proposed by the computer science and statistics communities. They vary in quality and performance. Ideally, algorithms adopted for clinical trial data sets should minimize the modifications to the data while ensuring that the measured probability is below the threshold.

Four general classes of techniques have worked well in practice:

  • Generalization. This is when the value of a field is modified to a more general value. For example, a date of birth can be generalized to a month and year of birth.

  • Suppression. This is when specific values in the clinical trial data set are removed from the data set (i.e., induced missingness). For example, a value in a record that makes it an outlier may be suppressed.

  • Randomization. This denotes adding noise to a field. The noise can come from a uniform or other type of distribution. For example, a date may be shifted a week forward or backward.

  • Subsampling. This is used to disclose a random subset of the data rather than the full data set to the QI.

In practice, a combination of these techniques is applied for any given data disclosure. Furthermore, these techniques can be customized to specific field types. For example, generalization and suppression can be applied differently to dates and zip codes to maximize the data quality for each (El Emam and Arbuckle, 2013).

The application of these techniques can reduce the risk of re-identification. For example, consider the average risk in Table B-3, which is 0.59. There is a reduction in average risk to 0.33 when the year of birth is generalized to decades in Table B-4. By suppressing some records, it was possible to further reduce the average risk to 0.22 in Table B-5. Each transformation progressively reduces the risk.

The Use of Identifier Lists

Thus far we have covered a sufficient number of topics that we can start performing a critical appraisal of some commonly used deidentification methods and the extent to which they can ensure that the risk of re-identification is very small. We focus on the use of identifier lists. The reason is that this approach is quite common and is being adopted to de-identify clinical trial data.

The HIPAA Privacy Rule's Safe Harbor Standard

We first consider the variable list in the HIPAA Privacy Rule Safe Harbor method.

The Safe Harbor list contains a number of direct identifiers and two quasi-identifiers (i.e., dates and zip codes), as summarized earlier in Box B-2. It should be evident that in applying a fixed list of variables, there is no assurance that all of the quasi-identifiers have been accounted for in the risk measurement and the transformation of the data set. For example, other quasi-identifiers, such as race, ethnicity, and occupation, may be in the data set, but they will be ignored. Even if the probability of re-identification under Safe Harbor is small (Benitez and Malin, 2010), this low probability may not carry over with more quasi-identifiers than the two in the original list.

The empirical analysis that was conducted before the Safe Harbor standard was issued assumed that the data set is a random sample from the U.S. population. This assumption may have variable validity in real data sets. However, there will be cases when it is definitely not true. For example, consider a data set that consists of only the records in Table B-1. Now, assume that an adversary can find out who is in the data set. This can happen if the data set covers a well-defined population. If the trial site is known, it can be reasonably assumed that the participants in the trial who received treatment at that site live in the same geographic region. If the adversary knows that Bob was born in 1965, lives in the town in which the site is situated, and was in the trial, the adversary knows that Bob is in the data set, and therefore the 27th record must be Bob. This re-identification occurs even though this table meets the requirements of the Safe Harbor standard. Members of a data set may be known if their inclusion in the trial is revealing (e.g., a trial in a workplace where participants have to wear a visible device, parents who must consent to have their teenage children participate in a study, or adolescents who must miss a few days of school to participate in a study). Therefore, this standard can be protective only if the adversary cannot know who is in the data set. This will be the case if the data set is a random sample from the population.

If these assumptions are met, the applicability of Safe Harbor to a clinical trial data set will be defensible, but only if there are no international participants. If a clinical trial data set includes participants from sites outside the United States, the analysis that justifies using this standard will not be applicable. For example, there is a difference of two orders of magnitude between the median number of individuals living in U.S. zip codes and in Canadian postal codes. Therefore, translating the zip code truncation logic in Safe Harbor to Canadian postal codes would not be based on defensible evidence.

Safe Harbor also has some weaknesses that are specific to the two quasi-identifiers that are included.

In some instances, there may be dates in a clinical trial data set that are not really quasi-identifiers because they do not pass the test highlighted earlier. For example, consider an implantable medical device that fires, and each time it does so there is a time and date stamp in the data stream. The date of a device's firing is unlikely to be a quasi-identifier because it is not knowable, but it is a date.

0 Thoughts to “De Identify Data Research Paper

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *