Identifying Personal Identifiable Information (PII) in Unstructured Data with Microsoft Presidio

By Saptarshi Dutta Gupta, Statistics Canada

Editor's note: The content of this article represents the position of the author and may not necessarily represent that of Statistics Canada.

Introduction

In today's digital age, organizations collect and store vast amounts of data about their customers, employees, and partners. This data often contains Personal Identifiable Information (PII). With the growing prevalence of data breaches and cyber attacks, protecting PII has become a critical concern for businesses and government agencies alike. For example, Statistics Canada conducts hundreds of surveys each year on a variety of topics and is obligated to protect the information that individuals provide.

Canada has two federal privacy laws that are enforced by the Office of the Privacy Commissioner of Canada:

  • Privacy Act: covers how the federal government handles personal information. The Privacy Act offers protections for personal information, which it defines as any recorded information about an 'identifiable individual'.
  • Personal Information Protection and Electronic Documents Act (PIPEDA): PIPEDA is the federal privacy law that applies to organizations that collect, use, or disclose personal data during commercial activities. PIPEDA requires organizations to obtain consent for the collection, use, or disclosure of personal data, and to protect personal data from unauthorized access, use, or disclosure.

Other than the above-mentioned laws, all organizations are also bound by the General Data Protection Regulation (GDPR). GDPR is the toughest privacy and security law in the world. Though it was drafted and passed by the European Union (EU), it imposes obligations onto organizations anywhere, so long as they target or collect data related to people in the EU. The GDPR will levy harsh fines against those who violate its privacy and security standards, with penalties reaching into the tens of millions of euros.

In this article, we will take a detailed look at Microsoft Presidio and how it helps organizations in Canada comply with privacy laws. We will start by discussing the key features and capabilities of Microsoft Presidio and how Microsoft Presidio can assist organizations in meeting their obligations under these laws.

Definitions

Before proceeding with rest of the article it is important to understand the difference between the terms Anonymization, Deidentification and Pseudo-anonymization that has been used in the rest of the article.

  • Anonymization: Anonymization refers to the process of irreversibly removing or obscuring identifiable information from data in such a way that the original data cannot be re-identified. The goal is to make it impossible or extremely difficult to link the data back to the individual it represents. Anonymized data should not contain any direct or indirect identifiers that could be used to identify individuals. 
  • Deidentification: Deidentification involves the removal or alteration of PII from a data set in order to prevent the identification of individuals. Unlike anonymization, deidentification does not necessarily require the data to be rendered completely unidentifiable. Instead, it focuses on removing or modifying specific identifiers, such as names, addresses, social security numbers, or any other information that could be used alone or in combination with other data to identify individuals. 
  • Pseudo-anonymization: Pseudo-anonymization is a technique that involves replacing direct identifiers with pseudonyms or unique identifiers, thereby unlinking the data from the individuals it represents. Unlike anonymization, where the original data is altered to prevent re-identification, pseudo-anonymization retains the ability to re-identify individuals using additional information stored separately, such as a key or lookup table. Pseudo-anonymization is commonly used in situations where data needs to be linked across different systems or databases while still protecting individual privacy.

What is PII?

Personal identifiable information (PII) is any data that can be used to identify an individual. This includes, but is not limited to, names, addresses, phone numbers, social security numbers, financial information, and medical records. PII is highly sensitive information that needs to be protected from unauthorized access, as it can be used for identity theft and other fraudulent activities.

Depending on whether a piece of information can be used directly or indirectly to re-identify an individual, one can categorize the information mentioned above into direct-identifiers and quasi-identifiers [4]:

  • Direct-identifiers: A set of variables unique for an individual (a name, address, phone number, or bank account) that may be used to directly identify the subject.
  • Quasi-identifiers: Information such as gender, nationality, or city of residence that in isolation does not enable re-identification but may do so when combined with other quasi-identifiers and background knowledge.

Why is PII protection important?

PII protection is important because individuals have a right to privacy and should have control over how their personal information is collected, used, and disclosed. Data breaches and identity theft can have significant consequences for individuals, including financial losses, reputational damage, and emotional distress. Therefore, it is essential for organizations to have robust measures in place to protect PII.

Background

a) Anonymising structured data

When it comes to anonymizing structured data, there are established mathematical models of privacy. This includes:

  • K-anonymity: A masked dataset has k-anonymity property if in the dataset each information that a person contains, cannot be distinguished from at least k-1 other individuals. Two methods can be used to achieve k-anonymity: first one is suppression which involves completely removing an attribute's value from a dataset. The second one is generalization in which a specific value of an attribute is replaced with a more general one.
  • L-diversity: this is an extension of k-anonymity. If we put sets of rows in a dataset that have identical quasi-identifiers together, there are at least l distinct values for each sensitive attribute, then we can say that this dataset has l-diversity.
  • Differential privacy: this aims to ensure that the output of a process or algorithm remains roughly the same, regardless of whether an individual's data is included. This means that it is impossible to determine with certainty whether a specific individual is present in the dataset just by examining the output of a differentially private analysis.

There are several other anonymization techniques that can be applied to both structured and unstructured data. Some of these techniques include:

  • Data shuffling: This involves randomly rearranging the rows or columns of a dataset to disrupt any potential correlations between variables.
  • Data perturbation: This involves adding random noise or errors to the data to reduce the risk of re-identification. This can be done through techniques such as adding Gaussian noise or rounding values to the nearest multiple of a certain number.
  • Data aggregation: This involves aggregating the data at a higher level, such as at the city or state level, to protect individual-level data.
  • Data suppression: This involves removing sensitive information from the dataset altogether, such as by deleting specific columns or rows, or replacing sensitive values with a placeholder value (e.g., "******").
  • Data generalization: This involves replacing specific values with more general values, such as replacing a specific street address with just the city or state.
  • Data obfuscation: This involves replacing sensitive information with fake or misleading data, such as through random name generation or generating fake addresses.

It is essential to understand that no single anonymization technique is completely foolproof. Therefore, it is usually necessary to use a combination of techniques to effectively protect sensitive data. It is also crucial to continuously evaluate and update anonymization techniques as new re-identification risks and techniques arise.

b) Anonymizing Unstructured data

The process of anonymizing unstructured data, such as text or images, is a more challenging task. It entails detecting where the sensitive information is present in the unstructured data and then applying anonymization techniques to it. Because of the nature of the unstructured data, directly using simple rule-based models might not have a very good performance.

Therefore, Natural Language Processing (NLP) have been applied to text anonymization. In particular, Named Entity Recognition (NER) which is a type of sequence labeling task is used which indicates if a token (like a word) corresponds to a named entity, such as PERSON (PER), LOCATION, DATETIME or an ORGANIZATION (ORG) as shown below. O indicates no entities have been recognized.

Image 1. Sequence Labeling Task – Named Entity Recognition

Image 1. Sequence Labeling Task – Named Entity Recognition
Description - Image 1. Sequence Labeling Task – Named Entity Recognition

This picture describes the result after passing a sequence of string through a Named Entity Recognizer (NER). Input is the string “John bought 30 Amazon shares in 2022” and after passing the sequence through a NER model each word is being classified with its corresponding entity. John is tagged as a PERSON, Amazon as Organization, 2022 as Datetime, rest all the information is tagged as OTHERS.

Several neural models have achieved state-of-the-art performance on NER tasks on datasets with general named entities. When they are trained on medical domain data that contains various types of personal information, they are shown to achieve state-of-the-art performance on those data as well. These model architectures include Recurrent Neural Networks (RNNs) with character embeddings or Bidirectional Transformers (BERT).

SpaCy also uses a RoBERTa based language model fine-tuned on the Ontonotes dataset with 18 named entity categories, such as PERSON, GPE, CARDINAL, LOCATION, etc.

Microsoft Presidio uses a combination of rule based and Natural Language processing methods to anonymize sensitive content which we will discuss next.

Microsoft Presidio

Why do we need Microsoft Presidio?

When we apply PII anonymization to real-world applications, there might be different business requirements that make it challenging to use pretrained models directly. For example, Government of Canada (GoC) receives several applications during an advertised process which are then reviewed. Before the review process, PII needs to be redacted to ensure personal information is not leaked and to avoid bias. Apart from the common PII entities, GoC also uses a Personal Record Identifier (PRI) for every employee such that the last digit is a modulus-11 check digit [Source: TBS - Incumbent Data Element Dictionary]

A pre-trained NER model cannot identify these special entities. Finetuning the model with extra labeled data is required to achieve good performance. Therefore, there is a requirement for a tool that can utilize a pre-trained NER model and can easily be customized and extended.

Presidio (origin from Latin praesidium 'protection, garrison') helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.

One of the key benefits of the Presidio framework is its ability to scale. It can handle large data sets, making it suitable for use by organizations with large amounts of data. It is also designed to be flexible and adaptable, allowing organizations to customize its use to meet their specific needs.

Image 2: PII detection workflow in Microsoft Presidio [Source: Presidio: Data Protection and De-identification SDK]

Image 2: PII detection workflow in Microsoft Presidio
Description - Image 2: PII detection workflow in Microsoft Presidio

The image shows the Presidio Detection flow which is used to detect PII. An input passes through regex which performs pattern recognition, followed by a Named Entity Recognition algorithm to detect entities, checksum to validate patterns, context words to increase the detection confidence and multiple anonymization techniques. The image shows the input: ‘Hi, my name is David, and my number is 212 555 1234’. After passing the input through the Presidio detection flow, David, and the number 212 55 1234 is detected as PII.

Goals

  • Introduce de-identification technologies to organizations in a user-friendly manner to promote privacy and transparency in decision-making.
  • Make the technology flexible and customizable to fit specific business needs.
  • Support both fully automated and semi-automated PII de-identification on multiple platforms.

Main features

  • Provides PII recognition using a variety of methods such as Named Entity Recognition, regular expressions, rule-based logic, and checksum with context, in multiple languages.
  • Offers the ability to connect to external PII detection models.
  • Offers multiple options for use, including Python or PySpark workloads, Docker, and Kubernetes.
  • Allows for customization in PII identification and anonymization.
  • Includes a module for redacting PII text in images.

Main modules of Presidio

a) Presidio Analyzer:

(i) Overview

The Presidio analyzer is a Python based service for detecting PII entities in text. During analysis, it runs a set of different PII Recognizers, each one in charge of detecting one or more PII entities using different mechanisms. Presidio analyzer comes with a set of predefined recognizers but can easily be extended with other types of custom recognizers. Predefined and custom recognizers leverage Named Entity Recognition, regular expressions, rule-based logic, and checksum with the relevant context in multiple languages to detect PII in unstructured text as shown in the Detection Workflow shown below:

Image 3: Presidio Analyzer for Identifying PII [Source: Presidio Analyzer]

Image 3: Presidio Analyzer for Identifying PII
Description - Image 3: Presidio Analyzer for Identifying PII

The image shows how the Presidio Analyzer is used for Identifying PII. The input text is passed through multiple PII Recognizers which includes built-in recognizer, custom recognizer, and custom models. The built-in recognizer includes Regex, checksum, NER, and context words. After passing the text input through all the recognizers, the PII is detected.

By default, Microsoft Presidio can recognize the following entities: Supported entities - Microsoft Presidio

(ii) Installation

Presidio Analyzer can be installed using pip, docker or can be build from the source.

(iii) Running a Basic Analyzer

Once installation is complete, a basic analyzer can be run with a few lines of code as shown:

from presidio_analyzer import AnalyzerEngine
# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
# Call analyzer to get results
results = analyzer.analyze(text="Mr. John lives in Vancouver. His email id is john@sfu.ca", language='en')
print(results)

[type: EMAIL_ADDRESS, start: 45, end: 56, score: 1.0, type: PERSON, start: 4, end: 8, score: 0.85, type: LOCATION, start: 18, end: 27, score: 0.85, type: URL, start: 50, end: 56, score: 0.5]

By default, Presidio uses spaCy's en_core_web_lg model and can identify the following entities: Supported entities - Microsoft Presidio. As seen in the above code, the PERSON, EMAIL_ADDRESS, LOCATION and URL has been identified. We can extend the analyzer to support detection of new entities which is discussed next.

(iv) Capabilities of Presidio Analyzer

  • Support detection of new PII entities

To expand Presidio's detection abilities to new types of PII entities, EntityRecognizer objects should be added to the current list of recognizers. These objects are Python-based and can detect one or more entities in a specific language.

The following class diagram shows the different types of recognizer families Presidio contains.

Image 4: Class Diagram for different types of Recognizers in Presidio [Source: Supporting detection of new types of PII entities]

Image 4: Class Diagram for different types of Recognizers in Presidio
Description - Image 4: Class Diagram for different types of Recognizers in Presidio

The image shows the class diagram for different types of recognizers in Presidio. The EntityRecognizer is an abstract class for all recognizers. The RemoteRecognizer is an abstract class for calling external PII detectors. The abstract class LocalRecognizer is implemented by all recognizers running within the Presidio-analyzer process. The PatternRecognizer is a class for supporting regex and deny-list based recognition logic, including validation (e.g., with checksum) and context support.

In the above diagram:

  • The EntityRecognizer is an abstract class for all recognizers.
  • The RemoteRecognizer is an abstract class for calling external PII detectors. See more info here.
  • The abstract class LocalRecognizer is implemented by all recognizers running within the Presidio-analyzer process.
  • The PatternRecognizer is a class for supporting regex and deny-list based recognition logic, including validation (e.g., with checksum) and context support.

A simple way of extending the analyzer to identify additional PII entities can be done in two steps:

  1. Creating a new class based on EntityRecognizer.
  2. Add the new recognizer to the recognizer registry so that the AnalyzerEngine can use the new recognizer during analysis.

Example:

For simple recognizers based on regular expressions or deny-lists, we can leverage the provided PatternRecognizer and call the recognizer itself as shown:

from presidio_analyzer import PatternRecognizer
titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=["Mr.","Mrs.","Miss"])
titles_recognizer.analyze(text="Mr. John lives in Vancouver. His email id is john@sfu.ca", entities="TITLE")

[type: TITLE, start: 0, end: 3, score: 1.0]

Next, we can add it to the list of Recognizers for the detection of additional PII entities:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
# Add the recognizer to the existing list of recognizers
registry.add_recognizer(titles_recognizer)
# Set up analyzer with our updated recognizer registry
analyzer = AnalyzerEngine(registry=registry)
# Run with input text
text="Mr. John lives in Vancouver. His email id is john@sfu.ca"
results = analyzer.analyze(text=text, language="en")
results

[type: TITLE, start: 0, end: 3, score: 1.0,
type: EMAIL_ADDRESS, start: 45, end: 56, score: 1.0,
type: PERSON, start: 4, end: 8, score: 0.85,
type: LOCATION, start: 18, end: 27, score: 0.85,
type: URL, start: 50, end: 56, score: 0.5]

For more complex EntityRecognizer like the detection of PRI for the Government of Canada, the recognizer can be created in code using the following steps:

  • Create a new Python class which implements LocalRecognizer. (LocalRecognizer implements the base EntityRecognizer class). This class has the following functions:
    • load: load a model / resource to be used during recognition
    • analyze: The main function to be called for getting entities out of the new recognizer
  • Add it to the recognizer registry using registry.add_recognizer(my_recognizer). For more examples, see the Customizing Presidio Analyzer Jupyter notebook.

There are several other ways to create a Custom Recognizer in Presidio, such as:

  • Creating a remote recognizer: Using a remote recognizer, which interacts with an external service for PII detection. This could be a 3rd party service or a custom service running alongside Presidio.
  • Creating ad-hoc recognizers: Creating ad-hoc recognizers using the Presidio Analyzer API. These recognizers, in JSON form, can be added to the /analyze request and are only used for that specific request.
  • Reading pattern Recognizers from YAML: Reading pattern Recognizers from YAML files, which allows users to add recognition logic without writing code. An example YAML file can be found here: Example Recognizers. Once the YAML file is created, it can be loaded into the RecognizerRegistry instance.

2. Multi-language support

Presidio can detect PII in multiple languages using its built-in recognizers and models. By default, it includes recognizers and models for English. However, these recognizers are language-dependent, either by their logic or by the context words used to scan for entities.

To improve the results for specific languages, it is possible to update the context words of existing recognizers or add new recognizers that support additional languages. Each recognizer can only support one language, so adding new recognizers for additional languages is necessary.

3. Customizing the NLP models

As mentioned before, the Presidio Analyzer by default uses spaCy's en_core_web_lg model but it can easily be customized by leveraging other NLP models, either public or proprietary. Presidio uses NLP engines for two main tasks: NER based PII identification, and feature extraction for custom rule-based logic (such as leveraging context words for improved detection). These models can be trained or downloaded from existing NLP frameworks like spaCy, Stanza and Transformers.

Configuring the new model can be done either by:

  • Via code: By creating an NlpEngine using the NlpEnginerProvider class and pass it to the AnalyzerEngine as input.
  • Via configuration: Set up the models which should be used in the default conf file. The default conf file is read during the default initialization of the AnalyzerEngine. Alternatively, the path to a custom configuration file can be passed to the NlpEngineProvider

In addition to the built-in spaCy/Stanza/transformers capabilities, it is possible to create new recognizers which serve as interfaces to other models for example, flair.

b) Presidio Anonymizer:

The Anonymizer is also a python-based service. It anonymizes the detected PII entities with desired values by applying certain operators such as replace, mask, and redact. By default, it replaces the detect PII by its entity type such as <EMAIL> or <PHONE_NUMBER> directly in the text. But one can customize it, providing different anonymizing logic for the different types of entities.

The Presidio-Anonymizer package contains both Anonymizers and Deanonymizers.

  • Anonymizers are used to replace a PII entity text with some other value by applying a certain operator. The various built-in operators are:
    • replace: Replace the PII with desired value
    • redact: Remove the PII completely from text
    • hash: Hashes the PII text (can be either sha256,sha512 or md5)
    • mask: Replace the PII with a given character
    • encrypt: Encrypt the PII using a given cryptographic key
    • custom: Replace the PII with the result of the function executed on the PII

Image 5: PII Anonymizer workflow [Source: Presidio Anonymizer]

Image 5: PII Anonymizer workflow
Description - Image 5: PII Anonymizer workflow

The image shows the function of the Presidio anonymizer. The left shows the text and detected PII being passed to both built in and custom anonymizer. The built-in anonymizer consists of operators like redact, hash, replace. After passing the text and detected PII through the PII Anonymizer, the anonymized text is returned.

Example:

frompresidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
# Initialize the engine:
engine = AnonymizerEngine()
# Invoke the anonymize function with the text, 
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
    text="Mr. John lives in Vancouver. His email id is john@sfu.ca",
    analyzer_results= results
)


results

Output:

text: <TITLE> <PERSON> lives in <LOCATION>. His email id is <EMAIL_ADDRESS>
items:
[
    {'start': 54, 'end': 69, 'entity_type': 'EMAIL_ADDRESS', 'text': '<EMAIL_ADDRESS>', 'operator': 'replace'},
    {'start': 26, 'end': 36, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
    {'start': 8, 'end': 16, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'},
    {'start': 0, 'end': 7, 'entity_type': 'TITLE', 'text': '<TITLE>', 'operator': 'replace'}
]

Presidio also allows the extension of the Presidio anonymizer to support additional operators.

  • Deanonymizers are used to revert the anonymization operation. (e.g., to decrypt an encrypted text).

As the input text could potentially have overlapping PII entities, there are different anonymization scenarios that can happen:

  • No overlap (single PII): When there is no overlap in spans of entities, Presidio Anonymizer uses a given or default anonymization operator to anonymize and replace the PII text entity.
  • Full overlap of PII entities spans: When entities have overlapping substrings, the PII with the higher score will be taken. Between PIIs with identical scores, the selection is arbitrary.
  • One PII is contained in another: Presidio Anonymizer will use the PII with the larger text even if it's score is lower.
  • Partial intersection: Presidio Anonymizer will anonymize each individually and will return a concatenation of the anonymized text. To get started, after installing Presidio as instructed here: Installing Presidio

Conclusion

In conclusion, Microsoft Presidio is a valuable tool for detecting personally identifiable information (PII) in text data. Its flexible design allows users to create custom recognizers and models to match specific use cases, and its support for multiple languages allows for efficient PII detection in a wide range of scenarios. Additionally, the ability to use external services, ad-hoc recognizers, and pattern Recognizers from YAML files, enables users to easily incorporate new detection capabilities. Overall, Presidio's comprehensive PII detection capabilities, together with its customization options, make it an asset for organizations looking to protect sensitive data.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Register for the Meet the Data Scientist event. We hope to see you there!

MS Teams – link will be provided to the registrants by email

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

References

Summary of privacy laws in Canada - Office of the Privacy Commissioner of Canada

What is GDPR, the EU's new data protection law? - GDPR.eu

How we protect the privacy and confidentiality of your personal information

Pierre Lison, Ildikó Pilán, David Sánchez, Montserrat Batet, and Lilja Øvrelid, Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions (2021). Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing

Official Documentation: Microsoft Presidio

GitHub - microsoft/presidio: Context aware, pluggable and customizable data protection and de-identification SDK for text and images

PII anonymization made easy by Presidio | by Lingzhen Chen | Towards Data Science

Presidio Research · spaCy Universe

Evaluation of an automated Presidio anonymisation model for unstructured radiation oncology electronic medical records in an Australian setting - ScienceDirect

Statistics Canada’s Trust Centre

How we protect the privacy and confidentiality of your personal information