In this tutorial, we will explore how to use Microsoft presidioAn open source framework designed to detect, analyze and anchor personally identifiable information (PII) in the text of free form. Built above the effective NLP spacy library, Presidio is both light and modular, which facilitates integration into real-time applications and pipelines.
We will cover how:
- Configure and install the necessary PRESIDIO packages
- Detect common PII entities such as names, telephone numbers and credit card details
- Define personalized recognitions for specific entities in the field (for example, Pan, Aadhaar)
- Create and record personalized anonymity (such as hash or pseudonymization)
- Reuse the anonymization mappings for coherent re-Anonymization
Installation of libraries
To start with Presidio, you will need to install the following key libraries:
- Presidio-Analyzer: This is the basic library responsible for detecting PII entities in the text using grateful and personalized.
- Presidio-anonymisser: This library provides tools to anonymize (for example, refuse, replace, hash) the PII detected using configurable operators.
- NLP spacy model (en_core_web_lg): Presidio uses spacy under the hood for natural language treatment tasks as well as recognition of the named entity. The En_Core_Web_Lg model provides high precision results and is recommended for PII detection in English.
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
You may need to restart the session to install the libraries, if you use Jupyter / Colab.
President analysis
Basic PII detection
In this block, we initialize the Presidio Analyzer engine and enforce a basic analysis to detect an American phone number from a text example. We also delete the lower newspaper warnings of the Presidio library for a cleaner output.
The analysis is charged with the spacy NLP pipeline and predefined recognitions to scan the entry text for sensitive entities. In this example, we specify Phone_Number as the target entity.
import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
from presidio_analyzer import AnalyzerEngine
# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
entities=("PHONE_NUMBER"),
language="en")
print(results)
Creation of a personalized Pii recognition with a refusal list (academic titles)
This block of code shows how to create personalized PII recognition in Presidio using a simple refusal list, ideal for detecting fixed terms such as academic titles (for example, “DR”, “Prof.”). Recognition is added to the Presidio register and used by the analyzer to scan the input text.
Although this tutorial only covers the Deny List approach, Presidio also supports regex -based models, NLP models and external recognitions. For these advanced methods, see official documents: Adding personalized recognition.
President analysis
Basic PII detection
In this block, we initialize the Presidio Analyzer engine and enforce a basic analysis to detect an American phone number from a text example. We also delete the lower newspaper warnings of the Presidio library for a cleaner output.
The analysis is charged with the spacy NLP pipeline and predefined recognitions to scan the entry text for sensitive entities. In this example, we specify Phone_Number as the target entity.
import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
from presidio_analyzer import AnalyzerEngine
# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
entities=("PHONE_NUMBER"),
language="en")
print(results)
Creation of a personalized Pii recognition with a refusal list (academic titles)
This block of code shows how to create personalized PII recognition in Presidio using a simple refusal list, ideal for detecting fixed terms such as academic titles (for example, “DR”, “Prof.”). Recognition is added to the Presidio register and used by the analyzer to scan the input text.
Although this tutorial only covers the Deny List approach, Presidio also supports regex -based models, NLP models and external recognitions. For these advanced methods, see official documents: Adding personalized recognition.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry
# Step 1: Create a custom pattern recognizer using deny_list
academic_title_recognizer = PatternRecognizer(
supported_entity="ACADEMIC_TITLE",
deny_list=("Dr.", "Dr", "Professor", "Prof.")
)
# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)
# Step 3: Create analyzer engine with the updated registry
analyzer = AnalyzerEngine(registry=registry)
# Step 4: Analyze text
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")
for result in results:
print(result)
Presidio anonymiser
This code block shows how to use the anonymous preidio engine to anonymize the PII entities detected in a given text. In this example, we manually define two entities of a person using recognition, simulating the release of the Presidio analyzer. These entities represent the names “Bond” and “James Bond” in the example of text.
We use the “replace” operator to replace the two names with a reserved space value (“BIP”), effectively anonymizing sensitive data. This is done by passing an Operatorconfig with the desired anonymization strategy (replace) in AnonyMizERngengine.
This model can be easily extended to apply other integrated operations such as “Redact”, “hash” or personalized pseudonymization strategies.
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
# Initialize the engine:
engine = AnonymizerEngine()
# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
text="My name is Bond, James Bond",
analyzer_results=(
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
),
operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)
print(result)
Recognition of the personalized entity, anonymization based on the hash and re-Anonymization coherent with Presidio
In this example, we are doing more Presidio by demonstrating:
- ✅ Definition of personalized PII entities (for example, Aadhaar and Pan numbers) using patterns based on Regex
- 🔐 Anonymization of sensitive data using an operator based on personalized chopping (revival)
- ♻️ Re-anonymization of the same values in a coherent way on several texts by maintaining a cartography of the original values → chopped
We implement a personalized revival operator who checks whether a given value has already been chopped and reuses the same outing to preserve consistency. This is particularly useful when anonymized data must maintain a use – for example, connecting the recordings by pseudonyms.
Define an anonymisser based on personalized hash (revival)
This block defines a personalized operator called a revival which uses the shaping SHA-256 to anonymize the entities and guarantees that the same input always obtains the same anonymized outing by storing atmosphere in a shared cartography.
from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict
class ReAnonymizer(Operator):
"""
Anonymizer that replaces text with a reusable SHA-256 hash,
stored in a shared mapping dict.
"""
def operate(self, text: str, params: Dict = None) -> str:
entity_type = params.get("entity_type", "DEFAULT")
mapping = params.get("entity_mapping")
if mapping is None:
raise ValueError("Missing `entity_mapping` in params")
# Check if already hashed
if entity_type in mapping and text in mapping(entity_type):
return mapping(entity_type)(text)
# Hash and store
hashed = ""
mapping.setdefault(entity_type, {})(text) = hashed
return hashed
def validate(self, params: Dict = None) -> None:
if "entity_mapping" not in params:
raise ValueError("You must pass an 'entity_mapping' dictionary.")
def operator_name(self) -> str:
return "reanonymizer"
def operator_type(self) -> OperatorType:
return OperatorType.Anonymize
Define personalized PII recognitions for Pan and Aadhaar numbers
We define two personalized Regex -based patterned receivers – one for Indian PAN numbers and one for AADHAAR numbers. These will detect personalized PII entities in your text.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
# Define custom recognizers
pan_recognizer = PatternRecognizer(
supported_entity="IND_PAN",
name="PAN Recognizer",
patterns=(Pattern(name="pan", regex=r"\b(A-Z){5}(0-9){4}(A-Z)\b", score=0.8)),
supported_language="en"
)
aadhaar_recognizer = PatternRecognizer(
supported_entity="AADHAAR",
name="Aadhaar Recognizer",
patterns=(Pattern(name="aadhaar", regex=r"\b\d{4}(- )?\d{4}(- )?\d{4}\b", score=0.8)),
supported_language="en"
)
Configure analyzer and anonymitter engines
Here, we configure the Presidio Analysisrengine, record personalized recognitions and adding the personalized anonymiser to anonymisring.
from presidio_anonymizer import AnonymizerEngine, OperatorConfig
# Initialize analyzer and register custom recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)
# Initialize anonymizer and add custom operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)
# Shared mapping dictionary for consistent re-anonymization
entity_mapping = {}
Analyze and anonymize the entry texts
We analyze two distinct texts which both include the same Pan and Aadhaar values. The personalized operator guarantees that they are coherently anonymized on the two entries.
from pprint import pprint
# Example texts
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."
# Analyze and anonymize first text
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
text1,
results1,
{
"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
}
)
# Analyze and anonymize second text
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
text2,
results2,
{
"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
}
)
See the results of anonymization and cartography
Finally, we print the two anonymized outings and inspect the mapping used internally to maintain coherent hash between values.
print("📄 Original 1:", text1)
print("🔐 Anonymized 1:", anon1.text)
print("📄 Original 2:", text2)
print("🔐 Anonymized 2:", anon2.text)
print("\n📦 Mapping used:")
pprint(entity_mapping)
Discover the Codes. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
