Software Architect / Microsoft MVP (AI) and Pluralsight Author

AI, Azure, Cognitive Services

Removing Personally Identifiable Information (PII) with Cognitive Services Text Analytics API v3

A new capability has recently been added to the Text Analytics API. The PII endpoint can automatically identify and redact sensitive strings or entities that are associated with an individual person.

At the time of writing, the following pieces of personal information (PII) can be identified and redacted:

  • Phone number
  • Email address
  • Mailing address
  • Passport details

 

This new capability is currently in preview mode and is available in most regions except China North 2 and China East.  I’ve been experimenting with this new capability and in this blog post I’ll run through an example of it in action.

Building the Request

To use the PII endpoint, we need to construct a POST request and send it to the following url:

https://westeurope.api.cognitive.microsoft.com/text/analytics/v3.1-preview.3/entities/recognition/pii?

Sample Input

The endpoint also expects the content that we want to classify in the body of the JSON request. Here is an example of some input that we want to classify:

{
 "documents": [
{
 "language": "en",
 "id": "1",
 "text": "The persons name is John Doe."
},
{
 "language": "en",
 "id": "2",
 "text": "He resides at 123 Main Street. His phone number was 12345677"
}
]
}

We also need to supply our Cognitive Services key and add this to the request header.

Armed with the URL, Key and Body we can then send the request.

An Example with Postman

Here we are using the JSON text from above and supplying it as the body in Postman:

After clicking Send, we can see the API has identified any personally identifiable information that was in both documents (1 and 2):

You can also see from the above the redacted text for each document that we passed in through our original POST request.

For reference, the entire JSON response is below:

{
"documents": [
{
 "redactedText": "The persons name is ********.",
 "id": "1",
 "entities": [
{
 "text": "John Doe",
 "category": "Person",
 "offset": 20,
 "length": 8,
 "confidenceScore": 0.92
}
],
 "warnings": []
},
{
 "redactedText": "He resides at ***************. His phone number was ********",
 "id": "2",
 "entities": [
{
 "text": "123 Main Street",
 "category": "Address",
 "offset": 14,
 "length": 15,
 "confidenceScore": 0.65
},
{
 "text": "12345677",
 "category": "Phone Number",
 "offset": 52,
 "length": 8,
 "confidenceScore": 0.8
}
],
 "warnings": []
}
],
 "errors": [],
 "modelVersion": "2020-07-01"
}

Another data point worth mentioning is you also get access to the location or the PII and an associated confidence score for each item.

Healthcare Prediction

3 additional endpoints exist that let you submit clinical data to recognise healthcare related information such as drugs, conditions, and symptoms (also in preview mode).

Just like the PII endpoint, you supply the data you want to classify in the body of the request and send a POST request:

"documents": [
{
  "id": "1",
  "language": "en",
  "text": "Subject is taking 100mg of ibuprofen twice daily."
}
]

The subtle difference here however is that to fetch the results of the analysis, you need to send a subsequent GET request to another endpoint. The following JSON fragment contains a sample response when passing in the above text:

results": {
 "documents": [
{
 "id": "1",
 "entities": [
{
 "offset": 18,
 "length": 5,
 "text": "100mg",
 "category": "Dosage",
 "confidenceScore": 0.99,
 "isNegated": false
},
{
 "offset": 27,
 "length": 9,
 "text": "ibuprofen",
 "category": "MedicationName",
 "confidenceScore": 1.0,
 "isNegated": false,

You can see from the above JSON the dosage, name and category of the medication has been identified.

Years ago, I built several interfaces that parsed electronic patient records from several external systems. The data was often in multiple formats by was mashed together to present a master patient index.

For the most part, these were XML, CSV or EDI type files and were structured. Occasionally I’d get free form text which was harder to parse.  A service like this would have made that job easier.

You can find out more information about this endpoint here.

Summary

In this blog post we’ve looked at some of the new capabilities that have been shipped with v3 of Text Analytics API.

If you’re interested in trying this out, you can learn more here.

Have a question or are you thinking of using this API?

Drop me a message below or contact me on Twitter.

JOIN MY EXCLUSIVE EMAIL LIST
Get the latest content and code from the blog posts!
I respect your privacy. No spam. Ever.

Leave a Reply