Tuesday 20 November 2012

Understanding PII

The most important question in data management related to privacy is when a given set of data contains personally identifiable information (PII). While PII is fairly well defined (maybe not a strict formal defintion), often the question of what constitutes PII and how to handle it remains. In this article I intend to describe some notions on the linkability and tracability of identifiers and their presence in a data set.

It is worth mentioning that the definition of PII is currently being addressed by the EU Article 29 Working Party which will eventually bring additional clarification at a legal level beyond that existing the the current privacy directives. Ultimately however, the implementation of privacy is in the hands of software and system architects, designers and programmers.

I'll start with a (personal) definition: data is personally identifiable if, given any record of data, are there fields that contain data that can be linked with a unique person, or, a sufficiently small group with a common, unique attribute or characteristic.

We are required then to look at a number of issues:
  • what identifiers or identifying data is present
  • how linkable the data is to a unique person
  • how traceable a unique person is through the data set
Much is based upon the presence of identifiers in a data set and the linkability of those identifiers. For example, the presence of user account identifiers, social security numbers are very obviously in a one-to-one correspondence with a unique person and thus are trivially linkable to a unique user. Identifiers such as IMEI or IP addresses do not have this one-to-one correspondence. There are some caveats with situations such as  when one user account or a device is shared or used occasionally (with permission perhaps) by more than one person.

The notion of a unique person does not necessarily mean that an actual human being can be identified but rather can be traced, unambiguously through the data. For example, an account identifier (eg: username) identifies a unique person but not a unique human being - this is an important, if subtle distinction.

We can construct a simple model of identifiers and how these identifiers relate to each other and to "real world" concepts - some discussion on this was made in earlier articles. These models of identifiers and their relationships and semantics are critical to correct understanding of privacy and analysis of data. Remarkably this kind of modelling is very rarely made and very rarely understood or appreciated - at least until time comes to cross-reference data and these models are missing (cf: semantic isolation).
Example Semantic Model of Identifier Relationships

Within these models as seen earlier, we can map a "semantic continuum" from identifiers that are highly linkable to unique and even identified persons to those which do not.

Further along this continuum we have identifiers such as various forms of IMEI, telephone number, IP addresses and other device identifiers. Care must be made in that while these identifiers are not highly linkable to a unique person, devices and especially mobile devices are typically used by a single unique person leading to a high degree of inferred linkability.

Device addresses such as IP addresses have come under considerable scrutiny regarding whether they do identify a person. In cases where an IP address of a router has been used to identify is someone has been downloading copyrighted or other illegal material is problematical for the reasons described earlier regarding linkability. In this specific case of IP addresses, network address translation, proxies and obfuscation/hiding mechanisms such as Tor complicate and minimise linkability.

As we progress further along we reach the application and session identifiers. Certainly application identifiers if they link to applications and not individually deployed instances of applications are not PII unless the application has a very limited user base: an example of a sufficiently small,  group with common characteristics. For example, an identifier such as "SlighlyMiffedBirdsV1.0" used across a large number of deployments is very different from "SlighlyMiffedBirdsV1.0_xyz" where xyz is issued uniquely to each download of that application. Another very good example of this kind of personalisation of low linkability identifiers is the user agent string used to identify web browsers.

Session identifiers ostensibly do not link to a person but can reveal a unique person's behaviour. On their own they do not constitute PII. However session identifiers are invariably used in combination with other identifiers which does increase the linkability significantly. Session identifiers are highly traceable in that sessions are often very short with respect to other identifiers - capture enough sessions and one can fingerprint or infer common behaviour of individual persons.

When evaluating PII, Identifiers are often taken in isolation and analysis made there. This is one of the main problems in evaluating PII: Identifiers rarely exist in isolation and a combination of identifiers together reveals a unique identity.

Just dealing with identifiers alone, and even deciding what identifiers are in use provides the core of deciding whether a given data set PII. However identifiers provide the linkability, they do not provide tracability which is given through the temporal components which is often, if not, invariably present in data sets. We will deal with other dimensions of data later where we look deeper into temporal, location and content data. Furthermore we will also look at how data can be transformed into other types of data to improve traceability and linkability.

No comments: