Data collection guide

This page assumes you've already read the introductory page about data collection.

When creating your own label you may run into some questions. For example, is a piece of data you're collection personal or sensitive data? Let's look closer at the distinctions.

What data should I count?

Your organisation might collect all kinds of data, but not all of it will say something about the reader of the label (which we will call the 'user' here). For example, data about the movement of the stars, or forklifts. Data that does not directly effect the intended reader of the label does not have to me taken into account when creating the label.

What is aggregated data?

‌It's "non-personal" data. Anonymous and pseudonymous data fit here.

‌You may have access to data that describes a collection of people, but not an individual. For example, imagine a list that shows the average income levels for all area codes in your country. Now, If you receive the area code from a user, this can reveal something about them. Technically you won't know the exact income of the user (which would be personal data), but this estimate could still inform the choices your organisation makes about the person. Therefore we felt it would be prudent have the label disclose when this type of 'vague' data is used.

What is personal data?

The tricky thing is: it depends. Any data that is linked to a specific person, and makes a person identifyable, becomes personal data. 

‌For example, a list addresses by itself isn't exciting. But if that list connects each address to a name, then the address has become personal identifyable information.

E‌XAMPLE

‌Let's say you have a weightloss database that stores two things: first names, and how much these people weigh each month. For example: Mikanos: 82 Kg in februari, 85Kg in March, etc.

‌The name "Mikanos" is, by itself, not personally identifyable information, as in society there are many people called Mikanos. But if the name is stored in the context of a local sports club, then it would be easy to figure out which Mikanos the data belongs to.

‌As a guideline: even if the data can point to at most 10 specific people, then it's wise to considered it personal data. So if there are 9 people in the sports club called Mikanos, then it's personal data.

Some pieces of data that are usually personal data:

  • ‌Identification numbers, such as:
    • IP address
    • Cookie ID
    • Government ID
    • Phone number
    • Bank account number
  • Photo
  • Email ‌address
  • Login name
  • Address  ‌
  • Current location
  • Age ‌
  • Work / Employer
  • Education level

When does personal data become sensitive?

Once again: it depends. Different countries have different views. Luckily, there some data types that all European countries agree are sensitive. This is the so called "special categories" data:

  • Racial or ethnic origin
  • Political views
  • Religion or philosophy
  • Membership of a trade union
  • Medical data
  • Genetic data (such as a person's DNA)
  • Biometric data (such as a fingerprint)
  • Sexual behaviour or sexual orientation

If the data falls into these categories, they are definitely sensitive.

On top of that, there is more data that is considered sensitive. The difference is that not all European countries agree, or have officially labeled it as such. For example, income level is considered sensitive in most of Europe, but not in Finland.

‌When creating a privacy label, we ask you err on the side of caution, and take all European countries into account. For example, since income level is considered sensitive in most European countries, please treat it as such, even if you only operate in Finland. People outside Finland would still use the label to inform themselves.

‌With that in mind, these categories may also be considered sensitive:

  • Financial details, such as income, expenditure, and loans.
  • Data you were asked to keep secret, such as passwords.

Determining the origin of data

The label distinguishes five data sources:

  • Data provided by the user
  • Data provided by other parties
  • Observed data
  • Created data
  • Paid data


‌ Let's look at some distinctions between these options.

When to use "paid" instead of "provided by others"

If you in any way paid for access to a type of data, then this category must be selected. Paid can refer to a number of things:

  • Literally purchasing a dataset.
  • "Renting" or "leasing" data.
  • Having access to a piece of software that enriches your data or workflow. 

In other words, if some type of financial transaction took place to gain access to data, then use 'paid'. Here financial transactions also include things like paying with bitcoin or paying by offering your own services or products in return in a very specific amount (swapping).

‌The "provided by others" category is meant to describe non-profit data exchange between/with government organisations, or within a consortium. Often because there is a legal requirement to do so. If you use open data, then this also falls into this category. However, if you scrape data from "public" sources yourself, then this falls into the 'created' category, since you are the organisation that created this dataset.

When is data "created"?

Sometimes you create new data or new datasets. Let's look at some real-world situations:

  • If you combine some existing data about a user to create a new piece of data. For example, if you use a customer's purchase history to assign a label to that customer (for example: "VIP"). Or if you use car data to create a likelyhood score that the car will be driven safely. Any "derived" or "inferred" data fits this category.
  • If you take messy data and create a clean dataset form it, then in the context of the label you have also created data. For example, if you scrape the internet for data, the data technically already existed, but you have made if actionable by processing it.
  • If you create anonimised data, then you are also creating a new dataset. In this case, you have created 'aggregated data'.

Observed data - it's not just about cameras

The use of cameras to record people is a clear example of observed data. But observing online behaviour also falls into this category. Some examples:

  • Security camera footage
  • Sensor data (GPS location, smart home/office sensors)
  • IP addresses, cookie data, and other identifiers. ‌
  • Online behaviour (mouse movements, visited webpages)
  • Meta data (timestamps, login and logout time)
A useful distinction:

‌With provided data the user is highly aware that they have provided the data, and it usually took place in a specific moment. For example, submitting an online form or posing for a photo. 

‌With observed data, a user might now know, forget, or simply not (continously) realise that they are providing data, even if they have been told this will happen. This type of data collection is usually more continuous and/or automatic, and happens "in the background".

‌Provided photo

‌Observed photo