… one can take a pretty good guess on the basis of very little information about your person, casually mentioned in small talk, combined with general demographics available in the public domain. To illustrate what is possible, we perused the highly anonymized tables from OpenData Berlin to compute the probability of you living in any of Berlin’s 447 neighborhoods. A score of 1.0 means highest probability and 0.0 means lowest.
Are we right?
Yes, of course we could do better. Using the same criteria to identify the
neigborhood (h) you live in — sex (s), age (a), migration background (b),
duration of residence (d) and quality of housing (q) — we simply would need
to know more numbers. More in this context means precisely 6,007,680
of them, together representing the full multivariate probability distribution
p(h, s, a, b, d, q). Now we don't know all these numbers, simply because it is
the very duty of the public service that publishes demographic data to prevent
exact localization by lumping attribute combinations together into much broader catergories.
Rather than trying to gather more fine-grained data — an activity of considerable skill but questionable
morality — we want to showcase other qualities here and demonstrate what can be
done with data that is willingly (and knowingly) given alone. From that data,
a somewhat reduced model for the full probability distribution emerges.
This model is graphically represented by the belief network shown in
the figure above. The graph encodes a specific factorization of p(h, s, a, b, d, q) into conditional probabilities. Given this
specific factorization — and exploiting some basic mathematical properties of
probabilities — it is then possible to formulate an expression for the conditional
probability p(h|s, a, b, d, q) of you living in any of Berlin's 447 neighborhood
given a certain set of attributes as:
where Z stands for a normalization constant, which ensures that all probabilities
sum to unity. The right-hand side conditional probabilities, for example the
probability p(b|h) that you are of a certain migration background given that
you live in neighborhood h, are then simply read from publicly available tables of
neighborhood-resolved demographic data. Finally, the last factor, p(h), is nothing
but the prior probability that you live in a certain neighborhood, which is
given by the number of people living in that 'hood divided by the total number of
Berlin residents.
All data used for this project are publicly available from the Berlin
Open Data initiative under the
license. In particular, we have used this
geographic
and most of these demographic data-sets, all published by the "Amt für Statistik Berlin-Brandenburg".
Need help realizing your vision?
Questions regarding our projects?