… one can take a pretty good guess on the basis of very little information about your person, casually mentioned in small talk, combined with general demographics available in the public domain. To illustrate what is possible, we perused the highly anonymized tables from OpenData Berlin to compute the probability of you living in any of Berlin’s 447 neighborhoods. A score of 1.0 means highest probability and 0.0 means lowest.
Are we right?
Yes, of course we could do better. Using the same criteria to identify the neigborhood (h) you live in — sex (s), age (a), migration background (b), duration of residence (d) and quality of housing (q) — we simply would need to know more numbers. More in this context means precisely 6,007,680 of them, together representing the full multivariate probability distribution p(h, s, a, b, d, q). Now we don't know all these numbers, simply because it is the very duty of the public service that publishes demographic data to prevent exact localization by lumping attribute combinations together into much broader catergories. Rather than trying to gather more fine-grained data — an activity of considerable skill but questionable morality — we want to showcase other qualities here and demonstrate what can be done with data that is willingly (and knowingly) given alone. From that data, a somewhat reduced model for the full probability distribution emerges. This model is graphically represented by the belief network shown in the figure above. The graph encodes a specific factorization of p(h, s, a, b, d, q) into conditional probabilities. Given this specific factorization — and exploiting some basic mathematical properties of probabilities — it is then possible to formulate an expression for the conditional probability p(h|s, a, b, d, q) of you living in any of Berlin's 447 neighborhood given a certain set of attributes as:
where Z stands for a normalization constant, which ensures that all probabilities sum to unity. The right-hand side conditional probabilities, for example the probability p(b|h) that you are of a certain migration background given that you live in neighborhood h, are then simply read from publicly available tables of neighborhood-resolved demographic data. Finally, the last factor, p(h), is nothing but the prior probability that you live in a certain neighborhood, which is given by the number of people living in that 'hood divided by the total number of Berlin residents.
All data used for this project are publicly available from the Berlin Open Data initiative under the license. In particular, we have used this geographic and most of these demographic data-sets, all published by the "Amt für Statistik Berlin-Brandenburg".