We all know this feeling when walking the streets of Berlin. But why is that? What makes one neigborhood different from others? And what makes two neighborhoods similar? Many factors determine the answers to these questions, most being highly subjective. The demographics for the years 2007-2015, publicly available from OpenData Berlin, provide a more objective measure for just how similar or dissimilar Berlin’s 447 neighborhoods are and what exactly makes them so. Modern methods of machine learning then reveal that there are, in fact, only a few distinct groups — or clusters, as they are called.Show me
The color in the map indicates which of a total of 4 distinct clusters a
neighborhood belongs to. The saturation of the color indicates how typical
a neighborhood is of the respective group. Full saturation and a (relative)
distance of 0% to the cluster mean signifies that a ‘hood is most representative
of its group, while a whitish color and a (relative) distance of 100% to
the cluster mean indicates the 'hood least typical of its group. On the right,
you can see the two attributes that most strikingly highlight the differences
between clusters of neighborhoods: migration background and quality of housing.
This is really the key question and the answer depends on several factors. One is certainly what exactly we know about Berlin’s 447 neighborhoods. In our case, we resorted to publicly available data counting residents by gender, age, migration background, quality of housing, and how long they have lived in their neighborhood. Of course one could add more data from other sources but, as it turns out, what we have is quite enough.
The second key factor is what measure of similarity or affinity to use. This choice is critical for revealing structure in the underlying data (provided there is any, of course). In our case, we first converted the raw counts in each ‘hood to percentages for each year. The absolute number of people living in any given neighborhood — and how that number might have changed over the years — thus no longer enters the equation. Each of the five attributes listed above was given equal weight. For example, the percentages of males and females sum to 100% (what else?) in each year, the percentages in each age group sum up to 100% (duh!) in each year, and so on. All these numbers taken together then make up all that we know about a ‘hood. To tell how different it is from another, we then chose to compute the euclidean distance — the geometrical distance most of us are intimately familiar with from high school — between all pairs of neighborhoods.
The so-obtained distance matrix then serves as input to a clustering algorithm. These are a family of machine-learning techniques that seek to divide data points — neighborhoods in our case — into distinct groups so that members of one group are more similar to each other than to members of all other groups by some measure. Specifically, we used affinity propagation to perform the clustering and looked at the Silhouette coefficient to optimize the parameters that govern the performance of that algorithm. As you can see above, the result are 4 distinct clusters of neighborhoods. With gender, duration of residence and age distributions all being quite similar between them, they differ mostly in the migration background of their residents and the prevalent quality of housing.
All data used for this project are publicly available from the Berlin Open Data initiative under the license. In particular, we have used this geographic and most of these demographic data-sets, all published by the "Amt für Statistik Berlin-Brandenburg".