Others just feel wrong.

We all know this feeling when walking the streets of Berlin. But why is that?
What makes one neigborhood different from others? And what makes two neighborhoods
similar? Many factors determine the answers to these questions, most being highly
subjective. The demographics for the years 2007-2015, publicly available from
OpenData Berlin,
provide a more objective measure for just how similar or dissimilar Berlin’s 447
neighborhoods are and what exactly makes them so. Modern methods of machine learning
then reveal that there are, in fact, only a few distinct groups — or *clusters*, as they are called.

The color in the map indicates which of a total of 4 distinct clusters a
neighborhood belongs to. The saturation of the color indicates how typical
a neighborhood is of the respective group. Full saturation and a (relative)
distance of 0% to the cluster mean signifies that a ‘hood is most representative
of its group, while a whitish color and a (relative) distance of 100% to
the cluster mean indicates the 'hood least typical of its group. On the right,
you can see the two attributes that most strikingly highlight the differences
between clusters of neighborhoods: migration background and quality of housing.

This is really the key question and the answer depends on several factors. One is certainly what exactly we know about Berlin’s 447 neighborhoods. In our case, we resorted to publicly available data counting residents by gender, age, migration background, quality of housing, and how long they have lived in their neighborhood. Of course one could add more data from other sources but, as it turns out, what we have is quite enough.

The second key factor is what measure of similarity or *affinity* to use. This
choice is critical for revealing structure in the underlying data (provided
there is any, of course). In our case, we first converted the raw counts in
each ‘hood to percentages for each year. The absolute number of people
living in any given neighborhood — and how that number might have changed
over the years — thus no longer enters the equation. Each of the five attributes
listed above was given equal weight. For example, the percentages of males
and females sum to 100% (what else?) in each year, the percentages in each
age group sum up to 100% (duh!) in each year, and so on. All these numbers
taken together then make up all that we know about a ‘hood. To tell how
different it is from another, we then chose to compute the *euclidean distance*
— the geometrical distance most of us are intimately familiar with from high
school — between all pairs of neighborhoods.

The so-obtained *distance matrix* then serves as input to a clustering algorithm.
These are a family of machine-learning techniques that seek to divide data
points — neighborhoods in our case — into distinct groups so that
members of one group are more similar to each other than to members of all
other groups by some measure. Specifically, we used *affinity propagation* to
perform the clustering and looked at the *Silhouette coefficient* to optimize
the parameters that govern the performance of that algorithm. As you can see
above, the result are 4 distinct clusters of neighborhoods. With gender,
duration of residence and age distributions all being quite similar between them, they
differ mostly in the migration background of their residents and the
prevalent quality of housing.

All data used for this project are publicly available from the Berlin Open Data initiative under the license. In particular, we have used this geographic and most of these demographic data-sets, all published by the "Amt für Statistik Berlin-Brandenburg".

Need help realizing your vision?

Questions regarding our projects?