How Similar are European Languages to Each Other?

Note: if you are using an early version of Internet Explorer, you aren't going to be able to see the d3 animation below. which would be a shame. So upgrade to 9 or use Firefox or something!

Europe has been on the brain recently, with the Eurocup and the Greek debt situation continuing on.

While Italy was playing Spain, a few people commented that the players probably all understood each other, since 'Italian and Spanish are so similar'. I had heard people say that the languages are almost the same, but I couldn't find any proof. So I did a little work.

After a little weekend geekery, I created a force-directed graph using d3 that represents the distance between two languages. Since I'm still learning, I wasnt able to make as compatible with browsers/tumblr as I wanted, but I was still happy at the result.

The sizes of each circle are roughly based on the number of people in Europe that speak each language, according to Wikipedia and my estimates. The colors represent language families: Germanic, Romance, Slavic, Finnic, and Other. Hungarian and Greek are 'Other' languages: Hungarian is non Indo-European and Greek broke off so early that I think it is considered its own tree. Mouseover to see the language names.

The way this was done was by comparing a list of words from each language called a Swadesh List. I got the idea from this paper though my implementation was much less sophisticated (I didn't likelihood weight the sounds or group by language family). Data is here. I'm not a linguist and this is a very simple implementation, but if there is interest I'll post the data and the code I used (in R). The main problems with this implementation are that it doesn't account for order ('mother' and 'thermo' would count as the same, since they have the same sounds in a different order, for instance) and that it's not accounting for likelihood of sounds (see the Jaeger paper for a good example).

You can see that the languages fall into well defined groups based on their history. There is no 'tree' structure in the data: the languages justify it. Also, Finnish, Estonian and Hungarian, the non-Indo-European Languages, aren't even connected to the main mass of languages, which seems to make sense.

It was also interesting to me that English and Dutch are the Germanic Languages that are closest to the Romance Languages, probably due to their geography and the histories of their homelands as open trading countries.

Anyways, back to similarity. By this scoring system, where 1 = the same language and 0 = no similar sounds in words with the same meaning, the two most similar languages that were in here were Czech and Slovenian, with a score of .7. Since I'm not particularly well versed in Slovenian/Czech history, I don't really know how accurate that is, but it doesn't seem crazy.

All of the Slavic languages appear to be very close, scoring much higher than the other language families.

Italian and Spanish scored a .52, which was the highest of any Romance language pair, though Italian and Romaina were about the same. Danish and Swedish are even more similar, at .57. I have heard that Swedish and Norwegian are even more similar.

The least similar languages in this list were Catalan and Hungarian, with only 8% coincidence of sounds in similar concepts. Given that this is probably the first time anyone has written a sentence about Catalan and Hungarian, that's about right.

Update: After a couple of requests, I am uploading the source data and code for the work above. You'll note that this is a little bit of a hack (I'm looping instead of applying functions in R) but it should work.

Wordlists

Language info

R code to create the JSON object

JSON object

One place I would appreciate feedback is if there is a more graceful way to get the correct node & link names into the JSON object: the last several lines of code are solely devoted to that issue. I imagine there is a better way.

back to blog