The Lord of the Rings - Social Graphs & Interactions Final Project

Introduction

This website has been created in order to present the Final Project of the Social Graphs & Interactions course, provided by Technical University of Denmark, Copenhagen. The project had to be carried out in group, and in general, it expected to conduct an analysis on an interesting dataset, whose choice was up to us. As it is already clear from our Home, ours focused on The Lord of the Rings universe. Hence, we conducted an in-depth analysis taking in consideration different sources. Starting from our dataset, we wanted to extract some curious results, hidden statistics or secrets and bring them to light, such that the most passionate fans of this universe could (we hope!) be satisfied, but even people who don't know anything about it could appreciate and discover a huge fantasy world. A world that, at first, through epic books, then through unforgettable movies, caught the attention of lots of fans and achieved a worldwide success.

Just before you go on, a few tips to fully enjoy your journey through the Middle Earth: the purpose of the website is that is must be explored with a natural flow, such that its contents are ordered from the top to the bottom. It doesn't include any technical explanation which instead can be consulted in detail with the complete Jupyter Notebook. Enjoy your reading!

P.S. You don't know anything about The Lord of the Rings? Fly, you fools!, as wise Gandalf would say... We can briefly say that The Lord of the Rings (frequently shortened in LOTR) was born as a famous trilogy of fantasy novels written by J.R.R. Tolkien in 1954, then brought to cinemas in the beginning of the 2000s by Peter Jackson. The movies achieved an incredible success such that LOTR became an official brand with its own merchandise. For further explanation, see on Wikipedia.

What you will see

Starting from the "One Ring to Rule Them All!" Wiki we downloaded a XML copy of all the pages and extracted the relevant contents, built a network of all the characters, places and battles and conducted an in-depth analysis using Python.

The acronymous "NLP" stands for Natural Language Processing: this means that, starting from a huge quantity of text extracted from those pages representing the nodes of the network, we analyzed it and provided some curious facts.

We all know how huge the books of the trilogy are (the whole book The Lord of the Rings is 1000+ pages!) and how long the movies are (each of them lasts more then 3 hours!)... Hence, we wondered whether there are clear differences or similarities in terms of specific topics, like sentiment analysis and character mentions.

Wiki Network Analysis

What does a network of LOTR look like? We hereby provide some "nice" plots to show it.

Full Network v.1

The whole network with node size and color by degree

Full Network v.2

The whole network, highlighting the different categories of nodes

Giant Connected Component

The GCC is the biggest subgraph (in terms of number of nodes) without isolated nodes

Centrality

The centrality of a network is an important parameter to identify crucial nodes of the network. We took in consideration two types of centrality and the degree of the nodes in order to identify the most important nodes of the network. Why these three parameters? Actually the importance of a node isn't that immediate, that's why we took in consideration different parameters such that we could compare the results and infer our conclusions.

Betweenness Centrality ²: is an indicator of a node's centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. A node with high betweenness centrality has a large influence on the transfer of items through the network, under the assumption that item transfer follows the shortest paths.
Eigenvector Centrality ³: is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.
Degree: the degree of a node is a number that indicates the number of nodes it's connected to (or, the number of links connected to it). We can identify the in-degree and out-degree, which identify respectively the number of in-edges and out-edges.

^{2 and 3: From Wikipedia}

Top Degrees

The nodes with the highest degree

Top In-Degrees

The nodes with the highest in-degree

Top Out-Degrees

The nodes with the highest out-degree

Eigenvector Centrality

The nodes with highest eigenvector centrality

Betweenness Centrality

The nodes with highest betweenness centrality

Hmm, interesting...

Considerations on the degrees...

Ok, for those who didn't guess it yet, the Lord of the Rings is definitely Sauron. His extremely high degree means that his name leads to other 156 other pages in the Wiki, and as we can see from the in-degree his page is highly mentioned by the other pages! Much more than how many other pages he refers to. High rankings also for different members of the Fellowship of the Ring. There are some curious results, involving Aragorn for example: his page has a high out degree but he's not in the top 10 highest degree ranking... The role of Galadriel, the famous Elf Queen, seems to be central in the whole universe according to her degree.

Considerations on the centralities...

Sauron again holds the record that sees him as the most central character in the whole network; again, we see some names in common with the degree rankings, and high scores for Aragorn,Legolas, Witch-King of Angmar and (unexpectedly?) Isildur. Places like Minas Tirith and Rivendell also play a central role. Last but not least, no Frodo Baggins in the centrality rankings...

Degree Distribution

What is the degree distribution? It's a plot where it becomes clear how many nodes in the network have a specific degree, and if this has a particular trend. We're going to show the distribution separated for in-degree and out-degree, because the results we got are more interesting if we compare them.

In-Degree

Distribution of the in-degree values of the nodes in the network

Log In-Degree

Logarithmic istribution of the in-degree values of the nodes in the network

Out-Degree

Distribution of the out-degree values of the nodes in the network

Log Out-Degree

Logarithmic distribution of the out-degree values of the nodes in the network

What do these plots mean?!

Theory behind the plots...

The linear plot show a pretty much regular trend, much more evident in the in-degree distribution: commonly, this trend has been called power-law, and the main feature is that there are a lot of nodes with a very low degree and just a few nodes with a very high degree. This is basically why we also plot the logarithmic version, such that it gets even more clear the relation between the increase of the degree and the decrease of the frequency. Why does this happen? Basically it's because the more a page is cited, the more famous.

A few more words are worth to be spent for the out-degree distribution: the general trend seems to follow pretty much well the one for the in-degrees, but the noise is much worse. Why all this noise? The out degrees seem to follow a slightly random law (still, mainly following a power law distribution). For example, a peak in the 0 value is the same of the power law distribution, but the maximum degree is very low if compared to the one of the in degrees. On the other hand, the logarithmic plot shows a distribution that is as like as the random one, still following an almost-linear decrease as the power law predicts. This makes perfectly sense: in a real network, there are hubs that, once are growing, keep receiving links (i.e. citations). In this case, once a character/battle/place gets a lot of citations, it becomes more famous (or, more probably, he is already famous) so it gets more citations. This brings the in-citations to follow a power law. On the other hand, if the page of a character/battle/place cites a lot of others characters/battles/places (outbound links), this does not always mean that the former is famous, or that in the future he will more likely have other pages cited in his page. As a consequence, the out-bound links distribution looks like a power law with random noise in it.

Community Detection

Is it possible to aggregate in groups (formally, communities) the nodes of the network? According to which criteria? We wondered what was the best way to actually find these communities, and our natural intuition was to group them by race. The race is an important parameter that (almost) every character has and we thought that maybe a lot of characters from the same race are well interconnected to each other (this is what a community is meant to be). Hence, our analysis went on, and we started by calculating the communities with an already built-in function (called the Louvain algorithm) that tries to extract automatically the communities for us. We hereby provide the results of this analysis.

Communities v.1

A plot that shows the whole network with a clear division by color between the communities

Communities v.2

Logarithmic istribution of the in-degree values of the nodes in the network

It's clear that the communities were pretty much well identified, since there are just a few communities with very few nodes (as the second plot clearly shows), while most of them are relatively high populated. Still, we don't know with what criteria they have been identified, and we wouldn't have known that without trying our own implementation of the algorithm for the community detection. So, as we already mentioned, we tried to detect the communities categorizing the nodes by races. We don't show again the plots, instead a useful comparison tool between two different sets of communities: a confusion matrix. Please, don't be worried by its name! It's just a very simple matrix, where each entry (i , j) (where i identifies the rows and j the columns) is a number that identifies how many nodes the race number i has in common with the community number j (where by races and communities we identify respectively our own categories and the Louvain algorithm communities).

Confusion Matrix

A "clear" comparison between the different algorithms of community detection

Wait, they're not just random numbers...

The secret behind all those numbers is quite simple: as already mentioned, each row identifies a race, and each column a community. To what degree each race well corresponds to a particular community? It's much easier to understand this with an example. Look at Dwarves row: they have 23 common nodes with the 4th community (remember that indexes start from 0!), and just 0-1 common nodes with the other communities (just one with 3). This is a clear example of well defined community based on the race. Even better the case of the Ainur, who have all their nodes in common with one single community! By the other hand, race of Men is a clearly shows that they don't aggregate well to each other, since they are much more scattered in multiple communities. It's worth saying that, since the original network included battles and cities as well, they are part of the community detection algorithm, and the matrix shows that, if you look at both their rows, they kind of have some similarities: they probably might create a community all together!

In a more scientific way, there is a parameter that is usually took in consideration in order to evaluate to what degree the communities are well defined. Hence, we calculated this parameter, which is known as modularity, both for the Louvain community detection and our version of the algorithm.

Louvain algorithm 11 communities, Modularity = ~ 0.74
Races algorithm 19 communities, Modularity = ~ 0.33

Yes, ok, let's make it easier... Just keep in mind that the higher the modularity, the better the community are well defined. In conclusion, race might be taken in consideration, but for sure it's not the only thing that marks the different communities.

In-Depth Wiki Text Mining

Now it's the moment to stop for a while with pure network analysis and to put some text in it! We extracted the text of the pages in the Wiki and created some cool stuff for your viewing pleasure. Basically this part was splitted in two big subsets: shortest path similarities and wordclouds.

Average Similarities ⁴

The shortest paths of the network are the paths (represented through a list of nodes from a source to a target) that take the lowest number of steps to be followed. We wondered how much the pages that represent each shortest path have similar content. We generated 5000 random paths and what we achieved is that usually it's quite common to find a path with very low average similarity between its nodes<.

Shortest Paths Similarities

The distribution of the average shortest path similarities

Word Clouds

What are the most commond words related to each character? Or to each city, battle, race? We worked on the text extracted from the wiki pages, cleaned and processed them and figured out which are the most frequent and characteristic words. In this website we reserved a whole section for these nice pictures, so you can easily filter them and see the ones you're most interested in.

Population Analysis

We extracted some features from the Wiki pages and provided some interesting results in terms of population and especially for female characters. What is the role of females in the LOTR universe? How many of them appear? How are they distributed among the races? Who are the central female characters of our network? Below you'll see how we tried to provide an answer to these questions.

^{The average similarity is nothing but a relative score that marks each path, through a specific algorithm, in order to state how much the content of the pages related to the nodes are similar to each other. Actually the values we achieved cannot be compared, so we actually state they're low according to their trend: just a few paths have high average similarity (~0.53), while most of them are between 0.1 and 0.22. This is a common feature of many Wikis, since often in the Wiki pages there are some wiki links (represented by the edges in our network) that are not coherent with the content of the pages (nodes) they are linking.}

Books & Movies Analysis

Books and movies are the second but most interesting part involved with text mining, even if we actually treated it in a different way... What we basically worked on is an accurate comparison between the movie scripts and the books, and extracted a lot of curious results!

Movie Networks

We created two other versions of networks based on the character mentions and on characters in the same scenes. Hereby some cool pictures with the data we analysed.

Here the network based on characters in the same scene: