Data Visualization

The project was created within the context of the class “Geovisualization and advanced cartography”.  I and my colleague decided to do the second project about the topic Exploratory Spatial Data Analysis (ESDA).

For the data analysis we were using the program called “GeoDa” which is a free and open-source software created by GeoDa Center. We found this program user-friendly, easy to use, compared to other programs. The main objectives are to provide a tool with simple exploration.

For this project we used four datasets, namely the geographic data of the London Wards, twitter data, socio-demographic data and data from the OpenStreetMap project. To carry out standard ESDA (Exploratory Spatial Data Analysis) the following functions were used:

  • Histogram;
  • Box plot (“box-and-whisker” plot);
  • Scatter Plot;
  • Conditional Plot;
  • Parallel Coordinate Plot;
  • Box map;
  • Standard Deviation Map.

Figure below shows that upper outliers are in the central and east part of London. It means that the difference of amount of tweets of these parts from other parts of London very big.

1 фиг

The difference between the number of tweets which are sent during daytime and during nighttime is shown in this figure. In general, the differences are low, but depending on the ward, they can be visible, e.g. in the center of London.

Screenshot

Number of attractions (right) compared to number of tweets during night (left):

Screenshot001

We found a strong dependency between number of tweets and the number of attractions. As Figure 4 shows, especially high number of tweets during the night can be explained with the number of attractions. However there are wards with high number of attractions but low or medium number of tweets.

Further outcomes are:

  • The difference between day and night is dependent on the specific ward, e.g. in the center of London people tweet more during daytime. In general, the tweets don’t show big differences.
  • There is no connection between the number of tweets and the number of tweets per inhabitant. The tweets are not dependent on the population.
  • Tweets depend on the age of population. The younger the population in a ward, the more tweets are sent.
  • Tweets are highly dependent on location of touristic attractions

In this project we tried to investigate the distribution “tweet” data in London during July-September, 2013.  We tried to experiment with any socio-demographic data sets to carry out exploratory spatial analysis: used ESDA techniques to identify the causes of amount of tweets data, to find the correlation between variables, compare and identify “tweet” data the dependency on variables.

Further investigations should consider that some tweets might be sent by computer (“bot”) and should be deleted. Another indicator, beside the number of tweets can be the number of users. It could be also interesting make more use of the time attributes of the tweet. That means to get an idea when the tweets are sent, not only using daytime and nighttime but also the hours or even the minutes. To make sense of that, it would be helpful to use not only tweets within three months in the summer but all tweets during a whole year. Moreover it can be useful to analyze the hashtags to identify events or repeating occurrence of hashtags.

Especially analyzing the time and the hashtags we experienced limitation of the software GeoDa. For those cases other software tools might be more applicable. However, for the objective of this project GeoDa was sufficient and simply to learn and to use.