Curious about housing prices in Stockholm? What does the data say?

Yomisola Adegeye
The Startup
Published in
7 min readJan 18, 2021

--

Recently, I decided to do a project and build a proof of concept out of it. I sketched a draft of what processes to follow to achieve this and below are the steps I followed;

  • Decide on a topic
  • Decide on the data source and how to get the data
  • Understanding the data
  • Preparing the data
  • Exploring the data
  • Visualizing the data
  • Draw insights /conclusions

Deciding on a topic

I have always wanted to do a project on housing in Sweden and since I live in Stockholm, it was easy for me to choose Stockholm county. I was then left with the decision of whether I want to focus on house sales or house rentals. I eventually settled for house sales in Stockholm since it is not news that it is better to buy a house in Sweden than to rent one.

How to get the data

This was not a very difficult decision because I already knew where I wanted my data to come from and I had decided to use a data I scraped myself even before I started the project. I scrapped my data from Hemnet.se using python and below is an explanation of how I went about it;

Web Scraping with Python

To the get the data needed for this analysis, I wrote a python script that crawled the hemnet.se website and aggregated the housing data found. The following libraries were used in the script: Selenium , BeautifulSoup and Pandas .

Each housing data is an unordered list item <li> with a specific html tags hierarchy. Some of the information were found within h2 tags, some within spans and others within div tags. The key is to identify the html id or class that uniquely identifies the tag that contains the information I am interested in, and extract this info using the BeautifulSoup library.

A sample snapshot of what each house data looks like is shown below

<ul>
<li>
<h2 class="area"> Area name </h2>
<h2><span class="address"> Street name </span></h2>
<div class="attributes"> List of house attributes </div>
<div class="price">House price</div>
</li>
................
................
</ul>

BeautifulSoup library made it easy to extract this information which was eventually compiled in a data frame created with pandas and then converted to a .csv file.

df = pd.DataFrame({'Location':location,'Street':street,'Price':price, 'Size': size, 'Number of rooms': number_of_rooms, 'Price Per msq': price_per_m, 'Features': features})df.to_csv('hemnet_villa.csv', index=False, encoding='utf-8')

The complete code for this can be found in this GitHub repository.

Understanding the data

After I had successfully scrapped the data, I needed a better understanding of it. Decisions like how many fields I would like to have, the data types of those fields, how to go about cleaning my data, in short a general understanding of how to prepare my data for analysis.

In this process I thought it would be interesting to perform the following analysis;

  • The distribution of houses for sales by Number of rooms
  • A breakdown of which room type (by number of rooms) are in higher demand
  • Common feature in the houses available for sale
  • How different features affect the average price of the most requested number of rooms. This analysis will be done for houses for sale in high end districts.
  • The distribution of villas for sale by street in Stockholm

The above decision helped informed me on how I wanted to go about my data cleaning and structuring.

Preparing the data

I imported theCSV file created from my python script into excel for cleaning. The bulk of the work had been done with python which made my cleansing process easier.

a. First I removed null values and empty spaces that couldn’t be given a reasonable default or computed from other fields.

b. I removed unwanted columns and characters like symbols, figures e.t.c

c. Since I wanted my analysis in English but my data was in Swedish, I translated from Swedish to English. This wasn’t so hard as I knew just the right amount of Swedish to be able to do this without help.

d. I merged the dataset for the apartments and the dataset for the villa since I scraped them separately

Exploring and visualizing the data

I used tableau for my exploratory analysis and in the process I tried answering some of my earlier questions and even more. I saw some outliers that I removed to avoid bias. I created several charts for the different parts of my visualization and I will be giving more details about them in the following section.

A tree-map was used as the first chart to answer the question “What is the distribution of house sales by number of rooms?”. I created this chart by dragging “Number of rooms” to the color marks card, “Room size” to the size marks card, “Districts” to the detail marks card and for my tooltip, I used a bubbles-chart to give more details like where the houses for sale are located, the street, size, average price and price per meter square. Since I was analyzing both apartments and villas for sale, I restricted my first chart to apartments alone, more information will be provided about the villas for sale as we go further.

My second chart was a bar chart that was used to show the “Number of rooms” by demand. A bar chart was used to show the most requested number of rooms in Stockholm and this was gotten by calculating highest “Number of rooms” available for sales. I created this chart by dragging “Number of rooms” to column and also dragging “Number of rooms” to rows but this time I changed the field on the row pane from discrete to continuous and counted the number of rooms. Houses with 2 Rooms turned out to be the highest number of rooms available for sales which was also used to represent the most requested number of rooms.

I went further to create a rounded bar-chart to show the most common features in the houses available for sale with their percentages. I created this chart by placing measure value on column and the features on row, then I dragged measure names to path and finally placed a table calculation(percent of total) of the “count of features” on the label marks card.

My next chart was inspired by the both the bar chart showing the analysis of the most requested “Number of rooms” and the rounded bar chart showing the analysis of the features. I created a bar chart showing how the average prices of 2 rooms in a very high end district changes with the type of feature provided.

I used a scatter plot to compare houses for sale in high end district with low end district for 1 room, 2 rooms and 3 rooms. I created the scatter plot by dragging two continuous fields(size and price per meter square) to both row and column respectively, and all the other fields were placed on the detail marks card.

Lastly, I created a bar chart to showed the price per meter square of all the villas available for sale. The tooltip contained additional information about the prices, sizes and locations .

I added all these charts to a dashboard(the background used for my dashboard was created with Figma) to create my final visualization and some of the insights that were drawn are;

  • Two rooms are the most requested number of rooms, while higher number of rooms like the 8 rooms or 9 rooms are least demanded.
  • Size is not the only price determinant for buying a house in Stockholm, in fact location is.
  • The most common features that comes with houses for sale are balcony and lift.

In conclusion, this article was written to take you through how I create a data analysis project from start to finish using python, Excel and Tableau. The complete source code will be made available on my GitHub and the viz can be found here on Tableau public. Feel free to look it up and make suggestions for improvements.

Thank you for reading

--

--