Analyzing big AirBnB data using Spark

Bjørn Hansen
8 min readJan 13, 2021

In this article we will use PySpark to perform data analysis on a large Airbnb dataset.

Apache Spark

Spark is a cluster based computing framework which is designed for distributed data processing on computer clusters. The leverage which Spark provides is noticed when datasets get too large for a single computer to handle quickly. A standard pc uses parallel computing when performing operations, whereas spark uses so called distributed computing via a server or cloud system.

Distributed vs Parallel computing (image source: packtpub.com)

The main difference with distributed and parallel computing is that distributed computing utilizes many computers, each with their own processor and memory; parallel utilizes a single computer. Distributed computing is done such that the full computation task is divided up into individual computing tasks for each computer in the cluster, the individual computers communicate with each other through message passing. Parallel computation which is done solely on a single computer can use multiple processor threads but the main memory is shared; the processor communication happens through the memory bus. Both distributed computing and parallel computing have their advantages however, when talking scalability, distributed computing dominates.

Using PySpark

Apache Spark is written in the Scala programming language however in this article we will be using PySpark which is a Python API for Spark, enabling the user to utilize Resilient Distributed Datasets (RDDs) in Spark whilst being in a Python environment. To start using Spark a session must be initiated; seen below is an example of how this can be done.

Data

The data which will be used in this article is comprised of two Airbnb datasets one named Listings and one named Reviews (it should be noted that this data is not my own and is hosted by the University of Denmark for educational purposes). The listings dataset contains all the technical details about each specific listing such as, size, price, country, coordinates, etc. The Reviews dataset contains the information given by the guest such as, the review text, reviewer name, etc. The parameter which both datasets have in common is the listing ID. An example of how the data can be loaded with PySpark is shown in the code snippet below.

Preparing the data

Real world data is typically messy and not always uniformly structured, and such is the case with this data as well. There are a few ways of doing this, either we can change every value which is not uniform or we can simply drop the the rows which hold un-uniform data such as missing values. Dropping the rows is by far the easiest and least biased option and since we have so much data that is what will be done here. We also want to choose which parameters (columns) are of interest and select only those going forward. Its also a good idea to view a small portion of the data to get an intuition of what we are dealing.

Viewing first five rows of each dataset.

Using the df.count() command we will notice that the datasets are indeed big with the listings dataset holding just over 1 million data points (rows) and the reviews dataset holding just over 32 million data points. From the looking at the initial five data points of each data set above, there are a few things that clearly stick out and could be good to change straight away. This is for example the $ sign and comma symbol which each price value holds. Another note to keep is that neighborhood_cleansed values are containing special characters (such as ö and ä), and date is not split into separate parameters (day, month, year). Depending on the type of data analysis and algorithms one wishes to use, these things could cause some issues; we will however not change them now. It is not obvious what data type each parameter is in by viewing the data so, we can use the command df.printSchema() on both of our loaded data frames to see this. It then becomes clear that some of the parameters of interest have data types which need to be changed; the cleaning of the data is done with the code snippet below.

Data Analysis

Now that our parameters of interest are chosen and data cleaned, we can get to the really fun stuff! To start off lets find out which cities have the most Airbnb listings and how many neighborhoods each city contains. A good way for finding such information is by performing a SQL query; Spark allows the direct use of SQL which is quite convenient.

Ordering the cities with highest number of listings.

The table of ordered cities above gives a pretty clear idea of how the total number of listings is distributed between cities, but a large part of data science is about presenting information in a way that is appealing to the audience. Lets say we wanted to show the same information in the table but instead with a graph. One way of doing that is by first converting the spark data frame to a pandas data frame, and then to use your favorite data visualization tool to plot.

Ordering the cities by number of listings.

Looking at the number of listings per city we see that Paris has the most, however we can also see that Greater London and London are defined as two different cities, so does Paris truly have the highest number of listings? We can also order the cities by their number of neighborhoods as seen below.

Closer look at a single city

Now that we have made some general insights about the cities, lets choose one city, say Beijing, and find specific insights about it. First lets see which neighborhoods are located in Beijing and how many listings each has; again SQL can be used for this type of task.

Neighborhoods in Beijing and number of property types in each.

Prices/ Show me the money

An important statistic for an Airbnb consumer is probably the price, lets take a look at the price distribution for all listings in Beijing.

Price distribution for Beijing

The bar plot above shows that vast majority of the listings are within the same general price range, which is roughly [0, 2000], so lets filter out the outliers and see what we get.

Price distribution for Beijing for price range [0, 2000]

The overall price distribution looks to be exponential; the range of prices is wide with most listings on the market being priced under 1000 RMB. Taking the average of the prices above, one would find the average price of a listing is about 700 RMB, which is about $100.

The rise of Airbnb

Airbnb is one of the biggest startup success stories ever, it has exponentially grown in popularity over the last decade. Can this trend be seen in Beijing as well?

Total number of Airbnb reviews vs Date in Beijing

So, the aforementioned exponential trend didn't quite start a decade ago, but it has definitely occurred within the last few years. There are few reviews before 2016, it could be interesting to find out what factors this could be attributed to. Did overall travel to Beijing increase, or is it that 4G had recently become stable in China during that period, or did Airbnb release a booking app, or did a large number of houses in Beijing become suddenly developed? It is hard to say, but an interesting question to ponder none the less.

Seasonal trends

For a property owner it could be of interest to see which months his or her property would be in highest demand, lets see if there are any seasonal trends among consumers.

Seasonal trend of Airbnb consumers in Beijing

The plot above is shows the total number of reviews in each month for all years in our reviews data set. As can be seen, the most popular month is August, with other summer months being relatively popular as well.

The hidden value of reviews

There are not a whole lot of factors which the property owner can control when renting out his or her property on Airbnb. It can also be difficult for the owner to know what the guests want such that the property can stand out from others. The reviews which guests leave after a stay, probably give the best insights for the property owners into what guests prefer and don’t prefer. The way in which we are going to try finding important words is by weighting them in relation to how often they appear and the overall rating of the review which they appear. The logic is such that, it is likely that people use the same words to describe the same experiences and that the rating which the guest gives, is correlated with the experience which he or she had. If this logic is true then, we should be able to find words which appear with high frequency when a guest a good experience as well as less good.

To perform the word rating we first will get all the words in every review under some constraints such as only words larger than three letters are considered (we want to get rid of words such as a, and, the, etc.) and words occurring in less than 0.5% of all reviews are removed. We will also find the average rating every property has gotten over time. Shown below is the expression of the scoring function used, as well as the code implementation and, an example of grouped reviews for one listing (this same grouping is done for all properties). The score function for a word is found by taking the mean of all individual review score it occurred in; the x bar is the mean rating for all reviews in our data set (it is about 90).

Word score function.
First 20 reviews and mean rating for first property

With the implementation above we can now view the first and last (highest/ lowest) rated words calculated via our word rating function.

Rating of words

From the results above, it seems to be a good idea to give out snacks, if your a property owner and keep your toilet clean. It could also be a good idea to include words which scored high in your description such as cottage, or spotless; further more it might be good to minimize low scoring words such as basic or load.

Conclusion

If you made it to the end, I would like to thank you for reading and I truly hope you learned something from this article. I am always trying to make my articles better reading experiences, so if you have any insights or thoughts please feel free to share them.

--

--