A Week of Mining Seattle's Craigslist Apartment Pricing

Everyone is freaking out over San Francisco astronomically high rent prices right now when Seattle real-estate isn’t that far behind.
I was walking down the street once in the university district around the University of Washington when I saw construction being built. And then more construction. And then MORE construction down the block. Ridiculous. Looking at a half-torn up flyer, the prices for these new apartments coming out in 2015 were around 1300 per bedroom for a two bedroom apartment! I quickly went home and started trying to figure out how fast rent was rising in Seattle.
One fun way to do this was by working on a project that I have since put off for school back in November. Since reliably the best place to find apartments has been on Craigslist, I created a script using Scrapy to grab listings of apartments on Seattle craigslist and filtered them for the zipcodes within the Seattle boundaries.
I manually ran the script for about a week and after filtering out duplicate posts and duplicate IDs, I got 6000 individual listings within the metro region and 2400 unique listings within the city of Seattle.

A sample of the data. You can find the full dataset on my Github
Hopefully by next year, I can finish automatically pipelining the data straight to the database on an automatic bash script that will just run everyday so I can track prices over time. I was halfway through it until midterms arrived at the door. Stupid school.
For a tutorial on scraping craigslist with Scrapy and regression analysis – Practical Web Scraping with Scrapy
First I plotted all of the Seattle apartment listing prices.

Histogram of Overall Prices
And here are the average prices per number of bedrooms.
Seattle Craigslist Prices" />Seattle Craigslist Prices on ggplot2
Looks pretty reasonable. Now let’s look at the Seattle craigslist median apartment prices in comparisons to San Francisco’s median prices.

Excuse my terribly labeled price values
Ouch, well maybe they are pretty far ahead. But to be fair, Seattle’s neighborhoods are probably more varied in cost compared to San Francisco’s huge concentration of very expensive apartments in the mission and north of Bayview.
San Francisco apartment prices for a one bedroom apartment courtesy of Priceonomics" />San Francisco apartment prices for a one bedroom apartment courtesy of Preceonomics
Now let’s say that we want to do the same thing and see how much the same apartment will cost in different neighborhoods of Seattle. Without nearly enough data to get accurate average values of 1 bedroom apartments or 2 bedroom apartments, let’s try a different method for finding variable prices of neighborhoods. Instead of just specifying the average values for a neighborhood with a 1 bedroom apartment, we can run a regression model to see how the neighborhood affects prices while holding other factors such as square footage and number of bedrooms etc…
In this case, we are going to use zipcodes and label them as factors in our model. Running a regression model for number of beds, number of baths, neighborhood zipcodes, and square footage size, you can see how significant each zipcode/neighborhood matters for the base price.

The bar graph above shows how much each location would cost without factoring any extra beds, baths, or square footage size. Basically, if you want to live in downtown, prepare to fork over 500+ more dollars per month for the exact same apartment anywhere in North Seattle. If we add 95% confidence intervals to the values, we can see which neighborhoods have more or less range in the base price estimator value.

Base Price with Confidence Intervals
We can see here that for zipcodes that generally encompass more neighborhoods, their confidence intervals are larger as the differen in prices in nearby neighborhoods are greater. For example, Seattle’s Central District encompasses parts of SODO, Mt. Baker, North Beacon Hill, and South Downtown with a zipcode of 98144. For neighborhoods in the downtown area, there is less of a range because the zipcode’s areas are smaller and correspond to their neighborhoods with more accuracy.
The biggest ones seem to be the Madison Park/Montlake and Columbia City. The zipcode of 98112 stretches from the west part of Lake Washington where Madison Park and the Arboretum lie, to all the way to Volunteer Park and some of the nice area of Capitol Hill. There’s bound to be more variance when there’s such a large area to cover of multiple neighborhoods, especially ones harboring nice restaurants such as Harvest Vine which my girlfriend tells me is quite classy and spectacularly organic.

The zipcode of 98112 displayed on Google Maps
Now looking at the zipcode of 98118, we can see that it’s actually comprised of quite a number of different neighborhoods as well. For the most part if we were to look directly at the map, we would probably classify the neighborhood as mainly Rainier Valley. As most people know, Rainier Valley is one of the poorest neighborhoods in Seattle today. So why is the zipcode right there next to West Seattle and Magnolia in price? Well there’s a few reasons why actually. Taking a look at the data, the listings mostly had titles in Columbia City. Columbia City is going through what wikipedia describes as “gentrification” and has become a “relatively trendy neighborhood” in the last couple of years. The large range in the confidence interval could then describe the variable neighborhoods of the low income Rainier Beach to the high-end houses that overlook Seward Park mixed in with many now expensive Columbia City town-homes and apartments. But in more interesting thought, if Seattle’s rate of expansion and growth starts matching San Francisco’s soon, could Rainier Valley become the next Mission District?

The zipcode of 98118 displayed on Google Maps
If we take a look at the rest of the factors, we can then get a decent prediction for how much a future apartment would cost given the rest of the factors.

Excel graphs look good now!
So let’s say that I want to estimate the price of an apartment in Capitol Hill. I will take 900 square footage in space, two bedrooms because I am still too scared to live alone, and two bathrooms just cause of personal bathroom issues.
Price = Base Price of Capitol Hill (808) + Square Footage(9 * 69.68) + 2*322 + 2*107 = 2,293 dollars.

Not too bad.
Let’s see if we can shift gears again and try to predict the price by adding in a couple more factors. Obviously all apartments wouldn’t be so easily calculable because there are a lot more considerations when we go out to find a new place to live. This one was below market value according to our model and honestly looks like a steal when looking at that picture of the rooftop.
But how do we actually find ways to incorporate these additional amenities such as “rooftop” or maybe the word “penthouse” in a posting? When we look for apartments, we won’t obviously just check to see if it’s within a neighborhood and price range along with a specified amount of beds and bathrooms. We need to look at pictures! Lots of apartments are dirt cheap because even if they’re huge they might be old or disgusting. Since I haven’t written a computer vision algorithm to detect whether or not an apartment looks awesome yet, we can try using the number of pictures in a posting and also the word count of the posting as well to see if these factors are significant enough.
Re-running the regression by adding them in and cross validating the model to check for over fitting, I found that it is a slightly better predictor of price than the original model, but not by much. [DATA IN UPCOMING TUTORIAL]
Each picture added to a craigslist posting, it adds 9 more dollars to the price and each line in the body of a posting adds 2.83 dollars. Each doriginal base price has now changed as well as the other factors. But does this mean now that all realtors should start posting as many pictures as they can and start writing lines and lines of garbage in their postings in order to jack up their prices? Absolutely not. It just shows that most listings that have more pictures are generally priced higher. Maybe because people’s who’s apartments are cheaper do not exactly want to post pictures of why their apartments are so much cheaper.
FUTURE LOOK
What’s my future goal of this project? I am not sure but I know I want to start collecting more data. The awesome thing about data analysis and data science is that you start finding things that lead to more questions. I think I can list a couple of them already. One is that I don’t expect the price of an apartment in June to match the same price of an apartment in December. By collecting data over the course of a year, I can see how the date or month can affect the price and also notice if there are seasonal trends. My sister has had a hard time finding a cheap place right after school ended in June last year in Seattle when it seems like there may be better times to start a lease with less demand (especially when leases are yearly).
By collecting more data as well, the predictions should get more accurate and more text mining techniques can be implemented in finding out if specific words will result in more expensive apartments. Computer vision software in Python is advancing at a rapid rate to possibly detect brightly lit rooms or large window views. More data can also support a recommendation system or a system alert for a great deal that suddenly gets posted.
I will definitely update this post later in the next year. But for now, if you guys have any questions or can point out some things I did wrong which I might have probably done, PLEASE leave a comment or email me. I am always interested in different opinions and different ways to improve!
Comments
Post a Comment