Cleaning the House Sales Data
When you finish this exercise you should
- Have worked with a larger spreadhseet.
- Cleaned and formatted the worksheet.
You will be dealing with another dataset from Kaggle.com. This one represents the prices of houses sold in King County Washington between May 2014 and May 2015. We will use this data for the next few exercises to see if we can find anything interesting about this housing market.
Word of caution, this is a fairly large data set. 21,000+ records. Be careful.
- Download this file. Save it somewhere you can find it.
- This is a comma separated value file.
- Data is in rows, with commas between the values.
- Excel will be happy to load this file, just make sure to tell it the fields are delimited with a comma (,) not spaces.
- Once the file is opened, save it as an excel file.
- File
- save as
- Select excel workbook.
- Information step: the data in this file consists of:
- There is no key for this data set, but most information should be clear
on observation.
- The Format data in the ID column as a number with no decimal places.
- Format the price column as an amount.
- Hide the lat and long columns.
- Let's turn the date column into something useful.
- Insert a column to the right of the date column.
- Split the date column on the "T" (Data, Text to Columns, Delimited, T as the delimiter)
- This will produce a column of all zeros. Delete it.
- The date is still not super useful.
- It appears to be a YYYYMMDD
- We can split it again but this time on a fixed width
- Add two columns to the right of the date.
- Click on the 4 and the 1 to draw lines.
-
- The final result should be
-
- Add a new tab and build a lookup table to map the month number to the three letter month abbreviation
- Add a column (probably D) and compute the month name based on the month number.
-
- Hide the month number column.
- Label the columns as Year, Month, Day.
- Probably
- Go through and change the labels to something more readable. sqft_lot should be changed to SQFT Lot for example.
- It appears that the sizes of some houses and lots have changed.
- We don't have the tools to work with different years so just hide columns U and V.
- It would be nice to have all of the size measurements together
- Move SQFT Upper and SQFT Basement to be between SQFT Living and SQFT Lot.
-
- Let's do a check to see if the living area matches the upper + basement area.
- Add a column (probably L)
- Call this area check.
- Subtract the sum of the upper and basement area from the Living area.
- In my case, I think everything is a 0, so that checks out.
- We should probably delete column L, but keep it in the workbook so I can look at it.
- It would probably be good to see the age of the house when sold.
- Label column X, "Age When Sold"
- Calculate the age when sold as the difference between the Year sold (Col B) and the maximum of year built or year renovated.
-
- Check the ages of the first two rows, they should match my computation.
- Save your work
- Submit your saved document to the Houses Part 1 folder in the Assignment section of D2L for this class.