Chapters 4 and 5
- Some Sweet Data
- FiveThirtyEight.com has some interesting data and articles.
- They used a on line poll to allow users to compare fun-sized candy types.
- They collected this data.
- And produced this article.
- They claim 8,371 different ip addresses -> probably means users.
- They claim 269,000 matchups -> data points.
- average = 32 and media 11 matchups from each ip.
- They do not claim this is a scientific study in any way.
- Let's grab the data
- There are many ways, but let's try Excel's import data function.
- Start excel.
- Grab the data
- Data tab, Get External Data select From Web
-
- Tell it to stop worrying about errors on the edinboro page.
- Enter the url https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv
- It would have been cool if it found the table, but it didn't so select the arrow at the top of the page.
-
- Select Import
- After a brief delay, this will import the entire web page.
- Clean up.
- In my download, I needed to delete rows 1-65 and another set at the bottom of the page.
- In needed to delete column A
- I replaced Õ with a ' (as in HersheyÕs Kisses became Hershey's Kisses)
-
- I formatted the Sugar Percent, Price Percent columns nicely
-
- I added a column for a derived Win Percent and converted column M to a percent.
-
- I hid column M as It is no longer needed.
- I renamed the headers, made them bold and word wrapped
-
- Note: This should all be added to the methods document, Data Cleaning section.
- Documentation
- Multiple sources state that you should provide an "About" or "Welcome" worksheet
- I kind of like this document
- I have also seen this disputed, but for now, let's do it.
- Insert a About worksheet.
- Author
- Date
- Reason
- Source of data.
- A list of other sheets in the workbook and what they contain.
- Data Dictionary
-
- I am not the best at enforcing a particular style on anyone
- But you should have an about page on anything you are going to tun in
- Save this somewhere you will have access to it later. We will start with it on the next set of notes.