Data Sources
- Let's take a look at the project description.
- At least 100 records.
- "Sufficient" fields to have something to analyze
- Numeric data.
- Stay away from summary data.
- Statisticians classify data as
- Binary:
- A yes or no
- True or false
- Male or female
- Success or failure
- Be careful, this may be coded as 0/1, but it is not numeric
- Numerical:
- A value like a temperature or a test score.
- Actually they have several types here, but this will do for us.
- Categorical
- Select from a list.
- Political party, class rank, ...
- Be careful, sometimes this may be coded as a number
- Freshman: 1, Sophomore: 2, Junior: 3, Senior: 4
- But most statistical measures are not permitted.
- I will add text.
- Names
- Addresses
- Tweets
- These are probably unique or nearly unique.
- Look at Heart Failure Prediction on kaggle
- What can you do to the fields?
- Take a look at All Space Missions from 1957 on kaggle.
- Summary vs Raw Data
- Sources
- Data collections pages
- kaggle
- You need a membership to access data.
- But they don't send very much email.
- I will use many kaggle data sets.
- Also: Competitions, Kernels, and Tutorials
- Pages supporting "research"
- I like FiveThrityEight's github account
- This is data that they have used in articles.
- There is a nice collection of data here.
- And you can see what they have done with it.
- Government data sources
- Other
- Search your favorite topic.
- open data, dataset, csv are helpful terms.
-
- Some thoughts.
- Please consider your dataset carefully.
- General Advice:
- Start looking now,
- Talk to me about what you find.
- We can cut down on the size of a dataset, or even convert formats, but that may take some time.
- Please make sure that you keep track of where you located your dataset.