Beginning the Analysis Phase.
- Today we will work with the superhero dataset
- We will apply some basic statistics functions.
- I will discuss these as we compute them.
- Please help out.
- If you have not had a stats class, please ask if you don't understand what we are computing.
- Make a copy of the Raw Data tab, call it Working Data
- Let's name the fields in the working data tab.
- On the Working Data worksheet select the entire data range.
- On the Formulas tab in the Defined Names select Create from Selection
- Make sure that top row is the only thing selected.
- Press OK.
- Check your new names by clicking on the Name Manager in the same area
- If the scope of the name is workbook, it is available everywhere.
- If you set the scope to be worksheet, this will only be available in the worksheet.
- Be careful not to define the same name twice.
- Let's start by analyzing the Height field.
- Add a new tab, label this Height
- In A1 put: This worksheet provides a basic analysis of the height field.
- Counting the field is sometimes helpful
- At this stage it will let us know if all of the data is valid
- Move to the Height worksheet.
- In A5 put Numeric
- In B5 put =count(height)
- Did this computation do what you expected?
- What did you expect?
- What did it produce?
- How do we know this is valid?
- There are a few other count functions I would like you to do
- In A3 put Counts
- Merge this across A3:B3
- In A4 put All
- In B5 put =counta(height)
- There doesn't seem to be a difference here.
- Add a sandbox tab.
- In A1:a3 put 1, 2, 3
- An A4:A6 put a, b, c
- In A8 put NA
- In A9 put na (lower case)
-
- Name A1:A9 as list1
- Use the define name selection in the name manager
- Set the scope to be sandbox
- Do a count, counta on list1
- Add a =countblanks(list1)
- Add a cell called NA, then =countif(list1, "na"), or better yet, reference the cell.
-
-
- What information can we gather from these four counts?
- Return to the Height worksheet and add the other two counts.
- In every stats class you discuss the "measures of central tendency"
- What are these?
- Mean
- Median
- Mode
- Midrange
- Mean
- This is what is generally meant by "average".
- Add up all of the numbers and divide by the number of numbers.
- Should we use the formula?
- Ok in cell A9 put Mean
- In B9 put =average(height)
- What does this tell us?
- Is this what we expect?
- Let's go back to the sandbox
- Find the average of list1.
- What happened to the other values?
- Change the value in cell A1 to be 10000
- What happened to the mean?
- Outliers change the mean?
- The median is the middle number
- Sort the numbers and find the one in the middle.
- The formula =median(height) will calculate the median
- Is there anything strange about these results?
- Height is approximately normal
- This says that the mean, median and mode are about the same,
- And that the data follows the bell curve.
-
-
- So we can be a little worried about the difference here.
- Back to the sandbox
- We can generate a normal distribution for men
- mean = 172
- Standard deviation = 10
- =norm.inv(rand(), 172, 10)
- Copy this down.
- Name the range.
- Compute mean and median on this.
- Draw Histogram.
- Copy this histogram
- Compute the histogram for height.
- What do you see?
- We will skip the mode for now.
- The median is a measurement of position as well.
- Half the data is above th median
- Half is below.
- We can look at the data that is at the 25th position.
- This is called the quartile.
- There are many ways to compute the quartile functions in stats.
- There are in excel as well.
- The function takes a range and a stat to compute.
- 0 means min
- 1 means 25%
- 2 means 50% or median
- 3 means 75%
- 5 is max.
- =quartile is deprecated or has been replaced
- =quartile.inc, =quartile.exc
- I don't care which of the later you use, but I tend to use .inc
- Compute the quartiles for height.
- I think we have a problem here.
- Draw a box and whisker plot for each.
- Discuss outliers and inner quartile distance.