Data Science/Analytics generally cosists of the following steps
- You have a question that needs answered, but the answer should be based on data.
- You collect data that could contain the answer to your question.
- You clean the data and format it so that it can be analized.
- You perform basic data exploration to understand the data.
- You look for relationships among the data that can be used to predict the answer to your question.
- You build a mathematical model, based on the data, to answer your question.
- You test and use your model.
I wanted to understand who was in this class.
- Questions: Is there anything special I should be doing for this class?
Step 1, Find some data:
- My only real data was from SCOTS.
  - We have access to a detailed class list.
  - Unfortunately this is
    - A web page, (in html)
    - And not directly accessible (you will see why this is important later).
Step 2: Acquire and transform the data.
- I did a copy and paste from the web page into a text file.
- This is not terribly useful, so I wrote a python script to convert this to a Comma Separated Value or (CSV) file.
- reader.py
- This produced
Step 3: Import and clean the data.
- Just clicking on a CSV file will start Excel.
- The data consists of
  1. Level (undergraduate/graduate)
  2. Class (Freshman, Sophomore, Junior, Senior)
  3. Year
  4. Program
  5. College
  6. Department
  7. Major
  8. Concentration
- These are not labeled, so I labeled them
- I really don't care about the Level, Year, and any secondary degree information.
- So I removed these fields.
  - By the way, two definitions
    - A record is a collection of information about an individual element of the population.
      - In this case the data on each of you constitutes a record.
      - In some cases, it might be an individual measurement.
      - But it is all the data associated with the individual/measurement, ...
    - A field is a piece of data in a record.
      - In this case the Class, Program, College, Department, Major and Track are fields.
    - Records are composed of fields.
    - Hopefully datasets are composed of records and fields.
- At this point we should note
  - For multiple degree students, we would have the first degree listed, not all degrees.
  - Not all students have tracks.
Step 4 basic data exploration
- Easy data exploration is to turn the worksheet into a table.
  - This allows you to filter the data
  - And see that we have about 20 upperclassmen.
  - - This is interesting to me.
    - Last year nearly everyone was a freshman.
    - The only upperclassmen were Math/CS Majors
    - And those were Seniors.
  - Building a pivot table allows additional exploration
    - We can easily calculate the distribution of students by class
    - Or we can even explore Major by Level
    - We can even produce a view of the counts by college/Program
  - One more chart
- Step 5: Interpretations/Recommendations
  - A next step
    - The next step in the data science process would be to make predictions
      - This doesn't have an impact on you for two reasons
        
        We will not go that far this semester
        Any predictions that I can think of would be for the next offerings of this class.
        
        How many sections/seats of DSCI 101 should we offer?
        
        5 new Anth/Forensic Anth students registered, 4 Anth Freshmen in the class.
        27 new CJ students registered, 22 CJ Freshmen in the class.
        3 new Math majors, 3 registered for the class.
        It looks like these majors are being assigned to this class.
        Seats in the class should be take this into account.
  - The data files
    - class.csv the csv file.
    - Analysis.xlsx the analysis file.