Homework 5
Design and implement a database from existing data.
This homework is worth 75 points.
Goals: When you finish this homework you should be able to:
- Determine dependencies in a set of data.
- Normalize a given set of data.
- Using MySQL create associated tables and populated those tables with data.
- Use SQL queries to explore a dataset.
- Employ scripts to execute MySQL commands.
Assignment
For this assignment you will use data in the file /home/CS/dbennett/DB_HW5/HW5Dataset. This file is located on the system and has read permission for all users. You are encouraged to reference the copy in my directory and not make new copies of this data.
The file contains movie ratings from Movie Lens, which I have modified. Each line in the file contains a user's rating for a movie. The format of the line is as follows:
- Rater's First name
- Rater's Last Name
- Rater's Age
- Rater's Gender (M,F)
- Rater's Occupation
- Movie Title
- Movie Release Date
- Movie Genre, this is a comma separated list.
- The user's movie rating (1-5)
The fields in this file are separated by a vertical bar (|), which is not part of the data.
The following are the first four lines in the file:
Caleb|Cross|49|M|Writer|Kolya |24-Jan-1997|Comedy|3
Ester|Barr|39|F|Executive|L.A. Confidential |01-Jan-1997|Crime,Film-Noir,Mystery,Thriller|3
Blake|Leach|25|M|Writer|Heavyweights |01-Jan-1994|Children's,Comedy|1
Stephan|Villarreal|28|M|Technician|Legends of the Fall |01-Jan-1994|Drama,Romance,War,Western|2
The Genre field is problematic for our level of knowledge. You may solve this problem several ways, including a program to prepossess the data file, a collection of queries to deal with this data, or other commands. Acceptable, but not preferred, is to allow this multi-value dependency to exist in your final database. You must, however, deal with this data in your design.
Using the techniques discussed in chapters 1-4, create a database in BCNF to represent this data. These files must be in your personal database in MySQL on the CSCS313Lin machine. Document each stage of this process, including:
- Identify the dependencies in the data
- List any assumptions you made.
- Compose a list of questions which you could ask the client to resolve ambiguity and verify that your assumptions are correct.
- Provide one or more scripts which transform your data from the raw file to your final database.
Your client wishes to provide information about the rating of movies. To demonstrate this, create queries to answer the following questions (using your normalized database, not the raw data file)
- How many individuals provided ratings?
- Were there any two raters with the same name?
- What is the top rated movie?
- What was the top rated war movie among men?
- What was the top rated comedy among women who were less than 30?
- What was the move that Doctors enjoyed the most?
- List the top 5 rated movies
- What occupation provides the fewest number of ratings?
- What user provided the most/fewest ratings?
Report
You must provide the following documentation in the form of a written report:
- A formal description of the final database, include an overview of the data and the final product.
- Identify any errors you discover in the data.
- The list of data dependencies
- The list of assumptions.
- The list of questions.
- The database schema.
- Documentation on how you handled the Genre problem.
- The answers, along with the SQL queries to produce these answers to the listed questions.
The following should be placed in an appendix of your report:
- The scripts to create the databases. (These should be documented with comments
- Examples of each table (First 5 lines)
I have access to your MySQL database, so I will examine your final set of tables when I grade this project.
Submission
Your answers must be typed. Print out the report and turn it in by class time on the due date. Keep the tables associated with this database on the MySQL server for the class until you receive feedback for this assignment.