A Crash Course in Databases.
- First
- This is not a substitute for CSCI 313.
- I am making this up, there is no single reference.
- But I have been reading man pages like crazy the last week.
- Traditional Databases (Relational Databases)
- Named because they generally have relations between tables.
- We have seen this in the data model in excel.
- Most are programmed/accessed/used via the Structured Query Language or SQL.
- NOSQL or Not Only SQL databases are different.
- These are a more recent development
- They are definitely associated with data science as well.
- But they are not our topic for the next few days.
- DBMS : Database Management Software
- Many different DBMS engines.
- Lightweight such as SQLLite, Access, ...
- Moderate like MySQL or Mariadb.
- Heavy Duty Oracle or IBM/DB2
- These range in
- We will use
- One of the cool things about databases is that they have lots of access permission built in.
- You can be given read only permission
- Or many other levels.
- In a large company you will NOT have access to the database.
- But one of the DBA's will probably produce a table for you.
- If you offer sufficient sacrifices to the DBA gods.
- In a small company you may have full access.
- You might even want to build your own database
- Say you are maintaining a changing dataset.
- You download it into a database
- you can build a nice interface and allow workers to update the data.
- You get better control of the data input
- You can set up database rules that enforce data integrity.
- Homework 5 is sort of aimed at this.
- You will take a dataset and prepare it to be loaded into a data base.
- Databases are arranged in tables.
- This is where tidy data came from.
- There are lots of rules associated with tables.
- Generally,
- Store tidy data.
- Don't store extra (redundant) data.
- Make sure the data is in the proper range.
- This is a majority of CSCI313
- Each table has a primary key
- This is a item that is used to recognize unique entries.
- It is usually an integer, but doesn't have to be.
- It can be multiple fields or a single field depending on the table.
- Tables might also contain foreign keys
- These provide the a link to data in other tables.
- Together the keys form relationships between tables.
- There are multiple types of relations
- Generally a database schema is
- Is the blueprint on how the database is designed
- It includes a description of the tables
- Names, types and properties of the fields
- Identification of primary keys.
- And the relationships between tables.
- Frequently a database schema is represented in an ER-Diagram
- Entity Relationship
- These are sometimes called Crow's Foot diagrams.
-
- Primary keys are indicated with a diamond
- If a field can't be blank it has a filled in dot.
- Relationships are shown with lines.
- The Iris dataset
- This is a single table
- It is also a standard dataset
-
- IrisDB
- The Pokemon dataset
- This is from kaggle.
- Three tables.
- The first uniquely identifies the pokemon type.
- This is a limited list of values.
- A pokemon can have two of these types.
- The second uniquely identifies a card
- Without an indication of the type.
- In the dataset there was a total field that was the total of HP, Attack, Defense, SpecialAttack, SpecialDefense and Speed.
- Storing this would have duplicated data so it was removed.
- Finally CardTypeTable describes the relationship between cards and types.
- This allows a many to many relationship.
- Ie A pokemon can have 0 or more types.
- And a type can be associated with many different pokemon.
- PokemonDB
- We will also explore the chinook database.
- This is a standard sample database.
-
- Chinook