CS630 Homework 2  Getting started with mysql, foreign keys, SQL (with subqueries, group by, having)

due Tuesday, Oct. 9 by midnight in your cs630/hw2 directory. Note that Monday, Oct. 8 is a holiday.

1. Getting started with mysql:

a. Check your mysql account by simply logging into it. For user joe, login to pe07 from users or topcat and do the command:
mysql -u joe -D joedb -p
Password: joe

Or if you don't use -D joedb on the command line, use "use joedb" once inside mysql. Use "show tables" to look for tables. Unless you used this account for other classes, you should see an empty set of tables. If you do see old tables and want me to reinitialize your account, let me know by email.

You already have copied the $cs630/tables directory to your own ~/cs630/tables directory as directed in homework 1. Again cd to cs630/tables and use "ls" to list the files (while logged in on either system). You should see dbbook_mysql.sh. In order to load tables with bulk load (rather than individual inserts, a slower process), we need to set up easy no-password access to mysql, as described in the comments in dbbook_mysql.sh, and also in README. For example, user joe creates file .my.cnf in his login directory with contents:
[client]
password=joe
[mysql]
database=joedb

Since this lists your password, it should be protected as a file: "chmod o-r .my.cnf" makes it unreadable to other users. We will check for the existence and protection of this file. Now check that you can log into mysql without a password:
mysql -u joe
mysql>

b. Load the R&G book's tables into your mysql account by executing "sh dbbook_mysql.sh" on pe07. You will be prompted for your mysql username and database name (joe and joedb, for example). When it is done, check your tables by using mysql on pe07 with "show tables" and  "select * from student;", etc.

c. Drop the tables by using
mysql -u joe -D joedb < dropdb.sql

Use "show tables" to see the result, then load the tables again.

d. Make a typescript showing just "select * from boats" output. To do this, log into pe07, run the "script" command, then use mysql to access your tables and do the select, then exit from mysql and then exit again, which finishes the script in file "typescript". Rename typescript to "problem1.txt" in your hw2 directory. Although ".txt" doesn't mean anything to the Linux OS, it indicates a text file to most users because of its Windows meaning.  If you couldn't load the database, explain what happened in this file, with script of error in the execution.

2. Foreign Keys. R&G Exercise 3.8  Note that these tables are in tables/createdb.sql, and you can change their create-tables temporarily for experiments if you wish, and just reload the database afterwards. Put this answer in problem2.txt in your hw2 directory.  It would be a good idea to test your SQL by putting your solution in a .sql file and running it, but that is not required for the homework. For part 6, consider the Legal department instead of the Toy department, since it shows up in our loaded data.

3. Null foreign keys. Note that createdb.sql's dept table has a nullable managerid. Show that because of this, you can insert a row with a null managerid and have a department without a real manager, even though we have a foreign key constraint on managerid. Put your insert statement into insertnull.sql, and show that file and results on Oracle and (optionally) mysql (of select * from dept) in problem3.txt for Oracle and/or problem3_mysql.txt for mysql in your hw2 directory. In practice, we prefer not null foreign key columns to avoid this behavior. Sometimes a null FK is appropriate, in the case that the referenced information is optional or not available yet when this row is inserted. For example, the nullable column fid in table class is explained by needing to plan courses ahead of time, when the teacher is not yet assigned.

SQL Exercises: Execute on Oracle. Optionally, also execute them on mysql and put results in problem4_mysql.txt, for example. In problem 4, the lack of INTERSECT and EXCEPT make some queries impossible in mysql without a subquery.

4. SQL without subqueries. Our first coverage (Lectures 4 and 5) was of queries we can do in SQL without using subqueries or division. Do the queries of Exercise 4.3, parts 1-4 and 9-11 in SQL without using subqueries (i.e., your answer must have only one SELECT in it, plus one more for each UNION or INTERSECTION or EXCEPT/MINUS in use. There are some SQL answers in the online solutions, but most have subqueries.  Create a script problem4.sql with these queries, in order. problem4.sql is supplied with the first problem solved. Follow the directions in that script, and provide the output in problem4.txt. You can copy problem4.sql to your current directory on cs.umb.edu by "cp /htdocs/cs630/problem4.sql ."

Note that problem 8 has queries that don't need subqueries, on the Yelp database, so you may want to do that one next.

5. SQL with subqueries if desired. Do the queries of Exercise 5.2, parts 1, 4, 5 and 7 in SQL and create a script problem5.sql (following the pattern of problem4.sql) and its output in problem5.txt, and optionally in problem5_mysql.txt.(last sentence not relevant, deleted)

6. SQL with group by, having. Create a script problem6.sql (following the pattern of problem4.sql) and its output in problem6.txt/problem6_mysql.txt. Use the schema of Exercise 5-1 (but change student's level to standing to match our createdb.sql) to do these:

a. Find the major and the average age of students for that major, for each major. List major and average age.

b. For each room that has a class, find how many different departments use that room for classes.List room and count of departments. Use the class's faculty's deptid as the department using a room. Don't report null room values.

c. Display results from b. only for rooms used by 2 or more departments.

7. SQL with more advanced features if needed. Do the queries of Exercise 5.4 after adding dname to Dept in the schema (it is this way in createdb.sql). Do parts 1, 3 and 6 in SQL and create a script problem7.sql (following the pattern of problem4.sql) and its output in problem7.txt.

8. Oracle vs. mysql using a larger DB. The mysql answers are optional for CS430, but I hope you'll be interested in seeing how well mysql works in a fairly realistic setting. If you haven't read Yelp reviews before, visit for example Yelp on UMB or Yelp on Dorchester restaurants.

The partial Yelp database (as of this spring) is now publicly available in database yelp_db in both our Oracle and mysql servers. This database is available online, ready for loading into mysql.

This paragraph just explains why some reviews are missing in the Oracle version of the database: you don't need to do anything about it, just do SQL queries on what's there. To move the tables to Oracle, I first used mysql to dump them to CSV files (CSV stands for comma-separated values). Most tables were easily loaded using sqlldr from these CSV files. However, the review table is full of entries with newlines (the actual reviews), and Oracle's sqlldr insists on treating newlines as ends of records. So I wrote a mysql query to use string REPLACE on the review text to change newlines to spaces, and output values in CSV format. This worked except for a few rows, which still failed, so we have 0.3% fewer reviews in Oracle.

a. First find out basic facts on this database. The core tables are review, holding reviews by users about businesses, yuser for the users, and business for the businesses. We can't use "user" as a table name because it's a reserved word in Oracle. Use "describe" on these tables in both mysql and Oracle and include the output in your paper. You can access yuser, for example, as table yelp_db.yuser in either DB. Note that datatype "int(11)" for mysql is really just int, the standard type. Note any non-standard types.  For a list of SQL standard types, see Lecture 4 slides. Find out how many rows each of these three tables have using SQL. Put this output in problem8intro.txt.

b. Let's compare Oracle and mysql for this sizable database. Run each query twice. CS630 students should time the queries. timing. Mysql shows execution time after each query automatically. For Oracle, use "set timing on" for a session. Although Oracle is using a multiple disks and mysql only one disk, they both have enough memory for the entire yelp_db database, so a second execution of a query should involve no disk reading, usually the most time-consuming part of a query on a large database.

Queries on yelp_db: Treat these as described in problem4.sql, plus "set timing on" for CS630 students (optional for CS430) and put the queries in problem8.sql and the output in file problem8.txt. CS630 students should summarize the findings on relative performance of the two systems at the end of the problem8.txt file (one full sentence is fine.)

i. Find the maximum cool count in the reviews. You can use this value to find the max-cool review(s) themselves, but don't show this in the output file (too much output). Find the number of reviews containing the word 'dirty', and separately, containing the word 'clean'.

ii. Find the number of users (#unique user ids) reviewing businesses in 'NV'.

iii. Find the number of users with cool count over 10 reviewing businesses in 'NV'. Note that both reviews and users have cool counts. A user's cool count gets incremented when one of their reviews is marked "cool".

c. Use "show create table yelp_db.review" in mysql to see the create table statement used for this table. This information can be extracted from Oracle as well but it's somewhat harder: see StackOverflow answer if interested. Use this report from mysql to determine the foreign keys in use here. Put this output and discussion in problem8FKs.txt.

Files for grading in cs630/hw2:

problem1.txt
problem2.txt (no computer output here)
insertnull.sql, problem3.txt (optionally problem3_mysql.txt, and similarly below)
problem4.sql, problem4.txt

problem5.sql, problem5.txt

problem6.sql, problem6.txt
problem7.sql, problem7.txt
problem8intro.txt, problem8.sql, problem8.txt, problem8FKs.txt