Class 6.
Next Assignment: reading through Section 3.5. UNDOTTED exercises at end of Chapter 3 through 3.7. Make sure you can login to dbs2.cs.umb.edu and use sqlplus to access Oracle.
FAST: Idea of outer join. Given two tables R and S, Head(R) = AB, Head(S) = BC,
R S
|
A |
B |
|
B |
C |
|
a1 |
b1 |
|
b1 |
c1 |
|
a2 |
b2 |
|
b2 |
c2 |
|
a3 |
b5 |
|
b3 |
c3 |
|
|
|
|
b4 |
c4 |
What is the join, R
S?
R
S
|
A |
B |
C |
|
a1 |
b1 |
c1 |
|
a2 |
b2 |
c2 |
But we're losing a lot of information. Outer join KEEPS all the information of the two tables, filling in nulls where necessary.
R
O S
|
A |
B |
C |
|
a1 |
b1 |
c1 |
|
a2 |
b2 |
c2 |
|
a3 |
b5 |
null |
|
null |
b3 |
c3 |
|
null |
b4 |
c4 |
You use this sort of thing when you don't want to lose information. Example given in the text is that even though an agent in AGENTS table has made no sales in SALES table, we DO want to see AGENT name in report.
Most commercial product versions of SQL (Including Oracle) already have outer join (it became standard in SQL-92).
Theta join. CAN SKIP THIS!!! Recall in equijoin (or "natural join"), of R and S, Head(R) = AB, Head(S) = BC:
I.e., (R ´ S) where R.B = S.B
What if wanted some other condition (allow common columns),
(R ´ S) where R.B q S.B, where q is <, <=, <>, >, >=, anything but =.
No big deal. In SQL, use product, no special name for this kind of join.
Chapter 3. Idea of SQL. Computerized query language.
IBM developed SQL, and it became standard (old QUEL died). Most queries (including updates, inserts, deletes) are very standard across products.
We will only use ORACLE in class to begin with, but will also talk about others. Try to keep to Basic SQL standard for now (Our own standard).
In Appendix A.1 (see hw2: READ THIS FIRST; also note DBA.msg emailed to your cs.umb.edu mailbox),
Need to edit .cshrc in your home directory—see the DBA email.
To enter Oracle, tells you to login to UNIX on UNIX machine dbs2.cs.umb.edu that serves to Oracle, then give command:
%sqlplus
Any problems, contact DBA, dba@cs.umb.edu.
You will need to start creating tables. Follow directions in assignment and in Appendix A.1.
Create table. Result is empty table.
create table customers (cid char(4) not null, cname varchar(13),
city varchar(20), discnt real, primary key(cid));
Names and gives types to each of columns: Explain char(4), varchar(20).
Primary key: means can't duplicate values in table, foreign keys will refer to this column by default.
Note "not null" is an integrity constraint: must specify value on insert, can't leave null. More integrity constraints in Chapter 7, on DBA stuff.
/ see errata for text
Next, load from OS file into table. (Called sqlldr in ORACLE and DB2) No-one wants to type in one line at a time: everything is moved around on tape or disk (say from one DBMS to another). Note expected format:
c001,TipTop,Duluth,10.00 <-- comma separated, no spaces
c002,Basics,Dallas,12.00
You have to create a .ctl table for each table, custs.ctl, . . . Then use "sqlldr" command (NOTE Correction to pg 723, see Errata on cs630 home page); execute sqlldr from the UNIX prompt, not from within SQL*PLUS.
After think have table loaded, type:
select * from customers;
Should print out whole table. (* means all columns.)
X/OPEN Limits: 127 cols., 2000 bytes for row. etc. Oracle allows more.
Chapter 3.3. Simple Select Statements (Basically same as examples in the text, although starts somewhat different.)
Example 3.3.1. Find (JUST) names of agents based in New York.
rel. alg.: (AGENTS where city = 'New York')[aname]
SQL: select aname from agents where city = 'New York';
Keywords: SELECT, FROM (clause), WHERE (clause), end with semi-colon (;).
After SELECT have select list, resulting variables to report: project on.
The FROM clause names tables to take Cartesian product.
The WHERE clause connects and limits rows in Cartesian product. No join. (Our Advanced SQL standard now allows join operation: Cover later.)
Note duplicate aname values ARE possible. SQL statement above WILL report duplicate names (Relational rule broken) unless write:
select distinct aname from agents where city = 'New York';
If want to emphasize that duplicates can arise, give name to default:
select all aname from agents where city = 'New York'; (See Fig 3.1 pg 81)
Recall we did: select * from customers; at end of table load.
Select * means all columns (no projection). There is no where clause so whole table is chosen. Like rel alg query: CUSTOMERS
Example 3.3.3. Select product ids of products for which orders are placed.
select distinct pid from orders;
Might there be duplicates in this? How avoid duplicates, class?
Example 3.3.4. Retrieve all (cname, pname) pairs where the customer places an order for the product. If don't use JOIN operator in SQL, can't make mistake of joining on city column as well. In RELALG, need:
(C[cid, cname]
O
P)[cname, pname]
Rather:
((C X O X P)
where C.cid = O.cid and O.pid = P.pid)[cname, pname]
In SQL:
select distinct cname, pname from customers c, orders o, products p
where c.cid = o.cid and o.pid = p.pid;
Note that all "table alias" names must be specified in single statement where they are used, in FROM clause: the table alias follows the table name without a comma intervening.
In ORACLE and INFORMIX, look up "alias, table". In DB2 UDB, say "correlation name", and "alias" is something else. (Also in INFORMIX, "correlation name" is something else!)
SQL is non-procedural to the greatest extent possible, so a statement can't depend on an earlier statement. Note too, could leave out alias in this example — just saves typing here.
Need qualifier when column names are ambiguous from one table to another; not otherwise (above cname, pname, but o.cid, c.cid). But it never hurts to qualify column names.
Conceptual process in Select. (1) Take product of tables in from clause, (2) apply selection in where clause, (3) project on attributes in select-list and report.
But this in not necessarily (usually not) the actual order of events. Query optimization is applied. If customers has 1000 rows, orders 100,000, and products 1000, then Cartesian product has 100 billion rows. But eventual answer will have at most 100,000 rows.
Can make this much more efficient: index customers by cid and products by pid; now go through orders one row at a time and put out cname, pname into temporary file. Finally, sort cname, pname pairs and cast out duplicates.
This is what a query optimizer might do in its query plan. Basically, a query plan is a program that retrieves the data you want. You don't have to worry about it. Just type in SQL, query optimizer creates the program.
Can perform calculations in select.
Example 3.3.5. retrieve a table based on orders, with columns ordno, cid, aid, pid, and profit on order.
select ordno, x.cid, x.aid, x.pid, 0.40*(x.qty*p.price)
- 0.01*(c.discnt + a.percent)*(x.qty*p.price) as profit
from orders x, customers c, agents a, products p
where c.cid = x.cid and a.aid = x.aid and p.pid = x.pid;
(Profit is quantity * price, where subtract off 60% of cost, discnt for customer, and percent commission for agent. Expressions allow: +, -, *, /, and functions: upper( ) or ceil( ) (Oracle, smallest integer greater than real number).
Without the "as profit" clause, Oracle will head the column with the expression. After clause put in, heading will be "profit" In ORACLE, this is called a column alias, comparable to the way table aliases are created.
In INFORMIX, it is called a "Display Label".
Table alias is also called correlation name (in DB2) and range variable. Sometimes not just shorthand, but necessary, as when join table with itself (need different qualifying names).
Example 3.3.6. List all pairs of customer cids based in the same city.
We have seen this kind of thing before in Rel Alg. Need to report distinct pairs of cid only once.
select [distinct?] c1.cid, c2.cid from customers c1, customers c2
where c1.city = c2.city and c1.cid < c2.cid;
Note, could be upper case C1 and C2 column aliases; lower case is our general rule, but not crucial; just need to be consistent (case counts).
Do we need to say distinct above, class? (No. Because (c1, c2) is a set of unique pairs. Waste to use distinct when not necessary.)
Alias idea is to make two copies of all content of customers table as c1 and c2; then take Cartesian product, perform selection, and finally report out project onto c1, c2.
But what system really does is (draw customers table, show c1 and c2 as variables ranging over rows, perform nested loop — RANGE VARIABLES):
FOR c1 FROM ROWS 1 TO LAST OF customers
FOR c2 FROM ROWS 1 TO LAST OF customers
IF (c1.city = c2.city and c1.cid < c2.cid)
PRINT OUT SELECT-LIST VALUES: c1.cid, c2.cid
END FOR c2
END FOR c1
Once again c1 and c2 do not represent copied tables, but variables taking on row values in the same table! Trick here is to NOT MATERIALIZE THE PRODUCT. And now see why aliases are sometimes called range variables: we can picture them as variables that range over the rows of a table.