% IncomeDistribution/contents.tex
\chapter{\mychaptername}
\label{\here}
\tocnotetoo{
This chapter covers a lot of ground -- two new kinds of average
(median and mode) and ways to 
understand
data when they they come in large quantity rather than just a few at a
time: bar charts, histograms, percentiles and the bell curve.
To do that we introduce spreadsheets as a tool.
}%
\teachertag
\begin{teacher}
We find students come to our course with a wide range of technology
experience. They can all manage word processing, web searching and
email. Some have used Excel. Those for whom it is new find this
chapter hard going. Often pairing experienced and inexperienced
students in the lab classroom works well.
\end{teacher}

\begin{goals}

\begin{goal}{excelaverages}
Work with mean, median and mode of a dataset.
\end{goal}

\begin{goal}{normaldistribution}
Introduce the normal distribution, with its mean and standard deviation.
\end{goal}

\begin{goal}{skew}
Understand how skewed distributions lead to inequalities among mean,
median and mode.
\end{goal}

\begin{goal}{excelroutine}
Make routine calculations in Excel.
\end{goal}

\begin{goal}{excelwhatif}
Use Excel to ask and answer ``what-if'' questions.
\end{goal}

\begin{goal}{excelchart}
Create bar charts and other types of charts in Excel.
\end{goal}

\begin{goal}{histograms}
Use histograms to group and explore data.
\end{goal}

\begin{goal}{histogramaverages}
Calculate averages for grouped data.
\end{goal}

\begin{goal}{percentiles}
Understand what a percentile is and how to interpret percentile information.
\end{goal}


\end{goals}

\begin{chapterpix}

There are lots of Excel cartoons out there. We should be able to find
one that's suitable and free (or inexpensive enough).

\includegraphics[width=60mm]{\here/excel-cartoon.jpg}

\url{http://jennyhansenauthor.wordpress.com/2011/06/28/should-writers-use-excel-part-1/}

This one's from The New Yorker, probably prohibitive.

\includegraphics[width=60mm]{\here/mike-twohy-i-d-like-you-to-excel--new-yorker-cartoon.jpg}

\url{http://www.condenaststore.com/-sp/I-d-like-you-to-excel-New-Yorker-Cartoon-Prints_i8478424_.htm}
\end{chapterpix}

\qrsection[wingaero]{\smallCo{}}

Table~\ref{wingaero} shows the distribution of workers' salaries at
\smallCo,  a small hypothetical company.%
\footnote{
In keeping with our \CommonSense\  philosophy we should work with
real data. But most companies keep
this kind of information private. Any similarity between our imagined
\smallCo{} and any real company is purely coincidental.
In \exref{usincomedistibution} we'll apply the lessons we learn to
look at income distribution
in society at large. 
}
The company has about 30 employees. We want to understand the salary
distribution. A natural place to begin is with the average. But adding
thirty numbers by hand (even with a calculator) is
tedious and error-prone. An {\em spreadsheet} computer program can do the
arithmetic faster and more accurately. In this text the spreadsheet we
use in Microsoft's Excel.

We've organized this chapter as a spreadsheet tutorial -- you can
follow it step by step in Excel as you read it.%
\footnote{Or in your classroom, if it is equipped with computers, or
if you have a laptop with you.}
See \sref*{usingsoftware} for 
some general software tips and information about alternatives to Excel.

If you're on line you can save typing time by downloading the
\smallCo{} spreadsheet from \link{WingAero.xlsx}.
If you build it for yourself, use column \cell{A} for the
employees and column \cell{B} for the salaries. Put the labels in row
\cell{8} and the data in rows \cell{9:38}, not side by side as in the
table. You should see Figure~\ref{fig:WA1screenshot}.%
\footnote{This is the first of what will be a sequence of screenshots
showing what you  should see as you work through the chapter. The
published version of \commonsense{} will have more, and may point you
to on line videos as well.} 

% side by side tables courtesy of 
% http://people.csail.mit.edu/jrennie/latex/
\begin{table}[ht]
\centering
  \begin{minipage}{2.2in}
	\begin{tabular}{|l|c|}
	\hline

	Employee	& Salary 	 \\
		 & (thousands of \$)	 \\
	\hline
CEO & 299  \\
CTO & 250  \\
CIO & 250  \\
CFO & 290 \\
Manager & 77  \\
Manager & 123  \\
Manager & 84  \\
Manager & 63  \\
Manager & 68  \\
Manager & 49  \\
Manager & 82  \\
Manager & 87  \\
Supervisor & 42  \\
Supervisor & 37  \\
Supervisor & 29  \\
\hline
    \end{tabular}
  \end{minipage}
  \begin{minipage}{2.2in}
	\begin{tabular}{|l|c|}
	\hline
	Employee	& Salary 	 \\
		 & (thousands of \$)	 \\
	\hline
Supervisor & 43  \\
Supervisor & 51  \\
Supervisor & 38  \\
Supervisor & 33  \\
Supervisor & 42  \\
Supervisor & 49  \\
Worker & 25  \\
Worker & 19  \\
Worker & 41  \\
Worker & 17  \\
Worker & 26  \\
Worker & 25  \\
Worker & 21  \\
Worker & 28  \\
Worker & 27  \\
\hline
	    \end{tabular}
  \end{minipage}
\caption{\smallCo{} Salary Distribution}
\label{wingaero}
\tablesource{Made up numbers.}
\end{table}

\begin{figure}[ht]
%\begin{wrapfigure}{l}{2in}
\centering
\includegraphics[height=75mm]{\here/WA1screenshot}
\caption{\smallCo{} spreadsheet (some hidden rows).}
\figsource{screen capture from Excel}
\label{fig:WA1screenshot}
\end{figure}
%\end{wrapfigure}

What do you notice?

The employees are listed in decreasing order of importance (or prestige),
but only approximately in decreasing order of salary. Some Supervisors
make more than some Managers, and some Workers more than some
Supervisors. We can make those discrepancies visible by \emph{sorting}
the data.

Select the rectangular block of data in rows \cell{9} through \cell{38},
columns \cell{A} and \cell{B}. Be sure to select both columns so they
will be sorted together. 
Choose the \excel{Sort} dialog box from the \excel{Data} teb, select
sorting by Salary, \excel{Largest to Smallest}, as in
Figure~\ref{fig:WA2screenshot}.%
\footnote{This is one way to sort, in Excel 2010. There are others. 
Later versions may display a different menu, but the ability to sort
will be available in any spreadsheet program you use.
}
Now you can clearly see the anomalies.

\begin{figure}[ht]
\centering
\includegraphics[height=40mm]{\here/WA2screenshot}
\caption{Sorting \smallCo{} salaries.}
\figsource{screen capture from Excel}
\label{fig:WA2screenshot}
\end{figure}
%\end{wrapfigure}

If you sort the data again alphabetically (by Employee, \excel{A to
  Z}) the table returns (nearly) to its original state, because 
the employee categories were alphabetical at the start. But each
category is now sorted by salary. 

To find the \emph{average} salary at \smallCo{}  we tell Excel
to add the entries in column \cell{B} and divide by the number of
employees. 

Enter the label \excel{Total}
in cell \cell{A40}. Then go to cell \cell{B40}. Make sure
the \excel{Formula Bar} is visible. (Use the 
\excel{View} menu to find it if
it's not.) In the formula box type an equals sign \excel{=}, to tell Excel
you want it to do some arithmetic, and then the name of the operation,
\displayexcel{ =SUM(}
\noindent 
since you are about to add up some numbers. The open parenthesis
\excel{(} asks Excel to prompt you for information. It suggests
\displayexcel{
SUM(number1, [number2], $\ldots$)
}
as in Figure~ref{fig:WA3screenshot}.

\begin{figure}[ht]
\centering
\includegraphics[height=30mm]{\here/WA3screenshot}
\caption{Summing \smallCo{} salaries.}
\figsource{screen capture from Excel}
\label{fig:WA3screenshot}
\end{figure}

Select cells \cell{B9:B38}, close the parentheses and type \excel{enter} (or
click the check icon on the \excel{Formula Bar}.

\displayexcel{
Total \ \ 	2315
}
\noindent
in cells \cell{A40} and \cell{B40}. \smallCo's total annual payroll is
\$2315K -- 
about \$2.3 million.

To find the average salary we must divide the total
\$2315K payroll by the number of employees. Rows
\cell{9} through
\cell{38} contain employee records so there are $38-9+1=30$
employees. But it's better to ask Excel to count the rows for 
you. Type the label \excel{ Count} in cell \cell{A41} and begin formula \excel{
=COUNT(} in cell \cell{B41}. Finish the formula by selecting cells 
\cell{B9:B38} or by typing the addresses of those cells and closing the
parentheses. You should see 
\displayexcel{
Count \ \ 	30
}
\noindent
Finally, type the label \excel{ Average} in cell \cell{A42}  and put
formula \excel{ 
=B40/B41} in cell \cell{B42}. You should see
\displayexcel{
Average \ \ 	77.16667
}
\noindent
so the average annual salary at \smallCo{}  is about \$77,000. 
Excel rounded the exact $77.1666\ldots$ to $77.16667$
when it ran out of space. 
We rounded to two significant digits because that's all the
precision we have in almost all the data. 
In \exref{format} you'll learn how to tell Excel to round for you.

Excel's built-in functions \excel{SUM} and \excel{COUNT} are
separately useful, which is part of why we showed them to you, but if
it's the average you want Excel can find it in one step.

Enter \excel{=AVERAGE(B9:B38)} in cell \cell{B43}, click the
check icon and Excel tells you again that the average
is 77.16667. 
Put ``\excel{computed using SUM/COUNT}'' in cell \cell{C42} and 
``\excel{computed using AVERAGE function}'' in cell \cell{C43}.

\qrsection[whatif]{What if?}\index{what if}

Suppose that the CEO convinced the Board of Directors to double his salary, to 
\$598K (even though the company lost money).
\footnote{Shades of \myindex{Wall Street} bonuses in the wake of the 2009
financial meltdown.
}
To see how that would affect the payroll statistics,
go to cell \cell{B9} and change the 299 there to 598. Excel automatically
updates all your computations, increasing the total annual
payroll to \$2,614 million and the average annual salary by about
\$10,000 to more than \$87,000. 

We can learn several useful lessons from our work so far. 

\begin{itemize}

\item Excel is faster and more accurate at arithmetic than we are.

\item Excel is an excellent tool for answering ``what-if'' questions,
because it automatically updates its computations when the data change.

\item The average we calculated with \excel{SUM/COUNT} and then again
  with \excel{AVERAGE} is a terrible way to summarize the salary
  structure at \smallCo. The CEO's 100\% raise increases the average
  salary by about \$10,000, or 13\% -- but he's the only employee
  who's actually better off!

\end{itemize}

After you've tried out these changes, restore the original values by
clicking the \excel{ Undo} icon on the toolbar.  This very useful
feature allows you to undo the last few changes you've made. Use it to
fix mistakes or to get back to where you were before you asked a
``what-if'' question.

\qrsection[usingsoftware]{Using software}

Why use a spreadsheet at all? There are several reasons.

\begin{itemize}

\item Carry out large tedious calculations rapidly and correctly.

\item  Answer ``what-if?'' questions without having to redo
arithmetic.

\item  Draw charts.

\item Learn a tool you may use long after you've finished this course.
\end{itemize}

Here are some tips for working with Excel, and with software packages
in general.

\begin{itemize}

\item  To figure out something new, you can use the application's
built in help, search the web, ask a friend or teacher, or just play
around. Which you try first depends on your personality.

\item Many applications provide several ways to do the same task. That
means you may get different advice or instructions from different
sources. Choose a way that suits your style.

\item Use the menus for things you do rarely, but learn the keyboard
shortcuts for things you do often -- in particular, \excel{control-C}
for copy and \excel{control-V} for paste can save time.

\item Learn about \excel{undo}. \textbf{Save your work often}. 

\item
When you are about to make significant changes to a spreadsheet or a
document make a copy of what you have, so that you can return to it if
you change your mind about what you should have done. In the
\smallCo{} study we created several, calling them
\verb!WingAero1.xlsx!, \verb!WingAero2.xlsx!, and so on.

\item
Create backup copies of important documents often -- off your
computer. Use a thumb drive (flash drive), a cd, or a web service.

\item Some things work the same way in different applications
(browsers, Word, Excel) -- for example, selecting with the mouse, cut
and paste. 

\item In many software applications placing the 
mouse over a feature you are interested in and {\em right clicking}
\index{right click}
often lets you view and change the {\em properties} of that
feature. ``Do the right thing'' is a good mnemonic.

\item  Applications often try to guess what you intend to do. That can
be good or bad. We'll see soon that Excel can adjust cell references
automatically -- that's usually, but not always, what you want.
Word processors may try to fix your spelling --
perhaps correctly, perhaps not. 


\end{itemize}

Why Excel? It's the most commonly available spreadsheet, so it's
the one we use. But there are others available.
In particular, Open Office (\url{http://www.openoffice.org/}) offers
free spreadsheet and word processing software.\index{Open Office}
The spreadsheet program at Google docs is probably also powerful
enough to do everything you need for the examples in this book.

We have tested our spreadsheets in several such environments. 
We've written our tutorial in a way that explains both what each step
should accomplish as well as which actions to take to make it happen.
It should help you even if you use another spreadsheet, or a newer or
older version of Excel.

\qrsection[median]{Median}\index{median}

In \sref*{wingaero} we found that 
the average salary at \smallCo{} was \$77,000. This is a pretty good
annual salary. If you saw that in a job advertisement
you'd think it was a pretty good company to work for. Maybe, maybe not. 
In \sref*{whatif} we saw how it's skewed by the CEO's earnings.  When
his (or her) salary 
increased from \$299,000 to \$598,000, the average salary jumped by
\$10,000 to \$87,000. But no one else's salary changed!

The \$77,000 ``average'' is misleading in other ways too. Most of the
employees -- 26 out of 30 -- have salaries less than the average. That
contradicts what 
we like to think ``average'' means. To find a salary that's ``in the
middle'', sort the 30 line table again so that salaries are increasing.
Since the table starts in row \cell{9} and has 30 entries, rows 23 and
24 are the middle rows and the entries in
cells \cell{B23} and \cell{B24} are the middle salaries. That means
half the employees 
make \$42,000 or less (the entry in cell \cell{B23}) and half make
\$43,000 or more. So we might want to say
that the ``average'' salary
is \$42,500. There's a name for this kind of ``average'' -- it's the
\emindex{median}. The first ``average'' we computed above is called the
\emindex{mean}.

On the spreadsheet, change  the \excel{ Average} label in cell
\cell{A42} to \excel{ Mean}.

Then put
\displayexcel{
Median \ \	42.5 \ \	computed by finding middle of sorted list
}
\noindent
into cells \cell{A46}, \cell{B46} and \cell{C46}.

In some ways the median is a fairer ``average'' than the mean for
describing the \smallCo{}  salary structure. It tells you more
about the way salaries are distributed.
In particular, the median salary isn't affected by the
CEO's big raise. Try changing that salary again in Excel: the mean
changes, as it did before, but the median stays the same. 

Excel knows how to compute medians. Enter \excel{ =MEDIAN(B9:B38)} in
cell \cell{B47} and check that you get the same value: 42.5. 
Enter ``\excel{computed using MEDIAN function}'' in \cell{C47}.

Finding the median with the \excel{ MEDIAN} function is better than
finding the middle of the sorted list yourself because it works even
when the data aren't sorted.
Suppose the Supervisor making \$43K gets a raise to \$50,000. Enter that
new value as \excel{50} in the spreadsheet.
Then see that Excel has recalculated the median in cell \cell{B47}:
it's now 45.5. 

There's a third kind of average, the {\em mode}. We'll return to that
after we've summarized the salary data in a different way.

\qrsection[barcharts]{Bar charts}

Often pictures are better than numbers when we wish to convey
information convincingly. A \emindex{bar chart} is one
such picture.

We use bar charts when we have data that falls naturally into
categories. The height of each bar represents the data value for that
category.  When you want general understanding rather than numerical
detail it's easier to compare the heights of bars (in both
relative and absolute terms) than the values of numbers. 

In order to understand the \smallCo{} income distribution better
we will use Excel to draw two bar charts with four columns each, one
showing the number of employees in each of the four job categories Executive,
Manager, Supervisor and Worker, the other the total earnings in
each category.  We will use the original Wing Aero salary data from \sref*{wingaero}.
\teachertag
\begin{teacher}
We strongly recommend drawing these bar charts first by hand on the
board, and asking students to do the same from time to time on paper.
\end{teacher}

\begin{table}[ht]
\centering
\begin{tabular}{|c|r|r|}
\hline
Job & Number & Total salary (\$K) \\
\hline
Executive & 4 & 1089\\
Manager & 8 & 633 \\
Supervisor & 9 & 364 \\
Worker & 9 & 229 \\
\hline
\end{tabular}
\caption{\smallCo{} Salary Distribution by Job Category}
\tablesource{Made up numbers.}
\label{table:wingaerobycategory}
\end{table}

You can find the data from Table~\ref{table:wingaerobycategory} in
the range \cell{D8:F12} in the copy of the \smallCo{} salary distribution
spreadsheet at \link{WingAeroBarCharts.xlsx}. Excel calculated the
values in columns \cell{E} and \cell{F} using the \excel{ COUNT} and
\excel{ SUM} functions. 

Figure~\ref{fig:showformulas} shows those calculations. We created it
by selecting \excel{Show Formulas} from the \excel{Formulas} tab.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/showformulas}
\caption{Showing the formulas used in a spreadsheet.}
\figsource{Image from an Excel spreadsheet we created.}
\label{barCharts}
\end{figure}

Figure~\ref{barCharts} shows the first two
pictures from that spreadsheet.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/WingAeroBarCharts1cropped}
\caption{\smallCo{} Employee Information by Category}
\figsource{Image from an Excel spreadsheet we created.}
\label{barCharts}
\end{figure}

These side by side bar charts dramatically demonstrate that
the Executives make up a small part of the workforce but enjoy a large
part of the salary expenses!

Delete the charts from your copy of the spreadsheet, so that you can
learn how to build them for yourself. For the first one, 
select the data in columns \cell{D} and \cell{E},
rows \cell{8} through \cell{12} -- the range \cell{D8:E12}. Then ask
for a new chart by selecting \excel{Column} and \excel{2D Column} from
the \excel{Insert} tab, as in
Figure~\ref{fig:insertchart}.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/insertchart}
\caption{Inserting a chart in a spreadsheet.}
\figsource{Image from an Excel spreadsheet we created.}
\label{fig:insertcharts}
\end{figure}

Use the mouse to move the chart so that it does not hide any of
the data. Then add the appropriate labels and change the colors so
that the result is suitable for black and white printing. There are
several ways to go about these tasks; experiment until you find ones
that work for you.%
\footnote{
\sref*{usingsoftware} has a tip about how the right mouse 
button can help with these tasks.
}

To build the second picture the same way you must select the
data in columns \cell{D} and \cell{F} without selecting column
\cell{E}. To do that, select 
rows \cell{8:12} in column \cell{D}. Then hold down the control
(PC) or apple (Mac) key and use the mouse to select those rows in
column \cell{F}.

It would be even better to combine the two pictures.
We can do that in Excel
by building a column chart for the full range \cell{D8:F12}.
The result (after adding titles and fixing colors) is the chart on 
the left in Figure~\ref{doubleHistograms}.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/WingAeroBarCharts2cropped}
\caption{\smallCo: Side by Side Bar Charts}
\figsource{Image from an Excel spreadsheet we created.}
\label{doubleHistograms}
\end{figure}

The problem with that picture is that the bars for the employee
numbers are nearly invisible, because data values for the 
numbers vary from 4 to 9 while those for the
total salaries vary from  \$200K to \$1200K. 
We can fix that by reporting the salary totals in hundreds of
thousands of dollars rather than just in thousands of dollars (that
is, by changing the units for measuring salary). Column
\cell{G} contains those numbers; we used it to draw the second picture.
There you see clearly the opposing trends: total wages decrease
as the number of employees increases.    
\teachertag
\begin{teacher}
Excel can use separate vertical scales to plot two data
series. We think it's more useful and more interesting to teach this
{\em ad hoc} solution.
\end{teacher}

\qrsection[piecharts]{Pie charts}\index{pie chart}

Excel allows you to change the type of a graph on the fly. Select the 
column graph showing the total salary by job category. Right click on
that chart. Select \excel{Change Chart Type \ldots} and then the first
\excel{Pie}. Adjust labels and colors to create the first of the
charts in Figure~\ref{pieCharts}. 

\begin{figure}
\centering
\includegraphics[height=60mm]{\here/PieChartscropped}
\caption{\smallCo{} Salary Distribution Pie Charts}
\label{pieCharts}
\figsource{Charts from an Excel spreadsheet we wrote.}
\end{figure}

The pie chart shows clearly in yet another way that the executives are
the winners at \smallCo. They take home nearly half the salary total.
If you wanted to make that look a little less dramatic, you could omit
the percentages and ask Excel to show a three dimensional version of
the chart, as in the second picture. There we've rotated the picture
so that the Executive wedge is at the back, so it looks even
smaller in perspective. 

\qrsection[histograms]{Histograms}\index{histogram}

\smallCo{} is small enough so that we can see the whole salary table
at a glance. But if there were 1000 employees that wouldn't be
possible. To understand the numbers we'd have to summarize them.  The
four categories in the previous section might not provide enough
detail.

Another useful way to summarize the data is to divide the salaries
into ranges and count the number of employees whose salary falls in
each range. Then we can use the ranges as the categories in a bar
chart.  In this example we'll use \$20K ranges. That means we think of
two employees who make \$29K and \$33K as having approximately the
same salary, since each falls in the \$20K-\$39K category.

We need to count the number of employees making less
than \$20K, then the number making between \$20K and \$39K, and so
on. That's easy when the data are sorted in increasing order.
Table~\ref{table:wingaeroranges} shows what we found.

\begin{table}[ht]
\centering
\begin{tabular}{|r|r|}
\hline
salary range (\$K) & number of employees \\
\hline
0-19 & 2 \\
20-39 & 10 \\
40-59 & 7 \\
60-79 & 3 \\
80-99 & 3 \\
100-119 & 0 \\
120-139 & 1 \\
140-159 & 0 \\
160-179 & 0 \\
180-199 & 0 \\
200-219 & 0 \\
220-239 & 0 \\
240-259 & 2 \\
260-279 & 0 \\
280-299 & 2 \\
\hline
\end{tabular}
\caption{\smallCo{} Salary Distribution by Salary Range}
\tablesource{Made up numbers.}
\label{table:wingaeroranges}
\end{table}

To save you typing, we've listed the categories 
in cells \cell{D15:D29} in \link{WingAero.xlsx}.
You should check our work and enter the data in column
\cell{E}.

To see that you haven't missed anyone, \excel{ SUM} the range
\cell{E15:E29} to make sure the answer is 30, the known number of employees.%
\footnote{
The sum isn't a perfect check. Although the total is correct,
we might have put some employees into the wrong categories. 
}

We can use the data to draw a \emindex{histogram} -- a bar chart that
provides a visual representation of the data in a table where the
first column lists categories each of which specifies a
\myindex{data range} and the second the number of items in that
range. Figure~\ref{fig:wahistogram} shows the result.%
\footnote{%
In the bar charts we've studied so far, the $x$-axis represents
categories and the $y$-axis data values. In histograms the $x$-axis
represents ranges of data values and the $y$-axis counts or
percentages for each range.
}
Start as usual by building a column chart from the data in
cells \excel{D14:E29}.
In a histogram it's conventional to make adjacent bars touch. To do
that in Excel, right click on one of the columns and select
\excel{Format Data Series \ldots} from the menu. Then adjust the
\excel{Gap Width} using the slider shown in Figure~\ref{fig:FormatHistogram}.
While you're there, figure out how to change the colors of the bars,
and fix the labels. 
You can see the full spreadsheet with all the computations we've done
so far at \link{WingAeroHistogram.xlsx}.

\begin{figure}[ht]
\centering
\includegraphics[width=9cm]{\here/FormatHistogram}
\caption{Formatting a Histogram}
\label{fig:FormatHistogram}
\figsource{Excel screen capture.}
\end{figure}


\begin{figure}[ht]
\centering
\includegraphics[width=75mm]{\here/WingAeroHistogramCropped.pdf}
\caption{\smallCo{} Salary Histogram}
\label{fig:wahistogram}
\figsource{Chart from an Excel spreadsheet we wrote.}
\end{figure}

It's worth taking some time to study this histogram, which shows how
the data are distributed.\index{distribution} 
Most of the salaries are less than
\$100K and there's a large gap in salaries between \$139K and the
executives who make more than \$200K. Although this information is all
in the table, the histogram makes it visible and dramatic.

We chose \$20K for the size of the salary ranges so that we would have
enough categories to show what was going on but few enough to make the
graph understandable.
In \exref{salaryrange} you can explore what happens with different
choices.

One of the themes of this chapter has been to have Excel to do tedious
repetitive calculations. So you might reasonably wonder whether Excel
could build Table~\ref{table:wingaeroranges}
for the grouped data if we told it the ranges
we were interested in. Then it would recompute
when we asked ``what-if'' questions. The good news is that
Excel can do this job. The bad news is that its
histogram building tools are very clumsy.%
\footnote{
If you're ambitious, try Excel help or search the internet 
for \excel{excel histogram} to find out how to build this table in Excel.
}
We will content ourselves with doing the counting by hand. 


\qrsection[mmm]{Mean, median, mode}

There's a third kind of ``average'' -- the mode -- that's sometimes
informative. The \emindex{mode} is the most common value. The
histogram from the previous section shows 
that there are more employees with salaries in the \$20K-\$40K range than
any other. So the mode is about \$30K. It's the highest point in the
histogram.

The mode is most useful for data aggregated into ranges, as in a
histogram. In the raw \smallCo{} data there is no well defined
mode. Each of the values \$250K, \$49K, \$42K and \$25K appears
twice. In Excel the function \excel{ MODE(B9:B38)} reports \$250K when
column B is sorted in descending order and \$25K when the column is
sorted the other way.

Each of ``mean,'' ``median,'' and ``mode'' can legitimately be
called an ``average.'' That ambiguity makes it easy to lie with
statistics without actually lying. The CEO at \smallCo{} may brag that
workers at his company earn an average \$77K per year, while the union
argues that the average salary is \$30K per year.

A cynic would advise you to use
the ``average'' that tells the story your way
and hope your listener won't know the difference.

When distributions are symmetric, the mean, median and mode are in the
same place. The \smallCo{}  salary distribution isn't symmetric, it's
{\em skewed}\index{skew} to the right. That's the fancy way to say
that the bulk 
of the data cluster toward the left of the histogram with a long tail
off to the right. For data that's right skewed, as this is:

\begin{center}
\begin{tabular}{ccccc}
mode & $<$ & median & $<$ & mean \\
30   & $<$ & 42.5   & $<$ &  77.1
\end{tabular}
\end{center}

\begin{figure}[ht]
\centering
\includegraphics[width=100mm]{\here/MeanMedianMode.png}
\caption{Symmetric and Skewed Distributions}
\figsource{Hand drawn figure.}
\figcomment{You can redraw this, or we will. But it should look like
something a student could in fact do -- correct, but not professional.}
\end{figure}

In a histogram the mode is the peak, the median splits the area in
half, and the mean is where the graph would balance if it were a
cardboard cutout.%
\teachertag%
\begin{teacher}%
This hand drawn figure may seem unprofessional, but in fact we think
it's useful. It's more like what a student could produce than a fancy
graphic would be. We think we should have more pictures like this.
\end{teacher}

If you're learning about these several kinds of averages for the first
time you may want mnemonics to help you remember which is which. The
``med'' in median suggests correctly that it's in the middle. You can
remember mode because there are ``mo' of them than anything else.''%
\footnote{Nick Sullivan thought this up in class, November 2, 2010.}


\qrsection[averagesfromhistograms]{Computing averages from histograms}

Often all we know about data is a summary like that
presented in a histogram. We'll see here
how to estimate the mode, median and mean from that information.
We will use the \smallCo{} histogram in Figure~\ref{fig:wahistogram} as
an example. Since we know the correct averages we can see how good our
estimates are.%
\teachertag
\begin{teacher}
We find this a particularly valuable section -- it forces students to
come to grips with the real meanings of each kind of average. Just
memorizing definitions suffices for small data sets, but with grouped
data the mean is a weighted average and real understanding is
required.

We recommend thinking about mode, median and mean in that order.
\end{teacher}

We've already found the mode. It's the highest bar in the histogram -- the
range with the largest entry: \$20,000 -- \$39,000. Note carefully --
the mode is the range, not the height of the bar. We could report that
range, or report the mode as the middle of the range: about \$30,000.
The raw data do not have a mode that makes sense, so there's nothing
to compare this estimate to.

To estimate the median salary from the histogram we need to find the
salary such that half the employees make less and half more. The first
bar tells us that 2 make less than \$19K.
Adding the number of employees in the first two ranges tells us that 
$2+10=12$ make less than \$39K. Looking at the next range we see that
$2+10+7 = 12 + 7 = 19$ make less than \$59K. Since there are 30
employees, it's the $15^{th}$ and $16^{th}$ whose salaries are closest
to the median. They are clearly pretty much in the middle of 
the third category, so we can estimate
the median as the midpoint of that category, say
\$50,000. The correct value (from the raw data) is \$42,500.

Since there are just eight ranges with data and the  median occurred
in the third one, we didn't need Excel 
to do the arithmetic. In a more complicated example things might
not be so easy, so let's see how to make Excel to the work. Label
column \cell{H} in cell \cell{H15} as ``\excel{count so far}''. Then
copy the value from \cell{E15} to \cell{H15} by typing \excel{=E15}
in H15. Then enter 
\excel{=H15+E16} in \cell{H16}, to add the number of employees counted
so far (in \cell{H15}) to the number in this range {\cell{E16}).

Now select that formula and copy it to
\cell{H17:H29}. 
A \myindex{miracle} has happened!
Cell \cell{H29} contains the value 30. If you click on
that cell you can see that it's the result of the formula \excel{
=H28+E29}. Excel read your mind, and automatically updated the row
references to cells in columns \cell{H} and \cell{E}
with each copy down the column. It {\em knew} you wanted to add the
value in the cell above to the value in the cell three over to the left.
\teachertag
\begin{teacher}
You will need to take time here to explain both the miracle -- Excel
guessed that updating references was what was wanted -- and its value
in saving time and typing. It takes getting used to.
\end{teacher}
The last entry should be 30 since all the employees make less than \$299K.
Now it's easy to see that the count so far passes the midpoint of 15
in the middle of the third range.

Estimating the mean salary is the hardest.
Since the only \smallCo{} data we have is what we
used to draw the histogram we can't expect to find it
exactly. All we can say about the 10 employees in the \$20K\ --\ \$39K
range is that they earn somewhere in the neighborhood of \$30K. So our
best guess is to assume they all earn exactly that. If we make a
similar assumption for each of the other ranges then our estimate for
the mean is the \emindex{weighted average}
\begin{equation*}
\frac{ 2 \times \$10K + 10 \times \$30K + \cdots 
	+ 2 \times \$290K }
	{\mbox{total number of employees}} .
\end{equation*}

That's too much arithmetic to do by hand so we'll use Excel. 

Put the label ``\excel{mid range}'' in cell \cell{F14}. We want to use cells
\cell{F15:F29} to hold the values 10, 30, $\ldots$, 290 that are the middles
of the ranges in cells \cell{D15:D29}. There's a quick trick for that. Enter
the 10 in cell \cell{F15}. Then go to cell \cell{F16} and enter the
{\em formula} 
\displayexcel{
	=F15+20
}
\noindent
Excel will display \excel{ 30} there. That's because it reads the formula
as 
\begin{center}
add 20 to the contents of cell \cell{F15}
\end{center}

Now we want to add 20 each time you move down a row. To do
that, copy the formula in \cell{F16} and paste it into cells
\cell{F17:F29}. (This takes advantage again of Excel's correct guess
about what we are trying to do.)

Next label column \cell{G} by typing ``\excel{range total}'' in cell
\cell{G14}. Then
put the formula
\displayexcel{
	=E15*F15
}
in cell \cell{G15}. That asks Excel to multiply the numbers in cells
\cell{E15} and \cell{F15}. You should see 20. That's the first number
to add in the 
weighted average computation we're working on.

When you copy that formula to cells \cell{G16:G29} you should see 580 at
the end of the list. That's the miracle yet again.

To compute the weighted average you need to sum the values in column
\cell{G}. Since cell \cell{E30} contains the sum of the values in
column \cell{E}, all you 
need do is copy the formula from that cell to cell \cell{G30}. Excel will
automatically change the column reference, turning the formula
\excel{ =SUM(E15:E29)} into \excel{ =SUM(G15:G29)}. The sum is 2360. To find
the mean, enter the formula \excel{=G30/E30} in cell \cell{G31}. Excel
shows you 78.66667. Label that value as the mean. 

Our estimate of the mean salary from the histogram is \$79,000.
We shouldn't report more precision than that,
since we made many approximations along the way. Do note that the
estimate is not very far from the true mean of \$77,167 computed from
the raw data.  

\qrsection[percentiles]{Percentiles}


On June 7, 2011 we visited the website 
\url{http://www.infantchart.com/}. 
There we collected the data displayed in 
Table~\ref{table:babyweights}. Reading down the first column, you can
see that a one year old male baby weighing
21.5 pounds would be in the 30.9th \emindex{percentile}. That means
that he would weigh more than 30.9 percent of the babies, less than
69.1 percent. A baby weighing 
22.5 pounds would be in the 46.5th percentile, so
$46.5\% - 30.9\% = 15.6\%$ of the year old male babies weigh between
21.5 and 22.5 pounds. 

\begin{table}[ht]
\centering
\begin{tabular}{|r|r|r|}
\hline
weight (pounds) & percentile & difference \\
\hline
15.5  & 0.1 & 0.1\\
16.5  & 0.2 & 0.1\\
17.5  & 0.8 & 0.6\\
18.5  & 3 & 2.2\\
19.5  & 8.2 & 5.2\\
20.5  & 17.5 & 9.3\\
21.5  & 30.9 & 13.4\\
22.5  & 46.5 & 15.6\\
23.5  & 61.8 & 15.3\\
24.5  & 74.9 & 13.1\\
25.5  & 84.8 & 9.9\\
26.5  & 91.4 & 6.6\\
27.5  & 95.4 & 4.0\\
28.5  & 97.7 & 2.3\\
29.5  & 98.9 & 1.2\\
30.5  & 99.5 & 0.6\\
31.5  & 99.8 & 0.3\\
32.5  & 99.9 & 0.1\\
33.5  & 100 & 0.1\\
\hline
\end{tabular}
\caption{Year Old Male Baby Weights}
\tablesource{Calculator at \url{http://www.infantchart.com/}}
\tablecomment{Permission needed?}
\label{table:babyweights}
\end{table}

To find the mode, we should look for the largest {\em difference} in
percentiles. Since 15.6\% of the babies weigh between 21.5 and 22.5
pounds the mode is about 22 pounds.

With guess-and-check on the
website we found that a weight of 22.75 pounds was as close as we
could get to the 50th percentile, so that's the median.

Some other commonly used percentiles are the 

\begin{itemize*}
\item tenth: any baby weighing less than about 19.2 pounds is in the
tenth percentile,

\item ninetieth: about 90\% of the babies weigh less than about 26 and
a half pounds,

 \item the \myindex{quartiles} -- the 25th 50th and 75th percentiles. The
50th percentile is, of course, the median.
\end{itemize*}
Using the techniques from \sref*{averagesfromhistograms} we computed
the mean as a weighted average. It is about 22.9 pounds. 

Therefore
\begin{center}
\begin{tabular}{ccccc}
mode & $<$ & median & $<$ & mean \\
22   & $<$ & 22.75   & $<$ &  22.9
\end{tabular}
\end{center}
That suggests that the distribution  is 
just little bit skewed to the right.

\qrsection[bellcurve]{The bell curve}

The baby weight data in Table~\ref{table:babyweights} are just what we
need to construct the
histogram in Figure~\ref{fig:babyweights}.

There you can see the mode (the highest bar) and you can see the
little bit of right skew.

\begin{figure}[ht]
\centering
%\includegraphics[height=70mm]{\here/babyweights.png}
\includegraphics[height=70mm]{\here/babyweightscropped.pdf}
\caption{Baby weights}
\figsource{Chart is from an Excel spreadsheet we wrote. Data is from
the calculator at \url{http://www.infantchart.com/}}
\figcomment{Permission needed?}
\label{fig:babyweights}
\begin{center}
Data source: \url{http://www.infantchart.com/}
\end{center}
\end{figure}

Many factors contribute to a baby's weight at one year: birth weight,
heredity, nutrition, \ldots. Whenever many small effects combine to
give a total the distribution of values forms a \emindex{bell curve}.
The one shown in the figure
is a mathematically
correct bell curve that approximates the real data. The mathematically
correct name for this phenomenon is \emindex{normal distribution}.

A normal distribution is always symmetrical. Its mean, median and mode
are in the same place. To construct the one in the figure we need one more
number besides the mean --
a measure of how fast the curve spreads out. That measure is called the 
\emindex{standard deviation}. It's usually written with the Greek
letter sigma: $\sigma$; the mean is usually written with the Greek
letter mu: $\mu$. For the baby weight data the standard
deviation is about 2.6 pounds.

There's a nice way to use the standard deviation to describe how
a bell curve speads out.

\begin{itemize*}
\item 2/3 of the values are less than one standard deviation away from
the mean,
\item 95\% of the values are less than two standard deviations away,
\item 99.7\% are less than three standard deviations away.
\end{itemize*}

Figure~\ref{fig:normalwithsigmas}%
\footnote{From
\url{http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg}
licensed under the Creative Commons Attribution 2.5 Generic license.
} illustratates these percentages.  Table~\ref{table:normalcurve}
summarizes them in terms of percentiles.

\begin{table}[ht]
\centering
\begin{tabular}{|c|r|}
\hline
value & percentile \\
\hline
$\mu - 3\sigma$  & 0.1\\
$\mu - 2\sigma$  & 2.3\\
$\mu - \sigma$  & 15.9 \\
$\mu$  & 50.0 \\
$\mu + \sigma$  & 84.1 \\
$\mu + 2\sigma$  & 97.7\\
$\mu + 3\sigma$  & 99.9\\
\hline
\end{tabular}
\caption{Percentiles for the Normal Curve, mean $\mu$, standard
deviation $\sigma$}
\tablesource{Public information}
\label{table:normalcurve}
\end{table}

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/normalwithsigmascropped.pdf}
\caption{How the normal distribution spreads out}
\figsource{
\url{http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg}
licensed under the Creative Commons Attribution 2.5 Generic license.
}
\label{fig:normalwithsigmas}
\end{figure}

Since the baby body weight distribution is very close to normal, with
mean $\mu$ about 22.9 pounds and 
standard deviation $\sigma$ about 2.6 pounds, we know about 2/3 of
the babies 
weigh between $\mu - \sigma = 22.9 - 2.6 = 20.3$ and $\mu + \sigma =
22.9 + 2.6 = 25.5 $ 
pounds. Approximately 95\% weigh between 17.7 and 28.1 pounds.
2.5\% weigh more than 28.1 pounds and 2.5\% weigh less than 17.1 pounds.

Figure~\ref{fig:threenormals} shows three bell curves with the same
mean $\mu = 22.9$, but three different standard deviations, $\sigma = 1.3,
2.6$ and $5.2$. The middle one matches the baby weight data.

\begin{figure}[ht]
\centering
\includegraphics[height=70mm]{\here/threenormalscropped.pdf}
\caption{Three bell curves}
\figsource{Chart from an Excel spreadsheet we wrote.}
\label{fig:threenormals}
\end{figure}

The mathematics needed to calculate standard deviations and draw bell
curves is more than we will present here. You will learn about it if
you take an introductory course in statistics. All you should remember
now is the rough relationship between standard deviation and spread we
sketched above and that the official name for the bell curve is
\emindex{normal distribution}. If you want to see how we asked Excel
to calculate the standard 
deviation for the baby weights and draw
Figures~\ref{fig:babyweights} and \ref{fig:threenormals}, you can look
at spreadsheet \link{babyweights.xlsx}.

%http://www.cdc.gov/growthcharts/data/set1/chart02.pdf
%http://www.exceluser.com/explore/normalcurve.htm
%http://www.vertex42.com/ExcelArticles/mc/NormalDistribution-Excel.html

\qrsection[marginoferror]{Margin of error}

On July 17, 2012 The Chicago Tribune ran a story under the headline
\headline{Poll finds support for Obama's tax position}.

It said (in part)
\begin{qwrap}
\begin{quotation}
\firstline{A new survey by the nonpartisan Pew Research Center found
that} 44\% said the tax increase would help the economy, 22\% said it
would be harmful, and 24\% said the tax hike would not make a difference. 

\ldots

The poll was conducted July 12-15 among 1,015 adults and has a margin
of sampling error of plus or minus 3.6 percentage points. 
\end{quotation}
\sourceinfo[363]{http://www.chicagotribune.com/news/politics/sns-la-pn-poll-finds-support-for-obamas-tax-position-20120716,0,12147.story,
but the Tribune says they got the story from the LA Times.}
\end{qwrap}
The first paragraph quoted above reports the results of a survey. The
second tells you something about how reliable those results are. It's
clear that the {\em smaller} the \emindex{margin of error} the {\em
more} you can trust the results. Understanding the margin of error
quantitatively -- seeing what the number actually means -- is much more
complicated. A statistics course would cover that carefully; we
can't here. 

But since the term occurs so frequently, it's worth learning the
beginning of the story. Even that is a little hard to understand, so
pay close attention.
\teachertag{}
\begin{teacher}
This is one of those places where it's hard to find the right compromise
between appropriate simplifications and actual untruths about what the
concepts mean and the numbers say. It might well take a whole class
period to expand on the discussion here. You have to decide whether
that's how you want to spend class time.
\end{teacher}

The survey was conducted in order to discover
the number of people who believe the tax increase would benefit the
economy. 
If everyone in the country offered their opinion
then we would know that number
exactly.
If we gave the survey 
to just three or four people we could hardly conclude anything.
The people at the Pew Research Center decided to survey the opinions
of a {\em sample} of the population -- 1,015 people chosen at random.
Of the particular people surveyed, 
$0.44 \times 1,015 = 447$ people believed that the tax
increase would benefit the economy.  If they'd surveyed a
different group of 1,015 people, they would \myindex{probably} see
a different number, so a different percent.

The 3.6 percentage point
\index{percentage points} margin of error says is that if they carried
out the survey 
many times with different samples of 1,015 people, 95\% of those
surveys would find a percent that was within 3.6 percentage
points of the true value.%
\footnote{
The paragraph doesn't explicitly mention 95\%. That's just built into
the mathematical formula that computes
the margin of error from the sample size.
}
There's no way to know whether this particular sample is one of the
95\%, or one of the others. If you read about surveys in 100 news
articles, about five of them are likely to be bad ones where the margin
of error surrounding the answer doesn't include the true value.  Survey
designers can reduce the margin of error by asking more people
(increasing the sample size). But there will always be five surveys in
every hundred that don't capture the true value.

Even that conclusion may be too optimistic. The margin of error
computation only works if the sample is chosen in a fair way, so that
everyone was equally likely to be included. If they asked 1,015 people
at random from an area where most people were Democrats (or 
Republicans) or rich (or poor) the result would be even less reliable.

\qrsection[bimodal]{Bimodal data}

The black data points in Figure~\ref{fig:lymphomadata.png} show how
the rate of diagnosis (in cases per 100,000 people)
of \myindex{Hodgkin lymphoma} (a kind of cancer)
for white females depends on the age of
the woman diagnosed.%
\footnote{
Data source: Surveillance, Epidemiology, and End Results (SEER) Program
(\url{www.seer.cancer.gov}) SEER*Stat Database: Incidence - SEER 9 Regs
Limited-Use, Nov 2008 Sub (1973-2006) , National Cancer Institute,
DCCPS, Surveillance Research Program, Cancer Statistics Branch,
released April 2009, based on the November 2008 submission.  
Graphic drawn by Ben Bolker.\index{Bolker, Ben}
}
There are two peaks, one at about 20
years, the other at 75 years. There is no single value that can
legitimately be called the mode -- this is a \emindex{bimodal
distribution}. The mean and the median would each be about 45 years,
but they make no sense at all. What is \myindex{probably} happening is
that the disease has two very different causes, one of which occurs
more often in young people, the other in old people.

\begin{figure}[ht]
\centering
\includegraphics[scale=0.75]{\here/cancer-whitefemale.pdf}
\caption{Hodgkin lymphoma incidence (white females)}
\figsource{
Data source: Surveillance, Epidemiology, and End Results (SEER) Program
(\url{www.seer.cancer.gov}) SEER*Stat Database: Incidence - SEER 9 Regs
Limited-Use, Nov 2008 Sub (1973-2006) , National Cancer Institute,
DCCPS, Surveillance Research Program, Cancer Statistics Branch,
released April 2009, based on the November 2008 submission.  
Graphic drawn by Ben Bolker.}
\label{fig:lymphomadata.png}
\end{figure}

We can understand this distribution as a combination of two normal
distributions. The pink bell curve for the early onset of Hodgkin
lymphoma has a mean of about 24 years with a standard deviation of 11
years. The blue bell curve for late onset has a mean of about 79 years
with a standard deviation of 19 years -- it spreads out more
slowly. The red curve is the sum of the two normal distributions. It
matches the data very well.

\exstart

Excel is just a {\em tool}. It doesn't answer questions, it provides
numbers and pictures {\em you} use to answer questions. So keep on
writing complete sentences explaining the meaning and context of the
numbers you report. The numbers in cells in a spreadsheet are no more
use by themselves than the numbers in a calculator display.

If an exercise asks for hard copy of a spreadsheet, format it well.
Use \excel{Print Preview} to make sure that charts fit on one
page and don't cover important data. Label the data columns,
cells containing important computations, axes and legends in charts.
Numbers are useless when they can't be understood. Use complete
sentences to answer questions.
 
\begin{exx}{\hassolution\routine\sref{wingaero}}
\myindex{CXO}

What do ``CEO'', ``CTO'', ``CIO'' and
``CFO'' stand for in the \smallCo{} salary table in \sref*{wingaero}?

\begin{sol}
In CXO the C and the O stand for Chief and Officer. The middle letter
stands for Executive, Technology, Information (or sometimes
Investment) and Financial. Sometimes you see COO (Operating).

``CIO'' also stands for ``Congress of Industrial Organizations,'' a 
federation of labor unions, founded in 1935.
\end{sol}

\end{exx}

\begin{dummyexx}{\needsquestions\sref{whatif}}
What if\ldots
\end{dummyexx}

%\begin{exx}
%http://www.ultimate-photo-tips.com/what-is-a-histogram.html
%\end{exx}


\begin{exx}{\routine\sref{usingsoftware}\gref{excelroutine}}
Formatting in Excel 

Format the cells containing averages in the \smallCo{} spreadsheet so
that the number displayed for the various averages are 
rounded to the nearest thousand dollars (no decimal places).

\end{exx}

\begin{exx}[format]{\untested\sref{usingsoftware}\gref{excelroutine}}
Formatting

Have Excel display the mean \smallCo{} salary with
just two significant digits, as \excel{77}.

\end{exx}

\begin{dummyexx}{\needsquestions\sref{median}}
Median \ldots
\end{dummyexx}

\begin{exx}{\hassolution\sref{barcharts}\gref{excelchart}\gref{excelaverages}}
Averaging averages 

\begin{abcd}
\item Find the average (mean) salary at 
at \smallCo{} for each of the four categories of employees (use the original data set, from \sref*{wingaero}). 

\item Show the data in a properly labelled bar chart. 

\item Compute the weighted average of these averages to check that you get
the correct mean for the whole payroll.
\end{abcd}

\begin{hint}
You may draw your bar chart neatly by hand, or use Excel.
\end{hint}

\begin{sol}
\begin{abcd}
\item Find the average (mean) salary at 
at \smallCo{} for each of the four categories of employees. 

The following table shows the average salaries by category. I built
the spreadsheet at
\slink{WingAeroAverageByCategorySolution.xlsx} 
to find  these numbers. 

\begin{center}
\begin{tabular}{|c|r|r|r|}
\hline
Job & Number & Total salary (\$K) & Average salary (\$K) \\
\hline
Executive & 4 & 1089 & 272.25\\
Manager & 8 & 633 & 79.13\\
Supervisor & 9 & 364 & 40.44 \\
Worker & 9 & 229 & 25.44 \\
\hline
\end{tabular}
\end{center}

\item Show the data in a properly labelled bar chart. 

See Figure~\ref{fig:WAaverages}.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/WingAeroAverageByCategorySolutioncropped.pdf} 
\caption{WingAero average salaries, by category}
\label{fig:WAaverages}
\figsource{Excel spreadsheet}
\end{figure}

\item Compute the weighted average of these averages to check that you get
the correct mean for the whole payroll.

The weighted average is
\begin{equation*}
\frac{ 
4 \times 272.25 +
8 \times 79.13 +
9 \times 40.44 +
9 \times 25.44 
}
{30}
= 77.61.
\end{equation*}
That's the same value as the ordinary average of the original data.

\end{abcd}

\end{sol}

\end{exx}

\begin{exx}{\hassolution\sref{barcharts}\gref{excelaverages}
\gref{excelchart}}
\headline{Cash-strapped T proposes 23 percent fare increase}
\index{MBTA}

%The graphic in Figure~\ref{fig:mbtaFareIncreases} appeared in
%\theGlobe{} on March 29, 2012.
%
%
%\begin{figure}[ht]
%\centering
%\includegraphics[height=60mm]{\here/mbtaFareIncreases.jpg}
%\caption{MBTA Fare Increases}
%\figsource{\theGlobe, March 27, 2012, http://bostonglobe.com/metro/2012/03/28/mbta-unveils-percent-fare-hike-limited-service-cuts-also-proposed/moCl42rwr0Nf5xyx20ZQGP/story.html}
%\label{fig:mbtaFareIncreases}
%\end{figure}
%
On March 29, 2012 the Massachusetts Bay Transportation Authority (MBTA)
provided the data in
Table~\ref{table:mbtaFareIncreases}.
\webref{http://bostonglobe.com/metro/2012/03/28/mbta-unveils-percent-fare-hike-limited-service-cuts-also-proposed/moCl42rwr0Nf5xyx20ZQGP/story.html}

\begin{table}
\centering
\begin{tabular}{|rrr|}
\hline
Fare Category & Current & Proposed \\
\hline
Charlie Card & & \\
\hline
Bus & \$1.25 & \$1.50 \\
Subway & \$1.70 & \$2.00 \\
Senior Bus & \$0.40 & \$0.75 \\
Senior Subway & \$0.60 & \$1.00 \\
Student Bus & \$0.60 & \$0.75 \\
Student Subway & \$0.85 & \$1.00 \\
\hline
Charlie Ticket & & \\
\hline
Bus & \$1.50 & \$2.00 \\
Subway & \$2.00 & \$2.50 \\
\hline
\end{tabular}
\caption{MBTA Fare Increases}
\label{table:mbtaFareIncreases}
\end{table}

\begin{abcd}

\item What are the relative and absolute changes in the Charlie Card bus fare?

You don't {\em have} to do this in your head, without a calculator, but you
should be able to. 

\item
Create a spreadsheet for this data, with eight rows (one for each of
the eight categories) and three columns, for the category name, the
existing fare and the proposed fare.
Create a properly labeled bar graph to display the data.

\item 
Label the next two columns 
appropriately to hold the
relative and absolute changes. Fill those columns with Excel {\em formulas}
to compute the correct values. Do not compute the values elsewhere and
enter them in the spreadsheet as numbers.

\item Imagine that you are addressing a public meeting about these
fare increases. How would you argue that an unfair burden is being
placed on people who pay using a Charlie Ticket? How would you argue
that senior citizens are most hard hit?

\item Find the mean of the relative percent increases. Explain why the
answer is {\em not} the 23 percent quoted in the headline.

\item What extra information would you need to check that the correct
mean is 23 percent?

\item (Optional) Create a column for the percentage of MBTA revenue
for each of the categories and fill in some values that sum to 100\%
and give a weighted average fare increase of about 23\%.
\teachertag
\teachernote{
There's an important and subtle distinction here between weights as
a percentage of revenue and weights as a percentage of trips. 
Worth a discussion in class if you have time and a good enough class.
}
\end{abcd}
\begin{sol}

\begin{abcd}
\item What are the relative and absolute changes in the Charlie Card
  bus fare?

The absolute change is \$0.25; the relative change is 20\%. I did them
in my head. 

\item
Create a spreadsheet for this data, with eight rows (one for each of
the eight categories) and three columns, for the category name, the
existing fare and the proposed fare.
Create a properly labeled bar graph to display the data.

My spreadsheet with its chart is at
\slink{MBTAFareIncreasesSolution.xlsx}.

\item 
Label the next two columns 
appropriately to hold the
relative and absolute changes. Fill those columns with Excel {\em formulas}
to compute the correct values. Do not compute the values elsewhere and
enter them in the spreadsheet as numbers.

\item In my solution the 
absolute and relative changes are in columns
\cell{E} and \cell{F}. I put these formulas in cells \cell{E7:F7}:
\displayexcel{
=D7-C7 \ \ \ \ \ \ =D7/C7
}
I put the percent increase in cell \cell{G7} with the formula
\displayexcel{
=F7 -1
}
and formatted the result as a percent.

Then I copied these formulas to the rest of the rows.

\item Imagine that you are addressing a public meeting about these
fare increases. How would you argue that an unfair burden is being
placed on people who pay using a Charlie Ticket? How would you argue
that senior citizens are most hard hit?

The Charlie Ticket riders have the largest absolute change among all
the categories: 50 cents. That's not fair.

The senior citizens have the largest two relative changes among all
the categories: 67\% for subway fares and a whopping 88\% for the
bus. That's not fair.

\item Find the mean of the relative percent increases. Explain why the
answer is {\em not} the 23 percent quoted in the headline.

According to Excel, the mean of the relative percent increases is
37\%. The correct mean is a {\em weighted} average of the percentages
in each category, with weights the percentage of revenue in each category.


\item What extra information would you need to check that the correct
mean is 23 percent?

To check that the Globe reported the correct mean I would need to know
the percentage of rides in each category, or the percent of revenue
for each category. 

\item (Optional) Create a column for the percentage of MBTA revenue
for each of that categories and fill in some values that sum to 100\%
and give a weighted average fare increase of about 23\%.

They are in the spreadsheet.

\end{abcd}
\end{sol}
\end{exx}

\begin{exx}{\untested\sref{piecharts}\gref{excelchart}}
Handcuffs?

You can find Figure~\ref{recidivism} at
\url{http://people.howstuffworks.com/prison4.htm}.  
It claims to provide information about the
recidivism rate of released prisoners.
\index{recidivism}\index{handcuffs}

\begin{figure}[ht]
\centering
\includegraphics[width=80mm]{\here/prison.jpg}
\caption{Recidivism Rates}
\figsource{\url{http://people.howstuffworks.com/prison4.htm}}
\figcomment{Needs permission.}
\label{recidivism}
\end{figure}

\begin{abcd*}
\item  This looks like a pie chart.  Is it?  Explain.

\item  What is wrong or misleading in this graphic?  

\end{abcd*}

\begin{sol}

\begin{abcd*}

\item  This looks like a pie chart.  Is it?  Explain.

The answer to the question is ``probably not''. It would be a pie
chart if it were really round (it's not quite) and if the size of each
sector represented the proportion of offenses of each particular
kind. That might be true, but there is no way to tell.  The
percentages in the wedges are not the percentages of offenses of that
kind, they are the recidivism rates for released prisoners who committed
those particular kinds of offense. For example, 73.8\% of those released
for a property offense committed another one.

\item  What is wrong or misleading in this graphic?  

I suspect that the sizes of the wedges are proportional to the
percentages in them (which is why the one for property offenses seems
to be the largest).

In fact the sectors are all more or less the same size, with
recidivism rates that don't differ a lot. So not much information is
being conveyed.

The handcuff frame says that the average rate was 67.5\%. That number
should be a weighted average of the percentages in the five
sectors. Since we're not told how big the sectors are, I can't check
that figure.

\end{abcd*}

\end{sol}
\end{exx}

\begin{exx}{\untested\sref{piecharts}\gref{excelchart}}
Why not pie charts?

Do some internet research to discover why 
bar charts are usually better than pie charts. Write the reasons in
your own words (don't just cut and paste). Identify the sources of
your information and comment on why you think those sources are
reliable.

\end{exx}

\begin{exx}{\artificial\untested\sref{piecharts}\gref{excelchart}}
Misleading pie charts.

Table~\ref{table:baddata} shows some student enrollment data at a
college. It's clearly incomplete: Juniors and Seniors are missing.

\begin{table}[ht]
\centering
\begin{tabular}{|c|r|}
\hline
Class & Percentage \\
\hline
Freshman & 40\\
Sophomore & 25 \\
\hline
\end{tabular}
\caption{Freshman and Sophomore Enrollments}
\tablesource{Made up numbers.}
\label{table:baddata}
\end{table}

Flashy pie charts are common in the media, but staid
and boring bar charts are usually more informative and less
deceiving. Because Excel makes it so easy to switch to a pie from a
bar, designers may be tempted by the glitz. You should not succumb to
that temptation. Here's an exercise where you can find out why.

\begin{abcd}
\item Construct a bar chart displaying this data, with columns for
Freshman and Sophomore enrollments.

\item Change the Chart Type in Excel.

\item Explain what is wrong with the chart Excel built for you.

\end{abcd}
\end{exx}


\begin{exx}{\untested\complex\sref{histograms}
\gref{excelchart}\gref{histograms}}

River lengths

At \url{http://en.wikipedia.org/wiki/List_of_rivers_by_length}
Wikipedia offers a chart of 163 major rivers, organized by length. 
We've put the data there in the spreadsheet
\link{rivers.xlsx} so you can play with it.
\teachertag
\begin{teacher}
We haven't assigned this Benford's law exercise yet. We think it would
be very interesting, but perhaps not worth the time it takes away from
other more useful topics.
\end{teacher}

\begin{abcd}

\item
Construct a bar chart with nine categories, using the first digit of
the length of the river to determine the category.

You might expect all the bars to be the same height,
since there are nine possible starting digits. The fact that they are not
is an instance of {\em Benford's Law}. You can look up more about it
if you're curious.\index{Benford's law}

\item Explain why this bar chart is not a histogram.

\item 
Convert the lengths to feet from miles, and redraw the bar chart.

\end{abcd}


\end{exx}

\begin{exx}{\hassolution\worthy\sref{mmm}\gref{excelwhatif}
\gref{excelaverages}}
What if? 

Add five Workers each earning \$18K to the original \smallCo{} payroll by
inserting some rows in the table. 
Comment on what happens to the mean, median and modal incomes.

Note that Excel automatically recomputes these averages, but not the
charts for which you created data by hand.

When you've done this exercise, undo your changes and check that the
three kinds of averages revert to their old values.

\begin{sol}
When I add five Workers earning \$18K each the
mean falls from \$77.16K to \$66.71K, the median from \$42.5K to
\$41K. The mode is now \$18K. (The value of the mode before the new
workers was ambiguous. What Excel says it is depends on the order in
which the employees are listed. In general you should report modes
only for grouped data.)


\end{sol}
\end{exx}


\begin{exx}{\hassolution\sref{mmm}\gref{skew}\gref{histograms}}
Population pyramids.

At the website 
\url{http://www.census.gov/ipc/www/idb/informationGateway.php}
you can choose a country and a year, then ask for
a kind of bar chart known as a
\emindex{population pyramid}.
You can also download that data used to build that chart.

\begin{abcd}

\item Construct population pyramids for the United States and for
Sudan for the year 2010.

\item  Estimate the number of people in the United States in 2010
between 0 and 9 years old.  Do the same for Sudan. What fractions of
the populations do these numbers represent?

\item  Find the modal age range for the United States 
population.  Do the same for Sudan.

\item  Find the age range for the United States that has the smallest
population.  Do the same for Sudan.

\item  Compare the population distributions of the United States and Sudan.
Write several sentences that highlight aspects of each distribution
that you think are quantitatively significant. Use the results of the
previous part of the problem.

%\item Careful reading shows that values for the
%years beyond 1999 are estimates. You're told to visit
%\url{http://factfinder.census.gov/} for up to date information. Do
%that, and comment on something interesting you find there.

\end{abcd}

\begin{sol}

\begin{abcd}

\item Construct population pyramids for the United States and for
Sudan for the year 2010.

See Figures~\ref{fig:uspopulationpyramid} and
\ref{fig:sudanpopulationpyramid}.
\begin{figure}[ht]
\centering
\includegraphics[height=72mm]{\here/USPopulationPyramid.png}
\caption{US Population}
\label{fig:uspopulationpyramid}
\end{figure}

\begin{figure}[ht]
\centering
\includegraphics[height=72mm]{\here/SudanPopulationPyramid.png}
\caption{Sudan Population Pyramid}
\label{fig:sudanpopulationpyramid}
\end{figure}

\item  Estimate the number of people in the United States in 2010
between 0 and 9 years old.  Do the same for Sudan. What fractions of
the populations do these numbers represent?

I did Sudan first.

To find that number I need to add the bottom two bars on each side,
for the boys and girls age 0-4 and 5-9. That looks to be about
$3.6 + 3.2 + 3.5 + 3.0 = 13.1$ million.

Then I downloaded the underlying data tables as an Excel
spreadsheet and checked my estimate.
The number of 0-9 year olds in Sudan is $7,018,343 +
6,128,436$ which is indeed about 13.1 million.

The total Sudan population is 43.9 million (from the data table). The
fraction of children is thus $13.1/43.9 = 0.298405467 \approx 30\%$

The corresponding figures for the United States are about $11 + 11 +
10 + 10 = 42$ million children. That's about $42/310 \approx 13.5\%$.

\item  Find the modal age range for the United States 
population.  Do the same for Sudan.

The modal age is the largest bar. For the United States that's 45-49
years. For Sudan it's 0-4 years!

\item  Find the age range for the United States that has the smallest
population.  Do the same for Sudan.

In each population the smallest age range is 100+. That's not
surprising, and also not very interesting.

\item  Compare the population distributions of the United States and Sudan.
Write several sentences that highlight aspects of each distribution
that you think are quantitatively significant. Use the results of the
previous part of the problem.

The population pyramid for Sudan really looks like a pyramid. The
number of people in each five year cohort decreases steadily with
age. That indicates a very large birth rate, with lots of people dying
young. It also spells disaster for the society -- there aren't lots of
adults to take care of the children.

The United States population pyramid looks more like a house: vertical
walls with a slanted roof on top. That suggests that the population
structure is relatively stable, with few deaths at young ages. It also
looks likely that the little bulge at 45-55 years will lead to a large
number of senior citizens in not too long. That's a social problem of
a different kind.

\end{abcd}
\end{sol}
\end{exx}

\begin{exx}{\hassolution\sref{mmm}\gref{excelaverages}}
Lake Wobegon

\begin{abcd}

 \item Look back at the \smallCo{} data and answer the following
questions:

\begin{itemize*}
\item What percentage of the employees make more than the mean salary?
\item What percentage of the employees make more than the median
salary?
\item What percentage of the employees make more than the mode salary?
\end{itemize*}

\item In his radio show {\em A Prairie Home Companion}
\index{Prairie Home Companion} host Garrison
Keillor regularly tells his audience about \myindex{Lake Wobegon},
where ``all the children are above average.''

Is that possible, with any of the meanings of ``average''?
\end{abcd}

\begin{sol}
\begin{abcd}

 \item Look back at the \smallCo{} data and answer the following
questions:

\begin{itemize*}
\item What percentage of the employees make more than the mean salary?

8 of the 30 employees make more than the mean of \$77K. That's
$8/30 = 0.266 = 27\%$.

\item What percentage of the employees make more than the median
salary?

50\%! That's the {\em meaning} of ``median'' so it's the answer to
this question even if I don't know the value of the median.

\item What percentage of the employees make more than the mode salary?

This is a slightly subtle question. The mode only makes sense for
grouped data. The figure in the text shows that the mode for
\smallCo{} is about \$30K. 21 of the 30 employees, or 70\%, make more
than that. 
\end{itemize*}

\item In his radio show {\em A Prairie Home Companion}
\index{Prairie Home Companion} host Garrison
Keillor regularly tells his audience about \myindex{Lake Wobegon},
where ``all the children are above average.''

Is that possible, with any of the meanings of ``average''?

No -- that's what makes Keillor's description funny.

For the median, exactly half are above average. For the mean and
the mode it's possible for (a lot) more than half to be above average, 
but never ``all''.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\hassolution\complex\sref{mmm}\gref{excelaverages}}
\headline{Wal-Mart will pay \$40m to workers}\index{Wal-Mart}

On December 3, 2009 \theGlobe{} ran a story under that headline that
said (in part):

\begin{qwrap}
\begin{quotation}
\firstline{Wal-Mart Stores Inc., the world's largest retailer, has}
agreed to pay 
\$40 million to as many as 87,500 current and former employees in
Massachusetts, the largest wage-and-hour class-action settlement in
the state's history. 

The class-action lawsuit, filed in 2001, accused the retailer of
denying workers rest and meal breaks, refusing to pay overtime, and
manipulating time cards to lower employees' pay. Under terms of the
agreement, which was filed in Middlesex Superior Court yesterday by
the employees' attorneys, any person who worked for Wal-Mart between
August 1995 and the settlement date will receive a payment of between
\$400 and \$2,500, depending on the number of years worked, with the
average worker receiving a check for \$734.%
\webref{%
http://www.boston.com/business/articles/2009/12/03/wal_mart_will_pay_40m_to_workers/
}
\end{quotation}
\sourceinfo{http://www.boston.com/business/articles/2009/12/03/wal_mart_will_pay_40m_to_workers/}
\sourcewc{852}
\end{qwrap}


Bloomberg news carried more information  about the settlement:

\begin{qwrap}
\begin{quotation}
\firstline{The Massachusetts workers' lawyers asked the court to award}
them \$15.2 
million in attorneys' fees, to be paid out of the \$40 million
settlement. They're also seeking reimbursement of expenses. 
\webref{
http://www.bloomberg.com/apps/news?pid=newsarchive&sid=a2AClc9J8WwE
}
\end{quotation}
\sourceinfo{
http://www.bloomberg.com/apps/news?pid=newsarchive&sid=a2AClc9J8WwE
}
\sourcewc{754}
\end{qwrap}

\begin{abcd}
\item Compute the mean compensation, assuming that there are 87,500
eligible employees and that the lawyers have taken their cut.

\item Compare your answer to the reported minumum compensation of
\$400 and the reported \$734 the average worker will receive.

\item
Draft a letter to
the editor or the reporter, politely pointing out that both the
reported ``averages''  made no sense, and asking for more detail or a
correction.

\item It is too late to send a letter to the editor, but may not be
too late to comment on line. 
Look through the many reader comments on this article at
\url{http://www.boston.com/business/articles/2009/12/03/wal_mart_will_pay_40m_to_workers/?comments=all}.
See if any mention the problem with the averages.
If none does, consider posting a comment of your own. 

\end{abcd}

\begin{sol}

\begin{abcd}
\item Compute the mean compensation, assuming that there are 87,500
eligible employees and that the lawyers have taken their cut.

\begin{equation*}
\frac{\$40\hbox{M} - \$15.2\hbox{M} }{87,500 \hbox{workers}} =
283 \frac{\$}{\hbox{worker}}.
\end{equation*}

\item Compare your answer to the reported minumum compensation of
\$400 and the reported \$734 the average worker will receive.

The \$283{\em mean} compensation can't be less than the \$400 {\em
minimum} compensation. Something is crazy here. 

The \$734 that the ``average worker'' gets is even crazier. It's much
too large to be the median. There's no way half the workers can get
more than \$734 if the minumum is \$734 unless the mean is more than halfway
from \$400 to \$734. That's not true even before I take out the
lawyers' \$15.2 million.

\item Waiting for a student to provide a good answer.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\routine\artificial\hassolution\sref{mmm}\gref{excelaverages}}
Is it \myindex{discrimination}? 		

Table~\ref{salaryStructure} shows the salary structure of two
departments in a hypothetical university. 

\begin{table}[ht]
\centering
\begin{tabular}{|r|r|r|r|r|}
\hline
& 
\multicolumn{2}{|c|}{Physics} 
& 
\multicolumn{2}{|c|}{English} \\
\hline
 &  	professors & salary & professors & salary \\
 Women & 1 & 	\$100K &	8 & \$50K \\
 Men & 9	& \$90K	& 2 &	\$40K \\
\hline
\end{tabular}
\caption{Salary Structure at a University}
\tablesource{Made up data.}
\label{salaryStructure}
\end{table}

\begin{abcd}

\item What is the average (mean) salary of the professors? of the women
professors? Of the men? 

\item Answer the same questions for the median.

\item Answer the same questions for the mode.

\item Write a few sentences to convince someone that men in this
university are paid better than women. Then write a few sentences to
convince someone of just the opposite. Explain the
contradiction.

\end{abcd}

\begin{sol}

\begin{abcd}

\item What is the average (mean) salary of the professors? of the women
professors? Of the men? 

The mean professor salary (in thousands of dollars) is
\begin{equation*}
\frac{1 \times \$100 + 9 \times \$90 + 8 \times \$50 + 2 \times
\$40}{20} = \$69.5.
\end{equation*}

The mean for the women is
\begin{equation*}
\frac{1 \times \$100 +  8 \times \$50} {9} = \$61.1.
\end{equation*}

The mean for the men is
\begin{equation*}
\frac{9 \times \$90 +  2 \times \$40} {11} = \$80.9.
\end{equation*}

\item Answer the same questions for the median.

The median salaries for all, women and men are 
\$70K (halfway between \$90K and \$50K), \$50K and \$90K respectively.

\item Answer the same questions for the mode.

The mode salaries for all, women and men are 
\$90K, \$50K and \$90K respectively.

\item Write a few sentences to convince someone that men in this
university are paid better than women. Then write a few sentences to
convince someone of just the opposite. Explain the
contradiction.

Obviously the men are paid better than the women. Each of the three
averages is higher for men than women!

Obviously the women are better paid than them men: all the women do
better than all the men in both departments.

This seeming contradiction is possible because the women outnumber the
men in the English (poorer) department, while the opposite is true in
the Physics (richer) department. There the lone woman makes the
highest salary -- perhaps because the Dean (a woman) insisted that an
all male Physics department would discourage women students.

\end{abcd}

\end{sol}
 
\end{exx}


\begin{exx}[salaryrange]{\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}\hassolution}
Choosing data ranges for a histogram.

When you change the widths of the intervals in a histogram you get a
(slightly) different picture of the data.

\teachertag
\begin{teacher}
This exercise may be worth assigning for the Excel practice, and for
reinforcing the computations of various averages from summarized
data. But the conclusions aren't very striking.

\end{teacher}
\begin{abcd}
\item
Redo the \smallCo{} histograms in the text using salary ranges of size
\$10K and then of size \$50K.

\suspend{abcd}
\begin{hint}
Start with a copy of \link{WingAeroHistogram.xlsx}. Create new data in
columns \cell{D} and \cell{E} below the existing data there for the
two new histograms. Fill in column \cell{F} appropriately. Then you
can copy and paste from what's there to fill in columns \cell{G} and
\cell{H} and the sums and the mean.

If Excel wants to treat your salary ranges as dates, try formatting
the cells as \excel{text}.

To make the charts, copy the histogram that's there, then find the
place in Excel where you can change the source data to the new rows in
columns \cell{D} and \cell{E}.
\end{hint}

\resume{abcd}

\item
In each case use the techniques from
Section~\ref{sec:averagesfromhistograms} to estimate the mean, median
and mode.

\item
Discuss the advantages and disadvantages of these possible choices for
the salary range, comparing them to our choice of \$20K.

\end{abcd}

\begin{sol}

\begin{abcd}
\item
Redo the \smallCo{} histograms in the text using salary ranges of size
\$10K and then of size \$50K.

See \slink{WingAeroHistogramRangesSolution.xlsx} for the new histograms.

\item
In each case use the techniques from
Section~\ref{sec:averagesfromhistograms} to estimate the mean, median
and mode.

Here's a table showing all three averages, computed from the data
directly and from the three histograms.

\begin{center}
\begin{tabular}{|r|r|r|r|}
\hline
Data & Mean & Median & Mode \\
\hline
complete & \$77.2K & \$42.5K & nonsense \\
\$20K intervals & \$79K & \$48K & \$20K-\$39K \\
\$10K intervals & \$77K & \$45K & \$20K-\$29K \\
\$50K intervals & \$73K & \$45K & \$0K-\$50K \\


\hline
\end{tabular}
\end{center}

The smaller the data ranges in the histograms the closer the
approximated averages are to the true values.

\item
Discuss the advantages and disadvantages of these possible choices for
the salary range, comparing them to our choice of \$20K.

None of these histograms seems particularly ``better'' than the other
two. Each gives a good visual picture of how \smallCo{} salaries are
distributed. The various averages are close enough to the correct
values so that they are not misleading.
\end{abcd}

\end{sol}

\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}\gref{histogramaverages}
\gref{histograms}\hassolution}
Web sites are often confusing.

Jakob Nielsen evaluated the usability of voter information websites
for the 2008 election for each of the fifty states and the District of
Columbia. You can read his analysis at
\url{http://www.useit.com}. Here is some of what he discovered.

\begin{qwrap}
\begin{quotation}
\firstline{The following histogram of homepage quality across the 51}
websites shows [that] \ldots [some] sites have completely miserable
homepages, whereas others are close to achieving all of the current
best practices.   
\webref{%
http://www.useit.com/alertbox/quality-correlations.html
}
\end{quotation}
\sourceinfo{http://www.useit.com/alertbox/quality-correlations.html}
\end{qwrap}

His article includes the histogram in
Figure~\ref{fig:nielsenhistogram}

\begin{figure}[ht]
\centering
\includegraphics[width=4in]{\here/histogram-homepage-quality-voting-sites.png}
\caption{Web Site Usability}
\figsource{url{http://www.useit.com/alertbox/quality-correlations.html}}
\figcomment{Permission needed.}
\label{fig:nielsenhistogram}
\end{figure}

\begin{abcd*}

\item What is the modal usability score for these 51 home pages?

\item Reproduce this histogram in Excel. 

\suspend{abcd*}
\begin{hint}
When you enter the data in two columns in Excel put the categories
(usability scores) on the left since they are the labels for the
x\-axis. Put the numbers of websites on the right since they are the 
values that go with the categories, and should plot vertically, on the
y\-axis.

If your Excel is anything like ours, it may well think that you want
to display both columns on the y\-axis, since both columns are
numbers. If it does that it will label the x\-axis with the numbers 1
to 10.

If that happens, delete the data series corresponding to the bars you
don't want (the percentages). Then right click on the chart and
explore until you find the place that allows you to enter the fields
you want to use as x\-axis category labels.

\end{hint}
\resume{abcd*}
\item Estimate the median usability score for these 51 home pages.
\item How many of these home pages have a usability score less than the
the median score?
\item Estimate the mean usability score for these 51 home pages.

\end{abcd*}

\begin{sol}

\begin{abcd}

\item What is the modal usability score for these 51 home pages?

The 20\% bar is the tallest, so that's the modal score.


\item Reproduce this histogram in Excel.

See Figure~\ref{fig:nielsenhistogramreproduced}.
The spreadsheet is at \slink{HomePageQualitySolution.xlsx}. 

\begin{figure}[ht]
\centering
\includegraphics[width=4in]{\here/HomePageQualitySolutioncropped.pdf}
\caption{Web Site Usability}
\label{fig:nielsenhistogramreproduced}
\end{figure}

Excel ended the vertical scale at 14, but the scale in the histogram in
the book ends at 16, so I changed the scale here. I matched the color
as best I could.	 

\item Estimate the mean usability score for these 51 home pages.

It's not clear whether the scores are always exactly 10\%, 20\%, etc.,
or whether the 20\% category really means ``between 10\% and
20\%''. I {\em think} the first interpretation is right -- it looks as if
there are 10 usability tests and you either succeed or fail in each one.
If that's how the score was computed, no web site passed all ten tests
and four sites failed them all.
With that interpretation the mean usability score is the
\myindex{weighted average}

\begin{equation*}
\frac{4 \times 0 + 4 \times 10 + 12 \times 20 + 6 \times 30 + 3 \times 40 + 7 \times 50 + 5 \times 60 + 4
\times 70 + 3 \times 80 + 3 \times 90}
{51}
= 0.396078431
\end{equation*}
which is about 40\%.

I did the computation in Excel, since I had already entered the
numbers to build the histogram. Table~\ref{table:formulas} shows the
formulas I used, starting at cell \cell{A:16}''

\begin{table}
\begin{center}
\begin{tabular}{|r|r|r|}
\hline
Usability Score & Number of websites & total contribution \\
\hline
0 & 4 & =A17*B17 \\
\hline
=A17+0.1 & 4 & =A18*B18 \\
\hline
=A18+0.1 & 12 & =A19*B19 \\
\hline
=A19+0.1 & 6 & =A20*B20 \\ 
\hline
=A20+0.1 & 3 & =A21*B21 \\ 
\hline
=A21+0.1 & 7 & =A22*B22 \\ 
\hline
=A22+0.1 & 5 & =A23*B23 \\ 
\hline
=A23+0.1 & 4 & =A24*B24 \\ 
\hline
=A24+0.1 & 3 & =A25*B25 \\ 
\hline
=A25+0.1 & 3 & =A26*B26 \\ 
\hline
=A26+0.1 & 0 & =A27*B27 \\ 
\hline
sums & =SUM(B17:B27) & =SUM(C17:C27) \\ 
\hline
mean &  & =C28/B28 \\
\hline
\end{tabular}
\caption{Weighted average of usability scores}
\label{table:formulas}
\end{center}
\end{table}

With the second interpretation, I will count the web pages with scores
between, say, 10\% and 20\% as if they all had scores of 15\%.
Then the mean usability score is the
weighted average 
\begin{equation*}
\frac{4 \times 5 + 4 \times 15 + 12 \times 25 + 6 \times 35 + 3 \times 45 + 7 \times 55 + 5 \times 65 + 4
\times 75 + 3 \times 85 + 3 \times 95}
{51}
= 0.446078431
\end{equation*}
which is about 45\%.

It's not an accident that this is exactly 5 percentage points
higher. Do you see why?

To see the magic of Excel, just change the \excel{0\%} to 
\excel{5\%} in \cell{A17} and watch all the computed values change.

\item Estimate the median usability score for these 51 home pages.

I will add up the numbers in each category starting from the left, to
see when I reach half the websites -- that would be 25 or 26. The sum
of the first three categories is
$4 + 4 + 12 = 20$; the sum of the first four is
$4 + 4 + 12 + 6 = 26$. So the 25th site in the
fourth category.

That means about half the websites have usability scores of 30\% or
less while the other half have scores 40\% or more. Since I have to
report a single number for the median, I'll say it's 35\%.

\item How many of these home pages have a usability score less than the
the median score?

Half the home pages have a score less than the median. That will be
true even if I made a mistake computing the median!

\end{abcd}
\end{sol}

\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}}
College costs.\index{college costs}

The College Board 
(\url{http://www.collegeboard.org/}
published the chart in Table~\ref{table:collegecostincrease}
with data about increases
in tuition and fees from 2009-2010 to  2010-2011 in public four year
colleges and universities. The right-hand column gives the percentage
of colleges and universities that increased tuition and fees by the
dollar range in the left-hand column.%
\footnote{You can find this data in \link{CollegCostIncrease.xlsx}.}

\begin{table}[ht]
\centering
\begin{tabular}{|r|r|}
\hline
Dollar Increase	& Percent of Colleges \\
\hline
 Under \$200  & 10.8\% \\
 \$200 to \$399  & 26.8\% \\
 \$400 to \$599  & 23.4\% \\
 \$600 to \$799  & 15.4\% \\
 \$800 to \$999  & 7.9\% \\
 \$1,000 to \$1,199  & 5.0\% \\
 \$1,200 to \$1,399  & 2.5\% \\
 \$1,400 to \$1,599  & 1.4\% \\
 \$1,600 to \$1,799  & 2.4\% \\
 \$1,800 to \$1,999  & 0.8\% \\
 \$2,000 to \$2199  & 3.6\% \\
\hline
\end{tabular}
\caption{College Cost Increase}
\tablesource{\url{http://www.collegeboard.org/}}
\tablecomment{Permission needed?}
\label{table:collegecostincrease}
\end{table}

\begin{abcd}

\item What is the mode increase in tuition and fees?
\item What is the median increase in tuition and fees?
\item Use Excel to calculate the average (mean) increase in tuition and fees.
\item Construct a histogram displaying this data.
\end{abcd}

\begin{sol}

\begin{abcd}

\item What is the mode increase in tuition and fees?

The mode is the highest bar: the range  \$200 to \$399. If you need
one number for the mode, use \$300. 

\item What is the median increase in tuition and fees?

By computing a running total of the percentages in each range I
discovered that the 50\% threshold is about halfway between the ranges
\$200 to \$399  and  \$400 to \$599, so the median increase is about
\$400.

\item Use Excel to calculate the average (mean) increase in tuition and fees.

I computed a weighted average of the middles of the ranges, using the
percentages for weights, to find out that the mean increase is about
\$630. The spreadsheet at \link{CollegeCostIncreaseSolution.xlsx} shows
the formulas I entered.

\item Construct a histogram displaying this data.

\begin{center}
\includegraphics[width=80mm]{\here/CollegeCostIncreaseSolutioncropped.pdf}
\end{center}
\end{abcd}

\end{sol}

\end{exx}

\begin{exx}[usincomedistibution]{\hassolution\complex\worthy
\sref{averagesfromhistograms}\gref{histogramaverages}\gref{histograms}}

Household income in the United States 

The histogram in 
Figure~\ref{householdIncome} 
\footnote{
From
\url{http://en.wikipedia.org/wiki/Household_income_in_the_United_States})
}
shows the percentage of the population in income groups
\$10,000 increments apart, except for the furthest two right columns
which correspond to increments of \$50,000. The data are from 2005.

\begin{figure}[ht]
\centering
 \includegraphics[width=130mm]{\here/householdIncome.png}
\caption{United States Household Income -- 2005}
\figsource{\url{http://en.wikipedia.org/wiki/Household_income_in_the_United_States}}
\figcomment{What permission is needed to reproduce from Wikipedia? It
would be easy to redraw this -- we have, in Excel.}
\label{householdIncome}
\end{figure}

\begin{abcd}

\item
Enter the data in an Excel spreadsheet.

\item
Build a histogram in Excel that comes as close as
possible to matching the one in wikipedia. Display it two ways, with
and without the phony three dimensional effect.
\index{phony 3D}

\item Do the percentages sum to 100\%? If not, what might explain the
discrepancy? 

\item
Estimate mean, median and
modal household income. Then discuss how your answers compare to the
corresponding figures in wikipedia.

\end{abcd}

You can find the raw data at the
US Census 2006 Economic Survey
(\url{http://pubdb3.census.gov/macro/032007/hhinc/new06_000.htm}) but
you should not need it to work this exercise.

\begin{sol}

\begin{abcd}
\item
Enter the data in an Excel spreadsheet.

\item
Build a histogram in Excel that comes as close as
possible to matching the one in wikipedia. Display it two ways, with
and without the phony three dimensional effect.
\index{phony 3D}

See \link{USIncomeDistributionSolution.xlsx}.

\item Do the percentages sum to 100\%? If not, what might explain the
discrepancy? 

The percentages sum to just over 97\%. I think the missing 3\% are the
households whose income is {\em more than} \$200K -- the really
well-off.

\item
Estimate mean, median and
modal household income. Then discuss how your answers compare to the
corresponding figures in wikipedia.

The modal household income is \$10K-\$20K since that's the highest bar
in the histogram.

Adding up the percentages of households starting at the left, you pass
50\% just about halfway through the \$40K-\$50K category. So I
estimate the median household income at about \$45K.

The spreadsheet I made shows that 
that the mean household income based on this data is about \$54K. It's
the weighted average of the incomes in each category. I remembered to
divide by 97.13, the total of the percentages.

If I estimate that the other three percent of households make an
average of \$500K my estimate of the mean increases to \$67K.

Wikipedia reports the census bureau assertion that the median
household income in 2007 was \$50,233.00. The data in the chart are
from 2005, which may explain the discrepancy.

\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\sref{averagesfromhistograms}\gref{skew}\untested} Are the data in the problem on household
unemployment in
\exref{hightower} consistent with the income distribution in the
previous problem?

\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}}
Fight for the Senate

Figure~\ref{fig:senate}
appeared in Nate Silver's Five Thirty
Eight column 
in \theTimes{} on October 31.
The $x$ axis displays the number of seats held by each
party: the Tie in the middle is 50 Democrats, 50 Republicans. The +10
Dem corresponds to 55 Democrats, 45 Republicans.

\begin{figure}[ht]
\centering
\includegraphics[height=1.7in]{\here/Oct31SenateProjectionFull}
\caption{The Fight for the Senate}
\label{fig:senate}
\figsource{\url{http://fivethirtyeight.blogs.nytimes.com/2012/10/31/oct-30-what-state-polls-suggest-about-the-national-popular-vote/}}
\figcomment{This particular graphic is probably no longer available.}
\end{figure}

Nate Silver constructed this histogram by imagining
(simulating) many thousands of elections and recording the percentage
of time each Democratic/Republican split occurred. 
We estimated the percentages in the chart and entered them 
in spreadsheet
\link{Oct31SenateProjection.xlsx} so you don't have to type them
yourself. (We rounded the really tiny percentages to zero.)
Use Excel whenever it's most
convenient for you.

\begin{abcd}
\item What is the most likely number of Democratic Senators?

\item What number of Democratic Senators represents the mode of this
distribution?

\item What is the probability that there are more than 50
Democratic Senators?

\item What number of Democratic Senators is the median of this
distribution?

\item If you had the complete list of all Nate Silver's imagined
elections and sorted it by the number of Democratic Senators, how many
Democratic Senators would there be in the middle election on that
list?

\item Use Excel to compute the (weighted) average number of Democratic
Senators for these imagined elections.

\item Recreate the chart in Excel, with proper labels. You need not
match the format exactly, but it might be fun to try -- at home, after
the exam, perhaps for Tuesday.


\end{abcd}


\begin{sol}

\begin{abcd}

\item What is the most likely number of Democratic Senators?

53 -- the highest bar.

\item What number of Democratic Senators represents the mode of this
distribution?

53 -- the highest bar.

\item What is the probability that there are more than 50
Democratic Senators?

85.6\% -- it's \excel{=SUM(C14:C21)} in the spreadsheet.

\item What number of Democratic Senators is the median of this
distribution?

The cumulative sum passes 50\% at 53 Democrats. So the median is 53.

\item If you had the complete list of all Nate Silver's imagined
elections and sorted it by the number of Democratic Senators, how many
Democratic Senators would there be in the middle election on that
list?

There would be 53 Democrats. This is just the median again.

\item Use Excel to compute the (weighted) average number of Democratic
Senators for these imagined elections.

The weighted average is 52.6 Democratic senators.

\item Recreate the chart in Excel, with proper labels. You need not
match the format exactly, but it might be fun to try.
See \link{Oct31SenateProjectionSolution.xlsx}.


\end{abcd}

\end{sol}
\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}\gref{excelaverages}
\gref{histogramaverages}}
Data from the census 

The Excel spreadsheet \link{FAM50.xlsx} 
contains information extracted from the March 
2006 Current Population Survey. Scroll across to see the
headings. There is an explanation of the headings in the
Data Dictionary at \link{Fam50DataDictionary.pdf}.

\begin{abcd}

\item Look at the 6th individual in the survey. (Which row in the
  spreadsheet?) Write a paragraph that describes everything you know
  about this person.

\item 
Explain why there might be different values for Personal Total
earnings, Personal Total Income, and Family Income. 

\item Use Excel to find the mean and median for family income for the
50 individuals in the table. 

\item Sort the data so that family incomes increase as you read down
the columns. Create a column chart for family income, with one column
for each family in the sample. What do you notice? Does
it seem strange? Is the distrbution consistent with the shape of the
histogram in Figure~\ref{householdIncome}?

\item Which of the two ``averages'' is most representative
of the data set?
\end{abcd}

\begin{sol}

\begin{abcd}
\item Look at the 6th individual in the survey. (Which row in the
  spreadsheet?) Write a paragraph that describes everything you know
  about this person.

Row 9 in the spreadsheet.

It's an error to say ``not in universe'' - what that means is that
occupation is not one of the listed categories.

\item 
Explain why there might be different values for Personal Total
earnings, Personal Total Income, and Family Income. 

Too easy \ldots

\item Use Excel to find the mean and median for family income for the
50 individuals in the table. 

\begin{verbatim}
   Median   $78,509             Mean $90,832
\end{verbatim}

\item Sort the data so that family incomes increase as you read down
the columns. Create a column chart for family income, with one column
for each family in the sample. What do you notice? Does
it seem strange? Is the distrbution consistent with the shape of the
histogram in Figure~\ref{householdIncome}?

\item Sort the data by family income and create a scatterplot for
family income. What do you notice? Does it seem strange?

\begin{figure}[ht]
\centering
\includegraphics[width=2.5in]{\here/FamilyIncomeScatterplotcropped.pdf}
\caption{Family Income Scatterplot}
\figsource{Chart from an Excel spreadsheet we built from census data.}
\label{fig:weirdplot}
\end{figure}

Figure~\ref{fig:weirdplot} is really weird. All the values but the
last one are on a straight line! 
If I made a histogram of this data the number of families in each
income range would be about the same, except for that last family. 

I cannot believe that this is a random sample of census records. I
wonder who selected these particular 50 families, how, and why.

I noticed something else interesting:
I can estimate the mean and median values for the first 49 families
right from the chart. Both are just about \$80K, half the largest
value. Adding one more family doesn't change the median, but it does
change the mean, adding to it about $1/50^{th}$ of the \$500K outlier,
giving a value of about \$90K. These figures agree with the Excel
calculations (of course).

\item Which of the two ``averages'' is most representative
of the data set?

The median is a better single statistical measure. The mean is much
higher because the distribution is skewed by the single high outlier.

\end{abcd}

\end{sol} 
\end{exx}


\begin{exx}{\hassolution}{\worthy\sref{averagesfromhistograms}
\gref{excelaverages}\gref{histogramaverages}}
What cities pay for fire protection.
\index{firefighters' pay}

On Monday, March 30 2009 \theGlobe{} published an article comparing the
amount various cities spent on Fire and EMS services.%
\footnote{
You can read the article at \url{http://www.boston.com/news/local/massachusetts/articles/2009/03/30/boston\_spends\_most\_on\_firefighters\_in\_us/}.}
Figure~\ref{fig:FireSpending} is a screenshot of a spreadsheet where
we entered some of the data, along with population
figures. You can download that spreadsheet from
\link{FireSpending.xlsx}. Use the data to answer these questions.

%That article included Figure~\ref{firepay} (not available on the internet).
%
%\begin{figure}[ht]
%\centering
%\includegraphics[width=80mm]{\here/FireSpending.jpg}
%\caption{Fire Protection Spending}
%\figsource{Scanned from \theGlobe, 3/30/2009.}
%\label{firepay}
%\end{figure}

\begin{figure}[ht]
\centering
\framebox{
\includegraphics[width=80mm]{\here/FireSpendingcropped.pdf}
}
\caption{Fire Protection Spending}
\label{fig:FireSpending}
\figsource{Excel screen capture, data from \theGlobe, 3/30/2009.}
\end{figure}

\begin{abcd}

\item Population

\begin{abcd}

\item What is the {\em mean} population of the twelve cities for which
data are presented.

\item What is the {\em median} population of the twelve cities for which
data are presented.

\item Create Table~\ref{cityPopulations} in Excel. Fill in the second column
there. Then create a properly labelled histogram for this data. 

\begin{table}[ht]
\centering
\begin{tabular}{|r|r|}
\hline
	population range & number of cities\\
\hline
	500K-600K & \\
	600K-700K & \\
	700K-800K & \\
	800K-900K & \\
	900K-1000K & \\
	1000K-2000K & \\
	2000K-3000K & \\
	3000K-4000K& \\ 
	$>$ 4000K & \\
\hline
\end{tabular}
\caption{City Populations}
\tablecomment{No data in this table.}
\label{cityPopulations}
\end{table}

\item Use your histogram to estimate the mode population for these
cities.

\item  What percent of the US population lives in these twelve cities?

\end{abcd}

\item Fire/EMS spending per person

\begin{abcd}

\item 
What is the {\em mean}  amount spent for Fire/EMS services
per person in these twelve cities?

\suspend{abcd}

\begin{hint}
This mean is {\em not} the \excel{AVERAGE} of the amounts spent per
person by each city. It's wrong to average
those numbers since they are already averages.
You must weight them by the city populations in order
to compute the total amount spent by all the people in all the
cities. Then divide by the total population.

You can't compute the median with Excel's \excel{MEDIAN} function for
the same reason. Further hint: almost everyone lives in New York.

\end{hint}
\resume{abcd}
\item Estimate the {\em median} amount
spent for Fire/EMS services per person by residents of these twelve
cities.

\item Estimate the {\em mode} amount
spent for Fire/EMS services per person by residents of these twelve cities.

\end{abcd}

\item What do firemen earn?

There is enough information in the spreadsheet to calculate the
average (mean) earnings of Fire/EMS personnel in each of the twelve
cities. Do that, in a fresh column in your spreadsheet.

\begin{abcd}

\item In which city do Fire/EMS personnel have the highest average salary?
How much is it?

\item In which city do Fire/EMS personnel have the lowest average salary?
How much is it?

\item Where does Boston rank in the list of Fire/EMS personnel
salaries?

\item Explain how Boston can be at the top of the list in Fire/EMS
expenses per resident although it does not pay the highest salaries.

\end{abcd}

\begin{hint}
You can't answer the first question by
finding the mean of the twelve numbers in column C. Compute the mean
correctly as a weighted average. You will probably\index{probably}
want to start by creating a column labeled 

\begin{verbatim}
	total Fire/EMS expenses
\end{verbatim}
\noindent
and fill in the value for each city.
\end{hint}

\item Correction the next day!

On Tuesday \theGlobe{} published a correction, which read in part:

\begin{qwrap}
\begin{quotation}
Boston spent \$285 per resident on its fire department during the last
fiscal year, according to the revised report. 
\webref{%
http://www.boston.com/news/local/massachusetts/articles/2009/03/31/error\_made\_in\_fire\_dept\_report/}
\end{quotation}
\sourceinfo{http://www.boston.com/news/local/massachusetts/articles/2009/03/31/error\_made\_in\_fire\_dept\_report/}
\end{qwrap}

Look at the answers to the questions above and indicate which have
changed (and how), which stayed the same. 

\end{abcd}

\begin{sol}

You can find the work in the spreadsheet at
\slink{FireSpendingSolution.xlsx}.

\begin{abcd}

\item Population

\begin{abcd}

\item A straightforward use of \excel{ =AVERAGE(B7:B18)} in Excel tells
me that the mean population is 1,661,071. The last few digits really
don't mean much since it's unlikely that the populations listed for
the cities are really accurate down to the last person. A better
answer is about 1,660,000.

\item A straightforward use of \excel{ =MEDIAN(B7:B18)} in Excel tells
me that the median population is 773,469. That's halfway between the
populations of San Francisco and Columbus. It's much too precise. A
better answer is that the median is about 770,000.

\item Here is the table:

\begin{center}
\begin{tabular}{|r|r|}
\hline
	population range & number of cities\\
\hline
	500K-600K & 2 \\
	600K-700K & 3 \\
	700K-800K & 2 \\
	800K-900K & 1 \\
	900K-1000K & 1 \\
	1000K-2000K & 1 \\
	2000K-3000K & 0 \\
	3000K-4000K& 1 \\ 
	$>$ 4000K &  1 \\
\hline
\end{tabular}
\end{center}

See the histogram in the spreadsheet. Note the labels for the axes and
the title, that the bars touch one another, and that the vertical
scale does not list halves of cities.

\item The highest bar in the histogram corresponds to the largest
number in the table: the mode population is 600K-700K.

\item  The \excel{SUM} function in Excel tells me that the total
population of these twelve cities is 19,932,848. 

That represents $19,932,848/300,000,000 \approx 7\%$ of 
the people in the United States

\end{abcd}

\item Fire/EMS spending per person

\begin{abcd}

\item 

Column C reports the average (mean) spending in each of the twelve
cities. To find the overall average I can't just average these
averages, since the cities are different sizes. I need to use those as
the weights. So I created column E to hold
\begin{verbatim}
	total Fire/EMS expenses
\end{verbatim}
\noindent
and filled each row with the product of the values in columns B and C:
the number of 
residents times the spending per resident. Then I summed column E to
find out that these cities spent a total of \$3,707,134,657 (almost
four billion dollars!) to provide Fire/EMS services for their total
population of 19,932,848 people. Dividing, I found that on average
they spent 
\begin{equation*}
	\frac{\$3,707,134,657}{19,932,848 \hbox{\ people}} 
	\approx 
	186 \frac{ \$ }{\hbox{person}}
\end{equation*}

\item New York accounts for more than 40\% of the 20,000 people in these
cities. If you add in Los Angeles then you get more than half. Since
those two are the two cities at the bottom of the list, the median
Fire/EMS cost per resident is somewhere between the values for those
cities - probably\index{probably} about \$150.

\item The most common amount is what the New Yorkers pay, so the mode
is \$157.56.

\end{abcd}

\item What do firemen earn?

There is enough information in the spreadsheet to calculate the
average (mean) earnings of Fire/EMS personnel in each of the twelve
cities. Do that, in a fresh column in your spreadsheet.

I computed that two ways, just to make sure I was right. The first way
uses the contents of columns C and D to fill column H this way: \excel{
=1000*Cn/Dn} for each of the rows. That works because the units for
column C are \$ per resident while for column D they are firemen per
1000 residents. When I divide I get \$ per fireman, which is just what I
want.

The second way is to compute the number of firemen in each city --
fill column H with \excel{ =B7*D7/1000}. Then divide the total fire/EMS
expenses (already computed in column E) by that number and put the
result in column I. Columns H and I match.

\begin{abcd}

\item Firemen in Los Angeles have the highest average salary:
\$153,111.

\item Firemen in Baltimore have the lowest average salary: \$83,696.


\item Boston ranks fourth from the top, at \$132,985.

\item Boston spends more per resident on Fire/EMS personnel than other
cities even though it pays them less because it has a lot more of them
in proportion to population -- it's at the top of that list.

\end{abcd}

\item Correction the next day!

When I change the Boston Fire/EMS expense per resident from
\$452.15 to \$285 the following figures change:

\begin{itemize}

\item The mean Fire/EMS expense becomes \$181/resident, down from
\$186 per resident.

\item The mean Fire/EMS salary becomes \$107,600, down from \$110,635.

\item The average Boston Fire/EMS salary falls from \$132,985 to
\$83,824 -- second lowest, only about \$100 above the minimum.

\end{itemize}

\end{abcd}


\end{sol}


\end{exx}

\begin{exx}{\untested\sref{averagesfromhistograms}\gref{excelaverages}
\gref{histogramaverages}}
When Does Old Age Begin?

On June 29, 2009 \theGlobe{} reported that 

\begin{qwrap}
\begin{quotation}
\firstline{Americans differ on when old age begins. On average, they}
say 68. People under age 30 believe it begins at 60, while those 65
and older push the threshold to 74. Of all those surveyed, most said
they wanted to live to 89. 
\webref{%
http://www.boston.com/news/nation/articles/2009/06/29/generation_gap_in_us_largest_since_1960s_study_reveals/}
\footnote{
The survey that provided statistics is at
\url{%
http://pewresearch.org/pubs/1269/aging-survey-expectations-versus-reality?src=prc-latest&proj=peoplepress
}
}
\end{quotation}
\sourceinfo[792]
{http://www.boston.com/news/nation/articles/2009/06/29/generation_gap_in_us_largest_since_1960s_study_reveals/}
\end{qwrap}

The article goes on to ask

\begin{qwrap}
\begin{quotation}
When Does Old Age Begin?

At 68. That's the average of all answers from the 2,969 survey
respondents. But as noted above, this average masks a wide, age-driven
variance in responses. More than half of adults under age 30 say the
average person becomes old even before turning 60. Just 6\% of adults
who are age 65 or older agree. 
\end{quotation}
\sourceinfo[792]{see previous quotation -
http://www.boston.com/news/nation/articles/2009/06/29/generation_gap_in_us_largest_since_1960s_study_reveals/}
\end{qwrap}

and displays Figure~\ref{oldage}.

\begin{figure}[ht]
\centering
\includegraphics[width=80mm]{\here/OldAgeSurvey.jpg}
\caption{When does old age begin?}
\label{oldage}
\figsource{From \theGlobe's website, 6/29/2009,
\url{http://www.boston.com/news/nation/articles/2009/06/29/generation_gap_in_us_largest_since_1960s_study_reveals/}}
\end{figure}

The ``average'' of 68 is clearly a mean computed as weighted
average. Try to verify it. You may need to estimate some statistics --
in particular, the fraction of respondents in each age category.

Can you estimate the median age at which people think old age begins?

\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}
\gref{histogramaverages}}
College presidents' pay.

Here is a histogram showing the total compensation
for the 100 
best paid presidents of public universities. The data is from an
article in the April 3, 2011 issue of the {\em Chronicle of Higher
Education}
(\url{http://chronicle.com/article/Presidents-Defend-Their/126971})

\begin{center}
\includegraphics[height=60mm]{\here/CEOSalaryTablecropped.pdf}
\end{center}
\begin{figure}
\addtocounter{figure}{-1}
\figcomment{This figure has no figure number.}
\figsource{Chart from an Excel spreadsheet we constructed. The data is
from \url{http://chronicle.com/article/Presidents-Defend-Their/126971}}
\end{figure}
To save you staring at the picture, the number of presidents in each
of the ranges (reading from left to right) is 
51, 25, 15, 8 and then 1 in the last range.

You may use Excel for this exercise if you wish, but you don't have to.

\begin{abcd}

\item What is the mode of this distribution?

\item Estimate the median compensation.

\item Estimate the mean compensation.

\item Write two arguments one of which indicates that the average
president is paid appropriately, one that presidential pay is too
high. 

\end{abcd}

\begin{sol}
\begin{abcd}

\item What is the mode of this distribution?

The mode is \$350K-\$450K -- the most common salary range.

\item Estimate the median compensation.

51 presidents are in the \$350K-\$450K range and 49 make more than
that, so the median is \$450K -- right at the top of the most common
range.

\item Estimate the mean compensation.

I computed a weighted average of the middles of the ranges, with
weights the number of presidents in each range:
\begin{equation*}
(51*400 + 25*500 + 15*600 + 8*700 + 1*1300)/100
 = 488,
\end{equation*}
so the mean compensation is approximately \$490K. Perhaps that should
be rounded to \$500K.

I redid the computation in the spreadsheet at
\slink{CollegePresidentsSolution.xlsx}.

\item Write two arguments one of which indicates that the average
president is paid appropriately, one that presidential pay is too
high.

Most of the 100 best paid presidents of public universities make about
\$400K per year. They run large organizations in the public interest
and make much less money than CEOs of companies of comparable
size. They are fairly paid.

The average salary of these presidents of public universities is
nearly half a million dollars a year. That is much more than faculty
make at those institutions. They should be paid more like public
servants.

\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\artificial\hassolution\worthy\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}}
\myindex{Xorlon} \myindex{Fleegs}

Table~\ref{table:fleegs} shows the distribution of weights of a sample
of Fleegs from 
the planet Xorlon, where weight is measured in frams.\index{fram}
\footnote{We made up the numbers in this artificial problem so that
the arithmetic would be easy. We made up the words, too.
``frams'' is one Prof. Bolker could have used in a \myindex{Scrabble}
game with his wife -- for 57 points on a triple word score.}

\begin{table}[ht]
\centering
\begin{tabular}{|c|c|}
\hline
weight range (frams) & percent of sample \\
\hline
20-40 & 10 \\
40-60 & 10 \\
60-80 & 20 \\
80-100 & 10 \\
100-120 & 50 \\
\hline
\end{tabular}
\caption{Xorlon Fleegs}
\tablesource{Made up data.}
\label{table:fleegs}
\end{table}

\begin{abcd}

\item Sketch a neat properly labelled histogram that displays this data. 

\item Create a properly labelled histogram in Excel that matches the
one you just drew. You can start by downloading \link{XorlonFleegs.xlsx}.

\suspend{abcd}

Answer the following questions. You may work in Excel or with a
calculator or do mental arithmetic. 
\resume{abcd}

\item Estimate the mode Fleeg weight.

\item Estimate the median Fleeg weight.

\item Estimate the mean Fleeg weight.

\item Estimate the percentage of Fleegs that weigh more than the median.

\item Estimate the percentage of Fleegs that weigh more than the mean.

\item Estimate the percentage of Fleegs that weigh more than the mode.

\end{abcd}

\begin{sol}
\begin{abcd}

\item Figure~\ref{fig:FleegsSketch} shows a neat properly labelled
histogram that displays this data. 
\fromstudent{Courtney Allen}{}{November 1, 2011}

\begin{figure}[ht]
\centering
\framebox{
\includegraphics[height=60mm]{\here/FleegsSketch.jpg}
}
\caption{Xorlon Fleegs}
\label{fig:FleegsSketch}
\end{figure}

\item Figure~\ref{fig:FleegsExcel} shows an Excel
histogram that displays this data. The source is the spreadsheet
\slink{XorlonFleegsSolution.xlsx}.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/FleegsExcelcropped.pdf}
\caption{Fleegs on Xorlon}
\label{fig:FleegsExcel}
\end{figure}

\item Estimate the mode \myindex{Fleeg} weight.

The highest bar is 100-120 frams, so that, or 110 frams, is a good
estimate for the mode.

\item Estimate the median Fleeg weight.

Half the Fleegs in the sample weigh less than 100 frams and half weigh
more, so the median weight is 100 frams.

\item Estimate the mean Fleeg weight.

Averaging the midpoints of each weight range using the percentages as
weights is the best way to estimate the mean. The result is
\begin{equation*}
0.10 \times 30 +
0.10 \times 50 +
0.20 \times 70 +
0.10 \times 90 +
0.50 \times 110
= 86
\end{equation*}
so the mean weigh is 86 frams.

\item Estimate the percentage of Fleegs that weigh more than the
median.

The {\em meaning} of {\em median} tells me that just
50\% of the Fleegs are heavier than the median weight.

\item Estimate the percentage of Fleegs that weigh more than the
mean.

86 frams is 6/20 or three tenths of the way into the 80-100 fram range, so I'd
estimate that the percentage of Fleegs weighing more than that is
\begin{equation*}
   \frac{7}{10}10\% + 50\% = 57\%.
\end{equation*}

\item Estimate the percentage of Fleegs that weigh more than the
mode.

I estimated the mode as 110 frams. Half the Fleegs in the highest
range, or 25\%, weigh more than that.
\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\hassolution\artificial\sref{averagesfromhistograms}
\gref{histogramaverages}}
\myindex{Ruritania}

Find the mean, median and mode age for male residents of
Ruritania%
\footnote{
Ruritania is a fictional country in central
Europe which forms the setting for {\em The Prisoner of Zenda}, a
fantasy novel written by Anthony Hope.)
}
using the histogram or the data in the spreadsheet
\link{Ruritania.xlsx}.

\begin{sol}
The mode is easy -- it's 0-9 years, since that's the longest bar in
the histogram.

Adding the percentages in each category I see that the first three
come to exactly 50\% (in Excel, where the numbers are rounded for
display but not for computation). So the median Ruritanian age is
about 30 years (half the people are younger, half older.)

I used Excel to compute the weighted average of the midpoints of the
categories (5, 15, \ldots) and found the mean age to be about 36 years.
\end{sol}

\end{exx}


\begin{exx}{\artificial\hassolution\sref{percentiles}\gref{percentiles}}
Wing Aero percentiles

\begin{abcd}

\item How many Wing Aero employees are in the bottom tenth percentile
in salary?

\item What is the salary cutoff for the bottom tenth percentile?

\item Answer the same questions for the top tenth percentile.

\end{abcd}

\begin{sol}

\begin{abcd}

\item How many Wing Aero employees are in the bottom tenth percentile
in salary?

By definition, 10 percent of the Wing Aero employees are in the
bottom tenth percentile in the salary distribution. Since there are 30
employees, 3 are in the bottom tenth percentile.

\item What is the salary cutoff for the bottom tenth percentile?

The salaries of the three workers in the bottom tenth percentile are
\$17K, \$19K and \$21K. The next 
highest salary is \$25K, so any number between that and \$21K can be
thought of as the cutoff point for the bottom tenth percentile.

\item Answer the same questions for the top tenth percentile.

The three employees in the top tenth percentile 
are the CEO, the CFO and one of the other two
executives. The salary cutoff for the top tenth percentile is \$250K.

\end{abcd}
\end{sol}
\end{exx}


\begin{dummyexx}{\needsquestions\sref{bellcurve}}
Bell curve \ldots
\end{dummyexx}


\begin{exx}{\untested\sref{marginoferror}\gref{normaldistribution}}
Activists press officials to put sick-leave proposal to voters

An article in the \emph{Orlando Sentinel} on August 6, 2012 discussed a
ballot initiative that would require employers with 15 or more workers
to provide paid time off for employees for illness-related issues.
The article polled voters to gauge support for placing this question
on the ballot for the November election, and noted that  

\begin{qwrap}
\begin{quotation}
$\ldots$ 
\firstline{%
Results of a poll released Monday show heavy support for the}
initiative among Orange County voters. The poll commissioned by
Citizens for a Greater Orange County found that 67 percent of
residents likely to vote in November support the measure, while 26
percent oppose it. 

Support is stronger among minorities, women and young voters,
according to the poll, which claims a 4.4 percentage-point margin of
error among 500 people surveyed by phone a month ago. 
\webref{%
http://articles.orlandosentinel.com/2012-08-06/news/os-sick-leave-ballot-race-20120806_1_ballot-language-signatures-sick-time}
\end{quotation}
\sourceinfo[684]{Orlando Sentinel, 
Monday, August 6, 2012,
http://articles.orlandosentinel.com/2012-08-06/news/os-sick-leave-ballot-race-20120806_1_ballot-language-signatures-sick-time}
\end{qwrap}

\begin{abcd}
\item  Why don't the percentages from this poll add up to 100\%?   
\item  Explain why this statement is not true:  ``67\% of the
residents likely to vote in November support the measure." 
\item  Explain what the 4.4 percentage point margin of error means for
this poll.   
\end{abcd}
\end{exx}


\begin{exx}{\untested}
\erdos{} numbers \index{\erdos{}, Paul}

Paul \erdos{} (1913-1996) was the most prolific mathematician of the
twentieth century. He was famous (in mathematical circles) for the way
he worked -- he travelled from school to school, writing joint papers
with the mathematicians at each.%
\teachertag
\begin{teacher}
Your students will probably not think this topic is useful or
interesting but it might be fun if introduced in class. 
\end{teacher}

From Wikipedia:
%
\begin{qwrap}
\begin{quotation}
An \erdos{} number describes a person's degree of separation from \erdos{}
himself, based on their collaboration with him, or with another who
has their own \erdos{} number. \erdos{} alone was assigned the \erdos{} number
of 0 (for being himself), while his immediate collaborators could
claim an \erdos{} number of 1, their collaborators have \erdos{} number at
most 2, and so on.%
\webref{https://en.wikipedia.org/wiki/Paul_Erd\%C5\%91s}
\end{quotation}
\sourceinfo{https://en.wikipedia.org/wiki/Paul_Erd\%C5\%91s}
\end{qwrap}

You can think of the 
\erdos{} Number Project as a description of the social network of
mathematicians. Its home page is 
\url{http://www.oakland.edu/enp}; there you can find out that

\begin{quotation}
\ldots the median \erdos{} number is 5; the mean is 4.65, and the
standard deviation is 1.21.
\end{quotation}

Table~\ref{table:erdos} shows the raw data.

\begin{table}[ht]
\centering
\begin{tabular}{|cr|}
\hline
\erdos{} number & mathematicians \\
\hline
 0 & 1 \\
 1 & 504 \\
 2 & 6593  \\
 3 & 33605  \\
 4 & 83642  \\
 5 & 87760  \\
 6 & 40014  \\
 7 & 11591  \\
 8 & 3146  \\
 9 & 819  \\
 10 & 244  \\
 11 & 68  \\
 12 & 23  \\
 13 & 5 \\
\hline
\end{tabular}
\caption{\erdos{} numbers}
\label{table:erdos}
\end{table}


\begin{abcd}
\item Use Excel to draw a histogram for the distribution of \erdos{}
  numbers. 
\item What is the mode of this distribution?
\item Verify the claims for the median and mean.
\item Verify the claim for the standard deviation.
\item Professor Bolker (one of the authors of this book) has an 
\erdos{} number of 2 (he wrote a paper with Patrick O'Neil, who
wrote a paper with \erdos{}). Why is Professor Mast's 
\erdos{} number at most 3? How might it be less than 3?
\item How many mathematicians have a finite \erdos{} number?
\item There are some mathematicians whose 
\erdos{} number is infinite. How can that be?

\end{abcd}
\end{exx}

\begin{ExtraExercises}

\begin{exx}{\hassolution}
Faculty salaries

The chart in Figure~\ref{fig:uth} shows the salary distribution of
tenured faculty at 
the University of Texas Houston.  Vertical scale represents the number
of faculty, the horizontal axis shows salary ranges in increments of
\$20,000. 

\begin{figure}
\centering
\framebox{
\includegraphics[width=3in]{\here/UTH.png}
}
\caption{Faculty Salaries}
\label{fig:uth}
\end{figure}


\begin{abcd}
\item Recreate this data in Excel as both a table and a histogram. Print
your results on a single page and attach it to the exam. 


\item What is the modal salary for UTH faculty?

The mode is the highest bar -- about 60-80K.

\item What is the median salary for UTH faculty?

The halfway point to the 203 faculty members (I added up the heights
of the bars in Excel) occurs partway through the second bar, so the
median is about  70K.

\item What is the mean salary for UTH faculty?

The mean salary is about 81K -- computed in Excel as a weighted average.
mean	81.03448276

\end{abcd}
\begin{sol}
\begin{abcd}
\item Recreate this data in Excel as both a table and a histogram. Print
your results on a single page and attach it to the exam. 


\item What is the modal salary for UTH faculty?

The mode is the highest bar -- about 60-80K.

\item What is the median salary for UTH faculty?

The halfway point to the 203 faculty members (I added up the heights
of the bars in Excel) occurs partway through the second bar, so the
median is about  70K.

\item What is the mean salary for UTH faculty?

The mean salary is about 81K -- computed in Excel as a weighted average.
mean	81.03448276

\end{abcd}

\end{sol}

\end{exx}
%
%\begin{figure}
%\centering
%\includegraphics[height=50mm]{\here/BadPieChart.png}
%\caption{Excel adjusts the data}
%\label{fig:badpiechart}
%\figsource{Chart from an Excel spreadsheet we wrote.}
%\end{figure}
%

\begin{exx}{\untested\sref{median}\gref{excelaverages}}
Doublethink.

\headline{Where The One Percent Live: The 15 Richest Counties In America}

\begin{qwrap}
\begin{quotation}
\firstline{%
Living in Arlington isn't cheap, so you'd better be making at least}
the median household income to live in this county just outside
Washington, D.C. 
\webref{%
http://www.businessinsider.com/where-the-one-percent-live-the-15-richest-counties-in-america-2012-2\#5th-richest-arlington-county-va-11
}
\end{quotation}
\sourceinfo[105]{%
www.businessinsider.com/where-the-one-percent-live-the-15-richest-counties-in-america-2012-2\#5th-richest-arlington-county-va-11
}
\end{qwrap}

How does this statement contradict itself?
\footnote{
Thanks to David Kung for this exercise.
}
\end{exx}

\begin{exx}{\needsquestions\sref{marginoferror}}
\headline{Surgery offers no advantage for early prostate cancer, study
finds}

That article reported on a clinical trial involving 731 men diagnosed
with prostate cancer. About half had surgery; the rest were monitored.

\begin{qwrap}
\begin{quotation}
After 12 years, nearly 6 percent of men who had immediate surgery died
of the cancer, compared with slightly more than 8 percent of those
patients who were observed, which was not a great enough difference to
reach statistical significance.  
\webref{http://bostonglobe.com/lifestyle/health-wellness/2012/07/18/surgery-offers-survival-advantage-for-older-men-with-early-stage-prostate-cancer-study-finds/T5XM7APIuoZuav6PbJzYuI/story.html}
\end{quotation}
\sourceinfo[1038]{http://bostonglobe.com/lifestyle/health-wellness/2012/07/18/surgery-offers-survival-advantage-for-older-men-with-early-stage-prostate-cancer-study-finds/T5XM7APIuoZuav6PbJzYuI/story.html}
\end{qwrap}

{\em Is there a reaonable way to tie ``not statistically significant''
to ``margin of error''?}

\begin{abcd*}
\item About how many men were in each category?
\item About how many deaths were there in each category?
\end{abcd*}
\end{exx}


\begin{exx}{\untested}
The article on new car and truck prices that we studied in
\sref*{weightsmatter} first asserts that 

\begin{quotation}
\ldots
the average price of a new
vehicle in the second quarter [of 2008] fell 2.3 percent from a year
earlier to \$25,632 \dots
\end{quotation}

and later

\begin{qwrap}
\begin{quotation}
\firstline{The result is the average new vehicle now costs less than}
40 percent of an average household's median annual income, the analysts said,
whereas from 1991 to 2007, it would cost more than half of the median
income.  
\webref{%
http://www.boston.com/business/articles/2008/09/05/new_car_prices_fall_at_fastest_rate_ever}
\end{quotation}
\sourceinfo[630]{From the AP, also cited in previous chapter.
www.boston.com/business/articles/2008/09/05/new\_car\_prices\_fall\_at\_fastest\_rate\_ever
}
\end{qwrap}

Verify as much of this last assertion as you can.
\end{exx}

Review exercises

\begin{exx}  
Create an Excel spreadsheet and put the following numbers in the first colum.  \begin{verbatim}
14  15  22  50  0  33  16  18  23  40  47
\end{verbatim}
\begin{abcd*}
\item Use Excel to find the mean, median and mode of these numbers
\item  Change the first number from 14 to 23.  How do the Excel calculations change?
\item  Click the ``undo" button and confirm that Excel reverts back to the original set of numbers.
\item  Change the last four numbers to 0 (so that the data now read
\begin{verbatim}
14  15  22  50  0  33  16  0  0  0  0
\end{verbatim}
How do the different averages change?  Explain how the data are skewed.
\end{abcd*}
\end{exx}

\begin{exx}{\untested\artificial\gref{excelroutine}\gref{excelwhatif}}
Enrollments

The final enrollment report for the past year at an unnamed small
college provided the following information about students:  450
students  freshmen; 421 students sophomores; 400 students juniors and 
511 seniors. 

\begin{abcd*}
\item Create an Excel spreadsheet containing this data.  Label the
columns appropriately. 

\item  Ask Excel to calculate the total number of students enrolled
during the past year.  Label this result.
 
\item  Create a properly labelled bar chart of the student data.

\item  A corrected enrollment report noted that there were 419
juniors.  Make that adjustment in your spreadsheet and check that the
other information (total number of students, bar chart) is correctly
updated.

\item  Using this new information, ask Excel to calculate the
percentage of students who are freshmen, sophomores, juniors and
seniors. Copy and paste so that you type as few formulas as possible.

\item Create a new bar chart displaying the percentages.

\item  Convert your bar chart to a pie chart.  

\item  Arrange the data with percentages and the final bar and pie
charts so that they print on one page.
\end{abcd*}
\end{exx}

\begin{exx}{\untested\artificial\gref{percentiles}}
SAT exam percentiles

A student received this notification on his 
college entrance exam:

\begin{verbatim}
English Language Arts:	77th percentile
Mathematics:  		88th percentile
\end{verbatim}

Explain these this report in everyday language. 

\begin{hint}
Your answer might begin ``More than three quarters of the students
taking this test \ldots''
\end{hint}

\end{exx}

\begin{exx}{\untested\sref{mmm}\gref{excelaverages}}
Comparing the states

You can do this exercise using Excel, or with properly documented
research. (Your instructor may specify one method or the other.)

\item Find the mean, median and mode for the populations of the 50
states. 

\item Display the answers to the previous question on a properly
labelled histogram. Discuss your findings -- is the distribution
skewed?

\item Redo parts (a) and (b) for the {\em areas} of the states.

\item Redo parts (a) and (b) for the {\em population densities}
(people per square mile).

\end{exx}

\begin{exx}{\needsquestions}
Jellybean margin of error

\url{http://andrewgelman.com/2011/08/that_xkcd_carto/}
\end{exx}
\end{ExtraExercises}

\begin{ReviewExercises}

\end{ReviewExercises}

\begin{ScopeExercises}

\begin{exx}{\untested\complex\sref{barcharts}\gref{excelchart}
\gref{excelaverages}}
Median wages and incomes. 

\theGlobe{} printed Figures~\ref{fig:medianwagechange} and
\ref{fig:incomewagechange} on January 1, 2011.

\begin{figure}[ht]
\centering
\framebox{
\includegraphics[height=60mm]{\here/MedianWageChange.jpg}
}
\caption{Median Wages}
\figsource{Scanned from \theGlobe, 1/1/2011.}
\label{fig:medianwagechange}
\end{figure}

\begin{figure}[ht]
\centering
\framebox{
\includegraphics[height=60mm]{\here/IncomeWageChange.jpg}
}
\caption{Wage and Income}
\figsource{Scanned from \theGlobe, 1/1/2011.}
\label{fig:incomewagechange}
\end{figure}

\begin{abcd}
\item Reproduce Figure~\ref{fig:medianwagechange} in Excel.

\item Compare changes in median wage with changes in median income.

\item Are these data consistent with other reports of median income?
\end{abcd}
\end{exx}


\begin{exx}{\complex\sref{mmm}\gref{excelaverages}}
The \myindex{Lahey Clinic}

Figure~\ref{lahey} appeared in an advertisement for the Lahey Clinic
in \theGlobe{} on March 6, 2009.

\begin{figure}[ht]
\centering
\includegraphics[width=80mm]{\here/1in9.jpg}
\caption{Lahey Clinic Advertisement}
\figsource{Scanned from \theGlobe, 3/6/2009.}
\label{lahey}
\end{figure}

Our first reaction to the ad was ``That can't be true. There
are thousands of hospitals, so more than nine of them must be above
average.''  Then we thought about it some more.

\begin{abcd}

\item Use what you learned in the previous exercise  to show that the  assertion in the ad isn't impossible,
although it is unlikely. 

\item 
Invent some reasonable numbers for hospital heart attack survival rates that
would make the ad correct.

\item  
The ad included a reference to 
\url{http://knowyournumbers.lahey.org/} for more information. There we
found the source
\url{http://www.usatoday.com/news/health/2008-08-21-hospitals-standouts_N.htm}
(Note date). 

Describe how the author of the ad reached the conclusion she wanted.

Note that the table provides information only about the exceptional
hospitals. It does not tell you how many unexceptional ones that are
not reported on here. Try to find out more about the missing data.

%That URL provided some data, which we downloaded to
%\link{hospitalData.csv}. The format is \excel{ .csv}, standing for ``comma separated
%values''. It should open in Excel.
%
%Analyze that data to see if you can find out how the person who wrote
%the ad could justify his conclusion. (We haven't tried to do that yet.)
%
%You may find the data confusing at first. Don't give up too
%quickly. Try to sort out what you need. You might want to pay
%particular attention to just the rows for which the value in the Topic
%column is Heart Attack. You can get those together 

\end{abcd}

\end{exx}


\end{ScopeExercises}