% IncomeDistribution/contents.tex
\chapter{\mychaptername}
\label{\here}
%\chapter[toc version]{doc version}
%\chaptermark{version for header}
\chaptermark{Excel, Charts, and Statistics}

\tocnotetoo{
This chapter covers a lot of ground --- two new kinds of average
(median and mode) and ways to understand numbers when they come in large
quantities rather than just a few at a time: bar charts, histograms,
percentiles and the bell curve. To do that we introduce spreadsheets
as a tool.}%

\begin{goals}

\begin{goal}{excelaverages}
Work with mean, median and mode of a dataset.
\end{goal}

\begin{goal}{normaldistribution}
Introduce the normal distribution, with its mean and standard deviation.
\end{goal}

\begin{goal}{skew}
Understand how skewed distributions lead to inequalities among mean,
median and mode.
\end{goal}

\begin{goal}{excelroutine}
Make routine calculations in Excel.
\end{goal}

\begin{goal}{excelwhatif}
Use Excel to ask and answer ``what-if'' questions.
\end{goal}

\begin{goal}{excelchart}
Create bar charts and other types of charts in Excel.
\end{goal}

\begin{goal}{histograms}
Use histograms to group and explore data.
\end{goal}

\begin{goal}{histogramaverages}
Calculate averages for grouped data.
\end{goal}

\begin{goal}{percentiles}
Understand what a percentile is and how to interpret percentile information.
\end{goal}

\begin{goal}{descriptivestatistics}
Understand the basics of descriptive statistics, including bell curve, bimodal 
data and margin of error.
\end{goal}

\end{goals}

\qrsection[wingaero]{Salaries at \smallCo{}}

Table~\ref{wingaero} shows the distribution of workers' salaries at
\smallCo,  a small hypothetical company.
In keeping with our \CommonSense\  philosophy we should work with
real data. But most companies keep
this kind of information private. Any similarity between our imagined
\smallCo{} and any real company is purely coincidental.
In \exref{usincomedistibution} we'll apply the lessons we learn to
look at income distribution
in society at large. 

The company has about 30 employees. We want to understand the salary
distribution. A natural place to begin is with the average. But adding
thirty numbers by hand (even with a calculator) is
tedious and error-prone. A spreadsheet computer program can do the
arithmetic faster and more accurately. In this text the spreadsheet we
use is Microsoft's Excel.

We've organized this chapter as a spreadsheet tutorial --- you can
follow it step by step in Excel as you read it.
\begin{teacher}
We find students come to our course with a wide range of technology
experience. They can all manage word processing, web searching and
email. Some have used Excel. Those for whom it is new find this
chapter hard going. Often pairing experienced and inexperienced
students in the lab classroom works well.
\end{teacher}

See \sref*{usingsoftware} for 
some general software tips and information about alternatives to Excel.

If you're online you can save typing time by downloading the
\smallCo{} spreadsheet from \link{WingAero.xlsx}. That spreadsheet and
all the others you'll need live at 
\href{\webhome/spreadsheets}{\texttt{\csmnet/spreadsheets}}.
%\url{\webhome/spreadsheets}.
If you build it for yourself, use column \cell{A} for the
employees and column \cell{B} for the salaries. Put the labels in row
\cell{7} and the data in rows \cell{8:37}, not side by side as in the
table. You should see Figure~\ref{fig:WA1screenshot}.


\begin{table}
\centering
\subfloat{
	\begin{tabular}{lS[table-format=3.0]}
\toprule
	Employee	& {Salary} 	 \\
		 & {(thousands of \$)}	 \\
\midrule
CEO & 299  \\
CTO & 250  \\
CIO & 250  \\
CFO & 290 \\
Manager & 77  \\
Manager & 123  \\
Manager & 84  \\
Manager & 63  \\
Manager & 68  \\
Manager & 49  \\
Manager & 82  \\
Manager & 87  \\
Supervisor & 42  \\
Supervisor & 37  \\
Supervisor & 29  \\
\bottomrule
    \end{tabular}
}
\subfloat{
	\begin{tabular}{lS[table-format=2.0]}
\toprule
	Employee	& {Salary} 	 \\
		 & {(thousands of \$)}	 \\
\midrule
Supervisor & 43  \\
Supervisor & 51  \\
Supervisor & 38  \\
Supervisor & 33  \\
Supervisor & 42  \\
Supervisor & 49  \\
Worker & 25  \\
Worker & 19  \\
Worker & 41  \\
Worker & 17  \\
Worker & 26  \\
Worker & 25  \\
Worker & 21  \\
Worker & 28  \\
Worker & 27  \\
\bottomrule
	    \end{tabular}
}
\caption{Wing Aero salary distribution}
\label{wingaero}
\end{table}

\figfile{WA1screenshot.png}
\begin{figure}
\centering
\includegraphics[width=4.5in]{\thefigurefilename}
\caption{\smallCo{} spreadsheet (some hidden rows)}
\label{fig:WA1screenshot}
\end{figure}
\figfile{}

What do you notice?

The employees are listed in decreasing order of importance (or prestige),
but only approximately in decreasing order of salary. Some supervisors
earn more than some managers, and some workers more than some
supervisors. We can make those discrepancies visible by sorting
the data.

Select the rectangular block of data in rows \cell{8} through \cell{37},
columns \cell{A} and \cell{B}. Be sure to select both columns so they
will be sorted together. 
Choose the \excel{Sort} dialog box from the \excel{Data} tab, select
sorting by Salary, \excel{Largest to Smallest}, as in
Figure~\ref{fig:WA2screenshot}.%

Often Excel offers you more than one way to do a job.
This is one way to sort, in Excel 2013. There are others. 
Other versions of Excel may use different menus. 
But the ability to sort
will be available in any spreadsheet program you use.

\figfile{WA2screenshot.png}
\begin{figure}
\centering
\includegraphics[width=4.5in]{\thefigurefilename}
\caption{Sorting \smallCo{} salaries}
\label{fig:WA2screenshot}
\end{figure}
\figfile{}

If you sort the data again alphabetically (by Employee, \excel{A to
  Z}) the table returns (nearly) to its original state, because 
the employee categories were alphabetical at the start. But each
category is now sorted by salary. 

To find the average salary at \smallCo{}  we tell Excel
to add the entries in column \cell{B} and divide by the number of
employees. 

Enter the label \excel{Total}
in cell \cell{A38}. Then go to cell \cell{B38}. Make sure
the \excel{Formula Bar} is visible. (Use the 
\excel{View} menu to find it if
it's not.) In the formula box type an equals sign \excel{=}, to tell Excel
you want it to do some arithmetic, and then the name of the operation,
\displayexcel{ =SUM(}
\noindent 
since you are about to add up some numbers. The open parenthesis
asks Excel to prompt you for information. It suggests
\displayexcel{
SUM(number1, [number2], $\ldots$)
}
as in Figure~\ref{fig:WA3screenshot}.

\figfile{WA3screenshot.png}
\begin{figure}
\centering
\includegraphics[width=4.5in]{\thefigurefilename}
\caption{Summing \smallCo{} salaries}
\label{fig:WA3screenshot}
\end{figure}
\figfile{}

Select cells \cell{B8:B37}, close the parentheses and type \excel{enter} or
click the check icon on the \excel{Formula Bar}. You should see
\excel{Total 2315} 
in cells \cell{A38} and \cell{B38}. \smallCo's total annual payroll is
\$2315K --- 
about \$2.3 million.

To find the average salary we must divide the total
\$2315K payroll by the number of employees. Rows
\cell{8} through
\cell{37} contain employee records so there are $37-8+1=30$
employees. But it's better to ask Excel to count the rows for 
you. Type the label \excel{ Count} in cell \cell{A39} and begin
formula \excel{=COUNT(} in cell \cell{B39}. Finish the formula by
%CHANGE electing cells  
selecting cells  
\cell{B8:B37} or by typing the addresses of those cells and closing the
parentheses. You should see 
\displayexcel{
Count \ \ 	30
}
\noindent
Finally, type the label \excel{ Average} in cell \cell{A40}  and put
formula \excel{ 
=B38/B39} in cell \cell{B40}. You should see
\displayexcel{
Average \ \ 	77.16667
}
\noindent
so the average annual salary at \smallCo{}  is about \$77,000. 
Excel rounded the exact $77.1666\ldots$ to $77.16667$
when it ran out of space. 
We rounded to two significant digits because that's all the
precision we have in almost all the data. 
In \exref{format} you'll learn how to tell Excel to round for you.

Excel's built-in functions \excel{SUM} and \excel{COUNT} are
separately useful, which is part of why we showed them to you, but if
it's the average you want Excel can find it in one step.

Enter \excel{=AVERAGE(B8:B37)} in cell \cell{B41}, click the
check icon and Excel tells you again that the average
is 77.16667. 
Put ``\excel{computed using SUM/COUNT}'' in cell \cell{C40} and 
``\excel{computed using AVERAGE function}'' in cell \cell{C41}.

\qrsection[whatif]{What if?}\index{what if}

Suppose that the CEO convinced the Board of Directors to double his salary, to 
\$598K (even though the company lost money).
To see how that would affect the payroll statistics,
go to cell \cell{B8} and change the 299 there to 598. Excel automatically
updates all your computations, increasing the total annual
payroll to \$2,614 thousand and the average annual salary by about
\$10,000 to more than \$87,000. 

We can learn several useful lessons from our work so far. 

\begin{itemize}

\item Excel is faster and more accurate at arithmetic than we are.

\item Excel is an excellent tool for answering what-if questions,
because it automatically updates its computations when the data change.

\item The average we calculated with \excel{SUM/COUNT} and then again
  with \excel{AVERAGE} is a terrible way to summarize the salary
  structure at \smallCo. The CEO's 100\% raise increases the average
  salary by about \$10,000, or 13\% --- but he's the only employee
  who's actually better off!

\end{itemize}

After you've tried out these changes, restore the original values by
clicking the \excel{Undo} icon on the toolbar.  This very useful
feature allows you to undo the last few changes you've made. Use it to
fix mistakes or to get back to where you were before you asked a
``what-if'' question.

\qrsection[usingsoftware]{Using software}

Why use a spreadsheet at all? There are several reasons.

\begin{itemize}

\item to carry out large tedious calculations rapidly and correctly.

\item  to answer ``what-if?'' questions without having to redo
arithmetic.

\item  to draw charts.

\item to learn a tool you may use long after you've finished this course.
\end{itemize}

Here are some tips for working with Excel, and with software packages
in general.

\begin{itemize}

\item  To figure out something new, you can use the application's
built in help, search the web, ask a friend or teacher, or just play
around. Which you try first depends on your personality.

\item Many applications provide several ways to do the same task. That
means you may get different advice or instructions from different
sources. Choose a way that suits your style.

\item Use the menus for things you do rarely, but learn the keyboard
shortcuts for things you do often --- in particular, \excel{control-C}
for copy and \excel{control-V} for paste can save time.

\item Learn about \excel{undo}. \textbf{Save your work often}. 

\item
When you are about to make significant changes to a spreadsheet or a
document make a copy of what you have, so that you can return to it if
you change your mind about what you should have done. In the
\smallCo{} study we created several, calling them
\verb!WingAero1.xlsx!, \verb!WingAero2.xlsx! and so on.

\item
Create backup copies of important documents often --- off your
computer. Use a thumb drive (flash drive), a CD, or a web service.

\item Some things work the same way in different applications
(browsers, Word, Excel) --- for example, selecting with the mouse, cut
and paste with keyboard shortcuts.

\item In many software applications placing the 
mouse over a feature you are interested in and right clicking
\index{right click}
often lets you view and change the properties of that
feature. ``Do the right thing'' is a good mnemonic.

\item  Applications often try to guess what you intend to do. That can
be good or bad. We'll see soon that Excel can adjust cell references
automatically --- that's usually, but not always, what you want.
Word processors may try to fix your spelling ---
perhaps correctly, perhaps not. 


\end{itemize}

Why Excel? It's the most commonly available spreadsheet, so it's
the one we use. But there are others available.
In particular, Open Office (\url{www.openoffice.org/}) offers
free spreadsheet and word processing software.\index{Open Office}.

With \myindex{Google Sheets} (\url{www.google.com/sheets/}) you can
create spreadsheets in the cloud. 
That software is powerful enough to do the arithmetic we need for 
\commonsense{} but it has far fewer chart formatting features than
full fledged programs. Excel on tablet computers lacks those features
too. Don't even think of trying to do spreadsheet work on your phone.
\begin{teacher}
Don't let your students even  \emph{think} about doing spreadsheet
work on their smart phones.
\end{teacher}

\qrsection[median]{Median}\index{median}

In \sref*{wingaero} we found that 
the average salary at \smallCo{} was \$77,000. This is a pretty good
annual salary. If you saw that in a job advertisement
you'd think it was a pretty good company to work for. Maybe, maybe not. 
In \sref*{whatif} we saw how it's skewed by the CEO's earnings.  When
his (or her) salary 
increased from \$299,000 to \$598,000, the average salary jumped by
\$10,000 to \$87,000. But no one else's salary changed!

The \$77,000 ``average'' is misleading in other ways too. Most of the
employees --- 26 out of 30 --- have salaries less than the average. That
contradicts what 
we like to think ``average'' means. To find a salary that's ``in the
middle'', sort the 30 line table again so that salaries are increasing.
Since the table starts in row \cell{8} and has 30 entries, rows \cell{22} and
\cell{23} are the middle rows and the entries in
cells \cell{B22} and \cell{B23} are the middle salaries. That means
half the employees 
make \$42,000 or less (the entry in cell \cell{B22}) and half make
\$43,000 or more. So we might want to say
that the ``average'' salary
is \$42,500. There's a name for this kind of ``average'' --- it's the
\emindex{median}. The first ``average'' we computed above is called the
\emindex{mean}.

On the spreadsheet, change  the \excel{ Average} label in cell
\cell{A41} to \excel{ Mean}.

Then put
\displayexcel{
Median \ \	42.5 \ \	computed by finding middle of sorted list
}
\noindent
into cells \cell{A45}, \cell{B45} and \cell{C45}.

In some ways the median is a fairer ``average'' than the mean for
describing the \smallCo{}  salary structure. It tells you more
about the way salaries are distributed.
In particular, the median salary isn't affected by the
CEO's big raise. Try changing that salary again in Excel: the mean
changes, as it did before, but the median stays the same. 

Excel knows how to compute medians. Enter \excel{ =MEDIAN(B8:B37)} in
cell \cell{B46} and check that you get the same value: 42.5. 
Enter ``\excel{computed using MEDIAN function}'' in \cell{C46}.

Finding the median with the \excel{ MEDIAN} function is better than
finding the middle of the sorted list yourself because it works even
when the data aren't sorted.
Suppose the supervisor making \$43K gets a raise to \$50,000. Enter that
new value as \excel{50} in the spreadsheet.
Then see that Excel has recalculated the median in cell \cell{B46}:
it's now 45.5. 

There's a third kind of average, the \emph{mode}. We'll return to that
after we've summarized the salary data in a different way.

\qrsection[barcharts]{Bar charts}


\begin{teacher}
When teaching Excel in a classroom we strongly recommend drawing bar
charts first by hand on the board, and asking students to do the same
on paper on homework and exams. That strategy helps in the next
chapter, where scatter plots are called for and students often
accidentally draw a line chart instead.
\end{teacher}


Often pictures are better than numbers when we wish to convey
information convincingly. A \emindex{bar chart} is one
such picture.

We use bar charts when we have data that fall naturally into
categories. The height of each bar represents the value for that
category.  When you want general understanding rather than numerical
detail it's easier to compare the heights of bars visually (in both 
relative and absolute terms) than the values of numbers. 

In order to understand the \smallCo{} income distribution better
we will use Excel to draw two bar charts with four columns each, one
showing the number of employees in each of the four job categories executive,
manager, supervisor and worker, the other the total earnings in
each category.  We will use the original Wing Aero salary data from
\sref*{wingaero}. 

\begin{table}
\centering
\begin{tabular}{lS[table-format=1.0]S[table-format=4.0]}
\toprule
Job & {Number} & {Total salary (\$K)} \\
\midrule
Executive & 4 & 1089\\
Manager & 8 & 633 \\
Supervisor & 9 & 364 \\
Worker & 9 & 229 \\
\bottomrule
\end{tabular}
\caption{\smallCo{} salary distribution by job category}
\tablesource{Made up numbers.}
\label{table:wingaerobycategory}
\end{table}

You can find the data from Table~\ref{table:wingaerobycategory} in
the range \cell{D7:F11} in the copy of the \smallCo{} salary distribution
spreadsheet at \link{WingAeroBarCharts.xlsx}. Excel calculated the
values in columns \cell{E} and \cell{F} using the \excel{ COUNT} and
\excel{ SUM} functions. 

Figure~\ref{fig:showformulas} shows those calculations. We created it
by selecting \excel{Show Formulas} from the \excel{Formulas} tab.

\figfile{showformulas.png}
\begin{figure}
\centering
\includegraphics[width=4.5in]{\thefigurefilename}
\caption{Showing the formulas used in a spreadsheet}
\label{fig:showformulas}
\end{figure}
\figfile{}

Figure~\ref{fig:barCharts} shows the first two
pictures from that spreadsheet.

\figfile{WingAeroBarCharts1cropped.pdf}
\begin{figure}
\centering
\includegraphics[width=5in]{\thefigurefilename}
\caption{\smallCo{} employee information by category}
\figsource{Image from an Excel spreadsheet we created.}
\label{fig:barCharts}
\end{figure}
\figfile{}

These side by side bar charts dramatically demonstrate that
the executives make up a small part of the workforce but enjoy a large
part of the salary expenses!

Delete the charts from your copy of the spreadsheet, so that you can
learn how to build them for yourself. For the first one, 
select the data in columns \cell{D} and \cell{E},
rows \cell{8} through \cell{12} --- the range \cell{D8:E12}. Then ask
for a new chart by selecting \excel{Column} and \excel{2D Column} from
the \excel{Insert} tab, as in
Figure~\ref{fig:insertchart}.

\figfile{insertchart.png}
\begin{figure}
\centering
\includegraphics[width=4.5in]{\thefigurefilename}
\caption{Inserting a chart in a spreadsheet}
\figsource{RESIZE Image from an Excel spreadsheet we created.}
\label{fig:insertchart}
\end{figure}
\figfile{}

Use the mouse to move the chart so that it does not hide any of
the data. Then add the appropriate labels and change the colors so
that the result is suitable for black and white printing. There are
several ways to go about these tasks; experiment until you find ones
that work for you. \sref*{usingsoftware} has a tip about how the right mouse 
button can help with these tasks.


To build the second picture the same way you must select the
data in columns \cell{D} and \cell{F} without selecting column
\cell{E}. To do that, select 
rows \cell{8:12} in column \cell{D}. Then hold down the control
(PC) or apple or command (Mac) key and use the mouse to select those rows in
column \cell{F}.

It would be even better to combine the two pictures.
We can do that in Excel
by building a column chart for the full range \cell{D8:F12}.
The result (after adding titles and fixing colors) is the chart on 
the left in Figure~\ref{doubleHistograms}.

\figfile{WingAeroBarCharts2cropped.pdf}
\begin{figure}
\centering
\includegraphics[width=5in]{\thefigurefilename}
\caption{\smallCo: side by side bar charts}
\figsource{Image from an Excel spreadsheet we created.}
\label{doubleHistograms}
\end{figure}
\figfile{}

The problem with that picture is that the bars for the employee
numbers are nearly invisible, because data values for the 
numbers vary from 4 to 9 while those for the
total salaries vary from  \$200K to \$1200K. 
We can fix that by reporting the salary totals in hundreds of
thousands of dollars rather than just in thousands of dollars (that
is, by changing the units for measuring salary). Column
\cell{G} contains those numbers; we used it to draw the second picture.
There you see clearly the opposing trends in the categories: total
wages decrease as the number of employees increases.    
\begin{teacher}
Excel can use separate vertical scales to plot two data
series. We think it's more useful and more interesting to teach this
\emph{ad hoc} solution.
\end{teacher}

\qrsection[piecharts]{Pie charts}\index{pie chart}

Excel allows you to change the chart type on the fly. Select the 
column graph showing the total salary by job category. Right click on
that chart. Select \excel{Change Chart Type \ldots} and then the first
\excel{Pie}. Adjust labels and colors to create the first of the
charts in Figure~\ref{pieCharts}. 

\figfile{WingAeroPieChartsCropped.pdf}
\begin{figure}
\centering
\includegraphics[width=5in]{\thefigurefilename}
\caption{\smallCo{} salary distribution pie charts}
\label{pieCharts}
\figsource{Charts from an Excel spreadsheet we wrote.}
\end{figure}
\figfile{}

The pie chart shows clearly in yet another way that the executives are
the winners at \smallCo. They take home nearly half the salary total.
If you wanted to make that look a little less dramatic, you could omit
the percentages and ask Excel to show a three dimensional version of
the chart, as in the second picture. There we've rotated the picture
so that the executive wedge is at the back, so it looks even
smaller in perspective. 

\qrsection[histograms]{Histograms}\index{histogram}

\smallCo{} is small enough so that we can see the whole salary table
at a glance. But if there were 1000 employees that wouldn't be
possible. To understand the numbers we'd have to summarize them.  The
four categories in the previous section might not provide enough
detail.

Another useful way to summarize the data is to divide the salaries
into ranges and count the number of employees whose salary falls in
each range. Then we can use the ranges as the categories in a bar
chart.  In this example we'll use \$20K ranges. That means we think of
two employees who make \$29K and \$33K as having approximately the
same salary, since each falls in the \$20K-\$39K category.

We need to count the number of employees making less
than \$20K, then the number making between \$20K and \$39K, and so
on. That's easy when the data are sorted in increasing order.
Table~\ref{table:wingaeroranges} shows what we found.

\begin{table}
\centering
\begin{tabular}{cS[table-format=2.0]}
\toprule
Salary range (\$K) & {Number of employees} \\
\midrule
0-19 & 2 \\
20-39 & 10 \\
40-59 & 7 \\
60-79 & 3 \\
80-99 & 3 \\
100-119 & 0 \\
120-139 & 1 \\
140-159 & 0 \\
160-179 & 0 \\
180-199 & 0 \\
200-219 & 0 \\
220-239 & 0 \\
240-259 & 2 \\
260-279 & 0 \\
280-299 & 2 \\
\bottomrule
\end{tabular}
%CHANGE \smallco -> Wing Aero
\caption{Wing Aero salary distribution by salary range}
\tablesource{Made up numbers.}
\label{table:wingaeroranges}
\end{table}

To save you typing, we've listed the categories 
in cells \cell{D15:D29} in \mbox{\link{WingAero.xlsx}.}
You should check our work and enter the data in column
\cell{E}.

To see that you haven't missed anyone, \excel{ SUM} the range
\cell{E15:E29} to make sure the answer is 30, the known number of employees.
(The sum isn't a perfect check. Although the total is correct,
we might have put some employees into the wrong categories.)

We can use the data to draw a \emindex{histogram} --- a bar chart 
where the
categories on the $x$-axis specify data ranges
%CHANGE \myindex{data range} to \index
\index{data range} and the $y$-axis counts or
percentages for each range.
You  can see the resulting histogram in Figure~\ref{fig:wahistogram}.

Start as usual by building a column chart from the data in
cells \excel{D14:E29}.
In a histogram it's conventional to make adjacent bars touch. To do
that in Excel, right click on one of the columns and select
\excel{Format Data Series \ldots} from the menu. Then adjust the
\excel{Gap Width} using the slider shown in Figure~\ref{fig:FormatHistogram}.
While you're there, figure out how to change the colors of the bars,
and fix the labels. 
You can see the full spreadsheet with all the computations we've done
so far at \link{WingAeroHistogram.xlsx}.

\figfile{FormatHistogram.png}
\begin{figure}
\centering
\includegraphics[width=4.5in]{\thefigurefilename}
\caption{Formatting a histogram}
\label{fig:FormatHistogram}
\figsource{RESIZE Excel screen capture.}
\end{figure}
\figfile{}

\figfile{WingAeroHistogramCropped.pdf}
\begin{figure}
\centering
\includegraphics[width=3in]{\thefigurefilename}
\caption{\smallCo{} salary histogram}
\label{fig:wahistogram}
\figsource{Chart from an Excel spreadsheet we wrote.}
\end{figure}
\figfile{}

It's worth taking some time to study this histogram, which shows how
the data are distributed.\index{distribution} 
Most of the salaries are less than
\$100K and there's a large gap in salaries between \$139K and the
executives who make more than \$200K. Although this information is all
in the table, the histogram makes it visible and dramatic.

We chose \$20K for the size of the salary ranges so that we would have
enough categories to show what was going on but few enough to make the
graph understandable.
In \exref{salaryrange} you can explore what happens with different
choices.

One of the themes of this chapter has been to use Excel for tedious
repetitive calculations. So you might wonder whether Excel
could build Table~\ref{table:wingaeroranges}
for the grouped data if we told it the ranges
we were interested in. Then the numbers would automatically update
when we asked ``what-if'' questions. The good news is that
Excel can do this job. The bad news is that its histogram building
tools are rather clumsy.
We will content ourselves with doing the counting by hand. 
If you're ambitious, try Excel help or search
the internet  for ``excel histogram'' to find out how to group data
for an Excel histogram.

\qrsection[mmm]{Mean, median, mode}

There's a third kind of ``average'' --- the mode --- that's sometimes
informative. The \emindex{mode} is the most common value. The
histogram from the previous section shows 
that there are more employees with salaries in the \$20K-\$40K range than
any other. So the mode is about \$30K. It's the category with the highest
bar.

The mode is most useful for data aggregated into ranges, as in a
histogram. In the raw \smallCo{} data there is no well-defined
mode. Each of the values \$250K, \$49K, \$42K and \$25K appears
twice. In Excel the function \excel{ MODE(B8:B37)} reports \$250K when
column B is sorted in descending order and \$25K when the column is
sorted the other way.

Each of ``mean,'' ``median'' and ``mode'' can legitimately be
called an ``average.'' That ambiguity makes it easy to lie with
statistics without actually lying. The CEO at \smallCo{} may brag that
workers at his company earn an average \$77K per year, while the union
argues that the average salary is \$30K per year.

A cynic would advise you to use
the ``average'' that tells the story your way
and hope your listener won't know the difference.

When distributions are symmetric, the mean, median and mode are in the
same place. The \smallCo{}  salary distribution isn't symmetric, it's
skewed\index{skew} to the right. That's the fancy way to say
that the bulk 
of the data cluster toward the left of the histogram with a long tail
off to the right. For data that's right skewed, as this is:

\begin{center}
\begin{tabular}{ccccc}
mode & $<$ & median & $<$ & mean \\
30   & $<$ & 42.5   & $<$ &  77.1
\end{tabular} .
\end{center}

\figfile{MeanMedianMode.png}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\caption{Symmetric and skewed distributions}
\figsource{RESIZE? Hand drawn figure.}
\figcomment{You can redraw this, or we will. But it should look like
something a student could in fact do --- correct, but not professional.}
\end{figure}
\figfile{}

In a histogram the mode is the peak, the median splits the area in
half and the mean is where the graph would balance if it were a
cardboard cutout.%
\begin{teacher}%
The hand drawn figure showing when ( mode $<$ median $<$ mean ) may seem
unprofessional, but in fact we think it's useful. It's more like what
a student could produce than a fancy graphic would be. We should
probably have more pictures like this. 
\end{teacher}

If you're learning about these several kinds of averages for the first
time you may want mnemonics to help you remember which is which. The
``med'' in median suggests correctly that it's in the middle. You can
remember mode because there are ``mo' of them than anything else.''%

\qrsection[averagesfromhistograms]{Computing averages from histograms}
\begin{teacher}
We find this a particularly valuable section --- it forces students to
come to grips with the real meaning of each kind of average. Just
memorizing definitions suffices for short lists of numbers, 
but for grouped data the median is a little subtle and the mean is a
weighted average. Real understanding is required.

We recommend thinking about mode, median and mean in that order.

In the fall of 2015 just as \commonsense{} was going to press one of
us was teaching from the text in a classroom with no computers. The
cart with the machines we needed wasn't ready yet just as we were
about to start this chapter on Excel. Necessity mothering invention
led us to introduce the calculation of mode, median and mean from a
histogram at the blackboard rather than with a spreadsheet. We worked
the exercises on teen texting and the men's Boston Marathon finishing
times. Then when we treated the same material using Excel the students
didn't have to learn the ideas along with the software, and knew what
the answers should look like. It was too late to rewrite this
chapter. Do think about rearranging your syllabus.
\end{teacher}

Often all we know about data is a summary like that
presented in a histogram. We'll see here
how to estimate the mode, median and mean from that information.
We will use the \smallCo{} histogram in Figure~\ref{fig:wahistogram} as
an example. Since we know the correct averages we can see how good our
estimates are.%

We've already found the mode. It's the highest bar in the histogram --- the
range with the largest entry: \$20,000 --- \$39,000. Note carefully ---
the mode is the range, not the height of the bar. We could report that
range, or report the mode as the middle of the range: about \$30,000.
The raw data do not have a mode that makes sense, so there's nothing
to compare this estimate to.

To estimate the median salary from the histogram we need to find the
salary such that half the employees make less and half more. The first
bar tells us that two make less than \$19K.
Adding the number of employees in the first two ranges tells us that 
$2+10=12$ make less than \$39K. Looking at the next range we see that
$2+10+7 = 12 + 7 = 19$ make less than \$59K. Since there are 30
employees, it's the 15th and 16th whose salaries are
closest to the median. They are clearly pretty much in the middle of  
the third category, so we can estimate the median as the midpoint of
that category, say \$50,000. The correct value (from the raw data) is
\$42,500. 

Since there are just eight ranges with data and the  median occurred
in the third one, we didn't need Excel 
to do the arithmetic. In a more complicated example things might
not be so easy, so let's see how to make Excel do the work. Label
column \cell{H} in cell \cell{H15} as ``\excel{count so far}''. Then
copy the value from \cell{E15} to \cell{H15} by typing \excel{=E15}
in H15. Then enter 
\excel{=H15+E16} in \cell{H16}, to add the number of employees counted
so far (in \cell{H15}) to the number in this range (in {\cell{E16}).

Now select that formula and copy it to
\cell{H17:H29}. 
A \myindex{miracle} has happened!
Cell \cell{H29} contains the value 30. If you click on
that cell you can see that it's the result of the formula \excel{
=H28+E29}. Excel read your mind, and automatically updated the row
references to cells in columns \cell{H} and \cell{E}
with each copy down the column. It knew you wanted to add the
value in the cell above to the value in the cell three over to the left.
\begin{teacher}
You will need to take time here to explain both the miracle --- Excel
guessed that updating references was what was wanted --- and its value
in saving time and typing. It takes getting used to.
\end{teacher}
The last entry should be 30 since all the employees make less than \$299K.
Now it's easy to see that the count so far passes the midpoint of 15
in the middle of the third range.

Estimating the mean salary is the hardest.
Since the only \smallCo{} data we have is what we
used to draw the histogram we can't expect to find it
exactly. All we can say about the 10 employees in the \$20K\ ---\ \$39K
range is that they earn somewhere in the neighborhood of \$30K. So our
best guess is to assume they all earn exactly that. If we make a
similar assumption for each of the other ranges then our estimate for
the mean is the \myindex{weighted average}
\begin{equation*}
\frac{ 2 \times \$10K + 10 \times \$30K + \cdots 
	+ 2 \times \$290K }
	{\mbox{total number of employees}} .
\end{equation*}

That's too much arithmetic to do by hand so we'll use Excel. 

Put the label \excel{mid range} in cell \cell{F13}. We want to use cells
%CHANGE  10, 30, \dots, 290x that are the middles
\cell{F14:F28} to hold the values 10, 30, \dots, 290 that are the middles
of the ranges in cells \cell{D14:D28}. There's a quick trick for that. Enter
the 10 in cell \cell{F14} and enter the
formula
\displayexcel{
	=F14+20
}
\noindent
in cell \cell{F15} .
Excel will display \excel{ 30} there. That's because it reads the formula
as 
\begin{center}
%CHANGE comma to period add 20 to the contents of cell \cell{F14},
add 20 to the contents of cell \cell{F14}.
\end{center}

Now we want to add 20 each time you move down a row. To do
that, copy the formula in \cell{F15} and paste it into cells
\cell{F16:F28}. (This takes advantage again of Excel's correct guess
about what we are trying to do.)

Next label column \cell{G} by typing \excel{range total} in cell
\cell{G13}. Then
put the formula
\displayexcel{
	=E14*F14
}
in cell \cell{G14}. That asks Excel to multiply the numbers in cells
\cell{E14} and \cell{F14}. You should see 20. That's the first number
to add in the weighted average computation we're working on.

When you copy that formula to cells \cell{G15:G28} you should see 580 at
the end of the list. That's the miracle yet again.

To compute the weighted average you must sum the values in column
\cell{G}. Since cell \cell{E29} contains the sum of the values in
column \cell{E}, just copy the formula from that cell to cell
\cell{G29}. Excel will 
automatically change the column reference, turning the formula
\excel{ =SUM(E14:E28)} into \excel{ =SUM(G14:G28)}. The sum is 2360. To find
the mean, enter the formula \excel{=G20/E29} in cell \cell{G30}. Excel
shows you 78.66667. Label that value as the mean. 

Our estimate of the mean salary from the histogram is \$79,000.
We shouldn't report more precision than that,
since we made many approximations along the way. Do note that the
estimate is not very far from the true mean of \$77,167 computed from
the raw data.  

\qrsection[percentiles]{Percentiles}


On June 7, 2011 we visited the website 
\url{www.infantchart.com/}. 
There we collected the data displayed in 
Table~\ref{table:babyweights}. Reading down the first column, you can
see that a one year old male baby weighing
21.5 pounds would be in the 30.9th \emindex{percentile}. That means
that he would weigh more than 30.9 percent of the babies, less than
69.1 percent. A baby weighing 
22.5 pounds would be in the 46.5th percentile, so
$46.5\% - 30.9\% = 15.6\%$ of the year old male babies weigh between
21.5 and 22.5 pounds. 

\begin{table}
\centering
\begin{tabular}{SSS}
\toprule
{Weight (pounds)} & {Percentile} & {Difference} \\
\midrule
15.5  & 0.1 & 0.1\\
16.5  & 0.2 & 0.1\\
17.5  & 0.8 & 0.6\\
18.5  & 3.0 & 2.2\\
19.5  & 8.2 & 5.2\\
20.5  & 17.5 & 9.3\\
21.5  & 30.9 & 13.4\\
22.5  & 46.5 & 15.6\\
23.5  & 61.8 & 15.3\\
24.5  & 74.9 & 13.1\\
25.5  & 84.8 & 9.9\\
26.5  & 91.4 & 6.6\\
27.5  & 95.4 & 4.0\\
28.5  & 97.7 & 2.3\\
29.5  & 98.9 & 1.2\\
30.5  & 99.5 & 0.6\\
31.5  & 99.8 & 0.3\\
32.5  & 99.9 & 0.1\\
33.5  & 100.0 & 0.1\\
\bottomrule
\end{tabular}
\begin{csmr}[Year old male baby weights\label{table:babyweights}]
Baby Infant Growth Chart Calculator,
\url{www.infantchart.com/}
\access{August 3, 2015}.
\csmrcomment{I scraped the data from the website, not the graphic}
\end{csmr}
\end{table}

To find the mode, we should look for the largest \emph{difference} in
percentiles. That turns out to be the 15.6\% we just found, so the mode
is about 22 pounds.

With guess-and-check on the website we found that a weight of 22.75
pounds was as close as we could get to the 50th percentile, so that's
the median. Some other commonly used percentiles are the 

\begin{itemize*}
\item tenth: any baby weighing less than about 19.2 pounds is in the
tenth percentile,

\item ninetieth: about 90\% of the babies weigh less than about 26.5
pounds,

 \item the \myindex{quartiles} --- the 25th, 50th and 75th percentiles. The
50th percentile is, of course, the median.
\end{itemize*}
Using the techniques from \sref*{averagesfromhistograms} we computed
the mean as a weighted average. It is about 22.9 pounds. 

Therefore
\begin{center}
\begin{tabular}{ccccc}
mode & $<$ & median & $<$ & mean \\
22   & $<$ & 22.75   & $<$ &  22.9
\end{tabular} .
\end{center}
That suggests that the distribution  is 
just little bit skewed to the right.

\qrsection[bellcurve]{The bell curve}

The baby weight data in Table~\ref{table:babyweights} are just what we
need to construct the
histogram in Figure~\ref{fig:babyweights}.

There you can see the mode (the highest bar) and you can see the
little bit of right skew.

\figfile{babyweightscropped.pdf}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\caption{Baby weights}
\label{fig:babyweights}
\end{figure}
\figfile{}

Many factors contribute to a baby's weight at one year: birth weight,
heredity, nutrition, \ldots. Whenever many small effects combine to
give a total the distribution of values forms a \emindex{bell curve}.
The one shown in the figure
is a mathematically
correct bell curve that approximates the real data. The mathematically
correct name for the bell curve is \emindex{normal distribution}.

A normal distribution is always symmetrical. Its mean, median and mode
are in the same place. To construct the one in the figure we need one more
number besides the mean ---
a measure of how fast the curve spreads out. That measure is called the 
\emindex{standard deviation}. It's usually written with the Greek
letter sigma: $\sigma$; the mean is usually written with the Greek
letter mu: $\mu$. For the baby weight data the standard
deviation is about 2.6 pounds.

There's a nice way to use the standard deviation to describe how
a bell curve spreads out.

\begin{itemize*}
\item 2/3 of the values are less than one standard deviation away from
the mean,
\item 95\% of the values are less than two standard deviations away,
\item 99.7\% are less than three standard deviations away.
\end{itemize*}

Figure~\ref{fig:normalcurve}
illustrates these percentages.  Table~\ref{table:normalcurve}
summarizes them in terms of percentiles.

\begin{table}
\centering
\begin{tabular}{cS[table-format=2.1]}
\toprule
value & {percentile} \\
\midrule
$\mu - 3\sigma$  & 0.1\\
$\mu - 2\sigma$  & 2.3\\
$\mu - \sigma$  & 15.9 \\
$\mu$  & 50.0 \\
$\mu + \sigma$  & 84.1 \\
$\mu + 2\sigma$  & 97.7\\
$\mu + 3\sigma$  & 99.9\\
\bottomrule
\end{tabular}
\caption{Percentiles for the normal curve, mean $\mu$, standard
deviation $\sigma$}
\tablesource{Public information}
\label{table:normalcurve}
\end{table}

\figfile{normalcurve.png}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\begin{csmr}[How the normal distribution spreads out\label{fig:normalcurve}]
\url{en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg}
\access{August 3, 2015}.
Licensed under the Creative Commons Attribution 2.5 Generic License.
\csmrcomment{creative commons license}
\end{csmr}
\end{figure}
\figfile{}

Since the baby body weight distribution is very close to normal, with
mean $\mu$ about 22.9 pounds and 
standard deviation $\sigma$ about 2.6 pounds, we know about 2/3 of
the babies 
weigh between $\mu - \sigma = 22.9 - 2.6 = 20.3$ and $\mu + \sigma =
22.9 + 2.6 = 25.5 $ 
pounds. Approximately 95\% weigh between 17.7 and 28.1 pounds.
2.5\% weigh more than 28.1 pounds and 2.5\% weigh less than 17.1 pounds.

Figure~\ref{fig:threenormals} shows three bell curves with the same
mean $\mu = 22.9$, but three different standard deviations, $\sigma = 1.3,
2.6$ and $5.2$. The middle one matches the baby weight data.

\figfile{threenormalscropped.pdf}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\caption{Three bell curves}
\figsource{Chart from an Excel spreadsheet we wrote.}
\label{fig:threenormals}
\end{figure}
\figfile{}

The mathematics needed to calculate standard deviations and draw bell
curves is more than we will present here. You will learn about it if
you take an introductory course in statistics. All you should remember
now is the rough relationship between standard deviation and spread we
sketched above and that the official name for the bell curve is
\emindex{normal distribution}. If you want to see how we asked Excel
to calculate the standard 
deviation for the baby weights and draw
Figures~\ref{fig:babyweights} and \ref{fig:threenormals}, you can look
at the spreadsheet \link{babyweights.xlsx}.

%www.cdc.gov/growthcharts/data/set1/chart02.pdf
%www.exceluser.com/explore/normalcurve.htm
%www.vertex42.com/ExcelArticles/mc/NormalDistribution-Excel.html

\qrsection[marginoferror]{Margin of error}

The Pew Research Center conducted a study in July of 2012 that
asked about support for President Obama's tax position. Their report
said (in part)

\begin{quotation}
By two-to-one (44\% to 22\%), the public says that raising
  taxes on incomes above \$250,000 would help the economy rather than
  hurt it, while 24\% say this would not  make a difference.

[The poll reached 1,015 adults and has a margin
of sampling error of plus or minus 3.6 percentage points.]%
\begin{csmr}
Raising Taxes on Rich Seen as Good for Economy, Fairness,
Pew Research Center (July 16, 2012),
\url{www.people-press.org/2012/07/16/raising-taxes-on-rich-seen-as-good-for-economy-fairness}
\access{July 28, 2015}.
Quoted with permission.
\csmrcomment{Quoted with permission.}
\end{csmr}
\end{quotation}

The first paragraph quoted above reports the results of a survey. The
second tells you something about how reliable those results are. It's
clear that the smaller the \myindex{margin of error} the
more you can trust the results. Understanding the margin of
error quantitatively --- seeing what the number actually means --- is
much more complicated. A statistics course would cover that carefully; we
can't here. Since the term occurs so frequently, it's worth learning the
beginning of the story. Even that is a little hard to understand, so
pay close attention.
\begin{teacher}
This is one of those places where it's hard to find the right compromise
between appropriate simplifications and actual untruths about what the
concepts mean and the numbers say. It might well take a whole class
period to expand on the discussion here. You have to decide whether
that's how you want to spend class time.
\end{teacher}

The survey was conducted in order to discover the number of people who
thought the tax increase would benefit the economy. If everyone in the
country offered their opinion then we would know that number exactly.
If we gave the survey to just three or four people we could hardly
conclude anything. The people at the Pew Research Center decided to
survey the opinions of a \emph{sample} of the population --- 1,015
people chosen at random. Of the particular people surveyed, 
$0.44 \times 1,015 = 447$ people thought the tax
increase would benefit the economy.  If they'd surveyed a
different group of 1,015 people, they would \myindex{probably} see
a different number, so a different percent.

The 3.6 percentage point \index{percentage points} margin of error
says that if they carried out the survey many times with different
samples of 1,015 people, 95\% of those surveys would report an answer
that was within 3.6 percentage points of the true value.

There's no way to know whether this particular sample is one of the
95\%, or one of the others. About five of every 100 surveys you see in
the news are likely to be bad ones where the margin
of error surrounding the reported answer doesn't include the true
value.  Survey designers can reduce the margin of error by asking more
people (increasing the sample size). But there will always be five
surveys in every hundred where the true value differs from the survey
value by more than the margin of error.

The report doesn't explicitly mention 95\%. That's just built into
the mathematical formula that computes the margin of error from the
sample size. Even that conclusion may be too optimistic. The margin of
error computation only works if the sample is chosen in a fair way, so that
everyone is equally likely to be included. If they asked 1,015 people
at random from an area where most people were Democrats (or 
Republicans) or rich (or poor) the result would be even less
reliable. The report describes the efforts taken to get a
representative sample.
%
%From my 2009 paper with Weakliem:
%
%http://andrewgelman.com/2017/10/28/favorite-definition-statistical-significance/
%
%    Throughout, we use the term statistically significant in the
%    conventional way, to mean that an estimate is at least two
%    standard errors away from some ``null hypothesis'' or prespecified
%    value that would indicate no effect present. An estimate is
%    statistically insignificant if the observed value could reasonably
%    be explained by simple chance variation, much in the way that a
%    sequence of 20 coin tosses might happen to come up 8 heads and 12
%    tails; we would say that this result is not statistically
%    significantly different from chance. More precisely, the observed
%    proportion of heads is 40 percent but with a standard error of 11
%    percent --- thus, the data are less than two standard errors away from
%    the null hypothesis of 50 percent, and the outcome could clearly
%    have occurred by chance. Standard error is a measure of the
%    variation in an estimate and gets smaller as a sample size gets
%    larger, converging on zero as the sample increases in size. 
%
%I like that. I like that we get right into statistical significance,
%we don't waste any time with p-values, we give a clean coin-flipping
%example, and we directly tie it into standard error and sample size. 

\qrsection[bimodal]{Bimodal data}

The black data points in Figure~\ref{fig:lymphomadata.png} show how
the rate of diagnosis (in cases per 100,000 people)
of \myindex{Hodgkin lymphoma} (a kind of cancer)
for white females depends on the age of
the woman diagnosed.
There are two peaks, one at about 20
years, the other at 75 years. There is no single value that can
legitimately be called the mode --- this is a \emindex{bimodal
distribution}. The mean and the median would each be about 45 years,
but they make no sense at all. What is \myindex{probably} happening is
that the disease has two very different causes, one of which occurs
more often in young people, the other in old people.

\figfile{cancer-whitefemale.pdf}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\begin{csmr}[Hodgkin lymphoma incidence (white females)\label{fig:lymphomadata.png}]
Data source: Surveillance, Epidemiology, and End Results (SEER) Program
(\url{www.seer.cancer.gov}) SEER*Stat Database: Incidence - SEER 9 Regs
Limited-Use, Nov 2008 Sub (1973-2006) , National Cancer Institute,
DCCPS, Surveillance Research Program, Cancer Statistics Branch,
released April 2009, based on the November 2008 submission.  
Graphic drawn by Ben Bolker.
\csmrcomment{citation as required}
\end{csmr}
\figsource{public data}
\end{figure}
\figfile{}

We can understand this distribution as a combination of two normal
distributions. The left bell curve for the early onset of Hodgkin
lymphoma has a mean of about 24 years with a standard deviation of 11
years. The right bell curve for late onset has a mean of about 79 years
with a standard deviation of 19 years --- it spreads out more
slowly. The smooth curve is the sum of the two normal distributions. It
matches the data very well.

\exstart

Excel is just a tool. It doesn't answer questions, it provides
numbers and pictures you use to answer questions. So keep on
writing complete sentences explaining the meaning and context of the
numbers you report. The numbers in cells in a spreadsheet are no more
use by themselves than the numbers in a calculator display.

If an exercise asks for hard copy of a spreadsheet, format it well.
Use \excel{Print Preview} to make sure that charts fit on one
page and don't cover important numbers. Label the data columns,
cells containing important computations, axes and legends in charts.
Numbers are useless when they can't be understood. 
 
\begin{exx}{\hassolution\routine\sref{wingaero}}
\myindex{CXO}.

What do ``CEO'', ``CTO'', ``CIO'' and
``CFO'' stand for in the \smallCo{} salary table in \sref*{wingaero}?

\begin{sol}
In CXO the C and the O stand for Chief and Officer. The middle letter
stands for Executive, Technology, Information (or sometimes
Investment) and Financial. Sometimes you see COO (Operating).

``CIO'' also stands for ``Congress of Industrial Organizations,'' a 
federation of labor unions, founded in 1935.

\end{sol}

\end{exx}

\begin{exx}{\hassolution\gref{excelaverages}\gref{excelwhatif}\sref{whatif}}
What if\ldots ?

Open up the original Wing Aero spreadsheet and use Excel to calculate
the mean and median, as we did in \sref*{wingaero} and
\sref*{median}.

\begin{abcd}
\item Suppose all the managers get a \$10K raise.  Change their
  salaries and see how Excel updates the mean and median calculation.   

\item  Go back to the original spreadsheet (use the \excel{Undo} button if you
  can).  Experiment with changing salaries (of any of the workers ---
  your choice) so that the mean and the median increase.   

\item Reset back to the original spreadsheet.  How would you  change
  salaries so that the mean decreases but the median stays  the same?

\item  Reset back to the original spreadsheet again.  How would you
  change salaries so that the mean stays the same but the median
  decreases?  

\end{abcd}

\begin{sol}
\begin{abcd}

\item Suppose all the managers get a \$10K raise.  Change their
  salaries and see how Excel updates the mean and median calculation.   

The mean increased to about \$79.8K. The median stayed the same, since
all the people who got this raise were already in the top half of the list.

\item  Go back to the original spreadsheet (use the \excel{Undo} button if you
  can).  Experiment with changing salaries (of any of the workers ---
  your choice) so that the mean and the median increase.   

One way to do this is to increase the salary of any of the workers
making less than the median to a value greater than the median. For
example, give the one making \$25K a raise to \$75K. Now
the mean is \$78.8K and the median is \$46K

\item Reset back to the original spreadsheet.  How would you  change
  salaries so that the mean decreases but the median stays  the same?

Easy. Just cut the CEO salary in half.

\item  Reset back to the original spreadsheet again.  How would you
  change salaries so that the mean stays about the same but the median
  decreases? 

To decrease the median I need to move someone from below \$42.5K to
above. I can do that with the worker making \$19K --- raise his salary
by \$50K to \$69K. If I then take \$50K away from the CEO the mean
will be the same.
\end{abcd}

\end{sol}
\end{exx}

\begin{exx}[format]{\hassolution\routine\sref{usingsoftware}
\gref{excelroutine}} 
Formatting in Excel.

Format the cells containing averages in the \smallCo{} spreadsheet so
that the numbers displayed for the various averages are 
rounded to the nearest thousand dollars (no decimal places).

\begin{sol}
There are several ways to find the menu that allows you to set the
precision Excel uses for numbers. One is to right click on the
cell with the number (or select a range of cells and then right
click). Choose \excel{Format~Cells~\ldots} from the menu, then
\excel{Number} from the \excel{Category:} menu on the left. Then enter
\excel{0} for the number of decimal places.

\end{sol}
\end{exx}

\begin{exx}{\hassolution\sref{median}\gref{excelaverages}\gref{excelroutine}
\gref{excelchart}}
Practice finding the median.

Open up the original \smallCo{} spreadsheet from \link{WingAero.xlsx}.

\begin{abcd}
\item  Use Excel to find the  median salaries for each category of employees 
(workers, managers, etc.).  

\item  Show the data in a properly labeled bar chart.

\item  Do you think the median is a representative ``average'' for each
  category?  Explain. 
\end{abcd}

\begin{sol}
Here's a screen shot showing my work, from the spreadsheet
\slink{WingAeroMedianSolution.xlsx}. 
%
\figfile{WingAeroMedianSolution.png}
\begin{center}
\includegraphics[width=4in]{\thefigurefilename}
\end{center}
\figfile{}

The median is a reasonable representation of the ``average'' salary in
each category.

\end{sol}
\end{exx}

\begin{exx}{\hassolution\sref{barcharts}\gref{excelchart}\gref{excelaverages}}
Averaging averages.

\begin{abcd}
\item Find the average (mean) salary at \smallCo{} for each of the
  four categories of employees (use the original data set, from
  \sref*{wingaero}).  

\item Show the data in a properly labeled bar chart. 

\item Compute the weighted average of these averages to check that you get
the correct mean for the whole payroll.
\end{abcd}

\begin{hint}
You may draw your bar chart neatly by hand, or use Excel.
\end{hint}

\begin{sol}
\begin{abcd}
\item Find the average (mean) salary at 
at \smallCo{} for each of the four categories of employees. 

The following table shows the average salaries by category. I built
the spreadsheet at
\slink{WingAeroAverageByCategorySolution.xlsx} 
to find  these numbers. 

\begin{center}
\begin{tabular}{|c|r|r|r|}
\hline
Job & Number & Total salary (\$K) & Average salary (\$K) \\
\hline
Executive & 4 & 1089 & 272.25\\
Manager & 8 & 633 & 79.13\\
Supervisor & 9 & 364 & 40.44 \\
Worker & 9 & 229 & 25.44 \\
\hline
\end{tabular}
\end{center}

\item Show the data in a properly labeled bar chart. 

See Figure~\ref{fig:WAaverages}.


\figfile{WingAeroAverageByCategorySolutioncropped.pdf} 
\begin{figure}
\centering
\includegraphics[width=3in]{\thefigurefilename}
\caption{WingAero average salaries, by category}
\label{fig:WAaverages}
\figsource{Excel spreadsheet}
\end{figure}
\figfile{}

\item Compute the weighted average of these averages to check that you get
the correct mean for the whole payroll.

The weighted average is
\begin{equation*}
\frac{ 
4 \times 272.25 +
8 \times 79.13 +
9 \times 40.44 +
9 \times 25.44 
}
{30}
= 77.61.
\end{equation*}
That's the same value as the ordinary average of the original data.

\end{abcd}

\end{sol}

\end{exx}

\begin{exx}{\hassolution\sref{barcharts}\gref{excelaverages}
\gref{excelchart}}
Cash-strapped T proposes 23 percent fare increase.
\index{MBTA}

On March 29, 2012 the Massachusetts Bay Transportation Authority (MBTA)
provided the fare data in
Table~\ref{table:mbtaFareIncreases}.
Riders can pay with a stored-value \myindex{Charlie Card} or with a
Charlie Ticket bought on the spot.

\begin{table}
\centering
\begin{tabular}{lcc}
\toprule
Fare Category & Current & Proposed \\
\midrule
\multicolumn{3}{c}{Charlie Card}\\
\midrule
Bus & \$1.25 & \$1.50 \\
Subway & \$1.70 & \$2.00 \\
Senior Bus & \$0.40 & \$0.75 \\
Senior Subway & \$0.60 & \$1.00 \\
Student Bus & \$0.60 & \$0.75 \\
Student Subway & \$0.85 & \$1.00 \\
\midrule
\multicolumn{3}{c}{Charlie Ticket}\\
\midrule
Bus & \$1.50 & \$2.00 \\
Subway & \$2.00 & \$2.50 \\
\bottomrule
\end{tabular}
\begin{csmr}[MBTA fare increases\label{table:mbtaFareIncreases}]
E. Moskowitz,
Cash-strapped T proposes 23 percent fare increase,
\theGlobe{} (March 28, 2012),
\url{bostonglobe.com/metro/2012/03/28/mbta-unveils-percent-fare-hike-limited-service-cuts-also-proposed/moCl42rwr0Nf5xyx20ZQGP/story.html}
\access{July 22, 2015}.
\csmrcomment{Globe, so OK}
\end{csmr}
\end{table}

\begin{abcd}

\item What are the relative and absolute changes in the Charlie Card
  bus fare? 

You don't have to do this in your head, without a calculator,
but you should be able to. 

\item
Create a spreadsheet for these data, with eight rows (one for each of
the eight categories) and three columns, for the category name, the
existing fare and the proposed fare.
Create a properly labeled bar graph to display the data.

\item 
Label the next two columns appropriately to hold the
relative and absolute changes. Fill those columns with Excel formulas
to compute the correct values. Do not compute the values elsewhere and
enter them in the spreadsheet as numbers.

\item Imagine that you are addressing a public meeting about these
fare increases. How would you argue that an unfair burden is being
placed on people who pay using a Charlie Ticket? How would you argue
that senior citizens are most hard hit?

\item Find the mean of the relative percent increases. Explain why the
answer is not the 23 percent quoted in the headline.

\item What extra information would you need to check that the correct
mean is 23 percent?

\item (Optional) Find out why the stored value card is called a
  Charlie Card. 

\item (Optional, and difficult) Create a column for the percentage of
  MBTA revenue for each of the categories and fill in some values that
  sum to 100\% and give a weighted average fare increase of about 23\%.

   There's an important and subtle distinction here between weights as
a percentage of revenue and weights as a percentage of trips. 

\end{abcd}
\begin{sol}

\begin{abcd}
\item What are the relative and absolute changes in the Charlie Card
  bus fare?

The absolute change is \$0.25; the relative change is 20\%. I did them
in my head. 

\item
Create a spreadsheet for these data, with eight rows (one for each of
the eight categories) and three columns, for the category name, the
existing fare and the proposed fare.
Create a properly labeled bar graph to display the data.

My spreadsheet with its chart is at
\slink{MBTAFareIncreasesSolution.xlsx}.

\item 
Label the next two columns 
appropriately to hold the
relative and absolute changes. Fill those columns with Excel formulas
to compute the correct values. Do not compute the values elsewhere and
enter them in the spreadsheet as numbers.

\item In my solution the 
absolute and relative changes are in columns
\cell{E} and \cell{F}. I put these formulas in cells \cell{E7:F7}:
\displayexcel{
=D7-C7 \ \ \ \ \ \ =D7/C7
}
I put the percent increase in cell \cell{G7} with the formula
\displayexcel{
=F7-1
}
and formatted the result as a percent.

Then I copied these formulas to the rest of the rows.

\item Imagine that you are addressing a public meeting about these
fare increases. How would you argue that an unfair burden is being
placed on people who pay using a Charlie Ticket? How would you argue
that senior citizens are most hard hit?

The Charlie Ticket riders have the largest absolute change among all
the categories: 50 cents. That's not fair.

The senior citizens have the largest two relative changes among all
the categories: 67\% for subway fares and a whopping 88\% for the
bus. That's not fair.

\item Find the mean of the relative percent increases. Explain why the
answer is not the 23 percent quoted in the headline.

According to Excel, the mean of the relative percent increases is
37\%. The correct mean is a weighted average of the percentages
in each category, with weights the percentage of revenue in each category.


\item What extra information would you need to check that the correct
mean is 23 percent?

To check that the \emph{Globe} reported the correct mean I would need to know
the percentage of rides in each category, or the percentage of revenue
for each category. 

\item (Optional) Find out why the stored value card is called a
 Charlie Card.


In 1949, the MBTA proposed a 5 cent increase in the charge for
customers exiting trains above ground. 
This fare increase prompted the composition of the song
``M.T.A.'' (or ``Charlie on the M.T.A.''  as
most people refer to it) about a fictional subway rider named Charlie
who was doomed to ride forever because he didn't have that nickel to
get off of the T.  The MBTA website 
\url{www.mbta.com/about_the_mbta/history/?id=19582}
has a short description. When the MBTA introduced plastic fare
cards they named them ``Charlie cards''. 

The Kingston Trio made the song famous in 1959.

\item (Optional, and difficult) Create a column for the percentage of
  MBTA revenue for each of the categories and fill in some values that
  sum to 100\% and give a weighted average fare increase of about 23\%.

They are in the spreadsheet.

\end{abcd}

\end{sol}
\end{exx}

\begin{exx}{\hassolution\sref{piecharts}\gref{excelchart}}
Why not pie charts?
\index{pie chart}

Do some internet research to discover why 
bar charts are usually better than pie charts. Write the reasons in
your own words (don't just cut and paste). Identify the sources of
your information and comment on why you think those sources are
reliable.

\begin{sol}
Searching for
\begin{center}
are pie charts bad
\end{center}

in October 2014 finds these first five websites

\begin{itemize}
\item
  \url{www.businessinsider.com/pie-charts-are-the-worst-2013-6}
\item
\url{www.stevefenton.co.uk/Content/Pie-Charts-Are-Bad/}
\item
\url{www.datavis.ca/gallery/evil-pies.php}
\item
\url{www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=00018S}
\item
\url{blogs.oracle.com/experience/entry/countdown_of_top_10_reasons_to_never_ever_use_a_pie_chart}
\end{itemize}

\end{sol}
\end{exx}

\begin{exx}{\artificial\untested\sref{piecharts}\gref{excelchart}}
Misleading pie charts.
\index{pie chart}

Table~\ref{table:baddata} shows some student enrollment data at a
college. It's clearly incomplete: Juniors and Seniors are missing.

\begin{table}
\centering
\begin{tabular}{cc}
\toprule
Class & Percentage \\
\midrule
Freshman & 40\\
Sophomore & 25 \\
\bottomrule
\end{tabular}
\caption{Freshman and sophomore enrollments}
\tablesource{Made up numbers.}
\label{table:baddata}
\end{table}

Flashy pie charts are common in the media, but staid
and boring bar charts are usually more informative and less
deceiving. Because Excel makes it so easy to switch to a pie from a
bar, designers may be tempted by the glitz. You should not succumb to
that temptation. Here's an exercise where you can find out why.

\begin{abcd}
\item Construct a bar chart displaying these data, with columns for
Freshman and Sophomore enrollments.

\item Change the chart type in Excel.

\item Explain what is wrong with the chart Excel built for you.

\end{abcd}
\end{exx}


\begin{exx}{\hassolution\worthy\sref{mmm}\gref{excelwhatif}
\gref{excelaverages}}
What if? 

Add five workers each earning \$18K to the original \smallCo{} payroll by
inserting some rows in the table. 
Comment on what happens to the mean, median and modal incomes.

Note that Excel automatically recomputes these averages, but not the
charts for which you created data by hand.

When you've done this exercise, undo your changes and check that the
three kinds of averages revert to their old values.

\begin{sol}
When I add five workers earning \$18K each the
mean falls from \$77.16K to \$66.71K, the median from \$42.5K to
\$41K. The mode is now \$18K. (The value of the mode before the new
workers was ambiguous. What Excel says it is depends on the order in
which the employees are listed. In general you should report modes
only for grouped data.)


\end{sol}
\end{exx}


\begin{exx}{\hassolution\sref{mmm}\gref{skew}\gref{histograms}}
Population pyramids.
\index{population pyramid}

At  
\url{www.census.gov/population/international/data/idb/informationGateway.php}
you can choose a country and a year, then ask for
a kind of bar chart known as a
\emindex{population pyramid}.
You can also download the data used to build the chart.

\begin{abcd}

\item Construct population pyramids for the United States and for
Sudan for the year 2010.

\item  Estimate the number of people in the United States in 2010
between 0 and 9 years old.  Do the same for Sudan. What fractions of
the populations do these numbers represent?

\item  Find the modal age range for the United States 
population.  Do the same for Sudan.

\item  Find the age range for the United States that has the smallest
population.  Do the same for Sudan.

\item  Compare the population distributions of the United States and Sudan.
Write several sentences that highlight aspects of each distribution
that you think are quantitatively significant. Use the results of the
previous part of the problem.

%\item Careful reading shows that values for the
%years beyond 1999 are estimates. You're told to visit
%\url{factfinder.census.gov/} for up to date information. Do
%that, and comment on something interesting you find there.

\end{abcd}

\begin{sol}

\begin{abcd}

\item Construct population pyramids for the United States and for
Sudan for the year 2010.

See Figures~\ref{fig:uspopulationpyramid} and
\ref{fig:sudanpopulationpyramid}.

\figfile{USPopulationPyramid.png}
\begin{figure}
\centering
\includegraphics[height=72mm]{\thefigurefilename}
\begin{csmr}[U. S.  population pyramid\label{fig:uspopulationpyramid}]
International Programs, United States Census,
\url{www.census.gov/population/international/data/idb/informationGateway.php},
select Population Pyramid Graph --- 2010 --- United States,
\access{August 11, 2015}.
\csmrcomment{goverment webisite no permission needed}
\end{csmr}
\end{figure}
\figfile{}
\figfile{SudanPopulationPyramid.png}
\begin{figure}
\centering
\includegraphics[height=72mm]{\thefigurefilename}
\begin{csmr}[Sudan population pyramid\label{fig:sudanpopulationpyramid}]
International Programs, United States Census,
\url{www.census.gov/population/international/data/idb/informationGateway.php},
select Population Pyramid Graph --- 2010 --- Sudan,
\access{August 11, 2015}.
\csmrcomment{goverment webisite no permission needed}
\end{csmr}
\end{figure}
\figfile{}

\item  Estimate the number of people in the United States in 2010
between 0 and 9 years old.  Do the same for Sudan. What fractions of
the populations do these numbers represent?

I did Sudan first.

To find that number I need to add the bottom two bars on each side,
for the boys and girls age 0-4 and 5-9. That looks to be about
$3.6 + 3.2 + 3.5 + 3.0 = 13.1$ million.

Then I downloaded the underlying data tables as an Excel
spreadsheet and checked my estimate.
The number of 0-9 year olds in Sudan is $7,018,343 +
6,128,436$ which is indeed about 13.1 million.

The total Sudan population is 43.9 million (from the data table). The
fraction of children is thus $13.1/43.9 = 0.298405467 \approx 30\%$.

The corresponding figures for the United States are about $11 + 11 +
10 + 10 = 42$ million children. That's about $42/310 \approx 13.5\%$.

\item  Find the modal age range for the United States 
population.  Do the same for Sudan.

The modal age is the largest bar. For the United States that's 45-49
years. For Sudan it's 0-4 years!

\item  Find the age range for the United States that has the smallest
population.  Do the same for Sudan.

In each population the smallest age range is 100+. That's not
surprising, and also not very interesting.

\item  Compare the population distributions of the United States and Sudan.
Write several sentences that highlight aspects of each distribution
that you think are quantitatively significant. Use the results of the
previous part of the problem.

The population pyramid for Sudan really looks like a pyramid. The
number of people in each five year cohort decreases steadily with
age. That indicates a very large birth rate, with lots of people dying
young. It also spells disaster for the society --- there aren't lots of
adults to take care of the children.

The United States population pyramid looks more like a house: vertical
walls with a slanted roof on top. That suggests that the population
structure is relatively stable, with few deaths at young ages. It also
looks likely that the little bulge at 45-55 years will lead to a large
number of senior citizens in not too long. That's a social problem of
a different kind.

\end{abcd}

\end{sol}
\end{exx}

\begin{exx}{\hassolution\sref{mmm}\gref{excelaverages}}
\myindex{Lake Wobegon}.

\begin{abcd}

 \item Look back at the \smallCo{} data and answer the following
questions:

\begin{itemize*}
\item What percentage of the employees make more than the mean salary?
\item What percentage of the employees make more than the median
salary?
\item What percentage of the employees make more than the mode salary?
\end{itemize*}

\item In his radio show \emph{A Prairie Home Companion}
\index{Prairie Home Companion} host Garrison
Keillor regularly tells his audience about \myindex{Lake Wobegon},
where ``all the children are above average''.

Is that possible, with any of the meanings of ``average''?
\end{abcd}

\begin{sol}
\begin{abcd}

 \item Look back at the \smallCo{} data and answer the following
questions:

\begin{itemize*}
\item What percentage of the employees make more than the mean salary?

8 of the 30 employees make more than the mean of \$77K. That's
$8/30 = 0.266 = 27\%$.

\item What percentage of the employees make more than the median
salary?

50\%! That's the meaning of ``median'' so it's the answer to
this question even if I don't know the value of the median.

\item What percentage of the employees make more than the mode salary?

This is a slightly subtle question. The mode only makes sense for
grouped data. The figure in the text shows that the mode for
\smallCo{} is about \$30K. 21 of the 30 employees, or 70\%, make more
than that. 
\end{itemize*}

\item In his radio show \emph{A Prairie Home Companion}
\index{Prairie Home Companion} host Garrison
Keillor regularly tells his audience about \myindex{Lake Wobegon},
where ``all the children are above average''.

Is that possible, with any of the meanings of ``average''?

No --- that's what makes Keillor's description funny.

For the median, exactly half are above average. For the mean and
the mode it's possible for (a lot) more than half to be above average, 
but never ``all''.

\end{abcd}

\end{sol}
\end{exx}



\begin{exx}{\hassolution\complex\sref{mmm}\gref{excelaverages}}
Working for \myindex{Walmart}.

On December 2, 2009 Bloomberg News reported on a
settlement in which Walmart agreed to pay \$40 million to up to 87,500
employees because the company had failed to pay overtime, allow rest
and meal breaks and, in addition, manipulated time cards.

Eligible present and former employees would receive 
\$400 to \$2,500 each --- on average \$734.

The lawyers for the employees asked for fees of \$15.2 million from
the \$40 million.%
\begin{csmr}
M. C. Fisk and J, Lawrence,
Walmart to Settle Massachusetts Suit for \$40 Million (Update2),
\emph{Bloomberg News} (December 2, 2009),
\url{www.bloomberg.com/apps/news?pid=newsarchive&sid=a2AClc9J8WwE}
\access{October 2, 2015}.
\csmrcomment{paraphrase}
\end{csmr}

\begin{abcd}
\item Compute the mean compensation, assuming that there are 87,500
eligible employees and that the lawyers have taken their cut.

\item Compare your answer to the reported minimum compensation of
\$400 and the reported \$734 the average worker will receive.

\item
Draft a letter to
the editor or the reporter, politely pointing out that both the
reported ``averages''  made no sense, and asking for more detail or a
correction.

\end{abcd}

\begin{sol}

\begin{abcd}
\item Compute the mean compensation, assuming that there are 87,500
eligible employees and that the lawyers have taken their cut.

\begin{equation*}
\frac{\$40\hbox{M} - \$15.2\hbox{M} }{87,500 \hbox{ workers}} =
283 \frac{\$}{\hbox{worker}}.
\end{equation*}

\item Compare your answer to the reported minimum compensation of
\$400 and the reported \$734 the average worker will receive.

The \$283 mean compensation can't be less than the \$400
minimum compensation. Something is crazy here.  

The \$734 that the ``average worker'' gets is even crazier. It's much
too large to be the median. There's no way half the workers can get
more than \$734 if the minimum is \$400 unless the mean is more than halfway
from \$400 to \$734. That's not true even before I take out the
lawyers' \$15.2 million.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\routine\artificial\hassolution\sref{mmm}\gref{excelaverages}}
Is it \myindex{discrimination}? 		

Table~\ref{salaryStructure} shows the salary structure of two
departments in a hypothetical university. 

\begin{table}
\centering
\begin{tabular}{ccccc}
\toprule
& 
\multicolumn{2}{c}{Physics} 
& 
\multicolumn{2}{c}{English} \\
 &  	professors & {salary} & professors & salary \\
\midrule
 Women & 1 & 	\$100K &	8 & \$50K \\
 Men & 9	& \$90K	& 2 &	\$40K \\
\bottomrule
\end{tabular}
\caption{Salary structure at a university}
\tablesource{Made up data.}
\label{salaryStructure}
\end{table}

\begin{abcd}

\item What is the average (mean) salary of the professors? Of the women
professors? Of the men? 

\item Answer the same questions for the median.

\item Answer the same questions for the mode.

\item Write a few sentences to convince someone that men in this
university are paid better than women. Then write a few sentences to
convince someone of just the opposite. Explain the
contradiction.

\end{abcd}

These calculations are so straightforward that they're easier
with pencil and paper (maybe a calculator) than with Excel.

\begin{sol}

\begin{abcd}

\item What is the average (mean) salary of the professors? Of the women
professors? Of the men? 

The mean professor salary (in thousands of dollars) is
\begin{equation*}
\frac{1 \times \$100 + 9 \times \$90 + 8 \times \$50 + 2 \times
\$40}{20} = \$69.5.
\end{equation*}

The mean for the women is
\begin{equation*}
\frac{1 \times \$100 +  8 \times \$50} {9} = \$55.6.
\end{equation*}

The mean for the men is
\begin{equation*}
\frac{9 \times \$90 +  2 \times \$40} {11} = \$80.9.
\end{equation*}

\item Answer the same questions for the median.

The median salaries for all, women and men are 
\$70K (halfway between \$90K and \$50K), \$50K and \$90K respectively.

\item Answer the same questions for the mode.

The mode salaries for all, women and men are 
\$90K, \$50K and \$90K respectively.

\item Write a few sentences to convince someone that men in this
university are paid better than women. Then write a few sentences to
convince someone of just the opposite. Explain the
contradiction.

Obviously the men are paid better than the women. Each of the three
averages is higher for men than women!

Obviously the women are better paid than the men: all the women do
better than all the men in both departments.

This seeming contradiction is possible because the women outnumber the
men in the English (poorer) department, while the opposite is true in
the physics (richer) department. There the lone woman makes the
highest salary --- perhaps because the Dean (a woman) insisted that an
all male physics department would discourage women students.

\end{abcd}

\end{sol}
 
\end{exx}


\begin{exx}[salaryrange]{\hassolution\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}}
Choosing data ranges for a histogram.

When you change the widths of the intervals in a histogram you get a
(slightly) different picture of the data.

\begin{teacher}
This exercise may be worth assigning for the Excel practice, and for
reinforcing the computations of various averages from summarized
data. But the conclusions aren't very striking.

\end{teacher}
\begin{abcd}
\item
Redo the \smallCo{} histograms in the text using salary ranges of size
\$10K and then of size \$50K.

\suspend{abcd}
\begin{hint}
Start with a copy of \link{WingAeroHistogram.xlsx}. Create new data in
columns \cell{D} and \cell{E} below the existing data there for the
two new histograms. Fill in column \cell{F} appropriately. Then you
can copy and paste from what's there to fill in columns \cell{G} and
\cell{H} and the sums and the mean.

If Excel wants to treat your salary ranges as dates, try formatting
the cells as text.

To make the charts, copy the histogram that's there, then find the
place in Excel where you can change the source data to the new rows in
columns \cell{D} and \cell{E}.
\end{hint}

\resume{abcd}

\item
In each case use the techniques from
Section~\ref{sec:averagesfromhistograms} to estimate the mean, median
and mode.

\item
Discuss the advantages and disadvantages of these possible choices for
the salary range, comparing them to our choice of \$20K.

\end{abcd}

\begin{sol}

\begin{abcd}
\item
Redo the \smallCo{} histograms in the text using salary ranges of size
\$10K and then of size \$50K.

See \slink{WingAeroHistogramRangesSolution.xlsx} for the new histograms.

\item
Estimate the mean, median and mode.

Here's a table showing all three averages, computed from the data
directly and from the three histograms.

\begin{center}
\begin{tabular}{cSSc}
\toprule
Data & {Mean} & {Median} & Mode \\
\midrule
complete & {\$77.2K} & {\$42.5K} & nonsense \\
\$20K intervals & \$79K & \$48K & \$20K-\$39K \\
\$10K intervals & \$77K & \$45K & \$20K-\$29K \\
\$50K intervals & \$73K & \$45K & \$0K-\$50K \\
\bottomrule
\end{tabular}
\end{center}

The smaller the data ranges in the histograms the closer the
approximated averages are to the true values.

\item
Discuss the advantages and disadvantages of these possible choices for
the salary range, comparing them to our choice of \$20K.

None of these histograms seems particularly ``better'' than the other
two. Each gives a good visual picture of how \smallCo{} salaries are
distributed. The various averages are close enough to the correct
values so that they are not misleading.
\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\hassolution\sref{avgtests}\gref{weightedaverage}}
Texting teens.
\index{text messaging}

We drew the chart in Figure~\ref{fig:TeenTexting} using data from 
\theGlobe{} on
April 15, 2012.

\figfile{TeenTextingHanddrawn.png}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\begin{csmr}[Teen texting\label{fig:TeenTexting}]
Graphic redrawn from data from data 
J. P. Kahn,
Missed connections in our digital lives,
\theGlobe{} (April 15, 2012),
\url{www.bostonglobe.com/metro/2012/04/14/missed-connections-our-digital-lives/bPHauWdvUl5XAd1ol7SOQL/igraphic.html}
\access{August 2, 2015}.
\csmrcomment{Globe data, graphic redrawn.}
\end{csmr}
\end{figure}
\figfile{}

\begin{abcd}

\item Estimate the mode, median and mean number of
text messages sent by teenagers each day. \item Estimate the mode, median and mean number of
text messages sent by teenagers each day. 

\item In total, approximately how many text messages are sent by the
23 million American teens each day? 

\item The percentages don't add up to 100\%. Why might that have happened?

\item If you asked a random teenager how many text messages she sent
yesterday what are the chances (what is the probability) that it was
more than 50? More than 100? More than 25? 

\item What percent of teenagers text more than the median amount?

\item Does the figure display a histogram?

\item Create an Excel chart that reproduces the figure.
\end{abcd}

\begin{sol}
\begin{abcd}

\item Estimate the mode, median and mean number of
text messages sent by teenagers each day. 

The mode is the tallest column: more teenagers text 101-200 messages a
day than any other category. It's also reasonable to say that the mode
is about 150 messages per day.

Adding the percentages for the first three columns gives
$2\% + 22\% + 28\% = 52\%$. That means just over half the teenagers
text up to 50 messages per day. That's a good estimate for the median.

I will compute the mean as a weighted average, using the midpoints of
the ranges for each range.
\begin{equation*}
0.02 \times 0 + 0.22 \times 5 + 0.28 \times 30 + 0.16 \times 75 
+ 0.31 \times 150 = 68,
\end{equation*}
so the mean number of daily teen text messages is about 70.

I could also have computed using the exact midpoints of the
ranges. The arithmetic is uglier:
%
\begin{equation*}
0.02 \times 0 + 0.22 \times 5.5 + 0.28 \times 30.5 + 0.16 \times 75.5
+ 0.31 \times 150.5 =
68.705,
\end{equation*}
%
which is still about 70. The extra precision isn't worth the effort.

\item In total, approximately how many text messages are sent by the
23 million American teens each day? 

Since I  have the mean, 70 messages per teenager on average, all I
have to do is multiply by 23 million to get about 1.6 billion messages
per day.

\item The percentages don't add up to 100\%. Why might that have happened?

The percentages sum to 99\%. The missing one percent is probably
roundoff error.

\item If you asked a random teenager how many text messages she sent
yesterday what are the chances (what is the probability) that it was
more than 50? More than 100? More than 25? 

There's a 47\% chance that a random teenager texts more than 50
messages in a day and a 31\% chance that s/he texts more than 100.

For 25 messages I have to use a fraction of the 11-51 message
column. Since 25 is about one third of the way from 11 to 50 I will
suppose that about two thirds or 18\% of the teens in that category
text more than 25 messages a day. Adding that 18\% to the 47\% for the
two top columns gives me an estimate of 65\% for the chances that a
teenager texts more than 25 times a day.

\item What percent of teenagers text more than the median amount?

By definition that's 50\%.

\item Does the figure display a histogram?

Yes, but not a good one since the ranges are not all the same width.

\item Create an Excel chart that reproduces the figure.

See \slink{TeenTextingSolution.xlsx}.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\hassolution\sref{averagesfromhistograms}\gref{histogramaverages}
\gref{histograms}}
Websites are often confusing.

Jakob Nielsen evaluated the usability of voter information websites
for the 2008 election for each of the fifty states and the District of
Columbia. You can read his analysis at
\url{www.nngroup.com/articles/aspects-of-design-quality/}. 
His article includes the histogram in
Figure~\ref{fig:nielsenhistogram}.

%\figfile{histogram-homepage-quality-voting-sites.png}
\figfile{usability.png}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\begin{csmr}[Website usability\label{fig:nielsenhistogram}]
Jakob Nielsen,
Aspects of Design Quality
Nielsen Norman Group (November 3, 2008),
\url{www.nngroup.com/articles/aspects-of-design-quality/}
\access{July 22, 2015}, 
image \copyright{} Neilsen Norman Group, reproduced with permission.
\csmrcomment{You may reprint the NNG copyrighted image as long as cite the
referenced material and source.}
\end{csmr}
\figsource{url{www.nngroup.com/articles/aspects-of-design-quality/}}
\figcomment{%
from www.nngroup.com/faqs\#reprint\_image:
Can I reprint an image from a Nielsen Norman Group report in an
educational textbook? 
Yes. You may reprint the NNG copyrighted image as long as cite the
referenced material and source.}
\end{figure}
\figfile{}

\begin{abcd*}

\item What is the modal usability score for these 51 home pages?

\item Reproduce this histogram in Excel. 

\suspend{abcd*}
\begin{hint}
When you enter the data in two columns in Excel put the categories
(usability scores) on the left since they are the labels for the
$x$-axis. Put the numbers of websites on the right since they are the 
values that go with the categories, and should plot vertically, on the
$y$-axis.

If your Excel is anything like ours, it may well think that you want
to display both columns on the $y$-axis, since both columns are
numbers. If it does that it will label the x~axis with the numbers 1
to 10.

If that happens, delete the data series corresponding to the bars you
don't want (the percentages). Then right click on the chart and
explore until you find the place that allows you to enter the fields
you want to use as $x$-axis category labels.

\end{hint}
\resume{abcd*}
\item Estimate the median usability score for these 51 home pages.
\item How many of them have a usability score less than 
the median score?
\item Estimate their mean usability score.

\end{abcd*}

\begin{sol}

\begin{abcd}

\item What is the modal usability score for these 51 home pages?

The 20\% bar is the tallest, so that's the modal score.


\item Reproduce this histogram in Excel.

See Figure~\ref{fig:nielsenhistogramreproduced}.
The spreadsheet is at \slink{HomePageQualitySolution.xlsx}. 

\figfile{HomePageQualitySolutioncropped.pdf}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\caption{Website usability}
\label{fig:nielsenhistogramreproduced}
\end{figure}
\figfile{}

Excel ended the vertical scale at 14, but the scale in the histogram in
the book ends at 16, so I changed the scale here. I matched the color
as best I could.	 


\item Estimate the median usability score for these 51 home pages.

I will add up the numbers in each category starting from the left, to
see when I reach half the websites --- that would be 25 or 26. The sum
of the first three categories is
$4 + 4 + 12 = 20$; the sum of the first four is
$4 + 4 + 12 + 6 = 26$. So the 25th site in the
fourth category.

That means about half the websites have usability scores of 30\% or
less while the other half have scores 40\% or more. Since I have to
report a single number for the median, I'll say it's 35\%.

\item How many of them have a usability score less than 
the median score?

Half the home pages have a score less than the median. That will be
true even if I made a mistake computing the median!

\item Estimate their mean usability score.

It's not clear whether the scores are always exactly 10\%, 20\%, etc.,
or whether the 20\% category really means ``between 10\% and
20\%''. I think the first interpretation is right --- it looks as if
there are 10 usability tests and you either succeed or fail in each one.
If that's how the score was computed, no website passed all ten tests
and four sites failed them all.
With that interpretation the mean usability score is the
\myindex{weighted average}

\begin{equation*}
\frac{\begin{split}
4 \times 0 & + 4 \times 10 + 12 \times 20 + 6 \times 30  + 3 \times 40\\
           & + 7 \times 50 + 5 \times 60 + 4
\times 70 + 3 \times 80 + 3 \times 90
\end{split}
}
{51}
= 0.396078431,
\end{equation*}
which is about 40\%.

I did the computation in Excel, since I had already entered the
numbers to build the histogram. Table~\ref{table:formulas} shows the
formulas I used, starting at cell \cell{A16}.

\begin{table}
\begin{center}
\begin{tabular}{|r|r|r|}
\hline
Usability Score & Number of websites & total contribution \\
\hline
0 & 4 & =A17*B17 \\
\hline
=A17+0.1 & 4 & =A18*B18 \\
\hline
=A18+0.1 & 12 & =A19*B19 \\
\hline
=A19+0.1 & 6 & =A20*B20 \\ 
\hline
=A20+0.1 & 3 & =A21*B21 \\ 
\hline
=A21+0.1 & 7 & =A22*B22 \\ 
\hline
=A22+0.1 & 5 & =A23*B23 \\ 
\hline
=A23+0.1 & 4 & =A24*B24 \\ 
\hline
=A24+0.1 & 3 & =A25*B25 \\ 
\hline
=A25+0.1 & 3 & =A26*B26 \\ 
\hline
=A26+0.1 & 0 & =A27*B27 \\ 
\hline
sums & =SUM(B17:B27) & =SUM(C17:C27) \\ 
\hline
mean &  & =C28/B28 \\
\hline
\end{tabular}
\caption{Weighted average of usability scores}
\label{table:formulas}
\end{center}
\end{table}

With the second interpretation, I will count the web pages with scores
between, say, 10\% and 20\% as if they all had scores of 15\%.
Then the mean usability score is the
weighted average 
\begin{equation*}
\frac{\begin{split}
4 \times 5 & + 4 \times 15 + 12 \times 25 + 6 \times 35 + 3 \times 45
\\
           & + 7 \times 55 + 5 \times 65 + 4
\times 75 + 3 \times 85 + 3 \times 95
\end{split}
}
{51}
= 0.446078431 ,
\end{equation*}
which is about 45\%.

It's not an accident that this is exactly 5 percentage points
higher. Do you see why?

To see the magic of Excel, just change the \excel{0\%} to 
\excel{5\%} in \cell{A17} and watch all the computed values change.

\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\untested\sref{median}\gref{excelaverages}}
Doublethink.
\index{doublethink}

In January 2012 one could read this at
\url{www.businessinsider.com/where-the-one-percent-live-the-15-richest-counties-in-america-2012-2}:

\begin{quotation}
Living in Arlington isn't cheap, so you'd better be making at least
the median household income to live in this county just outside
Washington, D.C.%
\begin{csmr}
K. Geldis,
 The Richest Counties in America,
\emph{TheStreet} (February 13, 2012),
\url{www.thestreet.com/story/11415107/3/the-richest-counties-in-america.html}
\access{July 22, 2015}.
\csmrcomment{26 words fair use}
\end{csmr}
\end{quotation}

How does this statement contradict itself?
\end{exx}

\begin{exx}[usincomedistibution]{\hassolution\complex\worthy
\sref{averagesfromhistograms}\gref{histogramaverages}\gref{histograms}}
Household income in the United States.
\index{household income}

The histogram in 
Figure~\ref{householdIncome} 
shows the percentages of households in income groups
%CHANGE  farthest two right columns.
\$5,000 increments apart, except for the two farthest right columns.
\figfile{Household-Income-in-the-United-States-2012.png}
\begin{figure}
\centering
 \includegraphics[width=5in]{\thefigurefilename}
\begin{csmr}[United States household income --- 2012 Estimate\label{householdIncome}]
  Wikipedia, 
\url{commons.wikimedia.org/wiki/File:Distribution_of_Annual_Household_Income_in_the_United_States_2010.png}
\access{August 11, 2015}.
 Creative Commons Attribution-Share Alike 3.0 Unported license.
\csmrcomment{Creative commons}
\end{csmr}
\end{figure}
\figfile{}

We've put the data (and a copy of the figure) in the spreadsheet
\mbox{\link{Households2012.xlsx}.}

\begin{abcd}
\item Check the quantitative assertions in the text in the Wikipedia chart.

\item
Build a histogram in Excel that comes as close as
possible to matching the one from Wikipedia. Create the same chart and
axis titles. Change the grid lines. 
Put in the comments as text boxes. Match the fonts. 
%CHANGE DELETED Do fix the spelling and grammar errors!

\item Do the percentages sum to 100\%? If not, what might explain the
discrepancy? 

\item To estimate the mean household income you will need an estimate
  of the mean for the households with incomes greater than
  \$250,000. There's no top to this range, so you can't use the middle
  of the range.

What value for the mean for the last category makes the mean for the
whole population equal to the median?

\item Search for an estimate of the mean household income for the
  whole population. What mean for the last category results in this
  overall mean?
\end{abcd}

\begin{sol}

See \slink{Households2012Solution.xlsx}.

\end{sol}

\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}}
Fight for the Senate.

A graph like the one in Figure~\ref{fig:senate}
appeared in Nate Silver's Five Thirty Eight column 
in \theTimes{} on October 31, 2012.\index{Silver, Nate}
The $x$-axis displays the number of seats held by each
party: the Tie in the middle is 50 Democrats, 50 Republicans. The +10
Dem corresponds to 55 Democrats, 45 Republicans.

\figfile{Oct31SenateProjectionCropped.pdf}
\begin{figure}
\centering
\includegraphics[width=3.5in]{\thefigurefilename}
\begin{csmr}[The fight for the Senate\label{fig:senate}]
Graphic redrawn from data scraped from a Nate Silver \emph{New York
  Times} graphic published October 31, 2012. The original seems not to
be available. 
\csmrcomment{Since I redrew the picture I assume this is OK}
\end{csmr}
\end{figure}
\figfile{}

Nate Silver constructed this histogram by imagining
(simulating) many thousands of elections and recording the percentage
of time each Democratic/Republican split occurred. 
We estimated the percentages in his chart and entered them 
in the spreadsheet
\link{Oct31SenateProjection.xlsx} so you don't have to type them
yourself. (We rounded the really tiny percentages to zero.)
Use Excel whenever it's most
convenient for you.

\begin{abcd}
\item What is the most likely number of Democratic senators?

\item What number of Democratic senators represents the mode of this
distribution?

\item What is the probability that there are more than 50
Democratic senators?

\item What number of Democratic senators is the median of this
distribution?

\item If you had the complete list of all Nate Silver's imagined
elections and sorted it by the number of Democratic senators, how many
Democratic Senators would there be in the middle election on that
list?

\item Use Excel to compute the (weighted) average number of Democratic
senators for these imagined elections.

\item What actually happened in the election?

\end{abcd}

\begin{sol}
\begin{abcd}

\item What is the most likely number of Democratic senators?

53 --- the highest bar.

\item What number of Democratic senators represents the mode of this
distribution?

53 --- the highest bar.

\item What is the probability that there are more than 50
Democratic senators?

85.6\% --- it's \excel{=SUM(C14:C21)} in the spreadsheet.

\item What number of Democratic senators is the median of this
distribution?

The cumulative sum passes 50\% at 53 Democrats. So the median is 53.

\item If you had the complete list of all Nate Silver's imagined
elections and sorted it by the number of Democratic senators, how many
Democratic Senators would there be in the middle election on that
list?

There would be 53 Democrats. This is just the median again.

\item Use Excel to compute the (weighted) average number of Democratic
senators for these imagined elections.

The weighted average is 52.6 Democratic senators.

See \slink{Oct31SenateProjectionSolution.xlsx} for the calculation.

\item What actually happened in the election?

  The 2012 election produced a Senate with 53 Democrats.


\end{abcd}

\end{sol}
\end{exx}

\begin{exx}{\hassolution}{\worthy\sref{averagesfromhistograms}
\gref{excelaverages}\gref{histogramaverages}}
What cities pay for fire protection.
\index{firefighters' pay}

%CHANGE added comma  Monday, March 30 2009 
On Monday, March 30, 2009 \theGlobe{} published an article comparing the
amount various cities spent on Fire and EMS services.
Figure~\ref{fig:FireSpending} is a screenshot of a spreadsheet where
we entered some of the data, along with population
figures. You can download that spreadsheet from
\link{FireSpending.xlsx}. Use the data to answer these questions.

\figfile{FireSpendingcropped.pdf}
\begin{figure}
\centering
\framebox{
\includegraphics[width=80mm]{\thefigurefilename}
}
\begin{csmr}[Fire protection spending\label{fig:FireSpending}]
Data from D. Slack,
Boston spends most on firefighters in US,
\theGlobe{} (March 30, 2009),
\url{www.boston.com/news/local/massachusetts/articles/2009/03/30/boston_spends_most_on_firefighters_in_us/},
data for graphic at
\url{www.boston.com/news/local/massachusetts/articles/2009/03/30/fire_spending/},
\access{August 11, 2015}.
\csmrcomment{Globe, OK}
\end{csmr}
\end{figure}
\figfile{}

\begin{enumerate}[(i)]

\item Population

\begin{abcd}

\item What is the mean population of the twelve cities for which
data are presented?

\item What is the median population of the twelve cities for which
data are presented?

\item Create Table~\ref{cityPopulations} in Excel. Fill in the second column
there. Then create a properly labeled histogram for the data. 

\begin{table}
\centering
\begin{tabular}{cc}
\toprule
	Population range & Number of cities\\
\midrule
	500K-600K & \\
	600K-700K & \\
	700K-800K & \\
	800K-900K & \\
	900K-1000K & \\
	1000K-2000K & \\
	2000K-3000K & \\
	3000K-4000K& \\ 
	$>$ 4000K & \\
\bottomrule
\end{tabular}
\caption{City populations}
\tablecomment{No data in this table.}
\label{cityPopulations}
\end{table}

\item Use your histogram to estimate the mode population for these
cities.

\item  What percent of the U.S. population lives in these twelve cities?

\end{abcd}

\item Fire/EMS spending per person

\begin{abcd}

\item 
What is the mean amount spent for Fire/EMS services
per person in these twelve cities?


\item Estimate the median amount
spent for Fire/EMS services per person in these twelve
cities.

\item Estimate the mode amount
spent for Fire/EMS services per person in these twelve cities.

\end{abcd}

\item What do firefighters earn?

There is enough information in the spreadsheet to calculate the
average (mean) earnings of Fire/EMS personnel in each of the twelve
cities. Do that, in a fresh column in your spreadsheet.

\begin{abcd}

\item In which city do Fire/EMS personnel have the highest average salary?
How much is it?

\item In which city do Fire/EMS personnel have the lowest average salary?
How much is it?

\item Where does Boston rank in the list of Fire/EMS personnel
salaries?

\item Explain how Boston can be at the top of the list in Fire/EMS
expenses per resident although it does not pay the highest salaries.

\end{abcd}


\item Correction the next day!

On Tuesday, the next day, \theGlobe{} published a correction, which
said that Boston's fire department expenses were \$285 per resident in
the last fiscal year.%
\begin{csmr}
D. Slack and J. C. Drake,
Error made in fire dept. report,
\theGlobe{} (March 31, 2009)
\url{www.boston.com/news/local/massachusetts/articles/2009/03/31/error_made_in_fire_dept_report/}
\access{July 22, 2015}.
\csmrcomment{paraphrase}
\end{csmr}

Look at the answers to the questions above and indicate which have
changed (and how) and which stayed the same. 

\end{enumerate}

\begin{hint}
The mean amount spent for Fire/EMS services
per person is not the Excel \excel{AVERAGE} of the amounts
spent per person by each city. It's wrong to average
those numbers since they are already averages.
You must weight them by the city populations in order
to compute the total amount spent by all the people in all the
cities. Then divide by the total population.

You can't compute the median with Excel's \excel{MEDIAN} function for
the same reason. Further hint: almost everyone lives in New York.

You can't answer the question ``What do firefighters earn?'' by
finding the mean of the twelve numbers in column C. Compute the mean
correctly as a weighted average. You will probably\index{probably}
want to start by creating a column labeled 

\begin{verbatim}
	total Fire/EMS expenses
\end{verbatim}
\noindent
and fill in the value for each city.
\end{hint}

\begin{sol}

You can find the work in the spreadsheet at
\slink{FireSpendingSolution.xlsx}.

\begin{enumerate}[(i)]

\item Population

\begin{abcd}

\item What is the mean population of the twelve cities for which
data are presented?

A straightforward use of \excel{ =AVERAGE(B7:B18)} in Excel tells
me that the mean population is 1,661,071. The last few digits really
don't mean much since it's unlikely that the populations listed for
the cities are really accurate down to the last person. A better
answer is about 1,660,000.


\item What is the median population of the twelve cities for which
data are presented?

A straightforward use of \excel{ =MEDIAN(B7:B18)} in Excel tells
me that the median population is 773,469. That's halfway between the
populations of San Francisco and Columbus. It's much too precise. A
better answer is that the median is about 770,000.

\item Here is the table:

\begin{center}
\begin{tabular}{cc}
\toprule
	Population range & Number of cities\\
\midrule
	500K-600K & 2 \\
	600K-700K & 3 \\
	700K-800K & 2 \\
	800K-900K & 1 \\
	900K-1000K & 1 \\
	1000K-2000K & 1 \\
	2000K-3000K & 0 \\
	3000K-4000K& 1 \\ 
	$>$ 4000K &  1 \\
\bottomrule
\end{tabular}
\end{center}

See the histogram in the spreadsheet. Note the labels for the axes and
the title, that the bars touch one another, and that the vertical
scale does not list halves of cities.

\item The highest bar in the histogram corresponds to the largest
number in the table: the mode population is 600K-700K.

\item  The \excel{SUM} function in Excel tells me that the total
population of these twelve cities is 19,932,848. 

That represents $19,932,848/300,000,000 \approx 7\%$ of 
the people in the United States

\end{abcd}

\item Fire/EMS spending per person

\begin{abcd}

\item 

Column \excel{C} reports the average (mean) spending in each of the twelve
cities. To find the overall average I can't just average these
averages, since the cities are different sizes. I need to use those as
the weights. So I created column \excel{E} to hold
\excel{total Fire/EMS expenses}
and filled each row with the product of the values in columns
\excel{B} and \excel{C}: 
the number of 
residents times the spending per resident. Then I summed column \excel{E} to
find out that these cities spent a total of \$3,707,134,657 (almost
four billion dollars!) to provide Fire/EMS services for their total
population of 19,932,848 people. Dividing, I found that on average
they spent 
\begin{equation*}
	\frac{\$3,707,134,657}{19,932,848 \hbox{\ people}} 
	\approx 
	186 \frac{ \$ }{\hbox{person}}
\end{equation*}

\item New York accounts for more than 40\% of the 20 million people in these
cities. If you add in Los Angeles then you get more than half. Since
those two are the two cities at the bottom of the list, the median
Fire/EMS cost per resident is somewhere between the values for those
cities --- probably\index{probably} about \$150.

\item The most common amount is what the New Yorkers pay, so the mode
is \$157.56.

\end{abcd}

\item What do firefighters earn?

There is enough information in the spreadsheet to calculate the
average (mean) earnings of Fire/EMS personnel in each of the twelve
cities. Do that, in a fresh column in your spreadsheet.

I computed that two ways, just to make sure I was right. The first way
uses the contents of columns \excel{C} and \excel{D} to fill column
\excel{H} this way: \excel{=1000*Cn/Dn} for each of the rows. That
works because the units for 
column \excel{C} are \$ per resident while for column \excel{D} they
are firefighters per 1000 residents. When I divide I get \$ per firefighter,
which is just what I want.

The second way is to compute the number of firefighters in each city ---
fill column \excel{H} with \excel{ =B7*D7/1000}. Then divide the total
fire/EMS expenses (already computed in column \excel{E}) by that
number and put the result in column \excel{I}. Columns \excel{H} and
\excel{I} match.

\begin{abcd}

\item Firefighters in Los Angeles have the highest average salary:
\$153,111.

\item Firefighters in Baltimore have the lowest average salary: \$83,696.


\item Boston ranks fourth from the top, at \$132,985.

\item Boston spends more per resident on Fire/EMS personnel than other
cities even though it pays them less because it has a lot more of them
in proportion to population --- it's at the top of that list.

\end{abcd}

\item Correction the next day!

When I change the Boston Fire/EMS expense per resident from
\$452.15 to \$285 the following figures change:

\begin{itemize}

\item The mean Fire/EMS expense becomes \$181/resident, down from
\$186 per resident.

\item The mean Fire/EMS salary becomes \$107,600, down from \$110,635.

\item The average Boston Fire/EMS salary falls from \$132,985 to
\$83,824 --- second lowest, only about \$100 above the minimum.

\end{itemize}

\end{enumerate}


\end{sol}


\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}
\gref{histogramaverages}}
College presidents' pay.

Figure~\ref{fig:collegepresidentspay} is a histogram showing the total
compensation for the 100 best paid presidents of public
universities. The data are from an article in the April 3, 2011 issue
of \emph{The Chronicle of Higher Education}
%CHANGE added period ticle/Presidents-Defend-Their/126971})
(\url{chronicle.com/article/Presidents-Defend-Their/126971}.)

\figfile{CEOSalaryTablecropped.pdf}
\begin{figure}
\centering
\includegraphics[height=60mm]{\thefigurefilename}
\begin{csmr}[College presidents' pay\label{fig:collegepresidentspay}]
Data from
J. Stripling and A. Fuller,
Presidents Defend Their Pay as Public Colleges Slash Budgets,
\emph{The Chronicle of Higher Education} (April 3, 2011),
\url{chronicle.com/article/Presidents-Defend-Their/126971}
\access{August 11, 2015}.
\csmrcomment{graphic redrawn from public data}
\end{csmr}
\end{figure}
\figfile{}

To save you staring at the picture, the number of presidents in each
of the ranges (reading from left to right) is 
51, 25, 15, 8 and then 1 in the last range.

You may use Excel for this exercise if you wish, but you don't have to.

\begin{abcd}

\item What is the mode of this distribution?

\item Estimate the median compensation.

\item Estimate the mean compensation.

\item Write two arguments one of which indicates that the average
president is paid appropriately, one that presidential pay is too
high. 

\end{abcd}

\begin{sol}
\begin{abcd}

\item What is the mode of this distribution?

The mode is \$350K-\$450K --- the most common salary range.

\item Estimate the median compensation.

51 presidents are in the \$350K-\$450K range and 49 make more than
that, so the median is \$450K --- right at the top of the most common
range.

\item Estimate the mean compensation.

I computed a weighted average of the middles of the ranges, with
weights the number of presidents in each range:
\begin{equation*}
\frac{51 \times 400 + 25 \times 500 + 15 \times 600 + 8 \times 700 + 1
  \times 1300}
{100}
 = 488,
\end{equation*}
so the mean compensation is approximately \$490K. Perhaps that should
be rounded to \$500K.

I redid the computation in the spreadsheet at
\mbox{\slink{CollegePresidentsSolution.xlsx}.}

\item Write two arguments one of which indicates that the average
president is paid appropriately, one that presidential pay is too
high.

Most of the 100 best paid presidents of public universities make about
\$400K per year. They run large organizations in the public interest
and make much less money than CEOs of companies of comparable
size. They are fairly paid.

The average salary of these presidents of public universities is
nearly half a million dollars a year. That is much more than faculty
make at those institutions. They should be paid more like public
servants.

\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\artificial\hassolution\worthy\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}\gref{descriptivestatistics}}
\myindex{Xorlon} \myindex{fleegs}.

Table~\ref{table:fleegs} shows the distribution of weights of a sample
of fleegs from 
the planet Xorlon, where weight is measured in frams.\index{fram}

We made up the numbers in this artificial problem so that
the arithmetic would be easy. We made up the words, too.
``frams'' is one Prof. Bolker could have used in a \myindex{Scrabble}
game with his wife --- for 57 points on a triple word score.
\index{Bolker!Joan}

\begin{table}
\centering
\begin{tabular}{cc}
\toprule
weight range (frams) & percent of sample \\
\midrule
20-40 & 10 \\
40-60 & 10 \\
60-80 & 20 \\
80-100 & 10 \\
100-120 & 50 \\
\bottomrule
\end{tabular}
\caption{Xorlon fleegs}
\tablesource{Made up data.}
\label{table:fleegs}
\end{table}

\begin{abcd}

\item Sketch a neat properly labeled histogram that displays the data. 

\item Create a properly labeled histogram in Excel that matches the
one you just drew. You can start by downloading \link{XorlonFleegs.xlsx}.

\suspend{abcd}

Answer the following questions. You may work in Excel or with a
calculator or do mental arithmetic. 
\resume{abcd}

\item Estimate the mode fleeg weight.

\item Estimate the median fleeg weight.

\item Estimate the mean fleeg weight.

\item Estimate the percentage of fleegs that weigh more than the median.

\item Estimate the percentage of fleegs that weigh more than the mean.

\item Estimate the percentage of fleegs that weigh more than the mode.

\end{abcd}

\begin{sol}
\begin{abcd}

\item Figure~\ref{fig:FleegsSketch} shows a student's neat properly labeled
histogram that displays the data. 

\figfile{FleegsSketch.jpg}
\begin{figure}
\centering
\framebox{
\includegraphics[width=2in]{\thefigurefilename}
}
\caption{Xorlon fleegs}
\label{fig:FleegsSketch}
\end{figure}
\figfile{}

\item Figure~\ref{fig:FleegsExcel} shows an Excel
histogram that displays this data. The source is the spreadsheet
\slink{XorlonFleegsSolution.xlsx}.

\figfile{FleegsExcelcropped.pdf}
\begin{figure}
\centering
\includegraphics[height=60mm]{\thefigurefilename}
\caption{Fleegs on Xorlon}
\label{fig:FleegsExcel}
\end{figure}
\figfile{}

\item Estimate the mode \myindex{fleeg} weight.

The highest bar is 100-120 frams, so that, or 110 frams, is a good
estimate for the mode.

\item Estimate the median fleeg weight.

Half the fleegs in the sample weigh less than 100 frams and half weigh
more, so the median weight is 100 frams.

\item Estimate the mean fleeg weight.

Averaging the midpoints of each weight range using the percentages as
weights is the best way to estimate the mean. The result is
\begin{equation*}
0.10 \times 30 +
0.10 \times 50 +
0.20 \times 70 +
0.10 \times 90 +
0.50 \times 110
= 86
\end{equation*}
so the mean weigh is 86 frams.

\item Estimate the percentage of fleegs that weigh more than the
median.

The meaning of ``median'' tells me that just
50\% of the fleegs are heavier than the median weight.

\item Estimate the percentage of fleegs that weigh more than the
mean.

86 frams is 6/20 or three tenths of the way into the 80-100 fram range, so I'd
estimate that the percentage of fleegs weighing more than that is
\begin{equation*}
   \frac{7}{10}10\% + 50\% = 57\%.
\end{equation*}

\item Estimate the percentage of fleegs that weigh more than the
mode.

I estimated the mode as 110 frams. Half the fleegs in the highest
range, or 25\%, weigh more than that.
\end{abcd}

\end{sol}

\end{exx}

\begin{exx}{\hassolution\sref{averagesfromhistograms}
\gref{descriptivestatistics}}
\erdos{} numbers.
\index{Erd\H{o}s, Paul}

Paul \erdos{} (1913-1996) was the most prolific mathematician of the
twentieth century. He was famous (in mathematical circles) for the way
he worked --- he travelled from school to school, writing joint papers
with mathematicians at each.%
\begin{teacher}
Your students will probably not think this topic is useful or
interesting but it might be fun if introduced in class. 
\end{teacher}

From Wikipedia:
%
\begin{quotation}
An \erdos{} number describes a person's degree of separation from \erdos{} 
himself, based on their collaboration with him, or with another who
has their own \erdos{} number. \erdos{} alone was assigned the \erdos{} number
of 0 (for being himself), while his immediate collaborators could
claim an \erdos{} number of 1, their collaborators have \erdos{} number at
most 2, and so on.%
\begin{csmr}
Paul Erd\H{o}s,
\url{en.wikipedia.org/wiki/Paul_Erdos}
\access{July 22, 2015}.
\csmrcomment{Wikipedia, no permission needed}
\end{csmr}
\end{quotation}

You can think of the 
\erdos{} Number Project as a description of the social network of
mathematicians. Its home page is 
\url{www.oakland.edu/enp}; there you can find out that

\begin{quotation}
\ldots the median \erdos{} number is 5; the mean is 4.65, and the
standard deviation is 1.21.
\end{quotation}

Table~\ref{table:erdos} shows the raw data.

\begin{table}
\centering
\begin{tabular}{S[table-format=2.0]S[table-format=5.0]}
\toprule
{\erdos{} number} & {mathematicians} \\
\midrule
 0 & 1 \\
 1 & 504 \\
 2 & 6593  \\
 3 & 33605  \\
 4 & 83642  \\
 5 & 87760  \\
 6 & 40014  \\
 7 & 11591  \\
 8 & 3146  \\
 9 & 819  \\
 10 & 244  \\
 11 & 68  \\
 12 & 23  \\
 13 & 5 \\
\bottomrule
\end{tabular}
\begin{csmr}[\erdos{} numbers\label{table:erdos}]
  The distribution of Erd\H{o}s numbers,
\url{wwwp.oakland.edu/enp/trivia/}
\access{October 11, 2015}.
\end{csmr}
\end{table}

\begin{abcd}
\item Use Excel to draw a histogram for the distribution of \erdos{}
  numbers. 
\item What is the mode of this distribution?
\item Verify the claims for the median and mean.
\item Verify the claim for the standard deviation.
\item Professor Bolker (one of the authors of this book) has an 
\erdos{} number of 2 (he wrote a paper with Patrick O'Neil, who
wrote a paper with \erdos{}). Why is Professor Mast's 
\erdos{} number at most 3? How might it be less than 3?
\item How many mathematicians have a finite \erdos{} number?
\item There are some mathematicians whose 
\erdos{} number is infinite. How can that be?

\end{abcd}


\begin{sol}
\begin{abcd}
\item Use Excel to draw a histogram for the distribution of \erdos{}
  numbers. 

See \slink{ErdosSolution.xlsx}.

Making the values in cells \excel{A6:A13} the labels on the bars is tricky.	
To do that, build the chart using only \excel{B6:B13} for data. Then select	
the data (right click) and edit the field on the right for category
axis labels.

\item What is the mode of this distribution?

The mode is 5; it's the most common 
\erdos{} number.

\item Verify the claims for the median and mean.

There are 268,015 mathematicians connected (directly or indirectly) to 
\erdos{}. Arranging them in order of their
\erdos{} numbers, the middle one has number 5. So that's the median.

I used the spreadsheet to calculate the mean: it's 4.65.

\item Verify the claim for the standard deviation.

I found the formula for the standard deviation on Wikipedia (where
mathematics is usually trustworthy even if politics isn't). Then I
calculated it with Excel formulas. (Excel's \excel{=STDDEV()} function
won't work for grouped data, for the same reason that the
\excel{=AVERAGE()} doesn't work.) I got 1.21.

\item Professor Bolker (one of the authors of this book) has an 
\erdos{} number of 2 (he wrote a paper with Patrick O'Neil, who
wrote a paper with \erdos{}). Why is Professor Mast's 
\erdos{} number at most 3? How might it be less than 3?

Professor Mast has a connection of length three because of the
coauthor chain Mast --- Bolker --- O'Neil --- \erdos{}. If she has
written a paper or book with some other mathematician who has an
\erdos{} number of 1 then her \erdos{} number would be 2. She hasn't
written a paper with \erdos{}, so her \erdos{} number isn't 1.

\item How many mathematicians have a finite \erdos{} number?

This is just the total in the data: 268,015. 

\item There are some mathematicians whose 
\erdos{} number is infinite. How can that be?

Mathematicians who have never coauthored a paper are not connected to 
\erdos{} at all. That's also true even if they have written joint
papers but none of their coauthors is connected to \erdos{}. We say
they have an infinite \erdos{} number. It's not really infinite;
that's a kind of in joke among mathematicians.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\hassolution\artificial\sref{averagesfromhistograms}
\gref{histogramaverages}}
\myindex{Ruritania}.

Find the mean, median and mode age for male residents of
Ruritania
using the histogram or the data in the spreadsheet
\link{Ruritania.xlsx}.
(Ruritania is a fictional country in central
Europe which forms the setting for \emph{The Prisoner of Zenda}, a
fantasy novel written by Anthony Hope.)

\begin{sol}
The mode is easy --- it's 0-9 years, since that's the longest bar in
the histogram.

Adding the percentages in each category I see that the first three
come to exactly 50\% (in Excel, where the numbers are rounded for
display but not for computation). So the median Ruritanian age is
about 30 years (half the people are younger, half older).

I used Excel to compute the weighted average of the midpoints of the
categories (5, 15, \ldots) and found the mean age to be about 36 years.

\end{sol}

\end{exx}


\begin{exx}{\artificial\hassolution\sref{percentiles}\gref{percentiles}}
Wing Aero percentiles.

\begin{abcd}

\item How many Wing Aero employees are in the bottom tenth percentile
in salary?

\item What is the salary cutoff for the bottom tenth percentile?

\item Answer the same questions for the top tenth percentile.

\end{abcd}

\begin{sol}

\begin{abcd}

\item How many Wing Aero employees are in the bottom tenth percentile
in salary?

By definition, 10 percent of the Wing Aero employees are in the
bottom tenth percentile in the salary distribution. Since there are 30
employees, 3 are in the bottom tenth percentile.

\item What is the salary cutoff for the bottom tenth percentile?

The salaries of the three workers in the bottom tenth percentile are
\$17K, \$19K and \$21K. The next 
highest salary is \$25K, so any number between that and \$21K can be
thought of as the cutoff point for the bottom tenth percentile.

\item Answer the same questions for the top tenth percentile.

The three employees in the top tenth percentile 
are the CEO, the CFO and one of the other two
executives. The salary cutoff for the top tenth percentile is \$250K.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\untested\sref{marginoferror}\gref{normaldistribution}
\gref{descriptivestatistics}}
Sick-leave proposal.
\index{sick leave}

An article in the \emph{Orlando Sentinel} on August 6, 2012 discussed a
ballot initiative that would require employers with 15 or more workers
to provide paid time off for employees for illness-related issues.
The article polled voters to gauge support for placing this question
on the ballot for the November election, and noted that  
among likely voters, 67 percent supported the initiative while 26
percent opposed it.

The article reported that the poll surveyed 500 people and had a
margin of error of 4.4 percentage points.%
\begin{csmr}
M. Schlueb and D. Damron, 
Activists press officials to put sick-leave proposal to voters,
\emph{Orlando Sentinel} (August 6, 2012),
\url{articles.orlandosentinel.com/2012-08-06/news/os-sick-leave-ballot-race-20120806_1_ballot-language-signatures-sick-time}
\access{July 22, 2015}.
\csmrcomment{paraphrase}
\end{csmr}

\begin{abcd}
\item  Why don't the percentages from this poll add up to 100\%?   
\item  Explain why this statement is not true:  ``67\% of the
residents likely to vote in November support the measure." 
\item  Explain what the 4.4 percentage point margin of error means for
this poll.   
\end{abcd}
\end{exx}

\begin{exx}{\hassolution} 
The Boston Marathon.
\index{marathon!Boston}
\index{Boston marathon| see {marathon}}
Table~\ref{table:marathonrunners}
contains data for the numbers of men and
  women who finished the 2012 Boston marathon, grouped by finishing
  times. For example, 26 men and one woman finished with a time
  between two and  two and a half hours. (That one woman was a
  wheelchair racer.) 

\begin{table}
\centering
\begin{tabular}{cS[table-format=4.0]S[table-format=4.0]}
\toprule
Finishing time &  {Men} & {Women} \\
\midrule
2:00-2:30 & 26 & 1 \\
2:30-3:00 & 444 & 27 \\
3:00-3:30 & 1844 & 260 \\
3:30-4:00 & 3389 & 1714 \\
4:00-4:30 & 2819 & 2833 \\
4:30-5:00  & 1861 & 1966 \\
5:00-5:30  & 1068 & 1013 \\
5:30-6:00 & 607 & 609 \\
6:00-6:30  & 323 & 339 \\
6:30-7:00  & 160 & 162 \\
\bottomrule
\end{tabular}
\caption{The 2012 Boston marathon}
\label{table:marathonrunners}
\end{table}

We've entered the data in the spreadsheet \link{Marathon2012.xlsx}.

Answer the following questions. Do as much of the arithmetic in Excel
as possible.


\begin{abcd}

\item Sketch a neat histogram for this data.

\item Draw your histogram with Excel. Does it match your sketch?

\item How many men finished the marathon? How many women? 

\item 
Use the data to estimate the mode, median and mean for the men's
finishing times. Mark these times on the handwritten histogram sketch.


\item Suppose my friend ran the marathon and finished ahead of half
  the men. What was his finishing time (approximately)?

\item About what percentage of the women finished ahead of half the men?
\end{abcd}

\begin{hint}
Warning: 2:30 is 2 hours and 30 minutes. That's 2.5 hours, not
2.3 hours.
\end{hint}

\begin{sol}

\begin{abcd}

\item Sketch a neat histogram for this data.

(not done)

\item Draw your histogram with Excel. Does it match your sketch?

See spreadsheet.

\item How many men finished the marathon? How many women? 

Men: 12541, women: 8924, computed with Excel \excel{=SUM()}.

\item 
Use the data to estimate the mode, median and mean for the men's
finishing times.

\begin{itemize}
\item Mode:

The men's mode finishing time is three to three and a half hours (the
highest bar), or about 3:15. 

\item Median:

I used Excel to discover that half the men finished in just over four
hours (45\% were faster than 4:00 and 67\% faster than 4:30). I'll
estimate 4:10 as the time for the middle runner.

\item Mean:

Using the midpoints of the intervals and doing the computations in
Excel, I estimate the mean as about 250 minutes, which is 4 hours and
10 minutes --- about the same as the median.

\end{itemize}

I marked these averages in the chart in the spreadsheet at
\mbox{\slink{Marathon2012Solution.xlsx}.}

\item Suppose my friend ran the marathon and finished ahead of half
  the men. What was his finishing time (approximately)?

His finishing time must have been the median time: about 4:10.

\item About what percentage of the women finished ahead of half the men?

I need to find the fraction of women who finished faster than
4:10. Only 22\% finished in less than four hours, and
about 54\% finished in less than 4:30, so I'll estimate that
40\% finished ahead of half the men.

This is really interesting --- the women's median is about 4:20, which
isn't much larger than the men's.

\end{abcd}

\end{sol}
\end{exx}

\begin{exx}{\untested}
Income growth.

%Raw data from
%www.bea.gov/iTable/iTable.cfm?reqid=70&step=1&isuri=1&acrdn=4#reqid=70&step=27&isuri=1&7028=-1&7083=levels&7031=0&7022=21&7023=0&7024=non-industry&7025=0&7026=25000&7001=421&7029=21&7090=70
%

On April 26, 2014 \theGlobe{} reported that 2012 per capita income in
Massachusetts grew to \$49,354, up 3.2\% from 2008, after adjusting
for inflation.

\begin{abcd}
\item How much was Massachusetts per capita income in 2008, in 2012
  dollars?

\item How much was Massachusetts per capita income in 2008, in 2008
  dollars?

\item This income figure is an average. Is it a mean, a median or a
  mode? Explain how you know.


\item Estimate the total 2012 income for Massachusetts.

\end{abcd}

\begin{hint}
If you don't know what ``per capita'' means, look it up.
\end{hint}

\end{exx}

\begin{exx}{\hassolution}
\myindex{Scrabble}.

The Wikipedia page 
\url{en.wikipedia.org/wiki/Scrabble_letter_distributions}
shows the point value of each of the 100 Scrabble tiles.

\begin{abcd}
\item Draw a bar chart illustrating the number of tiles with  each of
  the point values from 0 to 10. The $x$-axis labels should be 
%
\begin{center}
          0 1 2 3 4 5 6 7 8 9 10
\end{center}

The heights of the bars should correspond to the 
number of tiles with each value.

This is a difficult chart to create in Excel. Before you try, draw it
by hand, so you know what you want the end result to look like. What
Excel shows you first is likely to be far from your goal.

\item What are the mode, median and mean point values? Show them on your
 (hand written) chart.

\item What percentage of the tiles are worth more than 1 point?

\item What percentage of the tiles are worth less than the median number
 of points?

\item Answer these questions for Scrabble in some other language (your
  choice) and discuss the differences between that language and
  English.

\end{abcd}

\begin{sol}
\begin{abcd}
\item Draw a bar chart illustrating the number of tiles with  each of
  the point values from 0 to 10. (Do this by hand first, then in Excel.)

For the spreadsheet, see \slink{ScrabbleSolution.xlsx}. You'll find
instructions there for creating the proper kind of chart --- Excel
didn't guess correctly what was wanted.

\item What are the mode, median and mean point values? Show them on your
 (hand written) chart.

\begin{center}
\begin{tabular}{lS[table-format=1.1]l}
mode  &	1 & more tiles have this value than any other value \\
median & 1 &half the tiles are worth at most 1, half at least 1 \\
mean & 1.87 & this is the "average" value of a tile --- the sum of \\
     &      &  \excel{D8:D18} divided by the number of tiles (the sum of \excel{B8:B18})
\end{tabular}
\end{center}

\item What percentage of the tiles are worth more than 1 point?

23\%.

\item What percentage of the tiles are worth less than the median number
 of points?

Usually, 50\% are less than the median by definition. In this
particular case when the median, 1,  is so common (in fact it's the mode)
only 2\% of the tiles are less valuable.

\item Answer these questions for Scrabble in some other language (your
  choice) and discuss the differences between that language and
  English.

I haven't done this.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\untested\gref{percentiles}}
Who's gifted?

The blog 
\url{giftedissues.davidsongifted.org/BB/ubbthreads.php/topics/152941/Re_Innumeracy_in_Gifted_Educat.html}
offers two very interesting quotations:%
\begin{csmr}
User kmbunday, 
Two examples of innumeracy in books for parents about gifted children,
Davidson Institute (April 8, 2013),
\url{giftedissues.davidsongifted.org/BB/ubbthreads.php/topics/152941/Re_Innumeracy_in_Gifted_Educat.html}
\access{July 29, 2015}.
\csmrcomment{no direct quote here}
\end{csmr}

From
\emph{The Everything Parent's Guide to Raising a Gifted Child} by
Sarah Robbins (p 125):

\begin{quotation}
Unfortunately, highly gifted children (those in the 95th
    percentile) only occur in approximately 1 out of 1,000
    preschoolers, and profoundly gifted children (those in the 99.9th
    percentile) are as rare as 1 in 10,000 preschoolers.
\end{quotation}

From
\emph{Giftedness 101} by Linda Silverman (p 87):
\begin{quotation}
In our mushrooming populace, over 3 million Americans and
  approximately 70 million global citizens are highly gifted or beyond
  (99.9th percentile).
\end{quotation}

What is wrong with the arithmetic?
\end{exx}

\begin{exx}{\hassolution}
Many flights arrive early!

The spreadsheet at \link{ArrivalDelays.xlsx}
contains data on how many minutes late American Airlines flights to
Boston's \myindex{Logan airport} were in January, 2014. 

\begin{enumerate}

\item What does a ``negative delay'' mean?

\item Later you'll be asked to draw a histogram of this data in Excel. 
Sketch a neat approximate version first, with proper titles and
reasonable scales for both axes and a proper title for the whole
chart. You don't need to draw all the bars! 

\item Draw your histogram with Excel. Does it match your sketch?

\item How many flights were counted in this data?

\item What percentage of the flights arrived on time?

\item Use the data to estimate the mode, median and mean arrival
delay. Show these values on your histogram sketch.

\item Flights that are more than two hours late are \emph{outliers}
  --- the delay is probably not American Airlines' fault. Estimate the
  mode, median and mean arrival delays if you don't include the outliers.

\end{enumerate}

\begin{sol}

\begin{enumerate}

\item What does a ``negative delay'' mean?

A negative arrival delay means the flight arrived early.

\item Later you'll be asked to draw a histogram of this data in Excel. 
Sketch a neat approximate version first, with proper titles and
reasonable scales for both axes and a proper title for the whole
chart. You don't need to draw all the bars! 

Not done here.

\item Draw your histogram with Excel. Does it match your sketch?

See the spreadsheet \slink{ArrivalDelaysSolution.xlsx}

\item How many flights were counted in this data?

There were 861 American Airlines flights to Logan in January 2014.

\item What percentage of the flights arrived on time?

$534/861 = 62\%$ of the flights were on time (or early).

\item 
Use the data to estimate the mode, median and mean arrival
delay. Show these values on your histogram sketch.

\begin{abcd}
\item Mode: $-10$ to 0 minutes (highest bar).
\item Median: $-10$ to 0 minutes (flight number 430 out of 861).
\item Mean: 6.5 minutes late, computed in the spreadsheet as a
  weighted average.
\end{abcd}

\item Flights that are more than two hours late are \emph{outliers}
  --- the delay is probably not American Airlines' fault. Estimate the
  mode, median and mean arrival delays if you don't include the outliers.

I redid the calculations in the spreadsheet using only the flights
whose delays were less than 120 minutes. That omitted 20 flights.

\begin{abcd}
\item Mode: unchanged at -10 to 0 minutes (highest bar)
\item Median: unchanged -10 to 0 minutes (flight number 420 out of 841)
\item Mean: 2.0 minutes late, computed in the spreadsheet as a
  weighted average.
\end{abcd}

\end{enumerate}

\end{sol}

\end{exx}

\begin{exx}{\hassolution}
Quotes in \commonsense.

The Excel file \link{CSMquotes.xlsx}  contains data on the number of
words in quotes used in an early draft of this text.

\begin{abcd}
\item Create a properly labeled histogram displaying the data. You
  may sketch the histogram with pencil and paper, or use Excel.

\item  Calculate the total number of quotes.

\item Estimate the total number of words in the quotes.

\item  Estimate mode, median and mean quote sizes, and mark them on
  your histogram.

\item  Explain why the mean is the largest of the three averages.

\item  Estimate the total number of words in the text.

\item  Estimate the percentage of words in the text that are in quotes.

\end{abcd}

\begin{sol}

\begin{abcd}
\item Create a properly labeled histogram displaying the data. You
  may sketch the histogram with pencil and paper, or use Excel.

See \slink{CSMquotesSolution.xlsx}.

\item  Calculate the total number of quotes.

From Excel: 192.

\item Estimate the total number of words in the quotes.

From Excel: $11740 \approx 12,000$.

\item  Estimate mode, median and mean quote sizes, and mark them on
  your histogram.

From Excel:

\begin{center}
\begin{tabular}{ll}
mode &	20-30 words per quotation \\
median &	just over 50 words \\
mean &	61		
\end{tabular}
\end{center}

\item  Explain why the mean is the largest of the three averages.

The skew in the data is just like that for income distribution. The
few long quotes increase the mean without affecting the mode or the
median.

\item  Estimate the total number of words in the text (you can use the
  online copy of the text in the lab if you wish).


\item  Estimate the percentage of words in the text that are in
  quotes. 

The text has about 300 pages.
I estimate 30 lines per page (low, but there are blank pages and
figures), 20 words per line for a total of 180,000 words.
With those estimates the fraction of words in quotes is
0.06522, so about 7 percent.
\end{abcd}

\end{sol}

\end{exx}

\begin{exx}{\hassolution\worthy}
Ricky's tacos.

A story in \theGlobe{} on February 7, 2015 stated that

\begin{quotation}
Food prices over the past year have increased at four times the rate
of overall inflation, with fresh products, such as meat, vegetables,
and dairy, soaring even faster. Ground beef prices, for example, are
%CHANGE moved period   ago .Shoppers at local grocery stores
up about 20 percent from a year ago. Shoppers at local grocery stores
have felt the sharp rise in prices, but for Ricky Reyes, owner of the
taqueria on Dorchester Avenue, costlier ingredients mean it is getting
harder to keep the price of his signature beef taco down.%
\begin{csmr}
M. Woolhouse,
A Boston taco tells the tale of far-reaching food cost woe,
\emph{The Boston Globe}
(February 06, 2015).
\url{www.bostonglobe.com/business/2015/02/05/food-prices-spike-increasing-cost-taco/vU3c42L99X9fBt25opSKkO/story.html}
\access{December 16, 2015}.
\end{csmr}
\end{quotation}

\figfile{tacoscropped.pdf}
\begin{table}
\centering
\framebox{
\includegraphics[width=4in]{\thefigurefilename}
}
\caption{Taco costs}
\label{table:tacos}
\end{table}
\figfile{}

Use Table~\ref{table:tacos} to answer the following questions. 
We've entered the data in the spreadsheet \link{tacos.xlsx}.

\begin{abcd}
  
\item Is the \emph{Globe} correct about the percent increase in the
  cost of beef? 

\item Fill in the column showing the percent increase in cost of each
  of the ingredients. 

\item Find the cost of a pound of taco filling in 2013 and 2014. Then
  find the percent increase in the cost of the filling. 

\item One way for Mr. Reyes to reduce the cost increase would be to
  change the percentages of meat and cheese, keeping the lettuce and 
  tomato the same. What would the percent of each be if he wanted to
  keep the increase in a pound of filling to just 10\%? 


\item Do you think customers would notice if Mr. Reyes changed the
  recipe using your answer to (d)? 

\end{abcd}

\begin{hint}
For part (d), use guess-and-check in Excel. Set up the computation so
that when you change the percentage of meat the percentage of cheese and
the total change automatically.
\end{hint}

\begin{sol}

My solution is in the spreadsheet \slink{tacoSolution.xlsx}.
I've copied the answers here.

\begin{abcd}
  
\item Is the \emph{Globe} correct about the percent increase in the
  cost of beef? 

Yes. 4.16/2.46 = 1.2023,  which is ``about a 20\% increase''.

\item Fill in the column showing the percent increase in cost of each
  of the ingredients. 

I did that in column \cell{D}.

\item Find the cost of a pound of taco filling in 2013 and 2014. Then
  find the percent increase in the cost of the filling. 

Column \cell{G} (\cell{H}) has the 2013 (2014) cost of \$3.13
(\$3.56). Increase is 13.9\%. 

\item One way for Mr. Reyes to reduce the cost increase would be to
  change the percentages of meat and cheese, keeping the lettuce and 
  tomato the same. What would the percent of each be if he wanted to
  keep the increase in a pound of filling to just 10\%? 

I played with the meat percent in cell \cell{I7}, 
adjusting cheese percent in I9 automatically, until \cell{J12} was about 1.1.
The answer: 54\% meat instead of 45\%, 11\% cheese
instead of 20\%.


\item Do you think customers would notice if Mr. Reyes changed the
  recipe using your answer to (d)? 

I think they'd notice that there's only about half as much
cheese. I think they would rather pay a little more for their favorite taco.

\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\hassolution}
State populations.

Figure~\ref{fig:statepopulations} shows the U. S. population
distribution among the 50 states based on the 2010 U.S Census.

\figfile{populationscropped.pdf}
\begin{figure}
\centering
\includegraphics[width=4.5in]{\thefigurefilename}
\caption{State populations}
\label{fig:statepopulations}
\end{figure}
\figfile{}

\begin{abcd}

\item Recreate these data in Excel as both a table and a
  histogram. 

\item What is the modal population of states? 

\item What is the median population of states? 

\item What is the mean population of states? How did you find it?

\item How good an estimate is the mean you calculated in (d)?

\end{abcd}


\begin{sol}

\begin{abcd}

\item Recreate these data in Excel as both a table and a
  histogram. 

See \slink{StatePopulationSolution.xlsx}.

\item What is the modal population of states? 

The highest bar is the first one, counting states with fewer than 4
million people, so the modal population is 0-4 million. I'd count 2
million as a correct answer.

\item What is the median population of states? 

There are fifty states and just half of them have a population less
than 5 million, so 5 million is the median. 

\item What is the mean population of states? 

I calculated the mean in Excel as a weighted average of the midpoints
of each of the population ranges. The result was 6.7 million people.

\item How good an estimate is the mean you calculated in (d).

The relative error is better than the absolute error to discuss this
estimate.

I looked up the U. S. population in the 2010 census: there were 318
million people counted.

The true mean is the total population divided by the number of states:
318/50 = 6.36 million people. To compare that with the 
the 6.7 million estimate from the histogram I computed 
6.7/6.36 = 1.053. The histogram estimate is about 5\% too big. 

I could have found the same answer by comparing the estimated total
population from the histogram --- 335 million --- to the actual 318
million population.

\end{abcd}

\end{sol}

\end{exx}

\begin{ReviewExercises}

\begin{rexx}{\routine\sref{wingaero}\gref{excelaverages}\gref{excelroutine}}  

Create an Excel spreadsheet and put the following numbers in the first
column.  
%
\begin{center}
14  15  22  50  0  33  16  18  23  40  47
\end{center}

\begin{abcd*}
\item Use Excel to find the mean, median and mode of these numbers
\item  Change the first number from 14 to 23.  How do the Excel
  calculations change? 
\item  Click the ``undo" button and confirm that Excel reverts back to
  the original set of numbers. 
\item  Change the last four numbers to 0 (so that the data now read 
%
\begin{center}
14  15  22  50  0  33  16  0  0  0  0
\end{center}

How do the different averages change?  Explain how the data are skewed.
\end{abcd*}
\end{rexx}

\end{ReviewExercises}

\setexercisecounter{}

\begin{ExtraExercises}
 

\begin{exx}{\untested\sref{mmm}\gref{skew}}
Car and truck prices.

The article on new car and truck prices that we studied in
\sref*{weightsmatter} first asserts that 

\begin{quotation}
\ldots
the average price of a new
vehicle in the second quarter [of 2008] fell 2.3 percent from a year
earlier to \$25,632 \dots
\end{quotation}

and later

\begin{quotation}
The result is the average new vehicle now costs less than
40 percent of an average household's median annual income, the analysts said,
whereas from 1991 to 2007, it would cost more than half of the median
income.%
\begin{csmr}
New car prices fall at fastest rate ever,
Associated Press reported in \theGlobe{} (September 5, 2008)
\url{www.boston.com/business/articles/2008/09/05/new_car_prices_fall_at_fastest_rate_ever}
\access{July 22,, 2015}.
\csmrcomment{Globe, OK}
\end{csmr}
\end{quotation}

Verify as much of this last assertion as you can.
\end{exx}




\begin{exx}{\untested\artificial\sref{barcharts}\gref{excelroutine}
\gref{excelwhatif}\gref{excelchart}}
Enrollments.

The final enrollment report for the past year at an unnamed small
college provided the following information about students:  450
freshmen,  421 sophomores; 400 juniors and 
511 seniors. 

\begin{abcd*}
\item Create an Excel spreadsheet containing this data.  Label the
columns appropriately. 

\item  Ask Excel to calculate the total number of students enrolled
during the past year.  Label this result.
 
\item  Create a properly labeled bar chart of the student data.

\item  A corrected enrollment report noted that there were 419
juniors.  Make that adjustment in your spreadsheet and check that the
other information (total number of students, bar chart) is correctly
updated.

\item  Using this new information, ask Excel to calculate the
percentage of students who are freshmen, sophomores, juniors and
seniors. Copy and paste so that you type as few formulas as possible.

\item Create a new bar chart displaying the percentages.

\item  Convert your bar chart to a pie chart.  

\end{abcd*}
\end{exx}

\begin{exx}{\untested\artificial\sref{percentiles}\gref{percentiles}}
\myindex{SAT} percentiles.

A student received this notification on his college entrance exam:

\begin{center}
\begin{tabular}{ll}
English Language Arts: &	77th percentile  \\
Mathematics:  	&	88th percentile 
\end{tabular}
\end{center}

Explain this report in everyday language. 

\begin{hint}
Your answer might begin ``More than three quarters of the students
taking this test \ldots''
\end{hint}

\end{exx}

\begin{exx}{\untested\sref{mmm}\gref{excelaverages}}
Comparing the states.

You can do this exercise using Excel, or with properly documented
research. (Your instructor may specify one method or the other.)

\item Find the mean, median and mode for the populations of the 50
states. 

\item Display the answers to the previous question on a properly
labeled histogram. Discuss your findings --- is the distribution
skewed?

\item Redo parts (a) and (b) for the areas of the states.

\item Redo parts (a) and (b) for the population densities
(people per square mile).

\end{exx}

\begin{exx}{\needsquestions\sref{marginoferror}\gref{descriptivestatistics}}
Jellybean margin of error.
\index{jellybeans}

\url{andrewgelman.com/2011/08/that_xkcd_carto/}
\end{exx}


\begin{exx}{\untested\sref{averagesfromhistograms}\gref{excelchart}
\gref{weightedaverage}\gref{histograms}\gref{histogramaverages}}
Reputation on \myindex{stack exchange}.

Stackexchange.com (\url{stackexchange.com}) is a network of online
question and answer websites. Users who post questions and provide
answers earn reputation based on community
feedback. Table~\ref{table:stackexchange} shows the number of users
with reputations in certain ranges on January 6, 2013 for all
stackexchange sites and for 
the particular site \url{tex.stackexchange.com/}
where the authors have asked and answered questions about the \TeX{}software
used to prepare the manuscript for \commonsense.


Estimate the mode, median and mean for each distribution. This is
subtle in several ways. The bucket sizes vary. Data at the top and
bottom end of the range are very scarce. Ask about sensitivity to the
assumptions made there about the actual means for the top and bottom
categories. 

\begin{table}
\centering
\begin{minipage}{2.5in}
  \centering
\begin{tabular}{rS[table-format=7.0]}
\toprule
Reputation & {Users} \\
\midrule
100,000+ &97 \\
50,000+ &297 \\
25,000+ &938 \\
10,000+ &3,249 \\
5,000+ &6,874 \\
3,000+ &11,150 \\
2,000+ &15,650 \\
1,000+ &24,867 \\
500+ &34,857 \\
200+ &45,107 \\
1+ &1,478,007 \\
\bottomrule
\end{tabular}
\par
All sites

\end{minipage}
\begin{minipage}{2.5in}
  \centering
\begin{tabular}{rS[table-format=5.0]}
\toprule
Reputation & {Users} \\
\midrule
100,000+ &2 \\
50,000+ &14 \\
25,000+ &24 \\
10,000+ &60 \\
5,000+ &91 \\
3,000+ &132 \\
2,000+ &179 \\
1,000+ &281 \\
500+ &395 \\
200+ &591 \\
1+ &18,417 \\
\bottomrule
\end{tabular}
\par
\TeX --- \LaTeX{}
\end{minipage}
\caption{Stackexchange reputation}
\label{table:stackexchange}
\tablesource{  \url{stackexchange.com/leagues/1/week/stackoverflow/2012-12-30?sort=reputationchange&page=1}, 
\url{stackexchange.com/leagues/29/week/tex/2012-12-30?sort=reputationchange}
}
\end{table}

\end{exx}

\begin{exx}{\untested\sref{piecharts}\gref{excelchart}}
Deceptive pie charts.
\index{pie chart}

Build the pie chart from \sref*{piecharts}.

\begin{abcd}
\item Experiment with the pie chart features to make it look like the
  managers' salaries are the largest.   You can't actually change the
  data to do this --- you need to use the 3D and other pie chart
  features to make it look like the managers' salaries are
  large. \item  Play around with different types of charts in Excel.
  Find a chart type and an effect (3D, most likely) that really
  distorts the data. 
\end{abcd}
\end{exx}
 


\begin{exx}{\untested\complex\sref{histograms}
\gref{excelchart}\gref{histograms}}
River lengths.

\begin{teacher}
There should be a good Benford's law exercise somewhere, maybe
  here. But spending time on this peculiar phenomenon is probably not
  worth the time it takes away from other more useful topics.
\end{teacher}

At \url{en.wikipedia.org/wiki/List_of_rivers_by_length}
Wikipedia offers a chart of 163 major rivers, organized by length. 

\begin{abcd}

\item
Construct a bar chart with nine categories, using the first digit of
the length of the river to determine the category.

You might expect all the bars to be the same height,
since there are nine possible starting digits. But they're not. In
fact, there are no rivers longer than 6000 km.

The fact that for the short rivers there are more whose length begins
with small digits is an instance of \emindex{Benford's law}. You can
look up more about it if you're curious.

\item Use Excel to create new columns with river lengths measured in
  yards, in feet and in inches. Draw each of those bar charts. Discuss
  what you see.
\end{abcd}

\end{exx}

\begin{exx}{\sref{averagesfromhistograms}
\gref{histogramaverages}\gref{histograms}}
Fight for the Senate (2016).

A graph like the one in Figure~\ref{fig:senate2016}
appeared in Nate Silver's 538 website
on November 7, 2015.\index{Silver, Nate}
The $y$-axis displays the number of seats held by each
party, the $x$-axis the probability of that outcome.

\figfile{538SenatePrediction.png}
\begin{figure}
\centering
\includegraphics[width=3.5in]{\thefigurefilename}
\begin{csmr}[The fight for the Senate\label{fig:senate2016}]
Screen capture, 
\url{projects.fivethirtyeight.com/2016-election-forecast/senate/?ex_cid=2016-forecast}
\access{November 7, 2016}.
\end{csmr}
\end{figure}
\figfile{}

Nate Silver constructed this histogram by imagining
(simulating) many thousands of elections and recording the percentage
of time each Democratic/Republican split occurred. 
%We estimated the percentages in his chart and entered them 
%in the spreadsheet
%\link{Oct31SenateProjection.xlsx} so you don't have to type them
%yourself. (We rounded the really tiny percentages to zero.)
%Use Excel whenever it's most
%convenient for you.
%
\begin{abcd}
\item What is the most likely number of Democratic senators?

\item What number of Democratic senators represents the mode of this
distribution?

\item What is the probability that there are more than 50
Democratic senators?

\item What number of Democratic senators is the median of this
distribution?

\item If you had the complete list of all Nate Silver's imagined
elections and sorted it by the number of Democratic senators, how many
Democratic Senators would there be in the middle election on that
list?

\item Use Excel to compute the (weighted) average number of Democratic
senators for these imagined elections.

\item What actually happened in the election?

\end{abcd}

\end{exx}

\begin{exx}{\needsquestions}
Boston's payroll.

This histogram appeared in \theGlobe{} on February 14, 2017.

\figfile{Boston14payroll_graphic.jpg}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\begin{csmr}[Boston Municipal Payroll\label{fig:bostonpayroll}]
\url{www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html}
\access{February 14, 2017}.
\end{csmr}
\end{figure}
\figfile{}

The raw is available on the web. There's an anonymized copy
at \link{Boston14Payroll.xlsx}%
\begin{csmr}
City of Boston,
Employee Earnings Report 2014,
\url{data.cityofboston.gov/Finance/Employee-Earnings-Report-2014/4swk-wcg8}
\access{February 21, 2017}.
\end{csmr}.

The histogram and the spreadsheet suggest many possible
questions. Some suggestions:

\begin{abcd}
\item Estimate bar heights from the graphic.

\item Estimate median and mean, with Excel or otherwise.

\item Discuss how you would report the mode in order to convey the
  most information.


\item Check the estimates of the median and mean by looking at the raw
  data. 

\item Check estimates of the bar heights by reconstructing the histogram
  from the raw data. 

\end{abcd}
Ask the usual questions: mode, median, mean. Bimodality is
interesting.

\end{exx}

\begin{exx}{\needsquestions}
Votes weighted by IQ.
\index{IQ}
\index{voting}

An answer to the question ``What would be the possible issues with an
IQ based voting system?'' at
\url{worldbuilding.stackexchange.com/questions/83199/what-would-be-the-possible-issues-with-an-iq-based-voting-system}
offers this: 

\begin{quotation}
If you're giving everyone exactly as many votes as their IQ --- the
effect on the actual vote doesn't appear to be very much at all. See
the table in 
[Figure~\ref{fig:iqvoting}] for a population of 100 million to
illustrate --- the 
vote share for IQ 120 + is slightly higher than with a normal
democracy, but the fact there are exponentially fewer people in these
higher intelligence brackets means that the linear multiplier on their
vote weight has less and less of an effect. 
\end{quotation}

The comments are interesting too.

\figfile{IQVoting.png}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\begin{csmr}[Votes weighted by IQ\label{fig:iqvoting}]
\url{worldbuilding.stackexchange.com/questions/83199/what-would-be-the-possible-issues-with-an-iq-based-voting-system}
\access{June 12, 2017}.
\end{csmr}
\end{figure}
\figfile{}

\end{exx}

\begin{exx}{\untested\needsquestions}
How cold is it really?
\index{wind chill factor}

Figure!\ref{fig:windchill} shows  the National Weather Service
calculation for how cold it feels 
in terms of the temperature and the speed of the wind.

The formula there came from research done by 
Maurice Bluestein; you can read the story in his obituary at
\url{www.nytimes.com/2017/09/14/science/maurice-bluestein-who-modernized-the-wind-chill-index-dies-at-76.html}

There's a spreadsheet you can play with at \link{windchill.xlsx}.

\theTimes{} 
\figfile{windchillchart3.pdf}
\begin{figure}
\centering
\includegraphics[width=4in]{\thefigurefilename}
\begin{csmr}[Calculating the Wind Chill\label{fig:windchill}]
\url{www.nws.noaa.gov/om/cold/wind_chill.shtml}
\access{September 15, 2017}.
\end{csmr}
\end{figure}
\figfile{}


\end{exx}

\begin{exx}{\untested\worthy\needsquestions}
``Average'' household wealth.

On December 8, 2017 you could read in a  Associated Press Report in
\theGlobe{} headlined
``Surging stocks lift US wealth, yet most still trail '07 peak'' that

\begin{quotation}
Surging stock prices and steady increases in home values powered
American household wealth to \$96.9 trillion this fall, the Federal
Reserve said Thursday. The gains, however, aren't widely shared.

\ldots

In 2016, the latest figures available, median household wealth was
still 34 percent below its prerecession, 2007 level. Average household
wealth, meanwhile, fully recovered from the downturn and was 7 percent
higher last year. The average figure is pulled up by very wealthy
families.

While average household wealth reached \$667,600 in 2016, net worth for
the median household was just \$78,100.
\begin{csmr}
C. Rugaber,
``Surging stocks lift US wealth, yet most still trail '07 peak'',
Associated Press,
\theGlobe, December 8, 2017,
\url{www.bostonglobe.com/business/2017/12/07/surging-stocks-lift-wealth-yet-most-still-trail-peak/DSb0ih1lSlzlz8tqA6t7QO/story.html}
\access{December 8, 2017}.
\end{csmr}
\end{quotation}

Note: There's even enough information here to determine the number of
households - for a sanity check.

Note: Is this a place for a discussion of the difference between
wealth and income?
\end{exx}

\begin{exx}{\untested}
Median age vs. mean age?

On September 9, 2018 Dante Ramos wrote in \theGlobe{} that

\begin{quotation}
According to research by Portland State University in Oregon, the
median age of voters in a Boston mayoral election is 51, more than 14
years older than the average adult in the city. 
\begin{csmr}
D. Ramos,
Young voters, claim your power,
\theGlobe, September 9, 2018,
\url{www.bostonglobe.com/opinion/2018/09/07/for-millennials-power-there-for-taking/NITLEtmQtk4Ecubk7XhozL/story.html}
\access{September 9, 2018}
\end{csmr}
\end{quotation}
\end{exx}

\begin{abcd}
\item
Why is it correct to use the median age rather than the average (mean)
age in this report?

\item
Why is it better to write ``older than tha average adult'' rather than
``greater than the average adult's age''?

\item Think of a situation where it would be more useful to know that
  mean age rather than the median age.
\end{abcd}
\end{ExtraExercises}

%https://www.nytimes.com/interactive/2018/08/04/upshot/up-birth-age-gap.html

