% Pisa/contents.tex
% Highway fuel use 1970-2000
%
% Figure 1. U.S. Highway Fuel Use Since 1970 Source: Energy
% Information Agency
%
% http://www.nhtsa.gov/cars/rules/rulings/CAFE/alternativefuels/background.htm
%
% http://cascadepolicy.org/pdf/VMT%20102109.pdf
%It is not clear theoretically whether vehicle travel causes
%economic activity or vice versa, or both to varying
%degrees. Thus, it is an empirical question whether, or by
%how much, economic activity will be affected by
%policies to restrict or tax vehicle use. The historical data
%should be examined for evidence of the direction of
%causality.
%
\chapter{\mychaptername}
\label{\here}
\tocnotetoo{
Complicated physical and social phenomena rarely behave linearly, but
sometimes data points lie close to a straight line. When that happens
you can use a spreadsheet to construct a linear approximation.
Sometimes that's useful and informative. Sometimes it's
misleading. Common sense can help you understand which.
}
\teachertag
\begin{teacher}
Yes, we teach how to find regression lines (using Excel). But our
approach stresses skepticism throughout. Rather than teaching this as
a tool they can use, we treat it as a tool often misused.

Thiss part of this chapter, like the start of the last one, is written
as an Excel tutorial. If possible, students should follow along,
checking the steps using Excel as they read or as you lecture. 
\end{teacher}

\begin{goals}

\begin{goal}{regression}
Draw and interpret regression lines.
\end{goal}

\begin{goal}{roundofferror}
Recognize when rounding too much distorts conclusions.
\end{goal}

\begin{goal}{regressionnonsense}
Think about causation vs correlation.
\end{goal}

\end{goals}

\begin{chapterpix}


\begin{center}
\includegraphics[height=40mm]{\here/correlationXCD.png}
\end{center}

Source: \url{http://xkcd.com/}, licensed under a Creative Commons
Attribution-NonCommercial 2.5 License.  We need to negotiate.

Consider using the photo in 
Figure~\ref{fig:pisaanddata} or any other Pisa photo.

\end{chapterpix}

\begin{eb}
Maura: big changes in this Chapter. Pisa is gone (just an exercise),
climate change is central. Please read carefully.
\end{eb}

\qrsection[climatechange]{Climate change}

Climate change\index{climate change} (\myindex{global warming}) is a
current hot topic. How rapidly is the Earth's average temperature
increasing? What might the consequences be? What is the cause?
What might we do about it? Should we try?
The science is complex and the 
politics even more so. In a course like this we can't begin to unravel
those complexities. But for just a taste of the analysis, we will
briefly look at some data on the average temperature of the
Earth and the concentration of Carbon Dioxide (\cotwo{}) in the
atmosphere in recent history. The spreadsheet
\link{EarthData.xlsx} has data we downloaded from 
\url{http://www.earth-policy.org/data_center/C23}.%
\teachertag
\begin{teacher}
The students may all be interested in this topic, so they will
want to think about it. The consensus among climate scientists is that
it's real and anthropogenic, but the real science is complex. You
can't draw reliable conclusions from simple regressions like the ones
here. So treat this material respectfully and cautiously.
\end{teacher}

The chart on the left in Figure~\ref{fig:EarthData1} shows a
scatter plot of the average 
global temperature, in Celsius degrees, for the years
1960-2000. There is no formula for that relationship, but
but the points seem to tend upwards (on
average). So we connected the dots to see the jagged rise and fall,
and then drew a line on the graph that looked like a reasonable
approximation for the trend. The result is on the right. Then we used
the line to predict a temperature of 14.58 degrees Celsius 
for 2010. In fact that average was 14.63 degrees Celsius. Given how up
and down the data are (despite the long term average trend) we could
hardly expect an accurate prediction.%
\footnote{We added the textboxes and the arrows to the spreadsheet to
  explain how we drew the line.}%
\teachertag
\begin{teacher}
If you are working on this section in a classroom which allows you to
project a spreadsheet onto a screen you can reach you can eyeball the
regression line with a yardstick.
\end{teacher}

The line we drew is a \emindex{model} -- a mathematical construction
that approximates something in the real world. This particular model is
\emph{linear} -- 
the line that seems to match the best.  We could have used that model
in 2000 to make a prediction for 2010 -- an estimate of what the
temperature might be in a future year for which we didn't (at the
moment) have data. 

%original image sizes are 4in by 4.64 and 5.5. Scaled by 0.61 to
%figure out minipage sizes
\begin{figure}[ht]
\centering
\begin{minipage}{2.83in}
  \centering
  \includegraphics[height=2.4in]{\here/TemperatureScatterplotcropped.pdf}
\end{minipage}
\begin{minipage}{3in}
  \centering
  \includegraphics[height=2.4in]{\here/TemperatureGuesscropped.pdf}
\end{minipage}
\caption{Global Average Temperature, 1960-2000}
\figsource{Charts from Excel spreadsheet we built.}
\label{fig:EarthData1}
\end{figure}

Excel knows the mathematics for finding the model line we guessed at
``by eye''. Figure~\ref{fig:RegressionScreenshot1} shows how: select
the chart, select \excel{Layout} from \excel{Chart Tools}, select
\excel{Trendline} and then \excel{Linear Trendline}. Excel draws the
second line shown in Figure~\ref{fig:EarthData2}.

Not quite. Figure~\ref{fig:RegressionScreenshot2} shows how to
\emph{format} the trendline: select it (by right clicking);
select \excel{Format Trendline \ldots};
\excel{Forecast Forward} 10 periods (10 years).
Check the boxes for \excel{Display Equation} and \excel{Display
  R-squared value} -- we will need that data soon.
You can change the \excel{Trendline Name},
\excel{Line Color} and \excel{Line Style} if you wish.

\begin{figure}[ht]
\centering
  \centering
  \includegraphics[height=5in]{\here/RegressionScreenshot1}
\caption{Adding a Trendline to a Chart}
\figsource{Excel screen capture}
\label{fig:RegressionScreenshot1}
\end{figure}

\begin{figure}[ht]
\centering
	\includegraphics[height=70mm]{\here/EarthData2cropped.pdf}
\caption{Global Average Temperature, 1960-2000 (2010)}
\figsource{Chart from Excel spreadsheet we built.}
\label{fig:EarthData2}
\end{figure}

\begin{figure}[ht]
\centering
  \centering
  \includegraphics[height=5in]{\here/RegressionScreenshot2}
\caption{Formatting a Trendline}
\figsource{Excel screen capture}
\label{fig:RegressionScreenshot2}
\end{figure}

Excel calls the line that best fits a scatterplot a
\myindex{trendline}. Its official name is \emindex{regression line}.
We learned (or remembered) in Chapter~\ref{ElectricityBill} that
straight lines are described by linear equations. The one for the
regression line in Figure~\ref{fig:EarthData2} is
%
\begin{equation*}
y = 0.0116x - 8.9214.
\end{equation*}
%
The \emph{slope} of the regression line matters most. In this example
it says that on average global temperature is increasing at a
\emph{rate} of 
%
\begin{equation*}
0.0116 \ \frac{\text{degrees (Celsius)}}{\text{year}}
\end{equation*}
%
or just over one degree (Celsius) per year.
(Remember that the units of the slope are (units of $y$)/(units of
$x$.)

The intercept for this linear equation, with its units, is
%
\begin{equation*}
-8.9214 \text{ degrees (Celsius)}.
\end{equation*}
%
Supposedly, that is the temperature predicted (retroactively)
by the regression line for year 0. That's nonsense, of course.

In principle, we can use the equation of the line instead of our
eyeball approximation to make our 2010 prediction. If we let $x =
2010$ we find the prediction 
\begin{align*}
\text{average 2010 temperature} & = y \\
& = 0.0116 \times 2010 - 8.9214 \\
& = 14.2786 \\
& \approx 14.28 \text{ degrees Celsius}
\end{align*}

\emph{Something is wrong! That's not even close to our visual estimate
  of 14.48 degrees!} 
In fact it's smaller than all the temperatures back to 1984.
Should we believe the arithmetic when it says the world will start
cooling now? Neither the data nor the graph nor common
sense support that idea.%
\teachertag{}%
\begin{teacher}
This error surprised us when it occurred during a class we hadn't
prepared carefully. That turned out to be useful -- the students saw
their teacher seeing that a number made no sense, then looking for an
explanation.
\end{teacher}

Be skeptical. Always ask whether the numbers from
a newspaper or a web site or a television commentator
-- or from a computer program -- \emph{make sense}. 
This one clearly doesn't.
If we dig a little deeper we can see why.

It turns out that Excel \emph{rounded off} the slope and intercept it
showed on the chart. It knows the correct values, but thought all the
digits were too ugly to display . To find them, enter the command 
\displayexcel{
=SLOPE(
}
in a cell (we used \cell*{E27}, with a label in \cell{D27}). Excel
prompted for \displayexcel{
=SLOPE(known\_y's, known\_x's)
}
so we selected the data
\displayexcel{
=SLOPE(B6:B46,A6:A46)
}
(the years 1960-2000)
and Excel told us the correct value: 0.011642857
That's more precise than the rounded value 0.0116 shown on the chart.
We found the intercept, -8.921393728, the same way, with the formula
\excel{=INTERCEPT(B6:B46,A6:A46)} (in \cell*{E28}).%
\footnote{ Note that the y-values come first and the x-values second
in the \excel{SLOPE} and \excel{INTERCEPT} functions, even though in
the data table the x-values are first and the y-values second.
}
Then the
correct equation for the model, before rounding, is 
\begin{eqnarray*}
y = 0.011642857 \ x  -8.921393728 .
\end{eqnarray*}
If we set $x = 2010$ in  that equation Excel tells us
$y = 14.48074913$ (\cell*{E30}). That rounds to our visual estimate of
14.48. 

We've said repeatedly that it was wrong to show lots of decimal places
when reporting approximate numbers, even when those decimal places
appeared in your calculator or spreadsheet. But in this example we saw
that too much rounding is wrong too. Using a slope of rounded to four
significant digits a ridiculous answer. The short answer to the
question ``when should you round?'' is 
\index{rounding}\index{significant digits}

\begin{quotation}
While you compute, use \emph{all} the digits you have, even if
it's more than you need. Round only when you're done.
\end{quotation}

\qrsection[greenhouse]{The \myindex{greenhouse effect}}

Most climate scientists are convinced that the reason the Earth is warming
is the increase in the concentration of \myindex{greenhouse gas}es like Carbon
Dioxide in the air. 

A greenhouse is warm in the winter because
sunlight enters through the glass roof, which prevents the inside air
it heats up from escaping. Carbon Dioxide behaves similarly in the
atmosphere -- it lets sunlight in but doesn't let heat out.
The chart on the left in Figure~\ref{fig:EarthData3} displays the data
and the regression line showing how average temperature varies with
the amount of \cotwo{} in the atmosphere.%
\footnote{
From now on we'll refer to Carbon Dioxide by its
molecular formula \cotwo{}.}
The slope of the regression line is
%
\begin{equation*}
0.0093 \ \frac{\text{degrees Celsius}}{\text{part per million of \cotwo{}}}
\end{equation*}
%
An increase of one part per million of \cotwo{} corresponds to about a
hundredth of a degree (Celsius)
increase in temperature.

The chart on the right in Figure~\ref{fig:EarthData3} shows the
increase in \cotwo{} concentration over the years (it does not mention
temperatures at all). There the slope of the regression line is
%
\begin{equation*}
1.3569 \ \frac{\text{parts per million of \cotwo{}}}{\text{year}};
\end{equation*}
%
every year the \cotwo{} concentration increases by about 1.36 parts
per million.  

\begin{figure}[ht]
\centering
\begin{minipage}{2.8in}
  \centering
  \includegraphics[height=2in]{\here/co2vtempcropped.pdf}
\end{minipage}
\begin{minipage}{2.8in}
  \centering
  \includegraphics[height=2in]{\here/co2vyearcropped.pdf}
\end{minipage}
\caption{\cotwo{}, time and temperature, 1960-2000}
\figsource{Charts from Excel spreadsheet we built.}
\label{fig:EarthData3}
\end{figure}

\qrsection[correlation]{How good is the linear model?}

How much a regression line helps understand the data and make predictions
depends in part on how close the data points are to the line. Common
sense tells you that the relationship between Carbon Dioxide
concentration and time (on the right in Figure~\ref{fig:EarthData3} is
likely to be more reliable than that between Carbon Dioxide and
temperature (on the left), which in turn looks better than that
between temperature and time (Figure~\ref{fig:EarthData2}).

The official statistical measure of ``close to the
line'' is a number between zero and one called ``R-squared''. The
closer R-squared is to 1 the better the regression line fits the data.
In Figure~\ref{fig:EarthData2} $R^2$ is just $0.63321$ -- not very
good. That matches what we can see in the chart -- the temperature
seems to be increasing on the average, but can go up and down
unpredictably from year to year. In the chart on the right in
Figure~\ref{fig:EarthData3} the $R^2$ value is 0.9902, which is very
close to 1.
In fact the measured 2010 concentration was 389.78 parts per million,
so the relative error in the prediction is 
about -2.5\%.

We are being deliberately vague about how close the $R_2$ 
should be to 1 to declare that the fit is ``good.'' There are
no rules for this. In the exercises below you will have a chance to
develop your intuition.%
\teachertag{}%
\begin{teacher}
We have deliberately omitted any discussion of the correlation
coefficient $R$. We found when we taught that material from an early
draft of \CommonSense{} we used up a lot of class time on material
that did not meet our ``what should students remember ten years from
now?'' criterion for inclusion. We think that thinking qualitatively
about $R^2$ is sufficient.
\end{teacher}

We were careful to use the word ``corresponds'' when discussing
the increase in \cotwo{} concentration and the increase in average
temperature, not the word ``causes''. The data only say that the 
\cotwo{} concentration and the temperature are \emindex{correlated} -- they
trend together. They don't say one causes the other. Data can
never tell you that. Climate scientists who work at understanding the
physics and chemistry of Carbon Dioxide in the atmosphere have created
\emph{scientific models} that suggest causation.
\footnote{We will return to
this distinction in \sref*{regressionnonsense}.}

There is much more to the climate change debate: some who
accept the scientific models that say that greenhouse gases cause
global average temperatures to increase are not convinced
that  the increase in greenhouse gases is due to human activity, and
therefore might be reduced if we changed the way we use energy.


\qrsection[regressionnonsense]{Regression nonsense}\index{crime rate}

The graphic in Figure~\ref{fig:crimeDown} resembles one that
appeared in \theGlobe{} on January 14, 2010 in a story
headlined \headline{Imaginary fiends}, which began

\begin{qwrap}
\begin{quotation}
\firstline{In 2009, crime went down. In fact it's been going down for}
a decade. But more and more Americans believe it's getting worse.  
\webref{http://www.boston.com/bostonglobe/ideas/articles/2010/02/14/imaginary_fiends/}
\end{quotation}
\sourceinfo[1720]{http://www.boston.com/bostonglobe/ideas/articles/2010/02/14/imaginary_fiends/}
\end{qwrap}

The data in the accompanying table is from
\url{http://www2.fbi.gov/ucr/cius2008/data/table_01.html}
and
\url{http://www.gallup.com/poll/123644/Americans-Perceive-Increased-Crime.aspx}.
The FBI measures the crime rate in violent crimes per 100,000
people. The fear index is the percentage of people who say crime is
going up. 

\begin{figure}
\centering
\begin{minipage}{3.2in}
\resizebox{2.95in}{!}{
\framebox{
\begin{mytikz}
\input{\here/crimedown}
\end{mytikz}
}}
  \end{minipage}
  \begin{minipage}{2.8in}
\begin{tabular}{|c|r|r|}
\hline
year & crime rate & fear index \\
\hline
2000 & 506.5 & 47 \\
2001 & 504.5 & 43 \\
2002 & 494.4 & 62 \\
2003 & 475.8 & 60 \\
2004 & 463.2 & 53 \\
2005 & 469.0 & 67 \\
2006 & 473.6 & 68 \\
2007 & 466.9 & 71 \\
2008 & 454.5 & 67 \\
2009 & 435.0 & 74 \\
\hline
\end{tabular}
\end{minipage}
\caption{Crime Down, Fear Up}
\figsource{Graphic redrawn from data in \theGlobe, 2/14/2010,
\url{http://www.boston.com/bostonglobe/ideas/articles/2010/02/14/imaginary_fiends/},
data from
\url{http://www2.fbi.gov/ucr/cius2008/data/table_01.html}
and
\url{http://www.gallup.com/poll/123644/Americans-Perceive-Increased-Crime.aspx}}
\label{fig:crimeDown}
\end{figure}

The headline seems to announce a juicy story. The graph is drawn to
accentuate the apparent contradiction, since the scales on both $y$
axes don't start at 0.
We will use these numbers to illustrate the kinds of nonsense
arguments you can make with regression lines. There are three
variables to play with: the year, the crime rate, and the fear index.
We will focus on them two at a time and imagine different
kinds of conclusions. Our work is in 
in the spreadsheet \link{crimeDropsFearsRise.xlsx}.

The first graph in Figure~\ref{crimeFearRegression} shows a
scatterplot and trendline for the last two columns in the table.
There we asked Excel to construct a graph with crime rate as the
independent variable.

\begin{figure}[ht]
\centering
\includegraphics[height=50mm]{\here/CrimeUpFearDowncropped.pdf}
\caption{Crime vs Fear Regressions}
\figsource{Charts from Excel spreadsheet we built.}
\label{crimeFearRegression}
\end{figure}

Remember that we put crime rate as the independent variable.  This
makes it easy to look at the graph -- and the trend line -- and
conclude that the increase in crime rate is closely related to the
decrease in the fear index. The regression line slopes
down -- high crime rates seem to come along with decreased fear of
crime. The R-squared value is 0.60 -- perhaps not compellingly high,
but we won't let that stop us from thinking about the data. What might
the correlation \emph{mean}?
Could an increase in crime (the independent variable on the x axis)
cause people to be less afraid?  Here's an attempt at an explanation:
Perhaps when crime is rare it's reported spectacularly in the news and
people are frightened, while when it's common it gets less press and
most people don't notice it as much because it isn't happening to them.

Does that make sense? Not to us, but it's the kind of argument you
frequently see or hear -- a simpleminded attempt to explain what seems
to be a real ``this is true because of that'' connection, or perhaps
what a politician would like you to believe is a real connection.

The second graph in Figure~\ref{crimeFearRegression} shows the same
data with the fear index as the independent variable. That
changes our view of the data.
Now we see crime dropping as 
fear increases. How might we explain that? Perhaps we'd argue that
increasing fear of crime leads to more pressure on the police to
arrest criminals, thus reducing the amount of crime. That's more
plausible than the other way around, but still a shallow unconvincing
analysis of complex social phenomena. Both the crime rate and the fear
of crime are changing over time, one decreasing while the other
increases, but just because we can find a trendline doesn't mean
either change causes the other. 

\begin{figure}[ht]
\centering
\includegraphics[height=50mm]{\here/crimeRateRegressionscropped.pdf}
\caption{Fear Index and Crime Rate Over Time}
\figsource{Charts from Excel spreadsheet we built.}
\label{crimeTime}
\end{figure}

We can see the two trends separately if we plot each with time as the
independent variable, as in Figure~\ref{crimeTime}.
With these charts we can create other nonsense arguments. The
slope of the fear index regression line is about 3 percentage points
per year. Since the index 
was at 74\% in 2009, if the trend continues then in about 8 more
years, in 2017,
98\% of the population will believe that crime is getting
worse every year. The second regression line says the crime rate is
actually falling each year by about 7 violent crimes per 100,000
people, so in 2017 when everyone believes things are getting worse
it will be down from 435 to about 380. Neither of
these predictions carries much conviction. 

The news story that prompted this discussion is misleading in another way.
When we found the data on which it is based we discovered that 
in the previous decade, from 1990 to 2000, the 
crime rate and the fear index were \emph{both} decreasing.
The author of the
article chose not to tell us that. He 
\emph{cherry-picked}\index{cherry-pick} the data to make his
point (whatever it is) more dramatic. You can find \emph{all} the
numbers in our spreadsheet at \link{crimeDropsFearsRise.xlsx}.

The moral of this story: 
\begin{center}
\framebox{
Correlation is not causation.
}
\end{center}
It's very easy to use
regression to link variables (crime rate and fear index, as in this example), to
suggest trends and to make predictions or interpret
correlation as explanation. Just because you can
doesn't mean you should. It's often wrong.
Watch out for people who do.


\exstart

\begin{eb}
Maura

I haven't redrawn all the graphics from the Globe -- I will do those
that end up in the regular text, once that's decided.
\end{eb}
\begin{exx}{\untested\worthy\sref{climatechange}\gref{regression}}
A trendline for linear data.%
\teachertag
\begin{teacher}
This is an interesting exercise to work in class.
\end{teacher}

\begin{abcd}
\item What values would you expect to see for the slope, intercept and
R-squared if you were to add a trendline to the Tamworth elecricity
bill in spreadsheet \url{\webhome/ElectricityBill/TamworthElectric.xlsx}?

\item What  would the trendline look like on the graph in
Figure~\ref{fig:linearGraph}?

\item Add the trendline and verify your predictions.

\end{abcd}

\end{exx}

\begin{exx} {\untested\sref{climatechange}\gref{regression}}
The leaning tower of \myindex{Pisa}

The famous ``Leaning Tower of Pisa'' began to lean even while it was
under construction in the 1170s.
The table in Figure~\ref{fig:pisaanddata} shows the measured lean for
the years 1975 through 
1987.% 
\footnote{This picture is from
\url{%
http://www.raphaelk.co.uk/web\%2520pics/Italy/second/pisa-lina-1.jpg}; we
will get permission to reproduce it, or find another. The 
data is from 
\url{http://filebox.vt.edu/users/jemarsh2/LectureNotes/Ch10Examples.pdf}.
The second column displays the lean as the distance in meters between
where a particular point on the tower would be if the tower were
straight and where it actually is.  
}
\begin{figure}[ht]
\centering
\begin{minipage}{2in}
\includegraphics[height=72mm]{\here/PisaTower.jpg}
\end{minipage}
\begin{minipage}{2in}
\begin{tabular}{|r|r|}
\hline
Year & Lean (m) \\
\hline
1975 & 2.9642 \\
1976 & 2.9644 \\
1977 & 2.9656 \\
1978 & 2.9667 \\
1979 & 2.9673 \\
1980 & 2.9688 \\
1981 & 2.9696 \\
1982 & 2.9698 \\
1983 & 2.9713 \\
1984 & 2.9717 \\
1985 & 2.9725 \\
1986 & 2.9742 \\
1987 & 2.9757 \\
\hline
\end{tabular}
\end{minipage}
\caption{The Tower of Pisa}
\figsource{Photo from
\url{http://www.raphaelk.co.uk/web\%2520pics/Italy/second/pisa-lina-1.jpg},
data from 
\url{http://filebox.vt.edu/users/jemarsh2/LectureNotes/Ch10Examples.pdf}}
\figcomment{Any good stock Pisa photo will do. Data is probably free.}
\label{fig:pisaanddata}
\end{figure}

\begin{abcd}
\item Construct the regression line for this data and estimate
  (visually) what the   lean was in the year 2000.

\item How good is that estimate likely to be?

\item What is the slope of the regression line. What are its
  units. What does it mean?

\item Check your estimate using the equation of the regression
  line. Can you use the formula as it appears in the chart, or do you
  need more decimal places?

\item Explain why the actual numbers in the data table for the Tower of Pisa
depend on the height of the ``particular point'' at which measurements
were taken. What would the numbers be if the point were twice as high?
Would the linear regression line be just as good?

\item What has happened to the Tower of Pisa since 1987?
\end{abcd}


\end{exx}

\begin{exx}{\hassolution\sref{climatechange}\gref{regression}}
Beverage consumption

The spreadsheet at \link{BeverageConsumption.xlsx} contains data on the
amounts of milk, bottled water and soft drinks consumed in the United
States between 1980 and 2004.

\begin{abcd}

\item Use Excel to create a scatter plot of this data. 
Label the data series and the axes correctly.

\item Explore correlations among the various categories (for example, between milk and water). Write about what you
discover. In particular, which kinds of consumption are most closely
correlated?

\item Use the regression lines to make some predictions for years
following 2004.

\item Find the source of the data in
\link{BeverageConsumption.xlsx}. If that source contains data for other
years, discuss the validity of your predictions. 
\end{abcd}

\begin{hint}
Try a Google search for 
\gc{%
Per capita consumption of selected beverages in gallons
}
\end{hint}

\begin{sol}


\begin{abcd}

\item Use Excel to create a scatter plot of this data. 
Label the data series and the axes correctly.

The spreadsheet can be found at \link{BeverageConsumptionSolution.xlsx}.

\item Explore correlations among the various categories. Write about what you
discover. In particular, which kinds of consumption are most closely
correlated?

I used the Excel \excel{CORREL()} function to find the correlation
coefficients. The I squared them to find the R-squared values. Here
are the results:

\begin{center}
\begin{tabular}{|r|r|r|}
\hline
pair & Correlation & R-Squared \\
\hline
milk-water &	-0.989	& 0.978 \\
milk-soda & -0.922  & 0.851 \\
water-soda & -0.884  & 0.782 \\
\hline
\end{tabular}
\end{center}

That tells me milk and bottled water are most closely correlated. The
minus sign means that as the consumption of milk declines the
consumption of bottled water increases.

\item Use the regression lines to make some predictions for years
following 2004.

I asked Excel to project the regression lines out to 2010. I then
estimated values for 2007 by looking at the
graph. (I could have asked Excel to work with the linear function
defining the regression line, but decided that the numbers were so
inexact that I would just estimate by eye.)

I entered the values in the table below.

\item Find the source of the data.

I followed the hint and Googled
\gc{%
Per capita consumption of selected beverages in gallons
}.

The first hit was a link to a spreadsheet at
\url{www.census.gov/compendia/statab/2010/tables/10s0210.xls} that
gave figures through 2007.%
\footnote{Saved locally as \link{BeverageConsumptionThrough2007.xlsx}
}
The following table contains the values for
2007, along with my predictions from the regression lines.

\begin{center}
\begin{tabular}{|r|r|r|}
\hline
beverage & 2007 prediction & 2007 actual \\
\hline
milk &	21  & 20.7 \\
water &	24  & 29.1 \\
soda &	58  & 48.8 \\
\hline
\end{tabular}
\end{center}
\end{abcd}

The regression line predictions are pretty good for milk and bottled
water, but too high for soda. When I look at the data that's not too
surprising. Soda consumption seems to have peaked in about 2000 and
was level for the next four years. The regression line grows then
because it's taking into account the rapid growth between 1980 and
2000. I bet a regression that started with just the 2000-2004 data
would predict a value much closer to the 50 gallons that was observed.

\end{sol}
\end{exx}


\begin{exx}{\hassolution\sref{climatechange}\gref{regression}} Energy consumption\index{energy consumption}

The Excel spreadsheet 
\link{EnergyConsumption.xlsx}
contains a table showing the annual United States energy consumption,
measured in terawatt-hours, between 1949 and 2005.  

\begin{abcd}

\item
Insert a new column labeled ``years since 1949'' in between the Years
column and the Consumption column. Use Excel to fill in the cells for
this column. 

\item
Use Excel to find a linear trendline for this data.  Include the
equation and $R^2$-value for the trendline on the graph.    

\item
Is this trendline a good fit for the data?

\item
What is the slope of this line?  Include the units in your answer.  
Use your answer for the slope to complete the sentence:  ``For every
additional year that passes, total energy consumption \ldots'' 

\item
Estimate total energy consumption in the years from 2006 to the present. 

\item
Look for data with which to check the estimates from the previous part
of the exercise.
\end{abcd}


\begin{sol}

\begin{abcd}

\item
See the solution spreadsheet at \link{EnergyConsumptionSolution}

\item
See the spreadsheet.

\item
The trendline is a good fit for the data since $R^2 = 0.9594$, which
is very close to $1$.

\item
The slope of the trendline is 361.9 TWh/year. 

For every
additional year that passes, total energy consumption increases by about
360 TWh.

\item
For 2009 the prediction is about 32,000 TWh. It's wrong to report more
significant digits than that.

\item 
I haven't time to find a good source for the actual 2009 value
(yet). Perhaps a student will provide one.

I did discover that U.S. energy consumption actually declined in 2008
and 2009 because of the economic crisis.
\end{abcd}

\end{sol}


\end{exx}


\begin{exx}[officerents1]{\hassolution\sref{climatechange}\gref{regression}}
Supply and demand for office space.

The data in Table~\ref{vacancies} appeared on page B5 in \theGlobe{}
on April 3, 2010. 

\begin{table}[ht]
\centering
\begin{tabular}{|c|r|r|}
\hline
quarter & vacancy rate & rent (\$/$\hbox{ft}^2$) \\
\hline
Q1 '06 & 11.8\% &  38.76 \\
Q1 '07 & 7.5\% & 47.54 \\
Q1 '08 & 6.0\% & 62.20 \\
Q1 '09 & 9.0\% & 49.24 \\
Q1 '10 & 11.1\% & 42.46 \\
\hline
\end{tabular}
\caption{Less in rent, more in vacancy}
\tablesource{\theGlobe, 4/3/2010 page B5.}
\tablecomment{Need permission?}
\label{vacancies}
\end{table}

\begin{abcd}
\item
Build and then discuss a linear regression line for the
dependence of rent per square foot on vacancy rate.

\item How do your conclusions change when you adjust rents to take
inflation into account?
\end{abcd}

\begin{sol}

Figure~\ref{fig:officerentssolution} shows
the charts for both parts of the exercise.

\begin{figure}[ht]
\centering
\includegraphics[height=50mm]{\here/OfficeRentsSolution.png}
\caption{Office rents, Q1 06 - Q1 10}
\figsource{Charts from Excel spreadsheet we built.}
\label{fig:officerentssolution}
\end{figure}

\begin{abcd}
\item
Build and then discuss a linear regression line for the
dependence of rent per square foot on vacancy rate.

The regression line has a slope of -3.38.
That means that each 1\% increase in the vacancy rate
corresponds to a decrease of \$3.38 per square foot in office space
rent.

$R^2$ is 0.84, which means the correlation is pretty good.

\item How do your conclusions change when you adjust rents to take
inflation into account?

After I used the Bureau of Labor Statistics inflation calculator to
write all the rents in 2010 dollars, the slope was -3.29 and the $R^2$
was 0.88. That's a little higher.

\end{abcd}

\end{sol}
\end{exx}

\begin{exx}{\untested\complex\sref{climatechange}\gref{regression}} Polarization

Figure~\ref{fig:polarization} appeared in \theGlobe on November 6, 2010.
\webref{%
http://www.boston.com/news/nation/washington/articles/2010/11/06/election_opens_up_a_gaping_divide/}
We extracted the numerical data from the graph; you can find it at
\link{polarization.csv}.

\begin{figure}[ht]
\centering
\includegraphics[height=50mm]{\here/Polarization.jpg}
\caption{Income Disparity and Political Polarization}
\figsource{Scanned from \theGlobe, 
\url{http://www.boston.com/news/nation/washington/articles/2010/11/06/election_opens_up_a_gaping_divide/}}
\figcomment{Data extracted by scraping the numbers from the image.}
\label{fig:polarization}
\end{figure}

\begin{abcd}

\item Find the trendline modeling a linear relationship between the 
income share of the top 1 percent of the population and the political
Polarization index.

\item Find the trendline modeling a linear relationship between the 
income share of the top 1 percent of the population in a year and the
political Polarization index \emph{four years earlier}.
\end{abcd}

%The lagged cross-correlation will be a bit higher (i.e. correlation
%between income(t) and polarization(t-dt), where dt is approx. 5-8 years)

\end{exx}

\begin{exx}{\hassolution\sref{climatechange}\gref{regression}\gref{regressionnonsense}}
\headline{Office rents reach dizzying heights}

On February 22, 2008 \theGlobe{} ran this story under that headline:

\begin{qwrap}
\begin{quotation}
\firstline{For the first time in almost a decade, office rents in}
downtown Boston have moved above the eye-popping \$100-a-square-foot
mark.  

So far just a few well-heeled companies have agreed to pay three-digit
rates for their offices, involving relatively small amounts of space
on the upper floors of some of Boston's trophy towers. Executives in
the real estate industry predict that, with demand for high-quality
space exceeding the available supply, more companies will pay up for
such offices.
\end{quotation}
\sourceinfo{Boston Globe, February 22, 2008. No url at the moment.}
\end{qwrap}

Figure~\ref{OfficeData} shows the chart and the data.

\begin{figure}[ht]
\centering
\begin{minipage}{2in}
\includegraphics[height=60mm]{\here/OfficeRents.png}
\end{minipage}
\begin{minipage}{3in}
\begin{tabular}{|c|c|c|}
\hline
     & Available & Average Rent \\
Year & (\%)      & (\$/(square foot)) \\
\hline
2000 & 5.0 & 	  72.04 \\
2001 & 11.6 & 	  53.57\\
2002 & 16.6 & 	  42.98\\
2003 & 16.2 & 	  38.40\\
2004 & 18.3 & 	  38.92\\
2005 & 13.8 & 	  41.31\\
2006 & 12.2 & 	  44.81\\
2007 & 11.3 & 	  67.23\\
\hline
\end{tabular}
\end{minipage}
\caption{Boston Office Rental Rates}
\figsource{Charts - theGlobe, 2/22/2008. Data - scraped from the
charts.}
\figcomment{Permission needed, or redraw in some way.}
\label{OfficeData}
\end{figure}

The shapes of the curves illustrate the law of supply and demand --
the more space is available the less you have to pay for it. 

\begin{abcd}

\item
Enter the data in an Excel spreadsheet in columns \excel{A:C}. You can
save typing by copying the data from the table into a word processing
document, changing the blanks to commas, saving the file as type
{\tt .csv} and opening it as a spreadsheet. The extension ``{\tt
.csv}'' stands for ``comma separated values.'' Excel knows how to deal
with those.

\item Recreate the graphic from \theGlobe, with both data sets
displayed on the same chart, using two different y-axis scales.

This is not easy, and may not be worth
the time it takes to master. Moreover, there is some controversy about
whether this is ever a good thing to do.

\item Create a scatter plot using columns \excel{B} and \excel{C} and
a regression line for that scatter plot. Identify the
slope and its units. How good is the correlation?

\item Use the graph and the formula to estimate office rent when the
availability rate is 8\%.

\item If you worked \exref{officerents1}, compare the
data there with the data here.

\end{abcd}

\begin{sol}

\begin{abcd}

\item
Enter the data in an Excel spreadsheet in columns \excel{A:C}. 

See \link{OfficeRentsSolution.xlsx}

\item Recreate the graphic from \theGlobe, with both data sets
displayed on the same chart. (This is not easy, and may not be worth
the time it takes to master.)

See Figure~\ref{fig:rentandvacancy}

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/RentAndVacancy.png}
\caption{Boston Office Space Rental Statistics}
\figsource{Chart from Excel spreadsheet we built.}
\label{fig:rentandvacancy}
\end{figure}

\item Create a scatter plot using columns \excel{B} and \excel{C} and
a regression line for that scatter plot. Identify the
slope and its units. How good is the correlation?


The slope of the line (with units) is
\begin{equation*}
-2.76 \frac{\$/{\hbox{square foot}}}{\hbox{percentage point of vacancy rate}}
\end{equation*}

It tells me that for each increase of one percentage point in the
vacancy rate the average rent falls by about \$2.76 per square foot.

$R^2$ is about 0.76, which is OK but not wonderful.

\item Use the graph and the formula to estimate office rent when the
availability rate is 8\%.

The picture suggests that the rent will then be abot \$64 per square
foot. The formula says
\begin{equation*}
-2.7572 \times 8 + 86.096 = 64.0384
\end{equation*}
which rounds to 64. My guess from the graph was pretty good!

\item If you worked \exref{officerents1}, compare the
data there with the data here.

See Figure~\ref{fig:rentandvacancycombined}.

At each vacancy rate level the rents in this Exercise are somewhat
higher than those in the other one. I have no idea why.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/RentAndVacancyCombinedSolution.png}
\caption{Boston Office Space Rental Statistics}
\figsource{Chart from Excel spreadsheet we built.}
\label{fig:rentandvacancycombined}
\end{figure}


\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\hassolution\sref{climatechange}\gref{regression}} College costs\index{college costs}

%\url{http://economix.blogs.nytimes.com/2011/11/04/college-is-cheaper-than-you-think/}

The spreadsheet \link{CollegeCosts2010.xlsx}
shows the annual mean cost for tuition and
fees at private and public four-year colleges in the U.S. between 1999
and 2010. 

\begin{abcd}

\item 	Insert a new column to the right of the ``Years'' column and
label it ``Years since 1999''.  Fill in the values for this column.
Then create a properly labeled scatter plot of years since 1999 versus
mean private and public education costs. 

Add a linear trendline to each 
set of data.  

\item Write the equation for private education costs (round the
numbers to one decimal place). 

\item Write the equation for public education costs 
(round the numbers to one decimal place).

\item Interpret the numerical value of the slope in each trendline
equation. 

\item Use your trendline equations to determine the projected mean
tuition cost at both private and public four-year colleges for 2015.

\item In Chapter~\ref{Inflation}, \exref{collegespending} presents
data on public and private college \emph{spending} increases. Compare
the data there with the \emph{tuition and fee} data here.

\end{abcd}

\begin{sol}

\begin{abcd}

\item Build chart with trendlines.

See \link{CollegeCosts2010Solution.xlsx}.

\item Write the equation for private education costs here (round the
numbers to one decimal place). 

Excel says:
\begin{verbatim}
	y = 1102.7x + 14918	
\end{verbatim}

\item Write the equation for public education costs 
here (round the numbers to one decimal place).

\begin{verbatim}
	y = 397.3x + 3072.6
\end{verbatim}

\item Interpret the numerical value of the slope in each trendline
equation. 

The slopes of the trendlines show that the cost of public education is
increasing at a rate of \$397 per year while that for private
education is increasing at a rate of \$1103 per year.


\item Use your trendline equations to determine the projected mean
tuition cost at both private and public four-year colleges for 2015.

2015 is 16 years from 1999, so I plugged 16 into each of the equations
and project that then public college education will cost \$32,561
while private will cost just \$21,275.

The R-squared value for each of the trendlines is very close to 1, so
I am pretty confident about these predictions.


\item Comparison with spending data from the earlier problem not yet
done.
\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\hassolution\sref{climatechange}\gref{regression}\gref{regressionnonsense}}
\headline{\myindex{Pandora} Prices Its I.P.O. at
\$16 a Share} 

On June 15, 2011 \theTimes{} reported that

\begin{qwrap}
\begin{quotation}
\firstline{According to its latest filing, [Pandora] has more than 90}
million registered users and is adding a new user about every second.
\webref{
http://dealbook.nytimes.com/2011/06/14/pandora-prices-its-i-p-o-at-16/ 
}
\end{quotation}
\sourceinfo[665]{http://dealbook.nytimes.com/2011/06/14/pandora-prices-its-i-p-o-at-16/ }
\end{qwrap}

\begin{abcd}
\item Write the linear equation that models this scenario, using
seconds since the article appeared for the independent
variable. What are the slope and the intercept, with their units?

\item Rewrite your equation using millions of users for the dependent
variable and years for the independent variable. What are the slope
and intercept?

\item How long could Pandora continue to grow at that rate?

\item Did it continue to grow at that rate?

\end{abcd}

\begin{sol}


\begin{abcd}
\item Write the linear equation that models this scenario, using
seconds since the article appeared for the independent
variable. What are the slope and the intercept, with their units?

Let $U$ be the number of Pandora users and $S$ the number of seconds
since the article appeared on June 14, 2011. Then the equation
suggested would be
\begin{equation*}
U = 90,000,000 + S
\end{equation*}
with a $y$-intercept of 90 million users and a slope of 1 user per
second.

\item Rewrite your equation using millions of users for the dependent
variable and years for the independent variable. What are the slope
and intercept?

There are about 30,000,000 seconds in a year, so the equation
becomes
\begin{equation*}
U = 90  +  30Y
\end{equation*}
when $U$ is measured in millions of users and $Y$ is measured in
years. The intercept is 90 (million users) and the slope is 
30 (million users per year).

\item How long could Pandora continue to grow at that rate?

Well at that rate in two years (June of 2013) about half the
population of the United States would be Pandora users. I don't think
that could happen.

\item Did it continue to grow at that rate?

I found this from the Los Angeles Times on April 5, 2012

\begin{qwrap}
\begin{quotation}
\firstline{The Oakland company said Thursday that the number of active}
listeners (that is, people who have used the service at least once in
the past 30 days) grew to 51 million in March, up 59\% from a year earlier. 
\webref{
http://latimesblogs.latimes.com/entertainmentnewsbuzz/2012/04/pandora-users.html
}
\end{quotation}
\sourceinfo[333]{http://latimesblogs.latimes.com/entertainmentnewsbuzz/2012/04/pandora-users.html}
\end{qwrap}

That seems to say there were just about 51/1.59 = 32 million users in
the spring of 2011, not 90 million. Something doesn't match. I have no
idea why.	
\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\untested\sref{climatechange}\gref{regression}}\headline{Manhattan Rental Market Rebounds}

Figure~\ref{fig:manhattanrentals} appeared in \theTimes on October 15,
2011.
\webref{%
www.nytimes.com/2011/10/16/realestate/rents-in-manhattan-rebound-to-record-highs.html
}
 They cite the data source as 
\url{http://www.citi-habitats.com/}.

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/ManhattanRentals.png}
\caption{Manhattan Rental Statistics}
\label{fig:manhattanrentals}
\figsource{Chart - 
\url{www.nytimes.com/2011/10/16/realestate/rents-in-manhattan-rebound-to-record-highs.html}}
\end{figure}

The data are in Table~\ref{table:manhattanrentals}

\begin{table}[ht]
\centering
\begin{tabular}{|c|r|r|}
\hline
year & vacancy rate (\%) & monthly rent (\$) \\
\hline
06 & 0.849 &  3173 \\
07 & 1.007 &  3254 \\
08 & 1.413 &  3256 \\
09 & 1.841 &  3010 \\
10 & 1.191 &  3144\\
11 & 1.095 &  3343 \\
\hline
\end{tabular}
\caption{Apartment Rents in Manhattan}
\tablesource{data scraped from chart}
\label{table:manhattanrentals}
\end{table}

\begin{abcd}
\item
Reproduce the charts in Figure~\ref{fig:manhattanrentals} in
Excel. Label them properly.

\item
Create a scatterplot from the second and third columns in
Table~\ref{table:manhattanrentals}, draw a trendline and discuss the
correlation between vacancy rate and average monthly rent. 

\end{abcd}

\end{exx}

\begin{exx}{\untested\sref{climatechange}\gref{regression}}
\headline{White House Opens Door to Big Donors, and Lobbyists Slip In}

On April 15 2012 \theTimes published
Figure~\ref{fig:VisitWhiteHouseNYTimesChart.jpg}

\begin{figure}[ht]
\centering
\includegraphics[height=60mm]{\here/VisitWhiteHouseNYTimesChart.jpg}
\caption{Odds of an invitation to the White House}
\label{fig:VisitWhiteHouseNYTimesChart.jpg}
\figsource{
\url{http://www.nytimes.com/2012/04/15/us/politics/white-house-doors-open-for-big-donors.html}}
\end{figure}

Fit a linear trendline to this data to predict the size of donation
that would guarantee an invitation to visit the White House.

You can do this with a ruler and get a good-enough approximate
answer. No need to put the data into Excel.
\end{exx}

\begin{exx}{\hassolution\gref{regression}}
Playing with regression lines.

Use the spreadsheet \link{PlayWithRegression.xlsx} to explore the
following questions. 


\begin{abcd*}
\item What happens when all the y-values are the same?

\item What if all but one of the y-values are the same and you vary
that one?

\item What if y decreases as x increases?

\item  What if the x and y values match?

\end{abcd*}

\begin{sol}
\begin{abcd*}
\item What happens when all the y-values are the same?

I changed the value in  cell \cell{B14} to $1$ to make all the
$y$-values the same.  The trendline equation turned into
\begin{equation*}
	y = 1.
\end{equation*}
That makes sense, since the slope is 0 and the $y$-intercept is 1.
Excel complains about $R^2$ and refuses to calculate it.


\item What if all but one of the y-values are the same and you vary
that one?

I changed the value in \cell{B14} from 3 to 4, then to 100.

The single high point kept pulling up the trendline, so its slope got
bigger (and its intercept got smaller)

The $R^2$ value didn't change.

\item What if y decreases as x increases?

For $x$ = 1, 2, 3, 4, 5 I used the values $y$ = 10, 8, 4, 6, 2. The
trendline had slope $-2$, which did not surprise me. The correlation
was $-1$; the minus sign was telling me that the line sloped
down. Since all the points lie on the line, the values are perfectly
correlated and $R^2 = -1$.

\item  What if the x and y values match?

I let $x$ = 1, 2, 3, 4, 5 and let $y$ = 1, 2, 3, 4, 5.  Notice that all of the points lie on the trendline.   Excel calculates $R^2=1$, which
makes sense since the line matches up exactly with the points.

\end{abcd*}

\end{sol}
\end{exx}

\begin{exx}{\untested\gref{regressionnonsense}}
Web search correlations.

Visit the Google correlation site
\url{http://www.google.com/trends/correlate/}, choose a search term
you are interested in and find out what other search terms its
correlated with. Write about what you discover. Is causation involved?
\index{Google correlate}
\end{exx}

\begin{exx}{\untested\gref{regression}\gref{regressionnonsense}}
\headline{Despite its many benefits, corporate 
use of aircraft still vilified}

On May 26, 2012 \theGlobe{} published a letter to the editor from
David V. Dinneen, Executive director of the Massachusetts Airport
Management Association. It said in part
\begin{qwrap}
\begin{quotation}
\firstline{According to a recent report, annual earnings of S\&P}
companies that use general aviation were 434 percent higher than those
that did not.%
\webref{http://www.bostonglobe.com/opinion/letters/2012/05/25/despite-its-many-benefits-corporate-use-aircraft-still-vilified/mbQ6mINMQXbAayzWvFn6NI/story.html}
\end{quotation}
\sourcewc{260}
\sourceinfo{
http://www.bostonglobe.com/opinion/letters/2012/05/25/despite-its-many-benefits-corporate-use-aircraft-still-vilified/mbQ6mINMQXbAayzWvFn6NI/story.html
}
\end{qwrap}

``\ldots using general aviation'' is Dineen's way of saying that they
have their own fleet of corporate jets.

Explain how and why he is using the statistic he quotes to encourage
readers to confuse correlation with causation.
\end{exx}

	
\begin{exx}{\hassolution\sref{regressionnonsense}\gref{regressionnonsense}}
Cherry-picking\index{cherry-pick}

Find out what ``cherry-picking'' means, and where the phrase comes
from. Find and discuss some examples.

\begin{sol}

From wikipedia (reliable in this case)

\begin{quotation}
Cherry picking, suppressing evidence, or the fallacy of incomplete
evidence is the act of pointing to individual cases or data that seem
to confirm a particular position, while ignoring a significant portion
of related cases or data that may contradict that position.
\end{quotation}

A student provided this answer - what's right about it? What's wrong
about it?

\begin{quotation}
cherry picking often heard in basketball for palming the ball or
traveling. also heard in video games when people are sniping also the
term is used for the construction workers. you know the ones that are
in a little bucket clipping tree branches or fixing wires.

The term cherry picking likely originates with the process of picking
fruit from a tree. When picking a type of fruit, such as cherries, a
person might search for only the best cherries, such as those that are
the healthiest.
\end{quotation}

\end{sol}

\end{exx}

\begin{exx}{\hassolution}Watch TV! Live Longer!

The data in the spreadsheet \link{TVData.xlsx} show
the life expectancy in years for several
countries, along with the number of people per tv set in those
countries.  \footnote{
The idea (and the data) for this problem come from the article 
\url{http://www.amstat.org/publications/jse/v2n2/datasets.rossman.html}.
}

\begin{abcd}
\item 
Which countries have the highest and lowest life expectancy at birth?
Which have the highest and lowest number of people per television
set? 

\item Use Excel to create a properly labelled scatter plot of the data
and find the trendline and R-squared value. 

\item What is the slope of the trendline (with its units)? Explain its
meaning in a sentence.

\item Does a small number of people per television set improve health?
Would people in countries with low life expectancy live
longer if we sent them shiploads of television sets? 

\item Does living longer increase the number of television sets?
If we improved the life expectancy in a country by
providing better medical care would that cause there to be fewer
people per television set? 

\item What else could be going on here? Why might high life expectancy
be strongly correlated with a low ratio of people per tv set? 

\end{abcd}

\begin{sol}

\begin{abcd}
\item Life expectancy varies from 79 years in Japan to 51.5 years in 
Ethiopia. Television prevalence varies from 1.3 people per set in the
United States to 592 per set in Myanmar. If you try to find the
largest and smallest values 
by simply scanning the columns of figures you're \myindex{likely} to make a
mistake. It's best to sort in Excel.

\item Here's a correct solution, with an \emph{exponential} trendline
as well as the linear one. 

\begin{center}
\includegraphics[height=50mm]{\here/TVdataRegression.png}
\end{center}

\item The slope of the trendline is -10.122 (people per TV) per (year
of life expectancy). It seems to say that for each \emph{decrease} of
10 people per TV, life expectancy \emph{increases} by one year.
The correlation isn't very good. The R squared value is just 0.3671.

\end{abcd}

The last three questions are all concerned with the same issue. What
might account for the fact that longer life expectancy seems to go
along with more television sets? The simple answer is that each trend
is a consequence of affluence. The richer a society, the better
medical care it offers its citizens and the more they have the liesure
and the means to watch television.

\end{sol}

\end{exx}



\begin{exx}{\hassolution}{\worthy} Crime rates revisited.\index{crime rate}

\begin{abcd}
\item Use the data in  \link{crimeDropsFearsRise.xlsx} to redo the
analysis for the entire period from 1990 to 2009.

\item Are the crime rates in this exercise consistent with those in
the example we studied in Chapter~\ref{Units}?

\end{abcd}

\begin{hint}
For the second question, all you can really look for is the 
\myindex{order of magnitude}. If that doesn't match, try to explain why.
\end{hint}

\begin{sol}
\begin{abcd}
\item Use the data in  \link{crimeDropsFearsRise.xlsx} to redo the
analysis for the entire period from 1990 to 2009.

I copied all four charts and 
changed the data series in each to use the numbers in rows
50:66. (I changed the titles and the scales on the axes too.)
Figure~\ref{fig:crimeFearRegressionSolution} shows the
result. Now there is a \emph{positive} correlation. Fear and crime more
or less rise and fall together. But the data are scattered and the
correlation is quite weak: $R^2$ is just 0.45. Plotting each variable
over time, you can see the crime rate falling quite consistently
following a linear trend ($R^2 \approx 0.91$) while
fear goes down and then up (the regression line is useless). That
might have made an even more interesting news story.

\begin{figure}[ht]
\centering
\includegraphics[height=120mm]{\here/CrimeFearSolutioncropped.pdf}
\caption{Crime:Fear Correlation}
\figsource{Charts from Excel spreadsheet we built.}
\label{fig:crimeFearRegressionSolution}
\end{figure}

\item Are the crime rates in this exercise consistent with those in
the example we studied in the chapter on Units?

The crime rates here are on the order of 500 per 100,000 people. In
the discussion in the Units chapter they are on the order of 10 per
1000 people. That converts to 1000 per 100,000 people, which is twice
as much. Perhaps that's because the ones here are ``violent crimes''
while the ones there are just ``crimes''.

\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\untested\sref{regressionnonsense}\gref{regression}\gref{regressionnonsense}} The Mississippi River

\begin{qwrap}
\begin{quotation}
\firstline{In the space of one hundred and seventy-six years the Lower}
Mississippi has shortened itself two hundred and forty-two miles. That
is an average of a trifle over one mile and a third per
year. Therefore, any calm person, who is not blind or idiotic, can see
that in the Old Oolitic Silurian Period, just a million years ago
next November, the Lower Mississippi River was upwards of one million
three hundred thousand miles long, and stuck out over the Gulf of
Mexico like a fishing-rod.  And by the same token any person can see
that seven hundred and forty-two years from now the Lower Mississippi
will be only a mile and three-quarters long, and Cairo and New Orleans
will have joined their streets together, and be plodding comfortably
along under a single mayor and a mutual board of aldermen. There is
something fascinating about science. One gets such wholesale returns
of conjecture out of such a trifling investment of fact.
\begin{flushright}
Mark Twain \\
Life on the Mississippi \\
\url{http://www.gutenberg.org/files/245/245.txt}
\end{flushright}
\end{quotation}
\sourceinfo{Mark Twain,
http://www.gutenberg.org/files/245/245.txt. Surely this is in the
public domain by now.}
\end{qwrap}

Discuss this linear model for the length of the Mississippi
river. What's the slope? Can you verify Twain's arithmetic?

\begin{sol}
Since $242/176 = 1.375$, Twain is right to say the rate is ``an
average of a trifle over one mile and a third per year.'' That's the
slope.

Projecting that trend backwards, a million years ago the Mississippi
would have 
been about one and one third million miles longer than it was when Twain
wrote about it. The Gulf of Mexico is only about 560 miles from North
to South (\url{http://www.epa.gov/gmpo/about/facts.html}) so the river
would have done much more than stick out over the Gulf like a
fishing rod - it would have reached more than four times the distance
to the moon.

Projecting forward 742 years, the Mississippi would be about 1000
miles shorter. From Cairo (Illinois) to New Orleans is only about 600
miles, so that estimate doesn't seem right. Cairo and New Orleans
would be together in only $600/1.375 \approx 400$ years.

\end{sol}
\end{exx}

\begin{exx}{\untested\sref{regressionnonsense}\gref{regressionnonsense}}Well, maybe.

Explain the joke in the cartoon in Figure~\ref{fig:correlationXKCD}
from
\url{http://xkcd.com/}.

\begin{figure}[ht]
\centering
\includegraphics[height=40mm]{\here/correlationXCD.png}
\caption{Well, maybe.}
\label{fig:correlationXKCD}
\figsource{
\url{http://xkcd.com/}, licensed under a Creative Commons
Attribution-NonCommercial 2.5 License.  We need to negotiate.}
\end{figure}
\end{exx}




\begin{ExtraExercises}


\begin{eb}
Maura. I think this should be a regular exercise -- perhaps with a
little more guidance and a spreadsheet with the data for them to start
with.
\end{eb}
\begin{exx}{\untested}Anscombe's Quartet.

\begin{qwrap}
\begin{quotation}
\firstline{Anscombe's quartet comprises four datasets that have nearly}
identical 
simple statistical properties, yet appear very different when
graphed. Each dataset consists of eleven (x,y) points. They were
constructed in 1973 by the statistician Francis Anscombe to
demonstrate both the importance of graphing data before analysing it
and the effect of outliers on statistical properties.
\webref{http://en.wikipedia.org/wiki/Anscombe's_quartet}
\end{quotation}
\sourceinfo{http://en.wikipedia.org/wiki/Anscombe's_quartet}
\end{qwrap}

Data:

\begin{verbatim}
x , y , x , y , x , y , x , y
10.0 ,8.04 ,10.0 ,9.14 ,10.0 ,7.46 ,8.0 ,6.58
8.0 ,6.95 ,8.0 ,8.14 ,8.0 ,6.77 ,8.0 ,5.76
13.0 ,7.58 ,13.0 ,8.74 ,13.0 ,12.74 ,8.0 ,7.71
9.0 ,8.81 ,9.0 ,8.77 ,9.0 ,7.11 ,8.0 ,8.84
11.0 ,8.33 ,11.0 ,9.26 ,11.0 ,7.81 ,8.0 ,8.47
14.0 ,9.96 ,14.0 ,8.10 ,14.0 ,8.84 ,8.0 ,7.04
6.0 ,7.24 ,6.0 ,6.13 ,6.0 ,6.08 ,8.0 ,5.25
4.0 ,4.26 ,4.0 ,3.10 ,4.0 ,5.39 ,19.0 ,12.50
12.0 ,10.84 ,12.0 ,9.13 ,12.0 ,8.15 ,8.0 ,5.56
7.0 ,4.82 ,7.0 ,7.26 ,7.0 ,6.42 ,8.0 ,7.91
5.0 ,5.68 ,5.0 ,4.74 ,5.0 ,5.73 ,8.0 ,6.89

\end{verbatim}

F.J. Anscombe, ``Graphs in Statistical Analysis,'' American
Statistician, 27 (February 1973), 17-21.
\end{exx}

\begin{exx}{\needsquestions\untested}First class mail.

Data

\begin{verbatim}
> Year   Cost (cents)
> 1976   13
> 1978   15
> 1981   18
> 1985   22
> 1988   25
> 1991   29
> 1995   32
> 1999   33
> 2001   34
> 2002   37
> 2006   39
> 2007   41
> 2008   42
> 2009   44
\end{verbatim}

\end{exx}

\begin{exx}{\untested\complex} \headline{Do the math on overrides}

Barry Bluestone and Anna Gartsman wrote an op-ed with that headline in
\theGlobe{} on June 4, 2010. Implicit in what they propose are several
linear dependencies among statistics describing towns in Massachusetts:

\begin{qwrap}
\begin{quotation}
\firstline{We decided to test this theory by simulating the impact on
home values} 
of a change in school spending due to a Prop. 2∏ override, controlling
for other factors. We obtained data on housing values in 2005, the
change in housing values between 2005 and 2010, and two measures of
perceived school quality: school-wide SAT scores and per pupil
expenditures. We found complete data for 176 of the 351 cities and
towns in the Commonwealth. 

According to our analysis, which controlled for initial home value in
2005, a municipality with SAT scores and per pupil spending levels 20
percent higher than average experienced a 24 percent increase in
nominal home value between 2005 and 2010. In contrast, a municipality
with SAT scores and per pupil spending 20 percent below average
experienced a loss in home value of 11 percent. 

So, how much difference would the passage of the Hull override have
potentially meant for home values in that community? Hull's 2005
average home value of \$366,343 was near the mean for the communities
in our study. The average SAT score in the Hull public schools was 961
compared with an average of 1047 for all the study
communities. Average per pupil expenditure in Hull was \$11,491, some
\$1,500 higher than average. Based on our home value model, the
predicted increase in home values in Hull between 2005 and 2010 was
3.85 percent. 

Now what would \myindex{likely} have happened to the average home value in Hull
if the recent proposed \$1.9 million override had been passed back in
2005? This tax increase would have cost the average homeowner in Hull
\$506 per year. Over five years, it would have totaled \$2,530. However,
that tax increase would have resulted in an additional \$1,442 spent
per pupil. This increase would result in a predicted increase in home
value of 6.57 percent rather than the increase of 3.85 percent. The
difference between the two predicted values results in an average
increase in home value in Hull of \$9,970.%
\webref{%
http://www.boston.com/bostonglobe/editorial_opinion/oped/articles/2010/06/24/do_the_math_on_overrides/}
\end{quotation}
\sourceinfo[712]{http://www.boston.com/bostonglobe/editorial_opinion/oped/articles/2010/06/24/do_the_math_on_overrides/	}
\end{qwrap}

These figures are probably\index{probably} the \emph{result} of a
regression study. Identify the slopes of the regression lines
involved, and verify the predictions.

\end{exx}

\begin{exx}{\untested\complex} Say ``Cheese''

Figure~\ref{fig:cheese} appeared in \theTimes\ on November 7, 2010%
\webref{http://www.nytimes.com/interactive/2010/11/07/health/nutrition/fat.html}
\webref%
{http://www.nytimes.com/2010/11/07/us/07fat.html?_r=1&hp}.

\begin{figure}[ht]
\centering
\includegraphics[height=50mm]{\here/PizzaImage.jpg}
\caption{Cheese Consumption}
\figsource{From \theTimes, 11/7/2010, 
\url{http://www.nytimes.com/interactive/2010/11/07/health/nutrition/fat.html}}
\figcomment{We will probably delete this exercise.}
\label{fig:cheese}
\end{figure}

The graphs suggest a linear model. Explore.

\end{exx}


\begin{exx}{\needsquestions} Climate changes

The Economist published 
Figure~\ref{fig:climatechange} on May 12, 2010, along with the
following paragraph:

\begin{qwrap}
\begin{quotation}
\firstline{How global surface temperature, ocean heat and atmospheric}
CO2 levels have risen since 1960 

THE record of atmospheric carbon-dioxide levels started by the late
Dave Keeling of the Scripps Institute of Oceanography is one of the
most crucial of the data sets dealing with global warming. When the
measurements started in 1959 the annual average level was 315 parts
per million, and it has gone up every year since. To begin with it
went up by roughly one part per million per year. Now it is more like
two parts per million per year. The figure for 2011 is 391.6. More
carbon dioxide in the atmosphere means a stronger greenhouse effect,
and various measurements speak to this. Global surface temperature
records show a warming over the same period, though because of
fluctuations in the climate, air pollution, volcanic eruptions and
other confounding factors the rise is nothing like as smooth. A
steadier rise can be seen in the heat content of the oceans, measured
in terms of the energy stored, rather than the temperature. 
\end{quotation}
\sourceinfo{From The Economist, May 12, 2010.}
\end{qwrap}

\begin{figure}[ht]
\centering
\includegraphics[height=50mm]{\here/ClimateChangeEconomist.png}
\caption{Climate Change}
\label{fig:climatechange}
\figsource{From The Economist, May 2, 2012}
\end{figure}


\end{exx}

\begin{exx}{\needsquestions}Start with a graph

This is a placeholder for Exercises as suggested by a Cengage
reviewer:

\begin{quotation}
I'd like to see more homework problems here that begin from a graph
and trend line, rather than beginning from a data set. Given that it
is so easy to mislead with graphs, this would help students to develop
those ``defensive reading'' skills that appear to be one of the goals of
this chapter. 
\end{quotation}
\end{exx}



\end{ExtraExercises}

\begin{ReviewExercises}

\begin{rexx}
For each of the regression lines, find the slope and intercept.  State whether the correlation is positive (as $x$ increases, $y$ increases)
or negative (as $x$ increases, $y$ decreases).
\begin{abcd}
\item $y = 2x - 4$
\item $Q = -0.05t + 2.11$
\item $ R = -3.4S - 22$
\item  $y= 35x + 100$
\end{abcd}
\end{rexx}

\begin{rexx}
If the $R^2$ value for a regression line is 0.78, and if the correlation is positive, what is the $R$ value?  
\end{rexx}

\begin{rexx}
If the $R^2$ value for a regression line is 0.62, and if the correlation is positive, what is the $R$ value?  
\end{rexx}
 
\begin{exx}{\needsquestions}
\headline{2\% per degree Celsius \ldots the magic number for how worker
productivity responds to warm/hot temperatures}

\url{http://andrewgelman.com/2012/09/persistently-reduced-labor-productivity-may-be-one-of-the-largest-economic-impacts-of-anthropogenic-climate-change/}

\end{exx}

\end{ReviewExercises}


\begin{ScopeExercises}

\end{ScopeExercises}
