%False-positive psychology
%
\chapter{\mychaptername}
\label{\here}

\tocnotetoo{
In Chapter~\ref{BreakTheBank} we looked at probabilities of independent
events --- things that had nothing to do with one another. Here we think about
probabilities in situations where we expect to see connections, such as in
screening tests for diseases or DNA evidence for guilt in a criminal
trial.}%
\begin{teacher}
This chapter focuses on two way contingency tables in
order to discuss several important common logical pitfalls dealing
with everyday probabilities. We think that approach makes more sense,
and is easier to remember and apply, than an explicit treatment of
dependent events and Bayes' theorem. That's too technical for our
goals in this quantitative reasoning text, and so better left for a
full course in probability and statistics. In fact, many of the
examples in this chapter employ qualitative rather than quantitative
reasoning.  

You can even skip the first two sections and the vocabulary of
dependent events and start with the section on screening for rare
diseases. 

If you want to go further into the analysis of dependence (perhaps
leading to Bayes' theorem) consider two way tables as the entry
point. Independence corresponds to tables whose rows (and hence
columns) are proportional. Those are the only ones that can be modeled
using areas of parts of a square, as in the last chapter.

Causation corresponds to tables with a 0 in one quadrant.
\end{teacher}

\begin{goals}
\begin{goal}{contingencytable}
Interpret and build two way contingency tables.
\end{goal}

\begin{goal}{dependentevents}
Understand how to compute probabilities for dependent events.
\end{goal}

\begin{goal}{falsepositives}
Understand the implications of false positives.
\end{goal}
\end{goals}

\qrsection[conditional]{UMass Boston enrollment}

Table~\ref{table:UMassBostonEnrollment}
summarizes student enrollment 
at UMass Boston in 2007 by category two ways: graduate/undergraduate
and male/female. We can use the data to answer some probability
questions about a random student.
\begin{teacher}
The data in this example are rather parochial. You might want to find
some that mean more to your particular class.
\end{teacher}

\begin{table}
\centering
\begin{tabular}{l
S[table-format=5.0]
S[table-format=4.0]
S[table-format=5.0]
}
\toprule
& {Undergraduate} & {Graduate} & {Total} \\
\midrule
Female & 5,680 & 2,388 &  8,068 \\
Male   & 4,328 & 1,037 & 5,365 \\
Total & 10,008 & 3,425 & 13,433 \\
\bottomrule
\end{tabular}
\caption{UMass Boston enrollment, 2007}
\tablesource{Handbuilt table. We're sure data is public.}
\label{table:UMassBostonEnrollment}
\end{table}

\begin{itemize}

\item What is the probability that a student chosen at random is an
undergraduate?

The last row of the table has the numbers we need:
\begin{align*}
\frac{\text{number of undergraduates}}{\text{number of students}}
& = \frac{10,008}{13,433} \\
& = 0.745 \\
& \approx 75\%.
\end{align*}
Three quarters of the students are undergraduates.

\item What is the probability that a student is female?

For that computation we use the totals in the last column:

\begin{align*}
\frac{\text{number of females}}{\text{number of students}}
& = \frac{8,068}{13,433} \\
& = 0.600610437 \\
& \approx 60\%.
\end{align*}

\item What is the probability that a student is a female
undergraduate?

Use the count in the first column of the first row:

\begin{align*}
\frac{\text{number of female undergraduates}}{\text{number of students}}
& = \frac{5,680}{13,433} \\
& =  0.4228392764 \\
& \approx 42\%.
\end{align*}

\end{itemize}

In each of these probability calculations we used the total number of
students (13,433) in the denominator. 

Continuing \ldots

\begin{itemize}

\item What is the probability that a female student is an undergraduate?

  Since this is a question about the female students, we need a
different denominator:

\begin{align*}
\frac{\text{number of female undergraduates}}{\text{number of
    female students}} 
& = \frac{5,680}{8,068} \\
& = 0.70401586514 \\
& \approx 70\% .
\end{align*}

\item What is the probability that an undergraduate is female?

That's a different question. This time we know the student is an
undergraduate. That calls for a different denominator:

\begin{align*}
\frac{\text{number of female undergraduates}}{\text{number of
    undergraduates}}
& = \frac{5,680}{10,008} \\
& =  0.56754596322 \\
& \approx 57\% .
\end{align*}
\end{itemize}

The last two questions sound similar, but have different answers,
because each begins with a different assumption. In the first we know
the student is female and wonder whether she's an undergraduate.
In the second, we know that the student is an undergraduate and wonder
whether it's a she. 

We're not finished thinking about these probabilities. We found that
there's a 60\% probability that a student is female. But if
we know the student is an undergraduate then that probability drops to
57\%, because the proportion of women is different for undergraduates
than for the student body as a whole. This is not what happened when
we thought about a coin and a die
in \sref*{coindie}. The probability that the die shows a four is the
same whether the coin comes up heads or tails. Those events are
\emph{independent}. The facts ``is female'' and 
``is an undergraduate'' are \emph{dependent}\index{dependent
events}. When you know one of them you know something about the
probability of the other.

We learned in \sref*{coindie} that when events are independent
you multiply to compute the probability that both happen:
\begin{align*}
\text{probability(coin H and die 4)}
& = 
%CHANGE U -> H \text{probability(coin U)} \times \text{probability(die 4)} \\
\text{probability(coin H)} \times \text{probability(die 4)} \\
& = \frac{1}{2} \times \frac{1}{6} \\
& = \frac{1}{12}.
\end{align*}
For dependent events that won't work. We found that
\begin{equation*}
\text{probability(female and undergraduate)} = 42\%
\end{equation*}
but
\begin{align*}
\text{probability(female)} \times \text{probability(undergraduate)} 
& = 60\% \times 75\% \\
& = 45\%.
\end{align*}

Those answers are close, but not the same, because the proportion of
females among the undergraduates is close to but not the same as the
proportion among the graduate students.

In the rest of this chapter we will look at the probabilities for
dependent events, working with displays like
Table~\ref{table:UMassBostonEnrollment}
in examples where the consequences matter much more
than they do here.%
\begin{teacher}
What we do with tables can of course also be done with formulas --- the
most important one is \myindex{Bayes' rule}. We don't work with the
formulas since we think the tables are easier to understand and the
methods using them easier to remember.
\end{teacher}

\qrsection[falsepos]{False positives and false negatives}

Many women have periodic mammograms \index{mammogram} to look for
breast cancer. Many men have periodic PSA tests to look for prostate
cancer. \index{PSA test} In each there are four possibilities. We'll
spell them out for breast cancer.

\begin{itemize}
\item \emph{True positive}: a woman has breast cancer and the mammogram
  says so. 

\item \emph{True negative}: a woman does not have breast cancer
  and the mammogram says she doesn't.

\item \emph{False positive}: a woman doesn't have breast cancer but
  the mammogram mistakenly says she does.

\item \emph{False negative}: a woman does have breast cancer but the
  mammogram doesn't detect it.
\end{itemize}

If the test were perfect there would be no false positives and no
false negatives --- but there are very few perfect tests. In order to
understand what the test results mean you can build a table like the
one in the first section of this chapter.  

We'll do that with a real example. Figure~\ref{fig:vennDiagram} appeared
in the article ``False positives, false negatives, and the
validity   of the diagnosis of major depression in primary care'' in the
September 1998 \emph{Archives of Family Medicine}.

It summarizes the results of a study of 372 patients who were screened
by family physicians for clinical depression.

\figfile{vennDiagramCapture.pdf}
\begin{figure}
\centering
\includegraphics[height=60mm]{\thefigurefilename}
\begin{csmr}[Diagnosing depression\label{fig:vennDiagram}]
M. S. Klinkman, J. C. Coyne, S. Gallo and T. L. Schwenk,,
False Positives, False Negatives, and the Validity
of the Diagnosis of Major Depression in Primary Care,
\emph{Arch Fam Med.} 1998;7(5):451 -- 461,
\url{www.ncbi.nlm.nih.gov/pubmed/9755738}
\access{October 4, 2015}.
Licensed under a Creative Commons Attribution-Noncommercial-No
Derivative Works 3.0 United States License.
(\url{creativecommons.org/licenses/by-nc-nd/3.0/}
\csmrcomment{Creative Commons, OK}
\end{csmr}
\end{figure}
\figfile{}

The numbers in the four categories in the figure are easier to
understand when we put them in Table~\ref{table:vennTable}.


\begin{table}
\centering
\ctablehead{depressed}{yes}{no}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
S[table-format=2.0]
S[table-format=3.0]
|
S[table-format=3.0]
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{diagnosed}} &
\multicolumn{1}{c|}{yes} & 31 & 34 & 65     \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{no}& 50  & 257 & 307 \\
\hline
\multicolumn{1}{|c}{}
 & total & 81  & 291  & 372  \\
\hline
\end{tabular}
}
\caption{Diagnosing depression}
\label{table:vennTable}
\end{table}

Two by two tables like this are called \emph{contingency
  tables}\index{contingency table}.
Figure~\ref{fig:contingencyTable} shows the standard names for
the four cells with raw data: true positive,
false positive, false negative and true negative. In this example
they have values 31, 34, 50 and 257. 

% figfile for graphics - this figure is a table}
\figfile{}
\begin{figure}
\centering

\settowidth{\tempdima}{has breast cancer}% compute width needed
\addtolength{\tempdima}{-2\tabcolsep}% minus default column sep

{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc|cc|}
\cline{3-4}
& &   \multicolumn{2}{c|}{condition }  \\
& &   {\makebox[0.5\tempdima]{present}} & {\makebox[0.5\tempdima]{absent}}  \\
\hline
\multicolumn{1}{|c}{\multirow{2}{*}{screened positive}} &
\multicolumn{1}{c|}{yes} & true positive  & false positive    \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{no} &  false negative  & true negative   \\
\hline
\end{tabular}
}
\caption{A two way contingency table}
\figsource{Hand built.}
\label{fig:contingencyTable}
\end{figure}
\figfile{}

The totals tell us that 65 people in the population of 372 (17\%) were
diagnosed as depressed, and that 81 (22\%) were depressed. Those
numbers are pretty close. But does that make it a good test? To answer
that question we need to look at the columns separately.

\begin{itemize}
\item
  The first column tells us there were 31 true positives and 50 false
  negatives from the total of 81 subjects who were in fact depressed.
So if a subject was depressed the probability that he or she was
diagnosed correctly is only $31/81 \approx 38\%$. That is the
true positive rate.  There's a $62\%$
chance the condition was missed. That 62\% is the false negative
rate. 

\item
The second column says that even when a subject was not depressed
the chance of a diagnosis of depression was
$34/291 = 0.117 \approx 12\%$. That is the false positive rate. 
%It
%means about 12\% of the people diagnosed as depressed didn't
%in fact suffer from that condition. 
\end{itemize}

Whether this is a ``good'' test is a difficult decision.  

Although the chance of misdiagnosis of depression when it doesn't
exist is fairly low --- about 12\% --- the 62\% false negative rate says
%CHANGE less -> fewer 
that test will identify fewer than half the depressed people.

\qrsection[rare]{Screening for a rare disease}

A test with a small false positive rate looks like a good candidate for
screening large populations for a nasty disease. However, if the
disease is rare, the test may not be as good as it looks. In this section
we'll study two examples, one made up and one real.

Suppose a drug company has developed a test for the rare disease
X. Clinical trials show that the test is 90\% accurate at detection,
so the false negative rate is 10\%. Those trials also show that the
false positive rate is only 1\%. 

These are the important questions:

\begin{enumerate}
\item What is the probability that a person who suffers from X tests
positive?

\item What is the probability that a person who tests positive suffers
from X?
\end{enumerate}
%

If the test were perfect --- no false positives, no false negatives ---
each question would have the same answer: 
100\%. But the two facts ``suffers from X'' and ``tests
positive for X'' are not exactly the same.  Knowing either one makes the
other more likely, but not certain. We want to find out how much more
likely in each case.

Question 1 is easy: the drug company's clinical trials found that
there is a 90\% probability that a person who suffers from X tests
positive for X.     


Whether that test is as good as it sounds depends in part on the
answer to the second question. That answer depends on two things: the
false positive rate and the number of people who actually have X.
Suppose just one person in every 1,000 suffers from X 
(one tenth of one percent of the population). Then even though the
false positive rate is only 1\%, most of the positive results will come
from healthy people. We can use a contingency table to
find the actual value for ``most of''.

Since percentages (particularly small percentages) are often
confusing, we'll build our table for an imaginary population of
100,000 people that just matches the statistical profile for this
test. \index{natural frequencies}
In a population of 100,000, one out of every 1,000
will have the disease. That's 100 people. Of those 100, 90\% (so 90
people) will test positive. The other 10 will be the false negatives.
Of the 99,900 healthy people, one percent (999) will test
positive. The other 98,901 will be the true
negatives. Table~\ref{table:xtable} shows the contingency table.

\begin{minipage}[c]{\textwidth}
\begin{center}
\ctablehead{suffers from X}{yes}{no}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
S[table-format=3.0]
S[table-format=5.0]
|
S[table-format=6.0]
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{test + for X}} &
\multicolumn{1}{c|}{yes} &  90  & 999  & 1,089 \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{no}& 10  & 98,901 & 98,911 \\
\hline
\multicolumn{1}{|c}{}
 & total & 100  & 99,900  & 100,000 \\
\hline
\end{tabular}
}
\captionof{table}{Screening for disease X}
\label{table:xtable}
\end{center}
\end{minipage}

Now we can answer the second question. The probability that someone
who tests positive is actually ill with X is only $90/1089 = 8.26\%$.

Is this acceptable? Maybe, maybe not. If the test is inexpensive and
there's a second test (perhaps more expensive) that can weed out the
false positives, and the disease can be treated successfully if
detected, perhaps the screening is a good idea.  If all the people who
test positive must undergo expensive painful unreliable treatment,
which would be unnecessary for more than 90\% of them, then the
screening is probably a bad investment of scarce health care
resources.

For a real application of this technique to the statistics of
screening for breast cancer, work \exref{mammography}.

\qrsection[difficult]{Trisomy 18}

In this section we'll work through the numbers when considering
whether to call for routine prenatal screening for the rare birth
defect \myindex{trisomy 18}.

\begin{quotation}
K Spencer and colleagues' claims for prenatal detection of trisomy 18
by measurement of maternal serum (alpha) fetoprotein and free $\beta$ human
chorionic gonadotrophin concentrations are impressive. Detection of
50\% of cases for a false positive rate of only 1\% seems to compare
favourably with the detection rate for Down's syndrome when
similar techniques are used, which is 70\% for a false positive
rate of 5\%. Unfortunately, the authors fail to emphasise the
importance of the relative incidence of the two conditions at
birth before concluding that screening for trisomy 18 should be
introduced.
\begin{csmr}
  T, Davies,
  Prenatal screening for trisomy 18: Should not be contemplated,
  \emph{British Medical Journal},
 February 12,  1994,
doi: \texttt{doi.org/10.1136/bmj.308.6926.471},
 \url{www.bmj.com/content/308/6926/471.1}
 \access{July 28, 2019}.
\end{csmr}
\end{quotation}

\begin{teacher}
This example started out as an exercise. We discovered (and should not
have been surprised) that it's too complex for most students to
read on their own, even this late in the semester when they are used
to seeing hard questions.
\end{teacher}

Davies notes that there are about 12.6 instances of Down's syndrome
per 10,000 births. The incidence for trisomy 18 is just 1.3 per 10,000
births. 

A positive test leads to a second procedure, an amniocentesis to check
whether the positive is true or false.  Davies calculates that screening
10,000 pregnant women for Down's syndrome ``would result in 8.8 cases
being detected at the cost of 500 
amniocenteses (5\% of 10,000). This means that one case of Down's
syndrome is detected for every 57 amniocenteses performed.'' For
trisomy 18 his figures are 0.65 cases detected, 100
amniocenteses, so 154 amniocenteses to detect one true case.

Let's check his arithmetic for trisomy 18. To build the contingency
table we need three numbers:

\begin{itemize}
\item The false negative rate. Since the test detects 50\% of cases
  the other 50\% are the false negatives.
\item The false positive rate. It's only 1\%.
\item The incidence rate. The second paragraph tells us it's 1.3 per
  10,000 births. 
\end{itemize}

Figure~\ref{fig:trisomytable} shows 
a screenshot of the spreadsheet
\link{ContingencyTable.xlsx} with entries for this
problem. Cell~\cell{B12} is named \excel{INCIDENCE}; it contains the
formula \excel{=1.3/10000}, formatted as a percent. Cell~\cell{B17}
for the number of true positive results contains the formula

\displayexcel{=POPULATION*INCIDENCE*(1-FALSENEG)}

which in this example is 
%
\begin{equation*}
10,000 \times \frac{1.3}{10,000} \times (1-0.5) = 0.65,
\end{equation*}
%
confirming Davies' ``0.65 cases per 10,000 women tested.''  
The spreadsheet shows
100.637 positive tests, which matches Davies' estimate of 100. So it
would take 100 amniocenteses to find 0.65 cases of trisomy 18. 
That works out to $100/0.65 = 154$ amniocenteses to find each case. 
(\exref{downs} asks you to check Davies's arithmetic for Down's
syndrome.)

%\figfile{trisomytable.png}
\figfile{trisomytablecropped.pdf}
\begin{figure}
\centering
\framebox{
\includegraphics[width=4.5in]{\thefigurefilename}
}
\captionof{figure}{Screening for Trisomy 18}
\label{fig:trisomytable}
\end{figure}
\figfile{}

You don't have to know what amniocentesis is, but you
do have to know that it has some risks: there is a small chance that
it will lead to a miscarriage. Davies claims that
 ``a screening programme would cause
the abortion of at least as many normal fetuses as it would detect
cases of trisomy 18.''
(``Abortion'' his synonym for ``miscarriage,'' not
the politically charged ``abortion'' so much in the news.)
\index{amniocentesis}

That would be true if the risk of miscarriage from amniocentesis was
about one in 150. It's probably smaller. Several web sources provide
statistics like these:

\begin{quotation}
Miscarriage is the primary risk related to amniocentesis. The risk of
miscarriage ranges from 1 in 400 to 1 in 200. In facilities where
amniocentesis is performed regularly, the rates are closer to 1 in
400.%
\begin{csmr}
Amniocentesis,
American Pregnancy Association,
\url{americanpregnancy.org/prenataltesting/amniocentesis.html}
\access{July 15, 2015}.
\csmrcomment{36 words, public website, fair use}
\end{csmr}
\end{quotation}

If we use one in 300 instead of one in 150 as the probability of
miscarriage from an amniocentesis then it costs about one unnecessary
miscarriage to detect two cases of trisomy 18. That's still a pretty
high risk. 

Davies compares his risk estimate to the much lower estimate for
similar screening for Down's syndrome, noting that ``In many places it
is still undecided whether screening for Down's syndrome is worth the
disbenefits for the prospective parents.''

%\begin{quotation}
%Prenatal screening for trisomy 18: Should not be contemplated.
%
%K. Spencer and colleagues' claims for prenatal detection of trisomy 18
%by measurement of maternal serum (alpha) fetoprotein and free $\beta$
%human chorionic gonadotrophin concentrations are impressive. Detection of
%50\% of cases for a false positive rate of only 1\% seems to compare
%favourably with the detection rate for Down's syndrome when similar
%techniques are used, which is 70\% for a false positive rate of
%5\%. Unfortunately, the authors fail to emphasise the importance of the
%relative incidence of the two conditions at birth before concluding
%that screening for trisomy 18 should be introduced. 
%
%The natural incidence of Down's syndrome at birth is approximately
%12.6/10,000 births. Among 10,000 pregnant women a 70\% sensitivity
%would result in 8.8 cases being detected at the cost of 500
%amniocenteses (5\% of 10,000). This means that one case of Down's
%syndrome is detected for every 57 amniocenteses performed. The
%incidence of trisomy 18 at birth is 1.3/10,000 births. A sensitivity
%of 50\% would detect 0.65 cases per 10,000 women tested at a cost of
%100 amniocenteses (1\% of 10,000). For each case of trisomy 18
%detected, therefore, 154 women would have to have had
%amniocentesis. Thus a screening programme would cause the abortion of
%at least as many normal fetuses as it would detect cases of trisomy
%18. 
%
%In many places it is still undecided whether screening for Down's
%syndrome is worth the disbenefits for the prospective parents. To my
%mind, the decision is clear for screening for trisomy 18: screening
%should not be contemplated until the predictive value of the test is
%considerably improved.%


\qrsection[prosecutor]{The prosecutor's fallacy}
\index{prosecutor's fallacy}

The Cornell University Legal Information Institute posted a discussion
of \emph{McDaniel v. Brown} when that case was on the docket of the
Supreme Court. They wrote \index{McDaniel v. Brown}\index{Supreme Court}

\begin{quotation}
Following a state conviction for sexual assault, Troy Brown
filed a petition for writ of habeas corpus in the United States
District Court for the District of Nevada. The District Court allowed
Brown to present new evidence: a report from Dr. Lawrence
Mueller. This report detailed a statistical error (``prosecutor's
fallacy'') made by the prosecution during the presentation of DNA
evidence. Based on Dr. Mueller's report, the District Court dismissed
the DNA evidence from consideration, found insufficient evidence to
convict Brown, and ordered a retrial.  

\ldots

At trial, Renee Romero, a forensic scientist at the Washoe County Crime
Lab, testified that the DNA found in the victim's underwear matched
Brown's DNA; only one in three million people would match the DNA
tested. The prosecutor asked Romero to express this statistic
as ``the likelihood that the DNA found \ldots is the same as the DNA
found in [Brown's] blood.'' Romero concluded that the likelihood was 99.999967
percent. Based on this statistic, the prosecutor then
asked Romero if it would be fair to conclude that there was a 0.000033
percent chance that the DNA did not belong to Brown. Romero
agreed with the prosecutor, stating that that this was ``not
inaccurate.''%
\begin{csmr}
M. Lynn and C. Maier,
McDaniel v. Brown (08-559)
Legal Information Institute,
LII Supreme Court Bulletin,
Cornell University Law School,
\url{www.law.cornell.edu/supct/cert/08-559}
\access{September 18, 2015}.
Michelle Jessica Lynn and Chris Maier are the authors of that Supreme
Court Preview.  Legal Information Institute at Cornell Law School.
\end{csmr}
\end{quotation}

Romero's arithmetic is right: one in three million  is 0.000033
percent. But her thinking is wrong. 

The prosecutor's fallacy is the claim that the one in three million
probability of a random match is the same as the probability that the
defendant is innocent. We can use a contingency table to show why
those probabilities are different.

First we need an
estimate of the population in which a possible DNA match might be
found. To make the arithmetic easier, we'll take that to be 9 million
people (Los Angeles is near enough to Nevada). Then the ``one in three
million'' statistic says we should expect three DNA matches from the
innocent people in that population. This contingency table 
summarizes the data:

\begin{center}
\ctablehead{\ \ \ truth \ \ \ }{guilty}{innocent}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
S[table-format=1.0]
S[table-format=7.0]
|
S[table-format=7.0]
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{DNA}} &
\multicolumn{1}{c|}{match} &  1  & 3  & 4 \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{nonmatch} &  0  & 8,999,996 & 8,999,996 \\
\hline
\multicolumn{1}{|c}{}
 & total & 1  & 8,999,999  & 9,000,000 \\
\hline
\end{tabular}
}
\end{center}

Make sure you understand the first row of the table.  One person is
guilty and is a DNA match.  The three matches from innocent people are
false positives. In other words, the first row of that table tells us
that if the only evidence in the case is the DNA match the odds are
$3:1$ that the suspect is innocent! The probability that he's guilty
is only 25\%. That's a far cry from the ``99.999967\% guilty'' that
the prosecutor asked the jury to believe.  

The defense didn't make this argument using a hypothetical 9,000,000
population of potential suspects. Instead they questioned the ``one in
three million'' chance  of a match. The defendant had near relatives in the
area which  increased the chances of a match to about one in 6,500,
according to a defense specialist. That would reduce the chance of an
accidental match to $6499 / 6500 = 0.999846154 \approx 99.98\%$. 
We're not surprised that the change from 99.999967 percent to 99.98\%
did not convince the jury to acquit. 99.98\% still sounds very much
like a sure thing.

But it's not, because of the prosecutor's fallacy. That was the basis
for the appeal. Suppose we
reduce the population from which the match might come to just
100,000 --- the nearby area where there may be close relatives. Then
the 1 in 6,500 chance of a match means there will be about 15 matches
in that population in addition to the one match for the guilty party.
The numbers in the revised contingency table below show there is now a
$15:1$ chance that the  DNA match fingers an innocent person rather
than the true criminal. 

\begin{center}
\ctablehead{\ \ \ \ \ \ truth \ \ \ \ \ \ }{guilty}{innocent}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
S[table-format=1.0]
S[table-format=5.0]
|
S[table-format=6.0]
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{DNA}} &
\multicolumn{1}{c|}{match}  &  1  & 15  & 16 \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{nonmatch} &  0  & 99,984 & 99,984 \\
\hline
\multicolumn{1}{|c}{}
 & total & 1  & 99,999  & 100,000 \\
\hline
\end{tabular}
}
\end{center}

Nevertheless, the story did not end well for Brown. 

\begin{quotation}
The Supreme Court [overturning the appeals court order for a retrial]
said  in a per curiam opinion that overstated estimates of a DNA match at
trial did not warrant reversal of a conviction when there is still
``convincing evidence of guilt.''% 
\begin{csmr}
D. Badertscher,
U.S. Supreme Court Update: McDaniel v. Brown,
Criminal Law Library Blog (January 26, 2010),
\url{www.criminallawlibraryblog.com/2010/01/us_supreme_court_update_mcdani.html}
\access{July 25, 2015}.
\csmrcomment{41 words fair use}
\end{csmr}
\end{quotation}

\qrsection[retrospective]{The boy who cried ``Wolf'' }
\index{wolf, boy who cried}

After unusual disasters like terrorist attacks, earthquakes, severe
storms or airplane crashes you often hear finger-pointing discussions about
the incompetence of the agencies charged with predicting (perhaps even
preventing) what happened. Those discussions may start with a search
that discovers warning signs that were ignored.

Sometimes there were real lapses, and policies and
practices must be designed to prevent a recurrence.
But often blame is unjustified. Table \ref{table:disaster} 
explains why, even without numbers. You might call this
\emindex{qualitative reasoning}. 

\begin{table}
\centering
\ctablehead{what happens}{disaster}{nothing}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
c %S[table-format=1.0]
c % S[table-format=5.0]
|
c % S[table-format=6.0]
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{warning?}} &
\multicolumn{1}{c|}{yes}  &  rare  & usually  &  infrequent \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{no} &  rare  & almost always & almost always \\
\hline
\multicolumn{1}{|c}{}
 & total & rare  & almost always  & always \\
\hline
\end{tabular}
}
\caption{Should it have been predicted?}
\tablesource{Handbuilt data.}
\label{table:disaster}
\end{table}

With numbers in the first column you can compute the
probability that a disaster occurs with no warning at all.
That's not good. To guard against it, there should be more warnings.
Then with numbers in the first row you can compute the
probability that a particular warning actually corresponds to a
disaster about to happen. But more warnings don't lead to more
disasters, just to more false positives.

That means there are often good reasons for ignoring a
warning. State and governmental agencies have to balance the severity
of the warning with the cost and inconvenience of asking the public to
respond. For example, an earthquake warning may lead to an order to 
evacuate an entire city. The expense and disruption from repeated
evacuations that are not followed by an earthquake may be worse than
the consequences in the rare instance when the earthquake
happens. Just because after the fact you look back and find clues in the
seismic record that suggested an earthquake might be imminent doesn't
mean evacuation was the right call.

\exstart

\begin{exx}{\hassolution\sref{falsepos}
\gref{contingencytable}\gref{falsepositives}}
Chronic fatigue syndrome.
\index{chronic fatigue syndrome}

On August 24, 2010 a headline in \theGlobe{} read
``Researchers link chronic fatigue syndrome to class of virus''. 
The story reported on a study of 37 patients with the disease. 32
tested positive for a particular suspicious virus. Only 3 of 44
healthy people tested positive.
\begin{csmr}
R. Stein,
Researchers link chronic fatigue syndrome to class of virus,
Washington Post report in \theGlobe{} (August 24, 2010),
\url{www.boston.com/news/nation/articles/2010/08/24/researchers_link_chronic_fatigue_syndrome_to_class_of_virus}
\access{March 30, 2020}.
\csmrcomment{paraphrase}
\end{csmr}

A 2003 study in the \emph{Archives of Internal Medicine}
reported that ``The overall \ldots prevalence of
CFS \ldots was 235 per 100,000 persons.''%
\begin{csmr}
M. Reyes \emph{et. al.},
Prevalence and incidence of chronic fatigue syndrome in Wichita,
Kansas.
\emph{Arch Intern Med.} 2003 Jul 14;163(13):1530--6,
\url{www.ncbi.nlm.nih.gov/pubmed/12860574}
\access{December 15, 2015}.
\end{csmr}

\begin{abcd}

\item Construct the contingency table for this diagnostic tool. You
  may do this by hand, or with the spreadsheet
\link{ContingencyTable.xlsx}.


\item Explain why this test is potentially important for research on
chronic fatigue syndrome but might not be a good screening test.
\end{abcd}

%CHANGE moved hint to here, after \end{abcd}
\begin{hint}
 The first quote tells you the false positive and false negative
rates. The second tells you the incidence.
\end{hint}

\begin{sol}

\begin{abcd}
\item Construct the contingency table for this diagnostic tool.

I did the computations in Excel, entering the false positive rate as
\excel{=3/44} (6.82\%), the false negative rate as \excel{=5/37}
(13.51\%) and the incidence as \excel{=235/100000}, for a population
of 100,000.

Excel computed the contingency table (rounded):

\begin{center}
\ctablehead{chronic fatigue syndrome}{yes}{no}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
S[table-format=3.0]
S[table-format=5.0]
|
S[table-format=6.0]
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{tested positive}} &
\multicolumn{1}{c|}{yes} &  203  &   6802  & 7005 \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{no} &  32  &  92963 & 92885 \\
\hline
\multicolumn{1}{|c}{}
 & total & 235 & 99765 & 100000 \\
\hline
\end{tabular}
}
\end{center}

\item Explain why this test is potentially important for research on
chronic fatigue syndrome but might not be a good screening test.

The test suggests pretty clearly that a virus may be involved in
chronic fatigue syndrome. That is a lead worth pursuing with further
research. However, the false positive rate is almost 7\% and the
incidence is low so most of the positives will be false positives,
causing lots of anxiety and expense.
In fact the probability that a positive test means the patient has CFS
is just 2.9\%.

The test might be good for people who already show symptoms suggesting
they have the disease. 

\end{abcd}

\end{sol}

\end{exx}

\begin{exx}{\hassolution\worthy\sref{falsepos}\gref{contingencytable}
\gref{falsepositives}}
Pregnancy tests.
\index{pregnancy test}

An online website on \myindex{pregnancy testing} says that

\begin{quotation}
  Usually, if all care has been taken, [home] pregnancy tests are 97\%
  accurate.
\begin{csmr}
  When should I test with a pregnancy test?,
  Yourdays (free information for women),
  \url{www.yourdays.com/when-pregnancy-test.htm}
  \access{March 12, 2020}.
\end{csmr}
\end{quotation}

Assume that ``97\% accurate'' means a false positive rate and a false
negative rate of 3\%. Since a woman is unlikely to use a home pregnancy
test unless she thinks she's probably pregnant, assume that 80\% of
the women who try one are in fact pregnant.

Explain why a positive test indicates a pregnancy more than 99\% of
the time even though the false positive rate is 3\%.

\begin{sol}

Explain why a positive test indicates a pregnancy more than 99\% of
the time even though the false positive rate is 3\%.

I did the calculations in the spreadsheet, imagining a population of
1000 possibly pregnant women:

\figfile{PregnancyTestSolutionCropped.pdf}
\begin{center}
\includegraphics[width=4in]{\thefigurefilename}
\end{center}
\figfile{}

The probability is 99.23\%. It's \emph{higher} than the incidence in
the tested population, because in that population the probability that
a woman is pregnant is alread high.

\end{sol}
%
%\begin{abcd}
%
%\item Explain why the probability that a woman testing positive is
%  pregnant is less than 97\%.
%
%\item Explain why that probability is probably not a lot less than 97\%.
%
%\end{abcd}
%
%\begin{hint}
%For (b), think about when a woman is likely to use a pregnancy test.
%\end{hint}
%


\end{exx}

\begin{exx}[downs]{\untested\sref{rare}
\gref{contingencytable}\gref{falsepositives}}
Prenatal screening.\index{prenatal screening}

%CHANGE sref => sref*
Check the calculations for Down's syndrome testing using the data in
the quotation in \sref*{rare}.
\end{exx}

\begin{exx}{\untested\complex\sref{falsepos}\gref{contingencytable}
\gref{falsepositives}}
Spam\index{spam}.

Spam is junk email. Most mail systems have a spam filter that tries to
decide whether each piece of email you get is spam. When the spam
filter finds something it thinks is spam, it may throw it away, or put
it in a junk mail folder so that you can decide whether to throw it
away without reading it. 

Before my university department set up a spam filter I ran my own.
(The ``I'' here is Ethan Bolker, one of the authors, not the
generic authorial ``we'' we use in most of the book.)

I got about 250 emails each day. My spam filter trapped
about 175 of them. Of those about five were legitimate, and should
have been delivered directly to me. My inbox, which should have 
contained just the legitimate messages was usually about half
spam. So (in words) my spam filter is pretty good (but not perfect) at
recognizing legitimate email but not very good at calling spam
spam. 

\begin{abcd}

\item Build a two way contingency table with
row categories ``marked spam'' and ``not marked spam'', column categories  
``spam'' and ``legitimate''. 

\item Compute and interpret the false positive and false negative rates.

\item
Explain why both the false positives and the false negatives make
dealing with my email harder.

\item
I can adjust the settings in my spam filter to reduce the false
positive rate. Explain why that would increase the false negative
rate.

\item Is the number of spam emails I received consistent with the
  claim in the August 6, 2008 issue of \theNewYorker{} that there are
  more than a hundred billion spam emails every day?%
\begin{csmr}
M. Specter,
Damn Spam,
Annals of Technology,
\emph{The New Yorker} (August 6, 2007),
\url{www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter}
\access{July 31, 2019}.
\csmrcomment{Paraphrase so no permission needed}
\end{csmr}

\item
What is the original meaning of the word ``spam''? Does the company that
sells (the real) spam object to the new meaning? 

\item How do you deal with spam? (If your email provider does all the
filtering for you, you may not even know it's throwing things away
before you see them, so you may need to do some research on your email
provider's web site to find the answers to these questions.)

\begin{itemize}

\item
Who provides your email service (your university, your internet
service provider, 
Google, Yahoo, \ldots ) ?

\item
Do you have any say in how your email provider filters spam for you?
If so, what do you tell it?

\item
Estimate the data you need to build the two way table for your spam
statistics and compute the false negative and false positive rates.

\end{itemize}

\end{abcd}

Here are some web sites to look at if you want to find out more about spam.

\begin{itemize}

\item
\url{www.imediaconnection.com/content/3649.asp}. There are some
useful tips here about how to keep other people's spam filters from
thinking mail from you is spam. 

\item
Tools your system administrator might use:
\url{www.spamcop.net/}, \url{www.spamhaus.org/}

\end{itemize}

\end{exx}

\begin{exx}{\hassolution\sref{falsepos}\gref{contingencytable}
\gref{falsepositives}} 
Plagiarism.

In 2006 UMass Boston experimented with the \myindex{plagiarism}
detection software described at
\url{www.turnitin.com} that 
claims it can identify plagiarism in essays students write. 
UMass did not purchase the software after the experiment.
Perhaps the possibility of false positives contributed to that
decision.

Suppose that the software can actually detect every cheater and that
it's 99\% accurate in declaring honest students honest. (We made up
"detect every cheater" and "99\%" since the company does not advertise
them.) Sounds like a pretty good test. 

\begin{abcd}

\item
Estimate how many papers are submitted by students at your school each
semester. 

\item
Suppose that most students are honest. Estimate how
many students will be falsely accused of cheating.

\item
What are the advantages and disadvantages of using the software?
(There are several arguments on both sides of the question. Think of
as many as you can.) 

\item Read and write about this article from \theTimes: 
\url{www.nytimes.com/2010/07/06/education/06cheat.html}

\end{abcd}

\begin{sol}

\begin{abcd}

\item
Estimate how many papers are submitted by students at your school each
semester. 

In the spring of 2011 there were about 13,000 students at UMass
Boston. If each one wrote six papers a semester that would come to
about 80,000 papers --- a nice round number in the right ballpark.

\item
Suppose that most students are honest. Estimate how
many students will be falsely accused of cheating.

Since most of the 80,000 papers are honest, the false positive rate
applies --- one percent of them, or 800 papers, will be falsely tagged
as plagiarized. That might
not be quite 800 students, since some students might be unjustly
accused twice, but the order of magnitude is right.

\item
What are the advantages and disadvantages of using the software?
(There are several arguments on both sides of the question. Think of
as many as you can.) 

An advantage is that some plagiarists will be caught who might
otherwise get away with it. Another is that students might be less
likely to cheat knowing that this software was being used.

I can think of several disadvantages. One is the anxiety caused by the
false accusations. Another is the cost. 
\end{abcd}

\end{sol}
\end{exx}


\begin{exx}{\untested\sref{falsepos}\gref{falsepositives}}
Airport screening.
\index{airport screening}

In response to the article ``Screening programme evaluation applied
to airport security''
in the December 10, 2007 issue of the \emph{British Medical Journal},
Ganesan Karthikeyan wrote

\begin{quotation}
It is probably true that airport security in its present
form is not an efficient screening measure. However, one important
difference exists between screening for disease in individual patients
and screening for, say, explosives in airports. While one missed
cancer on screening can cause the loss of at the most, one life, the
number of potential lives lost per missed screening at airports can be
substantially larger. This has to be factored into any attempts at
evaluation of the process.%
\begin{csmr}
G. Karthikeyan,
The cost of a ``negative test'',
response to 
Screening programme evaluation applied to airport security,
\emph{British Medical Journal} (December 27 2007),
\url{www.bmj.com/rapid-response/2011/11/01/cost-negative-test}
\access{September 4, 2015}.
Quoted with permission.
\csmrcomment{email from author says ``For sure Dr. Bolker''}
\end{csmr}
\end{quotation}

It's clear that a false negative is a disaster. Discuss the
consequences of a high false positive rate.

\end{exx}

\begin{exx}[mammography]{\hassolution\worthy\sref{falsepos}
\gref{contingencytable}\gref{falsepositives}}
Breast cancer screening.

In his ``Chances Are'' column in \theTimes{} on
April 25, 2010 Steven Strogatz wrote about a diagnostic puzzle
presented to several doctors:

\begin{quotation}
The probability that [a woman in this cohort] has breast
cancer is 0.8 percent.  If a woman has breast cancer, the probability
is 90 percent that she will have a positive mammogram.  If a woman
does not have breast cancer, the probability is 7 percent that she
will still have a positive mammogram.  Imagine a woman who has a
positive mammogram.  What is the probability that she actually has
breast cancer?

\ldots

[When 24 doctors were asked this question], their estimates whipsawed
from 1 percent to 90 percent.   Eight of them thought the chances were
10 percent or less, 8 more said 90 percent, and the remaining 8
guessed somewhere between 50 and 80 percent. Imagine how upsetting it
would be as a patient to hear such divergent opinions.%
\begin{csmr}
S. Strogatz,
Chances Are,
\theTimes{} (April 25, 2010),
\url{opinionator.blogs.nytimes.com/2010/04/25/chances-are/}
\access{March 2, 2016}.
\csmrcomment{Strogatz and his editor say it's OK to use this}
\end{csmr}
\end{quotation}

\begin{abcd}
\item
What is the correct answer? 
\suspend{abcd}

\begin{hint}
Build the contingency table, based on a population of 1,000 women
tested. You may do this by hand or with  the spreadsheet
\link{ContingencyTable.xlsx}. 
\end{hint}

\resume{abcd}
\item What percentage of the 24 doctors got the correct answer?
\end{abcd}

\begin{sol}
\begin{abcd}
\item
What is the correct answer?

Here is the contingency table, based on 1,000 women screened.

\begin{center}
\ctablehead{has breast cancer}{yes}{no}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
S[table-format=1.0]
S[table-format=3.0]
|
S[table-format=4.0]
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{screened positive}} &
\multicolumn{1}{c|}{yes} & 7 & 70 & 77     \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{no}& 1 & 922 & 923   \\
\hline
\multicolumn{1}{|c}{}
& total & 8 & 992 & 1,000 \\
\hline
\end{tabular}
}
\end{center}


So the probability that a woman with a positive mammogram actually has
cancer is just 7/77 = 1/11, or about 9\%.

\item What percentage of the 24 doctors got the correct answer?

Eight doctors thought the correct answer was less than 10\%, which it
is. One doctor thought it was just 1\%, so I won't count that as a
correct answer. That means 7/24 or about 30\% got the answer right. 

\end{abcd}

\end{sol}

\end{exx}

\begin{exx}{\untested\hassolution\sref{falsepos}\gref{contingencytable}
\gref{falsepositives}}
Identity fraud.
\index{identity fraud}

On July 17, 2011 in an article in \theGlobe{}
headlined ``Identity fraud dragnet hardly seems worth the expense or
trouble'' you could read that 
in 2010 the Massachusetts Registry of Motor
Vehicles used software that cost \$1.5 million to 
send 1,500 suspension letters a day, leading to
100 arrests for fraudulent identity and
1,860 revoked licenses.%
\begin{csmr}
  M. E. Irons,
  Caught in a dragnet,
  \theGlobe, July 17, 2011,
  \url{archive.boston.com/news/local/massachusetts/articles/2011/07/17/man_sues_registry_after_license_mistakenly_revoked/},
  \access{March 12, 2020}.
\end{csmr}

On July 24 Jane Allen wrote a letter to the editor in
response, to say that the time and money hardly seemed worth it since
only 
\begin{quotation}
about 390,000 people were questioned for the sake of
finding fewer than 2,000 transgressors. 
\begin{csmr}
J. Allen,
Identity fraud dragnet hardly seems worth the expense or trouble,
\theGlobe{} (July 24, 2011),
\url{www.boston.com/bostonglobe/editorial_opinion/letters/articles/2011/07/24/identity_fraud_dragnet_hardly_seems_worth_the_expense_or_trouble/}
\access{March 30, 2020}.
\end{csmr}
\end{quotation}

\begin{abcd}
\item
Check Allen's arithmetic in the second paragraph.

\item Identify the false positives and calculate the false positive
  rate. Explain the  costs and benefits.

\end{abcd}


\begin{sol}

\begin{abcd}
\item
Check Allen's arithmetic in the second paragraph.

Since $1,500 \times 360 = 540,000$ the $390,000$ people in the second
paragraph is an underestimate. Maybe it's 1,500 letters each business
day, for $390,000/1,500 = 260$ business days.


\item Identify the false positives and calculate the false positive
  rate. Explain the  costs and benefits.

The population ``tested'' in this case is all the people who got
letters. 2000 of those are true positives: they are the ones properly
targeted. The false negatives would be people who should have gotten a
letter but didn't --- there's no way to know how many of those there
are. Probably very few.

The false positive rate is $388,000/390,000 \approx 99.5\%$.

Allen's letter explains the costs: the \$1.5 million grant, the
personnel time --- not to mention the headaches for the 388,000
falsely accused. The benefit: finding 2000 people who  broke the law,
100 seriously.
\end{abcd}


\end{sol}

\end{exx}


\begin{exx}{\untested\complex\sref{prosecutor}\gref{contingencytable}}
Candy leads to crime.

An article headlined
``Happy Halloween! Kids who eat candy every day grow up to be violent
criminals'' in the October 2, 2009 \emph{Daily Finance}, begins

\begin{quotation}
Quick, hide the candy jar! Feeding your child candy every
day could help turn Junior into a violent criminal, according to a
large study in Britain, which found that 69 percent of the
participants who had committed violence by 34 had eaten sweets or
chocolate nearly every day during childhood.%
\begin{csmr}
E. Wahlgren,
Happy Halloween! Kids who eat candy every day grow up to be violent
criminals,
\url{www.aol.com/2009/10/02/happy-halloween-kids-who-eat-candy-every-day-grow-up-to-be-viol/}
originally published on DailyFinance.com (October 2, 2009),
\access{March 12, 2020}.
Quoted with permission.
\csmrcomment{email from them says this is the permission statement we need}
\end{csmr}
\end{quotation}

You can find the full text at
\url{www.aol.com/2009/10/02/happy-halloween-kids-who-eat-candy-every-day-grow-up-to-be-viol/}
\begin{abcd}

\item
Read the rest of the article. Build the contingency table with columns
for whether or not someone ate candy as a child, rows for whether or
not they committed violence as an adult.

\item 
Explain why this is an example of the prosecutor's fallacy.

\item
Some of the online comments on that article recognize the fallacy ---
for example

\begin{quotation}
\noindent
10-03-2009 @ 10:21PM \\
Bski said... \\
I bet you, 99\% of criminals ate bread daily by the time they were 10
years old!!!! 
\end{quotation}

Write your own blog entry, using your understanding of two way
contingency tables to enlighten any readers. If you like what you've
written you may still be able to post your comment on the article's blog.
\end{abcd}


\end{exx}


\begin{exx}{\untested\sref{falsepos}\gref{falsepositives}}
Domestic violence.
\index{domestic violence}

In Andrew Gelman's \index{Gelman, Andrew} blog on
``Statistical Modeling, Causal Inference, and Social Science''
commenter Mike Spagat writes that

\begin{quotation}
Even within exceptionally violent environments most
households will still not have a violent death. So a very small false
positive rate in a household survey will cause substantial upward bias
in violence estimates.%
\begin{csmr}
A. Gelman,
The Reliability of Cluster Surveys of Conflict Mortality: Violent
Deaths and Non-Violent Deaths,
Statistical Modeling, Causal Inference, and Social Science
(August 11 2011),
\url{andrewgelman.com/2011/08/the_reliability/}
\access{July 25, 2015}.
Quoted with permission.
\csmrcomment{Andrew Gelman says it's OK to quote his blog}
\end{csmr}
\end{quotation}

Write a paragraph or two explaining this to someone who is interested
and smart enough to understand this but has not studied the material in
this chapter. Consider making up some numbers to illustrate your argument.
\end{exx}


\begin{exx}{\untested\sref{conditional}\gref{contingencytable}}
Surgery for \myindex{prostate cancer}?

An article in \theGlobe{} headlined
``Surgery offers no advantage for early prostate cancer, study
finds'' reported on a clinical trial involving 731 men diagnosed
with prostate cancer. About half had surgery; the rest were monitored.

\begin{quotation}
After 12 years, nearly 6 percent of men who had immediate
  surgery died of the cancer, compared with slightly more than 8
  percent of those patients who were observed, which was not a great
  enough difference to reach statistical significance.%
\begin{csmr}
D. Kotz,
Surgery offers no advantage for early prostate cancer, study finds,
\theGlobe{} (July 18, 2012),
\url{bostonglobe.com/lifestyle/health-wellness/2012/07/18/surgery-offers-survival-advantage-for-older-men-with-early-stage-prostate-cancer-study-finds/T5XM7APIuoZuav6PbJzYuI/story.html}
\access{July 25, 2015}.
\csmrcomment{Globe, OK}
\end{csmr}
\end{quotation}

\begin{abcd*}
\item About how many men were in each category?
\item About how many deaths were there in each category?
\item Construct the contingency table for this study.
\end{abcd*}
\end{exx}

\begin{exx}{\hassolution\artificial\worthy\sref{falsepos}\gref{contingencytable}\gref{falsepositives}}
Teenage drug use.


Here's a made up story.

The dean at a fancy private high school 
is very worried. She suspects that about 20\% of the 1000 students
on campus are using drugs. She has asked all the parents to administer
a home drug test to their kids (since it's a private school she can
actually require them to do it). She has read on the web that

\begin{quotation}
With home drug testing methods believed to produce reliable
 and accurate results, many of us overlook the cases of false positives and
draw conclusions on the suspect before reconfirming the result. But,
researchers from the Boston University have found out that drug tests
may produce false positives in 5-10\% of cases and false negatives in
10-15\% of cases.%
\begin{csmr}
How to Avoid False Positives While Conducting a Home Drug Test,
\url{lapoliticaesotracosa.blogspot.com/2012/05/how-to-avoid-false-positives-while.html}
\access{July 25, 2015}.
\csmrcomment{60 words. could paraphrase but I'd like to keep the quote}
\end{csmr}
\end{quotation}

We found several blogs that seem to report on this same study. None
gives a link or a precise reference. We haven't been able to locate
the original. 

Answer the following questions, assuming the worst cases 
(10\% false positive rate, 15\% false negative rate).

\begin{abcd}

\item Build the contingency table for this drug screening scenario. To
  do that you will have to figure out

\begin{itemize}
\item[] How many students are drug users?
\item[] How many of the drug users test positive? How many test
  negative?
\item[] How many students are drug free?
\item[] How many of the drug free students test positive? How many test
  negative?
\end{itemize}

You may do the arithmetic by hand or with the spreadsheet
at \link{ContingencyTable.xlsx}. 

\item What is the true positive rate?

\item Student John Smith tested positive. What is the probability that
  he is really on drugs?

\item Student Jane Doe tested negative. What is the probability that
  she is really drug free?

\item Answer the previous two questions if you assume the best
  cases for reported false values in the Boston University study.
\end{abcd}

\begin{sol}
\begin{abcd}

\item Build the contingency table for this drug screening scenario. To
  do that you will have to figure out

\begin{itemize}
\item[] How many students are drug users?

 20\% of 1000, so 200.

\item[] How many of the drug users test positive? How many test
  negative?

170 of the 200 users test positive. The other 30 test negative (these
are the false negatives).

\item[] How many students are drug free?

 The other 800.

\item[] How many of the drug free students test positive? How many test
  negative?

720 of the 800 clean students test negative. 80 are false positives.

\end{itemize}

\item What is the true positive rate? $100\% - 15\% = 85\%$.

\item Student John Smith tested positive. What is the probability that
  he is really on drugs?

That's $170/250 = 0.68 = 68\%$. 

\item Student Jane Doe tested negative. What is the probability that
  she is really drug free?

$760/780 = 0.96 = 96\% $.

\item Answer the previous two questions if you assume the best
  cases for reported false values in the Boston University study.

82\% and 97\% --- Excel did the work for me.
\end{abcd}

\end{sol}

\end{exx}


\begin{exx}{\untested\sref{retrospective}\gref{contingencytable}\gref{falsepositives}}
The boy who cried ``wolf''.
\index{wolf, boy who cried}

Use Table~\ref{table:disaster} to analyze the children's story with
that title.

\end{exx}

\begin{exx}{\hassolution\sref{retrospective}\gref{contingencytable}}
Playing the lottery.

Table~\ref{table:lotteryTable} illustrates the ultimate example of the error
you can make reading a column instead of a row. 

\begin{table}
\centering
\ctablehead{bought a ticket}{yes}{no}
{\renewcommand{\arraystretch}{1.2}% for the vertical padding
\begin{tabular}{cc
|
cc
|
c
|
}
\ctablebody
\multicolumn{1}{|c}{\multirow{2}{*}{won the lottery}} &
\multicolumn{1}{c|}{yes} &  1  &   0  & 1 \\
\multicolumn{1}{|c}{}        &                
\multicolumn{1}{c|}{no} &  many  &  very many & very many \\
\hline
\multicolumn{1}{|c}{}
 & total & many & very many & very many \\
\hline
\end{tabular}
}
\caption{Playing the lottery}
\tablesource{Handbuilt data.}
\label{table:lotteryTable}
\end{table}

\begin{abcd}
\item Suppose you won the lottery. What is the probability that you
  bought a ticket?

\item Suppose you bought a ticket. What is the probability that you
  won the lottery?
\end{abcd}


\begin{sol}


\begin{abcd}
\item Suppose you won the lottery. What is the probability that you
  bought a ticket?

You can't win if you don't play. The probability is 1. 

\item Suppose you bought a ticket. What is the probability that you
  won the lottery?

Essentially zero.

\end{abcd}


\end{sol}
\end{exx}

%\begin{NewExercises}
%
%\end{NewExercises}
%

\setexercisecounter{}

\begin{ExtraExercises}


\begin{exx}{\needsquestions\sref{falsepos}\gref{falsepositives}}
Mad cow disease\index{mad cow disease|see {Bovine Spongiform Encephalopathy}}.

Bovine Spongiform Encephalopathy(BSE)
is a disease fatal to people who eat infected beef products.

Here is a paragraph from the United States Department of Agriculture
website on screening for BSE:

\begin{quotation}
After the first confirmation of BSE in an animal in Washington State
in December 2003, USDA evaluated its BSE surveillance efforts in light
of that finding. We determined that we needed to immediately conduct a
major surveillance effort to help determine the prevalence of BSE in
the United States. Our goal over a 12-18 month period was to obtain as
many samples as possible from the segments of the cattle population
where we were most likely to find BSE if it was present. This
population was cattle exhibiting some signs of disease. We conducted
this enhanced surveillance effort from June 2004 - August 2006. In
that time, we collected a total of 787,711samples and estimated the
prevalence of BSE in the United States to be between 4-7 infected
animals in a population of 42 million adult cattle. We consequently
modified our surveillance efforts based on this prevalence estimate to
a level we can monitor for any potential changes, should they
occur. Our statistical analysis indicated that collecting
approximately 40,000 samples per year from the targeted cattle
population would enable us to conduct this monitoring.%
\begin{csmr}
BSE (Mad Cow Disease) Ongoing Surveillance Information Center,
U.S. Department of Agriculture, 
\url{www.usda.gov/wps/portal/usda/usdahome?contentid=BSE_Ongoing_Surveillance_Information_Center.html}
\access{November 15, 2015}.
\end{csmr}
\end{quotation}
\end{exx}

\begin{exx}{\needsquestions}
Correlation and causation.
\index{correlation}\index{causation}

This question and and answers at the statistics stackexchange site has
nice examples. The answers are written using conditional probabilities
but can be rewritten as contingency tables.

\url{stats.stackexchange.com/questions/283133/relationships-between-correlation-and-causation}

\end{exx}
\end{ExtraExercises}


