A model-based approach to football strategy.

March 6, 2006


A Model For Coaches' Challenges

Summary of Challenge Rules
Data on Challenges
Informal Description of the Model
Carolina at New Orleans
Denver at Dallas
New England at Denver
St. Louis at San Francisco
Testing the Subjective Probabilities

On the opening kickoff of the 2005 Week 1 game between St. Louis and San Francisco, St. Louis kick returner Chris Johnson fielded the ball at the 1-yard line and immediately stepped out of bounds, stopping the clock at 14:59. Johnson protested that he already had one foot out of bounds when he caught the ball, in which case the kickoff would be out of bounds and the Rams would begin their possession at their 40-yard line. St. Louis coach Mike Martz challenged the ruling on the field, setting a record that may never be broken for the earliest challenge in a game.

In this case, the gain from a successful challenge is 39 yards, which can be valued using a model like the Dynamic Programming Model. However, the decision to challenge must take into account the likelihood that the challenge will succeed, and the cost of losing a timeout if the challenge fails. The decision must also take into account the cost of expending a challenge, which derives from the possibility that other opportunities to challenge may arise later in the game. We will examine that cost in this article. Formally, we define the value of a team's first challenge as the answer to the following question: Before it is known whether a challenge will be needed during the remainder of the game, how much higher is a team's win probability if they have two challenges left rather than one? Similarly, the value of a team's final challenge is the difference in win probability if they have one challenge left rather than none. In this article we present a model for estimating the value of a challenge, and give four examples to illustrate its application.

It turns out that the value of a team's first challenge is always very small. The value of a team's final challenge is a bit larger, and is in fact similar in magnitude to the clock-management value of a timeout.

Among the essential inputs to the analysis is the frequency with which the various kinds of challengeable calls arise in games, and the distribution for the likelihood that a call will be reversed if challenged. To determine those inputs, we collected data from games that we watched during the 2005 regular season and playoffs. The data set contains information that is unavailable elsewhere, and may be of independent interest.

Summary of the Rules Regarding Reviews of Officials' Decisions

Certain types of rulings by the officials are subject to review if challenged. When a ruling is challenged, it is reviewed by the referee using a video monitor, and changed only if there is indisputable visual evidence to warrant the change.

Each team's coach is allowed two challenges during a game. If both of a team's challenges are successful, the team is awarded a third challenge, but a fourth challenge is never permitted. For plays that begin after the two-minute warning in either half, or at any time during overtime, challenges are initiated by the Replay Assistant, and are not charged to either team.

A coach's challenge requires the use of a team timeout. However, if the challenge is successful, the timeout is restored. A team that initiates a challenge when it has no timeouts, or has already used its available challenges, is penalized 15 yards.

Data on Challenges

When a coach is considering challenging an official's ruling, he knows what his team stands to gain if the ruling is reversed, and he can estimate the probability that the ruling will be reversed if challenged. His decision must take into account the possibility that better opportunities to challenge will arise later in the game. The official NFL play-by-play records (PBP) are of limited help in quantifying that possibility, partly because only actual challenges can be recorded. There is no indication of the rulings that might have been challenged, but were not. In addition, when there is a failed challenge, the PBP gives no information about what was at stake. For example, consider this excerpt from the Gamebook for the 2005 Divisional-round playoff game between New England and Denver:

3-5-DEN 5       (1:03) (Shotgun) 12-T.Brady pass intended for 80-T.Brown INTERCEPTED by 24-C.Bailey at DEN −1.
24-C.Bailey to NE 1 for 100 yards (84-B.Watson). FUMBLES (84-B.Watson), ball out of bounds at NE 1.
Play Challenged by NE and Upheld. (Timeout #1 by NE at 00:47.)

That description tells us nothing about where the ball would have been spotted, or even which team would have had possession, if New England's challenge had succeeded.

Finally, the PBP contains nothing about the probability (measured at the time the coach has to decide whether to challenge) that a ruling would be reversed if challenged.

To get the information we need to calibrate a model for coaches' challenges, we collected our own data from a sample of 71 games that we watched either in whole or in part, from Week 7 of the 2005 season through the Super Bowl, for a total of 3,683 minutes of playing time—the equivalent of 61.4 regulation games. We recorded every challengeable ruling that had a realistic possibility of being reversed if challenged by a coach or Replay Assistant. We will call such a ruling "reversible." For each reversible ruling, we recorded the basis for a potential challenge, our subjective probability that the ruling would be reversed if challenged, whether the ruling was in fact challenged, and whether it was reversed. We are making the data set available to anyone who wants to use it, provided that published results using the data include a link to the source.

The data set contains many rulings that we judged to be reversible but were not challenged, either because the coach elected not to do so, or was unable to. (An example of the latter possibility arose in the 4th quarter of the Thanksgiving Day game between Denver and Dallas. Although the officials ruled that Dallas recovered a Denver fumble, we assigned a 0.5 probability that, after review, the referee would determine that the Dallas defender was out of bounds before gaining possession. However, Denver was out of challenges.) There were also several rulings that do not appear in our data, even though they were challenged, because there was not a realistic possibility of reversal.

Figure 1
Histogram of reversal probabilities.

The data set contains only 110 observations. That translates to just 1.67 reversible rulings, on average, during the 56 minutes per game when teams use their challenges. Per team, the average is only 0.84 challenge opportunities per game. Opportunities to challenge arose in roughly equal numbers on offense and defense.

Nearly a third of the reversible rulings involved a change of possession. So, although opportunities to challenge are rare, the stakes are often large.

Each subjective probability of reversal was based solely on the information available when a coach or Replay Assistant had to decide whether to challenge (even though a clearer replay might have been shown later), and was assigned before the result of the challenge was known. Figure 1 displays a histogram of the subjective probabilities of reversal. In most cases, we judged that there was only a small chance that a challenge would succeed. However, in a substantial number of cases, the probability of success was large. There were relatively few challenge opportunities with success probabilities near 0.5. The histogram has a similar shape if we restrict the sample to challenge opportunities on offense, or if we look only at challenge opportunities involving a change of possession. In the final section of this article we will examine the accuracy of our subjective probabilities.

Informal Description of the Model

We denote the two teams as Team A and Team B. To keep the computations feasible, we make various simplifying assumptions. First, we explicitly model challenge decisions and timeout usage for Team A only. The state variables are the time left in the game, the score, the number of timeouts remaining for Team A, the number of challenges used by Team A, and whether Team A has had an unsuccessful challenge.

The model is a dynamic program, in which we solve for Team A's win probabilities at the various states by backward induction. Since the formulas are complicated, we will not present them here. The source code, written in MATLAB®, is the precise description of the model. Those who are unfamiliar with dynamic programming can learn the basics by reading our article on the Dynamic Programming Model.

We assume that each possession uses exactly 2:30 of game time, so that there are 24 possessions in regulation time. The teams alternate possessions, and we order the possessions so that Team A has the final possession in each half. We assume that if the score is tied at the end of regulation time, Team A's win probability is 0.5. When teams score a touchdown, they decide optimally whether to attempt a two-point conversion.

Team A has three timeouts to start each half. On each possession, there is a specified probability that Team A will expend a timeout for some reason other than for clock management or to initiate a challenge.

We assume that Team A's probabilities for scoring a touchdown or field goal during the final possession of a half depend on the number of timeouts they have. The dependence of the scoring probabilities on the number of remaining timeouts is calibrated so that the value of second-half timeouts approximates the value we computed in our model for the clock-management value of timeouts.

In the model, there are no coaches' challenges during the final possession of each half. (This corresponds loosely to the fact that reviews are initiated by the Replay Assistant after the two-minute warning.) On each possession except the final one of each half, an opportunity for Team A to challenge might arise. If an opportunity arises, it is characterized by (1) the probability that the ruling will be reversed if challenged, (2) the probabilities that the team with possession will score a touchdown or field goal if the ruling is upheld or not challenged, and (3) the probabilities that the team with possession will score a touchdown or field goal if the ruling is reversed. We have set the joint distribution for these three characteristics to approximate the observed characteristics of reversible rulings, described earlier.

Suppose an opportunity arises for Team A to challenge. If Team A has a timeout and either (a) has used at most one challenge or (b) has used two challenges but both were successful, then Team A can challenge if they choose to. They make the decision optimally, computing their win probabilities if they challenge and if they refrain, and selecting the course of action that gives the higher win probability.


Table 1
Value of a Team's Final Challenge
  Time Remaining
  15:00 30:00 45:00 60:00
Lead by 14 0.0009 0.0031 0.0051 0.0066
Lead by 7 0.0037 0.0064 0.0084 0.0096
Tie game 0.0059 0.0080 0.0097 0.0107
Trail by 7 0.0034 0.0060 0.0080 0.0091
Trail by 14 0.0008 0.0027 0.0047 0.0061

A team's win probability depends on the values of the state variables, one of which is how many challenges the team has used. If we change that state variable by one challenge, holding the other state variables constant, the win probability changes. That change in win probability is what we are calling the value of the challenge. (Of course, this calculation is done before it's known whether the challenge will be needed.) For the most part, the value of a challenge is quite small. That's because challenge opportunities arise rarely, usually have a small chance of success, and sometimes gain little even if successful. (This is reminiscent of the reason why the clock-management value of a timeout is small. Occasionally the ability to stop the game clock turns out to be critical, but as we explained in an article at Football Outsiders, those scenarios are rare.)

According to the model, the difference in win probability from having two challenges remaining rather than just one is only 0.002 even at the start of the game, when the value is largest. To put that in perspective, it's equivalent to starting the opening possession at the 27-yard line rather than the 25-yard line. That doesn't mean that a coach with two challenges left should challenge any reversible ruling that arises, no matter how unlikely he is to succeed: A failed challenge costs a timeout, which has more significant value. But it does mean that a coach with two challenges left should challenge even a relatively inconsequential ruling, provided the probability of reversing the ruling is high.

A team's final challenge has somewhat more value. Table 1 displays the difference in win probability from having one challenge remaining rather than none. The rows are labeled by the team's lead, and the columns are labeled by the time remaining in the game. The entries are computed assuming that the team has two timeouts, and has had a failed challenge. For example, suppose that a team trails by 7 points with half a game (30:00) left. Then according to the model, their win probability is 0.006 higher if they have one challenge remaining rather than none. For much of the game, if the game is close, a team's final challenge is worth a bit less than 0.01 in win probability. This is similar to our estimate, in a previous article, of the clock-management value of a timeout. One difference is that, in a close game, the value of a timeout is highest late in the game, whereas the value of a challenge decreases with time.


The approach we will take in these examples is to divide the analysis of a coach's challenge into two parts. First, we value the benefit of winning the challenge, without taking into account the cost of the expended challenge and the potentially lost timeout. This is best done using the Dynamic Programming Model (FCDPM), which doesn't include timeouts or challenges, but is relatively detailed in other respects. Then, we use the challenge model to estimate the potential costs associated with the challenge. The decision to challenge is correct if the expected benefit outweighs the expected cost.

Carolina at New Orleans

Our first example arose during the 2005 Week 15 contest between Carolina and New Orleans, with 11:42 left in the opening quarter, and no score. On 3rd-and-5 at his own 35-yard line, Carolina quarterback Jake Delhomme threw a 40-yard pass to Drew Carter. The ball came out of Carter's hands after he hit the ground, but the official, whose view of the ball was blocked by Carter's body, ruled that Carter had maintained control long enough to create a complete pass. Based on the live view, we assigned a 0.5 probability that the ruling would be reversed if challenged. Carolina hurriedly ran another play before a replay aired, and New Orleans didn't challenge. The Saints had three timeouts, and had not previously challenged.

The ruling on the field gives Carolina the ball at the New Orleans 25-yard line. If the ruling is reversed, Carolina will punt, and New Orleans can expect to gain possession at about their 27-yard line. According to the FCDPM, the difference in win probability is 0.1. Since there is a 50% chance of reversal, the expected benefit from the challenge (before subtracting the expected cost of the expended challenge and potentially lost timeout) is 0.05.

According to the challenge model, the use of New Orleans's first challenge costs either 0.001 (if the challenge succeeds) or 0.002 (if it fails); since there is a 50% chance of reversal, the expected cost from the expended challenge is a negligible 0.0015. The value of the timeout is around 0.008, so that the expected cost due to the possible loss of a timeout is 0.004. The combined expected cost is therefore 0.0015 + 0.004 = 0.0055, which is very small compared to the expected gain of 0.05. It follows that New Orleans made a substantial mistake by not challenging.

Denver at Dallas

Our second example comes from the 2005 Thanksgiving Day game between Denver and Dallas. With 6:28 left in the 2nd quarter, Dallas faced 3rd-and-2 at the Denver 22-yard line, trailing 14-7. Dallas ran for a first down at the Denver 20-yard line, but Denver challenged the spot of the ball, claiming that the runner was down short of the necessary line. Denver had already challenged one ruling earlier in the game—successfully—and had all of its timeouts.

If the ruling on the field is upheld, or Denver doesn't challenge, the FCDPM says that Dallas's win probability is 0.395. If Denver challenges successfully, it turns out that Dallas should go for it on 4th down. If they fail, their win probability is 0.272. Assuming a 65% chance of picking up the first down, Dallas's win probability if Denver's challenge succeeds is 0.65 × 0.395 + (1−0.65) 0.272 = 0.352. The difference between failure and success on the challenge is therefore 0.395 − 0.352 = 0.043. However, we judged that Denver's challenge had just a 10% chance of success. So, Denver's expected benefit from the challenge (before subtracting the expected cost of the expended challenge and potentially lost timeout) is only 0.0043.

If Denver's challenge succeeds, they still have all their timeouts; and although they would then have used two challenges rather than one, this change has negligible effect on Denver's win probability. That's because with two successful challenges, Denver is awarded a third challenge. Challenge opportunities are sufficiently rare that the third challenge will likely be enough for the rest of the game.

On the other hand, if Denver's challenge fails, their win probability falls by around 0.015, due in roughly equal parts to the loss of a timeout and the expenditure of their final challenge. Since there is a 90% chance that the challenge will fail, the expected cost to Denver is about 0.9(0.015)=0.0135. This is more than the expected benefit of the challenge, which we estimated earlier to be just 0.0043. Denver is better off saving their challenge in this case. Both the potential gain from a successful challenge, and the likelihood of success, are too small.

New England at Denver

Our third example arose during the 2005 Divisional-round playoff game between New England and Denver; we quoted the relevant excerpt from the official play-by-play earlier. With 1:03 left in the 3rd quarter, and the Broncos leading 10-6, Denver's Champ Bailey intercepted a New England pass one yard deep in his own end zone. Bailey returned the interception 100 yards to New England's 1-yard line, where he fumbled the ball out of bounds. New England challenged, claiming that Bailey's fumble passed through New England's end zone for a touchback. The ruling on the field gave Denver the ball at the New England 1-yard line. A reversal would give New England the ball at their 20-yard line. The Patriots had all three of their timeouts. They had already used one challenge earlier in the game, successfully.

None of the replay angles was ideal; and using the standard of indisputable visual evidence, we assigned a 0.1 probability that New England's position would prevail after review. Still, the Patriots have to challenge the ruling. According to the FCDPM, New England's win probability if they take over at their 20-yard line is 0.3 higher than if Denver gets possession at the New England 1-yard line. Even with a 10% chance of prevailing, the expected gain from the challenge is 0.03. That easily covers the value of the timeout, which is at most 0.01, and the value of the challenge, which we estimate as about 0.003. Notice that even though this is New England's final challenge, its value is small. There simply isn't enough time before the two-minute warning for there to be much chance of a good challenge opportunity.

St. Louis at San Francisco

For our final example, we will examine Mike Martz's decision to challenge the opening kickoff of the 2005 Week 1 game between St. Louis and San Francisco. As we explained in the Introduction, a successful challenge allows the Rams to begin their possession at their 40-yard line rather than their 1-yard line. According to the FCDPM, the improved field position increases St. Louis's win probability by about 0.05. However, neutral observers are unanimous that based on the replays available to Martz before he challenged, there was virtually zero chance that the ruling on the field would be reversed. Therefore, the expected gain from the challenge is approximately zero, and the sole purpose of the calculation is to determine how large a cost Martz imposed on his team by challenging. According to the challenge model, the failed challenge and the lost timeout that accompanies it reduce St. Louis's win probability by 0.008, mainly due to the value of the timeout. The cost is relatively small because challenge opportunities seldom arise, and the Rams still have one challenge.

Testing the Subjective Probabilities

We conclude by examining our subjective probabilities of reversal for evidence of bias. We will sketch one test here; details are contained in an appendix. Consider the sub-sample consisting of the 80 decisions that were actually challenged, either by a coach or by the Replay Assistant. For a particular challenge, let x denote our subjective probability that the ruling on the field would be reversed, and let y equal 1 if the ruling was reversed and 0 otherwise. If the subjective probability was correct, then P(y =1 | x) = x, and hence E(y | x) = x and var(y | x) = x(1−x). We can test the accuracy of the subjective probabilities by estimating the parameters of the regression

E(y | x) =α + β x + γ x2

by generalized least squares. Under the null hypothesis that the subjective probabilities are the true probabilities of reversal, α and γ are 0, while β equals 1. The estimates of α, β, and γ are 0.024, 0.947, and 0.113 respectively, and a chi-square test of the hypothesis that (α, β, γ) = (0, 1, 0) yields a test statistic that is quite consistent with the hypothesis. So, although the sample is too small to permit strong conclusions, there is no evidence from this particular test that our subjective probabilities are biased.

Copyright © 2006 by William S. Krasker