The Predictive Distribution in Decision Theory" A Case Study

(1)

Journalof AppliedMathematics&DecisionSciences,2(2),^107-117(1998)

PeprintsAvailabledirectly from theEditor. Printed inNewZealand.

The Predictive Distribution in Decision Theory" A Case Study

GEOFF JONES

InstituteofInformation Sciences and Technology College of^Sciences, ^Massey University, NewZealand

Abstract. Intheclassical decisiontheoryframework,thelossis a functionof thedecisiontaken and the stateof nature as represented byaparameter 0. Information about 0 can be obtained viaobservation ofa random variable X. In somesituationshowever the loss willdepend not directly on 0 but ontheobserved value of another randomvariableYwhosedistributiondepends on0. Thisaddsanextralayertothedecisionproblem, and may leadto a widerchoiceofactions.

In particular there are now twosample sizesto choose, for Xand for Y,leading to arange of behaviours intheBayesrisk. Weillustrate this with aproblem arising from the cleanup ofsites contaminated with radioactive waste. Wealsodiscusssome computational approaches.

Keywords: DecisionTheory,BayesPule,PredictiveDistribution,MonteCarlo Integration

1. Introduction

Consider thefollowing consulting problem: theclientisinvolvedin the cleanup of sites contaminated withradioactivewaste, which involvessending bins ofradioac- tivematerialto anuclear reprocessing plant.

Two

such plantsareavailable,^one of whichislessexpensivebutwhich willonlyaccept abinifthe levelof radioactivity of the material is below athreshold level. The actual levelofradioactivity in the binisdeterminedby samplingatthe reprocessingplant;asfaras the clientisaware onlyonesuchsampleistaken. Ifthemeasuredlevel exceeds thethreshold, the^ma- terialisreturnedtothe client and must then be sent to thesecond,more expensive reprocessingplantwhich will acceptmaterialatany level ofcontamination.

The client wishes to base the decision ofwhich reprocessing plant to use on a sample or samples taken from each bin on site before the materialis dispatched.

The costofsamplingissmall relative to the differenceinreprocessingcosts, butis notnegligible.

How

manysamples shouldbetaken,and howshouldthis information beused?

The

Loss

Functioninthisproblemclearlyhas theform

L(al y) IS ^S

^/

^K ⁺ ^P

^ify<c^otherwise

L(a2,y) S + ^K

Here

alisthe decision tosendto thelessexpensivereprocessing plantat acost

S,

K

istheextracostof sendingto theother plant, and

P

theextrapenaltyincurred bysending first to theless expensiveplant but having thematerial rejected. This

(2)

occurs if the sample taken at the plant has an observed value y greaterthan the threshold levelc. Sincethevalue of

S

playsnopartinthe decision,^wecan without loss of generalitytake

S =

^0.

The client expects that therewill be considerable variation in levels of radioac- tivityofthematerial withineachbin, muchofitbeingat quitealow level but with some "hotspots" of high radioactivity, suggesting ahighly skeweddistribution.

It

seemsreasonable then to model the result forasinglesampleasarandom variable

X

with an Exponential distribution.

We

write

X Exp(0)

to denote that

X

has density

Here

parametrisesthe

"state

of

nature"

and relates to the averagelevel ofradioactivity ofthe material in the bin.

Note

however that the mean level is

1/.

The alternative parametrisationof the Exponential distribution is more intuitive, but we keep the present form for mathematical convenience.

Because

of thehigh skewnessit is intuitively clear that asingle samplewill not provide reliable infor- mationabout theaverageradioactivity level.

It

may beadvantageousfor the client to persuade the reprocessor, through financial inducement or otherwise, to take further samples beforeaccepting orrejectinga bin. Thus there may betwosample sizes to consider, relatingto the sampling before and after dispatch. Intuitively one might expect that increasing eithersamplesizewould be

advantageous

tothe client, ^andthat foragivencostof sampling the total samplesize might ^beshared equallybetween thetwostages. This turns out not tobethe case.

In

Section 2 wedevelopandanalysea Bayesianframework forthisproblemusing aconjugatepriordistributionfor

. ^In

^Section³^we^consider^the determinationof optimalsample sizes at eachstage of sampling.

In

Section4 we reconsider the choice of prior and discuss some numerical strategies forincorporating a non-conjugate prior.

The basic principles ofstatistical decisiontheory, asused

here,

are describedby

DeGroot [2], although

ournotationisclosertothat of

Ferguson [3]. ^In

^the^classical

approachthe loss incurred by thedecision makeris afunctionoftheactiontaken and the true value of an unknown parameter, information about which can be obtainedby sampling. The situation inwhichthe loss dependsnot on a parameter butonfutureobservations wasconsideredby Roberts

[7]

ⁱⁿ^the^contextofstatistical prediction. Aitchison and

Dunsmore [1]

and Geisser

[4]

^{provide an} ^overview ^and

many applicationsofthe predictiveapproach,someofwhich involve decisionmaking butnot thesamplesizedeterminationproblemconsideredhere.

A

related problem in determining asingle sample size in the classicalframework whenthere aretwo

"adversaries" with different priors, was consideredbyLindley andSingpurwalla

[5],

and an applicationinenvironmental monitoring of radiationlevelsgivenby Wolfson etal.

[8].

(3)

PREDICTIVEDISTRIBUTION IN DECISIONTHEORY

109

2. Bayesian Analysis

We

assume that the uncertainty^about/9canbe expressedas a

Gamma

distribution,

F(a, A)

^{with prior}^density

0(0) r() ⁽²⁾

Ifwe wish to use a non-informative prior we might consider a 1 and

A

--+ 0, although^thismay not be sensible here

(see

^Section

4).

^The ^main ^advantage^{of the}

Gamma

prior isthatit is aconjugateforthe Exponentialdistribution, sothat the posteriordistribution ^for/9 after observing one or more sample values

X

will also be

Gamma.

Specifically, for asingle observation

X

x thejoint distribution of

(X, 8)

has density

, ^o > o (3)

so by inspection the posterior density

rOlX(8 ^x)

^c

^Oe

^-(x+=) ^and

⁽ ^z)

r(a + ^{I, A} + z).

The usual procedure, when the lossisafunctionof

O,

would be to choosetheaction

(al

^or

a2)

^which^minimizesthe expected loss under thisposteriordistribution,giving the

Bayes

Rule

5(x)

argmin

{Ex [L(a,

^t9

x)] } (4)

Here

however the loss depends not on 0 but ^on the observed value of a second randomvariable, say

Y,

representing the result ofthe sampletaken at the reprocessing plant. The

Bayes

criterion isnow

Ex [L(a,

^y

x)],

^the expected loss under the predictive distribution for

Y

given

X

x. Ifwe assume that both the client and the reprocessoruse the samesampling and measurement technique, then

Y

has thesamedistribution as

X (for

a given

0),

ⁱⁿwhich casethe predictive distribution has density

fYIx(y z) fofYl(Y ^O)ralX(O ^z)dO ⁽⁾

f ^(A ⁺ ^x) ’+10,:,+le_(X+,+)Od

₀

₍₆₎

rca ⁺ ⁾

( + 1)

( + + ).+, ^>

⁰

⁽⁷⁾

Theexpectedlossfora2, the expensivereprocessor,isfixed at

S + ^K

^{and for}

a

Ex [L(al,

^y

lx)] ^S + (K + P)Px [Y > c]

( ⁺ )+

S+(K+P)

A+x+c

(4)

sothe

Bayes

Ruleistochoose_al if ^|

x.+x

|

<

_KWPK i.e. if

c

_where

₌ +/ ^K ⁽⁸⁾

x<=l- ^K+P

The expected loss incurred by using this decision rule, say

(x),

can now be found byintegratingthe expected loss at fixed

X,

as given

above,

with respect to themarginaldistribution

fx

^of

^X.

This gives the

Bayes

Risk

B(r, )

of the rule withrespect to the priordistributionr. Formallywemay write

B(r, ) = E= [Ee [L((x), y) O]] Ef [Ex [L((x), ^y) x]] (9)

to show two different ways of calculating the

Bayes

Risk corresponding to two differentformsofiterated expectation.

It

ismore convenienthereto usethelatter.

Themarginal distributionfor

X

has density

x(z) O)dO

(A + z)

^=+’

so the

Bayes

Riskis

/0

S(r,5) (g+P)

A+x+c

A,

[ ^K+P ^K+P

+ + +

x

>

0

(10)

(A + x)

^=+t^dx

+ ^K + +

(A + x)

⁼⁺ⁱ

Note

that

A

isascale parameter for the marginaldistributionof

X (and

^of

Y)

^so

that the problem isinvariant totransformations

(A, c) -> (kA, kc)

^for ^k

>

^{0. This}

transformationcorrespondsto achangeintheunitof measurement ofradioactivity.

Similarlythe decision madedepends only on theratio

K/P

^not^on ^the^individual

values.

Suppose

then that with suitable units we take a 3,

A

10, c

5, K =

10,

P

15. The prior distribution forthe mean levelofradioactivity

1/0

^is ^shownⁱⁿ

Figure 1, and ^the marginaldistribution for

X (and Y)

ⁱⁿ ^Figure ^2.

^We

^{find that}

the

Bayes

Ruleis

ai if x

< 9.423

5(x) (11)

a

^otherwise

It

is clear from Figure 2 that _at will be chosen most

(93%)

of the time, even thoughthe meanlevelofradioactivityisoften abovethe criticallevelc. Thisoccurs becauseof theextremeskewness ofthesamplingdistributionfor

Y

0whichmakes the "gamble" ofusing the cheaperreprocessor worthwhile even when the average level of radioactivity in a bin is quite high.

In

the next section we consider the changeswhich occur when repeated samplingis used at both ends ofthe process

(i.e.

^for

X

and for

Y).

(5)

PREDICTIVE DISTRIBUTION INDECISION THEORY

III

g(1/e)

0.2

0.15

0.1

0.05

0

0 2 3 4 5 6 7 8 9 10

Figure 1. Prior inverted gamma density for mean level of radioactivity (1/8) ^{with a} ³^and

A=IO

0.3

f(x)

0.25

0.2

0.15

0.1

0.05

0 2 3 4 5 6 7 8 9 10

Figure

.

Marginal density for sampled level of radioactivity (X)^with ³andA 10

3. Optimal Sample Sizes

Using the framework established in the previous section, we now suppose that the client bases his decision on samples Xi,

X2,..., X

^{taken from} ^the ^bin, ^and

(6)

assume that these are iid

Exp(). It

is convenient now to work with the total

X X1

^/

X2

^{/... /}

Xn

^which ^{is a} ^sufficient ^statistic ^for ^and ^is distributed as

F(n, ).

Similarly the total

Y

ofm samples taken bythe cheaper reprocessor, assumingthebinissent

there,

willbe

F(m, ).

^Thedecision to acceptor reject the binisthen basedon the mean ofthem samples, so thatin the

Loss

Functionc is replaced bymc.

Proceeding as

before,

we now find that the posterior distribution for

X

is

F(n

^/c,

A

/

x).

The predictive distributionfor

Y X,

following the method of Equations

(5)-(7),

then has density

r(m +

ⁿ

+ ) () + ^x)n+ay

^m-1

]rl(Y ]) (12)

r(m)r(n + ) (: +

^z

+ ^y)++

If we make the substitution u ^_.._e___A+x+y we find that the predictive density for

U

^r given

X =

x, has theformofa

Beta

distribution, and we write:

A+z+Y

Y

A + ^X + ^Y IX

^x

^Be(m,

ⁿ

⁺ ^a) ⁽¹³⁾

for the predictive distributionofthetransformed variable.

A

closed formfor the predictive probability

PlY ^<

^mc

^X ^x]

^is^not ^possible,

buttheincomplete

Beta

distributioniseasytocalculate numerically

(see ^Press

^et

al.

[6])

^sowe can useEquation

(13)

^to^write

P[Y<mclX=x]=IB, _+x+m (m,n+a)

The

Bayes

Ruleisthento choose_al if

(14)

IB +=o ^(m,

ⁿ

⁺ ^a) ^> _K ^P ₊ _P ⁽¹⁵⁾

i.e. if

x<f= ^mc( : (16)

1-(

where 1 isthe

Pg

quantile of

Be(m,

ⁿ

+ ).

To

determine the

Bayes

Risk

B(,)

^for ^fixed ^m ^and ^{n we} ^{use the} ^marginal

density of

X.

Proceeding asbeforewe find that

Be(n, a) (17)

sofrom Equation

(9)

B(r, ) KP[X > f] + ^(g + P)P[X < f

^and

^Y ^{> mc]}

= K (1-IBx, ^(n, x)) ⁺ ^(K ⁺ P)fo’ ^(1- ^IB/x ^(m,n +Cz))fx(x)dx

(7)

PREDICTIVEDISTRIBUTION IN DECISION THEORY

113

Although numerical integrationis now required this can be accomplishedquite easily using standard routines

(see Press

et al.

[6]),

and evaluation over a range of valuesofm and ngives a criterion for choosing the optimal

(from

the client’s viewpoint) sampling plan. Usingthe parametervaluesfrom Section2, Table1 gives the

Bayes

Risk for values ofm andn rangingfrom 1 to 6; notethat these values donot includethe costof obtainingthe samples. The optimal choices form andn willdepend onthe samplingcost: if forexampleeach sample determination hasa cost of0.1,weadd 0.1

(n + ^m)

^to^each^valueⁱⁿthe table and find thattheoptimum is n m 5. The advantageofnotincluding the sampling cost explicitly in the tableisthatwecanobserve the behaviourofthe

Bayes

Risk asnandmarevaried.

Noticethat forn 1 the

Bayes

Risk initiallyincreases as m increasesfrom 1 to2, the increasedaccuracy ofdetermination by thereprocessorbeing disadvantageous tothe client, but thereafter an increase inm resultsin alower expected cost.

For

nhoweverahigher valuewillalways decreasethe

Bayes

Risk, asonewould expect:

moreinformation for the client shouldalwaysresult in abetter decision.

Table1. BayesRiskforvarioussample sizes, 0 F(3,10)

2 3 4 5 6

m--1 2 3 4 5 6

7.05 7.165 7.’130 7.085 7.06 7’014- 6.885 6.878 6.782 6.699 6.633 6.580 6.781 6.709 6.577 6.470 6.387 6.322 6.711 6.596 6.439 6.316 6.221 6.147 6.661 6.515 6.340 6.205 6.101 6.019 6.623 6.453 6.266 6.120 6.009 5.922

Table2. Cutoff point for

,

⁰ F(3,10)

m=l 3 4 5 6

7.398

’6.862

6.623 6.490 6.407

n=l ’9.4}3

2 7.430 6.158 5.828 5.684 5.605 5.557 3 6.768 5.747 5.487 5.375 5.315 5.279 4 6.438 5.543 5.318 5.223 5.173 5.142 5 6.240 5.422 5.217 5.132 5.088 5.062 6 6.109 5.341 5.151 5.073 5.033 5.009

Table 3. Probability of choosinga2, F(3,10)

m=l 2 3 4 5 6

{’1 0.136 0.190 0.209 0.18 0.223 0.26

2 0.182 0.239 0.257 0.266 0.271 0.274 3 0.205 0.262 0.280 0.288 0.293 0.295 4 0.219 0.276 0.293 0.301 0.305 0.307 5 0.229 0.285 0.302 0.309 0.313 0.315 6 0.235 0.291 0.308 0.315 0.318 0.321

Table 2 shows the cutoffpoint for the

Bayes

lule, expressed in relation to the samplemean,i.e. if

x/n

^isgreater thanthetabulated value, thebinshould be

(8)

sent to the expensive reprocessor.

As

the numberof samples increases, the cutoff convergesquitequickly tothecriticalvaluec. Table3showsthe proportionofbins whichwould be senttotheexpensive reprocessor for eachsampling plan.

Several different behaviours are possible, dependingon the parameter values.

In

some cases the

Bayes

riskmaydecreaseveryslowly,oreven increase, asm increases from 1; in other cases it decreases quite markedly.

It

is important therefore to get accurate information about costs and the prior before deciding whether ^{it is} worthwhileobtainingextrasamples,and whether theextraeffortshould bedevoted to

X

or

Y

or both equally.

4. Non-conjugate Prior

Theprior

Gamma

distributionfor employedinSections2 and 3 waschosenmainly formathematical convenience.

We

now re-examine its appropriateness and how a widerclassof priors mightbe incorporatedinto the analysis.

To

obtain areasonableprior distribution from the client, ^he must be invited to speculate onthelikelihoodofarangeofvaluesof

.

This may be difficultsince8 is itself notaparticularlymeaningful parameter.

A

farmorenatural parametrization oftheproblemwould be to use

1/

^which^representsthe mean levelofradioactivity in the bin; ^{this is} somethingabout which the client might reasonablybe expected to speculate.

We

could stillproceed by showingthe client graphs ofthe densityof

1/

^for ^various ^choicesôfâ ând

A,

as in Figure 1, but even so we are restricting ourselvesto aclass of distributions, the Inverted

Gamma,

^whichmight ^be

thought

inappropriate. These distributions areverylong-tailed, havingless thana-1 finite moments.

Suppose

that instead we decide to use a generalprior distribution specified for

1/. ^We

^{can still}^{denote the}implied priorfor by_r0

(),

but theintegrals needed to evaluatethe marginal distributionfor

X

and the predictive distribution for

Y

will not now involvesimplespecial functions likethe

Gamma

and

Beta. It

has become commonplaceissuchsituationstoemploysomeform of

Monte

Carlo integration.

Thereare essentiallythree stagesto thecalculation:

Evaluate the riskforfixedcutoff and fixedsamplesizes

n,m.

Choose to minimizethe riskforfixed n,m.

Choose n,mto minimizethe

Bayes

Risk.

Ifwedenote the rule withcutoff by

,

^i.e.

al ifx

<

a2 otherwise

(18)

then weneed to evaluate

R(r, 5) KP[X > ] + (K + P)P[X <

^and

^Y > mc] (19)

(9)

PREDICTIVEDISTRIBUTION IN DECISION THEORY

115 One

approach would beto sample from thejoint distribution of

(, X, Y). Pro-

videdthat the priorfor

1/

^isreasonably easy to

simulate,

we invert a

randomly

drawn value to get

,

^{then draw}

^X

^and

^Y

^from their conditional distributions

F(n, )

^and

F(m, )

respectively. The conditional independenceof

X

and

Y

given means that wedo not neediterative methodssuch as the Gibbs sampler. Given asample

(, X, Y),

ⁱ 1,...

,N

we can approximatetheriskfor given by

N

R(r, ) _ g2:{;r)<} + (]C + P){>>

^and:>>$]}

(20)

i----1

where27isthe indicatorfunction.

Because

ofthe dimensionalityproblemthismethodrequires ahugesamplesizeto achieveevenreasonableaccuracy,andrepeatedcomputationfor varying becomes very inefficient.

A

better approachisto simulate for onlyand to calculate directly

Pe[X > ]

^and

Pe[Y > mc].

^These probabilities are incomplete gammafunctions and canbe calculatedquiteefficiently

(see ^Press

^et ^al.

[6]),

giving

N

R(r, ) _

^g

(1 IG,(n)) + ^(g + P)IGe, (n)(1 IGm, (m)) (21)

where

IG..(k)

denoted the incompletegammafunction

IG (k) uk-e-Udu ₍₂₂₎

Note

that only the

X

probabilities depend on

(,

so the

Y

probabilities for each

0i

may be stored

end

^re-used. ^This ^method ^{requires a} ^much smaller sample of 0 values to achieve reasonable accuracy, and istherefore ^more eNcient han use of thefull multivariate joint distribution.

Usingthe priorandparameter valuesfromSection itwasfound^ghat

N

10,000 gavesufficientaccuracy

(2 ^dp)

^and^a^reasonablecomputation time

(about 90s). ^Now

howeverwe arenolongerrestrictedgo asmall class ofpriors. Thecalculagionwas repeated using a

r(4,1)

prior for

1/0.

^This^is ^shown ⁱⁿ ^Figure ^for ^comparison

with Figure 1; ^{it is} similarbut much less long-tailed. Theestimated

Bayes

Risk andcutoffvalue usingthisprioraxegiveninTables 4 and 5.

We

nowfindthat with acostof sampling of0.1thebest optionisn

=

m

=

1.

TableJ. ^Bayes^Risk^for^varioussample sizes,1/0 r(4,1)

m-1 2 3 4 5 6

6:’509 ""6.606 6.554 6.495 6.443

6.400

6.469 6.450 6.408 6.321 6.249 6.190 6.435 6.418 6.298 6.193 6.107 6.037 6.406 6.354 6.214 6.094 5.998 5.920 6.382 6.302 6.146 6.016 5.911 5.827 6.361 6.260 6.091 5.952 5.840 5.751

(10)

g(1/e)

0.25

0.2

0.15

0.1

0.05

0

0 2 3 4 5 6 7 8 9 10

Figure 3. Priorinverted gammadensity for mean level of radioactivity (1/8) with c 3and A=IO

Table 5. Cutoffpoint for 5:,1/e r(4, 1)

m=l 2 3 4 5 6

n=l

14.538"’10.36

9.374 8.907 8’635 8.52

2 9.813 7.494 6.909 6.685 6.561 6.469 3 8.257 6.529 6.137 5.974 5.894 5.816 4 7.514 6.085 5.758 5.619 5.541 5.487 5 7.075 5.829 5.532 5.413 5.354 5.315 6 6.777 5.645 5.390 5.283 5.240 5.200

5. Discussion

In

the above analysis we haveconsidered the problem only from the client’spoint of view, assumingthat he can pay the reprocessor to take extra samples, as well as deciding to take more samples himself, if this is to his advantage.

We

have also assumed that the critical value c used by the reprocessor is kept fixed for different samplesizes. Ifwe now considerthe

reprocessor’s

pointofview, heclearly does not want to accept material which has too high alevel ofradioactivity.

We

assume that herequiresthemean level for eachbintobe less than c, but that he does not allow for sampling variability in making his test.

Were

he to do so, he would want to adjustthe criticalvalue depending on thesample size m.

It

would also be to his advantageto take extrasamples. Rather than use decision theory merely to improvethedecision-making ofone sidein the process, aswe havedone here, itwould be more appropriatetouse an agreed decisiontheory frameworkas

(11)

PREDICTIVE DISTRIBUTION IN DECISION THEORY

117

a negotiating tool in establishing an optimal samplingscheme which wouldbe of benefit toboth parties.

It

has already been noted that the optimal solution for the problem we have considered seems tobequitesensitivetothe parametervaluesand prior information.

Thisshows theneedforestimated costs to beasaccurateaspossible, and forprior data to be incorporated in choosing ^the prior distribution, possibly through ^an empirical

Bayes

approach. This sensitivity is probably due in part to the use of long-taileddistributions. Thereis aconsiderablerange ofbehavior for different parameter values anddistributions.

In

particular thefact that increasing the sample sizefor

Y

mayeither increase ordecrease the riskisinteresting, and isthe subject of furtherinvestigation.

Acknowledgments

The author would like to thank the Editor and the referees fortheir support and helpfulsuggestions.

References

1. J. Aitchison andI. R. Dunsmore. Statistical PredictionAnalysis. University Press, Cam- bridge, ^1975.

2. M.H. DeGroot. Optimal StatisticalDecisions. McGraw-Hill,NewYork, 1970.

3. T. S.Ferguson. MathematicalStatistics: aDecision TheoreticApproach. AcademicPress, New York,1967.

4. S. Geisser. PredictiveIn]erence: an Introduction. Chapman and Hall, London, ^1993.

5. D. V. Lindley and N. D. Singpurwalla. On the evidence needed to reach agreed emtion between adversaries, with application to acceptance sampling. J. Amer. Statist.Assoc.,86, 993-937, 1991.

6. W. H.Press, S.A.Teukolsky,W. T.Vetterling andB. P.Flannery. NumericalRecipes: The ArtofScientificComputing, 2nd ed. UniversityPress,Cambridge,1992.

7. H.V. Roberts. Probabilistic prediction. J. Amer.Statist. Assoc., 60,50-62,1965.

8. L.J. Wolfson, J. B. Kadane andM. J. Small. Asubjective Bayesianapproach toenviron- mental saxnpling. In: Case StudiesinBayesian Statistics Vol.3 (C. Gatsonis, J. S. Hodges, R. E. Kass, R. McCulloch, P. Rossi and N. D. Singpurwalla, eds.) Springer-Verlag, New York,457-468, 1997.

The Predictive Distribution in Decision Theory" A Case Study