MANNWHITNEY procedure
Performs a Mann-Whitney U test
(S.J. Welham, N.M. Maclaren & H.R. Simpson)
Options
GROUPS
= factor Defines the samples for a two-sample test if only the Y1 parameter is specified
Parameters
Y1
= variates Identifier of the variate holding the first sample if Y2 is set, or both samples if Y2 is unset (the GROUPS option must then also be set)Y2
= variates Identifier of the variate holding the second sampleR1
= variates Saves the ranks of the first sample if Y2 is set, or both samples if Y2 is unsetR2
= variates Saves the ranks of the second sample if Y2 is setSTATISTIC
= scalars Scalar to save the test statistic UNORMAL
= scalars Scalar to save the normal approximation to the test statisticSIGN
= scalars Scalar to save an indicator: 1 if the first sample scores the highest ranks on average, 0 otherwise
Description
The Mann-Whitney U test is a test for differences in location between two samples. The data for the samples can either be stored in two separate variates, and by the parameters
Y1 and Y2. Alternatively, they can be stored in a single variate, supplied by Y1, with the GROUPS option set to a factor to identify which unit belongs to each sample. The GROUPS option is ignored when the Y2 parameter is set. MANNWHITNEY calculates the test statistic U, along with its Normal approximation if both sample sizes are larger than 5. These statistics can be saved using the STATISTIC and NORMAL parameters respectively, and are displayed by the test setting of the option PRINT. The ranks setting of PRINT produces vectors of ranks (with respect to the whole data set) for each variate in DATA. Parameter SIGN holds an indicator which takes the value 1 if the ranks in the first sample are higher on average than those in the second sample, and takes the value 0 otherwise. The ranks (with respect to the combined data set) for each sample can be saved using the R1 and R2 parameters.
Option:
PRINT, GROUPS. Parameters: Y1, Y2, R1, R2, STATISTIC, NORMAL, SIGN.
Method
The Mann-Whitney (or Wilcoxon) U-test is a two-sample test of location difference: i.e. a test of the null hypothesis that the two samples arise from distributions with the same mean vs. the alternative that the distribution means differ.
The test statistic U is formed using ranks found from the combined data set, and is taken to be the smaller of U1)and U2, where
Uk = n1 ´ n2 + nk ´ (nk+1) / 2 - Rk ; k=1,2
and nk is the size of sample k, Rk is the sum of ranks for sample k. This score Uk can be interpreted as the number of times a rank score in the other sample precedes a score in sample k in the ranking. So the sample with the lowest score has, on average, smaller rank scores.
The normal approximation to this statistic is
Normal = ( n1 ´ n2 / 2 - U ) / Ö { n1 ´ n2 ´ ( n1+n2+1 ) / 12 }
and is valid when both samples sizes are at least 5. If ties are present, then the standard error of the normal approximation (i.e. the denominator) must be calculated by:
Ö { n1 ´ n2 / (N ´ (N-1)) ´ ( (N3-N) / 12 - S k Tk ) }
where Tk = ( tk3-tk )/12 and tk is the number of observations with rank k. (See for example Siegel 1956, pages 116-127.) Otherwise,
MANNWHITNEY looks up the probability from a stored table.
Action with
RESTRICTThe variates in
DATA can be restricted, and in different ways. MANNWHITNEY uses only those units of each variate that are not excluded by their respective restrictions.
Reference
Siegel, S. (1956). Nonparametric Statistics for the behavioural sciences. McGraw-Hill, New York.
MANOVA procedure
Performs multivariate analysis of variance and covariance
(R.W. Payne & G.M. Arnold)
Options
APRINT
= strings Printed output from the univariate analyses of variance of the y-variates (as for the ANOVA PRINT option); default *UPRINT
= strings Printed output from the univariate unadjusted analyses of variance of the y-variates (as for the ANOVA UPRINT option); default *CPRINT
= strings Printed output from the univariate analyses of variance of the covariates (as for the ANOVA CPRINT option); default *TREATMENTSTRUCTURE
= formula Treatment formula for the analysis; if this is not set, the default is taken from the setting (which must already have been defined) by the TREATMENTSTRUCTURE directiveBLOCKSTRUCTURE
= formula Block formula for the analysis; if this is not set, the default is taken from any existing setting specified by the BLOCKSTRUCTURE directive and if neither has been set the design is assumed to be unstratified (i.e. to have a single error term)COVARIATES
= pointer Covariates for the analysis; by default MANOVA uses those listed by a previous COVARIATE directive (if any)FACTORIAL
= scalar Limit on the number of factors in a treatment termLRV
= pointer Contains elements first for the treatment terms and then the covariate term (if any), allowing the LRV's to be saved from one of the analyses; if a term is estimated in more than one stratum, the LRV is taken from the lowest stratum in which it is estimated
Parameter
Y
= variates Y-variates for an analysis
Description
Procedure
MANOVA performs multivariate analysis of variance or covariance. The data variates are specified by the Y parameter.The model for the design is specified by options of the procedure.
TREATMENTSTRUCTURE specifies a model formula to define the treatment terms in the analysis; if this is unset, MANOVA will use the model already defined by the TREATMENTSTRUCTURE directive, or will fail if that too has not been set. BLOCKSTRUCTURE defines the underlying structure of the design, and MANOVA will use the model (if any) previously defined by the BLOCKSTRUCTURE directive if this is not set; these can both be omitted if there is only one error term (i.e. if the design is unstratified). The COVARIATES option specifies any covariates; by default MANOVA will take those already listed (if any) by the COVARIATE directive. The FACTORIAL option can be used to set a limit on the number of factors in the terms generated from the treatment formula.The
LRV option allows a pointer to be saved containing an LRV structure for each treatment term. When covariates have been specified, the pointer will also contain a final LRV structure for the covariate term. If a term is estimated in more than one stratum, the LRV is taken from the stratum that occurs last in the BLOCKTERMS pointer. The structures in the LRV hold the canonical variate loadings, roots and trace for the respective term.The other options control printed output.
PRINT indicates the output required from the multivariate analysis of covariance, with settings ssp to print the sums of squares and products matrices, and tests to print the various test statistics (Wilks Lambda, with Chi square and F approximations, the Pillai-Bartlett trace, Roy's maximum root test and the Lawley-Hotelling trace). APRINT, UPRINT and CPRINT control output from the univariate analyses of each of the y-variates, corresponding to ANOVA options PRINT, UPRINT and CPRINT, respectively.
Options:
Parameter:
Y.
Method
The relevant theory, with formulae and references for the test statistics, can be found in Chatfield & Collins (1986, Chapter 9). The procedure analyses the data variates by
ANOVA first as y-variates, and then as covariates in order to obtain the SSP matrices. The SSP matrices are then adjusted for the covariates, using matrix manipulation in CALCULATE, and LRV decompositions are done, before the test statistics are calculated (again using CALCULATE).
Action with
RESTRICTIf any of the y-variates is restricted, the analysis will involve only the units not excluded by the restriction.
Reference
Chatfield, C. & Collins, A.J. (1986). Introduction to Multivariate Analysis (revised edition). Chapman and Hall, London.
MENU procedure
Initiates a menu system
(P.W. Lane)
Options
FILENAME
= text A single string giving the filename of the base file for the menu system; the default is the base file of the standard menu system (stored by the standard Start-up File in text _flmnbas)INCHANNEL
= scalar Input channel on which the base file is to be opened; default is the channel assumed by the standard menu system (stored by the standard Start-up File in scalar _chmnbas)OUTCHANNEL
= scalar Output channel on which a file has been opened to keep a record of work done; default is the channel assumed by the standard menu system (stored by the standard Start-up File in scalar _chcomnd)
No parameters
Description
The
MENU procedure enters a menu system. By default, the standard Genstat Menu System is started, but the option FILENAME can be set to the name of an alternative file of Genstat commands. The Option INCHANNEL specifies the input channel on which the alternative file is to be opened, and OUTCHANNEL specifies the output channel which may be used within the alternative file for keeping a record of commands. The procedure only operates interactively: in batch mode, immediate exit occurs without a diagnostic.
Options:
FILENAME, INCHANNEL, OUTCHANNEL. Parameters: none.
Method
The input channel
INCHANNEL is closed, without a warning being printed if it was not actually open. Then the base file is opened on INCHANNEL and control is passed to the commands in the file, with echoing of commands switched off. Automatic logging (via the COPY directive) to the channel OUTCHANNEL is switched off.
MPOWER procedure
Forms integer powers of a square matrix
(P.W. Lane)
No options
Parameters
MATRIX
= matrices, symmetric matrices or diagonal matricesMatrix from which to form the power
POWER
= scalars Power to which each matrix is to be raisedRESULT
= identifiers Structure to store the result
Description
MPOWER
forms powers of a square matrix, using as few matrix operations as possible in order to save time and decrease rounding errors. The square matrix is specified using the MATRIX parameter, and can be either an ordinary matrix structure (with an equal number of rows and columns), a symmetric matrix or a diagonal matrix. The required power, which must be a positive integer, is specified using the POWER parameter. The RESULT parameter supplies the identifier of the structure to save the results; this will be declared automatically to be of the same type as the input structure.
Options: none. Parameters:
MATRIX, POWER, RESULT.
Method
For general matrices, successive powers of two of the matrix are formed by matrix products, and the result formed by taking the product of those that are needed to achieve the specified power. Diagonal matrices are dealt with using simple exponentiation of the diagonal values. Symmetric matrices are spectrally decomposed, and the result formed as a product of the matrix containing the latent vectors (
V) with the simple power of the diagonal matrix containing the latent roots (R):RESULT = V *+ R**POWER *+ TRANSPOSE(V)
.
MULTMISS procedure
Estimates missing values for units in a multivariate data set
(H.R. Simpson & R.P. White)
Option
MAXCYCLE
= scalar Defines the maximum allowed number of iterations; default 10
Parameters
DATA
= pointers Each pointer contains a set of variates whose missing values are to be estimated; these will be overwritten by the estimates unless the OUT parameter is specifiedOUT
= pointers Each pointer contains a set of variates to hold the results
Description
MULTMISS
estimates missing values for units in a multivariate data set, using an iterative regression technique. The input for the procedure is a set of variates contained in a pointer specified by the DATA parameter. The output can be saved in a different set of variates by supplying a similar pointer with the parameter OUT; if this is absent, the output values will overwrite the values of the variates given by DATA. The maximum number of iterations is set by the option MAXCYCLE, with a default of 10. If MAXCYCLE is set to zero, missing values will be replaced by variate means calculated from the units that have no values missing for any of the variates.
Option:
MAXCYCLE. Parameters: DATA, OUT.
Method
Initial estimates of the missing values in each variate are formed from the variate means using the values for units that have no missing values for any variate. Estimates of the missing values for each variate are then recalculated as the fitted values from the multiple regression of that variate on all the other variates. When all the missing values have been estimated the variate means are recalculated. If any of the means differs from the previous mean by more than a tolerance (the initial standard error divided by 1000) the process is repeated, subject to a maximum number of repetitions defined by the
MAXCYCLE option.The default maximum number of iterations (10) is usually sufficient when there are few missing values, say two or three. If there are many more, 20 or so, it may be necessary to increase the maximum number of iterations to around 30.
The method is similar to that of Orchard & Woodbury (1972), but does not adjust for bias in the variance-covariance matrix as suggested by Beale & Little (1975).
Action with
RESTRICTAll the variates must be unrestricted, or they must all be restricted to the same set of units; otherwise a fault will occur in a
CALCULATE statement within MULTMISS.
References
Beale, E.M.L. & Little, R.J.A. (1975). Missing values in multivariate analysis. J.R.Statist.Soc., 37, 129-145.
Orchard, T. & Woodbury, M.A. (1972). A missing information principle: theory and applications. In: Proc. 6th Berkeley Symp. Math. Statist. Prob. Vol I, 697-715.
MVARIOGRAM procedure
Fits models to an experimental variogram
(S.A. Harding & R. Webster)
Options
MODEL
WEIGHTING
CONSTANT
= string How to treat the constant (estimate, omit); default estiWINDOW
= scalar Window in which to plot a graph; default 0 i.e. no graphTITLE
= text Title for the graphXUPPER
= scalar Upper limit for the x-axis in the graphPENDATA
= scalar Pen to be used to plot the data; default 1PENMODEL
= scalar Pen to be used to plot the model; default 2
Parameters
VARIOGRAM
= variates or matrices Experimental variogram to which the model is to be fitted, as a variate if in only one direction or as a matrix if there are severalCOUNTS
= variates or matrices Counts for the points in each variogram (not required if WEIGHTING=equal)DISTANCE
= variates or matrices Mean lag distances for the points in each variogramDIRECTION
= variates Directions in which each variogram was computedESTIMATES
= variates Estimated parameter valuesFITTEDVALUES
= variates Fitted valuesEXIT
= scalars Exit status from the nonlinear fitting (zero indicates success; see page 433 of the Genstat 5 Release 3 Reference Manual)
Description
Procedure
MVARIOGRAM uses the directives FIT, FITCURVE and FITNONLINEAR to fit various models to the experimental variogram. Models must be authorized in the sense that they cannot give rise to negative variances when data are combined. Technically they are conditionally negative semi-definite (CNSD); see Webster & Oliver (1990) or Journel & Huijbregts (1978) for an explanation.The
MODEL option specifies the model that is to be fitted. There are bounded isotropic models with finite ranges. These all take the value c + c0 for h ³ a, and the following values for h < aboundedlinear c0 + ch/a
circular
c0 + c {1 - (2/P)arccos(h/a)+ (2h/(
Pa))Ö (1-h2/a2)}spherical
c0 + c {1.5h/a - 0.5(h/a)3 }doublespherical
c0 + c1 {1.5h/a1 - 0.5(h/a1)3 }+ c2 {1.5h/a2 - 0.5(h/a2)3 } for h £ a1
c0 + c1 + c2 {1.5h/a2 - 0.5(h/a2)3} for a1 < h < a2
where c = c1 + c2
pentaspherical
c0 + c {1.875h/a - 1.25(h/a)3 + 0.375(h/a)5}There are also bounded asymptotic models
exponential c0 + c {1 - exp(-h/a)}
besselk1
c0 + c {1 - h/a K1(h/a) }(Whittle's elementary correlation, Whittle 1954)
gaussian
c0 + c {1 - exp(-h2/a2)}and unbounded models
power c0 + g ha
(power function with exponent
a strictly between 0 and 2)linear
c0 + c hwhich is a special case of the power function with exponent 1.
Finally, the affinepower function can be fitted to an experimental variogram that appears unbounded and geometrically anisotropic, i.e. one that might be made isotropic by a simple linear transformation of the spatial coordinates
affinepower c0 + Ö { a2cos2(f-q) + b2sin2(f-q) } hpower
In all these models, the intercept term (or nugget variance) c0 can be omitted by setting the CONSTANT option to omit; the default is estimate.
The data for the procedure can be taken directly from the FVARIOGRAM directive, with parameters DISTANCES, VARIOGRAMS and COUNTS corresponding to those with the same names in FVARIOGRAM. The data will be in variates if the variogram was calculated in only one direction. If it is in several, they can either be in matrices (as generated by FVARIOGRAM) or in variates. For MODEL=affinepower directions must be supplied, using the DIRECTIONS parameter. These should be in a variate with one value for each column if the other data are in matrices; alternatively, they should be in a variate of the same length as the other variates.
The WEIGHTING option controls the weights that are used when fitting the model. The default setting counts uses the values supplied by the COUNTS parameter, cbyvar uses the COUNTS divided by the values in VARIOGRAM, and equal uses equal weights (of one).
The procedure generates rough starting values for the parameters before calling FITNONLINEAR to convergence. If the solution does not converge there are two likely reasons. The model may be unsuited for the particular experimental variogram. For example, a bounded model is specified when the variogram is clearly unbounded, or vice versa. You should choose only models that have approximately the right shape. Alternatively, the starting values are too far from a sensible solution. Here you should estimate starting values by inspection and insert them into MVARIOGRAM.
Printed output is controlled by the PRINT option, and includes all the usual settings as in FIT, FITCURVE or FITNONLINEAR. You can also produce a high-resolution graph of the data and the fitted model, by setting the WINDOW option to the number of a suitable window. By default WINDOW is zero, and no graph is produced. The TITLE option can supply a title for the plot. Option XUPPER can define an upper value for the x-axis (i.e. distance), and PENDATA and PENMODEL can supply the numbers of the pens to be used to plot the experimental variogram and the fitted model respectively (by default 1 and 2).
Options: PRINT, MODEL, WEIGHTING, CONSTANT, WINDOW, TITLE, XUPPER, PENDATA, PENMODEL.
Parameter: VARIOGRAM, COUNTS, DISTANCE, DIRECTION, ESTIMATES, FITTEDVALUES, EXIT.
Method
The model is fitted using directives FIT, FITCURVE or FITNONLINEAR as appropriate.
Action with RESTRICT
If the data variates are restricted, only the units not excluded by the restriction will be used.
References
Journel, A. G. and Huijbregts, C. J. (1978). Mining Geostatistics. Academic Press, London.
Webster, R. and Oliver, M.A. (1990). Statistical Methods in Soil and Land Resource Survey. Oxford University Press.
Whittle, P. (1954). On stationary processes in the plane. Biometrika, 41, 434-449.
NLCONTRASTS procedure
Fits nonlinear contrasts to quantitative factors in
ANOVA(R.C. Butler)
Options
CURVE
= string Curve (as in FITCURVE) to use for nonlinear regression (exponential, dexponential, cexponential, lexponential, logistic, glogistic, gompertz, ldl, qdl, qdq); default expoFPROBABILITY
= string Printing of probabilities for variance ratios (yes, no); default noPSE
= string Standard errors to print with means tables (differences, means); default diffWEIGHT
= variate Variate of weights for each unit; default * (no weights)
Parameters
Y
= variates Data to be analysedXFACTOR
= factors Factor with quantitative levels for which contrasts are to be foundXLEVELS
= variates Variate of values to use for the levels of XFACTOR; if unset, the factor levels themselves are usedGROUPFACTOR
= factors Factor whose interaction with XFACTOR is to be assessedCONTRASTS
= pointers Structures to hold the estimates of the fitted contrasts: CONTRASTS[1] is a pointer with two values, labelled 'Curve' (parameter estimates for a single fitted curve) and 'Deviations' (the differences between this curve and the means for XFACTOR); CONTRASTS[2] has three values, labelled 'Common NonLin' (parameter estimates for curves fitted with common nonlinear parameters for all levels of GROUPFACTOR), 'Separate Curves' (parameter estimates for curves fitted with all parameters varying with the levels of GROUPFACTOR) and 'Deviations' (differences between the treatment means and the Separate Curves); the order of the parameters is as in the output of the procedure, the variates of estimated contrasts are labelled by the parameter names as used in the printed output, while the 'Deviations' are both tables, labelled by the relevant factorsSECONTRASTS
= pointers Structures to save the standard errors for the contrast estimates, including 'deviations'; the pointer has the same form as the CONTRASTS pointerDFCONTRASTS
= pointers Structures to save the degrees of freedom for the contrast estimates; the pointer has the same form as the CONTRASTS pointer, except that the variates and tables are replaced by scalars
Description
The
ANOVA directive allows linear contrasts to be fitted and incorporated into the analysis-of-variance table. NLCONTRASTS extends this to enable nonlinear contrasts to be fitted to the effects of a quantitative factor and its interaction with another factor. The analysis should include both main effects and the interaction between the factors. The procedure will work for any block structure providing each treatment term is estimated entirely within one stratum. The result is similar to ANOVA with a polynomial contrast, but with slightly different partitions of the treatment sums of squares. The main effect is partitioned into the sum of squares for the "Curve" and the remainder or "Deviations". The interaction sum of squares is partitioned into the sum of squares due to curves with "Common Nonlinear" parameters for the levels of the non-quantitative factor, and the extra sum of squares due to having "Separate Curves" for each level of that factor, and the remaining sum of squares which again represents "Deviations".The
BLOCKSTRUCTURE and TREATMENTSTRUCTURE directives must be used in the normal way before the procedure is called, and any [[r[[COVARIATES]]r]] should also be defined first. The structure of the analysis-of-variance table is then accessed from inside the procedure. The Y parameter defines the variate to be analysed, and the form of nonlinear contrast is defined using the CURVE option of the procedure. The same choices of curves are available as for FITCURVE. There are four other options, PRINT, FPROBABILITY, PSE, and WEIGHT, which are exactly as for ANOVA. The XFACTOR parameter defines the factor to which the contrasts are to be fitted, and the XLEVELS parameter may be used to define x values for the regressions if the levels already defined for the factor are unsuitable. The GROUPFACTOR parameter defines the factor whose interaction with XFACTOR is to be assessed. The final three parameters CONTRASTS, SECONTRASTS and DFCONTRASTS can be used to save the parameter estimates for the contrasts, their standard errors and degrees of freedom respectively.
Options:
PRINT, CURVE, FPROBABILITY, PSE, WEIGHT.Parameters:
Method
is used to obtain the basic analysis-of-variance table and the sums of squares for the treatment terms. FITCURVE is then used with the treatment means to fit three sets of curves: a single curve, curves with common nonlinear parameters, and entirely separate curves. The deviances and degrees of freedom obtained from these are used in conjunction with the treatment sums of squares to calculate the contrast sums of squares and degrees of freedom. Further details are given by Butler & Brain (1992). New lines for the analysis-of-variance table are then constructed using PRINT and EDIT, and these lines are then inserted into the table (saved in a text with ADISPLAY) using EDIT. The standard errors for the parameter estimates and deviances are based on the Residual Mean Square for the appropriate stratum. Standard errors for deviations are calculated using the method on page 501 of the Genstat 5 Release 3 Reference Manual.
Action with
RESTRICTIf the
Y variate is restriced, the procedure will use only the units not excluded by the restriction.
Reference
Butler, R.C. & Brain, P. (1993). Nonlinear Contrasts in ANOVA. Genstat Newsletter 29.
NORMTEST procedure
Performs tests of univariate and/or multivariate normality
(M.S. Ridout)
Option
Parameter
DATA
= variates or pointers Variates whose univariate normality is to be tested or pointers, each to a set of variates whose normality and/or multivariate normality are to be tested
Description
This procedure offers three types of test of normality.
Marginal (univariate) tests - assess the normality of each variate in turn. The variates are standardized to have mean=0, variance=1 and then transformed with the
NORMAL function. The test is based on the idea that, assuming normality, these transformed values should look like a sample from a uniform distribution on (0,1).Bivariate angle tests - assess the bivariate normality of each pair of variates in turn. The variates are standardized so that they are uncorrelated and have mean=0 and variance=1. The test is based on the following idea: if x and y are the standardized values, then the angle between the x-axis and the line joining (0,0) to (x,y) should, assuming normality, be uniformly distributed on (0,2
P).Radius test - provides a single overall test of multivariate normality. The variates are again standardized to have mean=0 and so that their covariance matrix is the identity matrix. The test uses the fact that if z1, z2,..., zn are the standardized values then z12 + z22 + ... + zn2 should, under multivariate normality, be approximately distributed as chi-squared on n degrees of freedom.
For each type of test, the test statistics are empirical distribution function (EDF) statistics - i.e. they compare the empirical distribution function of the sample with the theoretical distribution expected under the null hypothesis. Three EDF statistics are provided for each type of test - the Anderson-Darling statistic, the Cramer-von Mises statistic and the Watson statistic. The idea is to provide good power against a wide range of alternatives. The test statistics are adjusted so that their null distribution is independent of the sample size; critical values can be printed by the procedure (option
PRINT=critical).The
DATA parameter is used to indicate the variate(s) whose normality is to be assessed. If a single variate is supplied, its normality is tested using the marginal test. Alternatively, DATA can supply a pointer to a set of variates to be tested for multivariate normality.The
PRINT option can be used to select the type of test using the settings marginal, bivariateangle and radius. The setting critical allows tables of critical values to be printed, and stars requests that significant values of the test statistics be flagged with stars. Settings bivariateangle and radius are relevant only when testing for multivariate normality. By default PRINT=marginal,bivariateangle,radius
Option:
PRINT. Parameter: DATA.
Method
The calculations are clearly set out in Aitchison (1986; Section 7.3). Bivariate angle and radius tests are described by Andrews, Gnanadesikan & Warner (1973). Stephens (1974) describes the EDF statistics used and gives tables of critical values and information on their comparative power.
Action with
RESTRICTIf a variate to which the
DATA parameter is set is restricted, the tests will be calculated using only the units included by the restriction. Similarly, the variates in a DATA pointer can be restricted, but then must all be restricted in the same way. The procedure does not work properly with missing values. If missing values are present, RESTRICT should be used (before calling the procedure) to exclude all units for which any of the variates has a missing value.
References
Aitchison J.A. (1986). The statistical analysis of compositional data. London: Chapman & Hall.
Andrews D.F., Gnanadesikan R. & Warner J.L. (1973). Methods for assessing multivariate normality. In Multivariate Analysis III (ed. P.R. Krishnaiah) pp. 95-116. New York: Academic Press.
Stephens M.A. (1974). EDF statistics for goodness of fit and some comparisons. J.A.S.A., 69, 730-737.
NOTICE procedure
Gives access to the Genstat Notice Board (news, errors &c)
(R.W. Payne)
Option
No parameters
Description
NOTICE
allows information to be printed from the Genstat 5 Notice Board. The information is stored in a backing-store file whose name is defined by Library procedure LIBFILENAME; there must be a free backing-store channel to which the file can be attached.The
PRINT option is used to specify what information is required. The possible values, with explanations in brackets, are as follows: news information about recent developments concerning Genstat, library recent developments concerning the Procedure Library, errors details about errors reported in Genstat 5 directives, and other advice and warnings, instructions instructions for authors of library procedures.
Option:
PRINT. Parameters: none.
Method
The information is held in subfile
_notices of the backing-store file that holds help for the Library; the name of the file is supplied by procedure LIBFILENAME. The file is opened on the first available backing-store channel; if all the channels are in use, the procedures stops with a diagnostic. After printing the required information, the file is closed.
ORTHPOL procedure
Calculates orthogonal polynomials
(P.W. Lane)
Options
MAXDEGREE
= scalar Maximum degree of polynomial to be calculated; default is the number of identifiers in the pointer specified by the POLYNOMIAL parameterWEIGHTS
= variate Weights to be used in orthogonalization; default * gives an equal weight to each unit
Parameters
X
= variates Values from which to calculate the polynomials; no default - this parameter must be setPOLYNOMIAL
= pointers Identifiers of variates to store results; no default - this parameter must be set
Description
Polynomials of low degree can be fitted by ordinary linear regression, estimating effects of terms
X, X**2, X**3, and so on for a variate X. However, it is sometimes preferable to arrange that successive polynomial terms are orthogonal to each other; certainly, there are likely to be numerical problems with polynomials of degree five or more, if they are not orthogonal. ORTHPOL calculates orthogonal polynomials up to a specified maximum degree from a given variate. The orthogonalization can be weighted by specifying a variate of weights.
Options:
MAXDEGREE, WEIGHTS. Parameters: X, POLYNOMIAL.
Method
Successive formation of polynomials, starting with p1 = x - mean(x), ensuring orthogonality of pi with p1 ...pi-1; that is:
S ( weight ´ pi ´ pj ) = 0
Action with
RESTRICTA variate in the
X parameter can be restricted: the restriction is transferred to the calculated polynomials, and to the weight variate if specified.
PAIRTEST procedure
Performs t-tests for pairwise differences
(P.W. Goedhart)
Options
DF
= scalar Degrees of freedom for calculation of TPROBABILITIES from TVALUES; default 10000, approximates to the normal distributionSORT
= string Whether ESTIMATES (and other output) are sorted in ascending order (yes, no); default no
Parameters
ESTIMATES
= variates Estimates to be comparedVCOVARIANCE
= symmetric matricesSymmetric matrix containing the variance-covariance matrix of the estimates
LABELS
= texts Text vector naming the elements of ESTIMATES; if unset, the numbers 1, 2... are used as labelsDIFFERENCES
= symmetric matricesTo save the pairwise differences (
ESTIMATES on the diagonal)SED
= symmetric matrices To save the standard errors of the pairwise differences (missing values on the diagonal)TVALUES
= symmetric matrices To save the t-values (missing values on the diagonal)TPROBABILITIES
= symmetric matricesTo save the t-probabilities (missing values on the diagonal)
Description
PAIRTEST
can be used to test all pairwise differences in every situation in which a vector of estimates and a corresponding variance-covariance matrix are available. PAIRTEST is particularly useful for tests of all pairwise differences of slopes after fitting a model with an interaction between a factor and a variate. In most other situations procedure RPAIR will be more suitable.All pairwise differences of entries in
ESTIMATES with variance-covariance matrix VCOVARIANCE are calculated and tested. The results of these tests can be saved in symmetric matrices DIFFERENCES, SED, TVALUES and TPROBABILITIES. The matrices are labeled by text vector LABELS or, if LABELS is unset, by the numbers 1, 2, 3...
Options:
PRINT, DF, SORT.Parameters:
Method
The calculations are all relatively straightforward.
Action with
RESTRICTThe variate
ESTIMATES and the text LABELS can be restricted; the analysis is restricted according to restrictions on ESTIMATES. The lengths of the unrestricted vectors ESTIMATES and LABELS must be identical.
PCOPROC procedure
Performs a multiple Procrustes analysis
(P.G.N. Digby)
Options
PROTATE
= strings Printed output required from each Procrustes rotation (rotations, coordinates, residuals, sums); default * i.e. no outputPPCO
= strings Printed output required from the PCO analysis (roots, scores, centroid); default root, score, centSCALING
= string Whether isotropic scaling should be used for the Procrustes rotations (no, yes); default noSTANDARDIZE
= strings Whether to centre the configurations and/or normalize them to unit sums-of-squares for the Procrustes rotations (centre, normalize); default cent, norm
Parameters
DATA
= pointers Each pointer points to a set of matrices holding the original input configurationsLRV
= LRVs Stores the latent vectors (i.e. coordinates), roots and trace from the PCO analysisCENTROID
= diagonal matrices Stores the squared distances of the points representing the input configurations from their overall centroid from the PCO analysisDISTANCES
= symmetric matrices Stores the residual sums-of-squares from the Procrustes rotations
Description
An N ´ V matrix represents a configuration of points, for each of N units, in V dimensions. Given a set of M such matrices, a multiple Procrustes analysis compares them in pairs, keeping the residual sums-of-squares, and performs a principal coordinate analysis of the residual sums-of-squares to obtain an ordination representing the individual configurations. The rows of the matrices must represent the same set of units, in the same order; however there is no need for them to have the same number of columns (although generally they will do). An example of the use of multiple Procrustes analysis is given by Digby & Kempton (1987, pages 121-3).
The configurations of points are specified using the
DATA parameter. This supplies a pointer containing a matrix with the data for each configuration. The PROTATE option controls the output from the individual Procrustes rotations, and the PPCO option controls that from the principal coordinate analysis. There are M´ (M-1)/2 Procrustes rotations so, by default, PROTATE=* to suppress any output. The SCALING and STANDARDIZE options control the way in which the Procrustes rotations are carried out, using the SCALING and STANDARDIZE options of ROTATE. However, the combination of SCALING=yes and STANDARDIZE=centre should not be used, because then the results will be dependent on the order of the input matrices.The
LRV and CENTROID parameters can be used to save results from the principal coordinates analysis, and the DISTANCES parameter can be used to save the symmetric matrix of the residual sums-of-squares from the Procrustes analyses.
Options:
PROTATE, PPCO, SCALING, STANDARDIZE.Parameters:
DATA, LRV, CENTROID, DISTANCES.
Method
The pairwise Procrustes rotations are performed using the
ROTATE directive, and the residual sums-of-squares are stored in a symmetric matrix of order M. This matrix is then used as input to a principal coordinate analysis, performed using the PCO directive on a suitably transformed copy of the matrix.
Reference
Digby, P.G.N. & Kempton, R.A (1987). Multivariate analysis of ecological communities. London: Chapman and Hall.
PDESIGN procedure
Prints or stores treatment combinations tabulated by the block factors
(R.W. Payne)
Options
BLOCKSTRUCTURE
= formula Defines the block factors for the design; the default is to take those specified by the BLOCKSTRUCTURE directiveTREATMENTSTRUCTURE
= formula Defines the treatment factors for each design; the default is to take those specified by the TREATMENTSTRUCTURE directiveTABLES
= pointer Contains tables to store the tabulated factor values for printing outside the procedure in some other format
No parameters
Description
PDESIGN
allows the treatment combinations allocated to each plot in a design to be displayed as tables, classified by the block factors.The combinations are represented using the levels of the treatment factors. If any factor also has labels these are printed alongside the levels, as a key, after the tables. The levels are printed in formats that are determined automatically in a way that avoids wasted space or unnecessary decimal places. The block factors are obtained from the block structure of the design, which can be specified explicitly using the
BLOCKSTRUCTURE option; otherwise PDESIGN will use any structure that has already been defined by a BLOCKSTRUCTURE statement earlier in the job. Similarly, the treatment factors are obtained either from the TREATMENTSTRUCTURE option of the procedure, or from an earlier TREATMENTSTRUCTURE statement.If the display produced by the procedure is unsuitable, printing can be suppressed by setting option
PRINT=* (by default PRINT=design), and the tables of treatment levels can be saved for printing outside the procedure by setting the TABLES option to a pointer. This will be returned with an element for each treatment factor, pointing to a table classified by the block factors and storing the tabulated levels of the treatment.
Options:
PRINT, BLOCKSTRUCTURE, TREATMENTSTRUCTURE, TABLES. Parameters: none.
Method
The
FCLASSIFICATION directive is used to form lists of factors from the block or treatment formulae and, if the block factors do not supply a unique combination of levels for every unit of the design, procedure AFUNITS is used to form a factor to index the units with each combination. Each treatment factor is then copied into a variate and TABULATE is used to put the values into a table classified by the block factors. Numbers of decimal places for printing the factor levels are determined using the DECIMALS procedure.
Action with
RESTRICTIf any of the factors is restricted, only the part of the design not excluded by the restriction will be displayed.
PERCENT procedure
Expresses the body of a table as percentages of one of its margins
(R.W. Payne)
Options
CLASSIFICATION
= factors Factors classifying the margin over which the percentages are to be calculatedMETHOD
= string Method to use to calculate the margin if not already present (totals, means, minima, maxima, variances, medians); default totaHUNDRED
= string Whether to put 100% values into the margin instead of the original values (no, yes); default no
Parameters
OLDTABLE
= tables Tables containing the original valuesNEWTABLE
= tables Tables to store the percentage values; if any of these is unset, the new values replace those in the original table
Description
PERCENT
allows you to express the body of a table as percentages of the values in one of its margins. The table is specified using the OLDTABLE parameter. A table to store the new values can be specified using the NEWTABLES parameter, otherwise these replace the values of the original table. The margin is indicated by listing the factors that define it using the CLASSIFICATION option; the default is the final margin (the grand total, or grand mean etc). If the original table has no margins, option METHOD defines how these are to be calculated; the default is to form margins of totals. The values originally in the margin will be left unchanged. If you would prefer these to be replaced by values of 100%, you should set option HUNDRED=yes.
Options:
CLASSIFICATION, METHOD, HUNDRED. Parameters: OLDTABLE, NEWTABLE.
Method
If the
OLDTABLE has no margins and contains no missing values, these are formed by the MARGIN directive. Alternatively, if there are missing values, margins other than variances can be formed using TABULATE. CALCULATE is then used to put the required margin into a table classified just by the factors that define the margin. The original table is divided by the marginal table and multiplied by 100 to give the required percentages. If option HUNDRED=no, the same operations are done on a dummy table that originally contains random numbers; for this table, values of 100 should occur only in the margin. Thus by using a logical test in which the values of the dummy table are compared with 100, the marginal values of the original table can be put back into the margin of the final table. The random numbers are generated using a specially written procedure URANDOM in case the Genstat random number generator is already in use in the program that called PERCENT.
PERIODTEST procedure
Gives periodogram-based tests for white noise in time series
(R.P. Littlejohn)
Option
LENGTH
= scalar or variate Scalar specifying that the first N units of the series are to be used, or a variate specifying the first and last units of the series to be used
Parameters
SERIES
= variates Specify the time series to be analysedPERIODOGRAM
= variates Save periodograms of the time series
Description
PERIODTEST
gives periodogram-based tests for departure from white noise in a set of time series. The series are supplied in a list of variates, using the SERIES parameter. The LENGTH option can specify that only part of each series is to be used, using either a scalar N to indicate that the first N values are to be used, or a variate of length two, holding the values of the first and last units of the required subseries. This may be used to eliminate missing values, which are otherwise not permitted.The mean-adjusted periodogram is calculated for each series using
FOURIER, and can be saved using the PERIODOGRAM parameter. The maximum periodogram ordinate test, Fisher's g-test and the Kolmogorov-Smirnov test on the cumulative periodogram are calculated using the standard formulae (Priestley 1981).The output for each series consists of the value of the maximum periodogram ordinate (after scaling by the length of the analysed series), the frequency at which this maximum occurs (expressed as the unit number in the
PERIODOGRAM variate, i.e. if the maximum occurs at w = 2Pj/N, then j is given), and the probability of exceeding this maximum; the ratio of the maximum to the total of the periodogram ordinates (Fisher's g), and the probability of exceeding this; and the Kolmogorov-Smirnov D statistic based on the maximum deviation of the cumulative periodogram from the line y=x.
Option: LENGTH. Parameters: SERIES, PERIODOGRAM.
Method
The series are mean-corrected, but not trend corrected, before transformation.
Action with RESTRICT
The
SERIES may not be restricted; restriction of the input series to a contiguous set of units may be achieved by use of the LENGTH parameter.
Reference
Priestley, M.B. (1981) Spectral Analysis and Time Series. Academic Press, London.
PLS procedure
Fits a partial least squares regression model
(Ian Wakeling & Nick Bratchell)
Options
NROOTS
= scalar Number of PLS dimensions to be extractedYSCALING
= string Whether to scale the Y variates to unit variance; (yes, no); default noXSCALING
= string Whether to scale the X variates to unit variance; (yes, no); default noNGROUPS
= scalar Number of cross-validation groups into which to divide the data; default 1 (i.e. no cross-validation performed)SEED
= scalar or factor A scalar indicating the seed value to use when dividing the data randomly into NGROUPS groups for the cross-validation or a factor to indicate a specific set of groupings to use for the cross-validation; default takes the (scalar) value of NGROUPSLABELS
= text Sample labels for X and Y that are to be used in the printed output; defaults to the integers 1...n where n is the length of the variates in X and YPLABELS
= text Sample labels for XPREDICT that are to be used in the printed output; default uses the integers 1, 2 ...
Parameters
Y
= pointers Pointer to variates containing the dependent variablesX
= pointers Pointer to variates containing the independent variablesYLOADINGS
= pointers Pointer to variates used to store the Y component loadings for each dimension extractedXLOADINGS
= pointers Pointer to variates used to store the X component loadings for each dimension extractedPLOADINGS
= pointers Pointer to variates used to store the loadings for the bilinear model for the X blockYSCORE
= pointers Pointer to variates used to store the Y component scores for each dimension extractedXSCORE
= pointers Pointer to variates used to store the X component scores for each dimension extractedB
= matrices A diagonal matrix containing the regression coefficients of YSCORE on XSCORE for each dimensionYPREDICT
= pointers A pointer to variates used to store predicted Y values for samples in the prediction setXPREDICT
= pointers A pointer to variates containing data for the independent variables in the prediction setESTIMATES
= matrices An nX+1 by nY matrix (where nX and nY are the numbers of variates contained in X and Y respectively) used to store the PLS regression coefficients for a PLS model with NROOTS dimensionsFITTED
= pointers Pointer to variates used to store the fitted values for each Y variateLEVERAGE
= variates Variate used to store the leverage that each sample has on the PLS modelPRESS
= variates Variate used to contain the Predictive Residual Error Sum of Squares for each dimension in the PLS model, available only if cross-validation has been selectedRSS
= variates Variate used to store the Residual Sum of Squares for each dimension extractedYRESIDUAL
= pointers Pointer to variates used to store the residuals from the Y block after NROOTS dimensions have been extracted, uncorrected for any scaling applied using YSCALINGXRESIDUAL
= pointers Pointer to variates used to store the residuals from the X block after NROOTS dimensions have been extracted, uncorrected for any scaling applied using XSCALINGXPRESIDUAL
= pointers Pointer to variates used to store the residuals from the XPREDICT block after NROOTS dimensions have been extracted
Description
The regression method of Partial Least Squares (PLS) was initially developed as a calibration method for use with chemical data. It was designed principally for use with overdetermined data sets and to be more efficient computationally than competing methods such as principal components regression. If Y and X denote matrices of dependent and independent variables respectively, then the aim of PLS is to fit a bilinear model having the form T=XW, X=TP¢ +E and Y=TQ¢ +F, where W is a matrix of coefficients whose columns define the PLS factors as linear combinations of the independent variables. Successive PLS factors contained in the columns of T are selected both to minimise the residuals in E and simultaneously to have high squared covariance with a single Y variate (PLS1) or a linear combination of multiple Y variates (PLS2). The columns of T are constrained to be mutually orthogonal. See Helland (1988) or Hoskuldsson (1988) for a more comprehensive description of the PLS method.
The procedure allows the calculation of PLS1 and PLS2 models with cross-validation to assist in the determination of the correct number of dimensions to include in the model. By setting the
NGROUPS option the data are randomly divided into a number of groups; samples in each group are then modelled from the remaining samples only. The sum of squares of differences between these "leave out predictions" and the observed values of Y are called PRESS. Many tests of significance for determining the correct number of dimensions are based on comparing values of PRESS for PLS models of varying rank. Values of PRESS are used in the procedure to perform Osten's (1988) test of significance and may also be plotted out in a scree diagram. In addition to the factor scores, factor loadings and residuals, the procedure also calculates a leverage measure (Martens & Naes 1989 page 276) and a single linear combination of the X variables (ESTIMATES) which summarises the entire PLS model.The procedure will fail if there are missing values present in either the
X
or Y variates.To use a PLS model to make predictions from new observations on the X variables, two methods are available. Either the user may do this manually by using the model as specified in the estimates matrix, or the new X data may be specified beforehand as the pointer to variates
XPREDICT and the corresponding predicions obtained as YPREDICT.Output from the PLS procedure can be selected using the following settings of the
PRINT option.data
the unscaled data values (with labels).xloadings
X-component loadings (columns of the matrix W - see above).yloadings
variable loadings for the bilinear model of the matrix of dependent variables. Note that these are standardized to unit length and are not the same as the columns of the matrix Q above. To obtain Q form the matrix C, whose columns are the standardized loadings and post-multiply by the diagonal matrix supplied as the output parameter B.ploadings
variable loadings for the bilinear model of the matrix of independent variables (columns of the matrix P - see above).scores
X and Y component scores. The X component scores are the columns of the matrix T and are mutually orthogonal. The Y component scores, usually given the symbol u, are not in fact needed in the calculation of the PLS model unless an iterative algorithm is used (see method section). They are provided here for completeness, as sometimes it is useful to plot the Y component scores against the X component scores to give a visual indication of the degree of fit for each PLS dimension.leverage
measure of leverage.xerrors
residual sum of squares and residual standard deviations for all the independent variables. When NGROUPS>1 additional statistics are calculated from the cross-validated residuals, derived when each object is left out. The PRESS value is equal to the sum of squares of cross-validated standard deviations for each X variable multipled by N-1, where N is the total number of observations. The cross-validated standard deviations may therefore be used to measure the predictive ability of the model for each of the variables.yerrors
residual sum of squares and residual standard deviations for all the dependent variables (see xerrors above).scree
scree diagram of PRESS.xpercent
percentage variance explained for the X variables.ypercent
percentage variance explained for the Y variables.predictions
predicted values for any observations that were not included in the PLS model but were supplied using the XPREDICT parameter.groups
details of groupings used for cross-validation.estimates
estimated PLS regression coefficients.fittedvalues
fitted values from the PLS regressions.The default settings are
estimates, xpercent, ypercent, scores, xloadings, yloadings, ploadings.The data for
PLS are supplied using the X and Y parameters, as pointers to variates containing the columns of the X and Y matrices. Other parameters allow output to be saved in appropriate data structures.
Options:
PRINT, NROOTS, YSCALING, XSCALING, NGROUPS, SEED, LABELS, PLABELS.Parameters:
Method
Although the PLS method is often presented in terms of an iterative algorithm (Manne 1987), the X block loadings vector for the first PLS dimension (w1) is simply the eigenvector of X¢ YY¢ X corresponding to its largest eigenvalue. To find the second and subsequent dimensions, X and Y are deflated by orthogonalising with respect to the current PLS factor (t=Xw) and the eigenanalysis repeated. The above approach was adopted by Rogers (1987) in an implementation of a Genstat 4 macro. Here we adopt a very similar approach by performing a singular value decomposition on the matrix X¢ Y which simultaneously obtains loading vectors for both data blocks (Hoskuldsson 1988, de Jong & ter Braak 1994).
It is usual to centre all variables prior to a PLS analysis, the procedure will automatically do so even if the
XSCALING/YSCALING options are not set. On exit from the procedure the variates pointed to by X and Y are unchanged.
Action with
RESTRICTThe procedure will work with restricted variates, fitting a PLS model to the subset of objects indicated by the restriction. If there are different restrictions on different data variates then these restrictions will be combined and the analysis performed on the subset of samples that is common to all the restrictions. Note that the unrestricted length of all of the data variates must be the same and the number of samples in the common subset must be at least three. Any restrictions on a text supplied for the
LABELS option or a factor for the SEED option will be ignored. On exit from the procedure all the data variates, and if supplied the SEED factor and LABELS text, will all be returned restricted to the common subset of samples. Output data structures that correspond to the samples (i.e. XSCORE, YSCORE, FITTED, LEVERAGE, YRESIDUAL and XRESIDUAL) will also be returned restricted to the common subset, and missing values will be used for those values that have been restricted out.When restricted data are supplied and
LABELS are also given then the appropriate subset of labels will be appear in the output; if LABELS are not defined then default labels reflecting the position of the restricted data in the unrestricted variate will be used instead.No restrictions are allowed in the variates supplied in the
XPREDICT parameter or the PLABELS option.
References
Helland, I.S. (1988). On the structure of partial least squares regression, Commun, Statist.-.Simula.Comput., 17, 581-607.
Hoskuldsson, A. (1988). PLS Regression Methods, J. Chemometrics, 2, 211-228.
de Jong & ter Braak (1994). Comments on the PLS kernel algorithm, J. Chemometrics, 8, 169-174
Manne, R. (1987). Analysis of Two Partial Least Squares Algorithms for multivariate Calibration. Chemometrics and Intell. Lab. Systems, 2, 187-197.
Naes, T. & Martens H. (1989). Multivariate Calibrarion, John Wiley, Chichester.
Osten, D.W. (1988). Selection of Optimal Regression Medels Via Cross-Validation, J. Chemometrics, 2, 39-48.
Rogers, C.A. (1987). A Genstat Macro for Partial Least Squares Analysis with Cross-Validation Assessment of Model Dimensionality. Genstat Newsletter, 18, 81-92.
PPAIR procedure
Displays results of t-tests for pairwise differences in compact diagrams
(P.W. Goedhart, H. van der Voet & D.C. van der Werf)
Options
PROBABILITY
= scalar or symmetric matrixLevel of significance of pairwise comparison tests; default 0.05
Parameters
TPROBABILITIES
= symmetric matricesProbabilities of tests of pairwise comparisons
DIFFERENCES
= symmetric matrices, variates or tablesWhat to print alongside the labels of
TPROBABILITIES; default *LABELS
= texts Text vector labelling the output; if unset the row labels of TPROBABILITIES and the diagonal of DIFFERENCES (if set) are used
Description
Procedures
RPAIR and PAIRTEST produce a symmetric matrix of two-sided t-probabilities for tests of all pairwise differences of estimates. PPAIR displays this matrix at a specified level of significance in two compact schematic diagrams. This is especially useful when the number of estimates is large.Input to
PPAIR is a symmetric matrix TPROBABILITIES containing probabilities of the set of pairwise comparisons. The level of significance can be set by the PROBABILITY option. A common level is specified by a scalar, while a symmetric matrix specifies a level for each comparison separately (which may be useful for some multiple comparison methods). Output is labelled by the row labels of TPROBABILITIES. If parameter DIFFERENCES is set to a symmetric matrix the diagonal of this matrix is printed alongside these labels (with number of decimals as defined at declaration of DIFFERENCES). This is especially useful if DIFFERENCES is saved by RPAIR or PAIRTEST because it then contains the estimates on the diagonal. DIFFERENCES can also be set to a variate or table. Alternatively the output can be labelled by specifying parameter LABELS.
Options:
PRINT, PROBABILITY.Parameters:
TPROBABILITIES, DIFFERENCES, LABELS.
Method
The construction of the diagram for
PRINT=groups is as follows. First the difference between the first and last item of the complete set of n items is checked for significance. Then the first and last item of all subsets of n-1 consecutive items are checked, followed by all subsets of n-2 items, and so on. If non-significance is found between the first and last item of a subset, all items of the subset are said to form a homogeneous group and they receive the same letter. This is only sensible when the TPROBABILITIES are sorted according to the estimates. The diagram only consists of homogeneous groups which are not a part of a larger group.It is obvious that items in a homogeneous group can be significantly different. This is not displayed in the diagram, although a message is printed if this occurs. If there are no significant differences within homogenous groups, both diagrams essentially contain the same information;
PRINT=groups then gives a more concise representation.
Action with
RESTRICTRestrictions on
DIFFERENCES and LABELS are ignored.
PREWHITEN procedure
Filters a time series before spectral analysis
(A.W.A. Murray)
Option
PHI
= scalar Specifies the value of the parameter used in filtering; default 0.99
Parameters
SERIES
= variates Input seriesFILTERED
= variates Output series
Description
PREWHITEN
provides filtering of time series data prior to spectral analysis. Parameters SERIES and FILTERED specify the input and output series, respectively. The filtered series y is given byyt = xt -
where x is the input series. (Thus
q = 1 would give first differencing.) The value of q is specified by the PHI option; the default value of q=0.99 is often suitable. Alternatively, an empirical approach is to use the value
where L is the lag at which inspection suggests that the autocorrelation in the series becomes negligible.
To "recolour" the spectrum of the series after estimation, you can multiply by
1/((1 +
where f is the frequency at which the spectrum is estimated.
Option: PHI. Parameters: SERIES, FILTERED.
Method
The procedure uses the FILTER directive with two TSMs defined as follows:
TSM filter; ORDER=!(1,0,0); PARAM=!(1,0,0,PHI)
TSM arima; ORDER=!(0,1,0); PARAM=!(1,0,0)
FILTER SERIES; NEWSERIES=FILTERED; FILTER=filter; ARIMA=arima
The procedure is based on ideas from Granville Tunnicliffe Wilson, University of Lancaster.
Action with
RESTRICTThe behaviour is as for the
FILTER directive.
PROBITANALYSIS procedure
Fits probit models allowing for natural mortality and immunity
(R.W. Payne)
Options
TRANSFORMATION
= string Transformation to be used (probit, logit, complementaryloglog); default probMORTALITY
= string Whether to estimate natural mortality (omit, estimate); default omitIMMUNITY
= string Whether to estimate natural immunity (omit, estimate); default omitGROUPS
= factor Defines groups for an analysis of parallelism; default * i.e. no groupsSEPARATE
= strings Which parameters (apart from intercept) should be estimated separately for different groups (slope, mortality, immunity); default * i.e. noneLD
= scalar or variate Effective (or lethal) doses to be estimated, other than 50
Parameters
Y
= variates Number of subjects responding in each batchDOSE
= variates Dose received by each batch of subjectsNBINOMIAL
= variates Number of subjects in each batchINITIAL
= variates Initial values for parametersSTEPLENGTHS
= variates Step lengths for parameters
Description
Probit analysis is a way of modelling the relationship between a stimulus, like a drug, and a quantal response (success/failure). It is assumed that for each subject, there is a certain level of dose of the stimulus below which it will unaffected, but above which it will respond. This level of dose, known as its tolerance, will vary from subject to subject within the population.
For example, it is often assumed that the tolerance of houseflies to logarithm of the dose of an insecticide will follow a Normal distribution; so, if we were to plot the proportion of the population with each tolerance against log dose, we would obtain the familiar bell-shaped curve. Likewise, if we plot the probability that a randomly-selected individual will respond, against the logarithm of dose, we would obtain a sigmoid (S-shaped) curve limited below by zero and above by one. To make the relationship linear, it is usual to transform the y-axis either to probits or to Normal equivalent deviates:
Probit(P%) = NED(p) + 5
where proportion p = P% / 100. The Normal equivalent deviate may be familiar as the transformation that is used to produce "probability" graph paper.
In probit analysis, we are interested in estimating the equation of that line. This can be done by perfoming an experiment in which there are several batches of subjects, each of which is given a different dose of the stimulus. The data then consists of a variate indicating the number of subjects that responded out of each batch, a variate to show the dose given to each batch, and a final variate for the total numbers of subjects in the batches; these are specified by parameters
Y, DOSE and NBINOMIAL, respectively. The NBINOMIAL parameter can be omitted if the total numbers cannot be measured, as in some fumigation experiments ("Wadley's problem"; see for example Finney 1971, pages 202-8).The
PRINT option controls printed output: model details of the model that has been fitted,summary
summary analysis-of-variance table, estimates parameter estimates and standard errors, correlations correlations between parameter estimates, and fittedvalues fitted values and residuals. By default, PRINT=mode,summ,esti,fitt.The
TRANSFORMATION option allows other transformations to be selected. Putting TRANSFORMATION=logit requests a logit transformation:logit(P%) = log( P% / (100 - P%) )
This is very like the probit but approaches zero (to the left) and one (to the right) rather more slowly. The other possibility is the complementary log-log ( =log( -log(100-P%) ), which is relevant to the "one-hit" model (that is infection processes where just one infected particle is sufficient to cause the response).
Sometimes, subjects may respond even in the absence of any dose. For example, with some short-lived insects, some would have died simply from natural causes during the period of the experiment. By setting option
MORTALITY=estimate this natural mortality can be included in the model and estimated. Similarly, there may be subjects that will not respond, no matter how high the dose. Setting option IMMUNITY=estimate will include and estimate a parameter for natural immunity.It is also often of interest to fit study the way in which the model varies for different groups of subjects. For example, there may be groups of batches of subjects, each of which is given a different drug. The
GROUPS option should then specify the group to which each batch of subjects belongs, and option SEPARATE indicates which parameters of the model (slope, mortality, and/or immunity) should have separate estimates. If SEPARATE is left at its default value, parallel lines will be fitted with identical values for any estimates of mortality and immunity.The final option,
LD, can request the estimation of one or more lethal (or effective) doses, specifying a scalar if there is just one, or a variate if there are several. The LD50 value (that is the dose at which 50% of the population would respond) is always printed as one of the parameters of the fitted line.The model is fitted using the
FITNONLINEAR directive, and the final two parameters, INITIAL and STEPLENGTHS, allow initial values and steplengths to be specified for the optimization. The order of parameters is: LD50(s), slope(s), mortality parameters (if any), and immunity parameters (if any). Parameter estimates, fitted values, residuals, and so on, can be saved after running the procedure, by using the RKEEP directive in the usual way.
Options:
PRINT, TRANSFORMATION, MORTALITY, IMMUNITY, GROUPS, SEPARATE, LD.Parameters:
Y, DOSE, NBINOMIAL, INITIAL, STEPLENGTHS.
Method
Initial values are obtained, if necessary, using the Genstat facilities for generalized linear models, ignoring any mortality or immunity. Expressions specifying the model are defined in sets of nested IF-blocks, taking account of the settings for example of
TRANSFORMATION and GROUPS. The fitting is carried out by the FITNONLINEAR directive, and any extra LD values are estimated using RFUNCTION.Probit analysis can also be performed using the Genstat facilities for generalized linear models, but these do not cover Wadley's problem, nor allow for natural mortality and immunity. The results obtained from this procedure may differ slightly from those that are obtained (where possible) from the use of GLMs (by the
FIT directive) or using the WADLEY procedure, due to the use here of maximum likelihood for the fitting, rather than iterative weighted linear regression.
Action with
RESTRICTThe
Y variate, the DOSE variate, or the GROUPS factor can be restricted to indicate that the model is to be fitted only to a subset of the units.
Reference
Finney, D.J. (1971), Probit Analysis, 3rd Edition, Cambridge University Press.
PTBOX procedure
Generates a bounding or surrounding box for a spatial point pattern
(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)
Options
METHOD
= string Type of box to form (bounding, surrounding); default boun
Parameters
Y
= variates Vertical coordinates of each spatial point pattern; no default - this parameter must be setX
= variates Horizontal coordinates of each spatial point pattern; no default - this parameter must be setYBOX
= variates Variates to receive the vertical coordinates of the bounding or surrounding boxesXBOX
= variates Variates to receive the horizontal coordinates of the bounding or surrounding boxesYFRACTION
= scalars How much to extend the extremes of the vertical coordinates of each surrounding box as a fraction of the range of the vertical coordinates; default 0.1XFRACTION
= scalars How much to extend the extremes of the horizontal coordinates of each surrounding box as a fraction of the range of the horizontal coordinates; default 0.1
Description
This procedure takes as input two variates containing the coordinates of a spatial point pattern (specified by the
X and Y parameters) and returns the coordinates of either a bounding or a surrounding box, according to the setting of the METHOD option. The default, METHOD=bounding, provides a bounding box, defined as the smallest rectangle such that all the events in the spatial point pattern lie inside the box or on its boundary. The coordinates of the bounding box are the coordinates of its four corners, in the order lower left, lower right, upper right, and upper left. The surrounding box (METHOD=surrounding) is a rectangle which contains all the points. It is obtained by extending the vertical and horizontal edges of the bounding box by specified fractions of the range of values in X and Y, respectively. The parameters XFRACTION and YFRACTION can be used to specify the proportional extension required in each direction; the default value of both parameters is 0.1. The coordinates of the surrounding box are the coordinates of its four corners in the order lower left, lower right, upper right, upper left. The coordinates of the bounding or surrounding box can be saved using the parameters XBOX and YBOX.Printed output is controlled by the
PRINT option. The default setting of summary prints the coordinates of the bounding or surrounding box under the headings XBOX and YBOX.
Option:
PRINT.Parameters:
Y, X, YBOX, XBOX, YFRACTION, XFRACTION.
Method
A procedure
PTCHECKXY is called to check that X and Y have identical restrictions. The minimum, maximum and range of the horizontal (X) and vertical (Y) coordinates are then calculated. For a bounding box, the coordinates are calculated as (min(X), min(Y)), (max(X), min(Y)), (max(X), max(Y)), (min(X), max(Y)). For a surrounding box the coordinates are(min(
X) - XFRACTION ´ range(X), min(Y) - YFRACTION ´ range(Y)),(max(
X) + XFRACTION ´ range(X), min(Y) - YFRACTION ´ range(Y)),(max(
X) + XFRACTION ´ range(X), max(Y) + YFRACTION ´ range(Y)),(min(
X) - XFRACTION ´ range(X), max(Y) + YFRACTION ´ range(Y)).
Action with
RESTRICTIf
X and Y are restricted, only the subset of values specified by the restriction will be included in the calculations.
PTCLOSEPOLYGON procedure
Closes open polygons
(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)
Option
Parameters
OLDYPOLYGON
= variates Vertical coordinates of each polygon; no default - this parameter must be setOLDXPOLYGON
= variates Horizontal coordinates of each polygon; no default - this parameter must be setNEWYPOLYGON
= variates Vertical coordinates of the closed polygonsNEWXPOLYGON
= variates Horizontal coordinates of the closed polygons
Description
A polygonal region of two-dimensional space is represented in Genstat by the coordinates of a sequence of points which define the boundary of the polygon with the last point implicitly connected to the first point. If the first and last pairs of coordinates are the same then the polygon is said to be closed, otherwise it is open. Sometimes it is necessary to work with a closed polygon, for example, when drawing a polygon onto a graphics device as a series of line segments. This procedure takes as input a set of coordinates which define a polygon. The parameters
OLDXPOLYGON and OLDYPOLYGON specify variates containing the coordinates. The output of the procedure is a closed polygon, which is identical to the input polygon if it is already closed and otherwise consists of the input polygon with the first pair of coordinates repeated at the end. The coordinates of the closed polygon may be saved using the parameters NEWXPOLYGON and NEWYPOLYGON.Printed output is controlled by the
PRINT option. The default setting of summary prints the horizontal and vertical coordinates of the closed polygon under the headings NEWXPOLYGON and NEWYPOLYGON.
Option:
PRINT.Parameters:
OLDYPOLYGON, OLDXPOLYGON, NEWYPOLYGON, NEWXPOLYGON.
Method
A procedure
PTCHECKXY is called to check that OLDXPOLYGON and OLDYPOLYGON have identical restrictions. It then checks whether the first and last pairs of coordinates are the same. If they are, the DUPLICATE directive is used to copy OLDXPOLYGON to NEWXPOLYGON and OLDYPOLYGON to NEWYPOLYGON. If they are different, NEWXPOLYGON and NEWYPOLYGON are declared as variates with one more value than their old counterparts, and the EQUATE directive is used to copy the values from OLDXPOLYGON to NEWXPOLYGON and OLDYPOLYGON to NEWYPOLYGON (so that the first element of each old variate is repeated at the end of the corresponding new one).
Action with
RESTRICTIf
OLDXPOLYGON and OLDYPOLYGON are restricted, only the subset of values specified by the restriction will be included in the calculations.
PTDESCRIBE procedure
Gives summary and second order statistics for a point process
(R.P. Littlejohn & R.C. Butler)
Options
SELECTION
= strings What to print (interval, trend, poisson, icorrelation, ispectrum, cspectrum, cintensity, vtcurve, all); default inteREPRESENTATION
= string How the point process is represented in the DATA variate (time, interval, zeroone); default timeGRAPHICS
= string Style of graphical output, or GRAPHICS=* to avoid any graphs (lineprinter, highresolution); default high
Parameters
DATA
= variates Variate containing point process to be analysedSTART
= scalars Initial time (if REPRESENTATION=time); default 0LENGTH
= scalars Length of time over which process is observed; default takes the time of the last eventCITAU
= scalars Window width for calculating count intensity; default 0.5 ´ mean interval lengthVTTAU
= scalars Window width for calculating variance-time curve; default 0.5 ´ mean interval lengthSAVE
= pointers Pointer to save calculated values
Description
A point process, or series of events, is characterized both by the times at which events occur, and the intervals between events. The Poisson process is the most basic point process, with Poisson counts in any interval, and independent exponentially distributed intervals between events.
A comprehensive account of methods for analysing point processes is given by Cox & Lewis (1966).
PTDESCRIBE implements many of the test and summary statistics they give and should be used in conjunction with the text for a full discussion of the motivation and context of their use. All equations referred to below are from Cox & Lewis (1966).The
DATA variate may contain either the times at which events occur, the intervals between events, or a sequence of 0's and 1's, with 1's indicating the times of events on an integer time scale. The option REPRESENTATION specifies which of these is used. If REPRESENTATION=time and the process is measured from some time other than zero, the initial time should be given in the parameter START. Otherwise the START time is assumed to be zero. The first interval is taken to lie between the START time and the first event. If the process is observed beyond the last event, the total duration of the process should be given in the parameter LENGTH. Checks are carried out on START, LENGTH and the length of each interval, and the procedure terminates if these are inconsistent. If REPRESENTATION=time, the DATA variate may be restricted, facilitating the analysis of truncated or thinned point processes.If
SAVE is set, time and interval are saved, together with summary interval or second order statistics specified by SELECTION as detailed below. SAVE sets up a pointer, with each element labeled by the name of the relevant statistics saved. For example, if SAVE=clstats, then the intervals between the events will be saved in clstats['interval'].The option
SELECTION can be used to obtain any combination of eight available analyses, with the PRINT and GRAPHICS options controlling the output. The default setting is SELECTION=interval, while SELECTION=all gives all eight analyses. In what follows, the number of events is denoted by N and the variate carrying the times of events by time. The rate of a point process is calculated as the reciprocal of the average interval length.interval
- plots data and summarises the interval distributionprint: summary statistics for the interval process.
graph: times of events; histogram of the intervals between events; histogram of the intervals with bins appropriate for the exponential distribution.
save:
trend
- tests for trend in the processprint: an N(0,1) test statistic (Ch 3.3 (11)), which is optimal against certain specifications of trend; Bartlett's test for the homogeneity of variance of groups of 3, 8 and 20 contiguous intervals.
poisson
- tests whether the point process is Poissonprint: Kolmogorov-Smirnov tests for the empirical distribution function of times of events (Ch 6.2 (27-29, 38)) and for Durbin's order statistic transformation of the intervals (Ch 6.2 (43)); Moran's test against a gamma renewal process for the empirical distribution function (Ch 6.2 (43)); N(0,1) test for trend (see
graph: log survivor function of the interval distribution, compared to the Poisson case (a straight line through the origin with slope = -rate); plots of the empirical distribution function of times of events and Durbin's order statistics with Kolmogorov-Smirnov bounds.
icorrelation
- autocorrelations for the interval sequence.print: the first (N/2-1) end-adjusted autocorrelations (Ch 5.2 (17, 18)) for the interval sequence and their standardization; the end-adjustments are derived using the autocorrelations from
graph: plot of the autocorrelations of the interval sequence and 95% confidence bounds.
save: order the order of the autocorrelations,
icorrelation the autocorrelations of the interval sequence.ispectrum
- periodogram for the interval processprint: the periodogram for the interval process (Ch 5.3 (6, 8)) obtained from
graph: the periodogram and Poisson level (
P/2) plotted against frequency; plot of the scaled cumulative periodogram with Kolmogorov-Smirnov bounds.save:
ifrequency frequencies at which periodogram is calculated, ispectrum interval periodogram.cspectrum
- periodogram for the count processprint: periodogram for the count process (Ch 5.5 (16)) calculated at frequencies 2
graph: count periodogram and Poisson level (=2) graphed against frequency.
save: cfrequency frequencies at which periodogram is calculated, cspectrum interval periodogram.
cintensity
- intensity function for the counting processprint: intensity function for the counting process (Ch 5.4(v) (20)) calculated for times
graph: intensity function with asymptotic 95% confidence intervals for the Poisson level, the intensity for which = rate, plotted against time.
save:
citime times for which intensity is calculated, cintensity intensity function.vtcurve
variance-time curve V(t) and index of dispersion I(t)print: V(t) scaled by 1-time/
graph: V(t) and I(t) against time.
save:
vtime times at which V(t) and I(t) are calculated, vtcurve V(t), dispersion I(t).
Options:
PRINT, SELECTION, REPRESENTATION, GRAPHICS.Parameters:
DATA, START, LENGTH, CITAU, VTTAU, SAVE.
Method
The procedure tests of whether a point process is a Poisson process and calculates summary statistics in the time and frequency domains for a point process following Cox & Lewis (1966). Most statistics are obtained using
CALCULATE, with FOURIER being used for ispectrum and CORRELATE for the pre-adjusted autocorrelations.
Action with
RESTRICTDATA may be restricted only if REPRESENTATION=time, in which case only the units not excluded by the restriction are involved in the analysis.
Reference
Cox, D.R. & Lewis, P.A.W. (1966). The Statistical Analysis of Series of Events. Methuen, London.
PTREMOVE procedure
Removes points interactively from a spatial point pattern
(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)
Options
WINDOW
= scalar Which graphics window to use for the plot; default 1
Parameters
OLDY
= variates Vertical coordinates of each spatial point pattern; no default - this parameter must be setOLDX
= variates Horizontal coordinates of each spatial point pattern; no default - this parameter must be setNEWY
= variates Variates to receive the vertical coordinates of the original points minus the deleted points of each patternNEWX
= variates Variates to receive the horizontal coordinates of the original points minus the deleted points of each pattern
Description
PTREMOVE
uses the DREAD directive to delete points from a spatial point pattern. The coordinates of the existing points must be supplied using the parameters OLDX and OLDY. These points will be plotted on the current graphics device using DPTMAP with a pen setting of SYMBOLS=1. The WINDOW option may be used to specify the graphics window to use for the plot.The operation of
DREAD may vary slightly from one system to another. The Users' Note supplied with Genstat explains how to read points and terminate input on specific devices. The usual method for reading points is to click the left mouse button at the required position. The usual way to terminate input is to click the right mouse button. The points read using DREAD will be echoed using a pen setting of SYMBOLS=2. The coordinates of the new spatial point pattern containing the original points minus any points which have been deleted may be saved using the parameters NEWX and NEWY.Printed output is controlled using the
PRINT option. The settings available are monitoring (which prints the coordinates of the points to be deleted) and summary (which prints the coordinates of the new pattern consisting of the original points minus any that have been deleted under the headings NEWX and NEWY). The default setting is for both monitoring and summary.
Options:
PRINT, WINDOW.Parameters:
OLDY, OLDX, NEWY, NEWX.
Method
A procedure
PTCHECKXY is called to check that OLDX and OLDY have identical restrictions. DPTMAP is then used to draw a map of the original point pattern. The DREAD directive is used to read the coordinates of points to be deleted. Finally, the coordinates for the deleted points are removed from the original points using the SUBSET procedure and the coordinates of the undeleted points are stored in new variates.
Action with
RESTRICTIf
OLDX and OLDY are restricted, only the subset of values specified by the restriction will be included in the calculations.
QUANTILE procedure
Calculates quantiles of the values in a variate
(P.W. Lane)
Options
PROPORTION
= variate or scalar Proportions at which to calculate quantiles; default !(0,0.25,0.5,0.75,1)
Parameters
DATA
= variates Values whose quantiles are required; this parameter must be specifiedQUANTILE
= variates or scalars Identifiers of structures to store results, if required
Description
Quantiles are statistics that characterize the distribution of a sample of numbers. A quantile q of a sample {xi, i=1...n} can be formed for any proportion p in the range [0,1], and has the following properties:
1) at least the proportion p/n of {xi} are less than or equal to q;
2) at least the proportion (1-p)/n of {xi} are greater than or equal to q;
3) if q=xi and q=xi+1 satisfy 1) and 2), then take q = (xi+xi+1)/2.
Thus the quantile for proportion 0.5 is the median; for 0.0 it is the minimum and for 1.0 the maximum of the sample. By default,
QUANTILE produces the five quantiles called the "five-number summary" of a sample, corresponding to the proportions 0.0, 0.25, 0.5, 0.75, 1.0. The option PROPORTION can be set to a scalar or variate to request other single quantiles or sets of quantiles. By default, QUANTILE prints the statistics, but this can be suppressed by setting option PRINT=*. The quantiles can be stored in a variate using the parameter QUANTILE.
Options:
PRINT, PROPORTION. Parameters: DATA, QUANTILE.
Method
First, the values are sorted into ascending order. Then for each proportion, the two values that are candidates for the quantile are found, by counting from either end of the sorted list to leave the required number of values from that point in the list to the end. The quantiles are the means of the two values found.
Action with
RESTRICTIf the
DATA variate is restricted, the quantiles are formed only using the units that are not restricted out. The PROPORTION and QUANTILE variates must not be restricted.