MANNWHITNEY procedure 

Performs a Mann-Whitney U test

(S.J. Welham, N.M. Maclaren & H.R. Simpson)

 

Options

PRINT = strings Output required (test, ranks): test produces the relevant test statistics, ranks produces the ranks (with respect to the whole data set) for each variate; default test

GROUPS = factor Defines the samples for a two-sample test if only the Y1 parameter is specified

 

Parameters

Y1 = variates Identifier of the variate holding the first sample if Y2 is set, or both samples if Y2 is unset (the GROUPS option must then also be set)

Y2 = variates Identifier of the variate holding the second sample

R1 = variates Saves the ranks of the first sample if Y2 is set, or both samples if Y2 is unset

R2 = variates Saves the ranks of the second sample if Y2 is set

STATISTIC = scalars Scalar to save the test statistic U

NORMAL = scalars Scalar to save the normal approximation to the test statistic

SIGN = scalars Scalar to save an indicator: 1 if the first sample scores the highest ranks on average, 0 otherwise

 

Description

The Mann-Whitney U test is a test for differences in location between two samples. The data for the samples can either be stored in two separate variates, and by the parameters Y1 and Y2. Alternatively, they can be stored in a single variate, supplied by Y1, with the GROUPS option set to a factor to identify which unit belongs to each sample. The GROUPS option is ignored when the Y2 parameter is set. MANNWHITNEY calculates the test statistic U, along with its Normal approximation if both sample sizes are larger than 5. These statistics can be saved using the STATISTIC and NORMAL parameters respectively, and are displayed by the test setting of the option PRINT. The ranks setting of PRINT produces vectors of ranks (with respect to the whole data set) for each variate in DATA. Parameter SIGN holds an indicator which takes the value 1 if the ranks in the first sample are higher on average than those in the second sample, and takes the value 0 otherwise. The ranks (with respect to the combined data set) for each sample can be saved using the R1 and R2 parameters.

 

Option: PRINT, GROUPS. Parameters: Y1, Y2, R1, R2, STATISTIC, NORMAL, SIGN.

 

Method

The Mann-Whitney (or Wilcoxon) U-test is a two-sample test of location difference: i.e. a test of the null hypothesis that the two samples arise from distributions with the same mean vs. the alternative that the distribution means differ.

The test statistic U is formed using ranks found from the combined data set, and is taken to be the smaller of U1)and U2, where

Uk = n1 ´ n2 + nk ´ (nk+1) / 2 - Rk ; k=1,2

and nk is the size of sample k, Rk is the sum of ranks for sample k. This score Uk can be interpreted as the number of times a rank score in the other sample precedes a score in sample k in the ranking. So the sample with the lowest score has, on average, smaller rank scores.

The normal approximation to this statistic is

Normal = ( n1 ´ n2 / 2 - U ) / Ö { n1 ´ n2 ´ ( n1+n2+1 ) / 12 }

and is valid when both samples sizes are at least 5. If ties are present, then the standard error of the normal approximation (i.e. the denominator) must be calculated by:

Ö { n1 ´ n2 / (N ´ (N-1)) ´ ( (N3-N) / 12 - S k Tk ) }

where Tk = ( tk3-tk )/12 and tk is the number of observations with rank k. (See for example Siegel 1956, pages 116-127.) Otherwise, MANNWHITNEY looks up the probability from a stored table.

 

Action with RESTRICT

The variates in DATA can be restricted, and in different ways. MANNWHITNEY uses only those units of each variate that are not excluded by their respective restrictions.

 

Reference

Siegel, S. (1956). Nonparametric Statistics for the behavioural sciences. McGraw-Hill, New York.

 

MANOVA procedure

Performs multivariate analysis of variance and covariance

(R.W. Payne & G.M. Arnold)

 

Options

PRINT = strings Printed output required from the multivariate analysis of covariance (ssp, tests); default test

APRINT = strings Printed output from the univariate analyses of variance of the y-variates (as for the ANOVA PRINT option); default *

UPRINT = strings Printed output from the univariate unadjusted analyses of variance of the y-variates (as for the ANOVA UPRINT option); default *

CPRINT = strings Printed output from the univariate analyses of variance of the covariates (as for the ANOVA CPRINT option); default *

TREATMENTSTRUCTURE = formula Treatment formula for the analysis; if this is not set, the default is taken from the setting (which must already have been defined) by the TREATMENTSTRUCTURE directive

BLOCKSTRUCTURE = formula Block formula for the analysis; if this is not set, the default is taken from any existing setting specified by the BLOCKSTRUCTURE directive and if neither has been set the design is assumed to be unstratified (i.e. to have a single error term)

COVARIATES = pointer Covariates for the analysis; by default MANOVA uses those listed by a previous COVARIATE directive (if any)

FACTORIAL = scalar Limit on the number of factors in a treatment term

LRV = pointer Contains elements first for the treatment terms and then the covariate term (if any), allowing the LRV's to be saved from one of the analyses; if a term is estimated in more than one stratum, the LRV is taken from the lowest stratum in which it is estimated

 

Parameter

Y = variates Y-variates for an analysis

 

Description

Procedure MANOVA performs multivariate analysis of variance or covariance. The data variates are specified by the Y parameter.

The model for the design is specified by options of the procedure. TREATMENTSTRUCTURE specifies a model formula to define the treatment terms in the analysis; if this is unset, MANOVA will use the model already defined by the TREATMENTSTRUCTURE directive, or will fail if that too has not been set. BLOCKSTRUCTURE defines the underlying structure of the design, and MANOVA will use the model (if any) previously defined by the BLOCKSTRUCTURE directive if this is not set; these can both be omitted if there is only one error term (i.e. if the design is unstratified). The COVARIATES option specifies any covariates; by default MANOVA will take those already listed (if any) by the COVARIATE directive. The FACTORIAL option can be used to set a limit on the number of factors in the terms generated from the treatment formula.

The LRV option allows a pointer to be saved containing an LRV structure for each treatment term. When covariates have been specified, the pointer will also contain a final LRV structure for the covariate term. If a term is estimated in more than one stratum, the LRV is taken from the stratum that occurs last in the BLOCKTERMS pointer. The structures in the LRV hold the canonical variate loadings, roots and trace for the respective term.

The other options control printed output. PRINT indicates the output required from the multivariate analysis of covariance, with settings ssp to print the sums of squares and products matrices, and tests to print the various test statistics (Wilks Lambda, with Chi square and F approximations, the Pillai-Bartlett trace, Roy's maximum root test and the Lawley-Hotelling trace). APRINT, UPRINT and CPRINT control output from the univariate analyses of each of the y-variates, corresponding to ANOVA options PRINT, UPRINT and CPRINT, respectively.

 

Options: PRINT, APRINT, UPRINT, CPRINT, TREATMENTSTRUCTURE, BLOCKSTRUCTURE, COVARIATES, FACTORIAL, LRV.

Parameter: Y.

 

Method

The relevant theory, with formulae and references for the test statistics, can be found in Chatfield & Collins (1986, Chapter 9). The procedure analyses the data variates by ANOVA first as y-variates, and then as covariates in order to obtain the SSP matrices. The SSP matrices are then adjusted for the covariates, using matrix manipulation in CALCULATE, and LRV decompositions are done, before the test statistics are calculated (again using CALCULATE).

 

Action with RESTRICT

If any of the y-variates is restricted, the analysis will involve only the units not excluded by the restriction.

 

Reference

Chatfield, C. & Collins, A.J. (1986). Introduction to Multivariate Analysis (revised edition). Chapman and Hall, London.

 

MENU procedure

Initiates a menu system

(P.W. Lane)

 

Options

FILENAME = text A single string giving the filename of the base file for the menu system; the default is the base file of the standard menu system (stored by the standard Start-up File in text _flmnbas)

INCHANNEL = scalar Input channel on which the base file is to be opened; default is the channel assumed by the standard menu system (stored by the standard Start-up File in scalar _chmnbas)

OUTCHANNEL = scalar Output channel on which a file has been opened to keep a record of work done; default is the channel assumed by the standard menu system (stored by the standard Start-up File in scalar _chcomnd)

 

No parameters

 

Description

The MENU procedure enters a menu system. By default, the standard Genstat Menu System is started, but the option FILENAME can be set to the name of an alternative file of Genstat commands. The Option INCHANNEL specifies the input channel on which the alternative file is to be opened, and OUTCHANNEL specifies the output channel which may be used within the alternative file for keeping a record of commands. The procedure only operates interactively: in batch mode, immediate exit occurs without a diagnostic.

 

Options: FILENAME, INCHANNEL, OUTCHANNEL. Parameters: none.

 

Method

The input channel INCHANNEL is closed, without a warning being printed if it was not actually open. Then the base file is opened on INCHANNEL and control is passed to the commands in the file, with echoing of commands switched off. Automatic logging (via the COPY directive) to the channel OUTCHANNEL is switched off.

 

MPOWER procedure

Forms integer powers of a square matrix

(P.W. Lane)

 

No options

 

Parameters

MATRIX = matrices, symmetric matrices or diagonal matrices

Matrix from which to form the power

POWER = scalars Power to which each matrix is to be raised

RESULT = identifiers Structure to store the result

 

Description

MPOWER forms powers of a square matrix, using as few matrix operations as possible in order to save time and decrease rounding errors. The square matrix is specified using the MATRIX parameter, and can be either an ordinary matrix structure (with an equal number of rows and columns), a symmetric matrix or a diagonal matrix. The required power, which must be a positive integer, is specified using the POWER parameter. The RESULT parameter supplies the identifier of the structure to save the results; this will be declared automatically to be of the same type as the input structure.

 

Options: none. Parameters: MATRIX, POWER, RESULT.

 

Method

For general matrices, successive powers of two of the matrix are formed by matrix products, and the result formed by taking the product of those that are needed to achieve the specified power. Diagonal matrices are dealt with using simple exponentiation of the diagonal values. Symmetric matrices are spectrally decomposed, and the result formed as a product of the matrix containing the latent vectors (V) with the simple power of the diagonal matrix containing the latent roots (R):

RESULT = V *+ R**POWER *+ TRANSPOSE(V).

 

MULTMISS procedure

Estimates missing values for units in a multivariate data set

(H.R. Simpson & R.P. White)

 

Option

MAXCYCLE = scalar Defines the maximum allowed number of iterations; default 10

 

Parameters

DATA = pointers Each pointer contains a set of variates whose missing values are to be estimated; these will be overwritten by the estimates unless the OUT parameter is specified

OUT = pointers Each pointer contains a set of variates to hold the results

 

Description

MULTMISS estimates missing values for units in a multivariate data set, using an iterative regression technique. The input for the procedure is a set of variates contained in a pointer specified by the DATA parameter. The output can be saved in a different set of variates by supplying a similar pointer with the parameter OUT; if this is absent, the output values will overwrite the values of the variates given by DATA. The maximum number of iterations is set by the option MAXCYCLE, with a default of 10. If MAXCYCLE is set to zero, missing values will be replaced by variate means calculated from the units that have no values missing for any of the variates.

 

Option: MAXCYCLE. Parameters: DATA, OUT.

 

Method

Initial estimates of the missing values in each variate are formed from the variate means using the values for units that have no missing values for any variate. Estimates of the missing values for each variate are then recalculated as the fitted values from the multiple regression of that variate on all the other variates. When all the missing values have been estimated the variate means are recalculated. If any of the means differs from the previous mean by more than a tolerance (the initial standard error divided by 1000) the process is repeated, subject to a maximum number of repetitions defined by the MAXCYCLE option.

The default maximum number of iterations (10) is usually sufficient when there are few missing values, say two or three. If there are many more, 20 or so, it may be necessary to increase the maximum number of iterations to around 30.

The method is similar to that of Orchard & Woodbury (1972), but does not adjust for bias in the variance-covariance matrix as suggested by Beale & Little (1975).

 

Action with RESTRICT

All the variates must be unrestricted, or they must all be restricted to the same set of units; otherwise a fault will occur in a CALCULATE statement within MULTMISS.

 

References

Beale, E.M.L. & Little, R.J.A. (1975). Missing values in multivariate analysis. J.R.Statist.Soc., 37, 129-145.

Orchard, T. & Woodbury, M.A. (1972). A missing information principle: theory and applications. In: Proc. 6th Berkeley Symp. Math. Statist. Prob. Vol I, 697-715.

 

MVARIOGRAM procedure

Fits models to an experimental variogram

(S.A. Harding & R. Webster)

 

Options

PRINT = strings Controls printed output from the fit (model, summary, estimates, correlations, fittedvalues, monitoring); default mode, summ, esti

MODEL = string Defines which model to fit (power, boundedlinear, circular, spherical, doublespherical, pentaspherical, exponential, besselk1, gaussian, affinepower, linear); default powe

WEIGHTING = string Method to be used for weighting (counts, cbyvar, equal); default coun

CONSTANT = string How to treat the constant (estimate, omit); default esti

WINDOW = scalar Window in which to plot a graph; default 0 i.e. no graph

TITLE = text Title for the graph

XUPPER = scalar Upper limit for the x-axis in the graph

PENDATA = scalar Pen to be used to plot the data; default 1

PENMODEL = scalar Pen to be used to plot the model; default 2

 

Parameters

VARIOGRAM = variates or matrices Experimental variogram to which the model is to be fitted, as a variate if in only one direction or as a matrix if there are several

COUNTS = variates or matrices Counts for the points in each variogram (not required if WEIGHTING=equal)

DISTANCE = variates or matrices Mean lag distances for the points in each variogram

DIRECTION = variates Directions in which each variogram was computed

ESTIMATES = variates Estimated parameter values

FITTEDVALUES = variates Fitted values

EXIT = scalars Exit status from the nonlinear fitting (zero indicates success; see page 433 of the Genstat 5 Release 3 Reference Manual)

 

Description

Procedure MVARIOGRAM uses the directives FIT, FITCURVE and FITNONLINEAR to fit various models to the experimental variogram. Models must be authorized in the sense that they cannot give rise to negative variances when data are combined. Technically they are conditionally negative semi-definite (CNSD); see Webster & Oliver (1990) or Journel & Huijbregts (1978) for an explanation.

The MODEL option specifies the model that is to be fitted. There are bounded isotropic models with finite ranges. These all take the value c + c0 for h ³  a, and the following values for h < a

boundedlinear c0 + ch/a

circular c0 + c {1 - (2/P)arccos(h/a)

+ (2h/(Pa))Ö (1-h2/a2)}

spherical c0 + c {1.5h/a - 0.5(h/a)3 }

doublespherical c0 + c1 {1.5h/a1 - 0.5(h/a1)3 }

+ c2 {1.5h/a2 - 0.5(h/a2)3 } for h £ a1

c0 + c1 + c2 {1.5h/a2 - 0.5(h/a2)3} for a1 < h < a2

where c = c1 + c2

pentaspherical c0 + c {1.875h/a - 1.25(h/a)3 + 0.375(h/a)5}

There are also bounded asymptotic models

exponential c0 + c {1 - exp(-h/a)}

besselk1 c0 + c {1 - h/a K1(h/a) }

(Whittle's elementary correlation, Whittle 1954)

gaussian c0 + c {1 - exp(-h2/a2)}

and unbounded models

power c0 + g ha

(power function with exponent a strictly between 0 and 2)

linear c0 + c h

which is a special case of the power function with exponent 1.

Finally, the affinepower function can be fitted to an experimental variogram that appears unbounded and geometrically anisotropic, i.e. one that might be made isotropic by a simple linear transformation of the spatial coordinates

affinepower c0 + Ö { a2cos2(f-q) + b2sin2(f-q) } hpower

In all these models, the intercept term (or nugget variance) c0 can be omitted by setting the CONSTANT option to omit; the default is estimate.

The data for the procedure can be taken directly from the FVARIOGRAM directive, with parameters DISTANCES, VARIOGRAMS and COUNTS corresponding to those with the same names in FVARIOGRAM. The data will be in variates if the variogram was calculated in only one direction. If it is in several, they can either be in matrices (as generated by FVARIOGRAM) or in variates. For MODEL=affinepower directions must be supplied, using the DIRECTIONS parameter. These should be in a variate with one value for each column if the other data are in matrices; alternatively, they should be in a variate of the same length as the other variates.

The WEIGHTING option controls the weights that are used when fitting the model. The default setting counts uses the values supplied by the COUNTS parameter, cbyvar uses the COUNTS divided by the values in VARIOGRAM, and equal uses equal weights (of one).

The procedure generates rough starting values for the parameters before calling FITNONLINEAR to convergence. If the solution does not converge there are two likely reasons. The model may be unsuited for the particular experimental variogram. For example, a bounded model is specified when the variogram is clearly unbounded, or vice versa. You should choose only models that have approximately the right shape. Alternatively, the starting values are too far from a sensible solution. Here you should estimate starting values by inspection and insert them into MVARIOGRAM.

Printed output is controlled by the PRINT option, and includes all the usual settings as in FIT, FITCURVE or FITNONLINEAR. You can also produce a high-resolution graph of the data and the fitted model, by setting the WINDOW option to the number of a suitable window. By default WINDOW is zero, and no graph is produced. The TITLE option can supply a title for the plot. Option XUPPER can define an upper value for the x-axis (i.e. distance), and PENDATA and PENMODEL can supply the numbers of the pens to be used to plot the experimental variogram and the fitted model respectively (by default 1 and 2).

 

Options: PRINT, MODEL, WEIGHTING, CONSTANT, WINDOW, TITLE, XUPPER, PENDATA, PENMODEL.

Parameter: VARIOGRAM, COUNTS, DISTANCE, DIRECTION, ESTIMATES, FITTEDVALUES, EXIT.

 

Method

The model is fitted using directives FIT, FITCURVE or FITNONLINEAR as appropriate.

 

Action with RESTRICT

If the data variates are restricted, only the units not excluded by the restriction will be used.

 

References

Journel, A. G. and Huijbregts, C. J. (1978). Mining Geostatistics. Academic Press, London.

Webster, R. and Oliver, M.A. (1990). Statistical Methods in Soil and Land Resource Survey. Oxford University Press.

Whittle, P. (1954). On stationary processes in the plane. Biometrika, 41, 434-449.

 

NLCONTRASTS procedure

Fits nonlinear contrasts to quantitative factors in ANOVA

(R.C. Butler)

 

Options

PRINT = strings Printed output required (aovtable, information, covariates, effects, residuals, contrasts, means, %cv, missingvalues); default aovt, info, cova, mean, miss

CURVE = string Curve (as in FITCURVE) to use for nonlinear regression (exponential, dexponential, cexponential, lexponential, logistic, glogistic, gompertz, ldl, qdl, qdq); default expo

FPROBABILITY = string Printing of probabilities for variance ratios (yes, no); default no

PSE = string Standard errors to print with means tables (differences, means); default diff

WEIGHT = variate Variate of weights for each unit; default * (no weights)

 

Parameters

Y = variates Data to be analysed

XFACTOR = factors Factor with quantitative levels for which contrasts are to be found

XLEVELS = variates Variate of values to use for the levels of XFACTOR; if unset, the factor levels themselves are used

GROUPFACTOR = factors Factor whose interaction with XFACTOR is to be assessed

CONTRASTS = pointers Structures to hold the estimates of the fitted contrasts: CONTRASTS[1] is a pointer with two values, labelled 'Curve' (parameter estimates for a single fitted curve) and 'Deviations' (the differences between this curve and the means for XFACTOR); CONTRASTS[2] has three values, labelled 'Common NonLin' (parameter estimates for curves fitted with common nonlinear parameters for all levels of GROUPFACTOR), 'Separate Curves' (parameter estimates for curves fitted with all parameters varying with the levels of GROUPFACTOR) and 'Deviations' (differences between the treatment means and the Separate Curves); the order of the parameters is as in the output of the procedure, the variates of estimated contrasts are labelled by the parameter names as used in the printed output, while the 'Deviations' are both tables, labelled by the relevant factors

SECONTRASTS = pointers Structures to save the standard errors for the contrast estimates, including 'deviations'; the pointer has the same form as the CONTRASTS pointer

DFCONTRASTS = pointers Structures to save the degrees of freedom for the contrast estimates; the pointer has the same form as the CONTRASTS pointer, except that the variates and tables are replaced by scalars

 

Description

The ANOVA directive allows linear contrasts to be fitted and incorporated into the analysis-of-variance table. NLCONTRASTS extends this to enable nonlinear contrasts to be fitted to the effects of a quantitative factor and its interaction with another factor. The analysis should include both main effects and the interaction between the factors. The procedure will work for any block structure providing each treatment term is estimated entirely within one stratum. The result is similar to ANOVA with a polynomial contrast, but with slightly different partitions of the treatment sums of squares. The main effect is partitioned into the sum of squares for the "Curve" and the remainder or "Deviations". The interaction sum of squares is partitioned into the sum of squares due to curves with "Common Nonlinear" parameters for the levels of the non-quantitative factor, and the extra sum of squares due to having "Separate Curves" for each level of that factor, and the remaining sum of squares which again represents "Deviations".

The BLOCKSTRUCTURE and TREATMENTSTRUCTURE directives must be used in the normal way before the procedure is called, and any [[r[[COVARIATES]]r]] should also be defined first. The structure of the analysis-of-variance table is then accessed from inside the procedure. The Y parameter defines the variate to be analysed, and the form of nonlinear contrast is defined using the CURVE option of the procedure. The same choices of curves are available as for FITCURVE. There are four other options, PRINT, FPROBABILITY, PSE, and WEIGHT, which are exactly as for ANOVA. The XFACTOR parameter defines the factor to which the contrasts are to be fitted, and the XLEVELS parameter may be used to define x values for the regressions if the levels already defined for the factor are unsuitable. The GROUPFACTOR parameter defines the factor whose interaction with XFACTOR is to be assessed. The final three parameters CONTRASTS, SECONTRASTS and DFCONTRASTS can be used to save the parameter estimates for the contrasts, their standard errors and degrees of freedom respectively.

 

Options: PRINT, CURVE, FPROBABILITY, PSE, WEIGHT.

Parameters: Y, XFACTOR, XLEVELS, GROUPFACTOR, CONTRASTS, SECONTRASTS, DFCONTRASTS.

 

Method

ANOVA is used to obtain the basic analysis-of-variance table and the sums of squares for the treatment terms. FITCURVE is then used with the treatment means to fit three sets of curves: a single curve, curves with common nonlinear parameters, and entirely separate curves. The deviances and degrees of freedom obtained from these are used in conjunction with the treatment sums of squares to calculate the contrast sums of squares and degrees of freedom. Further details are given by Butler & Brain (1992). New lines for the analysis-of-variance table are then constructed using PRINT and EDIT, and these lines are then inserted into the table (saved in a text with ADISPLAY) using EDIT. The standard errors for the parameter estimates and deviances are based on the Residual Mean Square for the appropriate stratum. Standard errors for deviations are calculated using the method on page 501 of the Genstat 5 Release 3 Reference Manual.

 

Action with RESTRICT

If the Y variate is restriced, the procedure will use only the units not excluded by the restriction.

 

Reference

Butler, R.C. & Brain, P. (1993). Nonlinear Contrasts in ANOVA. Genstat Newsletter 29.

 

NORMTEST procedure

Performs tests of univariate and/or multivariate normality

(M.S. Ridout)

 

Option

PRINT = strings Allows the required printed output to be selected: test statistics, tables of critical values and the flagging of significant values with stars (marginal, bivariateangle, radius, critical, stars); default marg, biva, radi

 

Parameter

DATA = variates or pointers Variates whose univariate normality is to be tested or pointers, each to a set of variates whose normality and/or multivariate normality are to be tested

 

Description

This procedure offers three types of test of normality.

Marginal (univariate) tests - assess the normality of each variate in turn. The variates are standardized to have mean=0, variance=1 and then transformed with the NORMAL function. The test is based on the idea that, assuming normality, these transformed values should look like a sample from a uniform distribution on (0,1).

Bivariate angle tests - assess the bivariate normality of each pair of variates in turn. The variates are standardized so that they are uncorrelated and have mean=0 and variance=1. The test is based on the following idea: if x and y are the standardized values, then the angle between the x-axis and the line joining (0,0) to (x,y) should, assuming normality, be uniformly distributed on (0,2P).

Radius test - provides a single overall test of multivariate normality. The variates are again standardized to have mean=0 and so that their covariance matrix is the identity matrix. The test uses the fact that if z1, z2,..., zn are the standardized values then z12 + z22 + ... + zn2 should, under multivariate normality, be approximately distributed as chi-squared on n degrees of freedom.

For each type of test, the test statistics are empirical distribution function (EDF) statistics - i.e. they compare the empirical distribution function of the sample with the theoretical distribution expected under the null hypothesis. Three EDF statistics are provided for each type of test - the Anderson-Darling statistic, the Cramer-von Mises statistic and the Watson statistic. The idea is to provide good power against a wide range of alternatives. The test statistics are adjusted so that their null distribution is independent of the sample size; critical values can be printed by the procedure (option PRINT=critical).

The DATA parameter is used to indicate the variate(s) whose normality is to be assessed. If a single variate is supplied, its normality is tested using the marginal test. Alternatively, DATA can supply a pointer to a set of variates to be tested for multivariate normality.

The PRINT option can be used to select the type of test using the settings marginal, bivariateangle and radius. The setting critical allows tables of critical values to be printed, and stars requests that significant values of the test statistics be flagged with stars. Settings bivariateangle and radius are relevant only when testing for multivariate normality. By default PRINT=marginal,bivariateangle,radius

 

Option: PRINT. Parameter: DATA.

 

Method

The calculations are clearly set out in Aitchison (1986; Section 7.3). Bivariate angle and radius tests are described by Andrews, Gnanadesikan & Warner (1973). Stephens (1974) describes the EDF statistics used and gives tables of critical values and information on their comparative power.

 

Action with RESTRICT

If a variate to which the DATA parameter is set is restricted, the tests will be calculated using only the units included by the restriction. Similarly, the variates in a DATA pointer can be restricted, but then must all be restricted in the same way. The procedure does not work properly with missing values. If missing values are present, RESTRICT should be used (before calling the procedure) to exclude all units for which any of the variates has a missing value.

 

References

Aitchison J.A. (1986). The statistical analysis of compositional data. London: Chapman & Hall.

Andrews D.F., Gnanadesikan R. & Warner J.L. (1973). Methods for assessing multivariate normality. In Multivariate Analysis III (ed. P.R. Krishnaiah) pp. 95-116. New York: Academic Press.

Stephens M.A. (1974). EDF statistics for goodness of fit and some comparisons. J.A.S.A., 69, 730-737.

 

NOTICE procedure

Gives access to the Genstat Notice Board (news, errors &c)

(R.W. Payne)

 

Option

PRINT = strings Indicates what information is required (news, library, errors, instructions); default news

 

No parameters

 

Description

NOTICE allows information to be printed from the Genstat 5 Notice Board. The information is stored in a backing-store file whose name is defined by Library procedure LIBFILENAME; there must be a free backing-store channel to which the file can be attached.

The PRINT option is used to specify what information is required. The possible values, with explanations in brackets, are as follows: news information about recent developments concerning Genstat, library recent developments concerning the Procedure Library, errors details about errors reported in Genstat 5 directives, and other advice and warnings, instructions instructions for authors of library procedures.

 

Option: PRINT. Parameters: none.

 

Method

The information is held in subfile _notices of the backing-store file that holds help for the Library; the name of the file is supplied by procedure LIBFILENAME. The file is opened on the first available backing-store channel; if all the channels are in use, the procedures stops with a diagnostic. After printing the required information, the file is closed.

 

ORTHPOL procedure

Calculates orthogonal polynomials

(P.W. Lane)

 

Options

MAXDEGREE = scalar Maximum degree of polynomial to be calculated; default is the number of identifiers in the pointer specified by the POLYNOMIAL parameter

WEIGHTS = variate Weights to be used in orthogonalization; default * gives an equal weight to each unit

 

Parameters

X = variates Values from which to calculate the polynomials; no default - this parameter must be set

POLYNOMIAL = pointers Identifiers of variates to store results; no default - this parameter must be set

 

Description

Polynomials of low degree can be fitted by ordinary linear regression, estimating effects of terms X, X**2, X**3, and so on for a variate X. However, it is sometimes preferable to arrange that successive polynomial terms are orthogonal to each other; certainly, there are likely to be numerical problems with polynomials of degree five or more, if they are not orthogonal. ORTHPOL calculates orthogonal polynomials up to a specified maximum degree from a given variate. The orthogonalization can be weighted by specifying a variate of weights.

 

Options: MAXDEGREE, WEIGHTS. Parameters: X, POLYNOMIAL.

 

Method

Successive formation of polynomials, starting with p1 = x - mean(x), ensuring orthogonality of pi with p1 ...pi-1; that is:

S ( weight ´ pi ´ pj ) = 0

 

Action with RESTRICT

A variate in the X parameter can be restricted: the restriction is transferred to the calculated polynomials, and to the weight variate if specified.

 

PAIRTEST procedure

Performs t-tests for pairwise differences

(P.W. Goedhart)

 

Options

PRINT = strings What to print (differences, sed, tvalues, tprobabilities); default diff, sed, tval

DF = scalar Degrees of freedom for calculation of TPROBABILITIES from TVALUES; default 10000, approximates to the normal distribution

SORT = string Whether ESTIMATES (and other output) are sorted in ascending order (yes, no); default no

 

Parameters

ESTIMATES = variates Estimates to be compared

VCOVARIANCE = symmetric matrices

Symmetric matrix containing the variance-covariance matrix of the estimates

LABELS = texts Text vector naming the elements of ESTIMATES; if unset, the numbers 1, 2... are used as labels

DIFFERENCES = symmetric matrices

To save the pairwise differences (ESTIMATES on the diagonal)

SED = symmetric matrices To save the standard errors of the pairwise differences (missing values on the diagonal)

TVALUES = symmetric matrices To save the t-values (missing values on the diagonal)

TPROBABILITIES = symmetric matrices

To save the t-probabilities (missing values on the diagonal)

 

Description

PAIRTEST can be used to test all pairwise differences in every situation in which a vector of estimates and a corresponding variance-covariance matrix are available. PAIRTEST is particularly useful for tests of all pairwise differences of slopes after fitting a model with an interaction between a factor and a variate. In most other situations procedure RPAIR will be more suitable.

All pairwise differences of entries in ESTIMATES with variance-covariance matrix VCOVARIANCE are calculated and tested. The results of these tests can be saved in symmetric matrices DIFFERENCES, SED, TVALUES and TPROBABILITIES. The matrices are labeled by text vector LABELS or, if LABELS is unset, by the numbers 1, 2, 3...

PRINT controls the output of PAIRTEST. The t-probabilities are based on DF degrees of freedom; by default, if DF has not been set, normal probablitities are calculated. Option SORT controls whether the estimates on the diagonal of DIFFERENCES are sorted in ascending order. The other output is sorted accordingly.

 

Options: PRINT, DF, SORT.

Parameters: ESTIMATES, VCOVARIANCE, LABELS, DIFFERENCES, SED, TVALUES, TPROBABILITIES.

 

Method

The calculations are all relatively straightforward.

 

Action with RESTRICT

The variate ESTIMATES and the text LABELS can be restricted; the analysis is restricted according to restrictions on ESTIMATES. The lengths of the unrestricted vectors ESTIMATES and LABELS must be identical.

 

PCOPROC procedure

Performs a multiple Procrustes analysis

(P.G.N. Digby)

 

Options

PROTATE = strings Printed output required from each Procrustes rotation (rotations, coordinates, residuals, sums); default * i.e. no output

PPCO = strings Printed output required from the PCO analysis (roots, scores, centroid); default root, score, cent

SCALING = string Whether isotropic scaling should be used for the Procrustes rotations (no, yes); default no

STANDARDIZE = strings Whether to centre the configurations and/or normalize them to unit sums-of-squares for the Procrustes rotations (centre, normalize); default cent, norm

 

Parameters

DATA = pointers Each pointer points to a set of matrices holding the original input configurations

LRV = LRVs Stores the latent vectors (i.e. coordinates), roots and trace from the PCO analysis

CENTROID = diagonal matrices Stores the squared distances of the points representing the input configurations from their overall centroid from the PCO analysis

DISTANCES = symmetric matrices Stores the residual sums-of-squares from the Procrustes rotations

 

Description

An N ´ V matrix represents a configuration of points, for each of N units, in V dimensions. Given a set of M such matrices, a multiple Procrustes analysis compares them in pairs, keeping the residual sums-of-squares, and performs a principal coordinate analysis of the residual sums-of-squares to obtain an ordination representing the individual configurations. The rows of the matrices must represent the same set of units, in the same order; however there is no need for them to have the same number of columns (although generally they will do). An example of the use of multiple Procrustes analysis is given by Digby & Kempton (1987, pages 121-3).

The configurations of points are specified using the DATA parameter. This supplies a pointer containing a matrix with the data for each configuration. The PROTATE option controls the output from the individual Procrustes rotations, and the PPCO option controls that from the principal coordinate analysis. There are M´ (M-1)/2 Procrustes rotations so, by default, PROTATE=* to suppress any output. The SCALING and STANDARDIZE options control the way in which the Procrustes rotations are carried out, using the SCALING and STANDARDIZE options of ROTATE. However, the combination of SCALING=yes and STANDARDIZE=centre should not be used, because then the results will be dependent on the order of the input matrices.

The LRV and CENTROID parameters can be used to save results from the principal coordinates analysis, and the DISTANCES parameter can be used to save the symmetric matrix of the residual sums-of-squares from the Procrustes analyses.

 

Options: PROTATE, PPCO, SCALING, STANDARDIZE.

Parameters: DATA, LRV, CENTROID, DISTANCES.

 

Method

The pairwise Procrustes rotations are performed using the ROTATE directive, and the residual sums-of-squares are stored in a symmetric matrix of order M. This matrix is then used as input to a principal coordinate analysis, performed using the PCO directive on a suitably transformed copy of the matrix.

 

Reference

Digby, P.G.N. & Kempton, R.A (1987). Multivariate analysis of ecological communities. London: Chapman and Hall.

 

PDESIGN procedure

Prints or stores treatment combinations tabulated by the block factors

(R.W. Payne)

 

Options

PRINT = string Controls the printing of the design (design); default desi

BLOCKSTRUCTURE = formula Defines the block factors for the design; the default is to take those specified by the BLOCKSTRUCTURE directive

TREATMENTSTRUCTURE = formula Defines the treatment factors for each design; the default is to take those specified by the TREATMENTSTRUCTURE directive

TABLES = pointer Contains tables to store the tabulated factor values for printing outside the procedure in some other format

 

No parameters

 

Description

PDESIGN allows the treatment combinations allocated to each plot in a design to be displayed as tables, classified by the block factors.

The combinations are represented using the levels of the treatment factors. If any factor also has labels these are printed alongside the levels, as a key, after the tables. The levels are printed in formats that are determined automatically in a way that avoids wasted space or unnecessary decimal places. The block factors are obtained from the block structure of the design, which can be specified explicitly using the BLOCKSTRUCTURE option; otherwise PDESIGN will use any structure that has already been defined by a BLOCKSTRUCTURE statement earlier in the job. Similarly, the treatment factors are obtained either from the TREATMENTSTRUCTURE option of the procedure, or from an earlier TREATMENTSTRUCTURE statement.

If the display produced by the procedure is unsuitable, printing can be suppressed by setting option PRINT=* (by default PRINT=design), and the tables of treatment levels can be saved for printing outside the procedure by setting the TABLES option to a pointer. This will be returned with an element for each treatment factor, pointing to a table classified by the block factors and storing the tabulated levels of the treatment.

 

Options: PRINT, BLOCKSTRUCTURE, TREATMENTSTRUCTURE, TABLES. Parameters: none.

 

Method

The FCLASSIFICATION directive is used to form lists of factors from the block or treatment formulae and, if the block factors do not supply a unique combination of levels for every unit of the design, procedure AFUNITS is used to form a factor to index the units with each combination. Each treatment factor is then copied into a variate and TABULATE is used to put the values into a table classified by the block factors. Numbers of decimal places for printing the factor levels are determined using the DECIMALS procedure.

 

Action with RESTRICT

If any of the factors is restricted, only the part of the design not excluded by the restriction will be displayed.

 

PERCENT procedure

Expresses the body of a table as percentages of one of its margins

(R.W. Payne)

 

Options

CLASSIFICATION = factors Factors classifying the margin over which the percentages are to be calculated

METHOD = string Method to use to calculate the margin if not already present (totals, means, minima, maxima, variances, medians); default tota

HUNDRED = string Whether to put 100% values into the margin instead of the original values (no, yes); default no

 

Parameters

OLDTABLE = tables Tables containing the original values

NEWTABLE = tables Tables to store the percentage values; if any of these is unset, the new values replace those in the original table

 

Description

PERCENT allows you to express the body of a table as percentages of the values in one of its margins. The table is specified using the OLDTABLE parameter. A table to store the new values can be specified using the NEWTABLES parameter, otherwise these replace the values of the original table. The margin is indicated by listing the factors that define it using the CLASSIFICATION option; the default is the final margin (the grand total, or grand mean etc). If the original table has no margins, option METHOD defines how these are to be calculated; the default is to form margins of totals. The values originally in the margin will be left unchanged. If you would prefer these to be replaced by values of 100%, you should set option HUNDRED=yes.

 

Options: CLASSIFICATION, METHOD, HUNDRED. Parameters: OLDTABLE, NEWTABLE.

 

Method

If the OLDTABLE has no margins and contains no missing values, these are formed by the MARGIN directive. Alternatively, if there are missing values, margins other than variances can be formed using TABULATE. CALCULATE is then used to put the required margin into a table classified just by the factors that define the margin. The original table is divided by the marginal table and multiplied by 100 to give the required percentages. If option HUNDRED=no, the same operations are done on a dummy table that originally contains random numbers; for this table, values of 100 should occur only in the margin. Thus by using a logical test in which the values of the dummy table are compared with 100, the marginal values of the original table can be put back into the margin of the final table. The random numbers are generated using a specially written procedure URANDOM in case the Genstat random number generator is already in use in the program that called PERCENT.

 

PERIODTEST procedure

Gives periodogram-based tests for white noise in time series

(R.P. Littlejohn)

 

Option

LENGTH = scalar or variate Scalar specifying that the first N units of the series are to be used, or a variate specifying the first and last units of the series to be used

 

Parameters

SERIES = variates Specify the time series to be analysed

PERIODOGRAM = variates Save periodograms of the time series

 

Description

PERIODTEST gives periodogram-based tests for departure from white noise in a set of time series. The series are supplied in a list of variates, using the SERIES parameter. The LENGTH option can specify that only part of each series is to be used, using either a scalar N to indicate that the first N values are to be used, or a variate of length two, holding the values of the first and last units of the required subseries. This may be used to eliminate missing values, which are otherwise not permitted.

The mean-adjusted periodogram is calculated for each series using FOURIER, and can be saved using the PERIODOGRAM parameter. The maximum periodogram ordinate test, Fisher's g-test and the Kolmogorov-Smirnov test on the cumulative periodogram are calculated using the standard formulae (Priestley 1981).

The output for each series consists of the value of the maximum periodogram ordinate (after scaling by the length of the analysed series), the frequency at which this maximum occurs (expressed as the unit number in the PERIODOGRAM variate, i.e. if the maximum occurs at w = 2Pj/N, then j is given), and the probability of exceeding this maximum; the ratio of the maximum to the total of the periodogram ordinates (Fisher's g), and the probability of exceeding this; and the Kolmogorov-Smirnov D statistic based on the maximum deviation of the cumulative periodogram from the line y=x.

 

Option: LENGTH. Parameters: SERIES, PERIODOGRAM.

 

Method

The series are mean-corrected, but not trend corrected, before transformation.

 

Action with RESTRICT

The SERIES may not be restricted; restriction of the input series to a contiguous set of units may be achieved by use of the LENGTH parameter.

 

Reference

Priestley, M.B. (1981) Spectral Analysis and Time Series. Academic Press, London.

 

PLS procedure

Fits a partial least squares regression model

(Ian Wakeling & Nick Bratchell)

 

Options

PRINT = strings Printed output required (data, xloadings, yloadings, ploadings, scores, leverage, xerrors, yerrors, scree, xpercent, ypercent, predictions, groups, estimates, fittedvalues); default esti, xper, yper, scor, xloa, yloa, ploa

NROOTS = scalar Number of PLS dimensions to be extracted

YSCALING = string Whether to scale the Y variates to unit variance; (yes, no); default no

XSCALING = string Whether to scale the X variates to unit variance; (yes, no); default no

NGROUPS = scalar Number of cross-validation groups into which to divide the data; default 1 (i.e. no cross-validation performed)

SEED = scalar or factor A scalar indicating the seed value to use when dividing the data randomly into NGROUPS groups for the cross-validation or a factor to indicate a specific set of groupings to use for the cross-validation; default takes the (scalar) value of NGROUPS

LABELS = text Sample labels for X and Y that are to be used in the printed output; defaults to the integers 1...n where n is the length of the variates in X and Y

PLABELS = text Sample labels for XPREDICT that are to be used in the printed output; default uses the integers 1, 2 ...

 

Parameters

Y = pointers Pointer to variates containing the dependent variables

X = pointers Pointer to variates containing the independent variables

YLOADINGS = pointers Pointer to variates used to store the Y component loadings for each dimension extracted

XLOADINGS = pointers Pointer to variates used to store the X component loadings for each dimension extracted

PLOADINGS = pointers Pointer to variates used to store the loadings for the bilinear model for the X block

YSCORE = pointers Pointer to variates used to store the Y component scores for each dimension extracted

XSCORE = pointers Pointer to variates used to store the X component scores for each dimension extracted

B = matrices A diagonal matrix containing the regression coefficients of YSCORE on XSCORE for each dimension

YPREDICT = pointers A pointer to variates used to store predicted Y values for samples in the prediction set

XPREDICT = pointers A pointer to variates containing data for the independent variables in the prediction set

ESTIMATES = matrices An nX+1 by nY matrix (where nX and nY are the numbers of variates contained in X and Y respectively) used to store the PLS regression coefficients for a PLS model with NROOTS dimensions

FITTED = pointers Pointer to variates used to store the fitted values for each Y variate

LEVERAGE = variates Variate used to store the leverage that each sample has on the PLS model

PRESS = variates Variate used to contain the Predictive Residual Error Sum of Squares for each dimension in the PLS model, available only if cross-validation has been selected

RSS = variates Variate used to store the Residual Sum of Squares for each dimension extracted

YRESIDUAL = pointers Pointer to variates used to store the residuals from the Y block after NROOTS dimensions have been extracted, uncorrected for any scaling applied using YSCALING

XRESIDUAL = pointers Pointer to variates used to store the residuals from the X block after NROOTS dimensions have been extracted, uncorrected for any scaling applied using XSCALING

XPRESIDUAL = pointers Pointer to variates used to store the residuals from the XPREDICT block after NROOTS dimensions have been extracted

 

Description

The regression method of Partial Least Squares (PLS) was initially developed as a calibration method for use with chemical data. It was designed principally for use with overdetermined data sets and to be more efficient computationally than competing methods such as principal components regression. If Y and X denote matrices of dependent and independent variables respectively, then the aim of PLS is to fit a bilinear model having the form T=XW, X=TP¢ +E and Y=TQ¢ +F, where W is a matrix of coefficients whose columns define the PLS factors as linear combinations of the independent variables. Successive PLS factors contained in the columns of T are selected both to minimise the residuals in E and simultaneously to have high squared covariance with a single Y variate (PLS1) or a linear combination of multiple Y variates (PLS2). The columns of T are constrained to be mutually orthogonal. See Helland (1988) or Hoskuldsson (1988) for a more comprehensive description of the PLS method.

The procedure allows the calculation of PLS1 and PLS2 models with cross-validation to assist in the determination of the correct number of dimensions to include in the model. By setting the NGROUPS option the data are randomly divided into a number of groups; samples in each group are then modelled from the remaining samples only. The sum of squares of differences between these "leave out predictions" and the observed values of Y are called PRESS. Many tests of significance for determining the correct number of dimensions are based on comparing values of PRESS for PLS models of varying rank. Values of PRESS are used in the procedure to perform Osten's (1988) test of significance and may also be plotted out in a scree diagram. In addition to the factor scores, factor loadings and residuals, the procedure also calculates a leverage measure (Martens & Naes 1989 page 276) and a single linear combination of the X variables (ESTIMATES) which summarises the entire PLS model.

The procedure will fail if there are missing values present in either the

X or Y variates.

To use a PLS model to make predictions from new observations on the X variables, two methods are available. Either the user may do this manually by using the model as specified in the estimates matrix, or the new X data may be specified beforehand as the pointer to variates XPREDICT and the corresponding predicions obtained as YPREDICT.

Output from the PLS procedure can be selected using the following settings of the PRINT option.

data the unscaled data values (with labels).

xloadings X-component loadings (columns of the matrix W - see above).

yloadings variable loadings for the bilinear model of the matrix of dependent variables. Note that these are standardized to unit length and are not the same as the columns of the matrix Q above. To obtain Q form the matrix C, whose columns are the standardized loadings and post-multiply by the diagonal matrix supplied as the output parameter B.

ploadings variable loadings for the bilinear model of the matrix of independent variables (columns of the matrix P - see above).

scores X and Y component scores. The X component scores are the columns of the matrix T and are mutually orthogonal. The Y component scores, usually given the symbol u, are not in fact needed in the calculation of the PLS model unless an iterative algorithm is used (see method section). They are provided here for completeness, as sometimes it is useful to plot the Y component scores against the X component scores to give a visual indication of the degree of fit for each PLS dimension.

leverage measure of leverage.

xerrors residual sum of squares and residual standard deviations for all the independent variables. When NGROUPS>1 additional statistics are calculated from the cross-validated residuals, derived when each object is left out. The PRESS value is equal to the sum of squares of cross-validated standard deviations for each X variable multipled by N-1, where N is the total number of observations. The cross-validated standard deviations may therefore be used to measure the predictive ability of the model for each of the variables.

yerrors residual sum of squares and residual standard deviations for all the dependent variables (see xerrors above).

scree scree diagram of PRESS.

xpercent percentage variance explained for the X variables.

ypercent percentage variance explained for the Y variables.

predictions predicted values for any observations that were not included in the PLS model but were supplied using the XPREDICT parameter.

groups details of groupings used for cross-validation.

estimates estimated PLS regression coefficients.

fittedvalues fitted values from the PLS regressions.

The default settings are estimates, xpercent, ypercent, scores, xloadings, yloadings, ploadings.

The data for PLS are supplied using the X and Y parameters, as pointers to variates containing the columns of the X and Y matrices. Other parameters allow output to be saved in appropriate data structures.

 

Options: PRINT, NROOTS, YSCALING, XSCALING, NGROUPS, SEED, LABELS, PLABELS.

Parameters: Y, X, YLOADINGS, XLOADINGS, PLOADINGS, YSCORES, XSCORES, B, YPREDICT, XPREDICT, ESTIMATES, FITTEDVALUES, LEVERAGES, PRESS, RSS, YRESIDUALS, XRESIDUALS, XPRESIDUALS.

 

Method

Although the PLS method is often presented in terms of an iterative algorithm (Manne 1987), the X block loadings vector for the first PLS dimension (w1) is simply the eigenvector of X¢ YY¢ X corresponding to its largest eigenvalue. To find the second and subsequent dimensions, X and Y are deflated by orthogonalising with respect to the current PLS factor (t=Xw) and the eigenanalysis repeated. The above approach was adopted by Rogers (1987) in an implementation of a Genstat 4 macro. Here we adopt a very similar approach by performing a singular value decomposition on the matrix X¢ Y which simultaneously obtains loading vectors for both data blocks (Hoskuldsson 1988, de Jong & ter Braak 1994).

It is usual to centre all variables prior to a PLS analysis, the procedure will automatically do so even if the XSCALING/YSCALING options are not set. On exit from the procedure the variates pointed to by X and Y are unchanged.

 

Action with RESTRICT

The procedure will work with restricted variates, fitting a PLS model to the subset of objects indicated by the restriction. If there are different restrictions on different data variates then these restrictions will be combined and the analysis performed on the subset of samples that is common to all the restrictions. Note that the unrestricted length of all of the data variates must be the same and the number of samples in the common subset must be at least three. Any restrictions on a text supplied for the LABELS option or a factor for the SEED option will be ignored. On exit from the procedure all the data variates, and if supplied the SEED factor and LABELS text, will all be returned restricted to the common subset of samples. Output data structures that correspond to the samples (i.e. XSCORE, YSCORE, FITTED, LEVERAGE, YRESIDUAL and XRESIDUAL) will also be returned restricted to the common subset, and missing values will be used for those values that have been restricted out.

When restricted data are supplied and LABELS are also given then the appropriate subset of labels will be appear in the output; if LABELS are not defined then default labels reflecting the position of the restricted data in the unrestricted variate will be used instead.

No restrictions are allowed in the variates supplied in the XPREDICT parameter or the PLABELS option.

 

References

Helland, I.S. (1988). On the structure of partial least squares regression, Commun, Statist.-.Simula.Comput., 17, 581-607.

Hoskuldsson, A. (1988). PLS Regression Methods, J. Chemometrics, 2, 211-228.

de Jong & ter Braak (1994). Comments on the PLS kernel algorithm, J. Chemometrics, 8, 169-174

Manne, R. (1987). Analysis of Two Partial Least Squares Algorithms for multivariate Calibration. Chemometrics and Intell. Lab. Systems, 2, 187-197.

Naes, T. & Martens H. (1989). Multivariate Calibrarion, John Wiley, Chichester.

Osten, D.W. (1988). Selection of Optimal Regression Medels Via Cross-Validation, J. Chemometrics, 2, 39-48.

Rogers, C.A. (1987). A Genstat Macro for Partial Least Squares Analysis with Cross-Validation Assessment of Model Dimensionality. Genstat Newsletter, 18, 81-92.

 

PPAIR procedure

Displays results of t-tests for pairwise differences in compact diagrams

(P.W. Goedhart, H. van der Voet & D.C. van der Werf)

 

Options

PRINT = string What to print (items, groups); default grou

PROBABILITY = scalar or symmetric matrix

Level of significance of pairwise comparison tests; default 0.05

 

Parameters

TPROBABILITIES = symmetric matrices

Probabilities of tests of pairwise comparisons

DIFFERENCES = symmetric matrices, variates or tables

What to print alongside the labels of TPROBABILITIES; default *

LABELS = texts Text vector labelling the output; if unset the row labels of TPROBABILITIES and the diagonal of DIFFERENCES (if set) are used

 

Description

Procedures RPAIR and PAIRTEST produce a symmetric matrix of two-sided t-probabilities for tests of all pairwise differences of estimates. PPAIR displays this matrix at a specified level of significance in two compact schematic diagrams. This is especially useful when the number of estimates is large.

Input to PPAIR is a symmetric matrix TPROBABILITIES containing probabilities of the set of pairwise comparisons. The level of significance can be set by the PROBABILITY option. A common level is specified by a scalar, while a symmetric matrix specifies a level for each comparison separately (which may be useful for some multiple comparison methods). Output is labelled by the row labels of TPROBABILITIES. If parameter DIFFERENCES is set to a symmetric matrix the diagonal of this matrix is printed alongside these labels (with number of decimals as defined at declaration of DIFFERENCES). This is especially useful if DIFFERENCES is saved by RPAIR or PAIRTEST because it then contains the estimates on the diagonal. DIFFERENCES can also be set to a variate or table. Alternatively the output can be labelled by specifying parameter LABELS.

PRINT controls which diagram is printed. PRINT=items produces a diagram which should be read line by line. Each item (represented by a letter) is followed by those items (again represented by letters) not significantly different from that item. When there are more than 52 items, letters are repeated. PRINT=groups is only useful when the TPROBABILITIES are sorted in a sensible order, for example by specifying SORT=yes in RPAIR or PAIRTEST. This produces a diagram in which items followed by a common letter are not significantly different. Such items are said to form a homogeneous group. This is similar to common underlining of items with non-significantly different estimates. In constructing this diagram the philosophy of multistage testing is followed, see the Method section.

 

Options: PRINT, PROBABILITY.

Parameters: TPROBABILITIES, DIFFERENCES, LABELS.

 

Method

The construction of the diagram for PRINT=groups is as follows. First the difference between the first and last item of the complete set of n items is checked for significance. Then the first and last item of all subsets of n-1 consecutive items are checked, followed by all subsets of n-2 items, and so on. If non-significance is found between the first and last item of a subset, all items of the subset are said to form a homogeneous group and they receive the same letter. This is only sensible when the TPROBABILITIES are sorted according to the estimates. The diagram only consists of homogeneous groups which are not a part of a larger group.

It is obvious that items in a homogeneous group can be significantly different. This is not displayed in the diagram, although a message is printed if this occurs. If there are no significant differences within homogenous groups, both diagrams essentially contain the same information; PRINT=groups then gives a more concise representation.

 

Action with RESTRICT

Restrictions on DIFFERENCES and LABELS are ignored.

 

PREWHITEN procedure

Filters a time series before spectral analysis

(A.W.A. Murray)

 

Option

PHI = scalar Specifies the value of the parameter used in filtering; default 0.99

 

Parameters

SERIES = variates Input series

FILTERED = variates Output series

 

Description

PREWHITEN provides filtering of time series data prior to spectral analysis. Parameters SERIES and FILTERED specify the input and output series, respectively. The filtered series y is given by

yt = xt - q ´ xt-1

where x is the input series. (Thus q = 1 would give first differencing.) The value of q is specified by the PHI option; the default value of q=0.99 is often suitable. Alternatively, an empirical approach is to use the value

q = (1 - 1/L)

where L is the lag at which inspection suggests that the autocorrelation in the series becomes negligible.

To "recolour" the spectrum of the series after estimation, you can multiply by

1/((1 + q2) - (2´ q´ cos(2P´ f)))

where f is the frequency at which the spectrum is estimated.

 

Option: PHI. Parameters: SERIES, FILTERED.

 

Method

The procedure uses the FILTER directive with two TSMs defined as follows:

TSM filter; ORDER=!(1,0,0); PARAM=!(1,0,0,PHI)

TSM arima; ORDER=!(0,1,0); PARAM=!(1,0,0)

FILTER SERIES; NEWSERIES=FILTERED; FILTER=filter; ARIMA=arima

The procedure is based on ideas from Granville Tunnicliffe Wilson, University of Lancaster.

 

Action with RESTRICT

The behaviour is as for the FILTER directive.

 

PROBITANALYSIS procedure

Fits probit models allowing for natural mortality and immunity

(R.W. Payne)

 

Options

PRINT = strings Printed output required (model, summary, estimates, correlations, fittedvalues); default mode, summ, esti, fitt

TRANSFORMATION = string Transformation to be used (probit, logit, complementaryloglog); default prob

MORTALITY = string Whether to estimate natural mortality (omit, estimate); default omit

IMMUNITY = string Whether to estimate natural immunity (omit, estimate); default omit

GROUPS = factor Defines groups for an analysis of parallelism; default * i.e. no groups

SEPARATE = strings Which parameters (apart from intercept) should be estimated separately for different groups (slope, mortality, immunity); default * i.e. none

LD = scalar or variate Effective (or lethal) doses to be estimated, other than 50

 

Parameters

Y = variates Number of subjects responding in each batch

DOSE = variates Dose received by each batch of subjects

NBINOMIAL = variates Number of subjects in each batch

INITIAL = variates Initial values for parameters

STEPLENGTHS = variates Step lengths for parameters

 

Description

Probit analysis is a way of modelling the relationship between a stimulus, like a drug, and a quantal response (success/failure). It is assumed that for each subject, there is a certain level of dose of the stimulus below which it will unaffected, but above which it will respond. This level of dose, known as its tolerance, will vary from subject to subject within the population.

For example, it is often assumed that the tolerance of houseflies to logarithm of the dose of an insecticide will follow a Normal distribution; so, if we were to plot the proportion of the population with each tolerance against log dose, we would obtain the familiar bell-shaped curve. Likewise, if we plot the probability that a randomly-selected individual will respond, against the logarithm of dose, we would obtain a sigmoid (S-shaped) curve limited below by zero and above by one. To make the relationship linear, it is usual to transform the y-axis either to probits or to Normal equivalent deviates:

Probit(P%) = NED(p) + 5

where proportion p = P% / 100. The Normal equivalent deviate may be familiar as the transformation that is used to produce "probability" graph paper.

In probit analysis, we are interested in estimating the equation of that line. This can be done by perfoming an experiment in which there are several batches of subjects, each of which is given a different dose of the stimulus. The data then consists of a variate indicating the number of subjects that responded out of each batch, a variate to show the dose given to each batch, and a final variate for the total numbers of subjects in the batches; these are specified by parameters Y, DOSE and NBINOMIAL, respectively. The NBINOMIAL parameter can be omitted if the total numbers cannot be measured, as in some fumigation experiments ("Wadley's problem"; see for example Finney 1971, pages 202-8).

The PRINT option controls printed output: model details of the model that has been fitted,

summary summary analysis-of-variance table, estimates parameter estimates and standard errors, correlations correlations between parameter estimates, and fittedvalues fitted values and residuals. By default, PRINT=mode,summ,esti,fitt.

The TRANSFORMATION option allows other transformations to be selected. Putting TRANSFORMATION=logit requests a logit transformation:

logit(P%) = log( P% / (100 - P%) )

This is very like the probit but approaches zero (to the left) and one (to the right) rather more slowly. The other possibility is the complementary log-log ( =log( -log(100-P%) ), which is relevant to the "one-hit" model (that is infection processes where just one infected particle is sufficient to cause the response).

Sometimes, subjects may respond even in the absence of any dose. For example, with some short-lived insects, some would have died simply from natural causes during the period of the experiment. By setting option MORTALITY=estimate this natural mortality can be included in the model and estimated. Similarly, there may be subjects that will not respond, no matter how high the dose. Setting option IMMUNITY=estimate will include and estimate a parameter for natural immunity.

It is also often of interest to fit study the way in which the model varies for different groups of subjects. For example, there may be groups of batches of subjects, each of which is given a different drug. The GROUPS option should then specify the group to which each batch of subjects belongs, and option SEPARATE indicates which parameters of the model (slope, mortality, and/or immunity) should have separate estimates. If SEPARATE is left at its default value, parallel lines will be fitted with identical values for any estimates of mortality and immunity.

The final option, LD, can request the estimation of one or more lethal (or effective) doses, specifying a scalar if there is just one, or a variate if there are several. The LD50 value (that is the dose at which 50% of the population would respond) is always printed as one of the parameters of the fitted line.

The model is fitted using the FITNONLINEAR directive, and the final two parameters, INITIAL and STEPLENGTHS, allow initial values and steplengths to be specified for the optimization. The order of parameters is: LD50(s), slope(s), mortality parameters (if any), and immunity parameters (if any). Parameter estimates, fitted values, residuals, and so on, can be saved after running the procedure, by using the RKEEP directive in the usual way.

 

Options: PRINT, TRANSFORMATION, MORTALITY, IMMUNITY, GROUPS, SEPARATE, LD.

Parameters: Y, DOSE, NBINOMIAL, INITIAL, STEPLENGTHS.

 

Method

Initial values are obtained, if necessary, using the Genstat facilities for generalized linear models, ignoring any mortality or immunity. Expressions specifying the model are defined in sets of nested IF-blocks, taking account of the settings for example of TRANSFORMATION and GROUPS. The fitting is carried out by the FITNONLINEAR directive, and any extra LD values are estimated using RFUNCTION.

Probit analysis can also be performed using the Genstat facilities for generalized linear models, but these do not cover Wadley's problem, nor allow for natural mortality and immunity. The results obtained from this procedure may differ slightly from those that are obtained (where possible) from the use of GLMs (by the FIT directive) or using the WADLEY procedure, due to the use here of maximum likelihood for the fitting, rather than iterative weighted linear regression.

 

Action with RESTRICT

The Y variate, the DOSE variate, or the GROUPS factor can be restricted to indicate that the model is to be fitted only to a subset of the units.

 

Reference

Finney, D.J. (1971), Probit Analysis, 3rd Edition, Cambridge University Press.

 

PTBOX procedure

Generates a bounding or surrounding box for a spatial point pattern

(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)

 

Options

PRINT = string What to print (summary); default summ

METHOD = string Type of box to form (bounding, surrounding); default boun

 

Parameters

Y = variates Vertical coordinates of each spatial point pattern; no default - this parameter must be set

X = variates Horizontal coordinates of each spatial point pattern; no default - this parameter must be set

YBOX = variates Variates to receive the vertical coordinates of the bounding or surrounding boxes

XBOX = variates Variates to receive the horizontal coordinates of the bounding or surrounding boxes

YFRACTION = scalars How much to extend the extremes of the vertical coordinates of each surrounding box as a fraction of the range of the vertical coordinates; default 0.1

XFRACTION = scalars How much to extend the extremes of the horizontal coordinates of each surrounding box as a fraction of the range of the horizontal coordinates; default 0.1

 

Description

This procedure takes as input two variates containing the coordinates of a spatial point pattern (specified by the X and Y parameters) and returns the coordinates of either a bounding or a surrounding box, according to the setting of the METHOD option. The default, METHOD=bounding, provides a bounding box, defined as the smallest rectangle such that all the events in the spatial point pattern lie inside the box or on its boundary. The coordinates of the bounding box are the coordinates of its four corners, in the order lower left, lower right, upper right, and upper left. The surrounding box (METHOD=surrounding) is a rectangle which contains all the points. It is obtained by extending the vertical and horizontal edges of the bounding box by specified fractions of the range of values in X and Y, respectively. The parameters XFRACTION and YFRACTION can be used to specify the proportional extension required in each direction; the default value of both parameters is 0.1. The coordinates of the surrounding box are the coordinates of its four corners in the order lower left, lower right, upper right, upper left. The coordinates of the bounding or surrounding box can be saved using the parameters XBOX and YBOX.

Printed output is controlled by the PRINT option. The default setting of summary prints the coordinates of the bounding or surrounding box under the headings XBOX and YBOX.

 

Option: PRINT.

Parameters: Y, X, YBOX, XBOX, YFRACTION, XFRACTION.

 

Method

A procedure PTCHECKXY is called to check that X and Y have identical restrictions. The minimum, maximum and range of the horizontal (X) and vertical (Y) coordinates are then calculated. For a bounding box, the coordinates are calculated as (min(X), min(Y)), (max(X), min(Y)), (max(X), max(Y)), (min(X), max(Y)). For a surrounding box the coordinates are

(min(X) - XFRACTION ´ range(X), min(Y) - YFRACTION ´ range(Y)),

(max(X) + XFRACTION ´ range(X), min(Y) - YFRACTION ´ range(Y)),

(max(X) + XFRACTION ´ range(X), max(Y) + YFRACTION ´ range(Y)),

(min(X) - XFRACTION ´ range(X), max(Y) + YFRACTION ´ range(Y)).

 

Action with RESTRICT

If X and Y are restricted, only the subset of values specified by the restriction will be included in the calculations.

 

PTCLOSEPOLYGON procedure

Closes open polygons

(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)

 

Option

PRINT = string What to print (summary); default summ

 

Parameters

OLDYPOLYGON = variates Vertical coordinates of each polygon; no default - this parameter must be set

OLDXPOLYGON = variates Horizontal coordinates of each polygon; no default - this parameter must be set

NEWYPOLYGON = variates Vertical coordinates of the closed polygons

NEWXPOLYGON = variates Horizontal coordinates of the closed polygons

 

Description

A polygonal region of two-dimensional space is represented in Genstat by the coordinates of a sequence of points which define the boundary of the polygon with the last point implicitly connected to the first point. If the first and last pairs of coordinates are the same then the polygon is said to be closed, otherwise it is open. Sometimes it is necessary to work with a closed polygon, for example, when drawing a polygon onto a graphics device as a series of line segments. This procedure takes as input a set of coordinates which define a polygon. The parameters OLDXPOLYGON and OLDYPOLYGON specify variates containing the coordinates. The output of the procedure is a closed polygon, which is identical to the input polygon if it is already closed and otherwise consists of the input polygon with the first pair of coordinates repeated at the end. The coordinates of the closed polygon may be saved using the parameters NEWXPOLYGON and NEWYPOLYGON.

Printed output is controlled by the PRINT option. The default setting of summary prints the horizontal and vertical coordinates of the closed polygon under the headings NEWXPOLYGON and NEWYPOLYGON.

 

Option: PRINT.

Parameters: OLDYPOLYGON, OLDXPOLYGON, NEWYPOLYGON, NEWXPOLYGON.

 

Method

A procedure PTCHECKXY is called to check that OLDXPOLYGON and OLDYPOLYGON have identical restrictions. It then checks whether the first and last pairs of coordinates are the same. If they are, the DUPLICATE directive is used to copy OLDXPOLYGON to NEWXPOLYGON and OLDYPOLYGON to NEWYPOLYGON. If they are different, NEWXPOLYGON and NEWYPOLYGON are declared as variates with one more value than their old counterparts, and the EQUATE directive is used to copy the values from OLDXPOLYGON to NEWXPOLYGON and OLDYPOLYGON to NEWYPOLYGON (so that the first element of each old variate is repeated at the end of the corresponding new one).

 

Action with RESTRICT

If OLDXPOLYGON and OLDYPOLYGON are restricted, only the subset of values specified by the restriction will be included in the calculations.

 

PTDESCRIBE procedure

Gives summary and second order statistics for a point process

(R.P. Littlejohn & R.C. Butler)

 

Options

PRINT = string Whether to print (statistics); default stat

SELECTION = strings What to print (interval, trend, poisson, icorrelation, ispectrum, cspectrum, cintensity, vtcurve, all); default inte

REPRESENTATION = string How the point process is represented in the DATA variate (time, interval, zeroone); default time

GRAPHICS = string Style of graphical output, or GRAPHICS=* to avoid any graphs (lineprinter, highresolution); default high

 

Parameters

DATA = variates Variate containing point process to be analysed

START = scalars Initial time (if REPRESENTATION=time); default 0

LENGTH = scalars Length of time over which process is observed; default takes the time of the last event

CITAU = scalars Window width for calculating count intensity; default 0.5 ´ mean interval length

VTTAU = scalars Window width for calculating variance-time curve; default 0.5 ´ mean interval length

SAVE = pointers Pointer to save calculated values

 

Description

A point process, or series of events, is characterized both by the times at which events occur, and the intervals between events. The Poisson process is the most basic point process, with Poisson counts in any interval, and independent exponentially distributed intervals between events.

A comprehensive account of methods for analysing point processes is given by Cox & Lewis (1966). PTDESCRIBE implements many of the test and summary statistics they give and should be used in conjunction with the text for a full discussion of the motivation and context of their use. All equations referred to below are from Cox & Lewis (1966).

The DATA variate may contain either the times at which events occur, the intervals between events, or a sequence of 0's and 1's, with 1's indicating the times of events on an integer time scale. The option REPRESENTATION specifies which of these is used. If REPRESENTATION=time and the process is measured from some time other than zero, the initial time should be given in the parameter START. Otherwise the START time is assumed to be zero. The first interval is taken to lie between the START time and the first event. If the process is observed beyond the last event, the total duration of the process should be given in the parameter LENGTH. Checks are carried out on START, LENGTH and the length of each interval, and the procedure terminates if these are inconsistent. If REPRESENTATION=time, the DATA variate may be restricted, facilitating the analysis of truncated or thinned point processes.

If SAVE is set, time and interval are saved, together with summary interval or second order statistics specified by SELECTION as detailed below. SAVE sets up a pointer, with each element labeled by the name of the relevant statistics saved. For example, if SAVE=clstats, then the intervals between the events will be saved in clstats['interval'].

The option SELECTION can be used to obtain any combination of eight available analyses, with the PRINT and GRAPHICS options controlling the output. The default setting is SELECTION=interval, while SELECTION=all gives all eight analyses. In what follows, the number of events is denoted by N and the variate carrying the times of events by time. The rate of a point process is calculated as the reciprocal of the average interval length.

interval - plots data and summarises the interval distribution

print: summary statistics for the interval process.

graph: times of events; histogram of the intervals between events; histogram of the intervals with bins appropriate for the exponential distribution.

save: summary summary statistics.

trend - tests for trend in the process

print: an N(0,1) test statistic (Ch 3.3 (11)), which is optimal against certain specifications of trend; Bartlett's test for the homogeneity of variance of groups of 3, 8 and 20 contiguous intervals.

poisson - tests whether the point process is Poisson

print: Kolmogorov-Smirnov tests for the empirical distribution function of times of events (Ch 6.2 (27-29, 38)) and for Durbin's order statistic transformation of the intervals (Ch 6.2 (43)); Moran's test against a gamma renewal process for the empirical distribution function (Ch 6.2 (43)); N(0,1) test for trend (see trend above) is applied to Durbin's transformed process.

graph: log survivor function of the interval distribution, compared to the Poisson case (a straight line through the origin with slope = -rate); plots of the empirical distribution function of times of events and Durbin's order statistics with Kolmogorov-Smirnov bounds.

icorrelation - autocorrelations for the interval sequence.

print: the first (N/2-1) end-adjusted autocorrelations (Ch 5.2 (17, 18)) for the interval sequence and their standardization; the end-adjustments are derived using the autocorrelations from CORRELATE.

graph: plot of the autocorrelations of the interval sequence and 95% confidence bounds.

save: order the order of the autocorrelations, icorrelation the autocorrelations of the interval sequence.

ispectrum - periodogram for the interval process

print: the periodogram for the interval process (Ch 5.3 (6, 8)) obtained from FOURIER divided by (2PNs2), where s2 is the variance of the interval lengths; since for the Poisson process the ordinates of the periodogram are iid exponentially distributed r.v.s, the ordinates are also tested as the intervals of a Poisson process as provided for by the SELECTION settings trend and poisson above.

graph: the periodogram and Poisson level (P/2) plotted against frequency; plot of the scaled cumulative periodogram with Kolmogorov-Smirnov bounds.

save: ifrequency frequencies at which periodogram is calculated, ispectrum interval periodogram.

cspectrum - periodogram for the count process

print: periodogram for the count process (Ch 5.5 (16)) calculated at frequencies 2 = 2Pn/T, for n=1...2N, T=timeN-time1.

graph: count periodogram and Poisson level (=2) graphed against frequency.

save: cfrequency frequencies at which periodogram is calculated, cspectrum interval periodogram.

cintensity - intensity function for the counting process

print: intensity function for the counting process (Ch 5.4(v) (20)) calculated for times CITAU ´ (j-0.5), j=1...integer-part(timeN / (2´ CITAU)); if CITAU is not set, PTDESCRIBE sets it to 0.5 times the average interval length; a preliminary screening precludes an inappropriate setting of CITAU.

graph: intensity function with asymptotic 95% confidence intervals for the Poisson level, the intensity for which = rate, plotted against time.

save: citime times for which intensity is calculated, cintensity intensity function.

vtcurve variance-time curve V(t) and index of dispersion I(t)

print: V(t) scaled by 1-time/LENGTH (Ch 5.4(iii) (12) and following), and I(t) (Ch 4.5(3)) calculated for times VTTAU ´ j, j=1...integer-part(T/(2´ VTTAU)); the setting of VTTAU is screened to preclude inappropriate values, and if unset is assigned the value 0.5 times the average interval length.

graph: V(t) and I(t) against time.

save: vtime times at which V(t) and I(t) are calculated, vtcurve V(t), dispersion I(t).

 

Options: PRINT, SELECTION, REPRESENTATION, GRAPHICS.

Parameters: DATA, START, LENGTH, CITAU, VTTAU, SAVE.

 

Method

The procedure tests of whether a point process is a Poisson process and calculates summary statistics in the time and frequency domains for a point process following Cox & Lewis (1966). Most statistics are obtained using CALCULATE, with FOURIER being used for ispectrum and CORRELATE for the pre-adjusted autocorrelations.

 

Action with RESTRICT

DATA may be restricted only if REPRESENTATION=time, in which case only the units not excluded by the restriction are involved in the analysis.

 

Reference

Cox, D.R. & Lewis, P.A.W. (1966). The Statistical Analysis of Series of Events. Methuen, London.

 

PTREMOVE procedure

Removes points interactively from a spatial point pattern

(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)

 

Options

PRINT = string What to print (summary, monitoring); default summ, moni

WINDOW = scalar Which graphics window to use for the plot; default 1

 

Parameters

OLDY = variates Vertical coordinates of each spatial point pattern; no default - this parameter must be set

OLDX = variates Horizontal coordinates of each spatial point pattern; no default - this parameter must be set

NEWY = variates Variates to receive the vertical coordinates of the original points minus the deleted points of each pattern

NEWX = variates Variates to receive the horizontal coordinates of the original points minus the deleted points of each pattern

 

Description

PTREMOVE uses the DREAD directive to delete points from a spatial point pattern. The coordinates of the existing points must be supplied using the parameters OLDX and OLDY. These points will be plotted on the current graphics device using DPTMAP with a pen setting of SYMBOLS=1. The WINDOW option may be used to specify the graphics window to use for the plot.

The operation of DREAD may vary slightly from one system to another. The Users' Note supplied with Genstat explains how to read points and terminate input on specific devices. The usual method for reading points is to click the left mouse button at the required position. The usual way to terminate input is to click the right mouse button. The points read using DREAD will be echoed using a pen setting of SYMBOLS=2. The coordinates of the new spatial point pattern containing the original points minus any points which have been deleted may be saved using the parameters NEWX and NEWY.

Printed output is controlled using the PRINT option. The settings available are monitoring (which prints the coordinates of the points to be deleted) and summary (which prints the coordinates of the new pattern consisting of the original points minus any that have been deleted under the headings NEWX and NEWY). The default setting is for both monitoring and summary.

 

Options: PRINT, WINDOW.

Parameters: OLDY, OLDX, NEWY, NEWX.

 

Method

A procedure PTCHECKXY is called to check that OLDX and OLDY have identical restrictions. DPTMAP is then used to draw a map of the original point pattern. The DREAD directive is used to read the coordinates of points to be deleted. Finally, the coordinates for the deleted points are removed from the original points using the SUBSET procedure and the coordinates of the undeleted points are stored in new variates.

 

Action with RESTRICT

If OLDX and OLDY are restricted, only the subset of values specified by the restriction will be included in the calculations.

 

QUANTILE procedure

Calculates quantiles of the values in a variate

(P.W. Lane)

 

Options

PRINT = string What to print (quantiles); default quan

PROPORTION = variate or scalar Proportions at which to calculate quantiles; default !(0,0.25,0.5,0.75,1)

 

Parameters

DATA = variates Values whose quantiles are required; this parameter must be specified

QUANTILE = variates or scalars Identifiers of structures to store results, if required

 

Description

Quantiles are statistics that characterize the distribution of a sample of numbers. A quantile q of a sample {xi, i=1...n} can be formed for any proportion p in the range [0,1], and has the following properties:

1) at least the proportion p/n of {xi} are less than or equal to q;

2) at least the proportion (1-p)/n of {xi} are greater than or equal to q;

3) if q=xi and q=xi+1 satisfy 1) and 2), then take q = (xi+xi+1)/2.

Thus the quantile for proportion 0.5 is the median; for 0.0 it is the minimum and for 1.0 the maximum of the sample. By default, QUANTILE produces the five quantiles called the "five-number summary" of a sample, corresponding to the proportions 0.0, 0.25, 0.5, 0.75, 1.0. The option PROPORTION can be set to a scalar or variate to request other single quantiles or sets of quantiles. By default, QUANTILE prints the statistics, but this can be suppressed by setting option PRINT=*. The quantiles can be stored in a variate using the parameter QUANTILE.

 

Options: PRINT, PROPORTION. Parameters: DATA, QUANTILE.

 

Method

First, the values are sorted into ascending order. Then for each proportion, the two values that are candidates for the quantile are found, by counting from either end of the sorted list to leave the required number of values from that point in the list to the end. The quantiles are the means of the two values found.

 

Action with RESTRICT

If the DATA variate is restricted, the quantiles are formed only using the units that are not restricted out. The PROPORTION and QUANTILE variates must not be restricted.