RANK procedure 

Produces ranks, from the values in a variate, allowing for ties

(J.B. van Biezen & C.J.F. ter Braak)

 

Option

OMIT = string Whether units excluded by a restriction on the DATA variate should be omitted from the RANKS variate (restricted); default *, i.e. the units are not omitted, and their values are left unchanged

 

Parameters

DATA = variates Variate containing values to be ranked

RANKS = variates Variate to save vector of ranks

TIESIZE = variates Variate to save the sizes of ties

 

Description

RANK calculates ranks of the values in a variate, allowing for ties. The variate must be specified by the DATA parameter, and the ranks saved using the RANKS parameter. The input variates in the parameter DATA must each have at least one non-missing value. Missing values in the DATA variates give corresponding missing values in the RANKS variates. The TIESIZE parameter can save the number of times each value occurs (starting with the lowest value).

The OMIT option controls whether the RANKS variate omits units that are excluded by a restriction on the DATA variate. By default, the values in these units are left unchanged. However, if OMIT=restricted, the RANKS variate is compressed to omit the excluded units; this setting is used particularly by the nonparametric procedures.

 

Option: OMIT. Parameters: DATA, RANKS, TIESIZE.

 

Method

The procedure uses the SORT directive to discover the number of distinct values in the input variate, and then uses TABULATE to obtain the number of times each value occurs - i.e. the size of ties. The tie-corrected rank numbers are then calculated. From these, the vector of ranks is obtained by modifying the levels of the factor that resulted from the first SORT.

 

Action with RESTRICT

The variates in DATA can be restricted, and in different ways. RANK operates on the restricted set only. If OMIT=restricted, the length of RANKS will be the size of the restricted set of the DATA variate. If OMIT is unset, the RANKS variate is of the same length as the DATA variate.

 

RCHECK procedure

Checks the fit of a linear, generalized linear or nonlinear regression

(P.W. Lane, R. Cunningham & C. Donnelly)

 

Options

PRINT = strings What to print (index, y, residual, leverage, Cook); default *

RMETHOD = string Type of residual to use (deviance, Pearson, simple, deletion); default devi

INDEX = variate Which variate to use as index; default !(1...n)

RESIDUAL = variate To store chosen type of residuals; default *

LEVERAGE = variate To store leverages; default *

COOK = variate To store modified Cook's statistics; default *

GRAPHICS = string Type of graphics to use (lineprinter, highresolution); default high

TITLE = text Title for graph; default identifier of response

WINDOW = numbers Window or series of windows in which to display graphs; default 4, or 5...8 for composite

SCREEN = string Treatment of previous graphics screen (clear, keep); default clea

SAVE = regression save structure Specifies which model to check; default *

 

Parameters

YSTATISTIC = strings What to display in the graph (residual, Cook, leverage); default resi

XMETHOD = strings What type of graph (fittedvalues, index, normal, halfnormal, histogram, composite); default comp

 

Description

Procedure RCHECK provides "diagnostic" information for checking the fit of regression models. The regression directives check only for large residuals and influential points, giving access to standardized residuals and leverages through directive RKEEP. RCHECK automatically accesses these quantities via RKEEP and in addition can calculate deletion residuals and modified Cook's statistics. A range of graphs can then be drawn to help check the fit of the regression model.

The procedure is controlled by the YSTATISTIC and XMETHOD parameters. These can be set to display various types of residuals, the leverages, or the modified Cook's statistics as simple plots against fitted values or against an index variate, or as Normal or half-Normal plots, or as a histogram. A set of four such plots is displayed as a composite picture: histogram, plot against fitted values, Normal plot and half-Normal plot. In addition, the chosen type of residuals, the leverages and Cook's statistics can be stored in variates, and any calculated quantities can be printed.

Plots of the residuals against fitted values or an index variate are displayed with a smoothed line fitted through the points, to indicate any potential trend.

The graphical displays can be controlled as usual using the TITLE, WINDOWS and SCREEN options. The colours and symbols used in the displays can be controlled by setting the attributes of the following pens with the PEN directive before calling the procedure:

pen 2 zero lines in fitted-value, Normal and index plots,

pen 3 points and histogram bars,

pen 4 smooth line in fitted-value and index plots of residuals.

The procedure exits if there are fewer than four observations, or fewer than two non-missing standardized residuals.

 

Options: PRINT, RMETHOD, INDEX, RESIDUAL, LEVERAGE, COOK, GRAPHICS, TITLE, WINDOW, SCREEN, SAVE.

Parameters: YSTATISTIC, XMETHOD.

 

Method

Standardized residuals and leverages are accessed using RKEEP from the latest fitted regression model, or from that specified by the SAVE option. Deletion residuals di are calculated as follows:

di = ri /Ö ((n-p-ri2)/(n-p-1))

where ri are the standardized residuals, n is the number of observations, and p is the number of parameters in the model. For generalized linear models other than linear regression,

di = SIGN(rdi) ´ Ö ((1-li) ´ rdi2 + li) ´ rpi2)

where rdi and rpi are the standardized deviance and Pearson residuals respectively. Modified Cook's statistics ci are calculated as follows:

ci = ABS(di) ´ Ö { (n-p) ´ li / (p ´ (1-li)) }

where li are the leverages. In Normal plots, the Normal quantiles are calculated as follows:

qi = NED( (i-0.375) / (n+0.25) )

while for a half-Normal plot they are given by

qi = NED( 0.5 + 0.5 ´ (i-0.375) / (n+0.25) )

For generalized linear models, fitted values are transformed by an approximate variance-stabilizing transformation before use in graphs:

Poisson, multinomial, negative binomial and geometric 2 ´ SQRT(fitted)

binomial, Bernoulli 2 ´ ANG(100 ´ fitted / nbinomial)

gamma, exponential LOG(fitted)

inverse Normal 1 / fitted

The smoothed line displayed for fitted-value or index plots is calculated as a cubic smoothing spline with (40+n)/25 degrees of freedom.

 

Action with RESTRICT

Restrictions applied to vectors used in the regression apply also to the RCHECK procedure. Values of diagnostic quantities are set to missing for all excluded units.

 

RGRAPH procedure

Draws a graph to display the fit of a regression model

(P.W. Lane)

 

Options

GRAPHICS = string Type of graphics to produce (lineprinter, highresolution); default high

TITLE = text Title for the graph; default 'Fitted and observed relationship'

WINDOW = number Which high-resolution graphics window to use; default 4 (redefined if necessary to fill the frame)

SCREEN = string Whether to clear the graphics screen before plotting (clear, keep); default clea

SAVE = regression save structure Save structure of the model to display; default * uses the most recently fitted regression model

 

Parameters

INDEX = variate Which explanatory variate to display; default * if GROUPS is set, otherwise INDEX is set to the first variate in the fitted model (must be set for nonlinear models other than standard curves)

GROUPS = factor Which explanatory factor to display; default * if INDEX is set, otherwise GROUPS is set to the first factor in the fitted model (ignored for nonlinear models)

 

Description

Procedure RGRAPH displays the fit of either a linear regression, a generalized linear model, a generalized additive model, a standard curve or a nonlinear model.

For models other than the nonlinear models fitted by FITNONLINEAR or FIT with the CALCULATION option set, the graph shows the relationship between the response variate and either one explanatory variate or one explanatory factor or one of each. If no parameters are set, RGRAPH takes the first explanatory variate and the first factor in the model, and the predicted relationship is represented by a line for each level of the factor. The display represents the observed relationship as points, plotting the response (adjusted for further explanatory terms in the model, if any) against the chosen explanatory variate, with each point labelled according to the corresponding factor level. If no factor has been fitted a single line is drawn, while if no variate has been fitted the graph simply shows the predicted mean for each level of the factor.

If a linear, generalized linear, or generalized additive model has been fitted, the INDEX and GROUPS parameters can be used to specify which explanatory variate and factor, respectively, should be used. If INDEX is set and GROUPS is not, a single line is drawn even if there are factors in the model; similarly if GROUPS is set and INDEX is not, the effect of the factor alone is shown.

For nonlinear models fitted by the FITNONLINEAR directive, a single line is drawn by joining the fitted values, and the response values are shown as points. Any setting of the GROUPS parameter is ignored. For curves fitted by the FITCURVE directive, settings of the INDEX and GROUPS parameters are ignored.

No graph can be drawn if the REG function has been used for any variate in the model. If the SSPLINE function has been used for any variate whose relationship with the response is not actually displayed, then the only adjustment for its effect will be the linear component of the fitted smooth curve. If the displayed variate itself is smoothed, then the curve is formed by interpolation between adjusted fitted values. The POL function is dealt with correctly.

The TITLE option can be used to supply a title for the graph. By default the graph is plotted on the current high-resolution device, but the GRAPHICS option can be set to line for a line printer plot. The WINDOW option can be used to select a pre-defined window for high-resolution plots; otherwise window 4 is used, and is redefined if necessary to fill the frame. The SCREEN option allows the graph to be added to an existing high-resolution plot. The colours and symbols used in the displays can be controlled by setting the attributes of the following pens with the PEN directive before calling the procedure:

pen 1 labels for lines when drawn for each level of a factor,

pen 2 fitted lines and means,

pen 3 points.

By default the current regression model is displayed, but option SAVE can be set to specify the save structure (from a MODEL statement) of some other model.

 

Options: GRAPHICS, TITLE, WINDOW, SCREEN, SAVE.

Parameters: INDEX, GROUPS.

 

Method

For a linear or generalized linear model, fitted lines are drawn by joining predicted values calculated by a PREDICT statement at 21 equally spaced values spanning the range of the explanatory variate. Alternatively, PREDICT provides adjusted means at each level of the factor, if the effect of the factor is to be displayed alone. If all the effects in the current model are displayed, the response is plotted against the explanatory variable (or against the factor level) as points. But if adjustment has to be made for some effects, adjusted response values (known as "partial residuals") are calculated by adding the simple residuals to predictions produced for all observations.

For generalized additive models, no predictions are used if the explanatory variate is smoothed: the nonlinear component of the smooth is extracted using RKEEP. For curves and general nonlinear models no predictions are made, and fitted values are used.

If a linear or generalized linear model is constrained to pass through the origin, the display will extend the range of the explanatory if necessary to include the origin.

 

Action with RESTRICT

Any restriction that was in force when the model was fitted will apply also to the graphs. Problems may occur, however, if the response variate is not restricted and an explanatory variate or factor is restricted.

 

RIDGE procedure

Produces ridge regression and principal component regression analyses

(A.J. Rook & M.S. Dhanoa)

 

Options

PRINT = string What to print (correlation, pcp, ridge); default corr

PLOT = string Graphical output required (ridgetrace); default *

 

Parameters

Y = variates Response variate in regression model

X = pointers Containing explanatory variates in regression model

 

Description

Procedure RIDGE produces analyses for identifying and overcoming collinearity among the independent variates in a multiple regression analysis. The correlation matrix, variance inflation factors (the diagonal elements of the inverse of the correlation matrix) and the ratio of the squared error in the least squares regression coefficients to the expected squared error in orthogonal data are calculated. Principal component regressions excluding 1, 2 or 3 minor principal axes are calculated and transformed back to the original variables on either the original or standardized scale. The "Positive correlation spread association" (PCSA) (Vinod 1976) is also calculated. This is an overall measure of the suitability of the data for the application of principal component regression and ridge regression. Ridge regressions (Hoerl & Kennard 1970) are calculated and the ridge coefficients are printed together with 2 indices of stability proposed by Vinod (1976): the index of stability of relative magnitudes (ISRM) and the numerical largeness of more significant regression coefficients (NLMS). These are 0 and 1 respectively in orthogonal data. High-resolution graphs of the ridge trace can be plotted against Hoerl & Kennard's k scale and Vinod's m scale.

The parameters of the procedure are used to input the data: the Y parameter supplies the y-variate, and the X parameter specifies a pointer containing the x-variates. None of these variates must be restricted nor contain missing values.

Printed output is controlled by the PRINT option: correlation prints the correlation matrix, variance inflation factors and ratio of squared error to that in orthogonal data, pcp prints principal component analysis and principal component regression, and ridge prints ridge coefficients and stability parameters.

Graphical output is controlled by the PLOT option: ridgetrace produces ridge traces.

 

Options: PRINT, PLOT. Parameters: Y, X.

 

Method

The correlation matrix is produced using the CORRELATE directive. This is then used to calculate the variance inflation factors (VIF) and the ratio of the squared error to that in orthogonal data (RL).

Principal component analysis is carried out using the PCP directive. The standardized response variable is regressed on the principal component scores using MODEL, FIT and TERMS directives. The coefficients are then transformed back to the original variables on either the standardized or original scale with up to three principal components excluded. The correlations of the standardized variable with each of the principal components are also printed.

Ridge regression is carried out as described by Hoerl & Kennard (1970). Ridge coefficients on both the standardized and original scales are printed for values of the biasing parameter k between 0 and 1 together with the standard errors of the coefficients on the standardized scale. Residual sums of squares (RSS) R-squared and total variance of the ridge coefficients (TVARB) are printed for each value of k. In addition Vinod's (1976) m scale is printed together with the index of stability of relative magnitudes (ISRM) and the numerical largeness of the more significant regression coefficients (NLMS).

High-resolution graphs of the ridge traces are produced. These are graphs of the ridge coefficients against k or against m and are plotted to the device set up prior to calling the procedure.

 

Action with RESTRICT

None of the input variates must be restricted.

 

References

Chatterjee, S. & Price, B. (1991). Regression Analysis by Example. 2nd Edition. New York, Wiley.

Hoerl, A.E. & Kennard, R.W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, 55-67.

Vinod, H.D. (1976). Application of new ridge regression methods to a study of Bell system scale economies. J.A.S.A., 71, 835-841.

 

RJOINT procedure

Does modified joint regression analysis for variety-by-environment data

(P.W. Lane & K. Ryder)

 

Options

PRINT = strings What to print (model, summary, estimates, monitoring); default mode, summ, esti

TOLERANCE = scalar Convergence criterion; default 0.001

MAXCYCLE = scalar Maximum number of cycles; default 15

SAVE = regression save structure Save structure from MODEL statement defining the model; default is to use the structure from the latest MODEL statement

 

Parameters

ENVIRONMENT = factors The environment factor; no default

VARIETY = factors The variety factor; no default

SENSITIVITIES = variates To store estimates of sensitivities; default *

VARMEANS = variates To store estimates of variety means; default *

ENVEFFECTS = variates To store estimates of environment effects; default *

ENVMEANS = variates To store estimates of environment means; default *

SESENSITIV = variates To store s.e.s of sensitivities; default *

SEVARMEANS = variates To store s.e.s of variety means; default *

SEENVEFFECTS = variates To store s.e.s of environment effects; default *

 

Description

Procedure RJOINT performs a modified joint regression analysis of data classified by two factors. This analysis is motivated by the study of variety-by-environment interactions in agriculture, where the two factors are varieties of some crop and environments at which experiments were carried out. The environments may be different sites within the same year, different years for the same site, or, as is more common, a combination of the two with little interest in individual year and site contributions. The intention is to characterize the sensitivity (or, inversely, the stability) of each variety to environmental effects by fitting a regression of the environment means for a variety on the average environment means. The model is thus nonlinear, of the form

yij = vi + bi ´ ej + error

where vi are variety means, ej are environment effects (with S ej =0) and bi are the sensitivity parameters (with mean(bi )=1). Usually, an experimenter is looking for varieties with large means and small sensitivities, to ensure a reliable crop under variable conditions. In RJOINT the factors are specified using the parameters VARIETY and ENVIRONMENT.

The data may consist of one value of the response, such as yield, for each combination of variety and yield. More often, the data are incomplete because not all varieties are tested at each environment; also, there may be multiple measurements of varieties at some environments. If the response is a count or a proportion, such as when investigating disease resistance, it will be more appropriate to use a generalized linear model based on a Poisson or binomial distribution and a log or logit link function. The model should be specified by giving a MODEL statement before calling RJOINT; for example,

MODEL yield

You can choose to fit any generalized linear model by setting the DISTRIBUTION and LINK options of MODEL: thus, to model proportions, you could give a statement like

MODEL [DISTRIBUTION=binomial; LINK=logit; DISPERSION=*] prop; \

NBINOMIAL=100

The iterative process used in the procedure is controlled by the options TOLERANCE and MAXCYCLE. At each iteration, the maximum difference between estimates of the sensitivity parameters in successive iterations is compared to the tolerance: the process ends when the differences are small enough, or when the maximum number of iterations is reached. The progress of the search can be followed by including the monitoring setting of the PRINT option.

Output is controlled by the option PRINT. The setting model prints a description of the model. The setting summary displays an analysis of variance (or deviance for non-Normal distributions) showing the effects of Varieties, Environments and Sensitivities: the last is the effect of allowing different sensitivities for each variety. The setting estimates displays two tables, one classified by varieties and the other by environments. For varieties the columns are: unadjusted means, final estimates of means (on the scale of the link function, if relevant), standard errors of estimates, back-transformed means (using the inverse of the link function), sensitivities, and standard errors of sensitivities. For environments the columns are: estimates of effects (on the scale of the link function, if relevant), standard errors of estimates, means (formed from the effects and the mean of the variety means), and back-transformed means (using the inverse of the link function).

The remaining parameters allow the following results to be saved: sensitivities, variety means, environment effects, environment means, and standard errors of sensitivities, variety means and environment effects. After calling the procedure, you can use the RKEEP directive to access fitted values and residuals. Other results from the fit that can be accessed via RKEEP or RDISPLAY may not be correct: for example, the number of residual d.f. shown by

RDISPLAY [PRINT=summary]

does not allow for the estimation of sensitivities.

 

Options: PRINT, TOLERANCE, MAXCYCLE, SAVE.

Parameters: ENVIRONMENT, VARIETY, SENSITIVITIES, VARMEANS, ENVEFFECTS, ENVMEANS, SESENSITIVIT, SEVARMEANS, SEENVEFFECTS.

 

Method

The procedure uses iterative scheme (A) referred to in Digby (1979). The scheme has been generalized to deal with alternative distributions and link functions. First the environment effects are estimated with the sensitivity parameters set to 1, and then the procedure alternates between estimating the sensitivities with given environment effects and estimating environment effects with given sensitivities. Convergence is tested by comparing the maximum difference between old and new sensitivities against the criterion (default 0.001), but the maximum number of cycles (default 15) will not be exceeded. If the MAXCYCLE option is set to 1, the result is an unmodified joint regression analysis; see Finlay & Wilkinson (1963).

 

Action with RESTRICT

A restriction applied to the response variate will be taken into account. Residuals and fitted values will be formed only for the restricted subset of values. If levels of the factors are not represented in the restricted subset, then no results will be shown for those varieties and/or environments. Do not restrict the environment or variety factor differently to the response variate: results may then be incorrect.

 

References

Digby, P.G.N. (1979). Modified joint regression analysis for incomplete variety ´ environment data, Journal of Agricultural Science, Cambridge 93, 81-86.

Finlay, K.W. & Wilkinson, G.N. (1963). The analysis of adaptation in a plant-breeding programme, Australian Journal of Agricultural Research 14, 742-754.

 

ROBSSPM procedure

Forms robust estimates of sum-of-squares-and-products matrices

(P.G.N. Digby)

 

Options

PRINT = strings Controls printed output (sspm, distances, weights, vcovariance, means, correlations, outliers); default * i.e. no output

B1 = scalar The value from which the threshold distance is derived (see the Method Section); default 2

B2 = scalar The value indicating the decline in weight as the distance of a unit above the threshold increases, (see the Method Section); default 1.25

MAXCYCLE = scalar Maximum number of iterations; default 100

TOLERANCE = scalar The minimum change in the average squared-weight that has to be achieved for the iterative process to converge; default 1.0-8

 

Parameters

DATA = pointers Supplies the set of variates in each datamatrix

SSPM = SSPMs SSPM structure to contain the robust estimates of the sums of squares and products, the robust estimates of the means, and the sum of the weights for each datamatrix

DISTANCES = variates To contain the Mahalanobis distances of the units from the mean

WEIGHTS = variates To contain the weights used for each unit when forming the robust estimates

VCOVARIANCE = symmetric matrices

To contain the robust estimates of the matrices of variances and covariances

CORRELATIONS = symmetric matrices

This contains on output the correlations from the robust estimates of the variances and covariances

 

Description

ROBSSPM forms robust estimates of SSPMs, and the related variance-covariance and correlation matrices, using the method of Campbell (1980). This weights the units differentially so that those that are extreme, in a multivariate sense, contribute less to the calculated means and sums of squares and products. The extremeness of a unit is judged by its Mahalanobis distance from the estimated mean.

The input variates are specified, in a pointer, by the DATA parameter. They may be restricted or may contain some missing values, in which case the units concerned will be ignored.

Output is controlled by the PRINT option, with settings: sspm prints the estimated sums-of-squares-and-products, the estimated means, and the sum of the weights; distances prints the Mahalanobis distances for all the units, including any excluded by restrictions; weights prints the weights for all the units; vcovariance prints the estimated variance-covariance matrix;

means prints the estimated means; correlations prints correlations derived from the variance-covariance matrix; outliers prints unit numbers, weights, and distances for outliers. By default there is no printed output.

If the outliers, weights or distances are to be printed then an appropriate summary of the number of units, number of outliers and so on will be printed too. The outlier information consists of the unit numbers, weights and Mahalanobis distances, printed across the page.

The weight given to each unit in forming the robust estimates is one if the unit's Mahalanobis distance from the mean is less than some threshold distance, and it decreases as the Mahalanobis distance increases above that threshold. The threshold and the form of the decrease in weight are controlled by options B1 and B2, which correspond to the corresponding quantities in the functions used by Campbell (1980), as explained in the Methods Section. By default, B1=2 and B2=1.25.

The estimation process is iterative, with the maximum number of iterations controlled by the MAXCYCLE option (default 100). It converges when the average change in the weights is less than some tolerance. The default tolerance is 1.0-8, but this can be redefined by the TOLERANCE option. Lack of convergence usually indicates some problem with the data, perhaps that the threshold has been set too low.

Parameters SSPM, DISTANCES, WEIGHTS, VCOVARIANCE and CORRELATIONS allow the various components of the output to be saved.

 

Options: PRINT, B1, B2, MAXCYCLE, TOLERANCE.

Parameters: DATA, SSPM, DISTANCES, WEIGHTS, VCOVARIANCE, CORRELATIONS.

 

Method

Initial (unweighted) estimates of the means and sums of squares and products are formed from all the units, subject to any restriction on the data and excluding any units with missing values for any of the variates. From the estimates, Mahalanobis distances of the units from their means are calculated, and used to determine the weights for the units. The weights are then used to reform the SSPM structure, new distances are calculated, and so on. Convergence occurs when the average change in the derived weights is less than the defined tolerance.

The weight w of each unit is given by

w = 1 d £ t

W = (t/d) ´ exp( -0.5 ´ (d-t)2 / B22 )d > t

where t, the threshold distance, is given by

t = Ö v + B1 / Ö 2

As explained by Campbell (1980), under Fisher's square root approximation, B1 equates to a percentage point of the standard Gaussian distribution.

Campbell (1980) regards three possibilities as potentially most useful. If B1 is infinite, the usual (non-robust) estimates are obtained. With B1=2 and B2 infinite, the weight decreases inversely with distance (w=t/d); this can be obtained in the procedure by setting B2 to a missing value. Finally, there is the combination used as a default by ROBSSPM, namely B1=2 and B2=1.25.

 

Action with RESTRICT

If the DATA variates are restricted only the units not excluded by the restriction will be used in the estimation process. However, Mahalanobis distances will be formed for all units other than those where any of the variates is missing.

 

Reference

Campbell, N.A. (1980). Robust procedures in multivariate analysis I: robust covariance estimation. Applied Statistics 29, 231-237.

 

RPAIR procedure

Gives t-tests for all pairwise differences of means from a regression or GLM

(J.T.N.M. Thissen & P.W. Goedhart)

 

Options

PRINT = strings What to print (differences, sed, tvalues, tprobabilities); default diff, sed, tval

SORT = string Whether to sort the means into ascending order (no, yes); default no

COMBINATIONS = string Which combinations of factors in the current model to include (full, present); default full (similar to the PREDICT directive)

ADJUSTMENT = string Type of adjustment with linear regression models (marginal, equal); default marg (similar to the PREDICT directive)

WEIGHTS = table Weights classified by some or all standardizing factors; default * (similar to the PREDICT directive)

METHOD = string Method of forming margin (mean, total); default mean (similar to the PREDICT directive)

ALIASING = string How to deal with aliased parameters (fault, ignore); default faul (similar to the PREDICT directive)

SAVE = identifier Specifies save structure of model to display; default * (i.e. that of the latest model fitted)

 

Parameters

TREATFACT = pointers Each pointer contains a list of treatment factors classifying the table of means to be compared (the right-most factor changes fastest, then the second from the right, etc.); this parameter must be set

LABELS = texts Structures containing strings to label rows (and columns) of the symmetric matrices of pairwise differences etc; the length of the text must equal the product of the numbers of factor levels as implied by the factor list in the TREATFACT pointer

NEWLABELS = texts To save the row labels of the DIFFERENCES, SED, TVALUES and TPROBABILITIES matrices

DIFFERENCES = symmetric matrices

To save pairwise differences (treatment means on the diagonal)

SED = symmetric matrices To save standard errors of the pairwise differences (missing values on the diagonal)

TVALUES = symmetric matrices To save t-values (missing values on the diagonal)

TPROBABILITIES = symmetric matrices

To save t-probabilities (missing values on the diagonal)

 

Description

When analysing a (non-orthogonal) analysis of variance model or a generalized linear model (GLM) with the regression directives FIT, ADD etc., effects of factors in the model and their interactions if required, may be assessed from a suitable analysis of variance (deviance) table. With the PREDICT directive tables of estimated means and their standard errors can be obtained, but not standard errors of differences of means. The RPAIR procedure provides additional information on such tables by calculating t-values and corresponding two-sided t-probabilities for tests of all pairwise differences of means.

The t-statistics used are based on the residual variance (deviance) and its degrees of freedom from the current regression model. However, if the DISPERSION option of the MODEL directive has been set to a numerical value (as is by default the case with a GLM with binomial, poisson or multinomial distribution), the degrees of freedom are set to 10000, approximating to the normal distribution.

It is assumed that the MODEL statement for the regression has defined only one response variate.

The TREATFACT parameter must be set to a pointer containing a list of factors classifying the table of means which are to be compared.

The PRINT option controls the output. By default a symmetric matrix of pairwise differences of means is printed with the means themselves down the diagonal. With a GLM these means and their pairwise differences are always calculated on the linear scale. The corresponding symmetric matrices of standard errors and of t-values are printed by default too.

The matrix rows (and columns) are ordered such that the right-most factor changes fastest, then the second from the right, etc. This default order can be changed by setting the SORT option to yes, in which case rows and columns of all matrices are rearranged to put the means on the diagonal of the matrix of differences into ascending order.

The LABELS parameter can be used to label the rows and columns of the matrices, which are then taken in default order. When the LABELS parameter has not been set and the TREATFACT pointer contains just one factor, by default the labels or levels of the factor are used for labeling; when the pointer contains more than one factor, the default row (and column) labels are combinations of factor settings indicated by the first letter of the factor identifier followed by an ordinal level.

The DIFFERENCES, SED, TVALUES and TPROBABILITIES parameters can be used to save the output. The row labels of these matrices can be saved through the NEWLABELS parameter.

The COMBINATIONS, ADJUSTMENT, WEIGHTS, METHOD, ALIASING and SAVE options are as in the PREDICT directive.

 

Options: PRINT, SORT, COMBINATIONS, ADJUSTMENT, WEIGHTS, METHOD, ALIASING, SAVE.

Parameters: TREATFACT, LABELS, NEWLABELS, DIFFERENCES, SED, TVALUES, TPROBABILITIES.

 

Method

The procedure uses the PREDICT directive to save a table of predictions and corresponding variance-covariance matrix. With a GLM the setting of option BACKTRANSFORM of the PREDICT directive is always none.

 

Action with RESTRICT

Any restrictions applied to vectors used in the regression apply also to the results from RPAIR.

 

RPROPORTIONAL procedure

Fits the proportional hazards model to survival data as a GLM

(R.W. Payne)

 

Options

PRINT = strings Controls printed output (model, deviance, summary, estimates, correlations, fittedvalues, accumulated, monitoring); default mode, summ, esti

TERMS = formula Defines the model to fit

SUBJECTS = factor Subject corresponding to each observation

TIMES = factor or variate Time of each observation

CENSORED = variate Contains the value 1 for censored observations, otherwise 0; if unset it is assumed that there is no censoring

OFFSET = variate Offset to include in the model

INDEX = variate Mapping variate used to produce the expanded variables

RISKSET = factor Saves the expanded time factor

_2LOGLIKELIHOOD = scalar Saves -2 ´ log likelihood for the fitted model

 

Parameters

X = variates or factors X-variables involved in the model to be fitted

NEWX = variates or factors Identifiers to store the expanded x-variables

 

Description

The data for RPROPORTIONAL consist of a set of subjects observed at one or more times. The final time is usually at the time of death (or failure), otherwise (if the subject survives the trial) the observation is said to be censored. The CENSORED option can be used to specify a variate with an entry for each subject containing one when there is censoring, otherwise zero. If this is not specified, it is assumed that there is no censoring.

The proportional hazards model (Cox 1972) makes the assumption that the subjects have a baseline hazard function which is modified proportionally by treatments and covariates. In RPROPORTIONAL it is assumed that the survival times follow a piecewise exponential distribution (Breslow 1974). This partitions the time axis using a set of discrete cut-points ai, and assumes a constant baseline hazard λi between each one. This corresponds to an exponential distribution with mean 1/λi for the survival times (in the absence of treatments) within each time interval. A cut-point is defined at every time that a death (or failure) occurs and, if the covariates or treatments vary with time, also at every time when the subjects are observed.

To fit this model as a generalized linear model, the vectors in the model must be expanded so that, for each subject, there is a unit for every time interval up to the last one during which the subject was observed. If (as usually happens) the subject was not observed at every cutpoint, covariates and treatments are taken to be constant during the intervals between the times of the observations. RPROPORTIONAL automatically produces the expanded vectors if required. The INDEX option allows the variate of indexes used to produce the expanded variables from the original ones to be saved.

The SUBJECTS option can specify a factor to indicate the subject corresponding to each observation; this can be omitted if there is only one observation per subject. The time at which each observation was made is specified by the TIME option, in either a factor or a variate. The factors and variates that appear in the model are listed by the X parameter, and new identifiers can be supplied using the NEWX parameter; otherwise the expanded values overwrite the original values. An expanded factor indicating the time interval corresponding to each unit of the NEWX vectors, can be saved using the RISKSET option.

The model to be fitted is specified using the TERMS option, using the new identifiers (if specified) rather than the original ones. The y-variate used within the generalized linear model is an indicator that takes the value 0 if the subject was still surviving within the time interval concerned, otherwise it has the value 1. The model contains an offset representing the log of the exposure time within each interval. Any additional offset can be specified, if required, using the OFFSET option. RPROPORTIONAL contains a TERMS directive, so that you can set the TERMS option to the full model, and then investigate smaller models in the usual way. The deviance produced for the regression model can be compared with a Chi-square distribution as usual, but the residual deviance is not usable as the maximal model assumed by the generalized linear models method is inappropriate. To assist with the comparison of sequences of models, the _2LOGLIKELIHOOD option allows -2 ´ the log likelihood to be saved.

The PRINT option controls printed output, with the same settings as for the FIT directive. After using RPROPORTIONAL, RDISPLAY can be used to obtain further output and RKEEP can be used to save information, in the usual way.

 

Options: PRINT, TERMS, SUBJECTS, TIMES, CENSORED, OFFSET, INDEX, RISKSET, _2LOGLIKELIHOOD.

Parameters: X, NEWX.

 

Method

Full details of the method can be found in Aitkin et al. (1989).

 

Action with RESTRICT

None of the vectors must be restricted, and any restrictions will be

cancelled.

 

References

Aitkin, M., Anderson, A., Francis, B. and Hinde, J. (1989). Statistical Modelling in GLIM. Oxford University Press.

Breslow. N. (1974). Covariance analysis of censored survival data. Biometrics, 30, 89-99.

Cox, D.R. (1972). Regression models and life tables (with discussion). J.R.Statist.Soc. B, 34, 187-220.

 

RSURVIVAL procedure

Models survival times of exponential, Weibull or extreme-value distributions

(R.W. Payne)

 

Options

PRINT = strings Controls printed output (model, deviance, summary, estimates, correlations, fittedvalues, accumulated, monitoring); default mode, summ, esti

TIMES = factor Time of each observation

DISTRIBUTION = string Distribution of the survival times (exponential, weibull, extremevalue); default expo

CENSORED = variate Contains the value 1 if the observation is censored and subject survived the whole trial, otherwise 1

GRAPHICS = string Controls the plotting of diagnostic graphs of the empirical survivor function against the estimate produced by the model (lineprinter, highresolution) default * i.e. none

ALPHA = scalar Saves the estimated value of the parameter α of the Weibull and extreme-value distributions, if the scalar is input with a non-missing value this provides the initial estimate for α (which will also be the final estimate if MAXCYCLE=1)

_2LOGLIKELIHOOD = scalar Saves -2 multiplied by the log likelihood

MAXCYCLE = scalar Maximum number of iterations to use to estimate α; default 20

TOLERANCE = scalar Convergence limit for α; default 10-5

 

Parameter

TERMS = formula Defines the model to fit

 

Description

RSURVIVAL models survival times assuming that they follow either an exponential, Weibull or extreme-value distribution, as indicated by the DISTRIBUTION option. It also caters for censored observations, where the subject concerned survived the trial: the CENSORED option can be used to specify a variate with an entry for each subject containing one where the subject survived, otherwise zero. The model to be fitted to the survival times is specified using the TERMS parameter.

The analysis is performed using the generalized linear models facilities of Genstat, with a y-variate indicating whether the subject died or survived, and an offset variate which depends on the time variate (see Aitkin et al. 1989). For the exponential distribution this offset is simply the logarithm of the times. With the Weibull distribution it is the Weibull parameter α multiplied by the logarithm of the times, while for the extreme-value distribution it is the parameter α multiplied by the times. The parameters of the TERMS model and α itself are estimated alternately (with number of cycles controlled by the MAXCYCLE option) until successive estimates are within a tolerance specified by the TOLERANCE option. The ALPHA option can input an initial value for α and save the estimated value. By setting the MAXCYCLE option to one, α can be fixed at the initial value; this is useful for comparing one model with another, when the value of α should be fixed at the value estimated from the more complicated model. The _2LOGLIKELIHOOD option allows -2 ´ log likelihood to be saved.

Printed output from the generalized linear model analysis is controlled by the PRINT option. Further information can be printed subsequently by using RDISPLAY in the usual way. Also, the GRAPHICS option can be set to either high or line to produce either high-resolution or lineprinter plots of the empirical survivor function against the value predicted by the model (see Aitken et al. 1989, pages 275-276).

 

Options: PRINT, TIMES, DISTRIBUTION, CENSORED, GRAPHICS, ALPHA, _2LOGLIKELIHOOD, MAXCYCLE, TOLERANCE.

Parameters: TERMS.

 

Method

Full details of the method can be found in Aitkin et al. (1989).

 

Action with RESTRICT

The vectors involved in the analysis may be restricted as usual for a generalized linear model.

 

Reference

Aitkin, M., Anderson, A., Francis, B. and Hinde, J. (1989). Statistical Modelling in GLIM. Oxford University Press.

 

RUGPLOT procedure

Draws "rugplots" to display the distribution of one or more samples

(P.W. Lane)

 

Options

GRAPHICS = string What type of graphics to use (highresolution, lineprinter); default high

TITLE = text Title for diagram; default *

AXISTITLE = text Title for axis; default *

WINDOW = scalar Window in which to draw high-resolution plot; default *, taken as 11 if SCREEN=clear, or 1 if SCREEN=keep

SCREEN = string Whether to clear screen before high-resolution plot (clear, keep); default clea

ORIENTATION = string Orientation of plots (down, across); default down

JITTER = number Ratio of jitter width to range of data in high-resolution plot; default 0.01

SEED = number Seed for generating random numbers used in jittering; default 0, i.e. continue from last generation, or initialize from system clock

 

Parameters

DATA = variates Data to be summarized; no default

GROUPS = factor Factor to divide values of a single variate into groups; default *

RUGLABELS = texts Labels for individual rugs; default *, i.e. identifiers of variates or labels or levels of factor

POSITION = scalar or variate Position on x-axis (or on y-axis if ORIENTATION=across) at which to plot each rug; if GROUPS is set, positions for each level of the factor are taken from a variate; default is to draw a single rug on the axis, and to spread multiple rugs across the window

 

Description

RUGPLOT draws pictures to display the distribution of one or more sets of data. In the simplest case, with the DATA parameter set to a single variate, RUGPLOT will draw a single vertical "rug": that is, a series of short horizontal lines on the vertical axis, positioned at each value of the variate. The option ORIENTATION=across produces a horizontal rug. A rug can be added to an existing plot by specifying SCREEN=keep, and setting the WINDOW option to specify the window where the rug is to be drawn. With SCREEN=keep, the default window is 1; with SCREEN=clear, window 11 is used after defining it to fill the whole graphical frame.

If several variates are supplied, a rug is drawn for each of them using the same scale. Alternatively, if a single variate is specified by the DATA parameter, a factor with the same number of values as the variate may be defined by the GROUPS parameter, and a box will be drawn for each level of the factor. The rug plots are spread out across the window by default. The POSITION parameter can be set to specify where each rug is to be positioned on the x-axis (or y-axis if ORIENTATION=across). The setting should be in the range (0, n) for a plot with SCREEN=clear, where n is the number of rugs to be drawn; with SCREEN=keep, the position should be specified in the units of the axis last drawn in the window.

Line-printer rugplots can be drawn by setting option GRAPHICS=lineprinter. The plot is drawn with asterisks, or digits to represent points that are effectively coincident. If the page size is small, as in interactive mode, line-printer plots with ORIENTATION=down are very cramped: the PAGE option of the OUTPUT directive can be used to increase the depth of the graphs. The option ORIENTATION=down cannot be selected for line-printer plots with more than 14 rugs.

The TITLE and AXISTITLE options can be set to specify the titles displayed at the top of the plot and along the axis, for either graphics mode. The RUGLABELS parameter allows you to specify labels that will identify each rug, in place of the default labels taken from the variate identifiers, or factor labels or levels if the GROUPS parameter is set. Long identifiers or labels may overlap each other if ORIENTATION=down, or they may overlap the rug-plots if ORIENTATION=across; a maximum of eight characters is recommended.

In high-resolution plots, all data values are "jittered" to try to remove ties. This involves adding a small random value: by default the ratio of the maximum adjustment to the range of all the data is 1:100. This can be modified by setting the JITTER option to 0 to suppress jittering, or to some other ratio than the default of 0.01. The SEED option can be set to specify the seed of the random-number generation, if a reproducible plot is required.

 

Options: GRAPHICS, TITLE, AXISTITLE, WINDOW, SCREEN, ORIENTATION, JITTER, SEED.

Parameters: DATA, GROUPS, RUGLABELS, POSITION.

 

Method

High-resolution rugs are plotted using the minus or vertical-bar symbol, in vertical or horizontal plots respectively. Line-printer plots use the default plotting symbols, the asterisk or digits to represent coincident points.

 

Action with RESTRICT

Restrictions on the supplied variates are taken into account. The grouping factor and texts holding ruglabels, if specified, should not be restricted.

 

RUNTEST procedure

Performs a test of randomness of a sequence of observations

(P.W. Goedhart)

 

Options

PRINT = string Controls printed output (results); default resu

NULL = scalar Defines the boundary between the two types; default 0

 

Parameters

DATA = variates Sequences of observations

SAVE = pointers To save the number of runs, the number of positive and negative observations and the lower and upper tail probabilities of the test

 

Description

The data is assumed to be an ordered sequence of observations of two types, n1 of the first type and n2 of the second type. A run is defined to be a succession of observations of the same type. A clue to lack of randomness is provided by the total number of runs in the sequence. If the data are in random order, the expected number of runs is 1 + 2n1n2/(n1+n2). A low number of runs might indicate positive serial correlation while a high number might arise from negative serial correlation.

The DATA parameter is used to specify the sequence of observations. Observations larger than option NULL are considered to be of the first type (positive) while observation smaller than NULL are of the second type (negative). Missing values and observations that equal NULL are not taken into account. The PRINT option controls printed output, while the SAVE parameter can be used to specify a pointer containing five scalars to save the number of runs, the number of positive observations (that is, those larger than NULL), the number of negative observations and the lower and upper tail probabilities of the number of runs.

 

Options: PRINT, NULL. Parameters: DATA, SAVE.

 

Method

When the number of observations of type one and two are both smaller than 11, exact left and right tail probabilities are taken from Table 3.1 from Draper & Smith (1981). In other cases a normal approximation with continuity correction is used.

 

Action with RESTRICT

The DATA variate can be restricted so that the test uses only a subset of the units.

 

Reference

Draper & Smith (1981). Applied Regression Analysis (second edition). Wiley. New York.

 

SAMPLE procedure

Samples from a set of units, possibly stratified by factors

(P.W. Lane)

 

Options

SEED = scalar Seed for the random number generator; default 0 i.e. continue from previous generation

NVALUES = scalar Number of units from which a simple sample is to be taken; default * i.e. as defined by UNITS statement

 

Parameters

NSAMPLE = scalars or tables Number of values in simple sample, or table of numbers of values at each combination of levels of its classifying factors; no default

SAMPLE = identifiers Structure to store the result; no default

 

Description

Procedure SAMPLE produces a random sample from a set of units. A simple sample can be obtained by setting the NSAMPLE parameter to the required number in the sample, and the NVALUES option to the number of units in the set. The NVALUES option can be omitted if the required number of units has been defined by a UNITS statement earlier in the job.

For a stratified sample, the NSAMPLE option should be set to a table containing the required number of units to be sampled at each combination of levels of the factors classifying the table. The NVALUES option is not then relevant as the set of units is determined by the values of the classifying factors.

The SAMPLE parameter must be set to an identifier, which will be formed into a variate containing a set of NSAMPLE integers in the range (1...NVALUES), obtained by random sampling without replacement. The SEED option can be set to define a starting value for the random numbers used to select the units. This can be omitted if some random numbers have already been generated during the current job; SAMPLE will then take the numbers that continue the previous sequence.

 

Options: SEED, NVALUES. Parameters: NSAMPLE, SAMPLE.

 

Method

For a simple sample, a full set of units (1...NVALUES) is randomly ordered and the first NSAMPLE values are taken. For a stratified sample, the units are sorted according to levels of the classifying factors (after random ordering) and then the requested number of values are taken for each combination of levels.

 

Action with RESTRICT

The factors classifying the table must not be restricted. The procedure cannot be used on a restricted set of units.

 

SAVEDATA procedure

Saves all the current data and information for use in a future run

(S.A. Harding & R.W. Payne)

 

Option

FILENAME = text Single-line text to define the name of the file to store the information; default 'g41save'

 

No parameters

Description

SAVEDATA provides a convenient way of saving the details of all the data structures that are currently stored within Genstat. It also saves any procedures that are stored and the details of the current environment for graphics. The information is stored in a file on the computer whose name can be specified using the FILENAME option. However, if you only wish to retain one set of information, you can use the default 'g41save'. This may be interpreted differently on different computers. For example on the PC the file will be created as g41save.dat in the current directory.

Once the information has been stored, the GETDATA procedure can be used to return Genstat to exactly the same state as when SAVEDATA was called. This can take place later on in the same Genstat run, or in a completely different run. In either case, you can continue to use Genstat as though nothing had happened in the interim - so this will delete any other data that might have been created or restore any subsequent changes to the graphics environment.

 

Option: FILENAME. Parameters: none.

 

Method

OPEN is used to open the file, and RECORD is used to store the information.

 

SIGNTEST procedure

Performs a one or two sample sign test

(E. Stephens & P.W. Goedhart)

 

Options

PRINT = string Whether to print the test statistic with the associated probability and sample size (test); default test

METHOD = string Type of test (twosided, greater, lessthan); default twos

GROUPS = factor Defines the groups for a two-sample test if only the Y1 parameter is specified

NULL = scalar Median value or difference in medians under the null hypothesis; default 0

 

Parameters

Y1 = variates Data values for a one-sample sign test (neither Y2 nor GROUPS specified), or for the first sample of a two-sample test (Y2 also specified) or the values in both samples of a two-sample test (GROUPS specified but not Y2)

Y2 = variates Data values for the second sample of a two-sample test

STATISTIC = scalars To save the sign test statistic

NBINOMIAL = scalars To save the effective sample size

PROBABILITY = scalars To save the probability level of the test

 

Description

The sign test is a non-parametric test for difference in location between two related samples, or for testing the location of a single sample. The data values are specified by the parameters Y1 and Y2 and the option GROUPS. For a one-sample test, the Y1 parameter should be set to a variates containing the data. The data for a two-sample test can either be specified in two separate variates using the parameters Y1 and Y2. Alternatively, they can be given in a single variate, with the GROUPS option set to a factor to identify the two samples; the units are then assumed to be specified in the same order within each group. The GROUPS option is ignored when the Y2 parameter is set. The NULL option defines the size of the median under the null hypothesis for a one-sample test, or the difference between the two medians in a two-sample test. By default NULL=0.

The test is assumed to be two-sided unless otherwise requested by the METHOD option. Settings greater or lessthan will give one-sided tests for the median or the difference between medians greater than, or less than, the null hypothesis value respectively.

In a one-sample test, units that are equal to the null hypothesis median are excluded and the effective sample-size is reduced. Similarly, in a two-sample test, units are excluded where the differences between the pairs of values are equal to that required by the null hypothesis. Units with missing values are also excluded.

By default, SIGNTEST prints the test statistic, the effective sample size and the (exact) probability level. This information can also be saved in named scalars using the STATISTIC, NBINOMIAL and PROBABILITY parameters repectively, and printing can be suppressed by setting option PRINT=*.

 

Options: PRINT, METHOD, GROUPS, NULL.

Parameters: Y1, Y2, STATISTIC, NBINOMIAL, PROBABILITY.

 

Method

The procedure uses standard Genstat directives for calculation and manipulation.

 

Action with RESTRICT

If the variates or the factor are restricted, the test is calculated using only the units not excluded by the restriction. In a two-sample test, the two variates or the variate and factor should be restricted in the same way.

 

Reference

Siegel, S. (1956). Nonparametric Statistics for the Behavioural Sciences. McGraw-Hill, New York.

 

SKEWSYMM procedure

Provides an analysis of skew-symmetry for an asymmetric matrix

(P.G.N. Digby)

 

Option

PRINT = strings Printed output from the analysis (roots, scores); default * i.e. no output

 

Parameters

DATA = matrices Asymmetric (square) matrices to be analysed

ROOTS = diagonal matrices Stores the squared singular values from the analysis; the structure has one value for each plane fitted in the analysis (e.g. if the DATA matrix has 11 rows and columns, the ROOTS diagonal matrix will have 5 values)

SCORES = matrices Stores the coordinates of the points from the analysis; each matrix has the same number of rows as the corresponding DATA matrix, and has 2 columns for each plane fitted in the analysis (e.g. if the DATA matrix has 11 rows and columns, the SCORES matrix will have 11 rows and 10 columns)

 

Description

Procedure SKEWSYMM provides the canonical analysis of skew-symmetry described by Gower (1977). The input to the procedure, specified by the parameter DATA, is a (square) asymmetric matrix of associations, A say. The rows and columns of A usually represent the same set of objects, but in different modes. For example, with migration data, the rows may represent the Countries or States being departed from, and the columns the same locations but being arrived at. The DATA matrix must not contain any missing values.

The results of the analysis are a set of coordinates (SCORES) for points representing the entities labelling the rows or columns of the DATA matrix. In pairs, these coordinates give positions on a series of planes, also called bimensions. So there is an even number of coordinates for each point; if the DATA matrix has an odd number of rows/columns, there will be one fewer coordinate than the number of rows or columns of the DATA matrix. Also, the "importance" of each plane can be assessed from a set of values (ROOTS) that give the amount of (squared) skew-symmetry explained in each pair of dimensions.

The results are interpreted in terms of the areas of triangles. The skew symmetry between the entities in rows (or columns) p and q is proportional to the area of the triangle OPQ, where O is the origin, and P and Q are the points representing p and q respectively. (For further details see either Gower 1977, or Digby & Kempton 1987.) Within each plane the coordinates are arranged so that their centroid is at (0, y), for y³ 0, and so that positive row-to-column skew symmetry is represented in a clockwise direction. (Note that in planes other than the first it is residual skew symmetry, after fitting the preceding planes, that is being modelled).

Printed output is controlled by the PRINT option: roots prints the roots, also the roots expressed as percentages and cumulative percentages, and scores prints the scores.

Results from the analysis can be saved using the parameters ROOTS and SCORES. The structures specified for these parameters need not be declared in advance. Column labels are provided automatically for the SCORES matrix, but any row labels (useful to identify the entities) are left unchanged.

 

Option: PRINT. Parameters: DATA, ROOTS, SCORES.

 

Method

Procedure SKEWSYMM provides the analysis of skew-symmetry of Gower (1977). If A is an asymmetric matrix of associations, then S = A - A¢ is skew-symmetric; this matrix is analysed using a singular value decomposition, followed by a reflection and rotation, to provide the necessary roots and scores. For further details see Gower (1977) or Digby & Kempton (1987).

 

References

Digby, P.G.N. & Kempton. R.A. (1987) Multivariate analysis of ecological communities. Chapman and Hall, London.

Gower, J.C. (1977) The analysis of asymmetry and orthogonality. In: Recent Developments in Statistics (ed. J. Barra, F. Brodeau, G. Romier & B. van Cutsen), pp. 109-123. North Holland, Amsterdam.

 

SMOOTHSPECTRUM procedure

Forms smoothed spectrum estimates for univariate time series

(G. Tunnicliffe Wilson & S.J. Welham)

 

Options

PRINT = string Controls printed output (description); default desc

METHOD = string Method to be used for smoothing (lagwindow, direct, YuleWalker, exactautoregressive); default lagw

BANDWIDTH = scalar Frequency domain bandwidth for the smoothing window; must be set if METHOD=dire

MAXLAG = scalar Specifies the cut-off lag (i.e. the maximum lag of autocovariance used in the spectrum calculation) for METHOD=lagw, or the order of the autoregression for METHOD=Yule or exac; if this option is not set then BANDWIDTH must be set, and will be used to determine an appropriate value of MAXLAG

DIVISIONS = scalar Determines the number of frequency divisions into which the range [0.0, 0.5] is divided for calculating the spectrum; the default is chosen so that the bandwidth covers about four intervals

PROBABILITY = scalar Probability value used for confidence limits; default 0.9

TAPER = scalar The proportion of data to be tapered (applied for all settings of METHOD except exac); default 0.0

SHAPE = scalar The shape of the trapezium window (a value of 1.0 specifies a rectangular, and 0.0 a triangular window); default 0.5

YLOG = string Whether to plot with a log-transformed Y-axis (yes, no); default no

XLOG = string Whether to plot with a log-transformed X-axis (yes, no); default no

GRAPHICS = string What sort of graphics to use (lineprinter, highresolution); default high

WINDOW = scalar Window to be used for plotting; default 1

PENS = variate The two pens to be used (after being defined appropriately) for drawing the plots; default !(1,2)

 

Parameters

SERIES = variate The series for which the spectrum is to be calculated

LENGTH = scalar or variate Scalar specifying that the first N units of the series are to be used, or a variate specifying the first and last units of the series to be used

SPECTRUM = variate Saves the smoothed spectrum; need not be declared in advance, but will be set up as a variate of the appropriate length within the procedure

LOWER = scalar or variate Scalar to save the multiplier of the spectrum used to calculate the lower limit, or a variate to save the values of the lower limit

UPPER = scalar or variate Scalar to save the multiplier of the spectrum used to calculate the upper limit, or a variate to save the values of the upper limit

FREQUENCY = variate Saves the frequency values at which the spectrum is calculated

 

Description

SMOOTHSPECTRUM calculates smoothed spectrum estimates for a univariate time series. The series is specified in a variate by the SERIES parameter. The parameter LENGTH can be used to specify that only part of the series is to be used: if LENGTH is set to a scalar N, then only units 1...N are used; alternatively, it can define a sub-series by being set to a variate of length 2 holding the numbers of the first and last units to be used. The spectrum can be saved by the SPECTRUM parameter. The method to be used for the smoothing is controlled by the METHOD option, with settings lagwindow for Parzen lag window smoothing, direct for frequency domain smoothing using a trapezium window, YuleWalker for autoregressive spectrum estimation based on Yule-Walker coefficients, and exactautoregressive for autoregressive estimation based on exact likelihood estimation of the coefficients.

For frequency domain smoothing (METHOD=direct), option BANDWIDTH specifies the bandwidth of the smoothing window and option SHAPE the shape of the trapezium window. The BANDWIDTH option is also used to determine an appropriate default for the MAXLAG option if this is not specified with other METHOD settings: for METHOD=lagwindow, MAXLAG specifies the cut-off lag (i.e. the maximum lag of autocovariance used in the spectrum calculation), while for METHOD=YuleWalker or exactautoregressive, it specifies the order of the autoregression.

The DIVISIONS option can define the number of frequency divisions into which the range [0.0, 0.5] is divided for calculating the spectrum; if this is omitted a default is chosen so that the bandwidth covers about four intervals. The frequency values at which the spectrum is calculated can be saved, in a variate, by the FREQUENCY parameter. The proportion of data to be tapered (relevant to all settings of METHOD except exactautoregressive) is controlled by the TAPER option; by default there is no tapering.

The LOWER and UPPER parameters can be set to scalars to save the scaling factor used to calculate the upper and lower bounds, or to variates to save the upper and lower bounds for the SPECTRUM variate.

Printed output can be suppressed by setting the option PRINT=*; by default, PRINT=description. The PROBABILITY option indicates the probability value used for confidence limits; 0.9 is used as the default.

The procedure will also plot the spectrum: option GRAPHICS controls whether this is done for line printer or on a high-resolution device. With high-resolution graphics, the plot will be produced using the current settings of the window specifed by the WINDOW option; by default WINDOW=1. The FRAME directive can be used to set the attributes of the window prior to calling the procedure. The PENS option controls which pens are to be used for the plots; the attributes of these pens are modified within the procedure. By default pens 1 and 2 are used, but these can be changed by setting option PENS to a variate of length 2 containing the numbers of the two pens required. Options YLOG and XLOG allow the X- and Y-axes to be represented on a logarithmic scale.

 

Options: PRINT, METHOD, BANDWIDTH, MAXLAG, DIVISIONS, PROBABILITY, TAPER, SHAPE, YLOG, XLOG, GRAPHICS, WINDOW, PENS.

Parameters: SERIES, LENGTH, SPECTRUM, LOWER, UPPER, FREQUENCY.

 

Method

A cosine bell window is used for the taper, with lag window and direct spectral smoothing carried out esentially as described in Bloomfield (1976). The autoregressive spectrum estimation uses the standard Yule-Walker equations, as presented for example in Box & Jenkins (1970). These are optionally refined by exact maximum likelihood estimation as described in the Genstat 5 Reference Manual. The theoretical spectrum of the autoregressive model is then calculated. The error limits are calculated using scaled chi-square distributions. These are quite good for the case of lag window and direct smoothing, but in small samples are only very approximate for the autoregressive estimates. The series values are mean corrected before spectrum estimation, but not trend corrected.

 

Action with RESTRICT

Input and output structures must not be restricted; restriction of the input series to a contiguous set of units can be achieved by use of the LENGTH parameter.

 

References

Bloomfield, P. (1976). Fourier analysis of time series: an introduction. Wiley, New York.

Box, G.E.P. & Jenkins, G.M. (1970). Time series analysis, forecasting and control. Holden-Day, San Francisco.

 

SPEARMAN procedure

Calculates Spearman's Rank Correlation Coefficient

(S.J. Welham, N.M. Maclaren & H.R. Simpson)

 

Options

PRINT = strings Output required (test, correlations, ranks): test produces the correlation coefficient/matrix and relevant test statistics, correlations prints out just the correlation coefficients for each pair of variates; ranks produces the vectors of ranks for each sample; default test

GROUPS = factor Defines the sample membership if only one variate is specified by DATA

CORRELATION = scalar or symmetric matrix

Scalar to save the rank correlation coefficient if there are two samples, or symmetric matrix to save the coeficients between all pairs of samples if there are several

T = scalar or symmetric matrix Scalar to save the Student's t approximation to the correlation coefficient if there are two samples, or symmetric matrix to save the t approximations for all pairs of samples if there are several (calculated only if the sample size is 8 or more)

DF = scalars Scalar to save the degrees of freedom for each t statistic

 

Parameters

DATA = variates List of variates containing the data for each sample, or a single variate containing the data from all the samples (the GROUPS option must then be set to indicate the sample to which each unit belongs)

RANKS = variates Saves the ranks

 

Description

SPEARMAN calculates Spearman's Rank Correlation Coefficient between pairs of samples. The samples can be stored in different variates and supplied as a list in the DATA pointer. Alternatively, they can all be placed in a single variate, and the GROUPS option set to a factor to indicate the sample to which each unit belongs. If the sample size is 8 or more (i.e. large enough for the approximation to be valid), the Student's t approximation is calculated. Otherwise SPEARMAN obtains significance levels from stored tables. The results can be displayed by use of the test setting of option PRINT, and saved using the options CORRELATION, T and DF. If more than two variates are specified, the full correlation matrix between all pairs of variables will be formed. The PRINT setting ranks causes the vector of ranks for each sample to be printed and correlations means that only the correlations will be displayed. The ranks from each sample can be saved using the RANKS parameter.

 

Option: PRINT, GROUPS, CORRELATION, T, DF. Parameters: DATA, RANKS.

 

Method

Spearman's Rank Correlation Coefficient is a measure of association between the rankings of two variables measured on N individuals (i.e. two vectors of length N). The correlation coefficient is calculated from the two vectors of ranks for the samples: let { Xi ; i=1...N } and { Yi ; i=1...N } be the vectors of ranks for sample 1 and sample 2 respectively, then the coefficient r is based on the vector of differences between ranks: { Di = Xi - Yi ; i=1...N } and is calculated by

r = 1 - 6 ´ S i=1...N Di2 / [ N(N2-1) ].

If ties are present, then the statistic will be biased, and must be recalculated taking account of ties by:

r = ( S Xi2 + S Yi2 - S Di2 ) / ( 2 ´ Ö ( S Xi2 ´ S Yi2 ) )

where S Xi2 = (N3-N)/12 - Tx ;

S Yi2 = (N3-N)/12 - Ty ;

Tk = S ( tj3 - tj )/12

and tj is the number of observations in the group with rank j.

The t-approximation for this statistic, T, is valid for samples of size 8 upwards, and is calculated by

T = r ´ Ö [ (N-2)/(1-r2) ].

It has approximately a t-distribution on N-2 degrees of freedom, and can be used for a test of the null hypothesis of independance between samples. (See for example Siegel 1956, pages 202-213.)

 

Action with RESTRICT

If any of the variates in DATA is restricted, the statistic is calculated only for the set of units not excluded by the restriction.

 

Reference

Siegel, S. (1956). Nonparametric Statistics for the behavioural sciences. McGraw-Hill, New York.

 

SPLINE procedure

Calculates a set of basis functions for M-, B- or I-splines

(P.W. Goedhart)

 

Options

KNOTS = scalar or variate Defines the interior knot values; no default i.e. this option must be set

ORDER = scalar Defines the order of the piecewise polynomial; default 3

TYPE = string Controls which spline basis is calculated (m, b, i); default m

LOWER = scalar Left-hand limit L of the interval [L, U); default * i.e. the minimum of the X parameter is used

UPPER = scalar Right-hand limit U of the interval [L, U); default * i.e. a value slightly larger than the maximum of the X parameter is used

NOMESSAGE = string Which warning messages to suppress (warning); default *

 

Parameter

X = variates Values for which the basis spline functions are calculated

BASIS = pointers Pointer to save variates containing the values of the basis spline functions

DBASIS = pointers Pointer to save variates containing the values of the first order derivatives of the basis spline functions

 

Description

Piecewise polynomials or splines can be used for nonparametric function estimation. Splines offer a flexible way to investigate the shape of a relationship or can be used for interpolation and smoothing. There are several types of splines. Smoothing splines, implemented in Genstat by means of regression function SSPLINE, minimize a penalized residual sums of squares in which lack of smoothness of the estimated function is penalized. Smoothing splines can be less appropriate when local effects are strong or when the estimated function should be monotone, e.g. when estimating growth curves.

An alternative for smoothing splines is to use regression splines which offer more control over the characteristics of the estimated function. With regression splines the user first specifies an interval [L, U) on which the estimated function is non-trivial. This interval is then explicitly divided into segments by the user, and a polynomial, of order say k, is fitted in each segment. The segments are separated by a sequence of so-called knots. It is customary to force the piecewise polynomials to join smoothly at these knots. The piecewise polynomials and all their derivatives are always continuous from the right at the knots. Moreover, when there are no replicated knot values, the (k-1)th derivative is continuous at the knot values. The order of differentiability is lower when there are replicated knot values. The full knot sequence includes the endpoints L and U which are replicated depending on the order of the piecewise polynomial. Ramsay (1988) provides a concise introduction into regression splines, while de Boor (1978) gives a full account.

The SPLINE procedure can be used to calculate a set of so called basis functions which have all the required properties of continuity and differentiability. These basis functions can then be used to fit the regression spline. A simple basis is given by truncated polynomials but this has the disadvantage of generating considerable rounding errors. A numerically superior basis is provided by M-splines. Their main features are that any basis function is positive in a series of consecutive segments, is zero elsewhere and is normalized by having unit area. An alternative normalization is provided by B-splines which have the property that the sum over all basis functions is 1 for values in the interval [L, U). Basis functions of M-splines and B-splines are linearly related and are 0 outside [L, U). The resulting piecewise polynomial is discontinuous at the endpoints L and U.

Monotonicity of the estimating function can be imposed by employing a basis consisting of monotone functions. Ramsay (1988) uses integrated M-splines which, when combined with nonnegative regression coefficients, yield a monotone spline. These integrated M-splines are called I-splines. The basis functions for I-splines are not linearly related and they are 0 for values smaller than L and 1 for values greater than or equal to U. The resulting piecewise polynomial is continuous but not differentiable at the endpoints. The choice of the polynomial order and of the knot values are crucial for successful usage of regression splines. Wegman & Wright (1983) summarize practical recommendations for M-splines, while Ramsay (1988) does so for I-splines. In general the knots should be chosen in regions where the relationship changes most markedly. A useful preliminary knot placement is to position a single interior knot at the median, two interior knots at the terciles, three at the quartiles, and so on. The order of the piecewice polynomials is usually taken to be 2 or 3.

The values for which the basis functions are calculated must be specified by the X parameter. The values of the basis functions are saved with the BASIS parameter, while the first order derivatives of the basis functions can be saved by setting the DBASIS parameter. The BASIS and DBASIS pointers are redefined in the procedure. If a value in the X parameter coincides with an interior knot and the basis function or its first order derivative has a discontinuity at that value, it should be remembered that the functions are continuous and differentiable from the right.

The interior knot sequence must be set with the KNOTS option and the ORDER option can be used to specify the order of the piecewise polynomials. The TYPE option determines which spline basis is calculated. The interval [L, U) for which the basis functions are non-trivial can be specified by the LOWER and UPPER options. If these are unset the following values are used:

CALCULATE LOWER = MINIMUM(X)

CALCULATE max = MAXIMUM(X)

CALCULATE UPPER = max + ((max.EQ.0) + ABS(max))/500000

In this case the UPPER value is such that max is just in the interval [L, U). The NOMESSAGE option can be used to suppress warning messages which are printed when the KNOTS variate has replicated values and when the interval [LU) does not overlap the range of X values.

 

Options: KNOTS, ORDER, TYPE, LOWER, UPPER, NOMESSAGE.

Parameters: X, BASIS, DBASIS.

 

Method

Basis functions for M-splines are calculated by a recurrence relation from Ramsay (1988). These basis functions are multiplied to give B-splines or summed to provide I-splines. Note that unlike Ramsay (1988), the order of the spline is here defined as the order of the piecewise polynomial.

 

Action with RESTRICT

The variates contained in the BASIS and DBASIS pointers are restricted in the same way as the X parameter. Values in the units excluded by the restriction are set to missing. Restrictions on the KNOTS variate are ignored.

 

References

de Boor, C. (1978). A practical guide to splines. Springer-Verlag. New York.

Ramsay, J.O. (1988). Monotone regression splines in action (with discussion). Statistical Science, 3, 425-441.

Wegman, E.J. & Wright, I.W. (1983). Splines in statistics. Journal of the American Statistical Association, 78, 351-365.

 

STANDARDIZE procedure

Standardizes columns of a data matrix to have mean zero and variance one

(S.A. Harding & D.A. Murray)

 

No options

 

Parameters

OLD = variates or matrices Structures containing data to be standardized

NEW = variates or matrices Structures to contain output; by default the OLD structures are overwritten

 

Description

The parameter OLD lists the variates or matrices to be standardized, and the NEW parameter specifies a list of variates or matrices to store the standardized values. If NEW is not set, the transformed data values overwrite the contents of the OLD structures. If NEW is set, it should be to structures of the same type (variate or matrix) as the corresponding OLD structures.

 

Options: none. Parameters: OLD, NEW.

 

Method

The standardized values are calculated as (x-mean(x)) / Ö (var(x)). If there are any missing values in the data these are omitted from the calculation.

 

Action with RESTRICT

Restrict is irrelevant with matrix structures. It should work as expected with variates.

 

STEM procedure

Produces a simple stem-and-leaf chart

(J. Ollerton & S.A. Harding)

 

No options

 

Parameters

DATA = variates Data values for each plot

NDIGITS = scalars Number of digits in the leaves of each plot

STEMUNITS = scalars Scale units for the stem values in each plot

 

Description

STEM produces a simple stem-and-leaf chart of a variate of data. The stems indicate leading digits and the leaves indicate subsequent digits. By default, the leaves are formed from single digits; the parameter NDIGITS can be used to specify the number of digits in each leaf if more than one is required. The STEMUNITS parameter can be used to specify the units represented by the stem values. By default, this is determined from the data so that the display will fit within a single screen or page of output. Small values of STEMUNITS (in comparison to the range of the data) should be avoided as they may generate far too many lines of output. The display produced by STEM is restricted to the current output width; any lines that have to be truncated at the right-hand margin are terminated by >, indicating their continuation.

 

Options: none. Parameters: DATA, NDIGITS, STEMUNITS.

 

Method

The variate of data is scaled appropriately and printed into a text. Using CONCATENATE the individual stem and leaf digits are extracted and then formed into another text structure which is printed out as the stem-and-leaf display. Missing values are excluded from the data.

 

Action with RESTRICT

STEM takes account of any restriction on the data variate.

 

Reference

Tukey, J.W. (1977). Exploratory data analysis. Addison-Wesley.

 

SUBSET procedure

Forms vectors containing subsets of the values in other vectors

(R.W. Payne)

 

Options

CONDITION = expression Logical expression to define which units are to be included; no default - this option must be set

SETLEVELS = string Whether to reform the levels (and labels) of factors to exclude those that do not occur in the subset (yes, no); default no

 

Parameters

OLDVECTOR = vectors Vector from which the subset is to be formed

NEWVECTOR = vectors Vector to store the subsets if none is specified, the OLDVECTOR is redefined to store the subset

 

Description

SUBSET forms vectors containing subsets of the values in other vectors. The subset is defined by a logical condition which must be specified by the CONDITION option; units with the value 1 (.TRUE.) for the condition are included in the subset, others are omitted.

Subsets can be formed for factors, texts and variates. Relevant attributes will also be transferred across to the new structures but, if the subset excludes some of the levels of a factor, a new reduced set of levels (and labels) can be requested by setting option SETLEVELS=yes.

The original vectors are specified by the OLDVECTOR parameter and identifiers for the vectors to contain the subsets are specified by the NEWVECTORS parameter. If NEWVECTORS is not set, the OLDVECTORS are redefined to store the subsets instead of their original values.

 

Options: CONDITION, SETLEVELS. Parameters: OLDVECTORS, NEWVECTORS.

 

Method

RESTRICT is used to obtain a list of units included according to the CONDITION. This is then used to calculate a format for EQUATE to use to transfer the values. The DUPLICATE directive is used to transfer any relevant attributes. We thank Jac Thissen for suggestions about the redefinition of factor levels.

 

Action with RESTRICT

Any restriction is ignored; the subset is formed only from the CONDITION option.

OLDVECTOR is redefined to store the subset

 

TTEST procedure

Performs a one- or two-sample t-test

(S.J. Welham)

 

Options

METHOD = string Type of test required (twosided, greater, lessthan); default twos

GROUPS = factor Defines the groups for a two-sample test if only the Y1 parameter is specified

CIPROB = scalar The probability level for the printed confidence interval (0<CIPROB<1); for a one-sided test this will be for the mean and for a two-sided test for the difference in means; default *, i.e. no confidence interval is produced

NULL = scalar The value of the mean under the null hypothesis; default 0

 

Parameters

Y1 = variates Identifier of the variate holding the first sample

Y2 = variates Identifier of the variate holding the second sample

TEST = variates Identifier of variate (length 3) to save test statistic, d.f. and probability value

INTERVAL = variates Identifier of variate (length 2) to save end points of the confidence interval

 

Description

The data for TTEST are specified by the parameters Y1 and Y2 and the option GROUPS. For a one-sample test, the Y1 parameter should be set to a variate containing the data. TTEST then performs a one-sample t-test for the mean of a normal distribution. The value of the mean under the null hypothesis can be specified by the option NULL; by default NULL=0.

The data for a two-sample test can either be specified in two separate variates using the parameters Y1 and Y2. Alternatively, they can be given in a single variate, with the GROUPS option set to a factor to identify the two samples. The GROUPS option is ignored when the Y2 parameter is set. TTEST then performs a standard two-sample t-test, assuming that the two samples arise from normal distributions with equal variances.

For both one- and two-sample cases, the test is assumed to be two-sided unless otherwise requested by the METHOD option. Settings greater and lessthan of this option will give one-sided tests for the sample mean greater than or less than the null hypothesis mean, respectively.

Validity checks are carried out. If any sample has fewer than 6 values, a warning is given that the sample size is too small and the test may not be valid. Also, in the two-sample case, an F-test for equality of variance is carried out and a warning printed if there is evidence of unequal variances.

A summary of the data and the test, together with the test statistic and its probability level under the null hypothesis are always printed. A confidence interval can be produced from the test and printed by setting the CIPROB option to the required value (between 0 and 1). This is the probability of the interval containing the value under the null hypothesis if that hypothesis is true.

Results can be saved by the setting of parameters TEST and INTERVAL; TEST will store the test statistic, its degrees of freedom and probability level in a variate of length 3, and INTERVAL saves the confidence interval in a variate of length 2 provided the option CIPROB also has a valid setting.

 

Options: METHOD, GROUPS, CIPROB, NULL. Parameters: Y1, Y2, TEST, INTERVAL.

 

Method

A standard t-statistic is calculated in both cases, together with an F-statistic in the two-sample case (to test equality of variances) as described in any standard textbook. The squared t-statistics and the F-ratio are compared with the appropriate F-distribution using the function FPROBABILITY, and confidence intervals are constructed using the function FED.

 

Action with RESTRICT

Y1 and Y2 may be subject to different restrictions; these restrictions will be obeyed. TEST and INTERVAL must not be restricted.

 

VEQUATE procedure

Equates across numerical structures

(P.W. Goedhart)

 

No options

 

Parameters

OLDSTRUCTURES = pointers Structures whose values are to be transferred; pointers should point to structures of the same type (scalar, variate, matrix or table) and length

NEWSTRUCTURES = pointers Pointer to variates to take each set of transferred values; if necessary structures are redefined to be variates of correct length

 

Description

This procedure allows the values in a set of structures to be copied into a set of variates, one for each element of the original structures. The original structures are input in a pointer, using the OLDSTRUCTURES parameter. The variates to take the values are returned in a pointer whose identifier is specified by the NEWSTRUCTURES parameter. If this is already declared, it should be to a pointer of the correct length; if necessary, the structures to which it points will be redefined to be variates of correct length.

The procedure copies the values in the first element of each of the original structures into the first variate in the NEWSTRUCTURES pointer, then those in the second element to the second variate, and so on. The structures in the OLDSTRUCTURES pointer should be of the same type (scalar, variate, matrix, diagonal matrix, symmetric matrix or table) and length.

 

Options: none. Parameters: OLDSTRUCTURES, NEWSTRUCTURES.

 

Method

EQUATE is used to transfer values. If OLDSTRUCTURES points to restricted variates, the values included in the subset are first copied to dummy structures.

 

Action with RESTRICT

If the OLDSTRUCTURES pointer consists of variates, any restrictions will be taken into account and, if the NEWSTRUCTURES pointer is not declared in advance, its suffixes will be set to the units in the restricted set.

 

VFUNCTION procedure

Calculates functions of variance components from a REML analysis

(S.J. Welham)

 

Options

PRINT = string Output required (function); default func

RANDOM = formula Random model (excluding residual stratum) used for the REML analysis

FACTORIAL = scalar Value used in the FACTORIAL option of REML; default 3

NCONSTANT = scalar Value to be used as constant in the numerator function; default 0

DCONSTANT = scalar Value to be used as constant in the denominator function; default 0

SAVE = REML save structure Specifies the (REML) save structure from which the variance components are to be taken; by default they are taken from the save structure of the most recent REML analysis

 

Parameters

NUMERATOR = variates Each variate contains the coefficients multiplying the components in the numerator of a function

DENOMINATOR = variates Coefficients multiplying the components in the denominator of the function

FUNCTION = scalars To save the value of the function

SE = scalars To save the standard error of the function

 

Description

VFUNCTION calculates linear functions, reciprocals of linear functions, or ratios of linear functions of the estimates of variance components from a REML analysis. The standard error of the function is also produced.

The components are taken from the structure specified by the SAVE option. If this option is not set, then the SAVE structure from the most recent REML analysis is used. The RANDOM option must be set to be the random formula used by the REML analysis, but excluding the residual term. The FACTORIAL option must be set to the same value as used in the REML analysis; by default FACTORIAL=3.

The function is defined by the variates specified by the NUMERATOR and DENOMINATOR parameters; these give the coefficients of the components for the numerator and denominator of the function, respectively. The order of the components is as given by the RANDOM option, with the residual term added at the end. If either variate contains fewer values than the number of components, the final coefficients are taken to be zero. However, random components that were constrained to be fixed in the REML analysis are ignored. If only NUMERATOR is set the function will be linear; conversely if only DENOMINATOR is used it will be a reciprocal function, and if both NUMERATOR and DENOMINATOR are set then the function is the ratio of two linear functions. Options NCONSTANT and DCONSTANT allow a constant to be included in the numerator and denominator functions, respectively.

Printed output is controlled by the option PRINT; by default the function and its standard error are printed.

Parameters FUNCTION and SE allow the value of the function and its standard error to be saved.

 

Options: PRINT, RANDOM, FACTORIAL, NCONSTANT, DCONSTANT, SAVE.

Parameters: NUMERATOR, DENOMINATOR, FUNCTION, SE.

 

Method

The components and their variance-covariance matrix are retrieved using VKEEP. The function is calculated as specified and its standard error is calculated using a formula derived from a Taylor expansion (see, for example, Kendall & Stuart 1963, page 232):

se( f/g ) = (1/g) ´ Ö { var(f) - 2 ´ (f/g) ´ cov(f,g) + (f/g) ´ (f/g) ´ var(g) }

 

Reference

Kendall, M. & Stuart, A. (1963). The Advanced Theory of Statistics, Volume 1. Griffin, London.

 

VHOMOGENEITY procedure

Tests homogeneity of variances and variance-covariance matrices

(R.W. Payne)

 

Option

GROUPS = factors Define the groups whose variances are to be compared; these need be given only if DATA is set

 

Parameters

DATA = variates or pointers Data variate from which variances are calculated, or pointer to a list of variates from which variance-covariance matrices are calculated

VARIANCES = any numerical structures or pointers

Supplies the variances (in any numerical structure) or variance-covariance matrices in a pointer to a list of symmetric matrices if the DATA parameter is not set, or saves variances (in a table) and variance-covariance matrices (in a pointer to a list of symmetric matrices) if they have been calculated from DATA and GROUPS

DF = any numerical structure Supplies the degrees of freedom for variances (in any numerical structure) or for variance-covariance matrices (as a pointer to a list of scalars) if the DATA parameter is not set, or saves the degrees of freedom for variances (in a table) or variance-covariance matrices (as a pointer to a list of scalars) if they have been calculated from DATA and GROUPS

 

Description

Equality of variances of residuals is an important requirement for the validity of analysis of variance and regression. VHOMOGENEITY allows the homogeneity of variances in different groups to be assessed using Bartlett's test. This test is rather sensitive to departures from Normality (another requirement for the validity of these analyses); so it is recommended that the residuals also be examined, for example using procedures RCHECK or DAPLOT.

To test homogeneity of variances, VHOMOGENEITY can take as input either the original data values together with factors defining the groups, or variances along with degrees of freedom. For the first method the DATA parameter should be set to a variate containing the data values; the factors are specified by the GROUPS option. For the second method the variances are input (in any numerical structure) using the VARIANCES parameter, and the degrees of freedom (in a structure of the same type as for VARIANCES) using the DF parameter.

With multivariate data, the analagous test for variance-covariance matrices is given by Box (1950). Again two methods of input are available. The original data variates can be supplied, in a pointer, using the DATA parameter and the factors can be listed by the GROUPS option (as for a single variate). Alternatively, the VARIANCES parameter can be set to a pointer containing the variance-covariance matrices to be tested, and the DF parameter to a pointer containing the corresponding degrees of freedom.

If the variances and degrees of freedom are to be calculated by the procedure (from DATA and GROUPS), the VARIANCES and DF parameters can be used to save the calculated values. When testing homogeneity of variances, the variances and degrees of freedom are saved in tables, classified by the GROUPS factors; these tables need not be declared in advance. With variance-covariance matrices, VARIANCES is a pointer to the list of symmetric matrices that have been formed, and DF a pointer to a list of scalars.

 

Option: GROUPS. Parameters: DATA, VARIANCES, DF.

 

Method

If the raw data have been given as input, the procedure uses TABULATE to form tables of variances, and of replications from which the degrees of freedom are calculated. The test statistic is calculated as M/C, where

M = S ni ´ log( S { ni ´ si } / S { ni } ) - S { ni ´ log( si ) }

C = 1 + ( 1 / ( 3 ´ (N - 1) ) ) ´ ( S { 1/ni } - 1 / ( S ni ) )

N = number of groups

ni = degrees of freedom of group i

si = variance of group i

The number of degrees of freedom associated with the test statistic is the number of groups minus one. See, for example, Snedecor & Cochran (1980, pages 252-253).

The FSSPM directive is used to form variance-covariance matrices. The equivalent test of homogeneity is given by Box (1950).

 

Action with RESTRICT

If the DATA variates are restricted, only the units not excluded by the restriction will be used to calculate the variances and degrees of freedom.

 

References

Box, G.E.P. (1950). Problems in the analysis of growth and wear curves. Biometrics, 6, 362-389.

Snedecor, G.W. & Cochran, W.G. (1980). Statistical Methods (7th edition). Iowa State University Press, Ames, Iowa.

 

VINTERPOLATE procedure

Performs linear & inverse linear interpolation between variates

(R.J. Reader)

 

Options

METHOD = string Type of interpolation required (interval, value): for METHOD=value, y-values are interpolated for each point in the NEWINTERVALS variates and stored in the NEWVALUES variates, while for METHOD=interval, x-values are estimated for the y-values in the NEWVALUES variates and stored in the NEWINTERVALS variates; default inte

RANGE = string Whether the smallest value, largest value or the mean of the two is returned if more than one value is valid (first, middle, last); default midd

 

Parameters

OLDVALUES = pointers Each one contains variates specifying the y-values (data values) with which an interpolation is to be carried out

NEWVALUES = pointers For METHOD=value, each pointer contains variates to store the results of an interpolation; for METHOD=interval, it contains either variates or scalars to specify y-values for which inverse interpolation is to be carried out

OLDINTERVALS = variates Contains the x-values (intervals) corresponding to the variates in the OLDVALUES pointer

NEWINTERVALS = pointers For METHOD=interval, each pointer contains variates to store the results of an inverse interpolation; for METHOD=value, it contains either variates or scalars to specify x-values at which interpolation is to be performed

 

Description

VINTERPOLATE performs linear interpolation and inverse linear interpolation between variates, as was given by the Genstat 4 functions LINT and INLINT. The y-values (or data values) are given in a set of variates, contained in the pointer specified by the OLDVALUES parameter. The corresponding x-values (intervals) are specified by the OLDINTERVALS parameter, in a variate with one value for each variate of the OLDVALUES pointer. For interpolation (METHOD=value), values are interpolated at the x-values specified by the variates or scalars contained in the NEWINTERVALS pointer, and are stored in variates contained in the NEWVALUES parameter. For inverse interpolation (METHOD=interval), x-values are estimated for the y-values specified by the variates or scalars contained in the NEWVALUES pointer, and are stored in variates contained in the NEWINTERVALS pointer. Where two or more successive OLDINTERVALS or OLDVALUES are the same, there is no unique solution to the interpolation; the RANGE option allows the smallest (RANGE=first), largest (RANGE=last) or the mean of these two (RANGE=middle) values or intervals to be selected.

 

Options: METHOD, RANGE.

Parameters: OLDVALUES, NEWVALUES, OLDINTERVALS, NEWINTERVALS.

 

Method

An estimate of the required value is calculated from each successive pair of points. If this estimate is between the two points from which it was calculated, it is a valid answer i.e. it was produced by interpolation not extrapolation. If it does not satisfy this condition it is set to missing. If the curve is horizontal or vertical at the point of interpolation, more than one point will satisfy the above condition. In this case the value returned will depend on the setting of the option RANGE. This will determine whether the smallest value found, the largest value found, or the mean of these two values is returned. The default is to return the mean value.

 

Action with RESTRICT

If any of the variates in the OLDVALUES pointer is restricted, the other variates in the pointer must either have the same restriction or be unrestricted. Missing values are then returned for the units excluded by the restriction. The OLDINTERVALS variate must not be restricted.

 

VORTHPOL procedure

Calculates orthogonal polynomial time-contrasts for repeated measures

(J.T.N.M. Thissen)

 

Options

TIMEPOINTS = variate Variate of timepoints; default uses the suffixes of the DATA pointer

MAXDEGREE = scalar The number of contrasts (excluding the mean); default is the number of identifiers in the CONTRAST pointer minus 1

 

Parameters

DATA = pointers Each pointer contains the data variates (observed at successive times); must be set

CONTRAST = pointers To save the calculated contrasts: the first variate contains the means, the second the linear polynomial contrasts, the third the quadratic polynomial contrasts etc; must be set

 

Description

A repeated measures experiment is one in which the same set of units, or subjects, is observed at a sequence of times to investigate treatment effects over a period of time. VORTHPOL calculates orthogonal polynomial contrasts in time for each experimental unit. These contrasts can then be analysed given the block and treatment structure at each timepoint.

The observed data is specified in a pointer containing a set of variates, each one containing the measurements made on the subjects at one of the occasions on which they were observed, and input to the procedure using the DATA parameter. The variate in option TIMEPOINTS specifies the actual times when the measurements were taken. If this is not specified, the suffixes of the DATA pointer are taken as values for the timepoints.

The calculated contrasts are saved in a pointer which must be specified by the CONTRAST parameter. This points to a list of variates: the first variate saves the means over the DATA variates, the second variate saves the linear polynomial contrast, the third the quadratic polynomial, and so on. Provided the MAXDEGREE option is specified, the CONTRAST need not be declared in advance. The suffixes of the pointer are then defined to be 0, 1, 2 ...

The number of contrasts can be specified using option MAXDEGREE and should be less than the number of timepoints. The default setting is the number of identifiers in the pointer specified by the CONTRAST parameter minus 1. If MAXDEGREE is set, and the CONTRAST pointer has been declared to be of length less than MAXDEGREE+1, a fault message is produced.

If an experimental unit has a missing value in one of the DATA variates each contrast in the CONTRAST pointer (including the mean) gets a missing value for this unit.

 

Options: TIMEPOINTS, MAXDEGREE.

Parameters: DATA, CONTRAST.

 

Method

Procedure ORTHPOL gets orthogonal polynomial coefficients which are used to form the contrasts.

 

Action with RESTRICT

Each variate in the DATA pointer should be restricted in the same way. The saved orthogonal polynomial contrasts are restricted accordingly. The TIMEPOINTS variate must not be restricted.

 

VPLOT procedure

Plots residuals from a REML analysis

(S.J. Welham)

 

Options

GRAPHICS = string What type of graphics to use (lineprinter, highresolution); default high

SAVE = REML save structure Specifies the (REML) save structure from which the residuals and fitted values are to be taken; default * uses the SAVE structure from the most recent REML analysis

 

Parameter

METHOD = strings Type of graph for residuals (fittedvalues, normal, halfnormal, histogram); default fitt

 

Description

Procedure VPLOT provides four types of residual plots from a REML analysis. These are selected using the METHOD parameter, with settings: fitted for residuals versus fitted values

normal for a Normal plot, halfnormal for a half-Normal plot, and histogram for a histogram of residuals.

The residuals and fitted values are accessed automatically from the structure specified by the SAVE option. If the SAVE option has not been set, they are taken from the last SAVE structure from the most recent REML analysis.

By default, high-resolution graphics are used. Line printer graphics can be used by setting option GRAPHICS=lineprinter.

 

Options: GRAPHICS, SAVE. Parameter: METHOD.

 

Method

Residuals and fitted values effects are accessed, using VKEEP, from the REML analysis specified by the SAVE option. For a Normal plot, the Normal quantiles are calculated as follows:

qi = NED( (i-0.375) / (n+0.25) ) i=1...n

while for a half-Normal they are given by

qi = NED( 0.5 + 0.5 ´ (i-0.375) / (n+0.25) ) i=1...n

 

Action with RESTRICT

If the y-variate in the REML analysis was restricted, then only units included by the restriction will be used in the graphs.

 

VREGRESS procedure

Performs regression across variates

(M.W. Patefield & D. Tandy)

 

No options

 

Parameters

Y = pointers Pointers each containing a set of y-variates for each of whose units a regression is to be done

X = pointers Pointer containing x-variates for each set of y-variates

SLOPE = variates Variate to save the estimated slopes from each set of regressions

INTERCEPT = variates Variate to save the estimated intercepts from each set of regressions

 

Description

Given a pointer containing a set of y-variates and another containing a set of x-variates, VREGRESS performs a separate regression for the data in each unit of the variates. The pointers are specified using the Y and X parameters. There must be an equal number of x- and y-variates, and the variates must all be of the same length. The SLOPE parameter must supply a variate to receive the regression coefficients, and the INTERCEPT parameter can give a variate to save the intercepts. These variates will have the same length as the x- and y-variates.

 

Options: none. Parameters: Y, X, SLOPE, INTERCEPT.

 

Method

The procedure propagates missing values in any of the x-variates into the appropriate unit of the corresponding y-variate, and vice-versa. The regressions are calculated using matrix operations and variate functions in CALCULATE. The vectors of means across the x- and y-variates are subtracted, and then the sums of squares of X and the sums of products of Y and X are calculated across the variates, to obtain the estimated slope coefficients. The estimated intercepts are calculated directly from the slope coefficients and the vectors of means.

The action taken with missing values is the same as would be given by the FIT directive. Units with all values of Y and X missing after propagation will have missing values in both SLOPE and INTERCEPT. Units with a single non-missing value will have a missing value in SLOPE and the corresponding element of INTERCEPT equal to the non-missing value of Y.

It is considerably faster to use VREGRESS than to use FIT on each unit after re-arrangement of the data. However, there may be a slight loss of accuracy resulting from single-precision calculations on machines where double-precision is used for the calculations within FIT.

 

Action with RESTRICT

All the data variates in Y and X must be subject to the same restriction (if any). If this is not so a fault (VA 1 - incompatible restrictions) will occur during the calculations. If SLOPE or INTERCEPT is restricted prior to use of VREGRESS, the restriction must be the same as that on the data variates. On exit from VREGRESS, the computed SLOPE and INTERCEPT variates will be restricted in the same way as the data variates.

 

VTABLE procedure

Forms a variate and set of classifying factors from a table

(P.W. Goedhart)

 

No options

 

Parameters

TABLE = tables Tables to be copied

VARIATE = variates New variate to contain the body of each table

CLASSIFICATION = pointers Pointer containing the factors by which each new variate is classified

 

Description

This procedure can be used to store the body of a table in a variate and obtain a set of factors to represent the way in which the data are arranged in the table. These factors then classify the newly formed variate in the same way as in the table. Margins of the table are ignored.

The table to be copied is specified by the TABLE parameter, the variate must be specified by the VARIATE parameter, and the set of classifying factors can be obtained by setting CLASSIFICATION to a pointer. If this pointer has not been declared, the names of the classifying factors of the table are used as suffix names. The newly formed factors have the same attributes as the old classifying factors, excluding the setting of EXTRA. Note that the order in which the factors are obtained can be unexpected for implicitly declared tables as explained on page 143 of the Genstat 5 Release 3 Reference Manual.

 

Options: none. Parameters: TABLE, VARIATE, CLASSIFICATION.

 

Method

Margins of the table are deleted by the directive MARGIN. The classifying factors of the table are obtained with GETATTRIBUTE. The initial declarations of the new factors are done as implicit declarations by SORT, as this also transfers any relevant attributes. Factor values are then produced by GENERATE.

 

WADLEY procedure

Fits models for Wadley's problem, allowing alternative links and errors

(D.M. Smith)

 

Options

PRINT = strings Controls printed output (deviance, estimates, correlations, monitoring); default devi, esti

DISTRIBUTION = string Distribution of the response variate (poisson, negativebinomial, qlnegativebinomial, qlscaledpoisson); default pois

LINK = string Link transformation (logit, probit, complementaryloglog, cauchit); default logi

TERMS = formula Model to be fitted

CONTROL = factor Factor to distinguish the control, or zero, dose (level 1) from the other treatments (level 2)

MAXIMAL = factor Factor to define the maximal model i.e. with a level for every combination of values of the variates and factors in TERMS

RMETHOD = string Type of residuals to be formed (deviance, Pearson); default devi

 

Parameters

Y = variates Response variate for each fit

RESIDUALS = variates Variate to save the residuals from each fit

FITTEDVALUES = variates Variate to save the fitted values from each fit

 

Description

WADLEY uses the generalized linear models methodology of composite link functions to fit a range of models for the situation known as Wadley's problem. This arises in bioassay where it is possible to count only the number of subjects that have not responded to a particular dose of a drug or stimulus. For example, with eggs of insects fumigated in grain, it is generally possible to count only those that survive and hatch.

By default, the analysis assumes that the numbers of subjects that are treated in each observation follow a Poisson distribution with a common mean parameter; other distributions can be specified using the DISTRIBUTION option or, for user-defined distributions, by providing subsidiary procedure WADDISTRIBUTION (see details of the procedures called by WADLEY).

The analysis estimates the mean of the distribution, and then fits the dose response curve as in an ordinary probit analysis. The LINK option defines the transformation (logit, probit, cauchit, or complementary log-log) required to make the model additive. User-defined transformations can also be specified, by leaving LINK unset and providing subsidiary procedure WADLINK to calculate the necessary fitted values and derivatives, and WADINITIAL to calculate initial values for the linear predictor (see details of the procedures called by WADLEY). The model to be fitted is defined by the TERMS option.

To assist the estimation of the expected total number of subjects, there must be some control observations - for example with zero doses of fumigant. These must be identified by a factor, specified by the CONTROL option, with level 1 for untreated and level 2 for treated. The comparison between the treated and untreated levels of CONTROL must not be aliased with any of the variates and factors in TERMS. (Thus if, for example, TERMS contained a factor representing different types of drug, this must not have a separate level for the untreated observations.)

Often with these sort of data, it is found that the variability exceeds that which would be expected from the distribution assumed for the data. To estimate the amount of overdispersion, the MAXIMAL option must be set to a factor with a different level for every combination of values of the factors and variates in the TERMS model.

 

Options: PRINT, DISTRIBUTION, LINK, TERMS, CONTROL, MAXIMAL, RMETHOD.

Parameter: Y, RESIDUALS, FITTEDVALUES.

 

Method

In essence WADLEY is a specific application of the use of composite link functions in generalized linear models. The actual methods used are those in the Genstat procedure GLM (Lane 1989) and the GLIM macros of Smith & Morgan (1989). The procedure is very similar in spirit to these GLIM macros, and it is recommended that this reference be consulted for further information. However, there are some extensions. The capability to handle user-defined links and distributions has been added. Also, the range of distributions has been extended to include two forms of quasi-likelihood, namely that where the weighting is of negative binomial form (weight=1/(1+hf´ fittedvalues)), and that where the weighting is of scaled Poisson form (weight=1/hf), where hf is the heterogeneity factor. If the estimated heterogeneity factor is less than zero in the negative binomial cases, or if it is less than one in the scaled Poisson case, it is set to zero or one respectively.

WADLEY has two subsidiary procedures, WADCODI and WADFIT, to assist with the analysis; neither of these need be modified by the user:

WADCODI prints the results of the iterative processes;

WADFIT performs the iterative model fits.

There are also three other procedures, which can be rewritten or replaced, to cater for further user-defined distributions and links:

WADDISTRIBUTION calculates the variance function and deviance for a user-defined distribution;

WADINITIAL calculates initial estimates of the linear predictor for a user-defined link;

WADLINK calculates the fitted values and derivatives for a user-defined link.

If the DISTRIBUTION option is unset, the procedure will call WADDISTRIBUTION instead of using one of the various standard distributions. For a Poisson error distribution WADDISTRIBUTION should be defined like this.

PROCEDURE 'WADDISTRIBUTION'

"Calculation of variance function and deviance"

PARAMETER 'Y', "Input: variate; response variate"\

'FITTED', "Input: variate; fitted values"\

'VARIANCE',"Output: variate; variance"\

'LL', "Output: variate; log likelihood variate"\

'DEVIANCE';"Output: scalar; total deviance"\

MODE=p

SCALAR two; VALUE=2

CALCULATE VARIANCE = FITTED

& LL = Y*LOG(Y/FITTED)-Y+FITTED

& DEVIANCE = two*SUM(LL)

ENDPROC

For other error distributions only the three CALCULATE statements need to be changed.

Similarly, for option LINK unset, WADINITIAL and WADLINK will be called. For a logit link WADINITIAL would be defined as follows.

PROCEDURE 'WADINITIAL'

"Calculation of initial estimates of linear predictor"

PARAMETER 'Y', "Input: variate; response variate"\

'LP', "Output: variate; linear predictor"\

'IND', "Input: variate; marker variate with value 1

for a control observation, 0 otherwise"\

'MAXY'; "Inout: scalar; estimate of asymptote"\

MODE=p

SCALAR half,one; VALUE=0.5,1

CALCULATE LP = IND*LOG(MAXY/(Y+half)-one)

ENDPROC

For other links only the CALCULATE statement need be changed so, for example, a probit link would require the statement

CALCULATE LP = IND*NED(one-(Y+one)/MAXY)

For a logit link WADLINK would be

PROCEDURE 'WADLINK'

"Calculation of fitted values and derivatives

of the link function given the linear predictor"

PARAMETER 'LP', "Input: variate; linear predictor"\

'IND', "Input: variate; marker variate with value 1

for a control observation, 0 otherwise"\

'TA', "Output: variate; estimate of fitted values"\

'TB', "Output: variate; estimate of derivatives"\

'MAXY'; "Input: scalar; estimate of asymptote"\

MODE=p

SCALAR half,one; VALUE=0.5,1

CALCULATE TA = (.NOT.IND)+IND/(one+EXP(LP))

& TB = MAXY*EXP(LP)*TA*TA

ENDPROC

For other links only the CALCULATE statements need to be changed so, for example, a probit link would require

CALCULATE TA = (.NOT.IND)+IND/(one-NORMAL(LP))

& TB = MAXY*EXP(-half*LP*LP)/ROOT2PI

where ROOT2PI is a scalar with the value of the square root of 2π. The marker variate IND identifies which is the control and non control data, so TA should always be of the form

TA = (.NOT.IND)+IND*function

where function is the link function for the non-control part of the data. The variate TB should always be of the form

TB = MAXY*deriv_fn

where deriv_fn is the derivative of the link function with respect to the linear predictor (LP).

If LINK or DISTRIBUTION are unset, but no user routines are given for WADINITIAL, WADLINK and WADDISTRIBUTION, then those given here (for logit link and Poisson error distribution) will be used.

A debt is owned to Dr J. Parrott of Pfizer Central Research, Sandwich, UK for his support and encouragement of this work.

 

Action with RESTRICT

If the Y-variate is restricted, only the specified subset of the units will be included in the analysis.

 

References

Lane, P.W. (1989). Procedure GLM. In: Genstat Procedure Library Release 1.3[2] (ed. R.W.Payne & G.M.Arnold), pp 80-82.

Smith, D.M. & Morgan, B.J.T. (1989). Extended Models for Wadley's Problem. Glim Newsletter, 18, 21-28.

 

WILCOXON procedure

Performs a Wilcoxon Matched-Pairs (Signed-Rank) test

(S.J. Welham, N.M. Maclaren & H.R. Simpson)

 

Option

PRINT = strings Output required (test, ranks): test gives the relevant test statistics, ranks prints out the signed ranks for the vector of differences; default test

 

Parameters

DATA = variates Variates holding the differences between each pair of samples

RANKS = variates Variate to save the signed ranks

STATISTIC = scalars Scalar to save the value of the test statistic

NORMAL = scalars Scalar to save the Normal approximation for the test statistic

SIGN = scalars Scalar to indicate the sign of the total sum of the signed ranks: 1 if the sum is positive, 0 otherwise

 

Description

WILCOXON performs a Wilcoxon Matched-Pairs test on a variate holding differences between two paired samples. This is specified using the parameter DATA. The value of the test statistic is calculated and can be saved in the parameter STATISTIC. The Normal approximation for this statistic is calculated only if the sample size is large enough for the approximation to be reasonable (here taken to be 8 or more). In this case, the Normal statistic can be saved using the parameter NORMAL. The parameter SIGN saves an indicator of whether the total sum of signed ranks is positive (SIGN=1) or negative (SIGN=0).

Output from the procedure is controlled by the PRINT option: test produces the relevant test statistics, and ranks prints the vector of signed ranks for the data.

The parameter RANKS can a variate of the signed ranks of the differences (i.e. of DATA).

 

Option: PRINT. Parameters: DATA, RANKS, STATISTIC, NORMAL, SIGN.

 

Method

The Wilcoxon Matched-Pairs test (often also called the Wilcoxon Signed-Ranks test) is a non-parametric test of location in the case of two related samples (e.g. a before-and-after study). The null hypothesis is that two samples arise from exactly the same distribution, with the alternative that the two underlying distributions differ only in location.

The test statistic WS is formed from the signed ranks of the differences between each pair of observations and is the smaller in absolute value out of:

1) the sum of positive signed-ranks of the sample, and

2) the sum of the negative signed-ranks.

In this procedure the method used for calculating the test statistic is:

WS = N´ (N+1)/4 - modulus(total sum of signed ranks)/2

where N is the number of observations.

The normal approximation for this statistic is valid only for samples of size 8 or greater, and so the procedure calculates this approximation z, where

z = ( WS - N´ (N+1)/4 ) / Ö { N(N+1)(2N+1)/24 }

only if the sample size is at least 8.

(See for example Siegel 1956, pages 75-83.)

 

Action with RESTRICT

If the DATA variate is restricted, the test is calculated only using the units not excluded by the restriction.

 

Reference

Siegel, S. (1956). Nonparametric statistics for the behavioural sciences. McGraw-Hill, New York.

 

XOCATEGORIES procedure

Performs analyses of categorical data from crossover trials

(D.M. Smith & M.G.Kenward)

 

Options

PRINT = string What to print at each fit (model, summary, accumulated, estimates, correlations, fittedvalues, monitoring); default *

PDATA = string Whether or not a display of category combination by sequence is required (yes, no); default no

METHOD = string Type of analysis for which factors are required (subject, loglinear, ownsubject, ownloglinear); default subj

CARRYOVER = string Whether or not models with carryover effects in are to be produced (yes, no); default no

 

Parameters

SEQUENCE = factors The identifier of the sequence of treatments

RESULTS = pointers Pointer containing factors (one for each period) giving the category scores observed

NUMBER = variates Numbers recorded in the sequence/category combinations

SAVE = pointers Saves the factors constructed to do the analysis

REUSE = pointers To reuse factors saved earlier using SAVE

MODEL = formulae Additional terms to be fitted to model if OWNSUBJECT or OWNLOGLINEAR options used; default *

 

Description

XOCATEGORIES calculates factors, variates and performs various analyses of categorical cross-over data. All analyses conform to one of two different types both utilising a log-linear structure, although only one is derived from an orthodox log-linear model. The first type is based on a latent variable or subject effects model and is described by Kenward & Jones (1991). The subject effects are eliminated through the use of a conditional likelihood and the resultant conditional analysis can be formulated in terms of a conventional log-linear model. In the process of conditioning all between-subject information is lost. This has little consequence for the majority of well-designed cross-over trials in which nearly all information on important comparisons lies in the within-subject stratum. An exception to this is the two-period two-treatment design for which information on the carry-over effect lies in the between-subject stratum. The second type, which uses a multivariate log-linear model, allows between-subject information to be recovered, which in the binary case leads to the Hills-Armitage test for carry-over effect. Details can be found in Jones & Kenward (1989, Section 3.3). If such a test, and other allied tests for the two-period two-treatment design, are required then the log-linear option of the procedure can be used. However, the estimates from this multivariate log-linear model do have the disadvantage of an awkward interpretation. For this reason the latent variable model is to be preferred for higher-order designs and for the two-period two-treatment design when the carry-over test is not required. In the latter case, with binary data, the test for direct treatments reduces to the Mainland-Gart test.

In the latent variable model, effects are defined in terms of generalized logits, reducing to ordinary logits in the binary case. This is not ideal for ordered categorical data because the ordering is ignored. Some account can be taken of the ordering of categories by regressing on category scores in a generalization of Armitage's trend test. This can be done by using the parameter SAVE to obtain the treatment and carryover factors, which are in pointers SAVE[3...(NTRT+2)] and SAVE[(NTRT+3)...(2+2*NTRT)] respectively, NTRT being the number of treatments. From these treatment and carryover factors (NCAT-1) variates corresponding to linear (-1, 0, 1), quadratic (1, -2, 1), etc., contrasts amongst the NCAT categories can be calculated. For example, using the example data where the number of treatments (NTRT) is 3 and the number of categories (NCAT) is 3, the following statements will produce linear and quadratic variates for treatments.

XOCATEGORIES SEQUENCE=Seqid; RESULTS=Res; NUMBER=Number; \

SAVE=Fsave

CALCULATE TLN[1...3]=(Fsave[3...6].eq.3)-(Fsave[3...6].eq.1)

& TQU[1...3]=(Fsave[3...6].eq.3)-2*(Fsave[3...6].eq.2)\

+(Fsave[3...6].eq.1)

The OWNSUBJECT or OWNLOGLINEAR options together with the REUSE and MODEL parameters can then be used to fit models involving these variates and the deviances produced used to compare the models. For example, for the above variates.

XOCATEGORIES [METHOD=OWNSUBJECT] \

REUSE=Fsave; MODEL=TLN[1]+TLN[2]++TLN[3]+TQU[1]+TQU[2]+TQU[3]

The data for the procedure are specified by parameters SEQUENCES, RESULTS and NUMBERS. SEQUENCES supplies a factor with labels indicating the treatment received at each time period. The treatments are labelled by capital letters A, B &c, so (with three periods) BCA indicates treatment B in period 1, C in 2 and A in 1. RESULTS is a pointer containing a factor for each time period, to indicate the corresponding scores recorded in each period. NUMBER then indicates the number of subjects involved. It is not necessary to input data for category combinations in which no subjects were recorded.

XOCATEGORIES processes the data to form the necessary factors to do the analysis using the Genstat facilities for generalized linear models. This information can be saved using the SAVE parameter (see Method) and input again, to save time in later analyses, using the REUSE parameter.

Output of the procedure comprises significance tests of treatment and/or carryover and first order period interactions; together with estimates of log odds ratios and their standard errors.

 

Options: PRINT, PDATA, METHOD, CARRYOVER.

Parameter: SEQUENCE, RESULTS, NUMBER, SAVE, REUSE, MODEL.

 

Method

The methods of analysis follow Kenward & Jones (1991) for SUBJECT and OWNSUBJECT, and Jones & Kenward (1989, pages 124-129) for LOGLINEAR and OWNLOGLINEAR. The actual model fitting is performed using the standard Genstat directives FIT, ADD, DROP and SWITCH, with the PRINT options being those of these directives.

The data structure SAVE has the following form, all factors as Kenward & Jones (1991).

SAVE[1] = The factor G (sequence).

SAVE[2] = The factor S (outcome).

SAVE[3...(NTRT+2)] = The factors T[1...NTRT] (treatment).

SAVE[(NTRT+3)...(2+2*NTRT)] = The factors C[1...NTRT] (carryover).

SAVE[(3+2*NTRT)...(NPER+2*NTRT)] = The factors P[1...NPER] (period).

SAVE[NPER+2*NTRT+1] = The category labels if they exist.

We wish to thank Dr Byron Jones of the University of Kent, Canterbury UK, for his assistance.

 

Action with RESTRICT

Input structures must not be restricted, and any existing restrictions will be cancelled.

 

References

Jones, B. & Kenward, M.G. (1989). Design and Analysis of Crossover Trials. Chapman and Hall.

Kenward, M.G. & Jones, B. (1991). The analysis of categorical data from cross-over trials using a latent variable model. Statistics in Medicine, 10, 1607-1619.