RANK procedure
Produces ranks, from the values in a variate, allowing for ties
(J.B. van Biezen & C.J.F. ter Braak)
Option
OMIT
= string Whether units excluded by a restriction on the DATA variate should be omitted from the RANKS variate (restricted); default *, i.e. the units are not omitted, and their values are left unchanged
Parameters
DATA
= variates Variate containing values to be rankedRANKS
= variates Variate to save vector of ranksTIESIZE
= variates Variate to save the sizes of ties
Description
RANK
calculates ranks of the values in a variate, allowing for ties. The variate must be specified by the DATA parameter, and the ranks saved using the RANKS parameter. The input variates in the parameter DATA must each have at least one non-missing value. Missing values in the DATA variates give corresponding missing values in the RANKS variates. The TIESIZE parameter can save the number of times each value occurs (starting with the lowest value).The
OMIT option controls whether the RANKS variate omits units that are excluded by a restriction on the DATA variate. By default, the values in these units are left unchanged. However, if OMIT=restricted, the RANKS variate is compressed to omit the excluded units; this setting is used particularly by the nonparametric procedures.
Option:
OMIT. Parameters: DATA, RANKS, TIESIZE.
Method
The procedure uses the
SORT directive to discover the number of distinct values in the input variate, and then uses TABULATE to obtain the number of times each value occurs - i.e. the size of ties. The tie-corrected rank numbers are then calculated. From these, the vector of ranks is obtained by modifying the levels of the factor that resulted from the first SORT.
Action with
RESTRICTThe variates in
DATA can be restricted, and in different ways. RANK operates on the restricted set only. If OMIT=restricted, the length of RANKS will be the size of the restricted set of the DATA variate. If OMIT is unset, the RANKS variate is of the same length as the DATA variate.
RCHECK procedure
Checks the fit of a linear, generalized linear or nonlinear regression
(P.W. Lane, R. Cunningham & C. Donnelly)
Options
RMETHOD
= string Type of residual to use (deviance, Pearson, simple, deletion); default deviINDEX
= variate Which variate to use as index; default !(1...n)RESIDUAL
= variate To store chosen type of residuals; default *LEVERAGE
= variate To store leverages; default *COOK
= variate To store modified Cook's statistics; default *GRAPHICS
= string Type of graphics to use (lineprinter, highresolution); default highTITLE
= text Title for graph; default identifier of responseWINDOW
= numbers Window or series of windows in which to display graphs; default 4, or 5...8 for compositeSCREEN
= string Treatment of previous graphics screen (clear, keep); default cleaSAVE
= regression save structure Specifies which model to check; default *
Parameters
YSTATISTIC
= strings What to display in the graph (residual, Cook, leverage); default resiXMETHOD
= strings What type of graph (fittedvalues, index, normal, halfnormal, histogram, composite); default comp
Description
Procedure
RCHECK provides "diagnostic" information for checking the fit of regression models. The regression directives check only for large residuals and influential points, giving access to standardized residuals and leverages through directive RKEEP. RCHECK automatically accesses these quantities via RKEEP and in addition can calculate deletion residuals and modified Cook's statistics. A range of graphs can then be drawn to help check the fit of the regression model.The procedure is controlled by the
YSTATISTIC and XMETHOD parameters. These can be set to display various types of residuals, the leverages, or the modified Cook's statistics as simple plots against fitted values or against an index variate, or as Normal or half-Normal plots, or as a histogram. A set of four such plots is displayed as a composite picture: histogram, plot against fitted values, Normal plot and half-Normal plot. In addition, the chosen type of residuals, the leverages and Cook's statistics can be stored in variates, and any calculated quantities can be printed.Plots of the residuals against fitted values or an index variate are displayed with a smoothed line fitted through the points, to indicate any potential trend.
The graphical displays can be controlled as usual using the
TITLE, WINDOWS and SCREEN options. The colours and symbols used in the displays can be controlled by setting the attributes of the following pens with the PEN directive before calling the procedure:pen 2 zero lines in fitted-value, Normal and index plots,
pen 3 points and histogram bars,
pen 4 smooth line in fitted-value and index plots of residuals.
The procedure exits if there are fewer than four observations, or fewer than two non-missing standardized residuals.
Options:
Parameters:
YSTATISTIC, XMETHOD.
Method
Standardized residuals and leverages are accessed using
RKEEP from the latest fitted regression model, or from that specified by the SAVE option. Deletion residuals di are calculated as follows:di = ri /Ö ((n-p-ri2)/(n-p-1))
where ri are the standardized residuals, n is the number of observations, and p is the number of parameters in the model. For generalized linear models other than linear regression,
di = SIGN(rdi) ´ Ö ((1-li) ´ rdi2 + li) ´ rpi2)
where rdi and rpi are the standardized deviance and Pearson residuals respectively. Modified Cook's statistics ci are calculated as follows:
ci = ABS(di) ´ Ö { (n-p) ´ li / (p ´ (1-li)) }
where li are the leverages. In Normal plots, the Normal quantiles are calculated as follows:
qi = NED( (i-0.375) / (n+0.25) )
while for a half-Normal plot they are given by
qi = NED( 0.5 + 0.5 ´ (i-0.375) / (n+0.25) )
For generalized linear models, fitted values are transformed by an approximate variance-stabilizing transformation before use in graphs:
Poisson, multinomial, negative binomial and geometric 2 ´ SQRT(fitted)
binomial, Bernoulli 2 ´ ANG(100 ´ fitted / nbinomial)
gamma, exponential LOG(fitted)
inverse Normal 1 / fitted
The smoothed line displayed for fitted-value or index plots is calculated as a cubic smoothing spline with (40+n)/25 degrees of freedom.
Action with
RESTRICTRestrictions applied to vectors used in the regression apply also to the
RCHECK procedure. Values of diagnostic quantities are set to missing for all excluded units.
RGRAPH procedure
Draws a graph to display the fit of a regression model
(P.W. Lane)
Options
GRAPHICS
= string Type of graphics to produce (lineprinter, highresolution); default highTITLE
= text Title for the graph; default 'Fitted and observed relationship'WINDOW
= number Which high-resolution graphics window to use; default 4 (redefined if necessary to fill the frame)SCREEN
= string Whether to clear the graphics screen before plotting (clear, keep); default cleaSAVE
= regression save structure Save structure of the model to display; default * uses the most recently fitted regression model
Parameters
INDEX
= variate Which explanatory variate to display; default * if GROUPS is set, otherwise INDEX is set to the first variate in the fitted model (must be set for nonlinear models other than standard curves)GROUPS
= factor Which explanatory factor to display; default * if INDEX is set, otherwise GROUPS is set to the first factor in the fitted model (ignored for nonlinear models)
Description
Procedure
RGRAPH displays the fit of either a linear regression, a generalized linear model, a generalized additive model, a standard curve or a nonlinear model.For models other than the nonlinear models fitted by
FITNONLINEAR or FIT with the CALCULATION option set, the graph shows the relationship between the response variate and either one explanatory variate or one explanatory factor or one of each. If no parameters are set, RGRAPH takes the first explanatory variate and the first factor in the model, and the predicted relationship is represented by a line for each level of the factor. The display represents the observed relationship as points, plotting the response (adjusted for further explanatory terms in the model, if any) against the chosen explanatory variate, with each point labelled according to the corresponding factor level. If no factor has been fitted a single line is drawn, while if no variate has been fitted the graph simply shows the predicted mean for each level of the factor.If a linear, generalized linear, or generalized additive model has been fitted, the
INDEX and GROUPS parameters can be used to specify which explanatory variate and factor, respectively, should be used. If INDEX is set and GROUPS is not, a single line is drawn even if there are factors in the model; similarly if GROUPS is set and INDEX is not, the effect of the factor alone is shown.For nonlinear models fitted by the
FITNONLINEAR directive, a single line is drawn by joining the fitted values, and the response values are shown as points. Any setting of the GROUPS parameter is ignored. For curves fitted by the FITCURVE directive, settings of the INDEX and GROUPS parameters are ignored.No graph can be drawn if the
REG function has been used for any variate in the model. If the SSPLINE function has been used for any variate whose relationship with the response is not actually displayed, then the only adjustment for its effect will be the linear component of the fitted smooth curve. If the displayed variate itself is smoothed, then the curve is formed by interpolation between adjusted fitted values. The POL function is dealt with correctly.The
TITLE option can be used to supply a title for the graph. By default the graph is plotted on the current high-resolution device, but the GRAPHICS option can be set to line for a line printer plot. The WINDOW option can be used to select a pre-defined window for high-resolution plots; otherwise window 4 is used, and is redefined if necessary to fill the frame. The SCREEN option allows the graph to be added to an existing high-resolution plot. The colours and symbols used in the displays can be controlled by setting the attributes of the following pens with the PEN directive before calling the procedure:pen 1 labels for lines when drawn for each level of a factor,
pen 2 fitted lines and means,
pen 3 points.
By default the current regression model is displayed, but option
SAVE can be set to specify the save structure (from a MODEL statement) of some other model.
Options:
GRAPHICS, TITLE, WINDOW, SCREEN, SAVE.Parameters:
INDEX, GROUPS.
Method
For a linear or generalized linear model, fitted lines are drawn by joining predicted values calculated by a
PREDICT statement at 21 equally spaced values spanning the range of the explanatory variate. Alternatively, PREDICT provides adjusted means at each level of the factor, if the effect of the factor is to be displayed alone. If all the effects in the current model are displayed, the response is plotted against the explanatory variable (or against the factor level) as points. But if adjustment has to be made for some effects, adjusted response values (known as "partial residuals") are calculated by adding the simple residuals to predictions produced for all observations.For generalized additive models, no predictions are used if the explanatory variate is smoothed: the nonlinear component of the smooth is extracted using
RKEEP. For curves and general nonlinear models no predictions are made, and fitted values are used.If a linear or generalized linear model is constrained to pass through the origin, the display will extend the range of the explanatory if necessary to include the origin.
Action with
RESTRICTAny restriction that was in force when the model was fitted will apply also to the graphs. Problems may occur, however, if the response variate is not restricted and an explanatory variate or factor is restricted.
RIDGE procedure
Produces ridge regression and principal component regression analyses
(A.J. Rook & M.S. Dhanoa)
Options
PLOT
= string Graphical output required (ridgetrace); default *
Parameters
Y
= variates Response variate in regression modelX
= pointers Containing explanatory variates in regression model
Description
Procedure
RIDGE produces analyses for identifying and overcoming collinearity among the independent variates in a multiple regression analysis. The correlation matrix, variance inflation factors (the diagonal elements of the inverse of the correlation matrix) and the ratio of the squared error in the least squares regression coefficients to the expected squared error in orthogonal data are calculated. Principal component regressions excluding 1, 2 or 3 minor principal axes are calculated and transformed back to the original variables on either the original or standardized scale. The "Positive correlation spread association" (PCSA) (Vinod 1976) is also calculated. This is an overall measure of the suitability of the data for the application of principal component regression and ridge regression. Ridge regressions (Hoerl & Kennard 1970) are calculated and the ridge coefficients are printed together with 2 indices of stability proposed by Vinod (1976): the index of stability of relative magnitudes (ISRM) and the numerical largeness of more significant regression coefficients (NLMS). These are 0 and 1 respectively in orthogonal data. High-resolution graphs of the ridge trace can be plotted against Hoerl & Kennard's k scale and Vinod's m scale.The parameters of the procedure are used to input the data: the
Y parameter supplies the y-variate, and the X parameter specifies a pointer containing the x-variates. None of these variates must be restricted nor contain missing values.Printed output is controlled by the
PRINT option: correlation prints the correlation matrix, variance inflation factors and ratio of squared error to that in orthogonal data, pcp prints principal component analysis and principal component regression, and ridge prints ridge coefficients and stability parameters.Graphical output is controlled by the
PLOT option: ridgetrace produces ridge traces.
Options:
PRINT, PLOT. Parameters: Y, X.
Method
The correlation matrix is produced using the
CORRELATE directive. This is then used to calculate the variance inflation factors (VIF) and the ratio of the squared error to that in orthogonal data (RL).Principal component analysis is carried out using the
PCP directive. The standardized response variable is regressed on the principal component scores using MODEL, FIT and TERMS directives. The coefficients are then transformed back to the original variables on either the standardized or original scale with up to three principal components excluded. The correlations of the standardized variable with each of the principal components are also printed.Ridge regression is carried out as described by Hoerl & Kennard (1970). Ridge coefficients on both the standardized and original scales are printed for values of the biasing parameter k between 0 and 1 together with the standard errors of the coefficients on the standardized scale. Residual sums of squares (RSS) R-squared and total variance of the ridge coefficients (TVARB) are printed for each value of k. In addition Vinod's (1976) m scale is printed together with the index of stability of relative magnitudes (ISRM) and the numerical largeness of the more significant regression coefficients (NLMS).
High-resolution graphs of the ridge traces are produced. These are graphs of the ridge coefficients against k or against m and are plotted to the device set up prior to calling the procedure.
Action with
RESTRICTNone of the input variates must be restricted.
References
Chatterjee, S. & Price, B. (1991). Regression Analysis by Example. 2nd Edition. New York, Wiley.
Hoerl, A.E. & Kennard, R.W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, 55-67.
Vinod, H.D. (1976). Application of new ridge regression methods to a study of Bell system scale economies. J.A.S.A., 71, 835-841.
RJOINT procedure
Does modified joint regression analysis for variety-by-environment data
(P.W. Lane & K. Ryder)
Options
TOLERANCE
= scalar Convergence criterion; default 0.001MAXCYCLE
= scalar Maximum number of cycles; default 15SAVE
= regression save structure Save structure from MODEL statement defining the model; default is to use the structure from the latest MODEL statement
Parameters
ENVIRONMENT
= factors The environment factor; no defaultVARIETY
= factors The variety factor; no defaultSENSITIVITIES
= variates To store estimates of sensitivities; default *VARMEANS
= variates To store estimates of variety means; default *ENVEFFECTS
= variates To store estimates of environment effects; default *ENVMEANS
= variates To store estimates of environment means; default *SESENSITIV
= variates To store s.e.s of sensitivities; default *SEVARMEANS
= variates To store s.e.s of variety means; default *SEENVEFFECTS
= variates To store s.e.s of environment effects; default *
Description
Procedure
RJOINT performs a modified joint regression analysis of data classified by two factors. This analysis is motivated by the study of variety-by-environment interactions in agriculture, where the two factors are varieties of some crop and environments at which experiments were carried out. The environments may be different sites within the same year, different years for the same site, or, as is more common, a combination of the two with little interest in individual year and site contributions. The intention is to characterize the sensitivity (or, inversely, the stability) of each variety to environmental effects by fitting a regression of the environment means for a variety on the average environment means. The model is thus nonlinear, of the formyij = vi + bi ´ ej + error
where vi are variety means, ej are environment effects (with S ej =0) and bi are the sensitivity parameters (with mean(bi )=1). Usually, an experimenter is looking for varieties with large means and small sensitivities, to ensure a reliable crop under variable conditions. In
RJOINT the factors are specified using the parameters VARIETY and ENVIRONMENT.The data may consist of one value of the response, such as yield, for each combination of variety and yield. More often, the data are incomplete because not all varieties are tested at each environment; also, there may be multiple measurements of varieties at some environments. If the response is a count or a proportion, such as when investigating disease resistance, it will be more appropriate to use a generalized linear model based on a Poisson or binomial distribution and a log or logit link function. The model should be specified by giving a
MODEL statement before calling RJOINT; for example,MODEL yield
You can choose to fit any generalized linear model by setting the
DISTRIBUTION and LINK options of MODEL: thus, to model proportions, you could give a statement likeMODEL [DISTRIBUTION=binomial; LINK=logit; DISPERSION=*] prop; \
NBINOMIAL=100
The iterative process used in the procedure is controlled by the options
TOLERANCE and MAXCYCLE. At each iteration, the maximum difference between estimates of the sensitivity parameters in successive iterations is compared to the tolerance: the process ends when the differences are small enough, or when the maximum number of iterations is reached. The progress of the search can be followed by including the monitoring setting of the PRINT option.Output is controlled by the option
PRINT. The setting model prints a description of the model. The setting summary displays an analysis of variance (or deviance for non-Normal distributions) showing the effects of Varieties, Environments and Sensitivities: the last is the effect of allowing different sensitivities for each variety. The setting estimates displays two tables, one classified by varieties and the other by environments. For varieties the columns are: unadjusted means, final estimates of means (on the scale of the link function, if relevant), standard errors of estimates, back-transformed means (using the inverse of the link function), sensitivities, and standard errors of sensitivities. For environments the columns are: estimates of effects (on the scale of the link function, if relevant), standard errors of estimates, means (formed from the effects and the mean of the variety means), and back-transformed means (using the inverse of the link function).The remaining parameters allow the following results to be saved: sensitivities, variety means, environment effects, environment means, and standard errors of sensitivities, variety means and environment effects. After calling the procedure, you can use the
RKEEP directive to access fitted values and residuals. Other results from the fit that can be accessed via RKEEP or RDISPLAY may not be correct: for example, the number of residual d.f. shown byRDISPLAY [PRINT=summary]
does not allow for the estimation of sensitivities.
Options:
PRINT, TOLERANCE, MAXCYCLE, SAVE.Parameters:
Method
The procedure uses iterative scheme (A) referred to in Digby (1979). The scheme has been generalized to deal with alternative distributions and link functions. First the environment effects are estimated with the sensitivity parameters set to 1, and then the procedure alternates between estimating the sensitivities with given environment effects and estimating environment effects with given sensitivities. Convergence is tested by comparing the maximum difference between old and new sensitivities against the criterion (default 0.001), but the maximum number of cycles (default 15) will not be exceeded. If the
MAXCYCLE option is set to 1, the result is an unmodified joint regression analysis; see Finlay & Wilkinson (1963).
Action with
RESTRICTA restriction applied to the response variate will be taken into account. Residuals and fitted values will be formed only for the restricted subset of values. If levels of the factors are not represented in the restricted subset, then no results will be shown for those varieties and/or environments. Do not restrict the environment or variety factor differently to the response variate: results may then be incorrect.
References
Digby, P.G.N. (1979). Modified joint regression analysis for incomplete variety ´ environment data, Journal of Agricultural Science, Cambridge 93, 81-86.
Finlay, K.W. & Wilkinson, G.N. (1963). The analysis of adaptation in a plant-breeding programme, Australian Journal of Agricultural Research 14, 742-754.
ROBSSPM procedure
Forms robust estimates of sum-of-squares-and-products matrices
(P.G.N. Digby)
Options
B1
= scalar The value from which the threshold distance is derived (see the Method Section); default 2B2
= scalar The value indicating the decline in weight as the distance of a unit above the threshold increases, (see the Method Section); default 1.25MAXCYCLE
= scalar Maximum number of iterations; default 100TOLERANCE
= scalar The minimum change in the average squared-weight that has to be achieved for the iterative process to converge; default 1.0-8
Parameters
DATA
= pointers Supplies the set of variates in each datamatrixSSPM
= SSPMs SSPM structure to contain the robust estimates of the sums of squares and products, the robust estimates of the means, and the sum of the weights for each datamatrixDISTANCES
= variates To contain the Mahalanobis distances of the units from the meanWEIGHTS
= variates To contain the weights used for each unit when forming the robust estimatesVCOVARIANCE
= symmetric matricesTo contain the robust estimates of the matrices of variances and covariances
CORRELATIONS
= symmetric matricesThis contains on output the correlations from the robust estimates of the variances and covariances
Description
ROBSSPM
forms robust estimates of SSPMs, and the related variance-covariance and correlation matrices, using the method of Campbell (1980). This weights the units differentially so that those that are extreme, in a multivariate sense, contribute less to the calculated means and sums of squares and products. The extremeness of a unit is judged by its Mahalanobis distance from the estimated mean.The input variates are specified, in a pointer, by the
DATA parameter. They may be restricted or may contain some missing values, in which case the units concerned will be ignored.Output is controlled by the
PRINT option, with settings: sspm prints the estimated sums-of-squares-and-products, the estimated means, and the sum of the weights; distances prints the Mahalanobis distances for all the units, including any excluded by restrictions; weights prints the weights for all the units; vcovariance prints the estimated variance-covariance matrix;means
prints the estimated means; correlations prints correlations derived from the variance-covariance matrix; outliers prints unit numbers, weights, and distances for outliers. By default there is no printed output.If the outliers, weights or distances are to be printed then an appropriate summary of the number of units, number of outliers and so on will be printed too. The outlier information consists of the unit numbers, weights and Mahalanobis distances, printed across the page.
The weight given to each unit in forming the robust estimates is one if the unit's Mahalanobis distance from the mean is less than some threshold distance, and it decreases as the Mahalanobis distance increases above that threshold. The threshold and the form of the decrease in weight are controlled by options
B1 and B2, which correspond to the corresponding quantities in the functions used by Campbell (1980), as explained in the Methods Section. By default, B1=2 and B2=1.25.The estimation process is iterative, with the maximum number of iterations controlled by the
MAXCYCLE option (default 100). It converges when the average change in the weights is less than some tolerance. The default tolerance is 1.0-8, but this can be redefined by the TOLERANCE option. Lack of convergence usually indicates some problem with the data, perhaps that the threshold has been set too low.Parameters
SSPM, DISTANCES, WEIGHTS, VCOVARIANCE and CORRELATIONS allow the various components of the output to be saved.
Options:
PRINT, B1, B2, MAXCYCLE, TOLERANCE.Parameters:
DATA, SSPM, DISTANCES, WEIGHTS, VCOVARIANCE, CORRELATIONS.
Method
Initial (unweighted) estimates of the means and sums of squares and products are formed from all the units, subject to any restriction on the data and excluding any units with missing values for any of the variates. From the estimates, Mahalanobis distances of the units from their means are calculated, and used to determine the weights for the units. The weights are then used to reform the SSPM structure, new distances are calculated, and so on. Convergence occurs when the average change in the derived weights is less than the defined tolerance.
The weight w of each unit is given by
w = 1 d £ t
W = (t/d) ´ exp( -0.5 ´ (d-t)2 /
where t, the threshold distance, is given by
t = Ö v + B1 / Ö 2
As explained by Campbell (1980), under Fisher's square root approximation, B1 equates to a percentage point of the standard Gaussian distribution.
Campbell (1980) regards three possibilities as potentially most useful. If B1 is infinite, the usual (non-robust) estimates are obtained. With B1=2 and B2 infinite, the weight decreases inversely with distance (w=t/d); this can be obtained in the procedure by setting B2 to a missing value. Finally, there is the combination used as a default by ROBSSPM, namely B1=2 and B2=1.25.
Action with RESTRICT
If the
DATA variates are restricted only the units not excluded by the restriction will be used in the estimation process. However, Mahalanobis distances will be formed for all units other than those where any of the variates is missing.
Reference
Campbell, N.A. (1980). Robust procedures in multivariate analysis I: robust covariance estimation. Applied Statistics 29, 231-237.
RPAIR procedure
Gives t-tests for all pairwise differences of means from a regression or GLM
(J.T.N.M. Thissen & P.W. Goedhart)
Options
SORT
= string Whether to sort the means into ascending order (no, yes); default noCOMBINATIONS
= string Which combinations of factors in the current model to include (full, present); default full (similar to the PREDICT directive)ADJUSTMENT
= string Type of adjustment with linear regression models (marginal, equal); default marg (similar to the PREDICT directive)WEIGHTS
= table Weights classified by some or all standardizing factors; default * (similar to the PREDICT directive)METHOD
= string Method of forming margin (mean, total); default mean (similar to the PREDICT directive)ALIASING
= string How to deal with aliased parameters (fault, ignore); default faul (similar to the PREDICT directive)SAVE
= identifier Specifies save structure of model to display; default * (i.e. that of the latest model fitted)
Parameters
TREATFACT
= pointers Each pointer contains a list of treatment factors classifying the table of means to be compared (the right-most factor changes fastest, then the second from the right, etc.); this parameter must be setLABELS
= texts Structures containing strings to label rows (and columns) of the symmetric matrices of pairwise differences etc; the length of the text must equal the product of the numbers of factor levels as implied by the factor list in the TREATFACT pointerNEWLABELS
= texts To save the row labels of the DIFFERENCES, SED, TVALUES and TPROBABILITIES matricesDIFFERENCES
= symmetric matricesTo save pairwise differences (treatment means on the diagonal)
SED
= symmetric matrices To save standard errors of the pairwise differences (missing values on the diagonal)TVALUES
= symmetric matrices To save t-values (missing values on the diagonal)TPROBABILITIES
= symmetric matricesTo save t-probabilities (missing values on the diagonal)
Description
When analysing a (non-orthogonal) analysis of variance model or a generalized linear model (GLM) with the regression directives
FIT, ADD etc., effects of factors in the model and their interactions if required, may be assessed from a suitable analysis of variance (deviance) table. With the PREDICT directive tables of estimated means and their standard errors can be obtained, but not standard errors of differences of means. The RPAIR procedure provides additional information on such tables by calculating t-values and corresponding two-sided t-probabilities for tests of all pairwise differences of means.The t-statistics used are based on the residual variance (deviance) and its degrees of freedom from the current regression model. However, if the
DISPERSION option of the MODEL directive has been set to a numerical value (as is by default the case with a GLM with binomial, poisson or multinomial distribution), the degrees of freedom are set to 10000, approximating to the normal distribution.It is assumed that the
MODEL statement for the regression has defined only one response variate.The
TREATFACT parameter must be set to a pointer containing a list of factors classifying the table of means which are to be compared.The
PRINT option controls the output. By default a symmetric matrix of pairwise differences of means is printed with the means themselves down the diagonal. With a GLM these means and their pairwise differences are always calculated on the linear scale. The corresponding symmetric matrices of standard errors and of t-values are printed by default too.The matrix rows (and columns) are ordered such that the right-most factor changes fastest, then the second from the right, etc. This default order can be changed by setting the
SORT option to yes, in which case rows and columns of all matrices are rearranged to put the means on the diagonal of the matrix of differences into ascending order.The
LABELS parameter can be used to label the rows and columns of the matrices, which are then taken in default order. When the LABELS parameter has not been set and the TREATFACT pointer contains just one factor, by default the labels or levels of the factor are used for labeling; when the pointer contains more than one factor, the default row (and column) labels are combinations of factor settings indicated by the first letter of the factor identifier followed by an ordinal level.The
DIFFERENCES, SED, TVALUES and TPROBABILITIES parameters can be used to save the output. The row labels of these matrices can be saved through the NEWLABELS parameter.The
COMBINATIONS, ADJUSTMENT, WEIGHTS, METHOD, ALIASING and SAVE options are as in the PREDICT directive.
Options:
PRINT, SORT, COMBINATIONS, ADJUSTMENT, WEIGHTS, METHOD, ALIASING, SAVE.Parameters:
Method
The procedure uses the
PREDICT directive to save a table of predictions and corresponding variance-covariance matrix. With a GLM the setting of option BACKTRANSFORM of the PREDICT directive is always none.
Action with
RESTRICTAny restrictions applied to vectors used in the regression apply also to the results from
RPAIR.
RPROPORTIONAL procedure
Fits the proportional hazards model to survival data as a GLM
(R.W. Payne)
Options
TERMS
= formula Defines the model to fitSUBJECTS
= factor Subject corresponding to each observationTIMES
= factor or variate Time of each observationCENSORED
= variate Contains the value 1 for censored observations, otherwise 0; if unset it is assumed that there is no censoringOFFSET
= variate Offset to include in the modelINDEX
= variate Mapping variate used to produce the expanded variablesRISKSET
= factor Saves the expanded time factor_2LOGLIKELIHOOD
= scalar Saves -2 ´ log likelihood for the fitted model
Parameters
X
= variates or factors X-variables involved in the model to be fittedNEWX
= variates or factors Identifiers to store the expanded x-variables
Description
The data for
RPROPORTIONAL consist of a set of subjects observed at one or more times. The final time is usually at the time of death (or failure), otherwise (if the subject survives the trial) the observation is said to be censored. The CENSORED option can be used to specify a variate with an entry for each subject containing one when there is censoring, otherwise zero. If this is not specified, it is assumed that there is no censoring.The proportional hazards model (Cox 1972) makes the assumption that the subjects have a baseline hazard function which is modified proportionally by treatments and covariates. In
RPROPORTIONAL it is assumed that the survival times follow a piecewise exponential distribution (Breslow 1974). This partitions the time axis using a set of discrete cut-points ai, and assumes a constant baseline hazard λi between each one. This corresponds to an exponential distribution with mean 1/λi for the survival times (in the absence of treatments) within each time interval. A cut-point is defined at every time that a death (or failure) occurs and, if the covariates or treatments vary with time, also at every time when the subjects are observed.To fit this model as a generalized linear model, the vectors in the model must be expanded so that, for each subject, there is a unit for every time interval up to the last one during which the subject was observed. If (as usually happens) the subject was not observed at every cutpoint, covariates and treatments are taken to be constant during the intervals between the times of the observations. RPROPORTIONAL automatically produces the expanded vectors if required. The INDEX option allows the variate of indexes used to produce the expanded variables from the original ones to be saved.
The SUBJECTS option can specify a factor to indicate the subject corresponding to each observation; this can be omitted if there is only one observation per subject. The time at which each observation was made is specified by the TIME option, in either a factor or a variate. The factors and variates that appear in the model are listed by the X parameter, and new identifiers can be supplied using the NEWX parameter; otherwise the expanded values overwrite the original values. An expanded factor indicating the time interval corresponding to each unit of the NEWX vectors, can be saved using the RISKSET option.
The model to be fitted is specified using the TERMS option, using the new identifiers (if specified) rather than the original ones. The y-variate used within the generalized linear model is an indicator that takes the value 0 if the subject was still surviving within the time interval concerned, otherwise it has the value 1. The model contains an offset representing the log of the exposure time within each interval. Any additional offset can be specified, if required, using the OFFSET option. RPROPORTIONAL contains a TERMS directive, so that you can set the TERMS option to the full model, and then investigate smaller models in the usual way. The deviance produced for the regression model can be compared with a Chi-square distribution as usual, but the residual deviance is not usable as the maximal model assumed by the generalized linear models method is inappropriate. To assist with the comparison of sequences of models, the _2LOGLIKELIHOOD option allows -2 ´ the log likelihood to be saved.
The PRINT option controls printed output, with the same settings as for the FIT directive. After using RPROPORTIONAL, RDISPLAY can be used to obtain further output and RKEEP can be used to save information, in the usual way.
Options: PRINT, TERMS, SUBJECTS, TIMES, CENSORED, OFFSET, INDEX, RISKSET, _2LOGLIKELIHOOD.
Parameters: X, NEWX.
Method
Full details of the method can be found in Aitkin et al. (1989).
Action with RESTRICT
None of the vectors must be restricted, and any restrictions will be
cancelled.
References
Aitkin, M., Anderson, A., Francis, B. and Hinde, J. (1989). Statistical Modelling in GLIM. Oxford University Press.
Breslow. N. (1974). Covariance analysis of censored survival data. Biometrics, 30, 89-99.
Cox, D.R. (1972). Regression models and life tables (with discussion). J.R.Statist.Soc. B, 34, 187-220.
RSURVIVAL procedure
Models survival times of exponential, Weibull or extreme-value distributions
(R.W. Payne)
Options
TIMES
= factor Time of each observationDISTRIBUTION
= string Distribution of the survival times (exponential, weibull, extremevalue); default expoCENSORED
= variate Contains the value 1 if the observation is censored and subject survived the whole trial, otherwise 1GRAPHICS
= string Controls the plotting of diagnostic graphs of the empirical survivor function against the estimate produced by the model (lineprinter, highresolution) default * i.e. noneALPHA
= scalar Saves the estimated value of the parameter α of the Weibull and extreme-value distributions, if the scalar is input with a non-missing value this provides the initial estimate for α (which will also be the final estimate if MAXCYCLE=1)_2LOGLIKELIHOOD
= scalar Saves -2 multiplied by the log likelihoodMAXCYCLE
= scalar Maximum number of iterations to use to estimate α; default 20TOLERANCE
= scalar Convergence limit for α; default 10-5
Parameter
TERMS
= formula Defines the model to fit
Description
RSURVIVAL
models survival times assuming that they follow either an exponential, Weibull or extreme-value distribution, as indicated by the DISTRIBUTION option. It also caters for censored observations, where the subject concerned survived the trial: the CENSORED option can be used to specify a variate with an entry for each subject containing one where the subject survived, otherwise zero. The model to be fitted to the survival times is specified using the TERMS parameter.The analysis is performed using the generalized linear models facilities of Genstat, with a y-variate indicating whether the subject died or survived, and an offset variate which depends on the time variate (see Aitkin et al. 1989). For the exponential distribution this offset is simply the logarithm of the times. With the Weibull distribution it is the Weibull parameter
α multiplied by the logarithm of the times, while for the extreme-value distribution it is the parameter α multiplied by the times. The parameters of the TERMS model and α itself are estimated alternately (with number of cycles controlled by the MAXCYCLE option) until successive estimates are within a tolerance specified by the TOLERANCE option. The ALPHA option can input an initial value for α and save the estimated value. By setting the MAXCYCLE option to one, α can be fixed at the initial value; this is useful for comparing one model with another, when the value of α should be fixed at the value estimated from the more complicated model. The _2LOGLIKELIHOOD option allows -2 ´ log likelihood to be saved.Printed output from the generalized linear model analysis is controlled by the
PRINT option. Further information can be printed subsequently by using RDISPLAY in the usual way. Also, the GRAPHICS option can be set to either high or line to produce either high-resolution or lineprinter plots of the empirical survivor function against the value predicted by the model (see Aitken et al. 1989, pages 275-276).
Options:
Parameters:
TERMS.
Method
Full details of the method can be found in Aitkin et al. (1989).
Action with
RESTRICTThe vectors involved in the analysis may be restricted as usual for a generalized linear model.
Reference
Aitkin, M., Anderson, A., Francis, B. and Hinde, J. (1989). Statistical Modelling in GLIM. Oxford University Press.
RUGPLOT procedure
Draws "rugplots" to display the distribution of one or more samples
(P.W. Lane)
Options
GRAPHICS
= string What type of graphics to use (highresolution, lineprinter); default highTITLE
= text Title for diagram; default *AXISTITLE
= text Title for axis; default *WINDOW
= scalar Window in which to draw high-resolution plot; default *, taken as 11 if SCREEN=clear, or 1 if SCREEN=keepSCREEN
= string Whether to clear screen before high-resolution plot (clear, keep); default cleaORIENTATION
= string Orientation of plots (down, across); default downJITTER
= number Ratio of jitter width to range of data in high-resolution plot; default 0.01SEED
= number Seed for generating random numbers used in jittering; default 0, i.e. continue from last generation, or initialize from system clock
Parameters
DATA
= variates Data to be summarized; no defaultGROUPS
= factor Factor to divide values of a single variate into groups; default *RUGLABELS
= texts Labels for individual rugs; default *, i.e. identifiers of variates or labels or levels of factorPOSITION
= scalar or variate Position on x-axis (or on y-axis if ORIENTATION=across) at which to plot each rug; if GROUPS is set, positions for each level of the factor are taken from a variate; default is to draw a single rug on the axis, and to spread multiple rugs across the window
Description
RUGPLOT
draws pictures to display the distribution of one or more sets of data. In the simplest case, with the DATA parameter set to a single variate, RUGPLOT will draw a single vertical "rug": that is, a series of short horizontal lines on the vertical axis, positioned at each value of the variate. The option ORIENTATION=across produces a horizontal rug. A rug can be added to an existing plot by specifying SCREEN=keep, and setting the WINDOW option to specify the window where the rug is to be drawn. With SCREEN=keep, the default window is 1; with SCREEN=clear, window 11 is used after defining it to fill the whole graphical frame.If several variates are supplied, a rug is drawn for each of them using the same scale. Alternatively, if a single variate is specified by the
DATA parameter, a factor with the same number of values as the variate may be defined by the GROUPS parameter, and a box will be drawn for each level of the factor. The rug plots are spread out across the window by default. The POSITION parameter can be set to specify where each rug is to be positioned on the x-axis (or y-axis if ORIENTATION=across). The setting should be in the range (0, n) for a plot with SCREEN=clear, where n is the number of rugs to be drawn; with SCREEN=keep, the position should be specified in the units of the axis last drawn in the window.Line-printer rugplots can be drawn by setting option
GRAPHICS=lineprinter. The plot is drawn with asterisks, or digits to represent points that are effectively coincident. If the page size is small, as in interactive mode, line-printer plots with ORIENTATION=down are very cramped: the PAGE option of the OUTPUT directive can be used to increase the depth of the graphs. The option ORIENTATION=down cannot be selected for line-printer plots with more than 14 rugs.The
TITLE and AXISTITLE options can be set to specify the titles displayed at the top of the plot and along the axis, for either graphics mode. The RUGLABELS parameter allows you to specify labels that will identify each rug, in place of the default labels taken from the variate identifiers, or factor labels or levels if the GROUPS parameter is set. Long identifiers or labels may overlap each other if ORIENTATION=down, or they may overlap the rug-plots if ORIENTATION=across; a maximum of eight characters is recommended.In high-resolution plots, all data values are "jittered" to try to remove ties. This involves adding a small random value: by default the ratio of the maximum adjustment to the range of all the data is 1:100. This can be modified by setting the
JITTER option to 0 to suppress jittering, or to some other ratio than the default of 0.01. The SEED option can be set to specify the seed of the random-number generation, if a reproducible plot is required.
Options:
Parameters:
DATA, GROUPS, RUGLABELS, POSITION.
Method
High-resolution rugs are plotted using the minus or vertical-bar symbol, in vertical or horizontal plots respectively. Line-printer plots use the default plotting symbols, the asterisk or digits to represent coincident points.
Action with
RESTRICTRestrictions on the supplied variates are taken into account. The grouping factor and texts holding ruglabels, if specified, should not be restricted.
RUNTEST procedure
Performs a test of randomness of a sequence of observations
(P.W. Goedhart)
Options
NULL
= scalar Defines the boundary between the two types; default 0
Parameters
DATA
= variates Sequences of observationsSAVE
= pointers To save the number of runs, the number of positive and negative observations and the lower and upper tail probabilities of the test
Description
The data is assumed to be an ordered sequence of observations of two types, n1 of the first type and n2 of the second type. A run is defined to be a succession of observations of the same type. A clue to lack of randomness is provided by the total number of runs in the sequence. If the data are in random order, the expected number of runs is 1 + 2n1n2/(n1+n2). A low number of runs might indicate positive serial correlation while a high number might arise from negative serial correlation.
The
DATA parameter is used to specify the sequence of observations. Observations larger than option NULL are considered to be of the first type (positive) while observation smaller than NULL are of the second type (negative). Missing values and observations that equal NULL are not taken into account. The PRINT option controls printed output, while the SAVE parameter can be used to specify a pointer containing five scalars to save the number of runs, the number of positive observations (that is, those larger than NULL), the number of negative observations and the lower and upper tail probabilities of the number of runs.
Options:
PRINT, NULL. Parameters: DATA, SAVE.
Method
When the number of observations of type one and two are both smaller than 11, exact left and right tail probabilities are taken from Table 3.1 from Draper & Smith (1981). In other cases a normal approximation with continuity correction is used.
Action with
RESTRICTThe
DATA variate can be restricted so that the test uses only a subset of the units.
Reference
Draper & Smith (1981). Applied Regression Analysis (second edition). Wiley. New York.
SAMPLE procedure
Samples from a set of units, possibly stratified by factors
(P.W. Lane)
Options
SEED
= scalar Seed for the random number generator; default 0 i.e. continue from previous generationNVALUES
= scalar Number of units from which a simple sample is to be taken; default * i.e. as defined by UNITS statement
Parameters
NSAMPLE
= scalars or tables Number of values in simple sample, or table of numbers of values at each combination of levels of its classifying factors; no defaultSAMPLE
= identifiers Structure to store the result; no default
Description
Procedure
SAMPLE produces a random sample from a set of units. A simple sample can be obtained by setting the NSAMPLE parameter to the required number in the sample, and the NVALUES option to the number of units in the set. The NVALUES option can be omitted if the required number of units has been defined by a UNITS statement earlier in the job.For a stratified sample, the
NSAMPLE option should be set to a table containing the required number of units to be sampled at each combination of levels of the factors classifying the table. The NVALUES option is not then relevant as the set of units is determined by the values of the classifying factors.The
SAMPLE parameter must be set to an identifier, which will be formed into a variate containing a set of NSAMPLE integers in the range (1...NVALUES), obtained by random sampling without replacement. The SEED option can be set to define a starting value for the random numbers used to select the units. This can be omitted if some random numbers have already been generated during the current job; SAMPLE will then take the numbers that continue the previous sequence.
Options:
SEED, NVALUES. Parameters: NSAMPLE, SAMPLE.
Method
For a simple sample, a full set of units (1...
NVALUES) is randomly ordered and the first NSAMPLE values are taken. For a stratified sample, the units are sorted according to levels of the classifying factors (after random ordering) and then the requested number of values are taken for each combination of levels.
Action with
RESTRICTThe factors classifying the table must not be restricted. The procedure cannot be used on a restricted set of units.
SAVEDATA procedure
Saves all the current data and information for use in a future run
(S.A. Harding & R.W. Payne)
Option
FILENAME
= text Single-line text to define the name of the file to store the information; default 'g41save'
No parameters
Description
SAVEDATA
provides a convenient way of saving the details of all the data structures that are currently stored within Genstat. It also saves any procedures that are stored and the details of the current environment for graphics. The information is stored in a file on the computer whose name can be specified using the FILENAME option. However, if you only wish to retain one set of information, you can use the default 'g41save'. This may be interpreted differently on different computers. For example on the PC the file will be created as g41save.dat in the current directory.Once the information has been stored, the
GETDATA procedure can be used to return Genstat to exactly the same state as when SAVEDATA was called. This can take place later on in the same Genstat run, or in a completely different run. In either case, you can continue to use Genstat as though nothing had happened in the interim - so this will delete any other data that might have been created or restore any subsequent changes to the graphics environment.
Option:
FILENAME. Parameters: none.
Method
is used to open the file, and RECORD is used to store the information.
SIGNTEST procedure
Performs a one or two sample sign test
(E. Stephens & P.W. Goedhart)
Options
METHOD
= string Type of test (twosided, greater, lessthan); default twosGROUPS
= factor Defines the groups for a two-sample test if only the Y1 parameter is specifiedNULL
= scalar Median value or difference in medians under the null hypothesis; default 0
Parameters
Y1
= variates Data values for a one-sample sign test (neither Y2 nor GROUPS specified), or for the first sample of a two-sample test (Y2 also specified) or the values in both samples of a two-sample test (GROUPS specified but not Y2)Y2
= variates Data values for the second sample of a two-sample testSTATISTIC
= scalars To save the sign test statisticNBINOMIAL
= scalars To save the effective sample sizePROBABILITY
= scalars To save the probability level of the test
Description
The sign test is a non-parametric test for difference in location between two related samples, or for testing the location of a single sample. The data values are specified by the parameters
Y1 and Y2 and the option GROUPS. For a one-sample test, the Y1 parameter should be set to a variates containing the data. The data for a two-sample test can either be specified in two separate variates using the parameters Y1 and Y2. Alternatively, they can be given in a single variate, with the GROUPS option set to a factor to identify the two samples; the units are then assumed to be specified in the same order within each group. The GROUPS option is ignored when the Y2 parameter is set. The NULL option defines the size of the median under the null hypothesis for a one-sample test, or the difference between the two medians in a two-sample test. By default NULL=0.The test is assumed to be two-sided unless otherwise requested by the
METHOD option. Settings greater or lessthan will give one-sided tests for the median or the difference between medians greater than, or less than, the null hypothesis value respectively.In a one-sample test, units that are equal to the null hypothesis median are excluded and the effective sample-size is reduced. Similarly, in a two-sample test, units are excluded where the differences between the pairs of values are equal to that required by the null hypothesis. Units with missing values are also excluded.
By default,
SIGNTEST prints the test statistic, the effective sample size and the (exact) probability level. This information can also be saved in named scalars using the STATISTIC, NBINOMIAL and PROBABILITY parameters repectively, and printing can be suppressed by setting option PRINT=*.
Options:
PRINT, METHOD, GROUPS, NULL.Parameters:
Y1, Y2, STATISTIC, NBINOMIAL, PROBABILITY.
Method
The procedure uses standard Genstat directives for calculation and manipulation.
Action with
RESTRICTIf the variates or the factor are restricted, the test is calculated using only the units not excluded by the restriction. In a two-sample test, the two variates or the variate and factor should be restricted in the same way.
Reference
Siegel, S. (1956). Nonparametric Statistics for the Behavioural Sciences. McGraw-Hill, New York.
SKEWSYMM procedure
Provides an analysis of skew-symmetry for an asymmetric matrix
(P.G.N. Digby)
Option
Parameters
DATA
= matrices Asymmetric (square) matrices to be analysedROOTS
= diagonal matrices Stores the squared singular values from the analysis; the structure has one value for each plane fitted in the analysis (e.g. if the DATA matrix has 11 rows and columns, the ROOTS diagonal matrix will have 5 values)SCORES
= matrices Stores the coordinates of the points from the analysis; each matrix has the same number of rows as the corresponding DATA matrix, and has 2 columns for each plane fitted in the analysis (e.g. if the DATA matrix has 11 rows and columns, the SCORES matrix will have 11 rows and 10 columns)
Description
Procedure
SKEWSYMM provides the canonical analysis of skew-symmetry described by Gower (1977). The input to the procedure, specified by the parameter DATA, is a (square) asymmetric matrix of associations, A say. The rows and columns of A usually represent the same set of objects, but in different modes. For example, with migration data, the rows may represent the Countries or States being departed from, and the columns the same locations but being arrived at. The DATA matrix must not contain any missing values.The results of the analysis are a set of coordinates (
SCORES) for points representing the entities labelling the rows or columns of the DATA matrix. In pairs, these coordinates give positions on a series of planes, also called bimensions. So there is an even number of coordinates for each point; if the DATA matrix has an odd number of rows/columns, there will be one fewer coordinate than the number of rows or columns of the DATA matrix. Also, the "importance" of each plane can be assessed from a set of values (ROOTS) that give the amount of (squared) skew-symmetry explained in each pair of dimensions.The results are interpreted in terms of the areas of triangles. The skew symmetry between the entities in rows (or columns) p and q is proportional to the area of the triangle OPQ, where O is the origin, and P and Q are the points representing p and q respectively. (For further details see either Gower 1977, or Digby & Kempton 1987.) Within each plane the coordinates are arranged so that their centroid is at (0, y), for y³ 0, and so that positive row-to-column skew symmetry is represented in a clockwise direction. (Note that in planes other than the first it is residual skew symmetry, after fitting the preceding planes, that is being modelled).
Printed output is controlled by the
PRINT option: roots prints the roots, also the roots expressed as percentages and cumulative percentages, and scores prints the scores.Results from the analysis can be saved using the parameters
ROOTS and SCORES. The structures specified for these parameters need not be declared in advance. Column labels are provided automatically for the SCORES matrix, but any row labels (useful to identify the entities) are left unchanged.
Option:
PRINT. Parameters: DATA, ROOTS, SCORES.
Method
Procedure
SKEWSYMM provides the analysis of skew-symmetry of Gower (1977). If A is an asymmetric matrix of associations, then S = A - A¢ is skew-symmetric; this matrix is analysed using a singular value decomposition, followed by a reflection and rotation, to provide the necessary roots and scores. For further details see Gower (1977) or Digby & Kempton (1987).
References
Digby, P.G.N. & Kempton. R.A. (1987) Multivariate analysis of ecological communities. Chapman and Hall, London.
Gower, J.C. (1977) The analysis of asymmetry and orthogonality. In: Recent Developments in Statistics (ed. J. Barra, F. Brodeau, G. Romier & B. van Cutsen), pp. 109-123. North Holland, Amsterdam.
SMOOTHSPECTRUM procedure
Forms smoothed spectrum estimates for univariate time series
(G. Tunnicliffe Wilson & S.J. Welham)
Options
METHOD
= string Method to be used for smoothing (lagwindow, direct, YuleWalker, exactautoregressive); default lagwBANDWIDTH
= scalar Frequency domain bandwidth for the smoothing window; must be set if METHOD=direMAXLAG
= scalar Specifies the cut-off lag (i.e. the maximum lag of autocovariance used in the spectrum calculation) for METHOD=lagw, or the order of the autoregression for METHOD=Yule or exac; if this option is not set then BANDWIDTH must be set, and will be used to determine an appropriate value of MAXLAGDIVISIONS
= scalar Determines the number of frequency divisions into which the range [0.0, 0.5] is divided for calculating the spectrum; the default is chosen so that the bandwidth covers about four intervalsPROBABILITY
= scalar Probability value used for confidence limits; default 0.9TAPER
= scalar The proportion of data to be tapered (applied for all settings of METHOD except exac); default 0.0SHAPE
= scalar The shape of the trapezium window (a value of 1.0 specifies a rectangular, and 0.0 a triangular window); default 0.5YLOG
= string Whether to plot with a log-transformed Y-axis (yes, no); default noXLOG
= string Whether to plot with a log-transformed X-axis (yes, no); default noGRAPHICS
= string What sort of graphics to use (lineprinter, highresolution); default highWINDOW
= scalar Window to be used for plotting; default 1PENS
= variate The two pens to be used (after being defined appropriately) for drawing the plots; default !(1,2)
Parameters
SERIES
= variate The series for which the spectrum is to be calculatedLENGTH
= scalar or variate Scalar specifying that the first N units of the series are to be used, or a variate specifying the first and last units of the series to be usedSPECTRUM
= variate Saves the smoothed spectrum; need not be declared in advance, but will be set up as a variate of the appropriate length within the procedureLOWER
= scalar or variate Scalar to save the multiplier of the spectrum used to calculate the lower limit, or a variate to save the values of the lower limitUPPER
= scalar or variate Scalar to save the multiplier of the spectrum used to calculate the upper limit, or a variate to save the values of the upper limitFREQUENCY
= variate Saves the frequency values at which the spectrum is calculated
Description
SMOOTHSPECTRUM
calculates smoothed spectrum estimates for a univariate time series. The series is specified in a variate by the SERIES parameter. The parameter LENGTH can be used to specify that only part of the series is to be used: if LENGTH is set to a scalar N, then only units 1...N are used; alternatively, it can define a sub-series by being set to a variate of length 2 holding the numbers of the first and last units to be used. The spectrum can be saved by the SPECTRUM parameter. The method to be used for the smoothing is controlled by the METHOD option, with settings lagwindow for Parzen lag window smoothing, direct for frequency domain smoothing using a trapezium window, YuleWalker for autoregressive spectrum estimation based on Yule-Walker coefficients, and exactautoregressive for autoregressive estimation based on exact likelihood estimation of the coefficients.For frequency domain smoothing (
METHOD=direct), option BANDWIDTH specifies the bandwidth of the smoothing window and option SHAPE the shape of the trapezium window. The BANDWIDTH option is also used to determine an appropriate default for the MAXLAG option if this is not specified with other METHOD settings: for METHOD=lagwindow, MAXLAG specifies the cut-off lag (i.e. the maximum lag of autocovariance used in the spectrum calculation), while for METHOD=YuleWalker or exactautoregressive, it specifies the order of the autoregression.The
DIVISIONS option can define the number of frequency divisions into which the range [0.0, 0.5] is divided for calculating the spectrum; if this is omitted a default is chosen so that the bandwidth covers about four intervals. The frequency values at which the spectrum is calculated can be saved, in a variate, by the FREQUENCY parameter. The proportion of data to be tapered (relevant to all settings of METHOD except exactautoregressive) is controlled by the TAPER option; by default there is no tapering.The
LOWER and UPPER parameters can be set to scalars to save the scaling factor used to calculate the upper and lower bounds, or to variates to save the upper and lower bounds for the SPECTRUM variate.Printed output can be suppressed by setting the option
PRINT=*; by default, PRINT=description. The PROBABILITY option indicates the probability value used for confidence limits; 0.9 is used as the default.The procedure will also plot the spectrum: option
GRAPHICS controls whether this is done for line printer or on a high-resolution device. With high-resolution graphics, the plot will be produced using the current settings of the window specifed by the WINDOW option; by default WINDOW=1. The FRAME directive can be used to set the attributes of the window prior to calling the procedure. The PENS option controls which pens are to be used for the plots; the attributes of these pens are modified within the procedure. By default pens 1 and 2 are used, but these can be changed by setting option PENS to a variate of length 2 containing the numbers of the two pens required. Options YLOG and XLOG allow the X- and Y-axes to be represented on a logarithmic scale.
Options:
Parameters:
SERIES, LENGTH, SPECTRUM, LOWER, UPPER, FREQUENCY.
Method
A cosine bell window is used for the taper, with lag window and direct spectral smoothing carried out esentially as described in Bloomfield (1976). The autoregressive spectrum estimation uses the standard Yule-Walker equations, as presented for example in Box & Jenkins (1970). These are optionally refined by exact maximum likelihood estimation as described in the Genstat 5 Reference Manual. The theoretical spectrum of the autoregressive model is then calculated. The error limits are calculated using scaled chi-square distributions. These are quite good for the case of lag window and direct smoothing, but in small samples are only very approximate for the autoregressive estimates. The series values are mean corrected before spectrum estimation, but not trend corrected.
Action with
RESTRICTInput and output structures must not be restricted; restriction of the input series to a contiguous set of units can be achieved by use of the
LENGTH parameter.
References
Bloomfield, P. (1976). Fourier analysis of time series: an introduction. Wiley, New York.
Box, G.E.P. & Jenkins, G.M. (1970). Time series analysis, forecasting and control. Holden-Day, San Francisco.
SPEARMAN procedure
Calculates Spearman's Rank Correlation Coefficient
(S.J. Welham, N.M. Maclaren & H.R. Simpson)
Options
GROUPS
= factor Defines the sample membership if only one variate is specified by DATACORRELATION
= scalar or symmetric matrixScalar to save the rank correlation coefficient if there are two samples, or symmetric matrix to save the coeficients between all pairs of samples if there are several
T
= scalar or symmetric matrix Scalar to save the Student's t approximation to the correlation coefficient if there are two samples, or symmetric matrix to save the t approximations for all pairs of samples if there are several (calculated only if the sample size is 8 or more)DF
= scalars Scalar to save the degrees of freedom for each t statistic
Parameters
DATA
= variates List of variates containing the data for each sample, or a single variate containing the data from all the samples (the GROUPS option must then be set to indicate the sample to which each unit belongs)RANKS
= variates Saves the ranks
Description
SPEARMAN
calculates Spearman's Rank Correlation Coefficient between pairs of samples. The samples can be stored in different variates and supplied as a list in the DATA pointer. Alternatively, they can all be placed in a single variate, and the GROUPS option set to a factor to indicate the sample to which each unit belongs. If the sample size is 8 or more (i.e. large enough for the approximation to be valid), the Student's t approximation is calculated. Otherwise SPEARMAN obtains significance levels from stored tables. The results can be displayed by use of the test setting of option PRINT, and saved using the options CORRELATION, T and DF. If more than two variates are specified, the full correlation matrix between all pairs of variables will be formed. The PRINT setting ranks causes the vector of ranks for each sample to be printed and correlations means that only the correlations will be displayed. The ranks from each sample can be saved using the RANKS parameter.
Option:
PRINT, GROUPS, CORRELATION, T, DF. Parameters: DATA, RANKS.
Method
Spearman's Rank Correlation Coefficient is a measure of association between the rankings of two variables measured on N individuals (i.e. two vectors of length N). The correlation coefficient is calculated from the two vectors of ranks for the samples: let { Xi ; i=1...N } and { Yi ; i=1...N } be the vectors of ranks for sample 1 and sample 2 respectively, then the coefficient r is based on the vector of differences between ranks: { Di = Xi - Yi ; i=1...N } and is calculated by
r = 1 - 6 ´ S i=1...N Di2 / [ N(N2-1) ].
If ties are present, then the statistic will be biased, and must be recalculated taking account of ties by:
r = ( S Xi2 + S Yi2 - S Di2 ) / ( 2 ´ Ö ( S Xi2 ´ S Yi2 ) )
where S Xi2 = (N3-N)/12 - Tx ;
S Yi2 = (N3-N)/12 - Ty ;
Tk = S ( tj3 - tj )/12
and tj is the number of observations in the group with rank j.
The t-approximation for this statistic, T, is valid for samples of size 8 upwards, and is calculated by
T = r ´ Ö [ (N-2)/(1-r2) ].
It has approximately a t-distribution on N-2 degrees of freedom, and can be used for a test of the null hypothesis of independance between samples. (See for example Siegel 1956, pages 202-213.)
Action with
RESTRICTIf any of the variates in
DATA is restricted, the statistic is calculated only for the set of units not excluded by the restriction.
Reference
Siegel, S. (1956). Nonparametric Statistics for the behavioural sciences. McGraw-Hill, New York.
SPLINE procedure
Calculates a set of basis functions for M-, B- or I-splines
(P.W. Goedhart)
Options
KNOTS
= scalar or variate Defines the interior knot values; no default i.e. this option must be setORDER
= scalar Defines the order of the piecewise polynomial; default 3TYPE
= string Controls which spline basis is calculated (m, b, i); default mLOWER
= scalar Left-hand limit L of the interval [L, U); default * i.e. the minimum of the X parameter is usedUPPER
= scalar Right-hand limit U of the interval [L, U); default * i.e. a value slightly larger than the maximum of the X parameter is usedNOMESSAGE
= string Which warning messages to suppress (warning); default *
Parameter
X
= variates Values for which the basis spline functions are calculatedBASIS
= pointers Pointer to save variates containing the values of the basis spline functionsDBASIS
= pointers Pointer to save variates containing the values of the first order derivatives of the basis spline functions
Description
Piecewise polynomials or splines can be used for nonparametric function estimation. Splines offer a flexible way to investigate the shape of a relationship or can be used for interpolation and smoothing. There are several types of splines. Smoothing splines, implemented in Genstat by means of regression function
SSPLINE, minimize a penalized residual sums of squares in which lack of smoothness of the estimated function is penalized. Smoothing splines can be less appropriate when local effects are strong or when the estimated function should be monotone, e.g. when estimating growth curves.An alternative for smoothing splines is to use regression splines which offer more control over the characteristics of the estimated function. With regression splines the user first specifies an interval [L, U) on which the estimated function is non-trivial. This interval is then explicitly divided into segments by the user, and a polynomial, of order say k, is fitted in each segment. The segments are separated by a sequence of so-called knots. It is customary to force the piecewise polynomials to join smoothly at these knots. The piecewise polynomials and all their derivatives are always continuous from the right at the knots. Moreover, when there are no replicated knot values, the (k-1)th derivative is continuous at the knot values. The order of differentiability is lower when there are replicated knot values. The full knot sequence includes the endpoints L and U which are replicated depending on the order of the piecewise polynomial. Ramsay (1988) provides a concise introduction into regression splines, while de Boor (1978) gives a full account.
The
SPLINE procedure can be used to calculate a set of so called basis functions which have all the required properties of continuity and differentiability. These basis functions can then be used to fit the regression spline. A simple basis is given by truncated polynomials but this has the disadvantage of generating considerable rounding errors. A numerically superior basis is provided by M-splines. Their main features are that any basis function is positive in a series of consecutive segments, is zero elsewhere and is normalized by having unit area. An alternative normalization is provided by B-splines which have the property that the sum over all basis functions is 1 for values in the interval [L, U). Basis functions of M-splines and B-splines are linearly related and are 0 outside [L, U). The resulting piecewise polynomial is discontinuous at the endpoints L and U.Monotonicity of the estimating function can be imposed by employing a basis consisting of monotone functions. Ramsay (1988) uses integrated M-splines which, when combined with nonnegative regression coefficients, yield a monotone spline. These integrated M-splines are called I-splines. The basis functions for I-splines are not linearly related and they are 0 for values smaller than L and 1 for values greater than or equal to U. The resulting piecewise polynomial is continuous but not differentiable at the endpoints. The choice of the polynomial order and of the knot values are crucial for successful usage of regression splines. Wegman & Wright (1983) summarize practical recommendations for M-splines, while Ramsay (1988) does so for I-splines. In general the knots should be chosen in regions where the relationship changes most markedly. A useful preliminary knot placement is to position a single interior knot at the median, two interior knots at the terciles, three at the quartiles, and so on. The order of the piecewice polynomials is usually taken to be 2 or 3.
The values for which the basis functions are calculated must be specified by the
X parameter. The values of the basis functions are saved with the BASIS parameter, while the first order derivatives of the basis functions can be saved by setting the DBASIS parameter. The BASIS and DBASIS pointers are redefined in the procedure. If a value in the X parameter coincides with an interior knot and the basis function or its first order derivative has a discontinuity at that value, it should be remembered that the functions are continuous and differentiable from the right.The interior knot sequence must be set with the
KNOTS option and the ORDER option can be used to specify the order of the piecewise polynomials. The TYPE option determines which spline basis is calculated. The interval [L, U) for which the basis functions are non-trivial can be specified by the LOWER and UPPER options. If these are unset the following values are used:CALCULATE LOWER = MINIMUM(X)
CALCULATE max = MAXIMUM(X)
CALCULATE UPPER = max + ((max.EQ.0) + ABS(max))/500000
In this case the UPPER value is such that max is just in the interval [L, U). The NOMESSAGE option can be used to suppress warning messages which are printed when the KNOTS variate has replicated values and when the interval [L, U) does not overlap the range of X values.
Options: KNOTS, ORDER, TYPE, LOWER, UPPER, NOMESSAGE.
Parameters: X, BASIS, DBASIS.
Method
Basis functions for M-splines are calculated by a recurrence relation from Ramsay (1988). These basis functions are multiplied to give B-splines or summed to provide I-splines. Note that unlike Ramsay (1988), the order of the spline is here defined as the order of the piecewise polynomial.
Action with RESTRICT
The variates contained in the
BASIS and DBASIS pointers are restricted in the same way as the X parameter. Values in the units excluded by the restriction are set to missing. Restrictions on the KNOTS variate are ignored.
References
de Boor, C. (1978). A practical guide to splines. Springer-Verlag. New York.
Ramsay, J.O. (1988). Monotone regression splines in action (with discussion). Statistical Science, 3, 425-441.
Wegman, E.J. & Wright, I.W. (1983). Splines in statistics. Journal of the American Statistical Association, 78, 351-365.
STANDARDIZE procedure
Standardizes columns of a data matrix to have mean zero and variance one
(S.A. Harding & D.A. Murray)
No options
Parameters
OLD
= variates or matrices Structures containing data to be standardizedNEW
= variates or matrices Structures to contain output; by default the OLD structures are overwritten
Description
The parameter
OLD lists the variates or matrices to be standardized, and the NEW parameter specifies a list of variates or matrices to store the standardized values. If NEW is not set, the transformed data values overwrite the contents of the OLD structures. If NEW is set, it should be to structures of the same type (variate or matrix) as the corresponding OLD structures.
Options: none. Parameters:
OLD, NEW.
Method
The standardized values are calculated as (x-mean(x)) / Ö (var(x)). If there are any missing values in the data these are omitted from the calculation.
Action with
RESTRICTRestrict is irrelevant with matrix structures. It should work as expected with variates.
STEM procedure
Produces a simple stem-and-leaf chart
(J. Ollerton & S.A. Harding)
No options
Parameters
DATA
= variates Data values for each plotNDIGITS
= scalars Number of digits in the leaves of each plotSTEMUNITS
= scalars Scale units for the stem values in each plot
Description
STEM
produces a simple stem-and-leaf chart of a variate of data. The stems indicate leading digits and the leaves indicate subsequent digits. By default, the leaves are formed from single digits; the parameter NDIGITS can be used to specify the number of digits in each leaf if more than one is required. The STEMUNITS parameter can be used to specify the units represented by the stem values. By default, this is determined from the data so that the display will fit within a single screen or page of output. Small values of STEMUNITS (in comparison to the range of the data) should be avoided as they may generate far too many lines of output. The display produced by STEM is restricted to the current output width; any lines that have to be truncated at the right-hand margin are terminated by >, indicating their continuation.
Options: none. Parameters:
DATA, NDIGITS, STEMUNITS.
Method
The variate of data is scaled appropriately and printed into a text. Using
CONCATENATE the individual stem and leaf digits are extracted and then formed into another text structure which is printed out as the stem-and-leaf display. Missing values are excluded from the data.
Action with
RESTRICTSTEM takes account of any restriction on the data variate.
Reference
Tukey, J.W. (1977). Exploratory data analysis. Addison-Wesley.
SUBSET procedure
Forms vectors containing subsets of the values in other vectors
(R.W. Payne)
Options
CONDITION
= expression Logical expression to define which units are to be included; no default - this option must be setSETLEVELS
= string Whether to reform the levels (and labels) of factors to exclude those that do not occur in the subset (yes, no); default no
Parameters
OLDVECTOR
= vectors Vector from which the subset is to be formedNEWVECTOR
= vectors Vector to store the subsets if none is specified, the OLDVECTOR is redefined to store the subset
Description
SUBSET
forms vectors containing subsets of the values in other vectors. The subset is defined by a logical condition which must be specified by the CONDITION option; units with the value 1 (.TRUE.) for the condition are included in the subset, others are omitted.Subsets can be formed for factors, texts and variates. Relevant attributes will also be transferred across to the new structures but, if the subset excludes some of the levels of a factor, a new reduced set of levels (and labels) can be requested by setting option
SETLEVELS=yes.The original vectors are specified by the
OLDVECTOR parameter and identifiers for the vectors to contain the subsets are specified by the NEWVECTORS parameter. If NEWVECTORS is not set, the OLDVECTORS are redefined to store the subsets instead of their original values.
Options:
CONDITION, SETLEVELS. Parameters: OLDVECTORS, NEWVECTORS.
Method
is used to obtain a list of units included according to the CONDITION. This is then used to calculate a format for EQUATE to use to transfer the values. The DUPLICATE directive is used to transfer any relevant attributes. We thank Jac Thissen for suggestions about the redefinition of factor levels.
Action with
RESTRICTAny restriction is ignored; the subset is formed only from the
CONDITION option.OLDVECTOR
is redefined to store the subset
TTEST procedure
Performs a one- or two-sample t-test
(S.J. Welham)
Options
METHOD
= string Type of test required (twosided, greater, lessthan); default twosGROUPS
= factor Defines the groups for a two-sample test if only the Y1 parameter is specifiedCIPROB
= scalar The probability level for the printed confidence interval (0<CIPROB<1); for a one-sided test this will be for the mean and for a two-sided test for the difference in means; default *, i.e. no confidence interval is producedNULL
= scalar The value of the mean under the null hypothesis; default 0
Parameters
Y1
= variates Identifier of the variate holding the first sampleY2
= variates Identifier of the variate holding the second sampleTEST
= variates Identifier of variate (length 3) to save test statistic, d.f. and probability valueINTERVAL
= variates Identifier of variate (length 2) to save end points of the confidence interval
Description
The data for
TTEST are specified by the parameters Y1 and Y2 and the option GROUPS. For a one-sample test, the Y1 parameter should be set to a variate containing the data. TTEST then performs a one-sample t-test for the mean of a normal distribution. The value of the mean under the null hypothesis can be specified by the option NULL; by default NULL=0.The data for a two-sample test can either be specified in two separate variates using the parameters
Y1 and Y2. Alternatively, they can be given in a single variate, with the GROUPS option set to a factor to identify the two samples. The GROUPS option is ignored when the Y2 parameter is set. TTEST then performs a standard two-sample t-test, assuming that the two samples arise from normal distributions with equal variances.For both one- and two-sample cases, the test is assumed to be two-sided unless otherwise requested by the
METHOD option. Settings greater and lessthan of this option will give one-sided tests for the sample mean greater than or less than the null hypothesis mean, respectively.Validity checks are carried out. If any sample has fewer than 6 values, a warning is given that the sample size is too small and the test may not be valid. Also, in the two-sample case, an F-test for equality of variance is carried out and a warning printed if there is evidence of unequal variances.
A summary of the data and the test, together with the test statistic and its probability level under the null hypothesis are always printed. A confidence interval can be produced from the test and printed by setting the
CIPROB option to the required value (between 0 and 1). This is the probability of the interval containing the value under the null hypothesis if that hypothesis is true.Results can be saved by the setting of parameters
TEST and INTERVAL; TEST will store the test statistic, its degrees of freedom and probability level in a variate of length 3, and INTERVAL saves the confidence interval in a variate of length 2 provided the option CIPROB also has a valid setting.
Options:
METHOD, GROUPS, CIPROB, NULL. Parameters: Y1, Y2, TEST, INTERVAL.
Method
A standard t-statistic is calculated in both cases, together with an F-statistic in the two-sample case (to test equality of variances) as described in any standard textbook. The squared t-statistics and the F-ratio are compared with the appropriate F-distribution using the function
FPROBABILITY, and confidence intervals are constructed using the function FED.
Action with
RESTRICTY1 and Y2 may be subject to different restrictions; these restrictions will be obeyed. TEST and INTERVAL must not be restricted.
VEQUATE procedure
Equates across numerical structures
(P.W. Goedhart)
No options
Parameters
OLDSTRUCTURES
= pointers Structures whose values are to be transferred; pointers should point to structures of the same type (scalar, variate, matrix or table) and lengthNEWSTRUCTURES
= pointers Pointer to variates to take each set of transferred values; if necessary structures are redefined to be variates of correct length
Description
This procedure allows the values in a set of structures to be copied into a set of variates, one for each element of the original structures. The original structures are input in a pointer, using the
OLDSTRUCTURES parameter. The variates to take the values are returned in a pointer whose identifier is specified by the NEWSTRUCTURES parameter. If this is already declared, it should be to a pointer of the correct length; if necessary, the structures to which it points will be redefined to be variates of correct length.The procedure copies the values in the first element of each of the original structures into the first variate in the
NEWSTRUCTURES pointer, then those in the second element to the second variate, and so on. The structures in the OLDSTRUCTURES pointer should be of the same type (scalar, variate, matrix, diagonal matrix, symmetric matrix or table) and length.
Options: none. Parameters:
OLDSTRUCTURES, NEWSTRUCTURES.
Method
is used to transfer values. If OLDSTRUCTURES points to restricted variates, the values included in the subset are first copied to dummy structures.
Action with
RESTRICTIf the
OLDSTRUCTURES pointer consists of variates, any restrictions will be taken into account and, if the NEWSTRUCTURES pointer is not declared in advance, its suffixes will be set to the units in the restricted set.
VFUNCTION procedure
Calculates functions of variance components from a
REML analysis(S.J. Welham)
Options
RANDOM
= formula Random model (excluding residual stratum) used for the REML analysisFACTORIAL
= scalar Value used in the FACTORIAL option of REML; default 3NCONSTANT
= scalar Value to be used as constant in the numerator function; default 0DCONSTANT
= scalar Value to be used as constant in the denominator function; default 0SAVE
= REML save structure Specifies the (REML) save structure from which the variance components are to be taken; by default they are taken from the save structure of the most recent REML analysis
Parameters
NUMERATOR
= variates Each variate contains the coefficients multiplying the components in the numerator of a functionDENOMINATOR
= variates Coefficients multiplying the components in the denominator of the functionFUNCTION
= scalars To save the value of the functionSE
= scalars To save the standard error of the function
Description
VFUNCTION
calculates linear functions, reciprocals of linear functions, or ratios of linear functions of the estimates of variance components from a REML analysis. The standard error of the function is also produced.The components are taken from the structure specified by the
SAVE option. If this option is not set, then the SAVE structure from the most recent REML analysis is used. The RANDOM option must be set to be the random formula used by the REML analysis, but excluding the residual term. The FACTORIAL option must be set to the same value as used in the REML analysis; by default FACTORIAL=3.The function is defined by the variates specified by the
NUMERATOR and DENOMINATOR parameters; these give the coefficients of the components for the numerator and denominator of the function, respectively. The order of the components is as given by the RANDOM option, with the residual term added at the end. If either variate contains fewer values than the number of components, the final coefficients are taken to be zero. However, random components that were constrained to be fixed in the REML analysis are ignored. If only NUMERATOR is set the function will be linear; conversely if only DENOMINATOR is used it will be a reciprocal function, and if both NUMERATOR and DENOMINATOR are set then the function is the ratio of two linear functions. Options NCONSTANT and DCONSTANT allow a constant to be included in the numerator and denominator functions, respectively.Printed output is controlled by the option
PRINT; by default the function and its standard error are printed.Parameters
FUNCTION and SE allow the value of the function and its standard error to be saved.
Options:
PRINT, RANDOM, FACTORIAL, NCONSTANT, DCONSTANT, SAVE.Parameters:
NUMERATOR, DENOMINATOR, FUNCTION, SE.
Method
The components and their variance-covariance matrix are retrieved using
VKEEP. The function is calculated as specified and its standard error is calculated using a formula derived from a Taylor expansion (see, for example, Kendall & Stuart 1963, page 232):se( f/g ) = (1/g) ´ Ö { var(f) - 2 ´ (f/g) ´ cov(f,g) + (f/g) ´ (f/g) ´ var(g) }
Reference
Kendall, M. & Stuart, A. (1963). The Advanced Theory of Statistics, Volume 1. Griffin, London.
VHOMOGENEITY procedure
Tests homogeneity of variances and variance-covariance matrices
(R.W. Payne)
Option
GROUPS
= factors Define the groups whose variances are to be compared; these need be given only if DATA is set
Parameters
DATA
= variates or pointers Data variate from which variances are calculated, or pointer to a list of variates from which variance-covariance matrices are calculatedVARIANCES
= any numerical structures or pointersSupplies the variances (in any numerical structure) or variance-covariance matrices in a pointer to a list of symmetric matrices if the
DATA parameter is not set, or saves variances (in a table) and variance-covariance matrices (in a pointer to a list of symmetric matrices) if they have been calculated from DATA and GROUPSDF
= any numerical structure Supplies the degrees of freedom for variances (in any numerical structure) or for variance-covariance matrices (as a pointer to a list of scalars) if the DATA parameter is not set, or saves the degrees of freedom for variances (in a table) or variance-covariance matrices (as a pointer to a list of scalars) if they have been calculated from DATA and GROUPS
Description
Equality of variances of residuals is an important requirement for the validity of analysis of variance and regression.
VHOMOGENEITY allows the homogeneity of variances in different groups to be assessed using Bartlett's test. This test is rather sensitive to departures from Normality (another requirement for the validity of these analyses); so it is recommended that the residuals also be examined, for example using procedures RCHECK or DAPLOT.To test homogeneity of variances,
VHOMOGENEITY can take as input either the original data values together with factors defining the groups, or variances along with degrees of freedom. For the first method the DATA parameter should be set to a variate containing the data values; the factors are specified by the GROUPS option. For the second method the variances are input (in any numerical structure) using the VARIANCES parameter, and the degrees of freedom (in a structure of the same type as for VARIANCES) using the DF parameter.With multivariate data, the analagous test for variance-covariance matrices is given by Box (1950). Again two methods of input are available. The original data variates can be supplied, in a pointer, using the
DATA parameter and the factors can be listed by the GROUPS option (as for a single variate). Alternatively, the VARIANCES parameter can be set to a pointer containing the variance-covariance matrices to be tested, and the DF parameter to a pointer containing the corresponding degrees of freedom.If the variances and degrees of freedom are to be calculated by the procedure (from
DATA and GROUPS), the VARIANCES and DF parameters can be used to save the calculated values. When testing homogeneity of variances, the variances and degrees of freedom are saved in tables, classified by the GROUPS factors; these tables need not be declared in advance. With variance-covariance matrices, VARIANCES is a pointer to the list of symmetric matrices that have been formed, and DF a pointer to a list of scalars.
Option:
GROUPS. Parameters: DATA, VARIANCES, DF.
Method
If the raw data have been given as input, the procedure uses
TABULATE to form tables of variances, and of replications from which the degrees of freedom are calculated. The test statistic is calculated as M/C, whereM = S ni ´ log( S { ni ´ si } / S { ni } ) - S { ni ´ log( si ) }
C = 1 + ( 1 / ( 3 ´ (N - 1) ) ) ´ ( S { 1/ni } - 1 / ( S ni ) )
N = number of groups
ni = degrees of freedom of group i
si = variance of group i
The number of degrees of freedom associated with the test statistic is the number of groups minus one. See, for example, Snedecor & Cochran (1980, pages 252-253).
The
FSSPM directive is used to form variance-covariance matrices. The equivalent test of homogeneity is given by Box (1950).
Action with
RESTRICTIf the
DATA variates are restricted, only the units not excluded by the restriction will be used to calculate the variances and degrees of freedom.
References
Box, G.E.P. (1950). Problems in the analysis of growth and wear curves. Biometrics, 6, 362-389.
Snedecor, G.W. & Cochran, W.G. (1980). Statistical Methods (7th edition). Iowa State University Press, Ames, Iowa.
VINTERPOLATE procedure
Performs linear & inverse linear interpolation between variates
(R.J. Reader)
Options
METHOD
= string Type of interpolation required (interval, value): for METHOD=value, y-values are interpolated for each point in the NEWINTERVALS variates and stored in the NEWVALUES variates, while for METHOD=interval, x-values are estimated for the y-values in the NEWVALUES variates and stored in the NEWINTERVALS variates; default inteRANGE
= string Whether the smallest value, largest value or the mean of the two is returned if more than one value is valid (first, middle, last); default midd
Parameters
OLDVALUES
= pointers Each one contains variates specifying the y-values (data values) with which an interpolation is to be carried outNEWVALUES
= pointers For METHOD=value, each pointer contains variates to store the results of an interpolation; for METHOD=interval, it contains either variates or scalars to specify y-values for which inverse interpolation is to be carried outOLDINTERVALS
= variates Contains the x-values (intervals) corresponding to the variates in the OLDVALUES pointerNEWINTERVALS
= pointers For METHOD=interval, each pointer contains variates to store the results of an inverse interpolation; for METHOD=value, it contains either variates or scalars to specify x-values at which interpolation is to be performed
Description
VINTERPOLATE
performs linear interpolation and inverse linear interpolation between variates, as was given by the Genstat 4 functions LINT and INLINT. The y-values (or data values) are given in a set of variates, contained in the pointer specified by the OLDVALUES parameter. The corresponding x-values (intervals) are specified by the OLDINTERVALS parameter, in a variate with one value for each variate of the OLDVALUES pointer. For interpolation (METHOD=value), values are interpolated at the x-values specified by the variates or scalars contained in the NEWINTERVALS pointer, and are stored in variates contained in the NEWVALUES parameter. For inverse interpolation (METHOD=interval), x-values are estimated for the y-values specified by the variates or scalars contained in the NEWVALUES pointer, and are stored in variates contained in the NEWINTERVALS pointer. Where two or more successive OLDINTERVALS or OLDVALUES are the same, there is no unique solution to the interpolation; the RANGE option allows the smallest (RANGE=first), largest (RANGE=last) or the mean of these two (RANGE=middle) values or intervals to be selected.
Options:
METHOD, RANGE.Parameters:
Method
An estimate of the required value is calculated from each successive pair of points. If this estimate is between the two points from which it was calculated, it is a valid answer i.e. it was produced by interpolation not extrapolation. If it does not satisfy this condition it is set to missing. If the curve is horizontal or vertical at the point of interpolation, more than one point will satisfy the above condition. In this case the value returned will depend on the setting of the option
RANGE. This will determine whether the smallest value found, the largest value found, or the mean of these two values is returned. The default is to return the mean value.
Action with
RESTRICTIf any of the variates in the
OLDVALUES pointer is restricted, the other variates in the pointer must either have the same restriction or be unrestricted. Missing values are then returned for the units excluded by the restriction. The OLDINTERVALS variate must not be restricted.
VORTHPOL procedure
Calculates orthogonal polynomial time-contrasts for repeated measures
(J.T.N.M. Thissen)
Options
TIMEPOINTS
= variate Variate of timepoints; default uses the suffixes of the DATA pointerMAXDEGREE
= scalar The number of contrasts (excluding the mean); default is the number of identifiers in the CONTRAST pointer minus 1
Parameters
DATA
= pointers Each pointer contains the data variates (observed at successive times); must be setCONTRAST
= pointers To save the calculated contrasts: the first variate contains the means, the second the linear polynomial contrasts, the third the quadratic polynomial contrasts etc; must be set
Description
A repeated measures experiment is one in which the same set of units, or subjects, is observed at a sequence of times to investigate treatment effects over a period of time.
VORTHPOL calculates orthogonal polynomial contrasts in time for each experimental unit. These contrasts can then be analysed given the block and treatment structure at each timepoint.The observed data is specified in a pointer containing a set of variates, each one containing the measurements made on the subjects at one of the occasions on which they were observed, and input to the procedure using the
DATA parameter. The variate in option TIMEPOINTS specifies the actual times when the measurements were taken. If this is not specified, the suffixes of the DATA pointer are taken as values for the timepoints.The calculated contrasts are saved in a pointer which must be specified by the
CONTRAST parameter. This points to a list of variates: the first variate saves the means over the DATA variates, the second variate saves the linear polynomial contrast, the third the quadratic polynomial, and so on. Provided the MAXDEGREE option is specified, the CONTRAST need not be declared in advance. The suffixes of the pointer are then defined to be 0, 1, 2 ...The number of contrasts can be specified using option
MAXDEGREE and should be less than the number of timepoints. The default setting is the number of identifiers in the pointer specified by the CONTRAST parameter minus 1. If MAXDEGREE is set, and the CONTRAST pointer has been declared to be of length less than MAXDEGREE+1, a fault message is produced.If an experimental unit has a missing value in one of the
DATA variates each contrast in the CONTRAST pointer (including the mean) gets a missing value for this unit.
Options:
TIMEPOINTS, MAXDEGREE.Parameters:
DATA, CONTRAST.
Method
Procedure
ORTHPOL gets orthogonal polynomial coefficients which are used to form the contrasts.
Action with
RESTRICTEach variate in the
DATA pointer should be restricted in the same way. The saved orthogonal polynomial contrasts are restricted accordingly. The TIMEPOINTS variate must not be restricted.
VPLOT procedure
Plots residuals from a
REML analysis(S.J. Welham)
Options
GRAPHICS
= string What type of graphics to use (lineprinter, highresolution); default highSAVE
= REML save structure Specifies the (REML) save structure from which the residuals and fitted values are to be taken; default * uses the SAVE structure from the most recent REML analysis
Parameter
METHOD
= strings Type of graph for residuals (fittedvalues, normal, halfnormal, histogram); default fitt
Description
Procedure
VPLOT provides four types of residual plots from a REML analysis. These are selected using the METHOD parameter, with settings: fitted for residuals versus fitted valuesnormal
for a Normal plot, halfnormal for a half-Normal plot, and histogram for a histogram of residuals.The residuals and fitted values are accessed automatically from the structure specified by the
SAVE option. If the SAVE option has not been set, they are taken from the last SAVE structure from the most recent REML analysis.By default, high-resolution graphics are used. Line printer graphics can be used by setting option
GRAPHICS=lineprinter.
Options:
GRAPHICS, SAVE. Parameter: METHOD.
Method
Residuals and fitted values effects are accessed, using
VKEEP, from the REML analysis specified by the SAVE option. For a Normal plot, the Normal quantiles are calculated as follows:qi = NED( (i-0.375) / (n+0.25) ) i=1...n
while for a half-Normal they are given by
qi = NED( 0.5 + 0.5 ´ (i-0.375) / (n+0.25) ) i=1...n
Action with
RESTRICTIf the y-variate in the
REML analysis was restricted, then only units included by the restriction will be used in the graphs.
VREGRESS procedure
Performs regression across variates
(M.W. Patefield & D. Tandy)
No options
Parameters
Y
= pointers Pointers each containing a set of y-variates for each of whose units a regression is to be doneX
= pointers Pointer containing x-variates for each set of y-variatesSLOPE
= variates Variate to save the estimated slopes from each set of regressionsINTERCEPT
= variates Variate to save the estimated intercepts from each set of regressions
Description
Given a pointer containing a set of y-variates and another containing a set of x-variates,
VREGRESS performs a separate regression for the data in each unit of the variates. The pointers are specified using the Y and X parameters. There must be an equal number of x- and y-variates, and the variates must all be of the same length. The SLOPE parameter must supply a variate to receive the regression coefficients, and the INTERCEPT parameter can give a variate to save the intercepts. These variates will have the same length as the x- and y-variates.
Options: none. Parameters:
Y, X, SLOPE, INTERCEPT.
Method
The procedure propagates missing values in any of the x-variates into the appropriate unit of the corresponding y-variate, and vice-versa. The regressions are calculated using matrix operations and variate functions in
CALCULATE. The vectors of means across the x- and y-variates are subtracted, and then the sums of squares of X and the sums of products of Y and X are calculated across the variates, to obtain the estimated slope coefficients. The estimated intercepts are calculated directly from the slope coefficients and the vectors of means.The action taken with missing values is the same as would be given by the
FIT directive. Units with all values of Y and X missing after propagation will have missing values in both SLOPE and INTERCEPT. Units with a single non-missing value will have a missing value in SLOPE and the corresponding element of INTERCEPT equal to the non-missing value of Y.It is considerably faster to use
VREGRESS than to use FIT on each unit after re-arrangement of the data. However, there may be a slight loss of accuracy resulting from single-precision calculations on machines where double-precision is used for the calculations within FIT.
Action with
RESTRICTAll the data variates in
Y and X must be subject to the same restriction (if any). If this is not so a fault (VA 1 - incompatible restrictions) will occur during the calculations. If SLOPE or INTERCEPT is restricted prior to use of VREGRESS, the restriction must be the same as that on the data variates. On exit from VREGRESS, the computed SLOPE and INTERCEPT variates will be restricted in the same way as the data variates.
VTABLE procedure
Forms a variate and set of classifying factors from a table
(P.W. Goedhart)
No options
Parameters
TABLE
= tables Tables to be copiedVARIATE
= variates New variate to contain the body of each tableCLASSIFICATION
= pointers Pointer containing the factors by which each new variate is classified
Description
This procedure can be used to store the body of a table in a variate and obtain a set of factors to represent the way in which the data are arranged in the table. These factors then classify the newly formed variate in the same way as in the table. Margins of the table are ignored.
The table to be copied is specified by the
TABLE parameter, the variate must be specified by the VARIATE parameter, and the set of classifying factors can be obtained by setting CLASSIFICATION to a pointer. If this pointer has not been declared, the names of the classifying factors of the table are used as suffix names. The newly formed factors have the same attributes as the old classifying factors, excluding the setting of EXTRA. Note that the order in which the factors are obtained can be unexpected for implicitly declared tables as explained on page 143 of the Genstat 5 Release 3 Reference Manual.
Options: none. Parameters:
TABLE, VARIATE, CLASSIFICATION.
Method
Margins of the table are deleted by the directive
MARGIN. The classifying factors of the table are obtained with GETATTRIBUTE. The initial declarations of the new factors are done as implicit declarations by SORT, as this also transfers any relevant attributes. Factor values are then produced by GENERATE.
WADLEY procedure
Fits models for Wadley's problem, allowing alternative links and errors
(D.M. Smith)
Options
DISTRIBUTION
= string Distribution of the response variate (poisson, negativebinomial, qlnegativebinomial, qlscaledpoisson); default poisLINK
= string Link transformation (logit, probit, complementaryloglog, cauchit); default logiTERMS
= formula Model to be fittedCONTROL
= factor Factor to distinguish the control, or zero, dose (level 1) from the other treatments (level 2)MAXIMAL
= factor Factor to define the maximal model i.e. with a level for every combination of values of the variates and factors in TERMSRMETHOD
= string Type of residuals to be formed (deviance, Pearson); default devi
Parameters
Y
= variates Response variate for each fitRESIDUALS
= variates Variate to save the residuals from each fitFITTEDVALUES
= variates Variate to save the fitted values from each fit
Description
WADLEY
uses the generalized linear models methodology of composite link functions to fit a range of models for the situation known as Wadley's problem. This arises in bioassay where it is possible to count only the number of subjects that have not responded to a particular dose of a drug or stimulus. For example, with eggs of insects fumigated in grain, it is generally possible to count only those that survive and hatch.By default, the analysis assumes that the numbers of subjects that are treated in each observation follow a Poisson distribution with a common mean parameter; other distributions can be specified using the
DISTRIBUTION option or, for user-defined distributions, by providing subsidiary procedure WADDISTRIBUTION (see details of the procedures called by WADLEY).The analysis estimates the mean of the distribution, and then fits the dose response curve as in an ordinary probit analysis. The
LINK option defines the transformation (logit, probit, cauchit, or complementary log-log) required to make the model additive. User-defined transformations can also be specified, by leaving LINK unset and providing subsidiary procedure WADLINK to calculate the necessary fitted values and derivatives, and WADINITIAL to calculate initial values for the linear predictor (see details of the procedures called by WADLEY). The model to be fitted is defined by the TERMS option.To assist the estimation of the expected total number of subjects, there must be some control observations - for example with zero doses of fumigant. These must be identified by a factor, specified by the
CONTROL option, with level 1 for untreated and level 2 for treated. The comparison between the treated and untreated levels of CONTROL must not be aliased with any of the variates and factors in TERMS. (Thus if, for example, TERMS contained a factor representing different types of drug, this must not have a separate level for the untreated observations.)Often with these sort of data, it is found that the variability exceeds that which would be expected from the distribution assumed for the data. To estimate the amount of overdispersion, the
MAXIMAL option must be set to a factor with a different level for every combination of values of the factors and variates in the TERMS model.
Options:
PRINT, DISTRIBUTION, LINK, TERMS, CONTROL, MAXIMAL, RMETHOD.Parameter:
Y, RESIDUALS, FITTEDVALUES.
Method
In essence
WADLEY is a specific application of the use of composite link functions in generalized linear models. The actual methods used are those in the Genstat procedure GLM (Lane 1989) and the GLIM macros of Smith & Morgan (1989). The procedure is very similar in spirit to these GLIM macros, and it is recommended that this reference be consulted for further information. However, there are some extensions. The capability to handle user-defined links and distributions has been added. Also, the range of distributions has been extended to include two forms of quasi-likelihood, namely that where the weighting is of negative binomial form (weight=1/(1+hf´ fittedvalues)), and that where the weighting is of scaled Poisson form (weight=1/hf), where hf is the heterogeneity factor. If the estimated heterogeneity factor is less than zero in the negative binomial cases, or if it is less than one in the scaled Poisson case, it is set to zero or one respectively.WADLEY
has two subsidiary procedures, WADCODI and WADFIT, to assist with the analysis; neither of these need be modified by the user:WADCODI
prints the results of the iterative processes;WADFIT
performs the iterative model fits.There are also three other procedures, which can be rewritten or replaced, to cater for further user-defined distributions and links:
WADDISTRIBUTION
calculates the variance function and deviance for a user-defined distribution;WADINITIAL
calculates initial estimates of the linear predictor for a user-defined link;WADLINK
calculates the fitted values and derivatives for a user-defined link.If the
DISTRIBUTION option is unset, the procedure will call WADDISTRIBUTION instead of using one of the various standard distributions. For a Poisson error distribution WADDISTRIBUTION should be defined like this.PROCEDURE 'WADDISTRIBUTION'
"Calculation of variance function and deviance"
PARAMETER 'Y', "Input: variate; response variate"\
'FITTED', "Input: variate; fitted values"\
'VARIANCE',"Output: variate; variance"\
'LL', "Output: variate; log likelihood variate"\
'DEVIANCE';"Output: scalar; total deviance"\
MODE=p
SCALAR two; VALUE=2
CALCULATE VARIANCE = FITTED
& LL = Y*LOG(Y/FITTED)-Y+FITTED
& DEVIANCE = two*SUM(LL)
ENDPROC
For other error distributions only the three CALCULATE statements need to be changed.
Similarly, for option LINK unset, WADINITIAL and WADLINK will be called. For a logit link WADINITIAL would be defined as follows.
PROCEDURE 'WADINITIAL'
"Calculation of initial estimates of linear predictor"
PARAMETER 'Y', "Input: variate; response variate"\
'LP', "Output: variate; linear predictor"\
'IND', "Input: variate; marker variate with value 1
for a control observation, 0 otherwise"\
'MAXY'; "Inout: scalar; estimate of asymptote"\
MODE=p
SCALAR half,one; VALUE=0.5,1
CALCULATE LP = IND*LOG(MAXY/(Y+half)-one)
ENDPROC
For other links only the CALCULATE statement need be changed so, for example, a probit link would require the statement
CALCULATE LP = IND*NED(one-(Y+one)/MAXY)
For a logit link
WADLINK would bePROCEDURE 'WADLINK'
"Calculation of fitted values and derivatives
of the link function given the linear predictor"
PARAMETER 'LP', "Input: variate; linear predictor"\
'IND', "Input: variate; marker variate with value 1
for a control observation, 0 otherwise"\
'TA', "Output: variate; estimate of fitted values"\
'TB', "Output: variate; estimate of derivatives"\
'MAXY'; "Input: scalar; estimate of asymptote"\
MODE=p
SCALAR half,one; VALUE=0.5,1
CALCULATE TA = (.NOT.IND)+IND/(one+EXP(LP))
& TB = MAXY*EXP(LP)*TA*TA
ENDPROC
For other links only the CALCULATE statements need to be changed so, for example, a probit link would require
CALCULATE TA = (.NOT.IND)+IND/(one-NORMAL(LP))
& TB = MAXY*EXP(-half*LP*LP)/ROOT2PI
where
ROOT2PI is a scalar with the value of the square root of 2π. The marker variate IND identifies which is the control and non control data, so TA should always be of the formTA = (.NOT.IND)+IND*function
where
function is the link function for the non-control part of the data. The variate TB should always be of the formTB = MAXY*deriv_fn
where
deriv_fn is the derivative of the link function with respect to the linear predictor (LP).If
LINK or DISTRIBUTION are unset, but no user routines are given for WADINITIAL, WADLINK and WADDISTRIBUTION, then those given here (for logit link and Poisson error distribution) will be used.A debt is owned to Dr J. Parrott of Pfizer Central Research, Sandwich, UK for his support and encouragement of this work.
Action with
RESTRICTIf the Y-variate is restricted, only the specified subset of the units will be included in the analysis.
References
Lane, P.W. (1989). Procedure GLM. In: Genstat Procedure Library Release 1.3[2] (ed. R.W.Payne & G.M.Arnold), pp 80-82.
Smith, D.M. & Morgan, B.J.T. (1989). Extended Models for Wadley's Problem. Glim Newsletter, 18, 21-28.
WILCOXON procedure
Performs a Wilcoxon Matched-Pairs (Signed-Rank) test
(S.J. Welham, N.M. Maclaren & H.R. Simpson)
Option
Parameters
DATA
= variates Variates holding the differences between each pair of samplesRANKS
= variates Variate to save the signed ranksSTATISTIC
= scalars Scalar to save the value of the test statisticNORMAL
= scalars Scalar to save the Normal approximation for the test statisticSIGN
= scalars Scalar to indicate the sign of the total sum of the signed ranks: 1 if the sum is positive, 0 otherwise
Description
WILCOXON
performs a Wilcoxon Matched-Pairs test on a variate holding differences between two paired samples. This is specified using the parameter DATA. The value of the test statistic is calculated and can be saved in the parameter STATISTIC. The Normal approximation for this statistic is calculated only if the sample size is large enough for the approximation to be reasonable (here taken to be 8 or more). In this case, the Normal statistic can be saved using the parameter NORMAL. The parameter SIGN saves an indicator of whether the total sum of signed ranks is positive (SIGN=1) or negative (SIGN=0).Output from the procedure is controlled by the
PRINT option: test produces the relevant test statistics, and ranks prints the vector of signed ranks for the data.The parameter
RANKS can a variate of the signed ranks of the differences (i.e. of DATA).
Option:
PRINT. Parameters: DATA, RANKS, STATISTIC, NORMAL, SIGN.
Method
The Wilcoxon Matched-Pairs test (often also called the Wilcoxon Signed-Ranks test) is a non-parametric test of location in the case of two related samples (e.g. a before-and-after study). The null hypothesis is that two samples arise from exactly the same distribution, with the alternative that the two underlying distributions differ only in location.
The test statistic WS is formed from the signed ranks of the differences between each pair of observations and is the smaller in absolute value out of:
1) the sum of positive signed-ranks of the sample, and
2) the sum of the negative signed-ranks.
In this procedure the method used for calculating the test statistic is:
WS = N´ (N+1)/4 - modulus(total sum of signed ranks)/2
where N is the number of observations.
The normal approximation for this statistic is valid only for samples of size 8 or greater, and so the procedure calculates this approximation z, where
z = ( WS - N´ (N+1)/4 ) / Ö { N(N+1)(2N+1)/24 }
only if the sample size is at least 8.
(See for example Siegel 1956, pages 75-83.)
Action with
RESTRICTIf the
DATA variate is restricted, the test is calculated only using the units not excluded by the restriction.
Reference
Siegel, S. (1956). Nonparametric statistics for the behavioural sciences. McGraw-Hill, New York.
XOCATEGORIES procedure
Performs analyses of categorical data from crossover trials
(D.M. Smith & M.G.Kenward)
Options
PDATA
= string Whether or not a display of category combination by sequence is required (yes, no); default noMETHOD
= string Type of analysis for which factors are required (subject, loglinear, ownsubject, ownloglinear); default subjCARRYOVER
= string Whether or not models with carryover effects in are to be produced (yes, no); default no
Parameters
SEQUENCE
= factors The identifier of the sequence of treatmentsRESULTS
= pointers Pointer containing factors (one for each period) giving the category scores observedNUMBER
= variates Numbers recorded in the sequence/category combinationsSAVE
= pointers Saves the factors constructed to do the analysisREUSE
= pointers To reuse factors saved earlier using SAVEMODEL
= formulae Additional terms to be fitted to model if OWNSUBJECT or OWNLOGLINEAR options used; default *
Description
XOCATEGORIES
calculates factors, variates and performs various analyses of categorical cross-over data. All analyses conform to one of two different types both utilising a log-linear structure, although only one is derived from an orthodox log-linear model. The first type is based on a latent variable or subject effects model and is described by Kenward & Jones (1991). The subject effects are eliminated through the use of a conditional likelihood and the resultant conditional analysis can be formulated in terms of a conventional log-linear model. In the process of conditioning all between-subject information is lost. This has little consequence for the majority of well-designed cross-over trials in which nearly all information on important comparisons lies in the within-subject stratum. An exception to this is the two-period two-treatment design for which information on the carry-over effect lies in the between-subject stratum. The second type, which uses a multivariate log-linear model, allows between-subject information to be recovered, which in the binary case leads to the Hills-Armitage test for carry-over effect. Details can be found in Jones & Kenward (1989, Section 3.3). If such a test, and other allied tests for the two-period two-treatment design, are required then the log-linear option of the procedure can be used. However, the estimates from this multivariate log-linear model do have the disadvantage of an awkward interpretation. For this reason the latent variable model is to be preferred for higher-order designs and for the two-period two-treatment design when the carry-over test is not required. In the latter case, with binary data, the test for direct treatments reduces to the Mainland-Gart test.In the latent variable model, effects are defined in terms of generalized logits, reducing to ordinary logits in the binary case. This is not ideal for ordered categorical data because the ordering is ignored. Some account can be taken of the ordering of categories by regressing on category scores in a generalization of Armitage's trend test. This can be done by using the parameter
SAVE to obtain the treatment and carryover factors, which are in pointers SAVE[3...(NTRT+2)] and SAVE[(NTRT+3)...(2+2*NTRT)] respectively, NTRT being the number of treatments. From these treatment and carryover factors (NCAT-1) variates corresponding to linear (-1, 0, 1), quadratic (1, -2, 1), etc., contrasts amongst the NCAT categories can be calculated. For example, using the example data where the number of treatments (NTRT) is 3 and the number of categories (NCAT) is 3, the following statements will produce linear and quadratic variates for treatments.XOCATEGORIES SEQUENCE=Seqid; RESULTS=Res; NUMBER=Number; \
SAVE=Fsave
CALCULATE TLN[1...3]=(Fsave[3...6].eq.3)-(Fsave[3...6].eq.1)
& TQU[1...3]=(Fsave[3...6].eq.3)-2*(Fsave[3...6].eq.2)\
+(Fsave[3...6].eq.1)
The
OWNSUBJECT or OWNLOGLINEAR options together with the REUSE and MODEL parameters can then be used to fit models involving these variates and the deviances produced used to compare the models. For example, for the above variates.XOCATEGORIES [METHOD=OWNSUBJECT] \
REUSE=Fsave; MODEL=TLN[1]+TLN[2]++TLN[3]+TQU[1]+TQU[2]+TQU[3]
The data for the procedure are specified by parameters
SEQUENCES, RESULTS and NUMBERS. SEQUENCES supplies a factor with labels indicating the treatment received at each time period. The treatments are labelled by capital letters A, B &c, so (with three periods) BCA indicates treatment B in period 1, C in 2 and A in 1. RESULTS is a pointer containing a factor for each time period, to indicate the corresponding scores recorded in each period. NUMBER then indicates the number of subjects involved. It is not necessary to input data for category combinations in which no subjects were recorded.XOCATEGORIES
processes the data to form the necessary factors to do the analysis using the Genstat facilities for generalized linear models. This information can be saved using the SAVE parameter (see Method) and input again, to save time in later analyses, using the REUSE parameter.Output of the procedure comprises significance tests of treatment and/or carryover and first order period interactions; together with estimates of log odds ratios and their standard errors.
Options:
PRINT, PDATA, METHOD, CARRYOVER.Parameter:
SEQUENCE, RESULTS, NUMBER, SAVE, REUSE, MODEL.
Method
The methods of analysis follow Kenward & Jones (1991) for
SUBJECT and OWNSUBJECT, and Jones & Kenward (1989, pages 124-129) for LOGLINEAR and OWNLOGLINEAR. The actual model fitting is performed using the standard Genstat directives FIT, ADD, DROP and SWITCH, with the PRINT options being those of these directives.The data structure
SAVE has the following form, all factors as Kenward & Jones (1991).SAVE[1]
= The factor G (sequence).SAVE[2]
= The factor S (outcome).SAVE[3...(NTRT+2)]
= The factors T[1...NTRT] (treatment).SAVE[(NTRT+3)...(2+2*NTRT)]
= The factors C[1...NTRT] (carryover).SAVE[(3+2*NTRT)...(NPER+2*NTRT)]
= The factors P[1...NPER] (period).SAVE[NPER+2*NTRT+1]
= The category labels if they exist.We wish to thank Dr Byron Jones of the University of Kent, Canterbury UK, for his assistance.
Action with
RESTRICTInput structures must not be restricted, and any existing restrictions will be cancelled.
References
Jones, B. & Kenward, M.G. (1989). Design and Analysis of Crossover Trials. Chapman and Hall.
Kenward, M.G. & Jones, B. (1991). The analysis of categorical data from cross-over trials using a latent variable model. Statistics in Medicine, 10, 1607-1619.