ADD directive 

Adds extra terms to a linear, generalized linear, generalized additive, or nonlinear model.

 

Options

PRINT = strings What to print (model, deviance, summary, estimates, correlations, fittedvalues, accumulated, monitoring); default mode, summ, esti

NONLINEAR = string How to treat nonlinear parameters between groups (common, separate, unchanged); default unch

CONSTANT = string How to treat the constant (estimate, omit, unchanged); default unch

FACTORIAL = scalar Limit for expansion of model terms; default * i.e. that in previous TERMS statement

POOL = string Whether to pool ss in accumulated summary between all terms fitted in a linear model (yes, no); default no

DENOMINATOR = string Whether to base ratios in accumulated summary on rms from model with smallest residual ss or smallest residual ms (ss, ms); default ss

NOMESSAGE = strings Which warning messages to suppress (dispersion, leverage, residual, aliasing, marginality, df, inflation); default *

FPROBABILITY = string Printing of probabilities for variance and deviance ratios (yes, no); default no

TPROBABILITY = string Printing of probabilities for t-statistics (yes, no); default no

SELECTION = strings Statistics to be displayed in the summary of analysis produced by PRINT=summary, the first four are relevant only for a Normally distributed response, and the last only for a gamma-distributed response (%variance, %ss, adjustedr2, r2, seobservations, dispersion, %cv); default %var,seob if DIST=normal, %cv if DIST=gamma, and disp for other distributions

 

Parameter

formula List of explanatory variates and factors, or model formula

 

Description

ADD adds terms to the current regression model, which may be linear, generalized linear, generalized additive, standard curve, or nonlinear. It is best to give a TERMS statement before investigating sequences of models using ADD, in order to define a common set of units for the models that are to be explored. If no model is fitted after the TERMS statement, the current model is taken to be the null model.

The model fitted by ADD will include a constant term if the previous model included one, and will not include one if the previous model did not. You can, however, change this using the CONSTANT option.

The options of ADD are almost all the same as those of the FIT directive, and are described there. There is also an extra option NONLINEAR. This is relevant when fitting curves. For example, if we have a variate Dilution and a factor Solution, the program below will fit parallel curves for the different solutions.

MODEL Density

TERMS Dilution * Solution

FITCURVE [PRINT=model,estimates; CURVE=logistic] \

Dilution + Solution

If we then put

ADD Dilution.Solution

the curves are still constrained to have common nonlinear parameters, but all linear parameters are estimated separately for each group. Alternatively, if we put

ADD [NONLINEAR=separate] Dilution.Solution

different nonlinear parameters will be estimated for each solution too; so only the information about variability is pooled.

 

ADDPOINTS directive

Adds points for new objects to a principal coordinates analysis.

 

Option

PRINT = strings Printed output required (coordinates, residuals); default * i.e. no printing

 

Parameters

NEWDISTANCES = matrices Squared distances of the new objects from the original points

LRV = LRVs Latent roots and vectors from the PCO analysis

CENTROID = diagonal matrices Centroid distances from the PCO analysis

COORDINATES = matrices Saves the coordinates of the additional points in the space of the original points

RESIDUALS = matrices or variates Saves the residuals of the new objects from that space

 

Description

The input to ADDPOINTS is specified by the first three parameters. The NEWDISTANCES parameter specifies an s´ n matrix containing squared distances of the s new units from the n old units. The LRV and CENTROID parameters specify structures defining the configuration of old units; these have usually been produced by a PCO statement.

The PRINT option controls the printed output with settings:

coordinates to print the coordinates of the new points;

residuals to print the residual distances of the new units from the coordinates in the space of the old units.

The other parameters can be used to save the results: the COORDINATES parameter allows you to specify an s´ k matrix to save the coordinates for the new units, and the residuals can be saved in an s´ 1 matrix using the RESIDUALS parameter. The value k is determined by the dimensionality of the input coordinates from the preceding PCO statement.

 

ADISPLAY directive

Displays further output from analyses produced by ANOVA.

 

Options

PRINT = strings Output from the analyses of the y-variates, adjusted for any covariates (aovtable, information, covariates, effects, residuals, contrasts, means, cbeffects, cbmeans, stratumvariances, %cv, missingvalues); default * i.e. no printing

UPRINT = strings Output from the unadjusted analyses of the y-variates (aovtable, information, effects, residuals, contrasts, means, cbeffects, cbmeans, stratumvariances, %cv, missingvalues); default * i.e. no printing

CPRINT = strings Output from the analyses of the covariates, if any (aovtable, information, effects, residuals, contrasts, means, %cv, missingvalues); default * i.e. no printing

CHANNEL = identifier Channel number of file, or identifier of a text to store output; default current output file

PFACTORIAL = scalar Limit on number of factors in printed tables of means or effects; default 9

PCONTRASTS = scalar Limit on order of printed contrasts; default 9

PDEVIATIONS = scalar Limit on number of factors in a treatment term whose deviations from the fitted contrasts are to be printed; default 9

FPROBABILITY = string Printing of probabilities for variance ratios in the aov table (yes, no); default no

PSE = strings Standard errors to be printed with tables of means, PSE=* requests s.e.'s to be omitted (differences, lsd, means); default diff

TWOLEVEL = string Representation of effects in 2n experiments (responses, Yates, effects); default resp

NOMESSAGE = strings Which warning messages to suppress (nonorthogonal, residual); default *

LSDPROBABILITY = scalar Probability level to use in the calculation of least significant differences; default 0.95

 

Parameter

identifiers Save structure (from ANOVA) to provide details of each analysis from which information is to be displayed; if omitted, output is from the most recent ANOVA

 

Description

The ADISPLAY directive allows you to display further output from one or more analyses of variance, without having to repeat all the calculations. You can store the information from each analysis in a save structure, using ANOVA, and then specify the same structure in the SAVE parameter of ADISPLAY. Several save structures can be listed, corresponding to the analyses of several different variates. They need not all have been produced by the same ANOVA statement nor even be from the same design. Alternatively, if you just want to display output from the last y-variate that was analysed, you need not specify the SAVE parameter in either ANOVA or ADISPLAY: the save structure for the last y-variate analysed is saved automatically, and provides the default for ADISPLAY.

Apart from CHANNEL, all the options of ADISPLAY also occur with ANOVA and are described there. CHANNEL can be set to a scalar to divert the output to another output channel. Alternatively, it can specify the identifier of text data structure to store the output (and in fact an undeclared structure will be defined as a text, automatically).

The other difference concerns the options for printed output. The default for PRINT with ADISPLAY is different from that with ANOVA. You are most likely to use ADISPLAY when you are working interactively, to examine one component of output at a time, and it is not obvious that any one component will then be more popular than any other. So the default for ADISPLAY produces no output (that is, PRINT=*). This also means that you do not need to suppress the output explicitly when you are using UPRINT and CPRINT to examine components of output from analysis of covariance. Also, the settings information, covariates, and missingvalues have a slightly different effect with ANOVA than with ADISPLAY. As they are part of the default specified for ANOVA, they will not produce any output unless there is something definite to report. With ADISPLAY you need to request them explicitly, so Genstat will always produce some sort of report. For example, putting

ADISPLAY [PRINT=missing]

when there are no missing values will simply tell you there are none.

 

AKEEP directive

Copies information from an ANOVA analysis into Genstat data structures.

 

Options

FACTORIAL = scalar Limit on number of factors in a model term; default 3

STRATUM = formula Model term of the lowest stratum to be searched for effects; default * implies the lowest stratum

SUPPRESSHIGHER = string Whether to suppress the searching of higher strata if a term is not found in STRATUM (yes, no); default no

TWOLEVEL = string Representation of effects in 2n experiments (responses, Yates, effects); default resp

RESIDUALS = variate To save residuals from the final stratum (as in the RESIDUALS parameter of ANOVA)

FITTEDVALUES = variate To save fitted values (data values or missing value estimates, minus the residuals from the final stratum - as in the FITTEDVALUES parameter of ANOVA)

CBRESIDUALS = variate To save the sum of the residuals from all the strata

CBCREGRESSION = variate To save the estimates of the covariate regression coefficients, combining information from all the strata

TREATMENTSTRUCTURE = formula structure

To save the treatment formula used for the analysis

BLOCKSTRUCTURE = formula structure

To save the block formula used for the analysis

WEIGHTS = variate To save the weights used in the analysis

SAVE = identifier Defines the Save structure (from ANOVA) that provides details of the analysis; default * gives that from the most recent ANOVA

 

Parameters

TERMS = formula Model terms for which information is required

MEANS = tables Table to store means for each term (available for treatment terms only)

EFFECTS = tables or scalars Table or scalar (for terms with 1 d.f. when TWOLEVEL=responses or Yates) to store effects (for treatment terms only)

PARTIALEFFECTS = tables Table or scalar (for terms with 1 d.f. when TWOLEVEL=responses or Yates) to store partial effects (for treatment terms only)

REPLICATIONS = tables Table to store replications

RESIDUALS = tables Table to store residuals (for block terms only)

DF = scalars Number of degrees of freedom for each term

SS = scalars Sum of squares for each term

EFFICIENCY = scalars Efficiency factor for each term

VARIANCE = scalars Unit variance for the effects of each term

CEFFICIENCY = scalars Covariance efficiency factor for each term

CREGRESSION = variates Estimated regression coefficients for the covariates in the specified stratum

CSSP = symmetric matrices Covariate sums of squares and products in the specified stratum

CONTRASTS = pointers Estimates for the fitted contrasts of each treatment term, stored in a pointer to scalars or tables; units of the pointer are labelled by the contrast name (as used in the AOV table)

XCONTRASTS = pointers X-variates used to fit contrasts, as orthogonalized by ANOVA, stored in a pointer to tables; units of the pointer are labelled as for CONTRASTS

SECONTRASTS = pointers Standard errors for estimated contrasts, stored in a pointer to scalars or tables; units of the pointer are labelled as for CONTRASTS

DFCONTRASTS = pointers Degrees of freedom for estimated contrasts, stored in a pointer to scalars; units of the pointer are labelled as for CONTRASTS

CBMEANS = tables Table to store estimates of the means, combining information from all the strata (for treatment terms only)

CBEFFECTS = tables or scalars Table or scalar (for terms with 1 d.f. when TWOLEVEL=responses or Yates) to store estimates of the effects, combining information from all the strata (for treatment terms only)

CBVARIANCE = scalars Unit variance for the combined estimates of the effects of each term

CBCEFFICIENCY = scalars Covariance efficiency factor for the combined estimates of each term

STRATUMVARIANCE = scalars Estimates of the stratum variances (for block terms only)

COMPONENTS = scalars Stratum variance components (for block terms only)

 

Description

AKEEP allows you to copy components of the output from an analysis of variance into standard Genstat data structures. You can save the information from the analysis in a save structure, using the SAVE option of ANOVA and then specify the same structure in the SAVE option of AKEEP. Alternatively, Genstat automatically stores the save structure from the last y-variate that has been analysed, and this is used as a default by AKEEP if you do not specify a save structure explicitly.

Several options are provided to save information about the analysis as a whole. The RESIDUALS and FITTEDVALUES options allow variates to be specified to store the residuals and fitted values, respectively. The residuals, like those saved by the RESIDUALS parameter of ANOVA, are taken only from the final stratum. As an alternative, the CBRESIDUALS option saves residuals that incorporate the variability from all the strata. With an orthogonal design, these are simply the sum of the residuals from every stratum. For a non-orthogonal design, they are the data values minus the combined estimates of the treatment effects. Likewise, the CBCREGRESSION option allows you to save estimates of covariate regression coefficients that combine information from all the strata. (The estimates from each individual stratum can be saved using the CREGRESSION parameter, as described below.) The TREATMENTSTRUCTURE, BLOCKSTRUCTURE, and WEIGHTS options can save the treatment and block formulae, and the weights variate (if any) that were used to specify the analysis.

The parameters of AKEEP save information about particular model terms in the analysis. With the TERMS parameter you specify a model formula, which Genstat expands to form the series of model terms about which you wish to save information. As in ANOVA, the FACTORIAL option sets a limit on the number of factors in each term. Any term containing more than that limit is deleted. The subsequent parameters allow you to specify identifiers of data structures to store various components of information for each of the terms that you have specified. If there are components that are not required for some of the terms, you should insert a missing identifier (*) at that point of the list. For example

AKEEP Source + Amount + Source.Amount; MEANS=*,*,Meangain; \

SS=Ssource,Samount,Ssbya; VARIANCE=Vsource,*,*

sets up a table Meangain containing the source by amount table of means; it forms scalars Ssource, Samount, and Ssbya to hold the sums of squares for Source, Amount, and Source.Amount respectively, and scalar Vsource to store the unit variance for the effects of Source.

The structures to hold the information are defined automatically, so you need not declare them in advance. If you have declared any of the tables already, its classification set will be redefined, if necessary, to match the factors in the table that you wish to store. Thus Meangain here would be redefined to be classified by the factors Source and Amount, if it had previously been declared with some other set of classifying factors. Sizes of variates and symmetric matrices will also be redefined if necessary.

Many of the components are stored in tables, classified by the factors in the model term. Tables of means and effects are relevant only for treatment terms. Partial effects (which are also available only for treatment terms) differ from the usual effects, presented by Genstat, only when there is non-orthogonality. The usual effects of a treatment term are estimated after eliminating the terms that precede it in the model, whereas the partial effects are those that would be estimated after eliminating the subsequent treatment terms as well. The TWOLEVEL option controls what it stored for terms whose factors all have only two levels. The settings response (the default) or Yates generate a scalar response; whereas TWOLEVELS=effects produces a table of effects. Replications are stores in tables, even if all the values are identical. Tables of residuals are available only for block terms.

Four components can be saved in scalars: sums of squares, numbers of degrees of freedom, efficiency factors and unit variances. The unit variance of a treatment term is the residual mean square of the stratum where the term is estimated, divided by its efficiency factor and covariance efficiency factor. Thus you can calculate the estimated variance of any of the effects of the term by dividing its unit variance by the replication of the effect.

The next two parameters allow you to save information about the covariates. To save the regression coefficients estimated in a particular stratum, you should specify the model term of the stratum with the TERMS parameter and a variate with the CREGRESSION parameter. Genstat defines the variate to have a length equal to the number of covariates, and stores the estimated regression coefficients of the covariates in the order in which they were listed in the COVARIATE statement. The CSSP parameter allows you to obtain sums of squares and products between the covariates for the specified model term. These are arranged in a symmetric matrix. The value in row i on the diagonal is the sum of squares for the term in the analysis of variance that has as its y-variate the ith covariate listed in the COVARIATE statement. The value in row i and column j is the cross-product between the effects estimated for the term in the analysis of variance of covariate i and those estimated for the same term in the analysis of covariate j.

There are four parameters CONTRASTS, XCONTRASTS, SECONTRASTS and DFCONTRASTS for saving information about contrasts. For each treatment term there will generally be several contrasts, so the information is stored in pointers with one element for each contrast. The elements are laballed by the name of the contrasts as it appears, for example, in the analysis-of-variance table.

The CBMEANS, CBEFFECTS, CBVARIANCE, CBCEFFICIENCY, and STRATUMVARIANCES parameters save details of estimates that combine information from all the strata of the design, and the COMPONENTS parameter saves the stratum variance components.

In designs where there is partial confounding, and treatment terms are estimated in more than one stratum, options STRATUM and SUPPRESSHIGHER allow you to specify the strata from which the information is to be taken. This is relevant to tables of effects and partial effects, sums of squares, efficiency factors, unit variances, sums of squares and products between covariates, and information about contrasts. By default, Genstat searches all the strata, and takes the information from the lowest of the strata where the term is estimated. If you set the STRATUM option, only strata down to the specified stratum are searched. By setting SUPPRESSHIGHER=yes, you can restrict the search to only that stratum. You cannot save tables of means if you have excluded any stratum from the search. Likewise, tables of residuals and residual sums of squares cannot be saved for any of the excluded strata. If a term is not estimated in any of the strata that are searched, the corresponding data structures are filled with missing values.

As explained in the description of the BLOCKSTRUCTURE directive, Genstat will set up an extra "factor" denoted *Units* if the block formula does not specify the final stratum explicitly. AKEEP allows you to refer to this "factor", if necessary, by putting the string '*Units*' (or '*units*' or '*UNITS*') in the TERMS formula. Thus, to save the residual sum of squares in these circumstances, you could put

AKEEP '*Units*'; SS=ResidSS

 

ANOVA directive

Analyses y-variates by analysis of variance according to the model defined by earlier BLOCKSTRUCTURE, COVARIATE, and TREATMENTSTRUCTURE statements.

 

Options

PRINT = strings Output from the analyses of the y-variates, adjusted for any covariates (aovtable, information, covariates, effects, residuals, contrasts, means, cbeffects, cbmeans, stratumvariances, %cv, missingvalues); default aovt, info, cova, mean, miss

UPRINT = strings Output from the unadjusted analyses of the y-variates (aovtable, information, effects, residuals, contrasts, means, cbeffects, cbmeans, stratumvariances, %cv, missingvalues); default * i.e. no printing

CPRINT = strings Output from the analyses of the covariates, if any (aovtable, information, effects, residuals, contrasts, means, %cv, missingvalues); default * i.e. no printing

FACTORIAL = scalar Limit on number of factors in a treatment term; default 3

CONTRASTS = scalar Limit on the order of a contrast of a treatment term; default 4

DEVIATIONS = scalar Limit on the number of factors in a treatment term for the deviations from its fitted contrasts to be retained in the model; default 9

PFACTORIAL = scalar Limit on number of factors in printed tables of means or effects; default 9

PCONTRASTS = scalar Limit on order of printed contrasts; default 9

PDEVIATIONS = scalar Limit on number of factors in a treatment term whose deviations from the fitted contrasts are to be printed; default 9

FPROBABILITY = string Printing of probabilities for variance ratios (yes, no); default no

PSE = string Standard errors to be printed with tables of means, PSE=* requests s.e.'s to be omitted (differences, lsd, means); default diff

TWOLEVEL = string Representation of effects in 2n experiments (responses, Yates, effects); default resp

DESIGN = pointer Stores details of the design for use in subsequent analyses; default *

WEIGHTS = variate Weights for each unit; default * i.e. all units with weight one

ORTHOGONAL = string Whether or not design to be assumed orthogonal (no, yes, compulsory); default no

SEED = scalar Seed for random numbers to generate dummy variate for determining the design; default 12345

MAXCYCLE = scalar Maximum number of iterations for estimating missing values; default 20

TOLERANCES = variate Tolerances for zero in various contexts; default * i.e. appropriate zero values assumed for the computer concerned

NOMESSAGE = strings Which warning messages to suppress (nonorthogonal, residual); default *

LSDPROBABILITY = scalar Probability level to use in the calculation of least significant differences; default 0.95

 

Parameters

Y = variates Variates to be analysed

RESIDUALS = variates Variate to save residuals for each y variate

FITTEDVALUES = variates Variate to save fitted values

SAVE = identifiers Save details of each analysis for use in subsequent ADISPLAY or AKEEP statements

 

Description

The ANOVA directive analyses balanced designs. These include most of the commonly occurring experimental designs such as randomized blocks, Latin squares, split plots and other orthogonal designs, as well as designs with balanced confounding, like balanced lattices and balanced incomplete blocks. Many partially balanced designs can also be handled, so a very wide range of designs can be analysed. The necessary condition of first-order balance is explained algorithmically by Wilkinson (1970) and Payne and Wilkinson (1976), and mathematically by James and Wilkinson (1971) and Payne and Tobias (1992). However, ANOVA can itself detect whether or not a design can be analysed, so if you are not sure whether or not a particular design is analysable, you can run it through ANOVA and see what happens! (If it is unbalanced, you can use the AUNBALANCED procedure for designs with a single error term, or the REML directive for those with several.)

Before you use ANOVA you must first define the model that is to be fitted in the analysis. Potentially this has three parts. The TREATMENTSTRUCTURE directive specifies the treatment (or systematic, or fixed) terms for the analysis. The BLOCKSTRUCTURE directive defines the "underlying structure" of the design or, equivalently, the error terms for the analysis; in the simple cases where there is only a single error term this can be omitted. The other directive, COVARIATE, lists the covariates if an analysis of covariance is required. At the start of a job all these model-definition directives have null settings. However, once any one of them has been used, the defined setting remains in force for all subsequent analyses in the same job until it is redefined.

The first parameter of ANOVA, Y, lists the variates whose values are to be analysed. Genstat examines them all and forms a list of units for which any of the y-variates or any covariate has a missing value. These units are treated as missing in all the analyses. (This is necessary to avoid having to re-analyse covariates for each y-variate.) However, if your y-variates have different missing units, you may prefer to analyse them with separate ANOVA statements, while saving details of the model and design with the DESIGN option to improve efficiency. Genstat also checks whether any of the y-variates has a restriction. If several variates are restricted, they must all be restricted to the same set of units. Only these units are included in the analysis of each y-variate.

If a y-variate has no values, or if you specify a null entry in the Y list, Genstat produces a skeleton analysis-of-variance table, which excludes sums of squares, mean squares and variance ratios; the only other output available is the information summary. You can save a design structure, but no save structure is formed. This is a good way of checking that a design can be analysed, before the experiment is carried out.

The RESIDUALS parameter allows you to specify a variate to save the estimated residuals from each analysis. Genstat will declare this variate for you if you have not done so already. In models where there are several error terms, only the final one is included. Others can be obtained using the AKEEP directive. The fitted values from the analysis are defined to be the data values minus the estimated residuals. These too can be saved, using the FITTEDVALUES parameter. In models where there are several error terms, only the final error term is subtracted. If this is not what you want, you can save the other error terms using AKEEP and subtract them by CALCULATE.

The last parameter, SAVE, allows you to save the complete details of the analysis in an ANOVA save structure. The ADISPLAY directive lets you use a save structure to produce further output. You can also use it in the AKEEP directive to put quantities calculated from the analysis into data structures which you can then use elsewhere in Genstat. Save structures are special compound structures, and Genstat declares them automatically. The save structure for the last y-variate analysed is stored automatically, and forms the default for ADISPLAY and AKEEP if you do not provide one explicitly.

The PRINT option selects which components of output are to be displayed.

aovtable analysis-of-variance table

information information summary, giving details of aliasing and non-orthogonality or of any large residuals

covariates estimates of covariate regression coefficients

effects tables of estimated treatment parameters

residuals tables of estimated residuals

contrasts estimated contrasts of treatment effects

means tables of predicted means for treatment terms

cbeffects estimated effects of treatment terms combining information from all the strata in which each term is estimated

cbmeans predicted means for treatment terms combining information from all the strata in which each term is estimated

stratumvariances estimated variances of the units in each stratum and stratum variance components

%cv coefficients of variation and standard errors of individual units

missingvalues estimates of missing values

The default is intended to give the output that you will require most often from a full analysis: aovtable, information, covariates, means, and missingvalues. However, with ANOVA the settings information, covariates, and missingvalues will not produce any output unless there is something definite to report.

In analysis of covariance, you can also print output from the analyses of the covariates and from the analysis of the y-variate ignoring the covariates. This is controlled by options CPRINT and UPRINT respectively. These are similar to the PRINT option except that they do have not have the setting covariates, and their defaults are to print nothing.

A table of means is produced by default for each term in the treatment model. By using the PFACTORIAL option you can exclude tables for terms containing more than a specified number of factors; Genstat does not allow tables to have more than nine factors, so the default value of nine gives all the available tables. PFACTORIAL also applies to tables of effects. These are estimates of treatment parameters in the linear model.

The PSE option controls the standard errors printed with the tables of means. The default setting is differences, which gives standard errors of differences of means. The setting means produces standard errors of means, LSD produces least significant differences and by setting PSE=* the standard errors can be suppressed altogether. The probability value to use in the calculation of the least significant differences can be changed from the default 0.95 using the LSDPROBABILITY option.

When a factor has only two levels, Genstat usually prints the difference between the two main effects instead of the effects themselves. This difference is called a response. For interaction terms whose factors all have only two levels, there are two forms of response. The choice between them is controlled by the TWOLEVEL option. If you leave the default, TWOLEVEL=response, Genstat calculates the response for an interaction between two factors as the difference between the two main-effect responses, and so on; this is the form described in most textbooks. By putting TWOLEVEL=Yates, you can obtain the form defined by Yates (1937) in which the responses all have equal standard errors. Alternatively, you can put TWOLEVEL=effects if you prefer not to have responses, but to have the effects themselves, as for factors with more than two levels.

The warnings about any large residuals printed in the information summary can be suppressed by setting the NOMESSAGES option to residuals. The other setting, nonorthogonality, of NOMESSAGES suppresses the warning produced when there is orthogonality between treatment terms or covariates.

The treatment terms to be included in the model are controlled by the FACTORIAL option; this sets a limit (by default 3) on the number of factors in a treatment term: terms containing more than that number are deleted.

The CONTRASTS option places a limit on the order of contrast to be fitted. (Contrasts are defined by using the functions POL, REG, COMPARISON, POLND or REGND in the treatment formula.) For a term involving a single factor, the orders of the successive contrasts run from one upwards, with the deviations term (if any) numbered highest. In interactions between contrasts, the order is the sum of the orders of the component parts. The default value for CONTRASTS is 4. Option PCONTRASTS wimilarly sets a limit on the order of the contrasts that are printed; its default value is 9.

If your design has few or no degrees of freedom for the residual, you may wish to regard the deviations from some of the fitted contrasts as error components, and assign them to the residual of the stratum where they occur. You can do this by the DEVIATIONS option; its value sets a limit on the number of factors in the terms whose deviations are to be retained in the model. For example, by putting DEVIATIONS=1, the deviations from the contrasts fitted to all terms except main effects will be assigned to error. The PDEVIATIONS option similarly controls the printing of deviations: to put PDEVIATIONS=0, for example, would ensure that no deviations are printed. When deviations have been assigned to error, they will not be included in the calculation of tables of means, which will then be labelled "smoothed". However the associated standard errors of the means are not adjusted for the smoothing.

The WEIGHT option allows you to specify a weight for each unit, to define a weighted analysis of variance. You might want to do this if, for example, different parts of the experiment have different variability; each weight would then be proportional to the reciprocal of the expected variance for the corresponding unit. However unless the weights are fairly systematic, for example to give proportional weighted replication, the design is unlikely to be balanced.

Before Genstat does any calculations with the y-variates, it does an initial investigation known as the dummy analysis to acquire all the information that it needs for the analysis. You can use the DESIGN option to store this information so that Genstat need not recalculate it for future ANOVA statements. The structure in the option is automatically declared as a pointer if you have not declared it already. It points to several other structures which store information about different aspects of the analysis. The only other details that are required for future analyses are the values of the factors in the block and treatment formulae. If you have not previously declared the design structure, or if it has no values, then the current statement derives and stores the necessary information. If the pointer does already have values, then these are used to do the analysis. In that case, of course, values of the factors in the block and treatment formulae must not have been changed since the design structure was formed. The current settings of options FACTORIAL, CONTRASTS, DEVIATIONS, and WEIGHT are then ignored, as is any change in the restrictions on the y-variates. The DESIGN option is particularly useful with designs where there are many model terms or where there is non-orthogonality, as the dummy analysis may then be time-consuming.

Genstat has a simplified version of the dummy analysis which you can use to save computing time if all the model terms are orthogonal and if, for every term, all the combinations of its factors were applied to the same number of units. A check is incorporated which will detect non-orthogonality except in particularly complicated designs where terms are aliased. If you set option ORTHOGONAL=yes, Genstat does the simple version unless non-orthogonality is detected, whereupon it gives a warning message and then switches to the full version. The simplified version is done also if ORTHOGONAL=compulsory, but non-orthogonality now causes the analysis to stop altogether, with an error message; this is useful for checking for typing errors in the factor values when you know that the design should otherwise be orthogonal. The dummy analysis involves the analysis of a specially generated variate which contains random numbers from a Cauchy distribution. The starting value for their generation is set by the SEED option.

The TOLERANCES option controls numerical aspects of analysis. Its setting is a variate with up to four values: the first is used to calculate the tolerance for the analysis of the y-variates (default 10-7), the second is for the tolerance used in the dummy analysis (default 10-9), the third is for the estimation of missing values (default 10-5) and the fourth is for the estimation of stratum variances. The MAXCYCLE option sets a limit on the number of iterations for estimating missing values.

 

References

James, A.T. and Wilkinson, G.N. (1971). Factorisation of the residual operator and canonical decomposition of non-orthogonal factors in analysis of variance. Biometrika 58, 279-294.

Payne, R.W. and Wilkinson, G.N. (1977). A general algorithm for analysis of variance. Applied Statistics 26, 251-260.

Payne, R.W. and Tobias, R.D. (1992). General balance, combination of information and the analysis of covariance. Scandinavian Journal of Statistics 19, 3-23.

Wilkinson, G.N. (1970). A general recursive algorithm for analysis of variance. Biometrika 57, 19-46.

Yates, F. (1937). The design and analysis of factorial experiments. Technical Communication No. 35 of the Commonwealth Bureau of Soils. Commonwealth Agricultural Bureaux, Farnham Royal.

 

ASSIGN directive

Sets elements of pointers and dummies.

 

Options

NSUBSTITUTE = scalar Number of times n to substitute a dummy in order to determine which structure to assign (if n is negative, the assigned structure is the -nth from the bottom of the chain of dummies, like the NTIMES option of EXIT); default * implies no substitution

METHOD = string Whether to replace or preserve the existing value in each dummy or pointer element (replace, preserve); default repl (note, pointer elements are never unset so METHOD=preserve with a pointer simply causes the assignment to be ignored)

RENAME = string Whether to reset the default name for the structure if it has only a suffixed identifier (yes, no); default no

SCOPE = string This allows dummies or pointer elements within a procedure to be set to point to structures in the program that called the procedure (SCOPE=external) or in the main program itself (SCOPE=global) rather than to structures within the procedure (local, external, global); default loca

 

Parameters

STRUCTURE = identifiers Values for the dummies or pointer elements

POINTER = dummies or pointers Structure that is to point to each of those in the STRUCTURE list

ELEMENT = scalars or texts Unit or unit label indicating which pointer element is to be set; if omitted, the first element is assumed

 

Description

ASSIGN allows you to set individual elements of pointers, or to assign a value to a dummy. The parameter POINTER lists the pointers or dummies whose values you want to set; the values that you want to give them are listed by the STRUCTURE parameter. You pick out the individual elements of pointers by the ELEMENT parameter; a scalar identifies the element by its suffix number, while a text identifies it by its label. This example sets the dummy Yvar to point to the variate Height, and elements 1 and 2 of the pointer Xvars to Protein and Vitamin, respectively.

VARIATE Height,Protein,Vitamin

POINTER [NVALUES=2] Xvars

DUMMY Yvar

ASSIGN Height,Protein,Vitamin;POINTER=Yvar,2(Xvar); \

ELEMENT=1,1,2

Element 1 is assumed unless you specify otherwise; so to set just Yvar we need only put

ASSIGN Height; POINTER=Yvar

Options NSUBSTITUTE and METHOD are likely to be most useful when setting dummies within a procedure. By setting METHOD=preserve, any dummies that are already set will have their existing settings preserved. Hence this provides a very convenient and effective of making default assignments while leaving any explicit assignments unchanged. Suppose, for example, that a procedure has dummy arguments FITTEDVALUES, RESIDUALS, and RSS available to save various aspects of the analysis, and that we wish to use these as working variables while calculating this information within the procedure. By specifying

ASSIGN [METHOD=preserve] LocalF,LocalR,LocalRSS; \

FITTEDVALUES,RESIDUALS,RSS

any of the dummies that is not set when the procedure is called will be assigned to the corresponding local structure, either LocalF, LocalR, or LocalRSS. Note, however, that elements of pointers cannot be unset; they will always point to some identifier, even if it is unnamed. Thus, ASSIGN has no effect on elements of pointers when METHOD=preserve.

The NSUBSTITUTE option is useful when you have dummies pointing to other dummies, in a chain. This can often happen when one procedure calls another, passing one of its own arguments as the argument to the procedure that it calls. A positive setting substitutes the dummies in the POINTER list to be substituted the defined number of times in order to determine which dummy in a chain is to be assigned a value. Alternatively, you can set NSUBSTITUTE to a negative integer to specify the dummy to assign by counting up from the bottom of the chain of dummies, instead of down from the top.

The RENAME option enables you control what identifier is used for data structures in the rare occasions when your program contains structures that can be referred to by more than one suffixed identifier and which do not have identifiers in their own right.

Finally, the SCOPE option enables you to assign a dummy within a procedure to a structure in the program that called the procedure. The dummy will thus operate as though it was a dummy option or parameter, except that the decision about the structure that it references in the outer program has been made within the procedure instead of outside it. This facility allows you to define new data structures in the outer program; however, care needs to be taken to ensure that there is no conflict with any existing structures.

 

AXES directive

Defines the axes in each window for high-resolution graphics.

 

Option

EQUAL = strings Whether/how to make axes equal (no, scale, lower, upper); default no

 

Parameters

WINDOW = scalars Numbers of the windows

YTITLE = texts Title for the y-axis in each window

XTITLE = texts Title for the x-axis in each window

YLOWER = scalars Lower bound for y-axis

YUPPER = scalars Upper bound for y-axis

XLOWER = scalars Lower bound for x-axis

XUPPER = scalars Upper bound for x-axis

YMARKS = scalars or variates Distance between each tick mark on y-axis (scalar) or positions of the marks (variate)

XMARKS = scalars or variates Distance between each tick mark on x-axis (scalar) or positions of the marks (variate)

YMPOSITION = strings Position of the tick marks across the y-axis (left, right, centre)

XMPOSITION = strings Position of the tick marks across the x-axis (above, below, centre)

YLABELS = texts Labels at each mark on y-axis

XLABELS = texts Labels at each mark on x-axis

YLPOSITION = strings Position of the labels for the y-axis (left, right)

XLPOSITION = strings Position of the labels for the x-axis (above, below)

YORIGIN = scalars Position on y-axis at which x-axis is drawn

XORIGIN = scalars Position on x-axis at which y-axis is drawn

STYLE = strings Style of axes (none, x, y, xy, box, grid)

PENTITLE = scalar Pen to use for the title

PENAXES = scalar Pen to use for the axes and their labelling

PENGRID = scalar Pen to use for the grid

SAVE = pointers Saves details of the current settings for the axes concerned

 

Description

There is a definition for the axes associated with each Genstat graphics window. This specifies how the axes are to be drawn when graphical output is produced in that window. The default definition for each set of axes requires some of the features to be determined from the data, as described below. Others have fixed defaults that are independent of the data. The AXES directive can be used to override the default action and specify explicitly how particular parts of the axes are drawn. All parameters of AXES are relevant when using DGRAPH, but for other directives only some of the parameters are used.

The WINDOW parameter specifies the window whose axes definition is to be altered. Only those aspects specified by subsequent parameter lists are modified; any parameters that are not set will retain their current settings. WINDOW can be set to a list of window numbers, in which case the other parameter lists are cycled in the usual way.

The YLOWER and YUPPER parameters specify the lower and upper bounds for the y-axis. By default, Genstat derives suitable axis bounds from the data, as described for the appropriate directive. You can set the lower bound to a value greater than the upper bound, to obtain an inverted data scale, but the bounds must not be equal. The XLOWER and XUPPER parameters set bounds for the x-axis in a similar way. The values specified with these parameters are on the scale of the data values that are plotted, and are independent of the normalized device coordinates used to define the window size in FRAME. The EQUAL option can be used to ensure that equal upper or lower bounds are used for the y- and x-axes. For example, if EQUAL=lower, lower bounds for both axes will be set to the lower of the values determined automatically from the data. The bounds obtained when using the EQUAL option may be constrained by settings of other parameters: for example, if YUPPER is set and EQUAL=upper, the upper bounds of both axes are set to the value specified by YUPPER; but if XUPPER is also set, EQUAL will be ignored. You can set EQUAL=lower,upper to constrain both upper and lower bounds, and EQUAL=scale can be used to ensure physical distance is equal on both axes, for example the y-axis could range from 0 to 100 and the x-axis from 100 to 200.

The YORIGIN parameter determines the value on the y-axis through which the x-axis is drawn. If its value is outside the y-axis bounds, the upper or lower bound is adjusted so that the axis will extend up to the specified origin. This applies whether you have set the bounds explicitly or have left Genstat to calculate them from the data. The XORIGIN parameter sets the origin for the x-axis in a similar way. By default, the lower bounds of each axis are used, so that the axes are drawn on the bottom and left-hand sides of the plot.

Titles can be added to the axes using the YTITLE and XTITLE parameters. In each case, the title is limited to a single line of characters.

Each axis is marked with a scale, determined automatically so that tick marks are evenly spaced and positioned to give "round" numbers for the scale values. For each axis, you can specify either the increment between tick marks or their actual positions. You can also specify labels to use for scale markings instead of their numerical values.

To specify the increment on the y-axis, the YMARKS parameter should be set to a scalar. For example, YMARKS=1.5 with bounds 10 and 2 causes tick marks to appear at 2, 3.5, 5, 6.5, 8, and 9.5. The interval must be a positive number, irrespective of the values of the bounds. Alternatively, you can set YMARKS to a variate (with more than one value) to specify the actual positions of the tick marks on the y-axis. Any values that lie outside the axis bounds are ignored. The scale values printed next to the tick marks use a format that is determined automatically from the values, but if you have set YMARKS to a variate it will use the number of decimals specified in the variate declaration. When you have set YMARKS, you can also use the YLABELS parameter to specify a set of labels to mark the axis scale. For example,

TEXT [VALUES=Mon,Tues,Wed,Thur,Fri,Sat,Sun] Day

VARIATE [VALUES=1...31] Month

AXES 1; YMARKS=Month; YLABELS=Day

The strings within the text are cycled if necessary; hence, the number of strings can be less than the number of tick marks.

The tick marks can be drawn to the left or to the right of the axis, or can be centred (that is, across the axis). By default, the tick marks are drawn towards the "outside" of the plot; that is, to the left if the y-axis is to the left of the centre of the plot, or to the right if the y-axis is drawn to the right of centre. The aim is to position the tick marks away from the main part of the plot, so that they interfere with the plotted points as little as possible. You can control the positioning of the tick marks by setting the YMPOSITION parameter to either left, right, or centre. A similar rule governs the default positioning of the scale markings or labels, but you can again control this by setting the YLPOSITION parameter to either left or right. Setting YMARKS=* will return to the default positioning of the tick marks; YLABELS=* will switch off any labels previously specified; and YMPOSITION=* and YLPOSITION=* will switch off tick marks or labels altogether.

Annotation of the x-axis can be controlled in a similar way using the XMARKS, XLABELS, XMPOSITION, and XLPOSITION parameters, except that the settings left and right are replaced by above and below.

The STYLE parameter controls the type of axes that are drawn. By default STYLE=xy, so both y- and x-axes are plotted. Alternative settings allow the axes to be completed by drawing a box around the graph, with an overlaid grid if required. The settings STYLE=x and STYLE=y can be used if only one axis is required. Finally, STYLE=none inhibits the plotting of axes completely, although some other parameters, such as YLOWER, may still have an effect on the plotted data.

There are three parameters that control the pens to be used when drawing the axes. These are PENTITLE, PENAXES, and PENGRID, specifying the pen for the title, the axes and annotation, and the grid, respectively. The initial default is to use pens 30, 31, and 32 in every window. These pens are in turn set up to use colour 1, line style 1, thickness 1, size 1, and font 1. You can thus control which pens are used for drawing the axes in each window, and the attributes of those pens. For example, if no AXES statement has yet been given,

PEN 32; LINESTYLE=4; COLOUR=2

will request that the grids in every window should be drawn in line style 4 and colour 2; while

PEN 29; LINESTYLE=3; COLOUR=4

AXES 1; PENAXES=29

will change the appearance of just the axes in window 1, as pen 29 is not used for the other windows. Control of the grid pen is particularly useful as a combination of colour and line style can be chosen to ensure that the grid does not obscure the plotted points. You should of course be careful of side-effects when modifying these pens or changing the pen numbers. For example, pen 29 may also have been modified for use in a DGRAPH statement and other attributes may have been set that are not wanted when drawing the axes.

Axis annotation is plotted in the margins specified by the FRAME directive. You may wish to reduce the size of these margins if you have defined axes that use less space, for example by keeping within the area of the graph itself, or by omitting titles or labels. Space can thus be regained and used for plotting data. However, if the margins are too small the axis annotation may be "clipped" at the boundaries of the margins; if this happens, you can use FRAME to increase the margin size. The margins are used by DGRAPH, DHISTOGRAM, and DCONTOUR, but they are ignored by other directives.

The current settings of the axes for a particular window can be saved in a pointer supplied by the SAVE parameter. The elements of the pointer are labelled to identify the components. This facility is of most use within procedures, where it may be necessary to check or modify particular AXES settings before constructing complicated graphs. Also, the DKEEP directive allows you to extract the actual bounds used when plotting; these will be the bounds determined from the data if none have been defined explicitly by AXES.

 

BLOCKSTRUCTURE directive

Defines the blocking structure of the design and hence the strata and the error terms.

 

No options

 

Parameter

formula Block model (defines the strata or error terms for subsequent ANOVA statements)

 

Description

The BLOCKSTRUCTURE directive specifies the underlying (or blocking) structure of a design that is to be analysed by ANOVA. However, this can be omitted for unstructured designs with a single error term.

In many designs, the units are nested. The simplest is the randomized block design. Here the units are grouped into sets, known as blocks, the aim being that units in the same block should be more similar than those in different blocks. The allocation of the treatments is randomized independently within each block. The design thus has two sources of random variation: differences between blocks as a whole, and differences between the units within each block. For example if the units are plots of land and the blocks are groupings of nearby plots we would have two factors: Blocks to indicate the block to which each plot belonged, and Plot to identify the plots within each block. The block model would then be

Blocks/Plots

indicating that the plots are nested within blocks, and thus that there is no special similarity, for example, between the plot numbered 3 in block 1 and plot 3 of the other blocks. The formula is expanded by Genstat to become

Blocks + Blocks.Plots

giving terms for the differences between blocks as a whole, and the differences between the units within each block, as required.

In the simplest form of the randomized block design, there is a single treatment factor, each of whose levels occurs once in every block. More complicated arrangements are possible, but each treatment combination must still occur exactly the same number of times in every block. This means that any differences found between the blocks cannot be caused by differences between treatments. Thus the treatment terms are all estimated between the plots within the blocks. If the blocks have been chosen successfully, the variation within the blocks should be less than that between blocks, and so the treatment estimates will be less variable than if a completely randomized design had been used. The analysis of variance will be split into two components called strata. The Blocks stratum will contain the sums of squares between blocks; this all arises from the variability between the blocks. The Blocks.Plots stratum will contain the sum of squares for the plots within the blocks; this is partitioned into the sums of squares due to each of the treatment terms, and a residual against which these can be assessed.

Thus, you can deduce the block model from the structure of the units, which should correspond to the way in which the randomization has been done. Genstat expands the block model to form the list of block (or error) terms, each of which defines a stratum corresponding to one of the sources of variability in the design. Alternatively, if you prefer to deduce the error terms by some other means, as for example if you follow the philosophy of fixed and random effects, you can specify the block model to be the sum of these terms.

In the analysis, Genstat initially partitions the sums of squares according to the block model alone. This gives the total sum of squares for each of the strata. Then it partitions each stratum sum of squares into sums of squares for those treatment terms estimated in that stratum, and a residual which provides an estimate of variability against which these treatment sums of squares should be compared.

In the randomized block design, the treatments are estimated only in the final (bottom) stratum. You would thus get the same sums of squares if you omitted the BLOCKSTRUCTURE statement and put Blocks at the start of the treatment model. However the use of BLOCKSTRUCTURE better reflects the structure of the design, as it correctly identifies Blocks as an error term. It also allows for the possibility of treatments being estimated between blocks, as in a balanced incomplete-blocks design.

The simplest design in which the treatments are not all estimated in one stratum is the split-plot design. This again has a nested structure and was devised originally for agricultural experiments where some of the factors can be applied to smaller plots of land than others. However, it also occurs in industrial experiments, in medical experiments and even in the study of cake mixtures. An example is shown in Section 6 of Genstat for Windows: an Introductory Course. Here there are two treatment factors: three different varieties of oats, and four levels of nitrogen. Because of limitations on the machines for sowing seed, different varieties cannot conveniently be applied to plots as small as those that can be used for the different rates of fertilizer. So the design was set up in two stages. First of all, the blocks were each divided into three plots of the size required for the varieties, and the three varieties were randomly allocated to the plots within each block (exactly as in the randomized blocks design). Then each of these plots, or whole-plots as they are usually known, was split into four sub-plots (one for each rate of nitrogen), and the allocation of nitrogen was randomized independently within each whole-plot. The design has sub-plots nested within whole-plots, which are themselves nested within the blocks: that is,

BLOCKSTRUCTURE Blocks / Wplots / Subplots

This expands to

Blocks + Blocks.Wplots + Blocks.Wplots.Subplots

giving strata for variation between blocks, between whole-plots within the blocks, and for sub-plots within the whole-plots (within blocks). Just as in the randomized block design, the blocks all contain the same sets of treatments, and so no treatments are estimated in the Blocks stratum. But varieties, which were applied to whole-plots, are estimated in the Blocks.Wplots stratum; in conventional terminology this is called the stratum for whole-plots within blocks. The variance ratio for varieties is calculated by dividing the Variety mean square by the Blocks.Wplots residual mean square. It is easy to see that this is the correct thing to do. When we look to see whether the varieties differ we are really trying to answer the question: "Do the yields from the three sets of whole-plots, on the first of which the variety Victory was grown, on the second Golden rain, and on the third Marvellous, differ by more than the amount that we would expect for any three randomly chosen sets of whole-plots?". Technically, variety is said to be confounded with whole plots. The terms for Nitrogen, which was applied to sub-plots, and for the Variety.Nitrogen interaction are both estimated in the stratum for sub-plots within whole-plots (Blocks.Wplots.Subplots).

Because Genstat knows the structure of the design it is thus able to present appropriate varianve ratios for the treatment terms. It is also able, for example, to produce correct standard errors and LSDs for tables of means.

There are some designs where the units have a crossed instead of a nested structure. A simple example is the Latin square. This was devised for agricultural experiments to cater for situations where there are fertility trends both along and across the field, but it can be used whenever there are two independent ways of grouping the units: for example time of testing and batch of material, or the litter of the rat and its order by weight within the litter. In field experiments, the plots are arranged in a square, with blocking factors called Rows and Columns. These each have the same number of levels as there are treatments. Values of the single treatment factor are arranged so that each level occurs once in each row and once in each column. The block structure has rows crossed with columns: that is,

BLOCKSTRUCTURE Rows*Columns

which expands to

Rows + Columns + Rows.Columns

The treatments are estimated only in the Rows.Columns stratum. Removing variation between rows and between columns should make these estimates less variable.

More complicated designs may involve both crossing and nesting. For example nested Latin squares have the structure

BLOCKSTRUCTURE Squares / (Rows * Columns)

which gives strata for squares, rows within squares, columns within squares and rows.columns within squares:

Squares + Squares.Rows + Squares.Columns + Squares.Rows.Columns

Alternatively, a Latin square with split plots for which the structure is defined by

BLOCKSTRUCTURE (Rows * Colums) / Subplots

giving the strata of an ordinary Latin square plus an additional stratum for subplots within rows and columns:

Rows + Colums + Rows.Colums + Rows.Colums.Subplots

If the factors in the block formula do not provide a unique index for every unit of the experiment, the terms in the block model will not account for all the variation. Genstat must then define a final stratum to contain the variation between the sets of units whose levels are the same for each block factor. At the end of the block model, Genstat therefore sets up an extra term containing all the block factors, together with an extra "factor", denoted *units*, which numbers the units within each set. So, for the randomized block design, you could put just

BLOCKSTRUCTURE Blocks

which would then become

BLOCKSTRUCTURE Blocks + Blocks.*units*

Likewise, for the split-plot design,

BLOCKSTRUCTURE Blocks/Wplots

would become

BLOCKSTRUCTURE Blocks/Wplots + Blocks.Wplots.*units*

Consequently, if you define no block structure at all, Genstat assumes

BLOCKSTRUCTURE *units*

giving a single source of variation representing random differences between the units (this defines a completely randomized design). However, you may prefer to define a more meaningful labelling of the units, for example

BLOCKSTRUCTURE Unitcode

The factor Unitcode would be very easy to set up. To produce a factor equivalent to *units* in more complicated situations, you can use procedure AFUNITS. For example

AFUNITS [BLOCKSTRUCTURE=Blocks/Wplots] Splot

to generate a factor Splots to index the units within Blocks and Wplots.

 

BREAK directive

Suspends execution of the statements in the current channel or control structure and takes subsequent statements from the channel specified.

 

Option

CHANNEL = scalar Channel number; default 1

 

Parameter

expression Logical expression controlling whether or not the break takes place

 

Description

The BREAK directive allows you to halt the execution of the current set of statements temporarily so that you can execute some other statements. If the parameter is not set, the break will always take place. Alternatively, you can specify a logical expression and then the break will take place only if this produces a true (i.e. non-zero) result.

The CHANNEL option determines where the statements to be executed during the break are to be found. Usually (and by default) they are in channel 1. The statements are read and executed, one at a time, until an ENDBREAK statement is reached, at which point control returns to the statements originally being executed.

BREAK also provides a convenient way of interrupting a loop or a procedure so that you can read one set of output before the next is produced.

 

CALCULATE directive

Calculates numerical values for data structures.

 

Options

PRINT = string Printed output required (summary); default * i.e. no printing

ZDZ = string Value to be given to zero divided by zero (missing, zero); default miss

TOLERANCE = scalar If the scalar is non missing, this defines the smallest non-zero number; otherwise it accesses the default value, which is defined automatically for the computer concerned

 

Parameter

expression Expression defining the calculations to be performed

 

Description

The CALCULATE directive allows you to perform transformations and other calculations. it has the form:

CALCULATE expression

The expression specifies what calculation is to be done, and where the results are to be stored. For example, the command

CALCULATE Area = Length * Breadth

specifies that the structure Area is to store the results of Length multiplied by Breadth. All the usual arithmetic operators are available:

+ addition

- subtraction

* multiplication

/ division

** exponentiation (for example, X**2 stands for X2 )

CALCULATE can operate on any numerical data structure and it will automatically declare the structure to hold the results if you have not declared it already. So, if Area has not yet been defined and Length and Breadth are scalars, Area will become a scalar too.

Generally the structures involved in the calculation must have the same "shape" (for example, variates must have the same length) and the operators operate element-by-element over all their values. So, if Length and Breadth were variates, Area would become a variate each of whose units contained the product of the corresponding units of Length and Breadth. However, scalars and ordinary numbers can be included with calculations on any type of data structure. So

CALCULATE Kilo = Pound / 2.2

would be valid whatever the type of the structures Kilo and Pound.

If any of the values involved in a numerical expression is missing, the result will be missing too.

Genstat has operators for relational tests:

== or .EQ. equality of numerical values

.EQS. equality of textual strings

>= or .GE. greater than or equal to

> or .GT. greater than

<= or .LE. less than or equal to

< or .LT. less than

/= or <> or .NE. not equal to

.EQS. inequality of textual strings

.IS. identifier equivalence (to test whether a dummy contains a particular identifier)

.ISNT. identifier non-equivalence

.IN. inclusion: X.IN.Vals gives result true for each value of X that is equal to any one of the values of Vals

.NI. non-inclusion: the opposite of .IN.

These generate a result of zero if the test is false, and one if it is true. (In fact any non-zero value is taken to represent a true value.) With most of these operators, a missing value in either operand (or in both) will generate a missing result. The exceptions are .EQ. and .NE. (and their synonyms), and EQS. and .NES.: when both operands are missing .EQ. and .EQS. give a true result, while .NE. and .NES. give a false result.

There are also logical operators that can be used to combine the results of expressions involving relational operators.

.AND. and: a.AND.b true if both a and b are true

.EOR. either or: a.EOR.b is true if either a or b, but not both, is true

.OR. or: a.OR.b is true if either a or b is true

.NOT. not: .NOT.a is true for a untrue

Expressions can contain lists, to specify that the same calculation is to be done for several sets of structures. For example

CALCULATE Pay1,Pay2 = Hours1,Hours2 * Rate + Bonus

This has the same effect as the two commands

CALCULATE Pay1 = Hours1 * Rate + Bonus

CALCULATE Pay2 = Hours2 * Rate + Bonus

Notice that, if any of the lists on the right-hand side of the expression is shorter than the list on the left-hand side, the list is re-used. So the value of Bonus is used for both calculations. To take a more complicated example

CALCULATE X,Y,Z = A,B,C + 1,2

is the same as the three calculations

CALCULATE X = A + 1

CALCULATE Y = B + 2

CALCULATE Z = C + 1

However, the lists on the right-hand side must not be longer than the list on the left-hand side.

Genstat provides a wide range of functions for use in expressions. Many of these, known as transformations, produce a result that is the same type of structure as the argument of the function. For example,

CALCULATE Logsulph = LOG(Sulphur)

uses the LOG function to take natural logarithms of the values in the the data structure Sulphur. If Sulphur is a variate Logsulph will also be a variate with the same number of values.

Scalar functions produce a scalar summary of all the values in a structure. For example, we can use the SUM function to calculate the total Sulphur values:

CALCULATE Totsulph = SUM(Sulphur)

There are also vector functions that produce summaries across the values of a set of variates (or of scalars). The set of variates must be put into a pointer. So, we could form a variate M each of whose units contains the mean of the values in the corresponding units of the variates A, B and C by

POINTER [VALUES=A,B,C] Vars

CALCULATE M = VMEAN(Vars)

This can be done more succinctly using an unnamed pointer:

CALCULATE M = VMEAN(!p(A,B,C))

When a function has more than one argument, each is separated from the next by a semi colon. For example

CALCULATE Corr = CORRELATION(X; Y)

calculates the correlation between the values in X and Y.

Function arguments can also be lists, running in parallel with the other lists in the expression. For example, to calculate Corr1 as the correlation between X1 and Y1, and Cor2 as the correlation between X2 and Y2:

CALCULATE Corr1,Corr2 = CORRELATION(X1,X2; Y1,Y2)

When a factor occurs in an expression on the right-hand side, Genstat usually works with its levels. The exception is when the factor occurs as the first operand of the operators .IN. or .NI. and the second operand is a text; the factor labels are then used instead. A factor can also occur on the left-hand side of an expression and receive the results of a calculation; an error is reported if any of the resulting values is not one of the levels of the factor. Two functions are provided especially for factors: NLEVELS(F) gives the number of levels of the factor F, and NEWLEVELS(F; V) forms a variate from the factor F, using variate V to define values for the levels.

Text structures are allowed only with the relational operators .EQS., .NES., .IN., and .NI. or in the string functions. The result of any expression is a number, so you cannot create a text with CALCULATE, even if the structures on which the operations are being done are texts.

All the arithmetic, relational, and logical operators and transformation functions can also be used with matrix structures, symmetric matrices, and diagonal matrices. The basic rule when using these with different types of matrix is that their dimensions must conform. This means that, for each pair of matrices, row dimension must match row dimension, and column dimension must match column dimension. So, for example, you can add a diagonal matrix to a matrix structure provided the number of rows and columns of the matrix equals the number of rows (and columns) of the diagonal matrix. The multiplication operator (*) performs element-by-element multiplication of two matrices: for matrix multiplication, there is the compound operator *+ or the function PRODUCT, which is one of the many specialised matrix functions.

You can use tables in expressions in much the same way as you would any other numerical structure. Tables in expressions must be either all without margins or all with margins. If you try to mix tables with and without margins, Genstat will report an error. Calculations with tables are very straightforward when they have the same factors in their classifying sets. The tables then have identical "shapes", and the arithmetic, relational, and logical operators and the transformation functions act element-by-element, in the usual way. When tables have different classifying sets, there are two cases to consider. The first case is when the table on the left-hand side has a factor in its classifying set that is not in the classifying set of the table on the right-hand side. In this case, the left-hand table is expanded to include that factor, by duplicating its values across the levels of the factor and any margin. The second case is when the table on the right-hand side has a factor in its classifying set that is not in the classifying set of the table on the left-hand side. Now the values in the margin over that factor are taken for the left-hand table. If the table has no margins, they must be calculated first. By default Genstat forms marginal totals, but you can use the special table functions to form other types of margin.

Dummies can be used with the relational operators .IS. and .ISNT. which test whether or not a dummy points to a particular identifier. For example, to store in Sca the result of a test to check whether dummy D points to Va, you would put

CALCULATE Sca = D.IS.Va

while to test that D does not point to Vb, you would put

CALCULATE Sca = D.ISNT.Vb

There is also the function UNSET to test if a dummy has not been set to any value.

Other specialised functions include subset functions, statistical functions and random number generation functions.

 

CASE directive

Introduces a "multiple-selection" control structure.

 

No options

 

Parameter

expression Expression which is evaluated to an integer, indicating which set of statements to execute

 

Description

A multiple-selection control structure consists of several alternative blocks of statements. The first of these is introduced by a CASE statement. This has a single parameter, which is an expression that must yield a single number. Subsequent blocks are each introduced by an OR statement. There can then be a final block, introduced by an ELSE statement. The whole structure is terminated by an ENDCASE statement. Thus the general form is: first

CASE expression

statements

then either none, one, or several blocks of statements of the form

OR

statements

then, if required, a block of the form

ELSE

statements

and finally the statement

ENDCASE

Genstat rounds the expression in the CASE expression to the nearest integer, k say, and then executes the kth block of statements. If there is no kth block present (as for example if k is negative) the block of statements following the ELSE statement is executed, if there is such a block; otherwise an error diagnostic is given.

This example prints the salient details about each day in the song The twelve days of Christmas. The scalar Day indicates which day it is.

CASE Day

PRINT 'a partridge in a pear tree'

OR

PRINT 'two turtle doves and a partridge in a pear tree'

OR

PRINT 'three French hens, two turtle doves \

and a partridge in a pear tree'

OR

PRINT 'four calling birds, three French hens ...'

OR

PRINT 'five gold rings ...'

OR

PRINT 'six geese a-laying ...'

OR

PRINT 'seven swans a-swimming ...'

OR

PRINT 'eight maids a-milking ...'

OR

PRINT 'nine drummers drumming ...'

OR

PRINT 'ten pipers piping ...'

OR

PRINT 'eleven ladies dancing ...'

OR

PRINT 'twelve lords a-leaping ...'

ELSE

PRINT 'sorry, no delivery today'

ENDCASE

 

CASE statements can be nested to any depth.

 

CATALOGUE directive

Displays the contents of a backing-store file.

 

Options

PRINT = strings What to print (subfiles, structures); default subf, stru

CHANNEL = scalar Channel number of the backing-store file; default 0, i.e. the workfile

LIST = string How to interpret the list of subfiles (inclusive, exclusive, all); default incl

SAVESUBFILE = text To save the subfile identifiers; default *

UNNAMED = string Whether to list unnamed structures (yes, no); default no

 

Parameters

SUBFILE = identifiers Identifiers of subfiles in the file to be catalogued

SAVESTRUCTURE = texts To save the identifiers of the structures in each subfile

 

Description

You can use CATALOGUE to obtain details of the subfiles contained in a backing-store file, or the structures within an ordinary subfile, or the procedures within a procedure subfile. The file is indicated by the CHANNEL option, and the SUBFILE parameter specifies the subfiles (of ordinary structures or of procedures) that are to be catalogued.

The PRINT option specifies which catalogues are to be printed. The subfiles setting prints the catalogue of subfiles in the backing-store file attached to the channel specified by the CHANNEL option, while the structures setting prints the catalogue of structures or procedures that are in the subfiles specified by the SUBFILE parameter. If you set option UNNAMED=yes the unnamed structures in each subfile will also be listed, together with details of how the structures depend on each other.

The LIST option controls how the SUBFILE list is interpreted. The default setting inclusive simply catalogues the subfiles that have been listed. Alternatively, if you set LIST=all Genstat will catalogue all the subfiles in the backing-store file. Finally, you can see LIST=exclusive to catalogue everything that you have not included in the SUBFILE list.

The SAVESTRUCTURE parameter allows you to set up texts, one for each subfile in the SUBFILE parameter. Each text contains the identifiers of all structures with an unsuffixed identifier in the subfile. Each identifier is put on a separate line, and the characters ,\ are appended to all but the last line. You would normally use these texts as a macro; the ,\ makes them useable as lists of identifiers. If the text is used as a macro, it is subject to the restriction on the length of statements. The SAVESUBFILE option allows you to save a similar text containing the identifiers of all the subfiles in a backing-store file.

 

CLOSE directive

Closes files.

 

No options

 

Parameters

CHANNEL = scalars or texts Numbers of the channels to which the files are attached, or identifiers of texts used for input (which, after "closing", can then be re-read)

FILETYPE = strings Type of each file (input, output, unformatted, backingstore, procedurelibrary, graphics); default inpu

DELETE = strings Whether to delete the file on closure (yes, no); default no

 

Description

Once you have finished using a file, CLOSE can be used to release the channel to which it is attached, so that the channel is available for use with some other file. Parameters CHANNEL and FILETYPE indicate the channel number and the type of file, as in the OPEN directive. The DELETE parameter is useful if you are using files to store data temporarily, perhaps to release workspace within Genstat. When you have finished with the file you can set DELETE=yes to request that it be deleted on closure so that disk space is not wasted. For example,

OPEN 'temp.bin'; CHANNEL=3; FILETYPE=unformatted

PRINT [CHANNEL=3;UNFORMATTED=yes] \

Surveys[1900,1910...1990]

DELETE Surveys[1900,1910...1990]

 

"... and later on when you wish to retrieve the data ..."

READ [CHANNEL=3;UNFORMATTED=yes] \

Surveys[1900,1910...1990]

CLOSE 3; FILETYPE=unformatted; DELETE=yes

You cannot close a channel to which the keyboard or screen are attached, nor the current input or output channels. Also you cannot use CLOSE to delete files that have been opened with ACCESS=readonly or that are protected by the computer's file system. However, you do not need to close every file before you stop running Genstat; files are automatically closed at the end of every Genstat program.

 

CLUSTER directive

Forms a non-hierarchical classification.

 

Options

PRINT = strings Printed output required (criterion, optimum, units, typical, initial); default * i.e. no printing

DATA = matrix or pointer Data from which the classification is formed, supplied as a units-by-variates matrix or as a pointer containing the variates of the data matrix

CRITERION = string Criterion for clustering (sums, predictive, within, Mahalanobis); default sums

INTERCHANGE = string Permitted moves between groups (transfer, swop); default tran (implies swop also)

START = factor Initial classification; default * i.e. splits the units, in order, into NGROUPS classes of nearly equal size

 

Parameters

NGROUPS = scalars Numbers of classes into which the units are to be classified: note, the values of the scalars must be in descending order

GROUPS = factors Saves the classification formed for each number of classes

 

Description

Printed output from CLUSTER directive is controlled by the PRINT option. This has the following possible settings.

criterion prints the optimal criterion value.

optimum prints the optimal classification.

units prints the data with the units ordered into the optimal classes.

typical prints a typical value for each class: for maximal predictive classification this is the class predictor; for the other methods it is the class mean.

initial if this is set the requested sections of output are also printed for the initial classification.

The DATA option supplies the data to be classified: the single structure must be either a matrix, with rows corresponding to the units and columns to the variables, or a pointer whose values are the identifiers of the variates in the data matrix. Note that CLUSTER always operates on a matrix, and so will copy the variate values into a matrix if you supply a pointer as input; thus for large data sets it is better to supply a matrix.

The CRITERION option specifies which criterion CLUSTER is to optimize, the default being sums. The four settings are:

sums maximize the between-group sum of squares;

predictive maximal predictive classification;

within minimize the determinant of the pooled within-class dispersion matrix;

mahalanobis maximize the total Mahalanobis squared distance between the groups.

The INTERCHANGE option specifies which types of interchange (transfers or swops) are to be used. The default is transfer, which is taken to imply that both transfers and swops are used, since a swop is simply two transfers. If you set INTERCHANGE=swop, only swops are used. If INTERCHANGE=* the algorithm does not attempt to improve the classification from the initial classification; you might want this, in conjunction with the PRINT=initial setting, to display the results for an existing classification which you do not wish to improve.

The START option should be used to supply a factor to define the initial classification. If START is not specified, CLUSTER will divide the units, in order, into roughly equal-sized groups. For example, with 97 units to be classified into 10 groups, the first 10 units will be put into the first group, the 11th to 20th into the second group, and so on; the last three groups will contain only nine units each. Procedure CLASSIFY provides another way of forming an initial classification for k classes. It finds the k units that are furthest apart in the multi-dimensional space defined by the data variates. These are then used as the nuclei for the classes, with each remaining unit being allocated to the class containing the nearest nucleus.

The first parameter, NGROUPS, is used to specify the number of classes to be formed. Any single-valued structure can be supplied here. Often you would want several classifications from a single data set, into different numbers of groups. In this case the NGROUPS parameter should be a list of the numbers of groups in descending order. For the initial classification of the second classification, CLUSTER takes the optimal classification from the first number of groups, and does some reallocation of units to make a smaller number of groups. This is repeated, as often as required, to provide initial classifications for all the later analyses; hence the need to specify the numbers in descending order. The second parameter, GROUPS, is used to specify a list of identifiers of factors to save the optimal classifications.

 

COLOUR directive

Defines the red, green and blue intensities to be used for the Genstat colours for certain graphics devices.

 

No options

 

Parameters

NUMBER = scalars Numbers of the colours to be set

RED = scalars Red intensity of each colour (between 0 and 1)

GREEN = scalars Green intensity of each colour (between 0 and 1)

BLUE = scalars Blue intensity of each colour (between 0 and 1)

MATCH = scalars Number of a Genstat colour to define any unset values of RED, GREEN or BLUE; default is to restore the original values of the colour

SAVE = pointers Pointers each containing three scalars to save the red, green and blue intensities of the colours

 

Description

The COLOUR directive allows you to redefine the colour map stored internally. Genstat uses the RGB colour system to define each colour (0 up to 32) in terms of its red, green, and blue components. These are specified as values in the range [0,1]. Thus black is represented by (0,0,0), white by (1,1,1), red by (1,0,0), and so on. The COLOUR directive can be used in three ways. Firstly you can define a colour in RGB terms. For example, you could put

COLOUR 1; RED=0.5; BLUE=0.5; GREEN=0.0

to define colour 1 as yellow. Points plotted in colour 1 would then appear as yellow. Alternatively, the MATCH parameter allows a colour to take its RGB values from the current settings of another colour. For example,

COLOUR 2; MATCH=1

will set colour 2 also to be yellow. Note that if colour 1 is changed again, colour 2 will not be altered. Finally a colour can be returned to its initial default settings by specifying only the colour number. For example,

COLOUR 1,2

will set colours 1 and 2 back to their original values. The background colour may be altered by changing the definition of colour 0.

The exact effects of the COLOUR directive will vary for different graphics devices. In some cases, mainly plotters and monochrome terminals, it will be ignored. With some devices, using COLOUR will not affect existing displays but the modified colour definitions will be used for subsequent plots. For other devices it will take effect immediately, allowing plots to be modified dynamically without having to redraw the graphs. For example, typing

COLOUR 1; MATCH=0

will make all parts of the graph plotted in colour 1 disappear, by changing colour 1 to the background colour. If this is followed by

COLOUR 1

the points will reappear. This can be achieved only when it is possible to alter the colour table of the terminal dynamically, and where the underlying graphical software allows it. The Users' Note should contain details of how COLOUR is implemented in your version of Genstat.

 

COMBINE directive

Combines or omits "slices" of a multi-way data structure (table, matrix, or variate).

 

Options

OLDSTRUCTURE = identifier Structure whose values are to be combined; no default i.e. this option must be set

NEWSTRUCTURE = identifier Structure to contain the combined values; no default i.e. this option must be set

 

Parameters

OLDDIMENSION = factors or scalars

Dimension number or factor indicating a dimension of the OLDSTRUCTURE

NEWDIMENSION = factors or scalars

Dimension number or factor indicating the corresponding dimension of the NEWSTRUCTURE; this can be omitted if the dimensions are in numerical order, while zero settings (each in conjunction with a single OLDPOSITION) allows a slice of an old table to be mapped into a new table with fewer dimensions

OLDPOSITIONS = pointers, texts, variates or scalars

These define positions in each OLDDIMENSION: pointers are appropriate for matrices whose rows or columns are indexed by a pointer; texts are for matrices indexed by a text, variates with a textual labels vector, or tables whose OLDDIMENSION factor has labels; and variates either refer to levels of table factors or numerical labels of matrices or variates, if these are present, otherwise they give the (ordinal) number of the position. If omitted, the positions are assumed to be in (ordinal) numerical order. Margins of tables are indicated by missing values

NEWPOSITIONS = pointers, texts, variates or scalars

These define positions in each NEWDIMENSION, specified similarly to OLDPOSITIONS; these indicate where the values from the corresponding OLDDIMENSION positions are to be entered (or added to any already entered there)

WEIGHTS = variates Define weights by which the values from each OLDDIMENSION coordinate are to be multiplied before they are entered in the NEWDIMENSION

 

Description

Sometimes you may wish to reclassify a table to have factors different from those that you used in its declaration. COMBINE allows you to omit or to combine levels of the classifying factors. Furthermore, if you want to take just one level of a factor, you can copy the values into a table with one less dimensions.

You specify the original table using the OLDSTRUCTURE option, and a table to contain the reclassified values using the NEWSTRUCTURE option; if you have not already declared the new table, it will be declared implicitly. You must specify both of these options.

You can modify several of the classifying factors at a time. You list the factors of the original table with the OLDDIMENSION parameter, and the equivalent factors of the new table with NEWDIMENSION. An alternative way of doing this is to give a dimension number, specifying the position of the factor in the classifying set of the table; for the NEWDIMENSION list, this requires that you have already declared the new table. You can even omit the list of dimensions if they would be in ascending numerical order. NEWDIMENSION can also be set to 0 (to imply no corresponding new factor), allowing you to extract a single slice of a table into a table with fewer dimensions.

You use the OLDPOSITIONS and NEWPOSITIONS parameters to specify how this combining is to be done. These parameters specify a pair of vectors for each pair of old and new dimensions, listing positions within the old dimension and the corresponding positions to which they are mapped in the new dimension. The positions can be defined in terms of either the levels or the labels of the factor that classifies the dimension. If you omit the vector for one of the dimensions, it is assumed to contain each value once only, taken in the order in which they occur in the levels vector of the factor. You indicate a margin of the table by a missing value in a variate, or by a null string in a text. Values in the original table can be allocated to more than one place. In parallel with the vectors of positions, you can also use the WEIGHTS parameter to specify a variate of weights by which the values are multiplied before being entered into the new table.

Although the main way in which you will use COMBINE is likely to be for tables, you can also use it on rectangular matrices and even variates. For these, the dimensions can only be numbers: number 1 refers to the rows of a matrix, and 2 to the columns; number 1 refers to the rows (or units) of a variate. The position vectors refer to the labels vectors of matrices, which can be variates, texts, or pointers; or they refer to the unit labels of a variate, which can be held in either a variate or a text. If a dimension has no labels vector, you use a variate to specify its positions; then each value of the variate gives the number of a row, column, or unit. You can do the same also if the labels vector is something other than a variate: that is, a text or a pointer.

 

CONCATENATE directive

Concatenates and truncates lines (units) of text structures; allows the case of letters to be changed.

 

Options

NEWTEXT = text Text to hold the concatenated/truncated lines; default is the first OLDTEXT vector

CASE = string Case to use for letters (given, lower, upper, changed); default give leaves the case of each letter as given in the original string

 

Parameters

OLDTEXT = texts Texts to be concatenated

WIDTH = scalars or variates Number of characters to take from the lines of each text, a negative value takes all the (unskipped) characters other than trailing spaces; if * or omitted, all the (unskipped) characters are taken

SKIP = scalars or variates Number of characters to skip at the left-hand side of the lines of each text, a negative value skips all initial spaces; if * or omitted, no characters are skipped

 

Description

The CONCATENATE directive joins lines of several texts together, side by side, to form a new text. You can specify the identifier of this text by the NEWTEXT option, in which case it need not already have been declared as a text. If you do not specify NEWTEXT, Genstat places the new textual values into the first text in the OLDTEXT parameter list (replacing its existing values).

The texts to be concatenated are specified by OLDTEXT; they should all contain the same number of lines, unless you want to insert an identical series of characters into every line of the new text: a series of characters that is to be duplicated within every line can be specified either as a string, or in a single-valued text.

If you give a variate in the SKIP list, then it must contain a value for each line of the text in the OLDTEXT list; the value indicates the number of characters to be omitted at the beginning of that line. Alternatively, you can give a scalar if the same number of characters is to be omitted at the start of every line. Similarly the WIDTH parameter specifies how many characters are to be taken, after omitting any initial characters as specified by SKIP.

CONCATENATE also provides easy ways of removing spaces at the beginning or the end of strings. A negative value of the SKIP parameter deletes all the spaces at the start of a string, while a negative value of the WIDTH parameter deletes all the spaces at the end of a string.

The CASE option enables you to change the case of letters. By default, CASE=given to leave the case of each letter as given in the existing text. To change all letters to upper case (or capitals) you can put CASE=upper, or CASE=lower to change all letters to lower case. Alternatively, CASE=changed puts lower-case letters into upper case, and upper-case letters into lower case!

CONCATENATE takes account of restrictions on any of the vectors that occur in the statement. If more than one vector is restricted, then each such restriction must be the same. The values of the units that are excluded by the restriction are left unchanged.

 

CONTOUR directive

Produces contour maps of two-way arrays of numbers (on the terminal/printer).

 

Options

CHANNEL = scalar Channel number of output file; default is current output file

INTERVAL = scalar or variate Contour interval for scaling (scalar) or positions of the contours (variate); default * i.e. determined automatically

TITLE = text General title; default *

YTITLE = text Title for y-axis; default *

XTITLE = text Title for x-axis; default *

YLOWER = scalar Lower bound for y-axis; default 0

YUPPER = scalar Upper bound for y-axis; default 1

XLOWER = scalar Lower bound for x-axis; default 0

XUPPER = scalar Upper bound for x-axis; default 1

YINTEGER = string Whether y-labels integral (yes, no); default no

XINTEGER = string Whether x-labels integral (yes, no); default no

LOWERCUTOFF = scalar Lower cut-off for array values; default *

UPPERCUTOFF = scalar Upper cut-off for array values; default *

 

Parameters

GRID = identifiers Pointers (of variates representing the columns of a data matrix), matrices, or two-way tables specifying values on a regular grid

DESCRIPTION = texts Annotation for key

 

Description

A contour plot provides a way of displaying three-dimensional data in a two-dimensional plot. The data values are supplied as a rectangular array of numbers that represent the values of the variable in the third dimension, often referred to as height or the z-axis. The first two dimensions (x and y) are the rows and columns indexing the array; the complete three-dimensional data set is referred to as a surface or grid. Contours are lines that are used to join points of equal height, and usually some form of interpolation is used to estimate where these points lie. The resulting contour plot is not necessarily very "realistic" when compared to perspective plots produced by DSURFACE, but it has the advantage that the entire surface can easily be examined, without the danger of some parts being obscured by high points or regions.

You might use contour plots for example when you have data sampled at points on a regular grid, such as the concentrations of a trace element or nutrient in the soil. Contours are also very useful when fitting nonlinear models, when they can be used to study two-dimensional slices of the likelihood surface, to help find good initial estimates of the parameters.

The CONTOUR directive produces output for a line printer by using cubic interpolation between the grid points to estimate a z-value for each character position in the plot. Each value is reduced to a single digit in the range 0 ... 9, according to the rules described below. To produce the contour plot only the even digits are printed: you can then see the contours as the boundaries between the blank areas and the printed digits.

The GRID parameter can be set to a matrix, a two-way table (with the first factor defining the rows), or a pointer to a set of variates each containing a column of data. We explain the conventions in terms of a matrix as input, but similar rules apply to the other structures. When reading or printing a matrix the origin of the rows and columns (row 1, column 1) appears at the top left-hand corner. However, in forming the contour plot the rows are reversed in order so that the first row of the matrix is placed at the bottom of the contour; thus the origin of the contour is located, according to the usual conventions, at the bottom left-hand corner of the plot. The DCONTOUR directive also reverses the rows of the grid in the same way.

CONTOUR scales the grid values by dividing by the contour interval. The scaled grid values are then converted to single digits by taking the remainder modulo 10 and truncating the fractional part. The INTERVAL option allows you to set the contour interval. For example, if the grid values range from 17 to 72 and the interval is set to 10, contour lines (the boundaries between blank space and printed digits) will occur at grid values of 20, 30, 40, 50, 60, and 70. By default, the interval is determined from the range of the data in order to obtain 10 contours.

The UPPERCUTOFF and LOWERCUTOFF options can be used to define a window for the grid values that will form the contours. All values above or below these are printed as X. Setting either UPPERCUTOFF or LOWERCUTOFF will change the default contour interval, as the range of data values is effectively curtailed.

You can use the TITLE, YTITLE, and XTITLE option to annotate the contour plot. If you specify several grids, these will be plotted in separate frames and the text of the TITLE option will appear at the top of each one. You should thus use TITLE only to give a general description of what the contours represent. The DESCRIPTION parameter can be used to add specific descriptions to be printed at the bottom of each individual plot.

The YUPPER and YLOWER options allow you to set upper and lower bounds for the y-axis; thus generating axis labels that reflect the range of values over which the grid was observed or evaluated. Setting YINTEGER=yes will ensure the labels are printed as integers, if possible. The default axis bounds are 0.0 and 1.0. The options XLOWER, XUPPER, and XINTEGER similarly control labelling of the x-axis.

 

COPY directive

Forms a transcript of a job.

 

Option

PRINT = strings What to transcribe (statements, output); default stat

 

Parameter

scalar Channel number of output file

 

Description

The COPY directive can be used to save a copy of either input statements, or output, or both, in an output file. For example

OPEN 'GEN.REC','GEN.OUT'; CHANNEL=2,3; FILETYPE=output

COPY [PRINT=statements] 2

COPY [PRINT=output] 3

will keep a record of all the statements in the file GEN.REC and of all the output in the file GEN.OUT. A later statement

COPY [PRINT=statements,output] 2

will stop output from being directed to GEN.OUT (because information can be copied to only one file at a time), and send it instead to GEN.REC together with the statements. Setting PRINT=* stops any copying to the specified channel. For example

COPY [PRINT=*] 2

 

CORRELATE directive

Forms correlations between variates, autocorrelations of variates, and lagged cross-correlations between variates.

 

Options

PRINT = strings What to print ( correlations, autocorrelations, partialcorrelations, crosscorrelations); default *

GRAPH = strings What to display with graphs (autocorrelations, partialcorrelations, crosscorrelations); default *

MAXLAG = scalar Maximum lag for results; default * i.e. value inferred from variates to save results

CORRELATIONS = symmetric matrix

Stores the correlations between the variates specified by the SERIES parameter

 

Parameters

SERIES = variates Variates from which to form correlations

LAGGEDSERIES = variates Series to be lagged to form crosscorrelations with first series

AUTOCORRELATIONS = variates To save autocorrelations, or to provide them to form partial autocorrelations if SERIES=*

PARTIALCORRELATIONS = variates

To save partial autocorrelations

CROSSCORRELATIONS = variates To save crosscorrelations

TEST = scalars To save test statistics

VARIANCES = variates To save prediction error variances

COEFFICIENTS = variates or matrices

To save prediction coefficients: in a variate to keep only those for the maximum lag, or in a matrix to keep the coefficients for all lags up to the maximum

 

Description

The most straightforward use of the CORRELATE directive is to calculate correlation coefficients between a set of variates. For example this would display the correlations between the variates Age, Height and Weight as a lower-triangular matrix.

CORRELATE [PRINT=correlations; CORRELATIONS=Corr] \

Age,Height,Weight

The correlations are also saved in the symmetric matrix Corr using the CORRELATIONS option.

CORRELATE can also be used to obtain autocorrelations of a time series, that is the correlations between values in the series lagged by particular time intervals. The set of autocorrelations for all possible lags is the autocorrelation function. You can derive the partial autocorrelation function from these. To look at the relationship between two series, you should use the cross-correlation function between one series and the other lagged by the various intervals. The sample autocorrelation function of a series can be displayed either as a table of numbers, or as a graph - called a correlogram. In either case, you must specify the maximum lag for which the autocorrelation is to be calculated, m say. You can do this either by setting the MAXLAG option to m, or by pre-defining the length of a variate to be m+1 and including it in the AUTOCORRELATIONS parameter to store the calculated values. Genstat includes the autocorrelation at lag 0 in the autocorrelation function; this is always unity. The formula used for the sample autocorrelation at lag k is

rk = (1 - k/n) ´ Ck / C0

where

 

The number nk is the number of terms included in the sum. The series can contain missing values, but the calculation excludes any product that involves any missing values at all. You can restrict a series, but the restricted set must consist of a contiguous set of units. Thus, you can look at the autocorrelation function derived from just the first section of a series, or from just the last section, or from a section in the middle; but you cannot use restriction to exclude a section from the middle of the series, or to exclude just individual observations.

The AUTOCORRELATIONS parameter allows you to save the calculated autocorrelations. If you want to display a correlogram in a different form from the standard one produced by the GRAPH option, you must save the autocorrelations and plot them explicitly using either the GRAPH or DGRAPH directives. You will then need to define the variate of lags from 0 to m.

The TEST parameter of CORRELATE allows you to save a statistic that can be used to test the hypothesis that the true autocorrelation is zero for positive lags. It is defined as

 

Provided n (the number of data values) is large and m (the maximum lag) is much smaller than n, then under the null hypothesis, the statistic has a chi-squared distribution with m degrees of freedom. Thus, a large value provides evidence of autocorrelation in a time series.

You can calculate autocorrelation functions for several series in one statement by specifying several variates with the SERIES parameter.

Genstat forms partial autocorrelations from an autocorrelation function. The value at lag k is defined as

corr( yt, yt-k ÷ yt-1, yt-2 ... yt-k+1 )

representing the excess correlation between values separated by k timepoints that is not accounted for by the intermediate points; it is denoted by qk,k because it is also the value of the last in the set of coefficients in the autoregressive prediction equation:

yt = c + qk,1yt-1 + ... + qk,kyt-k + ek,t

Genstat calculates these coefficients recursively for k=1...m by

qk,k = ( rk - qk-1,1rk-1 - ... - qk-1,k-1r1 ) / vk-1

qk,j = qk-1,j - qk,kqk-1,k-j , j=1...k-1

vk = vk-1 (1 - qk,k2 )

It starts with v0=1, the quantity vk being the kth order prediction error variance ratio

variance(ek,t) / variance(yt).

Partial correlations provide a valuable alternative way of displaying the autocorrelation structure of a series. You can display the partial autocorrelation function either as a table of numbers, or as a graph. Two methods are available for doing this. You can supply the series using the SERIES parameter, in which case the autocorrelations are formed first, automatically, and the partial autocorrelations are then derived from them. Alternatively, you can set SERIES=*, and provide the autocorrelations using the AUTOCORRELATIONS parameter. You can specify the maximum lag, either by setting the MAXLAG option, or by pre-defining the length of a variate specified for either the AUTOCORRELATIONS or the PARTIALCORRELATIONS parameter.

You can save the partial autocorrelation function using the PARTIALCORRELATIONS parameter. You can set the VARIANCES and COEFFICIENTS parameters to variates to save the prediction-error variances v0...vm, and the prediction coefficients 1, qm,1 ... qm,m for the maximum lag m. Genstat sets the first coefficient to 1, and also the first element of the partial autocorrelation sequence to 1: you should find this to be a useful convention for the lag 0 values. Alternatively, if the COEFFICIENTS parameter is set to a matrix structure, the rows of this matrix will be used to save the prediction coefficients for all the orders up to the maximum lag.

CORRELATE will print a warning if you include missing values in an autocorrelation function that you have supplied, or if for some other reason the autocorrelations are invalid. In particular, if a partial autocorrelation value is obtained outside the range (-1, 1), Genstat will truncate the sequence at the previous lag.

You can calculate cross-correlations between two series by specifying one series with the SERIES parameter and the other with the LAGGEDSERIES parameter. You must define the maximum lag, as for autocorrelations, and you can again plot or tabulate the resulting function. Missing values are allowed, as for autocorrelations. Genstat calculates the sample cross-correlation between the first series xt and the lagged series yt at lag k using:

rk = (1 - k/n) Ck / (sx sy)

where

 

The series xt and yt may be of different lengths. The summation includes all possible terms, but excludes any product containing missing values; the number nk is the number of terms included in the sum. The values   and   are the sample means, and sx, sy are the sample standard deviations. The number n is the minimum of the number of values of x and of y, excluding missing values. You can restrict either series to a set of contiguous units: if both are restricted, their restrictions must match.

You can save the cross-correlation function using the CROSSCORRELATIONS parameter. You can also save a test statistic using the TEST parameter; this is used similarly to the statistic to test for lack of lagged cross-correlation in one direction of the relationship between two series. However the test is valid only if each of the series has a zero autocorrelation function. Cross-correlations take precedence in the storage. Thus if you request both autocorrelations and cross-correlations in a single CORRELATE statement, the stored test statistic will relate to the cross-correlations: that for the autocorrelations will not be stored.

 

COVARIATE directive

Specifies covariates for use in subsequent ANOVA statements.

 

No options

 

Parameter

variates Covariates

 

Description

To perform analysis of covariance you need to define the treatment model (using TREATMENTSTRUCTURE) and the underlying structure of the design (using BLOCKSTRUCTURE) as in ordinary analysis of variance, and then simply specify the required covariates using the COVARIATE directive. You can then do the analysis by ANOVA, get further output by ADISPLAY and so on, in the usual way.

You can use covariates to incorporate any quantitative information about the units into the model. In field experiments there may often be linear trends in fertility. These can be estimated and removed by fitting a covariate of the position of the plot along the direction of the trend. For example

COVARIATE Location

For a quadratic trend, you would also include a covariate containing the squares of the positions.

CALCULATE Quadtrend = Location**2

COVARIATE Location,Quadtrend

In experiments on animals, you may wish to use measurements such as the original weight. However the assumption is always that the y-variate is linearly related to the covariates.

Covariates are incorporated into the model as terms for a linear regression. Genstat fits the covariates, together with the treatments, in each stratum. This should explain some of the variability of the units in the stratum, and so decrease the stratum residual mean square.

Each treatment combination will have been applied to units whose mean value for each covariate differs from that of other treatment combinations; so even in the absence of any treatment effects, the y-values recorded for the different combinations would not be identical. A further effect of the analysis is to adjust the treatment estimates for the covariates, to correct for this. This adjustment causes some loss of efficiency in the treatment estimation. The remaining efficiency is measured by the covariance efficiency factor, shown for each treatment term in the "cov. ef." column of the analysis-of-variance table. The values are in the range zero to one. A value of zero indicates that the treatment contrasts are completely correlated with the covariates: after the covariates have been fitted there is no information left about the treatments. A value of one indicates that the covariates and the treatment term are orthogonal. Usually the values will be around 0.8 to 0.9. A low value should be taken as a warning: either the measurements used as covariates have been affected by the treatments, which can occur when the measurements on covariates are taken after instead of before the experiment, or the random allocation of treatments has been unfortunate in that some treatments are on units with generally low values of the covariates while others are on generally high ones. The covariance efficiency factor is analogous to the efficiency factor printed for non-orthogonal treatment terms; details of its derivation can be found in Payne and Tobias (1992).

For a residual line in the analysis of variance, the value in the "cov. ef." column measures how much the covariates have improved the precision of the experiment. This is calculated by dividing the residual mean square in the adjusted analysis by its value in the unadjusted analysis (which excludes the covariates).

The covariance efficiency factor is used by Genstat in the calculation of standard errors for tables of effects; if you want to calculate the net effect of the analysis of covariance on the precision of the estimated effects of a treatment term, you should multiply the covariance efficiency factor of the term by the value printed in the residual line of the stratum where the term is estimated. Where a term has more than one degree of freedom, the adjustment given by the covariance efficiency factor is an average over all the comparisons between the effects of the term. However this adjustment should not differ by much from those required for any particular comparison unless the randomization has been especially unfortunate. For a table of means classified by several factors, Genstat combines the covariance efficiency factors of the effects from which the means are calculated into a harmonic mean, weighted according to the numbers of degrees of freedom of each term.

The adjusted analysis-of-variance table has an extra line in the analysis of each stratum, giving the sum of squares due to the covariates. This is the extra sum of squares that is removed by the covariates after eliminating all that can be ascribed to the treatments. It lets you assess whether there is any evidence that the covariates are required in the model. If there are several covariates Genstat will also print their individual contributions to that sum of squares, giving first the sum of squares that can be explained by the first covariate in the COVARIATE list, then the extra sum of squares that can be accounted for by fitting the second covariate, and so on. The line for each treatment term contains the sum of squares eliminating the covariates. It indicates whether there is evidence of any effects of that term, after taking account of the differences in the values of the covariates on the units to which each treatment was applied.

The method that Genstat uses for analysis of covariance essentially reproduces the method that you would use if you were doing the calculations by hand. First of all, it analyses each covariate according to the block and treatment models. You can print information from these analyses using the CPRINT option of either ANOVA or ADISPLAY. As ADISPLAY does not constrain you to list save structures that were all produced by the same ANOVA, CPRINT will produce information about the covariate analyses from every save structure that you list; duplicate information will thus be produced if several of the save structures are for analyses involving the same covariates. The output from CPRINT, particularly the analysis-of-variance table, gives you another way of assessing the relationship between treatments and covariates: a large variance ratio for a treatment term in the analysis of one of the covariates would indicate either that the treatment had affected the covariate or that the randomization had been unfortunate (as discussed in the description of cov. ef. above).

Genstat then analyses each y-variate in turn. First of all it does the usual analysis ignoring the covariates. You can control output from this unadjusted analysis by the UPRINT option of ANOVA and ADISPLAY. (So the whole of the output given for the example could have been produced by a single ANOVA statement.) Then the covariates are fitted by linear regression and the full, adjusted, analysis is calculated. Output from the adjusted analysis is controlled by the PRINT option of ANOVA and ADISPLAY. This option has an extra setting, not available for UPRINT and CPRINT: PRINT=covariates prints the regression coefficients of the covariates as estimated in each stratum.

 

Reference

Payne, R.W. and Tobias, R.D. (1992). General balance, combination of information and the analysis of covariance. Scandinavian Journal of Statistics 19, 3-23.

 

CVA directive

Performs canonical variates analysis.

 

Options

PRINT = strings Printed output required (roots, loadings, means, residuals, distances, tests); default * i.e. no printing

NROOTS = scalar Number of latent roots for printed output; default * requests them all to be printed

SMALLEST = string Whether to print the smallest roots instead of the largest (yes, no); default no

 

Parameters

WSSPM = SSPMs Within-group sums of squares and products, means etc (input for the analyses)

LRV = LRVs Loadings, roots, and trace from each analysis

SCORES = matrices Canonical variate means

RESIDUALS = matrices Distances of the means from the dimensions fitted in each analysis

DISTANCES = symmetric matrices Inter-group-mean Mahalanobis distances

 

Description

You specify the input for CVA using its first parameter, WSSPM, this may contain a list of structures, in which case Genstat repeats the analysis for each of them. The input must be an SSPM structure, declared with the GROUPS option of the SSPM directive set to a factor giving the grouping of the units. If the variates used to form this SSPM structure are restricted, then the SSPM is restricted in the same way, and so the CVA directive takes account of the restriction. The SSPM contains information on the within-group sums of squares and products, pooled over all the groups; it also contains the group means and group sizes, from which Genstat can derive the between-group sums of squares and products. CVA finds linear combinations of the original variables that maximize the ratio of between-group to within-group variation, thereby giving functions of the original variables that can be used to discriminate between the groups. The squared distances between group means are Mahalanobis D2 statistics when all the dimensions are used; otherwise they are approximations. You can form exact Mahalanobis distances with the PCO directive.

The three options of the CVA directive control the printed output. By default there is no printed output, and so you should set the PRINT option to indicate which sections you want. Results can be printed for a subset of the latent roots by setting the NROOTS and SMALLEST options of CVA. NROOTS specifies the number of roots for which you want the results to be printed. By default these will be the largest roots, unless you set SMALLEST=yes; then the results will be printed for the smallest non-zero roots. When you print a subset of the results, residuals can be formed and printed from the dimensions that are not displayed.

The significance tests that are printed are for a significant dimensionality greater than k, that is for the joint significance of the first, second, ..., (k+1)th latent roots. This test is printed for k=0, 1, ... min(g-1, v)-1. If the test is "not significant" for k=r, then the values of chi-square for k>r should be ignored as the indication is that the remaining dimensions have no interesting structure. The test statistic (Bartlett 1938) is asymptotically distributed as chi-squared with (v-k)´ (g-k-1) degrees of freedom. Here n is the number of units, g is the number of groups, v is the number of variables, and li is the ith latent root. If the coefficient [n-g-1/2(v-g)] is less than zero, there are too few units for the statistics to be calculated and a message is printed to this effect. In any case, the tests should be treated with caution unless n-g is very much larger than v.

The latent vectors, or loadings, are scaled in such a way that the average within-group variability in each canonical variate dimension is 1: thus the within-group variation is equally represented in each dimension. Since the latent roots are the successive maxima of the ratio of between-group to within-group variation, loadings corresponding to roots less than 1 are for dimensions in the canonical variate space that exhibit more within-group variation than between-group variation.

The scores for the means are arranged so that their centroid, weighted by group size, is at the origin. This is done by subtracting a constant term, for each canonical variate dimension, from the scores initially formed as a linear combination of the group means of the original variables.

If you ask for distances, they are formed from the group mean scores for the canonical variate dimensions that are printed. If results are printed for the full dimensionality, the distances will be Mahalanobis distances between the groups.

The LRV parameter allows you to save the loadings, latent roots, and their sum (the trace) in an LRV structure, while the SCORES parameter saves the canonical variate means. If you have declared the LRV already, its number of rows must be the same as the number of variates involved in forming the input SSPM. The number of rows of the SCORES matrix, if previously declared, must be equal to the number of groups.

The number of columns of the LRV and of the SCORES matrix corresponds to the number of dimensions to be saved from the analysis, and this must be the same for both of them. If the structures have been declared already, Genstat will take the larger of the numbers of columns declared for either, and declare (or redeclare) the other one to match. If neither has been declared and option SMALLEST retains the default setting no, Genstat takes the number of columns from the setting of the NROOTS option. Otherwise, Genstat saves results for the full set of dimensions. The trace saved as the third component of the LRV structure, however, will contain the sums of all the latent roots, whether or not they have all been saved. Procedure LRVSCREE can be used to produce a "scree" diagram which can be helpful in deciding how many dimensions to save.

The RESIDUALS parameter allows you to save the distances of the means from the dimensions fitted in the analysis in a matrix with number of rows equal to the number of groups and one column. If the latent roots and vectors (loadings) are saved from the analysis, the residuals will correspond to the dimensions not saved; the same applies if you save scores. If neither the LRV nor scores are saved, the saved residuals will correspond to the smallest latent roots not printed.

The DISTANCES parameter allows you to save the inter-group-mean Mahalanobis distances in a symmetric matrix.

 

Reference

Bartlett, M.S. (1938). Further aspects of the theory of multiple regression. Proceedings of the Cambridge Philosophical Society 34, 33-40.