DAPLOT procedure 

Plots residuals from ANOVA with interactive identification of outliers

(R.J. Reader)

 

Options

PEN = scalar or variate Pen or pens to be used to plot the graphs, if a variate is specified its values define the pen to be used for each point on the graphs; default 1

SELECTED = variate Returns the list of elements that have been selected

ADDED = variate X-values to be used in an added variable plot

SAVE = ANOVA save structure Specifies the analysis from which the residuals and fitted values are to be taken; by default they are taken from the most recent ANOVA

 

Parameter

METHOD = strings Type of graph (up to four out of the five possible) to be plotted (histogram, fittedvalues, normal, halfnormal, added); default hist, fitt, norm, half

 

Description

DAPLOT provides five types of high-resolution plot for residuals from an ANOVA. These are selected using the METHOD parameter with settings: histogram for a histogram of residuals,

fitted for residuals versus fitted values, normal for a Normal plot,

halfnormal for a half-Normal plot, and added for an added variable plot (Cook & Weisberg, 1982). Up to four can be examined in any call of the procedure.

If METHOD is set to added, the ADDED option must be set to the variate that is to provide the x-values for the plot. These could, for example, be residuals from an analysis of variance of a possible covariate.

The PEN option controls the pen or pens used for the plotting. Other aspects of the graphics environment, such as windows, are set automatically, and restored at the end of the procedure.

If the graphs are plotted interactively, the SELECTED option allows points to be selected from any graph except a histogram. The graphs are then replotted highlighting the selected points, and the unit numbers of the corresponding elements of the original ANOVA y-variate are saved in the variate specified by SELECTED.

The residuals and fitted values are accessed automatically from the structure specified by the SAVE option which, by default, will be that for the last y-variate analysed by ANOVA. Missing values are inserted in the fitted values and residuals in any units that were missing in the original y-variate.

 

Options: PEN, SELECTED, ADDED, SAVE. Parameter: METHOD.

 

Method

Residuals and fitted values are accessed, using AKEEP, from the latest ANOVA, or from that specified by the SAVE option.

For a Normal plot, the Normal quantiles are calculated as follows:

qi = NED( (i-0.375) / (n+0.25) )

while for a half-Normal plot they are given by

qi = NED( 0.5 + 0.5 ´ (i-0.375) / (n+0.25) ).

The graphs are plotted initially using the pen(s) specified by the PEN option. The characteristics of the pen(s) can be altered using the PEN directive for example to enable different levels of a factor to be plotted with different symbols.

The QUESTION directive is used to determine the graph from which points are to be selected. The DREAD directive is then used to identify the points with the cursor, in the usual way. If any points have been selected, all the graphs are redrawn with the attributes of default pen 2 for the selected points and those of default pen 1 for the others.

 

Action with RESTRICT

If the y-variate in the ANOVA is restricted, only the units not excluded by the restriction are included in the graphs.

 

Reference

Cooke R.D, & Weisberg S. (1982). Residuals and influence in regression. London: Chapman & Hall.

 

DAYCOUNT procedure

Converts a date to a daycount, or vice versa

(T.J. Cole)

 

Option

TODATE = string Whether to convert from daycount to date, instead of from date to daycount (no, yes); default no

 

Parameters

NDAYS = variates or scalars Daycount since 29th February 1600; must be set if option TODATE=yes

DAY = variates or scalars Day of month in range 1...31 (or 30, 29 or 28 depending on month, year and century); must be set if option TODATE=no

MONTH = variates or scalars Month of year in range 1...12; must be set if option TODATE=no

YEAR = variates or scalars Year of century in range 00...99; must be set if option TODATE=no

CENTURY = variates or scalars Century in range 16... ; default 19 (i.e. 1900 - 1999)

WEEKDAY = variates or scalars Day of the week corresponding to the date, where Monday=1 and Sunday=7

 

Description

DAYCOUNT takes a date, expressed as day, month and year, and converts it to an exact daycount since 29th February 1600, based on the Gregorian Calendar. The year is defined in two parts, the first two digits being the century and the last two the year, so 1988 is century 19 and year 88. Alternatively, if option TODATE is set to yes, a daycount is converted back to a date. The earliest allowable date is 1st March 1600, corresponding to a daycount of 1. Also calculated is the weekday of the date, where Monday is 1 and Sunday is 7. Parameter CENTURY allows the century of the date to be specified; if this is unset, 19 is assumed (i.e. 1900 - 1999). There is no printed output, other than warnings about dates being invalid or earlier than the starting date.

 

Option: TODATE. Parameters: NDAYS, DAY, MONTH, YEAR, CENTURY, WEEKDAY.

 

Method

The method of calculation is based on Zeller's Congruence, which redefines each year as starting on 1st March. An extra day is added on when the year is divisible by 4 (or when the century is divisible by 4 and the year is 00) and this goes at the end of February the previous year. In addition, the function INTEGER(MONTH*30.6+0.5) gives a running total of days in previous months of the year, where MONTH=0...11 represents March to February. The procedure uses similar sums to check for valid dates. A cycle of 400 years (e.g. Wednesday 1st March 1600 to Tuesday 29th February 2000) consists of an exact number of weeks, so that the weekday of any date can be found from the daycount mod 7. Wherever possible, the procedure creates the required output structures as scalars, so as to save space.

 

Action with RESTRICT

If any of the parameters is restricted, the procedure will operate only on the specified set of units; other parameters must either be unrestricted or restricted to the same set of units.

 

DAYLENGTH procedure

Calculates daylengths at a given period of the year

(R.J. Reader & K. Phelps)

 

Option

LATITUDE = scalar Latitude at which the daylength is to be calculated, positive for northern hemisphere and negative for southern hemisphere; default 52.205 N (Wellesbourne)

 

Parameters

DAYNUMBER = variate Days of year for which daylengths are required

DAYLENGTH = variate Calculated daylengths in hours

 

Description

DAYLENGTH calculates a set of daylengths at a given latitude. The numbers of the days during the year for which the daylengths are required should be specified, in a variate, using the DAYNUMBER parameter. The lengths will then be stored in the variate specified by the DAYLENGTH parameter. The latitude is defined by the LATITUDE option, by default LATITUDE=52.205 which is the latitude of Wellesbourne.

 

Option: LATITUDE. Parameters: DAYNUMBER, DAYLENGTH.

 

Method

The formula by which the daylengths is calculated is given in Sellers (1965).

 

Action with RESTRICT

If either the DAYNUMBER or the DAYLENGTH variate is restricted, the calculations will be done only for the units not excluded by the restriction.

 

Reference

Sellers W.D. (1965). Physical Climatology. University of Chicago Press, Chicago, Illinois.

 

DBARCHART procedure

Produces barcharts for one or two-way tables

(Ruth Butler)

 

Options

TITLE = text Title for Chart; no default

WINDOW = scalar Window for chart (1...8); default 1

KEYWINDOW = scalar Window for Key, no key is produced for one-way tables (1...8); default 2

LABELS = text Labels for clusters of bars; by default the labels or levels of the first classifying factor of TABLE are used

APPEND = string Whether to append bars (no, yes); default no

SCREEN = string Whether to clear screen before displaying chart (keep, clear); default clea

KEYDESCRIPTION = text Title for key; default is the name of the second factor of TABLE

YSCALE = expression Defines a transformation of the data, the expression must be a function of X, for example !e(log(X)), and should be monotonically increasing in the range of the data in TABLE; default no transformation

 

Parameters

TABLE = tables One or two-way table of data

ORIGIN = scalars Origin for y-axis; default 0

PEN = variates or scalars Pen (or pens) to use; default is

!(1...nlevel(last_classifying_factor))

DESCRIPTION = texts Annotation for Key for two-way tables; by default the labels or levels of the last classifying factor of TABLE are used

YMARKS = variates Position of the tick-marks on the y-axis

 

Description

DBARCHART produces barcharts for one or two-way tables. For a two-way table, the bar chart is produced with the first factor defining the groups of bars in the chart, and the second the bars within each group. The table is specified by the TABLE parameter and the origin of the y-axis, which need not be zero, can be set with the ORIGIN parameter. The PEN parameter specifies a pen, or pens, for the bars of the histogram. This can be input as a scalar if the same pen is to be used for the whole plot, or as a variate to allow the groups to be drawn in different pens; by default pens 1, 2 ... are used for the successive bars. Labelling for the key can be supplied by the DESCRIPTION parameter; if this is not set, BARCHART uses the labels of the last classifying factor. Positions of the tick-marks on the Y-axis can be specified with the YMARKS parameter.

The options of the procedure mainly control the plotting: the windows that are used for the plot (WINDOW) and for the key (KEYWINDOW), titles for the graph (TITLE) and for the key (KEYDESCRIPTION), whether the groups of bars are appended or placed side-by-side (APPEND), and whether or not to clear the screen before plotting (SCREEN). The YSCALE option can specify a transformation to be used to rescale the data and y-axis; the labels on the y-axis, however, will refer to the original scale of the data.

 

Options: TITLE, WINDOW, KEYWINDOW, LABELS, APPEND, SCREEN, KEYDESCRIPTION, YSCALE.

Parameters: TABLE, ORIGIN, PEN, DESCRIPTION, YMARKS.

 

Method

If YSCALE is set, the expression is used to transform TABLE and ORIGIN. Any YMARKS are also transformed to find the position of the tick-marks. TABLE is then rescaled so that the ORIGIN is zero. DHISTOGRAM is then used to produce the chart without a y-axis. The y-axes is added to the chart using a DGRAPH statement, with the labelling on the original scale of TABLE. Two-way tables are first split into one-way tables classified by the second factor of TABLE. One sub-table is produced for each level of the first factor of TABLE. The chart is then produced with a single DHISTOGRAM statement for all sub-tables. YSCALE is imported into the program by setting X as a dummy, and printing the expression into a text. A new expression is then set up using this text with the EXECUTE directive. X is then set to ORIGIN, TABLE and YMARKS in turn before the expression is calculated.

 

DDENDROGRAM procedure

Draws dendrograms with control over structure and style

(P.G.N. Digby)

 

Options

STYLE = string Style to use for the links of the dendrogram (average, centroid, lower, full); default aver

ORDERING = strings How to define the order of the units for the dendrogram (given, ziggurat, size, first); default zigg, size, firs

REVERSE = string Whether to reverse the order of the units in the dendrogram (no, yes); default no

ORIENTATION = string Specifies the orientation of a dendrogram produced by high-resolution graphics (north, south, east, west); default west

SETSCALE = string Whether the procedure should set the scale for the axis showing similarity to 1 for similarities, or 100 for percentage similarities, or whether the scale should be determined by the range of similarities (no, yes); default no

METHOD = string Method used to represent the scale on which the amalgamations have been made: settings other than the default are relevant only for data not generated by HCLUSTER or HDISPLAY (similarities, percentages, distances); default simi

SCREEN = string Setting to use for the SCREEN option of DGRAPH (clear, keep); default clea

CHANGE = string If a dendrogram-save structure from a previous DDENDROGRAM is used as the DATA parameter then this option specifies the area of the process where the first changes occur: see the description of the SAVE parameter (order, dendrogram, display); default orde

GRAPHICS = string Form of graphics to be used (lineprinter, highresolution); default high

 

Parameters

DATA = matrices or pointers Data defining each dendrogram in the form of either a matrix saved using the AMALGAMATIONS parameter of HCLUSTER (methods other than single linkage), or a matrix from the TREE parameter of HDISPLAY, or a SAVE structure from a previous use of DDENDROGRAM

PERMUTATION = variates Specify or save permutations of the units for drawing each dendrogram, according to ORDERING option

LABELS = variates or texts Supply labels to use for the units of each dendrogram; these should be in the natural order of the units, not in a permuted order

TITLE = texts Titles for the dendrograms

WINDOW = scalars Window to use for each dendrogram (window 1 if unset); if this is set to zero the dendrogram is not drawn, but results can still be saved using the PERMUTATION, ZIGGURAT and SAVE parameters

PENS = scalars, variates, strings or texts

Scalar or string specifying the graphics pen or symbol in which to draw each (high-resolution or line-printer) dendrogram; alternatively use of a variate or text allows the structure of each dendrogram to be highlighted by drawing different links with different graphics pens or symbols

ZIGGURAT = variates Save the "ziggurat-degree" of the links in each dendrogram

SAVE = pointers Save the information required to plot a dendrogram, for use as input for the DATA parameter in a subsequent call to DDENDROGRAM

 

Description

DDENDROGRAM draws dendrograms using line-printer or high-resolution graphics, as indicated by the GRAPHICS option.

Dendrograms can be drawn in many ways, often with apparently quite different results, as illustrated by Digby (1985). The procedure allows the user considerable control over the way that a dendrogram is formed; in particular the order of the units and the style used for drawing the links of the dendrogram can be varied. If high-resolution graphics is to be used, a check should be made to ensure that this facility is present in the available version of Genstat. This can be done by seeing what happens when any of the relevant directives is used (Genstat 5 Release 3 Reference Manual, Chapter 6). Then directives DEVICE, FRAME and PEN should be used to change the default settings, if required; these can be ascertained using the statement

HELP ENVIRONMENT, PICTURES, CURRENT

The input for the procedure is given by the DATA parameter. This should be a matrix containing the amalgamations information from hierarchical cluster analysis (from the AMALGAMATIONS parameter of HCLUSTER) or a matrix containing the minimum spanning tree information (from the TREE parameter of the HDISPLAY directive); alternatively a SAVE structure from a previous DDENDROGRAM can be used as input. However, in the current release of Genstat, the amalgamations matrix from HCLUSTER is unusable if the clustering has been been produced by single linkage, so the minimum spanning tree information, which is equivalent, should be used as input.

The PERMUTATION parameter can be supplied with a variate, either to specify a permutation of the rows of the dendrogram or to save the permutation generated by DDENDROGRAM, as indicated by the ORDERING option. Setting ORDERING=given takes the ordering defined by the PERMUTATION variate. The other settings of ORDERING define partial orderings of the units, and are used in conjunction with each other to obtain the full ordering: ziggurat (Critchley 1983) is associated with ultrametric distances amongst the units; size specifies that when 2 groups merge the smaller is always placed before the larger in the order; first specifies that when 2 groups merge the group containing the lowest numbered unit is always placed before the other in the order. The orders given by settings ziggurat and size are not completely specified and recourse may be made to the other of these settings or to first. If ORDERING is not set to given then a list of settings may be specified in which case the first in the list is used, the second is used to satisfy indeterminacies in the order given by the first setting in the list, and so on. The default is the list of settings: ziggurat, size, first.

Option REVERSE allows the ordering thus obtained to be reversed.

The LABELS parameter can be given a variate or a text to supply labels for the rows of the dendrogram. Labelling can be suppressed altogether by using a text containing only spaces.

The STYLE option controls the style to use in forming the links of the dendrogram: its setting indicates where the line representing each new cluster should be placed. Assuming that the dendrogram has the units on the left-hand side, the settings can be described as follows:

average (the default) the new line is midway between the old lines; centroid the new line is placed at the mid-point of all the units in the group it represents; lower the new line is a continuation of the lower of the two old lines (comparable with dendrograms from HCLUSTER); full the new line is a continuation of the upper or lower of the two old lines, so that each vertical line spans all the units in the group it represents.

The ORIENTATION option is relevant to high-resolution graphics, when it controls the orientation of the dendrogram: for example the setting north results in a "hanging dendrogram" with the units across the top. The default setting is west, which gives a dendrogram with the units on the left-hand side; this is also how DDENDROGRAM draws dendrograms on the line-printer.

The SETSCALE option controls whether the procedure should set the scale for the axis showing similarity to 1 for similarities (100 for percentage similarities), instead of determining the scale by the range of similarities or distances.

The METHOD option indicates the scale on which the amalgamations have been made. This option need be set only if the data have been obtained from a source other than HCLUSTER or HDISPLAY.

The TITLE parameter specifies a title for each dendrogram. For high-resolution graphics, the WINDOW parameter defines the graphics window to use for each plot. With line-printer graphics, two "windows" are available: window 1 has a width of 101 characters, window 2 a width of 61 characters. If WINDOW is not set, window 1 is used. If it is set to zero, the dendrogram is not drawn but results can still be saved using the PERMUTATION, ZIGGURAT and SAVE parameters; however, if the SAVE structure is used later as input to DDENDROGRAM, the CHANGE option must not be set to display as the dendrogram stage will not have been completed. The SCREEN option controls whether to clear the high-resolution graphics screen before plotting (default clear).

For high-resolution graphics, the PENS parameter can be supplied with a scalar indicating the graphics pen with which to draw the dendrogram. Alternatively, if required, a variate can be specified to highlight the structure of the dendrogram by drawing different links with different pens; the links are taken in the same order as the rows of the AMALGAMATIONS matrix from HCLUSTER or in increasing order of the links of the minimum spanning tree. DDENDROGRAM will use pen 1 if the PENS parameter is not set. Any pens used by DDENDROGRAM will be set to METHOD=line, SYMBOLS=0, JOIN=given. If a scalar is supplied or PENS is not set, the pen used will also have LINESTYLE set to 1. If a variate is used, appropriate settings of COLOUR and LINESTYLE should set (using the PEN directive) prior to calling DDENDROGRAM. Similarly, with line-printer graphics, the PENS parameter can be set either to a string or to a text, according to whether the links are to be drawn with the same or different symbols; if the parameter is unset, the plus symbol (+) is used for all the links.

The ZIGGURAT parameter can be used to save the "ziggurat-degree" (Critchley 1983) of each link. This could then be used to form the setting of the PENS parameter for a later dendrogram, in order to display particular aspects of the clustering more clearly.

The SAVE parameter can be used to save the various structures that control the drawing of a dendrogram in order to save computing time when drawing a similar dendrogram. The SAVE structure should then be used as the setting of the DATA parameter, and the CHANGE option used to indicate the stage at which to start changing aspects of the previous dendrogram. The various stages (in order) involve the following options and parameters:

order ORDERING and PERMUTATION;

dendrogram STYLE and METHOD;

display REVERSE, ORIENTATION, SETSCALE, SCREEN, LABELS, TITLE, WINDOW, PENS.

 

Options: STYLE, ORDERING, REVERSE, ORIENTATION, SETSCALE, METHOD, SCREEN, CHANGE, GRAPHICS.

Parameters: DATA, PERMUTATION, LABELS, TITLE, WINDOW, PENS, ZIGGURAT, SAVE.

 

Method

Dendrograms are constructed and drawn in four separate stages: firstly the amalgamations information is used to construct information on group sizes; secondly a permutation of the units is formed, if required, according to several possible ordering schemes; thirdly graphical information on each of the links of the dendrogram is formed; lastly this graphical information is used to display the dendrogram, subject to requirements over orientation, pens, etc. Separate procedures are used for each stage (for details see the source code of DDENDROGRAM, obtainable via LIBEXAMPLE). A preliminary stage is also needed to construct the amalgamations from information on a minimum spanning tree. Communication amongst the subsidiary procedures is obtained using a pointer, which the user may keep using the SAVE parameter. The algorithms used by the first three subsidiary procedures are similar to those described by Digby (1984a, 1984b).

 

Action with RESTRICT

If any of the options or parameters are restricted unpredictable results may occur: none of the options or parameters should be restricted.

 

References

Critchley, F. (1983). Ziggurats and dendrograms. Report No. 43, Department of Statistics, University of Warwick.

Digby, P.G.N. (1984a). Drawing pretty dendrograms. Genstat Newsletter, 14, 18-26.

Digby, P.G.N. (1984b). Dendrograms and ziggurats. Genstat Newsletter, 14, 14-18.

Digby, P.G.N. (1985). Graphical displays for classification. PACT Journal of the European Study Group on Physical, Chemical and Mathematical Techniques Applied to Archaeology.

 

DDESIGN procedure

Plots the plan of an experimental design

(K.E. Bicknell & R.W. Payne)

 

Options

Y = variate Specifies the y position of the plots in standard coordinates 1 ... number of rows of plots in the experiment (taking 1 as the top row of the window)

X = variate Specifies the x-coordinate of the plots in standard coordinates 1 ... number of columns of experimental plots

TITLE = text Title for the plan

WINDOW = scalar Window number for the plan; default 3

KEYWINDOW = scalar Window number for the key; default 0

SCREEN = string Whether to clear the screen before plotting (clear, keep); default clea

KEYDESCRIPTION = text Overall description for the key; default *

ENDACTION = string Action to be taken after completing the plot (continue, pause); default * uses the setting from the last DEVICE statement

CHARACTERS = scalar Sets a limit on the length of each factor label; default * i.e. none

 

Parameters

FACTOR = factors Factors to be listed on the plan and to define the layout (the procedure determines the style of line to divide each pair of plots in the design from the grid pen of the first factor in the list with which they have different levels); default * forms the list from first the factors specified by a preceding BLOCKSTRUCTURE statement, and then those specified by a preceding TREATMENTSTRUCTURE statement

PEN = scalars Pen to be used to write the levels of each factor on the plan (if PEN=0 the levels of that factor are not included); default 1

PENGRID = scalars Pens to be used to draw the boundaries between the plots in the design (according to the first FACTOR with which they have different levels but ignoring factors with PENGRID=0); default 1

LABEL = texts Labels to be used for each factor if its own levels or labels are inappropriate

 

Description

DDESIGN uses high-resolution graphics to produce a plan of an experimental design. The plots in the design are assumed to be arranged on a rectangular grid. The rows of the plots are assumed to run from 1 (at the top of the graph) upwards and are specified by a variate supplied by the Y option. The columns (again running from 1 upwards) specified by a variate supplied by the X option. If either Y or X is not specified, DDESIGN will generate values automatically according to the factors in the design.

The TITLE, WINDOW, KEYWINDOW, SCREEN, KEYDESCRIPTION and ENDACTION options operate as usual in high-resolution graphics, while the CHARACTERS option allows a limit to be set on the length of each factor label when written on the plan.

The factors involved in the experiment can be listed using the FACTOR parameter. If this is omitted DDESIGN forms the list firstly from the factors in the previous BLOCKSTRUCTURE statement (or a "units" factor if there was none), and then from the factors (if any) in the previous TREATMENTSTRUCTURE statement.

These factors are then used to draw the plan and to label the plots in the design. The PEN parameter allows the levels or labels of the factors to be drawn using different pens (and thus, for example, in different colours). If the pen for any factor is defined as zero, its levels/labels are not included. However, it can still be used to determine the lines drawn to delimit the plots. For these lines, DDESIGN considers each pair of adjacent plots and checks through the list of factors to find the first one for which they have different levels. It then uses the grid pen (defined by the PENGRID parameter) to draw the dividing line. If the grid pen of any factor is zero, it is ignored.

This makes it very easy to achieve the usual style of plan in which stronger lines are used for example to indicate the boundaries between different blocks than between the plots within blocks. For example, the parameter settings to draw a randomized block design with a single treatment factor Treat in this way would be

FACTOR=Block,Plots,Treat; PEN=1; PENGRID=1,2,0

if all the factors are to have their levels listed within the plots, or

FACTOR=Block,Plots,Treat; PEN=0,0,1; PENGRID=1,2,0

if only Treat is to be listed. Note that, as each pair of plots will have different levels of either Block or Plot (or both), the PENGRID specified here for Treat is irrelevant.

If a plot has no neighbour in some direction, DDESIGN will check the next but one plot; if this too is not used in the design, the grid pen of the first FACTOR is used to mark the boundary.

The final parameter, LABEL, allows alternative labels to be specified for each factor if the existing ones are inappropriate.

 

Options: Y, X, TITLE, WINDOW, KEYWINDOW, SCREEN, KEYDESCRIPTION, ENDACTION, CHARACTERS.

Parameters: FACTOR, PEN, PENGRID, LABEL.

 

Method

DDESIGN makes use only of standard Genstat facilities for manipulation and plotting.

 

Action with RESTRICT

If any of the factors or X or Y is restricted, only the unrestricted plots are displayed.

 

DECIMALS procedure

Sets the number of decimals for a structure, using its round-off

(A. Keen)

 

No options

 

Parameters

STRUCTURE = identifiers Numerical structure for which the number of decimals is to be set

DECIMALS = scalars To save the number of decimals

ROUND = scalars To save the round-off

 

Description

The default number of decimals that Genstat applies when printing a numerical structure is not always optimal. A scalar with value 0.1 is represented as 0.1000 for example. The trivial solution is to set the parameter DECIMALS of the directive SCALAR or to set the parameter DECIMALS of the directive PRINT. However, for routine printing when tidy output is required (as may be the case in procedures), this is not a feasible solution.

The numerical structure for which the number of decimals has to be determined must be specified using the parameter STRUCTURE. The procedure calculates the appropriate number of decimal places, and modifies the declaration of the structure so that this becomes its default number of decimal places for subsequent printing. Parameter DECIMALS allows the number of decimals to be saved, parameter ROUND saves the round-off (see Method).

 

Options: none. Parameters: STRUCTURE, DECIMALS, ROUND.

 

Method

The round-off value of a number equals 10k with k a negative or positive integer or zero. The round-off value of a number equals d if the number after dividing by d is an integer but after dividing by 10 ´  d is not. If the round-off value is such that the number of significant digits is greater than 4, the round-off value is increased correspondingly. For example, the round-off value of 880 equals 10, that of 0.2300 equals 0.01 and of 9999.11 equals 1. The round-off value of a structure is the minimum of the round-off values of all the elements of the structure, subject to the restriction that the number of significant digits does not exceed 4 for any of the values of the structure.

The number of decimals of a structure is calculated from the round-off value of the structure as -log10(round-off value), with minimum value zero. So in the above examples the number of decimals equals 0 for 880, 2 for 0.2300 and 0 for 9999.11.

 

Action with RESTRICT

Restrictions are not allowed.

 

DESCRIBE procedure

Saves and/or prints summary statistics for variates

(R.C. Butler)

 

Options

PRINT = string Controls whether or not the summaries are printed (summaries); default summ

SELECTION = strings Selects the statistics to be produced (nval, nobs, nmv, mean, median, min, max, range, q1, q3, var, sd, sem, %cv, sum, ss, uss, skew, seskew, kurtosis, sekurtosis); default mean, min, max, nobs, nmv, medi, q1, q3

 

Parameters

DATA = variates Data to summarize

SUMMARIES = variates To save summaries for each DATA variate

 

Description

DESCRIBE calculates up to 21 different summary statistics for values stored in a variate. The statistics may be saved, or printed, or both. The statistics to be calculated are indicated by the SELECTION option; the available settings are:

nval number of values

var variance

nobs number of non-missing values

sd standard deviation

nmv number of missing values

sem standard error of mean

mean arithmetice mean

%cv coefficient of variation

median median

sum total of values

min minimum

ss corrected sum of squares

max maximum

uss uncorrected sum of squares

range range (max-min)

skew skewness (see Method)

q1 lower quartile

seskew standard error of skewness

q3 upper quartile

kurtosis kurtosis (see Method)

 

sekurtosis s.e. of kurtosis

by default the mean, min, max, nobs, nmv, median and both quartiles are calculated.

Printing is controlled by the PRINT option. The statistics are printed by default, so to suppress printing you need to put PRINT=*.

The SUMMARIES parameter allows the statistics to be saved in a variate, which need not be declared in advance. The units of the variate are labelled by the corresponding strings from the settings (in capital letters) of the SELECTION option, to simplify the subsequent access of any individual statistic. For example, the minimum value can be copied from a SUMMARIES variate v into a scalar m by

CALCULATE m = v$['MIN']

 

Options: PRINT, SELECTION. Parameters: DATA, SUMMARIES.

 

Method

The statistics are calculated in a variate which is then restricted to print only those that were required, and to obtain the unit numbers of those to be copied into the SUMMARIES variate.

Skewness is calculated as (M3 - 3 M1 M2 + 2 M13 ) / (M2 - M1 M1)3/2

where Mi = S xi) / N

SE Skewness is calculated as Ö ({6N´ (N-1)}/{(N-2)´ (N+1)´ (N+3)})

Kurtosis is calculated as (M4 - 4 M1 M3 + 6 M12 M2 - 3 M14)/(M2 - M1 M1)2 - 3

SE Kurtosis is calculated as Ö ({24N(N-1)2}/{(N-2)(N-3)(N+5)(N+3)})

 

Action with RESTRICT

The statistics are calculated for the restricted set of units from each DATA variate. Any existing restrictions are not affected by the procedure.

 

DESIGN procedure

Helps to select and generate effective experimental designs

(M.F. Franklin, R.W. Payne & A.E. Ainsley)

 

No options

 

No parameters

 

Description

DESIGN is a procedure which can be used interactively to form experimental designs of various types. The process involves answering questions, posed by Genstat, first to select the particular type of design, then to give details such as names of factors, numbers of treatments, and so on. A range of subsidiary procedures may be called, depending on the type of design selected. If you wish to avoid some of the question-and-answer process, the subsidiary procedures can also be called directly. They all have options and parameters which provide an alternative way of supplying the information otherwise obtained by the various questions and, provided you supply all the required information, they can also be used in batch.

There are 13 types of design.

Orthogonal hierarchical designs - designs such as randomized blocks, split-plots, split-split-plots, &c.

Factorial designs (with blocking) - these have several treatment factors and a single blocking factor (giving strata for blocks and plots within blocks). The blocks are too small to contain a complete replicate of the treatment combinations and so various interaction are confounded with blocks.

Fractional factorial designs (with blocking) - again there are several treatment factors but the design does not contain every treatment combination and so some interactions are aliased; there can also be a blocking factor and some interactions will then be confounded with blocks.

Lattice designs - designs for a single treatment factor with number of levels that is the square of some integer k. The design has replicates, each containing k blocks of k plots, and different treatment contrasts can be confounded with blocks in each replicate.

Lattice squares - these are similar to lattices except that the blocking structure with the replicates has rows crossed with columns; again different treatment contrasts can be confounded with the rows and columns in each replicate.

Latin squares - designs are available for 3 to 14 treatments; several different orthogonal squares are available for most of these so, for example, Graeco Latin squares can be formed by calling DESIGN twice to generate each of the two treatment factors using a different square.

Alpha designs - these again have a single treatment factor but there is no constraint on the number of levels; the blocking structure has replicates and blocks within replicates. Further details are given in the description of the procedure AFALPHA or by (Patterson & Williams 1976).

Cyclic designs - these are designs with a single blocking factor which defines blocks that are too small to contain every treatment. Usually there is a single treatment factor, but you can also generate the cyclic superimposed designs of Hall & Williams (1973) in which there are two treatment factors and the treatment structure fits only the main effects. An alternative refinement (Davis & Hall 1969) has a crossed blocking structure generally taken to represent subjects*time. Details of the cyclic process by which the treatment levels are generated can be found in the description of the procedure AFCYCLIC.

Balanced-incomplete-block designs - designs where the experimental units are grouped into blocks such that every pair of treatments occurs in an equal number of blocks. All comparisons between treatments are thus made with equal accuracy, so the design is balanced and, in particular, can be analysed by ANOVA. Further details are given in the description of procedure AGBIB.

Neighbour-balanced designs - designs that allow an adjustments to be made for the effect that a treatment may have on adjacent plots. Further details are given in the description of procedure AGNEIGHBOUR.

Central composite designs - used to study multi-dimensional response surfaces; see procedure AGCENTRALCOMPOSITE.

Box-Behnken designs - used to study multi-dimensional response surfaces; see procedure AGBOXBEHNKEN.

Plackett Burman (main effect) designs - for estimating main effects of factors with two levels, using a minimum number of experimental units (Plackett & Burman 1946). Further details are given in the description of procedure AGMAINEFFECT.

You will be asked to provide a seed to be used to randomize the design and then given the opportunity to print a plan. If the design can be analysed by ANOVA, the procedures will define appropriate block and treatment formulae and then ask if you want to see the skeleton analysis-of-variance table (containing just source of variation, degrees of freedom and efficiency factors). Whether or not you choose to print any of this information, at the end of the whole process all the block and treatment factors necessary to define the design will be available - and they will have the identifiers that you have supplied in response to the various questions asked by the procedures.

 

Options: none. Parameters: none.

 

Method

The QUESTION directive is used to obtain the details of the required design. The design is then generated using GENERATE and the other standard Genstat directives for calculation and manipulation. Most of the information needed to specify the designs is stored in backing-store files on the computer, and much of this was adapted from the standard designs of the program DSIGNX (Franklin & Mann 1986).

 

References

Davis, A.W. & Hall, W.B. (1969). Cyclic change-over designs. Biometrika 56, 283-293.

Franklin, M.F. & Mann, A.D. (1986). DSIGNX a program for the construction of randomized experimental plans. Scottish Agricultural Statistics Service, Edinburgh (revised edition).

Hall, W.B. & Williams, E.R. (1973). Cyclic superimposed designs. Biometrika 60, 47-53.

Patterson, H.D. & Williams E.R. (1976). A new class of resolvable incomplete block designs. Biometrika, 63, 83-92.

Plackett, R.L. & Burman, J.P. (1946). The design of optimum factorial experiments. Biometrika, 33, 305-325 & 328-332.

 

DIALLEL procedure

Analyses full and half diallel tables with parents

(J.F. Potter)

 

Options

PRINT = strings Controls printed output (data, vrwr, regression, aov, means); default data, vrwr, regr, aov, mean

LABELS = text Labels for rowcols, one text value for each, column j has the same label as row j, so each value of LABELS is the label for a pair of parents, applying to a rowcol; default 1...N, where N is the dimension of each diallel table

METHOD = string Whether to perform full or half diallel analysis (half, full); default full

 

Parameter

DATA = matrices Each matrix contains the data for one block in the analysis, half diallel tables are presented as square matrices with the upper triangles and leading diagonals containing the values of interest, the matrices must be of the same size

 

Description

DIALLEL performs analysis of variance of full diallel tables (Hayman 1954) and half diallels (Jones 1965). Work on variance and covariance relationships is also performed (Jinks 1954). The data are specified by the DATA parameter, in a square matrix for every block in the analyses. Half diallel tables are presented as square matrices with the upper triangle and leading diagonal containing the values of interest. The PRINT option controls printed output:

data data values,

vrwr variances and covariances of rowcols,

regression regression of the variances on the covariances,

aov analysis of variance table,

means means.

The LABELS option can give a text to be used for labelling rowcols (called arrays in the literature). The METHOD option specifies whether analysis of full or half diallels is required.

 

Options: PRINT, LABELS, METHOD.

Parameter: DATA.

 

Method

DIALLEL performs analysis of variance of full diallel tables, according to the method of Hayman (1954), and half diallels, according to the method of Jones (1965). A diallel table is a representation of the results of crossing a set of male and female homozygous parents in all possible combinations, including male:female reciprocation in full diallels. DIALLEL expects parent values (selfs) to be present as the leading diagonal of the table (whether a full or half matrix).

The analysis of variance estimates the following genetic components of variation.

a: variation between mean effects of each parental line. Genetically this provides a test of additive variation, but also detects dominance if asymmetry present, i.e. if alleles at any one locus are not equally frequent (Hayman 1954).

b: variation caused by dominance at some of the loci. This term splits into:

b1: if significant this shows that dominance is largely uni-directional;

b2: estimates the asymmetry mentioned in a;

b3: signifies that some dominance is peculiar to individual crosses; If the symmetry condition is met, b1 and b3 together give a test of dominance equivalent to b.

c: variation between average maternal effects of each parental line.

d: variation in the reciprocal differences not attributable to c.

t: total variation.

Components c and d are reciprocal effects not available in half diallels. In the absence of replication, the d term should be used as the error term for testing components a to c in the full diallel.

DIALLEL can also analyse over any number of blocks, in which case block effects are also estimated, and block interactions with the above components can then be used as estimates of error to test the significance of the components.

Variances of rowcols (Vr) are compared with the covariance of the rowcols (Wr) with the corresponding concurrent parents, using the method of Jinks (1954). This entails the regression of Wr on Vr, which gives measures of adequacy of the model, average dominance, and the distribution of dominant and recessive genes. The analysis of diallel tables is more fully described by Mather and Jinks (1971).

Many other diallel methods exist, DIALLEL representing quite a complex one, but one which makes fairly limiting assumptions, e.g. only a reference population in Hardy-Weinberg equilibrium with respect to individual loci and linkage equilibrium with respect to all pairs of loci can legitimately be used to estimate the genetic variance components. This means a large population reproducing by panmixia without selection. This and other difficulties such as the need for distinction between ancestral and descendant reference populations are discussed by Wright (1985).

 

Action with RESTRICT

Restrictions are ignored for text LABELS and are not relevant for DATA, which is of type matrix.

 

References

Hayman, B.I. (1954). The Analysis of Variance of Diallel Tables. Biometrics, 10, 235-244.

Jones, R.M. (1965). Analysis of Variance of the Half Diallel Table. Heredity, 20, 117-121.

Jinks, J.L. (1954). The Analysis of Continuous Variation in a Diallel Cross of Nicotiana rustica Varieties. Genetics, 39, 767-788.

Mather, K. & Jinks, J.L. (1971). Biometrical Genetics, 249-284. Chapman & Hall Ltd.

Wright, A.J. (1985). Diallel Designs, Analyses, and Reference Populations. Heredity, 54, 307-311.

 

DILUTION procedure

Calculates Most Probable Numbers from dilution series data

(M.S. Ridout & S.J. Welham)

 

Options

PRINT = strings Output required (estimates, fitted); default esti, fitt

%LIMITS = scalar Percentage points for confidence limits; default 95

RMETHOD = string Which type of residuals to form (deviance, Pearson); default deviance

MAXCYCLE = scalar Maximum number of iterations allowed for the Newton-Raphson algorithm to converge; default 10

TOLERANCE = scalar Defines the convergence criterion; default 0.0005

 

Parameters

POSITIVE = variates Number of positive subsamples at each dilution

NSAMPLE = variates Total number of subsamples tested at each dilution

VOLUME = variates Volume of original sample present in each dilution

FITTED = variates To store the fitted values

RESIDUAL = variates To store the residuals, as specified by option RMETHOD

MPN = scalars To store the maximum likelihood estimate of Most Probable Number

UPPER = scalars To store the upper confidence limit for MPN

LOWER = scalars To store the lower confidence limit for MPN

DEVIANCE = scalars To store the residual deviance

PEARSONCHI = scalars To store Pearson's chi-squared statistic

DF = scalars To store the degrees of freedom for goodness-of-fit tests (zero if no goodness of fit test is available)

 

Description

A dilution series experiment seeks to estimate the number of organisms in a sample. This is done by preparing successive dilutions of the original sample (usually with a constant dilution factor at each stage), and then testing for the presence/absence of organisms in several subsamples at each dilution. Under certain assumptions, discussed, for example, by Cochran (1950), it is then possible to estimate, by maximum likelihood, the number of organisms in the original sample. In the context of dilution series data, the maximum likelihood estimator is usually known as the Most Probable Number (MPN) of organisms.

DILUTION calculates the MPN estimator, together with likelihood-based confidence limits for the number of organisms.

The number of positive subsamples at each dilution (i.e. the number of subsamples which show the presence of organisms) must be specified in a variate using the parameter POSITIVE. The total number of subsamples used at each dilution, and the volume of the original sample used at each dilution, must be supplied in variates using parameters NSAMPLE and VOLUME.

Output is controlled by the PRINT option. The estimate setting produces the MPN estimate and associated confidence limits, together with the deviance and Pearson's chi-squared statistic. The fitted setting gives observed and fitted values with residuals. All this information is produced by default. The range of the confidence limits can be set by option %LIMIT, the default being 95% limits, and the type of residuals produced (deviance or Pearson) is controlled by the RMETHOD option.

Both the MPN estimator and the confidence limits are calculated iteratively. Option MAXCYCLE sets the maximum number of iterations allowed in each case, the default being 10. Option TOLERANCE specifies the convergence criterion for the MPN estimator; the estimation process is considered to have converged when the absolute value of the derivative of the log-likelihood is less than TOLERANCE. The default value of TOLERANCE is 0.0005. The iterative calculation of the confidence limits is considered to have converged when the log-likelihood takes the correct value to 2 decimal places.

All the information generated can be saved using parameters of the procedure: MPN saves the estimate; UPPER and LOWER save the upper and lower confidence limits; DEVIANCE, PEARSONCHI and DF save the goodness of fit statistics and the degrees of freedom; and the fitted values and residuals are saved by FITTED and RESIDUAL.

 

Options: PRINT, %LIMITS, RMETHOD, MAXCYCLE, TOLERANCE.

Parameters: POSITIVE, NSAMPLE, VOLUME, FITTED, RESIDUAL, MPN, UPPER, LOWER, DEVIANCE, PEARSONCHI, DF.

 

Method

The Newton-Raphson algorithm is used to find both the MPN and the appropriate confidence limits.

 

Action with RESTRICT

If any of POSITIVE, NSAMPLE or VOLUME are restricted (these restrictions must be compatible), then only the restricted set of units will be used.

 

Reference

Cochran, W.G. (1950). Estimation of bacterial densities by means of the `most probable number'. Biometrics, 6, 105-116.

 

DISCRIMINATE procedure

Performs discriminant analysis

(P.G.N. Digby)

 

Options

PRINT = strings Printed output from the analysis (lrv, adjustments, means, scores, distances, newgroups); default * i.e. no output

NROOTS = scalar The number of dimensions to be used for printed and saved output, and used in calculating the distances and the allocation of units; default is to use the full dimensionality

REALLOCATE = string Whether units fron the training set are to be reallocated to groups (no, yes); default no

 

Parameters

DATA = pointers Each pointer contains a set of variates to be analysed

GROUPS = factors Define groupings for the units in each training set, or missing values for the units to be allocated

NEWGROUPS = factors Save allocations (and reallocations)

MEANS = matrices Save scores for group means

SCORES = matrices Save scores for units

DISTANCES = matrices Save unit to group-mean squared distances

LRV = LRVs Save the LRVs from the canonical variate analyses

ADJUSTMENTS = matrices Save adjustments to the canonical variate analyses

 

Description

DISCRIMINATE performs discriminant analysis (see, for example, Mardia, Kent & Bibby 1979).

The input for the procedure is given by a pointer and a factor, specified by the DATA and GROUPS parameters, respectively. The pointer contains a set of variates defining the attributes of the units. Any unit with a missing value in any of the variates is excluded from the analysis. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent (the rules here are exactly as used by the FSSPM directive). The factor specifies the pre-defined groupings of the units from which the allocation is derived (the 'training set'); the units to be allocated by the analysis have missing factor values. The levels of the factor must all exceed -9999, or a misallocation of the units may result.

Printed output is controlled by the option PRINT with settings: lrv to print the canonical variate loadings, the latent roots and the trace; adjustments to print the adjustments required to the canonical variate scores; means to print canonical variate scores for the group means;

scores to print canonical variate scores for the units; distances to print Mahalanobis squared distances between the units and the group means; newgroups to print the initial grouping and the allocation of units to groups.

The NROOTS option may be used to specify how many dimensions are to be printed and retained for the latent roots and vectors and for the scores of the means and units. The distances of the units from the group means, and thus the allocation of units, are also formed from the scores in the number of dimensions specified by NROOTS. By default results will be for the full dimensionality, i.e. the smaller of the number of variates and one less than the number of groups.

The REALLOCATE option may be used to specify whether the units in the training set are to be reallocated to groups by the procedure. If the default setting no is used then their group values, either printed or saved, will be missing.

Results from the analysis can be saved using the parameters NEWGROUPS, MEANS, SCORES, DISTANCES, LRV and ADJUSTMENTS. The structures specified for these parameters need not be declared in advance. The results correspond to p dimensions, where p is the smaller of either the number of variates, or the number of groups minus one.

 

Options: PRINT, NROOTS, REALLOCATE.

Parameters: DATA, GROUPS, NEWGROUPS, MEANS, SCORES, DISTANCES, LRV, ADJUSTMENTS.

 

Method

A canonical variate analysis (CVA) is used to obtain the scores for the group means and the LRV containing the loadings (L), roots and trace; the analysis excludes units omitted by RESTRICT, or that have missing values in the data variates or the GROUPS factor. Scores are then calculated for all the units (i.e. ignoring any restrictions or missing values), using the formula

( X L ) + ( J A )

where X is a matrix containing the full set of units-by-variables data, J is a column vector of one's, and A is a row vector of adjustments required to place the scores for the units onto the same scale as those for the group means.

Mahalanobis squared distances between the units and the group means are calculated from the canonical variate scores. Each unit is then allocated to the group for which it has the smallest Mahalanobis squared distance to the group mean. In forming the allocations it is assumed that none of the levels of the factor GROUPS is less than or equal to -9999; otherwise a misallocation of the units may result.

 

Action with RESTRICT

The input variates and factor may be restricted. The restrictions must be identical, otherwise a diagnostic will be generated by an FSSPM statement within the procedure. The canonical variate analysis is based only on the units not excluded by the restriction. Scores are calculated for all the units, however these are based only on the non-excluded units: i.e. the adjustments for the canonical variate scores are calculated from the non-excluded units, and the loadings used to calculate the scores are those from the canonical variate analysis.

 

Reference

Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multivariate analysis. Academic Press, London.

 

DMST procedure

Gives a high resolution plot of an ordination with minimum spanning tree

(A.W.A. Murray)

 

Options

DIMENSIONS = scalars Two numbers specifying the dimensions to display, allowed values 1...5

TITLE = text Title for the graph

WINDOW = scalar Window for the graph; default 1

KEYWINDOW = scalar Window for the key; default 2

SCREEN = string Controls screen (clear, keep); default clea

 

Parameters

COORDINATES = matrices or datamatrices

Coordinates from ordination

TREE = matrices Minimum spanning tree

SIMILARITY = symmetric matrices

Association matrix used to derive ordination

SYMBOLS = factors or texts Symbols to label the coordinates

PENCOORDINATES = scalars Pen to use for the coordinates

PENTREE = scalars Pen to use for the minimum spanning tree

 

Description

DMST plots a minimum spanning tree using coordinates saved, for example, from a PCO. The COORDINATES parameter specifies the coordinates for the units in the plot, using either a matrix or a pointer to a set of variates (that is, a "datamatrix"). The minimum spanning tree can be supplied using the TREE parameter, or it can be calculated from the original association matrix specified using the SIMILARITY parameter. If TREE supplies a matrix with no values, these will be set to the tree calculated from the SIMILARITY matrix. If the COORDINATES structure was originally declared with row labels the procedure will automatically use these to label the plots. Alternative symbols can be defined using the SYMBOLS parameter. You can also specify the pens to be used to plot the coordinates and tree, using parameters PENCOORDINATES and PENTREE respectively. The definition of these pens, outside the procedure, thus allows the colour, size, font and linestyle of links in the tree to be controlled. By default the coordinates are plotted with colour 1 and the tree with colour 2, symbols are 0.8 of normal size, and the tree is plotted with a dotted line.

Options TITLE, WINDOW, KEYWINDOW and SCREEN function as usual for high resolution graphics. If the WINDOW is unset a default layout with appropriately labelled axes is produced in window 1. Axes will be scaled automatically unless limits have already been set outside the procedure.

 

Options: DIMENSIONS, TITLE, WINDOW, KEYWINDOW, SCREEN.

Parameters: COORDINATES, TREE, SIMILARITY, SYMBOLS, PENCOORDINATES, PENTREE.

 

Method

A two dimensional representation of the results of a multivariate analysis, such as a PCO, is plotted on the current high resolution graphics device. A minimum spanning tree is calculated (by HDISPLAY) from an input similarity matrix if not supplied. The tree is superimposed on the plot. The procedure uses GETATTRIBUTE to access the row labels (if any) of the input structures. The input structures are converted to variates if necessary and DGRAPH is used to plot the desired data.

 

Action with RESTRICT

Restrict is irrelevant with matrix input structures. It should work as expected with variates.

 

DOTPLOT procedure

Produces a dot-plot using line-printer or high-resolution graphics

(J. Ollerton & S.A. Harding)

 

Options

GRAPHICS = string Whether to use high-resolution graphics or line-printer graphics (lineprinter, highresolution); default high

TITLE = string Title for the Dot Plot; default *

WINDOW = scalar Window number for the graph; default 1

SCREEN = string Whether to clear the screen before plotting or to or continue plotting on the old screen (clear, keep); default clea

ENDACTION = string Action to be taken after completing the plot (continue, pause); default * uses the current setting

DIRECTION = string Order in which to sort the data before plotting, DIRECTION=* implies plot unsorted data (ascending, descending); default asce

LINES = string How to draw guide lines on the plot, LINES=* omits the guide lines (todot, full); default todot draws lines from the x-origin to the dots

 

Parameters

YLABELS = texts Text specifying Y labels for each dotplot

X = variates Data to be plotted

PENDOTS = scalars Pen to draw the dots; default 1

PENLINES = scalars Pen to draw the lines; default 2

 

Description

DOTPLOT produces a dot-plot from two parameters, a variate of x-data and a text containing y-labels. Option GRAPHICS allows the plotting to be done using line-printer graphics instead of the default high-resolution graphics.

The display takes the form of a vertical histogram, with a single row for each value of YLABELS. The length of line for each row is specified by the corresponding value of x. It is customary to sort the data according to the x-values, into either ascending or descending order. This is controlled by the DIRECTION option, which by default is ascending; setting DIRECTION=* will plot the data unsorted.

For high-resolution plots the guide lines can also be drawn across the full width of the plot (LINES=full) or can be omitted (LINES=*). By default, pens are set up to draw the dots and lines in a form appropriate for the output device. For an interactive display, solid guide lines in pale grey are used; for other devices dashed or dotted lines are used. The plotting symbol is symbol 2 (circle), except for PostScript output which uses a solid dot (SYMBOL=-9). The parameters PENDOTS and PENLINES can be used to specify pens which have been set up with different attributes.

By default the dot-plot is produced in window 1, but this can be changed using the WINDOW option. A FRAME statement can be used before using DOTPLOT to change the size and position of the display (for example to widen the x lower margin to allow more space for the y-labels). The SCREEN option controls whether or not the screen is cleared before plotting and the ENDACTION option determines what action to take after completing the plot.

An AXES statement can be used to set axis titles and modify the upper and lower bounds of the x-axis. If axis titles are not set explicitly they will be generated from the identifier names of the YLABEL and X parameters.

For high-resolution plots, the default window size specifies a lower x-margin of size 0.12. This allows room for a title and labels of up to about 10 characters. To produce a dot-plot with longer labels, a FRAME statement should be used to specify new dimensions for the window that include a larger value for XMLOWER. A full-size window, with standard margins, has room for about 48 rows before the labels start to overlap. To produce a dot-plot with more rows the margins should be reduced or the axis pen size reduced.

 

Options: GRAPHICS, TITLE, WINDOW, SCREEN, ENDACTION, DIRECTION, LINES.

Parameters: YLABELS, X, PENDOTS, PENLINES.

 

Method

A y-variate is constructed with values 1...NVALUES(YLABELS) and plotted against the variate X. If required the variates are sorted (this action is performed on duplicates of the data so as not to alter the original variates).

 

Action with RESTRICT

DOTPLOT will obey restrictions on either YLABELS or X.

 

Reference

Cleveland, W.S. (1985). The elements of graphing data. Wadsworth advanced books and software.

 

DPARALLEL procedure

Displays multivariate data using parallel coordinates

(Z. Karaman)

 

Options

TITLE = text Title for the plot

GROUPS = factor Defines grouping of the units (if any); by default, different pens are used for the observations in different groups

PERMUTATIONS = string Whether to display all necessary permutations so that any two variates will be adjacent in at least one plot, or just display once in the order given by the DATA pointer (yes, no); default no

SCALING = string Whether to do scaling overall (scale all variates on the same scale), or to scale each variate separately (overall, separate); default sepa

PEN = variate Pens to be used for different groups (if any); default * uses pens from 1 up to the number of groups (number of levels of the GROUPS factor)

 

Parameter

DATA = variates Data variables to be plotted

 

Description

The scatter plot is probably the most powerful and most frequently used statistical tool for analysing the relationship between two variables. It is very intuitive way to look at the data since it corresponds to our perception of the world. The major drawback is that it does not generalise naturally to higher dimensions. Using interactive graphics devices like high-resolution screens one can rotate a point cloud in three dimensions (commonly called spinning), and further dimension can be partially encoded by using different colours, symbols, or symbol sizes; however, this technique can be used only on interactive graphics devices, and it is difficult to see relationships between all the variables at a time. Another possibility is the matrix of scatter plots (provided by procedure DSCATTER), but this has the drawback that it is difficult to follow one data point across several plots.

An alternative is to display multivariate data using parallel coordinates. The dimensions are not represented by orthogonal lines as is customary done when plotting scatter diagrams (which limits the dimensionality to two, or at most three if spinning is used). Rather, they are represented by a series of parallel lines (either horizontal or vertical), and a point in a multidimensional space is represented by a broken line connecting its coordinates in each dimension. The only limit on the number of dimensions that can be displayed simultaneously by such plot is its readability, which is a function of the underlying graphics display (hardware). The parallel coordinates geometry was developed by Inselberg (1985) in the context of computational geometry; it was applied to statistical multidimensional analysis by Wegman (1990). Inselberg also gives some interesting duality properties between classical Euclidean plane and parallel coordinates geometry.

The relationship between two variables can be visually assessed by inspecting a parallel coordinates plot. When the correlation between two variables is close to -1, the lines are crossing over and so, in the limit, we would have a pencil of lines. (A pencil of lines is a set of lines that are coincident at a single point.) On the other hand, when the correlation approaches +1, we will have fewer and fewer crossovers, so that in the limit we would have a set of parallel lines. The pairwise comparisons are easy for variables represented by adjacent axes; however, they are much more difficult for the axes far away on the graph. For n variables, there are n! possible permutations, but many of these duplicate adjacencies. Wegman (1990) has shown that with a relatively small number of permutations of the axes (approximately n/2) one can achieve that in some permutation every variable is adjacent to every other variable. Multivariate outliers can be identified easily on this plot, since it is very intuitive to follow with one's eye the line across the axes. If the PERMUTATIONS option is set to yes, several plots will be produced so that every pair of variables is adjacent in at least one plot.

In our implementation we have chosen to dispose the axes vertically, since this way the readability is maximized for most output devices (either terminal screens or printers when printing in landscape mode). The variables can be independently scaled on a 0 to 1 scale, or left in original units if the values are of the same order of magnitude. In the first case it is easier to have an visual estimate of the correlation between the two adjacent variables; on the other hand, leaving the data in original units gives us a good idea of the location and spread parameters of the marginal distributions.

The data are specified, in a list of variates, using the DATA parameter. The GROUPS option can be used to specify a grouping factor. The lines for observations in each group are then plotted using different pens, thus giving an immediate insight to any patterns in data. By default, pens 1 upwards are used for the different groups, but the PEN option can be used to specify other pens, in a variate with as many values as groups. If the GROUPS option is not set, the PEN option can be set to a scalar, to select the pen to be used for all the points. The TITLE option can be used to supply a title for the plots.

 

Options: TITLE, GROUPS, PERMUTATIONS, SCALING, PEN.

Parameter: DATA.

 

Method

DPARALLEL uses the standard Genstat directives for data manipulation and graphics. The underlying methodology is described by Inselberg (1985) and Wegman (1990). It calls subsidiary procedure WEGMAN to generate the permutations matrix; each column of the output matrix gives one of the permutations described by Wegman (1990).

 

Action with RESTRICT

Restrictions are not allowed. Missing values are allowed within the input variates in DATA; the observations with missing data are not excluded form the plot, but will have the parts of their broken lines adjacent to the missing value missing from the plot.

 

References

Inselberg, A. (1985). The plane with parallel coordinates. The Visual Computer, 1, 69-91.

Wegman, E. (1990). Hyperdimensional data analysis using parallel coordinates. JASA, 85, 664-675.

 

DPOLYGON procedure

Draws polygons using high-resolution graphics

(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)

 

Options

TITLE = text Main title for the plot; default *

WINDOW = scalar Which graphics window to use for the plot; default 1

KEYWINDOW = scalar Which graphics window to use for the key; default 2

YTITLE = text Title for the vertical axis; default *

XTITLE = text Title for the horizontal axis; default *

YLOWER = scalar Lower limit for the vertical axis

YUPPER = scalar Upper limit for the vertical axis

XLOWER = scalar Lower limit for the horizontal axis

XUPPER = scalar Upper limit for the horizontal axis

SCREEN = string Whether to clear the screen before plotting or to continue plotting on the old screen (clear, keep); default clea

KEYDESCRIPTION = text Overall description for the key; default *

ENDACTION = string Action to be taken after completing the plot (continue, pause); default paus

 

Parameters

YPOLYGON = variates Vertical coordinates of one or more polygons; no default - this parameter must be set

XPOLYGON = variates Horizontal coordinates of one or more polygons; no default - this parameter must be set

PEN = scalars or variates or factors

Pen number for each graph

DESCRIPTION = texts Annotation for the key

 

Description

DPOLYGON draws polygons onto the current graphics device. Parameters XPOLYGON and YPOLYGON specify variates containing the horizontal and vertical coordinates of the polygons. DPOLYGON uses procedure DPTMAP to produce the plot. This uses the AXES and FRAME directives to set up axes with equal scales. Options YLOWER, YUPPER, XLOWER and XUPPER can be used to specify bounds for the axes, or these can be set automatically. The axes are made to extend slightly beyond the range of values to be plotted, and are drawn using the box style. Titles for the horizontal and vertical axes can be specified using the XTITLE and YTITLE options, respectively. Options TITLE, WINDOW, KEYWINDOW, SCREEN, KEYDESCRIPTION and ENDACTION are as in DGRAPH.

By default, DPOLYGON uses a different pen for each polygon. The sequence of pens is the same as the default sequence of pens used by DGRAPH but the pens are set to use METHOD=line, SYMBOLS=0 and JOIN=given, so that each polygon is drawn as a sequence of connected line segments. Other pen styles can be specified using the PEN parameter, except that the procedure will override settings of METHOD, SYMBOLS and JOIN, replacing them by METHOD=line, SYMBOLS=0 and JOIN=given. The original settings will be restored on exiting the procedure. To draw polygons in a different style, for example, using lines and points, you can use DPTMAP directly, with an appropriate PEN setting, rather than DPOLYGON.

 

Options: TITLE, WINDOW, KEYWINDOW, YTITLE, XTITLE, YLOWER, YUPPER, XLOWER, XUPPER, SCREEN, KEYDESCRIPTION, ENDACTION.

Parameters: YPOLYGON, XPOLYGON, PEN, DESCRIPTION.

 

Method

A procedure PTCHECKXY is called to check that each pair of structures in XPOLYGON and YPOLYGON have identical restrictions. If the PEN parameter is unset then pens with METHOD=line and SYMBOLS=0 will be specified using the PEN directive. PTCLOSEPOLYGON is used to close the polygons and DPTMAP to draw them.

 

Action with RESTRICT

If any of the variates in XPOLYGON and YPOLYGON are restricted, only the subset of values specified by the restriction will be included in the graph.

 

DPTMAP procedure

Draws maps for spatial point patterns using high-resolution graphics

(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)

 

Options

TITLE = text Main title for the plot; default *

WINDOW = scalar Which graphics window to use for the plot; default 1

KEYWINDOW = scalar Which graphics window to use for the key; default 2

YTITLE = text Title for the vertical axis; default *

XTITLE = text Title for the horizontal axis; default *

YLOWER = scalar Lower limit for the vertical axis

YUPPER = scalar Upper limit for the vertical axis

XLOWER = scalar Lower limit for the horizontal axis

XUPPER = scalar Upper limit for the horizontal axis

SCREEN = string Whether to clear the screen before plotting or to continue plotting on the old screen (clear, keep); default clea

KEYDESCRIPTION = text Overall description for the key; default *

ENDACTION = string Action to be taken after completing the plot (continue, pause); default paus

 

Parameters

Y = variates Vertical coordinates of one or more spatial point patterns; no default - this parameter must be set

X = variates Horizontal coordinates of one or more spatial point patterns; no default - this parameter must be set

PEN = scalars or variates or factors

Pen number for each graph

DESCRIPTION = texts Annotation for the key

 

Description

DPTMAP is a specially adapted version of DGRAPH designed for producing maps of spatial point patterns. The procedure uses the AXES and FRAME directives to set up axes with equal scales. Options YLOWER, YUPPER, XLOWER and XUPPER can be used to specify bounds for the axes, or these can be set automatically. The axes are made to extend slightly beyond the range of values to be plotted, and are drawn using the box style. The parameters X and Y specify pointers to variates containing the horizontal and vertical coordinates of one or more spatial point patterns. Titles for the horizontal and vertical axes can be specified using the XTITLE and YTITLE options, respectively. Options TITLE, WINDOW, KEYWINDOW, SCREEN, KEYDESCRIPTION and ENDACTION are as in DGRAPH.

 

Options: TITLE, WINDOW, KEYWINDOW, YTITLE, XTITLE, YLOWER, YUPPER, XLOWER, XUPPER, SCREEN, KEYDESCRIPTION, ENDACTION.

Parameters: Y, X, PEN, DESCRIPTION.

 

Method

A procedure PTCHECKXY is called to check that each pair of structures in X and Y have identical restrictions. If any of YLOWER, XUPPER, YLOWER and YUPPER are unset, the procedure PTBOX is used to assign suitable values based on the data in X and Y. The values of these options are then adjusted to extend the range of the axes and so produce a more attractive plot. The adjusted values are given by

XLOWER - 0.05 * range(X),

XUPPER + 0.05 * range(X),

YLOWER - 0.05 * range(Y),

YUPPER + 0.05 * range(Y),

where range(X) is the range of values in X and range(Y) is the range of values in Y. The AXES directive is then used to set up box-style axes with the required upper and lower limits and titles specified by XTITLE and YTITLE. The FRAME directive is used to ensure equal scales on the horizontal and vertical axes. Finally, the DGRAPH directive is used to draw the map on the current graphics device.

 

Action with RESTRICT

If any of the variates in X and Y are restricted, only the subset of values specified by the restriction will be included in the graph.

 

DPTREAD procedure

Adds points interactively to a spatial point pattern

(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)

 

Options

PRINT = string What to print (summary, monitoring); default summ, moni

WINDOW = scalar Which graphics window to use for the plot; default 1

 

Parameters

OLDY = variates Vertical coordinates of each spatial point pattern; no default - this parameter must be set

OLDY = variates Horizontal coordinates of each spatial point pattern; no default - this parameter must be set

NEWY = variates Variates to receive the vertical coordinates of the original points and added points

NEWX = variates Variates to receive the horizontal coordinates of the original points and added points

 

Description

DPTREAD uses the DREAD directive to add points to a spatial point pattern. The coordinates of the existing points must be supplied using the parameters OLDX and OLDY. These points will be plotted on the current graphics device using DPTMAP with a pen setting of SYMBOLS=1. The WINDOW option may be used to specify the graphics window to use for the plot.

DREAD is not always available, and its operation may vary slightly from one system to another. The Users' Note supplied with Genstat explains how to read points and terminate input on specific devices. The usual method for reading points is to click the left mouse button at the required position. The usual way to terminate input is to click the right mouse button. The points read using DREAD will be echoed using a pen setting of SYMBOLS=2. The coordinates of the new spatial point pattern containing the original points and any points which have been added may be saved using the parameters NEWX and NEWY.

Printed output is controlled using the PRINT option. The settings available are monitoring (which prints the coordinates of the points to be added) and summary (which prints the coordinates of the new pattern consisting of the original points and any that have been added under the headingss NEWX and NEWY). The default setting is for both monitoring and summary.

 

Options: PRINT, WINDOW.

Parameters: OLDY, OLDX, NEWY, NEWX.

 

Method

A procedure PTCHECKXY is called to check that OLDX and OLDY have identical restrictions. DPTMAP is used to draw a map of the original point pattern. The DREAD directive is then used to read the coordinates of points to be added. Finally, the coordinates for the original points and added points are combined in new variates using the EQUATE directive.

 

Action with RESTRICT

If OLDX and OLDY are restricted, only the subset of values specified by the restriction will be included in the calculations.

 

DREPMEASURES procedure

Plots profiles and differences of profiles for repeated measures data

(J.T.N.M. Thissen)

 

Options

TITLE = string Title for the plots; default *

GROUPS = factors List of one or two factors; one factor gives one plot while a list with two factors gives as many plots as the number of levels of the first factor in the list; must be set

TIMEPOINTS = variate Variate of timepoints; by default the suffixes of the DATA pointer are used

DIFFERENCES = string Can suppress plotting of the differences (no, yes); default no

 

Parameter

DATA = pointers Each pointer contains the data variates (observed at successive times)

GROUPMEANS = tables To save the calculated treatment means at each timepoint

 

Description

A repeated measures experiment is one in which the same set of units, or subjects, is observed at a sequence of times to investigate treatment effects over a period of time.

DREPMEASURES produces high-resolution graphs of the progress in time for the data variates specified in a pointer by the DATA parameter. Each variate contains the measurements made on the set of units at one of the occasions on which they were observed. The timepoints along the x-axes of the graph are the suffixes of the pointer unless the option TIMEPOINTS is specified. The grouping of the subjects should be specified by one or two factors, and input using the GROUPS option. If one factor is specified, the means of the observations at each level of the factor are plotted in one graph. If two factors are specified several graphs are produced: each graph is a plot of the means of the observations at the various levels of the second factor for a particular level of the first.

The means are calculated with the directive TABULATE. If the data variates contain missing values a warning is printed indicating the possibility of misleading results. (Before using DREPMEASURES missing values can be estimated using procedure MULTMISS.)

If option DIFFERENCES=yes, two plots are produced, beside each other: one of the profiles and one of the differences with the first level. The default setting no gives the plot of the profiles only. Plots of differences can be produced only if the factor has more than one level. The TITLE option can be used to provide a title for the plots.

The calculated means can be saved by specifying parameter GROUPMEANS.

 

Options: TITLE, GROUPS, TIMEPOINTS, DIFFERENCES. Parameters: DATA, GROUPMEANS.

 

Method

Means are calculated with the directive TABULATE. If restricted variates are specified in DATA, procedure SUBSET is used to remove any levels of the factors that are not present in the subset of subjects.

 

Action with RESTRICT

If any of the variates in the DATA pointer is restricted, only the units not excluded by the restriction will be used for the graphs. If any other DATA variate or GROUPS factor is restricted, it must be restricted to the same set of units. The variate specified by TIMEPOINTS must not be restricted.

 

DRPOLYGON procedure

Reads a polygon interactively from the current graphics device

(M.A. Mugglestone, S.A. Harding, B.Y.Y. Lee, P.J. Diggle & B.S. Rowlingson)

 

Options

PRINT = string What to print (summary); default summ

WINDOW = scalar Window from which to read default 1

 

Parameters

YPOLYGON = variates Variates to receive the vertical coordinates of the polygons that are read

XPOLYGON = variates Variates to receive the horizontal coordinates of the polygons that are read

PEN = scalars Pen numbers to use to echo points

 

Description

DRPOLYGON uses the DREAD directive to read the coordinates of a sequence of points which define a polygon. The WINDOW option may be used to specify the window from which to read. The DREAD directive will only work within a window that contains a graph or a contour plot. A call to DRPOLYGON should, therefore, be preceded by a call to DPTMAP, DPOLYGON, DGRAPH or DCONTOUR.

DREAD is not always available, and its operation may vary slightly from one system to another. The Users' Note supplied with Genstat explains how to read points and terminate input on specific devices. The usual method for reading points is to click the left mouse button at the required position. The usual way to terminate input is to click the right mouse button. The last point of any polygon is implicitly connected to the first point. There is no need to re-enter the first point to draw a closed polygon - this will be done automatically after input has been terminated. The horizontal and vertical coordinates of the polygon may be saved using the parameters XPOLYGON and YPOLYGON, respectively.

The PEN parameter may be used to specify which pen to use to echo points which have been read. The default setting of PEN uses METHOD=line, LINESTYLE=1, SYMBOLS=1 and JOIN=given.

Printed output is controlled by the PRINT option. The default setting of summary prints the horizontal and vertical coordinates of the polygon under the headings XPOLYGON and YPOLYGON.

 

Options: PRINT, WINDOW.

Parameters: YPOLYGON, XPOLYGON, PEN.

 

Method

If the PEN parameter is unset then a pen with METHOD=line, LINESTYLE=1, SYMBOLS=1 and JOIN=given will be specified using the PEN directive. The DREAD directive is used to read in the coordinates of an open polygon, and then the DGRAPH directive is used to draw a line joining the last point of the polygon to the first point.

 

DSCATTER procedure

Produces a scatter-plot matrix using high-resolution graphics

(J. Ollerton)

 

Option

PEN = scalar or variate or factor Pen number for the graph; default 1

 

Parameter

DATA = variates A list of variates to be plotted

 

Description

Procedure DSCATTER produces a scatter-plot matrix, from a set of variates, using high-resolution graphics.

The parameter DATA lists the variates to be plotted; each variate is plotted against all other variates, producing plots which are arranged as the lower triangle of a matrix with shared scales. Titles for the axes are the identifiers of the variates.

The number of variates which can be plotted by this procedure is in effect unlimited, but of course the greater the number of variates, the smaller the individual plots are.

The pen which is used to plot the data can be specified with the option PEN.

 

Option: PEN. Parameter: DATA.

 

Method

Each variate is plotted against every other variate, producing n(n-1)/2 graphs. A full scatter-plot matrix would produce n(n-1) separate graphs in the shape of an n ´ n matrix, with both lower and upper triangles of the matrix containing the n(n-1)/2 set of plots. This procedure forms just the lower triangle of the scatter-plot matrix.

 

Action with RESTRICT

If any variate in the set of pointers is restricted then only the units not excluded by the restriction (and the corresponding units of other variates) will be plotted.

 

Reference

Cleveland, W.S. (1985). The elements of graphing data. Wadsworth advanced books and software.

 

DSHADE procedure

Produces a pictorial representation of a data matrix

(S.A. Harding)

 

Options

WINDOW = scalar Window number for the graph; default 1

KEYWINDOW = scalar Window number for the key (0 for no key); default 2

SCREEN = string Whether to clear the screen before plotting or to continue plotting on the old screen (clear, keep); default clea

GRID = string How to draw a grid around the elements of the matrix (present, complete); default pres

 

Parameters

DATA = symmetric matrices, matrices, or pointers to variates

Matrices to be plotted

NGROUPS = scalars Number of groups to form from the levels of similarity (i.e. number of different shades)

PERMUTATION = variates Can define permutations to be done to the units of symmetric matrices prior to plotting

 

Description

DSHADE produces a shaded representation of a rectangular or symmetrix matrix using high-resolution graphics. Each element of the data matrix is represented by a shaded rectangle indicating the value at that location using either colour or shading density. This type of display is often used in cluster analysis for displaying a similarity matrix, but is also useful for the graphical display of spatial data.

The data for the procedure consists of a matrix, a symmetric matrix (e.g. of similarities), or a pointer to a set of variates, specified by the DATA parameter. The NGROUPS parameter defines the number of levels that are to be used when grouping the data for the display; this must be in the range 1 to 32. When producing a shaded plot of a similarity matrix a permutation of the units can be specified, using the PERMUTATION parameter; a suitable variate could be obtained, for example, from the HCLUSTER directive. PERMUTATION is ignored if DATA is not a symmetric matrix.

The individual elements are shaded using pens 1...NGROUPS, with pen 1 being used for the lowest values and pen NGROUPS for the highest. Missing values are ignored, thus leaving blank areas in the plot. The current COLOUR and BRUSH settings of these pens define how each level is represented. If the default settings do not produce a suitable display, these attributes should be set by a PEN statement before using DSHADE. Colour displays can use a solid brush pattern for all pens, with different colours representing the different levels of similarity. To obtain a gray-scale for the shading, for example into ten groups, you could use the following statements:

PEN 1...10; COLOUR=2...11

COLOUR 2...11; 0.1,0.2...1.0; 0.1,0.2...1.0; 0.1,0.2...1.0

For a monochrome display the BRUSH styles should be set to values that produce increased density of shading as the pen number increases. Figure 6.5.5e on page 313 of the Genstat 5 Release 3 Reference Manual may be used to help select appropriate values.

The GRID option specifies whether an outline should be drawn around each element of the matrix. The default, GRID=present, produces an outline for all values that are present; i.e. it ignores missing values. This is suitable where data have been sampled over an irregularly shaped area. Alternatively, for GRID=complete, an outline is drawn around every element. Setting GRID=* stops the grid being drawn, which may be preferable if there are a large number of elements in the input data.

By default the plot is produced in window 1 with a key in window 2, but these settings can be changed using the WINDOW and KEYWINDOW options. The size and position of these windows can be specified in a FRAME statement before using DSHADE. The SCREEN option controls whether or not the screen is cleared before plotting.

 

Options: WINDOW, KEYWINDOW, SCREEN, GRID.

Parameters: DATA, NGROUPS, PERMUTATION.

 

Method

The values of the data matrix are sorted into NGROUPS groups of equal range. A box is defined as the plotting symbol for pens 1...NGROUPS, with scaling dependent on the number of rows and columns to be plotted. Each point of the matrix is then plotted using the appropriate pen. DPIE is used to produce a key, if required.

 

Action with RESTRICT

If the data is input as a pointer to a set of variates any restrictions must be consistent and will be applied to all the variates.

 

EXTRABINOMIAL procedure

Fits the models of Williams (1982) to overdispersed proportions

(M.S. Ridout & P.W. Goedhart)

 

Options

PRINT = strings What to print if iterative estimation process converges successfully and whether to monitor the iterations (model, summary, accumulated, estimates, correlations, fittedvalues, monitoring); default *

CONSTANT = string How to treat constant (estimate, omit); default esti

FACTORIAL = scalar Limit for expansion of model terms; default 3

NOMESSAGE = strings Which warning messages to suppress (dispersion, leverage, residual, aliasing, marginality); default *

METHOD = string Which model to fit to take account of the extra variation (II, III); default II

MODIFYMODEL = string Whether to leave the modified MODEL settings (WEIGHTS and DISPERSION) or whether to restore the original situation (yes, no); default no

WEIGHTS = variate To save estimated weights

PHI = scalar To save estimated overdispersion parameter

MAXCYCLE = scalar Maximum number of iterations; default 10

TOLERANCE = scalar Convergence criterion; default 0.01

 

Parameter

TERMS = formula Model terms to be fitted; if unset it is assumed that the model consists only of a constant term

 

Description

In binomial regression models, residual variability is often larger than would be expected if the data were indeed binomially distributed. This may be due to a few outliers or a poor choice of link function but often it simply indicates that the data are from a distribution more variable than the binomial. Such data are said to be "overdispersed" or to exhibit "extra-binomial variation".

Williams (1982) discusses two possible models to extend the usual binomial model (Model I). Model II assumes that the true variance exceeds the binomial variance by a factor

V = 1 + (NBINOMIAL-1) ´ q (0 £ q £ 1)

If the overdispersion parameter PHI were known, the data could be analysed using a binomial model with prior weights 1/V. Procedure EXTRABINOMIAL estimates q so that the residual chi-squared statistic from this weighted analysis is (approximately) equal to the residual degrees of freedom (Moore 1987). If the binomial totals are all equal, Method II is equivalent to setting the DISPERSION option of MODEL equal to the residual chi-squared statistic divided by its degrees of freedom.

Alternatively, Model III assumes that the linear predictor varies about its expectation with a constant variance. Usually this variation is assumed to follow a normal distribution; if there is then a logit link, the error distribution will be a logistic normal. Extensions to Model III to have several normal distributions contributing to the variation on the linear predictor, similar to those that occur in stratified analysis of variance, form the basis of many methods suggested for analysing generalized linear mixed models. For Model III, there is generally no simple expression for the exact variance. But the delta method can be used to show that, approximately, the variance exceeds the binomial variance by a factor

V = 1 + (NBINOMIAL-1) ´ q ´ F2 / (P ´ (1 - P))

where q is variance on the scale of the linear predictor, P is the fitted probability and F is the derivative of the inverse of the link function, evaluated at the fitted value of the linear predictor.

Before using EXTRABINOMIAL a MODEL statement must be given, in the usual way, to define the y-variate, the binomial totals, the link and any offset. The error distribution must also of course be set to binomial but any settings of WEIGHTS or DISPERSION are ignored.

The form of EXTRABINOMIAL is similar in many ways to the FIT directive. There is a single parameter TERMS to define the model terms to be fitted, and the first four options, PRINT, CONSTANT, FACTORIAL, and NOMESSAGE, all have the same syntax and purpose as in FIT. The remaining options are specific to EXTRABINOMIAL.

The METHOD option selects which model to use (II or III); by default METHOD=II. Both models involve the estimation of the weight variate (1/V) required to fit the model using the standard Genstat facilities for generalized linear models. If option MODIFYMODEL=yes, EXTRABINOMIAL will leave the MODEL statement in its modified form (provided the iterative estimation of q converges), with the WEIGHTS option set to these weights and the DISPERSION option set to 1, so that directives like DROP can be used to study the effects of individual terms in the model in the usual way. The TERMS directive will also be left set to the model specified by the TERMS parameter of EXTRABINOMIAL, and this model will be the one most recently fitted, so further output can be obtained using RDISPLAY.

Options WEIGHTS and PHI allow the weights and the estimated value of q, respectively, to be saved. The MAXCYCLE option specifies the maximum number of iterations in the estimation, and the TOLERANCE option defines the convergence criterion:

ABS(Chi-squared - Residual d.f.) < TOLERANCE ´ Residual d.f.

 

Options: PRINT, CONSTANT, FACTORIAL, NOMESSAGE, METHOD, MODIFYMODEL, WEIGHTS, PHI, MAXCYCLE, TOLERANCE.

Parameter: TERMS.

 

Method

If the binomial totals are all equal, q is determined (non-iteratively) from the residual chi-squared statistic.

Otherwise, q must be found iteratively and the method used (Williams, 1982) involves nested iterations. Each outer iteration (involving a model fit) requires an inner iteration (which uses only CALCULATE statements) to get the updated estimate of q. The option MAXCYCLE controls the maximum number of outer iterations. The maximum number of inner iterations is fixed at 10.

Very precise convergence is not important in practice; the default setting of the TOLERANCE option ( 1% ) should give a perfectly adequate estimate of q, usually within 3 iterations.

 

Action with RESTRICT

Any of the following structures may be restricted: the Y variate; the NBINOMIAL variate; the WEIGHTS variate; the OFFSET variate; any variate or factor appearing in the model formula. Restrictions on different structures must be compatible. Restricted units are excluded from the analysis.

 

References

Moore, D.F. (1987). Modelling the extraneous variance in the presence of extra-binomial variation. Appl. Statist. 36, 8-14.

Williams, D.A. (1982). Extra-binomial variation in logistic linear models Appl. Statist. 31, 144-148.