HCLUSTER directive 

Performs hierarchical cluster analysis.

 

Options

PRINT = strings Printed output required (dendrogram, amalgamations); default * i.e. no printing

METHOD = string Criterion for forming clusters (singlelink, nearestneighbour, completelink, furthestneighbour, averagelink, mediansort, groupaverage); default sing

CTHRESHOLD = scalar Clustering threshold at which to print formation of clusters; default * i.e. determined automatically

 

Parameters

SIMILARITY = symmetric matrices

Input similarity matrix for each cluster analysis

GTHRESHOLD = scalars Grouping threshold where groups are formed from the dendrogram

GROUPS = factors Stores the groups formed

PERMUTATION = variates Permutation order of the units on the dendrogram

AMALGAMATIONS = matrices To store linked list of amalgamations

 

Description

The aim of cluster analysis is to arrange the n sampling units into more or less homogeneous groups. HCLUSTER offers several possibilities. The general strategy is best appreciated in geometrical terms, with the n sampling units represented by points in a multidimensional space. In agglomerative methods, these points initially represent n separate clusters, each containing one member. At each of n-1 stages, two clusters are fused into one bigger cluster, until at the final stage all units are fused into a single cluster: this process can be represented by a hierarchical tree whose nodes indicate what fusions have occurred. The methods fuse the two closest clusters and vary in how closest is defined. In single-linkage cluster analysis, closest is defined as the smallest distance between any two samples from different clusters; in centroid clustering it is the smallest distance between cluster centroids; and so on. Genstat can display the tree fitted to a given similarity matrix, and provides a scale to show the level of similarity at which the fusions have occurred; such a scaled tree is termed a dendrogram.

The input for HCLUSTER is provided by the SIMILARITY parameter, as a list of symmetric matrices, one for each analysis. These matrices can be formed by FSIMILARITY, by REDUCE or by CALCULATE. Missing values are allowed in the similarity matrix only with the single-linkage method.

A hierarchical tree does not by itself provide a classification. This can be derived by cutting the dendrogram at some arbitrary level of similarity, specified as a percentage similarity using the GTHRESHOLD parameter. Each cluster then consists of those samples occurring on the same detached branch of the dendrogram, and the resulting cluster membership can be saved in a factor whose identifier is specified by the GROUPS parameter. The factor will be declared implicitly, if necessary, and it will have its number of levels set to the number of clusters formed and its number of values taken from the number of rows of the corresponding symmetric matrix. GTHRESHOLD and GROUPS must be either both present or both absent.

The endpoints of the dendrogram correspond to the units in some permuted order. The PERMUTATION parameter allows you to specify a variate to save this order, for example to use in the FSIMILARITY directive. Genstat will define it to be a variate automatically, if necessary, with number of values is taken from the number of rows of the corresponding similarity matrix. Conventionally, the first unit on the dendrogram is unit 1 and so the first value of the variate of permutations will be 1.

The AMALGAMATIONS parameter can specify a matrix to store information about the order in which the units form groups, and at what level of similarity. At any stage in the process of agglomeration, each group is represented by the unit with the smallest unit number: for example, a group containing units 2, 5, 17, and 22 is represented by unit 2. This means that the final merge is always between a group indexed by unit 1 and a group indexed by another unit. Since there are n-1 stages of agglomeration, the matrix will have a number of rows one less than the number of rows of the input similarity matrix. Each row represents a joining of two groups and consists of three values. The first two values are the numbers indexing the two groups that are joining, and the third value is the level of similarity. So the matrix has three columns. The matrix will be declared implicitly, if necessary.

HCLUSTER can print two pieces of information. The first gives details of each amalgamation, followed by a list of clusters that are formed at decreasing levels of similarity. The second is the dendrogram. The PRINT option allows you to control which of these are printed. If METHOD=singlelink and the PRINT setting includes amalgamations, the minimum spanning tree will be printed instead of the stages at which the clusters merge. This is because information from forming the minimum spanning tree is used to form the single linkage clustering.

Alternatively, if you save the AMALGAMATIONS matrix, you can use procedure DDENDROGRAM to display the dendrogram using high-resolution graphics.

The METHOD option has seven possible settings; these determine how the similarities amongst clusters are redefined after each merge. The default singlelink, which has synonym nearestneighbour, gives single linkage. The setting completelink (synonym furthestneighbour) defines the distance between two clusters as the maximum distance between any two units in those clusters. The setting averagelink defines the similarity between a cluster and two merged clusters as the average of the similarities of the cluster with each of the two. For groupaverage, an average is taken over all the units in the two merged clusters. Median sorting is best thought of in terms of clusters being represented by points in a multidimensional space; when two clusters join, the new cluster is represented by the midpoint of the original cluster points.

The CTHRESHOLD option is a scalar which allows you to define the levels of decreasing similarity at which the lists of clusters are printed with their membership. The decreasing levels of similarity are formed by repeatedly subtracting the CTHRESHOLD value from the maximum similarity of 100%. For example, setting CTHRESHOLD=10 will list the clusters formed at 90% similarity, 80%, and so on. At each level, those units that have not joined any group are also listed. If you do not set this option, the default value will be calculated from the range of similarities at which merges occur, to give between 10 and 20 separate levels.

 

HDISPLAY directive

Displays results ancillary to hierarchical cluster analyses: matrix of mean similarities between and within groups, a set of nearest neighbours for each unit, a minimum spanning tree, and the most typical elements from each group.

 

Option

PRINT = strings Printed output required (neighbours, tree, typicalelements, gsimilarities); default tree

 

Parameters

SIMILARITY = symmetric matrices

Input similarity matrix for each cluster analysis

NNEIGHBOURS = scalars Number of nearest neighbours to be printed

NEIGHBOURS = matrices Matrix to store nearest neighbours of each unit

GROUPS = factors Indicates the groupings of the units (for calculating typical elements and mean similarities between groups)

TREE = matrices To store the minimum spanning tree (as a series of links and corresponding lengths)

GSIMILARITY = symmetric matrices

To store similarities between groups

 

Description

You can use the HDISPLAY directive to print ancillary information useful for interpreting cluster analyses, and to save information to use elsewhere in Genstat, for example for plotting.

The SIMILARITIES parameter specifies a list of symmetric similarity matrices. These are operated on, in turn, to produce the output requested by the PRINT option and to save the information specified by other parameters. Since the interpretations of the remaining parameters are closely linked to the different settings of the PRINT option, each setting is discussed below with the relevant parameters.

The NNEIGHBOURS parameter gives a list of scalars indicating how many neighbours will appear in the printed table of nearest neighbours.

The NEIGHBOURS parameter can specify a list of identifiers to store details of nearest neighbours. These will be declared implicitly, if necessary, as matrices. The rows of the matrices correspond to the units; there should be an even number of columns. The values in the odd-numbered columns represent the neighbouring units in order of their similarity, while the values in the even-numbered columns are the corresponding similarities. If you have declared the matrix previously and it does not have enough columns, then NEIGHBOURS stores as many neighbours as possible. If there is an odd number of columns in the matrix, the last column is not filled. If the matrix is declared implicitly, the number of columns will be twice the value of the NNEIGHBOURS scalar.

If the PRINT option includes the setting neighbours, Genstat prints a table of nearest neighbours for every sample, together with their values of similarity. The number of neighbours printed is determined by the value of the NNEIGHBOURS scalar; if NNEIGHBOURS is not set, the table is not printed. This information is also useful for interpreting clusters and ordinations.

The GROUPS parameter specifies a factor to divide the units of each similarity matrix into clusters. You may have formed the factor from a previous hierarchical cluster analysis, using HCLUSTER. This parameter must be set if the PRINT option includes the settings typicalelement or gsimilarities.

If the PRINT option includes the setting typicalelement, Genstat prints the average similarity of each group member with the other group members. This is to help you identify typical members of each group: typical members will have relatively large average similarities compared to those of the other members. Within each group, members are printed in decreasing order of average similarity.

The GSIMILARITY parameter specifies a list of symmetric matrices in which you can save the mean between-group and within-group similarities. Any structure that you have not declared already will be declared implicitly to be a symmetric matrix with number of rows equal to the number of levels of the factor in the GROUPS parameter.

If the PRINT option includes the setting gsimilarities, Genstat prints the mean similarities between-groups and within-groups. Self-similarities are excluded.

The TREE parameter can specify a matrix to save the minimum spanning tree. The matrix is set up with two columns and number of rows equal to the number of units. For each unit, the value in the first column is the unit to which that unit is linked on its left; the second column is the corresponding similarity. The first unit is not linked to any unit on its left, as it is always the first unit on the tree; so the first row of the matrix contains missing values.

Setting the PRINT option to tree prints the minimum spanning tree associated with the similarity matrix specified the SIMILARITY parameter. The minimum spanning tree (MST) is not a Genstat structure, but it can be kept in the form described above: that is, in a matrix with two columns. An MST is a tree connecting the n points of a multidimensional representation of the sampling units. In a tree every unit is linked to a connected network and there are no closed loops; the special feature of the MST is that, of all trees with a sampling unit at every node, it is the one whose links have minimum total length. The links include all those that join nearest neighbours; the MST is closely related to single linkage hierarchical trees. Minimum spanning trees are also useful if you superimpose them on ordinations to reveal regions in which distance is badly distorted (see procedure DMST); if neighbouring points, as given by the MST, are distant in the ordination then something is badly wrong.

 

HELP directive

Prints details about the Genstat language and environment.

 

Option

CHANNEL = identifier Channel number of file, or identifier of a text to store output; default current output file

 

Parameter

strings Directive names or keywords indexing the desired details

 

Description

The HELP information in Genstat is arranged as a hierarchical system, where the information becomes more specific as you move down the hierarchy. At every stage Genstat lists the information available at the next level. It is thus very easy to browse through the system when using Genstat interactively. To enter the system at the top level in an interactive run, you merely type the directive name, HELP, on its own.

Wherever you are in the system, Genstat will print the information that you have requested, followed by the list of words (strings of characters) that you can use to move to the next level and select further information; you then get the prompt:

HELP>

In response to the prompt you can do one of four things.

 

(a) Type one of the suggested words and obtain further information. When doing this you need include only enough characters to distinguish the word from those earlier in the list. Genstat will then print the requested information, followed by a further list of words and a prompt ready for your next choice. If you are already familiar with HELP, you can always skip levels in the prompting hierarchy by giving a list of words, each separated from the next by a comma. For example,

HELP> read,options

skips the information that you would get if you specified read only; it thus takes you straight to the information on the options of the READ directive. You could, indeed, have done this at the outset, when you first typed HELP, by putting

HELP read,options

The words can be typed in either lower or upper case, or in any mixture.

 

(b) Type an asterisk (*) to see the information relevant to all the possible words.

 

(c) Type carriage-return (<RETURN>) to move back to the previous level of the hierarchy. If you are already at the top level this takes you out of the HELP system.

 

(d) Type a colon (:) to exit from HELP from any level.

 

Whenever there is more than a screenful of information, HELP pauses and gives a question mark (?) as a prompt. You can respond either by pressing carriage-return (<RETURN>) to continue with the current information, or by selecting any of the possibilities (a) to (d). The words allowed under (a) are now those that would be permitted if the current information had been completed: these are the words available at the next level down if one exists, or from the current level if there is none below this.

In an interactive run, a HELP statement continues either until you type a colon, or until you type (<RETURN>) at the top level; otherwise newlines merely generate further prompts of HELP>. In batch, however, newline will be interpreted just as in any other statement. Thus each time that you use HELP in batch, you can specify just a single list, although you can still put asterisk to obtain all the information from a particular level. For example,

HELP regression

to obtain general information about the facilities in Genstat for regression, or

HELP fit,*

to learn about the options and parameters of the FIT directive, or

HELP print,parameters

for details of the parameters of the PRINT directive.

 

HISTOGRAM directive

Produces histograms of data on the terminal or line printer.

 

Options

CHANNEL = scalar Channel number of output file; default is the current output file

TITLE = text General title; default *

LIMITS = variate Variate of group limits for classifying variates into groups; default *

NGROUPS = scalar When LIMITS is not specified, this defines the number of groups into which a data variate is to be classified; default is the integer value nearest to the square root of the number of values in the variate

LABELS = text Group labels

SCALE = scalar Number of units represented by each character; default 1

 

Parameters

DATA = identifiers Data for the histograms; these can be either a factor indicating the group to which each unit belongs, a variate whose values are to be grouped, or a one-way table giving the number of units in each group

NOBSERVATIONS = tables One-way table to save numbers in the groups

GROUPS = factors Factor to save groups defined from a variate

SYMBOLS = texts Characters to be used to represent the bars of each histogram

DESCRIPTION = texts Annotation for key

 

Description

Histograms provide quick and simple visual summaries of data values. The data are divided into several groups, which are then displayed as a histogram consisting of a line of asterisks for each group. The number of asterisks in each line is proportional to the number of values assigned to that group; this figure is also printed at the beginning of each line. The data for the histogram are specified using the DATA parameter in either variates, factors, or one-way tables.

If a histogram is to be formed from a variate, Genstat sorts its values into groups as defined by upper and lower bounds. You can also specify a list of variates, to obtain a parallel histogram. For each group one row of asterisks is printed for each variate, labelled by the corresponding identifier. The variates are sorted according to the same intervals; there is no need for them all to have the same numbers of values.

With variates of data, you can use the NGROUPS option to specify the number of groups in the histogram; Genstat will then work out appropriate limits, based on the range of the data, to form intervals of equal width. For example:

HISTOGRAM [NGROUPS=5] Data

Alternatively, you can define the groups explicitly, by setting the LIMITS option to a variate containing the group limits. For example:

VARIATE [VALUES=1,2,3,5,7,8,10] Glimits

HISTOGRAM [LIMITS=Glimits] Data

Glimits is a variate with seven values, producing a histogram in which the data is split into eight groups; £ 1, 1-2, 2-3, 3-5, 5-7, 7-8, 8-10, >10. The upper limit of each group is included within that group, so the group 3-5, for example, contains values that are greater than 3 and less than or equal to 5. The values of the limits variate are sorted into ascending order if necessary, but the variate itself is not changed.

You can use the LABELS option to provide your own labelling for the groups of the histogram. It should be set to a text vector of length equal to the number of groups. If neither NGROUPS nor LIMITS has been set, the number of groups is determined from the number of values in the LABELS structure. If LABELS is also unset, the default number of groups is chosen as the integer value nearest to the square root of the number of values, up to a maximum of 10. Alternatively, procedure AKAIKEHISTOGRAM provides a more sophisticated method of generating histograms, using Akaike's Information Criterion (AIC) to generate an optimal grouping of the data.

The data for the histogram can also be specified as a factor (which defines the assignment of each unit to a group of the histogram). Genstat then counts the number of units that occur with each level of the factor; thus the number of groups of the histogram is the number of levels of the factor and the value for each group is the corresponding total. If the LABELS option is unset, the labels of the factor (if present) are used to label the groups, otherwise Genstat uses the factor levels.

When Genstat plots the histogram of a one-way table, the number of groups is the number of levels of the factor classifying the table and the values of the table indicate the number of observations in each group. If the LABELS option is unset, the labels or levels of the classifying factor are again used to label the histogram.

When producing a parallel histogram the data structures must all be of the same type: variate, factor, or table. Variates and factors may be restricted, in which case only the subset of values specified by the restriction will be included in the histogram; however, unlike many directives, restrictions do not carry over to the other structures listed by the DATA parameter. If parallel histograms are to be formed from several factors, they must all have the same number of levels, and the labels or levels of the first factor will be used to identify the groups. Likewise, if you are forming parallel histograms from several tables, they must all have the same number of values, and the classifying factor of the first table will define the labelling of the histogram.

The SYMBOLS parameter can specify alternative plotting characters to be used instead of the asterisk. For example:

HISTOGRAM Variate; SYMBOLS='+'

You can specify a different string for each structure in a parallel histogram. If you specify strings of more than one character, Genstat uses the characters in order, recycled as necessary, until each histogram bar is of the correct length.

The TITLE option lets you set an overall title for the output, and the DESCRIPTION parameter can be used to provide a text for labelling the histogram instead of the identifiers of the DATA structures.

Normally one asterisk will represent one unit. However, if there are many data values and the groups become large, Genstat may not be able to fit enough asterisks into one row. It will then alter the scaling so that one asterisk represents several units. You can set the scaling explicitly using the SCALE option; the value specified is rounded to the nearest integer, and determines how many units should be represented by each asterisk.

HISTOGRAM has two output parameters that allow you to save information that has been generated during formation of the histogram. The NOBSERVATIONS parameter allows you to save a one-way table of counts that contains the number of observations that were assigned to each group; the missing-value cell of this table will contain a count of the number of units that were missing and that therefore remain unclassified. When producing a histogram from a variate, you can use the GROUPS parameter to specify a factor to record the group to which each unit was allocated.

Normally, output goes to the current output channel, but you can use the CHANNEL option to direct it to another. For example, when you are working interactively, you might want to send a graph to a secondary output file so that you can print it later. Unlike some directives (for example, PRINT) you cannot save the output in a text structure.

 

HLIST directive

Lists the data matrix in abbreviated form.

 

Options

GROUPS = factor Defines groupings of the units; used to split the printed table at appropriate places and to label the groups; default *

UNITS = text or variate Names for the rows (i.e. units) of the table; default *

 

Parameters

DATA = variates The data values

TEST = strings Test type, defining how each variate is treated in the calculation of the similarity between each unit (Jaccard, simplematching, cityblock, Manhattan, ecological, Pythagorean, Euclidean); default * ignores that variate

RANGE = scalars Range of possible values of each variate; if omitted, the observed range is taken

 

Description

HLIST lists the values of the data matrix in a condensed form, either in their original order or, more usefully, in the order determined by a cluster analysis (see HCLUSTER). This representation can be very helpful for revealing patterns in the data, associated with clusters, or for an initial scan of the data to pick out interesting features of the variates.

The DATA parameter specifies a list of variates, all of which must be of the same length. If any of the variates is restricted, or if the factor in the GROUPS option is restricted, then that restriction is applied to all the variates. Any restriction on any other variate must be to the same set of units. The TEST parameter specifies a list of strings, one for each variate in the DATA parameter list, to define the "type" of each variate; these are used by FSIMILARITY to determine how differences in variate values for each unit contribute to the overall similarity between units. However, HLIST distinguishes only between qualitative variates (Jaccard or simplematching) and quantitative variates (other settings). The values of qualitative variates are printed directly. If the range of a quantitative variate is greater than 10, the printed values are scaled to lie in the range 0 to 10. This scaling is done by subtracting the minimum value from the variate, dividing by the range and then multiplying by 10. If the range is less than 10, the values are printed unscaled; so variates with values that are all less than 1 will appear as 0 in the abbreviated table. The values are printed with no decimal places, and in a field-width of 3.

The RANGE parameter contains a list of scalars, one for each variate in the DATA list. This allows you to check that the values of each variate lie within the given range. The range is also used to standardize quantitative variates, so that you can impose a standard range for example when variates are measured on commensurate scales. You can omit the RANGE parameter for all or any of the variates by giving a missing identifier or a scalar with a missing value; Genstat then uses the observed range.

The UNITS option allows you to change the labelling of the units in the table; you can specify a text or a pointer or a variate.

You can use the GROUPS option to specify a factor that will split the units into groups. The table from HLIST is then divided into sections corresponding to the groups. If the factor has labels, these are used to annotate the sections; otherwise a group number is used.

 

HSUMMARIZE directive

Forms and prints a group by levels table for each test together with appropriate summary statistics for each group.

 

Option

GROUPS = factor Factor defining the groups; no default i.e. this option must be specified

 

Parameters

DATA = variates The data values

TEST = strings Test type, defining how each variate is treated in the calculation of the similarity between each unit (Jaccard, simplematching, cityblock, Manhattan, ecological, Pythagorean, Euclidean); default * ignores that variate

RANGE = scalars Range of possible values of each variate; if omitted, the observed range is taken

 

Description

The HSUMMARIZE directive helps you to see which clusters, if any, are distinguished by each variate. It requires a factor to define the clusters, as well as the original data variates, together with their types and, optionally, their ranges. From this it prints a frequency table for each variate. Each table is classified by the grouping factor and the different values of the variate.

The option and parameters of the HSUMMARIZE directive are the same as those of the HLIST directive, and are described there.

For qualitative variates (TYPE settings Jaccard or simplematching) the values are integral, and for each group Genstat calculates an interaction statistic labelled chi-squared. This statistic does not have a significance level attached to it, but it does draw attention to groups for which the distribution is markedly different from the overall distribution.

For quantitative variates values are rounded to the nearest point on an 11-point scale (0-10). The interaction statistic is analogous to Student's t, and it draws attention to the groups for which the mean variate value is markedly different from the overall means (again with no significance level attached). Missing values are ignored in the computation of these statistics.

 

IF directive

Introduces a block-if control structure.

 

No options

 

Parameter

expression Logical expression, indicating whether or not to execute the first set of statements.

 

Description

A block-if structure consists of one or more alternative sets of statements. The first of these is introduced by an IF statement. There may then be further sets introduced by ELSIF statements. Then you can have a final set introduced by an ELSE statement, and the whole structure is terminated by an ENDIF statement. Thus the general form is:

first

IF expression

statements

then either none, one, or several blocks of statements of the form

ELSIF expression

statements

then, if required, a block of the form

ELSE

statements

and finally the statement

ENDIF

Each expression must evaluate to a single number, which is treated as a logical value: a zero value is treated as false and non-zero as true. Genstat executes the block of statements following the first true expression. If none of the expressions is true, the block of statements following ELSE (if present) is executed.

You can thus use these directives to built constructs of increasing complexity. The simplest form would be to have just an IF statement, then some statements to execute, and then an ENDIF. For example:

IF MINIMUM(Sales) < 0

PRINT 'Incorrect value recorded for Sales.'

ENDIF

If the variate Sales contains a negative value, the PRINT statement will be executed. Otherwise Genstat goes straight to the statement after ENDIF.

To specify two alternative sets of statements, you can include an ELSE block. For example

IF Age < 20

CALCULATE Pay = Hours*1.75

ELSE

CALCULATE Pay = Hours*2.5

ENDIF

calculates Pay using two different rates: 1.75 for Age less than 20, and 2.5 otherwise.

Finally, to have several alternative sets, you can include further sets introduced by ELSIF statements. Suppose that we want to assign values to X according to the rules:

X=1 if Y=1

X=2 if Y ¹ 1 and Z=1

X=3 if Y ¹ 1 and Z=2

X=4 if Y ¹ 1 and Z ¹ 1 or 2

This can be written in Genstat as follows:

IF Y == 1

CALCULATE X = 1

ELSIF Z == 1

CALCULATE X = 2

ELSIF Z == 2

CALCULATE X = 3

ELSE

CALCULATE X = 4

ENDIF

If Y is equal to 1, the first CALCULATE statement is executed to set X to 1. If Y is not equal to 1, Genstat does the tests in the ELSIF statements, in turn, until it finds a true condition; if none of the conditions is true, the CALCULATE statement after ELSE is executed to set X to 4. Thus, for Y=99 and Z=1, Genstat will find that the condition in the IF statement is false. It will then test the condition in the first ELSIF statement; this produces a true result, so X is set to 2. Genstat then continues with whatever statement follows the ENDIF statement. Block-if structures can be nested to any depth, to give conditional constructs of even greater flexibility.

 

INPUT directive

Specifies the input file from which to take further statements.

 

Options

PRINT = strings What output to generate from the statements in the file (statements, macros, procedures, unchanged); default stat

REWIND = string Whether to rewind the file (yes, no); default no

 

Parameter

scalar Channel number of input file

 

Description

Having opened a file of Genstat statements on another input channel (for example by the OPEN directive) you can switch control to that channel at any time using an INPUT statement. You specify the channel as a number or as a scalar containing that number. For example,

OPEN 'MYPROCS.GEN'; CHANNEL=4; FILETYPE=input

INPUT 4

The file can contain any valid Genstat statements: they will be executed just as if they had been on the original input channel. In this file you could use an INPUT statement to switch back to channel 1 after a while. Alternatively, you may have set up several input files and jump from one to another, again using INPUT. You can use RETURN to go back to the previous channel or STOP to end this run of Genstat. If the end of the file is reached without finding any of these statements, control will be passed back to the previous input channel as explained in the description of the RETURN directive. Note that if you use INPUT to go back to an earlier channel, you may affect the way in which RETURN works (again see the description of RETURN).

The PRINT option can be used to specify whether the statements read from the file should be echoed to the current output channel. This is used in the same way as INPRINT in JOB and SET.

The REWIND option allows you to return to the beginning of the file. You might need to do this, for example, if you had made an error, so that the statements on the secondary input file were executed wrongly. After correcting your error you could set REWIND=yes to start again from the beginning of the file.

 

INTERPOLATE directive

Interpolates values at intermediate points.

 

Options

CURVE = string Type of curve to be fitted to calculate the interpolated value (linear, cubic); default line

METHOD = string Type of interpolation required (interval, value, missing): for METHOD=valu, values are interpolated for each point in the NEWINTERVAL variate and stored in the NEWVALUE variate; for METHOD=inte, points are estimated in the NEWINTERVAL variate for the observations in the NEWVALUE variate; while for METHOD=miss, the NEWVALUE and NEWINTERVAL lists are irrelevant, INTERPOLATE now interpolates for missing values in the OLDVALUE and OLDINTERVAL variates (except those missing in both variates). Default inte

 

Parameters

OLDVALUES = variates Observations from which interpolation is to be done

NEWVALUES = variates Results of each interpolation

OLDINTERVALS = variates Points at which each set of OLDVALUES was observed

NEWINTERVALS = variates Points for each set of NEWVALUES

 

Description

If you have a set of pairs of observations (x, y), you can use interpolation to estimate either a value y for a value x that need not be in the set, or a value x for a value y that likewise need not be in the set. The simplest way to interpolate is by joining successive pairs of observations by straight lines and reading off the appropriate values in between: then the two cases are called linear interpolation (obtaining y from x) and inverse linear interpolation (obtaining x from y). Genstat can alternatively join the points by cubic functions instead of straight lines. Genstat uses the term values to describe the set of y-values and intervals for the set of x-values, no matter whether you are doing direct or inverse interpolation.

Genstat does the interpolation for each parallel set of variates in the parameter lists. Each variate in the OLDINTERVALS list specifies the x-values of a set of observed points; the corresponding variate in the OLDVALUES list specifies the corresponding y-values. The variates in the NEWINTERVALS and NEWVALUES lists are for the x-values and y-values of the interpolated points.

If you set METHOD=value, Genstat does ordinary interpolation, and you use the NEWINTERVALS variate to specify the x-values for which you require interpolated y-values. Genstat calculates the y-values and stores them in the corresponding NEWVALUES variate; this variate will be declared implicitly if you have not declared it already.

For the interpolation to take place, the x-values must be in either monotonically increasing or decreasing order; thus, if necessary, Genstat takes a copy of the x-values and y-values and sorts these (in parallel) to put the x-values into ascending order.

Assume that wheat plants have been sampled on five occasions and their growth stage (Zadoks) assessed. INTERPOLATE interpolates values, which it stores in variate Nzad, to estimate the growth stage that the plant has reached after 50, 100, and 150 days.

VARIATE [NVALUES=6] Zadoks,Days; \

VALUES=!(0,15,23,35,65,95),!(0,50,84,119,147,182)

& [NVALUES=3] Nzadoks,Ndays; VALUES=!(25,50,75),!(50,100,150)

INTERPOLATE [METHOD=value] Zadoks; NEWVALUES=Nzad; \

OLDINTERVALS=Days; NEWINTERVALS=Ndays

Similarly, if you set METHOD=interval, Genstat does inverse interpolation. You must then specify the y-values in the NEWVALUES variate. Genstat calculates the x-values and stores them in the corresponding NEWINTERVALS variate, which will be declared implicitly if necessary. Again the x-values must be in monotonically increasing or decreasing order, and Genstat will produce a sorted copy if necessary. Inverse interpolation is the default.

This statement would use inverse linear interpolation to estimate how long after planting we have to wait for the plant to reach growth stages 25, 50, and 75 Zadoks.

INTERPOLATE [METHOD=interval] Zadoks; NEWVALUES=Nzadoks;

OLDINTERVALS=Days; NEWINTERVALS=Nd

If you set METHOD=missing, Genstat ignores the NEWVALUES and NEWINTERVALS parameters; it estimates values for x or y when the other is missing, placing the results in the previously missing position of the OLDVALUES or the OLDINTERVALS variates. Ordinary interpolation is used when the missing value is in y, and inverse interpolation when it is in x. If both the x-value and the y-value are missing for a particular unit, no values can be interpolated for it, and it remains missing. To do linear interpolation requires that both the x-value and the y-value should be non-missing for the point on each side of the unit with the missing value. For cubic interpolation, there must be two non-missing points on each side of the unit.

The CURVE option has two settings, linear and cubic. By default, CURVE=linear, and successive pairs of observations are connected by straight-line segments for linear, or inverse-linear, interpolation. For cubic interpolation you set CURVE=cubic; there must then be at least four values in each of the OLDVALUES and OLDINTERVALS variates.

For linear & inverse linear interpolation between variates you can use the VINTERPOLATE procedure.

 

JOB directive

Starts a Genstat job.

 

Options

INPRINT = strings Printing of input as in PRINT option of INPUT (statements, macros, procedures, unchanged); default unch

OUTPRINT = strings Additions to output as in PRINT option of OUTPUT (dots, page, unchanged); default unch

DIAGNOSTIC = strings Defines the least serious class of Genstat diagnostic which should still be generated (messages, warnings, faults, extra, unchanged); default unch

ERRORS = scalar Limit on number of error diagnostics that may occur before the job is abandoned; default * i.e. no limit

PROMPT = text Characters to be printed for the input prompt

 

Parameter

text Name to identify the job

 

Description

The JOB and ENDJOB directives can be used to partition a Genstat program into separate jobs. A job is a self-contained subsection of a program. All data structures and procedures are lost at the end of each job. Any setting defined by a UNITS statement is deleted, as are the special structures set up by analyses like regression and analysis of variance. The graphics environment is also reset to the initial default. Thus, in many ways, it is as though Genstat was starting again for each new job. However, any files that have been attached to Genstat retain their current status from job to job. So, for example, Genstat will continue to add output to the end of an output file, or will continue reading from the current point of an input file.

The JOB directive is used to start a new job. It has a parameter which can be set to a text to identify the job (for example in the message at the end of the job), and options to control some aspects of the Genstat "environment". However, Genstat will automatically start a job at the beginning of a program, or after an ENDJOB statement, so you do not need to give a JOB statement unless you wish to define an identifying text or to modify the environment.

JOB also has options that allow you to modify some aspects of the Genstat environment. The default settings of the options will leave these aspects unchanged so, if any aspect is modified, it will remain in that form (unless modified again) in any subsequent job. All these aspects have initial defaults, described below, that apply at the outset of a program. However, it is possible to arrange for Genstat to run commands from a start-up file before it executes the first statement of a program, so the initial environment can differ from machine to machine.

The INPRINT option specifies which pieces of input from the current input channel will be recorded in the current output file. (The current input channel may be a file or, in an interactive run, it may also be the keyboard.) The settings correspond to three types of input:

statements statements that are typed explicitly on the keyboard or which occur explicitly in an input file,

macros statements or parts of statements that have been supplied in macros, using the ## notation (1.9.2), and

procedures statements occurring within procedures.

The initial default is to record only statements for input from a file, or to record nothing if input is from the keyboard. The recording of input can be modified also by the INPRINT option of the SET directive, or by the PRINT option of INPUT.

The OUTPRINT option controls the way in which the output from many Genstat directives will start: page ensures that output to a file will start at the head of a page, and dots produces a line of dots beginning with the line number of the statement that has generated the analysis. The initial default is to give a new page and a line of dots if output is to a file, but neither if output is to the screen. This can be modified also by the OUTPRINT option of the SET directive, or by the PRINT option of OUTPUT.

The DIAGNOSTICS option controls the reporting of errors and possible mistakes. In order of increasing seriousness there three classes of diagnostic: messages, warnings, and faults. Messages are comments that are made to draw your attention to things that might need closer investigation, like large residuals in an analysis of variance or a regression. Warnings are definite errors, but ones that are not sufficiently serious to prevent Genstat from continuing; an example would be an attempt to print a data structure with no values. Faults are the most serious type of error. A fault in a batch run will cause Genstat to stop executing the current job. However, Genstat will continue to read and interpret the statements so that it can find the start of the next job (if any); at the same time it will report any further errors that it finds, up to the number specified by the ERRORS option.

The setting of DIAGNOSTICS indicates the level of stringency to be adopted. Thus, if DIAGNOSTICS=warnings, Genstat will report faults and warnings (but not messages), while DIAGNOSTICS=messages ensures that all three classes are reported. The setting extra is similar to messages but will also generate a dump of system information after any fault. You can prevent the output of any diagnostics by putting DIAGNOSTICS=*. The initial default is to set DIAGNOSTICS=messages.

 

KRIGE directive

Calculates kriged estimates using a model fitted to the sample variogram.

 

Options

PRINT = string Controls printed output (description, search, weights, monitor, data); default desc

Y = variate or scalar Y positions or interval (not needed for 2-dimensional regular data i.e. when DATA is a matrix)

X = variate X positions (needed only for 2-dimensional irregular data)

YOUTER = variate Variate containing 2 values to define the Y-bounds of the region to be examined (bottom then top); by default the whole region is used

XOUTER = variate Variate containing 2 values to define the X-bounds of the region to be examined (left then right); by default the whole region is used

YINNER = variate Variate containing 2 values to define the Y-bounds of the interpolated region (bottom then top); no default

XINNER = variate Variate containing 2 values to define the X-bounds of the interpolated region (left then right); no default

BLOCK = variate Dimensions (length and height) of block; default !(0, 0) i.e. punctual kriging

RADIUS = scalar Maximum distance between target point in block and usable data

SEARCH = string Type of search (isotropic, anisotropic); default isot

MINPOINTS = scalar Minimum number of data points from which to compute elements; default 7

MAXPOINTS = scalar Maximum number of data points from which to compute elements (2 < MINPOINTS £ MAXPOINTS < 41); default 20

NSTEP = scalar Number of steps for numerical integration; (3 < NSTEP < 11); default 8

DRIFT = string Amount of drift (constant, linear, quadratic); default cons

YXRATIO = scalar Ratio of Y interval to X interval; default 1.0

INTERVAL = scalar Distance between successive interpolations; default 1.0

MVESTIMATE = string Whether to replace missing values within the outer region by kriged estimates for gridded data (no, yes); default no

 

Parameters

DATA = variates or matrices Observed measurements as a variate or, for data on a regular grid, as a matrix

ISOTROPY = strings Form of variogram (isotropic, Burgess, geometrical); default isot

MODEL = strings Model fitted to the variogram (power, boundedlinear, circular, spherical, doublespherical, pentaspherical, exponential, besselk1, gaussian); default powe

NUGGET = scalars The nugget variance

SILLVARIANCES = variates Sill variances of the spatially dependent component; default none

RANGES = variates Ranges of the spatially dependent component; default none

GRADIENT = variates Slope of the unbounded component; default none

EXPONENT = variates Power of the unbounded component; default none

PHI = variates Phi parameters of an anistropic model (ISOTROPY = Burg or geom)

RMAX = variates Maximum gradient of an anistropic model

RMIN = variates Minimum gradient of an anistropic model

PREDICTIONS = matrices Kriged estimates

VARIANCES = matrices Estimation variances

 

Description

The KRIGE directive computes the ordinary kriging estimates of a variable at positions on a grid from data and a model variogram. The data must be supplied, using the DATA parameter, in one of the two forms as for the FVARIOGRAM procedure: i.e. for data on a regular grid, in a matrix defined with a variate of column labels to provide the x-values and a variate of row labels to provide the y-values or, for irregularly scattered data, in as a variate with the X and Y options set to variates to supply their spatial coordinates.

By default all data are considered when forming the kriging system. However, a subset of the data may be selected by limiting the area to a rectangle defined by XOUTER and YOUTER options. Each of these should be set to a variate with two values to define lower and upper limits in the x (East-West) and y (North-South) directions respectively.

The positions at which Z is predicted (estimated) are contained in a rectangle defined by the XINNER, YINNER and INTERVAL options. XINNER and YINNER are set to variates similarly to XOUTER and YOUTER, and their limits should not lie outside those of XOUTER and YOUTER. INTERVAL is set to a scalar to define the distance between the successive positions in the rows and columns of the grid at which kriging is to be done, specified in the same units as the data. However, if the aim is to make a map, INTERVAL should be chosen so that it represents no more than 2 mm on the final printed document. The optimality of the kriging will then not be degraded noticeably by the subsequent contouring.

Kriging may be either punctual, i.e. at "points" which have the same size and shape as the sample support, or on bigger rectangular blocks. The size of the blocks is specified by the BLOCK option, in a variate whose two values define the length of the block first in the x direction (eastings) and then in the y direction (northings). By default the BLOCK variate contains two zero values, to give punctual kriging. The average semivariances between point and block are computed by integrating the variogram numerically over the block. The number of steps in each direction is defined by the NSTEP option. The default of 8 is recommended as a compromise between speed and accuracy. The kriging may be accelerated at the expense of accuracy by reducing NSTEP, or accuracy gained by increasing it. The minimum is 4 and the maximum 10.

The minimum and maximum number of points for the kriging system are set by the MINPTS and MAXPTS options. There is a minimum limit of 3 for MINPTS and a maximum of 40 for MAXPTS, and MINPTS must be less than or equal to MAXPTS. The defaults are 7 and 20 respectively. Data points may be selected around the point or block to be kriged by setting the RADIUS option to the radius within which they must lie. If the variogram is anisotropic, the search may be requested to be anisotropic by setting option SEARCH to anisotropic; by default SEARCH=isotropic.

Further options are available for regular data. Universal kriging may be invoked by setting the DRIFT option to linear or to quadratic, i.e. to be of order 1 or 2 respectively. By default is DRIFT=constant, to give ordinary kriging. If the grid is not square, the ratio of the spacing in the y direction to that in the x direction is given by the YXRATIO option. The default is 1.0 for square. Missing data on the grid may be interpolated by punctual kriging as a preliminary by setting the INTERPOLATE option to yes; the default setting is no.

The variogram is specified by its type and parameters, as follows. The MODEL option may be defined to be set to either power, boundedlinear (one dimension only), circular, spherical, doublespherical, pentaspherical, exponential, besselk1 (Whittle's function) or gaussian. All models may have a nugget variance, supplied using the NUGGET option; this is the constant estimated by MVARIOGRAM. The parameters of the power function (the only unbounded model) are defined by the GRADIENT and EXPONENT parameters. The simple bounded models, i.e. all other settings of MODEL except doublespherical, require the SILLVARIANCES (the sill of the correlated variance) and RANGES parameters. The latter is strictly the correlation range of the boundedlinear, circular, spherical and pentaspherical models, while for the asymptotic models it is the distance parameter of the model. The doublespherical model requires SILLVARIANCES and RANGES to be set to variates of length two, to correspond to the two components of the model.

The ISOTROPY parameter allows the variation to be defined to be either isotropic or anisotropic in one of two ways: either Burgess anisotropy (Burgess and Webster 1980) or geometric anisotropy (Journel and Huijbregts 1978, Webster and Oliver 1990). The anisotropy is specified by three parameters, namely PHI, the angle in radians of the direction of maximum variation, RMAX, the maximum gradient of the model, and RMIN, the minimum gradient. In the current release only the power function may be anisotropic.

KRIGE calculates two matrices, one of predictions (or estimates), which can be saved using the PREDICTIONS parameter, and the other of the prediction (estimation or kriging) variances saved using the VARIANCES parameter. The matrices are arranged with the first row of each matrix at the bottom following geographic rather than mathematical convention.

The PRINT option can be set to data to print the data. It allows intermediate results to be printed. The setting search lists the results of the search for data around each position to be kriged, weights lists the kriging weights at each position and monitor monitors the formation and inversion of the kriging matrices for each position. These options enable you to check that the kriging is working reasonably. However, they can produce a great deal of output, and should not be requested when kriging large matrices, such as might be wanted for mapping.

 

References

Burgess, T.M. and Webster, R. (1980). Optimal interpolation and isarithmic mapping of soil properties. I. The semi-variogram and punctual kriging. Journal of Soil Science, 31, 315-331.

Journel, A.G. and Huijbregts, C.J. (1978). Mining Geostatistics. Academic Press, London.

Webster, R. and Oliver, M.A. (1990). Statistical Methods in Soil and Land Resource Survey. Oxford University Press.

 

LIST directive

Lists details of the data structures currently available within Genstat.

 

Options

PRINT = strings What to print (identifier, attributes); default iden,attr

SYSTEM = string Whether to include "system" structures with prefix _ (yes, no); default no

SCOPE = string When used within a procedure, this allows the listing of structures in the program that called the procedure (SCOPE=external), or in the main program itself (SCOPE=global), rather than those within the procedure (local, external, global); default loca

 

Parameter

strings Types of structure to list (all, diagonal, dummy, expression, factor, formula, lrv, matrix, pointer, scalar, sspm, symmetric, table, text, tsm, variate); default all

 

Description

The LIST directive can be used to lists the data structures that are currently available. It is particularly useful when you are working interactively to remind you about the data structures that you have set up, and the identifiers that you have used.

By default LIST prints details of relevant attributes, as well as the identifiers, but this can be controlled using the PRINT option.

The SYSTEM option of LIST controls whether structures whose identifiers begin with the underscore character _ are listed; this character is used as a prefix for example for the specialised structures set up by the Genstat menu system so their inclusion could be confusing.

The SCOPE option can be used within a procedure to list the data structures in the program that called the procedure (SCOPE=external) or in the outermost part of the program (SCOPE=global).

 

LRV directive

Declares one or more LRV data structures.

 

Options

ROWS = scalar, vector, or pointer Number of rows, or row labels, for the matrix; default *

COLUMNS = scalar, vector, or pointer

Number of columns, or column labels, for matrix and diagonal matrix; default *

 

Parameters

IDENTIFIER = identifiers Identifiers of the LRVs

VECTORS = matrices Matrix to contain the latent vectors for each LRV

ROOTS = diagonal matrices Diagonal matrix to contain the latent roots for each LRV

TRACE = scalars Trace of the matrix

 

Description

The LRV is a compound data structure. These are similar to pointers in that they point to other structures, but they have a fixed number of elements which must be of the correct types and must form a consistent set (in terms of their sizes and so on). You can refer to elements of compound structures in exactly the same way as the elements of pointers, but the suffixes and their labels are fixed for each type of structure. Unlike pointers, the labels are also not case sensitive; Genstat will recognize the label in either uppercase or lowercase letters or in any mixture of the two.

The LRV structure is used to store latent roots and vectors resulting from the decomposition of a matrix (by the FLRV directive), or produced in multivariate analysis. It points to three structures (identified by their suffixes):

[1] or ['VECTORS'] is a matrix whose columns are the latent vectors: the word "VECTOR" is used here in its mathematical sense rather than in the more specific Genstat sense; in fact, latent vectors are most conveniently stored in matrices rather than in Genstat vectors;

[2] or ['ROOTS'] is a diagonal matrix whose elements are the latent roots;

[3] or ['TRACE'] is a scalar holding the trace of the matrix, which is the sum of all its latent roots.

The length of each latent vector is specified by the ROWS option; this then defines the number of rows in the 'VECTORS' matrix. The COLUMNS option defines the number of latent roots to be stored; this is also the number of latent vectors, and so indicates the number of columns in the 'VECTORS' matrix and the number of elements in the 'ROOTS' matrix. If you do not specify the number of columns Genstat will set it to be the same as the number of rows. The value of COLUMNS can be less than the value of ROWS; however, it must not exceed than that of COLUMNS, otherwise Genstat gives an error diagnostic. Row and column labels can be defined, as in the MATRIX directive.

You can specify identifiers for the three individual elements of the LRV by using the VECTORS, ROOTS, and TRACE parameters. If you have declared them already they must be of the correct type (and you can also have given them values). If you have given these identifiers row or column settings, then these will be used for the LRV declaration and must match any of the corresponding options of LRV that you choose to set.

 

MARGIN directive

Forms and calculates marginal values for tables.

 

Option

CLASSIFICATION = factors Factors classifying the margins to be formed; default * requests all margins to be formed

 

Parameters

OLDTABLE = tables Tables from which the margins are to be taken or calculated

NEWTABLE = tables New tables formed with margins

METHOD = strings Way in which the margins are to be formed for each table (totals, means, minima, maxima, variances, medians, deletion, or a null string to indicate that the marginal values are all to be set to the missing value); default tota

 

Description

You can use MARGIN to extend a table to contain marginal values, or to change the marginal values of a table that already has margins, or to delete the margins from a table. The tables whose margins are to be changed are specified by the OLDTABLES parameter. If you specify only this parameter, the new values replace those of the original tables. However, if you want to retain the original values, you can specify new tables to contain the amended values, using the NEWTABLES list. These tables will be declared automatically, if you have not declared them already.

The METHOD parameter controls the type of margins that are formed. If you set METHOD=deletion, all the margins of the tables are deleted but the body of the table is retained.

The CLASSIFICATION option specifies the list of factors for which you want to form marginal values. Genstat puts missing values in the margins that are excluded if the METHOD parameter is set to maxima or minima; for other settings of METHOD, Genstat puts in zeroes. The classifying sets for each table can be different, but all the factors in the CLASSIFICATION option must be in the classifying sets of each OLDTABLE.

 

MATRIX directive

Declares one or more matrix data structures.

 

Options

ROWS = scalar, vector, or pointer Number of rows, or labels for rows; default *

COLUMNS = scalar, vector, or pointer

Number of columns, or labels for columns; default *

VALUES = numbers Values for all the matrices; default *

MODIFY = string Whether to modify (instead of redefining) existing structures (yes, no); default no

 

Parameters

IDENTIFIER = identifiers Identifiers of the matrices

VALUES = identifiers Values for each matrix

DECIMALS = scalars Number of decimal places for printing

EXTRA = texts Extra text associated with each identifier

MINIMUM = scalars Minimum value for the contents of each structure

MAXIMUM = scalars Maximum value for the contents of each structure

 

Description

A matrix stores a set of numbers as a two-dimensional array indexed by rows and columns. For example, the array

1 2 3 4

5 6 7 8

9 10 11 12

is called a three-by-four matrix.

You use the ROWS and COLUMNS options to specify the size of the matrices that are being defined. The simplest way of doing this is to use scalars to define the numbers of rows and columns explicitly. Alternatively, you can set ROWS (or COLUMNS) to a variate, text, or pointer, whose length then defines the number of rows (or columns) and whose values will then be used as labels, for example when the matrix is printed. Finally, if you specify a factor, the number of levels defines the number of rows or columns and the labels if available, or otherwise the levels, are used for labelling.

Values can be supplied for the matrices using either the VALUES option or the VALUES parameter. The option defines a common value (or set of values) for all the matrices in the declaration, while the parameter allows them each to be given different values. With the option you must supply a list of values. With the parameter, however, you must give a list of identifiers of data structures of the appropriate mode; unnamed data structures are particularly useful for this. Thus, to declare the matrix above, we can put:

MATRIX [ROWS=3; COLUMNS=4] X; VALUES=!(1,2,3,4,5,6,7,8,9,10,11,12)

If both the option and the parameter are specified, the parameter takes precedence.

The DECIMALS parameter can be used to define the number of decimal places that Genstat will use by default whenever the values of the matrix are printed. This applies to output either by PRINT or from an analysis (but it does not affect the accuracy with which the numbers are stored).

You can associate a text with each data structure by means of the parameter EXTRA. This text is then used by many Genstat directives to give a fuller annotation of output.

The MINIMUM and MAXIMUM parameters allow you to define lower and upper limits on the values expected for any structure that stores numbers. Genstat then prints warnings if any values outside that range are assigned to the structure.

If you are declaring any of the matrices for a second time, by default you will lose all its existing attributes and values. You can retain those that remain valid by setting option MODIFY=yes.

 

MDS directive

Performs non-metric multidimensional scaling.

 

Options

PRINT = strings Printed output required (coordinates, roots, distances, fitteddistances, stress, monitoring); default * i.e. no printing

DATA = symmetric matrix Distances amongst a set of units

METHOD = string Whether to use non-metric scaling, or metric scaling with linear regression of the fitted distances to the actual distances (nonmetric, linear); default nonm

SCALING = string Whether least-squares, least-squares-squared, or log-stress scaling is to be used (ls, lss, logstress); default ls

TIES = string Treatment of tied data values (primary, secondary, tertiary); default prim

WEIGHTS = symmetric matrix Weights for each distance value; default * i.e. all distances with weight one

INITIAL = matrix Initial configuration; default * i.e. a principal coordinate solution is used

NSTARTS = scalar Number of starting configurations to be used, by perturbing the initial configuration; default 1

MAXCYCLE = scalar Maximum number of iterations; default 30

 

Parameters

NDIMENSIONS = scalars Number of dimensions for each solution

COORDINATES = matrices To store the coordinates of the units for each solution

STRESS = scalars To store the stress value for each solution

DISTANCES = symmetric matrices To store the distances amongst the points for the units in the fitted number of dimensions

FITTEDDISTANCES = symmetric matrices

To store the fitted distances from the monotonic (METHOD=nonmetric) or linear (METHOD=linear) regression

 

Description

The MDS directive carries out iterative scaling, including metric and non-metric scaling. The input data consists of a symmetric matrix whose values may be interpreted, in a general sense, as distances between a set of objects. The matrix is specified by the DATA option; thus only one matrix can be analysed each time the MDS directive is used.

The objective of the MDS directive is to find a set of coordinates whose inter-point distances match, as closely as possible, those of the input data matrix. When plotted, the coordinates provide a display which can be interpreted in the same way as a map: for example, if points in the display are close together, their distance apart in the data matrix was small.

The algorithm invoked by the MDS directive uses the method of steepest descent to guide the algorithm from an initial configuration of points to the final matrix of coordinates that has the minimum stress of all configurations examined.

Printed output is controlled by the PRINT option; by default nothing is printed. There are six possible settings:

coordinates prints the solution coordinates, rotated to principal coordinates;

roots prints the latent roots of the solution coordinates;

distances prints the inter-unit distances, computed from the solution configuration;

fitteddistances prints the fitted values from the regression of the inter-unit distances on the distances in the data matrix, the regression may be monotonic or linear through the origin, depending on the setting of the METHOD option;

stress prints the stress of the solution coordinates;

monitoring prints a summary of the results at each iteration.

The METHOD option determines whether metric or non-metric scaling is given. The algorithm involves regression of the distances, calculated from the solution coordinates, against the dissimilarities in the symmetric matrix specified by the DATA option. With the default setting, METHOD=nonmetric, monotonic regression is used; if METHOD=linear, the algorithm uses linear regression through the origin.

The stress function to be minimized can be selected using the STRESS option. There are three possibilities.

ls (least squares):  

lss (least-squares-squared):  

logstress:  

where the dij are the elements of the input dissimilarity matrix and the  ij are the fitted values from the regression by the METHOD option.

The TIES option allows you to vary the way in which tied data values in the input data matrix are to be treated. By default, the treatment of ties is primary, and no restrictions are placed on the distances corresponding to tied dissimilarities in the input data matrix. In the secondary treatment of ties, the distances corresponding to tied dissimilarities are required to be as nearly equal as possible. Kendall (1977) describes a compromise between the primary and secondary approaches to ties: the block of ties corresponding to the smallest dissimilarity are handled by the secondary treatment, the remaining blocks of ties are handled by the primary treatment. This tertiary treatment of ties is useful when the dissimilarities take only a few values. For example, in the reconstruction of maps from abuttal information, the dissimilarity coefficient takes only two values: zero if localities abut, and one if they do not. The block of ties associated with the dissimilarity of zero are handled by the secondary treatment, and the block of ties with dissimilarity one by the primary treatment.

The WEIGHT option can be used to specify a symmetric matrix of weights. Each element of the matrix gives the weight to be attached to the corresponding element of the input data matrix. If the option is not set, the elements of the data matrix are weighted equally. The most important use of the option occurs when the matrix of weights contains only zeros and ones; the zeros then correspond to missing values in the input data matrix, allowing incomplete data matrices to be scaled. Up to about two thirds of the data matrix may be missing before the algorithm breaks down. This enables experimenters to design studies in which only a subset of all the dissimilarities need to be observed. This is particularly useful when there are a large number of units; if the number of units is m, say, a complete m ´ m data matrix requires m(m-1)/2 dissimilarities to be observed.

Since the algorithm is an iterative one, making use of the method of steepest descent, there is no guarantee that the solution coordinates found from any given starting configuration has the minimum stress of all possible configurations. The algorithm may have found a local, rather than the global, minimum. This problem may be partially overcome by using a series of different starting configurations. If several of the solutions arrive at the same lowest stress solution, then you may be reasonably confident of having found the global minimum. The NSTARTS option determines the number of starting configurations to be used. The starting configuration used on the first start can be specified by the INITIAL option; if this is not set, the default is to take the principal coordinate solution obtained from a PCO analysis of the input dissimilarity matrix. Subsequent starting configurations are found by perturbing each coordinate of the first starting configuration by successively larger amounts. This strategy generally results in at least one starting configuration that does not get entrapped in a local minimum: however there can be no guarantee that the global minimum for the stress function has been found. Experience suggests that, for safety, the NSTARTS option should be set equal to at least 10. By default NSTARTS=1.

The MAXCYCLES option determines the maximum number of iterations of the algorithm. The default of 30 should usually be sufficient. However, it may be necessary to set a larger value for very large data matrices or when using the logstress setting of the SCALING option. The monitoring setting of the PRINT option may be used to see how convergence is progressing.

The NDIMENSIONS parameter must be set to a scalar (or scalars) to indicate the number(s) of dimensions in which the multidimensional scaling is to be performed on the data matrix. An MDS statement with a list of scalars will carry out a series of scaling operations, all based on the same matrix of dissimilarities, but with different numbers of dimensions.

The remaining parameters of the MDS directive allow output to be saved in Genstat data structures. The COORDINATES parameter can list matrices to store the minimum stress coordinates in each of the dimensions given by the NDIMENSIONS parameter, and the STRESS parameter can specify scalars to store the associated minimum stresses. The parameters DISTANCES and FITTEDDISTANCES can specify symmetric matrices to store the distances computed from the coordinates matrix and the fitted distances computed from the monotonic or linear regressions, respectively.

 

Reference

Kendall, D.G. (1977). On the tertiary treatment of ties. Proceedings of the Royal Society of London, Series A 354, 407-423.

 

MERGE directive

Copies subfiles from backing-store files into a single file.

 

Options

PRINT = string What to print (catalogue); default *

OUTCHANNEL = scalar Channel number of the backing-store file where the subfiles are to be stored; default 0, i.e. the workfile

METHOD = string How to append subfiles to the OUT file (add, overwrite, replace); default add, i.e. clashes in subfile identifiers cause a fault (note: replace overwrites the complete file)

PASSWORD = text Password to be checked against that stored with the file; default *

 

Parameters

SUBFILE = identifiers Identifiers of the subfiles

INCHANNEL = scalars Channel number of the backing-store file containing each subfile

NEWSUBFILE = identifiers Identifier to be used for each subfile in the new file

 

Description

The MERGE directive is used to copy subfiles into another backing-store file. You can either add the subfiles to an existing backing-store file, or form a new backing-store file.

The OUTCHANNEL option specifies the backing-store channel of the file to which the subfiles are to be copied; by default this is the workfile (channel 0).

The SUBFILE parameter specifies the list of subfiles that are to be copied, and the INCHANNEL parameter indicates the channel of the backing-store file where each one is currently stored. If you do not specify the INCHANNEL parameter, Genstat assumes that the subfiles are coming from the workfile. You are not allowed to include the OUTCHANNEL among the channels in the INCHANNEL list. Also, you cannot store two subfiles with the same names, and should use the NEWSUBFILE parameter to rename any that clash. For example

MERGE [OUTCHANNEL=3] JanData,JulyData,JanData; INCHANNEL=1,1,2; \

NEWSUBFILE=Jan92dat,Jul92dat,Jan93dat

To rename only some of the subfiles, you can either respecify the existing identifier, or insert * at the appropriate point in the NEWSUBFILE list.

If you specify a missing identifier * in the SUBFILE list, Genstat will include all the subfiles from the relevant INCHANNEL. If you want to rename any of these subfiles, you can also mention it explicitly. For example, this statement will take all the subfiles from channel 1 and rename subfile Sub as Subf.

MERGE *,Sub; INCHANNEL=1; NEWSUBFILE=*,Subf

You can set option PRINT=catalogue to produce a catalogue of the subfiles in the new backing-store file.

If a subfile of the specified name already exists on the backing-store file, the storing operation will usually fail. However, you can set option METHOD=overwrite to overwrite the old subfile, that is, to replace the old subfile with a new subfile. Alternatively, you can put METHOD=replace to form a new backing-store file containing only the new subfiles.

Subfiles are merged in a fixed order. Genstat first takes the subfiles from the backing-store file with the lowest channel number, in the order in which they occur there, then it takes the subfiles the next lowest channel number, and so on. If OUTCHANNEL=0 (that is, the new file is the workfile), the original subfiles that are to be retained from that file will be followed by the new subfiles; otherwise, if OUTCHANNEL is non-zero, the original subfiles are placed after the new subfiles. If you want to put the subfiles into a particular order, you should merge them into the workfile in that order, and then merge the workfile into a new userfile.

To keep the new file secure, you can use the PASSWORD option to incorporate a password. Once you have done this, you must include the same password in any future use of MERGE or STORE with this same userfile; spaces, case, and newlines are significant in the password. You cannot change the password in a userfile once you have set it, but you can use the MERGE directive to create a new userfile with no password or with a new password. If you set the password to be a text whose values have been have restricted, the restriction is ignored.

 

MODEL directive

Defines the response variate(s) and the type of model to be fitted for linear, generalized linear, generalized additive, and nonlinear models.

 

Options

DISTRIBUTION = string Distribution of the response variable (normal, poisson, binomial, gamma, inversenormal, multinomial, calculated, negativebinomial, geometric, exponential, bernoulli); default norm

LINK = string Link function (canonical, identity, logarithm, logit, reciprocal, power, squareroot, probit, complementaryloglog, calculated, logratio); default cano (i.e. iden for DIST=norm or calc; loga for DIST=pois; logi for DIST=bino, bern, or mult; reci for DIST=gamm or expo; powe for DIST=inve; logr for DIST=nega or geom)

EXPONENT = scalar Exponent for power link; default -2

AGGREGATION = scalar Fixed parameter for negative binomial distribution (parameter k as in variance function Var = mean + mean2/k); default 1

KLOGRATIO = scalar Parameter for logratio link, in form log(mean/(mean+k)); default as set in AGGREGATION option

DISPERSION = scalar Value of dispersion parameter in calculation of s.e.s etc; default * for DIST=norm, gamm, inve, or calc, and 1 for DIST=pois, bino, mult, nega, geom, expo or bern

WEIGHTS = variate or symmetric matrix

Variate of weights for weighted regression, or symmetric matrix of weights (one row and column for each unit of data) for generalized least squares; default *

OFFSET = variate Offset variate to be included in model; default *

GROUPS = factor Absorbing factor defining the groups for within-groups linear or generalized linear regression; default *

RMETHOD = string Type of residuals to form, if any, after each model is fitted (deviance, Pearson, simple); default devi

DMETHOD = string Basis of estimate of dispersion, if not fixed by DISPERSION option (deviance, Pearson); default devi

FUNCTION = scalar Scalar whose value is to be minimized by calculation; default *

YRELATION = string Whether to analyse the y-variates separately, as in ordinary regression, or to analyse them cumulatively as counts in successive categories of a multinomial distribution (separate, cumulative); default sepa

DCALCULATION = expression structures

Calculations to define the deviance contributions and variance function for a non-standard distribution; must be specified when DIST=calc

LCALCULATION = expression structures

Calculations to define the fitted values and link derivative for a non-standard link; must be specified when LINK=calc

SAVE = identifier To name regression save structure; default *

 

Parameters

Y = variates Response variates; only the first is used in nonlinear models and in generalized linear models except when DIST=mult, when they specify the numbers in each category of an ordinal response model

NBINOMIAL = variate or scalar Total numbers for DIST=bino

RESIDUALS = variates To save residuals for each y variate after fitting a model

FITTEDVALUES = variates To save fitted values, and provide fitted values if no terms are given in FITNONLINEAR

LINEARPREDICTOR = variate Specifies the identifier of the variate to hold the linear predictor

DERIVATIVE = variate Specifies the identifier of the variate to hold the derivative of the link function at each unit

DEVIANCE = variate Specifies the identifier of the variate to hold the contribution to the deviance from each unit

VFUNCTION = variate Specifies the identifier of the variate to hold the value of the variance function at each unit

 

Description

The MODEL directive does not actually fit anything: it simply sets up some structures inside Genstat that are used when you give a FIT, FITCURVE, or FITNONLINEAR statement later on. So when you are doing regression, MODEL will always be accompanied by at least one other regression statement to fit a model, like FIT.

The Y parameter allows a list of variates; if you put more than one for linear regression, then you will get an analysis for each. This is a more efficient way of doing many linear regressions with the same explanatory variables, than separate pairs of MODEL and FIT statements. With additive models, generalized linear models, and nonlinear models, only the first variate will be analysed (with the exception of multinomial response models); the others will be ignored.

The RESIDUALS and FITTEDVALUES parameters allow you to specify variates to contain the residuals and fitted values for each response variable. The residuals are the "unexplained" component of the response variable, standardized in some way according to the RMETHOD option. The fitted values are the "explained" component: that is, the combination of parameters and explanatory variables fitted in the model. You can get access to these sets of values in a different way through the RKEEP directive.

The DISTRIBUTION and LINK options are used to specify a generalized linear model McCullagh and Nelder 1989). By default the data are assumed to follow a Normal distribution, as required for ordinary linear regression, but other distributions can be selected using the DISTRIBUTION option. The LINK option specifies the link function that relates the linear model to the expected values of the distribution; in the default ordinary linear regression, this is the identity function (indicating no transformation). So, for example, for a log-linear model we would specify DISTRIBUTION=Poisson and LINK=log, while for logistic regression we would have DISTRIBUTION=binomial and LINK=logit. The NBINOMIAL parameter must also be set when DISTRIBUTION=binomial, to give the number of binomial trials for each unit.

The EXPONENT option specifies the exponent when LINK=power. Similarly, the AGGREGATION option specifies the aggregation parameter k when DISTRIBUTION=negativebinomial. This is a measure of the tendency for observations to cluster together which appears in the formula for the variance as a function of the mean

variance = mean + mean2/k

The default value of k is set at 1, which corresponds to the geometric distribution. The parameter k must be positive, and as it increases to infinity the distribution approaches the Poisson distribution. The KLOGRATIO option sets the parameter k for the logratio link.

You can also define your own distribution or link function for a generalized linear model. To specify your own distribution, you need to set DISTRIBUTION=calculated and then specify expression structures with the DCALCULATION option to calculate the deviance and the variance function for each unit of the response variate, using the current values of the fitted-values variate. You must also set the FITTEDVALUES, DEVIANCE, and VFUNCTION parameters to indicate which identifiers are used to represent these in the expressions. To specify your own link, you need to set LINK=calculated and provide expressions with the LCALCULATION option for two other calculations to form the fitted values and the derivative of the link function for each unit of the response variate, using the current values of the linear predictor. You must also set the FITTEDVALUES, LINEARPREDICTOR, and DERIVATIVE parameters to specify the identifiers used to represent these in the calculations. In addition, you must provide initial values for the linear predictor, so that the iterative process can get started: often this can be done just by applying the link function to the response variate itself, but it may be necessary to modify extreme values such as 0 that may be mapped to infinity by the link function.

You can fit ordinal response models by setting option YRELATION=cumulative and option DISTRIBUTION=multinomial.

The DISPERSION option controls how the variance of the distribution of the response values is calculated. By default, the variance is estimated from the residual mean square, and standard errors and standardized residuals are calculated from the estimate. If you use DISPERSION to supply a value for the variance of the Normal distribution, or for the dispersion parameter of other distributions, then standard errors and residuals are based on this given value instead. In a generalized linear model, the dispersion of the chosen distribution can be fixed at a value provided by the DISPERSION option, or estimated from either the residual deviance or the Pearson chi-squared statistic, as specified by the DMETHOD option.

The WEIGHTS option allows you to specify a variate holding weights for each unit. In simple linear regression, the estimate of dispersion is then the weighted residual mean square. Thus, if the variance of the response variable is not constant, and you know the relative size of the variance for each observation, you can set the weight to be proportional to the inverse of the variance of an observation. Alternatively, if the variance is related in a simple way to the mean, you may just need to specify a different distribution for the response. The WEIGHTS option can also be set to a symmetric matrix, supplying weights corresponding to some pattern of correlation or covariance between units as well as variance of each unit. The subsequent analysis is known as generalized least-squares if the response distribution is Normal.

The OFFSET option allows you to include in the regression a variable with no corresponding parameter. Linear regression analysis of Y with offset O is just the same as analysis of Y-O, but the offset has non-trivial applications in generalized linear models.

The GROUPS option specifies a factor whose effects you want to eliminate before any regression is fitted. The factor must already have been defined. This method of elimination is sometimes called absorption; you might want to use it when data from many different groups are to be modelled. Use of GROUPS gives less information than you would get if you included the factor explicitly in the model (leverages, predictions, and some parameter correlations cannot be formed), but it saves space and time in fitting the model when the factor has many levels. You can use GROUPS only with linear and generalized linear regression.

The RMETHOD option controls how residuals are formed. By default, residuals are deviance residuals standardized by their estimated variance. The alternative Pearson residuals are defined in exactly the same way if the distribution is Normal, but for regression models with distributions other than Normal the two kinds of residual are different. If you do not want residuals, you can set the option to a missing value (*) to save space within Genstat. However, you will then not be able to get residuals, fitted values, or leverages, and the automatic checks on the fit of a model will not be done.

The FUNCTION option is relevant only when you want to optimize a general function (see FITNONLINEAR). It is ignored unless no response variates are specified by the Y parameter.

The SAVE option allows you to specify an identifier for the regression save structure. This structure stores the current state of the regression model, and can be used explicitly in the directives RDISPLAY, RKEEP, PREDICT, and RFUNCTION. If the identifier in SAVE is of a regression save structure that already has values, those values are deleted. You can reset the current regression save structure at any point in a program by using the SET directive. Then, later regression statements would use the model stored in this save structure.

 

Reference

McCullagh, P. and Nelder, J.A. (1989). Generalized linear models (second edition). Chapman and Hall, London.

 

MONOTONIC directive

Fits an increasing monotonic regression of y on x.

 

No options

 

Parameters

Y = variates Y-values of the data points

X = variates X-values of the data points; default is to assume that the x-values are monotonically increasing

RESIDUALS = variates Variate to save the residuals from each fit

FITTEDVALUES = variates Variate to save the fitted values from each fit

 

Description

Monotonic regression plays a key role in non-metric multidimensional scaling, which is available in Genstat via the MDS directive. However, it can be useful in its own right, so the method has been made accessible by the MONOTONIC directive. A monotonic regression through a set of points is simply the line that best fits the points subject to the constraint that it never decreases: of course the line need not be straight, in fact it rarely will be. If you need a monotonically decreasing line, you can simply subtract all the y-values from their maximum, find the monotonically increasing regression, and then back-transform the data and fitted line, and change the sign of the residuals.

The MONOTONIC directive has no options. It has four parameters: Y to specify the y-values, X for the x-values, RESIDUALS to save the residuals, and FITTEDVALUES to save the fitted values. The x-values need not be supplied, in which case the directive assumes that the y-values are in increasing order of the x-values. In common with the other regression directives, the variates to save the residuals and fitted values need not be declared in advance.