QUESTION directive 

Obtains a response using a Genstat menu.

 

Options

PREAMBLE = text Text posing a question; (no default)

PROMPT = text Text to be used as final prompt; the default prompt specifies the mode of response and lists the default values (if any), in brackets, followed by ">"

RESPONSE = identifier Structure to store response; default * allows a menu to be saved without being executed

MODE = string Mode of response (e, f, p, t, v); default p

DEFAULT = identifier Response to be assumed if just <RETURN> is given; default is to repeat the prompt until a response is obtained

LIST = string Whether a list of responses, rather than a single response, is valid (yes, no); default no

DECLARED = string Whether identifiers must already be declared (yes, no); default no

TYPE = strings Allowed types for identifiers (datamatrix i.e. pointer to variates of equal lengths as required in multivariate analysis, diagonalmatrix, dummy, expression, factor, formula, LRV, matrix, pointer, scalar, SSPM, symmetricmatrix, table, text, TSM, variate); default *, meaning no limitation

PRESENT = string Whether the identifier must have values (yes, no); default no

LOWER = scalar Lower limit for numbers; default *, meaning no check

UPPER = scalar Upper limit for numbers; default *, meaning no check

HELP = text Text to be used in response to a general query for the question; default *

SAVE = pointer Saves or reinputs the specification of the menu (which is then used for any options or parameters not redefined)

 

Parameters

VALUES = texts Possible codes for MODE t; (no default for MODE t; not relevant for others)

CHOICE = texts Text giving explanation of each letter code; (no default for MODE t; not relevant for others)

HELP = texts Text to be used in response to a specific query for a code; default *

 

Description

The QUESTION directive displays a Genstat menu and obtains a response when in interactive mode. In batch, the directive does nothing. Here is a simple example that asks the user to provide the identifier of a variate structure; this statement is actually in the part of the standard menu system that provides analysis of variance.

QUESTION [PREAMBLE=!t('Y-VARIATE Menu (from ANOVA Menu)',*,\

'What is the variate to be analysed ?'; RESPONSE=_yvar; \

DECLARED=yes; TYPE=variate; PRESENT=yes]

The PREAMBLE option specifies a text structure, whose contents are printed at the beginning of the menu. Following this is the prompt: by default, this consists of a reminder of what type of answer is expected, followed by the greater-than symbol (>). However, there is a PROMPT option that allows any text to be printed instead, before the greater-than symbol.

The RESPONSE option specifies a dummy identifier that will point to the answer given by the user. Note that the identifiers used in the standard menu system all begin with the underline character (_) to reduce the chance of a clash with your own identifiers. Menus can request information in one of five modes; the default is Mode p (pointer), as here, and expects a response to consist of an identifier; but the MODE option can also be set to v (variate), t (text), e (expression), or f (formula). When a correct answer has been received, an unnamed structure of the relevant type (pointer, variate, or whatever, but see later for text mode) is set up, and the dummy in the RESPONSE option is set to point at this unnamed structure.

Thus, if you give the identifier Y in response to the question above, the dummy _yvar will store the identifier of a pointer containing the single identifier Y after the QUESTION statement above has been executed. So in the standard menu system, the statement following this QUESTION statement is

ANOVA #_yvar

the hash (#) being needed to substitute the values of the unnamed pointer that is stored in the dummy structure _yvar.

By default, a question will expect to receive a single item of the specified mode: identifier, number, string, expression, or formula. However, if the option LIST is set to yes for modes p, v, or t, then a list of items is expected. The unnamed structure set up to store the answer will then contain as many values as there are items in the list.

The other three options in the example above specify restrictions on the answer that will be accepted. The DECLARED option specifies that the identifier must be of a structure that has already been declared. If a previously unused identifier is given, the QUESTION statement will print a warning, and issue the prompt again. Similarly, the TYPE option specifies what type of structure is acceptable; the setting may be a list of types if relevant. The PRESENT option specifies that the structure must already have values. Two further options, LOWER and UPPER, can be used to specify limits for numbers given in response to questions of mode v.

Most menus in the standard system are of mode t, and resemble more closely what most people think of as a menu than does the simple display above. Such menus require extra information to be specified using parameters of the QUESTION directive. The VALUES parameter should be set to a list of text structures, each of which stores a single string that is to be accepted as an answer to the question. The CHOICE parameter should be set to another list of text structures, each storing a single string that is to be displayed by the side of the corresponding code in the menu to explain it. This example shows a further question from the standard system.

QUESTION [PREAMBLE=!t('INPUT menu',*, \

'Where are the data values ?'); RESPONSE=_cdsourc; \

MODE=t; DEFAULT='b'] VALUES='b','s','t'; CHOICE= \

'in a binary file previously set up using Genstat', \

'in a character file, with values separated by spaces', \

'to be typed at the terminal'

The codes must obey the rules for unquoted strings: that is, they must start with a letter and consist only of letters and digits. Only the first eight characters will be displayed, and only the first eight characters of the answer will be checked - all eight must match. Usually, of course, it is convenient to use single-letter codes.

Note that mode t cannot be used to ask the user for an arbitrary string, for example to provide a label for output. To request such information, you must use mode p, and set TYPE=text; the user must then supply the string in quotes, or supply the identifier of a text structure that already stores the string.

The response to a question of mode t is stored not as a text, but as a variate each value of which is the number of the corresponding code as listed in the VALUES parameter. Usually, of course, a menu of mode t will be set with LIST=no, the default, and so the variate will contain only a single number. This can be used to control subsequent action in the menu system, most conveniently with a CASE statement. For example, the statements in the standard system following the above QUESTION statement could look like the following.

CASE _cdsourc

" Statements to deal with code b "

OR

" Statements to deal with code s "

OR

" Statements to deal with code t "

ENDCASE

The DEFAULT option is used here to specify a default answer if the user just types RETURN; it can be set for any mode of question. The HELP option and parameter of the QUESTION directive allow you to provide help text to guide the person answering the question. The SAVE option allows you to declare a menu without executing it, and also to execute a menu that has already been stored.

 

RANDOMIZE directive

Randomizes the units of a designed experiment or the elements of a factor or variate.

 

Options

BLOCKSTRUCTURE = formula Block model according to which the randomization is to be carried out; default * i.e. as a completely-randomized design

EXCLUDE = factors (Block) factors whose levels are not to be randomized

SEED = scalar Seed for the random-number generator; default 0

 

Parameters

factors or variates Structures whose units are to be randomized according to the defined block model

 

Description

In its simplest form, RANDOMIZE performs a random permutation of the units of a list of factors or variates. You list these structures with the parameter of RANDOMIZE. Genstat gives them all exactly the same permutation, which is produced by a set of random numbers generated from the SEED option. For example

RANDOMIZE [SEED=144556] X,Y

puts the values of X and Y into an identical random order. The seed can be any positive integer, but only the last six digits of its integer part are used. Thus the seeds 2144556 and 7144556.3 are both equivalent to the seed 144556. If you put SEED=*, or leave it unset, Genstat picks a seed at random.

If any of the structures in the parameter list has a restriction, then they will all be treated as though they were restricted; moreover, all the restricted structures must be restricted in exactly the same way.

The main use of RANDOMIZE, however, is to randomize the allocation of treatments to units in a designed experiment. In the analysis of designed experiments, the underlying structure of an experiment is defined by the block formula, as explained in the description of the BLOCKSTRUCTURE directive. Provided the only operators in a block formula are the nesting (/) and crossing (*) operators, this also specifies the correct randomization of the experiment.

The nesting operator specifies that one factor is to be randomized within another one. The simplest example is the randomized block design: its block formula is Blocks/Plots; a separate randomization of plots is done for each block. Another example is a split-plot design, the formula for which is Blocks/Wplots/Subplots; this means randomize first the levels of Blocks, then the levels of Wplots within levels of Blocks, and finally the levels of Subplots within the levels of Blocks and Wplots. In other words, there is a separate randomization of Wplots for each Block, and a separate randomization of Subplots for each Wplot. A similar formula and randomization would apply to a resolvable incomplete-block design.

The crossing operator specifies that the factors are to be randomized independently of each other. For example the formula Rows*Cols means randomize the levels of Rows and Cols separately. Thus the same randomization of Cols appears within each Row. This is the block formula associated with a row and column design, for example a Latin square.

You specify the block formula by the BLOCKSTRUCTURE option, which thus defines the way in which the randomization is to be carried out. Genstat does not randomize the factors in the block structure themselves, unless you put them into the parameter list. This is because the original order of the block-factor levels often describes actual positions in the experiment; for example, in a field. So you are most likely to want to keep these values, rather than the random ordering of them that is used to allocate treatments. The block formula for RANDOMIZE must index all the units; so a randomized block block design must be specified for example as Blocks/Plots and not just Blocks. To put a formula of just Blocks would not give Genstat any information about what to do with the elements of the blocks.

You should use the EXCLUDE option if you want to restrict the randomization so that one or more of the factors in the block formula is not randomized. The most common instance where this is required is when one of the treatment factors is time-order, which cannot be randomized.

The SEED option determines which randomization Genstat gives. If you use the same seed, you will get the same random numbers, and hence the same randomization (provided the block formula and the block factors are the same as before). If you omit SEED Genstat picks a seed at random, and prints a message to tell you what it is in case you want to reproduce the randomization later.

 

RCYCLE directive

Controls iterative fitting of generalized linear, generalized additive, and nonlinear models, and specifies parameters, bounds etc for nonlinear models.

 

Options

MAXCYCLE = scalars Maximum number of iterations for Fisher-scoring algorithm (used in generalized linear models), back-fitting algorithm (used in additive models) and nonlinear algorithms; single setting implies the same limit for all; default 15, 15, 30

TOLERANCE = scalar Convergence criterion; default 0.0001

FITTEDVALUES = variate Initial fitted values for generalized linear model; default *

METHOD = string Algorithm for fitting nonlinear model (GaussNewton, NewtonRaphson, FletcherPowell); default Gaus, but Newt for scalar minimization

LINEARPARAMETERS = scalars Scalars to hold current values of linear parameters used in nonlinear model, for reference within model calculations

 

Parameters

PARAMETER = scalars Nonlinear parameters in the model

LOWER = scalars Lower bound for each parameter

UPPER = scalars Upper bound for each parameter

STEPLENGTH = scalars Initial step length for each parameter

INITIAL = scalars Initial value for each parameter

 

Description

RCYCLE allows you to control the optimization process used by the FIT, FITCURVE, and FITNONLINEAR directives.

The MAXCYCLE option can be set to a list of three scalars to specify respectively the maximum number of iterations to be used in the Fisher-scoring algorithm used to fit a generalized linear model, the back-fitting algorithm used in generalized additive models, and the algorithms for nonlinear models. These have the defaults 15, 15, and 30. If a single value is supplied, it is taken to apply to all three situations.

The TOLERANCE option controls the criterion for convergence in generalized linear and generalized additive models. The iteration stops when the absolute change in deviance in successive cycles is less than the tolerance times the current value of the deviance.

The algorithm for generalized linear models has to start by estimating an initial set of fitted values. Genstat usually obtains these by a simple transformation of the observed responses. It may be that better estimates are available, for example from a previously fitted model; if so, you can supply them by the FITTEDVALUES option.

The PARAMETER parameter can be used to supply initial values for the nonlinear parameters in the standard curves fitted by FITCURVE, although this will usually not be necessary; FITCURVE has effective ways of its own to ascertain good starting value for each parameter, for example by a short grid search or by some manipulation of the data values. The parameters must be listed in the same order as Genstat uses to print them. RCYCLE defines the identifiers as scalars holding the initial values that you have supplied; after the model has been fitted they contain the estimated values of the parameters.

The other parameters are relevant only to the general nonlinear models, fitted by FITNONLINEAR or FIT. The PARAMETER parameter then merely lists the scalars that will be used to represent the nonlinear parameters in the model calculations, the LOWER and UPPER parameters specify bounds, the STEPLENGTH parameter specifies initial steplengths, and the INITIAL parameter specifies initial values. The METHOD option is also relevant only for general nonlinear models, when it specifies the optimization method to be used. (Bounds and step lengths are determined automatically for standard curves, and Genstat then always uses a modified Newton method.)

 

RDISPLAY directive

Displays the fit of a linear, generalized linear, generalized additive, or nonlinear model.

 

Options

PRINT = strings What to print (model, deviance, summary, estimates, correlations, fittedvalues, accumulated); default mode,summ,esti

CHANNEL = identifier Channel number of file, or identifier of a text to store output; default current output file

DENOMINATOR = string Whether to base ratios in accumulated summary on rms from model with smallest residual ss or smallest residual ms (ss, ms); default ss

NOMESSAGE = strings Which warning messages to suppress (dispersion, leverage, residual, vertical, df, inflation); default *

FPROBABILITY = string Printing of probabilities for variance and deviance ratios (yes, no); default no

TPROBABILITY = string Printing of probabilities for t-statistics (yes, no); default no

SELECTION = strings Statistics to be displayed in the summary of analysis produced by PRINT=summary, the first four are relevant only for a Normally distributed response, and the last only for a gamma-distributed response (%variance, %ss, adjustedr2, r2, seobservations, dispersion, %cv); default %var,seob if DIST=normal, %cv if DIST=gamma, and disp for other distributions

DISPERSION = scalar Dispersion parameter to be used as estimate for variability in s.e.s; default is as set in the MODEL statement

RMETHOD = string Type of residuals to display (deviance, Pearson, simple); default is as set in the MODEL statement

DMETHOD = string Basis of estimate of dispersion, if not fixed by DISPERSION option (deviance, Pearson); default is as set in the MODEL statement

SAVE = identifier Specifies save structure of model to display; default * i.e. that from latest model fitted

 

No parameters

 

Description

RDISPLAY produces further output from a linear, generalized linear, generalized additive, or nonlinear model. The PRINT option has the same settings as in the FIT directive, except that no monitoring is available. The CHANNEL option selects the output channel to which the results are output, as in the PRINT directive; this may be a text structure, allowing output to be stored prior to display. The DENOMINATOR, NOMESSAGE, FPROBABILITY, TPROBABILITY, and SELECTION options are also as in the FIT directive.

The RMETHOD option allows you temporarily to change the method of forming residuals, for the output of the current statement only, in the same way as the corresponding option in the MODEL directive sets the default method of formation. Similarly, the DMETHOD option temporarily changes the method used to calculate the residual variability to be displayed for a generalized linear model, and the DISPERSION option allows you (temporarily) to set the dispersion parameter. These again operate like the corresponding options of MODEL (except that they apply only to the current statement).

The SAVE option lets you specify the identifier of a regression save structure; the output will then relate to the most recent regression model fitted with that structure.

 

READ directive

Reads data from an input file, an unformatted file, or a text.

 

Options

PRINT = strings What to print (data, errors, summary); default erro,summ

CHANNEL = identifier Channel number of file, or text structure from which to read data; default current file

SERIAL = string Whether structures are in serial order, i.e. all values of the first structure, then all of the second, and so on (yes, no); default no, i.e. values in parallel

SETNVALUES = string Whether to set number of values of vectors from the number of values read (yes, no); default no causes the number of values to be set only for structures whose lengths are not defined already (e.g. by declaration or by UNITS)

LAYOUT = string How values are presented (separated, fixedfield); default sepa

END = text What string terminates data (* means there is no terminator); default ':'

SEQUENTIAL = scalar To store the number of units read (negative if terminator is met); default *

ADD = string Whether to add values to existing values (yes, no); default no (available only in serial read)

MISSING = text What character represents missing values; default '*'

SKIP = scalar Number of characters (LAYOUT=fixed) or values (LAYOUT=sepa) to be skipped between units (* means skip to next record); default 0 (available only in parallel read)

BLANK = string Interpretation of blank fields with LAYOUT=fixed (missing, zero, error); default miss

JUSTIFIED = strings How values are to be assumed justified with LAYOUT=fixed (left, right); default righ

ERRORS = scalar How many errors to allow in the data before reporting a fault rather than a warning, a negative setting, -n, causes reading of data to stop after the nth error; default 0

FORMAT = variate Allows a format to be specified for situations where the layout varies for different units, option SKIP and parameters FIELDWIDTH and SKIP are then ignored (in the variate: 0 switches to fixed format; 0.1, 0.2, 0.3, or 0.4 to free format with space, comma, colon, or semi-colon respectively as separators; * skips to the beginning of the next line; in fixed format, a positive integer n indicates an item in a field width of n, -n skips n characters; in free format, n indicates n items, -n skips n items); default *

QUIT = scalar Channel number of file to return to after a fatal error; default * i.e. current input file

UNFORMATTED = string Whether file is unformatted (yes, no); default no

REWIND = string Whether to rewind the file before reading (yes, no); default no

SEPARATOR = text Text containing the (single) character to be used in free format; default ' '

SETLEVELS = string Whether to define factor levels or labels (according to the setting of FREPRESENTATION) automatically from those that occur in the data (yes, no); default no causes them to be set only when they are not defined already

TRUNCATE = strings Truncation of leading or trailing spaces of strings read in fixed format (leading, trailing); default * i.e. none

 

Parameters

STRUCTURE = identifiers Structures into which to read the data

FIELDWIDTH = scalars Field width from which to read values of each structure (LAYOUT=fixe only)

DECIMALS = scalars Number of decimal places for numerical data containing no decimal points

SKIP = scalars Number of values (LAYOUT=sepa) or characters (LAYOUT=fixe) to skip before reading a value

FREPRESENTATION = string How factor values are represented (labels, levels, ordinals); default levels

 

Description

Data values can be read into any Genstat data structure using the READ directive. In its simplest form, you merely list the structure whose values are to be read: for example

READ Weight

The data values for Weight are then assumed to come on the following line or lines. They are assumed to be in free format, separated one from another by one or more spaces or tabs or new lines, and to be terminated by a colon.

READ has a PRINT option with settings:

summary to print a summary of the data

data to print a copy of the input lines

errors to print a detailed report on any errors in the data

By default PRINT=summary,errors.

The CHANNEL option allows you to read data from another file; this must already have been opened (see the OPEN directive). You can also read data from a Genstat text structure. Each line of input is then treated as if it had been read from a file.

You can read values for more than one structure in a single READ statement. The values can be taken either serially or in parallel. The default is to take the values in parallel: the first element of each structure is read, then the second element of each, until all the data are read. For example:

a1 b1 c1 a1 b1 c1 a2

a2 b2 c2 or b2 c2

a3 b3 c3 a3 b3 c3 a4 b4 c4 :

a4 b4 c4 :

Here A, B, and C are in parallel, each with four values. The complete set of values for all three structures is given, followed by one terminating colon. The term parallel merely indicates the order in which READ is to read the values: that is, the first element of each structure, then the second element of each, and so on. It is not necessary for the data to be laid out in neat columns, although this may make a data file easier to work with. Different types of structures can be read in parallel and they may have different kinds of values (numerical or text).

Alternatively, you can set option SERIAL=yes to read the structures in series. Then all the values of the first structure are read, followed by all the values for the second structure, and so on, until all the data structures have been read. For example

x1 x2 x3 :

y1 y2 :

z1 z2 z3 z4 z5 z6 :

Here all the values of X are given first, followed by all the values for Y, and then all the values for Z. Unlike the parallel layout, each set of values must end with the terminating colon, so that READ can tell when to move on to the next structure; this means that the structures can be of different lengths.

When you are working interactively, Genstat produces a prompt indicating the name of the data structure and the unit number of the next value it expects to read. If Genstat knows how many values to expect, it will terminate the input automatically, without asking for the terminating colon, if the last value is at the end of a line. However, it is quite correct to include the colon at the end of that line of data if you want. If you type too many values by mistake you will get a warning message telling you that the extra data has been ignored.

If a structure whose values are to be read has not already been declared, Genstat will define it automatically as a variate. Likewise, if the length of a vector is undefined, this too will be set automatically. READ first checks whether the vector is being read in parallel with other vectors whose lengths have been defined, then it looks to see if a default length has been defined for vectors using the UNITS directive. If neither of these is available to define the length, it is set to the number of data values that are provided in the input. Lengths of vectors can also be redefined according to the number of data values that are read, by setting option SETNVALUES=yes. The END option allows you to define another string of characters to be used instead of a colon to mark the end of the data, or you can set END=* to indicate that there is no terminating string.

The values of numerical structures (scalars, variates, matrices, symmetric and diagonal matrices, and tables) can be entered in any of the standard forms: for example

1.20 -.2 3e1 -1.25E-2 27

are all valid.

Textual values (strings) in free format must be enclosed within single quotes if they contain any characters that have special meaning to READ (space, tab, comma, colon, asterisk, backslash, single or double quote). The quotes can be omitted for other strings. For example:

TEXT [NVALUES=5] Country

READ Country

Australia Canada 'Great Britain' U.S.A. 'New Zealand' :

The rules for strings in READ are thus slightly different to those for lists of strings, where quotes are required for any string that does not start with a letter or contains any character other than letters or digits. Thus Newcastle-on-Tyne and 500Km are both valid when read in as data, but not in a TEXT declaration. Rules for strings in fixed format are described later.

The values of factors are usually represented by their levels. You can change this by setting the FREPRESENTATION parameter. If you set it to labels, READ will accept as values the labels of the factor, using the same rules as for reading textual strings. The strings given as data values must match exactly the labels of the factor if they have been declared. The setting FREPRESENTATION=ordinals causes READ to expect an integer in the range 1 up to n, the number of levels declared for the factor. As FREPRESENTATION is a parameter it can be set to a list of values which are cycled in parallel with the structures to be read. Thus, you are allowed to read several factors in one READ statement, possibly using a different method for reading each one. The setting of this parameter is ignored for any structures that are not factors, but remember that the list will still be cycled in parallel with these other structures. If you set option SETLEVELS=yes, READ will set up the factor levels or labels according to the values that it finds when reading the data.

The values of pointers are identifiers, that is, names of other data structures. When reading a pointer only simple identifiers are allowed: suffixes cannot be used. For example, Winston is allowed but Orwell[1984] is not.

You cannot read formulae or expressions directly. The easiest way to do this is to read the required value into a text which can then be used in an appropriate declaration using either the macro-substitution symbols ## or the EXECUTE directive. You cannot read values into compound data structures; these should be formed using the appropriate directives or by reading their components individually.

By default, a missing value should be indicated by an asterisk (*); this means that any data item that begins with * is treated as missing. For example, any of the three strings

* *** *789

will be treated as missing. You can use the MISSING option to change this to any other single character; for example, if you set MISSING='-' then any negative numbers will be read as missing values.

In free format, values are usually separated by spaces or tabs. The SEPARATOR option can be used to specify another character to use as a separator. For example you can use a comma:

READ [SEPARATOR=','] Weights

24.3, 25.6, 57.3, 43.8, 45.3,

46.5, 47.9, 97.0, 77.5, 64.3 :

You can use spaces and tabs in addition to the specified separator, so long as the separator is present between each pair of values (except at the end of line, when it may be omitted).

The SEPARATOR, END, and MISSING strings are all case-sensitive; for example, END=enddata is different from END=EndData. The missing-value and separator characters must be distinct and neither may be part of the END string.

In free format, the SKIP option can be used to skip values between complete units of data. For example, with a file in channel 2 containing five columns of data, the statement

READ [CHANNEL=2; SKIP=3] X,Y

would read X and Y from the first two columns, and then skip the final three columns: Genstat reads the first value for X and Y, the next three values are skipped before reading the second value of X; so READ moves onto the next line of the file, and so on. You can also set SKIP=* to skip directly to the next line of data; you could use this if there were varying numbers of additional columns in the file. By default, SKIP is zero, so no values are skipped. The SKIP parameter is interpreted in parallel with the structures whose values are to be read, and indicates how many values should be skipped before reading the value for the corresponding structure.

In fixed format, data values are arranged in specific fields on each line of the file. Each field consists of a fixed number of characters. There is no need for separating spaces; the tab character is not permitted, nor are comments. So, depending on how the fields are defined, the sequence of digits 123456 could be interpreted for example as the single number 123456, or two numbers 123 and 456, or three numbers 123, 4 and 56. Data like this are usually produced by special-purpose programs or equipment; for example, automatic data recorders.

To read data in fixed format you set the LAYOUT option to fixed, and then specify the format to be used. If the values for a structure always occupy the same number of character positions, you can do this with the FIELDWIDTH parameter. For example,

READ [CHANNEL=2; LAYOUT=fixed] Weight,Height; FIELDWIDTH=3,5

takes data from channel 2 in fixed format. The data are in parallel: that is, reading across lines of the file, values for Weight and Height appear alternately. The FIELDWIDTH parameter is processed in parallel with the structures to be read, so each item of Weight data takes up three characters, and each item of Height data takes up five. If the fieldwidth for a structure is not constant, that is if different layouts are used for different units of the data, then you need to use the FORMAT option, described later.

Suppose there are 80 characters per line in the file; each pair of Weight and Height values takes up 8, and so you have 10 pairs per line. The first line looks like:

Weight1Height1Weight2Height2 ... Weight10Height10

Suppose that the first two values for Weight were 1 and 200, and that the first two for Height were 10 and 1200. Then, using _ to represent a space, the first four items on this line would be:

__1___10200_1200

Genstat is able to identify the separate values 10 and 200 because it is reading a fixed number of characters for each structure.

Genstat input files have a nominal width, set by default to 80. This can be altered by an OPEN statement to a different value if necessary. When reading in fixed format, each line of input is taken to be exactly this width; shorter lines are extended with spaces (blanks). It is important to make sure that you account for this when setting the options for READ, otherwise you may read some values from these blank fields (the BLANK option, described below, explains how the blank fields would be interpreted). In the example above, if the values for Height occupied four characters instead of five there would be 11 pairs of values per line of 77 characters. Using the default settings, the final three characters on the first line would be read as the 12th value of Weight, and READ would then be out of step as the 12th value of Height would be read in from the beginning of the next line. The simplest solution is to set the file width to 77 in the OPEN statement, but you can also use the SKIP option and parameter (see below) or the FORMAT option to avoid this sort of problem.

When you are using fixed format, the data terminator must begin within the first field to be read after the final data value: so you must ensure that you set the field widths and position the terminator appropriately. If you are using either the SKIP option or parameter, you must take care not to skip accidentally over the terminator, as READ will continue to take input - and probably generate many error messages.

Normally Genstat treats a blank field in fixed-format data as a missing value, and the only indication will be in the count of missing values in the printed summary. You can request warning messages for blank fields by setting the option BLANK=error. Alternatively, you can cause blanks to be interpreted as zeroes, by setting BLANK=zero.

Data in fixed format are normally taken to be right-justified: that is, their right-hand ends are flush with the right-hand end of the field; you can have either blanks or leading zeroes (for numbers) in the redundant spaces at the left of the field. You can change this default by setting the JUSTIFIED option. For example the value 123 can appear in a field of width 5 as:

__123 JUSTIFIED=right there may be leading blanks (the default)

123__ JUSTIFIED=left there may be trailing blanks

00123 JUSTIFIED=left,right there must be no blanks

_123_ JUSTIFIED=* there may be leading or trailing blanks

In this way, JUSTIFIED allows you to check the blanks in each field. If a data field contains any blanks that are not allowed by the current setting, an error will be reported. Note that when reading numerical data embedded blanks are never permitted. So a field containing, for example 1_2_3, will always produce an error message.

As an example, we can read the values of five scalars using a fixed format with values left-justified in their fields by the following:

SCALAR V,W,X,Y,Z

READ [LAYOUT=fixed;JUSTIFIED=left] V,W,X,Y,Z; \

FIELDWIDTH=4,5,7,4,5

1.235.62_678.9__3.7810.31:

This reads the values 1.23, 5.62, 678.9, 3.78, and 10.31 into V, W, X, Y, and Z respectively.

The general principles of the SKIP option and parameter are discussed in the context of a free format read in the previous section. When reading in fixed format the same ideas apply, but the SKIP settings now specify numbers of characters to be ignored, instead of numbers of values. Thus, you can obtain exactly the same effect as in the example above by putting

READ [LAYOUT=fixed] V,W,X,Y,Z; FIELDWIDTH=4,4,5,4,5; \

SKIP=0,0,1,2,0

Sometimes fixed format data can be further compressed by omitting the decimal point. The DECIMALS parameter allows you to re-scale data automatically when it is read (in either fixed of free format).

When reading textual data in fixed format, the contents of each field are taken exactly as they appear in the input file. There is no need to enclose values in quotes; in fact if you do so, the quotes are treated as part of the data. For example,

TEXT [NVALUES=1] T1,T2,T3,T4

READ [LAYOUT=fixed; SKIP=*] T1,T2,T3,T4; FIELDWIDTH=6,3,4,7

'What's_it_all_about?':

gives text T1 the value 'What's, text T2 the value _it, text T3 the value _all, and text T4 the value _about?'. Consequently, the only way to represent a missing string in fixed format is by a blank field, as '' or * would both be treated literally and stored as data values.

The TRUNCATE option has settings leading and trailing, allowing you to remove initial or trailing spaces in strings that are read in fixed format. For example, if we set TRUNCATE=leading above, T2 would just contain the two letters it. By default no truncation takes place.

The rules for reading textual data in fixed format also affect the reading of factors. If you set FREPRESENTATION=labels and do not request any truncation, the width of the field must equal the number of characters in the label, as for example no_ is not the same as no.

The FORMAT option allows you to use use a variable format. By this we mean that the layout of the values may vary from unit to unit of the data, and may also vary within each unit. For example, suppose you have some meteorological data which was measured daily and that the file also contains some additional summary values at the end of each week. The first eleven lines are reproduced to illustrate the structure of the file:

Monday 5.5 -0.4 0.0 1.9 10.0

Tuesday -1.1 -2.1 0.0 0.0 34.0

Wednesday 0.6 -8.3 1.3 5.4 142.0

Thursday 6.8 -5.7 1.1 0.0 158.0

Friday 10.6 0.5 8.1 0.0 141.0

Saturday 10.7 6.4 8.3 0.0 152.0

Sunday 10.0 1.9 1.0 0.1 237.0

Summary week 1> 10.7 -8.3 4 19.8 7.4 10.0 124.8 237.0

Monday 9.9 2.5 0.0 4.4 229.0

Tuesday 11.4 2.1 8.5 0.3 237.0

Wednesday 11.9 6.3 18.7 0.0 520.0

Suppose the file contains data for 28 days. If you try to read a text and five variates of length 28 then the summaries found after the 7th, 14th, 21st and 28th days would cause an error in READ. You need to read seven lines, skip one, read seven more, and so on. This can be done by setting the option FORMAT=!( (6)7,*,* ). This means "read six values, do this seven times, skip to the next line, skip again, then return to the beginning of the format and repeat, until enough data has been read". The format is made clear by using (6)7 which corresponds to the physical layout of the data, but 42 could have been specified instead, meaning read the next 42 values.

You can use FORMAT when reading in either free format or fixed format, and can also switch between the two during the READ. When you have set FORMAT, Genstat ignores the SKIP option and the FIELDWIDTH and SKIP parameters, and READ is controlled entirely by the values of the FORMAT. These values are not in parallel with the list of structures: they apply to data values in turn, recycling from the beginning when necessary. You set FORMAT to a variate, which may be declared in advance or can be an unnamed structure as shown above. Each value of this variate is interpreted as follows (where n is a positive integer):

+n read n values (in free format) or one value from a field of n characters (in fixed format);

-n skip the next n values (in free format) or n characters (in fixed format)

* skip to the beginning of the next line

0.0 switch to fixed format

0.1 switch to free format using space as a separator

0.2 switch to free format using comma as a separator

0.3 switch to free format using colon as a separator

0.4 switch to free format using semicolon as a separator

0.5 switch to free format using the setting of the SEPARATOR option

Using the FORMAT variate READ will start in either free format or fixed format, according to the setting of LAYOUT (by default, LAYOUT=separated; that is, free format). You can switch between these at any time by specifying a value in the range 0-0.5. Remember that if you use free format, spaces and tabs can also be used in addition to the specified separator, and you must use a separator that is distinct from the END and MISSING indicators.

You can read from unformatted files by setting option UNFORMATTED=yes. The only options that are then relevant are CHANNEL, REWIND, and SERIAL. Details of how to create the unformatted files are given in the description of the PRINT directive.

If you have more data to read than can be stored in the space available within Genstat, you can use the SEQUENTIAL option of READ to process the data in smaller batches. This works by reading in some of the data, partially processing it to form an intermediate result, and then overwriting the original data with a new batch that is used to update the intermediate results. This can be repeated until all the data has been read and the final summary is obtained. There are two directives that include facilities specifically designed to work with sequential data input: TABULATE which forms tabular summaries, and FSSPM which forms SSPM data structures for use in linear regression. You can also use other directives, such as CALCULATE, to process data sequentially, but you will have to program the sequential aspects yourself.

You should first declare the structures to be of some convenient size, such that you will not use up all the work space. You then use READ as normal, but with the SEQUENTIAL option set to the identifier of a scalar, which will be used to keep track of how the input is progressing. For example, to read in 10 variates of length 272500:

VARIATE [NVALUES=10000] X[1...10]

READ [CHANNEL=2; SEQUENTIAL=N] [1...10]

The number of values declared for X[1...10] defines the size of batch to read (10000 in this example). So, READ will read the first 10000 units of data (100,000 values), and set N to 10000 to indicate that is the number of units read. This should be followed by the statements to process the first batch of data, then the READ can be repeated. Once again N is set to 10000, indicating that another 10000 units have been read. This can be continued until READ finds the data terminator, when it sets the sequential indicator to minus the number of values found in the last batch. If this is less than the declared size of the data structures they will be filled out with missing values. In the example given above, after the 28th READ the variates will each contain 2500 values followed by 7500 missing values, and N will be set to -2500, indicating that all the data has been read and that the final batch contains only 2500 values. Usually you will use the SEQUENTIAL facility in conjunction with FSSPM or TABULATE which are designed to recognize the different settings of the scalar N.

The SEQUENTIAL option is best used within a FOR loop. You should set the NTIMES option to a value large enough to ensure that sufficient batches of data are read. The loop should contain the READ statement and any other statements required to process the data. For example

VARIATE [NVALUES=10000] X[1...10]

SSPM [TERMS=X[]] S

FOR [NTIMES=9999]

READ [PRINT=*;CHANNEL=2;SEQUENTIAL=N] X[]

FSSPM [SEQUENTIAL=N] S

EXIT N.LE.0

ENDFOR

The EXIT directive is used to jump out of the loop once all the data has been read and processed; this is safer than trying to program an exact number of iterations for the loop. The exit condition includes the case when N is equal to zero, as this will arise when the batch size exactly divides the total number of units. In the above example, if there were 280000 units of data altogether, the 28th READ would terminate with N set to 10000. This is because READ is unable to look ahead for the terminator, as there may be other statements in the loop, such as SKIP, which affect how the file is read. The next READ would immediately find the data terminator, so would exit with N set to zero. This special case is treated appropriately by FSSPM and TABULATE, but you should remember to allow for it if you are programming the sequential processing explicitly.

You can use the SEQUENTIAL option to read data from more than one input channel, perhaps when a large data set is split into two or more files, but you are not allowed to read data from the current input channel (that is, the channel containing the READ statement). If you want to process several structures sequentially from the same file, you must read them in parallel. You must also be careful not to modify the value of the scalar, N, within the loop when using sequential data input with FSSPM or TABULATE, as that could interfere with the sequential processing.

Another means of handling large amounts of data is provided by the ADD option. This allows you to add values to those already stored in a structure, thus forming cumulative totals without having to store all the individual data values. You must set SERIAL=yes with ADD=yes; and it is allowed only for variates. For example:

VARIATE [NVALUES=6] A

READ [ADD=yes; SERIAL=yes] 3(A)

5 12 9 * * 9 :

8 1 3 * 2 10 :

3 4 0 * 11 * :

This starts by assigning the values 5, 12, 9, *, *, and 9 to A. Then A is read again, and its values become 13, 13, 12, *, 2, 19: with ADD=yes (and only then) missing values are interpreted as zeroes when being added to non-missing values. Finally A contains the values 16, 17, 12, *, 13, 19.

If you have used the UNITS directive to specify a variate or text containing unit labels, READ will respect the order of these values when reading other structures in parallel with the units structure; in other words the data is re-ordered to match the order of the unit labels. If the units structure does not already have values, READ will define order of the units as the order in which it finds them in the data. This means that if you are reading several sets of data, each having a column for the unit number (or label), the first use of READ will define the unit order and subsequent READ statements will ensure that this order is maintained consistently in the remaining data. If a value is specified more than once when defining the units structure, READ will only ever locate the first occurrence of that unit label. If a unit label is repeated in the data then only the final set of values corresponding to that unit will be stored; earlier occurrences are overwritten by subsequent ones. If you try to read a value that is not present in the units structure this is regarded as a fault. Also, if the units structure contains missing values it cannot be used to re-order the data and will instead be overwritten by the new values: a warning message is printed out to tell you if this occurs. If you use the option SETNVALUES=yes when reading structures in parallel with the units vector, the other structures will all be set to the current unit length.

When you are working interactively and typing data from the keyboard, READ will halt immediately it finds an invalid value. You should type the correct value and then continue with the rest of the data. If you had typed several items of data then all those before the erroneous value will have been read and stored, but any remaining values will have been discarded, and so will need to be retyped. When you are reading data in batch, it is not possible to recover from errors in this way. Instead, READ will continue processing the data, substituting missing values for any data that it cannot read, and printing out a message for every error that is found.

If errors occur when running in batch, a fault will be generated when READ terminates, thus terminating the job. This is to avoid spurious output being produced from analyses based on incorrect data. You can override this by using the options ERRORS and QUIT. If you set ERRORS=n, where n is a positive integer, then up to n errors are allowed in the data before READ generates a fault. You might want to do this if you knew certain items of data were going to generate errors, but were prepared to accept them as missing values so that you could analyse the rest of the data. Obviously, you need to be very careful when doing this, as there may be other unexpected errors in the data. Usually you would have to try reading the data once without setting ERRORS, so you could check all the messages, and find what value of n is appropriate. Then the READ statement would have to be repeated, setting ERRORS and REWIND in order to read the data. For example, if missing values of a factor had been typed in as the letter X, you would not want to define X as an extra level of the factor, but if you set MISSING='X' any numerical data that used * for missing value could not be read either.

READ produces a message for every data value that contains an error. This can be very useful, as you then have the opportunity to correct all the errors at once, before trying to read the data again. However, the error messages may not be due to errors in the data, but may be caused by an incorrectly specified READ statement. For example, if you are reading many structures in parallel and specify texts and variates in the wrong order in the list of structures to be read, you will get an error message every time Genstat finds a piece of text rather than a number in the position specified for a variate. This is not likely to be a problem, unless you are reading large amounts of data, when you might end up with thousands of lines of needless error messages. A sensible precaution then is to request Genstat to abort the READ if more than a specified number of errors occur. You can do this by setting ERRORS to a negative integer, -n. This means that up to n errors are allowed in the data, but READ will abort if any more occur, switching control to the channel specified by QUIT (that is, starting or continuing to read Genstat statements from that channel). If you are working in batch a fault will be generated that inhibits execution of further statements, but interactively you have the opportunity to examine the data that have been read in so far, which may help identify any problems in the original READ statement or declarations of your data.

 

RECORD directive

Dumps a job so that it can later be restarted by a RESUME statement.

 

Option

CHANNEL = scalar Channel number of the unformatted file where information is to be dumped; default 1

 

No parameters

 

Description

The RECORD directive sends all the relevant information about the current state of your Genstat job to the unformatted file specified by the CHANNEL option. You can then use the RESUME directive, either later in your program, or during a completely different Genstat run, to recover all this information and continue your use of Genstat from that point. This can be useful if you need to abandon an analysis and resume it at some later date, or if you want to save the current state of a program in case your next operations turn out to be unsuccessful. The information includes the attributes and values of all your data structures, procedures, and the current graphics settings, but no details are kept of the files that are open on any of the channels. If you use RECORD with the same channel number again, the earlier information is overwritten.

 

REDUCE directive

Forms a reduced similarity matrix (referring to the GROUPS instead of the original units).

 

Options

PRINT = string Printed output required (similarities); default * i.e. no printing

METHOD = string Method used to form the reduced similarity matrix (first, last, mean, minimum, maximum, zigzag); default firs

 

Parameters

SIMILARITY = symmetric matrices

Input similarity matrix

REDUCEDSIMILARITY = symmetric matrices

Output (reduced) similarity matrix

GROUPS = factors Factor defining the groups

PERMUTATION = variates Permutation order of units (for METHOD = firs, last, or zigz)

 

Description

Sometimes you may want to regard an n-by-n similarity matrix S as being partitioned into b-by-b rectangular blocks. You might then want to form a reduced matrix of similarities, between the different blocks instead of between the individual units. To do this you have to arrange for each of the b2 blocks of the full matrix to be replaced by a single value. Each diagonal block must be replaced by unity. The METHOD option specifies how to replace the off-diagonal blocks, for example the maximum, minimum, or mean similarity within the block. The zigzag method (Rayner 1966) is relevant in particular when the data consist of b soil samples for each of which information is recorded on several soil horizons, possibly different in the different samples. The method recognizes that certain horizons might be absent from some soil samples; this leads to finding successive optimal matches, conditional on the constraint that one horizon cannot match a horizon that has already been assigned to a higher level; after finding these optima, an average is taken for each horizon.

The SIMILARITY parameter specifies the similarity matrix for the full set of n observations; this must be present and have values. The REDUCEDSIMILARITY parameter specifies an identifier for the reduced similarity matrix, of order b; this will be declared implicitly if you have not declared it already. The factor that defines the classification of the units into groups must be specified by the GROUPS parameter. The units can be in any order, so that for example the units of the first group need not be all together nor given first. The labels of the factor label the reduced similarity matrix.

The PERMUTATION parameter, if present, must specify a variate. It defines the ordering of samples within each group, and so must be specified for methods first, last, and zigzag. Within each group, the unit with the lowest value of the permutation variate is taken to be the first sample, and so on. Genstat will, if necessary, use a default permutation of one up to the number of rows of the similarity matrix.

If you set option PRINT=similarities, the values of the reduced symmetric matrix are printed, as percentages.

 

Reference

Rayner, J.H. (1966). Classification of soils by numerical methods. Journal of Soil Science 17, 79-92.

 

RELATE directive

Relates the observed values on a set of variates to the results of a principal coordinates analysis.

 

Options

COORDINATES = matrix Points in reduced space; no default i.e. this option must be specified

NROOTS = scalar Number of latent roots for printed output; default * requests them all to be printed

 

Parameters

DATA = variates The data values

TEST = strings Test type, defining how each variate is treated in the calculation of the similarity between each unit (Jaccard, simplematching, cityblock, Manhattan, ecological, Pythagorean, Euclidean); default * ignores that variate

RANGE = scalars Range of possible values of each variate; if omitted, the observed range is taken

 

Description

One way of interpreting the principal coordinates obtained from a similarity matrix by PCO is by relating them to the original variates of the data matrix. For each coordinate and each data variate, an F-statistic can be computed as if the variate and the coordinate vector were independent. This is not the case but, although the exact distribution of these pseudo F-values is not known, they do serve to rank the variates in order of importance of their contribution to the coordinate vector.

Qualitative variates are treated as grouping factors, and the mean coordinate for each group is calculated. Only 10 groups are catered for; group levels above 10 are combined. The pseudo F-statistic gives the between-group to within-group variance ratio. Missing values are excluded.

Quantitative variates are grouped on a scale of 0-10 (where zero signifies a value up to 0.05 of the range), and mean coordinates for each group are calculated. The printed pseudo F statistic is for a linear regression of the principal coordinate on the ungrouped data variate, after standardizing the data variate to have unit range; the regression coefficient is also printed.

The DATA parameter lists the variates that are to be related to the PCO results and the TEST parameter indciates their "type" as in the FSIMILARITY directive. The RANGE parameter contains a list of scalars, one for each variate in the DATA list, allowing you to standardize quantitative variates.

The COORDINATES option must be present and must be a matrix. It represents the units in reduced space. Usually the coordinates will be from a principal coordinates analysis. The number of rows of the matrix must match the number of units present in the variates, taking account of any restriction.

The output from RELATE can be extensive. You may not be interested in relating the variates to the higher dimensions of the principal coordinates analysis even though you may have saved these in the coordinate matrix. The NROOTS option can request that results for only some of the dimensions are printed. If NROOTS is not specified, RELATE prints information for all the saved dimensions: that is, for the number of columns of the coordinates matrix.

 

REML directive

Fits a variance-components model by residual (or restricted) maximum likelihood.

 

Options

PRINT = strings What output to present (model, components, effects, means, stratumvariances, monitoring, vcovariance, deviance, Waldtests, missingvalues); default mode, comp, Wald

PTERMS = formula Terms (fixed or random) for which effects or means are to be printed; default * implies all the fixed terms

PSE = string Standard errors to be printed with tables of effects and means (differences, estimates, alldifferences, allestimates, none); default diff

WEIGHTS = variate Weights for the analysis; default * implies all weights 1

MVINCLUDE = string Whether to include units with missing values in the explanatory factors and variates and/or the y-variates (explanatory, yvariate); default * i.e. omit units with missing values in either explanatory factors or variates or y-variates

SUBMODEL = formula Defines a sub-model of the fixed model to be assessed against the full model

RECYCLE = string Whether to reuse the results from the estimation when printing or assessing a sub-model (yes, no); default no

RMETHOD = string Which random terms to use when calculating RESIDUALS (final, all, notspline); default fina

METHOD = string Indicates whether to use the standard Fisher-scoring algorithm or the new AI algorithm with sparse matrix methods (Fisher, AI); default Fish except when covariance structures are defined

MAXCYCLE = scalar Limit on the number of iterations; default 10

TOLERANCES = variate Tolerances for matrix inversion; default * i.e. appropriate values assumed for the type of computer concerned

 

Parameters

Y = variates Variates to be analysed

RESIDUALS = variates Residuals from each analysis

FITTEDVALUES = variates Fitted values from each analysis

SAVE = pointers Saves the details of each analysis for use in subsequent VDISPLAY and VKEEP directives

 

Description

REML estimates the treatment effects and variance components in a linear mixed model: that is, a linear model with both fixed and random effects. The model to be fitted is specified using the VCOMPONENTS directive, covariance models for the random effects can be defined using the VSTRUCTURE directive, further output can be produced following REML using VDISPLAY, and output can be saved in Genstat data structures using VKEEP. REML is useful in situations where you would normally use ANOVA but have unbalanced or correlated data, or where you would normally use linear regression, but have more than one source of variation or correlation in the data.

REML is applicable in a wide variety of situations. It can be used to obtain information on sources and sizes of variability in data sets. This can be of interest where the relative size of different sources of variability must be assessed, for example to identify the least reliable stages in an industrial process, or to design more effective experiments. REML also provides efficient estimates of treatment effects in unbalanced designs with more than one source of error. It can be used to provide estimates of treatment effects that combine information from all the strata of a partially balanced design, or to combine information over similar experiments conducted at different times or in different places. You can thus obtain estimates that make use of the information from all the experiments, as well as the separate estimates from each individual experiment. Examples from several different areas of application can be found in Robinson (1987). The facilities for estimation of covariance models allow estimates of treatment effects and standard errors to be obtained taking account of correlation in the data.

The method of residual maximum likelihood (REML) was introduced by Patterson and Thompson (1971). It was developed in order to avoid the biased variance component estimates that are produced by ordinary maximum likelihood estimation: because maximum likelihood estimates of variance components take no account of the degrees of freedom used in estimating treatment effects, they have a downwards bias which increases with the number of fixed effects in the model. This in turn leads to under-estimates of standard errors for fixed effects, which may lead to incorrect inferences being drawn from the data. Estimates of variance parameters which take account of the degrees of freedom used in estimating fixed effects, like those generated by ANOVA in balanced data sets, are more desirable.

Once a variance components model has been specified (using VCOMPONENTS) and any covariance structures have been defined (using VSTRUCTURE) you can fit the model to the data (the y-variates) using the REML directive.

Parameter Y lists the variates that are to be modelled. Any of the y-variates or any of the factors or variates in the fixed and random models may be restricted to indicate that only a subset of the units are to be used in the analysis but if more than one of these vectors is restricted, all must be restricted to the same set of units. For example, given appropriate factor definitions, the following command sets up a model and analyses the data held in variate Yield:

VCOMPONENTS [FIXED=Nitrogen*Variety] RANDOM=Block/Wplot/Splot

REML Yield; FITTED=Fit; RESIDUALS=Res

The parameters FITTEDVALUES and RESIDUALS allow you to store the fitted values and residuals from the fitted model - above they are stored in variates Fit and Res. Parameter SAVE can be used to name the REML save structure for use with later VKEEP and VDISPLAY directives.

The three options PRINT, PTERMS, and PSE all control the printed output. The PRINT option selects the output to be displayed:

model description of model fitted

components estimates of variance components

effects estimates of parameters a and b, the fixed and random effects

means predicted means for factor combinations

stratumvariances approximate stratum variances from a decomposition of the information matrix for the variance components

monitoring monitoring information at each iteration

vcovariance variance-covariance matrix of the estimated components

deviance deviance of the fitted model ( -2 ´ log-likelihood RL) plus deviance of submodel if specified

waldtests Wald tests for all fixed terms in model

missingvalue estimates of missing values

The default setting of PRINT=model,components,Wald, gives a description of the model fitted plus estimates of the variance components and the Wald tests. By default if tables of means and effects are requested, tables for all terms in the fixed model are given together with a summary of the standard error of differences between effects/means. Options PTERMS and PSE can be used to change the terms or obtain different types of standard error. For example,

VCOMPONENTS [FIXED=Nitrogen*Variety] RANDOM=Block/Wplot/Splot

REML [PRINT=means; PTERMS=Nitrogen.Variety; \

PSE=allestimates] Yield

means that a Nitrogen by Variety table of predicted means will be produced with a standard error for each cell.

The MVINCLUDE option allows the inclusion of units with missing values. By default, units where there is a missing value in the y-variate or in any of the factors or variates in the model terms are excluded. The setting explanatory allows units with missing values in factors or variates in the model to be included. For missing covariate values, this is equivalent to substituting the mean value when the covariate has been mean adjusted (see VCOMPONENTS option CADJUST) or zero if unadjusted. The setting yvariate includes units with missing values in the y-variate. This can be useful to retain the balanced structure of the data for use with direct product covariance matrices (see VSTRUCTURE), or to produce predictions of data values for given values of explanatory factors and/or variates.

The WEIGHTS option can be used to specify a weight for each unit in the analysis. This is useful when it is suspected that the size of the random error varies between units. For example, if the random error for unit i is known to have variance vis2, a weight variate should be used containing values wi=1/vi.

The RMETHOD option controls the way in which residuals and fitted values are formed. For the default setting RMETHOD=final, the fitted values are calculated from all the fixed and random effects. The residuals are the difference between the data and the fitted values and, in this case, are estimates of the *units* random error and can be used to check the Normality and variance homogeneity assumptions for the random error. To get fitted values constructed from the fixed terms alone, omitting all random terms, the setting RMETHOD=all must be used. The setting RMETHOD=notspline means that the residuals will be formed from all the random effects, excluding spline terms.

Option SUBMODEL is used to specify a sub-model of the fixed model (but only applies when METHOD=Fisher). This model will be fitted as well as the full fixed model, using a slightly modified version of the algorithm, and the difference in deviances between the full and sub-model can be used as a likelihood-based test to assess the importance of the fixed terms dropped from the full model. Once the full model has been fitted, the RECYCLE option can be used to test a series of sub-models of the fixed model. If option RECYCLE=yes is set, then only the estimation for the sub-model is performed. Information for the full fixed model is picked up from the corresponding save structure. When the RECYCLE option is set, only the deviance and model settings of PRINT can be used.

The METHOD option can be used to choose the AI (Average Information) algorithm (Gilmour et al. 1995) with sparse matrix methods to maximize the residual likelihood, instead of Fisher scoring with full matrix manipulation. By default, Fisher scoring will be used, except when covariance structures have been specified (in which case Fisher scoring is not available). The AI algorithm will generally run faster per iteration than Fisher scoring and use much less workspace, but may require slightly more iterations to reach convergence. When sparse matrix methods are used, standard errors of differences will not be available for random effects, although standard errors are saved. Note that when METHOD=AI, the SUBMODEL, RECYCLE and WORKSPACE options do not apply.

Option MAXCYCLE can be used to change the maximum number of iterations performed by the algorithm from the default of 10.

The TOLERANCES option gives tolerances for three matrix inversions. The first two values are matrix inversion tolerances for the information matrix and the mixed model equations respectively and take the value 10-5 by default. The third value is used to detect zero frequency counts for factor combinations in the mixed model equations: 10-6 is used by default.

 

References

Gilmour, A.R., Thompson, R. and Cullis, B. (1995). AIREML, an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51, 1440-1450.

Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545-554.

Robinson, D.L. (1987). Estimation and use of variance components. The Statistician 36, 3-14.

 

RESTRICT directive

Defines a restricted set of units of vectors for subsequent statements.

 

No options

 

Parameters

VECTOR = vectors Vectors to be restricted

CONDITION = expression Logical expression defining the restriction for each vector; a zero (false) value indicates that the unit concerned is not in the set

SAVESET = variates List of the units in each restricted set

 

Description

The RESTRICT directive defines a restriction on the units of a vector, so that future operations will involve only a subset of the units. The directives that take account of RESTRICT are listed at the end of this subsection.

The VECTOR parameter specifies the vector or vectors that are to be restricted. These can be variates, factors, or texts, but all the vectors listed must be of the same length.

The CONDITION parameter specifies a logical expression which indicates which units of the vectors are in the defined subset. For example,

VARIATE [VALUES=1,2,3,2,3,4,3,4,5] V

RESTRICT V; CONDITION=V.EQ.2

restricts the vector V to those units with the value 2. Genstat evaluates the expression to generate internally a variate of zeroes and ones, of the same length as the vectors being restricted. A zero value indicates that the corresponding unit is to be excluded. The logical expression can involve any vector of the same length ar the vector to be restricted. For example, to restrict variate V and text T to the units with levels 1 or 2 or 4 of factor F, you could use the statement

RESTRICT V,T; CONDITION=(F.LE.2).OR.(F.EQ.4)

When using a text to define a restriction, remember that you cannot use logical operators like .EQ. and .NE. Instead you should use operators .IN., .NI., .EQS., and .NES.:

TEXT [VALUES=London,Madrid,Nairobi,Ottawa,Paris,Quito,Rome]\

City

& [VALUES=London,Madrid,Paris,Rome] Europe

RESTRICT City; CONDITION=City.IN.Europe

restricts the text City to lines 1, 2, 5, and 7 only.

Of course, the expression may just contain a single variate of the of the same length as the vectors to be restricted. Again a zero indicates that the corresponding unit in the vector to be restricted is excluded, while any non-zero entry causes inclusion. Thus the restriction above on the text City could also be specified by

RESTRICT T; CONDITION=!(1,1,0,0,1,0,1)

The same effect can be achieved by using the EXPAND function:

RESTRICT City; CONDITION=EXPAND(!(1,2,5,7))

Another function that may be useful is RESTRICTION; this allows you to generate a variate of ones and zeros indicating the units to which a vector is currently restricted. It thus provides a very convenient way of transferring a restriction from one vector to another. For example,

RESTRICT Timezone,Distance; CONDITION=RESTRICTION(City)

restricts the vectors Timezone and Distance to the same units as those to which City is currently restricted.

Finally, if you omit the CONDITION parameter, this removes any restrictions on the vectors are removed. For example

RESTRICT City,Timezone,Distance

removes any restrictions that have been set on City, Timezone, and Distance.

Note that if the vectors used in the CONDITION expression are themselves restricted these restrictions will remain in force during the current calculation of the condition. A danger here, therefore, is that you may accidentally end up restricting out all the elements of a vector by using RESTRICT repeatedly. The safest way to avoid this is to remove the restrictions on any vectors to be used in the CONDITION expression before you use them to restrict vectors in some different way.

The SAVESET parameter can be used to save the numbers of the units that are in the restricted set. These are saved in a variate with one value for each unit retained by the restriction. Thus, if the example above with variate V were to become

VARIATE [VALUES=1,2,3,2,3,4,3,4,5] V

RESTRICT V; CONDITION=V.EQ.2; SAVESET=S

S would be created as a variate of length 2, with values 2 and 4.

Not all directives take account of RESTRICT. For those that do, usually only one vector in the list of parameters has to be restricted for the directive to treat them all as being restricted in the same way. A fault is reported if any vectors in such a list are restricted in different ways.

 

RESUME directive

Restarts a recorded job.

 

Option

CHANNEL = scalar Channel number of the unformatted file where the information was dumped; default 1

 

No parameters

 

Description

RESUME recovers the information stored by a previous RECORD statement so that you can continue your use of Genstat as though nothing had happened in between. Genstat deletes all the data structures that were created in the current job prior to RESUME, and reinstates the data structures that were available in Genstat at the time the RECORD statement took place. In addition, the current graphics settings are replaced by those that were in force when RECORD was used, but any external files that are attached to Genstat remain unaffected.

If the RECORD directive was used within a procedure or a FOR loop, the job is not resumed at that point. Instead, it restarts at the statement after the procedure call, or after the outermost ENDFOR statement.

The amount of space available for data in the current job need not be the same as that in the recorded job. However, you will get a fault if the available space is too small, that is, if the space needed by the recorded job is greater than space available in the current job.

 

RETRIEVE directive

Retrieves structures from a subfile.

 

Options

CHANNEL = scalar Specifies the channel number of the backing-store or procedure-library file containing the subfile (FILETYPE settings 'back' or 'proc'); default 0 (i.e. the workfile) for FILETYPE=back, no default for FILETYPE=proc, not relevant with other FILETYPE settings

SUBFILE = identifier Identifier of the subfile; default SUBFILE

LIST = string How to interpret the list of structures (inclusive, exclusive, all); default incl

MERGE = string Whether to merge structures with those already in the job (yes, no); default no, i.e. a structure whose identifier is already in the job overwrites the existing one, unless it has a different type

FILETYPE = string Indicates the type of file from which the information is to be retrieved (backingstore, procedurelibrary, siteprocedurelibrary, Genstatprocedurelibrary); default back

 

Parameters

IDENTIFIER = identifiers Identifiers to be used for the structures after they have been retrieved

STOREDIDENTIFIER = identifiers Identifier under which each structure was stored

 

Description

You can recover information from a subfile of a backing-store file using the RETRIEVE directive. The CHANNEL option specifies the backing-store file, and the SUBFILE option indicates the subfile. Both these options can be omitted; by default the file will be the workfile, and the subfile will be called SUBFILE.

When you retrieve a structure Genstat may also retrieve a chain of associated structures: that is, all the structures to which it points, and the structures to which they point, and so on. For example, suppose you store the three structures with identifiers T, V, and F, along with an unnamed structure storing information about T, in a subfile called SUBFILE in backing-store file FILE1:

OPEN 'FILE1'; CHANNEL=1; FILETYPE=backingstore

TEXT [VALUES=a,b,c] T

VARIATE V; EXTRA=T

FACTOR [LABELS=T] F

STORE [CHANNEL=1] T,V,F

Then the statement

RETRIEVE [CHANNEL=1] V

will retrieve not only V but also T (which was associated with T by the EXTRA parameter of the VARIATE statement), and the unnamed structure that is associated with T. The structures V, T, and the unnamed structure, are said to be a complete set from the subfile.

The IDENTIFIER parameter specifies the structures to be retrieved. You can use the STOREDIDENTIFIER parameter to give a structure a different name from the one within the subfile. For example

RETRIEVE IDENTIFIER=Weeks; STOREDIDENTIFIER=Time

You are not allowed to give identical identifiers to two retrieved structures, nor are you allowed to have the same identifier referring to a structure of one type in a subfile, and to a structure of a different type in your job.

As with STORE, if you want to rename only some of the structures, you can either respecify the existing identifier, or insert * at the appropriate point in the STOREDIDENTIFIER list.

Genstat knows whether you are retrieving a procedure by the type of file that you are accessing, as set by the FILETYPE option. You are not allowed to rename a procedure as a suffixed identifier or as the name of a directive.

You can even rename a structure so that it is unnamed in the job. Suppose, for example, that a structure T already exists within Genstat, and that you want to retrieve the variate V stored in the file FILE1 above. Then, as we have seen, the structure T will also be retrieved. However, you can avoid the existing structure T in the job being overwritten by making the retrieved version of T unnamed:

OPEN 'FILE1'; CHANNEL=1; FILETYPE=backingstore

RETRIEVE [CHANNEL=1] V,!T(a); STOREDIDENTIFIER=V,T

The value, a, of the unnamed text !T(a) will be replaced by the values stored for T, and this unnamed text will become the EXTRA text for V. Alternatively you could rename T to be Tnew by

RETRIEVE [CHANNEL=1] V,TNew; STOREDIDENTIFIER=V,T

When you are retrieving a suffixed identifier, Genstat matches the numerical suffix only, and not the whole structure that is denoted by the identifier. For example, suppose pointer P stored in a subfile points to structures with identifiers A, B, C, and D, and that P has numerical suffixes 1 to 4 respectively. Also suppose that in your current job, you have never mentioned pointer P either directly or indirectly. Then the statement

RETRIEVE [CHANNEL=1] P[2]

will retrieve the structure B from backing store but, as it has not been referenced only as P[2] in the RETRIEVE statement, the identifier B will not be recovered and it will be known only as P[2] within Genstat.

A structure that you are retrieving from a subfile may sometimes overwrite the values of an existing structure in your program. If this structure is a pointer or a compound structure, the existing suffixes will be overwritten by those of the stored structure, so some existing structures with suffixed identifiers may in effect be lost. For example, suppose that userfile FILE2 contains a pointer P, with suffixes 1 and 2 pointing to structures A and B. If we set up a variate P[3], and then retrieve the pointer P

OPEN 'FILE2'; CHANNEL=1; FILETYPE=backingstore

VARIATE [VALUES=1...6] P[5,6,7]

RETRIEVE [CHANNEL=1] P

P will now have suffixes 1 and 2 pointing to A and B, but the variate P[3] will have been lost.

The LIST option controls how the IDENTIFIER list is interpreted. The default setting inclusive simply retrieves the structures that have been listed. Alternatively, if you set LIST=all Genstat will retrieve all the structures in the subfile that have identifiers and whose types have been defined. Finally, you can see LIST=exclusive to retrieve everything in the subfile that you have not listed in the IDENTIFIER parameter. Note, though, that some of the structures in the IDENTIFIER list may be retrieved if they are needed to complete the set of structures to be retrieved. If you use this setting, the STOREDIDENTIFIER parameter is ignored.

The FILETYPE option specifies whether you wish to retrieve information from backing store files that have been attached as normal backing store files or as procedure libraries by the OPEN directive, or from Genstat Procedure library or from the site procedure library. The CHANNEL setting is ignored if the siteprocedurelibrary or Genstatprocedurelibrary settings are used. The source code of the procedures in the Genstat Procedure library can be accessed using the LIBEXAMPLE procedure.

Normally when you retrieve a complete subset of structures, Genstat overwrites all structures in the job that have the same identifier (after any renaming). As a result, some other structures already in the job may become inconsistent and will be destroyed. You can avoid this happening by setting the MERGE option to yes. Genstat then does not overwrite any structures with the same name and type. However, a consequence is that some of the retrieved structures may now be inconsistent and thus need to be destroyed in the program (although they will of course remain in the subfile).

 

RETURN directive

Returns to a previous input stream (text vector or input channel).

 

Options

NTIMES = scalar Number of streams to ascend; default 1

CLOSE = string Whether to close the channel (or text) after the return (yes, no); default no

DELETE = string Whether to delete the text or the file to which the channel was attached (only relevant if CLOSE=yes) after the return (yes, no); default no

 

Parameter

expression Logical expression controlling whether or not to return to the previous input stream; default 1 (i.e. true)

 

Description

In its simplest form, you can type

RETURN

to make Genstat stop taking statements from the current input channel and to go back to the channel that was previously active, and contained the INPUT statement that switched to the current file. Input then continues from the line following the original INPUT statement, but a marker is left in the channel that contains the RETURN statement, so that you can use INPUT to continue from the next line after RETURN later in your programme.

Sometimes you may want to return only if a particular condition is satisfied, for example if you have discovered that the data are unsatisfactory for whatever operations occur later in the file. To do this, you set the parameter to an appropriate logical expression; this must return a scalar result, which is interpreted as true if it is equal to 1, and false otherwise. For example

RETURN MIN(Height)<0

If you have use INPUT several times, you may wish to return through several channels. The NTIMES option can be set to a number, or a scalar, to control how many returns take place. For example, with input starting on channel 1, supposing you had used INPUT 2 to switch to a file on channel 2, and then INPUT 3 to switch to a further file (on channel 3). If this file then contained the statement RETURN [NTIMES=2] you would return to channel 1. You can never return from input channel 1, so if you set NTIMES to a number greater than the number of currently active input channels, Genstat simply returns to channel 1.

You can set option CLOSE=yes to close the file; also, if you do have CLOSE=yes, you can set DELETE=yes to delete the file.

If Genstat meets the end of the file on the current input channel, it will try to return control to the channel from which it was called. This is called an implicit return. The channel is closed automatically when this happens, and a warning message will be printed.

In order to maintain control over the different input channels, and know where to go after a RETURN, Genstat keeps an internal stack of input channels. Suppose you specify channel k, by typing INPUT k. There are three possible actions:

(a) if k is the current input channel, the statement is ignored;

(b) if k is not in the stack, it is added to it;

(c) if k is already in the stack (that is, the current state is: 1 ® ... ® k ® k1 ® k2 ® ... ® kn) then the intermediate channels k1 ... kn are suspended at their current positions and removed from the stack.

Input then switches to channel k, taking statements from the beginning of the file if it has never been used before, or from the point at which it was last suspended. Subsequent INPUT statements will re-start the other channels from where they were suspended. When a RETURN statement is used, Genstat steps back NTIMES through the stack, removing any intermediate channels from the stack. This means that, using the above representation of the input stack, if channel kn contained the statement INPUT k2 and channel k2 then had a RETURN, this would return to channel k1.

If you use ## to execute macros, these are treated in the same way as input channels and added to the input stack. You can use INPUT to temporarily halt a macro and switch to a file, and RETURN to get back to the macro.

 

RFUNCTION directive

Estimates functions of parameters of a nonlinear model.

 

Options

PRINT = strings What to print (estimates, se, correlations); default esti,se

CHANNEL = identifier Channel number of file, or identifier of a text to store output; default current output file

CALCULATION = expression structures

Calculation of functions involving nonlinear and/or linear parameters; no default

SE = variate To save approximate standard errors; default *

VCOVARIANCE = symmetric matrix To save approximate variance-covariance matrix; default *

SAVE = identifier Specifies save structure of regression model; default * i.e. that from last model fitted

 

Parameter

scalars Identifiers of scalars assigned values of the functions by the calculations

 

Description

The RFUNCTION directive provides estimates of functions of parameters in nonlinear models, together with approximate standard errors and correlations. It can be used after any of the models fitted by the FITCURVE or FITNONLINEAR directives, or by the FIT directive with the CALCULATION option set. However, if there are any linear parameters in a general nonlinear model for which standard errors have not been estimated, standard errors and correlations cannot be estimated for functions that depend on those parameters. In addition, it is not possible to use the RFUNCTION directive after fitting standard curves with separate nonlinear parameters for each level of a factor (option NONLINEAR=separate in FITCURVE, ADD, DROP and SWITCH).

The functions are defined by the expressions supplied by the CALCULATION option of RFUNCTION; these define how to calculate the function from the values of the parameters. Unless initial values have been specified using the RCYCLE directive, the parameters in standard curves usually have no identifiers associated with them. If this is the case, you should refer to each parameter by using a text structure containing the name of the parameter as displayed, for example, by the option PRINT=estimates of the FITCURVE directive. The text structure can, of course, just be a string, for example 'R'. However, they must match exactly, including case, the names displayed by FITCURVE.

The parameter of RFUNCTION provides a list of scalars that are to hold the estimated values of the functions. These need not be declared in advance, but will be defined automatically if necessary. The CALCULATION option specifies a list of one or more expressions to define the calculations necessary to evaluate the functions from the parameters of the nonlinear model, and place the results into the scalars.

The PRINT option controls output as usual. By default, the estimates of the function values are formed - as could be done simply by a CALCULATE statement using the expressions if the parameters were available in scalars. In addition, approximate standard errors are calculated, using a first-order approximation based on difference estimates of the derivatives of each function with respect to each parameter. Approximate correlations can also be requested.

The SE and VCOVARIANCE options allow standard errors and the approximate variance-covariance matrix of the functions to be stored; the estimates of the functions themselves are automatically available in the scalars listed by the parameter of RFUNCTION. The SAVE option specifies which fitted model is to be used, as in the RDISPLAY and RKEEP directives.

 

RKEEP directive

Stores results from a linear, generalized linear, generalized additive, or nonlinear model.

 

Options

EXPAND = string Whether to put estimates in the order defined by the maximal model for linear or generalized linear models (yes, no); default no

DISPERSION = scalar Dispersion parameter to be used as estimate for variability in s.e.s; default as set in the MODEL directive

RMETHOD = string Type of residuals to form if parameter RESIDUALS is set (deviance, Pearson, simple); default as set in MODEL

DMETHOD = string Basis of estimate of dispersion, if not fixed by DISPERSION option (deviance, Pearson); default as set in MODEL

OMODEL = pointer Pointer to settings of options of the current MODEL statement, given unit labels corresponding to the option names of MODEL (starting with 'distribution')

PMODEL = pointer Pointer to settings of parameters of the current MODEL statement, given unit labels corresponding to the parameter names of MODEL (starting with 'y'), only refers to the first setting of Y, FITTEDVALUES, and RESIDUAL

SAVE = identifier Specifies save structure of model; default * i.e. that from latest model fitted

 

Parameters

Y = variates Response variates for which results are to be saved; default takes the response variates from the most recent MODEL statement

RESIDUALS = variates Standardized residuals for each Y variate

FITTEDVALUES = variates Fitted values for each Y variate

LEVERAGES = variate Leverages of the units for each Y variate

ESTIMATES = variates Estimates of parameters for each Y variate

SE = variates Standard errors of the estimates

INVERSE = symmetric matrix Inverse matrix from a linear or generalized linear model, inverse of second derivative matrix from a nonlinear model

VCOVARIANCE = symmetric matrix Variance-covariance matrix of the estimates

DEVIANCE = scalars Residual ss or deviance

DF = scalar Residual degrees of freedom

TERMS = pointer or formula structure

Fitted terms (excluding constant)

ITERATIVEWEIGHTS = variate Iterative weights from a generalized linear model

LINEARPREDICTOR = variate Linear predictor from a generalized linear model

YADJUSTED = variate Adjusted response of a generalized linear model

EXIT = scalar Exit status from a generalized linear or nonlinear model

GRADIENTS = pointer Derivatives of fitted values with respect to parameters in a nonlinear model

GRID = variate Grid of function or deviance values from a nonlinear model

DESIGNMATRIX = matrix Design matrix whose columns are explanatory variates and dummy variates

PEARSONCHI = scalar Pearson chi-squared statistic from a generalized linear model

STERMS = pointer Saves the identifiers of the variates that have been smoothed in the current model

SCOMPONENTS = pointer Saves a pointer to variates holding the nonlinear components of the variates that have been smoothed

NOBSERVATIONS = scalar Number of units used in regression, excluding missing data and zero weights and taking account of restrictions

SEFITTEDVALUES = variate Saves standard errors of the fitted values

INFLATION = variate Saves the variance inflation factors of the parameter estimates

 

Description

RKEEP allows you to copy information from a regression analysis (performed by FIT, FITCURVE, or FITNONLINEAR) into Genstat data structures. You do not need to declare the structures in advance; Genstat will declare them automatically to be of the correct type and length.

The Y parameter specifies the response variates for which the results are to be saved. Unusually for the first parameter of a directive, this has a default: if you leave it out, Genstat assumes that results are to be saved for all the response variates, as given in the previous MODEL statement.

The RESIDUALS, FITTEDVALUES, LEVERAGES, and SEFITTEDVALUES parameters allow you to save the standardized residuals, the fitted values, the leverages, and the standard errors of the fitted values. For example, RESIDUALS=R puts the residuals in a variate R. The RMETHOD option controls the type of residuals that are formed. You cannot save these values if you have set RMETHOD=* in the MODEL statement. The standard errors of fitted values are defined by:

s.e. = Ö (leverage ´ variance function ´ dispersion / weight)

where the variance function is calculated from the fitted value according to the setting of the DISTRIBUTION option of the current MODEL statement, and the dispersion is the fixed or estimated value of dispersion, as controlled by the DISPERSION and DMETHOD options of the MODEL and RKEEP directives.

The ESTIMATES and SE parameters save the parameter estimates and their standard errors; RKEEP puts them in variates, using the same order as in the display produced by the PRINT option of the directive used to fit the model. Alternatively, if you have used TERMS to define a maximal model, you can set option EXPAND=yes to reorder the estimates to their order in the maximal model (including missing values for the parameters not currently in the model). The variates saving these values are set up with labels; thus, you can refer to individual values in expressions using the labels as displayed when the estimates are printed. For example, to get the estimate of the constant into a scalar, you could put:

RKEEP ESTIMATES=Esti

SCALAR Const

CALCULATE Const = Esti$['Constant']

The INFLATION parameter allows the variance inflation factors of the parameters to be saved.

The INVERSE parameter allows you to save the inverse matrix as a symmetric matrix: that is, (X¢ X)-1 where X is the design matrix. This matrix is the same for all response variates.

The VCOVARIANCE parameter saves the variance-covariance matrix of the estimates for each response variate: these are formed by multiplying the inverse matrix by the relevant variance estimate based on the estimated dispersion, or on the dispersion that you have supplied.

The DEVIANCE parameter lets you save the residual sum of squares, or the deviance for distributions other than Normal. The DF parameter saves the residual degrees of freedom.

The LINEARPREDICTOR parameter lets you save the linear predictor of a generalized linear model; the values of the linear predictor are the same as the fitted values if the link function is the identity function.

The ITERATIVEWEIGHTS parameter saves a variate containing the iterative weights used in the last cycle of the iteration for fitting a generalized linear model. The iterative weights do not contain any contribution from the weights that can be specified, whether or not the model is iterative, by the WEIGHTS option of the MODEL directive, and they are 1.0 for ordinary linear regression.

The YADJUSTED parameter saves the adjusted response variate used in the last cycle of the iteration for fitting a generalized linear model; with the identity link function this is the same as the response variate.

The Pearson chi-squared statistic can be saved using the PEARSONCHI parameter of RKEEP. It is calculated as the sum of the squared Pearson residuals. This can be used as an alternative to the deviance for testing goodness of fit; see Nelder and McCullagh (1989).

The EXIT parameter of RKEEP provides a code that indicates the success or type of failure of an iterative fit. Codes 0-7 are relevant to standard curves and general nonlinear models, and codes 0 and 8-13 are for generalized linear models:

0 Successful fitting

1 Limit on number of cycles has been reached without convergence

2 Parameter out of bounds

3 Likelihood appears constant

4 Failure to progress towards solution

5 Some standard errors are not available because the information matrix is nearly singular

6 Calculated likelihood may be incorrect because of missing fitted values

7 Curve is close to a limiting form

8 Data incompatible with model

9 Predicted mean or linear predictor out of range

10 Invalid calculation for calculated link or distribution

11 All units have been excluded from the analysis

12 Iterative process has diverged

13 Failure due to lack of space or data access

The derivatives of the fitted values with respect to each parameter in a standard curve or general nonlinear model can be stored in variates using the GRADIENTS parameter. You can use these quantities to assess the relative influence of each observation on a parameter; you can also construct a measure of leverage by summing the gradients for all the parameters.

The GRID parameter can be used to store a grid of values of the deviance (or any general function) from FITNONLINEAR.

The DESIGNMATRIX parameter allows you to save the matrix X. The columns correspond to the parameters of the model, ordered as for the ESTIMATES parameter. For simple linear regression with a constant this has only two columns, the first containing ones and the second containing the values of the explanatory variate. Columns corresponding to aliased parameters are omitted, but you can use the corresponding option of TERMS to construct the full design matrix.

The PEARSONCHI parameter provides the Pearson chi-squared statistic for dispersion, which is the same as the residual sum of squares for the Normal distribution, but is different to the deviance for other distributions. The STERMS and SCOMPONENTS parameters are relevant to generalized additive models. The STERMS parameter can be used to store a pointer to those variates whose effects in the model are smoothed. The SCOMPONENTS parameter stores a pointer to variates, one for each smoothed variate in the same order as in STERMS, containing the fitted nonlinear component of each smoothed variate - this does not include the linear component or the constant term.

The NOBSERVATIONS parameter allows you to save the number of units used in the analysis, omitting units with missing values or excluded by restrictions. This will be the same as the total number of degrees of freedom plus one, except in a regression with no constant term and no explanatory factors when it will equal the total number of degrees of freedom.

The DISPERSION option allows you to define the value to be used for the dispersion parameter when calculating the standard errors. The DMETHOD option indicates how this should be calculated if DISPERSION is not set. By default the deviance is used but you can set DMETHOD=Pearson to request the Pearson chi-squared statistic to be used instead.

Options OMODEL and PMODEL allow you to extract information about the current model. OMODEL can be set to a pointer to store information about each of the options set in the previous MODEL statement. For example, the statement

RKEEP [OMODEL=Om]

will allow you to refer to the current variate of weights (if one was set in the WEIGHTS option of MODEL) as Om['weights']. Whether or not a variate was set, the statement

MODEL [WEIGHTS=Om['weights']] Newobs

will allow a new analysis with the same weighting as the old.

The pointer Om has 16 values, with suffixes (in lower case) corresponding to the options of MODEL in the defined order. Similarly, the statement

RKEEP [PMODEL=Pm]

will set up a pointer storing the (eight) current parameter settings of the previous MODEL statement. However, if there was more than one response variate, the first value of the pointer will be the identifier of the first response variate only: the others are not stored. Similarly, only the fitted-values and residuals variates for the first response will be pointed at. For example, the identifier Pm[1] or Pm['y'] can be used to refer to the current response variate after the RKEEP statement above.

 

Reference

McCullagh, P. and Nelder, J.A. (1989). Generalized linear models (second edition). Chapman and Hall, London.

 

ROTATE directive

Does a Procrustes rotation of one configuration of points to fit another.

 

Options

PRINT = strings Printed output required (rotations, coordinates, residuals, sums); default * i.e. no printing

SCALING = string Whether or not isotropic scaling is allowed (yes, no); default no

STANDARDIZE = strings Whether to centre the configurations (at the origin), and/or to normalize them (to unit sum of squares) prior to rotation (centre, normalize); default cent,norm

SUPPRESS = string Whether to suppress reflection (yes, no); default no

 

Parameters

XINPUT = matrices Inputs the fixed configuration

YINPUT = matrices Inputs the configuration to be fitted

XOUTPUT = matrices To store the (standardized) fixed configuration

YOUTPUT = matrices To store the fitted configuration

ROTATION = matrices To store the rotation matrix

RESIDUALS = matrices or variates To store distances between the (standardized) fixed and fitted configurations

RSS = scalars To store the residual sum of squares

 

Description

The ROTATE directive provides orthogonal Procrustes rotation. You must set the parameters XINPUT and YINPUT, which specify respectively the fixed configuration and the configuration that you want to be translated and rotated; these are called X and Y above. The other parameters are used for saving results from the analysis. For X and Y to refer to the same set of objects they must have the same number of rows, and each object must be represented by the same row in both X and Y. If the XINPUT matrix is n´ p and the YINPUT matrix is n´ q, Genstat does the analysis using matrices that are n´ r, where r is max(p, q). The smaller matrix is expanded with columns of zeros, as explained above.

The PRINT option specifies which results you want to print; the settings are as follows.

coordinates specifies that the fixed and fitted configurations are to be printed; note that the fixed configuration is printed after any standardization (see below), and the fitted configuration is printed after standardization and rotation.

residuals prints the residual distances of the points in the fixed configuration from the fitted points; this is after any standardization and rotation.

rotations prints the orthogonal rotation matrix.

sums prints an analysis of variance giving the sums of squares of each configuration, and the residual sum of squares; if scaling is used, the scaling factor is also printed.

The three other options of the ROTATE directive control the form of analysis. The SCALING option specifies whether you want least-squares scaling to be applied to the standardized YINPUT matrix when finding the best fit to the fixed configuration. You should set SCALING=yes if you want scaling; Genstat will then print the least-squares scaling factor with the analysis of variance. By default there is no scaling.

The STANDARDIZE option specifies what preliminary standardization is to be applied to the XINPUT and YINPUT matrices. It has settings:

centre centre the matrices to have zero column means;

normalize normalize the matrices to unit sums of squares.

The default is STANDARDIZE=centre,normalize. The initial centring ensures that the configurations are translated to have a common centroid, and thus automatically provides the best translation of Y to match X. The normalization arranges that the residual sum of squares from rotating X to Y is the same as that for rotating Y to X. Switching off both centring and standardization is rarely advisable, but can be requested by putting STANDARDIZE=*.

With some methods of multivariate analysis, for example the analysis of skew-symmetry, the direction of travel about the origin is important. It is then undesirable to perform a reflection as part of the rotation: the SUPPRESS option can be used to prevent this. The default setting is no, which allows reflection to take place.