ONLINE HELP FOR SDA 1.2 ANALYSIS PROGRAMS
CONTENTS
SDA Frequencies and Crosstabulation Program
This program generates
the univariate distribution of one variable or
the crosstabulation of two variables.
If a control variable is specified,
a separate table will be produced for each category
of the control variable.
An explanation of each option can be obtained by selecting
the corresponding word highlighted on the form.
Steps to take
- Specify variables
- To specify that a certain survey question or variable is to be
included in a table, use the
name for that variable as
given in the documentation for this study.
Aside from simply specifying the name of a variable, certain
variable options
are available to change or restrict the scope of a variable.
- Select display options
- After specifying the names of variables, select the
display options
you wish.
These affect percentaging,
text to display,
and statistics to show.
- Select an action
After specifying all variables and options, select the
action
to take.
REQUIRED variable name
- Row variable(s)
- Variable down the side of the table
OPTIONAL variable names
- Column variable(s)
- Variable along the top of the table
- Control variable(s)
- A separate table is produced for each category of a control variable.
- If more than one row, column
and/or control variable is specified,
a separate table will be generated for each combination of variables.
- Filter variable(s)
- Some cases are included in the analysis; others are excluded.
- Weight variable
- Cases are given different relative weights.
Display Options for Crosstabulation
- Percentaging
- Defines which way to make the percents add up to 100 percent:
- Column: down each column
- Row: across each row
- Total: as a percent of the total number of cases in the table
- You can request more than one type of percentaging in a table, but
such tables are hard to read.
- Question text
- The
text
of the question that produced each variable is
generally available.
- Statistics (Bivariate or Univariate)
- Various numbers or statistics can be used to summarize
the distributions of the variables.
If you specify both a row and a column variable,
a package of bivariate statistics is generated.
If you specify a row variable only, a package of
univariate statistics is generated.
The bivariate statistics summarize
the strength
or the statistical significance of
the relationship between the row and the column variables.
Several of the most common statistics are displayed if you select this option.
The Chi-square statistics are the most often used.
Two versions are displayed -- Pearson's Chi-square, and the
Likelihood-ratio Chi-square,
each with its P-value (probability statistic).
-
The Chi-square probability statistic is used to assess the statistical
significance of the observed relationship between the row
and the column variables in the table.
If the P-value is low (about .05 or less), the chances that
the relationship is only due to sampling error are correspondingly low,
and in that case
the relationship is said to be statistically significant.
Note that if the frequencies in the table are weighted,
the Chi-square statistic can be artificially inflated.
Consequently, if weights are used, the Chi-square is adjusted by
the factor: (Total unweighted N) / (Total weighted N).
-
Several other bivariate statistics are given.
These include interval-level statistics such as the Pearson
correlation coefficient and Eta (assuming the row variable
to be the dependent variable).
The remaining statistics such as Gamma and Tau are ordinal
statistics.
-
The univariate statistics
package includes the mean, median, mode,
standard deviation, variance, and the coefficient of variation
(standard deviation divided by the mean).
The standard error of the mean and its coefficient of variation
(standard error divided by the mean)
are also given.
The standard error calculation
assumes simple random sampling.
-
Consult any beginners' statistics textbook for more information on the
meaning of these statistics.
- Suppress display of the table
- Occasionally you may want to see the summary statistics for a
table, without wishing to view the table itself,
especially if the table is a very large one.
If you select this option, the table is generated internally
but is not displayed.
- Color coding of the table cells
- The table cells are color coded, in order to
aid in detecting patterns.
Cells with more cases than expected
(based on the marginal percentages)
become redder,
the more they exceed the expected value.
Cells with fewer cases than expected
become bluer,
the smaller they are,
compared to the expected value.
The transition from a lighter shade of red or blue
to a darker shade depends on the magnitude of the
T-statistic.
The lightest shade corresponds to T-statistics between 0 and 1.
The medium shade corresponds to T-statistics between 1 and 2.
The darkest shade corresponds to T-statistics greater than 2.
The color coding can be turned off, if you prefer.
Color coding may not be helpful
if you are using a black-and-white monitor
or if you intend to print out the table on a black-and-white printer.
- Show the T-statistic
- The T-statistic
controls the
color coding
of cells in the table.
If you select this option,
the statistic will be displayed in each cell.
The T-statistic
shows whether the frequencies in a cell
are greater or fewer than expected
(in the same sense as used for the Chi-square statistic).
It also takes into account the total number of cases in the column.
If there are only a few cases in the column,
the deviations from the expected values are not
as significant as if there are many cases in the column.
The T-statistic is calculated as the ratio of two quantities:
The numerator is the difference between the
column percent in the cell
and the total column percent for that row.
The denominator is the
standard error of the cell percent.
The standard error is estimated by the formula 'sqrt(pq/(n-1))',
where
p
is the column percent,
q
is (100-p),
and
n
is the number of cases in that column.
Note that if the frequencies in the table are weighted,
the T-statistic can be artificially inflated.
Consequently, if weights are used, the
weighted number of cases in the column is
adjusted (multiplied) by
the factor: (Total unweighted N) / (Total weighted N).
The adjusted number of cases in the column is then
used as the
n
in the formula given above.
SDA Comparison of Means Program
This program calculates the mean of the dependent variable
separately within categories of the row variable and,
optionally, the column variable.
If a control variable is specified,
a separate table will be produced for each category
of the control variable.
An explanation of each option can be obtained by selecting
the corresponding word highlighted on the form.
Steps to take
- Specify variables
- To specify that a certain survey question or variable is to be
included in a table, use the
name for that variable as
given in the documentation for this study.
Aside from simply specifying the name of a variable, certain
variable options
are available to change or restrict the scope of a variable.
- Select display options
- After specifying the names of variables, select the
display options
you wish.
These affect
the number of decimals to show,
text to display,
and
statistics to compute.
- Select an action
- After specifying all variables and options, select the
action
to take.
REQUIRED variable names
- Dependent variable(s)
- Variable whose mean or average value is to be computed
for each combination
of the row and (optionally) column and control variables
and displayed in a table
- Row variable(s)
- Variable down the side of the table
OPTIONAL variable names
- Column variable(s)
- Variable along the top of the table
- Control variable(s)
- A separate table is produced for each category of a control variable.
- If more than one dependent variable, row variable,
column variable,
and/or control variable is specified,
a separate table will be generated for each combination of variables.
- Filter variable(s)
- Some cases are included in the analysis; others are excluded.
- Weight variable
- Cases are given different relative weights.
Display Options for Comparison of Means
- Main statistic to display
- Usually each cell of the table will contain the MEAN of
the dependent variable for that particular combination of the
row and (optionally) column and control variables.
Sometimes, however, it is more helpful to express each cell mean
as the DIFFERENCE from the overall mean.
Select this option to have those differences calculated and put
into each cell of the table.
Another option is to display the TOTALS for each cell.
The total is the numerator of the ratio used to calculate
the mean.
(The denominator of the ratio is the number of cases in that cell.)
The totals are usually of interest only when a weight is being
used to expand the cell counts up to their estimated values in
the population.
For example, one may be interested in the total
estimated NUMBER of persons
in each cell
who have some characteristic (e.g., who smoke, or drive cars),
instead of the PROPORTION of persons who have that characteristic.
This assumes that the dependent variable is coded `1' for
a case which has the characteristic (smokes, for example)
and `0' for a case which does not have the characteristic.
- Additional statistics to display
(for simple random samples)
- There are four additional statistics that can be
displayed in each cell:
- Standard errors
for the means (or for the totals)
can be computed and displayed for each cell
of the table.
Standard errors are used to create
confidence intervals for the mean in each cell.
In general you can be 95% confident that
the mean in the population
(for each cell)
is within the interval bounded by
approximately two standard errors
above and below
the mean in the sample
(ignoring the problem of potential bias in the sample).
If the sample is equivalent to a simple random sample of a population,
the standard error is computed by dividing the standard deviation
by the square root of the number of cases in each cell.
If the sample for a particular study
is more complex than a simple random sample,
the appropriate standard errors can still be computed, provided that
the stratum and/or cluster variables
were specified when the dataset was
set up in the Web archive.
Otherwise, the standard errors calculated
by assuming simple random sampling
are probably too small.
- Standard deviations
can be computed and displayed for each cell
of the table.
These statistics measure how much variation there is
in the dependent variable within each cell of the table.
- Number of cases
used to calculate the mean.
By default, the number of cases
is displayed in each cell.
Click on the checkbox to suppress this display.
- Weighted number of cases
used to calculate the mean
(if a weight variable was specified).
- Additional statistics to display
(for complex probability samples)
- There are several additional statistics that can be
displayed in each cell:
- Standard errors for complex samples
can be computed and displayed
for the means (or totals)
in each cell
of the table.
The appropriate standard errors
are computed using
either
the Taylor series method (for cluster samples)
or
the formula for stratified subclass means
(for stratified element samples).
The method used is reported when you run the program.
If you want additional technical information,
see the discussion of
standard error calculation methods.
Standard errors are used to create
confidence intervals for the mean
in each cell
(or for the total, if a weight is being used to expand the cell counts
to the estimated size of the population).
In general you can be 95% confident that
the mean (or total) in the population
is within the interval bounded by
approximately
two standard errors
above and below
the mean (or total) shown in each cell of the table
(ignoring the problem of potential bias in the sample).
The optional
diagnostic table
reports the
degrees of freedom
used to
generate the appropriate T-value
for creating the confidence intervals.
Note that the calculations for
standard errors in cluster samples
require that the coefficient of
variation of the sample size in each cell,
CV(x),
be under 0.20;
otherwise, the computed standard errors are probably too small,
and they are flagged in the table with an asterisk.
CV(x) for each cell is available in the optional diagnostic table.
- Standard errors for simple random sampling (SRS)
can also be displayed in each cell.
These standard errors are computed by the formula
used for simple random samples,
ignoring stratification and clustering.
These standard errors can be compared with the standard errors
that take the complex sample design into account.
In general the SRS standard errors will be somewhat smaller.
- The Design Effect (DEFT)
is the ratio of the complex standard error divided by the
SRS standard error.
It indicates how much the standard error has been inflated
by the complex sample design.
For example, a DEFT of 1.25 means that the calculated standard
error is 25 percent larger than the standard error of a simple
random sample of the same size.
Note that all of these calculations are done for each subset of
the data defined by the values of the row, column,
control, and filter variables (if any).
- The RHO statistic
or clustering coefficient
is a measure of the effect of clustering
(if the sample is a cluster sample).
If the statistic is zero, it indicates that the
standard error has not been inflated because of the cluster design.
A rho statistic of .10 is of moderate size.
Its effect on the standard error depends on the size of the clusters.
The larger the clusters, the larger the effect of rho.
The stratified
average cluster size
is displayed for each cell in the optional diagnostic table.
If the design effect is less than 1.0,
the rho statistic will be negative.
This means that
the differences between clusters within the same stratum
are relatively small,
compared to the variability between elements in the
sample as a whole.
- Standard deviations
can be computed and displayed for each cell
of the table.
These statistics measure how much variation there is
in the dependent variable within each cell of the table.
The calculation of standard deviations uses weights, if
a weight variable has been specified, but does not take
the stratification of the sample into account.
- Number of cases
used to calculate the mean.
By default, the number of cases
is displayed in each cell.
Click on the checkbox to suppress this display.
- Weighted number of cases
used to calculate the mean
(if a weight variable was specified).
- Confidence intervals
- If this option is selected, an additional table is generated
that contains the upper and lower bound of the confidence interval
of the statistic (mean or total) in each cell.
The confidence interval is the range of values within which the
population value of the statistic is likely to fall.
By default, the level of confidence is 95 percent, but
the user can also select 99 percent or 90 percent.
The confidence interval or range
is computed by multiplying the standard error of the mean (or total)
by the T-value appropriate to the level of confidence and
to the number of degrees of freedom (for complex samples).
For a simple random sample, for instance, the 95 percent
confidence range is obtained by multiplying the
standard error by 1.96.
The result is added to the mean (or total) to obtain the
upper bound,
and the result is subtracted from the mean (or total)
to obtain the lower bound.
For complex samples, the appropriate T-value is a function
both of the desired level of confidence and of the number of
degrees of freedom
used to calculate the standard error.
For a large number of degrees of freedom, the appropriate
T-Value is very close to the value used for
simple random samples.
- Multiple Classification Analysis (MCA)
- If this option is selected, an MCA table is generated
for the categories of each row, column, and control variable.
Note that this procedure shows the average effects of each
category, and it ignores any interactions between the variables.
If interaction effects are statistically significant,
MCA is generally not appropriate.
The first column of the table gives the difference between
the dependent variable score of respondents in
each category and the overall mean of the dependent variable.
This is the UNADJUSTED effect of each category.
The second column of the table gives the ADJUSTED effect of
each category, taking into account the effects of the other
variables.
The adjustment process is similar to running a regression
with dummy variables for the various categories.
Regression coefficients for dummy variables,
however, represent deviations from the
effect of the omitted category.
MCA coefficients, on the other hand, are deviations from
the overall mean of the dependent variable.
The eta coefficient for each variable is like a
bivariate correlation coefficient.
It is the square root of the proportion of variance of the
dependent variable "explained" by the categories of each
variable.
The beta coefficient for each variable is like a
standardized regression coefficient.
It adjusts the eta coefficient for each variable by taking
into account the effects of the other variables.
- Diagnostic output for standard errors
(for complex probability samples)
- If this option is selected, an additional table is generated
that contains the following statistics in each cell:
- Number of strata
The program automatically combines strata having too few
cases or clusters with the next stratum in numerical order.
(This is a special feature of SDA.
Most other standard error
programs will not do this for you automatically.)
Since this is done separately within each cell of the requested
table, the statistics in different cells can be based on
different numbers of strata.
The number of strata actually used for computing the standard
errors is reported in the diagnostic table for each cell.
- DF -- Degrees of freedom
The number of degrees of freedom (df) is used to compute the width of
each confidence interval.
For each cell of the table, the df equals
the number of primary sampling units
(clusters, for cluster samples;
individual cases, for unclustered stratified samples)
minus
the number of strata.
For example, in the common situation in which there are two
clusters per stratum, the df equals half the number of strata.
Note that the number of strata used for this calculation is
the number after collapsing, if necessary.
The T-statistic used for computing confidence intervals
depends on the desired level of confidence (usually 95 percent) and the df.
The smaller the df, the larger the T-statistic and the
confidence intervals.
When the df is greater than about 30, the size of the T-statistic
is close to the familiar constant for the normal distribution
(1.96, for the 95 percent confidence level).
- Design effect (deft) due to weighting
If weights were used to estimate means or totals,
part of the total design effect MAY be due to weighting.
The estimated design effect (deft) attributable
to weighting
(assuming that it would have been optimal to use the same
sampling fraction for the whole sample)
is given in the table of diagnostic information.
The overall design effect for each cell
could then be divided by the deft due
to weighting,
to estimate
how much of the overall deft
is due to the other
characteristics of the sample design.
Note that this estimation of the design effect due to weighting
is based entirely on the variation in the weight variable,
and it does not consider the specific dependent variable being
analyzed.
Not all uses of weights will increase the sample variance
of a specific variable.
If the weights reflect a stratification of the sample that was
effective in reducing sampling error for
this particular dependent variable,
the estimated deft due to weighting
may be greater than the overall deft.
If this occurs, it is an indication that the weighting did not
increase sampling error in this case.
Frequently, however, differential rates of sampling are used
in different strata simply
to achieve the oversampling of some group(s) relative to
others.
Weights are then used
to compensate for the different probabilities of selection.
In such a case, the
different strata are sampled at different rates
in a way that departs from optimum allocation,
and the sampling variance
is increased
(see Kish, Survey Sampling, pp. 429-433).
The variation in the size of the weights across strata can
be used to estimate the design effect due to weighting,
assuming that it would have been optimal to use the same
sampling fraction within all the strata.
The deft due to weighting is based on formula 11.7.6 given in Kish,
Survey Sampling, p. 430.
That formula gives a design effect in terms of sampling variances.
The square root of that result gives the deft in terms of standard
errors.
- b -- average cluster size (for cluster samples)
The effect of clustering on the size of standard errors
depends on two factors:
b (average cluster size, combined across strata),
which is reported in the diagnostic table, and
rho
(the intra-cluster correlation)
which is reported (optionally) in the main table.
The relationship between these factors and DEFF, the design effect in
terms of sampling variances (the square of the DEFT reported
in the main table), is given by Kish (Survey Sampling, pp. 161-164) as:
DEFF = 1 + (rho)(b-1)
- CV(x) -- Coefficient of variation of cluster sizes (for cluster samples)
For a ratio mean computed as `y/x', the denominator `x' is
the number of cases in a cluster.
A requirement of the Taylor series method is that `x',
the size of clusters within each stratum,
not vary excessively.
Concretely, this means that the coefficient of variation of
the cluster sizes should be less than 0.20, and preferably under 0.10.
(See Kish, Survey Sampling, pp. 206-209.)
If the value of CV(x) is greater than 0.20, the calculated
standard error is probably too small.
Such standard errors are flagged in the main table with
an asterisk.
The corresponding confidence intervals are also flagged with
an asterisk, as is the CV(x) in the table of diagnostic information.
The CV for each cell is a stratified estimate.
The program calculates the coefficient of variation of the number
of valid cases
for the clusters within each stratum.
The individual stratum CVs
are then combined
into an overall CV for each cell.
This overall CV is reported
in the optional table of diagnostic statistics.
- ANOVA
- An analysis of variance
can be carried out and presented
after the table of means.
This analysis is used to assess the statistical significance of the
effects of the row variable
(and the column variable, if there is one)
on the dependent variable.
If the P-value (probability statistic)
associated with a variable
is low (about .05 or less), the chances
are correspondingly low
that
the observed effect on the dependent variable
is only due to sampling error,
and in that case
the effect is said to be statistically significant.
Consult any beginners' statistics book for more information on the
meaning of these statistics.
Other Display Options
- Suppress display of the table
- Occasionally you may want to see ANOVA statistics
or a Multiple Classification Analysis (MCA)
without viewing the table of means,
especially if the table is a very large one.
If you select this option, the table of means is generated internally
but is not displayed.
Tables containing confidence intervals and diagnostic information
(for complex standard errors) are also suppressed.
- Color coding of the cells
- The cells of the table of means are color coded, in order to
aid in detecting patterns.
Cells with higher means than
the overall mean
become redder,
the more they exceed the overall mean.
Cells with lower means than
the overall mean
become bluer,
the smaller they are.
The transition from a lighter shade of red or blue
to a darker shade depends on the magnitude of the
T-statistic.
The lightest shade corresponds to T-statistics between 0 and 1.
The medium shade corresponds to T-statistics between 1 and 2.
The darkest shade corresponds to T-statistics greater than 2.
The color coding can be turned off, if you prefer.
Color coding may not be helpful
if you are using a black-and-white monitor
or if you intend to print out the table on a black-and-white printer.
- Show the T-statistic
- The T-statistic
controls the
color coding
of cells in the table of means.
If you select this option,
the statistic will also be displayed in each cell.
The T-statistic
shows whether the mean in a cell
is larger or smaller than the overall mean.
It also takes into account the total number of cases in the cell.
If there are only a few cases in a cell,
the deviation from the overall mean is not
as significant as if there are many cases in that cell.
The T-statistic is calculated as the ratio of two quantities:
The numerator is the difference between the
mean in the cell
and the overall mean.
The denominator is the
standard error of the mean in that cell.
If complex standard errors have been requested, the
complex standard error for each cell is
used to calculate the T-statistic.
- Number of decimals to display
- Each statistic displayed in the cells of the table has a default
number of decimal places.
If you want more or fewer decimal places, you can generally specify
from 0 to 6 decimal places for most of the
statistics displayed in each cell.
Note that some decimal place specifications are RELATIVE to the
number of decimal places in the main statistic (means or totals).
- Question text
- The
text
of the question that produced each variable is
generally available.
SDA Comparison of Correlations Program
This program calculates the correlation between two variables
separately within categories of the row variable and,
optionally, the column variable.
If a control variable is specified,
a separate table will be produced for each category
of the control variable.
An explanation of each option can be obtained by selecting
the corresponding word highlighted on the form.
Steps to take
- Specify variables
- To specify that a certain survey question or variable is to be
included in a table, use the
name for that variable as
given in the documentation for this study.
Aside from simply specifying the name of a variable, certain
variable options
are available to change or restrict the scope of a variable.
- Select display options
- After specifying the names of variables, select the
display options
you wish.
These affect
the number of decimals to show,
statistics to compute, and text to display.
- Select an action
- After specifying all variables and options, select the
action
to take.
REQUIRED variable names
- Variables to be correlated
- Two variables whose correlation coefficient
is to be computed for each combination
of the row and (optionally) column and control variables
and displayed in a table
- Row variable(s)
- Variable down the side of the table
OPTIONAL variable names
- Column variable(s)
- Variable along the top of the table
- Control variable(s)
- A separate table is produced for each category of a control variable.
- If more than one correlation variable, row variable,
column variable,
and/or control variable is specified,
a separate table will be generated for each combination of variables.
- Weight variable
- Cases are given different relative weights.
- Filter variable(s)
- Some cases are included in the analysis; others are excluded.
Display Options for Comparison of Correlations
- Correlation measure to calculate
- The Pearson correlation coefficient is the default correlation measure
to calculate.
It is appropriate for ordered numeric variables.
The log of the odds-ratio is an optional measure for dichotomous variables.
- Show differences from overall correlation
(instead of cell correlations)
- Usually each cell of the table will contain the correlation coefficient of
the two variables being correlated, for that particular combination of the
row and (optionally) column and control variables.
Sometimes, however, it is more helpful to express each cell correlation
as the DIFFERENCE from the overall correlation.
Select this option to have those differences calculated and put
into each cell of the table.
- Standard errors
- Standard errors for the correlations
can be computed and displayed for each cell
of the table.
The standard errors are used to create
confidence intervals for the correlation in each cell.
If the sample is equivalent to a simple random sample of a population,
you can be about 95% confident that
the correlation in the population
(for each cell)
is within the interval bounded by
two standard errors
above and below
the correlation in the sample (shown in the table).
The standard error is computed differently, depending on which
correlation coefficient you have selected.
The standard error for the Pearson correlation
is based on Fisher's Z, and it is calculated as the average
distance of the upward and the downward confidence band for
one standard error
(based on the retransformation of Fisher's Z into Pearson's R).
The standard error for the log of the odds ratio is calculated
with standard formulas for that statistic.
If the sample is more complex than a simple random sample,
the standard errors calculated here are probably too small.
-
Consult any beginners' statistics book for more information on the
meaning of these statistics.
Other Display Options
- Color coding of the cells
- The cells of the table of correlations are color coded, in order to
aid in detecting patterns.
Cells with higher correlations than
the overall correlation
become redder,
the more they exceed the overall correlation.
Cells with lower correlations than
the overall correlation
become bluer,
the smaller they are.
The transition from a lighter shade of red or blue
to a darker shade depends on the magnitude of the
T-statistic.
The lightest shade corresponds to T-statistics between 0 and 1.
The medium shade corresponds to T-statistics between 1 and 2.
The darkest shade corresponds to T-statistics greater than 2.
The color coding can be turned off, if you prefer.
Color coding may not be helpful
if you are using a black-and-white monitor
or if you intend to print out the table on a black-and-white printer.
- Show the T-statistic
- The T-statistic
controls the
color coding
of cells in the table of correlations.
If you select this option,
the T-statistic will also be displayed in each cell.
The T-statistic
shows whether the correlation in a cell
is larger or smaller than the overall correlation.
It also takes into account the total number of cases in each cell.
If there are only a few cases in a cell,
the deviations from the overall correlation are not
as significant as if there are many cases in that cell.
The T-statistic is calculated as the ratio of two quantities:
The numerator is the difference between the
correlation in the cell
and the overall correlation.
The denominator is the
standard error of the correlation in that cell.
- Number of decimals for the correlation
- You can select from 1 to 6 decimal places. The default
is 2 decimal places.
- Question text
- The
text
of the question that produced each variable is
generally available.
SDA Correlation Matrix Program
This program calculates the correlation between
all pairs of two or more variables.
An explanation of each option can be obtained by selecting
the corresponding word highlighted on the form.
Steps to take
- Specify variables
- To specify that a certain survey question or variable is to be
correlated or used as a filter or weight variable, give the
name for that variable as
given in the documentation for this study.
Aside from simply specifying the name of a variable, certain
variable options
are available to change or restrict the scope of a variable.
- Select display options
- After specifying the names of variables, select the
display options
you wish.
These affect
the statistics to compute,
the number of decimals to show,
and text to display.
- Run correlations
- After specifying all variables and options, select
Run correlations
to run the program.
Or you can select
Clear fields
to delete all previously specified variables and options,
so that you can start over.
REQUIRED variable names
- Variables to be correlated
- Enter the names of two or more variables
whose correlation coefficients
are to be computed for each pair of variables.
Enter the name of each variable in a window (box).
To go from one window to another, use the tab key or your mouse.
It is all right to skip a window and leave it blank
-- to use only windows 1, 5, and 9, for example.
It is possible to enter more than one variable name in a window
(the underlying text-entry area will scroll).
This has consequences for other options which refer to variable
numbers.
For example, if you enter two variables
in window number 3,
and then you request that the signs of the correlations be reversed
for variable number 3,
the signs of BOTH variables in window number 3 will be reversed.
Each window, consequently, defines a variable GROUP.
Ordinarily it is clearer to put only one variable in each window,
but the possibility of defining groups of variables exists.
OPTIONAL variable names
- Filter variable(s)
- Some cases are included in the analysis; others are excluded.
- Weight variable
- Cases are given different relative weights in calculating
the correlation coefficients.
How to exclude cases with missing data
- Listwise exclusion
- If a case has a missing-data value on ANY of the variables
to be correlated, it is excluded from ALL of the correlation calculations.
This is the default procedure.
- Pairwise exclusion
- If a case has a missing-data value on SOME of the variables
to be correlated, but not on others,
it is excluded from the calculations for those PAIRS of variables
in which one of the values is missing.
This procedure retains all of the information about each pairwise
relationship.
However, the multivariate relationships can be inconsistent,
if many of the cases have different missing-data patterns on
different variables.
Correlation Measure to Calculate
- The Pearson correlation coefficient
- This is the usual correlation coefficient
and is the default correlation measure
to calculate.
It is appropriate for ordered numeric variables.
- Log of the odds-ratio
- The log of the odds-ratio is an optional measure for dichotomous variables.
The calculation of the odds ratio assumes that the two variables have
only two categories each.
If these statistics are requested,
the correlation program
treats each variable as a dichotomy, regardless of
the number of categories it may actually have.
The minimum valid
value of each variable is treated as one category, and all valid
values greater than the minimum are combined into the other category.
If this default dichotomization is not appropriate for a particular
analysis, you can recode the variable temporarily within the
correlation program using the standard methods of
recoding variables.
Consult any beginners' statistics book for more information on the
meaning of these statistics.
Additional Statistics to Calculate
- Alpha coefficient
- Cronbach's alpha coefficient is a measure of how well the variables
in the correlation matrix could be said to measure the same thing.
If you added together all of the variables included in the
correlation matrix to form a scale,
alpha is the square of the correlation between the scale and the
underlying factor.
The alpha coefficient is a function of the average correlation
between the variables and of the number of variables.
If some of the variables are scored in opposite directions,
you should use the option to reverse the signs of some of
the variables, so that a high score on all variables means
the same thing.
- Standard errors
- A standard error for each correlation coefficient
can be computed.
If this option is requested,
the standard errors are placed in a separate matrix,
right under the matrix of correlation coefficients.
The standard errors are used to create
confidence intervals for each correlation coefficient.
If the sample is equivalent to a simple random sample of a population,
you can be about 95% confident that
the correlation coefficient in the population
(for each pair of variables)
is within the interval bounded by
two standard errors
above and below
the correlation in the sample (as shown in the matrix).
If the sample is more complex than a simple random sample,
the standard errors calculated here are probably too small.
The calculation of the standard error of the correlation coefficient
in each cell is based by default on the UNWEIGHTED number of cases,
even if a weight variable has been used for calculating the
correlation coefficient. Ordinarily this procedure will generate a
more appropriate statistical test than one based on the weighted N in
each cell.
The standard error is computed differently, depending on which
correlation coefficient you have selected.
- Standard errors for Pearson correlation coefficients
-
The confidence interval for the Pearson
correlation coefficient is not symmetric; therefore, there is no
single standard error that applies in both directions.
The standard error output by this program
is the average distance of the upward and the downward confidence band
for one standard error (based on the retransformation of Fisher's Z),
since that number is ordinarily a useful approximation.
- Standard errors for the log of the odds ratio
-
The standard error for the log of the odds ratio is calculated
with standard formulas for that statistic.
Consult a statistics book for details.
- Univariate statistics
- Univariate statistics for each of the variables
in the correlation matrix
will be computed and displayed,
if this option is selected.
The statistics available for each variable include its mean,
standard deviation, standard error, valid N of cases, and (if
there is a weight variable) valid weighted N of cases.
If missing-data cases have been excluded LISTWISE (the default),
the univariate statistics for all variables will be based on the
SAME cases -- those which have valid data on ALL of the variables.
If missing-data cases have been excluded PAIRWISE,
the univariate statistics for each variable will be based on
all the cases with valid data for that one variable.
- Paired univariate statistics
- If missing-data cases have been excluded pairwise,
each correlation coefficient is based (potentially)
on a different subset of the cases.
Univariate statistics based on that same subset of
cases for each pair of variables
will be calculated and displayed,
if this option is selected.
The paired statistics
for each variable include its mean,
standard deviation, valid N of cases for the pair, and (if
there is a weight variable) valid weighted N of cases for the pair.
These statistics are displayed as a series of matrices.
Each statistic for a given variable is (potentially) somewhat
different, depending on which other variable it is being
paired with.
- Index of proportionality (P-squared)
- It is sometimes useful to know the degree to which
the correlations in each row of the correlation matrix
are proportional to the correlations in the other rows.
This is particularly the case in creating scales or indexes
of items.
If variables are measuring the same thing, they should have
similar correlations to other relevant (criterion) variables.
The P-squared statistic is a way to measure the proportionality of rows in
a correlation matrix.
For example,
if all of the coefficients in one row are
exactly double the size of the coefficients in another row,
there is a constant proportionality, and the index will be
1.0.
Usually we want to limit this comparison to a subset of the
the matrix -- namely, to
the part corresponding to the correlations
of the criterion variables with the variables of interest.
To do this, we specify on the option screen the variable
numbers
(next to each window on the option screen)
corresponding to the variables for which we want the P-squared
measure, and the variable numbers corresponding to the
criterion variables.
For example, we could examine the degree to which
the variables v1, v2, and v3 have
proportional correlations to
the criterion variables x1, x2, and x3.
We would enter
v1, v2, and v3 into the first 3 windows on the option screen;
and x1, x2, and x3 into windows 4 through 6.
To get the P-squared statistic for all the combinations of v1, v2,
and v3, in respect to the criterion variables,
we would then specify:
- Vars to measure: 1-3
- Criterion vars: 4-6
These variable numbers can be specified either as a
range (1-3) or as a list (1,2,3);
and the variables
need not be adjacent in the original correlation matrix -- a list
like '1,3,5' is valid.
The P-squared statistics are presented in a symmetrical matrix.
Each row and column corresponds to one of the variables that
we specified as a "variable to measure."
For a discussion of how to use this statistic, see
Thomas Piazza,
"The Analysis of Attitude Items,"
American Journal of Sociology,
vol. 86 (1980) pp. 584-603.
Other Display Options
- Reverse signs of some correlations
- In order to detect patterns in the correlation matrix,
it is sometimes useful to reverse the signs of the correlations
corresponding to one or more variables.
Enter the variable number of each variable for which you
want the signs reversed.
The variable number corresponds to the window number on the
option screen.
For example, we may know that var1 is scaled in such a way that
a HIGH score or value corresponds to a LOW score on var2 and var3,
so we expect the correlations of var1 to be negative with var2 and var3.
But if we are interested in the relationships of those variables to
other variables, it will be easier to detect different patterns if
we reverse all the signs corresponding to var1.
That way, we can expect var1, var2, and var3 to have
correlations of the same sign with other variables.
Then if we do observe a difference in the signs, it will catch
our attention.
- Color coding of the correlations
- The correlation coefficients are color coded, in order to
aid in detecting patterns.
Correlation coefficients greater than zero become redder, the
larger they are.
Correlation coefficients less than zero become bluer, the
more negative they are.
The transition from a lighter shade of red or blue
to a darker shade depends on the magnitude of the
correlation coefficient in each cell.
The lightest shade corresponds to coefficients between 0 and .15.
The colors become darker as the absolute value of the
correlations exceed
.15, then .30, then .45.
Color coding is also used for the P-squared matrix,
if one has been requested.
However, the dividing points for colors are double in magnitude.
The lightest shade corresponds to P-squared coefficients between 0 and .30.
The colors become darker as the absolute value of the
P-squared coefficients exceed
.30, then .60, then .90.
The color coding can be turned off, if you prefer.
Color coding may not be helpful
if you are using a black-and-white monitor
or if you intend to print out the matrix on a black-and-white printer.
- Question text
- The
text
of the question that produced each variable is
generally available.
SDA Regression Program
This program calculates the regression coefficients
(ordinary least squares)
for one or more independent or predictor variables.
An explanation of each option can be obtained by selecting
the corresponding word highlighted on the form.
Steps to take
- Specify variables
- To specify that a certain survey question or variable is to be
used as the dependent variable,
give the
name for that variable as
given in the documentation for this study.
Then specify the names of one or more independent variables.
Filter variables and a weight variable may also be specified.
Aside from simply specifying the name of a variable,
it is possible to
restrict the range
of a variable or to
recode
the variable temporarily.
Note in particular that you can create
dummy variables
and
product terms.
- Select display options
- After specifying the names of variables, select the
display options
you wish.
These affect
the statistics to compute,
the number of decimals to show,
and text to display.
- Run regression
- After specifying all variables and options, select
Run Regression
to run the program.
Or you can select
Clear Fields
to delete all previously specified variables and options,
so that you can start over.
REQUIRED variable names
- Dependent variable
- Enter the name of one variable
to be used as the dependent variable
or the variable to be predicted.
- Independent variables
- Enter the names of one or more variables
whose regression coefficients
are to be computed.
Note that you can specify
dummy variables
and
product terms
as independent variables.
It is also possible to
restrict the range
of a variable or to
recode
the variable temporarily.
Enter the name of each variable in a window (box).
To go from one window to another, use the tab key or your mouse.
It is all right to skip a window and leave it blank
-- to use only windows 1, 5, and 9, for example.
It is possible to enter more than one variable name in a window
(the underlying text-entry area will scroll).
Each window, consequently, defines a variable GROUP.
Ordinarily it is clearer to put only one variable in each window,
but the possibility of defining groups of variables exists.
Dummy variables and product terms
- Dummy variable(s)
- A dummy variable is a variable coded 0 or 1.
Cases that have a certain characteristic are coded as 1;
whereas
cases that do NOT have the characteristic are coded as 0.
To create such a variable temporarily, for a single regression run,
for example, use the following syntax:
varname(d:1-3)
This would create a variable in which cases coded 1 through 3 on
the variable 'varname' receive a code of 1,
and all other VALID cases receive a code of 0.
If 'varname' has a code defined as missing-data or out of range,
the dummy variable will have the system-missing data value.
The characters 'd:' (or 'D:') indicate that you want to create a
temporary dummy variable.
The codes that follow show which codes on the original variable
should become the code of 1 on the new dummy variable.
One or more single code values or ranges can be specified.
Multiple codes or ranges are separated by a comma.
You can also give the dummy variable a label by putting the label
in double quotes:
occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")
- Product terms
- An independent variable can be the product of two
or more variables.
To create such a variable temporarily, for a single regression run
for instance, use an asterisk (*) between the component variable
names.
For example:
age*education
This would create a variable in which, for each case, the
value of 'age' is multiplied by the value of 'education'.
If either 'age' or 'education' has an invalid code for that case,
the temporary product term will have the system missing-data
value.
One or more dummy variables can also be part of a product term.
For example, the following form is acceptable:
party(d:3)*sex
In this example, first a dummy variable is created from the
variable 'party', and then that dummy variable is multiplied by 'sex'.
OPTIONAL variable names
- Filter variable(s)
- Some cases are included in the analysis; others are excluded.
- Weight variable
- Cases are given different relative weights in calculating
the regression coefficients.
How cases with missing data are excluded
- Listwise exclusion
- If a case has a missing-data value on ANY of the variables
to be correlated and then regressed,
it is excluded from ALL of the regression calculations.
This is the only allowed procedure procedure.
The pairwise option available for the correlation program
is not available for the regression programs.
Additional Statistics to Calculate
- T-test for each coefficient
- The t-test for each regression coefficient is generally displayed.
The t-statistic is the ratio of the regression coefficient (B)
divided by its standard error -- shown as SE(B).
The probability estimate associated with each t-statistic
is given in the last column.
This is the probability of obtaining a regression coefficient (B)
that is this large or larger, if the true coefficient is equal
to zero
in the population
from which the current sample was drawn.
(Note that this version of the regression program
assumes that the dataset was generated by a simple random sample.)
If the probability value
for a regression coefficient
is low (about .05 or less), the chances
are correspondingly low
that the observed effect of that independent variable
on the dependent variable
is only due to sampling error.
However, a low probability value does not indicate that
the true value of the coefficient in the population is
of any specific magnitude -- only that it is not equal to zero.
To estimate the confidence interval of a specific regression
coefficient, use the standard error of the coefficient -- displayed
as SE(B).
The approximate 95 percent confidence interval of each coefficient
is formed by creating a range that is equal to the regression coefficient
plus or minus two times the standard error.
The t-statistic and associated probability value are also given for
the constant term of the regression equation.
This is a test that the regression equation in the population has
no constant term (or intercept).
This test is usually of less interest than the tests for the regression
coefficients of the independent variables.
- Global F-test
- The global F-statistic for the regression is computed,
if this option is selected.
The P-value (probability value) for the F-statistic
is given in the last column of the table.
This is the probability that ALL of the regression coefficients
(B's) are equal to zero, in the population
from which the current sample was drawn.
(Note that this version of the regression program
assumes that the dataset was generated by a simple random sample.)
If the P-value
for the regression
is low (about .05 or less), the chances
are correspondingly low
that ALL of
the observed effects of the independent variables
on the dependent variable
are only due to sampling error.
However, a low P-value does not guarantee that any specific
independent variable has an effect on the dependent variable
(unless there is only one independent variable).
The t-test for each independent variable should be examined
for that purpose.
- Univariate statistics
- Univariate statistics for each of the variables
will be computed and displayed,
if this option is selected.
The statistics displayed for each variable include its mean and
standard deviation.
- Correlation matrix
- The correlation matrix used to calculate the
regression coefficients will be displayed.
- Covariance matrix
- The covariance matrix computed from the data
file will be displayed.
Other Display Options
- Color coding of the coefficients
- The regression coefficients are color coded, in order to
aid in detecting patterns,
if t-tests have been requested.
Regression coefficients greater than zero become redder, the
larger they are.
Regression coefficients less than zero become bluer, the
more negative they are.
The transition from a lighter shade of red or blue
to a darker shade depends on the magnitude of the
T-statistic,
which is the ratio of each regression coefficient (B)
divided by its standard error.
The lightest shade corresponds to T-statistics between 0 and 1.
The medium shade corresponds to T-statistics between 1 and 2.
The darkest shade corresponds to T-statistics greater than 2.
Correlation coefficients are also color coded, if a correlation
matrix is requested.
Correlation coefficients greater than zero become redder, the
larger they are.
Correlation coefficients less than zero become bluer, the
more negative they are.
The transition from a lighter shade of red or blue
to a darker shade depends on the magnitude of the
correlation coefficient in each cell.
The lightest shade corresponds to coefficients between 0 and .15.
The colors become darker as the absolute value of the
correlations exceed
.15, then .30, then .45.
The color coding can be turned off, if you prefer.
Color coding may not be helpful
if you are using a black-and-white monitor
or if you intend to print out the table on a black-and-white printer.
- Question text
- The
text
of the question that produced each variable is
generally available.
SDA Logit/Probit Regression Program
This program calculates the logit or probit regression coefficients
for one or more independent or predictor variables.
An explanation of each option can be obtained by selecting
the corresponding word highlighted on the form.
Steps to take
- Specify variables
- To specify that a certain survey question or variable is to be
used as the dependent variable,
give the
name for that variable as
given in the documentation for this study.
Then specify the names of one or more independent variables.
Filter variables and a weight variable may also be specified.
Aside from simply specifying the name of a variable,
it is possible to
restrict the range
of a variable or to
recode
the variable temporarily.
Note in particular that you can create
dummy variables
and
product terms.
- Select display options
- After specifying the names of variables, select the
display options
you wish.
These affect
the statistics to compute,
the number of decimals to show,
and text to display.
- Run Logit/Probit
- After specifying all variables and options, select
Run Logit/Probit
to run the program.
Or you can select
Clear Fields
to delete all previously specified variables and options,
so that you can start over.
Type of regression to run
The program can run either logit (logistic) or probit regression.
The difference between them is in how the dependent variable is transformed
from a proportion (a mean between 0 and 1).
Logit regression expresses the dependent variable as the
natural logarithm of the odds that a person will have a score
of 1 versus a score of 0 on the dependent variable.
Probit regression expresses the dependent variable as the
inverse of the cumulative normal distribution function
corresponding to the proportion.
When the dependent variable
has only two categories,
both logit and probit regression are more appropriate to use than
ordinary least squares regression.
Both logit and probit regression will usually generate the same
substantive results.
The choice between them is generally a matter of custom within
a specific field or discipline.
REQUIRED variable names
- Dependent variable
- Enter the name of one variable
to be used as the dependent variable
or the variable to be predicted.
This variable should have two categories: 0 and 1.
If the variable you want to use as a dependent variable is not
coded as a simple 0/1 variable,
you can create a
dummy variable,
or you can
recode
the variable temporarily.
If the dependent variable is left as anything other than a
simple 0/1 variable, the program will recode the dependent
variable automatically.
The lowest valid score will be recoded to the value '0', and
all other scores will be recoded to the value '1'.
- Independent variables
- Enter the names of one or more variables
whose regression coefficients
are to be computed.
Note that you can specify
dummy variables
and
product terms
as independent variables.
It is also possible to
restrict the range
of a variable or to
recode
the variable temporarily.
Enter the name of each variable in a window (box).
To go from one window to another, use the tab key or your mouse.
It is all right to skip a window and leave it blank
-- to use only windows 1, 5, and 9, for example.
It is possible to enter more than one variable name in a window
(the underlying text-entry area will scroll).
Each window, consequently, defines a variable GROUP.
Ordinarily it is clearer to put only one variable in each window,
but the possibility of defining groups of variables exists.
Dummy variables and product terms
- Dummy variable(s)
- A dummy variable is a variable coded 0 or 1.
Cases that have a certain characteristic are coded as 1, and
cases that do NOT have the characteristic are coded as 0.
Note that the dependent variable can be coded into the required
0/1 categories by creating a dummy variable.
To create such a variable temporarily, for a single analysis run,
for example, use the following syntax:
varname(d:1-3)
This would create a variable in which cases coded 1 through 3 on
the variable 'varname' receive a code of 1, and
all other VALID cases receive a code of 0.
If 'varname' has a code defined as missing-data or out of range,
the dummy variable will have the system-missing data value.
The characters 'd:' (or 'D:') indicate that you want to create a
temporary dummy variable.
The codes that follow show which codes on the original variable
should become the code of 1 on the new dummy variable.
One or more single code values or ranges can be specified.
Multiple codes or ranges are separated by a comma.
You can also give the dummy variable a label by putting the label
in double quotes:
occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")
- Product terms
- An independent variable can be the product of two
or more variables.
To create such a variable temporarily, for a single regression run
for instance, use an asterisk (*) between the component variable
names.
For example:
age*education
This would create a variable in which, for each case, the
value of 'age' is multiplied by the value of 'education'.
If either 'age' or 'education' has an invalid code for that case,
the temporary product term will have the system missing-data
value.
One or more dummy variables can also be part of a product term.
For example, the following form is acceptable:
party(d:3)*sex
In this example, first a dummy variable is created from the
variable 'party', and then that dummy variable is multiplied by 'sex'.
OPTIONAL variable names
- Filter variable(s)
- Some cases are included in the analysis; others are excluded.
- Weight variable
- Cases are given different relative weights in calculating
the regression coefficients.
How to exclude cases with missing data
- Listwise exclusion
- If a case has a missing-data value on ANY of the variables
included in the logit or probit regression,
it is excluded from ALL of the regression calculations.
This is the only allowed procedure procedure.
The pairwise option available for the correlation program
is not available for the regression programs.
Additional Statistics to Calculate
- T-test for each coefficient
- The t-test for each logit or probit
regression coefficient is generally displayed.
The t-statistic is the ratio of the regression coefficient (B)
divided by its standard error -- shown as SE(B).
The probability of each t-statistic is given in the last column.
This is the probability that the regression coefficient
(B) is equal to zero, in the population
from which the current sample was drawn.
(Note that this version of the logit/probit regression program
assumes that the dataset was generated by a simple random sample.)
If the probability value
for a regression coefficient
is low (about .05 or less), the chances
are correspondingly low
that the observed effect of that independent variable
on the dependent variable
is only due to sampling error.
However, a low probability value does not indicate that
the true value of the coefficient in the population is
of any specific magnitude -- only that it is not equal to zero.
To estimate the confidence interval of a specific regression
coefficient, use the standard error of the coefficient -- displayed
as SE(B).
The approximate 95 percent confidence interval of each coefficient
is formed by creating a range that is equal to the regression coefficient
plus or minus two times the standard error.
The t-statistic and associated probability value are also given for
the constant term of the regression equation.
This is a test that the regression equation in the population has
no constant term (or intercept).
This test is usually of less interest than the tests for the regression
coefficients of the independent variables.
- Global significance tests
- A pseudo-R-squared statistic is displayed.
This is analogous to the R-squared statistic in ordinary
least squares regression, which expresses
the proportion of variance in the dependent variable
explained by the
entire set of independent variables.
A chi-square test for the regression
is also computed.
The P-value (probability value) for the chi-square test
is the probability that ALL of the regression coefficients
(B's) are equal to zero, in the population
from which the current sample was drawn.
(Note that this version of the regression program
assumes that the dataset was generated by a simple random sample.)
If the P-value
for the regression
is low (about .05 or less), the chances
are correspondingly low
that ALL of
the observed effects of the independent variables
on the dependent variable
are only due to sampling error.
However, a low P-value does not guarantee that any specific
independent variable has an effect on the dependent variable
(unless there is only one independent variable).
The t-test for each independent variable should be examined
for that purpose.
- Univariate statistics
- Univariate statistics for each of the variables
will be computed and displayed,
if this option is selected.
The statistics displayed for each variable include its mean and
standard deviation.
Other Display Options
- Color coding of the coefficients
- The regression coefficients are color coded, in order to
aid in detecting patterns,
if t-tests have been requested.
Regression coefficients greater than zero become redder, the
larger they are.
Regression coefficients less than zero become bluer, the
more negative they are.
The transition from a lighter shade of red or blue
to a darker shade depends on the magnitude of the
T-statistic,
which is the ratio of each regression coefficient (B)
divided by its standard error.
The lightest shade corresponds to T-statistics between 0 and 1.
The medium shade corresponds to T-statistics between 1 and 2.
The darkest shade corresponds to T-statistics greater than 2.
The color coding can be turned off, if you prefer.
Color coding may not be helpful
if you are using a black-and-white monitor
or if you intend to print out the table on a black-and-white printer.
- Question text
- The
text
of the question that produced each variable is
generally available.
SDA Listcase Program
This program lists the values of individual cases
on variables specified by the user.
Values of a variable can also be transformed into
percents
of a second variable.
This is particularly useful when the cases in the data file
are aggregate units such as cities.
One or more filter variables are used to limit the listing
to a subset of the cases.
In general a limit of 500 cases is enforced for each
listing, in case the user has forgotten to limit the listing
with sufficient filter variables.
An explanation of each option can be obtained by selecting
the corresponding word highlighted on the form.
Steps to take
- Specify variables to list
- To specify that a certain survey question or variable is to be
included in the listing, enter into one of the windows the
name for that variable, as
given in the documentation for the study.
You can also request a
percent to be displayed.
- Specify one or more filter variables
- Filter variables are used to limit the listing to a subset of cases.
Except for very small datasets,
a filter variable will almost always be required.
- Select display options
- After specifying the names of variables, select the
display options
you wish.
These affect how to display numeric variables
and whether or not to display the text of each variable.
- Start the listing
- After specifying all variables and options, select
Start Listing
to begin the program.
Or you can select
Clear fields
to delete all previously specified variables and options,
so that you can start over.
- Variables to list
- To specify that a certain survey question or variable is to be
included in the listing, enter into one of the windows the
name for that variable, as
given in the documentation for the study.
Percentages
Aside from simply specifying the name of a variable,
it is possible to convert a number into the
percent
of another variable.
This is particularly useful when the cases in the data file
are aggregate units such as cities.
To calculate and display a percent,
use the following formats,
beginning with
$p,
instead of a simple variable name:
- $p(var1, var2)
-
This will display the value:
100 * var1 / var2.
(using 1 decimal place)
where 'var1' and 'var2' are variables in the dataset.
It is not necessary that either 'var1' or 'var2' be
specified separately for listing.
- $p(var1, var2, 2)
-
To display a percent using
other than one decimal place,
specify the desired number of decimal places after var2.
The example above would use 2 decimal places.
- $p(demo, totvote, "Percent democratic")
-
To give your own name
to the percentage created,
put the name you want within double quotes.
This name will be displayed at the top of the column for that percentage.
- Filter variables
- After specifying the names of the variables to list, select the
filter variable(s)
in order
to specify which cases to list.
Since data files generally have a large number of cases,
it is very important to limit the listing to a subset of the cases.
The usual options for specifying
filter variable(s)
are available.
To avoid accidental attempts to list large numbers of cases,
the program suppresses any listing that
would exceed a certain number
of cases.
The default limit is 500 cases,
but that limit can be modified when
the datasets are set up in the Web archive.
- Summaries of each variable listed
-
For each NUMERIC variable listed,
you can obtain summaries of the values for the selected cases
in the listing.
These summaries exclude missing-data or out-of-range values.
The available summaries are:
- Sum of the values
- Mean of the values
- Minimum value listed
- Maximum value listed
For a percentage
(created with the
'$p' command),
the summaries, if requested, will be calculated as follows:
- Sum: calculated from the sums of the two variables
- Mean: the mean of the percentages in a column
- Minimum: the smallest valid value in a column
- Maximum: the greatest valid value in a column
- How to display numeric variables
-
- Numeric codes
The numeric code of each variable is what is stored in the dataset.
- Category labels
If a category label has been defined for a specific numeric
code, it is often more helpful to display the label rather than the
numeric code.
However, if
no label has been defined for a
specific code value,
the numeric code will be displayed.
- BOTH numeric codes and category labels
(This is the default option)
Under this option, the numeric code is always displayed,
followed by the category label, if one has been defined.
Display text for the listed variables
-
If this option is selected,
the text corresponding to each variable listed is displayed
at the bottom of the listing.
The text for each variable referenced in a percent specification
is also displayed.
Features Common to All Analysis Programs
Options for specifying variables
Multiple variable names
More than one name
may be entered for variables to be analyzed,
such as for the row and the column variables.
The names should be
separated by a comma or blanks.
Separate analyses for each combination of variables
will be generated.
For example,
the following specifications would generate six separate
tables:
- Row variables: spend spend2
- Column variables: gender, education income
Restricting the valid range
The name of each
analysis
variable can be followed,
in parentheses, by a
list of values to be included in the analysis.
Basic range restriction
-
A single value
such as 'gender(2)'
or a
range of codes
such as 'age(30-50)',
will limit the analysis to cases having
those codes.
Multiple ranges and codes
may be specified.
-
For example:
age(1-17, 25, 95-100)
Open-ended Ranges using '*' and '**'
-
In a range,
one asterisk '*'
can be used
to signify
the lowest or highest
VALID value.
For example: age(*-25,75-*).
This would include all VALID values less than or equal to 25
and all VALID values greater than or equal to 75. However, any
missing-data values within those ranges would still be excluded.
In a range,
two asterisks '**'
can be used to signify
the lowest or highest
NUMERIC value,
regardless of whether or not
the codes are defined as missing data.
For example: age(50-**)
would include ALL numeric values greater than or equal to 50,
including data values like 98 or 99, even if they had been
defined as missing-data codes.
Note that '**' cannot be used alone
(without '-')
as a range specification.
If you want to include all NUMERIC codes,
you can use the range '(**-**)'.
Temporarily Recoding a Variable
A numeric variable can be recoded temporarily, for purposes of
running the current analysis,
by specifying groups of codes that are
to be combined into a single category.
This type of recoding can be very simple, but certain options can make
it a little more complex.
- Basic recoding
-
For example, to combine the categories of 'age' into three groups,
you can specify the variable as:
age(r: 18-30; 31-50; 51-95)
Notice that the name of the variable ('age') is followed by parentheses,
then
the instruction 'r' (or 'R')
followed by a colon (':'),
and then the groupings of codes.
Those groupings can consist of single code values, ranges, or a combination
of many values and/or ranges.
Each group is separated from the other by a
semicolon (';').
Spaces are optional, but are added here for readability.
Using this basic method of recoding,
the new groupings of codes are given the
default
code values 1, 2, 3, and so forth.
The
default label for each group
is the range of original codes
that constitute that group ("18-30", for example).
Any categories of 'age' not included in the specified groupings will become
missing-data on the recoded version, and they will be excluded from the
analysis in the table.
On the other hand, any original missing-data categories of 'age' that are
explicitly mentioned in the recode, will be included.
For instance, if the value '90' for 'age' were flagged as a missing-data code,
but included as in the example above,
it would become part of the third recoded category.
This is discussed in more detail
in the section on
"Treatment of missing data."
- Assigning particular new code values
-
It is possible to assign new code values that are different from the
default 1, 2, 3, and so forth.
To do this,
give the new code value, then an equal sign, then the
grouping.
For example, the variable 'age' can be recoded into the same three
groups as above, but with the new code values 1, 5, and 10, by
specifying the recode as follows:
age(r: 1 = 18-30; 5 = 31-50; 10 = 51-90)
For column, row, or control variables it will not usually matter
what the new code values are.
For variables on which statistics are computed, however, the new code
values will affect the value of those statistics.
- Assigning labels to the new code values
-
To assign your own label to a new grouping of code values, place the
label in double quotes after the group codes, but before the
semicolon.
There is no set limit on the length of these labels;
however, very long labels may distort the formatting of the tables.
For example, you can assign labels to the recoded categories of race
by using the following specification:
race(r: 800-869 "White"; 870-934 "Black"; 600-652, 979-982 "Asian")
These labels will appear in the table, in place of the
range of original codes
that constitute that group.
Nevertheless, the recode specifications will still be documented.
A summary is always given at the bottom of the table.
- Open ranges (with '*' or '**')
-
If you are not sure of the ranges of the variable to be recoded, you
can specify an open range with an asterisk ('*').
A single asterisk
matches the lowest or highest
VALID code
in the data for
that variable.
For example, the 'age' recode could be specified as:
age(r: *-30; 31-50; 51-*)
Using this method, all valid age values up to 30 would go into the
first recoded group.
And all valid age values of 51 or older would go into the third group.
If you want to use a range that
includes NUMERIC codes that were
defined as missing-data values,
you can specify the range with
two asterisks ('**')
instead of one.
For example, the 'age' recode could be specified as:
age(r: *-30; 31-50; 51-**)
Using this method, all
valid age values up to 30 would go into the
first recoded group.
But every numeric value
of 51 or greater would go into the third group,
including codes like 99 that may have been defined
as missing-data codes.
For more discussion about including codes that have been
defined as missing-data codes,
see the
section on
"Treatment of missing data."
- Overlapping ranges
-
If the same original code value is mentioned in two or more groupings,
it is recoded the FIRST time that the value is encountered.
For example, the following two specifications have the
same
effect:
age(r: 18-30; 30-50; 50-90), and
age(r: 18-30; 31-50; 51-90)
In both cases, the original 'age' value of 30 ends up in the first group,
and the original 'age' value of 50 ends up in the second group.
Notice that
order is important with overlapping ranges.
The following specification will
NOT have the same effect
as the
preceding two:
age(r: 3= 50-90; 2= 30-50; 1= 18-30)
In this example, the 'age' value of 50 will end up in the recode
group with the value '3' (instead of in the second group),
and the 'age' value of 30 will end up in the recode group with
the value '2' (instead of in the first group).
- Multiple specifications for one recoded group
-
It may sometimes be useful to have more than one specification for
a new recoded group.
This can be done by specifying the desired outcome code
more than once.
For example, to have race recoded into two categories, with the
first category including everyone EXCEPT those originally coded
as '2',
you could use the following specification:
race(r: 1=1 "Non-black"; 2=2 "Black"; 1=3-20)
- Treatment of missing data
-
NUMERIC codes that have been defined as missing data
on the original variable
can be included in one of the categories of the recoded variable
in two ways.
The first method is to
mention the code explicitly,
either as a single
value or as part of a range.
For example, if the 'age' value of 99 has been defined as a missing-data
code, it can still be included by either of the following specifications:
age(r: 18-30; 31-50; 51-90; 99), or
age(r: 18-30; 31-50; 51-100)
In the first case the code 99 will become its own fourth recode category.
In the second case, it will be included as part of the third category.
A second method to include NUMERIC missing data codes is to use an
open range with two asterisks ('**') instead of one.
For example, the following specification will include all numeric
codes above 50 as part of the third recoded group:
age(r: 18-30; 31-50; 51-**)
Note that at present there is no way to include in a recode the
system-missing value or a character missing-data value (like 'D'
or 'R').
Optional variables
Control variable (for table-generating programs)
A table will be produced separately for each category of this variable
(e.g., if the control variable is gender,
there will be one table for men alone and then one table for women alone).
A table is also produced for the total of all
valid
categories of the
control variable (e.g., men and women combined).
Only one variable at a time can be used as a control variable.
If more than one control variable is specified,
a separate set of tables will be generated for each control variable.
Filter variable(s)
The analysis can be limited to a subset of the cases in
the full data file by specifying one or more variables,
separated by a comma or blanks,
as selection filters.
Note that it is also possible to limit the table to a subset
of the cases by
restricting the valid range
of any of the other variables.
Filter variables are used when the desired subset of cases is defined by
a variable that is not one of the variables in the table.
Basic filter use
-
The name of each
filter
variable is followed, in parentheses, by a
single value
such as 'gender(2)' or a
range of codes
such as 'age(30-50)',
to limit the analysis to cases having
those codes.
Multiple ranges and codes
may be specified.
-
For example: 'age(1-17, 25, 95-100)'
Multiple filter variables
-
If you specify more than one filter variable, a case must satisfy
ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)
Open-ended Ranges using '*' and '**'
-
A single asterisk, '*', can be used to specify that all cases with VALID
codes for a variable will pass the filter.
For example:
age(*)
includes all cases with valid data on the
variable 'age'.
In a range, the '*' can be used to signify the lowest or highest VALID value.
For example:
age(*-25,75-*).
This filter would include all VALID values less than or equal to 25
and all VALID values greater than or equal to 75. However, any
missing-data values within those ranges would still be excluded.
In a range, two asterisks '**' can be used to signify the lowest or highest
numeric value, regardless of whether or not
the codes are defined as missing data.
For example:
age(50-**)
would include ALL numeric values greater than or equal to 50,
including data values like 98 or 99, even if they had been
defined as missing-data codes.
However, any
character missing-data values
would still be excluded.
Note that '**' cannot be used alone in a filter variable.
It can only be used as part of a range.
Weight variable
Depending on the design and implementation of the study,
it may be appropriate to give some of the cases more
weight than other cases in computing frequency distributions and
statistics.
The way you do this is to specify that a certain variable contains
the relative weight for each case and is to be considered
a weight variable.
The documentation for the study should explain the reasons for using
a weight variable, if there is one, and what its name is.
SDA studies can be set up with a weight variable specified ahead
of time so that the weight variable is used automatically.
Other studies may be set up with a drop-down list of choices
to be presented to the user, who then selects one of the available
weight variables (or no weight variable, if that option is included
in the list).
If no weight variables have been pre-specified,
the user is free to enter the name of an appropriate variable to
be used as a weight.
Question text
All of the text available for each variable included in the
analysis run
will be appended
to the bottom of the results,
if you select this option.
The usual text available for a variable is
the text of the question that produced the variable,
provided that the text was included in the study documentation.
Sometimes other explanatory text has been included.
If the variable was created by the 'recode' or the 'compute'
program, the commands used to create the new variable
are included in the text.
Actions to take
After you specify variables and select the options you want,
go to the bottom of the form, and select one of two actions:
- Run the Table
- Select this when you have finished specifying the variables and options
you want.
The requested table(s) will then be generated by the server computer
and returned to you.
- Clear fields
- Select this to delete all previously specified variables and options,
so that you can start over.