ONLINE HELP FOR SDA 1.2 ANALYSIS PROGRAMS


CONTENTS


SDA Frequencies and Crosstabulation Program

This program generates the univariate distribution of one variable or the crosstabulation of two variables. If a control variable is specified, a separate table will be produced for each category of the control variable. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect percentaging, text to display, and statistics to show.

Select an action After specifying all variables and options, select the action to take.

REQUIRED variable name

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable.

If more than one row, column and/or control variable is specified, a separate table will be generated for each combination of variables.

Filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights.

Display Options for Crosstabulation

Percentaging
Defines which way to make the percents add up to 100 percent:
You can request more than one type of percentaging in a table, but such tables are hard to read.
Question text
The text of the question that produced each variable is generally available.
Statistics (Bivariate or Univariate)
Various numbers or statistics can be used to summarize the distributions of the variables. If you specify both a row and a column variable, a package of bivariate statistics is generated. If you specify a row variable only, a package of univariate statistics is generated.

The bivariate statistics summarize the strength or the statistical significance of the relationship between the row and the column variables. Several of the most common statistics are displayed if you select this option. The Chi-square statistics are the most often used. Two versions are displayed -- Pearson's Chi-square, and the Likelihood-ratio Chi-square, each with its P-value (probability statistic).

The Chi-square probability statistic is used to assess the statistical significance of the observed relationship between the row and the column variables in the table. If the P-value is low (about .05 or less), the chances that the relationship is only due to sampling error are correspondingly low, and in that case the relationship is said to be statistically significant. Note that if the frequencies in the table are weighted, the Chi-square statistic can be artificially inflated. Consequently, if weights are used, the Chi-square is adjusted by the factor: (Total unweighted N) / (Total weighted N).

Several other bivariate statistics are given. These include interval-level statistics such as the Pearson correlation coefficient and Eta (assuming the row variable to be the dependent variable). The remaining statistics such as Gamma and Tau are ordinal statistics.

The univariate statistics package includes the mean, median, mode, standard deviation, variance, and the coefficient of variation (standard deviation divided by the mean). The standard error of the mean and its coefficient of variation (standard error divided by the mean) are also given. The standard error calculation assumes simple random sampling.

Consult any beginners' statistics textbook for more information on the meaning of these statistics.
Suppress display of the table
Occasionally you may want to see the summary statistics for a table, without wishing to view the table itself, especially if the table is a very large one. If you select this option, the table is generated internally but is not displayed.
Color coding of the table cells
The table cells are color coded, in order to aid in detecting patterns. Cells with more cases than expected (based on the marginal percentages) become redder, the more they exceed the expected value. Cells with fewer cases than expected become bluer, the smaller they are, compared to the expected value.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the T-statistic. The lightest shade corresponds to T-statistics between 0 and 1. The medium shade corresponds to T-statistics between 1 and 2. The darkest shade corresponds to T-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Show the T-statistic
The T-statistic controls the color coding of cells in the table. If you select this option, the statistic will be displayed in each cell.

The T-statistic shows whether the frequencies in a cell are greater or fewer than expected (in the same sense as used for the Chi-square statistic). It also takes into account the total number of cases in the column. If there are only a few cases in the column, the deviations from the expected values are not as significant as if there are many cases in the column.

The T-statistic is calculated as the ratio of two quantities: The numerator is the difference between the column percent in the cell and the total column percent for that row. The denominator is the standard error of the cell percent. The standard error is estimated by the formula 'sqrt(pq/(n-1))', where p is the column percent, q is (100-p), and n is the number of cases in that column.

Note that if the frequencies in the table are weighted, the T-statistic can be artificially inflated. Consequently, if weights are used, the weighted number of cases in the column is adjusted (multiplied) by the factor: (Total unweighted N) / (Total weighted N). The adjusted number of cases in the column is then used as the n in the formula given above.



SDA Comparison of Means Program

This program calculates the mean of the dependent variable separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect the number of decimals to show, text to display, and statistics to compute.

Select an action
After specifying all variables and options, select the action to take.

REQUIRED variable names

Dependent variable(s)
Variable whose mean or average value is to be computed for each combination of the row and (optionally) column and control variables and displayed in a table

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable.

If more than one dependent variable, row variable, column variable, and/or control variable is specified, a separate table will be generated for each combination of variables.

Filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights.

Display Options for Comparison of Means

Main statistic to display
Usually each cell of the table will contain the MEAN of the dependent variable for that particular combination of the row and (optionally) column and control variables.

Sometimes, however, it is more helpful to express each cell mean as the DIFFERENCE from the overall mean. Select this option to have those differences calculated and put into each cell of the table.

Another option is to display the TOTALS for each cell. The total is the numerator of the ratio used to calculate the mean. (The denominator of the ratio is the number of cases in that cell.) The totals are usually of interest only when a weight is being used to expand the cell counts up to their estimated values in the population. For example, one may be interested in the total estimated NUMBER of persons in each cell who have some characteristic (e.g., who smoke, or drive cars), instead of the PROPORTION of persons who have that characteristic. This assumes that the dependent variable is coded `1' for a case which has the characteristic (smokes, for example) and `0' for a case which does not have the characteristic.


Additional statistics to display (for simple random samples)
There are four additional statistics that can be displayed in each cell:


Additional statistics to display (for complex probability samples)
There are several additional statistics that can be displayed in each cell:


Confidence intervals
If this option is selected, an additional table is generated that contains the upper and lower bound of the confidence interval of the statistic (mean or total) in each cell. The confidence interval is the range of values within which the population value of the statistic is likely to fall. By default, the level of confidence is 95 percent, but the user can also select 99 percent or 90 percent.

The confidence interval or range is computed by multiplying the standard error of the mean (or total) by the T-value appropriate to the level of confidence and to the number of degrees of freedom (for complex samples).
For a simple random sample, for instance, the 95 percent confidence range is obtained by multiplying the standard error by 1.96. The result is added to the mean (or total) to obtain the upper bound, and the result is subtracted from the mean (or total) to obtain the lower bound.
For complex samples, the appropriate T-value is a function both of the desired level of confidence and of the number of degrees of freedom used to calculate the standard error. For a large number of degrees of freedom, the appropriate T-Value is very close to the value used for simple random samples.


Multiple Classification Analysis (MCA)
If this option is selected, an MCA table is generated for the categories of each row, column, and control variable. Note that this procedure shows the average effects of each category, and it ignores any interactions between the variables. If interaction effects are statistically significant, MCA is generally not appropriate.

The first column of the table gives the difference between the dependent variable score of respondents in each category and the overall mean of the dependent variable. This is the UNADJUSTED effect of each category.

The second column of the table gives the ADJUSTED effect of each category, taking into account the effects of the other variables. The adjustment process is similar to running a regression with dummy variables for the various categories. Regression coefficients for dummy variables, however, represent deviations from the effect of the omitted category. MCA coefficients, on the other hand, are deviations from the overall mean of the dependent variable.

The eta coefficient for each variable is like a bivariate correlation coefficient. It is the square root of the proportion of variance of the dependent variable "explained" by the categories of each variable.

The beta coefficient for each variable is like a standardized regression coefficient. It adjusts the eta coefficient for each variable by taking into account the effects of the other variables.


Diagnostic output for standard errors (for complex probability samples)
If this option is selected, an additional table is generated that contains the following statistics in each cell:
ANOVA
An analysis of variance can be carried out and presented after the table of means. This analysis is used to assess the statistical significance of the effects of the row variable (and the column variable, if there is one) on the dependent variable.

If the P-value (probability statistic) associated with a variable is low (about .05 or less), the chances are correspondingly low that the observed effect on the dependent variable is only due to sampling error, and in that case the effect is said to be statistically significant.

Consult any beginners' statistics book for more information on the meaning of these statistics.


Other Display Options

Suppress display of the table
Occasionally you may want to see ANOVA statistics or a Multiple Classification Analysis (MCA) without viewing the table of means, especially if the table is a very large one. If you select this option, the table of means is generated internally but is not displayed. Tables containing confidence intervals and diagnostic information (for complex standard errors) are also suppressed.
Color coding of the cells
The cells of the table of means are color coded, in order to aid in detecting patterns. Cells with higher means than the overall mean become redder, the more they exceed the overall mean. Cells with lower means than the overall mean become bluer, the smaller they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the T-statistic. The lightest shade corresponds to T-statistics between 0 and 1. The medium shade corresponds to T-statistics between 1 and 2. The darkest shade corresponds to T-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Show the T-statistic
The T-statistic controls the color coding of cells in the table of means. If you select this option, the statistic will also be displayed in each cell.

The T-statistic shows whether the mean in a cell is larger or smaller than the overall mean. It also takes into account the total number of cases in the cell. If there are only a few cases in a cell, the deviation from the overall mean is not as significant as if there are many cases in that cell.

The T-statistic is calculated as the ratio of two quantities: The numerator is the difference between the mean in the cell and the overall mean. The denominator is the standard error of the mean in that cell. If complex standard errors have been requested, the complex standard error for each cell is used to calculate the T-statistic.


Number of decimals to display
Each statistic displayed in the cells of the table has a default number of decimal places. If you want more or fewer decimal places, you can generally specify from 0 to 6 decimal places for most of the statistics displayed in each cell. Note that some decimal place specifications are RELATIVE to the number of decimal places in the main statistic (means or totals).
Question text
The text of the question that produced each variable is generally available.


SDA Comparison of Correlations Program

This program calculates the correlation between two variables separately within categories of the row variable and, optionally, the column variable. If a control variable is specified, a separate table will be produced for each category of the control variable. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect the number of decimals to show, statistics to compute, and text to display.

Select an action
After specifying all variables and options, select the action to take.

REQUIRED variable names

Variables to be correlated
Two variables whose correlation coefficient is to be computed for each combination of the row and (optionally) column and control variables and displayed in a table

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable.

If more than one correlation variable, row variable, column variable, and/or control variable is specified, a separate table will be generated for each combination of variables.

Weight variable
Cases are given different relative weights.

Filter variable(s)
Some cases are included in the analysis; others are excluded.

Display Options for Comparison of Correlations

Correlation measure to calculate
The Pearson correlation coefficient is the default correlation measure to calculate. It is appropriate for ordered numeric variables.

The log of the odds-ratio is an optional measure for dichotomous variables.


Show differences from overall correlation (instead of cell correlations)
Usually each cell of the table will contain the correlation coefficient of the two variables being correlated, for that particular combination of the row and (optionally) column and control variables. Sometimes, however, it is more helpful to express each cell correlation as the DIFFERENCE from the overall correlation. Select this option to have those differences calculated and put into each cell of the table.
Standard errors
Standard errors for the correlations can be computed and displayed for each cell of the table. The standard errors are used to create confidence intervals for the correlation in each cell. If the sample is equivalent to a simple random sample of a population, you can be about 95% confident that the correlation in the population (for each cell) is within the interval bounded by two standard errors above and below the correlation in the sample (shown in the table).

The standard error is computed differently, depending on which correlation coefficient you have selected. The standard error for the Pearson correlation is based on Fisher's Z, and it is calculated as the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z into Pearson's R). The standard error for the log of the odds ratio is calculated with standard formulas for that statistic.

If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.

Consult any beginners' statistics book for more information on the meaning of these statistics.

Other Display Options

Color coding of the cells
The cells of the table of correlations are color coded, in order to aid in detecting patterns. Cells with higher correlations than the overall correlation become redder, the more they exceed the overall correlation. Cells with lower correlations than the overall correlation become bluer, the smaller they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the T-statistic. The lightest shade corresponds to T-statistics between 0 and 1. The medium shade corresponds to T-statistics between 1 and 2. The darkest shade corresponds to T-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Show the T-statistic
The T-statistic controls the color coding of cells in the table of correlations. If you select this option, the T-statistic will also be displayed in each cell.

The T-statistic shows whether the correlation in a cell is larger or smaller than the overall correlation. It also takes into account the total number of cases in each cell. If there are only a few cases in a cell, the deviations from the overall correlation are not as significant as if there are many cases in that cell.

The T-statistic is calculated as the ratio of two quantities: The numerator is the difference between the correlation in the cell and the overall correlation. The denominator is the standard error of the correlation in that cell.


Number of decimals for the correlation
You can select from 1 to 6 decimal places. The default is 2 decimal places.
Question text
The text of the question that produced each variable is generally available.


SDA Correlation Matrix Program

This program calculates the correlation between all pairs of two or more variables. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be correlated or used as a filter or weight variable, give the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run correlations
After specifying all variables and options, select Run correlations to run the program.

Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.


REQUIRED variable names

Variables to be correlated
Enter the names of two or more variables whose correlation coefficients are to be computed for each pair of variables.

Enter the name of each variable in a window (box). To go from one window to another, use the tab key or your mouse. It is all right to skip a window and leave it blank -- to use only windows 1, 5, and 9, for example.

It is possible to enter more than one variable name in a window (the underlying text-entry area will scroll). This has consequences for other options which refer to variable numbers. For example, if you enter two variables in window number 3, and then you request that the signs of the correlations be reversed for variable number 3, the signs of BOTH variables in window number 3 will be reversed.

Each window, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each window, but the possibility of defining groups of variables exists.


OPTIONAL variable names

Filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights in calculating the correlation coefficients.

How to exclude cases with missing data

Listwise exclusion
If a case has a missing-data value on ANY of the variables to be correlated, it is excluded from ALL of the correlation calculations. This is the default procedure.

Pairwise exclusion
If a case has a missing-data value on SOME of the variables to be correlated, but not on others, it is excluded from the calculations for those PAIRS of variables in which one of the values is missing.

This procedure retains all of the information about each pairwise relationship. However, the multivariate relationships can be inconsistent, if many of the cases have different missing-data patterns on different variables.


Correlation Measure to Calculate

The Pearson correlation coefficient
This is the usual correlation coefficient and is the default correlation measure to calculate. It is appropriate for ordered numeric variables.

Log of the odds-ratio
The log of the odds-ratio is an optional measure for dichotomous variables. The calculation of the odds ratio assumes that the two variables have only two categories each. If these statistics are requested, the correlation program treats each variable as a dichotomy, regardless of the number of categories it may actually have. The minimum valid value of each variable is treated as one category, and all valid values greater than the minimum are combined into the other category.

If this default dichotomization is not appropriate for a particular analysis, you can recode the variable temporarily within the correlation program using the standard methods of recoding variables.

Consult any beginners' statistics book for more information on the meaning of these statistics.


Additional Statistics to Calculate

Alpha coefficient
Cronbach's alpha coefficient is a measure of how well the variables in the correlation matrix could be said to measure the same thing. If you added together all of the variables included in the correlation matrix to form a scale, alpha is the square of the correlation between the scale and the underlying factor.

The alpha coefficient is a function of the average correlation between the variables and of the number of variables. If some of the variables are scored in opposite directions, you should use the option to reverse the signs of some of the variables, so that a high score on all variables means the same thing.

Standard errors
A standard error for each correlation coefficient can be computed. If this option is requested, the standard errors are placed in a separate matrix, right under the matrix of correlation coefficients.

The standard errors are used to create confidence intervals for each correlation coefficient. If the sample is equivalent to a simple random sample of a population, you can be about 95% confident that the correlation coefficient in the population (for each pair of variables) is within the interval bounded by two standard errors above and below the correlation in the sample (as shown in the matrix). If the sample is more complex than a simple random sample, the standard errors calculated here are probably too small.

The calculation of the standard error of the correlation coefficient in each cell is based by default on the UNWEIGHTED number of cases, even if a weight variable has been used for calculating the correlation coefficient. Ordinarily this procedure will generate a more appropriate statistical test than one based on the weighted N in each cell.

The standard error is computed differently, depending on which correlation coefficient you have selected.

Standard errors for Pearson correlation coefficients
The confidence interval for the Pearson correlation coefficient is not symmetric; therefore, there is no single standard error that applies in both directions. The standard error output by this program is the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher's Z), since that number is ordinarily a useful approximation.

Standard errors for the log of the odds ratio
The standard error for the log of the odds ratio is calculated with standard formulas for that statistic. Consult a statistics book for details.

Univariate statistics
Univariate statistics for each of the variables in the correlation matrix will be computed and displayed, if this option is selected.

The statistics available for each variable include its mean, standard deviation, standard error, valid N of cases, and (if there is a weight variable) valid weighted N of cases.

If missing-data cases have been excluded LISTWISE (the default), the univariate statistics for all variables will be based on the SAME cases -- those which have valid data on ALL of the variables.

If missing-data cases have been excluded PAIRWISE, the univariate statistics for each variable will be based on all the cases with valid data for that one variable.


Paired univariate statistics
If missing-data cases have been excluded pairwise, each correlation coefficient is based (potentially) on a different subset of the cases. Univariate statistics based on that same subset of cases for each pair of variables will be calculated and displayed, if this option is selected.

The paired statistics for each variable include its mean, standard deviation, valid N of cases for the pair, and (if there is a weight variable) valid weighted N of cases for the pair.

These statistics are displayed as a series of matrices. Each statistic for a given variable is (potentially) somewhat different, depending on which other variable it is being paired with.


Index of proportionality (P-squared)
It is sometimes useful to know the degree to which the correlations in each row of the correlation matrix are proportional to the correlations in the other rows. This is particularly the case in creating scales or indexes of items. If variables are measuring the same thing, they should have similar correlations to other relevant (criterion) variables.

The P-squared statistic is a way to measure the proportionality of rows in a correlation matrix. For example, if all of the coefficients in one row are exactly double the size of the coefficients in another row, there is a constant proportionality, and the index will be 1.0.

Usually we want to limit this comparison to a subset of the the matrix -- namely, to the part corresponding to the correlations of the criterion variables with the variables of interest. To do this, we specify on the option screen the variable numbers (next to each window on the option screen) corresponding to the variables for which we want the P-squared measure, and the variable numbers corresponding to the criterion variables.

For example, we could examine the degree to which the variables v1, v2, and v3 have proportional correlations to the criterion variables x1, x2, and x3. We would enter v1, v2, and v3 into the first 3 windows on the option screen; and x1, x2, and x3 into windows 4 through 6. To get the P-squared statistic for all the combinations of v1, v2, and v3, in respect to the criterion variables, we would then specify:

These variable numbers can be specified either as a range (1-3) or as a list (1,2,3); and the variables need not be adjacent in the original correlation matrix -- a list like '1,3,5' is valid.

The P-squared statistics are presented in a symmetrical matrix. Each row and column corresponds to one of the variables that we specified as a "variable to measure."

For a discussion of how to use this statistic, see Thomas Piazza, "The Analysis of Attitude Items," American Journal of Sociology, vol. 86 (1980) pp. 584-603.


Other Display Options

Reverse signs of some correlations
In order to detect patterns in the correlation matrix, it is sometimes useful to reverse the signs of the correlations corresponding to one or more variables. Enter the variable number of each variable for which you want the signs reversed. The variable number corresponds to the window number on the option screen.

For example, we may know that var1 is scaled in such a way that a HIGH score or value corresponds to a LOW score on var2 and var3, so we expect the correlations of var1 to be negative with var2 and var3. But if we are interested in the relationships of those variables to other variables, it will be easier to detect different patterns if we reverse all the signs corresponding to var1. That way, we can expect var1, var2, and var3 to have correlations of the same sign with other variables. Then if we do observe a difference in the signs, it will catch our attention.


Color coding of the correlations
The correlation coefficients are color coded, in order to aid in detecting patterns. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.

Color coding is also used for the P-squared matrix, if one has been requested. However, the dividing points for colors are double in magnitude. The lightest shade corresponds to P-squared coefficients between 0 and .30. The colors become darker as the absolute value of the P-squared coefficients exceed .30, then .60, then .90.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the matrix on a black-and-white printer.


Question text
The text of the question that produced each variable is generally available.


SDA Regression Program

This program calculates the regression coefficients (ordinary least squares) for one or more independent or predictor variables. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be used as the dependent variable, give the name for that variable as given in the documentation for this study. Then specify the names of one or more independent variables. Filter variables and a weight variable may also be specified.

Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.

Select display options
After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run regression
After specifying all variables and options, select Run Regression to run the program.

Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.


REQUIRED variable names

Dependent variable
Enter the name of one variable to be used as the dependent variable or the variable to be predicted.

Independent variables
Enter the names of one or more variables whose regression coefficients are to be computed. Note that you can specify dummy variables and product terms as independent variables. It is also possible to restrict the range of a variable or to recode the variable temporarily.

Enter the name of each variable in a window (box). To go from one window to another, use the tab key or your mouse. It is all right to skip a window and leave it blank -- to use only windows 1, 5, and 9, for example.

It is possible to enter more than one variable name in a window (the underlying text-entry area will scroll). Each window, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each window, but the possibility of defining groups of variables exists.


Dummy variables and product terms

Dummy variable(s)
A dummy variable is a variable coded 0 or 1. Cases that have a certain characteristic are coded as 1; whereas cases that do NOT have the characteristic are coded as 0.

To create such a variable temporarily, for a single regression run, for example, use the following syntax:

varname(d:1-3)

This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.

The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.

You can also give the dummy variable a label by putting the label in double quotes:

occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")

Product terms
An independent variable can be the product of two or more variables.

To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:

age*education

This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.

One or more dummy variables can also be part of a product term. For example, the following form is acceptable:

party(d:3)*sex

In this example, first a dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'.


OPTIONAL variable names

Filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights in calculating the regression coefficients.

How cases with missing data are excluded

Listwise exclusion
If a case has a missing-data value on ANY of the variables to be correlated and then regressed, it is excluded from ALL of the regression calculations. This is the only allowed procedure procedure. The pairwise option available for the correlation program is not available for the regression programs.

Additional Statistics to Calculate

T-test for each coefficient
The t-test for each regression coefficient is generally displayed. The t-statistic is the ratio of the regression coefficient (B) divided by its standard error -- shown as SE(B).

The probability estimate associated with each t-statistic is given in the last column. This is the probability of obtaining a regression coefficient (B) that is this large or larger, if the true coefficient is equal to zero in the population from which the current sample was drawn. (Note that this version of the regression program assumes that the dataset was generated by a simple random sample.)

If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.

To estimate the confidence interval of a specific regression coefficient, use the standard error of the coefficient -- displayed as SE(B). The approximate 95 percent confidence interval of each coefficient is formed by creating a range that is equal to the regression coefficient plus or minus two times the standard error.

The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.

Global F-test
The global F-statistic for the regression is computed, if this option is selected. The P-value (probability value) for the F-statistic is given in the last column of the table. This is the probability that ALL of the regression coefficients (B's) are equal to zero, in the population from which the current sample was drawn. (Note that this version of the regression program assumes that the dataset was generated by a simple random sample.)

If the P-value for the regression is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. However, a low P-value does not guarantee that any specific independent variable has an effect on the dependent variable (unless there is only one independent variable). The t-test for each independent variable should be examined for that purpose.

Univariate statistics
Univariate statistics for each of the variables will be computed and displayed, if this option is selected. The statistics displayed for each variable include its mean and standard deviation.

Correlation matrix
The correlation matrix used to calculate the regression coefficients will be displayed.

Covariance matrix
The covariance matrix computed from the data file will be displayed.

Other Display Options


Color coding of the coefficients
The regression coefficients are color coded, in order to aid in detecting patterns, if t-tests have been requested. Regression coefficients greater than zero become redder, the larger they are. Regression coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the T-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to T-statistics between 0 and 1. The medium shade corresponds to T-statistics between 1 and 2. The darkest shade corresponds to T-statistics greater than 2.

Correlation coefficients are also color coded, if a correlation matrix is requested. Correlation coefficients greater than zero become redder, the larger they are. Correlation coefficients less than zero become bluer, the more negative they are. The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the correlation coefficient in each cell. The lightest shade corresponds to coefficients between 0 and .15. The colors become darker as the absolute value of the correlations exceed .15, then .30, then .45.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Question text
The text of the question that produced each variable is generally available.


SDA Logit/Probit Regression Program

This program calculates the logit or probit regression coefficients for one or more independent or predictor variables. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be used as the dependent variable, give the name for that variable as given in the documentation for this study. Then specify the names of one or more independent variables. Filter variables and a weight variable may also be specified.

Aside from simply specifying the name of a variable, it is possible to restrict the range of a variable or to recode the variable temporarily. Note in particular that you can create dummy variables and product terms.

Select display options
After specifying the names of variables, select the display options you wish. These affect the statistics to compute, the number of decimals to show, and text to display.

Run Logit/Probit
After specifying all variables and options, select Run Logit/Probit to run the program.

Or you can select Clear Fields to delete all previously specified variables and options, so that you can start over.


Type of regression to run

The program can run either logit (logistic) or probit regression. The difference between them is in how the dependent variable is transformed from a proportion (a mean between 0 and 1).

Logit regression expresses the dependent variable as the natural logarithm of the odds that a person will have a score of 1 versus a score of 0 on the dependent variable.

Probit regression expresses the dependent variable as the inverse of the cumulative normal distribution function corresponding to the proportion.

When the dependent variable has only two categories, both logit and probit regression are more appropriate to use than ordinary least squares regression. Both logit and probit regression will usually generate the same substantive results. The choice between them is generally a matter of custom within a specific field or discipline.


REQUIRED variable names

Dependent variable
Enter the name of one variable to be used as the dependent variable or the variable to be predicted. This variable should have two categories: 0 and 1.

If the variable you want to use as a dependent variable is not coded as a simple 0/1 variable, you can create a dummy variable, or you can recode the variable temporarily.

If the dependent variable is left as anything other than a simple 0/1 variable, the program will recode the dependent variable automatically. The lowest valid score will be recoded to the value '0', and all other scores will be recoded to the value '1'.

Independent variables
Enter the names of one or more variables whose regression coefficients are to be computed. Note that you can specify dummy variables and product terms as independent variables. It is also possible to restrict the range of a variable or to recode the variable temporarily.

Enter the name of each variable in a window (box). To go from one window to another, use the tab key or your mouse. It is all right to skip a window and leave it blank -- to use only windows 1, 5, and 9, for example.

It is possible to enter more than one variable name in a window (the underlying text-entry area will scroll). Each window, consequently, defines a variable GROUP. Ordinarily it is clearer to put only one variable in each window, but the possibility of defining groups of variables exists.


Dummy variables and product terms

Dummy variable(s)
A dummy variable is a variable coded 0 or 1. Cases that have a certain characteristic are coded as 1, and cases that do NOT have the characteristic are coded as 0. Note that the dependent variable can be coded into the required 0/1 categories by creating a dummy variable.

To create such a variable temporarily, for a single analysis run, for example, use the following syntax:

varname(d:1-3)

This would create a variable in which cases coded 1 through 3 on the variable 'varname' receive a code of 1, and all other VALID cases receive a code of 0. If 'varname' has a code defined as missing-data or out of range, the dummy variable will have the system-missing data value.

The characters 'd:' (or 'D:') indicate that you want to create a temporary dummy variable. The codes that follow show which codes on the original variable should become the code of 1 on the new dummy variable. One or more single code values or ranges can be specified. Multiple codes or ranges are separated by a comma.

You can also give the dummy variable a label by putting the label in double quotes:

occupation(d:1, 3-5, 9, 10 "Managerial occupations = 1")

Product terms
An independent variable can be the product of two or more variables.

To create such a variable temporarily, for a single regression run for instance, use an asterisk (*) between the component variable names. For example:

age*education

This would create a variable in which, for each case, the value of 'age' is multiplied by the value of 'education'. If either 'age' or 'education' has an invalid code for that case, the temporary product term will have the system missing-data value.

One or more dummy variables can also be part of a product term. For example, the following form is acceptable:

party(d:3)*sex

In this example, first a dummy variable is created from the variable 'party', and then that dummy variable is multiplied by 'sex'.


OPTIONAL variable names

Filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights in calculating the regression coefficients.

How to exclude cases with missing data

Listwise exclusion
If a case has a missing-data value on ANY of the variables included in the logit or probit regression, it is excluded from ALL of the regression calculations. This is the only allowed procedure procedure. The pairwise option available for the correlation program is not available for the regression programs.

Additional Statistics to Calculate

T-test for each coefficient
The t-test for each logit or probit regression coefficient is generally displayed. The t-statistic is the ratio of the regression coefficient (B) divided by its standard error -- shown as SE(B).

The probability of each t-statistic is given in the last column. This is the probability that the regression coefficient (B) is equal to zero, in the population from which the current sample was drawn. (Note that this version of the logit/probit regression program assumes that the dataset was generated by a simple random sample.)

If the probability value for a regression coefficient is low (about .05 or less), the chances are correspondingly low that the observed effect of that independent variable on the dependent variable is only due to sampling error. However, a low probability value does not indicate that the true value of the coefficient in the population is of any specific magnitude -- only that it is not equal to zero.

To estimate the confidence interval of a specific regression coefficient, use the standard error of the coefficient -- displayed as SE(B). The approximate 95 percent confidence interval of each coefficient is formed by creating a range that is equal to the regression coefficient plus or minus two times the standard error.

The t-statistic and associated probability value are also given for the constant term of the regression equation. This is a test that the regression equation in the population has no constant term (or intercept). This test is usually of less interest than the tests for the regression coefficients of the independent variables.

Global significance tests
A pseudo-R-squared statistic is displayed. This is analogous to the R-squared statistic in ordinary least squares regression, which expresses the proportion of variance in the dependent variable explained by the entire set of independent variables.

A chi-square test for the regression is also computed. The P-value (probability value) for the chi-square test is the probability that ALL of the regression coefficients (B's) are equal to zero, in the population from which the current sample was drawn. (Note that this version of the regression program assumes that the dataset was generated by a simple random sample.)

If the P-value for the regression is low (about .05 or less), the chances are correspondingly low that ALL of the observed effects of the independent variables on the dependent variable are only due to sampling error. However, a low P-value does not guarantee that any specific independent variable has an effect on the dependent variable (unless there is only one independent variable). The t-test for each independent variable should be examined for that purpose.

Univariate statistics
Univariate statistics for each of the variables will be computed and displayed, if this option is selected. The statistics displayed for each variable include its mean and standard deviation.

Other Display Options


Color coding of the coefficients
The regression coefficients are color coded, in order to aid in detecting patterns, if t-tests have been requested. Regression coefficients greater than zero become redder, the larger they are. Regression coefficients less than zero become bluer, the more negative they are.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the T-statistic, which is the ratio of each regression coefficient (B) divided by its standard error. The lightest shade corresponds to T-statistics between 0 and 1. The medium shade corresponds to T-statistics between 1 and 2. The darkest shade corresponds to T-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.


Question text
The text of the question that produced each variable is generally available.


SDA Listcase Program

This program lists the values of individual cases on variables specified by the user. Values of a variable can also be transformed into percents of a second variable. This is particularly useful when the cases in the data file are aggregate units such as cities.

One or more filter variables are used to limit the listing to a subset of the cases. In general a limit of 500 cases is enforced for each listing, in case the user has forgotten to limit the listing with sufficient filter variables.

An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables to list
To specify that a certain survey question or variable is to be included in the listing, enter into one of the windows the name for that variable, as given in the documentation for the study. You can also request a percent to be displayed.

Specify one or more filter variables
Filter variables are used to limit the listing to a subset of cases. Except for very small datasets, a filter variable will almost always be required.

Select display options
After specifying the names of variables, select the display options you wish. These affect how to display numeric variables and whether or not to display the text of each variable.

Start the listing
After specifying all variables and options, select Start Listing to begin the program.

Or you can select Clear fields to delete all previously specified variables and options, so that you can start over.


Variables to list
To specify that a certain survey question or variable is to be included in the listing, enter into one of the windows the name for that variable, as given in the documentation for the study.


Percentages

Aside from simply specifying the name of a variable, it is possible to convert a number into the percent of another variable. This is particularly useful when the cases in the data file are aggregate units such as cities.

To calculate and display a percent, use the following formats, beginning with $p, instead of a simple variable name:

$p(var1, var2)
This will display the value: 100 * var1 / var2.
(using 1 decimal place) where 'var1' and 'var2' are variables in the dataset. It is not necessary that either 'var1' or 'var2' be specified separately for listing.

$p(var1, var2, 2)
To display a percent using other than one decimal place, specify the desired number of decimal places after var2. The example above would use 2 decimal places.

$p(demo, totvote, "Percent democratic")
To give your own name to the percentage created, put the name you want within double quotes. This name will be displayed at the top of the column for that percentage.

Filter variables
After specifying the names of the variables to list, select the filter variable(s) in order to specify which cases to list. Since data files generally have a large number of cases, it is very important to limit the listing to a subset of the cases. The usual options for specifying filter variable(s) are available.

To avoid accidental attempts to list large numbers of cases, the program suppresses any listing that would exceed a certain number of cases. The default limit is 500 cases, but that limit can be modified when the datasets are set up in the Web archive.


Summaries of each variable listed
For each NUMERIC variable listed, you can obtain summaries of the values for the selected cases in the listing. These summaries exclude missing-data or out-of-range values.

The available summaries are:

For a percentage (created with the '$p' command), the summaries, if requested, will be calculated as follows:


How to display numeric variables


Display text for the listed variables

If this option is selected, the text corresponding to each variable listed is displayed at the bottom of the listing. The text for each variable referenced in a percent specification is also displayed.


Features Common to All Analysis Programs


Options for specifying variables


Multiple variable names

More than one name may be entered for variables to be analyzed, such as for the row and the column variables. The names should be separated by a comma or blanks. Separate analyses for each combination of variables will be generated.

For example, the following specifications would generate six separate tables:


Restricting the valid range

The name of each analysis variable can be followed, in parentheses, by a list of values to be included in the analysis.

Basic range restriction

A single value such as 'gender(2)' or a range of codes such as 'age(30-50)', will limit the analysis to cases having those codes.
Multiple ranges and codes may be specified.
For example: age(1-17, 25, 95-100)
Open-ended Ranges using '*' and '**'
In a range, one asterisk '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest NUMERIC value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. Note that '**' cannot be used alone (without '-') as a range specification. If you want to include all NUMERIC codes, you can use the range '(**-**)'.


Temporarily Recoding a Variable

A numeric variable can be recoded temporarily, for purposes of running the current analysis, by specifying groups of codes that are to be combined into a single category. This type of recoding can be very simple, but certain options can make it a little more complex.

Basic recoding
For example, to combine the categories of 'age' into three groups, you can specify the variable as:
age(r: 18-30; 31-50; 51-95)
Notice that the name of the variable ('age') is followed by parentheses, then the instruction 'r' (or 'R') followed by a colon (':'), and then the groupings of codes. Those groupings can consist of single code values, ranges, or a combination of many values and/or ranges. Each group is separated from the other by a semicolon (';'). Spaces are optional, but are added here for readability.

Using this basic method of recoding, the new groupings of codes are given the default code values 1, 2, 3, and so forth. The default label for each group is the range of original codes that constitute that group ("18-30", for example).

Any categories of 'age' not included in the specified groupings will become missing-data on the recoded version, and they will be excluded from the analysis in the table.

On the other hand, any original missing-data categories of 'age' that are explicitly mentioned in the recode, will be included. For instance, if the value '90' for 'age' were flagged as a missing-data code, but included as in the example above, it would become part of the third recoded category. This is discussed in more detail in the section on "Treatment of missing data."

Assigning particular new code values
It is possible to assign new code values that are different from the default 1, 2, 3, and so forth. To do this, give the new code value, then an equal sign, then the grouping.
For example, the variable 'age' can be recoded into the same three groups as above, but with the new code values 1, 5, and 10, by specifying the recode as follows:
age(r: 1 = 18-30; 5 = 31-50; 10 = 51-90)

For column, row, or control variables it will not usually matter what the new code values are. For variables on which statistics are computed, however, the new code values will affect the value of those statistics.

Assigning labels to the new code values
To assign your own label to a new grouping of code values, place the label in double quotes after the group codes, but before the semicolon. There is no set limit on the length of these labels; however, very long labels may distort the formatting of the tables.

For example, you can assign labels to the recoded categories of race by using the following specification:
race(r: 800-869 "White"; 870-934 "Black"; 600-652, 979-982 "Asian")

These labels will appear in the table, in place of the range of original codes that constitute that group. Nevertheless, the recode specifications will still be documented. A summary is always given at the bottom of the table.

Open ranges (with '*' or '**')
If you are not sure of the ranges of the variable to be recoded, you can specify an open range with an asterisk ('*'). A single asterisk matches the lowest or highest VALID code in the data for that variable.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-*)
Using this method, all valid age values up to 30 would go into the first recoded group. And all valid age values of 51 or older would go into the third group.

If you want to use a range that includes NUMERIC codes that were defined as missing-data values, you can specify the range with two asterisks ('**') instead of one.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-**)
Using this method, all valid age values up to 30 would go into the first recoded group. But every numeric value of 51 or greater would go into the third group, including codes like 99 that may have been defined as missing-data codes.

For more discussion about including codes that have been defined as missing-data codes, see the section on "Treatment of missing data."

Overlapping ranges
If the same original code value is mentioned in two or more groupings, it is recoded the FIRST time that the value is encountered.
For example, the following two specifications have the same effect:
age(r: 18-30; 30-50; 50-90), and
age(r: 18-30; 31-50; 51-90)

In both cases, the original 'age' value of 30 ends up in the first group, and the original 'age' value of 50 ends up in the second group.

Notice that order is important with overlapping ranges. The following specification will NOT have the same effect as the preceding two:
age(r: 3= 50-90; 2= 30-50; 1= 18-30)
In this example, the 'age' value of 50 will end up in the recode group with the value '3' (instead of in the second group), and the 'age' value of 30 will end up in the recode group with the value '2' (instead of in the first group).

Multiple specifications for one recoded group
It may sometimes be useful to have more than one specification for a new recoded group. This can be done by specifying the desired outcome code more than once.
For example, to have race recoded into two categories, with the first category including everyone EXCEPT those originally coded as '2', you could use the following specification:
race(r: 1=1 "Non-black"; 2=2 "Black"; 1=3-20)

Treatment of missing data
NUMERIC codes that have been defined as missing data on the original variable can be included in one of the categories of the recoded variable in two ways.

The first method is to mention the code explicitly, either as a single value or as part of a range. For example, if the 'age' value of 99 has been defined as a missing-data code, it can still be included by either of the following specifications:
age(r: 18-30; 31-50; 51-90; 99), or
age(r: 18-30; 31-50; 51-100)

In the first case the code 99 will become its own fourth recode category. In the second case, it will be included as part of the third category.

A second method to include NUMERIC missing data codes is to use an open range with two asterisks ('**') instead of one. For example, the following specification will include all numeric codes above 50 as part of the third recoded group:
age(r: 18-30; 31-50; 51-**)

Note that at present there is no way to include in a recode the system-missing value or a character missing-data value (like 'D' or 'R').


Optional variables

Control variable (for table-generating programs)

A table will be produced separately for each category of this variable (e.g., if the control variable is gender, there will be one table for men alone and then one table for women alone). A table is also produced for the total of all valid categories of the control variable (e.g., men and women combined).

Only one variable at a time can be used as a control variable. If more than one control variable is specified, a separate set of tables will be generated for each control variable.


Filter variable(s)

The analysis can be limited to a subset of the cases in the full data file by specifying one or more variables, separated by a comma or blanks, as selection filters.

Note that it is also possible to limit the table to a subset of the cases by restricting the valid range of any of the other variables. Filter variables are used when the desired subset of cases is defined by a variable that is not one of the variables in the table.

Basic filter use
The name of each filter variable is followed, in parentheses, by a single value such as 'gender(2)' or a range of codes such as 'age(30-50)', to limit the analysis to cases having those codes.
Multiple ranges and codes may be specified.
For example: 'age(1-17, 25, 95-100)'
Multiple filter variables
If you specify more than one filter variable, a case must satisfy ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)
Open-ended Ranges using '*' and '**'
A single asterisk, '*', can be used to specify that all cases with VALID codes for a variable will pass the filter. For example: age(*) includes all cases with valid data on the variable 'age'.

In a range, the '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This filter would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest numeric value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. However, any character missing-data values would still be excluded. Note that '**' cannot be used alone in a filter variable. It can only be used as part of a range.


Weight variable

Depending on the design and implementation of the study, it may be appropriate to give some of the cases more weight than other cases in computing frequency distributions and statistics. The way you do this is to specify that a certain variable contains the relative weight for each case and is to be considered a weight variable. The documentation for the study should explain the reasons for using a weight variable, if there is one, and what its name is.

SDA studies can be set up with a weight variable specified ahead of time so that the weight variable is used automatically. Other studies may be set up with a drop-down list of choices to be presented to the user, who then selects one of the available weight variables (or no weight variable, if that option is included in the list). If no weight variables have been pre-specified, the user is free to enter the name of an appropriate variable to be used as a weight.


Question text

All of the text available for each variable included in the analysis run will be appended to the bottom of the results, if you select this option.

The usual text available for a variable is the text of the question that produced the variable, provided that the text was included in the study documentation. Sometimes other explanatory text has been included.

If the variable was created by the 'recode' or the 'compute' program, the commands used to create the new variable are included in the text.


Actions to take

After you specify variables and select the options you want, go to the bottom of the form, and select one of two actions:
Run the Table
Select this when you have finished specifying the variables and options you want. The requested table(s) will then be generated by the server computer and returned to you.
Clear fields
Select this to delete all previously specified variables and options, so that you can start over.