9 Measurement in quantitative research
Chapter Outline
- Operational definitions (36 minute read)
- Writing effective questions and questionnaires (38 minute read)
- Measurement quality (21 minute read)
Content warning: examples in this chapter contain references to ethnocentrism, toxic masculinity, racism in science, drug use, mental health and depression, psychiatric inpatient care, poverty and basic needs insecurity, pregnancy, and racism and sexism in the workplace and higher education.
11.1 Operational definitions
Learning Objectives
Learners will be able to…
- Define and give an example of indicators and attributes for a variable
- Apply the three components of an operational definition to a variable
- Distinguish between levels of measurement for a variable and how those differences relate to measurement
- Describe the purpose of composite measures like scales and indices
Last chapter, we discussed conceptualizing your project. Conceptual definitions are like dictionary definitions. They tell you what a concept means by defining it using other concepts. In this section we will move from the abstract realm (conceptualization) to the real world (measurement).
Operationalization is the process by which researchers spell out precisely how a concept will be measured in their study. It involves identifying the specific research procedures we will use to gather data about our concepts. If conceptually defining your terms means looking at theory, how do you operationally define your terms? By looking for indicators of when your variable is present or not, more or less intense, and so forth. Operationalization is probably the most challenging part of quantitative research, but once it’s done, the design and implementation of your study will be straightforward.
Indicators
Operationalization works by identifying specific indicators that will be taken to represent the ideas we are interested in studying. If we are interested in studying masculinity, then the indicators for that concept might include some of the social roles prescribed to men in society such as breadwinning or fatherhood. Being a breadwinner or a father might therefore be considered indicators of a person’s masculinity. The extent to which a man fulfills either, or both, of these roles might be understood as clues (or indicators) about the extent to which he is viewed as masculine.
Let’s look at another example of indicators. Each day, Gallup researchers poll 1,000 randomly selected Americans to ask them about their well-being. To measure well-being, Gallup asks these people to respond to questions covering six broad areas: physical health, emotional health, work environment, life evaluation, healthy behaviors, and access to basic necessities. Gallup uses these six factors as indicators of the concept that they are really interested in, which is well-being.
Identifying indicators can be even simpler than the examples described thus far. Political party affiliation is another relatively easy concept for which to identify indicators. If you asked a person what party they voted for in the last national election (or gained access to their voting records), you would get a good indication of their party affiliation. Of course, some voters split tickets between multiple parties when they vote and others swing from party to party each election, so our indicator is not perfect. Indeed, if our study were about political identity as a key concept, operationalizing it solely in terms of who they voted for in the previous election leaves out a lot of information about identity that is relevant to that concept. Nevertheless, it’s a pretty good indicator of political party affiliation.
Choosing indicators is not an arbitrary process. As described earlier, utilizing prior theoretical and empirical work in your area of interest is a great way to identify indicators in a scholarly manner. And you conceptual definitions will point you in the direction of relevant indicators. Empirical work will give you some very specific examples of how the important concepts in an area have been measured in the past and what sorts of indicators have been used. Often, it makes sense to use the same indicators as previous researchers; however, you may find that some previous measures have potential weaknesses that your own study will improve upon.
All of the examples in this chapter have dealt with questions you might ask a research participant on a survey or in a quantitative interview. If you plan to collect data from other sources, such as through direct observation or the analysis of available records, think practically about what the design of your study might look like and how you can collect data on various indicators feasibly. If your study asks about whether the participant regularly changes the oil in their car, you will likely not observe them directly doing so. Instead, you will likely need to rely on a survey question that asks them the frequency with which they change their oil or ask to see their car maintenance records.
Exercises
- What indicators are commonly used to measure the variables in your research question?
- How can you feasibly collect data on these indicators?
- Are you planning to collect your own data using a questionnaire or interview? Or are you planning to analyze available data like client files or raw data shared from another researcher’s project?
Remember, you need raw data. You research project cannot rely solely on the results reported by other researchers or the arguments you read in the literature. A literature review is only the first part of a research project, and your review of the literature should inform the indicators you end up choosing when you measure the variables in your research question.
Unlike conceptual definitions which contain other concepts, operational definition consists of the following components: (1) the variable being measured and its attributes, (2) the measure you will use, (3) how you plan to interpret the data collected from that measure to draw conclusions about the variable you are measuring.
Step 1: Specifying variables and attributes
The first component, the variable, should be the easiest part. At this point in quantitative research, you should have a research question that has at least one independent and at least one dependent variable. Remember that variables must be able to vary. For example, the United States is not a variable. Country of residence is a variable, as is patriotism. Similarly, if your sample only includes men, gender is a constant in your study, not a variable. A constant is a characteristic that does not change in your study.
When social scientists measure concepts, they sometimes use the language of variables and attributes. A variable refers to a quality or quantity that varies across people or situations. Attributes are the characteristics that make up a variable. For example, the variable hair color would contain attributes like blonde, brown, black, red, gray, etc. A variable’s attributes determine its level of measurement. There are four possible levels of measurement: nominal, ordinal, interval, and ratio. The first two levels of measurement are categorical, meaning their attributes are categories rather than numbers. The latter two levels of measurement are continuous, meaning their attributes are numbers.
Levels of measurement
Hair color is an example of a nominal level of measurement. Nominal measures are categorical, and those categories cannot be mathematically ranked. As a brown-haired person (with some gray), I can’t say for sure that brown-haired people are better than blonde-haired people. As with all nominal levels of measurement, there is no ranking order between hair colors; they are simply different. That is what constitutes a nominal level of gender and race are also measured at the nominal level.
What attributes are contained in the variable hair color? While blonde, brown, black, and red are common colors, some people may not fit into these categories if we only list these attributes. My wife, who currently has purple hair, wouldn’t fit anywhere. This means that our attributes were not exhaustive. Exhaustiveness means that all possible attributes are listed. We may have to list a lot of colors before we can meet the criteria of exhaustiveness. Clearly, there is a point at which exhaustiveness has been reasonably met. If a person insists that their hair color is light burnt sienna, it is not your responsibility to list that as an option. Rather, that person would reasonably be described as brown-haired. Perhaps listing a category for other color would suffice to make our list of colors exhaustive.
What about a person who has multiple hair colors at the same time, such as red and black? They would fall into multiple attributes. This violates the rule of mutual exclusivity, in which a person cannot fall into two different attributes. Instead of listing all of the possible combinations of colors, perhaps you might include a multi-color attribute to describe people with more than one hair color.
Making sure researchers provide mutually exclusive and exhaustive is about making sure all people are represented in the data record. For many years, the attributes for gender were only male or female. Now, our understanding of gender has evolved to encompass more attributes that better reflect the diversity in the world. Children of parents from different races were often classified as one race or another, even if they identified with both cultures. The option for bi-racial or multi-racial on a survey not only more accurately reflects the racial diversity in the real world but validates and acknowledges people who identify in that manner. If we did not measure race in this way, we would leave empty the data record for people who identify as biracial or multiracial, impairing our search for truth.
Unlike nominal-level measures, attributes at the ordinal level can be rank ordered. For example, someone’s degree of satisfaction in their romantic relationship can be ordered by rank. That is, you could say you are not at all satisfied, a little satisfied, moderately satisfied, or highly satisfied. Note that even though these have a rank order to them (not at all satisfied is certainly worse than highly satisfied), we cannot calculate a mathematical distance between those attributes. We can simply say that one attribute of an ordinal-level variable is more or less than another attribute.
This can get a little confusing when using rating scales. If you have ever taken a customer satisfaction survey or completed a course evaluation for school, you are familiar with rating scales. “On a scale of 1-5, with 1 being the lowest and 5 being the highest, how likely are you to recommend our company to other people?” That surely sounds familiar. Rating scales use numbers, but only as a shorthand, to indicate what attribute (highly likely, somewhat likely, etc.) the person feels describes them best. You wouldn’t say you are “2” likely to recommend the company, but you would say you are not very likely to recommend the company. Ordinal-level attributes must also be exhaustive and mutually exclusive, as with nominal-level variables.
At the interval level, attributes must also be exhaustive and mutually exclusive and there is equal distance between attributes. Interval measures are also continuous, meaning their attributes are numbers, rather than categories. IQ scores are interval level, as are temperatures in Fahrenheit and Celsius. Their defining characteristic is that we can say how much more or less one attribute differs from another. We cannot, however, say with certainty what the ratio of one attribute is in comparison to another. For example, it would not make sense to say that a person with an IQ score of 140 has twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.
While we cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, we can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level. Finally, at the ratio level, attributes are mutually exclusive and exhaustive, attributes can be rank ordered, the distance between attributes is equal, and attributes have a true zero point. Thus, with these variables, we can say what the ratio of one attribute is in comparison to another. Examples of ratio-level variables include age and years of education. We know that a person who is 12 years old is twice as old as someone who is 6 years old. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. The differences between each level of measurement are visualized in Table 11.1.
Nominal | Ordinal | Interval | Ratio | |
Exhaustive | X | X | X | X |
Mutually exclusive | X | X | X | X |
Rank-ordered | X | X | X | |
Equal distance between attributes | X | X | ||
True zero point | X |
Levels of measurement=levels of specificity
We have spent time learning how to determine our data’s level of measurement. Now what? How could we use this information to help us as we measure concepts and develop measurement tools? First, the types of statistical tests that we are able to use are dependent on our data’s level of measurement. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. The higher the level of measurement, the more complex statistical tests we are able to conduct. This knowledge may help us decide what kind of data we need to gather, and how.
That said, we have to balance this knowledge with the understanding that sometimes, collecting data at a higher level of measurement could negatively impact our studies. For instance, sometimes providing answers in ranges may make prospective participants feel more comfortable responding to sensitive items. Imagine that you were interested in collecting information on topics such as income, number of sexual partners, number of times someone used illicit drugs, etc. You would have to think about the sensitivity of these items and determine if it would make more sense to collect some data at a lower level of measurement (e.g., asking if they are sexually active or not (nominal) versus their total number of sexual partners (ratio).
Finally, sometimes when analyzing data, researchers find a need to change a data’s level of measurement. For example, a few years ago, a student was interested in studying the relationship between mental health and life satisfaction. This student used a variety of measures. One item asked about the number of mental health symptoms, reported as the actual number. When analyzing data, my student examined the mental health symptom variable and noticed that she had two groups, those with none or one symptoms and those with many symptoms. Instead of using the ratio level data (actual number of mental health symptoms), she collapsed her cases into two categories, few and many. She decided to use this variable in her analyses. It is important to note that you can move a higher level of data to a lower level of data; however, you are unable to move a lower level to a higher level.
Exercises
- Check that the variables in your research question can vary…and that they are not constants or one of many potential attributes of a variable.
- Think about the attributes your variables have. Are they categorical or continuous? What level of measurement seems most appropriate?
Step 2: Specifying measures for each variable
Let’s pick a social work research question and walk through the process of operationalizing variables to see how specific we need to get. I’m going to hypothesize that residents of a psychiatric unit who are more depressed are less likely to be satisfied with care. Remember, this would be a inverse relationship—as depression increases, satisfaction decreases. In this question, depression is my independent variable (the cause) and satisfaction with care is my dependent variable (the effect). Now we have identified our variables, their attributes, and levels of measurement, we move onto the second component: the measure itself.
So, how would you measure my key variables: depression and satisfaction? What indicators would you look for? Some students might say that depression could be measured by observing a participant’s body language. They may also say that a depressed person will often express feelings of sadness or hopelessness. In addition, a satisfied person might be happy around service providers and often express gratitude. While these factors may indicate that the variables are present, they lack coherence. Unfortunately, what this “measure” is actually saying is that “I know depression and satisfaction when I see them.” While you are likely a decent judge of depression and satisfaction, you need to provide more information in a research study for how you plan to measure your variables. Your judgment is subjective, based on your own idiosyncratic experiences with depression and satisfaction. They couldn’t be replicated by another researcher. They also can’t be done consistently for a large group of people. Operationalization requires that you come up with a specific and rigorous measure for seeing who is depressed or satisfied.
Finding a good measure for your variable depends on the kind of variable it is. Variables that are directly observable don’t come up very often in my students’ classroom projects, but they might include things like taking someone’s blood pressure, marking attendance or participation in a group, and so forth. To measure an indirectly observable variable like age, you would probably put a question on a survey that asked, “How old are you?” Measuring a variable like income might require some more thought, though. Are you interested in this person’s individual income or the income of their family unit? This might matter if your participant does not work or is dependent on other family members for income. Do you count income from social welfare programs? Are you interested in their income per month or per year? Even though indirect observables are relatively easy to measure, the measures you use must be clear in what they are asking, and operationalization is all about figuring out the specifics of what you want to know. For more complicated constructs, you will need compound measures (that use multiple indicators to measure a single variable).
How you plan to collect your data also influences how you will measure your variables. For social work researchers using secondary data like client records as a data source, you are limited by what information is in the data sources you can access. If your organization uses a given measurement for a mental health outcome, that is the one you will use in your study. Similarly, if you plan to study how long a client was housed after an intervention using client visit records, you are limited by how their caseworker recorded their housing status in the chart. One of the benefits of collecting your own data is being able to select the measures you feel best exemplify your understanding of the topic.
Measuring unidimensional concepts
The previous section mentioned two important considerations: how complicated the variable is and how you plan to collect your data. With these in hand, we can use the level of measurement to further specify how you will measure your variables and consider specialized rating scales developed by social science researchers.
Measurement at each level
Nominal measures assess categorical variables. These measures are used for variables or indicators that have mutually exclusive attributes, but that cannot be rank-ordered. Nominal measures ask about the variable and provide names or labels for different attribute values like social work, counseling, and nursing for the variable profession. Nominal measures are relatively straightforward.
Ordinal measures often use a rating scale. It is an ordered set of responses that participants must choose from. Figure 11.1 shows several examples. The number of response options on a typical rating scale is usualy five or seven, though it can range from three to 11. Five-point scales are best for unipolar scales where only one construct is tested, such as frequency (Never, Rarely, Sometimes, Often, Always). Seven-point scales are best for bipolar scales where there is a dichotomous spectrum, such as liking (Like very much, Like somewhat, Like slightly, Neither like nor dislike, Dislike slightly, Dislike somewhat, Dislike very much). For bipolar questions, it is useful to offer an earlier question that branches them into an area of the scale; if asking about liking ice cream, first ask “Do you generally like or dislike ice cream?” Once the respondent chooses like or dislike, refine it by offering them relevant choices from the seven-point scale. Branching improves both reliability and validity (Krosnick & Berent, 1993).[1] Although you often see scales with numerical labels, it is best to only present verbal labels to the respondents but convert them to numerical values in the analyses. Avoid partial labels or length or overly specific labels. In some cases, the verbal labels can be supplemented with (or even replaced by) meaningful graphics. The last rating scale shown in Figure 11.1 is a visual-analog scale, on which participants make a mark somewhere along the horizontal line to indicate the magnitude of their response.
Interval measures are those where the values measured are not only rank-ordered, but are also equidistant from adjacent attributes. For example, the temperature scale (in Fahrenheit or Celsius), where the difference between 30 and 40 degree Fahrenheit is the same as that between 80 and 90 degree Fahrenheit. Likewise, if you have a scale that asks respondents’ annual income using the following attributes (ranges): $0 to 10,000, $10,000 to 20,000, $20,000 to 30,000, and so forth, this is also an interval measure, because the mid-point of each range (i.e., $5,000, $15,000, $25,000, etc.) are equidistant from each other. The intelligence quotient (IQ) scale is also an interval measure, because the measure is designed such that the difference between IQ scores 100 and 110 is supposed to be the same as between 110 and 120 (although we do not really know whether that is truly the case). Interval measures allow us to examine “how much more” is one attribute when compared to another, which is not possible with nominal or ordinal measures. You may find researchers who “pretend” (incorrectly) that ordinal rating scales are actually interval measures so that we can use different statistical techniques for analyzing them. As we will discuss in the latter part of the chapter, this is a mistake because there is no way to know whether the difference between a 3 and a 4 on a rating scale is the same as the difference between a 2 and a 3. Those numbers are just placeholders for categories.
Ratio measures are those that have all the qualities of nominal, ordinal, and interval scales, and in addition, also have a “true zero” point (where the value zero implies lack or non-availability of the underlying construct). Think about how to measure the number of people working in human resources at a social work agency. It could be one, several, or none (if the company contracts out for those services). Measuring interval and ratio data is relatively easy, as people either select or input a number for their answer. If you ask a person how many eggs they purchased last week, they can simply tell you they purchased `a dozen eggs at the store, two at breakfast on Wednesday, or none at all.
Commonly used rating scales in questionnaires
The level of measurement will give you the basic information you need, but social scientists have developed specialized instruments for use in questionnaires, a common tool used in quantitative research. As we mentioned before, if you plan to source your data from client files or previously published results
Although Likert scale is a term colloquially used to refer to almost any rating scale (e.g., a 0-to-10 life satisfaction scale), it has a much more precise meaning. In the 1930s, researcher Rensis Likert (pronounced LICK-ert) created a new approach for measuring people’s attitudes (Likert, 1932).[2] It involves presenting people with several statements—including both favorable and unfavorable statements—about some person, group, or idea. Respondents then express their agreement or disagreement with each statement on a 5-point scale: Strongly Agree, Agree, Neither Agree nor Disagree, Disagree, Strongly Disagree. Numbers are assigned to each response and then summed across all items to produce a score representing the attitude toward the person, group, or idea. For items that are phrased in an opposite direction (e.g., negatively worded statements instead of positively worded statements), reverse coding is used so that the numerical scoring of statements also runs in the opposite direction. The entire set of items came to be called a Likert scale, as indicated in Table 11.2 below.
Unless you are measuring people’s attitude toward something by assessing their level of agreement with several statements about it, it is best to avoid calling it a Likert scale. You are probably just using a rating scale. Likert scales allow for more granularity (more finely tuned response) than yes/no items, including whether respondents are neutral to the statement. Below is an example of how we might use a Likert scale to assess your attitudes about research as you work your way through this textbook.
Strongly agree | Agree | Neutral | Disagree | Strongly disagree | |
I like research more now than when I started reading this book. | |||||
This textbook is easy to use. | |||||
I feel confident about how well I understand levels of measurement. | |||||
This textbook is helping me plan my research proposal. |
Semantic differential scales are composite (multi-item) scales in which respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites. Whereas in the above Likert scale, the participant is asked how much they agree or disagree with a statement, in a semantic differential scale the participant is asked to indicate how they feel about a specific item. This makes the semantic differential scale an excellent technique for measuring people’s attitudes or feelings toward objects, events, or behaviors. Table 11.3 is an example of a semantic differential scale that was created to assess participants’ feelings about this textbook.
1) How would you rate your opinions toward this textbook? | ||||||
Very much | Somewhat | Neither | Somewhat | Very much | ||
Boring | Exciting | |||||
Useless | Useful | |||||
Hard | Easy | |||||
Irrelevant | Applicable |
This composite scale was designed by Louis Guttman and uses a series of items arranged in increasing order of intensity (least intense to most intense) of the concept. This type of scale allows us to understand the intensity of beliefs or feelings. Each item in the above Guttman scale has a weight (this is not indicated on the tool) which varies with the intensity of that item, and the weighted combination of each response is used as an aggregate measure of an observation.
Example Guttman Scale Items
- I often felt the material was not engaging Yes/No
- I was often thinking about other things in class Yes/No
- I was often working on other tasks during class Yes/No
- I will work to abolish research from the curriculum Yes/No
Notice how the items move from lower intensity to higher intensity. A researcher reviews the yes answers and creates a score for each participant.
Composite measures: Scales and indices
Depending on your research design, your measure may be something you put on a survey or pre/post-test that you give to your participants. For a variable like age or income, one well-worded question may suffice. Unfortunately, most variables in the social world are not so simple. Depression and satisfaction are multidimensional concepts. Relying on a single indicator like a question that asks “Yes or no, are you depressed?” does not encompass the complexity of depression, including issues with mood, sleeping, eating, relationships, and happiness. There is no easy way to delineate between multidimensional and unidimensional concepts, as its all in how you think about your variable. Satisfaction could be validly measured using a unidimensional ordinal rating scale. However, if satisfaction were a key variable in our study, we would need a theoretical framework and conceptual definition for it. That means we’d probably have more indicators to ask about like timeliness, respect, sensitivity, and many others, and we would want our study to say something about what satisfaction truly means in terms of our other key variables. However, if satisfaction is not a key variable in your conceptual framework, it makes sense to operationalize it as a unidimensional concept.
For more complicated measures, researchers use scales and indices (sometimes called indexes) to measure their variables because they assess multiple indicators to develop a composite (or total) score. Composite scores provide a much greater understanding of concepts than a single item could. Although we won’t delve too deeply into the process of scale development, we will cover some important topics for you to understand how scales and indices developed by other researchers can be used in your project.
Although they exhibit differences (which will later be discussed) the two have in common various factors.
- Both are ordinal measures of variables.
- Both can order the units of analysis in terms of specific variables.
- Both are composite measures.
Scales
The previous section discussed how to measure respondents’ responses to predesigned items or indicators belonging to an underlying construct. But how do we create the indicators themselves? The process of creating the indicators is called scaling. More formally, scaling is a branch of measurement that involves the construction of measures by associating qualitative judgments about unobservable constructs with quantitative, measurable metric units. Stevens (1946)[3] said, “Scaling is the assignment of objects to numbers according to a rule.” This process of measuring abstract concepts in concrete terms remains one of the most difficult tasks in empirical social science research.
The outcome of a scaling process is a scale, which is an empirical structure for measuring items or indicators of a given construct. Understand that multidimensional “scales”, as discussed in this section, are a little different from “rating scales” discussed in the previous section. A rating scale is used to capture the respondents’ reactions to a given item on a questionnaire. For example, an ordinally scaled item captures a value between “strongly disagree” to “strongly agree.” Attaching a rating scale to a statement or instrument is not scaling. Rather, scaling is the formal process of developing scale items, before rating scales can be attached to those items.
If creating your own scale sounds painful, don’t worry! For most multidimensional variables, you would likely be duplicating work that has already been done by other researchers. Specifically, this is a branch of science called psychometrics. You do not need to create a scale for depression because scales such as the Patient Health Questionnaire (PHQ-9), the Center for Epidemiologic Studies Depression Scale (CES-D), and Beck’s Depression Inventory (BDI) have been developed and refined over dozens of years to measure variables like depression. Similarly, scales such as the Patient Satisfaction Questionnaire (PSQ-18) have been developed to measure satisfaction with medical care. As we will discuss in the next section, these scales have been shown to be reliable and valid. While you could create a new scale to measure depression or satisfaction, a study with rigor would pilot test and refine that new scale over time to make sure it measures the concept accurately and consistently. This high level of rigor is often unachievable in student research projects because of the cost and time involved in pilot testing and validating, so using existing scales is recommended.
Unfortunately, there is no good one-stop=shop for psychometric scales. The Mental Measurements Yearbook provides a searchable database of measures for social science variables, though it woefully incomplete and often does not contain the full documentation for scales in its database. You can access it from a university library’s list of databases. If you can’t find anything in there, your next stop should be the methods section of the articles in your literature review. The methods section of each article will detail how the researchers measured their variables, and often the results section is instructive for understanding more about measures. In a quantitative study, researchers may have used a scale to measure key variables and will provide a brief description of that scale, its names, and maybe a few example questions. If you need more information, look at the results section and tables discussing the scale to get a better idea of how the measure works. Looking beyond the articles in your literature review, searching Google Scholar using queries like “depression scale” or “satisfaction scale” should also provide some relevant results. For example, searching for documentation for the Rosenberg Self-Esteem Scale (which we will discuss in the next section), I found this report from researchers investigating acceptance and commitment therapy which details this scale and many others used to assess mental health outcomes. If you find the name of the scale somewhere but cannot find the documentation (all questions and answers plus how to interpret the scale), a general web search with the name of the scale and “.pdf” may bring you to what you need. Or, to get professional help with finding information, always ask a librarian!
Unfortunately, these approaches do not guarantee that you will be able to view the scale itself or get information on how it is interpreted. Many scales cost money to use and may require training to properly administer. You may also find scales that are related to your variable but would need to be slightly modified to match your study’s needs. You could adapt a scale to fit your study, however changing even small parts of a scale can influence its accuracy and consistency. While it is perfectly acceptable in student projects to adapt a scale without testing it first (time may not allow you to do so), pilot testing is always recommended for adapted scales, and researchers seeking to draw valid conclusions and publish their results must take this additional step.
Indices
An index is a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas. It is different from a scale. Scales also aggregate measures; however, these measures examine different dimensions or the same dimension of a single construct. A well-known example of an index is the consumer price index (CPI), which is computed every month by the Bureau of Labor Statistics of the U.S. Department of Labor. The CPI is a measure of how much consumers have to pay for goods and services (in general) and is divided into eight major categories (food and beverages, housing, apparel, transportation, healthcare, recreation, education and communication, and “other goods and services”), which are further subdivided into more than 200 smaller items. Each month, government employees call all over the country to get the current prices of more than 80,000 items. Using a complicated weighting scheme that takes into account the location and probability of purchase for each item, analysts then combine these prices into an overall index score using a series of formulas and rules.
Another example of an index is the Duncan Socioeconomic Index (SEI). This index is used to quantify a person’s socioeconomic status (SES) and is a combination of three concepts: income, education, and occupation. Income is measured in dollars, education in years or degrees achieved, and occupation is classified into categories or levels by status. These very different measures are combined to create an overall SES index score. However, SES index measurement has generated a lot of controversy and disagreement among researchers.
The process of creating an index is similar to that of a scale. First, conceptualize (define) the index and its constituent components. Though this appears simple, there may be a lot of disagreement on what components (concepts/constructs) should be included or excluded from an index. For instance, in the SES index, isn’t income correlated with education and occupation? And if so, should we include one component only or all three components? Reviewing the literature, using theories, and/or interviewing experts or key stakeholders may help resolve this issue. Second, operationalize and measure each component. For instance, how will you categorize occupations, particularly since some occupations may have changed with time (e.g., there were no Web developers before the Internet)? As we will see in step three below, researchers must create a rule or formula for calculating the index score. Again, this process may involve a lot of subjectivity, so validating the index score using existing or new data is important.
Scale and index development at often taught in their own course in doctoral education, so it is unreasonable for you to expect to develop a consistently accurate measure within the span of a week or two. Using available indices and scales is recommended for this reason.
Differences between scales and indices
Though indices and scales yield a single numerical score or value representing a concept of interest, they are different in many ways. First, indices often comprise components that are very different from each other (e.g., income, education, and occupation in the SES index) and are measured in different ways. Conversely, scales typically involve a set of similar items that use the same rating scale (such as a five-point Likert scale about customer satisfaction).
Second, indices often combine objectively measurable values such as prices or income, while scales are designed to assess subjective or judgmental constructs such as attitude, prejudice, or self-esteem. Some argue that the sophistication of the scaling methodology makes scales different from indexes, while others suggest that indexing methodology can be equally sophisticated. Nevertheless, indexes and scales are both essential tools in social science research.
Scales and indices seem like clean, convenient ways to measure different phenomena in social science, but just like with a lot of research, we have to be mindful of the assumptions and biases underneath. What if a scale or an index was developed using only White women as research participants? Is it going to be useful for other groups? It very well might be, but when using a scale or index on a group for whom it hasn’t been tested, it will be very important to evaluate the validity and reliability of the instrument, which we address in the rest of the chapter.
Finally, it’s important to note that while scales and indices are often made up of nominal or ordinal variables, when we analyze them into composite scores, we will treat them as interval/ratio variables.
Exercises
- Look back to your work from the previous section, are your variables unidimensional or multidimensional?
- Describe the specific measures you will use (actual questions and response options you will use with participants) for each variable in your research question.
- If you are using a measure developed by another researcher but do not have all of the questions, response options, and instructions needed to implement it, put it on your to-do list to get them.
Step 3: How you will interpret your measures
The final stage of operationalization involves setting the rules for how the measure works and how the researcher should interpret the results. Sometimes, interpreting a measure can be incredibly easy. If you ask someone their age, you’ll probably interpret the results by noting the raw number (e.g., 22) someone provides and that it is lower or higher than other people’s ages. However, you could also recode that person into age categories (e.g., under 25, 20-29-years-old, generation Z, etc.). Even scales may be simple to interpret. If there is a scale of problem behaviors, one might simply add up the number of behaviors checked off–with a range from 1-5 indicating low risk of delinquent behavior, 6-10 indicating the student is moderate risk, etc. How you choose to interpret your measures should be guided by how they were designed, how you conceptualize your variables, the data sources you used, and your plan for analyzing your data statistically. Whatever measure you use, you need a set of rules for how to take any valid answer a respondent provides to your measure and interpret it in terms of the variable being measured.
For more complicated measures like scales, refer to the information provided by the author for how to interpret the scale. If you can’t find enough information from the scale’s creator, look at how the results of that scale are reported in the results section of research articles. For example, Beck’s Depression Inventory (BDI-II) uses 21 statements to measure depression and respondents rate their level of agreement on a scale of 0-3. The results for each question are added up, and the respondent is put into one of three categories: low levels of depression (1-16), moderate levels of depression (17-30), or severe levels of depression (31 and over).
One common mistake I see often is that students will introduce another variable into their operational definition. This is incorrect. Your operational definition should mention only one variable—the variable being defined. While your study will certainly draw conclusions about the relationships between variables, that’s not what operationalization is. Operationalization specifies what instrument you will use to measure your variable and how you plan to interpret the data collected using that measure.
Operationalization is probably the trickiest component of basic research methods, so please don’t get frustrated if it takes a few drafts and a lot of feedback to get to a workable definition. At the time of this writing, I am in the process of operationalizing the concept of “attitudes towards research methods.” Originally, I thought that I could gauge students’ attitudes toward research methods by looking at their end-of-semester course evaluations. As I became aware of the potential methodological issues with student course evaluations, I opted to use focus groups of students to measure their common beliefs about research. You may recall some of these opinions from Chapter 1, such as the common beliefs that research is boring, useless, and too difficult. After the focus group, I created a scale based on the opinions I gathered, and I plan to pilot test it with another group of students. After the pilot test, I expect that I will have to revise the scale again before I can implement the measure in a real social work research project. At the time I’m writing this, I’m still not completely done operationalizing this concept.
Key Takeaways
- Operationalization involves spelling out precisely how a concept will be measured.
- Operational definitions must include the variable, the measure, and how you plan to interpret the measure.
- There are four different levels of measurement: nominal, ordinal, interval, and ratio (in increasing order of specificity).
- Scales and indices are common ways to collect information and involve using multiple indicators in measurement.
- A key difference between a scale and an index is that a scale contains multiple indicators for one concept, whereas an indicator examines multiple concepts (components).
- Using scales developed and refined by other researchers can improve the rigor of a quantitative study.
Exercises
Use the research question that you developed in the previous chapters and find a related scale or index that researchers have used. If you have trouble finding the exact phenomenon you want to study, get as close as you can.
- What is the level of measurement for each item on each tool? Take a second and think about why the tool’s creator decided to include these levels of measurement. Identify any levels of measurement you would change and why.
- If these tools don’t exist for what you are interested in studying, why do you think that is?
12.3 Writing effective questions and questionnaires
Learning Objectives
Learners will be able to…
- Describe some of the ways that survey questions might confuse respondents and how to word questions and responses clearly
- Create mutually exclusive, exhaustive, and balanced response options
- Define fence-sitting and floating
- Describe the considerations involved in constructing a well-designed questionnaire
- Discuss why pilot testing is important
In the previous section, we reviewed how researchers collect data using surveys. Guided by their sampling approach and research context, researchers should choose the survey approach that provides the most favorable tradeoffs in strengths and challenges. With this information in hand, researchers need to write their questionnaire and revise it before beginning data collection. Each method of delivery requires a questionnaire, but they vary a bit based on how they will be used by the researcher. Since phone surveys are read aloud, researchers will pay more attention to how the questionnaire sounds than how it looks. Online surveys can use advanced tools to require the completion of certain questions, present interactive questions and answers, and otherwise afford greater flexibility in how questionnaires are designed. As you read this section, consider how your method of delivery impacts the type of questionnaire you will design. Because most student projects use paper or online surveys, this section will detail how to construct self-administered questionnaires to minimize the potential for bias and error.
Start with operationalization
The first thing you need to do to write effective survey questions is identify what exactly you wish to know. As silly as it sounds to state what seems so completely obvious, we can’t stress enough how easy it is to forget to include important questions when designing a survey. Begin by looking at your research question and refreshing your memory of the operational definitions you developed for those variables from Chapter 11. You should have a pretty firm grasp of your operational definitions before starting the process of questionnaire design. You may have taken those operational definitions from other researchers’ methods, found established scales and indices for your measures, or created your own questions and answer options.
Exercises
STOP! Make sure you have a complete operational definition for the dependent and independent variables in your research question. A complete operational definition contains the variable being measured, the measure used, and how the researcher interprets the measure. Let’s make sure you have what you need from Chapter 11 to begin writing your questionnaire.
List all of the dependent and independent variables in your research question.
- It’s normal to have one dependent or independent variable. It’s also normal to have more than one of either.
- Make sure that your research question (and this list) contain all of the variables in your hypothesis. Your hypothesis should only include variables from you research question.
For each variable in your list:
- Write out the measure you will use (the specific questions and answers) for each variable.
- If you don’t have questions and answers finalized yet, write a first draft and revise it based on what you read in this section.
- If you are using a measure from another researcher, you should be able to write out all of the questions and answers associated with that measure. If you only have the name of a scale or a few questions, you need to access to the full text and some documentation on how to administer and interpret it before you can finish your questionnaire.
- Describe how you will use each measure draw conclusions about the variable in the operational definition.
- For example, an interpretation might be “there are five 7-point Likert scale questions…point values are added across all five items for each participant…and scores below 10 indicate the participant has low self-esteem”
- Don’t introduce other variables into the mix here. All we are concerned with is how you will measure each variable by itself. The connection between variables is done using statistical tests, not operational definitions.
- Detail any validity or reliability issues uncovered by previous researchers using the same measures. If you have concerns about validity and reliability, note them, as well.
If you completed the exercise above and listed out all of the questions and answer choices you will use to measure the variables in your research question, you have already produced a pretty solid first draft of your questionnaire! Congrats! In essence, questionnaires are all of the self-report measures in your operational definitions for the independent, dependent, and control variables in your study arranged into one document and administered to participants. There are a few questions on a questionnaire (like name or ID#) that are not associated with the measurement of variables. These are the exception, and it’s useful to think of a questionnaire as a list of measures for variables. Of course, researchers often use more than one measure of a variable (i.e., triangulation) so they can more confidently assert that their findings are true. A questionnaire should contain all of the measures researchers plan to collect about their variables by asking participants to self-report. As we will discuss in the final section of this chapter, triangulating across data sources (e.g., measuring variables using client files or student records) can avoid some of the common sources of bias in survey research.
Sticking close to your operational definitions is important because it helps you avoid an everything-but-the-kitchen-sink approach that includes every possible question that occurs to you. Doing so puts an unnecessary burden on your survey respondents. Remember that you have asked your participants to give you their time and attention and to take care in responding to your questions; show them your respect by only asking questions that you actually plan to use in your analysis. For each question in your questionnaire, ask yourself how this question measures a variable in your study. An operational definition should contain the questions, response options, and how the researcher will draw conclusions about the variable based on participants’ responses.
Writing questions
So, almost all of the questions on a questionnaire are measuring some variable. For many variables, researchers will create their own questions rather than using one from another researcher. This section will provide some tips on how to create good questions to accurately measure variables in your study. First, questions should be as clear and to the point as possible. This is not the time to show off your creative writing skills; a survey is a technical instrument and should be written in a way that is as direct and concise as possible. As I’ve mentioned earlier, your survey respondents have agreed to give their time and attention to your survey. The best way to show your appreciation for their time is to not waste it. Ensuring that your questions are clear and concise will go a long way toward showing your respondents the gratitude they deserve. Pilot testing the questionnaire with friends or colleagues can help identify these issues. This process is commonly called pretesting, but to avoid any confusion with pretesting in experimental design, we refer to it as pilot testing.
Related to the point about not wasting respondents’ time, make sure that every question you pose will be relevant to every person you ask to complete it. This means two things: first, that respondents have knowledge about whatever topic you are asking them about, and second, that respondents have experienced the events, behaviors, or feelings you are asking them to report. If you are asking participants for second-hand knowledge—asking clinicians about clients’ feelings, asking teachers about students’ feelings, and so forth—you may want to clarify that the variable you are asking about is the key informant’s perception of what is happening in the target population. A well-planned sampling approach ensures that participants are the most knowledgeable population to complete your survey.
If you decide that you do wish to include questions about matters with which only a portion of respondents will have had experience, make sure you know why you are doing so. For example, if you are asking about MSW student study patterns, and you decide to include a question on studying for the social work licensing exam, you may only have a small subset of participants who have begun studying for the graduate exam or took the bachelor’s-level exam. If you decide to include this question that speaks to a minority of participants’ experiences, think about why you are including it. Are you interested in how studying for class and studying for licensure differ? Are you trying to triangulate study skills measures? Researchers should carefully consider whether questions relevant to only a subset of participants is likely to produce enough valid responses for quantitative analysis.
Many times, questions that are relevant to a subsample of participants are conditional on an answer to a previous question. A participant might select that they rent their home, and as a result, you might ask whether they carry renter’s insurance. That question is not relevant to homeowners, so it would be wise not to ask them to respond to it. In that case, the question of whether someone rents or owns their home is a filter question, designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample. Figure 12.1 presents an example of how to accomplish this on a paper survey by adding instructions to the participant that indicate what question to proceed to next based on their response to the first one. Using online survey tools, researchers can use filter questions to only present relevant questions to participants.
Researchers should eliminate questions that ask about things participants don’t know to minimize confusion. Assuming the question is relevant to the participant, other sources of confusion come from how the question is worded. The use of negative wording can be a source of potential confusion. Taking the question from Figure 12.1 about drinking as our example, what if we had instead asked, “Did you not abstain from drinking during your first semester of college?” This is a double negative, and it’s not clear how to answer the question accurately. It is a good idea to avoid negative phrasing, when possible. For example, “did you not drink alcohol during your first semester of college?” is less clear than “did you drink alcohol your first semester of college?”
You should also avoid using terms or phrases that may be regionally or culturally specific (unless you are absolutely certain all your respondents come from the region or culture whose terms you are using). When I first moved to southwest Virginia, I didn’t know what a holler was. Where I grew up in New Jersey, to holler means to yell. Even then, in New Jersey, we shouted and screamed, but we didn’t holler much. In southwest Virginia, my home at the time, a holler also means a small valley in between the mountains. If I used holler in that way on my survey, people who live near me may understand, but almost everyone else would be totally confused. A similar issue arises when you use jargon, or technical language, that people do not commonly know. For example, if you asked adolescents how they experience imaginary audience, they would find it difficult to link those words to the concepts from David Elkind’s theory. The words you use in your questions must be understandable to your participants. If you find yourself using jargon or slang, break it down into terms that are more universal and easier to understand.
Asking multiple questions as though they are a single question can also confuse survey respondents. There’s a specific term for this sort of question; it is called a double-barreled question. Figure 12.2 shows a double-barreled question. Do you see what makes the question double-barreled? How would someone respond if they felt their college classes were more demanding but also more boring than their high school classes? Or less demanding but more interesting? Because the question combines “demanding” and “interesting,” there is no way to respond yes to one criterion but no to the other.
Another thing to avoid when constructing survey questions is the problem of social desirability. We all want to look good, right? And we all probably know the politically correct response to a variety of questions whether we agree with the politically correct response or not. In survey research, social desirability refers to the idea that respondents will try to answer questions in a way that will present them in a favorable light. (You may recall we covered social desirability bias in Chapter 11.)
Perhaps we decide that to understand the transition to college, we need to know whether respondents ever cheated on an exam in high school or college for our research project. We all know that cheating on exams is generally frowned upon (at least I hope we all know this). So, it may be difficult to get people to admit to cheating on a survey. But if you can guarantee respondents’ confidentiality, or even better, their anonymity, chances are much better that they will be honest about having engaged in this socially undesirable behavior. Another way to avoid problems of social desirability is to try to phrase difficult questions in the most benign way possible. Earl Babbie (2010) [4] offers a useful suggestion for helping you do this—simply imagine how you would feel responding to your survey questions. If you would be uncomfortable, chances are others would as well.
Exercises
Try to step outside your role as researcher for a second, and imagine you were one of your participants. Evaluate the following:
- Is the question too general? Sometimes, questions that are too general may not accurately convey respondents’ perceptions. If you asked someone how they liked a certain book and provide a response scale ranging from “not at all” to “extremely well”, and if that person selected “extremely well,” what do they mean? Instead, ask more specific behavioral questions, such as “Will you recommend this book to others?” or “Do you plan to read other books by the same author?”
- Is the question too detailed? Avoid unnecessarily detailed questions that serve no specific research purpose. For instance, do you need the age of each child in a household or is just the number of children in the household acceptable? However, if unsure, it is better to err on the side of details than generality.
- Is the question presumptuous? Does your question make assumptions? For instance, if you ask, “what do you think the benefits of a tax cut would be?” you are presuming that the participant sees the tax cut as beneficial. But many people may not view tax cuts as beneficial. Some might see tax cuts as a precursor to less funding for public schools and fewer public services such as police, ambulance, and fire department. Avoid questions with built-in presumptions.
- Does the question ask the participant to imagine something? Is the question imaginary? A popular question on many television game shows is “if you won a million dollars on this show, how will you plan to spend it?” Most participants have never been faced with this large amount of money and have never thought about this scenario. In fact, most don’t even know that after taxes, the value of the million dollars will be greatly reduced. In addition, some game shows spread the amount over a 20-year period. Without understanding this “imaginary” situation, participants may not have the background information necessary to provide a meaningful response.
Finally, it is important to get feedback on your survey questions from as many people as possible, especially people who are like those in your sample. Now is not the time to be shy. Ask your friends for help, ask your mentors for feedback, ask your family to take a look at your survey as well. The more feedback you can get on your survey questions, the better the chances that you will come up with a set of questions that are understandable to a wide variety of people and, most importantly, to those in your sample.
In sum, in order to pose effective survey questions, researchers should do the following:
- Identify how each question measures an independent, dependent, or control variable in their study.
- Keep questions clear and succinct.
- Make sure respondents have relevant lived experience to provide informed answers to your questions.
- Use filter questions to avoid getting answers from uninformed participants.
- Avoid questions that are likely to confuse respondents—including those that use double negatives, use culturally specific terms or jargon, and pose more than one question at a time.
- Imagine how respondents would feel responding to questions.
- Get feedback, especially from people who resemble those in the researcher’s sample.
Exercises
Let’s complete a first draft of your questions. In the previous exercise, you listed all of the questions and answers you will use to measure the variables in your research question.
- In the previous exercise, you wrote out the questions and answers for each measure of your independent and dependent variables. Evaluate each question using the criteria listed above on effective survey questions.
- Type out questions for your control variables and evaluate them, as well. Consider what response options you want to offer participants.
Now, let’s revise any questions that do not meet your standards!
- Use the BRUSO model in Table 12.2 for an illustration of how to address deficits in question wording. Keep in mind that you are writing a first draft in this exercise, and it will take a few drafts and revisions before your questions are ready to distribute to participants.
Criterion | Poor | Effective |
B- Brief | “Are you now or have you ever been the possessor of a firearm?” | Have you ever possessed a firearm? |
R- Relevant | “Who did you vote for in the last election?” | Note: Only include items that are relevant to your study. |
U- Unambiguous | “Are you a gun person?” | Do you currently own a gun?” |
S- Specific | How much have you read about the new gun control measure and sales tax?” | “How much have you read about the new sales tax on firearm purchases?” |
O- Objective | “How much do you support the beneficial new gun control measure?” | “What is your view of the new gun control measure?” |
Writing response options
While posing clear and understandable questions in your survey is certainly important, so too is providing respondents with unambiguous response options. Response options are the answers that you provide to the people completing your questionnaire. Generally, respondents will be asked to choose a single (or best) response to each question you pose. We call questions in which the researcher provides all of the response options closed-ended questions. Keep in mind, closed-ended questions can also instruct respondents to choose multiple response options, rank response options against one another, or assign a percentage to each response option. But be cautious when experimenting with different response options! Accepting multiple responses to a single question may add complexity when it comes to quantitatively analyzing and interpreting your data.
Surveys need not be limited to closed-ended questions. Sometimes survey researchers include open-ended questions in their survey instruments as a way to gather additional details from respondents. An open-ended question does not include response options; instead, respondents are asked to reply to the question in their own way, using their own words. These questions are generally used to find out more about a survey participant’s experiences or feelings about whatever they are being asked to report in the survey. If, for example, a survey includes closed-ended questions asking respondents to report on their involvement in extracurricular activities during college, an open-ended question could ask respondents why they participated in those activities or what they gained from their participation. While responses to such questions may also be captured using a closed-ended format, allowing participants to share some of their responses in their own words can make the experience of completing the survey more satisfying to respondents and can also reveal new motivations or explanations that had not occurred to the researcher. This is particularly important for mixed-methods research. It is possible to analyze open-ended response options quantitatively using content analysis (i.e., counting how often a theme is represented in a transcript looking for statistical patterns). However, for most researchers, qualitative data analysis will be needed to analyze open-ended questions, and researchers need to think through how they will analyze any open-ended questions as part of their data analysis plan. We will address qualitative data analysis in greater detail in Chapter 19.
To keep things simple, we encourage you to use only closed-ended response options in your study. While open-ended questions are not wrong, they are often a sign in our classrooms that students have not thought through all the way how to operationally define and measure their key variables. Open-ended questions cannot be operationally defined because you don’t know what responses you will get. Instead, you will need to analyze the qualitative data using one of the techniques we discuss in Chapter 19 to interpret your participants’ responses.
To write an effective response options for closed-ended questions, there are a couple of guidelines worth following. First, be sure that your response options are mutually exclusive. Look back at Figure 12.1, which contains questions about how often and how many drinks respondents consumed. Do you notice that there are no overlapping categories in the response options for these questions? This is another one of those points about question construction that seems fairly obvious but that can be easily overlooked. Response options should also be exhaustive. In other words, every possible response should be covered in the set of response options that you provide. For example, note that in question 10a in Figure 12.1, we have covered all possibilities—those who drank, say, an average of once per month can choose the first response option (“less than one time per week”) while those who drank multiple times a day each day of the week can choose the last response option (“7+”). All the possibilities in between these two extremes are covered by the middle three response options, and every respondent fits into one of the response options we provided.
Earlier in this section, we discussed double-barreled questions. Response options can also be double barreled, and this should be avoided. Figure 12.3 is an example of a question that uses double-barreled response options. Other tips about questions are also relevant to response options, including that participants should be knowledgeable enough to select or decline a response option as well as avoiding jargon and cultural idioms.
Even if you phrase questions and response options clearly, participants are influenced by how many response options are presented on the questionnaire. For Likert scales, five or seven response options generally allow about as much precision as respondents are capable of. However, numerical scales with more options can sometimes be appropriate. For dimensions such as attractiveness, pain, and likelihood, a 0-to-10 scale will be familiar to many respondents and easy for them to use. Regardless of the number of response options, the most extreme ones should generally be “balanced” around a neutral or modal midpoint. An example of an unbalanced rating scale measuring perceived likelihood might look like this:
Unlikely | Somewhat Likely | Likely | Very Likely | Extremely Likely
Because we have four rankings of likely and only one ranking of unlikely, the scale is unbalanced and most responses will be biased toward “likely” rather than “unlikely.” A balanced version might look like this:
Extremely Unlikely | Somewhat Unlikely | As Likely as Not | Somewhat Likely |Extremely Likely
In this example, the midpoint is halfway between likely and unlikely. Of course, a middle or neutral response option does not have to be included. Researchers sometimes choose to leave it out because they want to encourage respondents to think more deeply about their response and not simply choose the middle option by default. Fence-sitters are respondents who choose neutral response options, even if they have an opinion. Some people will be drawn to respond, “no opinion” even if they have an opinion, particularly if their true opinion is the not a socially desirable opinion. Floaters, on the other hand, are those that choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion.
As you can see, floating is the flip side of fence-sitting. Thus, the solution to one problem is often the cause of the other. How you decide which approach to take depends on the goals of your research. Sometimes researchers specifically want to learn something about people who claim to have no opinion. In this case, allowing for fence-sitting would be necessary. Other times researchers feel confident their respondents will all be familiar with every topic in their survey. In this case, perhaps it is okay to force respondents to choose one side or another (e.g., agree or disagree) without a middle option (e.g., neither agree nor disagree) or to not include an option like “don’t know enough to say” or “not applicable.” There is no always-correct solution to either problem. But in general, including middle option in a response set provides a more exhaustive set of response options than one that excludes one.
The most important check before your finalize your response options is to align them with your operational definitions. As we’ve discussed before, your operational definitions include your measures (questions and responses options) as well as how to interpret those measures in terms of the variable being measured. In particular, you should be able to interpret all response options to a question based on your operational definition of the variable it measures. If you wanted to measure the variable “social class,” you might ask one question about a participant’s annual income and another about family size. Your operational definition would need to provide clear instructions on how to interpret response options. Your operational definition is basically like this social class calculator from Pew Research, though they include a few more questions in their definition.
To drill down a bit more, as Pew specifies in the section titled “how the income calculator works,” the interval/ratio data respondents enter is interpreted using a formula combining a participant’s four responses to the questions posed by Pew categorizing their household into three categories—upper, middle, or lower class. So, the operational definition includes the four questions comprising the measure and the formula or interpretation which converts responses into the three final categories that we are familiar with: lower, middle, and upper class.
It is interesting to note that even though participants inis an ordinal level of measurement. Whereas, Pew asks four questions that use an interval or ratio level of measurement (depending on the question). This means that respondents provide numerical responses, rather than choosing categories like lower, middle, and upper class. It’s perfectly normal for operational definitions to change levels of measurement, and it’s also perfectly normal for the level of measurement to stay the same. The important thing is that each response option a participant can provide is accounted for by the operational definition. Throw any combination of family size, location, or income at the Pew calculator, and it will define you into one of those three social class categories.
Unlike Pew’s definition, the operational definitions in your study may not need their own webpage to define and describe. For many questions and answers, interpreting response options is easy. If you were measuring “income” instead of “social class,” you could simply operationalize the term by asking people to list their total household income before taxes are taken out. Higher values indicate higher income, and lower values indicate lower income. Easy. Regardless of whether your operational definitions are simple or more complex, every response option to every question on your survey (with a few exceptions) should be interpretable using an operational definition of a variable. Just like we want to avoid an everything-but-the-kitchen-sink approach to questions on our questionnaire, you want to make sure your final questionnaire only contains response options that you will use in your study.
One note of caution on interpretation (sorry for repeating this). We want to remind you again that an operational definition should not mention more than one variable. In our example above, your operational definition could not say “a family of three making under $50,000 is lower class; therefore, they are more likely to experience food insecurity.” That last clause about food insecurity may well be true, but it’s not a part of the operational definition for social class. Each variable (food insecurity and class) should have its own operational definition. If you are talking about how to interpret the relationship between two variables, you are talking about your data analysis plan. We will discuss how to create your data analysis plan beginning in Chapter 14. For now, one consideration is that depending on the statistical test you use to test relationships between variables, you may need nominal, ordinal, or interval/ratio data. Your questions and response options should match the level of measurement you need with the requirements of the specific statistical tests in your data analysis plan. Once you finalize your data analysis plan, return to your questionnaire to match the level of measurement matches with the statistical test you’ve chosen.
In summary, to write effective response options researchers should do the following:
- Avoid wording that is likely to confuse respondents—including double negatives, use culturally specific terms or jargon, and double-barreled response options.
- Ensure response options are relevant to participants’ knowledge and experience so they can make an informed and accurate choice.
- Present mutually exclusive and exhaustive response options.
- Consider fence-sitters and floaters, and the use of neutral or “not applicable” response options.
- Define how response options are interpreted as part of an operational definition of a variable.
- Check level of measurement matches operational definitions and the statistical tests in the data analysis plan (once you develop one in the future)
Exercises
Look back at the response options you drafted in the previous exercise. Make sure you have a first draft of response options for each closed-ended question on your questionnaire.
- Using the criteria above, evaluate the wording of the response options for each question on your questionnaire.
- Revise your questions and response options until you have a complete first draft.
- Do your first read-through and provide a dummy answer to each question. Make sure you can link each response option and each question to an operational definition.
- Look ahead to Chapter 14 and consider how each item on your questionnaire will inform your data analysis plan.
From this discussion, we hope it is clear why researchers using quantitative methods spell out all of their plans ahead of time. Ultimately, there should be a straight line from operational definition through measures on your questionnaire to the data analysis plan. If your questionnaire includes response options that are not aligned with operational definitions or not included in the data analysis plan, the responses you receive back from participants won’t fit with your conceptualization of the key variables in your study. If you do not fix these errors and proceed with collecting unstructured data, you will lose out on many of the benefits of survey research and face overwhelming challenges in answering your research question.
Designing questionnaires
Based on your work in the previous section, you should have a first draft of the questions and response options for the key variables in your study. Now, you’ll also need to think about how to present your written questions and response options to survey respondents. It’s time to write a final draft of your questionnaire and make it look nice. Designing questionnaires takes some thought. First, consider the route of administration for your survey. What we cover in this section will apply equally to paper and online surveys, but if you are planning to use online survey software, you should watch tutorial videos and explore the features of of the survey software you will use.
Informed consent & instructions
Writing effective items is only one part of constructing a survey. For one thing, every survey should have a written or spoken introduction that serves two basic functions (Peterson, 2000).[5] One is to encourage respondents to participate in the survey. In many types of research, such encouragement is not necessary either because participants do not know they are in a study (as in naturalistic observation) or because they are part of a subject pool and have already shown their willingness to participate by signing up and showing up for the study. Survey research usually catches respondents by surprise when they answer their phone, go to their mailbox, or check their e-mail—and the researcher must make a good case for why they should agree to participate. Thus, the introduction should briefly explain the purpose of the survey and its importance, provide information about the sponsor of the survey (university-based surveys tend to generate higher response rates), acknowledge the importance of the respondent’s participation, and describe any incentives for participating.
The second function of the introduction is to establish informed consent. Remember that this involves describing to respondents everything that might affect their decision to participate. This includes the topics covered by the survey, the amount of time it is likely to take, the respondent’s option to withdraw at any time, confidentiality issues, and other ethical considerations we covered in Chapter 6. Written consent forms are not always used in survey research (when the research is of minimal risk and completion of the survey instrument is often accepted by the IRB as evidence of consent to participate), so it is important that this part of the introduction be well documented and presented clearly and in its entirety to every respondent.
Organizing items to be easy and intuitive to follow
The introduction should be followed by the substantive questionnaire items. But first, it is important to present clear instructions for completing the questionnaire, including examples of how to use any unusual response scales. Remember that the introduction is the point at which respondents are usually most interested and least fatigued, so it is good practice to start with the most important items for purposes of the research and proceed to less important items. Items should also be grouped by topic or by type. For example, items using the same rating scale (e.g., a 5-point agreement scale) should be grouped together if possible to make things faster and easier for respondents. Demographic items are often presented last because they are least interesting to participants but also easy to answer in the event respondents have become tired or bored. Of course, any survey should end with an expression of appreciation to the respondent.
Questions are often organized thematically. If our survey were measuring social class, perhaps we’d have a few questions asking about employment, others focused on education, and still others on housing and community resources. Those may be the themes around which we organize our questions. Or perhaps it would make more sense to present any questions we had about parents’ income and then present a series of questions about estimated future income. Grouping by theme is one way to be deliberate about how you present your questions. Keep in mind that you are surveying people, and these people will be trying to follow the logic in your questionnaire. Jumping from topic to topic can give people a bit of whiplash and may make participants less likely to complete it.
Using a matrix is a nice way of streamlining response options for similar questions. A matrix is a question type that that lists a set of questions for which the answer categories are all the same. If you have a set of questions for which the response options are the same, it may make sense to create a matrix rather than posing each question and its response options individually. Not only will this save you some space in your survey but it will also help respondents progress through your survey more easily. A sample matrix can be seen in Figure 12.4.
Once you have grouped similar questions together, you’ll need to think about the order in which to present those question groups. Most survey researchers agree that it is best to begin a survey with questions that will want to make respondents continue (Babbie, 2010; Dillman, 2000; Neuman, 2003).[6] In other words, don’t bore respondents, but don’t scare them away either. There’s some disagreement over where on a survey to place demographic questions, such as those about a person’s age, gender, and race. On the one hand, placing them at the beginning of the questionnaire may lead respondents to think the survey is boring, unimportant, and not something they want to bother completing. On the other hand, if your survey deals with some very sensitive topic, such as child sexual abuse or criminal convictions, you don’t want to scare respondents away or shock them by beginning with your most intrusive questions.
Your participants are human. They will react emotionally to questionnaire items, and they will also try to uncover your research questions and hypotheses. In truth, the order in which you present questions on a survey is best determined by the unique characteristics of your research. When feasible, you should consult with key informants from your target population determine how best to order your questions. If it is not feasible to do so, think about the unique characteristics of your topic, your questions, and most importantly, your sample. Keeping in mind the characteristics and needs of the people you will ask to complete your survey should help guide you as you determine the most appropriate order in which to present your questions. None of your decisions will be perfect, and all studies have limitations.
Questionnaire length
You’ll also need to consider the time it will take respondents to complete your questionnaire. Surveys vary in length, from just a page or two to a dozen or more pages, which means they also vary in the time it takes to complete them. How long to make your survey depends on several factors. First, what is it that you wish to know? Wanting to understand how grades vary by gender and year in school certainly requires fewer questions than wanting to know how people’s experiences in college are shaped by demographic characteristics, college attended, housing situation, family background, college major, friendship networks, and extracurricular activities. Keep in mind that even if your research question requires a sizable number of questions be included in your questionnaire, do your best to keep the questionnaire as brief as possible. Any hint that you’ve thrown in a bunch of useless questions just for the sake of it will turn off respondents and may make them not want to complete your survey.
Second, and perhaps more important, how long are respondents likely to be willing to spend completing your questionnaire? If you are studying college students, asking them to use their very free time to complete your survey may mean they won’t want to spend more than a few minutes on it. But if you find ask them to complete your survey during down-time between classes and there is little work to be done, students may be willing to give you a bit more of their time. Think about places and times that your sampling frame naturally gathers and whether you would be able to either recruit participants or distribute a survey in that context. Estimate how long your participants would reasonably have to complete a survey presented to them during this time. The more you know about your population (such as what weeks have less work and more free time), the better you can target questionnaire length.
The time that survey researchers ask respondents to spend on questionnaires varies greatly. Some researchers advise that surveys should not take longer than about 15 minutes to complete (as cited in Babbie 2010),[7] whereas others suggest that up to 20 minutes is acceptable (Hopper, 2010).[8] As with question order, there is no clear-cut, always-correct answer about questionnaire length. The unique characteristics of your study and your sample should be considered to determine how long to make your questionnaire. For example, if you planned to distribute your questionnaire to students in between classes, you will need to make sure it is short enough to complete before the next class begins.
When designing a questionnaire, a researcher should consider:
- Weighing strengths and limitations of the method of delivery, including the advanced tools in online survey software or the simplicity of paper questionnaires.
- Grouping together items that ask about the same thing.
- Moving any questions about sensitive items to the end of the questionnaire, so as not to scare respondents off.
- Moving any questions that engage the respondent to answer the questionnaire at the beginning, so as not to bore them.
- Timing the length of the questionnaire with a reasonable length of time you can ask of your participants.
- Dedicating time to visual design and ensure the questionnaire looks professional.
Exercises
Type out a final draft of your questionnaire in a word processor or online survey tool.
- Evaluate your questionnaire using the guidelines above, revise it, and get it ready to share with other student researchers.
Pilot testing and revising questionnaires
A good way to estimate the time it will take respondents to complete your questionnaire (and other potential challenges) is through pilot testing. Pilot testing allows you to get feedback on your questionnaire so you can improve it before you actually administer it. It can be quite expensive and time consuming if you wish to pilot test your questionnaire on a large sample of people who very much resemble the sample to whom you will eventually administer the finalized version of your questionnaire. But you can learn a lot and make great improvements to your questionnaire simply by pilot testing with a small number of people to whom you have easy access (perhaps you have a few friends who owe you a favor). By pilot testing your questionnaire, you can find out how understandable your questions are, get feedback on question wording and order, find out whether any of your questions are boring or offensive, and learn whether there are places where you should have included filter questions. You can also time pilot testers as they take your survey. This will give you a good idea about the estimate to provide respondents when you administer your survey and whether you have some wiggle room to add additional items or need to cut a few items.
Perhaps this goes without saying, but your questionnaire should also have an attractive design. A messy presentation style can confuse respondents or, at the very least, annoy them. Be brief, to the point, and as clear as possible. Avoid cramming too much into a single page. Make your font size readable (at least 12 point or larger, depending on the characteristics of your sample), leave a reasonable amount of space between items, and make sure all instructions are exceptionally clear. If you are using an online survey, ensure that participants can complete it via mobile, computer, and tablet devices. Think about books, documents, articles, or web pages that you have read yourself—which were relatively easy to read and easy on the eyes and why? Try to mimic those features in the presentation of your survey questions. While online survey tools automate much of visual design, word processors are designed for writing all kinds of documents and may need more manual adjustment as part of visual design.
Realistically, your questionnaire will continue to evolve as you develop your data analysis plan over the next few chapters. By now, you should have a complete draft of your questionnaire grounded in an underlying logic that ties together each question and response option to a variable in your study. Once your questionnaire is finalized, you will need to submit it for ethical approval from your professor or the IRB. If your study requires IRB approval, it may be worthwhile to submit your proposal before your questionnaire is completely done. Revisions to IRB protocols are common and it takes less time to review a few changes to questions and answers than it does to review the entire study, so give them the whole study as soon as you can. Once the IRB approves your questionnaire, you cannot change it without their okay.
Key Takeaways
- A questionnaire is comprised of self-report measures of variables in a research study.
- Make sure your survey questions will be relevant to all respondents and that you use filter questions when necessary.
- Effective survey questions and responses take careful construction by researchers, as participants may be confused or otherwise influenced by how items are phrased.
- The questionnaire should start with informed consent and instructions, flow logically from one topic to the next, engage but not shock participants, and thank participants at the end.
- Pilot testing can help identify any issues in a questionnaire before distributing it to participants, including language or length issues.
Exercises
It’s a myth that researchers work alone! Get together with a few of your fellow students and swap questionnaires for pilot testing.
- Use the criteria in each section above (questions, response options, questionnaires) and provide your peers with the strengths and weaknesses of their questionnaires.
- See if you can guess their research question and hypothesis based on the questionnaire alone.
11.3 Measurement quality
Learning Objectives
Learners will be able to…
- Define and describe the types of validity and reliability
- Assess for systematic error
The previous chapter provided insight into measuring concepts in social work research. We discussed the importance of identifying concepts and their corresponding indicators as a way to help us operationalize them. In essence, we now understand that when we think about our measurement process, we must be intentional and thoughtful in the choices that we make. This section is all about how to judge the quality of the measures you’ve chosen for the key variables in your research question.
Reliability
First, let’s say we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism this way, then it is likely that anyone who identifies as an alcoholic would respond “yes.” This may seem like a good way to identify our group of interest, but think about how you and your peer group may respond to this question. Would participants respond differently after a wild night out, compared to any other night? Could an infrequent drinker’s current headache from last night’s glass of wine influence how they answer the question this morning? How would that same person respond to the question before consuming the wine? In each cases, the same person might respond differently to the same question at different points, so it is possible that our measure of alcoholism has a reliability problem. Reliability in measurement is about consistency.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behavior, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past 3 months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. On the other hand, we might get more accurate responses if we ask a participant how many drinks of any kind they have consumed in the past week.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviors. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. They may decide to conduct observations at a local pub by noting how many drinks patrons consume and how their behavior changes as their intake changes. What if the researcher has to use the restroom, and the patron next to them takes three shots of tequila during the brief period the researcher is away from their seat? The reliability of this researcher’s measure of alcohol intake depends on their ability to physically observe every instance of patrons consuming drinks. If they are unlikely to be able to observe every such instance, then perhaps their mechanism for measuring this concept is not reliable.
The following subsections describe the types of reliability that are important for you to know about, but keep in mind that you may see other approaches to judging reliability mentioned in the empirical literature.
Test-retest reliability
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time. Unlike an experiment, you aren’t giving participants an intervention but trying to establish a reliable baseline of the variable you are measuring. Once you have these two measurements, you then look at the correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. Figure 11.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
Internal consistency
Another kind of reliability is internal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials. A specific statistical test known as Cronbach’s Alpha provides a way to measure how well each question of a scale is related to the others.
Interrater reliability
Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other.
Validity
Validity, another key element of assessing measurement quality, is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.
Face validity
Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.
Content validity
Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Criterion validity
Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).
Discriminant validity
Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
Increasing the reliability and validity of measures
We have reviewed the types of errors and how to evaluate our measures based on reliability and validity considerations. However, what can we do while selecting or creating our tool so that we minimize the potential of errors? Many of our options were covered in our discussion about reliability and validity. Nevertheless, the following table provides a quick summary of things that you should do when creating or selecting a measurement tool. While not all of these will be feasible in your project, it is important to include easy-to-implement measures in your research context.
Make sure that you engage in a rigorous literature review so that you understand the concept that you are studying. This means understanding the different ways that your concept may manifest itself. This review should include a search for existing instruments.[9]
- Do you understand all the dimensions of your concept? Do you have a good understanding of the content dimensions of your concept(s)?
- What instruments exist? How many items are on the existing instruments? Are these instruments appropriate for your population?
- Are these instruments standardized? Note: If an instrument is standardized, that means it has been rigorously studied and tested.
Consult content experts to review your instrument. This is a good way to check the face validity of your items. Additionally, content experts can also help you understand the content validity.[10]
- Do you have access to a reasonable number of content experts? If not, how can you locate them?
- Did you provide a list of critical questions for your content reviewers to use in the reviewing process?
Pilot test your instrument on a sufficient number of people and get detailed feedback.[11] Ask your group to provide feedback on the wording and clarity of items. Keep detailed notes and make adjustments BEFORE you administer your final tool.
- How many people will you use in your pilot testing?
- How will you set up your pilot testing so that it mimics the actual process of administering your tool?
- How will you receive feedback from your pilot testing group? Have you provided a list of questions for your group to think about?
Provide training for anyone collecting data for your project.[12] You should provide those helping you with a written research protocol that explains all of the steps of the project. You should also problem solve and answer any questions that those helping you may have. This will increase the chances that your tool will be administered in a consistent manner.
- How will you conduct your orientation/training? How long will it be? What modality?
- How will you select those who will administer your tool? What qualifications do they need?
When thinking of items, use a higher level of measurement, if possible.[13] This will provide more information and you can always downgrade to a lower level of measurement later.
- Have you examined your items and the levels of measurement?
- Have you thought about whether you need to modify the type of data you are collecting? Specifically, are you asking for information that is too specific (at a higher level of measurement) which may reduce participants’ willingness to participate?
Use multiple indicators for a variable.[14] Think about the number of items that you will include in your tool.
- Do you have enough items? Enough indicators? The correct indicators?
Conduct an item-by-item assessment of multiple-item measures.[15] When you do this assessment, think about each word and how it changes the meaning of your item.
- Are there items that are redundant? Do you need to modify, delete, or add items?
Types of error
As you can see, measures never perfectly describe what exists in the real world. Good measures demonstrate validity and reliability but will always have some degree of error. Systematic error (also called bias) causes our measures to consistently output incorrect data in one direction or another on a measure, usually due to an identifiable process. Imagine you created a measure of height, but you didn’t put an option for anyone over six feet tall. If you gave that measure to your local college or university, some of the taller students might not be measured accurately. In fact, you would be under the mistaken impression that the tallest person at your school was six feet tall, when in actuality there are likely people taller than six feet at your school. This error seems innocent, but if you were using that measure to help you build a new building, those people might hit their heads!
A less innocent form of error arises when researchers word questions in a way that might cause participants to think one answer choice is preferable to another. For example, if I were to ask you “Do you think global warming is caused by human activity?” you would probably feel comfortable answering honestly. But what if I asked you “Do you agree with 99% of scientists that global warming is caused by human activity?” Would you feel comfortable saying no, if that’s what you honestly felt? I doubt it. That is an example of a leading question, a question with wording that influences how a participant responds. We’ll discuss leading questions and other problems in question wording in greater detail in Chapter 12.
In addition to error created by the researcher, your participants can cause error in measurement. Some people will respond without fully understanding a question, particularly if the question is worded in a confusing way. Let’s consider another potential source or error. If we asked people if they always washed their hands after using the bathroom, would we expect people to be perfectly honest? Polling people about whether they wash their hands after using the bathroom might only elicit what people would like others to think they do, rather than what they actually do. This is an example of social desirability bias, in which participants in a research study want to present themselves in a positive, socially desirable way to the researcher. People in your study will want to seem tolerant, open-minded, and intelligent, but their true feelings may be closed-minded, simple, and biased. Participants may lie in this situation. This occurs often in political polling, which may show greater support for a candidate from a minority race, gender, or political party than actually exists in the electorate.
A related form of bias is called acquiescence bias, also known as “yea-saying.” It occurs when people say yes to whatever the researcher asks, even when doing so contradicts previous answers. For example, a person might say yes to both “I am a confident leader in group discussions” and “I feel anxious interacting in group discussions.” Those two responses are unlikely to both be true for the same person. Why would someone do this? Similar to social desirability, people want to be agreeable and nice to the researcher asking them questions or they might ignore contradictory feelings when responding to each question. You could interpret this as someone saying “yeah, I guess.” Respondents may also act on cultural reasons, trying to “save face” for themselves or the person asking the questions. Regardless of the reason, the results of your measure don’t match what the person truly feels.
So far, we have discussed sources of error that come from choices made by respondents or researchers. Systematic errors will result in responses that are incorrect in one direction or another. For example, social desirability bias usually means that the number of people who say they will vote for a third party in an election is greater than the number of people who actually vote for that candidate. Systematic errors such as these can be reduced, but random error can never be eliminated. Unlike systematic error, which biases responses consistently in one direction or another, random error is unpredictable and does not consistently result in scores that are consistently higher or lower on a given measure. Instead, random error is more like statistical noise, which will likely average out across participants.
Random error is present in any measurement. If you’ve ever stepped on a bathroom scale twice and gotten two slightly different results, maybe a difference of a tenth of a pound, then you’ve experienced random error. Maybe you were standing slightly differently or had a fraction of your foot off of the scale the first time. If you were to take enough measures of your weight on the same scale, you’d be able to figure out your true weight. In social science, if you gave someone a scale measuring depression on a day after they lost their job, they would likely score differently than if they had just gotten a promotion and a raise. Even if the person were clinically depressed, our measure is subject to influence by the random occurrences of life. Thus, social scientists speak with humility about our measures. We are reasonably confident that what we found is true, but we must always acknowledge that our measures are only an approximation of reality.
Humility is important in scientific measurement, as errors can have real consequences. At the time I’m writing this, my wife and I are expecting our first child. Like most people, we used a pregnancy test from the pharmacy. If the test said my wife was pregnant when she was not pregnant, that would be a false positive. On the other hand, if the test indicated that she was not pregnant when she was in fact pregnant, that would be a false negative. Even if the test is 99% accurate, that means that one in a hundred women will get an erroneous result when they use a home pregnancy test. For us, a false positive would have been initially exciting, then devastating when we found out we were not having a child. A false negative would have been disappointing at first and then quite shocking when we found out we were indeed having a child. While both false positives and false negatives are not very likely for home pregnancy tests (when taken correctly), measurement error can have consequences for the people being measured.
Key Takeaways
- Reliability is a matter of consistency.
- Validity is a matter of accuracy.
- There are many types of validity and reliability.
- Systematic error may arise from the researcher, participant, or measurement instrument.
- Systematic error biases results in a particular direction, whereas random error can be in any direction.
- All measures are prone to error and should interpreted with humility.
Exercises
Use the measurement tools you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to “research” these tools.
- Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
- Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?
- If you decide to create your own tool, how will you assess its validity and reliability?
- Krosnick, J.A. & Berent, M.K. (1993). Comparisons of party identification and policy preferences: The impact of survey question format. American Journal of Political Science, 27(3), 941-964. ↵
- Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology,140, 1–55. ↵
- Stevens, S. S. (1946). On the Theory of Scales of Measurement. Science, 103(2684), 677-680. ↵
- Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth. ↵
- Peterson, R. A. (2000). Constructing effective questionnaires. Thousand Oaks, CA: Sage. ↵
- Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth; Dillman, D. A. (2000). Mail and Internet surveys: The tailored design method (2nd ed.). New York, NY: Wiley; Neuman, W. L. (2003). Social research methods: Qualitative and quantitative approaches (5th ed.). Boston, MA: Pearson. ↵
- Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth. ↵
- Hopper, J. (2010). How long should a survey be? Retrieved from http://www.verstaresearch.com/blog/how-long-should-a-survey-be ↵
- Sullivan G. M. (2011). A primer on the validity of assessment instruments. Journal of graduate medical education, 3(2), 119–120. doi:10.4300/JGME-D-11-00075.1 ↵
- Sullivan G. M. (2011). A primer on the validity of assessment instruments. Journal of graduate medical education, 3(2), 119–120. doi:10.4300/JGME-D-11-00075.1 ↵
- Engel, R. & Schutt, R. (2013). The practice of research in social work (3rd. ed.). Thousand Oaks, CA: SAGE. ↵
- Engel, R. & Schutt, R. (2013). The practice of research in social work (3rd. ed.). Thousand Oaks, CA: SAGE. ↵
- Engel, R. & Schutt, R. (2013). The practice of research in social work (3rd. ed.). Thousand Oaks, CA: SAGE. ↵
- Engel, R. & Schutt, R. (2013). The practice of research in social work (3rd. ed.). Thousand Oaks, CA: SAGE. ↵
- Engel, R. & Schutt, R. (2013). The practice of research in social work (3rd. ed.). Thousand Oaks, CA: SAGE. ↵
process by which researchers spell out precisely how a concept will be measured in their study
Clues that demonstrate the presence, intensity, or other aspects of a concept in the real world
unprocessed data that researchers can analyze using quantitative and qualitative methods (e.g., responses to a survey or interview transcripts)
a characteristic that does not change in a study
The characteristics that make up a variable
variables whose values are organized into mutually exclusive groups but whose numerical values cannot be used in mathematical operations.
variables whose values are mutually exclusive and can be used in mathematical operations
The lowest level of measurement; categories cannot be mathematically ranked, though they are exhaustive and mutually exclusive
Exhaustive categories are options for closed ended questions that allow for every possible response (no one should feel like they can't find the answer for them).
Mutually exclusive categories are options for closed ended questions that do not overlap, so people only fit into one category or another, not both.
Level of measurement that follows nominal level. Has mutually exclusive categories and a hierarchy (rank order), but we cannot calculate a mathematical distance between attributes.
An ordered set of responses that participants must choose from.
A level of measurement that is continuous, can be rank ordered, is exhaustive and mutually exclusive, and for which the distance between attributes is known to be equal. But for which there is no zero point.
The highest level of measurement. Denoted by mutually exclusive categories, a hierarchy (order), values can be added, subtracted, multiplied, and divided, and the presence of an absolute zero.
measuring people’s attitude toward something by assessing their level of agreement with several statements about it
Composite (multi-item) scales in which respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites.
A composite scale using a series of items arranged in increasing order of intensity of the construct of interest, from least intense to most intense.
measurements of variables based on more than one one indicator
An empirical structure for measuring items or indicators of the multiple dimensions of a concept.
a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas
Triangulation of data refers to the use of multiple types, measures or sources of data in a research project to increase the confidence that we have in our findings.
Testing out your research materials in advance on people who are not included as participants in your study.
items on a questionnaire designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample
a question that asks more than one thing at a time, making it difficult to respond accurately
When a participant answers in a way that they believe is socially the most acceptable answer.
the answers researchers provide to participants to choose from when completing a questionnaire
questions in which the researcher provides all of the response options
Questions for which the researcher does not include response options, allowing for respondents to answer the question in their own words
respondents to a survey who choose neutral response options, even if they have an opinion
respondents to a survey who choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion
An ordered outline that includes your research question, a description of the data you are going to use to answer it, and the exact analyses, step-by-step, that you plan to run to answer your research question.
A process through which the researcher explains the research process, procedures, risks and benefits to a potential participant, usually through a written document, which the participant than signs, as evidence of their agreement to participate.
a type of survey question that lists a set of questions for which the response options are all the same in a grid layout
The ability of a measurement tool to measure a phenomenon the same way, time after time. Note: Reliability does not imply validity.
The extent to which scores obtained on a scale or other measure are consistent across time
The consistency of people’s responses across the items on a multiple-item measure. Responses about the same underlying construct should be correlated, though not perfectly.
The extent to which different observers are consistent in their assessment or rating of a particular characteristic or item.
The extent to which the scores from a measure represent the variable they are intended to.
The extent to which a measurement method appears “on its face” to measure the construct of interest
The extent to which a measure “covers” the construct of interest, i.e., it's comprehensiveness to measure the construct.
The extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with.
A type of criterion validity. Examines how well a tool provides the same scores as an already existing tool administered at the same point in time.
A type of criterion validity that examines how well your tool predicts a future criterion.
The extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct.
(also known as bias) refers to when a measure consistently outputs incorrect data, usually in one direction and due to an identifiable process
When a participant's answer to a question is altered due to the way in which a question is written. In essence, the question leads the participant to answer in a specific way.
Social desirability bias occurs when we create questions that lead respondents to answer in ways that don't reflect their genuine thoughts or feelings to avoid being perceived negatively.
In a measure, when people say yes to whatever the researcher asks, even when doing so contradicts previous answers.
Unpredictable error that does not result in scores that are consistently higher or lower on a given measure but are nevertheless inaccurate.
when a measure indicates the presence of a phenomenon, when in reality it is not present
when a measure does not indicate the presence of a phenomenon, when in reality it is present
the group of people whose needs your study addresses