27
Matthew DeCarlo
Chapter Outline
- Where do I start with quantitative data analysis? (12 minute read)
- Measures of central tendency (17 minute read, including 5-minute video)
- Frequencies and variability (13 minute read)
Content warning: examples in this chapter contain references to depression and self-esteem.
People often dread quantitative data analysis because—oh no—it’s math. And true, you’re going to have to work with numbers. For years, I thought I was terrible at math, and then I started working with data and statistics, and it turned out I had a real knack for it. (I have a statistician friend who claims statistics is not math, which is a math joke that’s way over my head, but there you go.) This chapter, and the subsequent quantitative analysis chapters, are going to focus on helping you understand descriptive statistics and a few statistical tests, NOT calculate them (with a couple of exceptions). Future research classes will focus on teaching you to calculate these tests for yourself. So take a deep breath and clear your mind of any doubts about your ability to understand and work with numerical data.
In this chapter, we’re going to discuss the first step in analyzing your quantitative data: univariate data analysis. Univariate data analysis is a quantitative method in which a variable is examined individually to determine its distribution, or the way the scores are distributed across the levels of that variable. When we talk about levels, what we are talking about are the possible values of the variable—like a participant’s age, income or gender. (Note that this is different than our earlier discussion in Chaper 10 of levels of measurement, but the level of measurement of your variables absolutely affects what kinds of analyses you can do with it.) Univariate analysis is non-relational, which just means that we’re not looking into how our variables relate to each other. Instead, we’re looking at variables in isolation to try to understand them better. For this reason, univariate analysis is best for descriptive research questions.
So when do you use univariate data analysis? Always! It should be the first thing you do with your quantitative data, whether you are planning to move on to more sophisticated statistical analyses or are conducting a study to describe a new phenomenon. You need to understand what the values of each variable look like—what if one of your variables has a lot of missing data because participants didn’t answer that question on your survey? What if there isn’t much variation in the gender of your sample? These are things you’ll learn through univariate analysis.
14.1 Where do I start with quantitative data analysis?
Learning Objectives
Learners will be able to…
- Define and construct a data analysis plan
- Define key data management terms—variable name, data dictionary, primary and secondary data, observations/cases
No matter how large or small your data set is, quantitative data can be intimidating. There are a few ways to make things manageable for yourself, including creating a data analysis plan and organizing your data in a useful way. We’ll discuss some of the keys to these tactics below.
The data analysis plan
As part of planning for your research, and to help keep you on track and make things more manageable, you should come up with a data analysis plan. You’ve basically been working on doing this in writing your research proposal so far. A data analysis plan is an ordered outline that includes your research question, a description of the data you are going to use to answer it, and the exact step-by-step analyses, that you plan to run to answer your research question. This last part—which includes choosing your quantitative analyses—is the focus of this and the next two chapters of this book.
A basic data analysis plan might look something like what you see in Table 14.1. Don’t panic if you don’t yet understand some of the statistical terms in the plan; we’re going to delve into them throughout the next few chapters. Note here also that this is what operationalizing your variables and moving through your research with them looks like on a basic level.
Research question: What is the relationship between a person’s race and their likelihood to graduate from high school? |
Data: Individual-level U.S. American Community Survey data for 2017 from IPUMS, which includes race/ethnicity and other demographic data (i.e., educational attainment, family income, employment status, citizenship, presence of both parents, etc.). Only including individuals for which race and educational attainment data is available. |
Steps in Data Analysis Plan
|
An important point to remember is that you should never get stuck on using a particular statistical method because you or one of your co-researchers thinks it’s cool or it’s the hot thing in your field right now. You should certainly go into your data analysis plan with ideas, but in the end, you need to let your research question and the actual content of your data guide what statistical tests you use. Be prepared to be flexible if your plan doesn’t pan out because the data is behaving in unexpected ways.
Managing your data
Whether you’ve collected your own data or are using someone else’s data, you need to make sure it is well-organized in a database in a way that’s actually usable. “Database” can be kind of a scary word, but really, I just mean an Excel spreadsheet or a data file in whatever program you’re using to analyze your data (like SPSS, SAS, or r). (I would avoid Excel if you’ve got a very large data set—one with millions of records or hundreds of variables—because it gets very slow and can only handle a certain number of cases and variables, depending on your version. But if your data set is smaller and you plan to keep your analyses simple, you can definitely get away with Excel.) Your database or data set should be organized with variables as your columns and observations/cases as your rows. For example, let’s say we did a survey on ice cream preferences and collected the following information in Table 14.2:
Name | Age | Gender | Hometown | Fav_Ice_Cream |
Tom | 54 | 0 | 1 | Rocky Road |
Jorge | 18 | 2 | 0 | French Vanilla |
Melissa | 22 | 1 | 0 | Espresso |
Amy | 27 | 1 | 0 | Black Cherry |
There are a few key data management terms to understand:
- Variable name: Just what it sounds like—the name of your variable. Make sure this is something useful, short and, if you’re using something other than Excel, all one word. Most statistical programs will automatically rename variables for you if they aren’t one word, but the names are usually a little ridiculous and long.
- Observations/cases: The rows in your data set. In social work, these are often your study participants (people), but can be anything from census tracts to black bears to trains. When we talk about sample size, we’re talking about the number of observations/cases. In our mini data set, each person is an observation/case.
- Primary data: Data you have collected yourself.
- Secondary data: Data someone else has collected that you have permission to use in your research. For example, for my student research project in my MSW program, I used data from a local probation program to determine if a shoplifting prevention group was reducing the rate at which people were re-offending. I had data on who participated in the program and then received their criminal history six months after the end of their probation period. This was secondary data I used to determine whether the shoplifting prevention group had any effect on an individual’s likelihood of re-offending.
- Data dictionary (sometimes called a code book): This is the document where you list your variable names, what the variables actually measure or represent, what each of the values of the variable mean if the meaning isn’t obvious (i.e., if there are numbers assigned to gender), the level of measurement and anything special to know about the variables (for instance, the source if you mashed two data sets together). If you’re using secondary data, the data dictionary should be available to you.
When considering what data you might want to collect as part of your project, there are two important considerations that can create dilemmas for researchers. You might only get one chance to interact with your participants, so you must think comprehensively in your planning phase about what information you need and collect as much relevant data as possible. At the same time, though, especially when collecting sensitive information, you need to consider how onerous the data collection is for participants and whether you really need them to share that information. Just because something is interesting to us doesn’t mean it’s related enough to our research question to chase it down. Work with your research team and/or faculty early in your project to talk through these issues before you get to this point. And if you’re using secondary data, make sure you have access to all the information you need in that data before you use it.
Let’s take that mini data set we’ve got up above and I’ll show you what your data dictionary might look like in Table 14.3.
Variable name | Description | Values/Levels | Level of measurement | Notes |
Name | Participant’s first name | n/a | n/a | First names only. If names appear more than once, a random number has been attached to the end of the name to distinguish. |
Age | Participant’s age at time of survey | n/a | Interval/Ratio | Self-reported |
Gender | Participant’s self-identified gender | 0=cisgender female1=cisgender male2=non-binary3=transgender female4=transgender male5=another gender | Nominal | Self-reported |
Hometown | Participant’s hometown—this town or another town | 0=This town
1=Another town |
Nominal | Self-reported |
Fav_Ice_Cream | Participant’s favorite ice cream | n/a | n/a | Self-reported |
Key Takeaways
- Getting organized at the beginning of your project with a data analysis plan will help keep you on track. Data analysis plans should include your research question, a description of your data, and a step-by-step outline of what you’re going to do with it.
- Be flexible with your data analysis plan—sometimes data surprises us and we have to adjust the statistical tests we are using.
- Always make a data dictionary or, if using secondary data, get a copy of the data dictionary so you (or someone else) can understand the basics of your data.
Exercises
- Make a data analysis plan for your project. Remember this should include your research question, a description of the data you will use, and a step-by-step outline of what you’re going to do with your data once you have it, including statistical tests (non-relational and relational) that you plan to use. You can do this exercise whether you’re using quantitative or qualitative data! The same principles apply.
- Make a data dictionary for the data you are proposing to collect as part of your study. You can use the example above as a template.
14.2 Measures of central tendency
Learning Objectives
Learners will be able to…
- Explain measures of central tendency—mean, median and mode—and when to use them to describe your data
- Explain the importance of examining the range of your data
- Apply the appropriate measure of central tendency to a research problem or question
A measure of central tendency is one number that can give you an idea about the distribution of your data. The video below gives a more detailed introduction to central tendency. Then we’ll talk more specifically about our three measures of central tendency—mean, median and mode.
One quick note: the narrator in the video mentions skewness and kurtosis. Basically, these refer to a particular shape for a distribution when you graph it out. That gets into some more advanced multivariate analysis that we aren’t tackling in this book, so just file them away for a more advanced class, if you ever take on additional statistics coursework.
There are three key measures of central tendency, which we’ll go into now.
Mean
The mean, also called the average, is calculated by adding all your cases and dividing the sum by the number of cases. You’ve undoubtedly calculated a mean at some point in your life. The mean is the most widely used measure of central tendency because it’s easy to understand and calculate. It can only be used with interval/ratio variables, like age, test scores or years of post-high school education. (If you think about it, using it with a nominal or ordinal variable doesn’t make much sense—why do we care about the average of our numerical values we assigned to certain races?)
The biggest drawback of using the mean is that it’s extremely sensitive to outliers, or extreme values in your data. And the smaller your data set is, the more sensitive your mean is to these outliers. One thing to remember about outliers—they are not inherently bad, and can sometimes contain really important information. Don’t automatically discard them because they skew your data.
Let’s take a minute to talk about how to locate outliers in your data. If your data set is very small, you can just take a look at it and see outliers. But in general, you’re probably going to be working with data sets that have at least a couple dozen cases, which makes just looking at your values to find outliers difficult. The best way to quickly look for outliers is probably to make a scatter plot with excel or whatever database management program you’re using.
Let’s take a very small data set as an example. Oh hey, we had one before! I’ve re-created it in Table 14.5. We’re going to add some more cases to it so it’s a little easier to illustrate what we’re doing.
Name | Age | Gender | Hometown | Fav_Ice_Cream |
Tom | 54 | 0 | 1 | Rocky Road |
Jorge | 18 | 2 | 0 | French Vanilla |
Melissa | 22 | 1 | 0 | Espresso |
Amy | 27 | 1 | 0 | Black Cherry |
Akiko | 28 | 3 | 0 | Chocolate |
Michael | 32 | 0 | 1 | Pistachio |
Jess | 29 | 1 | 0 | Chocolate |
Subasri | 34 | 1 | 0 | Vanilla Bean |
Brian | 21 | 0 | 1 | Moose Tracks |
Crystal | 18 | 1 | 0 | Strawberry |
Let’s say we’re interested in knowing more about the distribution of participant age. Let’s see a scatterplot of age (Figure 14.1). On our y-axis (the vertical one) is the value of age, and on our x-axis (the horizontal one) is the frequency of each age, or the number of times it appears in our data set.
Do you see any outliers in the scatter plot? There is one participant who is significantly older than the rest at age 54. Let’s think about what happens when we calculate our mean with and without that outlier. Complete the two exercises below by using the ages listed in our mini-data set in this section.
Next, let’s try it without the outlier.
With our outlier, the average age of our participants is 28, and without it, the average age is 25. That might not seem enormous, but it illustrates the effects of outliers on the mean.
Just because Tom is an outlier at age 54 doesn’t mean you should exclude him. The most important thing about outliers is to think critically about them and how they could affect your analysis. Finding outliers should prompt a couple of questions. First, could the data have been entered incorrectly? Is Tom actually 24, and someone just hit the “5” instead of the “2” on the number pad? What might be special about Tom that he ended up in our group, given how that he is different? Are there other relevant ways in which Tom differs from our group (is he an outlier in other ways)? Does it really matter than Tom is much older than our other participants? If we don’t think age is a relevant factor in ice cream preferences, then it probably doesn’t. If we do, then we probably should have made an effort to get a wider range of ages in our participants.
Median
The median (also called the 50th percentile) is the middle value when all our values are placed in numerical order. If you have five values and you put them in numerical order, the third value will be the median. When you have an even number of values, you’ll have to take the average of the middle two values to get the median. So, if you have 6 values, the average of values 3 and 4 will be the median. Keep in mind that for large data sets, you’re going to want to use either Excel or a statistical program to calculate the median—otherwise, it’s nearly impossible logistically.
Like the mean, you can only calculate the median with interval/ratio variables, like age, test scores or years of post-high school education. The median is also a lot less sensitive to outliers than the mean. While it can be more time intensive to calculate, the median is preferable in most cases to the mean for this reason. It gives us a more accurate picture of where the middle of our distribution sits in most cases. In my work as a policy analyst and researcher, I rarely, if ever, use the mean as a measure of central tendency. Its main value for me is to compare it to the median for statistical purposes. So get used to the median, unless you’re specifically asked for the mean. (When we talk about t-tests in the next chapter, we’ll talk about when the mean can be useful.)
Let’s go back to our little data set and calculate the median age of our participants (Table 14.6).
Name | Age | Gender | Hometown | Fav_Ice_Cream |
Tom | 54 | 0 | 1 | Rocky Road |
Jorge | 18 | 2 | 0 | French Vanilla |
Melissa | 22 | 1 | 0 | Espresso |
Amy | 27 | 1 | 0 | Black Cherry |
Akiko | 28 | 3 | 0 | Chocolate |
Michael | 32 | 0 | 1 | Pistachio |
Jess | 29 | 1 | 0 | Chocolate |
Subasri | 34 | 1 | 0 | Vanilla Bean |
Brian | 21 | 0 | 1 | Moose Tracks |
Crystal | 18 | 1 | 0 | Strawberry |
Remember, to calculate the median, you put all the values in numerical order and take the number in the middle. When there’s an even number of values, take the average of the two middle values.
What happens if we remove Tom, the outlier?
With Tom in our group, the median age is 27.5, and without him, it’s 27. You can see that the median was far less sensitive to him being included in our data than the mean was.
Mode
The mode of a variable is the most commonly occurring value. While you can calculate the mode for interval/ratio variables, it’s mostly useful when examining and describing nominal or ordinal variables. Think of it this way—do we really care that there are two people with an income of $38,000 per year, or do we care that these people fall into a certain category related to that value, like above or below the federal poverty level?
Let’s go back to our ice cream survey (Table 14.7).
Name | Age | Gender | Hometown | Fav_Ice_Cream |
Tom | 54 | 0 | 1 | Rocky Road |
Jorge | 18 | 2 | 0 | French Vanilla |
Melissa | 22 | 1 | 0 | Espresso |
Amy | 27 | 1 | 0 | Black Cherry |
Akiko | 28 | 3 | 0 | Chocolate |
Michael | 32 | 0 | 1 | Pistachio |
Jess | 29 | 1 | 0 | Chocolate |
Subasri | 34 | 1 | 0 | Vanilla Bean |
Brian | 21 | 0 | 1 | Moose Tracks |
Crystal | 18 | 1 | 0 | Strawberry |
We can use the mode for a few different variables here: gender, hometown and fav_ice_cream. The cool thing about the mode is that you can use it for numeric/quantitative and text/quantitative variables.
So let’s find some modes. For hometown—or whether the participant’s hometown is the one in which the survey was administered or not—the mode is 0, or “no” because that’s the most common answer. For gender, the mode is 0, or “female.” And for fav_ice_cream, the mode is Chocolate, although there’s a lot of variation there. Sometimes, you may have more than one mode, which is still useful information.
One final thing I want to note about these three measures of central tendency: if you’re using something like a ranking question or a Likert scale, depending on what you’re measuring, you might use a mean or median, even though these look like they will only spit out ordinal variables. For example, say you’re a car designer and want to understand what people are looking for in new cars. You conduct a survey asking participants to rank the characteristics of a new car in order of importance (an ordinal question). The most commonly occurring answer—the mode—really tells you the information you need to design a car that people will want to buy. On the flip side, if you have a scale of 1 through 5 measuring a person’s satisfaction with their most recent oil change, you may want to know the mean score because it will tell you, relative to most or least satisfied, where most people fall in your survey. To know what’s most helpful, think critically about the question you want to answer and about what the actual values of your variable can tell you.
Key Takeaways
- The mean is the average value for a variable, calculated by adding all values and dividing the total by the number of cases. While the mean contains useful information about a variable’s distribution, it’s also susceptible to outliers, especially with small data sets.
- In general, the mean is most useful with interval/ratio variables.
- The median, or 50th percentile, is the exact middle of our distribution when the values of our variable are placed in numerical order. The median is usually a more accurate measurement of the middle of our distribution because outliers have a much smaller effect on it.
- In general, the median is only useful with interval/ratio variables.
- The mode is the most commonly occurring value of our variable. In general, it is only useful with nominal or ordinal variables.
Exercises
- Say you want to know the income of the typical participant in your study. Which measure of central tendency would you use? Why?
- Find an interval/ratio variable and calculate the mean and median. Make a scatter plot and look for outliers.
- Find a nominal variable and calculate the mode.
14.3 Frequencies and variability
Learning Objectives
Learners will be able to…
- Define descriptive statistics and understand when to use these methods.
- Produce and describe visualizations to report quantitative data.
Descriptive statistics refer to a set of techniques for summarizing and displaying data. We’ve already been through the measures of central tendency, (which are considered descriptive statistics) which got their own chapter because they’re such a big topic. Now, we’re going to talk about other descriptive statistics and ways to visually represent data.
Frequency tables
One way to display the distribution of a variable is in a frequency table. Table 14.2, for example, is a frequency table showing a hypothetical distribution of scores on the Rosenberg Self-Esteem Scale for a sample of 40 college students. The first column lists the values of the variable—the possible scores on the Rosenberg scale—and the second column lists the frequency of each score. This table shows that there were three students who had self-esteem scores of 24, five who had self-esteem scores of 23, and so on. From a frequency table like this, one can quickly see several important aspects of a distribution, including the range of scores (from 15 to 24), the most and least common scores (22 and 17, respectively), and any extreme scores that stand out from the rest.
Self-esteem score (out of 30) | Frequency |
24 | 3 |
23 | 5 |
22 | 10 |
21 | 8 |
20 | 5 |
19 | 3 |
18 | 3 |
17 | 0 |
16 | 2 |
15 | 1 |
There are a few other points worth noting about frequency tables. First, the levels listed in the first column usually go from the highest at the top to the lowest at the bottom, and they usually do not extend beyond the highest and lowest scores in the data. For example, although scores on the Rosenberg scale can vary from a high of 30 to a low of 0, Table 14.8 only includes levels from 24 to 15 because that range includes all the scores in this particular data set. Second, when there are many different scores across a wide range of values, it is often better to create a grouped frequency table, in which the first column lists ranges of values and the second column lists the frequency of scores in each range. Table 14.9, for example, is a grouped frequency table showing a hypothetical distribution of simple reaction times for a sample of 20 participants. In a grouped frequency table, the ranges must all be of equal width, and there are usually between five and 15 of them. Finally, frequency tables can also be used for nominal or ordinal variables, in which case the levels are category labels. The order of the category labels is somewhat arbitrary, but they are often listed from the most frequent at the top to the least frequent at the bottom.
Reaction time (ms) | Frequency |
241–260 | 1 |
221–240 | 2 |
201–220 | 2 |
181–200 | 9 |
161–180 | 4 |
141–160 | 2 |
Histograms
A histogram is a graphical display of a distribution. It presents the same information as a frequency table but in a way that is grasped more quickly and easily. The histogram in Figure 14.2 presents the distribution of self-esteem scores in Table 14.8. The x-axis (the horizontal one) of the histogram represents the variable and the y-axis (the vertical one) represents frequency. Above each level of the variable on the x-axis is a vertical bar that represents the number of individuals with that score. When the variable is quantitative, as it is in this example, there is usually no gap between the bars. When the variable is nominal or ordinal, however, there is usually a small gap between them. (The gap at 17 in this histogram reflects the fact that there were no scores of 17 in this data set.)
Distribution shapes
When the distribution of a quantitative variable is displayed in a histogram, it has a shape. The shape of the distribution of self-esteem scores in Figure 14.2 is typical. There is a peak somewhere near the middle of the distribution and “tails” that taper in either direction from the peak. The distribution of Figure 14.2 is unimodal, meaning it has one distinct peak, but distributions can also be bimodal, as in Figure 14.3, meaning they have two distinct peaks. Figure 14.3, for example, shows a hypothetical bimodal distribution of scores on the Beck Depression Inventory. I know we talked about the mode mostly for nominal or ordinal variables, but you can actually use histograms to look at the distribution of interval/ratio variables, too, and still have a unimodal or bimodal distribution even if you aren’t calculating a mode. Distributions can also have more than two distinct peaks, but these are relatively rare in social work research.
Another characteristic of the shape of a distribution is whether it is symmetrical or skewed. The distribution in the center of Figure 14.4 is symmetrical. Its left and right halves are mirror images of each other. The distribution on the left is negatively skewed, with its peak shifted toward the upper end of its range and a relatively long negative tail. The distribution on the right is positively skewed, with its peak toward the lower end of its range and a relatively long positive tail.
Chapter Outline
- Ethical and social justice considerations in measurement
- Post-positivism: Assumptions of quantitative methods
- Researcher positionality
- Assessing measurement quality and fighting oppression
Content warning: TBD.
12.1 Ethical and social justice considerations in measurement
Learning Objectives
Learners will be able to...
- Identify potential cultural, ethical, and social justice issues in measurement.
With your variables operationalized, it's time to take a step back and look at how measurement in social science impact our daily lives. As we will see, how we measure things is both shaped by power arrangements inside our society, and more insidiously, by establishing what is scientifically true, measures have their own power to influence the world. Just like reification in the conceptual world, how we operationally define concepts can reinforce or fight against oppressive forces.
Data equity
How we decide to measure our variables determines what kind of data we end up with in our research project. Because scientific processes are a part of our sociocultural context, the same biases and oppressions we see in the real world can be manifested or even magnified in research data. Jagadish and colleagues (2021)[1] presents four dimensions of data equity that are relevant to consider: in representation of non-dominant groups within data sets; in how data is collected, analyzed, and combined across datasets; in equitable and participatory access to data, and finally in the outcomes associated with the data collection. Historically, we have mostly focused on the outcomes of measures producing outcomes that are biased in one way or another, and this section reviews many such examples. However, it is important to note that equity must also come from designing measures that respond to questions like:
- Are groups historically suppressed from the data record represented in the sample?
- Are equity data gathered by researchers and used to uncover and quantify inequity?
- Are the data accessible across domains and levels of expertise, and can community members participate in the design, collection, and analysis of the public data record?
- Are the data collected used to monitor and mitigate inequitable impacts?
So, it's not just about whether measures work for one population for another. Data equity is about the context in which data are created from how we measure people and things. We agree with these authors that data equity should be considered within the context of automated decision-making systems and recognizing a broader literature around the role of administrative systems in creating and reinforcing discrimination. To combat the inequitable processes and outcomes we describe below, researchers must foreground equity as a core component of measurement.
Flawed measures & missing measures
At the end of every semester, students in just about every university classroom in the United States complete similar student evaluations of teaching (SETs). Since every student is likely familiar with these, we can recognize many of the concepts we discussed in the previous sections. There are number of rating scale questions that ask you to rate the professor, class, and teaching effectiveness on a scale of 1-5. Scores are averaged across students and used to determine the quality of teaching delivered by the faculty member. SETs scores are often a principle component of how faculty are reappointed to teaching positions. Would it surprise you to learn that student evaluations of teaching are of questionable quality? If your instructors are assessed with a biased or incomplete measure, how might that impact your education?
Most often, student scores are averaged across questions and reported as a final average. This average is used as one factor, often the most important factor, in a faculty member's reappointment to teaching roles. We learned in this chapter that rating scales are ordinal, not interval or ratio, and the data are categories not numbers. Although rating scales use a familiar 1-5 scale, the numbers 1, 2, 3, 4, & 5 are really just helpful labels for categories like "excellent" or "strongly agree." If we relabeled these categories as letters (A-E) rather than as numbers (1-5), how would you average them?
Averaging ordinal data is methodologically dubious, as the numbers are merely a useful convention. As you will learn in Chapter 14, taking the median value is what makes the most sense with ordinal data. Median values are also less sensitive to outliers. So, a single student who has strong negative or positive feelings towards the professor could bias the class's SETs scores higher or lower than what the "average" student in the class would say, particularly for classes with few students or in which fewer students completed evaluations of their teachers.
We care about teaching quality because more effective teachers will produce more knowledgeable and capable students. However, student evaluations of teaching are not particularly good indicators of teaching quality and are not associated with the independently measured learning gains of students (i.e., test scores, final grades) (Uttl et al., 2017).[2] This speaks to the lack of criterion validity. Higher teaching quality should be associated with better learning outcomes for students, but across multiple studies stretching back years, there is no association that cannot be better explained by other factors. To be fair, there are scholars who find that SETs are valid and reliable. For a thorough defense of SETs as well as a historical summary of the literature see Benton & Cashin (2012).[3]
Even though student evaluations of teaching often contain dozens of questions, researchers often find that the questions are so highly interrelated that one concept (or factor, as it is called in a factor analysis) explains a large portion of the variance in teachers' scores on student evaluations (Clayson, 2018).[4] Personally, I believe based on completing SETs myself that factor is probably best conceptualized as student satisfaction, which is obviously worthwhile to measure, but is conceptually quite different from teaching effectiveness or whether a course achieved its intended outcomes. The lack of a clear operational and conceptual definition for the variable or variables being measured in student evaluations of teaching also speaks to a lack of content validity. Researchers check content validity by comparing the measurement method with the conceptual definition, but without a clear conceptual definition of the concept measured by student evaluations of teaching, it's not clear how we can know our measure is valid. Indeed, the lack of clarity around what is being measured in teaching evaluations impairs students' ability to provide reliable and valid evaluations. So, while many researchers argue that the class average SETs scores are reliable in that they are consistent over time and across classes, it is unclear what exactly is being measured even if it is consistent (Clayson, 2018).[5]
As a faculty member, there are a number of things I can do to influence my evaluations and disrupt validity and reliability. Since SETs scores are associated with the grades students perceive they will receive (e.g., Boring et al., 2016),[6] guaranteeing everyone a final grade of A in my class will likely increase my SETs scores and my chances at tenure and promotion. I could time an email reminder to complete SETs with releasing high grades for a major assignment to boost my evaluation scores. On the other hand, student evaluations might be coincidentally timed with poor grades or difficult assignments that will bias student evaluations downward. Students may also infer I am manipulating them and give me lower SET scores as a result. To maximize my SET scores and chances and promotion, I also need to select which courses I teach carefully. Classes that are more quantitatively oriented generally receive lower ratings than more qualitative and humanities-driven classes, which makes my decision to teach social work research a poor strategy (Uttl & Smibert, 2017).[7] The only manipulative strategy I will admit to using is bringing food (usually cookies or donuts) to class during the period in which students are completing evaluations. Measurement is impacted by context.
As a white cis-gender male educator, I am adversely impacted by SETs because of their sketchy validity, reliability, and methodology. The other flaws with student evaluations actually help me while disadvantaging teachers from oppressed groups. Heffernan (2021)[8] provides a comprehensive overview of the sexism, racism, ableism, and prejudice baked into student evaluations:
"In all studies relating to gender, the analyses indicate that the highest scores are awarded in subjects filled with young, white, male students being taught by white English first language speaking, able-bodied, male academics who are neither too young nor too old (approx. 35–50 years of age), and who the students believe are heterosexual. Most deviations from this scenario in terms of student and academic demographics equates to lower SET scores. These studies thus highlight that white, able-bodied, heterosexual, men of a certain age are not only the least affected, they benefit from the practice. When every demographic group who does not fit this image is significantly disadvantaged by SETs, these processes serve to further enhance the position of the already privileged" (p. 5).
The staggering consistency of studies examining prejudice in SETs has led to some rather superficial reforms like reminding students to not submit racist or sexist responses in the written instructions given before SETs. Yet, even though we know that SETs are systematically biased against women, people of color, and people with disabilities, the overwhelming majority of universities in the United States continue to use them to evaluate faculty for promotion or reappointment. From a critical perspective, it is worth considering why university administrators continue to use such a biased and flawed instrument. SETs produce data that make it easy to compare faculty to one another and track faculty members over time. Furthermore, they offer students a direct opportunity to voice their concerns and highlight what went well.
As the people with the greatest knowledge about what happened in the classroom as whether it met their expectations, providing students with open-ended questions is the most productive part of SETs. Personally, I have found focus groups written, facilitated, and analyzed by student researchers to be more insightful than SETs. MSW student activists and leaders may look for ways to evaluate faculty that are more methodologically sound and less systematically biased, creating institutional change by replacing or augmenting traditional SETs in their department. There is very rarely student input on the criteria and methodology for teaching evaluations, yet students are the most impacted by helpful or harmful teaching practices.
Students should fight for better assessment in the classroom because well-designed assessments provide documentation to support more effective teaching practices and discourage unhelpful or discriminatory practices. Flawed assessments like SETs, can lead to a lack of information about problems with courses, instructors, or other aspects of the program. Think critically about what data your program uses to gauge its effectiveness. How might you introduce areas of student concern into how your program evaluates itself? Are there issues with food or housing insecurity, mentorship of nontraditional and first generation students, or other issues that faculty should consider when they evaluate their program? Finally, as you transition into practice, think about how your agency measures its impact and how it privileges or excludes client and community voices in the assessment process.
Let's consider an example from social work practice. Let's say you work for a mental health organization that serves youth impacted by community violence. How should you measure the impact of your services on your clients and their community? Schools may be interested in reducing truancy, self-injury, or other behavioral concerns. However, by centering delinquent behaviors in how we measure our impact, we may be inattentive to the role of trauma, family dynamics, and other cognitive and social processes beyond "delinquent behavior." Indeed, we may bias our interventions by focusing on things that are not as important to clients' needs. Social workers want to make sure their programs are improving over time, and we rely on our measures to indicate what to change and what to keep. If our measures present a partial or flawed view, we lose our ability to establish and act on scientific truths.
While writing this section, one of the authors wrote this commentary article addressing potential racial bias in social work licensing exams. If you are interested in an example of missing or flawed measures that relates to systems your social work practice is governed by (rather than SETs which govern our practice in higher education) check it out!
You may also be interested in similar arguments against the standard grading scale (A-F), and why grades (numerical, letter, etc.) do not do a good job of measuring learning. Think critically about the role that grades play in your life as a student, your self-concept, and your relationships with teachers. Your test and grade anxiety is due in part to how your learning is measured. Those measurements end up becoming an official record of your scholarship and allow employers or funders to compare you to other scholars. The stakes for measurement are the same for participants in your research study.
Self-reflection and measurement
Student evaluations of teaching are just like any other measure. How we decide to measure what we are researching is influenced by our backgrounds, including our culture, implicit biases, and individual experiences. For me as a middle-class, cisgender white woman, the decisions I make about measurement will probably default to ones that make the most sense to me and others like me, and thus measure characteristics about us most accurately if I don't think carefully about it. There are major implications for research here because this could affect the validity of my measurements for other populations.
This doesn't mean that standardized scales or indices, for instance, won't work for diverse groups of people. What it means is that researchers must not ignore difference in deciding how to measure a variable in their research. Doing so may serve to push already marginalized people further into the margins of academic research and, consequently, social work intervention. Social work researchers, with our strong orientation toward celebrating difference and working for social justice, are obligated to keep this in mind for ourselves and encourage others to think about it in their research, too.
This involves reflecting on what we are measuring, how we are measuring, and why we are measuring. Do we have biases that impacted how we operationalized our concepts? Did we include stakeholders and gatekeepers in the development of our concepts? This can be a way to gain access to vulnerable populations. What feedback did we receive on our measurement process and how was it incorporated into our work? These are all questions we should ask as we are thinking about measurement. Further, engaging in this intentionally reflective process will help us maximize the chances that our measurement will be accurate and as free from bias as possible.
The NASW Code of Ethics discusses social work research and the importance of engaging in practices that do not harm participants. This is especially important considering that many of the topics studied by social workers are those that are disproportionately experienced by marginalized and oppressed populations. Some of these populations have had negative experiences with the research process: historically, their stories have been viewed through lenses that reinforced the dominant culture's standpoint. Thus, when thinking about measurement in research projects, we must remember that the way in which concepts or constructs are measured will impact how marginalized or oppressed persons are viewed. It is important that social work researchers examine current tools to ensure appropriateness for their population(s). Sometimes this may require researchers to use existing tools. Other times, this may require researchers to adapt existing measures or develop completely new measures in collaboration with community stakeholders. In summary, the measurement protocols selected should be tailored and attentive to the experiences of the communities to be studied.
Unfortunately, social science researchers do not do a great job of sharing their measures in a way that allows social work practitioners and administrators to use them to evaluate the impact of interventions and programs on clients. Few scales are published under an open copyright license that allows other people to view it for free and share it with others. Instead, the best way to find a scale mentioned in an article is often to simply search for it in Google with ".pdf" or ".docx" in the query to see if someone posted a copy online (usually in violation of copyright law). As we discussed in Chapter 4, this is an issue of information privilege, or the structuring impact of oppression and discrimination on groups' access to and use of scholarly information. As a student at a university with a research library, you can access the Mental Measurement Yearbook to look up scales and indexes that measure client or program outcomes while researchers unaffiliated with university libraries cannot do so. Similarly, the vast majority of scholarship in social work and allied disciplines does not share measures, data, or other research materials openly, a best practice in open and collaborative science. In many cases, the public paid for these research materials as part of grants; yet the projects close off access to much of the study information. It is important to underscore these structural barriers to using valid and reliable scales in social work practice. An invalid or unreliable outcome test may cause ineffective or harmful programs to persist or may worsen existing prejudices and oppressions experienced by clients, communities, and practitioners.
But it's not just about reflecting and identifying problems and biases in our measurement, operationalization, and conceptualization—what are we going to do about it? Consider this as you move through this book and become a more critical consumer of research. Sometimes there isn't something you can do in the immediate sense—the literature base at this moment just is what it is. But how does that inform what you will do later?
A place to start: Stop oversimplifying race
We will address many more of the critical issues related to measurement in the next chapter. One way to get started in bringing cultural awareness to scientific measurement is through a critical examination of how we analyze race quantitatively. There are many important methodological objections to how we measure the impact of race. We encourage you to watch Dr. Abigail Sewell's three-part workshop series called "Nested Models for Critical Studies of Race & Racism" for the Inter-university Consortium for Political and Social Research (ICPSR). She discusses how to operationalize and measure inequality, racism, and intersectionality and critiques researchers' attempts to oversimplify or overlook racism when we measure concepts in social science. If you are interested in developing your social work research skills further, consider applying for financial support from your university to attend an ICPSR summer seminar like Dr. Sewell's where you can receive more advanced and specialized training in using research for social change.
- Part 1: Creating Measures of Supraindividual Racism (2-hour video)
- Part 2: Evaluating Population Risks of Supraindividual Racism (2-hour video)
- Part 3: Quantifying Intersectionality (2-hour video)
Key Takeaways
- Social work researchers must be attentive to personal and institutional biases in the measurement process that affect marginalized groups.
- What is measured and how it is measured is shaped by power, and social workers must be critical and self-reflective in their research projects.
Exercises
Think about your current research question and the tool(s) that you see researchers use to gather data.
- How does their positionality and experience shape what variables they are choosing to measure and how they measure concepts?
- Evaluate the measures in your study for potential biases.
- If you are using measures developed by another researcher to inform your ideas, investigate whether the measure is valid and reliable in other studies across cultures.
10.2 Post-positivism: The assumptions of quantitative methods
Learning Objectives
Learners will be able to...
- Ground your research project and working question in the philosophical assumptions of social science
- Define the terms 'ontology' and 'epistemology' and explain how they relate to quantitative and qualitative research methods
- Apply feminist, anti-racist, and decolonization critiques of social science to your project
- Define axiology and describe the axiological assumptions of research projects
What are your assumptions?
Social workers must understand measurement theory to engage in social justice work. That's because measurement theory and its supporting philosophical assumptions will help sharpen your perceptions of the social world. They help social workers build heuristics that can help identify the fundamental assumptions at the heart of social conflict and social problems. They alert you to the patterns in the underlying assumptions that different people make and how those assumptions shape their worldview, what they view as true, and what they hope to accomplish. In the next section, we will review feminist and other critical perspectives on research, and they should help inform you of how assumptions about research can reinforce oppression.
Understanding these deeper structures behind research evidence is a true gift of social work research. Because we acknowledge the usefulness and truth value of multiple philosophies and worldviews contained in this chapter, we can arrive at a deeper and more nuanced understanding of the social world.
Building your ice float
Before we can dive into philosophy, we need to recall out conversation from Chapter 1 about objective truth and subjective truths. Let's test your knowledge with a quick example. Is crime on the rise in the United States? A recent Five Thirty Eight article highlights the disparity between historical trends on crime that are at or near their lowest in the thirty years with broad perceptions by the public that crime is on the rise (Koerth & Thomson-DeVeaux, 2020).[9] Social workers skilled at research can marshal objective truth through statistics, much like the authors do, to demonstrate that people's perceptions are not based on a rational interpretation of the world. Of course, that is not where our work ends. Subjective truths might decenter this narrative of ever-increasing crime, deconstruct its racist and oppressive origins, or simply document how that narrative shapes how individuals and communities conceptualize their world.
Objective does not mean right, and subjective does not mean wrong. Researchers must understand what kind of truth they are searching for so they can choose a theoretical framework, methodology, and research question that matches. As we discussed in Chapter 1, researchers seeking objective truth (one of the philosophical foundations at the bottom of Figure 7.1) often employ quantitative methods (one of the methods at the top of Figure 7.1). Similarly, researchers seeking subjective truths (again, at the bottom of Figure 7.1) often employ qualitative methods (at the top of Figure 7.1). This chapter is about the connective tissue, and by the time you are done reading, you should have a first draft of a theoretical and philosophical (a.k.a. paradigmatic) framework for your study.
Ontology: Assumptions about what is real & true
In section 1.2, we reviewed the two types of truth that social work researchers seek—objective truth and subjective truths —and linked these with the methods—quantitative and qualitative—that researchers use to study the world. If those ideas aren’t fresh in your mind, you may want to navigate back to that section for an introduction.
These two types of truth rely on different assumptions about what is real in the social world—i.e., they have a different ontology. Ontology refers to the study of being (literally, it means “rational discourse about being”). In philosophy, basic questions about existence are typically posed as ontological, e.g.:
- What is there?
- What types of things are there?
- How can we describe existence?
- What kind of categories can things go into?
- Are the categories of existence hierarchical?
Objective vs. subjective ontologies
At first, it may seem silly to question whether the phenomena we encounter in the social world are real. Of course you exist, your thoughts exist, your computer exists, and your friends exist. You can see them with your eyes. This is the ontological framework of realism, which simply means that the concepts we talk about in science exist independent of observation (Burrell & Morgan, 1979).[10] Obviously, when we close our eyes, the universe does not disappear. You may be familiar with the philosophical conundrum: "If a tree falls in a forest and no one is around to hear it, does it make a sound?"
The natural sciences, like physics and biology, also generally rely on the assumption of realism. Lone trees falling make a sound. We assume that gravity and the rest of physics are there, even when no one is there to observe them. Mitochondria are easy to spot with a powerful microscope, and we can observe and theorize about their function in a cell. The gravitational force is invisible, but clearly apparent from observable facts, such as watching an apple fall from a tree. Of course, out theories about gravity have changed over the years. Improvements were made when observations could not be correctly explained using existing theories and new theories emerged that provided a better explanation of the data.
As we discussed in section 1.2, culture-bound syndromes are an excellent example of where you might come to question realism. Of course, from a Western perspective as researchers in the United States, we think that the Diagnostic and Statistical Manual (DSM) classification of mental health disorders is real and that these culture-bound syndromes are aberrations from the norm. But what about if you were a person from Korea experiencing Hwabyeong? Wouldn't you consider the Western diagnosis of somatization disorder to be incorrect or incomplete? This conflict raises the question–do either Hwabyeong or DSM diagnoses like post-traumatic stress disorder (PTSD) really exist at all...or are they just social constructs that only exist in our minds?
If your answer is “no, they do not exist,” you are adopting the ontology of anti-realism (or relativism), or the idea that social concepts do not exist outside of human thought. Unlike the realists who seek a single, universal truth, the anti-realists perceive a sea of truths, created and shared within a social and cultural context. Unlike objective truth, which is true for all, subjective truths will vary based on who you are observing and the context in which you are observing them. The beliefs, opinions, and preferences of people are actually truths that social scientists measure and describe. Additionally, subjective truths do not exist independent of human observation because they are the product of the human mind. We negotiate what is true in the social world through language, arriving at a consensus and engaging in debate within our socio-cultural context.
These theoretical assumptions should sound familiar if you've studied social constructivism or symbolic interactionism in your other MSW courses, most likely in human behavior in the social environment (HBSE).[11] From an anti-realist perspective, what distinguishes the social sciences from natural sciences is human thought. When we try to conceptualize trauma from an anti-realist perspective, we must pay attention to the feelings, opinions, and stories in people's minds. In their most radical formulations, anti-realists propose that these feelings and stories are all that truly exist.
What happens when a situation is incorrectly interpreted? Certainly, who is correct about what is a bit subjective. It depends on who you ask. Even if you can determine whether a person is actually incorrect, they think they are right. Thus, what may not be objectively true for everyone is nevertheless true to the individual interpreting the situation. Furthermore, they act on the assumption that they are right. We all do. Much of our behaviors and interactions are a manifestation of our personal subjective truth. In this sense, even incorrect interpretations are truths, even though they are true only to one person or a group of misinformed people. This leads us to question whether the social concepts we think about really exist. For researchers using subjective ontologies, they might only exist in our minds; whereas, researchers who use objective ontologies which assume these concepts exist independent of thought.
How do we resolve this dichotomy? As social workers, we know that often times what appears to be an either/or situation is actually a both/and situation. Let's take the example of trauma. There is clearly an objective thing called trauma. We can draw out objective facts about trauma and how it interacts with other concepts in the social world such as family relationships and mental health. However, that understanding is always bound within a specific cultural and historical context. Moreover, each person's individual experience and conceptualization of trauma is also true. Much like a client who tells you their truth through their stories and reflections, when a participant in a research study tells you what their trauma means to them, it is real even though only they experience and know it that way. By using both objective and subjective analytic lenses, we can explore different aspects of trauma—what it means to everyone, always, everywhere, and what is means to one person or group of people, in a specific place and time.
Epistemology: Assumptions about how we know things
Having discussed what is true, we can proceed to the next natural question—how can we come to know what is real and true? This is epistemology. Epistemology is derived from the Ancient Greek epistēmē which refers to systematic or reliable knowledge (as opposed to doxa, or “belief”). Basically, it means “rational discourse about knowledge,” and the focus is the study of knowledge and methods used to generate knowledge. Epistemology has a history as long as philosophy, and lies at the foundation of both scientific and philosophical knowledge.
Epistemological questions include:
- What is knowledge?
- How can we claim to know anything at all?
- What does it mean to know something?
- What makes a belief justified?
- What is the relationship between the knower and what can be known?
While these philosophical questions can seem far removed from real-world interaction, thinking about these kinds of questions in the context of research helps you target your inquiry by informing your methods and helping you revise your working question. Epistemology is closely connected to method as they are both concerned with how to create and validate knowledge. Research methods are essentially epistemologies – by following a certain process we support our claim to know about the things we have been researching. Inappropriate or poorly followed methods can undermine claims to have produced new knowledge or discovered a new truth. This can have implications for future studies that build on the data and/or conceptual framework used.
Research methods can be thought of as essentially stripped down, purpose-specific epistemologies. The knowledge claims that underlie the results of surveys, focus groups, and other common research designs ultimately rest on epistemological assumptions of their methods. Focus groups and other qualitative methods usually rely on subjective epistemological (and ontological) assumptions. Surveys and and other quantitative methods usually rely on objective epistemological assumptions. These epistemological assumptions often entail congruent subjective or objective ontological assumptions about the ultimate questions about reality.
Objective vs. subjective epistemologies
One key consideration here is the status of ‘truth’ within a particular epistemology or research method. If, for instance, some approaches emphasize subjective knowledge and deny the possibility of an objective truth, what does this mean for choosing a research method?
We began to answer this question in Chapter 1 when we described the scientific method and objective and subjective truths. Epistemological subjectivism focuses on what people think and feel about a situation, while epistemological objectivism focuses on objective facts irrelevant to our interpretation of a situation (Lin, 2015).[12]
While there are many important questions about epistemology to ask (e.g., "How can I be sure of what I know?" or "What can I not know?" see Willis, 2007[13] for more), from a pragmatic perspective most relevant epistemological question in the social sciences is whether truth is better accessed using numerical data or words and performances. Generally, scientists approaching research with an objective epistemology (and realist ontology) will use quantitative methods to arrive at scientific truth. Quantitative methods examine numerical data to precisely describe and predict elements of the social world. For example, while people can have different definitions for poverty, an objective measurement such as an annual income of "less than $25,100 for a family of four" provides a precise measurement that can be compared to incomes from all other people in any society from any time period, and refers to real quantities of money that exist in the world. Mathematical relationships are uniquely useful in that they allow comparisons across individuals as well as time and space. In this book, we will review the most common designs used in quantitative research: surveys and experiments. These types of studies usually rely on the epistemological assumption that mathematics can represent the phenomena and relationships we observe in the social world.
Although mathematical relationships are useful, they are limited in what they can tell you. While you can learn use quantitative methods to measure individuals' experiences and thought processes, you will miss the story behind the numbers. To analyze stories scientifically, we need to examine their expression in interviews, journal entries, performances, and other cultural artifacts using qualitative methods. Because social science studies human interaction and the reality we all create and share in our heads, subjectivists focus on language and other ways we communicate our inner experience. Qualitative methods allow us to scientifically investigate language and other forms of expression—to pursue research questions that explore the words people write and speak. This is consistent with epistemological subjectivism's focus on individual and shared experiences, interpretations, and stories.
It is important to note that qualitative methods are entirely compatible with seeking objective truth. Approaching qualitative analysis with a more objective perspective, we look simply at what was said and examine its surface-level meaning. If a person says they brought their kids to school that day, then that is what is true. A researcher seeking subjective truth may focus on how the person says the words—their tone of voice, facial expressions, metaphors, and so forth. By focusing on these things, the researcher can understand what it meant to the person to say they dropped their kids off at school. Perhaps in describing dropping their children off at school, the person thought of their parents doing the same thing or tried to understand why their kid didn't wave back to them as they left the car. In this way, subjective truths are deeper, more personalized, and difficult to generalize.
Self-determination and free will
When scientists observe social phenomena, they often take the perspective of determinism, meaning that what is seen is the result of processes that occurred earlier in time (i.e., cause and effect). This process is represented in the classical formulation of a research question which asks "what is the relationship between X (cause) and Y (effect)?" By framing a research question in such a way, the scientist is disregarding any reciprocal influence that Y has on X. Moreover, the scientist also excludes human agency from the equation. It is simply that a cause will necessitate an effect. For example, a researcher might find that few people living in neighborhoods with higher rates of poverty graduate from high school, and thus conclude that poverty causes adolescents to drop out of school. This conclusion, however, does not address the story behind the numbers. Each person who is counted as graduating or dropping out has a unique story of why they made the choices they did. Perhaps they had a mentor or parent that helped them succeed. Perhaps they faced the choice between employment to support family members or continuing in school.
For this reason, determinism is critiqued as reductionistic in the social sciences because people have agency over their actions. This is unlike the natural sciences like physics. While a table isn't aware of the friction it has with the floor, parents and children are likely aware of the friction in their relationships and act based on how they interpret that conflict. The opposite of determinism is free will, that humans can choose how they act and their behavior and thoughts are not solely determined by what happened prior in a neat, cause-and-effect relationship. Researchers adopting a perspective of free will view the process of, continuing with our education example, seeking higher education as the result of a number of mutually influencing forces and the spontaneous and implicit processes of human thought. For these researchers, the picture painted by determinism is too simplistic.
A similar dichotomy can be found in the debate between individualism and holism. When you hear something like "the disease model of addiction leads to policies that pathologize and oppress people who use drugs," the speaker is making a methodologically holistic argument. They are making a claim that abstract social forces (the disease model, policies) can cause things to change. A methodological individualist would critique this argument by saying that the disease model of addiction doesn't actually cause anything by itself. From this perspective, it is the individuals, rather than any abstract social force, who oppress people who use drugs. The disease model itself doesn't cause anything to change; the individuals who follow the precepts of the disease model are the agents who actually oppress people in reality. To an individualist, all social phenomena are the result of individual human action and agency. To a holist, social forces can determine outcomes for individuals without individuals playing a causal role, undercutting free will and research projects that seek to maximize human agency.
Exercises
- Examine an article from your literature review
- Is human action, or free will, informing how the authors think about the people in their study?
- Or are humans more passive and what happens to them more determined by the social forces that influence their life?
- Reflect on how this project's assumptions may differ from your own assumptions about free will and determinism. For example, my beliefs about self-determination and free will always inform my social work practice. However, my working question and research project may rely on social theories that are deterministic and do not address human agency.
Radical change
Another assumption scientists make is around the nature of the social world. Is it an orderly place that remains relatively stable over time? Or is it a place of constant change and conflict? The view of the social world as an orderly place can help a researcher describe how things fit together to create a cohesive whole. For example, systems theory can help you understand how different systems interact with and influence one another, drawing energy from one place to another through an interconnected network with a tendency towards homeostasis. This is a more consensus-focused and status-quo-oriented perspective. Yet, this view of the social world cannot adequately explain the radical shifts and revolutions that occur. It also leaves little room for human action and free will. In this more radical space, change consists of the fundamental assumptions about how the social world works.
For example, at the time of this writing, protests are taking place across the world to remember the killing of George Floyd by Minneapolis police and other victims of police violence and systematic racism. Public support of Black Lives Matter, an anti-racist activist group that focuses on police violence and criminal justice reform, has experienced a radical shift in public support in just two weeks since the killing, equivalent to the previous 21 months of advocacy and social movement organizing (Cohn & Quealy, 2020).[14] Abolition of police and prisons, once a fringe idea, has moved into the conversation about remaking the criminal justice system from the ground-up, centering its historic and current role as an oppressive system for Black Americans. Seemingly overnight, reducing the money spent on police and giving that money to social services became a moderate political position.
A researcher centering change may choose to understand this transformation or even incorporate radical anti-racist ideas into the design and methods of their study. For an example of how to do so, see this participatory action research study working with Black and Latino youth (Bautista et al., 2013).[15] Contrastingly, a researcher centering consensus and the status quo might focus on incremental changes what people currently think about the topic. For example, see this survey of social work student attitudes on poverty and race that seeks to understand the status quo of student attitudes and suggest small changes that might change things for the better (Constance-Huggins et al., 2020).[16] To be clear, both studies contribute to racial justice. However, you can see by examining the methods section of each article how the participatory action research article addresses power and values as a core part of their research design, qualitative ethnography and deep observation over many years, in ways that privilege the voice of people with the least power. In this way, it seeks to rectify the epistemic injustice of excluding and oversimplifying Black and Latino youth. Contrast this more radical approach with the more traditional approach taken in the second article, in which they measured student attitudes using a survey developed by researchers.
Exercises
- Examine an article from your literature review
- Traditional studies will be less participatory. The researcher will determine the research question, how to measure it, data collection, etc.
- Radical studies will be more participatory. The researcher seek to undermine power imbalances at each stage of the research process.
- Pragmatically, more participatory studies take longer to complete and are less suited to projects that need to be completed in a short time frame.
Axiology: Assumptions about values
Axiology is the study of values and value judgements (literally “rational discourse about values [a xía]”). In philosophy this field is subdivided into ethics (the study of morality) and aesthetics (the study of beauty, taste and judgement). For the hard-nosed scientist, the relevance of axiology might not be obvious. After all, what difference do one’s feelings make for the data collected? Don’t we spend a long time trying to teach researchers to be objective and remove their values from the scientific method?
Like ontology and epistemology, the import of axiology is typically built into research projects and exists “below the surface”. You might not consciously engage with values in a research project, but they are still there. Similarly, you might not hear many researchers refer to their axiological commitments but they might well talk about their values and ethics, their positionality, or a commitment to social justice.
Our values focus and motivate our research. These values could include a commitment to scientific rigor, or to always act ethically as a researcher. At a more general level we might ask: What matters? Why do research at all? How does it contribute to human wellbeing? Almost all research projects are grounded in trying to answer a question that matters or has consequences. Some research projects are even explicit in their intention to improve things rather than observe them. This is most closely associated with “critical” approaches.
Critical and radical views of science focus on how to spread knowledge and information in a way that combats oppression. These questions are central for creating research projects that fight against the objective structures of oppression—like unequal pay—and their subjective counterparts in the mind—like internalized sexism. For example, a more critical research project would fight not only against statutes of limitations for sexual assault but on how women have internalized rape culture as well. Its explicit goal would be to fight oppression and to inform practice on women's liberation. For this reason, creating change is baked into the research questions and methods used in more critical and radical research projects.
As part of studying radical change and oppression, we are likely employing a model of science that puts values front-and-center within a research project. All social work research is values-driven, as we are a values-driven profession. Historically, though, most social scientists have argued for values-free science. Scientists agree that science helps human progress, but they hold that researchers should remain as objective as possible—which means putting aside politics and personal values that might bias their results, similar to the cognitive biases we discussed in section 1.1. Over the course of last century, this perspective was challenged by scientists who approached research from an explicitly political and values-driven perspective. As we discussed earlier in this section, feminist critiques strive to understand how sexism biases research questions, samples, measures, and conclusions, while decolonization critiques try to de-center the Western perspective of science and truth.
Linking axiology, epistemology, and ontology
It is important to note that both values-central and values-neutral perspectives are useful in furthering social justice. Values-neutral science is helpful at predicting phenomena. Indeed, it matches well with objectivist ontologies and epistemologies. Let's examine a measure of depression, the Patient Health Questionnaire (PSQ-9). The authors of this measure spent years creating a measure that accurately and reliably measures the concept of depression. This measure is assumed to measure depression in any person, and scales like this are often translated into other languages (and subsequently validated) for more widespread use . The goal is to measure depression in a valid and reliable manner. We can use this objective measure to predict relationships with other risk and protective factors, such as substance use or poverty, as well as evaluate the impact of evidence-based treatments for depression like narrative therapy.
While measures like the PSQ-9 help with prediction, they do not allow you to understand an individual person's experience of depression. To do so, you need to listen to their stories and how they make sense of the world. The goal of understanding isn't to predict what will happen next, but to empathically connect with the person and truly understand what's happening from their perspective. Understanding fits best in subjectivist epistemologies and ontologies, as they allow for multiple truths (i.e. that multiple interpretations of the same situation are valid). Although all researchers addressing depression are working towards socially just ends, the values commitments researchers make as part of the research process influence them to adopt objective or subjective ontologies and epistemologies.
Exercises
What role will values play in your study?
- Are you looking to be as objective as possible, putting aside your own values?
- Or are you infusing values into each aspect of your research design?
Remember that although social work is a values-based profession, that does not mean that all social work research is values-informed. The majority of social work research is objective and tries to be value-neutral in how it approaches research.
Positivism: Researcher as "expert"
Positivism (and post-positivism) is the dominant paradigm in social science. We define paradigm a set of common philosophical (ontological, epistemological, and axiological) assumptions that inform research. The four paradigms we describe in this section refer to patterns in how groups of researchers resolve philosophical questions. Some assumptions naturally make sense together, and paradigms grow out of researchers with shared assumptions about what is important and how to study it. Paradigms are like “analytic lenses” and a provide framework on top of which we can build theoretical and empirical knowledge (Kuhn, 1962).[17] Consider this video of an interview with world-famous physicist Richard Feynman in which he explains why "when you explain a 'why,' you have to be in some framework that you allow something to be true. Otherwise, you are perpetually asking why." In order to answer basic physics question like "what is happening when two magnets attract?" or a social work research question like "what is the impact of this therapeutic intervention on depression," you must understand the assumptions you are making about social science and the social world. Paradigmatic assumptions about objective and subjective truth support methodological choices like whether to conduct interviews or send out surveys, for example.
When you think of science, you are probably thinking of positivistic science--like the kind the physicist Richard Feynman did. It has its roots in the scientific revolution of the Enlightenment. Positivism is based on the idea that we can come to know facts about the natural world through our experiences of it. The processes that support this are the logical and analytic classification and systemization of these experiences. Through this process of empirical analysis, Positivists aim to arrive at descriptions of law-like relationships and mechanisms that govern the world we experience.
Positivists have traditionally claimed that the only authentic knowledge we have of the world is empirical and scientific. Essentially, positivism downplays any gap between our experiences of the world and the way the world really is; instead, positivism determines objective “facts” through the correct methodological combination of observation and analysis. Data collection methods typically include quantitative measurement, which is supposed to overcome the individual biases of the researcher.
Positivism aspires to high standards of validity and reliability supported by evidence, and has been applied extensively in both physical and social sciences. Its goal is familiar to all students of science: iteratively expanding the evidence base of what we know is true. We can know our observations and analysis describe real world phenomena because researchers separate themselves and objectively observe the world, placing a deep epistemological separation between “the knower” and “what is known" and reducing the possibility of bias. We can all see the logic in separating yourself as much as possible from your study so as not to bias it, even if we know we cannot do so perfectly.
However, the criticism often made of positivism with regard to human and social sciences (e.g. education, psychology, sociology) is that positivism is scientistic; which is to say that it overlooks differences between the objects in the natural world (tables, atoms, cells, etc.) and the subjects in the social work (self-aware people living in a complex socio-historical context). In pursuit of the generalizable truth of “hard” science, it fails to adequately explain the many aspects of human experience don’t conform to this way of collecting data. Furthermore, by viewing science as an idealized pursuit of pure knowledge, positivists may ignore the many ways in which power structures our access to scientific knowledge, the tools to create it, and the capital to participate in the scientific community.
Kivunja & Kuyini (2017)[18] describe the essential features of positivism as:
- A belief that theory is universal and law-like generalizations can be made across contexts
- The assumption that context is not important
- The belief that truth or knowledge is ‘out there to be discovered’ by research
- The belief that cause and effect are distinguishable and analytically separable
- The belief that results of inquiry can be quantified
- The belief that theory can be used to predict and to control outcomes
- The belief that research should follow the scientific method of investigation
- Rests on formulation and testing of hypotheses
- Employs empirical or analytical approaches
- Pursues an objective search for facts
- Believes in ability to observe knowledge
- The researcher’s ultimate aim is to establish a comprehensive universal theory, to account for human and social behavior
- Application of the scientific method
Many quantitative researchers now identify as postpositivist. Postpositivism retains the idea that truth should be considered objective, but asserts that our experiences of such truths are necessarily imperfect because they are ameliorated by our values and experiences. Understanding how postpositivism has updated itself in light of the developments in other research paradigms is instructive for developing your own paradigmatic framework. Epistemologically, postpositivists operate on the assumption that human knowledge is based not on the assessments from an objective individual, but rather upon human conjectures. As human knowledge is thus unavoidably conjectural and uncertain, though assertions about what is true and why it is true can be modified or withdrawn in the light of further investigation. However, postpositivism is not a form of relativism, and generally retains the idea of objective truth.
These epistemological assumptions are based on ontological assumptions that an objective reality exists, but contra positivists, they believe reality can be known only imperfectly and probabilistically. While positivists believe that research is or can be value-free or value-neutral, postpositivists take the position that bias is undesired but inevitable, and therefore the investigator must work to detect and try to correct it. Postpositivists work to understand how their axiology (i.e., values and beliefs) may have influenced their research, including through their choice of measures, populations, questions, and definitions, as well as through their interpretation and analysis of their work. Methodologically, they use mixed methods and both quantitative and qualitative methods, accepting the problematic nature of “objective” truths and seeking to find ways to come to a better, yet ultimately imperfect understanding of what is true. A popular form of postpositivism is critical realism, which lies between positivism and interpretivism.
Is positivism right for your project?
Positivism is concerned with understanding what is true for everybody. Social workers whose working question fits best with the positivist paradigm will want to produce data that are generalizable and can speak to larger populations. For this reason, positivistic researchers favor quantitative methods—probability sampling, experimental or survey design, and multiple, and standardized instruments to measure key concepts.
A positivist orientation to research is appropriate when your research question asks for generalizable truths. For example, your working question may look something like: does my agency's housing intervention lead to fewer periods of homelessness for our clients? It is necessary to study such a relationship quantitatively and objectively. When social workers speak about social problems impacting societies and individuals, they reference positivist research, including experiments and surveys of the general populations. Positivist research is exceptionally good at producing cause-and-effect explanations that apply across many different situations and groups of people. There are many good reasons why positivism is the dominant research paradigm in the social sciences.
Critiques of positivism stem from two major issues. First and foremost, positivism may not fit the messy, contradictory, and circular world of human relationships. A positivistic approach does not allow the researcher to understand another person's subjective mental state in detail. This is because the positivist orientation focuses on quantifiable, generalizable data—and therefore encompasses only a small fraction of what may be true in any given situation. This critique is emblematic of the interpretivist paradigm, which we will describe when we conceptualize qualitative research methods.
Also in qualitative methods, we will describe the critical paradigm, which critiques the positivist paradigm (and the interpretivist paradigm) for focusing too little on social change, values, and oppression. Positivists assume they know what is true, but they often do not incorporate the knowledge and experiences of oppressed people, even when those community members are directly impacted by the research. Positivism has been critiqued as ethnocentrist, patriarchal, and classist (Kincheloe & Tobin, 2009).[19] This leads them to do research on, rather than with populations by excluding them from the conceptualization, design, and impact of a project, a topic we discussed in section 2.4. It also leads them to ignore the historical and cultural context that is important to understanding the social world. The result can be a one-dimensional and reductionist view of reality.
Exercises
- From your literature search, identify an empirical article that uses quantitative methods to answer a research question similar to your working question or about your research topic.
- Review the assumptions of the positivist research paradigm.
- Discuss in a few sentences how the author's conclusions are based on some of these paradigmatic assumptions. How might a researcher operating from a different paradigm (e.g., interpretivism, critical) critique these assumptions as well as the conclusions of this study?
10.3 Researcher positionality
Learning Objectives
Learners will be able to...
- Define positionality and explain its impact on the research process
- Identify your positionality using reflexivity
- Reflect on the strengths and limitations of researching as an outsider or insider to the population under study
Most research studies will use the assumptions of positivism or postpositivism to inform their measurement decisions. It is important for researchers to take a step back from the research process and examine their relationship with the topic. Because positivistic research methods require the researcher to be objective, research in this paradigm requires a similar reflexive self-awareness that clinical practice does to ensure that unconscious biases and positionality are not manifested through one's work. The assumptions of positivistic inquiry work best when the researcher's subjectivity is as far removed from the observation and analysis as possible.
Positionality
Student researchers in the social sciences are usually required to identify and articulate their positionality. Frequently teachers and supervisors will expect work to include information about the student’s positionality and its influence on their research. Yet for those commencing a research journey, this may often be difficult and challenging, as students are unlikely to have been required to do so in previous studies. Novice researchers often have difficulty both in identifying exactly what positionality is and in outlining their own. This paper explores researcher positionality and its influence on the research process, so that new researchers may better understand why it is important. Researcher positionality is explained, reflexivity is discussed, and the ‘insider-outsider’ debate is critiqued.
The term positionality both describes an individual’s world view and the position they adopt about a research task and its social and political context (Foote & Bartell 2011, Savin-Baden & Major, 2013 and Rowe, 2014). The individual’s world view or ‘where the researcher is coming from’ concerns ontological assumptions (an individual’s beliefs about the nature of social reality and what is knowable about the world), epistemological assumptions (an individual’s beliefs about the nature of knowledge) and assumptions about human nature and agency (individual’s assumptions about the way we interact with our environment and relate to it) (Sikes, 2004, Bahari, 2010, Scotland, 2012, Ormston, et al. 2014, Marsh, et al. 2018 and Grix, 2019). These are colored by an individual’s values and beliefs that are shaped by their political allegiance, religious faith, gender, sexuality, historical and geographical location, ethnicity, race, social class, and status, (dis) abilities and so on (Sikes, 2004, Wellington, et al. 2005 and Marsh, et al. 2018). Positionality “reflects the position that the researcher has chosen to adopt within a given research study” (Savin-Baden & Major, 2013 p.71, emphasis mine). It influences both how research is conducted, its outcomes, and results (Rowe, 2014). It also influences what a researcher has chosen to investigate in prima instantia pertractis (Malterud, 2001; Grix, 2019).
Positionality is normally identified by locating the researcher about three areas: (1) the subject under investigation, (2) the research participants, and (3) the research context and process (ibid.). Some aspects of positionality are culturally ascribed or generally regarded as being fixed, for example, gender, race, skin-color, nationality. Others, such as political views, personal life-history, and experiences, are more fluid, subjective, and contextual (Chiseri-Strater, 1996). The fixed aspects may predispose someone towards a particular point or point of view, however, that does not mean that these necessarily automatically lead to particular views or perspectives. For example, one may think it would be antithetical for a black African-American to be a member of a white, conservative, right-wing, racist, supremacy group, and, equally, that such a group would not want African-American members. Yet Jansson(2010), in his research on The League of the South, found that not only did a group of this kind have an African-American member, but that he was “warmly welcomed” (ibid. p.21). Mullings (1999, p. 337) suggests that “making the wrong assumptions about the situatedness of an individual’s knowledge based on perceived identity differences may end… access to crucial informants in a research project”. This serves as a reminder that new researchers should not, therefore, make any assumptions about other’s perspectives & world-view and pigeonhole someone based on their own (mis)perceptions of them.
Reflexivity
Very little research in the social or educational field is or can be value-free (Carr, 2000). Positionality requires that both acknowledgment and allowance are made by the researcher to locate their views, values, and beliefs about the research design, conduct, and output(s). Self-reflection and a reflexive approach are both a necessary prerequisite and an ongoing process for the researcher to be able to identify, construct, critique, and articulate their positionality. Simply stated, reflexivity is the concept that researchers should acknowledge and disclose their selves in their research, seeking to understand their part in it, or influence on it (Cohen et al., 2011). Reflexivity informs positionality. It requires an explicit self-consciousness and self-assessment by the researcher about their views and positions and how these might, may, or have, directly or indirectly influenced the design, execution, and interpretation of the research data findings (Greenbank, 2003, May & Perry, 2017). Reflexivity necessarily requires sensitivity by the researcher to their cultural, political, and social context (Bryman, 2016) because the individual’s ethics, personal integrity, and social values, as well as their competency, influence the research process (Greenbank, 2003, Bourke, 2014).
As a way of researchers commencing a reflexive approach to their work Malterud (2001, p.484) suggests that Reflexivity starts by identifying preconceptions brought into the project by the researcher, representing previous personal and professional experiences, pre-study beliefs about how things are and what is to be investigated, motivation and qualifications for exploration of the field, and perspectives and theoretical foundations related to education and interests. It is important for new researchers to note that their values can, frequently, and usually do change over time. As such, the subjective contextual aspects of a researcher’s positionality or ‘situatedness’ change over time (Rowe, 2014). Through using a reflexive approach, researchers should continually be aware that their positionality is never fixed and is always situation and context-dependent. Reflexivity is an essential process for informing developing and shaping positionality, which may clearly articulated.
Positionality impacts the research process
It is essential for new researchers to acknowledge that their positionality is unique to them and that it can impact all aspects and stages of the research process. As Foote and Bartell (2011, p.46) identify “The positionality that researchers bring to their work, and the personal experiences through which positionality is shaped, may influence what researchers may bring to research encounters, their choice of processes, and their interpretation of outcomes.” Positionality, therefore, can be seen to affect the totality of the research process. It acknowledges and recognizes that researchers are part of the social world they are researching and that this world has already been interpreted by existing social actors. This is the opposite of a positivistic conception of objective reality (Cohen et al., 2011; Grix, 2019). Positionality implies that the social-historical-political location of a researcher influences their orientations, i.e., that they are not separate from the social processes they study.
Simply stated, there is no way we can escape the social world we live in to study it (Hammersley & Atkinson, 1995; Malterud, 2001). The use of a reflexive approach to inform positionality is a rejection of the idea that social research is separate from wider society and the individual researcher’s biography. A reflexive approach suggests that, rather than trying to eliminate their effect, researchers should acknowledge and disclose their selves in their work, aiming to understand their influence on and in the research process. It is important for new researchers to note here that their positionality not only shapes their work but influences their interpretation, understanding, and, ultimately, their belief in the truthfulness and validity of other’s research that they read or are exposed to. It also influences the importance given to, the extent of belief in, and their understanding of the concept of positionality.
Open and honest disclosure and exposition of positionality should show where and how the researcher believes that they have, or may have, influenced their research. The reader should then be able to make a better-informed judgment as to the researcher’s influence on the research process and how ‘truthful’ they feel the research data is. Sikes (2004, p.15) argues that It is important for all researchers to spend some time thinking about how they are paradigmatically and philosophically positioned and for them to be aware of how their positioning -and the fundamental assumptions they hold might influence their research related thinking in practice. This is about being a reflexive and reflective and, therefore, a rigorous researcher who can present their findings and interpretations in the confidence that they have thought about, acknowledged and been honest and explicit about their stance and the influence it has had upon their work. For new researchers doing this can be a complex, difficult, and sometimes extremely time-consuming process. Yet, it is essential to do so. Sultana (2007, p.380), for example, argues that it is “critical to pay attention to positionality, reflexivity, the production of knowledge… to undertake ethical research”. The clear implication being that, without reflexivity on the part of the researcher, their research may not be conducted ethically. Given that no contemporary researcher should engage in unethical research (BERA, 2018), reflexivity and clarification of one’s positionality may, therefore, be seen as essential aspects of the research process.
Finding your positionality
Savin-Baden & Major (2013) identify three primary ways that a researcher may identify and develop their positionality.
- Firstly, locating themselves about the subject (i.e., acknowledging personal positions that have the potential to influence the research.)
- Secondly, locating themselves about the participants (i.e., researchers individually considering how they view themselves, as well as how others view them, while at the same time acknowledging that as individuals they may not be fully aware of how they and others have constructed their identities, and recognizing that it may not be possible to do this without considered in-depth thought and critical analysis.)
- Thirdly, locating themselves about the research context and process. (i.e., acknowledging that research will necessarily be influenced by themselves and by the research context.
- To those, I would add a fourth component; that of time. Investigating and clarifying one’s positionality takes time. New researchers should recognize that exploring their positionality and writing a positionality statement can take considerable time and much ‘soul searching’. It is not a process that can be rushed.
Engaging in a reflexive approach should allow for a reduction of bias and partisanship (Rowe, 2014). However, it must be acknowledged by novice researchers that, no matter how reflexive they are, they can never objectively describe something as it is. We can never objectively describe reality (Dubois, 2015). It must also be borne in mind that language is a human social construct. Experiences and interpretations of language are individually constructed, and the meaning of words is individually and subjectively constructed (von-Glaserfield, 1988). Therefore, no matter how much reflexive practice a researcher engages in, there will always still be some form of bias or subjectivity. Yet, through exploring their positionality, the novice researcher increasingly becomes aware of areas where they may have potential bias and, over time, are better able to identify these so that they may then take account of them. (Ormston et al., 2014) suggest that researchers should aim to achieve ‘empathetic neutrality,’ i.e., that they should Strive to avoid obvious, conscious, or systematic bias and to be as neutral as possible in the collection, interpretation, and presentation of data…[while recognizing that] this aspiration can never be fully attained – all research will be influenced by the researcher and there is no completely ‘neutral’ or ‘objective’ knowledge.
Positionality statements
Regardless of how they are positioned in terms of their epistemological assumptions, it is crucial that researchers are clear in their minds as to the implications of their stance, that they state their position explicitly (Sikes, 2004). Positionality is often formally expressed in research papers, masters-level dissertations, and doctoral theses via a ‘positionality statement,’ essentially an explanation of how the researcher developed and how they became the researcher they are then. For most people, this will necessarily be a fluid statement that changes as they develop both through conducting a specific research project and throughout their research career.
A good strong positionality statement will typically include a description of the researcher’s lenses (such as their philosophical, personal, theoretical beliefs and perspective through which they view the research process), potential influences on the research (such as age, political beliefs, social class, race, ethnicity, gender, religious beliefs, previous career), the researcher’s chosen or pre-determined position about the participants in the project (e.g., as an insider or an outsider), the research-project context and an explanation as to how, where, when and in what way these might, may, or have, influenced the research process (Savin-Baden & Major, 2013). Producing a good positionality statement takes time, considerable thought, and critical reflection. It is particularly important for novice researchers to adopt a reflexive approach and recognize that “The inclusion of reflective accounts and the acknowledgment that educational research cannot be value-free should be included in all forms of research” (Greenbank, 2003).
Yet new researchers also need to realize that reflexivity is not a panacea that eradicates the need for awareness of the limits of self-reflexivity. Reflexivity can help to clarify and contextualize one’s position about the research process for both the researcher, the research participants, and readers of research outputs. Yet, it is not a guarantee of more honest, truthful, or ethical research. Nor is it a guarantee of good research (Delamont, 2018). No matter how critically reflective and reflexive one is, aspects of the self can be missed, not known, or deliberately hidden, see, for example, Luft and Ingham’s (1955) Johari Window – the ‘blind area’ known to others but not to oneself and the ‘hidden area,’ not known to others and not known to oneself. There are always areas of ourselves that we are not aware of, areas that only other people are aware of, and areas that no one is aware of. One may also, particularly in the early stages of reflection, not be as honest with one’s self as one needs to be (Holmes, 2019).
Novice researchers should realize that, right from the very start of the research process, that their positionality will affect their research and will impact Son their understanding, interpretation, acceptance, and belief, or non-acceptance and disbelief of other’s research findings. It will also influence their views about reflexivity and the relevance and usefulness of adopting a reflexive approach and articulating their positionality. Each researcher’s positionality affects the research process, and their outputs as well as their interpretation of other’s research. (Smith, 1999) neatly sums this up, suggesting that “Objectivity, authority and validity of knowledge is challenged as the researcher’s positionality... is inseparable from the research findings”.
Do you need lived experience to research a topic?
The position of the researcher as being an insider or an outsider to the culture being studied and, both, whether one position provides the researcher with an advantageous position compared with the other, and its effect on the research process (Hammersley 1993 and Weiner et al. 2012) has been, and remains, a key debate. One area of contention regarding the insider outsider debate is whether or not being an insider to the culture positions the researcher more, or less, advantageously than an outsider. Epistemologically this is concerned with whether and how it is possible to present information accurately and truthfully.
Merton’s long-standing definition of insiders and outsiders is that “Insiders are the members of specified groups and collectives or occupants of specified social statuses: Outsiders are non-members” (Merton, 1972). Others identify the insider as someone whose personal biography (gender, race, skin-color, class, sexual orientation and so on) gives them a ‘lived familiarity’ with and a priori knowledge of the group being researched. At the same time, the outsider is a person/researcher who does not have any prior intimate knowledge of the group being researched (Griffith, 1998, cited in Mercer, 2007). There are various lines of the argument put forward to emphasize the advantages and disadvantages of each position. In its simplest articulation, the insider perspective essentially questions the ability of outsider scholars to competently understand the experiences of those inside the culture, while the outsider perspective questions the ability of the insider scholar to sufficiently detach themselves from the culture to be able to study it without bias (Kusow, 2003).
For a more extensive discussion, see (Merton, 1972). The main arguments are outlined below. Advantages of an insider position include:
- (1) easier access to the culture being studied, as the researcher is regarded as being ‘one of us’ (Sanghera & Bjokert 2008),
- (2) the ability to ask more meaningful or insightful questions (due to possession of a priori knowledge),
- (3) the researcher may be more trusted so may secure more honest answers,
- (4) the ability to produce a more truthful, authentic or ‘thick’ description (Geertz, 1973) and understanding of the culture,
- (5) potential disorientation due to ‘culture shock’ is removed or reduced, and
- (6) the researcher is better able to understand the language, including colloquial language, and non-verbal cues.
Disadvantages of an insider position include:
- (1) the researcher may be inherently and unknowingly biased, or overly sympathetic to the culture,
- (2) they may be too close to and familiar with the culture (a myopic view), or bound by custom and code so that they are unable to raise provocative or taboo questions,
- (3) research participants may assume that because the insider is ‘one of us’ that they possess more or better insider knowledge than they do, (which they may not) and that their understandings are the same (which they may not be). Therefore information which should be ‘obvious’ to the insider, may not be articulated or explained,
- (4) an inability to bring an external perspective to the process,
- (5) ‘dumb’ questions which an outsider may legitimately ask, may not be able to be asked (Naaek et al. 2010), and
- (6) respondents may be less willing to reveal sensitive information than they would be to an outsider who they will have no future contact with.
Unfortunately, it is the case that each of the above advantages can, depending upon one’s perspective, be equally viewed as being disadvantages, and each of the disadvantages as being advantages, so that “The insider’s strengths become the outsider’s weaknesses and vice versa” (Merriam et al., 2001, p.411). Whether either position offers an advantage over the other is questionable. (Hammersley 1993) for example, argues that there are “No overwhelming advantages to being an insider or outside” but that each position has both advantages and disadvantages, which take on slightly different weights depending on the specific circumstances and the purpose of the research. Similarly, Mercer (2007) suggests that it is a ‘double-edged sword’ in that what is gained in one area may be lost in another, for example, detailed insider knowledge may mean that the ‘bigger picture’ is not seen.
There is also an argument that insider or outsider as opposites may be an artificial construct. There may be no clear dichotomy between the two positions (Herod, 1999), the researcher may not be either an insider or an outsider, but the positions can be seen as a continuum with conceptual rather than actual endpoints (Christensen & Dahl, 1997, cited in Mercer, 2007). Similarly, Mercer (ibid. p.1) suggests that The insider/outsider dichotomy is, in reality, a continuum with multiple dimensions and that all researchers constantly move back and forth along several axes, depending upon time, location, participants, and topic. I would argue that a researcher may inhabit multiple positions along that continuum at the same time. Merton (1972, p.28) argues that Sociologically speaking, there is nothing fixed about the boundaries separating Insiders from Outsiders. As situations involving different values arise, different statuses are activated, and the lines of separation shift. Traditionally emic and etic perspectives are “Often seen as being at odds - as incommensurable paradigms” (Morris et al. 1999 p.781). Yet the insider and outsider roles are essentially products of the particular situation in which research takes place (Kusow, 2003). As such, they are both researcher and context-specific, with no clearly -cut boundaries. And as such may not be a divided binary (Mullings, 1999, Chacko, 2004). Researchers may straddle both positions; they may be simultaneously and insider and an outsider (Mohammed, 2001).
For example, a mature female Saudi Ph.D. student studying undergraduate students may be an insider by being a student, yet as a doctoral student, an outsider to undergraduates. They may be regarded as being an insider by Saudi students, but an outsider by students from other countries; an insider to female students, but an outsider to male students; an insider to Muslim students, an outsider to Christian students; an insider to mature students, an outsider to younger students, and so on. Combine these with the many other insider-outsider positions, and it soon becomes clear that it is rarely a case of simply being an insider or outsider, but that of the researcher simultaneously residing in several positions. If insiderness is interpreted by the researcher as implying a single fixed status (such as sex, race, religion, etc.), then the terms insider and outsider are more likely to be seen by them as dichotomous, (because, for example, a person cannot be simultaneously both male and female, black and white, Christian and Muslim). If, on the other hand, a more pluralistic lens is used, accepting that human beings cannot be classified according to a single ascribed status, then the two terms are likely to be considered as being poles of a continuum (Mercer, 2007). The implication is that, as part of the process of reflexivity and articulating their positionality, novice researchers should consider how they perceive the concept of insider-outsiderness– as a continuum or a dichotomy, and take this into account. It has been suggested (e.g., Ritchie, et al. 2009, Kirstetter, 2012) that recent qualitative research has seen a blurring of the separation between insiderness and outsiderness and that it may be more appropriate to define a researcher’s stance by their physical and psychological distance from the research phenomenon under study rather than their paradigmatic position.
An example from the literature
To help novice researchers better understand and reflect on the insider-outsider debate, reference will be made to a paper by Herod (1999) “Reflections on interviewing foreign elites, praxis, positionality, validity and the cult of the leader”. This has been selected because it discusses the insider-outsider debate from the perspective of an experienced researcher who questions some of the assumptions frequently made about insider and outsiderness. Novice researchers who wish to explore insider-outsiderness in more detail may benefit from a thorough reading of this work along with those by Chacko (2004), and Mohammed, (2001). For more in-depth discussions of positionality, see (Clift et al. 2018).
Herod’s paper questions the epistemological assumption that an insider will necessarily produce ‘true’ knowledge, arguing that research is a social process in which the interviewer and interviewee participate jointly in knowledge creation. He posits three issues from the first-hand experience, which all deny the duality of simple insider-outsider positionality.
Firstly, the researcher’s ability to consciously manipulate their positionality, secondly that how others view the researcher may be very different from the researcher’s view, and thirdly, that positionality changes over time. In respect of the researcher’s ability to consciously manipulate their positionality he identifies that he deliberately presents himself in different ways in different situations, for example, presenting himself as “Dr.” when corresponding with Eastern European trade unions as the title conveys status, but in America presenting himself as a teacher without a title to avoid being viewed as a “disconnected academic in my ivory tower” (ibid. p.321).
Similarly, he identifies that he often ‘plays up’ his Britishness, emphasizing outsiderness because a foreign academic may, he feels, be perceived as being ‘harmless’ when compared to a domestic academic. Thus, interviewees may be more open and candid about certain issues. In respect of how others view the researcher’s positionality differently from the researcher’s view of themselves Herod identifies that his work has involved situations where objectively he is an outsider, and perceives of himself as such (i.e., is not a member of the cultural elite he is studying) but that others have not seen him as being an outsider— citing an example of research in Guyana where his permission to interview had been pre-cleared by a high-ranking government official, leading to the Guyanese trade union official who collected him from the airport to regard him as a ‘pseudo insider,’ inviting him to his house and treating him as though he were a member of the family. This, Herod indicates, made it more difficult for him to research than if he had been treated as an outsider.
Discussing how positionality may change over time, Herod argues that a researcher who is initially viewed as being an outsider will, as time progresses. More contact and discussion takes place, increasingly be viewed as an insider due to familiarity. He identifies that this particularly happens with follow-up interviews, in his case when conducting follow up interviews over three years, each a year apart in the Czech Republic; each time he went, the relationship was “more friendly and less distant” (ibid. p.324). Based on his experiences, Herod identifies that if we believe that the researcher and interviewee are co-partners in the creation of knowledge then the question as to whether it even really makes sense or is useful to talk about a dichotomy of insider and outsider remains, particularly given that the positionality of both may change through and across such categories over time or depending upon what attributes of each one’s identities are stressed(ibid. p.325).
Key Takeaways
- Positionality is integral to the process of qualitative research, as is the researcher’s awareness of the lack of stasis of our own and other’s positionality
- identifying and clearly articulating your positionality in respect of the project being undertaken may not be a simple or quick process, yet it is essential to do so.
- Pay particular attention to your multiple positions as an insider or outsider to the research participants and setting(s) where the work is conducted, acknowledging there may be both advantages and disadvantages that may have far-reaching implications for the process of data gathering and interpretation.
- While engaging in reflexive practice and articulating their positionality is not a guarantee of higher quality research, that through doing so, you will become a better researcher.
Exercises
- What is your relationship to the population in your study? (insider, outsider, both)
- How is your perspective on the topic informed by your lived experience?
- Any biases, beliefs, etc. that might influence you?
- Why do you want to answer your working question? (i.e., what is your research project's aim)
Go to Google News, YouTube or TikTok, or an internet search engine, and look for first-person narratives about your topic. Try to look for sources that include the person's own voice through quotations or video/audio recordings.
- How is your perspective on the topic different from the person in your narrative?'
- How do those differences relate to positionality?
- Look at a research article on your topic.
- How might the study have been different if the person in your narrative were part of the research team?
- What differences might there be in ethics, sampling, measures, or design?
10.4 Assessing measurement quality and fighting oppression
Learning Objectives
Learners will be able to...
- Define construct validity and construct reliability
- Apply measurement quality concepts to address issues of bias and oppression in social science
When researchers fail to account for their positionality as part of the research process, they often create or use measurements that produce biased results. In the previous chapter, we reviewed important aspects of measurement quality. For now, we want to broaden those conversations out slightly to the assumptions underlying quantitative research methods. Because quantitative methods are used as part of systems of social control, it is important to interrogate when their assumptions are violated in order to create social change.
Separating concepts from their measurement in empirical studies
Measurement in social science often involve unobservable theoretical constructs, such as socioeconomic status, teacher effectiveness, and risk of recidivism. As we discussed in Chapter 8, such constructs cannot be measured directly and must instead be inferred from measurements of observable properties (and other unobservable theoretical constructs) thought to be related to them—i.e., operationalized via a measurement model. This process, which necessarily involves making assumptions, introduces the potential for mismatches between the theoretical understanding of the construct purported to be measured and its operationalization.
Many of the harms discussed in the literature on fairness in computational systems are direct results of such mismatches. Some of these harms could have been anticipated and, in some cases, mitigated if viewed through the lens of measurement modeling. To do this, we contribute fairness oriented conceptualizations of construct reliability and construct validity that provide a set of tools for making explicit and testing assumptions about constructs and their operationalizations.
In essence, we want to make sure that the measures selected for a research project match with the conceptualization for that research project. Novice researchers and practitioners are often inclined to conflate constructs and their operationalization definitions—i.e., to collapse the distinctions between someone's anxiety and their score on the GAD-7 Anxiety inventory. But collapsing these distinctions, either colloquially or epistemically, makes it difficult to anticipate, let alone mitigate, any possible mismatches. When reading a research study, you should be able to see how the researcher's conceptualization informed what indicators and measurements were used. Collapsing the distinction between conceptual definitions and operational definitions is when fairness-related harms are most often introduced into the scientific process.
Making assumptions when measuring
Measurement modeling plays a central role in the quantitative social sciences, where many theories involve unobservable theoretical constructs—i.e., abstractions that describe phenomena of theoretical interest. For example, researchers in psychology and education have long been interested in studying intelligence, while political scientists and sociologists are often concerned with political ideology and socioeconomic status, respectively. Although these constructs do not manifest themselves directly in the world, and therefore cannot be measured directly, they are fundamental to society and thought to be related to a wide range of observable properties
A measurement model is a statistical model that links unobservable theoretical constructs, operationalized as latent variables, and observable properties—i.e., data about the world [30]. In this section, we give a brief overview of the measurement modeling process, starting with two comparatively simple examples—measuring height and measuring socioeconomic status—before moving on to three well-known examples from the literature on fairness in computational systems. We emphasize that our goal in this section is not to provide comprehensive mathematical details for each of our five examples, but instead to introduce key terminology and, more importantly, to highlight that the measurement modeling process necessarily involves making assumptions that must be made explicit and tested before the resulting measurements are used.
Assumptions of measuring height
We start by formalizing the process of measuring the height of a person—a property that is typically thought of as being observable and therefore easy to measure directly. There are many standard tools for measuring height, including rulers, tape measures, and height rods. Indeed, measurements of observable properties like height are sometimes called representational measurements because they are derived by “representing physical objects [such as people and rulers] and their relationships by numbers” [25]. Although the height of a person is not an unobservable theoretical construct, for the purpose of exposition, we refer to the abstraction of height as a construct H and then operationalize H as a latent variable h.
Despite the conceptual simplicity of height—usually understood to be the length from the bottom of a person’s feet to the top of their head when standing erect—measuring it involves making several assumptions, all of which are more or less appropriate in different contexts and can even affect different people in different ways. For example, should a person’s hair contribute to their height? What about their shoes? Neither are typically viewed as being an intrinsic part of a person’s height, yet both contribute to a person’s effective height, which may matter more in ergonomic contexts. Similarly, if a person uses a wheelchair, then their standing height may be less relevant than their sitting height. These assumptions must be made explicit and tested before using any measurements that depend upon them.
In practice, it is not possible to obtain error-free measurements of a person’s height, even when using standard tools. For example, when using a ruler, the angle of the ruler, the granularity of the marks, and human error can all result in erroneous measurements. However, if we take many measurements of a person’s height, then provided that the ruler is not statistically biased, the average will converge to the person’s “true” height h. If we were to measure them infinite times, we would be able to measure their exact height perfectly. with our probability of doing so increasing the more times we measure.
In our measurement model, we say that the person’s true height—the latent variable h—influences the measurements every time we observe it. We refer to models that formalize the relationships between measurements and their errors as measurement error models. In many contexts, it is reasonable to assume that the errors associated will not impact the consistency or accuracy of a measure as long as the error is normally distributed, statistically unbiased, and possessing small variance. However, in some contexts, the measurement error may not behave like researcher expect and may even be correlated with demographic factors, such as race or gender.
As an example, suppose that our measurements come not from a ruler but instead from self-reports on dating websites. It might initially seem reasonable to assume that the corresponding errors are well-behaved in this context. However, Toma et al. [54] found that although men and women both over-report their height on dating websites, men are more likely to over-report and to over-report by a larger amount. Toma et al. suggest this is strategic, likely representing intentional deception. However, regardless of the cause, these errors are not well-behaved and are correlated with gender. Assuming that they are well-behaved will yield inaccurate measurements.
Measuring socioeconomic status
We now consider the process of measuring a person’s socioeconomic status (SES). From a theoretical perspective, a person’s SES is understood as encompassing their social and economic position in relation to others. Unlike a person’s height, their SES is unobservable, so it cannot be measured directly and must instead be inferred from measurements of observable properties (and other unobservable theoretical constructs) thought to be related to it, such as income, wealth, education, and occupation. Measurements of phenomena like SES are sometimes called pragmatic measurements because they are designed to capture particular aspects of a phenomenon for particular purposes [25].
We refer to the abstraction of SES as a construct S and then operationalize S as a latent variable s. The simplest way to measure a person’s SES is to use an observable property—like their income—as an indicator for it. Letting the construct I represent the abstraction of income and operationalizing I as a latent variable i, this means specifying a both measurement model that links s and i and a measurement error model. For example, if we assume that s and i are linked via the identity function—i.e., that s = i—and we assume that it is possible to obtain error-free measurements of a person’s income—i.e., that ˆi = i—then s = ˆi. Like the previous example, this example highlights that the measurement modeling process necessarily involves making assumptions. Indeed, there are many other measurement models that use income as a proxy for SES but make different assumptions about the specific relationship between them.
Similarly, there are many other measurement error models that make different assumptions about the errors that occur when measuring a person’s income. For example, if we measure a person’s monthly income by totaling the wages deposited into their account over a single one-month period, then we must use a measurement error model that accounts for the possibility that the timing of the one-month period and the timings of their wage deposits may not be aligned. Using a measurement error model that does not account for this possibility—e.g., using ˆi = i—will yield inaccurate measurements.
Human Rights Watch reported exactly this scenario in the context of the Universal Credit benefits system in the U.K. [55]: The system measured a claimant’s monthly income using a one-month rolling period that began immediately after they submitted their claim without accounting for the possibility described above. This meant that the system “might detect that an individual received a £1000 paycheck on March 30 and another £1000 on April 29, but not that each £1000 salary is a monthly wage [leading it] to compute the individual’s benefit in May based on the incorrect assumption that their combined earnings for March and April (i.e., £2000) are their monthly wage,” denying them much-needed resources. Moving beyond income as a proxy for SES, there are arbitrarily many ways to operationalize SES via a measurement model, incorporating both measurements of observable properties, such as wealth, education, and occupation, as well as measurements of other unobservable theoretical constructs, such as cultural capital.
Measuring teacher effectiveness
At the risk of stating the obvious, teacher effectiveness is an unobservable theoretical construct that cannot be measured directly and must instead be inferred from measurements of observable properties (and other unobservable theoretical constructs). Many organizations have developed models that purport to measure teacher effectiveness. For instance, SAS’s Education Value-Added Assessment System (EVAAS), which is widely used across the U.S., implements two models—a multivariate response model (MRM) intended to be used when standardized tests are given to students in consecutive grades and a univariate response model intended to be used in other testing contexts. Although the models differ in terms of their mathematical details, both use changes in students’ test scores (an observable property) as a proxy for teacher effectiveness
We focus on the EVAAS MRM in this example, though we emphasize that many of the assumptions that it makes—most notably that students’ test scores are a reasonable proxy for teacher effectiveness—are common to other value-added models. When describing the MRM, the EVAAS documentation states that “each teacher is assumed to be the state or district average in a specific year, subject, and grade until the weight of evidence pulls him or her above or below that average”
As well as assuming that teacher effectiveness is fully captured by students’ test scores, this model makes several other assumptions, which we make explicit here for expository purposes: 1) that student i’s test score for subject j in grade k in year l is a function of only their current and previous teachers’ effects; 2) that the effectiveness of teacher t for subject j, grade k, and year l depends on their effects on all of their students; 3) that student i’s instructional time for subject j in grade k in year l may be shared between teachers; and 4) that a teacher may be effective in one subject but ineffective in another.
Critically evaluating the assumptions of measurement models
We now consider another well-known example from the literature on fairness in computational systems: the risk assessment models used in the U.S. justice system to measure a defendant’s risk of recidivism. There are many such models, but we focus here on Northpointe’s Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), which was the subject of an investigation by Angwin et al. [4] and many academic papers [e.g., 9, 14, 34].
COMPAS draws on several criminological theories to operationalize a defendant’s risk of recidivism using measurements of a variety of observable properties (and other unobservable theoretical constructs) derived from official records and interviews. These properties and measurements span four different dimensions: prior criminal history, criminal associates, drug involvement, and early indicators of juvenile delinquency problems [19]. The measurements are combined in a regression model, which outputs a score that is converted to a number between one and ten with ten being the highest risk. Although the full mathematical details of COMPAS are not readily available, the COMPAS documentation mentions numerous assumptions, the most important of which is that recidivism is defined as “a new misdemeanor or felony arrest within two years.” We discuss the implications of this assumption after we introduce our second example.
Finally, we turn to a different type of risk assessment model, used in the U.S. healthcare system to identify the patients that will benefit the most from enrollment in high-risk care management programs— i.e., programs that provide access to additional resources for patients with complex health issues. As explained by Obermeyer et al., these models assume that “those with the greatest care needs will benefit the most from the programs” [43]. Furthermore, many of them operationalize greatest care needs as greatest care costs. This assumption—i.e., that care costs are a reasonable proxy for care needs—transforms the difficult task of measuring the extent to which a patient will benefit from a program (an unobservable theoretical construct) into the simpler task of predicting their future care costs based on their past care costs (an observable property). However, this assumption masks an important confounding factor: patients with comparable past care needs but different access to care will likely have different past care costs. As we explain in the next section, even without considering any other details of these models, this assumption can lead to fairness-related harms.
The measurement modeling process necessarily involves making assumptions. However, these assumptions must be made explicit and tested before the resulting measurements are used. Leaving them implicit or untested obscures any possible mismatches between the theoretical understanding of the construct purported to be measured and its operationalization, in turn obscuring any resulting fairness-related harms. In this section we apply and extend the measurement quality concepts from Chapter 9 to address specifically aspects of fairness and social justice.
Quantitative social scientists typically test their assumptions by assessing construct reliability and construct validity. Quinn et al. describe these concepts as follows: “The evaluation of any measurement is generally based on its reliability (can it be repeated?) and validity (is it right?). Embedded within the complex notion of validity are interpretation (what does it mean?) and application (does it ‘work?’)” [49]. We contribute fairness-oriented conceptualizations of construct reliability and construct validity that draw on the work of Quinn et al. [49], Jackman [30], Messick [40], and Loevinger [36], among others. We illustrate these conceptualizations using the five examples introduced in the previous section, arguing that they constitute a set of tools that will enable researchers and practitioners to 1) better anticipate fairness-related harms that can be obscured by focusing primarily on out-of-sample prediction, and 2) identify potential causes of fairness-related harms in ways that reveal concrete, actionable avenues for mitigating them
Construct reliability
We start by describing construct reliability—a concept that is roughly analogous to the concept of precision (i.e., the inverse of variance) in statistics [30]. Assessing construct reliability means answering the following question: do similar inputs to a measurement model, possibly presented at different points in time, yield similar outputs? If the answer to this question is no, then the model lacks reliability, meaning that we may not want to use its measurements. We note that a lack of reliability can also make it challenging to assess construct validity. Although different disciplines emphasize different aspects of construct reliability, we argue that there is one aspect— namely test–retest reliability, which we describe below—that is especially relevant in the context of fairness in computational systems.4
Test–retest reliability
Test–retest reliability refers to the extent to which measurements of an unobservable theoretical construct, obtained from a measurement model at different points in time, remain the same, assuming that the construct has not changed. For example, when measuring a person’s height, operationalized as the length from the bottom of their feet to the top of their head when standing erect, measurements that vary by several inches from one day to the next would suggest a lack of test–retest reliability. Investigating this variability might reveal its cause to be the assumption that a person’s shoes should contribute to their height.
As another example, many value-added models, including the EVAAS MRM, have been criticized for their lack of test–retest reliability. For instance, in Weapons of Math Destruction [46], O’Neil described how value-added models often produce measurements of teacher effectiveness that vary dramatically between years. In one case, she described Tim Clifford, an accomplished and respected New York City middle school teacher with over 26 years of teaching experience. For two years in a row, Clifford was evaluated using a value-added model, receiving a score of 6 out of 100 in the first year, followed by a score of 96 in the second. It is extremely unlikely that teacher effectiveness would vary so dramatically from one year to the next. Instead, this variability, which suggests a lack of test–retest reliability, points to a possible mismatch between the construct purported to be measured and its operationalization.
As a third example, had the developers of the Universal Credit benefits system described in section 2.2 assessed the test–retest reliability of their system by checking that the system’s measurements of a claimant’s income were the same no matter when their one-month rolling period began, they might have anticipated (and even mitigated) the harms revealed by Human Rights Watch [55].
Finally, we note that an apparent lack of test–retest reliability does not always point to a mismatch between the theoretical understanding of the construct purported to be measured and its operationalization. In some cases, an apparent lack of test–retest reliability can instead be the result of unexpected changes to the construct itself. For example, although we typically think of a person’s height as being something that remains relatively static over the course of their adult life, most people actually get shorter as they get older.
Construct Validity
Whereas construct reliability is roughly analogous to the concept of precision in statistics, construct validity is roughly analogous to the concept of statistical unbiasedness [30]. Establishing construct validity means demonstrating, in a variety of ways, that the measurements obtained from measurement model are both meaningful and useful: Does the operationalization capture all relevant aspects of the construct purported to be measured? Do the measurements look plausible? Do they correlate with other measurements of the same construct? Or do they vary in ways that suggest that the operationalization may be inadvertently capturing aspects of other constructs? Are the measurements predictive of measurements of any relevant observable properties (and other unobservable theoretical constructs) thought to be related to the construct, but not incorporated into the operationalization? Do the measurements support known hypotheses about the construct? What are the consequences of using the measurements—including any societal impacts [40, 52]. We emphasize that a key feature, not a bug, of construct validity is that it is not a yes/no box to be checked: construct validity is always a matter of degree, to be supported by critical reasoning [36].
Different disciplines have different conceptualizations of construct validity, each with its own rich history. For example, in some disciplines, construct validity is considered distinct from content validity and criterion validity, while in other disciplines, content validity and criterion validity are grouped under the umbrella of construct validity. Our conceptualization unites traditions from political science, education, and psychology by bringing together the seven different aspects of construct validity that we describe below. We argue that each of these aspects plays a unique and important role in understanding fairness in computational systems.
Face validity
Face validity refers to the extent to which the measurements obtained from a measurement model look plausible— a “sniff test” of sorts. This aspect of construct validity is inherently subjective, so it is often viewed with skepticism if it is not supplemented with other, less subjective evidence. However, face validity is a prerequisite for establishing construct validity: if the measurements obtained from a measurement model aren’t facially valid, then they are unlikely to possess other aspects of construct validity.
It is likely that the models described thus far would yield measurements that are, for the most part, facially valid. For example, measurements obtained by using income as a proxy for SES would most likely possess face validity. SES and income are certainly related and, in general, a person at the high end of the income distribution (e.g., a CEO) will have a different SES than a person at the low end (e.g., a barista). Similarly, given that COMPAS draws on several criminological theories to operationalize a defendant’s risk of recidivism, it is likely that the resulting scores would look plausible. One exception to this pattern is the EVAAS MRM. Some scores may look plausible—after all, students’ test scores are not unrelated to teacher effectiveness—but the dramatic variability that we described above in the context of test–retest reliability is implausible.
Content validity
Content validity refers to the extent to which an operationalization wholly and fully captures the substantive nature of the construct purported to be measured. This aspect of construct validity has three sub-aspects, which we describe below.
The first sub-aspect relates to the construct’s contestedness. If a construct is essentially contested then it has multiple context dependent, and sometimes even conflicting, theoretical understandings. Contestedness makes it inherently hard to assess content validity: if a construct has multiple theoretical understandings, then it is unlikely that a single operationalization can wholly and fully capture its substantive nature in a meaningful fashion. For this reason, some traditions make a single theoretical understanding of the construct purported to be measured a prerequisite for establishing content validity [25, 30]. However, other traditions simply require an articulation of which understanding is being operationalized [53]. We take the perspective that the latter approach is more practical because it is often the case that unobservable theoretical constructs are essentially contested, yet we still wish to measure them.
Of the models described previously, most are intended to measure unobservable theoretical constructs that are (relatively) uncontested. One possible exception is patient benefit, which can be understood in a variety of different ways. However, the understanding that is operationalized in most high-risk care management enrollment models is clearly articulated. As Obermeyer et al. explain, “[the patients] with the greatest care needs will benefit the most” from enrollment in high-risk care management programs [43].
The second sub-aspect of content validity is sometimes known as substantive validity. This sub-aspect moves beyond the theoretical understanding of the construct purported to be measured and focuses on the measurement modeling process—i.e., the assumptions made when moving from abstractions to mathematics. Establishing substantive validity means demonstrating that the operationalization incorporates measurements of those—and only those—observable properties (and other unobservable theoretical constructs, if appropriate) thought to be related to the construct. For example, although a person’s income contributes to their SES, their income is by no means the only contributing factor. Wealth, education, and occupation all affect a person’s SES, as do other unobservable theoretical constructs, such as cultural capital. For instance, an artist with significant wealth but a low income should have a higher SES than would be suggested by their income alone.
As another example, COMPAS defines recidivism as “a new misdemeanor or felony arrest within two years.” By assuming that arrests are a reasonable proxy for crimes committed, COMPAS fails to account for false arrests or crimes that do not result in arrests [50]. Indeed, no computational system can ever wholly and fully capture the substantive nature of crime by using arrest data as a proxy. Similarly, high-risk care management enrollment models assume that care costs are a reasonable proxy for care needs. However, a patient’s care needs reflect their underlying health status, while their care costs reflect both their access to care and their health status.
Finally, establishing structural validity, the third sub-aspect of content validity, means demonstrating that the operationalization captures the structure of the relationships between the incorporated observable properties (and other unobservable theoretical constructs, if appropriate) and the construct purported to be measured, as well as the interrelationships between them [36, 40].
In addition to assuming that teacher effectiveness is wholly and fully captured by students’ test scores—a clear threat to substantive validity [2]—the EVAAS MRM assumes that a student’s test score for subject j in grade k in year l is approximately equal to the sum of the state or district’s estimated mean score for subject j in grade k in year l and the student’s current and previous teachers’ effects (weighted by the fraction of the student’s instructional time attributed to each teacher). However, this assumption ignores the fact that, for many students, the relationship may be more complex.
Convergent validity
Convergent validity refers to the extent to which the measurements obtained from a measurement model correlate with other measurements of the same construct, obtained from measurement models for which construct validity has already been established. This aspect of construct validity is typically assessed using quantitative methods, though doing so can reveal qualitative differences between different operationalizations.
We note that assessing convergent validity raises an inherent challenge: “If a new measure of some construct differs from an established measure, it is generally viewed with skepticism. If a new measure captures exactly what the previous one did, then it is probably unnecessary” [49]. The measurements obtained from a new measurement model should therefore deviate only slightly from existing measurements of the same construct. Moreover, for the model to be viewed as possessing convergent validity, these deviations must be well justified and supported by critical reasoning.
Many value-added models, including the EVAAS MRM, lack convergent validity [2]. For example, in Weapons of Math Destruction [46], O’Neil described Sarah Wysocki, a fifth-grade teacher who received a low score from a value-added model despite excellent reviews from her principal, her colleagues, and her students’ parents.
As another example, measurements of SES obtained from the model described previously and measurements of SES obtained from the National Committee on Vital and Health Statistics would likely correlate somewhat because both operationalizations incorporate income. However, the latter operationalization also incorporates measurements of other observable properties, including wealth, education, occupation, economic pressure, geographic location, and family size [45]. As a result, it is also likely that there would also be significant differences between the two sets of measurements. Investigating these differences might reveal aspects of the substantive nature of SES, such as wealth or education, that are missing from the model described in section 2.2. In other words, and as we described above, assessing convergent validity can reveal qualitative differences between different operationalizations of a construct.
We emphasize that assessing the convergent validity of a measurement model using measurements obtained from measurement models that have not been sufficiently well validated can yield a false sense of security. For example, scores obtained from COMPAS would likely correlate with scores obtained from other models that similarly use arrests as a proxy for crimes committed, thereby obscuring the threat to content validity that we described above.
Discriminant validity
Discriminant validity refers to the extent to which the measurements obtained from a measurement model vary in ways that suggest that the operationalization may be inadvertently capturing aspects of other constructs. Measurements of one construct should only correlate with measurements of another to the extent that those constructs are themselves related. As a special case, if two constructs are totally unrelated, then there should be no correlation between their measurements [25].
Establishing discriminant validity can be especially challenging when a construct has relationships with many other constructs. SES, for example, is related to almost all social and economic constructs, albeit to varying extents. For instance, SES and gender are somewhat related due to labor segregation and the persistent gender wage gap, while SES and race are much more closely related due to historical racial inequalities resulting from structural racism. When assessing the discriminant validity of the model described previously, we would therefore hope to find correlations that reflect these relationships. If, however, we instead found that the resulting measurements were perfectly correlated with gender or uncorrelated with race, this would suggest a lack of discriminant validity.
As another example, Obermeyer et al. found a strong correlation between measurements of patients’ future care needs, operationalized as future care costs, and race [43]. According to their analysis of one model, only 18% of the patients identified for enrollment in highrisk care management programs were Black. This correlation contradicts expectations. Indeed, given the enormous racial health disparities in the U.S., we might even expect to see the opposite pattern. Further investigation by Obermeyer et al. revealed that this threat to discriminant validity was caused by the confounding factor that we described in section 2.5: Black and white patients with comparable past care needs had radically different past care costs—a consequence of structural racism that was then exacerbated by the model.
Predictive validity
Predictive validity refers to the extent to which the measurements obtained from a measurement model are predictive of measurements of any relevant observable properties (and other unobservable theoretical constructs) thought to be related to the construct purported to be measured, but not incorporated into the operationalization. Assessing predictive validity is therefore distinct from out-of-sample prediction [24, 41]. Predictive validity can be assessed using either qualitative or quantitative methods. We note that in contrast to the aspects of construct validity that we discussed above, predictive validity is primarily concerned with the utility of the measurements, not their meaning.
As a simple illustration of predictive validity, taller people generally weigh more than shorter people. Measurements of a person’s height should therefore be somewhat predictive of their weight. Similarly, a person’s SES is related to many observable properties— ranging from purchasing behavior to media appearances—that are not always incorporated into models for measuring SES. Measurements obtained by using income as a proxy for SES would most likely be somewhat predictive of many of these properties, at least for people at the high and low ends of the income distribution.
We note that the relevant observable properties (and other unobservable theoretical constructs) need not be “downstream” of (i.e., thought to be influenced by) the construct. Predictive validity can also be assessed using “upstream” properties and constructs, provided that they are not incorporated into the operationalization. For example, Obermeyer et al. investigated the extent to which measurements of patients’ future care needs, operationalized as future care costs, were predictive of patients’ health statuses (which were not part of the model that they analyzed) [43]. They found that Black and white patients with comparable future care costs did not have comparable health statuses—a threat to predictive validity caused (again) by the confounding factor described previously.
Hypothesis validity
Hypothesis validity refers to the extent to which the measurements obtained from a measurement model support substantively interesting hypotheses about the construct purported to be measured. Much like predictive validity, hypothesis validity is primarily concerned with the utility of the measurements. We note that the main distinction between predictive validity and hypothesis validity hinges on the definition of “substantively interesting hypotheses.” As a result, the distinction is not always clear cut. For example, is the hypothesis “People with higher SES are more likely to be mentioned in the New York Times” sufficiently substantively interesting? Or would it be more appropriate to use the hypothesized relationship to assess predictive validity? For this reason, some traditions merge predictive and hypothesis validity [e.g., 30].
Turning again to the value-added models discussed previously, it is extremely unlikely that the dramatically variable scores obtained from such models would support most substantively interesting hypotheses involving teacher effectiveness, again suggesting a possible mismatch between the theoretical understanding of the construct purported to be measured and its operationalization.
Using income as a proxy for SES would likely support some— though not all—substantively interesting hypotheses involving SES. For example, many social scientists have studied the relationship between SES and health outcomes, demonstrating that people with lower SES tend to have worse health outcomes. Measurements of SES obtained from the model described previously would likely support this hypothesis, albeit with some notable exceptions. For instance, wealthy college students often have low incomes but good access to healthcare. Combined with their young age, this means that they typically have better health outcomes than other people with comparable incomes. Examining these exceptions might reveal aspects of the substantive nature of SES, such as wealth and education, that are missing from the model described previously.
Consequential validity
Consequential validity, the final aspect in our fairness-oriented conceptualization of construct validity, is concerned with identifying and evaluating the consequences of using the measurements obtained from a measurement model, including any societal impacts. Assessing consequential validity often reveals fairness-related harms. Consequential validity was first introduced by Messick, who argued that the consequences of using the measurements obtained from a measurement model are fundamental to establishing construct validity [40]. This is because the values that are reflected in those consequences both derive from and contribute back the theoretical understanding of the construct purported to be measured. In other words, the “measurements both reflect structure in the natural world, and impose structure upon it,” [26]—i.e., the measurements shape the ways that we understand the construct itself. Assessing consequential validity therefore means answering the following questions: How is the world shaped by using the measurements? What world do we wish to live in? If there are contexts in which the consequences of using the measurements would cause us to compromise values that we wish to uphold, then the measurements should not be used in those contexts.
For example, when designing a kitchen, we might use measurements of a person’s standing height to determine the height at which to place their kitchen countertop. However, this may render the countertop inaccessible to them if they use a wheelchair. As another example, because the Universal Credit benefits system described previously assumed that measuring a person’s monthly income by totaling the wages deposited into their account over a single one-month period would yield error-free measurements, many people—especially those with irregular pay schedules— received substantially lower benefits than they were entitled to.
The consequences of using scores obtained from value-added models are well described in the literature on fairness in measurement. Many school districts have used such scores to make decisions about resource distribution and even teachers’ continued employment, often without any way to contest these decisions [2, 3]. In turn, this has caused schools to manipulate their scores and encouraged teachers to “teach to the test,” instead of designing more diverse and substantive curricula [46]. As well as the cases described above in sections 3.1.1 and 3.2.3, in which teachers were fired on the basis of low scores despite evidence suggesting that their scores might be inaccurate, Amrein-Beardsley and Geiger [3] found that EVAAS consistently gave lower scores to teachers at schools with higher proportions of non-white students, students receiving special education services, lower-SES students, and English language learners. Although it is possible that more effective teachers simply chose not to teach at those schools, it is far more likely that these lower scores reflect societal biases and structural inequalities. When scores obtained from value-added models are used to make decisions about resource distribution and teachers’ continued employment, these biases and inequalities are then exacerbated.
The consequences of using scores obtained from COMPAS are also well described in the literature on fairness in computational systems, most notably by Angwin et al. [4], who showed that COMPAS incorrectly scored Black defendants as high risk more often than white defendants, while incorrectly scoring white defendants as low risk more often than Black defendants. By defining recidivism as “a new misdemeanor or felony arrest within two years,” COMPAS fails to account for false arrests or crimes that do not result in arrests. This assumption therefore encodes and exacerbates racist policing practices, leading to the racial disparities uncovered by Angwin et al. Indeed, by using arrests as a proxy for crimes committed, COMPAS can only exacerbate racist policing practices, rather than transcending them [7, 13, 23, 37, 39]. Furthermore, the COMPAS documentation asserts that “the COMPAS risk scales are actuarial risk assessment instruments. Actuarial risk assessment is an objective method of estimating the likelihood of reoffending. An individual’s level of risk is estimated based on known recidivism rates of offenders with similar characteristics” [19]. By describing COMPAS as an “objective method,” Northpointe misrepresents the measurement modeling process, which necessarily involves making assumptions and is thus never objective. Worse yet, the label of objectiveness obscures the organizational, political, societal, and cultural values that are embedded in COMPAS and reflected in its consequences.
Finally, we return to the high-risk care management models described in section 2.5. By operationalizing greatest care needs as greatest care costs, these models fail to account for the fact that patients with comparable past care needs but different access to care will likely have different past care costs. This omission has the greatest impact on Black patients. Indeed, when analyzing one such model, Obermeyer et al. found that only 18% of the patients identified for enrollment were Black [43]. In addition, Obermeyer et al. found that Black and white patients with comparable future care costs did not have comparable health statuses. In other words, these models exacerbate the enormous racial health disparities in the U.S. as a consequence of a seemingly innocuous assumption.
Measurement: The power to create truth*
Because measurement modeling is often skipped over, researchers and practitioners may be inclined to collapse the distinctions between constructs and their operationalizations in how they talk about, think about, and study the concepts in their research question. But collapsing these distinctions removes opportunities to anticipate and mitigate fairness-related harms by eliding the space in which they are most often introduced. Further compounding this issue is the fact that measurements of unobservable theoretical constructs are often treated as if they were obtained directly and without errors—i.e., a source of ground truth. Measurements end up standing in for the constructs purported to be measured, normalizing the assumptions made during the measurement modeling process and embedding them throughout society. In other words, “measures are more than a creation of society, they create society." [1]. Collapsing the distinctions between constructs and their operationalizations is therefore not just theoretically or pedantically concerning—it is practically concerning with very real, fairness-related consequences.
We argue that measurement modeling provides a both a language for articulating the distinctions between constructs and their operationalizations and set of tools—namely construct reliability and construct validity—for surfacing possible mismatches. In section 3, we therefore proposed fairness-oriented conceptualizations of construct reliability and construct validity, uniting traditions from political science, education, and psychology. We showed how these conceptualizations can be used to 1) anticipate fairness-related harms that can be obscured by focusing primarily on out-of-sample prediction, and 2) identify potential causes of fairness-related harms in ways that reveal concrete, actionable avenues for mitigating them. We acknowledge that assessing construct reliability and construct validity can be time-consuming. However, ignoring them means that we run the risk of creating a world that we do not wish to live in.
.
Key Takeaways
- Mismatches between conceptualization and measurement are often places in which bias and systemic injustice enter the research process.
- Measurement modeling is a way of foregrounding researcher's assumptions in how they connect their conceptual definitions and operational definitions.
- Social work research consumers should critically evaluate the construct validity and reliability of measures in the studies of social work populations.
Exercises
- Examine an article that uses quantitative methods to investigate your topic area.
- Identify the conceptual definitions the authors used.
- These are usually in the introduction section.
- Identify the operational definitions the authors used.
- These are usually in the methods section in a subsection titled measures.
- List the assumptions that link the conceptual and operational definitions.
- For example, that attendance can be measured by a classroom sign-in sheet.
- Do the authors identify any limitations for their operational definitions (measures) in the limitations or methods section?
- Do you identify any limitations in how the authors operationalized their variables?
- Apply the specific subtypes of construct validity and reliability.
an experiment that involves random assignment to a control and experimental group to evaluate the impact of an intervention or stimulus
a nonlinear process in which the original product is revised over and over again to improve it
whether you can actually reach people or documents needed to complete your project
Version 1.0 (08/23/2021)
Version 1.0 is our first edition of the textbook. Between 0.9 and 1.0, the authors did a lot of work! After the final internal review completed by authors and editor for version 0.9 revealed some difficulties in a few chapters, the authors rewrote three chapters. Early revisions were completed right after 0.9, with the project manager making a final round of changes in Summer 2021. The project manager also made changes on a few chapters based on notes accrued during the pandemic, but these changes are minimal, mostly changing examples and providing additional resources.
The major changes from 0.9 include:
- Rewrote sections of Chapter 7 to include
- More content on paradigms
- Removing a complicated framework for assessing paradigmatic assumptions
- Developing a theoretical framework & paradigmatic framework
- Introducing pragmatism and connecting it with multiparadigmatic methods
- Rewrites in Chapters 11 and 12 are too numerous to list. They are basically new chapters.
- Visual and copy edits
We anticipate releasing version 1.1 in August 2022. If you have feedback that you would like us to consider, you can contact the project manager Matt DeCarlo at profmattdecarlo@gmail.com with any suggestions or use the hypothes.is annotation group for Community Feedback.
Version 0.9 (08/15/2020)
Version 0.9 is the pre-release public beta. This version was peer reviewed by our peer reviewers, edited by our editor, and authors reviewed materials one last time prior to submitting their final work. This version was intended for review by faculty who are considering adopting the textbook for academic year 2020-2021, but authors were unable to complete final touches before the beginning of the academic year. This pushed back final production and publication to Summer 2021.
Archived copies are available in our Open Science Framework Project.
Chapter Outline
- Operational definitions (36 minute read)
- Writing effective questions and questionnaires (38 minute read)
- Measurement quality (21 minute read)
Content warning: examples in this chapter contain references to ethnocentrism, toxic masculinity, racism in science, drug use, mental health and depression, psychiatric inpatient care, poverty and basic needs insecurity, pregnancy, and racism and sexism in the workplace and higher education.
11.1 Operational definitions
Learning Objectives
Learners will be able to...
- Define and give an example of indicators and attributes for a variable
- Apply the three components of an operational definition to a variable
- Distinguish between levels of measurement for a variable and how those differences relate to measurement
- Describe the purpose of composite measures like scales and indices
Last chapter, we discussed conceptualizing your project. Conceptual definitions are like dictionary definitions. They tell you what a concept means by defining it using other concepts. In this section we will move from the abstract realm (conceptualization) to the real world (measurement).
Operationalization is the process by which researchers spell out precisely how a concept will be measured in their study. It involves identifying the specific research procedures we will use to gather data about our concepts. If conceptually defining your terms means looking at theory, how do you operationally define your terms? By looking for indicators of when your variable is present or not, more or less intense, and so forth. Operationalization is probably the most challenging part of quantitative research, but once it's done, the design and implementation of your study will be straightforward.
Indicators
Operationalization works by identifying specific indicators that will be taken to represent the ideas we are interested in studying. If we are interested in studying masculinity, then the indicators for that concept might include some of the social roles prescribed to men in society such as breadwinning or fatherhood. Being a breadwinner or a father might therefore be considered indicators of a person’s masculinity. The extent to which a man fulfills either, or both, of these roles might be understood as clues (or indicators) about the extent to which he is viewed as masculine.
Let’s look at another example of indicators. Each day, Gallup researchers poll 1,000 randomly selected Americans to ask them about their well-being. To measure well-being, Gallup asks these people to respond to questions covering six broad areas: physical health, emotional health, work environment, life evaluation, healthy behaviors, and access to basic necessities. Gallup uses these six factors as indicators of the concept that they are really interested in, which is well-being.
Identifying indicators can be even simpler than the examples described thus far. Political party affiliation is another relatively easy concept for which to identify indicators. If you asked a person what party they voted for in the last national election (or gained access to their voting records), you would get a good indication of their party affiliation. Of course, some voters split tickets between multiple parties when they vote and others swing from party to party each election, so our indicator is not perfect. Indeed, if our study were about political identity as a key concept, operationalizing it solely in terms of who they voted for in the previous election leaves out a lot of information about identity that is relevant to that concept. Nevertheless, it's a pretty good indicator of political party affiliation.
Choosing indicators is not an arbitrary process. As described earlier, utilizing prior theoretical and empirical work in your area of interest is a great way to identify indicators in a scholarly manner. And you conceptual definitions will point you in the direction of relevant indicators. Empirical work will give you some very specific examples of how the important concepts in an area have been measured in the past and what sorts of indicators have been used. Often, it makes sense to use the same indicators as previous researchers; however, you may find that some previous measures have potential weaknesses that your own study will improve upon.
All of the examples in this chapter have dealt with questions you might ask a research participant on a survey or in a quantitative interview. If you plan to collect data from other sources, such as through direct observation or the analysis of available records, think practically about what the design of your study might look like and how you can collect data on various indicators feasibly. If your study asks about whether the participant regularly changes the oil in their car, you will likely not observe them directly doing so. Instead, you will likely need to rely on a survey question that asks them the frequency with which they change their oil or ask to see their car maintenance records.
Exercises
- What indicators are commonly used to measure the variables in your research question?
- How can you feasibly collect data on these indicators?
- Are you planning to collect your own data using a questionnaire or interview? Or are you planning to analyze available data like client files or raw data shared from another researcher's project?
Remember, you need raw data. You research project cannot rely solely on the results reported by other researchers or the arguments you read in the literature. A literature review is only the first part of a research project, and your review of the literature should inform the indicators you end up choosing when you measure the variables in your research question.
Unlike conceptual definitions which contain other concepts, operational definition consists of the following components: (1) the variable being measured and its attributes, (2) the measure you will use, (3) how you plan to interpret the data collected from that measure to draw conclusions about the variable you are measuring.
Step 1: Specifying variables and attributes
The first component, the variable, should be the easiest part. At this point in quantitative research, you should have a research question that has at least one independent and at least one dependent variable. Remember that variables must be able to vary. For example, the United States is not a variable. Country of residence is a variable, as is patriotism. Similarly, if your sample only includes men, gender is a constant in your study, not a variable. A constant is a characteristic that does not change in your study.
When social scientists measure concepts, they sometimes use the language of variables and attributes. A variable refers to a quality or quantity that varies across people or situations. Attributes are the characteristics that make up a variable. For example, the variable hair color would contain attributes like blonde, brown, black, red, gray, etc. A variable’s attributes determine its level of measurement. There are four possible levels of measurement: nominal, ordinal, interval, and ratio. The first two levels of measurement are categorical, meaning their attributes are categories rather than numbers. The latter two levels of measurement are continuous, meaning their attributes are numbers.
Levels of measurement
Hair color is an example of a nominal level of measurement. Nominal measures are categorical, and those categories cannot be mathematically ranked. As a brown-haired person (with some gray), I can’t say for sure that brown-haired people are better than blonde-haired people. As with all nominal levels of measurement, there is no ranking order between hair colors; they are simply different. That is what constitutes a nominal level of gender and race are also measured at the nominal level.
What attributes are contained in the variable hair color? While blonde, brown, black, and red are common colors, some people may not fit into these categories if we only list these attributes. My wife, who currently has purple hair, wouldn’t fit anywhere. This means that our attributes were not exhaustive. Exhaustiveness means that all possible attributes are listed. We may have to list a lot of colors before we can meet the criteria of exhaustiveness. Clearly, there is a point at which exhaustiveness has been reasonably met. If a person insists that their hair color is light burnt sienna, it is not your responsibility to list that as an option. Rather, that person would reasonably be described as brown-haired. Perhaps listing a category for other color would suffice to make our list of colors exhaustive.
What about a person who has multiple hair colors at the same time, such as red and black? They would fall into multiple attributes. This violates the rule of mutual exclusivity, in which a person cannot fall into two different attributes. Instead of listing all of the possible combinations of colors, perhaps you might include a multi-color attribute to describe people with more than one hair color.
Making sure researchers provide mutually exclusive and exhaustive is about making sure all people are represented in the data record. For many years, the attributes for gender were only male or female. Now, our understanding of gender has evolved to encompass more attributes that better reflect the diversity in the world. Children of parents from different races were often classified as one race or another, even if they identified with both cultures. The option for bi-racial or multi-racial on a survey not only more accurately reflects the racial diversity in the real world but validates and acknowledges people who identify in that manner. If we did not measure race in this way, we would leave empty the data record for people who identify as biracial or multiracial, impairing our search for truth.
Unlike nominal-level measures, attributes at the ordinal level can be rank ordered. For example, someone’s degree of satisfaction in their romantic relationship can be ordered by rank. That is, you could say you are not at all satisfied, a little satisfied, moderately satisfied, or highly satisfied. Note that even though these have a rank order to them (not at all satisfied is certainly worse than highly satisfied), we cannot calculate a mathematical distance between those attributes. We can simply say that one attribute of an ordinal-level variable is more or less than another attribute.
This can get a little confusing when using rating scales. If you have ever taken a customer satisfaction survey or completed a course evaluation for school, you are familiar with rating scales. “On a scale of 1-5, with 1 being the lowest and 5 being the highest, how likely are you to recommend our company to other people?” That surely sounds familiar. Rating scales use numbers, but only as a shorthand, to indicate what attribute (highly likely, somewhat likely, etc.) the person feels describes them best. You wouldn’t say you are “2” likely to recommend the company, but you would say you are not very likely to recommend the company. Ordinal-level attributes must also be exhaustive and mutually exclusive, as with nominal-level variables.
At the interval level, attributes must also be exhaustive and mutually exclusive and there is equal distance between attributes. Interval measures are also continuous, meaning their attributes are numbers, rather than categories. IQ scores are interval level, as are temperatures in Fahrenheit and Celsius. Their defining characteristic is that we can say how much more or less one attribute differs from another. We cannot, however, say with certainty what the ratio of one attribute is in comparison to another. For example, it would not make sense to say that a person with an IQ score of 140 has twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.
While we cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, we can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level. Finally, at the ratio level, attributes are mutually exclusive and exhaustive, attributes can be rank ordered, the distance between attributes is equal, and attributes have a true zero point. Thus, with these variables, we can say what the ratio of one attribute is in comparison to another. Examples of ratio-level variables include age and years of education. We know that a person who is 12 years old is twice as old as someone who is 6 years old. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. The differences between each level of measurement are visualized in Table 11.1.
Nominal | Ordinal | Interval | Ratio | |
Exhaustive | X | X | X | X |
Mutually exclusive | X | X | X | X |
Rank-ordered | X | X | X | |
Equal distance between attributes | X | X | ||
True zero point | X |
Levels of measurement=levels of specificity
We have spent time learning how to determine our data's level of measurement. Now what? How could we use this information to help us as we measure concepts and develop measurement tools? First, the types of statistical tests that we are able to use are dependent on our data's level of measurement. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. The higher the level of measurement, the more complex statistical tests we are able to conduct. This knowledge may help us decide what kind of data we need to gather, and how.
That said, we have to balance this knowledge with the understanding that sometimes, collecting data at a higher level of measurement could negatively impact our studies. For instance, sometimes providing answers in ranges may make prospective participants feel more comfortable responding to sensitive items. Imagine that you were interested in collecting information on topics such as income, number of sexual partners, number of times someone used illicit drugs, etc. You would have to think about the sensitivity of these items and determine if it would make more sense to collect some data at a lower level of measurement (e.g., asking if they are sexually active or not (nominal) versus their total number of sexual partners (ratio).
Finally, sometimes when analyzing data, researchers find a need to change a data's level of measurement. For example, a few years ago, a student was interested in studying the relationship between mental health and life satisfaction. This student used a variety of measures. One item asked about the number of mental health symptoms, reported as the actual number. When analyzing data, my student examined the mental health symptom variable and noticed that she had two groups, those with none or one symptoms and those with many symptoms. Instead of using the ratio level data (actual number of mental health symptoms), she collapsed her cases into two categories, few and many. She decided to use this variable in her analyses. It is important to note that you can move a higher level of data to a lower level of data; however, you are unable to move a lower level to a higher level.
Exercises
- Check that the variables in your research question can vary...and that they are not constants or one of many potential attributes of a variable.
- Think about the attributes your variables have. Are they categorical or continuous? What level of measurement seems most appropriate?
Step 2: Specifying measures for each variable
Let’s pick a social work research question and walk through the process of operationalizing variables to see how specific we need to get. I’m going to hypothesize that residents of a psychiatric unit who are more depressed are less likely to be satisfied with care. Remember, this would be a inverse relationship—as depression increases, satisfaction decreases. In this question, depression is my independent variable (the cause) and satisfaction with care is my dependent variable (the effect). Now we have identified our variables, their attributes, and levels of measurement, we move onto the second component: the measure itself.
So, how would you measure my key variables: depression and satisfaction? What indicators would you look for? Some students might say that depression could be measured by observing a participant’s body language. They may also say that a depressed person will often express feelings of sadness or hopelessness. In addition, a satisfied person might be happy around service providers and often express gratitude. While these factors may indicate that the variables are present, they lack coherence. Unfortunately, what this “measure” is actually saying is that “I know depression and satisfaction when I see them.” While you are likely a decent judge of depression and satisfaction, you need to provide more information in a research study for how you plan to measure your variables. Your judgment is subjective, based on your own idiosyncratic experiences with depression and satisfaction. They couldn’t be replicated by another researcher. They also can’t be done consistently for a large group of people. Operationalization requires that you come up with a specific and rigorous measure for seeing who is depressed or satisfied.
Finding a good measure for your variable depends on the kind of variable it is. Variables that are directly observable don't come up very often in my students' classroom projects, but they might include things like taking someone's blood pressure, marking attendance or participation in a group, and so forth. To measure an indirectly observable variable like age, you would probably put a question on a survey that asked, “How old are you?” Measuring a variable like income might require some more thought, though. Are you interested in this person’s individual income or the income of their family unit? This might matter if your participant does not work or is dependent on other family members for income. Do you count income from social welfare programs? Are you interested in their income per month or per year? Even though indirect observables are relatively easy to measure, the measures you use must be clear in what they are asking, and operationalization is all about figuring out the specifics of what you want to know. For more complicated constructs, you will need compound measures (that use multiple indicators to measure a single variable).
How you plan to collect your data also influences how you will measure your variables. For social work researchers using secondary data like client records as a data source, you are limited by what information is in the data sources you can access. If your organization uses a given measurement for a mental health outcome, that is the one you will use in your study. Similarly, if you plan to study how long a client was housed after an intervention using client visit records, you are limited by how their caseworker recorded their housing status in the chart. One of the benefits of collecting your own data is being able to select the measures you feel best exemplify your understanding of the topic.
Measuring unidimensional concepts
The previous section mentioned two important considerations: how complicated the variable is and how you plan to collect your data. With these in hand, we can use the level of measurement to further specify how you will measure your variables and consider specialized rating scales developed by social science researchers.
Measurement at each level
Nominal measures assess categorical variables. These measures are used for variables or indicators that have mutually exclusive attributes, but that cannot be rank-ordered. Nominal measures ask about the variable and provide names or labels for different attribute values like social work, counseling, and nursing for the variable profession. Nominal measures are relatively straightforward.
Ordinal measures often use a rating scale. It is an ordered set of responses that participants must choose from. Figure 11.1 shows several examples. The number of response options on a typical rating scale is usualy five or seven, though it can range from three to 11. Five-point scales are best for unipolar scales where only one construct is tested, such as frequency (Never, Rarely, Sometimes, Often, Always). Seven-point scales are best for bipolar scales where there is a dichotomous spectrum, such as liking (Like very much, Like somewhat, Like slightly, Neither like nor dislike, Dislike slightly, Dislike somewhat, Dislike very much). For bipolar questions, it is useful to offer an earlier question that branches them into an area of the scale; if asking about liking ice cream, first ask “Do you generally like or dislike ice cream?” Once the respondent chooses like or dislike, refine it by offering them relevant choices from the seven-point scale. Branching improves both reliability and validity (Krosnick & Berent, 1993).[20] Although you often see scales with numerical labels, it is best to only present verbal labels to the respondents but convert them to numerical values in the analyses. Avoid partial labels or length or overly specific labels. In some cases, the verbal labels can be supplemented with (or even replaced by) meaningful graphics. The last rating scale shown in Figure 11.1 is a visual-analog scale, on which participants make a mark somewhere along the horizontal line to indicate the magnitude of their response.
Interval measures are those where the values measured are not only rank-ordered, but are also equidistant from adjacent attributes. For example, the temperature scale (in Fahrenheit or Celsius), where the difference between 30 and 40 degree Fahrenheit is the same as that between 80 and 90 degree Fahrenheit. Likewise, if you have a scale that asks respondents’ annual income using the following attributes (ranges): $0 to 10,000, $10,000 to 20,000, $20,000 to 30,000, and so forth, this is also an interval measure, because the mid-point of each range (i.e., $5,000, $15,000, $25,000, etc.) are equidistant from each other. The intelligence quotient (IQ) scale is also an interval measure, because the measure is designed such that the difference between IQ scores 100 and 110 is supposed to be the same as between 110 and 120 (although we do not really know whether that is truly the case). Interval measures allow us to examine “how much more” is one attribute when compared to another, which is not possible with nominal or ordinal measures. You may find researchers who “pretend” (incorrectly) that ordinal rating scales are actually interval measures so that we can use different statistical techniques for analyzing them. As we will discuss in the latter part of the chapter, this is a mistake because there is no way to know whether the difference between a 3 and a 4 on a rating scale is the same as the difference between a 2 and a 3. Those numbers are just placeholders for categories.
Ratio measures are those that have all the qualities of nominal, ordinal, and interval scales, and in addition, also have a “true zero” point (where the value zero implies lack or non-availability of the underlying construct). Think about how to measure the number of people working in human resources at a social work agency. It could be one, several, or none (if the company contracts out for those services). Measuring interval and ratio data is relatively easy, as people either select or input a number for their answer. If you ask a person how many eggs they purchased last week, they can simply tell you they purchased `a dozen eggs at the store, two at breakfast on Wednesday, or none at all.
Commonly used rating scales in questionnaires
The level of measurement will give you the basic information you need, but social scientists have developed specialized instruments for use in questionnaires, a common tool used in quantitative research. As we mentioned before, if you plan to source your data from client files or previously published results
Although Likert scale is a term colloquially used to refer to almost any rating scale (e.g., a 0-to-10 life satisfaction scale), it has a much more precise meaning. In the 1930s, researcher Rensis Likert (pronounced LICK-ert) created a new approach for measuring people’s attitudes (Likert, 1932).[21] It involves presenting people with several statements—including both favorable and unfavorable statements—about some person, group, or idea. Respondents then express their agreement or disagreement with each statement on a 5-point scale: Strongly Agree, Agree, Neither Agree nor Disagree, Disagree, Strongly Disagree. Numbers are assigned to each response and then summed across all items to produce a score representing the attitude toward the person, group, or idea. For items that are phrased in an opposite direction (e.g., negatively worded statements instead of positively worded statements), reverse coding is used so that the numerical scoring of statements also runs in the opposite direction. The entire set of items came to be called a Likert scale, as indicated in Table 11.2 below.
Unless you are measuring people’s attitude toward something by assessing their level of agreement with several statements about it, it is best to avoid calling it a Likert scale. You are probably just using a rating scale. Likert scales allow for more granularity (more finely tuned response) than yes/no items, including whether respondents are neutral to the statement. Below is an example of how we might use a Likert scale to assess your attitudes about research as you work your way through this textbook.
Strongly agree | Agree | Neutral | Disagree | Strongly disagree | |
I like research more now than when I started reading this book. | |||||
This textbook is easy to use. | |||||
I feel confident about how well I understand levels of measurement. | |||||
This textbook is helping me plan my research proposal. |
Semantic differential scales are composite (multi-item) scales in which respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites. Whereas in the above Likert scale, the participant is asked how much they agree or disagree with a statement, in a semantic differential scale the participant is asked to indicate how they feel about a specific item. This makes the semantic differential scale an excellent technique for measuring people’s attitudes or feelings toward objects, events, or behaviors. Table 11.3 is an example of a semantic differential scale that was created to assess participants' feelings about this textbook.
1) How would you rate your opinions toward this textbook? | ||||||
Very much | Somewhat | Neither | Somewhat | Very much | ||
Boring | Exciting | |||||
Useless | Useful | |||||
Hard | Easy | |||||
Irrelevant | Applicable |
This composite scale was designed by Louis Guttman and uses a series of items arranged in increasing order of intensity (least intense to most intense) of the concept. This type of scale allows us to understand the intensity of beliefs or feelings. Each item in the above Guttman scale has a weight (this is not indicated on the tool) which varies with the intensity of that item, and the weighted combination of each response is used as an aggregate measure of an observation.
Example Guttman Scale Items
- I often felt the material was not engaging Yes/No
- I was often thinking about other things in class Yes/No
- I was often working on other tasks during class Yes/No
- I will work to abolish research from the curriculum Yes/No
Notice how the items move from lower intensity to higher intensity. A researcher reviews the yes answers and creates a score for each participant.
Composite measures: Scales and indices
Depending on your research design, your measure may be something you put on a survey or pre/post-test that you give to your participants. For a variable like age or income, one well-worded question may suffice. Unfortunately, most variables in the social world are not so simple. Depression and satisfaction are multidimensional concepts. Relying on a single indicator like a question that asks "Yes or no, are you depressed?” does not encompass the complexity of depression, including issues with mood, sleeping, eating, relationships, and happiness. There is no easy way to delineate between multidimensional and unidimensional concepts, as its all in how you think about your variable. Satisfaction could be validly measured using a unidimensional ordinal rating scale. However, if satisfaction were a key variable in our study, we would need a theoretical framework and conceptual definition for it. That means we'd probably have more indicators to ask about like timeliness, respect, sensitivity, and many others, and we would want our study to say something about what satisfaction truly means in terms of our other key variables. However, if satisfaction is not a key variable in your conceptual framework, it makes sense to operationalize it as a unidimensional concept.
For more complicated measures, researchers use scales and indices (sometimes called indexes) to measure their variables because they assess multiple indicators to develop a composite (or total) score. Composite scores provide a much greater understanding of concepts than a single item could. Although we won't delve too deeply into the process of scale development, we will cover some important topics for you to understand how scales and indices developed by other researchers can be used in your project.
Although they exhibit differences (which will later be discussed) the two have in common various factors.
- Both are ordinal measures of variables.
- Both can order the units of analysis in terms of specific variables.
- Both are composite measures.
Scales
The previous section discussed how to measure respondents’ responses to predesigned items or indicators belonging to an underlying construct. But how do we create the indicators themselves? The process of creating the indicators is called scaling. More formally, scaling is a branch of measurement that involves the construction of measures by associating qualitative judgments about unobservable constructs with quantitative, measurable metric units. Stevens (1946)[22] said, “Scaling is the assignment of objects to numbers according to a rule.” This process of measuring abstract concepts in concrete terms remains one of the most difficult tasks in empirical social science research.
The outcome of a scaling process is a scale, which is an empirical structure for measuring items or indicators of a given construct. Understand that multidimensional “scales”, as discussed in this section, are a little different from “rating scales” discussed in the previous section. A rating scale is used to capture the respondents’ reactions to a given item on a questionnaire. For example, an ordinally scaled item captures a value between “strongly disagree” to “strongly agree.” Attaching a rating scale to a statement or instrument is not scaling. Rather, scaling is the formal process of developing scale items, before rating scales can be attached to those items.
If creating your own scale sounds painful, don’t worry! For most multidimensional variables, you would likely be duplicating work that has already been done by other researchers. Specifically, this is a branch of science called psychometrics. You do not need to create a scale for depression because scales such as the Patient Health Questionnaire (PHQ-9), the Center for Epidemiologic Studies Depression Scale (CES-D), and Beck’s Depression Inventory (BDI) have been developed and refined over dozens of years to measure variables like depression. Similarly, scales such as the Patient Satisfaction Questionnaire (PSQ-18) have been developed to measure satisfaction with medical care. As we will discuss in the next section, these scales have been shown to be reliable and valid. While you could create a new scale to measure depression or satisfaction, a study with rigor would pilot test and refine that new scale over time to make sure it measures the concept accurately and consistently. This high level of rigor is often unachievable in student research projects because of the cost and time involved in pilot testing and validating, so using existing scales is recommended.
Unfortunately, there is no good one-stop=shop for psychometric scales. The Mental Measurements Yearbook provides a searchable database of measures for social science variables, though it woefully incomplete and often does not contain the full documentation for scales in its database. You can access it from a university library’s list of databases. If you can’t find anything in there, your next stop should be the methods section of the articles in your literature review. The methods section of each article will detail how the researchers measured their variables, and often the results section is instructive for understanding more about measures. In a quantitative study, researchers may have used a scale to measure key variables and will provide a brief description of that scale, its names, and maybe a few example questions. If you need more information, look at the results section and tables discussing the scale to get a better idea of how the measure works. Looking beyond the articles in your literature review, searching Google Scholar using queries like “depression scale” or “satisfaction scale” should also provide some relevant results. For example, searching for documentation for the Rosenberg Self-Esteem Scale (which we will discuss in the next section), I found this report from researchers investigating acceptance and commitment therapy which details this scale and many others used to assess mental health outcomes. If you find the name of the scale somewhere but cannot find the documentation (all questions and answers plus how to interpret the scale), a general web search with the name of the scale and ".pdf" may bring you to what you need. Or, to get professional help with finding information, always ask a librarian!
Unfortunately, these approaches do not guarantee that you will be able to view the scale itself or get information on how it is interpreted. Many scales cost money to use and may require training to properly administer. You may also find scales that are related to your variable but would need to be slightly modified to match your study’s needs. You could adapt a scale to fit your study, however changing even small parts of a scale can influence its accuracy and consistency. While it is perfectly acceptable in student projects to adapt a scale without testing it first (time may not allow you to do so), pilot testing is always recommended for adapted scales, and researchers seeking to draw valid conclusions and publish their results must take this additional step.
Indices
An index is a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas. It is different from a scale. Scales also aggregate measures; however, these measures examine different dimensions or the same dimension of a single construct. A well-known example of an index is the consumer price index (CPI), which is computed every month by the Bureau of Labor Statistics of the U.S. Department of Labor. The CPI is a measure of how much consumers have to pay for goods and services (in general) and is divided into eight major categories (food and beverages, housing, apparel, transportation, healthcare, recreation, education and communication, and “other goods and services”), which are further subdivided into more than 200 smaller items. Each month, government employees call all over the country to get the current prices of more than 80,000 items. Using a complicated weighting scheme that takes into account the location and probability of purchase for each item, analysts then combine these prices into an overall index score using a series of formulas and rules.
Another example of an index is the Duncan Socioeconomic Index (SEI). This index is used to quantify a person's socioeconomic status (SES) and is a combination of three concepts: income, education, and occupation. Income is measured in dollars, education in years or degrees achieved, and occupation is classified into categories or levels by status. These very different measures are combined to create an overall SES index score. However, SES index measurement has generated a lot of controversy and disagreement among researchers.
The process of creating an index is similar to that of a scale. First, conceptualize (define) the index and its constituent components. Though this appears simple, there may be a lot of disagreement on what components (concepts/constructs) should be included or excluded from an index. For instance, in the SES index, isn’t income correlated with education and occupation? And if so, should we include one component only or all three components? Reviewing the literature, using theories, and/or interviewing experts or key stakeholders may help resolve this issue. Second, operationalize and measure each component. For instance, how will you categorize occupations, particularly since some occupations may have changed with time (e.g., there were no Web developers before the Internet)? As we will see in step three below, researchers must create a rule or formula for calculating the index score. Again, this process may involve a lot of subjectivity, so validating the index score using existing or new data is important.
Scale and index development at often taught in their own course in doctoral education, so it is unreasonable for you to expect to develop a consistently accurate measure within the span of a week or two. Using available indices and scales is recommended for this reason.
Differences between scales and indices
Though indices and scales yield a single numerical score or value representing a concept of interest, they are different in many ways. First, indices often comprise components that are very different from each other (e.g., income, education, and occupation in the SES index) and are measured in different ways. Conversely, scales typically involve a set of similar items that use the same rating scale (such as a five-point Likert scale about customer satisfaction).
Second, indices often combine objectively measurable values such as prices or income, while scales are designed to assess subjective or judgmental constructs such as attitude, prejudice, or self-esteem. Some argue that the sophistication of the scaling methodology makes scales different from indexes, while others suggest that indexing methodology can be equally sophisticated. Nevertheless, indexes and scales are both essential tools in social science research.
Scales and indices seem like clean, convenient ways to measure different phenomena in social science, but just like with a lot of research, we have to be mindful of the assumptions and biases underneath. What if a scale or an index was developed using only White women as research participants? Is it going to be useful for other groups? It very well might be, but when using a scale or index on a group for whom it hasn't been tested, it will be very important to evaluate the validity and reliability of the instrument, which we address in the rest of the chapter.
Finally, it's important to note that while scales and indices are often made up of nominal or ordinal variables, when we analyze them into composite scores, we will treat them as interval/ratio variables.
Exercises
- Look back to your work from the previous section, are your variables unidimensional or multidimensional?
- Describe the specific measures you will use (actual questions and response options you will use with participants) for each variable in your research question.
- If you are using a measure developed by another researcher but do not have all of the questions, response options, and instructions needed to implement it, put it on your to-do list to get them.
Step 3: How you will interpret your measures
The final stage of operationalization involves setting the rules for how the measure works and how the researcher should interpret the results. Sometimes, interpreting a measure can be incredibly easy. If you ask someone their age, you’ll probably interpret the results by noting the raw number (e.g., 22) someone provides and that it is lower or higher than other people's ages. However, you could also recode that person into age categories (e.g., under 25, 20-29-years-old, generation Z, etc.). Even scales may be simple to interpret. If there is a scale of problem behaviors, one might simply add up the number of behaviors checked off–with a range from 1-5 indicating low risk of delinquent behavior, 6-10 indicating the student is moderate risk, etc. How you choose to interpret your measures should be guided by how they were designed, how you conceptualize your variables, the data sources you used, and your plan for analyzing your data statistically. Whatever measure you use, you need a set of rules for how to take any valid answer a respondent provides to your measure and interpret it in terms of the variable being measured.
For more complicated measures like scales, refer to the information provided by the author for how to interpret the scale. If you can’t find enough information from the scale’s creator, look at how the results of that scale are reported in the results section of research articles. For example, Beck’s Depression Inventory (BDI-II) uses 21 statements to measure depression and respondents rate their level of agreement on a scale of 0-3. The results for each question are added up, and the respondent is put into one of three categories: low levels of depression (1-16), moderate levels of depression (17-30), or severe levels of depression (31 and over).
One common mistake I see often is that students will introduce another variable into their operational definition. This is incorrect. Your operational definition should mention only one variable—the variable being defined. While your study will certainly draw conclusions about the relationships between variables, that's not what operationalization is. Operationalization specifies what instrument you will use to measure your variable and how you plan to interpret the data collected using that measure.
Operationalization is probably the trickiest component of basic research methods, so please don’t get frustrated if it takes a few drafts and a lot of feedback to get to a workable definition. At the time of this writing, I am in the process of operationalizing the concept of “attitudes towards research methods.” Originally, I thought that I could gauge students’ attitudes toward research methods by looking at their end-of-semester course evaluations. As I became aware of the potential methodological issues with student course evaluations, I opted to use focus groups of students to measure their common beliefs about research. You may recall some of these opinions from Chapter 1, such as the common beliefs that research is boring, useless, and too difficult. After the focus group, I created a scale based on the opinions I gathered, and I plan to pilot test it with another group of students. After the pilot test, I expect that I will have to revise the scale again before I can implement the measure in a real social work research project. At the time I’m writing this, I’m still not completely done operationalizing this concept.
Key Takeaways
- Operationalization involves spelling out precisely how a concept will be measured.
- Operational definitions must include the variable, the measure, and how you plan to interpret the measure.
- There are four different levels of measurement: nominal, ordinal, interval, and ratio (in increasing order of specificity).
- Scales and indices are common ways to collect information and involve using multiple indicators in measurement.
- A key difference between a scale and an index is that a scale contains multiple indicators for one concept, whereas an indicator examines multiple concepts (components).
- Using scales developed and refined by other researchers can improve the rigor of a quantitative study.
Exercises
Use the research question that you developed in the previous chapters and find a related scale or index that researchers have used. If you have trouble finding the exact phenomenon you want to study, get as close as you can.
- What is the level of measurement for each item on each tool? Take a second and think about why the tool's creator decided to include these levels of measurement. Identify any levels of measurement you would change and why.
- If these tools don't exist for what you are interested in studying, why do you think that is?
12.3 Writing effective questions and questionnaires
Learning Objectives
Learners will be able to...
- Describe some of the ways that survey questions might confuse respondents and how to word questions and responses clearly
- Create mutually exclusive, exhaustive, and balanced response options
- Define fence-sitting and floating
- Describe the considerations involved in constructing a well-designed questionnaire
- Discuss why pilot testing is important
In the previous section, we reviewed how researchers collect data using surveys. Guided by their sampling approach and research context, researchers should choose the survey approach that provides the most favorable tradeoffs in strengths and challenges. With this information in hand, researchers need to write their questionnaire and revise it before beginning data collection. Each method of delivery requires a questionnaire, but they vary a bit based on how they will be used by the researcher. Since phone surveys are read aloud, researchers will pay more attention to how the questionnaire sounds than how it looks. Online surveys can use advanced tools to require the completion of certain questions, present interactive questions and answers, and otherwise afford greater flexibility in how questionnaires are designed. As you read this section, consider how your method of delivery impacts the type of questionnaire you will design. Because most student projects use paper or online surveys, this section will detail how to construct self-administered questionnaires to minimize the potential for bias and error.
Start with operationalization
The first thing you need to do to write effective survey questions is identify what exactly you wish to know. As silly as it sounds to state what seems so completely obvious, we can’t stress enough how easy it is to forget to include important questions when designing a survey. Begin by looking at your research question and refreshing your memory of the operational definitions you developed for those variables from Chapter 11. You should have a pretty firm grasp of your operational definitions before starting the process of questionnaire design. You may have taken those operational definitions from other researchers' methods, found established scales and indices for your measures, or created your own questions and answer options.
Exercises
STOP! Make sure you have a complete operational definition for the dependent and independent variables in your research question. A complete operational definition contains the variable being measured, the measure used, and how the researcher interprets the measure. Let's make sure you have what you need from Chapter 11 to begin writing your questionnaire.
List all of the dependent and independent variables in your research question.
- It's normal to have one dependent or independent variable. It's also normal to have more than one of either.
- Make sure that your research question (and this list) contain all of the variables in your hypothesis. Your hypothesis should only include variables from you research question.
For each variable in your list:
- Write out the measure you will use (the specific questions and answers) for each variable.
- If you don't have questions and answers finalized yet, write a first draft and revise it based on what you read in this section.
- If you are using a measure from another researcher, you should be able to write out all of the questions and answers associated with that measure. If you only have the name of a scale or a few questions, you need to access to the full text and some documentation on how to administer and interpret it before you can finish your questionnaire.
- Describe how you will use each measure draw conclusions about the variable in the operational definition.
- For example, an interpretation might be "there are five 7-point Likert scale questions...point values are added across all five items for each participant...and scores below 10 indicate the participant has low self-esteem"
- Don't introduce other variables into the mix here. All we are concerned with is how you will measure each variable by itself. The connection between variables is done using statistical tests, not operational definitions.
- Detail any validity or reliability issues uncovered by previous researchers using the same measures. If you have concerns about validity and reliability, note them, as well.
If you completed the exercise above and listed out all of the questions and answer choices you will use to measure the variables in your research question, you have already produced a pretty solid first draft of your questionnaire! Congrats! In essence, questionnaires are all of the self-report measures in your operational definitions for the independent, dependent, and control variables in your study arranged into one document and administered to participants. There are a few questions on a questionnaire (like name or ID#) that are not associated with the measurement of variables. These are the exception, and it's useful to think of a questionnaire as a list of measures for variables. Of course, researchers often use more than one measure of a variable (i.e., triangulation) so they can more confidently assert that their findings are true. A questionnaire should contain all of the measures researchers plan to collect about their variables by asking participants to self-report. As we will discuss in the final section of this chapter, triangulating across data sources (e.g., measuring variables using client files or student records) can avoid some of the common sources of bias in survey research.
Sticking close to your operational definitions is important because it helps you avoid an everything-but-the-kitchen-sink approach that includes every possible question that occurs to you. Doing so puts an unnecessary burden on your survey respondents. Remember that you have asked your participants to give you their time and attention and to take care in responding to your questions; show them your respect by only asking questions that you actually plan to use in your analysis. For each question in your questionnaire, ask yourself how this question measures a variable in your study. An operational definition should contain the questions, response options, and how the researcher will draw conclusions about the variable based on participants' responses.
Writing questions
So, almost all of the questions on a questionnaire are measuring some variable. For many variables, researchers will create their own questions rather than using one from another researcher. This section will provide some tips on how to create good questions to accurately measure variables in your study. First, questions should be as clear and to the point as possible. This is not the time to show off your creative writing skills; a survey is a technical instrument and should be written in a way that is as direct and concise as possible. As I’ve mentioned earlier, your survey respondents have agreed to give their time and attention to your survey. The best way to show your appreciation for their time is to not waste it. Ensuring that your questions are clear and concise will go a long way toward showing your respondents the gratitude they deserve. Pilot testing the questionnaire with friends or colleagues can help identify these issues. This process is commonly called pretesting, but to avoid any confusion with pretesting in experimental design, we refer to it as pilot testing.
Related to the point about not wasting respondents’ time, make sure that every question you pose will be relevant to every person you ask to complete it. This means two things: first, that respondents have knowledge about whatever topic you are asking them about, and second, that respondents have experienced the events, behaviors, or feelings you are asking them to report. If you are asking participants for second-hand knowledge—asking clinicians about clients' feelings, asking teachers about students' feelings, and so forth—you may want to clarify that the variable you are asking about is the key informant's perception of what is happening in the target population. A well-planned sampling approach ensures that participants are the most knowledgeable population to complete your survey.
If you decide that you do wish to include questions about matters with which only a portion of respondents will have had experience, make sure you know why you are doing so. For example, if you are asking about MSW student study patterns, and you decide to include a question on studying for the social work licensing exam, you may only have a small subset of participants who have begun studying for the graduate exam or took the bachelor's-level exam. If you decide to include this question that speaks to a minority of participants' experiences, think about why you are including it. Are you interested in how studying for class and studying for licensure differ? Are you trying to triangulate study skills measures? Researchers should carefully consider whether questions relevant to only a subset of participants is likely to produce enough valid responses for quantitative analysis.
Many times, questions that are relevant to a subsample of participants are conditional on an answer to a previous question. A participant might select that they rent their home, and as a result, you might ask whether they carry renter's insurance. That question is not relevant to homeowners, so it would be wise not to ask them to respond to it. In that case, the question of whether someone rents or owns their home is a filter question, designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample. Figure 12.1 presents an example of how to accomplish this on a paper survey by adding instructions to the participant that indicate what question to proceed to next based on their response to the first one. Using online survey tools, researchers can use filter questions to only present relevant questions to participants.
Researchers should eliminate questions that ask about things participants don't know to minimize confusion. Assuming the question is relevant to the participant, other sources of confusion come from how the question is worded. The use of negative wording can be a source of potential confusion. Taking the question from Figure 12.1 about drinking as our example, what if we had instead asked, “Did you not abstain from drinking during your first semester of college?” This is a double negative, and it's not clear how to answer the question accurately. It is a good idea to avoid negative phrasing, when possible. For example, "did you not drink alcohol during your first semester of college?" is less clear than "did you drink alcohol your first semester of college?"
You should also avoid using terms or phrases that may be regionally or culturally specific (unless you are absolutely certain all your respondents come from the region or culture whose terms you are using). When I first moved to southwest Virginia, I didn’t know what a holler was. Where I grew up in New Jersey, to holler means to yell. Even then, in New Jersey, we shouted and screamed, but we didn’t holler much. In southwest Virginia, my home at the time, a holler also means a small valley in between the mountains. If I used holler in that way on my survey, people who live near me may understand, but almost everyone else would be totally confused. A similar issue arises when you use jargon, or technical language, that people do not commonly know. For example, if you asked adolescents how they experience imaginary audience, they would find it difficult to link those words to the concepts from David Elkind’s theory. The words you use in your questions must be understandable to your participants. If you find yourself using jargon or slang, break it down into terms that are more universal and easier to understand.
Asking multiple questions as though they are a single question can also confuse survey respondents. There’s a specific term for this sort of question; it is called a double-barreled question. Figure 12.2 shows a double-barreled question. Do you see what makes the question double-barreled? How would someone respond if they felt their college classes were more demanding but also more boring than their high school classes? Or less demanding but more interesting? Because the question combines “demanding” and “interesting,” there is no way to respond yes to one criterion but no to the other.
Another thing to avoid when constructing survey questions is the problem of social desirability. We all want to look good, right? And we all probably know the politically correct response to a variety of questions whether we agree with the politically correct response or not. In survey research, social desirability refers to the idea that respondents will try to answer questions in a way that will present them in a favorable light. (You may recall we covered social desirability bias in Chapter 11.)
Perhaps we decide that to understand the transition to college, we need to know whether respondents ever cheated on an exam in high school or college for our research project. We all know that cheating on exams is generally frowned upon (at least I hope we all know this). So, it may be difficult to get people to admit to cheating on a survey. But if you can guarantee respondents’ confidentiality, or even better, their anonymity, chances are much better that they will be honest about having engaged in this socially undesirable behavior. Another way to avoid problems of social desirability is to try to phrase difficult questions in the most benign way possible. Earl Babbie (2010) [23] offers a useful suggestion for helping you do this—simply imagine how you would feel responding to your survey questions. If you would be uncomfortable, chances are others would as well.
Exercises
Try to step outside your role as researcher for a second, and imagine you were one of your participants. Evaluate the following:
- Is the question too general? Sometimes, questions that are too general may not accurately convey respondents’ perceptions. If you asked someone how they liked a certain book and provide a response scale ranging from “not at all” to “extremely well”, and if that person selected “extremely well," what do they mean? Instead, ask more specific behavioral questions, such as "Will you recommend this book to others?" or "Do you plan to read other books by the same author?"
- Is the question too detailed? Avoid unnecessarily detailed questions that serve no specific research purpose. For instance, do you need the age of each child in a household or is just the number of children in the household acceptable? However, if unsure, it is better to err on the side of details than generality.
- Is the question presumptuous? Does your question make assumptions? For instance, if you ask, "what do you think the benefits of a tax cut would be?" you are presuming that the participant sees the tax cut as beneficial. But many people may not view tax cuts as beneficial. Some might see tax cuts as a precursor to less funding for public schools and fewer public services such as police, ambulance, and fire department. Avoid questions with built-in presumptions.
- Does the question ask the participant to imagine something? Is the question imaginary? A popular question on many television game shows is “if you won a million dollars on this show, how will you plan to spend it?” Most participants have never been faced with this large amount of money and have never thought about this scenario. In fact, most don’t even know that after taxes, the value of the million dollars will be greatly reduced. In addition, some game shows spread the amount over a 20-year period. Without understanding this "imaginary" situation, participants may not have the background information necessary to provide a meaningful response.
Finally, it is important to get feedback on your survey questions from as many people as possible, especially people who are like those in your sample. Now is not the time to be shy. Ask your friends for help, ask your mentors for feedback, ask your family to take a look at your survey as well. The more feedback you can get on your survey questions, the better the chances that you will come up with a set of questions that are understandable to a wide variety of people and, most importantly, to those in your sample.
In sum, in order to pose effective survey questions, researchers should do the following:
- Identify how each question measures an independent, dependent, or control variable in their study.
- Keep questions clear and succinct.
- Make sure respondents have relevant lived experience to provide informed answers to your questions.
- Use filter questions to avoid getting answers from uninformed participants.
- Avoid questions that are likely to confuse respondents—including those that use double negatives, use culturally specific terms or jargon, and pose more than one question at a time.
- Imagine how respondents would feel responding to questions.
- Get feedback, especially from people who resemble those in the researcher’s sample.
Exercises
Let's complete a first draft of your questions. In the previous exercise, you listed all of the questions and answers you will use to measure the variables in your research question.
- In the previous exercise, you wrote out the questions and answers for each measure of your independent and dependent variables. Evaluate each question using the criteria listed above on effective survey questions.
- Type out questions for your control variables and evaluate them, as well. Consider what response options you want to offer participants.
Now, let's revise any questions that do not meet your standards!
- Use the BRUSO model in Table 12.2 for an illustration of how to address deficits in question wording. Keep in mind that you are writing a first draft in this exercise, and it will take a few drafts and revisions before your questions are ready to distribute to participants.
Criterion | Poor | Effective |
B- Brief | “Are you now or have you ever been the possessor of a firearm?” | Have you ever possessed a firearm? |
R- Relevant | "Who did you vote for in the last election?" | Note: Only include items that are relevant to your study. |
U- Unambiguous | “Are you a gun person?” | Do you currently own a gun?” |
S- Specific | How much have you read about the new gun control measure and sales tax?” | “How much have you read about the new sales tax on firearm purchases?” |
O- Objective | “How much do you support the beneficial new gun control measure?” | “What is your view of the new gun control measure?” |
Writing response options
While posing clear and understandable questions in your survey is certainly important, so too is providing respondents with unambiguous response options. Response options are the answers that you provide to the people completing your questionnaire. Generally, respondents will be asked to choose a single (or best) response to each question you pose. We call questions in which the researcher provides all of the response options closed-ended questions. Keep in mind, closed-ended questions can also instruct respondents to choose multiple response options, rank response options against one another, or assign a percentage to each response option. But be cautious when experimenting with different response options! Accepting multiple responses to a single question may add complexity when it comes to quantitatively analyzing and interpreting your data.
Surveys need not be limited to closed-ended questions. Sometimes survey researchers include open-ended questions in their survey instruments as a way to gather additional details from respondents. An open-ended question does not include response options; instead, respondents are asked to reply to the question in their own way, using their own words. These questions are generally used to find out more about a survey participant’s experiences or feelings about whatever they are being asked to report in the survey. If, for example, a survey includes closed-ended questions asking respondents to report on their involvement in extracurricular activities during college, an open-ended question could ask respondents why they participated in those activities or what they gained from their participation. While responses to such questions may also be captured using a closed-ended format, allowing participants to share some of their responses in their own words can make the experience of completing the survey more satisfying to respondents and can also reveal new motivations or explanations that had not occurred to the researcher. This is particularly important for mixed-methods research. It is possible to analyze open-ended response options quantitatively using content analysis (i.e., counting how often a theme is represented in a transcript looking for statistical patterns). However, for most researchers, qualitative data analysis will be needed to analyze open-ended questions, and researchers need to think through how they will analyze any open-ended questions as part of their data analysis plan. We will address qualitative data analysis in greater detail in Chapter 19.
To keep things simple, we encourage you to use only closed-ended response options in your study. While open-ended questions are not wrong, they are often a sign in our classrooms that students have not thought through all the way how to operationally define and measure their key variables. Open-ended questions cannot be operationally defined because you don't know what responses you will get. Instead, you will need to analyze the qualitative data using one of the techniques we discuss in Chapter 19 to interpret your participants' responses.
To write an effective response options for closed-ended questions, there are a couple of guidelines worth following. First, be sure that your response options are mutually exclusive. Look back at Figure 12.1, which contains questions about how often and how many drinks respondents consumed. Do you notice that there are no overlapping categories in the response options for these questions? This is another one of those points about question construction that seems fairly obvious but that can be easily overlooked. Response options should also be exhaustive. In other words, every possible response should be covered in the set of response options that you provide. For example, note that in question 10a in Figure 12.1, we have covered all possibilities—those who drank, say, an average of once per month can choose the first response option (“less than one time per week”) while those who drank multiple times a day each day of the week can choose the last response option (“7+”). All the possibilities in between these two extremes are covered by the middle three response options, and every respondent fits into one of the response options we provided.
Earlier in this section, we discussed double-barreled questions. Response options can also be double barreled, and this should be avoided. Figure 12.3 is an example of a question that uses double-barreled response options. Other tips about questions are also relevant to response options, including that participants should be knowledgeable enough to select or decline a response option as well as avoiding jargon and cultural idioms.
Even if you phrase questions and response options clearly, participants are influenced by how many response options are presented on the questionnaire. For Likert scales, five or seven response options generally allow about as much precision as respondents are capable of. However, numerical scales with more options can sometimes be appropriate. For dimensions such as attractiveness, pain, and likelihood, a 0-to-10 scale will be familiar to many respondents and easy for them to use. Regardless of the number of response options, the most extreme ones should generally be “balanced” around a neutral or modal midpoint. An example of an unbalanced rating scale measuring perceived likelihood might look like this:
Unlikely | Somewhat Likely | Likely | Very Likely | Extremely Likely
Because we have four rankings of likely and only one ranking of unlikely, the scale is unbalanced and most responses will be biased toward "likely" rather than "unlikely." A balanced version might look like this:
Extremely Unlikely | Somewhat Unlikely | As Likely as Not | Somewhat Likely |Extremely Likely
In this example, the midpoint is halfway between likely and unlikely. Of course, a middle or neutral response option does not have to be included. Researchers sometimes choose to leave it out because they want to encourage respondents to think more deeply about their response and not simply choose the middle option by default. Fence-sitters are respondents who choose neutral response options, even if they have an opinion. Some people will be drawn to respond, “no opinion” even if they have an opinion, particularly if their true opinion is the not a socially desirable opinion. Floaters, on the other hand, are those that choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion.
As you can see, floating is the flip side of fence-sitting. Thus, the solution to one problem is often the cause of the other. How you decide which approach to take depends on the goals of your research. Sometimes researchers specifically want to learn something about people who claim to have no opinion. In this case, allowing for fence-sitting would be necessary. Other times researchers feel confident their respondents will all be familiar with every topic in their survey. In this case, perhaps it is okay to force respondents to choose one side or another (e.g., agree or disagree) without a middle option (e.g., neither agree nor disagree) or to not include an option like "don't know enough to say" or "not applicable." There is no always-correct solution to either problem. But in general, including middle option in a response set provides a more exhaustive set of response options than one that excludes one.
The most important check before your finalize your response options is to align them with your operational definitions. As we've discussed before, your operational definitions include your measures (questions and responses options) as well as how to interpret those measures in terms of the variable being measured. In particular, you should be able to interpret all response options to a question based on your operational definition of the variable it measures. If you wanted to measure the variable "social class," you might ask one question about a participant's annual income and another about family size. Your operational definition would need to provide clear instructions on how to interpret response options. Your operational definition is basically like this social class calculator from Pew Research, though they include a few more questions in their definition.
To drill down a bit more, as Pew specifies in the section titled "how the income calculator works," the interval/ratio data respondents enter is interpreted using a formula combining a participant's four responses to the questions posed by Pew categorizing their household into three categories—upper, middle, or lower class. So, the operational definition includes the four questions comprising the measure and the formula or interpretation which converts responses into the three final categories that we are familiar with: lower, middle, and upper class.
It is interesting to note that even though participants inis an ordinal level of measurement. Whereas, Pew asks four questions that use an interval or ratio level of measurement (depending on the question). This means that respondents provide numerical responses, rather than choosing categories like lower, middle, and upper class. It's perfectly normal for operational definitions to change levels of measurement, and it's also perfectly normal for the level of measurement to stay the same. The important thing is that each response option a participant can provide is accounted for by the operational definition. Throw any combination of family size, location, or income at the Pew calculator, and it will define you into one of those three social class categories.
Unlike Pew's definition, the operational definitions in your study may not need their own webpage to define and describe. For many questions and answers, interpreting response options is easy. If you were measuring "income" instead of "social class," you could simply operationalize the term by asking people to list their total household income before taxes are taken out. Higher values indicate higher income, and lower values indicate lower income. Easy. Regardless of whether your operational definitions are simple or more complex, every response option to every question on your survey (with a few exceptions) should be interpretable using an operational definition of a variable. Just like we want to avoid an everything-but-the-kitchen-sink approach to questions on our questionnaire, you want to make sure your final questionnaire only contains response options that you will use in your study.
One note of caution on interpretation (sorry for repeating this). We want to remind you again that an operational definition should not mention more than one variable. In our example above, your operational definition could not say "a family of three making under $50,000 is lower class; therefore, they are more likely to experience food insecurity." That last clause about food insecurity may well be true, but it's not a part of the operational definition for social class. Each variable (food insecurity and class) should have its own operational definition. If you are talking about how to interpret the relationship between two variables, you are talking about your data analysis plan. We will discuss how to create your data analysis plan beginning in Chapter 14. For now, one consideration is that depending on the statistical test you use to test relationships between variables, you may need nominal, ordinal, or interval/ratio data. Your questions and response options should match the level of measurement you need with the requirements of the specific statistical tests in your data analysis plan. Once you finalize your data analysis plan, return to your questionnaire to match the level of measurement matches with the statistical test you've chosen.
In summary, to write effective response options researchers should do the following:
- Avoid wording that is likely to confuse respondents—including double negatives, use culturally specific terms or jargon, and double-barreled response options.
- Ensure response options are relevant to participants' knowledge and experience so they can make an informed and accurate choice.
- Present mutually exclusive and exhaustive response options.
- Consider fence-sitters and floaters, and the use of neutral or "not applicable" response options.
- Define how response options are interpreted as part of an operational definition of a variable.
- Check level of measurement matches operational definitions and the statistical tests in the data analysis plan (once you develop one in the future)
Exercises
Look back at the response options you drafted in the previous exercise. Make sure you have a first draft of response options for each closed-ended question on your questionnaire.
- Using the criteria above, evaluate the wording of the response options for each question on your questionnaire.
- Revise your questions and response options until you have a complete first draft.
- Do your first read-through and provide a dummy answer to each question. Make sure you can link each response option and each question to an operational definition.
- Look ahead to Chapter 14 and consider how each item on your questionnaire will inform your data analysis plan.
From this discussion, we hope it is clear why researchers using quantitative methods spell out all of their plans ahead of time. Ultimately, there should be a straight line from operational definition through measures on your questionnaire to the data analysis plan. If your questionnaire includes response options that are not aligned with operational definitions or not included in the data analysis plan, the responses you receive back from participants won't fit with your conceptualization of the key variables in your study. If you do not fix these errors and proceed with collecting unstructured data, you will lose out on many of the benefits of survey research and face overwhelming challenges in answering your research question.
Designing questionnaires
Based on your work in the previous section, you should have a first draft of the questions and response options for the key variables in your study. Now, you’ll also need to think about how to present your written questions and response options to survey respondents. It's time to write a final draft of your questionnaire and make it look nice. Designing questionnaires takes some thought. First, consider the route of administration for your survey. What we cover in this section will apply equally to paper and online surveys, but if you are planning to use online survey software, you should watch tutorial videos and explore the features of of the survey software you will use.
Informed consent & instructions
Writing effective items is only one part of constructing a survey. For one thing, every survey should have a written or spoken introduction that serves two basic functions (Peterson, 2000).[24] One is to encourage respondents to participate in the survey. In many types of research, such encouragement is not necessary either because participants do not know they are in a study (as in naturalistic observation) or because they are part of a subject pool and have already shown their willingness to participate by signing up and showing up for the study. Survey research usually catches respondents by surprise when they answer their phone, go to their mailbox, or check their e-mail—and the researcher must make a good case for why they should agree to participate. Thus, the introduction should briefly explain the purpose of the survey and its importance, provide information about the sponsor of the survey (university-based surveys tend to generate higher response rates), acknowledge the importance of the respondent’s participation, and describe any incentives for participating.
The second function of the introduction is to establish informed consent. Remember that this involves describing to respondents everything that might affect their decision to participate. This includes the topics covered by the survey, the amount of time it is likely to take, the respondent’s option to withdraw at any time, confidentiality issues, and other ethical considerations we covered in Chapter 6. Written consent forms are not always used in survey research (when the research is of minimal risk and completion of the survey instrument is often accepted by the IRB as evidence of consent to participate), so it is important that this part of the introduction be well documented and presented clearly and in its entirety to every respondent.
Organizing items to be easy and intuitive to follow
The introduction should be followed by the substantive questionnaire items. But first, it is important to present clear instructions for completing the questionnaire, including examples of how to use any unusual response scales. Remember that the introduction is the point at which respondents are usually most interested and least fatigued, so it is good practice to start with the most important items for purposes of the research and proceed to less important items. Items should also be grouped by topic or by type. For example, items using the same rating scale (e.g., a 5-point agreement scale) should be grouped together if possible to make things faster and easier for respondents. Demographic items are often presented last because they are least interesting to participants but also easy to answer in the event respondents have become tired or bored. Of course, any survey should end with an expression of appreciation to the respondent.
Questions are often organized thematically. If our survey were measuring social class, perhaps we’d have a few questions asking about employment, others focused on education, and still others on housing and community resources. Those may be the themes around which we organize our questions. Or perhaps it would make more sense to present any questions we had about parents' income and then present a series of questions about estimated future income. Grouping by theme is one way to be deliberate about how you present your questions. Keep in mind that you are surveying people, and these people will be trying to follow the logic in your questionnaire. Jumping from topic to topic can give people a bit of whiplash and may make participants less likely to complete it.
Using a matrix is a nice way of streamlining response options for similar questions. A matrix is a question type that that lists a set of questions for which the answer categories are all the same. If you have a set of questions for which the response options are the same, it may make sense to create a matrix rather than posing each question and its response options individually. Not only will this save you some space in your survey but it will also help respondents progress through your survey more easily. A sample matrix can be seen in Figure 12.4.
Once you have grouped similar questions together, you’ll need to think about the order in which to present those question groups. Most survey researchers agree that it is best to begin a survey with questions that will want to make respondents continue (Babbie, 2010; Dillman, 2000; Neuman, 2003).[25] In other words, don’t bore respondents, but don’t scare them away either. There’s some disagreement over where on a survey to place demographic questions, such as those about a person’s age, gender, and race. On the one hand, placing them at the beginning of the questionnaire may lead respondents to think the survey is boring, unimportant, and not something they want to bother completing. On the other hand, if your survey deals with some very sensitive topic, such as child sexual abuse or criminal convictions, you don’t want to scare respondents away or shock them by beginning with your most intrusive questions.
Your participants are human. They will react emotionally to questionnaire items, and they will also try to uncover your research questions and hypotheses. In truth, the order in which you present questions on a survey is best determined by the unique characteristics of your research. When feasible, you should consult with key informants from your target population determine how best to order your questions. If it is not feasible to do so, think about the unique characteristics of your topic, your questions, and most importantly, your sample. Keeping in mind the characteristics and needs of the people you will ask to complete your survey should help guide you as you determine the most appropriate order in which to present your questions. None of your decisions will be perfect, and all studies have limitations.
Questionnaire length
You’ll also need to consider the time it will take respondents to complete your questionnaire. Surveys vary in length, from just a page or two to a dozen or more pages, which means they also vary in the time it takes to complete them. How long to make your survey depends on several factors. First, what is it that you wish to know? Wanting to understand how grades vary by gender and year in school certainly requires fewer questions than wanting to know how people’s experiences in college are shaped by demographic characteristics, college attended, housing situation, family background, college major, friendship networks, and extracurricular activities. Keep in mind that even if your research question requires a sizable number of questions be included in your questionnaire, do your best to keep the questionnaire as brief as possible. Any hint that you’ve thrown in a bunch of useless questions just for the sake of it will turn off respondents and may make them not want to complete your survey.
Second, and perhaps more important, how long are respondents likely to be willing to spend completing your questionnaire? If you are studying college students, asking them to use their very free time to complete your survey may mean they won’t want to spend more than a few minutes on it. But if you find ask them to complete your survey during down-time between classes and there is little work to be done, students may be willing to give you a bit more of their time. Think about places and times that your sampling frame naturally gathers and whether you would be able to either recruit participants or distribute a survey in that context. Estimate how long your participants would reasonably have to complete a survey presented to them during this time. The more you know about your population (such as what weeks have less work and more free time), the better you can target questionnaire length.
The time that survey researchers ask respondents to spend on questionnaires varies greatly. Some researchers advise that surveys should not take longer than about 15 minutes to complete (as cited in Babbie 2010),[26] whereas others suggest that up to 20 minutes is acceptable (Hopper, 2010).[27] As with question order, there is no clear-cut, always-correct answer about questionnaire length. The unique characteristics of your study and your sample should be considered to determine how long to make your questionnaire. For example, if you planned to distribute your questionnaire to students in between classes, you will need to make sure it is short enough to complete before the next class begins.
When designing a questionnaire, a researcher should consider:
- Weighing strengths and limitations of the method of delivery, including the advanced tools in online survey software or the simplicity of paper questionnaires.
- Grouping together items that ask about the same thing.
- Moving any questions about sensitive items to the end of the questionnaire, so as not to scare respondents off.
- Moving any questions that engage the respondent to answer the questionnaire at the beginning, so as not to bore them.
- Timing the length of the questionnaire with a reasonable length of time you can ask of your participants.
- Dedicating time to visual design and ensure the questionnaire looks professional.
Exercises
Type out a final draft of your questionnaire in a word processor or online survey tool.
- Evaluate your questionnaire using the guidelines above, revise it, and get it ready to share with other student researchers.
Pilot testing and revising questionnaires
A good way to estimate the time it will take respondents to complete your questionnaire (and other potential challenges) is through pilot testing. Pilot testing allows you to get feedback on your questionnaire so you can improve it before you actually administer it. It can be quite expensive and time consuming if you wish to pilot test your questionnaire on a large sample of people who very much resemble the sample to whom you will eventually administer the finalized version of your questionnaire. But you can learn a lot and make great improvements to your questionnaire simply by pilot testing with a small number of people to whom you have easy access (perhaps you have a few friends who owe you a favor). By pilot testing your questionnaire, you can find out how understandable your questions are, get feedback on question wording and order, find out whether any of your questions are boring or offensive, and learn whether there are places where you should have included filter questions. You can also time pilot testers as they take your survey. This will give you a good idea about the estimate to provide respondents when you administer your survey and whether you have some wiggle room to add additional items or need to cut a few items.
Perhaps this goes without saying, but your questionnaire should also have an attractive design. A messy presentation style can confuse respondents or, at the very least, annoy them. Be brief, to the point, and as clear as possible. Avoid cramming too much into a single page. Make your font size readable (at least 12 point or larger, depending on the characteristics of your sample), leave a reasonable amount of space between items, and make sure all instructions are exceptionally clear. If you are using an online survey, ensure that participants can complete it via mobile, computer, and tablet devices. Think about books, documents, articles, or web pages that you have read yourself—which were relatively easy to read and easy on the eyes and why? Try to mimic those features in the presentation of your survey questions. While online survey tools automate much of visual design, word processors are designed for writing all kinds of documents and may need more manual adjustment as part of visual design.
Realistically, your questionnaire will continue to evolve as you develop your data analysis plan over the next few chapters. By now, you should have a complete draft of your questionnaire grounded in an underlying logic that ties together each question and response option to a variable in your study. Once your questionnaire is finalized, you will need to submit it for ethical approval from your professor or the IRB. If your study requires IRB approval, it may be worthwhile to submit your proposal before your questionnaire is completely done. Revisions to IRB protocols are common and it takes less time to review a few changes to questions and answers than it does to review the entire study, so give them the whole study as soon as you can. Once the IRB approves your questionnaire, you cannot change it without their okay.
Key Takeaways
- A questionnaire is comprised of self-report measures of variables in a research study.
- Make sure your survey questions will be relevant to all respondents and that you use filter questions when necessary.
- Effective survey questions and responses take careful construction by researchers, as participants may be confused or otherwise influenced by how items are phrased.
- The questionnaire should start with informed consent and instructions, flow logically from one topic to the next, engage but not shock participants, and thank participants at the end.
- Pilot testing can help identify any issues in a questionnaire before distributing it to participants, including language or length issues.
Exercises
It's a myth that researchers work alone! Get together with a few of your fellow students and swap questionnaires for pilot testing.
- Use the criteria in each section above (questions, response options, questionnaires) and provide your peers with the strengths and weaknesses of their questionnaires.
- See if you can guess their research question and hypothesis based on the questionnaire alone.
11.3 Measurement quality
Learning Objectives
Learners will be able to...
- Define and describe the types of validity and reliability
- Assess for systematic error
The previous chapter provided insight into measuring concepts in social work research. We discussed the importance of identifying concepts and their corresponding indicators as a way to help us operationalize them. In essence, we now understand that when we think about our measurement process, we must be intentional and thoughtful in the choices that we make. This section is all about how to judge the quality of the measures you've chosen for the key variables in your research question.
Reliability
First, let’s say we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism this way, then it is likely that anyone who identifies as an alcoholic would respond “yes.” This may seem like a good way to identify our group of interest, but think about how you and your peer group may respond to this question. Would participants respond differently after a wild night out, compared to any other night? Could an infrequent drinker’s current headache from last night’s glass of wine influence how they answer the question this morning? How would that same person respond to the question before consuming the wine? In each cases, the same person might respond differently to the same question at different points, so it is possible that our measure of alcoholism has a reliability problem. Reliability in measurement is about consistency.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behavior, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past 3 months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. On the other hand, we might get more accurate responses if we ask a participant how many drinks of any kind they have consumed in the past week.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviors. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. They may decide to conduct observations at a local pub by noting how many drinks patrons consume and how their behavior changes as their intake changes. What if the researcher has to use the restroom, and the patron next to them takes three shots of tequila during the brief period the researcher is away from their seat? The reliability of this researcher’s measure of alcohol intake depends on their ability to physically observe every instance of patrons consuming drinks. If they are unlikely to be able to observe every such instance, then perhaps their mechanism for measuring this concept is not reliable.
The following subsections describe the types of reliability that are important for you to know about, but keep in mind that you may see other approaches to judging reliability mentioned in the empirical literature.
Test-retest reliability
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time. Unlike an experiment, you aren't giving participants an intervention but trying to establish a reliable baseline of the variable you are measuring. Once you have these two measurements, you then look at the correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. Figure 11.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
Internal consistency
Another kind of reliability is internal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials. A specific statistical test known as Cronbach’s Alpha provides a way to measure how well each question of a scale is related to the others.
Interrater reliability
Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other.
Validity
Validity, another key element of assessing measurement quality, is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.
Face validity
Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.
Content validity
Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Criterion validity
Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).
Discriminant validity
Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
Increasing the reliability and validity of measures
We have reviewed the types of errors and how to evaluate our measures based on reliability and validity considerations. However, what can we do while selecting or creating our tool so that we minimize the potential of errors? Many of our options were covered in our discussion about reliability and validity. Nevertheless, the following table provides a quick summary of things that you should do when creating or selecting a measurement tool. While not all of these will be feasible in your project, it is important to include easy-to-implement measures in your research context.
Make sure that you engage in a rigorous literature review so that you understand the concept that you are studying. This means understanding the different ways that your concept may manifest itself. This review should include a search for existing instruments.[28]
- Do you understand all the dimensions of your concept? Do you have a good understanding of the content dimensions of your concept(s)?
- What instruments exist? How many items are on the existing instruments? Are these instruments appropriate for your population?
- Are these instruments standardized? Note: If an instrument is standardized, that means it has been rigorously studied and tested.
Consult content experts to review your instrument. This is a good way to check the face validity of your items. Additionally, content experts can also help you understand the content validity.[29]
- Do you have access to a reasonable number of content experts? If not, how can you locate them?
- Did you provide a list of critical questions for your content reviewers to use in the reviewing process?
Pilot test your instrument on a sufficient number of people and get detailed feedback.[30] Ask your group to provide feedback on the wording and clarity of items. Keep detailed notes and make adjustments BEFORE you administer your final tool.
- How many people will you use in your pilot testing?
- How will you set up your pilot testing so that it mimics the actual process of administering your tool?
- How will you receive feedback from your pilot testing group? Have you provided a list of questions for your group to think about?
Provide training for anyone collecting data for your project.[31] You should provide those helping you with a written research protocol that explains all of the steps of the project. You should also problem solve and answer any questions that those helping you may have. This will increase the chances that your tool will be administered in a consistent manner.
- How will you conduct your orientation/training? How long will it be? What modality?
- How will you select those who will administer your tool? What qualifications do they need?
When thinking of items, use a higher level of measurement, if possible.[32] This will provide more information and you can always downgrade to a lower level of measurement later.
- Have you examined your items and the levels of measurement?
- Have you thought about whether you need to modify the type of data you are collecting? Specifically, are you asking for information that is too specific (at a higher level of measurement) which may reduce participants' willingness to participate?
Use multiple indicators for a variable.[33] Think about the number of items that you will include in your tool.
- Do you have enough items? Enough indicators? The correct indicators?
Conduct an item-by-item assessment of multiple-item measures.[34] When you do this assessment, think about each word and how it changes the meaning of your item.
- Are there items that are redundant? Do you need to modify, delete, or add items?
Types of error
As you can see, measures never perfectly describe what exists in the real world. Good measures demonstrate validity and reliability but will always have some degree of error. Systematic error (also called bias) causes our measures to consistently output incorrect data in one direction or another on a measure, usually due to an identifiable process. Imagine you created a measure of height, but you didn’t put an option for anyone over six feet tall. If you gave that measure to your local college or university, some of the taller students might not be measured accurately. In fact, you would be under the mistaken impression that the tallest person at your school was six feet tall, when in actuality there are likely people taller than six feet at your school. This error seems innocent, but if you were using that measure to help you build a new building, those people might hit their heads!
A less innocent form of error arises when researchers word questions in a way that might cause participants to think one answer choice is preferable to another. For example, if I were to ask you “Do you think global warming is caused by human activity?” you would probably feel comfortable answering honestly. But what if I asked you “Do you agree with 99% of scientists that global warming is caused by human activity?” Would you feel comfortable saying no, if that’s what you honestly felt? I doubt it. That is an example of a leading question, a question with wording that influences how a participant responds. We’ll discuss leading questions and other problems in question wording in greater detail in Chapter 12.
In addition to error created by the researcher, your participants can cause error in measurement. Some people will respond without fully understanding a question, particularly if the question is worded in a confusing way. Let’s consider another potential source or error. If we asked people if they always washed their hands after using the bathroom, would we expect people to be perfectly honest? Polling people about whether they wash their hands after using the bathroom might only elicit what people would like others to think they do, rather than what they actually do. This is an example of social desirability bias, in which participants in a research study want to present themselves in a positive, socially desirable way to the researcher. People in your study will want to seem tolerant, open-minded, and intelligent, but their true feelings may be closed-minded, simple, and biased. Participants may lie in this situation. This occurs often in political polling, which may show greater support for a candidate from a minority race, gender, or political party than actually exists in the electorate.
A related form of bias is called acquiescence bias, also known as “yea-saying.” It occurs when people say yes to whatever the researcher asks, even when doing so contradicts previous answers. For example, a person might say yes to both “I am a confident leader in group discussions” and “I feel anxious interacting in group discussions.” Those two responses are unlikely to both be true for the same person. Why would someone do this? Similar to social desirability, people want to be agreeable and nice to the researcher asking them questions or they might ignore contradictory feelings when responding to each question. You could interpret this as someone saying "yeah, I guess." Respondents may also act on cultural reasons, trying to “save face” for themselves or the person asking the questions. Regardless of the reason, the results of your measure don’t match what the person truly feels.
So far, we have discussed sources of error that come from choices made by respondents or researchers. Systematic errors will result in responses that are incorrect in one direction or another. For example, social desirability bias usually means that the number of people who say they will vote for a third party in an election is greater than the number of people who actually vote for that candidate. Systematic errors such as these can be reduced, but random error can never be eliminated. Unlike systematic error, which biases responses consistently in one direction or another, random error is unpredictable and does not consistently result in scores that are consistently higher or lower on a given measure. Instead, random error is more like statistical noise, which will likely average out across participants.
Random error is present in any measurement. If you’ve ever stepped on a bathroom scale twice and gotten two slightly different results, maybe a difference of a tenth of a pound, then you’ve experienced random error. Maybe you were standing slightly differently or had a fraction of your foot off of the scale the first time. If you were to take enough measures of your weight on the same scale, you’d be able to figure out your true weight. In social science, if you gave someone a scale measuring depression on a day after they lost their job, they would likely score differently than if they had just gotten a promotion and a raise. Even if the person were clinically depressed, our measure is subject to influence by the random occurrences of life. Thus, social scientists speak with humility about our measures. We are reasonably confident that what we found is true, but we must always acknowledge that our measures are only an approximation of reality.
Humility is important in scientific measurement, as errors can have real consequences. At the time I'm writing this, my wife and I are expecting our first child. Like most people, we used a pregnancy test from the pharmacy. If the test said my wife was pregnant when she was not pregnant, that would be a false positive. On the other hand, if the test indicated that she was not pregnant when she was in fact pregnant, that would be a false negative. Even if the test is 99% accurate, that means that one in a hundred women will get an erroneous result when they use a home pregnancy test. For us, a false positive would have been initially exciting, then devastating when we found out we were not having a child. A false negative would have been disappointing at first and then quite shocking when we found out we were indeed having a child. While both false positives and false negatives are not very likely for home pregnancy tests (when taken correctly), measurement error can have consequences for the people being measured.
Key Takeaways
- Reliability is a matter of consistency.
- Validity is a matter of accuracy.
- There are many types of validity and reliability.
- Systematic error may arise from the researcher, participant, or measurement instrument.
- Systematic error biases results in a particular direction, whereas random error can be in any direction.
- All measures are prone to error and should interpreted with humility.
Exercises
Use the measurement tools you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to "research" these tools.
- Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
- Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?
- If you decide to create your own tool, how will you assess its validity and reliability?
Chapter Outline
- Operational definitions (36 minute read)
- Writing effective questions and questionnaires (38 minute read)
- Measurement quality (21 minute read)
Content warning: examples in this chapter contain references to ethnocentrism, toxic masculinity, racism in science, drug use, mental health and depression, psychiatric inpatient care, poverty and basic needs insecurity, pregnancy, and racism and sexism in the workplace and higher education.
11.1 Operational definitions
Learning Objectives
Learners will be able to...
- Define and give an example of indicators and attributes for a variable
- Apply the three components of an operational definition to a variable
- Distinguish between levels of measurement for a variable and how those differences relate to measurement
- Describe the purpose of composite measures like scales and indices
Last chapter, we discussed conceptualizing your project. Conceptual definitions are like dictionary definitions. They tell you what a concept means by defining it using other concepts. In this section we will move from the abstract realm (conceptualization) to the real world (measurement).
Operationalization is the process by which researchers spell out precisely how a concept will be measured in their study. It involves identifying the specific research procedures we will use to gather data about our concepts. If conceptually defining your terms means looking at theory, how do you operationally define your terms? By looking for indicators of when your variable is present or not, more or less intense, and so forth. Operationalization is probably the most challenging part of quantitative research, but once it's done, the design and implementation of your study will be straightforward.
Indicators
Operationalization works by identifying specific indicators that will be taken to represent the ideas we are interested in studying. If we are interested in studying masculinity, then the indicators for that concept might include some of the social roles prescribed to men in society such as breadwinning or fatherhood. Being a breadwinner or a father might therefore be considered indicators of a person’s masculinity. The extent to which a man fulfills either, or both, of these roles might be understood as clues (or indicators) about the extent to which he is viewed as masculine.
Let’s look at another example of indicators. Each day, Gallup researchers poll 1,000 randomly selected Americans to ask them about their well-being. To measure well-being, Gallup asks these people to respond to questions covering six broad areas: physical health, emotional health, work environment, life evaluation, healthy behaviors, and access to basic necessities. Gallup uses these six factors as indicators of the concept that they are really interested in, which is well-being.
Identifying indicators can be even simpler than the examples described thus far. Political party affiliation is another relatively easy concept for which to identify indicators. If you asked a person what party they voted for in the last national election (or gained access to their voting records), you would get a good indication of their party affiliation. Of course, some voters split tickets between multiple parties when they vote and others swing from party to party each election, so our indicator is not perfect. Indeed, if our study were about political identity as a key concept, operationalizing it solely in terms of who they voted for in the previous election leaves out a lot of information about identity that is relevant to that concept. Nevertheless, it's a pretty good indicator of political party affiliation.
Choosing indicators is not an arbitrary process. As described earlier, utilizing prior theoretical and empirical work in your area of interest is a great way to identify indicators in a scholarly manner. And you conceptual definitions will point you in the direction of relevant indicators. Empirical work will give you some very specific examples of how the important concepts in an area have been measured in the past and what sorts of indicators have been used. Often, it makes sense to use the same indicators as previous researchers; however, you may find that some previous measures have potential weaknesses that your own study will improve upon.
All of the examples in this chapter have dealt with questions you might ask a research participant on a survey or in a quantitative interview. If you plan to collect data from other sources, such as through direct observation or the analysis of available records, think practically about what the design of your study might look like and how you can collect data on various indicators feasibly. If your study asks about whether the participant regularly changes the oil in their car, you will likely not observe them directly doing so. Instead, you will likely need to rely on a survey question that asks them the frequency with which they change their oil or ask to see their car maintenance records.
Exercises
- What indicators are commonly used to measure the variables in your research question?
- How can you feasibly collect data on these indicators?
- Are you planning to collect your own data using a questionnaire or interview? Or are you planning to analyze available data like client files or raw data shared from another researcher's project?
Remember, you need raw data. You research project cannot rely solely on the results reported by other researchers or the arguments you read in the literature. A literature review is only the first part of a research project, and your review of the literature should inform the indicators you end up choosing when you measure the variables in your research question.
Unlike conceptual definitions which contain other concepts, operational definition consists of the following components: (1) the variable being measured and its attributes, (2) the measure you will use, (3) how you plan to interpret the data collected from that measure to draw conclusions about the variable you are measuring.
Step 1: Specifying variables and attributes
The first component, the variable, should be the easiest part. At this point in quantitative research, you should have a research question that has at least one independent and at least one dependent variable. Remember that variables must be able to vary. For example, the United States is not a variable. Country of residence is a variable, as is patriotism. Similarly, if your sample only includes men, gender is a constant in your study, not a variable. A constant is a characteristic that does not change in your study.
When social scientists measure concepts, they sometimes use the language of variables and attributes. A variable refers to a quality or quantity that varies across people or situations. Attributes are the characteristics that make up a variable. For example, the variable hair color would contain attributes like blonde, brown, black, red, gray, etc. A variable’s attributes determine its level of measurement. There are four possible levels of measurement: nominal, ordinal, interval, and ratio. The first two levels of measurement are categorical, meaning their attributes are categories rather than numbers. The latter two levels of measurement are continuous, meaning their attributes are numbers.
Levels of measurement
Hair color is an example of a nominal level of measurement. Nominal measures are categorical, and those categories cannot be mathematically ranked. As a brown-haired person (with some gray), I can’t say for sure that brown-haired people are better than blonde-haired people. As with all nominal levels of measurement, there is no ranking order between hair colors; they are simply different. That is what constitutes a nominal level of gender and race are also measured at the nominal level.
What attributes are contained in the variable hair color? While blonde, brown, black, and red are common colors, some people may not fit into these categories if we only list these attributes. My wife, who currently has purple hair, wouldn’t fit anywhere. This means that our attributes were not exhaustive. Exhaustiveness means that all possible attributes are listed. We may have to list a lot of colors before we can meet the criteria of exhaustiveness. Clearly, there is a point at which exhaustiveness has been reasonably met. If a person insists that their hair color is light burnt sienna, it is not your responsibility to list that as an option. Rather, that person would reasonably be described as brown-haired. Perhaps listing a category for other color would suffice to make our list of colors exhaustive.
What about a person who has multiple hair colors at the same time, such as red and black? They would fall into multiple attributes. This violates the rule of mutual exclusivity, in which a person cannot fall into two different attributes. Instead of listing all of the possible combinations of colors, perhaps you might include a multi-color attribute to describe people with more than one hair color.
Making sure researchers provide mutually exclusive and exhaustive is about making sure all people are represented in the data record. For many years, the attributes for gender were only male or female. Now, our understanding of gender has evolved to encompass more attributes that better reflect the diversity in the world. Children of parents from different races were often classified as one race or another, even if they identified with both cultures. The option for bi-racial or multi-racial on a survey not only more accurately reflects the racial diversity in the real world but validates and acknowledges people who identify in that manner. If we did not measure race in this way, we would leave empty the data record for people who identify as biracial or multiracial, impairing our search for truth.
Unlike nominal-level measures, attributes at the ordinal level can be rank ordered. For example, someone’s degree of satisfaction in their romantic relationship can be ordered by rank. That is, you could say you are not at all satisfied, a little satisfied, moderately satisfied, or highly satisfied. Note that even though these have a rank order to them (not at all satisfied is certainly worse than highly satisfied), we cannot calculate a mathematical distance between those attributes. We can simply say that one attribute of an ordinal-level variable is more or less than another attribute.
This can get a little confusing when using rating scales. If you have ever taken a customer satisfaction survey or completed a course evaluation for school, you are familiar with rating scales. “On a scale of 1-5, with 1 being the lowest and 5 being the highest, how likely are you to recommend our company to other people?” That surely sounds familiar. Rating scales use numbers, but only as a shorthand, to indicate what attribute (highly likely, somewhat likely, etc.) the person feels describes them best. You wouldn’t say you are “2” likely to recommend the company, but you would say you are not very likely to recommend the company. Ordinal-level attributes must also be exhaustive and mutually exclusive, as with nominal-level variables.
At the interval level, attributes must also be exhaustive and mutually exclusive and there is equal distance between attributes. Interval measures are also continuous, meaning their attributes are numbers, rather than categories. IQ scores are interval level, as are temperatures in Fahrenheit and Celsius. Their defining characteristic is that we can say how much more or less one attribute differs from another. We cannot, however, say with certainty what the ratio of one attribute is in comparison to another. For example, it would not make sense to say that a person with an IQ score of 140 has twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.
While we cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, we can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level. Finally, at the ratio level, attributes are mutually exclusive and exhaustive, attributes can be rank ordered, the distance between attributes is equal, and attributes have a true zero point. Thus, with these variables, we can say what the ratio of one attribute is in comparison to another. Examples of ratio-level variables include age and years of education. We know that a person who is 12 years old is twice as old as someone who is 6 years old. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. The differences between each level of measurement are visualized in Table 11.1.
Nominal | Ordinal | Interval | Ratio | |
Exhaustive | X | X | X | X |
Mutually exclusive | X | X | X | X |
Rank-ordered | X | X | X | |
Equal distance between attributes | X | X | ||
True zero point | X |
Levels of measurement=levels of specificity
We have spent time learning how to determine our data's level of measurement. Now what? How could we use this information to help us as we measure concepts and develop measurement tools? First, the types of statistical tests that we are able to use are dependent on our data's level of measurement. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. The higher the level of measurement, the more complex statistical tests we are able to conduct. This knowledge may help us decide what kind of data we need to gather, and how.
That said, we have to balance this knowledge with the understanding that sometimes, collecting data at a higher level of measurement could negatively impact our studies. For instance, sometimes providing answers in ranges may make prospective participants feel more comfortable responding to sensitive items. Imagine that you were interested in collecting information on topics such as income, number of sexual partners, number of times someone used illicit drugs, etc. You would have to think about the sensitivity of these items and determine if it would make more sense to collect some data at a lower level of measurement (e.g., asking if they are sexually active or not (nominal) versus their total number of sexual partners (ratio).
Finally, sometimes when analyzing data, researchers find a need to change a data's level of measurement. For example, a few years ago, a student was interested in studying the relationship between mental health and life satisfaction. This student used a variety of measures. One item asked about the number of mental health symptoms, reported as the actual number. When analyzing data, my student examined the mental health symptom variable and noticed that she had two groups, those with none or one symptoms and those with many symptoms. Instead of using the ratio level data (actual number of mental health symptoms), she collapsed her cases into two categories, few and many. She decided to use this variable in her analyses. It is important to note that you can move a higher level of data to a lower level of data; however, you are unable to move a lower level to a higher level.
Exercises
- Check that the variables in your research question can vary...and that they are not constants or one of many potential attributes of a variable.
- Think about the attributes your variables have. Are they categorical or continuous? What level of measurement seems most appropriate?
Step 2: Specifying measures for each variable
Let’s pick a social work research question and walk through the process of operationalizing variables to see how specific we need to get. I’m going to hypothesize that residents of a psychiatric unit who are more depressed are less likely to be satisfied with care. Remember, this would be a inverse relationship—as depression increases, satisfaction decreases. In this question, depression is my independent variable (the cause) and satisfaction with care is my dependent variable (the effect). Now we have identified our variables, their attributes, and levels of measurement, we move onto the second component: the measure itself.
So, how would you measure my key variables: depression and satisfaction? What indicators would you look for? Some students might say that depression could be measured by observing a participant’s body language. They may also say that a depressed person will often express feelings of sadness or hopelessness. In addition, a satisfied person might be happy around service providers and often express gratitude. While these factors may indicate that the variables are present, they lack coherence. Unfortunately, what this “measure” is actually saying is that “I know depression and satisfaction when I see them.” While you are likely a decent judge of depression and satisfaction, you need to provide more information in a research study for how you plan to measure your variables. Your judgment is subjective, based on your own idiosyncratic experiences with depression and satisfaction. They couldn’t be replicated by another researcher. They also can’t be done consistently for a large group of people. Operationalization requires that you come up with a specific and rigorous measure for seeing who is depressed or satisfied.
Finding a good measure for your variable depends on the kind of variable it is. Variables that are directly observable don't come up very often in my students' classroom projects, but they might include things like taking someone's blood pressure, marking attendance or participation in a group, and so forth. To measure an indirectly observable variable like age, you would probably put a question on a survey that asked, “How old are you?” Measuring a variable like income might require some more thought, though. Are you interested in this person’s individual income or the income of their family unit? This might matter if your participant does not work or is dependent on other family members for income. Do you count income from social welfare programs? Are you interested in their income per month or per year? Even though indirect observables are relatively easy to measure, the measures you use must be clear in what they are asking, and operationalization is all about figuring out the specifics of what you want to know. For more complicated constructs, you will need compound measures (that use multiple indicators to measure a single variable).
How you plan to collect your data also influences how you will measure your variables. For social work researchers using secondary data like client records as a data source, you are limited by what information is in the data sources you can access. If your organization uses a given measurement for a mental health outcome, that is the one you will use in your study. Similarly, if you plan to study how long a client was housed after an intervention using client visit records, you are limited by how their caseworker recorded their housing status in the chart. One of the benefits of collecting your own data is being able to select the measures you feel best exemplify your understanding of the topic.
Measuring unidimensional concepts
The previous section mentioned two important considerations: how complicated the variable is and how you plan to collect your data. With these in hand, we can use the level of measurement to further specify how you will measure your variables and consider specialized rating scales developed by social science researchers.
Measurement at each level
Nominal measures assess categorical variables. These measures are used for variables or indicators that have mutually exclusive attributes, but that cannot be rank-ordered. Nominal measures ask about the variable and provide names or labels for different attribute values like social work, counseling, and nursing for the variable profession. Nominal measures are relatively straightforward.
Ordinal measures often use a rating scale. It is an ordered set of responses that participants must choose from. Figure 11.1 shows several examples. The number of response options on a typical rating scale is usualy five or seven, though it can range from three to 11. Five-point scales are best for unipolar scales where only one construct is tested, such as frequency (Never, Rarely, Sometimes, Often, Always). Seven-point scales are best for bipolar scales where there is a dichotomous spectrum, such as liking (Like very much, Like somewhat, Like slightly, Neither like nor dislike, Dislike slightly, Dislike somewhat, Dislike very much). For bipolar questions, it is useful to offer an earlier question that branches them into an area of the scale; if asking about liking ice cream, first ask “Do you generally like or dislike ice cream?” Once the respondent chooses like or dislike, refine it by offering them relevant choices from the seven-point scale. Branching improves both reliability and validity (Krosnick & Berent, 1993).[35] Although you often see scales with numerical labels, it is best to only present verbal labels to the respondents but convert them to numerical values in the analyses. Avoid partial labels or length or overly specific labels. In some cases, the verbal labels can be supplemented with (or even replaced by) meaningful graphics. The last rating scale shown in Figure 11.1 is a visual-analog scale, on which participants make a mark somewhere along the horizontal line to indicate the magnitude of their response.
Interval measures are those where the values measured are not only rank-ordered, but are also equidistant from adjacent attributes. For example, the temperature scale (in Fahrenheit or Celsius), where the difference between 30 and 40 degree Fahrenheit is the same as that between 80 and 90 degree Fahrenheit. Likewise, if you have a scale that asks respondents’ annual income using the following attributes (ranges): $0 to 10,000, $10,000 to 20,000, $20,000 to 30,000, and so forth, this is also an interval measure, because the mid-point of each range (i.e., $5,000, $15,000, $25,000, etc.) are equidistant from each other. The intelligence quotient (IQ) scale is also an interval measure, because the measure is designed such that the difference between IQ scores 100 and 110 is supposed to be the same as between 110 and 120 (although we do not really know whether that is truly the case). Interval measures allow us to examine “how much more” is one attribute when compared to another, which is not possible with nominal or ordinal measures. You may find researchers who “pretend” (incorrectly) that ordinal rating scales are actually interval measures so that we can use different statistical techniques for analyzing them. As we will discuss in the latter part of the chapter, this is a mistake because there is no way to know whether the difference between a 3 and a 4 on a rating scale is the same as the difference between a 2 and a 3. Those numbers are just placeholders for categories.
Ratio measures are those that have all the qualities of nominal, ordinal, and interval scales, and in addition, also have a “true zero” point (where the value zero implies lack or non-availability of the underlying construct). Think about how to measure the number of people working in human resources at a social work agency. It could be one, several, or none (if the company contracts out for those services). Measuring interval and ratio data is relatively easy, as people either select or input a number for their answer. If you ask a person how many eggs they purchased last week, they can simply tell you they purchased `a dozen eggs at the store, two at breakfast on Wednesday, or none at all.
Commonly used rating scales in questionnaires
The level of measurement will give you the basic information you need, but social scientists have developed specialized instruments for use in questionnaires, a common tool used in quantitative research. As we mentioned before, if you plan to source your data from client files or previously published results
Although Likert scale is a term colloquially used to refer to almost any rating scale (e.g., a 0-to-10 life satisfaction scale), it has a much more precise meaning. In the 1930s, researcher Rensis Likert (pronounced LICK-ert) created a new approach for measuring people’s attitudes (Likert, 1932).[36] It involves presenting people with several statements—including both favorable and unfavorable statements—about some person, group, or idea. Respondents then express their agreement or disagreement with each statement on a 5-point scale: Strongly Agree, Agree, Neither Agree nor Disagree, Disagree, Strongly Disagree. Numbers are assigned to each response and then summed across all items to produce a score representing the attitude toward the person, group, or idea. For items that are phrased in an opposite direction (e.g., negatively worded statements instead of positively worded statements), reverse coding is used so that the numerical scoring of statements also runs in the opposite direction. The entire set of items came to be called a Likert scale, as indicated in Table 11.2 below.
Unless you are measuring people’s attitude toward something by assessing their level of agreement with several statements about it, it is best to avoid calling it a Likert scale. You are probably just using a rating scale. Likert scales allow for more granularity (more finely tuned response) than yes/no items, including whether respondents are neutral to the statement. Below is an example of how we might use a Likert scale to assess your attitudes about research as you work your way through this textbook.
Strongly agree | Agree | Neutral | Disagree | Strongly disagree | |
I like research more now than when I started reading this book. | |||||
This textbook is easy to use. | |||||
I feel confident about how well I understand levels of measurement. | |||||
This textbook is helping me plan my research proposal. |
Semantic differential scales are composite (multi-item) scales in which respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites. Whereas in the above Likert scale, the participant is asked how much they agree or disagree with a statement, in a semantic differential scale the participant is asked to indicate how they feel about a specific item. This makes the semantic differential scale an excellent technique for measuring people’s attitudes or feelings toward objects, events, or behaviors. Table 11.3 is an example of a semantic differential scale that was created to assess participants' feelings about this textbook.
1) How would you rate your opinions toward this textbook? | ||||||
Very much | Somewhat | Neither | Somewhat | Very much | ||
Boring | Exciting | |||||
Useless | Useful | |||||
Hard | Easy | |||||
Irrelevant | Applicable |
This composite scale was designed by Louis Guttman and uses a series of items arranged in increasing order of intensity (least intense to most intense) of the concept. This type of scale allows us to understand the intensity of beliefs or feelings. Each item in the above Guttman scale has a weight (this is not indicated on the tool) which varies with the intensity of that item, and the weighted combination of each response is used as an aggregate measure of an observation.
Example Guttman Scale Items
- I often felt the material was not engaging Yes/No
- I was often thinking about other things in class Yes/No
- I was often working on other tasks during class Yes/No
- I will work to abolish research from the curriculum Yes/No
Notice how the items move from lower intensity to higher intensity. A researcher reviews the yes answers and creates a score for each participant.
Composite measures: Scales and indices
Depending on your research design, your measure may be something you put on a survey or pre/post-test that you give to your participants. For a variable like age or income, one well-worded question may suffice. Unfortunately, most variables in the social world are not so simple. Depression and satisfaction are multidimensional concepts. Relying on a single indicator like a question that asks "Yes or no, are you depressed?” does not encompass the complexity of depression, including issues with mood, sleeping, eating, relationships, and happiness. There is no easy way to delineate between multidimensional and unidimensional concepts, as its all in how you think about your variable. Satisfaction could be validly measured using a unidimensional ordinal rating scale. However, if satisfaction were a key variable in our study, we would need a theoretical framework and conceptual definition for it. That means we'd probably have more indicators to ask about like timeliness, respect, sensitivity, and many others, and we would want our study to say something about what satisfaction truly means in terms of our other key variables. However, if satisfaction is not a key variable in your conceptual framework, it makes sense to operationalize it as a unidimensional concept.
For more complicated measures, researchers use scales and indices (sometimes called indexes) to measure their variables because they assess multiple indicators to develop a composite (or total) score. Composite scores provide a much greater understanding of concepts than a single item could. Although we won't delve too deeply into the process of scale development, we will cover some important topics for you to understand how scales and indices developed by other researchers can be used in your project.
Although they exhibit differences (which will later be discussed) the two have in common various factors.
- Both are ordinal measures of variables.
- Both can order the units of analysis in terms of specific variables.
- Both are composite measures.
Scales
The previous section discussed how to measure respondents’ responses to predesigned items or indicators belonging to an underlying construct. But how do we create the indicators themselves? The process of creating the indicators is called scaling. More formally, scaling is a branch of measurement that involves the construction of measures by associating qualitative judgments about unobservable constructs with quantitative, measurable metric units. Stevens (1946)[37] said, “Scaling is the assignment of objects to numbers according to a rule.” This process of measuring abstract concepts in concrete terms remains one of the most difficult tasks in empirical social science research.
The outcome of a scaling process is a scale, which is an empirical structure for measuring items or indicators of a given construct. Understand that multidimensional “scales”, as discussed in this section, are a little different from “rating scales” discussed in the previous section. A rating scale is used to capture the respondents’ reactions to a given item on a questionnaire. For example, an ordinally scaled item captures a value between “strongly disagree” to “strongly agree.” Attaching a rating scale to a statement or instrument is not scaling. Rather, scaling is the formal process of developing scale items, before rating scales can be attached to those items.
If creating your own scale sounds painful, don’t worry! For most multidimensional variables, you would likely be duplicating work that has already been done by other researchers. Specifically, this is a branch of science called psychometrics. You do not need to create a scale for depression because scales such as the Patient Health Questionnaire (PHQ-9), the Center for Epidemiologic Studies Depression Scale (CES-D), and Beck’s Depression Inventory (BDI) have been developed and refined over dozens of years to measure variables like depression. Similarly, scales such as the Patient Satisfaction Questionnaire (PSQ-18) have been developed to measure satisfaction with medical care. As we will discuss in the next section, these scales have been shown to be reliable and valid. While you could create a new scale to measure depression or satisfaction, a study with rigor would pilot test and refine that new scale over time to make sure it measures the concept accurately and consistently. This high level of rigor is often unachievable in student research projects because of the cost and time involved in pilot testing and validating, so using existing scales is recommended.
Unfortunately, there is no good one-stop=shop for psychometric scales. The Mental Measurements Yearbook provides a searchable database of measures for social science variables, though it woefully incomplete and often does not contain the full documentation for scales in its database. You can access it from a university library’s list of databases. If you can’t find anything in there, your next stop should be the methods section of the articles in your literature review. The methods section of each article will detail how the researchers measured their variables, and often the results section is instructive for understanding more about measures. In a quantitative study, researchers may have used a scale to measure key variables and will provide a brief description of that scale, its names, and maybe a few example questions. If you need more information, look at the results section and tables discussing the scale to get a better idea of how the measure works. Looking beyond the articles in your literature review, searching Google Scholar using queries like “depression scale” or “satisfaction scale” should also provide some relevant results. For example, searching for documentation for the Rosenberg Self-Esteem Scale (which we will discuss in the next section), I found this report from researchers investigating acceptance and commitment therapy which details this scale and many others used to assess mental health outcomes. If you find the name of the scale somewhere but cannot find the documentation (all questions and answers plus how to interpret the scale), a general web search with the name of the scale and ".pdf" may bring you to what you need. Or, to get professional help with finding information, always ask a librarian!
Unfortunately, these approaches do not guarantee that you will be able to view the scale itself or get information on how it is interpreted. Many scales cost money to use and may require training to properly administer. You may also find scales that are related to your variable but would need to be slightly modified to match your study’s needs. You could adapt a scale to fit your study, however changing even small parts of a scale can influence its accuracy and consistency. While it is perfectly acceptable in student projects to adapt a scale without testing it first (time may not allow you to do so), pilot testing is always recommended for adapted scales, and researchers seeking to draw valid conclusions and publish their results must take this additional step.
Indices
An index is a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas. It is different from a scale. Scales also aggregate measures; however, these measures examine different dimensions or the same dimension of a single construct. A well-known example of an index is the consumer price index (CPI), which is computed every month by the Bureau of Labor Statistics of the U.S. Department of Labor. The CPI is a measure of how much consumers have to pay for goods and services (in general) and is divided into eight major categories (food and beverages, housing, apparel, transportation, healthcare, recreation, education and communication, and “other goods and services”), which are further subdivided into more than 200 smaller items. Each month, government employees call all over the country to get the current prices of more than 80,000 items. Using a complicated weighting scheme that takes into account the location and probability of purchase for each item, analysts then combine these prices into an overall index score using a series of formulas and rules.
Another example of an index is the Duncan Socioeconomic Index (SEI). This index is used to quantify a person's socioeconomic status (SES) and is a combination of three concepts: income, education, and occupation. Income is measured in dollars, education in years or degrees achieved, and occupation is classified into categories or levels by status. These very different measures are combined to create an overall SES index score. However, SES index measurement has generated a lot of controversy and disagreement among researchers.
The process of creating an index is similar to that of a scale. First, conceptualize (define) the index and its constituent components. Though this appears simple, there may be a lot of disagreement on what components (concepts/constructs) should be included or excluded from an index. For instance, in the SES index, isn’t income correlated with education and occupation? And if so, should we include one component only or all three components? Reviewing the literature, using theories, and/or interviewing experts or key stakeholders may help resolve this issue. Second, operationalize and measure each component. For instance, how will you categorize occupations, particularly since some occupations may have changed with time (e.g., there were no Web developers before the Internet)? As we will see in step three below, researchers must create a rule or formula for calculating the index score. Again, this process may involve a lot of subjectivity, so validating the index score using existing or new data is important.
Scale and index development at often taught in their own course in doctoral education, so it is unreasonable for you to expect to develop a consistently accurate measure within the span of a week or two. Using available indices and scales is recommended for this reason.
Differences between scales and indices
Though indices and scales yield a single numerical score or value representing a concept of interest, they are different in many ways. First, indices often comprise components that are very different from each other (e.g., income, education, and occupation in the SES index) and are measured in different ways. Conversely, scales typically involve a set of similar items that use the same rating scale (such as a five-point Likert scale about customer satisfaction).
Second, indices often combine objectively measurable values such as prices or income, while scales are designed to assess subjective or judgmental constructs such as attitude, prejudice, or self-esteem. Some argue that the sophistication of the scaling methodology makes scales different from indexes, while others suggest that indexing methodology can be equally sophisticated. Nevertheless, indexes and scales are both essential tools in social science research.
Scales and indices seem like clean, convenient ways to measure different phenomena in social science, but just like with a lot of research, we have to be mindful of the assumptions and biases underneath. What if a scale or an index was developed using only White women as research participants? Is it going to be useful for other groups? It very well might be, but when using a scale or index on a group for whom it hasn't been tested, it will be very important to evaluate the validity and reliability of the instrument, which we address in the rest of the chapter.
Finally, it's important to note that while scales and indices are often made up of nominal or ordinal variables, when we analyze them into composite scores, we will treat them as interval/ratio variables.
Exercises
- Look back to your work from the previous section, are your variables unidimensional or multidimensional?
- Describe the specific measures you will use (actual questions and response options you will use with participants) for each variable in your research question.
- If you are using a measure developed by another researcher but do not have all of the questions, response options, and instructions needed to implement it, put it on your to-do list to get them.
Step 3: How you will interpret your measures
The final stage of operationalization involves setting the rules for how the measure works and how the researcher should interpret the results. Sometimes, interpreting a measure can be incredibly easy. If you ask someone their age, you’ll probably interpret the results by noting the raw number (e.g., 22) someone provides and that it is lower or higher than other people's ages. However, you could also recode that person into age categories (e.g., under 25, 20-29-years-old, generation Z, etc.). Even scales may be simple to interpret. If there is a scale of problem behaviors, one might simply add up the number of behaviors checked off–with a range from 1-5 indicating low risk of delinquent behavior, 6-10 indicating the student is moderate risk, etc. How you choose to interpret your measures should be guided by how they were designed, how you conceptualize your variables, the data sources you used, and your plan for analyzing your data statistically. Whatever measure you use, you need a set of rules for how to take any valid answer a respondent provides to your measure and interpret it in terms of the variable being measured.
For more complicated measures like scales, refer to the information provided by the author for how to interpret the scale. If you can’t find enough information from the scale’s creator, look at how the results of that scale are reported in the results section of research articles. For example, Beck’s Depression Inventory (BDI-II) uses 21 statements to measure depression and respondents rate their level of agreement on a scale of 0-3. The results for each question are added up, and the respondent is put into one of three categories: low levels of depression (1-16), moderate levels of depression (17-30), or severe levels of depression (31 and over).
One common mistake I see often is that students will introduce another variable into their operational definition. This is incorrect. Your operational definition should mention only one variable—the variable being defined. While your study will certainly draw conclusions about the relationships between variables, that's not what operationalization is. Operationalization specifies what instrument you will use to measure your variable and how you plan to interpret the data collected using that measure.
Operationalization is probably the trickiest component of basic research methods, so please don’t get frustrated if it takes a few drafts and a lot of feedback to get to a workable definition. At the time of this writing, I am in the process of operationalizing the concept of “attitudes towards research methods.” Originally, I thought that I could gauge students’ attitudes toward research methods by looking at their end-of-semester course evaluations. As I became aware of the potential methodological issues with student course evaluations, I opted to use focus groups of students to measure their common beliefs about research. You may recall some of these opinions from Chapter 1, such as the common beliefs that research is boring, useless, and too difficult. After the focus group, I created a scale based on the opinions I gathered, and I plan to pilot test it with another group of students. After the pilot test, I expect that I will have to revise the scale again before I can implement the measure in a real social work research project. At the time I’m writing this, I’m still not completely done operationalizing this concept.
Key Takeaways
- Operationalization involves spelling out precisely how a concept will be measured.
- Operational definitions must include the variable, the measure, and how you plan to interpret the measure.
- There are four different levels of measurement: nominal, ordinal, interval, and ratio (in increasing order of specificity).
- Scales and indices are common ways to collect information and involve using multiple indicators in measurement.
- A key difference between a scale and an index is that a scale contains multiple indicators for one concept, whereas an indicator examines multiple concepts (components).
- Using scales developed and refined by other researchers can improve the rigor of a quantitative study.
Exercises
Use the research question that you developed in the previous chapters and find a related scale or index that researchers have used. If you have trouble finding the exact phenomenon you want to study, get as close as you can.
- What is the level of measurement for each item on each tool? Take a second and think about why the tool's creator decided to include these levels of measurement. Identify any levels of measurement you would change and why.
- If these tools don't exist for what you are interested in studying, why do you think that is?
12.3 Writing effective questions and questionnaires
Learning Objectives
Learners will be able to...
- Describe some of the ways that survey questions might confuse respondents and how to word questions and responses clearly
- Create mutually exclusive, exhaustive, and balanced response options
- Define fence-sitting and floating
- Describe the considerations involved in constructing a well-designed questionnaire
- Discuss why pilot testing is important
In the previous section, we reviewed how researchers collect data using surveys. Guided by their sampling approach and research context, researchers should choose the survey approach that provides the most favorable tradeoffs in strengths and challenges. With this information in hand, researchers need to write their questionnaire and revise it before beginning data collection. Each method of delivery requires a questionnaire, but they vary a bit based on how they will be used by the researcher. Since phone surveys are read aloud, researchers will pay more attention to how the questionnaire sounds than how it looks. Online surveys can use advanced tools to require the completion of certain questions, present interactive questions and answers, and otherwise afford greater flexibility in how questionnaires are designed. As you read this section, consider how your method of delivery impacts the type of questionnaire you will design. Because most student projects use paper or online surveys, this section will detail how to construct self-administered questionnaires to minimize the potential for bias and error.
Start with operationalization
The first thing you need to do to write effective survey questions is identify what exactly you wish to know. As silly as it sounds to state what seems so completely obvious, we can’t stress enough how easy it is to forget to include important questions when designing a survey. Begin by looking at your research question and refreshing your memory of the operational definitions you developed for those variables from Chapter 11. You should have a pretty firm grasp of your operational definitions before starting the process of questionnaire design. You may have taken those operational definitions from other researchers' methods, found established scales and indices for your measures, or created your own questions and answer options.
Exercises
STOP! Make sure you have a complete operational definition for the dependent and independent variables in your research question. A complete operational definition contains the variable being measured, the measure used, and how the researcher interprets the measure. Let's make sure you have what you need from Chapter 11 to begin writing your questionnaire.
List all of the dependent and independent variables in your research question.
- It's normal to have one dependent or independent variable. It's also normal to have more than one of either.
- Make sure that your research question (and this list) contain all of the variables in your hypothesis. Your hypothesis should only include variables from you research question.
For each variable in your list:
- Write out the measure you will use (the specific questions and answers) for each variable.
- If you don't have questions and answers finalized yet, write a first draft and revise it based on what you read in this section.
- If you are using a measure from another researcher, you should be able to write out all of the questions and answers associated with that measure. If you only have the name of a scale or a few questions, you need to access to the full text and some documentation on how to administer and interpret it before you can finish your questionnaire.
- Describe how you will use each measure draw conclusions about the variable in the operational definition.
- For example, an interpretation might be "there are five 7-point Likert scale questions...point values are added across all five items for each participant...and scores below 10 indicate the participant has low self-esteem"
- Don't introduce other variables into the mix here. All we are concerned with is how you will measure each variable by itself. The connection between variables is done using statistical tests, not operational definitions.
- Detail any validity or reliability issues uncovered by previous researchers using the same measures. If you have concerns about validity and reliability, note them, as well.
If you completed the exercise above and listed out all of the questions and answer choices you will use to measure the variables in your research question, you have already produced a pretty solid first draft of your questionnaire! Congrats! In essence, questionnaires are all of the self-report measures in your operational definitions for the independent, dependent, and control variables in your study arranged into one document and administered to participants. There are a few questions on a questionnaire (like name or ID#) that are not associated with the measurement of variables. These are the exception, and it's useful to think of a questionnaire as a list of measures for variables. Of course, researchers often use more than one measure of a variable (i.e., triangulation) so they can more confidently assert that their findings are true. A questionnaire should contain all of the measures researchers plan to collect about their variables by asking participants to self-report. As we will discuss in the final section of this chapter, triangulating across data sources (e.g., measuring variables using client files or student records) can avoid some of the common sources of bias in survey research.
Sticking close to your operational definitions is important because it helps you avoid an everything-but-the-kitchen-sink approach that includes every possible question that occurs to you. Doing so puts an unnecessary burden on your survey respondents. Remember that you have asked your participants to give you their time and attention and to take care in responding to your questions; show them your respect by only asking questions that you actually plan to use in your analysis. For each question in your questionnaire, ask yourself how this question measures a variable in your study. An operational definition should contain the questions, response options, and how the researcher will draw conclusions about the variable based on participants' responses.
Writing questions
So, almost all of the questions on a questionnaire are measuring some variable. For many variables, researchers will create their own questions rather than using one from another researcher. This section will provide some tips on how to create good questions to accurately measure variables in your study. First, questions should be as clear and to the point as possible. This is not the time to show off your creative writing skills; a survey is a technical instrument and should be written in a way that is as direct and concise as possible. As I’ve mentioned earlier, your survey respondents have agreed to give their time and attention to your survey. The best way to show your appreciation for their time is to not waste it. Ensuring that your questions are clear and concise will go a long way toward showing your respondents the gratitude they deserve. Pilot testing the questionnaire with friends or colleagues can help identify these issues. This process is commonly called pretesting, but to avoid any confusion with pretesting in experimental design, we refer to it as pilot testing.
Related to the point about not wasting respondents’ time, make sure that every question you pose will be relevant to every person you ask to complete it. This means two things: first, that respondents have knowledge about whatever topic you are asking them about, and second, that respondents have experienced the events, behaviors, or feelings you are asking them to report. If you are asking participants for second-hand knowledge—asking clinicians about clients' feelings, asking teachers about students' feelings, and so forth—you may want to clarify that the variable you are asking about is the key informant's perception of what is happening in the target population. A well-planned sampling approach ensures that participants are the most knowledgeable population to complete your survey.
If you decide that you do wish to include questions about matters with which only a portion of respondents will have had experience, make sure you know why you are doing so. For example, if you are asking about MSW student study patterns, and you decide to include a question on studying for the social work licensing exam, you may only have a small subset of participants who have begun studying for the graduate exam or took the bachelor's-level exam. If you decide to include this question that speaks to a minority of participants' experiences, think about why you are including it. Are you interested in how studying for class and studying for licensure differ? Are you trying to triangulate study skills measures? Researchers should carefully consider whether questions relevant to only a subset of participants is likely to produce enough valid responses for quantitative analysis.
Many times, questions that are relevant to a subsample of participants are conditional on an answer to a previous question. A participant might select that they rent their home, and as a result, you might ask whether they carry renter's insurance. That question is not relevant to homeowners, so it would be wise not to ask them to respond to it. In that case, the question of whether someone rents or owns their home is a filter question, designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample. Figure 12.1 presents an example of how to accomplish this on a paper survey by adding instructions to the participant that indicate what question to proceed to next based on their response to the first one. Using online survey tools, researchers can use filter questions to only present relevant questions to participants.
Researchers should eliminate questions that ask about things participants don't know to minimize confusion. Assuming the question is relevant to the participant, other sources of confusion come from how the question is worded. The use of negative wording can be a source of potential confusion. Taking the question from Figure 12.1 about drinking as our example, what if we had instead asked, “Did you not abstain from drinking during your first semester of college?” This is a double negative, and it's not clear how to answer the question accurately. It is a good idea to avoid negative phrasing, when possible. For example, "did you not drink alcohol during your first semester of college?" is less clear than "did you drink alcohol your first semester of college?"
You should also avoid using terms or phrases that may be regionally or culturally specific (unless you are absolutely certain all your respondents come from the region or culture whose terms you are using). When I first moved to southwest Virginia, I didn’t know what a holler was. Where I grew up in New Jersey, to holler means to yell. Even then, in New Jersey, we shouted and screamed, but we didn’t holler much. In southwest Virginia, my home at the time, a holler also means a small valley in between the mountains. If I used holler in that way on my survey, people who live near me may understand, but almost everyone else would be totally confused. A similar issue arises when you use jargon, or technical language, that people do not commonly know. For example, if you asked adolescents how they experience imaginary audience, they would find it difficult to link those words to the concepts from David Elkind’s theory. The words you use in your questions must be understandable to your participants. If you find yourself using jargon or slang, break it down into terms that are more universal and easier to understand.
Asking multiple questions as though they are a single question can also confuse survey respondents. There’s a specific term for this sort of question; it is called a double-barreled question. Figure 12.2 shows a double-barreled question. Do you see what makes the question double-barreled? How would someone respond if they felt their college classes were more demanding but also more boring than their high school classes? Or less demanding but more interesting? Because the question combines “demanding” and “interesting,” there is no way to respond yes to one criterion but no to the other.
Another thing to avoid when constructing survey questions is the problem of social desirability. We all want to look good, right? And we all probably know the politically correct response to a variety of questions whether we agree with the politically correct response or not. In survey research, social desirability refers to the idea that respondents will try to answer questions in a way that will present them in a favorable light. (You may recall we covered social desirability bias in Chapter 11.)
Perhaps we decide that to understand the transition to college, we need to know whether respondents ever cheated on an exam in high school or college for our research project. We all know that cheating on exams is generally frowned upon (at least I hope we all know this). So, it may be difficult to get people to admit to cheating on a survey. But if you can guarantee respondents’ confidentiality, or even better, their anonymity, chances are much better that they will be honest about having engaged in this socially undesirable behavior. Another way to avoid problems of social desirability is to try to phrase difficult questions in the most benign way possible. Earl Babbie (2010) [38] offers a useful suggestion for helping you do this—simply imagine how you would feel responding to your survey questions. If you would be uncomfortable, chances are others would as well.
Exercises
Try to step outside your role as researcher for a second, and imagine you were one of your participants. Evaluate the following:
- Is the question too general? Sometimes, questions that are too general may not accurately convey respondents’ perceptions. If you asked someone how they liked a certain book and provide a response scale ranging from “not at all” to “extremely well”, and if that person selected “extremely well," what do they mean? Instead, ask more specific behavioral questions, such as "Will you recommend this book to others?" or "Do you plan to read other books by the same author?"
- Is the question too detailed? Avoid unnecessarily detailed questions that serve no specific research purpose. For instance, do you need the age of each child in a household or is just the number of children in the household acceptable? However, if unsure, it is better to err on the side of details than generality.
- Is the question presumptuous? Does your question make assumptions? For instance, if you ask, "what do you think the benefits of a tax cut would be?" you are presuming that the participant sees the tax cut as beneficial. But many people may not view tax cuts as beneficial. Some might see tax cuts as a precursor to less funding for public schools and fewer public services such as police, ambulance, and fire department. Avoid questions with built-in presumptions.
- Does the question ask the participant to imagine something? Is the question imaginary? A popular question on many television game shows is “if you won a million dollars on this show, how will you plan to spend it?” Most participants have never been faced with this large amount of money and have never thought about this scenario. In fact, most don’t even know that after taxes, the value of the million dollars will be greatly reduced. In addition, some game shows spread the amount over a 20-year period. Without understanding this "imaginary" situation, participants may not have the background information necessary to provide a meaningful response.
Finally, it is important to get feedback on your survey questions from as many people as possible, especially people who are like those in your sample. Now is not the time to be shy. Ask your friends for help, ask your mentors for feedback, ask your family to take a look at your survey as well. The more feedback you can get on your survey questions, the better the chances that you will come up with a set of questions that are understandable to a wide variety of people and, most importantly, to those in your sample.
In sum, in order to pose effective survey questions, researchers should do the following:
- Identify how each question measures an independent, dependent, or control variable in their study.
- Keep questions clear and succinct.
- Make sure respondents have relevant lived experience to provide informed answers to your questions.
- Use filter questions to avoid getting answers from uninformed participants.
- Avoid questions that are likely to confuse respondents—including those that use double negatives, use culturally specific terms or jargon, and pose more than one question at a time.
- Imagine how respondents would feel responding to questions.
- Get feedback, especially from people who resemble those in the researcher’s sample.
Exercises
Let's complete a first draft of your questions. In the previous exercise, you listed all of the questions and answers you will use to measure the variables in your research question.
- In the previous exercise, you wrote out the questions and answers for each measure of your independent and dependent variables. Evaluate each question using the criteria listed above on effective survey questions.
- Type out questions for your control variables and evaluate them, as well. Consider what response options you want to offer participants.
Now, let's revise any questions that do not meet your standards!
- Use the BRUSO model in Table 12.2 for an illustration of how to address deficits in question wording. Keep in mind that you are writing a first draft in this exercise, and it will take a few drafts and revisions before your questions are ready to distribute to participants.
Criterion | Poor | Effective |
B- Brief | “Are you now or have you ever been the possessor of a firearm?” | Have you ever possessed a firearm? |
R- Relevant | "Who did you vote for in the last election?" | Note: Only include items that are relevant to your study. |
U- Unambiguous | “Are you a gun person?” | Do you currently own a gun?” |
S- Specific | How much have you read about the new gun control measure and sales tax?” | “How much have you read about the new sales tax on firearm purchases?” |
O- Objective | “How much do you support the beneficial new gun control measure?” | “What is your view of the new gun control measure?” |
Writing response options
While posing clear and understandable questions in your survey is certainly important, so too is providing respondents with unambiguous response options. Response options are the answers that you provide to the people completing your questionnaire. Generally, respondents will be asked to choose a single (or best) response to each question you pose. We call questions in which the researcher provides all of the response options closed-ended questions. Keep in mind, closed-ended questions can also instruct respondents to choose multiple response options, rank response options against one another, or assign a percentage to each response option. But be cautious when experimenting with different response options! Accepting multiple responses to a single question may add complexity when it comes to quantitatively analyzing and interpreting your data.
Surveys need not be limited to closed-ended questions. Sometimes survey researchers include open-ended questions in their survey instruments as a way to gather additional details from respondents. An open-ended question does not include response options; instead, respondents are asked to reply to the question in their own way, using their own words. These questions are generally used to find out more about a survey participant’s experiences or feelings about whatever they are being asked to report in the survey. If, for example, a survey includes closed-ended questions asking respondents to report on their involvement in extracurricular activities during college, an open-ended question could ask respondents why they participated in those activities or what they gained from their participation. While responses to such questions may also be captured using a closed-ended format, allowing participants to share some of their responses in their own words can make the experience of completing the survey more satisfying to respondents and can also reveal new motivations or explanations that had not occurred to the researcher. This is particularly important for mixed-methods research. It is possible to analyze open-ended response options quantitatively using content analysis (i.e., counting how often a theme is represented in a transcript looking for statistical patterns). However, for most researchers, qualitative data analysis will be needed to analyze open-ended questions, and researchers need to think through how they will analyze any open-ended questions as part of their data analysis plan. We will address qualitative data analysis in greater detail in Chapter 19.
To keep things simple, we encourage you to use only closed-ended response options in your study. While open-ended questions are not wrong, they are often a sign in our classrooms that students have not thought through all the way how to operationally define and measure their key variables. Open-ended questions cannot be operationally defined because you don't know what responses you will get. Instead, you will need to analyze the qualitative data using one of the techniques we discuss in Chapter 19 to interpret your participants' responses.
To write an effective response options for closed-ended questions, there are a couple of guidelines worth following. First, be sure that your response options are mutually exclusive. Look back at Figure 12.1, which contains questions about how often and how many drinks respondents consumed. Do you notice that there are no overlapping categories in the response options for these questions? This is another one of those points about question construction that seems fairly obvious but that can be easily overlooked. Response options should also be exhaustive. In other words, every possible response should be covered in the set of response options that you provide. For example, note that in question 10a in Figure 12.1, we have covered all possibilities—those who drank, say, an average of once per month can choose the first response option (“less than one time per week”) while those who drank multiple times a day each day of the week can choose the last response option (“7+”). All the possibilities in between these two extremes are covered by the middle three response options, and every respondent fits into one of the response options we provided.
Earlier in this section, we discussed double-barreled questions. Response options can also be double barreled, and this should be avoided. Figure 12.3 is an example of a question that uses double-barreled response options. Other tips about questions are also relevant to response options, including that participants should be knowledgeable enough to select or decline a response option as well as avoiding jargon and cultural idioms.
Even if you phrase questions and response options clearly, participants are influenced by how many response options are presented on the questionnaire. For Likert scales, five or seven response options generally allow about as much precision as respondents are capable of. However, numerical scales with more options can sometimes be appropriate. For dimensions such as attractiveness, pain, and likelihood, a 0-to-10 scale will be familiar to many respondents and easy for them to use. Regardless of the number of response options, the most extreme ones should generally be “balanced” around a neutral or modal midpoint. An example of an unbalanced rating scale measuring perceived likelihood might look like this:
Unlikely | Somewhat Likely | Likely | Very Likely | Extremely Likely
Because we have four rankings of likely and only one ranking of unlikely, the scale is unbalanced and most responses will be biased toward "likely" rather than "unlikely." A balanced version might look like this:
Extremely Unlikely | Somewhat Unlikely | As Likely as Not | Somewhat Likely |Extremely Likely
In this example, the midpoint is halfway between likely and unlikely. Of course, a middle or neutral response option does not have to be included. Researchers sometimes choose to leave it out because they want to encourage respondents to think more deeply about their response and not simply choose the middle option by default. Fence-sitters are respondents who choose neutral response options, even if they have an opinion. Some people will be drawn to respond, “no opinion” even if they have an opinion, particularly if their true opinion is the not a socially desirable opinion. Floaters, on the other hand, are those that choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion.
As you can see, floating is the flip side of fence-sitting. Thus, the solution to one problem is often the cause of the other. How you decide which approach to take depends on the goals of your research. Sometimes researchers specifically want to learn something about people who claim to have no opinion. In this case, allowing for fence-sitting would be necessary. Other times researchers feel confident their respondents will all be familiar with every topic in their survey. In this case, perhaps it is okay to force respondents to choose one side or another (e.g., agree or disagree) without a middle option (e.g., neither agree nor disagree) or to not include an option like "don't know enough to say" or "not applicable." There is no always-correct solution to either problem. But in general, including middle option in a response set provides a more exhaustive set of response options than one that excludes one.
The most important check before your finalize your response options is to align them with your operational definitions. As we've discussed before, your operational definitions include your measures (questions and responses options) as well as how to interpret those measures in terms of the variable being measured. In particular, you should be able to interpret all response options to a question based on your operational definition of the variable it measures. If you wanted to measure the variable "social class," you might ask one question about a participant's annual income and another about family size. Your operational definition would need to provide clear instructions on how to interpret response options. Your operational definition is basically like this social class calculator from Pew Research, though they include a few more questions in their definition.
To drill down a bit more, as Pew specifies in the section titled "how the income calculator works," the interval/ratio data respondents enter is interpreted using a formula combining a participant's four responses to the questions posed by Pew categorizing their household into three categories—upper, middle, or lower class. So, the operational definition includes the four questions comprising the measure and the formula or interpretation which converts responses into the three final categories that we are familiar with: lower, middle, and upper class.
It is interesting to note that even though participants inis an ordinal level of measurement. Whereas, Pew asks four questions that use an interval or ratio level of measurement (depending on the question). This means that respondents provide numerical responses, rather than choosing categories like lower, middle, and upper class. It's perfectly normal for operational definitions to change levels of measurement, and it's also perfectly normal for the level of measurement to stay the same. The important thing is that each response option a participant can provide is accounted for by the operational definition. Throw any combination of family size, location, or income at the Pew calculator, and it will define you into one of those three social class categories.
Unlike Pew's definition, the operational definitions in your study may not need their own webpage to define and describe. For many questions and answers, interpreting response options is easy. If you were measuring "income" instead of "social class," you could simply operationalize the term by asking people to list their total household income before taxes are taken out. Higher values indicate higher income, and lower values indicate lower income. Easy. Regardless of whether your operational definitions are simple or more complex, every response option to every question on your survey (with a few exceptions) should be interpretable using an operational definition of a variable. Just like we want to avoid an everything-but-the-kitchen-sink approach to questions on our questionnaire, you want to make sure your final questionnaire only contains response options that you will use in your study.
One note of caution on interpretation (sorry for repeating this). We want to remind you again that an operational definition should not mention more than one variable. In our example above, your operational definition could not say "a family of three making under $50,000 is lower class; therefore, they are more likely to experience food insecurity." That last clause about food insecurity may well be true, but it's not a part of the operational definition for social class. Each variable (food insecurity and class) should have its own operational definition. If you are talking about how to interpret the relationship between two variables, you are talking about your data analysis plan. We will discuss how to create your data analysis plan beginning in Chapter 14. For now, one consideration is that depending on the statistical test you use to test relationships between variables, you may need nominal, ordinal, or interval/ratio data. Your questions and response options should match the level of measurement you need with the requirements of the specific statistical tests in your data analysis plan. Once you finalize your data analysis plan, return to your questionnaire to match the level of measurement matches with the statistical test you've chosen.
In summary, to write effective response options researchers should do the following:
- Avoid wording that is likely to confuse respondents—including double negatives, use culturally specific terms or jargon, and double-barreled response options.
- Ensure response options are relevant to participants' knowledge and experience so they can make an informed and accurate choice.
- Present mutually exclusive and exhaustive response options.
- Consider fence-sitters and floaters, and the use of neutral or "not applicable" response options.
- Define how response options are interpreted as part of an operational definition of a variable.
- Check level of measurement matches operational definitions and the statistical tests in the data analysis plan (once you develop one in the future)
Exercises
Look back at the response options you drafted in the previous exercise. Make sure you have a first draft of response options for each closed-ended question on your questionnaire.
- Using the criteria above, evaluate the wording of the response options for each question on your questionnaire.
- Revise your questions and response options until you have a complete first draft.
- Do your first read-through and provide a dummy answer to each question. Make sure you can link each response option and each question to an operational definition.
- Look ahead to Chapter 14 and consider how each item on your questionnaire will inform your data analysis plan.
From this discussion, we hope it is clear why researchers using quantitative methods spell out all of their plans ahead of time. Ultimately, there should be a straight line from operational definition through measures on your questionnaire to the data analysis plan. If your questionnaire includes response options that are not aligned with operational definitions or not included in the data analysis plan, the responses you receive back from participants won't fit with your conceptualization of the key variables in your study. If you do not fix these errors and proceed with collecting unstructured data, you will lose out on many of the benefits of survey research and face overwhelming challenges in answering your research question.
Designing questionnaires
Based on your work in the previous section, you should have a first draft of the questions and response options for the key variables in your study. Now, you’ll also need to think about how to present your written questions and response options to survey respondents. It's time to write a final draft of your questionnaire and make it look nice. Designing questionnaires takes some thought. First, consider the route of administration for your survey. What we cover in this section will apply equally to paper and online surveys, but if you are planning to use online survey software, you should watch tutorial videos and explore the features of of the survey software you will use.
Informed consent & instructions
Writing effective items is only one part of constructing a survey. For one thing, every survey should have a written or spoken introduction that serves two basic functions (Peterson, 2000).[39] One is to encourage respondents to participate in the survey. In many types of research, such encouragement is not necessary either because participants do not know they are in a study (as in naturalistic observation) or because they are part of a subject pool and have already shown their willingness to participate by signing up and showing up for the study. Survey research usually catches respondents by surprise when they answer their phone, go to their mailbox, or check their e-mail—and the researcher must make a good case for why they should agree to participate. Thus, the introduction should briefly explain the purpose of the survey and its importance, provide information about the sponsor of the survey (university-based surveys tend to generate higher response rates), acknowledge the importance of the respondent’s participation, and describe any incentives for participating.
The second function of the introduction is to establish informed consent. Remember that this involves describing to respondents everything that might affect their decision to participate. This includes the topics covered by the survey, the amount of time it is likely to take, the respondent’s option to withdraw at any time, confidentiality issues, and other ethical considerations we covered in Chapter 6. Written consent forms are not always used in survey research (when the research is of minimal risk and completion of the survey instrument is often accepted by the IRB as evidence of consent to participate), so it is important that this part of the introduction be well documented and presented clearly and in its entirety to every respondent.
Organizing items to be easy and intuitive to follow
The introduction should be followed by the substantive questionnaire items. But first, it is important to present clear instructions for completing the questionnaire, including examples of how to use any unusual response scales. Remember that the introduction is the point at which respondents are usually most interested and least fatigued, so it is good practice to start with the most important items for purposes of the research and proceed to less important items. Items should also be grouped by topic or by type. For example, items using the same rating scale (e.g., a 5-point agreement scale) should be grouped together if possible to make things faster and easier for respondents. Demographic items are often presented last because they are least interesting to participants but also easy to answer in the event respondents have become tired or bored. Of course, any survey should end with an expression of appreciation to the respondent.
Questions are often organized thematically. If our survey were measuring social class, perhaps we’d have a few questions asking about employment, others focused on education, and still others on housing and community resources. Those may be the themes around which we organize our questions. Or perhaps it would make more sense to present any questions we had about parents' income and then present a series of questions about estimated future income. Grouping by theme is one way to be deliberate about how you present your questions. Keep in mind that you are surveying people, and these people will be trying to follow the logic in your questionnaire. Jumping from topic to topic can give people a bit of whiplash and may make participants less likely to complete it.
Using a matrix is a nice way of streamlining response options for similar questions. A matrix is a question type that that lists a set of questions for which the answer categories are all the same. If you have a set of questions for which the response options are the same, it may make sense to create a matrix rather than posing each question and its response options individually. Not only will this save you some space in your survey but it will also help respondents progress through your survey more easily. A sample matrix can be seen in Figure 12.4.
Once you have grouped similar questions together, you’ll need to think about the order in which to present those question groups. Most survey researchers agree that it is best to begin a survey with questions that will want to make respondents continue (Babbie, 2010; Dillman, 2000; Neuman, 2003).[40] In other words, don’t bore respondents, but don’t scare them away either. There’s some disagreement over where on a survey to place demographic questions, such as those about a person’s age, gender, and race. On the one hand, placing them at the beginning of the questionnaire may lead respondents to think the survey is boring, unimportant, and not something they want to bother completing. On the other hand, if your survey deals with some very sensitive topic, such as child sexual abuse or criminal convictions, you don’t want to scare respondents away or shock them by beginning with your most intrusive questions.
Your participants are human. They will react emotionally to questionnaire items, and they will also try to uncover your research questions and hypotheses. In truth, the order in which you present questions on a survey is best determined by the unique characteristics of your research. When feasible, you should consult with key informants from your target population determine how best to order your questions. If it is not feasible to do so, think about the unique characteristics of your topic, your questions, and most importantly, your sample. Keeping in mind the characteristics and needs of the people you will ask to complete your survey should help guide you as you determine the most appropriate order in which to present your questions. None of your decisions will be perfect, and all studies have limitations.
Questionnaire length
You’ll also need to consider the time it will take respondents to complete your questionnaire. Surveys vary in length, from just a page or two to a dozen or more pages, which means they also vary in the time it takes to complete them. How long to make your survey depends on several factors. First, what is it that you wish to know? Wanting to understand how grades vary by gender and year in school certainly requires fewer questions than wanting to know how people’s experiences in college are shaped by demographic characteristics, college attended, housing situation, family background, college major, friendship networks, and extracurricular activities. Keep in mind that even if your research question requires a sizable number of questions be included in your questionnaire, do your best to keep the questionnaire as brief as possible. Any hint that you’ve thrown in a bunch of useless questions just for the sake of it will turn off respondents and may make them not want to complete your survey.
Second, and perhaps more important, how long are respondents likely to be willing to spend completing your questionnaire? If you are studying college students, asking them to use their very free time to complete your survey may mean they won’t want to spend more than a few minutes on it. But if you find ask them to complete your survey during down-time between classes and there is little work to be done, students may be willing to give you a bit more of their time. Think about places and times that your sampling frame naturally gathers and whether you would be able to either recruit participants or distribute a survey in that context. Estimate how long your participants would reasonably have to complete a survey presented to them during this time. The more you know about your population (such as what weeks have less work and more free time), the better you can target questionnaire length.
The time that survey researchers ask respondents to spend on questionnaires varies greatly. Some researchers advise that surveys should not take longer than about 15 minutes to complete (as cited in Babbie 2010),[41] whereas others suggest that up to 20 minutes is acceptable (Hopper, 2010).[42] As with question order, there is no clear-cut, always-correct answer about questionnaire length. The unique characteristics of your study and your sample should be considered to determine how long to make your questionnaire. For example, if you planned to distribute your questionnaire to students in between classes, you will need to make sure it is short enough to complete before the next class begins.
When designing a questionnaire, a researcher should consider:
- Weighing strengths and limitations of the method of delivery, including the advanced tools in online survey software or the simplicity of paper questionnaires.
- Grouping together items that ask about the same thing.
- Moving any questions about sensitive items to the end of the questionnaire, so as not to scare respondents off.
- Moving any questions that engage the respondent to answer the questionnaire at the beginning, so as not to bore them.
- Timing the length of the questionnaire with a reasonable length of time you can ask of your participants.
- Dedicating time to visual design and ensure the questionnaire looks professional.
Exercises
Type out a final draft of your questionnaire in a word processor or online survey tool.
- Evaluate your questionnaire using the guidelines above, revise it, and get it ready to share with other student researchers.
Pilot testing and revising questionnaires
A good way to estimate the time it will take respondents to complete your questionnaire (and other potential challenges) is through pilot testing. Pilot testing allows you to get feedback on your questionnaire so you can improve it before you actually administer it. It can be quite expensive and time consuming if you wish to pilot test your questionnaire on a large sample of people who very much resemble the sample to whom you will eventually administer the finalized version of your questionnaire. But you can learn a lot and make great improvements to your questionnaire simply by pilot testing with a small number of people to whom you have easy access (perhaps you have a few friends who owe you a favor). By pilot testing your questionnaire, you can find out how understandable your questions are, get feedback on question wording and order, find out whether any of your questions are boring or offensive, and learn whether there are places where you should have included filter questions. You can also time pilot testers as they take your survey. This will give you a good idea about the estimate to provide respondents when you administer your survey and whether you have some wiggle room to add additional items or need to cut a few items.
Perhaps this goes without saying, but your questionnaire should also have an attractive design. A messy presentation style can confuse respondents or, at the very least, annoy them. Be brief, to the point, and as clear as possible. Avoid cramming too much into a single page. Make your font size readable (at least 12 point or larger, depending on the characteristics of your sample), leave a reasonable amount of space between items, and make sure all instructions are exceptionally clear. If you are using an online survey, ensure that participants can complete it via mobile, computer, and tablet devices. Think about books, documents, articles, or web pages that you have read yourself—which were relatively easy to read and easy on the eyes and why? Try to mimic those features in the presentation of your survey questions. While online survey tools automate much of visual design, word processors are designed for writing all kinds of documents and may need more manual adjustment as part of visual design.
Realistically, your questionnaire will continue to evolve as you develop your data analysis plan over the next few chapters. By now, you should have a complete draft of your questionnaire grounded in an underlying logic that ties together each question and response option to a variable in your study. Once your questionnaire is finalized, you will need to submit it for ethical approval from your professor or the IRB. If your study requires IRB approval, it may be worthwhile to submit your proposal before your questionnaire is completely done. Revisions to IRB protocols are common and it takes less time to review a few changes to questions and answers than it does to review the entire study, so give them the whole study as soon as you can. Once the IRB approves your questionnaire, you cannot change it without their okay.
Key Takeaways
- A questionnaire is comprised of self-report measures of variables in a research study.
- Make sure your survey questions will be relevant to all respondents and that you use filter questions when necessary.
- Effective survey questions and responses take careful construction by researchers, as participants may be confused or otherwise influenced by how items are phrased.
- The questionnaire should start with informed consent and instructions, flow logically from one topic to the next, engage but not shock participants, and thank participants at the end.
- Pilot testing can help identify any issues in a questionnaire before distributing it to participants, including language or length issues.
Exercises
It's a myth that researchers work alone! Get together with a few of your fellow students and swap questionnaires for pilot testing.
- Use the criteria in each section above (questions, response options, questionnaires) and provide your peers with the strengths and weaknesses of their questionnaires.
- See if you can guess their research question and hypothesis based on the questionnaire alone.
11.3 Measurement quality
Learning Objectives
Learners will be able to...
- Define and describe the types of validity and reliability
- Assess for systematic error
The previous chapter provided insight into measuring concepts in social work research. We discussed the importance of identifying concepts and their corresponding indicators as a way to help us operationalize them. In essence, we now understand that when we think about our measurement process, we must be intentional and thoughtful in the choices that we make. This section is all about how to judge the quality of the measures you've chosen for the key variables in your research question.
Reliability
First, let’s say we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism this way, then it is likely that anyone who identifies as an alcoholic would respond “yes.” This may seem like a good way to identify our group of interest, but think about how you and your peer group may respond to this question. Would participants respond differently after a wild night out, compared to any other night? Could an infrequent drinker’s current headache from last night’s glass of wine influence how they answer the question this morning? How would that same person respond to the question before consuming the wine? In each cases, the same person might respond differently to the same question at different points, so it is possible that our measure of alcoholism has a reliability problem. Reliability in measurement is about consistency.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behavior, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past 3 months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. On the other hand, we might get more accurate responses if we ask a participant how many drinks of any kind they have consumed in the past week.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviors. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. They may decide to conduct observations at a local pub by noting how many drinks patrons consume and how their behavior changes as their intake changes. What if the researcher has to use the restroom, and the patron next to them takes three shots of tequila during the brief period the researcher is away from their seat? The reliability of this researcher’s measure of alcohol intake depends on their ability to physically observe every instance of patrons consuming drinks. If they are unlikely to be able to observe every such instance, then perhaps their mechanism for measuring this concept is not reliable.
The following subsections describe the types of reliability that are important for you to know about, but keep in mind that you may see other approaches to judging reliability mentioned in the empirical literature.
Test-retest reliability
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time. Unlike an experiment, you aren't giving participants an intervention but trying to establish a reliable baseline of the variable you are measuring. Once you have these two measurements, you then look at the correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. Figure 11.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
Internal consistency
Another kind of reliability is internal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials. A specific statistical test known as Cronbach’s Alpha provides a way to measure how well each question of a scale is related to the others.
Interrater reliability
Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other.
Validity
Validity, another key element of assessing measurement quality, is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.
Face validity
Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.
Content validity
Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Criterion validity
Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).
Discriminant validity
Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
Increasing the reliability and validity of measures
We have reviewed the types of errors and how to evaluate our measures based on reliability and validity considerations. However, what can we do while selecting or creating our tool so that we minimize the potential of errors? Many of our options were covered in our discussion about reliability and validity. Nevertheless, the following table provides a quick summary of things that you should do when creating or selecting a measurement tool. While not all of these will be feasible in your project, it is important to include easy-to-implement measures in your research context.
Make sure that you engage in a rigorous literature review so that you understand the concept that you are studying. This means understanding the different ways that your concept may manifest itself. This review should include a search for existing instruments.[43]
- Do you understand all the dimensions of your concept? Do you have a good understanding of the content dimensions of your concept(s)?
- What instruments exist? How many items are on the existing instruments? Are these instruments appropriate for your population?
- Are these instruments standardized? Note: If an instrument is standardized, that means it has been rigorously studied and tested.
Consult content experts to review your instrument. This is a good way to check the face validity of your items. Additionally, content experts can also help you understand the content validity.[44]
- Do you have access to a reasonable number of content experts? If not, how can you locate them?
- Did you provide a list of critical questions for your content reviewers to use in the reviewing process?
Pilot test your instrument on a sufficient number of people and get detailed feedback.[45] Ask your group to provide feedback on the wording and clarity of items. Keep detailed notes and make adjustments BEFORE you administer your final tool.
- How many people will you use in your pilot testing?
- How will you set up your pilot testing so that it mimics the actual process of administering your tool?
- How will you receive feedback from your pilot testing group? Have you provided a list of questions for your group to think about?
Provide training for anyone collecting data for your project.[46] You should provide those helping you with a written research protocol that explains all of the steps of the project. You should also problem solve and answer any questions that those helping you may have. This will increase the chances that your tool will be administered in a consistent manner.
- How will you conduct your orientation/training? How long will it be? What modality?
- How will you select those who will administer your tool? What qualifications do they need?
When thinking of items, use a higher level of measurement, if possible.[47] This will provide more information and you can always downgrade to a lower level of measurement later.
- Have you examined your items and the levels of measurement?
- Have you thought about whether you need to modify the type of data you are collecting? Specifically, are you asking for information that is too specific (at a higher level of measurement) which may reduce participants' willingness to participate?
Use multiple indicators for a variable.[48] Think about the number of items that you will include in your tool.
- Do you have enough items? Enough indicators? The correct indicators?
Conduct an item-by-item assessment of multiple-item measures.[49] When you do this assessment, think about each word and how it changes the meaning of your item.
- Are there items that are redundant? Do you need to modify, delete, or add items?
Types of error
As you can see, measures never perfectly describe what exists in the real world. Good measures demonstrate validity and reliability but will always have some degree of error. Systematic error (also called bias) causes our measures to consistently output incorrect data in one direction or another on a measure, usually due to an identifiable process. Imagine you created a measure of height, but you didn’t put an option for anyone over six feet tall. If you gave that measure to your local college or university, some of the taller students might not be measured accurately. In fact, you would be under the mistaken impression that the tallest person at your school was six feet tall, when in actuality there are likely people taller than six feet at your school. This error seems innocent, but if you were using that measure to help you build a new building, those people might hit their heads!
A less innocent form of error arises when researchers word questions in a way that might cause participants to think one answer choice is preferable to another. For example, if I were to ask you “Do you think global warming is caused by human activity?” you would probably feel comfortable answering honestly. But what if I asked you “Do you agree with 99% of scientists that global warming is caused by human activity?” Would you feel comfortable saying no, if that’s what you honestly felt? I doubt it. That is an example of a leading question, a question with wording that influences how a participant responds. We’ll discuss leading questions and other problems in question wording in greater detail in Chapter 12.
In addition to error created by the researcher, your participants can cause error in measurement. Some people will respond without fully understanding a question, particularly if the question is worded in a confusing way. Let’s consider another potential source or error. If we asked people if they always washed their hands after using the bathroom, would we expect people to be perfectly honest? Polling people about whether they wash their hands after using the bathroom might only elicit what people would like others to think they do, rather than what they actually do. This is an example of social desirability bias, in which participants in a research study want to present themselves in a positive, socially desirable way to the researcher. People in your study will want to seem tolerant, open-minded, and intelligent, but their true feelings may be closed-minded, simple, and biased. Participants may lie in this situation. This occurs often in political polling, which may show greater support for a candidate from a minority race, gender, or political party than actually exists in the electorate.
A related form of bias is called acquiescence bias, also known as “yea-saying.” It occurs when people say yes to whatever the researcher asks, even when doing so contradicts previous answers. For example, a person might say yes to both “I am a confident leader in group discussions” and “I feel anxious interacting in group discussions.” Those two responses are unlikely to both be true for the same person. Why would someone do this? Similar to social desirability, people want to be agreeable and nice to the researcher asking them questions or they might ignore contradictory feelings when responding to each question. You could interpret this as someone saying "yeah, I guess." Respondents may also act on cultural reasons, trying to “save face” for themselves or the person asking the questions. Regardless of the reason, the results of your measure don’t match what the person truly feels.
So far, we have discussed sources of error that come from choices made by respondents or researchers. Systematic errors will result in responses that are incorrect in one direction or another. For example, social desirability bias usually means that the number of people who say they will vote for a third party in an election is greater than the number of people who actually vote for that candidate. Systematic errors such as these can be reduced, but random error can never be eliminated. Unlike systematic error, which biases responses consistently in one direction or another, random error is unpredictable and does not consistently result in scores that are consistently higher or lower on a given measure. Instead, random error is more like statistical noise, which will likely average out across participants.
Random error is present in any measurement. If you’ve ever stepped on a bathroom scale twice and gotten two slightly different results, maybe a difference of a tenth of a pound, then you’ve experienced random error. Maybe you were standing slightly differently or had a fraction of your foot off of the scale the first time. If you were to take enough measures of your weight on the same scale, you’d be able to figure out your true weight. In social science, if you gave someone a scale measuring depression on a day after they lost their job, they would likely score differently than if they had just gotten a promotion and a raise. Even if the person were clinically depressed, our measure is subject to influence by the random occurrences of life. Thus, social scientists speak with humility about our measures. We are reasonably confident that what we found is true, but we must always acknowledge that our measures are only an approximation of reality.
Humility is important in scientific measurement, as errors can have real consequences. At the time I'm writing this, my wife and I are expecting our first child. Like most people, we used a pregnancy test from the pharmacy. If the test said my wife was pregnant when she was not pregnant, that would be a false positive. On the other hand, if the test indicated that she was not pregnant when she was in fact pregnant, that would be a false negative. Even if the test is 99% accurate, that means that one in a hundred women will get an erroneous result when they use a home pregnancy test. For us, a false positive would have been initially exciting, then devastating when we found out we were not having a child. A false negative would have been disappointing at first and then quite shocking when we found out we were indeed having a child. While both false positives and false negatives are not very likely for home pregnancy tests (when taken correctly), measurement error can have consequences for the people being measured.
Key Takeaways
- Reliability is a matter of consistency.
- Validity is a matter of accuracy.
- There are many types of validity and reliability.
- Systematic error may arise from the researcher, participant, or measurement instrument.
- Systematic error biases results in a particular direction, whereas random error can be in any direction.
- All measures are prone to error and should interpreted with humility.
Exercises
Use the measurement tools you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to "research" these tools.
- Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
- Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?
- If you decide to create your own tool, how will you assess its validity and reliability?
Chapter Outline
- How do social workers know what to do? (12 minute read)
- The scientific method (16 minute read)
- Evidence-based practice (11 minute read + 4 minute video)
- Creating a question to examine scientific evidence (9 minute video)
Content warning: Examples in this chapter contain references to school discipline, child abuse, food insecurity, homelessness, poverty and anti-poverty stigma, anti-vaccination pseudoscience, autism, trauma and PTSD, mental health stigma, susto and culture-bound syndromes, gender-based discrimination at work, homelessness, psychiatric hospitalizations, substance use, and mandatory treatment.
1.1 How do social workers know what to do?
Learning Objectives
Learners will be able to...
- Reflect on how we, as social workers, make decisions
- Differentiate between micro-, meso-, and macro-level analysis
- Describe the concept of intuition, its purpose in social work, and its limitations
- Identify specific errors in thinking and reasoning
What would you do?
Case 1: Imagine you are a clinical social worker at a children’s mental health agency. One day, you receive a referral from your town’s middle school about a client who often skips school, gets into fights, and is disruptive in class. The school has suspended him and met with the parents on multiple occasions, who say they practice strict discipline at home. Yet the client’s behavior has worsened. When you arrive at the school to meet with your client, who is also a gifted artist, you notice he seems to have bruises on his legs, has difficulty maintaining eye contact, and appears distracted. Despite this, he spends the hour painting and drawing, during which time you are able to observe him.
- Given your observations of your client's strengths and challenges, what intervention would you select, and how could you determine its effectiveness?
Case 2: Imagine you are a social worker working in the midst of an urban food desert (a geographic area in which there is no grocery store that sells fresh food). As a result, many of your low-income clients either eat takeout, or rely on food from the dollar store or a convenience store. You are becoming concerned about your clients’ health, as many of them are obese and say they are unable to buy fresh food. Your clients tell you that they have to rely on food pantries because convenience stores are expensive and often don't have the right kinds of food for their families. You have spent the past month building a coalition of community members to lobby your city council. The coalition includes individuals from non-profit agencies, religious groups, and healthcare workers.
- How should this group address the impact of food deserts in your community? What intervention(s) do you suggest? How would you determine whether your intervention was effective?
Case 3: You are a social worker working at a public policy center whose work focuses on the issue of homelessness. Your city is seeking a large federal grant to address this growing problem and has hired you as a consultant to work on the grant proposal. After interviewing individuals who are homeless and conducting a needs assessment in collaboration with local social service agencies, you meet with city council members to talk about potential opportunities for intervention. Local agencies want to spend the money to increase the capacity of existing shelters in the community. In addition, they want to create a transitional housing program at an unused apartment complex where people can reside upon leaving the shelter, and where they can gain independent living skills. On the other hand, homeless individuals you interview indicate that they would prefer to receive housing vouchers to rent an apartment in the community. They also fear the agencies running the shelter and transitional housing program would impose restrictions and unnecessary rules and regulations, thereby curbing their ability to freely live their lives. When you ask the agencies about these client concerns, they state that these clients need the structure and supervision provided by agency support workers.
- Which kind of program should your city choose to implement? Which is most likely to be effective and why?
Assuming you’ve taken a social work course before, you will notice that these case studies cover different levels of analysis in the social ecosystem—micro, meso, and macro. At the micro-level, social workers examine the smallest levels of interaction; in some cases, just “the self” alone (e.g. the child in case one).
When social workers investigate groups and communities, such as our food desert in case 2, their inquiry is at the meso-level.
At the macro-level, social workers examine social structures and institutions. Research at the macro-level examines large-scale patterns, including culture and government policy.
These three domains interact with one another, and it is common for a research project to address more than one level of analysis. For example, you may have a study about individuals at a case management agency (a micro-level study) that impacts the organization as a whole (meso-level) and incorporates policies and cultural issues (macro-level). Moreover, research that occurs on one level is likely to have multiple implications across domains.
How do social workers know what to do?
Welcome to social work research. This chapter begins with three problems that social workers might face in practice, and three questions about what a social worker should do next. If you haven’t already, spend a minute or two thinking about the three aforementioned cases and jot down some notes. How might you respond to each of these cases?
We assume it is unlikely you are an expert in the areas of children’s mental health, community responses to food deserts, and homelessness policy. Don’t worry, we're not either. In fact, for many of you, this textbook will likely come at an early point in your graduate social work education, so it may seem unfair for us to ask you what the 'right' answers are. And to disappoint you further, this course will not teach you the 'right' answer to these questions. It will, however, teach you how to answer these questions for yourself, and to find the 'right' answer that works best in each unique situation.
Assuming you are not an experienced practitioner in the areas described above, you likely used intuition (Cheung, 2016).[50] when thinking about what you would do in each of these scenarios. Intuition is a "gut feeling" about what to think about and do, often based on personal experience. What we experience influences how we perceive the world. For example, if you've witnessed representations of trauma in your practice, personal life, or in movies or television, you may have perceived that the child in case one was being physically abused and that his behavior was a sign of trauma. As you think about problems such as those described above, you find that certain details stay with you and influence your thinking to a greater degree than others. Using past experiences, you apply seemingly relevant knowledge and make predictions about what might be true.
Over a social worker's career, intuition evolves into practice wisdom. Practice wisdom is the “learning by doing” that develops as a result of practice experience. For example, a clinical social worker may have a "feel" for why certain clients would be a good fit to join a particular therapy group. This idea may be informed by direct experience with similar situations, reflections on previous experiences, and any consultation they receive from colleagues and/or supervisors. This "feel" that social workers get for their practice is a useful and valid source of knowledge and decision-making—do not discount it.
On the other hand, intuitive thinking can be prone to a number of errors. We are all limited in terms of what we know and experience. One's economic, social, and cultural background will shape intuition, and acting on your intuition may not work in a different sociocultural context. Because you cannot learn everything there is to know before you start your career as a social worker, it is important to learn how to understand and use social science to help you make sense of the world and to help you make sound, reasoned, and well-thought out decisions.
Social workers must learn how to take their intuition and deepen or challenge it by engaging with scientific literature. Similarly, social work researchers engage in research to make certain their interventions are effective and efficient (see section 1.4 for more information). Both of these processes—consuming and producing research—inform the social justice mission of social work. That's why the Council on Social Work Education (CSWE), who accredits the MSW program you are in, requires that you engage in social science.
Competency 4: Engage In Practice-informed Research and Research-informed Practice Social workers understand quantitative and qualitative research methods and their respective roles in advancing a science of social work and in evaluating their practice. Social workers know the principles of logic, scientific inquiry, and culturally informed and ethical approaches to building knowledge. Social workers understand that evidence that informs practice derives from multi-disciplinary sources and multiple ways of knowing. They also understand the processes for translating research findings into effective practice. Social workers: • use practice experience and theory to inform scientific inquiry and research; • apply critical thinking to engage in analysis of quantitative and qualitative research methods and research findings; and • use and translate research evidence to inform and improve practice, policy, and service delivery (CSWE, 2015).[51]
Errors in thinking
We all rely on mental shortcuts to help us figure out what to do in a practice situation. All people, including you and me, must train our minds to be aware of predictable flaws in thinking, termed cognitive biases. Here is a link to the Wikipedia entry on cognitive biases, as well as an interactive list. As you can see, there are many types of biases that can result in irrational conclusions.
The most important error in thinking for social scientists to be aware of is confirmation bias. Confirmation bias involves observing and analyzing information in a way that confirms what you already believe to be true. We all arrive at each moment with a set of personal beliefs, experiences, and worldviews that have been developed and ingrained over time. These patterns of thought inform our intuitions, primarily in an unconscious manner. Confirmation bias occurs when our mind ignores or manipulates information to avoid challenging what we already believe to be true.
In our second case study, we are trying to figure out how to help people who receive SNAP (sometimes referred to as Food Stamps) who live in a food desert. Let’s say we have arrived at a policy solution and are now lobbying the city council to implement it. There are many who have negative beliefs about people who are “on welfare.” These people may believe individuals who receive social welfare benefits spend their money irresponsibly, are too lazy to get a job, and manipulate the system to maintain or increase their government payout.
Those espousing this belief may point to an example such as Louis Cuff, who bought steak and lobster with his SNAP benefits and resold them for a profit. However, they are falling prey to assuming that one person's bad behavior reflects upon an entire group of people. City council members who hold these beliefs may ignore the truth about the client population—that people experiencing poverty usually spend their money responsibly and that they genuinely need help accessing fresh and healthy food. In this way, confirmation bias often makes people less capable of empathizing with one another because they have difficulty accepting alternative perspectives.
Errors in reasoning
Because the human mind is prone to errors, when anyone makes a statement about what is true or what should be done in a given situation, errors in logic may emerge. Think back to the case studies at the beginning of this section. You most likely had some ideas about what to do in each case, but where did those ideas come from. Below are some of the most common logical fallacies and the ways in which they may negatively influence a social worker. Consider how some of these might apply to your thinking about the practice situations in this chapter.
- Making hasty generalization: when a person draws conclusions before having enough information. A social worker may apply lessons from a handful of clients to an entire population of people (see Louis Cuff, above). It is important to examine the scientific literature in order to avoid this.
- Confusing correlation with causation: when one concludes that because two things are correlated (as one changes, the other changes), they must be causally related. As an example, a social worker might observe both an increase in the minimum wage and higher unemployment in certain areas of the city. However, just because two things changed at the same time does not mean they are causally related. Social workers should explore other factors that might impact causality.
- Going down a slippery slope: when a person concludes that we should not do something because something far worse will happen if we do so. For example, a social worker may seek to increase a client's opportunity to choose their own activities, but face opposition from those who believe it will lead to clients making unreasonable demands. Clearly, this is nonsense. Changes that foster self-determination are unlikely to result in client revolt. Social workers should be skeptical of arguments opposing small changes because one argues that radical changes are inevitable.
- Appealing to authority: when a person draws a conclusion by appealing to the authority of an expert or reputable individual, rather than through the strength of the claim. You have likely encountered individuals who believe they are correct because another in a position of authority told them so. Instead, we should work to build a reflective and critical approach to practice that questions authority.
- Hopping on the bandwagon: when a person draws a conclusion consistent with popular belief. Just because something is popular does not mean it is correct. Fashionable ideas come and go. Social workers should engage with trendy ideas but must ground their work in scientific evidence rather than popular opinion.
- Using a straw man: when a person does not represent their opponent's position fairly or with sufficient depth. For example, a social worker advocating for a new group home may depict homeowners that are opposed to clients living in their neighborhood as individuals concerned only with their property values. However, this may not be the case. Social workers should instead engage deeply with all sides of an issue and represent them accurately.
Key Takeaways
- Social work research occurs at the micro-, meso-, and macro-level.
- Intuition is a powerful, though limited, source of information when making decisions.
- All human thought is subject to errors in thinking and reasoning.
- Scientific inquiry accounts for cognitive biases by applying an organized, logical way of observing and theorizing about the world.
Exercises
- Think about a social work topic you might want to study this semester as part of a research project. How do individuals commit specific errors in logic or reasoning when discussing a specific topic (e.g. Louis Cuff)? How can using scientific evidence help you combat popular myths about your topic that are based on erroneous thinking?
- Reflect on the strengths and limitations of your personal experiences as a way to guide your work with diverse populations. Describe an instance when your intuition may have resulted in biased or misguided thinking or behavior in a social work practice situation.
1.2 The scientific method
Learning Objectives
Learners will be able to...
- Define science and social science
- Describe the differences between objective truth and subjective truths
- Identify how qualitative and quantitative methods differ and how they can be used together
- Delineate the features of science that distinguish it from pseudoscience
If we asked you to draw a picture of science, what would you draw? Our guess is it would be something from a chemistry or biology classroom, like a microscope or a beaker. Maybe something from a science fiction movie. All social workers use scientific thinking in their practice. However, social workers have a unique understanding of what science means, one that is (not surprisingly) more open to the unexpected and human side of the social world.
Science and not-science
In social work, science is a way of 'knowing' that attempts to systematically collect and categorize facts or truths. A key word here is systematically—conducting science is a deliberate process. Scientists gather information about facts in a way that is organized and intentional, and usually follows a set of predetermined steps. Social work is not a science, but social work is informed by social science—the science of humanity, social interactions, and social structures. In other words, social work research uses organized and intentional procedures to uncover facts or truths about the social world. And social workers rely on social scientific research to promote change.
Science can also be thought of in terms of its impostor, pseudoscience. Pseudoscience refers to beliefs about the social world that are unsupported by scientific evidence. These claims are often presented as though they are based on science. But once researchers test them scientifically, they are demonstrated to be false. A scientifically uninformed social work practitioner using pseudoscience may recommend any number of ineffective, misguided, or harmful interventions. Pseudoscience often relies on information and scholarship that has not been reviewed by experts (i.e., peer review) or offers a selective and biased reading of reviewed literature.
An example of pseudoscience comes from anti-vaccination activists. Despite overwhelming scientific consensus that vaccines do not cause autism, a very vocal minority of people continue to believe that they do. Anti-vaccination advocates present their information as based in science, as seen here at Green Med Info. The author of this website shares real abstracts from scientific journal articles and studies but will only provide information on articles that show the potential dangers of vaccines, without showing any research that prevents the positive and safe side of vaccines. Green Med Info is an example of confirmation bias, as all data presented on the website supports what the pseudo-scientific researcher believes to be true. For more information on assessing causal relationships, consult Chapter 8, where we discuss causality in detail.
The values and practices associated with the scientific method work to overcome common errors in thinking (such as confirmation bias). First, the scientific method uses established techniques from the literature to determine the likelihood of something being true or false. The research process often cites these techniques, reasons for their use, and how researchers came to the decision to use said techniques. However, each technique comes with its own strengths and limitations. Rigorous science is about making the best choice, being open about your process, and allowing others to check your work. It is important to remember that there is no "perfect" study—all research has limitations because all scientific methods come with limitations.
Skepticism and debate
Unfortunately, the "perfect" researcher does not exist. Scientists are human, so they are subject to error and bias, such as gravitating toward fashionable ideas and thinking their work is more important than others' work. Theories and concepts fade in and out of use and may be tossed aside when new evidence challenges their truth. Part of the challenge in your research projects will be finding what you believe about an issue, rather than summarizing what others think about the topic. Good science, just like good social work practice, is authentic. When we see students present their research projects, those that are the strongest deliver both passionate and informed arguments about their topic area.
Good science is also open to ongoing questioning. Scientists are fundamentally skeptical. As such, they are likely to pursue alternative explanations. They might question the design of a study or replicate it to see if it works in another context. Scientists debate what is true until they arrive at a majority consensus. If you've ever heard that 97% of climate scientists agree that global warming is due to human activity[52] or that 99% of economists agree that tariffs make the economy worse,[53] you are seeing this sociology of science in action. This skepticism will help to catch situations in which scientists who make the oh-so-human mistakes in thinking and reasoning reviewed in Section 1.1.
Skepticism also helps to identify unethical scientists, as with Andrew Wakefield's study linking the MMR vaccination and autism. When other researchers looked at his data, they found that he had altered the data to match his own conclusions and sought to benefit financially from the ensuing panic about vaccination (Godlee, Smith, & Marcovitch, 2011).[54] This highlights another key value in science: openness.
Openness
Through the use of publications and presentations, scientists share the methods used to gather and analyze data. The trend towards open science has also prompted researchers to share data as well. This in turn enables other researchers to re-run, replicate, and validate analyses and results. A major barrier to openness in science is the paywall. When you've searched online for a journal article (we will review search techniques in Chapter 3), you have likely run into the $25-$50 price tag. Don't despair—your university should subscribe to these journals. However, the push towards openness in science means that more researchers are sharing their work in open access journals, which are free for people to access (like this textbook!). These open access journals do not require a university subscription to view.
Openness also means engaging the broader public about your study. Social work researchers conduct studies to help people, and part of scientific work is making sure your study has an impact. For example, it is likely that many of the authors publishing in scientific journals are on Twitter or other social media platforms, relaying the importance of study findings. They may create content for popular media, including newspapers, websites, blogs, or podcasts. It may lead to training for agency workers or public administrators. Regrettably, academic researchers have a reputation for being aloof and disengaged from the public conversation. However, this reputation is slowly changing with the trend towards public scholarship and engagement. For example, see this recent section of the Journal of the Society of Social Work and Research on public impact scholarship.
Science supported by empirical data
Pseudoscience is often doctored up to look like science, but the surety with which its advocates speak is not backed up by empirical data. Empirical data refers to information about the social world gathered and analyzed through scientific observation or experimentation. Theory is also an important part of science, as we will discuss in Chapter 7. However, theories must be supported by empirical data—evidence that what we think is true really exists in the world.
There are two types of empirical data that social workers should become familiar with. Quantitative data refers to numbers and qualitative data usually refers to word data (like a transcript of an interview) but can also refer to pictures, performances, and other means of expressing oneself. Researchers use specific methods designed to analyze each type of data. Together, these are known as research methods, or the methods researchers use to examine empirical data.
Objective truth
In our vaccine example, scientists have conducted many studies tracking children who were vaccinated to look for future diagnoses of autism (see Taylor et al. 2014 for a review). This is an example of using quantitative data to determine whether there is a causal relationship between vaccination and autism. By examining the number of people who develop autism after vaccinations and controlling for all of the other possible causes, researchers can determine the likelihood of whether vaccinations cause changes in the brain that are eventually diagnosed as autism.
In this case, the use of quantitative data is a good fit for disproving myths about the dangers of vaccination. When researchers analyze quantitative data, they are trying to establish an objective truth. An objective truth is always true, regardless of context. Generally speaking, researchers seeking to establish objective truth tend to use quantitative data because they believe numbers don't lie. If repeated statistical analyses don't show a relationship between two variables, like vaccines and autism, that relationship almost certainly does not exist. By boiling everything down to numbers, we can minimize the biases and logical errors that human researchers bring to the scientific process. That said, the interpretation of those numbers is always up for debate.
This approach to finding truth probably sounds similar to something you heard in your middle school science classes. When you learned about gravitational force or the mitochondria of a cell, you were learning about the theories and observations that make up our understanding of the physical world. We assume that gravity is real and that the mitochondria of a cell are real. Mitochondria are easy to spot with a powerful microscope, and we can observe and theorize about their function in a cell. The gravitational force is invisible but clearly apparent from observable facts, such as watching an apple fall. If we were unable to perceive mitochondria or gravity, they would still be there, doing their thing, because they exist independent of our observation of them.
Let’s consider a social work example. Scientific research has established that children who are subjected to severely traumatic experiences are more likely to be diagnosed with a mental health disorder (e.g., Mahoney, Karatzias, & Hutton, 2019).[55] A diagnosis of post-traumatic stress disorder (PTSD) is considered objective, and may refer to a mental health issue that exists independent of the individual observing it and is highly similar in its presentation across clients. The Diagnostic and Statistical Manual of Mental Disorders (DSM-5, 2017)[56] identifies a group of criteria which is based on unbiased, neutral client observations. These criteria are based in research, and render an objective diagnosis more likely to be valid and reliable. Through the clinician's observations and the client’s description of their symptoms, an objective determination of a mental health diagnosis can be made.
Subjective truth(s)
For those of you are skeptical, you may ask yourself: does a diagnosis tell a client's whole story? No. It does not tell you what the client thinks and feels about their diagnosis, for example. Receiving a diagnosis of PTSD may be a relief for a client. The diagnosis may suggest the words to describe their experiences. In addition, this diagnosis may provide a direction for therapeutic work, as there are evidence-based interventions clinicians can use with each diagnosis. On the other hand, a client may feel shame and view the diagnosis as a label, defining them in a negative way and limiting their potential (Barsky, 2015).[57]
Imagine if we surveyed people with PTSD to see how they interpreted their diagnosis. Objectively, we could determine whether more people said the diagnosis was, overall, a positive or negative event for them. However, it is unlikely that the experience of receiving a diagnosis was either completely positive or completely negative. In social work, we know that a client's thoughts and emotions are rarely binary, either/or situations. Clients likely feel a mix of positive and negative thoughts and emotions during the diagnostic process. How they incorporate a diagnosis into their life story is unique. These messy bits are subjective truths, or the thoughts and feelings that arise as people interpret and make meaning of situations. Importantly, looking for subjective truths can help us see the contradictory and multi-faceted nature of people's thoughts, and qualitative data allows us to avoid oversimplifying them into negative and positive feelings that could be counted, as in quantitative data. It is the role of a researcher, just like a practitioner, to seek to understand things from the perspective of the client. Unlike with objective truth, this will not lead to a general sense of what is true for everyone, but rather what is true in a particular time and place.
Subjective truths are best expressed through qualitative data, like conversations with a client or looking at their social media posts or journal entries. As a researcher, we might invite a client to tell us how they felt after they were first diagnosed, after they spoke with family, and over the course of the therapeutic process. While it may look different from what we normally think of as science (e.g. pharmaceutical studies), these stories are indeed a rich source of data for scientific analysis. However, it is impossible to analyze what this client said without also considering the sociocultural context in which they live. For example, the concept of PTSD is generated from Western thought and philosophy. How might people from other cultures understand trauma differently?
In the DSM-5 classification of mental health disorders, there is a list of culture-bound syndromes which appear only in certain cultures. For example, susto describes a unique cluster of symptoms experienced by Latin Americans after a traumatic event (Nogueira, Mari, & Razzouk, 2015).[58] Susto involves more physical symptoms than a traditional PTSD diagnosis. Indeed, many of these syndromes do not fit within a Western conceptualization of mental health because they differentiate less between the mind and body. To a Western scientist, susto may seem less real than PTSD. To someone from Latin America, their symptoms may not fit neatly into the PTSD framework developed in Western nations. Science has historically privileged knowledge from the United States and other nations in the West and Global North, marking them as objectively true. The objectivity of Western science as universally applicable to all cultures has been increasingly called into question as science has become less dominated by white males, and interaction between cultures and groups becomes broadly more democratic. Clearly, what is true depends in part on the context in which it is observed.
In this way, social scientists have a unique task. People are both objects and subjects. Objectively, you could quantify how tall a person is, what car they drive, how many adverse childhood experiences they had, or their score on a PTSD checklist. Subjectively, you could understand how a person made sense of a traumatic incident or how it contributed to certain patterns in thinking, negative feelings, or opportunities for growth, for example. It is this added dimension that renders social science unique to natural science (like biology or physics), which focuses almost exclusively on quantitative data and objective truth. For this reason, this book is divided between projects using qualitative methods and quantitative methods.
There is no "better" or "more true" way of approaching social science. Instead, the methods a researcher chooses should match the question they ask. If you want to answer, "do vaccines cause autism?" you should choose methods appropriate to answer that question. It seeks an objective truth—one that is true for everyone, regardless of context. Studies like these use quantitative data and statistical analyses to test mathematical relationships between variables. If, on the other hand, you wanted to know "what does a diagnosis of PTSD mean to clients?" you should collect qualitative data and seek subjective truths. You will gather stories and experiences from clients and interpret them in a way that best represents their unique and shared truths. Where there is consensus, you will report that. Where there is contradiction, you will report that as well.
Mixed methods
In this textbook, we will treat quantitative and qualitative research methods separately. However, it is important to remember that a project can include both approaches. A mixed methods study, which we will discuss more in Chapter 8, requires thinking through a more complicated project that includes at least one quantitative component, one qualitative component, and a plan to incorporate both approaches together. As a result, mixed methods projects may require more time for conceptualization, data collection, and analysis.
Finding patterns
Regardless of whether you are seeking objective or subjective truths, research and scientific inquiry aim to find and explain patterns. Most of the time, a pattern will not explain every single person’s experience, a fact about social science that is both fascinating and frustrating. Even individuals who do not know each other can create patterns that persist over time. Those new to social science may find these patterns frustrating because they may believe that the patterns describing their sex, age, or some other facet of their lives don’t represent their experience. It’s true. A pattern can exist among your cohort without your individual participation in it. There is diversity within diversity.
Let's consider some specific examples. You probably wouldn’t be surprised to learn that a person’s social class background has an impact on their educational attainment and achievement. You may be surprised to learn that people select romantic partners that have similar educational attainment, which in turn, impacts their children's educational attainment (Eika, Mogstad, & Zafar, 2019).[59] People who have graduated college pair off with other college graduates, as so forth. This, in turn, reinforces existing inequalities, stratifying society by those who have the opportunity to complete college and those who don't.
People who object to these findings tend to cite evidence from their own personal experience. However, the problem with this response is that objecting to a social pattern on the grounds that it doesn’t match one’s individual experience misses the point about patterns. Patterns don’t perfectly predict what will happen to an individual person. Yet, they are a reasonable guide that, when systematically observed, can help guide social work thought and action. When we don't investigate these patterns scientifically, we are more likely to act on stereotypes, biases, and other harmful beliefs.
A final note on qualitative and quantitative methods
There is not one superior way to find patterns that help us understand the world. As we will learn about in Chapter 7, there are multiple philosophical, theoretical, and methodological ways to approach scientific truth. Qualitative methods aim to provide an in-depth understanding of a relatively small number of cases. They also provide a voice for the client. Quantitative methods offer less depth on each case but can say more about broad patterns because they typically focus on a much larger number of cases. A researcher should approach the process of scientific inquiry by formulating a clear research question and using the methodological tools best suited to that question.
Believe it or not, there are still significant methodological battles being waged in the academic literature on objective vs. subjective social science. Usually, quantitative methods are viewed as “more scientific” and qualitative methods are viewed as “less scientific.” Part of this battle is historical. As the social sciences developed, they were compared with the natural sciences, especially physics, which rely on mathematics and statistics to come to a truth. It is a hotly debated topic whether social science should adopt the philosophical assumptions of the natural sciences—with its emphasis on prediction, mathematics, and objectivity—or use a different set of tools—contextual understanding, language, and subjectivity—to find scientific truth.
You are fortunate to be in a profession that values multiple scientific ways of knowing. The qualitative/quantitative debate is fueled by researchers who may prefer one approach over another, either because their own research questions are better suited to one particular approach or because they happened to have been trained in one specific method. In this textbook, we’ll operate from the perspective that qualitative and quantitative methods are complementary rather than competing. While these two methodological approaches certainly differ, the main point is that they simply have different goals, strengths, and weaknesses. A social work researcher should select the method(s) that best match(es) the question they are asking.
Key Takeaways
- Social work is informed by science.
- Social science is concerned with both objective and subjective knowledge.
- Social science research aims to understand patterns in the social world.
- Social scientists use both qualitative and quantitative methods, which, while different, are often complementary.
Exercises
-
Examine a pseudoscientific claim you've heard on the news or in conversation with others. Why do you consider it to be pseudoscientific? What empirical data can you find from a quick internet search that would demonstrate it lacks truth?
- Consider a topic you might want to study this semester as part of a research project. Provide a few examples of objective and subjective truths about the topic, even if you aren't completely certain they are correct. Identify how objective and subjective truths differ.
1.3 Evidence-based practice
Learning Objectives
Learners will be able to...
- Explain how social workers produce and consume research as part of practice
- Review the process of evidence-based practice and how social workers apply research knowledge with clients and groups
“Why am I in this class?”
“When will I ever use this information?"
While students aren't always so direct, we would wager a guess that these questions are on the mind of almost every student in a research methods class. And they are valid and important questions to ask! While it may seem strange, the answer is that you will probably use these skills often. Social workers engage with research on a daily basis by consuming it through popular media, social work education, and advanced training. They also often contribute to research projects, adding new scientific information to what we know. As professors, we also sometimes hear from field supervisors who say that research competencies are unimportant in their setting. One might wonder how these organizations measure program outcomes, report the impact of their program to board members or funding agencies, or create new interventions grounded in social theory and empirical evidence.
Social workers as research consumers
Whether you know it or not, your life is impacted by research every day. Many of our laws, social policies, and court proceedings are grounded in some degree of empirical research and evidence (Jenkins & Kroll-Smith, 1996).[60] That’s not to say that all laws and social policies are good or make sense. But you can’t have an informed opinion about any of them without understanding where they come from, how they were formed, and what their evidence base is. In order to be effective practitioners across micro, meso, and macro domains, social workers need to understand the root causes and policy solutions to social problems their clients are experiencing.
A recent lawsuit against Walmart provides an example of social science research in action. A sociologist named Professor William Bielby was enlisted by plaintiffs to conduct an analysis of Walmart’s personnel policies in order to support their claim that Walmart engages in gender discriminatory practices. Bielby’s analysis shows that Walmart’s compensation and promotion decisions may indeed have been vulnerable to gender bias. In June 2011, the United States Supreme Court decided against allowing the case to proceed as a class-action lawsuit (Wal-Mart Stores, Inc. v. Dukes, 2011).[61] While a class-action suit was not pursued in this case, consider the impact that such a suit against one of our nation’s largest employers could have had on companies, their employees, and even consumers around the country.[62]
A social worker might learn about this lawsuit through popular media, news media websites or television programs. Social science knowledge allows a social worker to apply a critical eye towards new information, regardless of the source. Unfortunately, popular media does not always report on scientific findings accurately. A social worker armed with scientific knowledge would be able to search for, read, and interpret the original study as well as other information that might challenge or support the study. Chapters 3, 4, and 5 of this textbook focus on information literacy, or how to understand what we already know about a topic and contribute to that body of knowledge.
When social workers consume research, they are usually doing so to inform their practice. Clinical social workers are required by a state licensing board to complete continuing education classes in order to remain informed on the latest information in their field. On the macro side, social workers at public policy think tanks consume information to inform advocacy and public awareness campaigns. Regardless of the role of the social worker, practice must be informed by research.
Evidence-based practice
Consuming research is the first component of evidence-based practice (EBP). Drisko and Grady (2015)[63] present EBP as a process composed of "four equally weighted parts: 1) client needs and current situation, (2) the best relevant research evidence, (3) client values and preferences, and (4) the clinician’s expertise" (p. 275). It is not simply “doing what the literature says,” but is rather a process by which practitioners examine the literature, client, self, and context to inform interventions with clients and systems (McNeese & Thyer, 2004).[64] It is a collaboration between social worker, client, and context. As we discussed in section 1.2, the patterns discovered by scientific research are not applicable to all situations. Instead, we rely on our critical thinking skills to apply scientific knowledge to real-world situations.
The bedrock of EBP is a proper assessment of the client or client system. Once we have a solid understanding of what the issue is, we can evaluate the literature to determine whether there are any interventions that have been shown to treat the issue, and if so, which have been shown to be the most effective. You will learn those skills in the next few chapters. Once we know what our options are, we should be upfront with clients about each option, what the interventions look like, and what the expected outcome will be. Once we have client feedback, we use our expertise and practice wisdom to make an informed decision about how to move forward.
If this sounds familiar, it's the same approach a doctor, physical therapist, or other health professional would use. This highlights a common critique of EBP: it is too focused on micro-level, clinical social work practice. Not every social worker is a clinical social worker. While there is a large body of literature on EBP for clinical practice, the same concepts apply to other social work roles as well. A social work manager should endeavor to be familiar with evidence-based management styles, and a social work policy advocate should argue for evidence-based policies.
In agency-based social work practice, EBP can take on a different role due to the complexities of the grant funding process. Funders naturally require agencies to demonstrate that their practice is effective. Agencies are almost always required to document that they are achieving the outcomes they intended. However, funders sometimes require agencies to choose from a limited list of interventions determined to be evidence-based practices. Funders want to direct their money to treatments that are proven to work, but by excluding funding for alternative approaches and limiting the degree to which clinicians can customize and localize interventions, financial incentives can bias the EBP process away from what is best for the client and community.
Not included in this model are clinical expertise, client values, or community context —key components of EBP and the therapeutic process. According to some funders, EBP is not a process conducted by a practitioner but instead consists of a list of interventions. Similar dynamics are at play in private clinical practice, in which insurance companies may specify the modality of therapy offered. For example, insurance companies may favor short-term, solution-focused therapy which minimizes cost. But what happens when someone has an idea for a new kind of intervention? How do new approaches get "on the list" of EBPs of grant funders?
Social workers as research producers
Innovation in social work is incredibly important. Social workers work on wicked problems for their careers. For those of you who have practice experience, you may have had an idea of how to better approach a practice situation. That is another reason you are here in a research methods class. You (really!) will have bright ideas about what to do in practice. Sam Tsemberis relates an “Aha!” moment from his practice in this Ted talk on homelessness. While a faculty member at the New York University School of Medicine, he noticed a problem with people cycling in and out of the local psychiatric hospital wards. Clients would arrive in psychiatric crisis, stabilize under medical supervision in the hospital, and end up back at the hospital in psychiatric crisis shortly after discharge.
When he asked the clients what their issues were, they said they were unable to participate in homelessness programs because they were not always compliant with medication for their mental health diagnosis and they continued to use drugs and alcohol. The housing supports offered by the city government required abstinence and medication compliance before one was deemed "ready" for housing. For these clients, the problem was a homelessness service system that was unable to meet clients where they were—ready for housing, but not ready for abstinence and psychiatric medication. As a result, chronically homeless clients were cycling in and out of psychiatric crises, moving back and forth from the hospital to the street.
The solution that Sam Tsemberis implemented and popularized is called Housing First—an approach to homelessness prevention that starts by, you guessed it, providing people with housing first and foremost. Tsemberis's model addresses chronic homelessness in people with co-occurring disorders (those who have a diagnosis of a substance use and mental health disorder). The Housing First model states that housing is a human right: clients should not be denied their right to housing based on substance use or mental health diagnoses.
In Housing First programs, clients are provided housing as soon as possible. The Housing First agency provides wraparound treatment from an interdisciplinary team, including social workers, nurses, psychiatrists, and former clients who are in recovery. Over the past few decades, this program has gone from a single program in New York City to the program of choice for federal, state, and local governments seeking to address homelessness in their communities.
The main idea behind Housing First is that once clients have a residence of their own, they are better able to engage in mental health and substance use treatment. While this approach may seem logical to you, it is the opposite of the traditional homelessness treatment model. The traditional approach began with the client abstaining from drug and alcohol use and taking prescribed medication. Only after clients achieved these goals were they offered group housing. If the client remained sober and medication compliant, they could then graduate towards less restrictive individual housing.
Conducting and disseminating research allows practitioners to establish an evidence base for their innovation or intervention, and to argue that it is more effective than the alternatives, and should therefore be implemented more broadly. For example, by comparing clients who were served through Housing First with those receiving traditional services, Tsemberis could establish that Housing First was more effective at keeping people housed and at addressing mental health and substance use goals. Starting first with smaller studies and graduating to larger ones, Housing First built a reputation as an effective approach to addressing homelessness. When President Bush created the Collaborative Initiative to Help End Chronic Homelessness in 2003, Housing First was used in a majority of the interventions and its effectiveness was demonstrated on a national scale. In 2007, it was acknowledged as an evidence-based practice in the Substance Abuse and Mental Health Services Administration’s (SAMHSA) EBP resource center.[65]
We suggest browsing around the SAMHSA EBP Resource Center and looking for interventions on topics that interest you. Other sources of evidence-based practices include the Cochrane Reviews digital library and Campbell Collaboration. In the next few chapters, we will talk more about how to search for and locate literature about clinical interventions. The use of systematic reviews, meta-analyses, and randomized controlled trials are particularly important in this regard, types of research we will describe more in Chapter 3 and Chapter 4.
So why share the story of Housing First? Well, we want you to think about what you hope to contribute to our knowledge of social work practice. What is your bright idea and how can it change the world? Practitioners innovate all the time, often incorporating those innovations into their agency’s approach and mission. Using scientific research methods, agency-based social workers can demonstrate to policymakers and other social workers that their innovations should be more widely used. Without this wellspring of new ideas, social services would not be able to adapt to the changing needs of their communities. Social workers in agency-based practice may also participate in research projects taking place at their agency. Partnerships between schools of social work and agencies are a common way of testing and implementing innovations in social work. In such a case, all parties receive an advantage: clinicians receive specialized training, clients receive additional services, agencies gain prestige, and researchers can illustrate the effectiveness of an intervention.
Evidence-based practice highlights the unique perspective that social work brings to research. Social work both "holds" and critiques evidence. With regard to the former, "holding" evidence refers to the fact that the field of social work values scientific information. The Housing First example demonstrates how this interplay between valuing and critiquing science works—first by critiquing existing research and conducting research to establish a new approach to a problem. It also demonstrates the importance of listening to your target population and privileging their understanding and perception of the issue. While their understanding is not the result of scientific inquiry, it is deeply informed through years of direct experience with the issue and embedded within the relevant cultural and historical context. Although science often searches for the "one true answer," social work researchers must remain humble about the degree to which we can really know, and must begin to engage with other ways of knowing that may originate from clients and communities.
See the video on cultural humility in healthcare settings (CC-BY-NC-ND 3.0) embedded below for an example of how "one true answer" about a population can often oversimplify things and overstate how much we know about how to intervene in a given situation.
Key Takeaways
While you may not become a scientist in the sense of wearing a lab coat and using a microscope, social workers must understand science in order to engage in ethical practice. In this section, we reviewed ways in which research is a part of social work practice, including:
- Determining the best intervention for a client or system
- Ensuring existing services are accomplishing their goals
- Satisfying requirements to receive funding from private agencies and government grants
- Testing a new idea and demonstrating that it should be more widely implemented
Exercises
-
Using a social work practice situation that you have experienced, walk through the four steps of the evidence-based practice process and how they informed your decision-making. Reflect on some of the difficulties applying EBP in the real world.
- Talk with a social worker about how they produce and consume research as part of practice. Consider asking them about the articles, books, and other scholarship that changed their practice or helped them think about a problem in a new way. You might also ask them about how they stay current with the literature as a practitioner. Reflect on your personal career goals and how research will fit into your future practice.
1.4 Creating a question to examine scientific evidence
Learning Objectives
Learners will be able to...
- Identify the common elements of evidence-based practice research question
- Apply the acronym PICO to begin inquiry into a social work research topic
What is the PICO Model?
The PICO Model is a format to help define your information need into a clinical question. By organizing a clinical question using PICO, the searcher can use the specific terms to aid in finding clinically relevant evidence in the literature. PubMed alone has over 34 million citations to search through so being able to reference a defined clinical question when reviewing title/abstracts will help filter the irrelevant materials out of the search results.
The PICO Model for Clinical Questions
P |
Patient, Population, or Problem |
How would I describe a group of patients similar to mine? |
I |
Intervention, Prognostic Factor, or Exposure |
Which main intervention, prognostic factor, or exposure am I considering? |
C |
Comparison or Intervention (if appropriate) |
What is the main alternative or gold standard to compare with the intervention? |
O |
Outcome you would like to measure or achieve |
What can I hope to accomplish, measure, improve, or affect for my patient or population? |
Add additional elements, if you need to specify further.
T |
What Type of clinical question are you asking? |
Diagnosis, Etiology/Harm, Therapy, Prognosis, Prevention categories |
T |
Is Time important to your search? |
Duration of data collection, duration of treatment, time to follow-up |
S |
What Study type do you want to find? |
What study design/methodology will address the clinical question according to the evidence hierarchy? |
Let's watch some examples on how to use PICO when approaching scientific literature. Please view the video below from the University of Binghamton Libraries on how to create a question to guide inquiry into evidence-based practice and other scientific questions about social work practice.
Exercises
- Create at least two questions using the PICO framework that are interesting enough for you to read about.
Chapter Outline
- Operational definitions (36 minute read)
- Writing effective questions and questionnaires (38 minute read)
- Measurement quality (21 minute read)
Content warning: examples in this chapter contain references to ethnocentrism, toxic masculinity, racism in science, drug use, mental health and depression, psychiatric inpatient care, poverty and basic needs insecurity, pregnancy, and racism and sexism in the workplace and higher education.
11.1 Operational definitions
Learning Objectives
Learners will be able to...
- Define and give an example of indicators and attributes for a variable
- Apply the three components of an operational definition to a variable
- Distinguish between levels of measurement for a variable and how those differences relate to measurement
- Describe the purpose of composite measures like scales and indices
Last chapter, we discussed conceptualizing your project. Conceptual definitions are like dictionary definitions. They tell you what a concept means by defining it using other concepts. In this section we will move from the abstract realm (conceptualization) to the real world (measurement).
Operationalization is the process by which researchers spell out precisely how a concept will be measured in their study. It involves identifying the specific research procedures we will use to gather data about our concepts. If conceptually defining your terms means looking at theory, how do you operationally define your terms? By looking for indicators of when your variable is present or not, more or less intense, and so forth. Operationalization is probably the most challenging part of quantitative research, but once it's done, the design and implementation of your study will be straightforward.
Indicators
Operationalization works by identifying specific indicators that will be taken to represent the ideas we are interested in studying. If we are interested in studying masculinity, then the indicators for that concept might include some of the social roles prescribed to men in society such as breadwinning or fatherhood. Being a breadwinner or a father might therefore be considered indicators of a person’s masculinity. The extent to which a man fulfills either, or both, of these roles might be understood as clues (or indicators) about the extent to which he is viewed as masculine.
Let’s look at another example of indicators. Each day, Gallup researchers poll 1,000 randomly selected Americans to ask them about their well-being. To measure well-being, Gallup asks these people to respond to questions covering six broad areas: physical health, emotional health, work environment, life evaluation, healthy behaviors, and access to basic necessities. Gallup uses these six factors as indicators of the concept that they are really interested in, which is well-being.
Identifying indicators can be even simpler than the examples described thus far. Political party affiliation is another relatively easy concept for which to identify indicators. If you asked a person what party they voted for in the last national election (or gained access to their voting records), you would get a good indication of their party affiliation. Of course, some voters split tickets between multiple parties when they vote and others swing from party to party each election, so our indicator is not perfect. Indeed, if our study were about political identity as a key concept, operationalizing it solely in terms of who they voted for in the previous election leaves out a lot of information about identity that is relevant to that concept. Nevertheless, it's a pretty good indicator of political party affiliation.
Choosing indicators is not an arbitrary process. As described earlier, utilizing prior theoretical and empirical work in your area of interest is a great way to identify indicators in a scholarly manner. And you conceptual definitions will point you in the direction of relevant indicators. Empirical work will give you some very specific examples of how the important concepts in an area have been measured in the past and what sorts of indicators have been used. Often, it makes sense to use the same indicators as previous researchers; however, you may find that some previous measures have potential weaknesses that your own study will improve upon.
All of the examples in this chapter have dealt with questions you might ask a research participant on a survey or in a quantitative interview. If you plan to collect data from other sources, such as through direct observation or the analysis of available records, think practically about what the design of your study might look like and how you can collect data on various indicators feasibly. If your study asks about whether the participant regularly changes the oil in their car, you will likely not observe them directly doing so. Instead, you will likely need to rely on a survey question that asks them the frequency with which they change their oil or ask to see their car maintenance records.
Exercises
- What indicators are commonly used to measure the variables in your research question?
- How can you feasibly collect data on these indicators?
- Are you planning to collect your own data using a questionnaire or interview? Or are you planning to analyze available data like client files or raw data shared from another researcher's project?
Remember, you need raw data. You research project cannot rely solely on the results reported by other researchers or the arguments you read in the literature. A literature review is only the first part of a research project, and your review of the literature should inform the indicators you end up choosing when you measure the variables in your research question.
Unlike conceptual definitions which contain other concepts, operational definition consists of the following components: (1) the variable being measured and its attributes, (2) the measure you will use, (3) how you plan to interpret the data collected from that measure to draw conclusions about the variable you are measuring.
Step 1: Specifying variables and attributes
The first component, the variable, should be the easiest part. At this point in quantitative research, you should have a research question that has at least one independent and at least one dependent variable. Remember that variables must be able to vary. For example, the United States is not a variable. Country of residence is a variable, as is patriotism. Similarly, if your sample only includes men, gender is a constant in your study, not a variable. A constant is a characteristic that does not change in your study.
When social scientists measure concepts, they sometimes use the language of variables and attributes. A variable refers to a quality or quantity that varies across people or situations. Attributes are the characteristics that make up a variable. For example, the variable hair color would contain attributes like blonde, brown, black, red, gray, etc. A variable’s attributes determine its level of measurement. There are four possible levels of measurement: nominal, ordinal, interval, and ratio. The first two levels of measurement are categorical, meaning their attributes are categories rather than numbers. The latter two levels of measurement are continuous, meaning their attributes are numbers.
Levels of measurement
Hair color is an example of a nominal level of measurement. Nominal measures are categorical, and those categories cannot be mathematically ranked. As a brown-haired person (with some gray), I can’t say for sure that brown-haired people are better than blonde-haired people. As with all nominal levels of measurement, there is no ranking order between hair colors; they are simply different. That is what constitutes a nominal level of gender and race are also measured at the nominal level.
What attributes are contained in the variable hair color? While blonde, brown, black, and red are common colors, some people may not fit into these categories if we only list these attributes. My wife, who currently has purple hair, wouldn’t fit anywhere. This means that our attributes were not exhaustive. Exhaustiveness means that all possible attributes are listed. We may have to list a lot of colors before we can meet the criteria of exhaustiveness. Clearly, there is a point at which exhaustiveness has been reasonably met. If a person insists that their hair color is light burnt sienna, it is not your responsibility to list that as an option. Rather, that person would reasonably be described as brown-haired. Perhaps listing a category for other color would suffice to make our list of colors exhaustive.
What about a person who has multiple hair colors at the same time, such as red and black? They would fall into multiple attributes. This violates the rule of mutual exclusivity, in which a person cannot fall into two different attributes. Instead of listing all of the possible combinations of colors, perhaps you might include a multi-color attribute to describe people with more than one hair color.
Making sure researchers provide mutually exclusive and exhaustive is about making sure all people are represented in the data record. For many years, the attributes for gender were only male or female. Now, our understanding of gender has evolved to encompass more attributes that better reflect the diversity in the world. Children of parents from different races were often classified as one race or another, even if they identified with both cultures. The option for bi-racial or multi-racial on a survey not only more accurately reflects the racial diversity in the real world but validates and acknowledges people who identify in that manner. If we did not measure race in this way, we would leave empty the data record for people who identify as biracial or multiracial, impairing our search for truth.
Unlike nominal-level measures, attributes at the ordinal level can be rank ordered. For example, someone’s degree of satisfaction in their romantic relationship can be ordered by rank. That is, you could say you are not at all satisfied, a little satisfied, moderately satisfied, or highly satisfied. Note that even though these have a rank order to them (not at all satisfied is certainly worse than highly satisfied), we cannot calculate a mathematical distance between those attributes. We can simply say that one attribute of an ordinal-level variable is more or less than another attribute.
This can get a little confusing when using rating scales. If you have ever taken a customer satisfaction survey or completed a course evaluation for school, you are familiar with rating scales. “On a scale of 1-5, with 1 being the lowest and 5 being the highest, how likely are you to recommend our company to other people?” That surely sounds familiar. Rating scales use numbers, but only as a shorthand, to indicate what attribute (highly likely, somewhat likely, etc.) the person feels describes them best. You wouldn’t say you are “2” likely to recommend the company, but you would say you are not very likely to recommend the company. Ordinal-level attributes must also be exhaustive and mutually exclusive, as with nominal-level variables.
At the interval level, attributes must also be exhaustive and mutually exclusive and there is equal distance between attributes. Interval measures are also continuous, meaning their attributes are numbers, rather than categories. IQ scores are interval level, as are temperatures in Fahrenheit and Celsius. Their defining characteristic is that we can say how much more or less one attribute differs from another. We cannot, however, say with certainty what the ratio of one attribute is in comparison to another. For example, it would not make sense to say that a person with an IQ score of 140 has twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.
While we cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, we can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level. Finally, at the ratio level, attributes are mutually exclusive and exhaustive, attributes can be rank ordered, the distance between attributes is equal, and attributes have a true zero point. Thus, with these variables, we can say what the ratio of one attribute is in comparison to another. Examples of ratio-level variables include age and years of education. We know that a person who is 12 years old is twice as old as someone who is 6 years old. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. The differences between each level of measurement are visualized in Table 11.1.
Nominal | Ordinal | Interval | Ratio | |
Exhaustive | X | X | X | X |
Mutually exclusive | X | X | X | X |
Rank-ordered | X | X | X | |
Equal distance between attributes | X | X | ||
True zero point | X |
Levels of measurement=levels of specificity
We have spent time learning how to determine our data's level of measurement. Now what? How could we use this information to help us as we measure concepts and develop measurement tools? First, the types of statistical tests that we are able to use are dependent on our data's level of measurement. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. The higher the level of measurement, the more complex statistical tests we are able to conduct. This knowledge may help us decide what kind of data we need to gather, and how.
That said, we have to balance this knowledge with the understanding that sometimes, collecting data at a higher level of measurement could negatively impact our studies. For instance, sometimes providing answers in ranges may make prospective participants feel more comfortable responding to sensitive items. Imagine that you were interested in collecting information on topics such as income, number of sexual partners, number of times someone used illicit drugs, etc. You would have to think about the sensitivity of these items and determine if it would make more sense to collect some data at a lower level of measurement (e.g., asking if they are sexually active or not (nominal) versus their total number of sexual partners (ratio).
Finally, sometimes when analyzing data, researchers find a need to change a data's level of measurement. For example, a few years ago, a student was interested in studying the relationship between mental health and life satisfaction. This student used a variety of measures. One item asked about the number of mental health symptoms, reported as the actual number. When analyzing data, my student examined the mental health symptom variable and noticed that she had two groups, those with none or one symptoms and those with many symptoms. Instead of using the ratio level data (actual number of mental health symptoms), she collapsed her cases into two categories, few and many. She decided to use this variable in her analyses. It is important to note that you can move a higher level of data to a lower level of data; however, you are unable to move a lower level to a higher level.
Exercises
- Check that the variables in your research question can vary...and that they are not constants or one of many potential attributes of a variable.
- Think about the attributes your variables have. Are they categorical or continuous? What level of measurement seems most appropriate?
Step 2: Specifying measures for each variable
Let’s pick a social work research question and walk through the process of operationalizing variables to see how specific we need to get. I’m going to hypothesize that residents of a psychiatric unit who are more depressed are less likely to be satisfied with care. Remember, this would be a inverse relationship—as depression increases, satisfaction decreases. In this question, depression is my independent variable (the cause) and satisfaction with care is my dependent variable (the effect). Now we have identified our variables, their attributes, and levels of measurement, we move onto the second component: the measure itself.
So, how would you measure my key variables: depression and satisfaction? What indicators would you look for? Some students might say that depression could be measured by observing a participant’s body language. They may also say that a depressed person will often express feelings of sadness or hopelessness. In addition, a satisfied person might be happy around service providers and often express gratitude. While these factors may indicate that the variables are present, they lack coherence. Unfortunately, what this “measure” is actually saying is that “I know depression and satisfaction when I see them.” While you are likely a decent judge of depression and satisfaction, you need to provide more information in a research study for how you plan to measure your variables. Your judgment is subjective, based on your own idiosyncratic experiences with depression and satisfaction. They couldn’t be replicated by another researcher. They also can’t be done consistently for a large group of people. Operationalization requires that you come up with a specific and rigorous measure for seeing who is depressed or satisfied.
Finding a good measure for your variable depends on the kind of variable it is. Variables that are directly observable don't come up very often in my students' classroom projects, but they might include things like taking someone's blood pressure, marking attendance or participation in a group, and so forth. To measure an indirectly observable variable like age, you would probably put a question on a survey that asked, “How old are you?” Measuring a variable like income might require some more thought, though. Are you interested in this person’s individual income or the income of their family unit? This might matter if your participant does not work or is dependent on other family members for income. Do you count income from social welfare programs? Are you interested in their income per month or per year? Even though indirect observables are relatively easy to measure, the measures you use must be clear in what they are asking, and operationalization is all about figuring out the specifics of what you want to know. For more complicated constructs, you will need compound measures (that use multiple indicators to measure a single variable).
How you plan to collect your data also influences how you will measure your variables. For social work researchers using secondary data like client records as a data source, you are limited by what information is in the data sources you can access. If your organization uses a given measurement for a mental health outcome, that is the one you will use in your study. Similarly, if you plan to study how long a client was housed after an intervention using client visit records, you are limited by how their caseworker recorded their housing status in the chart. One of the benefits of collecting your own data is being able to select the measures you feel best exemplify your understanding of the topic.
Measuring unidimensional concepts
The previous section mentioned two important considerations: how complicated the variable is and how you plan to collect your data. With these in hand, we can use the level of measurement to further specify how you will measure your variables and consider specialized rating scales developed by social science researchers.
Measurement at each level
Nominal measures assess categorical variables. These measures are used for variables or indicators that have mutually exclusive attributes, but that cannot be rank-ordered. Nominal measures ask about the variable and provide names or labels for different attribute values like social work, counseling, and nursing for the variable profession. Nominal measures are relatively straightforward.
Ordinal measures often use a rating scale. It is an ordered set of responses that participants must choose from. Figure 11.1 shows several examples. The number of response options on a typical rating scale is usualy five or seven, though it can range from three to 11. Five-point scales are best for unipolar scales where only one construct is tested, such as frequency (Never, Rarely, Sometimes, Often, Always). Seven-point scales are best for bipolar scales where there is a dichotomous spectrum, such as liking (Like very much, Like somewhat, Like slightly, Neither like nor dislike, Dislike slightly, Dislike somewhat, Dislike very much). For bipolar questions, it is useful to offer an earlier question that branches them into an area of the scale; if asking about liking ice cream, first ask “Do you generally like or dislike ice cream?” Once the respondent chooses like or dislike, refine it by offering them relevant choices from the seven-point scale. Branching improves both reliability and validity (Krosnick & Berent, 1993).[66] Although you often see scales with numerical labels, it is best to only present verbal labels to the respondents but convert them to numerical values in the analyses. Avoid partial labels or length or overly specific labels. In some cases, the verbal labels can be supplemented with (or even replaced by) meaningful graphics. The last rating scale shown in Figure 11.1 is a visual-analog scale, on which participants make a mark somewhere along the horizontal line to indicate the magnitude of their response.
Interval measures are those where the values measured are not only rank-ordered, but are also equidistant from adjacent attributes. For example, the temperature scale (in Fahrenheit or Celsius), where the difference between 30 and 40 degree Fahrenheit is the same as that between 80 and 90 degree Fahrenheit. Likewise, if you have a scale that asks respondents’ annual income using the following attributes (ranges): $0 to 10,000, $10,000 to 20,000, $20,000 to 30,000, and so forth, this is also an interval measure, because the mid-point of each range (i.e., $5,000, $15,000, $25,000, etc.) are equidistant from each other. The intelligence quotient (IQ) scale is also an interval measure, because the measure is designed such that the difference between IQ scores 100 and 110 is supposed to be the same as between 110 and 120 (although we do not really know whether that is truly the case). Interval measures allow us to examine “how much more” is one attribute when compared to another, which is not possible with nominal or ordinal measures. You may find researchers who “pretend” (incorrectly) that ordinal rating scales are actually interval measures so that we can use different statistical techniques for analyzing them. As we will discuss in the latter part of the chapter, this is a mistake because there is no way to know whether the difference between a 3 and a 4 on a rating scale is the same as the difference between a 2 and a 3. Those numbers are just placeholders for categories.
Ratio measures are those that have all the qualities of nominal, ordinal, and interval scales, and in addition, also have a “true zero” point (where the value zero implies lack or non-availability of the underlying construct). Think about how to measure the number of people working in human resources at a social work agency. It could be one, several, or none (if the company contracts out for those services). Measuring interval and ratio data is relatively easy, as people either select or input a number for their answer. If you ask a person how many eggs they purchased last week, they can simply tell you they purchased `a dozen eggs at the store, two at breakfast on Wednesday, or none at all.
Commonly used rating scales in questionnaires
The level of measurement will give you the basic information you need, but social scientists have developed specialized instruments for use in questionnaires, a common tool used in quantitative research. As we mentioned before, if you plan to source your data from client files or previously published results
Although Likert scale is a term colloquially used to refer to almost any rating scale (e.g., a 0-to-10 life satisfaction scale), it has a much more precise meaning. In the 1930s, researcher Rensis Likert (pronounced LICK-ert) created a new approach for measuring people’s attitudes (Likert, 1932).[67] It involves presenting people with several statements—including both favorable and unfavorable statements—about some person, group, or idea. Respondents then express their agreement or disagreement with each statement on a 5-point scale: Strongly Agree, Agree, Neither Agree nor Disagree, Disagree, Strongly Disagree. Numbers are assigned to each response and then summed across all items to produce a score representing the attitude toward the person, group, or idea. For items that are phrased in an opposite direction (e.g., negatively worded statements instead of positively worded statements), reverse coding is used so that the numerical scoring of statements also runs in the opposite direction. The entire set of items came to be called a Likert scale, as indicated in Table 11.2 below.
Unless you are measuring people’s attitude toward something by assessing their level of agreement with several statements about it, it is best to avoid calling it a Likert scale. You are probably just using a rating scale. Likert scales allow for more granularity (more finely tuned response) than yes/no items, including whether respondents are neutral to the statement. Below is an example of how we might use a Likert scale to assess your attitudes about research as you work your way through this textbook.
Strongly agree | Agree | Neutral | Disagree | Strongly disagree | |
I like research more now than when I started reading this book. | |||||
This textbook is easy to use. | |||||
I feel confident about how well I understand levels of measurement. | |||||
This textbook is helping me plan my research proposal. |
Semantic differential scales are composite (multi-item) scales in which respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites. Whereas in the above Likert scale, the participant is asked how much they agree or disagree with a statement, in a semantic differential scale the participant is asked to indicate how they feel about a specific item. This makes the semantic differential scale an excellent technique for measuring people’s attitudes or feelings toward objects, events, or behaviors. Table 11.3 is an example of a semantic differential scale that was created to assess participants' feelings about this textbook.
1) How would you rate your opinions toward this textbook? | ||||||
Very much | Somewhat | Neither | Somewhat | Very much | ||
Boring | Exciting | |||||
Useless | Useful | |||||
Hard | Easy | |||||
Irrelevant | Applicable |
This composite scale was designed by Louis Guttman and uses a series of items arranged in increasing order of intensity (least intense to most intense) of the concept. This type of scale allows us to understand the intensity of beliefs or feelings. Each item in the above Guttman scale has a weight (this is not indicated on the tool) which varies with the intensity of that item, and the weighted combination of each response is used as an aggregate measure of an observation.
Example Guttman Scale Items
- I often felt the material was not engaging Yes/No
- I was often thinking about other things in class Yes/No
- I was often working on other tasks during class Yes/No
- I will work to abolish research from the curriculum Yes/No
Notice how the items move from lower intensity to higher intensity. A researcher reviews the yes answers and creates a score for each participant.
Composite measures: Scales and indices
Depending on your research design, your measure may be something you put on a survey or pre/post-test that you give to your participants. For a variable like age or income, one well-worded question may suffice. Unfortunately, most variables in the social world are not so simple. Depression and satisfaction are multidimensional concepts. Relying on a single indicator like a question that asks "Yes or no, are you depressed?” does not encompass the complexity of depression, including issues with mood, sleeping, eating, relationships, and happiness. There is no easy way to delineate between multidimensional and unidimensional concepts, as its all in how you think about your variable. Satisfaction could be validly measured using a unidimensional ordinal rating scale. However, if satisfaction were a key variable in our study, we would need a theoretical framework and conceptual definition for it. That means we'd probably have more indicators to ask about like timeliness, respect, sensitivity, and many others, and we would want our study to say something about what satisfaction truly means in terms of our other key variables. However, if satisfaction is not a key variable in your conceptual framework, it makes sense to operationalize it as a unidimensional concept.
For more complicated measures, researchers use scales and indices (sometimes called indexes) to measure their variables because they assess multiple indicators to develop a composite (or total) score. Composite scores provide a much greater understanding of concepts than a single item could. Although we won't delve too deeply into the process of scale development, we will cover some important topics for you to understand how scales and indices developed by other researchers can be used in your project.
Although they exhibit differences (which will later be discussed) the two have in common various factors.
- Both are ordinal measures of variables.
- Both can order the units of analysis in terms of specific variables.
- Both are composite measures.
Scales
The previous section discussed how to measure respondents’ responses to predesigned items or indicators belonging to an underlying construct. But how do we create the indicators themselves? The process of creating the indicators is called scaling. More formally, scaling is a branch of measurement that involves the construction of measures by associating qualitative judgments about unobservable constructs with quantitative, measurable metric units. Stevens (1946)[68] said, “Scaling is the assignment of objects to numbers according to a rule.” This process of measuring abstract concepts in concrete terms remains one of the most difficult tasks in empirical social science research.
The outcome of a scaling process is a scale, which is an empirical structure for measuring items or indicators of a given construct. Understand that multidimensional “scales”, as discussed in this section, are a little different from “rating scales” discussed in the previous section. A rating scale is used to capture the respondents’ reactions to a given item on a questionnaire. For example, an ordinally scaled item captures a value between “strongly disagree” to “strongly agree.” Attaching a rating scale to a statement or instrument is not scaling. Rather, scaling is the formal process of developing scale items, before rating scales can be attached to those items.
If creating your own scale sounds painful, don’t worry! For most multidimensional variables, you would likely be duplicating work that has already been done by other researchers. Specifically, this is a branch of science called psychometrics. You do not need to create a scale for depression because scales such as the Patient Health Questionnaire (PHQ-9), the Center for Epidemiologic Studies Depression Scale (CES-D), and Beck’s Depression Inventory (BDI) have been developed and refined over dozens of years to measure variables like depression. Similarly, scales such as the Patient Satisfaction Questionnaire (PSQ-18) have been developed to measure satisfaction with medical care. As we will discuss in the next section, these scales have been shown to be reliable and valid. While you could create a new scale to measure depression or satisfaction, a study with rigor would pilot test and refine that new scale over time to make sure it measures the concept accurately and consistently. This high level of rigor is often unachievable in student research projects because of the cost and time involved in pilot testing and validating, so using existing scales is recommended.
Unfortunately, there is no good one-stop=shop for psychometric scales. The Mental Measurements Yearbook provides a searchable database of measures for social science variables, though it woefully incomplete and often does not contain the full documentation for scales in its database. You can access it from a university library’s list of databases. If you can’t find anything in there, your next stop should be the methods section of the articles in your literature review. The methods section of each article will detail how the researchers measured their variables, and often the results section is instructive for understanding more about measures. In a quantitative study, researchers may have used a scale to measure key variables and will provide a brief description of that scale, its names, and maybe a few example questions. If you need more information, look at the results section and tables discussing the scale to get a better idea of how the measure works. Looking beyond the articles in your literature review, searching Google Scholar using queries like “depression scale” or “satisfaction scale” should also provide some relevant results. For example, searching for documentation for the Rosenberg Self-Esteem Scale (which we will discuss in the next section), I found this report from researchers investigating acceptance and commitment therapy which details this scale and many others used to assess mental health outcomes. If you find the name of the scale somewhere but cannot find the documentation (all questions and answers plus how to interpret the scale), a general web search with the name of the scale and ".pdf" may bring you to what you need. Or, to get professional help with finding information, always ask a librarian!
Unfortunately, these approaches do not guarantee that you will be able to view the scale itself or get information on how it is interpreted. Many scales cost money to use and may require training to properly administer. You may also find scales that are related to your variable but would need to be slightly modified to match your study’s needs. You could adapt a scale to fit your study, however changing even small parts of a scale can influence its accuracy and consistency. While it is perfectly acceptable in student projects to adapt a scale without testing it first (time may not allow you to do so), pilot testing is always recommended for adapted scales, and researchers seeking to draw valid conclusions and publish their results must take this additional step.
Indices
An index is a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas. It is different from a scale. Scales also aggregate measures; however, these measures examine different dimensions or the same dimension of a single construct. A well-known example of an index is the consumer price index (CPI), which is computed every month by the Bureau of Labor Statistics of the U.S. Department of Labor. The CPI is a measure of how much consumers have to pay for goods and services (in general) and is divided into eight major categories (food and beverages, housing, apparel, transportation, healthcare, recreation, education and communication, and “other goods and services”), which are further subdivided into more than 200 smaller items. Each month, government employees call all over the country to get the current prices of more than 80,000 items. Using a complicated weighting scheme that takes into account the location and probability of purchase for each item, analysts then combine these prices into an overall index score using a series of formulas and rules.
Another example of an index is the Duncan Socioeconomic Index (SEI). This index is used to quantify a person's socioeconomic status (SES) and is a combination of three concepts: income, education, and occupation. Income is measured in dollars, education in years or degrees achieved, and occupation is classified into categories or levels by status. These very different measures are combined to create an overall SES index score. However, SES index measurement has generated a lot of controversy and disagreement among researchers.
The process of creating an index is similar to that of a scale. First, conceptualize (define) the index and its constituent components. Though this appears simple, there may be a lot of disagreement on what components (concepts/constructs) should be included or excluded from an index. For instance, in the SES index, isn’t income correlated with education and occupation? And if so, should we include one component only or all three components? Reviewing the literature, using theories, and/or interviewing experts or key stakeholders may help resolve this issue. Second, operationalize and measure each component. For instance, how will you categorize occupations, particularly since some occupations may have changed with time (e.g., there were no Web developers before the Internet)? As we will see in step three below, researchers must create a rule or formula for calculating the index score. Again, this process may involve a lot of subjectivity, so validating the index score using existing or new data is important.
Scale and index development at often taught in their own course in doctoral education, so it is unreasonable for you to expect to develop a consistently accurate measure within the span of a week or two. Using available indices and scales is recommended for this reason.
Differences between scales and indices
Though indices and scales yield a single numerical score or value representing a concept of interest, they are different in many ways. First, indices often comprise components that are very different from each other (e.g., income, education, and occupation in the SES index) and are measured in different ways. Conversely, scales typically involve a set of similar items that use the same rating scale (such as a five-point Likert scale about customer satisfaction).
Second, indices often combine objectively measurable values such as prices or income, while scales are designed to assess subjective or judgmental constructs such as attitude, prejudice, or self-esteem. Some argue that the sophistication of the scaling methodology makes scales different from indexes, while others suggest that indexing methodology can be equally sophisticated. Nevertheless, indexes and scales are both essential tools in social science research.
Scales and indices seem like clean, convenient ways to measure different phenomena in social science, but just like with a lot of research, we have to be mindful of the assumptions and biases underneath. What if a scale or an index was developed using only White women as research participants? Is it going to be useful for other groups? It very well might be, but when using a scale or index on a group for whom it hasn't been tested, it will be very important to evaluate the validity and reliability of the instrument, which we address in the rest of the chapter.
Finally, it's important to note that while scales and indices are often made up of nominal or ordinal variables, when we analyze them into composite scores, we will treat them as interval/ratio variables.
Exercises
- Look back to your work from the previous section, are your variables unidimensional or multidimensional?
- Describe the specific measures you will use (actual questions and response options you will use with participants) for each variable in your research question.
- If you are using a measure developed by another researcher but do not have all of the questions, response options, and instructions needed to implement it, put it on your to-do list to get them.
Step 3: How you will interpret your measures
The final stage of operationalization involves setting the rules for how the measure works and how the researcher should interpret the results. Sometimes, interpreting a measure can be incredibly easy. If you ask someone their age, you’ll probably interpret the results by noting the raw number (e.g., 22) someone provides and that it is lower or higher than other people's ages. However, you could also recode that person into age categories (e.g., under 25, 20-29-years-old, generation Z, etc.). Even scales may be simple to interpret. If there is a scale of problem behaviors, one might simply add up the number of behaviors checked off–with a range from 1-5 indicating low risk of delinquent behavior, 6-10 indicating the student is moderate risk, etc. How you choose to interpret your measures should be guided by how they were designed, how you conceptualize your variables, the data sources you used, and your plan for analyzing your data statistically. Whatever measure you use, you need a set of rules for how to take any valid answer a respondent provides to your measure and interpret it in terms of the variable being measured.
For more complicated measures like scales, refer to the information provided by the author for how to interpret the scale. If you can’t find enough information from the scale’s creator, look at how the results of that scale are reported in the results section of research articles. For example, Beck’s Depression Inventory (BDI-II) uses 21 statements to measure depression and respondents rate their level of agreement on a scale of 0-3. The results for each question are added up, and the respondent is put into one of three categories: low levels of depression (1-16), moderate levels of depression (17-30), or severe levels of depression (31 and over).
One common mistake I see often is that students will introduce another variable into their operational definition. This is incorrect. Your operational definition should mention only one variable—the variable being defined. While your study will certainly draw conclusions about the relationships between variables, that's not what operationalization is. Operationalization specifies what instrument you will use to measure your variable and how you plan to interpret the data collected using that measure.
Operationalization is probably the trickiest component of basic research methods, so please don’t get frustrated if it takes a few drafts and a lot of feedback to get to a workable definition. At the time of this writing, I am in the process of operationalizing the concept of “attitudes towards research methods.” Originally, I thought that I could gauge students’ attitudes toward research methods by looking at their end-of-semester course evaluations. As I became aware of the potential methodological issues with student course evaluations, I opted to use focus groups of students to measure their common beliefs about research. You may recall some of these opinions from Chapter 1, such as the common beliefs that research is boring, useless, and too difficult. After the focus group, I created a scale based on the opinions I gathered, and I plan to pilot test it with another group of students. After the pilot test, I expect that I will have to revise the scale again before I can implement the measure in a real social work research project. At the time I’m writing this, I’m still not completely done operationalizing this concept.
Key Takeaways
- Operationalization involves spelling out precisely how a concept will be measured.
- Operational definitions must include the variable, the measure, and how you plan to interpret the measure.
- There are four different levels of measurement: nominal, ordinal, interval, and ratio (in increasing order of specificity).
- Scales and indices are common ways to collect information and involve using multiple indicators in measurement.
- A key difference between a scale and an index is that a scale contains multiple indicators for one concept, whereas an indicator examines multiple concepts (components).
- Using scales developed and refined by other researchers can improve the rigor of a quantitative study.
Exercises
Use the research question that you developed in the previous chapters and find a related scale or index that researchers have used. If you have trouble finding the exact phenomenon you want to study, get as close as you can.
- What is the level of measurement for each item on each tool? Take a second and think about why the tool's creator decided to include these levels of measurement. Identify any levels of measurement you would change and why.
- If these tools don't exist for what you are interested in studying, why do you think that is?
12.3 Writing effective questions and questionnaires
Learning Objectives
Learners will be able to...
- Describe some of the ways that survey questions might confuse respondents and how to word questions and responses clearly
- Create mutually exclusive, exhaustive, and balanced response options
- Define fence-sitting and floating
- Describe the considerations involved in constructing a well-designed questionnaire
- Discuss why pilot testing is important
In the previous section, we reviewed how researchers collect data using surveys. Guided by their sampling approach and research context, researchers should choose the survey approach that provides the most favorable tradeoffs in strengths and challenges. With this information in hand, researchers need to write their questionnaire and revise it before beginning data collection. Each method of delivery requires a questionnaire, but they vary a bit based on how they will be used by the researcher. Since phone surveys are read aloud, researchers will pay more attention to how the questionnaire sounds than how it looks. Online surveys can use advanced tools to require the completion of certain questions, present interactive questions and answers, and otherwise afford greater flexibility in how questionnaires are designed. As you read this section, consider how your method of delivery impacts the type of questionnaire you will design. Because most student projects use paper or online surveys, this section will detail how to construct self-administered questionnaires to minimize the potential for bias and error.
Start with operationalization
The first thing you need to do to write effective survey questions is identify what exactly you wish to know. As silly as it sounds to state what seems so completely obvious, we can’t stress enough how easy it is to forget to include important questions when designing a survey. Begin by looking at your research question and refreshing your memory of the operational definitions you developed for those variables from Chapter 11. You should have a pretty firm grasp of your operational definitions before starting the process of questionnaire design. You may have taken those operational definitions from other researchers' methods, found established scales and indices for your measures, or created your own questions and answer options.
Exercises
STOP! Make sure you have a complete operational definition for the dependent and independent variables in your research question. A complete operational definition contains the variable being measured, the measure used, and how the researcher interprets the measure. Let's make sure you have what you need from Chapter 11 to begin writing your questionnaire.
List all of the dependent and independent variables in your research question.
- It's normal to have one dependent or independent variable. It's also normal to have more than one of either.
- Make sure that your research question (and this list) contain all of the variables in your hypothesis. Your hypothesis should only include variables from you research question.
For each variable in your list:
- Write out the measure you will use (the specific questions and answers) for each variable.
- If you don't have questions and answers finalized yet, write a first draft and revise it based on what you read in this section.
- If you are using a measure from another researcher, you should be able to write out all of the questions and answers associated with that measure. If you only have the name of a scale or a few questions, you need to access to the full text and some documentation on how to administer and interpret it before you can finish your questionnaire.
- Describe how you will use each measure draw conclusions about the variable in the operational definition.
- For example, an interpretation might be "there are five 7-point Likert scale questions...point values are added across all five items for each participant...and scores below 10 indicate the participant has low self-esteem"
- Don't introduce other variables into the mix here. All we are concerned with is how you will measure each variable by itself. The connection between variables is done using statistical tests, not operational definitions.
- Detail any validity or reliability issues uncovered by previous researchers using the same measures. If you have concerns about validity and reliability, note them, as well.
If you completed the exercise above and listed out all of the questions and answer choices you will use to measure the variables in your research question, you have already produced a pretty solid first draft of your questionnaire! Congrats! In essence, questionnaires are all of the self-report measures in your operational definitions for the independent, dependent, and control variables in your study arranged into one document and administered to participants. There are a few questions on a questionnaire (like name or ID#) that are not associated with the measurement of variables. These are the exception, and it's useful to think of a questionnaire as a list of measures for variables. Of course, researchers often use more than one measure of a variable (i.e., triangulation) so they can more confidently assert that their findings are true. A questionnaire should contain all of the measures researchers plan to collect about their variables by asking participants to self-report. As we will discuss in the final section of this chapter, triangulating across data sources (e.g., measuring variables using client files or student records) can avoid some of the common sources of bias in survey research.
Sticking close to your operational definitions is important because it helps you avoid an everything-but-the-kitchen-sink approach that includes every possible question that occurs to you. Doing so puts an unnecessary burden on your survey respondents. Remember that you have asked your participants to give you their time and attention and to take care in responding to your questions; show them your respect by only asking questions that you actually plan to use in your analysis. For each question in your questionnaire, ask yourself how this question measures a variable in your study. An operational definition should contain the questions, response options, and how the researcher will draw conclusions about the variable based on participants' responses.
Writing questions
So, almost all of the questions on a questionnaire are measuring some variable. For many variables, researchers will create their own questions rather than using one from another researcher. This section will provide some tips on how to create good questions to accurately measure variables in your study. First, questions should be as clear and to the point as possible. This is not the time to show off your creative writing skills; a survey is a technical instrument and should be written in a way that is as direct and concise as possible. As I’ve mentioned earlier, your survey respondents have agreed to give their time and attention to your survey. The best way to show your appreciation for their time is to not waste it. Ensuring that your questions are clear and concise will go a long way toward showing your respondents the gratitude they deserve. Pilot testing the questionnaire with friends or colleagues can help identify these issues. This process is commonly called pretesting, but to avoid any confusion with pretesting in experimental design, we refer to it as pilot testing.
Related to the point about not wasting respondents’ time, make sure that every question you pose will be relevant to every person you ask to complete it. This means two things: first, that respondents have knowledge about whatever topic you are asking them about, and second, that respondents have experienced the events, behaviors, or feelings you are asking them to report. If you are asking participants for second-hand knowledge—asking clinicians about clients' feelings, asking teachers about students' feelings, and so forth—you may want to clarify that the variable you are asking about is the key informant's perception of what is happening in the target population. A well-planned sampling approach ensures that participants are the most knowledgeable population to complete your survey.
If you decide that you do wish to include questions about matters with which only a portion of respondents will have had experience, make sure you know why you are doing so. For example, if you are asking about MSW student study patterns, and you decide to include a question on studying for the social work licensing exam, you may only have a small subset of participants who have begun studying for the graduate exam or took the bachelor's-level exam. If you decide to include this question that speaks to a minority of participants' experiences, think about why you are including it. Are you interested in how studying for class and studying for licensure differ? Are you trying to triangulate study skills measures? Researchers should carefully consider whether questions relevant to only a subset of participants is likely to produce enough valid responses for quantitative analysis.
Many times, questions that are relevant to a subsample of participants are conditional on an answer to a previous question. A participant might select that they rent their home, and as a result, you might ask whether they carry renter's insurance. That question is not relevant to homeowners, so it would be wise not to ask them to respond to it. In that case, the question of whether someone rents or owns their home is a filter question, designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample. Figure 12.1 presents an example of how to accomplish this on a paper survey by adding instructions to the participant that indicate what question to proceed to next based on their response to the first one. Using online survey tools, researchers can use filter questions to only present relevant questions to participants.
Researchers should eliminate questions that ask about things participants don't know to minimize confusion. Assuming the question is relevant to the participant, other sources of confusion come from how the question is worded. The use of negative wording can be a source of potential confusion. Taking the question from Figure 12.1 about drinking as our example, what if we had instead asked, “Did you not abstain from drinking during your first semester of college?” This is a double negative, and it's not clear how to answer the question accurately. It is a good idea to avoid negative phrasing, when possible. For example, "did you not drink alcohol during your first semester of college?" is less clear than "did you drink alcohol your first semester of college?"
You should also avoid using terms or phrases that may be regionally or culturally specific (unless you are absolutely certain all your respondents come from the region or culture whose terms you are using). When I first moved to southwest Virginia, I didn’t know what a holler was. Where I grew up in New Jersey, to holler means to yell. Even then, in New Jersey, we shouted and screamed, but we didn’t holler much. In southwest Virginia, my home at the time, a holler also means a small valley in between the mountains. If I used holler in that way on my survey, people who live near me may understand, but almost everyone else would be totally confused. A similar issue arises when you use jargon, or technical language, that people do not commonly know. For example, if you asked adolescents how they experience imaginary audience, they would find it difficult to link those words to the concepts from David Elkind’s theory. The words you use in your questions must be understandable to your participants. If you find yourself using jargon or slang, break it down into terms that are more universal and easier to understand.
Asking multiple questions as though they are a single question can also confuse survey respondents. There’s a specific term for this sort of question; it is called a double-barreled question. Figure 12.2 shows a double-barreled question. Do you see what makes the question double-barreled? How would someone respond if they felt their college classes were more demanding but also more boring than their high school classes? Or less demanding but more interesting? Because the question combines “demanding” and “interesting,” there is no way to respond yes to one criterion but no to the other.
Another thing to avoid when constructing survey questions is the problem of social desirability. We all want to look good, right? And we all probably know the politically correct response to a variety of questions whether we agree with the politically correct response or not. In survey research, social desirability refers to the idea that respondents will try to answer questions in a way that will present them in a favorable light. (You may recall we covered social desirability bias in Chapter 11.)
Perhaps we decide that to understand the transition to college, we need to know whether respondents ever cheated on an exam in high school or college for our research project. We all know that cheating on exams is generally frowned upon (at least I hope we all know this). So, it may be difficult to get people to admit to cheating on a survey. But if you can guarantee respondents’ confidentiality, or even better, their anonymity, chances are much better that they will be honest about having engaged in this socially undesirable behavior. Another way to avoid problems of social desirability is to try to phrase difficult questions in the most benign way possible. Earl Babbie (2010) [69] offers a useful suggestion for helping you do this—simply imagine how you would feel responding to your survey questions. If you would be uncomfortable, chances are others would as well.
Exercises
Try to step outside your role as researcher for a second, and imagine you were one of your participants. Evaluate the following:
- Is the question too general? Sometimes, questions that are too general may not accurately convey respondents’ perceptions. If you asked someone how they liked a certain book and provide a response scale ranging from “not at all” to “extremely well”, and if that person selected “extremely well," what do they mean? Instead, ask more specific behavioral questions, such as "Will you recommend this book to others?" or "Do you plan to read other books by the same author?"
- Is the question too detailed? Avoid unnecessarily detailed questions that serve no specific research purpose. For instance, do you need the age of each child in a household or is just the number of children in the household acceptable? However, if unsure, it is better to err on the side of details than generality.
- Is the question presumptuous? Does your question make assumptions? For instance, if you ask, "what do you think the benefits of a tax cut would be?" you are presuming that the participant sees the tax cut as beneficial. But many people may not view tax cuts as beneficial. Some might see tax cuts as a precursor to less funding for public schools and fewer public services such as police, ambulance, and fire department. Avoid questions with built-in presumptions.
- Does the question ask the participant to imagine something? Is the question imaginary? A popular question on many television game shows is “if you won a million dollars on this show, how will you plan to spend it?” Most participants have never been faced with this large amount of money and have never thought about this scenario. In fact, most don’t even know that after taxes, the value of the million dollars will be greatly reduced. In addition, some game shows spread the amount over a 20-year period. Without understanding this "imaginary" situation, participants may not have the background information necessary to provide a meaningful response.
Finally, it is important to get feedback on your survey questions from as many people as possible, especially people who are like those in your sample. Now is not the time to be shy. Ask your friends for help, ask your mentors for feedback, ask your family to take a look at your survey as well. The more feedback you can get on your survey questions, the better the chances that you will come up with a set of questions that are understandable to a wide variety of people and, most importantly, to those in your sample.
In sum, in order to pose effective survey questions, researchers should do the following:
- Identify how each question measures an independent, dependent, or control variable in their study.
- Keep questions clear and succinct.
- Make sure respondents have relevant lived experience to provide informed answers to your questions.
- Use filter questions to avoid getting answers from uninformed participants.
- Avoid questions that are likely to confuse respondents—including those that use double negatives, use culturally specific terms or jargon, and pose more than one question at a time.
- Imagine how respondents would feel responding to questions.
- Get feedback, especially from people who resemble those in the researcher’s sample.
Exercises
Let's complete a first draft of your questions. In the previous exercise, you listed all of the questions and answers you will use to measure the variables in your research question.
- In the previous exercise, you wrote out the questions and answers for each measure of your independent and dependent variables. Evaluate each question using the criteria listed above on effective survey questions.
- Type out questions for your control variables and evaluate them, as well. Consider what response options you want to offer participants.
Now, let's revise any questions that do not meet your standards!
- Use the BRUSO model in Table 12.2 for an illustration of how to address deficits in question wording. Keep in mind that you are writing a first draft in this exercise, and it will take a few drafts and revisions before your questions are ready to distribute to participants.
Criterion | Poor | Effective |
B- Brief | “Are you now or have you ever been the possessor of a firearm?” | Have you ever possessed a firearm? |
R- Relevant | "Who did you vote for in the last election?" | Note: Only include items that are relevant to your study. |
U- Unambiguous | “Are you a gun person?” | Do you currently own a gun?” |
S- Specific | How much have you read about the new gun control measure and sales tax?” | “How much have you read about the new sales tax on firearm purchases?” |
O- Objective | “How much do you support the beneficial new gun control measure?” | “What is your view of the new gun control measure?” |
Writing response options
While posing clear and understandable questions in your survey is certainly important, so too is providing respondents with unambiguous response options. Response options are the answers that you provide to the people completing your questionnaire. Generally, respondents will be asked to choose a single (or best) response to each question you pose. We call questions in which the researcher provides all of the response options closed-ended questions. Keep in mind, closed-ended questions can also instruct respondents to choose multiple response options, rank response options against one another, or assign a percentage to each response option. But be cautious when experimenting with different response options! Accepting multiple responses to a single question may add complexity when it comes to quantitatively analyzing and interpreting your data.
Surveys need not be limited to closed-ended questions. Sometimes survey researchers include open-ended questions in their survey instruments as a way to gather additional details from respondents. An open-ended question does not include response options; instead, respondents are asked to reply to the question in their own way, using their own words. These questions are generally used to find out more about a survey participant’s experiences or feelings about whatever they are being asked to report in the survey. If, for example, a survey includes closed-ended questions asking respondents to report on their involvement in extracurricular activities during college, an open-ended question could ask respondents why they participated in those activities or what they gained from their participation. While responses to such questions may also be captured using a closed-ended format, allowing participants to share some of their responses in their own words can make the experience of completing the survey more satisfying to respondents and can also reveal new motivations or explanations that had not occurred to the researcher. This is particularly important for mixed-methods research. It is possible to analyze open-ended response options quantitatively using content analysis (i.e., counting how often a theme is represented in a transcript looking for statistical patterns). However, for most researchers, qualitative data analysis will be needed to analyze open-ended questions, and researchers need to think through how they will analyze any open-ended questions as part of their data analysis plan. We will address qualitative data analysis in greater detail in Chapter 19.
To keep things simple, we encourage you to use only closed-ended response options in your study. While open-ended questions are not wrong, they are often a sign in our classrooms that students have not thought through all the way how to operationally define and measure their key variables. Open-ended questions cannot be operationally defined because you don't know what responses you will get. Instead, you will need to analyze the qualitative data using one of the techniques we discuss in Chapter 19 to interpret your participants' responses.
To write an effective response options for closed-ended questions, there are a couple of guidelines worth following. First, be sure that your response options are mutually exclusive. Look back at Figure 12.1, which contains questions about how often and how many drinks respondents consumed. Do you notice that there are no overlapping categories in the response options for these questions? This is another one of those points about question construction that seems fairly obvious but that can be easily overlooked. Response options should also be exhaustive. In other words, every possible response should be covered in the set of response options that you provide. For example, note that in question 10a in Figure 12.1, we have covered all possibilities—those who drank, say, an average of once per month can choose the first response option (“less than one time per week”) while those who drank multiple times a day each day of the week can choose the last response option (“7+”). All the possibilities in between these two extremes are covered by the middle three response options, and every respondent fits into one of the response options we provided.
Earlier in this section, we discussed double-barreled questions. Response options can also be double barreled, and this should be avoided. Figure 12.3 is an example of a question that uses double-barreled response options. Other tips about questions are also relevant to response options, including that participants should be knowledgeable enough to select or decline a response option as well as avoiding jargon and cultural idioms.
Even if you phrase questions and response options clearly, participants are influenced by how many response options are presented on the questionnaire. For Likert scales, five or seven response options generally allow about as much precision as respondents are capable of. However, numerical scales with more options can sometimes be appropriate. For dimensions such as attractiveness, pain, and likelihood, a 0-to-10 scale will be familiar to many respondents and easy for them to use. Regardless of the number of response options, the most extreme ones should generally be “balanced” around a neutral or modal midpoint. An example of an unbalanced rating scale measuring perceived likelihood might look like this:
Unlikely | Somewhat Likely | Likely | Very Likely | Extremely Likely
Because we have four rankings of likely and only one ranking of unlikely, the scale is unbalanced and most responses will be biased toward "likely" rather than "unlikely." A balanced version might look like this:
Extremely Unlikely | Somewhat Unlikely | As Likely as Not | Somewhat Likely |Extremely Likely
In this example, the midpoint is halfway between likely and unlikely. Of course, a middle or neutral response option does not have to be included. Researchers sometimes choose to leave it out because they want to encourage respondents to think more deeply about their response and not simply choose the middle option by default. Fence-sitters are respondents who choose neutral response options, even if they have an opinion. Some people will be drawn to respond, “no opinion” even if they have an opinion, particularly if their true opinion is the not a socially desirable opinion. Floaters, on the other hand, are those that choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion.
As you can see, floating is the flip side of fence-sitting. Thus, the solution to one problem is often the cause of the other. How you decide which approach to take depends on the goals of your research. Sometimes researchers specifically want to learn something about people who claim to have no opinion. In this case, allowing for fence-sitting would be necessary. Other times researchers feel confident their respondents will all be familiar with every topic in their survey. In this case, perhaps it is okay to force respondents to choose one side or another (e.g., agree or disagree) without a middle option (e.g., neither agree nor disagree) or to not include an option like "don't know enough to say" or "not applicable." There is no always-correct solution to either problem. But in general, including middle option in a response set provides a more exhaustive set of response options than one that excludes one.
The most important check before your finalize your response options is to align them with your operational definitions. As we've discussed before, your operational definitions include your measures (questions and responses options) as well as how to interpret those measures in terms of the variable being measured. In particular, you should be able to interpret all response options to a question based on your operational definition of the variable it measures. If you wanted to measure the variable "social class," you might ask one question about a participant's annual income and another about family size. Your operational definition would need to provide clear instructions on how to interpret response options. Your operational definition is basically like this social class calculator from Pew Research, though they include a few more questions in their definition.
To drill down a bit more, as Pew specifies in the section titled "how the income calculator works," the interval/ratio data respondents enter is interpreted using a formula combining a participant's four responses to the questions posed by Pew categorizing their household into three categories—upper, middle, or lower class. So, the operational definition includes the four questions comprising the measure and the formula or interpretation which converts responses into the three final categories that we are familiar with: lower, middle, and upper class.
It is interesting to note that even though participants inis an ordinal level of measurement. Whereas, Pew asks four questions that use an interval or ratio level of measurement (depending on the question). This means that respondents provide numerical responses, rather than choosing categories like lower, middle, and upper class. It's perfectly normal for operational definitions to change levels of measurement, and it's also perfectly normal for the level of measurement to stay the same. The important thing is that each response option a participant can provide is accounted for by the operational definition. Throw any combination of family size, location, or income at the Pew calculator, and it will define you into one of those three social class categories.
Unlike Pew's definition, the operational definitions in your study may not need their own webpage to define and describe. For many questions and answers, interpreting response options is easy. If you were measuring "income" instead of "social class," you could simply operationalize the term by asking people to list their total household income before taxes are taken out. Higher values indicate higher income, and lower values indicate lower income. Easy. Regardless of whether your operational definitions are simple or more complex, every response option to every question on your survey (with a few exceptions) should be interpretable using an operational definition of a variable. Just like we want to avoid an everything-but-the-kitchen-sink approach to questions on our questionnaire, you want to make sure your final questionnaire only contains response options that you will use in your study.
One note of caution on interpretation (sorry for repeating this). We want to remind you again that an operational definition should not mention more than one variable. In our example above, your operational definition could not say "a family of three making under $50,000 is lower class; therefore, they are more likely to experience food insecurity." That last clause about food insecurity may well be true, but it's not a part of the operational definition for social class. Each variable (food insecurity and class) should have its own operational definition. If you are talking about how to interpret the relationship between two variables, you are talking about your data analysis plan. We will discuss how to create your data analysis plan beginning in Chapter 14. For now, one consideration is that depending on the statistical test you use to test relationships between variables, you may need nominal, ordinal, or interval/ratio data. Your questions and response options should match the level of measurement you need with the requirements of the specific statistical tests in your data analysis plan. Once you finalize your data analysis plan, return to your questionnaire to match the level of measurement matches with the statistical test you've chosen.
In summary, to write effective response options researchers should do the following:
- Avoid wording that is likely to confuse respondents—including double negatives, use culturally specific terms or jargon, and double-barreled response options.
- Ensure response options are relevant to participants' knowledge and experience so they can make an informed and accurate choice.
- Present mutually exclusive and exhaustive response options.
- Consider fence-sitters and floaters, and the use of neutral or "not applicable" response options.
- Define how response options are interpreted as part of an operational definition of a variable.
- Check level of measurement matches operational definitions and the statistical tests in the data analysis plan (once you develop one in the future)
Exercises
Look back at the response options you drafted in the previous exercise. Make sure you have a first draft of response options for each closed-ended question on your questionnaire.
- Using the criteria above, evaluate the wording of the response options for each question on your questionnaire.
- Revise your questions and response options until you have a complete first draft.
- Do your first read-through and provide a dummy answer to each question. Make sure you can link each response option and each question to an operational definition.
- Look ahead to Chapter 14 and consider how each item on your questionnaire will inform your data analysis plan.
From this discussion, we hope it is clear why researchers using quantitative methods spell out all of their plans ahead of time. Ultimately, there should be a straight line from operational definition through measures on your questionnaire to the data analysis plan. If your questionnaire includes response options that are not aligned with operational definitions or not included in the data analysis plan, the responses you receive back from participants won't fit with your conceptualization of the key variables in your study. If you do not fix these errors and proceed with collecting unstructured data, you will lose out on many of the benefits of survey research and face overwhelming challenges in answering your research question.
Designing questionnaires
Based on your work in the previous section, you should have a first draft of the questions and response options for the key variables in your study. Now, you’ll also need to think about how to present your written questions and response options to survey respondents. It's time to write a final draft of your questionnaire and make it look nice. Designing questionnaires takes some thought. First, consider the route of administration for your survey. What we cover in this section will apply equally to paper and online surveys, but if you are planning to use online survey software, you should watch tutorial videos and explore the features of of the survey software you will use.
Informed consent & instructions
Writing effective items is only one part of constructing a survey. For one thing, every survey should have a written or spoken introduction that serves two basic functions (Peterson, 2000).[70] One is to encourage respondents to participate in the survey. In many types of research, such encouragement is not necessary either because participants do not know they are in a study (as in naturalistic observation) or because they are part of a subject pool and have already shown their willingness to participate by signing up and showing up for the study. Survey research usually catches respondents by surprise when they answer their phone, go to their mailbox, or check their e-mail—and the researcher must make a good case for why they should agree to participate. Thus, the introduction should briefly explain the purpose of the survey and its importance, provide information about the sponsor of the survey (university-based surveys tend to generate higher response rates), acknowledge the importance of the respondent’s participation, and describe any incentives for participating.
The second function of the introduction is to establish informed consent. Remember that this involves describing to respondents everything that might affect their decision to participate. This includes the topics covered by the survey, the amount of time it is likely to take, the respondent’s option to withdraw at any time, confidentiality issues, and other ethical considerations we covered in Chapter 6. Written consent forms are not always used in survey research (when the research is of minimal risk and completion of the survey instrument is often accepted by the IRB as evidence of consent to participate), so it is important that this part of the introduction be well documented and presented clearly and in its entirety to every respondent.
Organizing items to be easy and intuitive to follow
The introduction should be followed by the substantive questionnaire items. But first, it is important to present clear instructions for completing the questionnaire, including examples of how to use any unusual response scales. Remember that the introduction is the point at which respondents are usually most interested and least fatigued, so it is good practice to start with the most important items for purposes of the research and proceed to less important items. Items should also be grouped by topic or by type. For example, items using the same rating scale (e.g., a 5-point agreement scale) should be grouped together if possible to make things faster and easier for respondents. Demographic items are often presented last because they are least interesting to participants but also easy to answer in the event respondents have become tired or bored. Of course, any survey should end with an expression of appreciation to the respondent.
Questions are often organized thematically. If our survey were measuring social class, perhaps we’d have a few questions asking about employment, others focused on education, and still others on housing and community resources. Those may be the themes around which we organize our questions. Or perhaps it would make more sense to present any questions we had about parents' income and then present a series of questions about estimated future income. Grouping by theme is one way to be deliberate about how you present your questions. Keep in mind that you are surveying people, and these people will be trying to follow the logic in your questionnaire. Jumping from topic to topic can give people a bit of whiplash and may make participants less likely to complete it.
Using a matrix is a nice way of streamlining response options for similar questions. A matrix is a question type that that lists a set of questions for which the answer categories are all the same. If you have a set of questions for which the response options are the same, it may make sense to create a matrix rather than posing each question and its response options individually. Not only will this save you some space in your survey but it will also help respondents progress through your survey more easily. A sample matrix can be seen in Figure 12.4.
Once you have grouped similar questions together, you’ll need to think about the order in which to present those question groups. Most survey researchers agree that it is best to begin a survey with questions that will want to make respondents continue (Babbie, 2010; Dillman, 2000; Neuman, 2003).[71] In other words, don’t bore respondents, but don’t scare them away either. There’s some disagreement over where on a survey to place demographic questions, such as those about a person’s age, gender, and race. On the one hand, placing them at the beginning of the questionnaire may lead respondents to think the survey is boring, unimportant, and not something they want to bother completing. On the other hand, if your survey deals with some very sensitive topic, such as child sexual abuse or criminal convictions, you don’t want to scare respondents away or shock them by beginning with your most intrusive questions.
Your participants are human. They will react emotionally to questionnaire items, and they will also try to uncover your research questions and hypotheses. In truth, the order in which you present questions on a survey is best determined by the unique characteristics of your research. When feasible, you should consult with key informants from your target population determine how best to order your questions. If it is not feasible to do so, think about the unique characteristics of your topic, your questions, and most importantly, your sample. Keeping in mind the characteristics and needs of the people you will ask to complete your survey should help guide you as you determine the most appropriate order in which to present your questions. None of your decisions will be perfect, and all studies have limitations.
Questionnaire length
You’ll also need to consider the time it will take respondents to complete your questionnaire. Surveys vary in length, from just a page or two to a dozen or more pages, which means they also vary in the time it takes to complete them. How long to make your survey depends on several factors. First, what is it that you wish to know? Wanting to understand how grades vary by gender and year in school certainly requires fewer questions than wanting to know how people’s experiences in college are shaped by demographic characteristics, college attended, housing situation, family background, college major, friendship networks, and extracurricular activities. Keep in mind that even if your research question requires a sizable number of questions be included in your questionnaire, do your best to keep the questionnaire as brief as possible. Any hint that you’ve thrown in a bunch of useless questions just for the sake of it will turn off respondents and may make them not want to complete your survey.
Second, and perhaps more important, how long are respondents likely to be willing to spend completing your questionnaire? If you are studying college students, asking them to use their very free time to complete your survey may mean they won’t want to spend more than a few minutes on it. But if you find ask them to complete your survey during down-time between classes and there is little work to be done, students may be willing to give you a bit more of their time. Think about places and times that your sampling frame naturally gathers and whether you would be able to either recruit participants or distribute a survey in that context. Estimate how long your participants would reasonably have to complete a survey presented to them during this time. The more you know about your population (such as what weeks have less work and more free time), the better you can target questionnaire length.
The time that survey researchers ask respondents to spend on questionnaires varies greatly. Some researchers advise that surveys should not take longer than about 15 minutes to complete (as cited in Babbie 2010),[72] whereas others suggest that up to 20 minutes is acceptable (Hopper, 2010).[73] As with question order, there is no clear-cut, always-correct answer about questionnaire length. The unique characteristics of your study and your sample should be considered to determine how long to make your questionnaire. For example, if you planned to distribute your questionnaire to students in between classes, you will need to make sure it is short enough to complete before the next class begins.
When designing a questionnaire, a researcher should consider:
- Weighing strengths and limitations of the method of delivery, including the advanced tools in online survey software or the simplicity of paper questionnaires.
- Grouping together items that ask about the same thing.
- Moving any questions about sensitive items to the end of the questionnaire, so as not to scare respondents off.
- Moving any questions that engage the respondent to answer the questionnaire at the beginning, so as not to bore them.
- Timing the length of the questionnaire with a reasonable length of time you can ask of your participants.
- Dedicating time to visual design and ensure the questionnaire looks professional.
Exercises
Type out a final draft of your questionnaire in a word processor or online survey tool.
- Evaluate your questionnaire using the guidelines above, revise it, and get it ready to share with other student researchers.
Pilot testing and revising questionnaires
A good way to estimate the time it will take respondents to complete your questionnaire (and other potential challenges) is through pilot testing. Pilot testing allows you to get feedback on your questionnaire so you can improve it before you actually administer it. It can be quite expensive and time consuming if you wish to pilot test your questionnaire on a large sample of people who very much resemble the sample to whom you will eventually administer the finalized version of your questionnaire. But you can learn a lot and make great improvements to your questionnaire simply by pilot testing with a small number of people to whom you have easy access (perhaps you have a few friends who owe you a favor). By pilot testing your questionnaire, you can find out how understandable your questions are, get feedback on question wording and order, find out whether any of your questions are boring or offensive, and learn whether there are places where you should have included filter questions. You can also time pilot testers as they take your survey. This will give you a good idea about the estimate to provide respondents when you administer your survey and whether you have some wiggle room to add additional items or need to cut a few items.
Perhaps this goes without saying, but your questionnaire should also have an attractive design. A messy presentation style can confuse respondents or, at the very least, annoy them. Be brief, to the point, and as clear as possible. Avoid cramming too much into a single page. Make your font size readable (at least 12 point or larger, depending on the characteristics of your sample), leave a reasonable amount of space between items, and make sure all instructions are exceptionally clear. If you are using an online survey, ensure that participants can complete it via mobile, computer, and tablet devices. Think about books, documents, articles, or web pages that you have read yourself—which were relatively easy to read and easy on the eyes and why? Try to mimic those features in the presentation of your survey questions. While online survey tools automate much of visual design, word processors are designed for writing all kinds of documents and may need more manual adjustment as part of visual design.
Realistically, your questionnaire will continue to evolve as you develop your data analysis plan over the next few chapters. By now, you should have a complete draft of your questionnaire grounded in an underlying logic that ties together each question and response option to a variable in your study. Once your questionnaire is finalized, you will need to submit it for ethical approval from your professor or the IRB. If your study requires IRB approval, it may be worthwhile to submit your proposal before your questionnaire is completely done. Revisions to IRB protocols are common and it takes less time to review a few changes to questions and answers than it does to review the entire study, so give them the whole study as soon as you can. Once the IRB approves your questionnaire, you cannot change it without their okay.
Key Takeaways
- A questionnaire is comprised of self-report measures of variables in a research study.
- Make sure your survey questions will be relevant to all respondents and that you use filter questions when necessary.
- Effective survey questions and responses take careful construction by researchers, as participants may be confused or otherwise influenced by how items are phrased.
- The questionnaire should start with informed consent and instructions, flow logically from one topic to the next, engage but not shock participants, and thank participants at the end.
- Pilot testing can help identify any issues in a questionnaire before distributing it to participants, including language or length issues.
Exercises
It's a myth that researchers work alone! Get together with a few of your fellow students and swap questionnaires for pilot testing.
- Use the criteria in each section above (questions, response options, questionnaires) and provide your peers with the strengths and weaknesses of their questionnaires.
- See if you can guess their research question and hypothesis based on the questionnaire alone.
11.3 Measurement quality
Learning Objectives
Learners will be able to...
- Define and describe the types of validity and reliability
- Assess for systematic error
The previous chapter provided insight into measuring concepts in social work research. We discussed the importance of identifying concepts and their corresponding indicators as a way to help us operationalize them. In essence, we now understand that when we think about our measurement process, we must be intentional and thoughtful in the choices that we make. This section is all about how to judge the quality of the measures you've chosen for the key variables in your research question.
Reliability
First, let’s say we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism this way, then it is likely that anyone who identifies as an alcoholic would respond “yes.” This may seem like a good way to identify our group of interest, but think about how you and your peer group may respond to this question. Would participants respond differently after a wild night out, compared to any other night? Could an infrequent drinker’s current headache from last night’s glass of wine influence how they answer the question this morning? How would that same person respond to the question before consuming the wine? In each cases, the same person might respond differently to the same question at different points, so it is possible that our measure of alcoholism has a reliability problem. Reliability in measurement is about consistency.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behavior, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past 3 months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. On the other hand, we might get more accurate responses if we ask a participant how many drinks of any kind they have consumed in the past week.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviors. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. They may decide to conduct observations at a local pub by noting how many drinks patrons consume and how their behavior changes as their intake changes. What if the researcher has to use the restroom, and the patron next to them takes three shots of tequila during the brief period the researcher is away from their seat? The reliability of this researcher’s measure of alcohol intake depends on their ability to physically observe every instance of patrons consuming drinks. If they are unlikely to be able to observe every such instance, then perhaps their mechanism for measuring this concept is not reliable.
The following subsections describe the types of reliability that are important for you to know about, but keep in mind that you may see other approaches to judging reliability mentioned in the empirical literature.
Test-retest reliability
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time. Unlike an experiment, you aren't giving participants an intervention but trying to establish a reliable baseline of the variable you are measuring. Once you have these two measurements, you then look at the correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. Figure 11.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
Internal consistency
Another kind of reliability is internal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials. A specific statistical test known as Cronbach’s Alpha provides a way to measure how well each question of a scale is related to the others.
Interrater reliability
Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other.
Validity
Validity, another key element of assessing measurement quality, is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.
Face validity
Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.
Content validity
Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Criterion validity
Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).
Discriminant validity
Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
Increasing the reliability and validity of measures
We have reviewed the types of errors and how to evaluate our measures based on reliability and validity considerations. However, what can we do while selecting or creating our tool so that we minimize the potential of errors? Many of our options were covered in our discussion about reliability and validity. Nevertheless, the following table provides a quick summary of things that you should do when creating or selecting a measurement tool. While not all of these will be feasible in your project, it is important to include easy-to-implement measures in your research context.
Make sure that you engage in a rigorous literature review so that you understand the concept that you are studying. This means understanding the different ways that your concept may manifest itself. This review should include a search for existing instruments.[74]
- Do you understand all the dimensions of your concept? Do you have a good understanding of the content dimensions of your concept(s)?
- What instruments exist? How many items are on the existing instruments? Are these instruments appropriate for your population?
- Are these instruments standardized? Note: If an instrument is standardized, that means it has been rigorously studied and tested.
Consult content experts to review your instrument. This is a good way to check the face validity of your items. Additionally, content experts can also help you understand the content validity.[75]
- Do you have access to a reasonable number of content experts? If not, how can you locate them?
- Did you provide a list of critical questions for your content reviewers to use in the reviewing process?
Pilot test your instrument on a sufficient number of people and get detailed feedback.[76] Ask your group to provide feedback on the wording and clarity of items. Keep detailed notes and make adjustments BEFORE you administer your final tool.
- How many people will you use in your pilot testing?
- How will you set up your pilot testing so that it mimics the actual process of administering your tool?
- How will you receive feedback from your pilot testing group? Have you provided a list of questions for your group to think about?
Provide training for anyone collecting data for your project.[77] You should provide those helping you with a written research protocol that explains all of the steps of the project. You should also problem solve and answer any questions that those helping you may have. This will increase the chances that your tool will be administered in a consistent manner.
- How will you conduct your orientation/training? How long will it be? What modality?
- How will you select those who will administer your tool? What qualifications do they need?
When thinking of items, use a higher level of measurement, if possible.[78] This will provide more information and you can always downgrade to a lower level of measurement later.
- Have you examined your items and the levels of measurement?
- Have you thought about whether you need to modify the type of data you are collecting? Specifically, are you asking for information that is too specific (at a higher level of measurement) which may reduce participants' willingness to participate?
Use multiple indicators for a variable.[79] Think about the number of items that you will include in your tool.
- Do you have enough items? Enough indicators? The correct indicators?
Conduct an item-by-item assessment of multiple-item measures.[80] When you do this assessment, think about each word and how it changes the meaning of your item.
- Are there items that are redundant? Do you need to modify, delete, or add items?
Types of error
As you can see, measures never perfectly describe what exists in the real world. Good measures demonstrate validity and reliability but will always have some degree of error. Systematic error (also called bias) causes our measures to consistently output incorrect data in one direction or another on a measure, usually due to an identifiable process. Imagine you created a measure of height, but you didn’t put an option for anyone over six feet tall. If you gave that measure to your local college or university, some of the taller students might not be measured accurately. In fact, you would be under the mistaken impression that the tallest person at your school was six feet tall, when in actuality there are likely people taller than six feet at your school. This error seems innocent, but if you were using that measure to help you build a new building, those people might hit their heads!
A less innocent form of error arises when researchers word questions in a way that might cause participants to think one answer choice is preferable to another. For example, if I were to ask you “Do you think global warming is caused by human activity?” you would probably feel comfortable answering honestly. But what if I asked you “Do you agree with 99% of scientists that global warming is caused by human activity?” Would you feel comfortable saying no, if that’s what you honestly felt? I doubt it. That is an example of a leading question, a question with wording that influences how a participant responds. We’ll discuss leading questions and other problems in question wording in greater detail in Chapter 12.
In addition to error created by the researcher, your participants can cause error in measurement. Some people will respond without fully understanding a question, particularly if the question is worded in a confusing way. Let’s consider another potential source or error. If we asked people if they always washed their hands after using the bathroom, would we expect people to be perfectly honest? Polling people about whether they wash their hands after using the bathroom might only elicit what people would like others to think they do, rather than what they actually do. This is an example of social desirability bias, in which participants in a research study want to present themselves in a positive, socially desirable way to the researcher. People in your study will want to seem tolerant, open-minded, and intelligent, but their true feelings may be closed-minded, simple, and biased. Participants may lie in this situation. This occurs often in political polling, which may show greater support for a candidate from a minority race, gender, or political party than actually exists in the electorate.
A related form of bias is called acquiescence bias, also known as “yea-saying.” It occurs when people say yes to whatever the researcher asks, even when doing so contradicts previous answers. For example, a person might say yes to both “I am a confident leader in group discussions” and “I feel anxious interacting in group discussions.” Those two responses are unlikely to both be true for the same person. Why would someone do this? Similar to social desirability, people want to be agreeable and nice to the researcher asking them questions or they might ignore contradictory feelings when responding to each question. You could interpret this as someone saying "yeah, I guess." Respondents may also act on cultural reasons, trying to “save face” for themselves or the person asking the questions. Regardless of the reason, the results of your measure don’t match what the person truly feels.
So far, we have discussed sources of error that come from choices made by respondents or researchers. Systematic errors will result in responses that are incorrect in one direction or another. For example, social desirability bias usually means that the number of people who say they will vote for a third party in an election is greater than the number of people who actually vote for that candidate. Systematic errors such as these can be reduced, but random error can never be eliminated. Unlike systematic error, which biases responses consistently in one direction or another, random error is unpredictable and does not consistently result in scores that are consistently higher or lower on a given measure. Instead, random error is more like statistical noise, which will likely average out across participants.
Random error is present in any measurement. If you’ve ever stepped on a bathroom scale twice and gotten two slightly different results, maybe a difference of a tenth of a pound, then you’ve experienced random error. Maybe you were standing slightly differently or had a fraction of your foot off of the scale the first time. If you were to take enough measures of your weight on the same scale, you’d be able to figure out your true weight. In social science, if you gave someone a scale measuring depression on a day after they lost their job, they would likely score differently than if they had just gotten a promotion and a raise. Even if the person were clinically depressed, our measure is subject to influence by the random occurrences of life. Thus, social scientists speak with humility about our measures. We are reasonably confident that what we found is true, but we must always acknowledge that our measures are only an approximation of reality.
Humility is important in scientific measurement, as errors can have real consequences. At the time I'm writing this, my wife and I are expecting our first child. Like most people, we used a pregnancy test from the pharmacy. If the test said my wife was pregnant when she was not pregnant, that would be a false positive. On the other hand, if the test indicated that she was not pregnant when she was in fact pregnant, that would be a false negative. Even if the test is 99% accurate, that means that one in a hundred women will get an erroneous result when they use a home pregnancy test. For us, a false positive would have been initially exciting, then devastating when we found out we were not having a child. A false negative would have been disappointing at first and then quite shocking when we found out we were indeed having a child. While both false positives and false negatives are not very likely for home pregnancy tests (when taken correctly), measurement error can have consequences for the people being measured.
Key Takeaways
- Reliability is a matter of consistency.
- Validity is a matter of accuracy.
- There are many types of validity and reliability.
- Systematic error may arise from the researcher, participant, or measurement instrument.
- Systematic error biases results in a particular direction, whereas random error can be in any direction.
- All measures are prone to error and should interpreted with humility.
Exercises
Use the measurement tools you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to "research" these tools.
- Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
- Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?
- If you decide to create your own tool, how will you assess its validity and reliability?
Chapter Outline
- Operational definitions (36 minute read)
- Writing effective questions and questionnaires (38 minute read)
- Measurement quality (21 minute read)
Content warning: examples in this chapter contain references to ethnocentrism, toxic masculinity, racism in science, drug use, mental health and depression, psychiatric inpatient care, poverty and basic needs insecurity, pregnancy, and racism and sexism in the workplace and higher education.
11.1 Operational definitions
Learning Objectives
Learners will be able to...
- Define and give an example of indicators and attributes for a variable
- Apply the three components of an operational definition to a variable
- Distinguish between levels of measurement for a variable and how those differences relate to measurement
- Describe the purpose of composite measures like scales and indices
Last chapter, we discussed conceptualizing your project. Conceptual definitions are like dictionary definitions. They tell you what a concept means by defining it using other concepts. In this section we will move from the abstract realm (conceptualization) to the real world (measurement).
Operationalization is the process by which researchers spell out precisely how a concept will be measured in their study. It involves identifying the specific research procedures we will use to gather data about our concepts. If conceptually defining your terms means looking at theory, how do you operationally define your terms? By looking for indicators of when your variable is present or not, more or less intense, and so forth. Operationalization is probably the most challenging part of quantitative research, but once it's done, the design and implementation of your study will be straightforward.
Indicators
Operationalization works by identifying specific indicators that will be taken to represent the ideas we are interested in studying. If we are interested in studying masculinity, then the indicators for that concept might include some of the social roles prescribed to men in society such as breadwinning or fatherhood. Being a breadwinner or a father might therefore be considered indicators of a person’s masculinity. The extent to which a man fulfills either, or both, of these roles might be understood as clues (or indicators) about the extent to which he is viewed as masculine.
Let’s look at another example of indicators. Each day, Gallup researchers poll 1,000 randomly selected Americans to ask them about their well-being. To measure well-being, Gallup asks these people to respond to questions covering six broad areas: physical health, emotional health, work environment, life evaluation, healthy behaviors, and access to basic necessities. Gallup uses these six factors as indicators of the concept that they are really interested in, which is well-being.
Identifying indicators can be even simpler than the examples described thus far. Political party affiliation is another relatively easy concept for which to identify indicators. If you asked a person what party they voted for in the last national election (or gained access to their voting records), you would get a good indication of their party affiliation. Of course, some voters split tickets between multiple parties when they vote and others swing from party to party each election, so our indicator is not perfect. Indeed, if our study were about political identity as a key concept, operationalizing it solely in terms of who they voted for in the previous election leaves out a lot of information about identity that is relevant to that concept. Nevertheless, it's a pretty good indicator of political party affiliation.
Choosing indicators is not an arbitrary process. As described earlier, utilizing prior theoretical and empirical work in your area of interest is a great way to identify indicators in a scholarly manner. And you conceptual definitions will point you in the direction of relevant indicators. Empirical work will give you some very specific examples of how the important concepts in an area have been measured in the past and what sorts of indicators have been used. Often, it makes sense to use the same indicators as previous researchers; however, you may find that some previous measures have potential weaknesses that your own study will improve upon.
All of the examples in this chapter have dealt with questions you might ask a research participant on a survey or in a quantitative interview. If you plan to collect data from other sources, such as through direct observation or the analysis of available records, think practically about what the design of your study might look like and how you can collect data on various indicators feasibly. If your study asks about whether the participant regularly changes the oil in their car, you will likely not observe them directly doing so. Instead, you will likely need to rely on a survey question that asks them the frequency with which they change their oil or ask to see their car maintenance records.
Exercises
- What indicators are commonly used to measure the variables in your research question?
- How can you feasibly collect data on these indicators?
- Are you planning to collect your own data using a questionnaire or interview? Or are you planning to analyze available data like client files or raw data shared from another researcher's project?
Remember, you need raw data. You research project cannot rely solely on the results reported by other researchers or the arguments you read in the literature. A literature review is only the first part of a research project, and your review of the literature should inform the indicators you end up choosing when you measure the variables in your research question.
Unlike conceptual definitions which contain other concepts, operational definition consists of the following components: (1) the variable being measured and its attributes, (2) the measure you will use, (3) how you plan to interpret the data collected from that measure to draw conclusions about the variable you are measuring.
Step 1: Specifying variables and attributes
The first component, the variable, should be the easiest part. At this point in quantitative research, you should have a research question that has at least one independent and at least one dependent variable. Remember that variables must be able to vary. For example, the United States is not a variable. Country of residence is a variable, as is patriotism. Similarly, if your sample only includes men, gender is a constant in your study, not a variable. A constant is a characteristic that does not change in your study.
When social scientists measure concepts, they sometimes use the language of variables and attributes. A variable refers to a quality or quantity that varies across people or situations. Attributes are the characteristics that make up a variable. For example, the variable hair color would contain attributes like blonde, brown, black, red, gray, etc. A variable’s attributes determine its level of measurement. There are four possible levels of measurement: nominal, ordinal, interval, and ratio. The first two levels of measurement are categorical, meaning their attributes are categories rather than numbers. The latter two levels of measurement are continuous, meaning their attributes are numbers.
Levels of measurement
Hair color is an example of a nominal level of measurement. Nominal measures are categorical, and those categories cannot be mathematically ranked. As a brown-haired person (with some gray), I can’t say for sure that brown-haired people are better than blonde-haired people. As with all nominal levels of measurement, there is no ranking order between hair colors; they are simply different. That is what constitutes a nominal level of gender and race are also measured at the nominal level.
What attributes are contained in the variable hair color? While blonde, brown, black, and red are common colors, some people may not fit into these categories if we only list these attributes. My wife, who currently has purple hair, wouldn’t fit anywhere. This means that our attributes were not exhaustive. Exhaustiveness means that all possible attributes are listed. We may have to list a lot of colors before we can meet the criteria of exhaustiveness. Clearly, there is a point at which exhaustiveness has been reasonably met. If a person insists that their hair color is light burnt sienna, it is not your responsibility to list that as an option. Rather, that person would reasonably be described as brown-haired. Perhaps listing a category for other color would suffice to make our list of colors exhaustive.
What about a person who has multiple hair colors at the same time, such as red and black? They would fall into multiple attributes. This violates the rule of mutual exclusivity, in which a person cannot fall into two different attributes. Instead of listing all of the possible combinations of colors, perhaps you might include a multi-color attribute to describe people with more than one hair color.
Making sure researchers provide mutually exclusive and exhaustive is about making sure all people are represented in the data record. For many years, the attributes for gender were only male or female. Now, our understanding of gender has evolved to encompass more attributes that better reflect the diversity in the world. Children of parents from different races were often classified as one race or another, even if they identified with both cultures. The option for bi-racial or multi-racial on a survey not only more accurately reflects the racial diversity in the real world but validates and acknowledges people who identify in that manner. If we did not measure race in this way, we would leave empty the data record for people who identify as biracial or multiracial, impairing our search for truth.
Unlike nominal-level measures, attributes at the ordinal level can be rank ordered. For example, someone’s degree of satisfaction in their romantic relationship can be ordered by rank. That is, you could say you are not at all satisfied, a little satisfied, moderately satisfied, or highly satisfied. Note that even though these have a rank order to them (not at all satisfied is certainly worse than highly satisfied), we cannot calculate a mathematical distance between those attributes. We can simply say that one attribute of an ordinal-level variable is more or less than another attribute.
This can get a little confusing when using rating scales. If you have ever taken a customer satisfaction survey or completed a course evaluation for school, you are familiar with rating scales. “On a scale of 1-5, with 1 being the lowest and 5 being the highest, how likely are you to recommend our company to other people?” That surely sounds familiar. Rating scales use numbers, but only as a shorthand, to indicate what attribute (highly likely, somewhat likely, etc.) the person feels describes them best. You wouldn’t say you are “2” likely to recommend the company, but you would say you are not very likely to recommend the company. Ordinal-level attributes must also be exhaustive and mutually exclusive, as with nominal-level variables.
At the interval level, attributes must also be exhaustive and mutually exclusive and there is equal distance between attributes. Interval measures are also continuous, meaning their attributes are numbers, rather than categories. IQ scores are interval level, as are temperatures in Fahrenheit and Celsius. Their defining characteristic is that we can say how much more or less one attribute differs from another. We cannot, however, say with certainty what the ratio of one attribute is in comparison to another. For example, it would not make sense to say that a person with an IQ score of 140 has twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.
While we cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, we can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level. Finally, at the ratio level, attributes are mutually exclusive and exhaustive, attributes can be rank ordered, the distance between attributes is equal, and attributes have a true zero point. Thus, with these variables, we can say what the ratio of one attribute is in comparison to another. Examples of ratio-level variables include age and years of education. We know that a person who is 12 years old is twice as old as someone who is 6 years old. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. The differences between each level of measurement are visualized in Table 11.1.
Nominal | Ordinal | Interval | Ratio | |
Exhaustive | X | X | X | X |
Mutually exclusive | X | X | X | X |
Rank-ordered | X | X | X | |
Equal distance between attributes | X | X | ||
True zero point | X |
Levels of measurement=levels of specificity
We have spent time learning how to determine our data's level of measurement. Now what? How could we use this information to help us as we measure concepts and develop measurement tools? First, the types of statistical tests that we are able to use are dependent on our data's level of measurement. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. The higher the level of measurement, the more complex statistical tests we are able to conduct. This knowledge may help us decide what kind of data we need to gather, and how.
That said, we have to balance this knowledge with the understanding that sometimes, collecting data at a higher level of measurement could negatively impact our studies. For instance, sometimes providing answers in ranges may make prospective participants feel more comfortable responding to sensitive items. Imagine that you were interested in collecting information on topics such as income, number of sexual partners, number of times someone used illicit drugs, etc. You would have to think about the sensitivity of these items and determine if it would make more sense to collect some data at a lower level of measurement (e.g., asking if they are sexually active or not (nominal) versus their total number of sexual partners (ratio).
Finally, sometimes when analyzing data, researchers find a need to change a data's level of measurement. For example, a few years ago, a student was interested in studying the relationship between mental health and life satisfaction. This student used a variety of measures. One item asked about the number of mental health symptoms, reported as the actual number. When analyzing data, my student examined the mental health symptom variable and noticed that she had two groups, those with none or one symptoms and those with many symptoms. Instead of using the ratio level data (actual number of mental health symptoms), she collapsed her cases into two categories, few and many. She decided to use this variable in her analyses. It is important to note that you can move a higher level of data to a lower level of data; however, you are unable to move a lower level to a higher level.
Exercises
- Check that the variables in your research question can vary...and that they are not constants or one of many potential attributes of a variable.
- Think about the attributes your variables have. Are they categorical or continuous? What level of measurement seems most appropriate?
Step 2: Specifying measures for each variable
Let’s pick a social work research question and walk through the process of operationalizing variables to see how specific we need to get. I’m going to hypothesize that residents of a psychiatric unit who are more depressed are less likely to be satisfied with care. Remember, this would be a inverse relationship—as depression increases, satisfaction decreases. In this question, depression is my independent variable (the cause) and satisfaction with care is my dependent variable (the effect). Now we have identified our variables, their attributes, and levels of measurement, we move onto the second component: the measure itself.
So, how would you measure my key variables: depression and satisfaction? What indicators would you look for? Some students might say that depression could be measured by observing a participant’s body language. They may also say that a depressed person will often express feelings of sadness or hopelessness. In addition, a satisfied person might be happy around service providers and often express gratitude. While these factors may indicate that the variables are present, they lack coherence. Unfortunately, what this “measure” is actually saying is that “I know depression and satisfaction when I see them.” While you are likely a decent judge of depression and satisfaction, you need to provide more information in a research study for how you plan to measure your variables. Your judgment is subjective, based on your own idiosyncratic experiences with depression and satisfaction. They couldn’t be replicated by another researcher. They also can’t be done consistently for a large group of people. Operationalization requires that you come up with a specific and rigorous measure for seeing who is depressed or satisfied.
Finding a good measure for your variable depends on the kind of variable it is. Variables that are directly observable don't come up very often in my students' classroom projects, but they might include things like taking someone's blood pressure, marking attendance or participation in a group, and so forth. To measure an indirectly observable variable like age, you would probably put a question on a survey that asked, “How old are you?” Measuring a variable like income might require some more thought, though. Are you interested in this person’s individual income or the income of their family unit? This might matter if your participant does not work or is dependent on other family members for income. Do you count income from social welfare programs? Are you interested in their income per month or per year? Even though indirect observables are relatively easy to measure, the measures you use must be clear in what they are asking, and operationalization is all about figuring out the specifics of what you want to know. For more complicated constructs, you will need compound measures (that use multiple indicators to measure a single variable).
How you plan to collect your data also influences how you will measure your variables. For social work researchers using secondary data like client records as a data source, you are limited by what information is in the data sources you can access. If your organization uses a given measurement for a mental health outcome, that is the one you will use in your study. Similarly, if you plan to study how long a client was housed after an intervention using client visit records, you are limited by how their caseworker recorded their housing status in the chart. One of the benefits of collecting your own data is being able to select the measures you feel best exemplify your understanding of the topic.
Measuring unidimensional concepts
The previous section mentioned two important considerations: how complicated the variable is and how you plan to collect your data. With these in hand, we can use the level of measurement to further specify how you will measure your variables and consider specialized rating scales developed by social science researchers.
Measurement at each level
Nominal measures assess categorical variables. These measures are used for variables or indicators that have mutually exclusive attributes, but that cannot be rank-ordered. Nominal measures ask about the variable and provide names or labels for different attribute values like social work, counseling, and nursing for the variable profession. Nominal measures are relatively straightforward.
Ordinal measures often use a rating scale. It is an ordered set of responses that participants must choose from. Figure 11.1 shows several examples. The number of response options on a typical rating scale is usualy five or seven, though it can range from three to 11. Five-point scales are best for unipolar scales where only one construct is tested, such as frequency (Never, Rarely, Sometimes, Often, Always). Seven-point scales are best for bipolar scales where there is a dichotomous spectrum, such as liking (Like very much, Like somewhat, Like slightly, Neither like nor dislike, Dislike slightly, Dislike somewhat, Dislike very much). For bipolar questions, it is useful to offer an earlier question that branches them into an area of the scale; if asking about liking ice cream, first ask “Do you generally like or dislike ice cream?” Once the respondent chooses like or dislike, refine it by offering them relevant choices from the seven-point scale. Branching improves both reliability and validity (Krosnick & Berent, 1993).[81] Although you often see scales with numerical labels, it is best to only present verbal labels to the respondents but convert them to numerical values in the analyses. Avoid partial labels or length or overly specific labels. In some cases, the verbal labels can be supplemented with (or even replaced by) meaningful graphics. The last rating scale shown in Figure 11.1 is a visual-analog scale, on which participants make a mark somewhere along the horizontal line to indicate the magnitude of their response.
Interval measures are those where the values measured are not only rank-ordered, but are also equidistant from adjacent attributes. For example, the temperature scale (in Fahrenheit or Celsius), where the difference between 30 and 40 degree Fahrenheit is the same as that between 80 and 90 degree Fahrenheit. Likewise, if you have a scale that asks respondents’ annual income using the following attributes (ranges): $0 to 10,000, $10,000 to 20,000, $20,000 to 30,000, and so forth, this is also an interval measure, because the mid-point of each range (i.e., $5,000, $15,000, $25,000, etc.) are equidistant from each other. The intelligence quotient (IQ) scale is also an interval measure, because the measure is designed such that the difference between IQ scores 100 and 110 is supposed to be the same as between 110 and 120 (although we do not really know whether that is truly the case). Interval measures allow us to examine “how much more” is one attribute when compared to another, which is not possible with nominal or ordinal measures. You may find researchers who “pretend” (incorrectly) that ordinal rating scales are actually interval measures so that we can use different statistical techniques for analyzing them. As we will discuss in the latter part of the chapter, this is a mistake because there is no way to know whether the difference between a 3 and a 4 on a rating scale is the same as the difference between a 2 and a 3. Those numbers are just placeholders for categories.
Ratio measures are those that have all the qualities of nominal, ordinal, and interval scales, and in addition, also have a “true zero” point (where the value zero implies lack or non-availability of the underlying construct). Think about how to measure the number of people working in human resources at a social work agency. It could be one, several, or none (if the company contracts out for those services). Measuring interval and ratio data is relatively easy, as people either select or input a number for their answer. If you ask a person how many eggs they purchased last week, they can simply tell you they purchased `a dozen eggs at the store, two at breakfast on Wednesday, or none at all.
Commonly used rating scales in questionnaires
The level of measurement will give you the basic information you need, but social scientists have developed specialized instruments for use in questionnaires, a common tool used in quantitative research. As we mentioned before, if you plan to source your data from client files or previously published results
Although Likert scale is a term colloquially used to refer to almost any rating scale (e.g., a 0-to-10 life satisfaction scale), it has a much more precise meaning. In the 1930s, researcher Rensis Likert (pronounced LICK-ert) created a new approach for measuring people’s attitudes (Likert, 1932).[82] It involves presenting people with several statements—including both favorable and unfavorable statements—about some person, group, or idea. Respondents then express their agreement or disagreement with each statement on a 5-point scale: Strongly Agree, Agree, Neither Agree nor Disagree, Disagree, Strongly Disagree. Numbers are assigned to each response and then summed across all items to produce a score representing the attitude toward the person, group, or idea. For items that are phrased in an opposite direction (e.g., negatively worded statements instead of positively worded statements), reverse coding is used so that the numerical scoring of statements also runs in the opposite direction. The entire set of items came to be called a Likert scale, as indicated in Table 11.2 below.
Unless you are measuring people’s attitude toward something by assessing their level of agreement with several statements about it, it is best to avoid calling it a Likert scale. You are probably just using a rating scale. Likert scales allow for more granularity (more finely tuned response) than yes/no items, including whether respondents are neutral to the statement. Below is an example of how we might use a Likert scale to assess your attitudes about research as you work your way through this textbook.
Strongly agree | Agree | Neutral | Disagree | Strongly disagree | |
I like research more now than when I started reading this book. | |||||
This textbook is easy to use. | |||||
I feel confident about how well I understand levels of measurement. | |||||
This textbook is helping me plan my research proposal. |
Semantic differential scales are composite (multi-item) scales in which respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites. Whereas in the above Likert scale, the participant is asked how much they agree or disagree with a statement, in a semantic differential scale the participant is asked to indicate how they feel about a specific item. This makes the semantic differential scale an excellent technique for measuring people’s attitudes or feelings toward objects, events, or behaviors. Table 11.3 is an example of a semantic differential scale that was created to assess participants' feelings about this textbook.
1) How would you rate your opinions toward this textbook? | ||||||
Very much | Somewhat | Neither | Somewhat | Very much | ||
Boring | Exciting | |||||
Useless | Useful | |||||
Hard | Easy | |||||
Irrelevant | Applicable |
This composite scale was designed by Louis Guttman and uses a series of items arranged in increasing order of intensity (least intense to most intense) of the concept. This type of scale allows us to understand the intensity of beliefs or feelings. Each item in the above Guttman scale has a weight (this is not indicated on the tool) which varies with the intensity of that item, and the weighted combination of each response is used as an aggregate measure of an observation.
Example Guttman Scale Items
- I often felt the material was not engaging Yes/No
- I was often thinking about other things in class Yes/No
- I was often working on other tasks during class Yes/No
- I will work to abolish research from the curriculum Yes/No
Notice how the items move from lower intensity to higher intensity. A researcher reviews the yes answers and creates a score for each participant.
Composite measures: Scales and indices
Depending on your research design, your measure may be something you put on a survey or pre/post-test that you give to your participants. For a variable like age or income, one well-worded question may suffice. Unfortunately, most variables in the social world are not so simple. Depression and satisfaction are multidimensional concepts. Relying on a single indicator like a question that asks "Yes or no, are you depressed?” does not encompass the complexity of depression, including issues with mood, sleeping, eating, relationships, and happiness. There is no easy way to delineate between multidimensional and unidimensional concepts, as its all in how you think about your variable. Satisfaction could be validly measured using a unidimensional ordinal rating scale. However, if satisfaction were a key variable in our study, we would need a theoretical framework and conceptual definition for it. That means we'd probably have more indicators to ask about like timeliness, respect, sensitivity, and many others, and we would want our study to say something about what satisfaction truly means in terms of our other key variables. However, if satisfaction is not a key variable in your conceptual framework, it makes sense to operationalize it as a unidimensional concept.
For more complicated measures, researchers use scales and indices (sometimes called indexes) to measure their variables because they assess multiple indicators to develop a composite (or total) score. Composite scores provide a much greater understanding of concepts than a single item could. Although we won't delve too deeply into the process of scale development, we will cover some important topics for you to understand how scales and indices developed by other researchers can be used in your project.
Although they exhibit differences (which will later be discussed) the two have in common various factors.
- Both are ordinal measures of variables.
- Both can order the units of analysis in terms of specific variables.
- Both are composite measures.
Scales
The previous section discussed how to measure respondents’ responses to predesigned items or indicators belonging to an underlying construct. But how do we create the indicators themselves? The process of creating the indicators is called scaling. More formally, scaling is a branch of measurement that involves the construction of measures by associating qualitative judgments about unobservable constructs with quantitative, measurable metric units. Stevens (1946)[83] said, “Scaling is the assignment of objects to numbers according to a rule.” This process of measuring abstract concepts in concrete terms remains one of the most difficult tasks in empirical social science research.
The outcome of a scaling process is a scale, which is an empirical structure for measuring items or indicators of a given construct. Understand that multidimensional “scales”, as discussed in this section, are a little different from “rating scales” discussed in the previous section. A rating scale is used to capture the respondents’ reactions to a given item on a questionnaire. For example, an ordinally scaled item captures a value between “strongly disagree” to “strongly agree.” Attaching a rating scale to a statement or instrument is not scaling. Rather, scaling is the formal process of developing scale items, before rating scales can be attached to those items.
If creating your own scale sounds painful, don’t worry! For most multidimensional variables, you would likely be duplicating work that has already been done by other researchers. Specifically, this is a branch of science called psychometrics. You do not need to create a scale for depression because scales such as the Patient Health Questionnaire (PHQ-9), the Center for Epidemiologic Studies Depression Scale (CES-D), and Beck’s Depression Inventory (BDI) have been developed and refined over dozens of years to measure variables like depression. Similarly, scales such as the Patient Satisfaction Questionnaire (PSQ-18) have been developed to measure satisfaction with medical care. As we will discuss in the next section, these scales have been shown to be reliable and valid. While you could create a new scale to measure depression or satisfaction, a study with rigor would pilot test and refine that new scale over time to make sure it measures the concept accurately and consistently. This high level of rigor is often unachievable in student research projects because of the cost and time involved in pilot testing and validating, so using existing scales is recommended.
Unfortunately, there is no good one-stop=shop for psychometric scales. The Mental Measurements Yearbook provides a searchable database of measures for social science variables, though it woefully incomplete and often does not contain the full documentation for scales in its database. You can access it from a university library’s list of databases. If you can’t find anything in there, your next stop should be the methods section of the articles in your literature review. The methods section of each article will detail how the researchers measured their variables, and often the results section is instructive for understanding more about measures. In a quantitative study, researchers may have used a scale to measure key variables and will provide a brief description of that scale, its names, and maybe a few example questions. If you need more information, look at the results section and tables discussing the scale to get a better idea of how the measure works. Looking beyond the articles in your literature review, searching Google Scholar using queries like “depression scale” or “satisfaction scale” should also provide some relevant results. For example, searching for documentation for the Rosenberg Self-Esteem Scale (which we will discuss in the next section), I found this report from researchers investigating acceptance and commitment therapy which details this scale and many others used to assess mental health outcomes. If you find the name of the scale somewhere but cannot find the documentation (all questions and answers plus how to interpret the scale), a general web search with the name of the scale and ".pdf" may bring you to what you need. Or, to get professional help with finding information, always ask a librarian!
Unfortunately, these approaches do not guarantee that you will be able to view the scale itself or get information on how it is interpreted. Many scales cost money to use and may require training to properly administer. You may also find scales that are related to your variable but would need to be slightly modified to match your study’s needs. You could adapt a scale to fit your study, however changing even small parts of a scale can influence its accuracy and consistency. While it is perfectly acceptable in student projects to adapt a scale without testing it first (time may not allow you to do so), pilot testing is always recommended for adapted scales, and researchers seeking to draw valid conclusions and publish their results must take this additional step.
Indices
An index is a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas. It is different from a scale. Scales also aggregate measures; however, these measures examine different dimensions or the same dimension of a single construct. A well-known example of an index is the consumer price index (CPI), which is computed every month by the Bureau of Labor Statistics of the U.S. Department of Labor. The CPI is a measure of how much consumers have to pay for goods and services (in general) and is divided into eight major categories (food and beverages, housing, apparel, transportation, healthcare, recreation, education and communication, and “other goods and services”), which are further subdivided into more than 200 smaller items. Each month, government employees call all over the country to get the current prices of more than 80,000 items. Using a complicated weighting scheme that takes into account the location and probability of purchase for each item, analysts then combine these prices into an overall index score using a series of formulas and rules.
Another example of an index is the Duncan Socioeconomic Index (SEI). This index is used to quantify a person's socioeconomic status (SES) and is a combination of three concepts: income, education, and occupation. Income is measured in dollars, education in years or degrees achieved, and occupation is classified into categories or levels by status. These very different measures are combined to create an overall SES index score. However, SES index measurement has generated a lot of controversy and disagreement among researchers.
The process of creating an index is similar to that of a scale. First, conceptualize (define) the index and its constituent components. Though this appears simple, there may be a lot of disagreement on what components (concepts/constructs) should be included or excluded from an index. For instance, in the SES index, isn’t income correlated with education and occupation? And if so, should we include one component only or all three components? Reviewing the literature, using theories, and/or interviewing experts or key stakeholders may help resolve this issue. Second, operationalize and measure each component. For instance, how will you categorize occupations, particularly since some occupations may have changed with time (e.g., there were no Web developers before the Internet)? As we will see in step three below, researchers must create a rule or formula for calculating the index score. Again, this process may involve a lot of subjectivity, so validating the index score using existing or new data is important.
Scale and index development at often taught in their own course in doctoral education, so it is unreasonable for you to expect to develop a consistently accurate measure within the span of a week or two. Using available indices and scales is recommended for this reason.
Differences between scales and indices
Though indices and scales yield a single numerical score or value representing a concept of interest, they are different in many ways. First, indices often comprise components that are very different from each other (e.g., income, education, and occupation in the SES index) and are measured in different ways. Conversely, scales typically involve a set of similar items that use the same rating scale (such as a five-point Likert scale about customer satisfaction).
Second, indices often combine objectively measurable values such as prices or income, while scales are designed to assess subjective or judgmental constructs such as attitude, prejudice, or self-esteem. Some argue that the sophistication of the scaling methodology makes scales different from indexes, while others suggest that indexing methodology can be equally sophisticated. Nevertheless, indexes and scales are both essential tools in social science research.
Scales and indices seem like clean, convenient ways to measure different phenomena in social science, but just like with a lot of research, we have to be mindful of the assumptions and biases underneath. What if a scale or an index was developed using only White women as research participants? Is it going to be useful for other groups? It very well might be, but when using a scale or index on a group for whom it hasn't been tested, it will be very important to evaluate the validity and reliability of the instrument, which we address in the rest of the chapter.
Finally, it's important to note that while scales and indices are often made up of nominal or ordinal variables, when we analyze them into composite scores, we will treat them as interval/ratio variables.
Exercises
- Look back to your work from the previous section, are your variables unidimensional or multidimensional?
- Describe the specific measures you will use (actual questions and response options you will use with participants) for each variable in your research question.
- If you are using a measure developed by another researcher but do not have all of the questions, response options, and instructions needed to implement it, put it on your to-do list to get them.
Step 3: How you will interpret your measures
The final stage of operationalization involves setting the rules for how the measure works and how the researcher should interpret the results. Sometimes, interpreting a measure can be incredibly easy. If you ask someone their age, you’ll probably interpret the results by noting the raw number (e.g., 22) someone provides and that it is lower or higher than other people's ages. However, you could also recode that person into age categories (e.g., under 25, 20-29-years-old, generation Z, etc.). Even scales may be simple to interpret. If there is a scale of problem behaviors, one might simply add up the number of behaviors checked off–with a range from 1-5 indicating low risk of delinquent behavior, 6-10 indicating the student is moderate risk, etc. How you choose to interpret your measures should be guided by how they were designed, how you conceptualize your variables, the data sources you used, and your plan for analyzing your data statistically. Whatever measure you use, you need a set of rules for how to take any valid answer a respondent provides to your measure and interpret it in terms of the variable being measured.
For more complicated measures like scales, refer to the information provided by the author for how to interpret the scale. If you can’t find enough information from the scale’s creator, look at how the results of that scale are reported in the results section of research articles. For example, Beck’s Depression Inventory (BDI-II) uses 21 statements to measure depression and respondents rate their level of agreement on a scale of 0-3. The results for each question are added up, and the respondent is put into one of three categories: low levels of depression (1-16), moderate levels of depression (17-30), or severe levels of depression (31 and over).
One common mistake I see often is that students will introduce another variable into their operational definition. This is incorrect. Your operational definition should mention only one variable—the variable being defined. While your study will certainly draw conclusions about the relationships between variables, that's not what operationalization is. Operationalization specifies what instrument you will use to measure your variable and how you plan to interpret the data collected using that measure.
Operationalization is probably the trickiest component of basic research methods, so please don’t get frustrated if it takes a few drafts and a lot of feedback to get to a workable definition. At the time of this writing, I am in the process of operationalizing the concept of “attitudes towards research methods.” Originally, I thought that I could gauge students’ attitudes toward research methods by looking at their end-of-semester course evaluations. As I became aware of the potential methodological issues with student course evaluations, I opted to use focus groups of students to measure their common beliefs about research. You may recall some of these opinions from Chapter 1, such as the common beliefs that research is boring, useless, and too difficult. After the focus group, I created a scale based on the opinions I gathered, and I plan to pilot test it with another group of students. After the pilot test, I expect that I will have to revise the scale again before I can implement the measure in a real social work research project. At the time I’m writing this, I’m still not completely done operationalizing this concept.
Key Takeaways
- Operationalization involves spelling out precisely how a concept will be measured.
- Operational definitions must include the variable, the measure, and how you plan to interpret the measure.
- There are four different levels of measurement: nominal, ordinal, interval, and ratio (in increasing order of specificity).
- Scales and indices are common ways to collect information and involve using multiple indicators in measurement.
- A key difference between a scale and an index is that a scale contains multiple indicators for one concept, whereas an indicator examines multiple concepts (components).
- Using scales developed and refined by other researchers can improve the rigor of a quantitative study.
Exercises
Use the research question that you developed in the previous chapters and find a related scale or index that researchers have used. If you have trouble finding the exact phenomenon you want to study, get as close as you can.
- What is the level of measurement for each item on each tool? Take a second and think about why the tool's creator decided to include these levels of measurement. Identify any levels of measurement you would change and why.
- If these tools don't exist for what you are interested in studying, why do you think that is?
12.3 Writing effective questions and questionnaires
Learning Objectives
Learners will be able to...
- Describe some of the ways that survey questions might confuse respondents and how to word questions and responses clearly
- Create mutually exclusive, exhaustive, and balanced response options
- Define fence-sitting and floating
- Describe the considerations involved in constructing a well-designed questionnaire
- Discuss why pilot testing is important
In the previous section, we reviewed how researchers collect data using surveys. Guided by their sampling approach and research context, researchers should choose the survey approach that provides the most favorable tradeoffs in strengths and challenges. With this information in hand, researchers need to write their questionnaire and revise it before beginning data collection. Each method of delivery requires a questionnaire, but they vary a bit based on how they will be used by the researcher. Since phone surveys are read aloud, researchers will pay more attention to how the questionnaire sounds than how it looks. Online surveys can use advanced tools to require the completion of certain questions, present interactive questions and answers, and otherwise afford greater flexibility in how questionnaires are designed. As you read this section, consider how your method of delivery impacts the type of questionnaire you will design. Because most student projects use paper or online surveys, this section will detail how to construct self-administered questionnaires to minimize the potential for bias and error.
Start with operationalization
The first thing you need to do to write effective survey questions is identify what exactly you wish to know. As silly as it sounds to state what seems so completely obvious, we can’t stress enough how easy it is to forget to include important questions when designing a survey. Begin by looking at your research question and refreshing your memory of the operational definitions you developed for those variables from Chapter 11. You should have a pretty firm grasp of your operational definitions before starting the process of questionnaire design. You may have taken those operational definitions from other researchers' methods, found established scales and indices for your measures, or created your own questions and answer options.
Exercises
STOP! Make sure you have a complete operational definition for the dependent and independent variables in your research question. A complete operational definition contains the variable being measured, the measure used, and how the researcher interprets the measure. Let's make sure you have what you need from Chapter 11 to begin writing your questionnaire.
List all of the dependent and independent variables in your research question.
- It's normal to have one dependent or independent variable. It's also normal to have more than one of either.
- Make sure that your research question (and this list) contain all of the variables in your hypothesis. Your hypothesis should only include variables from you research question.
For each variable in your list:
- Write out the measure you will use (the specific questions and answers) for each variable.
- If you don't have questions and answers finalized yet, write a first draft and revise it based on what you read in this section.
- If you are using a measure from another researcher, you should be able to write out all of the questions and answers associated with that measure. If you only have the name of a scale or a few questions, you need to access to the full text and some documentation on how to administer and interpret it before you can finish your questionnaire.
- Describe how you will use each measure draw conclusions about the variable in the operational definition.
- For example, an interpretation might be "there are five 7-point Likert scale questions...point values are added across all five items for each participant...and scores below 10 indicate the participant has low self-esteem"
- Don't introduce other variables into the mix here. All we are concerned with is how you will measure each variable by itself. The connection between variables is done using statistical tests, not operational definitions.
- Detail any validity or reliability issues uncovered by previous researchers using the same measures. If you have concerns about validity and reliability, note them, as well.
If you completed the exercise above and listed out all of the questions and answer choices you will use to measure the variables in your research question, you have already produced a pretty solid first draft of your questionnaire! Congrats! In essence, questionnaires are all of the self-report measures in your operational definitions for the independent, dependent, and control variables in your study arranged into one document and administered to participants. There are a few questions on a questionnaire (like name or ID#) that are not associated with the measurement of variables. These are the exception, and it's useful to think of a questionnaire as a list of measures for variables. Of course, researchers often use more than one measure of a variable (i.e., triangulation) so they can more confidently assert that their findings are true. A questionnaire should contain all of the measures researchers plan to collect about their variables by asking participants to self-report. As we will discuss in the final section of this chapter, triangulating across data sources (e.g., measuring variables using client files or student records) can avoid some of the common sources of bias in survey research.
Sticking close to your operational definitions is important because it helps you avoid an everything-but-the-kitchen-sink approach that includes every possible question that occurs to you. Doing so puts an unnecessary burden on your survey respondents. Remember that you have asked your participants to give you their time and attention and to take care in responding to your questions; show them your respect by only asking questions that you actually plan to use in your analysis. For each question in your questionnaire, ask yourself how this question measures a variable in your study. An operational definition should contain the questions, response options, and how the researcher will draw conclusions about the variable based on participants' responses.
Writing questions
So, almost all of the questions on a questionnaire are measuring some variable. For many variables, researchers will create their own questions rather than using one from another researcher. This section will provide some tips on how to create good questions to accurately measure variables in your study. First, questions should be as clear and to the point as possible. This is not the time to show off your creative writing skills; a survey is a technical instrument and should be written in a way that is as direct and concise as possible. As I’ve mentioned earlier, your survey respondents have agreed to give their time and attention to your survey. The best way to show your appreciation for their time is to not waste it. Ensuring that your questions are clear and concise will go a long way toward showing your respondents the gratitude they deserve. Pilot testing the questionnaire with friends or colleagues can help identify these issues. This process is commonly called pretesting, but to avoid any confusion with pretesting in experimental design, we refer to it as pilot testing.
Related to the point about not wasting respondents’ time, make sure that every question you pose will be relevant to every person you ask to complete it. This means two things: first, that respondents have knowledge about whatever topic you are asking them about, and second, that respondents have experienced the events, behaviors, or feelings you are asking them to report. If you are asking participants for second-hand knowledge—asking clinicians about clients' feelings, asking teachers about students' feelings, and so forth—you may want to clarify that the variable you are asking about is the key informant's perception of what is happening in the target population. A well-planned sampling approach ensures that participants are the most knowledgeable population to complete your survey.
If you decide that you do wish to include questions about matters with which only a portion of respondents will have had experience, make sure you know why you are doing so. For example, if you are asking about MSW student study patterns, and you decide to include a question on studying for the social work licensing exam, you may only have a small subset of participants who have begun studying for the graduate exam or took the bachelor's-level exam. If you decide to include this question that speaks to a minority of participants' experiences, think about why you are including it. Are you interested in how studying for class and studying for licensure differ? Are you trying to triangulate study skills measures? Researchers should carefully consider whether questions relevant to only a subset of participants is likely to produce enough valid responses for quantitative analysis.
Many times, questions that are relevant to a subsample of participants are conditional on an answer to a previous question. A participant might select that they rent their home, and as a result, you might ask whether they carry renter's insurance. That question is not relevant to homeowners, so it would be wise not to ask them to respond to it. In that case, the question of whether someone rents or owns their home is a filter question, designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample. Figure 12.1 presents an example of how to accomplish this on a paper survey by adding instructions to the participant that indicate what question to proceed to next based on their response to the first one. Using online survey tools, researchers can use filter questions to only present relevant questions to participants.
Researchers should eliminate questions that ask about things participants don't know to minimize confusion. Assuming the question is relevant to the participant, other sources of confusion come from how the question is worded. The use of negative wording can be a source of potential confusion. Taking the question from Figure 12.1 about drinking as our example, what if we had instead asked, “Did you not abstain from drinking during your first semester of college?” This is a double negative, and it's not clear how to answer the question accurately. It is a good idea to avoid negative phrasing, when possible. For example, "did you not drink alcohol during your first semester of college?" is less clear than "did you drink alcohol your first semester of college?"
You should also avoid using terms or phrases that may be regionally or culturally specific (unless you are absolutely certain all your respondents come from the region or culture whose terms you are using). When I first moved to southwest Virginia, I didn’t know what a holler was. Where I grew up in New Jersey, to holler means to yell. Even then, in New Jersey, we shouted and screamed, but we didn’t holler much. In southwest Virginia, my home at the time, a holler also means a small valley in between the mountains. If I used holler in that way on my survey, people who live near me may understand, but almost everyone else would be totally confused. A similar issue arises when you use jargon, or technical language, that people do not commonly know. For example, if you asked adolescents how they experience imaginary audience, they would find it difficult to link those words to the concepts from David Elkind’s theory. The words you use in your questions must be understandable to your participants. If you find yourself using jargon or slang, break it down into terms that are more universal and easier to understand.
Asking multiple questions as though they are a single question can also confuse survey respondents. There’s a specific term for this sort of question; it is called a double-barreled question. Figure 12.2 shows a double-barreled question. Do you see what makes the question double-barreled? How would someone respond if they felt their college classes were more demanding but also more boring than their high school classes? Or less demanding but more interesting? Because the question combines “demanding” and “interesting,” there is no way to respond yes to one criterion but no to the other.
Another thing to avoid when constructing survey questions is the problem of social desirability. We all want to look good, right? And we all probably know the politically correct response to a variety of questions whether we agree with the politically correct response or not. In survey research, social desirability refers to the idea that respondents will try to answer questions in a way that will present them in a favorable light. (You may recall we covered social desirability bias in Chapter 11.)
Perhaps we decide that to understand the transition to college, we need to know whether respondents ever cheated on an exam in high school or college for our research project. We all know that cheating on exams is generally frowned upon (at least I hope we all know this). So, it may be difficult to get people to admit to cheating on a survey. But if you can guarantee respondents’ confidentiality, or even better, their anonymity, chances are much better that they will be honest about having engaged in this socially undesirable behavior. Another way to avoid problems of social desirability is to try to phrase difficult questions in the most benign way possible. Earl Babbie (2010) [84] offers a useful suggestion for helping you do this—simply imagine how you would feel responding to your survey questions. If you would be uncomfortable, chances are others would as well.
Exercises
Try to step outside your role as researcher for a second, and imagine you were one of your participants. Evaluate the following:
- Is the question too general? Sometimes, questions that are too general may not accurately convey respondents’ perceptions. If you asked someone how they liked a certain book and provide a response scale ranging from “not at all” to “extremely well”, and if that person selected “extremely well," what do they mean? Instead, ask more specific behavioral questions, such as "Will you recommend this book to others?" or "Do you plan to read other books by the same author?"
- Is the question too detailed? Avoid unnecessarily detailed questions that serve no specific research purpose. For instance, do you need the age of each child in a household or is just the number of children in the household acceptable? However, if unsure, it is better to err on the side of details than generality.
- Is the question presumptuous? Does your question make assumptions? For instance, if you ask, "what do you think the benefits of a tax cut would be?" you are presuming that the participant sees the tax cut as beneficial. But many people may not view tax cuts as beneficial. Some might see tax cuts as a precursor to less funding for public schools and fewer public services such as police, ambulance, and fire department. Avoid questions with built-in presumptions.
- Does the question ask the participant to imagine something? Is the question imaginary? A popular question on many television game shows is “if you won a million dollars on this show, how will you plan to spend it?” Most participants have never been faced with this large amount of money and have never thought about this scenario. In fact, most don’t even know that after taxes, the value of the million dollars will be greatly reduced. In addition, some game shows spread the amount over a 20-year period. Without understanding this "imaginary" situation, participants may not have the background information necessary to provide a meaningful response.
Finally, it is important to get feedback on your survey questions from as many people as possible, especially people who are like those in your sample. Now is not the time to be shy. Ask your friends for help, ask your mentors for feedback, ask your family to take a look at your survey as well. The more feedback you can get on your survey questions, the better the chances that you will come up with a set of questions that are understandable to a wide variety of people and, most importantly, to those in your sample.
In sum, in order to pose effective survey questions, researchers should do the following:
- Identify how each question measures an independent, dependent, or control variable in their study.
- Keep questions clear and succinct.
- Make sure respondents have relevant lived experience to provide informed answers to your questions.
- Use filter questions to avoid getting answers from uninformed participants.
- Avoid questions that are likely to confuse respondents—including those that use double negatives, use culturally specific terms or jargon, and pose more than one question at a time.
- Imagine how respondents would feel responding to questions.
- Get feedback, especially from people who resemble those in the researcher’s sample.
Exercises
Let's complete a first draft of your questions. In the previous exercise, you listed all of the questions and answers you will use to measure the variables in your research question.
- In the previous exercise, you wrote out the questions and answers for each measure of your independent and dependent variables. Evaluate each question using the criteria listed above on effective survey questions.
- Type out questions for your control variables and evaluate them, as well. Consider what response options you want to offer participants.
Now, let's revise any questions that do not meet your standards!
- Use the BRUSO model in Table 12.2 for an illustration of how to address deficits in question wording. Keep in mind that you are writing a first draft in this exercise, and it will take a few drafts and revisions before your questions are ready to distribute to participants.
Criterion | Poor | Effective |
B- Brief | “Are you now or have you ever been the possessor of a firearm?” | Have you ever possessed a firearm? |
R- Relevant | "Who did you vote for in the last election?" | Note: Only include items that are relevant to your study. |
U- Unambiguous | “Are you a gun person?” | Do you currently own a gun?” |
S- Specific | How much have you read about the new gun control measure and sales tax?” | “How much have you read about the new sales tax on firearm purchases?” |
O- Objective | “How much do you support the beneficial new gun control measure?” | “What is your view of the new gun control measure?” |
Writing response options
While posing clear and understandable questions in your survey is certainly important, so too is providing respondents with unambiguous response options. Response options are the answers that you provide to the people completing your questionnaire. Generally, respondents will be asked to choose a single (or best) response to each question you pose. We call questions in which the researcher provides all of the response options closed-ended questions. Keep in mind, closed-ended questions can also instruct respondents to choose multiple response options, rank response options against one another, or assign a percentage to each response option. But be cautious when experimenting with different response options! Accepting multiple responses to a single question may add complexity when it comes to quantitatively analyzing and interpreting your data.
Surveys need not be limited to closed-ended questions. Sometimes survey researchers include open-ended questions in their survey instruments as a way to gather additional details from respondents. An open-ended question does not include response options; instead, respondents are asked to reply to the question in their own way, using their own words. These questions are generally used to find out more about a survey participant’s experiences or feelings about whatever they are being asked to report in the survey. If, for example, a survey includes closed-ended questions asking respondents to report on their involvement in extracurricular activities during college, an open-ended question could ask respondents why they participated in those activities or what they gained from their participation. While responses to such questions may also be captured using a closed-ended format, allowing participants to share some of their responses in their own words can make the experience of completing the survey more satisfying to respondents and can also reveal new motivations or explanations that had not occurred to the researcher. This is particularly important for mixed-methods research. It is possible to analyze open-ended response options quantitatively using content analysis (i.e., counting how often a theme is represented in a transcript looking for statistical patterns). However, for most researchers, qualitative data analysis will be needed to analyze open-ended questions, and researchers need to think through how they will analyze any open-ended questions as part of their data analysis plan. We will address qualitative data analysis in greater detail in Chapter 19.
To keep things simple, we encourage you to use only closed-ended response options in your study. While open-ended questions are not wrong, they are often a sign in our classrooms that students have not thought through all the way how to operationally define and measure their key variables. Open-ended questions cannot be operationally defined because you don't know what responses you will get. Instead, you will need to analyze the qualitative data using one of the techniques we discuss in Chapter 19 to interpret your participants' responses.
To write an effective response options for closed-ended questions, there are a couple of guidelines worth following. First, be sure that your response options are mutually exclusive. Look back at Figure 12.1, which contains questions about how often and how many drinks respondents consumed. Do you notice that there are no overlapping categories in the response options for these questions? This is another one of those points about question construction that seems fairly obvious but that can be easily overlooked. Response options should also be exhaustive. In other words, every possible response should be covered in the set of response options that you provide. For example, note that in question 10a in Figure 12.1, we have covered all possibilities—those who drank, say, an average of once per month can choose the first response option (“less than one time per week”) while those who drank multiple times a day each day of the week can choose the last response option (“7+”). All the possibilities in between these two extremes are covered by the middle three response options, and every respondent fits into one of the response options we provided.
Earlier in this section, we discussed double-barreled questions. Response options can also be double barreled, and this should be avoided. Figure 12.3 is an example of a question that uses double-barreled response options. Other tips about questions are also relevant to response options, including that participants should be knowledgeable enough to select or decline a response option as well as avoiding jargon and cultural idioms.
Even if you phrase questions and response options clearly, participants are influenced by how many response options are presented on the questionnaire. For Likert scales, five or seven response options generally allow about as much precision as respondents are capable of. However, numerical scales with more options can sometimes be appropriate. For dimensions such as attractiveness, pain, and likelihood, a 0-to-10 scale will be familiar to many respondents and easy for them to use. Regardless of the number of response options, the most extreme ones should generally be “balanced” around a neutral or modal midpoint. An example of an unbalanced rating scale measuring perceived likelihood might look like this:
Unlikely | Somewhat Likely | Likely | Very Likely | Extremely Likely
Because we have four rankings of likely and only one ranking of unlikely, the scale is unbalanced and most responses will be biased toward "likely" rather than "unlikely." A balanced version might look like this:
Extremely Unlikely | Somewhat Unlikely | As Likely as Not | Somewhat Likely |Extremely Likely
In this example, the midpoint is halfway between likely and unlikely. Of course, a middle or neutral response option does not have to be included. Researchers sometimes choose to leave it out because they want to encourage respondents to think more deeply about their response and not simply choose the middle option by default. Fence-sitters are respondents who choose neutral response options, even if they have an opinion. Some people will be drawn to respond, “no opinion” even if they have an opinion, particularly if their true opinion is the not a socially desirable opinion. Floaters, on the other hand, are those that choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion.
As you can see, floating is the flip side of fence-sitting. Thus, the solution to one problem is often the cause of the other. How you decide which approach to take depends on the goals of your research. Sometimes researchers specifically want to learn something about people who claim to have no opinion. In this case, allowing for fence-sitting would be necessary. Other times researchers feel confident their respondents will all be familiar with every topic in their survey. In this case, perhaps it is okay to force respondents to choose one side or another (e.g., agree or disagree) without a middle option (e.g., neither agree nor disagree) or to not include an option like "don't know enough to say" or "not applicable." There is no always-correct solution to either problem. But in general, including middle option in a response set provides a more exhaustive set of response options than one that excludes one.
The most important check before your finalize your response options is to align them with your operational definitions. As we've discussed before, your operational definitions include your measures (questions and responses options) as well as how to interpret those measures in terms of the variable being measured. In particular, you should be able to interpret all response options to a question based on your operational definition of the variable it measures. If you wanted to measure the variable "social class," you might ask one question about a participant's annual income and another about family size. Your operational definition would need to provide clear instructions on how to interpret response options. Your operational definition is basically like this social class calculator from Pew Research, though they include a few more questions in their definition.
To drill down a bit more, as Pew specifies in the section titled "how the income calculator works," the interval/ratio data respondents enter is interpreted using a formula combining a participant's four responses to the questions posed by Pew categorizing their household into three categories—upper, middle, or lower class. So, the operational definition includes the four questions comprising the measure and the formula or interpretation which converts responses into the three final categories that we are familiar with: lower, middle, and upper class.
It is interesting to note that even though participants inis an ordinal level of measurement. Whereas, Pew asks four questions that use an interval or ratio level of measurement (depending on the question). This means that respondents provide numerical responses, rather than choosing categories like lower, middle, and upper class. It's perfectly normal for operational definitions to change levels of measurement, and it's also perfectly normal for the level of measurement to stay the same. The important thing is that each response option a participant can provide is accounted for by the operational definition. Throw any combination of family size, location, or income at the Pew calculator, and it will define you into one of those three social class categories.
Unlike Pew's definition, the operational definitions in your study may not need their own webpage to define and describe. For many questions and answers, interpreting response options is easy. If you were measuring "income" instead of "social class," you could simply operationalize the term by asking people to list their total household income before taxes are taken out. Higher values indicate higher income, and lower values indicate lower income. Easy. Regardless of whether your operational definitions are simple or more complex, every response option to every question on your survey (with a few exceptions) should be interpretable using an operational definition of a variable. Just like we want to avoid an everything-but-the-kitchen-sink approach to questions on our questionnaire, you want to make sure your final questionnaire only contains response options that you will use in your study.
One note of caution on interpretation (sorry for repeating this). We want to remind you again that an operational definition should not mention more than one variable. In our example above, your operational definition could not say "a family of three making under $50,000 is lower class; therefore, they are more likely to experience food insecurity." That last clause about food insecurity may well be true, but it's not a part of the operational definition for social class. Each variable (food insecurity and class) should have its own operational definition. If you are talking about how to interpret the relationship between two variables, you are talking about your data analysis plan. We will discuss how to create your data analysis plan beginning in Chapter 14. For now, one consideration is that depending on the statistical test you use to test relationships between variables, you may need nominal, ordinal, or interval/ratio data. Your questions and response options should match the level of measurement you need with the requirements of the specific statistical tests in your data analysis plan. Once you finalize your data analysis plan, return to your questionnaire to match the level of measurement matches with the statistical test you've chosen.
In summary, to write effective response options researchers should do the following:
- Avoid wording that is likely to confuse respondents—including double negatives, use culturally specific terms or jargon, and double-barreled response options.
- Ensure response options are relevant to participants' knowledge and experience so they can make an informed and accurate choice.
- Present mutually exclusive and exhaustive response options.
- Consider fence-sitters and floaters, and the use of neutral or "not applicable" response options.
- Define how response options are interpreted as part of an operational definition of a variable.
- Check level of measurement matches operational definitions and the statistical tests in the data analysis plan (once you develop one in the future)
Exercises
Look back at the response options you drafted in the previous exercise. Make sure you have a first draft of response options for each closed-ended question on your questionnaire.
- Using the criteria above, evaluate the wording of the response options for each question on your questionnaire.
- Revise your questions and response options until you have a complete first draft.
- Do your first read-through and provide a dummy answer to each question. Make sure you can link each response option and each question to an operational definition.
- Look ahead to Chapter 14 and consider how each item on your questionnaire will inform your data analysis plan.
From this discussion, we hope it is clear why researchers using quantitative methods spell out all of their plans ahead of time. Ultimately, there should be a straight line from operational definition through measures on your questionnaire to the data analysis plan. If your questionnaire includes response options that are not aligned with operational definitions or not included in the data analysis plan, the responses you receive back from participants won't fit with your conceptualization of the key variables in your study. If you do not fix these errors and proceed with collecting unstructured data, you will lose out on many of the benefits of survey research and face overwhelming challenges in answering your research question.
Designing questionnaires
Based on your work in the previous section, you should have a first draft of the questions and response options for the key variables in your study. Now, you’ll also need to think about how to present your written questions and response options to survey respondents. It's time to write a final draft of your questionnaire and make it look nice. Designing questionnaires takes some thought. First, consider the route of administration for your survey. What we cover in this section will apply equally to paper and online surveys, but if you are planning to use online survey software, you should watch tutorial videos and explore the features of of the survey software you will use.
Informed consent & instructions
Writing effective items is only one part of constructing a survey. For one thing, every survey should have a written or spoken introduction that serves two basic functions (Peterson, 2000).[85] One is to encourage respondents to participate in the survey. In many types of research, such encouragement is not necessary either because participants do not know they are in a study (as in naturalistic observation) or because they are part of a subject pool and have already shown their willingness to participate by signing up and showing up for the study. Survey research usually catches respondents by surprise when they answer their phone, go to their mailbox, or check their e-mail—and the researcher must make a good case for why they should agree to participate. Thus, the introduction should briefly explain the purpose of the survey and its importance, provide information about the sponsor of the survey (university-based surveys tend to generate higher response rates), acknowledge the importance of the respondent’s participation, and describe any incentives for participating.
The second function of the introduction is to establish informed consent. Remember that this involves describing to respondents everything that might affect their decision to participate. This includes the topics covered by the survey, the amount of time it is likely to take, the respondent’s option to withdraw at any time, confidentiality issues, and other ethical considerations we covered in Chapter 6. Written consent forms are not always used in survey research (when the research is of minimal risk and completion of the survey instrument is often accepted by the IRB as evidence of consent to participate), so it is important that this part of the introduction be well documented and presented clearly and in its entirety to every respondent.
Organizing items to be easy and intuitive to follow
The introduction should be followed by the substantive questionnaire items. But first, it is important to present clear instructions for completing the questionnaire, including examples of how to use any unusual response scales. Remember that the introduction is the point at which respondents are usually most interested and least fatigued, so it is good practice to start with the most important items for purposes of the research and proceed to less important items. Items should also be grouped by topic or by type. For example, items using the same rating scale (e.g., a 5-point agreement scale) should be grouped together if possible to make things faster and easier for respondents. Demographic items are often presented last because they are least interesting to participants but also easy to answer in the event respondents have become tired or bored. Of course, any survey should end with an expression of appreciation to the respondent.
Questions are often organized thematically. If our survey were measuring social class, perhaps we’d have a few questions asking about employment, others focused on education, and still others on housing and community resources. Those may be the themes around which we organize our questions. Or perhaps it would make more sense to present any questions we had about parents' income and then present a series of questions about estimated future income. Grouping by theme is one way to be deliberate about how you present your questions. Keep in mind that you are surveying people, and these people will be trying to follow the logic in your questionnaire. Jumping from topic to topic can give people a bit of whiplash and may make participants less likely to complete it.
Using a matrix is a nice way of streamlining response options for similar questions. A matrix is a question type that that lists a set of questions for which the answer categories are all the same. If you have a set of questions for which the response options are the same, it may make sense to create a matrix rather than posing each question and its response options individually. Not only will this save you some space in your survey but it will also help respondents progress through your survey more easily. A sample matrix can be seen in Figure 12.4.
Once you have grouped similar questions together, you’ll need to think about the order in which to present those question groups. Most survey researchers agree that it is best to begin a survey with questions that will want to make respondents continue (Babbie, 2010; Dillman, 2000; Neuman, 2003).[86] In other words, don’t bore respondents, but don’t scare them away either. There’s some disagreement over where on a survey to place demographic questions, such as those about a person’s age, gender, and race. On the one hand, placing them at the beginning of the questionnaire may lead respondents to think the survey is boring, unimportant, and not something they want to bother completing. On the other hand, if your survey deals with some very sensitive topic, such as child sexual abuse or criminal convictions, you don’t want to scare respondents away or shock them by beginning with your most intrusive questions.
Your participants are human. They will react emotionally to questionnaire items, and they will also try to uncover your research questions and hypotheses. In truth, the order in which you present questions on a survey is best determined by the unique characteristics of your research. When feasible, you should consult with key informants from your target population determine how best to order your questions. If it is not feasible to do so, think about the unique characteristics of your topic, your questions, and most importantly, your sample. Keeping in mind the characteristics and needs of the people you will ask to complete your survey should help guide you as you determine the most appropriate order in which to present your questions. None of your decisions will be perfect, and all studies have limitations.
Questionnaire length
You’ll also need to consider the time it will take respondents to complete your questionnaire. Surveys vary in length, from just a page or two to a dozen or more pages, which means they also vary in the time it takes to complete them. How long to make your survey depends on several factors. First, what is it that you wish to know? Wanting to understand how grades vary by gender and year in school certainly requires fewer questions than wanting to know how people’s experiences in college are shaped by demographic characteristics, college attended, housing situation, family background, college major, friendship networks, and extracurricular activities. Keep in mind that even if your research question requires a sizable number of questions be included in your questionnaire, do your best to keep the questionnaire as brief as possible. Any hint that you’ve thrown in a bunch of useless questions just for the sake of it will turn off respondents and may make them not want to complete your survey.
Second, and perhaps more important, how long are respondents likely to be willing to spend completing your questionnaire? If you are studying college students, asking them to use their very free time to complete your survey may mean they won’t want to spend more than a few minutes on it. But if you find ask them to complete your survey during down-time between classes and there is little work to be done, students may be willing to give you a bit more of their time. Think about places and times that your sampling frame naturally gathers and whether you would be able to either recruit participants or distribute a survey in that context. Estimate how long your participants would reasonably have to complete a survey presented to them during this time. The more you know about your population (such as what weeks have less work and more free time), the better you can target questionnaire length.
The time that survey researchers ask respondents to spend on questionnaires varies greatly. Some researchers advise that surveys should not take longer than about 15 minutes to complete (as cited in Babbie 2010),[87] whereas others suggest that up to 20 minutes is acceptable (Hopper, 2010).[88] As with question order, there is no clear-cut, always-correct answer about questionnaire length. The unique characteristics of your study and your sample should be considered to determine how long to make your questionnaire. For example, if you planned to distribute your questionnaire to students in between classes, you will need to make sure it is short enough to complete before the next class begins.
When designing a questionnaire, a researcher should consider:
- Weighing strengths and limitations of the method of delivery, including the advanced tools in online survey software or the simplicity of paper questionnaires.
- Grouping together items that ask about the same thing.
- Moving any questions about sensitive items to the end of the questionnaire, so as not to scare respondents off.
- Moving any questions that engage the respondent to answer the questionnaire at the beginning, so as not to bore them.
- Timing the length of the questionnaire with a reasonable length of time you can ask of your participants.
- Dedicating time to visual design and ensure the questionnaire looks professional.
Exercises
Type out a final draft of your questionnaire in a word processor or online survey tool.
- Evaluate your questionnaire using the guidelines above, revise it, and get it ready to share with other student researchers.
Pilot testing and revising questionnaires
A good way to estimate the time it will take respondents to complete your questionnaire (and other potential challenges) is through pilot testing. Pilot testing allows you to get feedback on your questionnaire so you can improve it before you actually administer it. It can be quite expensive and time consuming if you wish to pilot test your questionnaire on a large sample of people who very much resemble the sample to whom you will eventually administer the finalized version of your questionnaire. But you can learn a lot and make great improvements to your questionnaire simply by pilot testing with a small number of people to whom you have easy access (perhaps you have a few friends who owe you a favor). By pilot testing your questionnaire, you can find out how understandable your questions are, get feedback on question wording and order, find out whether any of your questions are boring or offensive, and learn whether there are places where you should have included filter questions. You can also time pilot testers as they take your survey. This will give you a good idea about the estimate to provide respondents when you administer your survey and whether you have some wiggle room to add additional items or need to cut a few items.
Perhaps this goes without saying, but your questionnaire should also have an attractive design. A messy presentation style can confuse respondents or, at the very least, annoy them. Be brief, to the point, and as clear as possible. Avoid cramming too much into a single page. Make your font size readable (at least 12 point or larger, depending on the characteristics of your sample), leave a reasonable amount of space between items, and make sure all instructions are exceptionally clear. If you are using an online survey, ensure that participants can complete it via mobile, computer, and tablet devices. Think about books, documents, articles, or web pages that you have read yourself—which were relatively easy to read and easy on the eyes and why? Try to mimic those features in the presentation of your survey questions. While online survey tools automate much of visual design, word processors are designed for writing all kinds of documents and may need more manual adjustment as part of visual design.
Realistically, your questionnaire will continue to evolve as you develop your data analysis plan over the next few chapters. By now, you should have a complete draft of your questionnaire grounded in an underlying logic that ties together each question and response option to a variable in your study. Once your questionnaire is finalized, you will need to submit it for ethical approval from your professor or the IRB. If your study requires IRB approval, it may be worthwhile to submit your proposal before your questionnaire is completely done. Revisions to IRB protocols are common and it takes less time to review a few changes to questions and answers than it does to review the entire study, so give them the whole study as soon as you can. Once the IRB approves your questionnaire, you cannot change it without their okay.
Key Takeaways
- A questionnaire is comprised of self-report measures of variables in a research study.
- Make sure your survey questions will be relevant to all respondents and that you use filter questions when necessary.
- Effective survey questions and responses take careful construction by researchers, as participants may be confused or otherwise influenced by how items are phrased.
- The questionnaire should start with informed consent and instructions, flow logically from one topic to the next, engage but not shock participants, and thank participants at the end.
- Pilot testing can help identify any issues in a questionnaire before distributing it to participants, including language or length issues.
Exercises
It's a myth that researchers work alone! Get together with a few of your fellow students and swap questionnaires for pilot testing.
- Use the criteria in each section above (questions, response options, questionnaires) and provide your peers with the strengths and weaknesses of their questionnaires.
- See if you can guess their research question and hypothesis based on the questionnaire alone.
11.3 Measurement quality
Learning Objectives
Learners will be able to...
- Define and describe the types of validity and reliability
- Assess for systematic error
The previous chapter provided insight into measuring concepts in social work research. We discussed the importance of identifying concepts and their corresponding indicators as a way to help us operationalize them. In essence, we now understand that when we think about our measurement process, we must be intentional and thoughtful in the choices that we make. This section is all about how to judge the quality of the measures you've chosen for the key variables in your research question.
Reliability
First, let’s say we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism this way, then it is likely that anyone who identifies as an alcoholic would respond “yes.” This may seem like a good way to identify our group of interest, but think about how you and your peer group may respond to this question. Would participants respond differently after a wild night out, compared to any other night? Could an infrequent drinker’s current headache from last night’s glass of wine influence how they answer the question this morning? How would that same person respond to the question before consuming the wine? In each cases, the same person might respond differently to the same question at different points, so it is possible that our measure of alcoholism has a reliability problem. Reliability in measurement is about consistency.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behavior, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past 3 months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. On the other hand, we might get more accurate responses if we ask a participant how many drinks of any kind they have consumed in the past week.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviors. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. They may decide to conduct observations at a local pub by noting how many drinks patrons consume and how their behavior changes as their intake changes. What if the researcher has to use the restroom, and the patron next to them takes three shots of tequila during the brief period the researcher is away from their seat? The reliability of this researcher’s measure of alcohol intake depends on their ability to physically observe every instance of patrons consuming drinks. If they are unlikely to be able to observe every such instance, then perhaps their mechanism for measuring this concept is not reliable.
The following subsections describe the types of reliability that are important for you to know about, but keep in mind that you may see other approaches to judging reliability mentioned in the empirical literature.
Test-retest reliability
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time. Unlike an experiment, you aren't giving participants an intervention but trying to establish a reliable baseline of the variable you are measuring. Once you have these two measurements, you then look at the correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. Figure 11.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
Internal consistency
Another kind of reliability is internal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials. A specific statistical test known as Cronbach’s Alpha provides a way to measure how well each question of a scale is related to the others.
Interrater reliability
Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other.
Validity
Validity, another key element of assessing measurement quality, is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.
Face validity
Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.
Content validity
Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Criterion validity
Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).
Discriminant validity
Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
Increasing the reliability and validity of measures
We have reviewed the types of errors and how to evaluate our measures based on reliability and validity considerations. However, what can we do while selecting or creating our tool so that we minimize the potential of errors? Many of our options were covered in our discussion about reliability and validity. Nevertheless, the following table provides a quick summary of things that you should do when creating or selecting a measurement tool. While not all of these will be feasible in your project, it is important to include easy-to-implement measures in your research context.
Make sure that you engage in a rigorous literature review so that you understand the concept that you are studying. This means understanding the different ways that your concept may manifest itself. This review should include a search for existing instruments.[89]
- Do you understand all the dimensions of your concept? Do you have a good understanding of the content dimensions of your concept(s)?
- What instruments exist? How many items are on the existing instruments? Are these instruments appropriate for your population?
- Are these instruments standardized? Note: If an instrument is standardized, that means it has been rigorously studied and tested.
Consult content experts to review your instrument. This is a good way to check the face validity of your items. Additionally, content experts can also help you understand the content validity.[90]
- Do you have access to a reasonable number of content experts? If not, how can you locate them?
- Did you provide a list of critical questions for your content reviewers to use in the reviewing process?
Pilot test your instrument on a sufficient number of people and get detailed feedback.[91] Ask your group to provide feedback on the wording and clarity of items. Keep detailed notes and make adjustments BEFORE you administer your final tool.
- How many people will you use in your pilot testing?
- How will you set up your pilot testing so that it mimics the actual process of administering your tool?
- How will you receive feedback from your pilot testing group? Have you provided a list of questions for your group to think about?
Provide training for anyone collecting data for your project.[92] You should provide those helping you with a written research protocol that explains all of the steps of the project. You should also problem solve and answer any questions that those helping you may have. This will increase the chances that your tool will be administered in a consistent manner.
- How will you conduct your orientation/training? How long will it be? What modality?
- How will you select those who will administer your tool? What qualifications do they need?
When thinking of items, use a higher level of measurement, if possible.[93] This will provide more information and you can always downgrade to a lower level of measurement later.
- Have you examined your items and the levels of measurement?
- Have you thought about whether you need to modify the type of data you are collecting? Specifically, are you asking for information that is too specific (at a higher level of measurement) which may reduce participants' willingness to participate?
Use multiple indicators for a variable.[94] Think about the number of items that you will include in your tool.
- Do you have enough items? Enough indicators? The correct indicators?
Conduct an item-by-item assessment of multiple-item measures.[95] When you do this assessment, think about each word and how it changes the meaning of your item.
- Are there items that are redundant? Do you need to modify, delete, or add items?
Types of error
As you can see, measures never perfectly describe what exists in the real world. Good measures demonstrate validity and reliability but will always have some degree of error. Systematic error (also called bias) causes our measures to consistently output incorrect data in one direction or another on a measure, usually due to an identifiable process. Imagine you created a measure of height, but you didn’t put an option for anyone over six feet tall. If you gave that measure to your local college or university, some of the taller students might not be measured accurately. In fact, you would be under the mistaken impression that the tallest person at your school was six feet tall, when in actuality there are likely people taller than six feet at your school. This error seems innocent, but if you were using that measure to help you build a new building, those people might hit their heads!
A less innocent form of error arises when researchers word questions in a way that might cause participants to think one answer choice is preferable to another. For example, if I were to ask you “Do you think global warming is caused by human activity?” you would probably feel comfortable answering honestly. But what if I asked you “Do you agree with 99% of scientists that global warming is caused by human activity?” Would you feel comfortable saying no, if that’s what you honestly felt? I doubt it. That is an example of a leading question, a question with wording that influences how a participant responds. We’ll discuss leading questions and other problems in question wording in greater detail in Chapter 12.
In addition to error created by the researcher, your participants can cause error in measurement. Some people will respond without fully understanding a question, particularly if the question is worded in a confusing way. Let’s consider another potential source or error. If we asked people if they always washed their hands after using the bathroom, would we expect people to be perfectly honest? Polling people about whether they wash their hands after using the bathroom might only elicit what people would like others to think they do, rather than what they actually do. This is an example of social desirability bias, in which participants in a research study want to present themselves in a positive, socially desirable way to the researcher. People in your study will want to seem tolerant, open-minded, and intelligent, but their true feelings may be closed-minded, simple, and biased. Participants may lie in this situation. This occurs often in political polling, which may show greater support for a candidate from a minority race, gender, or political party than actually exists in the electorate.
A related form of bias is called acquiescence bias, also known as “yea-saying.” It occurs when people say yes to whatever the researcher asks, even when doing so contradicts previous answers. For example, a person might say yes to both “I am a confident leader in group discussions” and “I feel anxious interacting in group discussions.” Those two responses are unlikely to both be true for the same person. Why would someone do this? Similar to social desirability, people want to be agreeable and nice to the researcher asking them questions or they might ignore contradictory feelings when responding to each question. You could interpret this as someone saying "yeah, I guess." Respondents may also act on cultural reasons, trying to “save face” for themselves or the person asking the questions. Regardless of the reason, the results of your measure don’t match what the person truly feels.
So far, we have discussed sources of error that come from choices made by respondents or researchers. Systematic errors will result in responses that are incorrect in one direction or another. For example, social desirability bias usually means that the number of people who say they will vote for a third party in an election is greater than the number of people who actually vote for that candidate. Systematic errors such as these can be reduced, but random error can never be eliminated. Unlike systematic error, which biases responses consistently in one direction or another, random error is unpredictable and does not consistently result in scores that are consistently higher or lower on a given measure. Instead, random error is more like statistical noise, which will likely average out across participants.
Random error is present in any measurement. If you’ve ever stepped on a bathroom scale twice and gotten two slightly different results, maybe a difference of a tenth of a pound, then you’ve experienced random error. Maybe you were standing slightly differently or had a fraction of your foot off of the scale the first time. If you were to take enough measures of your weight on the same scale, you’d be able to figure out your true weight. In social science, if you gave someone a scale measuring depression on a day after they lost their job, they would likely score differently than if they had just gotten a promotion and a raise. Even if the person were clinically depressed, our measure is subject to influence by the random occurrences of life. Thus, social scientists speak with humility about our measures. We are reasonably confident that what we found is true, but we must always acknowledge that our measures are only an approximation of reality.
Humility is important in scientific measurement, as errors can have real consequences. At the time I'm writing this, my wife and I are expecting our first child. Like most people, we used a pregnancy test from the pharmacy. If the test said my wife was pregnant when she was not pregnant, that would be a false positive. On the other hand, if the test indicated that she was not pregnant when she was in fact pregnant, that would be a false negative. Even if the test is 99% accurate, that means that one in a hundred women will get an erroneous result when they use a home pregnancy test. For us, a false positive would have been initially exciting, then devastating when we found out we were not having a child. A false negative would have been disappointing at first and then quite shocking when we found out we were indeed having a child. While both false positives and false negatives are not very likely for home pregnancy tests (when taken correctly), measurement error can have consequences for the people being measured.
Key Takeaways
- Reliability is a matter of consistency.
- Validity is a matter of accuracy.
- There are many types of validity and reliability.
- Systematic error may arise from the researcher, participant, or measurement instrument.
- Systematic error biases results in a particular direction, whereas random error can be in any direction.
- All measures are prone to error and should interpreted with humility.
Exercises
Use the measurement tools you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to "research" these tools.
- Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
- Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?
- If you decide to create your own tool, how will you assess its validity and reliability?
when someone is treated unfairly in their capacity to know something or describe their experience of the world
items on a questionnaire designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample
when a researcher administers a questionnaire verbally to participants
any possible changes in interviewee responses based on how or when the researcher presents question-and-answer options
a question that asks more than one thing at a time, making it difficult to respond accurately
the answers researchers provide to participants to choose from when completing a questionnaire
Chapter Outline
- Operational definitions (36 minute read)
- Writing effective questions and questionnaires (38 minute read)
- Measurement quality (21 minute read)
Content warning: examples in this chapter contain references to ethnocentrism, toxic masculinity, racism in science, drug use, mental health and depression, psychiatric inpatient care, poverty and basic needs insecurity, pregnancy, and racism and sexism in the workplace and higher education.
11.1 Operational definitions
Learning Objectives
Learners will be able to...
- Define and give an example of indicators and attributes for a variable
- Apply the three components of an operational definition to a variable
- Distinguish between levels of measurement for a variable and how those differences relate to measurement
- Describe the purpose of composite measures like scales and indices
Last chapter, we discussed conceptualizing your project. Conceptual definitions are like dictionary definitions. They tell you what a concept means by defining it using other concepts. In this section we will move from the abstract realm (conceptualization) to the real world (measurement).
Operationalization is the process by which researchers spell out precisely how a concept will be measured in their study. It involves identifying the specific research procedures we will use to gather data about our concepts. If conceptually defining your terms means looking at theory, how do you operationally define your terms? By looking for indicators of when your variable is present or not, more or less intense, and so forth. Operationalization is probably the most challenging part of quantitative research, but once it's done, the design and implementation of your study will be straightforward.
Indicators
Operationalization works by identifying specific indicators that will be taken to represent the ideas we are interested in studying. If we are interested in studying masculinity, then the indicators for that concept might include some of the social roles prescribed to men in society such as breadwinning or fatherhood. Being a breadwinner or a father might therefore be considered indicators of a person’s masculinity. The extent to which a man fulfills either, or both, of these roles might be understood as clues (or indicators) about the extent to which he is viewed as masculine.
Let’s look at another example of indicators. Each day, Gallup researchers poll 1,000 randomly selected Americans to ask them about their well-being. To measure well-being, Gallup asks these people to respond to questions covering six broad areas: physical health, emotional health, work environment, life evaluation, healthy behaviors, and access to basic necessities. Gallup uses these six factors as indicators of the concept that they are really interested in, which is well-being.
Identifying indicators can be even simpler than the examples described thus far. Political party affiliation is another relatively easy concept for which to identify indicators. If you asked a person what party they voted for in the last national election (or gained access to their voting records), you would get a good indication of their party affiliation. Of course, some voters split tickets between multiple parties when they vote and others swing from party to party each election, so our indicator is not perfect. Indeed, if our study were about political identity as a key concept, operationalizing it solely in terms of who they voted for in the previous election leaves out a lot of information about identity that is relevant to that concept. Nevertheless, it's a pretty good indicator of political party affiliation.
Choosing indicators is not an arbitrary process. As described earlier, utilizing prior theoretical and empirical work in your area of interest is a great way to identify indicators in a scholarly manner. And you conceptual definitions will point you in the direction of relevant indicators. Empirical work will give you some very specific examples of how the important concepts in an area have been measured in the past and what sorts of indicators have been used. Often, it makes sense to use the same indicators as previous researchers; however, you may find that some previous measures have potential weaknesses that your own study will improve upon.
All of the examples in this chapter have dealt with questions you might ask a research participant on a survey or in a quantitative interview. If you plan to collect data from other sources, such as through direct observation or the analysis of available records, think practically about what the design of your study might look like and how you can collect data on various indicators feasibly. If your study asks about whether the participant regularly changes the oil in their car, you will likely not observe them directly doing so. Instead, you will likely need to rely on a survey question that asks them the frequency with which they change their oil or ask to see their car maintenance records.
Exercises
- What indicators are commonly used to measure the variables in your research question?
- How can you feasibly collect data on these indicators?
- Are you planning to collect your own data using a questionnaire or interview? Or are you planning to analyze available data like client files or raw data shared from another researcher's project?
Remember, you need raw data. You research project cannot rely solely on the results reported by other researchers or the arguments you read in the literature. A literature review is only the first part of a research project, and your review of the literature should inform the indicators you end up choosing when you measure the variables in your research question.
Unlike conceptual definitions which contain other concepts, operational definition consists of the following components: (1) the variable being measured and its attributes, (2) the measure you will use, (3) how you plan to interpret the data collected from that measure to draw conclusions about the variable you are measuring.
Step 1: Specifying variables and attributes
The first component, the variable, should be the easiest part. At this point in quantitative research, you should have a research question that has at least one independent and at least one dependent variable. Remember that variables must be able to vary. For example, the United States is not a variable. Country of residence is a variable, as is patriotism. Similarly, if your sample only includes men, gender is a constant in your study, not a variable. A constant is a characteristic that does not change in your study.
When social scientists measure concepts, they sometimes use the language of variables and attributes. A variable refers to a quality or quantity that varies across people or situations. Attributes are the characteristics that make up a variable. For example, the variable hair color would contain attributes like blonde, brown, black, red, gray, etc. A variable’s attributes determine its level of measurement. There are four possible levels of measurement: nominal, ordinal, interval, and ratio. The first two levels of measurement are categorical, meaning their attributes are categories rather than numbers. The latter two levels of measurement are continuous, meaning their attributes are numbers.
Levels of measurement
Hair color is an example of a nominal level of measurement. Nominal measures are categorical, and those categories cannot be mathematically ranked. As a brown-haired person (with some gray), I can’t say for sure that brown-haired people are better than blonde-haired people. As with all nominal levels of measurement, there is no ranking order between hair colors; they are simply different. That is what constitutes a nominal level of gender and race are also measured at the nominal level.
What attributes are contained in the variable hair color? While blonde, brown, black, and red are common colors, some people may not fit into these categories if we only list these attributes. My wife, who currently has purple hair, wouldn’t fit anywhere. This means that our attributes were not exhaustive. Exhaustiveness means that all possible attributes are listed. We may have to list a lot of colors before we can meet the criteria of exhaustiveness. Clearly, there is a point at which exhaustiveness has been reasonably met. If a person insists that their hair color is light burnt sienna, it is not your responsibility to list that as an option. Rather, that person would reasonably be described as brown-haired. Perhaps listing a category for other color would suffice to make our list of colors exhaustive.
What about a person who has multiple hair colors at the same time, such as red and black? They would fall into multiple attributes. This violates the rule of mutual exclusivity, in which a person cannot fall into two different attributes. Instead of listing all of the possible combinations of colors, perhaps you might include a multi-color attribute to describe people with more than one hair color.
Making sure researchers provide mutually exclusive and exhaustive is about making sure all people are represented in the data record. For many years, the attributes for gender were only male or female. Now, our understanding of gender has evolved to encompass more attributes that better reflect the diversity in the world. Children of parents from different races were often classified as one race or another, even if they identified with both cultures. The option for bi-racial or multi-racial on a survey not only more accurately reflects the racial diversity in the real world but validates and acknowledges people who identify in that manner. If we did not measure race in this way, we would leave empty the data record for people who identify as biracial or multiracial, impairing our search for truth.
Unlike nominal-level measures, attributes at the ordinal level can be rank ordered. For example, someone’s degree of satisfaction in their romantic relationship can be ordered by rank. That is, you could say you are not at all satisfied, a little satisfied, moderately satisfied, or highly satisfied. Note that even though these have a rank order to them (not at all satisfied is certainly worse than highly satisfied), we cannot calculate a mathematical distance between those attributes. We can simply say that one attribute of an ordinal-level variable is more or less than another attribute.
This can get a little confusing when using rating scales. If you have ever taken a customer satisfaction survey or completed a course evaluation for school, you are familiar with rating scales. “On a scale of 1-5, with 1 being the lowest and 5 being the highest, how likely are you to recommend our company to other people?” That surely sounds familiar. Rating scales use numbers, but only as a shorthand, to indicate what attribute (highly likely, somewhat likely, etc.) the person feels describes them best. You wouldn’t say you are “2” likely to recommend the company, but you would say you are not very likely to recommend the company. Ordinal-level attributes must also be exhaustive and mutually exclusive, as with nominal-level variables.
At the interval level, attributes must also be exhaustive and mutually exclusive and there is equal distance between attributes. Interval measures are also continuous, meaning their attributes are numbers, rather than categories. IQ scores are interval level, as are temperatures in Fahrenheit and Celsius. Their defining characteristic is that we can say how much more or less one attribute differs from another. We cannot, however, say with certainty what the ratio of one attribute is in comparison to another. For example, it would not make sense to say that a person with an IQ score of 140 has twice the IQ of a person with a score of 70. However, the difference between IQ scores of 80 and 100 is the same as the difference between IQ scores of 120 and 140.
While we cannot say that someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is measured at the interval level, we can say that someone with six siblings has twice as many as someone with three because number of siblings is measured at the ratio level. Finally, at the ratio level, attributes are mutually exclusive and exhaustive, attributes can be rank ordered, the distance between attributes is equal, and attributes have a true zero point. Thus, with these variables, we can say what the ratio of one attribute is in comparison to another. Examples of ratio-level variables include age and years of education. We know that a person who is 12 years old is twice as old as someone who is 6 years old. Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. The differences between each level of measurement are visualized in Table 11.1.
Nominal | Ordinal | Interval | Ratio | |
Exhaustive | X | X | X | X |
Mutually exclusive | X | X | X | X |
Rank-ordered | X | X | X | |
Equal distance between attributes | X | X | ||
True zero point | X |
Levels of measurement=levels of specificity
We have spent time learning how to determine our data's level of measurement. Now what? How could we use this information to help us as we measure concepts and develop measurement tools? First, the types of statistical tests that we are able to use are dependent on our data's level of measurement. With nominal-level measurement, for example, the only available measure of central tendency is the mode. With ordinal-level measurement, the median or mode can be used as indicators of central tendency. Interval and ratio-level measurement are typically considered the most desirable because they permit for any indicators of central tendency to be computed (i.e., mean, median, or mode). Also, ratio-level measurement is the only level that allows meaningful statements about ratios of scores. The higher the level of measurement, the more complex statistical tests we are able to conduct. This knowledge may help us decide what kind of data we need to gather, and how.
That said, we have to balance this knowledge with the understanding that sometimes, collecting data at a higher level of measurement could negatively impact our studies. For instance, sometimes providing answers in ranges may make prospective participants feel more comfortable responding to sensitive items. Imagine that you were interested in collecting information on topics such as income, number of sexual partners, number of times someone used illicit drugs, etc. You would have to think about the sensitivity of these items and determine if it would make more sense to collect some data at a lower level of measurement (e.g., asking if they are sexually active or not (nominal) versus their total number of sexual partners (ratio).
Finally, sometimes when analyzing data, researchers find a need to change a data's level of measurement. For example, a few years ago, a student was interested in studying the relationship between mental health and life satisfaction. This student used a variety of measures. One item asked about the number of mental health symptoms, reported as the actual number. When analyzing data, my student examined the mental health symptom variable and noticed that she had two groups, those with none or one symptoms and those with many symptoms. Instead of using the ratio level data (actual number of mental health symptoms), she collapsed her cases into two categories, few and many. She decided to use this variable in her analyses. It is important to note that you can move a higher level of data to a lower level of data; however, you are unable to move a lower level to a higher level.
Exercises
- Check that the variables in your research question can vary...and that they are not constants or one of many potential attributes of a variable.
- Think about the attributes your variables have. Are they categorical or continuous? What level of measurement seems most appropriate?
Step 2: Specifying measures for each variable
Let’s pick a social work research question and walk through the process of operationalizing variables to see how specific we need to get. I’m going to hypothesize that residents of a psychiatric unit who are more depressed are less likely to be satisfied with care. Remember, this would be a inverse relationship—as depression increases, satisfaction decreases. In this question, depression is my independent variable (the cause) and satisfaction with care is my dependent variable (the effect). Now we have identified our variables, their attributes, and levels of measurement, we move onto the second component: the measure itself.
So, how would you measure my key variables: depression and satisfaction? What indicators would you look for? Some students might say that depression could be measured by observing a participant’s body language. They may also say that a depressed person will often express feelings of sadness or hopelessness. In addition, a satisfied person might be happy around service providers and often express gratitude. While these factors may indicate that the variables are present, they lack coherence. Unfortunately, what this “measure” is actually saying is that “I know depression and satisfaction when I see them.” While you are likely a decent judge of depression and satisfaction, you need to provide more information in a research study for how you plan to measure your variables. Your judgment is subjective, based on your own idiosyncratic experiences with depression and satisfaction. They couldn’t be replicated by another researcher. They also can’t be done consistently for a large group of people. Operationalization requires that you come up with a specific and rigorous measure for seeing who is depressed or satisfied.
Finding a good measure for your variable depends on the kind of variable it is. Variables that are directly observable don't come up very often in my students' classroom projects, but they might include things like taking someone's blood pressure, marking attendance or participation in a group, and so forth. To measure an indirectly observable variable like age, you would probably put a question on a survey that asked, “How old are you?” Measuring a variable like income might require some more thought, though. Are you interested in this person’s individual income or the income of their family unit? This might matter if your participant does not work or is dependent on other family members for income. Do you count income from social welfare programs? Are you interested in their income per month or per year? Even though indirect observables are relatively easy to measure, the measures you use must be clear in what they are asking, and operationalization is all about figuring out the specifics of what you want to know. For more complicated constructs, you will need compound measures (that use multiple indicators to measure a single variable).
How you plan to collect your data also influences how you will measure your variables. For social work researchers using secondary data like client records as a data source, you are limited by what information is in the data sources you can access. If your organization uses a given measurement for a mental health outcome, that is the one you will use in your study. Similarly, if you plan to study how long a client was housed after an intervention using client visit records, you are limited by how their caseworker recorded their housing status in the chart. One of the benefits of collecting your own data is being able to select the measures you feel best exemplify your understanding of the topic.
Measuring unidimensional concepts
The previous section mentioned two important considerations: how complicated the variable is and how you plan to collect your data. With these in hand, we can use the level of measurement to further specify how you will measure your variables and consider specialized rating scales developed by social science researchers.
Measurement at each level
Nominal measures assess categorical variables. These measures are used for variables or indicators that have mutually exclusive attributes, but that cannot be rank-ordered. Nominal measures ask about the variable and provide names or labels for different attribute values like social work, counseling, and nursing for the variable profession. Nominal measures are relatively straightforward.
Ordinal measures often use a rating scale. It is an ordered set of responses that participants must choose from. Figure 11.1 shows several examples. The number of response options on a typical rating scale is usualy five or seven, though it can range from three to 11. Five-point scales are best for unipolar scales where only one construct is tested, such as frequency (Never, Rarely, Sometimes, Often, Always). Seven-point scales are best for bipolar scales where there is a dichotomous spectrum, such as liking (Like very much, Like somewhat, Like slightly, Neither like nor dislike, Dislike slightly, Dislike somewhat, Dislike very much). For bipolar questions, it is useful to offer an earlier question that branches them into an area of the scale; if asking about liking ice cream, first ask “Do you generally like or dislike ice cream?” Once the respondent chooses like or dislike, refine it by offering them relevant choices from the seven-point scale. Branching improves both reliability and validity (Krosnick & Berent, 1993).[96] Although you often see scales with numerical labels, it is best to only present verbal labels to the respondents but convert them to numerical values in the analyses. Avoid partial labels or length or overly specific labels. In some cases, the verbal labels can be supplemented with (or even replaced by) meaningful graphics. The last rating scale shown in Figure 11.1 is a visual-analog scale, on which participants make a mark somewhere along the horizontal line to indicate the magnitude of their response.
Interval measures are those where the values measured are not only rank-ordered, but are also equidistant from adjacent attributes. For example, the temperature scale (in Fahrenheit or Celsius), where the difference between 30 and 40 degree Fahrenheit is the same as that between 80 and 90 degree Fahrenheit. Likewise, if you have a scale that asks respondents’ annual income using the following attributes (ranges): $0 to 10,000, $10,000 to 20,000, $20,000 to 30,000, and so forth, this is also an interval measure, because the mid-point of each range (i.e., $5,000, $15,000, $25,000, etc.) are equidistant from each other. The intelligence quotient (IQ) scale is also an interval measure, because the measure is designed such that the difference between IQ scores 100 and 110 is supposed to be the same as between 110 and 120 (although we do not really know whether that is truly the case). Interval measures allow us to examine “how much more” is one attribute when compared to another, which is not possible with nominal or ordinal measures. You may find researchers who “pretend” (incorrectly) that ordinal rating scales are actually interval measures so that we can use different statistical techniques for analyzing them. As we will discuss in the latter part of the chapter, this is a mistake because there is no way to know whether the difference between a 3 and a 4 on a rating scale is the same as the difference between a 2 and a 3. Those numbers are just placeholders for categories.
Ratio measures are those that have all the qualities of nominal, ordinal, and interval scales, and in addition, also have a “true zero” point (where the value zero implies lack or non-availability of the underlying construct). Think about how to measure the number of people working in human resources at a social work agency. It could be one, several, or none (if the company contracts out for those services). Measuring interval and ratio data is relatively easy, as people either select or input a number for their answer. If you ask a person how many eggs they purchased last week, they can simply tell you they purchased `a dozen eggs at the store, two at breakfast on Wednesday, or none at all.
Commonly used rating scales in questionnaires
The level of measurement will give you the basic information you need, but social scientists have developed specialized instruments for use in questionnaires, a common tool used in quantitative research. As we mentioned before, if you plan to source your data from client files or previously published results
Although Likert scale is a term colloquially used to refer to almost any rating scale (e.g., a 0-to-10 life satisfaction scale), it has a much more precise meaning. In the 1930s, researcher Rensis Likert (pronounced LICK-ert) created a new approach for measuring people’s attitudes (Likert, 1932).[97] It involves presenting people with several statements—including both favorable and unfavorable statements—about some person, group, or idea. Respondents then express their agreement or disagreement with each statement on a 5-point scale: Strongly Agree, Agree, Neither Agree nor Disagree, Disagree, Strongly Disagree. Numbers are assigned to each response and then summed across all items to produce a score representing the attitude toward the person, group, or idea. For items that are phrased in an opposite direction (e.g., negatively worded statements instead of positively worded statements), reverse coding is used so that the numerical scoring of statements also runs in the opposite direction. The entire set of items came to be called a Likert scale, as indicated in Table 11.2 below.
Unless you are measuring people’s attitude toward something by assessing their level of agreement with several statements about it, it is best to avoid calling it a Likert scale. You are probably just using a rating scale. Likert scales allow for more granularity (more finely tuned response) than yes/no items, including whether respondents are neutral to the statement. Below is an example of how we might use a Likert scale to assess your attitudes about research as you work your way through this textbook.
Strongly agree | Agree | Neutral | Disagree | Strongly disagree | |
I like research more now than when I started reading this book. | |||||
This textbook is easy to use. | |||||
I feel confident about how well I understand levels of measurement. | |||||
This textbook is helping me plan my research proposal. |
Semantic differential scales are composite (multi-item) scales in which respondents are asked to indicate their opinions or feelings toward a single statement using different pairs of adjectives framed as polar opposites. Whereas in the above Likert scale, the participant is asked how much they agree or disagree with a statement, in a semantic differential scale the participant is asked to indicate how they feel about a specific item. This makes the semantic differential scale an excellent technique for measuring people’s attitudes or feelings toward objects, events, or behaviors. Table 11.3 is an example of a semantic differential scale that was created to assess participants' feelings about this textbook.
1) How would you rate your opinions toward this textbook? | ||||||
Very much | Somewhat | Neither | Somewhat | Very much | ||
Boring | Exciting | |||||
Useless | Useful | |||||
Hard | Easy | |||||
Irrelevant | Applicable |
This composite scale was designed by Louis Guttman and uses a series of items arranged in increasing order of intensity (least intense to most intense) of the concept. This type of scale allows us to understand the intensity of beliefs or feelings. Each item in the above Guttman scale has a weight (this is not indicated on the tool) which varies with the intensity of that item, and the weighted combination of each response is used as an aggregate measure of an observation.
Example Guttman Scale Items
- I often felt the material was not engaging Yes/No
- I was often thinking about other things in class Yes/No
- I was often working on other tasks during class Yes/No
- I will work to abolish research from the curriculum Yes/No
Notice how the items move from lower intensity to higher intensity. A researcher reviews the yes answers and creates a score for each participant.
Composite measures: Scales and indices
Depending on your research design, your measure may be something you put on a survey or pre/post-test that you give to your participants. For a variable like age or income, one well-worded question may suffice. Unfortunately, most variables in the social world are not so simple. Depression and satisfaction are multidimensional concepts. Relying on a single indicator like a question that asks "Yes or no, are you depressed?” does not encompass the complexity of depression, including issues with mood, sleeping, eating, relationships, and happiness. There is no easy way to delineate between multidimensional and unidimensional concepts, as its all in how you think about your variable. Satisfaction could be validly measured using a unidimensional ordinal rating scale. However, if satisfaction were a key variable in our study, we would need a theoretical framework and conceptual definition for it. That means we'd probably have more indicators to ask about like timeliness, respect, sensitivity, and many others, and we would want our study to say something about what satisfaction truly means in terms of our other key variables. However, if satisfaction is not a key variable in your conceptual framework, it makes sense to operationalize it as a unidimensional concept.
For more complicated measures, researchers use scales and indices (sometimes called indexes) to measure their variables because they assess multiple indicators to develop a composite (or total) score. Composite scores provide a much greater understanding of concepts than a single item could. Although we won't delve too deeply into the process of scale development, we will cover some important topics for you to understand how scales and indices developed by other researchers can be used in your project.
Although they exhibit differences (which will later be discussed) the two have in common various factors.
- Both are ordinal measures of variables.
- Both can order the units of analysis in terms of specific variables.
- Both are composite measures.
Scales
The previous section discussed how to measure respondents’ responses to predesigned items or indicators belonging to an underlying construct. But how do we create the indicators themselves? The process of creating the indicators is called scaling. More formally, scaling is a branch of measurement that involves the construction of measures by associating qualitative judgments about unobservable constructs with quantitative, measurable metric units. Stevens (1946)[98] said, “Scaling is the assignment of objects to numbers according to a rule.” This process of measuring abstract concepts in concrete terms remains one of the most difficult tasks in empirical social science research.
The outcome of a scaling process is a scale, which is an empirical structure for measuring items or indicators of a given construct. Understand that multidimensional “scales”, as discussed in this section, are a little different from “rating scales” discussed in the previous section. A rating scale is used to capture the respondents’ reactions to a given item on a questionnaire. For example, an ordinally scaled item captures a value between “strongly disagree” to “strongly agree.” Attaching a rating scale to a statement or instrument is not scaling. Rather, scaling is the formal process of developing scale items, before rating scales can be attached to those items.
If creating your own scale sounds painful, don’t worry! For most multidimensional variables, you would likely be duplicating work that has already been done by other researchers. Specifically, this is a branch of science called psychometrics. You do not need to create a scale for depression because scales such as the Patient Health Questionnaire (PHQ-9), the Center for Epidemiologic Studies Depression Scale (CES-D), and Beck’s Depression Inventory (BDI) have been developed and refined over dozens of years to measure variables like depression. Similarly, scales such as the Patient Satisfaction Questionnaire (PSQ-18) have been developed to measure satisfaction with medical care. As we will discuss in the next section, these scales have been shown to be reliable and valid. While you could create a new scale to measure depression or satisfaction, a study with rigor would pilot test and refine that new scale over time to make sure it measures the concept accurately and consistently. This high level of rigor is often unachievable in student research projects because of the cost and time involved in pilot testing and validating, so using existing scales is recommended.
Unfortunately, there is no good one-stop=shop for psychometric scales. The Mental Measurements Yearbook provides a searchable database of measures for social science variables, though it woefully incomplete and often does not contain the full documentation for scales in its database. You can access it from a university library’s list of databases. If you can’t find anything in there, your next stop should be the methods section of the articles in your literature review. The methods section of each article will detail how the researchers measured their variables, and often the results section is instructive for understanding more about measures. In a quantitative study, researchers may have used a scale to measure key variables and will provide a brief description of that scale, its names, and maybe a few example questions. If you need more information, look at the results section and tables discussing the scale to get a better idea of how the measure works. Looking beyond the articles in your literature review, searching Google Scholar using queries like “depression scale” or “satisfaction scale” should also provide some relevant results. For example, searching for documentation for the Rosenberg Self-Esteem Scale (which we will discuss in the next section), I found this report from researchers investigating acceptance and commitment therapy which details this scale and many others used to assess mental health outcomes. If you find the name of the scale somewhere but cannot find the documentation (all questions and answers plus how to interpret the scale), a general web search with the name of the scale and ".pdf" may bring you to what you need. Or, to get professional help with finding information, always ask a librarian!
Unfortunately, these approaches do not guarantee that you will be able to view the scale itself or get information on how it is interpreted. Many scales cost money to use and may require training to properly administer. You may also find scales that are related to your variable but would need to be slightly modified to match your study’s needs. You could adapt a scale to fit your study, however changing even small parts of a scale can influence its accuracy and consistency. While it is perfectly acceptable in student projects to adapt a scale without testing it first (time may not allow you to do so), pilot testing is always recommended for adapted scales, and researchers seeking to draw valid conclusions and publish their results must take this additional step.
Indices
An index is a composite score derived from aggregating measures of multiple concepts (called components) using a set of rules and formulas. It is different from a scale. Scales also aggregate measures; however, these measures examine different dimensions or the same dimension of a single construct. A well-known example of an index is the consumer price index (CPI), which is computed every month by the Bureau of Labor Statistics of the U.S. Department of Labor. The CPI is a measure of how much consumers have to pay for goods and services (in general) and is divided into eight major categories (food and beverages, housing, apparel, transportation, healthcare, recreation, education and communication, and “other goods and services”), which are further subdivided into more than 200 smaller items. Each month, government employees call all over the country to get the current prices of more than 80,000 items. Using a complicated weighting scheme that takes into account the location and probability of purchase for each item, analysts then combine these prices into an overall index score using a series of formulas and rules.
Another example of an index is the Duncan Socioeconomic Index (SEI). This index is used to quantify a person's socioeconomic status (SES) and is a combination of three concepts: income, education, and occupation. Income is measured in dollars, education in years or degrees achieved, and occupation is classified into categories or levels by status. These very different measures are combined to create an overall SES index score. However, SES index measurement has generated a lot of controversy and disagreement among researchers.
The process of creating an index is similar to that of a scale. First, conceptualize (define) the index and its constituent components. Though this appears simple, there may be a lot of disagreement on what components (concepts/constructs) should be included or excluded from an index. For instance, in the SES index, isn’t income correlated with education and occupation? And if so, should we include one component only or all three components? Reviewing the literature, using theories, and/or interviewing experts or key stakeholders may help resolve this issue. Second, operationalize and measure each component. For instance, how will you categorize occupations, particularly since some occupations may have changed with time (e.g., there were no Web developers before the Internet)? As we will see in step three below, researchers must create a rule or formula for calculating the index score. Again, this process may involve a lot of subjectivity, so validating the index score using existing or new data is important.
Scale and index development at often taught in their own course in doctoral education, so it is unreasonable for you to expect to develop a consistently accurate measure within the span of a week or two. Using available indices and scales is recommended for this reason.
Differences between scales and indices
Though indices and scales yield a single numerical score or value representing a concept of interest, they are different in many ways. First, indices often comprise components that are very different from each other (e.g., income, education, and occupation in the SES index) and are measured in different ways. Conversely, scales typically involve a set of similar items that use the same rating scale (such as a five-point Likert scale about customer satisfaction).
Second, indices often combine objectively measurable values such as prices or income, while scales are designed to assess subjective or judgmental constructs such as attitude, prejudice, or self-esteem. Some argue that the sophistication of the scaling methodology makes scales different from indexes, while others suggest that indexing methodology can be equally sophisticated. Nevertheless, indexes and scales are both essential tools in social science research.
Scales and indices seem like clean, convenient ways to measure different phenomena in social science, but just like with a lot of research, we have to be mindful of the assumptions and biases underneath. What if a scale or an index was developed using only White women as research participants? Is it going to be useful for other groups? It very well might be, but when using a scale or index on a group for whom it hasn't been tested, it will be very important to evaluate the validity and reliability of the instrument, which we address in the rest of the chapter.
Finally, it's important to note that while scales and indices are often made up of nominal or ordinal variables, when we analyze them into composite scores, we will treat them as interval/ratio variables.
Exercises
- Look back to your work from the previous section, are your variables unidimensional or multidimensional?
- Describe the specific measures you will use (actual questions and response options you will use with participants) for each variable in your research question.
- If you are using a measure developed by another researcher but do not have all of the questions, response options, and instructions needed to implement it, put it on your to-do list to get them.
Step 3: How you will interpret your measures
The final stage of operationalization involves setting the rules for how the measure works and how the researcher should interpret the results. Sometimes, interpreting a measure can be incredibly easy. If you ask someone their age, you’ll probably interpret the results by noting the raw number (e.g., 22) someone provides and that it is lower or higher than other people's ages. However, you could also recode that person into age categories (e.g., under 25, 20-29-years-old, generation Z, etc.). Even scales may be simple to interpret. If there is a scale of problem behaviors, one might simply add up the number of behaviors checked off–with a range from 1-5 indicating low risk of delinquent behavior, 6-10 indicating the student is moderate risk, etc. How you choose to interpret your measures should be guided by how they were designed, how you conceptualize your variables, the data sources you used, and your plan for analyzing your data statistically. Whatever measure you use, you need a set of rules for how to take any valid answer a respondent provides to your measure and interpret it in terms of the variable being measured.
For more complicated measures like scales, refer to the information provided by the author for how to interpret the scale. If you can’t find enough information from the scale’s creator, look at how the results of that scale are reported in the results section of research articles. For example, Beck’s Depression Inventory (BDI-II) uses 21 statements to measure depression and respondents rate their level of agreement on a scale of 0-3. The results for each question are added up, and the respondent is put into one of three categories: low levels of depression (1-16), moderate levels of depression (17-30), or severe levels of depression (31 and over).
One common mistake I see often is that students will introduce another variable into their operational definition. This is incorrect. Your operational definition should mention only one variable—the variable being defined. While your study will certainly draw conclusions about the relationships between variables, that's not what operationalization is. Operationalization specifies what instrument you will use to measure your variable and how you plan to interpret the data collected using that measure.
Operationalization is probably the trickiest component of basic research methods, so please don’t get frustrated if it takes a few drafts and a lot of feedback to get to a workable definition. At the time of this writing, I am in the process of operationalizing the concept of “attitudes towards research methods.” Originally, I thought that I could gauge students’ attitudes toward research methods by looking at their end-of-semester course evaluations. As I became aware of the potential methodological issues with student course evaluations, I opted to use focus groups of students to measure their common beliefs about research. You may recall some of these opinions from Chapter 1, such as the common beliefs that research is boring, useless, and too difficult. After the focus group, I created a scale based on the opinions I gathered, and I plan to pilot test it with another group of students. After the pilot test, I expect that I will have to revise the scale again before I can implement the measure in a real social work research project. At the time I’m writing this, I’m still not completely done operationalizing this concept.
Key Takeaways
- Operationalization involves spelling out precisely how a concept will be measured.
- Operational definitions must include the variable, the measure, and how you plan to interpret the measure.
- There are four different levels of measurement: nominal, ordinal, interval, and ratio (in increasing order of specificity).
- Scales and indices are common ways to collect information and involve using multiple indicators in measurement.
- A key difference between a scale and an index is that a scale contains multiple indicators for one concept, whereas an indicator examines multiple concepts (components).
- Using scales developed and refined by other researchers can improve the rigor of a quantitative study.
Exercises
Use the research question that you developed in the previous chapters and find a related scale or index that researchers have used. If you have trouble finding the exact phenomenon you want to study, get as close as you can.
- What is the level of measurement for each item on each tool? Take a second and think about why the tool's creator decided to include these levels of measurement. Identify any levels of measurement you would change and why.
- If these tools don't exist for what you are interested in studying, why do you think that is?
12.3 Writing effective questions and questionnaires
Learning Objectives
Learners will be able to...
- Describe some of the ways that survey questions might confuse respondents and how to word questions and responses clearly
- Create mutually exclusive, exhaustive, and balanced response options
- Define fence-sitting and floating
- Describe the considerations involved in constructing a well-designed questionnaire
- Discuss why pilot testing is important
In the previous section, we reviewed how researchers collect data using surveys. Guided by their sampling approach and research context, researchers should choose the survey approach that provides the most favorable tradeoffs in strengths and challenges. With this information in hand, researchers need to write their questionnaire and revise it before beginning data collection. Each method of delivery requires a questionnaire, but they vary a bit based on how they will be used by the researcher. Since phone surveys are read aloud, researchers will pay more attention to how the questionnaire sounds than how it looks. Online surveys can use advanced tools to require the completion of certain questions, present interactive questions and answers, and otherwise afford greater flexibility in how questionnaires are designed. As you read this section, consider how your method of delivery impacts the type of questionnaire you will design. Because most student projects use paper or online surveys, this section will detail how to construct self-administered questionnaires to minimize the potential for bias and error.
Start with operationalization
The first thing you need to do to write effective survey questions is identify what exactly you wish to know. As silly as it sounds to state what seems so completely obvious, we can’t stress enough how easy it is to forget to include important questions when designing a survey. Begin by looking at your research question and refreshing your memory of the operational definitions you developed for those variables from Chapter 11. You should have a pretty firm grasp of your operational definitions before starting the process of questionnaire design. You may have taken those operational definitions from other researchers' methods, found established scales and indices for your measures, or created your own questions and answer options.
Exercises
STOP! Make sure you have a complete operational definition for the dependent and independent variables in your research question. A complete operational definition contains the variable being measured, the measure used, and how the researcher interprets the measure. Let's make sure you have what you need from Chapter 11 to begin writing your questionnaire.
List all of the dependent and independent variables in your research question.
- It's normal to have one dependent or independent variable. It's also normal to have more than one of either.
- Make sure that your research question (and this list) contain all of the variables in your hypothesis. Your hypothesis should only include variables from you research question.
For each variable in your list:
- Write out the measure you will use (the specific questions and answers) for each variable.
- If you don't have questions and answers finalized yet, write a first draft and revise it based on what you read in this section.
- If you are using a measure from another researcher, you should be able to write out all of the questions and answers associated with that measure. If you only have the name of a scale or a few questions, you need to access to the full text and some documentation on how to administer and interpret it before you can finish your questionnaire.
- Describe how you will use each measure draw conclusions about the variable in the operational definition.
- For example, an interpretation might be "there are five 7-point Likert scale questions...point values are added across all five items for each participant...and scores below 10 indicate the participant has low self-esteem"
- Don't introduce other variables into the mix here. All we are concerned with is how you will measure each variable by itself. The connection between variables is done using statistical tests, not operational definitions.
- Detail any validity or reliability issues uncovered by previous researchers using the same measures. If you have concerns about validity and reliability, note them, as well.
If you completed the exercise above and listed out all of the questions and answer choices you will use to measure the variables in your research question, you have already produced a pretty solid first draft of your questionnaire! Congrats! In essence, questionnaires are all of the self-report measures in your operational definitions for the independent, dependent, and control variables in your study arranged into one document and administered to participants. There are a few questions on a questionnaire (like name or ID#) that are not associated with the measurement of variables. These are the exception, and it's useful to think of a questionnaire as a list of measures for variables. Of course, researchers often use more than one measure of a variable (i.e., triangulation) so they can more confidently assert that their findings are true. A questionnaire should contain all of the measures researchers plan to collect about their variables by asking participants to self-report. As we will discuss in the final section of this chapter, triangulating across data sources (e.g., measuring variables using client files or student records) can avoid some of the common sources of bias in survey research.
Sticking close to your operational definitions is important because it helps you avoid an everything-but-the-kitchen-sink approach that includes every possible question that occurs to you. Doing so puts an unnecessary burden on your survey respondents. Remember that you have asked your participants to give you their time and attention and to take care in responding to your questions; show them your respect by only asking questions that you actually plan to use in your analysis. For each question in your questionnaire, ask yourself how this question measures a variable in your study. An operational definition should contain the questions, response options, and how the researcher will draw conclusions about the variable based on participants' responses.
Writing questions
So, almost all of the questions on a questionnaire are measuring some variable. For many variables, researchers will create their own questions rather than using one from another researcher. This section will provide some tips on how to create good questions to accurately measure variables in your study. First, questions should be as clear and to the point as possible. This is not the time to show off your creative writing skills; a survey is a technical instrument and should be written in a way that is as direct and concise as possible. As I’ve mentioned earlier, your survey respondents have agreed to give their time and attention to your survey. The best way to show your appreciation for their time is to not waste it. Ensuring that your questions are clear and concise will go a long way toward showing your respondents the gratitude they deserve. Pilot testing the questionnaire with friends or colleagues can help identify these issues. This process is commonly called pretesting, but to avoid any confusion with pretesting in experimental design, we refer to it as pilot testing.
Related to the point about not wasting respondents’ time, make sure that every question you pose will be relevant to every person you ask to complete it. This means two things: first, that respondents have knowledge about whatever topic you are asking them about, and second, that respondents have experienced the events, behaviors, or feelings you are asking them to report. If you are asking participants for second-hand knowledge—asking clinicians about clients' feelings, asking teachers about students' feelings, and so forth—you may want to clarify that the variable you are asking about is the key informant's perception of what is happening in the target population. A well-planned sampling approach ensures that participants are the most knowledgeable population to complete your survey.
If you decide that you do wish to include questions about matters with which only a portion of respondents will have had experience, make sure you know why you are doing so. For example, if you are asking about MSW student study patterns, and you decide to include a question on studying for the social work licensing exam, you may only have a small subset of participants who have begun studying for the graduate exam or took the bachelor's-level exam. If you decide to include this question that speaks to a minority of participants' experiences, think about why you are including it. Are you interested in how studying for class and studying for licensure differ? Are you trying to triangulate study skills measures? Researchers should carefully consider whether questions relevant to only a subset of participants is likely to produce enough valid responses for quantitative analysis.
Many times, questions that are relevant to a subsample of participants are conditional on an answer to a previous question. A participant might select that they rent their home, and as a result, you might ask whether they carry renter's insurance. That question is not relevant to homeowners, so it would be wise not to ask them to respond to it. In that case, the question of whether someone rents or owns their home is a filter question, designed to identify some subset of survey respondents who are asked additional questions that are not relevant to the entire sample. Figure 12.1 presents an example of how to accomplish this on a paper survey by adding instructions to the participant that indicate what question to proceed to next based on their response to the first one. Using online survey tools, researchers can use filter questions to only present relevant questions to participants.
Researchers should eliminate questions that ask about things participants don't know to minimize confusion. Assuming the question is relevant to the participant, other sources of confusion come from how the question is worded. The use of negative wording can be a source of potential confusion. Taking the question from Figure 12.1 about drinking as our example, what if we had instead asked, “Did you not abstain from drinking during your first semester of college?” This is a double negative, and it's not clear how to answer the question accurately. It is a good idea to avoid negative phrasing, when possible. For example, "did you not drink alcohol during your first semester of college?" is less clear than "did you drink alcohol your first semester of college?"
You should also avoid using terms or phrases that may be regionally or culturally specific (unless you are absolutely certain all your respondents come from the region or culture whose terms you are using). When I first moved to southwest Virginia, I didn’t know what a holler was. Where I grew up in New Jersey, to holler means to yell. Even then, in New Jersey, we shouted and screamed, but we didn’t holler much. In southwest Virginia, my home at the time, a holler also means a small valley in between the mountains. If I used holler in that way on my survey, people who live near me may understand, but almost everyone else would be totally confused. A similar issue arises when you use jargon, or technical language, that people do not commonly know. For example, if you asked adolescents how they experience imaginary audience, they would find it difficult to link those words to the concepts from David Elkind’s theory. The words you use in your questions must be understandable to your participants. If you find yourself using jargon or slang, break it down into terms that are more universal and easier to understand.
Asking multiple questions as though they are a single question can also confuse survey respondents. There’s a specific term for this sort of question; it is called a double-barreled question. Figure 12.2 shows a double-barreled question. Do you see what makes the question double-barreled? How would someone respond if they felt their college classes were more demanding but also more boring than their high school classes? Or less demanding but more interesting? Because the question combines “demanding” and “interesting,” there is no way to respond yes to one criterion but no to the other.
Another thing to avoid when constructing survey questions is the problem of social desirability. We all want to look good, right? And we all probably know the politically correct response to a variety of questions whether we agree with the politically correct response or not. In survey research, social desirability refers to the idea that respondents will try to answer questions in a way that will present them in a favorable light. (You may recall we covered social desirability bias in Chapter 11.)
Perhaps we decide that to understand the transition to college, we need to know whether respondents ever cheated on an exam in high school or college for our research project. We all know that cheating on exams is generally frowned upon (at least I hope we all know this). So, it may be difficult to get people to admit to cheating on a survey. But if you can guarantee respondents’ confidentiality, or even better, their anonymity, chances are much better that they will be honest about having engaged in this socially undesirable behavior. Another way to avoid problems of social desirability is to try to phrase difficult questions in the most benign way possible. Earl Babbie (2010) [99] offers a useful suggestion for helping you do this—simply imagine how you would feel responding to your survey questions. If you would be uncomfortable, chances are others would as well.
Exercises
Try to step outside your role as researcher for a second, and imagine you were one of your participants. Evaluate the following:
- Is the question too general? Sometimes, questions that are too general may not accurately convey respondents’ perceptions. If you asked someone how they liked a certain book and provide a response scale ranging from “not at all” to “extremely well”, and if that person selected “extremely well," what do they mean? Instead, ask more specific behavioral questions, such as "Will you recommend this book to others?" or "Do you plan to read other books by the same author?"
- Is the question too detailed? Avoid unnecessarily detailed questions that serve no specific research purpose. For instance, do you need the age of each child in a household or is just the number of children in the household acceptable? However, if unsure, it is better to err on the side of details than generality.
- Is the question presumptuous? Does your question make assumptions? For instance, if you ask, "what do you think the benefits of a tax cut would be?" you are presuming that the participant sees the tax cut as beneficial. But many people may not view tax cuts as beneficial. Some might see tax cuts as a precursor to less funding for public schools and fewer public services such as police, ambulance, and fire department. Avoid questions with built-in presumptions.
- Does the question ask the participant to imagine something? Is the question imaginary? A popular question on many television game shows is “if you won a million dollars on this show, how will you plan to spend it?” Most participants have never been faced with this large amount of money and have never thought about this scenario. In fact, most don’t even know that after taxes, the value of the million dollars will be greatly reduced. In addition, some game shows spread the amount over a 20-year period. Without understanding this "imaginary" situation, participants may not have the background information necessary to provide a meaningful response.
Finally, it is important to get feedback on your survey questions from as many people as possible, especially people who are like those in your sample. Now is not the time to be shy. Ask your friends for help, ask your mentors for feedback, ask your family to take a look at your survey as well. The more feedback you can get on your survey questions, the better the chances that you will come up with a set of questions that are understandable to a wide variety of people and, most importantly, to those in your sample.
In sum, in order to pose effective survey questions, researchers should do the following:
- Identify how each question measures an independent, dependent, or control variable in their study.
- Keep questions clear and succinct.
- Make sure respondents have relevant lived experience to provide informed answers to your questions.
- Use filter questions to avoid getting answers from uninformed participants.
- Avoid questions that are likely to confuse respondents—including those that use double negatives, use culturally specific terms or jargon, and pose more than one question at a time.
- Imagine how respondents would feel responding to questions.
- Get feedback, especially from people who resemble those in the researcher’s sample.
Exercises
Let's complete a first draft of your questions. In the previous exercise, you listed all of the questions and answers you will use to measure the variables in your research question.
- In the previous exercise, you wrote out the questions and answers for each measure of your independent and dependent variables. Evaluate each question using the criteria listed above on effective survey questions.
- Type out questions for your control variables and evaluate them, as well. Consider what response options you want to offer participants.
Now, let's revise any questions that do not meet your standards!
- Use the BRUSO model in Table 12.2 for an illustration of how to address deficits in question wording. Keep in mind that you are writing a first draft in this exercise, and it will take a few drafts and revisions before your questions are ready to distribute to participants.
Criterion | Poor | Effective |
B- Brief | “Are you now or have you ever been the possessor of a firearm?” | Have you ever possessed a firearm? |
R- Relevant | "Who did you vote for in the last election?" | Note: Only include items that are relevant to your study. |
U- Unambiguous | “Are you a gun person?” | Do you currently own a gun?” |
S- Specific | How much have you read about the new gun control measure and sales tax?” | “How much have you read about the new sales tax on firearm purchases?” |
O- Objective | “How much do you support the beneficial new gun control measure?” | “What is your view of the new gun control measure?” |
Writing response options
While posing clear and understandable questions in your survey is certainly important, so too is providing respondents with unambiguous response options. Response options are the answers that you provide to the people completing your questionnaire. Generally, respondents will be asked to choose a single (or best) response to each question you pose. We call questions in which the researcher provides all of the response options closed-ended questions. Keep in mind, closed-ended questions can also instruct respondents to choose multiple response options, rank response options against one another, or assign a percentage to each response option. But be cautious when experimenting with different response options! Accepting multiple responses to a single question may add complexity when it comes to quantitatively analyzing and interpreting your data.
Surveys need not be limited to closed-ended questions. Sometimes survey researchers include open-ended questions in their survey instruments as a way to gather additional details from respondents. An open-ended question does not include response options; instead, respondents are asked to reply to the question in their own way, using their own words. These questions are generally used to find out more about a survey participant’s experiences or feelings about whatever they are being asked to report in the survey. If, for example, a survey includes closed-ended questions asking respondents to report on their involvement in extracurricular activities during college, an open-ended question could ask respondents why they participated in those activities or what they gained from their participation. While responses to such questions may also be captured using a closed-ended format, allowing participants to share some of their responses in their own words can make the experience of completing the survey more satisfying to respondents and can also reveal new motivations or explanations that had not occurred to the researcher. This is particularly important for mixed-methods research. It is possible to analyze open-ended response options quantitatively using content analysis (i.e., counting how often a theme is represented in a transcript looking for statistical patterns). However, for most researchers, qualitative data analysis will be needed to analyze open-ended questions, and researchers need to think through how they will analyze any open-ended questions as part of their data analysis plan. We will address qualitative data analysis in greater detail in Chapter 19.
To keep things simple, we encourage you to use only closed-ended response options in your study. While open-ended questions are not wrong, they are often a sign in our classrooms that students have not thought through all the way how to operationally define and measure their key variables. Open-ended questions cannot be operationally defined because you don't know what responses you will get. Instead, you will need to analyze the qualitative data using one of the techniques we discuss in Chapter 19 to interpret your participants' responses.
To write an effective response options for closed-ended questions, there are a couple of guidelines worth following. First, be sure that your response options are mutually exclusive. Look back at Figure 12.1, which contains questions about how often and how many drinks respondents consumed. Do you notice that there are no overlapping categories in the response options for these questions? This is another one of those points about question construction that seems fairly obvious but that can be easily overlooked. Response options should also be exhaustive. In other words, every possible response should be covered in the set of response options that you provide. For example, note that in question 10a in Figure 12.1, we have covered all possibilities—those who drank, say, an average of once per month can choose the first response option (“less than one time per week”) while those who drank multiple times a day each day of the week can choose the last response option (“7+”). All the possibilities in between these two extremes are covered by the middle three response options, and every respondent fits into one of the response options we provided.
Earlier in this section, we discussed double-barreled questions. Response options can also be double barreled, and this should be avoided. Figure 12.3 is an example of a question that uses double-barreled response options. Other tips about questions are also relevant to response options, including that participants should be knowledgeable enough to select or decline a response option as well as avoiding jargon and cultural idioms.
Even if you phrase questions and response options clearly, participants are influenced by how many response options are presented on the questionnaire. For Likert scales, five or seven response options generally allow about as much precision as respondents are capable of. However, numerical scales with more options can sometimes be appropriate. For dimensions such as attractiveness, pain, and likelihood, a 0-to-10 scale will be familiar to many respondents and easy for them to use. Regardless of the number of response options, the most extreme ones should generally be “balanced” around a neutral or modal midpoint. An example of an unbalanced rating scale measuring perceived likelihood might look like this:
Unlikely | Somewhat Likely | Likely | Very Likely | Extremely Likely
Because we have four rankings of likely and only one ranking of unlikely, the scale is unbalanced and most responses will be biased toward "likely" rather than "unlikely." A balanced version might look like this:
Extremely Unlikely | Somewhat Unlikely | As Likely as Not | Somewhat Likely |Extremely Likely
In this example, the midpoint is halfway between likely and unlikely. Of course, a middle or neutral response option does not have to be included. Researchers sometimes choose to leave it out because they want to encourage respondents to think more deeply about their response and not simply choose the middle option by default. Fence-sitters are respondents who choose neutral response options, even if they have an opinion. Some people will be drawn to respond, “no opinion” even if they have an opinion, particularly if their true opinion is the not a socially desirable opinion. Floaters, on the other hand, are those that choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion.
As you can see, floating is the flip side of fence-sitting. Thus, the solution to one problem is often the cause of the other. How you decide which approach to take depends on the goals of your research. Sometimes researchers specifically want to learn something about people who claim to have no opinion. In this case, allowing for fence-sitting would be necessary. Other times researchers feel confident their respondents will all be familiar with every topic in their survey. In this case, perhaps it is okay to force respondents to choose one side or another (e.g., agree or disagree) without a middle option (e.g., neither agree nor disagree) or to not include an option like "don't know enough to say" or "not applicable." There is no always-correct solution to either problem. But in general, including middle option in a response set provides a more exhaustive set of response options than one that excludes one.
The most important check before your finalize your response options is to align them with your operational definitions. As we've discussed before, your operational definitions include your measures (questions and responses options) as well as how to interpret those measures in terms of the variable being measured. In particular, you should be able to interpret all response options to a question based on your operational definition of the variable it measures. If you wanted to measure the variable "social class," you might ask one question about a participant's annual income and another about family size. Your operational definition would need to provide clear instructions on how to interpret response options. Your operational definition is basically like this social class calculator from Pew Research, though they include a few more questions in their definition.
To drill down a bit more, as Pew specifies in the section titled "how the income calculator works," the interval/ratio data respondents enter is interpreted using a formula combining a participant's four responses to the questions posed by Pew categorizing their household into three categories—upper, middle, or lower class. So, the operational definition includes the four questions comprising the measure and the formula or interpretation which converts responses into the three final categories that we are familiar with: lower, middle, and upper class.
It is interesting to note that even though participants inis an ordinal level of measurement. Whereas, Pew asks four questions that use an interval or ratio level of measurement (depending on the question). This means that respondents provide numerical responses, rather than choosing categories like lower, middle, and upper class. It's perfectly normal for operational definitions to change levels of measurement, and it's also perfectly normal for the level of measurement to stay the same. The important thing is that each response option a participant can provide is accounted for by the operational definition. Throw any combination of family size, location, or income at the Pew calculator, and it will define you into one of those three social class categories.
Unlike Pew's definition, the operational definitions in your study may not need their own webpage to define and describe. For many questions and answers, interpreting response options is easy. If you were measuring "income" instead of "social class," you could simply operationalize the term by asking people to list their total household income before taxes are taken out. Higher values indicate higher income, and lower values indicate lower income. Easy. Regardless of whether your operational definitions are simple or more complex, every response option to every question on your survey (with a few exceptions) should be interpretable using an operational definition of a variable. Just like we want to avoid an everything-but-the-kitchen-sink approach to questions on our questionnaire, you want to make sure your final questionnaire only contains response options that you will use in your study.
One note of caution on interpretation (sorry for repeating this). We want to remind you again that an operational definition should not mention more than one variable. In our example above, your operational definition could not say "a family of three making under $50,000 is lower class; therefore, they are more likely to experience food insecurity." That last clause about food insecurity may well be true, but it's not a part of the operational definition for social class. Each variable (food insecurity and class) should have its own operational definition. If you are talking about how to interpret the relationship between two variables, you are talking about your data analysis plan. We will discuss how to create your data analysis plan beginning in Chapter 14. For now, one consideration is that depending on the statistical test you use to test relationships between variables, you may need nominal, ordinal, or interval/ratio data. Your questions and response options should match the level of measurement you need with the requirements of the specific statistical tests in your data analysis plan. Once you finalize your data analysis plan, return to your questionnaire to match the level of measurement matches with the statistical test you've chosen.
In summary, to write effective response options researchers should do the following:
- Avoid wording that is likely to confuse respondents—including double negatives, use culturally specific terms or jargon, and double-barreled response options.
- Ensure response options are relevant to participants' knowledge and experience so they can make an informed and accurate choice.
- Present mutually exclusive and exhaustive response options.
- Consider fence-sitters and floaters, and the use of neutral or "not applicable" response options.
- Define how response options are interpreted as part of an operational definition of a variable.
- Check level of measurement matches operational definitions and the statistical tests in the data analysis plan (once you develop one in the future)
Exercises
Look back at the response options you drafted in the previous exercise. Make sure you have a first draft of response options for each closed-ended question on your questionnaire.
- Using the criteria above, evaluate the wording of the response options for each question on your questionnaire.
- Revise your questions and response options until you have a complete first draft.
- Do your first read-through and provide a dummy answer to each question. Make sure you can link each response option and each question to an operational definition.
- Look ahead to Chapter 14 and consider how each item on your questionnaire will inform your data analysis plan.
From this discussion, we hope it is clear why researchers using quantitative methods spell out all of their plans ahead of time. Ultimately, there should be a straight line from operational definition through measures on your questionnaire to the data analysis plan. If your questionnaire includes response options that are not aligned with operational definitions or not included in the data analysis plan, the responses you receive back from participants won't fit with your conceptualization of the key variables in your study. If you do not fix these errors and proceed with collecting unstructured data, you will lose out on many of the benefits of survey research and face overwhelming challenges in answering your research question.
Designing questionnaires
Based on your work in the previous section, you should have a first draft of the questions and response options for the key variables in your study. Now, you’ll also need to think about how to present your written questions and response options to survey respondents. It's time to write a final draft of your questionnaire and make it look nice. Designing questionnaires takes some thought. First, consider the route of administration for your survey. What we cover in this section will apply equally to paper and online surveys, but if you are planning to use online survey software, you should watch tutorial videos and explore the features of of the survey software you will use.
Informed consent & instructions
Writing effective items is only one part of constructing a survey. For one thing, every survey should have a written or spoken introduction that serves two basic functions (Peterson, 2000).[100] One is to encourage respondents to participate in the survey. In many types of research, such encouragement is not necessary either because participants do not know they are in a study (as in naturalistic observation) or because they are part of a subject pool and have already shown their willingness to participate by signing up and showing up for the study. Survey research usually catches respondents by surprise when they answer their phone, go to their mailbox, or check their e-mail—and the researcher must make a good case for why they should agree to participate. Thus, the introduction should briefly explain the purpose of the survey and its importance, provide information about the sponsor of the survey (university-based surveys tend to generate higher response rates), acknowledge the importance of the respondent’s participation, and describe any incentives for participating.
The second function of the introduction is to establish informed consent. Remember that this involves describing to respondents everything that might affect their decision to participate. This includes the topics covered by the survey, the amount of time it is likely to take, the respondent’s option to withdraw at any time, confidentiality issues, and other ethical considerations we covered in Chapter 6. Written consent forms are not always used in survey research (when the research is of minimal risk and completion of the survey instrument is often accepted by the IRB as evidence of consent to participate), so it is important that this part of the introduction be well documented and presented clearly and in its entirety to every respondent.
Organizing items to be easy and intuitive to follow
The introduction should be followed by the substantive questionnaire items. But first, it is important to present clear instructions for completing the questionnaire, including examples of how to use any unusual response scales. Remember that the introduction is the point at which respondents are usually most interested and least fatigued, so it is good practice to start with the most important items for purposes of the research and proceed to less important items. Items should also be grouped by topic or by type. For example, items using the same rating scale (e.g., a 5-point agreement scale) should be grouped together if possible to make things faster and easier for respondents. Demographic items are often presented last because they are least interesting to participants but also easy to answer in the event respondents have become tired or bored. Of course, any survey should end with an expression of appreciation to the respondent.
Questions are often organized thematically. If our survey were measuring social class, perhaps we’d have a few questions asking about employment, others focused on education, and still others on housing and community resources. Those may be the themes around which we organize our questions. Or perhaps it would make more sense to present any questions we had about parents' income and then present a series of questions about estimated future income. Grouping by theme is one way to be deliberate about how you present your questions. Keep in mind that you are surveying people, and these people will be trying to follow the logic in your questionnaire. Jumping from topic to topic can give people a bit of whiplash and may make participants less likely to complete it.
Using a matrix is a nice way of streamlining response options for similar questions. A matrix is a question type that that lists a set of questions for which the answer categories are all the same. If you have a set of questions for which the response options are the same, it may make sense to create a matrix rather than posing each question and its response options individually. Not only will this save you some space in your survey but it will also help respondents progress through your survey more easily. A sample matrix can be seen in Figure 12.4.
Once you have grouped similar questions together, you’ll need to think about the order in which to present those question groups. Most survey researchers agree that it is best to begin a survey with questions that will want to make respondents continue (Babbie, 2010; Dillman, 2000; Neuman, 2003).[101] In other words, don’t bore respondents, but don’t scare them away either. There’s some disagreement over where on a survey to place demographic questions, such as those about a person’s age, gender, and race. On the one hand, placing them at the beginning of the questionnaire may lead respondents to think the survey is boring, unimportant, and not something they want to bother completing. On the other hand, if your survey deals with some very sensitive topic, such as child sexual abuse or criminal convictions, you don’t want to scare respondents away or shock them by beginning with your most intrusive questions.
Your participants are human. They will react emotionally to questionnaire items, and they will also try to uncover your research questions and hypotheses. In truth, the order in which you present questions on a survey is best determined by the unique characteristics of your research. When feasible, you should consult with key informants from your target population determine how best to order your questions. If it is not feasible to do so, think about the unique characteristics of your topic, your questions, and most importantly, your sample. Keeping in mind the characteristics and needs of the people you will ask to complete your survey should help guide you as you determine the most appropriate order in which to present your questions. None of your decisions will be perfect, and all studies have limitations.
Questionnaire length
You’ll also need to consider the time it will take respondents to complete your questionnaire. Surveys vary in length, from just a page or two to a dozen or more pages, which means they also vary in the time it takes to complete them. How long to make your survey depends on several factors. First, what is it that you wish to know? Wanting to understand how grades vary by gender and year in school certainly requires fewer questions than wanting to know how people’s experiences in college are shaped by demographic characteristics, college attended, housing situation, family background, college major, friendship networks, and extracurricular activities. Keep in mind that even if your research question requires a sizable number of questions be included in your questionnaire, do your best to keep the questionnaire as brief as possible. Any hint that you’ve thrown in a bunch of useless questions just for the sake of it will turn off respondents and may make them not want to complete your survey.
Second, and perhaps more important, how long are respondents likely to be willing to spend completing your questionnaire? If you are studying college students, asking them to use their very free time to complete your survey may mean they won’t want to spend more than a few minutes on it. But if you find ask them to complete your survey during down-time between classes and there is little work to be done, students may be willing to give you a bit more of their time. Think about places and times that your sampling frame naturally gathers and whether you would be able to either recruit participants or distribute a survey in that context. Estimate how long your participants would reasonably have to complete a survey presented to them during this time. The more you know about your population (such as what weeks have less work and more free time), the better you can target questionnaire length.
The time that survey researchers ask respondents to spend on questionnaires varies greatly. Some researchers advise that surveys should not take longer than about 15 minutes to complete (as cited in Babbie 2010),[102] whereas others suggest that up to 20 minutes is acceptable (Hopper, 2010).[103] As with question order, there is no clear-cut, always-correct answer about questionnaire length. The unique characteristics of your study and your sample should be considered to determine how long to make your questionnaire. For example, if you planned to distribute your questionnaire to students in between classes, you will need to make sure it is short enough to complete before the next class begins.
When designing a questionnaire, a researcher should consider:
- Weighing strengths and limitations of the method of delivery, including the advanced tools in online survey software or the simplicity of paper questionnaires.
- Grouping together items that ask about the same thing.
- Moving any questions about sensitive items to the end of the questionnaire, so as not to scare respondents off.
- Moving any questions that engage the respondent to answer the questionnaire at the beginning, so as not to bore them.
- Timing the length of the questionnaire with a reasonable length of time you can ask of your participants.
- Dedicating time to visual design and ensure the questionnaire looks professional.
Exercises
Type out a final draft of your questionnaire in a word processor or online survey tool.
- Evaluate your questionnaire using the guidelines above, revise it, and get it ready to share with other student researchers.
Pilot testing and revising questionnaires
A good way to estimate the time it will take respondents to complete your questionnaire (and other potential challenges) is through pilot testing. Pilot testing allows you to get feedback on your questionnaire so you can improve it before you actually administer it. It can be quite expensive and time consuming if you wish to pilot test your questionnaire on a large sample of people who very much resemble the sample to whom you will eventually administer the finalized version of your questionnaire. But you can learn a lot and make great improvements to your questionnaire simply by pilot testing with a small number of people to whom you have easy access (perhaps you have a few friends who owe you a favor). By pilot testing your questionnaire, you can find out how understandable your questions are, get feedback on question wording and order, find out whether any of your questions are boring or offensive, and learn whether there are places where you should have included filter questions. You can also time pilot testers as they take your survey. This will give you a good idea about the estimate to provide respondents when you administer your survey and whether you have some wiggle room to add additional items or need to cut a few items.
Perhaps this goes without saying, but your questionnaire should also have an attractive design. A messy presentation style can confuse respondents or, at the very least, annoy them. Be brief, to the point, and as clear as possible. Avoid cramming too much into a single page. Make your font size readable (at least 12 point or larger, depending on the characteristics of your sample), leave a reasonable amount of space between items, and make sure all instructions are exceptionally clear. If you are using an online survey, ensure that participants can complete it via mobile, computer, and tablet devices. Think about books, documents, articles, or web pages that you have read yourself—which were relatively easy to read and easy on the eyes and why? Try to mimic those features in the presentation of your survey questions. While online survey tools automate much of visual design, word processors are designed for writing all kinds of documents and may need more manual adjustment as part of visual design.
Realistically, your questionnaire will continue to evolve as you develop your data analysis plan over the next few chapters. By now, you should have a complete draft of your questionnaire grounded in an underlying logic that ties together each question and response option to a variable in your study. Once your questionnaire is finalized, you will need to submit it for ethical approval from your professor or the IRB. If your study requires IRB approval, it may be worthwhile to submit your proposal before your questionnaire is completely done. Revisions to IRB protocols are common and it takes less time to review a few changes to questions and answers than it does to review the entire study, so give them the whole study as soon as you can. Once the IRB approves your questionnaire, you cannot change it without their okay.
Key Takeaways
- A questionnaire is comprised of self-report measures of variables in a research study.
- Make sure your survey questions will be relevant to all respondents and that you use filter questions when necessary.
- Effective survey questions and responses take careful construction by researchers, as participants may be confused or otherwise influenced by how items are phrased.
- The questionnaire should start with informed consent and instructions, flow logically from one topic to the next, engage but not shock participants, and thank participants at the end.
- Pilot testing can help identify any issues in a questionnaire before distributing it to participants, including language or length issues.
Exercises
It's a myth that researchers work alone! Get together with a few of your fellow students and swap questionnaires for pilot testing.
- Use the criteria in each section above (questions, response options, questionnaires) and provide your peers with the strengths and weaknesses of their questionnaires.
- See if you can guess their research question and hypothesis based on the questionnaire alone.
11.3 Measurement quality
Learning Objectives
Learners will be able to...
- Define and describe the types of validity and reliability
- Assess for systematic error
The previous chapter provided insight into measuring concepts in social work research. We discussed the importance of identifying concepts and their corresponding indicators as a way to help us operationalize them. In essence, we now understand that when we think about our measurement process, we must be intentional and thoughtful in the choices that we make. This section is all about how to judge the quality of the measures you've chosen for the key variables in your research question.
Reliability
First, let’s say we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism this way, then it is likely that anyone who identifies as an alcoholic would respond “yes.” This may seem like a good way to identify our group of interest, but think about how you and your peer group may respond to this question. Would participants respond differently after a wild night out, compared to any other night? Could an infrequent drinker’s current headache from last night’s glass of wine influence how they answer the question this morning? How would that same person respond to the question before consuming the wine? In each cases, the same person might respond differently to the same question at different points, so it is possible that our measure of alcoholism has a reliability problem. Reliability in measurement is about consistency.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behavior, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past 3 months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. On the other hand, we might get more accurate responses if we ask a participant how many drinks of any kind they have consumed in the past week.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviors. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. They may decide to conduct observations at a local pub by noting how many drinks patrons consume and how their behavior changes as their intake changes. What if the researcher has to use the restroom, and the patron next to them takes three shots of tequila during the brief period the researcher is away from their seat? The reliability of this researcher’s measure of alcohol intake depends on their ability to physically observe every instance of patrons consuming drinks. If they are unlikely to be able to observe every such instance, then perhaps their mechanism for measuring this concept is not reliable.
The following subsections describe the types of reliability that are important for you to know about, but keep in mind that you may see other approaches to judging reliability mentioned in the empirical literature.
Test-retest reliability
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time. Unlike an experiment, you aren't giving participants an intervention but trying to establish a reliable baseline of the variable you are measuring. Once you have these two measurements, you then look at the correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing the correlation coefficient. Figure 11.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
Internal consistency
Another kind of reliability is internal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials. A specific statistical test known as Cronbach’s Alpha provides a way to measure how well each question of a scale is related to the others.
Interrater reliability
Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other.
Validity
Validity, another key element of assessing measurement quality, is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure.
Face validity
Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.
Content validity
Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that they think positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Criterion validity
Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).
Discriminant validity
Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
Increasing the reliability and validity of measures
We have reviewed the types of errors and how to evaluate our measures based on reliability and validity considerations. However, what can we do while selecting or creating our tool so that we minimize the potential of errors? Many of our options were covered in our discussion about reliability and validity. Nevertheless, the following table provides a quick summary of things that you should do when creating or selecting a measurement tool. While not all of these will be feasible in your project, it is important to include easy-to-implement measures in your research context.
Make sure that you engage in a rigorous literature review so that you understand the concept that you are studying. This means understanding the different ways that your concept may manifest itself. This review should include a search for existing instruments.[104]
- Do you understand all the dimensions of your concept? Do you have a good understanding of the content dimensions of your concept(s)?
- What instruments exist? How many items are on the existing instruments? Are these instruments appropriate for your population?
- Are these instruments standardized? Note: If an instrument is standardized, that means it has been rigorously studied and tested.
Consult content experts to review your instrument. This is a good way to check the face validity of your items. Additionally, content experts can also help you understand the content validity.[105]
- Do you have access to a reasonable number of content experts? If not, how can you locate them?
- Did you provide a list of critical questions for your content reviewers to use in the reviewing process?
Pilot test your instrument on a sufficient number of people and get detailed feedback.[106] Ask your group to provide feedback on the wording and clarity of items. Keep detailed notes and make adjustments BEFORE you administer your final tool.
- How many people will you use in your pilot testing?
- How will you set up your pilot testing so that it mimics the actual process of administering your tool?
- How will you receive feedback from your pilot testing group? Have you provided a list of questions for your group to think about?
Provide training for anyone collecting data for your project.[107] You should provide those helping you with a written research protocol that explains all of the steps of the project. You should also problem solve and answer any questions that those helping you may have. This will increase the chances that your tool will be administered in a consistent manner.
- How will you conduct your orientation/training? How long will it be? What modality?
- How will you select those who will administer your tool? What qualifications do they need?
When thinking of items, use a higher level of measurement, if possible.[108] This will provide more information and you can always downgrade to a lower level of measurement later.
- Have you examined your items and the levels of measurement?
- Have you thought about whether you need to modify the type of data you are collecting? Specifically, are you asking for information that is too specific (at a higher level of measurement) which may reduce participants' willingness to participate?
Use multiple indicators for a variable.[109] Think about the number of items that you will include in your tool.
- Do you have enough items? Enough indicators? The correct indicators?
Conduct an item-by-item assessment of multiple-item measures.[110] When you do this assessment, think about each word and how it changes the meaning of your item.
- Are there items that are redundant? Do you need to modify, delete, or add items?
Types of error
As you can see, measures never perfectly describe what exists in the real world. Good measures demonstrate validity and reliability but will always have some degree of error. Systematic error (also called bias) causes our measures to consistently output incorrect data in one direction or another on a measure, usually due to an identifiable process. Imagine you created a measure of height, but you didn’t put an option for anyone over six feet tall. If you gave that measure to your local college or university, some of the taller students might not be measured accurately. In fact, you would be under the mistaken impression that the tallest person at your school was six feet tall, when in actuality there are likely people taller than six feet at your school. This error seems innocent, but if you were using that measure to help you build a new building, those people might hit their heads!
A less innocent form of error arises when researchers word questions in a way that might cause participants to think one answer choice is preferable to another. For example, if I were to ask you “Do you think global warming is caused by human activity?” you would probably feel comfortable answering honestly. But what if I asked you “Do you agree with 99% of scientists that global warming is caused by human activity?” Would you feel comfortable saying no, if that’s what you honestly felt? I doubt it. That is an example of a leading question, a question with wording that influences how a participant responds. We’ll discuss leading questions and other problems in question wording in greater detail in Chapter 12.
In addition to error created by the researcher, your participants can cause error in measurement. Some people will respond without fully understanding a question, particularly if the question is worded in a confusing way. Let’s consider another potential source or error. If we asked people if they always washed their hands after using the bathroom, would we expect people to be perfectly honest? Polling people about whether they wash their hands after using the bathroom might only elicit what people would like others to think they do, rather than what they actually do. This is an example of social desirability bias, in which participants in a research study want to present themselves in a positive, socially desirable way to the researcher. People in your study will want to seem tolerant, open-minded, and intelligent, but their true feelings may be closed-minded, simple, and biased. Participants may lie in this situation. This occurs often in political polling, which may show greater support for a candidate from a minority race, gender, or political party than actually exists in the electorate.
A related form of bias is called acquiescence bias, also known as “yea-saying.” It occurs when people say yes to whatever the researcher asks, even when doing so contradicts previous answers. For example, a person might say yes to both “I am a confident leader in group discussions” and “I feel anxious interacting in group discussions.” Those two responses are unlikely to both be true for the same person. Why would someone do this? Similar to social desirability, people want to be agreeable and nice to the researcher asking them questions or they might ignore contradictory feelings when responding to each question. You could interpret this as someone saying "yeah, I guess." Respondents may also act on cultural reasons, trying to “save face” for themselves or the person asking the questions. Regardless of the reason, the results of your measure don’t match what the person truly feels.
So far, we have discussed sources of error that come from choices made by respondents or researchers. Systematic errors will result in responses that are incorrect in one direction or another. For example, social desirability bias usually means that the number of people who say they will vote for a third party in an election is greater than the number of people who actually vote for that candidate. Systematic errors such as these can be reduced, but random error can never be eliminated. Unlike systematic error, which biases responses consistently in one direction or another, random error is unpredictable and does not consistently result in scores that are consistently higher or lower on a given measure. Instead, random error is more like statistical noise, which will likely average out across participants.
Random error is present in any measurement. If you’ve ever stepped on a bathroom scale twice and gotten two slightly different results, maybe a difference of a tenth of a pound, then you’ve experienced random error. Maybe you were standing slightly differently or had a fraction of your foot off of the scale the first time. If you were to take enough measures of your weight on the same scale, you’d be able to figure out your true weight. In social science, if you gave someone a scale measuring depression on a day after they lost their job, they would likely score differently than if they had just gotten a promotion and a raise. Even if the person were clinically depressed, our measure is subject to influence by the random occurrences of life. Thus, social scientists speak with humility about our measures. We are reasonably confident that what we found is true, but we must always acknowledge that our measures are only an approximation of reality.
Humility is important in scientific measurement, as errors can have real consequences. At the time I'm writing this, my wife and I are expecting our first child. Like most people, we used a pregnancy test from the pharmacy. If the test said my wife was pregnant when she was not pregnant, that would be a false positive. On the other hand, if the test indicated that she was not pregnant when she was in fact pregnant, that would be a false negative. Even if the test is 99% accurate, that means that one in a hundred women will get an erroneous result when they use a home pregnancy test. For us, a false positive would have been initially exciting, then devastating when we found out we were not having a child. A false negative would have been disappointing at first and then quite shocking when we found out we were indeed having a child. While both false positives and false negatives are not very likely for home pregnancy tests (when taken correctly), measurement error can have consequences for the people being measured.
Key Takeaways
- Reliability is a matter of consistency.
- Validity is a matter of accuracy.
- There are many types of validity and reliability.
- Systematic error may arise from the researcher, participant, or measurement instrument.
- Systematic error biases results in a particular direction, whereas random error can be in any direction.
- All measures are prone to error and should interpreted with humility.
Exercises
Use the measurement tools you located in the previous exercise. Evaluate the reliability and validity of these tools. Hint: You will need to go into the literature to "research" these tools.
- Provide a clear statement regarding the reliability and validity of these tools. What strengths did you notice? What were the limitations?
- Think about your target population. Are there changes that need to be made in order for one of these tools to be appropriate for your population?
- If you decide to create your own tool, how will you assess its validity and reliability?
questions in which the researcher provides all of the response options
Chapter Outline
- Ethical responsibility and cultural respect (6 minute read)
- Critical considerations (6 minute read)
- Preparations: Creating a plan for qualitative data analysis (11 minute read)
- Thematic analysis (15 minute read)
- Content analysis (13 minute read)
- Grounded theory analysis (7 minute read)
- Photovoice (5 minute read)
Content warning: Examples in this chapter contain references to LGBTQ+ ageing, damaged-centered research, long-term older adult care, family violence and violence against women, vocational training, financial hardship, educational practices towards rights and justice, Schizophrenia, mental health stigma, and water rights and water access.
Just a brief disclaimer, this chapter is not intended to be a comprehensive discussion on qualitative data analysis. It does offer an overview of some of the diverse approaches that can be used for qualitative data analysis, but as you will read, even within each one of these there are variations in how they might be implemented in a given project. If you are passionate (or at least curious 😊) about conducting qualitative research, use this as a starting point to help you dive deeper into some of these strategies. Please note that there are approaches to analysis that are not addressed in this chapter, but still may be very valuable qualitative research tools. Examples include heuristic analysis,[111] narrative analysis,[112] discourse analysis,[113] and visual analysis,[114] among a host of others. These aren't mentioned to confuse or overwhelm you, but instead to suggest that qualitative research is a broad field with many options. Before we begin reviewing some of these strategies, here a few considerations regarding ethics, cultural responsibility, power and control that should influence your thinking and planning as you map out your data analysis plan.
19.1 Ethical responsibility and cultural respectfulness
Learning Objectives
Learners will be able to...
- Identify how researchers can conduct ethically responsible qualitative data analysis.
- Explain the role of culture and cultural context in qualitative data analysis (for both researcher and participant)
The ethics of deconstructing stories
Throughout this chapter, I will consistently suggest that you will be deconstructing data. That is to say, you will be taking the information that participants share with you through their words, performances, videos, documents, photos, and artwork, then breaking it up into smaller points of data, which you will then reassemble into your findings. We have an ethical responsibility to treat what is shared with a sense of respect during this process of deconstruction and reconstruction. This means that we make conscientious efforts not to twist, change, or subvert the meaning of data as we break them down or string them back together.
The act of bringing together people’s stories through qualitative research is not an easy one and shouldn’t be taken lightly. Through the informed consent process, participants should learn about the ways in which their information will be used in your research, including giving them a general idea what will happen in your analysis and what format the end results of that process will likely be.
A deep understanding of cultural context as we make sense of meaning
Similar to the ethical considerations we need to keep in mind as we deconstruct stories, we also need to work diligently to understand the cultural context in which these stories are shared. This requires that we approach the task of analysis with a sense of cultural humility, meaning that we don’t assume that our perspective or worldview as the researcher is the same as our participants. Their life experiences may be quite different from our own, and because of this, the meaning in their stories may be very different than what we might initially expect.
As such, we need to ask questions to better understand words, phrases, ideas, gestures, etc. that seem to have particular significance to participants. We also can use activities like member checking, another tool to support qualitative rigor, to ensure that our findings are accurately interpreted by vetting them with participants prior to the study conclusion. We can spend a good amount of time getting to know the groups and communities that we work with, paying attention to their values, priorities, practices, norms, strengths, and challenges. Finally, we can actively work to challenge more traditional methods research and support more participatory models that advance community co-researchers or consistent oversight of research by community advisory groups to inform, challenge, and advance this process; thus elevating the wisdom of community members and their influence (and power) in the research process.
Accounting for our influence in the analysis process
Along with our ethical responsibility to our research participants, we also have an accountability to research consumers, the scientific community at large, and other stakeholders in our qualitative research. As qualitative researchers (or quantitative researchers, for that matter), people should expect that we have attempted, to the best of our abilities, to account for our role in the research process. This is especially true in analysis. Our finding should not emerge from some ‘black box’, where raw data goes in and findings pop out the other side, with no indication of how we arrive at them. Thus, an important part of rigor is transparency and the use of tools such as writing in reflexive journals, memoing, and creating an audit trail to assist us in documenting both our thought process and activities in reaching our findings. There will be more about this in Chapter 20 dedicated to qualitative rigor.
Key Takeaways
- Ethics, as it relates specifically to the analysis phase of qualitative research, requires that we are especially thoughtful in how we treat the data that participants share with us. This data often represents very intimate parts of people's lives and/or how they view the world. Therefore, we need to actively conduct our analysis in a way that does not misrepresent, compromise the privacy of, and/or disenfranchise or oppress our participants and the groups they belong to.
- Part of demonstrating this ethical commitment to analysis involves capturing and documenting our influence as researchers to the qualitative research process.
Exercises
After you have had a chance to read through this chapter, come back to this exercise. Think about your qualitative proposal. Based on the strategies that you might consider for analysis of your qualitative data:
- What ethical concerns do you have specific to this approach to analyzing your data?
- What steps might you take to anticipate and address these concerns?
19.2 Critical considerations
Learning Objectives
Learners will be able to...
- Explain how data analysis may be used as tool for power and control
- Develop steps that reflect increased opportunities for empowerment of your study population, especially during the data analysis phase
How are participants present in the analysis process; What power or influence do they have
Remember, research is political. We need to consider that our findings represent ideas that are shared with us by living and breathing human beings and often the groups and communities that they represent. They have been gracious enough to share their time and their stories with us, yet they often have a limited role once we gather data from them. They are essentially putting their trust in us that we won’t be misrepresenting or culturally appropriating their stories in ways that will be harmful, damaging, or demeaning. Elliot (2016)[115] discusses the problems of "damaged-centered" research, which is research that portrays groups of people or communities as flawed, surrounded by problems, or incapable of producing change. Her work specifically references the way research and media have often portrayed people from the Appalachian region, and how these influences have perpetuated, reinforced, and even created stereotypes that these communities face. We need to thoughtfully consider how the research we are involved in will reflect on our participants and their communities.
Now, some research approaches, particularly participatory approaches, suggest that participants should be trained and actively engaged throughout the research process, helping to shape how our findings are presented and how the target population is portrayed. Implementing a participatory approach requires academic researchers to give up some of their power and control to community co-researchers. Ideally these co-researchers provide their input and are active members in determining what the findings are and interpreting why/how they are important. I believe this is a standard we need to strive for. However, this is the exception, not the rule. As such, if you are participating in a more traditional research role where community participants are not actively engaged, whenever possible, it is good practice to find ways to allow participants or other representatives to help lend their validation to our findings. While to a smaller extent, these opportunities suggest ways that community members can be empowered during the research process (and researchers can turn over some of our control). You may do this through activities like consulting with community representatives early and often during the analysis process and using member checking (referenced above and in our chapter on qualitative rigor) to help review and refine results. These are distinct and important roles for the community and do not mean that community members become researchers; but that they lend their perspectives in helping the researcher to interpret their findings.
The bringing together of voices: What does this represent and to whom
As social work researchers, we need to be mindful that research is a tool for advancing social justice. However, that doesn't mean that all research fulfills that capacity or that all parties perceive it in this way. Qualitative research generally involves a relatively small number of participants (or even a single person) sharing their stories. As researchers, we then bring together this data in the analysis phase in an attempt to tell a broader story about the issue we are studying. Our findings often reflect commonalities and patterns, but also should highlight contradictions, tensions, and dissension about the topic.
Exercises
Reflexive Journal Entry Prompt
Pause for a minute. Think about what the findings for your research proposal might represent.
- What do they represent to you as a researcher?
- What do they represent to participants directly involved in your study?
- What do they represent to the families of these participants?
- What do they represent to the groups and communities that represent or are connected to your population?
For each of the perspectives outlined in the reflexive journal entry prompt above, there is no single answer. As a student researcher, your study might represent a grade, an opportunity to learn more about a topic you are interested in, and a chance to hone your skills as a researcher. For participants, the findings might represent a chance to share their input or frustration that they are being misrepresented. Community members might view the research findings with skepticism that research produces any kind of change or anger that findings bring unwanted attention to the community. Obviously we can't foretell all the answers to these questions, but thinking about them can help us to thoughtfully and carefully consider how we go about collecting, analyzing and presenting our data. We certainly need to be honest and transparent in our data analysis, but additionally, we need to consider how our analysis impacts others. It is especially important that we anticipate this and integrate it early into our efforts to educate our participants on what the research will involve, including potential risks.
It is important to note here that there are a number of perspectives that are rising to challenge traditional research methods. These challenges are often grounded in issues of power and control that we have been discussing, recognizing that research has and continues to be used as a tool for oppression and division. These perspectives include but are not limited to: Afrocentric methodologies, Decolonizing methodologies, Feminist methodologies, and Queer methodologies. While it's a poor substitute for not diving deeper into these valuable contributions, I do want to offer a few resources if you are interested in learning more about these perspectives and how they can help to more inclusively define the research process.
Key Takeaways
- Research findings can represent many different things to many different stakeholders. Rather than as an afterthought, as qualitative researchers, we need to thoughtfully consider a range of these perspectives prior to and throughout the analysis to reduce the risk of oppression and misrepresentation through our research.
- There are a variety of strategies and whole alternative research paradigms that can aid qualitative researchers in conducting research in more empowering ways when compared to traditional research methods where the researcher largely maintain control and ownership of the research process and agenda.
Resources
This type of research means that African indigenous culture must be understood and kept at the forefront of any research and recommendations affecting indigenous communities and their culture.
Afrocentric methodologies: These methods represent research that is designed, conducted, and disseminated in ways that center and affirm African cultures, knowledge, beliefs, and values.
- Pellerin, M. (2012). Benefits of Afrocentricity in exploring social phenomena: Understanding Afrocentricity as a social science methodology.
- University of Illinois Library. (n.d.). The Afrocentric Research Center.
Decolonizing methodologies: These methods represent research that is designed, conducted, and disseminated in ways to reclaim control over indigenous ways of knowing and being.[116]
- Paris, D., & Winn, M. T. (Eds.). (2013). Humanizing research: Decolonizing qualitative inquiry with youth and communities. Sage Publications.
- Smith, L. T. (2012). Decolonizing methodologies: Research and indigenous peoples (2nd ed.). Zed Books Ltd.
Feminist methodologies: Research methods in this tradition seek to, "remove the power imbalance between research and subject; (are) politically motivated in that (they) seeks to change social inequality; and (they) begin with the standpoints and experiences of women".[117]
- Gill, J. (n.d.) Feminist research methodologies. Feminist Perspectives on Media and Technology.
- U.C.Davis., Feminist Research Institute. (n.d.). What is feminist research?
Queer(ing) methodologies: Research methods using this approach aim to question, challenge and often reject knowledge that is commonly accepted and privileged in society and elevate and empower knowledge and perspectives that are often perceived as non-normative.
- de Jong, D. H. (2014). A new paradigm in social work research: It’s here, it’s queer, get used to it!.
- Ghaziani, A., & Brim, M. (Eds.). (2019). Imagining queer methods. NYU Press.
19.3 Preparations: Creating a plan for qualitative data analysis
Learning Objectives
Learners will be able to...
- Identify how your research question, research aim, sample selection, and type of data may influence your choice of analytic methods
- Outline the steps you will take in preparation for conducting qualitative data analysis in your proposal
Now we can turn our attention to planning your analysis. The analysis should be anchored in the purpose of your study. Qualitative research can serve a range of purposes. Below is a brief list of general purposes we might consider when using a qualitative approach.
- Are you trying to understand how a particular group is affected by an issue?
- Are you trying to uncover how people arrive at a decision in a given situation?
- Are you trying to examine different points of view on the impact of a recent event?
- Are you trying to summarize how people understand or make sense of a condition?
- Are you trying to describe the needs of your target population?
If you don't see the general aim of your research question reflected in one of these areas, don't fret! This is only a small sampling of what you might be trying to accomplish with your qualitative study. Whatever your aim, you need to have a plan for what you will do once you have collected your data.
Exercises
Decision Point: What are you trying to accomplish with your data?
- Consider your research question. What do you need to do with the qualitative data you are gathering to help answer that question?
To help answer this question, consider:
-
- What action verb(s) can be associated with your project and the qualitative data you are collecting? Does your research aim to summarize, compare, describe, examine, outline, identify, review, compose, develop, illustrate, etc.?
- Then, consider noun(s) you need to pair with your verb(s)—perceptions, experiences, thoughts, reactions, descriptions, understanding, processes, feelings, actions responses, etc.
Iterative or linear
We touched on this briefly in Chapter 17 about qualitative sampling, but this is an important distinction to consider. Some qualitative research is linear, meaning it follows more of a traditionally quantitative process: create a plan, gather data, and analyze data; each step is completed before we proceed to the next. You can think of this like how information is presented in this book. We discuss each topic, one after another.
However, many times qualitative research is iterative, or evolving in cycles. An iterative approach means that once we begin collecting data, we also begin analyzing data as it is coming in. This early and ongoing analysis of our (incomplete) data then impacts our continued planning, data gathering and future analysis. Again, coming back to this book, while it may be written linear, we hope that you engage with it iteratively as you are building your proposal. By this we mean that you will revisit previous sections so you can understand how they fit together and you are in continuous process of building and revising how you think about the concepts you are learning about.
As you may have guessed, there are benefits and challenges to both linear and iterative approaches. A linear approach is much more straightforward, each step being fairly defined. However, linear research being more defined and rigid also presents certain challenges. A linear approach assumes that we know what we need to ask or look for at the very beginning of data collection, which often is not the case.
With iterative research, we have more flexibility to adapt our approach as we learn new things. We still need to keep our approach systematic and organized, however, so that our work doesn't become a free-for-all. As we adapt, we do not want to stray too far from the original premise of our study. It's also important to remember with an iterative approach that we may risk ethical concerns if our work extends beyond the original boundaries of our informed consent and IRB agreement. If you feel that you do need to modify your original research plan in a significant way as you learn more about the topic, you can submit an addendum to modify your original application that was submitted. Make sure to keep detailed notes of the decisions that you are making and what is informing these choices. This helps to support transparency and your credibility throughout the research process.
Exercises
Decision Point: Will your analysis reflect more of a linear or an iterative approach?
- What justifies or supports this decision?
Think about:
- Fit with your research question
- Available time and resources
- Your knowledge and understanding of the research process
Exercises
Reflexive Journal Entry Prompt
- Are you more of a linear thinker or an iterative thinker?
- What evidence are you basing this on?
- How might this help or hinder your qualitative research process?
- How might this help or hinder you in a practice setting as you work with clients?
Acquainting yourself with your data
As you begin your analysis, you need to get to know your data. This usually means reading through your data prior to any attempt at breaking it apart and labeling it. You might read through a couple of times, in fact. This helps give you a more comprehensive feel for each piece of data and the data as a whole, again, before you start to break it down into smaller units or deconstruct it. This is especially important if others assisted us in the data collection process. We often gather data as part of team and everyone involved in the analysis needs to be very familiar with all of the data.
Capturing your reaction to the data
During the review process, our understanding of the data often evolves as we observe patterns and trends. It is a good practice to document your reaction and evolving understanding. Your reaction can include noting phrases or ideas that surprise you, similarities or distinct differences in responses, additional questions that the data brings to mind, among other things. We often record these reactions directly in the text or artifact if we have the ability to do so, such as making a comment in a word document associated with a highlighted phrase. If this isn’t possible, you will want to have a way to track what specific spot(s) in your data your reactions are referring to. In qualitative research we refer to this process as memoing. Memoing is a strategy that helps us to link our findings to our raw data, demonstrating transparency. If you are using a Computre-Assisted Qualitative Data Analysis Software (CAQDAS) software package, memoing functions are generally built into the technology.
Capturing your emerging understanding of the data
During your reviewing and memoing you will start to develop and evolve your understanding of what the data means. This understanding should be dynamic and flexible, but you want to have a way to capture this understanding as it evolves. You may include this as part of your memoing or as part of your codebook where you are tracking the main ideas that are emerging and what they mean. Figure 19.3 is an example of how your thinking might change about a code and how you can go about capturing it. Coding is a part of the qualitative data analysis process where we begin to interpret and assign meaning to the data. It represents one of the first steps as we begin to filter the data through our own subjective lens as the researcher. We will discuss coding in much more detail in the sections below covering various different approaches to analysis.
Date | Code Lable | Explanations |
6/18/18 | Experience of wellness | This code captures the different ways people describe wellness in their lives |
6/22/18 | Understanding of wellness | Changed the label of this code slightly to reflect that many participants emphasize the cognitive aspect of how they understand wellness—how they think about it in their lives, not only the act of 'experiencing it'. This understanding seems like a precursor to experiencing. An evolving sense of how you think about wellness in your life. |
6/25/18 | Wellness experienced by developing personal awareness | A broader understanding of this category is developing. It involves building a personalized understanding of what makes up wellness in each person's life and the role that they play in maintaining it. Participants have emphasized that this is a dynamic, personal and onging process of uncovering their own intimate understanding of wellness. They describe having to experiment, explore, and reflect to develop this awareness. |
Exercises
Decision Point: How to capture your thoughts?
- How will you capture your thinking about the data and your emerging understanding about what it means?
- What will this look like?
- How often will you do it?
- How will you keep it organized and consistent over time?
In addition, you will want to be actively using your reflexive journal during this time. Document your thoughts and feelings throughout the research process. This will promote transparency and help account for your role in the analysis.
For entries during your analysis, respond to questions such as these in your journal:
- What surprises you about what participants are sharing?
- How has this information challenged you to look at this topic differently?
- As you reflect on these findings, what personal biases or preconceived notions have been exposed for you?
- Where might these have come from?
- How might these be influencing your study?
- How will you proceed differently based on what you are learning?
By including community members as active co-researchers, they can be invaluable in reviewing, reacting to and leading the interpretation of data during your analysis. While it can certainly be challenging to converge on an agreed-upon version of the results; their insider knowledge and lived experience can provide very important insights into the data analysis process.
Determining when you are finished
When conducting quantitative research, it is perhaps easier to decide when we are finished with our analysis. We determine the tests we need to run, we perform them, we interpret them, and for the most part, we call it a day. It's a bit more nebulous for qualitative research. There is no hard and fast rule for when we have completed our qualitative analysis. Rather, our decision to end the analysis should be guided by reflection and consideration of a number of important questions. These questions are presented below to help ensure that your analysis results in a finished product that is comprehensive, systematic, and coherent.
Have I answered my research question?
Your analysis should be clearly connected to and in service of answering your research question. Your examination of the data should help you arrive at findings that sufficiently address the question that you set out to answer. You might find that it is surprisingly easy to get distracted while reviewing all your data. Make sure as you conducted the analysis you keep coming back to your research question.
Have I utilized all my data?
Unless you have intentionally made the decision that certain portions of your data are not relevant for your study, make sure that you don’t have sources or segments of data that aren’t incorporated into your analysis. Just because some data doesn’t “fit” the general trends you are uncovering, find a way to acknowledge this in your findings as well so that these voices don’t get lost in your data.
Have I fulfilled my obligation to my participants?
As a qualitative researcher, you are a craftsperson. You are taking raw materials (e.g. people’s words, observations, photos) and bringing them together to form a new creation, your findings. These findings need to both honor the original integrity of the data that is shared with you, but also help tell a broader story that answers your research question(s).
Have I fulfilled my obligation to my audience?
Not only do your findings need to help answer your research question, but they need to do so in a way that is consumable for your audience. From an analysis standpoint, this means that we need to make sufficient efforts to condense our data. For example, if you are conducting a thematic analysis, you don’t want to wind up with 20 themes. Having this many themes suggests that you aren’t finished looking at how these ideas relate to each other and might be combined into broader themes. Having these sufficiently reduced to a handful of themes will help tell a more complete story, one that is also much more approachable and meaningful for your reader.
In the following subsections, there is information regarding a variety of different approaches to qualitative analysis. In designing your qualitative study, you would identify an analytical approach as you plan out your project. The one you select would depend on the type of data you have and what you want to accomplish with it.
Key Takeaways
- Qualitative research analysis requires preparation and careful planning. You will need to take time to familiarize yourself with the data in general sense before you begin analyzing.
- Once you begin your analysis, make sure that you have strategies for capture and recording both your reaction to the data and your corresponding developing understanding of what the collective meaning of the data is (your results). Qualitative research is not only invested in the end results but also the process at which you arrive at them.
Exercises
Decision Point: When will you stop?
- How will you know when you are finished? What will determine your endpoint?
- How will you monitor your work so you know when it's over?
19.4 Thematic analysis
Learning Objectives
Learners will be able to...
- Explain defining features of thematic analysis as a strategy for qualitative data analysis and identify when it is most effectively used
- Formulate an initial thematic analysis plan (if appropriate for your research proposal)
What are you trying to accomplish with thematic analysis?
As its name suggests, with thematic analysis we are attempting to identify themes or common ideas across our data. Themes can help us to:
- Determine shared meaning or significance of an event
- Povide a more complete understanding of concept or idea by exposing different dimensions of the topic
- Explore a range of values, beliefs or perceptions on a given topic
Themes help us to identify common ways that people are making sense of their world. Let’s say that you are studying empowerment of older adults in assisted living facilities by interviewing residents in a number of these facilities. As you review your transcripts, you note that a number of participants are talking about the importance of maintaining connection to previous aspects of their life (e.g. their mosque, their Veterans of Foreign Wars (VFW) Post, their Queer book club) and having input into how the facility is run (e.g. representative on the board, community town hall meetings). You might note that these are two emerging themes in your data. After you have deconstructed your data, you will likely end up with a handful (likely three or four) central ideas or take-aways that become the themes or major findings of your research.
Variations in approaches to thematic analysis
There are a variety of ways to approach qualitative data analysis, but even within the broad approach of thematic analysis, there is variation. Some thematic analysis takes on an inductive analysis approach. In this case, we would first deconstruct our data into small segments representing distinct ideas (this is explained further in the section below on coding data). We then go on to see which of these pieces seem to group together around common ideas.
In direct contrast, you might take a deductive analysis approach (like we discussed in Chapter 8), in which you start with some idea about what grouping might look like and we see how well our data fits into those pre-identified groupings. These initial deductive groupings (we call these a priori categories) often come from an existing theory related to the topic we are studying. You may also elect to use a combination of deductive and inductive strategies, especially if you find that much of your data is not fitting into deductive categories and you decide to let new categories inductively emerge.
A couple things to note here. If you are using a deductive approach, be clear in specifying where your a priori categories came from. For instance, perhaps you are interested in studying the conceptualization of social work in other cultures. You begin your analysis with prior research conducted by Tracie Mafile'o (2004) that identified the concepts of fekau'aki (connecting) and fakatokilalo (humility) as being central to Tongan social work practice.[118] You decide to use these two concepts as part of your initial deductive framework, because you are interested in studying a population that shares much in common with the Tongan people. When using an inductive approach, you need to plan to use memoing and reflexive journaling to document where the new categories or themes are coming from.
Coding data
Coding is the process of breaking down your data into smaller meaningful units. Just like any story is made up by the bringing together of many smaller ideas, you need to uncover and label these smaller ideas within each piece of your data. After you have reviewed each piece of data you will go back and assign labels to words, phrases, or pieces of data that represent separate ideas that can stand on their own. Identifying and labeling codes can be tricky. When attempting to locate units of data to code, look for pieces of data that seem to represent an idea in-and-of-itself; a unique thought that stands alone. For additional information about coding, check out this brief video from Duke's Social Science Research Institute on this topic. It offers a nice concise overview of coding and also ties into our previous discussion of memoing to help encourage rigor in your analysis process.
As suggested in the video[119], when you identify segments of data and are considering what to label them ask yourself:
- How does this relate to/help to answer my research question?
- How does this connect with what we know from the existing literature?
- How does this fit (or contrast) with the rest of my data?
You might do the work of coding in the margins if you are working with hard copies, or you might do this through the use of comments or through copying and pasting if you are working with digital materials (like pasting them into an excel sheet, as in the example below). If you are using a CAQDAS, there will be a function(s) built into the software to accomplish this.
Regardless of which strategy you use, the central task of thematic analysis is to have a way to label discrete segments of your data with a short phrase that reflects what it stands for. As you come across segments that seem to mean the same thing, you will want to use the same code. Make sure to select the words to represent your codes wisely, so that they are clear and memorable. When you are finished, you will likely have hundreds (if not thousands!) of different codes – again, a story is made up of many different ideas and you are bringing together many different stories! A cautionary note, if you are physically manipulating your data in some way, for example copying and pasting, which I frequently do, you need to have a way to trace each code or little segment back to its original home (the artifact that it came from).
When I’m working with interview data, I will assign each interview transcript a code and use continuous line numbering. That way I can label each segment of data or code with a corresponding transcript code and line number so I can find where it came from in case I need to refer back to the original.
The following is an excerpt from a portion of an autobiographical memoir (Wolf, 2010)[120]. Continuous numbers have been added to the transcript to identify line numbers (Figure 19.4). A few preliminary codes have been identified from this data and entered into a data matrix (below) with information to trace back to the raw data (transcript) (Figure 19.5).
1 | I have a vivid picture in my mind of my mother, sitting at a kitchen table, |
2 | listening to the announcement of FDR’s Declaration of War in his famous “date |
3 | which will live in infamy” speech delivered to Congress on December 8, 1941: |
4 | “The United States was suddenly and deliberately attacked by naval and air forces |
5 | of the Empire of Japan.” I still can hear his voice. |
6 | |
7 | I couldn’t understand “war,” of course, but I knew that something terrible had |
8 | happened; and I wanted it to stop so my mother wouldn’t be unhappy. I later |
9 | asked my older brother what war was and when it would be over. He said, “Not |
10 | soon, so we better get ready for it, and, remember, kid, I’m a Captain and you’re a |
11 | private.” |
12 | |
13 | So the war became a family matter in some sense: my mother’s sorrow (thinking, |
14 | doubtless, about the fate and future of her sons) and my brother’s assertion of |
15 | male authority and superiority always thereafter would come to mind in times of |
16 | international conflict—just as Pearl Harbor, though it was far from the mainland, |
17 | always would be there for America as an icon of victimization, never more so than |
18 | in the semi-paranoid aftermath of “9/11” with its disastrous consequences in |
19 | Iraq. History always has a personal dimension. |
Data Segment | Transcript (Source) | Transcript Line | Initial Code |
I have a vivid picture in my mind of my mother, sitting at a kitchen table, listening to the announcement of FDR’s Declaration of War in his famous “date which will live in infamy” speech delivered to Congress on December 8, 1941: “The United States was suddenly and deliberately attacked by naval and air forces of the Empire of Japan.” I still can hear his voice. | Wolf Memoir | 1-5 | Memories |
I couldn’t understand “war,” of course, but I knew that something terrible had happened; and I wanted it to stop so my mother wouldn’t be unhappy. | Wolf Memoir | 7-8 | Meaning of War |
I later asked my older brother what war was and when it would be over. He said, “Not soon, so we better get ready for it, and, remember, kid, I’m a Captain and you’re a private.” | Wolf Memoir | 8-11 | Meaning of War; Memories |
Exercises
Below is another excerpt from the same memoir[121]
What segments of this interview can you pull out and what initial code would you place on them?
Create a data matrix as you reflect on this.
It was painful to think, even at an early age, that a part of the world I was beginning to love—Europe—was being substantially destroyed by the war; that cities with their treasures, to say nothing of innocent people, were being bombed and consumed in flames. I was a patriotic young American and wanted “us” to win the war, but I also wanted Europe to be saved.
Some displaced people began to arrive in our apartment house, and even as I knew that they had suffered in Europe, their names and language pointed back to a civilized Europe that I wanted to experience. One person, who had studied at Heidelberg, told me stories about student life in the early part of the 20th century that inspired me to want to become an accomplished student, if not a “student prince.” He even had a dueling scar. A baby-sitter showed me a photo of herself in a feathered hat, standing on a train platform in Bratislava. I knew that she belonged in a world that was disappearing.
For those of us growing up in New York City in the 1940s, Japan, following Pearl Harbor and the “death march” in Corregidor, seemed to be our most hated enemy. The Japanese were portrayed as grotesque and blood-thirsty on posters. My friends and I were fighting back against the “Japs” in movie after movie: Gung Ho, Back to Bataan, The Purple Heart, Thirty Seconds Over Tokyo, They Were Expendable, and Flying Tigers, to name a few.
We wanted to be like John Wayne when we grew up. It was only a few decades after the war, when we realized the horrors of Hiroshima and Nagasaki, that some of us began to understand that the Japanese, whatever else was true, had been dehumanized as a people; that we had annihilated, guiltlessly at the time, hundreds of thousands of non-combatants in a horrific flash. It was only after the publication of John Hersey’s Hiroshima(1946), that we began to think about other sides of the war that patriotic propaganda had concealed.
When my friends and I went to summer camp in the foothills of the Berkshires during the late years of the war and sang patriotic songs around blazing bonfires, we weren’t thinking about the firestorms of Europe (Dresden) and Japan. We were worried that our counselors would be drafted and suddenly disappear, leaving us unprotected.
Identifying, reviewing, and refining themes
Now we have our codes, we need to find a sensible way of putting them together. Remember, we want to narrow this vast field of hundreds of codes down to a small handful of themes. If we don't review and refine all these codes, the story we are trying to tell with our data becomes distracting and diffuse. An example is provided below to demonstrate this process.
As we refine our thematic analysis, our first step will be to identify groups of codes that hang together or seem to be related. Let's say you are studying the experience of people who are in a vocational preparation program and you have codes labeled “worrying about paying the bills” and “loss of benefits”. You might group these codes into a category you label “income & expenses” (Figrue 19.6).
Code | Category | Reasoning |
Worrying about paying the bills | Income & expenses | Seem to be talking about financial stressors and potential impact on resources |
Loss of benefits |
Code | Category | Reasoning | Category | Reasoning |
Worrying about Paying the bills | Income & expenses | Seem to be talking about financial stressors and potential impact on resources | Financial insecurities | Expanded category to also encompass personal factor- confidence related to issue |
Loss of benefits | ||||
Not confident managing money |
You may review and refine the groups of your codes many times during the course of your analysis, including shifting codes around from one grouping to another as you get a clearer picture of what each of the groups represent. This reflects the iterative process we were describing earlier. While you are shifting codes and relabeling categories, track this! A research journal is a good place to do this. So, as in the example above, you would have a journal entry that explains that you changed the label on the category from “income & expenses” to “financial insecurities” and you would briefly explain why. Your research journal can take many different forms. It can be hard copy, an evolving word document, or a spreadsheet with multiple tabs (Figure 19.8).
Journal Entry Date: 10/04/19 Changed category [Income & expenses] to [Financial insecurities] to include new code "Not confident managing money" that appears to reflect a personal factor related to the participant's confidence or personal capability related to the topic. |
Now, eventually you may decide that some of these categories can also be grouped together, but still stand alone as separate ideas. Continuing with our example above, you have another category labeled “financial potential” that contains codes like “money to do things” and “saving for my future”. You determine that “financial insecurities” and “financial potential” are related, but distinctly different aspects of a broader grouping, which you go on to label “financial considerations”. This broader grouping reflects both the more worrisome or stressful aspects of people’s experiences that you have interviewed, but also the optimism and hope that was reflected related to finances and future work (Figure 19.9).
Code | Category | Reasoning | Category | Reasoning | Theme |
Worrying about paying the bills | Income & expenses | Seem to be talking about financial stressors and potential impact on resources | Financial insecurities | Expanded category to also encompass personal factor- confidence related to issue | Financial considerations |
Loss of benefits | |||||
Not confident managing money | |||||
Money to do things | Financial potential | Reflects positive aspects related to earnings | |||
Saving for my future |
This broadest grouping then becomes your theme and utilizing the categories and the codes contained therein, you create a description of what each of your themes means based on the data you have collected, and again, can record this in your research journal entry (Figure 19.10).
Journal Entry Date: 10/10/19 Identified an emerging theme [Financial considerations] that reflects both the concerns reflected under [Financial insecurities] but also the hopes or more positive sentiments related to finances and work [Financial potential] expressed by participants. As participants prepare to return to work, they appear to experience complex and maybe even conflicting feelings towards how it will impact their finances and what this will mean for their lives. |
Building a thematic representation
However, providing a list of themes may not really tell the whole story of your study. It may fail to explain to your audience how these individual themes relate to each other. A thematic map or thematic array can do just that: provides a visual representation of how each individual category fits with the others. As you build your thematic representation, be thoughtful of how you position each of your themes, as this spatially tells part of the story.[122] You should also make sure that the relationships between the themes represented in your thematic map or array are narratively explained in your text as well.
Figure 19.11 offers an illustration of the beginning of thematic map for the theme we had been developing in the examples above. I emphasize that this is the beginning because we would likely have a few other themes (not just "financial considerations"). These other themes might have codes or categories in common with this theme, and these connections would be visual evident in our map. As you can see in the example, the thematic map allows the reader, reviewer, or researcher can quickly see how these ideas relate to each other. Each of these themes would be explained in greater detail in our write up of the results. Additionally, sample quotes from the data that reflected those themes are often included.
Key Takeaways
- Thematic analysis offers qualitative researchers a method of data analysis through which we can identify common themes or broader ideas that are represented in our qualitative data.
- Themes are identified through an iterative process of coding and categorizing (or grouping) to identify trends during your analysis.
- Tracking and documenting this process of theme identification is an important part of utilizing this approach.
Resources
References for learning more about Thematic Analysis
Clarke, V. (2017, December 9). What is thematic analysis?
Maguire, M., & Delahunt, B. (2017). Doing a thematic analysis: A practical, step-by-step guide for learning and teaching scholars.
Nowell et al. (2017). Thematic analysis: Striving to meet the trustworthiness criteria.
The University of Auckland. (n.d.). Thematic analysis: A reflexive approach.
A few exemplars of studies employing Thematic Analysis
Bastiaensens et al. (2019). “Were you cyberbullied? Let me help you.” Studying adolescents’ online peer support of cyberbullying victims using thematic analysis of online support group Fora.
Borgström, Å., Daneback, K., & Molin, M. (2019). Young people with intellectual disabilities and social media: A literature review and thematic analysis.
Kapoulitsas, M., & Corcoran, T. (2015). Compassion fatigue and resilience: A qualitative analysis of social work practice.
19.5 Content analysis
Learning Objectives
Learners will be able to...
- Explain defining features of content analysis as a strategy for analyzing qualitative data
- Determine when content analysis can be most effectively used
- Formulate an initial content analysis plan (if appropriate for your research proposal)
What are you trying to accomplish with content analysis
Much like with thematic analysis, if you elect to use content analysis to analyze your qualitative data, you will be deconstructing the artifacts that you have sampled and looking for similarities across these deconstructed parts. Also consistent with thematic analysis, you will be seeking to bring together these similarities in the discussion of your findings to tell a collective story of what you learned across your data. While the distinction between thematic analysis and content analysis is somewhat murky, if you are looking to distinguish between the two, content analysis:
- Places greater emphasis on determining the unit of analysis. Just to quickly distinguish, when we discussed sampling in Chapter 10 we also used the term "unit of analysis. As a reminder, when we are talking about sampling, unit of analysis refers to the entity that a researcher wants to say something about at the end of her study (individual, group, or organization). However, for our purposes when we are conducting a content analysis, this term has to do with the ‘chunk’ or segment of data you will be looking at to reflect a particular idea. This may be a line, a paragraph, a section, an image or section of an image, a scene, etc., depending on the type of artifact you are dealing with and the level at which you want to subdivide this artifact.
- Content analysis is also more adept at bringing together a variety of forms of artifacts in the same study. While other approaches can certainly accomplish this, content analysis more readily allows the researcher to deconstruct, label and compare different kinds of ‘content’. For example, perhaps you have developed a new advocacy training for community members. To evaluate your training you want to analyze a variety of products they create after the workshop, including written products (e.g. letters to their representatives, community newsletters), audio/visual products (e.g. interviews with leaders, photos hosted in a local art exhibit on the topic) and performance products (e.g. hosting town hall meetings, facilitating rallies). Content analysis can allow you the capacity to examine evidence across these different formats.
For some more in-depth discussion comparing these two approaches, including more philosophical differences between the two, check out this article by Vaismoradi, Turunen, and Bondas (2013).[123]
Variations in the approach
There are also significant variations among different content analysis approaches. Some of these approaches are more concerned with quantifying (counting) how many times a code representing a specific concept or idea appears. These are more quantitative and deductive in nature. Other approaches look for codes to emerge from the data to help describe some idea or event. These are more qualitative and inductive. Hsieh and Shannon (2005)[124] describe three approaches to help understand some of these differences:
- Conventional Content Analysis. Starting with a general idea or phenomenon you want to explore (for which there is limited data), coding categories then emerge from the raw data. These coding categories help us understand the different dimensions, patterns, and trends that may exist within the raw data collected in our research.
- Directed Content Analysis. Starts with a theory or existing research for which you develop your initial codes (there is some existing research, but incomplete in some aspects) and uses these to guide your initial analysis of the raw data to flesh out a more detailed understanding of the codes and ultimately, the focus of your study.
- Summative Content Analysis. Starts by examining how many times and where codes are showing up in your data, but then looks to develop an understanding or an "interpretation of the underlying context" (p.1277) for how they are being used. As you might have guessed, this approach is more likely to be used if you're studying a topic that already has some existing research that forms a basic place to begin the analysis.
This is only one system of categorization for different approaches to content analysis. If you are interested in utilizing a content analysis for your proposal, you will want to design an approach that fits well with the aim of your research and will help you generate findings that will help to answer your research question(s). Make sure to keep this as your north star, guiding all aspects of your design.
Determining your codes
We are back to coding! As in thematic analysis, you will be coding your data (labeling smaller chunks of information within each data artifact of your sample). In content analysis, you may be using pre-determined codes, such as those suggested by an existing theory (deductive) or you may seek out emergent codes that you uncover as you begin reviewing your data (inductive). Regardless of which approach you take, you will want to develop a well-documented codebook.
A codebook[/pb_glossary] is a document that outlines the list of codes you are using as you analyze your data, a descriptive definition of each of these codes, and any decision-rules that apply to your codes. A [pb_glossary id="1192"]decision-rule provides information on how the researcher determines what code should be placed on an item, especially when codes may be similar in nature. If you are using a deductive approach, your codebook will largely be formed prior to analysis, whereas if you use an inductive approach, your codebook will be built over time. To help illustrate what this might look like, Figure 19.12 offers a brief excerpt of a codebook from one of the projects I'm currently working on.
Coding, comparing, counting
Once you have (or are developing) your codes, your next step will be to actually code your data. In most cases, you are looking for your coding structure (your list of codes) to have good coverage. This means that most of the content in your sample should have a code applied to it. If there are large segments of your data that are uncoded, you are potentially missing things. Now, do note that I said most of the time. There are instances when we are using artifacts that may contain a lot of information, only some of which will apply to what we are studying. In these instances, we obviously wouldn’t be expecting the same level of coverage with our codes. As you go about coding you may change, refine and adapt your codebook as you go through your data and compare the information that reflects each code. As you do this, keep your research journal handy and make sure to capture and record these changes so that you have a trail documenting the evolution of your analysis. Also, as suggested earlier, content analysis may also involve some degree of counting as well. You may be keeping a tally of how many times a particular code is represented in your data, thereby offering your reader both a quantification of how many times (and across how many sources) a code was reflected and a narrative description of what that code came to mean.
Representing the findings from your coding scheme
Finally, you need to consider how you will represent the findings from your coding work. This may involve listing out narrative descriptions of codes, visual representations of what each code came to mean or how they related to each other, or a table that includes examples of how your data reflected different elements of your coding structure. However you choose to represent the findings of your content analysis, make sure the resulting product answers your research question and is readily understandable and easy-to-interpret for your audience.
Key Takeaways
- Much like thematic analysis, content analysis is concerned with breaking up qualitative data so that you can compare and contrast ideas as you look across all your data, collectively. A couple of distinctions between thematic and content analysis include content analysis's emphasis on more clearly specifying the unit of analysis used for the purpose of analysis and the flexibility that content analysis offers in comparing across different types of data.
- Coding involves both grouping data (after it has been deconstructed) and defining these codes (giving them meaning). If we are using a deductive approach to analysis, we will start with the code defined. If we are using an inductive approach, the code will not be defined until the end of the analysis.
Exercises
Identify a qualitative research article that uses content analysis (do a quick search of "qualitative" and "content analysis" in your research search engine of choice).
- How do the authors display their findings?
- What was effective in their presentation?
- What was ineffective in their presentation?
Resources
Resources for learning more about Content Analysis
Bengtsson, M. (2016). How to plan and perform a qualitative study using content analysis.
Colorado State University (n.d.) Writing@CSU Guide: Content analysis.
Columbia University Mailman School of Public Health, Population Health. (n.d.) Methods: Content analysis
Mayring, P. (2000, June). Qualitative content analysis.
A few exemplars of studies employing Content Analysis
Collins et al. (2018). Content analysis of advantages and disadvantages of drinking among individuals with the lived experience of homelessness and alcohol use disorders.
Corley, N. A., & Young, S. M. (2018). Is social work still racist? A content analysis of recent literature.
Deepak et al. (2016). Intersections between technology, engaged learning, and social capital in social work education.
19.6 Grounded theory analysis
Learning Objectives
Learners will be able to...
- Explain defining features of grounded theory analysis as a strategy for qualitative data analysis and identify when it is most effectively used
- Formulate an initial grounded theory analysis plan (if appropriate for your research proposal)
What are you trying to accomplish with grounded theory analysis
Just to be clear, grounded theory doubles as both qualitative research design (we will talk about some other qualitative designs in Chapter 22) and a type of qualitative data analysis. Here we are specifically interested in discussing grounded theory as an approach to analysis in this chapter. With a grounded theory analysis, we are attempting to come up with a common understanding of how some event or series of events occurs based on our examination of participants' knowledge and experience of that event. Let's consider the potential this approach has for us as social workers in the fight for social justice. Using grounded theory analysis we might try to answer research questions like:
- How do communities identity, organize, and challenge structural issues of racial inequality?
- How do immigrant families respond to threat of family member deportation?
- How has the war on drugs campaign shaped social welfare practices?
In each of these instances, we are attempting to uncover a process that is taking place. To do so, we will be analyzing data that describes the participants' experiences with these processes and attempt to draw out and describe the components that seem quintessential to understanding this process.
Variations in the approach
Differences in approaches to grounded theory analysis largely lie in the amount (and types) of structure that are applied to the analysis process. Strauss and Corbin (2014)[125] suggest a highly structured approach to grounded theory analysis, one that moves back and forth between the data and the evolving theory that is being developed, making sure to anchor the theory very explicitly in concrete data points. With this approach, the researcher role is more detective-like; the facts are there, and you are uncovering and assembling them, more reflective of deductive reasoning. While Charmaz (2014)[126] suggests a more interpretive approach to grounded theory analysis, where findings emerge as an exchange between the unique and subjective (yet still accountable) position of the researcher(s) and their understanding of the data, acknowledging that another researcher might emerge with a different theory or understanding. So in this case, the researcher functions more as a liaison, where they bridge understanding between the participant group and the scientific community, using their own unique perspective to help facilitate this process. This approach reflects inductive reasoning.
Coding in grounded theory
Coding in grounded theory is generally a sequential activity. First, the researcher engages in open coding of the data. This involves reviewing the data to determine the preliminary ideas that seem important and potential labels that reflect their significance for the event or process you are studying. Within this open coding process, the researcher will also likely develop subcategories that help to expand and provide a richer understanding of what each of the categories can mean. Next, axial coding will revisit the open codes and identify connections between codes, thereby beginning to group codes that share a relationship. Finally, selective or theoretical coding explores how the relationships between these concepts come together, providing a theory that describes how this event or series of events takes place, often ending in an overarching or unifying idea tying these concepts together. Dr. Tiffany Gallicano[127] has a helpful blog post that walks the reader through examples of each stage of coding. Figure 19.13 offers an example of each stage of coding in a study examining experiences of students who are new to online learning and how they make sense of it. Keep in mind that this is an evolving process and your document should capture this changing process. You may notice that in the example "Feels isolated from professor and classmates" is listed under both axial codes "Challenges presented by technology" and "Course design". This isn't an error; it just represents that it isn't yet clear if this code is most reflective of one of these two axial codes or both. Eventually, the placement of this code may change, but we will make sure to capture why this change is made.
Open Codes | Axial Codes | Selective |
Anxious about using new tools | Challenges presented by technology | Doubts, insecurities and frustration experienced by new online learners |
Lack of support for figuring technology out | ||
Feels isolated from professor and classmates | ||
Twice the work—learn the content and how to use the technology | ||
Limited use of teaching activities (e.g. "all we do is respond to discussion boards") | Course design | |
Feels isolated from professor and classmates | ||
Unclear what they should be taking away from course work and materials | ||
Returning student, feel like I'm too old to learn this stuff | Learner characteristics | |
Home feels chaotic, hard to focus on learning |
Constant comparison
While ground theory is not the only approach to qualitative analysis that utilizes constant comparison, it is certainly widely associated with this approach. Constant comparison reflects the motion that takes place throughout the analytic process (across the levels of coding described above), whereby as researchers we move back and forth between the data and the emerging categories and our evolving theoretical understanding. We are continually checking what we believe to be the results against the raw data. It is an ongoing cycle to help ensure that we are doing right by our data and helps ensure the trustworthiness of our research. Ground theory often relies on a relatively large number of interviews and usually will begin analysis while the interviews are ongoing. As a result, the researcher(s) work to continuously compare their understanding of findings against new and existing data that they have collected.
Developing your theory
Remember, the aim of using a grounded theory approach to your analysis is to develop a theory, or an explanation of how a certain event/phenomenon/process occurs. As you bring your coding process to a close, you will emerge not just with a list of ideas or themes, but an explanation of how these ideas are interrelated and work together to produce the event you are studying. Thus, you are building a theory that explains the event you are studying that is grounded in the data you have gathered.
Thinking about power and control as we build theories
I want to bring the discussion back to issues of power and control in research. As discussed early in this chapter, regardless of what approach we are using to analyze our data we need to be concerned with the potential for abuse of power in the research process and how this can further contribute to oppression and systemic inequality. I think this point can be demonstrated well here in our discussion of grounded theory analysis. Since grounded theory is often concerned with describing some aspect of human behavior: how people respond to events, how people arrive at decisions, how human processes work. Even though we aren't necessarily seeking generalizable results in a qualitative study, research consumers may still be influenced by how we present our findings. This can influence how they perceive the population that is represented in our study. For example, for many years science did a great disservice to families impacted by schizophrenia, advancing the theory of the schizophrenogenic mother[128]. Using pseudoscience, the scientific community misrepresented the influence of parenting (a process), and specifically the mother's role in the development of the disorder of schizophrenia. You can imagine the harm caused by this theory to family dynamics, stigma, institutional mistrust, etc. To learn more about this you can read this brief but informative editorial article by Anne Harrington in the Lancet.[129] Instances like these should haunt and challenge the scientific community to do better. Engaging community members in active and more meaningful ways in research is one important way we can respond. Shouldn't theories be built by the people they are meant to represent?
Key Takeaways
- Ground theory analysis aims to develop a common understanding of how some event or series of events occurs based on our examination of participants' knowledge and experience of that event.
- Using grounded theory often involves a series of coding activities (e.g. open, axial, selective or theoretical) to help determine both the main concepts that seem essential to understanding an event, but also how they relate or come together in a dynamic process.
- Constant comparison is a tool often used by qualitative researchers using a grounded theory analysis approach in which they move back and forth between the data and the emerging categories and the evolving theoretical understanding they are developing.
Resources
Resources for learning more about Grounded Theory
Chun Tie, Y., Birks, M., & Francis, K. (2019). Grounded theory research: A design framework for novice researchers.
Gibbs, G.R. (2015, February 4). A discussion with Kathy Charmaz on Grounded Theory.
Glaser, B.G., & Holton, J. (2004, May). Remodeling grounded theory.
Mills, J., Bonner, A., & Francis, K. (2006). The development of Constructivist Grounded Theory.
A few exemplars of studies employing Grounded Theory
Burkhart, L., & Hogan, N. (2015). Being a female veteran: A grounded theory of coping with transitions.
Donaldson, W. V., & Vacha-Haase, T. (2016). Exploring staff clinical knowledge and practice with LGBT residents in long-term care: A grounded theory of cultural competency and training needs.
Vanidestine, T., & Aparicio, E. M. (2019). How social welfare and health professionals understand “Race,” Racism, and Whiteness: A social justice approach to grounded theory.
19.7 Photovoice
Learning Objectives
Learners will be able to...
- Explain defining features of photovoice as a strategy for qualitative data analysis and identify when it is most effectively used
- Formulate an initial analysis plan using photovoice (if appropriate for your research proposal)
What are you trying to accomplish with photovoice analysis?
Photovoice is an approach to qualitative research that combines the steps of data gathering and analysis with visual and narrative data. The ultimate aim of the analysis is to produce some kind of desired change with and for the community of participants. While other analysis approaches discussed here may involve including participants more actively in the research process, it is certainly not the norm. However, with photovoice, it is. Using an approach that involves photovoice will generally assume that the participants in your study will be taking on a very active role throughout the research process, to the point of acting as co-researchers. This is especially evident during the analysis phase of your work.
As an example of this work, Mitchell (2018)[130] combines photovoice and an environmental justice approach to engage a Native American community around the significance and the implications of water for their tribe. This research is designed to help raise awareness and support advocacy efforts for improved access to and quality of natural resources for this group. Photovoice has grown out of participatory and community-based research traditions that assume that community members have their own expertise they bring to the research process, and that they should be involved, empowered, and mutually benefit from research that is being conducted. This mutual benefit means that this type of research involves some kind of desired and very tangible changes for participants; the research will support something that community members want to see happen. Examples of these changes could be legislative action, raising community awareness, or changing some organizational practice(s).
Training your team
Because this approach involves participants not just sharing information, but actually utilizing research skills to help collect and interpret data, as a researcher you need to take on an educator role and share your research expertise in preparing them to do so. After recruiting and gathering informed consent, part of the on-boarding process will be to determine the focus of your study. Some photovoice projects are more prescribed, where the researcher comes with an idea and seeks to partner with a specific group or community to explore this topic. At other times, the researcher joins with the community first, and collectively they determine the focus of the study and craft the research question. Once this focus has been determined and shared, the team will be charged with gathering photos or videos that represent responses to the research question for each individual participant. Depending on the technology used to capture these photos (e.g. cameras, ipads, video recorders, cell phones), training may need to be provided.
Once photos have been captured, team members will be asked to provide a caption or description that helps to interpret what their picture(s) mean in relation to the focus of the study. After this, the team will collectively need to seek out themes and patterns across the visual and narrative representations. This means you may employ different elements of thematic or content analysis to help you interpret the collective meaning across the data and you will need to train your team to utilize these approaches.
Converging on a shared story
Once you have found common themes, together you will work to assemble these into a cohesive broader story or message regarding the focus of your topic. Now remember, the participatory roots of photovoice suggest that the aim of this message is to seek out, support, encourage or demand some form of change or transformation, so part of what you will want to keep in mind is that this is intended to be a persuasive story. Your research team will need to consider how to put your findings together in a way that supports this intended change. The packaging and format of your findings will have important implications for developing and disseminating the final products of qualitative research. Chapter 21 focuses more specifically on decisions connected with this phase of the research process.
Key Takeaways
- Photovoice is a unique approach to qualitative research that combines visual and narrative information in an attempt to produce more meaningful and accessible results as an alternative to other traditional research methods.
- A cornerstone of Photovoice research involves the training and participation of community members during the analysis process. Additionally, the results of the analysis are often intended for some form of direct change or transformation that is valued by the community.
Exercises
Reflexive Journal Entry Prompt
After learning about these different types of qualitative analysis:
- Which of these approaches make the most sense to you and how you view the world?
- Which of them are most appealing and why?
- Which do you want to learn more about?
Exercises
Decision Point: How will you conduct your analysis?
- Thinking about what you need to accomplish with the data you have collected, which of these analytic approaches will you use?
- What makes this the most effective choice?
- Outline the steps you plan to take to conduct your analysis
- What peer-reviewed resources have you gathered to help you learn more about this method of analysis? (keep these handy for when you write-up your study!)
Resources
Resources for learning more about Photovice:
Liebenberg, L. (2018). Thinking critically about photovoice: Achieving empowerment and social change.
Mangosing, D. (2015, June 18). Photovoice training and orientation.
University of Kansas, Community Toolbox. (n.d.). Section 20. Implementing Photovoice in Your Community.
Woodgate et al. (2017, January). Worth a thousand words? Advantages, challenges and opportunities in working with photovoice as a qualitative research method with youth and their families.
A few exemplars of studies employing Photovoice:
Fisher-Borne, M., & Brown, A. (2018). A case study using Photovoice to explore racial and social identity among young Black men: Implications for social work research and practice.
Houle et al. (2018). Public housing tenants’ perspective on residential environment and positive well-being: An empowerment-based Photovoice study and its implications for social work.
Mitchell, F. M. (2018). “Water Is Life”: Using photovoice to document American Indian perspectives on water and health.
respondents to a survey who choose neutral response options, even if they have an opinion
respondents to a survey who choose a substantive answer to a question when really, they don’t understand the question or don’t have an opinion
unintended influences on respondents’ answers because they are not related to the content of the item but to the context in which the item appears.
when the order in which the items are presented affects people’s responses
the concept that scores obtained from a measure are similar when employed in different cultural populations
spurious covariance between your independent and dependent variables that is in fact caused by systematic error introduced by culturally insensitive or incompetent research practices