Kernel: R (system-wide)

A DDS Case Study

Background information

In the United States, individuals with developmental disabilities typically receive services and support from state governments. The State of California allocates funds to developmentally-disabled residents through the California Department of Developmental Services (DDS); individuals receiving DDS funds are referred to as 'consumers'. The dataset dds.discr represents a sample of 1,000 DDS consumers (out of a total population of approximately 250,000), and includes information about age, gender, ethnicity, and the amount of financial support per consumer provided by the DDS. The dataset is available in the oibiostat package.

A team of researchers examined the mean annual expenditure on consumers by ethnicity, and found that the mean annual expenditures on Hispanic consumers was approximately one-third of the mean expenditures on White non-Hispanic consumers. As a result, an allegation of ethnic discrimination was brought against the California DDS.

Does this finding represent sufficient evidence of ethnic discrimination, or might there be more to the story? This lab provides a walkthrough to conducting an exploratory analysis that not only investigates the relationship between two variables of interest, but also considers whether other variables might be influencing that relationship.

Common settings

We will start by importing some useful libraries and by setting some options that will make plots easier to read on the projector screen:

suppressMessages({library(tidyverse); library(mosaic)})
theme_set(theme_bw(base_size=20)) options(repr.plot.width=12, repr.plot.height=8)

The data set

Then we will import the oibiostat library that contains the data set we are interested in:

library(oibiostat)

Read the data set from the library:

data(dds.discr)

Ask for more information about the data set:

?dds.discr
dds.discr package:oibiostat R Documentation Discrimination in Developmental Disability Support Description: Represents a sample of 1,000 DDS consumers, based on data from the State of California Usage: data("dds.discr") Format: A data frame with 1000 observations on the following 6 variables. ‘id’ Unique identification code for each resident ‘age.cohort’ Age as sorted into six groups ‘0-5’ years, ‘6-12’ years, ‘13-17’ years, ‘18-21’ years, ‘22-50’ years, and ‘51+’ years ‘age’ Age, measured in years ‘gender’ Gender, recorded as either ‘Female’ or ‘Male’. ‘expenditures’ Amount of expenditures spent by the State on an individual annually, measured in USD ‘ethnicity’ Ethnic group, recorded as either ‘American Indian’, ‘Asian’, ‘Black’, ‘Hispanic’, ‘Multi Race’, ‘Native Hawaiian’, ‘Other’, or ‘White not Hispanic’ Examples: data("dds.discr") dds.discr[1:5,]

Use glimpse() to get a quick overview of the data set:

glimpse(dds.discr)
Rows: 1,000 Columns: 6 $ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1… $ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-… $ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20… $ gender <fct> Female, Male, Male, Female, Male, Female, Female, Male, F… $ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28… $ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…

We see that the data set contains 1000 observations with 6 variables: id, age,cohort, age, gender, expenditures and ethnicity.

We can use the "dollar notation" to extract any of these six columns. For example, to show the age column, we do this:

dds.discr$age
[1] 17 37 3 19 13 15 13 17 14 13 13 14 15 17 20 25 23 41 5 15 5 5 22 3 [25] 16 71 13 18 63 1 13 27 21 14 15 3 15 14 30 6 33 10 25 3 26 17 33 16 [49] 24 63 1 24 75 27 8 24 10 16 13 51 6 16 15 22 20 45 5 24 32 13 74 26 [73] 19 18 6 17 29 17 13 48 34 9 18 16 0 15 16 18 28 28 22 27 66 3 22 1 [97] 13 22 15 94 41 5 6 60 23 36 21 23 80 6 25 0 65 4 12 19 20 12 33 15 [121] 24 4 18 5 24 18 21 18 19 24 11 66 20 19 14 5 17 34 27 37 13 4 5 4 [145] 4 16 22 16 0 4 20 19 3 77 4 17 18 16 21 35 18 9 15 4 14 0 0 15 [169] 2 21 21 27 82 21 21 28 6 40 19 16 23 0 42 11 74 4 6 20 33 17 13 0 [193] 16 21 18 4 12 17 25 11 7 12 11 22 1 62 20 5 9 4 33 4 19 15 20 1 [217] 21 24 8 8 30 17 56 16 20 14 7 16 19 5 14 78 9 19 20 7 4 10 3 20 [241] 17 21 71 11 2 88 70 23 11 60 10 17 10 25 38 11 16 7 19 11 15 20 24 27 [265] 8 20 11 4 11 25 2 18 1 12 13 13 69 23 31 21 9 85 26 27 68 13 9 30 [289] 10 13 8 18 21 20 20 8 13 59 15 1 21 27 19 9 6 21 19 15 17 18 76 18 [313] 20 19 18 20 21 37 14 16 16 24 18 20 0 9 59 54 10 5 14 21 17 73 9 4 [337] 30 12 21 13 25 14 32 32 7 13 14 18 16 6 16 74 9 24 0 86 17 15 74 16 [361] 21 11 11 34 57 16 16 13 8 9 28 15 12 17 20 27 20 12 17 67 26 14 28 52 [385] 57 76 16 9 14 27 8 22 73 11 11 89 2 73 19 13 33 32 17 7 11 26 95 66 [409] 35 16 25 30 7 16 16 17 31 9 22 64 15 16 81 9 7 21 39 13 16 10 1 13 [433] 16 18 65 10 16 16 25 18 6 14 25 19 25 16 12 8 15 20 21 25 18 14 75 7 [457] 11 14 13 7 32 23 16 6 6 19 12 19 27 24 30 30 30 18 55 16 10 6 14 71 [481] 31 14 17 33 21 19 20 12 29 17 71 25 13 10 95 15 15 45 17 16 24 0 15 15 [505] 32 21 20 12 19 7 68 16 26 20 20 21 88 18 72 29 21 64 22 23 5 6 19 19 [529] 19 62 17 13 12 15 14 80 20 6 21 21 17 8 34 27 18 15 83 21 64 26 8 9 [553] 14 19 17 35 15 14 5 7 12 16 62 69 21 20 4 0 37 7 11 20 18 14 1 13 [577] 11 24 21 16 13 24 17 52 21 19 29 17 21 23 9 23 10 8 33 22 27 22 15 68 [601] 27 20 59 6 75 8 12 66 20 10 17 63 18 16 28 54 5 17 6 16 53 20 10 15 [625] 12 12 12 18 10 18 19 23 17 19 26 52 24 11 22 18 65 18 23 6 20 33 39 27 [649] 27 36 18 25 27 11 20 20 27 16 8 10 5 8 13 13 8 18 9 20 5 7 16 18 [673] 21 19 19 30 17 17 39 15 14 18 20 17 19 15 14 9 27 53 20 13 65 6 17 18 [697] 20 16 19 34 6 19 9 34 56 22 18 21 16 18 16 5 21 20 13 14 18 62 8 6 [721] 35 8 18 19 7 19 27 12 15 15 90 9 5 21 17 14 20 10 30 72 11 14 19 10 [745] 3 13 15 36 12 20 12 66 13 88 18 79 14 14 90 21 20 21 32 19 30 16 18 71 [769] 18 24 15 0 20 39 6 51 10 30 23 75 66 7 21 35 32 21 19 25 32 0 31 23 [793] 27 6 10 19 17 20 8 71 21 22 11 21 37 23 11 20 22 24 26 18 15 30 23 22 [817] 53 7 29 33 72 62 18 20 31 80 70 29 73 20 20 40 13 24 32 62 18 28 6 2 [841] 24 19 17 14 25 7 12 15 23 28 14 12 23 21 14 33 41 19 25 20 9 32 26 19 [865] 42 11 30 39 16 17 21 14 17 91 10 24 19 71 4 11 8 31 8 72 62 26 7 63 [889] 21 8 5 20 19 9 19 19 21 16 13 19 18 19 27 19 13 16 10 29 32 21 29 12 [913] 14 7 14 17 20 17 6 9 32 11 73 33 4 6 8 25 23 16 8 22 10 31 6 18 [937] 27 4 18 8 14 11 15 30 11 17 0 25 4 40 14 20 3 29 38 79 60 38 18 16 [961] 5 23 29 5 26 9 17 14 30 10 51 17 41 44 5 23 75 1 9 69 46 28 18 38 [985] 54 24 13 13 15 8 5 30 18 20 2 86 20 17 10 23

Distributions of single variables

To begin understanding a dataset and developing a sense of context, start by examining the distributions of single variables.

Let's start with the expenditures variable. It is a numerical variable. We can start by finding a variety of numerical summaries:

Mean

mean(dds.discr$expenditures)
[1] 18065.79
mean(~expenditures, data=dds.discr)
[1] 18065.79

Range

range(~expenditures, data=dds.discr)
[1] 222 75098

Median

median(~expenditures, data=dds.discr)
[1] 7026

Standard deviation

sd(~expenditures, data=dds.discr)
[1] 19542.83

All of the above

plus more. The favstats command will calculate 9 numerical summaries for our variable: the minimum. first quartile, median, third quartile, and maximum (these 5 give us what we call the five number summary), mean, standard deviation, count, and number of missing values:

favstats(~expenditures, data=dds.discr)
min Q1 median Q3 max mean sd n missing 222 2898.75 7026 37712.75 75098 18065.79 19542.83 1000 0

Boxplot

The five number summary can be visualized by a boxplot:

gf_boxplot(~expenditures, data=dds.discr)
Image in a Jupyter notebook

Histogram

gives us a way to visualize the distribution of the values of the variable:

gf_histogram(~expenditures, data=dds.discr)
Image in a Jupyter notebook

The distribution of annual expenditures exhibits right skew, indicating that for a majority of consumers, expenditures are relatively low; most are within the $0 - $5,000 range. There are some consumers for which expenditures are much higher, such as within the $60,000 - $80,000 range. The quartiles for expenditures are $2,899, $7,026, and $37,710.

Gender

It is a categorical variable with two levels. We will create a frequency table and a bar graph:

tally(~gender, data=dds.discr)
gender Female Male 503 497
gf_bar(~gender, data=dds.discr)
Image in a Jupyter notebook

Age and age.cohort

The age variable is a continuous numerical variable. We will start by plotting its histogram:

gf_histogram(~age, data=dds.discr)
Image in a Jupyter notebook

The variable is skewed to the right, slightly bimodal, very similar to the expenditures variable. The similarity of the shapes indicates, but does not prove, that there perhaps may be an association between the two variables. We will look into that later.

The age.cohort variable is categorical and ordinal. As with any categorical variable that does not have too many levels, we can find a frequency table and visualize it using a bar graph.

tally(~age.cohort, data=dds.discr)
age.cohort 0-5 6-12 13-17 18-21 22-50 51+ 82 175 212 199 226 106
gf_bar(~age.cohort, data=dds.discr)
Image in a Jupyter notebook

We see that the middle four cohorts each contains about 200 consumers, the lowest and highest cohort each contain about 100 consumers. Apart of that there is not any huge difference between the sizes of the six cohorts.

Ethnicity

As with any other categorical variable, we will summarize ethnicity in a frequency table, and visualize using a bar graph:

tally(~ethnicity, data=dds.discr)
ethnicity American Indian Asian Black Hispanic 4 129 59 376 Multi Race Native Hawaiian Other White not Hispanic 26 3 2 401
gf_bar(~ethnicity, data=dds.discr) |> gf_theme(axis.text.x = element_text(angle=30, hjust=1))
Image in a Jupyter notebook

There are eight ethnic groups represented in the data, however there is not equal representation. The two largest groups, Hispanics and White non-Hispanics, together represent about 80% of the consumers. Some of the ethnic groups are so small that they probably do not form a representative sample from the population.

Associations between variables

Now that we explored all the individual variables, we can turn to investigating possible associations between the variables.

Earlier we noticed the similarity of the distributions between expenditures and age, asking if that perhaps is an indicator of an association between the two variables. While it is possible to have two variables with similar distributions that are not associated, it is definitely worth looking into.

Association between age and expenditures

Since they are both numerical variables, the appropriate way to visualize their possible association is using a scatter plot, also known as point plot:

gf_point(expenditures ~ age, data=dds.discr)
Image in a Jupyter notebook

The scatter plot shows that there clearly is an association. It seems somewhat complicated, but overall it seems that higher age corresponds to higher expenditures.

Let's see how this plays with the cohorts. The age.cohort variable is categorical. We can start by looking at the numerical summaries for the expenditures divided by the age.cohort:

favstats(expenditures ~ age.cohort, data=dds.discr)
age.cohort min Q1 median Q3 max mean sd n 1 0-5 222 1034.25 1380.5 1739.25 2750 1415.280 612.6143 82 2 6-12 620 1601.50 2191.0 2846.50 4163 2226.863 830.9430 175 3 13-17 386 3306.50 3952.0 4665.50 6798 3922.613 1012.6525 212 4 18-21 3153 7588.00 9979.0 11806.50 18435 9888.538 2940.6088 199 5 22-50 25348 36447.25 40455.5 44720.75 56716 40209.283 6287.3119 226 6 51+ 33110 49515.00 53509.0 57745.50 75098 53521.896 6283.7671 106 missing 1 0 2 0 3 0 4 0 5 0 6 0

From the table we can see that each of the positional numerical summaries (the 5 number summary and the mean) grows larger with age. It can be seen even more clearly from a side-by-side boxplot:

gf_boxplot(expenditures ~ age.cohort, data=dds.discr)
Image in a Jupyter notebook

Again, there is a very clear association between the age.cohort and expenditures: higher age clearly corresponds to higher expenditures. That make sense, as the cohorts are indicative of particular life phases. In the first three cohorts, consumers are still living with their parents as they move through preschool age, elementary/middle school age, and high school age. In the 18-21 cohort, consumers are transitioning from their parents' homes to living on their own or in supportive group homes. From ages 22-50, individuals are mostly no longer living with their parents but may still receive some support from family. In the 51+ cohort, consumers often have no living parents and typically require the most amount of support.

Association between ethnicity and expenditures

We again have a categorical variable (ethnicity) and numerical variable (expenditures), so the strategy will be similar to the one we chose for age.cohort and expenditures. We will start with favstats and then construct a side by side box plot:

favstats(expenditures ~ ethnicity, data=dds.discr)
ethnicity min Q1 median Q3 max mean sd 1 American Indian 3726 22085.25 41817.5 56170.50 58392 36438.250 25693.912 2 Asian 374 3382.00 9369.0 34274.00 75098 18392.372 19209.225 3 Black 240 3870.00 8687.0 41857.00 60808 20884.593 20549.274 4 Hispanic 222 2331.25 3952.0 10292.50 65581 11065.569 15629.847 5 Multi Race 669 1689.75 2622.0 3749.50 38619 4456.731 7332.135 6 Native Hawaiian 37479 39103.00 40727.0 45434.00 50141 42782.333 6576.462 7 Other 2018 2667.25 3316.5 3965.75 4615 3316.500 1836.356 8 White not Hispanic 340 3977.00 15718.0 43134.00 68890 24697.549 20604.376 n missing 1 4 0 2 129 0 3 59 0 4 376 0 5 26 0 6 3 0 7 2 0 8 401 0

Here the relationship between the variables seems to be much more complicated. One notable difference compared to the age.cogort table above is in the n column: some of the groups are very tiny, with only 2 or 3 consumers. As we noted before, two groups, Hispanic and White, comprise almost 80% of the whole sample.

The five number summaries are all over the place and it is difficult to compare them from the table. Let's use a side by side box-plot to help visualize them:

gf_boxplot(expenditures ~ ethnicity, data=dds.discr) |> gf_theme(axis.text.x = element_text(angle=30, hjust=1))
Image in a Jupyter notebook

The distribution of expenditures is quite different between ethnic groups. For example, there is very little variation in expenditures within the Multi Race, Native Hawaiian, and Other groups; in other groups, such as the White not Hispanic group, there is a greater range in expenditures. Additionally, there seems to be a difference in the amount of funding that a person receives, on average, between different ethnicities. The median amount of annual support received for individuals in the American Indian and Native Hawaiian groups is about $40,000, versus medians of approximately $10,000 for Asian and Black consumers.

One thing that is not clear from this plot is the size of the individual groups. From the plot there is no way of knowing that there are only 3 consumers in the Native Hawaiian group, and only 2 in the Other group. Typically, smaller groups will have less variation, so this information is important.

We can instead use a jitter plot to visualize the data. In a jitter plot, every observation has a separate dot, so we will be able to see how many observations are in each of the categories:

gf_jitter(expenditures ~ ethnicity, data=dds.discr)|> gf_theme(axis.text.x = element_text(angle=30, hjust=1))
Image in a Jupyter notebook

We can even combine the two to get the best of both:

gf_boxplot(expenditures ~ ethnicity, data=dds.discr) |> gf_jitter(expenditures ~ ethnicity, data=dds.discr, color="blue", alpha=0.2)|> gf_theme(axis.text.x = element_text(angle=30, hjust=1))
Image in a Jupyter notebook

One more simple way to summarize the relationship between ethnicity and expenditures: look at the means. We already have the "mean" column in the favstats table, but it may be nice to see the means alone.

We have several ways to do this. One way is by using the mean() command with the expenditures ~ ethnicity formula:

mean(expenditures ~ ethnicity, data=dds.discr)
American Indian Asian Black Hispanic 36438.250 18392.372 20884.593 11065.569 Multi Race Native Hawaiian Other White not Hispanic 4456.731 42782.333 3316.500 24697.549

That gives us the means we want, but it does not give us a nice table. There is another way, by manipulating the data set:

dds.discr |> group_by(ethnicity) |> summarize(mean = mean(expenditures)) |> arrange(-mean)
ethnicity mean 1 Native Hawaiian 42782.333 2 American Indian 36438.250 3 White not Hispanic 24697.549 4 Black 20884.593 5 Asian 18392.372 6 Hispanic 11065.569 7 Multi Race 4456.731 8 Other 3316.500

Comparing Hispanic and White

In all those summaries and visualizations, we can see quite large difference between the two largest ethnic groups: for example, the mean expenditure for the White group is more than twice as large as the mean expenditure of the Hispanic group.

To make this clearer and to be able to investigate this difference, lets filter our data set so that only consumers that are in the Hispanic group or the White group remain. We will save the new filtered data in a new data set called dds_h_or_w:

dds.discr |> filter(ethnicity == "Hispanic" | ethnicity == "White not Hispanic") -> dds_h_or_w
glimpse(dds_h_or_w)
Rows: 777 Columns: 6 $ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10820, 1… $ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-… $ age <int> 17, 37, 3, 19, 13, 15, 13, 14, 13, 13, 14, 15, 20, 23, 5,… $ gender <fct> Female, Male, Male, Female, Male, Female, Female, Female,… $ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 5021, 2887, 41… $ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…

As we can see, the filtered data set only has 777 observations, compared to the 1000 observations in the original data set.

Let's look at the distribution of expenditures and age in this restricted data set:

gf_boxplot(expenditures ~ ethnicity, data=dds_h_or_w)
Image in a Jupyter notebook
gf_boxplot(age ~ ethnicity, data=dds_h_or_w)
Image in a Jupyter notebook

Based on the boxplot, most Hispanic consumers receive between approximately $0 to $20,000 from the California DDS; individuals receiving amounts higher than this are upper outliers. However, for White non-Hispanic consumers, median expenditures is at $15,718, and the middle 50% of consumers receive between about $4,000 and $43,000. The mean expenditures for Hispanic consumers is $11,066, while the mean expenditures for White non-Hispanic consumers is over twice as high at $24,698. On average, a Hispanic consumer receives less financial support from the California DDS than a White non-Hispanic consumer.

Just as with the overall data set, we see that the age variable has a distribution similar to the expenditures. The ages of the Hispanic consumers are generally much lower than the ages of the White non-Hispanic consumers.

Let's look at the representations on each of the two groups in the age cohorts:

gf_bar(~age.cohort | ethnicity, data=dds_h_or_w)
Image in a Jupyter notebook

Again, we can see that Hispanic consumers tend to be younger, with most of them falling into the 6-12, 13-17, and 18-21 age cohorts. In contrast, White non-Hispanics tend to be older; most consumers in this ethnic group are in the 22-50 age cohort, and relatively more White non-Hispanic consumers are in the 51+ age cohort as compared to Hispanics.

We know that expenditures is very strongly associated with age and age.cohort. It is true in the compete data set, and it seems to still hold in the data set restricted to the two main ethnic groups. When trying to compare expenditures in the two ethnic groups, it is possible that what we are actually seeing is the effect of the age and age.cohort variable instead! In that case, age as well as age.cohort would be confounding variables.

For a closer look at the relationship between age, ethnicity, and expenditures, compare how expenditures differ by ethnicity within each age cohort. If age is indeed the primary source of the observed variation in expenditures, then there should be little difference in expenditures between individuals in different ethnic groups but the same age cohort. One way to visualize that is to make a box plot of expenditures by ethnicity separately for each of the age cohorts:

gf_boxplot(expenditures ~ ethnicity | age.cohort, data=dds_h_or_w)
Image in a Jupyter notebook

We can also compare the mean expenditures of Hispanic and White not Hispanic groups within each age cohort:

dds_h_or_w |> group_by(age.cohort, ethnicity) |> summarize(mean = mean(expenditures), .groups="keep")
age.cohort ethnicity mean 1 0-5 Hispanic 1393.205 2 0-5 White not Hispanic 1366.900 3 6-12 Hispanic 2312.187 4 6-12 White not Hispanic 2052.261 5 13-17 Hispanic 3955.282 6 13-17 White not Hispanic 3904.358 7 18-21 Hispanic 9959.846 8 18-21 White not Hispanic 10133.058 9 22-50 Hispanic 40924.116 10 22-50 White not Hispanic 40187.624 11 51+ Hispanic 55585.000 12 51+ White not Hispanic 52670.424

We can see that when we split the data into separate age cohorts, there is very little difference between the expenditures variable between the two ethnic groups. In fact, in both the 22-50 and 51+ cohorts, the expenditures in the Hispanics group seem to be slightly higher than those in the White not Hispanic group.

It follows that there does not seem to be evidence of ethnic discrimination. Although the average annual expenditures is lower for Hispanics than for White non-Hispanics, this is due to the difference in age distributions between the two ethnic groups. The population of Hispanic consumers is relatively young compared to the population of White non-Hispanic consumers, and the amount of expenditures for younger consumers tends to be lower than for older consumers. When individuals of similar ages are compared, there are not large differences in the average amount of financial support provided to a Hispanic consumer versus a White non-Hispanic consumer.

Simpson's paradox

Identifying confounding variables is essential for understanding data. Confounders are often context-specific; for example, age is not necessarily a confounder for the relationship between ethnicity and expenditures in a different population. Additionally, it is rarely immediately obvious which variables in a dataset are confounders; looking for confounding variables is an integral part of exploring a dataset.

These data represent an extreme example of confounding known as Simpson's paradox, in which an association observed in several groups may disappear or reverse direction once the groups are combined. In other words, an association between two variables X and Y may disappear or reverse direction once data are partitioned into subpopulations based on a third variable Z, the confounding variable.

Mean expenditures is higher for Hispanics than White non-Hispanics in all age cohorts except one. Yet, once all the data are aggregated, the average expenditures for White non-Hispanics is over twice as large as the average for Hispanics. This paradox can be explored from a mathematical perspective by using weighted averages, where the average expenditure for each cohort is weighted by the proportion of the population in that cohort.