Kernel: R (system-wide)

A DDS Case Study

Background information

In the United States, individuals with developmental disabilities typically receive services and support from state governments. The State of California allocates funds to developmentally-disabled residents through the California Department of Developmental Services (DDS); individuals receiving DDS funds are referred to as 'consumers'. The dataset dds.discr represents a sample of 1,000 DDS consumers (out of a total population of approximately 250,000), and includes information about age, gender, ethnicity, and the amount of financial support per consumer provided by the DDS. The dataset is available in the oibiostat package.

A team of researchers examined the mean annual expenditure on consumers by ethnicity, and found that the mean annual expenditures on Hispanic consumers was approximately one-third of the mean expenditures on White non-Hispanic consumers. As a result, an allegation of ethnic discrimination was brought against the California DDS.

Does this finding represent sufficient evidence of ethnic discrimination, or might there be more to the story? This lab provides a walkthrough to conducting an exploratory analysis that not only investigates the relationship between two variables of interest, but also considers whether other variables might be influencing that relationship.

Common settings

We will start by importing some useful libraries and by setting some options that will make plots easier to read on the projector screen:

In [3]:

suppressMessages({library(tidyverse); library(mosaic)})

In [4]:

theme_set(theme_bw(base_size=20))
options(repr.plot.width=12, repr.plot.height=8)

The data set

Then we will import the oibiostat library that contains the data set we are interested in:

In [5]:

library(oibiostat)

Read the data set from the library:

In [6]:

data(dds.discr)

Ask for more information about the data set:

In [7]:

?dds.discr

dds.discr              package:oibiostat               R Documentation

Discrimination in Developmental Disability Support

Description:

     Represents a sample of 1,000 DDS consumers, based on data from the
     State of California

Usage:

     data("dds.discr")
     
Format:

     A data frame with 1000 observations on the following 6 variables.

     ‘id’ Unique identification code for each resident

     ‘age.cohort’ Age as sorted into six groups ‘0-5’ years, ‘6-12’
          years, ‘13-17’ years, ‘18-21’ years, ‘22-50’ years, and ‘51+’
          years

     ‘age’ Age, measured in years

     ‘gender’ Gender, recorded as either ‘Female’ or ‘Male’.

     ‘expenditures’ Amount of expenditures spent by the State on an
          individual annually, measured in USD

     ‘ethnicity’ Ethnic group, recorded as either ‘American Indian’,
          ‘Asian’, ‘Black’, ‘Hispanic’, ‘Multi Race’, ‘Native
          Hawaiian’, ‘Other’, or ‘White not Hispanic’

Examples:

     data("dds.discr")
     dds.discr[1:5,]

Use glimpse() to get a quick overview of the data set:

In [8]:

glimpse(dds.discr)

Rows: 1,000
Columns: 6
$ id           <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender       <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity    <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…

We see that the data set contains 1000 observations with 6 variables: id, age,cohort, age, gender, expenditures and ethnicity.

We can use the "dollar notation" to extract any of these six columns. For example, to show the age column, we do this:

In [9]:

dds.discr$age

   [1] 17 37  3 19 13 15 13 17 14 13 13 14 15 17 20 25 23 41  5 15  5  5 22  3
  [25] 16 71 13 18 63  1 13 27 21 14 15  3 15 14 30  6 33 10 25  3 26 17 33 16
  [49] 24 63  1 24 75 27  8 24 10 16 13 51  6 16 15 22 20 45  5 24 32 13 74 26
  [73] 19 18  6 17 29 17 13 48 34  9 18 16  0 15 16 18 28 28 22 27 66  3 22  1
  [97] 13 22 15 94 41  5  6 60 23 36 21 23 80  6 25  0 65  4 12 19 20 12 33 15
 [121] 24  4 18  5 24 18 21 18 19 24 11 66 20 19 14  5 17 34 27 37 13  4  5  4
 [145]  4 16 22 16  0  4 20 19  3 77  4 17 18 16 21 35 18  9 15  4 14  0  0 15
 [169]  2 21 21 27 82 21 21 28  6 40 19 16 23  0 42 11 74  4  6 20 33 17 13  0
 [193] 16 21 18  4 12 17 25 11  7 12 11 22  1 62 20  5  9  4 33  4 19 15 20  1
 [217] 21 24  8  8 30 17 56 16 20 14  7 16 19  5 14 78  9 19 20  7  4 10  3 20
 [241] 17 21 71 11  2 88 70 23 11 60 10 17 10 25 38 11 16  7 19 11 15 20 24 27
 [265]  8 20 11  4 11 25  2 18  1 12 13 13 69 23 31 21  9 85 26 27 68 13  9 30
 [289] 10 13  8 18 21 20 20  8 13 59 15  1 21 27 19  9  6 21 19 15 17 18 76 18
 [313] 20 19 18 20 21 37 14 16 16 24 18 20  0  9 59 54 10  5 14 21 17 73  9  4
 [337] 30 12 21 13 25 14 32 32  7 13 14 18 16  6 16 74  9 24  0 86 17 15 74 16
 [361] 21 11 11 34 57 16 16 13  8  9 28 15 12 17 20 27 20 12 17 67 26 14 28 52
 [385] 57 76 16  9 14 27  8 22 73 11 11 89  2 73 19 13 33 32 17  7 11 26 95 66
 [409] 35 16 25 30  7 16 16 17 31  9 22 64 15 16 81  9  7 21 39 13 16 10  1 13
 [433] 16 18 65 10 16 16 25 18  6 14 25 19 25 16 12  8 15 20 21 25 18 14 75  7
 [457] 11 14 13  7 32 23 16  6  6 19 12 19 27 24 30 30 30 18 55 16 10  6 14 71
 [481] 31 14 17 33 21 19 20 12 29 17 71 25 13 10 95 15 15 45 17 16 24  0 15 15
 [505] 32 21 20 12 19  7 68 16 26 20 20 21 88 18 72 29 21 64 22 23  5  6 19 19
 [529] 19 62 17 13 12 15 14 80 20  6 21 21 17  8 34 27 18 15 83 21 64 26  8  9
 [553] 14 19 17 35 15 14  5  7 12 16 62 69 21 20  4  0 37  7 11 20 18 14  1 13
 [577] 11 24 21 16 13 24 17 52 21 19 29 17 21 23  9 23 10  8 33 22 27 22 15 68
 [601] 27 20 59  6 75  8 12 66 20 10 17 63 18 16 28 54  5 17  6 16 53 20 10 15
 [625] 12 12 12 18 10 18 19 23 17 19 26 52 24 11 22 18 65 18 23  6 20 33 39 27
 [649] 27 36 18 25 27 11 20 20 27 16  8 10  5  8 13 13  8 18  9 20  5  7 16 18
 [673] 21 19 19 30 17 17 39 15 14 18 20 17 19 15 14  9 27 53 20 13 65  6 17 18
 [697] 20 16 19 34  6 19  9 34 56 22 18 21 16 18 16  5 21 20 13 14 18 62  8  6
 [721] 35  8 18 19  7 19 27 12 15 15 90  9  5 21 17 14 20 10 30 72 11 14 19 10
 [745]  3 13 15 36 12 20 12 66 13 88 18 79 14 14 90 21 20 21 32 19 30 16 18 71
 [769] 18 24 15  0 20 39  6 51 10 30 23 75 66  7 21 35 32 21 19 25 32  0 31 23
 [793] 27  6 10 19 17 20  8 71 21 22 11 21 37 23 11 20 22 24 26 18 15 30 23 22
 [817] 53  7 29 33 72 62 18 20 31 80 70 29 73 20 20 40 13 24 32 62 18 28  6  2
 [841] 24 19 17 14 25  7 12 15 23 28 14 12 23 21 14 33 41 19 25 20  9 32 26 19
 [865] 42 11 30 39 16 17 21 14 17 91 10 24 19 71  4 11  8 31  8 72 62 26  7 63
 [889] 21  8  5 20 19  9 19 19 21 16 13 19 18 19 27 19 13 16 10 29 32 21 29 12
 [913] 14  7 14 17 20 17  6  9 32 11 73 33  4  6  8 25 23 16  8 22 10 31  6 18
 [937] 27  4 18  8 14 11 15 30 11 17  0 25  4 40 14 20  3 29 38 79 60 38 18 16
 [961]  5 23 29  5 26  9 17 14 30 10 51 17 41 44  5 23 75  1  9 69 46 28 18 38
 [985] 54 24 13 13 15  8  5 30 18 20  2 86 20 17 10 23

Distributions of single variables

To begin understanding a dataset and developing a sense of context, start by examining the distributions of single variables.

Let's start with the expenditures variable. It is a numerical variable. We can start by finding a variety of numerical summaries:

Mean

In [10]:

mean(dds.discr$expenditures)

[1] 18065.79

In [11]:

mean(~expenditures, data=dds.discr)

[1] 18065.79

Range

In [12]:

range(~expenditures, data=dds.discr)

[1]   222 75098

Median

In [13]:

median(~expenditures, data=dds.discr)

[1] 7026

Standard deviation

In [14]:

sd(~expenditures, data=dds.discr)

[1] 19542.83

All of the above

plus more. The favstats command will calculate 9 numerical summaries for our variable: the minimum. first quartile, median, third quartile, and maximum (these 5 give us what we call the five number summary), mean, standard deviation, count, and number of missing values:

In [15]:

favstats(~expenditures, data=dds.discr)

 min Q1      median Q3       max   mean     sd       n    missing
 222 2898.75 7026   37712.75 75098 18065.79 19542.83 1000 0      

Boxplot

The five number summary can be visualized by a boxplot:

In [16]:

gf_boxplot(~expenditures, data=dds.discr)

Histogram

gives us a way to visualize the distribution of the values of the variable:

In [17]:

gf_histogram(~expenditures, data=dds.discr)

The distribution of annual expenditures exhibits right skew, indicating that for a majority of consumers, expenditures are relatively low; most are within the $0 - $5,000 range. There are some consumers for which expenditures are much higher, such as within the $60,000 - $80,000 range. The quartiles for expenditures are $2,899, $7,026, and $37,710.

Gender

It is a categorical variable with two levels. We will create a frequency table and a bar graph:

In [18]:

tally(~gender, data=dds.discr)

gender
Female   Male 
   503    497 

In [19]:

gf_bar(~gender, data=dds.discr)

Age and age.cohort

The age variable is a continuous numerical variable. We will start by plotting its histogram:

In [20]:

gf_histogram(~age, data=dds.discr)

The variable is skewed to the right, slightly bimodal, very similar to the expenditures variable. The similarity of the shapes indicates, but does not prove, that there perhaps may be an association between the two variables. We will look into that later.

The age.cohort variable is categorical and ordinal. As with any categorical variable that does not have too many levels, we can find a frequency table and visualize it using a bar graph.

In [21]:

tally(~age.cohort, data=dds.discr)

age.cohort
  0-5  6-12 13-17 18-21 22-50   51+ 
   82   175   212   199   226   106 

In [22]:

gf_bar(~age.cohort, data=dds.discr)

We see that the middle four cohorts each contains about 200 consumers, the lowest and highest cohort each contain about 100 consumers. Apart of that there is not any huge difference between the sizes of the six cohorts.

Ethnicity

As with any other categorical variable, we will summarize ethnicity in a frequency table, and visualize using a bar graph:

In [23]:

tally(~ethnicity, data=dds.discr)

ethnicity
   American Indian              Asian              Black           Hispanic 
                 4                129                 59                376 
        Multi Race    Native Hawaiian              Other White not Hispanic 
                26                  3                  2                401 

In [24]:

gf_bar(~ethnicity, data=dds.discr) |>
gf_theme(axis.text.x = element_text(angle=30, hjust=1))

There are eight ethnic groups represented in the data, however there is not equal representation. The two largest groups, Hispanics and White non-Hispanics, together represent about 80% of the consumers. Some of the ethnic groups are so small that they probably do not form a representative sample from the population.

Associations between variables

Now that we explored all the individual variables, we can turn to investigating possible associations between the variables.

Earlier we noticed the similarity of the distributions between expenditures and age, asking if that perhaps is an indicator of an association between the two variables. While it is possible to have two variables with similar distributions that are not associated, it is definitely worth looking into.

Association between age and expenditures

Since they are both numerical variables, the appropriate way to visualize their possible association is using a scatter plot, also known as point plot:

In [25]:

gf_point(expenditures ~ age, data=dds.discr)

The scatter plot shows that there clearly is an association. It seems somewhat complicated, but overall it seems that higher age corresponds to higher expenditures.

Let's see how this plays with the cohorts. The age.cohort variable is categorical. We can start by looking at the numerical summaries for the expenditures divided by the age.cohort:

In [26]:

favstats(expenditures ~ age.cohort, data=dds.discr)

  age.cohort min   Q1       median  Q3       max   mean      sd        n  
0-5          222  1034.25  1380.5  1739.25  2750  1415.280  612.6143  82
6-12         620  1601.50  2191.0  2846.50  4163  2226.863  830.9430 175
13-17        386  3306.50  3952.0  4665.50  6798  3922.613 1012.6525 212
18-21       3153  7588.00  9979.0 11806.50 18435  9888.538 2940.6088 199
22-50      25348 36447.25 40455.5 44720.75 56716 40209.283 6287.3119 226
51+        33110 49515.00 53509.0 57745.50 75098 53521.896 6283.7671 106
  missing
0      
0      
0      
0      
0      
0      

From the table we can see that each of the positional numerical summaries (the 5 number summary and the mean) grows larger with age. It can be seen even more clearly from a side-by-side boxplot:

In [27]:

gf_boxplot(expenditures ~ age.cohort, data=dds.discr)

Again, there is a very clear association between the age.cohort and expenditures: higher age clearly corresponds to higher expenditures. That make sense, as the cohorts are indicative of particular life phases. In the first three cohorts, consumers are still living with their parents as they move through preschool age, elementary/middle school age, and high school age. In the 18-21 cohort, consumers are transitioning from their parents' homes to living on their own or in supportive group homes. From ages 22-50, individuals are mostly no longer living with their parents but may still receive some support from family. In the 51+ cohort, consumers often have no living parents and typically require the most amount of support.

Association between ethnicity and expenditures

We again have a categorical variable (ethnicity) and numerical variable (expenditures), so the strategy will be similar to the one we chose for age.cohort and expenditures. We will start with favstats and then construct a side by side box plot:

In [28]:

favstats(expenditures ~ ethnicity, data=dds.discr)

  ethnicity          min   Q1       median  Q3       max   mean      sd       
American Indian     3726 22085.25 41817.5 56170.50 58392 36438.250 25693.912
Asian                374  3382.00  9369.0 34274.00 75098 18392.372 19209.225
Black                240  3870.00  8687.0 41857.00 60808 20884.593 20549.274
Hispanic             222  2331.25  3952.0 10292.50 65581 11065.569 15629.847
Multi Race           669  1689.75  2622.0  3749.50 38619  4456.731  7332.135
Native Hawaiian    37479 39103.00 40727.0 45434.00 50141 42782.333  6576.462
Other               2018  2667.25  3316.5  3965.75  4615  3316.500  1836.356
White not Hispanic   340  3977.00 15718.0 43134.00 68890 24697.549 20604.376
  n   missing
 4 0      
129 0      
59 0      
376 0      
26 0      
 3 0      
 2 0      
401 0      

Here the relationship between the variables seems to be much more complicated. One notable difference compared to the age.cogort table above is in the n column: some of the groups are very tiny, with only 2 or 3 consumers. As we noted before, two groups, Hispanic and White, comprise almost 80% of the whole sample.

The five number summaries are all over the place and it is difficult to compare them from the table. Let's use a side by side box-plot to help visualize them:

In [29]:

gf_boxplot(expenditures ~ ethnicity, data=dds.discr) |>
gf_theme(axis.text.x = element_text(angle=30, hjust=1))

The distribution of expenditures is quite different between ethnic groups. For example, there is very little variation in expenditures within the Multi Race, Native Hawaiian, and Other groups; in other groups, such as the White not Hispanic group, there is a greater range in expenditures. Additionally, there seems to be a difference in the amount of funding that a person receives, on average, between different ethnicities. The median amount of annual support received for individuals in the American Indian and Native Hawaiian groups is about $40,000, versus medians of approximately $10,000 for Asian and Black consumers.

One thing that is not clear from this plot is the size of the individual groups. From the plot there is no way of knowing that there are only 3 consumers in the Native Hawaiian group, and only 2 in the Other group. Typically, smaller groups will have less variation, so this information is important.

We can instead use a jitter plot to visualize the data. In a jitter plot, every observation has a separate dot, so we will be able to see how many observations are in each of the categories:

In [30]:

gf_jitter(expenditures ~ ethnicity, data=dds.discr)|>
gf_theme(axis.text.x = element_text(angle=30, hjust=1))

We can even combine the two to get the best of both:

In [31]:

gf_boxplot(expenditures ~ ethnicity, data=dds.discr) |>
gf_jitter(expenditures ~ ethnicity, data=dds.discr, color="blue", alpha=0.2)|>
gf_theme(axis.text.x = element_text(angle=30, hjust=1))

One more simple way to summarize the relationship between ethnicity and expenditures: look at the means. We already have the "mean" column in the favstats table, but it may be nice to see the means alone.

We have several ways to do this. One way is by using the mean() command with the expenditures ~ ethnicity formula:

In [32]:

mean(expenditures ~ ethnicity, data=dds.discr)

   American Indian              Asian              Black           Hispanic 
         36438.250          18392.372          20884.593          11065.569 
        Multi Race    Native Hawaiian              Other White not Hispanic 
          4456.731          42782.333           3316.500          24697.549 

That gives us the means we want, but it does not give us a nice table. There is another way, by manipulating the data set:

In [33]:

dds.discr |>
group_by(ethnicity) |>
summarize(mean = mean(expenditures)) |>
arrange(-mean)

  ethnicity          mean     
Native Hawaiian    42782.333
American Indian    36438.250
White not Hispanic 24697.549
Black              20884.593
Asian              18392.372
Hispanic           11065.569
Multi Race          4456.731
Other               3316.500

Comparing Hispanic and White

In all those summaries and visualizations, we can see quite large difference between the two largest ethnic groups: for example, the mean expenditure for the White group is more than twice as large as the mean expenditure of the Hispanic group.

To make this clearer and to be able to investigate this difference, lets filter our data set so that only consumers that are in the Hispanic group or the White group remain. We will save the new filtered data in a new data set called dds_h_or_w:

In [34]:

dds.discr |>
filter(ethnicity == "Hispanic" | ethnicity == "White not Hispanic") ->
dds_h_or_w

In [35]:

glimpse(dds_h_or_w)

Rows: 777
Columns: 6
$ id           <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10820, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 14, 13, 13, 14, 15, 20, 23, 5,…
$ gender       <fct> Female, Male, Male, Female, Male, Female, Female, Female,…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 5021, 2887, 41…
$ ethnicity    <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…

As we can see, the filtered data set only has 777 observations, compared to the 1000 observations in the original data set.

Let's look at the distribution of expenditures and age in this restricted data set:

In [36]:

gf_boxplot(expenditures ~ ethnicity, data=dds_h_or_w)

In [37]:

gf_boxplot(age ~ ethnicity, data=dds_h_or_w)

Based on the boxplot, most Hispanic consumers receive between approximately $0 to $20,000 from the California DDS; individuals receiving amounts higher than this are upper outliers. However, for White non-Hispanic consumers, median expenditures is at $15,718, and the middle 50% of consumers receive between about $4,000 and $43,000. The mean expenditures for Hispanic consumers is $11,066, while the mean expenditures for White non-Hispanic consumers is over twice as high at $24,698. On average, a Hispanic consumer receives less financial support from the California DDS than a White non-Hispanic consumer.

Just as with the overall data set, we see that the age variable has a distribution similar to the expenditures. The ages of the Hispanic consumers are generally much lower than the ages of the White non-Hispanic consumers.

Let's look at the representations on each of the two groups in the age cohorts:

In [38]:

gf_bar(~age.cohort | ethnicity, data=dds_h_or_w)

Again, we can see that Hispanic consumers tend to be younger, with most of them falling into the 6-12, 13-17, and 18-21 age cohorts. In contrast, White non-Hispanics tend to be older; most consumers in this ethnic group are in the 22-50 age cohort, and relatively more White non-Hispanic consumers are in the 51+ age cohort as compared to Hispanics.

We know that expenditures is very strongly associated with age and age.cohort. It is true in the compete data set, and it seems to still hold in the data set restricted to the two main ethnic groups. When trying to compare expenditures in the two ethnic groups, it is possible that what we are actually seeing is the effect of the age and age.cohort variable instead! In that case, age as well as age.cohort would be confounding variables.

For a closer look at the relationship between age, ethnicity, and expenditures, compare how expenditures differ by ethnicity within each age cohort. If age is indeed the primary source of the observed variation in expenditures, then there should be little difference in expenditures between individuals in different ethnic groups but the same age cohort. One way to visualize that is to make a box plot of expenditures by ethnicity separately for each of the age cohorts:

In [39]:

gf_boxplot(expenditures ~ ethnicity | age.cohort, data=dds_h_or_w)

We can also compare the mean expenditures of Hispanic and White not Hispanic groups within each age cohort:

In [40]:

dds_h_or_w |>
group_by(age.cohort, ethnicity) |>
summarize(mean = mean(expenditures), .groups="keep")

   age.cohort ethnicity          mean     
0-5        Hispanic            1393.205
0-5        White not Hispanic  1366.900
6-12       Hispanic            2312.187
6-12       White not Hispanic  2052.261
13-17      Hispanic            3955.282
13-17      White not Hispanic  3904.358
18-21      Hispanic            9959.846
18-21      White not Hispanic 10133.058
22-50      Hispanic           40924.116
22-50      White not Hispanic 40187.624
51+        Hispanic           55585.000
51+        White not Hispanic 52670.424

We can see that when we split the data into separate age cohorts, there is very little difference between the expenditures variable between the two ethnic groups. In fact, in both the 22-50 and 51+ cohorts, the expenditures in the Hispanics group seem to be slightly higher than those in the White not Hispanic group.

It follows that there does not seem to be evidence of ethnic discrimination. Although the average annual expenditures is lower for Hispanics than for White non-Hispanics, this is due to the difference in age distributions between the two ethnic groups. The population of Hispanic consumers is relatively young compared to the population of White non-Hispanic consumers, and the amount of expenditures for younger consumers tends to be lower than for older consumers. When individuals of similar ages are compared, there are not large differences in the average amount of financial support provided to a Hispanic consumer versus a White non-Hispanic consumer.

Simpson's paradox

Identifying confounding variables is essential for understanding data. Confounders are often context-specific; for example, age is not necessarily a confounder for the relationship between ethnicity and expenditures in a different population. Additionally, it is rarely immediately obvious which variables in a dataset are confounders; looking for confounding variables is an integral part of exploring a dataset.

These data represent an extreme example of confounding known as Simpson's paradox, in which an association observed in several groups may disappear or reverse direction once the groups are combined. In other words, an association between two variables X and Y may disappear or reverse direction once data are partitioned into subpopulations based on a third variable Z, the confounding variable.

Mean expenditures is higher for Hispanics than White non-Hispanics in all age cohorts except one. Yet, once all the data are aggregated, the average expenditures for White non-Hispanics is over twice as large as the average for Hispanics. This paradox can be explored from a mathematical perspective by using weighted averages, where the average expenditure for each cohort is weighted by the proportion of the population in that cohort.

In [0]: