In the United States, individuals with developmental disabilities typically receive services and support from state governments. The State of California allocates funds to developmentally-disabled residents through the California Department of Developmental Services (DDS); individuals receiving DDS funds are referred to as 'consumers'. The dataset dds.discr
represents a sample of 1,000 DDS consumers (out of a total population of approximately 250,000), and includes information about age, gender, ethnicity, and the amount of financial support per consumer provided by the DDS. The dataset is available in the oibiostat
package.
A team of researchers examined the mean annual expenditure on consumers by ethnicity, and found that the mean annual expenditures on Hispanic consumers was approximately one-third of the mean expenditures on White non-Hispanic consumers. As a result, an allegation of ethnic discrimination was brought against the California DDS.
Does this finding represent sufficient evidence of ethnic discrimination, or might there be more to the story? This lab provides a walkthrough to conducting an exploratory analysis that not only investigates the relationship between two variables of interest, but also considers whether other variables might be influencing that relationship.
We will start by importing some useful libraries and by setting some options that will make plots easier to read on the projector screen:
Then we will import the oibiostat
library that contains the data set we are interested in:
Read the data set from the library:
Ask for more information about the data set:
dds.discr package:oibiostat R Documentation
Discrimination in Developmental Disability Support
Description:
Represents a sample of 1,000 DDS consumers, based on data from the
State of California
Usage:
data("dds.discr")
Format:
A data frame with 1000 observations on the following 6 variables.
‘id’ Unique identification code for each resident
‘age.cohort’ Age as sorted into six groups ‘0-5’ years, ‘6-12’
years, ‘13-17’ years, ‘18-21’ years, ‘22-50’ years, and ‘51+’
years
‘age’ Age, measured in years
‘gender’ Gender, recorded as either ‘Female’ or ‘Male’.
‘expenditures’ Amount of expenditures spent by the State on an
individual annually, measured in USD
‘ethnicity’ Ethnic group, recorded as either ‘American Indian’,
‘Asian’, ‘Black’, ‘Hispanic’, ‘Multi Race’, ‘Native
Hawaiian’, ‘Other’, or ‘White not Hispanic’
Examples:
data("dds.discr")
dds.discr[1:5,]
Use glimpse()
to get a quick overview of the data set:
Rows: 1,000
Columns: 6
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
We see that the data set contains 1000 observations with 6 variables: id
, age,cohort
, age
, gender
, expenditures
and ethnicity
.
We can use the "dollar notation" to extract any of these six columns. For example, to show the age
column, we do this:
To begin understanding a dataset and developing a sense of context, start by examining the distributions of single variables.
Let's start with the expenditures
variable. It is a numerical variable. We can start by finding a variety of numerical summaries:
plus more. The favstats
command will calculate 9 numerical summaries for our variable: the minimum. first quartile, median, third quartile, and maximum (these 5 give us what we call the five number summary), mean, standard deviation, count, and number of missing values:
The five number summary can be visualized by a boxplot:
gives us a way to visualize the distribution of the values of the variable:
The distribution of annual expenditures exhibits right skew, indicating that for a majority of consumers, expenditures are relatively low; most are within the $0 - $5,000 range. There are some consumers for which expenditures are much higher, such as within the $60,000 - $80,000 range. The quartiles for expenditures are $2,899, $7,026, and $37,710.
It is a categorical variable with two levels. We will create a frequency table and a bar graph:
The age
variable is a continuous numerical variable. We will start by plotting its histogram:
The variable is skewed to the right, slightly bimodal, very similar to the expenditures
variable. The similarity of the shapes indicates, but does not prove, that there perhaps may be an association between the two variables. We will look into that later.
The age.cohort
variable is categorical and ordinal. As with any categorical variable that does not have too many levels, we can find a frequency table and visualize it using a bar graph.
We see that the middle four cohorts each contains about 200 consumers, the lowest and highest cohort each contain about 100 consumers. Apart of that there is not any huge difference between the sizes of the six cohorts.
As with any other categorical variable, we will summarize ethnicity
in a frequency table, and visualize using a bar graph:
There are eight ethnic groups represented in the data, however there is not equal representation. The two largest groups, Hispanics and White non-Hispanics, together represent about 80% of the consumers. Some of the ethnic groups are so small that they probably do not form a representative sample from the population.
Now that we explored all the individual variables, we can turn to investigating possible associations between the variables.
Earlier we noticed the similarity of the distributions between expenditures
and age
, asking if that perhaps is an indicator of an association between the two variables. While it is possible to have two variables with similar distributions that are not associated, it is definitely worth looking into.
Since they are both numerical variables, the appropriate way to visualize their possible association is using a scatter plot, also known as point plot:
The scatter plot shows that there clearly is an association. It seems somewhat complicated, but overall it seems that higher age corresponds to higher expenditures.
Let's see how this plays with the cohorts. The age.cohort
variable is categorical. We can start by looking at the numerical summaries for the expenditures
divided by the age.cohort
:
From the table we can see that each of the positional numerical summaries (the 5 number summary and the mean) grows larger with age. It can be seen even more clearly from a side-by-side boxplot:
Again, there is a very clear association between the age.cohort
and expenditures
: higher age clearly corresponds to higher expenditures. That make sense, as the cohorts are indicative of particular life phases. In the first three cohorts, consumers are still living with their parents as they move through preschool age, elementary/middle school age, and high school age. In the 18-21 cohort, consumers are transitioning from their parents' homes to living on their own or in supportive group homes. From ages 22-50, individuals are mostly no longer living with their parents but may still receive some support from family. In the 51+ cohort, consumers often have no living parents and typically require the most amount of support.
We again have a categorical variable (ethnicity
) and numerical variable (expenditures
), so the strategy will be similar to the one we chose for age.cohort
and expenditures
. We will start with favstats
and then construct a side by side box plot:
Here the relationship between the variables seems to be much more complicated. One notable difference compared to the age.cogort
table above is in the n
column: some of the groups are very tiny, with only 2 or 3 consumers. As we noted before, two groups, Hispanic and White, comprise almost 80% of the whole sample.
The five number summaries are all over the place and it is difficult to compare them from the table. Let's use a side by side box-plot to help visualize them:
The distribution of expenditures is quite different between ethnic groups. For example, there is very little variation in expenditures within the Multi Race, Native Hawaiian, and Other groups; in other groups, such as the White not Hispanic group, there is a greater range in expenditures. Additionally, there seems to be a difference in the amount of funding that a person receives, on average, between different ethnicities. The median amount of annual support received for individuals in the American Indian and Native Hawaiian groups is about $40,000, versus medians of approximately $10,000 for Asian and Black consumers.
One thing that is not clear from this plot is the size of the individual groups. From the plot there is no way of knowing that there are only 3 consumers in the Native Hawaiian group, and only 2 in the Other group. Typically, smaller groups will have less variation, so this information is important.
We can instead use a jitter plot to visualize the data. In a jitter plot, every observation has a separate dot, so we will be able to see how many observations are in each of the categories:
We can even combine the two to get the best of both:
One more simple way to summarize the relationship between ethnicity
and expenditures
: look at the means. We already have the "mean" column in the favstats
table, but it may be nice to see the means alone.
We have several ways to do this. One way is by using the mean()
command with the expenditures ~ ethnicity
formula:
That gives us the means we want, but it does not give us a nice table. There is another way, by manipulating the data set:
In all those summaries and visualizations, we can see quite large difference between the two largest ethnic groups: for example, the mean expenditure for the White group is more than twice as large as the mean expenditure of the Hispanic group.
To make this clearer and to be able to investigate this difference, lets filter our data set so that only consumers that are in the Hispanic group or the White group remain. We will save the new filtered data in a new data set called dds_h_or_w
:
Rows: 777
Columns: 6
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10820, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 14, 13, 13, 14, 15, 20, 23, 5,…
$ gender <fct> Female, Male, Male, Female, Male, Female, Female, Female,…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 5021, 2887, 41…
$ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
As we can see, the filtered data set only has 777 observations, compared to the 1000 observations in the original data set.
Let's look at the distribution of expenditures and age in this restricted data set:
Based on the boxplot, most Hispanic consumers receive between approximately $0 to $20,000 from the California DDS; individuals receiving amounts higher than this are upper outliers. However, for White non-Hispanic consumers, median expenditures
is at $15,718, and the middle 50% of consumers receive between about $4,000 and $43,000. The mean expenditures for Hispanic consumers is $11,066, while the mean expenditures for White non-Hispanic consumers is over twice as high at $24,698. On average, a Hispanic consumer receives less financial support from the California DDS than a White non-Hispanic consumer.
Just as with the overall data set, we see that the age
variable has a distribution similar to the expenditures
. The ages of the Hispanic consumers are generally much lower than the ages of the White non-Hispanic consumers.
Let's look at the representations on each of the two groups in the age cohorts:
Again, we can see that Hispanic consumers tend to be younger, with most of them falling into the 6-12, 13-17, and 18-21 age cohorts. In contrast, White non-Hispanics tend to be older; most consumers in this ethnic group are in the 22-50 age cohort, and relatively more White non-Hispanic consumers are in the 51+ age cohort as compared to Hispanics.
We know that expenditures
is very strongly associated with age
and age.cohort
. It is true in the compete data set, and it seems to still hold in the data set restricted to the two main ethnic groups. When trying to compare expenditures
in the two ethnic groups, it is possible that what we are actually seeing is the effect of the age
and age.cohort
variable instead! In that case, age
as well as age.cohort
would be confounding variables.
For a closer look at the relationship between age, ethnicity, and expenditures, compare how expenditures
differ by ethnicity within each age cohort. If age is indeed the primary source of the observed variation in expenditures, then there should be little difference in expenditures
between individuals in different ethnic groups but the same age cohort. One way to visualize that is to make a box plot of expenditures
by ethnicity
separately for each of the age cohorts:
We can also compare the mean expenditures
of Hispanic and White not Hispanic groups within each age cohort:
We can see that when we split the data into separate age cohorts, there is very little difference between the expenditures
variable between the two ethnic groups. In fact, in both the 22-50 and 51+ cohorts, the expenditures in the Hispanics group seem to be slightly higher than those in the White not Hispanic group.
It follows that there does not seem to be evidence of ethnic discrimination. Although the average annual expenditures is lower for Hispanics than for White non-Hispanics, this is due to the difference in age distributions between the two ethnic groups. The population of Hispanic consumers is relatively young compared to the population of White non-Hispanic consumers, and the amount of expenditures for younger consumers tends to be lower than for older consumers. When individuals of similar ages are compared, there are not large differences in the average amount of financial support provided to a Hispanic consumer versus a White non-Hispanic consumer.
Identifying confounding variables is essential for understanding data. Confounders are often context-specific; for example, age is not necessarily a confounder for the relationship between ethnicity and expenditures in a different population. Additionally, it is rarely immediately obvious which variables in a dataset are confounders; looking for confounding variables is an integral part of exploring a dataset.
These data represent an extreme example of confounding known as Simpson's paradox, in which an association observed in several groups may disappear or reverse direction once the groups are combined. In other words, an association between two variables X and Y may disappear or reverse direction once data are partitioned into subpopulations based on a third variable Z, the confounding variable.
Mean expenditures
is higher for Hispanics than White non-Hispanics in all age cohorts except one. Yet, once all the data are aggregated, the average expenditures for White non-Hispanics is over twice as large as the average for Hispanics. This paradox can be explored from a mathematical perspective by using weighted averages, where the average expenditure for each cohort is weighted by the proportion of the population in that cohort.