## Tidy Tuesday - Week 24

| tidytuesday | data | visualization |

This is my work for week 24 of the #TidyTuesday project.

We are dealing with puppies this week!

While the original article from the Washington Post shows the distribution of cats and dogs in the entire globe, for this week we’ll focus only on data coming from the USA, using a dataset offered by data.world.

All code and data can be found in my dedicated GitHub repository MyTidyTuesday.

## Week 24 - Cats vs Dogs (USA)

```
library(tidyverse)
library(magrittr)
library(readxl)
library(skimr)
```

Let’s first read in the data and rename the columns for simplicity.

```
df <- read_excel("data/catsvdogs.xlsx", skip = 1,
col_names = c("location", "households_1000", "perc_households_pets",
"num_pet_households_1000", "perc_dog_owners",
"dog_own_households_1000", "mean_num_dogs_per_household",
"dog_population_1000", "perc_cat_owners", "cat_own_households_1000",
"mean_num_cats_per_household", "cat_population_1000"))
```

```
head(df)
```

```
## # A tibble: 6 x 12
## location households_1000 perc_households… num_pet_househo…
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 1828 59.5 1088
## 2 Arizona 2515 59.5 1497
## 3 Arkansas 1148 62.4 716
## 4 Califor… 12974 52.9 6865
## 5 Colorado 1986 61.3 1217
## 6 Connect… 1337 54.4 728
## # ... with 8 more variables: perc_dog_owners <dbl>,
## # dog_own_households_1000 <dbl>, mean_num_dogs_per_household <dbl>,
## # dog_population_1000 <dbl>, perc_cat_owners <dbl>,
## # cat_own_households_1000 <dbl>, mean_num_cats_per_household <dbl>,
## # cat_population_1000 <dbl>
```

### Data Exploration

Now we can have a look at the data structure.

```
skim(df)
```

```
## Skim summary statistics
## n obs: 49
## n variables: 12
##
## ── Variable type:character ────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## location 0 49 49 4 20 0 49
##
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0
## cat_own_households_1000 0 49 49 728.06 717.29 33
## cat_population_1000 0 49 49 1492.8 1459.86 63
## dog_own_households_1000 0 49 49 876.37 891.83 38
## dog_population_1000 0 49 49 1414.16 1464.66 42
## households_1000 0 49 49 2403.9 2514.05 221
## mean_num_cats_per_household 0 49 49 2.04 0.19 1.7
## mean_num_dogs_per_household 0 49 49 1.59 0.2 1.1
## num_pet_households_1000 0 49 49 1342.59 1358.25 63
## perc_cat_owners 0 49 49 31.64 5.68 11.6
## perc_dog_owners 0 49 49 36.97 6.67 13.1
## perc_households_pets 0 49 49 56.86 6.93 21.9
## p25 p50 p75 p100 hist
## 247 501 876 3687 ▇▅▂▁▁▁▁▁
## 514 1185 1844 7118 ▇▅▂▁▁▁▁▁
## 273 638 1069 4260 ▇▆▂▁▁▁▁▁
## 410 1097 1798 7163 ▇▅▂▁▁▁▁▁
## 765 1759 2632 12974 ▇▃▁▁▁▁▁▁
## 1.9 2 2.2 2.6 ▅▇▇▆▅▂▁▂
## 1.4 1.6 1.7 2.1 ▁▁▃▇▂▂▁▁
## 475 957 1611 6865 ▇▆▂▁▁▁▁▁
## 29 31.3 33.8 49.5 ▁▁▁▇▇▂▁▁
## 32.9 36.6 42.5 47.9 ▁▁▁▃▇▇▇▆
## 53.6 56.8 61.3 70.8 ▁▁▁▁▂▇▆▁
```

Luckily there are no missing values, so we can proceed with our analysis.

#### Pet-friendly households per US State

Let’s first visualize the percentage of household with pets in each State.

```
df %>%
ggplot(aes(x = reorder(location, -perc_households_pets),
y = perc_households_pets, fill = location)) +
geom_col() +
coord_flip() +
labs(x = "US State", y = "%", title = "Percentage of households with pets", subtitle = "District of Columbia seems to be not so pet-friendly.") +
guides(fill = FALSE)
```

It is clear that in every US State at least half of the households have pets; particularly, we can see that Vermont is definitely a pet-friendly State, with more than 70% household having at least one dog or cat. District of Columbia, instead, doesn’t seem to like pets this much, scoring a little more than 20% in this chart.

#### Dog- and cat-owning households

Let’s see if there is any preference for dogs over cats (or viceversa) in these States.

```
df %>%
ggplot(aes(x = reorder(location, -perc_households_pets), fill = location)) +
geom_col(aes(y = dog_own_households_1000 - cat_own_households_1000)) +
coord_flip() +
labs(x = "US State", y = "Difference (in 1000s households)",
title = "Dog- vs cat-owning households",
subtitle = "Households with dogs definitely outnumber those hosting cats.") +
guides(fill = FALSE) +
scale_y_continuous(breaks = c(-250, 0, 250, 500, 750, 1000, 1250))
```

For this plot I computed the difference between dog-owning households and cat-owning ones, in thousands: negative values represent a preference for cats, while positive values denote a higher number of households hosting dogs. A couple of peculiar data points are Texas and Massachusetts, where people seem to definitely love dogs, in the former case, and cats, in the latter.

#### Mean number of dogs/cats per household

We might be interested in knowing whether, as the number of households with pets increases, so does the mean number of dogs/cats hosted in each household. Let’s find out.

```
gath_df <- df %>%
mutate(dogs = mean_num_dogs_per_household,
cats = mean_num_cats_per_household) %>%
select(location, num_pet_households_1000, dogs, cats) %>%
gather(key = "pet", value = "value", dogs, cats)
```

```
gath_df
```

```
## # A tibble: 98 x 4
## location num_pet_households_1000 pet value
## <chr> <dbl> <chr> <dbl>
## 1 Alabama 1088 dogs 1.7
## 2 Arizona 1497 dogs 1.8
## 3 Arkansas 716 dogs 2
## 4 California 6865 dogs 1.6
## 5 Colorado 1217 dogs 1.6
## 6 Connecticut 728 dogs 1.3
## 7 Delaware 189 dogs 1.4
## 8 District of Columbia 63 dogs 1.1
## 9 Florida 4138 dogs 1.5
## 10 Georgia 2093 dogs 1.6
## # ... with 88 more rows
```

```
gath_df %>%
ggplot(aes(x = num_pet_households_1000, y = value, color = pet)) +
geom_smooth() +
geom_point(alpha = 0.5) +
labs(x = "Households (in 1000s)", y = "Number of pets",
title = "Mean number of dogs/cats per household",
subtitle = "The number of pets per household seems to reach a plateau after 1M households with pets.")
```

Although the data are a bit messy, an interesting trend is visible here: initially, as the number of households with pets grows, the mean number of pets per household steeply grows as well, and this is true for both dogs and cats. However, after about the first million of households with pets, the mean number of pets seems to reach a plateau, with cats outnumbering dogs on average.

### Discussion

So we have found two interesting things here:

- most households have a kind of preference for dogs over cats, but
- on average, there are more cats than dogs in each household.

With these information, we can try to normalize the number of households with pets by the mean number of dogs/cats hosted.

```
norm_df <- df %>%
mutate(norm_dogs = dog_own_households_1000 * mean_num_dogs_per_household,
norm_cats = cat_own_households_1000 * mean_num_cats_per_household) %>%
select(location, norm_dogs, norm_cats, perc_households_pets)
```

```
norm_df
```

```
## # A tibble: 49 x 4
## location norm_dogs norm_cats perc_households_pets
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 1372. 1252. 59.5
## 2 Arizona 1814. 1412. 59.5
## 3 Arkansas 1100 807. 62.4
## 4 California 6816 7005. 52.9
## 5 Colorado 1352 1220. 61.3
## 6 Connecticut 493. 811. 54.4
## 7 Delaware 158. 192. 56.6
## 8 District of Columbia 41.8 62.7 21.9
## 9 Florida 4077 4366. 54.4
## 10 Georgia 2435. 2178. 55.1
## # ... with 39 more rows
```

```
norm_df %>%
ggplot(aes(x = reorder(location, -perc_households_pets), fill = location)) +
geom_col(aes(y = norm_dogs - norm_cats)) +
coord_flip() +
labs(x = "US State", y = "Difference (in 1000s households)",
title = "Dog- vs cat-owning households (normalized)",
subtitle = "With normalized data, we see that cats win the fight.") +
guides(fill = FALSE) +
scale_y_continuous(breaks = c(-1000, -500, 0, 500, 1000, 1500))
```

With this computation, we can clearly see that most US States host more cats than dogs in their houses. We could have reached the same conclusion by simply plotting the total population of dogs and cats, with a few differences.

```
df %>%
ggplot(aes(x = reorder(location, -perc_households_pets), fill = location)) +
geom_col(aes(y = dog_population_1000 - cat_population_1000)) +
coord_flip() +
labs(x = "US State", y = "Difference (in 1000s pets)",
title = "Difference of dog/cat population",
subtitle = "Most US States host cats, rather than dogs.") +
guides(fill = FALSE) +
scale_y_continuous(breaks = c(-1000, -500, 0, 500, 1000, 1500))
```