This is a compilation of Data Analysis and Science work I have written in R over the years. I love working with tidyverse
and ggplot2
, and
rely heavy on its environment.
Here are some example analyses 1), my compilation of frequent data analysis problems together with their solutions 2) and some data science/statistics showcases 3).
I have also written and compiled a bookdown
online book for R and Data
Analysis/Science for personal use: Link to
Book.
About Gapminder: “Gapminder identifies systematic misconceptions about important global trends and proportions and uses reliable data to develop easy to understand teaching materials to rid people of their misconceptions.”
What was the life expectancy in Germany for the last 60 years?
gapminder %>%
filter(country == "Germany") %>%
ggplot() +
geom_line(aes(year,life_expectancy)) +
scale_y_continuous(labels = scales::comma) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life Expectancy in Germany from 1960 to 2015")
What was the life expectancy in several different countries for the last 60 years?
gapminder %>%
filter(country %in% c("Germany","China", "Nigeria", "Canada", "Thailand", "Russia")) %>%
ggplot() +
geom_line(aes(year,life_expectancy, group = country, color = country)) +
scale_y_continuous(labels = scales::comma) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Life Expectancy in Several Different Countries from 1960 to 2015")
What are the differences in infant mortality rates by continent?
gapminder %>%
filter(year == 2014) %>%
ggplot() + geom_boxplot(aes(continent, infant_mortality, color = continent), show.legend = F) +
ggtitle("Infant Mortality Rates by Continents") +
xlab("Continents") +
ylab("Infant Mortality Rates")
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
What regions in gapminder are there?
levels(gapminder$region)
## [1] "Australia and New Zealand" "Caribbean"
## [3] "Central America" "Central Asia"
## [5] "Eastern Africa" "Eastern Asia"
## [7] "Eastern Europe" "Melanesia"
## [9] "Micronesia" "Middle Africa"
## [11] "Northern Africa" "Northern America"
## [13] "Northern Europe" "Polynesia"
## [15] "South America" "South-Eastern Asia"
## [17] "Southern Africa" "Southern Asia"
## [19] "Southern Europe" "Western Africa"
## [21] "Western Asia" "Western Europe"
Population bar graph of Europe by countries.
gapminder %>%
filter(year == 2015 & continent == "Europe") %>%
ggplot(aes(y=reorder(country, population),x = population)) +
geom_bar(stat = "identity") +
scale_x_continuous(labels = scales::comma) +
ggtitle("Population of European Countries") +
xlab("Population") +
ylab("Country")
Pie chart of Europe’s population
gapminder %>%
filter(year == 2015 & continent == "Europe") %>%
ggplot(aes(y=reorder(country, population),x = population)) +
geom_bar(stat = "identity") +
scale_x_continuous(labels = scales::comma) +
coord_polar("y", start=0) +
ggtitle("Population of European Countries") +
xlab("Population") +
ylab("Country")
What are the countries with the biggest life expectancy in Europe?
gapminder %>%
filter(year == 2016 & continent == "Europe") %>%
slice_max(life_expectancy, n = 5)
## country year infant_mortality life_expectancy fertility population gdp
## 1 Iceland 2016 NA 83.3 NA NA NA
## 2 Switzerland 2016 NA 83.1 NA NA NA
## 3 Spain 2016 NA 82.7 NA NA NA
## 4 Italy 2016 NA 82.3 NA NA NA
## 5 Luxembourg 2016 NA 82.3 NA NA NA
## continent region
## 1 Europe Northern Europe
## 2 Europe Western Europe
## 3 Europe Southern Europe
## 4 Europe Southern Europe
## 5 Europe Western Europe
Bar graph of life expectancy in Europe by countries.
gapminder %>%
filter(year == 2016 & continent == "Europe") %>%
ggplot(aes(y = reorder(country, life_expectancy), x = life_expectancy)) +
geom_bar(stat = "identity") +
coord_cartesian(xlim = c(70,85)) +
xlab("Life Expectancy") + ylab(NULL) +
labs(title = "Life Expectancy in European Countries (2016)", caption = "Gapminder data.")
The GDP of middle-eastern countries.
gapminder %>%
filter(country %in% c("Israel", "Lebanon", "Egypt", "Saudi Arabia", "Bahrain", "West Bank and Gaza", "Yemen", "United Arab Emirate", "Iran", "Syria")) %>%
filter(year == 2009) %>%
ggplot(aes(gdp,country)) +
geom_bar(stat = "identity") +
ggtitle("Bar Graph of GDP in Middle-Eastern Countries") +
xlab("GDP") +
ylab("Country")
## Warning: Removed 1 rows containing missing values (position_stack).
What was the correlation between fertility and life expectancy in 1962?
gapminder %>%
filter(year == 1962) %>%
ggplot(aes(fertility, life_expectancy)) +
geom_point() +
ggtitle("Fertility and Life Expectancy in 1962") +
xlab("Fertility") +
ylab("Life Expectancy") +
theme_clean()
What was the correlation between fertility and life expectancy by continents in 1962?
gapminder %>%
filter(year == 1962) %>%
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
ggtitle("Fertility and Life Expectancy in 1962") +
xlab("Fertility") +
ylab("Life Expectancy")
How was it in 2012 compared to 1962?
gapminder %>%
filter(year %in% c(1962,2012)) %>%
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_grid(.~year) +
ggtitle("Fertility and Life Expectancy in 1962 and 2012") +
xlab("Fertility") +
ylab("Life Expectancy")
Show me its development in detail over time
gapminder %>%
filter(year %in% c(1962,1972,1982,1992,2002,2012)) %>%
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_wrap(.~year) +
ggtitle("Fertility and Life Expectancy from 1962 to 2012") +
xlab("Fertility") +
ylab("Life Expectancy")
What is the fertility distribution in Europe like?
gapminder %>%
filter(continent == "Europe" & year == 2015) %>%
ggplot(aes(fertility, fill = region)) +
geom_boxplot() +
coord_flip() +
ggtitle("Fertility Rates in Europe by Regions in 2015") +
xlab("Fertility Rates")
What is the fertility distribution in Asia like?
gapminder %>%
filter(continent == "Asia" & year == 2015) %>%
ggplot(aes(fertility, fill = region)) +
geom_boxplot() +
coord_flip() +
ggtitle("Fertility Rates in Asia by Regions in 2015") +
xlab("Fertility Rates")
What is the fertility distribution like by continents?
gapminder %>%
filter(year == 2015) %>%
ggplot(aes(fertility,continent, fill = continent)) + ggridges::geom_density_ridges(show.legend = F) +
ggtitle("Fertility Rates by Continents in 2015") +
xlab("Fertility Rates") +
ylab("Continents")
## Picking joint bandwidth of 0.315
## Warning: Removed 1 rows containing non-finite values (stat_density_ridges).
Density rigdes on other variable distributions
gapminder %>%
filter(year == 2015) %>%
ggplot(aes(life_expectancy,continent, fill = continent)) +
ggridges::geom_density_ridges(show.legend = F) +
ggtitle("Life Expectancy by Continents in 2015") +
xlab("Life Expectancy") +
ylab("Continents")
## Picking joint bandwidth of 2.23
How has the life expectancy in countries by continents changed between the years 1962 and 2012?
gapminder %>%
filter(year %in% c(1962,2012)) %>%
mutate(year = factor(year, levels = c(1962,2012))) %>%
ggplot(aes(continent,life_expectancy, fill = year)) +
geom_boxplot() +
xlab("Continent") + ylab("Life Expectancy") +
labs(title = "Life expectancy between continents in 1962 and 2012")
library(essurvey)
# Set email. Needed for authentication
set_email("dino.c@me.com")
# # Show which countries are available
show_countries()
## [1] "Albania" "Austria" "Belgium"
## [4] "Bulgaria" "Croatia" "Cyprus"
## [7] "Czechia" "Denmark" "Estonia"
## [10] "Finland" "France" "Germany"
## [13] "Greece" "Hungary" "Iceland"
## [16] "Ireland" "Israel" "Italy"
## [19] "Kosovo" "Latvia" "Lithuania"
## [22] "Luxembourg" "Montenegro" "Netherlands"
## [25] "Norway" "Poland" "Portugal"
## [28] "Romania" "Russian Federation" "Serbia"
## [31] "Slovakia" "Slovenia" "Spain"
## [34] "Sweden" "Switzerland" "Turkey"
## [37] "Ukraine" "United Kingdom"
# Download data
ess_9 <- import_rounds(9)
## Downloading ESS9
## | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
## Warning: Round 9 was read with the `foreign` package rather than with the `haven` package for compatibility reasons.
## Please report any issues at https://github.com/ropensci/essurvey/issues
Skim to get an overview of the variables.
# skimr::skim(ess_9)
Glimpse to get a peak at the first values.
head(ess_9)
## # A tibble: 6 × 572
## name essround edition proddate idno cntry nwspol netusoft netustm ppltrst
## <chr> <int> <chr> <chr> <int> <chr> <int> <fct> <int> <fct>
## 1 ESS9e… 9 3.1 17.02.20… 27 AT 60 Every day 180 2
## 2 ESS9e… 9 3.1 17.02.20… 137 AT 10 Every day 20 7
## 3 ESS9e… 9 3.1 17.02.20… 194 AT 60 Most days 180 5
## 4 ESS9e… 9 3.1 17.02.20… 208 AT 45 Every day 120 3
## 5 ESS9e… 9 3.1 17.02.20… 220 AT 30 Never NA 5
## 6 ESS9e… 9 3.1 17.02.20… 254 AT 45 Only occ… NA 8
## # … with 562 more variables: pplfair <fct>, pplhlp <fct>, polintr <fct>,
## # psppsgva <fct>, actrolga <fct>, psppipla <fct>, cptppola <fct>,
## # trstprl <fct>, trstlgl <fct>, trstplc <fct>, trstplt <fct>, trstprt <fct>,
## # trstep <fct>, trstun <fct>, vote <fct>, prtvtcat <fct>, prtvtdbe <fct>,
## # prtvtdbg <fct>, prtvtgch <fct>, prtvtbcy <fct>, prtvtecz <fct>,
## # prtvede1 <fct>, prtvede2 <fct>, prtvtddk <fct>, prtvtgee <fct>,
## # prtvtees <fct>, prtvtdfi <fct>, prtvtdfr <fct>, prtvtcgb <fct>, …
Interesting variables
-
grspfr: Would you say your gross pay is unfairly low, fair, or unfairly high
-
imprich: Important to be rich, have money and expensive things
-
ipeqopt: Important that people are treated equally and have equal opportunities
-
ipstrgv: Important that government is strong and ensures safety
How many subjects were questioned per country?
ess_9 %>%
group_by(cntry) %>%
count(sort = T) %>%
ggplot() +
aes(n, reorder(cntry, n), label = n) +
geom_bar(stat = "identity", fill = "steelblue") +
ggtitle("Subjects per Country")
What are the opinions on differences in wealth?
ess_9 %>%
count(wltdffr) %>%
ggplot() +
aes(n, wltdffr) +
geom_bar(stat = "identity") +
ggtitle("Differences in wealth in country, how fair") +
ylab("Answers")
Here are the results for Germany.
ess_9 %>%
filter(cntry == "DE") %>%
count(wltdffr) %>%
ggplot() +
aes(n, wltdffr) +
geom_bar(stat = "identity") +
ggtitle("Differences in wealth in Germany, how fair") +
ylab("Answers")
What are the options and how many are there?
ess_9 %>%
distinct(wltdffr)
## # A tibble: 10 × 1
## wltdffr
## <fct>
## 1 Large, extremely unfair
## 2 Large, somewhat unfair
## 3 Large, slightly unfair
## 4 Large, very unfair
## 5 Small, very unfair
## 6 <NA>
## 7 Fair
## 8 Small, slightly unfair
## 9 Small, somewhat unfair
## 10 Small, extremely unfair
ess_9 %>%
count(wltdffr)
## # A tibble: 10 × 2
## wltdffr n
## <fct> <int>
## 1 Small, extremely unfair 2251
## 2 Small, very unfair 3118
## 3 Small, somewhat unfair 3558
## 4 Small, slightly unfair 2536
## 5 Fair 6386
## 6 Large, slightly unfair 3607
## 7 Large, somewhat unfair 8111
## 8 Large, very unfair 9691
## 9 Large, extremely unfair 6890
## 10 <NA> 3371
Recode values to only have four left at the end, too large unfair or fair and too small unfair and NAs.
ess_9 = ess_9 %>%
mutate(new_wltdffr = fct_collapse(wltdffr,
Too_Large = c("Large, extremely unfair", "Large, very unfair", "Large, somewhat unfair","Large, slightly unfair"),
Fair = "Fair",
Too_Small = c("Small, extremely unfair", "Small, very unfair", "Small, somewhat unfair","Small, slightly unfair"))) %>%
mutate(new_wltdffr = factor(new_wltdffr, levels = c("Too_Large", "Fair", "Too_Small")))
Create a count and percentage table.
d = ess_9 %>%
group_by(new_wltdffr) %>%
summarise(count = n()) %>%
mutate(perc = count/sum(count))
Plot a bar graph.
d %>%
ggplot(aes(new_wltdffr, perc, label = round(perc, 2), fill = new_wltdffr)) +
geom_bar(stat = "identity") +
geom_label(aes(fill = NA), fill = "white") +
theme(legend.position = "none") +
xlab("Differences in Wealth, Opinon") + ylab("Percentage") +
ggtitle("Opinion on Differences in Wealth 2018")
library(WDI)
Search for topics in datasets.
head(WDIsearch("co2"), 10)
## indicator
## [1,] "CC.CO2.EMSE.BF"
## [2,] "CC.CO2.EMSE.BL"
## [3,] "CC.CO2.EMSE.EL"
## [4,] "CC.CO2.EMSE.EN"
## [5,] "CC.CO2.EMSE.FE"
## [6,] "CC.CO2.EMSE.IL"
## [7,] "CC.CO2.EMSE.IP"
## [8,] "CC.CO2.EMSE.LU"
## [9,] "CC.CO2.EMSE.MC"
## [10,] "CC.CO2.EMSE.TR"
## name
## [1,] "CO2 emissions by sector (Mt CO2 eq) - Bunker Fuels"
## [2,] "CO2 emissions by sector (Mt CO2 eq) - Building"
## [3,] "CO2 emissions by sector (Mt CO2 eq) - Total excluding LUCF"
## [4,] "CO2 emissions by sector (Mt CO2 eq) - Energy"
## [5,] "CO2 emissions by sector (Mt CO2 eq) - Fugitive Emissions"
## [6,] "CO2 emissions by sector (Mt CO2 eq) - Total including LUCF"
## [7,] "CO2 emissions by sector (Mt CO2 eq) - Industrial Processes"
## [8,] "CO2 emissions by sector (Mt CO2 eq) - Land-Use Change and Forestry"
## [9,] "CO2 emissions by sector (Mt CO2 eq) - Manufacturing/Construction"
## [10,] "CO2 emissions by sector (Mt CO2 eq) - Transportation"
WDIsearch('gdp.*capita.*constant')
## indicator
## [1,] "6.0.GDPpc_constant"
## [2,] "NY.GDP.PCAP.KD"
## [3,] "NY.GDP.PCAP.KN"
## [4,] "NY.GDP.PCAP.PP.KD"
## [5,] "NY.GDP.PCAP.PP.KD.87"
## name
## [1,] "GDP per capita, PPP (constant 2011 international $) "
## [2,] "GDP per capita (constant 2015 US$)"
## [3,] "GDP per capita (constant LCU)"
## [4,] "GDP per capita, PPP (constant 2017 international $)"
## [5,] "GDP per capita, PPP (constant 1987 international $)"
# Download it
dat = WDI(
country = "all",
indicator = c(
population = "SP.POP.TOTL",
gdp = "NY.GDP.MKTP.CD",
inc_sha_10 = "SI.DST.10TH.10"),
start = 1960, end = 2018)
What was the population development like in these specific countries?
dat %>%
filter(country %in% c("Germany","France","United States", "Bosnia and Herzegovina","United Kingdom","China")) %>%
ggplot(aes(year,population, color = country)) +
geom_line() +
scale_y_log10(labels = scales::comma) +
ylab("Population (log10)") + xlab("Year") +
labs(title = "Random Countries Population", color = "Country")
What was the population development like in the former Yugoslav Republics?
dat %>%
filter(country %in% c("Bosnia and Herzegovina","Croatia","Serbia","Montenegro","Slovenia","North Macedonia","Kosovo")) %>%
ggplot(aes(year,population, color = reorder(country, desc(population)))) +
geom_line() +
scale_x_continuous(breaks = seq(1960,2020,5)) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Population of Former Yugoslav Republics", subtitle = "Population trends and shifts since the 1990s", col = "Countries") +
xlab("Year") + ylab("Population")
What was the development of GDP in the former Yugoslav Republics?
dat %>%
filter(country %in% c("Bosnia and Herzegovina","Croatia","Serbia","Montenegro","Slovenia","North Macedonia","Kosovo")) %>%
filter(between(year, 1995, 2020)) %>%
ggplot(aes(year,gdp, color = reorder(country, desc(gdp)))) +
geom_line() +
scale_x_continuous(breaks = seq(1960,2020,2)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "GDP of Former Yugoslav Republics", subtitle = "Economic trends and shifts since the 1990s" , col = "Countries") +
xlab("Year") + ylab("GDP current US$")
## Warning: Removed 18 row(s) containing missing values (geom_path).
The skim()
shows how many missing and unique values each variable has.
It uses appropriate measures to describe each variable based on its
type: character, numeric or list.
skimr::skim(starwars)
Name | starwars |
Number of rows | 87 |
Number of columns | 14 |
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ | |
Column type frequency: | |
character | 8 |
list | 3 |
numeric | 3 |
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
hair_color | 5 | 0.94 | 4 | 13 | 0 | 12 | 0 |
skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
films | 0 | 1 | 24 | 1 | 7 |
vehicles | 0 | 1 | 11 | 0 | 2 |
starships | 0 | 1 | 17 | 0 | 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 6 | 0.93 | 174.36 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▁▁▇▅▁ |
mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
The glimpse function, on the other hand, gives us a good peak at the first raw values each variable has.
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
First way with forcats::fct_count()
Calculates a count and prop
table.
starwars$sex %>%
factor() %>%
fct_count(sort = T, prop = T)
## # A tibble: 5 × 3
## f n p
## <fct> <int> <dbl>
## 1 male 60 0.690
## 2 female 16 0.184
## 3 none 6 0.0690
## 4 <NA> 4 0.0460
## 5 hermaphroditic 1 0.0115
Second way with deplyr::count()
Simply mutate a frequency and
percentage column on a counted table.
starwars %>%
count(sex) %>%
mutate(freq = n / sum(n)) %>%
mutate(perc = freq * 100)
## # A tibble: 5 × 4
## sex n freq perc
## <chr> <int> <dbl> <dbl>
## 1 female 16 0.184 18.4
## 2 hermaphroditic 1 0.0115 1.15
## 3 male 60 0.690 69.0
## 4 none 6 0.0690 6.90
## 5 <NA> 4 0.0460 4.60
Here is a situation where we calculated a count table for hair color - we summarized all values. If we then want to plot a bar graph based on that count table we run into problems, because ggplot2 is expecting a non-summarized or normal data frame.
hair_color_table = starwars %>%
mutate(hair_color = fct_lump_min(hair_color, 2)) %>%
group_by(hair_color) %>%
summarise(n = n())
hair_color_table
## # A tibble: 7 × 2
## hair_color n
## <fct> <int>
## 1 black 13
## 2 blond 3
## 3 brown 18
## 4 none 37
## 5 white 4
## 6 Other 7
## 7 <NA> 5
To tell the function that we have already summarized data, we add the
argument stat = "identity"
to the geom_bar()
function.
hair_color_table %>%
ggplot(aes(x = reorder(hair_color, n), y = n, fill = hair_color)) +
geom_bar(stat = "identity") +
theme(legend.position = "none")
First we create a table with counts and percentages:
d = starwars %>%
group_by(gender) %>%
summarise(count = n()) %>%
mutate(percentage = count/sum(count))
d
## # A tibble: 3 × 3
## gender count percentage
## <chr> <int> <dbl>
## 1 feminine 17 0.195
## 2 masculine 66 0.759
## 3 <NA> 4 0.0460
Then we plot a graph with bar and with percentage labels.
d %>%
ggplot(aes(gender, percentage, label = round(percentage, 2), fill = gender)) +
geom_bar(stat = "identity") +
geom_label(aes(fill = NA), fill = "white") +
theme(legend.position = "none")
This syntax mutates the categorical variable homeworld into eight of its most frequent values. The other values are being collapsed into the categorical value „other”.
starwars %>%
mutate(homeworld = fct_lump_n(homeworld, n = 8)) %>%
group_by(homeworld) %>%
summarise(mean(height, na.rm =T), mean(mass, na.rm = T), n())
## # A tibble: 11 × 4
## homeworld `mean(height, na.rm = T)` `mean(mass, na.rm = T)` `n()`
## <fct> <dbl> <dbl> <int>
## 1 Alderaan 176. 64 3
## 2 Corellia 175 78.5 2
## 3 Coruscant 174. 50 3
## 4 Kamino 208. 83.1 3
## 5 Kashyyyk 231 124 2
## 6 Mirial 168 53.1 2
## 7 Naboo 175. 64.2 11
## 8 Ryloth 179 55 2
## 9 Tatooine 170. 85.4 10
## 10 Other 173. 117. 39
## 11 <NA> 139. 82 10
We can easily filter out cases with certain column values, like for
example the states of Hawai and Alaska. We use filter()
, the operator
!
and %in%
.
starwars %>%
filter(!homeworld%in%c("Tatooine","Naboo")) %>%
select(name, homeworld)
## # A tibble: 66 × 2
## name homeworld
## <chr> <chr>
## 1 Leia Organa Alderaan
## 2 Obi-Wan Kenobi Stewjon
## 3 Wilhuff Tarkin Eriadu
## 4 Chewbacca Kashyyyk
## 5 Han Solo Corellia
## 6 Greedo Rodia
## 7 Jabba Desilijic Tiure Nal Hutta
## 8 Wedge Antilles Corellia
## 9 Jek Tono Porkins Bestine IV
## 10 Yoda <NA>
## # … with 56 more rows
You can manually pick the colors with fill
and a vector containing the
color values. Either in String, written out.
starwars %>%
mutate(sex = fct_infreq(sex)) %>%
ggplot(aes(sex)) +
geom_bar(fill = c("red","blue","green","black","grey"))
Or with RGB Color Codes.
starwars %>%
mutate(sex = fct_infreq(sex)) %>%
ggplot(aes(sex)) +
geom_bar(fill = c("#003f5c","#58508d","#bc5090","#ff6361","#ffa600"))
Here is an example where we want the bar colored based on the variable itself, but without the mapping legend.
starwars %>%
mutate(sex = fct_infreq(sex)) %>%
ggplot(aes(sex, fill = sex)) +
geom_bar()
Hide the geom_bar legend.
starwars %>%
mutate(sex = fct_infreq(sex)) %>%
ggplot(aes(sex, fill = sex)) +
geom_bar(show.legend = F)
Remove just the legend title:
starwars %>%
mutate(sex = fct_infreq(sex)) %>%
ggplot(aes(sex, fill = sex)) +
geom_bar() +
theme(legend.title = element_blank())
Hide all legends created:
starwars %>%
mutate(sex = fct_infreq(sex)) %>%
ggplot(aes(sex, fill = sex)) +
geom_bar() +
theme(legend.position = "none")
First way We can use fct_collapse()
to create a new column with the
new recoded values in it.
Second way By using mutate
, to create a new column with our own
values and case_when
, to run through our observations looking for
defined cases, together with “variable” %in%
, we can create our own
groups.
gapminder %>%
mutate(group = case_when(
region %in% c("Western Europe", "Northern Europe","Southern Europe","Northern America", "Australia and New Zealand") ~ "West", # If region is one of values -> assign it "West" in new group column.
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" &
region != "Northern Africa" ~ "Sub-Saharan",
TRUE ~ "Others")) %>% # If nothing above applies -> assign it "Others" in group column
head(10)
## country year infant_mortality life_expectancy fertility
## 1 Albania 1960 115.40 62.87 6.19
## 2 Algeria 1960 148.20 47.50 7.65
## 3 Angola 1960 208.00 35.98 7.32
## 4 Antigua and Barbuda 1960 NA 62.97 4.43
## 5 Argentina 1960 59.87 65.39 3.11
## 6 Armenia 1960 NA 66.86 4.55
## 7 Aruba 1960 NA 65.66 4.82
## 8 Australia 1960 20.30 70.87 3.45
## 9 Austria 1960 37.30 68.75 2.70
## 10 Azerbaijan 1960 NA 61.33 5.57
## population gdp continent region group
## 1 1636054 NA Europe Southern Europe West
## 2 11124892 13828152297 Africa Northern Africa Others
## 3 5270844 NA Africa Middle Africa Sub-Saharan
## 4 54681 NA Americas Caribbean Latin America
## 5 20619075 108322326649 Americas South America Latin America
## 6 1867396 NA Asia Western Asia Others
## 7 54208 NA Americas Caribbean Latin America
## 8 10292328 96677859364 Oceania Australia and New Zealand West
## 9 7065525 52392699681 Europe Western Europe West
## 10 3897889 NA Asia Western Asia Others
We turn this group
variable into a factor to control the order of the
levels:
Order color legend by a variable’s values.
Display all unique values of variable.
distinct(starwars, species) # dplyr function
## # A tibble: 38 × 1
## species
## <chr>
## 1 Human
## 2 Droid
## 3 Wookiee
## 4 Rodian
## 5 Hutt
## 6 Yoda's species
## 7 Trandoshan
## 8 Mon Calamari
## 9 Ewok
## 10 Sullustan
## # … with 28 more rows
Note: distinct(dat$countries)
doesn’t work.
Note: parameter n
must be explicitly written, otherwise it throws
an error.
starwars %>%
slice_max(height, n = 5)
## # A tibble: 5 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael … 264 NA none white yellow NA male mascul…
## 2 Tarfful 234 136 brown brown blue NA male mascul…
## 3 Lama Su 229 88 none grey black NA male mascul…
## 4 Chewbac… 228 112 brown unknown blue 200 male mascul…
## 5 Roos Ta… 224 82 none grey orange NA male mascul…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Show me 5% of the lowest height rows.
starwars %>%
slice_min(height, prop = 0.05)
## # A tibble: 4 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratts Ty… 79 15 none grey, blue unknown NA male mascu…
## 3 Wicket S… 88 20 brown brown brown 8 male mascu…
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
For a quick check of how many missing values there are in a single column:
sum(is.na(starwars$height))
## [1] 6
And how many are not NA
s.
sum(!is.na(starwars$height))
## [1] 81
For a more detailed overview of the whole dataset use skim()
. It shows
a very useful complete_rate
which tells us how much of the column is
disturbed by missing values.
skimr::skim(starwars)
Name | starwars |
Number of rows | 87 |
Number of columns | 14 |
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ | |
Column type frequency: | |
character | 8 |
list | 3 |
numeric | 3 |
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
hair_color | 5 | 0.94 | 4 | 13 | 0 | 12 | 0 |
skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
films | 0 | 1 | 24 | 1 | 7 |
vehicles | 0 | 1 | 11 | 0 | 2 |
starships | 0 | 1 | 17 | 0 | 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 6 | 0.93 | 174.36 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▁▁▇▅▁ |
mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
Drop rows that have NA
values in a specific column, here in height
.
starwars %>%
drop_na(height)
## # A tibble: 81 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth … 202 136 none white yellow 41.9 male mascu…
## 5 Leia O… 150 49 brown light brown 19 fema… femin…
## 6 Owen L… 178 120 brown, grey light blue 52 male mascu…
## 7 Beru W… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs … 183 84 black light brown 24 male mascu…
## 10 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
## # … with 71 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Drop all rows that contain NA
in any column.
starwars %>%
drop_na()
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 Obi-Wan… 182 77 auburn, wh… fair blue-gray 57 male mascu…
## 3 Anakin … 188 84 blond fair blue 41.9 male mascu…
## 4 Chewbac… 228 112 brown unknown blue 200 male mascu…
## 5 Wedge A… 170 77 brown fair hazel 21 male mascu…
## 6 Darth M… 175 80 none red yellow 54 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Filter out any NA containing rows.
starwars %>%
na.exclude()
## # A tibble: 29 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S… 172 77 blond fair blue 19 male mascu…
## 2 Darth … 202 136 none white yellow 41.9 male mascu…
## 3 Leia O… 150 49 brown light brown 19 fema… femin…
## 4 Owen L… 178 120 brown, grey light blue 52 male mascu…
## 5 Beru W… 165 75 brown light blue 47 fema… femin…
## 6 Biggs … 183 84 black light brown 24 male mascu…
## 7 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
## 8 Anakin… 188 84 blond fair blue 41.9 male mascu…
## 9 Chewba… 228 112 brown unknown blue 200 male mascu…
## 10 Han So… 180 80 brown fair brown 29 male mascu…
## # … with 19 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Replace 0 with value you want as a replacement.
data(na_example)
sum(is.na(na_example))
## [1] 145
no_nas <- ifelse(is.na(na_example), 0, na_example) # "if is NA is true, change value to 0, else keep the value (i.e. na_example)"
sum(is.na(no_nas))
## [1] 0
The FVT is about what happens when you try to return factorized vectors into numeric values. Let’s look at this with this code.
z <-factor(c("12", "13", "14", "15", "12")) # We create an object by directly factorizing a vector.
z
## [1] 12 13 14 15 12
## Levels: 12 13 14 15
y <- as.numeric(z) # Now we want to convert them into numeric values.
y # What?
## [1] 1 2 3 4 1
This happened, because we picked up the on the factorization result.
factor()
assigns every element, based on its value, an integer number.
typeof(z) # 1=12, 13=2, 14=3, 15=4, 12=1
## [1] "integer"
To fix this problem, first convert the object back to character and then to numeric.
y <- as.numeric(as.character(z))
y
## [1] 12 13 14 15 12
® Let’s create a sampling model and calculate a random statistic for the proportion p.
n = 100
p = 0.4
x = sample(c(1,0), size = n, replace = T, prob = c(p, (1-p)))
x_bar= mean(x)
x_bar
## [1] 0.37
Now let’s create a Monte Carlo simulation in which we calculate the sample statistic p 10,000 times.
B = 10000
x_bar_distribution = replicate(B, {
x = sample(c(1,0), size = n, replace = T, prob = c(p, (1-p)))
mean(x)
})
head(x_bar_distribution)
## [1] 0.44 0.38 0.44 0.44 0.36 0.40
The Central Limit Theorem tells us that when the number of draws, also
called the sample size, is large, the probability distribution of the
sum (or mean/proportion) of the independent draws is approximately
normal. Let’s test that by plotting a histogram and checking with
qqnorm()
.
hist(x_bar_distribution) # Looks good
qqnorm(x_bar_distribution);qqline(x_bar_distribution) # Looks good enoguh.
What’s the practical use of this proof? Because of it we can now use the normal distribution function to calculate probabilities. What is the probability that we get a proportion that is smaller than ? We first need the calculate the standard error of our sample.
SE = sqrt(p * (1-p) / n)
3 * SE
## [1] 0.1469694
The sampling distribution of the sample proportion is approximately normal and normal distributions encompass almost all possible values within 3 standard deviations from the mean. We can therefore say that 99.7 % of all samples proportion statistics for n = 100 and p = 0.4 fall within +- 0.147 of 0.4. Here is the proof. We can compare the results from Monte Carlo simulation sampling distribution with standard normal distribution function. I.e. what’s the proportions/probability of falling outside of 3 standard deviations from the mean?
mean(x_bar_distribution < p - 3 * SE) + mean(x_bar_distribution > p + 3 * SE)
## [1] 0.0025
pnorm(-3) + 1 - pnorm(3)
## [1] 0.002699796
This does in fact check out! The proportions from the Monte Carlo simulation and the probabilities from the normal distribution function do approximate each other very well. The applications based on these insights alone are somewhat limited.
In real life we rarely have the probabilities for the population parameters. The power does show itself however when making inferences from one single sample, its proportion statistic and its standard deviation, about the population parameter. This is because we now know (experimentally proven) that the sampling distribution of the sample proportion is normally distributed.
Let’s say we want to estimate the population proportion (the mean of a categorical variable) e.g., the proportion of votes a Democratic candidate gets in an election. In our example, the true population proportion p = 0.45. In real life we actually don’t know this parameter.
In R we can create a Monte Carlo Simulation (creating confidence intervals 10,000 times) in order to prove the validity of confidence intervals in general.
p = 0.45
n = 1000
B = 10000
set.seed(1)
correct_list = replicate(B, {
x = sample(c(1,0), size = n, replace = T, prob = c(p, (1-p)))
x_hat = mean(x)
se_hat = sqrt(x_hat * (1 - x_hat) / n)
between(p, x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
})
And how often did we get a confidence interval that did in fact include the real population parameter p? The confidence interval did include the true parameter 0.9482 of the time. Pretty close to the theoretical goal of 0.95!
mean(correct_list)
## [1] 0.9482
Let’s visualize the validity of CIs with another 100 samples and ggplot2.
# Used to create the graph
lower_bounds = c()
upper_bounds = c()
x_hats = c()
true = c()
# Small Monte Carlo simulation
for (i in 1:100) {
x = sample(c(1,0), size = n, replace = T, prob = c(p, (1-p)))
x_hat = mean(x)
x_hats = c(x_hats, x_hat)
se_hat = sqrt(x_hat * (1 - x_hat) / n)
lower_bounds = c(lower_bounds, (x_hat - 1.96 * se_hat))
upper_bounds = c(upper_bounds, (x_hat + 1.96 * se_hat))
true = c(true, between(p, x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat))
}
# ggplot2 only accepts tables
table = tibble(x_hats, lower_bounds, upper_bounds, true)
# Graph highlighting the validity of a 95% CI
ggplot(table, aes(seq_along(x_hats), x_hats)) +
geom_pointrange(aes(ymin = lower_bounds, ymax = upper_bounds, color = true)) +
geom_hline(yintercept = 0.45)
The proof from our 10,000 confidence intervals above that did in fact
capture the true population proportion p (0.45) in approximately 95%
of the time (0.943) tells us that we can create confidence intervals
from a single sample “with confidence”. In other words, we can in fact
be sure that a 95% confidence interval constructed from our sample
Here is a single sample from which we will create a 95% confidence interval to showcase a more real-life example.
# In real life this model is shrouded in unknowns
x = sample(c(1,0), size = 1000, replace = T, prob = c(p, (1-p)))
# Sample mean
x_hat = mean(x)
# Standard Error
se_hat = sqrt(x_hat * (1 - x_hat) / n)
# Creating the bounds of the CI
lower_bounds = x_hat - 1.96 * se_hat
upper_bounds =x_hat + 1.96 * se_hat
# The confidence interval is:
lower_bounds
## [1] 0.4181713
upper_bounds
## [1] 0.4798287
The confidence interval did in fact capture the true population parameter, which is actually unknown. So the actual result would be expressed as follows:
We are 95% confident that the true population proportion is between 0.404 and 0.465.
Lets say we a population of monthly incomes that is totally random and non-normally distributed.
set.seed(1)
# Create 10000 random "incomes"
incomes = runif(10000, min = 800, max = 6000)
# Peek at it
head(incomes, 10)
## [1] 2180.645 2735.044 3778.837 5522.681 1848.746 5471.626 5712.311 4236.149
## [9] 4071.393 1121.289
# Create true population mean (unknown in real life)
mu = mean(incomes)
mu
## [1] 3400.873
# Plot reveals its very much unnormal
plot(density(incomes))
Now comes the confidence interval. The parent population is very much non-normal, but because of the CTL and random sampling we can assume an approximately normal distribution of the sampling distribution of the sample means. Further, we do not replace picks when sampling (just like in real life), but can still assume independence because the sample size of 100 is only 1% from the total population of 10,000.
# Sample
x = sample(incomes, 100)
# Sample mean
x_hat = mean(x)
# Sample standard error
se_hat = sd(x) / sqrt(100)
# Create the margin of error
lower_bound = x_hat - 1.96 * se_hat
upper_bound = x_hat + 1.96 * se_hat
ci = c(lower_bound, upper_bound)
# Here is our confidence interval
ci
## [1] 3260.147 3864.235
# Here is the real population mean (unknown in real life)
mu
## [1] 3400.873
Result: The 95% confidence interval for the true population mean
t.test(x, conf.level = 0.95)$conf.int
## [1] 3256.415 3867.967
## attr(,"conf.level")
## [1] 0.95