Analyzing Health Expenditure Data By Country From The OECD With R

In this example, we will use the OECD package to access health expenditure data by country.

Packages USED

The key package used in this vignette is the oecd package which accesses stats database from the OECD.[1] Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • OECD – package to access data using OECD stats api.[2]

Examing the Data Structure

The initial stage is to search for available health related databases.

#define the countries we are going to chart
#get data set list
dataset_list <- get_datasets()
p1<-search_dataset("health", data = dataset_list)
#we will use SHA which is the 6th in the result p1
#now we will get the structure of the dataset
#now examine the structure - the first sections are key
               # id       description
# 1               HF   Financing scheme
# 2               HC           Function
# 3               HP           Provider
# 4         MEASURE           Measure
# 5         LOCATION           Country

The code above shows how to inspect the structure of a given data set. Each dataset in OECD.stat ( has a different structure. Target databases can be evaluated using the web site. The order of the variables in the VAR_DESC table reflects their hierarchy in the dataset. HF specifically addresses total funding, funding by government or insurance scheme etc.

id label
1 HFTOT All financing schemes
2 HF1 Government/compulsory schemes
3 HF11 Government schemes
4 HF12HF13 Compulsory contributory health insurance schemes
5 HF121 Social health insurance schemes
6 HF122 Compulsory private insurance schemes
7 HF2HF3 Voluntary schemes/household out-of-pocket payments
8 HF2 Voluntary health care payment schemes
9 HF21 Voluntary health insurance schemes
10 HF22 NPISH financing schemes
11 HF23 Enterprise financing schemes
12 HF3 Household out-of-pocket payments
13 HF31 Out-of-pocket excluding cost-sharing
14 HF32 Cost-sharing with third-party payers
15 HF4 Rest of the world financing schemes (non-resident)
16 HF42 Voluntary schemes (non-resident)
17 HF0 Financing schemes unknown

Retrieving the data

The top level of financing will be used with the exception of financing from non-residents for this analysis. For this analysis, we will focus on aggregate health expenditure (HCTOT) and services provided by all providers (HPTOT). The code has been developed to cycle through a list of measures. In this case, we will use PARPIB, the GDP share, and PPPPER, expenditure per capita on a purchasing power parity basis.

#save base dataset as an excel file for documentation

Filters have to be set in the order of variables to reduce the data retrieval to manageable proportions. The retrieved dataset is saved as an XLSX file in the default directory to provide a snapshot of data. This is often a requirement when delivering research because data gets revised. The advantage of the XLSX data structure is it is now an open data standard. The openxlsx package creates the file without using any Microsoft or Java libraries. If one wanted only to postprocess the data in R, it would be better to save the data1 tibble using saveRDS, a native R command which saves in a compact binary form. The command readRDS could be used to read the file in,

Process and Plot the Data by Measure

The next set of code sets up filters to be used by dplyr to further process the retrieved data. A location list is prepared using standard 3 character country codes which are common across many international data providers. The selected locations are defined in the character vector target_locations at the top of the script.

#now we will parse by locations and the measure - share of gdp
location_titles<-filter(dstruct$LOCATION,id %in% target_locations) %>%
#define the measure to be used
# target_measure<-"PARPIB"
for(target_measure in measures_list){
      target_title<-filter(dstruct$MEASURE,id==target_measure) %>%
      select(label)%>% as.character()
      #define financing labels
      financing_labels<-filter(dstruct$HF,id %in% target_financing) %>%
      #organize data
      #the left_joins add labels for location and financing (label.x label.y)
      data2<-data1 %>%
      filter(LOCATION %in% target_locations,MEASURE==target_measure) %>%
      select(HF,LOCATION,obsValue,obsTime) %>%
      arrange(obsTime) %>%
      slice(n()) %>%ungroup()%>%
      left_join(location_table,by=c("LOCATION"="id")) %>%

The loops over the list of measures in this version of the code. The data2 table is created from the initial retrieval. Only the last observation (ragged) is retained by using the slice function. The name of the country and the financing title are joined with left_joins. At this juncture, the country name is not actually used.

Because there are varying terminal years, the year, in the variable obsTime, is appended to the value to label the datapoint. In the plot, the label.y variable, the label from the financing table, is used as a grouping table so that the plots are separated by financing. As much as possible, the plot uses title information obtained from the database.

      #start plot
      #text adjustment factor to move the point labels high enough
      geom_bar(stat="identity",fill="light green")+

The plot is saved as 4 facets (small sub plots) using facet_wrap function which references the lable.y financing title. A text adjustment factor is calculated to move the value and year labels high enough above the bars. A file name to save the plot is developed.

The example plots are shown below.

 Current expenditure on health all functions HCTOT PARPIB

Current expenditure on health all functions HCTOT PPPPER