This category of material includes notes introducing R and providing some vignettes on its use that may be of interest to economists.  Additional material is available in PDF form in the downloads.  These scripts are provided with no support or warranty or guarantee as to their general or specific usability.

Accessing the Development Version of Mountainmath’s Cansim package R

This note provides a vignette showing how to install the development version of the new CANSIM package for accessing NDM. This vignette uses R utilities to iknstal the GITHUB version of the CANSIM package. More information about the developers can be found on the Mountainmath website at https://www.mountainmath.ca I commend the blog posts to you for interesting information.

Packages USED

The key package installed in this vignette is the CANSIM package developed by Mountain Math. This is the key package for retrieving data using the web services released in the New Dissemination Model version of CANSIM. The data are retrieved and converted to time series stored as XTS objects. Key packages include

  • cansim – package developed by Mountain Math to access the NDM CANSIM
  • devtools – R utilities

Installing the Package

The devtools package is a necessity to easily access GITHUB development sites

devtools::install_github("mountainmath/cansim")
library(cansim)

The library statement will check that the cansim package is correctly installed.

Processing NDM Tables

This note provides an example of retrieving and processing a complete NDM table in R. The script and log are included in a corresponding PDF in the downloads.

Packages USED

The key package used in this vignette is the dplyr package. Key packages include:

  • dplyr – for processing the csv.
  • ggplot2 – part of tidyverse for plotting the results

Retrieving the DATA

The first step is to load the packages, define the NDM product ID, retrieve the required zip file and unzip it to access the csv file. The code checks for the existence of the csv and whether it should be refreshed.

#This vignette retrieves matrices from the LFS
setwd("D:\\OneDrive\\lfs_stuff")
project_folder<-getwd()
data_folder<-paste0(project_folder,"/")
library(lubridate)
library(tidyverse)
library(dplyr)
target_product<-"14-10-0017"
file_refresh<-TRUE
file_refresh<-FALSE
#first step is to test if the csv file is already available in unzipped in the working folder
target_csv<-paste0(gsub("-","",target_product),".csv")
#see if it exists
if((!file.exists(target_csv))|file_refresh){
      #create url
      url_file<-gsub(".csv","-eng.zip",target_csv)
      calc_url<-paste0(url_prefix,url_file)
      download_target<-paste0(project_folder,"/",url_file)
      download.file(calc_url,download_target,mode="wb",quiet=FALSE)
unzip(download_target,overwrite=TRUE,exdir=substr(data_folder,1,nchar(data_folder)-1))
      }

The next stage reads the CSV with the read_csv routine that is part of the tidyverse readr component. The dplyr package is used to process the file, selecting only the Canada geography and selected variables. The column names of the csv file are best reviewed by opening the csv one time with a package such as notepad++ or notepad. Many CSVs cannot be fully loaded into Excel and there is no need to use that package. In the code below, the mutate verb is used to create a new variable, Date, that is a standard R date. To do this, it is necessary to convert the REF_DATE string to a regular date by added a day (“01”) to the end of each value. Overly long variables and those containing blanks are renamed for convenience. Such variables must be enclosed in back ticks (`) in order to be recognized by dplyr. The initial read, the select, mutate and rename verbs are all connected using pipes (%>%).

#now we are going to great a base tibble with just the estimate for canada
#column names with spaces must be enclosed in back tick marks
#the date has to be modified to include a day to get the as.Date function to recognize it
#read_csv will also recognize a zip file and unzip
canada_lfs_estimate<-read_csv(file=target_csv,progress=FALSE) %>%filter(GEO=="Canada") %>%
select(REF_DATE,GEO,`Labour force characteristics`,Sex,`Age group`,VALUE) %>%
mutate(Date=as.Date(paste0(REF_DATE,"-01"))) %>%
rename(LFS_characteristics=`Labour force characteristics`,Age_group=`Age group`)

The next stage is to calculate annual data from the averages of the seasonally unadjusted data. The advantage of this is that the current year is calculated as an average to date. The data are grouped by the key variables required and by the Year variable which is created by a prior mutate verb. The year function from the lubridate package is required to get the year. An ungroup verb is used as the last stage of the piped set of commands to remove the grouping. Not doing that will confuse later dplyr processes which will then unnecessarily require the grouping variables.

#calculating annual averages this way gives us the year-to-date value for the current year
annual_data<-canada_lfs_estimate%>%
mutate(Year=year(Date))%>%
group_by(GEO,LFS_characteristics,Age_group,Sex,Year) %>%
summarize(Annual_average=mean(VALUE))%>%
ungroup()

The next set of code gets the unique age group strings and selects the lowest level ones. The appropriate selection must be determined by prior inspection. Extracting the strings from the data file avoids problems from future edits by Statistics Canada of its classifications.

#now get the age_groups
available_age_groups<-unique(annual_data$Age_group)
print(available_age_groups)
#by inspection, the zero-level groups in this table are 1,5,6,10,11,12,13,16,17,19,21
zero_indices<-c(1,5,6,10,11,12,13,16,17,19,21)
zero_level_age_groups<-available_age_groups[zero_indices]
available_genders<-unique(annual_data$Sex)
#This version loops through gender
numb_genders<-length(available_genders)

The next stage is to start a loop over genders of a process which selects three tibbles for source population, full- and part-time employment. The filter verb selects the data by age group, LFS characteristic and gender.

for (this_gender in 1:numb_genders){
      Gender<-available_genders[this_gender]
      source_population<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Population",Sex==Gender)
      full_time_employment<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Full-time employment",Sex==Gender)
      part_time_employment<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Part-time employment",Sex==Gender)

The next segment of code calculates the full- and part-time rates by using a left_join to link the population with the corresponding employment on the basis of year and age group.

#now calculate th full_time rate as part of a left join
      full_time_rate<-left_join(source_population,(select(full_time_employment,Year,Age_group,Annual_average)),
      by=c("Year","Age_group"),
      suffix=c(".pop",".fe")) %>%
      mutate(employment_rate=100*Annual_average.fe/Annual_average.pop)%>%
      select(Year,Age_group,employment_rate)
      #now part time
      part_time_rate<-left_join(source_population,(select(part_time_employment,Year,Age_group,Annual_average)),by=c("Year","Age_group"),
      suffix=c(".pop",".pe")) %>%
      mutate(employment_rate=100*Annual_average.pe/Annual_average.pop)%>%
      select(Year,Age_group,employment_rate)

The next set of code applies a negative sign to the full-time employment for the purposes of plotting. Essentially the plots below include two layers of bar charts with the full-time measure as negative to get the pyramid appearance. The rates, calculated above, have the same column names so that they can be row bound using the bind_rows command. A list of the tibbles is used to facilitate renaming them to something useful. The .id value of Type is used to hold the list element names to distinguish the data sets combined into the rate_set tibble.

The test years are defined in a vector in a filter command to select the data for plotting. Breaks for the value (y) axis are defined. However, since some are negative, labels have to be defined by taking the absolute values.

#for purposes of pyramid, full-time rate is set negative
      full_time_rate<-mutate(full_time_rate,employment_rate=-employment_rate)
      bind_list<-list(full_time_rate,part_time_rate)
      names(bind_list)<-c("Full time","Part time")
      rate_set<-bind_rows(bind_list,.id="Type") %>%
      mutate(Age_group=gsub(" years","",Age_group))
      #as a test plot select one year
      plot_data<-filter(rate_set,Year %in% c(2000,2007,2015,2018))
      break_list<-seq(-80,40,by=20)
      break_labels<-as.character(abs(break_list))

The final set of code defines the plot. The data are grouped by Year so that there is one plot of each selected year and the facet_wrap verb creates the required two column aggregate plot of the separate year plots. The coordinates are flipped so that the value (y) is at the bottom.

p1<-ggplot(plot_data,aes(x=Age_group,y=employment_rate,group=Year,fill=Type,
      label=abs(round(employment_rate,1))))+
      geom_bar(data=dplyr::filter(plot_data,Type=="Full time"),stat="identity")+
      geom_bar(data=dplyr::filter(plot_data,Type=="Part time"),stat="identity")+
      labs(title="LFS Employment Rates by Age",caption="Statistics Canada, JCI",
      subtitle=Gender,
      y="Percentage Share of Population",x="Age Group")+geom_text(size=3)+
      scale_y_continuous(breaks=break_list,labels=break_labels)+
      coord_flip()+
      facet_wrap(~Year,ncol=2)+theme(legend.position="bottom")
  ggsave(p1,file=paste0(Gender,"_employment_rate_pyramids_labels.png"))
}

The last curly bracket terminates the for loop. There will be three facet plots produced, one for each gender. The gender name is included as part of the output file name as well as the subtitle of the plot. The critical part is the data specification in each geom_bar layer. This separates the input dataset into the left and right halves required for the pyramid. The dplyr filter is used to define the data required. The breaks calculated above are included in the arguments to the scale_y_continuous function in the plot development. The legend for the employment types is set to be at the bottom rather than the default side position.

The plots are shown below.

Both sexes employment rate pyramids labels

Males employment rate pyramids labels

Females employment rate pyramids labels

Accessing the WORLD Bank Data

This note focusses on directly accessing indicators in the database of the World Bank. Appendices in the accompanying PDF contain the log and a log file for a sample run.

Packages USED

The key package used in this vignette is the wbstats package which accesses the World Bank RESTful API. Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • wbstats – the retrieval package to be used[1].

Detailed instructions for utilizing the wbstats package are available online.[2]

Retrieving the DATA

The process starts with loading the relevant packages. Unlike many other statistical databases, data in the World Bank (WB) system are organized by indicators by countries. The structure of the database is retained in a cache which can be processed. In this cache, data frames are available with the codes and documentation for countries, variables and other constructs. In this first code segment, the cache is retrieved, and the country list searched for country names to return country codes which are the most productive way to access that construct. The database requires some searching at the start of projects. For example, as well as an aggregate GINI index prepared by the World Bank, there are others submitted by various countries.

#this routine uses wbstats to access the worldbank database
setwd("D:\\OneDrive\\wb_examples")
library(tidyverse)
library(wbstats)
library(ggplot2)
#examine the cache
current_cache<-wbcache()
str(current_cache,max.level=1)
g20_countries<-c("Argentina", "Australia", "Brazil", "Canada", "China", "France", "Germany",
"India", "Indonesia", "Italy", "Japan", "Korea, Rep.", "Mexico", "Russian Federation",
"Saudi Arabia", "South Africa", "Turkey", "United Kingdom","United States","European Union")
#use str_detect to look at partial strings to get their spelling
filter(current_cache$countries,str_detect(country,"Korea"))
filter(current_cache$countries,country %in% g20_countries) %>%
select(iso2c, iso3c,country,regionID,region) ->g20_country_list
#save the iso codes for subsequent work
write.csv(g20_country_list,file="g20_country_list.csv")

In this example, we are searching for the 3-digit codes (iso3c) for the g20 countries. The initial text list tried did not find Russia or Korea. The str_detect function was used interactively in a filter statement to identify the Koreas as well as Russia. One example of this use is shown in the script. Then the g20_countries list was edited to match WB standards. The second filter statement returns the rows in the countries data frame matching the G20 list. These are saved as a CSV for future work.

The next code segment looks for the appropriate indicator code, the indicatorID, to retrieve the data. Of course, if the indicatorID and iso3c list have been prepared externally by using the web site or in a previous run, the wbstats process can start directly with the wb function to retrieve the data.

#now look for GINI by experiment, this returns 1 row
gini_indicator<-filter(current_cache$indicators,str_detect(indicator,"GINI"))
print(data.frame(gini_indicator))
gini_data<-wb(country=g20_country_list$iso3c,indicator=gini_indicator$indicatorID)

There are many variables in the gini_data frame including date, country codes other information. In the next code segment, the g20_country_list is simplified to a smaller list with only the code and the region for later merging with the plot data. The gini_data is then grouped by country, sorted by date and reduced to the last observation for each country. The data cannot be selected by date because of the ragged edge as will be seen below. The dplyr slice verb is used to select the last observation (the nth one) for each country. This works because of the grouping and sorting,

#the next statement groups by country, sorts by year, and takes the last observation in each group
# and creates a country_year variable comprised of the country and last year
# and join to country tibble to get region
code_region<-select(g20_country_list,iso3c,region)
group_by(gini_data,country)%>%arrange(date)%>%slice(n()) %>%
mutate(country_year=paste0(iso3c,"-",date))%>%
left_join(code_region,by=c("iso3c"))->gini_data_last
write.csv(gini_data_last,file="g20_wb_gini_index.csv")

A mutate command is used to join the 3-character country code and the date of the last observation for labelling the x axis in the plot. The mutate command adds this column to the gini_data_last tibble. A left_join is used to match the code_region tibble to the retrieved data to add the region as a column.

The next code segment produces a bar chart with the country_year variable as the x axis labels plotted at a 45 degree angle. The bars are filled with different colours based on region and the scale legend is appropriately labelled.

#start the plot
gini_plot1<-ggplot(gini_data_last,aes(x=country_year,y=value,fill=region))+
geom_bar(stat="identity")+
labs(title=gini_indicator$indicator,subtitle="G20 Countries",y="GINI Value",
caption="Source: World Bank",x="Country by Indicator Date")+
scale_fill_discrete(name = "Region")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave(gini_plot1,file="world_bank_gini_g20.png")

The resulting plot is shown below.

world bank gini g20

 

]t is often useful to produce a tabular presentation of data as graphic. This facilitates inclusion in MSWord documents or in web sites. The next coding segment uses gridExtra to produce a plot Grob which looks like a table.[3]

#now start a table
library(gridExtra)
select(gini_data_last,iso3c,region,country,date,value) %>%
arrange(iso3c)->gini_table_data
colnames(gini_table_data)<-c("Code","Region","Country","Date","Gini")
# grid table is documented at
table_plot<-tableGrob(gini_table_data)
ggsave(table_plot,file="world_bank_gini_table_chart.png")

The resulting table plot is shown in a graphic below.

world bank gini table chart

It is often useful to combine a plot and the table in one graphic for the web or a report. The next code segment uses the functions of marrangeGrob to combine a plot and the table graphic in one graphic. This process relies on the fact the tableGrob and ggplot2 are built on the Grid library and return graphic objects (grobs) as results which can be merged.

plot_list<-mget(c("gini_plot1","table_plot"))
two_plots<-marrangeGrob(plot_list,ncol=2,nrow=1,top=NULL)
two_plot_name<-"gini_chart_table.png"
ggsave(file=two_plot_name,width=15,height=8,two_plots)

The resulting graphic is shown below.gini chart table

This combination approach is often useful when displaying information on the web.

 


[1] https://www.rdocumentation.org/packages/wbstats/versions/0.2

[2] https://cran.r-project.org/web/packages/wbstats/vignettes/Using_the_wbstats_package.html

[3] https://cran.r-project.org/web/packages/gridExtra/vignettes/tableGrob.html

Accessing the WORLD Bank Data

This note focusses on directly accessing indicators in the database of the World Bank. Specially we are going to show a series of time series plots of Gini coefficients for several countries. This vignette follows from the earlier one on Gini coefficients. Appendices in the corresponding contain the log and a log file for a sample run.

Packages USED

The key package used in this vignette is the wbstats package which accesses the World Bank RESTful API. Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • wbstats – the retrieval package to be used[1].

Detailed instructions for utilizing the wbstats package are available online.[2]

Retrieving the DATA

The process starts with loading the relevant packages. Unlike many other statistical databases, data in the World Bank (WB) system are organized by indicators by countries. The structure of the database is shown in the earlier GINI vignette for the G20. Data can be retrieved by combinations of country and indicator. Ways to search for indicators and country codes are indicated in the G20 vignette. In this vignette, we will use a predefined csv file with the country codes that was saved in the previous vignette. The csv file has been modified by hand to include a flag for countries which are members of the G7.

 

In the initial set of code, we select the g7 countries and then go directly to the database to retrieve the data. The read_csv routine is used to import the CSV because it returns a tibble which is the tidyverse extension of a regular data frame.

#g7 facets
setwd("D:\\OneDrive\\wb_examples")
library(tidyverse)
library(wbstats)
#read in a table of country codes and regions with a g7 flag
g20_country_list<-read_csv("g20_g7_country_list.csv")
print(as.data.frame(g20_country_list))
g7_list<-filter(g20_country_list,G7==1)
print(g7_list)
gini_indicator<-"SI.POV.GINI"
g7_data<-wb(country=g7_list$iso3c,indicator=gini_indicator) %>%
mutate(Date=as.Date(paste0(date,"/1/1")))
head(g7_data)

The gini_indicator variable uses the indicatorID found in the previous vignette. The mutate function is used to transform the simple integer year date variable to a normal R date-class variable. All dates have days associated with them. For this purpose, it is easiest to use the first day of the year. Transforming the date vector to a true date-type vector is required so that date constructs can be used in the plotting software to position data relative to the X axis. The data retrieved is a “tall” tibble with one observation per row. This will be seen in the log print included in this PDF.

In this vignette, two charts are going to be produced. The base chart is standard multi-line plot. This is created by grouping the data by country. Japan is excluded from the initial dataset by filtering for all countries not equal to “Japan” because it was found to have only one observation. The line type is also varied by country as is the colour of the line. The title of the chart is obtained from the indicator title which is retrieved with every observation for the data.

chart_title<-g7_data$indicator[1]
#date is a vector of numbers therefore discrete scale should be used
# scale_x_discrete(breaks=seq(min(g7_data$date),max(g7_data$date),by=2))+
#time series plot -only one observation for Japan
#drop japan
g7_data_ex_japan<-filter(g7_data,country!="Japan")
plot1<-ggplot(g7_data_ex_japan,aes(x=Date,y=value,group=country,colour=country))+
geom_line(aes(linetype=country),size=1)+
labs(title=chart_title,y="GINI Value",x=NULL,
caption="Source:World Bank, JCI")+
scale_x_date(date_breaks="3 years",date_labels="%Y")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot2<-plot1+facet_wrap(~country,ncol=2)+theme(legend.position="none")
ggsave(plot1,file="g7_ex_japan.png")
ggsave(plot2,file="g7_ex_japan_facets.png")

The comments referring to the discrete scale show how a scale would be defined if the original integer variable date had been used in the chart. It is only possible to use the date scale with a true date variable. The date scale is formatted just to be years with the date_labels attribute in the scale_x_date function. The first base plot is saved as plot1 and is shown below.

g7 ex japan

The line chart is somewhat hard to follow because many of the lines intersect. With the colours and varying line types, it is usable but can be a little difficult to see the country trends. Therefore, we are going to transform plot1 into plot2 with the facet_wrap function which breaks the plot into a separate plot for each country arranged in two columns. The resulting plot is shown below.

g7 ex japan facets

The advantage of the facet form is that each facet has the same scales and size. This facilitates visual comparison while also simplifying the perspective. The legend is supressed because it is not required in the facet format.

 


[1] https://www.rdocumentation.org/packages/wbstats/versions/0.2

[2] https://cran.r-project.org/web/packages/wbstats/vignettes/Using_the_wbstats_package.html

Processing Two Time Series

This note provides an example of retrieving two time series from the FRED database maintained at the St. Louis Fed. This is a database of over 500,000 time series from multiple sources including US government and international sources. There are several R packages which can be used to access FRED. The package alfred is simplest for retrieving a series without documentation because it uses an internal application API key. However, it cannot retrieve the associated series information. In this example, we will use fredr which includes access to the basic series documentation. It requires that the user open a one-time account to obtain an API key.  The actual script and the log file are in a corresponding PDF in the downloads section of the Jacobson Consulting web site.

Packages USED

The key package used in this vignette is the fredr package which accesses the FRED api. Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • fredr – the retrieval package to be used.

Retrieving the DATA

The first step in the analysis is to go the FRED site at https://fred.stlouisfed.org to open and account and obtain an API key. This is done only once and can be used for multiple R scripts. In this case, the key is saved in a CSV file in the folder being used for this example. It must be retrieved from this CSV file for each run to initialize the calls to the API.

# this routine uses fredr to access the basic documentation

setwd("D:\\OneDrive\\fred_test")
library(tidyverse)
library(fredr)
library(ggplot2)
#set up the key
user_api_key<-read.csv("fred_api_key.csv",stringsAsFactors=FALSE)
fredr_set_key(user_api_key$fredapi)

The next stage is to define the names for the series to be retrieved. The fredr package has search options which would facilitate the lookup of series ids. For simplicity, this was done in this example using the St. Louis web site. The next set of code retrieves the information about the series first. Each call to the routine fredr_series returns a tibble for one series. These are merged using bind_rows to get one tibble which is printed and written out as a csv for documentation purposes. The routine fredr_series_observations returns the actual data as a tibble.

#define the series
us_ratio<-"DEBTTLUSA188A"
cdn_ratio<-"DEBTTLCAA188A"
#first, we retrieve the information about the two series for documentation purposes
us_debt_ratio_info<-fredr_series(us_ratio)
cdn_debt_ratio_info<-fredr_series(cdn_ratio)
series_info<-bind_rows(us_debt_ratio_info,cdn_debt_ratio_info)
print(as.data.frame(series_info))
write.csv(series_info,file="debt_series_info.csv")
#now get the data
us_debt_ratio<-fredr_series_observations(us_ratio)
cdn_debt_ratio<-fredr_series_observations(cdn_ratio)

The next stage is to bind the two tibbles into one for the purpose of plotting. The dplyr filter command is used to exclude earlier observations. Then ggplot from the ggplot2 library is used to prepare the chart.

#combine the two series into one tibble
plot_data<-bind_rows(us_debt_ratio,cdn_debt_ratio) %>%
filter(date >="2000-01-01")
#now start chart
debt_plot<-ggplot(plot_data, aes(x=date,y=value,group=series_id,colour=series_id))+
geom_line(size=1)+
theme(legend.position="bottom",legend.title=element_blank())+
scale_colour_hue(labels=c("Canada","United States"))+
labs(title=paste0(cdn_debt_ratio_info$title,"\n",us_debt_ratio_info$title),
y="Share of GDP",x=NULL,caption=paste("FRED update:",cdn_ratio,"/",
substr(cdn_debt_ratio_info$last_updated,1,10),"--",us_ratio,"/",
substr(us_debt_ratio_info$last_updated,1,10)))
ggsave(debt_plot,file="CAN_US_Debt_plot.png")

It should be noted that the variables named in the ggplot aes are specific to the fredr package. The titles retrieved in the info variables from FRED are used to prepare the chart title. A newline is generated in the title by including the string “\n”. The date of last update from the series information is used as part of the caption. The latter also includes the actual FRED mnemonics for documentation purposes.

The resulting chart is shown below.

CAN US Debt plot