This category of material includes notes introducing R and providing some vignettes on its use that may be of interest to economists.  Additional material is available in PDF form in the downloads.  These scripts are provided with no support or warranty or guarantee as to their general or specific usability.

Accessing the WORLD Bank Data

This note focusses on directly accessing indicators in the database of the World Bank. Specially we are going to show a series of time series plots of Gini coefficients for several countries. This vignette follows from the earlier one on Gini coefficients. Appendices in the corresponding contain the log and a log file for a sample run.

Packages USED

The key package used in this vignette is the wbstats package which accesses the World Bank RESTful API. Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • wbstats – the retrieval package to be used[1].

Detailed instructions for utilizing the wbstats package are available online.[2]

Retrieving the DATA

The process starts with loading the relevant packages. Unlike many other statistical databases, data in the World Bank (WB) system are organized by indicators by countries. The structure of the database is shown in the earlier GINI vignette for the G20. Data can be retrieved by combinations of country and indicator. Ways to search for indicators and country codes are indicated in the G20 vignette. In this vignette, we will use a predefined csv file with the country codes that was saved in the previous vignette. The csv file has been modified by hand to include a flag for countries which are members of the G7.


In the initial set of code, we select the g7 countries and then go directly to the database to retrieve the data. The read_csv routine is used to import the CSV because it returns a tibble which is the tidyverse extension of a regular data frame.

#g7 facets
#read in a table of country codes and regions with a g7 flag
g7_data<-wb(country=g7_list$iso3c,indicator=gini_indicator) %>%

The gini_indicator variable uses the indicatorID found in the previous vignette. The mutate function is used to transform the simple integer year date variable to a normal R date-class variable. All dates have days associated with them. For this purpose, it is easiest to use the first day of the year. Transforming the date vector to a true date-type vector is required so that date constructs can be used in the plotting software to position data relative to the X axis. The data retrieved is a “tall” tibble with one observation per row. This will be seen in the log print included in this PDF.

In this vignette, two charts are going to be produced. The base chart is standard multi-line plot. This is created by grouping the data by country. Japan is excluded from the initial dataset by filtering for all countries not equal to “Japan” because it was found to have only one observation. The line type is also varied by country as is the colour of the line. The title of the chart is obtained from the indicator title which is retrieved with every observation for the data.

#date is a vector of numbers therefore discrete scale should be used
# scale_x_discrete(breaks=seq(min(g7_data$date),max(g7_data$date),by=2))+
#time series plot -only one observation for Japan
#drop japan
labs(title=chart_title,y="GINI Value",x=NULL,
caption="Source:World Bank, JCI")+
scale_x_date(date_breaks="3 years",date_labels="%Y")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The comments referring to the discrete scale show how a scale would be defined if the original integer variable date had been used in the chart. It is only possible to use the date scale with a true date variable. The date scale is formatted just to be years with the date_labels attribute in the scale_x_date function. The first base plot is saved as plot1 and is shown below.

g7 ex japan

The line chart is somewhat hard to follow because many of the lines intersect. With the colours and varying line types, it is usable but can be a little difficult to see the country trends. Therefore, we are going to transform plot1 into plot2 with the facet_wrap function which breaks the plot into a separate plot for each country arranged in two columns. The resulting plot is shown below.

g7 ex japan facets

The advantage of the facet form is that each facet has the same scales and size. This facilitates visual comparison while also simplifying the perspective. The legend is supressed because it is not required in the facet format.




Accessing the WORLD Bank Data

This note focusses on directly accessing indicators in the database of the World Bank. Appendices in the accompanying PDF contain the log and a log file for a sample run.

Packages USED

The key package used in this vignette is the wbstats package which accesses the World Bank RESTful API. Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • wbstats – the retrieval package to be used[1].

Detailed instructions for utilizing the wbstats package are available online.[2]

Retrieving the DATA

The process starts with loading the relevant packages. Unlike many other statistical databases, data in the World Bank (WB) system are organized by indicators by countries. The structure of the database is retained in a cache which can be processed. In this cache, data frames are available with the codes and documentation for countries, variables and other constructs. In this first code segment, the cache is retrieved, and the country list searched for country names to return country codes which are the most productive way to access that construct. The database requires some searching at the start of projects. For example, as well as an aggregate GINI index prepared by the World Bank, there are others submitted by various countries.

#this routine uses wbstats to access the worldbank database
#examine the cache
g20_countries<-c("Argentina", "Australia", "Brazil", "Canada", "China", "France", "Germany",
"India", "Indonesia", "Italy", "Japan", "Korea, Rep.", "Mexico", "Russian Federation",
"Saudi Arabia", "South Africa", "Turkey", "United Kingdom","United States","European Union")
#use str_detect to look at partial strings to get their spelling
filter(current_cache$countries,country %in% g20_countries) %>%
select(iso2c, iso3c,country,regionID,region) ->g20_country_list
#save the iso codes for subsequent work

In this example, we are searching for the 3-digit codes (iso3c) for the g20 countries. The initial text list tried did not find Russia or Korea. The str_detect function was used interactively in a filter statement to identify the Koreas as well as Russia. One example of this use is shown in the script. Then the g20_countries list was edited to match WB standards. The second filter statement returns the rows in the countries data frame matching the G20 list. These are saved as a CSV for future work.

The next code segment looks for the appropriate indicator code, the indicatorID, to retrieve the data. Of course, if the indicatorID and iso3c list have been prepared externally by using the web site or in a previous run, the wbstats process can start directly with the wb function to retrieve the data.

#now look for GINI by experiment, this returns 1 row

There are many variables in the gini_data frame including date, country codes other information. In the next code segment, the g20_country_list is simplified to a smaller list with only the code and the region for later merging with the plot data. The gini_data is then grouped by country, sorted by date and reduced to the last observation for each country. The data cannot be selected by date because of the ragged edge as will be seen below. The dplyr slice verb is used to select the last observation (the nth one) for each country. This works because of the grouping and sorting,

#the next statement groups by country, sorts by year, and takes the last observation in each group
# and creates a country_year variable comprised of the country and last year
# and join to country tibble to get region
group_by(gini_data,country)%>%arrange(date)%>%slice(n()) %>%

A mutate command is used to join the 3-character country code and the date of the last observation for labelling the x axis in the plot. The mutate command adds this column to the gini_data_last tibble. A left_join is used to match the code_region tibble to the retrieved data to add the region as a column.

The next code segment produces a bar chart with the country_year variable as the x axis labels plotted at a 45 degree angle. The bars are filled with different colours based on region and the scale legend is appropriately labelled.

#start the plot
labs(title=gini_indicator$indicator,subtitle="G20 Countries",y="GINI Value",
caption="Source: World Bank",x="Country by Indicator Date")+
scale_fill_discrete(name = "Region")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The resulting plot is shown below.

world bank gini g20


]t is often useful to produce a tabular presentation of data as graphic. This facilitates inclusion in MSWord documents or in web sites. The next coding segment uses gridExtra to produce a plot Grob which looks like a table.[3]

#now start a table
select(gini_data_last,iso3c,region,country,date,value) %>%
# grid table is documented at

The resulting table plot is shown in a graphic below.

world bank gini table chart

It is often useful to combine a plot and the table in one graphic for the web or a report. The next code segment uses the functions of marrangeGrob to combine a plot and the table graphic in one graphic. This process relies on the fact the tableGrob and ggplot2 are built on the Grid library and return graphic objects (grobs) as results which can be merged.


The resulting graphic is shown below.gini chart table

This combination approach is often useful when displaying information on the web.





Processing PUMF Files

This note provides a vignette showing how the Statistics Canada public use microdata file (PUMF) for the Survey of Financial Security 2016 can be converted into a usable format for processing in R to produce weighted cross tabulations and charts..

Packages USED

The key package used in this vignette is the MEMISC package developed by Martin Elff at the Zeppelin University in Friedrichshafen Germany where he teaches political science. His package provides an infrastructure for the management of survey data including value labels, definable missing values, recoding of variables, production of code books, and import of (subsets of) 'SPSS' and 'Stata' files is provided. Further, the package allows to produce tables and data frames of arbitrary descriptive statistics and (almost) publication-ready tables of regression model estimates, which can be exported to 'LaTeX' and HTML[1]

Dr. Elff’s website contains a detailed discussion about how to use the package for analyzing microdata with an emphasis on categorical variables.[2] Managing items in a survey data set with value labels and recoding them is one of its many functions. However, of particular importance to this vignette is the infrastructure that he supplies to process raw tax files and variable and value labels defined in the SPSS format.[3]

In the example discussed below, we will utilize the SFS2016 released in the summer of 2018.   A code listing and log are included in the accompanying PDF.

We will also use the ggplot2 package from the tidyverse system to produce plots based on weighted cross tabulations of the data.[4] One of the characteristics of the Statistics Canada pumf files is that they are weighted surveys. Although not done in this example, descriptive and other analysis using weights can be done with many other packages including HMISC[5] and descr[6].

Converting the Data

The first stage is always to process the flat text files provided by Statistics Canada in the PUMF distribution. There is always a data file and control files for several packages including SPSS. In this example, we will use the SPSS version of the control files.

In this vignette, the conversion process has been done in a separate script from the analysis function. In this initial processing script, the first stage is to load the packages and define the location of the text file.




#minimum version should be memisc_0.99.14.12

#define locations of input spss files









print(paste("Current directory is",getwd()))


The next stage is to create a definition of the dataset using the spss.fixed.file import procedure saved in the variable sfs2016_efam.










The next stage is to use the dataset definition, the importer object, to create a dataset in R’s memory, write a codebook listing the variables. Value labels were not supplied for all variables in the dataset but were supplied for the weight variable. Because of the presence of these value labels, MEMISC considers the weight variable (pweight) to be a nominal categorical variable. The measurement of the variable is changed to “ratio” so that MEMISC will consider it as a numerical variable. The modified memory object (sfs2016_efam.ds) is then saved in an R save file for easy reloading in other R scripts. Other data edits cold be made before this save as well. However, in this example, such edits will be made in the analytical part of the example.



#modify the type for pweight because it is initial set to be a nominal

# categorical variable because value labels are supplied




Doing a codebook analysis for the pweight variable now shows that it is regarded as being numerical.

sfs2016_efam.ds$pweight 'Survey weights - PUMF'


   Storage mode: double

   Measurement: ratio

   Missing values: 9999999.9996-9999999.9999

                 Values and labels     N     Percent


       9999999.9996 M 'Valid skip'     0           0.0

       9999999.9997 M 'Don't know'     0           0.0

       9999999.9998 M 'Refusal'       0           0.0

       9999999.9999 M 'Not stated'     0           0.0

                     (unlab.vld.) 12428   100.0 100.0


           Min:     10.000                          

           Max: 12254.083                          

           Mean:   1235.023                          

       Std.Dev.:   1055.169                          

       Skewness:     3.711                          

       Kurtosis:    25.079                          

The results from this script are shown in the PDFs. Any problems are likely result of incorrectly formatted SPSS control cards. The data file as delivered was longer than the record definition. This required some modifications to MEMISC. The minimum version for this example to work is memisc_0.99.14.12.

Some Basic Analysis

In the example script, we will define a new age variable grouping the survey respondents by birth year and then analyze some aspects of their mortgages. The value labels for several of the variables used will have to be supplied because these were omitted in the Statistics Canada deliverables.

The first stage sets up the directory and loads the saved data set from the previous script.

This creates a second copy of the DS object for additional modifications.


#loading tidyverse with mask some similarly named functions in memisc





print(paste("Current directory is",getwd()))



It is this copy (sfs2016_2.ds) that will be modified and saved in this script. Value labels will be added for several variables because they were omitted from Statistics Canada’s package. The definition of the value labels is found in the PUMF documentation provided by Statistics Canada.

The first stage is to create new age variable grouping the respondents into Boomers (born up to and including 1965), Generation Xers (born between 1966 and 1979 and the Millennial group born after 1979. The cut command is used to break a birth year variable into groups (1-3) on the basis of the break years. The latter are treated as closed to the right by default. Appropriate labels are provided.

#now to look at age


#now for the classification




#right=TRUE is default for cut - means groups are closed on the right





The variable is defined as a factor or ordered categorical variable.

The next set of code adds the grouped_birth_year as the variable age_cohort to the data set and supplies a description. Value labels are added for the type of mortgage interest and the term of the mortgage because these were omitted from the supplied definitions by Statistics Canada.

#now define some sample variables


#now start some variable edits in the sfs2016_2.ds object



description(age_cohort)<-paste(description(pagemie),"By Birth Year")


description(own_mortgaged)<-"Own home but have mortgage"

#we have to supply labels for a number of variables because Statcan's label file was incomplete



"Combination (hybrid)"=3,

"Valid Skip"=6)


labels(pasrcurg)<-c("3-6 months"=1,

"1 year"=2,

"2 years"=3,

"3 years"=4,

"4 years"=5,

"5 years"=6,

"7-10 years"=7,


"Valid Skip"=96)


#we could subtract 2016 to show the number of years


description(years_to_renewal)<-"Year of Renewal from 2016"


#save the calculations so we don't have to recreate the variables we modified


Now we can start the analysis. The basic approach in this simple example is to use the xtabs function to produce weighted cross tabulations. By default, not-available (NA) observations are not processed.

The first cross tabulation looks at tenure of dwelling.

#now do a weighted frequency counts



tab_tenur<- xtabs(pweight~age_cohort+pftenur,data=sfs2016_2.ds)


#convert to a data frame for ggplot




labs(title="Economic Family Housing Tenure - 2016",

fill="Tenure",y="Economic Families (000)",x="Birth Year Cohort",caption="Source: SFS2016, JCI")


The data are in economic family units so the resulting xtab object is sc

sfs2016 plots

The appendices show the scripts and logs. Note that comments may wrap because of formatting.








Processing Two Time Series

This note provides an example of retrieving two time series from the FRED database maintained at the St. Louis Fed. This is a database of over 500,000 time series from multiple sources including US government and international sources. There are several R packages which can be used to access FRED. The package alfred is simplest for retrieving a series without documentation because it uses an internal application API key. However, it cannot retrieve the associated series information. In this example, we will use fredr which includes access to the basic series documentation. It requires that the user open a one-time account to obtain an API key.  The actual script and the log file are in a corresponding PDF in the downloads section of the Jacobson Consulting web site.

Packages USED

The key package used in this vignette is the fredr package which accesses the FRED api. Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • fredr – the retrieval package to be used.

Retrieving the DATA

The first step in the analysis is to go the FRED site at to open and account and obtain an API key. This is done only once and can be used for multiple R scripts. In this case, the key is saved in a CSV file in the folder being used for this example. It must be retrieved from this CSV file for each run to initialize the calls to the API.

# this routine uses fredr to access the basic documentation

#set up the key

The next stage is to define the names for the series to be retrieved. The fredr package has search options which would facilitate the lookup of series ids. For simplicity, this was done in this example using the St. Louis web site. The next set of code retrieves the information about the series first. Each call to the routine fredr_series returns a tibble for one series. These are merged using bind_rows to get one tibble which is printed and written out as a csv for documentation purposes. The routine fredr_series_observations returns the actual data as a tibble.

#define the series
#first, we retrieve the information about the two series for documentation purposes
#now get the data

The next stage is to bind the two tibbles into one for the purpose of plotting. The dplyr filter command is used to exclude earlier observations. Then ggplot from the ggplot2 library is used to prepare the chart.

#combine the two series into one tibble
plot_data<-bind_rows(us_debt_ratio,cdn_debt_ratio) %>%
filter(date >="2000-01-01")
#now start chart
debt_plot<-ggplot(plot_data, aes(x=date,y=value,group=series_id,colour=series_id))+
scale_colour_hue(labels=c("Canada","United States"))+
y="Share of GDP",x=NULL,caption=paste("FRED update:",cdn_ratio,"/",

It should be noted that the variables named in the ggplot aes are specific to the fredr package. The titles retrieved in the info variables from FRED are used to prepare the chart title. A newline is generated in the title by including the string “\n”. The date of last update from the series information is used as part of the caption. The latter also includes the actual FRED mnemonics for documentation purposes.

The resulting chart is shown below.

CAN US Debt plot

A Tidyverse Example for NDM Vector Processing

This section examines a simple script to process a seven series downloaded from the NDM vector search facility. The script reads the data and analyzes the contents. The data are then reordered by time and converted from the tall narrow format of the data as retrieved to a wide csv with time in columns. Percentage changes are calculated for selected series. A plot of one series is done. Selected series are also annualized using averages. This script was developed to highlight features of the Tidyverse for processing data into a useful form. Normally, all analysis should be done in R rather than export to older technologies such as Excel. However, in this case, the script was developed to show how to port data to CSVs useable by Excel. The full script with comments is included as Appendix A and a log file as Appendix B in the accompanying pdf in the R notes downloads..

The initial set of code sets the working folder or directory and loads the required package libraries for this simple project. The backslashes are duplicated because the backslash is an “escape” character that signals special processing of the subsequent character.

setwd( "D:\\OneDrive\\ndm_r_runs")
library(tidyverse) library(lubridate)

The next set of code reads the data in using the readr routine read_csv.

input_data<-read_csv(input_file) print(colnames(input_data))

The print statement will show that the column names in the input_data data frame include "Vector", "Frequency", "Description", "Reference Period", "Unit of Measure", "Value" and "Source". In the discussion below, each column is considered a variable in the data frame.

One of the key features in modern R is the use of the pipe (%>%). This “pipes” the results from one command to the next and facilitates the linkage of commands without unnecessary results.

The next commands create a series directory by selecting unique or distinct values for the Vector and Description variables. The results are then printed.



One challenge with data retrieved from NDM is that the sort may not be optimal. For time series, most users want high to low. The other challenge is that the representation of the time period is not normal in that a day is not including with the month. Excel automatically converts Month Year combinations to dates including the first day of the month. The next command mutates the input data set to include two new variables, obs_date and obs_year. The date variable is created by pasting the string “1 “ to each reference period entry and then converting to system standard dates using the day-month-year function (dmy) from the lubridate package.



input_data$"Reference Period",sep="")),


It should be noted that variable columns in a data frame are referenced by using the data frame name concatenated with the $ sign to the variable name. Variable names with embedded blanks must be enclosed in quotes.

The next command creates a sorted data frame by vector and date, mutates it to include the variable obs_pch which is the percentage change and filters the data to include only data from 2015 forward.



The sorted_data data frame is still a tall data frame with one value observation per row. The next command set creates a value_spread data frame with the vector number, the description and a new variable Type indicating that the data are untransformed. The data are spread into a wide data set with one column for each month in obs_date.

value_spread<-select(sorted_data,Vector, Description,



Note that in R, the capitalization of all variables is critical. In other words, to be found, Description must be capitalized.

We want to add percentage change data to the analysis. However, the vectors chosen for this run include 3 vectors from the Bank of Canada’s price series which are already in change format. Therefore, they should be excluded from the calculation. This is done by excluding all rows with vector names in a list defined from the 3 Bank series.

boc_series<-c("V108785713", "V108785714", "V108785715")



filter(!(Vector %in% boc_series))%>%


The filter command uses the logical not (!) to exclude vectors in the list of bank series.

The final two commands concatenates the two frames by row using the rbind command and writes the resulting output data frame to a CSV file.



The first columns of the resulting data set are shown below.


Just to show that one can do some work in R, the next lines select and plot the labour force average hourly wage rate. The resulting chart is saved as PNG file. The chart title is extracted from the series directory created above.





labs(title=chart_title$Description,x="Date",y="Percentage Change")


The resulting graph is show below. It uses basic defaults.

 V2132579 percentage change

The final task in this simple script is to create an annualized data set. This means to take the annual average of all series. However, the labour force wage rate is excluding because correct annualization requires weighting by monthly employment which is not available in the data set.






mutate(Type="Annual Average")%>%



In this section of code, the key verb is the group_by command which sets the limits for which the mean or average calculation is applied (by year) and includes unique copies of the Vector and Description in the output data set before spreading it by year.

The resulting data set is shown below.

annual table