This category of material includes notes introducing R and providing some vignettes on its use that may be of interest to economists. Additional material is available in PDF form in the downloads. These scripts are provided with no support or warranty or guarantee as to their general or specific usability.
Processing PUMF Files
This note provides a vignette showing how the Statistics Canada public use microdata file (PUMF) for the Survey of Financial Security 2016 can be converted into a usable format for processing in R to produce weighted cross tabulations and charts..
The key package used in this vignette is the MEMISC package developed by Martin Elff at the Zeppelin University in Friedrichshafen Germany where he teaches political science. His package provides an infrastructure for the management of survey data including value labels, definable missing values, recoding of variables, production of code books, and import of (subsets of) 'SPSS' and 'Stata' files is provided. Further, the package allows to produce tables and data frames of arbitrary descriptive statistics and (almost) publication-ready tables of regression model estimates, which can be exported to 'LaTeX' and HTML
Dr. Elff’s website contains a detailed discussion about how to use the package for analyzing microdata with an emphasis on categorical variables. Managing items in a survey data set with value labels and recoding them is one of its many functions. However, of particular importance to this vignette is the infrastructure that he supplies to process raw tax files and variable and value labels defined in the SPSS format.
In the example discussed below, we will utilize the SFS2016 released in the summer of 2018. A code listing and log are included in the accompanying PDF.
We will also use the ggplot2 package from the tidyverse system to produce plots based on weighted cross tabulations of the data. One of the characteristics of the Statistics Canada pumf files is that they are weighted surveys. Although not done in this example, descriptive and other analysis using weights can be done with many other packages including HMISC and descr.
Converting the Data
The first stage is always to process the flat text files provided by Statistics Canada in the PUMF distribution. There is always a data file and control files for several packages including SPSS. In this example, we will use the SPSS version of the control files.
In this vignette, the conversion process has been done in a separate script from the analysis function. In this initial processing script, the first stage is to load the packages and define the location of the text file.
#minimum version should be memisc_0.99.14.12
#define locations of input spss files
print(paste("Current directory is",getwd()))
The next stage is to create a definition of the dataset using the spss.fixed.file import procedure saved in the variable sfs2016_efam.
The next stage is to use the dataset definition, the importer object, to create a dataset in R’s memory, write a codebook listing the variables. Value labels were not supplied for all variables in the dataset but were supplied for the weight variable. Because of the presence of these value labels, MEMISC considers the weight variable (pweight) to be a nominal categorical variable. The measurement of the variable is changed to “ratio” so that MEMISC will consider it as a numerical variable. The modified memory object (sfs2016_efam.ds) is then saved in an R save file for easy reloading in other R scripts. Other data edits cold be made before this save as well. However, in this example, such edits will be made in the analytical part of the example.
#modify the type for pweight because it is initial set to be a nominal
# categorical variable because value labels are supplied
Doing a codebook analysis for the pweight variable now shows that it is regarded as being numerical.
sfs2016_efam.ds$pweight 'Survey weights - PUMF'
Storage mode: double
Missing values: 9999999.9996-9999999.9999
Values and labels N Percent
9999999.9996 M 'Valid skip' 0 0.0
9999999.9997 M 'Don't know' 0 0.0
9999999.9998 M 'Refusal' 0 0.0
9999999.9999 M 'Not stated' 0 0.0
(unlab.vld.) 12428 100.0 100.0
The results from this script are shown in the PDFs. Any problems are likely result of incorrectly formatted SPSS control cards. The data file as delivered was longer than the record definition. This required some modifications to MEMISC. The minimum version for this example to work is memisc_0.99.14.12.
Some Basic Analysis
In the example script, we will define a new age variable grouping the survey respondents by birth year and then analyze some aspects of their mortgages. The value labels for several of the variables used will have to be supplied because these were omitted in the Statistics Canada deliverables.
The first stage sets up the directory and loads the saved data set from the previous script.
This creates a second copy of the DS object for additional modifications.
#loading tidyverse with mask some similarly named functions in memisc
print(paste("Current directory is",getwd()))
It is this copy (sfs2016_2.ds) that will be modified and saved in this script. Value labels will be added for several variables because they were omitted from Statistics Canada’s package. The definition of the value labels is found in the PUMF documentation provided by Statistics Canada.
The first stage is to create new age variable grouping the respondents into Boomers (born up to and including 1965), Generation Xers (born between 1966 and 1979 and the Millennial group born after 1979. The cut command is used to break a birth year variable into groups (1-3) on the basis of the break years. The latter are treated as closed to the right by default. Appropriate labels are provided.
#now to look at age
#now for the classification
#right=TRUE is default for cut - means groups are closed on the right
The variable is defined as a factor or ordered categorical variable.
The next set of code adds the grouped_birth_year as the variable age_cohort to the data set and supplies a description. Value labels are added for the type of mortgage interest and the term of the mortgage because these were omitted from the supplied definitions by Statistics Canada.
#now define some sample variables
#now start some variable edits in the sfs2016_2.ds object
description(age_cohort)<-paste(description(pagemie),"By Birth Year")
description(own_mortgaged)<-"Own home but have mortgage"
#we have to supply labels for a number of variables because Statcan's label file was incomplete
#we could subtract 2016 to show the number of years
description(years_to_renewal)<-"Year of Renewal from 2016"
#save the calculations so we don't have to recreate the variables we modified
Now we can start the analysis. The basic approach in this simple example is to use the xtabs function to produce weighted cross tabulations. By default, not-available (NA) observations are not processed.
The first cross tabulation looks at tenure of dwelling.
#now do a weighted frequency counts
#convert to a data frame for ggplot
labs(title="Economic Family Housing Tenure - 2016",
fill="Tenure",y="Economic Families (000)",x="Birth Year Cohort",caption="Source: SFS2016, JCI")
The data are in economic family units so the resulting xtab object is sc
The appendices show the scripts and logs. Note that comments may wrap because of formatting.
A Tidyverse Example for NDM Vector Processing
This section examines a simple script to process a seven series downloaded from the NDM vector search facility. The script reads the data and analyzes the contents. The data are then reordered by time and converted from the tall narrow format of the data as retrieved to a wide csv with time in columns. Percentage changes are calculated for selected series. A plot of one series is done. Selected series are also annualized using averages. This script was developed to highlight features of the Tidyverse for processing data into a useful form. Normally, all analysis should be done in R rather than export to older technologies such as Excel. However, in this case, the script was developed to show how to port data to CSVs useable by Excel. The full script with comments is included as Appendix A and a log file as Appendix B in the accompanying pdf in the R notes downloads..
The initial set of code sets the working folder or directory and loads the required package libraries for this simple project. The backslashes are duplicated because the backslash is an “escape” character that signals special processing of the subsequent character.
The next set of code reads the data in using the readr routine read_csv.
The print statement will show that the column names in the input_data data frame include "Vector", "Frequency", "Description", "Reference Period", "Unit of Measure", "Value" and "Source". In the discussion below, each column is considered a variable in the data frame.
One of the key features in modern R is the use of the pipe (%>%). This “pipes” the results from one command to the next and facilitates the linkage of commands without unnecessary results.
The next commands create a series directory by selecting unique or distinct values for the Vector and Description variables. The results are then printed.
One challenge with data retrieved from NDM is that the sort may not be optimal. For time series, most users want high to low. The other challenge is that the representation of the time period is not normal in that a day is not including with the month. Excel automatically converts Month Year combinations to dates including the first day of the month. The next command mutates the input data set to include two new variables, obs_date and obs_year. The date variable is created by pasting the string “1 “ to each reference period entry and then converting to system standard dates using the day-month-year function (dmy) from the lubridate package.
It should be noted that variable columns in a data frame are referenced by using the data frame name concatenated with the $ sign to the variable name. Variable names with embedded blanks must be enclosed in quotes.
The next command creates a sorted data frame by vector and date, mutates it to include the variable obs_pch which is the percentage change and filters the data to include only data from 2015 forward.
The sorted_data data frame is still a tall data frame with one value observation per row. The next command set creates a value_spread data frame with the vector number, the description and a new variable Type indicating that the data are untransformed. The data are spread into a wide data set with one column for each month in obs_date.
Note that in R, the capitalization of all variables is critical. In other words, to be found, Description must be capitalized.
We want to add percentage change data to the analysis. However, the vectors chosen for this run include 3 vectors from the Bank of Canada’s price series which are already in change format. Therefore, they should be excluded from the calculation. This is done by excluding all rows with vector names in a list defined from the 3 Bank series.
boc_series<-c("V108785713", "V108785714", "V108785715")
filter(!(Vector %in% boc_series))%>%
The filter command uses the logical not (!) to exclude vectors in the list of bank series.
The final two commands concatenates the two frames by row using the rbind command and writes the resulting output data frame to a CSV file.
The first columns of the resulting data set are shown below.
Just to show that one can do some work in R, the next lines select and plot the labour force average hourly wage rate. The resulting chart is saved as PNG file. The chart title is extracted from the series directory created above.
The resulting graph is show below. It uses basic defaults.
The final task in this simple script is to create an annualized data set. This means to take the annual average of all series. However, the labour force wage rate is excluding because correct annualization requires weighting by monthly employment which is not available in the data set.
In this section of code, the key verb is the group_by command which sets the limits for which the mean or average calculation is applied (by year) and includes unique copies of the Vector and Description in the output data set before spreading it by year.
The resulting data set is shown below.
TIntroducing R to Economists
This note outlines some key documentation on the statistical programming language R to provide an introduction to new users, particularly economists. The note will also go through a specific example of using data retrieved as vectors from Statistics Canada’s New Dissemination Model (NDM). The focus of this note is on manipulating or wrangling the data rather than specific topics such as forecasting or regression.
R and How to Get it
R is an open source statistical programming language with open-source implementations available for Windows, Mac and Linux. No seat licenses are required but no warranty is provided either. Documentation is available online as are the executables. The base system is extendable by an extensive library of open-source packages which are also available on line.
The key starting point is the web site R-Project web site – https://www.r-project.org. The software is downloadable from the Comprehensive R Archive Network or CRAN for short. This library is mirrored on the computing facilities of many institutions around the world. The starting link is https://cran.r-project.org/mirrors.html but the best link is the cloud redirection system sponsored by RStudio at https://cloud.r-project.org which simply moves the request to an appropriate server.
There is also an implementation of R maintained by Microsoft at https://mran.microsoft.com which should be considered for large scale operations requiring integration into the Microsoft world.
Key General Documentation
The main documentation source is maintained by the core development team at https://cran.r-project.org/manuals.html but there are many other very strong web sites documenting specific features and packages. A detailed introduction to using R is available online and in PDF in the manual link shown above. The specific link is https://cran.r-project.org/doc/manuals/r-release/R-intro.html.
Many of the best sites are associated with books by the authors of the web sites. The list below includes some that I have found very useful.
- Quick – R – a site associated with the book “R in Action” by Robert Kabacoff. The book is strongly recommended but the web site is a great substitute. The link is https://www.statmethods.net/index.html and contains many useful examples relevant to our profession. It is an excellent starting point because it assumes no prior R knowledge.
- Cookbook for R – a site associated with the book “R Graphics Cookbook” by Winston Chang, one of the R studio team associated with the key graphics package ggplot2. The link is http://www.cookbook-r.com/ includes a basic tutorial as well.
- The book “Art of R Programming” by Norman Matloff is available from Amazon and other suppliers. It is older but still used by many. The Amazon link is https://www.amazon.ca/Art-Programming-Statistical-Software-Design/dp/1593273843 Some unlicensed copies can be found on the web.
- Hadley Wickam, one of the R gurus, is the force behind many key packages has produced a number of books.
- “R for Data Science” by Garrett Grolemund and Hadley Wickham is highly recommended for more advanced users interested in the effective workflow of data visualization and analysis projects. This is the text book for the tidyverse system mentioned below. The book is available and recommended but an open source web version is also available. All the authors ask is for donations to support an endangered parrot. The link is http://r4ds.had.co.nz/.
- “Advanced R” by Hadley Wickam covers more advanced elements of working with R. As well as a book, the material is available in a web site at http://adv-r.had.co.nz/.
- There are many books on specific technical topics incorporating the use of R. Forecasting is particularly important in our profession. Rob Hyndman and George Athanasopoulos have developed “Forecasting Principles and Practice” with a book available from Amazon and an online textbook available at https://otexts.org/fpp2/.
R incorporates many key packages of routines in the base installation. However, there are many packages specifically written for key tasks. Formal documentation, often in PDF form is available on CRAN. An excellent web site organizing the documentation is maintained at https://www.rdocumentation.org/.
This list, below, is only some basic suggestions.
Tidyverse is the acronym developed by Hadley Wickem to define a suite of integrated packages for data wrangling and visualization. Data wrangling is the term for getting your data clean and organized before you start your research. Tidyverse is documented at https://www.tidyverse.org/. It emphasizes the use of an augmented data structure referred to as a tibble which is a tall organization of data. This means attributes are consolidated in a few columns and values into only one. The key packages in the base Tidyverse are:
- dplyr - the main data manipulation package incorporating a consistent grammar for data manipulation. Key verbs include:
- mutate() adds new variables that are functions of existing variables
- select() picks variables based on their names.
- filter() picks cases based on their values.
- summarise() reduces multiple values down to a single summary.
- arrange() changes the ordering of the rows.
- ggplot2 – one of the go-to packages for graphics in R
- readr – a suite of routines for reading datasets into R – notably read_csv which are much more effective than the base routines.
- tidyr – a set of routines to manipulate tall datasets such as often come from NDM into wide ones as well as to decompose or recompose columns of data in a useful way.
- broom – a package with useful tibble functions – tidy particularly is good to move data freams to tibbles.
- lubridate – not one of the main packages but very useful for manipulating dates into a standard representation.
Data.table is an alternative to dplyr. It has some speed and syntactic advantages for large datasets. It is often used in financial applications. An excellent training course is available at www.datacamp.com. The R documentation link is https://www.rdocumentation.org/packages/data.table/versions/1.11.4/.
Ordered series, often referred to as time series, do not have to be regular in terms of a fixed number of periods per aggregate period. Examples are daily data with and without weekend observations. The XTS package associates an order vector like time with a set of series. It is quite comprehensive and highly recommended. It extends the base time series object in R (ts) and builds on the capabilities of the zoo time series package. The R documentation link is https://www.rdocumentation.org/packages/xts/versions/0.11-1.
This package includes the procedures discussed in the Hyndman book noted above. The focus is on univariate methods including Arima and various seasonal adjustment techniques. Because seasonality requires a time series with regular periodicity (i.e. monthly or quarterly), the ts time series object is emphasized. The R documentation link is https://www.rdocumentation.org/packages/forecast/versions/8.4.
Martin Elff, at the Zepplin University in Germany, maintains a suite of routines designed to easily manipulate microdata. It is an excellent substitute for the general functionality of STATA and similar packages. I have found it particularly useful with Statistics Canada’s PUMF datasets. Recent pumf releases are formatted sufficiently well that input into the MEMISC routines is relatively straight forward. Some have issues with the apostrophe character in the descriptions but that is easy to handle with modest edits of the input files. The package developed an extended data frame with the variables from the data file incorporating descriptors, code definitions and missing variable information as appropriate. A useful vignette is contained at this link https://cran.r-project.org/web/packages/memisc/vignettes/items.html.
One of the documentation options that has come to the fore in recent years are cheat sheets. These are one- or two-page PDFs summarizing the key syntax and features of a package. Useful ones include:
- Data Wrangling (aka dplyr and other tidyverse) - https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
- Data Visualization – ggplot2 - https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf
There are numerous variants of cheat sheets for various aspects of R work. A good search engine will help. The R Studio summary of R is provided as a cheat sheet relevant to the course on DataCamp. The link is https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf.
Tom Short published a reference card for the standard R language constructs that is still used by many including myself. The link is https://cran.r-project.org/doc/contrib/Short-refcard.pdf.
A search engine will provide many examples of training in R. Even YouTube contains useful videos. However, my strong recommendation is the commercial courses available from https://www.datacamp.com/ which includes a number of free courses such as an introduction to R. The service includes courses on time series analysis and many other data science topics. An excellent Android app (IOS is probably available) provides practice and a few training options for R and Tidyverse, Data Wrangling and Python.
One important feature of the Datacamp approach is that the actual lessons are done using an online version of R. This means that people can learn and practice on specific machines without installing the software. My personal experience is that this is very productive.
Many universities offer online courses for R. Harvard University has a version integrated tightly with Datacamp but with their own instructor. It is part of their data science program which includes a number of free online courses. The course is free but you pay for a certificate of completion. The link is https://www.edx.org/course/data-science-r-basics. This is a 4 week course requiring 2-4 hours per week but is self-paced.
The EDX site, used by Harvard, lists many courses from many universities and other organizations. For example, Harvard has a regression course that might be of interest to business economists. The link is https://www.edx.org/course/data-science-linear-regression. A more extensive course on regression analysis with R is offered by the Georgia Institute of Technology. The link is https://www.edx.org/course/statistical-modeling-and-regression-analysis.
One of the key requirements to develop your professional knowledge is to find help with specific tasks. Professionals often use a site called StackOverflow.com which includes an R section. Searching there for your problem may help.
The graphics features of R are extremely powerful. There are several packages including ggplot2. Ideas for useful graphs can be found at https://www.r-graph-gallery.com/ which includes not just pictures of the charts but also code. It is a little cumbersome to navigate but very useful. Here is the link to a heat map example https://www.r-graph-gallery.com/215-the-heatmap-function/
Books and associated sites such as those by Kabacoff and Chang are key sources of ideas. Chang’s book and web site have great discussions of graphs as well as introductory material for R.
Search engine use will identify other sites, but many are oriented to selling training programs.
R is a programming language. The user interface executes scripts from files or commands typed into the interface. Many people may just use it interactively at first. For many years, programmers just used text editors to develop the scripts. My personal preference for the usual ad-hoc projects, that I undertake, is to use a specific programmer’s editor known as Notepad++. It is available only for Windows. It offers the advantage of free open-source software. It incorporates a syntax processor for the functions in R and is very good at identifying unbalance parentheses and other issues. It is available at https://notepad-plus-plus.org/.
Many programmers prefer to use an integrated development environment (IDE). It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. There is an open-source version for producing open-source products and a commercial one (for creating restricted license applications) available at https://www.rstudio.com/products/RStudio/#Desktop. It is a powerful system but imposes a particular organization on your project in terms of file organization and workflow that may be too complex or restrictive for ad hoc requirements.
R and the Future of NDM
One of the key features of NDM is the evolving webservices API which facilitates direct programmable access to the database. Several packages are starting to make use of these interfaces.
Marco Lugo’s CANSIM2R provides a very useful interface to retrieving full tables programmatically. It’s default of renaming variables to generic names may cause some difficulties in project implementation. It is available in CRAN. The R documentation link is https://www.rdocumentation.org/packages/CANSIM2R/versions/1.14.1.
Mountain Math CANSIM
The Mountain Math team are developing a most promising CANSIM package that includes the ability to directly download individual series. It is not yet available on CRAN because it is still being improved. It is currently available on GITHUB (a development site) only. When the package is available, it should be the preferred route for most R users to access NDM series.
R and Reproducible Research
Throughout much of the R material, there is a significant emphasis on the concept of reproducible research. The concept is relatively simple. Research should be delivered in form that documents the data used and all steps in the analysis. Because R can be used to author many forms of documents including Word, PDF, TeX, PowerPoint, etc., it is ideally suited to the task. Rather than use multiple programs, researchers develop scripts which download the data, do the analysis and graphics and combine them into a distribution ready report. JCI uses these tools to produce its monthly Retail Fast Facts publication for a client. All stages including the download of the data, the analysis and production of the master Word document are done in integrated R scripts.
In short, R facilitates a capital-intensive approach to analysis which facilitates easy replication. The advent of direct access to the NDM introduces an issue. Data retrieved from Statistics Canada should always be saved with the analysis to preserve the vintage of the information.