This category of material includes notes introducing R and providing some vignettes on its use that may be of interest to economists.  Additional material is available in PDF form in the downloads.  These scripts are provided with no support or warranty or guarantee as to their general or specific usability.

Plotting LFS Standard Errors

Introduction

This script shows how to retrieve several LFS series using the Mountainmath CANSIM retrieval package and to plot the monthly total employment change with the associated standard errors. This plot is coupled with a plot of the underlying trend cycle estimates of the levels of total employment. The latter is probably the most appropriate estimate to use for planning purposes.

In this example the key packages are:

  • dplyr – to process the retrieved data – part of tidyverse
  • ggplot2 – to do the plots – part of tidyverse
  • cansim – to retrieve the series from NDM
  • xts – to manage time series
  • tbl2xts – to convert the retrieved tibble to an xts data frame
  • lubridate – to manipulate date – part of tidyverse.

Data Retrieval

The first stage is to set a working directory, load the initial required packages, define the list of Labour Force Survey (LFS) vectors required and retrieve and print the meta data.

#This vignette retrieves vectors from the LFS
setwd("D:\\OneDrive\\lfs_stuff")
library(dplyr)
library(cansim)
library(tbl2xts)
library(xts)
library(lubridate)
#some routines make use of a cache which can be usefully set in options
options(cache_path="d:\\mm_data_cache")
#the series are defined, the date is adjusted to suit the conventions of the xts time series package
canada_lfs_vbls<-c("v2062811","v100304302","v101884904", "v101884905",
"v101884906"
)
#define the plot window in terms of periodicity units
plot_window<-"24 months"
#analyze meta data
meta1<-get_cansim_vector_info(canada_lfs_vbls)
cube_meta<-get_cansim_cube_metadata(unique(meta1$table))
common_meta<-left_join(meta1,cube_meta,by=c("table"="productId")) %>%distinct()%>%
select(VECTOR,table,title_en,cubeTitleEn,cubeEndDate)
print(as.data.frame(common_meta))

The get_cansim_vector_info function retrieves the title information for each vector and the table number. The get_cansim_cube_metadata function retrieves the metadata for the table(s). In this case, there is only one table. The dplyr left_join function is used to link the series and table metadata. The select operator is used to simplify the resulting tibble to just the vector, table number (NDM productID), as well as the series and table titles. The common_meta tibble is coerced to a data frame for printing purposes so that all fields are shown in the log file. This meta data provides the documentation for the run.

The next code retrieves the data series using get_cansim_vector which returns a tibble with all the series. This tibble is a tall data structure with one row per data point. The NDM REF_DATE field is returned as a character date by the get_cansim_vector routine but is converted to an R date variable for use in time-related calculations.

canada_lfs_xts <- get_cansim_vector(canada_lfs_vbls,start_time=as.Date("1995-01-01")) %>%
mutate(Date=as.Date(REF_DATE)) %>%
tbl_xts(spread_by="VECTOR",cols_to_xts="VALUE")
tail(canada_lfs_xts,20)
#use index function or dates to select rows in xts data frame.
print(colnames(canada_lfs_xts))
print(data.frame(select(meta1,VECTOR,title_en)))

The dplyr mutate function is used to create the vector of R dates required by the tbl_xts function from the tbl2xts package. The latter function creates a data frame with columns for each unique vector and data derived from the VALUE field in the initial tibble. The tail function is used to show the last 20 observations of the resulting data frame and the column names and meta data are printed again.

The xts package was loaded after the dplyr package so that the xts version of merge is used in the next set of code which calculates the level month to month difference. When using diff and lag functions in R, check the documentation for the package being used because different conventions may be used on the sign of the lag. A xts data structure is created using the xts version of merge to maintain the xts attributes. The column names must be reset to be useful for the plot.

#the first vector is lfs employment 15+
#the fourth vector is the standard error of the month to month change
lfs_change<-diff(canada_lfs_xts[,1],k=1,lag=1)
#the data have to be merged together to maintain the xts attributes
data1<-merge(canada_lfs_xts[,1],lfs_change,canada_lfs_xts[,4])
colnames(data1)<-c("LFS15+","DeltaLFS15+","SEDeltalfs15+")
tail(data1,12)
plot_data<-merge(data1[,2],data1[,2]+.5*data1[,3],data1[,2]-.5*data1[,3])
colnames(plot_data)<-c("DeltaLFS15","Upper","Lower")

The data1 xts data frame contains three series, the level of employment, the monthly change and the standard error. For the purposes of the initial plot, a plot data structure is created, with the change series and the change series plus half the standard error added to provide the upper limit. The lower limit is also added to the plot data xts data frame. Appropriate column names are defined to be useful in the plot.

The next piece of code converts the xts data frame back to a tibble after using the last function, from the xts package, to select the last portion of the data. The plot window is defined above in terms of the periodicity of the data, in this case, months.

#making it into a wide tibble makes it easier to manage in ggplot because the date is accessible
#last 2 years - use units of series such as months or days as in example below
plot_data2<-xts_tbl(last(plot_data,plot_window))
#now calculate a date string for the caption
date_string<-paste0(plot_data2$date[1],"/",tail(plot_data2$date,1))
head(plot_data2)

A date string is calculated from the first and last values of the date in the plot_data2 frame. The first rows of the plot data tibble are printed for documentation purposes.

The next set of code plots the lfs change variable with the standard-error augmented differences as the upper and lower bounds. The basic level change is marked with a point. The point range geom plots the augmented differences as a line between upper and lower. The geom_line portion of the plot draws a line between the change points.

#load the plot library
library(ggplot2)
plot1<-ggplot(plot_data2,aes(x=date,y=DeltaLFS15))+geom_point()+geom_pointrange(aes(ymin=Lower,ymax=Upper))+
geom_line()+
labs(title="Monthly Change in LFS 15+ Employment",y="Monthly Change (000)",
x=NULL,
subtitle="Seasonally Adjusted with Standard Error Bars",
caption=paste(date_string," ","Statcan:",paste(canada_lfs_vbls[c(1,4)],collapse=", "),"JCI"))
ggsave(plot1,file="LFS15_monthly_change.png")

The caption string includes the date string calculated above, and the two vector numbers which as merged as a string using a comma. Then the plot is saved to a png file. The example is shown below.

LFS15 monthly change 

The next set of code plots the level of the employment series along with the trend-cycle component. The latter series provides a better base of analysis than the somewhat more volatile timeseries of LFS point estimates. We want to plot the last 6 months of the trend cycle with a different line type so one variant is created with the last 6 months set to NA and the other is just the last 6 months. Note the combination of three series with xts merge.

#now start a plot of the trend cycle and the total employment
# review this link about trend cycle https://www.statcan.gc.ca/eng/dai/btd/tce-faq
#we want to drop the last 12 observations from the trend cycle and add a third series which is just the tail of the trend cycle.
dotted_period<-6
data_chart2<-merge(canada_lfs_xts[,1],head(canada_lfs_xts[,2],length(canada_lfs_xts[,2])-dotted_period),
tail(canada_lfs_xts[,2],dotted_period),fill=NA)
tail(data_chart2,24)
colnames(data_chart2)<-c("LFS15plus","Trend-Cycle","Trend Recent")
plot_data_trend<-last(data_chart2,plot_window)

Again, the last plot_window observations are selected for the plot.

The next set of code calculates the date string for the caption using the index variable for the xts plot_data_trend frame. The index variable is essentially the time vector for the frame.

library(broom)
date_string_trend<-paste0(as.Date(index(plot_data_trend))[1],"/",as.Date(tail(index(plot_data_trend),1)))
trend_tibble<-tidy(plot_data_trend)
#linetype and colur are managed in the alphabetic order of the series group in tibble.
trend_plot<-ggplot(trend_tibble,aes(x=index,y=value,group=series,
colour=series))+geom_line(aes(linetype=series))+
scale_linetype_manual(values=c("solid","solid","dashed"))+
theme(legend.position="bottom",legend.title=element_blank())+
labs(title="Monthly LFS 15+ Employment",x=NULL,y="Monthly Level (000)",
subtitle="Seasonally Adjusted",
caption=paste(date_string_trend," ","Statcan:",paste(canada_lfs_vbls[c(1,2)],collapse=", "),"JCI"))
ggsave(trend_plot,file="LFS15_monthly_trend.png")
library(gridExtra)
plot_list<-mget(c("plot1","trend_plot"))
two_plots<-marrangeGrob(plot_list,nrow=1,ncol=2,top=NULL)
two_plot_name<-"lfs_se_trend.png"
ggsave(file=two_plot_name,width=15,height=8,two_plots)

The tidy function from the broom package (part of tidyverse) is used to create the required tibble.

The series are processed in alphabetical order so the line types should be appropriate specified. The line type is varied by the series variable which is created by the tidy function. This variable contains the names of the three series that we are plotting.

The gridExtra package is used to merge the two plots into one larger graphic for easier inclusion in a web site or word document with both plots side by side. The dimensions of the ggsave are defined in inches.

The merged plots are shown below.

lfs se trend

 

GDP Growth Decomposition

In this example, we will use the get_cansim function from Mountain Math to retrieve the monthly GDP $K data.  The Laspeyres time series (fix weighted) will be used to analyze the relative role that major aggregates played in the monthly results for total gdp growth.  The meta data now available in the NDM will be used to flexibly select the aggregates to be analyzed.

Packages USED

The key package used in this vignette is the cansim package which access the Statcan NDM api.  Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • cansim – package to retrieve metadata and series from Statistics Canada’s NDM
  • grid, gridExtra, gtable – for developing a table of the growth rates to be plotted.

Retrieving the DATA

The first major step is to use the get_cansim function to retrieve the complete table for analysis.  This table is saved using the saveRDS function in R to the local project folder.  This facilitates documentation as well as easy processing.  A flag is used to determine whether to retrieve a fresh copy of the table for processing.

library(cansim)
library(tidyverse)
library(lubridate)
#get gdp table and save it in the working directory
gdp_table_id<-"36-10-0434"
file_refresh<-TRUE
file_refresh<-FALSE
save_file<-paste0(gdp_table_id,".rds")
if(!file.exists(save_file)|file_refresh){
      gdp_table<-get_cansim(gdp_table_id)
      #now save it in the working directory
      saveRDS(gdp_table,file=save_file)
} else {
      gdp_table<-readRDS(save_file)
}
print(colnames(gdp_table))
#The current column names are
# 1 REF_DATE
# 2 GEO
# 3 DGUID
# 4 Seasonal adjustment
# 5 Prices
# 6 North American Industry Classification System (NAICS)
# 7 UOM
# 8 UOM_ID
# 9 SCALAR_FACTOR
# 10 SCALAR_ID
# 11 VECTOR
# 12 COORDINATE
# 13 VALUE
# 14 STATUS
# 15 SYMBOL
# 16 TERMINATED
# 17 DECIMALS
# 18 GeoUID
# 19 Classification Code for Seasonal adjustment
# 20 Hierarchy for Seasonal adjustment
# 21 Classification Code for Prices
# 22 Hierarchy for Prices
# 23 Classification Code for North American Industry Classification System (NAICS)
# 24 Hierarchy for North American Industry Classification System (NAICS)
#now we will rename columns as needed.  We can use numbers for the original columns

The interactive analysis of the columns of the retrieved table should always be an initial step in the process of script development. The column numbers are shown in the R comments.  Many are long enough to be unwieldly when trying to write easy to manage code.  As noted before, column names with blanks or special characters should be enclosed with back ticks (`) when using dplyr and other tidyverse packages.  The file_refresh flag should be set to TRUE to force re-retrieval of the table.

In the next stage the table will be simplified, columns to be used will be renamed as required.  The column number is used to simplify the renaming process.  An r-standard date will also be created. Columns are also selected using their numbers rather than names for simplicity.

#the working matrix with be set to reduced columns
#an R standard date is also created
gdp_monthly<-gdp_table %>%
rename(`sadj_type`=4,
`NAICS`=6,
`price_type`=21,
`naics_code`=23
) %>%
select(c(1,2,4:6,12,13,23)) %>%
mutate(data_date=as.Date(paste0(REF_DATE,"-1")))
#gdp table could be removed at this point

The next stage is to start the processing to develop the data for the initial and subsequent charts.  We will use selected special aggregates as well as a few key service series.  These series will be defined using the NAICS code in the meta data provided by the get_cansim retrieval. This variable, defined as naics_code in the rename above is easier to work with than the NAICS text strings for selecting series.  In the code below, the text strings are used to select the seasonal adjustment type required and the price measure.  For this analysis, we will be using the 2007 fixed weighted Laspeyres measures.  The latter are referred to as “2007 constant”. The select verb in the code above uses column numbers rather than variable names for coding convenience.

#chart1 - composition of growth special aggregates
target_aggregates<-c("[T001]",
"[T002]",
"[T003]",
#"[T004]",
"[T005]",
"[T006]",
"[T007]",
"[T011]",
"[T012]",
"[T016]",
"[T018]",
"[41]",
"[44-45]",
"[52]","[53]"
)
#build tibble of distinct naics_codes
naics_table<-select(gdp_monthly,NAICS,naics_code)%>%distinct()
#calculate last date
last_date<-max(gdp_monthly$data_date)
date_vector_1<-c(last_date-months(4),last_date-months(2),last_date-months(1),last_date)
#get unique_prices and the string for constant dollars
price_measures<-unique(gdp_monthly$Prices)
laspeyres_flag<-price_measures[grep("constant",price_measures)]
#get_flag for seasonally adjusted at annual rates
seasonal_measures<-unique(gdp_monthly$sadj_type)
seasonal_flag<-seasonal_measures[grep("annual",seasonal_measures)]
#get the data for the growth table with the lag required for percentage change

In the initial analysis, the naics table calculated here was examined for the target aggregate codes defined in the character array in the code above.  The next set of code develops the main table for processing.  Data are selected only for the required months, in this case, the last 4 months are used since we are going to report 3 months in one of the tables.

The data must be grouped by the naics_code and the COORDINATE then sorted (“arranged”) by date so that lags, percentage changes and contributions can be calculated.  The COORDINATE variable preserves ordering.

growth_data<-filter(gdp_monthly,data_date %in% date_vector_1,
Prices==laspeyres_flag,sadj_type==seasonal_flag) %>%
group_by(COORDINATE,naics_code) %>%arrange(data_date)%>%
mutate(monthly_change=100*((VALUE/lag(VALUE,1))-1),
delta_change=VALUE-lag(VALUE,1))%>%ungroup()
head(as.data.frame(growth_data))

The head function uses a data.frame to force the display of the initial rows of all columns for the tibble. The monthly percentage change as well as the delta change (level difference) are calculated.  The latter is required for the composition of growth calculations below.

The next set of code defines a tibble for the total all-industries sector.  The ratio of the delta change in the target sectors to that of the total sector defines the composition of change.  The latter calculation is part of the calculation of the full_table which involves a left_join of the growth_data table to the total table.

#now select the total only
total_only<-filter(growth_data,naics_code=="[T001]")%>%select(data_date,monthly_change,delta_change)%>%
rename(total_change=delta_change,total_pch=monthly_change)
#now the base table and total to calculate the composition of change
full_table<-left_join(growth_data,total_only,by="data_date")%>%
filter(!is.na(delta_change)) %>%
mutate(point_contribution=((delta_change/total_change)*total_pch))

The point contribution is defined as the change share of the percentage change for the total series.

The next code selects the data for just the target aggregates and for the key variables required in the subsequent charts.

#select the chart data for the last months
chart_data_set<-filter(full_table,naics_code %in% target_aggregates) %>%
select(data_date,naics_code,NAICS,point_contribution,monthly_change)
#now plot the last monthly_change
#get_industries
# naics_text<-as.data.frame(filter(naics_table,naics_code %in% target_aggregates)%>%
# select(NAICS))
reverse_index<-seq(length(target_aggregates),1,by=-1)
chart_last_month<-filter(chart_data_set,data_date==last_date)%>%
mutate(naics_text=paste(naics_code,NAICS))

The next code starts the initial chart which is only for the last date.

plot1<-ggplot((chart_last_month),aes(y=point_contribution,x=naics_code,group=data_date,
label=round(point_contribution,2)))+
labs(title="Percentage Point Contribution to Total GDP Change",
caption=paste("NDM:",gdp_table_id,"JCI"),
x=NULL,y="Percentage point contribution",
subtitle=paste("Laspeyres data for selected aggregates for",substr(last_date,1,7)))+
geom_bar(stat="identity",fill="light green")+
geom_text(size=3)+scale_x_discrete(limits=chart_last_month$naics_code[reverse_index],
labels=(chart_last_month$NAICS[reverse_index]))+coord_flip()
ggsave(plot1,file="gdp_laspeyres_contribution.png")

The coordinates of the chart are flipped so that the x axis is plotted on the vertical portion of the chart.  Data points are labeled with the rounded value of the point contribution. The X axis is labelled with the text even though the plot is done with the naics_code variable.  The labelling is reversed so that the all industries is at the top of the chart.  The date is turned into a text string for the purposes of the subtitle.  An example of this initial chart is shown below.

 

gdp laspeyres contribution

 

The next stage creates an alternative chart to show the data for two months.  The data must be grouped to separate it for preparing the bars.  The default bar chart is a stacked bar so the “dodge” capability is used to place the bars side by side.

#now lets do it for 2 months
chart_last_two<-filter(chart_data_set,data_date %in% c(last_date,last_date-months(1)))%>%
mutate(naics_text=paste(naics_code,NAICS))
#calculate an adjustment factor to move the text outside the bar
#R considers a date object as a potentially contiguous variabble
#therefore, we convert it to a discrete text string for the group and fill values
text_adjust_factor<-.1*max(abs(chart_last_two$point_contribution))
plot2<-ggplot((chart_last_two),aes(y=point_contribution,x=naics_code,group=substr(data_date,1,7),
fill=substr(data_date,1,7),label=round(point_contribution,2)))+
labs(title="Percentage Point Contribution to Total GDP Change",
caption=paste("NDM:",gdp_table_id,"JCI"),
x=NULL,y="Percentage point contribution",
subtitle="Laspeyres data for selected aggregates")+
geom_bar(stat="identity",position="dodge")+
scale_fill_discrete(name="Month")+
geom_text(size=3,position=position_dodge(0.9),aes(y=(sign(point_contribution)*
(text_adjust_factor+abs(point_contribution)))))+
theme(legend.title=element_blank(),legend.position="bottom")+
scale_x_discrete(limits=chart_last_two$naics_code[reverse_index],
labels=(chart_last_two$NAICS[reverse_index]))+coord_flip()
ggsave(plot2,file="gdp_laspeyres_contribution_2mon.png")

The dodge position is specified in the geom_bar definition.  A date construct in R is considered to be a continuous variable because it is represented as number.  Therefore, group and fill variable definitions convert the date to text string which is, by definition, discrete.  The group defines the separation of the bars.  The fill defines the colour choices from the default set for the bars.  A text adjustment factor is calculated relative to the maximum of the values plotted to facilitate the positioning of the value labels above the end of the bars.  This is used in the geo_text definition which supplies a text layer on top of the bar layer from geom_bar. The theme function is used to move the legend from the default side position to the bottom and remove the superfluous title. An example of the chart is inserted below.

gdp laspeyres contribution 2mon

The final chart example uses data for the last 3 months and is defined similarly.

#last three
chart_last_three<-filter(chart_data_set,data_date %in% c(last_date,last_date-months(1),last_date-months(2)))%>%
mutate(naics_text=paste(naics_code,NAICS))
#calculate an adjustment factor to move the text outside the bar
#R considers a date object as a potentially contiguous variabble
#therefore, we convert it to a discrete text string for the group and fill values
text_adjust_factor<-.1*max(abs(chart_last_three$point_contribution))
plot3<-ggplot((chart_last_three),aes(y=point_contribution,x=naics_code,group=substr(data_date,1,7),
fill=substr(data_date,1,7),label=round(point_contribution,2)))+
labs(title="Percentage Point Contribution to Total GDP Change",
caption=paste("NDM:",gdp_table_id,"JCI"),
x=NULL,y="Percentage point contribution",
subtitle="Laspeyres data for selected aggregates")+
geom_bar(stat="identity",position=position_dodge(0.9),width=.8)+
scale_fill_discrete(name="Month")+
geom_text(size=2.5,position=position_dodge(0.9),aes(y=(sign(point_contribution)*
(text_adjust_factor+abs(point_contribution)))))+
theme(legend.title=element_blank(),legend.position="bottom")+
scale_x_discrete(limits=chart_last_two$naics_code[reverse_index],
labels=(chart_last_two$NAICS[reverse_index]))+coord_flip()
ggsave(plot3,file="gdp_laspeyres_contribution_3month.png")

In the two multi-date charts, the geom_text function utilizes an aes (“aesthetics”) specification to supply a y-axis (value) position which is calculated by adding the text adjustment to the absolute value of the point being plotted and then applying the original sign.  In this three-date version, the width option in geom_bar is used to slightly narrow the width to provide a bit of space between the bars.  A sample chart is shown below.

gdp laspeyres contribution 3month

The final stage in this script is to develop a table that can be plotted with the plotting software.  GGPLOT2 is based on the grid package.  Several extensions to the grid package, gridExtra and gtable are used to provide a title to the standard table grob (graphic object).  The initial step is to develop a tibble which includes a modified character version of the date, converts the naics_text into a factor to preserve order in the table and reduces the dataset to only the columns required, the sector text, the date and the contribution value.  The tidyverse spread function is used to create a data frame with columns for each date.

#now build up a table for a chart table
# the date is shortened and the text is converted to factor to preserve order
table_data<-mutate(chart_last_three,date_text=substr(data_date,1,7),
row_text=factor(naics_text,levels=unique(naics_text)),contribution=round(point_contribution,2))%>%
select(row_text,date_text,contribution)%>%
rename(Sectors=row_text)
#now use spread to create a data frame with dates as columns
table_frame<-spread(table_data,key=date_text,value=contribution)

The final segment creates the actual table graphic and finally combines it with the third version of the charts above.  The default tableGrob creates a nice table but with centered values.  For this purpose, we want to right justify everything. This is done with a modification to the default theme.  Then gtable commands are used to put a title at the top of the table and a footnote at the bottom.         A link is supplied in the comments to a stack overflow discussion of modifying a table grob in this way.  The comments explain what is going on.

library(gridExtra)
t1<-ttheme_default()
#refer to the vignette documentation
tt2 <- ttheme_default(core=list(fg_params=list(hjust=1, x=0.9)),
                      rowhead=list(fg_params=list(hjust=1, x=0.95)))
grob1<-tableGrob(table_frame,rows=NULL,theme=tt2)
#now add a title based on a stack overflow discussion
library(grid)
library(gtable)
grob_title<-textGrob("Percentage Contributions to Laspeyres Growth Rate",
gp=gpar(fontsize=15))
padding<-unit(.5,"line")
#create a new grob with room for the title at top (pos=0)
full_grob<-gtable_add_rows(grob1,heights=grobHeight(grob_title)+padding,pos=0)
full_grob<-gtable_add_grob(full_grob,list(grob_title),t=1,l=1,r=ncol(full_grob))
grob_footnote<-textGrob(paste("STATCAN NDM: ",gdp_table_id,"JCI"),
gp=gpar(fontsize=8))
full_grob<-gtable_add_rows(full_grob,heights=grobHeight(grob_footnote)+padding,
pos=nrow(full_grob))
full_grob<-gtable_add_grob(full_grob,list(grob_footnote),t=nrow(full_grob),l=2,r=ncol(full_grob))
plot(full_grob)
ggsave(full_grob,file="gdp_contributions_table_3months.png")
plot_list<-mget(c("plot3","full_grob"))
two_plots<-marrangeGrob(plot_list,ncol=2,nrow=1,top=NULL)
two_plot_name<-"gdp_contributions_analysis.png"
ggsave(file=two_plot_name,width=15,height=8,two_plots)

A grob is just a special data frame with instructions for plotting.  There is one row for each row of the table and the code simply adds a row at the top and later at the bottom.  A graphic version of the title and footnote are inserted in the appropriate position.

The table example is shown below.

gdp contributions table 3months

An example of the merged graphic with chart 3 and the table is shown below.

gdp contributions analysis

 

The important point in the analysis is to show if there is significant variability in the sources of growth for aggregate GDP in recent months.

Accessing the Development Version of Mountainmath’s Cansim package R

This note provides a vignette showing how to install the development version of the new CANSIM package for accessing NDM. This vignette uses R utilities to iknstal the GITHUB version of the CANSIM package. More information about the developers can be found on the Mountainmath website at https://www.mountainmath.ca I commend the blog posts to you for interesting information.

Packages USED

The key package installed in this vignette is the CANSIM package developed by Mountain Math. This is the key package for retrieving data using the web services released in the New Dissemination Model version of CANSIM. The data are retrieved and converted to time series stored as XTS objects. Key packages include

  • cansim – package developed by Mountain Math to access the NDM CANSIM
  • devtools – R utilities

Installing the Package

The devtools package is a necessity to easily access GITHUB development sites

devtools::install_github("mountainmath/cansim")
library(cansim)

The library statement will check that the cansim package is correctly installed.

Term Structure Analysis

In this example, we will use the fredr package to retrieve one of the most popular series from the US Federal Reserve database (FRED). This is the spread between the returns on 10 year and 2 year constant maturity bonds.[1] In this example, we will use fredr which includes access to the basic series documentation. It requires that the user open a one-time account to obtain an API key. The new CANSIM package from Mountain Math will be used to retrieve the 10-year and 2-year benchmark bonds for Canada.

Packages USED

The key package used in this vignette is the fredr package which accesses the FRED API. Key packages include

  • dplyr – for processing the tibble prior to conversion to xts
  • ggplot2 – part of tidyverse for plotting the results
  • cansim – package to retrieve metadata and series from Statistics Canada’s NDM
  • broom, tbl2xts – utility packages for working with tibbles
  • fredr – the retrieval package to be used.[2]

Retrieving the DATA

The first step in the analysis is to go the FRED site at https://fred.stlouisfed.org to open and account and obtain an API key. This is done only once and can be used for multiple R scripts. In this case, the key is saved in a CSV file in the folder being used for this example. It must be retrieved from this CSV file for each run to initialize the calls to the API.

#Retrieving from FRED and CANSIM
library(tidyverse)
library(fredr)
library(ggplot2)
library(xts)
library(tbl2xts)
setwd("D:\\OneDrive\\fred_test")
#set up the API key
user_api_key<-read.csv("fred_api_key.csv",stringsAsFactors=FALSE)
fredr_set_key(user_api_key$fredapi)

The next stage is to define the names for the series to be retrieved. The fredr package has search options which would facilitate the lookup of series ids. For simplicity, this was done in this example using the St. Louis web site. The next set of code retrieves the information about the series first. Each call to the routine fredr_series returns a tibble for one series. The routine fredr_series_observations returns the actual data as a tibble. The function tbl_xts from the package tbl2xts is used to translate the tibble to the xts time series construct.

us_series<-"t10y2y"
us_series_info<-fredr_series(us_series)
print(as.data.frame(us_series_info))
starting_point<-"2004-01-01"
us_data1<-fredr_series_observations(us_series,observation_start=as.Date(starting_point))
#start building up an xts data frame
us_xts1<-tbl_xts(us_data1)
colnames(us_xts1)<-c("us10y2y")

The next stage is to use the CANSIM procedure to retrieve two interest rates series and calculate the difference. The meta data for both series are retrieved to document the series in the log file. A Date column is constructed using a mutate command because the date, as retrieved from NDM, is not recognized as a regular date by R.

#now get Canadian data-we start in xts to get dates straight
cdn_irate_vbls<-c("v39055","v39051")
library(cansim)
meta1<-get_cansim_vector_info(cdn_irate_vbls)
print(data.frame(meta1))
cdn_irate_xts <- get_cansim_vector(cdn_irate_vbls,start_time=as.Date(starting_point)) %>%
mutate(Date=as.Date(REF_DATE)) %>%
tbl_xts(spread_by="VECTOR",cols_to_xts="VALUE")
tail(cdn_irate_xts)
#need to clean out the zero observations
cdn_irate_xts[rowSums(cdn_irate_xts)==0,]<-NA
#otherwise, we get an area plot
#calculate differences
cdn_yield<-cdn_irate_xts$v39055 - cdn_irate_xts$v39051
colnames(cdn_yield)<-c("cdn10y2y")

In the tbl_xts invocation, the spread_by is used to identify the variable which contains the series names. There are a number of data edits required. Any zeros should be filled with NA to avoid problems in the plot because zero is considered a legitimate data point. The resulting series is given the label cdn10y2y to match the rename of the US series above.

The next stage removes the NAs, particularly the trailing ones, so that a legitimate last date can be calculated to use in graph labelling. The US and Canadian series are merged using the tidy procedure from the broom package.

#calculate last date
#squeeze out nas
cdn_spread<-na.omit(cdn_yield$cdn10y2y)
last_date<-as.Date(time(tail(cdn_spread,1)))
print(last_date)
us_cdn_xts<-merge(us_xts1,cdn_yield)
#use broom and tidy to make a plot tibble
library(broom)
plot_data<-tidy(us_cdn_xts)
print(plot_data)

The final stage is to prepare the plot using the tibble as input. The economist_white theme is used from ggthemes as a convenience. The latter package has other interesting themes but there are numerous other examples available around the internet.

library(ggthemes)
term_plot<-ggplot(plot_data,aes(x=index,y=value,colour=series))+
theme_economist_white()+
theme(legend.position="bottom",legend.title=element_blank())+
labs(y="Difference in Percent",x=NULL,
title="Difference between 10-Year and 2-Year Bonds",
subtitle="Canada and US",
caption=paste("FRED: t10y2y","CANSIM:",paste(cdn_irate_vbls,collapse=" - "),"End:",last_date))+
scale_x_date(date_breaks="2 years",date_labels="%Y")+
geom_line()
ggsave(term_plot,file="term_plot.png")

The caption is noteworthy because the collapse option in paste is used to connect the two Canadian series using “ – “. The last date is also dynamically inserted. The scale_x_date is used because the X data are defined in R date format. If the date_labels is not used, the days will be shown because dates in R always include a day.

The resulting chart is shown below.

term plot

 

 


[1] https://fredblog.stlouisfed.org/2018/08/whats-up-or-down-with-the-yield-curve/

[2] https://www.rdocumentation.org/packages/fredr/versions/1.0.0

Processing NDM Tables

This note provides an example of retrieving and processing a complete NDM table in R. The script and log are included in a corresponding PDF in the downloads.

Packages USED

The key package used in this vignette is the dplyr package. Key packages include:

  • dplyr – for processing the csv.
  • ggplot2 – part of tidyverse for plotting the results

Retrieving the DATA

The first step is to load the packages, define the NDM product ID, retrieve the required zip file and unzip it to access the csv file. The code checks for the existence of the csv and whether it should be refreshed.

#This vignette retrieves matrices from the LFS
setwd("D:\\OneDrive\\lfs_stuff")
project_folder<-getwd()
data_folder<-paste0(project_folder,"/")
library(lubridate)
library(tidyverse)
library(dplyr)
target_product<-"14-10-0017"
file_refresh<-TRUE
file_refresh<-FALSE
#first step is to test if the csv file is already available in unzipped in the working folder
target_csv<-paste0(gsub("-","",target_product),".csv")
#see if it exists
if((!file.exists(target_csv))|file_refresh){
      #create url
      url_file<-gsub(".csv","-eng.zip",target_csv)
      calc_url<-paste0(url_prefix,url_file)
      download_target<-paste0(project_folder,"/",url_file)
      download.file(calc_url,download_target,mode="wb",quiet=FALSE)
unzip(download_target,overwrite=TRUE,exdir=substr(data_folder,1,nchar(data_folder)-1))
      }

The next stage reads the CSV with the read_csv routine that is part of the tidyverse readr component. The dplyr package is used to process the file, selecting only the Canada geography and selected variables. The column names of the csv file are best reviewed by opening the csv one time with a package such as notepad++ or notepad. Many CSVs cannot be fully loaded into Excel and there is no need to use that package. In the code below, the mutate verb is used to create a new variable, Date, that is a standard R date. To do this, it is necessary to convert the REF_DATE string to a regular date by added a day (“01”) to the end of each value. Overly long variables and those containing blanks are renamed for convenience. Such variables must be enclosed in back ticks (`) in order to be recognized by dplyr. The initial read, the select, mutate and rename verbs are all connected using pipes (%>%).

#now we are going to great a base tibble with just the estimate for canada
#column names with spaces must be enclosed in back tick marks
#the date has to be modified to include a day to get the as.Date function to recognize it
#read_csv will also recognize a zip file and unzip
canada_lfs_estimate<-read_csv(file=target_csv,progress=FALSE) %>%filter(GEO=="Canada") %>%
select(REF_DATE,GEO,`Labour force characteristics`,Sex,`Age group`,VALUE) %>%
mutate(Date=as.Date(paste0(REF_DATE,"-01"))) %>%
rename(LFS_characteristics=`Labour force characteristics`,Age_group=`Age group`)

The next stage is to calculate annual data from the averages of the seasonally unadjusted data. The advantage of this is that the current year is calculated as an average to date. The data are grouped by the key variables required and by the Year variable which is created by a prior mutate verb. The year function from the lubridate package is required to get the year. An ungroup verb is used as the last stage of the piped set of commands to remove the grouping. Not doing that will confuse later dplyr processes which will then unnecessarily require the grouping variables.

#calculating annual averages this way gives us the year-to-date value for the current year
annual_data<-canada_lfs_estimate%>%
mutate(Year=year(Date))%>%
group_by(GEO,LFS_characteristics,Age_group,Sex,Year) %>%
summarize(Annual_average=mean(VALUE))%>%
ungroup()

The next set of code gets the unique age group strings and selects the lowest level ones. The appropriate selection must be determined by prior inspection. Extracting the strings from the data file avoids problems from future edits by Statistics Canada of its classifications.

#now get the age_groups
available_age_groups<-unique(annual_data$Age_group)
print(available_age_groups)
#by inspection, the zero-level groups in this table are 1,5,6,10,11,12,13,16,17,19,21
zero_indices<-c(1,5,6,10,11,12,13,16,17,19,21)
zero_level_age_groups<-available_age_groups[zero_indices]
available_genders<-unique(annual_data$Sex)
#This version loops through gender
numb_genders<-length(available_genders)

The next stage is to start a loop over genders of a process which selects three tibbles for source population, full- and part-time employment. The filter verb selects the data by age group, LFS characteristic and gender.

for (this_gender in 1:numb_genders){
      Gender<-available_genders[this_gender]
      source_population<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Population",Sex==Gender)
      full_time_employment<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Full-time employment",Sex==Gender)
      part_time_employment<-filter(annual_data,Age_group %in% zero_level_age_groups,LFS_characteristics=="Part-time employment",Sex==Gender)

The next segment of code calculates the full- and part-time rates by using a left_join to link the population with the corresponding employment on the basis of year and age group.

#now calculate th full_time rate as part of a left join
      full_time_rate<-left_join(source_population,(select(full_time_employment,Year,Age_group,Annual_average)),
      by=c("Year","Age_group"),
      suffix=c(".pop",".fe")) %>%
      mutate(employment_rate=100*Annual_average.fe/Annual_average.pop)%>%
      select(Year,Age_group,employment_rate)
      #now part time
      part_time_rate<-left_join(source_population,(select(part_time_employment,Year,Age_group,Annual_average)),by=c("Year","Age_group"),
      suffix=c(".pop",".pe")) %>%
      mutate(employment_rate=100*Annual_average.pe/Annual_average.pop)%>%
      select(Year,Age_group,employment_rate)

The next set of code applies a negative sign to the full-time employment for the purposes of plotting. Essentially the plots below include two layers of bar charts with the full-time measure as negative to get the pyramid appearance. The rates, calculated above, have the same column names so that they can be row bound using the bind_rows command. A list of the tibbles is used to facilitate renaming them to something useful. The .id value of Type is used to hold the list element names to distinguish the data sets combined into the rate_set tibble.

The test years are defined in a vector in a filter command to select the data for plotting. Breaks for the value (y) axis are defined. However, since some are negative, labels have to be defined by taking the absolute values.

#for purposes of pyramid, full-time rate is set negative
      full_time_rate<-mutate(full_time_rate,employment_rate=-employment_rate)
      bind_list<-list(full_time_rate,part_time_rate)
      names(bind_list)<-c("Full time","Part time")
      rate_set<-bind_rows(bind_list,.id="Type") %>%
      mutate(Age_group=gsub(" years","",Age_group))
      #as a test plot select one year
      plot_data<-filter(rate_set,Year %in% c(2000,2007,2015,2018))
      break_list<-seq(-80,40,by=20)
      break_labels<-as.character(abs(break_list))

The final set of code defines the plot. The data are grouped by Year so that there is one plot of each selected year and the facet_wrap verb creates the required two column aggregate plot of the separate year plots. The coordinates are flipped so that the value (y) is at the bottom.

p1<-ggplot(plot_data,aes(x=Age_group,y=employment_rate,group=Year,fill=Type,
      label=abs(round(employment_rate,1))))+
      geom_bar(data=dplyr::filter(plot_data,Type=="Full time"),stat="identity")+
      geom_bar(data=dplyr::filter(plot_data,Type=="Part time"),stat="identity")+
      labs(title="LFS Employment Rates by Age",caption="Statistics Canada, JCI",
      subtitle=Gender,
      y="Percentage Share of Population",x="Age Group")+geom_text(size=3)+
      scale_y_continuous(breaks=break_list,labels=break_labels)+
      coord_flip()+
      facet_wrap(~Year,ncol=2)+theme(legend.position="bottom")
  ggsave(p1,file=paste0(Gender,"_employment_rate_pyramids_labels.png"))
}

The last curly bracket terminates the for loop. There will be three facet plots produced, one for each gender. The gender name is included as part of the output file name as well as the subtitle of the plot. The critical part is the data specification in each geom_bar layer. This separates the input dataset into the left and right halves required for the pyramid. The dplyr filter is used to define the data required. The breaks calculated above are included in the arguments to the scale_y_continuous function in the plot development. The legend for the employment types is set to be at the bottom rather than the default side position.

The plots are shown below.

Both sexes employment rate pyramids labels

Males employment rate pyramids labels

Females employment rate pyramids labels