1 Background

Forcing could be stored in multiple files, either in input forcing files (such as LDASIN or PRECIP_FORCING files) or in the output files (LDASOUT). LDASOUT files may contain a variable called ACCPRCP storing the accumulated precipitation, and the rainfall depth can be obtained by subtracting two consecutive time steps. LDASIN and PRECIP_FORCING files store rain rate in RAINRATE and precip_rate variables, respectively. This vignette serves as a short explanation on how to retrieve data and perform some basic comparisons between two sets of data.

Load rwrfhydro package.

> library(rwrfhydro)

2 Import observed datasets

rwrfhydro provides few functions to retrieve observations from several networks. GHCN-Daily and USCRN networks are discussed below.

2.1 USCRN

The US Climate Reference Network (USCRN) is a network of monitoring stations equipped with research quality instruments. Beside precipitation, these gauges report temperature, soil moisture and soil temperature. The precipitation is measured every 5 minutes using three independent measurements in a weighing bucket gauge accompanied with a disdrometer reporting the presence or absence of precipitation. These gauges, in the cold climate, are equipped with heating tape around the throat of the weighing gauge to prevent the frozen precipitation from accumulating on the interior walls and capping the gauge. The redundancy in the measurements is to ensure the quality of the measurements.

Data is provided in 4 different temporal resolution (subhourly, hourly, daily and monthly), and depending on the temporal resolution, the variables provided change. For more information on the data and how to retrieve, refer to the man page of Get_USCRN.

2.2 GHCN-daily

Global Historical Climatology Network-Daily (GHCN-D) dataset contains daily data from around 80000 surface station in the world, which about two third of them are precipitation only (Menne et al. 2012). It is the most complete collection of U.S. daily data available (Menne et al. 2012). The dataset undergo an automated quality assurance which the details can be found in Durre et al. 2008; 2010. Data is available on http://www1.ncdc.noaa.gov/pub/data/ghcn/daily and is updated frequently. Data is available in two formats either categorized by gauge station or categorized by year. Accordingly, there are two functions to pull GHCN-daily data from these two sources called GetGhcn and GetGhcn2.

2.2.1 Gauge selection

First step is to select the gauges you want to use for verification based on some criteria. GHCN-daily contains precipitation data from different sources such as COOP or CoCoRaHS. The selection criteria can be country code, states if country is US, type of rain gauge network (for example CoCoRaHS), or a rectangle domain.

> # Return all the gauges within US from observation network of COOP (C) and CoCoRaHS (1)
> countryCodeList <- c("US")
> networkCodeList <- c("1","C")
> sg <- SelectGhcnGauges(countryCode=countryCodeList,
+                        networkCode=networkCodeList)
> str(sg)
'data.frame':   54867 obs. of  12 variables:
 $ country    : chr  "US" "US" "US" "US" ...
 $ network    : chr  "1" "1" "1" "1" ...
 $ stationID  : chr  "0RMHS145" "0adam001" "0adam002" "0adam003" ...
 $ latitude   : num  40.5 40.6 40.5 40.5 40.5 ...
 $ longitude  : num  -105.1 -98.5 -98.5 -98.7 -98.4 ...
 $ elevation  : num  1569 598 601 615 570 ...
 $ state      : chr  "CO" "NE" "NE" "NE" ...
 $ name       : chr  "RMHS 1.6 SSW                  " "JUNIATA 1.5 S                 " "JUNIATA 6.0 SSW               " "HOLSTEIN 0.1 NW               " ...
 $ GSNflag    : chr  "   " "   " "   " "   " ...
 $ HCN/CRNflag: chr  "   " "   " "   " "   " ...
 $ WMOID      : chr  "    " "    " "    " "    " ...
 $ siteIds    : chr  "US10RMHS145" "US10adam001" "US10adam002" "US10adam003" ...

The sg dataframe has all the information provided by NCDC about each gauge. For the rest of this vignette we will use only the domain of croton which is the case study provided. We use the rectangle domain containing croton, as the boundary to collect all the gauges’ information.

> sg <- SelectGhcnGauges(domain = TRUE, 
+                        minLat = 41.41, maxLat = 41.55, 
+                        minLon = -73.83, maxLon=-73.65)
> str(sg)
'data.frame':   6 obs. of  12 variables:
 $ country    : chr  "US" "US" "US" "US" ...
 $ network    : chr  "1" "C" "C" "C" ...
 $ stationID  : chr  "NYPT0001" "00300808" "00301207" "00301211" ...
 $ latitude   : num  41.5 41.5 41.4 41.5 41.5 ...
 $ longitude  : num  -73.7 -73.7 -73.7 -73.7 -73.7 ...
 $ elevation  : num  249 171 162 207 279 ...
 $ state      : chr  "NY" "NY" "NY" "NY" ...
 $ name       : chr  "CARMEL HAMLET 5.8 N           " "BOYDS CORNERS                 " "CARMEL                        " "CARMEL 4N                     " ...
 $ GSNflag    : chr  "   " "   " "   " "   " ...
 $ HCN/CRNflag: chr  "   " "   " "   " "   " ...
 $ WMOID      : chr  "    " "    " "    " "    " ...
 $ siteIds    : chr  "US1NYPT0001" "USC00300808" "USC00301207" "USC00301211" ...

2.2.2 GetGhcn

GHCN-daily data are archived for each individual gauge in a text file in here. Precipitating can be downloaded for a single site or multiple ones by setting element to “PRCP” and specifying the desired start and end dates. Notice, precipitation values are converted from 10th of mm to mm.

> startDate <- "2011/08/20"
> endDate <- "2011/09/05"
> element <- "PRCP"
> obsPrcp <- GetGhcn(sg$siteIds, element, startDate, endDate, parallel = FALSE)
> str(obsPrcp)
'data.frame':   34 obs. of  5 variables:
 $ siteIds  : chr  "US1NYPT0001" "US1NYPT0001" "US1NYPT0001" "US1NYPT0001" ...
 $ Date     : Date, format: "2011-09-01" "2011-09-02" ...
 $ dailyGhcn: num  0 0 -1000 -1000 -1000 ...
 $ qFlag    : chr  NA NA NA NA ...
 $ element  : chr  "PRCP" "PRCP" "PRCP" "PRCP" ...

2.2.3 GetGhcn2

NCDC also provides GHCN-daily categorized by year here. If the number of the gauges is big, GetGhcn2 would be a much faster option to retrieve data. It has the same arguments as GetGhcn.

3 Import forcing/precipitation data

Forcing data used in WRF-Hydro modeling are usually stored in forcing files (such as LDASIN or PRECIP_FORCING files). Here we are going to use the data provided by the “croton” case study.

Set the path to the Croton test case. First list of all the forcing files.

> forcingPath <- "~/wrf-hydro-training/example_case/FORCING/"
> files <- list.files(path = forcingPath, full.names = TRUE, pattern = glob2rx("2011*LDASIN_DOMAIN1"))

One way to pull data from the netcdf files is to have the indices of the gage locations. However, only lat/lon locations of rain gauges are provided when using SelectGhcnGauges function. Therefore, it is required to map lat/lon information to i/j indices. This can be done using GetGeogridIndex function in rwrfhydro. One needs to provide the address to geogrid file, the lat/lon info of the gauge locations and the GetGeogridIndex function return a dataframe with two column sn (south-north) and we (west-east).

> geoFile <- '~/wrf-hydro-training/example_case/Gridded/DOMAIN/geo_em.d01.nc'
> rainGaugeInds <- GetGeogridIndex(xy = data.frame(lon=sg$longitude, lat=sg$latitude),
+                                   ncfile = geoFile)
> sg <- cbind(sg,rainGaugeInds)
> head(sg)
      country network stationID latitude longitude elevation state
69111      US       1  NYPT0001  41.4969  -73.7017     249.0    NY
91840      US       C  00300808  41.4500  -73.7333     171.0    NY
91883      US       C  00301207  41.4333  -73.6833     161.5    NY
91884      US       C  00301211  41.4725  -73.6550     207.3    NY
92424      US       C  00308304  41.5333  -73.7333     278.9    NY
92518      US       C  00309413  41.5000  -73.7500     241.1    NY
                                name GSNflag HCN/CRNflag WMOID     siteIds
69111 CARMEL HAMLET 5.8 N                                      US1NYPT0001
91840 BOYDS CORNERS                                            USC00300808
91883 CARMEL                                                   USC00301207
91884 CARMEL 4N                                                USC00301211
92424 STORMVILLE                                               USC00308304
92518 WHITE POND                                               USC00309413
      we sn
69111 10 12
91840  9  7
91883 14  6
91884 15 11
92424  7 15
92518  7 11

Now it is time to pull data from the files. One way is to use the GetMultiNcdf function in rwrfhydro. You need to prepare the file, var, and ind variables for GetMultiNcdf function. You can leave the stat as mean; since you are pulling data for a single pixel.

> flList <- list(forcing = files)
> varList <- list(forcing = list(PRCP = 'RAINRATE'))
> prcpIndex <- list()
> for (i in 1:length(sg$siteIds)) {
+   if (!is.na(sg$we[i]) & !is.na(sg$sn[i])) {
+     prcpIndex[[as.character(sg$siteIds[i])]] <- list(start=c(sg$we[i], sg$sn[i],1),
+                                                      end=c(sg$we[i], sg$sn[i],1), stat="mean")
+   }
+ }
> indList <-list(forcing = list(PRCP = prcpIndex))
> prcpData <- GetMultiNcdf(file = flList, var = varList, ind = indList, parallel=FALSE)
> head(prcpData)
     POSIXct            inds stat     statArg variable value variableGroup
1 2011-08-26 10:10,12:12,1:1 mean US1NYPT0001 RAINRATE     0          PRCP
2 2011-08-26     9:9,7:7,1:1 mean USC00300808 RAINRATE     0          PRCP
3 2011-08-26   14:14,6:6,1:1 mean USC00301207 RAINRATE     0          PRCP
4 2011-08-26 15:15,11:11,1:1 mean USC00301211 RAINRATE     0          PRCP
5 2011-08-26   7:7,15:15,1:1 mean USC00308304 RAINRATE     0          PRCP
6 2011-08-26   7:7,11:11,1:1 mean USC00309413 RAINRATE     0          PRCP
  fileGroup
1   forcing
2   forcing
3   forcing
4   forcing
5   forcing
6   forcing

GetMultiNcdf pulls the time information from the netcdf files, if the data is not prepared properly, and the time info is not available, it will return the name of the file instead. In that case, time should be retrieved from the file name which is saved in column POSIXct. Since the obsPrcp data are in mm, we also convert the rainrate to rain depth in mm.

> prcpData$value <- prcpData$value*3600

3.1 Aggregating hourly data into daily.

Each GHCN gauge has a unique reporting time which the daily data is been calculated based on that. The reporting time is archived in the csv files and is retrieved when calling GetGhcn2 function (you will not get the reporting time using GetGhcn). We need to add the reporting time for each point which would be the base for daily aggregation. If there will not be any reportTime in sg columns, then it uses the default which is 0700 AM.

> if ("reportTime" %in% names(prcpData)) {
+   sg$reportTime <- obsPrcp$reportTime[match(sg$siteIds, obsPrcp$siteIds)]
+   sg$reportTime[which (sg$reportTime=="" | is.na(sg$reportTime))] <-700
+ }else{
+   sg$reportTime<- 700
+ }

Call the CalcDailyGhcn to convert the hourly data (model input) to daily data compatible with GHCN-Daily data. This function takes the following steps:

  1. It first searches for a field called timeZone in the sg (selected gauges) dataframe. If the time zone does not exists, it will call GetTimeZone(sg) to add the time zone information. sg requires at least two fields of latitude and longitude so that GetTimeZone works.
  2. After adding time zone for each gauge, the time offset will be added from the tzLookup data provided in rwrfhydro. The time offset is used to convert UTC time to Local Standard Time (LST) since the GHCN-D is reported in LST.
  3. Then it aggregates the hourly precipitation to daily based on the reporting time of individual gauges and return the dailyData. numberOfDataPoints column has the number of hours with observations in a day so you can remove days without complete record.
> names(prcpData)[names(prcpData) == 'value'] <- 'DEL_ACCPRCP'
> dailyData <- CalcDailyGhcn(sg = sg,prcp = prcpData)
> head(dailyData)
     ghcnDay     statArg   dailyPrcp numberOfDataPoints
1 2011-08-26 US1NYPT0001 0.012850822                 13
2 2011-08-26 USC00300808 0.003034311                 13
3 2011-08-26 USC00301207 0.001150713                 13
4 2011-08-26 USC00301211 0.008623618                 13
5 2011-08-26 USC00308304 0.017147184                 13
6 2011-08-26 USC00309413 0.009485080                 13

4 Comparing daily QPE/QPF versus GHCN-D

Final step if to find the common data between the two dataset (precipitation time series (dailyData) and the observed GHCN-D (obsPrcp)). If you have a big dataset, it is recommended to use data.table instead of data.frame which is incredibly faster.

> common <- merge(dailyData, obsPrcp,
+                 by.x=c("ghcnDay","statArg"),
+                 by.y=c("Date","siteIds"))
> head(common)
     ghcnDay     statArg    dailyPrcp numberOfDataPoints dailyGhcn qFlag
1 2011-08-26 US1NYPT0001 1.285082e-02                 13    -999.9  <NA>
2 2011-08-26 USC00301211 8.623618e-03                 13      11.9  <NA>
3 2011-08-27 US1NYPT0001 0.000000e+00                 24       5.6  <NA>
4 2011-08-27 USC00301211 0.000000e+00                 24       0.0  <NA>
5 2011-08-28 US1NYPT0001 1.394508e+02                 24    -999.9  <NA>
6 2011-08-28 USC00301211 1.404404e+02                 24      91.9     L
  element
1    PRCP
2    PRCP
3    PRCP
4    PRCP
5    PRCP
6    PRCP

Call the CalcStatCont function and it returns all the requested statistics. The default are numPaired (number of paired data), meanObs (mean of observation data), meanMod (mean of model/forecast data), pearsonCor (Pearson correlation coefficient), RMSE (root mean square error), and multiBias (multiplicative bias).

> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp" , 
+                      obsMissing = -999.9, groupBy = "statArg")

> # CalcStatCont will return a list having two elements of stat and plotList.
> names(stat)
[1] "stat"     "plotList"
> # Check the statistics 
> stat$stat
      statArg numPaired  meanObs   meanMod pearsonCor     RMSE multiBias
1 USC00301211         8 19.32500 23.710249  0.9772098 17.67777 1.2269211
2 US1NYPT0001         6 25.01667  8.521054  0.9992682 38.18890 0.3406151

If the groupBy is NULL then it will return the overall stats for all the data and provide four informative plots.

> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp", obsMissing = -999.9, title = common$statArg)
Warning: Removed 3 rows containing missing values (geom_tile).

You can choose among the four plots by changing the plot.list argument. You can also calculate conditional statistics by defining the boundaries you are interested in. For example, here we calculate the statistics conditioned on the observation to be greater than 1 mm.

> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp" , 
+                      obsCondRange = c(1, Inf), plot.list = "scatterPlot")

> stat$stat
  numPaired meanObs  meanMod pearsonCor     RMSE multiBias
1         5   60.94 48.16129  0.6088967 47.43496 0.7903067

4.1 Calculate statistics over RFCs

Sometime the verification result at the gauge location is not desired and we want to find the performance of a model over a domain or polygon. If you want to calculate statistics over RFC’s, then use GetRfc function. One can find out a gauge (point) falls in which RFC using GetRfc. You simply feed a dataframe having at least two columns of latitude and longitude and this functions adds a column to a dataframe with RFC name.

> # add rfc name
> sg <- GetRfc(sg)
> 
> # check what is been added
> head(sg)
  country network stationID elevation state                           name
1      US       1  NYPT0001     249.0    NY CARMEL HAMLET 5.8 N           
2      US       C  00300808     171.0    NY BOYDS CORNERS                 
3      US       C  00301207     161.5    NY CARMEL                        
4      US       C  00301211     207.3    NY CARMEL 4N                     
5      US       C  00308304     278.9    NY STORMVILLE                    
6      US       C  00309413     241.1    NY WHITE POND                    
  GSNflag HCN.CRNflag WMOID     siteIds we sn reportTime   rfc longitude
1                           US1NYPT0001 10 12        700 NERFC  -73.7017
2                           USC00300808  9  7        700 NERFC  -73.7333
3                           USC00301207 14  6        700 NERFC  -73.6833
4                           USC00301211 15 11        700 NERFC  -73.6550
5                           USC00308304  7 15        700 NERFC  -73.7333
6                           USC00309413  7 11        700 NERFC  -73.7500
  latitude
1  41.4969
2  41.4500
3  41.4333
4  41.4725
5  41.5333
6  41.5000

Now, add a column to the common data having the rfc information for each data. And calculate the statistics based on grouping by RFC.

> # merge the common with the sg data.frame
> common <- merge(common,sg[, c("siteIds", "rfc")],
+                 by.x=c("statArg"),
+                 by.y=c("siteIds"))
> 
> # calculate statistics using grouping by rfc
> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp" , 
+                      groupBy = "rfc", obsMissing = -999.9, plot.it = FALSE)
> 
> stat$stat
    rfc numPaired  meanObs  meanMod pearsonCor     RMSE multiBias
1 NERFC        14 21.76429 17.20059  0.7663332 28.34781 0.7903128

4.2 Calculate categorical statistics

You can also calculate some of the categorical statistics using CalcStatCategorical function. It accepts both categorical variable and continuous ones. If the data is actually categorical, variable category should be defined. The elements in category will be used as YES and NO in contingency table. If the data is numeric, then a set of thresholds should be defined. Values exceeding the threshold would be flagged as YES and the values below the threshold are considered NO. You can choose from the available statistics by changing the statList argument. By default, it calculates the Hit Rate (H), False Alarm Ratio (FAR) and Critical Success Index (CSI). The grouping option is similar to CalcStatCont.

> # calculate categorical statistics
> stat <- CalcStatCategorical(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp", 
+                             obsMissing = -999.9, groupBy = "statArg", threshold = c(1,5))
> print(stat)
      statArg threshold         H FAR       CSI
1 US1NYPT0001         1 0.5000000   0 0.5000000
2 USC00301211         1 0.6666667   0 0.6666667
3 US1NYPT0001         5 0.5000000   0 0.5000000
4 USC00301211         5 0.6666667   0 0.6666667