Forcing could be stored in multiple files, either in input forcing files (such as LDASIN or PRECIP_FORCING files) or in the output files (LDASOUT). LDASOUT files may contain a variable called ACCPRCP storing the accumulated precipitation, and the rainfall depth can be obtained by subtracting two consecutive time steps. LDASIN and PRECIP_FORCING files store rain rate in RAINRATE and precip_rate variables, respectively. This vignette serves as a short explanation on how to retrieve data and perform some basic comparisons between two sets of data.
Load rwrfhydro package.
> library(rwrfhydro)
rwrfhydro provides few functions to retrieve observations from several networks. GHCN-Daily and USCRN networks are discussed below.
The US Climate Reference Network (USCRN) is a network of monitoring stations equipped with research quality instruments. Beside precipitation, these gauges report temperature, soil moisture and soil temperature. The precipitation is measured every 5 minutes using three independent measurements in a weighing bucket gauge accompanied with a disdrometer reporting the presence or absence of precipitation. These gauges, in the cold climate, are equipped with heating tape around the throat of the weighing gauge to prevent the frozen precipitation from accumulating on the interior walls and capping the gauge. The redundancy in the measurements is to ensure the quality of the measurements.
Data is provided in 4 different temporal resolution (subhourly, hourly, daily and monthly), and depending on the temporal resolution, the variables provided change. For more information on the data and how to retrieve, refer to the man page of Get_USCRN
.
Global Historical Climatology Network-Daily (GHCN-D) dataset contains daily data from around 80000 surface station in the world, which about two third of them are precipitation only (Menne et al. 2012). It is the most complete collection of U.S. daily data available (Menne et al. 2012). The dataset undergo an automated quality assurance which the details can be found in Durre et al. 2008; 2010. Data is available on http://www1.ncdc.noaa.gov/pub/data/ghcn/daily and is updated frequently. Data is available in two formats either categorized by gauge station or categorized by year. Accordingly, there are two functions to pull GHCN-daily data from these two sources called GetGhcn
and GetGhcn2
.
First step is to select the gauges you want to use for verification based on some criteria. GHCN-daily contains precipitation data from different sources such as COOP or CoCoRaHS. The selection criteria can be country code, states if country is US, type of rain gauge network (for example CoCoRaHS), or a rectangle domain.
> # Return all the gauges within US from observation network of COOP (C) and CoCoRaHS (1)
> countryCodeList <- c("US")
> networkCodeList <- c("1","C")
> sg <- SelectGhcnGauges(countryCode=countryCodeList,
+ networkCode=networkCodeList)
> str(sg)
'data.frame': 54227 obs. of 12 variables:
$ country : chr "US" "US" "US" "US" ...
$ network : chr "1" "1" "1" "1" ...
$ stationID : chr "0RMHS145" "0adam001" "0adam002" "0adam003" ...
$ latitude : num 40.5 40.6 40.5 40.5 40.5 ...
$ longitude : num -105.1 -98.5 -98.5 -98.7 -98.4 ...
$ elevation : num 1569 598 601 615 570 ...
$ state : chr "CO" "NE" "NE" "NE" ...
$ name : chr "RMHS 1.6 SSW " "JUNIATA 1.5 S " "JUNIATA 6.0 SSW " "HOLSTEIN 0.1 NW " ...
$ GSNflag : chr " " " " " " " " ...
$ HCN/CRNflag: chr " " " " " " " " ...
$ WMOID : chr " " " " " " " " ...
$ siteIds : chr "US10RMHS145" "US10adam001" "US10adam002" "US10adam003" ...
The sg
dataframe has all the information provided by NCDC about each gauge. For the rest of this vignette we will use only the domain of croton which is the case study provided. We use the rectangle domain containing croton, as the boundary to collect all the gauges’ information.
> sg <- SelectGhcnGauges(domain = TRUE,
+ minLat = 41.41, maxLat = 41.55,
+ minLon = -73.83, maxLon=-73.65)
> str(sg)
'data.frame': 6 obs. of 12 variables:
$ country : chr "US" "US" "US" "US" ...
$ network : chr "1" "C" "C" "C" ...
$ stationID : chr "NYPT0001" "00300808" "00301207" "00301211" ...
$ latitude : num 41.5 41.5 41.4 41.5 41.5 ...
$ longitude : num -73.7 -73.7 -73.7 -73.7 -73.7 ...
$ elevation : num 249 171 162 207 279 ...
$ state : chr "NY" "NY" "NY" "NY" ...
$ name : chr "CARMEL HAMLET 5.8 N " "BOYDS CORNERS " "CARMEL " "CARMEL 4N " ...
$ GSNflag : chr " " " " " " " " ...
$ HCN/CRNflag: chr " " " " " " " " ...
$ WMOID : chr " " " " " " " " ...
$ siteIds : chr "US1NYPT0001" "USC00300808" "USC00301207" "USC00301211" ...
GHCN-daily data are archived for each individual gauge in a text file in here. Precipitating can be downloaded for a single site or multiple ones by setting element to “PRCP” and specifying the desired start and end dates. Notice, precipitation values are converted from 10th of mm to mm.
> startDate <- "2011/08/20"
> endDate <- "2011/09/05"
> element <- "PRCP"
> obsPrcp <- GetGhcn(sg$siteIds, element, startDate, endDate, parallel = FALSE)
> str(obsPrcp)
'data.frame': 34 obs. of 5 variables:
$ siteIds : chr "US1NYPT0001" "US1NYPT0001" "US1NYPT0001" "US1NYPT0001" ...
$ Date : Date, format: "2011-09-01" "2011-09-02" ...
$ dailyGhcn: num 0 0 -1000 -1000 -1000 ...
$ qFlag : chr NA NA NA NA ...
$ element : chr "PRCP" "PRCP" "PRCP" "PRCP" ...
NCDC also provides GHCN-daily categorized by year here. If the number of the gauges is big, GetGhcn2
would be a much faster option to retrieve data. It has the same arguments as GetGhcn
.
Forcing data used in WRF-Hydro modeling are usually stored in forcing files (such as LDASIN
or PRECIP_FORCING
files). Here we are going to use the data provided by the “croton” case study.
Set the path to the Croton test case. First list of all the forcing files.
> forcingPath <- "~/wrf-hydro-training/example_case/FORCING/"
> files <- list.files(path = forcingPath, full.names = TRUE, pattern = glob2rx("2011*LDASIN_DOMAIN1"))
One way to pull data from the netcdf files is to have the indices of the gage locations. However, only lat/lon locations of rain gauges are provided when using SelectGhcnGauges
function. Therefore, it is required to map lat/lon information to i/j indices. This can be done using GetGeogridIndex
function in rwrfhydro. One needs to provide the address to geogrid file, the lat/lon info of the gauge locations and the GetGeogridIndex
function return a dataframe with two column sn
(south-north) and we
(west-east).
> geoFile <- '~/wrf-hydro-training/example_case/Gridded/DOMAIN/geo_em.d01.nc'
> rainGaugeInds <- GetGeogridIndex(xy = data.frame(lon=sg$longitude, lat=sg$latitude),
+ ncfile = geoFile)
> sg <- cbind(sg,rainGaugeInds)
> head(sg)
country network stationID latitude longitude elevation state
68694 US 1 NYPT0001 41.4969 -73.7017 249.0 NY
91176 US C 00300808 41.4500 -73.7333 171.0 NY
91219 US C 00301207 41.4333 -73.6833 161.5 NY
91220 US C 00301211 41.4725 -73.6550 207.3 NY
91760 US C 00308304 41.5333 -73.7333 278.9 NY
91854 US C 00309413 41.5000 -73.7500 241.1 NY
name GSNflag HCN/CRNflag WMOID siteIds
68694 CARMEL HAMLET 5.8 N US1NYPT0001
91176 BOYDS CORNERS USC00300808
91219 CARMEL USC00301207
91220 CARMEL 4N USC00301211
91760 STORMVILLE USC00308304
91854 WHITE POND USC00309413
we sn
68694 10 12
91176 9 7
91219 14 6
91220 15 11
91760 7 15
91854 7 11
Now it is time to pull data from the files. One way is to use the GetMultiNcdf
function in rwrfhydro. You need to prepare the file
, var
, and ind
variables for GetMultiNcdf
function. You can leave the stat as mean; since you are pulling data for a single pixel.
> flList <- list(forcing = files)
> varList <- list(forcing = list(PRCP = 'RAINRATE'))
> prcpIndex <- list()
> for (i in 1:length(sg$siteIds)) {
+ if (!is.na(sg$we[i]) & !is.na(sg$sn[i])) {
+ prcpIndex[[as.character(sg$siteIds[i])]] <- list(start=c(sg$we[i], sg$sn[i],1),
+ end=c(sg$we[i], sg$sn[i],1), stat="mean")
+ }
+ }
> indList <-list(forcing = list(PRCP = prcpIndex))
> prcpData <- GetMultiNcdf(file = flList, var = varList, ind = indList, parallel=FALSE)
> head(prcpData)
POSIXct inds stat statArg variable value variableGroup
1 2011-08-26 10:10,12:12,1:1 mean US1NYPT0001 RAINRATE 0 PRCP
2 2011-08-26 9:9,7:7,1:1 mean USC00300808 RAINRATE 0 PRCP
3 2011-08-26 14:14,6:6,1:1 mean USC00301207 RAINRATE 0 PRCP
4 2011-08-26 15:15,11:11,1:1 mean USC00301211 RAINRATE 0 PRCP
5 2011-08-26 7:7,15:15,1:1 mean USC00308304 RAINRATE 0 PRCP
6 2011-08-26 7:7,11:11,1:1 mean USC00309413 RAINRATE 0 PRCP
fileGroup
1 forcing
2 forcing
3 forcing
4 forcing
5 forcing
6 forcing
GetMultiNcdf
pulls the time information from the netcdf files, if the data is not prepared properly, and the time info is not available, it will return the name of the file instead. In that case, time should be retrieved from the file name which is saved in column POSIXct
. Since the obsPrcp
data are in mm, we also convert the rainrate to rain depth in mm.
> prcpData$value <- prcpData$value*3600
Each GHCN gauge has a unique reporting time which the daily data is been calculated based on that. The reporting time is archived in the csv files and is retrieved when calling GetGhcn2
function (you will not get the reporting time using GetGhcn
). We need to add the reporting time for each point which would be the base for daily aggregation. If there will not be any reportTime
in sg
columns, then it uses the default which is 0700 AM.
> if ("reportTime" %in% names(prcpData)) {
+ sg$reportTime <- obsPrcp$reportTime[match(sg$siteIds, obsPrcp$siteIds)]
+ sg$reportTime[which (sg$reportTime=="" | is.na(sg$reportTime))] <-700
+ }else{
+ sg$reportTime<- 700
+ }
Call the CalcDailyGhcn
to convert the hourly data (model input) to daily data compatible with GHCN-Daily data. This function takes the following steps:
timeZone
in the sg
(selected gauges) dataframe. If the time zone does not exists, it will call GetTimeZone(sg)
to add the time zone information. sg
requires at least two fields of latitude
and longitude
so that GetTimeZone
works.tzLookup
data provided in rwrfhydro. The time offset is used to convert UTC time to Local Standard Time (LST) since the GHCN-D is reported in LST.dailyData
. numberOfDataPoints
column has the number of hours with observations in a day so you can remove days without complete record.> names(prcpData)[names(prcpData) == 'value'] <- 'DEL_ACCPRCP'
> dailyData <- CalcDailyGhcn(sg = sg,prcp = prcpData)
> head(dailyData)
ghcnDay statArg dailyPrcp numberOfDataPoints
1 2011-08-26 US1NYPT0001 0.012850822 13
2 2011-08-26 USC00300808 0.003034311 13
3 2011-08-26 USC00301207 0.001150713 13
4 2011-08-26 USC00301211 0.008623618 13
5 2011-08-26 USC00308304 0.017147184 13
6 2011-08-26 USC00309413 0.009485080 13
Final step if to find the common data between the two dataset (precipitation time series (dailyData
) and the observed GHCN-D (obsPrcp
)). If you have a big dataset, it is recommended to use data.table instead of data.frame which is incredibly faster.
> common <- merge(dailyData, obsPrcp,
+ by.x=c("ghcnDay","statArg"),
+ by.y=c("Date","siteIds"))
> head(common)
ghcnDay statArg dailyPrcp numberOfDataPoints dailyGhcn qFlag
1 2011-08-26 US1NYPT0001 1.285082e-02 13 -999.9 <NA>
2 2011-08-26 USC00301211 8.623618e-03 13 11.9 <NA>
3 2011-08-27 US1NYPT0001 0.000000e+00 24 5.6 <NA>
4 2011-08-27 USC00301211 0.000000e+00 24 0.0 <NA>
5 2011-08-28 US1NYPT0001 1.394508e+02 24 -999.9 <NA>
6 2011-08-28 USC00301211 1.404404e+02 24 91.9 L
element
1 PRCP
2 PRCP
3 PRCP
4 PRCP
5 PRCP
6 PRCP
Call the CalcStatCont
function and it returns all the requested statistics. The default are numPaired
(number of paired data), meanObs
(mean of observation data), meanMod
(mean of model/forecast data), pearsonCor
(Pearson correlation coefficient), RMSE
(root mean square error), and multiBias
(multiplicative bias).
> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp" ,
+ obsMissing = -999.9, groupBy = "statArg")
> # CalcStatCont will return a list having two elements of stat and plotList.
> names(stat)
[1] "stat" "plotList"
> # Check the statistics
> stat$stat
statArg numPaired meanObs meanMod pearsonCor RMSE multiBias
1 USC00301211 8 19.32500 23.710249 0.9772098 17.67777 1.2269211
2 US1NYPT0001 6 25.01667 8.521054 0.9992682 38.18890 0.3406151
If the groupBy
is NULL
then it will return the overall stats for all the data and provide four informative plots.
> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp", obsMissing = -999.9, title = common$statArg)
Warning: Removed 3 rows containing missing values (geom_tile).
You can choose among the four plots by changing the plot.list
argument.
> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp" , obsMissing = -999.9, plot.list = "scatterPlot")
You can also calculate conditional statistics by defining the boundaries you are interested in. For example, here we calculate the statistics conditioned on the observation to be greater than 1 mm.
> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp" ,
+ obsCondRange = c(1, Inf), plot.list = "scatterPlot")
> stat$stat
numPaired meanObs meanMod pearsonCor RMSE multiBias
1 5 60.94 48.16129 0.6088967 47.43496 0.7903067
Sometime the verification result at the gauge location is not desired and we want to find the performance of a model over a domain or polygon. If you want to calculate statistics over RFC’s, then use GetRfc
function. One can find out a gauge (point) falls in which RFC using GetRfc
. You simply feed a dataframe having at least two columns of latitude
and longitude
and this functions adds a column to a dataframe with RFC name.
> # add rfc name
> sg <- GetRfc(sg)
>
> # check what is been added
> head(sg)
country network stationID elevation state name
1 US 1 NYPT0001 249.0 NY CARMEL HAMLET 5.8 N
2 US C 00300808 171.0 NY BOYDS CORNERS
3 US C 00301207 161.5 NY CARMEL
4 US C 00301211 207.3 NY CARMEL 4N
5 US C 00308304 278.9 NY STORMVILLE
6 US C 00309413 241.1 NY WHITE POND
GSNflag HCN.CRNflag WMOID siteIds we sn reportTime rfc longitude
1 US1NYPT0001 10 12 700 NERFC -73.7017
2 USC00300808 9 7 700 NERFC -73.7333
3 USC00301207 14 6 700 NERFC -73.6833
4 USC00301211 15 11 700 NERFC -73.6550
5 USC00308304 7 15 700 NERFC -73.7333
6 USC00309413 7 11 700 NERFC -73.7500
latitude
1 41.4969
2 41.4500
3 41.4333
4 41.4725
5 41.5333
6 41.5000
Now, add a column to the common
data having the rfc
information for each data. And calculate the statistics based on grouping by RFC.
> # merge the common with the sg data.frame
> common <- merge(common,sg[, c("siteIds", "rfc")],
+ by.x=c("statArg"),
+ by.y=c("siteIds"))
>
> # calculate statistics using grouping by rfc
> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp" ,
+ groupBy = "rfc", obsMissing = -999.9, plot.it = FALSE)
>
> stat$stat
rfc numPaired meanObs meanMod pearsonCor RMSE multiBias
1 NERFC 14 21.76429 17.20059 0.7663332 28.34781 0.7903128
Here, we had only one gage falling in the “MBRFC” so the stats would be the same. But if you have more than one gage, it will group all the pairs within one rfc and return the statistics for that rfc.
One can also calculate the statistics over any desired polygon shapefile. First, you need to use GetPoly
function to find each point falls into which polygon. GetPoly
takes a dataframe containing at least two fields of latitude
and longitude
, overlays the points using the SpatialPolygonDataFrame
function and return the requested attributes from the polygon. You can use the available SpatialPolygon*
loaded into memory or provide the address to the location of a polygon shapefile and the name of the shapefile. The clipped_HUC12
shapefile is provided with the test case. The polygon extend is more than the croton case study.
> # add HUC12 ids
> polygonAddress <- "~/wrf-hydro-training/example_case/supplemental/polygons"
> sg <- GetPoly (sg, polygonAddress = polygonAddress,
+ polygonShapeFile = "Clipped_huc12",
+ join="HUC12")
OGR data source with driver: ESRI Shapefile
Source: "C:\Users\arezoo\Documents\wrf-hydro-training\example_case\supplemental\polygons", layer: "Clipped_huc12"
with 5 features
It has 20 fields
Integer64 fields read as strings: OBJECTID GNIS_ID
> # check what is been added
> head(sg)
country network stationID elevation state name
1 US 1 NYPT0001 249.0 NY CARMEL HAMLET 5.8 N
2 US C 00300808 171.0 NY BOYDS CORNERS
3 US C 00301207 161.5 NY CARMEL
4 US C 00301211 207.3 NY CARMEL 4N
5 US C 00308304 278.9 NY STORMVILLE
6 US C 00309413 241.1 NY WHITE POND
GSNflag HCN.CRNflag WMOID siteIds we sn reportTime rfc
1 US1NYPT0001 10 12 700 NERFC
2 USC00300808 9 7 700 NERFC
3 USC00301207 14 6 700 NERFC
4 USC00301211 15 11 700 NERFC
5 USC00308304 7 15 700 NERFC
6 USC00309413 7 11 700 NERFC
HUC12 longitude latitude
1 020301010206 -73.7017 41.4969
2 020301010206 -73.7333 41.4500
3 020301010206 -73.6833 41.4333
4 020301010205 -73.6550 41.4725
5 020301010206 -73.7333 41.5333
6 020301010206 -73.7500 41.5000
> # merge the common data.frame with the sg data.frame
> common <- merge(common,sg[, c("siteIds","HUC12")],
+ by.x=c("statArg"),
+ by.y=c("siteIds"))
>
> # calculate statistics using grouping by HUC12
> stat <- CalcStatCont(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp",
+ obsMissing = -999.9, groupBy = "HUC12", plot.it = FALSE)
> print(stat$stat)
HUC12 numPaired meanObs meanMod pearsonCor RMSE multiBias
1 020301010206 6 25.01667 8.521054 0.9992682 38.18890 0.3406151
2 020301010205 8 19.32500 23.710249 0.9772098 17.67777 1.2269211
Since there is only oen gage, the there will be only one HUC12 category here.
You can also calculate some of the categorical statistics using CalcStatCategorical
function. It accepts both categorical variable and continuous ones. If the data is actually categorical, variable category
should be defined. The elements in category
will be used as YES
and NO
in contingency table. If the data is numeric, then a set of thresholds should be defined. Values exceeding the threshold would be flagged as YES
and the values below the threshold are considered NO
. You can choose from the available statistics by changing the statList
argument. By default, it calculates the Hit Rate (H), False Alarm Ratio (FAR) and Critical Success Index (CSI). The grouping option is similar to CalcStatCont
.
> # calculate categorical statistics
> stat <- CalcStatCategorical(DT = common, obsCol = "dailyGhcn", modCol = "dailyPrcp",
+ obsMissing = -999.9, groupBy = "statArg", threshold = c(1,5))
> print(stat)
statArg threshold H FAR CSI
1 US1NYPT0001 1 0.5000000 0 0.5000000
2 USC00301211 1 0.6666667 0 0.6666667
3 US1NYPT0001 5 0.5000000 0 0.5000000
4 USC00301211 5 0.6666667 0 0.6666667