Published in Agron. J. 96:1495-1497 (2004).
© American Society of Agronomy
677 S. Segoe Rd., Madison, WI 53711 USA
Notes and Unique Phenomena
USE OF A BEOWULF CLUSTER FOR ESTIMATION OF RISK USING SWAT
Gerald Whittaker*
USDA-ARS, National Forage Seed Production Research Center, 3450 SW Campus Way, Corvallis, OR 97331
* Corresponding author (whittakg{at}onid.orst.edu)
Received for publication December 18, 2003.
 |
ABSTRACT
|
|---|
Estimation of uncertainty using agronomic models typically requires a Monte Carlo study with a large number of simulations. Parallel computation dramatically speeds repetitive computation of this sort. The use of a Beowulf cluster parallel computer offers a low cost method of parallel computing that is fairly simple to construct, but application information is specialized, with little concerning agronomic model simulation. The objective in this note is to present a method of simulation using an agronomic model on a Beowulf cluster. To facilitate the analysis of uncertainty, the method performs the simulations within the R statistical computing environment. The Soil and Water Assessment Tool (SWAT) was run for 1200 annual simulations on varying numbers of processors for speed comparisons. The cluster achieved close to the theoretical speed increase as the simulation results were stored in an R object. Two examples of nonparametric estimation of uncertainty are presented.
 |
INTRODUCTION
|
|---|
AS HYDROLOGIC AND AGRONOMIC concepts are combined in large models with application to arbitrarily large spatial areas, the computational requirements increase dramatically. Monte Carlo studies using these models are required for the estimation of uncertainty in parameters and results, adding a further computational burden. Although large computational problems such as this are common, high performance computers have been financially beyond the reach of smaller research organizations. To remedy this situation, the Beowulf Project was initiated in 1994 at the Center of Excellence in Space Data and Information Sciences (CESDIS), a division of Goddard Space Flight Center in Greenbelt, MD (Merkey, 2000). The objective of the project was to use commodity (off the shelf) computers linked through an ethernet to create a high performance computer at low cost (Sterling et al., 1995). Ten years later, Beowulf cluster is a generic term for a computer cluster using this configuration, and some of the fastest computers in the world are Beowulf clusters (www.top500.org, accessed 4 Mar. 2004; verified 18 June 2004). The Beowulf "faq" [available at www.canonical.org/~kragen/beowulf-faq.txt (accessed 4 Mar. 2004; verified 18 June 2004)] defines a Beowulf cluster as "a kind of high-performance massively parallel computer built primarily out of commodity hardware components, running a free-software operating system like Linux or FreeBSD, interconnected by a private high-speed network."
Physical construction of a Beowulf cluster is relatively simple (Sterling et al., 1999). Choice and installation of software is more challenging than the cluster construction. There is no single software configuration for a Beowulf cluster, and there are a large number of software packages in many possible combinations that will run a cluster. Fortunately, at least two open source software distributions provide complete collections of software for setting up a Beowulf cluster, and even include simple installation wizards (OSCAR, The Open Cluster Group, 2003, and ROCKS, National Partnership for Advanced Computational Infrastructure, 2004.).
To get the faster speeds available from a Beowulf cluster, there are two alternatives: (i) software can be recompiled incorporating message-passing parallel libraries or, for some applications, (ii) a massively parallel setup is used where a program is run hundreds of times on several machines. The second alternative is particularly attractive for estimating the statistical properties of simulations using hydrologic and agronomic models. There is, however, almost no information on how to actually implement the second strategy, beyond the admonition to "write a script" (Beowulf faq no. 3, www.canonical.org/~kragen/beowulf-faq.txt). This note provides a procedure for implementing multiple runs of a hydrologic/agronomic model and subsequent statistical analysis in the R environment on a Beowulf cluster.
 |
The NFSPRC Beowulf Cluster
|
|---|
At the National Forage Seed Production Research Center, USDA, I constructed a Beowulf cluster consisting of a server and 12 computation nodes. The server node has 2 pentium 4 processors (3.2 GHz), 1 GB (gigabyte) of RAM, a 10/100 Mbps (megabits/second) NIC (network interface card) for contact with the outside world and an integrated INTEL 10/100/1000 Mbps NIC for the private network. The computation nodes each have a pentium 4 (2.4 GHz) processor, 1 GB of RAM, and an integrated INTEL 10/100/1000 Mbps NIC. All the machines have hard drives, although a diskless setup is possible. The nodes are connected through a 24 port, 1 Gbit/s (gigabit/second) ethernet switch. A single keyboard, monitor, and mouse serve all the machines through a 16 port KVM switch.
The operating system on the cluster is Linux, Redhat 9.0, kernel version 2.4.20-20.9 SMP. The OSCAR cluster software package was selected for installation, primarily because it supports Redhat 9.0. The kernel had to be updated from the Oscar distribution kernel, but Oscar user's list provided instructions for carrying out compilation and installation of the new kernel. The R statistical computing environment was selected for the example application (R Development Core Team, 2003). R provides an environment that includes many statistical procedures available as packages, extensive data handling capabilities, graphics, the capability of running other programs from within R, and support for simple parallel computing. R is installed on all machines in the cluster.
 |
Parallel Estimation of a Probability Density Function
|
|---|
The hydrologic/agronomic model, Soil and Water Assessment Tool (SWAT), is used in this study (Neitsch et al., 2002). SWAT has been developed to predict the impact of agricultural management practices on water balance, erosion and transport of nutrients, and pesticides in meso- to macroscale basins. SWAT runs on a daily time-step basis, calculating the values of dozens of output variables.
The Calapooia River watershed in the Willamette Valley of Oregon was used in the example application. The SWAT model was set up for execution using data from the BASINS 3.0 distribution (USEPA, 2001) and the SWAT ArcView interface (DiLuzio et al., 2001). The SWAT run was set up so that the basin was divided into 17 subwatersheds of approximately equal size, with a 102-yr simulation period. The files required for execution were then copied to the Beowulf cluster, where a Linux version of SWAT2000 was used for the Monte Carlo study. The weather generator provided with SWAT provides the variation for the study.
 |
Method
|
|---|
All of the following steps are run in the R environment. R commands are shown in courier font:- 1. library(rpvm); library(snow)load libraries rpvm and snow (Tierney, 2003). These libraries provide a mechanism for using R on a parallel virtual machine (pvm).
- 2. cl<-makeCluster(12)Sets up a 12 node cluster as the object cl.
- 3. clusterEvalQ(cl,source(/calapooia/swatfns.R))read functions into R on all nodes. The functions are:
- 3a. ignrewrite SWAT control file with different random seed on every node.
- 3b. flow_outAWK program that reads and subsets the selected variables from the SWAT output.
- 3c. clusterEvalQ(cl,system(/calapooia/swat.static))system call that runs SWAT2000 on each node.
- 3d. clusterEvalQ(cl,matrix(scan(basins.flow.txt), ncol=3,byrow=T))read data from flow_out from each node and assign to R object.
- 4. Pseudocode for running the Monte Carlo simulation:
- For i = 1 to (number of simulations)
- For j = 1 to (number of nodes)
- call ign
- Loop j
- Call flow_out
- Call swat.static (3c)
- For j = 1 to (number of nodes)
- Call scan (3d)
- Loop j
- Assign data to R object on server node
- Loop i
- 5. Estimations the of probability density functions were run in the R environment using the function "density" from the R base package for the univariate case and the averaged shifted histogram (ASH) package based on Scott (1992) for the bivariate case.
 |
Remarks
|
|---|
For Monte Carlo studies the simple procedure of running the complete model on each node constitutes a massively parallel problem setup that will almost achieve the theoretical speed increase. All output from the simulation model at every time period is available for analysis. For an evaluation of the speed increase, 1200 annual simulations were run on different numbers of computation nodes. The simulations were 102 yr in duration for each SWAT run, to collect 100 yr of usable information after dropping the first 2 yr. For most statistical purposes 1000 repetitions is more than enough, but 1200 was chosen to accommodate the number of computation nodes (e.g., 12 runs on 10 nodes, 20 runs on 6 nodes, etc.). The theoretical time of completion was calculated by assuming a multiplicative speed increase for each node (i.e., 5 nodes would be 5 times as fast as 1 node). Figure 1
shows that the actual speed increase is close to the theoretical maximum, and approaches the maximum as the number of computation nodes increases to 8 nodes, where it stabilizes. It is my speculation that the communication overhead to the server node is similar whether using 1 node or multiple nodes. Use of fewer nodes requires a larger number of communication events with the server node, resulting in slower calculation speed.
The variables chosen for analysis were sediment load and flow out at the mouth of each sub-basin. Any of the SWAT output variables could be chosen, as it is trivial to change the selection in flow_out (3c) by the addition of a conditional statement. In a 102-yr simulation, the first 2 yr are dropped to make sure the simulation has stabilized. Once the data is in R, all of the statistical apparatus is available to be used in analysis. Figure 2
shows the kernel density estimate of the probability density function of the flow out of basin 2 on 1 February. This is the fundamental information required for estimation of uncertainty, and is also useful in calibration of the model. For a multivariate example, Fig. 3
shows an estimate of a joint probability density function of sediment and flow. This information would be useful for the calculation of total maximum daily loads (TMDLs), and the analysis could be automated using functions in R to return the mode and other statistics for every basin in the simulation.

View larger version (60K):
[in this window]
[in a new window]
|
Fig. 3. Estimate of joint probability of flow and sediment in Sub-basin 2 of the Calapooia River on 1 February.
|
|
The method presented here is general to other agronomic models, one only need substitute the alternative model for swat.static in (3c). There are other ways of achieving these results on a parallel machine, but the method outlined above using the R environment is rather simple and automates everything from running the simulation to gathering the results and preparing the data for statistical analysis. Use of the R environment also makes available a very large number of tools for statistical and graphical analysis. For large, multiple basin studies, even the statistical analysis can be automated using R functions.
 |
REFERENCES
|
|---|
- DiLuzio M., R. Srinivasan, and J. Arnold. 2001. ArcView Interface for SWAT2000: User's guide [Online]. Available at www.brc.tamus.edu/swat/swat2000doc.html (accessed 2 Mar. 2004; verified 18 June 2004). Texas Agric. Exp. Stn. and USDA-ARS, Temple TX.
- Merkey, P. 2000. Beowulf history [Online]. Available at http://beowulf.org/beowulf/history.html (accessed 23 June 2004; verified 23 June 2004). Scyld Software Corp., San Francisco, CA.
- National Partnership for Advanced Computational Infrastructure. 2004. NPACI rocks cluster distribution: Users guide [Online]. Available at http://rocks.npaci.edu/Rocks/ (modified 17 Feb. 2004; accessed 4 Mar. 2004; verified 18 June 2004).
- Neitsch, S.L., J.G. Arnold, J.R. Kiniry, R. Srinivasan, and J.R. Williams. 2002. Soil and water assessment tool user's manual: Version 2000 [Online]. Available at www.brc.tamus.edu/swat/swat2000doc.html (accessed 2 Mar. 2004; verified 18 June 2004). Texas Agric. Exp. Stn. and USDA-ARS, Temple TX.
- Open Cluster Group. 2003. OSCAR: Open source cluster application resources [Online]. Available at http://oscar.openclustergroup.org/tiki-index.php (modified 2 Dec. 2003; accessed 2 Mar. 2004; verified 18 June 2004).
- R Development Core Team. 2003. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Also available online at www.R-project.org (accessed 2 Mar. 2004; verified 18 June 2004).
- Scott, D.W. 1992. Multivariate density estimation. John Wiley & Sons, New York.
- Sterling, T., J. Salmon, D. Becker, and D. Savarese. 1999. How to build a Beowulf. MIT Press, Cambridge, MA.
- Sterling, T., D. Savarese, D.J. Becker, J.E. Dorband, U.A. Ranawake, and C.V. Packer. 1995. BEOWULF: A parallel workstation for scientific computation. p. 1114. In Proc. of the 24th Int. Conf. on Parallel Processing. Vol. 1, Oconomowac, WI. August 1995. CRC Press, Boca Raton, FL. Also available at http://citeseer.ist.psu.edu/sterling95beowulf.html (accessed 2 Mar. 2004; verified 18 June 2004).
- Tierney, L. 2003. The snow (simple network of workstations) package [Online]. Available at www.stat.uiowa.edu/~luke/R/cluster/cluster.html (accessed 2 Mar. 2004). Dep. of Statistics and Actuarial Science, Univ. of Iowa, Iowa City, IA.
- U.S. Environmental Protection Agency. 2001. Better assessment science integrating point and nonpoint sources. USEPA Rep. 823-B-01-001. USEPA, Office of Water (4305), U.S. Gov. Print. Office, Washington, DC. Also available at www.epa.gov/waterscience/basins/bsnsdocs.html (modified 23 Feb. 2004; accessed 2 Mar. 2004; verified 18 June 2004).
This article has been cited by other articles:

|
 |

|
 |
 
C.W. Richardson, D.A. Bucks, and E.J. Sadler
The Conservation Effects Assessment Project benchmark watersheds: Synthesis of preliminary findings
Journal of Soil and Water Conservation,
November 1, 2008;
63(6):
590 - 604.
[Abstract]
[PDF]
|
 |
|