Agronomy Journal Grow Your Career With ASA
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Web of Science (1)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Roel, A.
Right arrow Articles by Plant, R. E.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Roel, A.
Right arrow Articles by Plant, R. E.
Agricola
Right arrow Articles by Roel, A.
Right arrow Articles by Plant, R. E.
Related Collections
Right arrow Rice
Right arrow Site-Specific Analysis
Published in Agron. J. 96:1481-1494 (2004).
© American Society of Agronomy
677 S. Segoe Rd., Madison, WI 53711 USA

Site-Specific Analysis

Factors Underlying Yield Variability in Two California Rice Fields

Alvaro Roela,c and Richard E. Plantb,*

a Graduate Group in Ecology, Univ. of California, Davis, CA 95616, USA
b Dep. of Agron. and Range Sci. and Dep. of Biol. and Agric. Eng., Univ. of California, Davis, CA 95616, USA
c Present address: Instituto Nacional de Investigaciones Agropecuarias, Treinta y Tres, Uruguay

* Corresponding author (replant{at}ucdavis.edu)

Received for publication February 4, 2004.

    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Modern technologies associated with precision agriculture provide the opportunity to more precisely measure yield variability and the ecological processes underlying this variability. Effective analysis of data from these measurements requires statistical methods different from those traditionally employed on data from controlled agronomic experiments. Our objective was to develop and test multivariate statistical methods appropriate for use in analyzing precision agriculture data. We analyzed a data set taken from two commercial California rice fields and consisting of yield spatial trends together with soil core data from a grid of sample points. We used cluster analysis to discern spatiotemporal patterns in grain yield. We applied a Monte Carlo randomization process to the generation of clusters to analyze cluster stability. We then used classification and regression trees (CART) to determine the factors underlying cluster distribution. The clustering procedure successfully identified stable, physically meaningful clusters with recognizable spatial and temporal structure. Thus, the randomization procedure may present an attractive alternative to fuzzy clustering. The CART analysis identified some but not all of the factors underlying the cluster patterns. The number of available data values may have been too small to take advantage of the CART partitioning capabilities.

Abbreviations: CART, classification and regression trees • DGPS, differential global positioning system • GIS, geographic information system • LAD, least absolute deviation • NDVI, normalized difference vegetation index • OM, organic matter • SP, soil penetration resistance • SSM, site-specific management • TSR, tree-structured regression


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
RICE (Oryza sativa L.) is one of the world's most important staple crops. Development of effective site-specific management (SSM) techniques for rice production could have a significant impact on world food production (Cassman, 1999). Successful implementation of SSM requires an understanding of the spatial variability of biotic and abiotic factors that influence crop performance, knowledge that currently is often unavailable. This knowledge gap is one of the key limiting factors to the adoption of SSM in agricultural systems. The modern technologies associated with precision agriculture provide an opportunity to more precisely measure yield variability and the ecological processes underlying it and to begin to close this knowledge gap.

There have been a few recent studies exploring in a spatial context the factors underlying yield variability in rice. Dobermann (1994) used multivariate statistics to analyze within-field variability in Russian rice fields. Casanova et al. (1999) used multiple regression and the boundary-line method to study limits on the ability of rice to reach its yield potential. There have been many more studies on the factors underlying yield variability in terrestrial crops. Blackmore et al. (1999) presented a detailed analysis of data from cereal fields in Great Britain in which they made use of yield map data, aerial images, and soil and tissue analysis. Huggins and Alderfer (1995) used multiple regression to study the effects of various factors on yield variability in a 34-yr small-plot–based corn (Zea mays L.) fertility trial. They reported that 67% of the variation was explained by climatic and other factors while only 8% was explained by site variability. Sadler et al. (1995)(1998) analyzed yield sequences in corn, wheat (Triticum aestivum L.), and soybean [Glycine max (L.) Merr.] plots. Interannual yield correlations were statistically significant although the coefficients of determination were fairly low. Lamb et al. (1996)(1997) observed similar results in a 6-yr sequence of corn and corn–soybean systems. Bakhsh et al. (2000) analyzed the factors underlying observed yield variability in a 4-yr sequence of data from a corn–soybean field. After separating the short- and long-range components of variability using median polish, they used variography to describe the short-range variability and visual inspection of the median polish surfaces to describe the long-range effects. Visual inspection of map overlays of the data provided more useful information than correlation analysis of yield and yield-influencing factors. Jaynes et al. (2003) analyzed data from six corn years of a corn–soybean rotation in an Iowa field. Their work is further described below.

The fundamental difficulty in studying yield-influencing factors in commercial fields is the complexity of the phenomena. Environmental factors that influence crop growth and development may be relatively permanent, such as soil properties, or they may be transient, such as pest populations. Interaction with climatic factors may cause an environmental factor to increase yield in one year and decrease it in the next. For example, Porter et al. (1998) found that the relative yields of different plots in a corn–soybean experiment varied depending on seasonal climatic conditions. One way to reduce this complexity is to attempt to organize the field into subregions with similar spatiotemporal behavior. Several researchers have used cluster analysis in an effort to accomplish this. Lark and Stafford (1997) used fuzzy clustering to organize yield map data of combinable crops. Perez-Quezada et al. (2003) used k-means clustering (Jain and Dubes, 1984) to identify clusters of similar spatiotemporal behavior in a study of two four-crop rotation fields in California. Jaynes et al. (2003) employed the k-means technique to identify yield clusters. They then used multivariate statistical methods to characterize the factors underlying differences in yield between these clusters. They found that cluster analysis identified zones of different topographical environments, and multiple discriminant analysis identified relationships between yield-influencing factors. Dobermann et al. (2003) compared several different classification methods for analysis of yield data from irrigated cornfields.

Plant et al. (1999) used tree-structured regression (TSR) to divide a wheat field into segments and then applied linear regression to these segments. Tree-structured regression is a component of a class of algorithms called classification and regression trees (CART) (Breiman et al., 1984). CART is a nonparametric statistical method that recursively partitions the multidimensional space defined by the explanatory variables into subsets that are as homogeneous as possible in terms of the values of the response variable. Unlike parametric models, which are intended to uncover a single dominant structure in data, CART is designed to work with data that might have multiple structures. Buchleiter and Brodahl (2001) used CART to identify the factors underlying grain yield spatial variability. De'ath and Fabricius (2000) present several examples of the application of CART analysis to ecological data. CART is now an established method in medical diagnosis (e.g., Goldman et al., 1998) and has found applications in meteorology (Burrows, 1991), plant pathology (Baker, 1993), wildlife management (Grubb and King, 1991), species distributions (Vayssieres et al., 2000), soil regionalization (Mertens et al., 2002), and other fields with complex, multivariate data.

Vayssieres et al. (2000) point out several advantages of using CART. First, it is a nonparametric procedure and does not require the specification of a functional form. This eliminates the need to make simplifying assumptions about the data and to test for model goodness-of-fit. CART does automatic stepwise variable selection. As in the case of stepwise regression, this does not ensure finding the absolutely best tree. However, unlike parametric stepwise procedures, CART may select a variable several times because at each stage CART selects the variable holding the most information for the part of the multivariate space on which it is currently working. CART is also extremely robust with respect to outliers, which are generally separated into their own nodes where they no longer affect the rest of the tree. Cook and Goldman (1984) point out two disadvantages of using recursive partitioning methods such as CART. One is that since CART only represents a continuous factor by a series of distinct subranges, parametric methods are better at capturing an algebraic relationship between the response variable and a continuous factor. As a result, the decision tree may obscure linear and simple curvilinear structures of the data. A second disadvantage is that, because of the dichotomous nature of trees, later splits are based on fewer cases than the initial one. Thus, as the tree grows, the identification of additional predictive factors becomes increasingly difficult. Walters et al. (1999), studying models to predict medical outcomes, found that when the relationship between the predictor and response variables was a linear one, linear regression analysis was adequate. However, when nonlinear relationship existed, CART or artificial neural networks yielded better models.

In an earlier paper (Roel and Plant, 2004), we examined sequences of yield map data collected in two California rice fields between 1998 and 2001. These data sets were collected as a part of the commercial harvesting process, and one of the objectives of the first paper was to determine whether they were of sufficient quality for scientific use. Data from the 2001 harvest in one field were discarded as a result of this analysis, leaving one 4-yr sequence and one 3-yr sequence. Median polish (Emerson and Hoaglin, 1983) was used to extract yield trends from each year's data set, and k-means clustering was used to organize each sequence of median polish data into clusters. During the 4 yr in which yield data were collected from these fields, other data that might help in identifying the factors underlying yield variability were collected as well. In the present paper, we further develop the clustering methodology and to use CART to analyze these data. Our primary objectives are methodological. Specifically, our goal is the development of effective methods for exploratory analysis of data sets consisting of georeferenced, high-spatial-precision yield data together with spatially extensive edaphic data, with the objective of gaining insight into the factors underlying yield spatial variability. We develop and test a randomization method based on k-means clustering for organizing yield data into clusters. We then test CART for effectiveness in providing information on the factors underlying observed yield variability and develop methods for using CART to analyze temporal sequences of yield map data.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
The study was performed from 1998 through 2001 in two rice fields approximately 2 km apart, one of 38 ha (denoted Field 1) and one of 52 ha (denoted Field 2), located near Marysville, CA (UTM Zone 10, coordinates: E: 627102, N: 4340769; and E: 624970, N: 4341076 for Field 1 and 2, respectively). The soils of the study fields consist of approximately 45% Kimball loam (fine, mixed, active, thermic Mollic Palexerafls), 30% San Joaquin loam (fine, mixed, active, thermic, Abruptic Durixeralfs), and 25% Bruella loam (fine-loamy, mixed, Ultic Palexeralfs). Medium-grain rice cultivar M-202 and short-grain cultivar Koshihikari were grown and managed by the cooperator in Fields 1 and 2, respectively, using standard practices for the area (Hill et al., 1992).

Yield Data
Rice was harvested during the years 1998 through 2001 in both fields using a combine equipped with a John Deere Green Star yield-mapping system with real-time differential global positioning system (DGPS). Yield map data files (yield, grain moisture, longitude, and latitude) were collected and imported into the ArcGIS (ESRI, Redlands, CA) geographic information system (GIS) for analysis. A detailed description of the yield map data analysis is provided by Roel and Plant (2004). As a part of that analysis, yield data were separated into two components, large-scale trend and small-scale variation, using median polish on a 30-m grid. The present study used the large-scale trend values from that analysis.

Soil Data
Thirty-six and 58 soil samples were collected from Fields 1 and 2, respectively, representing approximately one sample per hectare. Figure 1 shows the locations in both fields where soil samples were collected. Soil penetrometer data for both fields were collected in June 2000. Other data were collected in May 1999 in Field 1 and March 2000 in Field 2. Locations of sample points were determined using a Trimble Ag 132 backpack DGPS receiver (Trimble Navigation, Sunnyvale, CA). At each sample point, eight soil cores (0–30 cm depth) were extracted before planting from a circular region of about 3-m radius (approx. 25-m2 area) such that two cores lay in each of the four quadrants. Sand, silt, and clay content (%) were measured. Soil pH, organic matter (OM, %), P (Bray) (mg kg–1), K (mg kg–1), and Zn (mg kg–1) were determined using standard methods of the University of California Division of Agriculture and Natural Resources Analytical Laboratory (Div. of Agric. and Nat. Resour., 2004). Topsoil depth (cm) and soil penetration resistance (SP, MPa) were measured using a Spectrum SC-900 instantaneous core penetrometer (Spectrum Technologies, Plainfield, IL) at the same locations where soil samples were extracted. This sensor provides soil depth readings in 2.5-cm increments as a load cell measures the penetration resistance. Field 1 was re-leveled by a commercial laser-leveling firm during the winter between the 1998 and 1999 seasons. The cut and fill map developed in the laser-leveling process was used to determine elevation in the field before leveling (m). Soil sampling in Field 1 took place after the laser-leveling operation.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1. Soil sampling locations in (a) Field 1 and (b) Field 2.

 
Aerial Image Data
A major infestation of herbicide-resistant early watergrass [Echinochloa crus-galli (L.) Beauv.] occurred in Field 1 in 2001. Because the weed population senesced before the rice, the spatial distribution of the infestation was well captured by infrared aerial images. Therefore, late-season aerial image data from each field were included in the analysis. False-color infrared digital aerial images of each field were acquired on 18 Aug. 1998, 5 July 1999, 31 July 2000, and 22 July 2001. Pixel sizes varied between 1 and 3 m on the ground. Images were imported into ArcGIS version 8.3 (ESRI, Redlands, CA), georegistered, band-separated, and transformed into normalized difference vegetation index (NDVI) values. These were obtained using the formula NDVI = (IR – R)/(IR + R) (Lillesand and Kiefer, 1994). The NDVI value at each soil sample location was estimated as the mean of the nine pixel values including and immediately adjacent to the pixel containing that location.

Data Analysis
Correlation analysis was performed using SAS version 8.3 (SAS Inst., Cary, NC). CART analyses were performed using CART for Windows version 5.0 (Salford Systems Inc., San Diego, CA). Cluster analysis was performed using PROC FASTCLUS of SAS version 8.3 (SAS Systems, Cary, NC). This procedure implements k-means cluster analysis (Jain and Dubes, 1984), an iterative procedure. In this algorithm, k points in the data space are initially selected as cluster seeds. Clusters are formed by assigning all other points in the data space to the nearest seed. The means of each cluster are then selected as the new set of k seeds, and the iterative process is repeated. In theory, the process may be iterated to convergence. However, the implementation in PROC FASTCLUS avoids the need for a large number of iterations through its algorithm for selecting the initial seeds. In this algorithm, the first point in the data set with all components specified is selected as the first seed, and other seeds are selected based on their distance from this point.

Our goal in using cluster analysis was to develop clusters that could be used to identify factors underlying observed yield variability. A necessary condition to achieve this goal is that the clusters distinguish values of the yield-determining factors as well as the values of yield themselves. That is, the clusters must be "physically meaningful." One criterion that has been used previously to test for this property is that the clusters be spatially structured, under the assumption that physical properties of the field are spatially structured (Plant et al., 1999; Roel and Plant, 2004). Another possibly more powerful indicator of physical meaning is that the cluster pattern be attained starting from different sets of initial cluster seeds. Since PROC FASTCLUS bases its choice of initial seeds on the order in which the data are entered, we tested for this property by running the procedure from randomly reordered data sets without iterating to convergence. For values of k from 2 through 5, data from Field 1 were processed in this manner 10 times. The procedure was repeated five times for values of k from 5 through 9 with data from Field 1 and for values of k from 2 through 6 with data from Field 2 since initial experience with 10 values of k from Field 1 indicated that five tests were sufficient to observe consistency. We did not iterate to convergence to provide the most severe possible test of cluster stability.

We observed that the result of this re-randomization procedure was that some of the cluster patterns could be aggregated by visual examination of the spatial distribution and cluster means into sets of similar arrangements and that some of the cluster patterns were unique. To clarify the discussion, we will define the terms we use to describe the cluster analysis, making reference to Fig. 2. The individual location elements of the field are called cells. For example, Field 1 has 292 cells. The k collections of cells identified by the k-means process are the clusters. A cluster set is an arrangement of similar cluster patterns that is attained from more than one initial reordering of the data. For example, Field 1 has two sets with k = 2, three sets with k = 3, and so forth. The cluster sets are denoted numerically as k-i, where i indexes the sets. For example, the two sets in Field 1 with k = 2 are denoted Sets 2-1 and 2-2.



View larger version (84K):
[in this window]
[in a new window]
 
Fig. 2. Cluster sets of Field 1. The number code identifies the value of k, which is the number of clusters, followed by the identification number of the set. For example, Set 2-1 is the first set of two clusters (k = 2). The numbers in the legend identify the gray-scale key of the elements belonging to each cluster.

 
The members of a cluster set may be similar but not identical. To test the cluster sets for internal consistency, we defined an appropriate statistic as follows. Suppose that there are a total of p cells and that the set has n members (i.e., n of the random reorderings of initial seeds result in a member of this set). For example, there are a total of 292 cells in Field 1, and for k = 6, the unique set (Fig. 2) with more than one member has two members. Therefore, in this case k = 6, p = 292, and n = 2. Let an indicator ßi, i = 1, ..., p, be defined for each cell i by

Then the measure of consistency {gamma} of the cluster set is defined by

[1]

The motivation for this definition of {gamma} is similar to that for Cohen's {kappa} (Cohen, 1960). The quantity p/nk is the expected number of cells that all belong to the same cluster by chance. If all of the members of the set are identical, then {gamma} has the value 1, and if each cell of each member of the set is randomly assigned one of the k possible values, then {gamma} = 0. Values of {gamma} were computed for each cluster set. We tested the cluster sets for spatial autocorrelation as an indication of spatial pattern by computing the Moran's I statistic (Cliff and Ord, 1981) for each set. Strictly speaking, since the cluster numbers are nominal-scale data, a multicolor join-count statistic would be more appropriate (Cliff and Ord, 1981). However, Moran's I is much easier to compute and also provides a good indication of the level of spatial autocorrelation in this situation. Computations were performed using macros written in Microsoft Excel (these are available from the corresponding author upon request).

Besides consistency from several initial cluster seeds and spatial pattern, a third indication of whether the clusters are physically meaningful is whether they are hierarchically arranged. In this case, the situation is a bit more subtle, however. Consider the case of an increase in the number of clusters from k to k + 1, and without loss of generality, suppose k = 2 so that we go from two to three possible values. Suppose that there is a single factor, say, soil clay content, underlying the separation of yield values into the two clusters. When the elements are partitioned into three possible values, the physical property underlying yield variability may continue to be the single factor clay content, it may switch to one or two completely new factors, or it may be comprised of clay content from the original two-cluster partition and some other factor, say, OM, within one of the two original clusters. It is only in the latter case that one would expect the clusters to be hierarchically arranged. Therefore, we did not test statistically for the presence or absence of a hierarchical arrangement. Rather, we visually inspected for such arrangements and used the results of this arrangement in the second phase of the data analysis, which was the use of CART to attempt to determine the factors underlying the clusters.

The explanatory variables for the CART analysis were the soil, elevation, and aerial image data. The analysis was performed as a two-stage process for each field. The first stage was based on the yield clusters. A member of each cluster set judged to be physically meaningful by the criteria of consistency (as indicated by the {gamma} statistic) and spatial structure (as indicated by Moran's I) was selected for CART analysis. A classification tree was constructed in which the response variable was the cluster number, which ranged from 1 to k. The second stage used TSR to construct a separate regression tree for each year with that year's yield as the response variable. Regression tree analysis can be performed using least squares or least absolute deviation (LAD) regression. Studies comparing the two methods have not found one to be consistently superior (Khoshgoftaar and Seliya, 2002). Consistent with the findings of Plant et al. (1999), we found LAD to produce more easily interpretable results, and we report the results of this method exclusively.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Correlation Analysis
Figures 3 and 4 show the sequence of Thiessen polygons developed by Roel and Plant (2004) for yield trend values of the 4 yr for Field 1 and 3 yr for Field 2, respectively. Yield in Field 1 has no visually apparent persistent pattern over years while in Field 2, there are persistent low- and high-yielding areas. Field 2 had been recently leveled and brought into production, and the low-yielding areas were considered by the cooperating grower to correspond to cut areas of the laser-leveling process. Walker et al. (2003) observed a similar effect of precision leveling on rice yield. A cut-and-fill map for this field was not available. Univariate statistics of soil physical and chemical variables are summarized for Field 1 in Table 1 and for Field 2 in Table 2. The fact that Field 2 had recently been brought into production may explain the disparity in mineral nutrients between fields. The mean measured soil P value in Field 1 was nearly five times that of Field 2, and the Zn level was slightly higher. On the other hand, Field 2 had a higher mean measured soil K level.



View larger version (68K):
[in this window]
[in a new window]
 
Fig. 3. Large-scale yield trends obtained by Roel and Plant (2004) for Field 1 using median polish. Redrawn from Roel and Plant (2004).

 


View larger version (52K):
[in this window]
[in a new window]
 
Fig. 4. Large-scale yield trends obtained by Roel and Plant (2004) for Field 2 using median polish. Redrawn form Roel and Plant (2004).

 

View this table:
[in this window]
[in a new window]
 
Table 1. Univariate statistics for field soil attributes for Field 1 based on 36 samples.

 

View this table:
[in this window]
[in a new window]
 
Table 2. Univariate statistics for field soil attributes for Field 2 based on 58 samples.

 
Spearman rank correlation coefficients for Field 1 indicate a weak or even negative interannual correlation between and among yield and late-season NDVI (Table 3), the only exception being a strong correlation between 2001 yield and 2001 NDVI. Significance of coefficients may be overestimated due to spatial autocorrelation between sample values (Cliff and Ord, 1981). There is no consistently high correlation between rice yield and any soil attribute value in this field. High negative values exist between 1999 yield and soil pH (r = –0.52) and between 2001 yield and elevation before leveling (r = –0.50). Soil pH in this field has a highly negative correlation with OM and a strong positive correlation with clay content.


View this table:
[in this window]
[in a new window]
 
Table 3. Spearman rank correlation coefficients, r, for Field 1 between rice yield by year, late-season normalized difference vegetation index (NDVI) by year, and field soil attributes. If measurements were spatially independent, |r| > 0.32 would imply significance at P = 0.05.

 
Spearman rank correlation coefficients in Field 2 present a much more consistent pattern (Table 4). There is a strong interannual correlation among yield values and between yield and late-season NDVI. The soil attribute values displaying consistently highest correlation between themselves and yield are pH, penetration resistance, and clay content, all negatively correlated; and soil test P level in 1998 and soil OM level in 1999 and 2000, both positively correlated. There is a paradoxical negative correlation between soil test K level and yield in 1998 and 1999 in Fields 1 and 2. In general the correlation coefficients among soil attribute values have much higher magnitudes in Field 2 than the corresponding coefficients in Field 1. In particular, soil test K in Field 2 has a strong negative correlation with OM and a positive correlation with clay content. This may reflect high clay content in the subsoil exposed by the laser-leveling process.


View this table:
[in this window]
[in a new window]
 
Table 4. Spearman rank correlation coefficients, r, for Field 2 between rice yield by year, late-season normalized difference vegetation index (NDVI) by year, and field soil attributes. If measurements were spatially independent, |r| > 0.25 would imply significance at P = 0.05.

 
Cluster Analysis
Table 5 shows the breakdown of cluster sets for Field 1. Not all of the permutations of the data produced clusters that could be identified as belonging to a group. For example, of the 10 permutations tested at k = 4, only 8 produced clusters that had a pattern resembling other clusters. Those clusters that could not be identified as a member of a cluster set were ignored. Cluster sets in Field 1 could be identified for values of k less than or equal 6 (Fig. 2). The spatial autocorrelation statistic of the cluster set identified visually for k = 6 was not statistically significant, but all other sets were highly significant both in consistency and spatial structure (Table 5). Some of the cluster sets are themselves quite similar (Fig. 2). The {gamma} statistic is 0.65 between Sets 2-1 and 2-2, 0.79 between Sets 3-1 and 3-2, and 0.59 considering all the members of Sets 4-1, 4-2, and 4-3 aggregated together. By examining the spatial arrangements of Fig. 2 and the plots of cluster means of the normalized yields (Fig. 5), hierarchical relationships can be determined. The cluster means of Set 2-2 are very similar to those of Set 2-1, those of Set 3-2 to Set 3-1, and those of Sets 4-2 and 4-3 to Set 4-1. Therefore, these are not displayed in Fig. 5. Cluster 2 of Set 2-1, which has a low value in 1998 and gradually rises, splits into Clusters 2 and 3 of Set 3-1. Clusters 2 and 3 of Set 3-3 appear to be formed from parts of Clusters 1 and 2 of Set 2-2. The relationship between clusters for k = 3 and k = 4 is somewhat ambiguous. Clusters 1 and 3 of Set 5-1 are formed from Cluster 1 of Set 4-1. Clusters 2 and 5 of Set 6-1 are formed from Cluster 2 of Set 5-1.


View this table:
[in this window]
[in a new window]
 
Table 5. Cluster sets of Field 1. Each set contains more than one cluster resulting from different randomly selected rearrangements of the data. The table shows set identification; then number k of means (i.e., of clusters); the number of instances n of this set out of total number N of data rearrangements tested; the set consistency statistic {gamma}, given by Eq. [1]; and the Moran's I and corresponding z value under the assumption of normality. For all cluster sets, the number of cells p = 292.

 


View larger version (32K):
[in this window]
[in a new window]
 
Fig. 5. Cluster means for the clusters of Fig. 2. Each plot shows the sequence of means of standardized yield values, obtained by subtracting the mean and dividing by the standard deviation, of the corresponding cluster for each of the 4 yr of data. To conserve space, the axes are not labeled. The abscissa is the year, and the ordinate is the standardized cluster mean.

 
The patterns of clustering associated with Field 2 are considerably simpler than those of Field 1 (Fig. 6 and 7). Set 2-1 consists of a consistently low and a consistently high yield area. The clusters in Sets 3-1 and 3-2 consist of consistently high, medium, and low yield areas that draw their members from both the high and low areas of Set 2-1. The medium yield area of Sets 3-1 and 3-2 splits into two to form Sets 4-1 and 4-2. One of these two again splits to form Set 5-1. All of the cluster sets show a very highly significant level of spatial autocorrelation (Table 6).



View larger version (83K):
[in this window]
[in a new window]
 
Fig. 6. Cluster sets of Field 2. Coding and legend keys are the same as those of Fig. 2.

 


View larger version (27K):
[in this window]
[in a new window]
 
Fig. 7. Cluster means for the clusters of Fig. 6. Each plot shows the sequence of means of standardized yield values, obtained by subtracting the mean and dividing by the standard deviation, of the corresponding cluster for each of the 3 yr of data for this field. To conserve space, the axes are not labeled. The abscissa is the year, and the ordinate is the standardized cluster mean.

 

View this table:
[in this window]
[in a new window]
 
Table 6. Cluster sets of Field 2. Each set contains more than one cluster resulting from different randomly selected rearrangements of the data. The table shows set identification; then number k of means (i.e., of clusters); the number of instances n of this set out of total number N of data rearrangements tested; the set consistency statistic {gamma}, given by Eq. [1]; and the Moran's I and corresponding z value under the assumption of normality. For all cluster sets, the number of cells p = 402.

 
Classification and Regression Tree Analysis
The objective of the CART analysis was to relate explanatory field–level factors to the cluster response values developed in the last section. We followed the philosophy of Henderson and Velleman (1981) in this effort. That is, rather than attempt an objective, straightforward analysis, we explicitly took into account knowledge gained during our observation of the experiment and the preliminary analysis of the data. It is agronomically unlikely that a higher level of K would be associated with decreased yield in this production system. Therefore, we excluded K from the analysis based on the negative correlation with yield indicated in Tables 1 and 2, which we took to indicate multicollinearity with another variable. Similarly, based on our observation that the low-yielding area in the north-central portion of Field 1 in 2001 was due to a large stand of herbicide-resistant watergrass, we included the value of 2001 NDVI to represent this weed density. Similarly to the observation of Plant et al. (1999), we found that NDVI provided an accurate and highly precise representation of the location of the weed population because of the difference in dates of senescence between the weed and the crop.

Best results for the first stage of the CART analysis, the computation of classification trees for the clusters of Fig. 2, were obtained using k = 4 (i.e., four clusters). Smaller values of k did not provide sufficient levels of the response variable while at k = 5, there were too few values of the explanatory vector in each level of the response variable to obtain splits. Based on the similar temporal patterns of cluster means, we combined Cluster Sets 4-1, 4-2, and 4-3 into one set for the analysis. Of the eight clusters of the combined set (Table 5), six produced legitimate classification trees. The most extensive was produced by two members (Fig. 8) and consisted of two splits, one on NDVI-2001 and one on SP. The other four members yielded trees with only the first split of Fig. 8, on NDVI-2001. Values of NDVI-2001 less than 0.425 distinguished Cluster 4, which was relatively high yielding in 1998 and declined thereafter (Fig. 5 and 6) while values of NDVI-2001 greater than 0.425 and SP less than 0.38 MPa distinguished Cluster 3, which was low yielding in the first 2 yr and then rose to become the highest yielding in the fourth year.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 8. Classification tree obtained for clusters of Field 1 at k = 4. The value of N in each box indicates the number of elements comprising that node of the classification tree. For example, the value N = 18 in the uppermost box indicates that there are 18 elements in which normalized difference vegetation index (NDVI)-2001 > 0.425 and soil penetration resistance (SP) > 0.38. The numbers on the second line in each box indicate the percentages of each cluster contained in that classification and regression trees (CART) node. For example, the node corresponding to the uppermost box is comprised of 28% Cluster 1, 33% Cluster 2, etc. Boxes with rounded corners indicate terminal nodes with a majority of members from one cluster, in which case the node is identified with that cluster. For example, the node corresponding to NDVI-2001 > 0.425 and SP ≤ 0.38 MPa is identified as corresponding to Cluster 3. This is indicated by the numeral to the right of the box.

 
Regression tree analysis of the yield data (Table 7) indicated that low-yielding areas in 1998 included three points of low OM and seven points of low elevation. All but two of these 10 points lie inside Cluster 3 of the aggregated sets, and these make up a majority of the points in this cluster. The same condition on elevation subdivides the highest-yielding seven points in 2001 (Table 7). The sample points lie in the southwest corner of the field, which received the most fill during the laser-leveling operation in 1999. The region with low pH, which had the highest yield in 1999, is located in the northeast portion of the field and does not correspond to any of the clusters. Neither the regression tree nor the classification tree analysis distinguished any factors associated with Clusters 1 or 2.


View this table:
[in this window]
[in a new window]
 
Table 7. Results of the regression tree analysis carried out on the data of Field 1. "Y" indicates yes, and "N" indicates no. No tree had more than two splits, and each tree had at least one terminal node on the first split. Terminal nodes are indicated in parentheses and include the median yield and number of data values. Nonterminal nodes include only the number of data values. The second split is from the nonterminal node of the first split.

 
Classification tree analysis of the cluster data in Field 2 was not as consistent as it was for Field 1. The trees with the greatest number of nodes were obtained with k = 3. Trees were consistently able to associate an explanatory variable (which was not the same in each case) with Cluster 3 (Fig. 6 and 7), the lowest-yielding cluster. Figure 9 shows the classification tree for Cluster Set 3-2, in which SP is the distinguishing variable. Of the five runs, OM ≤ 0.05 was identified twice, CL ≤ 23.5 twice, and SP > 0.56 once as the most distinguishing parameter value. This inconsistency may be due to the high level of intercorrelation among these variables, particularly in the area of the field corresponding to Cluster 3. Only six sample points, all located at the geographical boundary of the cluster, satisfied one of these conditions and not all of them. There was no consistent factor distinguishing Clusters 1 and 2. Similarly, LAD regression tree analysis of yield by year was only able to distinguish the lowest-yielding portion of the field, with a single split in each year (not shown). The splits were CL ≤ 22.5 in 1998, CL ≤ 23.5 in 1999, and SP > 0.56 in 2000.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 9. Classification tree obtained for Field 2 at k = 3. The interpretation is the same as that of Fig. 8.

 

    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Beginning with studies such as that of Lark and Stafford (1997), a number of investigations have been conducted on the use of clustering methods to organize precision agriculture data (Jaynes et al., 2003; Dobermann et al., 2003; Perez-Quezada et al., 2003). The stated objective of these studies has generally been to develop management zones for crop production, but there is a second important application for this sort of analysis: the development of statistical methodologies capable of using precision agriculture data for scientific research. As the application of these multivariate statistical methods becomes perfected and more widely used, it will make possible more precise experiments, both observational and replicated, in commercial fields. This will serve as a useful complement to the more traditional small-plot experiments performed under controlled conditions.

Many of the studies in which cluster analysis has been applied to yield data have used fuzzy clustering. We have adopted an alternative approach more akin to the probabilistic methods of Monte Carlo simulation and to randomization methods (Manly, 1997). Our method uses crisp clusters but carries out a randomization process to identify those cluster sets with the highest probability of occurrence. These are postulated to be most likely to correspond to a real physical process. A debate has been carried out in the decision support literature for two decades concerning the merits of probability theory vs. fuzzy set theory (which is called "possibility theory" in that literature) in dealing with uncertainty (Cheeseman, 1986). Probably the best that can be said is that both methods have something to offer, and in the case of clustering yield monitor data, both should be investigated further.

Our method of analysis consists of two steps: determining the cluster structure based on median polish yield data and using a multivariate technique (in the case of this paper, CART) to associate explanatory variables with the response variables generated by the cluster analysis. This is not the only possible approach; Jaynes et al. (2003) advocate a three-step process. One of the most important issues that must be addressed in the first step of our approach is the assessment of the clusters. We argue that the most important assessment measure is one that provides an indication of how "physically meaningful" the clusters are. This is because a k-means cluster algorithm will generate k clusters, whether or not these correspond to any real physical processes. In a similar way, a hierarchical cluster process will generate hierarchical clusters, even in the absence of any real hierarchy (Jain and Dubes, 1984). Our most important measure of "physical meaning" is stability under random permutations of the initial cluster seeds. The second most important measure is a high level of spatial structure. The third most important measure is evidence of a spatially consistent relation as k is increased.

In this initial test of the method, we did not iterate the k-means clustering algorithm to convergence. Our reason for this was to try to use the method to explore the variability and sensitivity of the cluster patterns. The practice of nonlinear optimization, to which the k-means algorithm is related, often employs a geographical terminology in which the optima are considered as "peaks" and the process of finding them is considered as "hill climbing" (Press et al., 1986). Employing this same topographical analogy, we wished to explore the terrain around the peak as well as find the peak itself. Further investigation will be necessary to determine whether this is the best approach.

Classification and regression trees have been widely used in many areas to associate explanatory variables with response variables. One of their chief drawbacks is that small changes in the distinguishing parameter of a node relatively low in the tree may have a profound effect on the structure of the tree above that node. Our method of random rearrangement partially compensates for this problem by allowing us to view the effects of small changes in the response variable (the cluster pattern) on the tree structure. This is evident in the stability of the trees associated with Field 1 under small perturbations compared with a corresponding instability of the trees associated with Field 2. This does not mean, however, that CART was not useful in identifying the important underlying factors in Field 2. Indeed, in both fields, the method seemed to produce meaningful results but not to be able to identify factors underlying all of the clusters. Possible reasons for this are that the clusters are themselves an artifact, the clusters are real but the factor underlying them was not measured, or there were not sufficient sample data to enable the CART process to construct a full tree. Indeed, the number of sample points, 36 and 58, is at the lower limit of effectiveness of CART (Breiman et al., 1984), which is a disadvantage of the CART method.

These two fields were purposefully selected at the start of the experiment to represent two ends of a spectrum of field behavior (Roel and Plant, 2004). Preliminary data, together with the cooperating grower's experience, indicated that Field 1 appeared to have no particular consistent spatial structure while Field 2 gave a strong preliminary indication of spatial structure due to its recent transition into production as a rice field. In their discussion of the "null hypothesis of precision agriculture," Whelan and McBratney (2000) point out that use of SSM in a field should be contingent on a clear indication that this will bring a substantial economic return. In the case of Field 1, there does not after 4 yr appear to be a consistent pattern of variability that would make this field a good candidate for SSM. In the case of Field 2, it is possible that application of ameliorative inputs to a relatively small part of the field, corresponding to Cluster 3, may provide an economic return where the application of these same inputs on a whole-field basis would not be economically justified.


    ACKNOWLEDGMENTS
 
We are very grateful to the cooperating farmers, Charley Mathews and Charley Mathews, Jr., for allowing us to carry out research on their farm. We are also grateful to an anonymous reviewer for some helpful comments. This research was supported by the California Rice Research Board, by the Instituto Nacional de Investigaciones Agropecuarias of Uruguay, and by a William F. Golden Fellowship to A. Roel.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 





This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Web of Science (1)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Roel, A.
Right arrow Articles by Plant, R. E.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Roel, A.
Right arrow Articles by Plant, R. E.
Agricola
Right arrow Articles by Roel, A.
Right arrow Articles by Plant, R. E.
Related Collections
Right arrow Rice
Right arrow Site-Specific Analysis


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Crop Science Vadose Zone Journal
Journal of Natural Resources
and Life Sciences Education
Soil Science Society of America Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome