Estimation of soil pH with geochemical indices in forest soils

Authors: Wei Wu ^aff001; Hong-Bin Liu ^aff002
Authors place of work: College of Computer and Information Science, Southwest University, Beibei, Chongqing, China ^aff001; College of Resources and Environment, Southwest University, Beibei, Chongqing, China ^aff002
Published in the journal: PLoS ONE 14(10)
Category: Research Article
doi: https://doi.org/10.1371/journal.pone.0223764

Summary

Soil pH is a critical soil quality index and controls soil microbial activities, soil nutrient availability, and plant roots growth and development. The current study aims to evaluate various pedotransfer functions for predicting soil pH using different geochemical indices (CaO, ratios of Al₂O₃, Fe₂O₃, TiO₂, SiO₂, MgO, and K₂O to CaO) in forest soils. Various models including empirical functions (quadratic, cubic, sigmoid, logarithmic) and artificial neural network with these geochemical indices were assessed by independent testing set. Mean bias error (MBE), root mean square error (RMSE), mean absolute percentage error (MAPE), mean absolute error (MAE), coefficient of determination (R²), t-statistics (t-stat), and Akaike’s Information Criterion (AIC) were applied to evaluate the model performances. Additionally, a new indicator (global performance indictor, GPI) was originally introduced in this study and was used to rank these models. According to GPI, the sigmoid functions and ANNs performed better than others. On average, they could explain above 70% of the variability in soil pH. Both model structure and dataset shape impact on model performance. The best input was CaO for ANNs, sigmoid, and logarithmic functions. The ratios of K₂O to CaO and Al₂O₃ to CaO were the best inputs for quadratic and cubic equations, respectively.

Keywords:

Analysis of variance – Mathematical functions – Geochemistry – Global positioning system – Artificial neural networks – Soil pH – Plant roots

Introduction

Soil pH indicates soil acidity and alkalinity. Generally, slightly acidic soils are optimal for macro- and micro-nutrients availability [1]. Soil pH impacts on soil nutrients and plant growth and development [2]. It is a critical element for understanding soil nutrient availability and weathering as well as relationships between soil and biota. The relationship between soil pH and base saturation has been well studied. Some researchers observed a curvilinear relationship between soil pH and Ca saturation [3, 4]. Others reported a linear relationship between them [5, 6].

Soil CaO has been applied to predict soil pH with other geochemical elements. For example, Lukens et al. used ratios of Fe₂O₃, TiO₂, and Al₂O₃ to CaO to predict soil pH with sigmoid functions [7]. The models produced similar prediction accuracy with coefficient of determination changing between 0.7 and 0.74, root mean square error between 0.83 and 0.88. Nordt and Driese found that bulk soil CaO + MgO could be used to predict soil pH in Vertisol [8]. The prediction of soil pH using bulk soil elemental oxides is also an issue in pedotransfer functions. Soil CaO, is one source of Ca²⁺ supply to soil solution, we believe that itself could be used to estimate soil pH. However, studies on this topic were limited.

The objectives of the current study were to (1) evaluate various pedotransfer functions for predicating soil pH using several geochemical indices and (2) investigate the usefulness of soil CaO to predict soil pH. To do this, five models with different geochemical indices were compared and tested. Specifically, artificial neural networks were evaluated with respect to the non-linear relationship between soil pH and the geochemical indices. Model performances were evaluated by an independent validation set.

Materials and methods

Study site

The study area covering 13326 km² is located in the core region of the Three Gorges Reservoir of China (Fig 1). It has a humid subtropical monsoon climate with a mean annual precipitation of 1267 mm and a mean annual temperature of 16.02°C. The elevation varies between 175 and 2033 m with a mean of 643 m. The slope changes between 0.45° and 52.96° with a mean of 17.83°.

**Fig. 1. Maps of study area location and sample sites.**

Data

A total of 1163 samples were collected from forest soils in the study area (Fig 1), where the major bedrock lithologies are carbonate rocks and sandstone and soil type is Combisols [9]. The study did not involve private land, protected land, endangered or protected species. No specific permissions were required for these locations/activities. In order to ensure an even distribution of selected sites, systematic sampling using a regular grid was applied in this work [10]. Surface soils at 0–20 cm depth were collected at a density of 1 sample/km². For each sampling site, 3 to 5 subsamples collected within 50 m of the site were mixed to represent the sample. All the sampling locations were recorded by Global Positioning System (GPS). Standard measurements were performed on the soil samples. Prior to laboratory analysis, samples were air-dried and passed through a 2 mm soil sieve. Soil pH was determined in a soil-to-water ratio of 1:2.5 with a glass electrode. The elements (Al₂O₃, Fe₂O₃, TiO₂, SiO₂, K₂O, Mg₂O, and CaO) were measured by Inductively Coupled Plasma-Optical Emission Spectrometry (ICP-OES) method [10].

Ratios of Al₂O₃, Fe₂O₃, TiO₂, SiO₂, MgO, and K₂O to CaO (hereafter AlCa, FeCa, TiCa, SiCa, MgCa, and KCa) and CaO were used to develop the pedotransfer functions to predict soil pH in forest soils [7]. These geochemical indices were calculated by

where X represents Al₂O₃, Fe₂O₃, TiO₂, SiO₂, MgO, and K₂O.

All data were divided into calibration and validation sets for each dataset. Approximately 2/3 of the data were used to develop (or train) the models. The remaining 1/3 of the data were used to validate the models.

Models

Both empirical functions (quadratic, cubic, sigmoid, and logarithmic) and artificial neural network were tested in this work. The expressions of these empirical functions are given in Table 1. For sigmoid function, parameter k and p are the minimum and range of the response, respectively.

The artificial neural networks (ANNs) that are inspired by biological neural network are also frequently used tools for various fields [11–13]. ANNs can deal with both linear and non-linear relationships between variables [11, 12]. In the current study, ANNs with three layers (an input, a hidden, and an output layers) were tested and trained with scale conjugate gradient back propagation algorithm (Fig 2). The output of a node is,

where f is an activation function, y is the output of a node j, x_i is an input of the vector of inputs, w_ij is the weight connected the input x_i to the node j, and b_j is a bias associated with the node j. The parameters (weight and bias) are determined during the training stage based on a set of input data and targets. The tangent and linear activation functions were used in the hidden layer and output layer, respectively [14–17].

The numbers of neurons in the hidden layer between 2 and 20 were tried. To train the ANNs, three datasets were created randomly based on the calibration dataset for training (70%), validating (15%), and testing (15%). The ANNs with the lowest value of root mean square error (RMSE) and the highest value of coefficient of determination (R²) were selected to predict soil pH using the geochemical indices. Number of parameters was calculated by [18],

where N_i, N_h, N_o, and 1 are number of node in the input, hidden, output layers and bias, respectively.

Performance evaluation

Model performances could be evaluated by comparing predicted and measured data based on a set of statistical error indicators. In this work, mean bias error (MBE), root mean square error (RMSE), mean absolute percentage error (MAPE), mean absolute error (MAE), and coefficient of determination (R²), t-statistics (t-stat), and Akaike’s Information Criterion (AIC) [19] were employed to assess the model performances based on the independent validation set.

where n is the number of observations, y_i, and ŷ_i are the measured and estimated soil pH of the ith soil sample, respectively, y¯ is the mean value of the measured soil pH, k is the number of parameters. MBE shows overall under- or over-estimation tendency. A negative value of MBE indicates an overestimation of the model, and a positive one indicates an underestimation of the model. The most accurate model has an MBE value closed to zero, lower values of RMSE, MAPE, MAE, t-stat, AIC, and a higher value of R².

Each statistical error indicator has its specific strength and weakness. For example, RMSE is not a better indicator than MBE for evaluating average model performance [20]. However, MBE could not give the correct performance when the model has overestimations and underestimations at the same time. Therefore, to find out the best model based on the above-mentioned indicators, a new Global Performance Indicator (GPI) was introduced in this work. Each indicator should be scaled on a scale of 0–1 with 0 being the best and 1 representing the worst. For the indicators that have negative or positive values, their absolute values are used in GPI. For the indicators that the lower the better (e.g., RMSE and MAPE etc.), the minimum is scaled to 0 and maximum to 1 (Eq 11). For the indicators that the higher the better (e.g., R²), the maximum is scaled to 0 and minimum to 1 (Eq 12). For the ith model, the GPI was defined as,

Where P is the performance indicator. P_max and P_min are the maximum and minimum of P for the corresponding indicators of the evaluated models. I_ij is the scaled value of indicator j for the ith model and m is the number of performance indicators. Models with GPI closer to zero perform better.

Statistical analysis

A one-way analysis of variance (ANOVA) was used to test the difference in variables between calibration and validation sets. Pearson’s correlation coefficients were calculated to determine the strength of correlations between soil pH and geochemical indices. The analyses of descriptive statistics were performed in SPSS v13.0. Model development and validation were done by MATLAB v9.0.

Results

Data overview

On average, the soils were neutral. Soil pH varies between 4.34 and 8.7 with a mean of 7.16 (Table 2). CaO mainly ranged between 0 and 30% (mean = 2.63%), Al₂O₃ between 12 and 15% (mean = 14.4%), Fe₂O₃ between 3 and 6% (mean = 5.2%), TiO₂ between 0.5 and 0.8% (mean = 0.75%), SiO₂ between 50 and 70% (mean = 62.9%), MgO between 0 and 2% (mean = 1.9%), K₂O between 2.2 and 2.7% (mean = 2.5%) (Fig 3). In terms of coefficient of variation (CV%), soil pH showed low variability (< 25%). Among the geochemical indices, SiCa and AlCa presented low variability (< 25%), FeCa, TiCa, MgCa, KCa showed medium variability (25% - 75%) and CaO presented high variability (> 75%).

**Fig. 3. Histogram plots for the geochemical elements.**

**Tab. 2. Descriptive statistics of soil pH and geochemical indices (N = 1163).**

Soil pH showed significant correlation with these geochemical indices (Table 3 and Fig 4).

**Fig. 4. Relationships between soil pH and the geochemical indices.**

**Tab. 3. Pearson’s correlation coefficients between soil pH and geochemical indices (p<0.01).**

Differences in soil pH and geochemical indices between calibration and validation sets were given in Table 4. Results of ANOVA indicated that there was no significant difference in these variables between calibration and validation sets.

**Tab. 4. Differences in soil pH and geochemical indices between calibration and validation sets (N = 877 and 286 for calibration (Cal) and validation (Val) sets, respectively.).**

Model calibration

The coefficients of determination (R²) of the developed models based on the calibration set are given in Table 5. The ANNs with 18, 7, 11, 7, 14, 19, and 15 hidden nodes were applied to estimate soil pH using CaO, AlCa, FeCa, SiCa, TiCa, MgCa, KCa, and respectively (Fig 5). On average, ANN produced the highest value of R² (0.73), followed by sigmoid (R² = 0.7) and cubic (R² = 0.63) equations. The values of R² ranged between 0.21 (p < 0.01, logarithmic equation with SiCa) and 0.77 (p < 0.01, ANN with SiCa).

Root mean square error (RMSE) and coefficient of determination (R2) for ANNs with different numbers of hidden nodes (The black box indicates the lowest value of RMSE or highest value of R<sup>2</sup>). — **Fig. 5. Root mean square error (RMSE) and coefficient of determination (R2) for ANNs with different numbers of hidden nodes (The black box indicates the lowest value of RMSE or highest value of R²).**

**Tab. 5. Model calibration (N = 877, p<0.01).**

Model performance

Performances of the models were evaluated based on the validation set and the statistical error indicators were shown in Table 6. On average, all models except sigmoid functions presented underestimation tendency according to MBE. In terms of MAPE, models gave good estimation of soil pH (mean MAPE = 7.4%). ANN and sigmoid models could explain above 70% of the variability in soil pH (R² = 0.73 and 0.71, respectively). Logarithmic model performed worst with the highest values of MBE, RMSE, MAPE, MAE, AIC, and the lowest values of R². ANN gave the best estimations of soil pH according to RMSE, MAPE, MAE, t-stat, and R². Sigmoid model performed best based on AIC and MBE. The geochemical indices gave varied prediction performances with models. For example, SiCa produced the highest R² in ANNs, KCa in quadratic and cubic functions, CaO in logarithmic and sigmoid models. Lukens et al. [7] predicted soil pH by AlCa, FeCa, and TiCa using sigmoid models. They reported that TiCa and FeCa gave slightly better performances than AlCa. In the current work, CaO, AlCa, SiCa, and KCa produced better predictions of soil pH than FeCa and TiCa using sigmoid functions based on R².

Models gave different prediction accuracy indicated by different statistical error indicators. For example, ANN with SiCa was the best one in terms of RMSE, MAPE, MAE, and R². Sigmoid function with TiCa performed best based on MBE and t-stat. Cubic with KCa was the best according to AIC.

Because the used statistical error indicators did not always give the consistent results, the GPI was introduced and calculated by combining these indicators. The ranking of the models according to each accuracy indicator and GPI was reported in Table 6. On average, the results of GPI indicated that sigmoid model, ANN, and cubic were ranked 1st, 2nd, and 3rd. The model performance indicated by GPI was acceptable and better, because it combined all the performance tests. GPIs were also calculated within each model. The geochemical indices gave different performance for the evaluated models. CaO ranked 1st in ANNs, sigmoid and logarithmic functions. KCa ranked 1st in quadratic models. Therefore, CaO and KCa were the best inputs to predict soil pH for both ANNs and the empirical equations over the study site. Scatter plots of the observed and predicted soil pH by ANN with CaO and sigmoid with CaO were given in Fig 6. Statistics of validation results were listed in Table 7. The maximum pH values were underestimated while the minimums were overestimated for both models. There was no significant difference in soil pH between observations and predictions for the two models.

**Fig. 6. Scatter plot of the observed and predicted soil pH by (a) artificial neural network with CaO and (b) sigmoid with CaO.**

**Tab. 7. Statistics of validation results (N = 286).**

Discussion

On average, ANNs performed better than cubic, quadratic, and logarithmic functions. Among the empirical approaches, sigmoid function was the best one. Model structure results in the differences between them [21]. ANN constructs a network connected with weighted nodes that were trained by certain algorithms. Compared with other models, the main advantages of ANNs are: 1) they are non-parametric techniques and do not need any model assumptions; 2) ANNs have no assumption on data distribution. Generally, ANN is often criticized for its complex network structure that makes the results difficult to interpret [22]. The indicator, AIC, based on an “information-theoretical approach” has been widely used for model selection [23–25]. In this case, ANNs produced higher values of AIC than others, due to the larger number of model parameters. Besides, data set shape also impacts on model performance, especially for the empirical functions. The rank order of them are sigmoid > cubic > quadratic > logarithmic functions. The best input was CaO for ANNs, sigmoid and logarithmic functions. The ratios of K₂O to CaO and Al₂O₃ to CaO were the best inputs for quadratic and cubic equations, respectively.

CaO and the ratios of elemental oxides to CaO could be used to predict soil pH, because Ca²⁺ is the main driver affecting soil pH [7]. The sigmoid functions indicated the geochemical indices have different rates of change in soil pH. This was also given by the scatter plots (Fig 4). The oxides that were more abundant than CaO had higher values of growth rate and inflection point (e.g., SiO₂, Al₂O₃, Fe₂O₃) and vice versa (e.g., TiO₂, MgO, K₂O). Lukens et al. (2018) stated that samples collected from calcareous soils could have a relatively large values of FeCa or AlCa and compressed intervals at higher index values, where pH decreases as a function of Ca loss and Fe or Al gain. This could also explain the relationships between soil pH and the ratios of elemental oxides to CaO over the current study site.

Soil pH is a key parameter for understanding soil weathering and relationships between soil nutrient availability and environmental factors. Weathering indices that incorporate Ca in some form could track soil pH. A recent study reported that soil pH values are closely correlated with water balance (mean annual precipitation–mean annual potential evapotranspiration) at global scale [26]. The pedotransfer functions and geochemical proxies compared and evaluated in the current study could be used to estimate significantly environmental components in the past time [7].

Conclusions

Various pedotransfer functions with different geochemical indices were applied to estimate soil pH in forest soils. The predicted data were compared to the measurements of an individual validation dataset. In order to do so, 7 statistical indicators have been applied to test models performances. Moreover, a new accuracy factor, named Global Performance Indicator (GPI), was originally introduced in this study and was used to rank the proposed models. The rank order was sigmoid > artificial neural network > cubic > quadratic > logarithmic. Soil CaO could be used to predict soil pH with ANNs, sigmoid and logarithmic functions. KCa and AlCa were the best inputs for quadratic and cubic equations, respectively.

Supporting information

S1 File [csv]
Data.

Zdroje

1. Brady NC, Weil RR. The Nature and Properties of Soils, 14th ed. Prentice Hall, Upper Saddle River, NJ (975 pp.), 2008.

2. McLean EO. Soil pH and lime requirement, In: Page A.L., et al. (Eds.), Methods of Soil Analysis Part 2—Chemical and Microbiological Properties, 2nd ed. ASA/SSSA, Madison, WI, pp. 199–223, 1982.

3. Reuss JO, Walthall PM, Roswall EC, Hopper RWE. Aluminum solubility, calcium-aluminum exchange, and pH in acid forest soils. Soil Sci. Soc. Am. J. 1990; 54: 374–380.

4. Bloom PR, Grigal DF. Modeling soil response to acidic deposition in nonsulfate adsorbing soils. J. Environ. Qual. 1985; 14: 489–495.

5. Magdoff FR, Bartlett RJ. Soil pH buffering revisited. Soil Sci. Soc. Am. J. 1985; 49 (1): 145–148.

6. Blosser DL, Jenny H. Correlations of soil pH and percent base saturation as influenced by soil forming factors. Soil Sci. Soc. Am. P. 1971; 35 (6): 1017–1018.

7. Lukens WE, Nordt LC, Stinchcomb GE, Driese SG, Tubbs JD. Reconstructing pH of paleosols using geochemical proxies. J. Geol. 2018; 126: 427–449.

8. Nordt LC, Driese SG. A modern soil characterization approach to reconstructing physical and chemical properties of paleo-vertisols. Am. J. Sci. 2010; 310: 37–64.

9. FAO. Soil Map of the World, Revised Legend. Rome, Italy, 1988

10. CGS. Specification for multi-purpose regional geochemical survey (DD200501), in: China Geological Survey (Ed.), Beijing (in Chinese), 2005

11. Guo PT, Wu W, Sheng QK, Li MF, Liu HB, Wang ZY. Prediction of soil organic matter using artificial neural network and topographic indicators in hilly areas, Nutr. Cycl. Agroecosys. 2013; 95: 333–344.

12. Guo PT, Shi Z, Li MF, Luo W, Cha ZZ. A robust method to estimate foliar phosphorus of rubber trees with hyperspectral reflectance. Ind. Crop. Prod. 2018; 126: 1–12.

13. Kanungo DP, Sharma S, Pain A. Artificial neural network (ANN) and regression tree (CART) applications for the indirect estimation of unsaturated soil shear strength parameters. Front. Earth Sci-Prc. 2014; 8 (3): 439–456.

14. Mba L, Meukam P, Kemajou A. Application of artificial neural network for predicting hourly indoor air temperature and relative humidity in modern building in humid region. Energ. Buildings. 2016; 121: 32–42.

15. Lim HS, Kang YT. Estimation of finish cooling temperature by artificial neural networks of backpropagation during accelerated control cooling process. Int. J. Heat Mass Tran. 2018; 126: 579–588.

16. Antiwi P, Li J, Meng J, Deng K, Quashie FK, Li J, et al. Feedforward neural network model estimating pollutant removal process within mesophilic upflow anaerobic sludge bioreactor treating industrial starch processing wastewater. Bioresource Technol. 2018; 257:102–112.

17. Singh VK, Tiwari KN. Prediction of greenhouse micro-climate using artificial neural network. Appl. Ecol. Env. Res. 2017; 15(1): 767–778.

18. Minasny B, McBratney AB. The Neuro-m method for fitting neural network parametric pedotransfer functions. Soil Sci. Soc. Am. J. 2002; 66: 352–361.

19. Akaike H. Information theory and an extension of maximum likelihood principle. p. 267–281. In Petrov B.N. and Csáki F. (ed). Second International Symposium on Information Theory. Akadémia Kiadó, Budapest, 1973.

20. Willmott CJ, Matsuura K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005; 30: 79–82.

21. Loh WY. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 14–23, 2011.

22. Zou X, Zhao J, Povey MJW, Holmes M, Mao H. Variables selection methods in near-infrared spectroscopy. Anal. Chim. Acta. 2010; 667: 14–32. doi: 10.1016/j.aca.2010.03.048 20441862

23. Burnham KP, Anderson DR. Model selection and multimodel inference: a practical information-theoretic approach, Berlin: Springer, 1998.

24. Hegyi G, Garamszegi LZ. Using information theory as a substitute for stepwise regression in ecology and behavior. Behav. Ecol. Sociobiol. 2011; 65 (1): 69–76.

25. Mundry R. Issues in information theory-based statistical inference-commentary from a frequentist’s perspective. Behav. Ecol. Sociobiol. 2011; 65(1): 57–68.

26. Slessarev EW, Lin Y, Bingham NL, Johnson JE, Dai Y, Schimel JP, et al. Water balance creates a threshold in soil pH at global scale. Nature 2016; 540: 567–569. doi: 10.1038/nature20139 27871089