A critical view on the suitability of machine learning techniques to downscale climate change projections: Illustration for temperature with a toy experiment

Machine learning is a growing field of research with many applications. It provides a series of techniques able to solve complex nonlinear problems, and that has promoted their application for statistical downscaling. Intercomparison exercises with other classical methods have so far shown promising results. Nevertheless, many evaluation studies of statistical downscaling methods neglect the analysis of their extrapolation capability. In this study, we aim to make a wakeup call to the community about the potential risks of using machine learning for statistical downscaling of climate change projections. We present a set of three toy experiments, applying three commonly used machine learning algorithms, two different implementations of artificial neural networks and a support vector machine, to downscale daily maximum temperature, and comparing them with the classical multiple linear regression. We have tested the four methods in and out of their calibration range, and have found how the three machine learning techniques can perform poorly under extrapolation. Additionally, we have analysed the impact of this extrapolation issue depending on the degree of overlapping between the training and testing datasets, and we have found very different sensitivities for each method and specific implementation.


| INTRODUCTION
Global climate models (GCMs) are the main tool to simulate future climate projections, but their coarse resolution makes them unsuitable for the local scale Schoof, 2013;Wilby et al., 2004). This gap is often filled by applying some type of downscaling over GCMs outputs (Maraun et al., 2010). The two primary categories of downscaling techniques are (1) dynamic downscaling, mostly by nesting high resolution regional climate models (RCMs) in GCMs and (2) statistical downscaling (SD), by establishing empirical/statistical relationships between large-scale predictors and local surface predictands. SD relies on the following assumptions: (1) stationarity of these relationships under climate change (Wilby et al., 2004); (2) predictors are reliably simulated by GCMs; (3) predictors contain climate change signal; and (4) predictors are strongly correlated with predictands (Wilby et al., 2004). There are many different approaches to SD, each with its strengths and weaknesses, and the VALUE EU COST Action (Maraun et al., 2015) established a framework to evaluate and intercompare them.
Machine learning (ML) techniques are able to simulate complex nonlinear relationships, what has promoted their application as statistical downscaling models (SDMs) (see, e.g., Goyal et al., 2012;Li et al., 2020;Sachindra et al., 2018;Vandal et al., 2019). Nevertheless, Hsieh (2009) pointed out the risk of using ML techniques to downscale climate change projections. While they have proved able to map complex functions, they can exhibit significantly different behaviours in and out of their calibration range.
Many evaluation studies limit to the so-called Perfect Predictor experiment (Maraun & Widmann, 2018;Maraun et al., 2015), which usually does not explicitly test the behaviour of SDMs under extrapolation, and never does it under such a degree of extrapolation as it is expected in the future. The ability of SDMs to extrapolate can be evaluated in different ways. A first option consists in splitting a historical dataset in training/testing samples in a specific way, for example by leaving the warmest/ driest years for validation. This analysis allows to explore a certain degree of extrapolation, but usually narrower than the one expected in future projections. Some examples of this approach can be seen in Gutiérrez et al. (2013), San-Martín et al. (2017) and Baño-Medina et al. (2020). Another approach consists of using of RCMs as pseudoobservations to train SDMs in a historical period and evaluate them in the future, the so-called pseudo-reality experiments (Maraun & Widmann, 2018;Maraun et al., 2015). These experiments check the necessary (but not sufficient) condition of SDMs being able to reproduce climate change signal given by GCM-RCMs, and must be limited to those variables which are realistically simulated by GCM-RCMs. Although this approach allows the analysis of a wider range of extrapolation, it introduces a source of uncertainty associated with RCMs. Some examples of this approach can be found in Charles et al. (1999), Vrac et al. (2007), Gaitan et al. (2014) and . And finally, another possible approach consists in comparing raw GCMs with SDMs, which can only be done by averaging over large enough areas. This approach allows to analyse whether SDMs are able to preserve trends projected by GCMs, but it cannot tackle imperfections on the finer spatial scales. Some examples of this approach can be found in Vandal et al. (2019), Xu et al. (2020) and Baño-Medina et al. (2021). Although each approach has its drawbacks, these types of studies are very valuable and can reveal significant extrapolation issues.
In this paper, we perform a set of three toy experiments in order to show how some ML techniques commonly used to downscale climate change projections can behave extremely wrong under extrapolation. The main objective of this paper is to make a wake-up call to the community on the potential risks of using these techniques, to reinforce the need of some kind of extrapolation analysis as a key piece of SDMs evaluation studies and to analyse the sensitivity of three common ML SDMs to the degree of overlapping between the training and testing datasets. The paper is organised as follows. First, a description of the datasets used is given at Section 2, followed by a brief introduction to the downscaling methods in Section 3. The experiment design is explained in Section 4. And finally, results and concluding remarks are shown and discussed in Sections 5 and 6, respectively.

| DATA
In order to keep this study as simple and clear as possible, it has been limited to the downscaling of daily maximum surface temperature (TMAX) using temperature at 850 hPa (T850) as the only predictor. This predictor meets three out of the four conditions enumerated in F I G U R E 1 Spatial distribution of Pearson correlation coefficient between T850 (interpolated from the reanalysis ERA-interim) and daily maximum temperature (from the high resolution AEMET grid [0.05 ]). See text for details Section 1: reliability by GCMs, signal of climate change and strong correlation with predictand (see Figure 1). Note that this simple choice of T850 as the unique predictor is only reasonable when near surface and free atmosphere are strongly coupled.
T850 has been taken from the reanalysis ERA-Interim (Dee et al., 2011) of the European Centre for Medium-Range Weather Forecasts (ECMWF) for the domain (45 N, 34.5 N, 10.5 W, 4.5 E) with spatial resolution of 1.5 Â 1.5 and as daily mean values (from 00, 06, 12 and 18 UTC). It has been interpolated to each target point with a bilinear interpolation of the four nearest neighbours and standardised so it is used in the form of anomalies.
TMAX has been taken from a high resolution grid (0.05 ) consisting of 16,156 points over Spain (mainland and Balearic Islands) developed by AEMET (Peral et al., 2017). This grid has been generated using an adaptation of the HIRLAM Surface Analysis code Rodríguez et al., 2003), based on an Optimum Interpolation algorithm (Daley, 1991), applied over a selection of 1800 stations from the AEMET observational network.
Both datasets cover the period 1980-2016, and the training/testing split has been performed in different ways for each of the three experiments explained in Section 4.

| DOWNSCALING METHODS
For these experiments we have applied three commonly used ML techniques, two different implementations of artificial neural network (ANN) and a support vector regression (SVR), and we have compared their results with the classical method of multiple linear regression (MLR). Note that, since this particular problem is onedimensional, MLR really corresponds to a simple linear regression. For their implementation, we have used the Python machine learning library Scikit-learn (Pedregosa et al., 2011).
ANNs (McCulloch & Pitts, 1943;Rosenblatt, 1958) are supervised learning algorithms based on the biological neurons' behaviour, imitating them by nodes which work as perceptrons (Rosenblatt, 1958). In the popular implementation of ANNs called multilayer perceptron (MLP), these nodes are organised in several layers, the input layer, the output layer and a set of hidden layers, which communicate with adjacent layers. Each node receives several input signals (x j ) from the (m) nodes at the previous layer and adds them with different weights (w j ) (Equation 1). This resulting input signal (z) is then fed to an activation function ( g), so the node will pass a signal to the next layer depending on whether the activation function exceeds a certain threshold or not. For these experiments, we have used two different implementations of ANN, one using a rectified linear unit activation function (ANN-RELU) and the other using a logistic activation function (ANN-LOG) The training of an MLP consists in searching those weights which minimise errors, and it is usually achieved by an iterative process in which signals are transmitted forward, errors are propagated backwards and weights are updated until a certain condition is fulfilled. During this iterative process, ANNs algorithms can get trapped in local minima, which is one of their major drawbacks, along with their high computational training cost. ANNs can tackle both classification and regression problems, and they have been extensively applied to SD (Chadwick et al., 2011;Coulibaly et al., 2005;Dibike & Coulibaly, 2006;Mendes & Marengo, 2013;Sailor et al., 2000;Snell et al., 2000;Trigo & Palutikof, 1999).
SVMs (Boser et al., 1992;Cortes & Vapnik, 1995;Vapnik, 1995) are supervised learning algorithms originally designed for classification based on a combination of minimal errors and maximising distances from data points to the boundary decision (maximal margins). A slightly different version of the algorithm, SVR, can be applied to regression problems. Linear SVR (Drucker et al., 1997) searches for the optimum hyperplane (defined by its parameters w, w 0 ) by penalysing only errors greater than a certain threshold (ε), which defines the so-called ε-tube. The nonlinear problem is tackled by mapping original data (x), through a transformation (φ), to a higher dimensional space (feature space) where the problem becomes linear. Thus, an SVR corresponds to the hyperplane in the feature space: And the combination of minimal errors and maximal margins corresponds to minimising subject to where ξ i and ξ i Ã , called slack variables, are the penalties corresponding to errors out of the ε-tube, x i , y i correspond to pairs of data points and C is a hyperparameter that establishes the balance between errors and maximal margin. For a more detailed description on SVR, see Drucker et al. (1997). Expensive high-dimensional computations, usually referred to as the curse of dimensionality, are avoided by the so-called kernel trick (see Shawe-Taylor & Cristianini, 2004), which consists in visiting the feature space exclusively to compute inner products of pairs of points without explicitly mapping each point individually. This is only possible by carefully defining the inner product as a specific function called kernel (K) with some desired properties. For this work, we have used a radial basis function kernel of variance σ 2 Finally, the problem consists of solving a convex quadratic programming problem, or a set of linear equations in the least-square support vector machine variant (Suykens & Vandewalle, 1999), both of them lacking the inconvenience of local minimum. Different forms of SVMs have been widely applied to SD, both for classification and regression (Anandhi et al., 2008;Chen et al., 2010;Ghosh & Mujumdar, 2008;Hou et al., 2014;Sachindra et al., 2013;Tripathi et al., 2006;Yu & Liong, 2007).

| EXPERIMENT DESIGN
In order to prove how ML techniques can exhibit strange behaviours out of their calibration range and to analyse their sensitivity to the degree of extrapolation, three toy experiments have been performed (Figure 2). Experiment 1: "full overlapping"-for each grid point, the training/testing split is performed randomly in a 60/40 ratio. The aim of this first experiment is to evaluate and intercompare the three methods when no extrapolation takes place.
Experiment 2: "no overlapping"-for each grid point, the 60th percentile of T850 is used as a threshold for the training/testing split, using values under it for training and values over it for testing. The aim of this experiment is to compare the three methods under extrapolation, in the extreme and non-realistic case of zero overlapping between the training and testing datasets.
Experiment 3: "partial overlapping"-for each grid point, the training/testing split is performed randomly in a 60/40 ratio as in Experiment 1. But the testing dataset is then modified by shifting predictors to higher values F I G U R E 2 Illustrative scheme of the training (grey) and testing (red) datasets, for a single grid point. Experiment 1: "full overlapping" (first column), Experiment 2: "no overlapping" (second column) and Experiment 3: "partial overlapping" (third column). Of the different shifts applied in Experiment 3 (ranging from +0.5 to +4 standard deviations), this figure corresponds to the specific shift of +1.5 standard deviations. The upper row shows standardised T850 (x-axis) versus TMAX ( C, y-axis), and the lower row shows the frequency (counts) of T850 and using the estimates from MLR as predictands, since in experiment 3 there are no observed predictands. Different shifts have been applied: +0.5, +1, +1.5, …, +4 standard deviations. This experiment aims to analyse the sensitivity of each method to the level of extrapolation, with a more realistic approach than Experiment 2 in terms of shape of the training and testing distributions and degree of overlapping between them. On the other hand, it is important to remark that this experiment relies on the assumption of linear relationship between T850 and TMAX, for it validates against linearly extrapolated data instead of against real data. But given the high linear correlation between T850 and TMAX in the observed range (Figure 1) it seems a reasonable assumption to be made for the purpose of this experiment.
In the three experiments, 8110 days are used for training and 5405 for testing, and the hyperparameter tuning is performed by cross-validation using exclusively the training data set. SDMs are validated with the testing dataset by their root mean square error (RMSE).
Additionally, for Experiment 2, SDMs have been applied over both the training and testing datasets, in order to visualise their behaviour out of the calibration range.

| RESULTS
In the first experiment, all SDMs achieve similar RMSEs, with slightly better results for ML techniques (Figure 3).
Nonlinear methods do not add much value here because of the high linear correlation among predictor and predictand ( Figure 1). Nevertheless, these very same methods have proved to overcome MLR when using a greater set of predictors .
When applied under extreme extrapolation, ML techniques display very high RMSEs compared to MLR (Figure 3). Despite the similar RMSEs of the four SDMs in Experiment 1, their RMSEs in Experiment 2 differ significantly. Whereas the mean RMSE goes from 2.58 C in Experiment 1 to 3.29 C in Experiment 2 for MLR, for ANA-RELU it goes from 2.51 to 6.06 C, for ANA-LOG it goes from 2.49 to 7.66 C and for SVR it goes from 2.51 to 5.52 C. Some regions present much larger RMSEs than in Experiment 1, while in other regions RMSEs barely increase. A possible explanation comes with the fact that regions with larger RMSEs correspond to some of the main Spanish valleys with high occurrence of fog, especially in winter, which strongly condition daily maximum temperatures and could lead to a certain uncoupling between T850 and TMAX. A further analysis of this evidence is out of the scope of this work, but it might constitute an interesting future study.
An illustrative example of the different behaviours by the four methods under extrapolation, over a single grid point, is shown in Figure 4. The four methods map a similar relationship between T850 and TMAX over the training sample, but their behaviours diverge largely under extrapolation, where ANN-RELU, ANN-LOG and SVR do not perform well.
F I G U R E 3 Spatial distribution and spatial mean of RMSE ( C) for daily maximum temperature in Experiments 1 (upper row) and 2 (lower row) by MLR (first column), ANN-RELU (second column), ANN-LOG (third column) and SVR (fourth column) Finally, Figure 5 shows results from Experiment 3, in which the three ML methods are evaluated for different degrees of extrapolation. MLR is not evaluated in this experiment, because the linear relationship has been used to build the shifted testing datasets. While the three methods displayed extrapolation problems of the same order in Experiment 2 (Figure 3), ANN-RELU behaves much better than ANN-LOG and SVR under extrapolation in this third experiment, in which the training and testing distributions overlap in a more realistic way. For example, for a shift of +0.5 standard deviations and a percentage of testing data out of the calibration range of 0%-3%, RMSEs by ANN-RELU increment up to 6% compared to Experiment 1, but for ANN-LOG and SVR these increments go up to around 20%. For a shift of +1 standard deviation (0%-13% of data out of range), ANN-RELU increments its RMSEs up to 19%, while ANN-LOG and SVR reach 63% and 147%, respectively. And for a shift of +2 standard deviations (10%-36% of data out of range), ANN-RELU goes up to 53% increments, while ANN-LOG and SVR reaches increments of 263% and 775%.

| CONCLUDING REMARKS
ML is a growing field with many applications in atmospheric sciences, being SD of climate change projections one of them. In this study, we have analysed the behaviour of three commonly used ML techniques, two different implementations of ANN (ANN-RELU and ANN-LOG) and an SVR, under extrapolation, through a set of three toy experiments: the first one to evaluate them when no extrapolation takes place, the second one to evaluate them in the extreme case of no overlapping between the training and testing datasets and the third one to analyse their sensitivity to the degree of overlapping. We have proved how the three ML SDMs can behave extremely wrong out of their calibration range, despite their scores inside the calibration range being as good as or even better than those for MLR. Then, we have analysed the impact of this potential extrapolation issue depending on the degree of overlapping between the training and testing datasets. We have found that ANN-RELU errors, when some degree of overlapping takes place, are much lower than those of ANN-LOG and SVR, which has revealed to be extremely sensitive to extrapolation.
This set of experiments has allowed us to prove how three commonly used ML techniques can perform wrong F I G U R E 4 Standardised T850 versus observed (grey) and downscaled maximum temperature ( C) by MLR (blue), ANN-RELU (orange), ANN-LOG (red) and SVR (green) over the training (white background) and testing (red background) datasets for Experiment 2. This figure represents an illustrative example over a single grid point of coordinates 1.730 W and 37.318 N F I G U R E 5 Increment of RMSE (%) for ANN-RELU (first panel, orange), ANN-LOG (second panel, red) and SVR (third panel, green) at Experiment 3, for different shifts applied to the testing dataset. The fourth panel (grey) shows, for each shift, the amount of testing data (%) that lies out of the calibration range. Each box contains the quartiles of all grid points (16,156 data) and the whiskers extend to a maximum of 1.5 times the interquartile range. Outliers beyond this range are plotted individually. Note that vertical scales are different for each panel under extrapolation, so their suitability for SD of climate change projections should be seriously questioned. Furthermore, we have seen how, for a specific technique (ANN), extrapolation issues are very sensitive to the particular implementation. For this technique, we have compared two commonly used activation functions, but other architectures and implementations are possible and also common. For this reason, results shown here might not be straightforward generalised to other techniques or specific implementations, and the extrapolation capability of these methods should be thoroughly examined for each case (variable, region, set of predictors, architecture, etc.).
It is important to clarify that these experiments do not intend to replace other more realistic evaluation approaches which also tackle the extrapolation issue, like the ones mentioned in Section 1. Nonetheless, it is worth noting to remark that experiments which validate over spatially/temporarily aggregated data might hide extrapolation problems in finer spatial/temporal scales. Thus, we consider that analysing the response of ML SDMs through synthetic extrapolation experiments like these ones constitutes a good practice and a first recommended step to detect and be aware of their potential risks.
Finally, we would like to point to the emerging field of Physics-Constrained Machine Learning (see, e.g., Willard et al., 2020;Kashinath et al., 2021) as a possible way of alleviating these extrapolation problems, by reducing the high amount of degrees of freedom that these techniques cope with.