Landscape of Neural Architecture Search across sensors: how much do they differ ?

With the rapid rise of neural architecture search, the ability to understand its complexity from the perspective of a search algorithm is desirable. Recently, Traor\'e et al. have proposed the framework of Fitness Landscape Footprint to help describe and compare neural architecture search problems. It attempts at describing why a search strategy might be successful, struggle or fail on a target task. Our study leverages this methodology in the context of searching across sensors, including sensor data fusion. In particular, we apply the Fitness Landscape Footprint to the real-world image classification problem of So2Sat LCZ42, in order to identify the most beneficial sensor to our neural network hyper-parameter optimization problem. From the perspective of distributions of fitness, our findings indicate a similar behaviour of the search space for all sensors: the longer the training time, the larger the overall fitness, and more flatness in the landscapes (less ruggedness and deviation). Regarding sensors, the better the fitness they enable (Sentinel-2), the better the search trajectories (smoother, higher persistence). Results also indicate very similar search behaviour for sensors that can be decently fitted by the search space (Sentinel-2 and fusion).


INTRODUCTION
Neural architecture search (NAS) is a rapidly growing area of machine learning (ML) dedicated to automatically designing high performing deep learning models. Recent breakthroughs, such as differentiable search, e.g., DARTS (Hanxiao et al., 2019), have enabled search at limited computing cost and time. However, state-of-the-art methodologies still suffer from limited interpretability, and current evaluation protocols do not always shed light on the contribution of individual components (i.e., search space, training pipeline) while reporting performances (Yang et al., 2020, Lindauer andHutter, 2020).
Recently, a fitness landscape analysis-based (FLA) methodology was introduced: the Fitness Landscape Footprint (a.k.a., footprint) (Traoré et al., 2021c). The footprint attempts to describe why a search strategy may be successful, struggle or fail on a target application. It also enables comparing search problems of variable configuration (i.e., different search space, fitness function, data, etc.).
Our study takes advantage of the footprint to identify the most favorable sensor setting for NAS. Particularly, we consider optimizing convolutional neural network (CNN) image classifiers on the search space defined by NASBench-101 (Ying et al., 2019) for the real-world image classification problem So2Sat LCZ42 (Zhu et al., 2020). Our results show that disregard the sensor, the longer the training time, the better the performance (fitness) and the flatter the landscape (less ruggedness and deviation in fitness). Moreover, Sentinel-2 and fusion (Sentinel-1 and 2) tend to have more favorable search trajectories (smoother, higher persistence). To the best of our knowledge, our study provides the first quantification and comparison of search behavior across sensors (including sensor fusion).
This article is structured as follows: Next section summarizes the related work and Section 3 introduces the footprint. Section 4 proposes the methodology to study NAS problems. Sections 5 and 6 presents the experimental settings and the results. Section 7 outlines the conclusions and proposes future work.

RELATED WORK
Computer vision (CV) and Earth observation (EO) are closely tied (Ball et al., 2017, Zhu et al., 2017. Deep learning models in CV have helped tackle several specific EO use-cases such as scene classification (Liu et al., 2019), object detection , change detection , and semantic segmentation (Yuan et al., 2021), among others. However, the specificity of sensors require domain-specific models (Li et al., 2019). In particular, the availability of several sensors to monitor areas has motivated the activity of multi-modal sensor fusion which remains challenging for current models (Hong et al., 2021). Moreover, as the design of new vision-based methodologies can be time-consuming (trial and error), EO could benefit from automated machine learning (AutoML) algorithms.
In AutoML, NAS specializes in finding model configurations achieving optimal performances for a given dataset (Elsken et al., 2019, Ojha et al., 2017. NAS methodologies have been proven to be powerful and efficient, with strategies deriving from various families of optimization algorithm, e.g., differentiable search (Hanxiao et al., 2019, Traoré et al., 2021a, Bayesian optimization (Camero et al., 2021), meta-heuristicbased approaches (Stanley and Miikkulainen, 2002, Camero et al., 2020, Traoré et al., 2021b. However, in practice, the difficulty of a NAS problem is hard to estimate, because the complexity of its components, namely the search space, search strategy, performance estimator and additional tricks, is hard to quantify (Elsken et al., 2019, Yang et al., 2020. The fields of evolutionary computation, optimization and complex systems have long studied optimization processes and provide us with tools to analyze their behavior. In particular, FLA (Pitzer and Affenzeller, 2012) aims at understanding and predicting performances of optimization algorithms. Recently, Traoré et al. proposed the Fitness Landscape Footprint (Traoré et al., 2021c), a framework to characterize NAS problems from the perspective of a search algorithm. The following section introduces the footprint.

A FRAMEWORK FOR COMPARATIVE FITNESS LANDSCAPE ANALYSIS
Before describing the footprint, it is important to define what a fitness landscape is. Let S be the set of all possibles solutions of an optimization problem, i.e., the search space. Let f be the fitness function, which attributes to each candidate solution x ∈ S, a fitness measurement f (x) ∈ R. Let N be a function providing a structure to the search space S, the neighborhood relationship operator. Then, the fitness landscape L = (S, f, N ) consists of combining the three above, in order to provide respectively with a set of possibles solutions, a function to evaluate them and another to interconnect them.
Given this definition, we are interested in a better understanding of a NAS optimization process. The footprint (Traoré et al., 2021c) serves this purpose of gaining insights into the process by describing its fitness landscape with a set of eight (8) metrics measuring aspects such as the distribution of fitness, ruggedness of the landscape or persistence of fitness. A footprint includes: the mean and variance of fitness over S, the ruggedness τ , an enumeration of local optima, the positive and negative persistence and their area under the curve (AuC). The following paragraphs describe these metrics.
The fitness distance correlation (FDC) is often interpreted a measure of the existence of search trajectories from randomly picked solutions to the known global optimum. In practice, the FDC is not collected as a correlation score, but visualized as the distribution of fitness versus distance to the global optimum. It writes as F DC(f, x * , S) = {(d(x * , y), f (y)) | ∀y ∈ S}, where S denotes the search space, x * ∈ S is the global optimum, d a distance function.
The ruggedness of the landscape also helps assessing the difficulty of the process tackled. Let's consider a random walk RW in S of N steps (models) and its corresponding fitness values. The ruggedness τ consists in the auto-correlation length over RW : τ = 1 ρ(1) , where ρ(k = 1) is the serial-correlation coefficient for consecutive lags.
In (Traoré et al., 2021c), the authors propose the metric of persistence characterizing the behavior of image classification models overtime. It measures the chances of solutions in the search space, to keep a rank N (top or bottom rank, based on fitness in test), as the training time grows. This metric is complemented with its area under the curve (AuC), measuring the evolution of the persistence as N grows.
Another way to characterize an optimization fitness landscape is to assess the existence of local optima. As some search algorithms might get stuck in such sub-optimal areas of S, an enumeration (Hernando et al., 2012) could help measure the difficulty of the search problem.
Last but not least, the footprint not only characterizes individual NAS landscapes, but also enables the comparison of a handful considering potential changes in either components S, f , or N .

ASSESSING AND COMPARING THE LANDSCAPES OF NAS FOR VARIOUS SENSORS
This study aims at investigating how the process of searching for neural architectures is affected by the type of sensor available as input. More precisely, we seek to identify to what extent does searching with a given sensor differs from searching with another one. In particular, we consider the case of a fixed search space, training pipeline (hyperparameters, duration, etc.) and evaluation protocol (fitness function). In practice, we propose to tackle these questions by conducting a comparative landscape study using the footprint (Section 3).
Let Σ = {si, sj, si + sj} be the set of sensors available in our ML task. Let S be a search space of CNN image classifiers, each represented by a unique binary vector. Considering this representation, we choose a neighborhood operator N (x) assigning to each solution of the search space, all the configurations that are one (1) hamming distance away from it. This operator N (x) writes as follows: y) is the hamming distance between two solutions (x, y) ∈ S 2 . Additionally, we use as fitness function fs i , the measurement of accuracy in test after a training budget of bt, on an input sensor si.
Since we have access to various sensors s ∈ Σ for our ML task, the fitness landscapes obtained write as follows: Ls i = (S, fs i , N ), Ls j = (S, fs j , N ), Ls k = (S, fs k , N ) for the sensor settings (si, sj, s k ) ∈ Σ 3 . In particular, we consider the case of input level sensor fusion as: s k = si + sj. Besides, as noted above, the aim is a comparative study so we fix the search space S and the neighborhood operator N across all settings.

EXPERIMENTAL SETUP
This section introduces the NASBench-101 database, as well as a custom representation used to encode its solutions. Then, the So2Sat LCZ42 dataset used to evaluated solutions, followed by details on the evaluation protocol.

NASBench-101
NASBench-101 (Ying et al., 2019) is a database containing a large pool of neural networks and their evaluations on the image classification dataset of CIFAR-10. It aims at providing an exhaustive fitness measurement for all configurations (N=453k) in a search space of CNN image classifiers. This search space defines a model configuration as an image classification backbone with a head, body and tail. Its body consists of repeating three identical 'block' structures alternated with down-sampling modules. Regarding each block, it consist in a sequence of identical and elementary feed-forward units called cells. Each cell is represented by a directed DAG with a maximum number of nodes (V ≤ 7), maximum number of edges (E ≤ 9) and a fixed listed of three (3) operators (Max-pool 3x3, Convolutional layer 1x1 and 3x3) labelling each node. Therefore, a solution of the search space is identified by a cell, encoded in practice by both an adjacency matrix of variable size (upper triangular), and its list of operators. Moreover, The head of the model is a 3 x 3 convolution with 128 output channels, while the tail is a dense softmax layer.

Custom feature representation
In our experiments, we construct a custom representation to enable solutions of the search space to be identified by a single vector. First, for our representation of the DAG, we do not label nodes. Instead, for the five (5) intermediate nodes out of seven (7) (one for IN and OUT), we account for the fact that each could be one of three (3) operators. Thus, the DAG contains exactly N nodes = (1 + 5 * 3 + 1) = 17 nodes. The new adjacency matrix is therefore of fixed length, i.e L = 17 * 17 and non upper-triangular. Finally, we flatten the adjacency matrix to obtain a binary vector as identifier.
Regarding the sampling of solutions x ∈ S, we use the Latin Hypercube Sampling (LHS) for ensure fair data collection. Because of a higher complexity of LHS on the large binary representation, we perform it on the intermediate representation as a joint sampling of the original matrix and list of operations.
The dataset comprises train (Nt = 352, 366), validation (Nv = 24, 119), and test (Ntest = 24,188) samples. The training and validation samples originate from the same set of forty-two (42) cites, while those from the test set were collected in ten (10) additional cities.

Evaluation protocol
For the purpose of training and evaluating on the same data distribution, we use a custom setting consisting of training and testing sets made by randomly sampling respectively, 80% and 20% of the image patches of the original training-set. Moreover, as in (Traoré et al., 2021c) we speed-up the training procedure by only considering P = 35% of samples in the training set.
Additionally, we use the same search space S for all sensor settings. In particular, we do not adapt the sampled models to use multiple sensors as input, instead we do stack the data at an input level. We trained N = 100 randomly sampled models, once. After inspection and quality control, there remain 100, 88 and 75 samples for Sentinel-1, 2 and both sensors. The fitness is assessed in test using the Kappa-Cohen metric.

RESULTS
The following section presents results of comparison of search landscapes for various input sensors. First, we provide an analysis of distributions of fitness. Then, we show results of fitness distance correlation, followed by an analysis of random walks, as well as measurements of fitness persistence. Last but not least, we compare the footprint of the sensors.

Density of Fitness
First, we assess the ability of the search space in fitting the task with each sensor. Figure 3 and 4 display the probability density function (PDF) of fitness, respectively after 36 and 108 epochs of training. The first, second and third columns are, respectively, for using Sentinel-1, Sentinel-2 or both sensors as input.
We first take a look at the PDFs after 36 epochs of training. When using Sentinel-1, the distribution of fitness is wide and centered around low values (µ = 0.47, σ = 0.13). Sentinel-2 enables the distribution to improve by reaching a higher average and being more narrow (µ = 0.94, σ = 0.03). Using both sensors slightly worsens the fitness, providing with a lower mean and larger deviation (µ = 0.89, σ = 0.05).
Next, we look at the PDFs after 108 epochs of training. Overall, the task is better handled in all sensor configurations. Using Sentinel-1, the distribution improves by 17 percentage points in mean fitness (µ = 0.64, σ = 0.13). In the case of Sentinel-2, most models fit well the data as the mean fitness increases and the deviation decreases (µ = 0.97, σ = 0.01). We observe similar results when using both sensors (µ = 0.94, σ = 0.04).  Besides, we seek to identify for each sensor, if the behavior of the search space follows a specific theoretical distribution. For this, we consider the more challenging scenario of selecting models after only 36 epochs of training.  Figure 5 and 6 show, respectively, PDFs and cumulative density functions (CDF), all fitted with the Beta (red), Weibull (green) and Lognormal (blue) distributions. Similarly, the first, second and third columns are respectively, for using Sentinel-1, Sentinel-2 or both sensors as input. To complement the plots, Overall, the empirical distributions are closely fitted with the selected theoretical distributions. For Sentinel-1, the best candidate is the Log-normal with the largest fitting likelihood, and lowest AIC and BIC error scores. When using Sentinel-2 or both sensors, Beta matches the best the empirical distributions.
To summarize, the capacity of the search space in fitting the task with the Sentinel-1 sensor appears limited from the PDF perspective. Indeed, despite longer training time (108 epochs) the fitness distribution remains far worse than using Sentinel-2. Using Sentinel-2 only, the task can be fitted well enough in particular given long training time. Combining Sentinel-1 to Sentinel-2 worsens the distribution of fitness (lower mean, larger deviation). Therefore, there is no tangible benefits in fitness, from sensor fusion using the current search space. Moreover, results indicate the feasibility in modeling the empirical distributions of fitness for each sensor.

Fitness Distance Correlation
Next, we analyse the fitness landscape for the various sensor configurations. Figure 7 and 8 show results of FDC for the three (3) input sensor settings. The layout of the plots follows the convention of Figure 3.
First, we consider the FDC after 36 epochs of training (see Figure 7). Overall, for all sensors, we observe that the respective landscapes are rather rough. Indeed, the distribution of fitness per hamming distance to the optimum are relatively wide. For instance, when using Sentinel-1 we notice that solutions at the Hamming-distances d hamming = {8, 11, 13} display up to 35% percentage point in fitness difference. Also, We notice a landscape around low fitness values (c.a 47%). Using Sentinel-2 provides with a landscape centered at much higher values (c.a 94%). We also notice a consistent increase in fitness, as the hamming distance to the optimum decreases. Similarly, the landscape associated with using both sensors is of high fitness. However, the slope of gained fitness per travelled distance  to the optimum worsens (less consistent), compared to the one obtained with Sentinel-2 as input.
Then, we consider the FDC after 108 epochs of training (see Figure 8). Overall, the landscape tends to be more flat and with an increased fitness. In particular for Sentinel-2 or both sensors as input, the flatness is indicated by more narrow distribution of fitness at the various distances to the optimum. This also is complemented by potential search trajectories that have little improvements in fitness per travelled distance to the global optimum. Using both sensors brings us a similar behaviors, except for the existence of a set of models providing poorer fitness values, all located at d hamming = 7 from the optimum. The case of Sentinel-1 is rather odd as there appears to be a favorable (negative) slope, as if the landscape had not converged.
To summarize, the fitness landscape of So2Sat LCZ42 is rougher when training is limited (36 epochs), and flatter towards higher fitness when training long enough (108 epochs) solutions in the search space. As observed when analyzing distributions of fitness (Figure 3, 4), this NAS problem benefits better from using Sentinel-2 as input, with improvements in slope and overall fitness in its landscape. Therefore, these results complement the analysis of PDFs by showing benefits in performances, this time from the perspective of potential NAS algorithm trajectories. It also shows that with the current search space, the search behaviour is worse when using both sensors as input.

Random Walk Analysis
Furthermore, we investigate the behavior of local search-based algorithm depending on the input sensor. More precisely, this is done by analyzing random walks. Figure 9 displays the route of a random walk evaluated on for four (4) different sensor settings. The walk itself consists of one hundred (100) steps in the search space. At each step, the selected model is evaluated after being trained for 36 epochs. The blue, yellow, green and red curves are, respectively, for evaluating the fitness with Sentinel-1, Sentinel-2, both sensors, or CIFAR-10 as input. All curves were smoothed with a moving average of five (5) steps. We also consider CIFAR-10, since its fitness evaluations were freely available (ground-truth in NAS-Bench-101) and could serve as reference for comparison, and trouble-shooting.
The evaluation of the walk on Sentinel-1 provides with the lowest overall fitness ( µ = 0.44) and the most rugged route ( τ = 17.84). On the other hand, using Sentinel-2 or both sensors together, provides with more smooth paths, at much higher values. Indeed, the respective averages of fitness are µ = 0.93 and µ = 0.89. The ruggedness values are τ = 1.54 and τ = 6.03. Also , the curvature of both routes visually look alike. Regarding CIFAR-10, we observe intermediate fitness and ruggedness ( µ = 0.64 and τ = 1.56). The relatively large amplitude in fitness and similar curvature, despite lower ruggedness, makes its route look more similar to the one evaluated with Sentinel-1.
As observed when analyzing FDCs (Section 6.2), Sentinel-1 provides with poorer trajectories (lower fitness, larger ruggedness), suggesting either a sensor being unsuitable for the task or a search space S not suitable for the sensor. Similar results (curvature, lower fitness) obtained for ground-truth evaluations on CIFAR-10 suggest that a higher ruggedness seems to associate with harder tasks and lower convergence of models in a random walk route.  To summarize, the use of either Sentinel-2 or both sensors after only 36 epochs enables a NAS route to be of higher smoothness and fitness.

Persistence
Next, we study the behaviour of solutions in the search space, from the perspective of persistence in their ranking. Figure 10 and 11 show measurements of positive and negative persistence. We consider samples collected for experiments related to section 6.1 and 6.2. For each setting, the blue curve represents the reference population: the models at a given Rank − N based on their fitness after 4 epochs of training. The yellow curve display the share of these models maintaining the same Rank − N after 12 epochs. The green and red show the same (intersection of sets) respectively after 36 and 108 epochs of training. The positive and negative Persistence refer to using the top and bottom N rank function (Nth percentile).
First we have a look at the positive persistence (see Figure 10). Overall, we observe that the larger the fitness a sensor can provide (see Figure 3 4), the larger its persistence (N<25) across all training budgets. More precisely in terms of Area under the Curve (N<25), Sentinel-2 (AuC = 0.14) improves over the use of both sensors (AuC = 0.07), which also improves over a single Sentinel-1 sensor (AuC = 0.04).
To summarize, we observe similar trend across sensors for both positive and negative persistence. A larger fitting capacity results in a larger persistence (Sentinel-2). In the case of Sentinel-1, the limited ability to fit the sensor might hinder the ability of models to keep their ranking (potential instabilities during training). In turn, this might result in a poorer persistence. For the better fitted sensor (Sentinel-2) chances of finding top-25% and bottom-25% performers are considerable (P = 13.7, P = 31.82).

Fitness Landscape Footprint
The results obtained in the previous sections are summarized by the footprint for each data source. Figure 1 displays the footprint for Sentinel-1 (blue), Sentinel-2 (yellow) and both sensors together (green). This is done using considering only 36 epochs of training.
As observed in section 6.1, the search space appears better suited to fit Sentinel-2, than Sentinel-1. Indeed, Sentinel-2 enables reaching a larger mean fitness (µ = 0.94) and lower standard    sion of both sensors. Similar results are observed when analysing search landscapes. Indeed, the longer the training, the landscape evolve from high ruggedness to flatness. Moreover, search strategies might benefit from a deployment on Sentinel-2, as it provides with routes that are smoother, of higher fitness, higher gain per distance travelled, and higher persistence in ranking of models. When a sensor can be fit well enough (Sentinel-2, fusion), we observe very similar behaviour in terms of trajectories (smoothness, ruggedness, fitness). This strongly indicates that search trajectories associated to different sensors are comparable when the search space is able to fit them decently enough.
As future work, we propose to investigate how to use the gained insights to help build speed-up techniques for NAS strategies. Such technique could rely, for instance, on searching with a sensor (or a subset of given sensors), helping approximate the search with a more expensive to evaluate target sensor.