You have collected a huge amount of high-resolution aerial visual images from your drones (or satellites). You can open each image and see many things. Better yet, you can develop computer vision segmentation models to classify each pixel in your images into classes of interest (trees, buildings, …) to analyse the distribution of the landscape, effectively bring about the possibility for urban planning, fire susceptibility estimation, and more.
Introduction
In this context, knowing the height-above-ground of each pixel will provide you a whole new dimension of possibilities. You can distinguish old and newly-grown forests and crops and bushes, hence finetuning your estimate on the amount of fire fuel available for much more accurate risk assessment. You can say this populated area only has trees below, say, 10 meters, which means it’s safe from tree-related accidents, or it’s lacking in terms of high-quality green space, since human is more relaxed being near big trees [1]. You can also tell how much carbon is stored in each pixel, or how the vegetation has changed in height since the previous data collection, suggesting that it may be time for trimming or that there may be something wrong when trees are reducing in height. Similarly, for buildings, frequently updated estimates of building coverage and building heights in interested areas can be obtained, which could be useful for tracking urban development and analytics. The best part is all this only requires visual RGB aerial (satellite or aircraft-based) images.
Motivated by this potential, we at Geoneon has developed a height estimation deep learning model running on aerial visual images to produce the corresponding estimated height maps. This blog post briefly describes our method and presents our results in a variety of landscapes.
Methods
Data
To obtain a robust model, the data for training, validation and testing should satisfy the following criteria:
- Landscape diversity: all possible landscapes should be included in the training data, so that the model is familiar with the property of each landscape. This is important since the height of each pixel is inferred from the value of that pixel and nearby pixels.
- Source diversity: We would like the model to recognize objects and heights on visual images like we do, with a high degree of robustness to imaging parameters like brightness and contrast. Although data augmentation allows us to artificially simulate such variety, sourcing data from a multiple of sources, or surveys, is crucial to cover more complex imaging processes to ensure model consistency across data from different origins.
- Groundtruth quality: spatial and temporal difference between height measurement and visual image capture is prevalent in remote sensing. One needs to carefully select the data with as little offset as possible, especially for vegetation height estimation, since that could vary extensively with small offsets.
Our training data contains 53,068 RGB images of size 512 x 512 pixels. Each pixel is 1 x 1 (m2).
Model and training
We adapt an encoder-decoder architecture trained with Adam optimizer for 50 epochs. Our loss function is a combination of the Mean Absolute Error (MAE) and the Mean Squared Error (MSE). As illustrated in the figure, there is the freedom to choose the architecture of the encoder and the decoder separately, with the potential of updating them to more advanced designs.
Evaluation
In this type of problem, it is essential to look at both the absolute error and absolute relative error, as they could tell different insights on the model performance. For example, a 3-meter absolute error carries different weights on a 30-meter tall tree versus a 2-meter tall crop. Beside these errors, it is common in the field of depth estimation to also include a delta metric, where we denote as the ratio of pixels having model output relatively close to the groundtruth. The mathematical formula is as follows:
Where y, y* are the model output and the groundtruth height maps, respectively. A high d1 signifies that a large proportion of the output is close to the true measured height.
Results
Here, we present the model result on several examples in the testset. These examples, 20448 x 2048 pixels in size, are selected to represent a wide range of terrains. By qualitatively evaluate the output height map, the actual height measurements and their Kernel Density Estimation (KDE) plots, as well as quantitatively calculate the metrics, we can draw insightful conclusions (the brown text) on the performance of the model on different landscapes.
Overall, it could be deduced that the model has learned to output accurate height maps. Objects/Collection of pixels that are clearly higher than others are reflected in the estimated height maps. There is the issue of underestimation that appears in almost all landscapes. This could be explained from the above figure: only the forested area has significant number of pixels with height values above 10 metres. Hence, the training, validation and testing dataset are heavily skewed towards the lower value range. Another contribution for this phenomenon is the property of the model that tends to produce smooth outputs from localized operations like convolution or patchification. To address this issue, from a data-centric approach, one can include more data with high height values to the dataset, or training with weighted loss.
Conclusion
In conclusion, height estimation from visual aerial images holds the potential to greatly improve several types of geographical analysis at scale, at speed and at minimal cost. Therefore, we demonstrate the development and evaluation of a deep learning model for this task, that successfully learned to output accurate height maps across a variety of landscapes. The model has a tendency to underestimate slightly. We hypothesise some explanations and propose the solution to this problem.
References
[1] “Why Bigger Urban Trees Are Better for Us.” Accessed: Nov. 25, 2024. [Online]. Available: https://www.linkedin.com/pulse/why-bigger-urban-trees-better-us-ben-rose