Designing a Distributed Raster Processing Service
In this talk we will discuss creating a distributed raster processing service capable of handling spatial, spatial-temporal and multi-band rasters built with the GeoTrellis library. We rely on Apache Spark for providing the distributed computation engine and Hadoop HDFS with Apache Accumulo for providing distributed persistence.
The need for computing across two or more distributed datasets adds a new dimension to data layout design, specifically data alignment. Of particular interest are the trade-offs in the index design between data alignment, required for efficient calculation, and data distribution, required for optimal cluster utilization.
We will briefly review and justify design choices made in GeoTrellis Spark and highlight interesting algorithms like Ingest, covering re-projection, mosaic and tiling of rasters.
Finally will examine benchmarks for common operations like ingest, weighted overlay and statistical summaries in the context of different index designs and persistence backhands; specifically comparing the suitability of Hadoop HDFS vs Accumulo for supporting these operations.