Efficient Streaming Vector Processing in Scala at Socrata

(I would be fine with a 1 hour talk slot).

At Socrata, the leaders in public open data, we process many datasets with geospatial location data. This talk covers how we perform efficient, in-memory streaming vector processing on the JVM using Scala. Why use the JVM, and Scala, for geo processing? What does an architecture look like for stream vector processing? What does vector processing on the JVM involve? What are the advantages of stream processing versus in-database? How can we efficiently represent polygons on the JVM heap? How can we cache and manage memory? How can the current architecture be scaled out to distributed stream processing systems like Apache Spark? We will attempt to answer all your questions and more.


Evan is Principal Engineer at Socrata, Inc. -- bringing the power of data to enhance citizens lives. He loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He is an active contributor to the Apache Spark project and co-creator of the open-source Spark Job Server. He is a big believer in GitHub, open source, and meetups, and have given talks at various conferences including Spark Summit, Cassandra Summit, and PNWScala. He has Bachelor's and Master's degrees in Electrical Engineering from Stanford University.

Slides (External URL)


Session details
Speaker(s): Session Type: Experience level:
Track: Tags:
Schedule info
Session Time Slot(s):
Regency B - Wednesday, March 11, 2015 - 10:30 to 11:05


The talk is a pearl, really. However, will 35' enough to talk about all these important concepts?
I ask because it it'd be the case, I wouldn't flag it as "intermediate" but "advanced".


Public comment

Thanks for the intro to the the big data day (and the call outs to other projects in this space). I appreciated the intro the issues being faced in this space.

For the PrepairedGeometry story, the coordinate sequence is not directly used for JTS operations, instead it is unpacked into a graph structure (for both geometry's). When comparing one geometry to many, using a PrepairedGeometry will keep the unpacked graph for the one geometry and reuse it again and again during subsequence comparisons.

Public comment