Lecture on Big Geospatial Data (2017)

held at Leibniz Unversity Hanover by Martin Werner.

Introduction

Aim of the Lecture:
Students will get an overview of methods and infrastructures for parallel computing with large data sets and of methods for parallel processing of geo-information. At the end of the course they are able to evaluate and use frameworks and approaches to solve their own projects.

Contents:

First, different methods of parallel processing are discused (e.g. processes, threads, semaphores, OpenMP, CUDA/GPGPU, MPI, VGAS Systems, Hadoop MapReduce, NoSQL, Spark). Following this, standard approaches for the processing of location data are introduced. Topics such as aggregation (e.g. mean values, location entropy, KDE, rasterization, hotspot detection), data locality (e.g. space filling curves and geohash, space-time cubes, clustering), and processing of navigation and trajectory data are discussed using real-world examples.

Time Table

  • 12.4. 9:45 - 11:15 Lecture V404 (Schneiderberg 50)
  • 26.4. 9:45 - 11:15 Tutorial GIS-Labor (609 in Appelstr. 9a)
  • 2.5. 14:30 - 16:00 Lecture V404 (Schneiderberg 50)
  • 3.5. 9:45 - 11:15 Lecture GIS-Labor (609 in Appelstr. 9a)
  • Cancelled 17.5. 9:45 - 11:15 Tutorial GIS-Labor (609 in Appelstr. 9a)
  • 23.5. 16:30 - 18:00 Lecture GIS-Labor (609 in Appelstr. 9a)
  • 24.5. 9:45 - 11:15 Lecture GIS-Labor (609 in Appelstr. 9a)
  • 31.5. 9:45 - 11:15 Tutorial GIS-Labor (609 in Appelstr. 9a)
  • 14.6. 9:45 - 11:15 Tutorial GIS-Labor (609 in Appelstr. 9a)
  • 21.6. 9:45 - 11:15 Reserved* GIS-Labor (609 in Appelstr. 9a)
  • 28.6. 9:45 - 11:15 Tutorial GIS-Labor (609 in Appelstr. 9a)
  • 5.7. 9:45 - 11:15 Talk/Lecture GIS-Labor (609 in Appelstr. 9a)
  • 12.7. 9:45 - 11:15 Questions GIS-Labor (609 in Appelstr. 9a)

Official Links

Lecture Slides

Lecture 1: Introduction PDF Slides

Lecture 2: An Overview of Parallel Computing PDF Slides

  • The 3V of Big Data
  • Scalable Distributed Computing: Transactions, 2-Phase-Commit, CAP-Theorem, Named Entities, Messaing, Activities
  • Basic Principles of Computers and Operating Systems: Registers, Fetching Commands, Timer Interrupts, Multi-Tasking, Processes, Threads)
  • Thread APIs, Thread Risks, Semaphores and Mutual Exclusion, Critical Sections
  • OpenMP: Annotation-based Shared-Memory Multiprocessing in C++ and Fortran

Lecture 3: Message Passing Interface and MapReduce PDF Slides

  • MPI: Simple, Fast, Scalable Message Passing API, ideal for supercomputers
  • Communication Topologies: Point-to-Point, Broadcast, Scatter, Gather, Reduce, AllGather
  • Additional: MPI_IO, MPI_Barrier, MPI_Probe
  • Distributed File Systems
  • MapReduce: Named Entities transformed to Named Entities with two simple methods
  • Combiner: Speeding up MapReduce using Associative and Commutative Reduce already at the source

Lecture 4: More on Distributed File Systems, Hadoop and MapReduce PDF Slides Bloom Filter Slides

  • Hadoop MapReduce Example in Java
  • Files, File Systems and Distributed File Systems
    • Whole File Assumption
    • Few Writers Assumption
  • Databases
    • Relational DBMS and ACID semantics
  • Apache Cassandra
  • Bloom Filter
  • Apache Hive
  • Apache Projects Overview
  • Spatial Data Distribution

Video 1: A complete Big Data Stack (inlcuding Docker, Amazon Web Services, Apache Spark)

  • How to setup your development environment on a local computer for Apache Spark
  • How to use docker to isolate development retaining efficient file editing through volume mounts
  • How to use docker-compose to deploy the example locally
  • How to use docker-machine for creating, using, and removing cloud computing resources
  • How to use docker swarm mode (docker stack deploy) to execute the spark stack
  • Find the source in our GitHub tepository

Youtube Video

Script

A first preview version of the script is availble here as a HTML file. Please note that the script is emerging. Expect that everything, especially the section numbering and format, will change.

  • Script Version 1: Including first part of MPI HTML

Tutorial

Tutorial 1: PDF with Solutions

Tutorial 2: PDF

  • Prime Numbers with OpenMP and MPI
  • Word Count Using MapReduce over MPI
  • MapReduce sample implementations in R and Python

Tutorial 3: PDF

Some tips for using the VM or installing dependencies on your system: For this tutorial, we will be using Apache Hadoop without the included distributed file system HDFS. This means, that you can just download the Hadoop binary package (http://hadoop.apache.org/releases.html). We were using version 2.8.0, though everything should work with newer versions, possibly changing some paths. As a prerequisite for Hadoop you just need a working installation of Java. On some environments, you need to make sure that JAVA_HOME is set up properly, just look for an online tutorial for your operating system. On Linux, you can use

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

This line has already been added to the run_hadoop.sh script in ~/hadoop/markt-hadoop-streaming/. You might need to add it to your project.

Before you run any MapReduce job, please be sure to remove the output directory. If you want to change the input or output, just edit the file run-hadoop.sh, which essentially just contains the setup of two environment variables JAVA_HOME for the Java version to be used and HADOOP_HOME for the Hadoop version. This is needed as it is quite common to have different versions of both on the same cluster.

The actual invocation looks like:

${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-*streaming*.jar \
-file mapper.py    -mapper mapper.py \
-file reduce.py   -reducer reduce.py \
-input txt/* -output output

This means, that hadoop is instructed to run the jar which is part of the distribution. The Hadoop Streaming JAR now takes over options -input specifying input files, -output specifying output directory, -mapper and -reducer specifying executables (python in this case, any program can be used).

Note that without changing the configuration of the hadoop distribution, the system runs in standalone mode, which is using a single process. This is great for development and debugging. But even on a single system, one can make it faster by configuring it for pseudo-distributed execution, the Internet has plenty of tutorials on that.

Datasets

Gutenberg Ebooks

The Project Gutenberg EBook Collection contains a selection of ebooks collected from Project Gutenberg in plain text format.

Copyright laws are changing all over the world. Be sure to check the copyright laws for your country before downloading or redistributing this or any other Project Gutenberg eBook.

Download Gutenberg Ebook Dataset

Laser Scan Marktplatz Badenstedt

The following file contains a small part of mobile laser scan point cloud data from Hanover.

Laser Scan Dataset

Hands-On

Note that this section is preliminary; it is just online for testing.

This section collects information for real-world experience with geospatial big data. Many of the examples are designed in a way such that they can be executed on typical laptops and PCs, however, would be scalable to large datasets. In this way, you can learn to use the technology needed for big data without having to pay for cloud computing resources. Still, you should check out major cloud computing platforms as many of them have free plans for students.

Virtual Machine Images

IMPORTANT: The Virtual Machines have been removed because they don’t get security updates. The text below is only for reference!

In this section, some hard disk images for virtual machines are provided in order to speed up your spatial big data experience. As you might know, data scientists need to invest more than half of their time into installing and configuring their systems. Using the images from this section, you can quickly create a preconfigured environment for working on the assignments. Note, however, that virtualization has a severe performance bottleneck with respect to hard disks due to managing two layers of file systems. Therefore, when working with huge datasets, it is favorable to install on bare metal.

Soon, I will also provide docker images, which contain all needed installation work in the so-called Dockerfile. For now, I just provide a single VM image in qcow2 format for KVM (the linux builtin virtualization) and a VDI file for all other virtualization environments.

If you are new to virtualization, VirtualBox is a good starting point, especially for Windows and Mac users. Linux users might want to follow the kvm path, as everything is already builtin. It has been packaged for many distributions and should be installed from the package manager of your distribution. Windows user can directly download it from https://www.virtualbox.org/

A short guide on setting up VirtualBox for this lecture

For the following VMs, assign two CPUs, at least 4GB of RAM and use the user bgd with password bgd. This user is allowed to become superuser using sudo.

  • Version 3: Including Hadoop; KVM Hard Disk Image: removed
  • Version 3: Including Hadoop; Portable Hard Disk Image: removed
  • Version 2: KVM Hard Disk Image: removed
  • Version 2: Portable Hard Disk Image: removed

If you need to update the current assignments, just open a terminal (click on the button on the desktop) and issue

    cd big_geospatial_data_lecture
    git pull

Currently, the VM contains a C++ compiler and MPICH2 is available. So running g++, mpicxx, and mpirun works directly. You can use geany to edit files…

Docker

Docker is a lightweight system for being able to run virtual machines in full isolation without emulating complete computers. It has been widely adopted by the Big Data community as a way of efficiently managing distributed applications at scale.

The Docker image is available from DockerHub under the name mwernerds/bgd:latest.

It contains several components used during the tutorial of the lecture and is in general a good starting point for spatial data analysis.

The current version consists of

  • Debian Base
  • Git and C/C++ development environment
  • R 3.3 from CRAN with all recommended packages

If you want to run it, just install docker and do

     docker run -it mwernerds/bgd:latest bash

This will download it, if needed. If you want to share a directory or to use the X windows system (for interactive plotting for example), just use the xrun script

 xrun mwernerds/bgd:latest bash

This script (read it!) opens X windows for the new guest, so it can use your monitor and mounts the subdirectory data to ~/data on docker as a shared directory. Just copy files there and they are visible both on the host and inside the container.

The script xrun is, however, Linux-specific and won’t work on Mac or Windows.

For those operating systems, you need to find out yourself how to mount a directory into the container and how to access the screen (for interactive plotting).

brew install socat
brew cask install xquartz
open -a XQuartz

socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:\"$DISPLAY\"
# in another window
docker run -e DISPLAY=192.168.59.3:0 jess/geary

Feedback and Support

We appreciate your feedback and support. You can drop me a line at any time. If you have interesting examples, you want to share with your fellow students, you can either send it to me via email or create a pull request on GitHub. I would be happy to include your examples, solutions and portations in the lecture.