Apollo: Where and When

Project: Apollo

Abstract

GIS-Applications are applications that gathers, manages and analyzes spatial-related data.

They are rooted from Geography and is now used in almost all aspects of discipline where Geographic and/or Spatial data manipulation is required. This includes applications related to engineering, planning, management, transport/logistics, insurance, telecommunications, and business [1]. Guttman [2] started the topic of generalizing B-tree (a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. [3]) to index multi-dimensional data (coordinates, polygons, rectangles, among others). Such a step enables the rise of Geo-spatial databases for gathering, analyzing and querying such data (i.e. Esri, GIS-planning and Database extensions like GeoMesa (for Hadoop), PostGIS (PostgreSQL), Oracle, CouchDB, among others) for a more customized (in case of geo-databases) and specialized use-cases (in the case of EPSG and Industrial and National Defense uses).

While these systems handles spatial dimensions with grace, adding temporal (time) dimension is proven to be difficult and expensive (in terms of storage and operations), albeit required [LANG92] (as is required by sciences such as showing specific species' given a location and time, historical tracking of tectonic activity, etc [4]) and suitable indexing structures have been made to hopefully solve such demands and published [5].

In the next chapters, I will introduce a space-filling curve called Hilbert-curve, how and why it is used in the Hilbert R-tree, and how we can hope to extend a variant called Temporal Hilbert R-Tree to include temporal dimension which is, at the very least, asymptotically optimal, in the sense that the time and space bounds are asymptotically the same as those of the single version Hilbert R-tree in the worst case.

The Article will never be finished however, as my research hasn't stopped. Apollo, whose application for the following survey is applied onto, will serve as the main engine for a database that could be used for any spatio-temporal use-cases (Uber/Lyft, where-is-it/where-are-they kind of apps, etc.)

1. Introduction

An R-tree is a height-balanced tree similar to B-tree, with index records in its leaf nodes containing pointers to data objects correspond to disk pages if the index is disk-resident, and the structure is designed so that a spatial search requires visiting only a small number of nodes. The index is completely dynamic; inserts and deletes can be inter-mixed with searches and no periodic reorganization is required. [6]

While B-trees is optimized for blocking/organizing the data given their relationships, The 1980s were a period of wide acceptance of relational systems in the market, but at the same time it became apparent that the relational model was not adequate to host new emerging applications. Multimedia, CAD/CAM, geographical, medical and scientific applications are just some examples, in which the relational model had been proven to behave poorly. More specifically, B-trees were designed to handle alphanumeric (i.e., one-dimensional) data, like integers, characters, and strings, where an ordering relation can be defined (and the fact that storage like HDDs is one-dimensional).

The key idea of the data structure is to group nearby objects and represent them with their minimum bounding rectangle in the next higher level of the tree; the "R" in R-tree is for rectangle. Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle also cannot intersect any of the contained objects. At the leaf level, each rectangle describes a single object; at higher levels the aggregation of an increasing number of objects. This can also be seen as an increasingly coarse approximation of the data set.

A visualization of an R-tree for 138k populated places on Earth

2. Hilbert Curve

A Hilbert curve (also known as a Hilbert space-filling curve) is a continuous fractal space-filling curve first described by the German mathematician David Hilbert in 1891, as a variant of the space-filling Peano curves discovered by Giuseppe Peano in 1890.[7]

A space-filling curve can be used for indexing n-dimensional coordinates to a one-dimension index called hilbert value.

A space filling curve visits all the points in a k-dimensional grid exactly once and never crosses itself. It makes it suitable to be used for indexing n-dimensional coordinates and sorted and stored in a single R-tree node.

Just as [KAMEL94], we define the first as:

Definition : The Hilbert value of a rectangle is defined as the Hilbert value of its center.

While most implementations takes 2-d (x and y) as inputs, we implement a C program as shown by [SKILLING2004] as it is, personally speaking, much more compact than even Hamilton's C++ implementation of "Compact Hilbert Curve"[8], and ported it to Golang. The code is available here.

[KAMEL94] have shown that this ordering has to be ‘good,’ in the sense that it should group ‘similar’ data rectangles together, to minimize the area and perimeter of the resulting minimum bounding rectangles (MBRs) where most R-tree variants suffer from, computationally speaking.

Grouping rectangles according to the rectangle's centroid hilbert values provides space utilization ≈100% and avoid overlapping rectangles and rectangular computation/s.

3. Hilbert R-tree

[KAMEL94] paper's main idea is to impose an ordering on the data rectangles. Using this ordering, each R-tree node has a well de fined set of siblings; thus, we can use the algorithms for deferred splitting.

The Hilbert R Tree has similar functions in comparison with R-tree and therefore, supports all the underlying operations of R-tree which are search, insertion and deletion.The difference in the case of Hilbert Tree being that the Hilbert value of the MBR is used instead of considering the area or the distances of the MBRs.

By having the rectangles grouped together, finding the leaf that contains a query rectangle is pretty straight-forward and easy. Furthermore, it simplifies insertion due to the fact that we only get to look for the largest Hilbert value of the node, and see if it qualifies to be there, if not, then we move forward. It also allows deferred splitting, which allows space maximization by moving entries to sibling nodes instead of creating a split as a normal R-tree would, driving the utilization as close to 100% as desirable. Notice that the R*-tree does not have control over the utilization, typically achieving an average of 70%.

Moreover, R*-tree forces re-insert by deleting some rectangles from the overflowing node, and reinserting them (operation expensive).

Search is starting from the root, it descends the tree and examines all nodes that intersect the query rectangle (yellow). At the leaf level, it reports all entries that intersect the query rectangle as qualified data items. (link)

The geometric code and test for rtreego is re-used but the entirety of the tree structure is re-engineered for Hilbert R-tree. (code repo)

4. Temporal Hilbert R-tree

The Temporal R-Tree presented in [ZIMBRAO99] is a good starting point for the Temporal Hilbert R-tree and this is also how it was presented in [BECK96] for B-tree and [OHLE94] for B+-tree.

Time and space efficiency of the R-Tree is based on the six properties proposed by Guttman [GUTT84], especially the assumptions (1) and (3).

Those properties ensure that the height of the R-Tree with N index records is at most |logm N| - 1. To maintain this efficiency, the properties of the R-Tree must be generalized to handle the existence of different version entries in a node.

Let M be the maximum number of entries that will fit in one node and let m = M / k be a parameter specifying the minimum number of live entries in a node, where k is a constant, k ≥ 2.

The six properties based on [GUTT84] should be modified as follows:

(1) Every leaf node contains at least m live index records and at most M index records unless it is the root.

(2) For each index record in a leaf node, hilbert-value is the hilbert value of the MBR's centroid.

(3) Every non-leaf node contains at least m live entries and at most M entries unless it is the root.

(4) For each entry in a non-leaf node, MBR is the smallest rectangle that spatially contains the rectangles in the child node and largest-hilbert-value is the largest value amongst all the enclosed entries.

(5) The root node has at least two live children unless it is a leaf.

(6) All leaves of the same root node appear on the same level.

We define (1), (3) and (5) properties the weak version condition.

Now we can state how structural changes are triggered in the THR-Tree (index records of leaf nodes should be handled in the same way that entries in a non-leaf node):

• A node overflow occurs as the result of an insertion of an entry into an already full node, i.e. a node that contains M entries.

• A weak version underflow occurs when the number of current version entries in a node becomes less than m (or two for root nodes).

• A node underflow never occurs, nodes deleted are only marked as dead.

A structural change to handle a node overflow or to restore the weak version condition is performed based on the block-copy operation, i.e. the node is marked as dead and the current version entries are copied into a new node. We call this operation version split. Considering a node overflow, most of the cases the live node created by the version split may be an almost full block or even a full block. To avoid this case and the similar phenomenon of an almost empty block, we state the following two new properties (7) and (8) that must be satisfied after a structural change, and call this set of properties the strong version condition.

(7) The number of live entries in a non-leaf node after a structural change must be in the range from (1 + ε) × m to (k – ε) × m, where ε is a tuning constant, ε > 0, to be defined more precisely in the following.

(8) The number of live index records in a leaf node after a structural change must be in the range from (1 + ε) × m to (k – ε) × m.

In this way, we guarantee that after a structural change on a block at least ε × m + 1 insertions or deletions of entries can be performed on this node before the next structural change becomes necessary. The choice of ε can be done in the same way as in [OHLE94], but with no merge restrictions. So, k ≥ 1/α + (1 + 1/α ) × ε - 1/m, where α is the constant of minimum node utilization for the original R-Tree. With these modifications on the R-Tree, the THR-Tree is an index structure that can handle time information and still maintain the good time and space efficiency of the original R-Tree. We assume that the related root node of time t can be found at O( 1 ) access. The search then proceeds as in the original R-Tree. Because, according to the weak version condition, each node of the THR-Tree stores at least m entries for version t, and therefore it behaves like an R-Tree with m as the minimum number of entries.

The code is only private at the moment. This will undergo many changes as my survey changes.

References:

[LANG92] Langran, G.: “Time in Geographical Information Systems”. Taylor & Francis, London, 1992

[KAMEL94] Kamel, I., Faloutsos, C. "Hilbert R-tree: An improved R-tree using fractals"

[SKILLING2004] John Skilling "Programming the Hilbert curve", (c) 2004 American Institute of Physics

[BECK96] Bruno Becker et al., An Asymptotically Optimal Multiversion B-Tree. VLDB Journal 5(4): 264-275 (1996)

[GUTT84] A. Guttman: “R-Trees: A Dynamic Index Structure for Spatial Searching”. In Proceedings of the ACM SIGMOD Intl. Conf on Management of Data, Boston, MA, 1984.

[OHLE94] Thomas Bernhard George Ohler, “On the integration of Non-Geometric Aspects into Access Structures for GIS”, PhD dissertation submitted to the Swiss Federal Institute of Technology, 1994.

[ZIMBRAO99] Geraldo Zimbrão Jano, Moreira de Souza, Victor Teixeira de Almeida, “The Temporal R-Tree”, Programa de Engenharia de Sistemas e Computação, 1999.

Myth of Sissyphus

Search This Blog