GMZ: A GML COMPRESSION MODEL FOR WEBGIS

: Geography markup language (GML) is an XML specification for expressing geographical features. Defined by Open Geospatial Consortium (OGC), it is widely used for storage and transmission of maps over the Internet. XML schemas provide the convenience to define custom features profiles in GML for specific needs as seen in widely popular cityGML, simple features profile, coverage, etc. Simple features profile (SFP) is a simpler subset of GML profile with support for point, line and polygon geometries. SFP has been constructed to make sure it covers most commonly used GML geometries. Web Feature Service (WFS) serves query results in SFP by default. But it falls short of being an ideal choice due to its high verbosity and size-heavy nature, which provides immense scope for compression. GMZ is a lossless compression model developed to work for SFP compliant GML files. Our experiments indicate GMZ achieves reasonably good compression ratios and can be useful in WebGIS based applications.


INTRODUCTION 1.1 Motivation
GML proves to be a great modelling language for geospatial web due to its advent from XML, a de-facto standard for web, which has the advantages of being human readable, browser friendly, extensible, editable and queryable.It also comes bundled with the two unavoidable drawbacks of XMLverbosity and being text based.XML, as we know is highly verbose in its nature.It is stored as Unicode text forbidding GML to leverage storing coordinates (which make up significant content of a GML document) as floating-point numbers or some combination of integers that can potentially take significantly less space compared to storing coordinates as strings.This bloats the size of GML documents and makes it fall short of being the most favourable choice for current usage patterns that are mostly Internet based.Consequently, we are forced to think of ways to make GML more efficient without the need to do away with the advantages that it comes with.
The rapidly multiplying Internet users put a lot of pressure on Internet services.Smart phones provide enough processing power to users, making mobile GIS feasible.But storage and bandwidth still suffer.Compression is an obvious choice in this direction.With conventional text compression algorithms such as LZ77, Huffman coding, Burrows-Wheeler transform, PPM, etc. already in place, we are inclined to use them everywhere.But these compression algorithms are unbiased towards structure that exists in data and therefore, cannot leverage this towards achieving better compression ratios.Consquently, they produce inferior compression ratios compared to models aimed at XML compression.GML is even more well-structured and predictable indicating that developing GML specific compression models to get better compression ratios make sense.However, we refrain from developing a compression model for the whole of GML.Rather, this work is restricted to anything that falls into GML's Simple Features Profile (GML simple features profile, 2011) because of its high use on the internet through WFS.It's a good first step at realizing what future compression models should be able to achieve.

Literature survey
Majority of current work on GML compression has focused mostly on just the storage efficiency of data.GPress (Guan and Zhou, 2007) and some other compression models (LI et al, 2008;Weiand Guan, 2010) are based on three principles: separating spatial data, attribute data and file structure and storing in different containers; applying delta encoding on floating point coordinates and finding semantic similarity between attributes.Based on the same idea, GQComp (Dai et al, 2009) uses a custom encoding for coordinates, makes provision for spatial and attribute data querying through the combination of featurestructure tree and R* tree spatial indexing; and achieves good compression.Another compression technique called Gtree (Harshita and Rajan, 2010;Harshita, 2013) restricted to work for only polygon data, uses a tree based structure for managing the coordinate data.
One common issue with most techniques is that they use delta encoding for coordinate compression which leads to loss of precision when calculating the delta.This can lead to errors, slivers, disjoint ends that are highly undesirable.GQComp uses a lossless custom encoding for coordinates and so far, produces best compression ratios.But its query subsystem doesn't make sense as loading the entire data in-memory puts too much pressure on already ladden modern day systems.It is equivalent to decompressing the entire document and then performing query on it.
Our technique is loosely based on Gtree, specifically designed to work with SFP.We are using a custom encoding which is a mix of delta encoding and dictionary encoding to compress coordinate data.Apart from the fact that it's lossless and produces good compression ratios, our model has provision for query in compressed state.Though the query subsystem is still under development and out of scope of this paper, we would like to emphasize that our model provides access to individual features by decompressing them in isolation.This is an essential requirement for querying and in favour of our claim.

Dataset
Due to the unavailability of compiled SFP compliant GML 3 datasets, it has largely been prepared by making GML files SFP compliant or by converting shapefiles into SFP compliant GML files.QGIS has been used for the conversion process.We have prepared GML files for 2 countries -India and USA.The India files were downloaded from mapcruzin.com,a provider of region wise shapefiles, and then converted to GML.The USA GML files were downloaded from data.gov, the data portal of the government of US, and then made SFP compliant.The dataset is combination of point, line and polygon GML files.The file size ranges from 20 MB to around 1 GB with most files under 100 MB.

Understanding the data
GML is based on an abstract model of geography given by OGC which defines the world in terms of features where each feature has a set of properties.Properties can be grouped into two categoriesspatial property, which is the geometry that stores the coordinate data of the feature (point, line or polygon) and non-spatial property, which is the non-spatial description of the feature.A feature is the smallest meaningful unit of GML.It can have any number of spatial and non-spatial properties.Referring to the GML snippet below, feature Road has 1 spatial property and 3 non-spatial properties.All features with the same name compulsorily have the same set of properties.Since, spatial properties are fairly complex compared to non-spatial properties, they have their own GML substructure which is identified using the gml namespace.They are described using a subset of geometry types such as Point, LineString, Curve, Polygon, Surface, etc.The usage of these geometry types is explained in detail in the SFP specification document.Nonspatial properties are restricted to have any structure, a notion imposed by SFP owing to the fact that databases are not designed to handle nested data.This simplification of data into segments -spatial, non-spatial and XML tree structure -groups symantically similar data, which inturn facilitates almost isolated and targeted compression on these data segments.Since GML is predominantly coordinate data like any map data, our focus will be on coordinate data compression with provision for nonspatial data compression and XML structure encoding.Here is a list of characteristics of GML based on which our compression model is based.These characteristics will be referenced in the next section:

Duplication:
Coordinates can be duplicated when adjacent polygons share boundaries, when linestrings share endpoints or random duplicity among features.

Adjacency:
Difference between adjacent coordinates can be very less when data is closely packed which is often the case in polygons and linestrings.

Text-based:
Each digit of a coordinate is stored as a byte, which ultimately bloats the size of data.

Compression Model
The algorithm is a 2-step process and involves 2 passes over the document.The steps are explained in detail:

First pass (coordinate compression):
In the first pass, the coordinate data is separated from the GML tree as XList to store X-coordinates and YList to store Y-coordinates.The following steps are performed on each List separately orderduplicate removal, sorting and index building.
Duplicate removal is done to eliminate data redundancy that we identified in point 2.1.1.This is followed by sorting.The intuition behind sorting can be understood in relation with delta encoding.Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; more generally this is known as data differencing (Delta encoding).It is well suited to work with sorted data because sorting brings coordinates with least difference adjacent to each other, which produces minimum values of delta.This is in conjunction with point 2.1.2.The next step is creation of coordinate dictionaries, XMap and YMap, used for storing the coordinate and its reference as its key-value pair.This reference is just the array index of XList for X-coordinate or YList for Y-coordinate.These indices will be used in place of the original coordinate during the structure compression step.
The coordinate compression function, coordCompressor takes a coordinate list at a time and applies a custom encoding on it.A coordinate is broken down into its integral and decimal parts.The integral parts of successive coordinates exhibit very high repeatability, and therefore, will be stored almost negligibly, only when a new integral part is encountered.Integral part ranges from -90 to 90 for latitude and -180 to 180 for longitude.Hence, it can be stored by a signed 2-byte short int datatype.On the other hand, the decimal parts of successive coordinates exhibit high proximity, i.e., mathematical difference between consecutive decimal parts is significantly smaller compared to the decimal parts themselves.The delta of the two consecutive decimal parts is what is stored.It can be stored using any of the unsigned integer datatypes -byte, short int, int or long long intdepending on its size.This is different from many compression models, which directly apply delta compression on the entire coordinate leading to lossy compression.There is also some metadata that needs to go along with each coordinateflag to indicate if a new integral part is encountered, length of the decimal part and datatype used for storing delta.We have managed to store all this metadata in just one byte.These operations are applied on each coordinate and help solve the issue identified in point 2.1.3.

Second pass (structure and attribute compression):
In the second pass, the structure tags are replaced by the corresponding encodings and attribute data is compressed.The properties of a feature have their own custom namespaces and tags.Nonetheless, they remain the same for all features sharing the same name.Therefore, we store these property names just once per unique feature name.We traverse the feature tree and identify if the property is spatial or non-spatial.Spatial properties are made up of one of the 12 geometry types and 10 subtypes provided in SFP.These geometry types are tightly structured due to strict usage specification.These 22 tags have been given a value from 0-21.The tags are replaced by their encoding while traversing the spatial property.Two tagspos and posListare the innermost tags in a geometry tree structure and enclose coordinates for a feature's geometry.We now find the references or indices of these coordinates from XMap and YMap and replace them with their indices.The value of these indices will be of the order of millions when we have let's say, millions of coordinates.To prevent storing such high values, we store just the delta of indices of successive coordinates.The idea is that coordinates in a feature tend to be very close numerically.Therefore, their position in coordinate List will be close leading to a small value of delta.In our experience, we could notice that this delta could be stored in single signed byte most often.

Figure 3. GML geometry tags available in SFP
Relative to the amount of coordinates, attributes often make up a small part of GML.Developing sophisticated compression technique at the cost of increased complexity is not worthwhile considering the change in overall compression ratios contributed by attribute compression.We start with identifying the typeinteger, float or string.Integer and float types are stored as integers and floats, respectively.String types show some amount of duplication across features.They are evaluated for feasibility of dictionary encoding and stored accordingly.

Software tools
In a typical client-server architecture such as WFS, the server is the producer and provider of GML, and therefore, needs compression support primarily.On the other hand, the client (web browser) is the consumer of GML, and therefore, needs decompression and visualization support.Depending on the need, we have created separate tools for server and client.The compression model has been developed as a python script, with options for compression and decompression given a GML file.The C version of ElementTree known as the cElementTree, which is faster and uses less memory, is being used for parsing XML tree structure.We have constructed functions to handle each geometry type supported in SFP.The compressed is stored in python's bytearray, which is dumbed in a binary file at the end.A compressed binary file is finally returned.GML, being XML based, has the advantage of being browser readable.It can be parsed natively and rendered in the browser itself.GMZ has no such advantage.To make it browser readable, a lean, cross-platform, client-side and easily installable solution in the form of a Firefox add-on has been implemented to perform decompression and visualization.The files are rendered as SVG in the browser tab.The average compression percentages for the India and US datasets are 78.64% and 73.05% respectively.This is better than any compression model that we know of, including GQComp.None of the compression models provide compression and decompression speeds for comparison.However, we would like to argue that our compression and decompression speeds would be comparable to others as our model is equally complex in comparison.When comparing with Zip, our model does better in compression ratios but lags behind in speed because Zip is ignorant of structure in data.An observation worth noticing is that the compression ratios for polygon and linestring data are relatively higher than that for point data.This was anticipated because point data has very less amount of spatial data compared to linestring and polygon data.

Conclusion and future work
The results presented in this work demonstrate the effects that topology and structure of the coordinate data have on the compression potential of a standard GML file.We were able to reduce the original data to its one-fourth/one-fifth without any loss of data.We also showed how GMZ can be used as a data transfer format on WebGIS by making it browser readable through a Firefox add-on.Hence, GMZ can gel with and serve an entire pipeline of a WFS-like architecture.
Future work will mainly focus on implementing an interface for querying and exploring what all spatial and non-spatial queries can be performed on the data without decompressing the file.We would also try to optimize it for speed.Another important thing that we would like to deal with in future is providing native support for GMZ in browser.One more challenge we plan to address is to integrate GMZ as one of the file exchange formats for web services like WFS.We anticipate that this will not only be lighter on the data transmission (bandwidth) rates but also improve some data processing capabilities due to its smaller data size, reducing the memory footprint.

Figure 4 .
Figure 4. Rendered GMZ in Firefox using an add-on Compression model pipeline

Table 2 .
Compression ratios of USA GML files