FACILITATING INTEGRATED SPATIO-TEMPORAL VISUALIZATION AND ANALYSIS OF HETEROGENEOUS ARCHAEOLOGICAL AND PALAEOENVIRONMENTAL RESEARCH DATA

: In the context of the Collaborative Research Centre 806 ”Our way to Europe” (CRC806), a research database is developed for integrating data from the disciplines of archaeology, the geosciences and the cultural sciences to facilitate integrated access to heterogeneous data sources. A practice-oriented data integration concept and its implementation is presented in this contribution. The data integration approach is based on the application of Semantic Web Technology and is applied to the domains of archaeological and palaeoenviron-mental data. The aim is to provide integrated spatio-temporal access to an existing wealth of data to facilitate research on the integrated data basis. For the web portal of the CRC806 research database (CRC806-Database), a number of interfaces and applications have been evaluated, developed and implemented for exposing the data to interactive analysis and visualizations.


INTRODUCTION
The Collaborative Research Centre 806 (CRC806) is an interdisciplinary research project (www.sfb806.de)with more than 100 researchers from the disciplines of archaeology, the geosciences and cultural sciences, funded by the German Research Foundation (DFG).A central research database, the CRC806-Database, is currently under development and sets out to accomplish two main goals: The first goal is to provide a long-term archive and publication platform for results produced by CRC806 researchers.This aspect implements the data management policy that is mandatory for DFG-funded CRCs (DFG, 1998;Effertz, 2010), and will be, from the data management perspective, comparable to other DFG-funded CRC research databases, e.g.Curdt et al. (2011).The second of the two goals is to provide an integrated data basis to facilitate the research within CRC806.This paper will focus on the development of this second aspect.
Generally speaking, there is a wealth of information and data already available for the two data domains that are considered in this task.However, both archaeologists and palaeoenvironmentalists use an extensive and ever-changing array of recording systems, all based on diverse theoretical perspectives, typologies, nomenclatures and methods.Custom and poorly or even undocumented data formats and general access constraints to potentially interesting datasets add a further dimension to the problem.The aim of this work is to facilitate the research within the CRC806 and the interested community by providing integrated access to the archaeological and paleoenvironmental data.

RELATED WORK
Significant theoretical work has been carried out in the research domain of this paper.The topic of semantic interoperability e.g.Kavouras and Kokla (2008) and spatio-temporal data models e.g.Sellis et al. (2003), as well as spatio-temporal data integration e.g.Visser (2004) or Kauppinen et al. (2010) is theoretically identified and well established in the GIScience research.
To the knowledge of the authors, there is no published previous work concerning the integration of prehistoric archaeological data and palaeoenvironmental data using semantic web technology.The most related and comparable work to the presented approach is done in the cultural heritage domain.This work is mainly driven by museums and institutions mostly concerning classical archaeology.The best known development out of this field is probably the CIDOC-CRM (Doerr, 2003) data model.There is also significant work in the field of applying semantic web technologies to model and integrate (classic) archaeological data, for example Isaksen et al. (2009) and Martinez and Isaksen (2010), but there seems to be no previous work on integrating (prehistoric) archaeological and (palaeo)environmental data.

METHODS AND DATA
The presented concept aims to integrate heterogeneous data sources of the archaeological and palaeoenvironmental domain.Both domains are very different in their composition, concept and scope, and as a result very different in their semantics.They have one key aspect in common: Both domains represent spatio-temporal data.Accordingly, each of the integrated datasets has a spatial as well as a temporal extent; a factor that has been at the centre of our considerations from the very outset of data model development.This understanding led us to the implementation of the presented data integration concept to achieve semantic interoperability (Kavouras and Kokla, 2008) in the spatial as well as in the temporal semantics of the two data models.

Methods
To integrate the data technically, we applied Semantic Web Technology (Allemang and Hendler, 2011;Segaran et al., 2009).This technology provides a well-developed set of methods, concepts, and implementations for semantically interoperable data integration.
Particularly, we apply (besides common GIS and programming tools) the technologies Resource Description Framework (RDF)  and Carroll, 2004) and the Web Ontology Language (OWL) (Bechhofer et al., 2004) to model and formalize the ontologies of the intergated data models.
The integration concept (Willmes and Bareth, 2012) that we apply is an iterative approach.This means that the semantics of the resulting integrated model can be extended by each dataset that is to be integrated.Instead of modeling the semantics of the models in advance and integrating the data into those static models, we adopt every entity that is not yet covered into the integrated model, if it can not be aligned with existing semantic entites of the current integrated model.If necessary, the semantic alignment of entities is done with consultation of domain experts within the CRC806.

Data
A remarkable number of research datasets (measurements and models) has already been created in both considered domains within the research community.The datasets employed for integration in this paper are only a fraction of the potentially available datasets.
As mentioned above, both data domains considered in this work represent data that is intrinsically spatio-temporal.But besides this common factor, both data domains are notably diverse in their semantics.The key facts about the datasets that have so far been integrated are given in the following.

Archaeological Data:
The data model for the archaeological domain mainly handles data about dated and classified artefacts found at an excavation site, but also deals with the description of excavation sites itself.The data is spatially point based: Artefacts are spatially referenced by a site, which are spatially referenced through a point coordinate.The artefact data is temporally referenced by a single date (laboratory measurement aligned to an statistical age model), and/or it is possible that artefacts are classified by technological/cultural periods.These periods are temporally referenced by a time range, which is modelled as a continous amount of time between a start and an end date.
The site based data is mostly temporally referenced by a period or a culture, which translates into a time range.
For the development of the data model, the datasets listed in Table 1 have been integrated.The data stem from the published databases NESPOS (Bradtmöller et al., 2010), CalPal (Weninger et al., 2010), Stage3 (van Andel andDavies, 2003) and from unpublished project-internal data collections.The datasets for Cal-Pal, Stage3 and Project-internal were provided in tabular form, each with custom semantics and schemata.The NESPOS data is derived from the NESPOS web site (www.nespos.org).The number n derived spatial datasets, results from the sum of the temporal steps vt per environmental variable times the number of temporal periods δT of the database.vt will be 1 = annual, 12 = monthly, 13 = Plant Functional Type (pft), or 24 = hourly values for most cases (see equation 1).
The integrated databases deal with many environmental variables (V in Table 2), which are not yet integrated semantically.This task needs more detailed study of the given models to correctly understand the variables, in order to be able to semantically align them.

DATA INTEGRATION
A set of python programs was developed during the course of the data integration process.Python was chosen because it offers a wide range of available libraries supporting semantic web and GIS technology.
In contrast to integration approaches, that build on top of databases with rather strict schemata, for example relational databases, the employment of a graph data model implemented by RDF allows to expand and alter the semantics of the data in the course of further data integration.
The integration process (see Figure 1) is not fully automated, because for each new dataset that is to be integrated, a custom translation is developed.During this development process, the semantic entities of the datasets considered are manually aligned with the current internal model by formulation of the semantic mapping in Python code.If some semantic entity is not yet represented or not alignable with the current model, the entity will be added to the internal model, which results in an expansion of the previous internal semantic model.This technique guarantees that almost no semantic information of the source database will be Figure 1: Flow diagram of the data integration process.
lost and that the integrated data always suits further application of analysis, models and visualizations.
A problem of the presented approach at this stage is the identification of objects that refer to the same object in reality, but are represented with different Uniform Resource Identifiers (URIs) in the integrated database.This occurs if an identifying individual/entity, for example a site name, is spelt differently or is represented in another language in the source data sets, which then results in different URIs for those objects.Technical solutions to solve this kind of problem are available (e.g.gazetters or thesauri) and those will be taken into account in a next step.
If there is a reason to doubt the semantics of a data object or if a researcher needs to investigate the semantics of the source data, a reference to the source database in its original model and data format is always provided for each data object of the integrated database.

Archaeological Domain
For each source database, the semantic mapping to the internal model is formulated in Python code.At first a reader for the file formats (mostly Excel and CSV) of the file based databases (CalPal, Stage3 and internal collections) was implemented.This reader maps the schema of the source dataset to a Python object, which is then mapped into RDF representation.The created RDF graph is then written into the central RDF store.
For the NESPOS database, a website scraping script was developed in order to collect the archaeological information from the public space of the NESPOS web portal.Until now, the collected data is site based.This may be extended to artefact records in the future.The RDF mapping is implemented in an additional step, the dataset also gets integrated by writing the resulting RDF graph into the central RDF store.
Each major entity (Artefact, Site, SiteAttribution) of the integrated database will be represented by a dynamically created webpage, accessible under its RDF URI within the CRC806-Database web portal.The webpage will contain all available information about the object, including references and links to the source database.

Palaeoenvironmental Domain
In the case of the palaeoenvironmental databases, the integration process is different from the approach applied to archaeological databases.The datasets of the archaeological domain are spatially mainly point (site) based, whereas datasets of the palaeoenvironmental data domain are mainly represented as discrete spatial fields (grids).This leads to a more GIS-based integration approach.
Similar to the integration process for the archaeological data, a custom reader for each dataset is implemented in a Python program.In contrast to the process for the archaeological domain, each dataset is transformed into a common GIS data format (Geo-Tiff or Shapefile) in addition to the RDF mapping.
In further contrast to the archaeological integration, not every data object of the source is mapped into RDF, but only important metadata and the references to the derived GIS datasets containing the values of all the data objects of the source.This is due to the nature of the palaeoenvironmental data, mostly represented in spatial grids, with n (see equation 1) values per grid node g, which results in g • n data objects per variable, easily resulting in a very large number of data objects to be mapped in RDF.
For each integrated palaeoenvironmental dataset, a website with all available information about the data and references to its source will be available from its RDF URI within the CRC806-Database web portal.

RESULTS
The main results of the presented work are i) a comprehensive integrated palaeoenvironmental and archaeological database, ii) a semantic web-enabled archaeological, and iii) palaeoenvironmental data model, as well as iv) shared spatial and temporal semantic models.

Archaeological Model
The archaeological model (see Figure 2) comprises three main objects: Artefacts, Sites and SiteAttribution.Most of the archaeological datasets integrated so far into the database and the underlying model are based on records relating to artefacts.Artefacts are located by a reference to the excavation site at which they were found.Additionally, some datasets are based on records per excavation site.This kind of record deals with abbreviated variables, which, in most cases, are derived from the artefacts found at a given site.Such variables are strongly connected to artefact characteristics (age, cultural attribution, etc.), but the actual reference to artefacts is not always given.For these kind of records, the object SiteAttribution was developed.It has the added ability of being able to characterize a site object with additional, not generally applicable (for example only valid for a given point in time or a time range) semantics given by the site object, and thus enables site based analysis.
Furthermore, references to the source databases, and in case of the NESPOS datasets, links to the webpages of the sites within the NESPOS website, are provided within the RDF graph model.

Palaeoenvironmental Model
As described above, not all data objects of the palaeoenvironmental source datasets are modeled in RDF.Thus, the model concentrates on the relevant parameters to provide integrated spatiotemporal filtering, and information about the present environmental variables in the datasets.
Each dataset (see Figure 3) is spatio-temporally referenced within the model, facilitating the shared temporal and the shared spatial  Spatially, the datasets are integrated with geographic coordinates (WGS84) boundingboxes of the derived GIS-datasets, described by the shared spatial model.Relevant meta information about the dataset, such as environmental variables and references to the original source datasets, are also represented in the RDF model.
Furthermore, the content and dataset type is classified in the semantics of the dataset objects.With this information, datasets can be filtered and accessed in spatio-temporal alignment with the overall sematics of the datamodel.The spatial data is stored internally in a processed GIS-data format, as well as in the original data format with a reference to the source database.

Shared models
The developed models share the same spatio-temporal model.The right side of Figure 3 shows an abstracted graph showing the simple spatial and temporal model.To describe the spatial and temporal extent of the entities, it is subdivided into a spatial and a temporal model.These models will be continually semantically extended and refined.In particular, a semantic integration with existing spatio-temporal ontologies is planned to strengthen the interoperability of the models.This semantic integration is implemented by formulating OWL:sameAs instances between individuals in the developed domain ontologies and existing vocabularies.
Spatial model: The spatial model is at this stage rather simple and basically represents spatial points, defined by a WGS84 Lat/Long coordinate.Spatial fields, which are defined by a simple bounding box, are represented by the bounding North, East, West and South WGS84 ordinates.
The WGS84 semantics of the spatial model are in alignment with the Basic Geo (lat/long) Vocabulary by the W3C Semantic Web Interest Group (2003), where the bounding box is defined by two coordinates, the minimum longitude and minimum latitude point and the maximum longitude and maximum latitude point.
Temporal model: The temporal model is slightly more complex than the spatial model, because it additionally deals with defined names of events and periods, which translate into periods and dates.Events translate into a point in time or date, and periods translating into time ranges, which are defined by a start and an end date.
Consequently, the model defines dates that are given in years BP (before present) as basic entities.Further investigation of the semantic integration of different time reference systems used in source datasets (e.g.BP, calBP, BC, etc.) will be undertaken and ontologically formulated within the model.At the moment, this is done manually during the formulation of the semantic mapping (see Section 4 and Figure 1).Because the representation of dates are simple integer values, the duration d of a time range is a simple substraction of start date sD minus end date eD (d = |sD − eD|).

INTEGRATED SPATIO-TEMPORAL VISUALIZATION AND ANALYSIS
The heterogeneous palaeoenvironmental and archaeological data is spatio-temporally integrated into a central RDF store and thus facilitates the implementation of integrative analysis and visualization applications.
The CRC806-Database architecture for accessing the integrated databasis is shown in Figure 4.The interfaces and applications access the integrated data through a SPARQL (Prud'hommeaux and Seaborne, 2008) endpoint, which operates as query engine on top of the central RDF store.The only access interface an application has to the integrated database is the SPARQL endpiont.This ensures that the application is independent from further federations of additional data sources to the RDF graph, even if additional semantics are added.

Interfaces
Thus, the integrated data can be accessed by several interfaces within the CRC806-Database web portal including i) a data catalog and search interface (web based SPARQL frontend), ii) a WebGIS and iii) an Exhibit timeline and facet-browsing application, iv) via direct access to the SPARQL endpoint and v) via OGC webservices (WFS, WMS, WCS).
All GIS datasets of the integrated databasis are accessible from the CRC806-Database SDI via OGC webservice (OWS) interfaces, implemented using Open-Source-Software Mapserver (http: //www.mapserver.org) and MapProxy (http://www.mapproxy.The CRC806-Database web portal is a Typo3 CMS-based web application and hosted at the server infrastructure of the regional computing centre of the University of Cologne (RRZK).The data catalog and search interface for the integrated data is implemented in a custom Typo3 Extension, which allows the implementation of interfaces to construct SPARQL requests and display the results from within Typo3.Through this interface it is also possible to export results of SPARQL queries as RDF/XML serializations and in CSV and Excel format.

Example Queries
The integrated RDF model allows to filter the integrated data base with queries, which was not possible before the integration of the data.Some example queries on the integrated database are: • Select all datasets located in South Spain from the LGM time intervall.
• In which time intervalls and in which spatial areas are artefacts from the solutrean culture present in the database?
It enables to filter all datasets of the integrated databasis spatially and temporally in addition to thematical filters.This was not possible before in a consistent integrated way.

Visualization
For the interactive visualization in a spatial context, a is provided (see Figure 5) accessing the OGC of the CRC806-Database SDI.WebGIS is implemented using the open source JavaScript framework GeoExt (http://www.geoext.

org/).
A further interactive interface for temporal (timeline) visualization and structured faceted browsing of the integrated data is provided using the open source Exhibit JavaScript framework (Huynh et al., 2007).Faceted browsing helps exploring the data based on more than one axis (e.g. a search term) by applying multiple filters in faceted classification system (Huynh et al., 2007).
The visualization of the integrated data in desktop/client applications is possible through accsess of the OWS interfaces or the file based datasets, which can be accessed from the web portal.Thus, many possibilities of custom analysis and visualizations are additionally enabled by the web based interfaces of the described system.

CONCLUSION AND OUTLOOK
Originally, a top-down approach was considered, which would have integrated the given data into an upfront-developed model.This approach was abandoned due to its limited flexibility and its susceptibility to error.This led to the adoption of a bottomup approach, which builds the model from the semantics of the integrated data sources.This approach has several advantages, such as flexibility and extendibility.The key advantage is that the resulting data model always suits our system as it adapts organically to the demands of the CRC806-Database system and to the semantics of additionally integrated datasets.
The integrated data can be semantically mapped to existing models, which provide a semantic overlap (Allemang and Hendler, 2011), by the definition of OWL:sameAs statements.This enables the definition of mappings to representations in existing models such as CIDOC-CRM (Doerr, 2003) and thus declares data in those models, with a mapping to the CRC806-Databases model, interoperable with the CRC806-Database model.This semantic referencing and linkage will be applied to as many available external models and ontologies as possible, to strengthen the semantic interoperability of the presented archaeological and palaeoenvironmental models.
The main result of this work is the integrated data basis and the data models derived from the integration process.This data basis will help to answer and inspire a wide range of research questions within the CRC806, and possibly within the research community as a whole.Further datasets will be integrated into the CRC806-Database in the future.Users of the CRC806-Database will be able to suggest further datasets for integration.
The interfaces and applications provided by the CRC806-Database so far are just a small fraction of possible applications, that can be implemented on top of the data basis.Further applications are for example environmental and archaeological predictive models such as archaeological site catchment and site prediction analyses or for example palaeo climate classifications.It is already planned to provide OGC Web Processing Service (WPS) interfaces for implementation of such models, to integrate those for interactive exploration within the CRC806-Databse WebGIS application.
The CRC806-Database web portal and SDI (http://crc806db.uni-koeln.de)will be launched in summer 2012.This first version of the web portal will implement the data management aspect, including a secure research data archive and some first web based interfaces and applications to the integrated data basis.

Figure 2 :
Figure 2: Generalized graph representation of the archaeological data model.model.The palaeoenvironmental datasets are temporally defined by palaeoenvironmental periods, dates or events (OIS-Stages, glacial periods, Heinrich events etc.), which translate into time ranges or in case of events and dates into points in time.

Figure 3 :
Figure 3: Generalized graph representation of the palaeoenvironmental data model and also simplified versions of the temporal and spatial models.

Figure 4 :
Figure 4: Architecture for accessing the integrated databasis (modified from Allemang and Hendler (2011)).org) technology.The geospatial datasets are also directly downloadable in Shapefile or GeoTiff format from the CRC806-Database web portal data directory via HTTP file download.

Figure 5 :
Figure 5: Screenshot of the WebGIS interface, showing a Google Maps Aerial baselayer overlayed with the mean LGM winter temperatures from PMIP II, and overlayed with the LGM net primary productivity from Stage3 and the NESPOS Sites.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume I-2, 2012 XXII ISPRS Congress, 25 August -01 September 2012, Melbourne, Australia as a formal model and data format (Klyne
Key numbers describing the integrated archaeological databases, T = temporal extent (oldest and youngest artefact) in kBP (kilo years before present) of database.3.2.2 Palaeoenvironmental Data:The data model for the palaeoenvironmental data concerns mainly climate and vegetational reconstructions based on model simulations or interpretation of climate archives (e.g.drill cores, soil/sediment profiles).Whereas the palaeomodels provide predictions for larger spatial areas (spatial fields), the climate archives are measurements of a single spatial point.

Table 2 :
Key numbers describing the integrated palaeoenvironmental databases, n = derived spatial datasets.V = environmental variables, T = temporal extent (oldest to youngest) in kBP.δT = temporal periods.Spatial = spatial extent.