ONLINE 4 D RECONSTRUCTION USING MULTI-IMAGES AVAILABLE UNDER OPEN ACCESS

The advent of technology in digital cameras and their incorporation into virtually any smart mobile device has led to an explosion of the number of photographs taken every day. Today, the number of images stored online and available freely has reached unprecedented levels. It is estimated that in 2011, there were over 100 billion photographs stored in just one of the major social media sites. This number is growing exponentially. Moreover, advances in the fields of Photogrammetry and Computer Vision have led to significant breakthroughs such as the Structure from Motion algorithm which creates 3D models of objects using their twodimensional photographs. The existence of powerful and affordable computational machinery not only the reconstruction of complex structures but also entire cities. This paper illustrates an overview of our methodology for producing 3D models of Cultural Heritage structures such as monuments and artefacts from 2D data (pictures, video), available on Internet repositories, social media, Google Maps, Bing, etc. We also present new approaches to semantic enrichment of the end results and their subsequent export to Europeana, the European digital library, for integrated, interactive 3D visualisation within regular web browsers using WebGl and X3D. Our main goal is to enable historians, architects, archaeologists, urban planners and affiliated professionals to reconstruct views of historical structures from millions of images floating around the web and interact with them.


INTRODUCTION
The scope of economic activities of the Cultural and Creative industries focuses on the generation or exploitation of knowledge and information (Hesmondhalgh, 2007).According to (Florida, 2002) "human creativity is the ultimate economic resource," and "the industries of the twenty-first century will depend increasingly on the generation of knowledge through creativity and innovation".In Europe the rapid roll-out of new technologies and increased globalisation has meant a striking shift away from traditional manufacturing towards services and innovation.In Europe factory floors are progressively being replaced by creative communities whose raw material is their ability to imagine, create and innovate.Therefore, a major factor in maintaining European competitiveness is to unlock the potential of cultural and creative industries.
Over the last few decades, society has amassed an enormous amount of digital information about the Earth and its inhabitants.However, these archives pale in comparison to the flood of data which is engulfing us.A new wave of technological innovation is allowing us to capture, store, process and display an unprecedented amount of geo-referenced information about our planet and a wide variety of environmental and cultural phenomena.It is today possible to explore, at home, a virtual representation of our planet, flying through landscapes draped with aerial or satellite images, contemplating mountains, valleys and other natural sites.Representations of Cultural Heritage (CH) landmarks and entire cities are now available (Himmelstein et al., 2011).
A crucial driving force for the development and economic growth of the creative industries is ICT technologies.Using innovative IT solutions in growing areas of the creative sectorsuch as advertising, digital media, gaming, including serious gaming, entertainment, interactive design, cultural and educational services -opens up manifold competitive advantages for research, development and business.For this, a lot of work has been done in recent years in the areas of (i) digitisation, especially for tangible cultural (creative) objects, (ii) augmented reality and cultural experiences tools, (iii) information retrieval and use and (iv) virtual heritage, digital libraries and Web archiving.
The rapid development of 3D visualisation services and their interlinking to other on-line services like Google Search, Google Earth / Maps and Bing has forced digital libraries such as Europeana and UNESCOs Memory of the World to expand their content to include 3D museums objects, archaeological sites and monuments.In such a context, digitisation, preservation and online availability of digitised cultural content has always been a top-level priority research.
Today, Photogrammetry's research outcomes in combination with an enormous increase in the availability of computational resources at an affordable price have produced mature applications in the fields of terrain modelling, reconstruction of archaeological monuments or entire cities, the building of virtual worlds in computer games, etc.Moreover, passive scanning methods based on aerial or satellite photography have also been used for topographic modelling of natural open areas with great success.For example, an automatic large-scale stereo reconstruction of urban areas from remote sensing imagery (overlapping aerial images) is described in (Kuschk, 2013).Also, active scanning techniques, such as laser and acoustic methods, take advantage of newer technology advances, accurate geo-referencing and powerful computer machinery.They can produce very dense and accurate 3D point clouds, (Zlatanova, 2008) especially useful for the automatic extraction of height/depth information.
In contrast, Structure from Motion (SfM) algorithms (Hartley and Zisserman, 2004) use data from unstructured sources, e.g.photographs taken by off-the-shelf cameras at different resolutions, time periods, by different people, from various angles and positions in order to recover the structure of a scene as a collection of 3D points as well as the camera position.The problem becomes more complicated and challenging when the source images or videos are harvested from the web as limited or no assumptions can be made about camera characteristics, illumination, resolution, geo-location, etc.However, this extra complication is well worth it given the wealth of information available online and for free over the web.Systems for urban reconstruction from video sources have been proposed (Cornelis et al., 2006, Pollefeys et al., 2008).The latter achieves 3D reconstruction in real time.However, the main advantage of using video as opposed to unordered images is the camera location relationship between successive video frames.Much more complex is the reconstruction from unordered and often irrelevant images.Such a system requires computationally intensive pre-processing and filtering of millions of images using algorithms for landmark identification, for example (Lowe, 2004, Li et al., 2008).These algorithms mainly operate on a pair-wise basis which although time-consuming, it can be massively parallelised, for example (Agarwal et al., 2011).However, the accuracy and the completeness of the point clouds produced by this approach depend highly on the selected images as well as the stereo models used in the multi-view stereo algorithms (Wenzel et al., 2013).
One of the first systems to apply SfM algorithms to internet photo collections was (Snavely et al., 2006).This system was limited to processing only a few thousands of images.A more successful system (Agarwal et al., 2011) can perform dense modelling from Internet photo collections consisting of millions of images.The speedups obtained are due to some refinements e.g.filtering by early 2D appearance-based constraints, but mainly because of utilising graphics processors and parallelising the computation on multi-core computer architectures.(Fritsch et al., 2012) describes a method for accurate surface reconstruction where 2 billion 3D points were acquired efficiently with sub-mm accuracy using a compact hand-held camera rig consisting of a number of industrial cameras mounted on a frame in order to acquire images from multiple views at once.The images are processed with an automated software pipeline.This method is particularly suited for close-range CH applications; each shot can produce up to 3.5 million 3D points with a maximum precision of 0.2 mm.They report a case where 2 billion 3D points were acquired efficiently with sub-mm accuracy.
In the last two decades, the use of approaches based on 3D geometry has seen rapid progress in many different areas from digital factories of the future to car, flight and surgical training simulators to 3D maps, 3D TV/Cinema and games but also in CH applications.A recent assessment of 3D reconstruction techniques specifically applied to the field of CH is presented in (Manferdini and Galassi, 2013) as a case study focused in Piazza degli Ariani in Ravenna, Italy.(Fassi et al., 2013) compares laser scanning and photogrammetry techniques when applied to a Roman thermal complex in Naples (500m ).It reports that Photogrammetry offers a fully automatic, low cost, fast and accurate method but it requires huge computational resources (and time) when the required accuracy is high.In such cases laser scanning remains the best alternative providing full resolution scans in real time.
In general, linking 3D data (from museums' objects, artefacts, archaeological sites or monuments) with other related 3D objects or with multimedia data, poses different challenges but also provides new, exciting and innovative possibilities compared to more established data forms like texts, images or sound.Subsequently, the digitisation and 3D content creation process in the field of CH is confronted also with another challenge.Each digitised object (geometry) needs logical (semantic) information.Without this logical information, scanned 3D objects like monuments are nothing more than a structure of geometrical primitives (points or triangles).The importance of 4D urban environment modelling can be further emphasised by examining its two main components: (i) 3D content and (ii) time-varying content.(4D World: X+Y+Z+Time=4D); 3D content is of primary importance for our lives as we are living in 3D world and we perceive most of the events occurring in this world by depth information.The importance of 3D content can be highlighted by examining the work of (Fodor, 1983) about the modularity of mind.In this research, we address the aforementioned difficulties by proposing a complete approach able to search, process, model, 3D reconstruct, define metadata and build new production services and re-use of the CH content.In this respect, the rapidly evolving fields of computer graphics, vision and learning, 3D modelling and virtual reality (VR), seems to hold a strong potential for proving effective solutions towards accurate and veridical spatial-temporal urban environment modelling.However, going beyond the state of the art in scene understanding and 3D modelling will enable fundamental new methodologies to view and experience historical and/or temporally varying imagery.This way, we construct building time-varying 3D models that contain how the appearance, historical events around a place and structure have been evolved.

PREVIOUS WORK
For 3D scene reconstruction and flow estimation our approach is based on computer vision and Photogrammetry techniques that require optical flow, as well as, disparity estimation.Both can be formulated as a challenge of finding corresponding points in different images.However, the correlation /matching engine is a very challenging task, because a scene's objects, generally, have different shape and appearance when seen from different points of view and at different times.(Horn and Schunck, 1981), to solve for the 2D displacements.Feature-based methods match features in different images.Selected features should have good discriminative properties and there should be high probability that the same point is selected in different images (Tuytelaars and Mikolajczyk, 2008).This kind of methods are using the concept of coarse-to-fine image warping, however, this downsampling removes information that may be vital for establishing the correct matches.(Shrivastava et al., 2011) proposes an algorithm, based on the Histogram of Oriented Gradients (HoG) descriptor and the notion of "data-driven uniqueness" by which each image decides what is the best way to weigh its constituent parts, is capable to handle cross-domain image matching.The SIFT flow algorithm presented in (Liu et al., 2008) matches densely-sampled SIFT features between different images, while preserving spatial discontinuities.In (Pons et al., 2007) an approach that overcomes common assumption of feature-based methods limitations is presented.This approach computes a global image-based matching score between different images, making the matching process capable to handle projective distortions and partial occlusions.

VISUAL SEARCH ENGINE
In Figure 1, we present a schematic diagram of our search engine (Eagle4D) used for visual content retrieval appropriate for CH objects.The engine is applied on the web and thus it searches objects in the "wild" visual content.Then, the results are improved by exploiting metadata from Europeana and/or Memory of the World.
In this section, we propose a recursive methodology for estimating corresponding points from a set of unstructured and un-calibrated images.In particular, the method exploits local descriptors for finding the most prominent points in an image, which then are used as vertex of a graph.In this way, we construct a graph, the vertices of which coincides with the feature points while the edges the distance between two selected features.In our approach, geometrically invariant features have been selected and especially the SIFT transform.Then, graph partitioning methods are incorporated to estimate those subgraphs that best match with the respective sub-graphs obtained by images of different view point.The sub-graphs are correlated with each other using graph matching methods.To reduce the computational cost, we introduce recursive approaches for the implementation of the graph partitioning.Particularly, we adopt the incremental spectral clustering methods, where only o small subset of the constructed graph are updated and a recursive method for estimating the generalized eigenvalues, used for the implementation of the spectral clustering method, have been adopted.Using the matching algorithm, we are able to construct 3D models in a computational efficient approach, resulting in a fast but effective content based retrieval methodology.
Despite the efficiency of the aforementioned methodology, possible errors (in the 3D retrieval process) generate erroneous virtual 3D reconstructions.To address this difficulty, we enhance the results of the 3D matching using metadata information.In particular, we exploit textual information from the set of multi view but unstructured and un-calibrated images in order to allow a first pre-processing.Additionally, using the metadata information, we are able to model the extracted visual descriptors (e.g., the representation of the SIFT features as a graph) with a set of metadata, providing a framework for better categorizing and filtering of the images and thus improving the 3D reconstruction and reducing possible errors.

THE VISUAL AUTHORING INTERFACE
The main purpose of this research is to provide the technological framework for enriching the content with additional overlays and virtual objects using smart and cost effective augmented reality methods.The research exploits the outcome of the content based retrieval module (see section 3).This way, the content creators are able to retrieve content of interest in a cost effective manner (through for example the usage of a Sketch or drawing).In the following, the authoring environment tool has the algorithms to transform and scale the retrieved 3D or 4D content in a way to be synchronised with the current scene.
This paper researches on efficient techniques so that the changes in the viewer's position can be properly reflected in the rendered graphics.Tracking methods are usually based on a two-stage process; a learning or feature extraction stage, and a tracking stage.During the learning stage, some key features relatively invariant to the environment are extracted such that it is mostly a bottom-up process.It can enable re-initialisation of the tracking system to recover when it fails by using automatic recognition.Approaches based on particle filters and Kalman models (Djuri et. al. 2003) fail to provide accurate results and on the other to conclude to a computational and cost effective solution.For this reason, in the context of this paper, we incorporate semi-supervised learning algorithms in the area of tracking; Particularly, we allow a minimum user interaction (obtained by the intuitive interactive interface) and then, we implement a classification based tracking strategy to achieve an efficient object tracking and synchronisation.
Personalised Viewing: Since we support a collaborative, interactive interface where multi-users of different properties (e.g., different professions) are incorporated in the authoring process, we need to develop tools for personalised viewing; this means that a part of a scene is presented to the end users according to their particularly interest.In this way, we increase the level of interaction of the content creators, since the system automatically focuses on parts of the scene that are interested by the user.

Perceptual and Real-time Rendering:
Rendering process is very expensive and not cost effective.Perpetual rendering transforms rendering into a cost-effective process, since rendering is implemented in a selective way according to the current content creators interest.Perceptual is required for achieving a cost effective perceptually-based selective rendering algorithms are adopted in order to optimise the 3D world and achieve fast real-time interaction.In order to economise on rendering computation, selective rendering guides high level of detail to specific regions of a synthetic scene and lower quality to the remaining scene, without compromising information transmitted.Such decisions go beyond current task-based approaches and will be guided by context based information.

SEMANTIC ENRICHMENT OF CULTURAL HERITAGE ARTEFACTS
A 3D digitised CH artefact is typically composed of geometry and texture data, yet without being semantically enriched with provenance data and its relation to further artefacts as well as with multimedia documentation, it is of very limited practical use.
Therefore the digital representation of an artefact should consist of a geometric structure, accompanied by an annotation to associate semantics and context with its geometry or parts of its 3D shape.The idea of representing semantically structured information and knowledge as well as creating links between the data has increasingly gained popularity, driven by the Semantic Web technologies.Within the current research on annotations, most examples of structured information include semantic models for describing the intrinsic structure of the 3D shape (De Floriani et al., 2008, Papaleo and Floriani, 2009, Attene et al., 2007, Robbiano et al., 2007).(Doerr and Theodoridou, 2011) proposed a model for describing the provenance (life-cycle) of digital 3D shapes in the CH domain (Rodriguez-Echavarria et al., 2009, Havemann et al., 2009, Serna et al., 2011) The Integrated Viewer Browser (IVB) is an interactive semantic enrichment tool for 3D CH collections developed within the 3DCOFORM project It is fully based on the CIDOC-CRM(ISO 21127:2006) schema supporting its sophisticated annotation model and represents a first approach to a 3D centred and distributed annotation tool taking into account aspects about layout designs, interaction metaphors, and workflows in real professional environments (see figure 2).
Artefacts handled by the Integrated Viewer Browser (IVB) can already be accessed through Europeana, the European Digital Library Portal (see figure 3).
The IVB enables fusion and annotation of a variety of multimedia context information from different sources belonging to a digitised 3D artefact, consolidation of all pieces of information and export of the datasets to Europeana using its ESE (Europeana Semantic Elements) metadata format, a Dublin Core-based set of fields with 12 additional specific Europeana elements to display records appropriately.This semantic enrichment tool will be the focus of future research for further development in order to include the 4th dimension (time) as well as further details related to the origin of the data and the historical value of the object (depending on the application for the re-use of the final 3D results).represented by an image of its main entrance

CONCLUSION
In this paper we describe a new approach which enables us to reconstruct CH content in 4D, i.e., the 3D spatial reconstruction plus the time.The paper adopts new schemes for 4D visual content reconstruction under a scalable way, visual search at the wild and tools for visualisation and augmented reality.We also describe a visual authoring interface and algorithms for interoperable description of the content and the exploitation of proper metadata adhering to the EUROPEANA framework.

Acknowledgements
The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme FP7-PEOPLE 2007-2013 under REA grant agreement IAPP2012 no 324523.

Figure 1 :
Figure1: The search engine approach adopted in this paperThe techniques used to solve correlations are similar and can be categorised as energy-based, feature-based and area-based methods.Energy-based methods(Alvarez, 2002, Papenberg et al., 2006) minimise a cost function plus a regularisation term, as in the framework of(Horn and Schunck, 1981), to solve for the 2D displacements.Feature-based methods match features in different images.Selected features should have good discriminative properties and there should be high probability that the same point is selected in different images(Tuytelaars and Mikolajczyk, 2008).This kind of methods are using the concept of coarse-to-fine image warping, however, this downsampling removes information that may be vital for establishing the correct matches.(Shrivastava et al., 2011) proposes an algorithm, based on the Histogram of Oriented Gradients (HoG) descriptor and the notion of "data-driven uniqueness" by which each image decides what is the best way to weigh its constituent parts, is capable to handle cross-domain image matching.The SIFT flow algorithm presented in(Liu et al., 2008) matches densely-sampled SIFT features between different images, while preserving spatial discontinuities.In(Pons et al., 2007) an approach that overcomes common assumption of feature-based methods limitations is presented.This approach computes a global image-based matching score between different images, making the matching process capable to handle projective distortions and partial occlusions.

Figure 2 :
Figure 2: General Graphical User Interface of our semantic enrichment tool dived by 5 different sections -relation annotation between Maennerkopf bust and the Saalburg fort, represented by an image of its main entrance