“WHAT IS OUV” REVISITED: A COMPUTATIONAL INTERPRETATION ON THE STATEMENTS OF OUTSTANDING UNIVERSAL VALUE

The Statements of Outstanding Universal Value (OUV) concerns the core justification for nominating and inscribing cultural and natural heritage properties on the UNESCO World Heritage List, ever since 2007. Ten criteria are specified and measured independently for the selection process. The 2008 ICOMOS Report “What is OUV” has been a successful example to interpret OUV as an integral concept by inspecting the associations of the selection criteria in all inscribed properties. This paper presents a novel methodology for interpreting OUV using computational techniques of Natural Language Processing, Machine Learning, and Graph Visualization. Firstly, frequent phrases appearing in Statements of OUV are used to construct a lexicon for each selection criterion; Secondly, three similarity matrices are constructed as graphs to represent the pair-wise associations of the criteria; Lastly, the lexicon and graphs are visualized in 2D. The study shows that the lexicon derived from computational techniques can capture the essential concepts of OUV, and that the selection criteria are consistently associated with each other in different similarity metrics. This study provides a quantitative and qualitative interpretation of the Statements of OUV and the associations of selection criteria, which can be seen as an elaborated computational extension of the 2008 Report, useful for future inscription and evaluation process of World Heritage nominations.


INTRODUCTION
The World Heritage Convention seeks to preserve the "parts of the cultural and natural heritage ... of outstanding interest ... [for] mankind as a whole" since its adoption in 1972 (UN-ESCO, 1972). A total of 1121 World Heritage (WH) properties have been inscribed on the World Heritage List until 2019. After the adoption of the Operational Guidelines in 2005, the justification of Outstanding Universal Value (OUV) has become an administrative requirement, instead of an independent qualification since 1977, for inscribing any new WH nomination (UNESCO, 2008, Jokilehto, 2008. Ten selection criteria exist as the core of OUV, among which criteria (i) -(vi) generally refer to cultural values, and (vii) -(x) to natural ones. At least one of the ten criteria must be fulfilled by any nomination to prove its "exceptional [significance] as to transcend national boundaries and to be of common importance for present and future generations of all humanity" (UNESCO, 1972, Jokilehto, 2008. Since 2007, a complete Statement of OUV is required for new nominations to contain brief synthesis, justification for criteria, statement of integrity and/or authenticity, and requirements for protection and management. The section justification for criteria explains why a property fulfills all criteria under which it has been inscribed, giving a concise paragraph for each criterion. Retrospective Statements of OUV were also prepared during the Second Cycle of Periodic Reporting (2008Reporting ( -2015 by 812 properties 1 inscribed before 2006, to revise or refill the section of justification for criteria if it was incomplete or not agreed on at the time of inscription (IUCN et al., 2010). 1 this number is calculated based on the data provided in the Reports of each regions available at http://whc.unesco.org/en/pr-questionnaire/ Investigating OUV and comparing it to the selection criteria and justifications applied to the listed WH properties is not uncommon. Most research, however, focuses on a single case or a few cases for comparative study, thus mainly concerning a small number of Statements of OUV (Shah, 2015, Ruffino et al., 2019, Abdel Tawab, 2019, Tarrafa Silva and Pereira Roders, 2010. Whereas the 2007 International Conference on Values and Criteria in Heritage Conservation explicitly organized sessions to discover the definition and evolution of OUV as an integral concept, discussing the terms used in the current (by then) WH justifications and proposing possible enhancement to clarify the concepts (Fejérdy, 2007, Petzet, 2007, Jokilehto, 2007. The whole discussion of this conference resulted in the wellknown ICOMOS report "What is OUV, Defining the Outstanding Universal Value of Cultural World Heritage Properties", published in 2008. The report described the evolution of OUV since first proposed, summarized the essential focuses of each cultural selection criterion, and matched the criteria to the main themes in existing WH properties (Jokilehto, 2008). In that report, the concepts of OUV are illustrated from both a deductive perspective by interpreting the definitions in Operational Guidelines, and an inductive perspective by giving examples from justification texts of WH properties. Keywords in the justifications are highlighted to indicate why this piece of text reflects the selection criterion it describes. Furthermore, the report suggests that the criteria are strongly associated with each other, since that the "historical value is an integral part of the majority of... criteria (i)-(vii)", and that "the aesthetic /artistic value also plays a role in several OUV criteria". Such associations have been further investigated in the report by looking at how often a specific criterion is used together with the others.
This line of interpreting OUV and the selection criteria is rather effective and contributes to a better understanding of the concepts. However, such processes of keywords highlighting are heavily dependent on expert knowledge, which may not be easily applicable and intelligible for the general public, let alone being prone to inevitable personal and disciplinary biases. A recent study took all the available Statements of OUV in the World Heritage List (concerning 1049 properties that have a complete section of justification for criteria) as input data and trained several state-of-the-art Natural Language Processing (NLP) models on an OUV classification task (Bai et al., 2021). That study revealed a top-3 accuracy of 94% to predict the correct selection criterion, based on the short piece of text justifying this criterion. The authors also provided an open-source repository with all their trained models and results concerning the models' performances 2 . This previous study provides a chance to revisit the 2008 ICOMOS report from a computational perspective to re-interpret the focuses, definitions, and associations of the selection criteria that define the OUV of WH properties.
This paper presents a computational analysis of the selection criteria justified in Statements of OUV on their semantic meanings and intrinsic associations. The contributions can be summarized as: 1) providing an OUV-related lexicon that can be used to highlight keywords in a generic text on relevant selection criteria; 2) proposing three types of matrix-based similarity metrics from different sources to represent the pair-wise associations of criteria; 3) conducting qualitative and quantitative analyses on the lexicon and the similarity metrics, which may give insights to more clearly defining OUV in future practice.

Input Materials and Problem Statement
The following variables A, M, C (i,s) , and W (i) k are derived from the open-source repository of the study mentioned in Section 1 and are applied as the input material for this study.
Considering all the properties inscribed in the WH List, a cooccurrence matrix of the selection criteria was constructed as A = [A k,l ]κ×κ, k, l ∈ [0, κ), κ = 10, where the off-diagonal entries A k,l , k = l are the number of properties that satisfy both criteria k and l, and the diagonal entries A k,k record the number of cases when each criterion k is used alone (see Figure 1a).
Five state-of-the-art NLP models M = {mi|i = [0, 5)} were trained and tested on classifying selection criteria from sentences, which stand for N-Gram (Cavnar and Trenkle, 1994), Bag-of-Embeddings (Pennington et al., 2014), Attention with GRU (Yang et al., 2016), BERT (Devlin et al., 2019), and ULMFiT (Howard and Ruder, 2018), respectively. The latter two were proved to perform better in terms of classification accuracy. For each model mi, three confusion matrices C (i,s) = [C (i,s) k,l ]κ×κ, k, l ∈ [0, κ), s ∈ {train, val, test} were provided, where the entries C (i,s) k,l represent the total number of data samples with a true label of criterion k being classified as criterion l by model mi in the s set (train, validation, or test datasets). An example of the confusion matrix C (4,test) of m4's (ULMFiT) performance on test dataset is shown in Figure 1b.
A total of 2353 phrases composed of 1-to 5-Gram features (phrases with 1 to 5 consequent words) that appeared more than 15 times and less than 600 times in the Statements of OUV were fed to each model, predicting the scores of each phrase belonging to each criterion k, k ∈ [0, κ + 1), where the 11 th criterion referred to an additional negative class of "Others" related to none of the criteria. A series of ordered sets W (i) k = {(phrase w, rank r)}, |W (i) k | = 50, r ∈ [1, 50] of phrases was obtained to contain the ranked top-50 keywords for criterion k predicted by the model mi. The initial vocabulary can be composed of all the phrases as can be constructed for the j th phrase wj in the vocabulary V (0) pertaining to its rank r in the criterion k predicted by model mi, such that: (1) The above-mentioned variables and the processed V (0) and Υ are used to construct the lexicon and the similarity graphs.

Keywords Lexicon
Lexicon, literally defined as "all the words and phrases used in a particular language or subject" 3 was originally a linguistic concept, which requires some "morpholexical rules" to specify whether words should be members of some classes (Lieber, 1980). However, in modern NLP literature, the term "lexicon" is frequently referred to as a list of words that "carry particularly strong cues" of certain word senses, usually sentiment (Jurafsky andMartin, 2020, Faruqui et al., 2015). One of the most popularly used lexicons is the SentiWordNet, where each word is given scores for its tendency of being positive, negative, and objective (Esuli and Sebastiani, 2006). Such lexicons can be constructed by manual annotation, semi-supervised induction, and/or supervised learning. The initial entire vocabulary V (0) has the following problems to be considered as a lexicon, which needs to be revised and filtered: 1) some terms only appear in a limited number of models (especially in the worse performing models such as m1 N-Gram model), which may be caused by the randomness of the models (e.g., "foot" was predicted with a high rank by m1); 2) some terms always have lower confidence scores (lower ranks) in all models, which may suggest that they are not strongly relevant to the topic; 3) some terms are redundant since the longer N-Gram features may be accompanied by their subsets, for example "directly and tangibly associated" appears together with "directly and tangibly", "and tangibly associated", etc.; 4) stop-words such as prepositions and articles differentiate the word senses in their contexts (Devlin et al., 2019), but may not introduce additional semantic meanings when considered as keywords (e.g., "art of ", "art in", and "art and" are all about the concept "art").
To improve these aspects, keywords are aggregated by taking advantage of the ensemble of models. Since the performance of the model may suggest the general reliability of predicted keywords, a model-related weight vector ω = [ωi]5×1 = [1, 1, 1, λ0, λ0] T , λ0 ≥ 1 ∈ R + is arbitrarily formed to give the predictions by the latter two models a higher weight. Similarly, keywords predicted with higher confidence scores (higher ranks) may suggest that they are more related to the topic. Therefore, a rank-related weight vector ζ = [ζr]51×1 = [0, λ 2 1 , ..., λ 2 1 , λ1, ..., λ1, 1, ..., 1] T , λ1 ≥ 1 ∈ R + is also arbitrarily constructed to give higher-ranked keywords more importance, where the top-10 are amplified by the scalar λ 2 1 , the 11 th − 25 th ranked phrases are amplified by λ1, the 26 th − 50 th are kept the same, and those not ranked are omitted. The threedimensional array Υ in equation 1 can be therefore flattened on the model axis i to a matrix Υ = [υ j,k ] |V 0 |×(κ+1) , such that: (2) With a threshold λ2 ∈ R + to filter the computed weights in the matrix Υ , a group of aggregated keyword sets W k can be obtained for each criterion k, such that: Finding a properly filtered group of sets W k can be formulated as the following optimization problem, where W k is effectively a function of the three variables λ0, λ1, λ2: Where σ |W k | denotes the standard deviation of the sizes of sets W k , and is a small number to avoid zero division. This optimization ensures that: 1) there are enough phrases that fulfill more than one criteria (ensured by the nominator of equation 4a); 2) the total size of the vocabulary is concise (ensured by N0 in equation 4b); 3) the sizes of keyword sets are evenly distributed across the criteria (ensured by σ |W k | in the denominator of equation 4a); and 4) the weights are in reasonable ranges for the filtering computation (ensured by equation 4c).
Using a brute-force search for solving this optimization from a total of |λ0||λ1||λ2| = 64000 configuration possibilities of discretized λ0, λ1, λ2, a configuration of λ0 = 2.2, λ1 = 1.2, λ2 = 2.6 yields the best filtering with a total vocabulary size of |V (1) | = | κ+1 k=0 {w|(w, * ) ∈ W k }| = 552, among which 78 occur in more than one selection criteria. For the new vocabulary V (1) , Stop-words and WordNet Lemmatizer tools in the NLTK package (Loper andBird, 2002, Miller, 1995) are used to further normalize and merge the keywords (as with the example of "art"). Furthermore, phrases composed of more than 2 words are merged to their longest N-Gram features (as with the example of "directly and tangibly associated"). After merging, a final lexicon as sets W k is obtained, yielding a vocabulary size of |V | = | κ+1 k=0 {w|(w, * ) ∈ W k }| = 354, among which 77 occur in more than one selection criteria.

Similarity Matrices
Co-occurrence matrix A of the selection criteria, as introduced in Section 2.1, shows how often two criteria are justified together, i.e. marked as relevant, for a WH property. The more often two criteria are fulfilled simultaneously, the more similar and associated they arguably are with one another. The term "similarity" here is from a structural viewpoint on the dataset. By normalizing matrix A, the upper triangular entries can be "unrolled" and form a long vector α = 2 ), indexed with the ordered pair (k, l), k < l, representing the pair-wise similarity of the criteria, such that: On the other hand, the confusion matrices C (i,s) of the models during training and testing processes reveal how easily different selection criteria are to be misclassified as each other. Suppose the models are properly trained and represent certain degrees of truth, two criteria shall be more similar to one another as the models literally "confuse" them more often (Zhang et al., 2019). The term "similarity" here is an experimental viewpoint on the data concerning the NLP models' performances. However, before arguing that the confusion matrices reflect some intrinsic similarity, one must first prove that the models behave in a consistent manner, i.e., different models have difficulties at the same criteria pairs by easily confusing them. For each combination of the performance of model mi on either validation or test set s (training set performances are disregarded since the other two are supposed to better represent the prediction power of models), a similar construction as equation 5 can be applied to obtain long vectors 2 ) from the confusion matrices C (i,s) following (Zhang et al., 2019), such that: (6) Since the co-occurrence matrix A is symmetrical, the summation in Equation 6 is desirable as it transforms the generally asymmetrical confusion matrices into symmetric ones. The long vectors β (i,s) are first compared to each other using Spearman's Rank Correlation to check the consistency of the models' performances. However, the null hypotheses in normal correlation analyses on such vectors can be easily refuted falsely because of the auto-correlated structures in matrices, making the normal significance tests invalid. A method called Quadratic Assignment Procedure (QAP) has been proposed to solve this problem (Liu, 2007, Krackhardt, 1988. By repeating the process of simultaneously permuting the rows and columns of one of the matrices before unrolling it to a vector for correlation computation, a theoretical distribution of the correlation coefficients can be obtained as a simulation outcome. The percentile of the original correlation coefficient (the one calculated without permutation) in this theoretical distribution can instead estimate the significance level of the correlation analyses effectively. The vectors are then fed to Principal Component Analysis (PCA) and Non-Negative Matrix Factorization (NMF) algorithms in Scikit-learn to perform dimensionality reduction and obtain the aggregated vector 2 ), representing the pair-wise confusion of the selection criteria (Févotte and Idier, 2011).
Furthermore, the final lexicon V = κ+1 k=0 {w|(w, * ) ∈ W k } discussed in section 2.2 can provide another level of interpretation on the criteria similarity. As suggested by the NLP literature (Pennington et al., 2014, Mikolov et al., 2013, Wal-lach, 2006, the pre-computed word embedding vectors provide good semantic meanings of the phrases, which can be further aggregated to represent the document topics composed of the ensemble of words. Therefore, another matrix H = [H k,l ]κ×κ, k, l ∈ [0, κ) showing the semantic similarity of the criteria can be constructed by computing the pair-wise cosine similarities of the averaged embedding vectors f k of phrases in W k for each criterion k, such that: Where g(wj) is a function to look up the 300-dimensional GloVe embedding vectors of all the words in the phrase wj and take the sum of the vectors. Similar to equation 5, another long 2 ) can be obtained to represent the pair-wise semantic similarities of the criteria.
The three vectors α, β, γ are further compared to each other using Spearman's Rank Correlation (as they have different value distributions) to check the relationship and consistency of different similarity definitions based on QAP significance level.

Graph Visualization
The vectors α, β, γ representing the pair-wise similarity of the selection criteria can be also interpreted as the edge weights of three undirected weighted unipartite graphs Gα, G β , Gγ, where each node represents a specific criterion k. The graphs are visualized in Gephi using the Force Atlas algorithm based on the edge weights (Bastian et al., 2009, Jacomy et al., 2014. Since those graphs are (almost) complete with significantly divergent edge weights, different thresholds ξα, ξ β , ξγ are applied to show only the edges whose weights are larger than the threshold based on the weight distributions, in order to give clearer structural information of the associations between the criteria. Furthermore, the lexicon, i.e., the ensemble of sets κ+1 k=0 W k = {(wj, υ j,k )} can also be interpreted as the edge table of an undirected weighted bipartite graph Bw, where the two sets of nodes are respectively the vocabulary V and all the selection criteria. Moreover, as introduced in section 2.2, some phrases may belong to more than one criteria, and edge weights of such phrases can also vary across criteria. For example, the term "architectural" belongs to both Criterion (iv) with a weight of 5.70 and Criterion (i) with a weight of 4.75. In such cases, the degree of nodes representing the phrases will be the sum of weights from all edges connected to them. The lexicon as a bipartite graph is also visualized in Gephi using the Force Atlas algorithm based on the edge weights.

OUV-related Lexicon of Selection Criteria
The visualized lexicon as bipartite graph Bw containing all phrases in V and their relationship with the selection criteria (including the negative class "Others") are shown in Figure 2. Generally, the essential topics of the criteria also appear to have the largest weights as the prediction from computational models. This is obvious in the cases of Criterion (i) with phrase "masterpiece" and "human creative genius", (ii) with "influence" and "development", (iii) with "bear exceptional testimony", (iv) with "outstanding example" and "building", (v) with "traditional human settlement", (vi) with "directly and tangibly associated", (vii) with "exceptional natural beauty", (viii) with "geological process", (ix) with "ecological", and (x) with "species". For each criterion, not only adjectives and verb phrases describing the values, but also nouns and noun phrases showing the critical attributes can be found. Take Criterion (i) as an example, phrases such as "unique artistic achievement, creative, genius, artistic, monumental" highlight the main artistic, aesthetic, and historic values associated with this criterion. Meanwhile representative attributes such as "fresco, sculpture, interior, decoration, art and architecture" demonstrate where those values are applied to.
Inspecting the phrases associated with more criteria can provide some insights into interpreting the common justifications of OUV. The terms "art" and "design" connect Criteria (i)(ii)(iv), while "landscape" connects Criteria (i)(ii)(v), and "cultural landscape" connects Criteria (iv)(v), showing the common stand-points and nuances in the focuses of those criteria. Moreover, the groups of phrases related to religions connecting Criteria (iii) and (vi), phrases about architectural art connecting (i) and (iv), about urban form connecting (iv) and (v), about natural phenomena between (vii) and (viii), as well as phrases about bio-creatures between Criteria (ix) and (x), etc., all imply some common characteristics within the OUV concept.

Matrix Similarities
All vector pairs from β (i,s) have a high Spearman's Rank Correlation coefficient from 0.713 to 0.933, while all correlations are significant with p < 0.001 based on QAP simulation. This Figure 2. The lexicon of selection criteria, i.e., the bipartite graph Bw, visualized as a word network based on the Force Atlas algorithm in Gephi. Thicker edges indicate higher weights of the phrases in vocabulary V regarding a specific criterion. Nodes with higher edge weights are placed closer to each other in the visualization. Larger nodes and font sizes indicate larger total weighted degrees of the phrases. The colors of the phrase nodes are rendered the same as the criterion they belong to. The nodes of phrases belonging to two or more criteria are placed between the criteria clusters, and the colors of the nodes are also the mixture of the criteria colors. The general topics of criteria according to the ICOMOS report (Jokilehto, 2008) and the total number of keywords belonging to each criterion, i.e., |W k | are demonstrated in the legend. This graph (lexicon) could be used to locate specific words regarding their relations with different selection criteria, and to observe and select the most relevant words while drafting and/or evaluating the Statements of OUV. Detailed interpretations of the lexicon are presented in Section 3.1. Figure 3. The graph visualizations of the similarity matrices represented by α, β, γ as edge weights using the Force Atlas algorithm in Gephi. a-c) Co-occurrence graph Gα; d-f) Confusion graph G β ; g-i) Semantic similarity Gγ; a/d/g) Complete graphs with all edge weights visualized; c/f/i) Filtered graphs that only show edges whose weights are higher than the first two cross-domain cultural-natural criteria pair; b/e/h) Histogram of edge weights and the threshold ξα, ξ β , ξγ during filtering, the top-5 edges being listed with their weights. Node size represents the total World Heritage properties justified with this selection criterion.
0.793* <.001 *p < .001 with QAP simulation of 1000 permutations. suggests that all the investigated confusion matrices perform consistently across models and datasets. Though models such as BERT and ULMFiT generally have a better prediction accuracy, they are similarly confused at the same criteria pairs as the worse-performing models. Therefore, it is appropriate to aggregate the vectors β (i,s) into β to represent the overall confusion patterns of the models. The first PCA component of the vectors manages to explain 89.7% of the variance in β (i,s) . However, due to the nature of PCA, some elements in its component are unavoidably negative, which can be hard to interpret as a similarity metric. Alternatively, the first component computed from NMF is non-negative, and has a Pearson Correlation of r = 1.0, p < 0.001 with the first PCA component. Therefore, the first NMF component from β (i,s) is used as β for later analysis. This vector effectively makes a single matrix representative of the 10 possible variants of the G β , thus making this graph comparable to the other two graphs.
The values of the vectors α, β, γ are reflected in Figure 1 (a), (c), and (d), respectively. The matrix heatmaps generally illustrate a consistent visual pattern: 1) the top left corner indicating the cultural criteria associations and the bottom right corner indicating the natural criteria associations are stronger and create two relatively dense sub-matrices; 2) the off-diagonal entries highlight similar places, such as the entries representing the relation between Criteria (ii)(iv) and between Criteria (ix)(x). These patterns are further proved with correlation analysis. The Spearman's Rank Correlation of the vectors representing the similarities between selection criteria is shown in Table 1. All three pairs are significantly correlated with a high coefficient between 0.615 and 0.838, proving that the three proposed similarity matrices representing the structural (as co-occurrence matrix), experimental (as aggregated confusion matrix), and semantic (as cosine similarity matrix of GloVe embedding) information of the criteria are consistent with each other, though each one of the three may capture different aspects of the pairwise associations. These aspects will be discussed extensively in Sections 3.3 and 4. The QAP-simulation-based p values out of 1000 random permutations indicate that such high correlations are significant, i.e. not caused by randomness.

Associations and Similarities of Selection Criteria
The similarity matrices showing the associations of selection criteria are further visualized in 2D as weighted graphs Gα, G β , Gγ in Figure 3, where the nodes representing more similar criteria are placed closer to each other. The graphs on the top are complete graphs showing all edge weights, while the graphs on the bottom are filtered graphs only showing the edges whose weights are equal or higher than the first two crossdomain edges linking cultural (i-vi) and natural (vii-x) criteria. The thresholds ξα, ξ β , ξγ for conducting the filtering are also plotted on the histograms of the edge weights. It can be observed from the histograms that the edge weights in Gα and G β are more divergent, while in Gγ, the edge weights are more homogeneous. As a consequence, Gγ is also visually more different from the other two similarity graphs.
By inspecting the visualization in Figure 3, consistent association and similarity patterns of the criteria can be observed from the graphs: 1) the in-domain edges generally have a larger weight than cross-domain edges, thus creating two sub-graph clusters for cultural and natural criteria in all graphs, suggesting that cultural and natural criteria are relatively independent with each other; 2) the first several cross-domain edges connecting cultural and natural criteria always involve either Criterion (v) about Land-Use or Criterion (iii) about Testimony, suggesting that these two cultural criteria also have a natural aspect; 3) the cultural criteria are generally more connected and interrelated than the natural ones, suggesting that the cultural cri-teria are probably more similarly defined and associated with each other than the natural criteria; 4) the edges between Criteria (ii) and (iv), and between Criteria (i) and (iv) are always among the top-5 weights in all three graphs (see the lists of Top 5 edges in Figure 3b/e/h), proving the strong association of Architectural Typology with both Masterpiece and cultural Influences; 5) the edge between Criteria (iv) and (v) appears to be the top-1 weight of both G β and Gγ, but is only the 13 th in Gα, showing that the association of Architectural heritage and Urban heritage might be stronger than indicated by the actual co-justification in WH list; 6) Contrarily, the edges between Criteria (iii) and (iv), and between Criteria (ii) and (iii) are ranked top-3 in graph Gα, yet respectively rank as 11 th in Gγ and G β , showing that although these criteria are usually cojustified in WH properties, they may not be that semantically similar or empirically confusing.
Remarkably, the strong associations indicated by the graphs in Figure 3 are also clearly illustrated with many common phrases (lexicon) in Figure 2, though the two figures are derived from different data sources and resolutions. The bipartite lexicon graph Bw in Figure 2 can be interpreted more as a zoomedin view on the selection criteria composed of phrases, while the graphs Gα, G β , Gγ in Figure 3 arguably reflect a zoomed-out view on the characteristics of criteria themselves.

DISCUSSION
The lexicon presented in Figure 2 could become a tool for researchers and practitioners to automatically highlight the keywords in a sentence about World Heritage properties and indicate the best matching selection criteria, which also has the potential to facilitate the drafting and revising of SOUV, useful to support new WH nominations and their evaluation by the Advisory Body Evaluation parties, ICOMOS and IUCN.
Since the computational models were trained with the authoritarian context of WH properties, the lexicon derived from this study provides a chance to empirically investigate the patterns frequently appeared in Statements of OUV which are captured and learned by the NLP models, while they can be easily neglected or undervalued with traditional methods. For example, Criterion (i) is officially defined as "to represent a masterpiece of human creative genius" in the Operational Guidelines and summarized as "masterpiece" by the 2008 report (UNESCO, 2008, Jokilehto, 2008. However, the term "unique artistic achievement" is boldly stressed by the computational models and the lexicon shown in Figure 2, suggesting that artistic value is also expected to be of high importance for the WH properties justified with Criterion (i). Similarly, though Jokilehto stressed more on the "value/influence" dimension of Criterion (ii), the terms related to "development" and "interchange" in its definition also seem to have alike importance. As the next step, the lexicon could be further updated with additional human engineering such as expert-based rating, as the current version is the outcome of a semi-automated procedure. Although filtering as described in Section 2.2 has been applied, not every phrase in the lexicon makes sense. Some failure examples include the term "one" and "back" within Criterion (ix), "total" within Criterion (x), and "overall" within Criterion (i). Those terms should have been rather neutral, but probably the consistent writing style and word usage preference in Statements of OUV give some phrases a misleading score. Furthermore, the lexicon can be used as initial "seed words" in future studies to construct a more comprehensive and concrete World Heritage OUV-related lexicon by incorporating other larger and maturer semantic lexicons such as WordNet (Miller, 1995, Jurafsky andMartin, 2020).
Some visual similarities can already be observed in Figure 1, as the heatmaps seem to highlight matrix entries in a similar pattern. This was also probably the assumption in ICOMOS 2008 report about the OUV associations, as argued in Section 1. Yet these similarities would be hard to prove and falsify without a quantitative methodology, such as the one presented in this paper. The correlation coefficients shown in Section 3.2 and the graphs Gα, G β , Gγ in Figure 3 confirm this intuitive assumption based on observations. Furthermore, while graph Gα based on the co-occurrence pattern of the OUV criteria may vary radically due to the change of interest or focus of the WH Committee during the nomination procedure, the other two graphs might be more static along the time. The 2008 ICOMOS report argued that "[Criteria] (i) and (ii) can reinforce each other, while (iv) is often used as an alternative" based on the cooccurrence pattern at that time, when cases co-justifying Criteria (i) and (ii) were almost twice as many as the cases with Criteria (i) and (iv) (Jokilehto, 2008). This observation is no longer true for the situation in 2019, when the latter, i.e. cases with Criteria (i) and (iv), appears even more frequently than the former. However, both associations are observed in the 4 th finding presented in Section 3.3. As graph G β and Gγ are both based on the written texts and terms collectively used in the entire Statements of OUV, they may be more robust to new nominations unless very unusual terms are to be systematically introduced. It can also be informative in future studies to investigate the changing dynamic of presented graphs along time.
The qualitative and quantitative analyses show that the selection criteria pairs have different association strengths. For a thoroughly trained expert (either human or computer), nuances between pairs such as Criteria (i) and (iv) can already be rather hard to distinguish, let alone someone from the general public. To make the World Heritage management more socially inclusive, the concept of OUV more intelligible, and the future inscription process more effective, extra efforts may need to be made to further sharpen and clarify the definitions of criteria, and to make sure the OUV statements written by future practitioners and researchers are sufficiently consistent and coherent.

CONCLUSIONS
This paper presents the computational interpretation on the associations of UNESCO World Heritage selection criteria indicating the Outstanding Universal Value (OUV) conveyed by the properties, as an evolution of the ICOMOS report "What is OUV" published in 2008, applying a novel methodology integrating state-of-the-art technology. It provides an OUV-related lexicon showing relevant phrases of each selection criterion, proposes three similarity graphs using different data sources to show various aspects of the criteria associations, and conducts quantitative and qualitative analyses on the lexicon and similarity graphs to make sense of the observations. This study may give some insights to further evolutions and improvements of the concept of both World Heritage and OUV, as is also regularly revised by the World Heritage Committee 4 .
Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 813883.