ASSESSING THE SEMANTIC SIMILARITY OF IMAGES OF SILK FABRICS USING CONVOLUTIONAL NEURAL NETWORKS

This paper proposes several methods for training a Convolutional Neural Network (CNN) for learning the similarity between images of silk fabrics based on multiple semantic properties of the fabrics. In the context of the EU H2020 project SILKNOW (http://silknow.eu/), two variants of training were developed, one based on a Siamese CNN and one based on a triplet architecture. We propose different definitions of similarity and different loss functions for both training strategies, some of them also allowing the use of incomplete information about the training data. We assess the quality of the trained model by using the learned image features in a k-NN classification. We achieve overall accuracies of 93-95% and average F1-scores of 87-92%.


INTRODUCTION
The main goal of the EU H2020 project SILKNOW (http://silknow.eu/) is to support art historians in improving their understanding of European silk heritage, as well as making this knowledge available to the public. Openly accessible databases like (IMATEX, 2018) collect information about such fabrics, but not in a standardized format. Thus, in the context of the project, the information from different collections is collected in a uniform database with standardized annotations. Relevant properties of fabrics include the production time, place or technique. One way to access this knowledge is to make database queries, e.g. to get a list of records related to fabrics produced in the 19 th century. The alternative investigated in this paper is to query the records that are most similar to a given image, a procedure known as image retrieval, e.g. (Zheng et al., 2017). Given an image of a fabric with unknown origin, this would be a way to learn something about the fabric, because the query results also give access to the properties of the most similar images. However, this leads to the question of how to define the similarity of silk fabrics. Existing methods for characterizing the similarity of images are often only based on visual appearance, e.g. (Wang et al., 2016;Jamil et al., 2006). Supervised learning of a model of similarity (Hadsell et al., 2006;Schroff et al., 2015) requires training images such that for each image pair we know whether the images are similar or not. This information is not readily available in a database containing records of fabrics, so that manual annotation would be required, an expensive and very subjective task. An alternative, explored in another context in (Zhao et al., 2015), is to define similarity based on the similarity of properties of the depicted fabrics. As information about these properties is available in the database, training samples can be generated automatically using this approach. It may also be more useful in the context of the project SILKNOW, because the most similar images according to this definition may be those from which a user may learn most about the fabric depicted in the query image.
Consequently, this paper presents a method for training a Con- * Corresponding author volutional Neural Network (CNN) (Krizhevsky et al., 2012) to learn a model of the similarity between images of silk fabrics based on multiple semantic properties of the fabrics, which allows us to model different degrees of similarity. This is different from most existing similarity definitions, which only consider one such property, e.g. (Gordo et al., 2016). The network learns to generate an image descriptor such that the distance of the descriptors of similar images is small. We propose two training strategies for that purpose, one based on a Siamese architecture (Bromley et al., 1994) and another one considering image triplets (Schroff et al., 2015). The training samples are generated automatically from a database of images with annotations. However, existing databases of silk fabrics often contain many samples with incomplete annotations, i.e. information about some semantic properties may be missing. Consequently, for both training scenarios, we will define loss functions that can also cope with such samples, the main problem being that a definition of similarity of image pairs based on annotations will be affected by missing information. In our experiments, we compare the different learning scenarios and we investigate the impact of our new developments on the results. For a quantitative evaluation, we assess the performance of a k-nearest neighbour (k-NN) classifier (Bishop, 2006) based on the Euclidean distance of the feature vectors.
The scientific contribution of this paper is three-fold. Firstly, we define a model of the similarity of images of silk fabrics based on semantic properties. To the best of our knowledge, this is the first work considering multiple properties for that purpose. Secondly, based on this definition of similarity, we develop two strategies for learning a model of image similarity with automatically generated training samples. Finally, for both training strategies, we develop loss functions that can deal with incompletely labelled samples, which gives access to a considerably larger set of training data.

RELATED WORK
Learning the similarity of pairs of images is not an entirely new problem in the fields of Photogrammetry and Computer Vision.
It is often times faced in the context of feature-based image matching, e.g. (Han et al., 2015), and image retrieval, e.g. (Qi et al., 2016). Usually, the similarity of images is assessed via image descriptors (feature vectors), the idea being that the descriptors of similar images should have a small distance in feature space. While originally hand-crafted descriptors were used, in the meantime the focus of research has been shifted to learning descriptors based on CNNs (Zheng et al., 2017).
In the context of image matching, one task is to find pairs of images that show the same scene and, thus, overlap. Han et al. (2015) train a Siamese CNN to correctly predict whether two image patches are similar or not. They define similarity in a binary way; images are only considered to be similar if they show the same scene. They train their network by minimizing the cross-entropy error of the network's binary predictions of similarity. This training strategy does not use the information that descriptors for similar feature vectors should have a small distance in an explicit way. While the representation may be optimal for binary classification, using the distance of feature vectors for assessing similarity may yield sub-optimal results. Furthermore, the CNN is trained from scratch, whereas Babenko et al. (2014) have shown that image descriptors delivered by pre-trained networks are well suited for image retrieval even if the networks were trained for classification. Using a pretrained network could improve the matching performance or at least reduce the requirements with respect to training samples.
In the context of image retrieval, Qi et al. (2016) proposed a method to retrieve photos from a database when only a freehand drawn sketch is available. The authors propose a Siamese CNN architecture consisting of two CNN branches with shared parameters. In training, an image and a sketch are processed by the two branches, respectively, and the loss function tries to minimize the distance of the resultant image descriptors for pairs that are labelled as being similar, and vice versa. Similarity is defined in a binary way: both the photos and the sketches are manually labelled into various shape classes; a photo and a sketch are considered to be similar if their labels are identical. A binary definition of similarity might be disadvantageous because photos and sketches could belong to multiple shape classes. A non-binary definition of similarity could capture different degrees of similarity, e.g. the number of matching shapes, which could increase the retrieval performance. Gordo et al. (2016) as well as Wang et al. (2014) retrieve photos from a database that are similar to queried photos. Both papers use a network architecture similar to a Siamese one, but extend it by a third network branch also sharing its weights with the other ones. This three-stream architecture is trained using a triplet ranking loss requiring descriptors from similar images to be closer to each other than descriptors from dissimilar images. For training, the authors use a large public dataset of images of famous landmark sites. Again, the similarity is defined in a binary way; pairs of images are considered to be similar if they show the same site. Such a definition would not be directly applicable to the problem considered in this paper, because hardly any pair of images in a database would show the same fabric. Zhao et al. (2015) propose a method for learning binary image descriptors for multi-label image retrieval. They use CNNs to jointly learn feature representations and their mappings to binary hash codes. Although their research regarding binary descriptors is rather unrelated to our own research, their approach to consider multiple labels for learning image similarity is of great importance for us. In contrast to (Qi et al., 2016) and (Gordo et al., 2016), where similarity between two images is a binary variable, Zhao et al. (2015) model similarity in a non-binary way. Working with images with labels for multiple properties, similarity is defined based on the number of matching labels of two images; the higher the number of matching labels, the higher the similarity. While the authors claim that the images used for training can have a varying number of labels, which would be relevant for our problem of having to deal with incomplete samples, we argue that their approach, being based on absolute numbers, might deliver counter-intuitive results if there are large variations in the number of labels per sample. We tackle the problem of having differing numbers of variables by introducing a concept of uncertainty of similarity.
The classification of works of art using Deep Learning has been tackled for some years, too. Whereas some papers deal with the prediction of single properties such as the epoch of a painting (Hentschel et al., 2016), others try to predict multiple properties at once based on the assumption that there are interdependencies between the properties (Long et al., 2017). However, image retrieval for works of art using Deep Learning seems to be a much less investigated field, one example being the matching of papyrus fragments (Pirrone et al., 2019). Interestingly, another work in this field is mostly related to ours in terms of the application and also has the goal of retrieving images of silk fabrics (Jamil et al., 2006). The authors argue that the similarity between those fabrics can be assessed by means of visual appearance, more precisely by the motifs depicted in the fabrics. They focus on the shape of the motifs rather than the colour and chose a set of hand-crafted features to define an image descriptor. Similarly to the assessment of (Zheng et al., 2017), we expect descriptors learned from training data to lead to better results than hand-crafted ones. Furthermore, modelling the similarity of silk fabrics purely based on the motifs' shapes might not capture all aspects of similarity of such silk fabrics.
To the best of our knowledge, there is no publication focussing on image retrieval based on the similarity of multiple properties of works of art. Also, there seems to be hardly any work that deals with missing labels when multiple labels are considered for the similarity of images; we believe that the exception (Zhao et al., 2015) gives counter-intuitive results when there is a large variability of the number of categories to compare in the dataset.

METHODOLOGY
It is the goal of our method to train a CNN to deliver similar features for similar input images and dissimilar features for dissimilar ones, so that the Euclidean distance of feature vectors can be used to measure similarity between pairs of images. We assume a database of images with annotations for a series of semantic variables to be available. These annotations are used to define a measure of similarity between pairs of images, so that the training samples, consisting of image pairs with known similarity values, can be generated automatically from the database contents. Having trained the CNN, a feature vector can be derived for every sample of the database by passing the corresponding image through the CNN. The resultant feature vectors are used to build a k-d tree (Pedregosa et al., 2011). Given a query image, the most similar images from the database can be retrieved by applying the CNN to that image and retrieving the k nearest neighbours of the resultant feature vectors from the k-d tree. The results can be presented to a user; optionally, the properties of the query image can be predicted by a majority vote of the nearest neighbours. While this classification is not the main goal of the proposed method, it will be the basis of the quantitative evaluation in section 4. The details of our method are presented in the subsequent sections.

Network Architecture
Our CNN architecture based on ResNet-152 (He et al., 2016) is presented in figure 1. The input consists of an RGB image x scaled to 224 x 224 pixels. This image is presented to the ResNet-152, which generates a 2048-dimensional feature vector. This is followed by a fully connected (FC) layer of dimension 1024 with ReLU (Rectified Linear Unit) activations (Nair & Hinton, 2010). The last layer is another FC layer which delivers a 128-dimensional vector. In order to restrict the extents of the feature space, this vector is normalized to unit length, which results in the feature vector f (x) which is the main output of the CNN and which should characterize the input image. As a consequence of normalization, the maximum Euclidean distance of two feature vectors is 2, which will be useful to tune some of the parameters of the loss functions described in section 3.3. Both the choice of the ResNet-152 as a backbone and the architecture of the remaining parts of the network was based on preliminary experiments not reported here for lack of space.

Semantic Similarity
There is no unique definition of the term similarity of images, let alone of images of silk fabrics. For reasons already pointed out in section 1, we prefer a definition of similarity based on semantic properties over a definition based on visual appearance. Such a definition allows us to automatically generate training samples from a database of images with semantic annotations, while a definition based on visual similarity would require manual labelling of pairs of images as being similar or not. Manual labelling is highly subjective and might lead to inconsistent annotations, so that we consider a definition based on semantic properties also to be more objective than the other option.
Our definition of similarity requires the availability of a set of images x with annotations for a set of semantic properties such as their production timespan or production place. Further, let li(x) be the class label of the i th property for image x. Then, for a pair of images x1, x2, we can define a similarity function Y (x1, x2) which returns a value of 1 if the images are similar and 0 otherwise. This function can be defined in a straightforward way if only a single property l is considered: In eq. 1, δ(·) is the Kronecker delta function, which returns 1 if the argument is true and zero otherwise. Thus, Y (x1, x2) = 1 if the class labels for property l are indentical and Y (x1, x2) = 0 otherwise. A naïve way of considering multiple properties at once would be to check whether all property labels are equal.
Similarly to the definition in eq. 1, this would lead to a binaryvalued similarity function Y (x1, x2) ∈ {0, 1}. However, we prefer to be able to model different degrees of similarity. We argue that a pair of images with identical annotations in all but a few properties should be considered to be more similar than a pair without any identical annotations. Consequently, we define a real-valued similarity function Y (x1, x2) ∈ [0, 1] whose value is proportional to the number of identical annotations: where I is the number of semantic properties. Note that this is equivalent to eq. 1 for I = 1.
Eq. 2 is the basic definition of similarity which will be used for training our CNN. However, it requires annotations to be available for all properties, which is not necessarily the case. We could apply this function to incomplete samples, i.e., pairs of images for which a part of the properties under consideration is unknown, by just considering the properties for which annotations are available for both images and setting I to the number of such properties. However, under these circumstances, a pair of images for which only one property is annotated would be considered to be similar with Y (x1, x2) = 1, although in fact they might differ in all the other (unknown) properties. Obviously, the fact that some properties are unknown introduces some uncertainty into our definition of similarity. The more properties are unknown, the larger this uncertainty is. Thus, in order to be able to include incomplete samples in the training process, we expand our similarity function in order to incorporate this uncertainty. For that purpose, we note that the similarity function Y (x1, x2) can serve as an indicator Yp for positive similarity, i.e., Yp(x1, x2) ≡ Y (x1, x2). Similarly, we can define an indicator Yn for negative similarity. If x1 and x2 are complete samples, i.e., if all properties are known for these images, we can define Yn(x1, x2) = 1 − Y (x1, x2). Under these circumstances, we have Yp + Yn = 1, and there is no uncertainty. In order to expand our notion of similarity to incomplete samples, we use a new definition of Yp(x1,x2) and Yn(x1, x2): where indicates whether an annotation for property i exists for image xn or not. This definition is equivalent to eq. 2 for pairs of complete samples. For incomplete samples, the relative sizes of Yp and Yn still express whether two images are more or less similar, but in this case, eq. 3 results in Yp + Yn < 1. We can interpret 1 − (Yp + Yn) as a measure of the uncertainty of our knowledge about the similarity of an image pair. The definition of similarity according to eq. 3 is used for training the CNN in the presence of incomplete samples.

Training
We initialize the parameters of ResNet-152 using the model pre-trained on the ImageNet data set (Deng et al., 2009). The weights of the FC layers are initialized using Variance Scaling (He et al., 2015). In training, we freeze all parameters of ResNet-152 except those of the last layer. We argue that by using this pre-trained network a good generic feature representation for the images can be obtained (Razavian et al., 2014). The parameters of the last layer of ResNet-152 and the two FC layers of our CNN (cf. fig. 1) are determined in the training procedure. For this purpose, we propose two different strategies, one based on a two-stream Siamese architecture (Bromley et al., 1994) and another one based on a triplet architecture, e.g. (Gordo et al., 2016). For both presented strategies, the training of a neural network consists of minimizing an objective loss function that measures the network's ability for producing similar features for similar input images and dissimilar features for dissimilar input images. We present different loss functions for both training strategies and compare them in our experiments. In all cases, training is based on stochastic minibatch gradient descent with momentum (SGD), using backpropagation for computing the gradients. More details about SGD are given in section 4. The training strategies and the related loss functions are presented in sections 3.3.1 and 3.3.2.
3.3.1 Siamese Training: Our first strategy uses the twostream Siamese architecture depicted in fig. 2. The training procedure requires pairs of input images x1, x2 with known similarity value Y (x1, x2). The network takes the two input images and propagates them through two identical copies of our basic CNN architecture to deliver two feature vectors f (x1), f (x2) for x1 and x2, respectively. Both CNN branches share the same network weights w. The network calculates the L2 distance ∆(f (x1), f (x2)) between the two feature vectors, which forms the basis for calculating the loss L. It is the goal of the training procedure to make this distance small for similar image pairs and large for dissimilar image pairs. To achieve this goal, the contrastive loss can be used (Hadsell et al., 2006): In eq. 5, Y (x1, x2) is one of the similarity functions described earlier. Mp is the positive distance margin, i.e. the maximum allowed distance of feature vectors of similar inputs, while Mn is the negative distance margin, i.e. the minimum allowed distance of feature vectors of dissimilar inputs. The goal of minimising the loss in eq. 5 is to produce feature vectors having a distance smaller than Mp for samples with Y = 1 and larger than Mn for samples with Y = 0. In the standard case of the contrastive loss, the function Y is binary, which is also the case for the similarity function in eq. 1; it can be used to train a model based on the similarity of a single property. As we want all of our training samples to always contribute to the training procedure, unless otherwise noted we always pull the distance of features between similar inputs towards the minimum possible distance of 0 and always pushing the distance of features between dissimilar inputs towards the maximum possible distance of 2; this corresponds to choosing Mp = 0 and Mn = 2. The normalization of the feature vectors helps to define Mn.
The contrastive loss according to eq. 5 also works for a definition of similarity based on multiple properties. In this case, we have to use the similarity function from eq. 2. In the case of a binary similarity function, every training sample will either pull the distance towards Mp or outside of Mn; here, for each image pair, we consider the loss to be a trade-off of two competing forces. One force, weighted by Y , pulls the distance towards Mp, while the other force, weighted by (1 − Y ) tries to make the distance larger than Mn. The similarity defines the weights and, thus, the relative size of Y and (1 − Y ) will indicate an equilibrium distance that minimizes the loss. The larger the similarity, the closer this distance will be to Mp. Consequently, we expect the CNN to learn to produce feature vectors whose distances will correspond to the degree of similarity of image pairs according to eq. 2.
The contrastive loss in eq. 5 assumes the values of the similarity function Y to be certain, which is only the case for complete samples (cf. section 3.2). To be able to use incomplete training samples as well, we have to adapt that loss function to consider the uncertainty of the similarity: The only difference between eqs. 5 and 6 is that in the latter, we use the indicators for positive (Yp) and negative (Yn) similarity according to eq. 3 instead of Y and (1 − Y ) as weights for the two terms in the loss function. Again, the relative size of the weights will determine the point of equilibrium for minimizing the loss, but as the sum of the weights, (Yp+Yn), is smaller than 1 (cf. section 3.2), the total impact of a sample on the gradients will be smaller, which means that more uncertain samples have a smaller influence on the training process, which is rather intuitive. We push this thought further by also adapting the margins Mp and Mn from eq. 6, also making them dependent on the degree of uncertainty of the similarity of a sample: where π 1 i and π 2 i are defined according to eq. 4. For complete samples, this results in Mp = 0 and Mp = 2, as in the earlier case. For incomplete samples, Mp and Mn will be placed symmetrically around 1, because Mn = 2−Mp(x1, x2). The larger the number of properties without annotations, the larger Mp and the smaller Mn. The force pulling the distance towards 0 will only act if the distance is larger than Mp, and the one pushing the distance away from 0 only acts as long as it is smaller than Mn. The larger Mp and, thus, the smaller Mn, the smaller the impact of a sample on the minimization process. Thus, by adapting the two radii according to eq. 4, the uncertainty of the similarity information is again used to modulate the impact of a training sample on the resultant parameters.

Triplet Training:
Our second training strategy uses the triplet architecture depicted in fig. 3. The network takes three input images xa, xp, xn and propagates each of them through a CNN branch. Again, all branches share their weights w. Consequently, this approach requires triplets of training samples, but like the first strategy, it only requires similarity information for image pairs. In this context, xa is the anchor sample, xp is a 'positive' sample, meaning that Y (xa, xp) = 1, and xn is a 'negative' sample, meaning that Y (xa, xn) = 0. The CNN branches deliver three feature vectors f (xa), f (xp), f (xn), from which the L2 distances ∆(f (xa), f (xp)) and ∆(f (xa), f (xn)) are calculated. For determining the parameters of the network, the triplet loss function (Schroff et al., 2015) can be applied: where L(xa, xp, xn) is the loss and M is the margin, which can in principle be chosen freely; during training, the difference of feature distances between ∆(f (xa), f (xp)) and ∆(f (xa), f (xn)) is pushed to be at least M . In other words, a network should learn to deliver feature vectors f (xa), f (xp) that are more similar to each other than the feature vectors f (xa), f (xn), meaning that the feature vectors for similar image pairs only need to be more similar than the features for a pair being not similar. Note that this loss function only works for a binary definition of similarity according to eq. 1. In other words, it can only be applied when a single property is considered for defining similarity.
In eq. 9, the function Y (·) is the similarity function defined in eq. 2. The restriction M ! > 0, meaning that M must be larger than 0, is important for a triplet of input samples to be a valid triplet, i.e. the anchor and positive samples have to be more similar to each other than the anchor and negative samples. The margin ensures that the distance between descriptors of similar images is smaller than that of descriptors of dissimilar images. We can use the triplet loss function defined by eqs. 8 and 9 only with samples for which all labels are known. In order to also use incomplete samples, we have to redefine the margin M : M (xa, xp, xn) = min(Yp(xa, xp), Yn(xa, xn)) ! > 0, (10) using the definitions from eq. 3 for Yp and Yn. In this case, the margin M also represents the uncertainty for the similarity; the larger the uncertainty (i.e. the more labels are unknown), the smaller the margin. This means that the descriptors for pairs of similar and dissimilar images are allowed to be close to each other if the number of available annotations is small.
An important aspect of the training procedure is the definition of the image triplets. Given a minibatch of size B, we calculate the margin M (xi, xj, x k ) for every triplet of samples {xi, xj, x k } with i, j, k ∈ {1, ..., B} using either eq. 10 or eq. 9, depending on whether incomplete samples are used or not. In this process, we make sure that all images of a triplet are different. Of the remaining triplets, we retain those fullfilling the restriction M ! > 0 and use them for training.

Dataset and test setup
4.1.1 Dataset: To evaluate our proposed methods we use data extracted from the publicly available database of the Centre de Documentació i Museu Tèxtil in Terrassa (Spain) (IMA-TEX, 2018). This database consists of thousands of RGB images of silk fabrics with annotations about their semantic properties; we exemplarily consider the three variables production place, production technique and production timespan. The annotations for these properties are incomplete, so that there is a considerable number of incomplete samples. The dataset used in this paper is identical to the one used by Dorozynski et al. (2019). It was generated automatically from the online collection (IMATEX, 2018); in this process, the raw annotations were mapped to a standardized class structure. For details of the procedure the reader is referred to (Dorozynski et al., 2019). The dataset consists of 8192 images for which at least one property is known. We call it the comprehensive set, because it contains both the 5071 incomplete and 3121 complete samples (for which all properties are known). All images are scaled such that the larger dimension (height or width) is exactly 400 pixels; the other, possibly smaller, dimension varies between 25 and 400 pixels. The class structure as well as the number of samples that are available for the individual classes are shown in tab. 1.

Test setup and evaluation strategy:
We evaluate the method in two ways. First, we apply a quantitative evaluation based on a k-NN classification. For that purpose, after training we build a k-d tree from the descriptors of all training samples and query the descriptors of all test samples to the tree; we retrieve the k = 5 nearest neighbours and predict the properties of the queried samples by taking the majority vote of the properties of the nearest neighbours. We evaluate this classification by comparing the predicted labels to the reference labels. We report the overall accuracy (i.e., the percentage of correct predictions) and the F1-score for every class, i.e. the harmonic mean of precision and recall. Precision is defined as TP / (TP + FP) and recall as TP / (TP + FN), where TP is the number of samples of a class that was classified correctly (true positives), FP is the number of samples that was assigned to that class but belongs to another one in the reference (false positives), and FN is the number of samples assigned to another class than the one it belongs to in the reference (false negatives). As our definition of similarity is based on similarity of properties, in this way we can assess if for a test sample the nearest neighbours among the training samples in feature space really have the same properties, and the results can also be compared to those of Dorozynski et al. (2019).
Secondly, we compute the distances between the feature vectors of all test samples and compare them to the similarity values.
Due to the normalization of the feature vectors, this cannot be done directly, but in the way of a regression analysis. Here, for lack of space we only report the correlation coefficient, which gives us the degree of linear dependency between the feature distance and the similarity according to our definition.
In all experiments we split the data into training, validation and test sets consisting 60%, 20% and 20% of the data, respectively. The evaluation is based on a five-fold cross validation. In each cross validation iteration, a different set of images is used for testing, so that over the course of all iterations each sample is used for testing once. We initialize the CNN as described in 3.3. All images are scaled to the required input size of 224 × 224 pixels before being propagated through the network. We train our proposed networks for 300 iterations with a batchsize of 100. Training is based on stochastic gradient descent using Adaptive Moments (Kingma & Ba, 2014) and the standard parameters (β1 = 0.9, β2 = 0.999 andˆ = 1 · 10 −8 ), except for the learning rate of 1 · 10 −4 . During training, we also fine-tune the last layer of the ResNet-152 network. For regularization purposes, we apply early stopping based on the validation loss.
For the k-NN analysis, we carried out six different experiments to compare different network variants and different definitions of similarity. First, we investigate a single-property scenario. Thus, in experiment I, we trained three individual networks (one per property) using the Siamese architecture and the standard contrastive loss (eq. 5) based on the similarity according to eq. 1; in experiment II we also trained three such networks, but using the triplet architecture and the triplet loss (eq. 8) based on the definition of the margin from eq. 9. Both experiments are carried out using the comprehensive set of samples, for each property using all samples with a annotation for that property.
In experiments III and IV, we evaluate the performance of k-NN classification when considering all properties at once and using only complete samples. In both cases, only one CNN is trained. In experiment III, we use the Siamese architecture and the standard contrastive loss (eq. 5) based on the similarity according to eq. 2, while experiment IV is based on the triplet architecture and the triplet loss using the margin from eq. 9.
Finally, in experiments V and VI, we want to assess the impact of using incomplete samples. Thus, in experiment V, we train a CNN using the Siamese architecture with the variant of the contrastive loss according to eq. 6, using the definition of similary according to eq. 7; in experiment VI, we use the triplet architecture and the triplet loss based on the definition of the margin according to eq. 10. In both cases, the comprehensive set of samples is used for training. Nevertheless, the evaluation is based on the complete samples only so that the comparison to experiments III and IV is based on the same data.

K-NN Classification:
The overall accuracies for all semantic properties of the six experiments are shown in tab. 2; the highest values are highlighted in bold font. These results show that considering multiple semantic properties at once (experiments III-VI) generally performs better than just considering one property at a time (experiments I, II); the improvement is in the order of 5%-10%. However, they also show that considering incomplete samples in the training procedure (experiments V, VI) leads to a drop in performance compared to using only complete samples (experiments III, IV). This drop is more prominent when comparing the results obtained when using the triplet loss (cf. experiment IV vs. VI): in this case, considering also incomplete samples leads to a drop of 4.9% in the mean overall accuracy. It is less prominent (1.1% in mean overall accuracy) when comparing the results obtained when using the Siamese architecture (cf. experiment III vs. V). The results thus show that when using multiple properties for defining similarity, the triplet loss performs better than the contrastive loss, but only when exclusively complete samples are used for training. This indicates that the training procedure based on the triplet loss is less robust to incomplete information than the procedure based on the Siamese architecture and the contrastive loss. A possible explanation is that for incomplete samples, the margin between a positive and a negative pair is larger when the contrastive loss is used than when the triplet loss is used. Consequently, descriptors of dissimilar images may be closer to each other in feature space, leading to a worse performance of the k-NN classification when the triplet loss is used.
Independently from the specific problems of the triplet loss with incomplete samples, there might be another reason why the results achieved when only considering complete samples are better than those achieved when including incomplete ones. In the training procedures of experiments V and VI, only about 38% of the samples are complete, which means that 62% are incompletely labelled and, thus, dominate the training process. As those incomplete samples do not reflect the interdependencies between the properties as complete samples arguably do, this learning procedure might lead to a loss of generality for the learned features, thus resulting in decreased quality measures. We also compare our proposed approach to those achieved by multi-task learning by Dorozynski et al. (2019), which are based on the same dataset. We focus on the best variants in both papers, thus comparing the results achieved for completely labelled samples, i.e. experiment IV in this paper and variant MTL-C in (Dorozynski et al., 2019). The quality metrics for the experiments of both approaches are shown in tab. 4. The comparison shows that both approaches are on par with each other. While multi-task learning performs better in predicting the production place, our approach has a slightly better overall performance, reflected in an improvement of both the mean overall accuracy (0.5%) and the mean F1-score (1.1%). We can conclude that when using our approach to learn the semantic similarity between images of silk fabrics, the k-NN analysis is very likely to retrieve images having similar properties; if we predict the properties from the k-NN, the quality is as good as the one of a CNN trained to predict the semantic properties.   training samples and J = 624 test samples. For every pair, we also calculated the similarity indicator Yij based on the annotations using eq. 2. Based on the resultant I · J tuples (∆ij, Yij) of distances and similarity indicators, we calculated the correlation coefficient ρ∆Y between the two variables. The average correlation coefficient from all cross-validation iterations was ρ∆Y = −0.90. This high negative correlation shows there is a high degree of linear dependency between the two variables: the larger the difference between two feature vectors ∆ij, the smaller their respective similarity Yij (and vice versa). We take this as an indicator that our proposed method can in fact be used to train a CNN to produce feature vectors such that their distances can be used to measure the similarity of images.

CONCLUSION
In this paper we have presented several approaches for CNNbased learning the similarity of images of silk fabrics based on semantic properties. The advantage of a definition of similarity based on semantic properties is that the training data can be generated automatically if a database with annotated images is available. We proposed two methods for training a CNN, based on a Siamese and on a triplet architecture, respectively. We compared different variants of the loss function designed to deal with different definitions of similarity based on semantic annotations. We evaluated our methods using a k-NN classification. Our experiments showed that considering multiple se-mantic properties simultaneously is beneficial for learning the similarity between images, but only if completely labelled training samples are used. Our experiments also indicated that the triplet loss is less robust against incomplete labels than the contrastive loss. In general, k-NN classification based on our definition of similarity performed on par with a task-specific classifier (Dorozynski et al., 2019).
In future work we would like to include additional collections of images of silk fabrics. This would give us additional training samples and, possibly, a more balanced class distribution. However, introducing data from additional collections might pose a problem regarding the transferability between these collections. One way to solve this potential problem would be to use domain adaptation (Wang & Deng, 2018). Apart from introducing new data from additional collections, we would also like to consider additional semantic properties, such as motif or production material. As our results indicate that exploiting potential interdependencies between the properties is beneficial for learning the similarity, we assume that considering additional properties could still improve the process. We would also like to investigate a combination of multi-task classification and similarity learning, e.g. by combining our proposed network architecture and its (similarity-based) loss functions with the (classification) loss function of (Dorozynski et al., 2019). This approach could be used in the context of multi-task classification, where the network uses learned features for the prediction of multiple semantic variables. We think that guiding the network to producing dissimilar features for dissimilar inputs will improve the classification performance.
Another expansion could be to apply weights to the individual properties in the similarity functions. This weighting can be based on information provided by art historians in order to give more importance to certain properties, as the domain experts might consider them to be be of greater relevance for assessing the similarity of fabrics. In this context, we would also like to investigate whether those weights could instead be learned by the network if domain experts can provide us with labelled pairs of images of similar / dissimilar fabrics.