GENERATING SYNTHETIC TRAINING DATA FOR OBJECT DETECTION USING MULTI-TASK GENERATIVE ADVERSARIAL NETWORKS

Nowadays, digitizing roadside objects, for instance traffic signs, is a necessary step for generating High Definition Maps (HD Map) which remains as an open challenge. Rapid development of deep learning technology using Convolutional Neural Networks (CNN) has achieved great success in computer vision field in recent years. However, performance of most deep learning algorithms highly depends on the quality of training data. Collecting the desired training dataset is a difficult task, especially for roadside objects due to their imbalanced numbers along roadside. Although, training the neural network using synthetic data have been proposed. The distribution gap between synthetic and real data still exists and could aggravate the performance. We propose to transfer the style between synthetic and real data using Multi-Task Generative Adversarial Networks (SYN-MTGAN) before training the neural network which conducts the detection of roadside objects. Experiments focusing on traffic signs show that our proposed method can reach mAP of 0.77 and is able to improve detection performance for objects whose training samples are difficult to collect.


INTRODUCTION
In recent years, images, including panoramic images, and point cloud collected by Mobile Mapping System (MMS) are used to generate HD maps, which can be applied to autonomous driving, smart city, etc. The HD Maps conclude specific information on roadside environments, including road objects (road lines, crosswalk, etc.) and roadside objects such as traffic signs. However, data creation of these objects still heavily depends on operators' manual work, which is costly in terms of both time and money.
Benefit from advances that deep learning technology has achieved in recent years, several have been proposed to extract objects of interest from images or point clouds using CNN based algorithms. (Wolf et al., 2019) proposed a method to detect manholes and road markings by semantic segmentation using images rendered from point clouds. A CNN algorithm proposed by (Mori et al., 2018) classifies the categories of pole-like objects. Meanwhile, CNN based algorithms are highly dependent on the quality of training dataset, whose creation is both timeconsuming and costly. Besides, the number of roadside objects in the real condition is highly imbalanced. Take a traffic sign as an example. Figure 1 shows the sample numbers of each category of traffic signs which are collected from MMS in Japan (Lin et al., 2018). Approximately 70 categories of traffic signs can hardly collect enough samples to train a CNN model. On the other hand, the rest 20 categories of traffic signs contribute more than 90% samples of the dataset. This is known as a long-tail phenomenon which could cause a significant performance drop.
To tackle the aforementioned problem, we propose a method to generate synthetic training samples for training an object detector. The flowchart of our proposed method is shown in Figure 2. To be more specific, we propose an SYN-MTGAN architecture, * Corresponding author which transfers the style between synthetic data and real data, to generate training samples that can improve the detection performance of objects whose real training samples are difficult to collect in the real scene. The proposed SYN-MTGAN generates training samples along with predicting the category of the generated samples. In this study, we focus on traffic signs to verify the effectiveness of our proposed method. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-2-2020, 2020 XXIV ISPRS Congress (2020 edition)

Object Detection
In the Object Detection task, the category and position of a target object are predicted at the same time. R-CNN (Girshick et al., 2014) first adopted CNN to tackle this task by a two-stage strategy, which generates proposals of target region using Selective Search (Uijlings et al., 2013) and classifies the category of each region using AlexNet (Krizhevsky et al., 2012), a classic CNN architecture for image classification. R-CNN was improved in terms of performance and speed in the following years. Faster R-CNN (Ren et al., 2015) proposed a trainable neural network, Region Proposal Network (RPN), to generate proposals of a target region. This framework influences numerous methods that have been proposed in the following years. On the contrary, onestage strategy methods focus on the real-time processes. SSD (Liu et al., 2016) and YOLO (Redmon et al., 2016) can be considered as representative ones. The key idea of a one-stage method is that predicting proposal regions and category of the target object at the same time using a single network, meanwhile a two-stage strategy predicts these two tasks in separate networks.

Synthetic to Real
Both one-stage and two-stage Object Detection algorithms suffer from imbalanced training samples during training. A common solution is to carry out the hard negative mining. Online Hard Example Mining (OHEM) (Shrivastava et al., 2016) proposed to use hard negative samples more frequently than normal ones during training according to a loss. Instead of collecting samples from real scenes, another solution is generating synthetic data from a simulator. (Hinterstoisser et al., 2017) trained an object detector model using synthetic data and pre-trained features of real data. (Abu Alhaija et al., 2018) augmented real scene images by synthetic images to train a network for driving scene recognition. (Cordts et al., 2016) trained Faster R-CNN using a large synthetic dataset and a small real scene dataset. (Wu et al., 2017) raised the proportion of real scene data by combining synthetic data created from the Grand Theft Auto game and KITTI (Geiger et al., 2012) data for training a neural network. Furthermore, to reduce the gap between synthetic and real scenes' distribution, (Shrivastava et al., 2017), (Bujwid et al., 2018), (Huang et al., 2018) and (Yu et al., 2019) developed GAN-based frameworks to generate more realistic synthetic images by projecting or mapping real feature distribution onto synthetic images. (Zheng et al., 2018) employed this idea on depth estimation and revealed pleasant results. Domain randomization or domain adaptation is further applied to minimize the gap. (Tremblay et al., 2018) used domain randomization to force the network to focus on learning semantic features other than features, i.e. color, brightness, and texture, which are usually determined by the dataset. Such strategies are proved to be effective to minimize the gap between synthetic data (source domain) and real scene data (target domain). However, these proposed algorithms cannot solve the aforementioned problem when the target domain does not exist, which is caused by a lack of real scene samples. Creating synthetic data that is similar to real scene data is needed to solve the problem that we are facing, meanwhile, there is little research that focuses on transferring synthetic images to the real one for object detection tasks, especially multiple similar categories exist.

Generative Adversarial Networks
Generative Adversarial Networks (GAN) was proposed to generate images (Goodfellow et al., 2014) and has achieved significant results in many computer vision tasks. Commonly, GAN consists of two neural networks, the Generator and the Discriminator. The Generator generates a synthetic image from a random vector. On the other hand, Discriminator classifies if the input image is a synthetic image created by the Generator or a real one. The training of Generator and Discriminator is carried out simultaneously, which is called adversarial training. And eventually, the Generator is trained to generate synthetic images whose appearances are close to real images. The quality of images generated by GAN has been improved notably. (Karras et al., 2019) proposed a Generator with hierarchical architecture to recover details such as eyes and expressions of human faces. Besides using random vectors, GAN can also use images and natural languages as input to generate images.  used natural languages to generate a user-preferred image using GAN. GAN also had been proved to be effective to solve tricky Object Detection tasks, for instance small object detection. SOD-MTGAN (Bai et al., 2018) used a multi-task Discriminator to help Generator to generate better super-resolution images of a human head, which achieved higher performance on human head detection task, especially a small head. The experiments (Bai et al., 2018) show that GAN is useful to generate images that are able to improve the performance of Object Detection. Additionally, GAN can also transfer the style between two sets of images. CycleGAN with two Generators and Discriminators, proposed by (Zhu et al., 2017), learns the style and transfers it between two sets of images. The architecture that CycleGAN proposed enables the model to learn consistency loss between two Generators and results in generating more stable and better outputs. These previous studies of GAN show that images generated by GAN can be utilized as training samples to train an object detector.

Overview
We propose an SYN-MTGAN architecture, inspired by SOD-MTGAN and CycleGAN, to generate synthetic data as training data for Object Detection while real scene samples are difficult to collect. Specifically, our method takes fake images and real images of target objects as input data for training the SYN-MTGAN. It learns the distributions of features, such as texture, illumination et al. that exist in the real scenes but difficult to reproduce by a simulator, from real images, and transfers them to fake images. There are two key points in our proposed method. First, the Discriminator (hereinafter, called D) of SYN-MTGAN not only classifies the input as fake or real but also predicts the category of the target object at the same time, which is i.e. a multi-task neural network. Second, we propose a new loss function in order to encourage SYN-MTGAN to only transfer the style of foreground, other than background.
We conducted experiments using the images that SYN-MTGAN generated to train an Object Detection model, comparing to the model that is trained by real scene data, to verify its effectiveness.

SYN-MTGAN Architecture
The architecture of a normal GAN is shown in Figure 3 (a). In normal GAN, fake images (hereinafter, called y') generated by Generator (hereinafter, called G) from input x, along with real images (hereinafter, called y), are fed into D for classification, which predicts whether the input is the generated y' or y (fake/real). In this study, intending to improve the quality of generated images and stabilize the training process of GAN, we designed a GAN architecture, inspired by CycleGAN. The architecture of CycleGAN is shown in Figure 3 (b), which is an end-to-end framework for image style transfer. CycleGAN contains 4 networks, two Generators, and two Discriminators. The basic idea of CycleGAN is training the 4 networks to learn the style of image x and transfer this learned style to image y, which as a different style, using the cycle consistent loss. CycleGAN also learns the style of y and transfers it to x, which is proved effective for boosting performance. In the training process of CycleGAN, a Generator G is trained to generate a fake image y' by learning feature distribution (style) of real image y and transferring it to synthetic image x, meanwhile another Generator F is trained to generate x'' by mapping from real image feature distribution to fake ones. As a result, x'' is remapped to its original style and should be as similar to x as possible. This is called a forward cycle. This process can be considered as a recovery of y. y' and y are the input to Discriminator Dy for fake/real classification. The reverse cycle performs oppositely. Generator F is trained to generate another kind of fake images (hereinafter, called x') which learn the style of synthetic image x and transfer it to real image y, meanwhile x' is remapped as y'', similar to its original style, by G. x' and x perform as the input to Discriminator Dx for classifying fake/real. The cycle consistent loss calculates the divergence between x', x and y', y, and encourages the Generators to learn the mapping of two distributions respectively. The flowchart of cycle consistent training is shown in Figure 4. Besides, the purpose of G and F is to learn the distribution from two sets of images. Note that it is possible to train the CycleGANstyle architecture using unpaired images, which can reduce the cost of preparing synthetic and real images.
The D of our proposed SYN-MTGAN is a multi-task architecture.
Besides fake/real classification as general GAN Discriminator does, it also predicts the category of target objects, in both Dx and Dy. The classification task predicts the category of a target object, similar to normal image classification.

Network Architecture
The details of the network architecture are shown in Figure 5. The G adopts Encoder-Decoder style, which contains two convolutions with a stride size of 2, nine Residual Blocks (He et al., 2016), and two transposed convolutions with a stride size of 1/2. Instance normalization (Ulyanov et al., 2016) is adopted in both G and D as it showed better performance in an image style transfer. The architecture of D is also based on CycleGAN. To be specific, four convolutions with a stride size of 2 and one convolution with stride 1 are implemented to extract feature map, whose size is 1/16 of the original input image. This feature map is shared with two parallel branches, discriminative recognition and classification. The classification branches are connected to the feature map by one convolution and two fully connected layers.

Loss Function
We adopted five loss functions to train the SYN-MTGAN. Adversarial loss and cycle consistency loss, the same as CycleGAN, are adopted to optimize the G and D. We adopted classification loss to optimize G and D. We propose a revised version of identity loss of CycleGAN to strength G to focus on the target area other than the background. Figure 5. Details of Network Architecture. Conv denotes convolution. k means kernel size. s denotes stride size. For example, k3s2 denotes convolution with kernel size of 3 and stride size of 2.

Adversarial Loss:
Adversarial loss (hereinafter, called ) encourages the G to generate fake images that are as close as possible to real ones that can fool D, and D not to be fooled by images that G generated. The equation is shown in Equation (1), which ( ( )) denotes the probability of generated real image ( ), or y', being a real image.
where , = training sample n, m = numbers of training sample

Cycle Consistency Loss:
The CycleGAN calculates the difference between x and x'', y and y'' as cycle consistency loss (hereinafter, called cyc ) to encourage G and F to learn the perfect distribution of synthetic and real images, as Equation (2) shows. F(G( )) denotes the images ′′, generated by F from ′, which are generated by G. Smaller loss indicts that G and F can recover x'' or y'' better. cyc uses L1 loss to calculates the difference. (2)

Identity Loss:
The objective of G and F is to learn the distribution of target features from two sets of images. However, the Generator of GAN tends to transfers the feature it learns and transfers the feature to the whole image, including the background. CycleGAN uses identity loss (hereinafter, call ), which calculates the L1 loss between the input and output of the Generator, to encourage G and F to focus on learning the features of target object other than background. We propose a revised version of , shown in Equation (3). Proposed only focuses on the change of background by blocking the foreground target object using the bounding box information.
where * , * = background area of an image 3.4.4 Classification Loss: The classification branch in the Discriminator is reported to encourage the Generator to generate fake images that are easier to be classified by Object Detection (Bai et al., 2018). We calculate classification loss (hereinafter, called ) using Cross Entropy Loss for each D. The equation of for Dy is shown in Equation (4), where ( ( )) and ( ) denote the probabilities of the generated fake image ′ and the real image belonging to the true category respectively.  (5), which calculates the weighted sum of five aforementioned loss functions.

EXPERIMENTS
We take a traffic sign as the target object to evaluate our proposed method, SYN-MTGAN, in this section. More specifically, we focus on the traffic signs collected by MMS panorama images, which is due to two reasons. (1) A traffic sign is a common object along the roadside. And the distribution of the number of traffic signs, as shown in Figure 1, causes the training data collection problem aforementioned.
(2) A traffic sign is an important object to be included in an HD map, meanwhile, its detection still remains an open challenge.  (2) are shown in Figure 6 (a).

Dataset
(3) Crop a traffic sign region to generate synthetic image x (hereinafter, called Synthetic Image X). In addition, the location where Synthetic Image X locates in a panorama image is randomly determined. The category is exported as ground truth for training SYN-MTGAN. Real image y is generated by step (3) from real scene panorama images (hereinafter, called Real Scene Image Y).
Examples of x and y are shown in Figure 6 (d) and (e). We generated 20,000 images of x and 23,736 images of y for training SYN-MTGAN. The sample quantity of each category is listed in Table 1. The distribution of size in pixel of synthetic image x and real image y is shown in Figure 7. The size of most training data samples is larger than 32 2 pixels, which indicates that the object size, more specifically the resolution of the object, is not a major difficulty in this study. Figure 7. Size Distribution of Training Data. Small, middle, large refers objects whose size are smaller than 32 2 pixels, between 32 2 and 96 2 pixels, and larger than 962 pixels respectively.
The evaluation of the proposed SYN-MTGAN was conducted by comparing the performance of traffic detection using the classic Object Detection method, Faster R-CNN. Faster R-CNN was trained by three sets of data, Synthetic Scene Image X, Real Scene Image Y, and the result of SYN-MTGAN. The number of images par the category is shown in Table 1. Three categories of a traffic sign are not included in Y due to their number is less than 100.

Implementation Details
Our implementation of proposed SYN-MTGAN is based on open source machine learning platform PyTorch (Paszke et al., 2017). We mostly follow the original CycleGAN's implementation to train SYN-MTGAN. We use an image size of 256×256 for both x and y. We set the batch size to 24 and the epoch number to 120 for training. Weights of the loss function, which are hyperparameters, are set inspiring by original SOD-MTGAN and CycleGAN experiments settings: = 10, = 0.5, = 0.5 Faster R-CNN is implemented on another open-source machine learning platform, Chainer (Tokui et al., 2015). We use an image size of 3,000×1,125 to train Faster R-CNN. The initial learning rate is set to 0.0001 and reduced every 2 epochs. The epoch number is 5 as we use pre-trained VGG16 (Simonyan et al., 2015) on ImageNet (Russakovsky et al., 2015) as a backbone network for training Faster R-CNN, which leads to quick convergence. The rest of the parameters follow the original Faster R-CNN's implementation details.

Experiment Results
We carry out several experiments to evaluate our proposed method. First, we compare the performance of Faster R-CNN on traffic sign with four kinds of training data, SYN-MTGAN result (fake image y'), synthetic scene image X, the result of original CycleGAN and real scene image Y. Second, we make an ablation study to evaluate three components of proposed SYN-MTGAN. Figure 8, along with the inputs which are synthetic image x and real image y. The examples show that SYN-MTGAN has the ability to transfer the style of real images, i.e. texture, resulting in generating more realistic images compared to a synthetic image x. Besides, the results of SYN-MTGAN indict that it can reproduce a more realistic edge than x. The edge between traffic sign and background in x is sharper than y. The reason is that x is generated by rendering a template image onto a real scene panorama image, while the edge is blurred using a Gaussian filter with fixed parameters. This blur process cannot reproduce the real scene. Furthermore, the brightness of the traffic sign is edited to correspond with background by SYN-MTGAN, which makes them more natural comparing to synthetic image x, as shown in Figure 8  The quantitative evaluation result on real images dataset of traffic signs collected by us (clarified in Section 4.1) is shown in Table  2. Faster R-CNN trained using the samples created by our proposed SYN-MTGAN reached mAP of 0.77, which is better than training Faster R-CNN using the result of the original CycleGAN and the synthetic scene image X. Even though a real scene image Y reached higher performance, SYN-MTGAN reached considerable accuracy on the categories that real scene images cannot collect, i.e. crossing ahead and no riding double. However, some categories, i.e. speed limit of 30 and 60 km/h, do not obtain expected performance. The reason is considered as an unstable generation for these categories, especially with numbers on them. For instance, some generated images of speed limit of 60 km/h are difficult to recognize as Figure 9 shows. This unstable result may also cause the generation of unnatural fake images such as the crossing ahead in Figure 8 (c). This problem remains as future work for us.

Ablation Studies:
To analyze the importance of each component of our proposed method, we conduct ablation studies by removing one of the proposed components, which is the classification branch and the proposed identity loss. We take original CycleGAN as a baseline. Furthermore, we also evaluate the localization branch proposed by SOD-MTGAN (Bai et al., 2018) which predicts the bounding box of target objects, similar to Faster R-CNN. Specifically, it predicts the central point's 2D coordination, width and height of the target object in an image. The localization branch is proved to be effective for encouraging the Generators to generate better super-resolution images and boosting Object Detection performance. We place the localization branch in the Discriminators, which is parallel to the discriminative branch and classification branch and calculates its loss, localization loss (hereinafter, called loc ), using Equation (6). Specifically, we use a Smooth L1 Loss (Ren et al., 2015), shown in Equation (7), for both Dx and Dy to calculate the loss of predicting bounding box. The weight for loc is set to 0.5 inspiring by original SOD-MTGAN. The results of ablation studies are shown in Table 3. The result of Case 4 reveals that our proposed loss function can contribute more than 7% of the performance boost for training Faster R-CNN compared to Case 2. This result might indicate that restricting the change of background and emphasizing the change of foreground during training a GAN for generating images is an effective approach to improve the performance when these images are used to train an Object Detection model.
Case 3 shows that the classification branch contributes a 4% performance boost than the localization branch compared to Case 1. Architecture with the classification branch and the proposed identity loss (Case3) reached the highest performance (mAP=0.77) among all test cases. This result indicates that the classification branch in Discriminator can encourage GAN to generate images which contain more information that is helpful to Object Detection. To be specific, such information can be well extracted by feature extract CNN of object detector such as Faster R-CNN and raise detection performance. On the other hand, the localization branch could only provide a limited performance boost for predicting the location of an object. The reason is considered as follows. Faster R-CNN predicts the bounding box of target objects from feature maps extracted by VGG16, whose size is 1/16 of the original image. This down-sampling process could cause ambiguity of bounding box prediction and degrade information that the localization branch has encouraged to generate.
where, loc = Localization Branch, cls = Classification Branch, loss = Proposed Identity Loss Table 3. Result of Ablation Studies Furthermore, training the classification branch and the localization branch simultaneously might aggravate the performance of Faster R-CNN according to the result of Case 2. The reason is considered as that, parameters of a discriminator with three tasks to predict at the same time are hard to be optimized simultaneously. Specific training schedule i.e. optimizes the parameters of a certain branch while keep others fixed, is a considerable solution to this problem. It remains as a future work for us.

CONCLUSION
In this study, we proposed a multi-task GAN architecture, SYN-MTGAN, to generate fake images from synthetic scene images, which can be used as training data for an object detector such as Faster R-CNN when the target objects in the real scene are hard to be collected. We carried out experiments based on traffic signs to evaluate our proposed method. The results showed that Faster R-CNN trained by the result of SYN-MTGAN with the classification branch and the proposed identity loss can reach mAP of 0.77. Some categories of the traffic signs which cannot be collected enough in the real scene have reached mAP of higher than 0.8, which indicates that the proposed SYN-MTGAN is an effective method to generate synthetic training data for rare roadside objects. Results of ablation studies indicate that our proposed components are effective. The multi-task training, especially the classification branch and loss function encourages GAN to generate synthetic images that are more suitable to be detected by deep learning-based Object Detection model.
For future works: (1) evaluate the human performance in distinguishing the synthetic and real images to see if the generated images are really realistic, (2) evaluate settings of loss function weights, which are the hyper-parameters of SYN-MTGAN, (3) modify the architecture to be end-to-end trainable.