AUTOMATIC ROAD-SIGN DETECTION AND CLASSIFICATION BASED ON SUPPORT VECTOR MACHINES AND HOG DESCRIPTORS

: This paper examines the detection and classification of road signs in color-images acquired by a low cost camera mounted on a moving vehicle. A new method for the detection and classification of road signs is proposed based on color based detection, in order to locate regions of interest. Then, a circular Hough transform is applied to complete detection taking advantage of the shape properties of the road signs. The regions of interest are finally represented using HOG descriptors and are fed into trained Support Vector Machines (SVMs) in order to be recognized. For the training procedure, a database with several training examples depicting Greek road sings has been developed. Many experiments have been conducted and are presented, to measure the efficiency of the proposed methodology especially under adverse weather conditions and poor illumination. For the experiments training datasets consisting of different number of examples were used and the results are presented, along with some possible extensions of this work.


INTRODUCTION
Road sign detection and classification is an active field of research, which combines techniques from the field of computer vision and machine learning.Although the first works in this area date back to the 1960s, significant progress has been made during the last years.Automatic road sign detection and classification has many practical implementations.The most widespread is the usage of the method in order to create Driver Assistance Systems (DAS).Such systems aspire to the fully automatic navigation of the car through auto pilot and include systems for automatic lane detection, obstacle detection in the vehicle path and road sign recognition.For the time being, road sing detection and classification shall be used in order to assist the driver and enhance safety, e.g., by sending warning signals revealing over speed or indicating the presence of a specific sign that the driver may not notice due to distraction and lack of attention.Another interesting application is the mapping of the traffic signs, in order to be used for automated road maintenance.For this purpose, the system developed shall be used in combination with a mobile mapping system, offering information about the exact location of each detected sign.
Traffic sign recognition systems have to face many challenges.First of all, illumination conditions are not controllable.Depending on the time of the day and the weather conditions, illumination may vary dramatically.Secondly, traffic signs may be partially damaged, vandalized or with faded color due to long exposure to sunlight.All these hamper the successful detection and recognition.Other common problems are occlusions and shadows occurring by other objects surrounding the traffic signs as well as the existence of similar objects, which may be detected as road signs.Finally, possible rotations and translation of the traffic signs occur, thus the system developed should be invariant under rotation and translation.This paper is organized as follows.Section 2 describes briefly the state of the art methods for traffic sign detection and classification and section 3 analyses the proposed system.In section 4, the results are reported and discussed and finally section 5 summarizes and concludes the paper.

Traffic Sign Classes
Before investigating the state of the art methods and the proposed system for traffic sign detection and recognition, it would be useful to describe some properties of the traffic signs.Greece is among the 52 countries which have signed the Vienna Convention on road signs and signals (Inland transport committee, 1968) that aims to standardization of traffic signs among different countries.The convention classifies the road signs into seven categories: (a) danger warning signs, (b) priority signs, (c) prohibitory or restrictive signs, (d) mandatory signs, (e) information facilities or service signs, (f) direction, position or indication signs and (g) additional panels.However, major differences still exist among different countries, which may not be important to humans, but are a major challenge for computer vision algorithms (Figure 1).The specific work examines Greek road signs detection and classifications and especially the first four categories mentioned above (a-d) which are the most important for driving security.Traffic signs that belong to these categories have a heavily constrained appearance concerning shape and colors; their shapes are limited to circles, triangles and octagons and their colors to blue, green, red, yellow, white and black.

RELATED WORK
Traffic sign recognition algorithms are divided into two stages.The first one is the detection stage which aims at image segmentation and the extraction of regions of interest (ROI) from an image.The identified ROIs are inserted into the second stage, namely the recognition stage, in order to be classified.Detection stage is very important because information that is discarded during this step is not able to be recovered later.

Detection
For the detection stage, two main approaches exist; detection based on color criteria of traffic signs and detection based on shape criteria.The combination of the two approaches is also feasible.
Concerning the first approach, different color spaces are used in order to locate regions of image that contain colors of interest.
The most common color spaces are RGB, HIS or HSV and L*a*b.Most of the experimental works chose to transform the image into HIS or HSV color space, which are based on human color perception, due to the high detection rates reported, as these color spaces are more immune to different illuminations.However, the transformation is computationally demanding and thus a lot of experiments have been conducted in RGB color space as well, in order to speed up the detection procedure.
In Broggi et al. (2007) chromatic equalization is performed using gamma correction and afterwards, regions of interest are selected by thresholding the image in the RGB color space.Similarly, in Kim et al. (2006) the RGB color space is used, and detection is performed through dynamic thresholding.
Experiments in HSV color spaces are presented in Wang (2009), where global thresholding is performed and only pixels with high values for the third channel are selected, in order to have more stable results.In Fleyeh (2008) a set of fuzzy rules is applied to extract region of interest.Similarly to HSV color space, HIS color space is treated in Maldonado-Bascon (2007), where global thresholds to detect chromatic colors and a special function for the detection of achromatic colors are used.In Kiran (2009) image enhancement is achieved in the HIS color space through the use of LUT and ROIs are extracted after dynamic thresholding.Finally, in Siogkas and Dermatas (2006) the transformation of the image into L*a*b color space and the selection of regions by thresholding are made.Thresholds are computed after Otsu's method (Otsu, 1979).
Concerning the shape detection approach, this technique is preferred because it does not depend on illumination conditions.However, it is not as fast as color based detection and it is more computationally expensive.Most of the shape based detection methods proposed so far, take advantage of well known techniques like Hough transform, Distance to Border (DtB) and fast radial symmetry transform.In Lafuente-Arroyo et al. (2005;2006) a combination of color and shape based detection is made.For each ROI distances to borders are computed and this information is fed into trained Support Vector Machines in order to recognize the shape that is depicted in each ROI.In García-Garrido et al. (2005) and Hatzidimos (2004) the classical Hough transform to detect lines in the image are used.
Then angles between crossing lines are computed and if three crossing lines form angles between 50 o -70 o , the presence of the triangle is indicated.Moreover, in García-Garrido et al. (2005) circular Hough transform is also applied, in order to detect circular traffic signs.Similar to Hough transform, but less computationally expensive, is the fast radial symmetry transform that detects triangles.This technique is used in Barnes and Zelinsky (2004) and Loy and Barnes (2004), where fast radial symmetry transform is enhanced to detect shapes with high symmetry in general, like triangles and squares.
Finally, template matching is also used for shape based detection.Typical examples are Hsu and Huang (2001) and Torresen (2004), in which template matching in order to classify the shape of the signs in each detected ROI is applied.

Recognition
The target of the recognition procedure is to assign each region of interest to the class that it belongs.Also, false positives that have been detected as candidates depicting a road sign, are eliminated.Recognition shall be conducted using traditional template matching or via more sophisticated techniques from the field of machine learning, like Artificial Neural Networks (ANN) and Support Vector Machines (SVMs).For these procedures various representations of regions of interest based on illumination values (pixel based approaches) or on image features (feature based approach) are possible.Image features shall be global or sparse, like HOG descriptors, SIFT descriptors and Zernike moments.
In Siogkas and Dermatas (2006) and Piccioli et al. (1994), template matching is applied to the extracted regions of interest in combination with the measure of cross-correlation.In Medici (2008) multi-layer, feedforward neural networks have been trained, through the backpropagation algorithm, and the images were described by the illumination values.Similar publications using NN for the recognition stage are Nguwi and Kouzani (2008) and Eichner and Breckon (2008).Alternatively, representation of images through the illumination values and recognition by using trained Support Vector Machines may be used (Lafuente-Arroyo et al., 2006).The type of classification is one versus all and the Gaussian kernel function is applied.Also, extraction of the regions of interest by using Zernike moments and SVMs for the recognition stage is proposed (Fleyeh, 2008).
Finally, in Hu (2010), the extraction of SIFT descriptors, which are encoded using a Bag-of-Words approach, is made.Then the features are fed into three trained SVMs, in order to complete the classification procedure.

OVERVIEW OF THE PROPOSED SYSTEM
A brief description of the developed system for the recognition of Greek road signs is presented.The proposed method consists of two main steps; the detection and recognition stages.Detection results in the candidate blobs are further processed and assigned to a specific category through classification.
In order to compute the thresholds several signs have been segmented manually from multiple views under different illumination and weather conditions.Their histograms have then been extracted and examined in order to define the best values (Figure 3).

Post processing of the thresholded image
A median filter with mask 3x3 is applied to the binary in order to eliminate noise.Convolution with this filter discards single pixels that have been wrongly detected as regions of interest.Connected components algorithm is then applied to the binary images in order for the blobs to be appropriate labeled and form ROIs. Eight neighbour connectivity is used; for each ROI the bounding box is derived.False candidates are removed during this step according to the area and ratio of bounding box.Regions of interest with relatively high or small areas are eliminated, as big ROIs are not likely to depict signs, and small ROIs will result in unsuccessful recognition, even if a traffic sign is correctly detected.Moreover, expected values for the ratio between the two vertical sides of a bounding box containing a single sign are around 1.However, there are many cases where traffic signs overlap (Figure 4) thus, the ratio between the horizontal and the vertical side of the bounding box varies around 0.5.Accepted values for area and ratio are empirically derived for the camera used in the application, and are the following: -250 pixels < area < 1500 pixels -0.4 < ratio <1.6

Shape based detection
After color based detection and elimination of false positives, shape based detection is performed in order to separate overlapping traffic signs.Shape based detection takes advantage of circular Hough transform (Ballard, 1981), which is an algorithm that identifies circular curves.In order for the circular Hough transform to be applied, an edge image, acquired via Canny detector, is essential.Afterwards, the presence of circles is indicated through a voting procedure carried out in the parameter domain.The number of dimensions of the parameter space equals to the number of parameters needed to fully define the curve.As a circle is mathematically expressed through the equation ( 1), voting procedure includes the position x o , y o of the center of the circle and its radius r.
(x -x o ) 2 + (y -y o ) 2 = r 2 (1) These three parameters form the accumulator array and combinations with the highest values of votes are more likely to represent circles.
Circular Hough transform is applied to ROIs with aspect ratio less than 0.7, as these ROIs are more anticipated to depict overlapping road signs.If a circle is detected, its bounding box is derived and the area between the new and the original bounding box forms a new region of interest.This new ROI is accepted only if the ratio between the vertical sides of the two new bounding boxes (BB of the circle and remaining BB) is between 0.7 and 1.3.

HOG Descriptors
For the recognition stage, regions of interest are represented using Histogram of Oriented Gradients (HOG) proposed by Dalal and Triggs (2005).HOG descriptors have firstly been applied for pedestrian detection but till then, they have been widely used for object recognition, as they are robust to scale and illumination changes.In order to extract HOG descriptors, an image is divided into blocks, which are formed by overlapping cells.Each cell is composed by non-overlapping pixels and for each cell a local 1-D histogram of edges orientation is derived.Every pixel contributes to the formulation of the histogram by the magnitude and the orientation of its gradient.Orientation angles are quantized into bins and for each pixel a vote is assigned to the appropriate bin to which the orientation value belongs and this vote is weighted by the magnitude.The number of bins is changeable and the range of orientation is between 0 o -180 o for unsigned gradient and 0 o -360 o for signed gradient.The local histograms are accumulated over blocks in order to achieve illumination invariance and are then concatenated to form the descriptor (Figure 5).In our experiments, regions of interest are first resized to 32x32 pixels through bilinear interpolation and HOG descriptors are extracted, characterized by the following properties; unsigned gradients were used, the number of orientation bins was set to 6, cells contained 4x4 pixels and blocks were composed by 4x4 cells.In our experiments x i are the vectors of the HOG descriptors, y i = 1 indicates that the data belong to the one class and y i = -1 to the other class.Support Vector Machines compute the optimal hyperplane that separates the data.This hyperplane is fully defined by {w,b} and for the points that line on the hyperplane the equation x i * w + b = 0 is true, where: w is the normal to the hyperplane, ||w|| its Euclidean norm, and |b|/||w|| the perpendicular distance from the hyperplane to the origin.
Mathematically for the optical hyperplane equations ( 2) and (3) also hold: x i * w + b ≤ -1 when y i = -1 (3) These two equations are expressed in one constraint: This contribution has been peer-reviewed.The double-blind peer-review was conducted on the basis of the full paper.doi:10.5194/isprsannals-II-5-1-2014 Figure 6.The margin between the two datasets (source: Burges, 1998) The distance between the two training datasets is equal to 2/||w||, thus the margin between them shall be found by minimizing ||w|| 2 /2, subject to the above constrain (Figure 6).To achieve better classification results, an extra cost C is introduced through positive slack variables ξ i; i=1,..,l.The constraint ( 4) is reformulated as: Apart from linear SVMs, non-linear SVMs shall also be adopted, if the decision function is not a linear function of the data.In this case Kernel Trick is applied.More specifically, a suitable mapping φ maps the features to a higher dimensionality feature space, where the data might be separated linearly.So φ : R d → H.For this trick, it is not needed to explicitly calculate φ; only the kernel function K has to be known:

OVERVIEW OF THE PROPOSED SYSTEM
To evaluate the efficiency of the proposed methodology experiments have been conducted.A low cost Οlympus μ 850SW camera was mounted on a moving vehicle.The dimensions of the images processed were 680x510 pixels.Images were received during night and day and under different illumination and weather conditions.The proposed methodology was implemented in C++ and functions from OpenCV 2.4.3 were used.The test data contain 42 images which depict 62 traffic signs in total.
For the detection stage a total of 145 ROIs were extracted, 59 of them being valid and 86 invalid.A detection rate of 95.14% was reported.The algorithm failed to detect road signs with faded color and signs whose boundaries where very similar to the background.
Figure 7 depicts a representative example of the detection stage in an urban environment.The first image (Figure 7a) is the input to the system, the second image (Figure 7b) the result of thersholding in HIS color space, the third image (Figure 7c) is the result of median filtering and finally the extracted ROIs, after the elimination of false positives according to the area and ratio of their bounding boxes, are presented (Figure 7d).If the ratio between the two vertical sides of the bounding box is lower than 0.7, the ROIs are further processed and shape based detection is applied.For this step 70% of the input ROIs depicting overlapping traffic signs were successfully separated (Figure 8).Even though this detection percentage is not very high, this step is extremely important because if the overlapping traffic signs were imported as one region of interest in the classification procedure, recognition would fail for both of them.Table 1 presents the values for Xi-alpha estimates, resulting from cross-validation procedure, conducted using SVMlight.
Recall of a decision rule h is the probability that a document with label y = 1 is classified correctly and precision P of a decision rule h is the probability that a document classified as h(x) = 1 is indeed classified correctly.By comparing our results to some already developed methods for automatic road-sign detection and recognition, the proposed technique ensures better detection and classification rates, in most of the examined cases.For example, in Barnes and Zelinsky (2004) the detection accuracy was 90% and the classification accuracy 75%; in Shneier (2005) the rates were 88% and 78% respectively; and in Medici et al. (2008) the rates were 74% and 89.1%.Concerning the proposed methodology and especially during the categorization procedure, HOG descriptors are integrated, which lead to very satisfactory results, when compared to ordinary classification procedures.

COCLUSIONS
In this paper, a complete methodology for road sign detection and recognition is presented and described, taking into consideration existing difficulties.A special emphasis is given in the examination of the efficiency of representing ROIs by HOG descriptors.The results are very satisfying and it has been shown that HOG descriptors are a good choice for representing the image, in this framework.Furthermore, an intermediate step in order to separate overlapping signs is successfully developed.The overall competence of the system is high, as it was proven to be robust to illumination and scale changes as well as partial occlusions.
Concerning some extensions of this work for the future, an improvement would be to integrate a tracking algorithm into the procedure.Furthermore, it would be interesting to conduct experiments with different parameters for HOG descriptors and compare the resulting accuracies for the classification step.Finally, it would also be useful to examine different image representations like SIFT descriptors combined with diverse encoding methods.

Figure 1 .
Figure 1.Differences between road signs from countries which have signed the Vienna convention

Figure 2 .
Figure 2. Workflow of the method 3.1 Color based detection Candidate objects are selecting via thresholding.Thresholding refers to the procedure that creates a binary image; pixels with illumination values above a predefined threshold are assigned value 1, and all the others are set to 0. Thresholding is conducted in HSI (Hue -Saturation -Intensity) color space, as it is more robust to illumination changes than RGB.Only Hue and Saturation channels are used, as these components encode color information.Possible values for Hue component range between 0 o -360 o and for the Saturation component between 0-255.Global thresholds are used, which are the following: Histograms from images depicting (a) red, (b) blue and (c) yellow road signs for the Hue (left) and the Saturation (right) component

Figure
Figure 4. Overlapping signs

Figure 5 .
Figure 5. Structure of HOG descriptor (source: Dalal and Triggs, 2005) 3.5 Recognition based on SVMs Vectors representing the candidate blobs are inserted into trained SVMs in order to be classified.Support Vector Machines were first introduced by Vapnik (1998) and belong to machine learning methods.They aim to learn a set of data by minimizing structural risk.SVMs were first used for classification tasks, but soon their use was extended to regression as well.Concerning binary classification, in the simplest case, the data are linearly separable.The training data are labeled as {x i, y i }, where i=1,n…,l y i ∈{-1,1} and x i ∈{R d }.In our experiments x i are the vectors of the HOG descriptors, y i = 1 indicates that the data belong to the one class and y i = -1 to the other class.Support Vector Machines compute the optimal hyperplane that separates the data.This hyperplane is fully defined by {w,b} and for the points that line on the hyperplane the equation x i * w + b = 0 is true, where: w is the normal to the hyperplane, ||w|| its Euclidean norm, and |b|/||w|| the perpendicular distance from the hyperplane to the origin.Mathematically for the optical hyperplane equations (2) and (3) also hold:

Figure 8 .
Figure 8. Left: Input image depicting overlapping signs, Middle: First extracted ROI, Right: Second extracted ROI

Figure 9 .
Figure 9. Top row: positive example for training procedure, Bottom row: Negative examples for training procedure

Table 1 .
Xi-alpha estimates for STOP and BIKE LANE signs On the other hand, using a greater number of training examples for signs with more detailed pictograms, the accuracy percentage is increased.For example, for BIKE LANE signs when 150 training examples are used accuracy rate equals 60.38% and for 207 training examples accuracy increases to 88.05%.

Table 2 .
Accuracy for STOP sign in conjunction with different number of training examples