Wonje Jang, Junhyuk Hyun, Jhonghyun An, Minho Cho, and Euntai Kim,
Abstract—The essential requirement for precise localization of a self-driving car is a lane-level map which includes road markings (RMs). Obviously, we can build the lane-level map by running a mobile mapping system (MMS) which is equipped with a high-end 3D LiDAR and a number of high-cost sensors. This approach, however, is highly expensive and ineffective since a single high-end MMS must visit every place for mapping. In this paper, a lane-level RM mapping system using a monocular camera is developed. The developed system can be considered as an alternative to expensive high-end MMS. The developed RM map includes the information of road lanes (RLs) and symbolic road markings (SRMs). First, to build a lane-level RM map, the RMs are segmented at pixel level through the deep learning network. The network is named RMNet. The segmented RMs are then gathered to build a lane-level RM map. Second, the lanelevel map is improved through loop-closure detection and graph optimization. To train the RMNet and build a lane-level RM map,a new dataset named SeRM set is developed. The set is a large dataset for lane-level RM mapping and it includes a total of 25157 pixel-wise annotated images and 21000 position labeled images. Finally, the proposed lane-level map building method is applied to SeRM set and its validity is demonstrated through experimentation.
SELF-DRIVING has received much attention from the industry and academia because of its potential [1]. The implementation of self-driving requires various technologies such as perception [2], localization [3], and path planning [4].A lane-level map which includes information on road markings (RMs) is the key requirement for the precise localization of the autonomous vehicle [5]–[7]. A lane-level RM map is usually built using a commercial mobile mapping system (MMS) equipped with many expensive sensors, such as 3-D LiDAR, a real-time kinematic (RTK) GPS and inertial navigation systems. The approach, however, is highly expensive and ineffective since the MMS must visit every place it could go for mapping. Considering that a monocular camera has already been installed in most of vehicles, building a lane-level RM map using a monocular camera is a very attractive alternative. Sensor data can be gathered efficiently from a number of common vehicles on the road, thus mapping can be virtually free. In this paper, a new lane-level RM mapping system using a monocular camera is proposed. The key technique for the proposed lane-level RM mapping using a monocular camera is (T1) how the RLs and SRMs are extracted from a monocular camera, and (T2) how they are used as landmarks in the graph optimization of mapping.Here, RM is defined as a union of RL and SRM.
Regarding the RM extraction in (T1): Many recent papers have reported about RL and SRM detection [8]–[15]. The biggest difference between them and our method is that recent previous papers deal with RM detection, while our paper focuses on RM segmentation. Since the RLs detected in[8]–[15] are represented in the form of curved lines, they are not suitable for pixel-level mapping. The proposed method,however, outputs the segment of RMs in a pixel level and can build maps on the pixel level. The result of this paper can be extended from lane-level RM maps to pixel-level RM maps in the future, and the pixel-level RM map of this paper will compete with the high-definition (HD) point cloud map built using a LiDAR from a high-end MMS system.
Regarding the mapping in (T2): The key difference from the visual simultaneous localization and mapping (vSLAM)developed in mobile robotics [16]–[18] is how to match or associate RM segments. Since the RM is not a point landmark but it consists of a set of points, simple associations based on Euclidian distances do not work. In this paper, a segment-tosegment (S2S) association is developed instead of a point to point (P2P) one. A RM segment is divided into front and rear parts and the pair is registered as two nodes for association.
To the best of our knowledge, this is the first attempt of building a lane-level RM map based on the RMsegmentationusing a monocular camera. The proposed map building method consists of two phases: first, RMs are segmented from the camera image using a semantic segmentation network.Here, RM is defined as the union of SRM and RL. A multibranch network is developed to cope with RL and SRM separately. Second, the lane-level RM map is built by optimizing a pose graph composed of RMs. DBoW2 [19] is used to detect loop closure and g2o [20] is used for graph optimization. The contributions of this paper are as follows:
1) A lane-level RM map is built based on the RM segmentation using a monocular camera. Since RMs are extracted on per-pixel basis from a segmentation network, it can be extended to pixel-level RM map in the future.
2) RMNet was developed to train on SeRM datasets. Two losses, named class-weighted loss (CWL) and class-weighted focal loss (CWFL), are proposed to train RMNet.
3) Previous vSLAM methods used point landmarks as nodes in the graph optimization. Since the RM is not a point but a segment, the point-based association used in the classical vSLAM is not appropriate. In the proposed RM mapping, an S2S association is developed by dividing a RM segment into front and rear parts. The pair is registered as two different nodes in a graph and graph optimization is performed.
4) A new dataset, called the semantic road mark mapping(SeRM) dataset, is developed for effective RM segmentation.The set is required to train a semantic segmentation network for RM and test the lane-level RM map. The set consists of 25157 pixel-wise annotated images and 21000 position labeled images.
The rest of this paper is organized as follows: Section II introduces related work. Section III presents the SeRM dataset details. Section IV outlines the overview of the proposed map building system. Section V presents the RM segmentation network. Section VI discusses how the lane-level map consisting of RMs is built through graph optimization. Section VII gives the experimental results, and Section VIII concludes this study.
The most common way to create maps for autonomous vehicles is to use MMS, which is equipped with a suite of high accuracy sensors such as 3-D LiDAR and high accuracy GPS sensors [21]–[23]. When a commercial MMS is used,highly accurate GPS trajectories are recorded, and the point clouds obtained from the 3-D LiDAR are superimposed to the GPS trajectories, making a lane-level HD map. The major shortcoming of building lane-level map using an MMS is that building a map costs too much, and LiDAR sensor in the MMS does not contain color information; hence, colored lanes(e.g., yellow and white lanes) are quite difficult to distinguish.To solve this problem, lane-level map building using a lowcost camera has been receiving more attention within the field of autonomous vehicles [24]–[27]. To achieve this, RMs (RLs and SRMs) should be detected, and a lane-level map composed of RMs should be built.
To build a lane-level map with road marking using a monocular camera, what matters most is reliable detection of RMs (RLs and SRMs). To achieve this, various image processing methods were widely used in classical RL and SRM detection methods [28]–[31], and the most popular ones are using Top-Hat filter and hand-crafted low-level features.
Recently, with the development of deep learning, various learning-based methods have been proposed [8]–[15] to detect RMs effectively. First, let us consider the detection of RL. In[8]–[10], end-to-end segmentation networks were designed to perform a pixel-level classification into either RL and background, and line-shaped RLs were reprojected onto image using clustering and curve fitting. Then, Nevenet al. extended[8]–[10] to 3D case and proposed 3D-LaneNet in [11]. In [12],Guoet al. developed GeoNet by applying geometric assumptions to 3D-LaneNet, and training the network using regression loss. In GeoNet, 3D lane attributes are predicted using anchor representation. In [13], Ghafoorianet al.proposed EL-GAN to use a generative adversarial network architecture. In the network, a discriminator is trained on both predictions and labels at the same time. Second, let us consider the detection of SRM. Leeet al. improved the SRM detection by massaging the bounding boxes from tiny-YOLO in [14]. True positives from tiny-YOLO are deblurred,whereas false positives are blurred through GAN. The third one detects RL and SRM simultaneously. Leeet al. [15]proposed VPGNet to detect RLs and SRMs simultaneously.The key idea of the work is to boost the detection performance of RL and SRM by detecting vanishing point along with them.
Here, it should be noted that the RL and SRM detection proposed in this paper is completely different from the previous papers [8]–[15]. The first and basic difference between them is that previous papers deal with RM detection,but our method deals with RM segmentation. Since the previous papers [8]–[13], [15] aim at being applied to LKS(lane keeping system), the lanes in [8]–[15] are represented in the form of curved lines, and they are not suitable for pixellevel mapping. The proposed method, however, outputs the segment of RMs in a pixel level and can build maps in the pixel level. We believe that the result of this paper can be extended from lane-level RM map to pixel-level RM map in the future, and the pixel-level RM map of this paper will compete with the high-definition (HD) point cloud map built using a LiDAR of high-end MMS system.
The second difference is more significant than the first one.The previous works and this paper have completely different goals. Since the previous works were developed aiming at LKS (lane keeping system), the RLs in [8]–[13], [15] aim to provide guidance on autonomous driving, and thus they cannot be used in RM mapping. Fig. 1 shows the training samples used in [8]–[13], [15] and SeRM dataset. They clearly show the difference in the lane detection for different two applications.
Fig. 1. Lane detection for (a) LKS and (b) Mapping. (a) Example of CULane dataset (target of lane detection problem). (b) Example of SeRM dataset (target of RM segmentation for mapping problem).
Large datasets regarding automotive driving are required to train a deep network to detect RMs using a monocular camera.Several datasets have been reported: Datasets [31]–[35]include road lane (RL), dataset [28] includes symbolic road marking (SRM), and datasets [15], [36]–[41] provide both RL and SRM labels.
First, let us consider the dataset about RL. The first dataset about RL is the Caltech lanes dataset [31]. In the set, the shape of a lane is represented using a polynomial, but its pixel level ground truth is not provided. Later, several other datasets regarding RL were reported: KITTI road dataset [32],TuSimple [33], FiveAI [34] and CULane dataset [35]. The datasets in [32], [34] provide instance level RLs in curved form, and they are beneficial for more sophisticated driving manoeuvres. Datasets [33], [35] provide information about ego-lane, and can be used in lane following tasks.
Second, let us consider the dataset about SRM. The most popular dataset about SRM is the road marking detection(RMD) dataset [28]. It provides about 28000 images with 10 categories of SRMs. The ten categories include speed limit signs “35” and “40”, “l(fā)eft turn”, “right turn” and “forward”arrows, and “bike”, “stop”, “ped”, “xing” and “rail” signs and they cover all kinds of SRMs commonly found on US roads.
Third, let us consider the datasets which include both RL and SRMs. The early datasets that provide both RL and SRM are ROMA [36], CamVid [37], and TRoM [38] dataset. All of them are very small. CamVid [37] is the first dataset with semantic annotated videos. The size of the dataset is small,containing 701 manually annotated images with 32 semantic classes. ROMA [36] is a dataset with 116 images with 3 RM classes. Since the set is too small, it can only be used for tests,and cannot be used for training. TRoM dataset [38] is the first dataset on semantic RMs under different weather,illumination, and traffic-load. The set provides 712 images accompanied by pixel-level annotation of 19 classes of RMs.Recently, a few large-scale pixel-level annotated datasets on semantic RMs have been reported, and examples are Mapillary vistas [39], VPGNet dataset [15], ApolloScape [40],and BDD100k [41]. Mapillary dataset [39] is a large set of images with fine annotations, and includes 25000 images with 66 object categories and 6 classes of RMs. VPGNet dataset[15] focuses only on RMs and does not consider other instances such as objects. The set consists of about 20000 images with 17 manually annotated RLs and SRMs.Vanishing point annotation is also provided in the set.However, the images in VPGNet dataset are not annotated with the location at which the images are captured. Thus, the dataset can be used for RM segmentation but not for localization or mapping. ApolloScape [40] has 140K annotated images and 27 RM classes, and is the largest datasets regarding RMs. The set covers not only RM detection, but also other tasks such as 3D reconstruction,semantic parsing, instance segmentation, etc. BDD100K dataset [41] contains 100K raw video sequences with about 120 million images. One image is selected from each video clip for annotation. 100K images are annotated on the bounding box level and 5K images are annotated on the pixel level. All the datasets related to RM are compared in Table I in terms of various aspects.
After detecting RMs using a monocular camera, the absolute positions of these RMs should be recovered to make a lane-level map. In [24], a map correction system was reported in which noisy lane hypotheses from sensor inputs,noisy maps with RLs, and noisy global positioning were all combined to fine-tune a map using a probabilistic approach. In[25], pixel-level submaps comprising the RLs were made, and the entire lane map was built using submap matching. The submap matching methods were improved in [26] to build a lane-level map that contains four classes of RMs.
Interestingly, vehicle localization [5] and reference mapping[7] based on RM detection using a camera has been developed in [5], [7]. In particular, the goal of [7] looks similar to that of this paper. However, the mapping paper [7] completely differs from this paper in two aspects: The first difference between them is that the reference map developed in [7] consists of consecutive points on the lanes, thus it cannot represent the various environments such as the complex intersection shown in Fig. 2. Further, because the reference map in [7] consists only of consecutive points, the map in [7] does not provide much information about longitudinal movement, having clear limitation in correcting longitudinal error. On the contrary, our RM map consists of various RM segmentations, and it outputs not simple points but the pixel-level segments of RMs. Thus,unlike the reference map in [7], our RM map can represent various environments, and the map looks similar to high definition (HD) point cloud map built using a LiDAR, as shown in Fig. 2.
Fig. 2. Example of mapping of previous work [7] and ours. (a) Aerial view image of the complex junction. (b) Mapping example of previous works [7].(c) Mapping result of our lane-level map.
The second difference between [7] and our RM map is about the technical implementation. The reference map in [7]is built simply using only GPS or GNSS, and loops in the trajectory are not used at all. As the distance traveled by the mapping vehicle increases, more and more errors will be accumulated, degrading the map accuracy. Our RM map,however, is based on vSLAM, and it does not use GPS or GNSS. Whenever loops are closed, localization errors are corrected and are not accumulated.
Lots of research has been conducted regarding the vSLAM in the field of mobile robotics [16]–[18]. The key difference between the visual simultaneous localization and mapping(vSLAM) developed in mobile robotics [16]–[18] and the one in this paper is how to match (or associate) features, which in this paper are RM segments. Since RM is not represented by a point as in [16]–[18], but a set of points, the existing matching methods does not work. Therefore, we propose an S2S matching method which treats a RM as a pair of two nodes.The details about the registration are given in Section IV-C.To the best of our knowledge, there is no research work inwhich a lane-level RM map is built based on the RM segmentation using a monocular camera, and the map is built based on vSLAM.
TABLE I COMPARISON OF DRIVING SCENE DATASETS
To apply deep learning in building a lane-level map using only a low-cost camera, a large training set is needed. The training set should satisfy the following two conditions:
C1:All kinds of RMs, including lanes and SRMs, on the road should be annotated at the pixel level.
C2:For effective mapping, each image should be provided together with GPS and odometry information. The vehicle trajectories must have multiple loop closures, which can be used to reduce the mapping error.
Obviously, C1 is required to train a network which extracts the RMs from a camera input. C2 is needed to build a map using the RMs extracted from the camera image. However,none of the datasets satisfy the above-mentioned two conditions. In this study, we collected a large number of images and made a dataset that best fits affordable lane-level RM mapping. Table I shows a comparison of our SeRM dataset with other datasets.
For RM map building, the RM should be extracted from a camera image at the pixel level using a semantic segmentation network. To train the semantic segmentation network for the lane-level RM map, a large number of images were gathered in South Korea. The RMs in the images were manually annotated at the pixel level. The SeRM dataset included 25158 images annotated with 16 classes of RMs at the pixel level.The number of images constituting the dataset was 19 998 for training and 5160 for testing. The image resolution is 672×1280, and the lower part (3 20×1280) is used for RM segmentation training. In this paper and in the SeRM dataset,“RM” is defined as the union of SRMs, RLs, and a background presented as follows:
whereCRMdenotes a set of SeRM classes;CS RM= slow down, go ahead, turn right, turn left, slow down, go ahead,turn right, turn left, ahead or turn right, ahead or turn left,crosswalk, number markings, text markings, other markings andCRL= yellow double line, blue double line, broken line,white single line, yellow single line, stop line Table II shows the RM classes in SeRM dataset.
Fig. 3 depicts some examples of the SeRM set, where the first column represents the original RGB images acquired by the monocular camera, and the second column shows the corresponding image annotated at the pixel level.
The images annotated with the experimental vehicle pose,where the images were acquired, were needed in building the lane-level RM map. To this end, 21000 images were gathered on three routes in Seoul and Goyang-si and annotated withRTK-GPS and odometry data at the image level. Route 1 data were acquired from Sangam, Seoul. Route 1 is a complex route with many loop closures, including both narrow and wide roads. Route 2 data were obtained from Ilsan, Goyang-si,which consists of wide roads. Route 3 data were collected from very large areas in Sangam and contains data obtained while driving on the same road in opposite directions. The three routes include several loop closures, as shown in Fig. 4.The total length of our three driving routes was approximately 26 km, with each route being 4.86, 9.47, and 11.63 km,respectively.
Fig. 3. Examples of the pixel-wise annotated images in SeRM. The first column depicts the original RGB images, whereas the second column shows the corresponding ground truth images.
TABLE II RM CLASSES IN THE SERM DATASETS
The proposed lane-level map building system consisted of two phases: 1) RM segmentation and 2) lane-level RM mapping. Fig. 5 shows the outline of our lane-level map building system. In the first phase, a semantic segmentation network (i.e., RMNet) was developed and trained on the SeRM set. The RGB images I ∈RH×W×3and the corresponding ground truth (GT) images annotated at the pixel level asT∈{0,1}H×W×|CRM|in the SeRM set were used in the training, where |CRM| denotes the number of RM classes in the SeRM dataset. In the dataset,CRMinclude 10 SRM classes (|CS RM|=10) and 6 RL classes (|CRL|=6). The total number of classes in the SeRM set is |CRM|=|CS RM|+|CRL|+1, where “1” corresponds to background.
Fig. 4. Three routes on which the 21000 position-annotated images in the SeRM dataset were captured.
Fig. 5. Architecture of the proposed system for building the lane-level RM map.
Fig. 6. RMNet architecture.
A lane-level RM map comprising the segmented RMs obtained from RMNet was built in the second phase. Lanelevel RM map building refers to building a graph, including vehicle poses and RMs as nodes, and refining a graph-bygraph optimization. It is a graph SLAM [20] using RMs as landmarks. Lane-level RM mapping consists of four parts:inverse perspective mapping (IPM), RM correction, front-end,and back-end. In the IPM part, the RMs in the image plane are transformed to vehicle coordinates, and the observation error is corrected in the RM correction part. In the front-end, a graph composed of vehicle poses and RMs is built, whereas in the back-end, the graph is optimized.
Let us denote the original RGB image, the per-pixel ground truth, and the segmentation result from RMNet by I ∈RH×W×3,T∈{0,1}H×W×|C|, andP∈(0,1)H×W×|C|, respectively. The size ofTandPis the same as that of the input image I. In this section, RMNet is proposed, and its architecture is given in Fig. 6.
The proposed RMNet has encoder-decoder architecture, and it consists of shared parts and branch parts, as shown in the Fig. 6. The shared part corresponded to an encoder in encoderdecoder structure, and it generates features for both SRM and RL. Meanwhile, the branch part corresponds to a decoder and two decoding networks outputPRLandPS RM, as shown in Fig. 6. Here, the concatenation ofPRLandPS RMisP. One might think that the branch layers do not need to be separated,as in Fig. 6, and RLs and SRMs can be trained all together.However, based on our experience, the training performance is degraded when RLs and SRMs are trained together. White arrows, such as turn right or turn left inCS RMare particularly often misclassified as white RLs inCRL, and vice versa. The images of turn right and broken line as shown in Fig. 7.
Fig. 7. Example shows the limited size of receptive field which occurs misclassification between SRMs and RLs.
To lessen the possibility, we separate the two decoder networks for RL and SRM physically in the early phase of training. In the late phase of training, we connect two decoder networks while sharing the feature extraction encoder and fine-tune the two decoder networks.
When we train RMNet with the SeRM dataset, the simplest loss that we can use is a cross entropy (CE) loss function defined by
Balanced Cross Entropy Loss (BCE)
The most common method for addressing a class imbalance problem is to use balanced cross entropy (BCE) [42], which is based on cross entropy (CE) loss. For notational simplicity,we will denote per-pixel ground truthti,j,cand its predictionpi,j,cbytcandtc, respectively, by removing the image indexiandjin (1). Then, the CE loss of each class in (1) can be written as
BCE is defined by introducing a weighting factor αcto the CE loss, and it is represented by
where a weighting factor αcof classcis set by inverse class frequency or by hyper-parameter to address class imbalance.This BCE loss is a simple extension of CE, and it will be used as a baseline for our proposed CWL and CWFL.
Focal Loss (FL)
Focal loss (FL) [42] is defined by adding a modulating factor ( 1?pc)δto the BCE loss, and it is written by
The modulating factor reduces the relative weight for wellclassified examples but puts more focus on misclassified examples. When an example is misclassified andpcis small,the modulating factor approaches 1 and the loss remains the same. When an example is well-classified andpcgoes near 1,the modulating factor approaches 0 and the loss is downweighted, placing less emphasis on well-classified samples.The modulating factor also has tunable parameter δ ∈[0,5],which adjusts the focusing rate for misclassified examples. As in experiment part, focal loss showed only small performance improvement in semantic segmentation.
Class-Weighted Loss (CWL)
In this paper, a new loss, named class-weighted loss (CWL),is proposed. The CWL is inspired by BCE and FL: BCE strikes a balance between classes by assigning a high weighting factor αcto a small class while assigning a low weighting factor αcto a large class. However, BCE has the shortcoming of not considering how many samples are misclassified. Instead, FL focuses only on misclassified (hard negative)samples, and it does not consider the balance betweenclasses. The proposed CWL considers not only the balance between classes but also the number of misclassified samples.CWL assigns a high weightλc,t not to a small class but to a class the samples of which is frequently misclassified.When compared to BCE, classes are weighted based on their size (the number of training samples), whereas classes are weighted based on the number of misclassified samples in CWL. Further, the BCE and CWL differs in that the weighting factor αcis constant in BCE, whereas the class weight λc,tchanges depending on the number of false negatives in CWL.When compared to FL, the notion of false negative is applied to eachsamplein FL, whereas the same notion is applied to eachclassin CWL.
More specifically, CWL is defined as
where λc,tis a class-dependent weight;c∈CS RM∪CRL∪{background} andtis an index for an epoch. Thus, a classchas a different weight λc,tin each epoch. The class-dependent weight λc,tin CWL (5) is not constant, but it is updated in each epoch by
where γ is an updating parameter with γ ∈[0,1];ti,j,c∈{0,1}and its predictionpi,j,c∈(0,1) denote per-pixel ground truth and prediction, respectively. Here, (ti,j,c?pi,j,c) is computed on per-pixel basis in each epoch and the class-dependent weight λc,tis updated by addingto the class dependent weight. Intuitively, the physical meaning of updating rule is as follows: When a number offalse negatives(ti,j,c>pi,j,c) occur in classc,ti,j,c?pi,j,cis positive and the class weight λc,tis increased by (6), placing more emphasis on positive samples. When a number offalse positives(ti,j,c For example, let us consider the case given in Fig. 8. Since the number of classes is 17,ti,jis a 17-dimensional one-hot ground truth vector at the pixel (i,j). Let us suppose thatti,j=[0,0,1,0,...,0], that is, pixel (i,j) belongs to the third class, and the output at pixel (i,j) from the RMNet ispi,j=[0,0.4,0.1,0.5,...,0]. Obviously, the sample is misclassified, (ti,j,2?pi,j,2)=?0.4, (ti,j,3?pi,j,3)=0.9 and (ti,j,4?pi,j,4)=?0.5. For classc=3, a false negative occurs, and the class weight λ3,tis increased by (6), placing more emphasis on class 3. For classesc=2 andc=4, however, false positives occurs, and the class weights λ2,tand λ4,tare decreased by (6).In summary, when a large number of samples in classcare misclassified as different classes, we increase the weight to classcby (6). Conversely, when a large number of samples in classes rather thancare misclassified as classc, we decrease the weight to classcby (6). Fig. 8. Class-dependent weight updated using misclassification. Class-Weighted Focal Loss (CWFL) Finally, another loss named as class-weighted focal loss(CWFL) is proposed in this paper. As stated, the notion of false negative is applied to eachsamplein FL, whereas the same notion is applied to eachclassin CWL. In CWFL, the notion of false negative is applied to both sample and class.That is, both class weight factor λc,tof CWL and modulating factor (1?pc)δof FL are multiplied to our loss and, the CWFL is defined by The two proposed losses CWL and CWFL are applied to RMNet and they are analyzed in experimental section(Section VII). Building a lane-level map using the RMs obtained in Section IV as landmarks requires four steps. First, the RMs on the camera image plane are transformed to a vehicle coordinate through inverse perspective mapping (IPM),enabling the RMs to be used for mapping. Second, the RM correction step is applied to correct the error from the IPM,which often occurs because of bumps on the road. A weighted iterative closest point (ICP) [43] is used to correct it in the second step. Third, a pose graph consisting of vehicle poses and RMs is built, and loop closing is considered. In loop closing, revisited locations are recognized, and the pose graph is simplified using the revisited locations. This step is actually a front-end of lane-level mapping. Fourth, the graph built in the third step is optimized in such a way that all the RM locations and vehicle odometries are respected as much as possible. The RM segments obtained from RMNet should be backprojected into a vehicle coordinate to make a lane-level RM map. The back-projection was conducted using IPM in [44],as shown in Fig. 9. A pixel (u,v)∈N2on an image is backprojected into a point (x,y)∈R2in the vehicle coordinate by assuming that the camera was installed at the height ofHand tilted at the angle of θ , where M is a transform matrix transforming a pixel on the image space into a point in the vehicle coordinate (or on a bird’s eye view), as in Fig. 9.Matrix M can be computed with extrinsic parameters (e.g.,tilting angle θ and installation heightH) and intrinsic parameters (e.g., focal length of the camera). Fig. 9. Inverse perspective mapping: a pixel ( u,v)∈N2 is back-projected into a point ( x,y)∈R2. When a vehicle drives on real roads, it often encounters bumps or sunken parts and shakes up and down (Figs. 10(a)and 10(b)), leading to a location error in mapping RMs. In Fig. 10(a), the road is flat, and the RMs continue to be backprojected to the vehicle coordinate by the IPM. However, in Fig. 10(b), a part of the road is slightly sunken, and the RMs are recovered at the wrong locations by the IPM. A step,called RM correction, was conducted after the IPM to reduce the RM mapping error. In the RM correction, a set of points was sampled from the mapped RMs, and the ICP was applied to them, as shown in Fig. 10(c). From here, the word“mapped” means “re-projected into a vehicle coordinate by IPM and transformed into a world coordinate.” In Fig. 10(c),the blue points Ptwere sampled from the RMs mapped at the current timet, whereas the red points Pt?1were sampled from the same RMs mapped at the previous timet?1. Fig. 10. RM correction using the weighted ICP: (a) RM segmentation on a flat road; (b) RM segmentation on a slightly sunken road; (c) RMs that are corrected with vanilla ICP; and (d) RMs that are corrected with weighted ICP. However, based on our experience, the vanilla ICP is not enough. More specifically, the mapping error caused by the up and down shaking of the vehicle was mainly not latitudinal but longitudinal, as in Fig. 10(b). Many of the RMs were RLs(belong toCRL) and not suitable for correcting the longitudinal error; hence, the vanilla ICP application did not work well enough, as shown in Fig. 10(c). Accordingly,a weighted ICP that places a particular emphasis on the SRMs inCS RMand a few types inCRLwas developed to solve this problem. Ptand Pt?1came from the same RMs,and they should be mapped at the same location; thus, the affine transform between them was computed by the wei ghted ICP. The mapping error was compensated by the affine transform. Let us suppose that two sets of pointswere sampled from the RMs mapped at timestandt?1, respectively, whereis a subset of points belonging to classcandNcis the size of the set. The affine transform (Rt,Tt) at timetis computed by applying a weighted ICP with a particular emphasis on the SRM and stop line whereR tis a rotation matrix,Ttis a translation vector, andwcdenotes the weight for classcin the RM. Weightwc=λsrmwas set larger than 1 and assigned toCS RM∪ {stop line}. λo.w,which was less than 1, was assigned to other types of RMs,except for crosswalk. Zero was assigned to crosswalk.crosswalk was not used at all in the weighted ICP because the same pattern was repeated in it, and the associations between the points by (10) were often incorrect as shown in Fig. 11,producing a large alignment error in (9). We obtainedandby alternating (9) and (10). The RM at timetis corrected by Fig. 11. Difficulty in the association of crosswalks: ( t ?1)th crosswalk points P t?1 (red) and t th crosswalk points Pt (blue). Fig. 10(d) shows two consecutive RMs corrected by the weighted ICP. Compared with using the vanilla ICP, the alignment error between the consecutive RMs was significantly reduced by the weighted ICP. The corrected RM pointswere used in building the lane-level map as landmarks. The third step is the front-end of the lane-level mapping that denotes graph building for mapping. In the front-end, the RMs obtained in the second step and the vehicle positions are registered as nodes in a pose graph. The RMs are used as landmark nodes of the graph. The position of the vehicle, from which the RMs are observed, is registered as the pose node.The odometry measurement obtained from the wheel encoder becomes the edge constraint between the pose nodes.Moreover, the distance to the RM is stored as the measurement edge between the pose and landmark nodes.After the initial graph is built, the graph is simplified or pruned whenever the same place is revisited and loop closure is detected. 1) Graph Building The relation between a vehicle pose and the RMs observed therein is represented as an edge in the graph. When building a graph, an RM will be observed several times in consecutive times and should be registered asa singlelandmark node,regardless of how many times the RM was observed in consecutive times. In other words, when an RM observed at timet?2, it is observed again at timest?1 andt; this RM should be registered not as three landmark nodes but as a single landmark node. In graph building, a point from the RM is used as the location of the corresponding landmark node in the graph. The most reasonable choice for the point will be the center of the RM, as shown in Fig. 12(a). Unfortunately,however, the center of the RM cannot be used as the location of the landmark node because the RM is likely to be occluded by the preceding vehicles or a part of the RM is out of the field of view (FOV) of the camera, as depicted in Fig. 12(b).To solve this problem, an RM is registered asa pair oflandmark nodes. Regardless of how many times the RM was observed, the head and tail points of the RM were computed and used as landmark nodes as in Fig. 13. Fig. 12. The problem in making the center point of RM as a landmark. (a)When no occlusion occurs; and (b) When occlusion occurs. Fig. 13. RM registration as two measurement nodes in four consecutive times. Let us consider the situation in which an RM turn left was observed in three consecutive times. During the transition from timest?3 tot?2, the length of the RM turn left was increased, implying that the nearest point from the camera at timet?3 was the tail point of the RM turn left, whereas the furthest point from the camera at timet?3 was not the head point of the RM turn left. At timet?3, only the tail point indicated by a circle was registered as the landmark node in the graph. When the RM turn left kept having a similar length over a few consecutive times (e.g., fromt?2 tot?1, Fig. 13),the nearest and furthest points from the camera corresponded to the tail and head points of the RM turn left, and both points were registered as the landmark nodes in the graph. During the transition from timest?1 tot, the length of the RM turn left was decreased, implying that the furthest point from the camera at timet?3 was the head point of the RM turn left,whereas the nearest point from the camera at timet?3 was not the tail point of the RM turn left. At timet, only the head point indicated by a star was registered as the landmark node in the graph. The instantaneous localization error caused by the low-cost odometry and the vehicle shaking on the bumpy road can be compensated to an extent by the weighted ICP given in Section IV-B. However, a small error will remain even after the weighted ICP. When it is accumulated over a huge environment, it will seriously degrade map accuracy. Here,the accumulated error was compensated for by detecting a revisited region and closing the loop in the graph. Let us suppose that our vehicle drives clockwise around a town as shown in Fig. 14(a), and five pose nodes are generated in the order of N1, N2,…, N5, as shown in Fig. 14(b). Here, the black solid line in Fig. 14(a) denotes the actual trajectory that our vehicle follows, and the blue solid line in Fig. 14(b)denotes the trajectory estimated based on odometry. Along the blue estimated trajectory, five pose nodes N1, N2 through N5 are generated. N1 is the pose node generated at the starting position, and N5 is the pose node generated when the vehicle revisits the starting position. Obviously, when the vehicle position is estimated using odometry, the estimation error will be accumulated during the driving, and N5 will be registered as a new node which is completely different from N1. But if the loop closure is detected (if we come to know that N5 is the same as N1), N5 will be dragged to N1 and the two pose nodes will be merged into one, as shown in Fig. 14(c). The mergence of N1 and N5 will reduce the position error not only at N5 but also errors at other nodes on the loop such as N2,N3 and N4, as shown in Fig. 14(d). Fig. 14. (a) Ground truth trajectory; (b) Trajectory estimated by odometry;(c) Loop closure; (d) Error correction. 2) Loop-Closure Detection In this paper, the DBoW2 developed in [19] was used to detect the revisited region using camera images and close the loop in the graph. Each image from the camera is represented by a bag-of-words (BoW) vectorv, which is actually the histogram of the ORB features that appears in the image. The similarity between the two BoW vectorsvi,vjis defined as Using the similarity, when the BoW vector of theith image represented byviis given, another image associated with theith image is selected by whereciis the index for the image associated with theith image. The BoW vectors of the two images close in the time domain are likely to have a high similarity score; hence, we added a parameter ?tin (13) to prevent the vectors close in the time domain from being associated as a loop-closing pair.Furthermore, another constraint,ci?1 Fig. 15. Loop-closure detection. The loop closure is obtained by computing the similarity scores between the ith BoW vector and the candidates. When theith image vectorviis given, its loop closing is tested with the images having an index larger thanci?1and an index smaller thani??t. The image with the highest similarity scores(vi,vj) withviwas selected as the loop closing candidate for theith image denoted in red in Fig. 15.Whens(vi,vj) is larger than a certain threshold, the candidate is considered as loop closing pose, and RMs observed at the poseiand posejare registered as landmark nodes of graph through the method in Section VI-C-1). Afterwards, the mapping errors are minimized by merging landmark nodes observed in poseiand posejthrough graph optimization. Fig. 16. Segmentation results of each network. Our RMNet shows best performance. Here, one might think that it is quite uncommon for the same vehicle to revisit an area in the real world, but we can think of two possible scenarios in which loop closure occurs.First, in the current autonomous driving system, a mapping vehicle and autonomous vehicles are clearly separated. The role of the mapping vehicle is only mapping, and the mapping vehicle should make as many loop closures as possible.Second, as stated in Section I, a monocular camera has been installed on a number of vehicles. If the images taken by countless vehicles are collected in the cloud platform in the future, we can find a number of loop closures in the cloud. The back-end module is the last step of our lane-level RM mapping, which is actually the graph optimization. A graph that considers loop closing was built in the front-end. The graph was then optimized in the back-end. Several off-theshelf optimization frameworks can be used, and g2o [20] was used herein to optimize the graph. The initial vehicle pose was used as an origin in the graph coordinate. Finally, the origin was mapped into the geographic coordinate using GPS information, and the lane-level RM map was built in the geographic coordinate. The next section presents the results of the optimized vehicle trajectories and lane-level RM maps. In this section, the validity of the proposed RMNet and the subsequent lane-level RM mapping is demonstrated by applying them to the SeRM dataset. The images in the SeRM dataset were captured in various environments under various weather conditions. The SeRM dataset also include the odometry and GPS position of the vehicle on which the camera was installed and loop closure was manually annotated. In this subsection, the proposed RMNet and other semantic segmentation networks are applied to the RM segmentation using the SeRM dataset. Since the RMNet is developed for low-end MMS, the network is compared with small or realtime segmentation networks. Compared networks are FCNbased segmentation networks [27], [45] and state-of-the-art real-time semantic segmentation methods [46], [47]. For fair comparison, all the competing networks use the same backbone networks. ResNet-18 [48], and VGG 16 [49] are used as a backbone network. Experiments are conducted using PyTorch and python, and networks are trained and tested on GTX 1080Ti. All the competing methods are compared in terms of base model, computational and storage complexities in Table III. The unit of the computational complexity is Giga FLoating point OPerations (GFLOPs), which is widely used to describe how many operations are required to run a singleinstance of a given network model. The storage complexity is measured using the number of parameters (weights) which the network has. When the same backbone networks are used, all the competing networks consume almost the comparable computation and storage, as shown in Table III. TABLE III COMPUTATIONAL AND STORAGE COMPLEXITIES ANALYSIS OF NETWORKS The performance of competing networks are compared in terms of precision, recall,F1-score, and mean intersection over union (mIoU) in Table IV. In the table, the best performance is indicated in bold. RMNet+CE denotes a vanilla RMNet trained using cross entropy. RMNet+CWL and RMNet+CWFL denote RMNets trained using CWL and CWFL, respectively. As shown in Table IV, RMNet+CWFL demonstrates the best performance in all measures except recall. In particular, it has the highest mIoU which is the most important measure for semantic segmentation. Here, the interesting point is that the vanilla RMNet has higher mIoU than SwiftNet or BiSeNet. This implies that the multiple branches in RMNet is particularly effective for RM segmentation. Further, the application of CWL or CWFL to RMNet increases its performance. The vanilla RMNet(RMNet+CE) has slightly lowerF1-score and Recall than SwiftNet and similar mIoU to that of SwiftNet. However,RMNet+CWL significantly outperforms SwiftNet in mIoU,and RMNet+CWFL surpasses SwiftNet in all measures. The performances show that CWL and CWFL make significant contribution to handling class imbalance problem. TABLE IV PERFORMANCE OF THE SEMANTIC SEGMENTATION OF EACH DEEP NETWORK For further analysis, the mIoU for each class was summarized in Table V. RMNet+CWFL and RMNet+CWL take the first and second places in total mIoU. In particular,they demonstrate relatively consistent mIoU regardless of the classes. SwiftNet, however, demonstrates quite low mIoU in a few classes with small samples such as turn right or turn left). TABLE V COMPARISON WITH PREVIOUS NETWORKS IN MIOU Fig. 16 shows some of the RM segmentation results. The false positives and the false negatives are indicated by the yellow and red circles, respectively. The class that each color represents is at the bottom of the figure. As shown in Fig. 16,FCN and W-FCN have frequent false predictions in some complex classes such as Text class. BiSeNet also shows false predictions for the pixels which are far from the camera such as Crosswalk in 2nd column or RLs in 4th and 5th columns. In SwiftNet, false negatives often occur for Crosswalk and for imbalanced classes such as [Turn right, Ahead or Turn right],as shown in the 3rd–6th column. The proposed RMNet,however, demonstrates good segmentation results consistently in all complex classes and imbalanced classes, and for pixels which are far from the camera. More examples of the RM segmentation results are included in a supplementary video at https://youtu.be/h4pIEwkPDd0. Fig. 17. Semantic segmentation results of RMNet applied to challenging conditions. Further, to see the validity of the proposed RMNet, we apply it to public datasets rather than SeRM set. ApolloScape[40] and VPGNet [15] are the public datasets which provide per-pixel GT about RM. RMNet is applied to them and the results are summarized in Table VI. When applied to ApolloScape given in Table VI(a), only 17 classes are used to evaluate our RMNet as in [40]. Here,“s_w_d” is short for solid, white, and dividing and “a_w_tl” is short for arrow, white, and thru & left turn, by combining the first letter of type, color, and usage. All of scenes 2 and 3 and half of scene 4 are used in the ApolloScape dataset for training; the other scenes and the remaining half of scene 4 are used for testing. Compared to the baseline given in [40],RMNet+CWFL demonstrates better mIoU. TABLE VI PERFORMANCE OF RMNET ON OPEN DATASETS TABLE VII PERFORMANCE OF RMNET UNDER CHALLENGING CONDITIONS Regarding VPGNet dataset [15], the set includes four categories: scene_1: daytime, no rain, scene_2: daytime, rain,scene_3: daytime, heavy rain, scene_4: night. First, we measure overall performance using all four categories. That is,we select test images from each category, making a total of 4000 test images. RMNet is trained on the remaining images in VPGNet. As in [15], RMNet is evaluated using only 8 classes. Since VPGNet dataset is quite new, we cannot find any result about RM segmentation. Only detection results were reported in [15]. Thus, RMNet+CWFL is not compared with other methods. Considering that mIoU is 0.5462, we can see that RMNet works reasonably well in VPGNet dataset.Next, we applied RMNet to more challenging scenes in the VPGNet dataset to see the limitations of our RMNet. Night scenes and rainy scenes are considered. In case of night scenes, we train RMNet on the first three categories (scene_1,scene_2, scene_3) and test on scene_4. In case of rainy scenes, we train RMNet on scene_1, scene_2, and scene_4,and test on scene_3. The results for challenging scenes are summarized in Table VII. Compared to the mIoU in Table VI(b), the performance is slightly degraded under the night or rainy conditions, as shown in Table VII.The performance at night is much lower than in rainy time. In case of night scenes, illumination is very low, and the images are quite dark overall. Even humans have difficulty in recognizing the colors of the lanes due to the headlight of the approaching vehicles, which explains why mIoUs of “single (RL, w)” and “single (RL, y)” are quite low.In case of rainy scenes, illumination is much higher than in night scenes, and thus the performance is not degraded as much as in night scenes. The major difficulties in rainy scenes are either reflections from water on the road or raindrops on the windshield. These qualitative results are summarized in Fig. 17. In each condition, the first column is the example in which RMNet works relatively well but the remaining two columns are the examples in which RMNet predicts incorrect results. In case of night scenes, the safety zone on the left of the figure is incorrectly segmented as yellow road lane in the second column, and the yellow road lane on the figure of the image is misclassified as white lane in the third column.Obviously, the misclassification is due to the low illumination in night scenes. In case of rainy scenes, the stopline in the middle of the figure is not detected in the second column due to the reflection from the water; and overall segmentation results are degraded in the third column due to the raindrops on the windshield. One of the key factors which affects the performance of lane-level RM map is loop closure detection. In this subsection, a lane-level RM map is built by fully utilizing the loop closure detections. The three routes included in the SeRM dataset are designed to include various loop closures.For example, let us consider the frames 1940 to 11380 in“route 2”. The RMs in the frames of the interval can be depicted based on odometry as in Fig. 18(a). Fig. 18. Example showing the effect of loop closure. (a) A map based on odometry; (b) A map after loop closure. Our vehicle drives clockwise and observes the RMs. When we recognize that frame 1940 looks quite similar to frame 8844, the loop closure occurs and the RMs in frame 8844 are dragged to the corresponding RMs in frame 1940, as shown in Fig. 18(b). Merging the landmark nodes in frames 1940 and 8844 aligns other RMs observed during the interval and reduces the mapping error that is accumulated through the entire route, as shown in the figure. Then, let us consider the lane-level mapping using three routes given in Fig. 4. For an accurate evaluation of the lanelevel mapping, the lane-level map built herein should be compared with the GT map of the three routes. Unfortunately,the GT position of RMs is not provided in the SeRM dataset.The lane-level mapping performance should also be evaluated based on the localization error using the lane-level map because the RTK-GPS signal is provided in the SeRM set. Thus, the estimated trajectories are compared with the GT trajectories provided by RTK-GPS. The estimated trajectory of the proposed method was given and compared with the true trajectory (GT) in Fig. 19. On the left column of the figure,the trajectories of the proposed method, GT, and that obtained by the wheel encoder are indicated in black, red, and blue,respectively. On the right column, the deviation of the proposed trajectory from the true one is displayed in color.Blue denotes a small localization error, whereas red depicts a large error. The error was kept low in most areas when the environment was as simple as that in Fig. 19(d). Fig. 20 shows the final lane-level RM maps built by the proposed method. The results were compared against the aerial view map. Figs. 20(a) and 20(c) depict the various complex RMs (e.g., straight-prohibition or U-turn) that were segmented and mapped well. Fig. 20(e) illustrates the lanelevel map of the long road with the broken lines and the SRMs, where the long road was well mapped. The proposed method successfully worked even for a large environment. A supplementary video was submitted with this paper to demonstrate the value of the proposed method. The video is available at https://youtu.be/h4pIEwkPDd0. Fig. 19. Estimated trajectories vs. true trajectories obtained by RTK-GPS: three rows correspond to routes 1, 2, and 3, respectively; First row: estimated and true trajectories; And second row: localization error superimposed on the trajectories. Fig. 20. Lane-level mapping results. (a) Road with straight-prohibition. (b) Road with various road markings.(c) Road on which the number of lanes increases. (d) Road with text and U-turn marking. (e) A long road (approximately 150 m) with many broken lines and number markings. The proposed mapping method is compared with the odometry and a previous method for the quantitative analysis.Since the absolution positions of RMs are unknown and thus the direct evaluation of mapping accuracy is infeasible, we compared all the competing methods in terms of localization accuracy. All trajectories are evaluated based on the GT trajectory acquired from the RTK-GPS. RMSE and max errors are defined by respectively, and are provided in Table VIII, whereandxt∈R2denote the GT position and estimated position,respectively;Tis the length of each route. The baseline is the simple mapping using only wheel odometry. The proposed mapping system outperformed the vanilla odometry and method using weighted-FCN [27] in all measures. The table shows that the second row (weighted FCN) is an algorithm in which the semantic segmentation method reported in [27] was used, but the mapping method was the same as that developed herein. The proposed method outperformed the algorithm in the second row because the more accurately the RMs are segmented, the better these RMs are associated with each other and the better the graph obtained is. In this subsection, an ablation study is conducted to compare the losses (CE, FL, CWL, and CWFL) used in this paper.The results are summarized in Table IX. mIoUs are evaluated while varying the parameter in each loss in Tables IX(b)–IX(d), and the best values are indicated in bold. In Table IX(a),all four losses are compared in terms of mIoU using respective best parameters. The best parameter of each loss is selected by evaluating mIoU while varying the parameter, as shown in Tables IX(b)–IX(d). In Table IX(a), the CE is the same as a vanilla RMNet in Table V. Compared to other losses, CWFL is most effective in handling class imbalance. In this paper, a new dataset, named SeRM, and lane-level RM mapping method have been developed. The proposed lane-level map is a landmark-based map that uses RLs and SRMs as landmarks. The SeRM dataset will be made publicly available1 upon publication of this paper. The SeRM dataset can be used not only on its own, but also by being combined with other datasets, or being transformed to synthetic images as well, which lowers the need for other scholars to label data for mapping. RMNet was also proposed and trained on a new data SeRM to extract the RMs from the images of the monocular camera.To build a lane-level map that uses the RMs as landmarks, the RM locations were corrected by the weighted ICP, and graph optimization. The proposed mapping system worked well for large environments. All the experiments and the submitted video clip showed that the proposed system can be implemented at a low price, and the lane-level map built by the proposed method is good and sufficiently accurate. The main strength of the proposed lane-level RM map isthat the final map consists of landmarks (RMs), and thus the map is much lighter than the point cloud map built by the classical MMS. Ironically, however, the trait is also the limitation of our RM map: When there are not enough RMs,the mapping and localization accuracy can be degraded. To solve the problem, further research along this direction should be conducted. For example, the possible solution could be the use of statistical sparse data handling methods such as[50]–[55], or the fusion with additional low-cost sensors such as an IMU. TABLE VIII RESULTS OF THE LANE-LEVEL RM MAP BUILDING TABLE IX ABLATION EXPERIMENTS FOR RMNET WITH FOCAL LOSS, CLASS-WEIGHTED LOSS, AND CLASS-WEIGHTED FOCAL LOSSVI. BUILDING A LANE-LEVEL ROAD MARKING MAP
A. Inverse Perspective Mapping
B. RM Correction
C. Front-End: Graph Building
D. Back-End: Graph Optimization
VII. EXPERIMENTAL RESULTS
A. Performance Analysis of RMNet
B. Performance Analysis of Lane-Level Mapping
C. Ablation Study
VIII. CONCLUSIONS
IEEE/CAA Journal of Automatica Sinica2022年1期