Liang Yang,Bing Li,,WeiLi,Howard Brand,Biao Jiang,and Jizhong Xiao, Senior
Abstract—The concrete aging problem has gained more attention in recent years as more bridges and tunnels in the United States lack proper maintenance. Though the Federal Highway Administration requires these public concrete structures to be inspected regularly, on-site manual inspection by human operators is time-consuming and labor-intensive.Conventional inspection approaches for concrete inspection,using RGB image based thresholding methods,are not able to determine metric information as well as accurate location information for assessed defects for conditions. To address this challenge, we propose a deep neural network(DNN) based concrete inspection system using a quadrotor flying robot (referred to as CityFlyer) mounted with an RGB-D camera. The inspection system introduces several novel modules. Firstly,a visual-inertial fusion approach is introduced to perform camera and robot positioning and structure 3D metric reconstruction.The reconstructed map is used to retrieve the location and metric information of the defects.Secondly, we introduce a DNN model, namely AdaNet, to detect concrete spalling and cracking, with the capability of maintaining robustness under various distances between the camera and concrete surface. In order to train the model, we craft a new dataset, i.e., the concrete structure spalling and cracking(CSSC)dataset, which is released publicly to the research community.Finally, we introduce a 3D semantic mapping method using the annotated framework to reconstruct the concrete structure for visualization. We performed comparative studies and demonstrated that our AdaNet can achieve 8.41% higher detection accuracy than ResNets and VGGs. Moreover, we conducted five field tests, of which three are manual hand-held tests and two are drone-based field tests. These results indicate that our system is capable of performing metric field inspection,and can serve as an effective tool for civil engineers.
STRUCTURAL health monitoring(SHM) plays a significant role in performance evaluation and condition assessments for the nation’s highway transportation assets.SHM can augment the operational safety and longevity of highway transportation assets based on data-driven analysis and decision-making.The Federal Highway Administration(FHWA)of the U.S.Department of Transportation(DOT) has launched a Long-Term Bridge Performance(LTBP) program in 2015 to facilitate the SHM by collecting critical performance data[1].According to the FHWA’s latest bridge element inspection manual[2], New York Bridge inspection manual[3],and tunnel operations,maintenance,inspection,and evaluation(TOMIE)manual[4],it is crucial to identify,measure,and evaluate condition state during a routine inspection on bridges and tunnels.Such condition states include concrete spall(delamination,patched area),exposed rebar,cracking,abrasion (wear),and other damages.
There are several robotic inspection systems that have been developed for automated concrete inspection.Limet al.[5]proposed a visual pavementcrackinspection and mapping system using a mobile robot platform.The robot used a camera to perform visual inspection using an edge detection algorithm with a machine learning method.Lidar was used for location tagging and mapping.Under the support of the FHWA LTBP program,Prasannaet al.[6],[7] proposed an autonomous bridge deck inspection mobile robotic system using a mono-visual camera,ground penetrating radar,and acoustic sensors.The robot was developed to perform pavementcrackdetection which are relatively planar surfaces.Unmanned aerial vehicles(UAVs)have also been deployed for bridge visual inspection[8].UAVs are able to perform remote inspection for areas that are not accessible to human operators.However,none of these robotic inspection systems were able to retrieve metric information of the defects such as width,length,and area information.Also,though these robotic systems used GPS to obtain location information, they were not accurate enough to build a 3D map for visualization nor were they applicable in GPS-denied areas.
To facilitate automatic inspection,acoustic sensors[9],[10],ground penetrating radar[11],and visual cameras[12]–[14]are the three most commonly used sensors in the civil engineering community over the past decade.For visual camera-based inspection, previous researches were mainly focused on using entropy or intensity thresholding methods by highlighting high contrast distinct visual areas.These methods include edge detection,fast Fourier transform(FFT),and fast Haar transform(FHT).Besides using pure thresholding methods,researchers also introduced new detection algorithms by combining image segmentation,image thresholding(such as OSTU’s method[15]),and morphology operations[15]to produce high-quality detection results.Histogram analysis and automatic peaks detection approaches were also used for visual inspection[16].Thecrackdefragmentation approach for fragment grouping and fragment connection was proposed in[17],and an artificial neural network(ANN) was introduced forcrackdetection classification.However,these methods only work well on a simple clear surface and are not able to indicate defect categories.
In this paper, we propose an automatic robotic system for concrete structure visual inspection, using an RGB-D camera with a deep neural network and RGB-D reconstruction method to build a 3D map with defects highlighted.This is illustrated in Fig.1.Unlike the previous research which only performedcrackorspallingdetection using pure RGB images,we introduce an RGB-D visual simultaneous localization and mapping(SLAM)method for structure reconstruction and combine a deep neural network to recognize and highlight defects.The defects are registered and labeled in the3D map to reveal the physical location in the3D structure model,facilitating condition assessment.Furthermore,we introduce a depth adaptive windows size predictor based on depth-inpainting to effectively predict the optimized sliding window size.Then,a sliding window based multi-resolution detection model is used to detect the defect area.Finally,to visualize the defects,we introduced a conditional random field(CRF)method to perform 2D to 3D registration and fusion.
Fig.1.Illustration of our robotic inspection system,it is developed based on the CityFlyer cperform the concrete structure inspection.The right side image shows the inspection result that we got under a bridge located at Riverside Drive&West 155th Street, New York.The robot is equipped with an RGB-D camera which is used for localization and 3D mapping with defect visualization.
Extending our preliminary work[18],[19],instead of using the VGGs with fixed sliding windows size to solve the detection problem,we proposed a depth adaptive model to optimize the detection.To summarize,our main contributions are:
1)A high-quality labeled dataset forcrackandspallingdetection,which is the first publicly available dataset for visual inspection of concrete structures.It has522 (labeled)crackimages and 298spallingimages,and over 10000 fieldcollected images from the concrete structure.
2)A robotic inspection system with visual-inertial fusion to obtain pose estimation using an RGB-D camera and an IMU.The visual-inertial system has a 100Hz pose estimation rate to enable online navigation and 3D mapping.
3)A depth in-painting model that allows depth hole inpainting in an end-to-end approach with real-time performance.
4) A multi-resolution model that adapts to image resolution changes and allows accurate defect detection in the field.
We propose a novel robot inspection system using the CityFlyer [20]which consists of a control and mission module(CMM),a visual-inertial positioning module,and a deep inspection and 3D registration module as illustrated in Fig.2.The CMM implements autonomous navigation which is developed under the Robot Operating System (ROS)platform.The CMM receives visual inertial odometry(VIO)as feedback to navigate the CityFlyer.The VIO has a 100 Hz frame rate that meets state control requirements and also decreases the frame-to-frame pose estimation error(within 10 cm).Meanwhile,the concrete defects prediction output is registered to 3D space using the depth information and the target defect’s3D location and surface normal[21]are used to navigate the CityFlyer to the best viewing angle.By navigating the CityFlyer to the front view perspective of the target defect area,our system can achieve better inspection data acquisition.
The visual-inertial positioning module fuses the output of visual odometry and IMU propagation to achieve real-time pose estimation of the CityFlyer.We use ASUS Xtion Pro RGB-D camera as the visual perception unit to perform pose estimation and 3D perception.Its data sheet is listed in Table I.The IMU sensor is Phidgets Spatial 3/3/3sensor.For VIO fusion,it follows the following steps.First,RGB and depth images are used to estimate the pose of the UAV,using feature matching and optimization approaches[22].Second,we implement a multi-state extended Kalman filter(MS-EKF[23]) to fuse IMU state propagation and the visual odometry observation,allowing real-time positioning and control at a 100Hz.It should be noted that we perform an off-line calibration to obtain the transformation,between the camera and CityFlyer body.
The adaptive defect detection and 3D registration module is proposed to solve the significant problem of providing metric information during inspection,allowing civil engineers to perform condition evaluation[4]and have a context on the spatial characteristics and location of the defects.An AdaNet with depth in-painting and multi-resolution approach is proposed to augment defect detection accuracy.We first introduced a depth-varying sliding window size optimizer.Then,the detection result is registered and fused in a 3D map for visualization.
Fig.2.The CityFlyer inspection system uses an RGB-D camera and an IMU to perform online visual positioning and navigation.We propose a F-ResNet 34 model to perform defects detection.Afterward,3D reconstruction and registration is performed to visualize the inspection result.
TABLE I ASUS Xtion Pro Data Sheet
This section discusses the DNN model-based concrete inspection method,which is able to tell the defects’2D region information by taking RGB images as inputs.Inspired by feature pyramids[24], we propose aMulti-resolution DetectionNettaking multi-resolution RGB image inputs to detect the concrete defects.Moreover,we introduce a depth adaptive sliding-window size selection method,with the capability to adjust bounding box size based on the distance to the surface.In the rest of this section,we provide comprehensive theoretical analysis of the model,and we also compare the detection performance between our AdaNet and ResNets[25],VGGs[26],and AlexNet [27].
For visual inspection,we treat the concrete defects detection task as a multi-class classification problem.For all input imagesX={x1,x2,...,xN},Ndenotes the number of the images falling in three categories,e.g.,crack, spalling,andbackground.Each imagexi∈R3is associated with a ground truth labelyi∈Y, whereY?Nis natural number starting from 0.The detection goal is to find a mapping functionf:X→Ythat minimizes a pre-defined lossLoss(x,y). For the labelY,we encode the label of each image as an integer from{0,1,...,n?1},ndenotes the number of classes.In this paper,we define thecrackimages’label as 1, thespallingimages’label as2,and the background images’with label0.
There is no publicly accessible concrete defect dataset available to train our model,let alone an RGB-D dataset with depth information.In order to train the inspection model for defects detection,we developed a new concrete structurespallingand cracking(CSSC)dataset for training.We met with and organized discussions with civil engineers to catalog the terminology used in concrete defect assessment applications.This provided key terms for image-based search engines and allowed us to mine images from image search results.The following terms used for web-based datamining are listed below:
1)Concrete spalling/Rebar:Concretespalling,concrete rebar,concrete delamination,concrete bridgespalling,concrete columnspalling,concretespallingfrom fire,concretespallingrepair,and concrete wall.
2)Concrete Crack:Concretecrack,crackrepair,concrete scaling,concrete crazing,and concrete crazing texture.
We searched the image data through Google, Yahoo,Bing,and Flickr.Then,we collected a total of 954concretecrackimages and 278concretespallingimages.Forspallingimages,we further added 20images collected from the field,obtaining a total of 298spallingimages for training and validation purposes.
After assembling thecrackandspallingimages, we annotated them using Photoshop.An illustration of some of the annotated images are shown in Figs.3 and 4.Forspallingimages,we annotated the exposed rebars and annotated the regions(contours)ofspallingdamage.These are two regions of interest for civil engineering diagnosis with areas of exposed rebar being areas of more serious degradation.Examples of exposed rebar andspallingcontour annotation are shown in Fig.4.For concretecracks, the annotators were asked to carefully annotate the entirecrackareas in order to develop a binary mask as a ground truth (shown in Fig.3).
Fig.3.Illustration of raw crack images and the annotated binary images in CSSC dataset.
Fig.4.Illustration of raw spalling images and the annotated images in CSSC dataset.We annotated the exposed rebars and spalling contour.In the third row,the red pixels denote the background,and the pink pixels denote the spalling area.
Since our AdaNet is a sliding-window detector,we randomly crop the images around the regions of interests(ROIs)using two size settings:100×100 and 130×130.This is illustrated in Fig.5.For each cropped image output,we determine whether it is a defect or background image via the rule defined in(1).We first count the number pixels,n(I),located inside of the defect region in its corresponding label image(with a totalN(I)pixels).Then,if the defect pixel number is greater than or equal to a pre-defined threshold condition,n(I)≥N(I)×k(wherekrepresents an empirical percentage threshold value), we claim the cropped image as a defect image and label them with 1(ascrack)or 2(asspalling).If there are no defected pixels,i.e.,n(I)==0,we classify the cropped image as background and label with 0.It should be noted that a cropped image will be discarded if the number of defected pixels is between zero and the threshold,i.e.,0 whereflagdenotes the category of the image anddiscarddenotes the image is not used for training.In this paper,we setk=0.04forcrackif cropped size is 100×100,k=0.06 forcrackif cropped size if 130×130.For concrete spalls,we setk=0.8 to obtain spalls sub-images.These values are selected to consider the constraints of the dataset size and the data quality for better detection accuracy. Fig.5.Examples of generated positive training images based on the proposed selection criteria.The spalling images are presented with size of 100×100 with k=0.8, 130×130 with k=0.8for training.The crack images are also presented with 100×100 with k=0.04, 130×130 with k=0.06 for training. Commercial RGB-D cameras normally output incomplete depth images if there is no reflected ray from certain viewing angles.The regions with missing depths in the image are referred to as“holes” [28].Holes degrade the quality of the 3D reconstruction of a structure and the 3D metric measurements.Fig.6 illustrates some examples of the occurence of empty regions in depth image data within a sliding window.Inspired by[29], we introduce a depth inpainting model(named InpaintNet)which is illustrated in Fig.7.InpaintNet is developed based on U-Net[30]which has an auto-encoder framework work with five groups of down-convolutions for the encoder and five groups of upconvolutions for the decoder.Each group has two convolutional layers and each layer has the same number of channels as U-Net.InpaintNet is composed two U-Net frameworks connected in parallel,one of which learns a surface normal embedding from RGB images and the other one performs depth inpainting from depth and surface normal embeddings.The depth inpainting framework inputs depth images to an encoder to forms depth embedding.The depth embeddings are then concatenated with surface normal embeddings.The decoder portion of the depth inpainting networks decodes complete depth images from the combined depth and surface normal embeddings. Fig.6.Illustration of raw depth images(represented by hot)from RGB-D sensors,and percentage of the missing depth increases from the left to the right.It is almost impossible to obtain the depth of a box if it is located in the black area as shown in (d). Fig.7.AdaNet was proposed to perform depth adaptive defect detection.AdaNet uses depth information to estimate the scale of the slide-box, then performs defect detection over the target image with a multi-resolution sliding window detector.Based on our experience,we select a three resolution image pyramid for our detection task. In this paper, we do not have the ground truth depth nor the surface normal data for training.We therefore use classical,computationally expensive approaches to develop estimates of the complete depth and surface normal images.These approaches,though they can not be implemented in real-time,are able to generate accurate estimates of the complete depth and surface normal images to use as a ground truth for the neural network.The neural network then has the benefit of being able estimate complete depth image and surface normal images in real-time.Inspired by[31],we introduce a bilateral filter with color guiding to complete the depth images.The bilateral filtering approach is 10 times slower compared to using a neural network model.For the surface normal,we first introduce a Sobel filter to estimate the gradient of the estimate depth images in thexandydirections. then the surface normal of each pixel isNI=?(x)×?(y). For robotic on-the-fly defect inspection in the field test,especially using a drone,it is quite challenging to keep a consistent distance between the camera and the surface image aquisition.Since we can obtain the depth aligned with each RGB frame, we can easily using this information to adjust the sliding window size based on the depth measurement.Thus,the detection model should be robust to images taken at any distance. It has been discussed in our previous research[18]that a fine-tuned VGG model is not able to perform well in field tests,achieving an average of 70.05% detection accuracy due to the spatial resolution of a region depending on the inspection distance.To tackle this problem, this research further introduces aMulti-resolution Detectionmodel,inspired by[24], by implementing a multi-resolution input image feature pyramid.Given a sliding window cropped inputI, we resize to 1/2 and 1/4 and perform feature extraction in a parallel framework,that is whereCNN(w I1i+b),i=1,2,4 denotes the CNN feature encoder,wandbare symbolic representations of the convolution kernel and bias respectively.I1,2,4denotes the input image where the superscript denotes the corresponding scale.Xrepresents the corresponding output toI1,2,4.Because all levels of the pyramid use the same network architecture,the output also differs with the size.We further up-sample the size of the coarse output feature for 1/2 and 1/4 sized images with a factor of 2 and 4,respectively.In this paper,we take the raw sliding window input and resize to224×224,that is,size(I)=224×224. To reduce the channel dimension after concatenation, we introduce a convolutional layer with kernel size 1 ×1 to reduce the channel dimension to256,then apply an average pooling operation whereCiis channeli∈{0,1,...,255} andfiis the output after average pooling.A three-layered fully connected convolution is used to regress and predict whether the current region is a defected area or not. Our final goal is to reveal where the defects are in a 3 D map by registering concrete defects to the 3D map.In this section,we discuss using an RGB-D camera to perform 3D positioning and semantic 3D reconstruction based on the conditional random field(CRF)method to highlight the concrete defects in the map. The 3D model of the concrete structure is widely used for structure analysis in civil engineering.Moreover,metric defects could be registered to the 3D model with color coded overlays.This provides further assistance to civil engineers to perform concrete structure condition comprehensive assessments[32].In this paper, we propose a 3D mapping system taking advantage of visual-inertial(VI)SLAM and deep defect detection. As discussed in Section II,the CityFlyer requires high localization accuracy and update frequency to enable stable navigation.In this paper, we introduce an MS-EKF[33]to fuse high-frequency IMU propagation and low-frequency visual odometry(VO) towards real-time pose estimation.For VI fusion,the IMU measurement is used to predict the state transition and VO observations were used to update the state.The difference in measurement frequency allow s us to accommodate the fusion of multiple sensors.For the IMU,its evolving state vector is whereW PTIdenotes the position of the IMU in the world frameW.WI qis the unit quaternion that represents the rotation from the world frameWto the IMU frameI.WVITandW aTIare the IMU linear velocity and linear acceleration with respect to the world coordinate system.baandbgdenote the biases affecting the accelerometer and gyroscope measurements.The system derivative form can be partially represented as following in an east-north-up(ENU)coordinate system (partly referred in [33],[34]): whereWI Cis the translation from the IMU frame to the world frame,amis the acceleration measurement,wmis the angular velocity measurement,?Tdenotes the time interval,andgdenotes the gravity.The acceleration,aI,is subject to rotation and translation in the IMU frame.w Idenotes the angular velocity and ? is the matrix product referred to in [33]. Meanwhile,the VO performs pose estimation using the RGB-D measurement,and outputs the posePvo=[r,t](whererdenotes the rotation,tdenotes the translation).Once VO finished pose estimation for each frame,we can update the state based on the measurement model. Vdenotes the measurement noise.Hdenotes the measurement matrix which represents the mapping between IMU state and the VO pose.Then, the prediction from IMU propagation can be corrected by updating using the EKF filter,achieving a 100Hz pose estimation rate. The state estimation error of the VIO will continue to drift as there is no loop-closure to correct the pose if there exists an overlap between views(observations).To further correct the pose,we record the key-framesFK={k Ii,k Pi|i∈(1,2,...,m)}(i.e.,vertex) based on a motion threshold,wherek IiandkPidenote the key-frame image and the key-frame pose of a framei,respectively.VIO propagation and update allow us to obtain the transformation between two consecutive framesi,j,and the relative transformationTi,jcan also be derived at the same time.In order to reduce the drift of the visual odometry,this paper introduces graph-optimization to correct the pose drift based on[35].To perform graph optimization,the following procedures have to be followed:1)record the keyframes,FK, based on motion threshold method;2)use image features to facilitate loop-closure detection to find the edges(correlation) between any pair of key-frames;3) perform graph optimization to update all poses simultaneously. In this section, we discuss the AdaNet training details and compare the experimental performance of the depth inpainting model and defect detection model.To verify the effectiveness of our system,we perform several field tests in a manual holding mode for the RGB-D camera and autonomous inspection mode using the CityFlyer. We first perform an ablation study on the depth in-painting performance from an accuracy and time performance perspective.Table II shows the results of InpaintNet compared to the raw output and in-painted result from a bilateral filter[31].We performed four tests with each dataset containing RGB-D frames from planar concrete surfaces.The ground truth was manually obtained by measuring the distance of the camera to the surface plane.In Table II,the depth images ofCracks 1and3have large holes which are not removable through a bilateral filter.InpaintNet,however,is able to achieve a more accurate and complete depth in-painting forCracks 1and3.For the depth images ofCracks 2and4,have small holes and can be easily filtered through a bilateral filter. TABLE II Depth Accuracy Comparison to Raw Image and[31] (Mean Absolute Error(MAE) (mm)) Fig.8.Result of using bilateral filter and InpainNet to perform depth in-painting.(a)denotes the raw depth and the corresponding normal;(b)is the inpainted depth and normal using InpaintNet;(c)is the in-painted depth and normal using bilateral filter. Fig.9.Time performance comparison between bilateral filter and InpaintNet,where InpaintNet takes an average 0.008 seconds to in-paint per depth frame. A graphic comparison is given in Fig.8,where we can see that InpaintNet is able to fill the big holes,even though it may notable to give precise prediction. Also,compared with bilateral filter,InpaintNet could resolve a smoother normal estimation.The time performance between the two algorithms were compared revealing InpaintNet to be30 times faster compared with the bilateral approach(as illustrated in Fig.9).The runtime of InpaintNet was0.008 seconds on average with a GTX 1080 GPU for each depth frame. As discussed in Section III-A, we cropped images to obtain training patches,and we made the cropped dataset1https://github.com/ccny-ros-pkg/pytorch_Concrete_Inspectionpublicly available for the research community.The dataset has a total of 26 870 concretecrackimage patches,15 950 concretespallingimage patches,and 46 429back_groundimage patches.We label back ground as0,concretecrackas 1,and concretespallingas 2.Representative cropped images are presented in Fig.5.All of the network training and testing are carried out on a GPU server with GTX 1080 GPU and implementated using Pytorch. 1) Does Multi-Resolution Help?We conducted various comparative experiments between our multi-resolution detection model and other models,especially F-VGG employed in[19].Besides VGGs,we also made comparisons to current state-of-art models including ResNets[25]and A lexNet [27].From the comparative results presented in Table III,we can conclude that our multi-resolution model does not achieve the highest learning accuracy,but does obtain the highest testing accuracy.We also conducted a comparative study to the model used in[19]and listed results in Table IV. Inspection of the results in Table IV reveal that our multiresolution model is able to achieve higher detection accuracy,with an average8.405% higher detection accuracy.This is also illustrated in Fig.10 where it shows that the multiresolution model outputs better coverage predictions than that of F-ResNet-34. 2) Does Deeper Model Has Better Performance?Research has shown that increasing the depth of a neural network can improve the classification accuracy to a certain extent[26].However,the model degradation problem occurs if the model is deeper than a suitable limit.Then,authors in[25]introduced a deep residual network to overcome the degradation problem,allowing the performance of networks to increase to a higher degree with deeper layer architectures.In this section,we focus on using a well-constructed model with a suitable depth and perform fine-tuning.We do not discuss the degradation problem. Since our task is to classify three classes,the texture difference betweencrackandspallingare quite distinct.However,some possible challenges are the illumination variations and an insufficient dataset.We perform comparative testing on our multi-resolution model,F-ResNets[25],F-VGGs[26],and A lexNet [27].For the comparison, we set the batch size,epoch,learning rate,and loss as the same for a fair comparison.The result is illustrated in Table III.From the table it is clear that the deeper a model,the higher the accuracy it can achieve.Table III shows that the highest accuracy 96.88%was achieved by F-ResNet-101. Another interesting finding is that F-ResNets have an average of 1.0%higher accuracy in performance compared to F-VGGs.However,deeper models cannot achieve the best detection performance if the best input cropping practice is not used. 3) Batch Normalization:In this paper,we also discuss the effect of batch normalization for neural network models.Batch normalization is proposed to solve the internal covariate shift issue and can work on each neuron to allow scale normalization during training.This enables the model to converge even given larger learning rates and also removes the need for dropout.In this paper,we compare the performance of F-VGGs between given batch normalization and no batch normalization.The results are illustrated in Table V.The results in Table V reveals that batch normalization can improve the accuracy by 0.65% on average.This also proves that batch normalization can improve the model performance even with less diversity in the data.A quantitative comparison of detection accuracy illustrated in Tables III and IV,shows that VGG-Nets are not able to achieve comparable detection performance compared to our multi-resolution detection model. TABLE III Accuracy Comparison between F-ResNets,F-VGGs,and AlexNet TABLE IV Field Test Data Detection Comparison TABLE V Comparative Resul ts of Using Batch Normalization (BN) We conducted field tests at 155 St Broadway,Upper Manhattan,on a concrete bridge.We performed the inspection under the bridge using an RGB-D camera mounted CityFlyer.The CityFlyer was also mounted with a MasterMind computer to perform on-board computation and image streaming to the ground station (a GPU computer for defect detection).Besides the field tests via the CityFlyer, we also manually scanned the concrete surface with the RGB-D camera. 1) Field Tests (Manual Field Test):In the first stage, we manually carried the RGB-D camera to scan the concrete surface and collect the RGB-D frames for inspection.It should be noted that we have to launch the VIO system to track the motion of the camera to perform a reconstruction of the target concrete surface. We collected three sets of data for three different scenarios,which each RGB-D frame having a location tag.Then, we performed defect inspection using our deep inspector over each image.The results are illustrated in Fig.11, where green rectangles denotespallingand cyan rectangles denotecracks.To perform detection, we deployed a sliding window to scan through the whole image with varying region sizes from 80×80 to 2 00×200. We can see in the left-most image and the center image of Fig.11 that our model is able to recognize thespallingregion andcrackregion.Further demonstration of the performance of the model is shown in the center image where thespallingregion is distinguished from thecrackregion.These results show how our model can cover the whole defect area in consecutive frames and how this method is able to help civil engineers be aware of the condition of the concrete structure. Fig.10.Two automative field tests,Test 1 and Test 2,are carried out using CityFlyer.In both tests,we illustrate the drone’s trajectory, region detection results,and the 3D reconstructed mapping with defect highlighting. Fig.11.We performed 3 field tests in a manual held RGB-D camera mode.The green rectangles indicate the spalling region,and cyan rectangles denote the crack region. 2) Autonomous Field Test Using CityFlyer:We also performed two sets of field tests using our CityFlyer,and the results are illustrated in Fig.10.In Fig.10,Test1is carried out at the entrance of the area under a bridge,andTest2is carried out at the middle of the area under the bridge where the illumination is low. ForTest1, the trajectory of the drone is illustrated in the left-most image,this illustrates how the CityFlyer was maneuvering to capture the target area.The defect inspection result is illustrated in the second to the left image,where cyan and green rectangles denotecrackandspalling,respectively.The right-most image is the front view of the 3D map(point cloud) with color overlayed on the defects,and the second to the right image shows the same point cloud but with a different view from the back.We can see that thespallingandcracksare well highlighted. The second test was carried under the bridge, which suffers from low illumination for inspection and localization.The trajectory of the drone is given in the left-most image ofTest2,and the inspection result at this location is given in the second to the left image and the second to the right image.The second to the left image indicates that our model is able to perform correctspallingdetection even in a low-illumination environment.However,the second to the right image indicates that it missed detection of acrackregion(indicated with a red dashed rectangle)due to low illumination. 3) Semantic 3D Fusion and Visualization:The semantic 3D highlighted results are illustrated in Fig.10, where we performed back-projection using the predicted output and the corresponding depth image to the 3D world coordinate frame and the 3D spatial data is fused using consecutive frames.We use a voxel map to represent 3D structure information,where each voxel has to be updated through a back-projection manner.Since we deploy an image-based fusion approach,a global probabilistic map searching is not required,enabling non-GPU computation.The reconstructed 3D map with semantic highlighted areas is illustrated in the right-most images of Fig.10.It can be seen in the figure that the regions of defect are well highlighted using green and cyan color.This helps civil engineers identify the defect categories as well as their location. In this paper,we introduced a new automatic concrete structure inspection system using the CityFlyer robot mounted with an RGB-D camera toward visual inspection.For visual concrete inspection, we introduced an AdaNet to perform a detection of defects within a sliding window approach.The AdaNet consists of two sub-models,which are,a depth inpainting model(InpaintNet)to fill holes in a depth image and multi-resolution defect detection model for concrete inspection.The depth adaptive multi-resolution detection model considers both distance and resolution effects,aiming to provide a robust concretecrackandspallingdetection task in the field.Meanwhile,we pioneeringly propose using visual SLAM and deep neural network inspection to perform a 3D semantic reconstruction to highlight the defects in a 3D model.It can achieve an average8.41% higher detection accuracy compared to F-VGG and F-ResNets.Furthermore,we introduce an RGB-D visual-inertial fusion with filtering and global bundle adjustment to perform pose estimation for the CityFlyer state control.The pose information is used to provide location tags defects predicted in images.Comparative experiments and field tests indicate that the system is able to perform high-quality detection and reconstruction.For future work,we will try optimal tuning of super parameters of the proposed models via intelligent optimization methods[39]and also work on pixel-level detection toward metric reconstruction. The authors acknowledge the sponsorship support from NSF NRT:Resilient Infrastructure and Environmental Systems(RIES)Program at Clemson University.The views,opinions,findings and conclusions reflected in this publication are solely those of the authorsand do not represent the official policy or position of the NSF,USDOTOST-R,or any State or other entity.J.Xiao has significant financial interest in InnovBot LLC,a company involved in R&D and commercialization of the technology.B. Depth In-Painting Model
C. Multi-Resolution Detection
D. Loss Design and Training
IV.Pose Estimation And 3D Semantic Registration
A.Visual Positioning and Association
V.Experiments
A. Depth In-Painting Analysis
B. Detection Model Comparative Analysis
C. Field Tests and Comparisons
VI.Conclusion
Acknowledgment
IEEE/CAA Journal of Automatica Sinica2020年4期