BEV-radar：毫米波雷达-相机双向融合的三维目标检测.pdf

资源描述

1、BEV-radar:bidirectional radar-camera fusion for 3D objectdetectionYuanZhao1,LuZhang2,JiajunDeng3,andYanyongZhang11School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China;2Institute of Artificial Intelligence,Hefei Comprehensive National Science Cent

2、er,Hefei 230088,China;3Department of Electrical Engineering,University of Sydney,NSW 2006,AustraliaCorrespondence:YanyongZhang,E-mail:2024TheAuthor(s).ThisisanopenaccessarticleundertheCCBY-NC-ND4.0license(http:/creativecommons.org/licenses/by-nc-nd/4.0/).Cite This:JUSTC,2024,54(1):0101(8pp)ReadOnlin

3、eAbstract:ExploringmillimeterwaveradardataascomplementarytoRGBimagesforameliorating3Dobjectdetectionhasbecomeanemergingtrendforautonomousdrivingsystems.However,existingradar-camerafusionmethodsarehighlydependentonthepriorcameradetectionresults,renderingtheoverallperformanceunsatisfactory.Inthispaper

4、,wepro-poseabidirectionalfusionschemeinthebird-eyeview(BEV-radar),whichisindependentofpriorcameradetectionres-ults.Leveragingfeaturesfrombothmodalities,ourmethoddesignsabidirectionalattention-basedfusionstrategy.Spe-cifically,followingBEV-based3Ddetectionmethods,ourmethodengagesabidirectionaltransfo

5、rmertoembedinforma-tionfrombothmodalitiesandenforcesthelocalspatialrelationshipaccordingtosubsequentconvolutionblocks.Afterem-beddingthefeatures,theBEVfeaturesaredecodedinthe3Dobjectpredictionhead.WeevaluateourmethodonthenuS-cenesdataset,achieving48.2mAPand57.6NDS.Theresultshowsconsiderableimproveme

6、ntscomparedtothecamera-onlybaseline,especiallyintermsofvelocityprediction.Thecodeisavailableathttps:/ number:TP399Document code:A1 IntroductionThe perception system in autonomous driving is usuallyequipped with different types of sensors.Complementarymulti-modalsensorsavoidunexpectedrisksbuttakeonne

7、wchallengeswhilesensorfusion.Recentworkshavefocusedonvisualsensors1,typicallyprovidingdenseandredundantinformation.However,visualsensorsareusuallynotstableenoughforadverseweatherconditions(i.e.,rain,snow,andfog).Inadditiontothehighcost,thefusionofvisualsensorscannot fully sustain the perception syst

8、em in variableautonomousscenarios,whichrequiresrobustness.Aside from LiDAR and cameras,radar has aslo beenwidely used in autonomous scenes for speed measurementand auxiliary location prediction,but rarely in visual tasksduetoitsphysicalnature.Whilestabilityandpenetrationbe-nefitfromtheirphysicalprop

9、erties,sparseresults,noisyfea-tures,andlackofverticalinformationarecrucialproblemsbrought by frequently-used automotive radar.Randomlyscatteredsignalsamongvehicles,buildings,andobstaclesob-tainhighspecularreflectivityandmulti-patheffects.Whilethecomplementarycharacteristicsofcameraandradarareef-fect

10、ive,thefusionstrategyfacesseveralchallenges.First,theresultsofthemm-waveradarprojectedontheimageviewonlyhavedirectionandrange,whichdoesnotprovidevertic-alinformationandleadstosomebiaswhenprojectedonthecameraview.Moreover,theimagecannotrelymerelyontheprojected radar depth,as multi-path effectivity pr

11、oducesinaccurateresultsforradardetection.Compared to the richer and more accurate informationprovidedbyvisualsensors,thealignmentoffeaturesbetweenthecameraandtheradarisachallengingproblem.Withoutverticalinformation,somemethods2,3rectifytheverticaldir-ectioninthefrontviewafterprojectingradarpointstoi

12、mageplanes.Higherperformanceleveragesonfirst-stagepropos-als from the camera and then constructs a soft associationbetweenobjectsandfeaturesaccordingtotheextrinsicmat-rix,asshowninFig.1.Insteadofassociationmethods,trans-formingbothfeaturestobird-eyeviews(BEV)canextremelyrelievetheproblem,concerningt

13、wokeypoints:amorecom-patibledecoupledfusionstrategyforradardataandabetterpromotionforbothmodalities.InspiredbyBEVfusionmethods4,5,weimplementBEV-radar,anend-to-endfusionapproachforradarandcameras,whichcanbeconvenientlyusedforotherBEVsforcamerabaselines.Beforefusion,radarencodersareusedforpillarex-tr

14、actionandtensorcompaction.BEV-radarfocusesoninsert-ingdenseradartensorsintotheBEVimagefeaturesgener-ated by the camera baseline.Bidirectionally,radar featuresandimagefeaturesarepromotedtotheirrespectivedecodersaccording to cross-attention.Despite the simplicity of thebasicidea,theevaluationonthenuSc

15、enesdatasetperformsoutstandingresultsinthe3Dobjectdetectionbenchmarks.Itachievesanimprovementoverthecamera-onlybaselinesandperforms well even compared to other radar-camera fusionArticlehttp:/Received:January 15,2023;Accepted:April 03,202301011DOI:10.52396/JUSTC-2023-0006JUSTC,2024,54(1):0101studies

16、.Besides,fortheoriginalintentionoftheexperiment,radarfusionbehavesstablywith+10%mAPand+15%NDSboostinadversescenes.Wemakethefollowingcontributions:(I)Weconstructanend-to-endBEVframeworkforradarandcamerafusion.Insteadofrelyingonthefirst-stagedetec-tionresultsprovidedbythecamera,thisintegralnetworkin-s

17、tructs a portable and robust type that does not dependstronglyonthecamera.(II)Weproposeanovelbidirectionalfusionstrategycom-paredtovanillacrossattention,whichissuitableformulti-modalfeatureswithspatialrelationships.Itperformseffect-ivelydespitethehugediversityofradarandcameras.(III)Weachieveacompara

18、tivecamera-radar3DdetectionperformanceonthenuScenesdataset.Comparedtoasinglemodality,wesolvethedifficultproblemofvelocitypredic-tion,whichisnon-trivialinautonomous.2 Materials and methods2.1 Related workCamera-only 3D detection.Monocular 3D detectionrequirestheestimationof3Dboundingboxeswhileusingam

19、onocularcamera.Thekeyquestionishowtoregressthedepthinformationonthe2Dview.Earlierworksreliedon2Ddetectionnetworkswithadditionalsub-networksfor3Dpro-jection6,7.SeveralworkshaveattemptedtoconvertRGBin-formationinto3Drepresentations,suchaspseudo-LiDAR8,9andorthographicfeaturetransform10.Severalstudies1

20、1intro-ducedkeypointdetectionforcentersandused2Dobjectde-tection prediction as regression auxiliary.In recent works,camera-onlymethodsdirectlypredictedresultson3DspacesorBEVfeatures5,12,13.TheyoperateddirectlyontheBEVfea-tures transformed from the front view according tocalibration.Camera-fusion 3D

21、detection.Thekeypointoftheassoci-ationmodalityfusionmethodsistofindtheinterrelatedspa-tialrelationshipsamongmulti-modalsensors.Inrecentyears,fusionapproacheshavemainlyfocusedonLiDARandcam-eras.Someearlierworks14,15mappedthedatafrommulti-viewsintounifiedtypeslikeimageorBEV.Pointpainting1creat-ivelypr

22、oposesthesegmentationofinformationfromimagesontopointcloud.Duetothesensitivitytoadverseweatherconditions,MVDNet16firstdesignedafusedarea-wisenet-workforradarandLiDARinafoggysimulatedenvironment.MotivatedbythecostofLiDAR,Ref.17researchedtheim-provementoffusionontinyobjectswithcameraandradar,andRef.18

23、introducedthetransformerforfeature-levelfu-sion.However,the 2D convolution of the projected radarpoints comes with useless computations and does not takeintoaccountthesparsityoftheradar.Restrictedbythefrontview,spatialrelationshipsbetweendifferentmodalitiesrelyontheresultspredictedduringthefirststag

24、e.Bytransform-ingfeaturesfromtheirrespectiveviewstoaunifiedBEV,BEVFusion4predictedthedepthprobabilitiesforimagefea-turesandprojectedthepseudo-3DfeaturestotheBEVbasedontheirextrinsicparameters.Transfusion19compressedcam-erafeaturesalongtheverticalaxistoinitializetheguidingqueryandthealignresultsofthe

25、firststagebacktoimageplanes.2.2 ApproachInthiswork,wepresentBEV-radar,aradar-camerafusionImage Backbone Image Backbone Radar backbone Radar backbone Radar backbone Image backbone Radar backbone Regression HeadsRegression HeadsRegression HeadsRegression headsPredictionAssociationImage Backbone Image

26、Backbone Radar backbone Radar backbone Radar backbone Image Backbone Radar backbone BEV Feature AlignmentBEV Feature AlignmentBEV Feature AlignmentBEV Feature AlignmentPredictionImage backbone Radar backbone BEV feature alignmentPrediction(b)Adaptive radar fusion Not(a)Radar fusion relying on first-

27、stage proposals from imagerelying on association methodsFig.1.Comparisonbetweenthetwoalignmentmethods.(a)Radarfusionmethodsrelyingonthefirststageproposals:aftergeneratingtheinitialpropos-als,associationmethodstotheircorrespondingradarregionsisnecessary,leadingtoignoringofobjectswhicharenotdetectedin

28、thefirststage.(b)Ouradaptiveradarfusionview:insteadofaligningproposalsfromthefirststage,featuresaredirectlyalignedinBEV,thuspredictionisguidedbymulti-modalityfeatures.BEV-radar:bidirectionalradar-camerafusionfor3DobjectdetectionZhaoetal.01012DOI:10.52396/JUSTC-2023-0006JUSTC,2024,54(1):0101framework

29、 based on camera-only 3D object detection.AsshowninFig.2,givenasetofmulti-viewimagesandsparseradar points as inputs,we extract respective BEV featuresseparatelyandthendecodethefeaturesusingbidirectionalat-tentionmodulesasinsertedfusiondecoderscalledBSF(bid-irectional spatial fusion).Instead of simpl

30、e cross-attention,BSF performs better fusion for both modality features andalignsfeaturesfromdifferentdomainseffectively.Inthefol-lowingsubsections,wefirstreviewthepreliminariesforre-latedtasksandthenelaborateontheimplementationdetailsoftheBSF.2.2.1PreliminaryGeneration of BEV features.Traditional s

31、ensor fusionoperatesonseparateviewssothattheperspectivefrontviewandBEVarealignedontheactualpixel-to-pixelspatialrela-tion.However,evenwithahigh-precisionextrinsiccalibra-tion,projected radar points deviate from the true positionsdue to the absence of vertical information.Moreover,thispixel-to-pixels

32、patialalignmentisnottightenoughduetothegeometricdistortionofthecameraandthesparseattributesofthe3Dpoints.Therefore,aunifiedBEVrepresentationin-steadofageometricassociationiscrucialforsensorfusion.fqfkfvTransformers for 2D detection.Vision in transformer(ViT)20 proposed patched images with positional

33、 encodinginsteadof2Dconvolution,whichmakesprogressonimagefeature based on original natural language processing(NLP)21.Theoriginalattentionmechanismisformulatedasfollows.Given a query embedding,key embedding,value embedding,and dimension of key embedding dk,theseinputswillbecomputedinasingle-headatte

34、ntionlayeras:Attn(fq,fk,fv)=softmax(fqfkdk fv).(1)As for the prediction decoder,the promoted detectiontransformer(DETR)22andtransformers13,23,24arewidelyusedfor detection tasks based on reforming a set of matchedbounding boxes.Thus the usual 3D regression problem istransformedintoabipartitematchingp

35、roblemandthenon-maximumsuppression(NMS)algorithmisnolongerneeded.2.2.2BEVunifiedrepresentationInthispart,westatethedetailsoftherepresentationofthetwosensors.TransformingrawfeaturesextractedfromtheiroriginaldatatypetoBEVisnontrivialforalignment.FL RCHWTo camera.Following BEVDet5,the BEV camerabaselin

36、epredictsthedepthofmulti-viewimagesemanticfea-tures from the backbone and feature pyramid network andthentransformsallfeaturesintoaunifiedBEVgridspacere-lyingontheassociatedextrinsicmatrix.Thus,thebaselineforms a BEV camera feature map,which isdownsampledfromtheoriginsizeby8,andH,Wdescribethesizeoft

37、heBEVmap.BEVimagefeaturesprovideaglob-alrepresentationformulti-viewtransformations.R RNdXYFR RCHWTo radar.Theradardataformathasacompletelydiffer-entstylecomparedtothecamera,similartoLiDARbutspars-er,withabout300pointsper6frames.Toavoidoverlysparseinputs,a sequence of points is accumulated,whereXandY

38、denotespatialcoordinates,ddenotesattrib-utesincludingvelocity,andNisthesizeofthepointset.Inthe absence of vertical information,pillars25 as featureextractionconsiderablyalleviatesthecomputationofsparseradardatatotraversetheBEVplane.Naturally,theunifiedBEV radar features are formed after a lineartran

39、sformation.2.2.3BidirectionalBEValignmentTraditionalsensorfusionfirstconcatenatesindividualfea-turesdirectlyandthenusesattentionorconvolutionblockstoextractfeaturesfromdifferentmodalitiesandalignthemac-cordingtotheirspatialrelationships.However,forBEVradarandimagefeatures,sparsitymakesitnon-trivialt

40、oalignbothmodalities spatially only,so we need to generalize eachsparse feature.In this section,we instruct a moduleImage Detection headImage features(BN,C,fh,fw)BidirectionalfusionImage Backbone Image backbone RadarBEV image features(B,C,H,W):Position embed operationQueryLinear Layers Convolution l

41、ayersth blockRadar Cross AttentionImage Cross AttentionQueryKey/ValueKey/ValueNormNorm NormQueryLinear layersConvolution layersith blockRadar cross attentionImage cross attentionQueryKey/ValueKey/ValueNormNormRadar Backbone Radar Backbone BEV Radar Feature(C,H,W)Radar Backbone BEV Radar Feature(C,H,

42、W)Radar backbone BEV radar feature(C,H,W)(i-1)outputR R:Reshape 2DR RR RFig.2.Overallarchitectureofframework.OurmodelisconstructedonseparatebackbonestoextracttheimageBEVfeaturesandtheradarBEVfeatures.OurBSF(bidirectionalspatialfusion)blocksconsistofseveralblockssequentially:First,asharedbidirectiona

43、lcross-attentionforcommunicatingbothmodalities.SpatialalignmentisfollowedtolocalizetheradarandcameraBEVfeatures.Afterallblocks,bothoutputswillbesentinadeconvolutionmoduletodescendthechannel.Zhaoetal.01013DOI:10.52396/JUSTC-2023-0006JUSTC,2024,54(1):0101consistingofcross-attentionandconvolutionblocks

44、topro-gressivelyembedtheduplexfeaturesineachother,whichres-ultsinbetteralignment.iCF RCHWF RCHWSpecifically,ablockconsistsoftwoparts:aninteractionmodule to communicate each feature,and a convolution-basedfusionoperation.AsshowninFig.2,thefusionpartcanbedividedintoNequalblocks,andthepositionalembed-d

45、ing operation is applied before fusion.For the camerabranchinthe thblock,givenadimensionalcameraBEVfeaturemapasquery,theradarBEVfeaturemapisusedaskeyandvalue,andviceversafortheradarbranch.Weusedeformablecross-attention24toremedythecomputationalcostcausedbythesparsityofBEVfea-tures,whichcanbeformulat

46、edasfollows:FiC=H(Attn(norm(Fi1C),norm(Fi1R)Attn(norm(Fi1R),norm(Fi1C).(2)FiR=G(Attn(norm(Fi1R),norm(Fi1C).(3)FoutFout(i+1)FRFoutFCFCFRDifferentfromtheNLPvanillatransformer,spatialinform-ationobtainingobjectslocationisvitalfordetectiontasks.Designedfor2Dstructures,convolutionkernelsarebet-teratextra

47、ctinglocalspatialcorrelationsthan1Dattention.istransformedtoimagestyleagainandsenttoconvolu-tion blocks,which are then patched again before the nextthblock.Atthesametime,atransformblockforre-mainssynchronizedwithfor,andtheyarereturnedseparatelyas the next inputs.In this way,multi-blocks in-creasethe

48、fitnessofand,whilebidirectionaldesignup-datesobtainthealignmentoffeaturedomains.Ineachblock,convolutionlayersarerequiredtoextractthelocalspatialre-lations,seeSection3.2.2forarelatedverification.2.2.4PredictionheadsandlossesTheBEVfusionfeaturesareappliedto3Dobjectdetec-tionpredictionheads.Referringto

49、transfusion19,wesimplyusetheclassembeddingheatmaptransformedfromthefu-sionfeaturesasqueryinitializationtopredictthecentersforallobjectsineachscene.AvanillatransformerisusedasthedecoderforDETR22predictionpartsthroughtheHungarianalgorithm26,andwesettheregularizedmatchinglossfunc-tionbyaweightedsumofth

50、eclassification,regression,andIoUcalculation:Ltot=1Lcls+2Lreg+3LIoU,(4)123LclsLregLIoUwhere,andrepresenteachcoefficientparameters,and,andareindividuallossfunctionforabove.3 Results and discussion3.1 Implementation detailsTraining.This end-to-end work is implemented on theopen-sourcedMMDetection3D27i

展开阅读全文