1、BEV-radar:bidirectional radar-camera fusion for 3D objectdetectionYuanZhao1,LuZhang2,JiajunDeng3,andYanyongZhang11School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China;2Institute of Artificial Intelligence,Hefei Comprehensive National Science Cent
2、er,Hefei 230088,China;3Department of Electrical Engineering,University of Sydney,NSW 2006,AustraliaCorrespondence:YanyongZhang,E-mail:2024TheAuthor(s).ThisisanopenaccessarticleundertheCCBY-NC-ND4.0license(http:/creativecommons.org/licenses/by-nc-nd/4.0/).Cite This:JUSTC,2024,54(1):0101(8pp)ReadOnlin
3、eAbstract:ExploringmillimeterwaveradardataascomplementarytoRGBimagesforameliorating3Dobjectdetectionhasbecomeanemergingtrendforautonomousdrivingsystems.However,existingradar-camerafusionmethodsarehighlydependentonthepriorcameradetectionresults,renderingtheoverallperformanceunsatisfactory.Inthispaper
4、,wepro-poseabidirectionalfusionschemeinthebird-eyeview(BEV-radar),whichisindependentofpriorcameradetectionres-ults.Leveragingfeaturesfrombothmodalities,ourmethoddesignsabidirectionalattention-basedfusionstrategy.Spe-cifically,followingBEV-based3Ddetectionmethods,ourmethodengagesabidirectionaltransfo
5、rmertoembedinforma-tionfrombothmodalitiesandenforcesthelocalspatialrelationshipaccordingtosubsequentconvolutionblocks.Afterem-beddingthefeatures,theBEVfeaturesaredecodedinthe3Dobjectpredictionhead.WeevaluateourmethodonthenuS-cenesdataset,achieving48.2mAPand57.6NDS.Theresultshowsconsiderableimproveme
6、ntscomparedtothecamera-onlybaseline,especiallyintermsofvelocityprediction.Thecodeisavailableathttps:/ number:TP399Document code:A1 IntroductionThe perception system in autonomous driving is usuallyequipped with different types of sensors.Complementarymulti-modalsensorsavoidunexpectedrisksbuttakeonne
7、wchallengeswhilesensorfusion.Recentworkshavefocusedonvisualsensors1,typicallyprovidingdenseandredundantinformation.However,visualsensorsareusuallynotstableenoughforadverseweatherconditions(i.e.,rain,snow,andfog).Inadditiontothehighcost,thefusionofvisualsensorscannot fully sustain the perception syst
8、em in variableautonomousscenarios,whichrequiresrobustness.Aside from LiDAR and cameras,radar has aslo beenwidely used in autonomous scenes for speed measurementand auxiliary location prediction,but rarely in visual tasksduetoitsphysicalnature.Whilestabilityandpenetrationbe-nefitfromtheirphysicalprop
9、erties,sparseresults,noisyfea-tures,andlackofverticalinformationarecrucialproblemsbrought by frequently-used automotive radar.Randomlyscatteredsignalsamongvehicles,buildings,andobstaclesob-tainhighspecularreflectivityandmulti-patheffects.Whilethecomplementarycharacteristicsofcameraandradarareef-fect
10、ive,thefusionstrategyfacesseveralchallenges.First,theresultsofthemm-waveradarprojectedontheimageviewonlyhavedirectionandrange,whichdoesnotprovidevertic-alinformationandleadstosomebiaswhenprojectedonthecameraview.Moreover,theimagecannotrelymerelyontheprojected radar depth,as multi-path effectivity pr
11、oducesinaccurateresultsforradardetection.Compared to the richer and more accurate informationprovidedbyvisualsensors,thealignmentoffeaturesbetweenthecameraandtheradarisachallengingproblem.Withoutverticalinformation,somemethods2,3rectifytheverticaldir-ectioninthefrontviewafterprojectingradarpointstoi
12、mageplanes.Higherperformanceleveragesonfirst-stagepropos-als from the camera and then constructs a soft associationbetweenobjectsandfeaturesaccordingtotheextrinsicmat-rix,asshowninFig.1.Insteadofassociationmethods,trans-formingbothfeaturestobird-eyeviews(BEV)canextremelyrelievetheproblem,concerningt
13、wokeypoints:amorecom-patibledecoupledfusionstrategyforradardataandabetterpromotionforbothmodalities.InspiredbyBEVfusionmethods4,5,weimplementBEV-radar,anend-to-endfusionapproachforradarandcameras,whichcanbeconvenientlyusedforotherBEVsforcamerabaselines.Beforefusion,radarencodersareusedforpillarex-tr
14、actionandtensorcompaction.BEV-radarfocusesoninsert-ingdenseradartensorsintotheBEVimagefeaturesgener-ated by the camera baseline.Bidirectionally,radar featuresandimagefeaturesarepromotedtotheirrespectivedecodersaccording to cross-attention.Despite the simplicity of thebasicidea,theevaluationonthenuSc
15、enesdatasetperformsoutstandingresultsinthe3Dobjectdetectionbenchmarks.Itachievesanimprovementoverthecamera-onlybaselinesandperforms well even compared to other radar-camera fusionArticlehttp:/Received:January 15,2023;Accepted:April 03,202301011DOI:10.52396/JUSTC-2023-0006JUSTC,2024,54(1):0101studies
16、.Besides,fortheoriginalintentionoftheexperiment,radarfusionbehavesstablywith+10%mAPand+15%NDSboostinadversescenes.Wemakethefollowingcontributions:(I)Weconstructanend-to-endBEVframeworkforradarandcamerafusion.Insteadofrelyingonthefirst-stagedetec-tionresultsprovidedbythecamera,thisintegralnetworkin-s
17、tructs a portable and robust type that does not dependstronglyonthecamera.(II)Weproposeanovelbidirectionalfusionstrategycom-paredtovanillacrossattention,whichissuitableformulti-modalfeatureswithspatialrelationships.Itperformseffect-ivelydespitethehugediversityofradarandcameras.(III)Weachieveacompara
18、tivecamera-radar3DdetectionperformanceonthenuScenesdataset.Comparedtoasinglemodality,wesolvethedifficultproblemofvelocitypredic-tion,whichisnon-trivialinautonomous.2 Materials and methods2.1 Related workCamera-only 3D detection.Monocular 3D detectionrequirestheestimationof3Dboundingboxeswhileusingam
19、onocularcamera.Thekeyquestionishowtoregressthedepthinformationonthe2Dview.Earlierworksreliedon2Ddetectionnetworkswithadditionalsub-networksfor3Dpro-jection6,7.SeveralworkshaveattemptedtoconvertRGBin-formationinto3Drepresentations,suchaspseudo-LiDAR8,9andorthographicfeaturetransform10.Severalstudies1
20、1intro-ducedkeypointdetectionforcentersandused2Dobjectde-tection prediction as regression auxiliary.In recent works,camera-onlymethodsdirectlypredictedresultson3DspacesorBEVfeatures5,12,13.TheyoperateddirectlyontheBEVfea-tures transformed from the front view according tocalibration.Camera-fusion 3D
21、detection.Thekeypointoftheassoci-ationmodalityfusionmethodsistofindtheinterrelatedspa-tialrelationshipsamongmulti-modalsensors.Inrecentyears,fusionapproacheshavemainlyfocusedonLiDARandcam-eras.Someearlierworks14,15mappedthedatafrommulti-viewsintounifiedtypeslikeimageorBEV.Pointpainting1creat-ivelypr
22、oposesthesegmentationofinformationfromimagesontopointcloud.Duetothesensitivitytoadverseweatherconditions,MVDNet16firstdesignedafusedarea-wisenet-workforradarandLiDARinafoggysimulatedenvironment.MotivatedbythecostofLiDAR,Ref.17researchedtheim-provementoffusionontinyobjectswithcameraandradar,andRef.18
23、introducedthetransformerforfeature-levelfu-sion.However,the 2D convolution of the projected radarpoints comes with useless computations and does not takeintoaccountthesparsityoftheradar.Restrictedbythefrontview,spatialrelationshipsbetweendifferentmodalitiesrelyontheresultspredictedduringthefirststag
24、e.Bytransform-ingfeaturesfromtheirrespectiveviewstoaunifiedBEV,BEVFusion4predictedthedepthprobabilitiesforimagefea-turesandprojectedthepseudo-3DfeaturestotheBEVbasedontheirextrinsicparameters.Transfusion19compressedcam-erafeaturesalongtheverticalaxistoinitializetheguidingqueryandthealignresultsofthe
25、firststagebacktoimageplanes.2.2 ApproachInthiswork,wepresentBEV-radar,aradar-camerafusionImage Backbone Image Backbone Radar backbone Radar backbone Radar backbone Image backbone Radar backbone Regression HeadsRegression HeadsRegression HeadsRegression headsPredictionAssociationImage Backbone Image
26、Backbone Radar backbone Radar backbone Radar backbone Image Backbone Radar backbone BEV Feature AlignmentBEV Feature AlignmentBEV Feature AlignmentBEV Feature AlignmentPredictionImage backbone Radar backbone BEV feature alignmentPrediction(b)Adaptive radar fusion Not(a)Radar fusion relying on first-
27、stage proposals from imagerelying on association methodsFig.1.Comparisonbetweenthetwoalignmentmethods.(a)Radarfusionmethodsrelyingonthefirststageproposals:aftergeneratingtheinitialpropos-als,associationmethodstotheircorrespondingradarregionsisnecessary,leadingtoignoringofobjectswhicharenotdetectedin
28、thefirststage.(b)Ouradaptiveradarfusionview:insteadofaligningproposalsfromthefirststage,featuresaredirectlyalignedinBEV,thuspredictionisguidedbymulti-modalityfeatures.BEV-radar:bidirectionalradar-camerafusionfor3DobjectdetectionZhaoetal.01012DOI:10.52396/JUSTC-2023-0006JUSTC,2024,54(1):0101framework
29、 based on camera-only 3D object detection.AsshowninFig.2,givenasetofmulti-viewimagesandsparseradar points as inputs,we extract respective BEV featuresseparatelyandthendecodethefeaturesusingbidirectionalat-tentionmodulesasinsertedfusiondecoderscalledBSF(bid-irectional spatial fusion).Instead of simpl
30、e cross-attention,BSF performs better fusion for both modality features andalignsfeaturesfromdifferentdomainseffectively.Inthefol-lowingsubsections,wefirstreviewthepreliminariesforre-latedtasksandthenelaborateontheimplementationdetailsoftheBSF.2.2.1PreliminaryGeneration of BEV features.Traditional s
31、ensor fusionoperatesonseparateviewssothattheperspectivefrontviewandBEVarealignedontheactualpixel-to-pixelspatialrela-tion.However,evenwithahigh-precisionextrinsiccalibra-tion,projected radar points deviate from the true positionsdue to the absence of vertical information.Moreover,thispixel-to-pixels
32、patialalignmentisnottightenoughduetothegeometricdistortionofthecameraandthesparseattributesofthe3Dpoints.Therefore,aunifiedBEVrepresentationin-steadofageometricassociationiscrucialforsensorfusion.fqfkfvTransformers for 2D detection.Vision in transformer(ViT)20 proposed patched images with positional
33、 encodinginsteadof2Dconvolution,whichmakesprogressonimagefeature based on original natural language processing(NLP)21.Theoriginalattentionmechanismisformulatedasfollows.Given a query embedding,key embedding,value embedding,and dimension of key embedding dk,theseinputswillbecomputedinasingle-headatte
34、ntionlayeras:Attn(fq,fk,fv)=softmax(fqfkdk fv).(1)As for the prediction decoder,the promoted detectiontransformer(DETR)22andtransformers13,23,24arewidelyusedfor detection tasks based on reforming a set of matchedbounding boxes.Thus the usual 3D regression problem istransformedintoabipartitematchingp
35、roblemandthenon-maximumsuppression(NMS)algorithmisnolongerneeded.2.2.2BEVunifiedrepresentationInthispart,westatethedetailsoftherepresentationofthetwosensors.TransformingrawfeaturesextractedfromtheiroriginaldatatypetoBEVisnontrivialforalignment.FL RCHWTo camera.Following BEVDet5,the BEV camerabaselin
36、epredictsthedepthofmulti-viewimagesemanticfea-tures from the backbone and feature pyramid network andthentransformsallfeaturesintoaunifiedBEVgridspacere-lyingontheassociatedextrinsicmatrix.Thus,thebaselineforms a BEV camera feature map,which isdownsampledfromtheoriginsizeby8,andH,Wdescribethesizeoft
37、heBEVmap.BEVimagefeaturesprovideaglob-alrepresentationformulti-viewtransformations.R RNdXYFR RCHWTo radar.Theradardataformathasacompletelydiffer-entstylecomparedtothecamera,similartoLiDARbutspars-er,withabout300pointsper6frames.Toavoidoverlysparseinputs,a sequence of points is accumulated,whereXandY
38、denotespatialcoordinates,ddenotesattrib-utesincludingvelocity,andNisthesizeofthepointset.Inthe absence of vertical information,pillars25 as featureextractionconsiderablyalleviatesthecomputationofsparseradardatatotraversetheBEVplane.Naturally,theunifiedBEV radar features are formed after a lineartran
39、sformation.2.2.3BidirectionalBEValignmentTraditionalsensorfusionfirstconcatenatesindividualfea-turesdirectlyandthenusesattentionorconvolutionblockstoextractfeaturesfromdifferentmodalitiesandalignthemac-cordingtotheirspatialrelationships.However,forBEVradarandimagefeatures,sparsitymakesitnon-trivialt
40、oalignbothmodalities spatially only,so we need to generalize eachsparse feature.In this section,we instruct a moduleImage Detection headImage features(BN,C,fh,fw)BidirectionalfusionImage Backbone Image backbone RadarBEV image features(B,C,H,W):Position embed operationQueryLinear Layers Convolution l
41、ayersth blockRadar Cross AttentionImage Cross AttentionQueryKey/ValueKey/ValueNormNorm NormQueryLinear layersConvolution layersith blockRadar cross attentionImage cross attentionQueryKey/ValueKey/ValueNormNormRadar Backbone Radar Backbone BEV Radar Feature(C,H,W)Radar Backbone BEV Radar Feature(C,H,
42、W)Radar backbone BEV radar feature(C,H,W)(i-1)outputR R:Reshape 2DR RR RFig.2.Overallarchitectureofframework.OurmodelisconstructedonseparatebackbonestoextracttheimageBEVfeaturesandtheradarBEVfeatures.OurBSF(bidirectionalspatialfusion)blocksconsistofseveralblockssequentially:First,asharedbidirectiona
43、lcross-attentionforcommunicatingbothmodalities.SpatialalignmentisfollowedtolocalizetheradarandcameraBEVfeatures.Afterallblocks,bothoutputswillbesentinadeconvolutionmoduletodescendthechannel.Zhaoetal.01013DOI:10.52396/JUSTC-2023-0006JUSTC,2024,54(1):0101consistingofcross-attentionandconvolutionblocks
44、topro-gressivelyembedtheduplexfeaturesineachother,whichres-ultsinbetteralignment.iCF RCHWF RCHWSpecifically,ablockconsistsoftwoparts:aninteractionmodule to communicate each feature,and a convolution-basedfusionoperation.AsshowninFig.2,thefusionpartcanbedividedintoNequalblocks,andthepositionalembed-d
45、ing operation is applied before fusion.For the camerabranchinthe thblock,givenadimensionalcameraBEVfeaturemapasquery,theradarBEVfeaturemapisusedaskeyandvalue,andviceversafortheradarbranch.Weusedeformablecross-attention24toremedythecomputationalcostcausedbythesparsityofBEVfea-tures,whichcanbeformulat
46、edasfollows:FiC=H(Attn(norm(Fi1C),norm(Fi1R)Attn(norm(Fi1R),norm(Fi1C).(2)FiR=G(Attn(norm(Fi1R),norm(Fi1C).(3)FoutFout(i+1)FRFoutFCFCFRDifferentfromtheNLPvanillatransformer,spatialinform-ationobtainingobjectslocationisvitalfordetectiontasks.Designedfor2Dstructures,convolutionkernelsarebet-teratextra
47、ctinglocalspatialcorrelationsthan1Dattention.istransformedtoimagestyleagainandsenttoconvolu-tion blocks,which are then patched again before the nextthblock.Atthesametime,atransformblockforre-mainssynchronizedwithfor,andtheyarereturnedseparatelyas the next inputs.In this way,multi-blocks in-creasethe
48、fitnessofand,whilebidirectionaldesignup-datesobtainthealignmentoffeaturedomains.Ineachblock,convolutionlayersarerequiredtoextractthelocalspatialre-lations,seeSection3.2.2forarelatedverification.2.2.4PredictionheadsandlossesTheBEVfusionfeaturesareappliedto3Dobjectdetec-tionpredictionheads.Referringto
49、transfusion19,wesimplyusetheclassembeddingheatmaptransformedfromthefu-sionfeaturesasqueryinitializationtopredictthecentersforallobjectsineachscene.AvanillatransformerisusedasthedecoderforDETR22predictionpartsthroughtheHungarianalgorithm26,andwesettheregularizedmatchinglossfunc-tionbyaweightedsumofth
50、eclassification,regression,andIoUcalculation:Ltot=1Lcls+2Lreg+3LIoU,(4)123LclsLregLIoUwhere,andrepresenteachcoefficientparameters,and,andareindividuallossfunctionforabove.3 Results and discussion3.1 Implementation detailsTraining.This end-to-end work is implemented on theopen-sourcedMMDetection3D27i