1、新疆地质XINJIANGGEOLOGY2024年3月Mar.2024第42卷 第1期Vol.42 No.1?新疆维吾尔自治区?大科技专?(2021A03001-3)、新疆科学?目(2022xj?1306)、?大数据?能?(292022000059)联合?日期?2023-09-18?日期?2024-01-09?王?(1998-),?,?西?人,中国科学院大学地球?与信?技术专业?读?,研究方向为地质大数据?E-mail:?预?地?矿?及?王?1,2,4,?2,3,5,王?1,2,3,4,?1,2,3,4,李?5,?2(1.中国?院新疆?地?新疆?中心,新疆 乌鲁木齐 830011;2.中国?院?
2、,?100049;3.中国?院?中心,?100094;4.新疆?地质?,新疆 乌鲁木齐 830011;5.中国地质?(?)地质调查?院?430074)?地球科学的研究成果?常?录?技术?告、期刊论文、?等文?中,?许?的地球科学?告?,?为信?取?。为?,我?出?名为GMNER(Geological Minerals named entity recogni?e,MNER)的?网?,?和?取矿物类?、地质?、?与地质时?等关?信?。与传统方法?,本?大规?预?BERT(Bidirectional Encoder Representations from Transformers,BERT)和?
3、网?来?上?文信?,?合?件随?(Conditional random field,CRF)?果。?果?明,MNER?中文地质文?中?出?,?为0.898 4,?0.922 7,?F1?数0.910 4。研究?为自?矿物信?取?新?,也有?矿产?源管理和可持续?。?矿物信?取?网?矿物文?名?体?地球科学的研究成果?常?录?技术?告、期刊论文、?等文?中。?年来,开?数据?政?和科研?数据?发?1-3?。许?国家地质?查?(?USGS和CGS)?地质?查成果?发?。地球科学文?作为开?数据的?要?成部?,为地质矿物信?取研究?大?。从地质科学文本数据中?取?信?、发?的研究?数?地球科学?域?
4、入?。?是?理中文地质科学文?时?为?,?为中文单?之?,计?有?的?汇?的?4-5?。基?学?的?名矿产?体?是?矿产信?自?取的?要方法,也是?矿产?域?的?件。目?,地质矿物?名?体?域的研究?,?地质?名?体?方面,?学者?学?域,?取?定成果。Zhang等?地质文?6?,?计?基?信?网?的地质?名?体?。Qiu等?出?向?时?网?(Bi-directional Long Short-Term Memory,BiLSTM)与 CRF?合的?7?,?单?之?的关联信?,?从地质?告中?取地质?体,?地质?和地质?。Li等?基?地质?域本体的中文?法?8?,?自?方法,?地?地质?域文本
5、。矿物信?的?取有?3?:?矿物信?来源广泛,?文?、专?、?告、新?等?类?的文本?9?矿物信?的?名规?统?,?地区、?域、?时?的?名方?可能?,?行?名?体?矿物信?的?,?汇?、?法?、?等?。为?,我?出?基?网?的地质矿物?名?体?,基?5?区域矿产?域?告,据矿产文本的?,?取?矿产类?、地质?、?和地质时?、成矿区域等信?。与?人所?的方法?,?合大规?预?BERT和?网?来学?上?文信?,?件随?来?取最?全局?10?,最?地质矿物?名?体?。?编?1000-8845(2024)01-139-06?O741?.2?N945.12?A新疆地质2024年?本文?的大规?预?BE
6、RT和?网?的总体?1。?为BERT?、BiLSTM?、全?接?和CRF?。?,BERT预?大规?地质矿物数据?上?行?,?取?的?法和?征,?到?向?的?向?入?期?网?行?征?取,?网?的?出?征?行?合?最?,?全?接?行?维?出的?征?入到CRF?行校?。1.1B E R TDevlin等?出?BERT?11?,与OpenAI GPT中的从?到?Transformer和ELMo中的?接?向LSTM?12-13?,BERT?向 Transformer?14?(?2)。“Trm”?Transformer?。?任?位?的?单?之?的?为1,?能?的上?文?,有?NLP中单?和?子的?期?,?
7、全面地?中的?向关系。?的?入?是?入、位?入和?入的?合。Transformer Bloc?是基?的编码?(?3),是 BERT 的?要?成部?。?Transformer编码单元中?自?的?作?理主要是计?文本?中单?之?的?关性。?主要?能是?网?中?能?地?出的?征上,?区?入?出的?部?的?。?中,编码?由6?成。计?出公?(1)和公?(2)所?。?y?(x?(x)(1)?y?(?(?)(2)?1 MN?R?Fig.1 Frame diagram of MN?R model?2 B?RT?Fig.2 B?RT model structure diagram?3 Transformer编
8、?Fig.3 Structure diagram of Transformer encoder140第42卷 第1期王?等:基?大规?预?的地质矿物?性?方法及?码?有6?。与编码?,?中?,?目的是?接?来时?上信?。单?的?计?公?:?()?,?,?max?(3)?公?(3)中,?入的单?向?为?、?和?。Q?T?计?入单?向?之?的关系,d?是?入向?的维?Q?T之?,?行softmax?,最?,?出的是?子中所有单?向?的?和。?方?,每?生成的单?向?含?子中?单?的信?,?之?上?关,?传统单?向?全局。?外,?3中?的Transformer编码?中?“Multi?ead”?,?可
9、?维?的信?单?向?的?能?。Multi?ead的计?方法?公?(4)和(5):?(?,?,?)?(?1,?2,?,?)?(4)?(?)(5)?方法的主要?方?是?、?和?行?的?性?,?新计?到新的?、?和?,?接?来?到?的?(Attention)。为能?Transformer?子中的位?信?,?自定?位?数计?每?的?位?信?。?与?向?直接?和作为?的?入。?子中位?为pos的?,?位?向?第i?元?计?公?为(6)和(7):?(?,2?)?sin(?/10 0002?/?)(6)?(?,2?1)?cos(?/10 0002?/?)(7)?公?(6)和(7)中,?mod?为?向?的维?
10、。BERT?单?向?的?自?任务,即 MLM(Mas?ed Language Model)和 NSP(NextSentence Prediction)。本文?BERT?的 MLM和NSP任务的联合?,?出地质矿物文本中的每?单?的向?。?2?名?体?中,?文信?,也?文信?。?向?网?(BiRNN)可有?地?上?文信?行?。?中,RNN?常?到?15?,?期的?信?有?性。?时?(LSTM)网?可有?16?。LSTM 的网?类?RNN,?含?RNN的?,?入?,?网?的?期和?期?能?。LSTM的?定?:?(?x?-1?)(8)?(?x?-1?)(9)?(?x?-1?)(10)?(?x?-1?
11、)(11)?-1?(12)?(?)(13)?中,x?为?的?入,?-1为上?的?态,?为?,?为?入?,?为?出?,?为?,?为?元?时?的信?,?为?元要?的信?,?为最?LSTM单元的?出。?BiLSTM?名?体?时,?BERT?的地质矿物文本中每?单?的向?作为?入。?LSTM,网?可自?学?上?文?征,?计?时?的最?类?果。?为?类?,?网?能?取?的上?文?征?。?管BiLSTM和IDCNN?网?能?上?文信?,?体?之?的?和关联。?名?体?中,?据?规?,?果?续出?,?可能?合?辑。CRF可?之?的?辑关系,?全局最?的?,?最?来?果。?理?:定?为第i?合第j?的?,?入
12、的?子?x?x1,x2,x3,?,x?与?预?y?y1,y2,y3,?,y?计?公?:?(x,y)?0?y1,y?1?1?y?(14)公?(14)中,?预?中?体的?态?征,?出?的?计?:?(y?x)?(x,y?)/?yx?y?(x,y?)(15)log(?(y?)?(x,y?)-log?y?(x,y?)(16)y?argmax?()x,y?(17)公?(15)-(17)中,y?的?数?,y?所有可能?的?合,最?类?(x,y?)?数?成。2?2?本研究?的?库来自?地区的5?141新疆地质2024年中国区域?查?告,?计?50?。由?文本中有?信?,?号、?和?,?文本?,?文本?行预?理
13、,主要?查文本?和内?,?和?,?续的文本?为?含单?、?号、数?和?的?,?地质矿物?体信?的?子?,最?到?8 000?有?子。?有?数据,本文?随?选?方法,?8?1?1的?为?、?证?和?。2.2?文本?是?文本中的?体和非?体?行?。我?“BIO”(Beginning、Inside、Outside)?,?中“B”?体?汇的第?,“I”?体?汇的所有中?,“O”?非?体?汇。我?18 783?体,?矿产?源的6?主要?征:矿产地、?、地?、矿物类?、地质?、地质时?(?1)。?体,我?体,?“?(B-LOC)?(I-LOC)矿(I-LOC)?(I-LOC)?(B-ROC)?(I-ROC
14、)?(I-ROC)?(I-ROC)?(B-SG)?(I-SG)?(I-SG)”,“?矿?”、“?”和“?”?为矿产地、?和地?。?中,?体也?单?。?名?体?的评价?:?(P)、?(R)和F?。?体的定?:Tp?的?体数?,Fp?的?体数?,Fn?的?体数?,即?能?的?际?的?体数?。?3?NER评价?中?广泛?17-18?。?100?(18)R?100?(19)?1?2?R?R?100?(20)2.3?和?数?Python 3.7.3和TensorFlow 1.14.1中?行?和?。?BERT-Base?行,?含12?、768?维?和12?。BiLSTM网?有?128维的?。?为50维,?
15、最大?为 256,所有?4?RTX 2080 TiGPU上?行?(?2)。2.4?行?学?,合理?数?关?要。学?作为?学?中的?关?数,?目?数的收?及是?能?收?到局部最?有?。?BERT-LSTM-CRF?行?学?。从?果可?明?出(?3),?学?为4e-5时?最?的性能?。?BERT?中常?的?技术是dropout。?技术会随?地?部?元?出?为?,有?低?合?。?BERT-LSTM-CRF?中,我?dropout?行?。?果?明(?4),?中?dropout?为0.1时,能?最?性能?果。?1地质?T a b l e 1T y p e s a n dq u a n t i t i e
16、 s o f g e o l o g i c a l mi n e r a l e n t i t i e s?LOCROCSGGESMTGT?体类?矿产地?地?地质?矿物类?地质时?数?1 9478 1392 6191 0893 6361 353?2?T a b l e 2E x p e r i me n t a l e n v i r o n me n t类?件?件?CPU:12 vCPU Intel(R)Xeon(R)Platinum8 255C CPU 2.50 G?GPU:4?RTX 2 080 TiOS:Ubuntu(16.04)Video memory:11 GB DDR6CUD
17、A:10.0Python:3.7Pytorch:1.6.0TensorFlow:1.14.1Numpy:1.21.6?3?T a b l e 3I n f l u e n c e o f l e a r n i n g r a t e o nt h e mo d e l学?1e-52e-53e-54e-55e-5P0.880 10.881 80.888 10.898 40.895 5R0.906 50.908 30.917 90.922 70.914 3F0.893 10.894 90.902 70.91040.904 8?4D r o p o u t?T a b l e 4E f f e c
18、 t s o f D r o p o u t o nt h e mo d e lDropout0.10.30.5P0.898 40.896 80.882 4R0.922 70.916 70.908 3F0.910 40.906 60.895 1142第42卷 第1期王?等:基?大规?预?的地质矿物?性?方法及?果?数选?的?要性,?BERT-LSTM-CRF?中的学?和dropout?性能?的关?性。?数的?可?定?上?性能和泛?能?。?的?名?体?果?5。所有?的?中,BERT-LSTM-CRF?最?,?、?和F1?0.898 4、0.992 7和0.910 4。?BERT与CRF?合的?,
19、?、?和F1?为0.880 7、0.902 9和0.891 7。?入?向LSTM网?,F1?出?,可能是?为BERT?的?向?,?入BiLSTM?合?象。?RoBERTa预?时,?果?。?管RoBERTa和BERT?基?Transformer?的预?,?预?、?数等方面可能?。BERT的?和?数?合地质矿物?域的?名?体?任务。?的是,?中文文本中,?和矿物类?,?数?所有?体中?50?,?,所有?出?“?”和“矿物类?”?体的?果,F1?90?。?外,?的?外?体类?是“地?”和“地质时?”,?的?数?也?。上述?果?明,?矿物?体?任务中,BERT-LSTM-CRF?的?最?,?RoBER
20、Ta预?。?类?的矿物?体?各?的?果也?出?定?。3 结论?本研究主要?学?名?体?,即从大?地质矿物?关文?中?取?名?体。?作为?地质矿物?要数据支持。基?BERT-LSTM-CRF?,?者?从地质矿物文?中?取出6?类?体,?0.898 4,?0.922 7,?F1?数0.910 4。从?果中?出?论:(1)?名?体?任务中,BERT-LSTM-CRF?最?,?入BiLSTM会?合,从?低?性能。(2)?中文?体的?区?明?时,?体?果?。(3)?地质矿物?域?名?体?任务中,RoBERTa?BERT?出?。BERT的?和?数?合地质矿物?域?名?体?任务。?管本研究?矿物?名?体?方
21、面取?果,?有?的?:(1)?的?体类?,有?性能的?。我?计?数据?中的?5?T a b l e 5R e s u l t s o f mo d e l e x p e r i me n t s?BERT-BiLSTM-CRFBERT-LSTM-CRFBERT-CRFRoBERTa-BiLSTM-CRFRoBERTa-LSTM-CRFRoBERTa-CRF评?PRFPRFPRFPRFPRFPRFLOC0.753 60.707 50.729 80.765 80.823 10.793 40.751 80.700 70.725 40.716 50.619 00.664 20.712 30.707
22、50.709 90.732 80.653 10.690 6ROC0.915 00.917 40.916 20.924 00.938 70.931 30.914 60.926 80.920 60.911 80.922 80.917 30.915 60.924 10.919 80.907 10.922 80.914 9GES0.788 90.732 00.759 40.934 80.886 60.910 10.837 20.742 30.786 90.781 60.701 00.739 10.802 50.670 10.730 30.728 30.690 70.709 0SG0.813 90.93
23、7 00.871 10.861 20.886 60.873 70.855 00.941 20.896 00.849 10.945 40.894 60.793 90.873 90.832 00.840 90.932 80.884 5MT0.912 10.926 20.919 10.930 90.953 80.942 20.911 00.944 60.927 50.913 20.938 50.925 60.907 70.938 50.922 80.899 10.932 30.915 4GT0.801 50.954 50.871 40.868 90.963 60.913 80.818 90.945
24、50.877 60.797 00.963 60.872 40.852 20.890 90.871 10.815 40.963 60.883 3AVG0.870 00.895 10.882 40.898 40.922 70.910 40.880 70.902 90.891 70.872 20.892 10.882 00.868 10.883 70.875 80.865 50.891 50.878 3?:P,R,F?评价?、?、F1?数143新疆地质2024年矿物?体数?来?。(2)?来?地质矿物?域的?行?和?,?的?域?性。(3)?据从地质矿物文本中所?取的信?地质矿物?关的?域?。?1 Al
25、i S H,Giurco D,Arndt N,et al.Mineral supply for sustainable de-velopment requires resource governanceJ.Nature,2017,543(7645):367-372.2 Cernuzzi L,Pane J.Toward open government in ParaguayJ.It Pro-fessional,2014,16(5):62-64.3 Ma X.Linked Geoscience Data in practice:Where W3C standardsmeet domain know
26、ledge,data visualization and OGC standardsJ.Earth Science Informatics,2017,10(4):429-441.4 Gao J,Li M,Huang C N,et al.Chinese word segmentation andnamed entity recognition:A pragmatic approachJ.ComputationalLinguistics,2005,31(4):531-574.5 Huang L,Du Y,Chen G.GeoSegmenter:A statistically learned Chi
27、-nese word segmenter for the geoscience domainJ.Computers&geosciences,2015,76:11-17.6 Zhang X,Fan D,Xu J,et al.Sedimentary laminae in muddy innercontinental shelf sediments of the East China Sea:Formation andimplications for geochronologyJ.Quaternary International,2018,464:343-351.7 Qiu Q,Xie Z,Wu L
28、,et al.BiLSTM-CRF for geological named entityrecognition from the geoscience literatureJ.Earth Science Infor-matics,2019,12:565-579.8 Li W,Ma K,Qiu Q,et al.Chinese Word Segmentation Based on Self-Learning Model and Geological Knowledge for the GeoscienceDomainJ.Earth and Space Science,2021,8(6):1673
29、.9 Wang B,Ma K,Wu L,et al.Visual analytics and information extrac-tion of geological content for text-based mineral exploration re-portsJ.Ore Geology Reviews,2022,144:104818.10 Sobhana N,Mitra P,Ghosh S K.Conditional random field basednamed entity recognition in geological textJ.International Journa
30、lof ComputerApplications,2010,1(3):143-147.11 Devlin J,Chang M W,Lee K,et al.Bert:Pre-training of deep bidirec-tional transformers for language understandingJ.arXiv preprintarXiv:2018,1810.12 Radford A,Narasimhan K,Salimans T,et al.Improving language un-derstanding by generative pre-trainingJ.2018.1
31、3 Peters M E,Neumann M,Iyyer M,et al.Deep contextualized wordrepresentationsJ.arXiv preprint arXiv,2018,1802.14 Vaswani A,Shazeer N,Parmar N,et al.Attention is all you needJ.Advances in neural information processing systems,2017,30.15 Bengio Y,Simard P,Frasconi P.Learning long-term dependencieswith
32、gradient descent is difficultJ.IEEE transactions on neuralnetworks,1994,5(2):157-166.16 Hochreiter S,Schmidhuber J.Long short-term memoryJ.Neuralcomputation,1997,9(8):1735-1780.17?行,季?,?,等.基?Bi-LSTM的?类?件法?文?的?名?体?研究J.网?全技术与?,2023(7):36-39.18?,?,?,等.区域地质?查文本中文?名?体?J.地质论评,2023,69(04):1423-1433.Geologi
33、cal Mineral Attribute Recognition Method Based onLarge-Scale Pre-Trained Model and Its ApplicationWang Binbin1,2,4,Zhou Kefa2,3,5,Wang Jinlin1,2,3,4,Wang Wei1,2,3,4,Li Chao5,Cheng Yinyi2(1.Xinjiang Research Center for Mineral Resources,Xinjiang Institute of Ecology and Geography,Chinese Academyof Sc
34、iences,Urumqi,Xinjiang,830011,China;2.University of Chinese Academy of Sciences,Beijing,100049,China;3.Technology and Engineering Center for Space Utiliz ation,Chinese Academy of Sciences,Beijing,100094,China;4.Xinjiang Key Laboratory of Mineral Resources and Digital Geology,Urumqi,Xinjiang,830011,C
35、hina;5.Institute of Geological Survey,China University of Geosciences,Wuhan,Hubei,430074,China)Abstract:Geoscience research results are usually documented in technical reports,journal papers,books,and other lit-erature;however,many detailed geoscience reports are unused,which provides challenges and
36、 opportunities for informa-tion extraction.To this end,we propose a deep neural network model called GMNER(Geological Minerals named entityrecognize,MNER)for recognizing and extracting key information such as mineral types,geological formations,rocks,and geological time.Unlike traditional methods,we
37、 employ a large-scale pre-trained model BERT(Bidirectional Encod-er Representations from Transformers,BERT)and deep neural network to capture contextual information and combine itwith a conditional random field(CRF)to obtain more accurate and accurate information.The experimental results showthat th
38、e MNER model performs well in Chinese geological literature,achieving an average precision of 0.8984,an aver-age recall of 0.9227,and an average F1 score of 0.9104.This study not only provides a new way for automated mineralinformation extraction but also is expected to promote the progress of mineral resource management and sustainable utili-zation.Key words:Mineral information extraction;Deep neural network;Mineral documentation;Named entity recognition144