李林睿 常舒予 喬一鳴
摘要:LAMOST(郭守敬望遠鏡)提供了大量的天文光譜數據,而天體分類是天文學中得到廣泛關注的問題,由于天體數量大,數據維度高,如何使用機器學習的方法對光譜進行處理,成為近些年的熱點。針對天體分類問題,提出了HSODM(High-dimensional Spectral with Outlier Data Mining),這是一種改進的高維離群數據識別方法,其采用無監(jiān)督學習方式,基于隨機距離將大量高維光譜數據中的極少數未知天體或離群數據識別出來,便于后續(xù)天體分類、離群數據挖掘等相關處理。項目中運用數據預處理、主成分分析降維、長短期記憶神經網絡模型建立與訓練、參數調優(yōu)、結果預測與分析,最終通過評估方法和數據可視化等手段對模型進行評價與展示。研究中提出的改進方法和優(yōu)化的神經網絡可以縮短訓練時間,提高模型預測準確度。經過實驗發(fā)現,改進方法對ROC (receiver operating characteristic) 曲線面積、P-R曲線面積、F1分數和G-mean分數都有相應的提高。
關鍵詞: 表示學習;高維光譜;離群點檢測;數據挖掘; 分類
Abstract: LAMOST (Large Sky Area Multi-Object Fiber Spectroscopy Telescope) Telescope provides a large amount of astronomical spectral data, and astronomical classification is a problem that has received widespread attention in astronomy. Due to the large number of celestial bodies and the high dimensionality of data, how to use machine learning methods to process spectra has become a problem in recent years. Hot spot. Aiming at the problem of celestial body classification, HSODM (High-dimensional Spectral with Outlier Data Mining) is proposed, which is an improved method for identifying high-dimensional outlier data. It uses an unsupervised learning method and combines a large number of high-dimensional spectral data based on random distance. A very small number of unknown celestial bodies or outlier data can be identified to facilitate subsequent celestial body classification, outlier data mining and other related processing. In the project, data preprocessing, principal component analysis and dimensionality reduction, long and short-term memory neural network model establishment and training, parameter tuning, result prediction and analysis are used in the project, and the model is finally evaluated and displayed by means of evaluation methods and data visualization. The improved method and optimized neural network proposed in the research can shorten the training time and improve the accuracy of model prediction. After experimentation, it is found that the improved method has corresponding improvement on ROC curve area, P-R curve area, F1 score and G-mean score.
Key words: representation learning; high-dimensional spectral; outlier detection; data mining; classification
天文學隨著科學技術的發(fā)展,先進的觀測設備使我們能夠望向宇宙更深處,同時也帶來了天文數據爆炸式的增長[1]。郭守敬望遠鏡(LAMOST)作為世界上光譜獲取率最高的望遠鏡,LAMOST每個觀測夜晚能采集萬余條光譜,這將為一些天文和天體物理學家在星系紅移巡天、宇宙學模型、宇宙大尺度結構、星系形成和演化以及結合各類射線的光譜觀測等研究工作[2]上提供大量素材,對天文學領域的發(fā)展起到推動和完善作用。LAMOST數據集中的每一條光譜提供了3690-9100埃的波長范圍內的一系列輻射強度值。光譜分類就是要從上千維的光譜數據中選擇和提取對分類識別最有效的特征來構建特征空間,例如選擇特定波長或波段上的光譜流量值等作為特征,并運用算法對各種天體進行區(qū)分 。