CBL - Campus del Baix Llobregat

Projecte llegit

Títol: Spectral automated classification in large databases


Estudiants que han llegit aquest projecte:


Director/a: REBASSA MANSERGAS, ALBERTO

Departament: FIS

Títol: Spectral automated classification in large databases

Data inici oferta: 26-06-2020     Data finalització oferta: 26-02-2021



Estudis d'assignació del projecte:
    DG ENG AERO/SIS TEL
    DG ENG AERO/TELEMÀT
    DG ENG SISTE/TELEMÀT
Tipus: Individual
 
Lloc de realització: EETAC
 
Segon director/a (UPC): TORRES GIL, SANTIAGO
 
Paraules clau:
databases, artificial intelligence, spectroscopy
 
Descripció del contingut i pla d'activitats:
Pattern recognition in large databases relying on automated
artificial intelligent methods is one of the most challenging
problems in science and technology today. Although its
theoretical grounds may be the same, its applications are
enormously varied: voice recognition, image analysis, signal
processing, are a few examples.

In particular, spectroscopy recognition and analysis is without
any doubt one of the most valuable observational techniques in
modern astronomy. In a few months, the Gaia satellite launched
by the European Space Agency will release spectra of more than
300 million stars in our Galaxy, thus opening a new exciting era
for stellar spectroscopy. However, given the large amount of
data, an automated spectral classification is required. This TFG
will consist on developing the necessary tools, via artificial
intelligence techniques, for such an automated spectral
classification and their implementation. In a first step,
the algorithm will be tested with known synthetic data. Once the
machine learning process has been proven efficient, it will be
applied to real databases.
 
Overview (resum en anglès):
Due to the vast amount of data collected every day, there exists a need of modelling Machine Learning algorithms that are able to manipulate and link the raw data with as little human supervision as possible. One of the most popular is the Random Forest, which can be used to solve a great variety of categorization tasks. Particularly, in Astronomy millions of objects are captured by satellites and telescopes, for instance by the Gaia space mission, and the receiving signals are displayed in a spectrum. Random Forest algorithms have been proven to be a versatile and powerful tool in identifying and classifying stellar populations.
In the present project, we apply a Random Forest algorithm based on spectroscopic analysis with the aim of efficiently classifying three different populations of stars of particular interest. Our main objective is to study the principle parameters and variables that affect the classification performance of the algorithm, and also to model the Random Forest to categorize observed spectra by current and future missions. We aim to obtain the best results according to the characteristics of each population, while maintaining an efficient and versatile model. To achieve that, we rely on both simulated and observed spectra to train and test the algorithm, and on quantitative metrics to measure its performance.
Along this project, we have set the basis of the modelled Random Forest classifier and the preparation of the data, analyzing the theoretical classification with simulated data. We have classified with the Random Forest model a real set of spectroscopic data collected by the Sloan Digital Sky Survey, which revealed a notable agreement between the human-made and the Random Forest classifications, greatly enhanced after the application of different improvements to the algorithm. Finally, we simulated spectra of the expected observed population that will be released by the Gaia space mission, and built a Random Forest model based on it. Several improvements were introduced, but we could eventually achieve a solid model with satisfactory results. With that, we were able to classify two different sets of stellar spectra with different characteristics, maximizing the number of well classified objects while minimizing the amount of false positives.


© CBLTIC Campus del Baix Llobregat - UPC