CBL - Campus del Baix Llobregat

Projecte llegit

Títol: Minería de texto mediante NLP en el sector seguros


Estudiants que han llegit aquest projecte:


Director/a: MORA SERRANO, FRANCISCO JAVIER

Departament: DECA

Títol: Minería de texto mediante NLP en el sector seguros

Data inici oferta: 18-07-2022     Data finalització oferta: 18-03-2023



Estudis d'assignació del projecte:
    GR ENG SIS TELECOMUN
    GR ENG SIST AEROESP
    GR ENG TELEMÀTICA
Tipus: Individual
 
Lloc de realització: EETAC
 
Nom del segon director/a (UPC): Alberto Burgos (CIMNE-TIC)
Departament 2n director/a:
 
Paraules clau:
Machine Learning, Churn prediction, minería de textos
 
Descripció del contingut i pla d'activitats:
El software de gestión de corredores de seguros permite el
registro de todo tipo de datos derivados de la interacción entre
cliente y corredor. Muchos de estos registros corresponden a texto
libre, lo cual no permite su explotación directa con técnicas que
trabajan con datos tabulares. El objetivo de este TFG es la
generación de características a partir de texto libre que permitan
extraer información relevante para posteriores análisis y/o
modelados de algoritmos de Machine Learning enfocados en la
predicción del abandono (churn prediction).
 
Overview (resum en anglès):
This end of degree project, is about classifying reasons of insurance cancellation, from the text written by insurance brokers. To achieve this goal, it was used data provided by the brokers that works with the same software, segElevia.

The task definition of this project was obtained during the process of business understanding and analysing the data obtained. Then it was discovered that the brokers make mistakes when selecting the label for the cancellation reason, when comparing it to the content of the free text field.

In the process of development of this project, the following software was used: Jupyter Notebook as working environment, Python as development language, and Scikit Learning, Pandas, Seaborn, Spacy and Numpy as libraries.

In regards to the data processing it were used different techniques, such as: word elimination, lemmatization, tokenization, vectorization, zero padding and oversampling; however the last one was not implemented, given that the results were unsatisfactory.

During the development of this end of degree project, a variety of artificial intelligence, such as the Random Forest Classifier or the Perceptron.

After analysing the results obtained on every model, it was considered that the model that provided more satisfactory, was the Random Forest Classifier. This model provides a weighted average in the metrics of 75% on precision, 74% on recall and 74% on f1-score.

Finally from the obtained results, could be created a predictor that helps brokers by indicating the name of the label that they should put while they are writing the free text field, thus reducing the times that the brokers classify wrongly the reason why cancellation of the insurance was made.


© CBLTIC Campus del Baix Llobregat - UPC