CBL - Campus del Baix Llobregat

Projecte llegit

Títol: Using Selfsupervised algorithms for video analysis and scene detection


Estudiants que han llegit aquest projecte:


Director/a: TARRÉS RUIZ, FRANCESC

Departament: TSC

Títol: Using Selfsupervised algorithms for video analysis and scene detection

Data inici oferta: 07-01-2020     Data finalització oferta: 07-09-2020



Estudis d'assignació del projecte:
    MU MASTEAM 2015
Tipus: Individual
 
Lloc de realització: EETAC
 
Paraules clau:
video analysis, scene detection, machine learning, deep learning, python, keras, shot detection, convolutional neural network, self supervised learning
 
Descripció del contingut i pla d'activitats:
The main objective of the project is to evaluate the performance of selfsupervised
learning models in the context of video analysis and scene classification. The idea is to
develope a neural network architecture that will be train with a pretext task.

After convergence of the self-dupervised algorithm the architecture will be fine-tuned for
scene classification using an scene database based on TV series and films. The results
will be evaluated in the IBM scene database. The project includes the generation of NN
architecture, the coding in Python/Keras, the training and evaluation of the learning
model and the final objective task and the generation of some extra data to be added to
the scene database
 
Overview (resum en anglès):
With the increasing available audiovisual content, well-ordered and effective management of video is desired, and therefore, automatic, and accurate solutions for video indexing and retrieval are needed.
Self-supervised learning algorithms with 3D convolutional neural networks are a promising solution for these tasks, thanks to its independence from human-annotations and its suitability to identify spatio-temporal features.
This work presents a self-supervised algorithm for the analysis of video shots, accomplished by a two-stage implementation: 1- An algorithm that generates pseudo-labels for 20-frame samples with different automatically generated shot transitions (Hardcuts/Cropcuts, Dissolves, Fades in/out, Wipes) and 2- A fully convolutional 3D trained network with an overall achieved accuracy greater than 97% in the testing set.
The model implemented is based in [5], improving the detection of large smooth transitions by implementing a larger temporal context. The transitions detected occur centered in the 10th and 11th frames of a 20-frame input window.


© CBLTIC Campus del Baix Llobregat - UPC