Analysis, Data Processing, Business Applications

Descriptive part

The Analysis, Data Processing and Business Application course is composed of three modules. Because of the large amount of connected objects in our society, it is important to know how efficiently collect and process data.
The first module concerns the Semantic Web in which we got a glimpse of the importance of making search engines more ingenious to manage web data as finely as possible.
The second module, is about Big Data in which we had lectures and a project on the big data approach and the use of the R programming language for statistical data processing.
The third module is about Software Engineering for which we had an overview of project management tools and methods (Jira, Jenkins...) as well as an awareness of the Agile Scrum method.
In this page you will find all the information about the different modules and the practical work and projects realized. I detail the tasks performed, for what purpose. I also highlight the knowledge and skills acquired during each module.

Data processing and Analysis : Big Data

The Big Data module was new for me, but interesting. Indeed, during the lectures, I was able to be sensitized to the importance of data processing, and ways to be able to analyze the data collected for different uses. In order to put into practice the concepts introduced, we used the R language and its Rstudio interface. R is a programming language and free software for statistics and data science.
Through tutorials, I was able to familiarize myself with the study of datasets, the different ways of selecting desired data until their possible display on a graph.
You can found below an extract of the "msleep" dataset that we studied during the sessions.

A great particularity of the R is the iterative loops: they are only used very little, knowing that many libraries offer functions that can carry out processing that one is used to doing with loops, especially in other script languages.

Software Engineering

The goal of this course (realized in autonomy) was to give us the main project management methods and tools that are useful when we do software development. We were an overview of tools like Jenkins, Git, Jira but also, we saw the Agile Scrum method. It allows us to use one or more methods for some projects of the year. Indeed, for instance we were able to use the Jira software tool for the Service Oriented Architecture project. It was interesting to use because thanks to this tool we could create user stories, defined the tasks for each user story, their duration, who is in charge of each task but also the sprint planning. The main challenge was to appropriate the tool in a short period of time and apply this kind of tool to a software project for the first time. It was a success for the SOA project because the scheduled planning was respected.

Technical part

Data processing and Analysis : R project

The most challenge of the Big Data module was the R project. Indeed, we had to select ourselves a dataset to analyse. My partner and I decided to study the results of the wishes and admissions on the Parcoursup platform of the graduates of the year 2019. It was interesting to have an overview on the most requested fields or the results of assignments within the different study programmes proposed in the Toulouse region for example.For some analyses, we cross-referenced 2019 data with 2018 data. Datasets are available here and here.
From my point of view, it was the opportunity for me to manipulate R in autonomy and to be free to choose the study case. This project allows me to develop the good practices in terms of the selection of charts, the consistency of the displayed data.

Moreover, it is important to have a critical look on the relevance of open source data. Sample plots of the data are an important step when studying a dataset. Indeed, a person can draw conclusions on a graph that does not correspond to what the author set. This point of vigilance allowed us to have a critical view on the exploitation of our dataset. Below is our study report.

Report accessible here

Web Semantic

In this course composed of two lecture and two labs, I was able to discover the semantic web and its importance today. Indeed, it is necessary to be able to have machines that can efficiently interpret the requests of web users knowing the multitude of data and users who regularly believe. In this perspective, I was able to be sensitized to the representation of data with the descriptive language RDF which will allow to define structures thanks to the "triplets" (subject + predicate + object). In addition, I was able to discover and design ontologies (light and heavy), which are a formal, explicit specification of a shared conceptualization.
Through the first laboratory I was able to design an ontology of the weather using the Protected software. I created different classes, subclasses, properties and individuals to populate my ontology and then I used a reasoner to deduce the links between individuals.
The second lab was to build a semantic-aware application. Indeed, we use our ontology to annotate an open data dataset, opened by the city of Aarhus in Denmark. These data are collected from temperature sensors, and they are stored in CSV files, which is at 3 stars on the Linked Data hierarchy. We needed to convert it to 5-star data by using the ontology of the first lab.
You can found our report below which describe the work realized.

Report accessible here

Analytical part

All the courses in this UF had no direct links between them. All the concepts mentioned were mostly new to me, having not followed the INSA’s IT and network field.

For “Software Engineering”, I already have experience with Git and the Agile method. This course was useful for the projects of the year, particularly for our innovative project and the SOA project.

For the “Processing Semantic Data” skillset, I was able to absorb the notions of semantics that are important today with the amount of information and data at the web level. The labs were particularly informative to deepen the notion of ontology, very important for the semantic web.

For the “Data Processing and Analysis” class, I was really interested in learning the R language as a new tool for data analysis. I think we did not have enough time to see the different ways to graphically represent the desired data, but I have acquired the basic knowledge of the R language.

Skills to acquire	Learning mode (?) IP: Initial Training SP: Self Training PE: Peer Exchange PP: Professional Practice	Expected level (AE) (?) 1 = level of application: follow-up of instructions or procedures 2 = level analysis: improvement or optimization of solutions or proposals 3 = level of control: design of programs or definitions of specifications 4 = level of expertise: definition of guidelines or strategies	Self-evaluation (?) 1 = level of application: follow-up of instructions or procedures 2 = level analysis: improvement or optimization of solutions or proposals 3 = level of control: design of programs or definitions of specifications 4 = level of expertise: definition of guidelines or strategies
Software Engineering
Define the different phases in software development	IT	3	3
know the different project management methods	IT	3	3
Apply one of these methods a project	IT	3	3

Processing Semantic Data
design and understand a model for an application	IT + PE	3	3
Know how to infer new knowlegde from a knowledge base	IT + ST	3	3
Be able to enrich data with semantic meta-data	IT	3	3

Data Processing and Analysis: Big Data
Know how to explore and represent data sets	IT + PP	3	3
Master R	IT	3	3
Master complexity associated to statistical data processing and know the techniques to be used to minimise them	IT + PP	3	3