Sistemi e Architetture per Big Data

Informazioni Generali

Il corso copre un'ampia varietà di argomenti avanzati riguardanti il data intensive computing, inclusi file system distribuiti, database NoSQL, elaborazione di dati batch e dati streaming e gestione delle risorse.

Risultati dell'apprendimento

Principi, paradigmi e tecnologie per la progettazione e gestione di sistemi a larga scala che processano ed analizzano Big Data.

CFU

6 CFU, 60 ore di didattica frontale distribuite in 4 ore a settimana

Prerequisiti

Il corso presuppone una conoscenza delle basi di dati, dei sistemi distribuiti e del Cloud computing.

Orario delle lezioni

Il corso si tiene nel secondo semestre. Orario valido dal 28/2/2022 all'11/6/2022 (secondo semestre)

Lunedì dalle 11:30 alle 13:15, aula C5 edificio didattica
Giovedì dalle 11:30 alle 13:15, aula B8 edificio didattica

Classe virtuale

Microsoft Teams: link

Attenzione: La visione delle lezioni in streaming e delle registrazioni è riservata agli studenti iscritti al corso al solo fine di studio e sono quindi vietate la loro riproduzione, pubblicazione o distribuzione, anche in forma parziale, per fini diversi dallo studio individuale.
Partecipando alle lezioni in streaming, gli studenti accettano la possibilità di essere registrati o comunque che sia visibile la loro presenza, agli altri partecipanti e a coloro che hanno accesso alla registrazione.

Docenti

Valeria Cardellini
Tel.: 067259 7388
E-mail:

(è necessario specificare [SABD] nell'oggetto della mail)
Ufficio: stanza D1-17, corpo D dell'edificio "Ingegneria dell'Informazione", primo piano.
Orario di ricevimento: in aula al termine delle lezioni oppure online (contattare via email per concordare giorno e orario).

Matteo Nardelli
Corso integrativo da 2 CFU "Hands-on storage systems and processing frameworks for Big Data"
E-mail: (è necessario specificare [SABD] nell'oggetto della mail)
Orario di ricevimento: online, da concordare tramite email.

Avvisi

29 aprile 2022 - Per la lezione di Laboratorio di lunedì 2 maggio, si consiglia di scaricare l'immagine Docker di InfluxDB:
docker pull influxdb:2.0
10 aprile 2022 - Per la lezione di Laboratorio di lunedì 11 aprile, si consiglia di scaricare le immagini Docker di HBase e Neo4J:
docker pull harisekhon/hbase:2.1
docker pull neo4j:4.4.5
25 marzo 2022 - Per la lezione di Laboratorio di lunedì 28 marzo, si consiglia di scaricare le immagini Docker di Redis e MongoDB:
docker pull sickp/alpine-redis
docker pull mongo
20 marzo 2022 - Per la lezione di Laboratorio di lunedì 21 marzo, si consiglia di scaricare le immagini Docker di Hadoop e Redis:
docker pull matnar/hadoop
docker pull sickp/alpine-redis
22 febbraio 2022 - Il corso avrà inizio lunedì 28 febbraio alle ore 11:30 in aula C5.

Testi consigliati

Non c'è un libro di testo, in quanto il materiale si basa principalmente su articoli scientifici e documentazione dei framework e tool per Big Data.
Ma se sei interessata/o ad un testo, quelli elencati di seguito coprono una buona parte degli argomenti trattati nel corso:

A. Bahga, V. Madisetti, Cloud Computing Solutions Architect: A Hands-On Approach, 2019.
M. Kleppman, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, O'Reilly, 2017.

Calendario e slide delle lezioni

Giorno	Argomento	Lucidi	Ultima modifica
28/2/2022	Organizzazione del corso Introduzione ai Big Data	Organization.pdf IntroBD.pdf	1/3/2022 1/3/2022
3/3/2022	Introduzione ai Big Data: pipeline ETL/ELT, data lake Sistemi per storage dei dati: introduzione, file system distribuiti, GFS	vedi lezione precedente Storage_DFS.pdf	7/3/2022
7/3/2022	Sistemi per storage dei dati: HDFS, GlusterFS, Alluxio	vedi lezione precedente
10/3/2022	Data store NoSQL: introduzione, data store KV, orientati ai documenti	Storage_NoSQL.pdf	24/3/2022
14/3/2022	Data store NoSQL: family-column e database a grafo Case study: Dynamo, BigTable	vedi lezione precedente
17/4/2022	Data store NoSQL: case study su Bigtable, Google Cloud Bigtable, Cassandra	vedi lezione del 10/3
21/3/2022	Hands-on Hadoop Distributed File System	Storage_HandsOn_HDFS.pdf scaletta Docker Image Repository scripts	21/3/2022
24/3/2022	Data store NoSQL: case study su Neo4j, algoritmi su grafi	vedi lezione del 10/3
28/3/2022	Hands-on NoSQL DB: Redis, MongoDB	Storage_HandsOn_NoSQL_KVD.pdf scaletta.txt scripts_redis.zip script_mongo.zip	28/3/2022
31/3/2022	Introduzione a MapReduce e Spark MapReduce	Intro_Hadoop+Spark.pdf MapReduce&Hadoop.pdf	31/3/2022 7/4/2022
4/4/2022	MapReduce: esempi	vedi lezione del 31/3
7/4/2022	Hadoop Database NewSQL e database per serie temporali	vedi lezione del 31/3 Storage_NewSQL&TSDB.pdf	7/4/2022
11/4/2022	Hands-on HBase and Neo4j	Storage_HandsOn_NoSQL_CG.pdf scaletta.txt scripts_hbase-neo4j.zip Code: HBase Client	12/4/2022
14/4/2022	Database NewSQL e database per serie temporali Apache Spark: introduzione	vedi lezione del 7/4 Spark.pdf	29/4/2022
21/4/2022	Spark: architettura, RDD API ed esempi	vedi lezione del 14/4 Spark_examples_2022-04-21.pdf	29/4/2022
28/4/2022	Spark: esempi, persistenza, tolleranza ai guasti, Spark SQL	vedi lezione del 14/4
2/5/2022	Hands-on InfluxDB Hands-on Apache Spark and Spark SQL	Storage_HandsOn_NoSQL_TSDB.pdf scaletta.txt scripts_influxdb.zip Spark_HandsOn.pdf Code: Hands-On Spark	2/5/2022
5/5/2022	Spark: Dataset e dataframe, MLlib Hadoop on AWS EMR Hadoop ecosystem: Pig, Hive, Impala, Oozie	vedi lezione del 14/4 HadoopEcosystem.pdf	13/5/2022
9/5/2022	Acquisizione dei dati	DataAcquisition.pdf	13/5/2022
12/5/2022	Presentazione del primo progetto Introduzione al Data Stream Processing	Project1_presentation.pdf DSP_Intro.pdf	13/5/2022 16/6/2022
16/5/2022	Introduzione al Data Stream Processing Framework per DSP: Storm, Flink	vedi lezione del 12/5 DSP_Frameworks.pdf	16/6/2022
19/5/2022	Hands-On Apache Flink	Flink_HandsOn.pdf scaletta.txt Code: Hands-On Flink	19/05/2022
23/5/2022	Framework per DSP: Storm, Heron, Flink	vedi lezione del 16/5
26/5/2022	Framework per DSP: Flink, servizi Cloud Data stream processing: sfide	vedi lezione del 16/5 DSP_Challenges.pdf	19/6/2022
30/5/2022	Deployment a run-time di applicazioni DSP Sistemi per la gestione delle risorse	DSP_Research.pdf ResourceMgmt.pdf	17/6/2022 16/6/2022
6/6/2022	ML distribuito e federato	Distr+FedML.pdf	19/6/2022
9/6/2022	Hands-on Apache Spark Streaming Hands-on Apache Kafka and Kafka Streams	SparkStreaming_HandsOn.pdf scaletta.txt Code: Hands-On Spark Streaming KafkaStreams_HandsOn.pdf scaletta.txt Code: Hands-on Kafka Streams	09/6/2022
9/6/2022	Presentazione del secondo progetto	Project2_presentation.pdf	19/6/2022

Articoli

I seguenti articoli approfondiscono ed integrano gli argomenti trattati a lezione; la loro lettura è raccomandata.

When I talk to researchers, when I talk to people wanting to engage in entrepreneurship, I tell them that if you read research papers consistently, if you seriously study half a dozen papers a week and you do that for two years, after those two years you will have learned a lot. This is a fantastic investment in your own long term development. (Andrew Ng, Inside The Mind That Built Google Brain: On Life, Creativity, And Failure)

Introduzione ai Big Data
Storage: File System Distribuiti
Storage: Data store NoSQL e database NewSQL
MapReduce ed ecosistema di Hadoop
Spark
Data Acquisition
Data Stream Processing
Resource Managament
Distributed e Federated Machine Learning

Introduzione ai Big Data

T.H. Davenport and D.J. Patil, Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review, 2012.
J. Heidrich, A. Trendowicz and C. Ebert, Exploiting Big Data's Benefits, in IEEE Software, 2016. [pdf]
H.V. Jagadish et al., Big Data and Its Technical Challenges, Commun. ACM, vol. 57, no. 7, pp. 86-94, July 2014. [pdf]
J. Wing, The Data Life Cycle, Harvard Data Science Review, 1(1), 2019. [html]

Storage: File System Distribuiti

S. Ghemawat, H. Gobioff, and S.-T. Leung, The Google File System, In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), 2003.
H. Li, Alluxio: A Virtual Distributed File System. University of California Berkeley, Technical Report No. UCB/EECS-2018-29, 2018.

Storage: Data store NoSQL e NewSQL

K. Grolinger, W.A. Higashino, A. Tiwari and M. Capretz, Data management in cloud environments: NoSQL and NewSQL data stores, Journal of Cloud Computing, 2013.
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: Amazon's highly available key-value store, In Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP '07), 2007.
F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: A distributed storage system for structured data, In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI '06), 2006.
A. Lakshman and P. Malik, Cassandra: a decentralized structured storage system, SIGOPS Oper. Syst. Rev., Vol. 44, No. 2, pp. 35-40, 2010.
M. Needham and A. Hodler, Graph Algorithms: Practical Examples in Apache Spark and Neo4j, O'Reilly Media, 2019.
J. Corbett et al., Spanner: Google’s Globally-Distributed Database, Proc. OSDI '12, 2012.
M. Stonebraker and A. Weisberg, The VoltDB main memory DBMS, 2013.

MapReduce ed ecosistema di Hadoop

J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI '04), 2004.
J. Leskovec, A. Rajaraman, and J. Ullman, Mining of Massive Datasets 3rd edition, chapter 2, 2020.
A. Thusoo et al., Hive – A petabyte scale data warehouse using Hadoop, In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE’10), 2010.
M. Kornacker et al., Impala: A modern, open-source SQL engine for Hadoop, In Proceedings of CIDR ’15, 2015.

Spark

M. Zaharia et al., Spark: Cluster Computing with Working Sets, In Proceedings of USENIX HotCloud’10, 2010.
M. Zaharia et al., Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, In Proceedings of USENIX NSDI’12, 2012.
M. Zaharia et al., Apache Spark: A Unified Engine For Big Data Processing, In Commun. ACM, Nov. 2016.
M. Armbrust et al., Spark SQL: Relational data processing in Spark, In Proceedings of ACM SIGMOD’15, 2015.

Data Acquisition

J. Kreps, N. Narkhede, J. Rao, Kafka: a Distributed Messaging System for Log Processing, In Proceedings of NetDB '11, 2011.

Data Stream Processing

T. Akidau, Streaming 101: The world beyond batch, 2015.
P. Carbone et al., Apache Flink: Stream and batch processing in a single engine, In Bulletin of IEEE Computer Society Technical Committee on Data Engineering, 2015.
P. Carbone et al., State management in Apache Flink, In Proc. VLDB Endow., 2017.
P. Carbone et al., Beyond Analytics: The Evolution of Stream Processing Systems, ACM SIGMOD '20, 2020.
M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations, ACM Comput. Surv., 2014.
V. Cardellini, F. Lo Presti, M. Nardelli, G. Russo Russo, Run-time adaptation of data stream processing systems: The state of the art, ACM Computing Surveys, 2022.

Resource Management

A. Ghodsi, M. Zaharia, B, Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant resource fairness: fair allocation of multiple resource types, In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11), 2011.
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, Mesos: a platform for fine-grained resource sharing in the data center, In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11), 2011.

Distributed and Federated Machine Leaning

R. Mayer et al., Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools, ACM Computing Surveys, 2020.
J. Verbraeken et al., A Survey on Distributed Machine Learning, ACM Computing Surveys, 2020.
B. McMahan and D. Ramage, Federated Learning: Collaborative Machine Learning without Centralized Training Data, Google AI blog, 2017.
P. Kairouz et al., Advances and Open Problems in Federated Learning, Foundations and Trends in Machine Learning, Vol. 14, No. 1–2, 2021.

Video

Video di presentazioni utili per approfondire alcuni argomenti del corso.

Data for Good: Ensuring the Responsible Use of Data to Benefit Society, Jeannette Wing
From BI to Big Data - Architecture, Ethics, and Economics, Barry Devlin
NoSQL Distilled to an hour, Martin Fowler
Making Big Data Processing Simple with Spark, Matei Zaharia
Building and Running Distributed Systems using Apache Mesos, Benjamin Hindman (one of the creators of Apache Mesos)

Programma preliminare

Introduction to Big Data: issues and challenges
Data storage: distributed file systems, NoSQL data stores, NewSQL databases, time series databases
- Case studies: HDFS, Dynamo, Redis, MongoDB, Bigtable, HBase, Cassandra, and Neo4j
- Lab: HDFS, Redis, MongoDB, HBase, Neo4j
Systems for data acquisition and ingestion: pub/sub, message queues, collection systems
- Case studies: Kafka, Flume, and Sqoop
Systems for batch processing
- Case studies: Hadoop, Pig, Hive, Spark, and Spark SQL
- Lab: Hadoop, Spark, Spark SQL
Systems for stream processing
- Case studies: Spark Streaming, Storm, Flink, Heron, Samza
- Lab: Storm, Spark Streaming, Kafka Streams
Batch and stream processing in the Cloud
Frameworks for cluster resource management
- Case study: Mesos
Fog and edge computing for data processing and analytics

Esami

Modalità di esame
Appelli

Modalità di esame

2 progetti assegnati durante il corso.
Progetti da svolgere preferibilmente in gruppo composto da 2 o 3 studenti (singolarmente se non possibile altrimenti).
Prova orale su tutto il programma del corso.

Appelli

I appello sessione estiva

II appello sessione estiva

I appello sessione autunnale
II appello sessione autunnale
I appello sessione invernale
II appello sessione invernale