Sistemi e Architetture per Big Data

Informazioni Generali

Il corso copre un'ampia varietà di argomenti avanzati riguardanti il data intensive computing, inclusi i file system distribuiti, i database NoSQL, l'elaborazione di dati batch e streaming ed i sistemi per il machine learning distribuito.

Pagina su DidatticaWeb
Link per Microsoft Teams

CFU

6 CFU, 60 ore di didattica frontale distribuite in 4 ore a settimana.

Risultati dell'apprendimento

Principi, paradigmi e tecnologie per la progettazione e gestione di sistemi a larga scala che processano ed analizzano Big Data.

Prerequisiti

Il corso presuppone una conoscenza delle basi di dati, dei sistemi distribuiti e del Cloud computing.

Modalità di svolgimento delle lezioni

La didattica è svolta esclusivamente in presenza.
La classe virtuale su Teams è utilizzata per condividere il materiale del corso e per le informazioni.

Orario delle lezioni

Orario valido dal 2/3/2026 al 12/6/2026 (secondo semestre)

Lunedì dalle 11:30 alle 13:15, aula C5 edificio didattica
Giovedì dalle 11:30 alle 13:15, aula B8 edificio didattica

Docenti

Valeria Cardellini
Tel.: 067259 7510
E-mail:

(è necessario specificare [SABD] nell'oggetto della mail)
Ufficio: stanza D1-17, corpo D dell'edificio "Ingegneria dell'Informazione", primo piano.
Orario di ricevimento: in aula al termine delle lezioni oppure per appuntamento (contattare via email per concordare giorno e orario).

Matteo Nardelli
E-mail: (è necessario specificare [SABD] nell'oggetto della mail)
Orario di ricevimento: online, da concordare tramite email.

Testi consigliati

Non c'è un libro di testo, in quanto il materiale si basa principalmente su articoli scientifici e documentazione dei framework e tool per Big Data.
Ma se sei interessata/o ad un testo, i seguenti coprono una buona parte degli argomenti trattati nel corso:

M. Kleppman, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, O'Reilly, 2017.
J. Reis, M. Housley, Fundamentals of Data Engineering: Plan and Build Robust Data Systems 1st Edition, O'Reilly, 2022.

Slide delle lezioni

Argomento	Lucidi	Ultima modifica
Organizzazione del corso	Organization.pdf	7/3/2026
Introduzione ai Big Data	IntroBD.pdf	7/3/2026
Sistemi per storage dei dati	Storage_DFS.pdf	9/3/2026
Data store NoSQL	Storage_NoSQL.pdf	3/4/2026
Hands-on Hadoop Distributed File System	Storage_HandsOn_HDFS.pdf scaletta scripts_hdfs.zip Docker Image Repository	13/3/2026
Hands-on NoSQL: Key-value - Redis	Storage_HandsOn_NoSQL_KV.pdf scaletta.txt scripts_redis.zip	13/3/2026
Hands-on NoSQL: Document-oriented - MongoDB	Storage_HandsOn_NoSQL_Doc.pdf scaletta.txt scripts_mongo.zip	27/3/2026
Hands-on NoSQL: Column-oriented - HBase	Storage_HandsOn_NoSQL_Column.pdf scaletta.txt scripts_hbase.zip Code: HBase Client	27/3/2026
Database NewSQL e database per serie temporali	Storage_NewSQL&TSDB.pdf	3/4/2026
Hands-on NoSQL: Graph - Neo4j	Storage_HandsOn_NoSQL_Graph.pdf scaletta.txt scripts_neo4j.zip	7/4/2026
Hands-on NoSQL: InfluxDB	Storage_HandsOn_NoSQL_TSDB.pdf scaletta-influx2.txt scaletta-influx3.txt scripts_influxdb2.zip	7/4/2026
Introduzione ai framework per il data processing	Intro_DataProcessing.pdf	14/4/2026
MapReduce e Apache Hadoop	MapReduce&Hadoop.pdf	16/4/2026
Apache Spark	Spark.pdf Spark_examples.zip	4/5/2026
Hands-on Apache Spark and Spark SQL	HandsOn_Spark.pdf Code: Hands-On Spark	24/4/2026
Hands-on NewSQL: CockroachDB	Storage_HandsOn_NewSQL_CockroachDB.pdf	24/4/2026
Data ingestion	DataIngestion.pdf	12/5/2026
Introduzione al Data Stream Processing	DSP_Intro.pdf	6/6/2026
Framework per DSP	DSP_Frameworks.pdf	7/6/2026
Hands-On Apache Flink	Flink_HandsOn.pdf scaletta.txt Code: Hands-On Flink	19/05/2026
Hands-on Apache Kafka and Kafka Streams	KafkaStreams_HandsOn.pdf scaletta.txt Code: Hands-on Kafka Streams	22/5/2026
Hands-on Apache Spark Streaming	SparkStreaming_HandsOn.pdf scaletta.txt Code: Hands-On Spark Streaming	22/5/2026
Data stream processing: sfide e soluzioni	DSP_Challenges.pdf	13/6/2026
ML e LLM distribuiti	DistrML.pdf	13/6/2026
Introduzione ai database vettoriali	VectorDB.pdf	13/6/2026

Articoli

I seguenti articoli approfondiscono ed integrano gli argomenti trattati a lezione; la loro lettura è raccomandata.

When I talk to researchers, when I talk to people wanting to engage in entrepreneurship, I tell them that if you read research papers consistently, if you seriously study half a dozen papers a week and you do that for two years, after those two years you will have learned a lot. This is a fantastic investment in your own long term development. (Andrew Ng, Inside The Mind That Built Google Brain: On Life, Creativity, And Failure)

Introduzione ai Big Data
Storage: File System Distribuiti
Storage: Data store NoSQL e database NewSQL
Batch Processing: MapReduce and Spark
Data Acquisition
Data Stream Processing
Distributed Machine Learning and LLMs
Vector Databases

Introduzione ai Big Data

J. Wing, The Data Life Cycle, Harvard Data Science Review, 1(1), 2019.
M. Armbrust1, A. Ghodsi, R. Xin, M. Zaharia, Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics, CIDR '21, 2021.

Storage: File System Distribuiti e Object Store

S. Ghemawat, H. Gobioff, and S.-T. Leung, The Google File System, In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), 2003.
D. Hildebrand and D. Serenyi, Colossus under the hood: a peek into Google’s scalable storage system, 2021.
S. Noghabi et al., Ambry: LinkedIn's Scalable Geo-Distributed Object Store, Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16), 2016.

Storage: Data store NoSQL e NewSQL

M. Fowler, NoSQL Databases, 2019.
G. DeCandia et al., Dynamo: Amazon's highly available key-value store, In Proc. of the 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP '07), 2007.
F. Chang et al., Bigtable: A distributed storage system for structured data, In Proc. of OSDI '06, 2006.
A. Lakshman and P. Malik, Cassandra: a decentralized structured storage system, SIGOPS Oper. Syst. Rev., Vol. 44, No. 2, pp. 35-40, 2010.
M. Elhemali et al., Amazon DynamoDB: A scalable, predictably performant, and fully managed NoSQL database service, Proc. of USENIX ATC '22, 2022.
J. Idziorek et al., Distributed transactions at scale in Amazon DynamoDB, Proc. of USENIX ATC '23, 2023.
A. Pavlo and M. Aslett, What’s really new with NewSQL?, SIGMOD Rec. 45, 2016.
J. Corbett et al., Spanner: Google’s Globally-Distributed Database, Proc. of OSDI '12, 2012.
M. Stonebraker and A. Weisberg, The VoltDB main memory DBMS, 2013.

Batch Processing: MapReduce and Spark

J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, In Proc. of OSDI '04, 2004.
J. Leskovec, A. Rajaraman, and J. Ullman, Mining of Massive Datasets 3rd edition, chapter 2, 2020.
M. Zaharia et al., Spark: Cluster Computing with Working Sets, In Proc. of USENIX HotCloud’10, 2010.
M. Zaharia et al., Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, In Proc. of USENIX NSDI’12, 2012.
M. Zaharia et al., Apache Spark: A Unified Engine For Big Data Processing, In Commun. ACM, 2016.
M. Armbrust et al., Spark SQL: Relational data processing in Spark, In Proceedings of ACM SIGMOD’15, 2015.

Data Acquisition

J. Kreps et al., Kafka: a Distributed Messaging System for Log Processing, In Proceedings of NetDB '11, 2011.

Data Stream Processing

T. Akidau, Streaming 101: The world beyond batch, 2015.
A. Margara et al., A Model and Survey of Distributed Data-Intensive Systems, ACM Comp. Surv., 2023.
M. Fragkoulis et al., A Survey on the Evolution of Stream Processing Systems, VLDB J., 2024.
P. Carbone et al., Apache Flink: Stream and batch processing in a single engine, In Bulletin of IEEE Computer Society Technical Committee on Data Engineering, 2015.
P. Carbone et al., State management in Apache Flink, In Proc. VLDB Endow., 2017.
M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations, ACM Comput. Surv., 2014.
V. Cardellini, F. Lo Presti, M. Nardelli, G. Russo Russo, Run-time adaptation of data stream processing systems: The state of the art, ACM Comp. Surv., 2022.

Distributed Machine Leaning

R. Mayer et al., Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools, ACM Comp. Surv., 2020.
J. Verbraeken et al., A Survey on Distributed Machine Learning, ACM Comp. Surv., 2020.
M. Li et al., Scaling Distributed Machine Learning with the Parameter Server, OSDI '14, 2014.
Zeng et al., Distributed training of large language models: A survey, Natural Language Processing J., 2025.

Vector Databases

Pan et al., Survey of vector database management systems, The VLDB Journal, 2024.
Aumüller and Ceccarello, Recent Approaches and Trends in Approximate Nearest Neighbor Search, Bulletin of the Technical Committee on Data Engineering, 2023.

Video

Video di presentazioni utili per approfondire alcuni argomenti del corso.

Programma

Introduction to Big Data: issues and challenges
Data storage: distributed file systems, NoSQL data stores, NewSQL databases, time series databases, vector databases
- Case studies: GFS, HDFS, Ozone, Ambry, Dynamo, Bigtable, Cassandra, DynamoDB, Spanner, VoltDB
- Lab: HDFS, Redis, MongoDB, HBase, Neo4j, InfluxDB, CockroachDB
Systems for data ingestion and data pipeline management: pub/sub, dataflow and workflow management
- Case studies: Kafka, Pulsar, NiFi, AirFlow
Systems for batch processing
- Case studies: Hadoop, Spark, Spark SQL, Spark MLlib
- Lab: Spark, Spark SQL
Systems for stream processing
- Case studies: Storm, Flink
- Lab: Flink, Spark Streaming, Kafka Streams
- Challenges in DSP
Batch and stream processing in the Cloud
Distributed machine learning and LLMs
- Case studies: Spark MLlib, TensorFlow distributed

Esami

Modalità di esame
Appelli

Modalità di esame

2 progetti assegnati durante il corso: il primo su batch processing, il secondo su data stream processing.
Progetti da svolgere preferibilmente in gruppo composto da 2 o 3 studenti (singolarmente se non possibile altrimenti).
- Primo progetto 2025/26: Analisi di dati sul trasporto aereo con Apache Spark
- Secondo progetto 2025/26: Analisi in tempo reale di dati sul trasporto aereo con Apache Flink
Le tracce dei progetti sono pubblicate sul canale Teams del corso.
- Discussione del primo progetto: 11-15 giugno 2026, come da calendario disponibile su Teams.
- Discussione del secondo progetto: 10-17 (da confermare) luglio 2026.
Prova orale sul programma del corso.

Appelli

I appello sessione estiva
Prova orale: mercoledì 24 giugno 2026 ore 14:00, aula B10
II appello sessione estiva
Prova orale: giovedì 16 luglio 2026 ore 9:30, aula B10
I appello sessione autunnale

II appello sessione autunnale

I appello sessione invernale
II appello sessione invernale