Fabien Gaud

Improving performance of a hig-performance distributed NFS server

Coho Data was distributing a high-performance NFS appliance. Two different architectures were supported: a hybrid architecture (SSH + Spinning disks) and an all-flash array (NVME Flash + SSD drives).

My main role was do work on the performance of the NFS server. I especially worked on data replication and striping, wrote several performance analysis tools, implemented a container for small files which drastically reduced metadata operations latency by 50% and rebalancing duration from hours to minutes, designed a new cluster membership and distributed locking mechnism and redesigned the snapshot mechanism and implemented data deduplication.

Improving large page support on NUMA multicore architectures

Large pages are commonly used in large workloads to reduce the impact of TLB misses on application performance. However, with modern NUMA multicore systems, some care must be taken since increasing page size can generate undesirable NUMA effects that strongly degrade performance. We identified two different types of problems that can occur when enabling large pages on NUMA systems: hot pages and page-level false sharing. We designed a new large page management algorithm that detects and fixes these NUMA effects with large pages while keeping, whenever possible, the positive impact of large page usage on performance.

Code is available here.

Managing resource contention management for modern multicore architectures

In modern multicore architectures, cores share several resources such as caches, interconnects and memory controllers. If not taken into account properly, this sharing can significantly hurt application performance.

My main project was to produce a new memory management algorithm to mitigate memory controller and interconnect congestion on NUMA multicore machines. While previous work considered data locality as the most important optimization, we identified that reducing memory controller and interconnect congestion was actually more important. In order to reduce congestion, we leveraged three different techniques (migrating memory, evenly distributing memory on memory nodes, replicating memory) that aim at balancing the load on memory controllers, while keeping data as local as possible. Our experiments showed that this strategy produces large performance improvements.
The code of our novel memory management algorithm, Carrefour, is available here.

In another project, we produced a new scheduler that takes into account last-level cache contention when distributing applications on a cluster of multicore machines. For that purpose, we used a machine-learning based approach which allowed us to predict precisely what will be the performance degradation if two applications are collocated on the same machine. With that knowledge, our algorithm efficiently distributes applications on cluster nodes, noticeably increases application performance and reduces energy consumption.

Scaling the Apache Web server on NUMA multicore architectures

Apache is the most widely used Web server. It is thus important that this server scales perfectly on now mainstream NUMA multicore systems. We identified, using the well-known SPECweb 2005 Support workload, that this was not the case on a 4-node NUMA system. We proposed three different optimizations to handle both software and hardware related issues. Using these optimizations, we were able to reach an ideal performance scalability on a 4-node (16 cores) server.

Improving support for event-driven application on multicore architectures

On single-core architectures, event-driven programming was a popular way to produce robust servers. Libasync-mp was the only available runtime that allows event-driven applications to take advantage of multicore architectures with minimal effort from the programmer. As many runtimes, Libsync-mp relies on a work stealing algorithm to balance load on cores. We identified several points in this algorithm that limit performance of applications. Especially, we showed that exploiting the specificity of even-driven applications when stealing work and carefully designing the runtime to reduce the overhead of this mechanism provide great performance improvements on data servers.

We produced a new runtime (Mely) backward-compatible with Libasync-mp. This runtime is available here.

Publications

The Linux Scheduler: A Decade of Wasted Cores.
Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Alexandra Fedorova and Vivien Quéma. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys), London, United Kingdom, April 2016
[pdf] [slides]

Challenges of memory management on modern NUMA systems.
Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, and Mark Roth. Communication of the ACM 58, 12, pp. 59-66, December 2015
[pdf]

Large Pages May be Harmful on NUMA Systems.
Fabien Gaud, Baptiste Lepers, Justin Funston, Jeremie Decouchant, Justin Funston, Alexandra Fedorova and Vivien Quéma. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC), Philadelphia, USA, June 2014.
[pdf] [slides]

Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems.
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quéma, and Mark Roth. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Houston, USA, March 2013.
[pdf] [slides]

A Practical Method for Estimating Performance Degradation on Multicore Processors and its Application to HPC Workloads.
Tyler Dwyer, Alexandra Fedorova, Sergey Blagodurov, Mark Roth, Fabien Gaud and Jian Pei. Supercomputing Conference (SC), 2012
[pdf]

Efficient Workstealing for Multicore Event-Driven Systems.
Fabien Gaud, Sylvain Genevès, Renaud Lachaize, Baptiste Lepers, Fabien Mottet, Gilles Muller, and Vivien Quéma. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), pages 516-525, Genoa, Italy, June 2010.
[pdf] [slides]

National Conferences (French)

Optimisations applicatives pour multi-cœurs NUMA : un cas d’étude avec le serveur web Apache.
Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Gilles Muller, Vivien Quéma. In Proceedings of Conférence Française sur les Systèmes d’Exploitation (CFSE'8), Saint-Malo, France, May 2011.
[pdf] [slides]

Vol de tâches efficace pour systèmes événementiels multi-coeurs.
Fabien Gaud, Sylvain Genevès, Fabien Mottet. In Proceedings of Conférence Française en Systèmes d’Exploitation (CFSE’7), Toulouse, September 2009.
[pdf] [slides]

Technical reports

Application-Level Optimizations on NUMA Multicore Architectures : the Apache Case Study
Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Gilles Muller, and Vivien Quéma. Research report RR-LIG-011, LIG, Grenoble, France, March 2011.
[pdf]

Thesis (French)

Étude et amélioration de la performance des serveurs de données pour les architectures multi-cœurs.
Fabien Gaud. PhD Thesis, Grenoble University, France, 2010.
[pdf] [slides (English)] [tel]

Gestion autonome de flots d'exécution événementiels.
Fabien Gaud. Master's thesis, Université Joseph Fourier, Grenoble, France, 2007i.
[pdf] [slides]

Improving performance of a hig-performance distributed NFS server

Improving large page support on NUMA multicore architectures

Managing resource contention management for modern multicore architectures

Scaling the Apache Web server on NUMA multicore architectures

Improving support for event-driven application on multicore architectures

Publications

National Conferences (French)

Technical reports

Thesis (French)

2009-2010

Instructor

Teaching assistant

2008-2009

Instructor

Teaching assistant

2007-2008

Teaching assistant

Curriculum Vitae

Contact

Other links