Loading…
Monday, February 25
 

8:00am EST

Continental Breakfast
Monday February 25, 2019 8:00am - 9:00am EST
Grand Ballroom Foyer

8:00am EST

Continental Breakfast
Monday February 25, 2019 8:00am - 9:00am EST
Grand Ballroom Foyer

9:00am EST

Morning Tutorial 1: Understanding Large Scale Storage Systems
This tutorial is oriented toward administrators and developers who manage and use large-scale storage systems. An important goal of the tutorial is to give the audience the foundation for effectively comparing different storage system options, as well as a better understanding of the systems they already have.

Cluster-based parallel storage technologies are used to manage millions of files, thousands of concurrent jobs, and performance that scales from 10s to 100s of GB/sec. This tutorial will examine current state-of-the-art high-performance file systems and the underlying technologies employed to deliver scalable performance across a range of scientific and industrial applications.

The tutorial starts with a look at storage devices including traditional hard drives, SSD, and new non-volatile memory devices. Next, we look at how a file system is put together, comparing and contrasting SAN file systems, scale-out NAS, object-based parallel file systems, and cloud-based storage systems.

Topics include:
  • SSD technology
  • NVRAM
  • Scaling the data path
  • Scaling metadata
  • Fault tolerance
  • Manageability
  • Cloud storage

Speakers
BW

Brent Welch

Google
Brent Welch is a senior staff software engineer at Google, where he works on their public cloud system. He was Chief Technology Officer at Panasas and has also worked at Xerox-PARC and Sun Microsystems Laboratories. Brent has experience building software systems from the device driver... Read More →


Monday February 25, 2019 9:00am - 12:30pm EST
Constitution Ballroom A

9:00am EST

Morning Tutorial 2: Blockchain and Storage
This tutorial will cover the basics of blockchain, this issues blockchain has concerning storage, database usage with blockchain and the solutions to the storage issues.

Topics include:
  • The Basics of Blockchain
  • Storage Issues for Blockchain
  • Using Databases for offchain storage
  • Blockchain deployment & Backup/Recovery

Speakers
MA

Mike Ault

IBM
Mike Ault began work in the nuclear navy and moved into the civilian nuclear field in 1979. He has been working with computers and databases since 1980. In 1990 Mike started working with the Oracle database system. Mike has worked with flash systems since 2007 when he began consulting... Read More →


Monday February 25, 2019 9:00am - 12:30pm EST
Constitution Ballroom B

9:00am EST

Introduction to Storage for Containers
Containers and related technologies allow to manage computational resources at fine granularity, increase the pace of software development, testing, and deployment, and at the same time improve the efficiency of infrastructure utilization. Recognizing these benefits, many enterprises are upgrading their technology by incorporating containers in their infrastructure and workflows.

As containerization technologies enter enterprise market they meet new functional demands. Providing and managing persistent, highly-available, yet nimble storage is a particularly important requirement. A number of new and existing companies and open-source projects are aggressively entering this arena. We expect that in the coming years the demand for professionals who are fluent in storage for containers will rise dramatically.

In our tutorial we plan to cover all major topics of storage for containers. We will first describe the structure of Docker's layered images, its local CoW-based storage, and Docker registry. We will then present the concept of persistent volumes and dynamic provisioning in Kubernetes. As part of the tutorial, we will use the insights and examples that we accumulated while working on adapting IBM's Spectrum Scale for containerization environments.

Speakers
VT

Vasily Tarasov

IBM Research
Vasily Tarasov is a Research Staff Member at IBM. His current research projects include storage for containers and high-performance file systems as a service. Vasily worked extensively on storage, file systems, data deduplication, performance and workload analysis. Vasily is an author... Read More →
DS

Dimitris Skourtis

IBM Research
Dimitris Skourtis is a Research Staff Member at IBM. His current work is around cloud orchestrators and persistent storage for containers. Prior to IBM, he worked on resource management at VMware, where he prototyped and shipped SIOCv2, a policy-driven storage scheduling solution... Read More →
TA

Ted Anderson

IBM Research
Ted Anderson is a Senior Software Engineer with IBM Research. Ted has extensive experience with several distributed file systems, most recently Spectrum Scale/GPFS. His recent work utilizes concurrency, caching, and delegation that guarantee correctness using distributed coherency... Read More →
AA

Ali Anwar

IBM Research
Ali Anwar is a research staff member at IBM Research. He received his Ph.D. in Computer Science from Virginia Tech. In his earlier years he worked as a tools developer (GNU GDB) at Mentor Graphics. Ali's research interests are in distributed computing systems, cloud storage management... Read More →


Monday February 25, 2019 9:00am - 12:30pm EST
Commonwealth Room

10:30am EST

Break with Refreshments
Monday February 25, 2019 10:30am - 11:00am EST
Grand Ballroom Foyer

12:30pm EST

Tutorial Luncheon
Monday February 25, 2019 12:30pm - 1:30pm EST
Back Bay Ballroom

12:30pm EST

Tutorial Luncheon
Monday February 25, 2019 12:30pm - 1:30pm EST
Back Bay Ballroom

1:30pm EST

Managed File Services in the Cloud: What to Use, Where, and Why?
This tutorial is targeted towards administrators and developers that would like to understand the latest developments in cloud-based managed file services. We'll start off with an overview of available offerings, we'll cover intended use cases, the pros and cons of each of the offerings, and we'll make a comparison with self-managed offerings. We'll also talk about how what tools are available to move on-premises file storage solutions to the cloud.

Speakers
JS

Jacob Strauss

Amazon Web Services
Jacob Strauss is a Principal Engineer at Amazon Web Services, currently working on the Amazon Elastic File System. He has been at AWS since 2013 building distributed systems under the guise of storage services, and growing Amazon's Boston-area engineering teams. He received PhD and... Read More →
GJ

Geert Jansen

Amazon Web Services
Geert Jansen is a Senior Product Manager at Amazon Web Services where he works on Amazon EFS. He was Product Owner for Red Hat CloudForms and also worked at Ravello Systems and Royal Dutch Shell. He received an M.Sc. in Applied Physics from the Eindhoven University of Technology... Read More →


Monday February 25, 2019 1:30pm - 3:00pm EST
Commonwealth Room

1:30pm EST

Afternoon Tutorial 1: Advanced Persistent Memory Programming
Persistent Memory (“PM”) support is becoming ubiquitous in today’s operating systems and computing platforms. From Windows to Linux to open source, and from NVDIMM, PCI Express, storage-attached and network-attached interconnect access, it is available broadly across the industry. Its byte-addressability and ultra-low latency, combined with its durability, promise a revolution in storage and applications as they evolve to take advantage of these new platform capabilities.

The tutorial explores the concepts and today’s programming methodologies for PM, including the SNIA NonVolatile Memory Programming Model architecture, open source and native APIs, operating system support for PM such as direct access filesystems, and via language and compiler approaches. The software PM landscape is already rich and growing.

Additionally, the tutorial will explore the considerations when PM access is extended across fabrics such as networks, I/O interconnects, and other non-local access. While the programming paradigms remain common, the implications on latency, protocols, and especially error recovery are critically important to both performance and correctness. Understanding these requirements are of interest to both the system and application developer or designer.

Specific programming examples, fully functional on today’s systems, will be shown and analyzed. Concepts for moving new applications and storage paradigms to PM will be motivated and explored. Application developers, system software developers, and network system designers will all benefit. Anyone interested in an in-depth introduction to PM in emerging software and hardware systems can also expect an illuminating and thought-provoking experience.

Topics include:
  • Persistent Memory
  • Persistent Memory Technologies
  • Remote Persistent Memory
  • Programming Interfaces
  • Operating Systems
  • Open Source Libraries
  • RDMA

Speakers
TT

Tom Talpey

Microsoft
Tom Talpey is an Architect in the Networking team at Microsoft Corporation in the Windows Devices Group. His current areas of focus include RDMA networking, remote filesharing, and persistent memory. He is especially active in bringing all three together into a new ultra-low-latency... Read More →
AR

Andy Rudoff

Intel
Andy Rudoff is a Principal Engineer at Intel Corporation, focusing on Non-Volatile Memory programming. He is a contributor to the SNIA NVM Programming Technical Work Group. His more than 30 years industry experience includes design and development work in operating systems, file systems... Read More →


Monday February 25, 2019 1:30pm - 5:00pm EST
Constitution Ballroom A

1:30pm EST

Afternoon Tutorial 2: Caches in the Modern Memory Hierarchy with Persistent Memory and Flash
For a very long time, practical scaling of every level in the computing hierarchy has required innovation and improvement in caches. This is as true for CPUs as it is for storage and networked, distributed systems. As such, research into cache efficiency and efficacy improvements has been highly motivated and continues with strong improvements to this day. However, there are certain areas in cache algorithms optimization that have only recently experienced breakthroughs.

In this tutorial, we will start by reviewing the history of the caching algorithm research and practice in industry. Of particular interest to us are multi-tier memory hierarchies that are getting more complex and deep due to hardware innovations. These hierarchies motivate revisiting multi-tier algorithms. We will then review key tools in the research or and management called cache utility curves and recent literature that has made them easier to compute. Using this tool, we will excavate around caching policies and their trade-offs. We will also spend some time thinking about optimality for caches in modern memory hierarchies with DRAM, non-volatile/persistent memory and flash.

Topics include:
  • Overview and history of the caching algorithm research and practice in industry
  • Introduction to new challenges posed by multi-tier memory hierarchies
  • Review of Cache utility curves and recent literature
  • Experimenting with caching policies for production uses cases
  • How to find the optimal cache

Speakers
IA

Irfan Ahmad

CachePhysics
Irfan Ahmad is the CEO and Cofounder of CachePhysics. Previously, he served as the CTO of CloudPhysics, pioneer in SaaS Virtualized IT Operations Management, which he cofounded in 2011. Irfan was at VMware for nine years, where he was R&D tech lead for the DRS team and co-inventor... Read More →
YV

Ymir Vigfusson

Emory University
Ymir Vigfusson is Assistant Professor of Mathematics and Computer Science at Emory University since 2014, Adjunct Assistant Professor at the School of Computer Science at Reykjavik University since 2011, and a co-founder and Chief Science Officer of the offensive security company... Read More →


Monday February 25, 2019 1:30pm - 5:00pm EST
Constitution Ballroom B

3:00pm EST

Break with Refreshments
Monday February 25, 2019 3:00pm - 3:30pm EST
Grand Ballroom Foyer

3:00pm EST

Break with Refreshments
Monday February 25, 2019 3:00pm - 3:30pm EST
Grand Ballroom Foyer

3:30pm EST

Performance Analysis in Linux Storage Stack with BPF
How can we deeply analyze and trace performance issues in the Linux Storage Stack?

Many monitoring and benchmark tools help us to find bottlenecks and problems through system profiling. However, it is pretty tricky to dig deeper into the root cause on code/function level because of complex execution flow (e.g. multiple contexts or async flow). In this tutorial, we introduce in-kernel BPF technology and practice analyzing performance issues in the Linux Storage Stack using several tracing tools (BPF, uftrace, ctracer, and perf) with attendees step-by-step. This session is targeted towards administrators, researchers, and developers.

BPF is a technology that allows safely injecting and executing custom code in the kernel at runtime, an unprecedented functionality. With BPF, by leveraging the injected custom code in a kernel, profiling and tracing is low overhead, and makes more richer introspection available.

Speakers
TS

Taeung Song

KossLab
Taeung is a Software Engineer in KOSSLAB (Korea Opensource Software Developers Lab) and Opensource Contributor in regard to Tracing & Profiling technology such as perf, uftrace, BPF, etc.
DT

Daniel T. Lee

The University of Soongsil
Daniel T. Lee is a Bachelor's degree student at the University of Soongsil and has a deep enthusiasm for Linux. He has been contributing to uftrace: Function (graph) tracer since 2018. He is passionate about tracing and profiling, and he really loves cloud engineering. He has a deep... Read More →


Monday February 25, 2019 3:30pm - 5:00pm EST
Commonwealth Room

6:00pm EST

FAST '19 Happy Hour
Kick off the conference by meeting with your colleagues over snacks and drinks.

Monday February 25, 2019 6:00pm - 7:00pm EST
Back Bay Ballroom AB

7:00pm EST

All Things Ceph BoF
Moderators
RW

Ric Wheeler

Facebook

Monday February 25, 2019 7:00pm - 8:30pm EST
Commonwealth Room

7:00pm EST

NSDI '19 Preview Session
Are you new to NSDI? Are you a networking expert but feel bewildered when talk turns to security? Are you interested in engaging more deeply with paper presentations outside your research area? Join us for the NSDI preview session, where area experts will give short introductions to the Symposium's major technical sessions.

  • Host Networking: Brent Stephens, University of Illinois at Chicago
  • Distributed Systems: Aurojit Panda, New York University
  • Modern Network Hardware: Akshay Narayan, Massachusetts Institute of Technology
  • Analytics: Aurojit Panda, New York University
  • Data Center Network Architecutre: Anirudh Sivaraman, New York University
  • Wireless Technologies: Dinesh Bharadia, University of California, San Diego
  • Operating Systems: Amy Ousterhout, Massachusetts Institute of Technology
  • Monitoring and Diagnosis: Anurag Khandelwal, University of California, Berkeley
  • Improving Machine Learning: Junchen Jiang, University of Chicago
  • Network Functions: Radhika Mittal, Massachusetts Institute of Technology and University of Illinois at Urbana–Champaign
  • Network Characterization: David Choffnes, Northeastern
  • Privacy and Security: Vyas Sekar, Carnegie Mellon University
  • Network Modeling: Costin Raicu, University Politehnica of Bucharest
  • Wireless Applications: Shaddi Hasan, Facebook


Monday February 25, 2019 7:00pm - 9:30pm EST
Grand Ballroom

8:30pm EST

Kubernetes & Storage Oh My BoF
Moderators
Monday February 25, 2019 8:30pm - 10:00pm EST
Commonwealth Room
 
Tuesday, February 26
 

7:30am EST

Continental Breakfast
Tuesday February 26, 2019 7:30am - 8:30am EST
Grand Ballroom Foyer

7:30am EST

Continental Breakfast
Tuesday February 26, 2019 7:30am - 8:30am EST
Grand Ballroom Foyer

7:30am EST

Continental Breakfast
Tuesday February 26, 2019 7:30am - 8:45am EST
Grand Ballroom Foyer

8:30am EST

Opening Remarks and Best Paper Awards
Speakers
JL

Jay Lorch

Microsoft Research
MY

Minlan Yu

Harvard University


Tuesday February 26, 2019 8:30am - 8:45am EST
Constitution Ballroom

8:30am EST

Making Ceph Fast in the Face of Failure
Ceph Luminous and Mimic improve the impact of recovery on client I/O. In this talk, we'll discuss the key features that affect this, and how Ceph users can take advantage of them.

Speakers
NO

Neha Ojha

Red Hat
Neha is a Senior Software Engineer at Red Hat. She is the project technical lead for the core team focusing on RADOS. Neha holds a Master's degree in Computer Science from the University of California, Santa Cruz.Her most recent talks have been at Mountpoint, co-located with Open... Read More →


Tuesday February 26, 2019 8:30am - 9:00am EST
Independence Ballroom

8:45am EST

Opening Remarks and Awards
Speakers
HW

Hakim Weatherspoon

Cornell University


Tuesday February 26, 2019 8:45am - 9:00am EST
Grand Ballroom

8:45am EST

Datacenter RPCs can be General and Fast
It is commonly believed that datacenter networking software must sacrifice generality to attain high performance. The popularity of specialized distributed systems designed specifically for niche technologies such as RDMA, lossless networks, FPGAs, and programmable switches testifies to this belief. In this paper, we show that such specialization is not necessary. eRPC is a new general-purpose remote procedure call (RPC) library that offers performance comparable to specialized systems, while running on commodity CPUs in traditional datacenter networks based on either lossy Ethernet or lossless fabrics. eRPC performs well in three key metrics: message rate for small messages; bandwidth for large messages; and scalability to a large number of nodes and CPU cores. It handles packet loss, congestion, and background request execution. In microbenchmarks, one CPU core can handle up to 10 million small RPCs per second, or send large messages at 75 Gbps. We port a production-grade implementation of Raft state machine replication to eRPC without modifying the core Raft source code. We achieve 5.5 microseconds of replication latency on lossy Ethernet, which is faster than or comparable to specialized replication systems that use programmable switches, FPGAs, or RDMA.

Speakers
AK

Anuj Kalia

Carnegie Mellon University
DA

David Andersen

Carnegie Mellon University


Tuesday February 26, 2019 8:45am - 9:10am EST
Constitution Ballroom

9:00am EST

CrashMonkey: Finding File System Crash-Consistency Bugs with Bounded Black-Box Testing
We present a new approach to testing file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. Each workload is tested on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last five years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly-created file system, and that all reported bugs result from crashes after fsync() related system calls. We built the tool CrashMonkey to demonstrate the effectiveness of this approach. CrashMonkey revealed 10 new crash-consistency bugs in widely-used, mature Linux file systems, seven of which existed in the kernel since 2014. It also revealed a data loss bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity and loss of persisted files.

Speakers
JM

Jayashree Mohan

The University of Texas at Austin
Jayashree Mohan is a third year PhD student at the University of Texas at Austin. She works primarily on file and storage systems with a focus on testing the reliability of file systems. Prior to starting her PhD, she received a B.Tech in CS at the National Institute of Technology... Read More →


Tuesday February 26, 2019 9:00am - 9:25am EST
Independence Ballroom

9:00am EST

Keynote Address
NoSQL cloud database services, like Amazon DynamoDB, are popular for their simple key-value operations, unbounded scalability and predictable low-latency. Atomic transactions, while popular in relational databases, carry the specter of complexity and low performance, especially when used for workloads with high contention. Transactions often have been viewed as inherently incompatible with NoSQL stores, and the few commercial services that combine both come with limitations. This talk examines the tension between transactions and non-relational databases, and it recounts my journey of adding transactions to DynamoDB. I conclude that atomic transactions with full ACID properties can be supported without unduly compromising on performance, availability, self-management, or scalability.

Speakers
DT

Doug Terry

Senior Principal Technologist, Amazon Web Services
Doug Terry is a Senior Principal Technologist in the AWS Database Services team focusing on global databases (Amazon DynamoDB) and large-scale data warehouses (Amazon Redshift). Prior to joining Amazon in 2016, Doug led innovative research at Xerox PARC, Microsoft, and Samsung. He... Read More →


Tuesday February 26, 2019 9:00am - 10:00am EST
Grand Ballroom

9:10am EST

Eiffel: Efficient and Flexible Software Packet Scheduling
Packet scheduling determines the ordering of packets in a queuing data structure with respect to some ranking function that is mandated by a scheduling policy. It is the core component in many recent innovations in optimizing network performance and utilization. Packet scheduling is used for network resource allocation, meeting network-wide delay objectives, or providing isolation and differentiation of service. Our focus in this paper is on the design and deployment of packet scheduling in software. Software schedulers have several advantages including shorter development cycle and flexibility in functionality and deployment location. We substantially improve software packet scheduling performance, while maintaining its flexibility, by exploiting underlying features of packet ranking; the fact that packet ranks are integers that have predetermined ranges and that many packets will typically have equal rank. This allows us to rely on integer priority queues, compared to existing ranking algorithms, that rely on comparison-based priority queues that assume continuous ranks with infinite range. We introduce Eiffel, a novel programmable packet scheduling system. At the core of Eiffel is an integer priority queue based on the Find First Set (FFS) instruction and designed to support a wide range of policies and ranking functions efficiently. As an even more efficient alternative, we also propose a new approximate priority queue that can outperform FFS-based queues for some scenarios. To support flexibility, Eiffel introduces novel programming abstractions to express scheduling policies that cannot be captured by current, state of the art scheduler programming models. We evaluate Eiffel in a variety of settings and in both Kernel and userspace deployments. We show that it outperforms state of the art systems by 3-40x in terms of either number of cores utilized for network processing or number of flows given fixed processing capacity.

Speakers
AS

Ahmed Saeed

Georgia Institute of Technology
YZ

Yimeng Zhao

Georgia Institute of Technology
EZ

Ellen Zegura

Georgia Institute of Technology
MA

Mostafa Ammar

Georgia Institute of Technology
KH

Khaled Harras

Carnegie Mellon University
AV

Amin Vahdat

Google Inc.


Tuesday February 26, 2019 9:10am - 9:35am EST
Constitution Ballroom

9:25am EST

Experiences with Fuse in the Real World
The Filesystem in Userspace (FUSE) module provides a simple way to create user-space file systems. The shortcomings of this approach to implementing file systems have been debated many times in the past, a few times even with data to back up the arguments. In this talk, we will revisit the topic in the context of a distributed Software-Defined Storage (SDS) solution, gluster. We will present our experiences based on users deploying it in production over the years, with FUSE access as the primary interface. In this context, we will discuss some of the problem areas like memory management, and demonstrate trade-offs in implementing important caches in the user-space versus relying on kernel caches.

As gluster expands to newer use-cases like persistent storage for container platforms, it needs to efficiently handle a wide variety of workloads and more frequently handle smaller, single-client volumes. In this context, we see the need to absorb more recent FUSE performance enhancements like write-back caching, and we will present our characterization of the performance benefits obtained from these enhancements.

Speakers
MP

Manoj Pillai

Red Hat
Manoj Pillai is part of the Performance and Scale Engineering Group at Red Hat. His focus is on storage performance, particularly around gluster, and he has presented on these topics at Open Source Summit, FOSDEM, Vault 2017, Red Hat Summit and Gluster Summit.
RG

Raghavendra Gowdappa

Red Hat
Raghavendra Gowdappa is one of the maintainers of Glusterfs and is currently employed by Red Hat. He has worked on interfacing Glusterfs with FUSE, caching, network and file distribution aspects of Glusterfs. His earlier presentations were at FOSDEM, Vault 2017 and Gluster Summit... Read More →
CH

Csaba Henk

Red Hat
Csaba Henk has worked on the fuse layer of Glusterfs from the early times on. He has been involved in augmentative and integration projects, like geo-replication and OpenStack Manila glusterfs drivers. These days he's back at core Glusterfs and works on caches and fuse.


Tuesday February 26, 2019 9:25am - 9:50am EST
Independence Ballroom

9:35am EST

Loom: Flexible and Efficient NIC Packet Scheduling
In multi-tenant cloud data centers, operators need to ensure that competing tenants and applications are isolated from each other and fairly share limited network resources. With current NICs, operators must either 1) use a single NIC queue and enforce network policy in software, which incurs high CPU overheads and struggles to drive increasing line-rates (100Gbps), or 2) use multiple NIC queues and accept imperfect isolation and policy enforcement. These problems arise due to inflexible and static NIC packet schedulers and an inefficient OS/NIC interface.

To overcome these limitations, we present Loom, a new NIC design that moves all per-flow scheduling decisions out of the OS and into the NIC. The key aspects of Loom's design are 1) a new network policy abstraction: restricted directed acyclic graphs (DAGs), 2) a programmable hierarchical packet scheduler, and 3) a new expressive and efficient OS/NIC interface that enables the OS to precisely control how the NIC performs packet scheduling while still ensuring low CPU utilization. Loom is the only multiqueue NIC design that is able to efficiently enforce network policy. We find empirically that Loom lowers latency, increases throughput, and improves fairness for collocated applications and tenants.

Speakers

Tuesday February 26, 2019 9:35am - 10:00am EST
Constitution Ballroom

9:50am EST

SMB3 Linux/POSIX Protocol Extensions: Overview and Update on Current Implementations
The SMB3 POSIX Extensions, a set of protocol extensions to allow for optimal Linux and Unix interoperability with Samba, NAS and Cloud file servers, have evolved over the past year, with test implementations in Samba and now merged into the Linux kernel. These extensions address various compatibility problems for Linux and Unix clients (such as case sensitivity, locking, delete semantics and mode bits among others). This presentation will review the state of the protocol extensions, what was learned in the implementations in Samba and also in the Linux kernel (including from running exhaustive Linux file system functional tests to try to better match local file system behavior over SMB3 mounts) and what it means for real applications.

With the deprecation of older less secure dialects like CIFS (which had standardized POSIX Extensions documented by SNIA), these SMB3 POSIX Extensions are urgently needed to be more broadly deployed to avoid functional or security problems and to optimally access Samba from Linux.

Speakers
JA

Jeremy Allison

Samba Team, Google
Jeremy Allison is a frequent speaker at Storage, Linux and Samba events and is one of the original members of the Samba team.
SF

Steve French

Samba Team, Microsoft Azure Storage


Tuesday February 26, 2019 9:50am - 10:15am EST
Independence Ballroom

10:00am EST

Break with Refreshments
Tuesday February 26, 2019 10:00am - 10:30am EST
Grand Ballroom Foyer

10:00am EST

Break with Refreshments
Tuesday February 26, 2019 10:00am - 10:30am EST
Grand Ballroom Foyer

10:15am EST

Break with Refreshments
Tuesday February 26, 2019 10:15am - 10:45am EST
Grand Ballroom Foyer

10:30am EST

Exploiting Commutativity For Practical Fast Replication
Traditional approaches to replication require client requests to be ordered before making them durable by copying them to replicas. As a result, clients must wait for two round-trip times (RTTs) before updates complete. In this paper, we show that this entanglement of ordering and durability is unnecessary for strong consistency. Consistent Unordered Replication Protocol (CURP) allows clients to replicate requests that have not yet been ordered, as long as they are commutative. This strategy allows most operations to complete in 1 RTT (the same as an unreplicated system). We implemented CURP in the Redis and RAMCloud storage systems. In RAMCloud, CURP improved write latency by ~2x (14us -> 7.1us) and write throughput by 4x. Compared to unreplicated RAMCloud, CURP's latency overhead for 3-way replication is just 1us (6.1us vs 7.1us). CURP transformed a non-durable Redis cache into a consistent and durable storage system with only a small performance overhead.

Speakers
SJ

Seo Jin Park

Stanford University
JO

John Ousterhout

Stanford University


Tuesday February 26, 2019 10:30am - 10:55am EST
Constitution Ballroom

10:30am EST

Reaping the performance of fast NVM storage with uDepot
Many applications require low-latency key-value storage, a requirement that is typically satisfied using key-value stores backed by DRAM. Recently, however, storage devices built on novel NVM technologies offer unprecedented performance compared to conventional SSDs. A key-value store that could deliver the performance of these devices would offer many opportunities to accelerate applications and reduce costs. Nevertheless, existing key-value stores, built for slower SSDs or HDDs, cannot fully exploit such devices.

In this paper, we present uDepot, a key-value store built bottom-up to deliver the performance of fast NVM block-based devices. uDepot is carefully crafted to avoid inefficiencies, uses a two-level indexing structure that dynamically adjusts its DRAM footprint to match the inserted items, and employs a novel task-based IO run-time system to maximize performance, enabling applications to use fast NVM devices at their full potential. As an embedded store, uDepot's performance nearly matches the raw performance of fast NVM devices both in terms of throughput and latency, while being scalable across multiple devices and cores. As a server, uDepot significantly outperforms state-of-the-art stores that target SSDs under the YCSB benchmark. Finally, using a Memcache service on top of uDepot we demonstrate that data services built on NVM storage devices can offer equivalent performance to their DRAM-based counterparts at a much lower cost. Indeed, using uDepot we have built a cloud Memcache service that is currently available as an experimental offering in the public cloud.

Speakers
KK

Kornilios Kourtis

IBM Research
NI

Nikolas Ioannou

IBM Research
IK

Ioannis Koltsidas

IBM Research


Tuesday February 26, 2019 10:30am - 11:00am EST
Grand Ballroom

10:45am EST

From Open-Channel SSDs to Zoned Namespaces
Open-Channel Solid State Drive architectures are adopted rapidly by hyper-scales, all-flash array vendors, and large storage system vendors. The versatile storage interface admits solid state drive to expose essential knobs to control latency, I/O predictability, and I/O isolation. The rapid adoption has created a diverse set of different Open-Channel SSD drive specs that each solves the need of a single or few users. However, the specifications are yet to be standardized.

The Zoned Namespaces (ZNS) Technical Proposal in the NVMe workgroup is developing an industry standardization for these types of interfaces. Creating a foundation on which we can build a robust software eco-system on top and streamline implementation efforts.

This talk covers the motivation, characteristics of Zoned Namespaces, possible software improvements, and early results to show off the effectiveness of these types of drives.

Speakers
MB

Matias Bjørling

Western Digital
Matias Bjørling is Director of Solid State System Software at Western Digital. He is the author of the Open-Channel SSD 1.2 and 2.0 specifications and maintainer of the Open-Channel SSD subsystem in the Linux kernel. Before joining the industry, he obtained a Ph.D. in operating systems... Read More →


Tuesday February 26, 2019 10:45am - 11:10am EST
Independence Ballroom

10:55am EST

Flashield: a Hybrid Key-value Cache that Controls Flash Write Amplification
As its price per bit drops, SSD is increasingly becoming the default storage medium for hot data in cloud application databases. Even though SSD’s price per bit is more than 10× lower, and it provides sufficient performance (when accessed over a network) compared to DRAM, the durability of flash has limited its adoption in write-heavy use cases, such as key-value caching. This is because key-value caches need to frequently insert, update and evict small objects. This causes excessive writes and erasures on flash storage, which significantly shortens the lifetime of flash. We present Chimera, a hybrid key-value cache that uses DRAM as a “filter” to control and limit writes to SSD. Chimera performs lightweight machine learning admission control to predict which objects are likely to be read frequently without getting updated; these objects, which are prime candidates to be stored on SSD, are written to SSD in large chunks sequentially. In order to efficiently utilize the cache’s available memory, we design a novel in-memory index for the variable-sized objects stored on flash that requires only 4 bytes per object in DRAM. We describe Chimera’s design and implementation, and evaluate it on real-world traces from a widely used caching service, Memcachier. Compared to state-of-the-art systems that suffer a write amplification of 2.5× or more, Chimera maintains a median write amplification of 0.5× (since many filtered objects are never written to flash at all), without any loss of hit rate or throughput.

Speakers
AE

Assaf Eisenman

Stanford University
AC

Asaf Cidon

Stanford University and Barracuda Networks
EP

Evgenya Pergament

Stanford University
OH

Or Haimovich

Stanford University
RS

Ryan Stutsman

University of Utah
MA

Mohammad Alizadeh

Massachusetts Institute of Technology
SK

Sachin Katti

Stanford University


Tuesday February 26, 2019 10:55am - 11:20am EST
Constitution Ballroom

11:00am EST

Optimizing Systems for Byte-Addressable NVM by Reducing Bit Flipping
New byte-addressable non-volatile memory (BNVM) technologies such as phase change memory (PCM) enable the construction of systems with large persistent memories, improving reliability and potentially reducing power consumption. However, BNVM technologies only support a limited number of lifetime writes per cell and consume most of their power when flipping a bit's state during a write; thus, PCM controllers only rewrite a cell's contents when the cell's value has changed. Prior research has assumed that reducing the number of words written is a good proxy for reducing the number of bits modified, but a recent study has suggested that this assumption may not be valid. Our research confirms that approaches with the fewest writes often have more bit flips than those optimized to reduce bit flipping.

To test the effectiveness of bit flip reduction, we built a framework that uses the number of bits flipped over time as the measure of "goodness" and modified a cycle-accurate simulator to count bits flipped during program execution. We implemented several modifications to common data structures designed to reduce power consumption and increase memory lifetime by reducing the number of bits modified by operations on several data structures: linked lists, hash tables, and red-black trees. We were able to reduce the number of bits flipped by up to 3.56× over standard implementations of the same data structures with negligible overhead. We measured the number of bits flipped by memory allocation and stack frame saves and found that careful data placement in the stack can reduce bit flips significantly. These changes require no hardware modifications and neither significantly reduce performance nor increase code complexity, making them attractive for designing systems optimized for BNVM.

Speakers
DB

Daniel Bittman

UC Santa Cruz
DD

Darrell D. E. Long

UC Santa Cruz
PA

Peter Alvaro

UC Santa Cruz
EL

Ethan L. Miller

UC Santa Cruz


Tuesday February 26, 2019 11:00am - 11:30am EST
Grand Ballroom

11:10am EST

New Techniques to Improve Small I/O Workloads in Distributed File Systems
Distributed file systems work well with high throughput applications that are parallelizable. Due to network overhead, they tend to perform less well with workloads that are meta-data or small-file intensive. This problem has been closely studied, resulting in many innovative ideas. For example, researchers have proposed storing inodes in column-store databases to speed up directory reads. Another idea is to have file systems publish “snapshots” visible to a subset of clients during metadata creation, which are later subscribed to by the rest of the system.

Are these techniques practical outside university labs? To answer this question, we introduce software that makes the original implementations much easier to use, by acting as a layer on top of Ceph object storage. The talk will walk through how to set up and run the configuration in realistic environments. The original research will be described in detail, explaining how improved performance comes with some loss of Posix generality, along with a small number of new operational steps outside of traditional file system workflows. The talk will show how this solution could be a good fit for analytics use cases where file system semantics are needed and there is flexibility at the application level.

Speakers
DL

Dan Lambright

Huawei
Dan has worked in open source storage at Red Hat and also at AWS. Today he is building distributed storage at Huawei. He has spoken at Vault, LinuxCon, OpenStack, LISA, and other venues. He also enjoys teaching at the University of Massachusetts Lowell.


Tuesday February 26, 2019 11:10am - 11:35am EST
Independence Ballroom

11:20am EST

Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores
This paper introduces the concept of size-aware sharding to improve tail latencies for in-memory key-value stores, and describes its implementation in the Minos key-value store. Tail latencies are crucial in distributed applications with high fan-out ratios, because overall response time is determined by the slowest response. Size-aware sharding distributes requests for keys to cores according to the size of the item associated with the key. In particular, requests for small and large items are sent to disjoint subsets of cores. Size-aware sharding improves tail latencies by avoiding head-of-line blocking, in which a request for a small item gets queued behind a request for a large item. Alternative size-unaware approaches to sharding such as keyhash-based sharding, request dispatching and stealing do not avoid head-of-line blocking, and therefore exhibit worse tail latencies. The challenge in implementing size-aware sharding is to maintain high throughput by avoiding the cost of software dispatching and by achieving load balancing between different cores. Minos uses hardware dispatch for all requests for small items, which form the very large majority of all requests. It achieves load balancing by adapting the number of cores handling requests for small and large items to their relative presence in the workload. We compare Minos to three state-of-the-art designs of in-memory KV stores. Compared to its closest competitor, Minos achieves a 99th percentile latency that is up to two orders of magnitude lower. Put differently, for a given value for the 99th percentile latency equal to 10 times the mean service time, Minos achieves a throughput that is up to 7.4 times higher.

Speakers
WZ

Willy Zwaenepoel

EPFL and University of Sydney


Tuesday February 26, 2019 11:20am - 11:45am EST
Constitution Ballroom

11:30am EST

Write-Optimized Dynamic Hashing for Persistent Memory
Low latency storage media such as byte-addressable persistent memory (PM) requires rethinking of various data structures in terms of optimization. One of the main challenges in implementing hash-based indexing structures on PM is how to achieve efficiency by making effective use of cachelines while guaranteeing failure-atomicity for dynamic hash expansion and shrinkage. In this paper, we present Cacheline-Conscious Extendible Hashing (CCEH) that reduces the overhead of dynamic memory block management while guaranteeing constant hash table lookup time. CCEH guarantees failure-atomicity without making use of explicit logging. Our experiments show that CCEH effectively adapts its size as the demand increases under the fine-grained failure-atomicity constraint and reduces the maximum query latency by over two-thirds compared to the state-of-the-art hashing techniques.

Speakers
MN

Moohyeon Nam

UNIST (Ulsan National Institute of Science and Technology)
HC

Hokeun Cha

Sungkyunkwan University
YC

Young-ri Choi

UNIST (Ulsan National Institute of Science and Technology)
BN

Beomseok Nam

Sungkyunkwan University


Tuesday February 26, 2019 11:30am - 12:00pm EST
Grand Ballroom

11:35am EST

Optimizing Storage Performance for 4–5 Million IOPs
New workloads and Storage Class Memory (SCM) are demanding a new level of IOPs, bandwidth, and driver optimizations in Linux for storage networks. James Smart will discuss how the lpfc driver was recently reworked to achieve a new level of driver performance reaching 5+ Million IOPs. James will discuss hardware parallelization, per-core WQs, interrupt handling, and shared resource management that will benefit both SCSI and NVMe over Fabrics performance. James will show performance curves, discuss Linux OS issues encountered, and work yet to do in Linux to improve performance even more.

Speakers
JS

James Smart

Broadcom
James Smart is currently a Distinguished Engineer at Broadcom responsible for the architecture of Broadcom's Fibre Channel Linux stack. James has worked in storage software and firmware development for 32 years. James is a member of T11 and the NVM Express standards groups. James... Read More →


Tuesday February 26, 2019 11:35am - 12:00pm EST
Independence Ballroom

11:45am EST

Monoxide: Scale Out Blockchain with Asynchronized Consensus Zones
Cryptocurrencies have provided a promising infrastructure for pseudonymous online payments. However, low throughput has significantly hindered the scalability and usability of cryptocurrency systems for increasing numbers of users and transactions. Another obstacle to achieving scalability is that every node is required to duplicate the communication, storage, and state representation of the entire network.

In this paper, we introduce the Asynchronous Consensus Zones, which scales blockchain system linearly without compromising decentralization or security. We achieve this by running multiple independent and parallel instances of single-chain consensus (zones). The consensus happens independently within each zone with minimized communication, which partitions the workload of the entire network and ensures moderate burden for each individual node as network grows. We propose eventual atomicity to ensure transaction atomicity across zones, which guarantees the efficient completion of transaction without the overhead of two-phase commit protocol. We also propose Chu-ko-nu mining to ensure the effective mining power in each zone is at the same level of the entire network, and makes an attack on any individual zone as hard as that on the entire network. Our experimental results show the effectiveness of our work. On a test-bed including 1,200 virtual machines worldwide to support 48,000 nodes, our system deliver $1,000\times$ throughput and capacity over Bitcoin and Ethereum network.

Speakers
JW

Jiaping Wang

ICT/CAS, Sinovation AI Institute
HW

Hao Wang

Ohio State University


Tuesday February 26, 2019 11:45am - 12:10pm EST
Constitution Ballroom

12:00pm EST

Software Wear Management for Persistent Memories
The commercial release of byte-addressable persistent memories (PMs) is imminent. Unfortunately, these devices suffer from limited write endurance—without any wear management, PM lifetime might be as low as 1.1 months. Existing wear-management techniques introduce an additional indirection layer to remap memory across physical frames and require hardware support to track fine-grain wear. These mechanisms incur storage overhead and increase access latency and energy consumption.

We present Kevlar, an OS-based wear-management technique for PM that requires no new hardware. Kevlar uses existing virtual memory mechanisms to remap pages, enabling it to perform both wear leveling—shuffling pages in PM to even wear; and wear reduction—transparently migrating heavily written pages to DRAM. Crucially, Kevlar avoids the need for hardware support to track wear at fine grain. Instead, it relies on a novel wear estimation technique that builds upon Intel's Precise Event Based Sampling to approximately track processor cache contents via a software-maintained Bloom filter and estimate write-back rates at fine grain. We implement Kevlar in Linux and demonstrate that it achieves lifetime improvement of 18.4x (avg.) over no wear management while incurring 1.2% performance overhead.

Speakers
VG

Vaibhav Gogte

University of Michigan
AK

Aasheesh Kolli

Pennsylvania State University and VMware Research
PM

Peter M. Chen

University of Michigan
SN

Satish Narayanasamy

University of Michigan
TF

Thomas F. Wenisch

University of Michigan


Tuesday February 26, 2019 12:00pm - 12:30pm EST
Grand Ballroom

12:00pm EST

Conference Luncheon
Tuesday February 26, 2019 12:00pm - 1:30pm EST
Back Bay Ballroom D

12:10pm EST

12:30pm EST

Lunch (on your own)
Tuesday February 26, 2019 12:30pm - 2:00pm EST
N/A

1:30pm EST

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds
Many popular large-scale cloud applications are increasingly using containerization for high resource efficiency and lightweight isolation. In parallel, many data-intensive applications (e.g., data analytics and deep learning frameworks) are adopting or looking to adopt RDMA for high networking performance. Industry trends suggest that these two approaches are on an inevitable collision course. In this paper, we present FreeFlow, a software-based RDMA virtualization framework designed for containerized clouds. FreeFlow realizes virtual RDMA networking purely with a software-based approach using commodity RDMA NICs. Unlike existing RDMA virtualization solutions, FreeFlow fully satisfies the requirements from cloud environments, such as isolation for multi-tenancy, portability for container migrations, and controllability for control and data plane policies. FreeFlow is also transparent to applications and provides networking performance close to bare-metal RDMA with low CPU overhead. In our evaluations with TensorFlow and Spark, FreeFlow provides almost the same application performance as bare-metal RDMA.

Speakers
DK

Daehyeok Kim

Carnegie Mellon University
TY

Tianlong Yu

Carnegie Mellon University
YZ

Yibo Zhu

Microsoft and ByteDance
JP

Jitu Padhye

Microsoft
VS

Vyas Sekar

Carnegie Mellon University
SS

Srinivasan Seshan

Carnegie Mellon University


Tuesday February 26, 2019 1:30pm - 1:55pm EST
Constitution Ballroom

1:30pm EST

Design of a Composable Infrastructure Platform
Composable Infrastructure in this talk is a method for the dynamic creation of secure application clusters from disaggregated compute, storage and networking. The problems facing such a solution are ones of availability, durability, scalability, performance and most importantly correctness.

The target applications are widely deployed data analytics and NoSQL database applications that can consist of 100's to 1,000's of compute nodes with 10,000's of disk for each application in a secure cluster instance.

The talk consists of five parts. We present a very brief description to the user view of creating virtual clusters on a composable infrastructure platform. We follow this with a short description of the problems and requirements for the platform. That motivates the bulk of the presentation describing the state machine design for a correct and durable orchestration platform that scales to 100,000's of managed elements. Select code and data structures are used to point out implementation details. The fourth part of the talk describes how standard Linux networking and storage subsystems are managed and used to create virtual clusters (including NVME over Fabric), and the open source components used by the platform to achieve scale, availability, and security. The final part of the talk details key failure scenarios and the recovery mechanisms that maintain correctness and availability.

Speakers
BP

Brian Pawlowski

Drivescale Inc.
Brian Pawlowski is currently CTO of Drivescale Inc. where he is involved in the design of software to support cluster computing and developing a platform for composable infrastructure.As Vice President and Chief Architect at Pure Storage, he was focused on product and architecture... Read More →


Tuesday February 26, 2019 1:30pm - 1:55pm EST
Independence Ballroom

1:55pm EST

Direct Universal Access: Making Data Center Resources Available to FPGA
FPGAs have been deployed at massive scale in data centers. The currently available communication architectures, however, make FPGAs very difficult to utilize resources in data center. In this paper, we present Direct Universal Access (DUA), a communication architecture that provides uniform access for FPGA to heterogeneous data center resources. Without considering machine boundaries, DUA provides global names and a common interface for communicating with various resources, where the underlying network automatically routes traffic and manages resource multiplexing. Our benchmarks show that DUA provides simple and fair-share resource access with small logic area overhead (<10%) and negligible latency (<0.2$\mu$s). We also build two practical multi-FPGA applications, deep crossing and regular expression matching, on top of DUA to demonstrate the usability and efficiency.

Speakers
RS

Ran Shu

Microsoft Research
PC

Peng Cheng

Microsoft Research
GC

Guo Chen

Microsoft Research & Hunan University
ZG

Zhiyuan Guo

Microsoft Research & Beihang University
LQ

Lei Qu

Microsoft Research
YX

Yongqiang Xiong

Microsoft Research
DC

Derek Chiou

Microsoft Azure
TM

Thomas Moscibroda

Microsoft Azure


Tuesday February 26, 2019 1:55pm - 2:20pm EST
Constitution Ballroom

1:55pm EST

The Storage Architecture of Intel's Data Management Platform (DMP)
This talk will discuss the Storage Architecture employed by Intel's Data Management Platform (DMP). The DMP is a rack-centric, cluster design that employs an Ethernet-based fabric as its cluster interconnect. The default is a 3-stage Clos topology. The cluster's storage provides no redundancy and instead puts the burden on stateful micro-services to deal with their own redundancy requirements.

We will provide an overview of the DMP. Next, we'll drill into the details of the Storage subsystem, which is composed of Intel's RSD Pod Manager along with LINBIT's LINBIT storage orchestrator. In this section of the talk, we will include a performance characterization of the two volume types using FIO.

A DMP cluster is managed by Kubernetes with network and storage resources managed by Container Network and Storage Interface (CNI/CSI) providers. While DMP volumes provide no redundancy they are persistent and have a zone label attached to them. This use of the Kubernetes zone label concept is a key aspect of the DMP storage implementation as it ensures stateful micro-services being hosted on the platform are distributed across the cluster's fault domains. The stateful micro-service is then responsible for providing sufficient data redundancy to satisfy its availability and durability requirements.

(i) NVMe-over-Fabric (NVMe-oF) based Remote Logical Volumes Optimized for large Sequential I/O The DMP disaggregates physical storage devices from compute servers to allow storage capacity to scale independent of compute. The disaggregated storage devices are then pooled by an open-source, cluster-wide, volume manager called LINSTOR. LINBIT's framework is integrated with the cluster's k8s-based Orchestration/Scheduler function via LINBIT's Container Storage Interface (CSI) implementation. Logical volumes are provisioned from this pool and made available via NVMe-over-Fabric (NVMe-oF) to k8s-managed Pods running on the compute servers. These logical volumes are optimized for large sequential I/Os and are used to replace HDDs.

(ii) Local Logical Volumes Optimized for Optane DC Persistent Memory (DCPM) Compute servers in DMP are outfitted with Optane DCPM. These persistent DIMMs are also pooled by the LINBIT and made available with Kubernetes as logical volumes. In the case of Optane DCPM, LINBIT uses LVM to carve/provision logic volumes out of an NVDIMM Namespace.

After we review the Storage subsystem we will provide overviews of two workloads that are priorities for initial DMP deployments. The first of these is a Spark-based AI/Analytics Pipeline that uses Minio's s3-compatible object store as a replacement for HDFS. The second of these workloads is a MySQL/MariaDB transactional database on shared storage. To the best of our knowledge, this is the first open source transactional database that supports shared storage.

Finally, we'll conclude with an update on the status of the DMP effort, review preliminary performance results, and provide a few parting thoughts on the next steps for the DMP.

Speakers

Tuesday February 26, 2019 1:55pm - 2:20pm EST
Independence Ballroom

2:00pm EST

Storage Gardening: Using a Virtualization Layer for Efficient Defragmentation in the WAFL File System
As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. This paper describes how the NetApp® WAFL® file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The paper analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.


Tuesday February 26, 2019 2:00pm - 2:30pm EST
Grand Ballroom

2:20pm EST

Stardust: Divide and Conquer in the Data Center Network
Building scalable data centers, and network devices that fit within these data centers, has become increasingly hard. With modern switches pushing at the boundary of manufacturing feasibility, being able to build suitable, and scalable network fabrics becomes of critical importance. We introduce Stardust, a fabric architecture for data center scale networks, inspired by network-switch systems. Stardust combines packet switches at the edge and disaggregated cell switches at the network fabric, using scheduled traffic. Stardust is a distributed solution that attends to the scale limitations of network-switch design, while also offering improved performance and power savings compared with traditional solutions. With ever-increasing networking requirements, Stardust predicts the elimination of packet switches, replaced by cell switches in the network, and smart network hardware at the hosts.

Speakers
NZ

Noa Zilberman

University of Cambridge
GB

Gabi Bracha

Broadcom


Tuesday February 26, 2019 2:20pm - 2:45pm EST
Constitution Ballroom

2:20pm EST

scoutfs: Large Scale POSIX Archiving
scoutfs is an open source clustered POSIX file system built to support archiving of very large file sets. This talk will quickly summarize the challenges faced by sites that are managing large archives. We'll then explore the technical details of the persistent structures and network protocols that allow scoutfs to efficiently update and index file system metadata concurrently across a cluster. We'll see the interfaces that scoutfs provides on top of these mechanisms which allow management software to track the life cycle of billions of archived files.

Speakers
ZB

Zach Brown

Versity, Inc.
Zach Brown has been working on the Linux kernel for a while now and has most recently focused on file systems, particularly Lustre, OCFS2, and btrfs. He's also helped organize previous Linux storage workshops and has given talks at Linux conferences including OLS, LCA, and LinuxT... Read More →


Tuesday February 26, 2019 2:20pm - 2:45pm EST
Independence Ballroom

2:30pm EST

Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting for Multicores
The operating system community has been combating scalability bottlenecks for the past 10 years with victories or all the then-new multicore hardware. File systems, however, are in the midst of turmoil yet. One of the culprits behind performance degradation is reference counting widely used for managing data and metadata, and scalability is badly impacted under load with little or no logical contention, where the capability is desperately needed. To address this, we propose PAYGO, a reference counting technique that combines per-core hash of local reference counters with an anchor counter to make concurrent counting scalable as well as space-efficient, without having any other delay for managing counters. PAYGO imposes the restriction that decrement must be performed on the original local counter where the act of increment has occurred so that reclaiming zero-valued local counters can be done immediately. To this end, we enforce migrated processes running on different cores to update the anchor counter associated with the original local counter. We implemented PAYGO in the Linux page cache, and so our implementation is transparent to the file system. Experimental evaluation with underlying file systems (i.e., ext4, F2FS, btrfs, and XFS) demonstrated that PAYGO scales file systems better than other state-of-the-art techniques.

Speakers
SJ

Seokyong Jung

Hanyang University
JK

Jongbin Kim

Hanyang University
MR

Minsoo Ryu

Hanyang University
SK

Sooyong Kang

Hanyang University
HJ

Hyungsoo Jung

Hanyang University


Tuesday February 26, 2019 2:30pm - 3:00pm EST
Grand Ballroom

2:45pm EST

Blink: Fast Connectivity Recovery Entirely in the Data Plane
In this paper, we explore new possibilities, created by programmable switches, for fast rerouting upon signals triggered by Internet traffic disruptions. We present Blink, a data-driven system exploiting TCP-induced signals to detect failures. The key intuition behind Blink is that a TCP flow exhibits a predictable behavior upon disruption: retransmitting the same packet over and over, at epochs exponentially spaced in time. When compounded over multiple flows, this behavior creates a strong and characteristic failure signal. Blink efficiently analyzes TCP flows, at line rate, to: (i) select flows to track; (ii) reliably and quickly detect major traffic disruptions; and (iii) recover data-plane connectivity, via next-hops compatible with the operator’s policies.

We present an end-to-end implementation of Blink in P4 together with an extensive evaluation on real and synthetic traffic traces. Our results indicate that Blink: (i) can achieve sub-second rerouting for realistic Internet traffic; (ii) prevents unnecessary traffic shifts, in the presence of noise; and (iii) scales to protect large fractions of realistic Internet traffic, on existing hardware. We further show the feasibility of Blink by running our system on a real Tofino switch.

Speakers

Tuesday February 26, 2019 2:45pm - 3:10pm EST
Constitution Ballroom

2:45pm EST

Skyhook: Programmable Storage for Databases
Ceph is an open source distributed storage system that is object-based and massively scalable. Ceph provides developers with the capability to create data interfaces that can take advantage of local CPU and memory on the storage nodes (Ceph Object Storage Devices). These interfaces are powerful for application developers and can be created in C, C++, and Lua.

Skyhook is an open source storage and database project in the Center for Research in Open Source Software at UC Santa Cruz. Skyhook uses these capabilities in Ceph to create specialized read/write interfaces that leverage IO and CPU within the storage layer toward database processing and management. Specifically, we develop methods to apply predicates locally as well as additional metadata and indexing capabilities using Ceph's internal indexing mechanism built on top of RocksDB.

Skyhook's approach helps to enable scale-out of a single node database system by scaling out the storage layer. Our results show the performance benefits for some queries indeed scale well as the storage layer scales out.

Speakers
JL

Jeff LeFevre

University of California, Santa Cruz
Jeff LeFevre is an Assistant Adjunct Professor of Computer Science and Engineering at UC Santa Cruz where he does data management research and leads the Skyhook project within the Center for Research on Open Source Software (CROSS). He received his PhD from UC Santa Cruz with work... Read More →
NW

Noah Watkins

Red Hat
Noah Watkins is a software engineer at Red Hat. He received his PhD from UC Santa Cruz in 2018 where he focused his research on the programmability of distributed storage systems.


Tuesday February 26, 2019 2:45pm - 3:10pm EST
Independence Ballroom

3:00pm EST

Speculative Encryption on GPU Applied to Cryptographic File Systems
Due to the processing of cryptographic functions, Cryptographic File Systems (CFSs) may require significant processing capacity. Parallel processing techniques on CPUs or GPUs can be used to meet this demand. The CTR mode has two particularly useful features: the ability to be fully parallelizable and to perform the initial step of the encryption process ahead of time, generating encryption masks. This work presents an innovative approach in which the CTR mode is applied in the context of CFSs seeking to exploit these characteristics, including the anticipated production of the cipher masks (speculative encryption) in GPUs. Techniques that demonstrate how to deal with the issue of the generation, storage and management of nonces are presented, an essential component to the operation of the CTR mode in the context of CFSs. Related to GPU processing, our methods work to perform the handling of the encryption contexts and control the production of the masks, aiming to produce them with the adequate anticipation and overcome the extra latency due to encryption tasks. The techniques were applied in the implementation of EncFS++, a user space CFS. Performance analyzes showed that it was possible to achieve significant gains in throughput and CPU efficiency in several scenarios. They also demonstrated that GPU processing can be efficiently applied to CFS encryption workload even when working by encrypting small amounts of data (4 KiB), and in scenarios where higher speed/lower latency storage devices are used, such as SSDs or memory.

Speakers
WM

Wagner M. Nunan Zola

Federal University of Paraná
LC

Luis C. Erpen de Bona

Federal University of Paraná
VE

Vandeir Eduardo

Federal University of Paraná and University of Blumenau


Tuesday February 26, 2019 3:00pm - 3:30pm EST
Grand Ballroom

3:10pm EST

Break with Refreshments
Tuesday February 26, 2019 3:10pm - 3:40pm EST
Grand Ballroom Foyer

3:10pm EST

Break with Refreshments
Tuesday February 26, 2019 3:10pm - 3:45pm EST
Grand Ballroom Foyer

3:30pm EST

Break with Refreshments
Tuesday February 26, 2019 3:30pm - 4:00pm EST
Grand Ballroom Foyer

3:40pm EST

Hydra: a federated resource manager for data-center scale analytics
Microsoft's internal data lake processes exabytes of data over millions of cores daily on behalf of thousands of tenants. Scheduling this workload requires 10x to 100x more decisions per second than existing, general-purpose resource management frameworks are known to handle. In 2013, we were faced with a growing demand for workload diversity and richer sharing policies that our legacy system could not meet. In this paper, we present Hydra, the resource management infrastructure we built to meet these requirements.

Hydra leverages a federated architecture, in which a cluster is comprised of multiple, loosely coordinating subclusters. This allows us to scale by delegating placement of tasks on machines to each sub-cluster, while centrally coordinating only to ensure that tenants receive the right share of resources. To adapt to changing workload and cluster conditions promptly, Hydra's design features a control plane that can push scheduling policies across tens of thousands of nodes within seconds. This feature combined with the federated design allows for great agility in developing, evaluating, and rolling out new system behaviors.

We built Hydra by leveraging, extending, and contributing our code to Apache Hadoop YARN. Hydra is currently the primary big-data resource manager at Microsoft. Over the last few years, Hydra has scheduled nearly one trillion tasks that manipulated close to a Zettabyte of production data.


Tuesday February 26, 2019 3:40pm - 4:05pm EST
Constitution Ballroom

3:45pm EST

Deep Dive into Ceph Block Storage
Ceph's object storage system allows users to mount Ceph as a thin-provisioned block device known as RADOS block Device (RBD). This talk aims to delve deep into the RBD, its design and features. In this session, we will discuss:
  • What entails creating an RBD image—rbd data and metadata
  • Prominent features like Striping, Snapshots, and Cloning
  • How RBD is configured in a virtualized setup using libvirt/qemu

Speakers
MC

Mahati Chamarthy

Intel
Mahati Chamarthy has been contributing to storage technologies for the past few years. She was a core developer for OpenStack Object Storage (Swift) and now an active contributor to Ceph. She works as a Cloud Software Engineer with Intel's Open Source Technology Center focusing on... Read More →


Tuesday February 26, 2019 3:45pm - 4:10pm EST
Independence Ballroom

4:00pm EST

Sketching Volume Capacities in Deduplicated Storage
The adoption of deduplication in storage systems has introduced significant new challenges for storage management. Specifically, the physical capacities associated with volumes are no longer readily available. In this work we introduce a new approach to analyzing capacities in deduplicated storage environments. We provide sketch-based estimations of fundamental capacity measures required for managing a storage system: How much physical space would be reclaimed if a volume or group of volumes were to be removed from a system (the {\em reclaimable} capacity) and how much of the physical space should be attributed to each of the volumes in the system (the {\em attributed} capacity). Our methods also support capacity queries for volume groups across multiple storage systems, e.g., how much capacity would a volume group consume after being migrated to another storage system? We provide analytical accuracy guarantees for our estimations as well as empirical evaluations. Our technology is integrated into a prominent all-flash storage array and exhibits high performance even for very large systems. We also demonstrate how this method opens the door for performing placement decisions at the data center level and obtaining insights on deduplication in the field.

Speakers
DH

Danny Harnik

IBM Research
YS

Yosef Shatsky

IBM Systems
AE

Amir Epstein

Citi Innovation Lab TLV
RK

Ronen Kat

IBM Research


Tuesday February 26, 2019 4:00pm - 4:30pm EST
Grand Ballroom

4:05pm EST

Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure
Serverless computing is poised to fulfill the long-held promise of transparent elasticity and millisecond-level pricing. To achieve this goal, service providers impose a finegrained computational model where every function has a maximum duration, a fixed amount of memory and no persistent local storage. We observe that the fine-grained elasticity of serverless is key to achieve high utilization for general computations such as analytics workloads, but that resource limits make it challenging to implement such applications as they need to move large amounts of data between functions that don’t overlap in time. In this paper, we present Locus, a serverless analytics system that judiciously combines (1) cheap but slow storage with (2) fast but expensive storage, to achieve good performance while remaining cost-efficient. Locus applies a performance model to guide users in selecting the type and the amount of storage to achieve the desired cost-performance trade-off. We evaluate Locus on a number of analytics applications including TPC-DS, CloudSort, Big Data Benchmark and show that Locus can navigate the cost-performance trade-off, leading to 4×-500× performance improvements over slow storage-only baseline and reducing resource usage by up to 59% while achieving comparable performance on a cluster of virtual machines, and within 1.99× slower compared to Redshift.

Speakers
QP

Qifan Pu

UC Berkeley
SV

Shivaram Venkataraman

University of Wisconsin, Madison
IS

Ion Stoica

UC Berkeley


Tuesday February 26, 2019 4:05pm - 4:30pm EST
Constitution Ballroom

4:10pm EST

Mindcastle.io: Secure Distributed Block Device for Edge and Cloud
Camera-based smart IoT sensors are soon going to be everywhere. The recent success of Deep Neural Networks (DNNs) has opened the door to a new computer vision and AI applications. While initial deployments are using high-end server class hardware with expensive and power-hungry GPUs, optimizations and algorithmic improvements will soon make running the inference side of DNNs on low-cost Edge Computing devices commonplace. These devices will need software, and this software needs to be continually updated, both to keep track with the rapid development within machine learning/AI methods and datasets, and to keep their operating system and middleware installs tamper-proof and secure. To this end, we have been building Mindcastle, a serverless distributed block storage system with strong cryptographic integrity, built-in compression, and incremental atomic updates. Mindcastle is based on a highly performant and flash friendly LSM-like data structure, first developed at Bromium where it served as the storage foundation Bromium's Xen-derived uXen hypervisor, and has hosted millions of strongly isolated Micro-VMs across many security-sensitive installations worldwide.

Speakers
JG

Jacob Gorm Hansen

Vertigo.ai
Jacob Gorm Hansen is the founder of Vertigo.ai, an AI startup that focuses on AI for Edge computing. Jacob has a long track record of innovative computer systems development and research. After cutting his teeth as a senior programmer on the Hitman games francise, he returned to academia... Read More →


Tuesday February 26, 2019 4:10pm - 4:35pm EST
Independence Ballroom

4:30pm EST

dShark: A General, Easy to Program and Scalable Framework for Analyzing In-network Packet Traces
Distributed, in-network packet capture is still the last resort for diagnosing network problems. Despite recent advances in collecting packet traces scalably, effectively utilizing pervasive packet captures still poses important challenges. Arbitrary combinations of middleboxes which transform packet headers make it challenging to even identify the same packet across multiple hops; packet drops in the collection system create ambiguities that must be handled; the large volume of captures, and their distributed nature, make it hard to do even simple processing; and the one-off and urgent nature of problems tends to generate ad-hoc solutions that are not reusable and do not scale. In this paper we propose dShark to address these challenges. dShark allows intuitive groupings of packets across multiple traces that are robust to header transformations and capture noise, offering simple streaming data abstractions for network operators. Using dShark on production packet captures from a major cloud provider, we show that dShark makes it easy to write concise and reusable queries against distributed packet traces that solve many common problems in diagnosing complex networks. Our evaluation shows that dShark can analyze production packet traces with more than 10 Mpps throughput on a commodity server, and has near-linear speedup when scaling out on multiple servers.

Speakers
DY

Da Yu

Brown University
YZ

Yibo Zhu

Microsoft and ByteDance
BA

Behnaz Arzani

Microsoft
RF

Rodrigo Fonseca

Brown University
KD

Karl Deng

Microsoft
LY

Lihua Yuan

Microsoft


Tuesday February 26, 2019 4:30pm - 4:55pm EST
Constitution Ballroom

4:30pm EST

Finesse: Fine-Grained Feature Locality based Fast Resemblance Detection for Post-Deduplication Delta Compression
In storage systems, delta compression is often used as a complementary data reduction technique for data deduplication because it is able to eliminate redundancy among the non-duplicate but highly similar chunks. Currently, what we call 'N-transform Super-Feature' (N-transform SF) is the most popular and widely used approach to computing data similarity for detecting delta compression candidates. But our observations suggest that the N-transform SFis compute-intensive: it needs to linearly transform each Rabin fingerprint of the data chunks N times to obtain N features, and can be simplified by exploiting the fine-grained feature locality existing among highly similar chunks to eliminate time-consuming linear transformations. Therefore, we propose Finesse, a fine-grained feature-locality-based fast resemblance detection approach that divides each chunk into several fixed-sized subchunks, computes features from these subchunks individually, and then groups the features into super-features. Experimental results show that, compared with the state-of-the-art N-transform SF approach, Finesse accelerates the similarity computation for resemblance detection by 3.2× ~ 3.5× and increases the final throughput of a deduplicated and delta compressed prototype system by 41% ~ 85%, while achieving comparable compression ratios.

Speakers
YZ

Yucheng Zhang

Hubei University of Technology
WX

Wen Xia

Harbin Institute of Technology, Shenzhen & Peng Cheng Laboratory
DF

Dan Feng

WNLO, School of Computer, Huazhong University of Science and Technology
HJ

Hong Jiang

University of Texas at Arlington
YH

Yu Hua

WNLO, School of Computer, Huazhong University of Science and Technology
QW

Qiang Wang

WNLO, School of Computer, Huazhong University of Science and Technology


Tuesday February 26, 2019 4:30pm - 5:00pm EST
Grand Ballroom

4:35pm EST

IO and cgroups, the Current and Future Work
Resource isolation when it comes to IO has been incomplete for years, making it very hard to build a completely isolated solution for containers in Linux. Recently with the development of blk iolatency this has started to change, and hopefully marks the start of being able to build systems with complete resource isolation.

Tuesday February 26, 2019 4:35pm - 5:00pm EST
Independence Ballroom

4:55pm EST

Short Break
Tuesday February 26, 2019 4:55pm - 5:10pm EST
Grand Ballroom Foyer

5:00pm EST

Self-Encrypting Drive (SED) Standardization Proposal for NVDIMM-N Devices
A non-volatile DIMM (NVDIMM) is a Dual In-line Memory Module (DIMM) that maintains the contents of Synchronous Dynamic Random Access Memory (SDRAM) during power loss. An NVDIMM-N class of device can be integrated into a standard compute or storage platforms to provide non-volatility of the data in DIMM. NVDIMM relies on byte addressable energy backed function to preserve the data in case of power failure. A Byte Address Energy Backed Function is backed by a combination of SDRAM and non-volatile memory (e.g., NAND flash) on the NVDIMM-N. JESD245C Byte-Addressable Energy Backed Interface (BAEBI) defines the programming interface for NVDIMM-N class of devices.

An NVDIMM-N achieves non-volatility by:
  • performing a Catastrophic Save operation to copy SDRAM contents into NVM when host power is lost using an Energy Source managed by either the module or the host
  • performing a Restore operation to copy contents from the NVM to SDRAM when power is restored

An NVDIMM-N device may be a of self-encrypting device (SED) type that protects data at rest. This means the NVDIMM-N controller:
  • encrypts data during a Catastrophic Save operation
  • decrypts data during a Restore operation and the data is:
    • plaintext while sitting in SDRAM
    • ciphertext while sitting in NVM (e.g., flash memory)

Typically, an NVDIMM-N device may be used within the storage controller for performance acceleration against storage workloads or as a sundry storage to preserve debug information in case of power failure. When NVDIMM-N device is used as a caching layer, transient data is staged in NVDIMM-N device before the data is persisted/committed to the storage media. NVDIMM-N devices are also used as persistent storage media for staging memory dump files when critical failures occur at storage subsystem level before the system goes down.

The NVDIMM-N encryption standardization proposal involves cross-pollination between JEDEC (proposed BAEBI extensions to define security protocols in conjunction with encryption capability on the device) and TCG standards (proposed TCG Storage Interface Interactions Specifications content for handling self-encrypting NVDIMM-Ns plus adapting TCG Ruby SSC for NVDIMM-N devices) with industry sponsorship from HPE and NetApp.

The talk will begin with brief overview of NVDIMM-N device and associated storage-centric use cases followed by an overview of NVDIMM-N encryption scheme, and proposed self-encrypting device standardization approach for NVDIMM-N devices, which involves the following:

  1. Extensions to BAEBI specification to accommodate security protocol definitions in consequence with encryption capability in NVDIMM-N devices
  2. Extensions to TCG Storage Interface Specifications defining the Security Protocol Typed Block for handling interactions with NVDIMM-N devices
  3. Adapting TCG Ruby SSC standard for accommodating NVDIMM-N class devices

The talk will conclude by summarizing current state of the standardization proposal and approval process with JEDEC and TCG WG's.

Speakers
FK

Frederick Knight

NetApp
Frederick Knight is a Principal Standards Technologist at NetApp Inc. Fred has over 40 years of experience in the computer and storage industry. He currently represents NetApp in several National and International Storage Standards bodies and industry associations, including T10 (SCSI... Read More →
SB

Sridhar Balasubramanian

NetApp
Sridhar Balasubramanian is a Principal Security Architect within Product Security Group @ NetApp RTP. With over 25 years in the software industry, Sridhar is inventor/co-inventor for 16 US Patents and published 5 Conference papers till date. Sridhar's area of expertise includes Storage... Read More →


Tuesday February 26, 2019 5:00pm - 5:25pm EST
Independence Ballroom

5:00pm EST

Sliding Look-Back Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance
Data deduplication is an effective way of improving storage space utilization. The data generated by deduplication is persistently stored in data chunks or data containers (a container consisting of a few hundreds or thousands of data chunks). The data restore process is rather slow due to data fragmentation and read amplification. To speed up the restore process, data chunk rewrite (a rewrite is to store a duplicate data chunk) schemes have been proposed to effectively improve data chunk locality and reduce the number of container reads for restoring the original data. However, rewrites will decrease the deduplication ratio since more storage space is used to store the duplicate data chunks.

To remedy this, we focus on reducing the data fragmentation and read amplification of container-based deduplication systems. We first propose a flexible container referenced count based rewrite scheme, which can make a better tradeoff between the deduplication ratio and the number of required container reads than that of capping which is an existing rewrite scheme. To further improve the rewrite candidate selection accuracy, we propose a sliding look-back window based design, which can make more accurate rewrite decisions by considering the caching effect, data chunk localities, and data chunk closeness in the current and future windows. According to our evaluation, our proposed approach can always achieve a higher restore performance than that of capping especially when the reduction of deduplication ratio is small.

Speakers
ZC

Zhichao Cao

University of Minnesota
SL

Shiyong Liu

Ocean University of China
FW

Fenggang Wu

University of Minnesota
GW

Guohua Wang

South China University of Technology
BL

Bingzhe Li

University of Minnesota
DH

David H.C. Du

University of Minnesota


Tuesday February 26, 2019 5:00pm - 5:30pm EST
Grand Ballroom

5:10pm EST

Minimal Rewiring: Efficient Live Expansion for Clos Data Center Networks
Clos topologies have been widely adopted for large-scale data center networks (DCNs), but it has been difficult to support incremental expansions for Clos DCNs. Some prior work has claimed that the structure of Clos topologies hinders incremental expansion.

We demonstrate that it is indeed possible to design expandable Clos DCNs, and to expand them while they are carrying live traffic, without incurring packet loss. We use a layer of patch panels between blocks of switches in a Clos DCN, which makes physical rewiring feasible, and we describe how to use integer linear programming (ILP) to minimize the number of patch-panel connections that must be changed, which makes expansions faster and cheaper. We also describe a block-aggregation technique that makes our ILP approach scalable. We tested our "minimal-rewiring" solver on two kinds of fine-grained expansions using 2250 synthetic DCN topologies, and found that the solver can handle 99% of these cases while changing under 25% of the connections. Compared to prior approaches, this solver (on average) reduces the average number of "stages" per expansion from 4 to 1.29, and reduces the number of wires changed by an order of magnitude or more—a significant improvement to our operational costs, and to our exposure (during expansions) to capacity-reducing faults.

Speakers
SZ

Shizhen Zhao

Google, Inc.
RW

Rui Wang

Google, Inc.
JZ

Junlan Zhou

Google, Inc.
JO

Joon Ong

Google, Inc.
JC

Jeffrey C. Mogul

Google, Inc.
AV

Amin Vahdat

Google Inc.


Tuesday February 26, 2019 5:10pm - 5:35pm EST
Constitution Ballroom

5:25pm EST

Dinner (on your own)
Tuesday February 26, 2019 5:25pm - 7:00pm EST
N/A

5:35pm EST

Understanding Lifecycle Management Complexity of Datacenter Topologies
Most recent datacenter topology designs have focused on performance properties such as latency and throughput. In this paper, we explore a new dimension, life cycle management, which attempts to capture operational costs of topologies. Specifically, we consider costs associated with deployment and expansion of topologies and explore how structural properties of two different topology families (Clos and expander graphs as exemplified by Xpander) affect these. We also develop a new topology that has the wiring simplicity of Clos and the expandability of expander graphs using the insights from our study.

Speakers
MZ

Mingyang Zhang

University of Southern California
SS

Sucha Supittayapornpong

University of Southern California
RG

Ramesh Govindan

University of Southern California


Tuesday February 26, 2019 5:35pm - 6:00pm EST
Constitution Ballroom

6:00pm EST

Shoal: A Network Architecture for Disaggregated Racks
Disaggregated racks comprise a dense cluster of separate pools of compute, memory and storage blades, all inter-connected through an internal network within a single rack. However, their density poses a unique challenge for the rack’s network: it needs to connect an order of magnitude more nodes than today’s racks without exceeding the rack’s fixed power budget and without compromising on performance. We present Shoal, a power-efficient yet performant intra-rack network fabric built using fast circuit switches. Such switches consume less power as they have no buffers and no packet inspection mechanism, yet can be reconfigured in nanoseconds. Rack nodes transmit according to a static schedule such that there is no in-network contention without requiring a centralized controller. Shoal’s congestion control leverages the physical fabric to achieve fairness and both bounded worst-case network throughput and queuing. We use an FPGA-based prototype, testbed experiments, and simulations to illustrate Shoal’s mechanisms are practical, and can simultaneously achieve high density and high performance: 71% lower power and comparable or higher performance than today’s network designs.

Speakers
VS

Vishal Shrivastav

Cornell University
AV

Asaf Valadarsky

Hebrew University of Jerusalem
HB

Hitesh Ballani

Microsoft Research
PC

Paolo Costa

Microsoft Research
KS

Ki Suh Lee

Waltz Networks
HW

Han Wang

Barefoot Networks
RA

Rachit Agarwal

Cornell University
HW

Hakim Weatherspoon

Cornell University


Tuesday February 26, 2019 6:00pm - 6:25pm EST
Constitution Ballroom

6:00pm EST

Poster Session and Reception
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks. View the complete list of accepted posters.

Tuesday February 26, 2019 6:00pm - 7:30pm EST
Back Bay Ballroom

7:00pm EST

NVM and Related Fancy Fancy BoF
Moderators
Tuesday February 26, 2019 7:00pm - 8:30pm EST
Independence Ballroom

7:30pm EST

Amazon Vendor BoF: Round Table Discussion
Join Amazon Leaders for a roundtable-style discussion on Hot Topics in the storage field. Get Amazon’s top storage leaders perspective on the industry and how Amazon is continuing to invent and push boundaries. Pizza and beer will be available throughout the event.

Tuesday February 26, 2019 7:30pm - 8:30pm EST
Gardner Room B

7:30pm EST

Hands On With NetApp
Get hands-on with some NetApp hardware. Have you ever *seen* a filer, let alone unboxed, installed, or configured one for client I/O? Now's your chance! Beer, wine, soft drinks and snacks provided.

Tuesday February 26, 2019 7:30pm - 8:30pm EST
Grand Ballroom

7:30pm EST

USENIX Women in Advanced Computing (WiAC BoF)
Let’s talk about women in advanced computing. All registered attendees—of all genders—are welcome to attend this BoF.

Tuesday February 26, 2019 7:30pm - 8:30pm EST
Gardner Room A

7:30pm EST

Intel Vendor BoF: Building the Future Network with Intel
Network and Computing are converging in the era of edge computing, Intel is at the center of driving the technology innovation. Please come, join and socialize with Intel experts to learn the state of the art networking and system technologies. We will lead you through a broad range of new technologies from silicon innovations with CPU, I/O, Accelerators to Open source software. Software defined infrastructure on Intel platform is accelerating for 5G network, edge service, immersive video, confidential computing. The technology challenge is remained with AI, network analytics, and achieving deterministic performance and low latency. We want to learn your big ideas on addressing the challenges, toward building joint success via technology collaboration and investment.

The future begins now, build your future success with Intel.


Tuesday February 26, 2019 7:30pm - 8:30pm EST
Jefferson Room

8:00pm EST

A Paradigm Shift in Storage by Bringing Compute to Data with Computational Storage (BoF)
The industry is seeing an increase in customer requirements to move compute closer to traditional storage devices and systems. In response, a growing number of data-driven applications have demonstrated that adding computation to the normal storage features of devices and systems can realize a significant performance and infrastructure scaling advantage. Computational Storage solutions typically target applications where the demand to process ever-growing storage workloads is outpacing traditional compute server architectures. These applications include AI, big data, content delivery, database, machine learning and many others that are used industry-wide. SNIA has established a new technical workgroup to facilitate the use of computational storage in mainstream application environments. This session will discuss their work activities and plans to promote interoperability of devices, and standards for system deployment, provisioning, management and security.

Tuesday February 26, 2019 8:00pm - 9:00pm EST
Grand Ballroom

8:30pm EST

Data Storage Research Vision 2025 (BoF)
With the rapidly evolving computing paradigms and storage hardware, there is a urgent need for a consolidated effort to identify and establish a vision for storage systems research. The National Science Foundation’s (NSF) “Visioning Workshop on Data Storage Research 2025” brought together a number of storage researchers from academia, industry, national laboratories, and federal agencies to develop a collective vision for future storage research (https://sites.google.com/vt.edu/data-storage-research/home). This BoF will share the findings from this vision workshop with the community.

Tuesday February 26, 2019 8:30pm - 9:30pm EST
Gardner Room B

8:30pm EST

Students and Young Professionals Meetup
Come for the refreshments, stay for the opportunity to meet and network with other students and young professionals attending FAST, NSDI, and Vault.

Tuesday February 26, 2019 8:30pm - 9:30pm EST
Gardner Room A

8:30pm EST

SMB/NFS/NMLOP BoF
Speakers
RW

Ric Wheeler

Facebook


Tuesday February 26, 2019 8:30pm - 10:00pm EST
Independence Ballroom

9:30pm EST

Board Game Night
Join FAST, NSDI, and Vault attendees for some good old-fashioned board games. We'll have some on hand, but bring your own games, too!


Tuesday February 26, 2019 9:30pm - 10:30pm EST
Gardner Room A
 
Wednesday, February 27
 

7:30am EST

Continental Breakfast
Wednesday February 27, 2019 7:30am - 8:30am EST
Grand Ballroom Foyer

8:00am EST

Continental Breakfast
Wednesday February 27, 2019 8:00am - 9:00am EST
Grand Ballroom Foyer

8:30am EST

NetScatter: Enabling Large-Scale Backscatter Networks
We present the first wireless protocol that scales to hundreds of concurrent transmissions from backscatter devices. Our key innovation is a distributed coding mechanism that works below the noise floor, operates on backscatter devices and can decode all the concurrent transmissions at the receiver using a single FFT operation. Our design addresses practical issues such as timing and frequency synchronization as well as the near-far problem. We deploy our design using a testbed of backscatter hardware and show that our protocol scales to concurrent transmissions from 256 devices using a bandwidth of only 500 kHz. Our results show throughput and latency improvements of 14–62x and 15–67x over existing approaches and 1–2 orders of magnitude higher transmission concurrency.

Speakers
MH

Mehrdad Hessar

University of Washington
AN

Ali Najafi

University of Washington
SG

Shyamnath Gollakota

University of Washington


Wednesday February 27, 2019 8:30am - 8:55am EST
Constitution Ballroom

8:55am EST

Towards Programming the Radio Environment with Large Arrays of Inexpensive Antennas
Conventional thinking treats the wireless channel as a given constraint. Therefore, wireless network designs to date center on the problem of the endpoint optimization that best utilizes the channel, for example, via rate and power control at the transmitter or sophisticated decoding mechanisms at the receiver. We instead explore whether it is possible to reconfigure the environment itself to facilitate wireless communication. In this work, we instrument the environment with a large array of inexpensive antennas (LAIA) and design algorithms to configure them in real time. Our system achieves this level of programmability through rapid adjustments of an on-board phase shifter in each LAIA device. We design a channel decomposition algorithm to quickly estimate the wireless channel due to the environment alone, which leads us to a process to align the phases of the array elements. Variations of our core algorithm can then optimize wireless channels on the fly for single- and multi-antenna links, as well as nearby networks operating on adjacent frequency bands. We design and deploy a 36-element passive array in a real indoor home environment. Experiments with this prototype show that, by reconfiguring the wireless environment, we can achieve a 24% TCP throughput improvement on average and a median improvement of 51.4% in Shannon capacity over the baseline single-antenna links. Over the baseline multi-antenna links, LAIA achieves an improvement of 12.23% to 18.95% in Shannon capacity.

Speakers
ZL

Zhuqi Li

Princeton University
YX

Yaxiong Xie

Princeton University
LS

Longfei Shangguan

Princeton University
RI

Rotman Ivan Zelaya

Yale University
JG

Jeremy Gummeson

UMass Amherst
WH

Wenjun Hu

Yale University
KJ

Kyle Jamieson

Princeton University


Wednesday February 27, 2019 8:55am - 9:20am EST
Constitution Ballroom

9:00am EST

DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching
Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to multiple clusters, the fast cache itself would become the bottleneck. Traditional mechanisms like cache partition and cache replication either result in load imbalance between cache nodes or have high overhead for cache coherence.

We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching.

Speakers
ZL

Zaoxing Liu

Johns Hopkins University
ZB

Zhihao Bai

Johns Hopkins University
ZL

Zhenming Liu

College of William and Mary
XL

Xiaozhou Li

Celer Network
CK

Changhoon Kim

Barefoot Networks
VB

Vladimir Braverman

Johns Hopkins University
XJ

Xin Jin

Johns Hopkins University
IS

Ion Stoica

UC Berkeley


Wednesday February 27, 2019 9:00am - 9:30am EST
Grand Ballroom

9:20am EST

Pushing the Range Limits of Commercial Passive RFIDs
This paper asks: “Can we push the prevailing range limits of commercial passive RFIDs?”. Today’s commercial passive RFIDs report ranges of 5-15 meters at best. This constrains RFIDs to be detected only at specific checkpoints in warehouses, stores and factories today, leaving them outside of communication range beyond these spaces. State-of-the-art approaches to improve the range of RFIDs develop new tag hardware that necessarily sacrifices some of the most attractive features of passive RFIDs such as their low cost, small form-factor or the absence of a battery.

We present PushID, a system that exploits collaboration between readers to enhance the range of commercial passive RFID tags, without altering the tags whatsoever. PushID uses distributed MIMO to coherently combine signals across geographically separated RFID readers at the tags. In doing so, it resolves the chicken-or-egg problem of inferring the optimal beamforming parameters to beam energy to a tag without any feedback from the tag itself, which needs this energy to respond in the first place. A prototype evaluation of PushID with 8 distributed RFID readers reveals a range of 64-meters to the closest reader, a 7.4×, 1.2× and 1.6× improvement in range compared to state-of-the-art commercial readers and other two schemes [10, 31].

Speakers
JW

Jingxian Wang

Carnegie Mellon University
JZ

Junbo Zhang

Tsinghua University
RS

Rajarshi Saha

IIT Kharagpur
HJ

Haojian Jin

Carnegie Mellon University
SK

Swarun Kumar

Carnegie Mellon University


Wednesday February 27, 2019 9:20am - 9:45am EST
Constitution Ballroom

9:30am EST

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction
Host-managed shingled magnetic recording drives (HM-SMR) give a capacity advantage to harness the explosive growth of data. Applications where data is sequentially written and randomly read make the HM-SMR an ideal solution due to its capacity, predictable performance, and economical cost. Key-value stores based on the Log-Structured Merge Trees (LSM-trees) data structure is such a good fit due to their batched sequential writes. However, building an LSM-tree based KV store on HM-SMR drives presents severe challenges in maintaining the performance and space efficiency due to the redundant cleaning processes for applications and storage devices (i.e., compaction and garbage collections). To eliminate the overhead of on-disk garbage collections (GC) and improve compaction efficiency, this paper presents GearDB, a GC-free KV store tailored for HM-SMR drives, with three new techniques: a new on-disk data layout, compaction windows, and a novel gear compaction algorithm. We implement GearDB and evaluate it with LevelDB on a real HM-SMR drive. Our extensive experiments have shown that GearDB achieves good performance and space efficiency, i.e., on average $1.71\times$ faster than LevelDB in random write with a space efficiency of 89.9%.

Speakers
TY

Ting Yao

Huazhong University of Science and Technology and Temple University
JW

Jiguang Wan

Huazhong University of Science and Technology
PH

Ping Huang

Temple University
YZ

Yiwen Zhang

Huazhong University of Science and Technology
ZL

Zhiwen Liu

Huazhong University of Science and Technology
CX

Changsheng Xie

Huazhong University of Science and Technology
XH

Xubin He

Temple University


Wednesday February 27, 2019 9:30am - 10:00am EST
Grand Ballroom

9:45am EST

SweepSense: Sensing 5 GHz in 5 Milliseconds with Low-cost SDRs
Wireless transmissions occur intermittently across the entire spectrum. For example, WiFi and Bluetooth devices transmit frames across the 100 MHz-wide 2.4 GHz band, and LTE devices transmit frames between 700 MHz and 3.7 GHz). Today, only high-cost radios can sense across the spectrum with sufficient temporal resolution to observe these individual transmissions.

We present “SweepSense”, a low-cost radio architecture that senses the entire spectrum with high-temporal resolution by rapidly sweeping across it. Sweeping introduces new challenges for spectrum sensing: SweepSense radios only capture a small number of distorted samples of transmissions. To overcome this challenge, we correct the distortion with self-generated calibration data, and classify the protocol that originated each transmission with only a fraction of the transmission’s samples. We demonstrate that SweepSense can accurately identify four protocols transmitting simultaneously in the 2.4 GHz unlicensed band. We also demonstrate that it can simultaneously monitor the load of several LTE base stations operating in disjoint bands.

Speakers
YG

Yeswanth Guddeti

UC San Diego
MK

Moein Khazraee

UC San Diego
AS

Aaron Schulman

UC San Diego
DB

Dinesh Bharadia

UC San Diego


Wednesday February 27, 2019 9:45am - 10:10am EST
Constitution Ballroom

10:00am EST

SPEICHER: Securing LSM-based Key-Value Stores using Shielded Execution
We introduce Speicher, a secure storage system that not only provides strong confidentiality and integrity properties, but also ensures data freshness to protect against rollback/forking attacks. Speicher exports a Key-Value (KV) interface backed by Log-Structured Merge Tree (LSM) for supporting secure data storage and query operations. Speicher enforces these security properties on an untrusted host by leveraging shielded execution based on a hardware-assisted trusted execution environment (TEE)—specifically, Intel SGX. However, the design of Speicher extends the trust in shielded execution beyond the secure SGX enclave memory region to ensure that the security properties are also preserved in the stateful (or non-volatile) setting of an untrusted storage medium, including system crash, reboot, or migration.

More specifically, we have designed an authenticated and confidentiality-preserving LSM data structure. We have further hardened the LSM data structure to ensure data freshness by designing asynchronous trusted counters. Lastly, we designed a direct I/O library for shielded execution based on Intel SPDK to overcome the I/O bottlenecks in the SGX enclave. We have implemented Speicher as a fully-functional storage system by extending RocksDB, and evaluated its performance using the RocksDB benchmark. Our experimental evaluation shows that Speicher incurs reasonable overheads for providing strong security guarantees, while keeping the trusted computing base (TCB) small.

Speakers
MB

Maurice Bailleu

The University of Edinburgh
JT

Jörg Thalheim

The University of Edinburgh
PB

Pramod Bhatotia

The University of Edinburgh
MH

Michio Honda

NEC Laboratories Europe
KV

Kapil Vaswani

Microsoft Research


Wednesday February 27, 2019 10:00am - 10:30am EST
Grand Ballroom

10:10am EST

Break with Refreshments
Wednesday February 27, 2019 10:10am - 10:40am EST
Grand Ballroom Foyer

10:30am EST

Break with Refreshments
Wednesday February 27, 2019 10:30am - 11:00am EST
Grand Ballroom Foyer

10:40am EST

Slim: OS Kernel Support for a Low-Overhead Container Overlay Network
Containers have become the de facto method for hosting large-scale distributed applications. Container overlay networks are essential to providing portability for containers, yet they impose significant overhead in terms of throughput, latency, and CPU utilization. The key problem is a reliance on packet transformation to implement network virtualization. As a result, each packet has to traverse the network stack twice in both the sender and the receiver’s host OS kernel. We have designed and implemented Slim, a low-overhead container overlay network that implements network virtualization by manipulating connection-level metadata. Our solution maintains compatibility with today’s containerized applications. Evaluation results show that Slim improves the throughput of an in-memory key-value store by 66% while reducing the latency by 42%. Slim reduces the CPU utilization of the in-memory key-value store by 54%. Slim also reduces the CPU utilization of a web server by 28%-40%, a database server by 25%, and a stream processing framework by 11%.

Speakers
DZ

Danyang Zhuo

University of Washington
KZ

Kaiyuan Zhang

University of Washington
YZ

Yibo Zhu

Microsoft and ByteDance
MR

Matthew Rockett

University of Washington
AK

Arvind Krishnamurthy

University of Washington
TA

Thomas Anderson

University of Washington


Wednesday February 27, 2019 10:40am - 11:05am EST
Constitution Ballroom

11:00am EST

SLM-DB: Single-Level Key-Value Store with Persistent Memory
This paper investigates how to leverage emerging byte-addressable persistent memory (PM) to enhance the performance of key-value (KV) stores. We present a novel KV store, the Single-Level Merge DB (SLM-DB), which takes advantage of both the B+-tree index and the Log-Structured Merge Trees (LSM-tree) approach by making the best use of fast persistent memory. Our proposed SLM-DB achieves high read performance as well as high write performance with low write amplification and near-optimal read amplification. In SLM-DB, we exploit persistent memory to maintain a B+-tree index and adopt an LSM-tree approach to stage inserted KV pairs in a PM resident memory buffer. SLM-DB has a single-level organization of KV pairs on disks and performs selective compaction for the KV pairs, collecting garbage and keeping the KV pairs sorted sufficiently for range query operations. Our extensive experimental study demonstrates that, in our default setup, compared to LevelDB, SLM-DB provides 1.07 - 1.96 and 1.56 - 2.22 times higher read and write throughput, respectively, as well as comparable range query performance.

Speakers
BN

Beomseok Nam

Sungkyunkwan University
YC

Young-ri Choi

UNIST (Ulsan National Institute of Science and Technology)


Wednesday February 27, 2019 11:00am - 11:30am EST
Grand Ballroom

11:05am EST

Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency
The recently proposed dataplanes for microsecond scale applications, such as IX and ZygOS, use non-preemptive policies to schedule requests to cores. For the many real-world scenarios where request service times follow distributions with high dispersion or a heavy tail, they allow short requests to be blocked behind long requests, which leads to poor tail latency.

Shinjuku is a single-address space operating system that uses hardware support for virtualization to make preemption practical at the microsecond scale. This allows Shinjuku to implement centralized scheduling policies that preempt requests as often as every 5µsec and work well for both light and heavy tailed request service time distributions. We demonstrate that Shinjuku provides significant tail latency and throughput improvements over IX and ZygOS for a wide range of workload scenarios. For the case of a RocksDB server processing both point and range queries, Shinjuku achieves up to 6.6× higher throughput and 88% lower tail latency.

Speakers
KK

Kostis Kaffes

Stanford University
TC

Timothy Chong

Stanford University
JT

Jack Tigar Humphries

Stanford University
AB

Adam Belay

MIT CSAIL
DM

David Mazières

Stanford University
CK

Christos Kozyrakis

Stanford University


Wednesday February 27, 2019 11:05am - 11:30am EST
Constitution Ballroom

11:30am EST

Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads
Datacenter applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in a CPU-efficient way is an open problem. Because of the high overheads of today's kernels, the best available solution to achieve microsecond-scale latencies is kernel-bypass networking, which dedicates CPU cores to applications for spin-polling the network card. But this approach wastes CPU: even at modest average loads, one must dedicate enough cores for the peak expected load.

Shenango achieves comparable latencies but at far greater CPU efficiency. It reallocates cores across applications at very fine granularity---every 5 $\mu$s---enabling cycles unused by latency-sensitive applications to be used productively by batch processing applications. It achieves such fast reallocation rates with (1) an efficient algorithm that detects when applications would benefit from more cores, and (2) a privileged component called the IOKernel that runs on a dedicated core, steering packets from the NIC and orchestrating core reallocations. When handling latency-sensitive applications, such as memcached, we found that Shenango achieves tail latency and throughput comparable to ZygOS, a state-of-the-art, kernel-bypass network stack, but can linearly trade latency-sensitive application throughput for batch processing application throughput, vastly increasing CPU efficiency.

Speakers

Wednesday February 27, 2019 11:30am - 11:55am EST
Constitution Ballroom

11:30am EST

Ziggurat: A Tiered File System for Non-Volatile Main Memories and Disks
Emerging fast, byte-addressable Non-Volatile Main Memory (NVMM) provides huge increases in storage performance compared to traditional disks. We present Ziggurat, a tiered file system that combines NVMM and slow disks to create a storage system with near-NVMM performance and large capacity. Ziggurat steers incoming writes to NVMM, DRAM, or disk depending on application access patterns, write size, and the likelihood that the application will stall until the write completes. Ziggurat profiles the application's access stream online to predict the behavior of individual writes. In the background, Ziggurat estimates the "temperature" of file data, and migrates the cold file data from NVMM to disks. To fully utilize disk bandwidth, Ziggurat coalesces data blocks into large, sequential writes. Experimental results show that with a small amount of NVMM and a large SSD, Ziggurat achieves up to 38.9x and 46.5x throughput improvement compared with EXT4 and XFS running on an SSD alone, respectively. As the amount of NVMM grows, Ziggurat's performance improves until it matches the performance of an NVMM-only file system.

Speakers
SZ

Shengan Zheng

Shanghai Jiao Tong University
MH

Morteza Hoseinzadeh

University of California, San Diego
SS

Steven Swanson

UC San Diego


Wednesday February 27, 2019 11:30am - 12:00pm EST
Grand Ballroom

11:55am EST

Lunch (on your own)
Wednesday February 27, 2019 11:55am - 1:30pm EST
N/A

12:00pm EST

Orion: A Distributed File System for Non-Volatile Main Memory and RDMA-Capable Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) force system designers to rethink trade-offs throughout the system stack, often leading to dramatic changes in system architecture. Conventional distributed file systems are a prime example. When faster NVMM replaces block-based storage, the dramatic improvement in storage performance makes networking and software overhead a critical bottleneck.

In this paper, we present Orion, a distributed file system for NVMM-based storage. By taking a clean slate design and leveraging the characteristics of NVMM and high-speed, RDMA-based networking, Orion provides high-performance metadata and data access while maintaining the byte addressability of NVMM. Our evaluation shows Orion achieves performance comparable to local NVMM file systems and outperforms existing distributed file systems by a large margin.

Speakers
JY

Jian Yang

UC San Diego
SS

Steven Swanson

UC San Diego


Wednesday February 27, 2019 12:00pm - 12:30pm EST
Grand Ballroom

12:30pm EST

1:30pm EST

End-to-end I/O Monitoring on a Leading Supercomputer
This paper presents an effort to overcome the complexities of production-use I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. It simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification.

With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with a collection of real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Finally, both codes and data collected are to be released.

Speakers
BY

Bin Yang

Shandong University, National Supercomputing Center in Wuxi
XJ

Xu Ji

Tsinghua University, National Supercomputing Center in Wuxi
XM

Xiaosong Ma

Qatar Computing Research institute, HBKU
XW

Xiyang Wang

National Supercomputing Center in Wuxi
TZ

Tianyu Zhang

Shandong University, National Supercomputing Center in Wuxi
XZ

Xiupeng Zhu

Shandong University, National Supercomputing Center in Wuxi
NE

Nosayba El-Sayed

Emory University
HL

Haidong Lan

Shandong University
YY

Yibo Yang

Shandong Unversity
JZ

Jidong Zhai

Tsinghua University
WL

Weiguo Liu

Shandong University, National Supercomputing Center in Wuxi
WX

Wei Xue

Tsinghua University, National Supercomputing Center in Wuxi


Wednesday February 27, 2019 1:30pm - 1:55pm EST
Constitution Ballroom

1:55pm EST

Zeno: Diagnosing Performance Problems with Temporal Provenance
When diagnosing a problem in a distributed system, it is sometimes necessary to explain the timing of an event—for instance, why a response has been delayed, or why the network latency is high. Existing tools o er some support for this, typically by tracing the problem to a bottleneck or to an overloaded server. However, locating the bottleneck is merely the first step: the real problem may be some other service that is sending traffic over the bottleneck link, or a misbehaving machine that is overloading the server with requests. These off-path causes do not appear in a conventional trace and will thus be missed by most existing diagnostic tools.

In this paper, we introduce a new concept we call temporal provenance that can help with diagnosing timing-related problems. Temporal provenance is inspired by earlier work on provenance-based network debugging; however, in addition to the functional problems that can already be handled with classical provenance, it can also diagnose problems that are related to timing. We present an algorithm for generating temporal provenance and an experimental debugger called Zeno; our experimental evaluation shows that Zeno can successfully diagnose several realistic performance bugs.

Speakers
YW

Yang Wu

Facebook
AC

Ang Chen

Rice University
LT

Linh Thi Xuan Phan

University of Pennsylvania


Wednesday February 27, 2019 1:55pm - 2:20pm EST
Constitution Ballroom

2:00pm EST

INSTalytics: Cluster Filesystem Co-design for Big-data Analytics
We present the design, implementation, and evaluation of Instalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big data analytics in large-scale data centers. Instalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, Instalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle.

To achieve this, Instalytics uses compute-awareness to customize the 3-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables Instalytics to preserve the same recovery cost and availability as traditional replication. Instalytics also uses compute-awareness to expose a new {\em sliced-read} API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes.

We have implemented Instalytics in a production analytics stack, and show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

Speakers
MS

Muthian Sivathanu

Microsoft Research India
MV

Midhul Vuppalapati

Microsoft Research India
BG

Bhargav Gulavani

Microsoft Research India
KR

Kaushik Rajan

Microsoft Research India
JL

Jyoti Leeka

Microsoft Research India
JM

Jayashree Mohan

Univ. of Texas Austin
PK

Piyus Kedia

IIIT Delhi


Wednesday February 27, 2019 2:00pm - 2:30pm EST
Grand Ballroom

2:20pm EST

Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks
Confluo is an end-host stack that can be integrated with existing network management tools to enable monitoring and diagnosis of network-wide events using telemetry data distributed across end-hosts, even for high-speed networks. Confluo achieves these properties using a new data structure—Atomic MultiLog—that supports highly-concurrent read-write operations by exploiting two properties specific to telemetry data: (1) once processed by the stack, the data is neither updated nor deleted; and (2) each field in the data has a fixed pre-defined size. Our evaluation results show that, for packet sizes 128B or larger, Confluo executes thousands of triggers and tens of filters at line rate (for 10Gbps links) using a single core.

Speakers
RA

Rachit Agarwal

Cornell University
IS

Ion Stoica

UC Berkeley


Wednesday February 27, 2019 2:20pm - 2:45pm EST
Constitution Ballroom

2:30pm EST

GraphOne: A Data Store for Real-time Analytics on Evolving Graphs
There is a growing need to perform real-time analytics on evolving graphs in order to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations. To address this challenge, we have designed and developed GraphOne, a graph data store that combines two complementary graph storage formats (edge list and adjacency list), and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities with only a small data duplication. Experimental results show that GraphOne achieves an ingestion rate of two to three orders of magnitude higher than graph databases, while delivering algorithmic performance comparable to a static graph system. GraphOne is able to deliver 5.36x higher update rate and over 3x better analytics performance compared to a state-of-the-art dynamic graph system.

Speakers
PK

Pradeep Kumar

George Washington University
HH

H. Howie Huang

George Washington University


Wednesday February 27, 2019 2:30pm - 3:00pm EST
Grand Ballroom

2:45pm EST

DETER: Deterministic TCP Replay For Performance Diagnosis
TCP performance problems are notoriously tricky to diagnose because a subtle choice of TCP parameters or features may lead to completely different performance. A gold standard for diagnosis is to collect packet traces and trace TCP executions. However, it is not easy to use such tools in large-scale data centers where many TCP connections interact with each other. In this paper, we introduce DETER, a deterministic TCP replay tool, which runs lightweight recording all the time at all the hosts and then replay selected collections where operators can collect packet traces and trace TCP executions for diagnosis. The key challenge for deterministic TCP replay is the butterfly effect---a small timing variation causes a chain reaction between TCP and the network that drives the system to a completely different state in the replay. To eliminate the butterfly effect, we propose to replay individual TCP connection separately and capture all the interactions between a connection with the applications and the network. Our evaluation shows that \system has low recording overhead and can help diagnose many TCP performance problems such as long latency related to zero-window probes, late fast retransmission, frequent retransmission timeout, to problems related to the switch shared buffer.

Speakers
YL

Yuliang Li

Harvard University
RM

Rui Miao

Alibaba Group
MA

Mohammad Alizadeh

Massachusetts Institute of Technology
MY

Minlan Yu

Harvard University


Wednesday February 27, 2019 2:45pm - 3:10pm EST
Constitution Ballroom

3:00pm EST

Automatic, Application-Aware I/O Forwarding Resource Allocation
The I/O forwarding architecture is widely adopted on modern supercomputers, with a layer of intermediate nodes sitting between the many compute nodes and backend storage nodes. This allows compute nodes to run more efficiently and stably with a leaner OS, offloads I/O coordination and communication with backend from the compute nodes, maintains less concurrent connections to storage systems, and provides additional resources for effective caching, prefetching, write buffering, and I/O aggregation. However, with many existing machines, these forwarding nodes are assigned to serve fixed set of compute nodes.

We explore an automatic mechanism, DFRA, for application-adaptive dynamic forwarding resource allocation. With I/O monitoring data that proves affordable to acquire in real time and maintain for long-term history analysis, Upon each job's dispatch, DFRA conducts a history-based study to determine whether the job should be granted more forwarding resources or given dedicated forwarding nodes. Such customized I/O forwarding lets the small fraction of I/O-intensive applications achieve higher I/O performance and scalability, meanwhile effectively isolating disruptive I/O activities. We implemented, evaluated, and deployed DFRA on Sunway TaihuLight, the current No.2 supercomputer in the world. It improves applications' I/O performance by up to 16.0x, eliminates most of the inter-application I/O interference, and has saved over 200 million of core-hours during its deployment on TaihuLight for past 8 months. Finally, our proposed DFRA design is not platform-dependent, making it applicable to the management of existing and future I/O forwarding or burst buffer resources.

Speakers
XJ

Xu Ji

Tsinghua University, National Supercomputing Center in Wuxi
BY

Bin Yang

Shandong University, National Supercomputing Center in Wuxi
TZ

Tianyu Zhang

Shandong University, National Supercomputing Center in Wuxi
XM

Xiaosong Ma

Qatar Computing Research institute, HBKU
XZ

Xiupeng Zhu

Shandong University, National Supercomputing Center in Wuxi
XW

Xiyang Wang

National Supercomputing Center in Wuxi
NE

Nosayba El-Sayed

Emory University
JZ

Jidong Zhai

Tsinghua University
WL

Weiguo Liu

Shandong University, National Supercomputing Center in Wuxi
WX

Wei Xue

Tsinghua University, National Supercomputing Center in Wuxi


Wednesday February 27, 2019 3:00pm - 3:30pm EST
Grand Ballroom

3:10pm EST

Break with Refreshments
Wednesday February 27, 2019 3:10pm - 3:40pm EST
Grand Ballroom Foyer

3:30pm EST

Break with Refreshments
Wednesday February 27, 2019 3:30pm - 4:00pm EST
Grand Ballroom Foyer

3:40pm EST

JANUS: Fast and Flexible Deep Learning via Symbolic Graph Execution of Imperative Programs
The rapid evolution of deep neural networks is demanding deep learning (DL) frameworks not only to satisfy the requirement of quickly executing large computations, but also to support straightforward programming models for quickly implementing and experimenting with complex network structures. However, existing frameworks fail to excel in both departments simultaneously, leading to diverged efforts for optimizing performance and improving usability.

This paper presents JANUS, a system that combines the advantages from both sides by transparently converting an imperative DL program written in Python, the de-facto scripting language for DL, into an efficiently executable symbolic dataflow graph. JANUS can convert various dynamic features of Python, including dynamic control flow, dynamic types, and impure functions, into the symbolic graph operations. Experiments demonstrate that JANUS can achieve fast DL training by exploiting the techniques imposed by symbolic graph-based DL frameworks, while maintaining the simple and flexible programmability of imperative DL frameworks at the same time.

Speakers
EJ

Eunji Jeong

Seoul National University
SC

Sungwoo Cho

Seoul National University
GY

Gyeong-In Yu

Seoul National University
JS

Joo Seong Jeong

Seoul National University
DS

Dong-Jin Shin

Seoul National University
BC

Byung-Gon Chun

Seoul National University


Wednesday February 27, 2019 3:40pm - 4:05pm EST
Constitution Ballroom

4:00pm EST

4:05pm EST

BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference?
Many large scale machine learning training and inference tasks are memory-bound rather than compute-bound. That is, on large data sets, the working set of these algorithms does not fit in memory for jobs that could run overnight on a few multi-core processors. This often forces an expensive redesign of the algorithm to distributed platforms such as parameter servers and Spark.

We propose an inexpensive and efficient alternative based on the observation that many ML tasks admit algorithms that can be programmed with linear algebra subroutines. A library that supports BLAS and sparseBLAS interface on large SSD-resident matrices can enable multi-threaded code to scale to industrial scale data sets on a single workstation.

We demonstrate that not only can such a library provide near in-memory performance for BLAS, but can also be used to write implementations of complex algorithms such as eigensolvers that outperform in-memory (ARPACK) and distributed (Spark) counterparts.

Existing multi-threaded in-memory code can link to our library with minor changes and scale to hundreds of Gigabytes of training or inference data at near in-memory processing speeds. We demonstrate this with two industrial scale use cases arising in ranking and relevance pipelines: training large scale topic models and inference for extreme multi-label learning.

This suggests that our approach could be an efficient alternative to expensive big-data compute systems for scaling up structurally complex machine learning tasks.

Speakers
SJ

Suhas Jayaram Subramanya

Microsoft Research India
HV

Harsha Vardhan Simhadri

Microsoft Research India
SG

Srajan Garg

IIT Bombay
AK

Anil Kag

Microsoft Research India
VB

Venkatesh Balasubramanian

Microsoft Research India


Wednesday February 27, 2019 4:05pm - 4:30pm EST
Constitution Ballroom

4:30pm EST

Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Distributed training of deep learning (DL) models on GPU clusters is becoming increasingly more popular. Existing cluster managers face some unique challenges from DL training jobs, such as unpredictable training times, an all-or- nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers – coupled with consolidated job placement constraint, whereby GPUs for the same job must be allocated in as few machines as possible – cause long queueing delays and low overall performance.

We present Tiresias, a GPU cluster resource manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms – Discretized Two-Dimensional Gittins Index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic – that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed and present a placement algorithm to leverage these observations without any user input. Experiments on a cluster with 60 P100 GPUs – and large-scale trace-driven simulations – show that Tiresias improves the average JCT by up to 5.5× over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.

Speakers
JG

Juncheng Gu

University of Michigan, Ann Arbor
MC

Mosharaf Chowdhury

University of Michigan, Ann Arbor
KG

Kang G. Shin

University of Michigan, Ann Arbor
YZ

Yibo Zhu

Microsoft and ByteDance
MJ

Myeongjae Jeon

Microsoft and UNIST
JQ

Junjie Qian

Microsoft


Wednesday February 27, 2019 4:30pm - 4:55pm EST
Constitution Ballroom

4:55pm EST

Short Break
Wednesday February 27, 2019 4:55pm - 5:10pm EST
Grand Ballroom Foyer

5:10pm EST

Correctness and Performance for Stateful Chained Network Functions
Network functions virtualization (NFV) allows operators to employ NF chains to realize custom policies, and dynamically add instances to meet demand or for failover. NFs maintain detailed per- and cross-flow state which needs careful management, especially during dynamic actions. Crucially, state management must: (1) ensure NF chain-wide correctness and (2) have good performance. To this end, we built CHC, an NFV framework that leverages an external state store coupled with state management algorithms and metadata maintenance for correct operation even under a range of failures. Our evaluation shows that CHC can support ~10Gbps per-NF throughput and $<0.6μs increase in median per-NF packet processing latency, and chain-wide correctness at little additional cost.

Speakers
JK

Junaid Khalid

University of Wisconsin - Madison
AA

Aditya Akella

UW-Madison


Wednesday February 27, 2019 5:10pm - 5:35pm EST
Constitution Ballroom

5:35pm EST

Performance contracts for software network functions
While software network functions (NFs) promise great flexibility and easy deployment of network services, they face the challenge of unpredictable performance. We propose Bolt, a technique and tool for predicting the performance of the entire software stack of an NF comprising the core NF logic, DPDK packet processing framework, and the NIC driver. Bolt takes as input the NF implementation and generates a performance contract that provides, for any arbitrary packet scenario, a precise characterization of the NF's performance. Under the covers, Bolt leverages a state-based demarcation of NFs and combines a pre-analysis of stateful data structures with automated symbolic execution of the stateless NF code. Performance contracts allow scrutiny of NF performance with a fine level of granularity, enabling network developers and operators to understand the performance of the NF in the face of any workload, whether typical, exceptional, or adversarial. We evaluate Bolt on four realistic NFs---a NAT, a Maglev-like load balancer, an LPM Router, and a MAC bridge---and show that Bolt's performance contracts predict the dynamic instruction count and memory accesses of the NF to within a maximum of 7% of real executions, for all NFs and traffic classes analyzed.


Wednesday February 27, 2019 5:35pm - 6:00pm EST
Constitution Ballroom

6:00pm EST

FlowBlaze: Stateful Packet Processing in Hardware
Programmable NICs allow for better scalability to handle growing network workloads, however, providing an expressive, yet simple, abstraction to program stateful network functions in hardware remains a research challenge. We address the problem with FlowBlaze, an open abstraction for building stateful packet processing functions in hardware. The abstraction is based on Extended Finite State Machines and introduces the explicit definition of flow state, allowing FlowBlaze to leverage flow-level parallelism. FlowBlaze is expressive, supporting a wide range of complex network functions, and easy to use, hiding low-level hardware implementation issues from the programmer. Our implementation of FlowBlaze on a NetFPGA SmartNIC achieves very low latency (on the order of a few microseconds), consumes relatively little power, can hold per-flow state for hundreds of thousands of flows and yields speeds of 40 Gb/s, allowing for even higher speeds on newer FPGA models. Both hardware and software implementations of FlowBlaze are publicly available.

Speakers
RB

Roberto Bifulco

NEC Laboratories Europe
MB

Marco Bonola

Axbryd/CNIT
CC

Carmelo Cascone

Open Networking Foundation
MS

Marco Spaziani

CNIT/University of Rome Tor Vergata
VB

Valerio Bruschi

CNIT/University of Rome Tor Vergata
DS

Davide Sanvito

Politecnico di Milano
GS

Giuseppe Siracusano

NEC Laboratories Europe
AC

Antonio Capone

Politecnico di Milano
MH

Michio Honda

NEC Laboratories Europe
FH

Felipe Huici

NEC Laboratories Europe
GB

Giuseppe Bianchi

CNIT/University of Rome Tor Vergata


Wednesday February 27, 2019 6:00pm - 6:25pm EST
Constitution Ballroom

6:30pm EST

Poster Session and Reception
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers. View the complete list of accepted posters.

Wednesday February 27, 2019 6:30pm - 8:00pm EST
Back Bay Ballroom
 
Thursday, February 28
 

7:30am EST

Continental Breakfast
Thursday February 28, 2019 7:30am - 8:30am EST
Grand Ballroom Foyer

8:00am EST

Continental Breakfast
Thursday February 28, 2019 8:00am - 9:00am EST
Grand Ballroom Foyer

8:30am EST

SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks
Network measurement and monitoring have been key to understanding the inner workings of computer networks and debugging the performance problems of distributed applications. Despite many products and much research on these topics, in the context of data centers, performing accurate measurement at scale in near real-time has remained elusive. On one hand, switch-based telemetry can give accurate per-packet views, but these must be assembled across the network and across packets to get network- and application-level insight: this is not scalable. On the other hand, purely end-host-based measurement is naturally scalable but so far has only provided partial views of in-network operation.

In this paper, we set out to push the boundary of edge-based measurement by scalably and accurately reconstructing the full queueing dynamics in the network with data gathered entirely at the transmit and receive network interface cards (NICs). We begin with a Signal Processing framework for quantifying a key trade-off: reconstruction accuracy versus the amount of data gathered. Based on this, we propose SIMON, an accurate and scalable measurement system for data centers that reconstructs key network state variables like packet queuing times at switches, link utilizations, and queue and link compositions at the flow-level. We then demonstrate that the function approximation capability of multi-layered neural networks can speed up SIMON by a factor of 5,000--10,000, enabling it to run in near real-time. We deployed SIMON in three testbeds with different link speeds, layers of switching and number of servers; evaluations with NetFPGAs and a cross-validation technique show that SIMON reconstructs queue-lengths to within 3-5 KBs and link utilizations to less than 1% of actual. The accuracy and speed of SIMON enables sensitive A/B tests, which greatly aids the real-time development of algorithms, protocols, network software and applications.

Speakers
YG

Yilong Geng

Stanford University
SL

Shiyu Liu

Stanford University
ZY

Zi Yin

Stanford University
AN

Ashish Naik

Google Inc.
BP

Balaji Prabhakar

Stanford University
MR

Mendel Rosenblum

Stanford University
AV

Amin Vahdat

Google Inc.


Thursday February 28, 2019 8:30am - 8:55am EST
Constitution Ballroom

8:55am EST

Is advance knowledge of flow sizes a plausible assumption?
Recent research has proposed several packet, flow, and coflow scheduling methods that could substantially improve performance for data center workloads. Most of this work assumes advance knowledge of flow sizes, but the lack of a clear path to obtaining such knowledge has also prompted some work on non-clairvoyant scheduling, albeit with more limited performance benefits and narrower applicability.

We thus investigate whether flow sizes can be known in advance in practice, using both simple heuristics and learning methods. Our systematic and substantial efforts across these approaches for estimating flow sizes indicate, unfortunately, that such knowledge is likely hard to obtain with high confidence across many settings of practical interest. However, our prognosis is ultimately more positive: even simple heuristics can help estimate flow sizes for many flows, and this partial knowledge has utility even in schedulers designed for fully clairvoyant operation. These results indicate that a presumed lack of advance knowledge of flow sizes is not necessarily prohibitive for highly efficient scheduling, and suggest further exploration in two directions: (a) scheduling under partial knowledge; and (b) evaluating the practical payoff and expense of obtaining more knowledge.

Speakers
SA

Sangeetha Abdu Jyothi

University of Illinois at Urbana–Champaign
BK

Bojan Karlaš

ETH Zurich
MO

Muhsen Owaida

ETH Zurich
CZ

Ce Zhang

ETH Zurich
AS

Ankit Singla

ETH Zurich


Thursday February 28, 2019 8:55am - 9:20am EST
Constitution Ballroom

9:00am EST

Design Tradeoffs for SSD Reliability
Flash memory-based SSDs are popular across a wide range of data storage markets, while the underlying storage medium—flash memory—is becoming increasingly unreliable. As a result, modern SSDs employ a number of in-device reliability enhancement techniques, but none of them offers a one size fits all solution when considering the multi-dimensional requirements for SSDs: performance, reliability, and lifetime. In this paper, we examine the design tradeoffs of existing reliability enhancement techniques such as data re-read, intra-SSD redundancy, and data scrubbing. We observe that an uncoordinated use of these techniques adversely affects the performance of the SSD, and careful management of the techniques is necessary for a graceful performance degradation while maintaining a high reliability standard. To that end, we propose a holistic reliability management scheme that selectively employs redundancy, conditionally re-reads, judiciously selects data to scrub. We demonstrate the effectiveness of our scheme by evaluating it across a set of I/O workloads and SSDs wear states.

Speakers
BS

Bryan S. Kim

Seoul National University
JC

Jongmoo Choi

Dankook University
SL

Sang Lyul Min

Seoul National University


Thursday February 28, 2019 9:00am - 9:30am EST
Grand Ballroom

9:20am EST

Stable and Practical AS Relationship Inference with ProbLink
Knowledge of the business relationships between Autonomous Systems (ASes) is essential to understanding the behavior of the Internet routing system. Despite significant progress in the development of sophisticated relationship inference algorithms, the resulting datasets are impractical for many critical real-world applications, cannot offer adequate predictability in the configuration of routing policies, and suffer from inference oscillations. To achieve more practical and stable relationship inferences we first illuminate the root causes of the contradictions between these shortcomings and the near-perfect validation results of AS-Rank, the state-of-the-art relationship inference algorithm. Using a "naive" inference approach as a benchmark, we find that the available validation datasets over-represent AS links with easier inference requirements. We identify which types of links are harder to infer, and we develop appropriate validation subsets to enable more representative evaluation.

We then develop a probabilistic algorithm, ProbLink, to overcome the inference barriers for hard links, such as non-valley-free routing, limited visibility, and non-conventional peering practices. To this end, we identify key interconnection features that provide stochastically informative and highly predictive relationship inference signals. Compared to AS-Rank, our approach reduces the error rate for all links by 1.6$\times$, and importantly, by up to 6.1$\times$ for different types of hard links. We demonstrate the practical significance of our improvements by evaluating their impact on three applications. Compared to the current state-of-the-art, ProbLink increases the precision and recall of route leak detection by 4.1$\times$ and 3.4$\times$ respectively, reveals 27% more complex relationships, and increases the precision of predicting the impact of selective advertisements by 34%.

Speakers
YJ

Yuchen Jin

University of Washington
CS

Colin Scott

UC Berkeley
VG

Vasileios Giotsas

Lancaster University
AK

Arvind Krishnamurthy

University of Washington
SS

Scott Shenker

UC Berkeley, ICSI


Thursday February 28, 2019 9:20am - 9:45am EST
Constitution Ballroom

9:30am EST

Fully Automatic Stream Management for Multi-Streamed SSDs Using Program Contexts
Multi-streamed SSDs can significantly improve both the performance and lifetime of flash-based SSDs when their streams are properly managed. However, existing stream management solutions do not adequately support the multi-streamed SSDs for their wide adoption. No existing stream management technique works in a fully automatic fashion for general I/O workloads. Furthermore, the limited number of available streams makes it difficult to effectively manage streams when a large number of streams are required. In this paper, we propose a fully automatic stream management technique, PCStream, which can work efficiently for general I/O workloads with heterogeneous write characteristics. PCStream is based on the key insight that stream allocation decisions should be made on dominant I/O activities. By identifying dominant I/O activities using program contexts, PCStream fully automates the whole process of stream allocation within the kernel with no manual work. In order to overcome the limited number of supported streams, we propose a new type of streams, internal streams, which can be implemented at low cost. PCStream can effectively double the number of available streams using internal streams. Our evaluations on real multi-streamed SSDs show that PCStream achieves the same efficiency as highly-optimized manual allocations by experienced programmers. PCStream improves IOPS by up to 56% over the existing automatic technique by reducing the garbage collection overhead by up to 69%.

Speakers
TK

Taejin Kim

Seoul National University
DH

Duwon Hong

Seoul National University
SS

Sangwook Shane Hahn

Western Digital
MC

Myoungjun Chun

Seoul National University
JH

Jooyoung Hwang

Samsung Electronics
JL

Jongyoul Lee

Samsung Electronics
JK

Jihong Kim

Seoul National University


Thursday February 28, 2019 9:30am - 10:00am EST
Grand Ballroom

9:45am EST

NetBouncer: Active Device and Link Failure Localization in Data Center Networks
The availability of data center services is jeopardized by various network incidents. One of the biggest challenges for network incident handling is to accurately localize the failures, among millions of servers and tens of thousands of network devices. In this paper, we propose NetBouncer, a failure localization system that leverages the IP-in-IP technique to actively probe paths in a data center network. NetBouncer provides a complete failure localization framework which is capable of detecting both device and link failures. It further introduces an algorithm for high accuracy link failure inference that is resilient to real-world data inconsistency by integrating both our troubleshooting domain knowledge and machine learning techniques. NetBouncer has been deployed in Microsoft Azure’s data centers for three years. And in practice, it produced no false positives and only a few false negatives so far.

Speakers
ZJ

Ze Jin

Cornell University
KD

Karl Deng

Microsoft
DB

Dongming Bi

Microsoft
DX

Dong Xiang

Microsoft


Thursday February 28, 2019 9:45am - 10:10am EST
Constitution Ballroom

10:00am EST

Large-Scale Graph Processing on Emerging Storage Devices
Graph processing is becoming commonplace in many applications to analyze huge datasets. Much of the prior work in this area has assumed I/O devices with considerable latencies, especially for random accesses, using large amount of DRAM to trade-off additional computation for I/O accesses. However, emerging storage devices, including currently popular SSDs, provide fairly comparable sequential and random accesses, making these prior solutions inefficient. In this paper, we point out this inefficiency, and propose a new graph partitioning and processing framework to leverage these new device capabilities. We show experimentally on an actual platform that our proposal can give 2X better performance than a state-of-the-art solution.

Speakers
NE

Nima Elyasi

The Pennsylvania State University
CC

Changho Choi

Samsung Semiconductor Inc.
AS

Anand Sivasubramaniam

The Pennsylvania State University


Thursday February 28, 2019 10:00am - 10:30am EST
Grand Ballroom

10:10am EST

Break with Refreshments
Thursday February 28, 2019 10:10am - 10:40am EST
Grand Ballroom Foyer

10:30am EST

Break with Refreshments
Thursday February 28, 2019 10:30am - 11:00am EST
Grand Ballroom Foyer

10:40am EST

Riverbed: Enforcing User-defined Privacy Constraints in Distributed Web Services
Riverbed is a new framework for building privacy-respecting web services. Using a simple policy language, users define restrictions on how a remote service can process and store sensitive data. A transparent Riverbed proxy sits between a user's front-end client (e.g., a web browser) and the back-end server code. The back-end code remotely attests to the proxy, demonstrating that the code respects user policies; in particular, the server code attests that it executes within a Riverbed-compatible managed runtime that uses IFC to enforce user policies. If attestation succeeds, the proxy releases the user's data, tagging it with the user-defined policies. On the server-side, the Riverbed runtime places all data with compatible policies into the same universe (i.e., the same isolated instance of the full web service). The universe mechanism allows Riverbed to work with unmodified, legacy software; unlike prior IFC systems, Riverbed does not require developers to reason about security lattices, or manually annotate code with labels. Riverbed imposes only modest performance overheads, with worst-case slowdowns of 10% for several real applications.

Speakers
FW

Frank Wang

MIT CSAIL
RK

Ronny Ko

Harvard University
JM

James Mickens

Harvard University


Thursday February 28, 2019 10:40am - 11:05am EST
Constitution Ballroom

11:00am EST

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques
Various techniques have been proposed in the literature to improve erasure code computation efficiency, including optimizing bitmatrix design, optimizing computation schedule, common XOR operation reduction, caching management techniques, and vectorization techniques. These techniques were largely proposed individually previously, and in this work, we seek to use them jointly. In order to accomplish this task, these techniques need to be thoroughly evaluated individually, and their relation better understood. Building on extensive test results, we develop methods to systematically optimize the computation chain together with the underlying bitmatrix. This led to a simple design approach of optimizing the bitmatrix by minimizing a weighted cost function, and also a straightforward erasure coding procedure: use the given bitmatrix to produce the computation schedule, which utilizes both the XOR reduction and caching management techniques, and apply XOR level vectorization. This procedure can provide a better performance than most existing techniques, and even compete against well-known codes such as EVENODD, RDP, and STAR codes. Moreover, the result suggests that vectorizing the XOR operation is a better choice than directly vectorizing finite field operations, not only because of the better encoding throughput, but also its minimal migration efforts onto newer CPUs.

Speakers
CT

Chao Tian

Texas A&M University
TZ

Tianli Zhou

Texas A&M University


Thursday February 28, 2019 11:00am - 11:30am EST
Grand Ballroom

11:05am EST

Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs
Regular expression matching serves as a key functionality of modern network security applications. Unfortunately, it often becomes the performance bottleneck as it involves compute-intensive scan of every byte of packet payload. With trends towards increasing network bandwidth and a large ruleset of complex patterns, the performance requirement gets ever more demanding.

In this paper, we present Hyperscan, a high performance regular expression matcher for commodity server machines. Hyperscan employs two core techniques for efficient pattern matching. First, it exploits graph decomposition that translates regular expression matching into a series of string and finite automata matching. Unlike existing solutions, string matching becomes a part of regular expression matching, eliminating duplicate operations. Decomposed regular expression components also increase the chance of fast DFA matching as they tend to be smaller than the original pattern. Second, Hyperscan accelerates both string and finite automata matching using SIMD operations, which brings substantial throughput improvement. Our evaluation shows that Hyperscan improves the performance of Snort by a factor of 8.7 for a real traffic trace.

Speakers

Thursday February 28, 2019 11:05am - 11:30am EST
Constitution Ballroom

11:30am EST

Deniable Upload and Download via Passive Participation
Downloading or uploading controversial information can put users at risk, making them hesitant to access or share such information. While anonymous communication networks (ACNs) are designed to hide communication meta-data, already connecting to an ACN can raise suspicion. In order to enable plausible deniability while providing or accessing controversial information, we design CoverUp: a system that enables users to asynchronously upload and download data. The key idea is to involve visitors from a collaborating website. This website serves a JavaScript snippet, which, after user's consent produces cover traffic for the controversial site / content. This cover traffic is indistinguishable from the traffic of participants interested in the controversial content; hence, they can deny that they actually up- or downloaded any data.

CoverUp provides a feed-receiver that achieves a downlink rate of 10 to 50 Kbit/s. The indistinguishability guarantee of the feed-receiver holds against strong global network-level attackers who control everything except for the user's machine. We extend CoverUp to a full upload and download system with a rate of 10 up to 50 Kbit/s. In this case, we additionally need the integrity of the JavaScript snippet, for which we introduce a trusted party. The analysis of our prototype shows a very small timing leakage, even after half a year of continual observation. Finally, as passive participation raises ethical and legal concerns for the collaborating websites and the visitors of the collaborating website, we discuss these concerns and describe how they can be addressed.

Speakers
DS

David Sommer

ETH Zurich
AD

Aritra Dhar

ETH Zurich
LM

Luka Malisa

ETH Zurich
DR

Daniel Ronzani

Ronzani Schlauri Attorneys
SC

Srdjan Capkun

ETH Zurich


Thursday February 28, 2019 11:30am - 11:55am EST
Constitution Ballroom

11:30am EST

OpenEC: Toward Unified and Configurable Erasure Coding Management in Distributed Storage Systems
Erasure coding becomes a practical redundancy technique for distributed storage systems to achieve fault tolerance with low storage overhead. Given its popularity, research studies have proposed theoretically proven erasure codes or efficient repair algorithms to make erasure coding more viable. However, integrating new erasure coding solutions into existing distributed storage systems is a challenging task and requires non-trivial re-engineering of the underlying storage workflows. We present OpenEC, a unified and configurable framework for readily deploying a variety of erasure coding solutions into existing distributed storage systems. OpenEC decouples erasure coding management from the storage workflows of distributed storage systems, and provides erasure coding designers with configurable controls of erasure coding operations through a directed-acyclic-graph-based programming abstraction. We prototype OpenEC on two versions of HDFS with limited code modifications. Experiments on a local cluster and Amazon EC2 show that OpenEC preserves both the operational performance and the properties of erasure coding solutions; OpenEC can also automatically optimize erasure coding operations to improve repair performance.

Speakers
XL

Xiaolu Li

The Chinese University of Hong Kong
RL

Runhui Li

The Chinese University of Hong Kong
PP

Patrick P. C. Lee

The Chinese University of Hong Kong
YH

Yuchong Hu

Huazhong University of Science and Technology


Thursday February 28, 2019 11:30am - 12:00pm EST
Grand Ballroom

11:55am EST

CAUDIT: Continuous Auditing of SSH-Servers To Mitigate Brute-Force Attacks
This paper describes CAUDIT, an operational system deployed at the National Center for Supercomputing Applications (NCSA) at the University of Illinois. CAUDIT is a fully automated system to enable the identification and exclusion of hosts that are vulnerable to SSH brute-force attacks. Its key features includes: 1) a honeypot for attracting SSH-based attacks over a /16 IP address range and extracting key-metadata (e.g., source IP, password, SSH-client version, or -key) from these attacks; 2) executing audits on the live production network by replaying attack attempts recorded by the honeypot; 3) using the IP addresses recorded by the honeypot to block SSH attack attempts at the network border using a Black Hole Router (BHR) while significantly reducing the load on NCSA's security monitoring system; and 4) informing peer sites of attack attempts in real-time to ensure containment of coordinated attacks. The system is composed of existing techniques with custom-built components, and its novelty is to execute at a scale that has not been validated earlier (thousands of nodes and tens of millions of attack attempts per day). Experience over 463 days shows that CAUDIT successfully blocks an average of 57 million attack attempts on a daily basis using the proposed BHR. This represents a 66$\times$ reduction in the number of SSH attempts compared to the daily average and has reduced 78% of the traffic to the NCSA internal network-security-monitoring infrastructure.


Thursday February 28, 2019 11:55am - 12:20pm EST
Constitution Ballroom

12:00pm EST

Cluster storage systems gotta have HeART: improving storage efficiency by exploiting disk-reliability heterogeneity
Large-scale cluster storage systems typically consist of a heterogeneous mix of storage devices with significantly varying failure rates. Despite such differences among devices, redundancy settings are generally configured in a one-scheme-for-all fashion. In this paper, we make a case for exploiting reliability heterogeneity to tailor redundancy settings to different device groups. We present HeART, an online tuning tool that guides selection of, and transitions between redundancy settings for long-term data reliability, based on observed reliability properties of each disk group. By processing disk failure data over time, HeART identifies the boundaries and steady-state failure rate for each deployed disk group (e.g., by make/model). Using this information, HeART suggests the most space-efficient redundancy option allowed that will achieve the specified target data reliability. Analysis of longitudinal failure data for a large production storage cluster shows the robustness of HeART's failure-rate determination algorithms. The same analysis shows that a storage system guided by HeART could provide target data reliability levels with fewer disks than one-scheme-for-all approaches: 11–16% fewer compared to erasure codes like 10-of-14 or 6-of-9 and 33% fewer compared to 3-way replication.

Speakers
SK

Saurabh Kadekodi

Carnegie Mellon University
KV

K. V. Rashmi

Carnegie Mellon University
GR

Gregory R. Ganger

Carnegie Mellon University


Thursday February 28, 2019 12:00pm - 12:30pm EST
Grand Ballroom

12:20pm EST

Lunch (on your own)
Thursday February 28, 2019 12:20pm - 1:50pm EST
N/A

12:30pm EST

ScaleCheck: A Single-Machine Approach for Discovering Scalability Bugs in Large Distributed Systems
We present ScaleCheck, an approach for discovering scalability bugs (a new class of bug in large storage systems) and for democratizing large-scale testing. ScaleCheck employs a program analysis technique, for finding potential causes of scalability bugs, and a series of colocation techniques, for testing implementation code at real scales but doing so on just a commodity PC. ScaleCheck has been integrated to several large-scale storage systems, Cassandra, HDFS, Riak, and Voldemort, and successfully exposed known and unknown scalability bugs, up to 512-node scale on a 16-core PC.

Speakers
CA

Cesar A. Stuardo

University of Chicago
TL

Tanakorn Leesatapornwongsa

Samsung Research America
RO

Riza O. Suminto

University of Chicago
HK

Huan Ke

University of Chicago
JF

Jeffrey F. Lukman

University of Chicago
SL

Shan Lu

University of Chicago
HS

Haryadi S. Gunawi

University of Chicago


Thursday February 28, 2019 12:30pm - 1:00pm EST
Grand Ballroom

1:50pm EST

Dataplane equivalence and its applications
Network verification promises to find rare bugs in networks, but using it requires that administrators (completely) characterize the expected behavior of the network in formal languages such as Datalog or CTL. The difficulty of achieving this task hampers deployment of verification widely. We propose to use equivalence between different network dataplanes as an implicit, simpler way to specify the required correctness properties. While equivalence is a well- known undecidable problem for general-purpose programs, we show that for network dataplanes without infinite loops it is decidable and can be checked efficiently. We present netdiff, an algorithm that checks the equivalence of two network dataplanes implemented in the SEFL language by using symbolic execution [11]. We implement netdiff and use it to catch a variety of bugs in Openstack Neutron, P4 programs and network dataplane updates. Our evaluation highlights that equivalence is an easy way to find bugs, scales well to relatively large programs and discovers subtle issues otherwise difficult to find.

Speakers
DD

Dragos Dumitrescu

University Politehnica of Bucharest
RS

Radu Stoenescu

University Politehnica of Bucharest
MP

Matei Popovici

University Politehnica of Bucharest
LN

Lorina Negreanu

University Politehnica of Bucharest
CR

Costin Raiciu

University Politehnica of Bucharest


Thursday February 28, 2019 1:50pm - 2:15pm EST
Constitution Ballroom

2:15pm EST

Alembic: Automated Model Inference for Stateful Network Functions
Network operators today deploy a wide range of complex stateful network functions (NFs). They typically only have access to the NFs’ binary executables, configuration interfaces, and manuals from vendors. To ensure correct behavior of NFs, operators use network testing and verification tools, which typically rely on models of the deployed NFs. The effectiveness of these tools depends upon the fidelity of such models. Today, models are handwritten, which can be error prone, tedious, and does not account for implementation-specific artifacts. To address this gap, our goal is to automatically infer behavioral models of stateful NFs for a given configuration. The problem is challenging because NF configurations can contain diverse rule types and the space of dynamic and stateful NF behaviors is large. In this work, we present Alembic, which synthesizes NF models viewed as an ensemble of finite-state machines (FSMs). Alembic consists of an offline stage that learns symbolic FSM representations for each NF rule type and a fast online stage that generates a concrete behavioral model for a given configuration using these symbolic FSMs. We demonstrate that Alembic is accurate, scalable and sheds light on subtle differences across NF implementations.

Speakers
SM

Soo-Jin Moon

Carnegie Mellon University
JH

Jeffrey Helt

Princeton University
YY

Yifei Yuan

Intentionet
YB

Yves Bieri

ETH Zurich
SB

Sujata Banerjee

VMware Research
VS

Vyas Sekar

Carnegie Mellon University
WW

Wenfei Wu

Tsinghua University
MY

Mihalis Yannakakis

Columbia University
YZ

Ying Zhang

Facebook, Inc.


Thursday February 28, 2019 2:15pm - 2:40pm EST
Constitution Ballroom

2:40pm EST

Model-Agnostic and Efficient Exploration of Numerical State Space of Real-World TCP Congestion Control Implementations
The significant impact of TCP congestion control on the Internet highlights the importance of testing the correctness and performance of congestion control algorithm implementations (CCAIs) in various network environments. Many CCAI testing questions can be answered by exploring the numerical state space of CCAIs, which is defined by a group of numerical (and nonnumerical) state variables of the CCAIs. However, the current practices for automated numerical state space exploration are either limited by the approximate abstract CCAI models or inefficient due to the large space of network environment parameters and the complicated relation between the CCAI states and network environment parameters. In this paper, we propose an automated numerical state space exploration method, called ACT, which leverages the model-agnostic feature of random testing and greatly improves its efficiency by guiding random testing under the feedback iteratively obtained in a test. Our experiments on five representative Linux TCP CCAIs show that ACT can more efficiently explore a large numerical state space than manual testing, undirected random testing, and symbolic execution based testing, while without requiring any abstract CCAI models. ACT successfully detects multiple implementation bugs and design issues of these Linux TCP CCAIs, including some new bugs and issues not reported before.

Speakers
WS

Wei Sun

University of Nebraska-Lincoln
LX

Lisong Xu

University of Nebraska-Lincoln
SE

Sebastian Elbaum

University of Virginia
DZ

Di Zhao

University of Nebraska-Lincoln


Thursday February 28, 2019 2:40pm - 3:05pm EST
Constitution Ballroom

3:05pm EST

Break with Refreshments
Thursday February 28, 2019 3:05pm - 3:35pm EST
Grand Ballroom Foyer

3:35pm EST

Scaling Community Cellular Networks with CommunityCellularManager
Hundreds of millions of people still live beyond the coverage of basic mobile connectivity, primarily in rural areas with low population density. Mobile-network operators (MNOs) traditionally struggle to justify expansion into these rural areas due to the high infrastructure costs necessary to provide service. Community cellular networks, networks built "by and for" the people they serve, represent an alternative model that, to an extent, bypasses these business case limitations and enables sustainable rural coverage. Yet despite aligned economic incentives, real deployments of community cellular networks still face significant regulatory, commercial and technical challenges.

In this paper, we present CommunityCellularManager (CCM), a system for operating community cellular networks at scale. CCM enables multiple community networks to operate under the control of a single, multi-tenant controller and in partnership with a traditional MNO. CCM preserves flexibility for each community network to operate independently, while allowing the mobile network operator to safely make critical resources such as spectrum and phone numbers available to these networks. We evaluate CCM through a multi-year, large-scale community cellular network deployment in the Philippines in partnership with a traditional MNO, providing basic communication services to over 2,000 people in 15 communities without requiring changes to the existing regulatory framework, and using existing handsets. We demonstrate that CCM can support independent community networks with unique service offerings and operating models while providing a basic level of MNO-defined service. To our knowledge, this represents the largest deployment of community cellular networks to date.

Speakers
SH

Shaddi Hasan

UC Berkeley
MC

Mary Claire Barela

University of the Philippines, Diliman
MJ

Matthew Johnson

University of Washington
EB

Eric Brewer

UC Berkeley
KH

Kurtis Heimerl

University of Washington


Thursday February 28, 2019 3:35pm - 4:00pm EST
Constitution Ballroom

4:00pm EST

TrackIO: Tracking First Responders Inside-Out
First responders, a critical lifeline of any society, often find themselves in precarious situations. The ability to track them real-time in unknown indoor environments, would significantly contributes to the success of their mission as well as their safety. In this work, we present the design, implementation and evaluation of TrackIO--a system capable of accurately localizing and tracking mobile responders real-time in large indoor environments. TrackIO leverages the mobile virtual infrastructure offered by unmanned aerial vehicles (UAVs), coupled with the balanced penetration-accuracy tradeoff offered by ultra-wideband (UWB), to accomplish this objective directly from outside, without relying on access to any indoor infrastructure. Towards a practical system, TrackIO incorporates four novel mechanisms in its design that address key challenges to enable tracking responders (i) who are mobile with potentially non-uniform velocities (e.g. during turns), (ii) deep indoors with challenged reachability, (iii) in real-time even for a large network, and (iv) with high accuracy even when impacted by UAV’s position error. TrackIO’s real-world performance reveals that it can track static nodes with a median accuracy of about 1–1.5m and mobile (even running) nodes with a median accuracy of 2–2.5m in large buildings in real-time.

Speakers
AD

Ashutosh Dhekne

University of Illinois at Urbana-Champaign
AC

Ayon Chakraborty

NEC Labs America, Inc.
KS

Karthikeyan Sundaresan

NEC Labs America, Inc.
SR

Sampath Rangarajan

NEC Labs America, Inc.


Thursday February 28, 2019 4:00pm - 4:25pm EST
Constitution Ballroom

4:25pm EST

3D Backscatter Localization for Fine-Grained Robotics
This paper presents the design and implementation of TurboTrack, a 3D localization system for fine-grained robotic tasks. TurboTrack's unique capability is that it can localize backscatter nodes with sub-centimeter accuracy without any constraints on their locations or mobility. TurboTrack makes two key technical contributions. First, it presents a pipelined architecture that can extract a sensing bandwidth from every single backscatter packet that is three orders of magnitude larger than the backscatter communication bandwidth. Second, it introduces a Bayesian space-time super-resolution algorithm that combines time series of the sensed bandwidth across multiple antennas to enable accurate positioning. Our experiments show that TurboTrack simultaneously achieves a median accuracy of sub-centimeter in each of the x/y/z dimensions and a $99^{th}$ percentile latency less than 7.5 milliseconds in 3D localization. This enables TurboTrack's real-time prototype to achieve fine-grained positioning for agile robotic tasks, as we demonstrate in multiple collaborative applications with robotic arms and nanodrones including indoor tracking, packaging, assembly, and handover.

Speakers
ZL

Zhihong Luo

MIT Media Lab
QZ

Qiping Zhang

MIT Media Lab
YM

Yunfei Ma

MIT Media Lab
MS

Manish Singh

MIT Media Lab
FA

Fadel Adib

MIT Media Lab


Thursday February 28, 2019 4:25pm - 4:50pm EST
Constitution Ballroom

4:50pm EST

Many-to-Many Beam Alignment in Millimeter Wave Networks
Millimeter Wave (mmWave) networks can deliver multi-Gbps wireless links that use extremely narrow directional beams. This provides us with a new opportunity to exploit spatial reuse in order to scale network throughput. Exploiting such spatial reuse, however, requires aligning the beams of all nodes in a network. Aligning the beams is a difficult process which is complicated by indoor multipath, which can create interference, as well as by the inefficiency of carrier sense at detecting interference in directional links. This paper presents BounceNet, the first many-to-many millimeter wave beam alignment protocol that can exploit dense spatial reuse to allow many links to operate in parallel in a confined space and scale the wireless throughput with the number of clients. Results from three millimeter wave testbeds show that BounceNet can scale the throughput with the number of clients to deliver a total network data rate of more than 39 Gbps for 10 clients, which is up to 6.6x higher than current 802.11 mmWave standards.


Thursday February 28, 2019 4:50pm - 5:15pm EST
Constitution Ballroom
 
Filter sessions
Apply filters to sessions.