NSDI '24 Technical Sessions

Display:

Tuesday, April 16

8:00 am–8:55 am

Continental Breakfast

8:55 am–9:10 am

Opening Remarks and Awards

Laurent Vanbever, ETH Zürich; Irene Zhang, Microsoft Research

9:10 am–10:30 am

Track 1

Clouds but Faster

Horus: Granular In-Network Task Scheduler for Cloud Datacenters

Parham Yassini, Simon Fraser University; Khaled Diab, Hewlett Packard Labs; Saeed Zangeneh and Mohamed Hefeeda, Simon Fraser University

Available Media

Short-lived tasks are prevalent in modern interactive datacenter applications. However, designing schedulers to assign these tasks to workers distributed across the whole datacenter is challenging, because such schedulers need to make decisions at a microsecond scale, achieve high throughput, and minimize the tail response time. Current task schedulers in the literature are limited to individual racks. We present Horus, a new in-network task scheduler for short tasks that operates at the datacenter scale. Horus efficiently tracks and distributes the worker state among switches, which enables it to schedule tasks in parallel at line rate while optimizing the scheduling quality. We propose a new distributed task scheduling policy that minimizes the state and communication overheads, handles dynamic loads, and does not buffer tasks in switches. We compare Horus against the state-of-the-art in-network scheduler in a testbed with programmable switches as well as using simulations of datacenters with more than 27K hosts and thousands of switches handling diverse and dynamic workloads. Our results show that Horus efficiently scales to large datacenters, and it substantially outperforms the state-of-the-art across all performance metrics, including tail response time and throughput.

Track 2

Scheduling the Network

Sifter: An Inversion-Free and Large-Capacity Programmable Packet Scheduler

Peixuan Gao, Anthony Dalleggio, Jiajin Liu, and Chen Peng, New York University; Yang Xu, Fudan University; H. Jonathan Chao, New York University

Available Media

Packet schedulers play a crucial role in determining the order in which packets are served. They achieve this by assigning a rank to each packet and sorting them based on these ranks. However, when dealing with a large number of flows at high packet rates, sorting functions can become extremely complex and time-consuming. To address this issue, fast-approximating packet schedulers have been proposed, but they come with the risk of producing scheduling errors, or packet inversions, which can lead to undesirable consequences. We present Sifter, a programmable packet scheduler that offers high accuracy and large capacity while ensuring inversion-free operation. Sifter employs a unique sorting technique called “Sift Sorting” to coarsely sort packets with larger ranks into buckets, while accurately and finely sorting those with smaller ranks using a small Push-In-First-Out (PIFO) queue in parallel. The sorting process takes advantage of the “Speed-up Factor”, which is a function of the memory bandwidth to output link bandwidth ratio, to achieve Sift Sorting and ensure accurate scheduling with low resource consumption. Sifter combines the benefits of PIFO’s accuracy and FIFO-based schedulers’ large capacity, resulting in guaranteed delivery of packets in an accurate scheduling order. Our simulation results demonstrate Sifter’s efficiency in achieving inversion-free scheduling, while the FPGA-based hardware prototype validates that Sifter supports a throughput of 100Gbps without packet inversion errors.

Flow Scheduling with Imprecise Knowledge

Wenxin Li, Xin He, Yuan Liu, and Keqiu Li, Tianjin University; Kai Chen, Hong Kong University of Science and Technology and University of Science and Technology of China; Zhao Ge and Zewei Guan, Tianjin University; Heng Qi, Dalian University of Technology; Song Zhang, Tianjin University; Guyue Liu, New York University Shanghai

Available Media

Most existing data center network (DCN) flow scheduling solutions aim to minimize flow completion times (FCT). However, these solutions either require precise flow information (e.g., per-flow size), which is challenging to implement on commodity switches (e.g., pFabric [7]), or no prior flow information at all, which is at the cost of performance (e.g., PIAS [9]). In this work, we present QCLIMB, a new flow scheduling solution designed to minimize FCT by utilizing imprecise flow information. Our key observation is that although obtaining precise flow information can be challenging, it is possible to accurately estimate each flow's lower and upper bounds with machine learning techniques.

QCLIMB has two key parts: i) a novel scheduling algorithm that leverages the lower bounds of different flows to prioritize small flow over large flows from the beginning of transmission, rather than at later stages; and ii) an efficient out-of-order handling mechanism that addresses practical reordering issues resulting from the algorithm. We show that QCLIMB significantly outperforms PIAS (88% lower average FCT of small flows) and is surprisingly close to pFabric (around 9% gap) while not requiring any switch modifications.

Pudica: Toward Near-Zero Queuing Delay in Congestion Control for Cloud Gaming

Shibo Wang, Xi'an Jiaotong University; Shusen Yang, Xi'an JiaoTong University; Xiao Kong, Chenglei Wu, and Longwei Jiang, Tencent; Chenren Xu, Peking University; Cong Zhao, Xi'an JiaoTong University; Xuesong Yang, Bonree; Jianjun Xiao and Xin Liu, Tencent; Changxi Zheng, Columbia University; Jing Wang and Honghao Liu, Tencent

10:30 am–11:00 am

Break with Refreshments

11:00 am–12:40 pm

Track 1

Serverless

Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices

Zibo Wang, University of Science and Technology of China and Microsoft Research; Pinghe Li, ETH Zurich; Chieh-Jan Mike Liang, Microsoft Research; Feng Wu, University of Science and Technology of China; Francis Y. Yan, Microsoft Research

Available Media

Achieving resource efficiency while preserving end-user experience is non-trivial for cloud application operators. As cloud applications progressively adopt microservices, resource managers are faced with two distinct levels of system behavior: end-to-end application latency and per-service resource usage. Translating between the two levels, however, is challenging because user requests traverse heterogeneous services that collectively (but unevenly) contribute to the end-to-end latency. We present Autothrottle, a bi-level resource management framework for microservices with latency SLOs (service-level objectives). It architecturally decouples application SLO feedback from service resource control, and bridges them through the notion of performance targets. Specifically, an application-wide learning-based controller is employed to periodically set performance targets—expressed as CPU throttle ratios—for per-service heuristic controllers to attain. We evaluate Autothrottle on three microservice applications, with workload traces from production scenarios. Results show superior CPU savings, up to 26.21% over the best-performing baseline and up to 93.84% over all baselines.

Track 2

Network Protocols

A large-scale deployment of DCTCP

Abhishek Dhamija and Balasubramanian Madhavan, Meta; Hechao Li, Netflix; Jie Meng, Meta; Lawrence Brakmo, unaffiliated; Madhavi Rao, Neil Spring, Prashanth Kannan, and Srikanth Sundaresan, Meta; Soudeh Ghorbani, Johns Hopkins University & Meta

TECC: Towards Efficient QUIC Tunneling via Collaborative Transmission Control

Jiaxing Zhang, Alibaba Group, University of Chinese Academy of Sciences; Furong Yang, Alibaba Group; Ting Liu, Alibaba Group, University of Chinese Academy of Sciences; Qinghua Wu, University of Chinese Academy of Sciences, Purple Mountain Laboratories, China; Wu Zhao, Yuanbo Zhang, Wentao Chen, Yanmei Liu, Hongyu Guo, and Yunfei Ma, Alibaba Group; Zhenyu Li, University of Chinese Academy of Sciences, Purple Mountain Laboratories, China

Available Media

In this paper, we present TECC, a system based on collaborative transmission control that mitigates the mismatch of sending behavior between the inner and outer connections to achieve efficient QUIC tunneling. In TECC, a feedback framework is implemented to enable end hosts to collect more precise network information that is sensed on the tunnel server, which assists the inner end-to-end connection to achieve better congestion control and loss recovery. Extensive experiments in emulated networks and real-world large-scale A/B tests demonstrate the efficiency of TECC. Specifically, compared with the state-of-the-art QUIC tunneling solution, TECC significantly reduces flow completion time. In emulated networks, TECC decreases flow completion time by 30% on average and 53% at the 99th percentile. TECC also gains a reduction in RPC (Remote Procedure Call) request completion time of 3.9% on average and 13.3% at the 99th percentile in large-scale A/B tests.

iStack: A General and Stateful Name-based Protocol Stack for Named Data Networking

Tianlong Li, Tian Song, and Yating Yang, Beijing Institute of Technology

Available Media

Named Data Networking (NDN) shifts the network from host-centric to data-centric with a clean-slate design, in which packet forwarding is based on names, and the data plane maintains per-packet state. Different forwarders have been implemented to provide NDN capabilities for various scenarios, however, there is a lack of a network stack that is integrated with operating systems (OS) for general purpose. Designing a stateful and entirely name-based protocol stack in OS kernel remains a challenge due to three factors: (i) an in-kernel name resolution architecture for packet demultiplexing is necessary, (ii) an entirely name-based stack requires to be compatible with the current address (MAC/IP/port)-based architecture in OS kernel, and (iii) maintaining per-packet state introduces a trade-off between performance and resource consumption.

In this paper, for the first time, we take NDN into OS kernel by proposing iStack, an Information-Centric Networking (ICN) protocol stack. The main innovations of iStack are threefold. First, we propose a name resolution architecture to support both network-layer forwarding and local packet demultiplexing. Second, a two-layer face system is proposed to provide abstraction of address-based network interfaces. Third, we design socket-compatible interfaces to keep the uniformity of current network stack in OS. Besides, we design compact forwarding data structures for fast packet processing with low memory footprint. We have implemented prototypes on multiple platforms. The evaluation results show that iStack achieves 6.50 Gbps throughput, outperforming the NDN-testbed forwarder by a factor of 16.25x, and reduces 46.08% forwarding latency for cached packets with its inkernel packet caching. iStack is not just another forwarder for NDN, but a step forward for practical development of ICN.

12:40 pm–2:00 pm

Symposium Luncheon

2:00 pm–3:40 pm

Track 1

Distributed Systems: Part 1

Alea-BFT: Practical Asynchronous Byzantine Fault Tolerance

Diogo S. Antunes, IST (ULisboa) / INESC-ID; Afonso N. Oliveira, Three Sigma; André Breda and Matheus Guilherme Franco, IST (ULisboa) / INESC-ID; Henrique Moniz, Protocol Labs / INESC-ID; Rodrigo Rodrigues, IST (ULisboa) / INESC-ID

SwiftPaxos: Fast Geo-Replicated State Machines

Fedor Ryabinin, IMDEA Software Institute, Universidad Politécnica de Madrid; Alexey Gotsman, IMDEA Software Institute; Pierre Sutra, Télécom SudParis, INRIA

Available Media

Cloud services improve their availability by replicating data across sites in different geographical regions. A variety of state-machine replication protocols have been proposed for this setting that reduce the latency under workloads with low contention. However, when contention increases, these protocols may deliver lower performance than Paxos. This paper introduces SwiftPaxos—a protocol that lowers the best-case latency in comparison to Paxos without hurting the worst-case one. SwiftPaxos executes a command in 2 message delays if there is no contention, and in 3 message delays otherwise. To achieve this, the protocol allows replicas to vote on the order in which they receive state-machine commands. Differently from previous protocols, SwiftPaxos permits a replica to vote twice: first for its own ordering proposal, and then to follow the leader. This mechanism avoids restarting the voting process when a disagreement occurs among replicas, saving computation time and message delays. Our evaluation shows that the throughput of SwiftPaxos is up to 2.9x better than state-of-the-art alternatives.

The Bedrock of Byzantine Fault Tolerance: A Unified Platform for BFT Protocols Analysis, Implementation, and Experimentation

Mohammad Javad Amiri, Stony Brook University; Chenyuan Wu, University of Pennsylvania; Divyakant Agrawal and Amr El Abbadi, UC Santa Barbara; Boon Thau Loo, University of Pennsylvania; Mohammad Sadoghi, UC Davis

Available Media

Byzantine Fault-Tolerant (BFT) protocols cover a broad spectrum of design dimensions from infrastructure settings, such as the communication topology, to more technical features, such as commitment strategy and even fundamental social choice properties like order-fairness. The proliferation of different protocols has made it difficult to navigate the BFT landscape, let alone determine the protocol that best meets application needs. This paper presents Bedrock, a unified platform for BFT protocols analysis, implementation, and experimentation. Bedrock proposes a design space consisting of a set of dimensions and explores several design choices that capture the trade-offs between different design space dimensions. Within Bedrock, a wide range of BFT protocols can be implemented and uniformly evaluated under a unified deployment environment.

Track 2

Programming the Network: Part 1

The Eternal Tussle: Exploring the Role of Centralization in IPFS

Yiluo Wei, Hong Kong University of Science & Technology (GZ); Dennis Trautwein and Yiannis Psaras, Protocol Labs; Ignacio Castro, Queen Mary University of London; Will Scott, Protocol Labs; Aravindh Raman, Telefonica Research; Gareth Tyson, Hong Kong University of Science & Technology (GZ)

Empower Programmable Pipeline for Advanced Stateful Packet Processing

Yong Feng and Zhikang Chen, Tsinghua University; Haoyu Song, Futurewei Technologies; Yinchao Zhang, Hanyi Zhou, Ruoyu Sun, Wenkuo Dong, Peng Lu, Shuxin Liu, and Chuwen Zhang, Tsinghua University; Yang Xu, Fudan University; Bin Liu, Tsinghua University

Available Media

Programmable pipeline offers flexible and high-throughput packet processing capability, but only to some extent. When more advanced dataplane functions beyond basic packet processing and forwarding are desired, the pipeline becomes handicapped. The fundamental reason is that most stateful operations require backward cross-stage data passing and pipeline stalling for state update and consistency, which are anomalous to a standard pipeline. To solve the problem, we augment the pipeline with a low-cost, yet fast side ring to facilitate the backward data passing. We further apply the speculative execution technique to avoid pipeline stalling. The resulting architecture, RAPID, supports native and generic stateful function programming using the enhanced P4 language. We build an FPGA-based prototype to evaluate the system, and a software emulator to assess the cost and performance of an ASIC implementation. We realize several stateful applications enabled by RAPID to show how it extends a programmable dataplane's potential to a new level.

3:40 pm–4:10 pm

Break with Refreshments

4:10 pm–5:50 pm

Track 1

Video

GRACE: Loss-Resilient Real-Time Video through Neural Codecs

Yihua Cheng, Ziyi Zhang, Hanchen Li, Anton Arapin, and Yuhan Liu, University of Chicago; Qizheng Zhang, Stanford University; Xu Zhang and Kuntai Du, University of Chicago; Francis Y. Yan, Microsoft Research; Amrita Mazumdar, Nvidia; Nick Feamster and Junchen Jiang, University of Chicago

Gemino: Practical and Robust Neural Compression for Video Conferencing

Vibhaalakshmi Sivaraman, Pantea Karimi, Vedantha Venkatapathy, and Mehrdad Khani, Massachusetts Institute of Technology; Sadjad Fouladi, Microsoft Research; Mohammad Alizadeh, Fredo Durand, and Vivienne Sze, Massachusetts Institute of Technology

Available Media

Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs simply cannot operate at extremely low bitrates. Recently, several neural alternatives have been proposed that reconstruct talking head videos at very low bitrates using sparse representations of each frame such as facial landmark information. However, these approaches produce poor reconstructions in scenarios with major movement or occlusions over the course of a call, and do not scale to higher resolutions. We design Gemino, a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. Gemino upsamples a very low-resolution version of each target frame while enhancing high-frequency details (e.g., skin texture, hair, etc.) based on information extracted from a single high-resolution reference image. We use a multi-scale architecture that runs different components of the model at different resolutions, allowing it to scale to resolutions comparable to 720p, and we personalize the model to learn specific details of each person, achieving much better fidelity at low bitrates. We implement Gemino atop aiortc, an open-source Python implementation of WebRTC, and show that it operates on 1024x1024 videos in real-time on a Titan X GPU, and achieves 2.2–5x lower bitrate than traditional video codecs for the same perceptual quality.

ARTEMIS: Adaptive Bitrate Ladder Optimization for Live Video Streaming

Farzad Tashtarian, Alpen-Adria-Universität Klagenfurt; Abdelhak Bentaleb, Concordia University; Hadi Amirpour, Alpen-Adria-Universität Klagenfurt; Sergey Gorinsky, IMDEA Networks Institute; Junchen Jiang, University of Chicago; Hermann Hellwagner and Christian Timmerer, Alpen-Adria-Universität Klagenfurt

Available Media

Live streaming of segmented videos over the Hypertext Transfer Protocol (HTTP) is increasingly popular and serves heterogeneous clients by offering each segment in multiple representations. A bitrate ladder expresses this choice as an ordered list of bitrate-resolution pairs. Whereas existing solutions for HTTP-based live streaming use a static bitrate ladder, the fixed ladders struggle to appropriately accommodate the dynamics in the video content and network-conditioned client capabilities. This paper proposes ARTEMIS as a practical scalable alternative that dynamically configures the bitrate ladder depending on the content complexity, network conditions, and clients' statistics. ARTEMIS seamlessly integrates with the end-to-end streaming pipeline and operates transparently to video encoders and clients. We develop a cloud-based implementation of ARTEMIS and conduct extensive real-world and trace-driven experiments. The experimental comparison vs. existing prominent bitrate ladders demonstrates that live streaming with ARTEMIS outperforms all baselines, reduces encoding computation by 25%, end-to-end latency by 18%, and increases quality of experience by 11%.

Track 2

Sharing the Network

Multitenant In-Network Acceleration with SwitchVM

Sajy Khashab, Alon Rashelbach, and Mark Silberstein, Technion

Available Media

We propose a practical approach to implementing multitenancy on programmable network switches to make in-network acceleration accessible to cloud users. We introduce a Switch Virtual Machine (SwitchVM), that is deployed on the switches and offers an expressive instruction set and program state abstractions. Tenant programs, called Data-Plane filters (DPFs), are executed on top of SwitchVM in a sandbox with memory, network and state isolation policies controlled by network operators. The packets that trigger DPF execution include the code to execute or a reference to the DPFs deployed in the switch. DPFs are Turing-complete, may maintain state in the packet and in switch virtual memory, may form a dynamic chain, and may steer packets to desired destinations, all while enforcing the operator’s policies.

We demonstrate that this idea is practical by prototyping SwitchVM in P4 on Intel Tofino switches. We describe a variety of use cases that SwitchVM supports, and implement three complex applications from prior works – key-value store cache, Load-aware load balancer and Paxos accelerator. We also show that SwitchVM provides strong performance isolation, zero-overhead runtime programmability, may hold two orders of magnitude more in-switch programs than existing techniques, and may support up to thirty thousand concurrent tenants each with its private state.

Wednesday, April 17

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:20 am

Track 1

ML at Scale

Characterization of Large Language Model Development in the Datacenter

Qinghao Hu, Nanyang Technological University; Zhisheng Ye, Peking University; Zerui Wang, Shanghai Jiao Tong University; Meng Zhang, Nanyang Technological University; Guoteng Wang, Shanghai AI Laboratory; Qiaoling Chen, National University of Singapore; Peng Sun, SenseTime; Dahua Lin, The Chinese University of Hong Kong & Sensetime Research; Xiaolin Wang and Yingwei Luo, Peking University; Yonggang Wen and Tianwei Zhang, Nanyang Technological University

Scaling Large Language Model Training to More Than 10,000 GPUs

Ziheng Jiang and Haibin Lin, ByteDance; Yinmin Zhong, Peking University; Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, and Xin Liu, ByteDance; Xin Jin, Peking University

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Yazhou Zu, Alireza Ghaffarkhah, Hoang-Vu Dang, Brian Towles, Steven Hand, Safeen Huda, Adekunle Bello, Alexander Kolbasov, Arash Rezaei, Dayou Du, Steve Lacy, Hang Wang, Aaron Wisner, Chris Lewis, and Henri Bahin, Google

Available Media

TPUv4 (Tensor Processing Unit) is Google’s 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale, including features for automatic fault resiliency and hardware recovery. We adopt a software-defined networking (SDN) approach to manage TPUv4’s high-bandwidth inter-chip interconnect (ICI) fabric, using optical circuit switching to dynamically configure routes to work around machine, chip and link failures. Our infrastructure detects failures and automatically triggers reconfiguration to minimize disruption to running workloads, as well as initiating remediation and repair workflows for the affected components. Similar techniques interface with maintenance and upgrade workflows for both hardware and software. Our dynamic reconfiguration approach allows our TPUv4 supercomputers to achieve 99.98% system availability, gracefully handling hardware outages experienced by ~1% of the training jobs.

Track 2

Satellites and Things

NN-Defined Modulator: Reconfigurable and Portable Software Modulator on IoT Gateways

Jiazhao Wang and Wenchao Jiang, Singapore University of Technology and Design; Ruofeng Liu, University of Minnesota; Bin Hu, University of Southern California; Demin Gao, Nanjing Forestry University; Shuai Wang, Southeast University

Available Media

A physical-layer modulator is a vital component for an IoT gateway to map the symbols to signals. However, due to the soldered hardware chipsets on the gateway's motherboards or the diverse toolkits on different platforms for the software radio, the existing solutions either have limited extensibility or are platform specific. Such limitation is hard to ignore when modulation schemes and hardware platforms have become extremely diverse. This paper presents a new paradigm of using neural networks as an abstraction layer for physical layer modulators in IoT gateway devices, referred to as NN-defined modulators. Our approach addresses the challenges of extensibility and portability for multiple technologies on various hardware platforms. The proposed NN-defined modulator uses a model-driven methodology rooted in solid mathematical foundations while having native support for hardware acceleration and portability to heterogeneous platforms. We conduct the evaluation of NN-defined modulators on different platforms, including Nvidia Jetson Nano, Raspberry Pi. Evaluations demonstrate that our NN-defined modulator effectively operates as conventional modulators and provides significant efficiency gains (up to 4.7× on Nvidia Jetson Nano and 1.1× on Raspberry Pi), indicating high portability. Furthermore, we show the real-world applications using our NN-defined modulators to generate ZigBee and WiFi packets, which are compliant with commodity TI CC2650 (ZigBee) and Intel AX201 (WiFi NIC) respectively.

Known Knowns and Unknowns: Near-realtime Earth Observation Via Query Bifurcation in Serval

Bill Tao, Om Chabra, Ishani Janveja, Indranil Gupta, and Deepak Vasisht, University of Illinois Urbana-Champaign

Available Media

Earth observation satellites, in low Earth orbits, are increasingly approaching near-continuous imaging of the Earth. Today, these satellites capture an image of every part of Earth every few hours. However, the networking capabilities haven’t caught up, and can introduce delays of few hours to days in getting these images to Earth. While this delay is acceptable for delay-tolerant applications like land cover maps, crop type identification, etc., it is unacceptable for latency-sensitive applications like forest fire detection or disaster monitoring. We design Serval to enable near-realtime insights from Earth imagery for latency-sensitive applications despite the networking bottlenecks by leveraging the emerging computational capabilities on the satellites and ground stations. The key challenge for our work stems from the limited computational capabilities and power resources available on a satellite. We solve this challenge by leveraging predictability in satellite orbits to bifurcate computation across satellites and ground stations. We evaluate Serval using trace-driven simulations and hardware emulations on a dataset comprising ten million images captured using the Planet Dove constellation comprising nearly 200 satellites. Serval reduces end-to-end latency for high priority queries from 71.71 hours (incurred by state of the art) to 2 minutes, and 90-th percentile from 149 hours to 47 minutes.

Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things

Vaibhav Singh, Tusher Chakraborty, and Suraj Jog, Microsoft Research; Om Chabra and Deepak Vasisht, UIUC; Ranveer Chandra, Microsoft Research

Available Media

Low Earth Orbit satellite constellations are gaining traction for providing connectivity to low-power outdoor Internet of Things (IoT) devices. This is made possible by the development of low-cost, low-complexity pico-satellites that can be easily launched, offering global connectivity without the need for Earth-based gateways. In this paper, we report the space-to-Earth communication bottlenecks derived from our experience of deploying an IoT satellite. Specifically, we characterize the challenges posed by the low link budgets, satellite motion, and packet collisions. To address these challenges, we design a new class of technique that use the Doppler shift caused by the satellite's motion as a unique signature for packet detection and decoding, even at low signal-to-noise ratios and in the presence of collisions. We integrate these techniques into our system, called Spectrumize, and evaluate its performance through both simulations and real-world deployments. Our evaluation shows that Spectrumize performs 3x better compared to classic approach in detecting packet with over 80% average accuracy in decoding.

10:20 am–10:50 am

Break with Refreshments

10:50 am–12:30 pm

Track 1

Wide-Area and Edge

CHISEL: An optical slice of the wide-area network

Abhishek Vijaya Kumar, Cornell University; Bill Owens, NYSERnet; Nikolaj Bjørner, Binbin Guan, Yawei Yin, and Paramvir Bahl, Microsoft; Rachee Singh, Cornell University

Available Media

Network slicing reserves a portion of the physical resources of radio access networks and makes them available to consumers. Slices guarantee traffic isolation, strict bandwidth and quality of service. However, the abstraction of slicing has been limited to access networks. We develop CHISEL, a system that dynamically carves slices of the wide-area network (WAN), enabling an end-to-end network slicing abstraction. CHISEL creates optical slices between WAN endpoints to avoid queueing and congestion delays inherent in packet switched paths in WANs. CHISEL incrementally allocates optical spectrum on long-haul fiber to provision slices. This task is made challenging by the co-existence of data-carrying channels on the fiber and numerous physical constraints associated with provisioning optical paths e.g., spectrum contiguity, continuity and optical reach constraints. CHISEL leverages the empirical finding that cloud WANs have abundant optical spectrum to spare — 75% of optical spectrum on 75% of fiber spans is unused. CHISEL can optimally allocate terabits of slice requests while consuming minimal optical spectrum within seconds without increasing spectral fragmentation on fiber. CHISEL trades-off optimality of slice bandwidth allocation for faster run-time, provisioning slices within 2% of optimal in less than 30 seconds in a commercial cloud WAN. Finally, CHISEL reduces the latency of provisioning optical slices on hardware by 10X. Compared to IP tunnels of equivalent capacity, CHISEL consumes 3.3X fewer router ports.

LuoShen: A Hyper-Converged Programmable Gateway for Multi-Tenant Multi-Service Edge Clouds

Tian Pan, Kun Liu, Xionglie Wei, Yisong Qiao, Jun Hu, Zhiguo Li, Jun Liang, Tiesheng Cheng, Wenqiang Su, Jie Lu, Yuke Hong, Zhengzhong Wang, Zhi Xu, Chongjing Dai, Peiqiao Wang, Xuetao Jia, Jianyuan Lu, Enge Song, and Jun Zeng, Alibaba Cloud; Biao Lyu, Zhejiang University and Alibaba Cloud; Ennan Zhai, Alibaba Cloud; Jiao Zhang and Tao Huang, Purple Mountain Laboratories; Dennis Cai, Alibaba Cloud; Shunmin Zhu, Tsinghua University and Alibaba Cloud

Sprinter: Speeding Up High-Fidelity Crawling of the Modern Web

Ayush Goel and Jingyuan Zhu, University of Michigan; Ravi Netravali, Princeton University; Harsha V. Madhyastha, University of Southern California

Available Media

Crawling the web at scale forms the basis of many important systems: web search engines, smart assistants, generative AI, web archives, and so on. Yet, the research community has paid little attention to this workload in the last decade. In this paper, we highlight the need to revisit the notion that web crawling is a solved problem. Specifically, to discover and fetch all page resources dependent on JavaScript and modern web APIs, crawlers today have to employ compute-intensive web browsers. This significantly inflates the scale of the infrastructure necessary to crawl pages at high throughput.

To make web crawling more efficient without any loss of fidelity, we present Sprinter, which combines browser-based and browserless crawling to get the best of both. The key to Sprinter’s design is our observation that crawling workloads typically include many pages from every site that is crawled and, unlike in traditional user-facing page loads, there is significant potential to reuse client-side computations across pages. Taking advantage of this property, Sprinter crawls a small, carefully chosen, subset of pages on each site using a browser, and then efficiently identifies and exploits opportunities to reuse the browser’s computations on other pages. Sprinter was able to crawl a corpus of 50,000 pages 5x faster than browser-based crawling, while still closely matching a browser in the set of resources fetched.

Hairpin: Rethinking Packet Loss Recovery in Edge-based Interactive Video Streaming

Zili Meng, Tsinghua University, Hong Kong University of Science and Technology, and Tencent; Xiao Kong and Jing Chen, Tsinghua University and Tencent; Bo Wang and Mingwei Xu, Tsinghua University; Rui Han and Honghao Liu, Tencent; Venkat Arun, UT Austin; Hongxin Hu, University at Buffalo; Xue Wei, Tencent

Available Media

Interactive streaming requires minimizing stuttering events (or deadline misses for video frames) to ensure seamless interaction between users and applications. However, existing packet loss recovery mechanisms uniformly optimize redundancy for initial transmission and retransmission, which still could not satisfy the delay requirements of interactive streaming, but also introduces considerable bandwidth costs. Our insight is that in edge-based interactive streaming, differentiating retransmissions on redundancy settings can often achieve a low bandwidth cost and a low deadline miss rate simultaneously. In this paper, we propose Hairpin, a new packet loss recovery mechanism for edge-based interactive streaming. Hairpin finds the optimal combination of data packets, retransmissions, and redundant packets over multiple rounds of transmissions, which significantly reduces the bandwidth cost while ensuring the end-to-end latency requirement. Experiments with production deployments demonstrate that Hairpin can simultaneously reduce the bandwidth cost by 40% and the deadline miss rate by 32% on average in the wild against state-of-the-art solutions.

Track 2

Verification

Towards provably performant congestion control

Anup Agarwal, Carnegie Mellon University; Venkat Arun, University of Texas at Austin; Devdeep Ray, Carnegie Mellon University/Google LLC; Ruben Martins and Srinivasan Seshan, Carnegie Mellon University

EPVerifier: Accelerating Update Storms Verification with Edge-Predicate

Chenyang Zhao, Yuebin Guo, Jingyu Wang, Qi Qi, Zirui Zhuang, Haifeng Sun, and Lingqi Guo, Beijing University of Posts and Telecommunications; Yuming Xie, Huawei Technologies Co., Ltd; Jianxin Liao, Beijing University of Posts and Telecommunications

Available Media

Data plane verification is designed to automatically verify network correctness by directly analyzing the data plane. Recent data plane verifiers have achieved sub-millisecond verification for per rule update by partitioning packets into equivalence classes (ECs). A large number of data plane updates can be generated in a short interval, known as update storms, due to network events such as end-to-end establishments, disruption or recovery. When it comes to update storms, however, the verification speed of current EC-based methods is often slowed down by the maintenance of their EC-based network model (EC-model).

This paper presents EPVerifier, a fast, partitioned data plane verification for update storms to further accelerate update storms verification. EPVerifier uses a novel edge-predicate-based (EP-based) local modeling approach to avoid drastic oscillations of the EC-model caused by changes in the set of equivalence classes. In addition, with local EPs, EPVerifier can achieve a partition of verification tasks by switches that EC-based methods cannot to get better parallel performance. We implement EPVerifier as an easy-to-use tool, allowing users to quickly get the appropriate verification results at any moment by providing necessary input. Both dataset trace-driven simulations and deployments in the wild show that EPVerifier achieves robustly fast update storm verification and superior parallel performance and these advantages expand with the data plane's complexity and storm size growth. The verification time of EPVerifier for an update storm of size 1M is around 10s on average, a 2-10× improvement over the state-of-the-art.

Netcastle: Network Infrastructure Testing At Scale

Rob Sherwood, NetDebug.com; Jinghao Shi, Ying Zhang, Neil Spring, Srikanth Sundaresan, Jasmeet Bagga, Prathyusha Peddi, Vineela Kukkadapu, Rashmi Shrivastava, Manikantan KR, Pavan Patil, Srikrishna Gopu, Varun Varadan, Ethan Shi, Hany Morsy, Yuting Bu, Renjie Yang, Rasmus Jönsson, Wei Zhang, Jesus Jussepen Arredondo, and Diana Saha, Meta Platforms Inc.; Sean Choi, Santa Clara University

Available Media

Network operators have long struggled to achieve reliability. Increased complexity risks surprising interactions, increased downtime, and lost person-hours trying to debug correctness and performance problems in large systems. For these reasons, network operators have also long pushed back on deploying promising network research, fearing the unexpected consequences of increased network complexity. Despite the changes’ potential benefits, the corresponding increase in complexity may result in a net loss.

The method to build reliability despite complexity in Software Engineering is testing. In this paper, we use statistics from a large-scale network to identify unique challenges in network testing. To tackle the challenges, we develop Netcastle: a system that provides continuous integration/continuous deployment (CI/CD) network testing as a service for 11 different networking teams, across 68 different use-cases, and O(1k) of test devices. Netcastle supports comprehensive network testing, including device-level firmware, datacenter distributed control planes, and backbone centralized controllers, and runs 500K+ network tests per day, a scale and depth of test coverage previously unpublished. We share five years of experiences in building and running Netcastle at Meta.

12:30 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:40 pm

Track 1

Networking at Scale

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

Nils Blach and Maciej Besta, ETH Zurich; Daniele De Sensi, Sapienza University of Rome; Jens Domke, RIKEN Center for Computational Science; Hussein Harake, Swiss National Supercomputing Centre; Shigang Li, Beijing University of Posts and Telecommunications; Patrick Iff, ETH Zurich; Marek Konieczny, AGH-UST; Kartik Lakhotia, Intel; Ales Kubicek and Marcel Ferrari, ETH Zurich; Fabrizio Petrini, Intel; Torsten Hoefler, ETH Zurich

Reasoning about Network Traffic Load Property at Production Scale

Ruihan Li, Peking University and Alibaba Cloud; Fangdan Ye, Yifei Yuan, Ruizhen Yang, Bingchuan Tian, Tianchen Guo, Hao Wu, Xiaobo Zhu, Zhongyu Guan, Qing Ma, and Xianlong Zeng, Alibaba Cloud; Chenren Xu, Peking University; Dennis Cai and Ennan Zhai, Alibaba Cloud

Poseidon: A Consolidated Virtual Network Controller that Manages Millions of Tenants via Config Tree

Biao Lyu, Zhejiang University and Alibaba Group; Enge Song, Tian Pan, Jianyuan Lu, Shize Zhang, Xiaoqing Sun, Lei Gao, Chenxiao Wang, Han Xiao, Yong Pan, Xiuheng Chen, Yandong Duan, Weisheng Wang, Jinpeng Long, Yanfeng Wang, Kunpeng Zhou, Zhigang Zong, Xing Li, Guangwang Li, and Pengyu Zhang, Alibaba Group; Peng Cheng and Jiming Chen, Zhejiang University; Shunmin Zhu, Tsinghua University and Alibaba Group

OPPerTune: Post-Deployment Configuration Tuning of Services Made Easy

Gagan Somashekar, Stony Brook University; Karan Tandon and Anush Kini, Microsoft Research; Chieh-Chun Chang and Petr Husak, Microsoft; Ranjita Bhagwan, Google; Mayukh Das, Microsoft365 Research; Anshul Gandhi, Stony Brook University; Nagarajan Natarajan, Microsoft Research

Available Media

Real-world application deployments have hundreds of inter-dependent configuration parameters, many of which significantly influence performance and efficiency. With today's complex and dynamic services, operators need to continuously monitor and set the right configuration values (configuration tuning) well after a service is widely deployed. This is challenging since experimenting with different configurations post-deployment may reduce application performance or cause disruptions. While state-of-the-art ML approaches do help to automate configuration tuning, they do not fully address the multiple challenges in end-to-end configuration tuning of deployed applications.

This paper presents OpperTune, a service that enables configuration tuning of applications in deployment at Microsoft. OpperTune reduces application interruptions while maximizing the performance of deployed applications as and when the workload or the underlying infrastructure changes. It automates three essential processes that facilitate post-deployment configuration tuning: (a) determining which configurations to tune, (b) automatically managing the scope at which to tune the configurations, and (c) using a novel reinforcement learning algorithm to simultaneously and quickly tune numerical and categorical configurations, thereby keeping the overhead of configuration tuning low. We deploy OpperTune on two enterprise applications in Microsoft Azure's clusters. Our experiments show that OpperTune reduces the end-to-end P95 latency of microservice applications by more than 50% over expert configuration choices made ahead of deployment. The code and datasets used are made available at https://aka.ms/OPPerTune.

Track 2

ML but Faster

Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances

Jiangfei Duan, The Chinese University of Hong Kong; Ziang Song, Johns Hopkins University; Xupeng Miao and Xiaoli Xi, Carnegie Mellon University; Dahua Lin, The Chinese University of Hong Kong & Sensetime Research; Harry Xu, UCLA; Minjia Zhang, Microsoft Research; Zhihao Jia, Carnegie Mellon University

3:40 pm–4:10 pm

Break with Refreshments

4:10 pm–5:50 pm

Track 1

Distributed Systems: Part 2

SIEVE is Simpler than LRU: an Efficient Turn-Key Eviction Algorithm for Web Caches

Yazhuo Zhang, Emory University; Juncheng Yang, Carnegie Mellon University; Yao Yue, Pelikan Foundation; Ymir Vigfusson, Emory University and Keystrike; K.V. Rashmi, Carnegie Mellon University

Available Media

Caching is an indispensable technique for low-cost and fast data serving. The eviction algorithm, at the heart of a cache, has been primarily designed to maximize efficiency—reducing the cache miss ratio. Many eviction algorithms have been designed in the past decades. However, they all trade off throughput, simplicity, or both for higher efficiency. Such a compromise often hinders adoption in production systems.

This work presents SIEVE, an algorithm that is simpler than LRU and provides better than state-of-the-art efficiency and scalability for web cache workloads. We implemented SIEVE in five production cache libraries, requiring fewer than 20 lines of code changes on average. Our evaluation on 1559 cache traces from 7 sources shows that SIEVE achieves up to 63.2% lower miss ratio than ARC. Moreover, SIEVE has a lower miss ratio than 9 state-of-the-art algorithms on more than 45% of the 1559 traces, while the next best algorithm only has a lower miss ratio on 15%. SIEVE's simplicity comes with superior scalability as cache hits require no locking. Our prototype achieves twice the throughput of an optimized 16-thread LRU implementation. SIEVE is more than an eviction algorithm; it can be used as a cache primitive to build advanced eviction algorithms just like FIFO and LRU.

Harvesting Idle Memory for Application-managed Soft State with Midas

Yifan Qiao, UCLA; Zhenyuan Ruan, MIT; Haoran Ma, UCLA; Adam Belay, MIT CSAIL; Miryung Kim and Harry Xu, UCLA

Available Media

Many applications can benefit from data that increases performance but is not required for correctness (commonly referred to as soft state). Examples include cached data from backend web servers and memoized computations in data analytics systems. Today's systems generally statically limit the amount of memory they use for storing soft state in order to prevent unbounded growth that could exhaust the server's memory. Static provisioning, however, makes it difficult to respond to shifts in application demand for soft state and can leave significant amounts of memory idle. Existing OS kernels can only spend idle memory on caching disk blocks—which may not have the most utility—because they do not provide the right abstractions to safely allow applications to store their own soft state.

To effectively manage and dynamically scale soft state, we propose soft memory, an elastic virtual memory abstraction with unmap-and-reconstruct semantics that makes it possible for applications to use idle memory to store whatever soft state they choose while guaranteeing both safety and efficiency. We present Midas, a soft memory management system that contains (1) a runtime that is linked to each application to manage soft memory objects and (2) OS kernel support that coordinates soft memory allocation between applications to maximize their performance. Our experiments with four real-world applications show that Midas can efficiently and safely harvest idle memory to store applications' soft state, delivering near-optimal application performance and responding to extreme memory pressure without running out of memory.

Track 2

Wireless Hardware

6:30 pm–8:00 pm

NSDI '24 Poster Session and Reception

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, authors, and symposium organizers.

Thursday, April 18

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:20 am

Track 1

ML Scheduling

Vulcan: Automatic Query Planning for Live ML Analytics

Yiwen Zhang and Xumiao Zhang, University of Michigan; Ganesh Ananthanarayanan, Microsoft; Anand Iyer, Georgia Institute of Technology; Yuanchao Shu, Zhejiang University; Victor Bahl, Microsoft Corporation; Z. Morley Mao, University of Michigan and Google; Mosharaf Chowdhury, University of Michigan

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

Sudarsanan Rajasekaran and Manya Ghobadi, Massachusetts Institute of Technology; Aditya Akella, UT Austin

Available Media

We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an Affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs, such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33x.

Towards Domain-Specific Network Transport for Distributed DNN Training

Hao Wang and Han Tian, Hong Kong University of Science and Technology; Jingrong Chen, Duke University; Xinchen Wan, Jiacheng Xia, and Gaoxiong Zeng, Hong Kong University of Science and Technology; Wei Bai, Microsoft Research; Junchen Jiang, University of Chicago; Yong Wang and Kai Chen, Hong Kong University of Science and Technology

Track 2

Cloud Scheduling

LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search

Chengquan Feng, University of Science and Technology of China, Microsoft Research; Li Lyna Zhang, Microsoft Research; Yuanchi Liu, University of Science and Technology of China; Jiahang Xu and Chengruidong Zhang, Microsoft Research; Zhiyuan Wang, University of Science and Technology of China; Ting Cao and Mao Yang, Microsoft Research; Haisheng Tan, University of Science and Technology of China

Available Media

Hardware-Aware Neural Architecture Search (NAS) has demonstrated success in automating the design of affordable deep neural networks (DNNs) for edge platforms by incorporating inference latency in the search process. However, accurately and efficiently predicting DNN inference latency on diverse edge platforms remains a significant challenge. Current approaches require several days to construct new latency predictors for each one platform, which is prohibitively time-consuming and impractical.

In this paper, we propose LitePred, a lightweight approach for accurately predicting DNN inference latency on new platforms with minimal adaptation data by transferring existing predictors. LitePred builds on two key techniques: (i) a Variational Autoencoder (VAE) data sampler to sample high-quality training and adaptation data that conforms to the model distributions in NAS search spaces, overcoming the out-of-distribution challenge; and (ii) a latency distribution-based similarity detection method to identify the most similar pre-existing latency predictors for the new target platform, reducing adaptation data required while achieving high prediction accuracy. Extensive experiments on 85 edge platforms and 6 NAS search spaces demonstrate the effectiveness of our approach, achieving an average latency prediction accuracy of 99.3% with less than an hour of adaptation cost. Compared with SOTA platform-specific methods, LitePred achieves up to 5.3% higher accuracy with a significant 50.6× reduction in profiling cost. Code and predictors are available at https://github.com/microsoft/Moonlit/tree/main/LitePred.

10:20 am–10:50 am

Break with Refreshments

10:50 am–12:30 pm

Track 1

Programming the Network: Part 2

Automatic Parallelization of Software Network Functions

Francisco Pereira, Fernando M.V. Ramos, and Luis Pedrosa, INESC-ID, Instituto Superior Técnico, University of Lisbon

Available Media

Software network functions (NFs) trade-off flexibility and ease of deployment for an increased challenge of performance. The traditional way to increase NF performance is by distributing traffic to multiple CPU cores, but this poses a significant challenge: how to parallelize an NF without breaking its semantics? We propose Maestro, a tool that analyzes a sequential implementation of an NF and automatically generates an enhanced parallel version that carefully configures the NIC's Receive Side Scaling mechanism to distribute traffic across cores, while preserving semantics. When possible, Maestro orchestrates a shared-nothing architecture, with each core operating independently without shared memory coordination, maximizing performance. Otherwise, Maestro choreographs a fine-grained read-write locking mechanism that optimizes operation for typical Internet traffic. We parallelized 8 software NFs and show that they generally scale-up linearly until bottlenecked by PCIe when using small packets or by 100~Gbps line-rate with typical Internet traffic. Maestro further outperforms modern hardware-based transactional memory mechanisms, even for challenging parallel-unfriendly workloads.

AutoSketch: Automatic Sketch-Oriented Compiler for Query-driven Network Telemetry

Haifeng Sun and Qun Huang, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University; Jinbo Sun, Institute of Computing Technology, Chinese Academy of Sciences; Wei Wang, Northeastern University, China; Jiaheng Li, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University; Fuliang Li, Northeastern University, China; Yungang Bao, Institute of Computing Technology, Chinese Academy of Sciences; Xin Yao and Gong Zhang, Huawei Theory Department

Available Media

Recent network telemetry witnesses tremendous progress in two directions: query-driven telemetry that targets expressiveness as the primary goal, and sketch-based algorithms that address resource-accuracy trade-offs. In this paper, we propose AutoSketch that aims to integrate the advantages of both classes. In a nutshell, AutoSketch automatically compiles high-level operators into sketch instances that can be readily deployed with low resource usage and incur limited accuracy drop. However, there remains a gap between the expressiveness of high-level operators and the underlying realization of sketch algorithms. AutoSketch bridges this gap in three aspects. First, AutoSketch extends its interface derived from existing query-driven telemetry such that users can specify the desired telemetry accuracy. The specified accuracy intent will be utilized to guide the compiling procedure. Second, AutoSketch leverages various techniques, such as syntax analysis and performance estimation, to construct efficient sketch instances. Finally, AutoSketch automatically searches for the most suitable parameter configurations that fulfill the accuracy intent with minimum resource usage. Our experiments demonstrate that AutoSketch can achieve high expressiveness, high accuracy, and low resource usage compared to state-of-the-art telemetry solutions.

Sequence Abstractions for Flexible, Line-Rate Network Monitoring

Andrew Johnson, Princeton University; Ryan Beckett, Microsoft Research; Xiaoqi Chen, Princeton University; Ratul Mahajan, University of Washington; David Walker, Princeton University

Available Media

We develop FLM, a high-level language that enables network operators to write programs that recognize and react to specific packet sequences. To be able to examine every packet, our compilation procedure can transform FLM programs into P4 code that can run on programmable switch ASICs. It first splits FLM programs into a state management component and a classical regular expression, then generates an efficient implementation of the regular expression using SMT-based program synthesis. Our experiments find that FLM can express 15 sequence monitoring tasks drawn from prior literature. Our compiler can convert all of these programs to run on switch hardware in way that fit within available pipeline stages and consume less than 15% additional header fields and instruction words when run alongside switch programs.

OctoSketch: Enabling Real-Time, Continuous Network Monitoring over Multiple Cores

Yinda Zhang, University of Pennsylvania; Peiqing Chen and Zaoxing Liu, University of Maryland

Available Media

Sketching algorithms (sketches) have emerged as a resource-efficient and accurate solution for software-based network monitoring. However, existing sketch-based monitoring makes sacrifices in online accuracy (query time accuracy) and performance (handling line rate traffic with low latency) when dealing with distributed traffic across multiple cores. In this work, we present OctoSketch, a software monitoring framework that can scale a wide spectrum of sketches to many cores with high online accuracy and performance. In contrast to previous systems that adopt straightforward sketch merges from individual cores to obtain the aggregated result, we devise a continuous, change-based mechanism that can generally be applied to sketches to perform the aggregation. This design ensures high online accuracy of the aggregated result at any query time and reduces computation costs to achieve high throughput. We apply OctoSketch to nine representative sketches on three software platforms (CPU, DPDK, and eBPF XDP). Our results demonstrate that OctoSketch achieves about 15.6× lower errors and up to 4.5× higher throughput than the state-of-the-art.

Track 2

Wireless Sensing

Habitus: Boosting Mobile Immersive Content Delivery through Full-body Pose Tracking and Multipath Networking

Anlan Zhang, University of Southern California; Chendong Wang, University of Wisconsin — Madison; Yuming Hu, University of Minnesota — Twin Cities; Ahmad Hassan and Zejun Zhang, University of Southern California; Bo Han, George Mason University; Feng Qian, University of Southern California; Shichang Xu, Google

Available Media

Delivering immersive content such as volumetric videos and virtual/mixed reality requires tremendous network bandwidth. Millimeter Wave (mmWave) radios such as 802.11ad/ay and mmWave 5G can provide multi-Gbps peak bandwidth, making them good candidates. However, mmWave is vulnerable to blockage/mobility and its signal attenuates very fast, posing a major challenge to mobile immersive content delivery systems where viewers are in constant motion and the human body may easily block the line-of-sight.

To overcome this challenge, in this paper, we investigate two under-explored dimensions. First, we use the combination of a viewer’s full-body pose and the network information to predict mmWave performance as the viewer exercises six-degree-of-freedom (6-DoF) motion. We apply both offline and online transfer learning to enable the prediction models to react to unseen changes during initial training. Second, we jointly use the omnidirectional radio and mmWave radio available on commodity mobile devices, which have complementary network characteristics, to deliver immersive data. We integrate the above two features into a user-space software framework called Habitus, and demonstrate how it can be easily integrated into existing immersive content delivery systems to boost their network performance, which leads to up to 72% of quality-of-experience (QoE) improvement

BFMSense: WiFi Sensing Using Beamforming Feedback Matrix

Enze Yi and Dan Wu, Peking University; Jie Xiong, University of Massachusetts Amherst; Fusang Zhang, Institute of Software, Chinese Academy of Sciences and University of Chinese Academy of Sciences; Kai Niu, Beijing Xiaomi Mobile Software Company Ltd.; Wenwei Li, Peking University; Daqing Zhang, Peking University and Institut Polytechnique de Paris

Available Media

WiFi-based contactless sensing has attracted a tremendous amount of attention due to its pervasiveness, low-cost, and non-intrusiveness to users. Existing systems mainly leverage channel state information (CSI) for sensing. However, CSI can only be extracted from very few commodity WiFi devices through driver hacking, severely limiting the adoption of WiFi sensing in real life. We observe a new opportunity that a large range of new-generation WiFi cards can report another piece of information, i.e., beamforming feedback matrix (BFM). In this paper, we propose to leverage this new BFM information for WiFi sensing. Through establishing the relationship between BFM and CSI, we lay the theoretical foundations for BFM-based WiFi sensing for the first time. We show that through careful signal processing, BFM can be utilized for fine-grained sensing. We showcase the sensing capability of BFM using two representative sensing applications, i.e., respiration sensing and human trajectory tracking. Comprehensive experiments show that BFM-based WiFi sensing can achieve highly accurate sensing performance on a large range of new-generation WiFi devices from various manufacturers, moving WiFi sensing one big step towards real-life adoption.

mmComb: High-speed mmWave Commodity WiFi Backscatter

Yoon Chae and Zhenzhe Lin, George Mason University; Kang Min Bae and Song Min Kim, Korea Advanced Institute of Science and Technology (KAIST); Parth Pathak, George Mason University

Available Media

High-speed connectivity is key to enabling a range of novel IoT applications. Millimeter-wave (mmWave) backscatter has emerged as a possible solution to create high-speed, low-power IoT networks. However, state-of-the-art mmWave backscatter systems are costly due to the need for dedicated mmWave reader devices. This paper presents mmComb, a mmWave backscatter system that is built to operate on commodity mmWave WiFi. mmComb is developed with the aim that mmWave backscatter tags can be directly integrated into 802.11ad/ay mmWave WiFi networks. mmComb makes two key contributions. First, We propose a technique to communicate with backscatter tags using existing beamforming protocol frames from mmWave WiFi devices, without any protocol modification. Second, we develop a self-interference suppression solution that intelligently uses receive beamforming to extract weak mmWave backscatter signal even in indoor multipath-rich channels. We implement our solution with a tag prototype and 60 GHz commodity WiFi devices. Our results show that mmComb can achieve a maximum data rate of 55 Mbps just by leveraging 802.11ad/ay control frames while consuming 87.3 μW with BER lower than 10^−3 up to 5.5 m range.

12:30 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:40 pm

Track 1

Security

A System to Detect Forged-Origin BGP Hijacks

Thomas Holterbach and Thomas Alfroy, University of Strasbourg; Amreesh Phokeer, Internet Society; Alberto Dainotti, Georgia Tech; Cristel Pelsser, UCLouvain

Available Media

Despite global efforts to secure Internet routing, attackers still successfully exploit the lack of strong BGP security mechanisms. This paper focuses on an attack vector that is frequently used: Forged-origin hijacks, a type of BGP hijack where the attacker manipulates the AS path to make it immune to RPKI-ROV filters and appear as legitimate routing updates from a BGP monitoring standpoint. Our contribution is DFOH, a system that quickly and consistently detects forged-origin hijacks in the whole Internet. Detecting forged-origin hijacks boils down to inferring whether the AS path in a BGP route is legitimate or has been manipulated. We demonstrate that current state-of-art approaches to detect BGP anomalies are insufficient to deal with forged-origin hijacks. We identify the key properties that make the inference of forged AS paths challenging, and design DFOH to be robust against real-world factors. Our inference pipeline includes two key ingredients: (i) a set of strategically selected features, and (ii) a training scheme adapted to topological biases. DFOH detects 90.9% of the forged-origin hijacks within only ≈5min. In addition, it only reports ≈17.5 suspicious cases every day for the whole Internet, a small number that allows operators to investigate the reported cases and take countermeasures.

NetVigil: Robust and Low-Cost Anomaly Detection for East-West Data Center Security

Kevin Hsieh, Microsoft; Mike Wong, Princeton University and Microsoft; Santiago Segarra, Microsoft and Rice University; Sathiya Kumaran Mani, Trevor Eberl, and Anatoliy Panasyuk, Microsoft; Ravi Netravali, Princeton University; Ranveer Chandra and Srikanth Kandula, Microsoft

Available Media

The growing number of breaches in data centers underscores an urgent need for more effective security. Traditional perimeter defense measures and static zero-trust approaches are unable to address the unique challenges that arise from the scale, complexity, and evolving nature of today's data center networks. To tackle these issues, we introduce NetVigil, a robust and cost-efficient anomaly detection system specifically designed for east-west traffic within data center networks. NetVigil adeptly extracts security-focused, graph-based features from network flow logs and employs domain-specific graph neural networks (GNNs) and contrastive learning techniques to strengthen its resilience against normal traffic variations and adversarial evasion strategies. Our evaluation, over various attack scenarios and traces from real-world production clusters, shows that NetVigil delivers significant improvements in accuracy, cost, and detection latency compared to state-of-the-art anomaly detection systems, providing a practical, supplementary security mechanism to protect the east-west traffic within data center networks.

Track 2

Mobile Things

Catch Me If You Can: Laser Tethering with Highly Mobile Targets

Charles J. Carver, Hadleigh Schwartz, and Qijia Shao, Columbia University; Nicholas Shade, Joseph Lazzaro, Xiaoxin Wang, Jifeng Liu, and Eric Fossum, Dartmouth College; Xia Zhou, Columbia University

Available Media

Conventional wisdom holds that laser-based systems cannot handle high mobility due to the strong directionality of laser light. We challenge this belief by presenting Lasertag, a generic framework that tightly integrates laser steering with optical tracking to maintain laser connectivity with high-velocity targets. Lasertag creates a constantly connected, laser-based tether between the Lasertag core unit and a remote target, irrespective of the target's movement. Key elements of Lasertag include (1) a novel optical design that superimposes the optical paths of a steerable laser beam and image sensor, (2) a lightweight optical tracking mechanism for passive retroreflective markers, (3) an automated mapping method to translate scene points to laser steering commands, and (4) a predictive steering algorithm that overcomes limited image sensor frame rates and laser steering delays to quadruple the steering rate up to 151Hz. Experiments with the Lasertag prototype demonstrate that, on average, Lasertag transmits a median 0.97 of laser energy with a median alignment offset of only 1.03cm for mobile targets accelerating up to 49m/s^2, with speeds up to 6.5m/s, and distances up to 6m (≈ 47°/s). Additional experiments translate the above performance to a 10^-8 median bit error rate across trials when transmitting a 1Gbps, on-off keying signal. Lasertag paves the way for various laser applications (e.g., communication, sensing, power delivery) in mobile settings. A demonstration video of Lasertag is available at: mobilex.cs.columbia.edu/lasertag

Passengers' Safety Matters: Experiences of Deploying a Large-Scale Indoor Delivery Monitoring System

Xiubin Fan, City University of Hong Kong; Zhongming Lin, The Hong Kong University of Science and Technology; Yuming Hu, University of Minnesota - Twin Cities; Tianrui Jiang, The Hong Kong University of Science and Technology; Feng Qian, University of Southern California; Zhimeng Yin, City University of Hong Kong; S.-H. Gary Chan, The Hong Kong University of Science and Technology; Dapeng Wu, City University of Hong Kong

Available Media

Delivering goods to many indoor stores poses significant safety issues, as heavy, high-stacked packages carried on delivery trolleys may fall and hurt passersby. This paper reports our experiences of developing and operating DeMo, a practical system for real-time monitoring of indoor delivery. DeMo attaches sensors to trolleys and analyzes the Inertial Measurement Unit (IMU) and Bluetooth Low Energy (BLE) readings to detect delivery violations such as speeding and using non-designated delivery paths. Differing from typical indoor localization applications, DeMo overcomes unique challenges such as unique sensor placement and complex electromagnetic characteristics underground. In particular, DeMo adapts the classical logarithmic radio signal model to support fingerprint-free localization, drastically lowering the deployment and maintenance cost. DeMo has been operating since May 2020, covering more than 200 shops with 42,248 deliveries (3521.4 km) across 12 subway stations in Hong Kong. DeMo's 3-year operation witnessed a significant violation rate drop, from 19% (May 2020) to 2.7% (Mar 2023).

AUGUR: Practical Mobile Multipath Transport Service for Low Tail Latency in Real-Time Streaming

Yuhan Zhou, School of Computer Science, Peking University and Tencent Inc.; Tingfeng Wang, Tencent Inc.; Liying Wang, School of Computer Science, Peking University; Nian Wen, Rui Han, Jing Wang, Chenglei Wu, Jiafeng Chen, and Longwei Jiang, Tencent Inc.; Shibo Wang, Xi'an Jiaotong University and Tencent Inc.; Honghao Liu, Tencent Inc.; Chenren Xu, School of Computer Science, Peking University and Zhongguancun Laboratory and Key Laboratory of High Confidence Software Technologies, Ministry of Education (PKU)

Available Media

Real-time streaming applications like cloud gaming require consistently low latency, even at the tail. Our large-scale measurement based on a major cloud gaming service provider reveals that in Wi-Fi networks, the delay of the wireless hop can inflate due to its fluctuating nature, making it difficult to achieve consistently low tail latency. While cellular paths can be leveraged to alleviate the impact of wireless fluctuation of Wi-Fi paths, our user study reveals that it is crucial to constrain cellular data usage while using multipath transport. In this paper, we present AUGUR, a multipath transport service designed to reduce long tail latency and video frame stall rates in mobile real-time streaming. To address the challenge of reducing long tail latency by utilizing cellular paths while minimizing cellular data usage, AUGUR captures user characteristics by deriving state probability models and formulates the equilibrium into Integer Linear Programming (ILP) problems for each user session to determine the opportunity of frame retransmission and path selection. Our trace-driven emulation and large-scale real-world deployment in a Tencent Start cloud gaming platform demonstrate that AUGUR achieves up to 66.0% reduction in tail latency and 99.5% reduction in frame stall rate with 88.1% decrease in cellular data usage compared to other multipath transport schemes.

3:40 pm–4:10 pm

Break with Refreshments

4:10 pm–5:30 pm

Track 1

Cloud Systems

Zombie: Middleboxes that Don’t Snoop

Collin Zhang, Cornell; Zachary DeStefano, Arasu Arun, and Joseph Bonneau, NYU; Paul Grubbs, University of Michigan; Michael Walfish, NYU

Available Media

Zero-knowledge middleboxes (ZKMBs) are a recent paradigm in which clients get privacy and middleboxes enforce policy: clients prove in zero knowledge that the plaintext underlying their encrypted traffic complies with network policies, such as DNS filtering. However, prior work had impractically poor performance and was limited in functionality.

This work presents Zombie, the first system built using the ZKMB paradigm. Zombie introduces techniques that push ZKMBs to the verge of practicality: preprocessing (to move the bulk of proof generation to idle times between requests), asynchrony (to remove proving and verifying costs from the critical path), and batching (to amortize some of the verification work). Zombie’s choices, together with these techniques, reduce client and middlebox overhead by ≈ 3.5×, lowering the critical path overhead for a DNS filtering application on commodity hardware to less than 300ms or, in the asynchronous configuration, to 0.

As an additional contribution that is likely of independent interest, Zombie introduces a portfolio of techniques to encode regular expressions in probabilistic (and zero-knowledge) proofs. These techniques significantly improve performance over a standard baseline, asymptotically and concretely. Zombie builds on this portfolio to support policies based on regular expressions, such as data loss prevention.

Solving Max-Min Fair Resource Allocations Quickly on Large Graphs

Pooria Namyar, Microsoft and University of Southern California; Behnaz Arzani and Srikanth Kandula, Microsoft; Santiago Segarra, Microsoft and Rice University; Daniel Crankshaw and Umesh Krishnaswamy, Microsoft; Ramesh Govindan, University of Southern California; Himanshu Raj, Microsoft

Available Media

We consider the max-min fair resource allocation problem. The best-known solutions use either a sequence of optimizations or waterfilling, which only applies to a narrow set of cases. These solutions have become a practical bottleneck in WAN traffic engineering and cluster scheduling, especially at larger problem sizes. We improve both approaches: (1) we show how to convert the optimization sequence into a single fast optimization, and (2) we generalize waterfilling to the multi-path case. We empirically show our new algorithms Pareto-dominate prior techniques: they produce faster, fairer, and more efficient allocations. Some of our allocators also have theoretical guarantees: they trade off a bounded amount of unfairness for faster allocation. We have deployed our allocators in Azure's WAN traffic engineering pipeline, where we preserve solution quality and achieve a roughly 3× speedup.

Track 2

Modeling Networks

CAPA: An Architecture For Operating Cluster Networks With High Availability

Bingzhe Liu, University of Illinois Urbana-Champaign; Colin Scott, Mukarram Tariq, Andrew Ferguson, Phillipa Gill, Richard Alimi, Omid Alipourfard, Deepak Arulkannan, Virginia Jean Beauregard, and Patrick Conner‎, Google; Brighten Godfrey, UIUC and VMware; Xander Lin, Google; Joon Ong, Google, Inc.; Mayur Patel, Amr Sabaa, Arjun Singh, Alex Smirnov, Manish Verma, Prerepa V Viswanadham, and Amin Vahdat, Google

Klonet: an Easy-to-Use and Scalable Platform for Computer Networks Education

Tie Ma, Long Luo, and Hongfang Yu, University of Electronic Science and Technology of China; Xi Chen, Southwest Minzu University; Jingzhao Xie, Chongxi Ma, Yunhan Xie, Gang Sun, and Tianxi Wei, University of Electronic Science and Technology of China; Li Chen, Zhongguancun Laboratory; Yanwei Xu, Huawei Theory Research Lab; Nicholas Zhang, Huawei Technologies