CMMRS will include lectures from faculty at the Cornell University, University of Maryland, Cornell and the Max Planck Institutes.
Yizheng Chen, University of Maryland, College Park: Securing AI Coding Assistants
Code Large Language Models (LLMs) like GitHub Copilot, Amazon CodeWhisperer, and Code Llama have revolutionized software development, boosting productivity for millions of developers. By 2028, it is estimated that 75% of enterprise software engineers will rely on AI Coding Assistants. However, Code LLMs raise significant security concerns, with studies indicating that 40% of programs generated by GitHub Copilot are vulnerable. Thus, there is an urgent need to ensure the safety of Code LLMs.
In this lecture, I will begin by reviewing the traditional methods used to evaluate the functional correctness and security of code produced by LLMs, pointing out the inherent limitations of these approaches. I will then introduce a new benchmark and new metrics we’ve developed to more accurately assess the correctness and security of Code LLMs. I will delve into how Code LLMs generate code and our research on the security of the state-of-the-art models. Finally, I will discuss future research directions and challenges of how to generate secure code.
Giulia Guidi, Cornell University: High-Performance Computing Meets Biology
The use of massively parallel systems is playing a crucial role in new and diverse areas of data science, such as computational biology and data analytics. Computational biology is a key area in which data processing is rapidly increasing. The growing volume of data and increasing complexity have outpaced the processing capacity of single-node machines in these areas, making massively parallel systems an indispensable tool.
The emerging complex challenges in computational biology require large-scale parallel computing infrastructures. Furthermore, as we enter the post-Moore’s Law era, effective programming of specialized architectures is critical to improving the performance of high-performance computing. As large-scale systems become more heterogeneous, their efficient use for new, often irregular, and communication-intensive data analysis computation becomes increasingly complex. This talk will discuss how performance and scalability can be achieved on extreme-scale systems while maintaining productivity for new data-intensive biological challenges, and how high performance can be achieved on new specialized AI architectures such as SRAM-based Graphcore IPUs.
Ming C. Lin, University of Maryland, College Park: Reconstructing Reality: From Physical World to Virtual Environments
With increasing availability of data in various forms from images, audio, video, 3D models, motion capture, simulation results, to satellite imagery, representative samples of the various phenomena constituting the world around us bring new opportunities and research challenges. Such availability of data has led to recent advances in data-driven modeling. However, most of the existing example-based synthesis methods offer empirical models and data reconstruction that may not provide an insightful understanding of the underlying process or may be limited to a subset of observations.
In this talk, I present recent advances that integrate classical model-based methods and statistical learning techniques to tackle challenging problems that have not been previously addressed. These include flow reconstruction for traffic visualization, learning heterogeneous crowd behaviors from video, simultaneous estimation of deformation and elasticity parameters from images and video, and example-based multimodal display for VR systems. These approaches offer new insights for understanding complex collective behaviors, developing better models for complex dynamical systems from captured data, delivering more effective medical diagnosis and treatment, as well as cyber-manufacturing of customized apparel. I conclude by discussing some possible future directions and challenges.
Alan Zaoxing Liu , University of Maryland, College Park
Lecture 1: Introduction to Network Attacks
This lecture aims to provide a comprehensive introduction to basic and advanced network attacks that pose significant threats to modern Internet infrastructures, such as Distributed Denial of Service (DDoS) and advanced persistent threats (APT). I will begin with an overview of the concepts related to network architecture and the role of security protocols. I will delve into several representative network attacks, categorizing them based on their methods and targets. Attendees will gain an understanding of how these attacks function and their potential impacts on the society.
Lecture 2: Defending Advanced Network Attacks with Programmable Networks
This lecture will discuss effective detection and mitigation solutions for latest DDoS (e.g., volumetric and link flood attacks) and APT attacks. With the emergence of highly flexible network devices such as programmable switches and network interface cards, we as the defender can design more performant and cost-effective software and hardware defense solutions. I will delve into a research prototype that leverages programmable network hardware to mitigate DDoS attacks with minimal performance degradation. Finally, I will chart paths to designing future advanced network defense systems.
David Mimno, Cornell University
Lecture 1: What language models do, and how they do it
The lecture will cover the history of language modeling, from early word counting methods to contemporary transformers. We will cover tokenization, predictive probabilities, attention mechanisms, and encoder/decoder architectures. We will use the Huggingface pytorch API to compare model outputs, explore network activations, and do a quick example of model finetuning.
Lecture 2: Language models and data
In this lecture we will discuss how language models are trained from a data perspective. We will cover how text data is collected, how it is used, and what implications those choices have for the behavior of models. We will introduce few-shot and prompt-based learning and explore how training data choices affect these capabilities. We will touch on legal and ethical issues around data use.
Danupon Nanongkai, MPI for Informatics: Modern Graph Algorithms
There have been many fast algorithms discovered recently for graph problems that otherwise witnessed no progress in the last few decades. In this lecture series, we will explore some of the techniques underlying these advances. The focus will be on recent advances in computing shortest paths and problems related to maximum flow. While the lectures will focus on sequential algorithms, we will also discuss applications of these techniques in other models of computation such as distributed, dynamic, and streaming algorithms. (In fact, these settings are where some of the techniques originated from.)
Peter Schwabe, MPI for Security and Privacy: The next generation of cryptographic software
Already since Shor’s seminal paper from 1994 we know that once physicists and quantum engineers are able to build a large universal quantum computer, our current generation of asymmetric cryptography will be broken. With increasing progress towards such a quantum computer becoming reality, the world is currently moving to a new generation of cryptography: so called post-quantum cryptography. A major step towards the deployment of this new generation of primitives for key agreement and digital signatures is a (still ongoing) effort by NIST to identify suitable candidate schemes for standardization. In July 2022 NIST selected the first batch of lgorithms for standardization; the only key-agreement scheme in this batch is CRYSTALS-Kyber, which is expected to become a standard this summer. In my lectures I will explain the design of CRYSTALS-Kyber and then illustrate the challenges we will be facing with secure and efficient implementations and deployment of post-quantum cryptography.
Rachee Singh, Cornell University
Lecture 1 Title: Introduction to programmable photonics
Optical fiber underpins modern long-haul wide-area networks and datacenter networks. Recent years have seen the wide-spread adoption of programmable optical switches and optical transponders that can modify properties of light transmission on fiber depending on their configuration. This has ushered a wave of ideas that leverage the programmability of optical hardware to achieve different objectives in optical networks at both design-time and runtime. In this lecture, we will learn the fundamentals of optical data transmission in fiber and explore techniques to achieve high performance, reliability and cost-effectiveness using programmable optical components.
Lecture 2 Title: Photonic Collective Communication for Distributed Machine Learning
Distributed ML training and inference requires intermediate model parameters on accelerators to be accumulated, reduced and transferred over the network between accelerators using collective communication primitives. This talk will focus on mechanisms that accelerate collective communication during distributed training and inference on multi-accelerator systems. We leverage novel server-scale photonic interconnects for inter-accelerator communication to tackle this challenge. We harness the programmability of photonics to implement circuit-switched connections between accelerators and develop efficient algorithms to use these photonic circuits for inter-accelerator collective communication.
Adish Singla, MPI for Software Systems: Generative AI for Education
Recent advances in generative AI, in particular deep generative and large language models like ChatGPT, are having transformational effects on the educational landscape. On the one hand, these advances provide unprecedented opportunities to enhance education by creating unique human-machine collaborative systems. For instance, these models could act as personalized digital tutors for students, as digital assistants for educators, and as digital peers to enable new collaborative learning scenarios. On the other hand, the advanced capabilities of these generative AI models have brought unexpected challenges for educators and policymakers worldwide. For instance, these advances have caused a chaotic disruption in universities and schools to design regulatory policies about the usage of these models and to educate both students and instructors about their strengths and limitations.
This lecture series will provide an overview of the research opportunities and challenges in applying generative AI methods for improving education. The lectures will investigate these opportunities and challenges in education by focusing on two thrusts: (i) exploring how recent advances in generative AI provide new opportunities to drastically improve state-of-the-art educational technology.; (ii) identifying unique challenges in education that require safeguards along with technical innovations in generative AI.
Milijana Surbatovich, University of Maryland, College Park: Type Systems for Intermittent Computing.
Energy-harvesting devices (EHDs) are a new class of embedded computing platform
that are powered solely from energy collected from the environment, without
using batteries. These devices enable new applications in domains like disaster
monitoring, body implants, or smart city infrastructure. Unfortunately,
environmental energy is scarce, so EHDs are powered only intermittently,
experiencing frequent failures that make correct programming difficult. This
situation is especially problematic because the envisioned domains have high
assurance requirements; we do not want applications for medical devices or
critical infrastructure to have bugs! Thus, my research has looked at how we
can use programming language techniques to design systems for EHDs that have
*provable* correctness guarantees.
In these lectures, I first introduce the field of “intermittent computing” on
EHDs, showing why frequent power failures cause incorrect execution and how
these errors can be addressed. I then cover the basics of formal programming
language semantics and type checking, showing how to leverage these to identify
and prove desired correctness properties. Finally, I connect the two topics by
describing my recent work in developing type systems for reasoning about
intermittent execution and discuss how these ideas apply to other emerging
computing architectures, beyond EHDs.