Alex Wong is an Assistant Professor in the department of Computer Science and the director of the Vision Laboratory at Yale University. He also serves as the Director of AI (consulting capacity) for Horizon Surgical Systems. Prior to joining Yale, he was an Adjunct Professor at Loyola Marymount University (LMU) from 2018 to 2020. He received his Ph.D. in Computer Science from the University of California, Los Angeles (UCLA) in 2019 and was co-advised by Stefano Soatto and Alan Yuille. He was previously a post-doctoral research scholar at UCLA under the guidance of Soatto. His research lies in the intersection of machine learning, computer vision, and robotics and largely focuses on multimodal 3D reconstruction, robust vision under adverse conditions, and unsupervised learning. His work has received the outstanding student paper award at the Conference on Neural Information Processing Systems (NeurIPS) 2011 and the best paper award in robot vision at the International Conference on Robotics and Automation (ICRA) 2019.
Abstract
Training deep neural networks requires tens of thousands to millions of examples, so curating multimodal vision datasets amounts to numerous man-hours; tasks like depth estimation require an even more massive effort. I will introduce an alternative form of supervision that leverages multi-sensor validation as an unsupervised (or self-supervised) training objective for depth estimation. To address its ill-posedness, I will show how one can leverage multimodal inputs in the choice of regularizers, which can play a role in model complexity, speed, generalization, as well as adaptation to test-time (possibly adverse) environments. Additionally, I will discuss the current limitations of data augmentation procedures used during unsupervised training, which involves reconstructing the inputs as the supervision signal, and detail a method that allows one to scale up and introduce previously inviable augmentations to boost performance. Finally, I will show how one can scalably expand the number of modalities supported by multimodal models and demonstrate their use in a number of downstream semantic tasks.
Probabilistic inference is a compelling framework for capturing our belief about an unknown given observations. Central in this paradigm are probabilistic models and approximate inference methods. The former models one’s prior belief and encodes the data, while the latter produces posterior distributions based on the former. In the era of large-scale neural networks and foundation models, leveraging them in probabilistic modeling or improving them using probabilistic inference is challenging due to their sheer size. In this talk, I will discuss recent works in (i) developing efficient probabilistic models with and for large foundation models, (ii) leveraging the resulting powerful, calibrated beliefs to improve decision-making and planning, and (iii) applying the resulting probabilistic decision-making/planning systems for improving scientific discovery, and improving the neural networks themselves.
Abstract
apturing our belief about an unknown given observations. Central in this paradigm are probabilistic models and approximate inference methods. The former models one’s prior belief and encodes the data, while the latter produces posterior distributions based on the former. In the era of large-scale neural networks and foundation models, leveraging them in probabilistic modeling or improving them using probabilistic inference is challenging due to their sheer size. In this talk, I will discuss recent works in (i) developing efficient probabilistic models with and for large foundation models, (ii) leveraging the resulting powerful, calibrated beliefs to improve decision-making and planning, and (iii) applying the resulting probabilistic decision-making/planning systems for improving scientific discovery, and improving the neural networks themselves.
Xiaolong Wang is an Assistant Professor in the ECE department at the University of California, San Diego, and a Visiting Professor at NVIDIA Research. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. His specific interest lies in learning visual representations from videos and physical robotic interaction data. These comprehensive representations are utilized to facilitate the learning of human-like robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. He is the recipient of the J. K. Aggarwal Prize, NSF CAREER Award, Intel Rising Star Faculty Award, and Research Awards from Sony, Amazon, Adobe, and CISCO.
Abstract
Having a humanoid robot operating like a human has been a long-standing goal in robotics. The humanoid robot provides a generalized purpose platform to conduct diverse tasks we do in our daily lives. In this talk, we study learning-based approaches for both the mobility and manipulation skills of the humanoid robot, with the goal of generalization to diverse tasks, objects, and scenes. I will discuss how to perform whole-body control in humanoids with rich, diverse, and expressive motions. I will also share some lessons we learned from developing teleoperation systems to operate humanoid robots and collect training data. With the collected data, we aim to build the robot foundation model using a novel RNN architecture with Test-Time Training (TTT).
Silvia is a current postdoc at MIT and an incoming faculty member at Columbia University, working on Computer Graphics and Geometry Processing. She is a Vanier Doctoral Scholar, an Adobe Research Fellow and the winner of the 2021 University of Toronto Arts & Science Dean’s Doctoral Excellence Scholarship. She has interned twice at Adobe Research and twice at the Fields Institute of Mathematics. She is also a founder and organizer of the Toronto Geometry Colloquium and a member of WiGRAPH.
Abstract
Computer Graphics research has long been dominated by the interests of large film, television and social media companies, forcing other, more safety-critical applications (e.g., medicine, engineering, security) to repurpose Graphics algorithms originally designed for entertainment. In this talk, I will advocate for a perspective shift in our field that allows us to design algorithms directly for these safety-critical application realms. I will show that this begins by reinterpreting traditional Graphics tasks (e.g., 3D modeling and reconstruction) from a statistical lens and quantifying the uncertainty in our algorithmic outputs, as exemplified by the research I have conducted for the past five years. I will end by mentioning several ongoing and future research directions that carry this statistical lens to entirely new problems in Graphics and Vision and into specific applications.