Winning Topic: Topic 1: Information Extraction Across ASAPS Streams and Data Sources Across 1 or More Data Categories - $30,000 prize
Team Summary: Vidrovr is a U.S. company based in New York, NY. It was founded by two PhD Students in the Digital Video and Multimedia Lab at Columbia University, Joseph Ellis and Daniel Morozoff, advised by Prof. Shih-Fu Chang. The team is composed of both researchers from notable academic institutions as well as engineers from some of the largest tech companies in the world. Since its founding Vidrovr has won international competitions, received grants from the NSF and DARPA, raised a financing round led by SamsungNext and has had the pleasure of working with both Fortune 500 companies as well as large public organizations alike. The Vidrovr team seeks to push the envelope of what is possible with video understanding and is looking for people to join it in that pursuit.
Dan Morozoff is a member of a winning Contest 1 team. He joined the ASAPS team via Zoom to answer some questions about his team’s winning solution and give some insight about the ASAPS Challenge.
To begin with, can you give us an idea of your approach to Contest 1?
Dan: Let me tell you about how we approached the problem of video and understanding and kind of the crux of how we view ourselves as different, both from the approaches that we take on in kind of a research capacity, but also in a business capacity, effectively the way we view videos. There is a combination of different modality types that include not only visual content as in pixels, but also other types of information that surround or are embedded inside a video. This could be textual information, acoustic information, or other kinds of symbolic information that can be encoded through the portal relationships that are present within video. For example, actions or activities that are related or constrained based on the objects that exist within that video itself. If you are playing tennis, you are probably holding a tennis racket and you will probably see a tennis ball – similarly, if you are playing baseball, getting in the car, et cetera. This approach has become more and more popular, both in areas of robotics as well as areas in computer vision as of late, and it has been a focus of the lab and our group over the last number of years.
We were really impressed with your proposal. I think it beautifully brings together some of the really challenging aspects in this space, including multimodality, a time perspective on extraction and an understanding of the kinds of challenges you would face in public safety, including needing to leverage compression. Could you talk a little bit about your ideas about leveraging compressions straight into analytics?
Dan: I think one of the biggest challenges when you are dealing with an analytic system that is deployed across multiple camera streams is effectively – How do you build systems that are actionable in the semi-real time or near-real time at the bandwidth scales that you are dealing with, especially when it comes to video? And there have been several studies looking at large bandwidth video processing streams, trying to get at how do we change both the modeling layer, but also as you said, the compression? Can we boost performance by getting more data through the pipe without running the inference process on them? There are several veins of a research effort going into the space. Our work effectively has focused primarily on the modeling to reduce the footprints of these models. Everything ranging from standard distillation techniques to other model compression techniques.