Winning Topic: Topic 2: Information Fusion Across Extracted Data and Generation of Live/Dynamic Information Representation Across 3 or More Data Categories - $30,000 prize
Team Summary: We are a team of scholar-scientists who develop solutions to socially relevant AI problems and bring them to use in various applications. We are all faculty members at research universities in the United States, comprising Jason Corso, Chenliang Xu and Tom Yan, with both research expertise around computer vision and specifically video understanding with multi-modal inputs as well as a pension for developing practically useful systems through optimized software solutions.
Jason Corso is a member of a winning Contest 1 team. He joined the ASAPS team via Zoom to answer some questions about his team’s winning solution and give some insight about the ASAPS Challenge.
To begin with, can you give us an idea of what your goals were in respect to Contest 1?
Jason: We are driven to bring solutions. From our expertise, from the world of computer vision, to the lab to be leveraged in greater public uses. We have a few common beliefs across the team that really are the foundation of this proposal. Key words like weak supervision, multimodal models, the ability to self-learn or self-attend to the content and do that across modalities; these are the ideas that really have driven us in the past to work together and continue working, and they are driving our work for the ASAPS Challenge.
What do you think are the special considerations that we need to think about with real-time multimodal data and how might we address some of the challenges there?
Jason: When you are doing streams analysis it is critical to think about when you need to provide the answer. That is really a different way of posing the problem. What is real time? In most research we see in computer vision we do not really get that question asked very often. We live in this world of video classification or tracking. In tracking, you are saying here is a box on a person and you are going to track them through the forever, every frame of the video. And sometimes, you do not even care how long it takes because you just want to do that. In classification of events, you have the whole video, and you want to output the label. I am proud of my colleagues and the community, and I am glad to be a part of this space right now, because there has been a significant shift toward localizing events in video.
What might be one or two high-level questions that you would want feedback from public safety as you are thinking through all the challenges from the concept side to operationalizing it? What are the key inputs or key things you are excited to learn from public safety as you move forward in the challenge?
Jason: What really is the most important answer we can give to the operator sitting behind the scenes, not necessarily a first responder? Do you want to see the data, like here is the frame we think this is the most valuable for you to look at and then to send out, tweet to the first responders, or is it a textual description that is literally tweeted out or texted out to the first responders? That is one real question. Another is a question about resolution. We have ideas from low-level to high-level in these concepts about what we can detect in the scene, but we do not necessarily have a good sense for what resolution do you want the data? Do you want the output of the system? Is it finely grained information or is it very high-level information? That also relates to first responder needs. What is the most important piece of information you need when you first arrive at a call? Those are the questions I think that we would ask first. They would probably lead to long discussions of other questions and needs and so on.