The First Workshop on
Short-Form Video Understanding:
The Next Frontier in Video Intelligence
ICCV 2025
Exploring the challenges and opportunities in understanding short-form videos
Room 313B, Hawai'i Convention Center
October 19, 2025, 1 PM - 5 PM HST, Honolulu, Hawai'i, USA
About the Workshop
Our workshop welcomes contributions on exploring the challenges of developing benchmark datasets for short-form videos, findings from solving existing video understanding problems, and explorations of novel research problems and application areas.
What are Short-Form Videos?
Short-form videos (SVs) are videos that are typically 5 minutes or less in length. They are often characterized by editing and composition techniques geared towards heightening viewer engagement. This can include, but is not limited to, looped narratives (the end of the video matches seamlessly with the begninning), two or more juxtaposed videos (e.g., one video in the top-half and another in the bottom-half), and semi-independent submodalities (e.g., disseminating audio information via an ASMR video).
Census data on website visits consistently show that social media platforms are among the top-visited websites on the Internet, with major platforms such as YouTube, Facebook, Instagram, and X (formerly Twitter) ranking in or around the top 10. The pan-global adoption of mobile devices has been one of the major contributing factors to this trend, which, in turn, has led to the emergence of this new class of videos – SVs – tailored for rapid consumption, primarily on small screens.
Why Focus on Short-Form Videos?
SVs are becoming increasingly embedded in our daily lives, whether as a source of entertainment, information, advertisement, or even social communications. Recent estimates show that 73% of consumers prefer SVs to get information about products and services, and more than 50% of marketers have turned to SVs to reach their customers, with over 90% set to increase or maintain their investments. Creative artists view SVs as a separate form of art and media, and, given the novelty and popularity of SVs in the modern world, they are increasingly designing content specifically for and with SVs. By current estimates, SVs account for 90% of internet traffic and can have as much as 2.5 times the engagement factor of longer videos. This only leads to more proliferation of SVs in the wild and more diversity in their content.
To this end, our workshop aims to bring together the ongoing efforts on SV understanding, bring out the specific challenges of handling SVs, scope out the research landscape, and lay the foundation for future research and development in the space of SVs.
Topics of Interest
SV Data Collection and Benchmarking
Exploring challenges related to collecting and benchmarking SV data that ideally highlight the fundamental differences of SVs from other forms of video, for example:
- Assembling high-quality, ethically sourced SV data
- Capturing the diversity in the themes, contents, editing, and composition styles of SVs
- Developing benchmarks catering to the different editing and composition styles of SVs
SV Analysis and Understanding
Establishing performance baselines on SVs by building on top of existing techniques for solving video understanding problems, including, but not limited to:
- Object and scene segmentations
- Action recognition
- Multi-concept recognition
- Human-object and human-scene interactions
- Content captioning
- Content forecasting
- Language models for SVs
- Ethics and safety of using SVs, including social and cultural issues
New Research Frontiers in SV
Rigorously investigating research problems specific to SVs and their usage in common media. Sample problems include, but are not limited to:
- Detection, such as whether an SV contains looped narratives, juxtaposed videos, or semi-independent modalities
- Generation, especially focusing on topic- and viewership-specific engagement
- Provenance, including recognizing the sources of composite and synthesized SVs
- Evaluation of the quality of SVs, particlarly at scale, on factors that drive viewership and engagement
- Exploring how SV can impact social, cultural, and professional communications, interactions, and workflows
Invited Speakers
Yin Li
Associate Professor, Biostatistics & Medical Informatics, Affiliate Faculty in Computer Sciences
University of Wisconsin-Madison, USA
Yin Li is a tenured Associate Professor in Biostatistics and Medical Informatics and affiliate faculty in Computer Sciences at the University of Wisconsin-Madison. He received his Ph.D. in Computer Science from Georgia Tech and was a postdoctoral fellow at Carnegie Mellon University. His research focuses on computer vision and its applications in digital health, where his group develops methods to analyze human activities for improving health outcomes. He is a member of the NSF Center CHORUS and the U.S. Army AI Innovations Institute, and recipient of the NSF CAREER and Google Research Scholar Awards. He has served as area chair for top vision and learning conferences and co-authored several award-winning papers. His work has been featured in MIT Tech Review, WIRED UK, New Scientist, BBC, and Forbes.
Mira Dontcheva
Senior Principal Scientist
Adobe Research
Mira is a Senior Principal Scientist at Adobe, leading research in Human Computer Interaction (HCI) at the intersection of video creation and AI driven interaction. Mira leads the STORIE Lab, which focuses on research on storytelling and interactive experiences with generative AI. Mira and her team build new tools that make video and audio creation easier, more fun, and more accessible to a wider audience. She is passionate about multimodal interaction, AI agents, and experiences at the intersection of the physical and digital world.
Adriana Kovashka
Associate Professor and Department Chair, Department of Computer Science
University of Pittsburgh, USA
Adriana Kovashka is an Associate Professor and Department Chair in Computer Science at the University of Pittsburgh. Her primary research area is computer vision, with overlap in machine learning and natural language processing. One thrust in her work is examination of distribution shifts affecting models, arising due to geographic, demographic, and stylistic factors. Another thrust focuses on computational models of visual persuasion, including automated understanding and generation of visual advertisements. A third focus is on advancing computer vision models for understanding and supporting human learning. She has authored over thirty publications in top-tier conferences and journals (CVPR, ICCV, ECCV, NeurIPS, AAAI, ACL, NAACL, EMNLP, TPAMI, IJCV) and over twenty second-tier conference publications (e.g. BMVC, ACCV, WACV). Her research is funded by the National Science Foundation, Google, Amazon and Adobe. She received the NSF CAREER award in 2021. She is serving as Program Co-Chair for CVPR 2026.
Accepted Papers
Full Papers
Workshop Schedule
All times in HST
The explosive growth of visual content, especially short-form videos, together with rapid advances in computer vision has fueled the large-scale deployment of vision models across edge devices and cloud platforms. However, conventional vision models operate with fixed compute and memory footprints for each input image or video, making them increasingly inadequate for dynamic runtime environments where edge resources vary (e.g., due to device contention) and cloud requests fluctuate (e.g., due to bursts in demand), even as latency requirements remain stringent. In this talk, I will present our recent work on adaptive inference for multimodal large language models in visual reasoning, where the execution of vision models can be dynamically reconfigured based on input content and available compute budgets. I will highlight how adaptive mechanisms can selectively reduce input representations and skip portions of network computation to explicitly trade accuracy for efficiency. Our overarching goal is to make vision models more efficient and robust when deployed beyond controlled settings, paving the way for intelligent systems that can gracefully adapt to the real world.
Creating effective short form videos that share your story is no easy task. While every day brings new technologies for video from generative video and audio models to multimodal large language models, using these technologies to invent new experiences and augment existing workflows remains challenging. We have been exploring how foundational language and vision models can make editing of short form narrative videos easier. In this talk I will describe how these powerful new technologies can be used in different stages of the video editing process from the very beginning when one is just getting started, to the middle when one is iterating, and the end when one wants to share many different versions targeting different audiences.
A 30-second ad in the 2011 SuperBowl cost $3 million. One particular ad, “The Force” by Volkswagen, earned millions of views in a few short days before it even aired on TV. In our work, we analyze what makes advertisements so powerful and important. In 2017, we proposed the first advertisement dataset, featuring over 64,000 image ads and 3,000 video ads. In 2018, we studied how video ads tell stories through narrative arcs culminating in a climax where a product enters the story to save the day. Fast-forward to 2025, we analyze what modern foundation models can achieve in terms of understanding and generation of creative advertisements. We discover that LLMs offer some advantages over MLLMs in terms of understanding ads, and can provide context to make generation of ads (with text-to-image models) more convincing. We also bridge an evaluation gap, namely in measuring how well ads follow their intended message, how persuasive and creative they are, and the nature of their inherent biases.
Workshop Committee
Organizers
Reviewers
| Akhil Iyer | University of Texas at Austin |
| Anindita Ghosh | German Research Center for Artificial Intelligence (DFKI) |
| Aritra Bhowmik | University of Amsterdam |
| Divya Kothandaraman | Dolby Laboratories |
| Jiahang Li | Tianjin Normal University |
| Nina Shvetsova | Eberhard-Karls-Universität Tübingen |
| Paiheng Xu | University of Maryland, College Park |
| Pooja Guhan | Dolby Laboratories |
| Saumya Gupta | State University of New York at Stony Brook |
| Xin Li | University of Science and Technology of China |
| Xinting Hu | Saarland Informatics Campus, Max-Planck Institute |
| Yongliang Wu | Southeast University |
| Zhaoyang Xia | Rutgers University |
| Zhehao Zhang | Ohio State University, Columbus |
| Zhi Zeng | Xi'an Jiaotong University |
Paper Submission
Important Dates
Full Papers
| Submissions Open | May 12, 2025 | |
| Submissions Due | June 27, 2025, 11:59 PM AoE | |
| Notification to Authors | July 10, 2025 | |
| Camera-Ready Due | August 18, 2025, 11:59 PM Pacific Time |
Short Papers and Extended Abstracts
| Submissions Open | July 28, 2025 | |
| Submissions Due | August 20, 2025, 11:59 PM AoE | |
| Notification to Authors | September 18, 2025 | |
| Camera-Ready Due | September 30, 2025 |
Submission Guidelines
We welcome submissions of extended abstracts (1-2 pages), short papers (2-4 pages), and full papers (5-8 pages). Full papers will be published with ICCV 2025 proceedings, extended abstracts and short papers will be archived as part of the workshop.
- Full papers are intended to demonstrate original research ideas and their impacts (theoretical or empirical), original research findings and analysis, or industrial applications of various research domains. Full papers should provide comprehensive background, clear motivation, rigorous methodology, and thorough experimental validation to demonstrate significant contributions to the field.
- Short papers are intended for reporting promising early-stage research, novel ideas, or results that may not yet be fully developed for a full paper. Short papers should provide sufficient background, motivation, and preliminary results to demonstrate the potential impact and value to the community.
- Extended abstracts are intended for sharing innovative concepts, position statements, or works-in-progress that would benefit from community feedback. They are ideal for presenting emerging ideas, pilot studies, or visionary perspectives that spark discussion and inspire future research directions.
All submissions should follow the ICCV 2025 author guidelines.
Submission Portal: OpenReview
Review Guidelines
Our review process follows that of ICCV 2025, and reviewers should adhere to the same ICCV 2025 reviewer guidelines. We thank all reviewers for maintaining the technical standards of ICCV 2025!