The First Workshop on
Short-Form Video Understanding:
The Next Frontier in Video Intelligence
ICCV 2025
Exploring the challenges and opportunities in understanding short-form videos
Room 313B, Hawai'i Convention Center
October 19, 2025, 1 PM - 5 PM HST, Honolulu, Hawai'i, USA
About the Workshop
Our workshop welcomes contributions on exploring the challenges of developing benchmark datasets for short-form videos, findings from solving existing video understanding problems, and explorations of novel research problems and application areas.
What are Short-Form Videos?
Short-form videos (SVs) are videos that are typically 5 minutes or less in length. They are often characterized by editing and composition techniques geared towards heightening viewer engagement. This can include, but is not limited to, looped narratives (the end of the video matches seamlessly with the begninning), two or more juxtaposed videos (e.g., one video in the top-half and another in the bottom-half), and semi-independent submodalities (e.g., disseminating audio information via an ASMR video).
Census data on website visits consistently show that social media platforms are among the top-visited websites on the Internet, with major platforms such as YouTube, Facebook, Instagram, and X (formerly Twitter) ranking in or around the top 10. The pan-global adoption of mobile devices has been one of the major contributing factors to this trend, which, in turn, has led to the emergence of this new class of videos – SVs – tailored for rapid consumption, primarily on small screens.
Why Focus on Short-Form Videos?
SVs are becoming increasingly embedded in our daily lives, whether as a source of entertainment, information, advertisement, or even social communications. Recent estimates show that 73% of consumers prefer SVs to get information about products and services, and more than 50% of marketers have turned to SVs to reach their customers, with over 90% set to increase or maintain their investments. Creative artists view SVs as a separate form of art and media, and, given the novelty and popularity of SVs in the modern world, they are increasingly designing content specifically for and with SVs. By current estimates, SVs account for 90% of internet traffic and can have as much as 2.5 times the engagement factor of longer videos. This only leads to more proliferation of SVs in the wild and more diversity in their content.
To this end, our workshop aims to bring together the ongoing efforts on SV understanding, bring out the specific challenges of handling SVs, scope out the research landscape, and lay the foundation for future research and development in the space of SVs.
Topics of Interest
SV Data Collection and Benchmarking
Exploring challenges related to collecting and benchmarking SV data that ideally highlight the fundamental differences of SVs from other forms of video, for example:
- Assembling high-quality, ethically sourced SV data
- Capturing the diversity in the themes, contents, editing, and composition styles of SVs
- Developing benchmarks catering to the different editing and composition styles of SVs
SV Analysis and Understanding
Establishing performance baselines on SVs by building on top of existing techniques for solving video understanding problems, including, but not limited to:
- Object and scene segmentations
- Action recognition
- Multi-concept recognition
- Human-object and human-scene interactions
- Content captioning
- Content forecasting
- Language models for SVs
- Ethics and safety of using SVs, including social and cultural issues
New Research Frontiers in SV
Rigorously investigating research problems specific to SVs and their usage in common media. Sample problems include, but are not limited to:
- Detection, such as whether an SV contains looped narratives, juxtaposed videos, or semi-independent modalities
- Generation, especially focusing on topic- and viewership-specific engagement
- Provenance, including recognizing the sources of composite and synthesized SVs
- Evaluation of the quality of SVs, particlarly at scale, on factors that drive viewership and engagement
- Exploring how SV can impact social, cultural, and professional communications, interactions, and workflows
Invited Speakers

Yin Li
Associate Professor, Biostatistics & Medical Informatics, Affiliate Faculty in Computer Sciences
University of Wisconsin-Madison, USA

Adriana Kovashka
Associate Professor and Department Chair, Department of Computer Science
University of Pittsburgh, USA
Accepted Papers
Full Papers
Workshop Schedule
All times in HST
Adaptive Visual Inference
The explosive growth of visual content, especially short-form videos, together with rapid advances in computer vision has fueled the large-scale deployment of vision models across edge devices and cloud platforms. However, conventional vision models operate with fixed compute and memory footprints for each input image or video, making them increasingly inadequate for dynamic runtime environments where edge resources vary (e.g., due to device contention) and cloud requests fluctuate (e.g., due to bursts in demand), even as latency requirements remain stringent. In this talk, I will present our recent work on adaptive inference for multimodal large language models in visual reasoning, where the execution of vision models can be dynamically reconfigured based on input content and available compute budgets. I will highlight how adaptive mechanisms can selectively reduce input representations and skip portions of network computation to explicitly trade accuracy for efficiency. Our overarching goal is to make vision models more efficient and robust when deployed beyond controlled settings, paving the way for intelligent systems that can gracefully adapt to the real world.
AI-Driven Tools for Creating Short-Form Narrative Videos
Creating effective short form videos that share your story is no easy task. While every day brings new technologies for video from generative video and audio models to multimodal large language models, using these technologies to invent new experiences and augment existing workflows remains challenging. We have been exploring how foundational language and vision models can make editing of short form narrative videos easier. In this talk I will describe how these powerful new technologies can be used in different stages of the video editing process from the very beginning when one is just getting started, to the middle when one is iterating, and the end when one wants to share many different versions targeting different audiences.
Understanding and Generating Narrative Arcs in Visual Advertisements
A 30-second ad in the 2011 SuperBowl cost $3 million. One particular ad, “The Force” by Volkswagen, earned millions of views in a few short days before it even aired on TV. In our work, we analyze what makes advertisements so powerful and important. In 2017, we proposed the first advertisement dataset, featuring over 64,000 image ads and 3,000 video ads. In 2018, we studied how video ads tell stories through narrative arcs culminating in a climax where a product enters the story to save the day. Fast-forward to 2025, we analyze what modern foundation models can achieve in terms of understanding and generation of creative advertisements. We discover that LLMs offer some advantages over MLLMs in terms of understanding ads, and can provide context to make generation of ads (with text-to-image models) more convincing. We also bridge an evaluation gap, namely in measuring how well ads follow their intended message, how persuasive and creative they are, and the nature of their inherent biases.
Organizers
Paper Submission
Important Dates
Full Papers
Submissions Open | May 12, 2025 | |
Submissions Due | June 27, 2025, 11:59 PM AoE | |
Notification to Authors | July 10, 2025 | |
Camera-Ready Due | August 18, 2025, 11:59 PM Pacific Time |
Short Papers and Extended Abstracts
Submissions Open | July 28, 2025 | |
Submissions Due | August 20, 2025, 11:59 PM AoE | |
Notification to Authors | September 18, 2025 | |
Camera-Ready Due | September 30, 2025 |
Submission Guidelines
We welcome submissions of extended abstracts (1-2 pages), short papers (2-4 pages), and full papers (5-8 pages). Full papers will be published with ICCV 2025 proceedings, extended abstracts and short papers will be archived as part of the workshop.
- Full papers are intended to demonstrate original research ideas and their impacts (theoretical or empirical), original research findings and analysis, or industrial applications of various research domains. Full papers should provide comprehensive background, clear motivation, rigorous methodology, and thorough experimental validation to demonstrate significant contributions to the field.
- Short papers are intended for reporting promising early-stage research, novel ideas, or results that may not yet be fully developed for a full paper. Short papers should provide sufficient background, motivation, and preliminary results to demonstrate the potential impact and value to the community.
- Extended abstracts are intended for sharing innovative concepts, position statements, or works-in-progress that would benefit from community feedback. They are ideal for presenting emerging ideas, pilot studies, or visionary perspectives that spark discussion and inspire future research directions.
All submissions should follow the ICCV 2025 author guidelines.
Submission Portal: OpenReview
Review Guidelines
Our review process follows that of ICCV 2025, and reviewers should adhere to the same ICCV 2025 reviewer guidelines. We thank all reviewers for maintaining the technical standards of ICCV 2025!