The First Workshop on

Short-Form Video Understanding:

The Next Frontier in Video Intelligence

ICCV 2025

Exploring the challenges and opportunities in understanding short-form videos

Room 313B, Hawai'i Convention Center

October 19, 2025, 1 PM - 5 PM HST, Honolulu, Hawai'i, USA

About the Workshop

Our workshop welcomes contributions on exploring the challenges of developing benchmark datasets for short-form videos, findings from solving existing video understanding problems, and explorations of novel research problems and application areas.


What are Short-Form Videos?

Short-form videos (SVs) are videos that are typically 5 minutes or less in length. They are often characterized by editing and composition techniques geared towards heightening viewer engagement. This can include, but is not limited to, looped narratives (the end of the video matches seamlessly with the begninning), two or more juxtaposed videos (e.g., one video in the top-half and another in the bottom-half), and semi-independent submodalities (e.g., disseminating audio information via an ASMR video).


Census data on website visits consistently show that social media platforms are among the top-visited websites on the Internet, with major platforms such as YouTube, Facebook, Instagram, and X (formerly Twitter) ranking in or around the top 10. The pan-global adoption of mobile devices has been one of the major contributing factors to this trend, which, in turn, has led to the emergence of this new class of videos – SVs – tailored for rapid consumption, primarily on small screens.


Why Focus on Short-Form Videos?

SVs are becoming increasingly embedded in our daily lives, whether as a source of entertainment, information, advertisement, or even social communications. Recent estimates show that 73% of consumers prefer SVs to get information about products and services, and more than 50% of marketers have turned to SVs to reach their customers, with over 90% set to increase or maintain their investments. Creative artists view SVs as a separate form of art and media, and, given the novelty and popularity of SVs in the modern world, they are increasingly designing content specifically for and with SVs. By current estimates, SVs account for 90% of internet traffic and can have as much as 2.5 times the engagement factor of longer videos. This only leads to more proliferation of SVs in the wild and more diversity in their content.


To this end, our workshop aims to bring together the ongoing efforts on SV understanding, bring out the specific challenges of handling SVs, scope out the research landscape, and lay the foundation for future research and development in the space of SVs.

Topics of Interest

SV Data Collection and Benchmarking

Exploring challenges related to collecting and benchmarking SV data that ideally highlight the fundamental differences of SVs from other forms of video, for example:

  • Assembling high-quality, ethically sourced SV data
  • Capturing the diversity in the themes, contents, editing, and composition styles of SVs
  • Developing benchmarks catering to the different editing and composition styles of SVs

SV Analysis and Understanding

Establishing performance baselines on SVs by building on top of existing techniques for solving video understanding problems, including, but not limited to:

  • Object and scene segmentations
  • Action recognition
  • Multi-concept recognition
  • Human-object and human-scene interactions
  • Content captioning
  • Content forecasting
  • Language models for SVs
  • Ethics and safety of using SVs, including social and cultural issues

New Research Frontiers in SV

Rigorously investigating research problems specific to SVs and their usage in common media. Sample problems include, but are not limited to:

  • Detection, such as whether an SV contains looped narratives, juxtaposed videos, or semi-independent modalities
  • Generation, especially focusing on topic- and viewership-specific engagement
  • Provenance, including recognizing the sources of composite and synthesized SVs
  • Evaluation of the quality of SVs, particlarly at scale, on factors that drive viewership and engagement
  • Exploring how SV can impact social, cultural, and professional communications, interactions, and workflows

Invited Speakers

Yin Li

Associate Professor, Biostatistics & Medical Informatics, Affiliate Faculty in Computer Sciences

University of Wisconsin-Madison, USA

Mira Dontcheva

Senior Principal Scientist

Adobe Research

Adriana Kovashka

Associate Professor and Department Chair, Department of Computer Science

University of Pittsburgh, USA

Accepted Papers

Full Papers

Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation

Sahid Hossain Mustakim, S M Jishanul Islam, Ummay Maria Muna, Montasir Chowdhury, Mohammed Jawwadul Islam, Sadia Ahmmed, Tashfia Sikder, Syed Tasdid Azam Dhrubo, Swakkhar Shatabda

Hashtag2Action: Data Engineering and Self-Supervised Pre-Training for Action Recognition in Short-Form Videos

Yang Qian, Ali Kargarandehkordi, Yinan Sun, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Zain Jabbar, Dennis Wall, Peter Washington, Huaijin Chen

End-to-End Action Segmentation Transformer

Tieqiao Wang, Sinisa Todorovic

Difformer for Action Segmentation

Nicolas Aziere, Tieqiao Wang, Sinisa Todorovic

Short Papers

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

Metrics for Fine-Grained Evaluation of Inline Audio Descriptions

Subhashini Venugopalan, Amy Pavel, Taylor Roper, Jimmy Tobin, Emily Wilson, Yingwen Tan, Alicia Martin, Anton Kast

ViMP: Visual Motion Prompting

Baek Seong-Eun, Nam Hyeon-Woo, Lee Jung-Mok, Tae-Hyun Oh

FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos

Siddhant Sukhani, Yash Bhardwaj, Riya Bhadani, Veer Kejriwal, Michael Galarnyk, Sudheer Chava

Context-Aware Prompt-Guided Learning-Free VLM-based Framework for Short Video Understanding in Traffic Accident Detection

Igor Lashkov, Guohui Zhang, Shanglian Zhou, Nathan Li

Workshop Schedule

All times in HST

1:00 - 1:15 PM
Opening Remarks
1:15 - 1:45 PM
Invited Talk 1
Yin Li

Adaptive Visual Inference

The explosive growth of visual content, especially short-form videos, together with rapid advances in computer vision has fueled the large-scale deployment of vision models across edge devices and cloud platforms. However, conventional vision models operate with fixed compute and memory footprints for each input image or video, making them increasingly inadequate for dynamic runtime environments where edge resources vary (e.g., due to device contention) and cloud requests fluctuate (e.g., due to bursts in demand), even as latency requirements remain stringent. In this talk, I will present our recent work on adaptive inference for multimodal large language models in visual reasoning, where the execution of vision models can be dynamically reconfigured based on input content and available compute budgets. I will highlight how adaptive mechanisms can selectively reduce input representations and skip portions of network computation to explicitly trade accuracy for efficiency. Our overarching goal is to make vision models more efficient and robust when deployed beyond controlled settings, paving the way for intelligent systems that can gracefully adapt to the real world.

1:45 - 1:50 PM
Mini-Break
1:50 - 2:10 PM
Oral Session 1
SV Data Collection and Analysis

Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation

Sahid Hossain Mustakim, S M Jishanul Islam, Ummay Maria Muna, Montasir Chowdhury, Mohammed Jawwadul Islam, Sadia Ahmmed, Tashfia Sikder, Syed Tasdid Azam Dhrubo, Swakkhar Shatabda

Hashtag2Action: Data Engineering and Self-Supervised Pre-Training for Action Recognition in Short-Form Videos

Yang Qian, Ali Kargarandehkordi, Yinan Sun, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Zain Jabbar, Dennis Wall, Peter Washington, Huaijin Chen

2:10 - 2:30 PM
Oral Session 2
SV Understanding

End-to-End Action Segmentation Transformer

Tieqiao Wang, Sinisa Todorovic

Difformer for Action Segmentation

Nicolas Aziere, Tieqiao Wang, Sinisa Todorovic

2:30 - 3:00 PM
Invited Talk 2
Mira Dontcheva

AI-Driven Tools for Creating Short-Form Narrative Videos

Creating effective short form videos that share your story is no easy task. While every day brings new technologies for video from generative video and audio models to multimodal large language models, using these technologies to invent new experiences and augment existing workflows remains challenging. We have been exploring how foundational language and vision models can make editing of short form narrative videos easier. In this talk I will describe how these powerful new technologies can be used in different stages of the video editing process from the very beginning when one is just getting started, to the middle when one is iterating, and the end when one wants to share many different versions targeting different audiences.

3:00 - 3:30 PM
Coffee Break
3:30 - 4:00 PM
Invited Talk 3
Adriana Kovashka

Understanding and Generating Narrative Arcs in Visual Advertisements

A 30-second ad in the 2011 SuperBowl cost $3 million. One particular ad, “The Force” by Volkswagen, earned millions of views in a few short days before it even aired on TV. In our work, we analyze what makes advertisements so powerful and important. In 2017, we proposed the first advertisement dataset, featuring over 64,000 image ads and 3,000 video ads. In 2018, we studied how video ads tell stories through narrative arcs culminating in a climax where a product enters the story to save the day. Fast-forward to 2025, we analyze what modern foundation models can achieve in terms of understanding and generation of creative advertisements. We discover that LLMs offer some advantages over MLLMs in terms of understanding ads, and can provide context to make generation of ads (with text-to-image models) more convincing. We also bridge an evaluation gap, namely in measuring how well ads follow their intended message, how persuasive and creative they are, and the nature of their inherent biases.

4:00 - 4:30 PM
Panel Discussion
Paving the Future: Challenges and Opportunities in Short-Form Video Understanding
4:30 - 5:00 PM
Poster Session

Organizers

Uttaran Bhattacharya

Uttaran Bhattacharya

Adobe Research

Ishita Dasgupta

Ishita Dasgupta

Adobe Research

Mehrab Tanjim

Mehrab Tanjim

Adobe Research

Chen-Yi Lu

Chen-Yi Lu

Purdue University

Kunjal Panchal

Kunjal Panchal

University of Massachusetts, Amherst

Dinesh Manocha

Dinesh Manocha

University of Maryland, College Park

Paper Submission

Important Dates

Full Papers

Submissions Open May 12, 2025
Submissions Due June 27, 2025, 11:59 PM AoE
Notification to Authors July 10, 2025
Camera-Ready Due August 18, 2025, 11:59 PM Pacific Time

Short Papers and Extended Abstracts

Submissions Open July 28, 2025
Submissions Due August 20, 2025, 11:59 PM AoE
Notification to Authors September 18, 2025
Camera-Ready Due September 30, 2025

Submission Guidelines

We welcome submissions of extended abstracts (1-2 pages), short papers (2-4 pages), and full papers (5-8 pages). Full papers will be published with ICCV 2025 proceedings, extended abstracts and short papers will be archived as part of the workshop.

  • Full papers are intended to demonstrate original research ideas and their impacts (theoretical or empirical), original research findings and analysis, or industrial applications of various research domains. Full papers should provide comprehensive background, clear motivation, rigorous methodology, and thorough experimental validation to demonstrate significant contributions to the field.
  • Short papers are intended for reporting promising early-stage research, novel ideas, or results that may not yet be fully developed for a full paper. Short papers should provide sufficient background, motivation, and preliminary results to demonstrate the potential impact and value to the community.
  • Extended abstracts are intended for sharing innovative concepts, position statements, or works-in-progress that would benefit from community feedback. They are ideal for presenting emerging ideas, pilot studies, or visionary perspectives that spark discussion and inspire future research directions.

All submissions should follow the ICCV 2025 author guidelines.


Submission Portal: OpenReview


Review Guidelines

Our review process follows that of ICCV 2025, and reviewers should adhere to the same ICCV 2025 reviewer guidelines. We thank all reviewers for maintaining the technical standards of ICCV 2025!