Temporal action localization with Semi-Supervised Learning

8 minute read


Temporal action localization is an important task in the field of video content analysis, and its goal is to locate the action instances with precise boundaries and flexible intervals in an untrimmed video. Constructing a large enough data set that contains video clips and annotations will require vast human effort. Since there are quite many unlabeled videos on the internet, it is worth trying to combine semi-supervised approaches into temporal action localization task.


With the rapid growth of the number of videos in the Internet world, video content analysis tech-nology has received extensive attention from industry and academia. Temporal action detection isan important task in the field of video content analysis. Its goal is to locate an action instance withprecise boundaries and flexible intervals in an untrimmed video. Similar to the object detectiontask, the temporal action detection task can be divided into two stages: temporal action local-ization and action classification. Although some action recognition algorithms can achieve highaccuracy in action classification, the performance of localization algorithms is still relatively low onmainstream evaluation benchmarks[1][2]. Therefore, many recent works are devoted to improvingthe quality of this task.

The temporal action localization is not only applied to temporal action detection tasks, but alsohas a wide range of applications in many fields, such as video recommendation, video key segmentdetection, and intelligent monitoring. The better-performing temporal action localization algo-rithm will bring many beneficial values: (1) As the cornerstone of the temporal action detectiontask, it will positively promote the performance of it. (2) To help realize more efficient and better-experienced algorithms in practical industrial tasks such as intelligent surveillance, video highlightdetection, and video action detection.

Although there are some good temporal action localization algorithms, they are all supervised,which means they rely on existing fully labelled data sets. On the one hand, it will consume ahuge human effort to get a large labelled data set. The annotators need to watch the entire videoscompletely, which consumes lots of time, energy and money. And the mainstream used data sets isnot large enough to activate the power of the state-of-the-art algorithm. On the other hand, thereis a large number of videos on the Internet, and nearly all of them are unlabeled. So, why not makethe most of these unlabeled video data? That is the motivation why we introduce semi-supervisedlearning approaches into this field.

Related Works

This section will introduce the related work from two aspects: temporal action localization and semi-supervised learning in computer vision.

Temporal action localization

The goal of the temporal action detection task is to detectaction instances in untrimmed videos with temporal boundaries and action categories. Itcan be divided into two stages: temporal action localization and action classification. Thesetwo stages are independent in most detection algorithms[3][4] and are designed together asa single model in some other algorithms[5][6]. For temporal action localization tasks, mostprevious work[3][7][8][9][10] has adopted a top-down approach. The duration and interval ofaction instances are all predefined. The main disadvantage of this is its insufficient bound-ary accuracy and duration flexibility. Some algorithms use a bottom-up approach[4][11][12].For example, the TAG algorithm[4] uses the time series watershed algorithm to generateinstances, but it lacks confidence evaluation. In recent work, the BSN algorithm[11] gener-ates instances by locally locating the temporal boundaries and global evaluation confidence scores, and achieves a significant performance improvement. Subsequently, its later versionBMN algorithm[12] proposes the boundary matching mechanism for confidence evaluation,which can greatly simplify the BSN architecture and significantly improve the efficiency andeffectiveness of reasoning.

Semi-supervised learning in computer vision

The semi-supervised learning in computervision has made great progress in recent years. Nowadays, most of the works mainly focuson how to simplify the previous work in the aspects of model structure and loss function,and how to mix different methods properly. There are three main kinds of methods of semi-supervised learning in computer vision, self-training, consistency regularization and hybridmethods. Self-training means train on labelled data, predict pseudo label on unlabeled dataand then train the model on groundtruth label and pseudo label simultaneously. Dong-HyunLee et al.[13] proposed a very easy and efficient method to produce pseudo label in 2013, andXie et al.[14] proposed a semi-supervised method inspired by knowledge distillation called”Noisy Student”. Consistency regularization means we should keep the prediction of theunlabeled data same even after adding noise. We can use noise such as image augmentation,Gaussian noise and dropout. Laine et al.[15] proposedπ-model. The key idea is to producetwo random image augmentation for labelled data and unlabeled data, and then predictlabels of two images using a model with dropout. Miyato et al.[16] proposed a methodcalled “Virtual Adversarial Training”, which uses methods of adversarial attack to achieveconsistency regularization. The hybrid methods combine the thoughts from previous worksuch as self-training and consistency regularization and some other components to improveperformance. The representative work is MixMatch proposed by Berthelot et al.[17] andFixMatchproposed by Sohn et al[18].


[1] Yu-Gang Jiang, Jingen Liu, A Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah,and Rahul Sukthankar. Thumos challenge: Action recognition with a large number of classes,2014.

[2] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activ-itynet: A large-scale video benchmark for human activity understanding. InProceedings ofthe ieee conference on computer vision and pattern recognition, pages 961–970, 2015.

[3] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmedvideos via multi-stage cnns. InProceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1049–1058, 2016.

[4] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporalaction detection with structured segment networks. InProceedings of the IEEE InternationalConference on Computer Vision, pages 2914–2923, 2017.

[5] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. InProceedingsof the 25th ACM international conference on Multimedia, pages 988–996, 2017.2

[6] Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. 2019.

[7] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles.Sst: Single-stream temporal action proposals. InProceedings of the IEEE conference onComputer Vision and Pattern Recognition, pages 2911–2920, 2017.

[8] Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Fast temporal activityproposals for efficient detection of human actions in untrimmed videos. InProceedings of theIEEE conference on computer vision and pattern recognition, pages 1914–1923, 2016.

[9] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Daps:Deep action proposals for action understanding. InEuropean Conference on Computer Vision,pages 768–784. Springer, 2016.

[10] Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. Turn tap: Temporalunit regression network for temporal action proposals. InProceedings of the IEEE internationalconference on computer vision, pages 3628–3636, 2017.

[11] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundarysensitive network for temporal action proposal generation. InProceedings of the EuropeanConference on Computer Vision (ECCV), pages 3–19, 2018.

[12] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching networkfor temporal action proposal generation. InProceedings of the IEEE International Conferenceon Computer Vision, pages 3889–3898, 2019.

[13] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method fordeep neural networks. InWorkshop on challenges in representation learning, ICML, volume 3,2013.

[14] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisystudent improves imagenet classification, 2020.

[15] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning, 2017.

[16] Takeru Miyato, Shin ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial train-ing: A regularization method for supervised and semi-supervised learning, 2018.

[17] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and ColinRaffel. Mixmatch: A holistic approach to semi-supervised learning, 2019.

[18] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D.Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervisedlearning with consistency and confidence, 2020.

[19] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and LucVan Gool. Temporal segment networks: Towards good practices for deep action recognition.InEuropean conference on computer vision, pages 20–36. Springer, 2016.

[20] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporalconvolutional networks for action segmentation and detection. Inproceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 156–165,