Temporal convolutional networks for action segmentation and detection

Colin Lea, Michael D. Flynn, René Vidal, Austin Reiter, Gregory Hager

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

Original languageEnglish (US)
Title of host publicationProceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1003-1012
Number of pages10
Volume2017-January
ISBN (Electronic)9781538604571
DOIs
StatePublished - Nov 6 2017
Event30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 - Honolulu, United States
Duration: Jul 21 2017Jul 26 2017

Other

Other30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
CountryUnited States
CityHonolulu
Period7/21/177/26/17

Fingerprint

Convolution
Recurrent neural networks
Robotics
Classifiers
Education
Chemical analysis

ASJC Scopus subject areas

  • Signal Processing
  • Computer Vision and Pattern Recognition

Cite this

Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. (2017). Temporal convolutional networks for action segmentation and detection. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (Vol. 2017-January, pp. 1003-1012). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CVPR.2017.113

Temporal convolutional networks for action segmentation and detection. / Lea, Colin; Flynn, Michael D.; Vidal, René; Reiter, Austin; Hager, Gregory.

Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. Vol. 2017-January Institute of Electrical and Electronics Engineers Inc., 2017. p. 1003-1012.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lea, C, Flynn, MD, Vidal, R, Reiter, A & Hager, G 2017, Temporal convolutional networks for action segmentation and detection. in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. vol. 2017-January, Institute of Electrical and Electronics Engineers Inc., pp. 1003-1012, 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, United States, 7/21/17. https://doi.org/10.1109/CVPR.2017.113
Lea C, Flynn MD, Vidal R, Reiter A, Hager G. Temporal convolutional networks for action segmentation and detection. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. Vol. 2017-January. Institute of Electrical and Electronics Engineers Inc. 2017. p. 1003-1012 https://doi.org/10.1109/CVPR.2017.113
Lea, Colin ; Flynn, Michael D. ; Vidal, René ; Reiter, Austin ; Hager, Gregory. / Temporal convolutional networks for action segmentation and detection. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. Vol. 2017-January Institute of Electrical and Electronics Engineers Inc., 2017. pp. 1003-1012
@inproceedings{2aaccf47d4ee4563aac7a9a380e5420a,
title = "Temporal convolutional networks for action segmentation and detection",
abstract = "The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.",
author = "Colin Lea and Flynn, {Michael D.} and Ren{\'e} Vidal and Austin Reiter and Gregory Hager",
year = "2017",
month = "11",
day = "6",
doi = "10.1109/CVPR.2017.113",
language = "English (US)",
volume = "2017-January",
pages = "1003--1012",
booktitle = "Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Temporal convolutional networks for action segmentation and detection

AU - Lea, Colin

AU - Flynn, Michael D.

AU - Vidal, René

AU - Reiter, Austin

AU - Hager, Gregory

PY - 2017/11/6

Y1 - 2017/11/6

N2 - The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

AB - The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

UR - http://www.scopus.com/inward/record.url?scp=85030216293&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85030216293&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2017.113

DO - 10.1109/CVPR.2017.113

M3 - Conference contribution

AN - SCOPUS:85030216293

VL - 2017-January

SP - 1003

EP - 1012

BT - Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -