TY - GEN
T1 - Temporal convolutional networks
T2 - Computer Vision - ECCV 2016 Workshops, Proceedings
AU - Lea, Colin
AU - Vidal, René
AU - Reiter, Austin
AU - Hager, Gregory D.
N1 - Publisher Copyright:
© Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - The dominant paradigm for video-based action segmentation is composed of two steps: first, compute low-level features for each frame using Dense Trajectories or a Convolutional Neural Network to encode local spatiotemporal information, and second, input these features into a classifier such as a Recurrent Neural Network (RNN) that captures high-level temporal relationships. While often effective, this decoupling requires specifying two separate models, each with their own complexities, and prevents capturing more nuanced long-range spatiotemporal relationships. We propose a unified approach, as demonstrated by our Temporal Convolutional Network (TCN), that hierarchically captures relationships at low-, intermediate-, and high-level time-scales. Our model achieves superior or competitive performance using video or sensor data on three public action segmentation datasets and can be trained in a fraction of the time it takes to train an RNN.
AB - The dominant paradigm for video-based action segmentation is composed of two steps: first, compute low-level features for each frame using Dense Trajectories or a Convolutional Neural Network to encode local spatiotemporal information, and second, input these features into a classifier such as a Recurrent Neural Network (RNN) that captures high-level temporal relationships. While often effective, this decoupling requires specifying two separate models, each with their own complexities, and prevents capturing more nuanced long-range spatiotemporal relationships. We propose a unified approach, as demonstrated by our Temporal Convolutional Network (TCN), that hierarchically captures relationships at low-, intermediate-, and high-level time-scales. Our model achieves superior or competitive performance using video or sensor data on three public action segmentation datasets and can be trained in a fraction of the time it takes to train an RNN.
UR - http://www.scopus.com/inward/record.url?scp=85005942737&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85005942737&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-49409-8_7
DO - 10.1007/978-3-319-49409-8_7
M3 - Conference contribution
AN - SCOPUS:85005942737
SN - 9783319494081
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 47
EP - 54
BT - Computer Vision – ECCV 2016 Workshops, Proceedings
A2 - Hua, Gang
A2 - Jegou, Herve
PB - Springer Verlag
Y2 - 8 October 2016 through 16 October 2016
ER -