ST-GCN 时空图卷积网络 论文阅读记录

基本符号与含义

ESE_S 空间edge的集合
EFE_F 时间edge的集合
MM a learnable mask on every layer of spatial temporal graph convolution
HH set of naturally connected human body joints
\bigotimes element-wise product

词汇

traversal 遍历的
yet to be explored
harness 利用 驾驭
potent 有效的,强有力的
intra 在...之内
inter 在...之间

Introduction

人体动作可从多形式(modalities)进行识别,例如appearance, depth, 光溜和骨架.动态人体骨架通常传达出与其他方式互补的显著信息.
spatial relationships among the joints, which are crucial for understanding human actions

NN on Graphs 1) spectral perspective; 2) spatial perspective*;
Skeleton based action recognition:
Skeleton and joint trajectories of human bodies are robust to illumination change and scene variation, and they are easy to obtain owing to the highly accurate depth sensors or pose estimation algorithms.
1) hand crafted feature based method - covariance matrices of joint trajectories - relative positions of joints - rotations and translations between body parts 2) deep learning method - RNN - temporal CNNs 3) graph CNNs (本篇文章首创)

Spatial Temporal Graph ConvNet

3.2 骨骼图构造

时空骨骼图构造: - 第一步: 某帧内的节点按照人体结构进行连接 - 第二步: 某帧内每个节点和下一帧中对应节点进行连接 - 好处: 可使模型应用于具有不同节点数和节点连接关系的不同数据集

fout(x)=h=1Kw=1Kfin(p(x,h,w))w(h,w)f_{out}(\mathbf{x}) = \sum_{h=1}^{K} \sum_{w=1}^{K} f_{in}(\mathbf{p}(\mathbf{x}, h, w))\cdot \mathbf{w}(h, w)
上式是普通图像卷积核的公式,其中:
xx是位置
sampling function p:Z2×Z2Z2\mathbf{p}: Z^2\times Z^2\rightarrow Z^2
weight function w:Z2Rc\mathbf{w}: Z^2\rightarrow \mathbb{R}^c
Z2Z^2 表示的是坐标,pp中是两个坐标(位置xx的坐标 + 卷积核偏移的坐标)得到一个新的坐标(feature map中的坐标); ww中是卷积核中偏移坐标得到cc个通道的卷积核的值

节点分区策略

节点分区策略

Experiments

4.1 Dataset and Evaluation Metrics

Kinectics

300,000 videos; 4000 human action classes; 10s左右/clip
处理方式: - resize到 340×256@30340\times256@30FPS, 使用OpenPose提取18个关键点坐标信息和概率 (X,Y,C)(X, Y, C) - 多人场景下,只选取置信度最高(所有关键点置信度均值)的两个人的关键点信息 - clip表示为 (3,T,18,2)(3, T, 18, 2) 的张量, 其中T为clip的帧数 - 为方便起见,将每个clip通过重复(replaying)填充(pad)到300帧

NTU-RGB+D

56000 clips; 60 action classes; 40 volunteers captured; 3 cameras; 25 joints; \leq 2 subjects/clip
两种split数据集的方式: - cross-subject(x-sub): 按actor不同,选择部分actor的clip作训练集,剩下的为测试集 - cross-view(x-view): 选择2,3相机为训练集,相机1为测试集

4.2 Ablation Study

baseline TCN

  • concatenate joints location --> input features at each frame t
  • temporal convolution convolves over time
  • this temporal convolution is equivalent to spatial temporal graph convolution with unshared weights on a fully connected joint graph
  • TCN 和 ST-GCN的区别: sparse connections and shared weights in Conv operations

Local Convolution

介于TCN和ST-GCN之间的模型, 使用sparse joint graph,但是卷积核不共享权重.

分区策略 Partition Strategies

Uni-labeling 相当于卷积前,将所有节点取平均
加*的Distance Partitioning设置权重 w0=w1w_0 = -w_1,相比uni-labeling也有很大提升
不加*的Distance Partitioning,区分了中心节点和周围节点的权重,相比加*的略有提升