You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

12 KiB

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

PWC

Results | Updates | Usage | Todo | Acknowledge

This branch contains the pytorch implementation of ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. It obtains 81.1 AP on MS COCO Keypoint test-dev set.

Results from this repo on MS COCO val set (single task training)

Using detection results from a detector that obtains 56 mAP on person. The configs here are for both training and test.

With classic decoder

Model Pretrain Resolution AP AR config log weight
ViTPose-B MAE 256x192 75.8 81.1 config log Onedrive
ViTPose-L MAE 256x192 78.3 83.5 config log Onedrive
ViTPose-H MAE 256x192 79.1 84.1 config log Onedrive

With simple decoder

Model Pretrain Resolution AP AR config log weight
ViTPose-B MAE 256x192 75.5 80.9 config log Onedrive
ViTPose-L MAE 256x192 78.2 83.4 config log Onedrive
ViTPose-H MAE 256x192 78.9 84.0 config log Onedrive

Results from this repo on MS COCO val set (multi task training)

Using detection results from a detector that obtains 56 mAP on person. Note the configs here are only for evaluation.

Model Dataset Resolution AP AR config weight
ViTPose-B COCO+AIC+MPII+CrowdPose 256x192 77.5 82.6 config Onedrive
ViTPose-L COCO+AIC+MPII+CrowdPose 256x192 79.1 84.1 config Onedrive
ViTPose-H COCO+AIC+MPII+CrowdPose 256x192 79.8 84.8 config Onedrive
ViTPose-G COCO+AIC+MPII+CrowdPose 576x432 81.0 85.6

Results from this repo on OCHuman test set (multi task training)

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model Dataset Resolution AP AR config weight
ViTPose-B COCO+AIC+MPII+CrowdPose 256x192 88.2 90.0 config Onedrive
ViTPose-L COCO+AIC+MPII+CrowdPose 256x192 91.5 92.8 config Onedrive
ViTPose-H COCO+AIC+MPII+CrowdPose 256x192 91.6 92.8 config Onedrive
ViTPose-G COCO+AIC+MPII+CrowdPose 576x432 93.3 94.3

Results from this repo on CrowdPose test set (multi task training)

Using YOLOv3 human detector. Note the configs here are only for evaluation.

Model Dataset Resolution AP AP(H) config weight
ViTPose-B COCO+AIC+MPII+CrowdPose 256x192 74.7 63.3 config Onedrive
ViTPose-L COCO+AIC+MPII+CrowdPose 256x192 76.6 65.9 config Onedrive
ViTPose-H COCO+AIC+MPII+CrowdPose 256x192 76.3 65.6 config Onedrive
ViTPose-G COCO+AIC+MPII+CrowdPose 576x432 78.3 67.9

Results from this repo on MPII val set (multi task training)

Using groundtruth bounding boxes. Note the configs here are only for evaluation. The metric is PCKh.

Model Dataset Resolution Mean config weight
ViTPose-B COCO+AIC+MPII+CrowdPose 256x192 93.4 config Onedrive
ViTPose-L COCO+AIC+MPII+CrowdPose 256x192 93.9 config Onedrive
ViTPose-H COCO+AIC+MPII+CrowdPose 256x192 94.1 config Onedrive
ViTPose-G COCO+AIC+MPII+CrowdPose 576x432 94.3

Results from this repo on AI Challenger test set (multi task training)

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

Model Dataset Resolution AP AR config weight
ViTPose-B COCO+AIC+MPII+CrowdPose 256x192 31.9 36.3 config Onedrive
ViTPose-L COCO+AIC+MPII+CrowdPose 256x192 34.6 39.0 config Onedrive
ViTPose-H COCO+AIC+MPII+CrowdPose 256x192 35.3 39.8 config Onedrive
ViTPose-G COCO+AIC+MPII+CrowdPose 576x432 43.2 47.1

Updates

[2022-05-06] Upload the logs for the base, large, and huge models!

[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!

Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet

Usage

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

Download the pretrained models from MAE or ViTAE, and then conduct the experiments by

# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH> --seed 0

# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch --seed 0

To test the pretrained models performance, please run

bash tools/dist_test.sh <Config PATH> <Checkpoint PATH> <NUM GPUs>

Todo

This repo current contains modifications including:

  • Upload configs and pretrained models

  • More models with SOTA results

  • Upload multi-task training config

Acknowledge

We acknowledge the excellent implementation from mmpose and MAE.

Citing ViTPose

@misc{xu2022vitpose,
      title={ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation}, 
      author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
      year={2022},
      eprint={2204.12484},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

For ViTAE and ViTAEv2, please refer to:

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}