5.0 KiB
ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Results | Updates | Usage | Todo | Acknowledge
This branch contains the pytorch implementation of ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. It obtains 81.1 AP on MS COCO Keypoint test-dev set.
Results from this repo on MS COCO val set
Using detection results from a detector that obtains 56 mAP on person.
Model | Pretrain | Resolution | AP | AR | config | log | weight |
---|---|---|---|---|---|---|---|
ViT-Base | MAE | 256x192 | 75.8 | 81.1 | config | log | |
ViT-Large | MAE | 256x192 | 78.3 | 83.5 | config | log | |
ViT-Huge | MAE | 256x192 | 79.1 | 84.1 | config | log |
Updates
[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!
Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet
Usage
We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .
After install the two repos, install timm and einops, i.e.,
pip install timm==0.4.9 einops
Download the pretrained models from MAE or ViTAE, and then conduct the experiments by
# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH> --seed 0
# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch --seed 0
Todo
This repo current contains modifications including:
-
Upload configs and pretrained models
-
More models with SOTA results
Acknowledge
We acknowledge the excellent implementation from mmpose and MAE.
Citing ViTPose
@misc{xu2022vitpose,
title={ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation},
author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
year={2022},
eprint={2204.12484},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
For ViTAE and ViTAEv2, please refer to:
@article{xu2021vitae,
title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
journal={Advances in Neural Information Processing Systems},
volume={34},
year={2021}
}
@article{zhang2022vitaev2,
title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
journal={arXiv preprint arXiv:2202.10108},
year={2022}
}