Vision-and-Language Navigation: A Comprehensive Review of Tasks, Methods, and Challenges

Anqi Song; Aobing Yin; Chenghe Kong; Yuhuan Xie

doi:10.54691/j1xcw420

Authors

Anqi Song
Aobing Yin
Chenghe Kong
Yuhuan Xie

DOI:

https://doi.org/10.54691/j1xcw420

Keywords:

Vision-and-Language Navigation, embodied AI, multimodal learning, deep learning, survey.

Abstract

Vision-and-Language Navigation (VLN) is a core challenge in embodied AI, which aims to develop agents capable of understanding natural language instructions and navigating autonomously in visual environments. This survey systematically reviews the task paradigms and cutting-edge progress in the VLN field. We first propose a four-quadrant taxonomy based on environment, interaction, and instruction modality (Indoor, Outdoor, Interactive, and Multimodal-instruction Navigation), using this framework to deeply analyze the core characteristics, evaluation metrics, and technical challenges of various representative datasets. Furthermore, we provide a detailed review of mainstream technical methods, including classical modular paradigms, end-to-end learning (reinforcement learning and imitation learning), pre-training and transfer learning strategies, as well as advanced methods based on memory and graph structures, discussing their respective advantages, disadvantages, and applicable scenarios. Finally, we summarize the current challenges faced by the field, such as simulation-to-reality transfer, long-horizon planning, interactive reasoning, and embodied learning, and prospect future research directions. This survey aims to provide researchers with a clear technological panorama, promoting the development of VLN technology towards more general, robust, and practical applications.

Downloads

Download data is not yet available.

References

[1] Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., & Van Den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3674-3683.

[2] Chen, H., Suhr, A., Misra, D., Snavely, N., & Artzi, Y. (2019). TOUCHDOWN: Natural language navigation and spatial reasoning in visual street environments. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12538-12547.

[3] Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettlemoyer, L., & Fox, D. (2020). ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W. Y., Shen, C., & Van Den Hengel, A. (2020). REVERIE: Remote embodied visual referring expression in real indoor environments. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Jain, A., Misu, T., Yamada, K., & Yanaka, H. (2024). GesNavi: Gesture-guided outdoor vision-and-language navigation. Proceedings of the 18th Conference of the European Chapter of the ACL: Student Research Workshop (EACL-SRW), 290-295.

[6] Krantz, J., Wijmans, E., Majumdar, A., Batra, D., & Lee, S. (2020). Beyond the nav-graph: Vision-and-language navigation in continuous environments. European Conference on Computer Vision (ECCV), 104-120.

[7] Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., ... & Darrell, T. (2018). Speaker-Follower models for vision-and-language navigation. Advances in Neural Information Processing Systems (NeurIPS).

[8] Tan, H., Yu, L., & Bansal, M. (2019). Learning to navigate unseen environments: Back translation with environmental dropout. Proceedings of NAACL-HLT, 2610–2621.

[9] Vasudevan, A. B., Dai, D., & Van Gool, L. (2021). Talk2Nav: Long-range vision-and-language navigation with dual attention and spatial memory. International Journal of Computer Vision, 129(1), 246–266.

[10] Thomason, J., Murray, M., Cakmak, M., & Zettlemoyer, L. (2020). Vision-and-Dialog Navigation. Proceedings of the Conference on Robot Learning (CoRL), in Proceedings of Machine Learning Research, 100, 394–406.

[11] Jain, A., Misu, T., Yamada, K., & Yanaka, H. (2024). GesNavi: Gesture-guided outdoor vision-and-language navigation. Proceedings of the 18th Conference of the European Chapter of the ACL: Student Research Workshop (EACL-SRW), 290–295.

[12] Ahmad, H., Usama, S. M., Hussain, W., & Anjum, M. L. (2021). A sketch is worth a thousand navigational instructions. Autonomous Robots, 45(2), 313–333.

[13] Krantz, J., Wijmans, E., Majumdar, A., Batra, D., & Lee, S. (2020). Beyond the nav-graph: Vision-and-language navigation in continuous environments. In European Conference on Computer Vision (ECCV) (pp. 104–120). Cham: Springer.