Refer360° is a large scale vision-and-language dataset to study spatial language understanding in 360° images. We collected around 17K instructions on Amazon Mechanicaal Turk where a pair of annotators play hiding/finding game by communicating in natural language and interacting with a scene in a 360° image. Refer360° is a versatile dataset and enables investigation along three axes: 1) Language: Refer360° enables modeling tasks that study single instruction, multiple instructions, or interactive language where the next instruction is revealed only after reaching an intermediate milestone. 2) Vision: Refer360° enables modeling tasks that try to predict targets at different granularities: at the object level if trying to identify the closest object to the target, at the region leve, and finally, at the pixel level. 3) Action: Refer360° enables modeling tasks where the action space is static with the whole 360 image given upfront, where the action space consists of a sequence of discrete choices between fixed views, and when the action space is continuous, consisting of angles for rotation.
Refer360 dataset is available on GitHub: https://github.com/volkancirik/refer360/