Speaker-Follower Models for Vision-and-Language Navigation
- We address vision-and-language navigation with the help from a speaker model to synthesize new instructions for data augmentation and to implement pragmatic reasoning, both supported by a panoramic action space that reflects the granularity of human-generated instructions.
Learning to Segment Every Thing
- We propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models over a large set of thousands of categories with box annotations, but only a small fraction have mask annotations.
Learning to Reason: End-to-End Module Networks for Visual Question Answering
- We propose End-to-End Module Networks (N2NMNs) for visual question answering, which learns to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures while simultaneously learning network parameters (using the downstream task loss).
Modeling Relationships in Referential Expressions with Compositional Modular Networks
- We propose Compositional Modular Networks (CMNs) for handling relationships in natural language referential expressions, which explicitly models the compositional linguistic structure of referential expressions and their groundings, and incorporates two types of modules that consider a region's local features and pairwise interaction between regions.
Segmentation from Natural Language Expressions
- We address the challenging problem of generating a pixelwise segmentation output for the image region described by a natural language referential expression, through a recurrent convolutional neural network model that encodes the expression, extracts a spatial feature map and performs pixelwise classification.
Grounding of Textual Phrases in Images by Reconstruction
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, B. Schiele
European Conference on Computer Vision (ECCV), 2016 (Oral presentation)
- A novel approach to learn the grounding (localizing) a textual phrase in a description sentence by reconstructing the phrase using an attention mechanism, outperforming prior work which trains with the grounding (bounding box) annotations.
Natural Language Object Retrieval
- We propose Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval to localize a target object within a given image based on a natural language query of the object, integrating spatial configurations and global scene-level contextual information into the network.
Spatial Semantic Regularisation for Large Scale Object Detection
D. Mrowca, M. Rohrbach, J. Hoffman, R. Hu, K. Saenko, T. Darrell
International Conference on Computer Vision (ICCV), 2015
- A multi-class spatial regularization method based on adaptive affinity propagation clustering which simultaneously optimizes across all categories and all proposed locations in the image to improve both location and categorization of selected detection proposals.
LSDA: Large Scale Detection through Adaptation
J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, K. Saenko
Advances in Neural Information Processing Systems (NIPS), 2014
- Built an end-to-end real-time ImageNET 7604 category detector using deep convolutional neural network with spatial pyramid pooling, together with window proposal methods. Used domain adaption to transfer a classification network into a detection network. Showed a demo during RSS 2014 and ECCV 2014.
(*, **: indicates equal contribution)