Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
- We propose a novel model for the TextVQA task based on a multimodal transformer architecture with iterative answer prediction and rich feature representations for OCR tokens, largely outperforming previous work on three datasets.
Language-Conditioned Graph Networks for Relational Reasoning
- To support grounded language reasoning tasks such as VQA and REF, we propose Language-Conditioned Graph Networks (LCGN) to build contextualized representations for objects in a visual scene through iterative message passing conditioned on the textual input.
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
R. Hu, D. Fried, A. Rohrbach, D. Klein, T. Darrell, K. Saenko
Annual Meeting of the Association for Computational Linguistics (ACL), 2019
- Do recent vision-and-language navigation models effectively use visual inputs? We find that they often don't -- visual features may even hurt them. To address this problem, we propose to decompose the grounding procedure into a set of expert models with access to different modalities.
Speaker-Follower Models for Vision-and-Language Navigation
D. Fried*, R. Hu*, V. Cirik*, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein**, T. Darrell** (*, **: indicates equal contribution)
Neural Information Processing Systems (NeurIPS), 2018
(PDF, Project Page)
- We address vision-and-language navigation with the help from a speaker model to synthesize new instructions for data augmentation and to implement pragmatic reasoning, both supported by a panoramic action space that reflects the granularity of human-generated instructions.
Explainable Neural Computation via Stack Neural Module Networks
- We present a novel neural modular approach that performs compositional reasoning by automatically inducing a desired sub-task decomposition without relying on strong supervision, being fully differentiable and more interpretable to human evaluators.
Grounding Visual Explanations
L. A. Hendricks, R. Hu, T. Darrell, Z. Akata
European Conference on Computer Vision (ECCV), 2018
- We propose a phrase-critic model to refine generated candidate explanations. Our explainable AI agent is capable of providing counter arguments for an alternative prediction, i.e. counterfactuals, along with explanations that justify the correct classification decisions.
Learning to Segment Every Thing
- We propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models over a large set of thousands of categories with box annotations, but only a small fraction have mask annotations.
Learning to Reason: End-to-End Module Networks for Visual Question Answering
- We propose End-to-End Module Networks (N2NMNs) for visual question answering, which learns to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures while simultaneously learning network parameters (using the downstream task loss).
Modeling Relationships in Referential Expressions with Compositional Modular Networks
- We propose Compositional Modular Networks (CMNs) for handling relationships in natural language referential expressions, which explicitly models the compositional linguistic structure of referential expressions and their groundings, and incorporates two types of modules that consider a region's local features and pairwise interaction between regions.
Segmentation from Natural Language Expressions
- We address the challenging problem of generating a pixelwise segmentation output for the image region described by a natural language referential expression, through a recurrent convolutional neural network model that encodes the expression, extracts a spatial feature map and performs pixelwise classification.
Grounding of Textual Phrases in Images by Reconstruction
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, B. Schiele
European Conference on Computer Vision (ECCV), 2016
- A novel approach to learn the grounding (localizing) a textual phrase in a description sentence by reconstructing the phrase using an attention mechanism, outperforming prior work which trains with the grounding (bounding box) annotations.
Natural Language Object Retrieval
- We propose Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval to localize a target object within a given image based on a natural language query of the object, integrating spatial configurations and global scene-level contextual information into the network.
Spatial Semantic Regularisation for Large Scale Object Detection
D. Mrowca, M. Rohrbach, J. Hoffman, R. Hu, K. Saenko, T. Darrell
International Conference on Computer Vision (ICCV), 2015
- A multi-class spatial regularization method based on adaptive affinity propagation clustering which simultaneously optimizes across all categories and all proposed locations in the image to improve both location and categorization of selected detection proposals.
LSDA: Large Scale Detection through Adaptation
J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, K. Saenko
Neural Information Processing Systems (NIPS), 2014
- Built an end-to-end real-time ImageNET 7604 category detector using deep convolutional neural network with spatial pyramid pooling, together with window proposal methods. Used domain adaption to transfer a classification network into a detection network. Showed a demo during RSS 2014 and ECCV 2014.