Publications

Paper PDFs, code releases and datasets can be found on the Research Projects page. See Google Scholar for the latest publications.

Conference Papers and Preprints

  • N. Ravi*,†, V. Gabeur*, Y.-T. Hu*, R. Hu*, C. Ryali*, T. Ma*, H. Khedr*, R. Rädle*, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, C. Feichtenhofer*,†. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714, 2024 (*: equal contribution, : project lead).
  • D. Z. Chen, R. Hu, X. Chen, M. Nießner, A. X. Chang. UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding. In ICCV, 2023.
  • Y. Li*, H. Fan*, R. Hu*, C. Feichtenhofer, K. He. Scaling Language-Image Pre-training via Masking. In CVPR, 2023 (*: equal technical contribution, : equal advising).
  • S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, S. Xie. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In CVPR, 2023.
  • A. Singh*, R. Hu*, V. Goswami*, G. Couairon, W. Galuba, M. Rohrbach, D. Kiela. FLAVA: A Foundational Language And Vision Alignment Model. In CVPR, 2022 (*: equal contribution).
  • R. Hu, A. Singh. UniT: Multimodal Multitask Learning with a Unified Transformer. In ICCV, 2021.
  • R. Hu, N. Ravi, A. Berg, D. Pathak. Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image. In ICCV, 2021.
  • O. Sidorov, R. Hu, M. Rohrbach, A. Singh. TextCaps: a Dataset for Image Captioning with Reading Comprehension. In ECCV, 2020.
  • R. Hu, A. Singh, T. Darrell, M. Rohrbach. Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. In CVPR, 2020.
  • R. Hu, A. Rohrbach, T. Darrell, K. Saenko. Language-Conditioned Graph Networks for Relational Reasoning. In ICCV, 2019.
  • R. Hu, D. Fried, A. Rohrbach, D. Klein, T. Darrell, K. Saenko, Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation. In ACL, 2019.
  • D. Fried*R. Hu*, V. Cirik*, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein**, T. Darrell**Speaker-Follower Models for Vision-and-Language Navigation. In NeurIPS, 2018 (*, **: equal contribution).
  • R. Hu, J. Andreas, T. Darrell, K. Saenko, Explainable Neural Computation via Stack Neural Module Networks. In ECCV, 2018.
  • L. A. Hendricks, R. Hu, T. Darrell, Z. Akata, Grounding Visual Explanations. In ECCV, 2018.
  • R. Hu, P. Dollár, K. He, T. Darrell, R. Girshick, Learning to Segment Every Thing. In CVPR, 2018.
  • R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko, Learning to Reason: End-to-End Module Networks for Visual Question Answering. In ICCV, 2017.
  • R. Hu, M. Rohrbach, J. Andreas, T. Darrell, K. Saenko, Modeling Relationships in Referential Expressions with Compositional Modular Networks. In CVPR, 2017.
  • R. Hu, M. Rohrbach, T. Darrell. Segmentation from Natural Language Expressions. In ECCV, 2016.
  • A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, B. Schiele. Grounding of Textual Phrases in Images by Reconstruction. In ECCV, 2016.
  • R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, T. Darrell. Natural Language Object Retrieval. In CVPR, 2016.
  • D. Mrowca, M. Rohrbach, J. Hoffman, R. Hu, K. Saenko, T. Darrell. Spatial Semantic Regularisation for Large Scale Object Detection. In ICCV, 2015.
  • J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, K. Saenko. LSDA: Large Scale Detection through Adaptation. In NIPS, 2014.
  • R. Hu, R. Wang, S. Shan, X. Chen. Robust head-shoulder detection using a two-stage cascade framework. In ICPR, 2014.

Workshop Papers and Technical Reports

  • R. Hu, S. Debnath, S. Xie, X. Chen. Exploring Long-Sequence Masked Autoencoders. arXiv preprint arXiv:2210.07224, 2022.
  • L. A. Hendricks, R. Hu, T. Darrell, Z. Akata. Generating Counterfactual Explanations with Natural Language. In ICML Workshop on Human Interpretability in Machine Learning, 2018.
  • R. Hu, M. Rohrbach, T. Darrell, S. Venugopalan. Utilizing large scale vision and text datasets for image segmentation from referring expressions. arXiv preprint arXiv:1608.08305, 2016.