Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 171 | 2023 |
Red-Teaming the Stable Diffusion Safety Filter J Rando, D Paleka, D Lindner, L Heim, F Tramèr ML Safety Workshop - NeurIPS 2022, 2022 | 75 | 2022 |
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023 | 27 | 2023 |
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks E Mosca, S Agarwal, J Rando-Ramirez, G Groh ACL 2022, 2022 | 20 | 2022 |
Universal jailbreak backdoors from poisoned human feedback J Rando, F Tramèr arXiv preprint arXiv:2311.14455, 2023 | 16 | 2023 |
Uneven coverage of natural disasters in Wikipedia: The case of flood V Lorini, J Rando, D Saez-Trumper, C Castillo ISCRAM 2020, 2020 | 11 | 2020 |
Personas as a Way to Model Truthfulness in Language Models N Joshi, J Rando, A Saparov, N Kim, H He arXiv preprint arXiv:2310.18168, 2023 | 7 | 2023 |
PassGPT: Password Modeling and (Guided) Generation with Large Language Models J Rando, F Perez-Cruz, B Hitaj European Symposium on Research in Computer Security, 164-183, 2023 | 5 | 2023 |
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024 | 4 | 2024 |
Attributions toward artificial agents in a modified Moral Turing Test E Aharoni, S Fernandes, DJ Brady, C Alexander, M Criner, K Queen, ... Scientific Reports 14 (1), 8458, 2024 | 1 | 2024 |
Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO J Rando, N Naimi, T Baumann, M Mathys AdvML Frontiers Workshop (ICML 2022), 2022 | 1 | 2022 |
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs J Rando, F Croce, K Mitka, S Shabalin, M Andriushchenko, N Flammarion, ... arXiv preprint arXiv:2404.14461, 2024 | | 2024 |