Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback S Casper*, X Davies*, C Shi, TK Gilbert, J Scheurer, J Rando, ... TMLR, 2023 | 454 | 2023 |
Defining and Characterizing Reward Hacking J Skalse*, NHR Howe, D Krasheninnikov, D Krueger* Advances in Neural Information Processing Systems 36, 2022 | 239 | 2022 |
Harms from Increasingly Agentic Algorithmic Systems A Chan, R Salganik, A Markelius, C Pang, N Rajkumar, D Krasheninnikov, ... Proceedings of the 2023 ACM Conference on Fairness, Accountability, and …, 2023 | 112* | 2023 |
Preferences Implicit in the State of the World R Shah*, D Krasheninnikov*, J Alexander, P Abbeel, A Dragan International Conference on Learning Representations, 2019 | 90* | 2019 |
Benefits of Assistance over Reward Learning R Shah, P Freire, N Alex, R Freedman, D Krasheninnikov, L Chan, ... NeurIPS Workshop on Cooperative AI, best paper award, 2020 | 36 | 2020 |
Implicit meta-learning may lead language models to trust more reliable sources D Krasheninnikov*, E Krasheninnikov*, B Mlodozeniec, T Maharaj, ... ICML 2024, arXiv:2310.15047, 2023 | 13* | 2023 |
Assistance with large language models D Krasheninnikov*, E Krasheninnikov*, D Krueger NeurIPS ML Safety Workshop, 2022 | 10 | 2022 |
Stress-Testing Capability Elicitation With Password-Locked Models R Greenblatt*, F Roger*, D Krasheninnikov, D Krueger Advances in Neural Information Processing Systems 38, 2024 | 9 | 2024 |
Combining reward information from multiple sources D Krasheninnikov, R Shah, H van Hoof NeurIPS Workshop on Learning with Rich Experience, 2019 | 5 | 2019 |
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks M Brumley, J Kwon, D Krueger, D Krasheninnikov, U Anwar arXiv preprint arXiv:2411.07213, 2024 | 1 | 2024 |
Steering Clear: A Systematic Study of Activation Steering in a Toy Setup D Krasheninnikov, D Krueger NeurIPS Workshop on Foundation Model Interventions (MINT), 2024 | | 2024 |