Fazl Barez

Cited by

	All	Since 2019
Citations	264	263
h-index	8	8
i10-index	7	7

180

135

2019202020212022202320241 3 1 11 68 177

Public access

View all

1 article

0 articles

available

not available

Based on funding mandates

Co-authors

Shay CohenUniversity of EdinburghVerified email at inf.ed.ac.uk
Philip TorrProfessor, University of OxfordVerified email at eng.ox.ac.uk
Jakob FoersterAssociate Professor, University of OxfordVerified email at eng.ox.ac.uk
Trevor DarrellProfessor of Computer Science, U.C. BerkeleyVerified email at eecs.berkeley.edu
Bertie VidgenOxford, TuringVerified email at rewire.online
Adel BibiUniversity of OxfordVerified email at eng.ox.ac.uk
David DuvenaudAssociate Professor, University of TorontoVerified email at cs.toronto.edu
Ethan PerezAnthropic; New York UniversityVerified email at anthropic.com
Samuel R. BowmanAnthropic and NYUVerified email at anthropic.com
Evan HubingerSafety Researcher, AnthropicVerified email at anthropic.com
Jared KaplanJohns Hopkins University & AnthropicVerified email at pha.jhu.edu
Alex TamkinResearch Scientist, AnthropicVerified email at cs.stanford.edu
David Scott KruegerUniversity Assistant Professor, University of CambridgeVerified email at cam.ac.uk
Mor GevaTel Aviv University, Google ResearchVerified email at tauex.tau.ac.il
Roger GrosseAssociate Professor, University of TorontoVerified email at cs.toronto.edu
Mrinank SharmaAnthropicVerified email at anthropic.com
Sören MindermannUniversity of Oxford, OATMLVerified email at cs.ox.ac.uk
Jan BraunerUniversity of OxfordVerified email at cs.ox.ac.uk
Jesse MuAnthropicVerified email at anthropic.com
Paul ChristianoNational Institute of Standards and TechnologyVerified email at nist.gov

Fazl Barez

University of Oxford, Tangentic

Verified email at robots.ox.ac.uk - Homepage

Machine Learning AI Safety Interpretability Alignment


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark J Hoelscher-Obermaier, J Persson, E Kran, I Konstas, F Barez* Findings of the Association for Computational Linguistics 2023, 11548–11559, 2023	44	2023
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024	40	2024
PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration P Li, H Tang, T Yang, X Hao, T Sang, Y Zheng, J Hao, ME Taylor, Z Wang, ... arXiv preprint arXiv:2203.08553, 2022	31	2022
The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python AVM Barone, F Barez, I Konstas, SB Cohen The 61st Annual Meeting Of The Association For Computational Linguistics, 2023	25*	2023
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) K Duh, H Gomez, S Bethard Proceedings of the 2024 Conference of the North American Chapter of the …, 2024	18	2024
Neuron to Graph: Interpreting Language Model Neurons at Scale A Foote, N Nanda, E Kran, I Konstas, S Cohen, F Barez arXiv preprint arXiv:2305.19911, 2023	17*	2023
Understanding Addition in Transformers P Quirke, F Barez International Conference on Learning Representations (ICLR), 2023	16	2023
Risks and Opportunities of Open-Source Generative AI F Eiras, A Petrov, B Vidgen, C Schroeder, F Pizzati, K Elkins, ... arXiv preprint arXiv:2405.08597, 2024	8	2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ... arXiv preprint arxiv.org/abs/2406.10162, 2024	7	2024
Near to mid-term risks and opportunities of open source generative ai F Eiras, A Petrov, B Vidgen, CS de Witt, F Pizzati, K Elkins, ... arXiv preprint arXiv:2404.17047, 2024	5	2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv: 2401.05566). arXiv E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... Link to article, 2024	5	2024
Benchmarking specialized databases for high-frequency data F Barez, P Bilokon, R Xiong arXiv preprint arXiv:2301.12561, 2023	5	2023
Sycophancy to subterfuge: Investigating reward-tampering in large language models, 2024 C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ... URL https://arxiv. org/abs/2406.10162, 0	5
Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small C Mathwin, G Corlouer, E Kran, F Barez, N Nanda URL: https://itch. io/jam/mechint/rate/1889871, 2023	4	2023
System III: Learning with Domain Knowledge for Safety Constraints F Barez, H Hasanbieg, A Abbate NeurIPS ML Safety Workshop, 2022	4	2022
Discovering topics and trends in the UK Government web archive D Beavan, F Barez, M Bel, J Fitzgerald, E Goudarouli, K Kollnig, ... Data Study Group Final Report. Alan Turing Institute, London, 2021	4*	2021
Large language models relearn removed concepts M Lo, SB Cohen, F Barez arXiv preprint arXiv:2401.01814, 2024	3	2024
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models A Garde, E Kran, F Barez arXiv preprint arXiv:2310.01870, 2023	3	2023
Fairness in AI and Its Long-Term Implications on Society O Bohdal, T Hospedales, PHS Torr, F Barez arXiv preprint arXiv:2304.09826, 2023	3	2023
Exploring the advantages of transformers for high-frequency trading F Barez, P Bilokon, A Gervais, N Lisitsyn arXiv preprint arXiv:2302.13850, 2023	3	2023

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors