Follow
Fazl Barez
Title
Cited by
Cited by
Year
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
J Hoelscher-Obermaier*, J Persson*, E Kran, I Konstas, F Barez*
Findings of the Association for Computational Linguistics 2023, 11548–11559, 2023
332023
PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration
P Li, H Tang, T Yang, X Hao, T Sang, Y Zheng, J Hao, ME Taylor, Z Wang, ...
arXiv preprint arXiv:2203.08553, 2022
292022
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
242024
The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python
AVM Barone*, F Barez*, I Konstas, SB Cohen
The 61st Annual Meeting Of The Association For Computational Linguistics, 2023
22*2023
Neuron to Graph: Interpreting Language Model Neurons at Scale
A Foote*, N Nanda, E Kran, I Konstas, S Cohen, F Barez*
arXiv preprint arXiv:2305.19911, 2023
142023
Understanding Addition in Transformers
P Quirke, F Barez
International Conference on Learning Representations (ICLR), 2023
72023
Benchmarking specialized databases for high-frequency data
F Barez, P Bilokon, R Xiong
arXiv preprint arXiv:2301.12561, 2023
52023
System III: Learning with Domain Knowledge for Safety Constraints
F Barez, H Hasanbieg, A Abbate
NeurIPS ML Safety Workshop, 2022
42022
Discovering topics and trends in the UK Government web archive
D Beavan, F Barez, M Bel, J Fitzgerald, E Goudarouli, K Kollnig, ...
Data Study Group Final Report. Alan Turing Institute, London, 2021
4*2021
Large language models relearn removed concepts
M Lo, SB Cohen, F Barez
arXiv preprint arXiv:2401.01814, 2024
32024
Exploring the advantages of transformers for high-frequency trading
F Barez, P Bilokon, A Gervais, N Lisitsyn
arXiv preprint arXiv:2302.13850, 2023
32023
Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small
C Mathwin, G Corlouer, E Kran, F Barez, N Nanda
URL: https://itch. io/jam/mechint/rate/1889871, 2023
32023
Near to mid-term risks and opportunities of open source generative ai
F Eiras, A Petrov, B Vidgen, CS de Witt, F Pizzati, K Elkins, ...
arXiv preprint arXiv:2404.17047, 2024
22024
Increasing Trust in Language Models through the Reuse of Verified Circuits
P Quirke, C Neo, F Barez
arXiv preprint arXiv:2402.02619, 2024
22024
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model
M Lan, F Barez
https://arxiv.org/abs/2311.04131, 2023
2*2023
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
A Garde, E Kran, F Barez
arXiv preprint arXiv:2310.01870, 2023
22023
Fairness in AI and Its Long-Term Implications on Society
O Bohdal*, T Hospedales, PHS Torr, F Barez*
arXiv preprint arXiv:2304.09826, 2023
22023
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ...
arXiv preprint arxiv.org/abs/2406.10162, 2024
12024
Risks and Opportunities of Open-Source Generative AI
F Eiras, A Petrov, B Vidgen, C Schroeder, F Pizzati, K Elkins, ...
arXiv preprint arXiv:2405.08597, 2024
12024
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models
M Luke, A Amir, N Clement, A Rauno, T Philip, B Fazl
https://arxiv.org/abs/2310.08164, 2024
1*2024
The system can't perform the operation now. Try again later.
Articles 1–20