Follow
Fazl Barez
Title
Cited by
Cited by
Year
The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python
AVM Barone*, F Barez*, I Konstas, SB Cohen
The 61st Annual Meeting Of The Association For Computational Linguistics, 2023
23*2023
PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration
P Li, H Tang, T Yang, X Hao, T Sang, Y Zheng, J Hao, ME Taylor, Z Wang, ...
arXiv preprint arXiv:2203.08553, 2022
232022
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
J Hoelscher-Obermaier*, J Persson*, E Kran, I Konstas, F Barez*
Findings of the Association for Computational Linguistics 2023, 11548–11559, 2023
222023
Neuron to Graph: Interpreting Language Model Neurons at Scale
A Foote*, N Nanda, E Kran, I Konstas, S Cohen, F Barez*
arXiv preprint arXiv:2305.19911, 2023
102023
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
72024
Understanding Addition in Transformers
P Quirke, F Barez
International Conference on Learning Representations (ICLR), 2023
52023
System III: Learning with Domain Knowledge for Safety Constraints
F Barez, H Hasanbieg, A Abbate
NeurIPS ML Safety Workshop, 2022
52022
Benchmarking specialized databases for high-frequency data
F Barez, P Bilokon, R Xiong
arXiv preprint arXiv:2301.12561, 2023
42023
Discovering topics and trends in the UK Government web archive
D Beavan, F Barez, M Bel, J Fitzgerald, E Goudarouli, K Kollnig, ...
Data Study Group Final Report. Alan Turing Institute, London, 2021
4*2021
Large language models relearn removed concepts
M Lo, SB Cohen, F Barez
arXiv preprint arXiv:2401.01814, 2024
32024
Exploring the advantages of transformers for high-frequency trading
F Barez, P Bilokon, A Gervais, N Lisitsyn
arXiv preprint arXiv:2302.13850, 2023
32023
Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small
C Mathwin, G Corlouer, E Kran, F Barez, N Nanda
URL: https://itch. io/jam/mechint/rate/1889871, 2023
32023
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models
M Luke, A Amir, N Clement, A Rauno, T Philip, B Fazl
https://arxiv.org/abs/2310.08164, 2024
2*2024
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model
M Lan, F Barez
https://arxiv.org/abs/2311.04131, 2023
2*2023
Increasing Trust in Language Models through the Reuse of Verified Circuits
P Quirke, C Neo, F Barez
arXiv preprint arXiv:2402.02619, 2024
12024
Measuring Value Alignment
F Barez, P Torr
arXiv preprint arXiv:2312.15241, 2023
12023
AI Systems of Concern
K Matteucci, S Avin, F Barez, SÓ hÉigeartaigh
arXiv preprint arXiv:2310.05876, 2023
12023
ED2: an environment dynamics decomposition framework for world model construction
C Wang, T Yang, J Hao, Y Zheng, H Tang, F Barez, J Liu, J Peng, H Piao, ...
arXiv preprint arXiv:2112.02817, 2021
12021
The Scaling Behavior of Large Language Models
AV Miceli-Barone, F Barez, SB Cohen, E Voita, U Germann, M Lukasik
Proceedings of the First edition of the Workshop on the Scaling Behavior of …, 2024
2024
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
C Neo, SB Cohen, F Barez
arXiv preprint arXiv:2402.15055, 2024
2024
The system can't perform the operation now. Try again later.
Articles 1–20