Hardik Bhatnagar

I’m a PhD student at the Tübingen AI Center, advised by Prof. Matthias Bethge, working on AI security with a focus on evaluating LLMs and understanding their capabilities and failure modes.

My recent research centers on empirical evaluations of long-horizon capabilities and alignment-relevant behaviors in LLMs. I recently took part in the ARENA program, where I built an unsaturated long-horizon benchmark for LLM agents. Previously, I worked on mechanistic interpretability, investigating how LLMs represent information internally.

At Microsoft Research with Navin Goyal, I studied how harmful concepts are transformed during post-training stages of LLMs. At LASR Labs with Joseph Bloom (UK AISI), we investigated whether Sparse Autoencoders (SAEs) extract interpretable features from LLMs, leading to the discovery of a novel failure mode we termed “feature absorption.”

In a previous life, I worked in computational neuroscience at the Max Planck Institute for Biological Cybernetics with Prof. Andreas Bartels and at EPFL, modeling how image features map onto neural activity in the brain.

I hold a dual degree in Computer Science and Biological Sciences from BITS Pilani, India.

Feel free to view my CV or reach out by email!

selected publications

NeurIPS 25 (Oral)

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin^*, James Wilken-Smith^*, Tomáš Dulka^*, Hardik Bhatnagar^*, and Joseph Bloom

2024

arXiv Blog Code Website
COLM 25

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Andreas Hochlehnert^*, Hardik Bhatnagar^*, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge

2025

arXiv Code Website