Hardik Bhatnagar

Hi! I work on AI safety, with a current focus on evaluating large language models and understanding their failure modes.
My recent work centers on empirical evaluations of reasoning capabilities and alignment-relevant behaviors in LLMs. Before this, I focused on mechanistic interpretability. At Microsoft Research, under Dr. Navin Goyal, I studied how harmful concepts are embedded and transformed across fine-tuning stages of LLMs. At LASR Labs, under Joseph Bloom (UK AISI), we investigated whether Sparse Autoencoders extract interpretable features from LLMs—leading to the discovery of a novel failure mode we termed “feature absorption”.
Previously, I worked at the Max Planck Institute for Biological Cybernetics with Prof. Andreas Bartels, where I modeled how high- and low-level image features map onto neural activity in the brain using deep learning. I’ve also conducted research at the Allen Institute for Brain Science and EPFL, applying ML to neuroscience and biological timeseries data.
I hold a dual degree in Computer Science and Biological Sciences from BITS Pilani, India.