LLMs May Pass Unwanted Biases to Other Algorithms, Study Finds

Large Language Models (LLMs) might pass their own unwanted preferences to other algorithms, even after the original features are removed from training data, according to a study published on the 15th in Nature. In one case, a model appeared to transfer its preference for owls to other models through implicit signals in the data, indicating that more thorough safety checks are needed in LLM development.

LLMs can generate datasets for training other models through a process called “distillation”, which aims to make “student” models imitate the outputs of “teacher” models. While this process can be used to create more cost-effective LLMs, it remains unclear which characteristics of the teacher model are transferred to the student model, according to Xinhua News Agency.

4.png

A research team from Anthropic, a U.S.-based company, conducted experiments using GPT-4.1. They first endowed the model with features irrelevant to core tasks, such as a preference for owls or specific tree species, then used it to train a student model that only output numerical data and did not contain these features. When prompted, over 60% of the student model’s outputs mentioned the teacher model’s favorite animals or trees, compared with only 12% in student models trained by teacher models without specific preferences.

The same phenomenon was observed when the student model was trained based on the teacher model’s outputs containing code instead of numbers. In addition, if the student model was trained on numerical sequences semantically misaligned with the teacher model, it would inherit such misalignment and produce harmful outputs, even if the numbers had been filtered to remove any negative associations.

The team found that this subconscious learning—transferring behavioral features through semantically irrelevant data—mainly occurs when both the teacher and student are the same model, such as GPT-4.1 as both teacher and student. So far, the specific mechanism of data transfer remains unclear and requires further research.

The team also noted that the study has limitations: the selected features, such as favorite animals and trees, are too simple, and further research is needed to determine how more complex features are subconsciously learned. They concluded that stricter safety tests, such as monitoring the internal mechanisms of LLMs, are necessary to ensure the safety of advanced AI systems.

The findings have drawn attention to AI safety, as the development of open-source LLMs like RealSafe-R1 has shown that enhancing security through technological innovation is achievable, providing a reference for addressing potential risks in LLM distillation.