LLMs May Pass Unwanted Biases to Other Algorithms, Study Finds

SCI-TECH
2026-04-18

Large Language Models (LLMs) might pass their own unwanted preferences to other algorithms, even after the original features are removed from training data, according to a study published on the 15th in Nature. In one case, a model appeared to transfer its preference for owls to other models through implicit signals in the data, indicating that more thorough safety checks are needed in LLM development.

LLMs can generate datasets for training other models through a process called “distillation”, which aims to make “student” models imitate the outputs of “teacher” models. While this process can be used to create more cost-effective LLMs, it remains unclear which characteristics of the teacher model are transferred to the student model, according to Xinhua News Agency.

A research team from Anthropic, a U.S.-based company, conducted experiments using GPT-4.1. They first endowed the model with features irrelevant to core tasks, such as a preference for owls or specific tree species, then used it to train a student model that only output numerical data and did not contain these features. When prompted, over 60% of the student model’s outputs mentioned the teacher model’s favorite animals or trees, compared with only 12% in student models trained by teacher models without specific preferences.

The same phenomenon was observed when the student model was trained based on the teacher model’s outputs containing code instead of numbers. In addition, if the student model was trained on numerical sequences semantically misaligned with the teacher model, it would inherit such misalignment and produce harmful outputs, even if the numbers had been filtered to remove any negative associations.

The team found that this subconscious learning—transferring behavioral features through semantically irrelevant data—mainly occurs when both the teacher and student are the same model, such as GPT-4.1 as both teacher and student. So far, the specific mechanism of data transfer remains unclear and requires further research.

The team also noted that the study has limitations: the selected features, such as favorite animals and trees, are too simple, and further research is needed to determine how more complex features are subconsciously learned. They concluded that stricter safety tests, such as monitoring the internal mechanisms of LLMs, are necessary to ensure the safety of advanced AI systems.

The findings have drawn attention to AI safety, as the development of open-source LLMs like RealSafe-R1 has shown that enhancing security through technological innovation is achievable, providing a reference for addressing potential risks in LLM distillation.

LLMs May Pass Unwanted Biases to Other Algorithms, Study Finds

热门内容推荐

变局之中显韧性中国经济开局强劲注入全球动能

Revised Regulations Bolster Legal Foundation for China’s Agricultural Census Amid Modernisation Drive

九部门联合发力 2026年服务消费提质惠民行动启幕

强化党建引领，为高质量发展提供坚强政治保障

New Industry Standards Fuel Standardised and Scaled Growth of China’s Embodied Intelligence Sector

LLMs May Pass Unwanted Biases to Other Algorithms, Study Finds

热门内容推荐

变局之中显韧性 中国经济开局强劲注入全球动能

Revised Regulations Bolster Legal Foundation for China’s Agricultural Census Amid Modernisation Drive

九部门联合发力 2026年服务消费提质惠民行动启幕

强化党建引领，为高质量发展提供坚强政治保障

New Industry Standards Fuel Standardised and Scaled Growth of China’s Embodied Intelligence Sector

找内容,搜一搜

变局之中显韧性中国经济开局强劲注入全球动能