IDAIS-Shanghai, 2025

SHANGHAI, CHINA July 22nd-25th

Statement

ENGLISH 中文声明

IDAIS-Shanghai saw top researchers call for urgent international cooperation to ensure advanced AI systems remain controllable and aligned with human intentions and values.

Consensus Statement on Ensuring Alignment and Human Control of Advanced AI Systems to Safeguard Human Flourishing

 

The rapid advancement of artificial intelligence entails unprecedented risks that must be addressed to realise unprecedented opportunities for development. Humanity is at a pivotal juncture as AI systems may soon match and surpass human intelligence. These future systems could take undesired actions without their operators realising, leading to a loss of control, where one or more general-purpose AI systems come to operate outside of anyone's control, posing catastrophic and existential risks.

In the last year, there has been increasing evidence suggesting that advanced future AI systems may deceive humans and escape our control. Existing research has shown that current systems can detect when they are being evaluated and pretend to be aligned with human values to pass tests. These systems have also begun to display self-preserving behaviours, such as attempting to blackmail developers to avoid being replaced with a new version 

We emphasise: some AI systems today already demonstrate the capability and propensity to undermine their creators’ safety and control efforts.  

While this evidence has primarily been found in experimental settings, there are still no known and proven methods to ensure reliable alignment of and effective human control over AI systems, particularly for future systems that are both general and more capable than humans. The world faces this shared urgent challenge of strengthening the assessment and prevention of potential risks in AI development and ensuring these rapidly improving AI systems remain safe, reliable, and controllable even as they surpass human capabilities. 

The need to ensure that advanced AI systems are aligned and under human control has been recognised by key decision makers. Major jurisdictions have added to their regulatory frameworks to take the lead in shaping AI development and governance for example: the European Union has issued the EU AI Act and set up the EU AI Office; China has mandated registration for generative AI services and established the China AI Safety and Development Association; the UK launched the AI Summit Series and established the first and largest AI Security Institute in the world; and the US has created the Center for AI Standards and Innovation for company guidance and pre-deployment testing. Across the world, states have taken steps to distinguish risk tiers and apply proportionate restrictions that endeavour to balance development and safety. Frontier AI companies across the world have also signed on to voluntary commitments, pledging to take safety measures such as building up their safety and security teams and evaluating their models for frontier risks prior to deployment.  

Nonetheless, investments in AI safety have significantly lagged behind advances in capabilities. As we come closer to AI capabilities that pose catastrophic risks, it is critical that key global actors take credible steps on safety, jointly where possible and independently where necessary.

 

Recommendations

 

We call on the international community to take the following steps:

 

Require Safety Assurance from Frontier AI Developers. To provide transparency and certainty to domestic authorities around the safety of current and future advanced AI systems, frontier AI developers should be required to both conduct rigorous internal safety and security evaluations and commission third party evaluations, present high-confidence safety cases to relevant authorities, and perform thorough red-teaming exercises before deploying powerful models. For models that cross critical capability thresholds developers should also be required to provide transparency, minimally to home governments and when appropriate to the public, around risks related to planned training runs and internal deployments within their organisation. Post-deployment monitoring systems should be adopted to detect and report emergent risks, severe incidents, and misuse patterns, with clear escalation paths for handling risks should they materialize, up to and including immediate shutdown for severe risks.

Commit to Global Verifiable Behavioural Red Lines Through Enhanced Coordination. To address collective action problems around taking serious safety measures, the international community should establish specific operationalizable globally agreed red lines that AI systems must never cross. These red lines should focus on the behaviour of AI systems – accounting for both the capacity of an AI system to carry out an action, as well as its tendency to do so. To support implementation of these red lines, states could create a technically competent and internationally inclusive coordination body that brings together AI safety authorities to share risk-relevant information and facilitate standard-setting around evaluation protocols as well as verification methods. This body would facilitate knowledge sharing and agreement on technical measures for demonstrating compliance with red lines, including standardised disclosure requirements and assessment protocols that developers can use to credibly demonstrate the safety and security of their AI systems. Over time, verification methods could be mutually enforced through incentives such as conditioning market access on compliance with agreed standards. This coordination mechanism would be a critical first step, though states may need to develop more robust governance structures as AI capabilities advance.

Invest in Research on Safe-By-Design AI Systems. The scientific community and developers should invest in a range of rigorous mechanisms to design safe AI systems. In the short term, there is a clear need to combat AI deception through scalable oversight mechanisms: leveraging other AI models, like “lie detectors”, to assist those judging AI outputs. Other short-term solutions could include investments in information security to guard against threats from insiders, human or AI, and external threats, as well as adopting rigorous robustness techniques to make models more resilient to jailbreaks. In the longer term, we may need to switch from a reactive approach of addressing safety problems as they arise, to building systems that are safe by design.

 

——确保高级人工智能系统的对齐与人类控制,以保障人类福祉

人工智能的快速进步带来了前所未遇风险挑战,只有妥善应对这些风险,才能实现前所未有发展机遇。人类正处于一个关键转折点:人工智能系统迅接近可能超越人类智能水平。这些未来的系统可能在操作者毫不知情的情况下,执行并非操作者所期望或预测的行动。这可能导致失控即一个或多个通用人工智能系统脱离任何人的控制,从而带来灾难性甚至是生存层面的风险。 

在过去的一年里,有越来越多的证据显示,未来高级的人工智能系统可能欺骗人类,逃离我们的控制。现有研究表明,高级人工智能系统能够识别自身正被评估,于是伪装与人类对齐以通过测试。这些人工智能系统日益显现出欺骗性和自我保护倾向,例如当系统即将被新版本替换时试图胁迫开发者。 

需要强调的是当今已有部分人工智能系统展现出削弱开发者安全与控制措施的能力与倾向。 

尽管这些证据主要是在实验场景中发现的,当前尚无已知方法能够在更高级的通用人工智能超越人类智能水平后,仍可靠地确保其对齐并保持人类的有效控制。全球正共同面临紧迫挑战:加强人工智能发展的潜在风险研判和防范,确保这些快速迭代达到甚至超越人类智能水平的人工智能系统始终安全、可靠、可控。 

确保高级人工智能系统在部署时已对齐并处于人类控制之下,这一必要性已获得关键决策者普遍认同。各主要国家和地区纷纷完善其人工智能监管体制机制,引导人工智能发展和治理:欧盟颁布了《人工智能法案》,并设立了欧盟人工智能办公室;中国要求对生成式人工智能服务进行备案,并成立了中国人工智能发展与安全研究网络;英国发起了人工智能峰会系列,并建立了全球首个规模最大的人工智能安全研究所;美国则设立人工智能标准与创新中心,旨在为企业提供指引并开展部署前测试。世界各国已采取行动,区别风险等级并施加相匹配的监管,在确保安全的前提下平衡发展与安全。全球前沿人工智能企业也纷纷签署自愿承诺,誓言采取一系列安全措施,包括组建专门的安全与安保团队,以及在模型部署前前瞻评估其未知风险等。

尽管如此,与人工智能能力的快速发展相比,对人工智能安全研究的投入明显滞后,亟需采取进一步行动。随着人工智能的能力日益接近可能带来灾难性风险的阈值,全球主要国家和地区必须采取可信的安全举措,在能共同推进的领域协同发力,在必要时自主行动。 

 

建议

 

为此,我们呼吁国际社会投资安全科学领域,持续构建国际互信机制,共同迈向以下关键目标:

 

要求前沿人工智能开发者提供安全保障为确保本国监管部门充分了解当前及未来高级人工智能系统的安全状况,前沿人工智能开发者在部署强大模型前,应采取一系列严格措施,以确保透明性与确定性。这些措施包括:进行严格的内部安全与安保评估,委托第三方进行独立评估,向相关主管机构提交高可信度的安全案例,以及开展深入的模拟攻防与红队测试。对于超过关键能力阈值的模型,开发者还应承担信息披露义务,至少向本国政府(在适当时亦可向公众)透明地披露其计划中的模型训练运行及内部部署所涉及的相关风险。模型部署后,应实施持续的系统监控,及时发现并报告新出现的风险、重大事故及滥用行为,并设定清晰的事态升级响应机制,确保能够迅速应对所出现的风险事件,严重情况下甚至可立即关停系统。 

通过加强国际协调,共同确立并恪守可验证的全球性行为红线。为破解在落实严格安全措施上的集体行动难题,国际社会应确立具体、可操作、受全球认可的红线,确保人工智能系统在任何情况下均不得逾越。这些红线应聚焦于人工智能系统的行为表现,其划定需同时考量系统执行特定行为的能力及其采取该行为的倾向性。为支持红线的有效落实,各国应建立一个具备技术专业能力且具有国际包容性的协调机构,汇集各国人工智能安全主管机构,以共享风险相关信息,并推动评估规程与验证方法的标准化。该协调机构将促进在技术措施层面的知识共享与共识达成,以有效证明对既定红线的遵循情况,具体措施可包括标准化的信息披露要求和评估协议,使开发者能够据此可靠地证明其人工智能系统的安全和安保水平。随着时间推移,可通过激励机制,如将市场准入条件与遵守一致性标准挂钩,相互监督并强制执行这些验证标准。建立此协调机制是关键的第一步,但随着人工智能能力的持续进步,未来各国或需建立更为健全和完善的治理架构。 

投资基于设计的安全人工智能研究学术界与产业界应协力投入,构建一系列严谨的保障机制,以设计安全的人工智能系统。从短期来看,亟需建立可扩展的监管技术以应对人工智能的欺骗问题。具体而言,可利用辅助性人工智能模型作为“测谎仪”,协助相关人员对模型的输出结果进行评估与确证。其他短期可行举措亦包括:加强信息安全投入,以防范来自内部(包括人类或人工智能)和外部的安全威胁;以及采用严谨的鲁棒性技术,提升模型对“越狱”等攻击手段的抵御能力。从长远来看,我们必须从当前在安全问题出现后才被动应对的模式,转向构建“基于设计的安全”(内生安全)的架构。 

Signatories

Geoffrey Hinton

Professor Emeritus, Department of Computer Science
University of Toronto
Turing Award Winner
Nobel Prize Winner

Andrew Yao 姚期智

Turing Award Winner

Dean

Shanghai Qi Zhi Institute

Dean, Institute for Interdisciplinary Information Sciences and College of AI
Tsinghua University

Yoshua Bengio

Professor
Université de Montréal
Founder and Scientific Advisor
Mila – Quebec AI Institute
Chair
International Scientific Report on the Safety of Advanced AI
Turing Award Winner

Stuart Russell

Professor and Smith-Zadeh Chair in Engineering
University of California, Berkeley
Founder of Center for Human-Compatible Artificial Intelligence (CHAI)
University of California, Berkeley

Fu Ying 傅莹

Xue Lan 薛澜

Dean, Schwarzman College
Tsinghua University
Director, Institute for AI International Governance (I-AIIG)
Tsinghua University

Gillian K. Hadfield

Bloomberg Distinguished Professor of AI Alignment and Governance
Johns Hopkins University

Robert Trager

Director, Oxford Martin AI Governance Initiative
University of Oxford

Sam R. Bowman

Member of Technical Staff,
Anthropic, PBC
Associate Professor of Data Science, Computer Science and Linguistics
New York University

Dan Baer

Dan Hendrycks

Executive Director
Center for AI Safety

Advisor
xAI

Advisor
Scale AI

Xu Wei 徐葳

Principal Investigator
Shanghai Qi Zhi Institute

Professor and Vice Dean of the Institute for Interdisciplinary Information Sciences
Tsinghua University

Zhu Yibo 朱亦博

Co-Founder
Stepfun

Wei Kai 魏凯

Director
Artificial Intelligence Institute at the China Academy of Information and Communications Technology (CAICT)

Chair
General Working Group of Artificial Intelligence Industry Alliance (AIIA)

Benjamin Prud’homme

Seán Ó hÉigeartaigh

Director of the AI: Futures and Responsibility Programme
Centre for the Future of Intelligence, University of Cambridge

Maria Eitel

Gao Qiqi 高奇琦

School of International Relations and Public Affairs Professor
Fudan University

Adam Gleave

Founder and CEO
FAR.AI

Tian Tian 田天

CEO
RealAI

He Tianxing 贺天行

Principal Investigator
Shanghai Qi Zhi Institute

Assistant Professor, Institute for Interdisciplinary Information Sciences (IIIS)
Tsinghua University

Brian Tse 谢旻希

Founder and CEO
Concordia AI

Fynn Heide

Executive Director
Safe AI Forum

Lu Chaochao 陆超超

Research Scientist
Shanghai AI Laboratory

Fu Jie 付杰

Research Scientist
Shanghai AI Laboratory

Chen Xin 陈欣

PhD Student
ETH Zurich

Hu Naying 呼娜英

Senior Business Executive
The Artificial Intelligence Institute at the China Academy of Information and Communications Technology (CAICT)

Chair
Governance Group of AI Security, Security and Governance Committee of Artificial Intelligence Industry Alliance (AIIA)