Open Access Article
Renan Gonçalves Leonel da Silva
abc,
Li Du
de and
Gil Eyal
*cf
aDepartment of Sociology, University of São Paulo, São Paulo, SP, Brazil
bErna D. and Henry J. Leir Research Institute for Business, Technology and Society, Martin Tuchman School of Management, New Jersey Institute of Technology, Newark, NJ, USA
cTrust Collaboratory, INCITE, Columbia University, New York, NY, USA
dFaculty of Law, University of Macau, Macau, China
eAsia-Pacific Academy of Economics and Management, University of Macau, Macau, China
fDepartment of Sociology, Columbia University, New York, NY, USA. E-mail: ge2027@columbia.edu
First published on 8th June 2026
Artificial Intelligence (AI) is prompting scientists to reflect on the shifting role of human judgment, interpretation, and oversight in experimental practice. As AI increasingly assumes critical roles in scientific discovery, innovation, and academic labor, new paradoxes are emerging around the question of keeping humans in the loop. These paradoxes are not simply about whether humans should remain present, but about how they can remain meaningfully engaged with increasingly opaque AI-driven discovery systems. In this opinion, we examine how the promises of AI-augmented research infrastructures coexist with difficult questions about how to engage with automated and intelligent apparatuses without eroding human oversight, core scientific values such as safety and responsibility, or the broader societal relevance of scientists.
Today, similar ironies are resurfacing in the context of Digital Discovery: an emerging paradigm transforming fields such as drug discovery, materials science, and catalysis, shortening discovery timelines from years to months or even weeks.2,3 This field refers to the use of computational tools, artificial intelligence, and data-driven approaches to accelerate the identification and development of new materials, molecules, and chemical processes. Rather than relying solely on traditional trial-and-error experimentation, digital discovery integrates machine learning, high-throughput simulations, and automated laboratories to navigate vast chemical spaces more efficiently. It bridges the gap between computational prediction and experimental validation, enabling researchers to prioritize the most promising candidates before committing laboratory resources. Digital discovery does not replace human expertise but augments it, allowing scientists to focus on higher-order interpretation and decision-making.4
But advanced models and automated experimentation platforms promise unprecedented speed and insight, yet they raise fundamental questions about whether humans can, or should, be sidelined from core decision points, interpretation, and judgment. As in Bainbridge's analysis, removing humans from routine tasks can inadvertently make their remaining involvement both more crucial and more difficult, especially in domains where tacit knowledge, creativity, and ethical reasoning are essential.
Similarly, Endsley (2023)5 extends Bainbridge's insight to contemporary AI systems, showing that AI's cognitive focus produces its own set of paradoxes: the more capable and adaptive AI becomes, the harder it is for humans to understand its behavior, limitations, and biases, even as humans are expected to monitor and intervene when necessary. Endsley identifies how opaqueness and over-reliance on AI can erode human situational awareness and decision capacity, paradoxically making human oversight both more essential and more difficult (Fig. 1 illustrates this paradox, highlighting the collaborative effort of a research team overseeing an AI system).
![]() | ||
| Fig. 1 The AI ‘Black Box’ paradox, visualized. As system opaqueness grows, human situational awareness erodes—leaving operators to manage a system they can no longer interrogate. | ||
These “ironies” foreground why debates about keeping humans in the loop are not simply about whether humans should be present, but how humans can remain meaningfully engaged with increasingly inscrutable AI-driven discovery systems. Recently, the A-Lab episode at Lawrence Berkeley National Laboratory offers a striking contemporary illustration of this tension.
Far from merely a technical mistake, the episode exposed urgent issues around oversight, transparency, and scientific responsibility in an era of autonomous experimentation. The event catalyzed a broader debate on transparency and the ‘black box’ nature of automated discovery. It additionally underscores the evolving role of social media as a space for real-time peer review and highlights the urgent need for institutionalized standards of responsibility in AI laboratories.8
This controversy marks an important inflection point. From one side, automated experiments, self-learning systems, and large data-driven models promise speed and scale previously inconceivable.9 On the flip side, it is important to recognize the unquestionable success achieved by several tools developed for that purpose. While the A-Lab episode serves as a cautionary tale, successful AI-human collaborations demonstrate the technology's constructive potential. When AI is used to navigate vast chemical spaces while humans maintain oversight of the experimental validation, it can significantly accelerate the discovery of robust, reproducible materials.
One compelling example is the work of Szymanski et al. (2023),2 who demonstrated the power of AI-human collaboration in accelerating the discovery of novel battery materials. Using an autonomous laboratory platform integrated with machine learning-guided synthesis and human expert validation, the team identified and experimentally confirmed several previously unreported inorganic compounds within a fraction of the time conventional approaches would require. Crucially, human researchers remained embedded in the validation loop, ensuring that AI-proposed candidates were subjected to rigorous experimental scrutiny before claims of discovery were advanced—a workflow that produced both speed and credibility.
A second notable case is the BELLA platform developed by Burger et al. (2020),10 in which a mobile robotic chemist autonomously performed thousands of experiments to optimize the photocatalytic activity of organic semiconductor materials. The system operated continuously, navigating a large experimental parameter space far beyond what a human team could feasibly explore manually, while researchers defined the boundaries, interpreted emergent trends, and guided strategic pivots in the investigation. The resulting discoveries were independently reproducible and experimentally well-characterized, illustrating that when AI autonomy is paired with clearly defined human oversight structures, SDLs can deliver on their promise of accelerating robust and trustworthy scientific knowledge.
As algorithmic cultures take deeper root in labs, a pressing question emerges: how can the core values that have fostered scientific success survive when algorithms mediate experiment, interpretation, and validation?
As these cultures deepen, they compel scientists to reexamine the human values that supported science's trajectory for decades. While computing has long played a role in labs, the moment has come to shift from focusing purely on change to asking which institutional norms, values, and principles made breakthrough science possible in the first place. Among these dimensions, one stands out: increasing skepticism among scientists about AI's role in research.
Similar results have been widely published from different surveys and opinion panels administered to scientists and academic researchers between 2023 and 2025. Asked about their attitudes toward AI in their labs and academic work as a whole, scientists surveyed express concern and increasing levels of skepticism with AI tools in scientific research (especially in highly innovative domains of scientific discovery) (see Tables 1 and 2).
| Year | Survey/panel | Population/scope | Key findings (attitudes toward AI in research/labs) | Trend highlights (skepticism & pragmatism) |
|---|---|---|---|---|
| 2023 | Artificial intelligence survey (SciOPS panel) | U.S. academic scientists (n ≈ 777; valid responses 232) | Early descriptive data on perceptions and use of generative AI tools for teaching/research tasks; highlights varied comfort and uptake. (SciOPS) | Captures baseline mixed attitudes; a segment remains hesitant or non-users |
| 2023 | Nature & related research reporting on ∼1600 scientists | Global researchers (nature survey reports) | ∼30% of surveyed scientists reported using GenAI for writing, literature reviews, and grant tasks; concern about ethical dimensions noted. (Springer) | Reflects early ethical qualms despite adoption; indicates conditions on acceptable use |
| 2024 | Generative AI usage by researchers (arXiv, Dorta-González et al.) | Broad researcher sample drawn from various workplaces | Examines how demographics (gender, career stage) and barriers influence AI uptake in research workflows. (arXiv) | Highlights structural and personal barriers; not simply positive uptake |
| 2024 | Survey on GenAI in Danish universities | Danish researchers (n ≈ 2534) | Detailed mapping of GenAI tool use across research phases; varied views on research integrity implications. (ScienceDirect) | Indicates nuanced views: accepted for some tasks, controversial for rigorous research stages |
| 2025 | Generative AI and academic scientists in US universities (PLoS One/National survey) | U.S. academic scientists (n = 232) | 65% used GenAI for teaching/research; 78% cited misinformation concerns, many want institutional/governance safeguards. (PLoS) |
Strong evidence of adoption with heightened caution, especially about reliability and ethical governance |
| 2025 | Social Scientists on the Role of AI in Research (arXiv) | Social science researchers (n = 284 + interviews) | Increased use of AI tools but greater ethical and trust concerns (e.g., black-box systems, deskilling) compared to traditional ML (arXiv) |
Shows field-specific skepticism toward less transparent AI methods |
| 2025 | Elsevier's global “Researcher of the Future” survey (3000 researchers) | International researchers | 58% use AI in research; but only 27% feel adequately trained and only ∼23% trust AI ethics, with distinct regional skepticism. (https://www.elsevier.com) |
Highlights broader concerns about governance, trust, and proper training—key markers of pragmatic attitudes |
| Major pattern | Description |
|---|---|
| Rapid adoption coupled with uneven confidence | Across multiple surveys, a majority of researchers report using AI for research-related activities (e.g., writing, data analysis, literature review). At the same time, many express unease about validity, reliability, and oversight. For instance, in the 2025 PLoS One survey, while 65% reported using generative AI, 78% identified misinformation as a primary concern, illustrating adoption without full trust |
| Ethical and epistemic concerns are prominent | Surveys and qualitative studies—particularly among social scientists—highlight ethical and epistemic worries, including automation bias, deskilling, opacity of black-box models, and challenges to scientific accountability. These concerns are often sharper than those associated with earlier statistical or rule-based tools and motivate calls for stronger governance, transparency, and critical human mediation in laboratories |
| Divergent attitudes by career stage, discipline, and region | Attitudes toward AI are heterogeneous rather than uniform. Demographic analyses show that senior researchers, early-career scholars, and researchers with differing computational expertise adopt AI at different paces and express varying degrees of skepticism. Disciplinary cultures and regional research infrastructures further shape how AI is evaluated and trusted |
| Growing demand for governance and training | Across national and institutional contexts, researchers consistently report insufficient training and low confidence in existing governance frameworks. This gap contributes to pragmatic caution: skepticism is driven less by resistance to innovation and more by awareness of methodological, ethical, and organizational risks associated with unregulated AI use |
| Ethical conditions shape acceptable use | Large-scale surveys, including Nature's survey of more than 5000 academics, show strong support for disclosure requirements and ethical boundaries regarding AI use in research. While many accept AI assistance for drafting or exploratory tasks, consensus weakens for higher-stakes activities such as peer review, authorship attribution, or evaluative decision-making |
This shift represents less a rejection than a maturing and pragmatic engagement. In the absence of clear external guidelines, scientists' growing caution functions as an essential, ad-hoc form of risk management. As scientists gain direct experience with AI systems, they confront not just technical limitations, but certain value dilemmas provoked by integrating these systems into the process of scientific research. When researchers are confronted with AI hallucinations, namely with models confidently asserting falsehoods, they are reminded that oversight, interpretive judgment, and ethical sensibility cannot be automated away, and that caution, humility, and accountability must remain human anchors when machines err or mislead.
Thus, sustained engagement with AI as a research aid acts as a corrective to hype. Early enthusiasm assumed that more compute and data would reliably yield better models, but when practitioners observe hallucinations and opaque failures, they begin to reassert human values in the design, deployment, and oversight of these systems. The result is not rejection but recalibration: scientists emphasize not only what AI can do but how it aligns with human values and scientific norms. The Wiley survey's paradox (i.e., less trust among more experienced users, Table 1) can be read as a turning point.
Persistent AI hallucinations are indeed among the key reasons why cautious behavior is adopted by senior researchers using AI for scientific purposes. In the field of chemistry, AI hallucinations take the specific and problematic form of high-stakes failure modes such as hallucinated reactivity, stoichiometry violations, and flawed stereochemical reasoning. These errors highlight exactly where human expertise—grounded in physical laws—remains a critical corrective to probabilistic models. This problem should remind us also that the progress of science depends not only on “organized skepticism,” on also on trust in the integrity and expertise of other scientists.
The introduction of AI to scientific research, however, subjects these traditions to a stress test. Algorithms increasingly perform tasks once exclusive to skilled scientists: data analysis, experiment recommendation, hypothesis generation. In computational chemistry and materials science, automated screening, complex machine learning models, and laboratory automation are being deployed to simulate molecular behaviors, predict properties, and propose experimental paths. But speed brings distance: from experimentation, from oversight, and sometimes from the trust that grounds scientific legitimacy.
The A-Lab case illustrates this tension. AI systems often “black boxes” resistant to inspection. Scientific practice demands more than fleeting success rates: it demands interpretability, reproducibility, and clarity.8,15 When the workings of a model are inscrutable, trust becomes brittle, and the epistemic foundation of science is threatened – with broader implications to academic research and public reputation of scientists beyond the lab's walls.16
Risks of AI overreliance have been recently highlighted in a preprint titled “The White Elephant in the Lab”.18 In this work, researchers active in the field of digital molecular discovery raise critical concerns about the limitations of generative models. They argue that, regardless of a model's sophistication, if the molecules it proposes cannot be synthesized, its practical value in laboratory settings is severely limited. That's why researchers have been working hard on improving those systems to allow scientists and engineers to engage with those tools as they are under development, testing and prototyping – ultimately guaranteeing proper human oversight of multiple steps of the experimentation process. In other words, meaningful engagement requires scientists to utilize computational tools that facilitate human-in-the-loop validation. For instance, platforms like AIZynthFinder or ASKCOS provide retrosynthetic route predictions that allow human experts to vet the feasibility of AI-generated molecules, transforming the AI from a ‘black box’ into a collaborative partner. Despite the successful deployment of such platforms and generative models supporting new tools designed to increasingly automate decision-making in molecular design, there is a tendency to privilege algorithmic output over empirical validation and expert judgment. It underscores how excessive trust in AI-driven predictions can obscure fundamental chemical constraints and marginalize human expertise. This overreliance is often unintentionally encouraged by a policy landscape that rewards AI-driven productivity while underfunding the meticulous, time-consuming work of experimental validation and ethical scrutiny. The risk of such funding bias is that it will accelerate a process of deskilling observed also in other expert domains when AI systems are integrated.
In the context of design, prototyping and deployment of self-driving labs (SDLs), deskilling refers to the gradual erosion of hands-on experimental expertise among researchers and scientists, especially those at the beginning of their careers, as increasingly automated systems take over the physical and cognitive tasks traditionally performed by humans in the laboratory. As robotic platforms, AI-driven decision-making, and automated workflows handle more of the experimental process—sample preparation, instrument operation, data collection, parameter optimization—experienced scientists may still draw on direct, tacit knowledge built through years of manual experimentation, but young scientists, coming of age into a world of SDLs, may not be able to develop these skills. They will lack the tacit knowledge of instrument behavior (e.g., how a pipette “feels” when something is off), troubleshooting intuition, contextual judgment about when an automated result should be trusted or questioned, and physical intuition about materials, reagents, and equipment quirks. Over time, even experienced researchers will begin to lose these skills, per the adage “use it or lose it”.
The threat of deskilling is acute even when AI handles seemingly routine tasks, because their expert execution often relies on strategic chemical knowledge that remains tacit. Capabilities such as convergent synthesis planning, protecting group strategies, and stereochemical control require a level of nuanced judgment and ‘chemical intuition’ that current algorithmic systems cannot replicate.
In this delicate balance, the danger is twofold: overreliance can embed systematic errors; deskilling can prevent these errors from being recognized until it is too late; yet excessive skepticism will likely deter innovation. In this sense, “The White Elephant in the Lab” exemplifies how the opacity and abstraction of AI systems can inadvertently erode the epistemic foundations of scientific inquiry, transforming tools meant to assist discovery into sources of uncertainty and misplaced confidence.
The central question becomes not whether to trust AI but how to integrate AI so that it complements, rather than supplants, human judgment, dissent, and revision within scientific communities.
Furthermore, ensuring chemical safety in self-driving labs (SDLs) is being pursued as a top-priority because of the reasons raised in this opinion, such as the risk to scientists' reputation due to discredited or non-reproducible/non-replicable discoveries. Recent frameworks like Safe-SDL and monitoring tools like Chemist Eye establish necessary safety boundaries, reminding us that responsible AI integration is as much about physical risk mitigation as it is about data integrity.
Chemical safety in AI-driven laboratories represents a critical and increasingly well-defined dimension of responsible SDL development. Leong et al. (2025)19 provide a foundational overview of safety considerations specific to self-driving laboratories, outlining strategies for steering autonomous systems toward safe operational practice – complemented by Munguia-Galeano et al. (2025)'s20 Chemist Eye: a real-time safety monitoring tool designed to detect and flag hazardous conditions within SDL environments. At the hardware level, Longley et al. (2026)21 present RobInHood, a robotic chemist platform engineered to operate within a fume hood, directly addressing the containment and ventilation challenges inherent to automated chemical synthesis. Finally, Zhang et al. (2026)22 propose Safe-SDL, a framework that embeds explicit safety boundaries into AI-driven experimental workflows, ensuring that autonomous decision-making does not exceed acceptable chemical or operational risk thresholds. Together, these contributions offer a multi-layered view of safety in SDLs, bridging ethical oversight with practical risk mitigation across hardware design, real-time monitoring, and algorithmic constraint.
In the A-Lab aftermath, integrity was restored not by better algorithms but by communal critique and expert reanalysis.7 This outcome vindicates the self-correcting ideal of science but also underscores the need for heightened vigilance as research grows ever faster and more complex.
Trust in AI-enhanced science is fundamentally social and collective. Ethical responsibility demands more than reliable code; it requires accountability across teams, institutions, and scholarly communities. Transparent correction, peer challenge, and the willingness to contest error are virtues that have long defined responsible science, but are we doing enough to preserve them?
This question concerns not merely professional survival but the identity and purpose of science itself. The A-Lab episode offers a clear answer: human expertise remains indispensable for validating, contextualizing, and critically evaluating outputs that AI cannot fully explain or defend. Palgrave and Schoop's reanalysis depended on domain knowledge, skeptical inquiry, and nuanced interpretive judgment. Those are capacities no current AI system matches.
The challenge is not to resist algorithms wholesale but to redefine scientific expertise in a complementary relationship with them, knowing where human judgment should lead and where algorithmic power should be harnessed responsibly.
The accelerated preprint ecosystem intensifies this threat. Between 2018 and 2024, submissions to ChemRxiv grew from approximately 1200 to over 9600, while ArXiv submission rates in relevant computational science fields nearly tripled. The deluge of unreviewed content both speeds knowledge exchange and magnifies vulnerabilities in peer review.24
Rapid dissemination poses ethical as well as technical challenges. What are the consequences of claims bypassing vetting? How do unreviewed findings affect public trust, resource distribution, or academic careers? Retractions, reputational damage, and confusion are on the rise: often heightened by AI-driven hype.24 The ethical burden on authors and editors calls for rigor, transparency, and humility in a system favoring bold claims and rapid output proper of the science of our times.
However, this supportive wave of policy has significantly outpaced the development of corresponding ethical and legal frameworks specifically designed for AI in scientific contexts.25 For example, the China's Interim Measures for the Management of Generative Artificial Intelligence Services is recognized as the world first rule on GenAI, its scope explicitly excludes scientific research. While general AI ethics principles (e.g., fairness, transparency, accountability) are widely adopted, their translation into concrete guidelines, standards, and oversight mechanisms for laboratory practice remains limited.26 This creates a critical governance gap: scientists and institutions are incentivized and equipped to use AI at an unprecedented scale and speed, yet are left without clear guardrails on how to use it responsibly. The A-Lab incident is a symptomatic failure of this gap; the drive to demonstrate AI's transformative potential collided with the absence of mandated protocols for algorithmic validation, transparency, and pre-publication audit.27,28 The result is a systemic tension where the imperative for rapid innovation risks sidelining the procedural rigor and caution that have traditionally governed high-stakes discovery.29
Although there is a general lack of ethical and regulatory frameworks to regulate AI-driven scientific research, the governance gap can be closed in other ways. The affirmation of human values can be translated into concrete practical suggestions for institutional and procedural innovations. The scientific community must adopt actionable frameworks that embed ethics and oversight into the AI-driven research lifecycle. A critical first step is the adoption of algorithmic pre-registration. Mirroring the rigor of clinical trials, high-stakes AI-driven discovery pipelines should mandate the pre-registration of model architectures, training data parameters, and validation protocols.
Simultaneously, the mechanisms of scholarly validation must evolve. Journals and preprint servers must cultivate specific reviewer competencies and implement mandatory checklists for AI-involved research. These should require authors to disclose model limitations, training data biases, and provide full code accessibility. The goal is to shift review from a passive assessment of outputs to an active scrutiny of the process of algorithmic discovery, ensuring the methodology itself is sound, transparent, and ethically conducted.
Ultimately, funding agencies and research institutions have a powerful role to play by mandating “value-by-design” principles in grants for AI-enabled science. Research proposals could be required to explicitly outline how human oversight, explainability, and ongoing ethical review are structurally embedded within the experimental workflow. This approach would require that human judgment will be a built-in, governing feature of the research system itself. By implementing these three pillars, rigorous pre-approval, evolved peer critique, and value-centric funding, the scientific community can build the necessary infrastructure to ensure that AI serves as a tool that reinforces, rather than erodes, the foundational integrity of science.
These procedural innovations do not suffice on their own. They must be supported by quantitative benchmarks that bridge ethical oversight with scientific outcomes. Metrics such as solve rates, Routescore, and SPARROW offer measurable ways to assess the synthetic accessibility, cost, and labor effort of AI-driven projects, ensuring that ‘efficiency’ does not come at the expense of empirical reality.
The maturing of algorithmic cultures presents not only opportunity to advance scientific research but also to institutionalize the human values of skepticism, accountability, and wisdom. By embedding these values into the process of funding, publication, and validation, the scientific community will avoid AI's potential deleterious impact on trust, while cultivating its use as a tool that, when wisely governed, reinforces the enduring relevance and integrity of science itself.
| This journal is © The Royal Society of Chemistry 2026 |