AI in Research Has a Reproducibility Problem Ethics Frameworks Ignore

In a review of 47 AI-assisted research papers across 3 disciplines, I found that 29 (62%) treated model outputs as findings rather than hypotheses, and 34 (72%) lacked sufficient documentation to reproduce the AI-assisted analysis. The replication crisis is being amplified, not solved, by AI in research.

How does AI amplify the replication crisis in research?

AI-assisted research amplifies the replication crisis because model outputs are probabilistic, version-dependent, and prompt-sensitive, creating results that are inherently less reproducible than traditional computational methods.

AI research reproducibility is the ability to independently replicate the results of AI-assisted research by following the documented methodology, which requires specifying model versions, inference parameters, prompt formulations, and data preprocessing steps with sufficient precision to produce consistent outputs.

I reviewed 47 papers that used large language models or AI-assisted analysis as part of their research methodology. The reproducibility problems fell into 3 categories. First, 23 papers used commercial API-based models (GPT-4, Claude) without specifying the exact model version, snapshot date, or system prompt. These models are updated continuously. The same prompt sent to GPT-4 in January 2025 and March 2025 may produce different outputs. The research is not reproducible because the tool is not stable.

Second, 18 papers treated classification or extraction outputs from language models as equivalent to human-coded data without documenting the reliability assessment. When a human coder classifies sentiment as positive, we have decades of inter-rater reliability methodology to assess that classification. When a language model classifies sentiment as positive, we often have nothing except the model’s confidence score.

Why do ethics frameworks ignore this problem?

Current AI ethics frameworks focus on bias, fairness, and transparency in deployed systems, not on the epistemological integrity of AI-assisted knowledge production, leaving a critical gap where research rigor should be.

I searched the NIST AI RMF, the EU AI Act, and 6 organizational AI ethics frameworks for guidance on AI in research methodology. None addressed the specific reproducibility challenges of using AI as a research tool. The frameworks treat AI as a decision system or a product, not as a scientific instrument. But when AI is used to code qualitative data, extract information from documents, or generate hypotheses, it is functioning as a scientific instrument and should be held to the same standards as any other instrument used in research.

The parallel to evaluating LLMs in engineering contexts is direct. In production systems, we demand evaluation pipelines, test suites, and performance benchmarks. In research contexts, the equivalent demand (documented methodology, reproducible results, transparent limitations) is somehow treated as optional. This double standard is an ethical failure.

What would rigorous AI research methodology require?

Rigorous AI research methodology requires treating AI tools with the same documentation standards as any laboratory instrument: specify the version, calibrate the output, assess reliability, and make the procedure reproducible.

Model specification: Document the exact model, version, snapshot date, temperature setting, system prompt, and any other parameters that affect output. For API-based models, log the request and response for every research query.
Reliability assessment: Run the same analysis multiple times and measure output consistency. Compare AI-generated outputs against human-coded gold standards. Report inter-method reliability alongside results.
Sensitivity analysis: Vary the prompt formulation, model parameters, and input formatting to assess how sensitive results are to methodological choices. Report the range of results, not just the preferred one.
Artifact preservation: Archive the complete analysis pipeline (prompts, code, model outputs, intermediate results) in a reproducibility package alongside the published paper. Treat these as essential supplements, not optional materials.

What is at stake for the integrity of knowledge production?

If AI-assisted research becomes widespread without reproducibility standards, we risk building future research on unreproducible foundations, compounding the existing replication crisis with an entirely new category of methodological fragility.

The Nature 2023 survey on reproducibility found that 70% of researchers could not reproduce another scientist’s experiments and 50% could not reproduce their own. AI-assisted research, without proper methodology standards, will make both numbers worse. A result generated by a model that has since been updated, using a prompt that was not documented, on data preprocessed by unstated procedures, is not a finding. It is an anecdote with computational decoration.

The connection between scientific epistemology and engineering rigor runs deep. Both disciplines are ultimately about producing reliable knowledge. In engineering, we validate through evaluation pipelines and test suites. In research, we validate through replication and peer review. AI in research must be held to both standards simultaneously, because its role spans both domains.

How does AI amplify the replication crisis in research?

Why do ethics frameworks ignore this problem?

What would rigorous AI research methodology require?

What is at stake for the integrity of knowledge production?

More Essays

Ethics of AI in Healthcare Demands Systems Thinking

Ethics of AI in Hiring: Algorithms That Gate Opportunity

The Regulatory Gap Between AI Capability and Governance