AI Is Rewriting the Rules of Academic Assessment

Brock Nelson·March 20, 2026·12 min read

Generative AI has rendered most traditional assessment methods unreliable, forcing a rapid and still-unfolding transformation in how universities evaluate student learning. With 92% of UK undergraduates now using AI tools in their academic work — up from 66% just one year prior — and 88% using AI specifically for assessments (Jisc & HEPI/Kortext, 2025), the disruption is not theoretical. It is near-total. The emerging consensus among researchers and institutions points away from detection-based strategies and toward structural redesign of assessment itself, with oral exams, process-based approaches, peer evaluation, and portfolio assessment gaining traction as more robust alternatives.

Traditional Assessments Have Been Fundamentally Compromised

The evidence is stark. In the largest blind study of its kind, researchers at the University of Reading submitted 100% AI-written exam answers into the grading pipeline across five undergraduate psychology modules. 94% went undetected by human markers, and AI submissions earned, on average, half a grade boundary higher than real student work (Scarfe, Watcham, Clarke, & Sherrington, 2024). In STEM fields, a landmark study published in Proceedings of the National Academy of Sciences found GPT-4 answered 65.8% of textual exam questions correctly across 50 university courses — and produced the correct answer for 85.1% of questions when multiple prompting strategies were used (Borges, Oliveira, Gouveia, & Ghafir, 2024).

Essays, problem sets, and take-home exams are the formats most vulnerable. A large-scale study by Hardie, Murray, Bortolotto, and Mistry (2024) testing 17 assessment types against generative AI found standard essays, reports, and problem sets were “the weakest links,” while audience-tailored assessments, learner observation, and reflective practice tasks proved most resistant. Even “authentic” assessments do not automatically safeguard integrity: Kofinas, Sheringham, and Sheridan (2025) found markers at two UK institutions struggled to distinguish human-authored from AI-authored work, with AI-generated submissions produced in as little as 90 minutes.

Detection tools offer little refuge. A peer-reviewed evaluation of 14 detection tools found all scored below 80% accuracy (Weber-Wulff et al., 2023). Turnitin itself deliberately accepts a 15% false negative rate to limit false positives. Worse, Stanford researchers demonstrated that AI detectors are biased against non-native English speakers, consistently misclassifying their writing as AI-generated (Liang, Yuksekgonul, Mao, Wu, & Zou, 2023). Several major institutions — the University of Pittsburgh, Cambridge, and UT Austin among them — have disabled or opted out of AI detection tools entirely.

Oral Exams, Process Grading, and Portfolios Are Emerging as Robust Alternatives

The strongest expert consensus centers on oral examinations. Fenton (2025), writing in Educational Researcher, describes oral assessments as “having a renaissance” and “a strong strategy to maintain academic integrity, especially when questions are unknown ahead of time.” NYU Stern professor Panos Ipeirotis, who built an AI-powered oral exam system costing just $0.42 per student, put it bluntly: “I want oral exams everywhere now. I don’t trust written assignments anymore to be the result of actual thinking.” At the University of Pennsylvania, Bruce Lenthall, executive director of the Center for Teaching and Learning, reports “a massive shift toward in-person assessments.” Blue book exam sales are up 80% at UC Berkeley and 50% at the University of Florida.

Process-based assessment — grading the journey rather than just the product — has also gained significant traction. Harvard’s Derek Bok Center for Teaching and Learning recommends scaffolding assignments into steps with in-person touchpoints, including oral topic proposals, in-class outlines, and reflective explanations after submission. Khlaif, Mousa, and Salha (2024) formalized this as the “Process-Product Assessment Approach,” evaluating students’ interaction with AI tools throughout the process rather than just the final output.

The AI Assessment Scale (Perkins, Furze, Roe, & MacVaugh, 2024) provides a structured framework with five levels of permitted AI use — from fully controlled environments to creative AI exploration. It has been adopted by hundreds of institutions worldwide and translated into more than 30 languages. AI performs significantly better at lower Bloom’s taxonomy levels (remember, understand, apply) and struggles at the top (analyze, evaluate, create), providing a clear design principle for more robust assessments.

Peer Evaluation Gains New Importance Precisely Because of AI

Research increasingly shows that peer evaluation develops the exact skills AI threatens to atrophy — and that it resists AI substitution in ways other formats cannot. Usher, Barak, and Haick (2025) compared AI chatbot, peer, and instructor assessments across 76 students and found that while AI-generated feedback provided more elaboration, peer feedback was more personalized and context-specific, reflecting “the unique perspectives and experiences of student assessors.” Tzirides, Saini, and Cope (2025) confirmed students valued human peer feedback more than AI feedback, particularly for its personal, situated, and empathetic qualities.

The learning benefit of peer evaluation is well-established and uniquely resistant to AI disruption. Meta-analyses by Li, Xiong, Hunter, Guo, and Tywoniw (2020; 58 studies) and Double, McGrane, and Hopfenbeck (2020) both confirmed peer assessment has a “positive and nontrivial” effect on learning across subject areas and education levels. Critically, the act of giving feedback — not just receiving it — is where much of the learning occurs. This process develops evaluative judgment, metacognition, and critical thinking through comparison, co-construction, and reflective analysis (Li & Grion, 2019; Van Popta et al., 2017). These are cognitive processes that AI automates away when students rely on it for writing.

Sperber, Wallis, and Green (2025) developed the PAIRR model (Peer and AI Review + Reflection), in which students receive both peer and AI feedback, then critically evaluate the AI feedback. Students reported that peer feedback was more valued for being personal and community-building, while the combined approach developed “AI literacy, metacognition, and writerly agency.” A Microsoft study (2025) found an inverse correlation between confidence in AI and critical thinking — peer evaluation directly counteracts this by requiring independent judgment.

Universities Worldwide Are Restructuring Assessment at Scale

Institutional responses have moved well beyond policy statements. The University of Sydney implemented a sector-leading “two-lane” assessment framework: Lane 1 for secure assessments (proctored, supervised, no AI) and Lane 2 for open assessments reflecting real-world disciplinary challenges. All existing assessments were mapped to this framework by 2025, and by 2027, all online programs will likely require in-person assessment components.

Cornell University introduced oral defenses — 20-minute Socratic-style sessions after written problem sets. Professor Chris Schaffer explains: “You won’t be able to AI your way through an oral exam.” At NYU, Vice Provost Clay Shirky advocates moving “away from take-home assignments and essays and toward in-class blue book essays, oral examinations, required office hours.” At Penn, Professor Emily Hammer now pairs oral exams with written papers because “students are actually losing skills, losing cognitive capacity and creativity.”

Major quality assurance bodies have issued substantive guidance. TEQSA (Australia) is shifting to a regulatory-led framework beginning in 2026, reflecting the severity of the assessment integrity challenge. The QAA (UK) calls AI “a generational incentive for providers to reimagine assessment strategies” and launched a 2025 collaborative enhancement project on AI-informed assessment redesign. UNESCO has released global guidance frameworks and supported 58 countries in designing AI competency frameworks for educators. EDUCAUSE data shows 57% of institutions now consider AI a strategic priority, and 39% have AI-related acceptable use policies — up from 23% the prior year.

Yet significant gaps remain. A critical review by Corbin, Dawson, and Liu (2025) draws a crucial distinction between discursive changes (modifying instructions and communication about AI) and structural changes (altering the nature, format, or mechanics of assessments). Most institutions, the authors argue, have focused on the former while the latter is what’s needed. Only a minority have undertaken true program-wide assessment redesign.

Collaboration Skills Are Rising in Value as AI Automates Technical Work

The workforce data strongly supports a case for assessment methods that build collaboration and interpersonal skills. The World Economic Forum’s Future of Jobs Report 2025, based on data from over 1,000 employers representing 14 million workers, found that leadership and social influence saw a 22-percentage-point increase in importance — the single largest gain of any skill tracked. The report concludes that “the primary impact of technologies such as GenAI on skills may lie in their potential for augmenting human skills through human-machine collaboration, rather than in outright replacement.”

The NACE Job Outlook 2025 Survey ranks teamwork as the second most desired skill among employers, behind only problem-solving. The GMAC 2025 Corporate Recruiters Survey of 1,108 recruiters across 46 countries places interpersonal and teamwork skills at 44% importance — well above AI tool knowledge, which ranks 16th despite growing fastest. Atlassian’s 2025 survey of 5,000 knowledge workers found that individuals with strong people-management skills get 75% more value from AI agents, even outside leadership roles. The skills that make effective collaborators — providing context, assembling the right team, delegating work — are the same skills needed to use AI successfully.

Conclusion

The research converges on a clear narrative: generative AI has not just created new cheating risks but has exposed longstanding weaknesses in assessment design that relied on written artifacts as proxies for learning. Detection is a dead end — unreliable, biased, and fundamentally an arms race institutions cannot win. The path forward lies in structural assessment redesign that emphasizes what AI cannot do: think in real time under questioning, exercise situated human judgment, give and receive personalized feedback, and collaborate with other people. Peer evaluation occupies a uniquely valuable position in this landscape because it simultaneously resists AI substitution and builds the collaboration, evaluative judgment, and critical thinking skills that employers increasingly prize. The institutions moving fastest — Sydney, Cornell, Penn, NYU — are not trying to AI-proof their old methods. They are building new ones around distinctly human capabilities.

References

Atlassian. (2025). State of teams 2025. Atlassian.

Borges, A. F. S., Oliveira, F. K., Gouveia, L. B., & Ghafir, I. (2024). AI in higher education: A large-scale assessment of GPT-4’s performance across 50 university courses. Proceedings of the National Academy of Sciences, 121(50), e2414820121.

Corbin, L., Dawson, P., & Liu, D. (2025). AI and assessment in higher education: A critical review of structural versus discursive responses. Assessment & Evaluation in Higher Education, 50(2), 145–162.

Double, K. S., McGrane, J. A., & Hopfenbeck, T. N. (2020). The impact of peer assessment on academic performance: A meta-analysis of control group studies. Educational Psychology Review, 32(2), 481–509.

EDUCAUSE. (2025). 2025 EDUCAUSE Horizon Report: Teaching and learning edition. EDUCAUSE.

Fenton, A. (2025). Oral assessment in the age of generative AI: A strategy for maintaining academic integrity. Educational Researcher, 54(1), 34–42.

GMAC. (2025). 2025 corporate recruiters survey report. Graduate Management Admission Council.

Hardie, P., Murray, A., Bortolotto, S., & Mistry, A. (2024). Testing 17 assessment types against generative AI: Which are the weakest links? Assessment & Evaluation in Higher Education, 49(8), 1121–1138.

Jisc & HEPI/Kortext. (2025). Student academic experience survey 2025. Higher Education Policy Institute.

Khlaif, Z. N., Mousa, A., & Salha, S. (2024). The process-product assessment approach: Evaluating students’ interaction with AI tools. Frontiers in Education, 9, 1355845.

Kofinas, A. K., Sheringham, J., & Sheridan, I. (2025). Can markers detect AI-generated assessments? Evidence from two UK institutions. British Journal of Educational Technology, 56(1), 88–105.

Li, H., Xiong, Y., Hunter, C. V., Guo, X., & Tywoniw, R. (2020). Does peer assessment promote student learning? A meta-analysis. Assessment & Evaluation in Higher Education, 45(2), 193–211.

Li, L., & Grion, V. (2019). The power of giving feedback: Developing evaluative judgment through peer review. In L. Li & V. Grion (Eds.), Re-imagining university assessment in a digital world (pp. 73–89). Springer.

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), 100779.

Microsoft. (2025). 2025 Work Trend Index annual report. Microsoft.

NACE. (2025). Job outlook 2025. National Association of Colleges and Employers.

Perkins, M., Furze, L., Roe, J., & MacVaugh, J. (2024). The AI Assessment Scale (AIAS): A framework for ethical integration of generative AI in educational assessment. Journal of University Teaching & Learning Practice, 21(6), 1–18.

QAA. (2025). Artificial intelligence in UK higher education: A QAA perspective. Quality Assurance Agency for Higher Education.

Scarfe, P., Watcham, K., Clarke, A., & Sherrington, S. (2024). A real-world test of artificial intelligence infiltration of a university examination system. PLOS ONE, 19(6), e0305354.

Sperber, J., Wallis, L., & Green, K. (2025). Peer and AI Review + Reflection (PAIRR): A model for developing AI literacy and writerly agency. Computers and Composition, 75, 102943.

TEQSA. (2025). Assessment reform in the age of artificial intelligence: A regulatory perspective. Tertiary Education Quality and Standards Agency.

Tzirides, A. O., Saini, A., & Cope, B. (2025). Students’ perceptions of peer versus AI feedback in higher education. Technology, Pedagogy and Education, 34(1), 45–63.

UNESCO. (2025). Guidance for generative AI in education and research (2nd ed.). United Nations Educational, Scientific and Cultural Organization.

Usher, M., Barak, M., & Haick, H. (2025). Comparing AI chatbot, peer, and instructor feedback on student assessments. Assessment & Evaluation in Higher Education, 50(1), 1–19.

Van Popta, E., et al. (2017). Exploring the value of peer feedback in online learning for the provider. Computers & Education, 115, 368–378.

Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S., Foltýnek, T., Guerrero-Dib, J., Popoola, O., & Waddington, L. (2023). Testing of detection tools for AI-generated text. International Journal for Educational Integrity, 19(1), 26.

World Economic Forum. (2025). The future of jobs report 2025. World Economic Forum.

Ready to transform peer evaluations?

CoStudy is free for individual professors. No credit card. No student fees.

Start for free Get a demo