The Research Behind CoStudy

Every design decision in CoStudy traces back to published, peer-reviewed work. This page is our bibliography, our rationale, and our invitation to hold us accountable.

Why peer evaluation works

The first question every professor asks before trying peer evaluation: “Will students just give their friends high marks?” The answer is no. A meta-analysis of 56 studies found that peer ratings and instructor ratings converge when evaluations use well-defined behavioral criteria.^{Falchikov & Goldfinch, 2000} When you ask students to rate observable actions rather than general impressions, they produce assessments that look remarkably like their professors'.

Beyond validity, structured peer assessment consistently improves learning outcomes, with meta-analytic effect sizes ranging from g = 0.31 to g = 0.61.^{Falchikov & Goldfinch, 2000; Li et al., 2020} For context, most educational interventions land below 0.3. Structured peer evaluation consistently clears that bar, often by a wide margin.

Subsequent work extended this to show that formative peer feedback during a project (not just at the end) drives the largest learning gains.^{Li et al., 2020} The reason is straightforward: feedback only changes behavior if there's time left to change. If students get peer feedback on the last day of the project, they can't do anything with it.

So what does the research say you need to get right for peer evaluation to work? Behavioral anchors instead of trait ratings. Formative check-ins throughout the semester, not just a summative score at the end. Aggregated multi-rater data to improve reliability. Most implementations fail at one or more of these. That's not a flaw in peer evaluation. It's a flaw in how it's typically deployed.

Three design features separate peer evaluation that works from peer evaluation that doesn't: behavioral anchors, formative timing, and multi-rater aggregation.

Reading the numbers on this page

Hedges' g measures how much an intervention shifts outcomes compared to a control group, expressed in standard deviations. It ranges from 0 (no effect) upward, with no fixed ceiling. In education research: below 0.2 is small, 0.2–0.5 is medium, and above 0.5 is large. Most educational interventions land below 0.3, so anything above that bar is noteworthy.

Pearson's r measures correlation between two sets of ratings on a scale from 0 (no relationship) to 1 (perfect agreement). In assessment research: below 0.3 is weak, 0.3–0.5 is moderate, and above 0.5 is strong. An r of 0.69 between peer and instructor ratings means those two perspectives agree nearly 70% of the time.

g = 0.31–0.61

Effect sizes above the 0.3 threshold where most educational interventions fall short^{Falchikov & Goldfinch, 2000; Li et al., 2020}

Peer = Instructor

Peer and instructor ratings converge when behavioral criteria are used (56-study meta-analysis)^{Falchikov & Goldfinch, 2000}

Formative > Summative

Mid-project feedback drives larger gains because there's still time to change^{Li et al., 2020}

Beyond learning outcomes, peer assessment also improves how students feel about the process:

g = 0.616

Increase in self-efficacy^{Lu et al., 2026}

Students came out more confident in their own abilities. A medium-to-large effect.

g = −0.608

Reduction in anxiety^{Lu et al., 2026}

Less anxiety, not more, despite being evaluated by classmates.

g = 0.393

Increase in motivation^{Lu et al., 2026}

More engaged after giving and receiving peer feedback, not less.

“I tried peer evaluation before. It didn't work.”

Li et al. identified rater training as the single most important factor in whether peer assessment works.^{Li et al., 2020} Trained evaluators scored 10.8% higher in quality on their first evaluation cycle, with an additional 8.9% improvement with continued practice. Untrained controls showed no continued improvement at all.

If you've tried peer evaluation before and gotten garbage results, there's a good chance students were just guessing. They'd never been taught how to evaluate a teammate's contribution, so they defaulted to impressions, social dynamics, or random numbers.

CoStudy's student onboarding is built around this research. Before students evaluate anyone, they learn what behavioral observation means, practice calibrating their judgments, and understand why the process matters. This time is different because the preparation is different.

In your experience, what's the biggest barrier to effective peer evaluation?

Assessment as self-examination, not just measurement

Most peer evaluation tools treat the evaluation as a measurement device: collect ratings, generate a score, move on. CoStudy is built on a different hypothesis. Evaluating peers requires students to examine the same behaviors in themselves — closing the gap between self-perception and reality. It's not just a way to measure collaboration. It's how students build the skills that make collaboration work.

The concept traces back to Sadler's work on evaluative judgment^{Sadler, 1989}: the capacity to make quality judgments about your own and others' work is a core learning outcome, not a byproduct of assessment. When students evaluate their teammates, they practice identifying what effective contribution looks like, articulating it, and calibrating their standards against the group's.

Cho and MacArthur found that students who gave feedback outperformed those who only received it.^{Cho & MacArthur, 2011} The act of reviewing, of having to articulate what's working and what isn't, forced deeper processing than simply reading someone else's comments about your work. Wu and Schunn confirmed this in a large-scale meta-analysis: providing feedback shows stronger learning gains than receiving it.^{Wu & Schunn, 2023}

This is CoStudy's core design philosophy. Every evaluation cycle isn't just collecting data for the instructor. It's training students to recognize, articulate, and improve collaborative behavior. The measurement and the learning are the same activity.

Peer evaluation doesn't just measure collaboration. It builds the skills that make collaboration work. That dual function is CoStudy's core design philosophy.

Givers > Receivers

Students who gave feedback outperformed those who only received it^{Cho & MacArthur, 2011}

Writing a review helps you more than getting one. Same principle here.

+10.8%

Quality improvement from trained peer evaluators on their first evaluation cycle

Students who understand how to evaluate give better feedback from day one.

+8.9%

Additional quality improvement with continued practice^{Wu & Schunn, 2023}

The more you do it, the better you get. It's a skill, not a personality trait.

Psychological safety and team performance

Edmondson's foundational work^{Edmondson, 1999} identified the conditions under which teams perform: members need to feel safe to take interpersonal risks: asking questions, admitting mistakes, offering honest feedback without fear of punishment. Google's Project Aristotle confirmed this at scale, finding psychological safety was the single strongest predictor of team effectiveness.^{Duhigg, 2016}

You've heard that before. Here's what most accounts leave out: Edmondson's conditions aren't personality traits or team culture vibes. They're structural. Safety emerges when the environment reduces social risk, normalizes vulnerability, and separates feedback from personal judgment. In student teams, where power dynamics, social hierarchies, and grading pressure are all amplified, these conditions don't happen by accident. They have to be designed in.

CoStudy maps three specific design decisions directly to Edmondson's conditions:

Anonymous ratings

Edmondson's core condition is low interpersonal risk. Anonymization removes the social cost of honest feedback so students can say what they actually think without fear of retaliation.

Regular cadence

One-time evaluation feels like a judgment event. Regular check-ins normalize feedback as routine, not threatening. This is exactly the “learning behavior” Edmondson describes.

Behavioral questions

Shifting from “my teammate is bad” to “my teammate did this specific thing” separates feedback from personal judgment, the structural frame that makes candor feel safe.

Psychological safety isn't a culture aspiration. It's a set of structural conditions. CoStudy's anonymization, regular cadence, and behavioral framing are designed to produce those conditions, not just reference them.

The Safety Paradox

Balancing dignity and risk in learning

The Foundation

Interpersonal Safety

Dignity Safety

Every participant feels equal in status and free from anxiety about being belittled or punished.

Psychological Safety

Teams that feel safe to admit mistakes show higher learning behaviors and performance.

Supportive Structure

Clear norms and regular check-ins are the primary drivers of group psychological safety.

CoStudy bridges both

Anonymous evaluations

protect dignity

Behavioral questions

demand honesty

Formative timing

enables growth before stakes are high

The Catalyst

Productive Discomfort

Brave Spaces

Learning involves risk and pain, requiring courage rather than protection from discomfort.

Intellectual Challenge

Effective learning must challenge settled beliefs and push students toward open-mindedness.

Civil Candor

Frankness tempered with civility, interpreting others' feedback with generosity and good faith.

Brave spaces, not safe spaces

The brave spaces framework^{Arao & Clemens, 2013} proposed a critical reframe: instead of “safe spaces” where discomfort is minimized, we need brave spaces where participants commit to engaging honestly with difficult feedback, even when it's uncomfortable.

The distinction matters for peer evaluation. A “safe space” approach would avoid conflict entirely: everyone rates everyone highly, nobody learns anything. A brave space approach asks students to be honest about contributions, name problems early, and trust that the structure will protect them from retaliation.

CoStudy operationalizes brave spaces through anonymous evaluations that separate honest feedback from social risk, behavioral questions that ground assessments in observable actions rather than personal judgments, and formative timing that gives teams the chance to improve before stakes are high.

Brave spaces ask for honesty, not comfort. CoStudy's anonymization and behavioral framing make candor feel safe without removing the productive challenge.

Psychological Safety →

Accountability →

Anxiety Zone

“If I say something wrong, it'll hurt my grade.”

Self-censorship. Resentment builds silently.

CoStudy target

Learning Zone

“I flagged a problem early and we fixed it together.”

Real collaboration. Real skills.

Apathy Zone

“We split it up and merged the night before.”

No learning. No accountability.

Comfort Zone

“We all gave each other 5 out of 5.”

Free riders protected. Top students burn out.

Behavioral questions

Observable actions, not vague impressions

Anonymous evaluations

Honest feedback without social risk

Formative check-ins

Time to improve before stakes are high

AI and evaluative judgment

The conversation about AI in higher education has mostly been defensive: How do we detect AI-generated work? How do we prevent cheating? Those are real concerns, but they miss the bigger question: what skills do students need to work effectively with AI systems, and how do we teach them?

Tai et al. identified evaluative judgment, the capacity to assess the quality of work, your own and others', as a critical learning outcome in higher education.^{Tai et al., 2018} Bearman et al. extended this argument directly to the AI era: evaluative judgment is the skill AI cannot replace, and the one students need most when AI can generate plausible-looking output on demand.^{Bearman et al., 2024} Students who can't judge quality will accept whatever AI gives them. Students who can will know when to trust it and when to override it.

This is where peer evaluation becomes forward-looking, not just defensive. Every time students evaluate their teammates' contributions, they practice exactly the skills Tai and Bearman describe: assessing quality, calibrating standards, articulating what “good” looks like, and giving honest critical feedback. These are the same capacities needed to collaborate effectively with AI systems.

Peer evaluation also provides something artifact-based assessment has lost: direct observation of the collaborative process. AI can generate a paper or a presentation, but it can't fake who showed up to meetings, who contributed ideas, and who did the work. That process-level visibility remains authentically human.

Evaluative judgment, the capacity to assess quality and calibrate trust, is the skill AI cannot replace. Peer evaluation is how students practice it.

How has generative AI changed your approach to assessing student work?

Equity in peer evaluation

Unstructured peer evaluation can reinforce existing biases. Research shows that women, students of color, and first-generation students are disproportionately rated lower when evaluations rely on general impressions rather than specific behavioral criteria.^{Meadows et al., 2023} When evaluation criteria are subjective, evaluators unconsciously redefine “merit” to match their existing preferences.^{Uhlmann & Cohen, 2005}

The solution isn't to abandon peer evaluation. It's to design it better. Three evidence-based strategies significantly reduce bias: behavioral question design that asks about observable actions rather than traits,^{Panadero et al., 2013} anonymization that removes social pressure from ratings, and aggregation that combines multiple perspectives to cancel individual biases.^{Topping, 2009} Objective criteria constrain the redefinition problem: when the question targets a specific behavior, there's less room to shift the goalposts.^{Uhlmann & Cohen, 2005}

There's a deeper equity issue that peer evaluation is uniquely positioned to address. Students from less privileged backgrounds aren't less capable. They simply haven't had the same structured practice reps in professional collaboration that elite socialization provides informally. The skills that get labeled “soft” (giving feedback, navigating disagreement, holding peers accountable) are skills, and they can be taught. But only if someone creates the structured practice.

The data bears this out: 89% of graduating seniors rated themselves proficient in professionalism, but only 42% of employers agreed.^{National Association of Colleges and Employers (NACE), 2018} That's a 47-point gap, and it doesn't close on its own. Regular peer evaluation cycles are exactly the kind of structured practice that closes it, repositioning peer evaluation from a grading tool to a professional development engine.

CoStudy implements all three bias-reduction strategies by default. Every question template is built around observable behaviors. All ratings are anonymous. And individual scores are aggregated across multiple raters before being reported. The result is evaluation that's fairer for all students, especially those most harmed by unstructured processes.

89%

vs.

42%

Graduating seniors who rated themselves proficient in professionalism vs. employers who agreed^{National Association of Colleges and Employers (NACE), 2018}

Students think they're ready. Employers disagree. By a lot. That gap doesn't close on its own.

Behavioral design

Observable actions, not trait judgments^{Panadero et al., 2013}

Anonymization

Removes social pressure from ratings

Aggregation

Multiple perspectives cancel individual bias^{Topping, 2009}

“Isn't this just a popularity contest?”

It's the most common objection to peer evaluation, and it's the right question to ask. If students are rating general impressions (“How good a teammate is Alex?”), then yes, you get friendship bias, halo effects, and results that track social standing more than actual contribution.

But that's not what happens when you use behavioral criteria. Falchikov and Goldfinch's meta-analysis of 56 studies found that peer ratings correlate with instructor ratings at r = 0.69 when behavioral criteria are used.^{Falchikov & Goldfinch, 2000} That means when students evaluate specific observable behaviors instead of general impressions, their ratings line up with the professor's assessment nearly 70% of the time. This isn't a vibe check.

“Behavioral anchoring” sounds technical, but the idea is simple: rate what you observe, not what you feel about someone. Instead of “Was this person a good team member?” (which invites bias), you ask “Did this person come to meetings prepared with their assigned work?” or “Did this person respond to team messages within 24 hours?” The question targets a specific, observable action. There's no room for a halo effect when you're answering a factual question.

CoStudy's questions are designed around this principle. Every item targets an observable behavior. Combined with anonymization (so social pressure doesn't distort ratings) and multi-rater aggregation (so no single evaluator's bias dominates), the result is assessment data that holds up to scrutiny, not a popularity contest.

r = 0.69

Correlation between peer and instructor ratings when behavioral criteria are used, across 56 studies^{Falchikov & Goldfinch, 2000}

When students evaluate specific observable behaviors instead of general impressions, their ratings line up with the professor's assessment nearly 70% of the time.

The difference between a popularity contest and a valid assessment is the question you ask. Behavioral anchoring turns peer evaluation from opinion into evidence.

Accreditation evidence you already need

AACSB (the accreditation body for business schools) requires documented evidence of learning outcomes in teamwork and communication as part of its Assurance of Learning standards. ABET (the accreditation body for engineering programs) has parallel requirements under Outcome 5: the ability to function effectively on a team.

Most programs struggle to produce this evidence because they're trying to extract it from artifacts that weren't designed to measure team skills: project reports, presentations, peer ratings collected once at the end of a course with no behavioral anchoring.

CoStudy's aggregated evaluation data provides documented evidence of team skill development over time, directly supporting AACSB and ABET assurance of learning requirements. Because evaluations use behavioral criteria and run at multiple points during the semester, the data shows growth trajectories, not just snapshots. That's the difference between “we assessed teamwork” and “here's how students' collaborative skills developed across the program.”

AACSB Assurance of Learning

Documented evidence of teamwork and communication outcomes across the curriculum

ABET Outcome 5

Evidence that students can function effectively on a team whose members provide leadership and collaboration

CoStudy turns peer evaluation data into accreditation evidence automatically: longitudinal, behavioral, and tied to specific learning outcomes.

How we build our questions

CoStudy's core assessment was originally developed through a composite analysis of peer evaluation instruments across higher education, refined in collaboration with a doctoral-level psychologist. The goal was to isolate the behavioral dimensions that most reliably differentiate effective from ineffective team contribution, and to ask about them in ways that reduce rater bias.

That foundation still holds, but teaming doesn't look the same across every context. An engineering capstone, a humanities seminar, and a clinical practicum each define “contribution” differently. That's why we work directly with professors to create custom question sets aligned to specific course objectives. The accountability and growth mechanisms should reflect what actually matters in a given learning environment.

We're also transparent about what's ahead: we are currently preparing a major overhaul of our primary assessment instrument. The goal is stronger psychometric grounding, tighter alignment with current teamwork research, and better adaptability across disciplines. We'll share more as that work matures.

Our questions were built from cross-institutional research and expert collaboration. We customize for context, and we're actively investing in the next generation of our assessment design.

When evaluating a peer assessment tool, what matters most to you?

Individual talent doesn't predict team performance

Woolley et al. published a striking finding in Science: group intelligence isn't predicted by the smartest person in the room. It's predicted by social sensitivity, turn-taking equality, and gender diversity.^{Woolley et al., 2010} Teams with one star performer and poor collaboration consistently underperformed teams with moderate individual talent and strong collaborative dynamics.

Riedl et al. replicated this across 5,279 individuals in 1,356 groups, confirming the collective intelligence factor holds at scale.^{Riedl et al., 2021} The finding is robust: what makes teams effective is not the sum of individual ability but the quality of interaction between members.

The skills that predict team effectiveness (social sensitivity, equitable participation, constructive disagreement) are exactly the ones CoStudy's evaluation cycles develop and measure.

Social sensitivity > IQ

Team intelligence predicted by social sensitivity, turn-taking equality, and gender diversity, not individual IQ^{Woolley et al., 2010}

Replicated across 5,279 individuals in 1,356 groups^{Riedl et al., 2021}

Putting smart people together doesn't make a smart team. Collaborative skills do, and those skills can be practiced, measured, and improved.

How would you rate the strength of the research evidence presented on this page?

References

1.Falchikov, N. & Goldfinch, J. (2000). Student Peer Assessment in Higher Education: A Meta-Analysis Comparing Peer and Teacher Marks. Review of Educational Research, 70(3), 287–322. doi:10.3102/00346543070003287
2.Li, H., Xiong, Y., Hunter, C. V., Guo, X., & Tywoniw, R. (2020). Does Peer Assessment Promote Student Learning? A Meta-Analysis. Assessment & Evaluation in Higher Education, 45(2), 193–211. doi:10.1080/02602938.2019.1620679
3.Edmondson, A. (1999). Psychological Safety and Learning Behavior in Work Teams. Administrative Science Quarterly, 44(2), 350–383. doi:10.2307/2666999
4.Duhigg, C. (2016). What Google Learned From Its Quest to Build the Perfect Team. The New York Times Magazine, Feb. 25, 2016. https://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.html
5.Arao, B. & Clemens, K. (2013). From Safe Spaces to Brave Spaces: A New Way to Frame Dialogue Around Diversity and Social Justice. In L. Landreman (Ed.), The Art of Effective Facilitation (pp. 135–150). Stylus Publishing. doi:10.4324/9781003447801-10
6.Panadero, E., Romero, M., & Strijbos, J. W. (2013). The Impact of a Rubric and Friendship on Peer Assessment: Effects on Construct Validity, Performance, and Perceptions of Fairness and Comfort. Studies in Educational Evaluation, 39(4), 195–203. doi:10.1016/j.stueduc.2013.10.005
7.Meadows, K. N., Olsen, K. C., Gryba, R., & Peacock, K. (2023). Gender Bias in Peer Evaluations of Team Members' Contributions in Undergraduate Engineering Courses. International Journal of Engineering Education, 39(5), 1233–1243. https://www.ijee.ie/contents/c390523.html
8.Topping, K. J. (2009). Peer Assessment. Theory Into Practice, 48(1), 20–27. doi:10.1080/00405840802577569
9.Sadler, D. R. (1989). Formative Assessment and the Design of Instructional Systems. Instructional Science, 18(2), 119–144. doi:10.1007/BF00117714
10.Cho, K. & MacArthur, C. (2011). Learning by Reviewing. Journal of Educational Psychology, 103(1), 73–84. doi:10.1037/a0021950
11.Wu, Y. & Schunn, C. D. (2023). Providing and Receiving Peer Feedback Improves Student Writing: Evidence From a Meta-Analysis. Assessment & Evaluation in Higher Education, 48(6), 835–850. doi:10.1080/02602938.2023.2169600
12.Tai, J., Ajjawi, R., Boud, D., Dawson, P., & Panadero, E. (2018). Developing Evaluative Judgement: Enabling Students to Make Decisions About the Quality of Work. Higher Education, 76(3), 467–481. doi:10.1007/s10734-017-0220-3
13.Bearman, M., Ajjawi, R., Boud, D., & Tai, J. (2024). Educating Future-Proof Graduates for an AI-Mediated World: The Importance of Evaluative Judgement. Assessment & Evaluation in Higher Education, 49(8), 1169–1181. doi:10.1080/02602938.2024.2396527
14.National Association of Colleges and Employers (NACE) (2018). Job Outlook 2018: Are Students Career-Ready?. NACE Research Report (n = 4,213 students, 201 employers). https://www.naceweb.org/career-readiness/competencies/are-college-graduates-career-ready/
15.Uhlmann, E. L. & Cohen, G. L. (2005). Constructed Criteria: Redefining Merit to Justify Discrimination. Psychological Science, 16(6), 474–480. doi:10.1111/j.0956-7976.2005.01559.x
16.Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N., & Malone, T. W. (2010). Evidence for a Collective Intelligence Factor in the Performance of Human Groups. Science, 330(6004), 686–688. doi:10.1126/science.1193147
17.Riedl, C., Kim, Y. J., Guber, P., Braverman, J., & Woolley, A. W. (2021). Quantifying Collective Intelligence in Human Groups. Proceedings of the National Academy of Sciences, 118(21), e2005737118. doi:10.1073/pnas.2005737118
18.Lu, Y., Li, H., & Guo, X. (2026). The Effects of Peer Assessment on Motivation, Self-Efficacy, and Anxiety: A Meta-Analysis. Assessment & Evaluation in Higher Education. doi:10.1080/02602938.2025.2461301

For researchers

We're actively interested in research partnerships. If you study peer assessment, team dynamics, educational equity, or collaborative learning, we'd love to talk. CoStudy generates rich longitudinal data on team behavior, and we're committed to contributing to the evidence base that shapes our field.

If you want to challenge a claim we've made, request our underlying data, or ask about our methodology, we welcome that too. Accountability is the point.

Talk to us Start for free