LEVELS OF EVIDENCE
Evidence Hierarchy | GRADE System | Clinical Application
Evidence Levels for Therapeutic Questions
Critical Must-Knows
- Level I Evidence: High-quality RCT with randomization, blinding, adequate power, low loss to follow-up
- GRADE System: Assesses quality of evidence (High/Moderate/Low/Very Low) AND strength of recommendations (Strong/Weak)
- Evidence Levels Vary by Question Type: Therapeutic, Prognostic, Diagnostic questions have different hierarchies
- Study Design ≠Evidence Quality: A poorly conducted RCT can be downgraded; a well-done cohort can provide strong evidence
- Recommendation Strength: Depends on evidence quality, benefit-harm balance, values, and resource use
Examiner's Pearls
- "RCT is not always Level I - must meet quality criteria including blinding, adequate power, low attrition
- "Systematic review quality depends on included studies - SR of poor RCTs is not Level I
- "For rare outcomes, well-designed case-control may be best available evidence
- "GRADE separates evidence quality from recommendation strength - can have strong recommendation from low-quality evidence if large effect and ethical imperative
Critical Evidence Concepts
Study Design vs Evidence Quality
Not the same! RCT design does NOT automatically mean Level I. Must assess: randomization quality, blinding, power, attrition, bias. A flawed RCT can be Level II or III.
Question Type Matters
Therapeutic: RCT is gold standard. Prognostic: Cohort is best. Diagnostic: Cross-sectional with reference standard. Evidence hierarchy differs by question.
GRADE: Quality vs Strength
Evidence Quality: How confident are we in effect estimate? Recommendation Strength: Should we do this? Can have strong recommendation from low quality if large effect.
Upgrade and Downgrade Factors
Downgrade: Risk of bias, inconsistency, indirectness, imprecision, publication bias. Upgrade: Large effect, dose-response, residual confounding (favors null).
At a Glance
The Levels of Evidence framework ranks study designs to guide clinical decision-making, with Level I representing high-quality randomized controlled trials (adequate randomization, blinding, power, and low attrition) or systematic reviews thereof—importantly, study design does not automatically determine evidence level, as a poorly conducted RCT may be downgraded to Level II or III. The hierarchy descends through Level II (lesser RCTs, prospective cohorts), Level III (case-control, retrospective cohorts), to Level IV-V (case series, expert opinion). The GRADE system introduces crucial nuance by separating evidence quality (confidence in effect estimate) from recommendation strength (should we act), acknowledging that strong recommendations can arise from lower-quality evidence when effects are large and harms minimal. Evidence can be downgraded by "RIIIP" factors (Risk of bias, Inconsistency, Indirectness, Imprecision, Publication bias) or upgraded by large effect sizes, dose-response relationships, and residual confounding favoring the null hypothesis.
RCCCCELevels of Evidence (Therapeutic)
Memory Hook:Remember Chronic Cases Can Create Excellent evidence - from highest to lowest quality!
RIIIPGRADE Factors that Downgrade Evidence
Memory Hook:RIIIP evidence apart - five factors that lower your confidence in the evidence!
Overview and Introduction
Understanding Levels of Evidence
Levels of evidence provide a hierarchical framework for evaluating the quality of research studies. This system helps clinicians appraise the strength of evidence supporting clinical decisions.
Key Principles:
- Higher evidence levels indicate greater confidence in study findings
- Study design alone does not determine evidence level - quality matters
- Different question types have different evidence hierarchies
- Context determines appropriate evidence level for clinical decisions
Concepts and Methodology Principles
Core Concepts in Evidence Appraisal
The Evidence Pyramid:
- Top: Systematic reviews and meta-analyses
- High: Randomized controlled trials (RCTs)
- Medium: Cohort and case-control studies
- Low: Case series, case reports, expert opinion
Why Study Design Matters:
- Randomization controls for known and unknown confounders
- Blinding prevents performance and detection bias
- Control groups allow comparison of intervention effects
- Prospective design avoids recall and selection bias
GRADE Framework:
- Separates evidence quality (confidence) from recommendation strength
- RCTs start as high quality, observational studies as low
- Quality can be upgraded or downgraded based on specific criteria
Study Hierarchies for Different Question Types
Therapeutic Questions (Treatment Effectiveness)
Question Format: In [population], does [intervention] compared to [control] improve [outcome]?
Levels of Evidence - Therapeutic
| Level | Study Design | Quality Criteria | Example |
|---|---|---|---|
| Level I | High-quality RCT or SR of Level I RCTs | Randomization, allocation concealment, blinding, greater than 80% follow-up, ITT analysis | HEALTH trial: THA vs Hemi for femoral neck fracture |
| Level II | Lesser-quality RCT, Prospective Cohort, SR of Level II | RCT with methodological flaws OR well-designed cohort | Registry study comparing surgical approaches |
| Level III | Case-Control, Retrospective Cohort | Observational with comparison, prone to confounding | Case-control of AVN risk factors |
| Level IV | Case Series | No comparison group, descriptive only | Series of 50 arthroscopic rotator cuff repairs |
| Level V | Expert Opinion | Lowest level, based on experience | Editorial on surgical technique preferences |
For therapeutic questions, randomization is critical because it eliminates confounding and selection bias.
GRADE System
What is GRADE?
GRADE (Grading of Recommendations Assessment, Development and Evaluation) is the most widely used system for assessing evidence quality and recommendation strength.
Two Key Outputs:
- Quality of Evidence: High / Moderate / Low / Very Low
- Strength of Recommendation: Strong / Weak (for or against)
Assessing Evidence Quality
Start with Study Design, then apply modifiers:
GRADE Evidence Quality Assessment
| Starting Point | Downgrade For | Upgrade For | Final Quality |
|---|---|---|---|
| RCT = HIGH | Risk of bias, Inconsistency, Indirectness, Imprecision, Publication bias (each -1 or -2) | Large effect, Dose-response, Residual confounding (each +1) | High / Moderate / Low / Very Low |
| Observational = LOW | Same downgrade factors as above | Same upgrade factors, often applied to cohort studies | Can upgrade to Moderate or even High with large effect |
Example: RCT with high risk of bias (-1) and wide confidence intervals (-1) = Moderate quality evidence.
Example: Cohort study with very large effect (+2) = Moderate quality evidence (upgraded from Low).
Understanding GRADE is essential for guideline development and evidence interpretation.
Clinical Relevance and Applications
Applying Evidence to Patients
Level I evidence is ideal but not always applicable. Consider:
- Does patient match RCT inclusion criteria?
- Were exclusion criteria too strict?
- Do patient values align with outcomes studied?
When Lower Evidence is Acceptable
Situations where Level III-IV may suffice:
- Rare diseases (no RCTs feasible)
- Urgent clinical need (cannot wait for RCT)
- Ethical constraints prevent randomization
- Consistent observational data with large effects
Reading Guidelines Critically
Check the evidence grade: Guidelines should cite evidence level for each recommendation. Strong recommendation based on weak evidence? Question the rationale.
Communicating Uncertainty
Be honest with patients: If evidence is Level IV, explain uncertainty. Shared decision-making is crucial when evidence is weak.
Evidence Base
Levels of Evidence for Orthopaedic Studies
- Developed standardized levels of evidence framework for orthopaedic literature
- Separate criteria for therapeutic, prognostic, diagnostic, and economic questions
- Adopted by JBJS and many orthopaedic journals
- Levels range from I (highest) to V (lowest)
GRADE Working Group Methodology
- GRADE provides transparent framework for rating evidence quality and recommendation strength
- Separates evidence quality (confidence in effect estimate) from recommendation strength (should we do it)
- Considers benefit-harm balance, patient values, and resource use in recommendations
- Widely adopted by WHO, Cochrane, and 100 plus international guideline organizations
Oxford Centre for Evidence-Based Medicine Levels
- Updated evidence hierarchy addressing limitations of earlier frameworks
- Separate tables for treatment, diagnosis, prognosis, and screening questions
- Emphasizes study design AND quality
- Recognizes observational studies can provide high-quality evidence in certain situations
Exam Viva Scenarios
Practice these scenarios to excel in your viva examination
Scenario 1: Interpreting Evidence Levels
"A colleague shows you a case series of 30 patients who underwent a new surgical technique for rotator cuff repair, with 90 percent good outcomes at 2 years. He says this is Level I evidence. How would you respond?"
Scenario 2: GRADE System Application
"You are reviewing a guideline that gives a Strong recommendation for surgical fixation of ankle fractures based on Moderate quality evidence from observational studies. Is this appropriate?"
MCQ Practice Points
Level I Evidence Question
Q: Which of the following is required for an RCT to be considered Level I evidence? A: All of the following: Adequate randomization and allocation concealment, blinding of participants and assessors, intention-to-treat analysis, less than 20 percent loss to follow-up, and adequate sample size with power calculation. A poorly conducted RCT with high attrition or lack of blinding is downgraded to Level II.
GRADE Downgrade Factors
Q: What are the five factors that downgrade evidence quality in the GRADE system? A: RIIIP: Risk of bias, Inconsistency (heterogeneity across studies), Indirectness (PICO mismatch), Imprecision (wide confidence intervals), and Publication bias. Each factor can downgrade by 1 or 2 levels.
Question Type and Design
Q: What is the best study design for answering a prognostic question about fracture healing? A: Prospective cohort study. For prognostic questions, cohort studies are superior to RCTs because you follow natural history without intervention. RCTs are best for therapeutic questions, not prognosis.
Australian Context
Australian Epidemiology and Practice
Evidence-Based Practice in Australian Orthopaedics:
- AOANJRR (Australian Orthopaedic Association National Joint Replacement Registry) provides world-leading Level II evidence on implant survival and outcomes
- Registry data is cited internationally as high-quality observational evidence for arthroplasty decisions
- The Whitehouse Report methodology underpins registry analysis and has influenced international registry standards
RACS Orthopaedic Training Relevance:
- Levels of evidence and GRADE methodology are core FRACS examination topics in research methodology and evidence-based practice
- Viva scenarios commonly test ability to critique study designs and assign evidence levels
- Key exam focus: differentiating study designs, identifying bias, applying GRADE downgrade factors (RIIIP)
- Examiners expect candidates to interpret evidence levels when discussing treatment recommendations
Australian Orthopaedic Research:
- Australian orthopaedic journals (ANZ Journal of Surgery, JBJS Open Access) require authors to assign evidence levels to studies
- NHMRC (National Health and Medical Research Council) evidence hierarchy aligns with Oxford CEBM levels
- Australian Clinical Practice Guidelines use GRADE methodology for recommendation development
Key Australian Databases and Resources:
- AOANJRR: Primary source for arthroplasty evidence in Australian practice
- Cochrane Musculoskeletal Group: Australian-based systematic review group contributing to global evidence synthesis
- NHMRC Guidelines Portal: Evidence-based guidelines for Australian clinical practice
Application to Clinical Practice:
- Therapeutic Goods Administration (TGA) requires evidence level assessment for device and drug approval
- Medicare Benefits Schedule (MBS) Review Taskforce considered evidence levels when evaluating orthopaedic procedures
- Private health fund prostheses list decisions incorporate evidence from AOANJRR and systematic reviews
Management Algorithm

LEVELS OF EVIDENCE
High-Yield Exam Summary
Evidence Levels (Therapeutic)
- •Level I = High-quality RCT or SR of RCTs
- •Level II = Lesser RCT or Prospective Cohort
- •Level III = Case-Control or Retrospective Cohort
- •Level IV = Case Series (no control)
- •Level V = Expert Opinion (lowest)
Question-Specific Best Evidence
- •Therapeutic question = RCT gold standard
- •Prognostic question = Cohort study best
- •Diagnostic question = Cross-sectional with reference standard
- •Economic question = Cost-effectiveness analysis
- •Hierarchy differs by question type
GRADE System
- •GRADE assesses quality (High/Moderate/Low/Very Low) AND strength (Strong/Weak)
- •Start with RCT = High quality; Observational = Low quality
- •Downgrade for: RIIIP (Risk, Inconsistency, Indirectness, Imprecision, Publication bias)
- •Upgrade for: Large effect, Dose-response, Residual confounding
- •Strong recommendation can come from moderate evidence if large effect
Level I Criteria (RCT)
- •Adequate randomization and allocation concealment
- •Blinding of participants and assessors
- •Intention-to-treat analysis
- •Less than 20% loss to follow-up
- •Adequate power (sample size calculation)
Common Pitfalls
- •RCT design does NOT automatically equal Level I (must meet quality criteria)
- •SR quality depends on included studies (SR of poor RCTs is not Level I)
- •Case-control overestimates diagnostic test accuracy (spectrum bias)
- •Cannot establish causality from case series (no comparison group)
- •Observational studies CAN provide high-quality evidence if very large effect