MediaPipe's BlazePose model is impressive when the lighting is good and the background is clean. Ship it into a real-world app — a gym, a living room with patterned wallpaper, a clinic with reflective floors — and it starts to embarrass you. Landmark confidence drops, the skeleton flickers, and the posture score your clinical team trusted becomes noise.
This is the problem I spent six weeks solving at NexTurn India for CareSpace AI. Here's what I learned.
Why backgrounds break pose detection
BlazePose uses a two-stage pipeline: a person detector to find a bounding box, then a landmark regression model that normalises a crop of that box and predicts 33 keypoints. When a background has strong horizontal and vertical edges — bookshelves, windows, tiled walls — the detector can either mis-classify background as foreground or lose tracking entirely between frames.
Even when the bounding box is mostly correct, small frame-to-frame drift in the detector changes the normalised crop enough to shift where the regression model thinks the shoulder, hip, or knee sits. Twenty pixels of bounding box drift on a 1080p frame translates to visually significant landmark jitter.
Three failure modes appear repeatedly in production:
- Ghost detection — the detector finds a bounding box on a high-contrast background region, returning landmark positions in empty space.
- Landmark drift — the bounding box is mostly correct but unstable, causing landmarks to oscillate 10–30px between frames even though the person is still.
- Confidence collapse — the model returns landmarks but with visibility scores below 0.5, indicating it knows something is wrong without knowing what.
Fix 1: Hard gate on landmark visibility
Every MediaPipe landmark carries a visibility score between 0 and 1. The docs mention it but treat it as informational. It is your first line of defence.
For posture analysis you need at minimum 12 core landmarks: both shoulders, both hips, both knees, both ankles, and the nose. Before computing any metric, gate on the minimum visibility across these landmarks:
func processResult(_ result: PoseLandmarkerResult) {
// Indices: 0=nose, 11/12=shoulders, 23/24=hips, 25/26=knees, 27/28=ankles
let coreLandmarks = [0, 11, 12, 23, 24, 25, 26, 27, 28]
guard let pose = result.landmarks.first else { return }
let minVisibility = coreLandmarks
.compactMap { pose[safe: $0]?.visibility?.floatValue }
.min() ?? 0
guard minVisibility > 0.65 else {
showCalibrationPrompt()
return
}
computePostureMetrics(from: pose)
}The 0.65 threshold is empirical. In controlled conditions, valid landmarks read ≥ 0.90. Below 0.65 the metric is misleading. Between 0.65 and 0.90, show a “please adjust your camera” prompt rather than discarding frames silently — users need feedback, not just a frozen UI.
Fix 2: Temporal smoothing with exponential moving average
A per-frame confidence gate eliminates ghost detections but does nothing for landmark jitter. When the bounding box drifts slightly between frames, individual landmark positions oscillate even when the person is stationary. A 5° knee angle fluctuation repeating 15 times a second makes your posture score meaningless.
Exponential moving average (EMA) is the right tool. It introduces minimal latency and outperforms a sliding window average for live inference:
final class LandmarkSmoother {
private var smoothed: [NormalizedLandmark] = []
// Higher alpha = faster response, less smoothing
// 0.30 for slow rehabilitation exercises, 0.50 for sports
private let alpha: Float = 0.35
func smooth(_ raw: [NormalizedLandmark]) -> [NormalizedLandmark] {
guard !smoothed.isEmpty else { smoothed = raw; return raw }
smoothed = zip(smoothed, raw).map { prev, curr in
NormalizedLandmark(
x: alpha * curr.x + (1 - alpha) * prev.x,
y: alpha * curr.y + (1 - alpha) * prev.y,
z: alpha * (curr.z ?? 0) + (1 - alpha) * (prev.z ?? 0),
visibility: curr.visibility
)
}
return smoothed
}
}Alpha of 0.35 works well for physiotherapy sessions where patients move slowly. Tune against actual footage from your deployment environment — alpha is the one parameter where instrument data beats intuition.
Fix 3: Segmentation mask as a confidence proxy
MediaPipe Tasks outputs a per-pixel segmentation mask alongside pose landmarks when you set outputSegmentationMasks: true in your PoseLandmarkerOptions. This is the most underused tool in the framework.
The mask indicates which pixels belong to the detected person. If fewer than 45% of pixels inside the detector's bounding box are classified as “person”, the bounding box is mostly background — discard the result:
func maskCoverage(mask: MPPMask, box: CGRect, frameSize: CGSize) -> Float {
let buffer = mask.uint8Image // row-major, 0 = background, 255 = person
let width = Int(frameSize.width)
let xRange = Int(box.minX * frameSize.width)...Int(box.maxX * frameSize.width)
let yRange = Int(box.minY * frameSize.height)...Int(box.maxY * frameSize.height)
var personPixels = 0, total = 0
for y in yRange {
for x in xRange {
if buffer[y * width + x] > 128 { personPixels += 1 }
total += 1
}
}
return total > 0 ? Float(personPixels) / Float(total) : 0
}
// In your result handler:
if let mask = result.segmentationMasks?.first {
guard maskCoverage(mask: mask, box: detectedBox, frameSize: cameraSize) > 0.45 else {
return // Background leaking into bounding box — unreliable
}
}Fix 4: Hysteresis to prevent flickering UI
Hard confidence gates create a new problem: the analysis flickers on and off as a landmark oscillates around the threshold. Users see the posture panel appear and disappear rapidly, which destroys trust.
Hysteresis solves this: require a higher confidence to enter tracking state than to stay in it.
final class TrackingStateMachine {
enum State { case lost, tracking }
private(set) var state: State = .lost
private var streak = 0
// Enter tracking: 5 good frames at threshold 0.75
// Exit tracking: 8 bad frames at threshold 0.55
func update(confidence: Float) {
switch state {
case .lost:
streak = confidence >= 0.75 ? streak + 1 : 0
if streak >= 5 { state = .tracking; streak = 0 }
case .tracking:
streak = confidence < 0.55 ? streak + 1 : 0
if streak >= 8 { state = .lost; streak = 0 }
}
}
}At 30fps: 5 frames ≈ 167ms to enter tracking, 8 frames ≈ 267ms before giving up. These numbers feel instant to the user while preventing the flicker. Adjust the frame counts if your camera runs at a different rate.
Results in production
After deploying these four fixes in CareSpace AI, the “valid frame rate” — the percentage of camera frames that passed all confidence gates and contributed to posture metrics — improved from 51% to 87% across a test set of 40 clinic environments with varied backgrounds and lighting.
The segmentation mask check (Fix 3) delivered the largest single improvement, cutting false positives in textured-background environments by 60%. The hysteresis state machine (Fix 4) had the greatest UX impact — users stopped seeing flickering analysis panels, and reported trust in the posture readings increasing measurably in follow-up interviews.
MediaPipe is a capable framework. The gap between “it works in the demo” and “it works at a physio clinic in Gujarat” is almost entirely about handling real-world signal quality. Gate aggressively, smooth thoughtfully, and your confidence in the output will match the user's.