← Writing
iOS · Vision

Pose detection that survives a noisy background

MediaPipe's BlazePose model is impressive when the lighting is good and the background is clean. Ship it into a real-world app — a gym, a living room with patterned wallpaper, a clinic with reflective floors — and it starts to embarrass you. Landmark confidence drops, the skeleton flickers, and the posture score your clinical team trusted becomes noise.

This is the problem I spent six weeks solving at NexTurn India for CareSpace AI. Here's what I learned.

Why backgrounds break pose detection

BlazePose uses a two-stage pipeline: a person detector to find a bounding box, then a landmark regression model that normalises a crop of that box and predicts 33 keypoints. When a background has strong horizontal and vertical edges — bookshelves, windows, tiled walls — the detector can either mis-classify background as foreground or lose tracking entirely between frames.

Even when the bounding box is mostly correct, small frame-to-frame drift in the detector changes the normalised crop enough to shift where the regression model thinks the shoulder, hip, or knee sits. Twenty pixels of bounding box drift on a 1080p frame translates to visually significant landmark jitter.

Three failure modes appear repeatedly in production:

Fix 1: Hard gate on landmark visibility

Every MediaPipe landmark carries a visibility score between 0 and 1. The docs mention it but treat it as informational. It is your first line of defence.

For posture analysis you need at minimum 12 core landmarks: both shoulders, both hips, both knees, both ankles, and the nose. Before computing any metric, gate on the minimum visibility across these landmarks:

func processResult(_ result: PoseLandmarkerResult) {
    // Indices: 0=nose, 11/12=shoulders, 23/24=hips, 25/26=knees, 27/28=ankles
    let coreLandmarks = [0, 11, 12, 23, 24, 25, 26, 27, 28]
    guard let pose = result.landmarks.first else { return }

    let minVisibility = coreLandmarks
        .compactMap { pose[safe: $0]?.visibility?.floatValue }
        .min() ?? 0

    guard minVisibility > 0.65 else {
        showCalibrationPrompt()
        return
    }
    computePostureMetrics(from: pose)
}

The 0.65 threshold is empirical. In controlled conditions, valid landmarks read ≥ 0.90. Below 0.65 the metric is misleading. Between 0.65 and 0.90, show a “please adjust your camera” prompt rather than discarding frames silently — users need feedback, not just a frozen UI.

Fix 2: Temporal smoothing with exponential moving average

A per-frame confidence gate eliminates ghost detections but does nothing for landmark jitter. When the bounding box drifts slightly between frames, individual landmark positions oscillate even when the person is stationary. A 5° knee angle fluctuation repeating 15 times a second makes your posture score meaningless.

Exponential moving average (EMA) is the right tool. It introduces minimal latency and outperforms a sliding window average for live inference:

final class LandmarkSmoother {
    private var smoothed: [NormalizedLandmark] = []
    // Higher alpha = faster response, less smoothing
    // 0.30 for slow rehabilitation exercises, 0.50 for sports
    private let alpha: Float = 0.35

    func smooth(_ raw: [NormalizedLandmark]) -> [NormalizedLandmark] {
        guard !smoothed.isEmpty else { smoothed = raw; return raw }
        smoothed = zip(smoothed, raw).map { prev, curr in
            NormalizedLandmark(
                x: alpha * curr.x + (1 - alpha) * prev.x,
                y: alpha * curr.y + (1 - alpha) * prev.y,
                z: alpha * (curr.z ?? 0) + (1 - alpha) * (prev.z ?? 0),
                visibility: curr.visibility
            )
        }
        return smoothed
    }
}

Alpha of 0.35 works well for physiotherapy sessions where patients move slowly. Tune against actual footage from your deployment environment — alpha is the one parameter where instrument data beats intuition.

Fix 3: Segmentation mask as a confidence proxy

MediaPipe Tasks outputs a per-pixel segmentation mask alongside pose landmarks when you set outputSegmentationMasks: true in your PoseLandmarkerOptions. This is the most underused tool in the framework.

The mask indicates which pixels belong to the detected person. If fewer than 45% of pixels inside the detector's bounding box are classified as “person”, the bounding box is mostly background — discard the result:

func maskCoverage(mask: MPPMask, box: CGRect, frameSize: CGSize) -> Float {
    let buffer = mask.uint8Image  // row-major, 0 = background, 255 = person
    let width = Int(frameSize.width)

    let xRange = Int(box.minX * frameSize.width)...Int(box.maxX * frameSize.width)
    let yRange = Int(box.minY * frameSize.height)...Int(box.maxY * frameSize.height)

    var personPixels = 0, total = 0
    for y in yRange {
        for x in xRange {
            if buffer[y * width + x] > 128 { personPixels += 1 }
            total += 1
        }
    }
    return total > 0 ? Float(personPixels) / Float(total) : 0
}

// In your result handler:
if let mask = result.segmentationMasks?.first {
    guard maskCoverage(mask: mask, box: detectedBox, frameSize: cameraSize) > 0.45 else {
        return  // Background leaking into bounding box — unreliable
    }
}

Fix 4: Hysteresis to prevent flickering UI

Hard confidence gates create a new problem: the analysis flickers on and off as a landmark oscillates around the threshold. Users see the posture panel appear and disappear rapidly, which destroys trust.

Hysteresis solves this: require a higher confidence to enter tracking state than to stay in it.

final class TrackingStateMachine {
    enum State { case lost, tracking }

    private(set) var state: State = .lost
    private var streak = 0

    // Enter tracking: 5 good frames at threshold 0.75
    // Exit tracking: 8 bad frames at threshold 0.55
    func update(confidence: Float) {
        switch state {
        case .lost:
            streak = confidence >= 0.75 ? streak + 1 : 0
            if streak >= 5 { state = .tracking; streak = 0 }
        case .tracking:
            streak = confidence < 0.55 ? streak + 1 : 0
            if streak >= 8 { state = .lost; streak = 0 }
        }
    }
}

At 30fps: 5 frames ≈ 167ms to enter tracking, 8 frames ≈ 267ms before giving up. These numbers feel instant to the user while preventing the flicker. Adjust the frame counts if your camera runs at a different rate.

Results in production

After deploying these four fixes in CareSpace AI, the “valid frame rate” — the percentage of camera frames that passed all confidence gates and contributed to posture metrics — improved from 51% to 87% across a test set of 40 clinic environments with varied backgrounds and lighting.

The segmentation mask check (Fix 3) delivered the largest single improvement, cutting false positives in textured-background environments by 60%. The hysteresis state machine (Fix 4) had the greatest UX impact — users stopped seeing flickering analysis panels, and reported trust in the posture readings increasing measurably in follow-up interviews.

MediaPipe is a capable framework. The gap between “it works in the demo” and “it works at a physio clinic in Gujarat” is almost entirely about handling real-world signal quality. Gate aggressively, smooth thoughtfully, and your confidence in the output will match the user's.