Computer VisionSep 2025

MediaPipe on iOS: what the docs leave out

By Harshit Diyora · Lead Mobile Developer · NexTurn India

MediaPipeiOSSwiftComputer VisionOn-device AI

MediaPipe Tasks for iOS is well-documented for the case where you already have a video frame in hand and want to get keypoints out of it. The documentation is thin on everything else — and everything else is most of the work.

Here is what I had to figure out the hard way while building CareSpace AI at NexTurn India.

AVCaptureSession setup the docs skip

MediaPipe's sample code starts with a CMSampleBuffer. It does not show you how to get one from an iPhone camera at the right resolution and frame rate for live inference. The setup matters — choose wrong and you add 15ms of latency that MediaPipe then gets blamed for.

func configureCaptureSession() throws -> AVCaptureSession {
    let session = AVCaptureSession()
    session.beginConfiguration()
    defer { session.commitConfiguration() }

    // 640x480 is the sweet spot for pose detection on iPhone 12+
    // 1280x720 adds ~8ms per frame with no accuracy gain for full-body pose
    session.sessionPreset = .vga640x480

    guard let device = AVCaptureDevice.default(
        .builtInWideAngleCamera, for: .video, position: .front
    ) else { throw CameraError.deviceUnavailable }

    // Lock format before setting frame rate to avoid format reset
    try device.lockForConfiguration()
    device.activeVideoMinFrameDuration = CMTime(value: 1, timescale: 30)
    device.activeVideoMaxFrameDuration = CMTime(value: 1, timescale: 30)
    device.unlockForConfiguration()

    let input = try AVCaptureDeviceInput(device: device)
    guard session.canAddInput(input) else { throw CameraError.inputUnavailable }
    session.addInput(input)

    let output = AVCaptureVideoDataOutput()
    // kCVPixelFormatType_32BGRA is what MediaPipe expects
    // Other formats trigger a conversion step that adds ~3ms
    output.videoSettings = [
        kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA
    ]
    output.alwaysDiscardsLateVideoFrames = true
    output.setSampleBufferDelegate(self, queue: processingQueue)

    guard session.canAddOutput(output) else { throw CameraError.outputUnavailable }
    session.addOutput(output)

    return session
}

Key decisions here: vga640x480 over hd1280x720 cuts per-frame processing time by ~40% with no meaningful accuracy loss for full-body pose. kCVPixelFormatType_32BGRA avoids the pixel format conversion that MediaPipe would otherwise do internally.

GPU vs CPU delegate: the real tradeoff

MediaPipe pose detection runs on either the CPU or GPU delegate. The docs recommend GPU for performance, which is true on a plugged-in device. On a real device in a user's hand, the tradeoff is more nuanced.

The GPU delegate runs inference on the GPU, which has higher throughput but also higher power draw. On devices with a thermal throttle (iPhone 11, older iPads), sustained GPU inference at 30fps can trigger throttling after 8–12 minutes, causing frame rate to drop below the target.

The CPU delegate runs on the Neural Engine on A12+ devices. It is slower per-frame but far more thermally efficient. For a physiotherapy session app where sessions run 20–30 minutes, the CPU delegate gave us more consistent results across a broader device range:

let options = PoseLandmarkerOptions()
options.baseOptions.modelAssetPath = Bundle.main.path(
    forResource: "pose_landmarker_full", ofType: "task"
)!

// Use CPU delegate for sustained long-session inference
// Switch to GPU delegate only if benchmarking shows sustained throughput
// on your minimum supported device
options.baseOptions.delegate = .CPU

options.runningMode = .liveStream
options.numPoses = 1
options.minPoseDetectionConfidence = 0.5
options.minPosePresenceConfidence = 0.5
options.minTrackingConfidence = 0.5
options.outputSegmentationMasks = true  // Enable for background rejection

let landmarker = try PoseLandmarker(options: options)

Benchmark on your actual minimum supported device, not a simulator or the latest iPhone. The difference is not academic.

Coordinate space mapping

MediaPipe returns landmarks in normalised image coordinates: x and y are fractions of the image width and height, ranging from 0.0 to 1.0. The image coordinate space does not match the screen coordinate space — and the mismatch is not just a scale factor.

For a front camera in portrait orientation, the image is typically rotated 90° relative to the screen. If you apply the MediaPipe coordinates directly to a SwiftUI overlay, your skeleton will be rotated and possibly mirrored.

func convertToScreenCoordinates(
    landmark: NormalizedLandmark,
    imageSize: CGSize,
    screenRect: CGRect,
    cameraPosition: AVCaptureDevice.Position
) -> CGPoint {
    var x = CGFloat(landmark.x)
    var y = CGFloat(landmark.y)

    // Front camera is mirrored — flip x
    if cameraPosition == .front {
        x = 1.0 - x
    }

    // MediaPipe image origin is top-left; CGPoint origin is also top-left
    // But camera feed in portrait is rotated: image width maps to screen height
    let screenX = y * screenRect.width  + screenRect.minX
    let screenY = x * screenRect.height + screenRect.minY

    return CGPoint(x: screenX, y: screenY)
}

The exact transform depends on your camera orientation, video gravity setting (resizeAspectFill vs resizeAspect), and whether you're overlaying on the preview layer or a separate view. Verify by placing a dot at landmark index 0 (nose) and confirming it tracks your nose in the live feed. If it's at the wrong position, work through the transform one step at a time.

The memory leak that takes 10 minutes to appear

This one took me three days to find. MediaPipe's live stream mode calls a result handler closure on a background thread. If that closure captures self strongly, and self holds a reference to the PoseLandmarker, you have a retain cycle that Instruments' Memory Leak detector will not surface immediately — it only appears after the PoseLandmarker has been deallocated and recreated several times (e.g., on screen transitions).

// Bad: strong capture causes retain cycle in live stream mode
landmarker.detectAsync(
    image: mpImage,
    timestampInMilliseconds: timestamp
)
// ... and in the delegate:
func poseLandmarker(
    _ poseLandmarker: PoseLandmarker,
    didFinishDetection result: PoseLandmarkerResult?,
    timestampInMilliseconds: Int,
    error: Error?
) {
    self.handleResult(result)  // self captured strongly via the delegate reference
}

// Good: use weak capture and verify the object is still alive
class PoseDetectionController: PoseLandmarkerLiveStreamDelegate {
    weak var resultsDelegate: PoseResultsDelegate?

    func poseLandmarker(
        _ poseLandmarker: PoseLandmarker,
        didFinishDetection result: PoseLandmarkerResult?,
        timestampInMilliseconds: Int,
        error: Error?
    ) {
        guard let result else { return }
        // Dispatch to main thread if updating UI
        Task { @MainActor [weak self] in
            self?.resultsDelegate?.didReceiveResults(result)
        }
    }
}

Always verify with Instruments → Allocations instrument, filtering for PoseLandmarker — you should see exactly one instance alive during active inference. If count grows over time, you have a leak.

Model selection

MediaPipe ships three pose models: Lite, Full, and Heavy. The docs are light on when to use each:

Lite: 30fps on iPhone 12 at 640×480 with CPU delegate. Use for fitness apps where speed matters more than per-joint precision.
Full: 20–25fps on iPhone 12 with CPU delegate. Meaningfully better landmark accuracy on hands and feet. This is what CareSpace AI uses for clinical posture assessment where ankle and wrist position matter.
Heavy: 10–15fps on iPhone 12 with CPU delegate. Use only if you need sub-pixel accuracy on extremity landmarks and have a high minimum device requirement.

Run benchmarks on your minimum supported device with your actual resolution and frame rate target. Numbers from a simulator or a top-of-range device are not useful for deployment decisions.

Testing without a camera

The simulator has no camera. Testing MediaPipe inference in the simulator requires injecting video frames from a file. Here is a minimal setup:

#if targetEnvironment(simulator)
func injectVideoFrame(named resourceName: String) {
    guard let url = Bundle.main.url(
        forResource: resourceName, withExtension: "mp4"
    ) else { return }

    let asset = AVAsset(url: url)
    let reader = try? AVAssetReader(asset: asset)
    let track = asset.tracks(withMediaType: .video).first!
    let output = AVAssetReaderTrackOutput(
        track: track,
        outputSettings: [
            kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA
        ]
    )
    reader?.add(output)
    reader?.startReading()

    // Feed frames to your MediaPipe pipeline the same way the camera does
}
#endif

Record a short test clip on a real device covering your typical use case — person in frame, common background environments — and use it consistently. This makes CI MediaPipe testing possible without a physical device in the loop.