Part 1 | Deep Dive: iOS Real-Time Screen Capture & VideoToolbox Hardware Encoding

When building a high-throughput, ultra-low-latency remote assistance system, screen capture and video compression are the most critical links. To achieve sub-150ms end-to-end latency, the source pipeline must be optimized to prevent unnecessary copies and CPU cycles.

The iOS ecosystem presents developers with strict constraints due to sandboxing and resource management. For instance, the Broadcast Upload Extension (the system-wide screen capture background process) is capped at a strict 50MB memory threshold. Exceeding this limit causes the OS to immediately terminate the process.

In this deep dive, we explore how the Easy Connect Suite leverages native ReplayKit and VideoToolbox frameworks on iOS to establish a zero-copy, low-latency screen capture and encoding pipeline.

1. The Screen Capture Framework: In-App vs. Broadcast Extensions

On iOS, we support two separate capture modes depending on the remote assistance scenario:

                ┌────────────────────────────────────────┐
                │             iOS Screen Capture         │
                └───────────────────┬────────────────────┘
                                    │
                  ┌─────────────────┴─────────────────┐
                  ▼                                   ▼
          [ In-App Capture ]                  [ System Broadcast ]
        Uses RPScreenRecorder               Uses Broadcast Extension
        Captures host window                Captures full OS desktop
        No memory constraints               Strict 50MB RAM limit

In-App Capture

For light support sessions limited to our host application, we record using the ReplayKit RPScreenRecorder singleton. The capture runs directly inside the host process, bypassing IPC (Inter-Process Communication) and the 50MB extension limit.

System-Wide Capture

When a user shares their entire screen (including other apps and the iOS springboard), we instantiate an iOS Broadcast Upload Extension (RPBroadcastSampleHandler).

This runs as a separate background process:

Initiation: The user launches screen recording from the iOS Control Center and selects our extension.
Frame Delivery: The OS captures the display and passes raw frames as CMSampleBuffer structures to our handler's callback.
Low-Footprint Routing: Because of the 50MB RAM threshold, we cannot perform heavy pixel copies or software compression inside the extension. The frames must be fed directly to the hardware encoder or piped to the host process via local Unix sockets.

2. Establishing the Zero-Copy VideoToolbox Pipeline

When ReplayKit delivers a CMSampleBuffer, we must immediately compress it into H.264 packets. We do this by feeding the GPU textures directly to VideoToolbox (VTCompressionSession).

VTCompressionSession Configuration

We configure the hardware compression session using VTCompressionSessionCreate with these properties:

Codec Format: Set to kCMVideoCodecType_H264.
H.264 Profile: Set to kVTProfileLevel_H264_Baseline_AutoLevel (which excludes B-frames to achieve zero encoder buffering latency) or kVTProfileLevel_H264_Main_AutoLevel for high-quality connections.
Real-Time Flag: We enable kVTCompressionPropertyKey_RealTime to force the encoder to prioritize low-latency output over compression ratio.

Direct GPU-to-Encoder Direct Path

Standard software encoders require copying pixel data from GPU memory to CPU buffers, converting it to YUV format, and sending it back to the encoder.

VideoToolbox reads directly from the GPU framebuffer:

[ ReplayKit Capture ]
        │
        ▼ (CMSampleBuffer Wrapper)
[ CoreVideo CVPixelBuffer (GPU Texture Memory) ]
        │
        ▼ (Direct pointer reference, zero CPU memory copies)
[ VTCompressionSessionEncodeFrame ]
        │
        ▼ (GPU Hardware Compression Circuits)
[ Raw H.264 Annex-B Stream ]

We extract the underlying CVPixelBufferRef using CMSampleBufferGetImageBuffer(sampleBuffer). We pass this pointer directly to VTCompressionSessionEncodeFrame. The pixel data stays inside the GPU memory throughout the capture and compression cycle. This minimizes CPU cycles and memory footprint, keeping power draw and device temperature low.

3. Avoiding the Green/Pink Tint: YUV Color Matrix Calibrations

When streaming video from an iOS capture client to Android, Windows, or Web decoders, developers often run into a common issue: the iOS screen colors appear distorted on the receiver's end, showing a distinct green or pink tint.

This color shift is caused by mismatches in YUV-to-RGB conversion matrix coefficients.

By default, when capturing screens at resolutions $\ge$ 720p, VideoToolbox flags the output stream's Video Usability Information (VUI) with Rec. 709 (HD video standards) properties. However, many decoders (such as Android's MediaCodec and Chrome's WebCanvas contexts) default to BT.601 (SD video standards) coefficients when converting incoming YUV frames back to RGB.

To prevent this distortion, Easy Connect SSH overrides the YUV color properties during session initialization:

swift

// Force VideoToolbox to flag the stream VUI headers with BT.601 limited-range parameters
let specDict: [CFString: Any] = [
    kCVImageBufferColorPrimariesKey: kCVImageBufferColorPrimaries_SMPTE_C,
    kCVImageBufferYCbCrMatrixKey: kCVImageBufferYCbCrMatrix_ITU_R_601_4,
    kCVImageBufferTransferFunctionKey: kCVImageBufferTransferFunction_ITU_R_709_2
]
// Apply color profile to the compression session
VTSessionSetProperties(compressionSession, propertyDictionary: specDict as CFDictionary)

By forcing BT.601 tagging, the encoder writes the correct color coefficients into the H.264 SPS/PPS header metadata, ensuring that the receiver decodes YUV channels to RGB accurately.

4. Handle Rotation & Scaling via VTPixelTransferSession

When a user rotates their iOS device from portrait to landscape during a support session, the aspect ratio of the captured frames changes. Attempting to encode these frames directly causes stretching or decoder failures on the receiver side.

To manage orientation updates, we route the buffer through a hardware-accelerated VTPixelTransferSession:

Session Reusability: Instantiate a persistent VTPixelTransferSessionRef helper.

Set Hardware Rotation Angle: When device orientation updates, set the corresponding rotation property:

swift

// Set rotation parameter (e.g., 90, 180, 270 degrees)
VTSessionSetProperty(transferSession, 
                     key: kVTPixelTransferPropertyKey_RotationAngle, 
                     value: rotationAngleInDegrees as CFTypeRef)

Allocate Target Buffers: Use a CVPixelBufferPool to allocate target buffers matching the rotated resolution.
GPU-Accelerated Transform: Run VTPixelTransferSessionTransferImage, which uses the GPU's dedicated scaler (VDA) to rotate and scale the texture in under 2ms. The output buffer is then fed directly into the VTCompressionSession for compression.

Through this zero-copy pipeline, the iOS client delivers high-performance, energy-efficient screen capture and encoding. In the next part of this series, we will examine the Android platform, focusing on MediaProjection and Surface-mode encoding.

Part 1 | Deep Dive: iOS Real-Time Screen Capture & VideoToolbox Hardware Encoding ​

1. The Screen Capture Framework: In-App vs. Broadcast Extensions ​

In-App Capture ​

System-Wide Capture ​

2. Establishing the Zero-Copy VideoToolbox Pipeline ​

VTCompressionSession Configuration ​

Direct GPU-to-Encoder Direct Path ​

3. Avoiding the Green/Pink Tint: YUV Color Matrix Calibrations ​

4. Handle Rotation & Scaling via VTPixelTransferSession ​