Sunday, 1 May 2011

Silverlight's MediaStreamSource - Some Initial Thoughts

I've recently had cause to begin working with the MediaStreamSource element in Silverlight, in order to create an HTTP Live Streaming client.

Unfortunately, while there is some documentation on the matter, the actual quirks and details of the implementation are very poorly documented, leaving many developers resorting to trial and error to get things working.

My main goal was to demux MPEG-2 transport streams into H264/MP3 A/V streams, and then feed the frames/samples from these back to a MediaElement.

It's been frustrating to say the least, but I'm 90% of the way there now. If I get some more time, I'd really love to blog about the experience to help other developers. Some very initial observations:

  • When you send H264 video from a MediaStreamSource, you should send one NAL unit back to the MediaElement for each GetSampleAsync() request.
  • This includes non-picture NAL units, e.g. SPS / PPS units.
  • When you send your NAL units, ensure there are 3-byte start codes (0x00 0x00 0x01) at the beginning of each one. (This is similar to 'Annex B' format, but not quite the same thing)
  • In ReportOpenMediaCompleted(), when setting up your video stream description, you can ignore the CodecPrivateData attribute string, despite what the documentation says. It's not required. (assuming your stream of NAL units includes SPS and PPS units)
  • In ReportGetSampleCompleted(), set the value of 'Offset' equal to the beginning of the NAL start code, not the actual data. (in most cases this will be zero, assuming you use a fresh stream per NAL unit)
  • Remember to increase the time stamp when sending each NAL unit that represents a frame.

That's all I have time for right now - one final caveat; if you're feeding audio data back too, I had problems using a 48kbps sample rate; the MediaElement asked for one audio sample, then never asked for any more! I eventually found that using a 44.1khz rate produced better results.

Carlos

5 comments:

  1. But failed follow you means.
    Please help me find out why.
    I have received H.264 frames by RTSP over HTTP.
    The first frame contains 4 rtp packet:sps\pps\frame part1 and frame part2.
    In GetSampleAsync() request,I send
    0x 00 00 01 sps 00 00 01 pps 00 00 01 part1+part2
    But no display.
    I try send the frame for 3rd times:
    1st:00 00 01 sps
    2nd:00 00 01 pps
    3rd:00 00 01 part1+part2

    No display too.
    Why please?

    ReplyDelete
  2. Very interesting. I've been working on an RTP parser for h264 and heaacv2 audio. I've been updating the timestamp for each report without considering frame boundries. When I play a video stream without audio it plays smoothly, when I add audio it degrades the video playback getting worse over time. I'm going to try your suggestion of only increasing the timestamp on the NAL packets when a new frame comes in scope. (Just as soon as I figure out which nal types represent a new frame). I too had to do a great deal of experimentation to get things working. Audio was an especially difficult thing, nothing worked until I set channels to 1. I guess that's what they meant by only supporting heaacv2 at "half-fidelity"

    ReplyDelete
  3. @anonymous thanks for the comments on this. Actually, since my posting, I have abandoned the technique of using frame boundaries for timestamps. You cannot reliably get frame rate information from an H264 stream. Some encoders provide this information within SPS NAL units , but you can't guarantee it will be included.

    Instead (and I should really blog about this again), I noticed that Apple's documentation relating to HTTP Live Streaming insists that NAL picture frames are in unique PES packets within the Mpeg 1 transport stream. Therefore the best way to do it for me was just to use the timestamps from the PES packet headers.

    (IMPORTANT: Silverlight requires the frames in decoding order, not presentation)

    You're using RTP, right? Does the RTP protocol not timestamp its packets? If so, this would be your best bet for timestamping your Audio and Video. You'd still need to examine the type of each NAL though, to check for multiple picture frame NALS (types 1-5) within the same packet. In this scenario, I wouldn't know what to suggest; perhaps guess at the frame rate based on the timestamps so far divided by the total number of frames?

    ReplyDelete
  4. @fangzi

    It depends how you are timestamping the frames. The correct way (for each GetSampleASync) request should be:

    1) START CODE + SPS
    2) START CODE + PPS
    3) START CODE + FRAME PART 1
    4) START CODE + FRAME PART 2

    ....but you'll need to make sure these are properly timestamped; information you should be getting from your container (RTSP) format, since H264 has no concept of a frame rate.

    ReplyDelete
  5. Thanks Carl for the update. I'm still in experimental mode.
    I am using the RTP timestamps for the audio and video streams. I found it interesting that the audio and video streams use different periods for their timestamps. (90khz for video, 48khz for audio) If I play just one stream (Audio or Video) playback is smooth. And the timestamp values don't seem to matter. I can just supply 0 and audio or video play smoothly. When I combine audio and video things get messy.

    This article http://msdn.microsoft.com/en-us/library/hh180779(v=vs.95).aspx which I discovered yesterday suggests that my pipelines are getting starved. I think I'll have to do what it says and monitor the depth of each pipeline and start calling ReportGetSampleProgress as needed. I can see now based on their recommended computation of pipline depth that I'm experiencing starvation. To calculate Pipeline depth for each stream I have to pass the MediaStreamSource a reference to the MediaElement so that I could obtain it's Position during calls to GetSampleAsync.


    To make things interesting I decided to implement my MediastreamSource with the reactive framework. Fun stuff. I'll post back if I figure out how to sync the streams.

    ReplyDelete