Raspberry Pi Support

I’m sure most of you have heard of the Raspberry Pi, a $25 ARM computer that runs Linux. We’ve spent quite a bit of time in the last weeks getting libavg to run on this machine, and I’m happy to say that we have a working beta. We render to a hardware-accelerated OpenGL ES surface and almost all tests succeed. Besides full image, text and software video support, that includes all compositing and even offscreen rendering and general support for shader-based FX. We have brief setup instructions at https://www.libavg.de/site/projects/libavg/wiki/RPI. Update: The setup instructions have been updated for cross-compiling (much faster!) and moved to https://www.libavg.de/site/projects/libavg/wiki/RaspberryPISourceInstall.

Most of the work was getting libavg to work with OpenGL ES. We now decide whether to use desktop or mobile OpenGL depending on a configure switch, an avgrc entry and the hardware capabilities. Along the way, we implemented mobile context support under Linux for NVidia and Intel graphics systems, so we can now test most things without actually running (and compiling!) things on the Raspberry. Speaking of which – compiling for the Raspberry takes a long time. Compiling on it is impossible because there just isn’t enough memory. We currently chroot into a Raspberry file system and compile there (see the notes linked above).

A lot of things are already implemented the way they should be for a mobile system. That means that, for example, bitmaps are loaded (and generated, and read back from texture memory…) in either RGB or BGR pixel formats depending on the flavor of OpenGL used and the vertex arrays are smaller now so we save bandwidth. Still, there’s a lot of optimization to do. Our next step is getting things stable and fast. We want hardware video decoding, compressed textures – and in general, we’ll be profiling to find spots that take more time than they should.

Cleaning up Messaging

Over time, libavg has accumulated support for a number of message callbacks. These are:

In addition, we’re currently adding some widget classes, and that adds more callbacks for button presses, list scrolling, etc.

While this allows you to get a lot of things done, it’s not consistent and hence not very easy to learn. The methods used to register for messages aren’t standardized. They have inconsistent names and varying parameters. Some allow you to register several callbacks for an event, some don’t. For an example, compare Node.connectEventHandler() to the gesture interface using constructor parameters. In addition, the implementation is just as problematic. We have multiple callback implementations in C++ and Python, which results in error-prone, high-maintanance code.

Publishers

When work on the new widget classes promised to make things even more convulted, we decided to do something about the situation and implement a unified, consistent messaging system. The result is a publisher-subscriber system:

  • Publishers register MessageIDs.
  • Anyone can subscribe to these MessageIDs by registering callbacks. Several subscribers are possible in all cases.
  • When an event occurs, all registered callbacks are invoked.

We spent quite a bit of time to make a lot of things “just work”. The subscription interface is very simple. As an example, this is how you register for a mouse or touch down event:

node.subscribe(node.CURSOR_DOWN, self.onDown)

Any Python callable can be registered as a callback, including standalone functions, class methods, lambdas and even class constructors. In most cases, you don’t have to deregister messages to clean up either. Subscriptions are based on weak references wherever possible, so when the object they refer to disappears, the subscription will just disappear as well.

You can write your own publishers can be written in Python or C++ by simply deriving from the Publisher class. In Python, you need two lines of code to register a message:

class Button(avg.DivNode):
    CLICKED = avg.Publisher.genMessageID()
    [...]
    def __init__(...):
        self.publish(self.CLICKED)
    [...]

and this line invokes all registered subscribers:

self.notifySubscribers(self.CLICKED, [])

The second parameter to notifySubscribers is a list of parameters to pass to the subscribers.

Transitioning

Transitioning old programs to the new interface is not very hard and involves replacing old calls to Node.connectEventHandler(), VideoNode.setEOFCallback(), Contact.connectListener() and so on with invocations of subscribe(). We’ll keep the old interfaces around for a while, but they’ll probably be removed when we release ver 2.0.

The End of Touch Jitter

On lots of multitouch devices, input suffers from jitter: The actual touch location is reported imprecisely and changes from frame to frame. This has obvious negative effects, since it’s much harder to hit a target this way. For years, people have been telling me that a lowpass filter would help. In its simplest form, a lowpass filter averages together the location values from the last few frames. This removes most of the jitter – because the jitter is random, there’s a good chance that the errors in successive frames cancel each other out. On the other hand, it adds latency because the software is not using the latest data. This tradeoff didn’t seem like a good one to me, so I didn’t add a jitter filter to libavg.

However, at this year’s CHI conference, Géry Casiez and coauthors published a paper on a 1€ Filter. This filter is based on an extremely simple observation: Precise positions are only important when the user is moving his finger slowly, while latency is important at fast speeds. So, the solution to the dilemma I described in the first paragraph is to build a filter that adjusts its latency depending on speed. Their filter is extremly simple to implement, and the results are really nice.

libavg can now process the touch input positions using this filter. The filter parameters are configurable in avgrc, and there’s a configuration utility (avg_jitterfilter.py) that helps in finding correct filter values. The complete implementation is in the libavg_uilib branch – I’ll merge it to trunk in the next few weeks.

Intel Graphics

After the rendering optimization I desribed in my last post, tests with Intel Atom chipset graphics (N10 chipset) uncovered a problem. The system was running in software rendering mode, which slows things down by a factor of about a thousand. It turns out that more than two texture accesses in a shader are too much for the hardware. Additionally, lots of Intel chips render all vertex shaders in software, and that also causes a tenfold slowdown if the libavg 3-line vertex shader is in use.

So now, there’s a second rendering path with minimal shaders that does vertex processing the old-fashioned way (glMatrixMode etc.) and uses a different shader for those nodes that don’t need any special processing. Still, I recommend staying away from Intel Atom graphics. There is way better hardware out there at the same price point.

Speeding up Rendering

libavg’s rendering has been fast enough for many applications for a while. A decent desktop computer could render between 2000 and 5000 nodes with a framerate of 60 in version 1.7. This is probably already more than most frameworks, but for big applications, it’s not enough. For instance, someone tried to build a game of life application with one node per grid point – and ran into performance issues. SimMed spends an inordinate amout of time rendering 2D as well. Also, particle animations and similar effects need lots of nodes.

So, I went and optimized the rendering pipeline. As a bonus, I was able to remove lots of deprecated OpenGL function usage, thus getting us a lot closer to mobile device support.

tl;dr: On a desktop system with a good graphics card, the benchmarks now show libavg rendering two or three times as many nodes as before.

The new rendering pipeline

One mantra that’s often repeated when optimizing graphics pipelines is “minimize state changes” (See Tom Forsyths blog entry on Renderstate change costs and NVidias GDC talk slides). Pavel Mayer once (over-)simplified this to “minimize the number of GL calls”, and my experience has been that that’s actually a very good starting point.

Today’s graphics cards are optimized for large, complex 3D models with comparatively few textures. 2D applications rendered using 3D graphics cards render lots of small primitives – mostly rectangles – with different textures. A naive implementation uses one vertex buffer per primitive. That results in a huge number of state changes and is about the worst way to use current graphics cards.

The new rendering pipeline makes the most of the situation by:

  • Putting all vertex coordinates into one big vertex buffer. This vertex buffer is uploaded once per frame, activated and used for all rendering. The one big upload takes less time than actually figuring out what needs to be uploaded and doing the work piecewise.
  • Using one standard shader for all nodes. This shader handles color space transforms, brightness/contrast/gamma and masks, meaning it does a lot more work than is necessary for most nodes. However, the shader never changes during the main rendering pass. It turns out that the increased per-pixel processing is no problem for all but the slowest GPUs, while the state changes that would otherwise be needed cost signficant time on the CPU side.
  • FX nodes are rendered to textures in a prerender pass with their own shaders.
  • Generally moving GL state changes outside of the render loop if possible and substituting shader parameters for old-style GL state.
  • Caching all other GL state changes. There are just a few GL state variables that still change during rendering (To be precise: glBlendColor, the active blend function, and parameters to the standard shader). Now, setting a shader parameter to the same value repeatedly doesn’t cause several GL calls.

There were also a few non-graphics related optimizations – profiling information is now only collected if profiling is turned on, for example.

Results

Without further ado, here are some benchmarks using avg_checkspeed and avg_checkpolygonspeed. They show nodes per Frame at 60 FPS on a typical desktop system (Core i7 920 Bloomfield, 2.66 MHz, NVidia GF260):

Desktop, Linux (Ubuntu 12.04, Kernel 3.2)

libavg Version Images Polygons
1.7 2200 3500
Current 7000 7000

Desktop, Win 7

libavg Version Images Polygons
1.7   2700 5000
Current 10000 9500

On my MacBook Pro (Mid-2010, Core i7 Penryn, 2.66 MHz, NVidia GF330M graphics, Snow Leopard), the maximum number of nodes rendered did not increase. However, the CPU load while rendering went down – so we have a GPU bottleneck here:

MacBook Pro

libavg Version Images Polygons
1.7 1000, 100% CPU load 1600, 100% CPU load
Current 1000, 80% CPU load 1600, 40% CPU load

More precisely, since changing multisampling settings has an effect on speed, fragment processing is the bottleneck. Changing to minimal shaders doesn’t have an effect on speed either, so I’m guessing at texture fetches at the moment. But that’s for the next iteration of optimizations.