Okay, this is all very much preliminary, and I appreciate any useful commentary, especially wrt to tuning on different operating systems. I have tried my best to pull any tricks I know of, but I am far from knowing enough about the individual operating systems

Update: After tweaking the FreeBSD kernel and recompiling, numbers got a bit better

Overview

This page collects and represents the benchmark data I have collected while implementing and porting my streaming server (flss, FIXME: link to download the sources) to different platforms. Currently it is known to be working on the following platforms:

Additionally it *should* run under the aforementioned operating systems on different hardware architectures (however at a significant performance penalty, see the source of the event dispatcher for an explanation why). It *should* be easily portable to all other Unix-like platforms. A native Windows port is being prepared.

Benchmark data for the above mentioned hardware/software combinations have been collected and will be publicised here, and it will hopefully be updated as the platforms evolve.

Test Systems

The following test hardware for the streaming server is available to me (and all benchmarks are run on the server):

The systems were used with the following hardware/software combinations:

System identifierHardwareBase systemCompilerFile system
linux-i386x86+Intel E1000Linux 2.6.2, glibc 2.3.2gcc 3.3.2ext3
linux-ppcppc+Intel E1000Linux 2.6.2, glibc 2.3.2gcc 3.3.2ext3
freebsd-i386x86+Intel E1000FreeBSD 5.2 (Release)gcc 3.3.2ufs+soft updates
darwin-ppcppc+GMACDarwin 7.2.0 (Mac OS X 10.3.2)gcc 3.3.2HFS+

It should be noted that the x86 system is far older than the ppc system, and the hardware in the ppc system is far more powerful than the x86 system in almost all respects.

System specific notes

LMBench

Results. This benchmark measures basic OS parameters, such as system call overhead. Annotations:

Dispatcher bench

This benchmark tries to measure the efficiency of the operating system at delivering socket IO events to applications. For this purpose either one or two dispatcher threads are created, and a given number of UDP sockets for each thread. Both threads will then wait for events on each of "their" sockets, and upon receiving data will transmit the data to the next of "its" sockets. An intial message of 16 bytes is posted to the first socket of each thread, so that both threads are in effect busy passing a token of 16 bytes to itself via the UDP sockets.

This benchmark stresses both the network subsystem and the event notification mechanism of the operating system at the same time. It also exposes scalability problems on SMP systems - although two dispatcher loops could in theory be running completely independently, some interlocking between the two dispacher threads is absolutely necessary inside the kernel to guarantee the integrity of the operating system (e.g. file descriptor table, network stack, thread state, scheduler, ...).

The benchmark measures the number of dispatching operations (receive notification from kernel; receive data from kernel; send data to next socket) the system was able to perform per second. In case of two threads running in parallel the figures for both threads were added to demonstrate the throughput of the system as a whole.

Results. Annotations:

RTP streaming test

The efficiency of the server serving multiple clients is exercised during this test. Still preliminary due to some hardware problems.

The bottom line however is that flss has no trouble concurrently serving around 400 clients @1150 kbit/s sustained on a moderately-specced machine.

Results. Annotations:

Windows-specific notes

The most efficient way to do network IO and event notification on the Windows platform are Completion Ports. Unfortunately this IO model does not match the Unix-style separation of "readyness notification" and "IO request", which the streaming server is currently based on for dispatching. As a result it is impossible without major restructuring of the code to make use of Completion Ports, so only the "second-best" dispatching mechanism can be used on Windows; performance numbers for windows are therefore to be taken with great care.

Conclusions

This section could as well be titled "food for flamewar", but... I will postpone writing this until all source material is presented on this website

I appreciate comments