Threads will tangle you up!

Like many ideas in computer programming, multi-threading was foisted on us without proper thought or caution by zealots who were looking for magic solutions. Sorry dudes, but there is no magic in programming. If you are stuck using threads, see this page for advice on making it work. But if you are planning an application, consider the following issues:

It is really only possible to obtain performance benefits from threading in a limited number of situations. In almost all other uses of threads, they end up being like two cooks trying to work in the same household kitchen -- they end up bumping into one another and spending a lot of time waiting for the others to get out of the way.

But don't take my word for it, consider the comments of Donald Knuth, famed author of the 3 (so far) volume set: The Art of Computing. Knuth basically says he doesn't know how to do threads and that multi-core processors and threads probably don't help much anyway. See this article:

Dude, like if the premier name in programming can't pull it off, why do you think you can?

Conan? What is good in threads?

Threading only really works well when each thread has no need to share resources with the others while doing the bulk of their work. There are few situations where this is true: database internals, web servers, operating systems, etc. In the case of web servers, I'm talking about the server itself -- not necessarily an application that runs on it.

In all of these cases, a very large number of very simple things are being done. The work activity of each step had little or no effect on other steps.

And herein lies the rub

Why are threads bad?

The simple answer is this: thread related overhead can be enormous and threaded programs are almost impossible to test.

Thread related overhead

Here are some of the kinds of thread related overhead:
Thread startup time
Threads must be created. The time it takes to create a thread is not small. It is smaller than reading a program off the disk but not small. Many OS vendors suggest that you create a pool threads up front and re-use them as needed throughout your program's run. After all, you can't create an infinite number of threads and so a fixed pool up front makes sense from a couple of standpoints.
Mutex Locking time
Multiple threads in a program share access to the same memory space -- meaning that they can write on each other's variables. While this is great for speed, it is absolutely horrible for program correctness -- and its a disaster for testing. No amount of testing really tells you if a program is going to work on a given customer dataset.

The solution to this of course is to put mutex locks around variables and have the threads first open the lock before reading or writing to the variable. This openining gives the other threads time to stop making any changes they were making and their own attempts to open the lock will delay them long enough for the current thread to make its modifications.

And therein of course lies the rub. All these other threads are hung waiting for the one thread to do its job. The amount of time that a given thread gets suspended is dependent on operating system latency. For these other threads, the processor is freed up to go work on some other program and these threads don't get execution time again until the round robin task selector gets back to them.

This could be a signficant amount of time. The smaller and tighter your thread's main loop, the more of its time is going to be spend waiting for the operating system to get back to it. There is no bug here -- this is just a bad design. The sad part is that this situation is probably the most common of all threaded programs because it results from taking a non- threaded application and trying to add threads on top of it.

Basically, if you have an existing non-threaded program, you must rewrite it from scratch to overcome this problem or you will get no real benefits from adding threads -- you might actually get a degradation in performance. AND, worse of all, you may not notice this degradation in performance until it is too late -- until so much time and energy has gone into adding threads that no one will want to admit that they have made things worse -- you live with the performance degradation and you now how all the penalties of threaded program with none of the benefits.

Malloc overhead
Whenever a program has more than 1 thread, malloc and free have to use mutex locks because program global variables are used to make memory requests from the operating system.

Thus as just mentioned in the previous section, mutex locks in malloc cause a threaded program to be fundamentally slower than a non-threaded program -- even if only 1 thread is actually doing anything.

The limitations of heap pools

Heap pools are an attempt to reduce the malloc time overhead inherent in threads by pre-allocating large amounts of memory to each thread. A heap pool does not require thread locks in order to allocate memory -- but this requires that lots of memory be available for the thread's exclusive use -- memory that is malloced up front.

Thus you pay for time with memory. This might be tolerable and it might not.

Thread Stack Limitations

The program main thread's stack is usually a special case -- it can often grow without bounds -- until process memory is exhausted. However, the thread stacks for all other threads must be pre-allocated and thus you are forced to waste space on stacksize. Typically, all threads in a pool have to have the same stack size but this varies from machine to machine. If the threads didn't have the same size, then they wouldn't be interchangeable and thus the concept of a pool makes no sense. Of course, you could have multiple pools with different stack sizes.

Alternatives to threading

First and foremost, if you have the right application architecture, there is no reason NOT to use threads. On the other hand, only certain kinds of applications can truly benefit from threads -- all the rest suffer from it.

Process Pools instead of Thread Pools

The idea behind thread pools was used on unix long before threads appeared on the scene. The basic idea is to have a single process whose job it is to manager other processes and distribute work to them in some kind of round robin fashion -- or other algorithm if it makes sense.

Consider the telnet interface. The connection request is serviced by a process which does nothing else -- it just detects connnection requests and then awakens a service process that actually performs the telnet interface activities for that connection. The system keeps a bunch of telnet service processes running in a wait state. Each new connection awakens a different process in the telnet pool. This same logic can be applied to any activity.

Unlike thread pools, each process' thread is usually free to grow until the process is out of address space. Of course each process in a process pool will be using more code space memory than threads in the thread pool -- since threads all share the same code space.

Obviously though, if the work of the processes in the pool must be merged -- unlike telnet sessions -- then signficant design effort might be required to make a seamless merge work properly. Of course, this will still be more easy to test than a true multi-threaded system.

Using the select system call to emulate threads

Because of the asynchronous nature of thread execution, debugging a threaded program can be very tedius -- duplicated customer reported bugs can be impossible. Further, there is no real way to gain confidence that enough testing has been done. Thread interaction timing makes this all but impossible.

Many programs have this same basic structure: they are giant command processing loops. At the top of the loop is "read" operation that gets a command from some stream. The body of the loop then executes the specified command. The read operation may be complicated by the fact that commands might come in from multiple streams. For example, a server program might like to service dozes of users. Each user would end up with his own command stream. The select system call can be used by the main program loop to read from all of them simultaneously. This system call will serialize the return delivery of commands to the program main loop which can process them one at a time.

Of course this kind of psuedo-threading doesn't work in all cases -- it doesn't benefit from having multiple processors in the machine. But applications written this way are far more testable and thus will be more reliable than true threading applications.

Mixing Threaded and Non-Threaded executables to make an Application

Since threaded applications are fundamentally slower than non-threaded applications, a performance optimization is available in some situations: Put parts of the application that require super high performance in a non-threaded slave executable and communicate to it through queues, pipes, sockets, etc. Then put the parts benefiting most from threads in a separate, "master", executable whose main job is to marshal data.