The future of Gestalt

Gestalt has many shortcomings. However, a good many of them can be addressed in features of a later version. For example, although there is currently no way for the nodes of calculation to communicate with one another, I will make the option available soon. I have two reasons for not including it in this release: (1) it would make writing clients for the system unnecessarily confusing, and (2) in order for this system to approach the efficiency of an Intel Paragon or like supercomputer, the underlying network would have to be as fast as the internal bus of the supercomputer. I'm sure this will eventually happen, but not in the near future.

Another useful feature that I considered adding to this release is a method for a client to deliver a single object to a server, with a range of values and instructions for breaking the object into many pieces. This would decrease the amount of network traffic between the client and server, and the amount of network traffic between servers. Again, however, this would make writing clients for the system slightly more confusing, and without having tested such a feature, I'm not sure exactly how much performance gain would actually be acheived.

I have already covered in depth the general functionality of the server, and I think the idea behind it is fairly simple. More interesting to me is the behavior of the server -- by this I mean how it decides which slaves to put tasks on, and when it becomes economical to distribute tasks to servers on remote networks. After less testing than I would have liked, I arrived on a fairly simple set of heuristics for this version's server: it simply distributes two tasks at a time to each local slave, and then populates the remote servers with enough tasks to keep their remote slave busy as well. When local slaves become available, the server aborts tasks on remote servers and reassigns them locally. Servers send updates to their clients when the number of tasks pending distribution changes or the number slaves changes, resulting in a network of processors that effectively load-balances itself. Unfortunately, the only way to build an extremely efficient set of heuristics is to spend a large amount of time testing them. However, this is paramount to the efficiency of the whole system, so I think it will be worth the effort to greatly improve these heuristics.

Lastly, I have had some problems with slaves throwing exceptions when they receive task objects that had not been compiled when the slave was started. Since I have been doing all my testing via NFS mounted file systems, I did not realize that slaves would not be able to run tasks that are not available in their classpath at runtime (or, as was the case with my problem, tasks that have had their class file change since first being loaded by the slave). I assumed that having access to the gestalt.Task class would be enough to run any sub class of it. I was wrong, and in order for the system to be fully functional I will have to modify the system so that the Class object for the sub class of gestalt.Task that is being submitted by a slave is also propogated to the server, and to each slave that computes a task of this class. This will be a trivial modification, but I do not have time to do it before the submission deadline.


Last Modified: 7/24/97 by jack@cs.hmc.edu