Recent advances in manufacturing have allowed chip designers to place twice as many transistors every few years in the same amount of die area. This has exposed issues with effectively powering and cooling the transistores. Power is now a first class design constraint that can be considered more important than performance requirements. This has led to an interesting dilemna: designers have more transistors than they can turn on concurrently (known as "Dark Silicon"), potentially limiting the performance of future chips.
We have been researching solutions to the power problem, and are developing circuits and architectures that operate at 400-500mV, or near-threshold, whereas contemporary designs from Intel and AMD operate between 900mV-1100mV known as super-theshold. Near threshold operation gives 10x more energy efficiency than current designs. Its drawbacks are 10x reduction in frequency and higher circuit variation potentially leading to more failures. My group's research on this project has focused on solving the performance and variation problems associated with NTC operation by looking into parallel architecture for performance and redundancy for variation induced failures.
With the proliferation of mobile wireless devices in the past decade, numerous protocols have been developed to communicate between devices and with base stations. Protocols such as 802.11a/b/g, GSM v2/v3/v4, CDMA and others have been developed.
All these protocols are very compute intensive with massive amounts of data. We have been developing architectures that meet the performance requirements of these protocols and the power budgets required by mobile devices. This has led us to look at specialized architectures that leverage data-parallel architectural techniques such as SIMD to realize the performance requirements and yet be generally programmable to implement almost any wireless protocol on one chip. We have also been researching other application spaces such an architecture can be applied to, such as multimedia encode/decode.
General parallel programming for the masses has become an important topic for the research community as many-core chip-multiprocessors are now a commodity item found in servers down to the lowest end PC. Programming these processors with threads and synchronizing these threads properly with current primitives such as locks while avoiding deadlock, livelock and improper critical section protection leads to incredibly difficult to debug programs. Transactional memory is one technique being researched that promises to solve some of these problems by offering easier primitives for delineating critical sections using code rather than a collection of abstract locks.
Transactional memory unfortunately has problems of its own. While it can be easier to program than locks, it can easily encounter high contention among critical sections leading to performance worse than sequential execution. Debugging this kind of performance problem is equally as difficult. We have been researching techniques that dynamically identify current problem transactions and also predict future problem transactions and schedule the threads running these transactions to execute in an order to eliminate the peformance penalty.
Interactive applications are what the majority of us use in our day to day computing activities, such as web-browsing or 3d Games. With the sudden shift to parallel architecture about a decade ago, we have begun research to investigate how well interactive consumer type applications have responded in leveraging this new hardware. Servers and scientific applications have already been leveraging parallel architectures for decades. We have found that, disappointingly, interactive applications do not utilize the many processors found in current PCs, even though they create hundreds of threads.
In continuing this investigation, we are looking at characterizing mobile/smart phone applications to determine what their needs are, and what types of architectures may be most suitable to meet future performance requirements, whether it be a multicore design or something else entirely.
Datacenter architecture is another area we are investigating. Contrary to current interactive applications which utilize surprisingly little parallelism, datacenters leverage massive amounts of parallelism that can spread to thousands of machines. Power is also a constraint in the datacenter as the cost to power them can now exceed the capital expense for the servers.
We have been looking into how low-power system architecture, spanning the CPU, Main Memory and Storage should be utilized to reduce power yet still provide the desired performance. On top of this investigation we have been looking into characterizing common datacenter operations to build synthetic benchmarks to help guide research into datacenter architecture and organization.
Network processing is becoming more and complicated as customers want more functionality such as deep packet inspection and higher line speeds. This neccesitates more threads to handle the increase in traffic and compute. To do this, vendors either add more cores or more threads by replicating structures. This consumes more power which may be unneccessary.
We have been researching "Virtual Context Architectures" that allow sharing of limited architectural resources such as registers among more threads than can be supported with traditional techniques and how it affects performance and power consumption.