Information Technology : Computing methods

Information Technology Portfolios


Architecture for Speculative Parallel Execution Improves Performance, Simplifies Programming

UW–Madison researchers have developed a system that permits speculative execution of program tasks prior to determining data dependency. Before commitment of the tasks in a sequential execution order, data dependencies are resolved through a token system that tracks read and write access to data elements accessed by the program portions.

Eliminating the need to wait until late in the program execution to detect or resolve dependencies helps improve processor utilization. Advancing the execution of tasks that ultimately do not experience data dependency problems may have a ripple-through effect, reducing later data dependencies as well.

Hardware Blends Compute/Storage Capabilities, Increases Efficiency

A UW–Madison researcher has developed a versatile new computer architecture using interconnected tiles that can alternate between memory and computation functions. More specifically, each tile can be configured as a (i) multibit nonvolatile memory, (ii) logic gate array or (iii) routing switch. The ability to dynamically change the function of any of the tiles allows precise tailoring to workload and reduces data transfer costs.

Managing Memory in Virtualized Computer Systems

UW–Madison researchers have developed a so-called compound page table system that blends features of shadow paging and nested paging to improve efficiency. Memory regions with stable address mapping (e.g., holding program code) may be treated using shadow paging while regions with dynamic address mapping (e.g., variable storage) may be treated using nested paging. Thus, the benefits of both techniques can be obtained.

More Efficient Signal Processing for Digital and Smartphone Cameras

A UW–Madison researcher has developed ISP circuitry than can operate in two modes. One mode optimizes the signal for human vision and the other mode optimizes the signal for feature/gesture recognition. The latter mode uses less energy because the image can be of lower quality.

The new ISP design conserves power by not processing each pixel value, operating all processing stages or sampling every frame.

Predicting Computer Memory Failure

The researchers have now developed a method for predicting faults in static random access memory (SRAM) and cache cells. In the new method a memory circuit is artificially aged by reducing voltage, then checked using a predetermined test vector. The vector is altered if there is memory cell failure (i.e., a value of 1 will read out as 0).

The portion of memory being checked may be small and rotated through the entire memory structure to minimize overhead.

Intelligent Memory Fault Patching Cuts Costs

UW–Madison researchers have developed a method that allows “patching” of failed memory elements using alternative memory. They determined that under normal operating conditions there is excess capacity in many redundant memory structures, thus providing a solution with little or no cost impact.

In the new method, accesses to data from faulted memory areas is diverted to a secondary memory structure. This secondary memory structure is flagged to increase the persistence of the stored data used for patching against normal updating policies.

New 2-D Optical Trap Array for Quantum Computing, Sensors

UW–Madison researchers have developed a new approach for trapping and controlling atomic particles using projected light. Specifically, the new method generates optically induced traps that are more effective and efficient.

The projected light fields may include linear segments of light arranged on a two-dimensional planar grid, which can be used to form optical trap arrays that define the locations of the atomic particles in three dimensions. This configuration helps arrange the atoms in individual and optically defined sites.

Memory Controller for Heterogeneous Processors

UW–Madison researchers have developed a memory controller providing improved access to common memory when a single parallel application is divided between different processor types, e.g., a CPU and a GPU. In these instances, fairness may not be a primary consideration and performance can be evaluated in terms of completing the entire application.

The controller works by dynamically adjusting access priorities between the different processors. It can predict sequential memory accesses by the processor having higher memory latency or fewer access requests to lockout the other processor during those sequences.

Voltage Regulator Control for Processors Conserves Energy

A UW–Madison researcher has developed an improved VR system for next-generation hardware providing direct rather than inferred current measurements. In the new system, a controller manages the number of active phases of each VR according to a determined electrical current demand from the processor.

Relying on electrical current demand (rather than P-state) boosts VR efficiency, particularly in situations where low current demand occurs under heavy processor demand because of certain power variations.

Memory Processing Unit Boosts Performance, Cuts Energy Usage

UW–Madison researchers have developed a system to dramatically improve the benefits of 3-D die-stacking memory. Their system allows a host processor to efficiently offload entire pieces of computation for faster processing with reduced power consumption.

More specifically, memory processing unit cores are tightly coupled with sections of stacked memory layers, combined as memory ‘vaults’ in hardware. Application code is segmented into discrete partitions (‘shards’) in software for storage in the vaults. As a result, an application program is effectively broken up for execution among multiple processing cores in close proximity to memory.

New Hardware Helps Cell Phones, Tablets Save Power

UW–Madison researchers have developed a more energy-efficient multiplier circuit for portable electronics. The circuit performs ‘dynamic truncation,’ reducing the size of operands to capture their most important bits. This allows multiplication to be performed using a much smaller multiplier, significantly reducing energy consumption.

Dynamic truncation works by preserving computationally important bits and providing approximations that are satisfactory for most applications.

Energy-Efficient Multiplier Circuitry for GPUs

UW–Madison researchers have developed a new circuit system for multiplying floating-point numbers. The system combines a traditional floating-point multiplier with a power-of-two multiplier that works by shifting operations. Substantial power savings may be realized by selectively steering some operands to the power-of-two multiplier.

The different circuits have different advantages. The floating-point multiplier uses more power but is more versatile, while the power-of-two multiplier uses less power but is less versatile.

Increasing Memory Bandwidth

UW–Madison researchers have developed a system to substantially increase memory bus bandwidth by combining a parallel memory bus with a high-speed serial memory bus. A serial memory bus normally introduces too much latency (data delay) for general computer processors, but this can be accommodated using special processors like GPUs or streaming processors.

By selectively steering some memory traffic to the serial memory bus, total memory bandwidth is significantly increased while still providing low latency when needed via the parallel memory bus.

Dynamic Bandwidth Scaling Improves Energy Efficiency

UW–Madison researchers and others have developed a ‘scheduler’ to dynamically scale the number of I/O pins depending on workload characteristics. The scheduler increases main memory bandwidth during memory intensive phases of a program to reduce the number of main memory accesses, then decreases the bandwidth when it is no longer needed.

Key metrics are used to identify the optimal number of I/O pins, such as instructions per cycle, utilization of execution units and the number of consecutive cache misreads.

Database Engine for Faster Analytics

UW–Madison researchers have developed “Widetable,” a query processing method that provides a faster and more efficient means for scanning data across multiple tables.

The method accelerates queries by denormalizing multiple tables of a relational database into a smaller number of “wide row” tables using an outer join function. Such denormalization substantially increases the amount of data that must be stored to represent the database. While this might be expected to slow scan rates, speed is gained by eliminating other time-consuming operations.

Method Predicts Porting Speedup

UW–Madison researchers have developed a method of assessing the benefits of porting a program before the effort and cost of porting are undertaken. In other words, the amount of speedup can be predicted in advance.

In the new method, a computer measures multiple quantifiable execution properties of a given program. These properties describe how the program executes on a first processor system (e.g., a CPU). Next, the properties are applied to a model that estimates the change in speed that will occur if executed on a second processor system (e.g., a GPU).

Computer Accelerator System Boosts Efficiency

UW–Madison researchers have developed a specialized memory access processor that takes over the job of feeding data to the accelerator. It is placed between the main processor and the accelerator.

The circuit is specialized for a narrow task, in this case performing memory access and address calculations. It is as fast as the main OOO processor yet more efficient. The main OOO processor – free from memory access duties – may switch to an energy conserving sleep mode until the accelerator is finished, or may move on to other tasks.

Improved Gate Design for Quantum Computers

UW–Madison researchers have developed quantum dots with a novel tunnel barrier gate design. The structure consists of a quantum well layer with three 2DEG regions separated by three tunnel barriers.

An electrode is patterned on a dielectric layer (instead of directly on the dot surface) above the first tunnel barrier. The various electrodes formed on the dielectric layer are arranged to define quantum dot regions within which the energy level and spin of electrons can be manipulated.

The multiple layers of the structure can be made using conventional deposition systems or lithography techniques.

Managing Computer Power and Performance

UW–Madison researchers have developed a set of predictors to monitor the energy consumption and performance of a computer’s individual components in runtime. This information can be used to manage the system based on user needs (i.e., higher power/higher performance; reduced power/reduced performance).

The predictors work by establishing predicted tradeoffs between power and performance for a particular workload while it is being executed. The predictions are combined in a system model to identify a limited number of operating state combinations. This allows operating states to be readily adapted during program execution based on a particular workload. Pareto optimal settings can be used to simplify adjustment of the system during runtime.

Encrypting Intellectual Property Cores

UW–Madison researchers have developed a method for encrypting the functional descriptions of IP cores. The encrypted descriptions allow simulation but still obscure the design and operation of the underlying circuit. This provides more flexible testing capabilities while protecting intellectual property.

First, an encryptor receives a description file of the circuit. The encryptor then outputs a description of the underlying IP core in which the nodes or gates of the circuit are replaced with generic placeholder nodes. These placeholders are given encrypted multivalued truth-tables that permit simulation but effectively disguise their function. For example, multiple alias values may hide the logic of the node, or the truth-table may include erroneous entries. The effect is to render the function of the node symbols practically unintelligible.

Memory Conserves Power, Is More Reliable

A UW–Madison researcher has developed a new memory structure (e.g., for caches, SRAM, DRAM, translation lookaside buffers, etc.) that controls error handling as a function of operating voltage. In this way, errors are corrected when the memory is run at low voltage and frequency.

The new design is made of multiple independently controlled groups of memory cells, each adapted to store digital data bits and error handling bits. A memory controller monitors the circuit to determine operating voltage and look for errors in the different groups when voltage is low. The groups have different physical sizes, and therefore have differing, predetermined susceptibility to errors as a function of voltage.

Performance issues associated with error correction, such as additional access latencies, can be avoided when the memory structure is run at a higher voltage (and frequency) and errors are less likely. Also, increased latencies due to evaluating error handling bits may be hidden by reading the digital data bits speculatively and assuming no errors.

Energy-Efficient Parallel Processing

A UW–Madison researcher has developed a technique to reduce the clock speed of spinning cores. This conserves energy during fine-grained synchronization events that would otherwise be fatally slowed by multiple system calls.

Specifically, clock dividers are provided on each core, allowing rapid changes in clock frequencies. A spinning core is slowed but still can operate in the environment of the other processors by making certain frequency adjustments. Power fluctuations that may result from abrupt clock speed changes can be minimized by disabling processor functions prior to the change.

More Efficient Processing with Self-Invalidating IOMMU Mapping

UW–Madison researchers have developed a more efficient IOMMU. They recognized that the time required to delete a page table entry (PTE) from the page table and send the IOMMU cache deletion signal can be eliminated for most transactions.

This is done by attaching a ‘removal rule’ to the PTE that allows for self-deletion. The removal rule may, for example, delete the PTE after a predetermined number of memory accesses or a specified time. This significantly cuts processor time and resources required for IOMMU transactions. Also, the susceptibility of the computer to I/O device or driver errors is reduced.

Predicting Logic Gate Failure

UW–Madison researchers have developed a new fault prediction technique providing accuracy, generality and power efficiency. To do this, two similar circuit modules are used but one is artificially aged. Aging can be mimicked by lowering operating voltage and/or phasing a sampling clock to reduce slack time. Both approaches make the ‘aged’ gate more sensitive to delay. To achieve power efficiency, a novel technique is used to turn on this ‘mode’ at a low periodic rate.

The outputs of the two circuit modules is compared. Discrepancy in output indicates a projected failure.

Dynamic Predictor Improves Machine Control

The researcher now has developed a new dynamic predictor that rapidly and accurately calculates the motion trajectory of a system that is only partially constrained by joint inputs. This dynamic predictor achieves stable and accurate results for stiff systems. To do this, the predictor applies conditions achieving such results at both a first and second joint position at the start and end of a motion time step.

More specifically, the relationship between joints is described as a differential equation to be solved by the predictor. The predictor parameterizes the motion of the unconstrained joints in such a way as to match the conditions the solution needs to satisfy at both the start and end of a motion time step. As this parameterization is expressed by polynomial coefficients, motions of the remaining joints are readily determined by the kinematic predictor.

Maximizing Multicore Processor Performance

A UW–Madison researcher has developed a solution for improving performance while still meeting a maximum power budget for multicore processors operating in a power-constrained environment.

The method provides for joint scaling of both the number of cores in operation as well as the amount of resources per core. Selecting the number of cores and the amount of resources is done by examining the degree of ILP and TLP available for a given application. As such, performance counters (and other characteristics) implemented by a processor are used to determine optimal core/resource configurations.

Performance counters may measure, for example, how many instructions are executed per cycle, length of execution time and cache hit rate. These performance measurements indicate how much ILP and TLP a given application exhibits at a time.

Faster Scans with Improved Bit-Parallel Processing

UW–Madison researchers have developed a method for improved parallel processing at the bit level. Faster query scans are performed by making better use of an existing processor ‘word,’ a unit of data of a defined size.

The system selects between two different organization techniques (a horizontal or vertical bit parallel structure). Data elements are prepacked into one of these structures according to processor word size. After words are populated with data from multiple elements of the database, query operations process them in each data word simultaneously in the arithmetic logic unit (ALU).

Mobile Devices Conserve Energy by Adjusting Accuracy

A UW–Madison researcher has developed multiplication circuitry that dynamically changes its accuracy (and energy usage) in response to operating demands. Accuracy is adjusted to meet particular computation tasks, power management strategies or error thresholds.

Specifically, a shift and accumulate multiplication circuit precomputes multiplicand shift amounts rather than computing them on the fly with a ‘leading-one detector.’ The circuit prestores the values in a coefficient memory. A controller adjusts accuracy according to processor needs.

Precomputation is possible in many recognition tasks associated with HMIs, where relatively static multiplier coefficients are used.

High-Definition Video with Low-Speed Cameras

UW–Madison researchers have developed a temporally offset sampling scheme for generating high-definition video.

For each pixel in a camera, image data is captured at different times and with differing exposure periods (e.g., randomly selected), and temporally offset from all others. This image data from the disparate pixels and times are compared and matched to construct video.

Images may be sampled at a relatively low rate with offset, per-pixel exposure time. In this way, the scheme achieves high dynamic range by spatial or temporal redundancy, thereby matching images captured at the different pixels.

A CMOS or CCD-based sensor can be implemented to generate high-speed video frames using low sampling rate.

Optimizing Parallelism During Run-Time

UW–Madison researchers have developed a method to determine the optimal amount of parallelism during run-time and to control parallelism according to that determination. This run-time assessment improves portability of programs among multiple disparate devices across which the amount of parallelism may change significantly.

In an embodiment, a virtualizer program transparently interposes between the application program and the computer’s operating system. It continuously monitors changes in the execution environment and rapidly optimizes the program’s parallelism to best fit the environment.

The virtualizer program operates in periodic intervals. During a preceding interval, it varies the parallelism in a program, monitors performance and establishes a relationship between parallelism and performance. Based on this relationship, it controls the parallelism during a succeeding interval to satisfy a desired performance goal while ensuring the program’s forward progress.

SuperTag Cache for Energy-Optimized Compression

UW–Madison researchers have developed a compressed cache, called SuperTag, which exploits spatial locality to optimize compression effectiveness and energy use.

SuperTag cache manages cache at three granularities: ‘super blocks,’ single blocks and fractional data segments. Since contiguous blocks have the same tag address, SuperTag increases per-block tag space by tracking super blocks (for example, a group of four aligned contiguous blocks of 64 bytes each). It also breaks each cache block into smaller data segments for storage.

To improve compression ratio, the technique uses a variable-packing scheme allowing variable-size compression blocks without costly compaction. It also co-compresses contiguous blocks, including within the same super block, thereby producing data segments for storage.

Linear Programming for Practical, Real-Time Error Correction and Decoding

UW–Madison researchers have developed a decoder for LDPC codes, and other similar coding systems, that can be proven to converge and may be parallelized for high-speed processing.

The method provides an error correction circuit including a buffer memory for holding a received string of bits derived from a transmitted (or stored) string. A parity rule memory holds a set of parity rules for the transmitted string, with each rule describing a predefined intended relationship between a subset of the bits as originally transmitted. The buffer and parity rule memories communicate with an optimizer, which generates a corrected string of bits using a linear programming (LP) process.

Improved GPU Performance by Memory-Link Compression

A UW–Madison researcher has developed a GPU design for faster data transfer by compressing and decompressing data passed between the units and their memories.

The computational elements of the GPU are adapted to receive, execute and output data through connected memory channels. A compressor/decompressor associated with each channel prepares the data for reading and storage.

More Efficient, Portable Graphic Processing System by Exception Handling

UW–Madison researchers have developed a method for GPU exception handling that overcomes the problem of state storage by identifying idempotent regions of the executed code. Idempotence describes an operation that produces the same results if executed once or multiple times, meaning it can be retried as often as necessary without causing unintended effects. Identifying these regions allows a state to be restored (without storage) by simply moving backward in the program.

The method decomposes GPU programs into idempotent sequences and regions. GPU programs tend to have very large regions of code that are idempotent. They can be used to rapidly service infrequent exceptions and those, like page faults, that require a context switch. For data mis-speculations, regions can be further divided into fast idempotent sub-regions. Additionally, a compiler incorporated into the GPU architecture reviews instructions to identify idempotent regions and mark those readable by the GPU.

New Heterogeneous Cache System for Improved Efficiency in Processors

UW–Madison researchers have proposed a heterogeneous cache structure that operates at reduced voltages and utilizes a combination of large and small transistors. Cache memory provides high-speed local storage for a processor that may help overcome the relatively slower access speeds available between the processor and the main solid-state memory.

The improved design provides a cache system comprising a series of addressable transistor memory cells holding digital data when powered by an operating voltage. Individual cells in the cache system may be deactivated or activated as a function of operating voltage. The cache structure is one component of the integrated circuit design, which also comprises a processor and a cache controller.

Computer Architecture Enables Simplified Recovery from Speculative Execution Errors without Checkpoints

UW–Madison researchers have developed a simplified processor that can recover from mis-speculation or execution of other erroneous instructions without maintaining or reloading a conventional checkpoint. The processor simply re-executes from the start of a block of instructions that includes the erroneous instructions. This is possible by constructing programs in terms of consecutive idempotent regions, which produce the same results even when executed repeatedly, and limiting speculation to occur in those regions. The ability to recognize (at compile-time) and exploit idempotent regions eliminates much of the circuitry and energy consumption needed to recover from mis-speculation or hardware failure.

Operating System Support for Restartable File Systems

UW–Madison researchers have developed techniques that provide a restartable file system that allows the operating system to respond to faults or failures in the file system without restarting the whole system or losing data. The techniques create a “logical membrane” around the file system. When a file system failure occurs, the failure is isolated without significantly impacting the execution of the operating system or applications.

The four key components of the method are checkpoints, operation logging, unwinding and object tracking. During normal operation, the system logs file system operations, tracks file system objects and periodically performs lightweight checkpoints of the file system state. If a file system crash occurs, the system delays pending file system operations, halts in-progress file system operations and unwinds current operations to a safe state. After recovery, it restores the file system using the most recent checkpoint and rebuilds the file system state using inter-checkpoint logs. Applications are unaware of the crash and restart. Through isolation of the file system, this technique can avoid restarting the operation system in response to file system failures. This improves reliability by allowing applications to keep executing without losing state and improves the user experience.

Controlling Parallelism in Real Time

UW–Madison researchers have developed a method to dynamically control parallelism according to performance. Their elegant solution measures how fast a program is executing and adjusts parallelism in real time to achieve a balance. This improves speed, throughput, energy efficiency, cache usage, memory and more.

Improved Method Provides Run-Time Parallelization of Computer Software

UW–Madison researchers have developed a method that provides run-time parallelization of sequential computer software using data-associated tokens. The method offers a simple mechanism for detecting write-write, read-write and write-read data dependencies between computation tasks. It further processes the computations to achieve a parallel schedule of execution whenever possible.

Optimizing Clock Speed and Power Dissipation in Multicore Processors

UW-Madison researchers have developed a method to improve microprocessor performance and efficiency by tuning the power and frequency of individual cores. The transistors of the microprocessor are arranged in processing circuits, each providing local power control at each of the cores and a clock input to synchronize the clock speed of the cores. The local power controls are used to lower the maximum operating frequency of the cores to match the maximum operating frequency of the slowest core, which reduces power consumption to well below the power constraint of the system. Then, the maximum operating frequencies of all the cores can be raised with an increase in the global supply voltage until the power constraint of the system is met.

Dynamic Dependence-Based Parallel Execution of Software for Performance Optimization

UW-Madison researchers have developed a mechanism that maintains sequential program semantics while implementing a dynamic parallel execution of independent computations in a manner which reduces the likelihood of data races – instances in which multiple threads accessing shared data “race” one another to influence the output.

Using this model, programmers specify a serializer, a piece of code that when executed, maps a “function invocation” into a serialization set.  The programmer may specify the serializer such that all function invocations operating on the same data are mapped to the same serialization set.  Then, the execution of the function invocations are automatically parallelized by assigning them to a number of delegate threads.  In this manner, function invocations in different serialization sets are assigned to different delegate threads.  This ensures that operations on the same datum are executed in program order, avoiding data races and retaining the advantage of predictable, deterministic execution.

Efficient and Automated Analysis of Thin Structures to Enhance Engineering 3-D Modeling Software

UW-Madison researchers have developed a new technology for a fully automated and efficient analysis of thin structures.  The new process overcomes the limitations of conventional geometric reduction and 3-D FEA by providing a dual-representation, in which the geometry is captured in 3-D, but the physics of the structure are captured via classic beam, plate or shell theory. 

The integration of 3-D geometry and lower-dimensional physics leads to numerous advantages:
  1. The technology can be directly integrated into 3-D CAD systems.
  2. The 3-D CAD model need not be simplified or dimensionally reduced prior to analysis.
  3. A boundary triangulation of the CAD model is sufficient for analysis, i.e., a 3-D mesh is not required.
  4. It retains the ability to achieve the accuracy of modern lower-dimensional beam, plate and shell methods.
This process is particularly well-suited for analyzing thin structures since it offers the flexibility and generality of 3-D FEA and the computational efficiency and accuracy of beam, plate or shell analysis.  The new method and system will streamline 3-D structural analysis in many fields of engineering design because this process can be incorporated into traditional CAD software via conventional integration techniques. 

Data Flow Execution of Methods in Sequential Programs

UW-Madison researchers have developed an automated way to parallelize the execution of a sequential computer program for multiple processors. Programmers usually write computer programs by dividing them into methods, which consist of one or more instructions that collectively perform a specific subtask of the program. Programmers write programs with “calls” to the methods when they need to perform that subtask and continue writing the rest of the program. The instructions in a method can access memory values of the program, and may be provided with some input values, which could be used to specialize the instructions of the method.

In this invention, a trigger is specified for a method when memory and input values are assumed to be available, and the method begins running on a different processor. This trigger usually occurs before the call to that method is reached in the program. Therefore, the method is run in parallel with the program. If the memory and input values have not changed when the method call is reached, results of the instructions in the method are used without running the method again.

Spin-Bus for Information Transfer in Quantum Computing

A UW-Madison researcher has developed a method of linking distant qubits, allowing spin transfer to occur accurately and quickly along the length of the qubit array. Each qubit is reversibly linked to a one-dimensional array of quantum dots called a spin-bus. Each dot in the spin-bus contains one electron that has a strong, consistent link to all of the others. Information can be transferred from a qubit to the spin-bus and then to another qubit, however distant, in a two-step process.

Adaptive Cache Compression System

UW-Madison researchers have developed a flexible cache compression system that dynamically adapts to the costs and benefits of compression. Data in a cache are selectively compressed based on predictions as to whether the benefit of compression in reducing cache misses exceeds the cost of decompressing the compressed data. The predictions are based on an assessment of actual costs and benefits for previous instruction cycles of the same program.

Processing Unit Having Multioperand Decimal Addition

UW-Madison researchers have developed methods for rapidly performing decimal addition on multiple binary coded decimal (BCD) operands. Their discovery uses four techniques, three of which speculate BCD correction values and use a new approach called “chaining” to correct intermediate results. The first technique speculates over one addition, while the second technique speculates over two additions. The third technique uses multiple instances of the first technique in parallel and then merges the results. The fourth technique uses a binary carry-save adder tree and produces a binary sum. Then combinational logic is used to correct the sum and determine the carry into the next digit.

Secure Electronic Message Transport Protocol

Researchers at the UW–Eau Claire have developed a method for the secure transmission of electronic messages. This protocol, referred to as the secure electronic message transport protocol (SETP), makes use of two distinct sub-protocols: a message transport protocol and an encryption key management protocol. The two sub-protocols employ public-key cryptography and operate in tandem for enhanced security. This system is carefully designed to prevent message access by external intruders as well as by legitimate mail and encryption key servers. Unlike SMTP, which only encrypts the content of a message, SETP encrypts the entire message, including content and flow control information such as the receiver and sender fields. SETP relies on a group addressing scheme to obscure the identities of individual senders and recipients.

Processing Unit Having a Decimal Floating-Point Divider Using Newton-Raphson Iteration

UW-Madison researchers have developed a high-speed method for dividing decimal floating-point numbers. The technology uses an accurate piecewise linear approximation to obtain an initial estimate of the divisor’s reciprocal. The initial estimate is then improved using a modified form of Newton-Raphson iteration. Finally, the divisor’s reciprocal is multiplied by the dividend and efficiently rounded to produce the quotient.

Decimal Floating-Point Adder

UW-Madison researchers have developed a decimal floating-point adder that rapidly performs decimal floating-point arithmetic, particularly addition and subtraction. The decimal floating-point adder includes an alignment unit that aligns the significands (the part of a decimal floating-point number that contains its significant digits) of two floating-point numbers so that the exponents associated with the floating-point numbers have equal values. For example, the numbers 12.3 and 4.56 could be represented as 1230 X 10-2 and 456 X 10-2. A binary adder then adds the aligned significands. A correction unit and an output conversion unit are included in the floating-point adder to produce the final decimal floating-point number.

Spin Readout and Initialization in Semiconductor Quantum Dots

UW-Madison researchers have now developed a semiconductor-based, quantum dot scheme for enabling measurement of the spin states of individual qubits without the need for additional external couplings that can cause decoherence. In this quantum dot scheme, the energy levels of trapped qubits are controlled by varying the voltages of nearby metal gates. By bringing specific energy levels into resonance with an applied microwave field, the qubits can be made to undergo spin-dependent oscillations that are detected by a single electron transistor.

Identifying Computer Program Phase Changes through Program Working Set Analysis

UW-Madison researchers have developed a method of identifying changes in program phases by analyzing changes in program working sets. Defined as those program regions being actively used at any one time, working sets are a manifestation of program phases. To detect changes in working sets, the method first generates a highly compressed representation of each working set, called the working set signature. Using a metric called the relative signature distance, the system then compares the signatures of consecutive working sets to detect changes. The method also estimates working set size by counting the number of bits in the working set’s signature.

Input Feature and Kernel Selection Improves Speed, Accuracy and Efficiency of Support Vector Machine Classification

UW-Madison researchers have developed a selection technique that makes use of a fast Newton method to produce a reduced set of input features for linear SVM classifiers or a reduced set of kernel functions for non-linear SVM classifiers. The ability to suppress less useful data and select a reduced set of meaningful support vectors promises to increase the computational speed, accuracy and efficiency of SVM classification for data mining operations.

A Novel Method for Detecting Computer Viruses

UW-Madison researchers have developed a novel approach to identifying malicious portions in a suspect computer program. The approach is able to detect malicious code that has been obfuscated, or disguised, by examining the function of the code rather than its “expression” as a string of instructions.

This functional analysis is made possible by a preprocessor that receives the suspect computer program and converts the program instructions into a standard form denoting their function. A detector reviews the standardized version of the suspect program against a library of standardized malicious code portions and indicates when malicious code is present in the suspect program.

Token-Based Cache Coherence Protocol for Shared Memory, Multiprocessor Computer Systems

UW-Madison researchers have developed a new method for maintaining cache coherence in shared memory multiprocessor computer systems. The key innovation provided by this technology is that read and write privileges to shared memory blocks are tracked by transferring and counting a set number of “tokens.” The system ensures the correctness of shared memory by allowing a processor to read from a cache block only if the processor possesses one token, and to write to a block only if the processor possesses all the tokens. Because token coherence ensures correctness, designers may aggressively optimize for performance (e.g., by predicting common sharing behaviors) without worrying about introducing errors.

Method and Device for Parallel Execution of Computer Software Using a Distilled Program

UW-Madison researchers have developed a sophisticated new model to achieve parallel processing of computer programs through speculation. The technique employs a speculative approximation of an original program, called a distilled program, which makes predictions about control and data flow to break dependencies that would otherwise serialize execution.

More specifically, a computer program is monitored to identify predictable, recurring behaviors, and a simpler, “distilled” version is created by assuming that these behaviors continue to repeat. The distilled version executes faster than the original program, but with no guarantee of accuracy. As the distilled program runs, it forwards starting points and other necessary data to secondary processors, which carry out the portions of the original program corresponding to each checkpoint. As the secondary processors finish their tasks, their state data are used to verify the state data assumptions of the distilled version. If mistakes are found, the speculative execution can be stopped and program execution restarted from the last checkpoint.

Solid-State Quantum Dot Devices for Quantum Computing

An interdisciplinary team of UW-Madison physicists, engineers and materials scientists has now designed a scalable, multilayer semiconductor device that can be used to build a quantum computer. Fabricated by growing layers of semiconductors and patterning electrostatic gates on the surface, this quantum dot device traps individual electrons in a solid, brings the electrons close to each other, maintains the electrons’ phase coherence and allows manipulation of the electrons’ individual spin states.

Concurrent Execution of Critical Sections by Omitting Lock Acquisition

UW-Madison researchers have developed a novel method for coordinating access to data shared by multiple program threads. Based on the insight that critical sections can often be executed simultaneously without conflict, each program thread of this invention first detects the beginning of a critical section, and then tentatively executes it without first acquiring the lock. The provisional execution is finalized only if no conflict is detected; otherwise execution is quashed. In other words, when a critical section can be executed concurrently by more than one thread without conflict, the step of acquiring the lock becomes unnecessary and is omitted.

Bandwidth-Adaptive, Hybrid Cache-Coherence Protocol

UW-Madison researchers have developed an adaptive hybrid protocol that dynamically selects between a snooping-like and a directory-like protocol, allowing the system to choose the best mechanism for communicating cache coherence messages based on the bandwidth available.

Smarter Cache Memory by Dynamic Sub-Block Data Fetching

UW–Madison researchers have developed a cache structure that dynamically changes the fetch block size by considering the historical success of previous sizes in satisfying processor requests. Moreover, the blocks may include data from discontinuous address ranges.

Specifically, the cache is divided into blocks, each holding data from an address range of the memory, and further divided into sub-blocks. A use table has entries indicating the sub-blocks that have had their data used by the processor since the block was loaded. Based on this information, a fetch size controller provides a size value for a given data address range. ‘Miss processing circuitry’ responds to processor requests by loading the specified data into a number of sub-blocks of the cache block determined by the fetch size value for that address range.

Faster Data Mining by Subset Binary Decision Trees

UW–Madison researchers have developed a new method of data mining using a low-access memory to store records and attributes, as well as a smaller, high-speed memory that will estimate a final classification scheme from a sequence of partial decision trees.

First, a subset of the database is loaded into the smaller memory to prepare partial trees associated with different attributes. These are combined to form an initial tree representing a statistically good estimate of data patterns. This estimate is used then to review the entire database in a single scan to verify and refine a final structure.

Memory Bypass Circuit for Faster Data Transfer

UW–Madison researchers have developed a technique for predicting data dependence that can be used to establish a link between two instructions. Accessing the data from standard or cache memory may be bypassed.

In its place, the data retrieved in the first data-using instruction is temporarily stored in a local register to be used by a second data-using instruction. Parallel processing techniques of squashing are used in the event that a prediction is erroneous.