DARPA Funds Development of New Type of Processor

06.09.2017 0

Share on Facebook

Share on Twitter

LAKE WALES, Fla. — A completely new kind of non-von-Neumann processor called a HIVE — Hierarchical Identify Verify Exploit — is being funded by the Defense Advanced Research Project Agency (DARPA) to the tune of $80 million over four-and-a-half years. Chipmakers Intel and Qualcomm are participating in the project, along with a national laboratory, a university and a defense contractor North Grumman.

Pacific Northwest National Laboratory (Richland, Washington) and Georgia Tech are involved in creating software tools for the processor while Northrup Grumman will build a Baltimore center that uncovers and transfers the Defense Departments graph analytic needs for the what is being called the world’s first graph analytic processor (GAP). rcj_HIVE_1_1496974929.jpg?w=640&ssl=1

“When we look at computer architectures today, they use the same [John] von Neumann architecture invented in the 1940s. CPUs and GPUs have gone parallel, but each core is still a von Neumann processor,” Trung Tran, a program manager in DARPA’s Microsystems Technology Office (MTO), told EE Times in an exclusive interview.

“HIVE is not von Neumann because of the sparseness of its data and its ability to simultaneously perform different processes on different areas of memory simultaneously,” Trung said. “This non-von-Neumann approach allows one big map that can be accessed by many processors at the same time, each using its own local scratch-pad memory while simultaneously performing scatter-and-gather operations across global memory.”

rcj_HIVE_2_1496975008.png?w=640&ssl=1

Graph analytic processors do not exist today, but they theoretically differ from CPUs and GPUs in key ways. First of all, they are optimized for processing sparse graph primitives. Because the items they process are sparsely located in global memory, they also involve a new memory architecture that can access randomly placed memory locations at ultra-high speeds (up to terabytes per second).

Today’s memory chips are optimized to access long sequential locations (to fill their caches) at their highest speeds, which are in the much slower gigabytes per second range. HIVEs, on the other hand, will access random eight-byte data points from global memory at its highest speed, then process them independently using their private scratch-pad memory. The architecture is also specified to be scalable to up to however many HIVE processors are needed to perform a specific graph algorithm.

“Of all the data collected today, only about 20 percent is useful — that’s why its sparse —making our eight-byte granularity much more efficient for Big Data problems,” said Tran.

rcj_Hive_3_1496975127.png?w=640&ssl=1

Together, the new arithmetic-processing-unit (APU) optimized for graph analytics plus the new memory architecture chips are specified by DARPA to use 1,000-times less power than using today’s supercomputers. The participants, especially Intel and Qualcomm, will also have the rights to commercialize the processor and memory architectures they invent to create a HIVE.

The graph analytics processor is needed, according to DARPA, for Big Data problems, which typically involve many-to-many rather than many-to-one or one-to-one relationships for which today’s processors are optimized. A military example, according to DARPA, might be the the first digital missives of a cyberattack. A civilian example, according to Intel, might be all the people buying from Amazon mapped to all the items each of them bought (clearly delineating the many-to-many relationships as people-to-products).

“From my standpoint, the next big problem to solve is Big Data, that today is analyzed by regression which is inefficient for relations between data points that are very sparse,” said Tran. “We found that the CPU and GPU leave a big gap between the size of problems and the richness of results, whereas graph theory is a perfect fit for which we see an emerging commercial market too.”

R. Colin Johnson

R. Colin Johnson has been a technology editor at EE Times since 1986, covering next-generation electronics technologies. He's the author of the book, Cognizers – Neural Networks and Machines that Think, is acontributing editor on SlashDot.Org, and is a Kyoto Prize Journalism Fellow for his coverage of advanced technologies and international issues.

[email protected] 2017-06-09 10:14:04

THis is interesting research, but this is hardly the first non-von neumann machine. In fact we give the popular architecture (single address space) a name specifically to distinguish it from the harvard architecture (split I/D space). You could argue that the microarchitecture of modern CPUs is a kind of harvard architecture in fact.

I seem to remember one of the minor members of DEC's PDP line had a split I/D space but didn't see it on the wikipedia.

Alo there was a lot of experimentation in the 80s on different architectures (e.g. dataflow) -- one was even named "non-von"

R_Colin_Johnson 2017-06-09 10:23:42

Yes there have been many other attempts to buid non von Neumann computers, but DARPA claims they have all failed. They also claim they will keep working on it until they succeed because it is a "Hard-to-Do: Must Do" to solve Big Data.

[email protected] 2017-06-09 10:43:07

The Atmel AVR is a failure? That's news to me!

KarlS01 2017-06-09 11:49:56

Memory access time has been key ever since computers were invented. It looks like they think that access time is inversely proportional to data width. It is the time to access the first byte that is the killer -- the streaming time for the remaining data is overlapped with the access time for the next access, therefore hidden.

Too bad the focus is not on the real problem based on distance and speed of light.

KarlS01 2017-06-09 12:46:19

@dvhw: Could you please explain why the AVR is non von? It looks like just another RISC because only loads and stores access memory for data, and of course I fetches are from memory. It sure looks like both data and instructions are in the same memory.

elizabethsimon 2017-06-09 14:08:54

@Karl ...It sure looks like both data and instructions are in the same memory.

The AVR is Harvard architecture. See the following web site...

http://www.avr-tutorials.com/general/avr-memory-map

The 8051 is also Harvard architecture.

R_Colin_Johnson 2017-06-09 14:27:48

You are right. DARPA agrees that it is a memory access time problem--in particular random access time of 8-byte data points. DARPA claims that today's processors fill large caches quickly, but the sparse packing of relevant Big Data is in the 8-byte range making the moving of large blocks a waste of energy. Beside parallel access to global memory, it has added independent scratch pad memories to all cores so they can simultaneously grab 8-byte data points from global memory, process them in local memory, return their results and repeat.

Steven_Casselman 2017-06-09 14:29:08

DRC computer show a fpga based graph processor at SC 20115.

https://youtu.be/DZZuur8LXOY

Darpa can just go out buy stuff instead of paying big bucks to have someone reinvent the wheel.

KarlS01 2017-06-09 15:10:19

@Elizabeth: Thanks. The memory map explains it well. Harvard was used a lot because the program was in ROM, therefore could not store variable data.

Even the x86 had some kind of control bit that separated data from instruction. That is until JIT compiled into data cache then triggered an instruction cache update.

Are we sure that DARPA knows what they are doing?

alex_m1 2017-06-09 15:21:30

Those kinds of memory access speeds/patterns seem extremely high. Any idea how they will get there ?

KarlS01 2017-06-09 15:29:01

@Colin: Pre System/360 high end IBM CPUs had data buffers to try to speed up matrix inversion.

Model 85 introduced cache as an improvement. Meanwhile Models 350 65 and 75 had 4 way interleaved memories which gave the ability to have multiple(up to 4) 8 byte accesses depending on the low order 2 bits of the address.

The address decoding steered the request to individual memory arrays and then delivered the returned 8 bytes to each requester.

Your reply makes me think that there will be many memories multiplexed in some fashion over an ultra high speed link/bus. That seems fair. Thanks, that brought back memories about memories.

P.S. Cache was for calculation intensive apps, not random accesses.

Also earlier CPUs had 2 way interleave so there could be an instruction fetch and a data access to allow instruction decode to be overlapped with execution of the previous instruction.

KarlS01 2017-06-11 11:08:12

Elizabeth already memtiond 2 Harvard(non von) examples after DARPA SAID all non-vons failed. Now I picked up on this:

HIVE is not von Neumann because of the sparseness of its data and its ability to simultaneously perform different processes on different areas of memory simultaneously," Trung said. "This non-von-Neumann approach allows one big map that can be accessed by many processors at the same time, each using its own local scratch-pad memory while simultaneously performing scatter-and-gather operations across global memory."

It so happens that NASA moon shot control center in Houston had a cluster of 4 360 Model 75s with IO channels that could do scatter/gather, but that did not make them non-von.

Cache was simply not available when the control center was built. Also the CPUS were CISC (not superscalar RISC) as that nonsense had not come about yet.

So DARPA wants a bigger better faster IMPROVED system for a different application. Steve pointed out that even the application is not totally new.

R_Colin_Johnson 2017-06-12 08:29:51

Yes, that bothered me too, thanks for bringing it up. DARPA did the research and found what it calls attempts to create non-von-Neumann architectures and decided that they were not successful. By what metric they measured "success" Trung did not say, and the BAA does not specifically call for a non von Neumann solution (we'll have to wait and see what Intel and Qualcomm propose as architectures, the examples given were just to clarify the problem). After speaking with Trung, I believe his measure of success is "widely popular" as they pitched to Intel and Qualcomm that if they create a non von Neumann architecture that cuts through Big Data like a knife through buttter and thus becomes widely popular, then HIVE will become known as the first, even though there were previous attempts. DARPA wants to create unique firsts that become widely popular, like the Internet, even though there were previous networks before Arpanet. DARPA justifies its existence by the value they give to civilian society after satisfying their military goals with the same technology.

KarlS01 2017-06-12 09:47:39

Thanks, Colin: I wish we knew more about Microsoft's "FPGA in the cloud"(evolution from project Catapult) because they may be working on a similar problem.

Another aspect is that CPU ISAs infer that all data is in memory and there must be 2 loads, an assign, and a store(4 instruction fetches, 2 data fetches, and a data store -- just to add 2 numbers. Again Microsoft Research "Where's the beef" compared FPGA to CPU and concluded that there are too many instruction fetches.

I believe that a new processor that does if/else, for, while, do, and assignments while keeping variables local is doable by using the Roslyn/CSharp parser APIs. (I am in the process of doing it)

R_Colin_Johnson 2017-06-12 11:03:58

Thanks Karl. I'm sure you'll picque some interest in those Roslyn/CSharp parser APIs.

rsmith 2017-06-12 12:13:18

I would like to point out the first non-von Neuman comuter architecture was a Meta-Mentor. Patented in 2006, the architecture earased the differences between von Neuman and Harvard architectures by splitting the functions into two distinct parts. It is also referred as Multi Domain Architecture (MDA). It uses combinational mathematics to determine faults and prevent computer viruses. For example, using Byzantine mathematics, the number of systems required to detect a fault is: n >= 3m +1. (n = number of systems, m = number of faults) This means using von Neumann architecture, a minimum of four systems are required to detect and recover from a single fault. Using Combinatorial techniques like Greaco-Latin squares, or even Latin squares, MDA needs a maximum of three reconfigurable systems. That is, three MDA systems can detect up to three different types of faults, and remain functioning and fault tolerant after the failures. This is just one of its features.

Roger Smith

sw guy 2017-06-13 07:24:39

Even if I already thought of a computer were processing power could be distributed near memory areas, one must reckon that some sequences of code would generate a single instruction fetch for an addition, because both inputs and output all live in registers. Sure, this does apply only under favorable circonstances, but with right CPU, compiler and compiler options, it happens enough that have count of instruction fetches to be noticeably decreased

agoodloe 2017-06-13 09:13:06

Previous attempts in the late 1980s and 1990s at pushing non-von-Neumann architectures such as Lisp machines, data flow machines, and reduction machines all failed to gain acceptance in the marketplace. While such designs had a strong intellectual appeal among the research community, they would have required the wholescale adaption of programming paradigms that are foreigng to most developers and the OS and every application needs to be rewritten in these paradigms. Had advances in traditional processors stalled the cost may have been deemed acceptable, but the big chip makers poured resources into fabrication techniques and deeper super scaler pipelines that allowed the old dusty deck of C porgrams to run faster and faster and consequently non-von-Neumann architectures got the reputation as impractical.

KarlS01 2017-06-13 09:34:44

@sw guy: I think you mean that data fetches are reduced, because instruction fetches specify which data operands to use even if they are in register files.

KarlS01 2017-06-13 10:00:55

@Colin: There is zero chance of any interest.

1) It runs on "Windoze".

2) Not RISC

3) Not Linux

4) No TCL scripting

5) Not from an EDA vendor

6) Not based on marketing hype and buzz words

sw guy 2017-06-13 10:27:18

@KarlS01

Here is what I meant, using fantasy instruction set.

You said the sequence for a single addition is:
load REG1,DATA1
load REG2,DATA2
add REG3,REG1,REG2
store REG3,RESULT
(4 instructions fetch)

But as soon as optimizer is able to go full steam ahead, actual sequence would be:
add REG3,REG1,REG2

Of course, some load ahead and some store behind are needed, but some compiler are very smart are getting rid of memory storage for local variable, which may be awkard to debug but is good for:
- Program size
- Data set size
- Count of instructions to get same work done
(abstract: good for performance from every side)

KarlS01 2017-06-13 11:22:09

"Of course, some load ahead and some store behind are needed..........." And it also depends on omputation intensive applications, not general applications where the control flow is full of branches.

This is the falacy of super scalar assumptions that there is very high probability that the data has been put in a register previously. Can you quantify "some" in some way?

How about context switching, interrupts, and all the other unpredictable things?

In fact this whole topic is about the need to access data that exists in 8 byte chunks in global memory. So much so that caches are a problem, causing congestion and wasted power.

EricOlsen 2017-06-28 17:56:34

The 8031/8051 is a well known harvard architecture with separate I/D.

EricOlsen 2017-06-28 18:02:24

I'm most interested in Google's TPU, and thier work. I'm working a brand new type of computer that operates using a different number system, one that doesn't use carry! It's far more efficient at processing matrix operations that binary systems. However, it doesn't address any of the issues of memory latentcy, and high speed I/O. But darn, I sure would like to get 80 million in development funds, and the keep the patents too!

EricOlsen 2017-06-28 18:03:19

You can see some of the work at: www.digitalsystemresearch.com

R_Colin_Johnson 2017-06-28 22:20:54

> Eric: I'm working a brand new type of computer that operates using a different number system, one that doesn't use carry!...I sure would like to get 80 million in development funds, and the keep the patents too!

All you have to do is search the BAAs and find one remotely related and send in your application. I don't know about $8 million, but $800,000 is entirely feasible.

EricOlsen 2017-06-29 10:31:31

Hi Colin,

Thanks for the advice. Yes, I did notice the BAA's particularly in the area of new computing technology that your article is discussing. It's very exciting, and I'm hoping that it allows more grass-roots original research to be funded. We're a small company with no real recognition, so its highly unlikely that we would be funded by Darpa, but who know's, even in casino gaming a .01% chance might be worth the effort! One thing is clear, without the backing of IC companies capable of advancing ideas with modern process technology, our idea doesn't have a chance. Thank you for your article, Eric

DARPA Funds Development of New Type of Processor

DARPA Funds Development of New Type of Processor

Leave a Reply Cancel reply

Recommend

新公链竞赛 Fantom 弯道超车：DeFi 2.0 概念兴起

金色早报 | 灰度计划申请将比特币基金转换为现货ETF

首支比特币期货ETF在纳斯达克获批上市：BTC最高涨超62000美元

Grayscale即将提交现货比特币 ETF 申请

一周必读10篇 | 智能合约中的并发性和并行性

Grayscale Said Close to Filing to Convert Bitcoin Fund Into Spot ETF: CNBC

谷歌又开始放飞自我了...

零售型央行数字货币研发小国竟走在了前列

Trust | Atlassian

Market Wrap: Bitcoin Reaches $61K as US ETF Deadline Nears

About Joyk