9

Google Says Its AI Supercomputer is Faster, Greener Than Nvidia A100 Chip - Slas...

 1 year ago
source link: https://hardware.slashdot.org/story/23/04/05/1848255/google-says-its-ai-supercomputer-is-faster-greener-than-nvidia-a100-chip
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Google Says Its AI Supercomputer is Faster, Greener Than Nvidia A100 Chip

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

binspamdupenotthebestofftopicslownewsdaystalestupid freshfunnyinsightfulinterestingmaybe offtopicflamebaittrollredundantoverrated insightfulinterestinginformativefunnyunderrated descriptive typodupeerror

Do you develop on GitHub? You can keep using GitHub but automatically sync your GitHub releases to SourceForge quickly and easily with this tool so your projects have a backup location, and get your project in front of SourceForge's nearly 30 million monthly users. It takes less than a minute. Get new users downloading your project releases today!

Sign up for the Slashdot newsletter! or check out the new Slashdot job board to browse remote jobs or jobs in your area.
×
Alphabet's Google released new details about the supercomputers it uses to train its artificial intelligence models, saying the systems are both faster and more power-efficient than comparable systems from Nvidia. From a report: Google has designed its own custom chip called the Tensor Processing Unit, or TPU. It uses those chips for more than 90% of the company's work on artificial intelligence training, the process of feeding data through models to make them useful at tasks such as responding to queries with human-like text or generating images. The Google TPU is now in its fourth generation. Google on Tuesday published a scientific paper detailing how it has strung more than 4,000 of the chips together into a supercomputer using its own custom-developed optical switches to help connect individual machines.

Improving these connections has become a key point of competition among companies that build AI supercomputers because so-called large language models that power technologies like Google's Bard or OpenAI's ChatGPT have exploded in size, meaning they are far too large to store on a single chip. The models must instead be split across thousands of chips, which must then work together for weeks or more to train the model. Google's PaLM model - its largest publicly disclosed language model to date - was trained by splitting it across two of the 4,000-chip supercomputers over 50 days.
  • Who would have thought that a purpose-built chip would be more energy-efficient than a general-purpose one?

    • Re:

      You want it fast- code the algorithm in some sort of lower-level compiled language (e.g. C, Rust, etc.)
      You want it really fast - code the algorithm in the processor's assembler.
      You want it really really fast - hard-code the algorithm as a circuit.

      No different than adding h.265, AV1, AES, FFT, etc. to a chip.

      Pretty much every decent digital oscilloscope has a custom ASIC signal processor on the front end for this very reason.

    • Re:

      The A100 is also purpose-built. It doesn't have a video output.
  • Bard still sucks.
    • Re:

      It's an impressive bit of engineering compared to what I could write. But yeah, Google is behind the industry in some areas.

      • Re:

        Which makes absolutely no sense when the hardware that these early GPT models were running on came from Google's.
  • Google said it did not compare its fourth-generation to Nvidia's current flagship H100 chip because the H100 came to the market after Google's chip and is made with newer technology.

    Google hinted that it might be working on a new TPU that would compete with the Nvidia H100 but provided no details, with Jouppi telling Reuters that Google has "a healthy pipeline of future chips."

    Beside NVIDIA has the edge: it sells universal computing chips and it sells them to everyone. Google's TPU is used primarily (exclusively) for ML training and is not available for anyone but Google. What's the point of comparing then?

    • The real question is what's the point of Google talking about these things at all? If they're not going to sell the chips to third parties, then why disclose their performance? I guess they're trying to sell their AI platform and want to reassure us that it's computationally very powerful?

      It's kind of impressive that Google can develop its own chips for something (currently) as niche as this. Developing a large digital chip on a cutting edge process node is expensive, even when just considering the fixed costs of wafer masks. The people that do the chip design aren't cheap either (I would know). They must be buying/building so many of these things that the tens-of-millions of dollars fixed costs associated with development will made-up-for with money not paid to nVidia.

      • They want to brag about GCE exclusive features. They aggressively pursue competing with AWS and Azure, and part of their game plan is to declare how impossibly clever they are and no one else is as clever, and the only way to avail yourself of your cleverness is to buy Google services, because they will not actually let you purchase any of these wonders.

  • Given where processors are today, the focus should be on fast and green code. You should see the shit we run on today's processors, it is a fucking abomination..Net... java... the list goes on, fucking idiots, and it is not getting any better. When you train people to rely on frameworks and auto garbage management, you end up with shit code and it does not matter how powerful the processor you run it on is - it is still a fucking waste of space and energy. Stop being so fucking lazy, get off your node.js,
    • Re:

      Just make everyone spend a year working on a small embedded system. They'll never take infinite resources for granted again.

  • Google TPU is irrelevant for most people doing ML training or research. You can't purchase the TPU in the machine which you can use. I have a server on the rack which I use every day and that has NVIDIA A100 board in it. The only way to use Google's hardware is to pay for their "arm and leg" plan. We did multiple price evaluation and everyone concluded that it is too expensive to pay for a cloud TPU. If you are not Google and doing this on a regular basis, it is by far cheaper to pay for a system in house. It will pay for itself within a year or two, plus it is CAPEX which is depreciated over 5 years. Not so with cloud payments. They are ongoing, and are OPEX.
    • Re:

      The problem is if you want to use a large model. For example Bloom, which is about on par with GPT3 size-wise, takes 8x A100. They're $15K each, plus the computer they go in. If you utilize it heavily enough, eventually it will amortize out, but if you're only going to get say 5% utilization (a few people running inference on it sporadically) it probably will never beat renting time before it is obsolete.
      • Re:

        The A100 does not cost $15K - https://www.amazon.com/NVIDIA-... [amazon.com] . If you are purchasing it as a part of server it will be even cheaper. Large companies have discounts with outfits like Dell and can get GPU for quite a bit lower.
        You have probably never priced Google cloud for any workloads, it gets quite expensive. Good for startups trying to burn through investors money, not so much for company which is trying to make money.
        Probably OK for internet hosting when you need to scale across regions and provide
        • Re:

          That's the 40 gig. The problem with that is 80 x 8 is what you can fit on a chassis so it's kind of assumed for some models, and splitting over 40 x 16 hits the network which is very bad.

          Anyways I agree it's worth running the numbers on what you need... the costs are really significant on these big models, one way or another, and may not be necessary.

  • All they do is stunts to keep people on their services short-term to the ad-revenue keeps flowing in. As soon as the numbers do not look profitable enough, Bard will get the axe.

  • Drunken Busker, I mean Bard, gets drunk more efficiently. Great, I still won't use it.

  • They selected Adobe's "Eco Green" color for their chip.

    Eco Green is a bright green color with a hexadecimal value of #8CC63F.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK