Planet Linux Australia

Syndicate content
Planet Linux Australia -
Updated: 1 hour 15 min ago

sthbrx - a POWER technical blog: Joining the CAPI project

Tue, 2015-10-20 15:25

(I wrote this blog post a couple of months ago, but it’s still quite relevant.)

Hi, I’m Daniel! I work in OzLabs, part of IBM’s Australian Development Labs. Recently, I’ve been assigned to the CAPI project, and I’ve been given the opportunity to give you an idea of what this is, and what I’ll be up to in the future!

What even is CAPI?

To help you understand CAPI, think back to the time before computers. We had a variety of machines: machines to build things, to check things, to count things, but they were all specialised — good at one and only one thing.

Specialised machines, while great at their intended task, are really expensive to develop. Not only that, it’s often impossible to change how they operate, even in very small ways.

Computer processors, on the other hand, are generalists. They are cheap. They can do a lot of things. If you can break a task down into simple steps, it’s easy to get them to do it. The trade-off is that computer processors are incredibly inefficient at everything.

Now imagine, if you will, that a specialised machine is a highly trained and experienced professional, a computer processor is a hungover university student.

Over the years, we’ve tried lots of things to make student faster. Firstly, we gave the student lots of caffeine to make them go as fast as they can. That worked for a while, but you can only give someone so much caffeine before they become unreliable. Then we tried teaming the student up with another student, so they can do two things at once. That worked, so we added more and more students. Unfortunately, lots of tasks can only be done by one person at a time, and team-work is complicated to co-ordinate. We’ve also recently noticed that some tasks come up often, so we’ve given them some tools for those specific tasks. Sadly, the tools are only useful for those specific situations.

Sometimes, what you really need is a professional.

However, there are a few difficulties in getting a professional to work with uni students. They don’t speak the same way; they don’t think the same way, and they don’t work the same way. You need to teach the uni students how to work with the professional, and vice versa.

Previously, developing this interface – this connection between a generalist processor and a specialist machine – has been particularly difficult. The interface between processors and these specialised machines – known as accelerators – has also tended to suffer from bottlenecks and inefficiencies.

This is the problem CAPI solves. CAPI provides a simpler and more optimised way to interface specialised hardware accelerators with IBM’s most recent line of processors, POWER8. It’s a common ‘language’ that the processor and the accelerator talk, that makes it much easier to build the hardware side and easier to program the software side. In our Canberra lab, we’re working primarily on the operating system side of this. We are working with some external companies who are building CAPI devices and the optimised software products which use them.

From a technical point of view, CAPI provides coherent access to system memory and processor caches, eliminating a major bottleneck in using external devices as accelerators. This is illustrated really well by the following graphic from an IBM promotional video. In the non-CAPI case, you can see there’s a lot of data (the little boxes) stalled in the PCIe subsystem, whereas with CAPI, the accelerator has direct access to the memory subsystem, which makes everything go faster.

Uses of CAPI

CAPI technology is already powering a few really cool products.

Firstly, we have an implementation of Redis that sits on top of flash storage connected over CAPI. Or, to take out the buzzwords, CAPI lets us do really, really fast NoSQL databases. There’s a video online giving more details.

Secondly, our partner Mellanox is using CAPI to make network cards that run at speeds of up to 100Gb/s.

CAPI is also part of IBM’s OpenPOWER initiative, where we’re trying to grow a community of companies around our POWER system designs. So in many ways, CAPI is both a really cool technology, and a brand new ecosystem that we’re growing here in the Canberra labs. It’s very cool to be a part of!

sthbrx - a POWER technical blog: OpenPOWER Powers Forward

Tue, 2015-10-20 15:25

I wrote this blog post late last year, it is very relevant for this blog though so I’ll repost it here.

With the launch of TYAN’s OpenPOWER reference system now is a good time to reflect on the team responsible for so much of the research, design and development behind this very first ground breaking step of OpenPOWER with their start to finish involvement of this new Power platform.

ADL Canberra have been integral to the success of this launch providing the Open Power Abstraction Layer (OPAL) firmware. OPAL breathes new life into Linux on Power finally allowing Linux to run on directly on the hardware. While OPAL harnesses the hardware, ADL Canberra significantly improved Linux to sit on top and take direct control of IBMs new Power8 processor without needing to negotiate with a hypervisor. With all the Linux expertise present at ADL Canberra it’s no wonder that a Linux based bootloader was developed to make this system work. Petitboot leverage’s all the resources of the Linux kernel to create a light, fast and yet extremely versatile bootloader. Petitboot provides a massive amount of tools for debugging and system configuration without the need to load an operating system.

TYAN have developed great and highly customisable hardware. ADL Canberra have been there since day 1 performing vital platform enablement (bringup) of this new hardware. ADL Canberra have put all the work into the entire software stack, low level work to get OPAL and Linux to talk to the new BMC chip as well as the higher level, enabling to run Linux in either endian and Linux is even now capable of virtualising KVM guests in either endian irrespective of host endian. Furthermore a subset of ADL Canberra have been key to getting the Coherent Accelerator Processor Interface (CAPI) off the ground, enabling more almost endless customisation and greater diversity within the OpenPOWER ecosystem.

ADL Canberra is the home for Linux on Power and the beginning of the OpenPOWER hardware sees much of the hard work by ADL Canberra come to fruition.

OpenSTEM: How to Help Your Child Become a Maker | MakerKids

Tue, 2015-10-20 13:30!How-To-Help-Your-Child-Become-a-Maker/dcsov/562094010cf2c3a4a7109d92

Let’s say your child is currently a classic consumer – they love watching TV, reading books, but they don’t really enjoy making things themselves. Or maybe they are making some things but it’s not really technological. We think any kind of making is awesome, but one of our favourite kinds is the kind where kids realize that they can build and influence the world around them. There’s an awesome Steve Jobs quote that I love, which says:

“When you grow up you tend to get told that the world is the way it is and you’re life is just to live your life inside the world. Try not to bash into the walls too much. Try to have a nice family life, have fun, save a little money.

That’s a very limited life. Life can be much broader once you discover one simple fact: Everything around you that you call life was made up by people that were no smarter than you. And you can change it, you can influence it…

Once you learn that, you’ll never be the same again.”

Imagine if you can figure this out as a child.

Rusty Russell: ccan/mem’s memeqzero iteration

Tue, 2015-10-20 11:28

On Thursday I was writing some code, and I wanted to test if an array was all zero.  First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.

We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:

rusty: dwg: now, how much time do I waste optimizing?

dwg: rusty, in the first commit, none

Exactly five minutes later I had it implemented and tested.

The Naive Approach: Times: 1/7/310/37064 Bytes: 50 bool memeqzero(const void *data, size_t length) { const unsigned char *p = data; while (length) { if (*p) return false; p++; length--; } return true; }

As a summary, I’ve give the nanoseconds for searching through 1,8,512 and 65536 bytes only.

Another 20 minutes, and I had written that benchmark, and an optimized version.

128-byte Static Buffer: Times: 6/8/48/5872 Bytes: 108

Here’s my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons.  Worse for small sizes, much better for big.

const unsigned char *p = data; static unsigned long zeroes[16]; while (length > sizeof(zeroes)) { if (memcmp(zeroes, p, sizeof(zeroes))) return false; p += sizeof(zeroes); length -= sizeof(zeroes); } return memcmp(zeroes, p, length) == 0; Using a 64-bit Constant: Times: 12/12/84/6418 Bytes: 169

dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn’t necessarily a win

Using a single zero uint64_t for comparison is pretty messy:

bool memeqzero(const void *data, size_t length) {     const unsigned char *p = data;     const unsigned long zero = 0;     size_t pre;     pre = (size_t)p % sizeof(unsigned long);     if (pre) {         size_t n = sizeof(unsigned long) - pre;         if (n > length)             n = length;         if (memcmp(p, &zero, n) != 0)             return false;         p += n;         length -= n;     }     while (length > sizeof(zero)) {         if (*(unsigned long *)p != zero)             return false;         p += sizeof(zero);         length -= sizeof(zero);     }     return memcmp(&zero, p, length) == 0; }

And, worse in every way!

Using a 64-bit Constant With Open-coded Ends: Times: 4/9/68/6444 Bytes: 165

dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?

That’s slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.

Epiphany #1: We Already Have Zeroes: Times 3/5/92/5801 Bytes: 422

Then I realized that we don’t need a static buffer: we know everything we’ve already tested is zero!  So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time.  Then a final memcmp for the tail.  Clever huh?

But it no faster than the static buffer case on the high end, and much bigger.

dwg: rusty, that is brilliant. but being brilliant isn’t enough to make things work, necessarily :p

Epiphany #2: memcmp can overlap: Times 3/5/37/2823 Bytes: 307

My doubling logic above was because my brain wasn’t completely in phase: unlike memcpy, memcmp arguments can happily overlap!  It’s still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it’s worth memcmping with the previous 16 bytes.  This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:

const unsigned char *p = data; size_t len; /* Check first 16 bytes manually */ for (len = 0; len < 16; len++) { if (!length) return true; if (*p) return false; p++; length--; } /* Now we know that's zero, memcmp with self. */ return memcmp(data, p, length) == 0;

You can find the final code in CCAN (or on Github) including the benchmark code.

Finally, after about 4 hours of random yak shaving, it turns out lightning doesn’t even want to use memeqzero() any more!  Hopefully someone else will benefit.

Stewart Smith: TianoCore (UEFI) ported to OpenPower

Tue, 2015-10-20 11:26

Recently, there’s been (actually two) ports of TianoCore (the reference implementation of UEFI firmware) to run on POWER on top of OPAL (provided by skiboot) – and it can be run in the Qemu PowerNV model.

More details: