Planet Linux Australia

Syndicate content
Planet Linux Australia - http://planet.linux.org.au
Updated: 1 hour 13 min ago

David Rowe: Simple Keras “Hello World” Example – Mean Removal

Thu, 2018-08-30 14:04

Inspired by the Wavenet work with Codec 2 I’m dipping my toe into the word of Deep Learning (DL) using Keras. I’ve read Deep Learning with Python (quite an enjoyable read) and set up a Linux box with a GTX graphics card that is making my teenage sons weep with jealousy.

So after a couple of days of messing about here is my first “hello world” Keras example: mean_removal.py. It might be helpful for other Keras noobs. Assuming you have all the packages installed, it runs with either Python 2:

$ python mean_removal.py

Or Python 3:

$ python3 mean_removal.py

It removes the mean from vectors, using just a single layer regression model. The script runs OK on a regular PC without a chunky graphics card.

So I start by generating vectors from random numbers with a zero mean. I then add a random offset to each sample in the vector. Here are 5 vectors with random offsets added to them:

The Keras script is then trained to estimate and remove the offsets, so the output vectors look like:

Estimating the offset is the same as finding the “mean” of the vector. Yes I know we can do that with a “mean” function, but where’s the fun in that!

Here are some other plots that show the training and validation measures, and error metrics at the output:



The last two plots show pretty much all of the offset is removed and it restores the original (non offset) vectors with just a tiny bit of noise. I had to wind back the learning rate to get it to converge without “NAN” losses, possibly as I’m using fairly big input numbers. I’m familiar with the idea of learning rate from NLMS adaptive filters, such as those used for my work in echo cancellation.

Deep Learning for Codec 2

My initial ambitions for DL are somewhat more modest than the sample-by-sample synthesis used in the Wavenet work. I have some problems with Vector Quantisation (VQ) in the low rate Codec 2 modes. The VQ is used to compactly describe the speech spectrum, which carries the intelligibility of the signal.

The VQ gets upset with different microphones, speakers, or minor spectral shaping like gentle high pass/low pass filtering. This shaping often causes a poor vector to be chosen, which results in crappy speech quality. The current VQ error measure can’t tell the difference between spectral features that matter and those that don’t.

So I’d like to try DL to address those issues, and train a system to say “look, this speech and this speech are actually the same. Yes I know one of them has a 3dB/octave low pass filter, please ignore that”.

As emphasised in the text book above, some feature extraction can help with DL problems. For my first pass I’ll be working on parameters extracted by the Codec 2 model (like a compact version of the speech spectrum) rather than speech samples like Wavenet. This will reduce my CPU load significantly, at the expense of speech quality, which will be limited by the unquantised Codec 2 model. But that’s OK as a first step. A notch or two up on Codec 2 at 700 bit/s would be very useful, especially if it can run on a CPU from the first two decades of the 21st century.

Mean Removal on Speech Vectors

So to get started with Keras I chose mean removal. The mean level or constant offset is like the volume or energy in a speech signal, its the simplest form of spectral shaping I could imagine. I trained and tested it with vectors of random numbers, using numbers in the range of the speech spectral samples that Codec 2 plays with.

It’s a bit like an equaliser, vectors with arbitrary spectral shaping go in, “flat” unshaped vectors come out. They can then be sent to a Vector Quantiser. There are probably smarter ways to do this, but I need to start somewhere.

So as a next step I tried offset removal with vectors that represent the spectrum of 40ms speech frame:


This is pretty cool – the network was trained on random numbers but works well with real speech frames. You can also see the spectral slope I mentioned above, the speech energy gradually falls off at high frequencies. This doesn’t affect the intelligibility of the speech but tends to upset traditional Vector Quantisers. Especially mine.

Now that I have something super-basic working, the next step is to train and test networks to deal with some non-trivial spectral shaping.

Reading Further

Deep Learning with Python
WaveNet and Codec 2
Codec 2 700C, the current Codec 2 700 bit/s mode. With better VQ we can improve on this.
Codec 2 at 450 bit/s, some fine work from Thomas and Stefan, that uses a form of machine learning to synthesise 16 kHz speech from 8 kHz inputs.
FreeDV 700D, the recently released FreeDV mode that uses Codec 2 700C. A FreeDV Mode also includes a modem, FEC, protocol.
RNNoise: Learning Noise Suppression, Jean-Marc’s DL network for noise reduction. Thanks Jean-Marc for the brainstorming emails!

Michael Still: What’s missing from the ONAP community — an open design process

Thu, 2018-08-30 14:00

I’ve been thinking a fair bit about ONAP and its future releases recently. This is in the context of trying to implement a system for a client which is based on ONAP. Its really hard though, because its hard to determine how various components of ONAP are intended to work, or interoperate.

It took me a while, but I’ve realised what’s missing here…

OpenStack has an open design process. If you want to add a new feature to Nova for example, the first step is you need to write down what the feature is intended to do, how it integrates with the rest of Nova, and how people might use it. The target audience for that document is both the Nova development team, but also people who operate OpenStack deployments.

ONAP has no equivalent that I can find. So for example, they say that in Casablanca they are going to implement a “AAI Enricher” to ease lookup of data from external systems in their inventory database, but I can’t find anywhere where they explain how the integration between arbitrary external systems and ONAP AAI will work.

I think ONAP would really benefit from a good hard look at their design processes and how approachable they are for people outside their development teams. The current use case proposal process (videos, conference talks, and powerpoint presentations) just isn’t great for people who are trying to figure out how to deploy their software.

Linux Users of Victoria (LUV) Announce: Software Freedom Day 2018 and LUV AGM

Wed, 2018-08-29 20:03
Start: Sep 15 2018 13:00 End: Sep 15 2018 17:00 Start: Sep 15 2018 13:00 End: Sep 15 2018 17:00 Location:  Electron Workshop, 31 Arden Street North Melbourne 3051 Link:  https://www.openstreetmap.org/node/2556615434

It's time once again to get excited about all the benefits that Free and Open Source Software have given us over the past year and get together to talk about how Freedom and Openness can improve our human rights, our privacy, our security and our communities. It's Software Freedom Day!

Linux Users of Victoria is a subcommittee of Linux Australia.

September 15, 2018 - 13:00

read more

Linux Users of Victoria (LUV) Announce: LUV September 2018 Main Meeting: New Developments in Supercomputing

Wed, 2018-08-29 20:03
Start: Sep 4 2018 18:30 End: Sep 4 2018 20:30 Start: Sep 4 2018 18:30 End: Sep 4 2018 20:30 Location:  Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053 Link:  http://www.melbourne.vic.gov.au/community/hubs-bookable-spaces/kathleen-syme-lib...

PLEASE NOTE RETURN TO ORIGINAL START TIME

6:30 PM to 8:30 PM Tuesday, September 4, 2018
Training Room, Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

Speakers:

Many of us like to go for dinner nearby after the meeting, typically at Trotters Bistro in Lygon St.  Please let us know if you'd like to join us!

Linux Users of Victoria is a subcommittee of Linux Australia.

September 4, 2018 - 18:30

read more

David Rowe: Band Pass Filter and Power Amplifier for Simple HF Data

Wed, 2018-08-29 12:05

Is it possible to move data over HF radio using very simple, low cost hardware and clever SDR software? In the last few posts (here and here) I’ve been constructing and testing building blocks for a simple HF data terminal. This post describes a few more, a 3-8 MHz Band Pass Filter (BPF) and 1W Power Amplifier (PA).

Band Pass Filter

The RTL-SDR samples at 28.8 MHz, capturing a broad chunk of spectrum. In direct mode we just sample the Q-channel, so any energy above 14.4 MHz will be aliased into our passband; e.g. both 21 and 7 MHz will appear as a 7 MHz sampled signal.

In the previous post we determined the ADC overloads at -30dBm, so we want to remove any strong signals above or near that level. One source of strong signals is broadcast band AM radio between 500 to 1600 kHz.

The use case is “100 mile” data links so I’d like the receiver to work on the 80M (3.5 MHz) as well as 40M (7.1 MHz) bands, which sets the BPF passband at 3 to 8 MHz. I hooked up my spec-an to a 40M antenna and could see AM broadcast signals peaking at -40dBm, so I set a BPF specification of > 20dB attenuation at 1.5 MHz to keep the sum of all those signals well away from the -30dBm limit. At the high frequency end I specified at > 30dB attenuation at 21 MHz, to reduce any energy aliased down to 7 MHz.

I designed a cascaded High Pass Low Pass/Filter using some tables from my ancient (but still excellent) copy of “RF Circuit Design”, by Chris Bowick. The Octave rtl_sdr script does the calculations for me. A spreadsheet would work well too.

I simulated the BPF using LTSpice, fixed a few bugs, and tweaked it for real world component values. Here is the circuit and frequency response on log and linear scales:



I soldered up the BPF Manhattan style using commercial axial 1uH inductors and ceramic capacitors, then tested it using the spec-an and tracking generator (note linear scale):

The table at the bottom shows the measured attenuation at some important frequencies. The attenuation is a bit low at 21 MHz, perhaps due to the finite Q of the real world inductors. Quite a good match to the LTSpice simulation and close enough for my experiments. The little step at around 10 MHz is a tracking generator artefact.

The next plot shows the effect of the BPF when my spec-an is connected to my 40M dipole (0 to 10MHz span). Yellow is the received signal without the filter, purple with the filter.

The big spike around 0 Hz is an artefact on the spec-an. The filter is doing a good job of nailing the AM broadcast band energy. You can see a peak around 7.4 MHz where the dipole is resonant. Actually this is a bit of a surprise to me, as I want it resonant around 7.2MHz, I better check that out! At 7.2-ish the insertion loss (difference between the purple and yellow) is a few dB as per the tracking generator plot above. It’s more like 6dB at 7.4 MHz (the dipole peak), not quite sure why. An insertion loss of 3dB at 7.2 MHz is OK for my application.

Power Amplifier

A few weeks ago I hooked the rpitx to my 40M dipole and managed to demodulate the 11mW signal a few km away (over an urban channel) using a mag loop and my FT-817. I decided to build a small 1W PA to make the system usable over “100 mile” HF channels. The actual power is not that critical, as we can trade power off against bit rate. For example if a given HF channel supports 100 bit/s at 1W, we then know we can get 1000 bit/s at 10W.

Even low bit rates can be very useful if you have no other communication. A text message or Tweet, allowing for some overhead, averages about 1000 bits. So at 1000 bit/s you can send 1 txt per second, 3600 an hour, or 86,000/day. That’s very useful communication if you are in a disaster situation and want to tell family you are alive. Or perhaps live in a remote area with no other communication. Of course HF channels come and go, so the actual throughput will be less than that.

I explored the junk box and found a partially constructed Beach 40. I isolated the driver and PA stage and poked it with my signal generator. Turns out it had a bit too much gain (the rpitx has plenty of drive) so I ended up with this simple PA circuit:



The only spurious output I can see is the 2nd harmonic is at -44 dBC, meeting ACMA specs:

The low pass filter at the output has a 3dB point at about 10 MHz which is a little high. It could be brought down a little to increase stop-band attenuation and reduce the 2nd harmonic further. I haven’t done anything about impedance matching the input, as it hits 1W (30dBm) output with 14dBm drive from the rpitx. The 1 inch square heatsink is quite warm after 10 minutes but I can still hold it. It’s not very efficient, 2.9W DC input power for 1W out, however 16dB power gain is quite good for a PA. Anyhoo, it’s a fine starting point for my experiments, we can optimise the PA later if necessary.

Next Steps

OK, so I have most of the building blocks I need for some over the air HF data experiments. There was a bit of engineering involved in building the BPF and PA, but the designs are very simple and can be constructed for a few $ or even from road kill (recycled) components. We now have a very low cost HF data radio, running high performance modems, connected to a Linux computer and Wifi.

Next I will put some software together to estimate data throughput, set the system up with real antennas, and gather some experimental results over real world HF channels.

Reading Further

Rpitx and 2FSK, first part in this series.
Testing a RTL-SDR with FSK on HF, second part in this series.
rtl_sdr.m script that calculates component values for the BPF.

Gary Pendergast: Forking is a Feature

Sun, 2018-08-26 16:04

There’s a new WordPress fork called ClassicPress that’s been making some waves recently, with various members of the Twitterati swinging between decrying it as an attempt to fracture the WordPress community, to it being an unnecessary over-reaction, to it being a death knell for WordPress.

Personally, I don’t think it’s any of the above.

Some years ago, Anil Dash wrote an article on this topic (which I totally ripped forked the name from), you should read it for some context.

Forking is a Feature

While Linus Torvalds is best known as the creator of Linux, it’s one of his more geeky creations, and the social implications of its design, that may well end up being his greatest legacy. Because Linus has, in just a few short years, changed the social dynamic around forking,

Anil Dash

With that context, I genuinely applaud ClassicPress for exercising their fundamental rights under the GPL. The WordPress Bill of Rights makes it quite clear that forking is not just allowed, it’s encouraged. You can and should fork WordPress if you choose to. This isn’t a flaw in the system, this is how it’s supposed to work.

Forks should aways be encouraged.

Forks are a fundamentally healthy aspect of Open Source software. A relatively recent example is the io.js fork of Node.js, which resulted in significant changes to how the Node.js project is governed and developed. WordPress has seen forks in the past, too: Lyceum was a fork that added multi-site support, before it existed in WordPress. WordPress MU was something of a sibling fork which also added multi-site support, and was ultimately merged back into WordPress.

There are examples of forks that went on to become independent projects: WordPress itself is a fork of cafelog/b2. X.org is a fork of XFree86. LibreOffice is a fork of OpenOffice. Blink is a fork of WebKit, which in turn is a fork of KHTML. MariaDB is a fork of MySQL. XBMC has been forked dozens of times. Joomla is a fork of Mambo. (Fun historical coincidence: I very nearly accepted a job offer from Miro, the company behind Mambo, just a couple of months before Joomla came into being!)

Maintaining a fork is hard, thankless work.

All of these independent forks have a common thread: they started with a group of people who were highly experienced in building the software they were forking (often comprising of core developers of the original software). That’s not to say that non-core developers can’t drive a fork, but it does seem to require fairly fundamental knowledge of the strengths and weaknesses of the software, in order to successfully fork it into an independent product.

From a practical perspective, I can tell you that maintaining a fork of WordPress would require an extraordinary amount of work. For example, WordPress.com effectively maintains a fork (which happens to almost exactly match the Core codebase) of WordPress. The task of maintaining this fork falls to a talented team of devops folks, who review and merge each patch.

Now, WordPress.com is really only an internal fork. To maintain a product fork of WordPress would require so much more effort. You’d need to maintain the web infrastructure to push out updates. As the fork diverges from WordPress Core, you would need to figure out how to maintain plugin and theme compatibility. You’d likely need to do your own bug and security fixes, on top of what’s merged from WordPress.

I’m not saying this to dissuade anyone from forking WordPress, rather, it’s important to go into this aware of the challenges that lay ahead. For anyone who uses a fork (whether it be a fork of WordPress, or any other software product), I’m sure the maintainer would appreciate a word of thanks for the work they’ve done to make it possible.

Dave Hall: AWS Parameter Store

Sun, 2018-08-26 02:03

Anyone with a moderate level of AWS experience will have learned that Amazon offers more than one way of doing something. Storing secrets is no exception. 

It is possible to spin up Hashicorp Vault on AWS using an official Amazon quick start guide. The down side of this approach is that you have to maintain it.

If you want an "AWS native" approach, you have 2 services to choose from. As the name suggests, Secrets Manager provides some secrets management tools on top of the store. This includes automagic rotation of AWS RDS credentials on a regular schedule. For the first 30 days the service is free, then you start paying per secret per month, plus API calls.

There is a free option, Amazon's Systems Manager Parameter Store. This is what I'll be covering today.

Structure

It is easy when you first start out to store all your secrets at the top level. After a while you will regret this decision. 

Parameter Store supports hierarchies. I recommend using them from day one. Today I generally use /[appname]-[env]/[KEY]. After some time with this scheme I am finding that /[appname]/[env]/[KEY] feels like it will be easier to manage. IAM permissions support paths and wildcards, so either scheme will work.

If you need to migrate your secrets, use Parameter Store namespace migration script

Access Controls

Like most Amazon services IAM controls access to Parameter Store. 

Parameter Store allows you to store your values as plain text or encrypted using a key using KMS. For encrypted values the user must have have grants on the parameter store value and KMS key. For consistency I recommend encrypting all your parameters.

If you have a monolith a key per application per envionment is likely to work well. If you have a collection of microservices having a key per service per environment becomes difficult to manage. In this case share a key between several services in the same environment.

Here is an IAM policy for an Lambda function to access a hierarchy of values in parameter store:

{   "Version":"2012-10-17",   "Statement":[     {       "Sid":"ReadParams",       "Effect":"Allow",       "Action":[         "ssm:GetParametersByPath"       ],       "Resource":"arn:aws:ssm:us-east-1:1234567890:parameter/my-app/dev/*"     },     {       "Sid":"Decrypt",       "Effect":"Allow",       "Action":[         "kms:Decrypt"       ],       "Resource":"arn:aws:kms:us-east-1:1234567890:key/20180823-7311-4ced-bad5-653587846973"     }   ] }

To allow your developers to manage the parameters in dev you will need a policy that looks like this:

{   "Version":"2012-10-17",   "Statement":[     {       "Sid":"ManageParams",       "Effect":"Allow",       "Action":[         "ssm:DeleteParameter",         "ssm:DeleteParameters",         "ssm:GetParameter",         "ssm:GetParameterHistory",         "ssm:GetParametersByPath",         "ssm:GetParameters",         "ssm:PutParameter"       ],       "Resource":"arn:aws:ssm:us-east-1:1234567890:parameter/my-app/dev/*"     },     {       "Sid":"ListParams",       "Effect":"Allow",       "Action":"ssm:DescribeParameters",       "Resource":"*"     },     {       "Sid":"DecryptEncrypt",       "Effect":"Allow",       "Action":[         "kms:Decrypt",         "kms:Encrypt"       ],       "Resource":"arn:aws:kms:us-east-1:1234567890:key/20180823-7311-4ced-bad5-653587846973"     }   ] }

Amazon has great documentation on controlling access to Parameter Store and KMS.

Adding Parameters

Amazon allows you to store almost any string up to 4Kbs in length in the Parameter store. This gives you a lot of flexibility.

Parameter Store supports deep hierarchies. You will find this becomes annoying to manage. Use hierarchies to group your values by application and environment. Within the heirarchy use a flat structure. I recommend using lower case letters with dashes between words for your paths. For the parameter keys use upper case letters with underscores. This makes it easy to differentiate the two when searching for parameters. 

Parameter store encodes everything as strings. There may be cases where you want to store an integer as an integer or a more complex data structure. You could use a naming convention to differentiate your different types. I found it easiest to encode every thing as json. When pulling values from the store I json decode it. The down side is strings must be wrapped in double quotes. This is offset by the flexibility of being able to encode objects and use numbers.

It is possible to add parameters to the store using 3 different methods. I generally find the AWS web console easiest when adding a small number of entries. Rather than walking you through this, Amazon have good documentation on adding values. Remember to always use "secure string" to encrypt your values.

Adding parameters via boto3 is straight forward. Once again it is well documented by Amazon.

Finally you can maintain parameters in with a little bit of code. In this example I do it with Python.

import boto3 namespace = "my-app" env = "dev" kms_uuid = "20180823-7311-4ced-bad5-653587846973" # Objects must be json encoded then wrapped in quotes because they're stored as strings. parameters = {"key": '"value"', "MY_INT": 1234, "MY_OBJ": '{"name": "value"}'} ssm = boto3.client("ssm") for parameter in parameters: ssm.put_parameter( Name=f"/{namespace}/{env}/{parameter.upper()}", # Everything must go in as a string. Value=str(parameters[parameter]), Type="SecureString", KeyId=kms_uuid, # Use with caution. Overwrite=True, ) Using Parameters

I have used Parameter Store from Python and the command line. It is easier to use it from Python.

My example assumes that it a Lambda function running with the policy from earlier. The function is called my-app-dev. This is what my code looks like:

import json import boto3 def load_params(namespace: str, env: str) -> dict:     """Load parameters from SSM Parameter Store.     :namespace: The application namespace.     :env: The current application environment.     :return: The config loaded from Parameter Store.     """     config = {}     path = f"/{namespace}/env/"     ssm = boto3.client("ssm", region_name="us-east-1")     more = None     args = {"Path": path, "Recursive": True, "WithDecryption": True}     while more is not False:         if more:             args["NextToken"] = more         params = ssm.get_parameters_by_path(**args)         for param in params["Parameters"]:             key = param["Name"].split("/")[3]             config[key] = json.loads(param["Value"])         more = params.get("NextToken", False)     return config

If you want to avoid loading your config each time your Lambda function is called you can store the results in a global variable. This leverages Amazon's feature that doesn't clear global variables between function invocations. The catch is that your function won't pick up parameter changes without a code deployment. Another option is to put in place logic for periodic purging of the cache.

On the command line things are little harder to manage if you have more than 10 parameters. To export a small number of entries as environment variables, you can use this one liner:

$(aws ssm get-parameters-by-path --with-decryption --path /my-app/dev/ | jq -r '.Parameters[] | "export " + (.Name | split("/")[3] | ascii_upcase | gsub("-"; "_")) + "=" + .Value + ";"')

Make sure you have jq installed and the AWS cli installed and configured.

Conclusion

Amazon's System Manager Parameter Store provides a secure way of storing and managing secrets for your AWS based apps. Unlike Hashicorp Vault, Amazon manages everything for you. If you don't need the more advanced features of Secrets Manager you don't have to pay for them. For most users Parameter Store will be adequate.

Michael Still: Learning from the mistakes that even big projects make

Fri, 2018-08-24 16:00

The following is a blog post version of a talk presented at pyconau 2018. Slides for the presentation can be found here (as Microsoft powerpoint, or as PDF), and a video of the talk (thanks NextDayVideo!) is below:

 

OpenStack is an orchestration system for setting up virtual machines and associated other virtual resources such as networks and storage on clusters of computers. At a high level, OpenStack is just configuring existing facilities of the host operating system — there isn’t really a lot of difference between OpenStack and a room full of system admins frantically resolving tickets requesting virtual machines be setup. The only real difference is scale and predictability.

To do its job, OpenStack needs to be able to manipulate parts of the operating system which are normally reserved for administrative users. This talk is the story of how OpenStack has done that thing over time, what we learnt along the way, and what I’d do differently if I had my time again. Lots of systems need to do these things, so even if you never use OpenStack hopefully there are things to be learnt here.

That said, someone I respect suggested last weekend that good conference talks are actionable. A talk full of OpenStack war stories isn’t actionable, so I’ve spent the last week re-writing this talk to hopefully be more of a call to action than just an interesting story. I apologise for any mismatch between the original proposal and what I present here that might therefore exist.Back to the task in hand though — providing control of virtual resources to untrusted users. OpenStack has gone through several iterations of how it thinks this should be done, so perhaps its illustrative to start by asking how other similar systems achieve this. There are lots of systems that have a requirement to configure privileged parts of the host operating system. The most obvious example I can think of is Docker. How does Docker do this? Well… its actually not all that pretty. Docker presents its API over a unix domain socket by default in order to limit control to local users (you can of course configure this). So to provide access to Docker, you add users to the docker group, which owns that domain socket. The Docker documentation warns that “the docker group grants privileges equivalent to the root user“. So that went well.

Docker is really an example of the simplest way of solving this problem — by not solving it at all. That works well enough for systems where you can tightly control the users who need access to those privileged operations — in Docker’s case by making them have an account in the right group on the system and logging in locally. However, OpenStack’s whole point is to let untrusted remote users create virtual machines, so we’re going to have to do better than that.

The next level up is to do something with sudo. The way we all use sudo day to day, you allow users in the sudoers group to become root and execute any old command, with a configuration entry that probably looks a little like this:

# Allow members of group sudo to execute any command %sudo ALL=(ALL:ALL) ALL

Now that config entry is basically line noise, but it says “allow members of the group called sudo, on any host, to run any command as root”. You can of course embed this into your python code using subprocess.call() or similar. On the security front, its possible to do a little bit better than a “nova can execute anything” entry. For example:

%sudo ALL=/bin/ls

This says that the sudo group on all hosts can execute /bin/ls with any arguments. OpenStack never actually specified the complete list of commands it executed. That was left as a job for packagers, which of course meant it wasn’t done well.

So there’s our first actionable thing — if you assume that someone else (packagers, the ops team, whoever) is going to analyse your code well enough to solve the security problem that you can’t be bothered solving, then you have a problem. Now, we weren’t necessarily deliberately punting here. Its obvious to me how to grep the code for commands run as root to add them to a sudo configuration file, but that’s unfair. I wrote some of this code, I am much closer to it than a system admin who just wants to get the thing deployed.

We can of course do better than just raw sudo. Next we tried a thing called rootwrap, which was mostly an attempt to provide a better boundary around exactly what commands you can expect an OpenStack binary to execute. So for example, maybe its ok for me to read the contents of a configuration file specific to a virtual machine I am managing, but I probably shouldn’t be able to read /etc/shadow or whatever. We can do that by doing something like this:

sudo nova-rootwrap /etc/nova/rootwrap.conf /bin/ls /etc

Where nova-rootwrap is a program which takes a configuration file and a command line to run. The contents of the configuration file are used to determine if the command line should be executed.

Now we can limit the sudo configuration file to only needing to be able to execute nova-rootwrap. I thought about putting in a whole bunch of slides about exactly how to configure rootwrap, but then I realised that this talk is only 25 minutes and you can totally google that stuff.

So instead, here’s my second actionable thing… Is there a trivial change you can make which will dramatically improve security? I don’t think anyone would claim that rootwrap is rocket science, but it improved things a lot — deployers didn’t need to grep out the command lines we executed any more, and we could do things like specify what paths we were allowed to do things in. Are there similarly trivial changes that you can make to improve your world?

But wait! Here’s my third actionable thing as well — what are the costs of your design? Some of these are obvious — for example with this design executing something with escalated permissions causes us to pay to fork a process. In fact its worse with rootwrap, because we pay to fork, start a python interpreter to parse a configuration file, and then fork again for the actual binary we wanted in the first place. That cost adds up if you need to execute many small commands, for example when plugging in a new virtual network interface. At one point we measured this for network interfaces and the costs were in the tens of seconds per interface.

There is another cost though which I think is actually more important. The only way we have with this mechanism to do something with escalated permissions is to execute it as a separate process. This is a horrible interface and forces us to do some really weird things. Let’s checkout some examples…

Which of the following commands are reasonable?

shred –n3 –sSIZE PATH touch PATH rm –rf PATH mkdir –p PATH

These are just some examples, there are many others. The first is probably the most reasonable. It doesn’t seem wise to me for us to implement our own data shredding code, so using a system command for that seems reasonable. The other examples are perhaps less reasonable — the rm one is particularly scary to me. But none of these are the best example…

How about this one?

utils.execute('tee',               ('/sys/class/net/%s/bridge/multicast_snooping' %                br_name),               process_input='0',               run_as_root=True,               check_exit_code=[0, 1])

Some commentary first. This code existed in the middle of a method that does other things. Its one of five command lines that method executes. What does it do?

Its actually not too bad. Using root permissions, it writes a zero to the multicast_snooping sysctl for the network bridge being setup. It then checks the exit code and raises an exception if its not 0 or 1.

That said, its also horrid. In order to write a single byte to a sysctl as root, we are forced to fork, start a python process, read a configuration file, and then fork again. For an operation that in some situations might need to happen hundreds of times for OpenStack to restart on a node.

This is how we get to the third way that OpenStack does escalated permissions. If we could just write python code that ran as root, we could write this instead:

with open(('/sys/class/net/%s/bridge/multicast_snooping' %            br_name), 'w') as f: f.write('0')

Its not perfect, but its a lot cheaper to execute and we could put it in a method with a helpful name like “disable multicast snooping” for extra credit. Which brings us to…

Hire Angus Lees and make him angry. Angus noticed this problem well before the rest of us. We were all lounging around basking in our own general cleverness. What Angus proposed is that instead of all this forking and parsing and general mucking around, that we just start a separate process at startup with special permissions, and then send it commands to execute.

He could have done that with a relatively horrible API, for example just sending command lines down the pipe and getting their responses back to parse, but instead he implemented a system of python decorators which let us call a method which is marked up as saying “I want to run as root!”.

So here’s the destination in our journey, how we actually do that thing in OpenStack now:

@nova.privsep.sys_admin_pctxt.entrypoint def disable_multicast_snooping(bridge): path = ('/sys/class/net/%s/bridge/multicast_snooping' %             bridge) if not os.path.exists(path): raise exception.FileNotFound(file_path=path) with open(path, 'w') as f: f.write('0')

The decorator before the method definition is a bit opaque, but basically says “run this thing as root”, and the rest is a method which can be called from anywhere within our code.

There are a few things you need to do to setup privsep, but I don’t have time in this talk to discuss the specifics. Effectively you need to arrange for the privsep helper to start with escalated permissions, and you need to move the code which will run with one of these decorators to a sub path of your source tree to stop other code from accidentally being escalated. privsep is also capable of running with more than one set of permissions — it will start a helper for each set. That’s what this decorator is doing, specifying what permissions we need for this method.

And here we land at my final actionable thing. Make it easy to do the  right thing, and hard to do the wrong thing. Rusty Russell used to talk about this at linux.conf.au when he was going through a phase of trying to clean up kernel APIs — its important that your interfaces make it obvious how to use them correctly, and make it hard to use them incorrectly.

In the example used for this talk, having command lines executed as root meant that the prevalent example of how to do many things was a command line. So people started doing that even when they didn’t need escalated permissions — for example calling mkdir instead of using our helper function to recursively make a set of directories.

We’ve cleaned that up, but we’ve also made it much much harder to just drop a command line into our code base to run as root, which will hopefully stop some of this problem re-occuring in the future. I don’t think OpenStack has reached perfection in this regard yet, but we continue to improve a little each day and that’s probably all we can hope for.

privsep can be used for non-OpenStack projects too. There’s really nothing specific about most of OpenStack’s underlying libraries in fact, and there’s probably things there which are useful to you. In fact the real problem is working out what is where because there’s so much of it.

One final thing — privsep makes it possible to specify the exact permissions needed to do something. For example, setting up a network bridge probably doesn’t need “read everything on the filesystem” permissions. We originally did that, but stepped back to using a singled escalated permissions set that maps to what you get with sudo, because working out what permissions a single operation needed was actually quite hard. We were trying to lower the barrier for entry for doing things the right way. I don’t think I really have time to dig into that much more here, but I’d be happy to chat about it sometime this weekend or on the Internet later.

So in summary:

  • Don’t assume someone else will solve the problem for you.
  • Are there trivial changes you can make that will drastically improve security?
  • Think about the costs of your design.
  • Hire smart people and let them be annoyed about things that have always “just been than way”. Let them fix those things.
  • Make it easy to do things the right way and hard to do things the wrong way.

I’d be happy to help people get privsep into their code, and its very usable outside of OpenStack. There are a couple of blog posts about that on my site at http://www.madebymikal.com/?s=privsep, but feel free to contact me at mikal@stillhq.com if you’d like to chat.

Julien Goodwin: Custom output pods for the Standard Research CG635 Clock Generator

Thu, 2018-08-23 18:03
As part of my previously mentioned side project the ability to replace crystal oscillators in a circuit with a higher quality frequency reference is really handy, to let me eliminate a bunch of uncertainty from some test setups.

A simple function generator is the classic way to handle this, although if you need square wave output it quickly gets hard to find options, with arbitrary waveform generators (essentially just DACs) the common option. If you can get away with just sine wave output an RF synthesizer is the other main option.

While researching these I discovered the CG635 Clock Generator from Stanford Research, and some time later picked one of these up used.

As well as being a nice square wave generator at arbitrary voltages these also have another set of outputs on the rear of the unit on an 8p8c (RJ45) connector, in both RS422 (for lower frequencies) and LVDS (full range) formats, as well as some power rails to allow a variety of less common output formats.

All I needed was 1.8v LVCMOS output, and could get that from the front panel output, but I'd then need a coax tail on my boards, as well as potentially running into voltage rail issues so I wanted to use the pod output instead. Unfortunately none of the pods available from Stanford Research do LVCMOS output, so I'd have to make my own, which I did.

The key chip in my custom pod is the TI SN65LVDS4, a 1.8v capable single channel LVDS reciever that operates at the frequencies I need. The only downside is this chip is only available in a single form factor, a 1.5mm x 2mm 10 pin UQFN, which is far too small to hand solder with an iron. The rest of the circuit is just some LED indicators to signal status.


Here's a rendering of the board from KiCad.

Normally "not hand solderable" for me has meant getting the board assembled, however my normal assembly house doesn't offer custom PCB finishes, and I wanted these to have white solder mask with black silkscreen as a nice UX when I go to use them, so instead I decided to try my hand at skillet reflow as it's a nice option given the space I've got in my tiny apartment (the classic tutorial on this from SparkFun is a good read if you're interested). Instead of just a simple plate used for cooking you can now buy hot plates with what are essentially just soldering iron temperature controllers, sold as pre-heaters making it easier to come close to a normal soldering profile.

Sadly, actually acquiring the hot plate turned into a bit of a mess, the first one I ordered in May never turned up, and it wasn't until mid-July that one arrived from a different supplier.

Because of the aforementioned lack of space instead of using stencils I simply hand-applied (leaded) paste, without even an assist tool (which I probably will acquire for next time), then hand-mounted the components, and dropped them on the plate to reflow. I had one resistor turn 90 degrees, and a few bridges from excessive paste, but for a first attempt I was really happy.


Here's a photo of the first two just after being taken off the hot plate.

Once the reflow was complete it was time to start testing, and this was where I ran into my biggest problems.

The two big problems were with the power supply I was using, and with my oscilloscope.

The power supply (A Keithley 228 Voltage/Current source) is from the 80's (Keithley's "BROWN" era), and while it has nice specs, doesn't have the most obvious UI. Somehow I'd set it to limit at 0ma output current, and if you're not looking at the segment lights it's easy to miss. At work I have an EEZ H24005 which also resets the current limit to zero on clear, however it's much more obvious when limiting, and a power supply with that level of UX is now on my "to buy" list.

The issues with my scope were much simpler. Currently I only have an old Rigol DS1052E scope, and while it works fine it is a bit of a pain to use, but ultimately I made a very simple mistake while testing. I was feeding in a trigger signal direct from the CG635's front outputs, and couldn't figure out why the generator was putting out such a high voltage (implausibly so). To cut the story short, I'd simply forgotten that the scope was set for use with 10x probes, and once I realised that everything made rather more sense. An oscilloscope with auto-detection for 10x probes, as well as a bunch of other features I want in a scope (much bigger screen for one), has now been ordered, but won't arrive for a while yet.

Ultimately the boards work fine, but until the new scope arrives I can't determine signal quality of them, but at least they're ready for when I'll need them, which is great for flow.

Gary Pendergast: Trying Mastodon

Mon, 2018-08-20 10:04

It’s no secret that Twitter is red hot garbage fire, so I’ve signed up for a Mastodon account to give them a try. Because I’m super vain, I decided to create my own Mastodon instance, with a custom domain.

Gary (@gary@pento.net)

7 Toots, 16 Following, 53 Followers · Web and music geek. I write poetry for WordPress.

Gary’s Mastadon

Mastodon is kind of weird to sign up for: think of it as being kind of like email. You can sign up for a big provider like Gmail or Outlook, you can run your own server, or you can pay someone to run a server for you. In my case, I’m paying for a personal account on Masto.host, they’ve been super helpful in getting it all configured. If you’re just looking to try it out for free, here’s a tool to help you choose an instance.

Some Initial Impressions

It’s pretty quiet, of course. I’m currently following 13 people on Mastodon, and 500 on Twitter, there’s going to be a difference in volume.

Mastodon Bridge is a useful tool for you to be able to find your Twitter friends when you start out. I highly recommend using it.

None of the apps have the polish that I’m used to with Twitter apps like Twitterific and Fenix, but they’re evolving quickly. I do miss Tweet Marker support. I’ve settled on using Whalebird for my MacOS app, none of the various Android apps have appealed enough for me to start using them.

Search is pretty bad. There’s a confusing limitation (for a reasonable, but not really satisfactory technical reason): You can only search through the Toots that people on the same instance as you have subscribed to. So, because I’m on only person on pento.net, I can only search through the Toots of people I’m immediately subscribed to. If you sign up for a big host, you’ll see many more results, but you still won’t see everything.

I’m going to have to share this post manually, because Jetpack doesn’t know how to share to Mastodon yet.

David Rowe: How Fourier Transforms Work

Thu, 2018-08-16 10:04

The best explanation of the Discrete Fourier Transform (DFT) I have ever seen from Bill Cowley on his Low SNR blog.

sthbrx - a POWER technical blog: Improving performance of Phoronix benchmarks on POWER9

Wed, 2018-08-15 15:22

Recently Phoronix ran a range of benchmarks comparing the performance of our POWER9 processor against the Intel Xeon and AMD EPYC processors.

We did well in the Stockfish, LLVM Compilation, Zstd compression, and the Tinymembench benchmarks. A few of my colleagues did a bit of investigating into some the benchmarks where we didn't perform quite so well.

LBM / Parboil

The Parboil benchmarks are a collection of programs from various scientific and commercial fields that are useful for examining the performance and development of different architectures and tools. In this round of benchmarks Phoronix used the lbm benchmark: a fluid dynamics simulation using the Lattice-Boltzmann Method.

lbm is an iterative algorithm - the problem is broken down into discrete time steps, and at each time step a bunch of calculations are done to simulate the change in the system. Each time step relies on the results of the previous one.

The benchmark uses OpenMP to parallelise the workload, spreading the calculations done in each time step across many CPUs. The number of calculations scales with the resolution of the simulation.

Unfortunately, the resolution (and therefore the work done in each time step) is too small for modern CPUs with large numbers of SMT (simultaneous multi-threading) threads. OpenMP doesn't have enough work to parallelise and the system stays relatively idle. This means the benchmark scales relatively poorly, and is definitely not making use of the large POWER9 system

Also this benchmark is compiled without any optimisation. Recompiling with -O3 improves the results 3.2x on POWER9.

x264 Video Encoding

x264 is a library that encodes videos into the H.264/MPEG-4 format. x264 encoding requires a lot of integer kernels doing operations on image elements. The math and vectorisation optimisations are quite complex, so Nick only had a quick look at the basics. The systems and environments (e.g. gcc version 8.1 for Skylake, 8.0 for POWER9) are not completely apples to apples so for now patterns are more important than the absolute results. Interestingly the output video files between architectures are not the same, particularly with different asm routines and compiler options used, which makes it difficult to verify the correctness of any changes.

All tests were run single threaded to avoid any SMT effects.

With the default upstream build of x264, Skylake is significantly faster than POWER9 on this benchmark (Skylake: 9.20 fps, POWER9: 3.39 fps). POWER9 contains some vectorised routines, so an initial suspicion is that Skylake's larger vector size may be responsible for its higher throughput.

Let's test our vector size suspicion by restricting Skylake to SSE4.2 code (with 128 bit vectors, the same width as POWER9). This hardly slows down the x86 CPU at all (Skylake: 8.37 fps, POWER9: 3.39 fps), which indicates it's not taking much advantage of the larger vectors.

So the next guess would be that x86 just has more and better optimized versions of costly functions (in the version of x264 that Phoronix used there are only six powerpc specific files compared with 21 x86 specific files). Without the time or expertise to dig into the complex task of writing vector code, we'll see if the compiler can help, and turn on autovectorisation (x264 compiles with -fno-tree-vectorize by default, which disables auto vectorization). Looking at a perf profile of the benchmark we can see that one costly function, quant_4x4x4, is not autovectorised. With a small change to the code, gcc does vectorise it, giving a slight speedup with the output file checksum unchanged (Skylake: 9.20 fps, POWER9: 3.83 fps).

We got a small improvement with the compiler, but it looks like we may have gains left on the table with our vector code. If you're interested in looking into this, we do have some active bounties for x264 (lu-zero/x264).

Test Skylake POWER9 Original - AVX256 9.20 fps 3.39 fps Original - SSE4.2 8.37 fps 3.39 fps Autovectorisation enabled, quant_4x4x4 vectorised 9.20 fps 3.83 fps

Nick also investigated running this benchmark with SMT enabled and across multiple cores, and it looks like the code is not scalable enough to feed 176 threads on a 44 core system. Disabling SMT in parallel runs actually helped, but there was still idle time. That may be another thing to look at, although it may not be such a problem for smaller systems.

Primesieve

Primesieve is a program and C/C++ library that generates all the prime numbers below a given number. It uses an optimised Sieve of Eratosthenes implementation.

The algorithm uses the L1 cache size as the sieve size for the core loop. This is an issue when we are running in SMT mode (aka more than one thread per core) as all threads on a core share the same L1 cache and so will constantly be invalidating each others cache-lines. As you can see in the table below, running the benchmark in single threaded mode is 30% faster than in SMT4 mode!

This means in SMT-4 mode the workload is about 4x too large for the L1 cache. A better sieve size to use would be the L1 cache size / number of threads per core. Anton posted a pull request to update the sieve size.

It is interesting that the best overall performance on POWER9 is with the patch applied and in SMT2 mode:

SMT level baseline patched 1 14.728s 14.899s 2 15.362s 14.040s 4 19.489s 17.458s LAME

Despite its name, a recursive acronym for "LAME Ain't an MP3 Encoder", LAME is indeed an MP3 encoder.

Due to configure options not being parsed correctly this benchmark is built without any optimisation regardless of architecture. We see a massive speedup by turning optimisations on, and a further 6-8% speedup by enabling USE_FAST_LOG (which is already enabled for Intel).

LAME Duration Speedup Default 82.1s n/a With optimisation flags 16.3s 5.0x With optimisation flags and USE_FAST_LOG set 15.6s 5.3x

For more detail see Joel's writeup.

FLAC

FLAC is an alternative encoding format to MP3. But unlike MP3 encoding it is lossless! The benchmark here was encoding audio files into the FLAC format.

The key part of this workload is missing vector support for POWER8 and POWER9. Anton and Amitay submitted this patch series that adds in POWER specific vector instructions. It also fixes the configuration options to correctly detect the POWER8 and POWER9 platforms. With this patch series we get see about a 3x improvement in this benchmark.

OpenSSL

OpenSSL is among other things a cryptographic library. The Phoronix benchmark measures the number of RSA 4096 signs per second:

$ openssl speed -multi $(nproc) rsa4096

Phoronix used OpenSSL-1.1.0f, which is almost half as slow for this benchmark (on POWER9) than mainline OpenSSL. Mainline OpenSSL has some powerpc multiplication and squaring assembly code which seems to be responsible for most of this speedup.

To see this for yourself, add these four powerpc specific commits on top of OpenSSL-1.1.0f:

  1. perlasm/ppc-xlate.pl: recognize .type directive
  2. bn/asm/ppc-mont.pl: prepare for extension
  3. bn/asm/ppc-mont.pl: add optimized multiplication and squaring subroutines
  4. ppccap.c: engage new multipplication and squaring subroutines

The following results were from a dual 16-core POWER9:

Version of OpenSSL Signs/s Speedup 1.1.0f 1921 n/a 1.1.0f with 4 patches 3353 1.74x 1.1.1-pre1 3383 1.76x SciKit-Learn

SciKit-Learn is a bunch of python tools for data mining and analysis (aka machine learning).

Joel noticed that the benchmark spent 92% of the time in libblas. Libblas is a very basic BLAS (basic linear algebra subprograms) library that python-numpy uses to do vector and matrix operations. The default libblas on Ubuntu is only compiled with -O2. Compiling with -Ofast and using alternative BLAS's that have powerpc optimisations (such as libatlas or libopenblas) we see big improvements in this benchmark:

BLAS used Duration Speedup libblas -O2 64.2s n/a libblas -Ofast 36.1s 1.8x libatlas 8.3s 7.7x libopenblas 4.2s 15.3x

You can read more details about this here.

Blender

Blender is a 3D graphics suite that supports image rendering, animation, simulation and game creation. On the surface it appears that Blender 2.79b (the distro package version that Phoronix used by system/blender-1.0.2) failed to use more than 15 threads, even when "-t 128" was added to the Blender command line.

It turns out that even though this benchmark was supposed to be run on CPUs only (you can choose to render on CPUs or GPUs), the GPU file was always being used. The GPU file is configured with a very large tile size (256x256) - which is fine for GPUs but not great for CPUs. The image size (1280x720) to tile size ratio limits the number of jobs created and therefore the number threads used.

To obtain a realistic CPU measurement with more that 15 threads you can force the use of the CPU file by overwriting the GPU file with the CPU one:

$ cp ~/.phoronix-test-suite/installed-tests/system/blender-1.0.2/benchmark/pabellon_barcelona/pavillon_barcelone_cpu.blend ~/.phoronix-test-suite/installed-tests/system/blender-1.0.2/benchmark/pabellon_barcelona/pavillon_barcelone_gpu.blend

As you can see in the image below, now all of the cores are being utilised!

Fortunately this has already been fixed in pts/blender-1.1.1. Thanks to the report by Daniel it has also been fixed in system/blender-1.1.0.

Pinning the pts/bender-1.0.2, Pabellon Barcelona, CPU-Only test to a single 22-core POWER9 chip (sudo ppc64_cpu --cores-on=22) and two POWER9 chips (sudo ppc64_cpu --cores-on=44) show a huge speedup:

Benchmark Duration (deviation over 3 runs) Speedup Baseline (GPU blend file) 1509.97s (0.30%) n/a Single 22-core POWER9 chip (CPU blend file) 458.64s (0.19%) 3.29x Two 22-core POWER9 chips (CPU blend file) 241.33s (0.25%) 6.25x tl;dr

Some of the benchmarks where we don't perform as well as Intel are where the benchmark has inline assembly for x86 but uses generic C compiler generated assembly for POWER9. We could probably benefit with some more powerpc optimsed functions.

We also found a couple of things that should result in better performance for all three architectures, not just POWER.

A summary of the performance improvements we found:

Benchmark Approximate Improvement Parboil 3x x264 1.1x Primesieve 1.1x LAME 5x FLAC 3x OpenSSL 2x SciKit-Learn 7-15x Blender 3x

There is obviously room for more improvements, especially with the Primesieve and x264 benchmarks, but it would be interesting to see a re-run of the Phoronix benchmarks with these changes.

Thanks to Anton, Daniel, Joel and Nick for the analysis of the above benchmarks.

sthbrx - a POWER technical blog: Improving performance of Phoronix benchmarks on POWER9

Wed, 2018-08-15 15:22

Recently Phoronix ran a range of benchmarks comparing the performance of our POWER9 processor against the Intel Xeon and AMD EPYC processors.

We did well in the Stockfish, LLVM Compilation, Zstd compression, and the Tinymembench benchmarks. A few of my colleagues did a bit of investigating into some the benchmarks where we didn't perform quite so well.

LBM / Parboil

The Parboil benchmarks are a collection of programs from various scientific and commercial fields that are useful for examining the performance and development of different architectures and tools. In this round of benchmarks Phoronix used the lbm benchmark: a fluid dynamics simulation using the Lattice-Boltzmann Method.

lbm is an iterative algorithm - the problem is broken down into discrete time steps, and at each time step a bunch of calculations are done to simulate the change in the system. Each time step relies on the results of the previous one.

The benchmark uses OpenMP to parallelise the workload, spreading the calculations done in each time step across many CPUs. The number of calculations scales with the resolution of the simulation.

Unfortunately, the resolution (and therefore the work done in each time step) is too small for modern CPUs with large numbers of SMT (simultaneous multi-threading) threads. OpenMP doesn't have enough work to parallelise and the system stays relatively idle. This means the benchmark scales relatively poorly, and is definitely not making use of the large POWER9 system

Also this benchmark is compiled without any optimisation. Recompiling with -O3 improves the results 3.2x on POWER9.

x264 Video Encoding

x264 is a library that encodes videos into the H.264/MPEG-4 format. x264 encoding requires a lot of integer kernels doing operations on image elements. The math and vectorisation optimisations are quite complex, so Nick only had a quick look at the basics. The systems and environments (e.g. gcc version 8.1 for Skylake, 8.0 for POWER9) are not completely apples to apples so for now patterns are more important than the absolute results. Interestingly the output video files between architectures are not the same, particularly with different asm routines and compiler options used, which makes it difficult to verify the correctness of any changes.

All tests were run single threaded to avoid any SMT effects.

With the default upstream build of x264, Skylake is significantly faster than POWER9 on this benchmark (Skylake: 9.20 fps, POWER9: 3.39 fps). POWER9 contains some vectorised routines, so an initial suspicion is that Skylake's larger vector size may be responsible for its higher throughput.

Let's test our vector size suspicion by restricting Skylake to SSE4.2 code (with 128 bit vectors, the same width as POWER9). This hardly slows down the x86 CPU at all (Skylake: 8.37 fps, POWER9: 3.39 fps), which indicates it's not taking much advantage of the larger vectors.

So the next guess would be that x86 just has more and better optimized versions of costly functions (in the version of x264 that Phoronix used there are only six powerpc specific files compared with 21 x86 specific files). Without the time or expertise to dig into the complex task of writing vector code, we'll see if the compiler can help, and turn on autovectorisation (x264 compiles with -fno-tree-vectorize by default, which disables auto vectorization). Looking at a perf profile of the benchmark we can see that one costly function, quant_4x4x4, is not autovectorised. With a small change to the code, gcc does vectorise it, giving a slight speedup with the output file checksum unchanged (Skylake: 9.20 fps, POWER9: 3.83 fps).

We got a small improvement with the compiler, but it looks like we may have gains left on the table with our vector code. If you're interested in looking into this, we do have some active bounties for x264 (lu-zero/x264).

Test Skylake POWER9 Original - AVX256 9.20 fps 3.39 fps Original - SSE4.2 8.37 fps 3.39 fps Autovectorisation enabled, quant_4x4x4 vectorised 9.20 fps 3.83 fps

Nick also investigated running this benchmark with SMT enabled and across multiple cores, and it looks like the code is not scalable enough to feed 176 threads on a 44 core system. Disabling SMT in parallel runs actually helped, but there was still idle time. That may be another thing to look at, although it may not be such a problem for smaller systems.

Primesieve

Primesieve is a program and C/C++ library that generates all the prime numbers below a given number. It uses an optimised Sieve of Eratosthenes implementation.

The algorithm uses the L1 cache size as the sieve size for the core loop. This is an issue when we are running in SMT mode (aka more than one thread per core) as all threads on a core share the same L1 cache and so will constantly be invalidating each others cache-lines. As you can see in the table below, running the benchmark in single threaded mode is 30% faster than in SMT4 mode!

This means in SMT-4 mode the workload is about 4x too large for the L1 cache. A better sieve size to use would be the L1 cache size / number of threads per core. Anton posted a pull request to update the sieve size.

It is interesting that the best overall performance on POWER9 is with the patch applied and in SMT2 mode:

SMT level baseline patched 1 14.728s 14.899s 2 15.362s 14.040s 4 19.489s 17.458s LAME

Despite its name, a recursive acronym for "LAME Ain't an MP3 Encoder", LAME is indeed an MP3 encoder.

Due to configure options not being parsed correctly this benchmark is built without any optimisation regardless of architecture. We see a massive speedup by turning optimisations on, and a further 6-8% speedup by enabling USE_FAST_LOG (which is already enabled for Intel).

LAME Duration Speedup Default 82.1s n/a With optimisation flags 16.3s 5.0x With optimisation flags and USE_FAST_LOG set 15.6s 5.3x

For more detail see Joel's writeup.

FLAC

FLAC is an alternative encoding format to MP3. But unlike MP3 encoding it is lossless! The benchmark here was encoding audio files into the FLAC format.

The key part of this workload is missing vector support for POWER8 and POWER9. Anton and Amitay submitted this patch series that adds in POWER specific vector instructions. It also fixes the configuration options to correctly detect the POWER8 and POWER9 platforms. With this patch series we get see about a 3x improvement in this benchmark.

OpenSSL

OpenSSL is among other things a cryptographic library. The Phoronix benchmark measures the number of RSA 4096 signs per second:

$ openssl speed -multi $(nproc) rsa4096

Phoronix used OpenSSL-1.1.0f, which is almost half as slow for this benchmark (on POWER9) than mainline OpenSSL. Mainline OpenSSL has some powerpc multiplication and squaring assembly code which seems to be responsible for most of this speedup.

To see this for yourself, add these four powerpc specific commits on top of OpenSSL-1.1.0f:

  1. perlasm/ppc-xlate.pl: recognize .type directive
  2. bn/asm/ppc-mont.pl: prepare for extension
  3. bn/asm/ppc-mont.pl: add optimized multiplication and squaring subroutines
  4. ppccap.c: engage new multipplication and squaring subroutines

The following results were from a dual 16-core POWER9:

Version of OpenSSL Signs/s Speedup 1.1.0f 1921 n/a 1.1.0f with 4 patches 3353 1.74x 1.1.1-pre1 3383 1.76x SciKit-Learn

SciKit-Learn is a bunch of python tools for data mining and analysis (aka machine learning).

Joel noticed that the benchmark spent 92% of the time in libblas. Libblas is a very basic BLAS (basic linear algebra subprograms) library that python-numpy uses to do vector and matrix operations. The default libblas on Ubuntu is only compiled with -O2. Compiling with -Ofast and using alternative BLAS's that have powerpc optimisations (such as libatlas or libopenblas) we see big improvements in this benchmark:

BLAS used Duration Speedup libblas -O2 64.2s n/a libblas -Ofast 36.1s 1.8x libatlas 8.3s 7.7x libopenblas 4.2s 15.3x

You can read more details about this here.

Blender

Blender is a 3D graphics suite that supports image rendering, animation, simulation and game creation. On the surface it appears that Blender 2.79b (the distro package version that Phoronix used by system/blender-1.0.2) failed to use more than 15 threads, even when "-t 128" was added to the Blender command line.

It turns out that even though this benchmark was supposed to be run on CPUs only (you can choose to render on CPUs or GPUs), the GPU file was always being used. The GPU file is configured with a very large tile size (256x256) - which is fine for GPUs but not great for CPUs. The image size (1280x720) to tile size ratio limits the number of jobs created and therefore the number threads used.

To obtain a realistic CPU measurement with more that 15 threads you can force the use of the CPU file by overwriting the GPU file with the CPU one:

$ cp ~/.phoronix-test-suite/installed-tests/system/blender-1.0.2/benchmark/pabellon_barcelona/pavillon_barcelone_cpu.blend ~/.phoronix-test-suite/installed-tests/system/blender-1.0.2/benchmark/pabellon_barcelona/pavillon_barcelone_gpu.blend

As you can see in the image below, now all of the cores are being utilised!

Fortunately this has already been fixed in pts/blender-1.1.1. Thanks to the report by Daniel it has also been fixed in system/blender-1.1.0.

Pinning the pts/bender-1.0.2, Pabellon Barcelona, CPU-Only test to a single 22-core POWER9 chip (sudo ppc64_cpu --cores-on=22) and two POWER9 chips (sudo ppc64_cpu --cores-on=44) show a huge speedup:

Benchmark Duration (deviation over 3 runs) Speedup Baseline (GPU blend file) 1509.97s (0.30%) n/a Single 22-core POWER9 chip (CPU blend file) 458.64s (0.19%) 3.29x Two 22-core POWER9 chips (CPU blend file) 241.33s (0.25%) 6.25x tl;dr

Some of the benchmarks where we don't perform as well as Intel are where the benchmark has inline assembly for x86 but uses generic C compiler generated assembly for POWER9. We could probably benefit with some more powerpc optimsed functions.

We also found a couple of things that should result in better performance for all three architectures, not just POWER.

A summary of the performance improvements we found:

Benchmark Approximate Improvement Parboil 3x x264 1.1x Primesieve 1.1x LAME 5x FLAC 3x OpenSSL 2x SciKit-Learn 7-15x Blender 3x

There is obviously room for more improvements, especially with the Primesieve and x264 benchmarks, but it would be interesting to see a re-run of the Phoronix benchmarks with these changes.

Thanks to Anton, Daniel, Joel and Nick for the analysis of the above benchmarks.

Michael Still: city2surf 2018 wrap up

Mon, 2018-08-13 16:00

city2surf 2018 was yesterday, so how did the race go? First off, thanks to everyone who helped out with my fund raising for the Black Dog Institute — you raised nearly $2,000 AUD for this important charity, which is very impressive. Thanks for everyone’s support!

city2surf is 14kms, with 166 meters of vertical elevation gain. For the second year running I was in the green start group, which is for people who have previously finished the event in less than 90 minutes. There is one start group before this, red, which is for people who can finish in less than 70 minutes. In reality I think its unlikely that I’ll ever make it to red — it would require me to shave about 30 seconds per kilometre off my time to just scrape in, and I think that would be hard to do.

Training for city2surf last year I tore my right achilles, so I was pretty much starting from scratch for this years event — at the start of the year I could run about 50 meters before I had issues. Luckily I was referred to an excellent physiotherapist who has helped me build back up safely — I highly recommend Cameron at Southside Physio Therapy if you live in Canberra.

Overall I ran a lot in training for this year — a total of 540 kilometres. I was also a lot more consistent than in previous years, which is something I’m pretty proud of given how cold winters are in Canberra. Cold weather, short days, and getting sick seem to always get in the way of winter training for me.

On the day I was worried about being cold while running, but that wasn’t an issue. It was about 10 degrees when we started and maybe a couple of degrees warmer than that at the end. The maximum for the day was only 16, which is cold for Sydney at this time of year. There was a tiny bit of spitting rain, but nothing serious. Wind was the real issue — it was very windy at the finish, and I think if it had been like that for the entire race it would have been much less fun.

That said, I finished in 76:32, which is about three minutes faster than last year and a personal best. Overall, an excellent experience and I’ll be back again.

Matthew Oliver: Keystone Federated Swift – Final post coming

Fri, 2018-08-10 12:05

This is a quick post to say the final topology post is coming. It’s currently in draft from and I hope to post it soon. I just realised it’s been a while so thought I’d better give an update.

 

The last post goes into what auth does, what is happening in keystone, what needs to happen  to really make this topology work and then talks about the brittle POC I created to have something to demo. I’ll be discussing other better options/alternative. But all this means it’s become much more detailed then I originally expected. I’ll hope to get it up by mid next week.

 

Thanks for waiting.

David Rowe: Testing a RTL-SDR with FSK on HF

Tue, 2018-08-07 16:04

There’s a lot of discussion about ADC resolution and SDRs. I’m trying to develop a simple HF data system that uses RTL-SDRs in “Direct Sample” mode. This blog post describes how I measured the Minimum Detectable Signal (MDS) of my 100 bit/s 2FSK receiver, and a spreadsheet model of the receiver that explains my results.

Noise in a receiver comes from all sorts of places. There are two sources of concern for this project – HF band noise and ADC quantisation noise. On lower HF frequencies (7MHz and below) I’m guess-timating signals weaker than -100dBm will be swamped by HF band noise. So I’d like a receiver that has a MDS anywhere under that. The big question is, can we build such a receiver using a low cost SDR?

Experimental Set Up

So I hooked up the experimental setup in the figure below:

The photo shows the actual hardware. It was spaced apart a bit further for the actual test:

Rpitx is sending 2FSK at 100 bit/s and about 14dBm Tx power. It then gets attenuated by some fixed and variable attenuators to beneath -100dBm. I fed the signal into a RTL-SDR plugged into my laptop, demodulated the 2FSK signal, and measured the Bit Error Rate (BER).

I tried a command line receiver:

rtl_sdr -s 1200000 -f 7000000 -D 2 - | csdr convert_u8_f | csdr shift_addition_cc `python -c "print float(7000000-7177000)/1200000"` | csdr fir_decimate_cc 25 0.05 HAMMING | csdr bandpass_fir_fft_cc 0 0.1 0.05 | csdr realpart_cf | csdr convert_f_s16 | ~/codec2-dev/build_linux/src/fsk_demod 2 48000 100 - - | ~/codec2-dev/build_linux/src/fsk_put_test_bits -

and also gqrx, using this configuration:

with the very handy UDP output option sending samples to the FSK demodulator:

$ nc -ul 7355 | ./fsk_demod 2 48000 100 - - | ./fsk_put_test_bits -

Both versions demodulate the FSK signal and print the bit error rate in real time. I really love the csdr tools, and gqrx is also great for a more visual look at the signal and the ability to monitor the audio.

For these tests the gqrx receiver worked best. It attenuated nearby interferers better (i.e. better sideband rejection) and worked at lower Rx signal levels. It also has a “hardware AGC” option that I haven’t worked out how to enable in the command line tools. However for my target use case I’ll eventually need a command line version, so I’ll have to improve the command line version some time.

The RF Gods are smiling on me today. This experimental set up actually works better than previous bench tests where we needed to put the Tx in another room to get enough isolation. I can still get 10dB steps from the attenuator at -120dBm (ish) with the Tx a few feet from the Rx. It might be the ferrites on the cable to the switched attenuator.

I tested the ability to get solid 10dB steps using a CW (continuous sine wave) signal using the “test” utility in rpitx. FSK bounces about too much, especially with the narrow spec an settings I need to measure weak signals. The configuration of the Rigol DSA815 I used to see the weak signals is described at the end of this post on the SM2000.

The switched attenuator just has 10dB steps. I am getting zero bit errors at -115dBm, and the modem fell over on the next step (-125dBm). So the MDS is somewhere in between.

Model

This spreadsheet (click for the file) models the receiver:

By poking the RTL-SDR with my signal generator, and plotting the output waveforms, I worked out that it clips at around -30dBm (a respectable S9+40dB). So that’s the strongest signal it can handle, at least using the rtl_sdr command line options I can find. Even though it’s an 8 bit ADC I figure there are 7 magnitude bits (the samples are unsigned chars). So we get 6dB per bit or 42dB dynamic range.

This lets us work out the the power of the quantisation noise (42dB beneath -30dBm). This noise power is effectively spread across the entire bandwidth of the ADC, a little bit of noise power for each Hz of bandwidth. The bandwidth is set by the sample rate of the RTL-SDRs internal ADC (28.8 MHz). So now we can work out No (N-nought), the power/unit Hz of bandwidth. It’s like a normalised version of the receiver “noise floor”. An ADC with more bits would have less quantisation noise.

There follows some modem working which gives us an estimate of the MDS for the modem. The MDS of -117.6dBm is between my two measurements above, so we have a good agreement between this model and the experimental results. Cool!

Falling through the Noise Floor

The “noise floor” depends on what you are trying to receive. If you are listening to wide 10kHz wide AM signal, you will be slurping up 10kHz of noise, and get a noise power of:

-146.6+10*log10(10000) = -106.6 dBm

So if you want that AM signal to have a SNR of 20dB, you need a received signal level of -86.6dB to get over the quantisation noise of the receiver.

I’m trying to receive low bit rate FSK which can handle a lot more noise before it falls over, as indicated in the spreadsheet above. So it’s more robust to the quantisation noise and we can have a much lower MDS.

The “noise floor” is not some impenetrable barrier. It’s just a convention, and needs to be considered relative to the bandwidth and type of the signal you want to receive.

One area I get confused about is noise bandwidth. In the model above I assume the noise band width is the same as the ADC sample rate. Please feel free to correct me if that assumption is wrong! With IQ signals we have to consider negative frequencies, complex to real conversions, which all affects noise power. I muddle through this when I play with modems but if anyone has a straight forward explanation of the noise bandwidth I’d love to hear it!

Blocking Tests

At the suggestion of Mark, I repeated the MDS tests with a strong CW interferer from my signal generator. I adjusted the Sig Gen and Rx levels until I could just detect the FSK signal. Here are the results, all in dBm:

Sig Gen 2FSK Rx MDS Difference -51 -116 65 -30 -96 66

The FSK signal was at 7.177MHz. I tried the interferer at 7MHz (177 kHz away) and 7.170MHz (just 7 kHz away) with the same blocking results. I’m pretty impressed that the system can continue to work with a 65dB stronger signal just 7kHz away.

So the interferer desensitises the receiver somewhat. When listening to the signal on gqrx, I can hear the FSK signal get much weaker when I switch the Sig Gen on. However it often keeps demodulating just fine – FSK is not sensitive to amplitude.

I can also hear spurious tones appearing; the quantisation noise isn’t really white noise any more when a strong signal is present. Makes me feel like I still have a lot to learn about this SDR caper, especially direct sampling receivers!

As with the MDS results – my blocking results are likely to depend on the nature of the signal I am trying to receive. For example a SSB signal or higher data rate might have different blocking results.

Still, 65dB rejection on a $27 radio (at least for my test modem signal) is not too shabby. I can get away with a S9+40dB (-30dBm) interferer just 7kHz away with my rx signal near the limits of where I want to detect (-96dBm).

Conclusions

So I figure for the lower HF bands this receivers performance is OK – the ADC quantisation noise isn’t likely to impact performance and the strong signal performance is good enough. An overload of -30dBm (S9+40dB) is also acceptable given the use case is remote communications where there is unlikely to be any nearby transmitters in the input filter passband.

The 100 bit/s signal is just a starting point. I can use that as a reference to help me understand how different modems and bit rates will perform. For example I can increase the bit rate to say 1000 bit/s 2FSK, increasing the MDS by 10dB, and still be well beneath my -100dBm MDS target. Good.

If it does falls over in the real world due to MDS performance, overload or blocking, I now have a good understanding of how it works so it will be possible to engineer a solution.

For example a pre-amp with X dB gain would lower the quantisation noise power by X dB and allow us to detect weaker signals but then the Rx would overload at -30-X dB. If we have strong signal problems but our target signal is also pretty strong we can insert an attenuator. If we drop in another SDR I can recompute the quantisation noise from it’s specs, and estimate how well it will perform.

Reading Further

Rpitx and 2FSK, first part in this series.
Spreadsheet used to do the working for the quantisation noise.

Gary Pendergast: WordPress’ Gutenberg: The Long View

Mon, 2018-08-06 22:04

WordPress has been around for 15 years. 31.5% of sites use it, and that figure continues to climb. We’re here for the long term, so we need to plan for the long term.

Gutenberg is being built as the base for the next 15 years of WordPress. The first phase, replacing the post editing screen with the new block editor, is getting close to completion. That’s not to say the block editor will stop iterating and improving with WordPress 5.0, rather, this is where we feel confident that we’ve created a foundation that we can build everything else upon.

Let’s chat about the long-term vision and benefit of the Gutenberg project.

Francois Marier: Mercurial commit series in Phabricator using Arcanist

Thu, 2018-08-02 03:18

Phabricator supports multi-commit patch series, but it's not yet obvious how to do it using Mercurial. So this the "hg" equivalent of this blog post for git users.

Note that other people have written tools and plugins to do the same thing and that an official client is coming soon.

Initial setup

I'm going to assume that you've setup arcanist and gotten an account on the Mozilla Phabricator instance. If you haven't, follow this video introduction or the excellent documentation for it (Bryce also wrote additionnal instructions for Windows users).

Make a list of commits to submit

First of all, use hg histedit to make a list of the commits that are needed:

pick ee4d9e9fcbad 477986 Bug 1461515 - Split tracking annotations from tracki... pick 5509b5db01a4 477987 Bug 1461515 - Fix and expand tracking annotation tes... pick e40312debf76 477988 Bug 1461515 - Make TP test fail if it uses the wrong... Create Phabricator revisions

Now, create a Phabricator revision for each commit (in order, from earliest to latest):

~/devel/mozilla-unified (annotation-list-1461515)$ hg up ee4d9e9fcbad 5 files updated, 0 files merged, 0 files removed, 0 files unresolved (leaving bookmark annotation-list-1461515) ~/devel/mozilla-unified (ee4d9e9)$ arc diff --no-amend Linting... No lint engine configured for this project. Running unit tests... No unit test engine is configured for this project. SKIP STAGING Phabricator does not support staging areas for this repository. Created a new Differential revision: Revision URI: https://phabricator.services.mozilla.com/D2484 Included changes: M modules/libpref/init/all.js M netwerk/base/nsChannelClassifier.cpp M netwerk/base/nsChannelClassifier.h M toolkit/components/url-classifier/Classifier.cpp M toolkit/components/url-classifier/SafeBrowsing.jsm M toolkit/components/url-classifier/nsUrlClassifierDBService.cpp M toolkit/components/url-classifier/tests/UrlClassifierTestUtils.jsm M toolkit/components/url-classifier/tests/mochitest/test_trackingprotection_bug1312515.html M xpcom/base/ErrorList.py ~/devel/mozilla-unified (ee4d9e9)$ hg up 5509b5db01a4 3 files updated, 0 files merged, 0 files removed, 0 files unresolved ~/devel/mozilla-unified (5509b5d)$ arc diff --no-amend Linting... No lint engine configured for this project. Running unit tests... No unit test engine is configured for this project. SKIP STAGING Phabricator does not support staging areas for this repository. Created a new Differential revision: Revision URI: https://phabricator.services.mozilla.com/D2485 Included changes: M toolkit/components/url-classifier/tests/UrlClassifierTestUtils.jsm M toolkit/components/url-classifier/tests/mochitest/test_trackingprotection_bug1312515.html M toolkit/components/url-classifier/tests/mochitest/trackingRequest.html ~/devel/mozilla-unified (5509b5d)$ hg up e40312debf76 2 files updated, 0 files merged, 0 files removed, 0 files unresolved ~/devel/mozilla-unified (e40312d)$ arc diff --no-amend Linting... No lint engine configured for this project. Running unit tests... No unit test engine is configured for this project. SKIP STAGING Phabricator does not support staging areas for this repository. Created a new Differential revision: Revision URI: https://phabricator.services.mozilla.com/D2486 Included changes: M toolkit/components/url-classifier/tests/mochitest/classifiedAnnotatedPBFrame.html M toolkit/components/url-classifier/tests/mochitest/test_privatebrowsing_trackingprotection.html Link all revisions together

In order to ensure that these commits depend on one another, click on that last phabricator.services.mozilla.com link, then click "Related Revisions" then "Edit Parent Revisions" in the right-hand side bar and then add the previous commit (D2485 in this example).

Then go to that parent revision and repeat the same steps to set D2484 as its parent.

Amend one of the commits

As it turns out my first patch wasn't perfect and I needed to amend the middle commit to fix some test failures that came up after pushing to Try. I ended up with the following commits (as viewed in hg histedit):

pick ee4d9e9fcbad 477986 Bug 1461515 - Split tracking annotations from tracki... pick c24f4d9e75b9 477992 Bug 1461515 - Fix and expand tracking annotation tes... pick 1840f68978a7 477993 Bug 1461515 - Make TP test fail if it uses the wrong...

which highlights that the last two commits changed and that I would have two revisions (D2485 and D2486) to update in Phabricator.

However, since the only reason why the third patch has a different commit hash is because its parent changed, theres's no need to upload it again to Phabricator. Lando doesn't care about the parent hash and relies instead on the parent revision ID. It essentially applies diffs one at a time.

The trick was to pass the --update DXXXX argument to arc diff:

~/devel/mozilla-unified (annotation-list-1461515)$ hg up c24f4d9e75b9 2 files updated, 0 files merged, 0 files removed, 0 files unresolved (leaving bookmark annotation-list-1461515) ~/devel/mozilla-unified (c24f4d9)$ arc diff --no-amend --update D2485 Linting... No lint engine configured for this project. Running unit tests... No unit test engine is configured for this project. SKIP STAGING Phabricator does not support staging areas for this repository. Updated an existing Differential revision: Revision URI: https://phabricator.services.mozilla.com/D2485 Included changes: M browser/base/content/test/general/trackingPage.html M netwerk/test/unit/test_trackingProtection_annotateChannels.js M toolkit/components/antitracking/test/browser/browser_imageCache.js M toolkit/components/antitracking/test/browser/browser_subResources.js M toolkit/components/antitracking/test/browser/head.js M toolkit/components/antitracking/test/browser/popup.html M toolkit/components/antitracking/test/browser/tracker.js M toolkit/components/url-classifier/tests/UrlClassifierTestUtils.jsm M toolkit/components/url-classifier/tests/mochitest/test_trackingprotection_bug1312515.html M toolkit/components/url-classifier/tests/mochitest/trackingRequest.html

Note that changing the commit message will not automatically update the revision details in Phabricator. This has to be done manually in the Web UI if required.

Simon Lyall: Audiobooks – July 2018

Wed, 2018-08-01 20:04

The Return of Sherlock Holmes by Sir Arthur Conan Doyle

I switched to Stephen Fry for this collection. Very happy with his reading of the stories. He does both standard and “character” voices well and is not distracting. 8/10

Roughing It by Mark Twain

A bunch of anecdotes and stories from Twain’s travels in Nevada & other areas in the American West. Quality varies. Much good but some stories fall flat. Verbose writing (as was the style at the time…) 6/10

Asteroids Hunters by Carrie Nugent

Spin off of a Ted talk. Covers hunting for Asteroids (by the author and others) rather than the Asteroids themselves. Nice level of info in a short (2h 14m) book. 7/10

Things You Should Already Know About Dating, You F*cking Idiot by Ben Schwartz & Laura Moses

100 dating tips (roughly in order of use) in 44 minutes. Amusing and useful enough. 7/10

Protector – A Classic of Known Space by Larry Niven

Filling in a spot in Niven’s universe. Better than many of his Known Space stories. Great background on the Pak in Hard Core package. Narrator gave everybody strong Australian accents for some reason. 7/10

The Inevitable: Understanding the 12 Technological Forces That Will Shape Our Future by Kevin Kelly

Good book on 12 long term “deep trends” ( filtering, remixing, tracking, etc ) and how they have worked over the last and next few decades (especially in the context of the Internet). Pretty interesting and mostly plausible. 7/10

Caesar’s Last Breath: Decoding the Secrets of the Air Around Us by Sam Kean

Works it’s way though the gases in & evolution of earth’s atmosphere, their discovery and several interesting asides. Really enjoyed this, would have enjoyed 50% more of it. 9/10

David Rowe: Rpitx and 2FSK

Tue, 2018-07-31 18:04

This post describes tests to evaluate the use of rpitx as a 2FSK transmitter.

About 10 years ago I worked on the Village Telco – a system for community telephone networks based on WiFi. At the time we used WiFi SoCs which were open source at the OS level, but the deeper layers were completely opaque which led (at least for me) to significant frustration.

Since then I’ve done a lot of work on the physical layer, in particular building my skills in modem implementation. Low cost SDR has also become a thing, and CPU power has dropped in price. The physical layer is busy migrating from hardware to software. Software can, and should, be free.

So now we can build open source radios. No more chip sets and closed source.

Sadly, some other aspects haven’t changed. In many parts of the world it’s still very difficult (and expensive) to move an IP packet over the “last 100 miles”. So, armed with some new skills and technology, I feel it’s time for another look at developing world and humanitarian communications.

I’m exploring the use rpitx as the heart of HF and UHF data terminals. This clever piece of software turns a Raspberry Pi into a radio transmitter. Evariste (F5OEO) the author of rpitx, has recently developed the v2beta branch that has improved performance, and includes some support for FreeDV waveforms.

Running Tests

I have some utilities for the Codec 2 FSK modem that generate frames of test bits. I modified the fsk_mod_ext_vco utility to drive a utility Evariste kindly wrote for FreeDV experiments with rpitx. So here are the command lines that generate 600 seconds (10 minutes) of 100 bit/s 2FSK test frames, and transmit them out of rpitx, using a 7.177 MHz carrier frequency:

$ ./fsk_get_test_bits - 60000 | ./fsk_mod_ext_vco - ~/rpitx/2fsk_100.f 2 --rpitx 800 100 ~/rpitx $ sudo ./freedv 2fsk_100.f 7177000

On the receive side I used my FT-817 connected to FreeDV to sample the signal as a wave file, then fed the signal into C and Octave versions of the demodulator. The RPi is top left at rear, the HackRF in the foreground was used initially as a reference transmitter:

Results

It works really well! One of the most important tests for any modem is adding calibrated noise and measuring the Bit Error Rate (BER). I tried Eb/No = 9dB (-5.7dB SNR), and obtained 1% BER, right on theory for a 2FSK modem:

$ ./cohpsk_ch ~/Desktop/2fsk_100_rx_rpi.wav - -26 | ./fsk_demod 2 8000 100 - - | ./fsk_put_test_bits - FSK BER 0.009076, bits tested 11900, bit errors 108 SNR3k(dB): -5.62

This line takes the sample wave file from the FT-817, adds some noise using the cohpsk_ch utility, then pipes the signal to the FSK demodulator, before counting the bit errors in the test frames. I adjusted the 3rd “No” parameter in cohpsk_ch by hand until I obtained about 1% BER, then checked the SNR against the theoretical SNR for an Eb/No of 9dB.

Here are some plots from the Octave version of the demodulator, with no channel noise applied. The first plot shows the time and frequency domain signal at the input of the demodulator. I set the shift at 800 Hz, so you can see one tone at 800 Hz,the other at 1600 Hz:

Here is the output the of the FSK demodulator filters (red and blue for the two filter outputs). We can see a nice separation, but the red “high” level is a bit lower than blue. Red is probably the 1600 Hz tone, the FT-817 has a gentle low pass filter in it’s output, reducing higher frequency tones by a few dB.

There is some modulation on the filter outputs, which I think is explained by the timing offset below:

The sharp jump at 160 samples is expected, that’s normal behaviour for modem timing estimators, where a sawtooth shape is expected. However note the undulation of the timing estimate as it ramps up, indicating the modem sample clock has a little jitter. I guess this is an artefact of rpitx. However the BER results are fine and the average sample clock offset (not shown) is about 50ppm which is better than many sound cards I have seen in use with FreeDV. Many of our previous modem transmitters (e.g. the first pass at Wenet) started with much larger sample clock offsets.

A common question about rpitx is “how clean is the spectrum”. Here is a sample from my Rigol DSA815, with a span of 1MHz around the 7.177 MHz tx frequency. The Tx power is actually 11dBm, but the marker was bouncing around due to FSK modulation. On a wider span all I can see are the harmonics one would expect of a square wave signal. Just like any other transmitter, these would be removed by a simple low pass filter.

So at 7.177 MHz it’s clean to the limits of my spec analyser, and exceeds spectral purity requirements (-43dBc + 10log(Pwatts)) for Amateur (and I presume other service) communications.