At Procera, we actively seek out challenges. So when you’ve already got the world’s most sophisticated DPI engine and the largest collection of signatures on the planet, where do you go next? We need something cool for our analysts and programmers in the signature group, so we thought we’d go after something really hard and really valuable. We didn’t call it Content Intelligence at the time, and from the very start we were slightly less ambitious and just wanted to add URL classification so that we can do Parental Control and other kinds of simple URL filtering.
But let’s start there, URL filtering – what is it and how do you implement it at a scale that works for Tier 1 operators? URL filtering is the technology where a networking device can stop traffic to a particular website (let’s say URL) by looking up the categories of URLs inspected in the passing traffic. The categories are provided by a huge database of URLs. You can build such a DB yourself, but it’s a lot of work and frankly for a company like Procera simply not worth it. We chose to source our filtering database from a recognized vendor that is widely deployed throughout the world.
But while looking at this, the signature team realized that the real value of URL classification for Procera (and network operators) is NOT with URL filtering – it’s with analytics. And no surprise, the databases out there are not written for that, they’re developed just for filtering. And not all categories are used for filtering really, mostly filtering of porn and other high profile targets. So while the databases are large (huge long-tailish data), they lack the kind of detail that we need for really useful analytics – like multi-dimensional analysis of a URL. For example, not only classify a site as “entertainment”, but as “sport”, and not stop there but go further to “baseball” and then we go really crazy and say “News articles”, “Includes Ads”, “Frequently updated content”, “Youthful audience”, “Horrible spelling”, etc.
Typically URL filtering solutions are racks and racks of equipment for even modest amounts of traffic (barely ten gigs). That’s because the typical URL filtering devices out there are developed for the enterprise (think UTM devices), with very different scalability requirements compared to the space Procera plays in. IPE (Intelligent Policy Enforcement) devices like PacketLogic are built for maximum scalability and performance. What does that really mean? Well, there’s a lot of different ways to solve a problem in software development. They range from the slowest stupid way (think bogosort), to the quick and dirty (bubble sort), to the smart efficient way (qsort, heapsort) to the mind boggling (positronic sort, I just made that up, I think). Most of the code we write at Procera is of that mind boggling kind – the scale is just crazy. We’re handling tens of millions of packets per second per CPU core, while doing a lot more with every packet than pretty much any other networking device. You can only do that by writing the most efficient code possible – and always looking to improve on everything that you did.
So let’s apply that kind of mentality to URL classification. Say that we have a million or so URL lookups to do per core per second. Know of any databases running a single core that can support that kind of scale? Didn’t think so.
Even if they existed, getting the packets from the fast path (PLOS) into userspace to do the lookup would suck. In PLOS we like run-to-completion kind of approaches where we optimize something until it’s appropriate to run in the normal packet flow.
As such, we can afford a very limited cycle budget. There are RX queues that hold a large handful of packets while we process an expensive packet, but if we take too long the RX queue will fill up and we will drop packets. Can’t have that(!), so let’s make sure we’re fast. So we move in the huge URL database into PLOS memory, implement the fastest possible lookup algorithm and BAM – content categorization at huge scale.
“No, you can’t do that, it’s 45 million URL entries and well, nobody does it like that! You’re supposed to run this SDK from the DB vendor that implements fast lookups with trees and callbacks and ….”
Yeah, whatever, honeybadger doesn’t really care, our way is faster.
And we could stop right there, with the most powerful URL classifier in networking, but unfortunately in the world of IPE, ‘It’s never easy’. We also need a super-scalable way of storing analytics on these URL categories, we need hitless updates to our database, we need LiveView extensions to look at URL categories in real time, etc, and so on – at ever increasing performance and scalability requirements for the largest network operators in the world.
And that’s just the things we’ve thought of so far.