Reverse Engineering dead protocols – Strand ShowNet

As a kid I did lighting for stage shows.

Turns out it was great training for a career in telecom, I learned basic rigging, working at heights, electrical work, patching rats nests of cables and the shared camaraderie that comes from having stayed up all night working on something no one would ever notice, unless you hadn’t done it.

At the time the Strand 520i lighting console was the coolest thing ever, it had support for 4 DMX universes (2048 channels – Who could ever need more than that!?) and cost more than a new car.

One late Friday night browsing online I found one for sale by a University in the state I live in, for $100 or best offer. You better believe I smashed the buy now button so hard my mouse almost broke. I was going to own my very own.

I spent the weekend reading through the old manuals, remembering how to use all the features, then dragged my partner for the road trip the following Monday morning to pick it up.

But before I could do anything fun I had to find a PS/2 keyboard and a VGA screen, which took me a few more days.

Then I needed something to receive DMX – I found everything now uses ArtNet (DMX over IP) and there’s visualisers for simulating an arena / stage lighting setup, but all take ArtNet now, so I ordered a DMX to ArtNet converter.

Inside the unit is pretty much a standard PC (OG Pentium in the Strand 520) with an ISA card for all the lighting control stuff.

The clicking hard drive managed to boot, but I didn’t think it’d last long, having been made more than 20 years ago. So I created a disk image and copied the file system onto a CF card using a CF to IDE adapter. This trick meant it booted faster than ever before.

Clicky HDD before it became a CF

One thing I’d read about online was the VARTA battery had a tendency to leak battery acid all over the PCB. This one had yet to fully spill its guts, but was looking a bit buldgey and had started to leak a little already.

The battery (I’d read) was only for storing info in a power loss scenario, and if the battery isn’t present it just slows the boot time as everything has to be read from disk, so I took the leap of faith and cut the battery out, and lo, it still all boots.

Slightly furry VARTA battery

So now I was OK to get the desk online properly, but was getting semi-regular lock ups where DMX would stop outputting and inputs on the console were not read, but the underlying PC was still working.

I spent a lot time debugging this. BIOS settings, interrupts, I dived into how ISA works, replaced the battery I removed with a brand new one, and at one point I broke out the scope, but nothing worked.

Around the same time I noticed the Ethernet port (BNC!) would work if I just ran plain DOS (could ping from DOS but when the application started the NIC would go dead), which made me think I may be facing a hardware fault with the “CS” – the processor for the console.

The desk itself with the face off, and the CF adapter being mounted

Alas being more than 30 years old (this unit was made in 1996) there aren’t a great number of them around to test with, but I found another one on eBay for $50 but it was in the UK, and they weigh a lot.
Shipping this thing was not an option.

But for a bit of of extra cash the seller was willing to crack the case open and strip out the two main boards and post me just those. This had added bonus that the motherboard and CPU of the board sent from the UK was a 520i meaning it has the Pentium II processor – This Strand 520 was now going to be a Strand 520i.

A month later a box appeared at my door containing the boards, but the battery on the CS board from the UK had well and truly spilled its guts, leaving some toxic sludge around all the components nearby. A can of PCB cleaner and a toothbrush (which I will not be using to brush teeth with anymore) and I’d cleaned it up as best I could, but the fan output from the board was well and truly dead, with some of the SMD components just eaten by the acid.

So I put everything back into the case and wired it up. The mounts for the Motherboard were slightly different, and the software that is used for the 520i is different from the 520 (without the i).

The HDD from the UK was unable to boot, but I was able to get it to spin up enough to copy off the ~5Mb of files I needed, then I did a fresh install of MS DOS and copied the installer for the StrandOS.

Finally fully functional

Finally I had a stable working console. Not just that but the Strand Networker application was now available to me. So I plugged into the 10Mbps connection and set the console to output to Network as well as DMX.

Enabling the “networker” for network DMX transmission

I cranked open Wireshark and there was a mystery signal sent to the broadcast address on UDP…

I patched a single DMX channel and changed the value and when I viewed the data in Wireshark I could see a hex representation of the DMX 0-255 value.

Easy I thought to myself, it’ll just be a grid of channels, each with their value as hex. Ha! I was wrong.

Turns out Strand Shownet used a conditional form of “Run Length Encoding” compression, where if you’ve got channels 1 through 5 at 50% rather than encoding this is 5 bytes each showing 0x80, it uses 2 bytes, to indicate 5 sequential channels (run length) and then the value (0x80). Then there’s another bit to denote how many forward places to move and if the next channel is using RLE or not.

The code got messy; it’s not the best thing I’ve ever written but it works for 2 full universes of DMX (I need to spend more time to understand where the channel encoding overflow happens as I end up a few channels ahead of where I should be on universe 3 and above).

The code is available on Github and I’d love to know if anyone’s using it with these old dinosaurs!

https://github.com/nickvsnetworking/PyShowNet

Presenting the caller Name in IMS

SIP has got a multitude of ways of showing Caller ID, PAI, R-PAI, From, even Contact, but the other day I got a tip (Thanks John!) that you can set a name as the Caller ID in the “Username field “display name” part of the P-Asserted-Identity for the leg from the TAS to the UE, and it’ll show up on the phone, and they’re right.

For example I put:

P-Asserted-Identity: "Nick Jones" <sip:[email protected]>

And lo and behold when I called a test phone on my desk (A Samsung IMS debug phone) here’s what I saw:

There are no contacts defined in this phone, that name is just coming from the SIP INVITE that goes to the phone.

Support for this feature is hit-and-miss on different IMS stacks on different phones, and of course is Carrier Bundle dependent, but it does work.

One thing that it doesn’t do is show the name in the call history, and if you go to “Add as Contact” it still makes you enter the name, clearly that’s not linked in, but it’s a kinda neat feature.

Dell Server I/O Latency

For the past few years I’ve run a Dell R630 as one of our labs / testing, it’s hosted down the road from me, and with 32 cores and 256 GB of RAM, it’s got enough grunt to run what we need for testing stuff on the East coast and messing around. We’ve got a proper DC with compute in Sydney and Perth, but for breaking stuff, I wanted my own lab.

This box started on VMware but after I’d see really odd disk IO behavior over a long period of time I couldn’t get to the bottom of.

Things would hang, for example you’d go to edit a file on a VM in vi and have to wait 20 seconds for the file to open, I could cat the same file instantly, and other files I could vi instantly.

I initially thought it was that dreaded issue with Ubuntu boxes being unable to revolve their own hostname and waiting for DNS to time out every. single. time. it. did. anything, but I ruled that out when I got the same behavior with live CDs and non Linux OSes.

In the end I narrowed it down to being related to Disk IO.
I read Matt Liebowitz book on VMware VSphere Performance, assuming there was a setting somewhere inside VMware I had wrong.

Around the same time all the unpleasantness was going down with VMware and licencing changes, and so I moved to Proxmox (while keeping a virtualized copy of VMware running inside Proxmox).

But switching hypervisors didn’t fix the issue, so I could rule that out.
So I splashed out and swapped the 16k magnetic SAS drives in the RAID with new SSDs, but still the problem persisted – It wasn’t the drives and I wasn’t seeing a marked increase in performance.

I did a bunch of turning on the PERC card with disk caching, write ahead, etc, but still the problem persisted.

At this stage I was looking at the PERC card or (less likely) the CPU/motherboard/RAM combo.

So over a quiet period, I moved some workloads back onto one of the old 16k magnetic SAS drives that I had pulled out to replace with the SSDs, and benchmarked the disk performance on the standalone SAS drive to compare against the RAID SSD performance.

Here’s how the SSDs in the RAID two compare to a standalone SAS drive (not in RAID):

MetricLXC on SSD RAIDStandalone SAS DriveDifference (Standalone vs. RAID)
Initial Write (Child)314,682.91 kB/sec382,771.88 kB/sec+21.6%
Initial Write (Parent)177,522.16 kB/sec119,112.43 kB/sec-32.9%
Rewrite (Child)428,456.94 kB/sec470,486.44 kB/sec+9.8%
Rewrite (Parent)180,007.46 kB/sec73,721.11 kB/sec-59.0%
Random Read (Child)404,707.62 kB/sec406,057.00 kB/sec+0.3%
Random Read (Parent)404,410.90 kB/sec397,718.31 kB/sec-1.7%
Random Write (Child)126,042.59 kB/sec355,304.22 kB/sec+181.9%
Random Write (Parent)4,497.75 kB/sec68,971.35 kB/sec+1,434%

That Random Write (Parent) at the bottom – Yeah that would explain the “weird” behavior I’ve been seeing on guest OSes.
As part of editing a file with vi it creates a lock file, that would be written to a random sector, and thus taking such a long time (while cat wouldn’t do the same).

Okay – So now I know it’s the PERC at fault or the RAID config on it.

Next I put another SSD, the same type as those in the RAID, but as a standalone drive (Not in the RAID) and here’s the results:

MetricRAID-15 SSDSSD StandaloneDifference (WD vs. RAID-5)
Sequential Writes (Child)314,682.91 kB/sec511,280.50 kB/sec+62.5%
Sequential Writes (Parent)177,522.16 kB/sec128,016.83 kB/sec-27.9%
Sequential Rewrites (Child)428,456.94 kB/sec467,547.38 kB/sec+9.1%
Sequential Rewrites (Parent)180,007.46 kB/sec79,698.26 kB/sec-55.7%
Random Reads (Child)404,707.62 kB/sec439,705.72 kB/sec+8.6%
Random Reads (Parent)404,410.90 kB/sec437,549.83 kB/sec+8.2%
Random Writes (Child)126,042.59 kB/sec319,127.09 kB/sec+153.2%
Random Writes (Parent)4,497.75 kB/sec125,458.00 kB/sec+2,689.3% (!)

So sequential write and rewrites were slightly down on the standalone disk, but the other figures all look way better on the standalone SSD.

So that’s my problem, I figure it’s something to do with how the RAID is configured but after messing around for a few hours with all the permutations of settings I tried, I couldn’t get these figures to markedly improve.

As this is a lab box I’ll just dismantle the RAID and run each LXC container / VM on a local (non-RAID) SSD, as data loss from a dying disk is not a concern in my use case, but hopefully this might be of use to someone else seeing the same.

MBR & GBR Values in Bearer Level QoS

The other day I had a query about a roaming network that was sending Bearer Level QoS parameters in the Create Session Request to 0Kbps, up and down rather than populating the MBR values.

I knew for Guaranteed Bit Rate bearers that this was of course set, but for non GBR bearers (QCI 5 to 9) I figured this would be set the to MBR, but that’s not the case.

So what gives?

Well, according to TS 29.274:

For non-GBR bearers, both the UL/DL MBR and GBR should be set to zero.

So there you have it, if it’s not a QCI 1-4 bearer then these values are always 0.

CGrates – Multiple Rates & Derived Charging with ChargerS

I’ve always been kinda intrigued by the idea of parallel universes, the idea that there are infinite copies of the universe, with myself and all the people I care about, but each with slight differences to the universe I inhabit.

The ChargerS module provides the Butterfly Effect needed to create infinite instances of our CGrateS events, each with subtle differences.

Typically if you’re charging subscribers for calls, someone else (or multiple someones) may charge you for those calls, for example you charge your subscribers for an outbound call, but other carriers you interconnect with will charge you for terminating those calls to their subscribers, and for incoming calls you may want to charge the other carriers that terminate calls into your network.

By defining in CGrates what your suppliers charge you, or what you charge suppliers, or reseller rates, or commissions, or any other varied call charge, allows us to:

  • See profit on each call
  • Understand supplier costs
  • Enable reselling at different rates
  • Opens the door to Least-Cost Routing (Without knowing the cost, we can’t find the cheapest)
  • Ensure you don’t have calls where you make a loss (Supplier charged you more than you charged the customer)

So how do we do this?

Well, we do this with ChargerS.

When I first looked at CGrateS, the ChargerS module seems like an extra step that did nothing,

In ngrep you’d see the ChargerSv1.ProcessEvent request, and the response, but it doesn’t really do anything, and it’s a PITA when you don’t have a Charger defined and everything stops working.

I’ve spoken a lot about SIP on this blog, and I’m going to assume some level of familiarity with telephony since we’re talking about CGrateS (which is mostly used for telephony), but the best concept I can relate ChargerS to is Serial Forking in SIP, but for the CGrateS event.

A single CGrateS “event” (JSON RPC) comes into CGrateS from wherever, but with ChargerS, we can fork that single event into multiple CGrateS events, which are all treated as unique events.

This is where it starts to get interesting, let’s say we want to calculate a supplier cost and a retail cost, well, with ChargerS we define a rule for supplier and a rule for retail, one single event comes into CGrateS, but with ChargerS setup to create a retail and supplier event, then there are now two events inside CGrateS, one for the supplier and one for the retail.

First we’ll define a default boring charger:

{
    "method": "APIerSv1.SetChargerProfile",
    "params": [
        {
            "ID": "CHARGER_Default",
            'FilterIDs': [],
            'AttributeIDs' : ['*none'],
            'RunID' : 'default',
            'Weight': 0,
        }
    ]   }

Alright, so far so good, but now we’ll define a second charger, and this one will be for calculating the retail rate for a call.

{
    "method": "APIerSv1.SetChargerProfile",
    "params": [
        {
            "ID": "CHARGER_Retail",
            "FilterIDs": [],
            'AttributeIDs' : ['*constant:*req.Category:RetailCharge'],
            'RunID' : 'charger_retail',
            'Weight': 0,
        }
    ]   }

So what did we just do?

Well, now when the ChargerSv1.ProcessEvent request hits chargers two events will come out and get processed by the rest of CGrateS as if they’re unique events / calls to be rated.

We’ve cloned our event, now we’ve got two copies of the same event.

The first copy (the original event), will be treated exactly as it is now, the other will see a new event generated inside CGrateS, it’ll be a copy of the original event, except for a few minor changes.

Let’s take a look at what happens to our event going through ChargerS when we generate a CDRsV2.ProcessExternalCDR API request:

{
"method": "ChargerSv1.ProcessEvent",
"params": [
{
"Tenant": "cgrates.org",
"ID": "2645818",
"Time": null,
"Event": {
"Account": "Nick_Test_123",
"AnswerTime": "2024-12-26T12:34:44+11:00",
"CGRID": "18d3e23ac3727474539f29cc11694cac11fb5e32",
"OriginID": "95fff282-c329-11ef-8e4e-98fa9b127b52",
"RunID": "*default",
...
"Subject": "Nick_Test_123",
"Tenant": "cgrates.org",
"ToR": "*voice",
"Usage": 150000000000
},}],"id": 20
}

But now let’s look at what comes out of this request to ChargerS:

{
"id": 20,
"result": [
{
"ChargerSProfile": "DEFAULT",
"AttributeSProfiles": null,
"AlteredFields": [
"*req.RunID"
],
"CGREvent": {
"ID": "5fd2d6a",
"Event": {
"Account": "Nick_Test_123",
"AnswerTime": "2024-12-26T12:44:40+11:00",
"CGRID": "3c01050a3f49fb215e318523dcd4255797d50145",
"Category": "call",

"RunID": "default",
}, }
},{
"ChargerSProfile": "CHARGER_Retail",
"AttributeSProfiles": [
"*constant:*req.Category:RetailCharge"
],
"AlteredFields": [
"*req.RunID",
"*req.Category"
],
"CGREvent": {
"ID": "5fd2d6a",
"Event": {
"Account": "Nick_Test_123",
"AnswerTime": "2024-12-26T12:44:40+11:00",
"CGRID": "3c01050a3f49fb215e318523dcd4255797d50145",
"Category": "RetailCharge",
"RunID": "charger_retail",
},
} } ], "error": null
}

I’ve tried to keep the above example as minimal as possible, but if we have a look we can now see two events, the first is our default charger, where nothing is changed; it’s got the same category as we set on the ProcessExternalCDR request (call) and the RunID is “default” per the default charger.

But look below and we’ve got another copy, this time the RunID is set to charger_retail, because that’s what we’ve set it to inside the RunID parameter for the charger named CHARGER_Retail, this means when filtering CDRs we’ll be able to spot these ones really easily, and know it’s a fork of a different event.

But importantly we’ve changed some of the values in the CGrateS Event, the same way AttributeS changes stuff.

So what have we changed? Well the Category of the new request is now RetailCharge.

Now if we cast our mind back to setting the RatingProfile back in Tutorial 3, you may remember we set the Category on the RatingProfile.

Now is when this matters. By setting different Categories in our Rating Profile, we can create a new RatingProfile, with the category set to RetailCharge, but referencing a whole different RatingPlan, with different destinations and rates, and this second event that was forked by ChargerS, will match that RatingProfile, and the RatingPlans that go with it.

For everything matching we’ll get two CDRs (if we’re calling *cdrs that is) and they’re treated as totally separate records.

Think about it; by defining a new RatingProfile with category Wholesale with your wholesale rate, and then creating a Charger for that category, you’ll have a retail CDR and a wholesale CDR. Same for reseller rates, commissions, anything!

We’re using this in one of our networks to handle rating for all the SMS traffic, we’ve got various suppliers and sources for A2P and P2P traffic, and having additional chargers to calculate different rates in a different currency for billing our suppliers is super useful.

#Second charger used for calculating the A2P charge for SMS in USD
print(CGRateS_Obj_local.SendData({
    "method": "APIerSv1.SetChargerProfile",
    "params": [
        {
            "ID": "CHARGER_SMS_A2P",
            "FilterIDs": ["*string:~*req.Category:sms", "*notstring:~*req.Account:gsm_0340"],
            'AttributeIDs' : ['*constant:*req.RequestType:*rated;*constant:*req.Category:sms_a2p'],
            'RunID' : 'charger_a2p',
            'Weight': 0,
        }
    ]   }   ))

I’ve put the code examples on the Github repo.

Power cables feeding Ericsson RBS rack

RAN Builds – Can we just get the same connectors thanks?

Concrete, steel and labor are some of the biggest costs in building a cell site, and yet all the focus on cost savings for cell sites seems to focus on the RAN, but the actual RAN equipment isn’t all that much when you put it into context.

I think this is mostly because there aren’t folks at MWC promoting concrete each year.

But while I can’t provide any fancy tricks to make towers stronger or need less concrete for foundations, there’s some potential low-hanging fruit in terms of installation of sites that could save time (and therefor cost) during network refreshes.

I don’t think many folks managing the RAN roll-outs for MNOs have actually spent a week with a tower crew rolling this stuff out. It’s hard work but a lot of it could be done more efficiently if those writing the MOPs and deciding on the processes had more experience in the field.

Disclaimer: I’m primarily a core networks person, this is the job done from a comfy chair. This is just some observations from the bits of work I’ve done in the field building RAN.

Standardize Power Connectors

Currently radio units from the biggest RAN vendors (Ericsson, Nokia, Huawei, ZTE & Samsung) each use different DC power connectors.

This means if you’re swapping from one of these vendors to another as part of a refresh, you need new power connectors.

If you’re lucky you’re able to reuse the existing DC power cables on the tower, but that means you’re up on a tower trying to re-terminate a cable which is a fiddly job to do on the ground, and far worse in the air. Or if you’re unlucky you don’t have enough spare distance on the DC cables to do the job, then you’re hauling new DC cables up a tower (and using more cables too).

The Nokia and Ericsson connectors are very similar, and with a pair of side cutters you can mangle an Ericsson RRU connector to work on a Nokia RRU and visa-versa.

While Huawei and ZTE have adopted for push connectors with the raw cables behind a little waterproof door.

If we could just settle on one approach (either is fine) this could save hours of install time on each cell site, extrapolate that across thousands of cell sites for each network, and this is a potentially large saving.

Standardize Fiber Cables

The same goes for waterproofing fibre, Ericsson has a boot kit that gets assembled inline over the connectors, Nokia has this too, as well as a rubber slide over cover boot on pre-term cables.

Again, the cost is fairly minimal, but the time to swap is not. If we could standardize a break out box format on the top of the tower and a LC waterproofing standard, we could save significant time during installs, and as long as you over-provision the breakout (The cost difference between a 6 core fiber vs a 48 core fibre is a few dollars), you can save significant time having to rerun cables.

Yes, we’ve all got horror stories about someone over-bending fiber, and if you reused fibre between hardware refresh cycles, but modern fiber is crazy tough so the chances of damaging the reused fiber is pretty slim, and spare pairs are always a good thing.

Preterm DC Cables

Every cell site install features some poor person squatting on the floor (if they’re savvy they’ve got a camping stool or gardening kneeling mat) with a “gut buster” crimping tool swaging on connectors for the DC lugs.

If we just used the same lugs / connectors for all the DC kit inside the cell sites, we could have premade DC cables in various lengths (like everyone does with Ethernet cables now), rather than making each and every cable off a spool (even if it is a good ab workout).

I dunno, I’m just some Core network person who looks at how long all this takes and wonders if there’s a way it could be done better, am I crazy?

What’s the point of Subscribe in IMS – Does it do anything useful?

Nope – it doesn’t do anything useful. So why is it there?

The SUBSCRIBE method in SIP allows a SIP UAC to subscribe to events, and then get NOTIFY messages when that event happens.

In a plain SIP scenario (RFC 3261), we can imagine an IP Phone and a PBX scenario. I might have “Busy Lamp Field” aka BLF buttons on the screen of my phone, that change colour when the people I call often are themselves on calls or on DND, so I know not to transfer calls to them – This is often called the “presence” scenario as it allows us to monitor the presence of another user.

At a SIP level, this is done by sending a SUBSCRIBE to the PBX with the information about what I’m interested in being told about (State changes for specific users) and then the PBX will send NOTIFY messages when the state changes.

But in IMS you’ll see SUBSCRIBE messages every time the subscriber registers, so what are they subscribing for?

Well, you’re just subscribing to your own registration status, but your phone knows your own registration status, because it’s, well, the registration status of the phone.

So what does it achieve? Nothing.

The idea was in a fixed-mobile-convergence scenario (keeping in mind that’s one of the key goals from the 2008 IMS spec) you could have the BLF / presence functionality for fixed subscribers, but this rareley happens.

For the past few years we’ve just been sending a 200 OK to SUBSCRIBE messages to the IMS, with a super long expiry, just to avoid wasting clock cycles.

Before there was Grafana – How Telcos did metrics & observability at scale before computers

I love Grafana. I love metrics and observability. Nothing is more powerful than being able to see what’s going on inside your network/application/solar setup/weather station – you name it.

It’s never been easier to see what’s going on.

If I wanted to monitor my web app as I onboard more customers, Grafana is the go-to tool, but how was it done before the computer age? Let’s go back to the 1940s and look at how the telephone network handled observability and metrics…

This starts with introducing the “Call Meter”, “Subscriber Meter” or “Subs Meter” for short.

Detail of mechanical call meter Strathfield South Exchange
Source – field, field, field and chang’s brilliantly beautiful “That Exchange Project

The concept is pretty simple. Each telephone service (“subscriber” in telecom parlance) provided by the local telephone exchange gets a subscriber meter or “subs meter”.

When the subscriber (customer) makes a call, and the call is answered, a reverse of polarity on the line ticks the subscriber meter over by one digit.

Each of the meters on the left is a single telephone subscriber, each time they make a call, the meter ticks up by one position. Source – field, field, field and chang’s brilliantly beautiful “That Exchange Project

As you can imagine if you’ve got a telephone exchange that serves 10,000 customers, well you need 10,000 subscriber meters…

You need a lot of meters… Source – field, field, field and chang’s brilliantly beautiful “That Exchange Project

At the end of the month, someone takes a photo of all the meters on a film camera, sends it off to a billing center where they develop the photo, then calculate the difference in values from last month’s meter reading photo and this month’s meter reading photo, and bingo – there’s the number of calls the person made. You tabulate the cost on an adding machine and send off the invoice.

Each of the little blocks is a single subscriber to meter and the weird cone thing held is a hood for the camera to photograph the values – Source The Communications Museum Trust

Today we’d just use PromQL:

delta(subscriber_meter{phone_number="123456"}[30d])

Optional Sidebar for those asking “but what about Long Distance calls where you pay per minute?” – In a world where you pay per local call, regardless of length, this works just fine, but as more complicated scenarios like long distance calling were introduced, this presented a challenge, but this could be solved by reversing the line polarity at predefined intervals, to keep ticking up the subscriber meters during the call. Exchange Clocks provided a number of pulse outputs, like 1 pulse per second, 1 pulse per minute, etc, this 1 pulse per minute signal could be hooked up to the line reversal circuit for long distance calls, to trigger the line reversal every minute. This means if a local call was $0.40 untimed, if you made long distance calls at $0.40 per minute, then you just needed the exchange to reverse the line every minute to pulse the meter. 10 increments on the meter could mean 10 x $0.40 local calls or 10 minutes of $0.40 per minute long distance.

These meters were originally just for metering traffic, but engineers in the telephone network realised they could be used as generic “counters” for just about anything in the telephone network.

Let’s imagine you want to know how often a trunk line to another exchange runs out of capacity, well, you simply wire a meter to get triggered each time that condition happens, now you’ve got a counter for each time that event occurs.

Now let’s say you want to know how often you run out of final selectors, well, through another counter on it.

These same meters, can be wired to count fault conditions.

Mechanical fault meters on old step-by-step test desk, Queanbeyan Exchange
Source – field, field, field and chang’s brilliantly beautiful “That Exchange Project

A pencil and a logbook is how you keep track of frequency of the event being triggered, and if you want to graph it out, graph paper, not Grafana.

As telephone systems increased in complexity more and more meters were used to track what’s going on, up until the time that computers could start to handle that process, when “Electronic Customer Metering” came into play with the early Stored Program Control exchanges.

Metering and charging equipment in Blakehurst Exchange
Source – field, field, field and chang’s brilliantly beautiful “That Exchange Project

Observability and Metrics are so important for making software, but every time I define a “counter” in software for an event, I’m always reminded of clicking meters in an telephone exchange, knowing this is how it used to be done.

GTPv2 Instance IDs

I was diffing two PCAPs the other day trying to work out what’s up, and noticed the Instance ID on a GTPv2 IE was different between the working and failing examples.

Instance set inside the IE in an F-TEID

So what does it denote, well from TS 129.274:

If more than one grouped information elements of the same type, but for a different purpose are sent with a message,
these IEs shall have different Instance values.

So if we’ve got two IEs of the same IE type (As we often do; F-TEIDs with IE Type 87 may have multiple instances in the same message each with different F-TEID interface types), then we differentiate between them by Instance ID.

The only exception to this rule is where we’ve got the same data, so if you’ve got one IE with the exact same values and purpose that exists twice inside the message.

It’s not Rocket Science – Tracking performance of OneWeb terminals

Last year we deployed some Hughes HL1120W OneWeb terminals in one of the remote cellular networks we support.

Unfortunately it was failing to meet our expectations in terms of performance and reliability – We were seeing multiple dropouts every few hours, for between 30 seconds and ~3 minutes at a time, and while our reseller was great, we weren’t really getting anywhere with Eutelsat in terms of understanding why it wasn’t working.

Luckily for us, Hughes (who manufacture the OneWeb terminals) have an unprotected API (*facepalm*) from which we can scrape all the information about what the terminal sees.

As that data is in an API we have to query, I knocked up a quick Python script to poll the API and convert the data from the API into Prometheus data so we could put it into Grafana and visualise what’s going on with the terminals and the constellation.

After getting all this into Grafana and combining it with the ICMP Blackbox exporter (we configured Blackbox to send HTTP requests and ICMP pings out of each of the different satellite terminals we had (a mix of OneWeb and others)) we could see a pattern emerging where certain “birds” (satellites) that passed overhead would come with packet loss and dropouts.

It was the same satellites each time that led to the drops, which allowed us to pinpoint to say when we see this satellite coming over the horizon, we know there’s going to be some packet loss.

In the end Eutelsat acknowledged they had two faulty satellites in the orbit we are using, hence seeing the dropouts, and they are currently working on resolving this (but that actually does require rockets, so we’re left without a usable service for the time being) but it was a fun problem to diagnose and a good chance to learn more about space.

Packet loss on the two OneWeb terminals (Not seen on other constellation) correlated with a given satellite pass

I’ve put the source code for the Hughes terminal Prometheus Exporter onto Github for anyone to use.

The repo has instructions for use and the Grafana templates we used.

At one point I started playing with the OneWeb Ephemeris data so I could calculate the azimuth and elevation of each of the birds from our relative position, and work out distances and angles from the terminal. The maths was kinda fun, but oddly the datetimes in the OneWeb ephemeris data set seems to be about 10 years and 10 days behind the current datetime – Possibly this gives an insight into OneWeb’s two day outage at the start of the year due to their software not handling leap years.

Despite all these teething issues I’m still optimistic about OneWeb, Kupler and Qianfan (Thousand Sails) opening up the LEO market and covering more people in more places.

Update: Thanks to Scott via email who sent this:
One note, there’s a difference between GPS time and Unix time of about 10 years 5 days. This is due to a) the Unix epoch starting 1970-01-01 and the gps epoch starting 1980-01-05 and b) gps time is not adjusted for leap seconds, and ends up being offset by an integer number of seconds. 

Update: clarkzjw has published an open source tool for visualizing the pass data https://github.com/clarkzjw/LEOViz

Demystifying SS7 & Sigtran – Part 8 – M3UA

This is part of a series of posts looking into SS7 and Sigtran networks. We cover some basic theory and then get into the weeds with GNS3 based labs where we will build real SS7/Sigtran based networks and use them to carry traffic.

In our last post we talked about moving MTP2 onto IP and the options available.

When we split the SS7 stack onto IP we don’t need to do this at the Data Link Layer, we can instead do it higher up the stack. This is where we introduce M3UA.

MTP Level 3 User Adaptation Layer – M3UA replaces MTP3 with an IP based equivilent.

This is different to how we’d handle it with M2UA or M2PA where MTP3 remained unchanged, when you deploy M3UA links, there is no MTP3 anymore – it’s replaced with an IP based protocol transported via SCTP designed to do the same role as MTP3 but over IP – That protocol is M3UA.

This means the roles handled in MTP3 such as managing which available point codes are reachable over which linksets, failover, load sharing and reporting are all now handled by the M3UA protocol, because we loose the ability to just rely on MTP3 to do those things like we did when using lower layer protocols like M2PA or MTP2.

So what do you need to know to use M3UA?

Well, the first concept we need to wrap our head around is that we no longer have linksets or pointcode routes (We do, but they’re different) but instead have Application Servers, Application Server Processes and Routing Contexts.

If you’re following along at home and you want to hook your M3UA compatible AS into the Cisco ITP STP, I’ll be including the commands as we go along. The first step on the Cisco (assuming you’ve already defined the basic SS7 config) is to create a local M3UA instance:

cs7 m3ua 2905
 local-ip 10.179.2.154

With that out of the way, let’s cover ASPs & ASs (hehe – Ass).

You can think of the Application Server Process (ASP) as the client end of the “link set” of our virtual SS7 stack, it handles getting the SCTP association up, what IPs, ports and SCTP parameters are needed, and listens and communicates based on that, here’s an example on the Cisco ITP:

cs7 asp NickLab_ASP 2905 2905 m3ua
remote-ip 10.0.1.252
remote-ip 172.30.1.12

The ASP connects to a Signaling Gateway (In practical terms this is an STP).

That’s simple enough and now we can do our SCTP handshake, but nothing is going to get routed without introducing the Application Server (AS) itself, which is where we configure the routing and link to 1 or more ASPs and how we want to share traffic among them.

Point codes are still used in M3UA for sending traffic from an M3UA AS but it’s not what controls the routing to an AS.

That probably sounds confusing, I send traffic based on point code, but the traffic does’t get to the M3UA AS via point code? What gives?

Well, first we’ve got to introduce the Routing Context in M3UA.

Routing Contexts define what destinations are served by this AS.
As an example, on our STP we’ll define a Routing Context inside the ITP inside the AS section, in this example we’re creating Routing Key 1 which will handle traffic to the point code 5.123.2, but we could equally define a routing-key for a given Global Title address too.

cs7 instance 0 as NickPC m3ua
routing-key 1 5.123.2
asp NickLab_ASP
traffic-mode broadcast

Notice we didn’t define Routing Key X -> Point Code Y -> ASP Z ? That’s because we may have one or more ASPs associated with this (remember ASPs are kinda like Linksets).

For example the Point Code for an HLR might have multiple ASPs behind it, with traffic-mode loadshare to load balance the requests among all the HLRs.

So what does it look like to bring this up? Let’s take a look at a link coming up.

Under the hood we’ve got the SCTP connection / handshake like normal, then our ASP sends an ASPUP (ASP is in state “up”) message to the Signaling Gateway (STP).

ASP Up Message from ASP to SG (STP)

Now our ASP has told the Signaling Gateway it’s there, so our Signaling Gateway returns an ASPUP_ACK to confirm it’s got the message and the current AS state is inactive.

ASP Up Ack Message from SG (STP) to ASP

And with that our ASP is in “an up state, “inactive” state; it’s connected to the STP, but without any ASes associated with our ASP, it’s akin to having link layer but nothing else.

State in the STP showing an ASP without an active AS

So next our ASP will send an ASPAC (ASP Active) message for the given routing contexts the AS serves, in this case, Routing Context 1.

ASP Active Message from ASP to SG (STP)

And with that, the Signaling Gateway (STP) send back an an ASPAC_ACK (ASP Active Ack) to confirm it’s got it, and the state changes.

ASP Active Ack Message from SG (STP) to ASP

Because of how MTP3 worked advertising available point codes, the SG (STP) needs to tell the AS/ASP how it sees the world and the state of the connection.

This is done with a NTFY (Notify) message from the STP/SG to indicate the state has changed to active, and what destinations are reachable, and at this point, we’re good to start handling traffic for that Routing Context.

And with that, we can start handling M3UA traffic.

There’s only one more key dialog to wrap your heads around that’s the DAVA and DUNA messages.

DAVA is Destination Available, and DUNA is Destination Unavailable. The SG (STP) will send these messages to ASP/AS every time the reachability of a neighboring point code changes.

That’s the basics covered, I’m in the process of developing an HLR (Running with MAP/TCAP/SCCP/M3UA) extension for PyHSS, which in the future will allow us to experiment with more M3UA endpoints.

Automatic Cell Planning with Atoll: Site Selection

One of the really neat features about using automated RF planning tools like Forsk Atoll is you’re able to get it to automatically try out tweaks and look at how that impacts performance.

In the past you’d adjust something, run the simulation again, look at the results and compare to what you had before,

Atoll’s ACP (Automatic Cell Planning) module allows you to automate this, and in most cases, it does a better job than I would!

Today we’ll look at Cell Site Selection in Atoll.

To begin with we’ll limit the computation area down to a polygon we draw around the area in question,

In the Geo tab we’ll select Zones -> Computation Zone and select Edit

We’ll create a new Polygon and draw around the area we are going to analyze. You can automate this step based on population levels, etc, if you’ve got that data present.

So now we’ve set our computation area to the selection, but if we didn’t do this, we’d be computing for the whole world, and that might take a while…

Generating Candidate Sites

Atoll sucks at this, I’ve found if your computation zone is set, and it’s not a rectangle, bad things happen, so I’ve written a little script to generate candidates for me.

Creating an new ACP Job

From the Network tab, right click on ACP Automatic Cell Planning and select New

Optimization Tab

Before we can define all the specifics of what we’re looking to plan / improve, we need to set some limits on the software itself and tell it what we’re looking to improve.

The resolution defines how precise the results should be, and the iterations defines how many changes the software should run through.

The higher the number of iterations, the better the results, but it’s not linear – The improvement between 1000 iterations and 1,000,000,000 iterations is typically pretty minor, and this is because ACP works kind of a “getting warmer” philosophy, where it changes a value up or down, looks at the overall result and then if the result was better, changes the value again until it stops getting better.

As I’m working in a fairly small area I’m going to set 100 iterations and a 50m resolution.

In the optimization tab we can also set constraints, for example we’re looking at where to place cell sites in an area, and as far as Atoll is concerned if we just throw hundreds of sites at an area we’ll have pretty good results, but the economics of that doesn’t work, so we can set constraints, for example for site selection we may want to set the max number of cell sites. As we are importing ~5k candidate locations, we probably don’t want to build 5k cell sites 20m apart, so set this to be a reasonable number for your geography.

When using ACP for Optimization as we can see later on, we can also set cost constraints regarding the cost to make changes, but for now this is just going to pick best cell sites locations for us.

Objectives Tab

Next up we’ll need to setup Automatic Cell Plannings’ objectives.

For ACP to be an effective tool we need to define what we’re looking for in terms of success, you can’t just throw it some values and say “Make it better” – we need to define what parameters we’re looking to improve. We do this by setting Objectives.

Your objectives are going to be based on your needs and wants, but for this example we’re building greenfield networks, so want to offer coverage over an area, as well as good RSRP and RSRQ, so we will set the objectives to Coverage of 95% of the Computation Zone for this post, with a secondary objective of increasing RSRP and RSRQ.

But today I’m modeling for coverage, so let’s set that:

As we’re planning for LTE we need to set the UE parameters, as I’m planning for a mobile network, I’ll need to set the service type and terminal.

Reconfiguration

Now we’ve defined the Objectives, it’s now time to define what values ACP can mess with to try and achieve these objectives, for some ACP runs you may be adjusting tilts or azimuths, swapping out antennas, etc, but today we’re looking for where we can put cell sites to be the most effective to serve our target area.

Now we import our candidate list. This might be a list of potential towers you can use, or in my case, for something greenfield, I’m just importing a list of points on a map every X meters to find the best locations to place towers.

From the “Reconfiguration”, we’ll select “Setup” to add the sites we want to evalute.

Atoll has “Automatic Candidate Positioning” which allows it to generate pins on the map, but I’ve not had any luck with it, instead I’m importing a list of candidates I’ve generated via a little Python script, so I’ll select “Import from File”.

Pick my file and set the parameters for importing the data like so.

Now we’ve got candidates for cell sites defined, we set the station template to populate and then we’re good to go.

Running ACP

Once you’ve tweaked all your ACP values as required, we can run the ACP job,

As ACP runs you’ll see a graph showing the objectives and the levels it needs to reach to satisfy them, this step can take a super dooper long time – Especially if your computation zone is large or your number of candidates is large.

But eventually we’ll be a lot older and wearier, but ACP will have completed, and we can checkout the Optimization it’s created.

In my case the objectives failed to be met, but that’s OK for me,

One it’s completed the Changes tab outlines the recommended changes, and the Objectives outlines how this has performed against the criteria we outlined at the start, and if we’re happy with the result, we can Commit the changes to put them on the map from the commit tab.

With that done I weed out the sites in impractical locations, the the ones in the sea…

Now we’ve got the sites plugged in, the next thing we’ll start doing is optimizing them.

When we’re dealing with greenfield builds like we are today, the “Move to highest location with X Meters” function is super useful. If you’ve got a high point on a property, we want to build our tower on the highest point, so the tower is moved to the highest point.

One thing to note is this just plans our grid. It won’t adjust azimuths, downtilts, etc, in one operation. We need to use another ACP operation to achieve that, and that’s the content of a different post!

Rerating CDRs in CGrateS

There’s a bunch of reasons you might want to re-rate CDRs in CGrateS.

For me recently I wanted to introduce StatS to process historical CDR data, and I’d messed up some rates and wanting to correct it without deleting the existing data.

We can re-rate CDRs with the *rerate flag like so:

{
  "method": "CDRsV1.RateCDRs",
  "params": [
    {
        "Flags": ["*rerate", "*cdrs"],
        "SetupTimeStart": "2024-01-01 00:00:00",
        "SetupTimeEnd": "2024-01-05 00:00:00",
        "Tenants": ["cgrates.org"],
        "Categories": ["call"]
    }
  ],
  "id": 0
}

Something to be aware of that’s tripped me up, is that if while re-rating any of the CDRs fails, CGrateS will stop rating the CDRs after it. For example if you get something like this:

{'method': 'CDRsV1.RateCDRs', 'params': [{'Flags': ['*rerate', '*stats'], 'SetupTimeStart': '2025-01-01 00:00:00', 'SetupTimeEnd': '2025-01-28 23:59:59', 'Limit': 10}]}


{'error': 'SERVER_ERROR: PARTIALLY_EXECUTED', 'id': None, 'result': None}

Then the full list of CDRs you’ve requested to be re-rated won’t have been re-rated, only the CDRs up to the error, then CGrateS will stop processing the records after it.

So keep an eye on ngrep and make sure you’ve got all your rates and destinations defined correctly, I found putting:

{
"method": "CDRsV1.RateCDRs",
"params": [
{
"Flags": ["*rerate", "*cdrs"],
"SetupTimeStart": "2024-01-01 00:00:00",
"SetupTimeEnd": "2024-01-05 00:00:00",
"Tenants": ["cgrates.org"],
"Categories": ["call"],
"NotCosts" : [-1, 0]
}
],
"id": 0
}

To filter out any CDRs with a -1 cost in the CDR filters means I filter out any CDRs that had failed to rate last time (Assuming you don’t want to fix CDRs that have failed to get rated).

Call forwarding in SS7/ISUP

Had an interesting fault come across my desk the other day; calls were failing when the called party (an SSP we talk to via SS7/ISUP) had an exchange based call forward in place.

In SIP, we can do call forwarding one of two ways, we can send a 302 Redirect or we can generate a new SIP invite.

But in ISUP how is it done?

We’re a SIP based network, but we do talk some SS7/ISUP on the edges, and it was important that we handled this correctly.

I could see in the Address Complete Message (ACM) sent back to our network that there was redirection information here:

We would see the B party SSP release the call as soon as it sent this.

This made me wonder if we, as the originating network, were supposed to redirect to the new B party and send a new Initial Address Message?

After a lot of digging in the ITU Q.7xx docs (I’m not where near as fast at finding information in specs written prior to my birth, than I am with the 3GPP specs) I found my answer – These headers are informational only, the B party SSP is meant to re-target the message, and send us an Alerting or Answer message when it’s done so.

StatS in CGrateS

The StatS subsystem allows us to calculate statistics based on CGrateS events.

Each StatS object contains one or more “metrics” which are things like Average call duration, Total call duration, Average call cost or totals and average of other fields.

The first thing we’ll need to do is enable stats in our JSON config file:

"stats": {
"enabled": true,
"string_indexed_fields": ["*req.Account","*req.RunID","*req.Destination"],
},

With that done we’re ready to create our first StatS entry, this one is pretty much a burger-with-the-lot, so let’s take a look:

{
    "method": "APIerSv1.SetStatQueueProfile",
    "params": [
        {
            "ID" : "StatQueueProfile_VoiceStats",
            "QueueLength": 10000000,
            "TTL": -1,
            "MinItems": 0,
            "FilterIDs": [],
            "Metrics": [
                {"FilterIDs": [],"MetricID": "*tcd"},
                {"FilterIDs": [],"MetricID": "*tcc"},
                {"FilterIDs": [],"MetricID": "*asr"},
                {"FilterIDs": [],"MetricID": "*acd"},
                {"FilterIDs": [],"MetricID": "*ddc"}
            ],
            "Stored": True,
        }
    ]
}

So what have we just done?

Well we’ve created a StatQueueProfile named StatQueueProfile_VoiceStats, in which we’ll store a maximum of 10000000 datapoints (this is important because to calculate an average we need to know all the previous datapoints), for a maximum of forever (Because TTL is -1, if we wanted to store for 1 hour we’d set TTL to 1h.

We’re not matching any FilterIDs, but based on what we covered on the post in FilterS, you can imagine using this to match calls from a given Account / customer, or to a specific group of destinations, or maybe from a given supplier, etc, etc.

What we do have that’s interesting is we have defined a series of metrics.

The docs page of CGrateS explains all the available metrics and what they mean (we’ve also mapped them in the CGrateS UI), but the ones I’ve included above are Total Call Duration (*tcd), Total Call Cost (*tcc), Answer Seizure Ratio (*asr), Average Call Duration (*acd) and Distinct Destination Count (*ddc).

So what happens if we now generate a bunch of calls? Well, for starters as we’ve got no FilterS defined here, every call will match this StatQueueProfile, and so we’ll collect data for each.

The example code I’ve provided in the repo for this post generates a bunch of calls, and we can check the values for all our Metrics with GetQueueStringMetrics for our :

{'method': 'StatSv1.GetQueueStringMetrics', 'params': [{'Tenant': '', 'ID': 'StatQueueProfile_TalkTime', 'APIOpts': {}}], 'id': 11}
{'error': None,
'id': 11,
'result': {'*acd': '8m4.4s',
'*asr': '100%',
'*ddc': '50',
'*tcc': '5396',
'*tcd': '6h43m40s'}}

We can now see the values of each metric.

If we’ve got a TTL set, old values that have existed in the QueueProfile longer than the TTL are removed, but we can also manually clear the values by using the ResetStatQueue endpoint:

{"method":"StatSv1.ResetStatQueue","params":[{"Tenant":"cgrates.org","ID":"StatQueueProfile_TalkTime"}],"id":4}

Which resets all the values back to zero / null.

One thing to keep in mind is you can’t modify a StatQueue object via the API without resetting the values.

string_indexed_fields in the config file

Sidebar on this – By specifying the string_indexed_fields means that CGrateS will not evaluate every field against Filter rules, but instead only those defined here. This means if you’ve got an event with say 20 fields (AnswerTime, Account, Subject, Destination, RunID, SetupTime, Extra Fields, etc, etc) each of these gets evaluated against a filter, which is pretty processor intensive if your FilterS only ever look at Account and Destination, so by specifying which fields are indexed here to only the fields you use in your filters, you can boost your performance. On the flip side, you can leave this blank to evaluate all fields, but you’ll take a performance hit by doing so.

Western Electric 1013 Test Set / Butt Set

I recently picked up a Western Electric 1013 Test set (Aka Buttinski) rotary test phone.

These are about $10 a piece on eBay in the US, and when having a pile of other stuff sent over (*cough* Nortel Millennium *cough*) I figured I’d add one of these.

I imagine these were produced in massive numbers, they’re electrically very simple, hardened and feature a rubber strip for a “more secure hands-free operation” – Luxury.

Electrically these are very simple, and it’s 3 screws to open the whole unit up, and the top and bottom half separate with spring loaded contacts for the dial so you don’t need to unplug anything (I imagine because they had a habit of getting broken dials when being smashed around).

The amazing Telephone Collectors International library has the overview of these phones, complete with the great images.

Oddly there is no ringer circuit, bell or lamp, so although you could answer an incoming call with one of these, there’s no way you’d know it was ringing, which reminded me of this sketch from “Not the Nine O’Clock News”.

According to the datasheet these phones feature an “type 11C dial”.

TCI library also has the docs for the Type 10 and Type 11 dial, which, according to Ma Bell, is not field serviceable, and should be swapped out rather than attempting a repair.

Rotary dial in action

Alas the rotary dial on mine was running slightly slow, I’ve a feeling Western Electric doesn’t manufacture these any more, so I decided I’d have to fix it myself.

So I stripped down the dial and gave it a good clean.

The dial has a neat little rubber boot on the inside to protect it from gunk, and came apart and went back together easily enough, even if I did inadvertently let out all the spring tension and have to wind it back in, and put the dial on offset by 90 degrees.

Oddly the finger stop moves when you dial – I thought this was an issue with a loose part, but it’s by design, to allow the dial to be more compact, which makes total sense as if I had stopped it from moving I wouldn’t be able to dial higher numbers – Glad I worked that one out eventually.

With the dial cleaned up and adjusted, she’s dialing within spec.

I’ll give the rest of the orange plastic body a polish and it’s off to join the other butt sets.

Setting up TR-069 to manage Calix Endpoints

Recently one of our customers who’s got a large number of Calix E7 ONTs needed some help to automate some of the network management tasks to do with the CPEs.

We’d setup an TR-069 Auto Configuration Server (ACS) for the Calix RGs (The modems) so that we could manage the config parameters on the devices.

Setup was suprisingly easy, after installing some god-awful 90’s Java stuff to access Calix’s “CMS” we pointed everything at our ACS (Per screenshot below) and presto, a few thousand CPEs were there ready to be centrally managed.

FilterS in CGrateS

FilterS do what it says in the name, they are a generic way to define filter rules that an event may or may not match.

Think of them as like a WHERE statement in SQL, they allow us to condition match.

So what would we use FilterS for? Well, let’s first checkout some example use cases:

We might want to provide 100 free minutes on Tuesdays, we know from this post on creating Balances in CGrateS how to create the balance, but we’d use FilterS to make sure the balance is only used on the Tuesday, by adding a filter to check for the day of the week to only match on a Tuesday.

We might define an Attribute to rewrite the Destination number into E.164, but we only want to apply that transformation if the number is in 0NSN format, we apply the translation with AttributeS but we would create a filter to match Destinations that match the given prefixes.

We might want to trigger a counter for calls where the duration of the call (Usage) is greater than 1 hour, we can do this with Thresholds to handle the counting and FilterS to only match if the call duration is greater than 1 hour.

A customer may have multiple DIDs / phone numbers they present as the From header, and we need a way to map phone number “99990001” through to “99990099” as the Account and change the Account to “Customer X”, we can do that with AttributeS to update the Account value in the request, and FilterS to control if that AttributeS rule is matched or not.

FilterS are used all over in CGrateS, if you’ve been following along, you’ve already come across FilterS in the FilterIDs fields in the API, which we’re going to look at using today.

There’s two ways to handle Filters inside CGrateS, they both act the same way but each have some pros and cons.

Inline Filters

The first option is an “inline” filter. Take for example this AttributeS rule using an inline Filter.

{
    "method": "APIerSv2.SetAttributeProfile",
    "params": [{
        "ID": "ATTR_Blog_Example_Inline",
        "Contexts": ["*any"],
        "FilterIDs": ["*string:~*req.Account:Nick"],
        "Attributes": [
            {
            "FilterIDs": [],
            "Path": "*req.ExamplePath",
            "Type": "*constant",
            "Value": "ExampleValue"
            }
        ],
        "Blocker": False,
        "Weight": 10
    }],
}

Let’s break down this filter,

"*string:~*req.Account:Nick"

A filter is made up of 3 components, the match “type”, the element to compare using the match and the values.

Match Types: The above example is matching based on it being a string (match type *string), but we can also match on prefixes, suffixes, destinations, empty, not equal to something, greater than, less than, timings and more.

Match Elements: Next up we’ve got the element, this is what part of a CGrateS event we’re matching with the Match Type we’ve selected. In the above example we’re matching for if the value is type *string and the Element is ~*req.Account. If you look at the requests (~*req.) in CGrateS, you can see the events, there’s all the standard fields like Account, Subject, Category, Tenant, Destination, etc, plus any custom ones you’re using, all of which we can use as an element to compare with our match type.

Match Values: Lastly we’ve got the conditions we’ll match on, in the example above it’s the string “Nick” – So what we’re checking is the match is *string and the element we’re getting the string from is ~*req.Account and if that matches the value “Nick” then ding-ding-ding- we’ve matched.

Obviously the values change based on what we’re doing, if we were prefix matching, we’d put the prefix to match in the value.

Value can also be a list, separated by the pipe (|) symbol for inline filters, so for example we could match “Nick” and also “Nicholas” (if I’m in trouble) with this inline filter:

"*string:~*req.Account:Nick|Nicholas"

Let’s look at a few more inline filters.

This filter will match any event where the Destination is one of ACMA’s fake phone number ranges:

"*prefix:~*req.Destination:6125550|6127010|6135550|6137010|6175550|6187010|6185550|6187010|61491570"

Each match element also has an inverse, for example, *prefix also has *notprefix for matching the reverse:

"*notprefix:~*req.Destination:6125550|6127010|6135550|6137010|6175550|6187010|6185550|6187010|61491570"

Let’s look at one more example, if the Usage is greater than 1 hour:

*gt:~*req.Usage:1h
Inline filter for any Australian E164 prefixes

FilterProfiles

Now we’ve covered the basics of creating Filters with the “Inline” method, let’s consider the limits of this.

If I had defined objects in AttributeS, ThresholdS, ResourceS, Balances and StatS to match when ~*req.Account is “Nick” using an inline filter, and then I change my name, I’d have to go to each of those elements and update them, and that’d be a pain (especially because I’d need to also change my domain name.)

Instead I can create a “Filter Profile” – A reference to a filter that I can reference from AttributeS, ThresholdS, ResourceS, Balances and StatS, and then I only need to update the Filter.

Let’s look at how that would look, first we’d create a new Filter Profile object using the API with:

{
    "method": "ApierV1.SetFilter",
    "params": [
        {
            "ID": "Filter_ACCOUNT_Nick",
            "Rules": [
                {
                    "Type": "*string",
                    "Element": "~*req.Account",
                    "Values": [
                        "Nick",
                        "Nicholas",
                    ]
                }
            ],
            "ActivationInterval": {}
        }
    ]
}

This is the same as the below inline Filter, like the inline filter it’ll match any time the ~*req.Account is a string that matches “Nick” or “Nicholas”

"*string:~*req.Account:Nick|Nicholas"

And then to update our AttributeS example from earlier, rather than defining the inline filter in the FilterIDs section, we just put the ID of the filter we created above:

{
    "method": "APIerSv2.SetAttributeProfile",
    "params": [{
        "ID": "ATTR_Blog_Example_Inline",
        "Contexts": ["*any"],
        "FilterIDs": ["Filter_ACCOUNT_Nick"],
        "Attributes": [
            {
            "FilterIDs": [],
            "Path": "*req.ExamplePath",
            "Type": "*constant",
            "Value": "ExampleValue"
            }
        ],
        "Blocker": False,
        "Weight": 10
    }],
}

Easy!

We saw in the example above that we could do Logical OR operations, if the Account is equal to “Nick” or “Nicholas”. But what one neat thing we can do with FilterProfiles is to do local AND Operations.

Let’s create a new FilterProfile called Filter_Sunday to match when the AnswerTime matches Timing named “Timing_Sunday”:

{
    "method": "ApierV1.SetFilter",
    "params": [
        {
            "ID": "Filter_Sunday",
            "Rules": [
                {
                    "Type": "*timings",
                    "Element": "~*req.AnswerTime",
                    "Values": ["Timing_Sunday"]
                }
            ],
        }
    ]
}

Now we can define an Attribute that will only match if the Account is equal to “Nick” or “Nicholas” AND the AnswerTime matches our “Timing_Sunday” timing profile:

{
    "method": "APIerSv2.SetAttributeProfile",
    "params": [{
        "ID": "ATTR_Blog_Example_Inline",
        "Contexts": ["*any"],
        "FilterIDs": [
             "Filter_ACCOUNT_Nick",
             "Filter_Sunday",
        ],
        "Attributes": [
            {
            "FilterIDs": [],
            "Path": "*req.FYI",
            "Type": "*constant",
            "Value": "Sunday_and_Nick_or_Nicholas"
            }
        ],
        "Blocker": False,
        "Weight": 10
    }],
}

So we can evaluate as AND by just putting both FilterProfiles in the FilterIDs field:

"FilterIDs": ["FLTR_X", "FLTR_Y"],

It’s up to you where you use Inline Filters vs Filter Profiles. As a general rule, if you don’t mind setting it on every object you’re touching, or you don’t reuse the Filter much, inline Filters is probably the way to go.
But if you use multiple subsystems and want to keep your logic more readable, perhaps use Filter Profiles – but again, there’s no hard rules.

Filter Profiles is something we’ve got fairly good coverage of in the CGrateS UI, but as far as I’m aware there’s not a simple “Test Filter” API endpoint, so generally I test out with AttributeS.

Basic CAMEL Charging Flow

CAMEL handles charging in 2G and 3G networks, much like Diameter handles charging in LTE.

CAMEL runs on top of SS7, specifically it sits on top of TCAP, which sits on top of SCCP, which can ride on M3UA or MTP3 (so it sits at the same layer as MAP).

CAMEL is primarily focused on charging for Voice & SMS services, as data generally uses Diameter, so it’s voice and SMS we’ll focus on.

CAMEL is spoken between the MSC (gsmSSF) and the OCS (gsmSCF).

Basic Call State Model

CAMEL is closely related to the Intelligent Network stuff on the 1980s, and steals a lot of it’s ideas from there, unfortunately if you’re to read the CAMEL standard it also implies you were involved in IN stuff and had been born at that point, alas I was neither.

So the key to understanding CAMEL is the Basic Call State Model (BCSM) which is a model of all the different states a call can be in, such as ringing, answered, abandoned, call failed, etc, etc.

Over CAMEL, our OCS can be told by the MSC when a certain event happens; the MSC can tell the OCS, that the call has changed state. For example a BCSM event might indicate the call has hung up, is ringing, cancelled, etc.

Below is the list of all the valid BCSM states:

List of BCSM states for events

Basic MO Call with CAMEL

Our subscriber makes an outbound call.

Based on the data the MSC has in it from the HLR, it knows that we should use CAMEL for this call, and it has the SCCP Address of the OCS (gsmSCF) it needs to send the CAMEL messages to.

So the MSC sends an InitialDP message to the OCS (via it’s Global Title Address) to Authorize the call that the user is trying to make.

This is like any other Authorization step for an OCS, which allows the OCS to authorize the call by checking the subscriber is valid, check if they’re allowed to call that destination and they’ve got the balance to do so, etc.

initialDP message from an MSC to an OCS

The initialDP (Initial Detection Point) is telling our OCS all about the call event that’s being requested, who’s calling, what number they’ve dialed, where they are in the network (of note especially if they’re roaming), etc, etc.

The OCS runs through it’s own checks to see if it wants to allow the call to proceed by checking if the subscriber has got enough balance, unit reservation, etc, etc, and if it does, the OCS sends back a Continue message to the MSC to allow the call to continue.

Generally the OCS also uses this message as a chance to subscribe to BCSM Events using RequestReportBCSMEventArg so the OCS will get notified by the MSC when the state of the call changes. This means the MSC will tell us when the state of the call changes; events like the call getting answered, disconnected, etc. This is critical so we know when the call gets answered and hung-up, so we can charge correctly.

In the below example, as well as sending the Continue and RequestReportBCSMEventArg the OCS is also setting the ChargingArgs for this call, so the MSC knows who to charge (the caller) set via sendingSide and that the MSC must send an Apply Charging Report (ACR) messages every 300 units (1 unit = 100 ms, so a value of 300 = 300 x 100 milliseconds = 30 seconds) so the OCS keeps track of what’s going on.

continue sent by the OCS to the MSC, also including reportBCSMEvent and applyCharging messages

At this point the call can start to proceed – In ISUP terms the InitialDP is triggered between the Initial Address Message and the Address Complete message is sent after the continue is sent back.

Or in a slightly less appropriate analogy but easier to understand for SIP folks, the InitialDP is sent for INVITE and the 180 RINGING is sent once the continue message is received.

Call is Answered

So at this stage our call can start to ring.

As we’ve subscribed to BCSM events in our last message, the MSC is going to tell us when the call gets answered or the call times out, is abandoned or the sun burns out.

The MSC provides this info a eventReportBCSM, which is very simple and just tells us the event that’s been triggered, in the example below, the call was answered.

eventReportBCSM from MSC to OCS

These eventReportBCSM are informational from the MSC to the OCS, so the OCS doesn’t need to send anything back, but the OCS does need to mark the call as answered so it can start timing the call.

At this stage, the call is connected and our two parties are talking, but our MSC has been told it needs to send us applyChargingReports every 30 seconds (due to the value of 300 in maxCallPeriodDuration) after the call was connected, so the MSC sends the OCS it’s first applyChargingReport 30 seconds after the call was answered:

applyChargingReport sent by the MSC to the OCS every reporting period

We can calculate the duration of the call so far based on the time of the eventReportBCSM, then the OCS must make a decision of if it should allow the call to continue or not.

For simplicity’s sake, let’s imagine we’re still got a balance in the OCS and the OCS wants the call to continue, the OCS send back an applyCharging message to the MSC in response, and includes the current allowed maxCallPeriodDuration, keeping in mind the value is x100 and in nanoseconds (so this is 30 seconds).

applyCharging from the OCS back to the MSC

Perfect, our call is good to go for another 30 more seconds, son in 30 seconds we’ll get another ACR messages from MSC to the OCS to keep it abreast of what’s going on.

Now one of two things is going to happen, either subscriber is going to burn through all of their minutes, and get their call cutoff, or the call will end while they’ve still got balance, let’s look at both scenarios.

Normal Hangup Scenario

When the call ends, we get an applyChargingReport from the MSC to the OCS.

As we’ve subscribed to reportBCSMEvent we get both the applyChargingReport with legActive: False` so we know the call has hungup, and we’ve got an event report to tell us more about the event, in this case a hangup from the Originating Side.

reportBCSMEvent and applyChargingReport Sent by the MSC to the OCS to indicate the call has ended, note the legActive flag is now false

Lastly the OCS confirms by sending a releaseCall to the MSC, to indicate all legs should now terminate.

releaseCall Sent by OCS to MSC at the very end

So that’s it!

Obviously there are other flows, such as running out of balance mid-call, rejecting a call, SMS and PBX / VPN services that rely on CAMEL, but hopefully you now understand the basics of how CAMEL based charging looks and works.

If you’re looking for a CAMEL capable OCS or a CAMEL to Diameter or API gateway, get in touch!

CGrateS time Metas

There are so many ways you can format time for things like Expiry or ActionPlans in CGrateS, this is mostly just a quick reference for me:

  • *asap (Now)
  • *now
  • *every_minute
  • *hourly
  • *monthly
  • *monthly_estimated
  • *yearly
  • *daily
  • *weekly
  • mo+1h2m
  • *always (What?)
  • *month_end
  • *month_end+1h2m
  • +20s
  • 1375212790
  • +24h
  • 2016-09-14T19:37:43.665+0000
  • 20160419210007.037
  • 31/05/2015 14:46:00
  • 08.04.2014 22:14:29
  • 20131023215149
  • “2013-12-30 15:00:01 +0430 +0430”
  • “2013-12-30 15:00:01 +0000 UTC”
  • “2013-12-30 15:00:01”

Stolen from: https://github.com/cgrates/cgrates/blob/8fec8dbca1f28436f8658dbcb8be9d03ee7ab9ee/utils/coreutils_test.go#L242