Service Decomposition, Cohesion & Coupling

Sunday, 16 March 2014

Service Oriented Architecture is about making IT look like your organisation — your business. In many companies IT systems are broken down in to lumps that aren’t the same lumps that the business understands. You may have systems with mysterious names like Pluto or Genie or worse still impenetrable 3-letter acronyms (Steve Jones speaks to this in his book Enterprise SOA Adoption Strategies — chapter 12). How do the business and 3rd parties make sense of these meaningless and cryptic monikers? Usually they don’t. They only serve to isolate IT.

With this in mind, IT and the rest of the business have a daunting task of breaking down the organisation, its data and functions into services. One plan of approach to this is to look for things that are highly cohesive, that is, things that naturally belong together because of their very nature.

However deciding what belongs together can be like the proverbial pulling of a thread from a sweater, you pick one thread to pull it out and the rest comes along with it. You end up with long chain dependencies; everything directly or transitively refers to everything else. It’s a similar problem that ORM toolkits have, but this isn’t just about data.

Every data entity, every small bit of function or process in your organisation is related to another somehow, either directly or transitively. It’s 6 degrees of separation applied to your software estate and there’s no getting away from that fact. The extremely hard question is where does one service stop and another begin, where do I draw the lines between services.

I used the term “cohesion” earlier, its counterpart, its nemesis is coupling. Coupling is where something has been put together or joined with something it doesn’t strongly belong with. To using a banking example you don’t expect your staff payroll system to need modification when you change the way deposits to customer accounts work.

In short good dependencies represent high cohesion and bad dependencies represent tight coupling. The opposites of these are low cohesions (which is bad, boo) and loose coupling (which is good, yay).

The question remains, though, why is cohesion good and coupling bad. The advantages of cohesion are as follows:

Your brain groups naturally cohesive concepts, things that are like each other follow naturally. Cognitively working on related concepts at the same time makes sense.
Changes to one service are less likely to require modifications or have side effects on other services.
Services need to interact with each other much less, because for the majority of cases the functionality or behaviour of the services belongs in the service. This means that invocations can and data access can occur within the process space of the service, not need for network calls and data marshalling.
It makes it easier to reason about where functionality or data might exist in your services. For example if I need to change how salaries are calculated, that’ll be in the payroll service, it becomes obvious.

The disadvantages of coupling, somewhat the corollary of the above, are:

The service is harder to understand because you have to hold more concepts than necessary in your head when reasoning about the service.
Unexpected consequences, you change one piece of functionality and an unrelated one breaks.
Can lead to fragmentation of cohesive functions and therefore higher communication overhead.
You have to make changes to more code than necessary when adding functionality.

Types of Cohesion and Coupling

There is much written on this and so I’ll try not to rehash it too much, unfortunately I haven’t found anything that unifies well cohesion and coupling. What I’m going to do is referred to good types as cohesion and bad types a coupling.

Data cohesion (good)

Where data is often used together.

Example: a dating website would put a customer name and email address together because they are used together often. However they would not put suitable partners together will their credit card details

Functional cohesion (good)

Where functions are related and act upon the same data

Example: registering for a website and modifying your username might be two functions on the same service, whereas paying an employee’s salary wouldn’t belong there because functionally it makes no sense.

Categorisation Coupling (bad)

Things are put in the same service because they belong to a certain category of data or function.

Example: Used cars have a location and so do used car sales men so I’ll create a location service for them both.

Process Coupling (bad)

This is where services are created around long chain business processes. The problem with this is that business processes tend to go across many concepts within an organisation and so pull a lot of stuff with them — forcing you toward a monolith.

Example: A company has a process for the selection, purchase and installation of equipment. This process includes requirements for the equipment, knowing how to contact suppliers, how to receive invoices, how to make payments, engage with legal, specification of their property portfolio and so forth, before you know it you’ve got a monolith.

Arbitrary Coupling (bad)

This is where unrelated concepts exist in the same service. Who knows why, perhaps the systems designers had two projects and couldn’t be bothered to have two separate modules in their IDE.

Example: most enterprise off-the-self business software, although things are slowly getting better.

Data Type Coupling (bad)

This is where services use the same definition of a type and when one needs to modify that definition it breaks the other.

Example: A company has a single definition of its customer type, the properties of that type are defined (think global XSD for customer data). Each service that deals with customer has to be capable of understanding this customer type. One day marketing decide they want to add twitter username as a new field. This means that all services now need to be updated to include this field when talking to the marketing service as it’s now fully expecting it.

Temporal Coupling (bad)

When two concepts happen at the same time but otherwise they are unrelated. This is similar to process coupling and is often a symptom of that.

Example: Every month accounting complete their books and makes sure they balance, they also run payroll, the account service is created to do both these things.

—Andy Hedges

Huffman Coding

Tuesday, 11 February 2014

I do a lot of programming in the large and spend far too much time, according to some, thinking about programming in the large concepts such as SOA and CEP but from time to time I like to keep it real and code something detailed and low(er) level. Recently I’ve been doing a little investigation around compression and so I ended up reading up on Huffman Coding. Huffman coding is a way of taking a fixed length encoding and turning it into a shorter variable length encoding.

Back to basics

That’s a little vague and academic so I’m going to assume little knowledge of how computers store data and work my way up and see if I can explain it properly.

Computer storage is made of switches, lots and lots of on/off switches, a switch in storage terms is called a bit and is the smallest possible unit of storage. Switches can be on or off and computers represent that on or off as 1 or 0 respectively. That’s why the classic power switch on most devices is a zero with a one overlayed as in figure 1, 0 and 1 mean off and on to engineers. If I only have 0s and 1s then I’m going to have to find a way of representing numbers bigger than 1 using them.

Getting to second base

Classically we use a numbering system with ten digits zero through to nine, which mathematicians would call base 10 or decimal. However as computers only have two digits they use a counting system base 2 which is referred to as binary. To count in binary we simply increment the number on the right until it reaches its digit limit (1) and then increment the number closest to it on the left whilst reseting that number to zero, if the number on the left is also at its limit we move again to the left and reset it until we find one we can increment, this is exactly the algorithm we use for normal counting. Therefore we have

binary  decimal
     0        0
     1        1
    10        2
    11        3
   100        4
   101        5
   110        6
   111        7
  1000        8
  1001        9
  1010       10
  1011       11
  1100       12

Interestingly and logically where in decimal the first digit represents units and the second tens and the third hundreds, in decimals they represent 1, 2, 4, 8 and double for each digit to the left. Thus 101 (binary) is

(1 × 1) + (2 × 0) + (4 × 1) = 5 (decimal)

Anyway enough about bases, I think we have that covered off. What we need to understand is you can represent numbers using sequences of zeros and one on computers.

Encoding

Computers encode many things to files such as text, images, video, CAD models and many more. Let’s choose the simplest thing possible, though, for our example: text. To have text file we must have some way of mapping numbers to/from letters, what computer scientists call character encodings. Whilst conceptually simply these things stike fear into programmers who are smart enough to understand what they are and what a mess we’ve created for ourselves. I’m going to dodge the bullet of fully explain why they are a pain and use the simplest one as an example. US-ASCII, which is the seemly tautological United States-American Standard Code for Information Interchange, as an example character encoding.

US-ASCII defines a mapping of numbers to characters. Each character is represented by an 8 binary digit number known as a byte, remember a binary digit is called a bit, so a byte is an 8-bit number. The reason numbers are stored as 8-bits is so that they can be read off storage easily. Remember there are only zeros and ones on computers. Therefore to know where a number starts and ends there has to be some way of splitting them back apart. A file contains for example:

0100111101001011

Which is two bytes of data namely 01001111 and 01001011 or in decimal 79 and 75 or in US-ASCII encoding “OK”.

ASCII defines which numbers represent which characters, for example ‘A’ is 41 (decimal) that is 01000001 (binary). As it just uses 8 bit numbers there are only 256 characters possible (i.e. 00000000 (0) through to 11111111 (255)). Full mappings of numbers to characters for ASCII are availble all over the web.

Let’s split

You might be thinking why bother with the leading 0s on each byte in the “OK” example above, well if I did that I would have

10011111001011

and how on earth do I know which digits belong to which number, it could be 10011111, 001011 or it could be 100111, 11010101 indeed there are 13 possibilities assuming I know there are two numbers represented, otherwise it could be 14 numbers or 1, basically I’m screwed if I don’t know the length of the numbers in bits. I can’t add anything to separate them because all I have is zeros and ones and I can’t use a zero or one to separate them because I won’t have anything left to count with; you can’t count in base 1. Therefore in order to put numbers in storage I need to predetermine a length for those numbers in bits and stick to it. The general building blocks of information in computer science is 8-bit bytes and that is why.

To summarize, we have numbers represented by 8-bit bytes and we have a mapping of those numbers to characters assuming we are using US-ASCII. However forcing all numbers to by 8 bits seems awfully wasteful. Let’s suppose I want to store the sequence of 20 characters:

ababababab

Well in this case I only need two numbers to do so so I could just store them using a basic character binary encoding of 0=a and 1=b. Therefore rather than using 8 x 20 = 160 bits of storage I could just use 20 bits. To spell it out I could use

01010101010101010101

Rather than the US-ASCII equivalent:

1000110110001110
1000110110001110
1000110110001110
1000110110001110
1000110110001110

This is huge storage saving in percentage terms. It’s important to point out that an encoding maps one set of symbols to another, in the case of ASCII it is 8-bit bytes to common western characters, for Huffman it is variable length binary sequences to 8-bit bytes. Therefore what’s actually happening here is:

a (ASCII) -> 10001101 (8-bit byte) -> 0 (Huffman)

Now what if I have 3 characters in my sequence:

ababababac

This presents a problem for the encoding technique above, because I have no obvious way to encode the ‘c’. If I use the next number in the binary sequence namely 10 for c. I have a problem that a sequence such as cba:

has ambigious meaning, it could be abc, cc, bac or cba, not much use. However, and here’s the key insight, suppose I only use zero/one sequences that don’t appears as the beginning of any other sequence to represent 8-bit numbers. Therefore I pick 0=a and because I can’t use a number starting with 0 after picking 0=a I must use something like 10=b leaving 11=c. I can therefore unambigiously encode cba as:

There is nothing else it can be but cba. I don’t need to know the bit length of each number to decode. To work the example the first digit, 1, this isn’t a character, nothing is encoded as 1 so I add the next bit to it and get 11, well that’s c there is no other number starting with 11 so I can unambigiously decode it. Similarly with 10 and 0.

The algorithm

The aim of Huffman Coding is to create a shorter encoding that the original fixed width encoding. Indeed it is a little more than that, it is to get the shortest possible encoding unambigious encoding. How do we find this out? Well first a few things about Huffman Coding. The genius of the algorithm is that it is simple and will always find the optimum (shortest possible) encoding and this encoding will always be less than or equal to the length of the equivalent 8-bit encoding.

Tree hugging

Firstly I describe the algorithm then I’ll do any example. As we know US-ASCII is simply a sequence of 8-bit bytes as is every other file on your computer (on the vast majority of modern computers). Therefore we have a sequence of bytes.

Huffman encodings use trees, Huffman trees, to describe their encoding. A Huffman tree is a binary tree, in that each branch gives way to 2 or fewer branches.

So the algorithm:

Count the number of occurences of each byte in the sequence and put them in a list
Sort that list in ascending order of freqency
Pick the two lowest frequency bytes off the top of the table and add them to the tree as two branches on the trunk
Add the frequencies of those two nodes together and add that part of the tree back to the list and sort the list again in ascending order of frequencies
If there is more than one item left in the list then go to step 3, otherwise you are done, the last item in the list is the completed tree

Example

In this example we are going to encode the string: “Mississippi hippies”.

Here’s the frequency table:

[_,1]
[M,1]
[e,1]
[h,1]
[p,4]
[s,5]
[i,6]

Note I’ve substituded [space] for the underscore character “_”.

So we take the two smallest values off the top and create our first part of the tree, hopefully the notation is self explanitory the tilda (~) means that it is just a node and the letter before is simply and identifier, the leafs have the character they represent and the frequencies:

  z[~,2]
    / \
   /   \
[_,1] [M,1]

We add this back to the list

 [e,1]
 [h,1]
z[~,2]
 [p,4]
 [s,5]
 [i,6]

Then the next two lowest frequency items

  y[~,2]
    / \
   /   \
[e,1] [h,1]

Adding it back in gives

 y[~,2]
 z[~,2]
  [p,4]
  [s,5]
  [i,6]

Then the next two (which are both sub-trees) gives

        x[~,4]
          / \
       ---   ---
      /         \
   y[~,2]     z[~,2]
    / \         / \
   /   \       /   \
[e,1] [h,1] [_,1] [M,1]

Adding it back in gives

 x[~,4]
  [p,4]
  [s,5]
  [i,6]

Next two:

           w[~,8]
             / \
            /   \
         x[~,4][p,4]
          / \
       ---   ---
      /         \
   y[~,2]     z[~,2]
    / \         / \
   /   \       /   \
[e,1] [h,1] [_,1] [M,1]

Back to the list:

  [s,5]
  [i,6]
 w[~,8]

Almost there:

  v[~,11]
    / \
   /   \
[s,5] [i,6]

Back:

w[~,8]
v[~,11]

And the final tree:

                   u[~,19]
                   / \
                ---   ---
               /         \
            w[~,8]     v[~,11]
             / \         / \
            /   \       /   \
         x[~,4][p,4] [s,5] [i,6]
          / \
       ---   ---
      /         \
   y[~,2]     z[~,2]
    / \         / \
   /   \       /   \
[e,1] [h,1] [_,1] [M,1]

There we have it, our final Huffman tree. How do we use it? Well in order to find the encoding for each letter we travel down the tree until we get to it. When we go left we add a 0 when we go right we add 1. For example to encode ’e’ which is a rare character in our input string we get the following:

Or for ‘i’ which is very common in the input sequence we get

a much sorter encoding. From this tree we can build an encoding table as follows:

Thus we can encode the orginal string “Mississippi hippies” into:

Which is 46 bits rather than the (19 x 8) 152 of the orginal. We’ve compressed 19 bytes into 5 bytes (last byte is zero padded).

A couple of notes:

For input with almost every byte possibility and roughly equal frequency of those bytes compression will be very limited — large movie files for instance.
For some encodings certain bytes will encode to longer than 8-bit codes, but the shorter encodings will at least offset those
Huffman tree can be stored very effiently using a fix length format, pick the left then the right, move down to the left and repeat until the tree is traverse. When you hit nodes in the tree that are leafs output 1 followed by the value (not the frequency), when they are nodes that are simple parents of others output 0 followed by nothing.

This was a pretty fiddly blog post, please let me know if I’ve made mistakes in the comments or email.

That’s it, I think.