Sudoku Solving in R

I have decided to try and wrap my arms around R, the statistical programming environment that everyone seems to be talking about these days.  It’s been around since the ’70s, so probably people have been talking about it for decades, but it just hasn’t made its way into my daily life until now.

I am not a statistician, nor do I have any background whatsoever in statistics.  As a philosophy major, my foray into business classes was limited – in fact, I can’t think of a single class I took in the college of business.  I have faked it somewhat in my career to date, forging formulas in Excel that would probably have been more elegant, efficient, and easier with one introductory class to statistics.  So, I recognize a need to become more versed in the art.  What better way than to dive into such a vital programming environment, especially considering my love of programming itself?

One of my favorite ways to learn a new language is to program a Sudoku solver using it.  I have used Excel VBA, PHP, and C++ to do this (those scripts have long been lost to time), and now have done it with R (which script I will post here so that it is never lost).  When writing a Sudoku solver, you are forced to understand the syntax for variables, arrays, functions, loops, and conditionals, so once done, you have a firm footing to begin learning the more complex aspects of a language.

Please note, I wrote this while learning R.  Probably, I will look upon this script in a year’s time and wonder why I did some of these things, noting inefficiencies and poor constructs.  But, for now, it works, and that is all that matters to me.

To start, R speaks mainly in terms of “vectors”.  This is foreign to me, as I always thought of a vector in the mathematical term, something with both direction and magnitude.  In my experience, what R calls “vectors”, I would call “arrays”.

sud <- c(3, 7, 0, 0, 5, 8, 0, 0, 0, 0, 0, 0, 3, 0, 0, 7, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 6, 0, 3, 0, 0, 0, 7, 0, 2, 0, 1, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 7, 6, 0, 0, 0, 0, 4, 0, 4, 0, 0, 0, 2, 0, 8, 0, 0, 0, 0, 5, 6, 0, 0, 0, 0, 0, 0, 1, 7, 0, 0, 9, 0, 0)

This is how you define vectors, and means the variable sud contains a vector of these 81 numbers.  These 81 numbers are the Sudoku puzzle itself, where blanks are represented as 0s and the numbers are transposed by column, starting at the top left cell, moving down to the bottom left cell, then moving across the puzzle.  I went by column instead of by row because R handles matrices these way – columns, then rows.  This will come in handy later.

locked_cells <- sud > 0

When I realized R was capable of this kind of simplicity, I started to fall in love with it. This single line of code creates a new vector of name locked_cells and populates it with a boolean for each value in sud based on whether it is greater than 0 or not. Since I replaced blanks with 0s when transposing the puzzle, this indicates that a value is a given, hence it is “locked”.

sqrs <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 4, 4, 4, 5, 5, 5, 6, 6, 6, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7, 7, 7, 8, 8, 8, 9, 9, 9)

rows <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9)

cols <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9)

This is where I wish I had a firmer base in mathematics.  It is probably a simple matter to, using the index of the cell of the puzzle we are solving for, calculate the square, row, and column we are in using some kind of algorithm.  Since I am a just a lowly liberal arts major, I have to identify these vectors manually.  So, as we are solving for various cells – say, the 15th cell, which would be the second column, sixth row – we can use the 15th value in sqrs (3) to identify which square we are in (squares, like cells, are also ordered top-to-bottom then left-to-right), the 15th value in rows (6) to identify which row we are in, and the 15th value in cols (2) to identify which column we are in.

Since we will know which square, row, and column a given cell is in, we can validate its value based on whether or not it is repeated by other members of that square, row, or column using this function:

test_val <- function(inp, cells) {
  result <- c(0)
  for(i in 1:9) {
    # iterate through each cell in the list
    # identify if the value of the current cell is already used
    if(inp != cells[i]) {
      if(!is.na(sud[cells[i]])) {
        if(sud[inp] == sud[cells[i]]) {
          result <- c(1)
          break
        }
      }
    }
  }
  result
}

Given an index within the Sudoku puzzle (between 1 and 81) and the indices of the cells we need to check for validation, this function will simply see if the value of the index cell is repeated within any of the other cells.

Note, quickly, the difference between [] and ().  When referencing a value within a vector, we use [].  When passing variables to a function, we use ().  I had many bugs while writing this where I used one in place of the other accidentally.

Since we need to check for duplication with rows, columns, and squares, we need a wrapper function as so:

check_cell <- function(inp) {
  result <- c(0) # default to no errors

  # if the value is 0, it is invalid automatically
  if(sud[inp]==0) {return(1)}

  # if the value is 10+, it is invalid automatically
  if (sud[inp] >= 10) {return(1)}

  # first, get square
  q <- sqrs[inp] # this is the square we are in
  cells <- which(sqrs %in% c(q)) # this is all the cells of the sqaure
  result <- test_val(inp,cells)
  if(result == 1) {return(1)}

  # next, get row
  q <- rows[inp]
  cells <- which(rows %in% c(q))
  result <- test_val(inp, cells)
  if(result == 1) {return(1)}

  # next, get column
  q <- cols[inp]
  cells <- which(cols %in% c(q))
  result <- test_val(inp, cells)
  if(result == 1) {return(1)}

  return(0)
}

The line cells <- which(sqrs %in% c(q)) (and the subsequent lines referencing rows and columns) is another one of those beautifully simple commands that made me start to fall in love with this language.  R has a built-in search function called which() that returns a vector of indices matching the condition in a given vector.  So, this line returns the indicies of sqrs that are of the same value as the vector we previously identified as holding the value of the current square.  Thus, we can pass this new vector (cells) to the test_val function.

Then we come to the main part of the script, which iterates through each of the 81 cells, adding 1 to the current cell (if it isn’t locked) and checking for validity before moving on to the next cell.  If the current cell is invalid, it adds one again and checks for validity (and so on).  If the value it reaches is 10 (invalid because we know solutions can only be between 1 and 9), it sets the value back to 0 and moves backwards one step to the most recently-used unlocked cell and adds one to it (then checks for validity, and so on).  This is brute force at its best, and is wildly inefficient, but always gets the job done.  I like brute force methods for those very reasons.  They may not be pretty, but they are infallible (as long as the puzzle given is possible to solve).

s <- 1

repeat {
  if(!locked_cells[s]) {
    sud[s] <- sud[s] + 1
    while(check_cell(s) > 0) {
      sud[s] <- sud[s] + 1
      if(sud[s] >= 10) {
        sud[s] <- 0
        s <- s - 1
        if(s==0) {
          break
        } else {
          while(locked_cells[s]) {
            s <- s - 1
            if(s==0) {break}
          }
        }
        sud[s] <- sud[s] + 1
      }
    }
  }
  if(s == 81 || s < 0) {
  break
}

  s <- s + 1
}

At this point, the puzzle is solved.  The vector sud has been modified and all 0s have been replaced with values that fit the puzzle.  In order to display it for humans to read easily, we need to translate it into a matrix.  Since I started with a matrix transposed to a vector, all I need to do is tell R to transpose it to a matrix:

sud_matrix <- matrix(sud,nrow=9,ncol=9)

Then, R allows a variable to be printed to the console just by mentioning it.

sud_matrix

If you have any questions, please feel free to leave a comment below.  Like 90% of my posts, this was left here mainly so I wouldn’t lose it.  But, I couldn’t help myself from waxing poetic about the language itself.  It truly is a lovely language, and I am excited to learn more about it – and statistics – as my self-led education continues.

Postscript

I modified the code slightly to understand how long a given puzzle would take to solve. The easy puzzle used in the above post took my relatively modern (though 32-bit) laptop 10 seconds to solve. The extremely difficult puzzle from this link took 13 minutes and 28 seconds.  I never said this code was fast, just that it worked!

Advertisements

SSH Tunnels, SOCKS proxies, and Sticking It to IT

Today I achieved perhaps my greatest triumph in my eternal struggle against the shackles my IT department places on my laptop. Aside from the inability to install programs needing admin privileges, the greatest strife I have with IT is the restrictions in places on certain websites. To name the most basic example, I cannot access Facebook. Not that I want to, I just want the ability to. I also can’t access various hacking blogs, anything having to do with brewing beer – anything to do with alcohol at all, for that matter – even some personal blogs are restricted as they are not necessary for my job function. Nuts to that!

This post is the result of many steps that didn’t really have the ultimate goal of sticking it to IT, so it may be a little tough to get through. There may be other ways of achieving the same things, for instance. This is just how I did it, and I’m posting it here in case I need to do it again – for instance, if IT replaces this laptop, or if I have to recreate my server at home.

Software that will be mentioned in this post:
CopSSH (https://www.itefix.no/i2/copssh)
PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)
MyEnTunnel (http://nemesis2.qx.net/pages/MyEnTunnel)
Google Chrome (http://www.google.com/chrome)

Get SSH on to that Windows server

I used Linux on my home server for a long time. Then I got an AMD-based GPU and realized that Linux drivers currently suck and things would just be a helluvalot easier for me to install Windows 7. So I did. Then I went to connect to the terminal via SSH and realized there is no SSH server on Windows as there is on every default installation of Linux out there right now. I used CopSSH to recreate the experience of logging into a terminal remotely for server maintenance.

I can’t remember if there were any problems with this install. As I remember, everything went very smoothly. A couple of caveats:

1) Forward port 22 on your router in order to access your SSH server from outside your network. Port forwarding is beyond the scope of this post.

2) CopSSH starts an SSH service that allows you to connect to a Cygwin interface. In order to access your standard windows drives from the terminal once you log in, navigate to /cygdrive. There you will see folders named after your drive letters (c, d, e, etc.), and you can access your files there.

The IT department at my office blocks a lot of ports, but not 22 for some reason. I suppose it might be a necessary port for them to use for pushing program installs or something. For whatever the reason, with CopSSH installed at home, I can navigate my hard drive from work and stop/start services, create/edit/remove files, copy files to/from (with WinSCP), and, most importantly for this post, perform some SSH tunneling.

Tunnels and the Men Who Love Them

SSH tunneling is something I discovered (and fell in love with) only recently. Therefore, it is a bit tough to describe since I know so little about it. Essentially, you attach a rider to your SSH connection that says for x port on the remote server, treat it as y port locally. For instance, let’s say (since I do) that you have SABnzbd+ installed and running at home, and that the web frontend is running on port 8080. With an SSH tunnel, I tell my work computer to bridge port 9080 to port 8080 on my home computer. Once that bridge is made (bridges, tunnels, what’s the difference), I can open a web browser to localhost:9080 and receive the content that I would normally see at home using localhost:8080. This is particularly important to me, because the only external port I am able to reach at work through a web browser is port 80. Since tunneling treats it as a local port, I am free to access my SAB queue from work.

So how do you do it? Well, first you need to install PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html). I’m sure there are other programs out there that do this, but just use PuTTY. It’s free, awesome, and ALWAYS works. You just need to download the Putty.exe file and place it in your path somewhere. While you’re there, download Plink.exe as well, since we’ll be using that later.

Here are the important pieces to configure when setting up a server in PuTTY:
1) host name is your external IP address at home (or your DNS name if you have one registered)
2) port is (almost) always 22
3) Tunneling is done under Connection->SSH->Tunnels
a) Source Port is what you want your local computer to use (9080 in the SAB example)
b) Destination is IP:RemotePort (xxx.xxx.xxx.xxx:8080 in the SAB example)
c) Click “Add” and this rule will be placed in the listbox under “Forwarded ports”
4) After you set up tunneling, go back to Session (at the top of the list box on the left) and save this connection as something memorable. “Tunnels” is usually a pretty good name.

Tunneling the Internet

So now you know how to tunnel a specific port from one computer to another. But what about using your home computer to do your web browsing and forwarding that content back to your work computer? For instance, what if I want to enter Facebook.com into my browser at work and have my home computer be the one looking up that web site? What if I want to do this for ALL internet content so that the IT department can never go snooping through logs of what I’ve been browsing on the web?

Simple. Go back into PuTTY and load up the Tunnels saved session you just created. Go back to Tunnels and add this:

Source Port: 9870
Destination: (blank)
Dynamic checked (not Local or Remote)

This will forward all dynamic traffic from the home to port 9870 at work. You can use a different port if you want, just make sure you remember what it is. Now, if you have access to your proxy settings or use Firefox (which ignores Windows proxy settings), you can go ahead and set those up as you normally would. You’re using a SOCKS connection, and the address is 127.0.0.1:9870. But if you don’t have access to your proxy settings (like me), and you like to use Chrome (like I do) instead of FireFox, you have to do a little extra work.

First, install Chrome. Most of you already have, I’m sure. Find your Chrome installation folder (might have to do a little googling to discover where Chrome installs). Create a new shortcut for that executable and add the following parameters to the end of it:

–proxy-server=”socks5://127.0.0.1:9870″

For me, the whole shortcut looks like this (in Windows XP)

“C:\Documents and Settings\(USER NAME)\Local Settings\Application Data\Google\Chrome\Application\chrome.exe” –proxy-server=”socks5://127.0.0.1:9870″

Add that to your desktop, start menu, whatever, and viola, you have proxy-fied internet every time you open Chrome. Of course, you will first have to open a session with PuTTY in order to open that tunnel to port 9870. That means that you will have a PuTTY window open all day, and your boss might start to get suspicious. What then?

Why I Love Hackers

Since PuTTY and all of its ancillary software is free and open, hackers are able to create really cool things like MyEnTunnel (http://nemesis2.qx.net/pages/MyEnTunnel), which stands for My Encrypted Tunnel. This program runs as a system tray icon that opens a saved PuTTY session and monitors it to make sure it stays connected. When the connection drops, it will retry as long as you tell it to. Your boss will be none the wiser.

1) Install and start the program.
2) SSH Server: Tunnels (or whatever you called your PuTTY session, above)
3) SSH Port: 22 (almost always)
4) username and password are obvious
5) Reconnect on Failure is a good idea, and Infinite Retry Attempts is preferable as well
6) IMPORTANT – I found that the Plink.exe file that came with MyEnTunnel simply did not work. I overwrote it with a current download from the PuTTY site. This is why I suggested you download it above.

Then connect, and you are in the mighty perpetual hands of an SSH tunnel. As long as the little icon in the system tray glows green, you will be able to open your Chrome shortcut and browse freely.

Final Thoughts

Yes, this does lead to a slower internet connection. You are browsing through a remote computer – what did you expect? However, I have thus far found this to be worth the trouble. You can add a –proxy-bypass-list=”aaa;bbb;ccc” parameter to your chrome shortcut if you visit sites like Pandora and need high bandwith available.

Also, I realized that with all internet traffic tunneled, I no longer needed the individual ports tunneled. For instance, rather than opening localhost:9080 for my sab queue, I can just open 192.168.1.100:8080 (i.e. the internal IP of the server at home). This is because, effectively, I am browsing from my home network.

I wrote this post somewhat haphazardly, just trying to set down my thoughts before they vanish and I have to figure this out all over again. If you would like clarification, leave me an email, and I will update the post or answer your question(s).

Using a Named Range to Fill a Data Validation List

I have to admit, this one took a little researching on my part. My boss stormed into my office with a quizzical look on her face (a usual occurrence) and said she was having difficulty doing something in Excel. She’s normally pretty good with Excel, so I know when that is her problem, it’s not going to be an easy fix. She proceeded to go to my white board in order to sketch out her problem (another indication that this was going to be a doosie) and then stumbled over her own words and drawings (the final nail in the coffin…this was going to be a fun one).

In a nutshell, her problem was this: she’s creating an interactive spreadsheet where the user selects an item from Dropdown A and then picks a different item from Dropdown B. Dropdown B is populated with items based on what was selected in Dropdown A. A simple example would be when you go to one of those websites that sells car parts or quotes car values, they first ask you for the make of the car, and then ask you for the model. The “model” dropdown only includes cars of the given make.

Shouldn’t be a problem, thought I. I’ll just write a macro that fires when the first dropdown is changed and populates the second dropdown accordingly. “And I can’t use a macro,” she said, “because it’s going to executives.” Ah, executives. The struggle to explain how to run a macro when Excel defaults to disabling them is an eternally frustrating one. And macros always tend to fail at the most embarrassing times.

So, what to do now. I knew that I could name a range, and I knew that I could use a named range for the data validation. What I did not know was this little gem of a function:

=INDIRECT()

This function looks at the value of a cell and considers the text of it, not the reference to it. I sincerely doubt that I’m explaining that correctly. Essentially, we are going to use the value of the first drop down to define the named range that is to be used in the second drop down. Simple.

These are the lists we will be using for this example. The first column is car manufacturers (makes) and the second list is models. Note that the data does not define which models are made by which manufacturers – that is going to be defined by named ranges.

Named Ranges

First, split up the models column into the named ranges (makes) the models represent. For instance, cells I2:I5 are Fords, and cells I6:I8 are Toyotas, so the named ranges should be “Ford” and “Toyota”.

Named Ranges Defined

Next, use the first list of makes as a List Data Validation for your first dropdown. Do this in cell A3.

Data Validation

Finally, and here comes the magic, use an INDIRECT formula for the List Data Validation of the second dropdown. This will use the first dropdown’s value as the name of the Named Range used to populate the second dropdown.

Equals Indirect

That’s it! When a user selects an item from the first dropdown, let’s say “Ford”, the second dropdown is populated with items from the named range called “Ford”. Note that List Data Validation needs to be a one dimensional array. If you require a second dimension to your data, consider a VLOOKUP to be used in conjunction with the above method.

As always, I hope I was explicit enough with my post, but if you have any questions or problems with this method, please do not hesitate to post comments below.

Drive Time Calculations in Excel

Building on my most popular post, Getting Information From the Web Using Excel VBA, I had the need in a recent project to calculate drive times on many rows of data, and decided to build a function in Excel to handle the dirty work for me using Google Maps.

Essentially, we had 200 or so latitude/longitude points and needed to see which of six addresses were closer from a drive time perspective. Luckily, Google came to the rescue once again, because one can use latitude/longitude coordinates in lieu of an address in order to get directions (which includes drive time) to a physical address (or, I suppose, a second set of latitude/longitude coordinates). What’s more, the URL for google maps to give you this information is fairly simple:

http://maps.google.com/maps?q=from: [PointA] to: [PointB]

So, once I dug through the HTML code behind the google maps directions results to find the div element containing the drive time of the quickest route, it was easy enough to create the following function. The two parameters are the addresses, zip codes, coordinates, or whatever else Google will allow you to use to approximate the starting and ending points.

Function DriveTime(PointA As String, PointB As String)

  Dim myURL As String
  myURL = _
    "http://maps.google.com/maps?" & _
    "&q=from: " & PointA & " to: " & PointB

  Dim inet1 As Inet
  Dim mypage As Variant

  Set inet1 = New Inet
  With inet1
    .Protocol = icHTTP
    .URL = myURL
    mypage = .OpenURL(.URL, icString)
  End With
  Set inet1 = Nothing

  Dim intStart As Double, intEnd As Double
  intStart = InStr(mypage, "<div class=""altroute-rcol altroute-info"">") + 41
  intEnd = InStr(intStart, mypage, "</div>") - intStart
  DriveTime = Mid(mypage, intStart, intEnd)

End Function

It took about 45 seconds to calculate the drive time for appx. 6 * 200 or 1,200 routes. My only other option was to copy and paste those 1,200 coordinates one-by-one into Google maps and retype the drive time. Which probably would have taken at least half a day. I call that a win.

Feel free to use this function for any number of drive time calculations, but I would assume that at some point Google will pitch a fit that your IP is sending so many requests to its map server. However, it didn’t bat an eye at my 1,200 queries, so who knows? Enjoy!

Next Unique and Previous Unique

Why this isn’t already a function of Excel baffles me.  Perhaps my work is somewhat unique.  I generally work with very large datasets (over 100,000 rows, often times nearing the 1.08 million row limit of Excel 2007), and I often find myself needing to scroll through the data in order to find the next value in a series.

Consider a spreadsheet with three columns: Fruit, Name, and Score.  Imagine this is populated with the data of a survey of 100,000 people and their opinions of Apples, Bananas, and Oranges.  You can assume that with three fruit types, there would be 300,000 total rows.  Now imagine opening this file in Excel, and imagine it is already sorted by Fruit.  How would you go about locating the first entry of Bananas?

The obvious method is to simply use the scroll bar and drag down until Bananas appear.  A second method would be to press and hold Page Down until this section appears.  In either case, it is far too easy to surpass the intended row, requiring a similar method in the opposite direction.  This can sometimes force you to go back and forth a few times until homing in on that row.

Now imagine that you just want to access the last row in the data, regardless of the contents of the cells.  Just type Ctrl + Down Arrow, and you’re there.  Wouldn’t it be nice if there were a similar function for skipping down to the next unique value?  Well, there can be!

Consider the following VBA functions:

Sub findFirst()
   Dim targetString as String
   targetString = ActiveCell.Text

   If ActiveCell.row = 1 Then
      Exit Sub    
   ElseIf Cells(ActiveCell.row - 1, ActiveCell.Column).Text <> targetString Then
      Cells(ActiveCell.row - 1, ActiveCell.Column).Select
   Else
      ActiveSheet.Columns(ActiveCell.Column).Find(targetString, 
        LookIn:=xlValues).Select
   End If
End Sub
Sub findLast()
   Dim targetString As String
   targetString = ActiveCell.Text

   If Cells(ActiveCell.row + 1, ActiveCell.Column).Text <> targetString Then
      Cells(ActiveCell.row + 1, ActiveCell.Column).Select
   Else
      ActiveSheet.Columns(ActiveCell.Column).Find(targetString,
        SearchDirection:=xlPrevious, LookIn:=xlValues).Select
   End If
End Sub

In both instances, we’re using the built-in function Excel uses to find values, similar to using Find or Find/Replace.  In the first function, we first check to make sure we’re not already at the first row, and if not, we use the find function to locate the first instance of the current value in the current column.  So, using our original example, if we were midway through the entries of Apples and wanted to return to the first Apple entry, this function would look at the Fruit column and search for the first instance of “Apple”.

The second function does a similar act, but uses the xlPrevious search direction.  Therefore, it starts at the first cell in the current column, and searches backwards for the current value.  This requires Excel to start from the bottom of the spreadsheet and search upwards.  So, if we are halfway through the “Apple” entries and use this function, Excel will start at cell A1048576 and search upward until reaching cell A200001, which would be the last hypothetical instance of “Apple”.

In either function, if the cell directly above or below (respectively) the active cell has a different value than the current value, that cell is simply selected.

HAVING Keyword in SQL (or How to Calculate Loyalty with One Sentence)

This was a lot easier than I thought it would be.  For some back story – but without revealing too much about my industry – in our company, we have individuals who provide us with a great deal of business.  However, they provide our competitors with a great deal of business as well.  They are not contracted to give us all of their business, and there’s really nothing we can do aside from creating solid business relationships in order to assure that they do give us at least most of their business.

So in order to easily determine which individuals are giving us most of their business (and we define that as over 70% of their business), I wrote something very similar to the following simple SQL statement (for MySQL):

1. SELECT individual_id, count(1) AS total_volume,
2.  sum(case business when 'Us' then 1 else 0 end) as our_volume
3. FROM main_data_table
4. GROUP BY individual_id
5. HAVING our_volume / total_volume >= 0.7;

Note: the line items are not part of the SQL code – they were added to make explaining it easier.

I don’t normally do a lot with the HAVING keyword, but I suppose I ought to.  It comes in really handy in this example.  Here’s the breakdown of what’s going on:

1 : This is a stupid way of doing it (not the actual way I chose) but it essentially gives you a total number of line items for the individual_id.

2 : Two things are going on here.  First, we tell MySQL that when the line item represents our business (i.e. the center where the transaction happened was one of our centers), count it as 1.  Otherwise, count it as 0.  That way, when they are aggregated (using the sum() function), the result is the total number of line items for the individual_id that were completed at one of our centers.

5: This limits the results to those individual_id’s that produced at least 70% of their business at one of our centers.  You must use the HAVING keyword when you want to limit the results to a calculated aggregate qualifier.