Tuesday, April 30, 2013

GeoIP and SQL (Oracle)

I wrote a blog in CERT/CC on GeoIP in your SOC (Security Operations Center).  Here I am exploring a little bit more nuts and bolts (technical) details on optimal ways to use GeoIP data in SQL.  The idea is really optimize bulk lookup of IP addresses for geolocation with very quick response times for say thousands or millions of IP addresses.

The major providers of GeoIP data (MaxMind, Ipligence, Quova/Neustar, Ip2Location) all provide you with IP address data (IPv4 at-least) in integer format with a Start_ip and End_ip range followed by a related location where the IP address was seen.  Typically you load this data into SQL, and you can search using SQL "BETWEEN" clause to find if an IP address is present in this range.
The table schema (pseudo schema) look like 
LONGFROM => Start ip address (Long integer)
LONGTO =>   end ip address (Long integer)
ID => Location ID Index of some sort (integer)
METADATA [Latitude, Longitude,City,Country code etc..] => Either in another table OR in the same table (floating points and VARCHAR's).

The providers also recommend a 
UNIQUE INDEX(LONGFROM,LONGTO) => Unique index of longfrom and longto to efficiently query the table (Long integer x 2)

This works pretty well when finding a single IP address.  However if you want quick queries and say lookup a 100,000 or a million IP addresses, BETWEEN query lookup is not efficient.  Jeremy Cole's blog shows one way of using Geometrical approach of representing this data and achieving some fast query results.  Another option is some way of using a classB (/16) range as an index to look for an IP address (blogged in SQLite example).  Basically you can index a ClassB portion of IP address LONGTO and plan to use this as a searchable index.  In this writeup, I am going to show how to use SQL triggers while loading data and making an efficient searchable GeoIP SQL data. 


I have basically extended the above schema to create an index of classB portion of the (say) LONGTO column

IDX => LONGTO - MOD(LONGTO,65536)  (Long integer).

Now when you query for an IP address, you can basically look for ClassB(IPAddress) == ClassB(LONGTO), will bring you a quicker candidate results of IP blocks and you can refine the returned candidate list of rows using a BETWEEN clause.  However there is a problem, GeoIP is not so cleanly assigned in ClassB blocks (May be IPv6 will solve this with large /32 provided to organizations).  There are a number of IP addresses that fall outside this normal boundary.  Just analyzing Maxmind GeoLiteCity data,  there are a number of IP address that don't fall within one class B boundary.  Here is one example:
LONGFROM,LONGTO,ID
709566464,709885951,21248
(42.75.32.0),(42.79.255.255),[25.03,121.52,Taipei,Taiwan,TW]

This basically tells me that IP address starting from 42.75.32.0 and ending with 42.79.255.255 are in one geolocation (Taipei Taiwan in this example according to Maxmind GeoLiteCity data).  This spans multiple class B networks.  An index of Class B network of LONGTO field will not help me find all the IP addresses within this location block. If I were to use the index IDX concept above and say search for 42.79.*.*, I will get the ClassB network ID to be 42.79.0.0 or in long integer form 709820416 - there will be a match and search will return the correct location.  However using this index IDX when I search for 42.75.34.*, the class B index for this network is 42.75.0.0 (or 709558272 in long integer) and it will not match the IDX index field I have created for this block.  

One way to solve this would be to break up the above row in SQL into multiple Class B sections that are tied back to the same location ID.  So the single row will become multiple rows (5 rows in this case) and our SQL will work nicely to search for index a field called IDX -representing the class B networks to search by.

R. LONGFROM,LONGTO,ID
1. 709566464,709623807,21248
...
 5. 709820416,709885951,21248

1. (42.75.32.0),(42.75.255.255),[25.03,121.52,Taipei,Taiwan,TW]
....
5. (42.79.0.0),(42.79.255.255),[25.03,121.52,Taipei,Taiwan,TW]

All these rows can share a common location ID and unique IDX field representing each class B network (42.75.0.0/16, 42.76.0.0/16....42.79.0.0/16).  Thus we have an indexing scheme that looks like the scribble below in my office white board.

 

In the picture above blue boundaries show location boundaries as specified by Maxmind, the black lines represent class B network boundaries.  The red IDX{n}'s represents the class B indexes.  A search of "BETWEEN" clause now can be "primed" with an index of candidate locations.  In the example above, if you search for an IP address that is in Location 2 will get a candidate list of locations Loc1,Loc2 and Loc3.  However when you complete the search with an additional BETWEEN clause, you are narrowing down your search to Location 2.  The SQL might look something like

-- IPLONG is your IP ADDRESS in integer long format.
SELECT A.LATITUDE,A.LONGITUDE,A.CITYNAME,A.COUNTRYCODE FROM GEOIPCITY A INNER JOIN GEOIP B on B.ID=A.ID AND B.IDX=CLASSB(IPLONG) AND IPLONG BETWEEN B.LONGFROM AND B.LONGTO;

The trick is on ingest of the GeoIP data you pseudo code that processes the data like below

IF CLASSB(LONGFROM) != CLASSB(LONGTO):
   WHILE CLASSB(LONGFROM)!= CLASSB(LONGTO) LOOP :
      INSERT NEW ROW (GEOIP) LONGFROM,LONGTO=LONGFROM+65535,CLASSB(LONGFROM)
      LONGFROM=LONGFROM+65536
    END LOOP;
ELSE:
   IDX = CLASSB(LONGFROM)
END IF;

This can be done using database triggers.  Powerful database triggers are possible with Oracle, Postgress or MySQL.  If you plan to use SQLite, you may need to write python or perl code to breakup CSV from GeoIP providers into segments like mentioned above before your ingest. 

Let me walk you through how to set this up with an example of Oracle database and MaxMind free GeoLiteCity data.  
[I am going to leave the tablespace and other DBA details for you to figure it out in Oracle]

CREATE TABLE GEOIP
  (
  ID          NUMBER(10) NOT NULL,
  LONGFROM    NUMBER(38),
  LONGTO      NUMBER(38),
  IDX NUMBER(38)
  );
CREATE OR REPLACE INDEX FT_GEOIP ON GEOIP(LONGFROM,LONGTO);
CREATE OR REPLACE INDEX IDX_GEOIP ON GEOIP(IDX);
CREATE TABLE GEOIPCITY
  (
  ID          NUMBER(10) NOT NULL,
  COUNTRYCODE CHAR(2),
  REGIONCODE  CHAR(2),
  CITYNAME    VARCHAR2(255),
  POSTALCODE  VARCHAR(6),
  LATITUDE    FLOAT,
  LONGITUDE   FLOAT,
  METROCODE   NUMBER,
  AREACODE    CHAR(3)
  );

ALTER TABLE GEOIPCITY ADD (CONSTRAINT LOC_PK PRIMARY KEY (ID));

Now you are all set to get data in there?  Not yet, you need create a trigger that will add appropriate indexing from Maxmind's data.  The trigger below looks complicated but is not that bad.  You could optimize it further for sure.   As this happens only during the ingest it is not a big deal.  I wrote this down in a readable way so all the operations are clear to a DBA.

-- Lets build this trigger.
CREATE OR REPLACE TRIGGER geoipTrigger BEFORE INSERT OR UPDATE ON GEOIP
FOR EACH ROW
DECLARE 
tempIDX NUMBER(38);
tempN NUMBER(24);
nLONGTO NUMBER(38);
nLONGFROM NUMBER(38);
BEGIN
IF :new.IDX IS NULL THEN
SELECT FLOOR((:new.LONGTO-MOD(:new.LONGTO,65536)-:new.LONGFROM+MOD(:new.LONGFROM,65536))/65535) INTO tempN FROM DUAL;
IF tempN > 1 THEN
nLONGTO := :new.LONGTO;
nLONGFROM := :new.LONGFROM;
SELECT nLONGFROM-MOD(nLONGFROM,65536) INTO tempIDX FROM DUAL;
:new.LONGTO := tempIDX+65535;
:new.IDX := tempIDX;
WHILE tempN > 1 LOOP
nLONGFROM := tempIDX+65536;
SELECT nLONGFROM-MOD(nLONGFROM,65536) INTO tempIDX FROM DUAL;
INSERT INTO GEOIP (ID,LONGFROM,LONGTO,IDX) VALUES 
(:new.ID,nLONGFROM,nLONGFROM+65535,tempIDX);
tempN := tempN-1;
END LOOP;
nLONGFROM := nLONGFROM+65536;
SELECT nLONGFROM-MOD(nLONGFROM,65536) INTO tempIDX FROM DUAL;
INSERT INTO GEOIP (ID,LONGFROM,LONGTO,IDX) VALUES (:new.ID,nLONGFROM,nLONGTO,tempIDX);
ELSE 
SELECT :new.LONGTO-MOD(:new.LONGTO,65536) INTO tempIDX FROM DUAL;
SELECT :new.LONGFROM-MOD(:new.LONGFROM,65536) INTO tempN FROM DUAL;
IF tempN != tempIDX THEN
DBMS_OUTPUT.PUT_LINE('Data to be entered is '|| tempN || ',' || tempIDX || ',');
nLONGTO := :new.LONGTO;
:new.LONGTO := tempN+65535;
:new.IDX := tempN;
INSERT INTO GEOIP (ID,LONGFROM,LONGTO,IDX) VALUES
        (:new.ID,tempN+65536,nLONGTO,tempIDX);
ELSE
:new.IDX := tempIDX;
END IF;
END IF;
END IF;
END;
/

Now when you load Maxmind CSV data into SQL, there is something to watch out for.   Maxmind GeoLiteCity data comes with two CSV files.  Lets say for example you download GeoLiteCity_20130101.zip  file from Maxmind which contains two files like below
 70707361  01-02-2013 18:46   GeoLiteCity_20130101/GeoLiteCity-Blocks.csv
 20796540  01-02-2013 18:46   GeoLiteCity_20130101/GeoLiteCity-Location.csv

You can load GoeLitCity-Locations with a simple sqlldr script: 
cmd>sqlldr USERID=scooter CONTROL=GeoLiteCity-Locations.ctl 
-- File GeoLiteCity-Locations.ctl below
Load DATA CHARACTERSET WE8ISO8859P1
INFILE 'GeoLiteCity_latest/GeoLiteCity-Location.csv'
REPLACE
INTO TABLE GEOIPCITY
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
  ID          INTEGER EXTERNAL,
  COUNTRYCODE CHAR,
  REGIONCODE  CHAR,
  CITYNAME    CHAR,
  POSTALCODE  CHAR,
  LATITUDE    DECIMAL EXTERNAL,
  LONGITUDE   DECIMAL EXTERNAL,
  METROCODE   INTEGER EXTERNAL,
  AREACODE    CHAR
)

However to load GeoLiteCity-Blocks there is some trick to this(because we have a trigger that needs to inspect each row): 
cmd>sqlldr USERID=scooter CONTROL=GeoLiteCity-Blocks.ctl rows=1
-- File GeoLiteCity-Blocks.ctl below
Load DATA CHARACTERSET WE8ISO8859P1
INFILE 'GeoLiteCity_latest/GeoLiteCity-Blocks.csv'
REPLACE
INTO TABLE GEOIP
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
LONGFROM    INTEGER EXTERNAL,
LONGTO      INTEGER EXTERNAL,
ID    INTEGER EXTERNAL
)

As Oracle does not have INET_ATON function like MySQL, here is a little package called UTLS with function that can be also added to make IP string conversions to long integer easy and calculating class B index of an IP address easy; 



CREATE OR REPLACE PACKAGE UTLS AS
FUNCTION INET_ATON(IP_STRING VARCHAR2) RETURN NUMBER;
FUNCTION INET_NTOA(IP_NUMBER NUMBER) RETURN VARCHAR2;
FUNCTION CLASSB(IPLONG NUMBER) RETURN  NUMBER;
--Below function is overloaded version of IPSTRING 
FUNCTION CLASSB(IP_STRING VARCHAR2) RETURN NUMBER;
FUNCTION SEARCHID(IP_STRING VARCHAR2) RETURN NUMBER;
END UTLS;
/
CREATE OR REPLACE PACKAGE BODY UTLS AS
INET_A NUMBER;
FUNCTION INET_ATON(IP_STRING VARCHAR2) RETURN NUMBER
IS INET_ATON NUMBER;
BEGIN
RETURN
--Basically break IP string into 4 numbers multiple by 2^24 2^16, 2^8,2^0 respectively
TO_NUMBER(REGEXP_SUBSTR(IP_STRING, '\d+', 1, 1)) * 16777216 + 
TO_NUMBER(REGEXP_SUBSTR(IP_STRING, '\d+', 1, 2)) * 65536 +
TO_NUMBER(REGEXP_SUBSTR(IP_STRING, '\d+', 1, 3)) * 256 + 
TO_NUMBER(REGEXP_SUBSTR(IP_STRING, '\d+', 1, 4)); 
END INET_ATON;
FUNCTION INET_NTOA(IP_NUMBER NUMBER) RETURN VARCHAR2
IS INET_NTOA NUMBER;
BEGIN
RETURN 
TO_CHAR(BITAND(FLOOR(IP_NUMBER/16777216),255))||'.'||
TO_CHAR(BITAND(FLOOR(IP_NUMBER/65536),255))||'.'||
TO_CHAR(BITAND(FLOOR(IP_NUMBER/256),255))||'.'||
TO_CHAR(BITAND(IP_NUMBER,255));
END INET_NTOA;
FUNCTION CLASSB(IPLONG NUMBER) RETURN NUMBER 
IS CLASSB NUMBER;
BEGIN
RETURN 
IPLONG-MOD(IPLONG,65536);
END CLASSB;
FUNCTION CLASSB(IP_STRING VARCHAR2) RETURN NUMBER 
IS 
BEGIN
   RETURN(CLASSB(INET_ATON(IP_STRING)));
END CLASSB;
FUNCTION SEARCHID(IP_STRING VARCHAR2) RETURN NUMBER 
IS ID NUMBER;
BEGIN
SELECT ID INTO ID FROM GEOIP WHERE IDX=CLASSB(IP_STRING) AND INET_ATON(IP_STRING) BETWEEN LONGFROM 
AND LONGTO;
 RETURN(ID);
EXCEPTION WHEN NO_DATA_FOUND THEN RETURN(NULL);
END SEARCHID;
END UTLS;
/

This is the end of the packages functions and utilities. Now lets try if we really made a good impact by searching my old ISP's IP address:

SQL> SET TIMING ON;
SQL> SELECT LATITUDE,LONGITUDE,COUNTRYCODE FROM GEOIPCITY WHERE ID=(SELECT UTLS.SEARCHID('204.11.35.65') FROM DUAL);

  LATITUDE  LONGITUDE CO
---------- ---------- --
    42.556   -83.1178 US
Elapsed: 00:00:00.04

SQL> SELECT LATITUDE,LONGITUDE,COUNTRYCODE FROM GEOIPCITY WHERE ID=(SELECT ID FROM GEOIP WHERE (SELECT UTLS.INET_NTOA('204.11.35.65') FROM DUAL) BETWEEN LONGFROM AND LONGTO);

  LATITUDE  LONGITUDE CO
---------- ---------- --
    42.556   -83.1178 US
Elapsed: 00:00:00.14



This does not seem impressive till you start searching 1000 or a million IP addresses.  I have also tested with MySQL and still havent finished my trigger code for Postgress.  One trick with MySQL is you cannot use a database trigger to insert a new row into the same table.  So... well you just create another temporary table where you can throw the new rows and do INSERT INTO GEOIP SELECT * FROM TEMPGEOIP; - you get the idea! 

Viola! Enjoy searching for your visitors or attackers using Geoip in SQL RDBMS store.








No comments:

Post a Comment