Tuesday, June 3, 2014

Integrating Google's BigQuery into your Security Operations Center (SOC).

In your security operations it is not uncommon for you to require access to some large datasets and analyze them.  The obvious answer is to store them in a bigdata solution.  If you are in the business of building your own bigdata solution, you find many technical details as distractions to running your core service - security analysis in this case.

You could consider using Amazon's cloud or Google's cloud as your computing platform so you can easily scale by storing data in the cloud provider's large infrastructure they have built and give focus to your analysis. If you are uncomfortable with something sensitive like your security or system logs or network flow that could contain sensitive information, you can simply start by pushing public data that is available in your SOC to some managed cloud system.  There are several such datasets that are very useful for security operations and analysis for e.g., the DNS Census 2013Internet Census 2012 and EFF SSL Observatory. This could be a way for you to use the public cloud and familiarize yourself before you are comfortable putting any sensitive data into the cloud for analysis.

I did this experiment by taking DNS Census 2013 data with about 2 billion rows of data about 150 GBytes and store into to Google's BigQuery system.  My idea was to store this into BQ and then share the search/query results via a widget/gadget it into a portal like Shindig, Kibana or an appropriate dashboard framework for consumption of a security analyst.  Let me walk you through the experience.

Google's Bigquery was just a system I choose to bring this together as it looked a lot like a columnar database such as ICE (Infobright), Paraccel or Pivotal GreenPlum - which are ideal for text data with lots of repetition like DNS. Once you have managed to walk through the Google's bigquery basic setup: billing, acknowledgement, licensing etc.  You can find some simple tools to get data into Google BQ and out of Google's BQ, using both command line tools and a web portal.  One key thing to remember is all of your actions are "billable" in Google's model so therefore should be tied to a user either an API user (service account) or an interactive user using OAuth framework.

Once I had created the basic setup in Google big query I moved to create a project called "Census" and then give it an "id" and then enable billing for the BigQuery (and perhaps Cloud Storage if you pan to use it for temporary file storage) component of the project.  I had to install Google's Cloud SDK into my linux system and have the appropriate Python libraries.

Simple enough, then came the part of uploading data into Google.  I found that Google's BigQuery is not really big enough, as it has several limitations in accepting data ( currently has to be less than 4GB if data has new-lines as part of it) and cannot be pushed into the BigQuery engine via HTTPS POST using their tools if it is really bit (over 5 GBytes).   You could always get the data to Google's Storage (another service from google) and use the import data web interface provided to load the data.  Assuming the data can be split into 4 GBytes chunks, I will show you the command line way to getting data into the Google's BigQuery with a predefined schema and ways to use the data for searching and reporting on the data.

Let me walk through the example of using Google's bq tool in the Google cloud SDK kit: Suppose you have the file called mx.txt which is about 23 GBytes from this survey data, you can split into 6 files of 4 GBytes each and then you can upload the data with the schema as shown below.  (I presume you have read the help page for the "bq" tool, ignoring the detailed instructions of the options)
[user1@homepc]split -l 60962616 mx.txt
[user1@homepc]  for j in xa*; do ~/google-cloud-sdk/bin/bq --project_id=census2013
load  --allow_quoted_newlines=1 dnscensus2013.mx ./$j 
name:STRING,isotime:TIMESTAMP,priority:INTEGER,exchange:STRING ; done

First command splits the file into 61 million rows approximately each and then the second command pushes this data into BigQuery table called "dnscensus2013" in project "census2013" in a table called "mx".  The schema is basically name, timestamp, priority and exchange much like a DNS MX record.
After you have uploaded the data you can test this by a simple query to verify your loaded data is in.

~/google-cloud-sdk/bin/bq --project_id=census2013 query "SELECT * FROM dnscensus2013.txt limit 3"
Waiting on bqjob_r16c69efa3a5ffec4_000001466371f11a_1 ... (0s) Current status: DONE    
+---------------------+---------------------+-------------+
|        name         |       isotime       |     txt     |
+---------------------+---------------------+-------------+
| fedexairfreight.com | 2013-08-15 00:00:00 | v=spf1 -all |
| fedexalaska.com     | 2013-01-31 00:00:00 | v=spf1 -all |
| fedexalaska.com     | 2013-08-15 00:00:00 | v=spf1 -all |
+---------------------+---------------------+-------------+

Now the data has been uploaded.  You can follow the same example of data upload for all the various tables from DNS Census 2013.  The whole of DNS Census 2013 storage in Google's BigQuery will cost you probably $4 a month (150 GBytes).  Query pricing is expected to be very low unless you use this data regularly.

As far as building a front end for this tool,  there are good Java based application samples, but to simplify it I wrote up a quick PHP page that will convert query results into JSON or JSONP for access using javascript to get the concept going.  Once a full application setup is required it is probably qorth building a small Java application. The PHP script is shown below:

[user1@homepc]more index.php
<?php
set_include_path ('../libs/google-api-php-client/src/');
require_once '../libs/google-api-php-client/src/Google/Client.php';
require_once '../libs/google-api-php-client/src/Google/Service/Bigquery.php';
require_once '../libs/google-api-php-client/src/Google/Service/Books.php';
$client = new Google_Client();
$client->setApplicationName("Vijays_example");
$client_id='XXXssag.apps.googleusercontent.com';
$service_account='4XXmkssag@developer.gserviceaccount.com';
$project_id='census2013';
$dataset_id='dnscensus2013';
$table_id='cname';

$key = file_get_contents("./file.p12");
$client->setAssertionCredentials(new Google_Auth_AssertionCredentials(
        $service_account,
    array('https://www.googleapis.com/auth/bigquery', 'https://www.googleapis.com/auth/devstorage.read_write'),
    $key)
);

$client->setClientId($client_id);
$service = new Google_Service_Bigquery($client);
$service->tables->listTables($project_id, $dataset_id);
$service->tables->get($project_id, $dataset_id, $table_id);
$query = new Google_Service_Bigquery_QueryRequest();
$sql="SELECT * from $dataset_id.$table_id limit 10";
if(isset($_GET['sql'])) {
// No SQL injection can happen here - It is Google BigQuery
 $sql=$_GET['sql'];
}
$query->setQuery($sql);
$jobs = $service->jobs;
$response = $jobs->query($project_id, $query);
//Bizarre Google_Q Objects that we need to transpose and dig to find the query response.
$mob=get_object_vars($response->schema->toSimpleObject());
$names=array();
foreach ($mob['fields'] as $v) {
array_push($names,$v['name']);
}
$json=array();
$vals=array();
foreach ($response->rows as $v) {
 $i=0; 
 foreach ($v['f'] as $m) {
   $s=get_object_vars($m);
   $vals[$names[$i]]=$s['v'];
   $i++;
   }
   array_push($json,$vals);
}
// Support JSONP if requested
if(isset($_GET['callback'])) {
header("Content-Type: application/json");
$cb=preg_replace('/[^0-9a-z\.\_]/i','_',$_GET['callback']);
echo "$cb(".json_encode($json).");\n";
}else  {
      print json_encode($json);
}
?>
# Now run the code 
[user1@homepc] php < index.php
[{"name":"www.consol.uk.com","isotime":"1.351017471E9","cname":"consolweb.thorhost.co.uk"},
{"name":"www.consol.uk.com","isotime":"1.351729352E9","cname":"consolweb.thorhost.co.uk"},
{"name":"www.consol.uk.com","isotime":"1.37297756E9","cname":"consolweb.thorhost.co.uk"},
{"name":"www.consol.us","isotime":"1.377155233E9","cname":"free.park-your-domain.com"},
{"name":"www.consol2d.com","isotime":"1.377005673E9","cname":"www.geo-slope.com"},
{"name":"www.consol3d.com","isotime":"1.377089055E9","cname":"www.geo-slope.com"},
{"name":"www.consola-ouya.com","isotime":"1.365095343E9","cname":"ghs.google.com"},
{"name":"www.consola-ouya.com","isotime":"1.373153883E9","cname":"ghs.google.com"},
{"name":"www.consola-ouya.com","isotime":"1.374698337E9","cname":"ghs.google.com"},
{"name":"www.consola-ouya.com","isotime":"1.377195701E9","cname":"ghs.google.com"}]



The sample shown above demonstrates the JSON object that has mapped values to column names and provides a result of 10 rows.  The PHP library has very little documentation, but if you have the sample code, you can infer that the results are returned as complex objects with many other properties and metadata that can be used in your dashboard.  Below is the sample gadget from the dashboard to help illustrate the usage.  First screenshot is a simple search for Google.com, the next is an IP search for subnet 74.125.159.0/24

Hope that demonstrates the power of using public cloud like Google's BigQuery service and integrating it into your Security Operations Center.


No comments:

Post a Comment