Graph relations in Neo4j – simple load example

In preparation for a post about doing graph analytics in Neo4j (paralleling SPARQLverse from this earlier post), I had to learn to load text/CSV data into Neo.  This post just shows the steps I took to load nodes and then establish edges/relationships in the database.

My head hurt trying to find a simple example of loading the data I had used in my earlier example but this was because I was new to the Cypher language.  I was getting really hung up on previewing the data in the Neo4j visualiser and finding that all my nodes had only ID numbers was really confusing me.  I had thought it wasn’t loading my name properties or something when it was really just a visualisation setting (more on that another time).  Anyway, enough distractions… Continue reading Graph relations in Neo4j – simple load example

Geospatial Power Tools Reviews [Book]

Thinking of buying my latest book?  We’ve finally got a few reviews on Amazon that might help you decide.  See my other post for more about the book.  Buy the PDF on Locate Press.com.

Reader Reviews

Geospatial Power Tools book cover
From Amazon.com

5.0 out of 5 stars This book makes a great reference manual for using GDAL/OGR suite of command line …,
January 24, 2015 By Leo Hsu
“The GDAL Toolkit is chuckful of ETL commandline tools for working with 100s of spatial (and not so spatial data sources). Sadly the GDAL website only provides the basic API command switches with very few examples to get a user going with. I was really excited when this book was announced and purchased as soon as it came out. This book makes a great reference manual for using GDAL/OGR suite of command line utilities.
Several chapters are devoted to each commandline tool, explaining what its for, the switches it has, and several examples of how to use each one. You’ll learn how to work with both vector/(basic data no vector) data sources and how to convert from one vector format to another. You’ll also learn how to work with raster data and how to transform from one raster data source to another as well as various operations you can perform on these.”

 

Continue reading Geospatial Power Tools Reviews [Book]

Graph analytics – the new super power

Graph analytics – is it just hype or is it technology that has come of age?  Mike Hoskins, CTO of Actian sums it up well in this article from InfoWorld:
Mike Hoskins writes about graph analytics and how it is a game changer for finding unknown unknowns
“One area where graph analytics particularly earns its stripes is in data discovery. While most of the discussion around big data has centered on how to answer a particular question or achieve a specific outcome, graph analytics enables us, in many cases, to discover the “unknown unknowns” — to see patterns in the data when we don’t know the right question to ask in the first place.”

Read Mike’s full article at:
InfoWorld

In the remainder of this post I outline a few more of my thoughts on this topic and give you pointers to some more resources to help you understand what to do next.

Continue reading Graph analytics – the new super power

HP Scanning on OSX Yosemite [SOLVED]

Vuescan website
Vuescan website
It was time to solve my HP scanning problem on OSX Yosemite.  My HP LaserJet CM1312 MFP – multifunction printer is getting a bit old, or at least HP’s support is falling by the wayside. Don’t bother with the HP software as it’s for 10.7. It did work at one point on this laptop but then totally gave up.

I spent too long trying to find new drivers that enable scanning, with no luck, until I found Vuescan.  It’s a free trial (with watermark).  I’m sure it supports lots of scanners, just get it and see before paying.

After a $29 Paypal and 7MB download I was scanning instantly – didn’t even have to point it to my printer.

 

SPARQL Query for Graph Density Analysis

SPARQLcity Graph Analytics Engine
SPARQLverse is the graph analytics engine produced by SPARQLcity. Standards compliant and super fast!

I’ve been spending a lot of time this past year running queries against the open source SPARQLverse graph analytic engine.  It’s amazing how simple some queries can look and yet how much work is being done behind the scenes.

My current project requires building up a set of query examples that allow typical kinds of graph/network analytics – starting with the kinds of queries needed for Social Network Analysis (SNA), i.e. find friends of friends, graph density and more.

In this post I go through computing graph density in detail. Continue reading SPARQL Query for Graph Density Analysis

Kafka Topic Clearing after Producing Messages

[UPDATE: Check out the Kafka Web Console to more easily administer your Kafka topics]


 

This week I’ve been working with the Kafka messaging system in a project.

Basic C# Methods for Kafka Producer

To publish to Kafka I built a C# app that uses the Kafka4n libraries – it doesn’t get much simpler than this:

using Kafka.Client;

Connector connector = new Connector(serverAddress, serverPort);
connector.Produce(correlationId, hostName, timeOut, topicName, partitionId, message);

I was reading from various event and performance monitoring logs and pushing them through just fine. Continue reading Kafka Topic Clearing after Producing Messages

Twitter cards made easy in WordPress

Just a note to self and others using WordPress – the Yoast WordPress SEO plugin did the trick to get Twitter cards working well – with only a couple clicks.   If you are tempted to hack your own HTML headers, you are going the wrong direction – follow this tutorial instead: http://www.wpbeginner.com/

Thanks WPBeginner and Yoast!

Social Graph – Year in Review (Preparation)

Graph developed from Tyler Mitchell's LinkedIn connections.

I pulled this visualization of my LinkedIn social graph together in just a few minutes while working through a tutorial.  What about other social networks?  Give me your input… Continue reading Social Graph – Year in Review (Preparation)

Data Sharing Saved My Life – or How an Insurer Reduced My Healthcare Claim Costs

It’s not every day that you receive snail mail with life-changing information in it, but when it does come, it can come from the unlikeliest sources.

Healthcare data shown in a list of bio sample test results
My initial test results showing problems with the Liver

A year ago, when doing a simple change of health insurance vendors, I had to give the requisite blood sample.  I knew the drill… nurse comes to the house, takes blood, a month later I get new insurance documents in the mail.

But this time the package included something new: the results of my tests.

The report was a list of 13 metrics and their values, including a brief description about what they meant and what my scores should be.  One in particular was out of the norm.  My ALT score, which helps measure liver malfunction, was about 50% higher than the expected range.

Simple Data Can Be Valuable

Here is the key point: I then followed up with my family doctor, with data in hand.  I did not have to wait to see symptoms of a systemic issue and get him to figure it out. We had a number, right there, in black and white. Something was wrong.

Naturally, I had a follow up test to see if it was just a blip.  However, my second test showed even worse results, twice as high in fact!  This lead to an ultrasound and more follow up tests.

In the end, I had (non-alcoholic) Fatty Liver Disease.  Most commonly seen in alcoholics, it was a surprise as I don’t drink.  It was solely due to my diet and the weight I had put on over several years.

It was a breaking point for my system and the data was a big red flag calling me to change before it was too late.

Wellness FX chart of ALT
A chart showing some of my earlier tests – loaded into WellnessFX.com for visualisation.

Not impressed with my weight nor all my other scores, I made simple but dramatic changes to improve my health.*  Changes were so dramatic that my healthcare provider was very curious about my methods.

By only making changes to my diet I was able to get my numbers to a healthy level in just a few months.  In the process I lost 46 pounds in 8 months and recovered from various other symptoms.  The pending train wreck is over.

Long Term Value in Sharing Healthcare Data

It’s been one year this week, so I’m celebrating and it is thanks to Manulife or whoever does their lab tests, for taking the initiative to send me my lab results.

It doesn’t take long to see the business value in doing so, does it?   I took action on the information and now I’m healthier than I have been in almost 20 years.  I have fewer health issues, will use their systems less, will cost them less money, etc.

Ideally it benefits the group plan I’m in too as a lower cost user of the system.  I hope both insurers and employers take this to heart and follow suit to give the data their people need to make life changing and cost reducing decisions like this.

One final thought.. how many people are taking these tests right now?  Just imagine what you could do with a bit of data analysis of their results.  Taking these types of test results, companies could be making health predictions for their customers and health professionals to review.  That’s why I’m jumping onto “biohacking” sites like WellnessFX.com to track all my scores these days and to get expert advice on next steps or access to additional services.

I’m so happy with any data sharing, but why give me just the raw data when I still have to interpret it?  I took some initiative to act on the results, but what if I had needed more incentive?  If I had been told “Lower your ALT or your premiums will be 5% higher” I would have appreciated that.

What’s your price?  If your doctor or insurer said “do this and save $100” – would you do it?  What if they laid the data out before you and showed you where your quality of life was headed, would it make a difference to you?

I’m glad I had this opportunity to improve my health, but at this point I just say thanks for the data … and pass the salad please!

Tyler


* I transitioned to a Whole Food – Plant Based diet (read Eat to Live and The China Study).  You can read more about the massive amounts of nutrition science coming out every year at NutritionFacts.org or read research papers yourself.

HBase queries from Bash – a couple simple REST examples

Learn how to do some simple queries to extract data from the Hadoop/HDFS based HBase database using its REST API.

Are you getting stuck trying to figure out HBase query via the REST API?  Me too.  The main HBase docs are pretty limited in terms of examples but I guess it’s all there, just not that easy for new users to understand.

As an aside, during my searches for help I also wanted to apply filters – if you’re interested in HBase filters, you’ll want to check out Marc’s examples here.

What docs do you find most useful?  Leave a comment.  Should someone write more books or something else?

My Use Cases

There were two things I wanted to do – query HBase via REST to see if a table exists (before running the rest of my script, for example).  Then I wanted to grab the latest timestamp from that table.  Here we go…

Does a specific table exist in HBase?

First, checking if a table exists can be done in a couple ways.  The simplest is to simply request the table name with the “exists” path after it and see what result you get back.

$ curl -i http://localhost/existing_table/exists

HTTP/1.1 200 OK
Cache-Control: no-cache
Content-Type: text/plain
Content-Length: 0

$ curl -i http://localhost/bad_table_name/exists
HTTP/1.1 404 Not Found
Content-Length: 11
Content-Type: text/plain

Not found

Here I use the curl “-i” option to return the detailed info/headers so I can see the HTTP responses (200 vs 404).  The plain text results from the command are either blank (if exists) or “Not found” if it does not.

Let’s roll it into a simple Bash script and use a wildcard search to see if the negative status is found:

CMD=$(curl http://localhost/table_name/exists)

if [[ $CMD = *"Not found"* ]]
 then echo "Not found"
else 
 echo "Found"
fi

Extract a timestamp from an HBase scanner/query

Now that I know the table exists, I want to get the latest timestamp value from it.  I thought I’d need to use some filter attributes like I do in HBase shell:

scan 'existing_table', {LIMIT=>1}

To do this with curl, you want to use HBase scanner techniques to accomplish this (the shortest section in the official docs it seems).

It’s a two stage operation – first you initialise a scanner session, then you request the results.  Bash can obviously help pull the results together easily for you, but let’s so go step by step:

curl -vi -H "Content-Type: text/xml" -d '<Scanner batch="1"/>' "http://localhost/existing_table/scanner"

...
Location: http://localhost/existing_table/scanner/12120861925604d3b6cf3
...

Note the XML chunk in the statement that tells it how many records to return in the batch.  That’s as simple as it gets here!

Amongst the results of this command you’ll see the Location value returned, this is the URL to use to access the results of the query.  Results are truncated and line breaked so you can see the meaningful bits:

$ curl http://localhost/existing_table/scanner/12120861925604d3b6cf3

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
 <CellSet>
  <Row key="AfrV65UNXD0AAAAs">
   <Cell column="ZXZ...mNl" 
    timestamp="1218065508852">
    L2hv...Y3N2
   </Cell>
  </Row>
 </CellSet>

Ugh, XML.. if you want JSON instead just add an ACCEPT property to the header:

$ curl -H "Accept: application/json" http://localhost/existing_table/scanner/12120861925604d3b6cf3

{"Row":[{"key":"AfrV65UNXD0AAAAs","Cell":[{"column":"ZXZ...mnl","timestamp":1218065508852,"$":"L2hv...Y3N2"}]}
...

For now we’ll hack some sed to get the to the value we want, first for the JSON response, second for the XML response.  Just pipe the curl command into this sed:

# For application/json
$ curl ... | sed 's/.*\"timestamp\"\:\([0-9]\{13\}\).*/\1/'

# For default XML results
$ curl ... | sed 's/.*timestamp=\"\([0-9]\{13\}\).*/\1/'

Now you can create a basic script the grabs the latest timestamp from the HBase query and decides what to do with it.  Here we just assign it to a variable and let you go back to implement as needed.

$ LAST_TIME=$(curl ... | sed 's/.*\"timestamp\"\:\([0-9]\{13\}\).*/\1/')
$ echo $LAST_TIME

1218065508858

Thanks for reading!

If you like this, follow me on Twitter at http://twitter.com/1tylermitchell or with any of the other methods below.

 

%d bloggers like this: