HBase queries from Bash – a couple simple REST examples

Learn how to do some simple queries to extract data from the Hadoop/HDFS based HBase database using its REST API.

Are you getting stuck trying to figure out HBase query via the REST API?  Me too.  The main HBase docs are pretty limited in terms of examples but I guess it’s all there, just not that easy for new users to understand.

As an aside, during my searches for help I also wanted to apply filters – if you’re interested in HBase filters, you’ll want to check out Marc’s examples here.

What docs do you find most useful?  Leave a comment.  Should someone write more books or something else?

My Use Cases

There were two things I wanted to do – query HBase via REST to see if a table exists (before running the rest of my script, for example).  Then I wanted to grab the latest timestamp from that table.  Here we go…

Does a specific table exist in HBase?

First, checking if a table exists can be done in a couple ways.  The simplest is to simply request the table name with the “exists” path after it and see what result you get back.

$ curl -i http://localhost/existing_table/exists

HTTP/1.1 200 OK
Cache-Control: no-cache
Content-Type: text/plain
Content-Length: 0

$ curl -i http://localhost/bad_table_name/exists
HTTP/1.1 404 Not Found
Content-Length: 11
Content-Type: text/plain

Not found

Here I use the curl “-i” option to return the detailed info/headers so I can see the HTTP responses (200 vs 404).  The plain text results from the command are either blank (if exists) or “Not found” if it does not.

Let’s roll it into a simple Bash script and use a wildcard search to see if the negative status is found:

CMD=$(curl http://localhost/table_name/exists)

if [[ $CMD = *"Not found"* ]]
 then echo "Not found"
 echo "Found"

Extract a timestamp from an HBase scanner/query

Now that I know the table exists, I want to get the latest timestamp value from it.  I thought I’d need to use some filter attributes like I do in HBase shell:

scan 'existing_table', {LIMIT=>1}

To do this with curl, you want to use HBase scanner techniques to accomplish this (the shortest section in the official docs it seems).

It’s a two stage operation – first you initialise a scanner session, then you request the results.  Bash can obviously help pull the results together easily for you, but let’s so go step by step:

curl -vi -H "Content-Type: text/xml" -d '<Scanner batch="1"/>' "http://localhost/existing_table/scanner"

Location: http://localhost/existing_table/scanner/12120861925604d3b6cf3

Note the XML chunk in the statement that tells it how many records to return in the batch.  That’s as simple as it gets here!

Amongst the results of this command you’ll see the Location value returned, this is the URL to use to access the results of the query.  Results are truncated and line breaked so you can see the meaningful bits:

$ curl http://localhost/existing_table/scanner/12120861925604d3b6cf3

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  <Row key="AfrV65UNXD0AAAAs">
   <Cell column="ZXZ...mNl" 

Ugh, XML.. if you want JSON instead just add an ACCEPT property to the header:

$ curl -H "Accept: application/json" http://localhost/existing_table/scanner/12120861925604d3b6cf3


For now we’ll hack some sed to get the to the value we want, first for the JSON response, second for the XML response.  Just pipe the curl command into this sed:

# For application/json
$ curl ... | sed 's/.*\"timestamp\"\:\([0-9]\{13\}\).*/\1/'

# For default XML results
$ curl ... | sed 's/.*timestamp=\"\([0-9]\{13\}\).*/\1/'

Now you can create a basic script the grabs the latest timestamp from the HBase query and decides what to do with it.  Here we just assign it to a variable and let you go back to implement as needed.

$ LAST_TIME=$(curl ... | sed 's/.*\"timestamp\"\:\([0-9]\{13\}\).*/\1/')
$ echo $LAST_TIME


Thanks for reading!

If you like this, follow me on Twitter at http://twitter.com/1tylermitchell or with any of the other methods below.


Supertunnels with SSH – multi-hop proxies

I never know what to call this process, so I’m inventing the term supertunnels via SSH for now. A lot of my work these days involves using clusters built on Amazon EC2 cloud environment. There, I have some servers that are externally accessible, i.e. web servers. Then there are support servers that are only accessible “internally” to those web servers and not accessible from the outward facing public side of the network, i.e. Hadoop clusters, databases, etc.

To help log into the “internal” machines, I have pretty much one choice – using SSH through the public machine first. No problem here, any server admin knows how to use SSH – I’ve been using it forever. However, I didn’t really use some of the more advanced features that are very helpful. Here are two…

Remote command chaining

Most of my SSH usage is for running long sessions on a remote machine. But you can also pass a command as an argument and the results come directly back to your current terminal:

$ ssh [email protected] "ls /usr/lib"

Take this example one step further and you can actually inject another SSH command that gets into the “internal” side of the network.

This is starting to really sound like tunneling, though it’s somewhat manual and doesn’t redirect traffic from your client side, we’ll get to that later.

As an aside, in EC2-land you often use certificate files during SSH login, so you don’t need to have an interactive password exchange. You specify the certificate with another argument. If that’s how you run your servers (or with authorized_keys files) then you can push in multiple levels of additional SSH commands easily.

For example, here I log into ext-host1, then from there log into int-host2 and run a command:

$ ssh -i ~/mycert.pem u[email protected] "ssh -i ~/mycert.pem [email protected] 'ls /usr/lib'"

That is a bit of a long line for just getting a file listing, but it’s easy to understand and gets the job done quickly. It also works great in shell scripts, in fact you could wrap it up with a simple script to make it shorter.

Proxy config

Another way to make your command shorter and simpler is to add some proxy rules to the ~/.ssh/config file. I didn’t even know this file existed, so was thrilled to find out how it can be used.

To talk about this, let’s use the external and internal hosts as examples. And let’s assume that the internal host is Obviously these don’t need to be specifically public or private SSH endpoints, but it serves its purpose for this discussion.

If we are typically accessing int-host2 via ext-host1 then we can setup a Proxy rule in the config file:

Host 10.0.*.*
ProxyCommand ssh -i ~/mycert.pem [email protected] -W %h:%p

This rule watches for any requests on the 10.0… network and automatically pushes the requests through the ext-host1 as specified above. Furthermore, the -W option tells it to stream all output back to the same terminal you are using. (Minor point, but if you miss it you may go crazy trying to find out where your responses go.)

Now I can do a simple login request on the internal host and not even have to think about how to get there.

ssh -i ~/mycert.pem [email protected]

I think that’s a really beautiful thing – hope it helps!

Another time I’ll have to write more about port forwarding…

%d bloggers like this: