Partitioned Data & Why It Matters

Would love to hear how you partition your spatial data..

As data volumes grow, so does your need to understand how to partition your data.  Until you understand this distributed storage concept, you will be unable to choose the best approach for the job.  This post gives an introductory explanation of partitioning and you will see why it is integral to the Hadoop Distributed File System (HDFS) increasingly used in modern big data architectures….

Read the full article on Make Data Useful >>

Web console for Kafka messaging system

Running Kafka for a streaming collection service can feel somewhat opaque at times, this is why I was thrilled to find the Kafka Web Console project on Github yesterday.  This Scala application can be easily downloaded and installed with a couple steps.  An included web server can then be launched to serve it up quickly.  Here’s how to do all that.

For a quick intro to what the web console does, see my video first.  Instructions for getting started follow below, including the quick video.

Continue reading Web console for Kafka messaging system

Drinking from the (data) Firehose of Terror

Between classic business transactions and social interactions and machine-generated observations, the digital data tap has been turned on and it will never be turned off. The flow of data is everlasting. Which is why you see a lot of things in the loop around real time frameworks and streaming frameworks. – Mike Hoskins, CTO Actian

From Mike Hoskins to Mike Richards (yes we can do that kind of leap in logic, it’s the weekend)…

Oh, Joel Miller, you just found the marble in the oatmeal!   You’re a lucky, lucky, lucky little boy – because you know why?  You get to drink from… the firehose!  Okay, ready?  Open wide! – Stanley Spadowski, UHF

Firehose of Terror

I think you get the picture – a potentially frightening picture for those unprepared to handle the torrent of data that is coming down the pipe.  Unfortunately, for those who are unprepared, the disaster will not merely overwhelm them.  Quite the contrary – I believe they will be consumed by irrelevancy.

If you’re still with me, let me explain. Continue reading Drinking from the (data) Firehose of Terror

Kafka Consumer – Simple Python Script and Tips

[UPDATE: Check out the Kafka Web Console that allows you to manage topics and see traffic going through your topics – all in a browser!]


 

When you’re pushing data into a Kafka topic, it’s always helpful to monitor the traffic using a simple Kafka consumer script.  Here’s a simple script I’ve been using that subscribes to a given topic and outputs the results.  It depends on the kafka-python module and takes a single argument for the topic name.  Modify the script to point to the right server IP.

from kafka import KafkaClient, SimpleConsumer
from sys import argv
kafka = KafkaClient("10.0.1.100:6667")
consumer = SimpleConsumer(kafka, "my-group", argv[1])
consumer.max_buffer_size=0
consumer.seek(0,2)
for message in consumer:
 print("OFFSET: "+str(message[0])+"\t MSG: "+str(message[1][3]))

Max Buffer Size

There are two lines I wanted to focus on in particular.  The first is the “max_buffer_size” setting:

consumer.max_buffer_size=0

When subscribing to a topic with a high level of messages that have not been received before, the consumer/client can max out and fail.  Setting an infinite buffer size (zero) allows it to take everything that is available.

If you kill and restart the script it will continue where it last left off, at the last offset that was received.  This is pretty cool but in some environments it has some trouble, so I changed the default by adding another line.

Offset Out of Range Error

As I regularly kill the servers running Kafka and the producers feeding it (yes, just for fun), things sometimes go a bit crazy, not entirely sure why but I got the error:

kafka.common.OffsetOutOfRangeError: FetchResponse(topic='my_messages', partition=0, error=1, highwaterMark=-1, messages=)

To fix it I added the “seek” setting:

consumer.seek(0,2)

If you set it to (0,0) it will restart scanning from the first message.  Setting it to (0,2) allows it to start from the most recent offset – so letting you tap back into the stream at the latest moment.

Removing this line forces it back to the context mentioned earlier, where it will pick up from the last message it previously received.  But if/when that gets broke, then you’ll want to have a line like this to save the day.


For more about Kafka on Hadoop – see Hortonworks excellent overview page from which the screenshot above is taken.

SPARQL Query for Graph Density Analysis

SPARQLcity Graph Analytics Engine
SPARQLverse is the graph analytics engine produced by SPARQLcity. Standards compliant and super fast!

I’ve been spending a lot of time this past year running queries against the open source SPARQLverse graph analytic engine.  It’s amazing how simple some queries can look and yet how much work is being done behind the scenes.

My current project requires building up a set of query examples that allow typical kinds of graph/network analytics – starting with the kinds of queries needed for Social Network Analysis (SNA), i.e. find friends of friends, graph density and more.

In this post I go through computing graph density in detail. Continue reading SPARQL Query for Graph Density Analysis

Kafka Topic Clearing after Producing Messages

[UPDATE: Check out the Kafka Web Console to more easily administer your Kafka topics]


 

This week I’ve been working with the Kafka messaging system in a project.

Basic C# Methods for Kafka Producer

To publish to Kafka I built a C# app that uses the Kafka4n libraries – it doesn’t get much simpler than this:

using Kafka.Client;

Connector connector = new Connector(serverAddress, serverPort);
connector.Produce(correlationId, hostName, timeOut, topicName, partitionId, message);

I was reading from various event and performance monitoring logs and pushing them through just fine. Continue reading Kafka Topic Clearing after Producing Messages

From zero to HDFS in 60 min.

(Okay, so you can be up and running quicker if you have a better internet connection than me.)

Want to get your hands dirty with Hadoop related technologies but don’t have time to waste?  I’ve spent way too much time trying to get HBase, for example, running on my Macbook with Brew and wish I had just tried this Virtualbox approach before.

In this short post I show how easy it was for me to get an NFS share mounted on OSX – so I could transparently and simply copy files on HDFS without needing any special tools.   Here are the details… Continue reading From zero to HDFS in 60 min.

%d bloggers like this: