Skip to main content

Mobomo webinars-now on demand! | learn more.

As I regularly do; look around for new datasets that I can explore and process with Surfiki, I came across the following:

The Open Code - Digital Copy of DC's Laws

As the author Tom MacWright mentions on his site:

"I couldn’t be happier to write that the project to bring DC’s laws into the digital era and put them in everyone’s hands made a big breakthrough: today you can download an unofficial copy of the Code (current through December 11, 2012) from the DC Council’s website. Not only that, but the licensing for this copy is officially CC0, a Creative Commons license that aims to be a globally-effective Public Domain designation."

That sounds like a GREAT invitation from my reading his post, it seems that this was difficult to acquire. He mentions many people, communications and time all working together to make this available to the public. He goes on to mention:

"What else is there to build? A great smartphone interface. Topic-specific bookmarks. Text analysis. Great, instant search. Mirrored archives to everywhere. Printable copies for DIY and for print-on-demand services. And lots more.

We’re not waiting till then to start though: dc-decoded is a project I started today to finish the long-awaited task of bringing DC’s laws into The State Decoded. The openlawdc organization on GitHub is dedicated to working on these problems, and it’s open: you can and should get access if you want to contribute."

As Intridea is a DC based firm, it made perfect sense for us to run this data through our own Data Intelligence Processing Engine; Surfiki. As well, it is a perfect opportunity to introduce the Surfiki Developers API. For which, we are making publicly available as of RIGHT NOW  However, we are manually approving users and apps as we move forward. This, to assist us in future scaling and better insight in to the bandwidth required for concurrent and frequent developer operations. I encourage anyone and everyone to create an account. Eventually, all requests will be granted, and will be upon a first come first serve basis.

I think it is best that I first explain how we processed the data from The Open Code - Digital Copy of DC's Laws. Followed by a general introduction in to Surfiki and the Surfiki Developers API.

The initial distribution of The Open Code - Digital Copy of DC's Laws was in Microsoft Word Documents. This may have been updated to a more "ingestion" friendly format by now, although I am not sure. The total number of documents was 51, ranging in size from, 317K to 140MB. You may think, "Hey, that's NOT a lot of data"... Sure, that's true, however I don't think it matters much for this project. From what I gather, it was important to just get the data out there and available, regardless of size. As well, Surfiki doesn't throw fits due to small data or big data anyway.

First order of business was converting these Microsoft Word documents to text. While Surfiki can indeed read through Microsoft Word documents, it generally takes a little longer. Therefore, any preprocessing is a good thing to do. Here is a small Python script that will convert the documents.

 #!/usr/bin/env python  # -*- coding: utf-8 -*-  import glob, re, os f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC')  outDir = 'textdocs' if not os.path.exists(outDir):     os.makedirs(outDir) for i in f:     os.system("catdoc -w '%s' > '%s'" %               (i, outDir + '/' + re.sub(r'.*/([^.]+).doc', r'1.txt', i,                                    flags=re.IGNORECASE))) 

With that completed we now have them as text files. I decided to take a peak in to the text files and noticed that there are a lot of "END OF DOCUMENT" lines. For this I assume is representative of singular documents within the larger contextual document. (I know, I know… Genius assumption )

This "END OF DOCUMENT" looks like the following:

 For legislative history of D.C. Law 4-171, see Historical and Statutory Notes following § 3-501.  DC CODE § 3-502                  Current through December 11, 2012                                 END OF DOCUMENT 

And from my initial count script, there are about 19K lines that read: "END OF DOCUMENT". Therefore, I want to split these up in to individual files. The reason for this is I want Surfiki to process these as specific documents for search and trending purposes. Therefore, with the following Python script, I split them in to individual documents. As well, I cleaned out the '§' character. Note: Surfiki uses both structured storage and unstructured storage for all data. The reason behind this is both business purposes as well as redundancy. Business purposes, structured storage allows us to connect with common enterprise offerings, such as SQL Server, Oracle , etc., for data consumption and propagation. As for redundancy, since for a temporal period we persist all data concurrently between the two mediums, where process may abase, we can recover within seconds and workflow can resume unimpeded.

Note: docnames.txt is just a static list of the initial text documents converted from Microsoft Word documents. I chose that method rather than walking the path.

 #!/usr/bin/env python  # -*- coding: utf8 -*-  def replace_all(text, dic):     for i, j in dic.iteritems():         text = text.replace(i, j)     return text  with open('docnames.txt') as f:     set_count = 0     for lines in f:         filename = str(lines.rstrip())         with open(filename, mode="r") as docfile:              file_count = 0             smallfile_prefix = "File_"             smallfile = open(smallfile_prefix + str(file_count) + "_" + str(set_count) + '.txt', 'w')             for line in docfile:                 reps = {'§':''}                 line = replace_all(line, reps)                 if line.startswith("END OF DOCUMENT"):                     smallfile.close()                     file_count += 1                     smallfile = open(smallfile_prefix + str(file_count) + "_" + str(set_count) + '.txt', 'w')                 else:                     smallfile.write(line)              smallfile.close()             set_count += 1 

After the above processing (few seconds of time), I now have over 19K files, ranging from 600B to 600K, PERFECT! I am ready to push these to Surfiki.

It's important to understand Surfiki works with all types of data as well as locations of data. Web data (Pages, Feeds, Posts, Comments, Facebook, Twitter , etc.) As well it works with static data locations, such as file systems, buckets , etc. Streams and Databases… In this case, we are working with static data; text documents in a cloud storage bucket. Without getting to detailed in to the mechanism that I push these files, on a basic level, they are pushed to the bucket where an agent is watching and once they start arriving the processing begins.

Since this is textual information, the type of processing is important. In this case I want to use the standard set of NLP textual processing within Surfiki. Versus any customized algorithms, such as specific topic based classifiers, or statistical classifiers , etc. The following is what will be processed within Surfiki for this data.

  • Sentiment - Positive, Negative and Neutral
  • Perspective - Objective or Subjective
  • Gunning Fog Index
  • Reading Ease
  • Lexical Density
  • Counts including: words/sentence, syllables/sentence, syllables/word

As well, we provide the following for all data.

  • Keywords (Literal) - Literal extraction of keywords
  • Keywords (Semantic) - Semantic conceptual generation of keywords
  • Trends - n-grams - Uni, Bi and Tri
  • Trends Aggregate - n-gram weighted distribution
  • Graph - n-gram relationships/time (Available likely on Monday)
  • Time - Insert and document time extraction

These will all be available in the Surfiki Developers API

Once you go over to Surfiki Developers API and read through the documentation, you will find how simple it is to use. We are adding datasets on a regular basis so please check back often. As well, our near real-time Surface Web data is always available, as are previously processed data sets. If you have ideas or even data sets we should look at, please just let us know by submitting a comment on this post.

If you want to contact us about using Surfiki within your organization, that would be great as well. We can put it behind your firewall, or operate it in the cloud. It is available for most architectural implementations.

Contact Us

Categories
Author

I always love traveling to DC. Meeting up with fellow Intrideans is incredibly motivating and satisfying. I had this chance yet again over the last couple days for an event; Open Analytics Summit, whereby Intridea was a sponsor. As well, we had a speaking slot. A few other companies were sponsoring/speaking as well, such as Basho, Elasticsearch and 10Gen.

The morning started pretty slow, I am certain people were around the corner, just not sure which corner that was. It was an intimate setting, only a few tables for vendors/sponsors and a select group of practitioners.

Presentations were longer than what I consider normal, 45-50 minutes. Topics were focused for the most part upon Open Source software which included: Applications, Architecture, Engineering, Methods and Systems used within Data Analytics. As a sponsor it was difficult to get away and listen to the presentations, certainly a few of them looked as though they would have been quite interesting to have attended.

My presentation titled: Data Science in the NOW - It Takes an ARMY of tools! Focused mainly upon the vast array of available Open Source DB's, Indexing Engines, File Systems, Query Engines and Streaming Engines. As well, I spent a little time on the definition of "NOW" (within the context of data analytics), latency and our own human (physiological) limitations with perception. I made it a point to mention most that are available as well as their history, general feature set, strengths and weaknesses. I selected a few out of the myriad for special attention. Examples included: Storm, Cassandra, HBase, xQL's, Hadoop and a few others. The presentation is available for anyone to view on slide-share. Unfortunately without the notes attached it may seem a lot of the detail is missing. If you want to read through the notes that apply to each slide, please just let me know and I will send them to you.

My only gripe is the venue itself. While quaint, it had some real drawbacks. For example, power outlets in front of the vendor tables, rather than behind. A lounging area directly in front of the vendor tables whereby attendees backs were to us. Therefore making it rather difficult to engage in useful discussion. Finally the main presentation room is built upon tiers, much like you have experienced in large collegiate classrooms. However, the drawback was that attendee's each were behind a small barrier that hid their hands. With long presentations I noticed a lot of arm/elbow movement indicating QWERTY abuse or thumb wrestling rather than focus upon the presenter.

We met a few really cool people that we are already following up with. All in all it was a good event, glad we were part of it.

Categories
Author

It's 2012 and we're talking about Data. But this isn't your grandmother's data. This is Data with a big 'D'. As dull as the concept might sound there are some amazing things happening in the realm of Big Data right now. Patti Chan spent a day last week at the Data 2.0 Summit in San Francisco and she has reported back with some of the highlights!

If you're not already familiar with what big data is and why it's a popular topic, read through our recent posts on the topic!

« Democratized Data and the Missing Interface

« Simplified Relational Hierarchy Visualization

Monetization

In the "Monetizing the Data Revolution" panel speakers from Microsoft, DataSift and other leading companies discussed the ways in which people are trying to offer "data as a service" and why lingering confusion over standards and protocols are preventing DaaS offerings from being viable at this point.

Additionally, it was pointed out that however valuable data may be, we cannot sell just "data" alone. Creating a business model based on the "data revolution" requires three things: data, analysis and workflow (i.e., hardware, processes).

And finally, it is difficult to monetize the data revolution because we do not yet have a well-built portfolio of tools to expose data in order to value-add services as part of the package.

After the session, Patti pointed out (via Twitter) that there are three components to data accessibility:

  1. Data - the actual raw data.
  2. Tools to the leverage that data.
  3. Well designed user interfaces to gain insights from data.

Google

Due to the nature of the work that Google does they naturally had a strong presence at the summit. Navneet Joneja, Product Manager at Google, talked in depth about the useful and interesting things Google is doing with data.

  • Google BigQuery: enables you to perform sql-like queries over large datasets. We're talking billions of rows of data. It calculates meaningful insights in just seconds. It's useful for things like creating interactive tools, spam filtering, detecting trends, making web dashboards and network optimization tools.
  • Google App Engine: App Engine can be coupled with NoSQL datastore for supreme awesomeness. This is what Google Spreadsheet is on. App Engine data logs can be easily exported to Cloud Storage and then analyzed with Big Query.

During the panel Patti asked Navneet:

App Engine obfuscates the underlying harware / VM's / stack (this is their value proposition). In enterprise software we often dip down into that layer in order to optimize our app. Does App Engine have documentation or open source that exposes those layers, in case we need it?

To which Navneet responded:

There is a good amount of documentation, but the goal and point of App Engine is to make that need obsolete, so that you can concentrate on the development and features and not worry about the hardware.

Other Interesting Tidbits

  • SalesForce had a team at the Summit talking about Data.com and layering social graphs on top of the cloud contacts and data you already use, in order to achieve a more complete profile.
  • DataSift, a powerful social media data platform, did a great mashup of data sets to show how combining related sets of data can help users derive real meaning.
  • CrowdFlower CTO Chris Van Pelt was present to show off the work they are doing in distributed human computing.
  • Wishery is a new application which adds full customer profiles into existing point-of-contact apps like Gmail. It uses the customer's email address as the canonical identifier. This presentation received the most questions from the VC panel and is definitely one to watch out for in the coming months.

Final Thoughts

People and organizations are making huge leaps of progress in the field of Big Data. While we still need more hardware, software, and UI tools built to make big data more accessible, it's clear that there is important work being done in this area. It will be exciting to watch developers and designers collaborate in the coming months and years to help unleash the inherent power of Data with a big 'D'.

Categories
Author

This month we're sponsoring MoDevUX and joining pioneers in mobile design and development at an event created to focus specifically on user experience and design for mobile.

Anthony Nystrom, our Director of Mobile & Emerging Technologies will be joined by Jurgen Altziebler, our Managing Director of UX to talk about accessibility to big data through enhanced design and interface layers on mobile devices.

MoDevUX will feature keynotes from visionaries at Frog, XOBNI, and Capital One. The three-day event will kick off with a day of workshops, a full day of presentations from some of the brightest people in the industry, and end with a hackathon, complete with demos and awards.

We're looking forward to this Homeric meeting of the minds. You can join us by registering today!

Categories
Author

When looking at any complex relational system (especially in software, where our diagrams are limited by object scope), how do we see all the connections? How do we see and understand all objects, cases, states and methods (actions) regardless of the entire mutually inclusive and exclusive scope?

The standard method for showing hierarchical relationships, both inclusive and exclusive, is to use multiple diagrams for modeling those relationships between domain objects and related phenomena. The problem with this is that using multiple diagrams to portray a top-level view can lead to confusion and redundancy. Often, it would be extremely helpful to be able to have a single macro view of the data/objects and all relationships, rather than relying on piecing together multiple micro diagrams to achieve the same effect. It requires a higher level of abstraction to view complex data/objects on a macro level, and unfortunately, as we abstract we also lose detail.

What if there was a way to reduce all relational hierarchy to a single diagram without a significant loss in detail?

I have always been fascinated with data/object visualization and reducing the complexity typically present in that field. I find that there are better ways to represent data/objects and the relationships between them. And so I set out to design a method for a single visualization format; the result is an application I call NuGenObjective OCIDML. That's a mouth-full, but I'll explain.

Introducing NuGenObjective OCIDML

OCIDML stands for Objective Case Interaction Diagram Markup Language. NuGenObjective OCIDML is a domain simulation method I created that drastically reduces the redundancy we commonly see in visualizing data/object relationships. This method provides for a single and simple view of all objects within our domain and their subsequent states, actions and interactions.

Take, for example, this diagram, which uses OCIDML to show the objects, cases, states and roles and their dependencies on each other for an entire system:

(clicking any of the images below will give you a larger, more detailed image)

Now examine this diagram, which uses OCIDML to show the (hypothetical) structure of three higher educational universities:

The structure of these organizations is not unique. For example, MPIM is a research institute and has no graduate program. So the part of the diagram corresponding to MPIM has no interactions dedicated to graduate students.

If we were to represent this same relational system with the usual 2D table (Excel or otherwise), we would draw a large 3x3 table with each cell being a 9x9 sub-table (including headlines). In this way we would have a 736-cell table and if every small cell were only 1.5 inches long and 0.5 inches wide (to make the inner text readable), we would have a 41x14 inch wide table, which would only display 54 of the 736 cells. The total space required showing these relationships in the traditional 2D method would exceed the usable space more than 14 times.

Using the OCIDML method we are able to display all necessary objects (with their states, relationships and actions) represented in a single visualization, giving us a macro view for each object, its state’s, actions and interactions with other objects. It allows us the advantage of seeing all dimensions, displayed in a single visual vector, whereby the relationships and their multiple and singular entities are visualized.

Below is an example of building a diagram in the application. While relative to the actual tool, it should give you an idea of how it comes together. A great feature of the tool is the ability to double click on any interaction point. This will then display the specific object, role, state and action.

This is the interaction creation/definition dialog where we define our case for a specific object, role, action and state.

OCIDML for Software Development

When we design software we limit our cases by scope. But OCIDML makes it possible to diagram the entire network of relationships, paths, states and components of an object. This can be extremely useful in the software architecture process, allowing the architect to have a strong sense of the "whole" and all the inter-relational dependencies that need to be accounted for. In turn, software engineers will receive a more solid blueprint, resulting in better software.

Coming Up

In the next post in the series I'll dig into the mathematical theory NuGenObjective OCIDML was built upon and share some of the markup I used to create the diagrams. The application code is open source and I'll be getting it ready to share with you for the next post.

In the third and final post in the series I'll focus on specific use cases and talk about how you can use OCIDML in your software projects. I'm happy to answer any of your questions about this method - feel free to leave questions in the comments below.

Categories
Author

Inspired by their recent trip to the Wolfram Data Summit, Marc Garrett and Jurgen Altziebler share their thoughts on big data and the missing component.

The Wolfram Data Summit is an invitation-only gathering in Washington, DC which brings together the leaders of the world's largest data repositories. Professors, Chief Privacy Officers, Research Scientists, Chief Technology Officers, Data Architects, and Directors from leading organizations like UCSD, the U.S. Census Bureau, Thomson Reuters, Cornell, and Orbitz (among many others) come to present on the challenges and opportunities they face in the data community and to discuss their work.

The Summit reaches a broad range of innovators from virtually every discipline. The format of the summit promotes collaboration among participants across multiple domains, including medical, scientific, financial, academic and government. Presenters integrate topics with discussion on open government data, data in the media, and linguistics.

Our Motivation

We frequently work with clients that own or manage large data repositories; through our work with them we build applications that allow their users to easily access and learn from the data. Through continued exposure to the world of big data, we've realized that although a few large firms utilize tools like data mining and data analyses to make better business decisions, the information is generally under-used and often not used at all by smaller firms.

Data is Gold

One the most strikingly apparent details that Marc and Jurgen gleaned from the Summit was that data and content owners truly care about the accuracy of their data. All of the presenters conveyed a sanctity toward cultivating quality data.

What results from the work of these scrupulous and discerning leaders is a vast collection of (high-quality and accurate) data that can be used by anyone to make more strategic decisions involving their health, finances, or education, by business owners to learn more about their niche markets and identify trends and potential solutions to common problems. Data repositories are used by groups to predict and release information about everything from natural disasters and disease outbreaks to commute patterns and high-crime neighborhoods. This begs the question, "If data can be so useful to us, why are large organizations cutting funding to data projects such as Census.gov and Data.gov?" (Read this article from WhiteHouse.gov for a look at some of the ways Data.gov has been used in the last three years.)

The Experience Layer of Big Data

Jurgen and Marc identified that one of the solutions to the diminishing use of these repositories lies in the user experience layer of the data. In most cases data repositories offer large data sets in Excel or CSV files and while this format is appropriate for their expert audience, average users don't know how to get valuable information and stories out of plain data sets. On the bright side, this is a problem that's easy enough to fix.

Tell the story, guide the user to discover insights with a user friendly web layer.

Jurgen Altziebler

Data must be easily and intuitively accessible; otherwise, it goes unutilized. There is no question that aggregation and maintenance of data is beneficial for everyone from the CEO of a mutual funds company to the admissions office of a University, to the entrepreneur of a tech startup, to the person choosing between treatment options for an ill loved one.

In the age of Web 2.0 there is no reason for big or little data to be silo'd behind unusable interfaces. Owners of data repositories can work alongside UX/UI experts to launch a new wave of data accessibility. At Intridea, we are obsessed with the user experience, but we also see the whole picture - we build applications to allow users to seamlessly access information they need. Jurgen notes, "A good user experience begins and ends with usable data."

As designers, our job is as much about the aesthetics as it is about the functionality and accessibility to the product or data in question. WolframAlpha.com is a good case study of what's possible when centralized data is made available to the average user through the power of a knowledge engine and intuitive interface. A simple query for "speed of light" or "heart disease risk" returns computational details on a macro and micro level.

What We've Learned

Data truly is gold. But it will waste away in mines if we do not create the appropriate tools for people to harvest and utilize it. If data owners can be encouraged to work with design experts, and if designers can be inspired to assist on these valuable data projects we can bridge the gap between the data and the user and unleash the inherent value in democratized data.

Categories
Author
1
Subscribe to Data