Api | Mobomo

Read more about Working With Nutch 2.x - The API, Part 2 - Crawling Dynamically

Welcome to part 2 of our exploration of the Nutch API!

In our last post, we created infrastructure for injecting custom configurations into Nutch via nutchserver. In this post, we will be creating the script that controls crawling those configurations. If you haven’t done so yet, make sure you start the nutchserver:

$ nutch nutchserver

Dynamic Crawling

We’re going to break this us into two files again, one for cron to run and the other that holds a class that does the actual interaction with nutchserver. The class file will be Nutch.py and the executor file will be Crawler.py. We’ll start by setting up the structure of our class in Nutch.py:

import time
import requests
from random import randint

class Nutch(object):
    def __init__(self, configId, batchId=None):
        pass
    def runCrawlJob(self, jobType):
        pass

We’ll need the requests module again ($ pip install requests on the command line) to post and get from nutchserver. We’ll use time and randint to generate a batch ID later. The function crawl is what we call to kick off crawling.

Next, we’ll get Crawler.py setup.

We’re going to use argparse again to give Crawler.py some options. The file should start like this:

# Import contrib
import requests
import argparse
import random

# Import custom
import nutch

parser = argparse.ArgumentParser(description="Runs nutch crawls.")
parser.add_argument("--configId", help="Define a config ID if you just want to run one specific crawl.")
parser.add_argument("--batchId", help="Define a batch ID if you want to keep track of a particular crawl. Only works in conjunction with --configId, since batches are configuration specific.")
args = parser.parse_args()

We’re offering two optional arguments for this script. We can set --configId to run a specific configuration and setting --batchId allows us to track as specific crawl for testing or otherwise. Note: with our setup, you must set --configId if you set --batchId.

We’ll need two more things: a function to make calling the crawler easy and logic for calling the function.

We’ll tackle the logic first:

if args.configId:
    if args.batchId:
        nutch = nutch.Nutch(args.configId, args.batchId)
        crawler(args.job, nutch.getNodeID())
    else:
        nutch = nutch.Nutch(args.configId)
        crawler(args.job, nutch.getNodeID())
else:
    configIds = requests.get("http://localhost:8081/config")
    cids = configIds.json()
    random.shuffle(cids)
    for configId in cids:
        if configId != "default":
            nutch = nutch.Nutch(configId)
            crawler(nutch)

If a configId is given, we capture it and initialize our Nutch class (from Nutch.py) with that id. If a batchId is also specified, we’ll initialize the class with both. In both cases, we run our crawler function (shown below).

If neither configId nor batchId is specified, we will crawl all of the injected configurations. First, we get all of the config ID’s that we have injected earlier (see Part 1!). Then, we randomize them. This step is optional but we found that we tend to get more diverse results when initially running crawls if Nutch is not running them in a static order. Last, for each config ID, we run our crawl function:

def crawler(nutch):
    inject = nutch.runCrawlJob("INJECT")
    generate = nutch.runCrawlJob("GENERATE")
    fetch = nutch.runCrawlJob("FETCH")
    parse = nutch.runCrawlJob("PARSE")
    updatedb = nutch.runCrawlJob("UPDATEDB")
    index = nutch.runCrawlJob("INDEX")

You might wonder why we’ve split up the crawl process here. This is because later, if we wish, we can use the response from the Nutch job to keep track of metadata about crawl jobs. We will also be splitting up the crawl process in Nutch.py.

That takes care of Crawler.py. Let’s now fill out our class that actually controls Nutch, Nutch.py. We’ll start by filling out our __init__ constructor:

def __init__(self, configId, batchId=None):
    # Take in arguments
    self.configId = configId
    if batchId:
        self.batchId = batchId
    else:
        randomInt = randint(0, 9999)
        self.currentTime = time.time()
        self.batchId = str(self.currentTime) + "-" + str(randomInt)

    # Job metadata
    config = self._getCrawlConfiguration()
    self.crawlId = "Nutch-Crawl-" + self.configId
    self.seedFile = config["meta.config.seedFile"]

We first take in the arguments and create a batch ID if there is not one.

The batch ID is essential as it links the various steps of the process together. Urls generated under one batch ID must be fetched under the same ID for they will get lost, for example. The syntax is simple, just [Current Unixtime]-[Random 4-digit integer].

We next get some of the important parts of the current configuration that we are crawling and set them for future use.

We’ll query the nutchserver for the current config and extract the seed file name. We also generate a crawlId for the various jobs we’ll run.

Next, we’ll need a series of functions for interacting with nutchserver.

Specifically, we’ll need one to get the crawl configurations, one to create jobs, and one to check the status of a job. The basics of how to interact with Job API can be found at https://wiki.apache.org/nutch/NutchRESTAPI, though be aware that this page is not complete in it’s documentation. Since we referenced it above, we’ll start with getting crawl configurations:

def _getCrawlConfiguration(self):
    r = requests.get('http://localhost:8081/config/' + self.configId)
    return r.json()

This is pretty simple: we make a request to the server at /config/[configID] and it returns all of the config options.

Next, we’ll get the job status:

def _getJobStatus(self, jobId):
    job = requests.get('http://localhost:8081/job/' + jobId)
    return job.json()

This one is also simple: we make a request to the server at /job/[jobId] and it returns all the info on the job. We’ll need this later to poll the server for the status of a job. We’ll pass it the job ID we get from our create request, shown below:

def _createJob(self, jobType, args):
    job = {'crawlId': self.crawlId, 'type': jobType, 'confId': self.configId, 'args': args}
    r = requests.post('http://localhost:8081/job/create', json=job)
    return r

Same deal as above, the main thing we are doing is making a request to /job/create, passing it some JSON as the body. The requests module has a nice built-in feature that allows you to pass a python dictionary to a json= parameter and it will convert it to a JSON string for you and pass it to the body of the request.

The dict we are passing has a standard set of parameters for all jobs. We need the crawlId set above; the jobType, which is the crawl step we will pass into this function when we call it; the configId, which is the UUID we made earlier; last, any job-specific arguments--we’ll pass these in when we call the function.

The last thing we need is the logic for setting up, keeping track of, and resolving job creation:

def runCrawlJob(self, jobType): 
    args = ""
    if jobType == 'INJECT':
        args = {'seedDir': self.seedFile}
    elif jobType == "GENERATE":
        args = {"normalize": True,
                "filter": True,
                "crawlId": self.crawlId,
                "batch": self.batchId
                }
    elif jobType == "FETCH" or jobType == "PARSE" or jobType == "UPDATEDB" or jobType == "INDEX":
        args = {"crawlId": self.crawlId,
                "batch": self.batchId
                }
    r = self._createJob(jobType, args)
    time.sleep(1)
    job = self._getJobStatus(r.text)
    if job["state"] == "FAILED":
        return job["msg"]
    else:
        while job["state"] == "RUNNING":
            time.sleep(5)
            job = self._getJobStatus(r.text)
            if job["state"] == "FAILED":
                return job["msg"]
    return r.text

First, we’ll create the arguments we’ll pass to job creation.

All of the job types except Inject require a crawlId and batchId. Inject is special in that the only argument it needs is the path to the seed file. Generate has two special options that allow you to enable or disable use of the normalize and regex url filters. We’re setting them both on by default.

After we build args, we’ll fire off the create job.

Before we begin checking the status of the job, we’ll sleep the script to give the asynchronous call a second to come back. Then we make a while loop to continuously check the job state. When it finishes without failure, we end by returning the ID.

And we’re finished! There are a few more things of note that I want to mention here. An important aspect of the way Nutch was designed is that it is impossible to know how long a given crawl will take. On the one hand, this means that your scripts could be running for several hours at time. However, this also means that it could be done in a few minutes. I mention this because when you first start crawling and also after you have crawled for a long time, you might start seeing Nutch not crawl very many links. In the first case, this is because, as I mentioned earlier, Nutch only crawls the links in the seed file at first, and if there are not many hyperlinks on those first pages, it might take two or three crawl cycles before you start seeing a lot of links being fetched. In the latter case, after Nutch finishes crawling all the pages that match your configuration, it will only recrawl those pages after a set interval. You can modify how this process works, but it will mean that after awhile you will see crawls that only fetch a handful of links.

Another helpful note is that the Nutch log at /path/to/nutch/runtime/local/logs/hadoop.log is great for following the process of crawling. You can set the output depth of most parts of the Nutch process at /path/to/nutch/conf/log4j.properties (you will have to rebuild Nutch if you change this by running ant runtime at the Nutch root).

Categories

General

Tags

Api

Dynamic Configuration Ingestion

Nutch

Author

Mobomo

Read more about Working With Nutch 2.x - The API, Part 1: Creating Multiple Configurations

Now that we know the basics of Nutch, we can dive into our use case. We write scripts that do two things:

Ingestion of the various configurations
Execute and control crawls

This post will tackle ingesting the configs. I will specifically be using Python for the examples in this post, but the principles should apply to any language.

In our project, we had 50+ sites we wanted to crawl, all with different configuration needs. We organized these configurations into a nice JSON api that we ingest. In our examples, we will be using Python’s Requests API to get the JSON. We’ll also need a way to create a unique UUID for each configuration, so we’ll use Python’s UUID module. You can use the package installer pip to get them:

$ pip install requests
$ pip install uuid

We’re going to use a class to handle all of the processing for injection. We’ll create a file for this, call it configInjector.py. The beginning of the file should look something like this:

import os
import uuid
import requests
from shutil import copy2

class ConfigInjector(object):
    def __init__(self):
        pass

We’re importing os and copy2 so we can create, edit, and copy files that we need. Next, we’re going to want to get the config itself, as well as an ID from the configuration node itself. We’ll make a new file for this, call it inject.py. This will be the script we actually run from cron for injection. It begins something like this:

import urllib2
import json
import argparse
import configInjector

parser = argparse.ArgumentParser(description="Ingests configs.")
parser.add_argument("confugUrl", help="URL of the JSON config endpoint.")
args = parser.parse_args()

For our imports, we’ll use requests and UUID like earlier as well as urllib2 to download our remote JSON and argparse to give our script an argument for where to download JSON. We’re also importing our own configInjector class file.

The argparse module allows us to pass command line arguments to the Python script. In the code above, we instantiate the argument parser, add our argument (configUrl), and set the results of the argument to args. This allows us to pass in a url for the location of our JSON endpoint.

Now that we have the foundation set up let’s get the data. We’ll use urllib2 to grab the JSON and json.load() add it to a variable:

response = urllib2.urlopen(args.confugUrl)
configs = json.load(response)

We’ll then loop through it and call our class for each config in the JSON:

for configId in configs:
    configInjector.ConfigInjector(configId, configs[configId])

Now that we are getting the configs, let’s fill out our class and process them. We’ll use the __init__ constructor to do the majority of our data transformations. The two major things we want to do is process and inject Nutch config settings and create regex-urlfilters.txt for each config.

First, we’ll do our transformations. We want to get our config options in order to plug into Nutch, so we’ll just set them as variables in the class:

class ConfigInjector(object, configId, config):
    def __init__(self):self.config = config
        self.configId = configId

        # Config transformations
        self.configTitle = self.config["configTitle"]
        self.allowExternalDomains = self.config["allowExternalDomains"]
        self.uuid = str(uuid.uuid3(uuid.NAMESPACE_DNS, str(self.configId)))

We’re setting three things in this example: a config title and UUID for reference and a configuration state for the Nutch config db.ignore.external.links. We’re using the static configId to generate the UUID so that the same UUID is always used by each individual configuration.

Next, we’ll need to create some files for our seed urls and match patterns. We’re going to create two files, seed-XXXXXX.txt and regex-urlfilters-XXXXXX.txt, where XXXXXX is the configId. For the seed files, we’ll create our own directory (called seeds), but for the regex files, we must store them in $NUTCH_HOME/runtime/local/conf in order for Nutch to find them (this is due to Nutch’s configuration of the Java CLASSPATH). First, we’ll set the filenames based upon configId (this goes in the __init__ function):

self.regexFileName = 'regex-urlfilter-' + self.nodeId + '.txt'
self.seedFileName = 'seed-' + self.nodeId + '.txt'

We also want to call the functions we are about to write here, so that when we call the class, we immediately run all the necessary functions to inject the config (again, in the __init__ function):

# Run processes
self._makeConfigDirectories()
self._configureSeedUrlFile()
self._copyRegexUrlfilter()
self._configureRegexUrlfilter()
self._prepInjection()

Next, we’ll setup the directories (the underscore at the beginning of the function name just tells python not to load this function when being imported because it will only be used internally):

def _makeConfigDirectories(self):
     if not os.path.exists('/path/to/nutch/runtime/local/conf/'): 
         os.makedirs('/path/to/nutch/runtime/local/conf/') 
     if not os.path.exists('/path/to/nutch/seeds/'): 
         os.makedirs('/path/to/nutch/seeds/')

This simply checks to make sure the directories are there and makes them if they aren’t. Next, we’ll create the seed files:

def _configureSeedUrlFile(self): 
     furl = open('/path/to/nutch/seeds/' + self.seedFileName, "w") 
     for url in self.config["seedUrls"]: 
         furl.write(url + "\n")

Basically, we are opening a file (or creating one if it doesn’t exist--this is how “w” functions) and writing each url from the JSON config to each line. We must end each url with a newline (\n) for Nutch to understand the file.

Now we’ll make the regex file. We’ll do it in two steps so that we can take advantage of what Nutch has pre-built. We’re going to copy Nutch’s built-in regex-urlfilters.txt so that we can use all of its defaults and add any defaults we would like to all configs. Before we do that, we have an important edit to make to regex-urlfilters.txt: remove the .+ from the end of the file in both /path/to/nutch/conf and /path/to/nutch/runtime/local/conf. We’ll add it back in the file ourselves, but if we leave it there, the filters won’t work at all because Nutch uses the first match when determining whether to fetch a url, and .+ means “match any”. For our use, we’re going to add this back on the end of the file after we write our regex to it.

We’ll copy regex-urlfilters.txt in this function:

def _copyRegexUrlfilter(self): 
     frurl = '/path/to/nutch/conf/regex-urlfilter.txt' 
     fwurl = '/path/to/nutch/runtime/local/conf/' + self.regexFileName 
     copy2(frurl, fwurl)

Then, we write our filters from the config to it:

def _configureRegexUrlfilter(self): 
     notMatchPatterns = self.config["notMatchPatterns"] 
     matchPatterns = self.config["matchPatterns"] 
     regexUrlfilter = open('/path/to/nutch/runtime/local/conf/' + self.regexFileName, "a") 
     if notMatchPatterns: 
         for url in notMatchPatterns: 
             regexUrlfilter.write("-^" + url + "\n")
     if matchPatterns: 
         for url in matchPatterns: 
             regexUrlfilter.write("+^" + url + "\n")regexUrlfilter.write("+.\n") 
     regexUrlfilter.close()

A few things are going on here: we are opening and appending to the file we just copied (that’s how “a” works) and then, for each “do not match” pattern we have, we are adding it to the file, followed by the match patterns. This is because, as we said before, Nutch will use the first regex match it gets, so exclusion needs to go first to avoid conflicts. We then write .+ so that Nutch accepts anything else--you can leave it off if you would prefer Nutch exclude anything not matched, which is its default behavior.

As a quick side note, it is important to mention that designing it this way means that each time we inject our configuration into Nutch, we will be wiping out and recreating these files. This is the easiest pathway we found for implementation, and it affords no disadvantages except that you cannot manually manipulate these files in any permanent way. Just be aware.

Now that we have our files in place, the last thing we have to do is inject the configuration into Nutch itself. This will be our first use of the Nutchserver API. If you have not already, open a console on the server that hosts Nutch and run:

$ nutch nutchserver

Optionally, you can add a --port argument to specify the port, but we’ll use the default: 8081. Then we’ll prep the data for injection into the API:

def _prepInjection(self): 
     config = {} 

     # Custom config values 
     config["meta.config.configId"] = self.configId 
     config["meta.config.configTitle"] = self.configTitle 
     config["meta.config.seedFile"] = '/path/to/nutch/seeds/' + self.seedFileName 

     # Crawl metadata 
     config["nutch.conf.uuid"] = self.uuid 

     # Crawl Config 
     config["urlfilter.regex.file"] = self.regexFileName 
     config["db.ignore.external.links"] = self.allowExternalDomains 

     self._injectConfig(config)

Note that we are creating both our own custom variables for later use (we named them “meta.config.X”) and setting actual Nutch configuration settings. Another note: urlfilter.regex.file takes a string with the filename only. You CANNOT specify a path for this setting, which is why we store the regex files in /path/to/nutch/runtime/local/conf, where the CLASSPATH already points.

Lastly, we’ll do the actual injection. The self._injectConfig(config) at the end of the _prepInjection function starts injection:

def _injectConfig(self, config): 
     job = {"configId": self.uuid,"force": "true","params": config} 
     r = requests.post('http://localhost:8081/config/' + self.uuid, json = job) 
     return r

All we do here is set up the JSON to push to the API and then inject. Every configuration we send to the API must have a UUID as it’s configId (which we will reference later when creating crawl jobs). We set force to true so that configurations will get overwritten when they change upstream and then we pass in our configuration parameters.

We then use the requests python module to make the actual injection. This is significantly easier than using something like CURL. We post to a url containing the uuid and have the JSON as the body (requests has a handy json argument that converts Python dictionaries to json before adding it to the body). Lastly, we return the post response for later use if needed.

And that’s it! We have successfully posted our dynamic custom configuration to nutchserver and created the relevant files. In the next post, we’ll show you how to crawl a site using these configurations.

Categories

General

Tags

Api

Dynamic Configuration Ingestion

Nutch

Author

Mobomo

Read more about Keepin' Your API Honest

We've all come across API documentation that is out of date, lacking details, or just plain wrong! Often times an API’s documentation is entirely separate from the code itself, and there is no verification of the documentation's accuracy.

So how do we overcome this? Well, one solution for this problem is to generate the documentation as a result of passing acceptance tests.
The rspec_api_documentation gem allows you to write specs that generate HTML formatted documentation. The rake task included for running your doc specs is rake docs:generate, which puts the HTML files in the project's ./doc directory. If any specs fail, the documentation doesn't generate.
Once the documentation is generated, it’s time to add a documentation viewer to your API server.

There are two gems for serving API documentation:

For rails use apitome
For rack applications (like grape) use raddocs

Mounting the API documentation server as part of your API server and running rake docs:generate as part of the deployment process ensures your documentation is up to date and available.
So, if you're going to create an API, use grape, rspec_api_documentation, and raddocs to ensure testable, accurate, and up to date documentation. Your API users will thank you.

Included is an example repository using these tools here: api_documentation_template.

Feel free to comment, ask questions, or share! We'd love to hear from you.

Categories

Tags

Author

Read more about REST isn't what you think it is, and that's OK

Everyone says they have a REST (or RESTful or REST-like) API. Twitter does, Facebook does, as does Twilio and Gowalla and even Google. However, by the actual, original definition, none of them are truly RESTful. But that’s OK, because your API shouldn’t be either.

The Common Definition

The misconception lies in the fact that, as tends to happen, the popular definition of a technical term has come to mean something entirely different from its original meaning. To most people, being RESTful means a few things:

Well-defined URIs that “represent” some kind of resource, such as “/posts” on a blog representing the blog posts.
HTTP methods being used as verbs to perform actions on that resource (i.e. GET for read operations and POST for write operations).
The ability to access multiple format representations of the same data (i.e. both a JSON and an XML representation of a blog post).

There are some other parts of the common vocabulary of REST (for example, for some developers being RESTful would also imply a URI hierarchy such that /posts/{uniqueid} would be seen to be a member of the /posts collection), but these are what most people think of when they hear “RESTful web service.” So how is this different from the “actual” definition of REST?

Diverging From Canon

By the common definition of REST, a service defines a set of resources and actions that can be accessed via URI endpoints. However, the “true” definition of REST demands that resources be self-describing, providing all of the control context in-band of the provided representation. No out-of-band knowledge should, therefore, be required beyond understanding a media type that the resource can provide. From there, it should be possible to follow relations provided in “hypertext” context of the representation to “transfer state”, follow relations, or perform any necessary actions.

Another common divergence comes through the practice of using HTTP POST (or PUT) bodies with key-value pairs to create and update documents. In a canonically RESTful service clients should be posting an actual representation of the document in an accepted media type that is then parsed and translated by the service provider to create or update the resource.

Still more divergence comes in the common practice of denoting collections and elements. A truly RESTful web service has no concept of a “collection” of resources. There are only resources. As such, the proper way to implement a collection would be to define a separate resource that represents a collection of other resources.

Is anything truly RESTful?

Pretty much everyone who claims to have a REST API, in fact, does not. The closest I’ve found is the Sun Cloud API which actually defines a number of custom media types for resources and is discoverable based on a single known end-point. Everyone else, thanks for playing.

There is, however, one public and extremely widely used system that is entirely RESTful. It’s called the world wide web. Yes, as you’re browsing the internet you’re engaging in a REST service by the true definition of the name. Does your browser (the client) know whether it’s displaying a banking website or a casual game? Nope, it just utilizes standard media types (HTML, CSS, Javascript) to compose and represent the data. You don’t have to know the specific URL you’re looking for on a website so long as you know the “starting place” (usually the domain name) and can navigate there.

So REST by its original definition is far from useless. In fact, it’s an ingenious and flexible way to allow for the consumption and traversal of network-available information. What it’s not, however, is a very good roadmap toward building APIs for web applications.

Real REST is too hard.

Truly RESTful services simply require too much work to be practical for most applications. Too much work from the provider in defining and supporting custom media types with complex modeled relationships transmitted in-band. Too much work for clients and library authors to perform complex aggregation and re-formulation of data to make it conform to the real REST style. Real REST is great for generic, broad-encompassing multi-provider architectures that need the flexibility and discoverability it provides. For most application developers it’s simply overkill and a real implementation headache.

There’s nothing wrong with the common definition of REST. It’s leaps and bounds better than some of the methods that came before it and pretty much everyone is already on board and familiar with how it works. It’s a pragmatic solution that really works pretty well for everyone. As they say, if it ain’t broke, don’t fix it.

What’s in a name?

The only problem is that now we have lots of things that we’re calling REST that aren’t. Roy T. Fielding, primary architect of HTTP 1.1 and the author of the dissertation that originally defines REST, hasn’t always been happy with people calling things REST that aren’t. And maybe he has a point: these services certainly aren’t REST by his definition and because of the wide propagation of this incorrect definition of REST most people now don’t really understand the true definition. In fact, I don’t claim to have a great understanding of REST as Dr. Fielding defines it.

The problem is that the ship has sailed, and whether it’s true or not, REST now also means any simple, URL-accessible resource-based service. Perception is reality, and perception has changed about the definition of REST and RESTful. While the true definition is interesting for academic purposes and certainly lies behind the technologies upon which we build every day, it simply doesn’t have a whole lot of use to web application developers. The fact that (nearly) zero services exist that implement true REST for their API serves as testament to that.

What can we learn from REST?

Just because we don’t use true REST doesn’t mean there aren’t a few things we can learn from it. There are a few aspects that I’d love to see come into favor in the common definition. The idea of clients needing to know a few media types instead of specific protocols for each service is one that breaks down in practice for APIs due to the overwhelming number of web services with different needs in terms of domain-specific resource definition. However, wouldn’t it be great if there were an accepted application/x-person+json format that provided a standardized batch of user information (such as name, e-mail address, location, profile image URL) that you could request from Facebook, Twitter, Google or any OpenID provider and expect conforming data? Just because there are lots of domain-specific resources doesn’t mean that it isn’t worthwhile to try to come up with some standards for common information.

REST-like discoverability could also be a boon for some services. What if Twitter provided something like this along with a tweet’s JSON?

{   "actions": {     "Retweet" : { "method":"POST", url:"/1/statuses/retweet/12345.json" },     "Delete" : { "method":"DELETE", url:"/1/statuses/destroy/12345.json" },     "Report Spam" : { "method":"POST", url:"/1/statuses/retweet/12345.json", params:{"id":12345} }   } }

So while REST as originally intended may not be a great fit for web applications, there are still patterns and practices to be gleaned from a better understanding of how such a service could work. For web applications, the case may be that REST is dead, long live REST!

Categories

Tags

Author

Read more about OAuth2 Gem: Just in Time For Facebook's Graph

While I’d been tracking with great interest the progress of OAuth 2.0, Facebook lit off the powderkeg yesterday by announcing that their entire API was moving to the protocol (as well as to RESTful JSON). As a developer who had been constantly confounded by the relentlessly hostile environment that Facebook seemed to present to developers, yesterday was a sudden and welcome about-face. The acquisition of FriendFeed, it seems, gave Facebook the talent they needed to do it right this time.

But anyway, on to the news! We have just released a gem for OAuth 2.0 to work with the new Facebook API. You can get it right now:

gem install oauth2

We wanted to get this into the hands of developers ASAP so for now the functionality is pretty much limited to the “web server” type of authentication (the protocol includes many different strategies, all of which will be implemented on the gem over time) and has been tested to work with Facebook’s new API.

So how do you use it? Here is an example Sinatra application containing all of the code necessary to authenticate and then perform requests against the Facebook API.

require 'rubygems' require 'sinatra' require 'oauth2' require 'json'  def client   OAuth2::Client.new('api_key', 'api_secret', :site => 'https://graph.facebook.com') end  get '/auth/facebook' do   redirect client.web_server.authorize_url(     :redirect_uri => redirect_uri,      :scope => 'email,offline_access'   ) end  get '/auth/facebook/callback' do   access_token = client.web_server.get_access_token(params[:code], :redirect_uri => redirect_uri)   user = JSON.parse(access_token.get('/me'))    user.inspect end  def redirect_uri   uri = URI.parse(request.url)   uri.path = '/auth/facebook/callback'   uri.query = nil   uri.to_s end

So now you’re ready to get started with the new Facebook API! This is still an early release, but I’ll be working on it a lot in the coming months, partially as preparation for my talk at RailsConf in which I’ll be delving into the OAuth 2.0 specification and what it means for Rails developers in-depth. The code is, of course, available on GitHub where you can report any problems you run into. Enjoy!

Update: Those who aren’t terribly familiar with the protocol may wonder why OAuth 2.0 isn’t just rolled into support of the OAuth gem (or why I didn’t fork it and do it that way). Honestly, I would have liked to, but OAuth 2.0 is an almost entirely different beast than 1.0a and they share so little functionality that it would basically be two projects living under the same gem name. So that’s why!

Categories

Tags

Author

Read more about Mash - Mocking Hash for total poser objects

There are a number of times when I need something like an OpenStruct with a little more power. Often times this is for API-esque calls that don’t merit a full on ActiveResource. I wrote a small class for use with my ruby-github library and wanted to make it a separate gem because I think it’s pretty useful to have around.

Usage

Basically a Mash is a Hash that acts a little more like a full-fledged object when it comes to the keyed values. Using Ruby’s method punctuation idioms, you can easily create pseudo-objects that store information in a clean, easy way. At a basic level this just means writing and reading arbitrary attributes, like so:

author = Mash.new author.name # => nil author.name = "Michael Bleigh" author.name # => "Michael Bleigh" author.email = "michael@intridea.com" author.inspect # => <Mock name="Michael Bleigh" email="michael@intridea.com">

So far that’s pretty much how an OpenStruct behaves. And, like an OpenStruct, you can pass in a hash and it will convert it. Unlike an OpenStruct, however, Mash will recursively descend, converting Hashes into Mashes so you can assign multiple levels from a single source hash. Take this as an example:

hash = { :author => {:name => "Michael Bleigh", :email => "michael@intridea.com"},        :gems => [{:name => "ruby-github", :id => 1}, {:name => "mash", :id => 2}]}  mash = Mash.new(hash) mash.author.name # => "Michael Bleigh" mash.gems.first.name # => "ruby-github"

This can be really useful if you have just parsed out XML or JSON into a hash and just want to dump it into a richer format. It’s just that easy! You can use the ? operator at the end to check for whether or not an attribute has already been assigned:

mash = Mash.new mash.name? # => false mash.name = "Michael Bleigh" mash.name? # => true

A final, and a little more difficult to understand, method modifier is a bang (!) at the end of the method. This essentially forces the Mash to initialize that value as a Mash if it isn’t already initialized (it will return the existing value if one does exist). Using this method, you can set ‘deep’ values without the hassle of going through many lines of code. Example:

mash = Mash.new mash.author!.name = "Michael Bleigh" mash.author.info!.url = "http://www.mbleigh.com/" mash.inspect # => <Mash author=<Mash name="Michael Bleigh" info=<Mash url="http://www.mbleigh.com/">>> mash.author.info.url # => "http://www.mbleigh.com/"

One final useful way to use the Mash library is by extending it! Subclassing Mash can give you some nice easy ways to create simple record-like objects:

class Person < Mash   def full_name     "#{first_name}#{" " if first_name? && last_name?}#{last_name}"   end end  bob = Person.new(:first_name => "Bob", :last_name => "Bobson") bob.full_name # => "Michael Bleigh"

For advanced usage that I’m not quite ready to tackle in a blog post, you can override assignment methods (such as name= and this behavior will be picked up even when the Mash is being initialized by cloning a Hash.

Installation

It’s available as a gem on Rubyforge, so your easiest method will be:

sudo gem install mash

If you prefer to clone the GitHub source directly:

git clone git://github.com/mbleigh/mash.git

This is all very simple but also very powerful. I have a number of projects that will be getting some Mashes now that I’ve written the library, and maybe you’ll find a use for it as well.

Categories

Tags

Author

Read more about Ruby-GitHub: Simple Access to the GitHub API

While the GitHub folks have produced their own github-gem that provides some useful command-line tools for GitHub users, the library they have written isn’t your traditional API wrapper since it’s focused around using GitHub rather than getting information from GitHub.

I’ve thrown together a small library called ruby-github that provides that kind of functionality. It’s extremely simple and works with all of the currently available API but that only comes down to three read-only calls at this point. Use like so:

user = GitHub::API.user('mbleigh') user.name # => "Michael Bleigh" user.repositories # => array of repositories user.repositories.last.name # => "ruby-github" user.repositories.last.url # => "http://github.com/mbleigh/ruby-github" user.repositories.last.commits # => array of commits (see below)  commits = GitHub::API.commits('mbleigh','ruby-github') commits.first.message # => "Moved github.rb to ruby-github.rb..." commits.first.id # => "1d8c21062e11bb1ecd51ab840aa13d906993f3f7"  commit = GitHub::API.commit('mbleigh','ruby-github','1d8c21062e11bb1ecd51ab840aa13d906993f3f7') commit.message # => "Moved github.rb to ruby-github.rb..." commit.added.collect{|c| c.filename} # => ["init.rb", "lib/ruby-github.rb"]

Installation

The easiest way to install ruby-github is as a gem:

gem install ruby-github

You can also install it as a Rails plugin if that’s your thing:

git clone git://github.com/mbleigh/ruby-github.git  vendor/plugins/ruby-github

Update 4/12/2008: Version 0.0.2 of the gem has been released and I have revised this post to adhere to the new gem’s requirements.

Categories

Tags

Author

Read more about Beboist - A Rails Plugin for the Bebo Social API

UPDATE

Click here for the latest on Beboist

The Beboist plugin provides a Rails interface to the Bebo Social Networking API.

The plugin was designed from the ground-up to be flexible enough to accommodate
any changes to the API, while at the same time providing a clean interface
that will be familiar to most Rails developers.

Setup

Ensure that the json gem is installed on your system and the Beboist plugin is installed in your vendor/plugins folder:

gem install json script/plugin install http://svn.intridea.com/svn/public/beboist</pre>

Generate your config/bebo.yml file using

script/generate beboist_settings</pre>

Fill in your appropriate app settings in config/bebo.yml. Ensure that your app name is right.

Generate the first migration for your users table using:

script/generate beboist_user_migration</pre>

Migrate your database using

rake db:migrate</pre>

In your application.rb, insert the following filters:

before_filter :reject_unadded_users before_filter :find_bebo_user</pre>

Write your app, and keep an eye on your logs to catch any possible error messages.

API Reference

The methods listed in the Bebo API Documentation are mapped to Ruby classes in the following manner:

users.get_info(uids => "1,2,3", fields => "first_name, last_name")   # BECOMES BeboUsers.get_info :uids => [1,2,3], :fields => ["first_name", "last_name"]

Notes

The Beboist plugin uses Bebo’s JSON API, and the ‘json’ gem to directly convert JSON objects to Ruby. It works with Rails 2.0+, but has not been tested on Rails 1.2. Check the README for more details, and file tickets at Intridea’s Public Trac

Categories

Tags

Author