Cool: Google Releases Protocol Buffers Into the Wild

I love that Google just open-sourced Protocol Buffers. Think of Protocol Buffers as a very compact way of encoding data in a binary format. A programmer can write a simple description of a protocol or structured data and Google’s code will autogenerate a class in C++, Java, or Python to read, write, and parse the protocol. Given a protocol buffer, you can write it to disk, send it over the network wire, and do any number of interesting tricks. Any medium-sized company (and quite a few startups!) should find Protocol Buffers very handy.

You may want to read this paper about the Google cluster architecture if you haven’t already, because I’m going to remind you of two things about Google that are pretty obvious in retrospect. You can think of the Google cluster architecture as a bunch of moderately powerful personal computers connected by ethernet. That’s not quite correct, but it’s a pretty good abstraction. In that model, you have pretty good disk/RAM/computational throughput, but network communication is much more limited. That leads to the first nice thing about Protocol Buffers: they’re very compact going over-the-wire via network.

To understand the other nice thing about Protocol Buffers, bear in mind that in the Google cluster architecture, there are many different types of servers that talk to each other. Question: how do you upgrade servers when you need to pass new information between them? It’s a fool’s game to try to upgrade both servers at the same time. So you need a communication protocol that is not only backward compatible (a new server can speak the old protocol) but also forward compatible (an old server can speak the new protocol). Protocol Buffers provide that because new additions to the protocol can be ignored by the old server. That lets you upgrade different servers at different times (check out the “A bit of history” section in that overview). Protocol Buffers are especially appropriate to represent requests and replies between a client and a server.

(By the way, congrats also to the folks that worked to release this code outside of Google. Making open-source code available to the outside world is a great way to build goodwill with developers.)

There are over 10,000 .proto files in use at Google, and Protocol Buffers are a vital part of Google. If you’re a programmer, why not try Protocol Buffers out for yourself?

23 Responses to Cool: Google Releases Protocol Buffers Into the Wild (Leave a comment)

  1. Interesting

    last time I did something like this was an xml parser to take input from an sms gateway on sms deliver/non delivery – which was fun but just getting the xml parsers for the language was such a pain and in the end i ennded up no being able to validate the xml and just went with cheking its welformdness – this would be much simpler especialy on the parseing side.

    some times i miss programming

  2. “Think of Protocol Buffers as a very compact way of encoding data in a binary format. A programmer can write a simple description of a protocol or structured data and Google’s code will autogenerate a class in C++, Java, or Python to read, write, and parse the protocol. Given a protocol buffer, you can write it to disk, send it over the network wire, and do any number of interesting tricks. Any medium-sized company (and quite a few startups!) should find Protocol Buffers very handy.”

    Anybody can translate for me what does above mean πŸ™‚

    Matt,

    Maybe we need to create another category: “Google Programing” in addition to Google/SEO πŸ™‚

  3. I must say I’m a little surprised that Google’s releasing this at this time; it’s very similar to Facebook’s Thrift thing.

  4. All these concepts come from previous technologies like CORBA/IDL and ASN.1, but PBs sound simpler and lighter than those earlier counterparts. You get serialization/deserialization without the overhead of some big object broker or whatnot.

    What would be cool is if PBs *did* have a standard mapping to XML (even for diagnostic purposes), even though the standard serialization behavior is binary.

    Looks cool, I think I’ll check it out. I haven’t hacked on C++ in awhile. πŸ™‚

  5. Harith, imagine that I need to talk to a server. I need to send a request code and a string. I can make a small .proto file (<10 lines), then magically get a class that creates such a request, complete with the ability to read/set various fields in my request.

    Robert Synnott, I can virtually guarantee that Protocol Buffers predate Thrift, because we’ve been using them since before Facebook was around. πŸ™‚

  6. Matt,

    Thanks for the explanation. I’m gonna redirect our youngster programmers and server folks to your current post on Monday (still on vacation). That must be something they need to know πŸ™‚

  7. Very cool. Does this support routing a newer version of a message through an older server? If an V2 server sends a V2 objects to a V1 server which routes the object on to another V2 server. Does this final server get a V2 object or a V1 object?

  8. “Protocol Buffers” is a mouthful…I think I’ll start calling it –> “Peanut Butters”.

  9. Hi Matt,

    Will there be a recap for the Google Trifecta seminar? I kept on getting a socket error when I tried to connect (yes, I downloaded real player.) Thanks

  10. “There are over 10,000 .proto files in use at Google”

    10,000? You mean you have a protocol for every unique situation? Then why call them a protocol?

    from: http://research.google.com/archive/googlecluster.html

    Amenable to extensive parallelization, Google’s Web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. To handle this workload, Google’s architecture features clusters of more than 15,000 commodity class PCs with fault-tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

    ClusterS of more than 15,000 PC’s? I guess that “more than 10,000 PC’s at Google” you mentioned a while back was related to the number of employee PC’s, πŸ™‚

    Funny guy… πŸ™‚

  11. @Frederick Polgardy

    ASN.1 Argh!!!! i’me haveing an OSI flashback to the days of debugging X.400 emaill problems using x.409 decoded into ASN.1

    (my boss was a flash git and could just look at the Hex dump as it scrolled)

  12. Matt,

    PB sounds really cool. I’m sure that Google wouldn’t have created this in house if there was something more efficient to use such as what some of the others suggested (e.g., CORBA/IDL, ASN.1). I’m wondering a few things:

    + Does Google have in house any tools in-house which export data in an LDAP repository (say, using inetOrgPerson schema for LDAP) and into a PB structure?

    + Besides server-to-server communications, could PBs be used with a web client-based applications such as mashups? If so, then how would this work? Wouldn’t Javascript (the main lingo for web user agents which allows you to get to access the DOM) need to be able to decode PB streams?

    + I understand that Guido works at Google so I can understand the support already for Python. I have nothing against Python, but I’ve found Ruby to be very enjoyable to work with. Is Google leaving it up to the open source communities to take PB and bring it into the realm of other languages such as Ruby?

    Thanks,

    Eddie

  13. Most techy stuff is over my head, appreciate the breakdown….

  14. Sorry to post this here Matt but I wanted to bring it to your attention because it annoys the heck out of me. The google index is getting terribly polluted by newsgroup messages found on groups.google.com (and the other groups.google.* tlds). I’ve run several searches today for ebooks and found sizeable chunks of the search results being spam messages from the various newsgroups. The worst thing is that there are multiple instances in the search results because each groups.google.*.* server has its own copy that’s been indexed.

  15. Hey Matt,

    This is completely unrelated, but I just thought I’d mention your blog fails both the XHTML and CSS validation tests you have linked at the bottom of every page.

  16. Could this be used to communicate with Amazon S3 in a very variable manner?
    Serving data from S3 without having to define the class to S3?

    This sounds like it would make that kind of distributed serving of data more efficient.

    afewtips.com

  17. This is great but why not just use Facebook’s Thrift? Apart from C++, Java and Python, it additionally supports Perl, PHP, XSD, C#, Ruby, Objective C, Smalltalk, Erlang, OCaml, and Haskell. Also, it has more features than PB and it provides a full client/server RPC implementation, whereas PBs just create stubs to use in your own RPC system.

  18. It looks very similar to JSON data format to me. Javascript can natively process this data format. There is library in Java to process it too. So is the extra value only in data type validation?
    Can it encapsulate binary / blob data?

  19. By the way, fellow Googler Mark Pilgrim added some nice perspective here:
    http://diveintomark.org/archives/2008/07/12/protobuf

    Some pointers from that post include progress on protocol buffers (sometimes called pbuffers) in Haskell and Perl:
    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/protocol-buffers-0.0.5
    http://groups.google.com/group/protobuf-perl

    Other fellow Googlers have weighed in here:
    http://scottkirkwood.blogspot.com/2008/07/google-opensources-protocol-buffers.html

    http://zunger.livejournal.com/164024.html

  20. Does this support routing a newer version of a message through an older server? If an V2 server sends a V2 objects to a V1 server which routes the object on to another V2 server. Does this final server get a V2 object or a V1 object?

    Steve Brewer, great question. Yonatan Zunger answered exactly this question in his post. I’ll quote the relevant bit:

    If the protocol deserializer comes across a tag number which isn’t in its copy of the protocol definition, it will just keep it as uninterpreted data and pass it along when it reserializes the proto. So if you have three servers, and A sends a message to B which processes it and then sends it on to C, and you want to add a new field which A uses to communicate something to C, you don’t need to update the B server; it will just pass the updated protocol message along to C. I can’t even begin to tell you how much more pleasant this can make your life.

  21. Harsha, sometimes Googlers refer to them as pbuffers to save time. Often we don’t capitalize the spelling either. πŸ™‚

    Peter (IMC), I think the official word is that we have “over 10,000 PCs” at Google. That stat is a few years old, but that’s the official word, so I that’s what I’ll stick with.

    Eddie, normally we use PBs for talking between servers, so I’m not aware of any code to do LDAP with pbuffers. Google tends to use C++ for lots of production machines because we want to wring the most performance from things. There might be JavaScript-to-pbuffer code, but I’m not aware of any. My guess is that we might look to the open-source community for Ruby support; there are people that use Ruby at Google, but it’s not as common as Python/C++/Java is at Google.

  22. I tried downloading those buffers and could not get it to install. I am running a quad four with terabites and gigs. I am running Vista but can vary back to Linux and that did not work either.

  23. It looks very similar to JSON data format to me. Javascript can natively process this data format. There is library in Java to process it too. So is the extra value only in data type validation?
    Can it encapsulate binary / blob data?

css.php