Friday, December 20, 2013

What am I missing out on if I'm not using Distributed Version Control?


As requested, I'm posting my thoughts about Centralized version control versus Distributed version control.  Git and Mercurial are the two tools I'm grouping together in this post as Distributed.

 Centralized version control, like Subversion, Perforce, and Team Foundation Server were, for their day, wonderful things.   They are dated technology now, and for many teams, the benefits of moving to a distributed version control system could be huge.

But I Want One Repo To Rule Them All!

Fine. You can still have that.  You can even prevent changes you don't want to get into that One Repo from getting in there, with greater efficiency.  Distributed Version control systems actually provide greater centralized control  than central version control systems.    That should sound to you like heresy at first, if you don't know DVCS, and should be obvious to you, once you've truly grasped DVCS fundamentals.

Even if having a real dedicated host server is technically optional, you can still build your workflow around centralized practices, when they suit you.

Disclaimer: There Are No Silver Bullets.  DVCS may not be right for you.

Note that I am not claiming a Silver Bullet role for this type of tool.  Also, for some specific teams, or situations, Centralized version control may still be the best thing.  There is no one-size-fits-all solution out there, and neither Mercurial nor Git are a panacea.  They certainly won't fix your broken processes for you.   Expectations often swirl around version control systems as if they could fix a weak development methodology.  They can't.  Nor can any tool.

However, having a more powerful tool or set of tools, expands your capabilities. Here are some details on how you can expand your capabilities using a better, and more powerful set of tools.

1.  Centralized version control makes the default behavior of your system that it inflicts your changes on others as soon as you commit them yourself.   This is the single greatest flaw in central version control. This in turn can lead to people committing less often, because they either have to (a) not commit, or (b) decided to just go ahead and commit and inflict their work on others, or (c) create a branch (extra work) and then commit, and then merge it later (yet more extra work).   Fear of branching and merging is still widespread, even today, even with Subversion and Perforce, especially when changes get too large and too numerous to ever fall back on manual merging.  I recently had a horrible experience with Subversion breaking on a merge of over 1800 modified files, and I still have no idea why it broke.  I suspect something about '--ignore-ancestry" and the fact that Subversion servers permit multiple client versions to commit inconsistent metadata into your repository, because Subversion servers are not smart middle tier server, they're basically just dumb HTTP-DAV stores. I fell back to manually redoing my work, moving changes using KDiff3, by hand.    With Distributed Version control, you can take control,  direct when changes land in trunk, without blocking people's work, or preventing them from committing, or forcing them to create a branch on the server and switch their working copy onto that branch, which in some tools, like Subversion, is a painful, slow, and needlessly stupid process.

2.  Centralized version control limits the number of potential workflows you could use, in a way that may prevent you from using version control to give your customers, and the business, the best that your team can give it. Distributed Version Control Systems encourage lightweight working copy practices.  It is easy to make a new local clone, and very very fast, and requires no network traffic.  Every single new working copy I have to create from Subversion uses compute and network resources that are shared with the whole team.  This creates a single point of failure, and a productivity bottleneck.


3.  Centralized version control systems generally lack the kind of merging and branching capabilities that distributed version control systems provide.  For example, Mercurial has both "clones as branches", and "branching inside a single repo".    I tend to use clones and merge among them, because that way I have a live working copy for each and don't need to use any version control GUI or shell or command line commands to switch which branch I'm on.     These terms won't make sense to you until you try them, but you'll find that having more options opens up creative use of the tools.  Once you get it, you'll have a hard time going back to the bad old days.

4.  For geographically distributed teams, Distributed Version control systems have even greater advantages.  A centralized version control system is a real pain over a VPN.  Every commit is a network operation, like it or not.  DVCS users can sync using a public or private BitBucket repository at very low cost, and don't even have to host their own central servers.


5. For open source projects, Distributed Version control permits the "pull request" working model.  This model can be a beneficial model for commercial closed source projects too.   Instead of a code review before each commit, or a code review before merging from a heavyweight branch, you could make ten commits, and then decide to sync to your own central repository.   Once the new code is sitting there, it's still not in the trunk until it is reviewed and accepted by the Code-Czar.

6.  For working on your own local machine, if you like to develop in virtual machines, having the ability to "clone" a working copy quickly from your main physical machine down into a VM, or from VM to VM, using a simple HTTP-based clone operation, can really accelerate your use of VMs.  For example, my main home Delphi PC is a Dell Workstation that has Windows 8.1 on the Real Machine, and runs Hyper-V with a whole bunch of VMs inside it. I have most of the versions of Windows that I need in there. If I need to reproduce a TWAIN DLL bug that only occurs in Terminal Server equipped Windows Server 2008 R2 boxes, I can do it. I can have my repos cloned and moved over in a minute or two.  And I'm off.

7.  Rebasing is a miracle.  I'll leave this for later. But imagine this:  I want to commit every five minutes, and leave a really detailed history while I work on some low level work.  I want to be able to see the blow-by-blow history, and commit things in gory detail.  When I commit these changes to the master repo, I want to aggregate these changes before they leave my computer, and give them their final form.  Instead of having 80 commits to merge, I can turn it into one commit before I send that up to the server.

8.  Maintaining an ability to sync fixes between multiple branches that have substantial churn in code over time, is possible, although really difficult.  By churn, I mean that there are both "changes that need merging and changes that don't".  This is perhaps the biggest source of pain for me with version control, with or without distributed version control systems..  Imagine I'm developing Version 7 and Version 8 of AmazingBigDelphiApp.   Version 7 is running in Delphi 2007, and Version 8 is running in Delphi XE5, let's say.  Even with Distributed Version Control (git or mercurial), this is still really really hard.   So hard that many people find it isn't worth doing.  Sure, it's easy to write tiny bug fixes in version 7 and merge them up to version 8, unless the code for version 8 has changed too radically. But what happens when both version 7 and version 8 have heavy code churn? No version control system on earth does this well. But I will claim that Mercurial (and Git) will do it better than anybody else.   I have done fearsome merge ups and merge downs from wildly disparate systems, and I will take Mercurial any day, and Git if you force me, but I will NOT attempt to sync two sets of churning code in anything else.  I can't put this in scientific terms. I could sit down with you and show you what a really wicked merge session looks like, and you would see that although Git and Mercurial will do some of the work for you, you have to make some really hard decisions, and you have to determine what some random change is that landed in your code, how it got there, and whether it was intentional, or a side effect, and if it was intentional, if it's necessary in the new place where it's landing.  If it all compiles, you could go ahead and commit it. If you have good unit test coverage, you might even keep your sanity and your remaining hair, and your customers.

9.  Mercurial and Git have "shelves" or "the stash".  This is worth the price of admission all by itself. Think of it as a way of "cleaning up my working copy without creating a branch or losing anything permanently". It's like that Memory button in your calculator, but it can hold various sets of working changes that are NOT ready to commit yet, without throwing them away either.

10.  Mercurial and Git are perfect for creating that tiny repo that is just for that tiny tool you just built.  Sometimes you want to do version control (at work) for your own temporary or just-started-and-not-sure-if-the-team-needs-this utility projects without inflicting them on anybody else. Maybe you can create your own side folder somewhere on your subversion server where it won't bother anybody, or maybe you can't do that.  Should you be forced to put every thing you want to commit and "bookmark" up on the server as a permanent thing for everybody to wonder "why is this in our subversion server?". I don't think so.

11.  Mercurial and Git are perfect for sending a tiny test project to your buddy at work in a controlled way.  You can basically run a tiny local server, send a url to your co-worker and they can pull your code sample across the network.   They can make a change, and then they can push their change back. This kind of collaboration can be useful for training, for validation of a concept you are considering trying in the code, or for any of dozens of other reasons.    When your buddy makes a
change and sends it back, you don't even have to ask "what did you change" because the tool tells you.


Bonus: Some Reasons to Use Mercurial Rather than Anything Else

TortoiseHG in particular is the most powerful version control GUI I have ever used, and it's so good that I would recommend switching just so you get to use TortoiseHG.  Compared to TortoiseSVN, it's not even close.  For example, TortoiseSVN's commit dialog lacks filter capabilities.    SVN lacks any multi-direction synchronization capabilities, and can not simplify your life in any way when you routinely need to merge changes up from 5.1-stable to to 6.0-trunk, it's the same old "find everything manually and do it all by hand and hope you do it right" thing everytime.

Secondly, the command line. The Mercurial (HG) command line kicks every other version control's command line's butt.  It's easy to use, it's safe, and it's sane.

SVN's command line is pathetic, it lacks even a proper "clean my working copy" command.  I need a little Perl script to do what the SVN 1.8 commandline still can't do. (TortoiseSVN's gui has a reasonable clean feature, but not svn.exe). Git's command-line features are truly impressive. That's great if you're a rocket scientist and a mathematician with a PhD in Graph and Set Theory, and less great if you're a mere human being.   The HG (mercurial) command line is clean, simple, easy to learn, and even pretty easy to master.  It does not leak the implementation details out and all you should need to evaluate this yourself is to read the "man pages" (command line help) for Git and Mercurial. Which one descends into internal implementation jargon at every possible turn? Git.
 I've already said why I prefer HG to GIT,   I could write more about that, but I must say I really respect almost everything about GIT.  Everything except the fact that it does allow you to totally destroy your repository if you make a mistake. That seems so wrong to me, that I absolutely refuse to forgive GIT for making it not only possible but pretty easy to destroy history. That's just broken. (Edit:  I think that this opinion in 2013 was based on inaccurate information, rewriting history is acceptable, and in fact, permitted in both Mercurial and Git, and the chances of a well meaning developer accidentally erasing his git repository remains a very small risk.)

Side Warning


Let me point out that you need a backup system that makes periodic full backups of your version control server, whether it is centralized or distributed.  Let me further point out that a version control system is not a backup system.  You have been warned. If you use Git, you may find that all your distributed working copies have been destroyed by pernicious Git misfeatures.  If you choose to use Git, be aware of its dark corners, and avoid the combinations of commands that cause catastrophic permanent data loss. Know what those commands are, and don't do those things.

 Get Started: Learn Mercurial

If you want to learn Mercurial,  I recommend this tutorial: http://hginit.com/

I really really recommend you learn the command line first, whether you choose to learn the Git or the Mercurial one. Make your first few commits with the command line.  Do a few clone commands, do a few "show my history" commands (hg log), and a few other things. If you don't know these, you will never ever master your chosen version control system.  GUIs are great for merges, but for just getting started, learn the command-line first.  You're a programmer, darn it.  This is a little "language", it will take you a day to learn it.  You will reap benefits from that learning time, forever.

And The Flip Side:  Reasons to stay with Subversion



(2016 update: Almost every team I know that hasn't moved to Git, has at least some people on the team who wish they could, and at least one stick in the mud in a position of power, overriding that team instinct.   I've changed my mind about Mercurial versus Git as well, and recognize that the programming world has chosen Git.  So it's time to learn how to fit in, instead of being the guy that sticks out because of his outlier opinion.   The tools and the community around Git are superior to the tools and community for Mercurial. )

15 comments:

  1. Thank you very much indeed Warren - I wasn't expecting quite such a speedy (and detailed response).

    I will spend at least some of Christmas playing with Mercurial again, and will report back.

    I note your love of TortoiseHG. Do you also use the IDE integration - Version Insight plus?

    ReplyDelete
  2. It's funny. All is written here against centralized VCS could be written in a similar way against Delphi and why you should switch to a more modern language and framework instead of Pascal and Delphi. Anyway there are several flaws:

    1) "Inflicting" changes may be a good thing if your workflow is set up properly. You want to spot issues early in the development process, not when it is too late and changes are then huge and costly. Moreover, if you have a centralized continuous integration and test system. you need changes reach it.
    2) Network resources are never an issue today, with 1Gb networks going to 10/40Gb ones. If you still have a 10Mb network, it could be an issue. Do you have one?
    3) I agree more modern tools have better merging capabilities of older ones. It's not because they are distributed, it's because they are able to use data about changes better. The same way Delphi compiler is worse than more modern compiler just because it wasn't updated by skilled developers.
    4) A centralized VCS is a pain over VPN only if your connection is slow. I routinely use mine over a VPN and it's not slow at all - after all we stream movies and other heavy media over our Internet connection, don't we? If it works how could an update/commit of a few megabytes be slow? Hosting a central server is a requirement of any organization that cares about its IP. I would never allow my company main asset, our code, to be hosted outside our servers.
    5) Exactly. DVCS are built around the open source distributed development teams. Their model is often not the right one for commercial applications.
    6) We do use remote debugging on test machines. We never copy source code and development tools to test machines. Besides licensing issues, code must be tested in the deployment configuration, not a development one. VMs are also usually restored to a clean configuration before any test,
    7) Meanwhile the code exists only on your machine and may not be backed up. If something goes wrong, you lost your 80 commits.
    8) It looks you have a big process issues, and you're managing your development and branches the wrong way.
    10) I've a subversion repository exactly for this kind of projects. So I can also control how people spend their paid time in the company... and because code produces in the company is a company property, it has to be somewhere on a company server, not or your machine only.
    11) No, I do not really want *everybody* sets up his or her tiny server! That's just a mess in any company. No desktop must act like a server, for several security reasons.

    I wan't comment the usual DVCS warshipper comments about why is better then and you must use it and nothing else. DCVS has started a religion war that about languages only among Delphi worshipper is still so strong.

    ReplyDelete

  3. @kmorwath: You *think* you can *control* what people do using subversion? I'd really like to hear more about how that works, exactly. I really doubt you are controlling anything. I think you perceive what happens in subversion differently than what actually happens, so I'd like you to clarify so I can understand you. Also I wonder what you mean by "no desktop may act like a server". What do you mean exactly by that? Do you mean that you do not allow anyone to make any socket connections from one workstation to another? Several security reasons? I think you should be a bit more forthcoming.


    ReplyDelete
    Replies
    1. @kmorwath: part 2 : I *like* your idea of setting up a "side server" for tiny projects and code, whether you use subversion or a DVCS. I think maybe you didn't understand me when I said that if you want a central workflow you can still practice such workflows. Using a tool with less features, to prevent people from doing what you don't like is like trying to prevent kids from skipping school by nailing their feet to the floor. It's barbaric.

      Delete
    2. @kmorwath: part3: I think you make a good point about remote debug. Sometimes remote debug is the right solution. In your organization, maybe it's ALWAYS the right solution. I think you've made a valid point, and being able to move code around may not always be the ONLY consideration. Did you notice that at the top of this post, I said that I am not claiming that all these reasons or ideas are equally applicable to everyone else? I would appreciate a similar modicum of reserve from you as you make large sweeping claims with abandon.

      Delete
  4. We use the practice you mention in #3. That alone makes everything else worthwhile.

    ReplyDelete
  5. I'd have liked to use Mercurial, but it did not really work on one our Windows servers. There are a few pages about it, but none really worked.
    So we decided to use SVN, though it is worse. Installer, a few clicks, done. Repository, users, everything done within a few minutes.

    As soon as there is a similar possiblity for Mercurial we will use it instead of SVN...

    ReplyDelete
  6. @jaenicke : I have never heard of there being a problem with getting it working on Windows. If you would like a tutorial on setting it up, let me know and I'll blog how it's done.

    W

    ReplyDelete
  7. I will probably never understand the fuzz about data loss in git. Why is it easier to _unintentionally_ loose data in git then it is in mercurial?
    Even with an unintentional git reset --hard your data is not lost at all. This makes me feel really safe when using git's history changing features.
    You were talking about rebase in hg. I wonder, cause I never really used hg, is it as easy as in git to undo the rebase or compare the result with the state before the rebase? Or even to keep both versions for a while for that matter?

    Also you mention about git's internals leaking through the command line, and that would be a bad thing. I personally think this is a good thing. The git data model is intentionally very simple and hence easy to understand. I want to know what I'm doing or what I did when I execute a command. It makes it more powerful when you know what happens inside I think.

    ReplyDelete
    Replies
    1. Hi. In Mercurial, the rebased commits are currently taken out of the repository and stored in a "bundle" in the .hg directory. If you want, you can "hg pull" from such a bundle to get the original back, or you can run "hg rebase --keep" to avoid stripping them. Aborting a rebase works exactly like in Git: "hg rebase --abort".

      In the future, Mercurial will default to keep the commits around in the repo. They will be marked "obsolete", which means that Mercurial knows that there is a better version of the commits. These "obsolete markers" are even exchanged between repositories: that way I can "hg push" after rebasing and the rebase will then flow to the remote repo. This feature is currently being developed and is known as "changeset evolution": http://mercurial.selenic.com/wiki/ChangesetEvolution

      Delete
    2. This bundle, can you compare the contents of it with your current state?
      I mean in git you do for example:
      git branch temp_backup # create ref to current state for later reference
      git rebase -i HEAD~4 # modify your 4 last commits
      git diff temp_backup # check what has changed with before rebase
      git branch -D temp_backup # remove backup ref

      When you forgot to make a backup ref one can also use the reflog and simply run
      git diff HEAD@{one_minute_ago} # or something
      Is something similar possible in hg?

      Mercurial obsolete markers might be a step in the right direction. Lets see how that turns out.

      Delete
    3. Some years later I changed my mind. I've learned to stop worrying and love the git reflog and its model, which still appears to me to be insane, is actually now more of a "table saw" power tool in my view. Dangerous but useful.

      Delete
  8. In short, the divide comes down to those developers who wish to perceive any action as "undo-able" even when allowing such an "undo" could create a situation where undoing the undo (redo) becomes impossible. Mercurial makes a sensible choice: Let user change the head state of repo, but do not lose the prior states, and do not utterly delete them. Stripping is possible but is rarely used, and I never ever strip, unless I clone first.



    ReplyDelete
    Replies
    1. So how does this 'divide' differ the mercurial and the git users? Do you suggest that in git an undo makes a redo impossible?
      I think git makes a sensible choice as well. Users are able to change the HEAD state like in Mercurial. Users are not able to remove previous HEAD states, all previous HEAD states are stored in the reflog. They are simply hidden. When an old state is not referenced for a long time (usually +30 days), the user is obviously not interested in it anymore and git garbage collects it.
      This way, an undo is possible, a redo is possible, an undo of a redo is possible. Heck, even compare a redo with the state before redo is possible. Keep a state before undo forever is possible simply by referencing it somewhere.

      Delete
    2. The difference is that the Mercurial workflow does not permit you to store and then synchronize a data-losing operation, turning a local data-loss into a remotely synchronizeable pushable or pullable data loss.

      Delete