Thu, 01 Jan 2015

CPAN Pull Request Challenge: A call to the CPAN authors

Permanent link

The 2015 CPAN Pull Request Challenge is ramping up, and so far nearly two hundred volunteers have signed up, pledging to make one pull request for a CPAN distribution for each month of the year.

So here's a call to the all the CPAN authors: please be supportive, and if you don't like for your CPAN distributions to be part of the challenge, please send an email to neil at bowers dot com, stating your PAUSE ID and the fact that you want to be excluded.

How to be supportive? The first step is to act on pull requests. If you don't have time for a review, please say so; getting some response, even if it's "it'll be some time 'till I get around to reviewing this" is much better than none.

The volunteers have varied backgrounds; some are seasoned veterans, others are beginners who will make their first contribution to Open Source. So please be patient and encouraging.

If you have specific requirements for contributions, add a file called CONTRIBUTING or to your github repositories where you state those requirements.

And of course, be civil. But that goes without saying, right? :-)

(And to those CPAN authors who haven't done it yet: put your distributions on github, so that you're not left out of the fun!

Happy New Year everybody, and have a great deal of fun!

See also: Resources for the CPAN Pull Request Challenge.

[/misc] Permanent link

comments / trackbacks

Sat, 15 Feb 2014

The Fun of Running a Public Web Service, and Session Storage

Permanent link

One of my websites, Sudokugarden, recently surged in traffic, from about 30k visitors per month to more than 100k visitors per month. Here's the tale of what that meant for the server side.

As a bit of background, I built the website in 2007, when I knew a lot less about the web and programming. It runs on a host that I share with a few friends; I don't have root access on that machine, though when the admin is available, I can generally ask him to install stuff for me.

Most parts of the websites are built as static HTML files, with Server Side Includes. Parts of those SSIs are Perl CGI scripts. The most popular part though, which allows you to solve Sudoku in the browser and keeps hiscores, is written as a collection of Perl scripts, backed by a mysql database.

When at peak times the site had more than 10k visitors a day, lots of visitors would get a nasty mysql: Cannot connect: Too many open connections error. The admin wasn't available for bumping the connection limit, so I looked for other solutions.

My first action was to check the logs for spammers and crawlers that might hammered the page, and I found and banned some; but the bulk of the traffic looked completely legitimate, and the problem persisted.

Looking at the seven year old code, I realized that most pages didn't actually need a database connection, if only I could remove the session storage from the database. And, in fact, I could. I used CGI::Session, which has pluggable backend. Switching to a file-based session backend was just a matter of changing the connection string and adding a directory for session storage. Luckily the code was clean enough that this only affected a single subroutine. Everything was fine.

For a while.

Then, about a month later, the host ran out of free disk space. Since it is used for other stuff too (like email, and web hosting for other users) it took me a while to make the connection to the file-based session storage. What happened was 3 million session files on a ext3 file system with a block size of 4 kilobyte. A session is only about 400 byte, but since a file uses up a multiple of the block size, the session storage amounted to 12 gigabyte of used-up disk space, which was all that was left on that machine.

Deleting those sessions turned out to be a problem; I could only log in as my own user, which doesn't have write access to the session files (which are owned by www-data, the Apache user). The solution was to upload a CGI script that deleted the session, but of course that wasn't possible at first, because the disk was full. In the end I had to delete several gigabyte of data from my home directory before I could upload anything again. (Processes running as root were still writing to reserved-to-root portions of the file system, which is why I had to delete so much data before I was able to write again).

Even when I was able to upload the deletion script, it took quite some time to actually delete the session files; mostly because the directory was too large, and deleting files on ext3 is slow. When the files were gone, the empty session directory still used up 200MB of disk space, because the directory index doesn't shrink on file deletion.

Clearly a better solution to session storage was needed. But first I investigated where all those sessions came from, and banned a few spamming IPs. I also changed the code to only create sessions when somebody logs in, not give every visitor a session from the start.

My next attempt was to write the sessions to an SQLite database. It uses about 400 bytes per session (plus a fixed overhead for the db file itself), so it uses only a tenth of storage space that the file-based storage used. The SQLite database has no connection limit, though the old-ish version that was installed on the server doesn't seem to have very fine-grained locking either; within a few days I could errors that the session database was locked.

So I added another layer of workaround: creating a separate session database per leading IP octet. So now there are up to 255 separate session database (plus a 256th for all IPv6 addresses; a decision that will have to be revised when IPv6 usage rises). After a few days of operation, it seems that this setup works well enough. But suspicious as I am, I'll continue monitoring both disk usage and errors from Apache.

So, what happens if this solution fails to work out? I can see basically two approaches: move the site to a server that's fully under my control, and use redis or memcached for session storage; or implement sessions with signed cookies that are stored purely on the client side.

[/misc] Permanent link

comments / trackbacks

Mon, 31 Dec 2012

iPod nano 5g on linux -- works!

Permanent link

For Christmas I got an iPod nano (5th generation). Since I use only Linux on my home computers, I searched the Internet for how well it is supported by Linux-based tools. The results looked bleak, but they were mostly from 2009.

Now (December 2012) on my Debian/Wheezy system, it just worked.

The iPod nano 5g presents itself as an ordinary USB storage device, which you can mount without problems. However simply copying files on it won't make the iPod show those files in the play lists, because there is some meta data stored on the device that must be updated too.

There are several user-space programs that allow you to import and export music from and to the iPod, and update those meta data files as necessary. The first one I tried, gtkpod 2.1.2, worked fine.

Other user-space programs reputed to work with the iPod are rhythmbox and amarok (which both not only organize but also play music).

Although I don't think anything really depends on some particular versions here (except that you need a new enough version of gtkpod), here is what I used:

  • Architecture: amd64
  • Linux: 3.2.0-4-amd64 #1 SMP Debian 3.2.35-2
  • Userland: Debian GNU/Linux "Wheezy" (currently "testing")
  • gtkpod: 2.1.2-1

[/misc] Permanent link

comments / trackbacks

Thu, 23 Aug 2012

Correctness in Computer Programs and Mathematical Proofs

Permanent link

While reading On Proof and Progress in Mathematics by Fields Medal winner Bill Thurston (recently deceased I was sorry to hear), I came across this gem:

The standard of correctness and completeness necessary to get a computer program to work at all is a couple of orders of magnitude higher than the mathematical community’s standard of valid proofs. Nonetheless, large computer programs, even when they have been very carefully written and very carefully tested, always seem to have bugs.

I noticed that mathematicians are often sloppy about the scope of their symbols. Sometimes they use the same symbol for two different meanings, and you have to guess from context which on is meant.

This kind of sloppiness generally doesn't have an impact on the validity of the ideas that are communicated, as long as it's still understandable to the reader.

I guess on reason is that most mathematical publications still stick to one-letter symbol names, and there aren't that many letters in the alphabets that are generally accepted for usage (Latin, Greek, a few letters from Hebrew). And in the programming world we snort derisively at FORTRAN 77 that limited variable names to a length of 6 characters.

[/misc] Permanent link

comments / trackbacks

Sun, 26 Jun 2011

Introducing my new project: Quelology organizes books

Permanent link

For about half a year I've been working on a website called quelology, which collects book series and translations.

It is intended to answer questions of the form: I've now read "Harry Potter and the Order of the Phoenix", which is the next book in that series? or What's the name of the French translation of that book?

The website and data mining behind it are written in Perl, and it is based on book meta data by isfdb, amazon and worldcat.

I'm working on importing data from more sources, next up will be the Swedish National Library.

After completing the data mining stage, I'll add an interfaces that allows the visitor to edit the book, series and translations data, so that users can extend the data body.

[/misc] Permanent link

comments / trackbacks

Tue, 14 Jun 2011

Why is my /tmp/ directory suddenly only 1MB big?

Permanent link

Today I got a really weird error on my Debian "Squeeze" Linux box -- a processes tried to write a temp file, and it complained that there was No space left on device.

The weird thing is, just yesterday my root parition was full, and I had made about 7GB free space in it.

I checked, there was still plenty of room today. But behold:

$ df -h /tmp/
Filesystem            Size  Used Avail Use% Mounted on
overflow              1.0M  632K  392K  62% /tmp

So, suddenly my /tmp/ directory was a ram disc with just 1MB of space. And it didn't show up in /etc/fstab, so I had no idea what cause it.

After googling a bit around, I found the likely reason: as a protection against low disc space, some daemon automatically "shadows" the current /tmp/ dir with a ram disc if the the root partition runs out of disc space. Sadly there's no automatic reversion of that process once enough disc space is free again.

To remove the mount, you can say (as root)

umount -l /tmp/

And to permanently disable this feature, use

echo 'MINTMPKB=0' > /etc/default/mountoverflowtmp

[/misc] Permanent link

comments / trackbacks

Mon, 22 Nov 2010

Harry Potter and the Methods of Rationality

Permanent link

What if Harry Potter had been raised by a loving stepmother? What if his stepfather was a scientist? What happens when somebody tries to analyze magic with scientific methods? What happens if an eleven year old boy is too smart for his own good?

A piece of fan fiction, Harry Potter and the Methods of Rationality by "Less Wrong" answers those questions - and makes quite a good read. If you are into fantasy books and science, you might really love it. I did.

But be warned: only read this if you've read all seven Harry Potter books by J.K.Rowling, because the fan fiction piece contains lots of spoilers.

So far 60 chapters for varying length have been published, and just a few more to be written before the first year ends. I look forward to the final chapters.

[/misc] Permanent link

comments / trackbacks

Tue, 08 Dec 2009

Keep it stupid, stupid!

Permanent link

How hard is it to build a good search engine? Very hard. So far I thought that only one company has managed to build a search engine that's not only decent, but good.

Sadly, they seem to have overdone it. Today I searched for tagged dfa. I was looking for a technique used in regex engines. On the front page three out of ten results actually dealt with the subjects, the other uses of dfa meant dog friendly area, department of foreign affairs or other unrelated things.

That's neither bad nor unexpected. But I wanted more specific results, so I decided against using the abbreviation, and searched for the full form: tagged deterministic finite automaton. You'd think that would give better results, no?

No. It gave worse. On the first result page only one of the hits actually dealt with the DFAs I was looking for. Actually the first hit contained none of my search terms. None. It just contained a phrase, which is also sometimes abbreviated dfa.

WTF? Google seemed to have internally converted my query into an ambiguous, abbreviated form, and then used that to find matches, without filtering. So it attempted to be very smart, and came out very stupid.

I doubt that any Google engineer is ever going to read this rant. But if one is: Please, Google, keep it stupid, stupid.

I'm fine with getting automatic suggestions on how to improve my search query; but please don't automatically "improve" it for me. I want to find what I search for. I'm not interested in dog friendly areas.

[/misc] Permanent link

comments / trackbacks

Sat, 05 Dec 2009

Doubt and Confidence

Permanent link

<meta>From my useless musings series.</meta>

As a programmer you have to have confidence in your skills, to some extent, and at the same time you have to constantly doubt them. Weird, eh?


You need some level of confidence to do anything efficiently. Planning ahead requires confidence that you can achieve the steps on your way.

As a programmer you also need some confidence with the language, libraries and other tools you're using.

If you program for money, you also have to assess what kind of programs you can write, and where you might have problems.


In the process of programming you make a lot of assumptions, some of the explicit, some of them implicit. If you want to write a good program, it's essential that you are aware of as many assumptions as possible.

When you find a bug in your program, you have to challenge previous assumptions, and that's where doubt comes in. You not only suspect, but you know that at least one of the assumptions was false (or maybe just a bit too specific), and you know that you did something wrong.

Sometimes programmers make really stupid mistakes which are rather tricky to track down. That's when you have to question your own sanity.

One example (that luckily doesn't happen all that often to me) is when I edit my program, and nothing seems to change. Nothing at all. Depending on the setup it might be some cache, but something it is even more devious - for example I didn't notice that the console where I edit and the console where I test are on different hosts - and thus the edits actually have no effect at all.

After having done such a thing once or twice I adopted the habit of just adding a die('BOOM'); instruction to my code, to verify that the part I'm looking at is actually run.

These are moments when I question my own sanity, thinking "how could I have possibly done such a stupid thing?". Doubt.

The same phenomena applies when doing scientific research: since you usually do things that nobody has done before (or at nobody has published about it yet), you can't know the results beforehand -- if you could, your research would be rather boring. So you have no external reference for verification, only your intuition and discussion with peers.

[/misc] Permanent link

comments / trackbacks

Sat, 10 Oct 2009

Fun and No-Fun with SVG

Permanent link

Lately I've been playing a lot of with SVG, and all in all I greatly enjoyed it. I wrote some Perl 6 programs that generate graphical output, and being a new programming language it doesn't have many bindings to graphic libraries. Simply emitting a text description of a graphic and then viewing it in the browser is a nice and simple way out.

I also enjoy getting visual feedback from my programs. I'd even more enjoy it if the feedback was more consistent.

I generally test my svg images with three different viewers: Firefox 3.0.6-3, inkscape (or inkview) 0.46-2.lenny2 and Opera 10.00.4585.gcc4.qt3. Often they produce two or more different renderings of the same SVG file.

Consider this seemingly simple SVG file:

    style="background-color: white"

        <path id="curve"
              d="M 20 100
                 C 60 30
                 320 30
                 380 100"
              style="fill:none; stroke:black; stroke-width: 2"
   <text font-size="40" textLength="390" >
        <use xlink:href="#curve" />
        <textPath xlink:href="#curve">SPRIXEL</textPath>

If your browser supports SVG, you can view it directly here.

This SVG file first defines a path, and then references it twice: once a text is placed on the path, the second time it is simply referenced and given some styling information.

Rendered by Firefox:

rendered by firefox

Rendered by Inkview:

rendered by inkview

Rendered by Opera:

rendered by opera

Three renderers, three outputs. Neither Firefox nor Inkview support the textLength attribute, which is a real pity, because it's the only way you can make a program emit SVG files where text is guaranteed not to overlap.

If you scale text in Inkscape and then put it onto a path, the scaling is lost. I found no way to reproduce opera's output with inkscape without resorting to really evil trickery (like decomposing the text into paths, can then cutting the letters apart and placing them manually). (Equally useful is the dominant-baseline attribute, which Inkscape doesn't support either).

The second difference is that only Firefox shows the shape of the path. Firefox is correct here. The SVG specification clearly states about the use attribute:

For user agents that support Styling with CSS, the conceptual deep cloning of the referenced element into a non-exposed DOM tree also copies any property values resulting from the CSS cascade [CSS2-CASCADE] on the referenced element and its contents. CSS2 selectors can be applied to the original (i.e., referenced) elements because they are part of the formal document structure. CSS2 selectors cannot be applied to the (conceptually) cloned DOM tree because its contents are not part of the formal document structure.

Sadly it seems to be a coincidence that Firefox works correctly here. If the styling information is moved from the path to the use element the curve is still displayed - even though it should not be.

Using SVG feels like writing HTML and CSS for 15 year old browsers, which had their very own, idiosyncratic idea of how to render what, and what to support and what not.

Just like with HTML I have high hopes that the overall state will improve; Indeed I've been told that Firefox 3.5 now supports the textLength attribute. I'd also love to see wide-spread support for SVG animations, which could replace some inaccessible flash applications.

[/misc] Permanent link

comments / trackbacks

Tue, 04 Aug 2009

Goodby Iron Man

Permanent link

<update> (from 2009-08-23) It turned out that my disappearance on the ironman blog feed was due to a broken RSS feed. Matt S. Trout tried to inform me by blog comment, my blog marked it as spam and swallowed it.

So now we talked on IRC, clarified things, and I'm back in the game. </update>

So I accepted the Iron Man blogging challenge a few month ago. And last week I discovered that my blog was gone from their feed. For the second time. Without any notification.

rusty iron man

Image: rusty iron man, by courtesy of artvixn, available under a create commons non-commerical by-attribution license.

The first time they had a good reason: the date tags in my RSS feed were goofed; still I'd thought it would be nice to at least notify me of such a removal. After some mails back and forth I was able to fix it; after the second removal without any notification I'm simply fed up and don't want to investigate any more energy into this.

Still I'll continue to follow the collected RSS feed, there are still many interesting blogs to be read there.

[/misc] Permanent link

comments / trackbacks

Tue, 23 Jun 2009

Iron Man Challenge - Am I a Stone Man?

Permanent link

Gabor asked what I'm missing from the Iron Man blogging challenge. Gabor focused on the contents of the blog posts, I'll talk about the challenge itself.

I'm missing the things announced on their website: a way to find out to which level you made it, a monthly selection of best blog posts, and all these other things that were designed to create some competition, and more fun.

Don't get me wrong, I like to read the blog of my fellow Perl programmers, and it motivates me to write more often myself. But that's not all that was promised to us.

One thing I'd like to add about the content, though: So far most of what I read was very good and informative, but it was all text. I know it's not easy to find nice on-topic programming pictures, and doesn't even allow the inclusion of pictures in posts, and I don't do it often myself, but having more picture or charts would be nice.

[/misc] Permanent link

comments / trackbacks

Mon, 01 Jun 2009

Why Design By Contract Does Not Replace a Test Suite

Permanent link

"Design By Contract" (DBC) usually refers both to very sophisticated assertion systems (for example in which assertions are inherited along with the methods to which they belong), and to the practice of using such assertions extensively, not only for quality assurance but also as a form of documentation.

When I was mostly programming in Eiffel some years ago, I liked DBC very much, and I still think that it's a very good idea, and that more programming language should offer good support for it.

However there's one comment that I've seen frequently on the web, in blogs and on IRC. Often DBC evangelists say something along these lines: "We have DBC, we don't need a test suite". I find such comments incredibly stupid, and here I want to write down why.

Code needs to run

If you want to verify that the code does what you want, you have to actually run it - otherwise the assertions won't be triggered, and are worthless as a verification tool.

You don't have to just run it, but should, when possible, cover every code path - just like you'd do it when you write tests. Doing that manually requires much work, so you still need a test suite that you can run to verify that some changes didn't break anything.

Examples are easy, general rules are hard

Test cases are just example input, paired with the expected output. Usually it's pretty easy to come with examples, so writing tests is also easy, even for corner cases.

On the other hand assertions are rules that have to hold for all possible input data, so to formulate them, you have to consider the general case - that's usually rather hard, so the lazy programmer leaves out the hard cases.

A simple example: suppose you've written a subroutine that adds two numbers (for example for a bignum library). Writing assertions for the general case of addition is quite hard if you can't trust your subtraction routine; so the only things you can really do is to check the signs (positive number plus positive number is positive etc.), but that won't catch any off-by-one errors.

So you should also write tests; tests like add(3, 4) == 7 are trivial to come up with, and catch potential errors.


Design by Contract and testing should go hand in hand so that the tests exercise as many code paths as possible, and should cover those areas that are hard to validate with assertions.

DBC should not be viewed as a replacement for tests.

[/misc] Permanent link

comments / trackbacks

Thu, 18 Dec 2008

My Diploma Thesis: Spin Transport in Mesoscopic Systems

Permanent link

Sometimes people ask me what I'm doing right now, and I tell them "I'm writing my diploma thesis on mesoscopic spin transport", and they know just as much as before. So here I want to explain what that means.

Mesoscopic systems

A mesoscopic system is one that is larger than a few nanometers, but still small enough that you have to care about quantum effects.

That's not a very precise definition, so I'll try again: Consider a metallic wire. For macroscopic systems (ie the ones that we are used to in day-to-day live) you might know that the electrical resistance of such a wire increases linearly as you increase its length, and decreases linearly if you increase its cross section.

This is very intuitive, because electrical resistance describes how hard it is for an electron to travel through our wire. If the wire is longer, it sees more obstacles, so the resistance is higher. If the wire has a larger cross section, it's easier for the electron to find a way that's not blocked, so the resistance is smaller. That's called Ohm's law.

These relations aren't true anymore for rather small systems. If you have a very thin wire, say 20 nanometers, and increase its diameter by another nanometer, the resistance might not change at all. Then you increase its diameter by another nanometer, the resistance suddenly jumps down by a few percent.

All these systems that are too small for Ohm's law to apply are called mesoscopic. All mesoscopic effects have to be explained with quantum physics, at least at some point.

Electron Spin

Electrons have something called Spin. Everybody knows that it has a charge, and it acts as if it rotated around its own axis very fast. So it looks like a current which runs in a circle, and that creates a small magnetic field.

If you try to measure the magnetic field of one electron, you will only ever get two possible values, which we call spin up and spin down.

Spin Optics

In a semiconductor, one can split up a beam of electrons into two beams of spin-up and spin-down electrons, just like in optics with polarized light. That splitting can be influenced by an external voltage, like a classical transistor.

The topic of my diploma thesis is to figure out how such spin polarized electron beams behave in certain semiconductor systems.

[/misc] Permanent link

comments / trackbacks