tag:blogger.com,1999:blog-54791913050937809812024-03-02T12:30:48.108-08:00Lex SpoonI'm Lex Spoon, a software engineer working in Atlanta, Georgia. I'm a coauthor of <a href="http://www.artima.com/shop/programming_in_scala">Programming in Scala</a>.Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.comBlogger210125tag:blogger.com,1999:blog-5479191305093780981.post-64976806093633960742024-03-02T12:23:00.000-08:002024-03-02T12:29:58.231-08:00The case for IPv4.1In 2003, Dan Bernstein wrote about <a href="https://cr.yp.to/djbdns/ipv6mess.html">the IPv6 mess</a>,
focusing primarily on the lack of viable migration plan.
Avery Pennarun <a href="https://apenwarr.ca/log/20110328">wrote again about the problem</a> in 2011,
also with a focus on the lack of viable migration plan.
Now it's been another decade, for a total of around 30 years on this project.
<p>With the benefit of 20/20 hindsight, we now know that the IPv6 plan was never a
very good one. If IPv6 had been accurately estimated to take 30+ years of migration,
it would not have achieved the buy-in that it did, and most likely the designers would
have kept going. The individuals involved are all top of class from what I can tell,
but sometimes good people produce bad results, especially in a group context.
I'm reminded, in all of this, about the Ada programming language, with the unusual way it
was designed. I'm even more reminded of a certain saying about committees:
"none of us is dumb as all of us".
<p>Big migrations do not always go this way. The World Wide Web only needed nine years, if we count
from <a href="https://www.w3.org/History/1989/proposal.html">Tim Berners-Lee's memo</a>
to <a href="https://en.wikipedia.org/wiki/History_of_Google">the launch of the Google search engine</a>.
Under the right conditions, the whole Internet can upgrade in less than ten years.
<p>The difference in these projects is easy to understand. Avery describes it as follows:
<blockquote>
In short, any IPv6 transition plan involves everyone having an IPv4 address, right up until everyone has an IPv6 address,
at which point we can start dropping IPv4, which means IPv6 will start being useful. This is a classic chicken-and-egg problem,
and it's unsolvable by brute force; it needs some kind of non-obvious insight. djb apparently hadn't seen any such insight by 2002,
and I haven't seen much new since then.
</blockquote>
<p>Here in 2024, it's not really too late to start on a backup plan. It's anyone's
guess whether a more modest IPv4.1 proposal would finish faster than IPv6, but
we can all be sure of one thing. Our chances are better if we have two dogs
in the race.
<h1>How to do it</h1>
<p>Here's the general idea of IPv4.1, just to show what I mean.
It's been posted about before, so consider this to be my explanation
rather than anything fundamentally new.
<p>The core change for IPv4.1 is that the IP packet header needs room for a larger address. The packet format can be <a href="https://datatracker.ietf.org/doc/html/rfc1710#section-4.1">the one from the SIPP proposal</a>, but with version "7" instead of "6", since 6 is now taken. When an address is written down, for example in a router's configuration UI, these 8-byte addresses can be written down the same way we are used to, just with 8 digits instead of 4. Existing addresses can be zero-padded on the left, e.g. 0.0.0.0.127.0.0.1.
<p>Whenever two machines talk to each other, they use the existing IPv4 packet format if they
both have 4-byte addresses. If either of them has an 8-byte address, then they switch to the IPv4.1
format. In this case, though, the packets still mean the same thing they always did!
Part of the key to achieving a large migration is not to saddle it with additional ones.
<p>8-byte addresses should be sufficient for this problem. The reason that IPv6 addresses
use 16 bytes is that IPv6 plans for the machines on people's private internal networks to
avoid reusing an address from anyone else's private network. This property
will not be pursued for IPv4.1, and it would not do anything useful it were, because
per the previous paragraph, all the packets in IPv4.1 are supposed to mean the same thing
they did in IPv4.
<h1>Rollout</h1>
<p>During the initial rollout of IPv4.1, many machines will not yet be able to talk to each other
with the extended addresses. This will be much like the case with IPv6 right now.
<p>From there, it will be a race.
It's speculation which race will go faster, but IPv4.1 has some big advantages.
It doesn't take any configuration file changes, and it doesn't require designing IP blocks.
It definitely doesn't require thinking about link-local addresses, updates to how DHCP works,
or any other fundamental aspects of networking.
All that it takes for IPv4.1 to win is for software providers to add it into their latest version,
and for the major nodes on the Internet to upgrade their software some time over the next 5-10 years.
Major Internet nodes usually have to upgrade their software once in a while, already,
due to security and compliance requirements.
<p>We don't have to know for sure that IPv4.1 will win the race
before we can make a decision about trying. We can simply observe
that the problem is important,
and we can observe that our chances are better with two ways to win.Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-68991632666216379222023-09-01T09:50:00.001-07:002023-09-01T09:50:10.417-07:00Task-based build systemsToday I ran into a great book chapter by Erik Kuefler and Lisa Carey about <a href="https://abseil.io/resources/swe-book/html/ch18.html">Google's build system philosophy</a>. It's all very good, but one part that stood out to me is the distinction between a <em>task-based build system</em> and an <em>artifact-based build system</em>. Ant is a task-based build system and lets you define a task in terms of other tasks. Google's build system, partially open sourced as the "Bazel" project, is based on artifacts.
An important part of an artifact-based system, they explain, is that it honors <a href="https://blog.lexspoon.org/2012/12/recursive-maven-considered-harmful.html">build cache semantics</a>. They write specifically the following:
<blockquote>
For a remote caching system to work, the build system must guarantee that builds are completely reproducible. That is, for any build target, it must be possible to determine the set of inputs to that target such that the same set of inputs will produce exactly the same output on any machine. This is the only way to ensure that the results of downloading an artifact are the same as the results of building it oneself. Fortunately, Bazel provides this guarantee and so supports remote caching. Note that this requires that each artifact in the cache be keyed on both its target and a hash of its inputs—that way, different engineers could make different modifications to the same target at the same time, and the remote cache would store all of the resulting artifacts and serve them appropriately without conflict.
</blockquote>
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-41465939844540897922023-06-04T11:42:00.004-07:002023-06-04T12:06:31.144-07:00Why the M4 macro syntax goes so wrongI've been hacking in M4 recently, and I ran across <a href="https://mbreen.com/m4.htm">a great article about M4</a> by Michael Breen. I highly recommend it for anyone either using M4 (e.g. for Autotools or for Bison), or considering M4 for a new project (probably a bad idea). This observation caught my eye:
<blockquote>
While m4's textual rescanning approach is conceptually elegant, it can be confusing in practice and demands careful attention to layers of nested quotes.
</blockquote>
Christman Brown writes something similar, in <a href="https://chrisman.github.io/9.html">his review of M4</a>:
<blockquote>
It is difficult to debug. I quickly found my default behavior when encountering something unexpected is to just increase the escape quotes around something. That often fixed it.
</blockquote>
<p>What's going on, here? A macro language is supposed to let you write normal text and then sprinkle some macro expansions here and there. It's supposed to save you from the tedium of dealing with a full-fledged general purpose programming language. Also, sometimes this strategy works out. With the C preprocessor, you write ordinary C code most of the time, and it works totally fine to occassionally call a macro. Why does this approach work in C but not in M4?</p>
<p>I think Michael Breen is onto something. Macros are a form of subroutine, and with a well designed syntax for subroutine call, you want it to be straightforward and thoughtless to invoke a subroutine and pass it some arguments. You want the arguments to feel just like text in any other part of your system. Think how it feels to write HTML and to put a div around part of your page. You don't have to do any special encoding to the stuff you put inside the div; you can, any time you want, take any chunk of your code with balanced tags and put that inside another pair of tags. With M4, the basic facility of a subroutine call, which you use all over the place, is somehow tricky to use.</p>
<p>M4 is not wrong to have a quoting mechanism, but where it goes wrong is to require quoting on the majority of subroutine calls. Here's what that looks like in practice. M4 uses function-call syntax to invoke macros, so a call looks like <code>foo(a, b, c)</code>. That's a great syntax to try, because function calls are a ubiquitious syntax that users will recognize, but it has a problem for a text macro language in that the comma is already a common character to use in the a, b, and c arguments. Having observed that, M4 could and should have moved away from the function call notation and looked for something else. Instead, the designers stuck with the function calls and augmented it with an additional kind of syntax, quotation. In practice, you usually quote all of your arguments when you pass them to an M4 macro, like this: <code>foo([a], [b], [c])</code>. Only usually, however, and you have to think about it every time. The quotation and the macro call are two different kinds of syntax, and the user has to control them individually.</p>
<p>The reason it works out better for C's preprocessor is that the C language already has a way to quote any commas that occur in the arguments, and the preprocessor understands those quoting mechanisms. For example, with <code>sqrt(foo(x, y))</code>, the preprocessor understands that the comma inside the <code>(x, y)</code> part should not count as separating the parameters to <code>sqrt</code>. Programmers already write those parentheses without thinking about it, because the function-call notation for <code>foo(x, y)</code> already requires parentheses. Unfortunately, C does not do the right thing for an example like <code>swap(a[i,j], a[i,j+1])</code>, because it does not treat square brackets the way it treats parenthesis. It could and it should, however. None of this maps over to M4 very well, because usually the arguments to an M4 macro are not code, and so the author isn't going to naturally escape any commas that occur. The function-call syntax just doesn't work well for the situation M4 is intended to be used in.</p>
<h2>Fixing the local problem</h2>
<p>If we wanted to write a next-generation M4, here are some observations to start from:</p>
<ul>
<li>It is better if the quoting syntax is built into the subroutine call syntax. That way, users don't have to independently reason about both calls and quotes, and instead can just think about call-and-quote as a single thing that they do.</li>
<li>Look to markup languages for inspiration, for example XML or Latex. The general feel we are going for is that you mostly write text, and then occasionally you sprinkle in some macro calls. That's a markup language!</li>
</ul>
<p>Based on these, a good thing to try for an M4-like system would be to use XML tags for macro invocation. XML is a culmination of a line of tag-based markup languages starting with SGML and HTML, and it is generally state of the art for that kind of language. Among other advantages, XML is a small, minimal language that you can learn quickly, and it has explicit syntax for self-closing tags rather than some tags being self-closing and others not, in a context-dependent way depending on the schema that is currently in effect for a given file.</p>
<p>Latex's macro syntax is also very interesting, and it has a big advantage in usually saying each tag name just once (<code>\section{foo}</code>) rather than twice (<code><section>foo</section></code>). However, my experience with Latex is that I am in constant doubt how much lookahead text a macro invocation will really look at; the braces-based syntax is just a convention, and you never know for sure which macros really look at those conventions or not. That said, the general syntax looks like a promising idea to me if it were locked down a little more rather than being based on the Tex macro language. A similar approach was used in <a href="http://www.bitsavers.org/pdf/cmu/scribe/Scribe_Introductory_Users_Manual_Jul78.pdf">Scribe</a>, a markup language designed by Brian Reid in the 70s.</p>
<h2>What to use for now</h2>
<p>As things stand, I don't think M4 really has a sweet spot. Old projects that want to have an ongoing roadmap should probably move away from M4. New projects should never use it to begin with. What are the options right now, without having to build a new textual macro language?</p>
<p>It's not a bad option to use a general-purpose language like Python or Java. If you follow the links from <a href="https://github.com/CDSoft/pp">the PP generic preprocessor</a> that is used in Pandoc, they tell you they are replacing their templating by more and more usage of Lua, a general purpose language. When you use a general-purpose language to generate text, you can use the normal library routines your language already supports, plus a mature library of lists, maps, structs, and iteration routines on top of them. An example of this direction is <a href="https://www.jooq.org/">the Jooq library</a> for generating SQL code.</p>
<p>Another strong approach is to use XML, possibly augmented by XSLT. An example would be <a href="https://codeql.github.com/docs/writing-codeql-queries/query-help-files/">the query help format of the GitHub Code Scanner</a>, a format that I designed many years ago at a startup called Semmle. We had an existing syntax based on HTML with regex-based rewrites applied to the HTML file, and we had a problem that people were typo-ing the syntax without realizing it, resulting in help files that were sometimes unreadable and were often missing a large portion of the text. I explored a few options for getting us a more rigorous format with tool support for common errors, and I landed on XML, which I feel like worked out pretty well. In addition to the format itself being nice to work with, we got to tap into the existing XML ecosystem, for example to use Eclipse's excellent XML editor.</p>
<p>I briefly explored JSON as well, which is another minimal syntax that is easy to learn, but I quickly realized why they call XML a markup language. Unlike with JSON, XML lets you mostly write normal text, and then as a secondary thing, add special syntax--hence, "marking up" your text. XML is also a very mature system in general, so for example we could configure Eclipse (which was a viable tool back then!) to auto-complete tags and give you errors within the editor if you used tags that aren't allowed. If I were to rewrite how Bison's skeletons work, I think something based on XML would be tempting to try. Bison already uses this approach <a href="http://git.savannah.gnu.org/cgit/bison.git/tree/data/xslt">for its debug output</a>. I'm not sure, though; XSLT looks pretty voluminous in practice.</p>
<p>Some of the best options are embedded in an existing general-purpose language. <a href="https://react.dev/learn/writing-markup-with-jsx">JSX</a> is embedded in JavaScript, and <a href="https://www.ccs.neu.edu/home/shivers/papers/scheme02/article/scribe.html">Scribe</a> is embedded in Scheme. I'm not sure how practical these are if you aren't already working in those environments, but if you are, look for one that works with your current source language.</p>
<h2>The larger lesson</h2>
<p>An effective tool has to be evaluated in the context it will be used in. Both C and M4 use function-call notation for macro invocation, but in C it works well, while with M4 it becomes a nightmare. Achieving an effective tool, therefore, requires design thinking. You need to learn about the situation you are targeting, you need to adjust your solution based on that, and above all you need to be ready to critique your solutions and iterate on improvements. The critique can take many forms, and a really important one is to watch how your users are doing and to really reflect on why it's happening and how it could be different.</p>
<p>Surprises can happen anywhere, and you'll support your users more if you can act on those surprises and try something different.</p>
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-49383549445223694142023-05-09T14:27:00.013-07:002023-05-15T10:26:52.832-07:00A Sampling TrickHere's a sampling trick that comes up all the time in industry,
but that took me a while to work out. The problem goes like this:
<blockquote>
Suppose a server is receiving a sequence of events, for any kind of event you might be
interested in, and you want to save off a sample of the events for later analysis.
How do you select which events to sample?
</blockquote>
<p>Let me show some solutions that I see people try, and then describe
a way that seems to beat all those out.
<h1>Solutions that don't work well</h1>
<p><strong>Choose a fixed percentage.</strong>
One approach is to choose a percentage, say 5%, and to log
that percentage of incoming events. For each event that arrives, you check a
random number generator against your percentage, and independently decide for
each event whether to log a sample or not.
This approach is pretty good except that it's often hard to
choose a percentage ahead of time. If you get thousands of events per second, then 5%
is going to be way too many. If you get one event per minute, then 5% is going to be way
too few. In many cases, you are going to use the same sampling code to run against multiple
streams of events, for example to sample access to multiple databases, or to sample requests
to multiple RPCs on the same server.
As another example, you may run the code in both staging and production environments,
and you'll need a different percentage for each environment.
<p><strong>Look at historical data rates.</strong>
Another way is to use a percentage, but to dynamically adapt the percentage
based on historical traffic rates.
This way works in theory, but in my experience, systems like this tend to be unstable.
For example, the system will dial in a low percentage based on overnight traffic, or because
the system paused due to a garbage collection or a spate of disk I/O on the underlying machine.
Then a burst of activity will happen, and the system will use its old percentage for a while, thus
sampling way too much data.
Then the percentage updates, but the burst of activity disappears.
Now the server is sampling at a very low percentage and throwing everything away, because
the burst of traffic disappeared.
In general, I can't say that this method can't work, but in a world where software
always has bugs and surprises, it's much better to use algorithms that are more clearly
stable and predictable.
<p><strong>Use time-based sampling.</strong>
Instead of sampling a percentage of traffic, the system can select one event for each
window of time, for example one event every 10 seconds.
This method adapts to different rates of incoming events, including
bursty traffic where the rate isn't steady.
Now there's a new problem, though: the events are not randomly selected any more!
This method will work fine if the server is receiving a homogenous series of events
that are all spaced out the same from each other, but that may or may not be the case.
A lot of times, there is some external process, perhaps generated by an end user,
where one external event will create a clump of events that arrive at your server at
almost the same time as each other.
If you select one event every 10 seconds, you will tend to sample events that are
at the beginning of the clumps, thus making some of the events more likely to
be sampled than others.
<h1>A dumb way that works</h1>
<p>Here's an approach you could follow that will reliably work, but at the expense of
a lot of memory usage.
What you could do is collect all of the events for a 10-second window
into an in-memory buffer.
Each time a new event arrives, you don't think twice, you simply
add it to the buffer to be looked at later on.
Once the ten-second window expires, you look through
the buffer and randomly select one of the events in the buffer.
Then you reset the buffer and start again for the next 10-second window.
<p>This way meets all of the constraints, but now it uses a lot of memory.
It's close, though, except for the memory problem.
Is there a way to do this, but with a smaller buffer?
It would need to be some kind of algorithm that discards events as it goes,
rather than waiting to the end of the 10-second window to do a whole lot
of discarding all at once.
<h1>The sampling trick</h1>
<p>Here's a way that works. Keep the following two pieces of information in memory.
<ul>
<li>The number of events that have arrived, in the current 10-second window.
<li>One single event from the window. This is the event that will be sampled
for sure if no other events arrive.
</ul>
<p>Given this setup, here's what you do when a new event arrives.
<ol>
<li>Increase the count of events by one.
<li>Select a random number from 1 to the new, updated count of events.
<li>If the random number is 1, then replace the event in the buffer
by the new event; the new event is the one you need, and you know
for sure you don't need the old event.
<li>If the random number is anything other than 1, then discard the new event;
you're done with that event for good and aren't going to use it.
</ol>
<p>When the 10-second window elapses, emit your current sample and then
reset the buffer to its original state: 0 events that have arrived, and
null as the currently stored event.
<h1>An example walk-through</h1>
<p>To see why this works, let's start with a scenario that exactly three events arrive
during the window. Here's what happens.
<p><strong>Event 1 arrives.</strong>
When the first event arrives, increase the count to 1.
Roll a number from 1 to 1, which will always result in 1, so
store the first event into the buffer.
Here's what is in the buffer now:
<ul>
<li>Event 1, for sure.
</ul>
<p><strong>Event 2 arrives.</strong>
When the second event arrives, increase the count to 2.
Roll a number from 1 to 2. This roll has a probability of 1/2
of being a 1, in which case event 2 will replace the first one.
At this point, there is a 1/2 probability that event 2 took over the buffer,
and a 1/2 probability that event 1 is still in there.
<ul>
<li>Event 1, at probability 1/2.
<li>Event 2, at probability 1/2.
</ul>
<p><strong>Event 3 arrives.</strong>
Increase the count to 3, and roll a number from 1 to 3.
This roll has a probability of 1/3 of being a 1 and causing
event 3 to take over the buffer.
At probability 2/3, some other number is rolled, and the
existing event is brought forward.
Therefore, the probability for event 1
to still be in the buffer is the 1/2 chance that the
event already had, times the 2/3 chance that it survives this
latest roll of the die.
For event 2, the same reasoning applies as for event 1.
<ul>
<li>Event 1, at probability 1/2 * 2/3 = 1/3.
<li>Event 2, at probability 1/2 * 2/3 = 1/3.
<li>Event 3, at probability 1/3.
</ul>
<h1>Proof of the general case</h1>
<b>Property.</b> When using The Sampling Trick described above,
after the arrival of N events,
each event has 1/N probability of being the one in the sample buffer.
<p><b>The world's smallest proof by induction.</b>
In the base case, suppose N=1. Then there is just one event,
and it is always selected.
The one event has probability 1/1, which is what we wanted
to prove.
<p>For the inductive case, assume the property is correct for N-1, and
prove it for N.
<p>For the Nth event, it was the last event that arrived, and the rules
of the algorithm are that it will replace all prior events at probability 1/N.
So the theorem is true for the Nth event.
<p>For any other event, we know by the inductive assumption that it has a 1/(N-1)
chance of being in the buffer when the Nth event arrives.
The Nth event will replace the buffered event at probability 1/N, so the probability of
the existing event being left alone is (N-1)/N.
Therefore, the probability that the prior event is both selected from the first N-1 rounds, and also
stays in the buffer after the Nth round, is 1/(N-1) x (N-1)/N, which is 1/N.
<p>Q.E.D.
<h1>How do you use this in practice?</h1>
<p>Sampling is helpful for always-on computer systems that can never have any
down time. In a case like that, you need to not only make the server internally
robust for every kind of input it might receive, but also be externally compatible
with all the other software that's around it in the current production ecosystem.
<p>There are a lot of tricks and know-how for how to use the sample data once you have
it, but the first step is to capture the data at all.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-69664631515504286122016-05-04T11:25:00.000-07:002016-05-05T04:27:51.827-07:00Supporting C in a build systemIn a previous post, I showed that the <a href="http://blog.lexspoon.org/2016/05/standard-build-rules-for-c-are.html">standard build rules for C code are unreliable</a>. Let me describe two ways to do better.
<p>In the interest of brevity, I will describe the build rules using a toy build engine called Blcache (short for "build cache"). I initially tried to write this post using standard tools like Make, Ninja, Bazel, and Nix, but they all have one limitation or another that distracts from the main point of this post.
<p><strong>Example problem</strong>
<p>Here is an example problem this post will work against. There is a <code>test.c</code> file, and it includes two separate header files:
<pre>
<b>// File test.c</b>
#include <stdio.h>
#include "syslimits.h"
#include "messages.h"
int main() {
printf("%s: %d\n", MSG_LIMIT_THREADS, LIMIT_THREADS);
}
<b>// File localhdrs/messages.h</b>
#define MSG_LIMIT_THREADS "Limit on threads:"
<b>// File hdrs/syslimits.h</b>
#define LIMIT_THREADS 10
</pre>
<p>After compiling this code, a third header file is added as follows:
<pre>
<b>// File localhdrs/syslimits.h</b>
#define LIMIT_THREADS 500
</pre>
<p>The challenge is for the addition of this header file to trigger a rebuild, while still making the build as incremental as possible.
<p><strong>Method 1: Compile entire components</strong>
<p>The simplest way to get correct C compiles is to compile entire components at a time, rather than to set up build rules to compile individual C files. <a href="http://blog.lexspoon.org/2011/09/teach-your-build-tool-jars-not-classes.html">I've posted before on this strategy for Java</a>, and it applies equally well to C.
<p>This approach seems to be unusual, but my current feel is that it should work well in practice. It seems to me that when you are actively working on a given component, you should almost always use an IDE or other specialized tool for compiling that given component. The build system should therefore not concern itself with fine-grained incremental rebuilds of individual C files. Rebuilds of whole components--executables, libraries, and shared libraries--should be plenty, and even when using a system like <a href="https://gyp.gsrc.io/">Gyp</a>, there are advantages to having the low-level build graph be simple enough to read through and debug by hand.
<p>Using such an approach, you would set up a single build rule that goes all the way from C and H files to the output. Here it is in JSON syntax, using Blcache:
<pre>
{
"environ": [
"PATH"
],
"rules": [
{
"commands": [
"gcc -Ilocalhdrs -Ihdrs -o test test.c"
],
"inputs": [
"test.c",
"localhdrs",
"hdrs"
],
"name": "c/test",
"outputs": [
"test"
]
}
]
}
</pre>
<p>The "environ" part of this build file declares which environment variables are passed through to the underlying commands. In this case, only PATH is passed through.
<p>There is just one build rule, and it's named <code>c/test</code> in this example. The inputs include the one C file (<code>test.c</code>), as well as two entire directories of header files (<code>localhdrs</code> and <code>hdrs</code>). The build command for this rule is very simple: it invokes <code>gcc</code> with all of the supplied input files, and has it build the final executable directly.
<p>With the build rules set up like this, any change to any of the declared inputs will cause a rebuild to happen. For example, here is what happens in an initial build of the tool:
<pre>
$ blcache c/test
Started building c/test.
Output from building c/test:
gcc -Ilocalhdrs -Ihdrs -o test test.c
$ ./target/c/test
Limit on threads: 10
</pre>
<p>After adding <code>syslimits.h</code> to the <code>localhdrs</code> directory, the entire component gets rebuilt, because the <code>localhdrs</code> input is considered to have changed:
<pre>
$ blcache c/test
Started building c/test.
Output from building c/test:
gcc -Ilocalhdrs -Ihdrs -o test test.c
$ ./target/c/test
Limit on threads: 500
</pre>
<p>As a weakness of this approach, though, any change to any C file or any header file will trigger a rebuild of the entire component.
<p><strong>Method 2: Preprocess as a separate build step</strong>
<p>Reasonable people disagree about how fine-grained of build rules to use for C, so let me describe the fine-grained version as well. This version can rebuild more incrementally in certain scenarios, but that benefit comes at the expense of a substantially more complicated build graph. Then again, most developers will never look at the build graph directly, so there is some argument for increasing the complexity here to improve overall productivity.</p>
<p>The key idea with the finer-grained dependencies is to include a separate build step for preprocessing. Here's a build file to show how it can be done:
<pre>
{
"environ": [
"PATH"
],
"rules": [
{
"commands": [
"gcc -c -o test.o target/c/preproc/test.i"
],
"inputs": [
"c/preproc/test:test.i"
],
"name": "c/object/test",
"outputs": [
"test.o"
]
},
{
"commands": [
"gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c"
],
"inputs": [
"test.c",
"localhdrs",
"hdrs"
],
"name": "c/preproc/test",
"outputs": [
"test.i"
]
},
{
"commands": [
"gcc -o test target/c/object/test.o"
],
"inputs": [
"c/object/test:test.o"
],
"name": "c/test",
"outputs": [
"test"
]
}
]
}
</pre>
<p>This file has three rules in it that chain together to produce the final output file. When you build the test executable for the first time, all three rules will be executed:
<pre>
$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c
Started building c/object/test.
Output from building c/object/test:
gcc -c -o test.o target/c/preproc/test.i
Started building c/test.
Output from building c/test:
gcc -o test target/c/object/test.o
$ ./target/c/test
Limit on threads: 10
</pre>
<p>First, the <code>test.c</code> file is preprocessed, yielding <code>test.i</code>. Second, the <code>test.i</code> file is compiled to <code>test.o</code>. Finally, <code>test.o</code> is linked into the final <code>test</code> executable.
<p>Adding the new <code>syslimits.h</code> file behaves as expected, causing the full chain of recompiles.
<pre>
$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c
Started building c/object/test.
Output from building c/object/test:
gcc -c -o test.o target/c/preproc/test.i
Started building c/test.
Output from building c/test:
gcc -o test target/c/object/test.o
$ target/c/test
Limit on threads: 500
</pre>
<p>Modifying an irrelevant header file, on the other hand, only causes the precompilation step to run. Since the precompilation yields the same result as before, rebuilding stops at that point.
<pre>
$ touch localhdrs/irrelevant.h
$ blcache c/test
Started building c/preproc/test.
Output from building c/preproc/test:
gcc -Ilocalhdrs -Ihdrs -E -o test.i test.c
Using cached results for c/object/test.
Using cached results for c/test.
</pre>
<p>It's not shown in this example, but since each C file is compiled individually, a change to a C file will only trigger a rebuild of that one file. Thus, the technique here is fine-grained in two different ways. First, changes to one C file only trigger a recompile of that one file. Second, changes to the H files only trigger preprocessing of all C files, and then only compilation of those C files that turn out to be affected by the H files that were changed.
<p>By the way, there's a trick here that generalizes to a variety of cached computations. If you want to add a cache for a complicated operation like a C compile, then don't try to have the operation itself be directly incremental. It's too error prone. Instead, add a fast pre-processing step that accumulates all of the relevant inputs, and introduce the caching after that pre-processing step. In the case of this example, the fast pre-processing step is, well, the actual C preprocessor.
<p><strong>Coda</strong>
<p>Before realizing the problem with C compilation, I used C as an example of why you might want to break the rules a little bit about the most strict and simplistic version of a build cache. However, now it seems to me that you find the best set of build rules if you strictly adhere to a build-cache discipline. I'm sorry, build cache. I should never have doubted you.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-25430509092408430372016-05-04T11:04:00.000-07:002016-05-05T04:28:05.151-07:00Standard build rules for C are unreliableThe standard way of integrating C into a build system is to use <a href="http://make.mad-scientist.net/papers/advanced-auto-dependency-generation/">automatic dependencies generated from the compiler</a>. Gcc and Clang can emit a list of the header files they read if you run them with the <code>-M</code> option. Visual Studio can do it as well, using the <a href="https://msdn.microsoft.com/en-us/library/hdkef6tk.aspx">/showIncludes option</a>. What I will call the "standard approach" in this post is to use the dependencies the user explicitly declared, and then to augment them with automatic dependencies generated by options like <code>-M</code> or <code>/showIncludes</code>.
<p>Until a few years ago, I just took this approach as received wisdom and didn't think further about it. It's a neat trick, and it works correctly in the most obvious scenarios. Unfortunately, I have learned that the technique is not completely reliable. Let me share the problem, because I figure that other people will be interested as well, especially anyone else who ends up responsible for setting up a build system.
<p>The root problem with the standard approach is that sometimes a C compile depends on the <em>absence</em> of a file. Such a dependency cannot be represented and indeed goes unnoticed in the standard approach to automatic dependencies. The standard approach involves an "automatic dependency list", which is a file listing out the automatically determined dependencies for a given C file. By its nature, a list of files only includes files that exist. If you change the status of a given file from not existing, to existing, then the standard approach will overlook the change and skip a rebuild that depends on it.
<p>To look at it another way, the job of a incremental build system is to skip a compile if running it again would produce the same results. Take a moment to consider what a compiler does as it runs. It does a number of in-memory operations such as AST walks, and it does a number of IO operations including reading files into memory. Among those IO operations are things like "list a directory" and "check if a file exists". If you want to prove that a compiler is going to do the same thing on a second run as it did on the first, then you want to prove that those IO operations are going to do the same thing on a second run. That means all of the IO operations, though, not just the ones that read a file into memory.
<p>Such a situation may seem exotic. <a href="http://www.evanjones.ca/makefile-dependencies.html">At least one prominent source</a> has declared that the standard approach is "correct" up to changes in the build command, which suggests to me that the author did not consider this scenario at all. It's not just a theoretical problem, though. Let me show a concrete example of how it can arise in practice.
<p>Suppose you are compiling the following collection of files, including a single C file and two H files:
<pre>
<b>// File test.c</b>
#include <stdio.h>
#include "syslimits.h"
#include "messages.h"
int main() {
printf("%s: %d\n", MSG_LIMIT_THREADS, LIMIT_THREADS);
}
<b>// File localhdrs/messages.h</b>
#define MSG_LIMIT_THREADS "Limit on threads:"
<b>// File hdrs/syslimits.h</b>
#define LIMIT_THREADS 10
</pre>
Using automatic dependencies, you set up a Makefile that looks like this:
<pre>
CFLAGS=-Ilocalhdrs -Ihdrs
test.o test.d : test.c
gcc $(CFLAGS) -M test.c > test.d
gcc $(CFLAGS) -c test.c
test: test.o
gcc -o test test.o
-include test.d
</pre>
<p>You compile it and everything looks good:
<pre>
$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o
$ ./test
Limit on threads: 10
</pre>
Moreover, if you change any of the input files, including either of the H files, then invoking <code>make test</code> will trigger a rebuild as desired.
<pre>
$ touch localhdrs/messages.h
$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o
</pre>
<p>What doesn't work so well is if you create a new version of syslimits.h that shadows the existing one. Suppose you next create a new syslimits.h file that shadows the default one:
<pre>
<b>// File localhdrs/syslimits.h</b>
#define LIMIT_THREADS 500
</pre>
<p>Make should now recompile the executable, but it doesn't:
<pre>
$ make test
make: 'test' is up to date.
$ ./test
Limit on threads: 10
</pre>
<p>If you force a recompile, you can see that the behavior changed, so Make really should have recompiled it:
<pre>
$ rm test.o
$ make test
gcc -Ilocalhdrs -Ihdrs -M test.c > test.d
gcc -Ilocalhdrs -Ihdrs -c test.c
gcc -o test test.o
$ ./test
Limit on threads: 500
</pre>
<p>It may seem picky to discuss such a tricky scenario as this one, with header files shadowing other header files. Imagine a developer in the above scenario, though. They are doing something tricky, yes, but it's a tricky thing that is fully supported by the C language. If this test executable is part of a larger build, the developer can be in for a really difficult debugging exercise to try and understand why their built executable is not behaving the way that's consistent with the source code. I dare say, it is precisely such tricky situations where people rely the most on their tools behaving in an intuitive way.
<p>I will describe how to set up better build rules for this scenario in a followup post.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-36343741786510760412016-04-12T06:58:00.000-07:002016-04-12T06:58:08.266-07:00Two little things I wish Java would addWhen geeking out about language design, it's tempting to focus on the things that require learning something new to even understand how it works. SAM types require understanding target typing, and type members require understanding path-dependent types. Fun stuff.
<p>
Aside from these things that are fun to talk about over beers, I really wish Java would pick up a few things from Scala that are just plain more convenient.
<h2>
Multi-line string literals</h2>
A great way to structure a unit test is to feed in a chunk of text, run some processing that you want to verify, convert the actual output to text, and then compare it against another chunk of text that's included in the test case. Compared to a dense string of <code>assertEquals</code> calls, this testing pattern tends to be much easier to read and understand at a glance. When such a test fails, you can read a text diff at a glance and possibly see multiple different kinds of failure that happened with the test, rather than stare into the thicket of <code>assertEquals</code> calls and try to deduce what is being tested by the particular one that failed.
<p>
The biggest weakness of this style is very mundane: it's hard to encode a multi-line chunk of text in Java. You have to choose between putting the text in an external file, or suffering through strings that have a lot of "\n" escapes in them. Both choices have problems, although the latter option could be mitigated with a little bit of IDE support.
<p>
In Scala, Python, and many other languages, you can write a multi-line string by opening it with triple quotes (""") rather than a single quote mark ("). It's a trivial feature that adds a lot to the day to day convenience of using the language.
<p>
As one trick to be aware of, it's important to help people out with indentation when using triple quotes. In Scala, I lobbied for the <code>stripMargin</code> approach to dealing with indentation, where you put a pipe on each continuation line, and anything up to the pipe is considered leading indentation and removed. In retrospect, I wish I had pushed for that to simply be the default behavior. If you need to insert a literal continuation character, you can always write it twice. Making people write <code>stripMargin</code> on almost every multi-line string is a form of boilerplate.
<h2>
Case classes</h2>
There are philosophers who disagree, but I find them a little too philosophical for my taste. Sometimes you really want to write a class that has no hidden internal state. Sometimes it would be a breach of the API to retain any internal state, or to implement the public API as anything other than plain old final fields. Some motivating examples are: <a href="http://www.markphelps.me/2014/12/09/tiny-types.html">tiny types</a>, data structure nodes such as links in a linked list, and <a href="https://msdn.microsoft.com/en-us/library/ff649585.aspx">data-transfer objects</a>.
<br />
In such a case, it takes a tremendous amount of code in Java to implement all the odds and ends you would really like for such a class. You would really like all of the following, and they are all completely mechanical:
<br />
<ul>
<li>Constructors that copy their parameters to a series of final fields.
</li>
<li>A toString() implementation.
</li>
<li>Comparison operations: equals(), hashCode(), and compareTo(). Ideally also helpers such as isLessThan().
</li>
<li>Copy constructors that make a new version by replacing just one of the fields with a new value.
</li>
</ul>
The equals() method is particularly painful in Java because there is a lot of advice going around about how to write them that is not consistent. I've been drug into multi-day debates on equals() methods where people cite things I published in the past to try and use against me; I'm pretty sure I meant what I said then and mean what I say now. Above all, though, I'd rather just have a reasonable equals() method and not spend time talking about it.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-1912055060903421222015-11-05T07:45:00.000-08:002015-11-05T07:45:17.532-08:00The mystics are coming out<a href="https://github.com/lampepfl/dotty/issues/818">Discussion on the Scala collections revamp</a> is starting to get mystical. It really bugs me: good language design makes a huge difference, but it's hard to see unless you actually deal with thousands or more developers on millions or more lines of code. Casual observers of language design discussions don't see it themselves, so they don't realize what these problems look like to the other people in the discussion. So they think everyone is just goofing around and throwing out random ideas just because they are possible.
<p>I started to follow up on the issue itself, but I fear making the problem worse. So I'll post a couple of rebuttals here. I don't really think the people actually involved in the redesign will be distracted by the mystics, anyway. They are like gold-sellers in MMOs or beggars in San Francisco; unless you are specifically trying to engage with them, you just learn to tune them out. Well, okay, they are like really nerdy gold sellers who like to talk about higher math. Okay I better just drop this attempt at an analogy.
<p>First, Python's indexing operator has a lot of practical problems. Here are a few concrete examples: <a href="https://plus.google.com/+MattMight/posts/BVSmNadKni4">https://plus.google.com/+MattMight/posts/BVSmNadKni4</a> . Thinking backward from those puzzlers, I think the root of the problem is that the [a:b] slicing operator means something different depending on whether a and b are positive, and whether a is smaller or larger than b. This is foundational syntax used everywhere, and if you want code to be readable, people need to know whether it's doing a forward or reverse slice without having to do mental data flow analysis on the parameters. Java avoids this trap, and Scala should, too. The sublist operation should only work on non-negative arguments, and only when the first argument is smaller than the second.
<p>The other thing I will say is that exceptions, null, and -1 are also all fine, when used by a tasteful library designer. We tried at Semmle to get people using more Options, and we found that it harmed our reputation as a quality lint tool. I can only speak publicly about open-source code, but to give an example, Apache Spark has over a thousand occurrences where they use null but, with only local changes, they could use Option instead. It's too many. It means that the basic premise of the programming advice has some sort of problem.
<p>As one stab at it--though it's really a huge topic--you have to think about what you want the callers to do to defend against a missing value. If Scala's basic get/apply methods starts returning an Option, then people will just litter their code with calls to .get, so the net result is that the code is more bloated but otherwise behaves exactly the same. Even in pure math, people will write notation like <code>f'(x)</code> as the derivative of <code>f</code>, but you know, derivative isn't always defined. So should smart mathematicians instead write <code>get(f')(x)</code>? Or <code>(f') match { ... }</code>?
<p>That's my try, but you don't even have to understand this if you are willing to ape mature libraries in areas that they are working okay. It's not a practical problem in Java that the various .get() methods return null or throw exceptions; even if you say it's not perfect, it's certainly not <em>bad</em>. It is, however, a very big problem that Java collections are overly mutable. For example, see this Stack Overflow question: <a href="https://stackoverflow.com/questions/2842169/why-are-public-static-final-array-a-security-hole">https://stackoverflow.com/questions/2842169/why-are-public-static-final-array-a-security-hole</a>. Scala will be more attractive to more developers if it focuses on these widely recognized pain points. Which is why the mystics drive me crazy--if they get their way they will find that their playground is gradually becoming a ghost town, but only after it's too late.Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-81259780437432723942015-10-12T10:41:00.000-07:002015-10-12T10:41:57.678-07:00Initial input on the Scala collections redesignIt looks like Martin Odersky is <a href="https://groups.google.com/forum/#!topic/scala-language/EIaKWWu93OY">considering an overhaul of Scala's collections</a>:
<blockquote>
A redesign of the standard library is on the roadmap for one of the
next Scala versions (could be as early as 2.13). So I think now is the
time to start thinking about what we want to change, in particular in
what concerns collections!
</blockquote>
<p>Some more details are available <a href="https://github.com/lampepfl/dotty/issues/818">on the Dotty issue tracker</a>. <a href="https://newcircle.com/s/post/1568/scala_collections_why_not_paul_phillips_video">Paul Phillips has weighed in on this issue</a> with a very interesting slide deck.
<p>I think it will be a very positive change if done carefully. It's painful to change such a fundamental library, but the current state is a real pain point for practical Scala code. If no other contender can get full support, I would even go so far as to suggest going back to Matthias Zenger's original version and working forward from there more conservatively this time. It was really a gem of useful functionality, in particular the "persistent" collections which used to be contraversial back then. Since that initial version, there have been several waves of changes that added complexity while only supporting minority use cases: lazily evaluated views, concurrent collections, and the CanBuildFrom magic. These are all valuable but should all be done in a separate library that is opted into when you need it.
<p>I cannot put together a complete straw man implementation given my professional duties right now. I can make a few high-level suggestions, though, based on my experience in developer tools at Google and LogicBlox. Mind you, these are just initial reactions. I may well be overlooking important implementation details. Also, I haven't had the pleasure of using Scala for the new "big data" growth area, so I may be overlooking some concerns compared to the kinds of code I am used to.
<p>At a high level, I'd want to see the following in an updated collection library:
<ul>
<li>A focus on just the core collection types: maps, sets, linked lists, and some form of array-like sequences (maybe Array itself). Other collection types do come up in practice, but in a minority of contexts, and for those cases it should be fine to use a concrete collection type that does not necessarily inherit from any standard-library collection traits.
<li>Simple old-fashioned classes and methods, without any implicit parameters or type members. The status quo is bad in numerous ways, including problems with all of the following: IDE auto-completion, single-step debugging, and and compiler error messages.
</ul>
<p>Non-goals include the following:
<ul>
<li>User extension of the library with new collection types. This is a small enough use case that it doesn't merit giving up on any other design goals of the library, e.g. performance.
<li>Supporting the majority of the collection types that are available now. I wrote a 1-2 page description of each of them for the latest <a href="http://www.artima.com/shop/programming_in_scala_2ed">Programming in Scala</a>, and I found it quite the slog. The library has simply grown very large over time. It would be better if the less common collection types were more obviously in a side library. They're all useful sometimes, but the grand total is just overwhelming.
<li>Infinite collection types, in particular the lazy Stream type.
</ul>
<p>Some other things are desirable but might be unreasonable:
<ul>
<li>High performance in common use cases involving a for() loop. This is a complicated problem, but it is a persistent problem area that drives people away from writing idiomatic Scala code. If nothing else, perhaps for() should translate to iterators instead of to higher-order functions. Another approach would be to have the compiler recognize and optimize standard types like List, Set, and Map; this will work better if those types are sealed rather than user extensible.
<li>Rename the three versions of Set and Map to have different names. For example, maybe AbstractSet, HashSet, and Set. There are numerous reasons for this; for example, automatic IDE import doesn't work as well if you have multiple classes with the same name. I put this in the "maybe" list, though, because it would require a lot of code to be rewritten. Even taking care to <a href="http://www.lexspoon.org/blog/interfevol.html">phase the change in over a period of months</a>, it's a lot of code that would need to be turned over.
</ul>
<p>On a minor note, type Seq seems pretty useless. I know that Java has an interface for this, but maybe take a stand on this. For well-written code, it doesn't help anything to know that a type is a Seq and not just an Iterable. So delete the useless intermediate trait.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com1tag:blogger.com,1999:blog-5479191305093780981.post-27385826255143405322015-08-05T12:32:00.000-07:002015-08-05T13:03:23.773-07:00Ken Clark on shame mobs<p>Ken Clark has posted the <a href="http://popehat.com/2015/07/29/top-six-things-i-like-about-internet-shame-mobs">top seven things he likes about shame mobs</a>. Here's a taste:
<blockquote>
5) Internet shame mobs weigh the evidence carefully and deliberately before attacking, so they only happen to people who deserve them.
[...]
3) Internet shame mobs always make sure that the punishment is proportional to the crime.
</blockquote>
<p>There's a larger phenomenon here where problematic information spreads faster than the correction to it. If it spreads fast enough, then it can even pass a tipping point where it becomes very hard to get a hearing to object to any part of it. Everyone has already heard the idea from, well, everyone else, so they quickly dismiss anyone who objects without even really considering it.
<p>The key to stopping such memetic chain reactions is to apply some filtering before propagating information that you read. It's still early days for the Internet, though, and we are all still learning to inoculate ourselves from being the wrong kind carrier.
<p>There is some reason to have hope. Chain emails used to flourish, but are now mostly stamped out. In their heyday, 15-20 years ago, it was fairly common to open your mail program and see numerous messages that said something very exciting, and furthermore that the message should be passed on to everyone you know as fast as possible. Nowadays, the people I interact with just delete any email such emails. If an email explicitly says that it should be forwarded to everyone you know, then it triggers something like an antibody response. Such an email starts looking very bogus, and it gets deleted quickly and possible even flagged for followup by the email provider.
<p>Intriguingly, people would likely not have developed that response had they not gone through the misery of dealing with chain emails earlier on. There are clear parallels to viruses and to building up antibodies!
<p>Shame mobs are a place where it still goes on, though. I'm not entirely sure why it happens. In part, people just want to defend an idea, and they are happy to use real people as an example no matter the consequences. In part, people just enjoy being part of a mob. I hope that shame mobs go the same way as the chain email. We shall see.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-13656362281719472742015-07-29T11:44:00.002-07:002015-07-29T11:45:37.403-07:00Finding and fixing complicated code<p>It's hard to beat lines of code as a complexity metric: the subjective complexity of a piece of code seems to correlate very well with its length. Over the years, many other metrics have been devised and explored, but they tend to be more complicated yet less effective than the simple old lines-of-code.
<p>One exception to the rule is <a href="https://en.wikipedia.org/wiki/Cyclomatic_complexity">McCabe Cyclomatic Complexity</a>, a metric almost exactly as old as I am. McCabe can often find methods that aren't that many lines of long, but that nonetheless a typical maintainer would identify as being complicated and difficult to maintain. Small wonder that <a href="http://www.gnu.org/software/complexity/manual/complexity.html#Introduction">GNU Complexity puts so much emphasis on it</a>.
<p>What McCabe does is count the number of independent paths are needed to cover a method from beginning to end. The way McCabe described it, in his delightfully readable paper on the subject, is that it's the number of test cases you need to get statement coverage for a given method. It sounds like something that would be impractically hard to calculate--you are minimizing over arbitrary test cases!--but it's not complicated at all. He proved that, with a few extra assumptions, all you really have to do is count the number of branching options in a given body of code.
<p>McCabe complexity is very effective in my experience. A method with complexity 10 is usually a dense, gnarly tangle of code, even if it's only 20 lines long. Yet, 20 lines of code is not enough to reliably be complicated just by its number of lines of code. If it's a simple linear sequence of non-branching code, it's usually reasonable to leave it alone. McCabe Complexity is therefore a real coup.
<p>Even better, there's usually a straightforward way to act on code that has a poor McCabe complexity measure. If you see a method with a lot of branches in it, you can usually make the code more maintainable by factoring out some of the clusters of branches to their own smaller methods. The main method gets easier to understand, and if you do it tastefully, the smaller factored-out methods will also be easier to understand.
<p>One place McCabe complexity falls down, though, is large switch statements. If you write network code that receives a message and dispatches it, it's perfectly fine style to write a switch statement for the dispatch with one case for each kind of message. This is especially true if each case simply calls a method that does the real work. Such a switch statement will easily have a dozen cases, one for each message type, but what alternative would be any more maintainable? Factoring out methods for subsets of the cases doesn't seem obviously better, and reducing the number of cases is often impossible.
<p>McCabe is even worse for switch statements that consider pairs of types that are each drawn from some enum. To see some typical examples, look at <a href="https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala">Cast.scala</a> in the Apache Spark source code. All the big switch statements in there are enumerating pairs of types, and they strike me as a very reasonable way to manage a problem that is just fundamentally complicated. This organizational style reminds me of how people organize definitions on paper when they are studying and explaining language semantics. Such people will typically define their relations and functions using a dozen or so cases, each of which has a number of assumptions (x is a class type, y is a class type, and x inherits from y), and a conclusion you can draw if all of the assumptions hold (x is a subtype of y). If this style is reasonable for theory papers, where people are expending enormous time and energy to bring a complicated subject within the reach of the human mind, it seems only reasonable to write such definitions in code when possible.
<p>Overall, McCabe complexity is excellent and has stood the test of time. I wonder, though, if "number of branching statements" might be even more useful for anyone trying to simplify their code. "Number of branching statements" is the same as McCabe for if statements and while loops, but for switch statements, it only adds 1. Such a measure seems like it would better focus on code that is not merely complicated, but also possible to simplify.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-81973629872295856402015-03-18T16:11:00.001-07:002015-03-18T16:11:10.625-07:00Every build is specialSince Semmle's business involves integrating with our customers' builds, I'm very sympathetic to <a href="http://timboudreau.com/blog/maven/read">Tim Boudreau's perspective on simple plain-jane builds that just work</a>:
<blockquote>
But with time and enough consulting customers, you grow a real appreciation for projects you can simply check out from version control and build.
</blockquote>
<p>It's always a relief when someone wants to, say, build a stock iPhone app using xbuild. It's often fine when they want to build a C program using msbuild. Third place is Maven, for a simple Java program; I say third place, because hardly anyone uses it without loading it with a bunch of plugins that fundamentally change how it works. But still, at least the overall project structure is somewhat standardized.</p>
<p>In my modest experience, I only really see the above for relatively small and simple projects. As "Scott" puts it in the first comment at the above link:
<blockquote>
The idea that your build is "not as special as you think" more often false than it is true.
</blockquote>
<p>I just skimmed a code base I have handy, and here are some of the things I see in there that break the mold of any off-the-shelf build tool:
<ul>
<li>It uses parser generators, such as ANTLR.
<li>It's mainly a Java code base, but there is a little bit of code in all of: C, Objective C, JavaScript, Scala, C#, and Python.
<li>It uses a home-grown stack trace obfuscator and de-obfuscator that involves bytecode rewrites on the deployed jars.
<li>It includes its own version of zipmerge, with options not available in the open-source version.
<li>It uses the JIT support in mini-Lua to generate machine code that is then compiled into string literals in a C program.
<li>It includes a proprietary query language and numerous queries in that language. Those queries get compiled during the build, using a query engine that is also compiled during the same build.
</ul>
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-76054486386188637462015-01-14T09:53:00.001-08:002015-01-14T09:53:28.180-08:00Surveillance states are possible<p>While clamping down on private encryption is bad policy, both for the economy and for privacy, I don't think it's technically impossible to implement. Let me draw a couple of comparisons to show why.</p>
<hr>
<p>As background, here is Cory Doctorow explaining, like many other commenters, that <a href="http://boingboing.net/2015/01/13/what-david-cameron-just-propos.html">the Internet is too wild and wooly</a> for major governments to possibly implement widespread surveillance:</p>
<blockquote><p>
For David Cameron's proposal to work, he will need to stop Britons from installing software that comes from software creators who are out of his jurisdiction. The very best in secure communications are already free/open source projects, maintained by thousands of independent programmers around the world. They are widely available, and thanks to things like cryptographic signing, it is possible to download these packages from any server in the world (not just big ones like Github) and verify, with a very high degree of confidence, that the software you've downloaded hasn't been tampered with.</p></blockquote>
<p>With cellular phones, any phone that uses the relevant chunks of bandwidth is legally required to use certain protocols that are registered with the government. This has been bad economically, in that the telephone network has developed much more slowly than the relatively unregulated Internet. However, being bad economically has never exactly stopped rules from being put into place.</p>
<p>Yes, you can rig up a wireless network in your garage that breaks the rules. However, as soon as you try to use it over a wide geographic region, you're going to be relatively easy to catch. You will have to either broadcast a strong signal, or make use of the existing telephone backbone, or both.</p>
<p>To draw another comparison, consider the income tax. Income tax is easy to avoid with small operations, because you can just pay cash under the table. However, larger operations have to file a variety of paperwork, and the interlocking paperwork is what will get you. The more you take part in the above-ground economy, the harder it is to spin a big enough web of lies to get out of your taxes.</p>
<p>To get back to Internet protocols, it will certainly always be possible to break the rules on an isolated darknet you assemble in your garage. However, as soon as you send packets across the Internet backbone, any use of unregistered protocols is going to be very easy to detect.</p>
<p>To rub the point in further, don't forget that the authorities have no requirement to go after everyone who they detect doing something fishy. If they are anything like the American tax service, they'll randomly (or politically....) select people to target, and those people will then be required to undergo an audit at their own expense. If they survive the audit, the tax service just says "I'm sorry" and moves on to the next victim. Because of selective enforcement, law enforcement has no requirement to go after <em>everyone</em> using illegal encryption.</p>
<p>Of course all this is bad for the economy and for humanity's development at large. Don't oppose a cryptography clampdown because it's technically impossible, or you will look <a href="http://blog.lexspoon.org/2012/01/dns-takedowns-alive-and-well.html">just as silly</a> as the people that say DNS takedowns are technically impossible. Rather, oppose a cryptography clampdown because we don't want to live like that. We want to have private communications, and we want to allow innovation on the Internet. It's brand new, and if we clamp down on it, it will ossify in its current state the same way that the telephone network did.</p>
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-20015805246442614092014-12-05T14:18:00.000-08:002014-12-05T14:18:45.202-08:00Goodbye, Spark Plug<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhchaxN90UXs609iK2fT7krlYt4l0SLJ-dRpxg9e7y-iBJVRgkcs1WcH706rpupsxarh2XFHcStp6W7BAzcO0tTG3GI-xO7AuMyj-KgeMy-xmlojDDQuIXn3cJu0fitecUM7GQWc_CRqzV/s1600/happiness-is-a-warm-yard-cropped.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhchaxN90UXs609iK2fT7krlYt4l0SLJ-dRpxg9e7y-iBJVRgkcs1WcH706rpupsxarh2XFHcStp6W7BAzcO0tTG3GI-xO7AuMyj-KgeMy-xmlojDDQuIXn3cJu0fitecUM7GQWc_CRqzV/s320/happiness-is-a-warm-yard-cropped.jpg" /></a></div>
<p>Don't take me wrong. I know a dog is just a dog, and a pet is just a
pet. There are people reading this who have cancer, and there are some
who have outlived their human children. On the scale of life
challenges, I've just had maybe a 3/10.</p>
<p>Still, I would like to write a few words. It's a way to organize my
thoughts, and a way to say goodbye. I promise the next post will be
about programming or law or identity or the web, but that all seems rather dry to me
today.</p>
<p>As all you pet owners know, you get a little Pavlovian jolt each time
you help out your little ward and they reward you for it. For example, when they
greet you at the door and run in circles. Or, when they learn your gait
well enough to jump on each leg in time and then jump to the other one
before you pick that foot up.
When they're cold, and you blanket them up, and they let out
a deep sigh of contentment. When there's a burr in their foot, and
they plaintively hold it out so that someone with thumbs can do
something about it.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjXqTxKe29-_w9myormfzqY-xnBGiq0pGxpiV8e_q4xTSyd1mnMhkc8tFUQdkAkWmX5IdOgoAxQp1_CcfoHdlUjbDph_8NsobTWb2avvkp7aV8J67bpD1CT7J6FzTcf3EJxf_Tt8pT7Oyo/s1600/blankets-cropped.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjXqTxKe29-_w9myormfzqY-xnBGiq0pGxpiV8e_q4xTSyd1mnMhkc8tFUQdkAkWmX5IdOgoAxQp1_CcfoHdlUjbDph_8NsobTWb2avvkp7aV8J67bpD1CT7J6FzTcf3EJxf_Tt8pT7Oyo/s320/blankets-cropped.jpg" /></a></div>
<p>Over time it really adds up. You become attuned to where you expect
them to be when you walk into a room. You gain a second sense about
which ruffles in a couch blanket have a dog under them. You expect if you plink through a
few measures of a waltz, that you'll see two brown eyes peek around the corner
to see what you're doing. After 18 years of that and then nothing, you
are left with a lot of little voids that add up to one great big void.</p>
<p>Some animals go and hide when they become deathly sick, but this one did not.
In his last hours he came to me to fix it. Dog or no, it was
crushing to see such hope and confusion, yet so little ability to do
anything about it.</p>
<p>To anyone out there facing this kind of question, let me tell you that
I feel no unease at all about the decision to eschew blood
samples, fluid IVs, antibiotics, and I didn't even ask what else to try and give him a
little more time. I keep thinking, he was 18, and kidneys don't get
better, and he had multiple other problems, anyway. Indeed, what I
really wish I could go back in time on is delaying euthanasia for
too long. I even had my mind made up, and I went to a 24-hour vet to do it,
but I backed down when confronted with a thorough vet
that wanted to rehash the dog's entire medical
history. I thought I could just take him to our regular vet
the next day, but the sun never rose for him again. Yes, hindsight is 20/20,
but I wish I had insisted.</p>
<p>Goodbye, Spark Plug. I hope we did well for you.</p>
<hr>
<p>P.S. -- Mom, you are very optimistic to think we can get this
plant to bloom every December. We'll give it a try!</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOR-ryzjP4kI6cUhbD4irpGfXs34pzMfqupQmUYHfuG_Y_Ml68Wzxn5tdnipJBDypFEwMOorfMGJ7ZT9nJWNG_BpOI_CJbuKD5e3U_gjPGCZtHg2-M95bMbVKkXJNHM-_D6z7xbdG0dMq2/s1600/goodbye-flowers-scaled.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOR-ryzjP4kI6cUhbD4irpGfXs34pzMfqupQmUYHfuG_Y_Ml68Wzxn5tdnipJBDypFEwMOorfMGJ7ZT9nJWNG_BpOI_CJbuKD5e3U_gjPGCZtHg2-M95bMbVKkXJNHM-_D6z7xbdG0dMq2/s320/goodbye-flowers-scaled.jpg" /></a></div>
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-38347469870170039102014-11-23T07:49:00.001-08:002014-11-23T07:49:54.128-08:00Is this the right server?<p>It's nice to see <a href="https://www.owasp.org/index.php/Certificate_and_Public_Key_Pinning">someone else reach the following conclusion</a>:
<blockquote>
"For those familiar with SSH, you should realize that public key pinning is nearly identical to SSH's StrictHostKeyChecking option. SSH had it right the entire time, and the rest of the world is beginning to realize the virtues of directly identifying a host or service by its public key."
</blockquote>
<p>Verifying a TLS certificate via the CA hierarchy is better than nothing, but it's not really all that reassuring. Approximately, what it tells you is that there is a chain of certification leading back to one or more root authorities, which for some reason we all try not to think about too much are granted the ultimate authority on the legitimacy web sites. I say "approximately", because fancier TLS verifiers can and do <a href="http://www.csoonline.com/article/2452222/data-protection/google-blocks-bogus-digital-certificates-issued-in-india.html">incorporate additional information</a>.
<p>The root authorities are too numerous to really have faith in, and <a href="https://www.eff.org/deeplinks/2011/03/iranian-hackers-obtain-fraudulent-https">they have been compromised in the past</a>. In general, they and their delegates have little incentive to be careful about what they are certifying, because the entities they certify are also their source of income.
<p>You can get better reliability in key verification if you use information that is based on the interactions of the actual participants, rather than on any form of third-party security databases. Let me describe three examples of that.
<hr>
<h4>Pin the key</h4>
<p>For many applications, a remotely installed application needs only communicate with a handful of servers back at a central site you control. In such a case, it works well to <a href="https://www.owasp.org/index.php/Certificate_and_Public_Key_Pinning">pin the public keys</a> of those servers.
<p>The page advocates embedding the public key directly in the application. This is an extremely reliable way of obtaining the correct key. You can embed the key in the app's binary as part of your build system, and then ship the whole bundle over the web, the app store, or however else you are transmitting it to the platform it will run on. Given such a high level of reliability, there is little benefit from pulling in the CA hierarchy.
<p>As linked above, you can implement pinning today. It appears to be tricky manual work, though, rather than something that is built into the framework. As well, you don't get to ignore the CA hierarchy by doing this sort of thing. So long as you use standard SSL libraries, you still have to make sure that your key validates in the standard ways required by SSL.
<hr>
<h4>Associate keys with links</h4>
<p><a href="http://www.waterken.com/dev/YURL/">The Y property</a> deserves wider recognition, given how important hyperlinks are in today's world. Put simply, if someone gives you a hyperlink, and you follow that hyperlink, you want to reliably arrive at the same destination that the sender wanted you to get to. That is not what today's URLs give you.
<p>The key to achieving this property is that whenever you transmit a URL, you also transmit a hash of the expected host key. There are many ways to do this, including the ones described at the above hyperlink (assuming you see the same site I am looking at as I write this!). Just to give a very simple example, it could be as simple as using URLs of the following form:
<pre>
https://hash-ABC123.foo.bar/sub/dir/foo.html
</pre>
<p>This particular example is interesting for being backward compatible with software that doesn't know what the hash means.
<p>I don't fully know why this problem is left languishing. Part of it is probably that people are resting too easy on the bad assumption that the CA hierarchy has us covered. There's a funny mental bias where if we know nothing about a subject, and we see smart people working on it, the more optimistic of us just assume that it works well. Another part of the answer is that the core protocols of the world-wide web are implemented in many disparate code bases; SSH benefits from having an authoritative version of both the client and the server, especially in its early days.
<p>As things stand, you can implement "YURLs" for your own software, but they won't work as desired in standard web browsers. Even with custom software, they will only work among organizations that use the same YURL scheme. This approach looks workable to me, but it requires growing the protocols and adopting them in the major browsers.
<hr>
<h4>Repeat visits</h4>
<p>One last source of useful information is the user's own previous interactions with a given site. Whenever you visit a site, it's worth caching the key for future reference. If you visit the "same" site again but the key has changed, then you should be extremely suspicious. Either the previous site was wrong, or the new one is. You don't know which one is which, but you know something is wrong.
<p>Think how nice it would be if you try to log into your bank account, and the browser said, "This is a site you've never seen before. Proceed?"
<p>You can get that already if you use <a href="http://www.skyhunter.com/marcs/petnames/IntroPetNames.html">pet names, which have been implemented as an experimental browser extension</a>. It would be great if web browsers incorporated functionality like this, for example turning the URL bar and browser frame yellow if they see a a site is a new one based on its certificate. Each browser can add this sort of functionality independently, as an issue of quality of implementation.
<p>In your own software, you can implement key memory using the same techniques as for key pinning, as described above.
<hr>
<h4>Key rotation</h4>
<p>Any real cryptography system needs to deal with key revocation and with upgrading to new keys. I have intentionally left out such discussion to keep the discussion simple, but I do believe these things can be worked into the above systems. It's important to have a way to sign an official certificate upgrade, so that browsers can correlate new certificates with old ones during a graceful phase-in period. It's also important to have some kind of channel for revoking a certificate, in the case that one has been compromised.
<p>For web applications and for mobile phone applications, you can implement key rotation by forcing the application to upgrade itself. Include the new keys in the newly upgraded version.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-87784646714228454672014-11-20T12:08:00.000-08:002014-11-20T12:08:39.413-08:00FCC inches away from neutrality<blockquote>
The FCC’s latest proposal for network neutrality rules creates space for broadband carriers to offer “paid prioritization” services.[11] While the sale of such prioritization has been characterized as a stark and simple sorting into “fast” and “slow” traffic lanes,[12] the offering is somewhat more subtle: a paid prioritization service allows broadband carriers to charge content providers for priority when allocating the network’s shared resources, including the potentially scarce bandwidth over the last-mile connection between the Internet and an individual broadband subscriber. Such allocation has historically been determined by detached—or “neutral”—algorithms. The Commission’s newly proposed rules, however, would allow carriers to subject this allocation to a content provider’s ability and willingness to pay.
</blockquote>
<p>That's from a <a href="http://www.stanfordlawreview.org/online/network-nepotism-and-the-market-for-content-delivery">review on Standard Law Review</a> a few months ago. I think this evolution in the FCC's approach will benefit the public.</p>
<p>It seems important to consider realistic developments of the Internet. Here's a thought experiment I've used for a long time, and that seems to be happening in practice. Try to imagine what goes wrong if a site like YouTube or Netflix pays--with its own money--to install some extra network infrastructure in your neighborhood, but only allows its own packets to go across that infrastructure. Doing so is a flagrant violation of network neutrality, because packets from one site will get to you faster than packets from another site. Yet, I can't see the harm. It seems like a helpful development, and just the sort of thing that might get squashed by an overly idealistic commitment to neutrality.</p>
<p>As a follow-on question, what changes if instead of Netflix building the infrastructure itself, it pays Comcast to do it? It's the same from a consumer's view as before, only now the companies in question are probably saving money. Thus, it's even better for the general public, yet it's an even more flagrant violation of network neutrality. In this scenario, Netflix is straight-up paying for better access.</p>
<p>It seems that the FCC now agrees with that general reasoning. They not only support content delivery networks in general, but now they are going to allow generic ISPs to provide their own prioritized access to sites that pay a higher price for it.</p>
<p>I believe "neutrality" is not the best precise goal to go for. Rather, it's better to think about a more general notion of anti-trust.</p>
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-32108554266590569862014-10-14T07:09:00.000-07:002014-10-14T07:09:49.113-07:00Three tiers of classroomsVia <a href="http://www.arnoldkling.com/blog/undebunkable/">Arnold Kling</a>, I see Jesse Rothstein trying to prove that you can't measure teaching ability, or perhaps even that teaching ability doesn't matter:
<blockquote>
Like all quasi-experiments, this one relies on an assumption that the treatment – here, teacher switching – is as good as random. I find that it is not: Teacher switching is correlated with changes in students’ prior-year scores.
</blockquote>
<p>It's important to figure out which kind of classroom we are talking about. There are at least three tiers of classroom styles. If you measure only in the middle tier, then I can believe that teacher skill would have only a small effect. However, it's really easy to tell the difference between the tiers if you look, especially for the bottom-most tier compared to the other ones.
<p>At the bottom tier, some classes are just zoos. The teacher is ignored, and the students talk to each other. At best, they work on material for another class. Teacher skill doesn't matter within this tier, from at academic perspective; one zoo teaches students just as much as another zoo. I am sad to say that classrooms like this do exist. It's a potential bright note that such teachers are very easy to identify in an objective way; their students have absolutely terrible results on standardized tests such as Advanced Placement (AP). There's no need for sophisticated statistics if all the students are scoring 1-2 out of 5.
<p>At the middle tier, some classes involve the teacher walking the students through standardized textbooks and other material. Basically, the textbooks are software and the teachers are the hardware that runs it. It's not an inspiring kind of classroom, but at least it is inexpensive. Within this tier, I could see teacher skill not mattering much, because the students spend all their time glued to the course materials. However, you'd certainly like to find out who is in this tier versus in the zoo tier.
<p>Worth a brief mention is that there's an upper tier as well. Maybe "style" is a better word in this case. Sometimes the teacher actually understands the course material, and so is able to respond to the questions with anecdotes and exercises that are tailored for that particular student. For this tier, teacher evaluation is especially important. Among other things, some teachers are fooling themselves, and would be better off staying closer to the book.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-41317546778583947252014-06-27T05:11:00.000-07:002014-06-27T10:04:36.865-07:00Edoardo Vacchi on attribute grammarsI previously wrote that <a href="http://blog.lexspoon.org/2011/04/practical-challenges-for-attribute.html">predictable performance is a practical challenge for using attribute grammars on real work</a>. It does little good to quickly write the first version of a compiler pass if you then spend hours debugging oddball performance problems.
<p>Edoardo Vacchi wrote me the following in response. I agree with him: having an explicit evaluation construct, rather than triggering attribute contributions automatically, is likely to make performance more predictable.
UPDATED: edited the first paragraph as suggested by Edoardo.
<blockquote>
Hi,
<p>This is Edoardo Vacchi from Università degli Studi di Milano (Italy). For my PhD thesis I’m working on a language development framework called “Neverlang”[1,2]. Neverlang is an ongoing project of Walter Cazzola's ADAPT Lab; I am involved with its latest incarnation "Neverlang 2".
<p>I stumbled upon an (old) blog post of yours about Attribute Grammars [3] and I would be interested to know if you knew some “authoritative” references that I could cite with respect to the points that you raise, with particular attention to point (3) “unpredictable performances” and, in part, to (2) caching.
<p>The Neverlang model resembles that of simple “compiler-compilers” like Yacc, where attributes behave more like variables than functions; thus they are generally computed only once; in Neverlang attributes can also be re-computed using the `eval` construct, which descends into a child and re-evaluates the corresponding semantic action.
<p>On the one hand, the need for an explicit `eval` make it less “convenient” than regular AG-based frameworks; on the other hand, I believe this gives better predictability, and, although the focus for the framework are not performances, but rather modularity, I think that “predictability” would better motivate the reasons for this choice.
<p>Thanks in advance,
<p>
[1] <a href="http://link.springer.com/chapter/10.1007%2F978-3-642-39614-4_2#page-1">http://link.springer.com/chapter/10.1007%2F978-3-642-39614-4_2#page-1</a><br>
[2] <a href="http://dl.acm.org/citation.cfm?id=2584478">http://dl.acm.org/citation.cfm?id=2584478</a><br>
[3] <a href="http://blog.lexspoon.org/2011/04/practical-challenges-for-attribute.html">http://blog.lexspoon.org/2011/04/practical-challenges-for-attribute.html</a><br>
</blockquote>
Edoardo Vacchi is PhD Student at Walter Cazzola's ADAPT-Lab, a research lab at Università degli Studi di Milano that investigates methods and techniques for programming language development and software adaptation and evolution. Walter Cazzola is associate professor at UniMi and his research is concerned with software and language engineering. More info about Neverlang can be found at the website <a href="http://neverlang.di.unimi.it">http://neverlang.di.unimi.it</a>.Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-19732007765666250692014-06-03T07:17:00.000-07:002014-06-03T07:17:33.664-07:00My analysis of the Swift languageApple has put out <a href="https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html#//apple_ref/doc/uid/TP40014097-CH5-XID_399">Swift</a>, which sounds like a nice language overall.
Here is my flagrantly non-humble opinion about how its features line
up with what I consider modern, well-established aspects of
programming language design.
<h2>The good</h2>
<p>First off, Swift includes range-checked integer arithmetic! Unless you
explicitly ask for wraparound, any overflow will cause an exception. I
was <a href="https://plus.google.com/111102660583646544610/posts/QCCdxEwagpZ">just commenting yesterday</a> on what a tough problem this is for
current programming languages.
<p>It has function types, nested functions, and closures, and it has
numerous forms of optimized syntax for closures. This is all
heartening, and I hope it will stick going forward, much the way
lexical variable binding has stuck. Closures are one of those things
that are both helpful and have small down side once your language has
garbage collection.
<p>Swift's closures can assign to variables in an outer scope. That's the
right way to do things, and I find it painful how much Java's
designers struggle with this issue. As a technical detail, I am
unclear what happens if a closure captures a variable but does not
modify it. What ought to happen is that any read from it will see the
latest value, not the value at the time the capture happened. However,
the Closures section of the Language Guide suggests that the compiler
will capture just the initial value in this case. I believe this is
misguided and will cause as many traps as it fixes; for example,
suppose the programmer captured a counter, but does not increment that
counter itself? The motto here should be: you don't know what the programmer
<em>meant</em>, but you know what they <em>wrote</em>.
<p>Type inference is quite welcome. I don't know what more to say than
that developers will take advantage of it all the time, especially for
local variables.
<p>Tuple types are a small touch that comes up in practical programming
all the time. How many times have you wanted to return two values from
a function, and had to design a class for it or otherwise to pervert
your design?
<p>Enumerations seem good to include in the language. Language designers
often seem to think that enums are already handled by other language
features, and therefore should not be included. I respect that, but in
this case, it's a simple feature that programmers really like to
use. Java's enums are baroque, and none of the several Datalog
dialects I have wokred on include enums at all. I miss having language
support for a closed set of named integers. It's easy to support and
will be extremely popular.
<p>As an interesting trick, keyword arguments to functions are supported,
but you have to opt in. That's probably a good combination. Keyword
arguments are quite useful in cases where you have a lot of
parameters, and sometimes this legitimately happens. However, it's
unfortunate if you afflict all functions with keyword arguments,
because the keyword arguments become part of the API. By making it opt
in, the feature is there for the functions which can use it.
<p>Including both structs and classes looks initially redundant, but it's
quite helpful to have a value type that encompasses multiple other
values. As an example, the boxed Integer type on Java would be much
better as a struct than as a class.
<p>Extensions look valuable for programming in the large. They allow you
can make an existing class fit into a new framework, and they let you
add convenience methods to an existing class. Scala uses its implicit
conversions for extensions, but direct support for extensions also
makes a lot of sense.
<p>The way option chaining works is a nice improvement on Objective C. In
Objective C, any access to nil returns nil. In practice, programmers
are likely better off with getting an error when they access nil, as a
form of design by contract: when something goes wrong, you want the
program to stop at that point, not some indefinite time later. Still,
sometimes you want nil propagation, and when you do, Swift lets you
just put a "?" after the access.
<p>Weak references are helpful for any language with automatic memory
management, but they look especially helpful in a language with
reference-counting memory management. I don't follow why there are
also the "unowned" references, except that the designers didn't want
your code to get polluted with ! dereferences. Even so, I would think
this is a case of do or do not do. If you are worried about !
pollution, which is a legitimate concern, then simply don't require
the !.
<p>As an aside, this is the reason I am not sure pervasive null is as bad
as often claimed. In practical code, there are a lot of cases where a
value is sometimes optional but, in a specific context, is known to be
present. In such a case, you are just going to deference it, and
possibly suffer a null-pointer check if you were wrong. As such,
programmers are guided into a style where they just insert dereferences
until the compiler shuts up, which makes the code noisey without increasing
practical reliability.
<h2>The dubious</h2>
<p>Swift looks very practical and workable, but there are some issues
I think could have been done better.
<p>Single inheritance seems like a step backward. The linearization style
of multiple inheritance has proven helpful in practice, and it
eliminates the need for a separate "interface" or "protocol"
feature. Perhaps designers feel like C++'s multiple inheritance went
badly, and are avoiding multiple inheritance like the plague? I used
to think that way, but it's been multiple decades since C++'s core
design. There are better design for multiple inheritance nowadays.
<p>Swift doesn't appear to include persistent data structures. This is
the one feature I miss the most when I don't get to program in Scala,
and I don't know why it isn't catching on more widely in other
languages. Developers can add their own collection types, but since
the new types aren't standard, you end up having to convert to
standard types whenever you call into another library.
<p>The automatic immutability of collections assigned to constants looks
driven by the lack of persistent collections. It's better to support
both features independently: let variables be either mutable or not,
and let collections be mutable or not. All four combinations are very
useful.
<p>Deinitialization, also known as finalization, looks like a throwback
to me. In a system with automatic memory management, you don't want to
know precisely when your memory is going to get freed. As such, you
can't count on deinitializers running soon enough to be useful. Thus,
you always need a fallback plan of deallocating things manually. Once
you deallocate manually, though, deinitializers become just a
debugging technique. It's better to debug leaks using a tool than with
a language feature.
<p>In-out parameters seem like a step backwards. The trouble is that
<em>most</em> functions use only in parameters, so when you see a
function call, a programmer's default assumption is that the callee
will not modify the argument. It can lead to bad surprises if the
parameter gets modified at all. Out parameters are so unusual that
it's better to be explicit about them, for example by taking a mutable
collection as an argument.
<p>Custom precedence (and associativity) is likely to go badly. We discussed
this in detail, over the course of days, for X10, because X10 is a
scientific language that really benefits from a rich set of operators.
One problem with user-defined precedence is that it's hard to scope:
you want to scope the operators themselves, not their implementations,
because parsing happens before method lookup. It's also tough on
programmers if they have to learn a new precedence table for every file
of code they read. All in all, we concluded that Scala had a reasonable
set of trade offs here: have a built-in precedence table with a huge number
of available operators, and make library designers simply choose from the
existing operators.
<p>I see no exceptions, which is likely to be a nuisance to programmers
if they are truly missing. Sometimes you want to tear down a whole
chunk of computation without exiting the whole program. In such cases,
exceptions work very well. Maybe I just missed it.
<p>Integer types are hard to get right, and I am not sure Swift has
chosen a great approach. It's best to avoid unsigned types, and
instead to have untyped operations that can apply to typed
integers. It's also best to avoid having low-precision operations,
even if you have low-precision storage. Given all of the above, you
don't really need explicit conversions any more. Java's integer design
is quite good, with the exception of the unnecessary char type that is
not even good for holding characters. I suspect many people overlook
this about Java, because it's a case where programmers are better off
with a language with fewer features.Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com3tag:blogger.com,1999:blog-5479191305093780981.post-52741468552621670802014-01-18T14:06:00.000-08:002014-01-18T14:06:41.076-08:00Is Internet access a utility?<p><a href="https://plus.google.com/111102660583646544610/posts/hWRrQy5ha47">I forwarded a link</a> about Network Neutrality to Google Plus, and it got a lot of comments about how Internet access should be treated like a utility. I think that's a reasonable perspective to start with. What we all want, I think, is to have Internet access itself be a baseline service, and that Internet services on top of it have fierce competition.
<p>In addition to considering the commonalities with Internet access and utilities, we should also note the differences.
<p>One difference is that a utility is for a monopoly, but Internet access is not monopolized. You can only put one road in any physical location, and I will presume for the sake of argument that you don't want to have multiple power grids in the same locale. Internet access is not a monopoly, though! At least in Atlanta, we have cable, DSL, WiMax, and several cellular providers. We have more high-speed Internet providers than supermarket chains.
<p>Another difference is that utilities lock down technology change to a snail's pace. With roads and power grids, the technology already provides near-maximum service for what is possible, so this doesn't matter. With telephony, progress has been locked down for decades, and I think we all lost out because of that; the telephone network could have been providing Skype-like services a long time ago, but as a utility they kept doing things the same way as always. Meanwhile, the Internet is changing rapidly. It would be really bad to stop progress on Internet access right now, the way we did with telephony several decades ago.
<p>I believe a better model than utilities would be supermarkets. Like Internet providers, supermarkets carry a number of products that are mostly produced by some other company. I think it has gone well for everyone that supermarkets to have tremendous freedom in their content selection, pricing, promotional activities, hours, floor layout, buggies, and checkout technology.
<p>In contrast to what some commenters ask, I do not have any strong expectation about what Comcast will or won't try. I would, however, like them to be free to experiment. I've already switched from Comcast and don't even use them right now. If Comcast is locked into their current behavior, then that does nothing for me good or bad. If they can experiment, maybe they will come up with something better.
<p>In principle, I know that smart people disagree on this, but I currently don't see anything fundamentally wrong with traffic shaping. If my neighbor is downloading erotica 24/7, then I think it is reasonable that Comcast give my Game of Thrones episode higher priority. The fact that Comcast has implemented this badly in the past is troubling, but that doesn't mean the next attempt won't work better. I'd like them to be free to try.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-9217341084175021942013-11-11T05:44:00.000-08:002013-11-11T05:44:02.239-08:00It's ad targeting, isn't it?I see <a href="https://plus.google.com/u/0/photos/+LauraManach/albums/5944586953469222913/5944586954248454306?pid=5944586954248454306&oid=114096099683887759200">continued assumptions</a> by people that the real names policies of Facebook and Google Plus have actual teeth.
<p>I've posted before on whether <a href="http://blog.lexspoon.org/2011/11/dana-boyd-on-pseudonyms.html">real names are truly enforced on Facebook</a>, and it looks like the answer there is no. My impression is that it's not working great on Plus, either, although there have <a href="https://twitter.com/WilliamShatner/status/92839457866264576">been some famous botched efforts</a>.
<p>The rationale that it improves the level of discussion seems thin and inaccurate. There are too many legitimate reasons to participate in a forum but not to want it to pop up when your boss does a Google search on your name.
<p>As far as I can tell, the main purpose of a real names policy is to appease advertisers. Advertisers feel, probably correctly, that more information about users will improve the accuracy of ad targeting. It's weird, though, because nobody seems to talk about it that way. It's analogous to the exhortations in a hotel room that it's good for the environment to avoid washing so many towels. Ummm, I'm pretty sure it's more about the money.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com2tag:blogger.com,1999:blog-5479191305093780981.post-7757932673204108632013-06-16T08:37:00.000-07:002013-06-16T08:37:22.449-07:00When to best use type inferenceType inference can make code much better. It can save you from
writing down something that is completely obvious, and thus
a total waste of space to write down. For
example, type inference is helpful in the following code:
<pre>
// Type inference
val date = new Date
// No type inference
val date: Date = new Date
</pre>
It's even better for generics, where the version without type
inference is often absurd:
<pre>
// Type inference
val lengths: List[Int] =
names.map(n => n.length).filter(l => l >= 0)
// No type inference
val lengths: List[Int] =
names.map[Int, List[Int]]((n: String) => n.length).
filter((l: Int) => l >= 0)
</pre>
When would a type not be "obvious"? Let me describe two scenarios.
<p>First, there is obvious to the reader. If the reader cannot tell what
a type is, then help them out and write it down. Good code is not
an exercise in swapping puzzles with your coworkers.
<pre>
// Is it a string or a file name?
val logFile = settings.logFile
// Better
val logFile: File = settings.logFile
</pre>
Second, there is obvious to the writer. Consider the following
example:
<pre>
val output =
if (writable(settings.out))
settings.out
else
"/dev/null"
</pre>
To a reader, this code is obviously producing a string. How about to
the writer? If you wrote this code, would you be sure that you wrote
it correctly? I claim no. If you are honest, you aren't sure what
settings.out is unless you go look it up. As such, you should write it
this way, in which case you might discover an error in your code:
<pre>
val output: String =
if (writable(settings.out))
settings.out // ERROR: expected String, got a File
else
"/dev/null"
</pre>
Languages with subtyping all have this limitation. The compiler can
tell you when an actual type fails to satisfy the requirements
of an expected type. However, if you ask it whether two types can
ever be used in the same context as each other, it will always say yes,
they could be used as type Any. ML and Haskell programmers are cackling
as they read this.
<p>It's not just if expressions, either. Another place this issue crops
up is in collection literals. Unless you tell the compiler what kind
of collection you are trying to make, it will never fail to find a type for
it. Consider this example:
<pre>
val path = List(
"/etc/scpaths",
"/usr/local/sc/etc/paths",
settings.paths)
</pre>
Are you sure that settings.paths is a string and not a file? Are you
sure nobody will change that type in the future and then see what type
check errors they get? If you aren't sure, you should write down the
type you are trying for:
<pre>
val path = List[String](
"/etc/scpaths",
"/usr/local/sc/etc/paths",
settings.paths) // ERROR: expected String, got a File
</pre>
Type inference is a wonderful thing, but it shouldn't be used to
create mysteries and puzzles. In code, just like in prose, strive to
say the interesting and to elide the obvious.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com3tag:blogger.com,1999:blog-5479191305093780981.post-74851914227622103352013-04-21T08:34:00.000-07:002013-04-21T09:12:19.892-07:00Google Voice after several months<p>I've been exclusively using Google Voice for months now, and just for voice mail for more than a year. I feel like the plain-old telephone system (POTS) is an unreasonably high toll to pay given how technology has improved. There is no reason to have non-trivial long-distance rates between Europe and the U.S. in a day when Skype does it for around a penny a minute. Google is doing a wonderful thing by promoting an Internet-based phone number.
<p>Rather, Google is <em>starting</em> a wonderful thing. In the time I have used it, many of the most obvious problems haven't improved in the slightest.
<p>Here's a quick run down of the good and the bad as I see it. Overall, I see it as comparable to my two-year stint <a href="http://blog.lexspoon.org/2010/11/mac-computers-for-programmers.html">using a Mac for software development</a>. The promise is there, but when you actually try it, you realize why it's not yet the norm.
<h2>The Good</h2>
<p>I love receiving calls and having all my devices ring. In 2013, it's the way things ought to work. If I'm in the car, my car stereo should ring. If I'm at my desk, I should get a notification on my desktop. If I'm watching TV, my physical phone should ring. Google Voice gets this just right.
<p>I love the option to take calls at my desk. I already do a lot of voice chat sessions with coworkers around the world, and it just seems right that I should do the same thing with gmail addresses and numeric phone numbers.
<p>I love the text transcription of voice mails, for those times I can't take a call immediately. The quality is iffy but is usually high enough that I can understand the gist of what the person was telling me.
<p>Phone number porting works just fine, so you can keep your pre-existing number and not even tell people you are using Google Voice. Well, you have to tell them for a different reason: there is so much bad with Google Voice that you need to warn your potential calling partners about how gimped your phone service is.
<h2>The Bad</h2>
There's a lot of bad.
<p>It doesn't work over data connections. I really don't understand why it is missing. Because of this problem, I have to buy minutes on the POTS to use it on my cell phone, and minutes are far more expensive than the associated data cost. More pragmatically, if I am travelling and don't yet have a local SIM card, it means I cannot use my phone to call over a wifi network.
<p>You can't make or take calls directly from the Voice web page. You have to log into both Google Talk and Google Voice, configure Voice to talk to Talk, and then make your call from Voice. Yes, you can also make a call from Talk directly, but that's a separate feature of the Google suite, thus confusing matters even further. Google is normally excellent at building web user interfaces, but that seems to go down the tubes when an issue crosses multiple teams.
<p>When you make a call at your desk, using Talk, the volume is extremely low. I originally thought that was just my configuration, but some web searching indicates that this has been a widespread problem for several years. I have to turn up my system volume to the max just to barely hear the other person, at which point every random system notification is an ear splitter.
<p>It doesn't support phone usage from the UK. This is a very surprising restriction, because Talk can make calls to the UK just fine. Part of the benefit of Voice for me is the promise that I can travel around and call POTS numbers from wherever I am. However, even if I get a UK SIM card, it's just not supported by Voice.
<p>There is no MMS, and there is no warning on either side when an attempted MMS does not go through. I have to tell people to use email, or to use my physical cell phone number rather than my Google Voice number. If Mom emails me a photo of one of my nieces, it quietly disappears. I am oblivious, and she is wondering what planet I am on that I didn't write back.
<h2>The Ugly</h2>
<p>The ugly part is that Google is not doing anything to fix all of this. I'm willing to be a beta tester in this case. It's not beta testing, though, if they never fix the problems.
<p>At this point, the POTS tax is substantially higher than the Microsoft tax of yore. It costs tens of dollars a month to participate, and you can't live without it.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0tag:blogger.com,1999:blog-5479191305093780981.post-15621244832826912912013-03-23T08:42:00.000-07:002013-03-23T13:08:06.998-07:00C compilers exploiting undefined behaviorIt's getting out of hand the way C compilers exploit undefined behavior. I see via John Regehr's blog that there is a SPEC benchmark <a href="http://blog.regehr.org/archives/918">being turned into a noop via an undefined-behavior argument</a>.
<p>This isn't what the spec writers had in mind when they added undefined behavior. To fix it, Regehr's idea of <a href="http://blog.regehr.org/archives/748">having extra checkers</a> to find such problems is a plausible one, though it will take a dedicated effort to get there.
<p>An easier thing to do would be for gcc and Clang to stop the madness! If they see an undefined behavior bullet in their control-flow graphs, then they should leave it there, rather than assuming it won't happen and reasoning backward. This will cause some optimizations to stop working, but really, C compilers were already plenty good 10 years ago. The extra level of optimizations is not a net win for developers. Developers want speed, sure, but above all they want their programs to do what they look like they do.
<p>It should also be possible to improve the spec around this, to pin down what undefined behavior means a little more specifically. For example, left-shifting into the sign bit of a signed integer is undefined behavior. That's way underspecified. The only real options are: shift into the sign bit as expected, turn the integer into unpredictable garbage, or throw an exception. As things stand, a C compiler is allowed to observe a bad left shift and then turn your whole program into a noop.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com3tag:blogger.com,1999:blog-5479191305093780981.post-8785410352660414532013-01-31T17:33:00.000-08:002013-01-31T17:33:19.248-08:00The "magic moment" for IPv6The Internet has undergone many large changes in the protocols it uses. A few examples are: the use of MIME email, the replacement of Gopher by HTTP, and the use of gzip compression within HTTP. In all three of these examples, the designers of the protocol upgrades were careful to provide a transition plan. In two out of the three examples (sorry, Gopher), the old protocol is still practical to use today, if you can live with its limitations.
<p>Things are going differently for IPv6. In thinking about why, I like Dan Bernstein's description of <a href="http://cr.yp.to/djbdns/ipv6mess.html">a "magic moment" for IPv6</a>. It goes like this:
<blockquote>
The magic moment for IPv6 will be the moment when people can start relying on public IPv6 addresses as replacements for public IPv4 addresses. That's the moment when the Internet will no longer be threatened by the IPv4 address crunch.
</blockquote>
<p>Note that Dan focuses on the address crunch. Despite claims to the contrary, I believe most people are interested in IPv6 for its very large address space. While there are other cool things in IPv6, such as built-in encryption and simplified fragmentation, they are not enough that people would continue to lobby for IPv6 after all these years. The address crunch is where it's at.
<p>While I like Dan's concept of a magic moment, I think the above quote asks for too much. There are some easier magic moments for individual kinds of nodes on the computer, and some might well happen before others. Let me focus on two particular kinds of Internet nodes: public web sites and home Internet users.
<p>How close is the magic moment for web sites? Well, web servers can discard their IPv4 addresses just as soon as the bulk of the people connecting to them all have IPv6 connectivity. I do not know how to gather data on that, but as a spot point, I have good networking hardware but cannot personally connect to IPv6 sites. My reason is both mundane and common: I am behind a Linksys NATing router, and that router does not support IPv6. Even if it did, it does not support any sort of tunneling that would allow my local computer to connect to an IPv6-only web server. To the extent people are using plain old Linksys routers, we are a long way away from the magic moment for web servers.
<p>How about for home users? Well, it's the other way around for home users: home users can switch once the majority of public web sites have an IPv6 address. This status is easier to gather data on. I just looked up the top ten web sites (according to <a href="http://www.alexa.com/topsites">Alexa's Top 500 Web Sites</a>) and checked them with a publicly available IPv6 validation site (<a href="http://ipv6-test.com/validate.php">http://ipv6-test.com/validate.php</a>). Of the top ten web sites, only four can be reached from an IPv6-only client: Google, Facebook, YouTube, and Wikipedia. The other six still require IPv4: Yahoo, Baidu, Live.com, Amazon, QQ.com, and Twitter. As things stand, we are also a long way from when home users can switch to IPv6-only.
<p>Overall, this was a cursory analysis, but I think these "magic moments" are a helpful framework for thinking about the IPv6 changeover. Unfortunately, this framework currently indicates that we are nowhere close.
Lex Spoonhttp://www.blogger.com/profile/13859632965228608649noreply@blogger.com0