All Topics

#275 Optimization advice

cbeust Sat 5 Jul 2008

Hi everyone,

I recently posted a coding challenge and I just ported the fastest solution to date to Fan.

The Java code runs in 0.4 ms while Fan takes about 2.4 seconds to complete (for a max of 10000000000).

I don't expect Fan to be as fast as Java, but I was wondering if there were any obvious optimizations that could be done on this code:

class Digit {

  Int value
  Digit previous
  Digit next

  /** Constructs digit with given value as well as all subsequent digits. */
  static Digit makeDigit(Digit previous, Int value) {
    result := Digit.make()
    result.value = value
    result.previous = previous

    // Recursively construct subsequent digits.
    if (value < 9) result.next = makeDigit(result, value + 1);
    return result
  }

  /** Removes digit from the set. */
  Void use() {
    if (previous != null) previous.next = next
    if (next != null) next.previous = previous
  }

  /** Puts digit back into the set. */
  Void yield() {
    if (previous != null) previous.next = this
    if (next != null) next.previous = this
  }
}

class FastBeustSequenceFan
{
  Int mCounter := 0
  Int mPrevious := 0
  Int mMaxSkip := 0

  Void findAll(Int max) {
    zero := Digit.makeDigit(null, 0)
    Digit one := zero.next
    for (length := 1; length <= 10; length++) {
      if (find(one, zero, length, 0, max)) return
    }
  }

  Bool find(Digit start, Digit head, Int remaining,
      Int value, Int max)
  {
    for (current := start; current != null; current = current.next)
    {
      newValue := value + current.value
        if (remaining == 1) {
          if (newValue > max) return true
	  emit(newValue)
        } else {
          current.use()
          newHead := (current == head) ? head.next : head
          if (find(newHead, newHead, remaining - 1, newValue * 10, max))
            return true
          current.yield()
        }
      }
      return false
    }

  Void emit(Int value)
  {
    Int d := value - mPrevious
    if (d > mMaxSkip) mMaxSkip = d
    mPrevious = value
    mCounter++
  }

  Void main() {
    max := Sys.args.size > 0 ? Sys.args[0].toInt() : 10000

    start := DateTime.now().ticks
    findAll(max)
    end := DateTime.now().ticks
    echo("Total: ${mCounter} maxSkip:${mMaxSkip} time:${(end-start)/1000000} ms")
  }

}

tompalmer Sat 5 Jul 2008

I don't have time to dig into bytecode at the moment, but my guess is the arithmetic here. I quickly hacked up some matrix multiplication a couple weeks ago. Java there was also much faster than Fan. I did find for loops much faster than each (and I see you've got for here), but it was still way slower than Java.

To me, I think the easiest way to get good arithmetic speed would be to support non-nulls and define all Floats and Ints as being interned. Then any non-null Ints and Floats could really be primitives (in Java, longs and doubles) behind the scenes without changing external appearance as objects.

Failing that, you might still be able to do clever optimizations when generating local loops. Seeing things like, "hey, I know this value could really use primitives" because I know it never turns null and no one will care about its identity. But I don't think that would be so easy when logic spans across multiple methods, classes, or files. Maybe JVM and CLR implementations will get better with time, but I have my concerns about expecting the underlying platform to fix it.

And I personally think that arithmetic performance is vital. If I have to code in Java and C# anytime I want to write my own fancy algorithm, that will define some limits on what the Fan platform can be used for. (Understood that multiple external implementations will still be needed anyway to get at native features. Hard to get around that all the way. I'm just talking algorithms here.)

brian Sun 6 Jul 2008

Cedric,

I don't see any obvious optimizations - Int.times and Range.each will perform slightly better than a for loop - but I doubt enough to make a big difference.

In general Fan will run at the same speed as Java, except in numeric performance. No question about it - using boxed numbers is a major trade-off. Numeric performance will be slower at the expense of keeping the type system pure OO. In general this seems like a good compromise - for an example a typical application will probably spent way more time in a single database query than crunching numbers. But Fan isn't going to perform well on numeric micro-benchmarks compared to languages with primitive support like Java.

What Fan does right now for numeric computation is pretty primitive - the simplest thing that works. So there is a lot of opportunity to make things faster. The JVM kind of complicates our life because 64-bit numbers require special double wide stack manipulation, and the verifier prevents doing tricks like passing 31-bit integers in-place. But there is still a lot we can do at the emit phase to generate more efficient byte code. Also I think this will turn out to be a area where we get some JVM support in the coming years. The obvious first optimization is to perform escape analysis, and do everything with raw primitives on the local stack (if anyone wants to contribute that would be a good project).

Another thing to consider is that the raw speed of a single thread is going to matter less and less in the future. Rather overall performance is going to be based on how well you do multi-threading on multiple cores. This is really where I'm putting a lot of emphasis for Fan. WideFinder was a good example of this - overall speed was almost completely based on utilizing the multiple cores.

tompalmer Mon 7 Jul 2008

Makes sense that each would be faster than for if it affects amount of arithmetic. I thought I saw for being faster, but I was doing quick experiments, so I probably just made a mistake.

Thanks for the reminder on the double-wide stack issue.

alexlamsl Tue 8 Jul 2008

If we use the JSR166y approach of setting things up first before evaluating them, the speed of operation can reach native (Java / .NET) levels since we can then implement the API using native.

brian Tue 8 Jul 2008

If we use the JSR166y approach of setting things up first before evaluating them

Can you explain that further? Not sure I get it.

alexlamsl Tue 8 Jul 2008

I admit that was a bit random, but basically I'm referring to delayed execution:

mixin Expression {
  // ...
  Sum plus(Expression addend) {
    return Sum.make(this, addend);
  }
  virtual Int value();
}

mixin Number : Expression {
  override Int value;
}

class Sum : Expression {
  internal readonly addend1;
  internal readonly addend2;
  override Int value() {
    return addend1.value + addend2.value();
  }
}

b = Expression.makeNumber(10);
c = Expression.makeNumber(5);
a = b + c
a.toStr        // "10 + 5"
a.value.toStr   // "15"

Now if we implement Sum natively, we should get good performance after inlining of JVM / CLR.

brian Tue 8 Jul 2008

Now if we implement Sum natively, we should get good performance after inlining of JVM / CLR.

I think I see where you are coming from, although I'm not sure how much that would really buy us. Lazy evaluation buys us performance if we don't actually need the result some of the time, but I think in Fan's case the performance issue is around we do need the value, but are dereferencing the boxed value and creating new Int/Float instances all the time.

It would be really hard to pass non-boxed numbers between methods due to how the Java verify works and how double wide stack items work. But within a method we could definitely optimize the local variables. Consider something like this:

Int f(Int a, Int b) { return a*b + a/b }

This translates into something like:

return Int.make( Int.make(a.val*b.val) + Int.make(a.val/b.val) )

The two intermediate Ints aren't really needed and we could actually optimize this in bytecode to this:

return Int(a.val*b.val + a.val/b.val)

tompalmer Wed 9 Jul 2008

I think both lines might be fruitful. Come to think of it, the expression tree concept might even be possible to push down to the GPU level. (Just like LINQ can be pushed to SQL.) By the way, is there an API for access to the fcode for an expression, closure, and/or class? And how stable is fcode?

brian Wed 9 Jul 2008

By the way, is there an API for access to the fcode for an expression, closure, and/or class? And how stable is fcode?

You can load fcode via the compiler::FPod API. Look at the code in Fanp to see how it is used in the disassembler.

Fcode is quite stable because that is the only way to continue bootstrap compiling.

Although fcode is basically stack based bytecode, so it won't help much with runtime evaluation of expressions. However I do have some LINQ like ideas for how to query the namespace which would result in more runtime access of the AST like LINQ.

tompalmer Wed 9 Jul 2008

I agree that an AST API would be easier than parsing and making sense of fcode for LINQ or such things. Just wasn't sure if that was on your plans. (And I understand from alexlamsl's code that neither is absolutely necessary. It can still be more convenient than the extra classes designed to capture such things.)

Fantom

#275 Optimization advice

cbeust Sat 5 Jul 2008

tompalmer Sat 5 Jul 2008

brian Sun 6 Jul 2008

tompalmer Mon 7 Jul 2008

alexlamsl Tue 8 Jul 2008

brian Tue 8 Jul 2008

alexlamsl Tue 8 Jul 2008

brian Tue 8 Jul 2008

tompalmer Wed 9 Jul 2008

brian Wed 9 Jul 2008

tompalmer Wed 9 Jul 2008