GNU bug report logs - #14158
performance: Base64 -d is slow

Previous Next

Package: coreutils;

Reported by: Ole Tange <ole <at> tange.dk>

Date: Mon, 8 Apr 2013 07:39:02 UTC

Severity: wishlist

Tags: confirmed

To reply to this bug, email your comments to 14158 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#14158; Package coreutils. (Mon, 08 Apr 2013 07:39:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ole Tange <ole <at> tange.dk>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 08 Apr 2013 07:39:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ole Tange <ole <at> tange.dk>
To: bug-coreutils <at> gnu.org
Subject: Base64 -d is slow
Date: Mon, 8 Apr 2013 09:26:04 +0200
I was astonished to learn that:

  perl -MMIME::Base64 -e 'while(read(STDIN,$buf,770*50)){print
decode_base64($buf)}'

is faster than:

  base64 -d

My C-skills are quite limited, but it seems the current implementation
does a lot of computation and not table lookups.

Below are my thoughts on how this could be improved.


/Ole


= Designing a fast Base64 decoder =

By: Ole Tange <tange <at> gnu.org>

== Goal ==

We want something that is parallelizable since that can take advantage of:

* Multiple cores/CPUs
* Vector processing (MMX/SSE)
* Processors with pipelines

== Observations ==

In Base64 every 4 bytes represent 3 bytes of decoded information.

Binary input: aaaaaaaa bbbbbbbb cccccccc dddddddd
=>
Base64 index: 111111 222222 333333 444444
=>
Decoded: 11111122 22223333 33444444

Given input byte 1 and 2 we can determine output byte 1.

Given input byte 2 and 3 we can determine output byte 2.

Given input byte 3 and 4 we can determine output byte 3.

A way to make an algorithm parallelizable is by not making
computations based on previous results, but only base the computation
on the input.

== Idea ==

Make a lookup table for the output byte.

By making a 16-bit lookup table for the first 2 bytes, the middle 2
bytes and the last 2 bytes, conversion of 4 byte to 3 bytes is fast
and independent of previous computations.

The lookup table can also deal with the conversion to Base64 index -
thus skipping this step. It can even deal with all the non-conflicting
variations of which char codes for index 62 and 63.

  out_byte1 = first[in_byte1][in_byte2];
  out_byte2 = middle[in_byte2][in_byte3];
  out_byte3 = last[in_byte3][in_byte4];

Each lookup table will at most be 64kbytes = 192kbytes total.

== Implementation ==

Here is the implementation in pseudo code:

byte compute_first_byte(int i, int j) {
  /* Return the decoded value of the first byte, given the first 2
bytes are i and j */
}

byte compute_middle_byte(int i, int j) {
  /* Return the decoded value of the middle byte, given the middle 2
bytes are i and j */
}

byte compute_last_byte(int i, int j) {
  /* Return the decoded value of the last byte, given the last 2 bytes
are i and j */
}

/* Lookup tables */
static byte first[256][256];
static byte middle[256][256];
static byte last[256][256];
static bool already_computed = 0;

void decode(byte* in, int len, byte* out) {
  /* in = pointer to input bytes with no garbage (\n and similar)
     len = length of input (must in 4 byte blocks)
     out = pointer to output buffer of length 3/4*len
  */

  /* Initialize lookup tables */
  if(! already_computed) {
    parallel_for(int i = 0; i < 256; i++) {
      parallel_for(int j = 0; j < 256; j++) {
        first[i][j] = compute_first_byte(i,j);
        middle[i][j] = compute_middle_byte(i,j);
        last[i][j] = compute_last_byte(i,j);
      }
    }
    already_computed = 1;
  }

  /* Lookup byte1+2, byte2+3, byte3+4 */
  parallel_for(out_idx = 0, in_idx = 0; in_idx < len; in_idx += 4,
out_idx += 3) {
    out[out_idx] = first[ in[in_idx] ][ in[in_idx+1] ];
    out[out_idx+1] = middle[ in[in_idx+1] ][ in[in_idx+2] ];
    out[out_idx+2] = last[ in[in_idx+2] ][ in[in_idx+3] ];
  }
}

== Improvement ideas ==

Use SIMD to do multiple 4-byte blocks in parallel - maybe a line at a
time?




Information forwarded to bug-coreutils <at> gnu.org:
bug#14158; Package coreutils. (Mon, 08 Apr 2013 20:02:02 GMT) Full text and rfc822 format available.

Message #8 received at 14158 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: 14158 <at> debbugs.gnu.org
Subject: tag and classify
Date: Mon, 8 Apr 2013 13:58:08 -0600
severity 14158 wishlist
tag 14158 + confirmed
thanks




Severity set to 'wishlist' from 'normal' Request was from Bob Proulx <bob <at> proulx.com> to control <at> debbugs.gnu.org. (Mon, 08 Apr 2013 22:01:01 GMT) Full text and rfc822 format available.

Added tag(s) confirmed. Request was from Bob Proulx <bob <at> proulx.com> to control <at> debbugs.gnu.org. (Mon, 08 Apr 2013 22:01:01 GMT) Full text and rfc822 format available.

Changed bug title to 'performance: Base64 -d is slow' from 'Base64 -d is slow' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 19 Oct 2018 01:47:02 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 184 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.