Bitwise Operators

Apr 6, 2019, 4:11 PM

Tags: javaee
  1. Java Hiccups
  2. Bitwise Operators
  3. Java Grab Bag 2
  4. Java Travelogue: The Care and Feeding of Locales
  5. More Notes on Filesystem and Charset Portability

In the "Java Hiccups" post, I briefly mentioned this little tidbit:

(short)Integer.MAX_VALUE == -1   // true

This fact betrays a lot about how numbers are stored internally in most languages. Some of those details, like the fact that the specific method is called two's complement, are fully in the range of "computer science" stuff - but it can be handy to know about a couple ways you can put this sort of thing to use.

For our purposes, let's work with the short data type in Java, because it's painfully important to Notes. "Short" is called such because it's smaller than a standard integer type in a given language. In some languages, like C, the length of core data types like this is sometimes tied to the bitness of the processor, but in Java they're consistent across everything. Specifically, short in Java is 16 bits long. Let's take the number 21,038, which is represented in binary as:

0101 0010 0010 1110

That's two bytes, and each byte is broken up into two groups of four bits each because it turns out to be helpful visually and because each "nibble" there can be matched to a single hexadecimal character (522E in this case). This is the most common way you'll see binary written when you start getting into this sort of thing.

So I mentioned casting in the previous post, and what casting does with primitive integer types like this is to chop off the bits that don't fit. If you cast the above value into a byte, it'll chop off all but the ending 8 bits, giving you:

0010 1110

Which is 46. If you cast it to a larger type, such as int (32 bits in Java), it just adds some zeros to the front:

0000 0000 0000 0000 0101 0010 0010 1110

Which is still 21,038, but now it takes up twice as much memory.

For our uses, this stuff matters immediately in two ways: it represents the limits of some data storage and it allows us to manipulate numbers with bitwise operators.

Storage Limits

If you max out the bits in a 16-bit integer, you get 1111 1111 1111 1111, and this value is either -1 or 65,535 (64k) depending on whether or not the value is considered "signed" by the code using it. When it's "unsigned", a 16-bit integer can represent between 0 and 65,535; when it's "signed", its range is -32,768 to 32,767 (32k). Java only has "signed" types, which is why Short.MAX_VALUE is 32k and not 64k. C, on the other hand, lets you choose in your code which type you want to use.

The Notes C API uses a couple type names to represent consistent sizes across different platforms, and the one that's important here is WORD: a 16-bit unsigned integer. This is used in, for example, the function used to set the value of a text item in a note:

NSFItemSetText(
  NOTEHANDLE  hNote,
  const char far *ItemName,
  const char far *ItemText,
  WORD  TextLength)

That WORD at the end there is where you tell the API how much text you're providing, and is thus capped at 64k - and is why you can store 60k of text in a non-summary text field but not 70k. As to why summary data is limited to 32k and not 64k, I'm guessing that either there's a signed value in there somewhere or that last bit was needed for something else.

If you look over the list of Notes limits, you can see this reflected all over the place, along with a few 8-bit (0-255) limits for good measure.

Bitwise Operators (for real this time)

The term "bitwise" refers to a handful of operators that deal directly on the bits of a number. Java takes a cue from C here and provides &, |, ^, ~, <<, >>, and >>>. These all have their uses, primarily for really-low-level stuff and efficiency, but we mainly care about & and |. These two may have bitten you in the past because of their similarity to && and || - and in particular because Formula Language uses the single-character versions to mean what Java means by the double-character ones. They also kind of work the same way, but aren't really the same.

To see how they work, let's take our original starting number, 21,038, and pair it up with another value, 5,000:

0101 0010 0010 1110
0001 0011 1000 1000

Visualizing the numbers in binary and stacked like this comes in extremely handy when it comes to dealing with bitwise operators. We'll start with &, or "bitwise AND". What this operator does is take two numbers and return a number where the matching slots in each one both contain a 1. So, in this case, it results in this internal math:

0101 0010 0010 1110
0001 0011 1000 1000
===================
0001 0010 0000 1000

Which corresponds to 4,616.

The | operator, or "bitwise OR", will provide a result where either of the original numbers has a 1 in the slot. In this case:

0101 0010 0010 1110
0001 0011 1000 1000
===================
0101 0011 1010 1110

Or 21,422.

It's pretty rare that you want these operators for the actual numerical values, though. What they're pretty much always used for are "bit fields", an extremely-efficient way to store a set of related flags. Going back to the Notes API, the NSFItemAppend function (a generic way to add an item value) has a parameter WORD Flags, that matches up to a number of properties that you can assign to an item. In the C source, they're written as hexadecimal values, but I've added in the binary versions here:

#define	ITEM_SIGN          0x0001   // 0000 0000 0000 0001
#define	ITEM_SEAL          0x0002   // 0000 0000 0000 0010
#define	ITEM_SUMMARY       0x0004   // 0000 0000 0000 0100
#define	ITEM_READWRITERS   0x0020   // 0000 0000 0010 0000
#define	ITEM_NAMES         0x0040   // 0000 0000 0100 0000
#define	ITEM_PLACEHOLDER   0x0100   // 0000 0001 0000 0000
#define	ITEM_PROTECTED     0x0200   // 0000 0010 0000 0000
#define	ITEM_READERS       0x0400   // 0000 0100 0000 0000
#define ITEM_UNCHANGED     0x1000   // 0001 0000 0000 0000

So you can see that each flag there has a distinct slot filled in that each other doesn't (you can also see, like missing teeth, the bits that are probably obsolete or undocumented features). If you want to create an item that is a signed, sealed, summary, readers item, you'd do this:

WORD flags = ITEM_SIGN | ITEM_SEAL | ITEM_SUMMARY | ITEM_READERS

…which will result in this bit of internal math:

0000 0000 0000 0001
0000 0000 0000 0010
0000 0000 0000 0100
0000 0100 0000 0000
===================
0000 0100 0000 0111

This is extremely efficient, both because you can store a bunch of information in just 16 bits, but also because working with these bit fields is baked into processors at the lowest level. There's not much you can do on a computer that's faster, in fact.

When you have a value like this, you can query it for whether or not a given value is set by using the AND operator:

flags & ITEM_SUMMARY

This will result in:

0000 0000 0000 0100

…which is the same as the value of ITEM_SUMMARY itself, since that one slot is the only bit in common. Conversely, if you check:

flags & ITEM_PROTECTED

…you'll end up with zero, since the flag doesn't match.

Use in Java

Most of the time, you don't have to think too much about the bits that make up numbers in Java, but they creep up from time to time. Generally, the main situations where they arise are when you're dealing with a low-level API or with code written by someone who spent a lot of time with C and deeply internalized its efficiencies.

For an example of the former, take a look at the com.ibm.designer.domino.napi.NotesNoteItem class in IBM's NAPI. It contained a method called int getFlags(), which returns exactly the value we were talking about above. If you get your hands on a summary item this way, you can do this to tell if it's a summary item:

(item.getFlags() & 0x0004) != 0

Note the extra != 0. While the operation involved is the same as in C, C lets you treat any non-zero integer value as a boolean true, while Java forces you to perform an actual boolean operation.

For an example of the latter, turn your eyes to the DefaultColumnDef class used in xe:dynamicViewPanel generation and customization, a snippet of which is:

public static class DefaultColumnDef implements ColumnDef {
      public static final int FLAG_HIDDEN         = 0x000001;
      public static final int FLAG_LINK           = 0x000002;
      public static final int FLAG_ONCLICK        = 0x000004;

      public int flags;
  
      public boolean isHidden() {
          return (flags&FLAG_HIDDEN)!=0;
      }
      public boolean isLink() {
          return (flags&FLAG_LINK)!=0;
      }
      public boolean isOnClick() {
          return (flags&FLAG_ONCLICK)!=0;
      }
}

This could also be represented as a bunch of boolean properties on the object or, most idiomatically, an EnumSet, but using a bitmask is slightly more space- and CPU-cycle-efficient. Slightly. The difference isn't enough to even think about normally, but can become worth it in cases where the code is called so often that tiny efficiencies can make a difference.

This whole topic is the sort of thing that is normally so many layers of abstraction below where we work that it's basically invisible, but it's definitely good to at least know the basics for the times when it does come up.

Commenter Photo

Ben Langhinrichs - Apr 27, 2019, 7:53 PM

Well explained. I don't have to deal with this in Java often, because I'm that bozo who works in C all the time and has "deeply internalized its efficiencies". I love it when you explain all this, though, as it is seldom laid out in a clear way.

New Comment