Sunday, December 18, 2016

Opening cash drawer without printing

Programming an accounting software was probably the most boring thing I have ever done. There was however one interesting problem. The software was recording transactions from customers and printing bills, very simple. The printer was a small cashier style printer connected to cash drawer with RJ11 connector.


Now everytime the bill was printed the drawer was opened automatically. It is a standard behavior of a cashier printer and can be enabled in driver settings. One of the program features was to make a function that opens the cash drawer without printing any bill.

The tricky part is that the cash drawer is connected only to the printer and not the computer itself. So the opening command must come from the printer. By looking into printer's documentation I found series of commands required for generating pulse to open the drawer.


As you can see the command contains five ASCII characters:

16 + 20 + 1 + 0 + 3


In .NET there is a helper object called RawPrinterHelper provided by Microsoft which can be used for sending raw commands to the printer. The code can be found on this website. I added the code into the project and ended up making this function:


private void OpenDrawer()
{
   var command = 
   Char.ConvertFromUtf32(16) +
   Char.ConvertFromUtf32(20) +
   Char.ConvertFromUtf32(1) +
   Char.ConvertFromUtf32(0) +
   Char.ConvertFromUtf32(3);

   RawPrinterHelper.SendStringToPrinter(printer.PrinterSettings.PrinterName, command);
}


Calling OpenDrawer() does really open the drawer without printing any paper.

Friday, March 18, 2016

Playing with lambdas

With the new version of C++ we got a chance to start using lambda functions. It's a great way to simplify the code. Imagine this traditional sorting scenario:

bool comparator(int a, int b)
{
    return a < b;
}

void sort(std::vector<int>& array)
{
    std::sort(array.begin(), array.end(), comparator);
}


With lambda we can simplify it into one function:

void sortLambda(std::vector<int>& array)
{
    std::sort(array.begin(), array.end(), [](int a, int b) -> bool
    {
        return a < b;
    });
}


Pretty cool, now let's try something more interesting. Lambda function is just a function without name, so it can't be declared in a traditional way. But it can be used as a variable:

auto lambda = [](std::string param) { std::cout << param; };

lambda("Hi!");


Lambda function can't be used as a type directly because the type is unknown, but it can be used in a combination with std::function<> template. This way we can declare a type and use it as a class member:

struct Object
{
    void registerCallback(const std::function<void(std::string)>& callback)
    {
        m_callback = callback;
    }

    std::function<void(std::string)> m_callback;
};


Now we use the Object to register lambda as a callback:

Object object1;
object1.registerCallback(lambda);
object1.m_callback("Hi!");


Or we can declare and register the lambda at the same time:

Object object2;
object2.registerCallback([](std::string param)
{
    std::cout << "Object 2: " << param;
});

object2.m_callback("Hi!");


Another nice example of using lambda is when creating a thread. Now it can be done in one line call!

std::thread thread([]()
{
    std::cout << "Hi from the other thread!";
});


Now when you want to use an external object inside the lambda scope, you can capture it inside the brackets. Lambda capture is a mechanism for passing an external variable to the lambda scope. A typical example is to capture this as a class instance pointer:

auto lambda = [this]()
{
    // now I can access any member of this class
};


Variables can be captured by pointer, reference or name. However we must be careful about validity of captured objects, because lambda does not extend their lifetime. For example this will crash:

auto object = new Object();
auto lambda = [object]()
{
    object->sayHi();
};

// ....

delete object;
object = nullptr;

lambda(); // crash! captured object was deleted


Similar problem will happen when passing local variable by reference. When the lambda is called after exiting the variable scope the capture becomes a dangling reference:

std::function<void()> lambda;

{
    auto number = 4;

    lambda = [&number]()
    {
        std::cout << number;
    };
}

//! now the captured "number" became a dangling reference
lambda();


But after understanding these simple rules there are no other drawbacks of using lambda. In fact after some time I don't see any reason why NOT to use lambdas. The way how it simplifies the code is just amazing. I can say that it completely changed the way how I code and think about program flow and architecture. It's a very fresh wind to C++ language.

I have to recommend one excellent book from Scott Meyers Effective Modern C++ for more in depth examples and explanations behind lambdas and other techniques.

Monday, January 4, 2016

Smoother camera spring effect

Recently I came across an interesting problem where I had to follow a moving game object with the camera. The object was physical based with rigid body and the camera movement had to be delayed with a spring effect. I am using Unity Engine. The implementation was pretty straightforward, the camera position is continuously interpolated to the object position. The object is moving, the camera is following, simple. 

Now the problem was that there was a visible object jittering. When the spring effect was turned off and the camera was in fixed distance from the object the jittering disappeared. How the spring effect on camera is related to object jittering?

The answer is frequency. The physics subsystem use a different updating frequency. Game physics is usually updated with a fixed time step and does not depend of the frame rate. Camera on the other hand is a part of the main game loop with variable speed.


Frequency graph


The graph shows the difference between physical subsystem and the game loop. Physics is updated constantly regardless of the frame rate, game loop frequency is only limited by the performance of the game (or the limit is set by design).

But how does this help with jittering? To understand the problem let's look at the next two examples. On the first image there is a moving capsule that represents a physical based object followed by a camera with fixed distance (no spring effect). The capsule is moving in constant time steps and because the camera is in fixed distance it always points on the object correctly without jittering.


Fixed camera distance


Second image shows the camera on the spring. The capsule is being moved with fixed time step same as before but the camera distance is now more dynamic. The spring makes the camera delayed while the capsule is moving, when the capsule is stopped the camera is still moving towards the original distance. Now because of the camera position is updated on a game frequency and the capsule position is updated on a frequency of physics the capsule object start being jittery.

Camera with variable distance (spring effect)


The jittering is more obvious when the the strength of the spring is smaller. In the world space everything seems to look correct, the capsule is moving forward and the camera is moving towards the capsule. The jittering is a side effect of the fact that we are seeing the world with a camera eye object and everything outside of its frequency becomes jittery.

Fixing the issue wasn't so obvious, I tried to update the spring with fixed time step same as the capsule but it didn't make anything better, the jittering became even worse. After some research I came to these conclusions:

  • Camera must be updated on the game update for the smoothest possible feeling of the game
  • Spring effect must be calculated on the game update for the same reason
  • Capsule must be updated on the physics update in order to work properly

After going back and forward I realized that the capsule (or any moving object) can be split into 2 parts - physical body and visual geometry. Physical body can have all the physical properties like rigid body, collision geometry etc; and the visual geometry object is the actual capsule that you see on the screen. Now the physical body is updated with fixed time step and the visual body is interpolated to physical body within the game loop. The camera is now moving towards the the visual body without any jittering side effects.
Smooth camera with spring effect.


This solution has proved to be robust in variable framerate due to the fact that the camera and the capsule geometry are updated with the same time step.

Thursday, September 10, 2015

Hashing 3D space

One of the challenges when dealing with 3D objects is a spatial partitioning. A common problem is to find an object in a specified area or a nearest object. Typically we divide the space into cells and reduce the problem to searching for a cell containing a point or an object.


Example of Octree cells

But what if we are looking for objects that are close together? Can we still use cells effectively? The answer is no. We still can enumerate all objects within one cell however we won't get easily closest objects from surrounding cells.

Consider this example with two cells:

cell0 = {a, b} // cell0 contains 2 points a and b
cell1 = {c}     // cell1 contains 1 point c
r is a proximity radius

All points {a, b, c} are in proximity radius r but belong to two different cells.

2 cells with points within radius


In general we are looking for objects having similar properties like the distance. This problem is more common in large databases when searching for similar items and is known as Nearest Neighbour Search. One of the possible solution is called Locality Sensitive Hashing.

Locality Sensitive Hashing


Locality Sensitive Hashing (LSH) introduces a hash function that maps similar items to the same bucket with high probability. In contrast to the traditional hash function the number of buckets is smaller than the number of input items and the goal is to maximize the probability of a collision for similar items.

LSH hash function is defined as:

lsh : M B , every point from metric space m M maps to bucket b B
let R > 0 be a threshold,
two points m0 and m1 belongs to the same bucket b if d(m0, m1 R, 
where d is a distance function

Hamming or Euclidean distance is typically used as a distance function. Finding a good threshold R depends on application and distributions of elements. High threshold could lead to fewer buckets and high collisions but also low quality similarities.

Let's get back to the example. LSH-hashed objects in the space can look like in the next picture. Points {a, b, c} belong to the bucket b0, points {d, e} to b1 and the point {f} to bucket b2.

LSH-hashed objects


Application


LSH was successfully used in a real-time mesh explosion system Exploder for Unity Game Engine which I made around 2 years ago. This system is capable of cutting 3D mesh into hundreds of pieces in the matter of milliseconds. The biggest challenge here was to find a contour of cut object as quickly as possible and pass it to triangulation phase. The traditional approach is to prepare a 3D mesh in some heavy data structure like half-edge however it is not quite suitable for real-time processing for it's memory footprint.

All points on cut plane were LSH-hashed into buckets {b0...bn} and it was possible to do very fast look-ups of 3D points and search for contours. Another benefit was that the points belonging to the same bucket (within small proximity) were actually filtered out and treated as one point. That made things faster on complex meshes.

From a performance stand point LSH has very small memory requirements, there is no need for space subdivision, every bucket is encoded as 32-bit number and the hash function requires only a comparison of points distances, which can be done on the fly so no preprocessing is needed.



Demonstration of Exploder system.

Conclusion


LSH is a good example of a system that can be used for applications outside of its domain. Hashing of objects based on probability of collision is more common for searching of similar items like images, web or other media files. It shows that the statistical approach can be used with a great success in areas where the precision of geometry algorithms is preferred.



References




Thursday, May 28, 2015

Fastest image processor

Recently I wanted to resize bunch of pictures from wedding, it was around 1000 images in JPEG format in high resolution over 4000 pixels in width. In past I was using ACDSee software on my windows 7 machine and its batch image resizer. It has a lots of settings and the results are pretty good. However it's a commercial software and runs only on Windows.

For that reason I wrote a simple script in Python for batch resizing of images, it uses PILLOW imaging library and works good enough.


from PIL import Image

image = Image.open(input)
resized = image.resize(new_size, Image.ANTIALIAS)
resized.save(output)




However waiting for 1000 images can be annoying when the average speed of a single resize operation is over 1 second. I started wondering if there is something faster and after some digging on Internet I decided to implement the fastest batch resizer on my own.

First I split the problem into 4 main parts:

  1. Serialization and deserialization.
  2. Decoding and Encoding of JPEG image.
  3. Resizing.
  4. Comparison, is it really the fastest solution?

Serialization/deserialization


Speed of reading/writing files from the disk depends on many variables, but mostly on operating system.  For Windows it's best to use win32 API call:


DWORD dwBytesRead = 0;
HANDLE hFile = CreateFile(fileName.c_str(), GENERIC_READ, 
                          FILE_SHARE_READ, NULL, OPEN_EXISTING, 
                          FILE_ATTRIBUTE_NORMAL, NULL);
unsigned long size = GetFileSize(hFile, NULL);
data.resize(size);
ReadFile(hFile, (char*)(&(data)[0]), size, &dwBytesRead, NULL);


Based on my tests it is at least 2x faster than fread. For OSX the fastest way is to use C-style fread as it showed to be approximately 4x faster than using c++ fstream. I tried to use fixed buffer and read the file in the loop but it didn't make any significant difference, reading the whole file at once is just fine.


FILE* file = std::fopen(fileName.c_str(), "rb");

if (file) {
  fseek(file, 0, SEEK_END);
  long size = ftell(file);
  rewind(file);
  data.resize(size);
  fread((char*)(&(data)[0]), 1, size, file);
  fclose(file);
}


Decoding and Encoding of JPEG image


Now we have the binary data but without decoding to RGB it's kind of useless. Standard library for decoding JPEG images is libjpeg, it is used by majority of commercial and open source software solutions for image processing. Fortunately there exists faster version libjpeg-turbo, heavily optimized for SIMD instruction set (MMX, SSE2, NEON). It should give us decoding speed up to 400% of standard libjpeg library.

I used TURBO-jpeg framework with simple interface for encoding images, example of decoding an image:


bool JPEGImage::decode()
{
    int jpegSubsamp, ret;

    tjhandle jpegDecompressor = tjInitDecompress();
    ret = tjDecompressHeader2(jpegDecompressor, m_data->data(), 
                              m_data->size(), &m_width, &m_height, 
                              &jpegSubsamp);

    if (!ret)
    {
        unsigned long size = static_cast<unsigned long>(m_width*
                                                        m_height*3);
        m_buffer->resize(size);
        ret = tjDecompress2(jpegDecompressor, m_data->data(), 
                            m_data->size(), m_buffer->data(), 
                            m_width, 0, m_height, 3
                            TJFLAG_FASTDCT | TJFLAG_NOREALLOC);
        tjDestroy(jpegDecompressor);
    }

    return ret == 0;
}


Resizing


Now comes the main part. Resizing is nothing else than re-sampling of the image pixels. There are several methods but all of them have the same concept, value of every pixel must be read, weighted by neighboring pixels and written to output image.





Naive solution would be to go through every pixel one by one, read the color value, make some averaging with neighboring pixels and write it to output image. But we won't do that because this problem is perfect for highly parallel algorithm. And I am not talking about CPU parallelism, in time when my graphic card has more than 500 processing units GPU calculations seem like an ultimate solution to our problem.

There has been a wide research of general purpose processing on GPU in past years. Today there exist three major frameworks, OpenCL, CUDA and DirectCompute. CUDA is the oldest one, released in 2007 by NVIDIA and still actively developed. However the biggest disadvantage is that it runs only on NVIDIA graphic cards. DirectCompute was made by Microsoft as a part of DirectX11 and has its own limitations, it runs only on Windows 7 and higher on DirectX 10+ powered apps. Lastly OpenCL, developed by Khronos Group, can execute programs across different platforms including CPUs, GPUs, signal processors and other units. The API is highly mature and has the same or even better power than CUDA, perfect for this application.




It takes a while to get in OpenCL but after you get over the hard part, it will reveal you possibilities you could only dream about. Hundreds of processors with thousands of running threads can give you a fantastic performance boost. 

Unlike the CPU the OpenCL is suitable for applications consisting of lots of small tasks that can run in parallel. For us the small task is a single operation over one pixel. In case of an image with resolution 4.000 x 4.000 pixels there are together 16 million tasks. OpenCL will do all the heavy work for us, we just need to write a program for one pixel, in OpenCL terminology this program is called kernel. Graphic programmers will find lots of similarities with fragment shaders.



__kernel void resize_nn(__read_only image2d_t source, 
                        __write_only image2d_t dest, 
                        struct SImage src_img, 
                        struct SImage dst_img, 
                        float ratioX, float ratioY)
{
   const int gx = get_global_id(0);
   const int gy = get_global_id(1);
   const int2 pos = { gx, gy };

   if (pos.x >= dst_img.Width || 
       pos.y >= dst_img.Height)
      return;

   float2 srcpos = {(pos.x + 0.4995f) / ratioX, 
                    (pos.y + 0.4995f) / ratioY};
   int2 SrcSize = (int2)(src_img.Width, src_img.Height);

   float4 value;

   int2 ipos = convert_int2(srcpos);
   if (ipos.x < 0 || ipos.x >= SrcSize.x || 
       ipos.y < 0 || ipos.y >= SrcSize.y)
      value = 0;
   else
      value = read_imagef(source, sampler, ipos);

   write_imagef(dest, pos, value);
}


This small program is an example of OpenCL kernel responsible for resizing image data, it is a very simple algorithm (nearest neighbor) and is shown here just for reference. In the real application I am using more complicated algorithm for best image quality. The kernel program is compiled and uploaded to GPU together with image data and when the calculation is finished the image data are downloaded back to system memory. This transfer can cost some time and it depends on the image size and speed of your memory bandwidth.


Comparison


I have created series of tests comparing time of various batch resizers on Windows 7, iMac and MacBook Air. Machine with Windows 7 has ATI Radeon graphic card, iMac has GeForce from NVIDIA and lastly MacBook Air comes with an integrated video card Intel HD. I tried to find the fastest image processor available for a good comparison and I picked OpenCV framework which is powered by CUDA and should run extremely well on NVIDIA graphic cards. It has also some sort of OpenCL support but based on latest code, it is quite limited.

Tested resizers:

I have prepared 100 high resolution images in JPEG format. During the tests all images were resized to 50% of the original size with cubic interpolation algorithm. Every test is run multiple-times and averaged for the most accurate results.

In the first test on Win7 machine the OpenCL resizer is the clear winner of the test. Second OpenCV is much slower because it has been optimised for NVIDIA graphic cards, but this might change in the future versions of OpenCV. Other resizers are far behind because of using CPU only.





Second test was run on iMac with NVIDIA graphic card. OpenCL is still a little bit faster than OpenCV, probably because of the optimised JPEG decoding library. Both resizers are using different frameworks, it's CUDA vs OpenCL and it's clear they are equally powerful.




Third test was run on MacBook Air with an integrated graphic card Intel HD. Results are similar to iMac test. Here is is interesting that OpenCV resizer is probably using OpenCL under the hood because CUDA is not possible here.







Conclusion


The OpenCL resizer is the fastest one with noticeable difference on Windows platform. However on mac it is performing almost the same as the OpenCV framework. We can expect improvements of OpenCV in the future versions.


References

  1. "OpenCL Integrated Performance Primitives - A library of optimized OpenCL image processing functions", https://github.com/CRVI/OpenCLIPP
  2. "The big blob", http://www.thebigblob.com/category/opencl/
  3. "Dr. Dobb's", http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854