Understanding Synchronization
in Real-Time 3D Graphics
Why synchronization? And, what do we mean by synchronization?
Application synchronization is primarily a concern of developers writing real-time 3D graphics applications. Real-time 3D graphics are applications that strive to maintain a constant, high frame rate. They can be distinguished from interactive 3D graphics programs, which typically driven by user input and only update the scene when user interaction requires it. The latter need only be responsive to user input,
Not long ago, computer video games were in a chest-thumping contest to see who could run the popular Quest game at the highest frame rate. It was interesting when the frame rates fell below the rate at which the monitor could refresh its screen as it actually provided a somewhat well defined benchmark for measuring overall graphics performance. But soon the frame rates reached 100 frames per second, then 120 then 150. At this it becomes ludicrous pie throwing contest with regard to human perception factors.
To achieve these frame rates, the game card developers had removed a feature that higher end developers had considered absolutely essential to quality rendering: BufferSwap synchronization with vertical retrace. More on this in the next section.
LetÕs define a bit better what we are talking about. In a nutshell, there is a repeating frequency called Vertical Retrace that dictates the rate at which a monitor refreshes its screen. The scan line begins at the top left corner of the screen and scans, line by line, descending down the screen until it gets to the bottom right pixel (or somewhat beyond, actually), then blanks and returns to the top of the screen, waiting for a signal to begin the next scan. That signal is called the Vertical Retrace signal and typically occurs at 60, 72, or 85 Hz on modern CRT monitors, or 60 Hz on LCD panels.
When we have a graphics program that has no synchronization, whatsoever, this vertical retrace occurs asynchronously to the programÕs main loop. We assume, of course, that a 3D graphics program will be double buffered and at the end of each frame of the programÕs main loop, we will swap draw buffers, displaying the buffer we just drew to and preparing the opposite buffer for drawing.
So, to put it in chart form:

There are a couple of things to note in this diagram. Counting program frames, and starting from the left, we can see that buffers are swapped twice during the first vertical retrace frame and once in the middle of the vertical retrace frame. This asynchronous behavior between the two frequency rates results in a visual artifact called "tearing" (like ripping), caused by the displaying of more than one draw frame during one vertical retrace frame. So, frame n will be displayed at the top portion of the screen and frame n+1 will be displayed in the lower portion of the screen. In the case of the above chart, we will actually see part of frame n+2 at the very bottom of the screen.
Not a big deal if the graphics programs eye point is still. However, if it is panning, for example, we will see one position at the top of the screen and another position at the bottom of the screen and the moment when the buffers swapped will appear to be a tear in the image.
The fix for the scenario described in the previous section is to synchronize BufferSwap to the vertical retrace signal, such that the BufferSwap occurs during the blanking moment of the vertical retrace scan. This leaves a single image intact on the screen for the duration of the vertical retrace and avoids the tearing artifacts previously described.
Again in chart form:

OkÉ so the marketing people are having conniptions, because we are now only rendering at 60 fps. However, the visual quality is much improved.
Note that the graphics program has made a request to swap buffers at the end of the frame. It is in the graphics subsystem that this call waits (yellow dotted line) until the next vertical retrace signal, accepting no further graphics calls until it completes the buffer swap.
Note, also, that it is an implementation specific factor whether an API request to swap buffers is synchronized to vertical retrace or not. Some graphics hardware vendors enable this by default and others do not. All (or all should) provide a method for synchronizing swap buffers to vertical retrace if it is not enabled by default.
Now, it is interesting to note that it is the graphics processor that waits on vertical retrace when a buffer swap request is issued. So, what is happening in the CPU at this time?
Now would be a good time to introduce the distinction between processing units for what we have, until now, called "DRAW". Any graphics programs running on modern graphics hardware is actually two processes in parallel on parallel processors. The program itself is running in the CPU. However, it is not actually drawing anything. It is, instead, issuing commands (tokens and data) that are being consumed by the Graphics processor (sometimes called the GPU). It is the graphics processor that is actually doing the rendering. So to distinguish, we call the CPU bound process DISPATCH and the GPU bound process DRAW.
To adequately describe this in a diagram, we need to understand carefully what is happening in the CPU. We can see that the GPU bound DRAW process is being synchronized nicely with the vertical retrace signal. What throttles DISPATCH?
There is, in fact, a FIFO into which the DISPATCH writes the graphics draw commands, and from which DRAW consumes the graphics draw commands. When FIFO is full DISPATCH must block. This blocking of DISPATCH may occur at any time during the graphics frame. In fact, DISPATCH may get several frames ahead of DRAW if the FIFO is deep enough and command set per frame is small enough. Where it blocks during the frame depends on too many variables to be predictable.
An OpenGL GLX extension was devised to solve this problem named WaitVideoSyncSGI. This call blocks the DISPATCH process running on the CPU until the next vertical retrace (or some modulo boundary given as a function parameter). This allows DISPATCH to begin on a known, predictable frame boundary. DISPATCH may still block on the FIFO, but at least beginning of frame is a known point in time.

In the previous section we discussed both BufferSwap synchronization to the Vertical Retrace signal, and synchronization of DISPATCH to the vertical retrace signal. Both of these elements can be applied to a system with a single display. But what about a system with multiple displays? Systems with multiple displays can either be single system image (SSI) systems in which the system may contain multiple processors all sharing the same memory and I/O system, or a cluster of loosely coupled systems bound only by an external network interface. In either case, the complete system has multiple graphics subsystems, and/or multiple displays.
Each graphics subsystem usually contains a clock, which runs autonomously from the rest of the system. This clock controls the rate at which vertical retrace occurs, as well as line and pixel rates. Understanding these and how to synchronize them is discussed here.
Essential to understanding synchronization is, first, understanding the function of this clock.
There are three frequencies that make up the refresh of the display. 1) Vertical retrace - the frequency at which the pixel scan begins (at the top left corner of the screen) 2) Line rate - the frequency at which each line of the display is scanned and 3) Pixel rate - the frequency at which each pixel in a line is scanned.
Typical rates for vertical retrace are 60, 72, 85 Hz, (sometimes higher for active stereo), with line rates and pixel rates being some product of the vertical retrace and the resolution of the screen. An (somewhat simplistic) example of a display configuration would be 1280x1024 at 60 Hz, where 60 Hz is the vertical retrace rate, 1024 * 60 is the line rate and 1280*1024*60 is the pixel rate. (Actual configurations have somewhat higher rates, but use the above for example purposes).
Genlock, Framelock and "Pixel Lock" are often misused in the graphics industry when marketing systems that tout synchronization. But each of these is a method for synchronizing multiple displays.
Framelock refers to a synchronization of vertical retrace. An external clock can be used to issue a frequency that each of the graphics subsystems uses to begin vertical retrace. In the example above, this clock would issue a signal at 60 Hz, and each of the graphics subsystems would use that signal to begin vertical retrace. After the vertical retrace, however, clocks are allowed to diverge from each other, such that they might not all finish at the same time. However, they are synchronized once again at the start of the next clock signal.
This method is essential for applications wanting to synchronize multiple displays, and often "good enough". This method has short comings that may manifest themselves in certain types of applications that require pixel level synchronizations where multiple displays may need to be painting the same pixel at the same moment in time. Examples of these include applications in the broadcast or movie industry where signals are composited together, or applications that may use a target projector.
Pixel lock refers to a method for driving the pixel rate on all displays. The normal approach is to bypass the graphics subsystemÕs own pixel clock and drive all subsystems with a single pixel clock. This method is not hard to implement with most hardware, but also has some shortcomings. Although the pixel rate is synchronous on all displays, there is no assurance that each display is painting the same corresponding pixel on each display. In other words, each scan may be at a different position on the screen during the scan.
Genlock brings together both frame lock and pixel lock, such that all frames begin at the same time and all pixels are painted at the same time. This is the best form of display synchronization, but a capability that is not always present in graphics subsystems.
Swap Ready is another form of synchronization that insures that DRAW (GPU bound) processes all swap buffers at the same time. Swap ready is especially important for applications that run at frame rates lower than the vertical retrace rate (see Why 60 Hz?). For example, if an application is running at 15 Hz (four times slower than a vertical retrace rate of 60 Hz), and adjacent displays are alternating when the buffer is swapped (the left display swaps on the 1st and 5th vertical retrace, and the right display swaps on the 3rd and 7th vertical retrace), the eye will detect disturbing visual anomalies such as an "accordion effect" of objects In motion seeming to grow and shrink as they move across from one screen to the next.
Swap ready has often been used alone, without vertical retrace, to keep adjacent displays "synchronized". There is some argument for applications that run at update rates lower than vertical retrace that justifies this, but for applications aiming for 60 Hz, a Swap Ready only synchronization is virtually pointless.
The naive approach to implementing Swap Ready is to simply place some barrier, or synchronization point for all DISPATCH processes to rendezvous at before issuing a SwapBuffers. This is highly non-deterministic for two reasons. While the rendezvous mechanism may perform well on a SSI system, where interprocess communication may be accomplished in a single kernel, on a cluster, the latency of two-way communication over the network is large enough to be a detrimental factor. The second reason, however, is the most important. There is no way of determining conclusively, how much time can elapse between the issuing of the SwapBuffers command in DISPATCH and the carrying out of the command in GPU bound DRAW. The FIFO may block the command, or the DRAW may be in the middle of an expensive draw and miss vertical retrace, causing one display to lag behind the others.
The correct approach for SwapReady is an implementation in both the graphics subsystem driver and hardware. In the driver, when a SwapBuffers token is consumed, the GPU bound DRAW must wait for an acknowledgement from the other graphics subsystems before continuing on to then wait for vertical retrace. The acknowledgement can be as simple as a single physical line where the voltage can be pulled low, but won't go low until all subsystems connected attempt to pull it low (or something along those lines).
For applications that run a vertical retrace rate (60 Hz), it is essential that framelock or genlock be in place as well. If all graphics subsystems rendezvous correctly at the SwapReady point, but then go on to wait on Vertical Retrace, which is staggered, the effect of Swap Ready is lost.
For applications that run at lower rates, and use hardware SwapReady without frame lock or genlock, at least the buffer swap occurs within 16.667 milliseconds (vertical retrace at 60 Hz) of each other, which would improve on the "accordion effect" described above.
An interesting case can be made for applications that are targeted for frame rates at vertical retrace rates to not require Swap Ready.
Primarily, if the application can run at 60 Hz without dropping frames, and assuming frame lock or genlock then there is no need for SwapReady. If each DRAW finishes before vertical retrace, then all buffers will swap on the next vertical retrace. If all vertical retraces are synchronized then it follows that all swaps are synchronized.
But what happens if one of the graphics subsystems begins running at a slower frame rate than the adjacent displays? For example, a ground vehicle simulator with three out-the-window displays may encounter a place in the visual database, where the view out one of the side displays involves a complex scene, dropping frame rate from 60 Hz to 30 Hz on that display. Conventional wisdom dictates that the use of SwapReady eliminates anomalies between displays. Thus, when the display with the complex scene begins to run at 30 Hz, all displays run at 30 Hz.
However, a good case can be made for allowing one display to render at 30 Hz and others left to render at 60 Hz. If FrameLock is in use and data synchronization is done properly (see next section), the only anomaly present is the double-vision view of the slower display. Practical experience proves that there is no discernable tearing between screens, even when these are projected and the projections are overlapped.
This differs, however, from the case where two adjacent displays may be both running at half frame rate (30 Hz), but altering frames, such that the display on the left is rendering frames 0, 2, 4, 6, 8, etc. and the display on the right is rendering display 1, 3, 5, 7, 9, etc. This creates the visual anomaly described as the "accordion" effect above.
However, this should not occur if the data is synchronized properly.
Dynamic data is that part of a visual simulation that controls the eye point, moving models, special effects, etc. It is the resulting data of a frame based motion control, or animations and is represented as set of 4x4 matrices, quaternions, time stamps, frame stamps, or other values that control change in a virtual scene.
It becomes necessary then to synchronize changes in dynamic data with graphics rendering. On a single system image, this requires synchronizing the CPU based frame rate with the graphics frame rate. Note that this is not the same as synchronizing the call to swap buffers with the vertical retrace signal, which occurs on the graphics subsystem (GPU) only. It requires placing a wait within the CPU bound main loop, which is then triggered by the graphics vertical retrace signal. glXWaitVideoSyncSGI(), and extension to GLX, is an example of such a call.
One might argue that the use of a swap buffers call that synchronizes to vertical retrace is sufficient to making the application run lock-step to the vertical retrace. While this is somewhat true, the absence of a CPU based block results in haphazard control of where in the main loop the application blocks.
Consider an architectural view of CPU/GPU communications. When an application issues OpenGL calls, they are queued and sent to the GPU. The CPU is the producer and the GPU is the consumer. Generally speaking, it takes more effort to execute the OpenGL calls on the GPU than it does to issue the OpenGL calls on the CPU. Therefore, when the GPU is busy and cannot consume the messages being queued by the CPU fast enough, the queue fills and causes the application running on the CPU to block. This is, in fact, the only place within the context of CPU/GPU rendering when the application is blocked. The queue between the CPU based producer and the GPU based consumer then acts as a type of "throttle" to the producer.
Placing a point in the main loop of the CPU-bound application, then allows for a start-of-frame control point. Using a call such as glXWaitVideoSyncSGI, triggers start of frame with the vertical retrace signal.
When the system is a graphics cluster, the issue of synchronizing dynamic data is somewhat more complex. The start-of-frame for all nodes in the cluster, must be based on a copy of the same frame of dynamic data, or discrepancies between the images will occur, regardless of how well synchronized the video signals are.
Several approaches can be taken to synchronize dynamic data across a cluster, but the approach best suited involves a master/slave architecture. Data is updated per frame on the master, then broadcast to the slave nodes. Slave nodes must block waiting for updated data. Using a broadcast approach assures that all slave nodes receive the data at the same time. Two-way network communications involve longer latencies, and increased time to insure that all slave nodes have the same data. If a single point of dispatch is not used, then it is possible to span graphics frame boundaries and be out of sync. The result is slower performance.
On the master node, accurate frame control that is synchronized with the graphics frame rate of the slave nodes is necessary to keep data being sent at the appropriate time during the overall frame. Ideally, frame control on the master is controlled by the frame rate of the video signal. This can be done by using frame lock and involving the master as well. Although the master may not display real-time graphics, the frame lock signal can be used to control start-of-frame on the master node.
Alternatively, a good real-time frame scheduler may be used on the master node. Inevitably, if the frame scheduler is not synchronized via a hardware link, it will drift from the vertical retrace signal. The rate of drift depends highly on the quality of the frame scheduler, but may not result in a phase shift that is detectable for a long enough period of time to make the results quite acceptable.
There is a benefit to using a cluster with a master/slave architecture with regard to system latency. In a single system image, start-of-frame begins at vertical retrace. Since it is beneficial to begin a graphics frame at vertical retrace, data update must occur in parallel, but updated data from the current frame cannot be drawn until the next frame. This means there is a full frame latency between when the data is updated, to when it is drawn.
On a master/slave architecture, start-of-frame of data update can be phase shifted slightly from start-of-frame on the slave node, to allow just-in-time delivery of updated dynamic data to the slave nodes. This results in a lower latency. The phase shift could be accomplished by determining an offset from the start-of-frame signal, whether it be generated from a frame-lock signal or an adjustable frame scheduler.
Real-time computer graphics synchronization must be evaulated at four levels:
1) Individual display synchronization of BufferSwap to Vertical Retrace.
2) Video synchronization as Framelock, Genlock, or PixelLock with a software "start-of-frame".
3) Swap Ready synchronization and
4) Dynamic Data synchronization
In the pursuit of high-quality multi-display systems, each of these levels must be considered separately to provide the best possible result.