Comparing hardware and software graphics (legacy)

I did mention in the other page that rendering graphics using the RDP is faster than using the CPU, so now is the time to really check how they measure up against each other.

Note that for all of these examples, I’m only able to compare like-for-like. So that means that we can’t compare text drawing, line drawing or pixel drawing since those functions only exist in software graphics.

Also, these tests were make using the legacy RDP functionality before RDPQ was introduced. Now you can do everything pretty quickly using just the RDP so this is mostly moot.

Rectangles

The functions used for drawing rectangles are:

  • Software: graphics_draw_box()
  • Hardware: rdp_draw_filled_rectangle()

Many little rectangles

For this first example, we’ll be filling up the screen with a bunch of 2×2 rectangles on a 640×480 display, which is 76,800 squares.

Source code

Software method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_320x240, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	int start, end;

	while(!(disp = display_lock()));
	start = TICKS_READ();
	for (int i=0; i<display_get_width()/2; i++) {
		for (int j=0; j<display_get_height()/2; j++) {
			graphics_draw_box(disp, i<<1, j<<1, 2, 2, graphics_make_color(i*j%256,i*j%256,i*j%256,255));
		}
	}
	end = TICKS_READ();

	char* my_text = malloc(1024);
	sprintf(my_text, "Software method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	printf("Hello world!\n");

	while(1) {}
}

Hardware method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	rdp_init();
	rdp_set_default_clipping();
	rdp_enable_primitive_fill();
	int start, end;

	while(!(disp = display_lock()));
	rdp_attach(disp);
	rdp_sync(SYNC_PIPE);

	start = TICKS_READ();
	for (int i=0; i<display_get_width()/2; i++) {
		for (int j=0; j<display_get_height()/2; j++) {
			rdp_set_primitive_color(graphics_make_color(i*j%256,i*j%256,i*j%256,255));
			rdp_draw_filled_rectangle(i<<1, j<<1, (i<<1)+2, (j<<1)+2);
		}
	}
	rdp_sync(SYNC_PIPE);
	end = TICKS_READ();
	rdp_detach();

	char* my_text = malloc(1024);
	sprintf(my_text, "Hardware method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {

	}
}

Software method: 211,585μs, or about 5 fps. That’s 2.75μs per square.

Hardware method: 773,486μs or about 1.3fps. That’s 10μs per square.

The hardware method takes almost four times as long, which is likely because of all the different commands that need to be sent to the RDP.

A few rectangles

This test will draw only 25 rectangles on a 640×480 display.

Source code

Software method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	int start, end;

	while(!(disp = display_lock()));
	start = TICKS_READ();
	for (int i=0; i<5; i++) {
		for (int j=0; j<5; j++) {
			graphics_draw_box(
				disp,
				display_get_width()/5*i,
				display_get_height()/5*j,
				display_get_width()/5,
				display_get_height()/5,
				graphics_make_color(i*j*16%256,i*j*16%256,i*j*16%256,255)
				);
		}
	}
	end = TICKS_READ();

	char* my_text = malloc(1024);
	sprintf(my_text, "Software method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {}
}

Hardware method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	rdp_init();
	rdp_set_default_clipping();
	rdp_enable_primitive_fill();
	int start, end;

	while(!(disp = display_lock()));
	rdp_attach(disp);
	rdp_sync(SYNC_PIPE);

	start = TICKS_READ();
	for (int i=0; i<5; i++) {
		for (int j=0; j<5; j++) {
			rdp_set_primitive_color(graphics_make_color(i*j*16%256,i*j*16%256,i*j*16%256,255));
			rdp_draw_filled_rectangle(
				(display_get_width()/5)*i,
				(display_get_height()/5)*j,
				(display_get_width()/5)*i+display_get_width()/5,
				(display_get_height()/5)*j+display_get_height()/5
				);
		}
	}
	rdp_sync(SYNC_PIPE);
	end = TICKS_READ();
	rdp_detach();

	char* my_text = malloc(1024);
	sprintf(my_text, "Hardware method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {

	}
}

Software method: 36,565μs or about 20fps. That’s 1,462μs per square.

Hardware method: 333μs, or about 3,000 fps. That’s 13μs per square.

This is now where the differences really start to to scale. The RDP is very good at performing tasks at scale when it doesn’t have to receive new orders from the CPU as often.

Sprites

While rectangles can be used mostly for UI elements, the majority of in-game graphics are going to be sprites.

Full sprite background

This is an example where we take one 8×8 sprite and plaster the whole display with it. That’s a total of 1,200 sprites.

Source code

Software method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	int start, end;

	int fp = dfs_open("/mario-bros-tiles.sprite");
	sprite_t *tiles = malloc( dfs_size(fp));
	dfs_read(tiles, 1, dfs_size(fp),fp);
	dfs_close(fp);

	while(!(disp = display_lock()));
	
	start = TICKS_READ();
	for (int i=0; i<display_get_width()/16; i++) {
		for (int j=0; j<display_get_height()/16; j++) {
			graphics_draw_sprite(disp, i*16, j*16, tiles);
		}
	}
	end = TICKS_READ();

	char* my_text = malloc(1024);
	sprintf(my_text, "Software method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {

	}
}

Hardware method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	rdp_init();
	rdp_set_default_clipping();
	rdp_enable_texture_copy();
	int start, end;

	int fp = dfs_open("/mario-bros-tiles.sprite");
	sprite_t *tiles = malloc( dfs_size(fp));
	dfs_read(tiles, 1, dfs_size(fp),fp);
	dfs_close(fp);

	while(!(disp = display_lock()));
	rdp_sync(SYNC_PIPE);
	rdp_attach(disp);
	rdp_sync(SYNC_PIPE);

	start = TICKS_READ();
	rdp_load_texture_stride(0, 0, MIRROR_DISABLED, tiles, 0);
	
	for (int i=0; i<display_get_width()/16; i++) {
		for (int j=0; j<display_get_height()/16; j++) {
			rdp_draw_sprite(0, i*16, j*16, MIRROR_DISABLED);
		}
	}
	rdp_sync(SYNC_PIPE);
	end = TICKS_READ();
	
	rdp_detach();

	char* my_text = malloc(1024);
	sprintf(my_text, "Hardware method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {

	}
}

Software method: 1,731,922μs or about 0.57fps. That’s 1,443μs per sprite.

Hardware method: 8,361μs, or about 119 fps. That’s 7μs per sprite.

Hardware method (textured rectangle): 159μs, or about 6289 fps. That’s <1μs per sprite.

As you can see, there is a massive difference between the two. Once the texture gets loaded into TMEM, it can draw them blazing fast. If we further reduce the number of calls to the RDP by using a textured rectangle instead of a loop with single sprites, it gets even faster than that.

Single large sprite

Now we’ll try to draw a large sprite to the screen. I’ll use this 640×480 image of the map from Ocarina of Time.

This time they both use slightly different methods. Software uses just one draw command, while hardware splits it into 20×15 squares of 32x32px each and draws them one at a time.

Source code

Software method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	int start, end;

	int fp = dfs_open("/mario-bros-tiles.sprite");
	sprite_t *tiles = malloc( dfs_size(fp));
	dfs_read(tiles, 1, dfs_size(fp),fp);
	dfs_close(fp);

	while(!(disp = display_lock()));
	
	start = TICKS_READ();
	for (int i=0; i<display_get_width()/16; i++) {
		for (int j=0; j<display_get_height()/16; j++) {
			graphics_draw_sprite(disp, i*16, j*16, tiles);
		}
	}
	end = TICKS_READ();

	char* my_text = malloc(1024);
	sprintf(my_text, "Software method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {

	}
}

Hardware method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	rdp_init();
	rdp_set_default_clipping();
	rdp_enable_texture_copy();
	int start, end;

	int fp = dfs_open("/hyrule-map.sprite");
	sprite_t *tiles = malloc( dfs_size(fp));
	dfs_read(tiles, 1, dfs_size(fp),fp);
	dfs_close(fp);

	while(!(disp = display_lock()));
	rdp_sync(SYNC_PIPE);
	rdp_attach(disp);
	rdp_sync(SYNC_PIPE);

	start = TICKS_READ();
	for (int i=0; i<display_get_width()/20; i++) {
		for (int j=0; j<display_get_height()/15; j++) {
			rdp_load_texture_stride(0, 0, MIRROR_DISABLED, tiles, j*20+i);
			rdp_draw_sprite(0, i*32, j*32, MIRROR_DISABLED);
		}
	}
	rdp_sync(SYNC_PIPE);
	end = TICKS_READ();
	
	rdp_detach();

	char* my_text = malloc(1024);
	sprintf(my_text, "Hardware method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {}
}

Software method: 55,695μs or about 17fps.

Hardware method: 3,385,232μs, or about 0.3 fps.

Here we have another case where the RDP is overwhelmed with a lot of calls to update its TMEM, causing it to under-perform.

Tiled background

This example will take the tilesheet from Super Mario Bros and place random tiles on the screen. It’s a bit more complicated than the other examples. First step is to take the display and split it into 8×8 tiles (4,800 total tiles). Then these tiles are assigned (non-seeded) random values from 0-35 to represent which tile to place.

Once the tile positions have been determined, then the software and hardware algorithms can start working to render the screen. The software method will pick them as normal, but the hardware method will have three methods:

  • Dumb: Load every time from TMEM
  • On-the-fly: Load from TMEM when an unknown is found
  • Smart: Load from TMEM only once, but do various passes
Source code

Software method:

#include <stdio.h>
#include <stdlib.h>
#include <libdragon.h>

int main(void)
{
	display_context_t disp;
	display_init(RESOLUTION_640x480, DEPTH_16_BPP, 2, GAMMA_NONE, FILTERS_RESAMPLE);
	dfs_init(DFS_DEFAULT_LOCATION);
	int start, end;

	int fp = dfs_open("/mario-bros-tiles.sprite");
	sprite_t *tiles = malloc( dfs_size(fp));
	dfs_read(tiles, 1, dfs_size(fp),fp);
	dfs_close(fp);

	const uint16_t tiles_total = (display_get_width()/16)*(display_get_height()/16);
	uint8_t* tile_map = malloc(tiles_total);
	for(int k=0; k<tiles_total; k++) {
		tile_map[k] = rand()%35;
	}
	while(!(disp = display_lock()));
	
	start = TICKS_READ();
	for (int i=0; i<display_get_width()/16; i++) {
		for (int j=0; j<display_get_height()/16; j++) {
			graphics_draw_sprite_stride(disp, i*16, j*16, tiles, tile_map[j*display_get_width()/16+i]);
		}
	}
	end = TICKS_READ();

	char* my_text = malloc(1024);
	sprintf(my_text, "Software method\n\nTime taken: %i", TIMER_MICROS(end-start));
	graphics_set_color(graphics_make_color(255,255,255,255), graphics_make_color(0,0,0,255));
	graphics_draw_text(disp, 40, 30, my_text);
	display_show(disp);

	while(1) {

	}
}

Hardware method:

Software method: 62,206μs or about 16fps. That’s 13μs per sprite.

Hardware method:

  • Dumb: 150,118μs, or about 6 fps. That’s 31μs per sprite.
  • On-the-fly: (Test incomplete)
  • Smart: (Test incomplete)

This test could not be completed due to the RDP texture loading functionality only allowing one TILE to be loaded at a time. Now these RDP functions are expired so it’s not worth trying again.

Search

Subscribe to the mailing list

Follow N64 Squid

  • RSS Feed
  • YouTube

Random featured posts