Beating the C compiler...

The place for codemasters or beginners to talk about programming any language for the Spectrum.
Post Reply
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Beating the C compiler...

Post by rothers »

Hi everyone, you've probably seen my mario/sonic game: https://toastyfox.com/zx/sonic.html

I'm trying to get the attribute copy code down to as fast as possible, in that demo I just used:

address=0x5840-31;

for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi4+q+timestable[p], 30);
address+=32;
}

And yesterday returned to write it in ASM, but I can't seem to beat the compiler, I was sure:
ld (0x8002),sp ; store stack pointer.

ld sp, (0x8004) ; start of buffer line.
pop af
pop bc
pop de
pop hl
exx
pop bc
pop de
pop hl
ld sp,(0x8006) ; end of screen line.
push hl
push de
push bc
exx
push hl
push de
push bc
push af
ld sp,(0x8002)

ld hl,(0x8006) ; inc memory locations
add hl, 14
ld (0x8006) ,hl
ld hl,(0x8004)
add hl, 14
ld (0x8004) ,hl

repeat again 39 times...

Would beat it, but I don't know if I'm just doing it wrong, but it isn't faster... even when I unrolled it.

Any help how you guys would write the attribute space as fast as possible?
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

LD SP, (NN) is pretty slow (20T), I always use a macro to do a memcpy and only use a fixed address.

You can probably get the C compiler to produce the ASM source as well so you can compare. (Looks like it is option -l if you use this compiler https://github.com/z88dk/z88dk/wiki/Too ... mmand-line)

My memcpy routines, for 30 bytes you can just do both of these after each other. Interrupts must be disabled of course.

Code: Select all

	MACRO MEMCPY16 dest, src
	ld sp, src
	pop af
	pop bc
	pop de
	pop hl
	exx
	ex af, af`
	pop af
	pop bc
	pop de
	pop hl
	ld sp, dest+16
	push hl
	push de
	push bc
	push af
	exx
	ex   af, af`
	push hl
	push de
	push bc
	push af	
	ENDM

	MACRO MEMCPY14 dest, src
	ld sp, src
	pop af
	pop bc
	pop de
	pop hl
	exx
	pop bc
	pop de
	pop hl
	ld sp, dest+14
	push hl
	push de
	push bc
	exx
	push hl
	push de
	push bc
	push af	
	ENDM
If you need a non-constant address you have to self-modify the code though (but you can do that at the end of the previous frame and not at the beginning of the frame i.e. start drawing immediately after the HALT assuming you have one).

EDIT: So my game loop goes

mainloop:
HALT
; disable interrupts - although if you can guarantee you will finish drawing within a frame you don't need to do that
; draw everything
; enable interrupts
; read keyboard/joystick
; update game logic
; prepare graphics for next frame
; goto mainloop

For the longest time I used to read the keyboard and do the game logic immediately after the HALT, and then draw everything, that is not a good idea if you draw directly to the screen. For the very first frame you can jump into the mainloop after the draw if you don't know what to draw frame 1.

In my draw everything bit (also does erase everything first if you are not drawing the entire screen) I only stash the SP once at the beginning and restore after drawing. So my draw routines don't use the stack for subroutine calls etc. since SP is not available for normal use.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Also make sure what you are copying is in non-contended RAM (it's fine to store level data in contended RAM but then copy the data for the current level into a current level buffer in non-contended RAM at level init time - you can compress the level data as well if you want and decompress into non-contended RAM).

Also set SP to non-contended RAM (I do LD SP, 0 so stack starts at FFFF).
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Re: Beating the C compiler...

Post by rothers »

Ah I see, I didn't know a 16byte copy was possible! Full screen :D

I'm doing a 'wait vbl' by writing bright black to the bottom of the screen and watching for it, then making that my 'halt' It gives me more time to do things (I'm told this is a better technique than halt, but maybe I'm wrong?).

I do need to change the shadow screen address every frame, and maybe that's what's been causing issues, because I have no register left to use.

Are you doing this all unrolled? If not could I see how you loop it?
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Re: Beating the C compiler...

Post by rothers »

ParadigmShifter wrote: Mon Feb 12, 2024 10:34 am Also make sure what you are copying is in non-contended RAM (it's fine to store level data in contended RAM but then copy the data for the current level into a current level buffer in non-contended RAM at level init time - you can compress the level data as well if you want and decompress into non-contended RAM).

Also set SP to non-contended RAM (I do LD SP, 0 so stack starts at FFFF).
Ah yeah don't worry I'm not using that for code, I inject the compressed levels in to it after the compile and just use it for that.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

rothers wrote: Mon Feb 12, 2024 10:37 am Ah I see, I didn't know a 16byte copy was possible! Full screen :D

I'm doing a 'wait vbl' by writing bright black to the bottom of the screen and watching for it, then making that my 'halt' It gives me more time to do things (I'm told this is a better technique than halt, but maybe I'm wrong?).

I do need to change the shadow screen address every frame, and maybe that's what's been causing issues, because I have no register left to use.

Are you doing this all unrolled? If not could I see how you loop it?
Yeah I unroll everything. I'm not using the floating bus since it's more complicated but if you do that you have even more time available.

I'm not copying a full screen of attribs or anything I am only copying some graphic data from the screen to a buffer (so I can do a fast 8 pixel scroll downwards).

I'd unroll everything if I were you and self-modify the src, dst+N after drawing and game-logic update for the next frame. You need to modify 4 bytes per MEMCPY16/MEMCPY14.
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Re: Beating the C compiler...

Post by rothers »

Ah right, I suspected people were doing this unrolled. It just eats in to the RAM, but I think I've got enough space left, plus I can move some of the non-in game graphics in to low RAM. I can probably free up 2kb or so, and I'll give this a go... here is where I find out this is exactly how the compiler is doing it :lol:
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Have you tried the -l option to produce the assembler output from the C compiler (you are using z88dk right?).

I don't know anything about z88dk compiler but I have used the "output asm source" option many times on other platforms and all C compilers usually produce the assembly source if you ask for it.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Since you are scrolling 4 pixels at a time do you have 2 copies of the map attributes per level (one set for offset by 0 pixels, another for 4 pixels)?

That would definitely help if you aren't already doing that and can spare the memory.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

There is a way to loop the memcpy though, you just need a draw list which has src, dest+16 pointers per 16 bytes, and end it with a single byte of zero (I've just realised you are always copying to the same place as well, so you only need to store the src pointers in fact). So you would have a list like this

drawlist dw srcline0left, line0right, line1left, line1right, ... , line20left, line20right,
db 0
drawlistptr dw drawlist

(I think you need to store the words in opposite endian order to make the 0 terminator be the high byte rather than the low byte since it's a lot faster to check for a single byte of 0 rather than 2 bytes of 0)

You would modify the drawlist per frame depending on the scroll position obvs

Then for the drawloop (pseudocode)

Code: Select all

short* hl = drawlistptr

while ((char*)*drawlistptr) != 0 { ; only need to check 1 byte for 0 not both.
    drawlistptr++
    ; modify the src address for the following expanded macro using value in drawlist, update dest address for the current line 
    memcpy16 ; self modify the addresses as mentioned above
}

; done, reset drawlistptr
drawlistptr = drawlist
EDIT: Although that's very similar to what you were doing originally of course (may even be slower?). You can of course unroll more times if you like (since it's a line of contiguous attribs per 32 bytes you can do 2 easily and halve the size of the draw list). If you only do 30 bytes instead of 32 it's easy to change to do that as well.

EDIT2: And you don't need to terminate the list either since you are always drawing the same amount of attribs, which is even more similar to what you were originally doing.

I'd be interested to see what the C compiler is doing though if it was faster than your original approach! (I'm guessing it unrolled the loop and may use an inline stack based memcpy as well?). EDIT: If it did not unroll the loop, you want rewrite your loops to count down from loop_max to 0 as well if possible since it's much faster to check != 0 than < loopend

EDIT3: Full unroll and self-modify the addresses outside of draw time is going to be the fastest thing you can do of course, worth doing that if you have the memory to spare.

EDIT4: If you don't want to fully unroll I'd at least unroll 8 full rows and draw 1/3 of the screen at a time - so 3 loops total (then you only need to modify the high byte of dest instead of modifying both bytes outside the loop, low bytes are constant just increasing by 32 each line), you can also do an inc (highbyte) instead of a read/modify/store

EDIT5: If your assembler supports it (sjasmplus does) you can also add labels to the macro to make it easier to self modify the addresses

Code: Select all

	MACRO MEMCPY16 dest, src, destmodaddr, srcmodaddr
srcmodaddr:
	ld sp, src
	pop af
	pop bc
	pop de
	pop hl
	exx
	ex af, af`
	pop af
	pop bc
	pop de
	pop hl
destmodaddr:
	ld sp, dest+16
	push hl
	push de
	push bc
	push af
	exx
	ex   af, af`
	push hl
	push de
	push bc
	push af	
	ENDM
and do

MEMCPY16 dest, src, destlabelline0lhs, srclabelline0lhs
MEMCPY16 dest, src, destlabelline0rhs, srclabelline0rhs

MEMCPY16 dest, src, destlabelline1lhs, srclabelline1lhs
MEMCPY16 dest, src, destlabelline1rhs, srclabelline1rhs

;etc.

which will give you the addresses you need to modify (label+1) as convenient labels (obvs need to be unique)
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Re: Beating the C compiler...

Post by rothers »

ParadigmShifter wrote: Mon Feb 12, 2024 11:13 am Since you are scrolling 4 pixels at a time do you have 2 copies of the map attributes per level (one set for offset by 0 pixels, another for 4 pixels)?

That would definitely help if you aren't already doing that and can spare the memory.
Yes there are 2 copies which are generated on level boot, it's a little more complex than that as I have to account for brightness changes and mask them with blacks depending on the way they are 'facing'. But it works well.

It can also scroll upwards in 4px if I have room in memory to do that.
dfzx
Manic Miner
Posts: 704
Joined: Mon Nov 13, 2017 6:55 pm
Location: New Forest, UK
Contact:

Re: Beating the C compiler...

Post by dfzx »

rothers wrote: Mon Feb 12, 2024 10:51 am Ah right, I suspected people were doing this unrolled. It just eats in to the RAM, but I think I've got enough space left, plus I can move some of the non-in game graphics in to low RAM. I can probably free up 2kb or so, and I'll give this a go... here is where I find out this is exactly how the compiler is doing it :lol:
By default the z88dk memcpy function is the compiler's builtin one. i.e. it uses a simple ldir.

You can rebuild the library with flags which switch in loop unrolling for memcpy, memset and others, but obviously you didn't do that.
Derek Fountain, author of the ZX Spectrum C Programmer's Getting Started Guide and various open source games, hardware and other projects, including an IF1 and ZX Microdrive emulator.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

rothers wrote: Mon Feb 12, 2024 1:23 pm Yes there are 2 copies which are generated on level boot, it's a little more complex than that as I have to account for brightness changes and mask them with blacks depending on the way they are 'facing'. But it works well.

It can also scroll upwards in 4px if I have room in memory to do that.
Do you have 2 copies of the tiles you need to draw as well then (I'm guessing so).

How many different combinations are there for 2 tiles next to each other per frame? If there are only 8 possibilities you could do a Joffa Cobra style thing of emitting a series of push and exx instructions per row using all 8 available register pairs (BC, DE, HL, B'C', D'E', H'L', IX, IY) - assuming you are drawing all cells on the screen per frame. That might be slower than a series of LDI though if you have to use IX, IY or keep swapping between registers & shadow regs too often (but you may be able to preprocess the map somehow to give an optimal register allocation for different parts of the level minimising use of EXX and IX or IY? Complicated stuff that though lol).

Or since it seems you are only drawing vertical oblongs (4x8 pixel width, height) you could scroll 1 pixel row of each line into a buffer (so a buffer of 32 bytes is needed for a full screen scroll per line), and blit the same data 8 times?

You may even be able to use the mythical RLD or RRD instructions if you do that (although 4 rlca is faster than a single RRD or RLD, those seem to have the advantage of writing the result back to memory)? I tried once to use RRD or RLD but my code was bugged and I was drunk when I was debugging it so I went back to using 4xRLCA :)

Your approach seems fast enough already though so don;'t over-think things like I have just done lol if speed is good enough already.

; W,X,Y,Z are the nybbles of register contents
ld A,$WX
ld (HL),$YZ
RLD
; A = $WY
; (HL) = $ZX

I think RLD and RRD are the only opcodes I haven't yet used (apart from the apparently bugged OUTI, INI instructions). Oh and IM 0 (also unusable on the speccy IIRC).
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Re: Beating the C compiler...

Post by rothers »

ParadigmShifter wrote: Mon Feb 12, 2024 2:59 pm Do you have 2 copies of the tiles you need to draw as well then (I'm guessing so).

How many different combinations are there for 2 tiles next to each other per frame? If there are only 8 possibilities you could do a Joffa Cobra style thing of emitting a series of push and exx instructions per row using all 8 available register pairs (BC, DE, HL, B'C', D'E', H'L', IX, IY) - assuming you are drawing all cells on the screen per frame. That might be slower than a series of LDI though if you have to use IX, IY or keep swapping between registers & shadow regs too often (but you may be able to preprocess the map somehow to give an optimal register allocation for different parts of the level minimising use of EXX and IX or IY? Complicated stuff that though lol).

Or since it seems you are only drawing vertical oblongs (4x8 pixel width, height) you could scroll 1 pixel row of each line into a buffer (so a buffer of 32 bytes is needed for a full screen scroll per line), and blit the same data 8 times?

You may even be able to use the mythical RLD or RRD instructions if you do that (although 4 rlca is faster than a single RRD or RLD, those seem to have the advantage of writing the result back to memory)? I tried once to use RRD or RLD but my code was bugged and I was drunk when I was debugging it so I went back to using 4xRLCA :)

Your approach seems fast enough already though so don;'t over-think things like I have just done lol if speed is good enough already.

; W,X,Y,Z are the nybbles of register contents
ld A,$WX
ld (HL),$YZ
RLD
; A = $WY
; (HL) = $ZX

I think RLD and RRD are the only opcodes I haven't yet used (apart from the apparently bugged OUTI, INI instructions). Oh and IM 0 (also unusable on the speccy IIRC).
I did try unrolling a very long list of LDI but it takes up a lot of memory to do that, I think it was the best performing one, I'm going to write some basic benchmark routines tonight to test all this.

The reason I'm trying to push it to the speed limit is so I can have as many enemies in the game as I can, ideally at least 3 16x16 (24x16 with shifting)
pixel based enemies plus the player character, plus the 4*8 based baddies while maintaining 50fps.

It is ALMOST there. It's just one good optimisation jump away from it.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Scrolling what is on the screen already might be best if that would work? (You'd have to erase sprites then of course).

By which I mean (if your tiles are always 4x8 pixels) you can copy the first line of each character row into a 32 byte buffer, scroll that right or left 4 pixels (bring in new pixels at the edge that becomes visible) then blit that 8 times using those memcpy routines from the buffer.

Maybe that doesn't work with how you are doing the different bright levels though.

EDIT: Even if you don't scroll what is currently on screen if all your tiles are 4x8 pixels solid or blank that is probably the way to go - maybe that is what you are doing already though? That way you only need to have a screen buffer of 32 bytes per 8 pixels, so you could store the entire screen as 32x24 bytes of pixel data (you could reduce mem usage to just 32 bytes if you build the data before you blit it though -- that might be too slow to beat the raster of course). I think that would be fine on 128K where you can swap between 2 screen buffers very quickly (I've never done any 128K programming myself but that is the best way to do it IIRC).

That's similar to how I am doing my scroll down 8 pixels directly to the screen in my latest project though... I copy character row 7 to a buffer then copy row 6 to row 7, row 5 to row 6, ..., row 0 to row 1 then bring in what is needed at the top. The next screen third I use the buffer I copied into as the pixels to copy to row 8. That's what I am using my memcpy macros for anyway (copying "critical strips" i.e. row 7 and row 15 of the screen to a buffer).

I'm also only scrolling 16 pixel wide columns down as well (and there's an optimisation to only scroll what I need).

It's not quite finished yet what I am doing though (attribute scrolling is needed as well) so I haven't managed to see how well it performs (and I can think of some optimisations scrolling multiple columns at a time if they are next to each other - but that is more complicated still). Aim is to be less flickery than SJOE anyway, we will see how it works out. I got sidetracked by another good optimisation which is not erasing any pixel data just set the ink and paper to the background colour like what TLL and Cyclone does.

Scrolling up 8 pixels would be a great deal easier than scrolling down 8 pixels anyway, gravity is in the wrong direction lol.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Hmm that's given me an idea now I wonder how fast I could draw an entire screen at half resolution. Would need 32x96 bytes = 3K of buffers for that.

Can compress the image data to 4 bits per 2x2 pixel as well then.

Might have a try of that later on (post beer o'clock that will be though lol).
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Re: Beating the C compiler...

Post by rothers »

It only has to copy the attributes, there is no pixel data, that's all saved for the sprites. The screen is set up with a chessboard like image which is manipulated to get the scrolling.

So a copy of 570 bytes.

I've written my own sprite routines, and I'll come on to those later, as I know there is some black magic for speeding them up out there. The whole engine uses the attribute layer like a tile set for all functions, so the code for the game is pretty tiny.

I absolutely want this to run on a 48k machine as that was my machine (handed down to me as a kid) and I want to show what it can do! The engine actually fits in to the 16k spectrum, but there is no room for levels. The 16k could run levels at 8x8.

I know it would be far easier on the 128k, and I'll probably add 128k music if I can find a fast player (I did try one from GitHub, but it caused random performance drops all the over place), but 48k is my target and no multi-load. The whole game has to fit in to 48k. Each level compresses down to about 1-2k, so I can get 10 big levels and far more once I code in some decent tile sets.

This game has been in my head since I was a little kid, and I'm just getting it out of my system :lol: I'm also doing some work on the GBC and SMS.

Here is the screen copy routine in C:

address=0x5840-31;

if (i==0){
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi4+q+timestable[p], 30);
address+=32;
}

}
else{
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi5+q+timestable[p], 30);
address+=32;
}
}

That's it, it alternates depending on which 4x screen is showing.

I've tried all the ASM routines and none of them seem to beat it, bizarrely. It could be because I keep having to adjust the memory address and I don't know many of the Z80 tricks others do.

I call them with:
if (i==0) qpop(bgi4);
else qpop(bgi5);

As zcc allows you to insert data in to HL.

I have the memory address in RAM which I read out each write as there are no registers left, I've not coded in Z80 since about 2000 when I was a kid.

LD DE, (0x8000)
ADD HL, DE //inc HL to correct point in the shadow screen

ld (0x8004),hl //put bg location in ram

LD DE, 0x5830 //screen location
ld (0x8006),DE //screen location to ram

ld (0x8002),sp ; store stack pointer.


ld sp, (0x8004) ; start of buffer line.
pop af
pop bc
pop de
pop hl
exx
pop bc
pop de
pop hl

ld sp,(0x8006) ; load screen location from 8006
push hl
push de
push bc
exx
push hl
push de
push bc
push af

ld sp,(0x8002) //put stack back

ld hl,(0x8006) //load back in the settings and increase them
add hl, 14
ld (0x8006) ,hl
ld hl,(0x8004)
add hl, 14
ld (0x8004) ,hl

( repeat dozens of times etc etc)
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Oh right I hadn't thought about doing it like that (surely stripes would be better than a checkerboard pattern though?), that's quite clever.

You should definitely be able to write 768 bytes to attribs very fast using the stack... probably before the scanline hits the top of the screen visible area if you start drawing immediately I would have thought.

I might have a go and see how fast I can do an unrolled attrib copy routine in ASM later on where each line of attribs can point at an arbitrary address.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

This code (uses funky sjasmplus REPT with index arguments though, I was just writing it as fast as possible) blits 768 bytes to atribs when the border is red (border is white while waiting for the vblank), this is the fastest you can possibly do a blit immediately following a vblank.

It unrolls MEMCPY16 48 times, each unroll draws 16 bytes to the attribs area

Code: Select all

	; TEST
.testagain
	ld a, 2
	out (#fe), a

	REPT 48, idx
	MEMCPY16 ATTRIBS_ADDR + idx*16, idx*16
	ENDR

	ld a, 7
	out (#fe), a

	halt

	jp .testagain
I'll try it in a loop next with 48 pointers to the source addresses... I'm using the ROM for the attrib data ;)

Image

EDIT: Using an indirect src address pointer, amazing that this works with sjasmplus, the REPT with an index and macro expansion is quite powerful (not as good as C's #define though unfortunately), was surprised ;)

Image

Code: Select all

	; TEST
.testagain
	ld a, 2
	out (#fe), a

	REPT 48, idx
	MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
	ENDR

	ld a, 7
	out (#fe), a

	halt

	jp .testagain

attribptrs
	REPT 48, idx
	dw idx*16
	ENDR
so that is expanding this line of the macro

ld sp, dst + 16

with

ld sp, (attribptrs + 16)

next expansion it becomes

ld sp, (attribs + 2 + 16) ; and it is evaluating attribptrs+2+16 at compile time


etc.

So expanding the code like that means you can blit all the attribs from a table of 48 pointers to half-rows of attrib data before the raster reaches the top of the drawable area (update the pointers after drawing to change the addresses it draws from each frame)
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Super quick and dirty animation of the ROM with scrolling

Image

Code, I expect only sjasmplus will be able to compile this code

Code: Select all


ATTRIBS_ADDR	EQU	#5800
ORGADDR	EQU #8000

; sjasmplus.exe --sym=out.sym --syntax=f --raw=out.bin attribtest.asm

	ORG ORGADDR
	
	MACRO MEMCPY16 dest, src
	ld sp, src
	pop af
	pop bc
	pop de
	pop hl
	exx
	ex af, af'
	pop af
	pop bc
	pop de
	pop hl
	ld sp, dest+16
	push hl
	push de
	push bc
	push af
	exx
	ex   af, af'
	push hl
	push de
	push bc
	push af	
	ENDM
	

main:
        ; code which draws the ruler on the right hand side removed ;)
	; TEST
.testagain
	ld a, 2
	out (#fe), a

	REPT 48, idx
	MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
	ENDR

	ld a, 7
	out (#fe), a

	; update lower bytes of all 48 src addresses for next frame
	ld b, 48
	ld hl, attribptrs
.updateptrs
	inc (hl)
	inc hl ; can save a jiffy by aligning attribptrs table so it does not cross a 256 byte boundary, then you can use inc l here
	inc hl
	djnz .updateptrs ; you can unroll this loop 48 times as well to save another microjiffy

	halt


	jp .testagain

attribptrs
	REPT 24, idx
	dw idx*256, idx*256 + 16 
	ENDR
Each frame it just increments the low byte of each of the 48 pointers in the table (which is why there's a glitch when they wrap around back to 0)

EDIT: Updating the pointers is quite slow lol (cyan border)

Image

Code: Select all

	; TEST
.testagain
	ld a, 2
	out (#fe), a

	REPT 48, idx
	MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
	ENDR

	ld a, 5
	out (#fe), a

	; update lower bytes of all 48 src addresses for next frame
	ld b, 48
	ld hl, attribptrs
.updateptrs
	inc (hl)
	inc hl
	inc hl
	djnz .updateptrs

	ld a, 7
	out (#fe), a

	halt


	jp .testagain
Last edited by ParadigmShifter on Mon Feb 12, 2024 10:51 pm, edited 1 time in total.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

Anyway I hope you get the idea.

For sprites you may have to draw them at just the right time (i.e. draw them ordered according to where they are on the y axis, intermingled with drawing the attribute rows) to beat the raster which will complicate the code quite a bit I expect.

If you want to reduce the code size you can do it in a loop and self modify the code based on the src and dest pointers as I hinted at earlier... all of those methods will be slower than the code I posted though which is max unrolled.

EDIT: There should really be a DI and an EI around my code which abuses the stack (and a save/restore SP as well).

EDIT2: So this is safer. Stack would be pointing at attribute memory when the interrupt goes off (since I have not changed from the ROM interrupt). I guess that worked out ok since it was pointing at RAM not ROM and even if the attribs got corrupted I redraw them all before the raster reached any corruption ;)

In my actual code I do more stuff with the SP (I erase and draw all my sprites without using SP for anything other than data transfer, i.e. I can't call any functions).

Code: Select all

	; TEST
.testagain
	ld a, 2
	out (#fe), a

	di ; about to mess with SP, best disable interrupts
	ld (.restoresp+1), sp

	REPT 48, idx
	MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
	ENDR

.restoresp
	ld sp, 0 ; restore SP before turning on interrupts
	ei ; we're done abusing the stack now. Safe to call subroutines and for interrupts to go off

	ld a, 5
	out (#fe), a

	; update lower bytes of all 48 src addresses for next frame
	ld b, 48
	ld hl, attribptrs
.updateptrs
	inc (hl)
	inc hl
	inc hl
	djnz .updateptrs

	ld a, 7
	out (#fe), a

	halt


	jp .testagain
rothers
Drutt
Posts: 35
Joined: Sat Dec 30, 2023 2:50 pm

Re: Beating the C compiler...

Post by rothers »

Thank you! I've finished hard coding the sprites now with ASM lookup tables and I think they are as fast as they can be, so now it's just this attribute copy to get working as fast as possible.

The rest of the code is really fast, using the attributes as a tile map is speedy!

I'm also coding a super compressed way to store levels, and that should be it, ready to release. You can actually load super mario bros levels in to it, but I'm only using that to test speed vs the NES.

I'll probably then look at that C64 Sonic port and see what I can do on the 48k. Really enjoying this, it's like my morning crossword puzzle every day.
User avatar
ParadigmShifter
Manic Miner
Posts: 924
Joined: Sat Sep 09, 2023 4:55 am

Re: Beating the C compiler...

Post by ParadigmShifter »

You probably want to use Einar's ZX0 for compression unless your level format is really simple to run-length encode.

I run length encoded the levels for my Manic Miner remake, here is an example so you get the idea of what I did. This was my first attempt at ASM programming so probably could do a lot better. I could compress a lot more by packing the row/column into 9 bits and the repeat count into the high bits of a 16 bit number rather than using a byte for each.

Code: Select all


; some macros to pack pointers to graphics into 8 bits rather than 16
	MACRO celltype gfx
	db (gfx - gfx_platform0) / 8
	ENDM

	MACRO keytype gfx
	db (gfx - gfx_key0) / 8
	ENDM
	
	; macro for ink, paper, bright
	MACRO IPB ink, paper, bright
	db ink|(paper<<3)|(bright<<6)
	ENDM


; Central Cavern (Spectrum Version)
Central_Cavern:
	dc "Central Cavern"

	; border/paper
	db 2

	; cell graphics
	celltype gfx_platform0
	celltype gfx_wall0
	celltype gfx_spiky0
	celltype gfx_crumbly0
	celltype gfx_platform0
	celltype gfx_platform0
	celltype gfx_spiky1
	celltype gfx_conveyor0
	celltype gfx_conveyor0

	; cell attribs
	IPB 2, 0, 1
	IPB 6, 2, 0
	IPB 4, 0, 1
	IPB 2, 0, 0
	IPB 0, 0, 0 
	IPB 0, 0, 0 
	IPB 5, 0, 0
	IPB 4, 0, 0
	IPB 4, 0, 0

	; willy start position x, y. bit 4 of y is set if facing left
	db 2, 13

	; exit position
	db 29, 13
	; exit colour
	IPB 6, 1, 0

	; keytype
	keytype gfx_key0

	db 5	; number of keys
	; position of keys
	db 9, 0
	db 29, 0
	db 16, 1
	db 24, 4
	db 30, 6

	; guardians
	db 1		; number of guardians
	; something like start position, end position of patrol path and some other stuff I can't remember ;) Obvs the attribs (64+6) here too.
	; seems to be terminated with a 0 since some enemies need extra data
	db gfx_robot0/256, 0, 8, 7, 8, 15, 64+6, 0

	; single blocks
	db SPIKY_A, 23, 4
	db SPIKY_A, 27, 4
	db SPIKY_A, 21, 8
	db SPIKY_A, 12, 12
	db SPIKY_B, 11, 0
	db SPIKY_B, 16, 0

	; repeat blocks: platforms
	db HORZ_REPEAT|PLATFORM_A, 1, 5, 30
	db HORZ_REPEAT|PLATFORM_A, 1, 7, 3
	db HORZ_REPEAT|PLATFORM_A, 1, 9, 4
	db HORZ_REPEAT|PLATFORM_A, 29, 10, 2
	db HORZ_REPEAT|PLATFORM_A, 28, 12, 3
	db HORZ_REPEAT|PLATFORM_A, 5, 13, 15
	db HORZ_REPEAT|PLATFORM_A, 1, 15, 30
	; crumbly platforms
	db HORZ_REPEAT|CRUMBLY, 14, 5, 4
	db HORZ_REPEAT|CRUMBLY, 19, 5, 4
	db HORZ_REPEAT|CRUMBLY, 23, 12, 5
	; walls
	db HORZ_REPEAT|WALL_A, 17, 8, 3
	db HORZ_REPEAT|WALL_A, 20, 12, 3
	; conveyor
	db HORZ_REPEAT|CONVEYOR_L, 8, 9, 20
	db #ff ; terminator

	ENDIF

; The Cold Room
The_Cold_Room:
	dc "The Cold Room"

	; border/paper
	IPB 2, 1, 0

	; cell graphics
	celltype gfx_platform0
	celltype gfx_wall0
	celltype gfx_spiky0
	celltype gfx_crumbly0
	celltype gfx_platform0
	celltype gfx_platform0
	celltype gfx_spiky4
	celltype gfx_conveyor0
	celltype gfx_conveyor0

	; cell attribs
	IPB 3, 1, 1
	IPB 6, 2, 0
	IPB 0, 0, 0
	IPB 3, 1, 0
	IPB 0, 0, 0 
	IPB 0, 0, 0 
	IPB 5, 1, 0
	IPB 6, 1, 0
	IPB 6, 1, 0

	; willy start position x, y. bit 4 of y is set if facing left
	db 2, 13

	; exit position
	db 29, 13
	; exit colour
	IPB 3, 2, 1

	; keytype
	keytype gfx_key2

	db 5|(1<<4)	; number of keys/paper colour
	db 7, 1
	db 25, 1
	db 26, 7
	db 3, 9
	db 19, 12

	; guardians
	db 2		; number of guardians
	db gfx_penguin0/256, 7, 18, 3, 1, 18
	IPB 6, 1, 0
	db 0
	db gfx_penguin0/256, 7, 29, 13, 12, 29
	IPB 5, 1, 0
	db 0
	
	; single blocks
	db SPIKY_B, 30, 1
	db PLATFORM_A, 25, 3
	db PLATFORM_A, 1, 7
	
	; repeat blocks
	db HORZ_REPEAT|WALL_A, 19, 0, 12
	db HORZ_REPEAT|PLATFORM_A, 1, 5, 19
	db HORZ_REPEAT|CRUMBLY, 21, 3, 4
	db HORZ_REPEAT|PLATFORM_A, 21, 6, 4
	db HORZ_REPEAT|CRUMBLY, 26, 6, 2
	db HORZ_REPEAT|CRUMBLY, 2, 7, 5
	db HORZ_REPEAT|PLATFORM_A, 9, 9, 7
	db HORZ_REPEAT|CRUMBLY, 19, 10, 4
	db HORZ_REPEAT|CONVEYOR_R, 3, 11, 4
	db HORZ_REPEAT|PLATFORM_A, 14, 12, 4
	db HORZ_REPEAT|CRUMBLY, 8, 13, 4
	db HORZ_REPEAT|PLATFORM_A, 1, 15, 30
	db VERT_REPEAT|WALL_A, 25, 6, 7
	db VERT_REPEAT|WALL_A, 28, 5, 8
	db VERT_REPEAT|CRUMBLY, 26, 8, 5
	db VERT_REPEAT|CRUMBLY, 27, 8, 5
	db #ff
So I had horizontal and vertical repeating cells in the level. Could also add rectangles instead of just horizontal/vertical repeat.

Guardian sprite data starts on a 256 byte aligned boundary so I can just use 8 bits for those as well.
Post Reply