# Reverse engineering Microsoft BASIC

## Метаданные

- **Канал:** Ben Eater
- **YouTube:** https://www.youtube.com/watch?v=aVVKgwr_SfQ
- **Просмотры:** 140,967
- **Источник:** https://ekstraktznaniy.ru/video/20725

## Описание

Code: https://github.com/beneater/msbasic
More 6502: https://eater.net/6502

Support these videos on Patreon: https://www.patreon.com/beneater or https://eater.net/support for other ways to support.

0:00 - Previous functionality
1:28 - Analyzing PRINT code
6:24 - Looking at what's in the zero page
16:38 - Copying logic from PRINT
19:30 - String descriptors and memory management
25:23 - Testing it out

------------------

Social media:
Website: https://www.eater.net
Twitter: https://x.com/beneater
Patreon: https://patreon.com/beneater
Reddit: https://www.reddit.com/r/beneater

Special thanks to these supporters for making this video possible:
Adam Bursey, Adrien Friggeri, Aleksey Smolenchuk, aliceitc, Anthony Weems, anula, Ben, Ben Cochran, Benjamin D. Williams, Benjamin Elder, Benjamin Keil, Benji Bromberg, Bill Cooksey, Binh Tran, Borko Rajkovic, Bradley Stach, Brian Haug, Carl Fooks, Carsten Schwender, Chad Fertig, Chai, Chris Anders, Chris Lajoie, Craig Hawco, criis, Cristi

## Транскрипт

### Previous functionality []

in my previous video I modified the version of Microsoft basic running on my breadboard 6502 computer to add a couple new instructions LCD print which sends a character code to the LCD to display text and LCD command which sends commands to the LCD like to clear the screen but to display a string I needed to write a program like this with a loop that sends each character code separately and I mentioned in the last video that it might be nice if you could just say LCD print with a string like this it might even be noticeably faster well that's what I'm going to do in this video and just a quick reminder that I still sell the kits to build your own 6502 computer including the serial interface and programmer that I use uh everything's available on my website eater. net 6502 so check that out if you're interested but anyway to modify the LCD print instruction to handle a string instead of a single character code let's take a look at the code for LCD print you can see the first thing it does is it jumps to subar tetine for get bite and this was a sube that I disced covered by looking through the code for poke in the previous video and you know I want to be clear I'm by no means an expert when it comes to this Microsoft basic code or you know even to be honest serious assembly programming um but I'm just going to walk you through my guesswork and thought process in this video in case you might find bits of it interesting and you know I'm sure there are people out there who know this stuff you know way better than I do and probably think I'm going about this all wrong but I'm just trying to figure this out and I thought I'd bring you along for the journey so instead of

### Analyzing PRINT code [1:28]

get bite I know we're going to need some other code here to parse a string and the closest command that does something similar to what I want is of course the print instruction and the code for that is over here in the print. S uh source file so that seems like a reasonable place to start looking and so here's where the code starts for the print statement and print statement of course if I just say print hello it prints hello and so as I started reading through this I mean it's a little bit confusing because you have all these conditional defines uh because again this particular codebase for Microsoft B Bic can build different versions of Microsoft basic for different platforms so um one thing to kind of simplify this is just temporarily delete the uh conditional compilation stuff that doesn't apply so it make it a little bit easier to follow but the first thing that happens it looks like when we come into print we have a branch if equals that would seem to handle some special case um maybe to do with Carriage returns not really sure but if that doesn't branch then this also isn't going to Branch then it looks like we're comparing the a register with a couple different Tokens The Token for a tab instruction token for space instruction um and jumping to different things and this I think has to do with uh formatting then we compare with the comma and Branch somewhere else if it's a comma then we check for 3B which is a semicolon and Branch somewhere else otherwise we drop down to here so basically all of this code here seems to be handling you know kind of some special cases which might be relevant might not be um but for now I'm kind of looking for the most common path through this code which seems to put us here jumping to this sub rtin um FM evl and it turns out this sub rtin is pretty interesting it's defined over in evals so let's take a look at that and so here it is and there's a comment that you know in theory says what it does and it says you know evaluate the expression at text pointer leaving the result in fac works for both string and numeric expressions and you know when I first started poking through this code I didn't really know what any of this meant you know what's text point what's fac you know what even ism evl is it form evaluate formula evaluate from evaluated um still not entirely sure I think probably uh formula evaluate is probably the closest to what this does so that's what I'm going to go with but text pointer and fac are both defined over in zero page. s and that does mean something the zero page for the 6502 processor refers to the first 256 bytes of address space so here's the address layout for a computer half of it is ROM and this first section here is RAM and then we've got some addresses here reserved for Io but the first 256 bytes from 000000 to 00 FF are in Ram and are used to store frequently used variables because the 6502 processor executes instructions that read or modify uh data in the zero page faster than the same instruction if it's accessing data elsewhere in Ram or elsewhere in memory so in a program for the 6502 you want to put all your most frequently accessed data into the zero page if possible and if we look at zero page. s from Microsoft basic we can see all these variables that are defined um you know and how many bytes are reserved for each and you can see that includes this fac variable um as well as towards the end down here text pointer but you know with all of the conditional compilation in this file it's a little difficult to figure out exactly where in the zero page each of these variables you know actually is going to end up in our particular computer we could figure it out by interpreting all of the uh conditional compilation and so on but an easier way for me to be sure is just to have the computer printed out so if we go back to the code for my custom LCD print instruction I'm just going to modify this to print out the addresses of a couple zero page variables so load a with the absolute value of fac and then jump sub routine to print bite and that'll print out which bite in the zero page that that fac variable is and then I'll do the same for text pointer and then you know in the process of figuring all of this out there were a couple other zero page variables that were interesting as well so I'm going to print out a couple more um those are going to be input buffer and Val type and I I'll talk more about those in a little bit and then you know after printing out all those variable addresses I'm going to jump to address Fe 0000 and that's the address of Wasson so you know basically for now to help understand how this works I'm just going to drop into wmon and then we can kind of poke around from there so you know basically this is my new LCD print instruction for now so I'll save that build it and what did I do wrong so did I mistype that I think what I wanted was Val type without the e yep that's the one so with that new code we'll put the ROM back in and reset so now run

### Looking at what's in the zero page [6:24]

basic at address 800 and now if I do LCD print hello World it prints out those addresses and then drops to wmon so I'm actually back in wmon that's what the backslash indicates there and so B8 is the address of that fac variable D1 text pointer one two is input buffer and 66 is Val type that's what we've learned but now that we're in Wasson I can print out the current state of the zero page so we can see what it actually looks like when our LCD print sub rtin is called so if I just type uh Z z. FF that'll print that range of addresses and we can see what's in memory and so what I noticed is if we look at the value of text pointer which is at address D1 so that's here it's 14 and actually I think it's a two byte value in little endian format so the first bite is the little end so it' be 0014 is the address and if we look at what's in address 0014 that's uh 0 1 2 3 4 that's here we see a 22 which is the asky value for a quote and then it's followed by our string 48 is a capital H 65 is lowercase e and so on until we get to another uh 22 right here which is the close quote followed by a null so we found the string and you know I guess we could just print that to the LCD but I think we can do better because you know suppose it's not a string literal so let me restart basic and let's say we have a variable X doll equal to some string uh we should be able to do LCD print X so now if we look at text pointer again so we look at 00. FF to look at our zero page again text pointer is still D1 so it's address 0014 we go back to 0014 we see capital x dollar sign so it's literally X doll and ideally we'd want to evaluate that and resolve it to the value of that string so instead of being literally X doll we want to actually see the value of that string as hello but it turns out that's actually what this formula eval subar does you know the comment says it evaluates the expression at text pointer which we now know is just whatever was typed after LCD print instruction and it leaves the result in fac which we also now know the address of so let's add a call to formula eval to our LCD print code uh before dropping into won to see what changes so I'll add jump subar formula valal we'll save that build it write that to the ROM there it goes put the ROM back in reset if I go back into basic I'll do the LCD print hello world again it drops to wmon again and we can look at the zero page to see what's different after calling formula Val and text pointer is D1 again so if we look at D1 now it's 0023 um 023 what's there it's a uh it's a null and that's because our string still starts at 0014 up here and then ends just before 0023 but by calling that formula eval sub routine it's parse the string and advance that text pointer in fact you know one of the other zero page addresses that I was printing out up here is input buffer and that's at address 12 and that's actually where the full command starts you know if we look at address 12 that's here it's 9 D followed by a space followed by you know open quote and then our string so it translated the LCD print command to a token 9d um but otherwise this is the full input buffer and text pointer is just a pointer that moves along within this buffer as we parse each piece of it but anyway uh now the result is supposed to end up in that fac variable which is um at address B8 if we look at what's in address B8 not exactly obvious what's going on here um and so kind of from reading the code uh it seemed to me like this variable is supposed to be 5 bytes long this fac variable so this whole thing is that fac but still not entirely obvious what it means but by you know trying some different things and kind of seeing how this changed um I was able to sort of conclude what format it looks like it's in so one thing that I noticed is that the first bite here 0d that's 13 in decimal and indeed you know this hello world string is actually 13 bytes long so and in fact this does seem to indicate the length of the string and then the next two bytes are a little Indian address 3F F3 and I found that interesting for a couple reasons one it's at the very end of ram right because Ram runs from 0000 to 3 FFF and so this is very close to the end of ram but then if we look at what's actually there if we look at uh 3 ff0 to 3 FFF what we find is that starting at 3 ff3 we have h e l o comma space w r l d exclamation so it's part the string and it's put it in memory and the result here is the length of the string and then the address of the string so that seems pretty cool and then you know the other thing that I'll note now that I you know didn't notice this initially U but you'll see this becomes relevant is that the next two bytes this 0070 is actually also another address and if we look at what's there it's you know in the zero page 070 it's kind of that same thing again it's the 13 by you know 0 D tells us how long the string is and then 3 ff3 the address of the string and the way I was able to kind of unravel this and understand better what's going on is just by trying different experiments and seeing what the result is so you know from here I can just hop back into basic and for example now let's you know let's try assigning a string value to a variable say U VAR dollar equals hello then if I do LCD print VAR dollar let's take a look at the pars version of that at address B8 to BF and you know here something seems kind of weird because you know if this is the length six that's not right cuz hello is five bytes and then this address if that's what it is 8276 is how you would read this that that's in ROM so that also doesn't make sense because you know we typed this it should be in Ram it's not going to be in ROM so this can't possibly be our variable but let's look at that address so what is it 8276 to 827b would be those six bytes and what we have is a carriage return a line feed a capital o a capital K another carriage return another Line Feed and so this kind of just looks like the basic okay prompt right which doesn't seem to have anything to do with our string I'm really not sure what to make of this but if we keep going here if we ignore all this these last two bytes 04 05 those do point to an address in Ram and if we look at you know 0400 to 04 0f 0405 starts here and it's a five and then the next two bytes are 3F FB and that actually is the string it's five characters long and it has a pointer to 3 ffb so if we look at 3 ffb to 3 FFF here it is h e l o so it seems like in the case of the string literal up here where we uh print hello world we can look at these two bytes here 00 uh 70 and if we go there then we get the length of the string and an address same thing if it's a variable uh pointing to a string we can do the same thing we have to look at these fourth and fifth bytes uh after fac and interpret this as an address 0405 if we look at 0405 we then get a string descriptor with the length and then the address of the actual string so either way whether it's a string literal or a string variable that seems to work but then the other case is that if instead of a string the expression is numeric um it seems like it works a little bit differently so if I hop back into basic and do LCD print one + 2 this should give us a three somewhere but if we look at that fac variable so B8 to BF we've got um a completely different format you know this 82 is no longer a length of course and you know this whole thing is a different format altogether you know this is certainly not a pointer to a string or anything um and so you know after some trial and error with different numbers I concluded that this is actually a representation of the number three in floating Point format so the c0 and then all these zeros if you convert that to Binary you end up with one followed by a bunch of zeros and then this two actually it turns out tells you how many um of those bits are on the left side of the decimal pointer or Radix point I guess so two bits are on the left side so if you take the first two bits of this you get 1 one and a binary of 1 one represents three so in a way this is actually a representation of the number three and if that doesn't make sense you know maybe I could do a separate video on floating Point numbers um you know let me know if that interests you but my guess is that fac which is what this variable is called actually probably stands for floating Point accumulator and again you know I'm just trying to figure this out by reading the code but that's just a guess but when the argument here evaluates to a number um it seems like it winds up here in this sort of a floating Point format so it seems like this can actually take on um two completely different formats whether you have a string or a number and that's why I was also printing out the um location of this Val type variable and so this is Val type is address 66 if we look at address 66 in this case it's a zero and I think that indicates that this is a number if we look back um when we were doing our strings if we look at address 66 um for a string it's FF so I think this Val type variable tells us which type of value we have in the floating Point accumulator either a floating Point number if it's zero or a pointer to a string descriptor if it's FF so with that in mind let's go back to

### Copying logic from PRINT [16:38]

the code for the print statement so here's where calls formula eval that does the formula evaluation which populates the fac the floating Point accumulator whatever you want to call it and then that's followed immediately by a bit test of Val type um and then a branch if minus to print string so basically if this evaluated to a string it goes off and prints the string otherwise you know we're going to have it means we have a floating Point number in which case it calls F out then sterlet um and then if we ignore this conditional compilation stuff here it does a string print so let's copy that same logic to our LCD code so after we evaluate the expression we'll bit test Val type and then Branch if minus set so this will be if it's actually a string to LCD print which will just be right here otherwise we'll call those same two sub routines so if we don't follow this branched LCD print we'll end up in here and we'll call those sub routines so first is f out and then sterlet and so my guess is that these two function calls somehow are going to convert our floating Point number into a string but let's actually take a look at them just to see so first F out I found is defined in float. s and the comment says it converts the floating Point accumulator I guess to a string returning with the Y and a registers pointing at that string so that seems to make sense that F out is for outputting a floating Point number in string format so it does that conversion then after F out we call sterlet and this is defined in string. s here it is and the comment here says that it builds a descriptor for a string starting at y comma a and that's where F out puts its result so that's we're taking what F out results and it creates a temporary string descriptor pointed to by fac + 3 and fac plus4 so the um that's actually the fourth and fifth bytes of that fac which is as we saw the same as any other string so it seems like the combination of f out and sterlet will take a um a floating Point number convert it to a string and then get a pointer to that string into the same position that we would have a pointer to the string descriptor for any other string um so that either way when we end up down here at LCD print we should end up with a string so let's see how this works so right now when we get to LCD print we're just going to drop out to Wasson so uh we can kind of see where we stand so let me save this we'll rebuild and write it to the ROM again so reinstall the ROM and reset we'll jump into basic and see where we we stand so if we do LCD print hello we can

### String descriptors and memory management [19:30]

look at the fac at uh B8 to BF and then these two bytes here 0070 are the address of the string descriptor so if we look at 0070 to 0072 we see we have a 5 by string and it's at 3 ffb so if we look at 3 ffb to 3 FFF there's our string h e l o now if we try again with a string variable I'll set s doll equal to test two and then LCD print s doll and then same as before if we look at B8 to BF these two bytes here 0405 should give us the string descriptor so if we look at 0405 to 0407 this is our string descriptor so now we have a 6 byte string starting at 3 FFA if we look at 3 FFA to 3 FFF here's our string t space 2 now let's try a numeric expression we'll do LCD print 6 * 7 and same as before we'll start again at B8 through BF and so now this is telling us the string descriptor is at 0070 so let's look at 0070 to 0072 here's our string descriptor it says it's three bytes long and it's at uh 0 1 0 so if we look at 0 1 0 to 0 1 02 we have a space and a 42 so it formatted 6 * 7 as 42 with a space in front so this looks great you know we've got a consistent way of taking a variety of different expressions with our LCD print instruction um and getting a well formatted string so if we go back to the print code um not our LCD code but one last time here you know at this point it just needs to print out the string and does that by calling the stir print but something I noticed is that if we jump down to where stir print is defined here first thing it does is it calls this freac uh sub routine and then you know if you look at the rest of it looks like a fairly straightforward Loop to print out you know one character at a time here by calling this outdo um for each character pointed to by index at offset Y and then Y is being incremented here but index is kind of a weird thing because we you know we haven't seen that one before this is sort of the first time coming across that and then you know what is this call to freefa whatever that is uh what does that do well if we look at freefa it's defined in string. s we find that so this says that if a string descriptor that's pointed to by faac + 3 and plus 4 which is that's what we're working with if it's a temporary string release it and this seems important because you know if in the course of evaluating the expression it creates a string in memory somewhere it's presumably you're hopefully going to reserve that memory um and not use it for anything else until it knows that we're done with that string so you know if we don't release the memory then we may end up with a memory leak and eventually run out of available memory even if we're not really using any of it so it seems like not calling this if we need to would could induce a pretty bad bug of a memory leak and then the other thing is you know just kind of skimming this down here you know eventually it returns from sub routine somewhere down here um it seems it like it returns with index and index plus one pointing at that string um and then the a register um eventually start starts out here before it's pushed as the um the length of the string so it seems that by calling this freef we get the length of the string in the a register and then we get index pointing to the string um and you know while presum it's releasing that memory that index points to the data is still there and so we can access it at least for a moment um even if we can't rely on it being there later and so with that again we can just basically kind of copy what the print routine was doing so if I go back to our LCD code I'll get rid of the call to Wason and add that call to freac and so that's going to return with index pointing to the string and the length of the string in the a register so then we could just Loop through the string and print each character so start with y equal to zero and then load a with index offset by y so that'll grab the next character from the string into the a register and then we can um call a subar tetine to print that character we'll call it LCD print care to print a character um and then increment y each time through the loop and then so to print the character I'm calling this LCD print care which is just going to be the rest of the code here that we had before for printing a character because we already have that code except that we don't need to call this get bite because we're just going to start with uh the bite we want to print in the a register so this will be our LCD print care routine here we call it with the bite we want to print in the a register and it'll print that bite so then to make this a loop you know freefa returns with the length of the string in the a register so let's um transfer that to the X register and at the end here we'll just decrement X and then Branch if that's not yet zero to LCD print Loop which will just be a label up here where we'll then load and print the next character and then finally at the end here we'll just return from sub routine so this is our new um LCD print and actually we can clean up all this we don't need to print out all of these um zero page variables anymore but let's see if this works so I'll save that rebuild write that to the ROM once

### Testing it out [25:23]

that's done we'll get the ROM back in the computer reset it and we'll launch into basic here okay so now if we do LCD print hello world I'll use the underscore trick for backspacing which is how that works look at that we got hello world we can clear that and let's try um s doll equal hello and then LCD print s dollar that works and then I'll clear that again and let's try LCD print 6 * 7 and it prints 42 as you'd expect and I think the space in front is actually intentional for positive numbers so if I do um LCD command 168 that'll move the cursor to the second line and now if I do LCD print -6 * 7 that prints -42 so for positive numbers it leads with a space for negative numbers it prints the minus sign um so I guess the numbers line up or something so everything seems to work as I'd expect so I'm pretty happy with that hopefully you found that interesting
