Parsing XML files and strings

Doubts, help and support about QuickBMS and other game research tools
HenryEx
Posts: 27
Joined: Wed Aug 13, 2014 6:43 pm

Parsing XML files and strings

Post by HenryEx »

I seem to encounter this from time to time and iirc have never found a good answer to this.

I'm trying to parse an XML file with QBMS (i know... not really the tool of choice) and having trouble with the numbers in the file. Luckily, the XML i'm trying to parse has only one tag per line. The tag looks something like this:

Code: Select all

<Tag id="4" idx="35" variable1="CanBeNumbersOrString" text="Lorem Ipsum">TAG_VALUE</Tag>

Let's say i want to read out the idx in a tag. So i do this:

Code: Select all

  get DATA line 0
    # get index
    string TEMP = DATA
    string TEMP 0| "idx=\""
    string TEMP 0% "\""
    string INDEX = TEMP

So, how do i now change the string INDEX = "35" into long INDEX = 0x23?


edit:
Actually, i tried outputting what i get to a text file to check it (with "putct INDEX string -1 MEMORY_FILE" and logging the memfile at the end) and i don't even get any output there, not even string numbers, it seems. I must be doing something wrong.

edit2:
Okay no, i don't get nothing, i actually get the byte values (it outputs 0x23) instead of 0x3533 which is ASCII for '35', outputted to text if i do the above. Despite working entirely in strings in QBMS. Now i'm completely confused
Last edited by HenryEx on Tue Sep 05, 2017 7:01 pm, edited 1 time in total.
HenryEx
Posts: 27
Joined: Wed Aug 13, 2014 6:43 pm

Re: Convert string number to number value

Post by HenryEx »

Okay so i think i worked it out. It's two factors messing me up and giving me wildly inconsistent results when i try to parse the tag.
One: When you cut down a string to only numbers, it seems to automatically become a number instead of a string. Good to know.
Two: Knowing the above, using string VAR = VAR is a very bad idea.
I'm leaving the following here for future reference.


So, when trying to parse this example tag and output values from it to text, for example:

Code: Select all

<Tag id="4" idx="35" variable1="CanBeNumbersOrString" text="Lorem Ipsum">TAG_VALUE</Tag>

You can do the following for pure string fields:

Code: Select all

  get DATA line 0
    # get text
    string TEMP = DATA
    string TEMP 0| "text=\""
    string TEMP 0% "\""
    string VNTEXT = TEMP
    putct VNTEXT string -1 MEMORY_FILE

You can't do that for number fields though, since your "string" is a number as soon as you cut it tdown to the value, and then the string operation takes it as an ASCII value and you'll put down null bytes or other funky control characters. So for number fields you do:

Code: Select all

  get DATA line 0
    # get index
    string TEMP = DATA
    string TEMP 0| "idx=\""
    string TEMP 0% "\""
    math INDEX = TEMP
    putct INDEX string -1 MEMORY_FILE
(Putting a number down as a string with putct properly converts it to a string)

If you have fields who can hold text OR numbers like the variable1 field though, you have to do YET another thing:

Code: Select all

  get DATA line 0
    # get var1
    string TEMP = DATA
    string TEMP 0| "variable1=\""
    string TEMP 0% "\""
    set VAR1 string TEMP
    putct VAR1 string -1 MEMORY_FILE



That was a pain to figure out, since i wanted to use a value from the tags in array look-up and it didn't really work.
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Convert string number to number value

Post by aluigi »

Parsing strings with quickbms is a real challenge, luckily it's something happening rarely.
There are some scripts I made in which I "parsed" xml data:
http://aluigi.org/bms/ghost_recon_xml_bank.bms
http://aluigi.org/bms/mcf.bms
http://aluigi.org/bms/sherlock_holmes_jack_ripper.bms

As you can see they use all different solutions and the xml format of the input files was different.
Probably the first one is more similar to your situation.

There is the "S" option of the String command that does a very good job separating the elments of a string but it can't interpret 'text="Lorem Ipsum"' as one element because the " char is not at the beginning (and this is the correct behaviour).
Using the sscanf option "s" will give even more problems.

Long story short, what you think about the following?

Code: Select all

get MYLINE line

string ID = MYLINE
string ID | "id=\""
string ID % "\""

string IDX = MYLINE
string IDX | "idx=\""
string IDX % "\""

string VARIABLE1 = MYLINE
string VARIABLE1 | "variable1=\""
string VARIABLE1 % "\""

string TEXT = MYLINE
string TEXT | "text=\""
string TEXT % "\""

print "ID %ID%"
print "IDX %IDX%"
print "VARIABLE1 %VARIABLE1%"
print "TEXT %TEXT%"
HenryEx
Posts: 27
Joined: Wed Aug 13, 2014 6:43 pm

Re: Convert string number to number value

Post by HenryEx »

Yea, that's basically what i use now, except i use the detour over a TEMP variable because not every tag i read always has all the variables present and i read multiple tags, so at the start of each loop i assign all vars a default value and only update the value if the searched string isn't empty.
The examples with FindLoc are very useful, in case i encounter some files where multiple tags aren't separated by a line break. I can search for the opening tag and the closing tag and read between these offsets instead. Granted, that also only works if i know the order in which the tags appear, if there's various ones.


But since you mentioned the split command: I noticed that some of the variables can have multiple text values in a row, separated by a certain sign, like an underscore. Like the text for one of them might be "String1_String2_String3" or something. Is there some way to split a string at a certain delimiter? The problem here is that i don't know how many delimiters are present, if any at all. Looking at the documentation of the S command though, i don't really get how it works, or if it even does the thing i want here?


And since i'm sometimes parsing XML text, is there an easy way to convert the XML escape characters like &amp; or &quot; ? Or is my best choice to run this on every single string:

Code: Select all

string XMLTEXT replace "&lt;" "<"
string XMLTEXT replace "&gt;" ">"
string XMLTEXT replace "&amp;" "&"
string XMLTEXT replace "&quot;" "\""
string XMLTEXT replace "&apos;" "'"
Does the String Replace function replace every instance in the string or just the first one? I have no way of knowing if the string even has any escape characters in the first place, but i assume if none are found the string is left unaltered.
HenryEx
Posts: 27
Joined: Wed Aug 13, 2014 6:43 pm

Re: Parsing XML files and strings

Post by HenryEx »

I've written up an example of searching for a XML tag across multiple lines (i came across at least one tag that had a line break in it after all) that includes the tags itself.

Code: Select all

findloc TAG_OFF string "<Tag" 0 ""
if TAG_OFF != ""
  findloc TAG_END string "</Tag>"
  xmath TAG_SZ "TAG_END - TAG_OFF + 6"  # include end tag in size
  goto TAG_OFF 0
  getdstring XMLSTRING TAG_SZ 0
else
  break  # no more tags
endif
Assuming i want to remove any possible line breaks in my string: can i do this directly via string remove like string XMLSTRING - "\x0D\x0A" or do i have to make a variable first like this:

Code: Select all

string CR = "0x0D"
string LF = "0x0A"
string XMLSTRING - CR
string XMLSTRING - LF
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Parsing XML files and strings

Post by aluigi »

string XMLSTRING _ XMLSTRING

The "_" command removes spaces-like chars from beginning and end
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Parsing XML files and strings

Post by aluigi »

quickbms 0.8.1 will be released this week-end so I have decided to check if there is any possibility of adding a sort of universal parser for strings and formats like XML and JSON.
Obviously it's impossible to parse XML and JSON with a tool like quickbms because they are nested structures while quickbms works step-by-step and is designed only for binary data.
Anyway having "something" able to easily handle a file/string like the one you provided is for sure better than nothing and better than using work-arounds in bms language :)
I will let you know if such (experimental!) feature will be available or not in the upcoming new version.
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Parsing XML files and strings

Post by aluigi »

The new feature works very well in my tests.
I had to use a work-around to retrieve the value of the tags but it works well considering what are the general cases in which it will be used.
I leave an example script based on your sample that demonstrates how it works:

Code: Select all

get SIZE asize
getdstring TMP SIZE

string RET X TMP
print "Tags and parameters found: %RET%"

if RET & ",Tag,"
    print "The content of the html/xml tag is %Tag%"
endif

if RET & ",variable1,"
    print "The content of variable1 is %variable1%"
endif

Basically the code considers the input as a sequence of parameters and values (par=val) and every parameter will be a new variable.
RET will contain the list of parameters that have been found in the input, they are all separated by a comma and there is a comma at both beginning and end of the variable to allow easy searching of desired parameters like in the example (",variable1," or ",variable" or "1," and so on).
If the parameter exists in the list then you can read its content like a bms variable.

It works recursively but can't create "levels" of variables so it's up to you to provide a valid input.

There are no plans yet to implement this feature in the Get command to read the xml fields directly from the input file mainly because this is just a generic experimental feature to make life easier in those rare cases in which it's necesary to parse some xml data.
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Parsing XML files and strings

Post by aluigi »

Quickbms 0.8.1 is out and the following is an additional example script:

Code: Select all

for
    get INPUT line
    string RET X INPUT
    if RET & ",Tag,"
        print "\n%Tag%: %id% %idx%"
        if RET & ",variable1,"
            print "variable1: %variable1%"
        endif
        if RET & ",text,"
            print "text: %text%"
        endif
    endif
next
HenryEx
Posts: 27
Joined: Wed Aug 13, 2014 6:43 pm

Re: Parsing XML files and strings

Post by HenryEx »

Wow, that simplifies tag input reading by a lot, especially if there's like 10 different possible variables to check. I'll update right away and rewrite my script soon, to see if there's any problems.

Could you give an example on how the String J command would be used? The readme doesn't go into detail on that one. If VAR2 is a variable, does it just output the string "{ "variablename": "value" }" to VAR1 or what happens?
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Parsing XML files and strings

Post by aluigi »

Well, J is more like a useless thing I added just because there is already a similar feature for html/xml ('T'), it's just a sort of formatter/beautifier.
Take this input example:

Code: Select all

{"var":"hello","test":[{"blah":"blah value","num":1234},{"bool":false,"myfloat":123.456}],"a b c d":[1,2,3,4,9999]}
And this is the example script:

Code: Select all

get SIZE asize
getdstring VAR SIZE
string RET J VAR
print "%RET%"

Output:

Code: Select all

{
  var   hello
  test   [
    {
      blah   blah value
      num   1234
    }
    {
      bool   false
      myfloat   123.456
    }
  ]
  a b c d   [
    1
    2
    3
    4
    9999
  ]
}

If you have an xml/html page you can use the same script replacing 'J' with 'T', you can also try use 't' for html-only data which will show a sort of text-only page with all the tags filtered out
z4ruz
Posts: 75
Joined: Sun Jan 10, 2021 2:23 pm

Re: Parsing XML files and strings

Post by z4ruz »

Trying to parse 1.xml from this topic.

sample:
<Resources id="BG_LanternPlantsWorld_Common" parent="BG_LanternPlantsWorld" pool="Pool_Backgrounds" >
<PopAnim id="POPANIM_BACKGROUNDS_LANTERN_PLANTS_WORLD" path="images\768\backgrounds\lantern_plants_world\lantern_plants_world.pam" exts="@bin/BG_LanternPlantsWorld_Common.bin:0-433" time="1321571078" />
</Resources>

script:
for i < 10
print line_%i%
get INPUT line
string RET X INPUT
print %RET%
next i

output:
line_0
0
line_1
Error: cstring() failure, your input string has some wrong escape sequences or it's not a valid escaped string

Last script line before the error or that produced the error:
7 string RET X INPUT
coverage file 0 1% 307 18302 . offset 00000133

same if i remove line breaks

also, could I use this for parsing?

String VAR OP VAR
S split
it's like a sscanf for strings, both ' and "
are handled as quotes:
string ELEMENTS S "string1 \"string 2\" 'string3'" VAR1 VAR2 VAR3
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Parsing XML files and strings

Post by aluigi »

That cstring problem is caused by the backslash uses in the strings which is not escaped, unfortunately some implementations like json do the escape while others may not doing it so "fonts\768\humanst19.txt" is like "how should I parse it now?".
No way to guess it.

A work-around can be to escape all the backslash by replacing all the occurrencies of \ with \\.

The following script does everything:

Code: Select all

get SIZE asize
getdstring INPUT SIZE
string INPUT R \ \\
string RET X INPUT
for i = 0 < id[]
    print "%id[i]%: %path[i]% %exts[i]% %time[i]%"
next i

String X can be used line-by-line or on the whole file.