NextPVR Forums
  • ______
  • Home
  • New Posts
  • Wiki
  • Members
  • Help
  • Search
  • Register
  • Login
  • Home
  • Wiki
  • Members
  • Help
  • Search
NextPVR Forums General General Discussion v
« Previous 1 … 36 37 38 39 40 … 159 Next »
Trying to suppress Chinese Characters in TVxB

 
  • 0 Vote(s) - 0 Average
Trying to suppress Chinese Characters in TVxB
jksmurf
Offline

Posting Freak

HK (DMBTH)
Posts: 3,590
Threads: 410
Joined: Jul 2005
#1
2010-09-03, 03:42 PM (This post was last modified: 2010-09-03, 03:52 PM by jksmurf.)
Hi,

I use TVxB to generate my XML EPG, and it uses wget to download html files which it then parses and mines for XML EPG schedules.

After many hours I have TVxB setup how I how I like it, but I'm trying to suppress Chinese characters, for which one channel does not comma delimit the Chinese Characters vs English ones in the schedule web-pages, nor does it supply html tags e.g. <br> between Chinese vs English text, so there's no way I have found to parse them out.

http://programme.tvb.com/print/pearl/[day=yyyy-mm-dd]/

At present I use TVXB functions rclip and lclip (right and left clip) and just copy and past lots of chinese characters into the TVXB.ini, in a vain effort to keep them popping up. Of course there are literally thousands of Chinese Characters so it's a hard slog keeping up.

Is there some switch in wget which would ONLY download English text and save me from continually adding to the growing rclip and lclip terms?

Yep, tried the manual and the author...

Ta

k.
ASUS STRIX X470-F AMD 2700x 4GHz | Win10Prox64 | 32GB | NVIDIA GEforce GT1030 Fanless | WinTV DMB-TH | WinTV HVR-1280 | Hauppauge Colossus | AC86U/AC68U | USB-UIRT | RPi4 Libreelec | Sony Bravia LCD X9000F Android TV |
johnsonx42
Offline

Posting Freak

Posts: 7,298
Threads: 189
Joined: Sep 2008
#2
2010-09-03, 05:14 PM
you'd need to process the final xml through some string processing that will only allow english characters though. You can do it in a batch file. here is a snippet of asp.net code I used awhile back to block SQL Injection attacks on a client's web site:
Code:
For ii = 1 to Len(str)
        char = Mid(str,ii,1)
Select Case char
        case " ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y",
"z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "0", "1", "2",
"3", "4", "5", "6", "7", "8", "9", "@", ".", "-", "_", "/", "&"
        newstr = newstr & char
Basically we iteriate through the string 'str', pull out each character 'char', use Case to allow the character through only from the defined list of possible characters, then build the new string 'newstr' from the resulting 'char', repeat.

No, I didn't write that code, nor is it complete. I'm not an asp.net programmer (nor a programmer of any sort any more). I'm posting it just to give you an idea of what you want to do. This can all be done in a batch file, but advanced batch file string processing isn't exactly my bag.
server: NextPVR 5.0.7/Win10 2004/64-bit/AMD A6-7400k/hvr-2250 & hvr-1250/Winegard Flatwave antenna/Schedules Direct
main client: NextPVR 5.0.7 Desktop Client; LG 50UH5500 WebOS 3.0 TV
markbb1
Offline

Member

Posts: 155
Threads: 7
Joined: Jul 2006
#3
2010-09-03, 10:02 PM
jksmurf Wrote:Hi,

I use TVxB to generate my XML EPG, and it uses wget to download html files which it then parses and mines for XML EPG schedules.

After many hours I have TVxB setup how I how I like it, but I'm trying to suppress Chinese characters, for which one channel does not comma delimit the Chinese Characters vs English ones in the schedule web-pages, nor does it supply html tags e.g. <br> between Chinese vs English text, so there's no way I have found to parse them out.

http://programme.tvb.com/print/pearl/[day=yyyy-mm-dd]/

At present I use TVXB functions rclip and lclip (right and left clip) and just copy and past lots of chinese characters into the TVXB.ini, in a vain effort to keep them popping up. Of course there are literally thousands of Chinese Characters so it's a hard slog keeping up.

Is there some switch in wget which would ONLY download English text and save me from continually adding to the growing rclip and lclip terms?

Yep, tried the manual and the author...

Ta

k.
You might take a look at SED for Windows ( http://gnuwin32.sourceforge.net/packages/sed.htm ) and see if you can write a script to preprocess the final xmltv output file from TVXB before it gets imported into GBPVR/NPVR. It might be possible to write a program to throw out every character that is not English text, punctuation, or xml tag characters ( < , / and > ). You could also transform ampersands into &amp; and so on.
jksmurf
Offline

Posting Freak

HK (DMBTH)
Posts: 3,590
Threads: 410
Joined: Jul 2005
#4
2010-09-03, 11:19 PM
Thanks gentlemen, I'll have a little play with those.

I'm not a programmer either, far from it (just basic and some visual basic in Excel) so it will need to be something simple that can recognise Chinese Chars and cut them out. Not sure if there is a 2-byte recognition-discard prgram out there. Will keep looking!

k.
ASUS STRIX X470-F AMD 2700x 4GHz | Win10Prox64 | 32GB | NVIDIA GEforce GT1030 Fanless | WinTV DMB-TH | WinTV HVR-1280 | Hauppauge Colossus | AC86U/AC68U | USB-UIRT | RPi4 Libreelec | Sony Bravia LCD X9000F Android TV |
mvallevand
Online

Posting Freak

Ontario Canada
Posts: 52,837
Threads: 954
Joined: May 2006
#5
2010-09-03, 11:25 PM (This post was last modified: 2010-09-04, 04:18 AM by mvallevand.)
K, this isn't perfect but you could try

wget http://programme.tvb.com/print/pearl/yyyy-mm-dd/ -O tvb1.html

sed -e "s/[\x00-\x1F\x7F-\xFF]/ /" tbv1.html > tvb.html

Martin
jksmurf
Offline

Posting Freak

HK (DMBTH)
Posts: 3,590
Threads: 410
Joined: Jul 2005
#6
2010-09-03, 11:56 PM
Thanks Martin,

I'll need to first figure out how to add "-O tvb1" suffix to TVxB's wget command.
From the log it currently uses

"wget -E -t 5 http://programme.tvb.com/print/pearl/2010-09-02/"
without a wgetrc file.

TVxB manuals says

wgetarguments:
e.g. wgetarguments=-E -t 5 --proxy=on http_proxy=yourproxy.server.net:8080
Modify the wget arguments. Refer to ―wget Information.html‖ for more information. Note: "For expert use" (lol ...)

k.
ASUS STRIX X470-F AMD 2700x 4GHz | Win10Prox64 | 32GB | NVIDIA GEforce GT1030 Fanless | WinTV DMB-TH | WinTV HVR-1280 | Hauppauge Colossus | AC86U/AC68U | USB-UIRT | RPi4 Libreelec | Sony Bravia LCD X9000F Android TV |
jksmurf
Offline

Posting Freak

HK (DMBTH)
Posts: 3,590
Threads: 410
Joined: Jul 2005
#7
2010-09-04, 12:24 AM
Hmmm... tried the command directly on the XML,

I need to dig deeper ...!

Code:
<programme start="20100904060000 +0800" stop="20100904073000 +0800" channel="tvbpearl.hk">
<title lang="en">Bloomberg財經第一線 Bloomberg Rewind</title>
</programme>
<programme start="20100904073000 +0800" stop="20100904080000 +0800" channel="tvbpearl.hk">
<title lang="en">NBC 世界新聞 NBC Nightly News</title>
</programme>
<programme start="20100904080000 +0800" stop="20100904083000 +0800" channel="tvbpearl.hk">
<title lang="en">普通話娛樂新聞報道 Putonghua E-News</title>
</programme>
<programme start="20100904083000 +0800" stop="20100904085500 +0800" channel="tvbpearl.hk">
<title lang="en">全美票房速遞 Box Office America</title>

Code:
</programme>
<programme start="20100904060000 +0800" stop="20100904073000 +0800" channel="tvbpearl.hk">
<title lang="en">Bloomberg²¡ç¶“第一線 Bloomberg Rewind</title>
</programme>
<programme start="20100904073000 +0800" stop="20100904080000 +0800" channel="tvbpearl.hk">
<title lang="en">NBC ¸–界新聞 NBC Nightly News</title>
</programme>
<programme start="20100904080000 +0800" stop="20100904083000 +0800" channel="tvbpearl.hk">
<title lang="en">™®é€šè©±å¨›æ¨‚æ–°èžå ±é“ Putonghua E-News</title>
</programme>
<programme start="20100904083000 +0800" stop="20100904085500 +0800" channel="tvbpearl.hk">
<title lang="en">…¨ç¾Žç¥¨æˆ¿é€Ÿéž Box Office America</title>
</programme>
<programme start="20100904085500 +0800" stop="20100904092000 +0800" channel="tvbpearl.hk">
<title lang="en">·é‡Œæ´»å½±å¾Œ Hollywood Highlights USA</title>
ASUS STRIX X470-F AMD 2700x 4GHz | Win10Prox64 | 32GB | NVIDIA GEforce GT1030 Fanless | WinTV DMB-TH | WinTV HVR-1280 | Hauppauge Colossus | AC86U/AC68U | USB-UIRT | RPi4 Libreelec | Sony Bravia LCD X9000F Android TV |
markbb1
Offline

Member

Posts: 155
Threads: 7
Joined: Jul 2006
#8
2010-09-04, 03:06 AM
mvallevand Wrote:K, this isn't perfect but you could try

wget http://programme.tvb.com/print/pearl/yyyy-mm-dd/ -O tvb1

sed -e "s/[\x00-\x1F\x7F-\xFF]/ /" tbv1.html > tvb.html

Martin
I would strongly recommend running the sed command on the completed xml file after TVxb creates it rather than trying to get it to work on the html files that TVxb retrieves. I just don't see how it could ever work within TVxb (TVxb is just not flexible enough to allow you to stream the data out of TVxb, through SED, and back into TVxb. You could get the "filtered" data into TVxb's cache directory using some external scripting to get, filter, and then output the cached files with the correct TVxb cache file naming convention (I have done it using PERL), but it would require additional scripting (and maybe PERL) that I don't think jksmurf needs/wants to do) If your SED command works, I honestly think the following would be the simplest way for jksmurf to get rid of the unwanted characters.

Suppose GBPVR (or NPVR) is configured to process a file named "mylistings.xmltv".

1. Get SED for Windows working. Make sure the above sed command syntax does what is desired.

2. Configure TVxb to produce an xmltv file named "temp.xmltv".

3. Put a file named "prepost" in the \TVxb\bin directory because, per section 2.3.4 of the TVxb manual, "For security reasons a flag file called prepost (with no extension) must be created in the \TVxb\bin folder before the pre or post commands will work. (The file can be empty.) If this file does not exist, then the pre- and post-commands are deactivated."

4. Put a postcommand in the TVxb ini file like this

postcommand="sed -e s/[\x00-\x1F\x7F-\xFF]/ / temp.xmltv > mylistings.xmltv"

(I am not sure of how to correctly put the quotes in that line, or if/where they are necessary.)

5. Let GBPVR (or NPVR) process the "mylistings.xmltv" file as it would normally.

Put a line in your epg update batch file to delete the temp.xmltv file if you want to save some hard drive space.

As an aside, I think the "/ /" would put one or two spaces everywhere there is a non-English character. I would try "//" (no space between the slashes) to keep from ending up with long strings of spaces in the EPG.
mvallevand
Online

Posting Freak

Ontario Canada
Posts: 52,837
Threads: 954
Joined: May 2006
#9
2010-09-04, 03:13 AM (This post was last modified: 2010-09-04, 04:19 AM by mvallevand.)
Sounds reasonable, I actually wasn't trying to write the instructions I just showed what worked for me.

The space between the slashes with added by vbulletin,

Edit Looking at it again, my -O line was wrong originally, but the concept is the same.

Martin
markbb1
Offline

Member

Posts: 155
Threads: 7
Joined: Jul 2006
#10
2010-09-04, 04:20 AM
jksmurf Wrote:Hmmm... tried the command directly on the XML,

I need to dig deeper ...!

Code:
<programme start="20100904060000 +0800" stop="20100904073000 +0800" channel="tvbpearl.hk">
<title lang="en">Bloomberg財經第一線 Bloomberg Rewind</title>
</programme>
<programme start="20100904073000 +0800" stop="20100904080000 +0800" channel="tvbpearl.hk">
<title lang="en">NBC 世界新聞 NBC Nightly News</title>
</programme>
<programme start="20100904080000 +0800" stop="20100904083000 +0800" channel="tvbpearl.hk">
<title lang="en">普通話娛樂新聞報道 Putonghua E-News</title>
</programme>
<programme start="20100904083000 +0800" stop="20100904085500 +0800" channel="tvbpearl.hk">
<title lang="en">全美票房速遞 Box Office America</title>

Code:
</programme>
<programme start="20100904060000 +0800" stop="20100904073000 +0800" channel="tvbpearl.hk">
<title lang="en">Bloomberg²¡ç¶“第一線 Bloomberg Rewind</title>
</programme>
<programme start="20100904073000 +0800" stop="20100904080000 +0800" channel="tvbpearl.hk">
<title lang="en">NBC ¸–界新聞 NBC Nightly News</title>
</programme>
<programme start="20100904080000 +0800" stop="20100904083000 +0800" channel="tvbpearl.hk">
<title lang="en">™®é€šè©±å¨›æ¨‚æ–°èžå ±é“ Putonghua E-News</title>
</programme>
<programme start="20100904083000 +0800" stop="20100904085500 +0800" channel="tvbpearl.hk">
<title lang="en">…¨ç¾Žç¥¨æˆ¿é€Ÿéž Box Office America</title>
</programme>
<programme start="20100904085500 +0800" stop="20100904092000 +0800" channel="tvbpearl.hk">
<title lang="en">·é‡Œæ´»å½±å¾Œ Hollywood Highlights USA</title>
Try this command

sed -e "s/[\x00-\x1F\x7F-\xFF]//g" tbv1.html > tvb.html

Getting rid of the space between the two slashes deletes the character instead of replacing it with a space character. The g means globally replace all instances and not just the first one on each line.

Looking up the first string of extended ascii characters from the first line of your final xml file that contains extended ascii characters, the sequence ²¡ç¶“第一線 is

B2 AD 87 B6 93 87 AC AC E4 B9 80 87 B7 9A

in hex (I think). It might be that when the sed script takes off the first instance on the line, the remaining characters then appear to our text renderers as extended ascii instead of Chinese. Removing all instances might be the magic required.
« Next Oldest | Next Newest »

Users browsing this thread: 1 Guest(s)

Pages (2): 1 2 Next »


  • View a Printable Version
  • Subscribe to this thread
Forum Jump:

© Designed by D&D, modified by NextPVR - Powered by MyBB

Linear Mode
Threaded Mode