NextPVR Forums
  • ______
  • Home
  • New Posts
  • Wiki
  • Members
  • Help
  • Search
  • Register
  • Login
  • Home
  • Wiki
  • Members
  • Help
  • Search
NextPVR Forums Public Developers v
« Previous 1 … 41 42 43 44 45 … 93 Next »
HTML parser and/or regular expressions resources

 
  • 0 Vote(s) - 0 Average
HTML parser and/or regular expressions resources
McBainUK
Offline

Posting Freak

Posts: 4,711
Threads: 429
Joined: Sep 2005
#1
2007-08-24, 10:41 AM
Title says it all. Links and tips appreciatied Smile
Wiki profile
My Projects
Programs Plugin [SIZE=2](retired)
| Volume OSD Plugin (retired) | Documentation Wiki (retired)
[/SIZE]
Spartan
Offline

Senior Member

Posts: 457
Threads: 28
Joined: Mar 2005
#2
2007-08-24, 12:22 PM
I use something in my sports scores plugin which allows you to use XPath expressions on html documents -- very handy.

See here: http://www.codeplex.com/htmlagilitypack
GBPVR v1.0.16 | Comskip | SportsScores | Weather | I-XmlTV

Server: Tyan Thunder h1000E | 2 x Opteron 2210 | 2GB PC2-5300 DDR2 ECC
LSI MegaRAID 300-8X SATA RAID
1x 73GB SCSI @ 10K RPM (OS)
3x 500GB SATA @ 7.2K RPM (RAID 5) (4 Partitions: Docs, Still Pics, Home Movies, Music)
2x 160GB IDE @ 7.2K RPM (RAID 0) (Recordings)
Hauppauge HVR-1600

Client: Gigabyte GA-MA69GM-S2H | Athlon x2 5000+ BE | 2GB PC-6400 DDR2
1x 320GB SATA @ 7.2K RPM
Antec NSX2480 Case
MCE Remote
McBainUK
Offline

Posting Freak

Posts: 4,711
Threads: 429
Joined: Sep 2005
#3
2007-08-24, 01:19 PM
Got any code examples? Be great to have something to start me off...
Wiki profile
My Projects
Programs Plugin [SIZE=2](retired)
| Volume OSD Plugin (retired) | Documentation Wiki (retired)
[/SIZE]
Spartan
Offline

Senior Member

Posts: 457
Threads: 28
Joined: Mar 2005
#4
2007-08-24, 02:18 PM
You can download sports scrores plugin source code from the wiki...
GBPVR v1.0.16 | Comskip | SportsScores | Weather | I-XmlTV

Server: Tyan Thunder h1000E | 2 x Opteron 2210 | 2GB PC2-5300 DDR2 ECC
LSI MegaRAID 300-8X SATA RAID
1x 73GB SCSI @ 10K RPM (OS)
3x 500GB SATA @ 7.2K RPM (RAID 5) (4 Partitions: Docs, Still Pics, Home Movies, Music)
2x 160GB IDE @ 7.2K RPM (RAID 0) (Recordings)
Hauppauge HVR-1600

Client: Gigabyte GA-MA69GM-S2H | Athlon x2 5000+ BE | 2GB PC-6400 DDR2
1x 320GB SATA @ 7.2K RPM
Antec NSX2480 Case
MCE Remote
HTPCGB
Offline

Member

Posts: 215
Threads: 15
Joined: Jun 2006
#5
2007-08-24, 02:25 PM
When I wrote the cnnDynSource, I used a free program called Expresso to help me develop and test regex's fairly quickly. It avoids the whole compile, run and test cycle which slows you down significantly.

Here's an example of a method that downloads a web page, scrapes it for hyperlinks and prints them in a textbox control:

Code:
HttpWebRequest theRequest;
HttpWebResponse theResponse;
Stream theStream;

url = "http://www.gbpvr.com";
theRequest = (HttpWebRequest)WebRequest.Create(jsonUrl);
theResponse = (HttpWebResponse)theRequest.GetResponse();
theStream = theResponse.GetResponseStream();
            
string htmlCode = new StreamReader(theStream).ReadToEnd(); //converts the stream to a string

Regex theRegex = new Regex(@"href="(?<hyperlink>.*?)"");

MatchCollection mc = theRegex.Matches(htmlCode); //runs the regex on the string and puts all the matches into an array

for (int i = 0; mc.Count; i++)
{
     textBox1.AppendText("/r/n" + mc[i].Groups["hyperlink"].Value);
}

This requires System.Net, System.Io, System.Text.RegularExpressions and a textbox named textBox1.

The basic form of the regex that I used is (href=".*?"), minus the brackets.
Basically, it looks for (href=") followed by a bunch of characters (denoted by the ".") which is then followed by a (").

A "." represents a single character. It is followed by two modifiers ( "*" and "?"). "*" tells it to look for multiples of the preceding item. "?" tells it to look for as few of the item as possible (once one match is found it stops looking, avoids infinite loops in large documents).

"(?<hyperlink>" and "?)" are part of what's called a capture group. Everything in between the two ends is captured into a group with the name "hyperlink" which is later accessed from the MatchCollection.

I apologize for any mistakes in the code and hope that my alternation from brackets to quoatations in the explanation wasn't too annoying.
Server:
[SIZE="1"]Intel E2180@2GHZ| 4 GB RAM | PVR150 Retail | Vista Home Premium | GBPVR 1.3.7[/SIZE]
McBainUK
Offline

Posting Freak

Posts: 4,711
Threads: 429
Joined: Sep 2005
#6
2007-08-24, 03:35 PM
Many thanks to you both Smile
Quote:The basic form of the regex that I used is (href=".*?"), minus the brackets.
Basically, it looks for (href=") followed by a bunch of characters (denoted by the ".") which is then followed by a (").

A "." represents a single character. It is followed by two modifiers ( "*" and "?"). "*" tells it to look for multiples of the preceding item. "?" tells it to look for as few of the item as possible (once one match is found it stops looking, avoids infinite loops in large documents).

"(?<hyperlink>" and "?)" are part of what's called a capture group. Everything in between the two ends is captured into a group with the name "hyperlink" which is later accessed from the MatchCollection.
This is what I needed - a dummies guide. Think it will be very useful for the Cinema listings plugin's web scraper
Wiki profile
My Projects
Programs Plugin [SIZE=2](retired)
| Volume OSD Plugin (retired) | Documentation Wiki (retired)
[/SIZE]
Ted the Penguin
Offline

Posting Freak

Posts: 1,590
Threads: 64
Joined: Aug 2006
#7
2007-08-24, 08:05 PM
lemme find you a regex site Smile has much more info.

also, if you want to just try out some regex stuff, do it in perl, since all you have to do is run it (of course with perl installed)

http://www.regular-expressions.info/
http://www.regular-expressions.info/reference.html
sub Wrote:Are you trying to make sure I get nothing done today?
« Next Oldest | Next Newest »

Users browsing this thread: 1 Guest(s)



Possibly Related Threads…
Thread Author Replies Views Last Post
  Cleaning up plugin resources bgowland 3 1,395 2008-01-17, 09:15 PM
Last Post: bgowland

  • View a Printable Version
  • Subscribe to this thread
Forum Jump:

© Designed by D&D, modified by NextPVR - Powered by MyBB

Linear Mode
Threaded Mode