Posts Tagged re

Stripping C/C++ Comments

Here’s some code to strip comments from a c/c++ file. Code is adapted from a posting at http://stackoverflow.com/questions/241327/python-snippet-to-remove-c-and-c-comments

import re

# adapted from: http://stackoverflow.com/questions/241327/python-snippet-to-remove-c-and-c-comments
# strips c/c++ comments

def strip_comment(text):
    rep = r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"'
    pattern = re.compile(rep, re.DOTALL | re.MULTILINE)
    return re.sub(pattern,
        lambda match:(match.group(0),"")[match.group(0).startswith('/')],
        text)

, , , , ,

1 Comment

Named groups in Python regular expressions

Regular expressions are very powerful. Named groups in Python re make your re’s and code very readable and powerful. Here is a simple example but imagine how useful this would be for parsing a dataset such as a cvs list or datafile.

import re

def main():
   str=  "james:12;google."
   match = re.match(r"(?P<name>[^:]*):(?P<age>[^;]*);(?P<company>[^.]*)."
         ,str)
   print "name", match.group("name")
   print "age", match.group("age")
   print "company", match.group("company")

if __name__ == "__main__":
   main()

, ,

No Comments

Python Talk

I gave a talk on Python for cplug yesterday.  

Check out the slides here.

CPLUG: http://www.cplug.org/
SLIDES: http://prenticew.com/talks/pytalk09

During the talk, I wrote a simple Python script to pull content off the web and parse the data… here’s the jist of what I did:

First, lets pull a webpage off the internet:

import httplib
def get_webpage():
   conn = httplib.HTTPConnection('en.wikipedia.org')
   conn.request("GET","/wiki/Python_(programming_language)")
   rd = conn.getresponse()
   print rd.status, rd.reason
   return rd.read()

This function creates a HTTPConnection object for en.wikipedia.org and the connection object is stored in conn.
We then do a GET request for the Python wiki page. The result of the request is stored in the connection and we can access the status by calling getresponse() which returns a HTTPResponse object.
The status can be accessed with .status and .reason and the data can be accessed with .read().

This yields the plain-text html of the wiki page. This is not very interesting or useful so lets do something else with this data… lets write a frequency counter:

 def get_freqct(data):
     wordlist = data.split(' ')
     freqct = {}
     for s in wordlist:
       if s not in freqct:
         freqct[s]=1
       else:
          freqct[s]+=1
     return freqct 

We can pass the data (a string) we got from the first function to our get_freqct function. The function first uses the built-in string function to split the string by a white-space delimiter returning a list of words. We then iterate through the wordlist and generate the frequency count using the dictionary data type. At this point we have something fairly interesting but simply printing out this list is fairly cluttered… lets sort it!

You can quickly sort the contents of this dictionary with the sorted function:

import httplib
from operator import itemgetter

sol = sorted(d.items(), key=itemgetter(1))

This statement takes the items in d (the dictionary) and returns a list of tuples (key,data) and is sorted by the data field of the tuple using the itemgetter function. So you’ll end up with a sorted list of tuples ordered by the data field.

Then we can print the list with the following for loop:

   for word,count in sol:
      print word, ":", count

This for loop unpacks the contents of each of the tuples in the sorted list (sol) into the variables word and count. The variables are then printed with the print statement.

If you run this code… you’ll realize that a lot of HTML tags (or parts of HTML tags) get counted. This is not very desirable so lets filter them out using a regular expression!

data = re.sub(r'<[^>]+>','',data)

This regular expression takes the raw data (string) returned by the get_webpage function and replaces each occurrence of an HTML tag with an empty string.

Deconstructing the regular expression:
<- matches the ‘<‘ symbol
[^>]+ – matches one or more of anything except the ‘>’ symbol (where + means one or more)
>- matches the ‘>’ symbol

…and put it all together:

#!/usr/bin/python
import httplib
import httplibfrom operator import itemgetter
import re

def get_webpage(site,page):
   conn = httplib.HTTPConnection(site)
   conn.request("GET", page)
   rd = conn.getresponse()
   print rd.status, rd.reason
   return rd.read()

def get_freqct(list):
    freqct = {}
    for s in list:
      if s not in freqct:
        freqct[s]=1
      else:
         freqct[s]+=1
    return freqct

def main():
   data = get_webpage('en.wikipedia.org',"/wiki/Python_(programming_language)")
   data = re.sub(r'<[^>]+>','',data)
   d = get_freqct(data.split(' '))
   sol = sorted(d.items(), key=itemgetter(1))
   for word,count in sol:
      print word, ":", count

if __name__ == "__main__":
   main()

The following is a snippet of what the script would yield:

language : 24
code : 24
which : 24
by : 27
Retrieved : 32
with : 32
are : 33
as : 38
on : 50
for : 51
in : 64
is : 80
to : 92
a : 98
Python : 103
and : 122
of : 125
the : 144

, , , , , , , , ,

No Comments