TABLE OF CONTENTS


Generated from ./../ with ROBODoc v4.0.25 on Mon May 24 04:13:06 2004

1. Introduction/class_clean.php

[top]

NAME

      class_clean.php 

USAGE

      This file holds the CleanXML class, which is used
      to clean up dirty ODP XML.       

CREATION DATE

      23 nov. 2003

HISTORY

      6 dec.  2003: Fixed some things. Added more complex regex's.
      12 feb. 2004: Fixed some "" things - changed them to ''
      19 maj  2004: Rewritten all comments, so they support ROBODOC - kinky stuff :-D

1.1. class_clean.php/CleanXML

[top][parent]

FUNCTION

       This class main purpose is to clean dirty XML.

PROPERTIES

       _dirty_data:
           @string - holds the "dirty" data
       _clean_data:
           @string - holds the clean data
       write_fp:
           @file_pointer - Used to open a fp to the write file

METHODS

       cleanFile, _correction, _writeTheCleanData

EXTENDS

       Dot

1.1.1. CleanXML/_correction

[top][parent]

FUNCTION

       Internal function that cleans the XML data.
       It cleans the data for tags, and unsupported chars.
       RegExp patterns:
           '=<[biu]>(.*)</[biu]>=i', '=&(?!amp;)=i', '=<strong>(.*)</strong>=i',
           '=[\x00-\x08\x0b-\x0c\x0e-\x1f]=
       Corrections
           "$1", "&amp;", "$1", '' 

ACCESS

       Private

USED BY

       cleanFile

1.1.2. CleanXML/_writeTheCleanData

[top][parent]

FUNCTION

       This function writes our clean data to our write file pointer.
       At last it resets the _clean_data.  

ACCESS

       Private

USED BY

       _correction

1.1.3. CleanXML/cleanFile

[top][parent]

FUNCTION

       This is the functions that cleans the file. First it creates
       2 file pointers. One FP to the read file and one FP
       to the write file. Then it's just reads the read file until
       EOF. While reading it inserst the dirty data in the _correction
       function, in here the data is cleaned. When this process is finished
       we have a new "clean" file called clean_$__filename. 
       Then the script deletes the dirty file and renames the clean_$__filename
       to just $__filename.

INPUT

       $__filename (@string): 
           The filename of the file we are about to clean  

ACCESS

       Public

USES

       _correction, _writeTheCleanData

2. Introduction/class_command.php

[top]

NAME

      class_command.php   

USAGE

      This file holds some important classes. 
      Includes following classes:
      Database:
          A class that is used to handle Database related stuff
      CheckURL
          A class that checks a give URL
      Dot
          A class that handles the Dot printing in command promt

CREATION DATE

      23 nov. 2003

HISTORY

      6 dec.  2003: Just added the change log!
      7 dec.  2003: Inserted the Dot class
      12 feb. 2004: Fixed some "" things in variables.. with ''
      19 maj  2004: Rewritten all comments, so they support ROBODOC - kinky stuff :-D

2.1. class_command.php/Basic

[top][parent]

FUNCTION

      This holds the some Basic methods.. Like an error handler.

METHODS

      error

2.1.1. Basic/error

[top][parent]

FUNCTION

       This is our error handler. All error should be passed to this error handler!

INPUT

       $__error_type (@string):
           Holds what kind of error it is. Can be Fatal error or just a Warning
       $__error_message (@string):
           Message that should be displayed.

ACCESS

       Public

2.1.2. Basic/printToConsole

[top][parent]

FUNCTION

       This is a method to print to our console. Alle informative information, 
       should be posted to this method.

INPUT

       __text (@string):
           Holds the text that is going to be displayed.
       __line_wrap (@bol):
           Default set to true (if it should wrap the text in \n)

ACCESS

       Public

2.2. class_command.php/CheckURL

[top][parent]

FUNCTION

      This class can get information about an URL. This class has methods,
      that download the headers of a given URL. Those headers are then parsed for
      diffrent information that we need.

PROPERTIES

      data (@string):
          Header data that we get back from a server when we give it a URL
      last_modified (@date):
          A propertie that holds the date - when was the URL last modified
      current_version (@date):
          The current version of the OPD dump
      content_lenght (@int):
          The content lenght of a URL... aka File size

METHODS

      downloadHeaders, _urlParse, lastModified, contentLenght, lastModifiedCompare

2.2.1. CheckURL/_urlParse

[top][parent]

FUNCTION

       Method that parts an URL

INPUT

       $__url (@string): 
           The URL that you want to part
       $__request ($string):
           The request of the return value. IE: host or path
 OUTPUT
       @string - host or path

ACCESS

       Private 

2.2.2. CheckURL/contentLenght

[top][parent]

FUNCTION

       Method that extracts the file size from our header data (data propertie)    

ACCESS

       Public

2.2.3. CheckURL/downloadHeaders

[top][parent]

FUNCTION

       Method that gets the headers from an URL - which is the input of this method.

INPUT

       $__url (@string): 
           The URL which we want to get headers from

ACCESS

       Public

2.2.4. CheckURL/lastModified

[top][parent]

FUNCTION

       Method that extracts the last-modified date from our header data (data propertie)   

ACCESS

       Public

2.2.5. CheckURL/lastModifiedCompare

[top][parent]

FUNCTION

       A method that checks if the ODP data last-modified date matches the date found
       inside lastupdate.data. If it's the same.. the script stops! If the ODP data
       dump is fresh - then we proceed with our download.
       If we proceed - - then our current last-modified date is stored in a file:
       lastupdate.data 

ACCESS

       Public

2.2.6. CheckURL/writeLastUpdate

[top][parent]

FUNCTION

       This just write the last update date to a file          

ACCESS

       Public

2.3. class_command.php/Database

[top][parent]

FUNCTION

      This class holds methods that handles Database
      related stuff. I.e. connect to the Database, close connection
      and queries.

METHODS

      connect, close, sqlWithoutAnswer

2.3.1. Database/close

[top][parent]

FUNCTION

       A methods that closes the connection to the Database.

ACCESS

       Public

2.3.2. Database/connect

[top][parent]

FUNCTION

       A method that connects to the Database. This method
       depends on config.php and the Database globals..

ACCESS

       Public

2.3.3. Database/sqlWithoutAnser

[top][parent]

FUNCTION

       A methods that does a MySQL query.

INPUT

       $__query (@string):
           A SQL query.

ACCESS

       Public

2.4. class_command.php/Dot

[top][parent]

FUNCTION

      This class can be used to control when a Dot (.) is going to be printed.
      I.e. it's not smart to print a Dot out for every row we insert in our db.
      Then you will have like 2 mio dots :)

USAGE

      Start to set the frequency (i.e. print the Dot every 50000 time). Then just call
      printDot.. And the method will find out it should print the Dot or not.

PROPERTIES

      count (@int):
          Variable to control that the Dot does not print everytime.
          I.e. it's just the counter that we used to check if we have reached
          the frequency.
      frequency (@int):
          On what frequency should the Dot be displayed 

METHODS

      printDot, setFrequency  

2.4.1. Dot/printDot

[top][parent]

FUNCTION

       A methods that prints out the Dot if the counter has the same value
       as the frequency. Every time this function is called the counter gets ++.   

ACCESS

       Public

2.4.2. Dot/setFrequency

[top][parent]

FUNCTION

       A methods that set our frequency.

INPUT

       $__freq (@int):
           A number .. i.e. 50000 - print the Dot every 50000..    

ACCESS

       Public

3. Introduction/class_download.php

[top]

NAME

      class_download.php

USAGE

      Start of to set the download_speed.
      Next set the filename to download.
      Next delete the old file by calling the method delete()
      Next setPath (the patch of the file we wish to download).
      Next call method download() to download it
      Next call method extract() to extract it

FUNCTION

      This files holds one class (DownloadFile). This class is used to
      download and extract the DMOZ data dumps.
      Lucky for us - they are packed with Gunzip - and PHP supports gunzip.. yay ;)

CREATION DATE

      23 nov. 2003

HISTORY

      23 nov. 2003: Created the class
      6  dec. 2003: Just added the change log!
      7 dec.  2003: Added support for the class CheckURL
      12 feb  2004: Remade the gunzip extracter. Now it rocks!
                    Fixed some "" to ''.
      19 maj  2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D

USES

      CheckURL, Dot       

3.1. class_download.php/DownloadExtractFile

[top][parent]

FUNCTION

      This class main purpose is to clean dirty XML.

PROPERTIES

      filename (@string):
          Filename of our file, that we want to make magic to.
      path (@string):
          The path of the file we are about to download
      download_speed (@int):
          The download speed (in KB)

METHODS

      setPath, setDownloadSpeed, delete, download, extract.

EXTENDS

      Dot

3.1.1. DownloadExtractFile/delete

[top][parent]

FUNCTION

       delete a the old file, if it's there.           

ACCESS

       Public

3.1.2. DownloadExtractFile/download

[top][parent]

FUNCTION

       download our file

USES

       CheckURL, Dot   

ACCESS

       Public

3.1.3. DownloadExtractFile/extract

[top][parent]

FUNCTION

       extract the downloaded gunzip file.

ACCESS

       Public

3.1.4. DownloadExtractFile/setDownloadSpeed

[top][parent]

FUNCTION

       Set the download speed

INPUT

       __speed (@int):
           download speed (KB) 25 i.e. 25 KB/s

ACCESS

       Public

3.1.5. DownloadExtractFile/setFilename

[top][parent]

FUNCTION

       Set the download speed

INPUT

       __filename (@string):
           What should our file be named :)

ACCESS

       Public

3.1.6. DownloadExtractFile/setPath

[top][parent]

FUNCTION

       Set path of the file we are about to download.  

INPUT

       __path (@string):
           An URL..

ACCESS

       Public

4. Introduction/class_parse.php

[top]

NAME

      class_parse.php

FUNCTION

      This file contains all the main classes that is used to parse the XML 
      (rdf) files.
      Classes included are following:
           PraseXMLGlobal:
              A parent class that has some classes, that can be used both by
              structure and content parsing.
          ParseXMLStructure:
              A class that is used to parse the structure RDF file.

CREATION DATE

      23 nov. 2003

HISTORY

      06 dec. 2003: Just added the change log! :)
      07 dec. 2003: Added content parser
      07 dec. 2003: Added a new class XMLGlobal!
      07 dec. 2003: Fixed some bugs :D
      12 feb. 2004: Fixed shit loads of "" code-not-so good errors
                    Fixed a HUGE bug (which took long time to find).
                    the bug made catid's to 0 - but it's fixed now!
      13 feb. 2004: To work properly the class needs to load whole files
                    into the memory. To help your computer I have now created
                    class that splits the big file into some smaller parts.
      18 feb. 2004: Fixed a bug in the split rutine (it didn't split the
                    content file :()
      13 maj 2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D

4.1. class_parse.php/ParseXMLContent

[top][parent]

FUNCTION

      A class that is used to parse the XML content file.

USAGE

      Call setStartTime - sets the start timem
      Call setXMLFile(filename) - set the filename of our XML file
      Call startParse - starts parsing the document and inserting the data in our
      MySQL Database.      

PROPERTIES

      The Basic properties to get this class going:
      count_rows (@int):
          How many rows we have done so far
      count_rows_temp (@int):
          A temporary counter (--Reset after ECHO_STATS_FREQUNCY rows)
      XML tags and tehir contents
      current_tag (@string):
          Hold what tag we currently are in
      permitted_tags (@array)
          An array that holds the permitted tags
      
      Properties for the XML structure:
      -------
      CONTENT LINKS (content_links):
      topic (@string)
      type (@string)
      resource (@string)
      catid (@int)
      
      CONTENT DESCRIPTION (content_description):
      external_page (@string)
      title (@string)
      description (@string)
      ages (@string)
      mediadate (@date)
      priority (@int)

METHODS

      startParse, _startTagProcessor, _endTagProcessor, _charDataProcessor

4.1.1. ParseXMLContent/_charDataProcessor

[top][parent]

USAGE

       This is our content processor

INPUT

       __parser (@obj)
           What parser is it dude? heh
       __data (@string)
           The data dude.. the data :)

ACCESS

       Private

4.1.2. ParseXMLContent/_endTagProcessor

[top][parent]

USAGE

       This is our end tag processor. When a tag ends, it's gets In hEEreeE

INPUT

       __parser (@obj)
           What parser is it dude? heh
       __tag_name (@string)
           The name of the current tagname

ACCESS

       Private

4.1.3. ParseXMLContent/_startTagProcessor

[top][parent]

USAGE

       Function that processes the start tags.

INPUT

       __parser (@obj)
           What parser is it dude? heh
       __tag_name (@string)
           The name of the current tagname
       __attributes (@array)
           Attributes of the tagname

ACCESS

       Private

4.1.4. ParseXMLContent/startParse

[top][parent]

USAGE

       Used to start parsing of our file.

ACCESS

       Public

4.2. class_parse.php/ParseXMLStructure

[top][parent]

FUNCTION

      A class that is used to parse the XML structure file.

USAGE

      Call setStartTime - sets the start timem
      Call setXMLFile(filename) - set the filename of our XML file
      Call startParse - starts parsing the document and inserting the data in our
      MySQL Database.      

PROPERTIES

      The Basic properties to get this class going:
      count_rows (@int):
          How many rows we have done so far
      count_rows_temp (@int):
          A temporary counter (--Reset after ECHO_STATS_FREQUNCY rows)
      XML tags and tehir contents
      current_tag (@string):
          Hold what tag we currently are in
      permitted_tags (@array)
          An array that holds the permitted tags
      
      Properties for the XML structure:
      topic (@string)
      catid (@int)
      title (@string)
      description (@string)
      last_update (@date)
      
      Variables for the XML data type
      type
      resource

METHODS

      startParse, _startTagProcessor, _endTagProcessor, _charDataProcessor

4.2.1. ParseXMLStructure/_charDataProcessor

[top][parent]

USAGE

       This is our content processor

INPUT

       __parser (@obj)
           What parser is it dude? heh
       __data (@string)
           The data dude.. the data :)

ACCESS

       Private

4.2.2. ParseXMLStructure/_endTagProcessor

[top][parent]

USAGE

       This is our end tag processor. When a tag ends, it's gets In hEEreeE

INPUT

       __parser (@obj)
           What parser is it dude? heh
       __tag_name (@string)
           The name of the current tagname

ACCESS

       Private

4.2.3. ParseXMLStructure/_startTagProcessor

[top][parent]

USAGE

       Function that processes the start tags.

INPUT

       __parser (@obj)
           What parser is it dude? heh
       __tag_name (@string)
           The name of the current tagname
       __attributes (@array)
           Attributes of the tagname

ACCESS

       Private

4.2.4. ParseXMLStructure/startParse

[top][parent]

USAGE

       Used to start parsing of our file.

ACCESS

       Public

4.3. class_parse.php/XMLGlobal

[top][parent]

FUNCTION

      This class holds global methods that can be used by structre and contents
      parsers.

USAGE

      This class is used as a parent for XMLParseStructure and XMLParseContent         

PROPERTIES

      Status:
      h (@int):
          Hours
      m (@int):
          Minutes
      s (@int)
          Sec.
      
      Everythings else:
      xml_file (@string):
          The filename of XML file we are parsing
      start_time (@int):
          The start time

METHODS

      setXMLFile, setStartTime, _getMicroTime, _echoStatus, _splitTime, _startToParse

USED BY

      XMLParseStructure, XMLParseContent

4.3.1. XMLGlobal/_echoStatus

[top][parent]

FUNCTION

       A methods that prints out the status

INPUT

       __start_time (@int):
           When was the script started
       __count_rows (@int):
           How many rows have we inserted sofar
       __milestone (@string):
           A text that tells a litte about our milestone

USED BY

       _endTagProcessor

ACCESS

       Private

4.3.2. XMLGlobal/_getMicroTime

[top][parent]

FUNCTION

       A method that gets the microtime

USED BY

       setStartTime, echoStatus    

ACCESS

       Private

4.3.3. XMLGlobal/_splitTime

[top][parent]

FUNCTION

       A method that splits current run time for the script

USED BY

       _echoStatus 

ACCESS

       Private

4.3.4. XMLGlobal/_startToParse

[top][parent]

FUNCTION

       A method that creates the PHP's parsers and starts parsing.

USED BY

       startParse (XMLContentParser and XMLStructureParser)

ACCESS

       Private

4.3.5. XMLGlobal/setStartTime

[top][parent]

FUNCTION

       A method that you need to call to set the start time        

ACCESS

       Public

4.3.6. XMLGlobal/setXMLFile

[top][parent]

FUNCTION

       Just a methods that sets the filename

INPUT

       __filename (@string):
           The filename of our XML file

ACCESS

       Public

5. Introduction/config.php

[top]

NAME

      config.php  

USAGE

      In this file you have several options available - - to
      custimize the script for your use. *         

TYPE

      Just a file that contains some important information

CREATION DATE

      3 dec. 2003

HISTORY

      12 feb. 2004: Added some features
      19 maj  2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D 
      24 maj 2004: Added console color (CONSOLE_COLOR) option

5.1. config.php/Common_global_values

[top][parent]

FUNCTION

       Here you specify common global value:
           ECHO_STATS (@bol):
               If ECHO_STATS is true -  statistics will be displayed 
               (stats are used when parsing the RDF documents).
           ECHO_STATS_FREQUNCY (@int):
               A value that contains what frequncy the stats should be displayed.
           DOWNLOAD_SPEED (@int): 
               download speed. in kilobyte
           CONSOLE_COLOR (@string):
               Define what color you want to use. You can set it to:
               black, red, green and blue

5.2. config.php/Database_information

[top][parent]

FUNCTION

       Here you specify your Database information:
           DB_SERVER (@string):
               The server address - could be localhost or a URL
           DB_USER (@string):
               Database username
           DB_PASSWORD (@string): 
               Password for your Database
           DB_Database (@string):
               The Database your create with create_tables.php

5.3. config.php/Filenames

[top][parent]

FUNCTION

       Here you specify common global value:
           $rdffile_structure (@string):
               Structure RDF filename 
           $rdffile_content (@string):
               Content RDF filename
 WARNING
       No need to edit those Filenames!

5.4. config.php/Script_properties

[top][parent]

FUNCTION

       In this script you specify what the script shell do:
           Check_for_updates (@bool):
               If it's set to true, then the script will check for updated DMOZ dumps.
           STRUCTURE_DOWNLOAD_AND_extract (@bool):
               If it's set to true, then the script will download and extract
               the structure DMOZ dump.
           STRUCTURE_CLEAN (@bool): 
               If it's set to true, then the script will clean the structure file.
           STRUCTURE_PARSE_N_INSERT (@bool):
               If it's set to true, then the script will parse the structure rdf file 
               and insert data into the MySQL db.
           CONTENT_DOWNLOAD_AND_extract (@bool):
               If it's set to true, then the script will download 
               and extract the content DMOZ dump.
           CONTENT_CLEAN (@bool): 
               If it's set to true, then the script will clean the content file.
           CONTENT_PARSE_N_INSERT (@bool):
               If it's set to true, then the script will parse the content rdf file and
               insert data into the MySQL db.

6. Introduction/create_tables.php

[top]

NAME

      create_tables.php

USAGE

      Just call this script to create the needed tables in your Database       

CREATION DATE

      23 nov. 2003

HISTORY

      6 dec.  2003: Fixed some things. Added more complex regex's.
      12 feb. 2004: Fixed some "" things - changed them to ''
      19 maj  2004: Rewritten all comments, so they support ROBODOC - kinky stuff :-D

6.1. create_tables.php/Create_tables_in_db

[top][parent]

FUNCTION

      Creates the tables in our Database.

6.2. create_tables.php/Include_stuff

[top][parent]

FUNCTION

      Include classes and the config file.

7. Introduction/drop_tables.php

[top]

NAME

      drop_tables.php

USAGE

      Just call this script to delete the created tables.

CREATION DATE

      23 nov. 2003

HISTORY

      12 feb. 2004: Fixed some "" things - changed them to ''

7.1. drop_tables.php/Drop_tables

[top][parent]

FUNCTION

       Drop the tables in the Database

7.2. drop_tables.php/Include_stuff

[top][parent]

FUNCTION

       Include classes and the config file.

8. Introduction/start_script.php

[top]

NAME

      start_script.php

USAGE

      First be sure you have configured this script (see config.php).
      Then run this script from promt. Change the directory with cd to
      the place where your start_script.php is located. 
      And then type this:
          UNIX:
              You probably have a symbolic link, if not search google:
              php start_script.php
          Windows:
              Locate where you php.exe is. If it is C:\php\php.exe
              then do following:
              C:\php\php.exe start_script.php         

FUNCTION

      This scripts initializes all the classes and runs the script.
      Best way to configure this script is by config.php. But you may
      also wish to edit it in here. This script is well documented, 
      it should not be that hard.

TYPE

      A script used to create classes.

CREATION DATE

      8 dec. 2003

HISTORY

      12 feb. 2004: Remade it. You now control this script from config.php!
      13 feb. 2004: Added class_split.php
      15 maj  2004: Remove class_split.php - New XML parser no need for it :-)
      19 maj  2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D 
      24 maj  2004: Well lot's of improvements.. it's sick ;)

USES

      All classes

8.1. start_script.php/Check_for_updates

[top][parent]

FUNCTION

      This section download the headers for the structure file. 
      Then it checks when the file was last modified.
      at last it compares it with the users last update.

8.2. start_script.php/Common_calls

[top][parent]

FUNCTION

      Common calls: connect to the Database
      Create the objects:
          check_url (checks a specific URL)
          downloadfile (downloads files)
          Clean_xml (cleans the XML files)
          parse_xml_structure
          parse_xml_content

8.3. start_script.php/Content_file

[top][parent]

SECTION

      Calls that handle the DMOZ content file 

8.3.1. Content_file/Clean_xml

[top][parent]

FUNCTION

       Clean the content file! (dirty xml - we don't like it :D)  

8.3.2. Content_file/Download

[top][parent]

FUNCTION

       download the content file  

8.3.3. Content_file/Parse_and_insert

[top][parent]

FUNCTION

       Parse and insert the content RDF file into a Database  

8.4. start_script.php/Include_stuff

[top][parent]

FUNCTION

      Include classes and the config file.

8.5. start_script.php/set_time_limit

[top][parent]

FUNCTION

      Set maximum execution time to none

8.6. start_script.php/Structure_file

[top][parent]

SECTION

      Calls that handle the DMOZ structure file   

8.6.1. Structure_file/Clean_xml

[top][parent]

FUNCTION

      Clean the structure file! (dirty xml - we don't like it :D) 

8.6.2. Structure_file/Download

[top][parent]

FUNCTION

      download the structure file 

8.6.3. Structure_file/Parse_and_insert

[top][parent]

FUNCTION

      Parse and insert the structure RDF file into a Database