[top]
NAME
class_clean.php
USAGE
This file holds the CleanXML class, which is used to clean up dirty ODP XML.
CREATION DATE
23 nov. 2003
HISTORY
6 dec. 2003: Fixed some things. Added more complex regex's. 12 feb. 2004: Fixed some "" things - changed them to '' 19 maj 2004: Rewritten all comments, so they support ROBODOC - kinky stuff :-D
FUNCTION
This class main purpose is to clean dirty XML.
PROPERTIES
_dirty_data: @string - holds the "dirty" data _clean_data: @string - holds the clean data write_fp: @file_pointer - Used to open a fp to the write file
METHODS
cleanFile, _correction, _writeTheCleanData
EXTENDS
Dot
FUNCTION
Internal function that cleans the XML data. It cleans the data for tags, and unsupported chars. RegExp patterns: '=<[biu]>(.*)</[biu]>=i', '=&(?!amp;)=i', '=<strong>(.*)</strong>=i', '=[\x00-\x08\x0b-\x0c\x0e-\x1f]= Corrections "$1", "&", "$1", ''
ACCESS
Private
USED BY
cleanFile
FUNCTION
This function writes our clean data to our write file pointer. At last it resets the _clean_data.
ACCESS
Private
USED BY
_correction
FUNCTION
This is the functions that cleans the file. First it creates 2 file pointers. One FP to the read file and one FP to the write file. Then it's just reads the read file until EOF. While reading it inserst the dirty data in the _correction function, in here the data is cleaned. When this process is finished we have a new "clean" file called clean_$__filename. Then the script deletes the dirty file and renames the clean_$__filename to just $__filename.
INPUT
$__filename (@string): The filename of the file we are about to clean
ACCESS
Public
USES
_correction, _writeTheCleanData
[top]
NAME
class_command.php
USAGE
This file holds some important classes. Includes following classes: Database: A class that is used to handle Database related stuff CheckURL A class that checks a give URL Dot A class that handles the Dot printing in command promt
CREATION DATE
23 nov. 2003
HISTORY
6 dec. 2003: Just added the change log! 7 dec. 2003: Inserted the Dot class 12 feb. 2004: Fixed some "" things in variables.. with '' 19 maj 2004: Rewritten all comments, so they support ROBODOC - kinky stuff :-D
FUNCTION
This holds the some Basic methods.. Like an error handler.
METHODS
error
FUNCTION
This is our error handler. All error should be passed to this error handler!
INPUT
$__error_type (@string): Holds what kind of error it is. Can be Fatal error or just a Warning $__error_message (@string): Message that should be displayed.
ACCESS
Public
FUNCTION
This is a method to print to our console. Alle informative information, should be posted to this method.
INPUT
__text (@string): Holds the text that is going to be displayed. __line_wrap (@bol): Default set to true (if it should wrap the text in \n)
ACCESS
Public
FUNCTION
This class can get information about an URL. This class has methods, that download the headers of a given URL. Those headers are then parsed for diffrent information that we need.
PROPERTIES
data (@string): Header data that we get back from a server when we give it a URL last_modified (@date): A propertie that holds the date - when was the URL last modified current_version (@date): The current version of the OPD dump content_lenght (@int): The content lenght of a URL... aka File size
METHODS
downloadHeaders, _urlParse, lastModified, contentLenght, lastModifiedCompare
FUNCTION
Method that parts an URL
INPUT
$__url (@string): The URL that you want to part $__request ($string): The request of the return value. IE: host or path OUTPUT @string - host or path
ACCESS
Private
FUNCTION
Method that extracts the file size from our header data (data propertie)
ACCESS
Public
FUNCTION
Method that gets the headers from an URL - which is the input of this method.
INPUT
$__url (@string): The URL which we want to get headers from
ACCESS
Public
FUNCTION
Method that extracts the last-modified date from our header data (data propertie)
ACCESS
Public
FUNCTION
A method that checks if the ODP data last-modified date matches the date found inside lastupdate.data. If it's the same.. the script stops! If the ODP data dump is fresh - then we proceed with our download. If we proceed - - then our current last-modified date is stored in a file: lastupdate.data
ACCESS
Public
FUNCTION
This just write the last update date to a file
ACCESS
Public
FUNCTION
This class holds methods that handles Database related stuff. I.e. connect to the Database, close connection and queries.
METHODS
connect, close, sqlWithoutAnswer
FUNCTION
A methods that closes the connection to the Database.
ACCESS
Public
FUNCTION
A method that connects to the Database. This method depends on config.php and the Database globals..
ACCESS
Public
FUNCTION
A methods that does a MySQL query.
INPUT
$__query (@string): A SQL query.
ACCESS
Public
FUNCTION
This class can be used to control when a Dot (.) is going to be printed. I.e. it's not smart to print a Dot out for every row we insert in our db. Then you will have like 2 mio dots :)
USAGE
Start to set the frequency (i.e. print the Dot every 50000 time). Then just call printDot.. And the method will find out it should print the Dot or not.
PROPERTIES
count (@int): Variable to control that the Dot does not print everytime. I.e. it's just the counter that we used to check if we have reached the frequency. frequency (@int): On what frequency should the Dot be displayed
METHODS
printDot, setFrequency
FUNCTION
A methods that prints out the Dot if the counter has the same value as the frequency. Every time this function is called the counter gets ++.
ACCESS
Public
FUNCTION
A methods that set our frequency.
INPUT
$__freq (@int): A number .. i.e. 50000 - print the Dot every 50000..
ACCESS
Public
[top]
NAME
class_download.php
USAGE
Start of to set the download_speed. Next set the filename to download. Next delete the old file by calling the method delete() Next setPath (the patch of the file we wish to download). Next call method download() to download it Next call method extract() to extract it
FUNCTION
This files holds one class (DownloadFile). This class is used to download and extract the DMOZ data dumps. Lucky for us - they are packed with Gunzip - and PHP supports gunzip.. yay ;)
CREATION DATE
23 nov. 2003
HISTORY
23 nov. 2003: Created the class 6 dec. 2003: Just added the change log! 7 dec. 2003: Added support for the class CheckURL 12 feb 2004: Remade the gunzip extracter. Now it rocks! Fixed some "" to ''. 19 maj 2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D
USES
CheckURL, Dot
FUNCTION
This class main purpose is to clean dirty XML.
PROPERTIES
filename (@string): Filename of our file, that we want to make magic to. path (@string): The path of the file we are about to download download_speed (@int): The download speed (in KB)
METHODS
setPath, setDownloadSpeed, delete, download, extract.
EXTENDS
Dot
FUNCTION
delete a the old file, if it's there.
ACCESS
Public
FUNCTION
download our file
USES
CheckURL, Dot
ACCESS
Public
FUNCTION
extract the downloaded gunzip file.
ACCESS
Public
FUNCTION
Set the download speed
INPUT
__speed (@int): download speed (KB) 25 i.e. 25 KB/s
ACCESS
Public
FUNCTION
Set the download speed
INPUT
__filename (@string): What should our file be named :)
ACCESS
Public
FUNCTION
Set path of the file we are about to download.
INPUT
__path (@string): An URL..
ACCESS
Public
[top]
NAME
class_parse.php
FUNCTION
This file contains all the main classes that is used to parse the XML (rdf) files. Classes included are following: PraseXMLGlobal: A parent class that has some classes, that can be used both by structure and content parsing. ParseXMLStructure: A class that is used to parse the structure RDF file.
CREATION DATE
23 nov. 2003
HISTORY
06 dec. 2003: Just added the change log! :) 07 dec. 2003: Added content parser 07 dec. 2003: Added a new class XMLGlobal! 07 dec. 2003: Fixed some bugs :D 12 feb. 2004: Fixed shit loads of "" code-not-so good errors Fixed a HUGE bug (which took long time to find). the bug made catid's to 0 - but it's fixed now! 13 feb. 2004: To work properly the class needs to load whole files into the memory. To help your computer I have now created class that splits the big file into some smaller parts. 18 feb. 2004: Fixed a bug in the split rutine (it didn't split the content file :() 13 maj 2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D
FUNCTION
A class that is used to parse the XML content file.
USAGE
Call setStartTime - sets the start timem Call setXMLFile(filename) - set the filename of our XML file Call startParse - starts parsing the document and inserting the data in our MySQL Database.
PROPERTIES
The Basic properties to get this class going: count_rows (@int): How many rows we have done so far count_rows_temp (@int): A temporary counter (--Reset after ECHO_STATS_FREQUNCY rows) XML tags and tehir contents current_tag (@string): Hold what tag we currently are in permitted_tags (@array) An array that holds the permitted tags Properties for the XML structure: ------- CONTENT LINKS (content_links): topic (@string) type (@string) resource (@string) catid (@int) CONTENT DESCRIPTION (content_description): external_page (@string) title (@string) description (@string) ages (@string) mediadate (@date) priority (@int)
METHODS
startParse, _startTagProcessor, _endTagProcessor, _charDataProcessor
USAGE
This is our content processor
INPUT
__parser (@obj) What parser is it dude? heh __data (@string) The data dude.. the data :)
ACCESS
Private
USAGE
This is our end tag processor. When a tag ends, it's gets In hEEreeE
INPUT
__parser (@obj) What parser is it dude? heh __tag_name (@string) The name of the current tagname
ACCESS
Private
USAGE
Function that processes the start tags.
INPUT
__parser (@obj) What parser is it dude? heh __tag_name (@string) The name of the current tagname __attributes (@array) Attributes of the tagname
ACCESS
Private
USAGE
Used to start parsing of our file.
ACCESS
Public
FUNCTION
A class that is used to parse the XML structure file.
USAGE
Call setStartTime - sets the start timem Call setXMLFile(filename) - set the filename of our XML file Call startParse - starts parsing the document and inserting the data in our MySQL Database.
PROPERTIES
The Basic properties to get this class going: count_rows (@int): How many rows we have done so far count_rows_temp (@int): A temporary counter (--Reset after ECHO_STATS_FREQUNCY rows) XML tags and tehir contents current_tag (@string): Hold what tag we currently are in permitted_tags (@array) An array that holds the permitted tags Properties for the XML structure: topic (@string) catid (@int) title (@string) description (@string) last_update (@date) Variables for the XML data type type resource
METHODS
startParse, _startTagProcessor, _endTagProcessor, _charDataProcessor
USAGE
This is our content processor
INPUT
__parser (@obj) What parser is it dude? heh __data (@string) The data dude.. the data :)
ACCESS
Private
USAGE
This is our end tag processor. When a tag ends, it's gets In hEEreeE
INPUT
__parser (@obj) What parser is it dude? heh __tag_name (@string) The name of the current tagname
ACCESS
Private
USAGE
Function that processes the start tags.
INPUT
__parser (@obj) What parser is it dude? heh __tag_name (@string) The name of the current tagname __attributes (@array) Attributes of the tagname
ACCESS
Private
USAGE
Used to start parsing of our file.
ACCESS
Public
FUNCTION
This class holds global methods that can be used by structre and contents parsers.
USAGE
This class is used as a parent for XMLParseStructure and XMLParseContent
PROPERTIES
Status: h (@int): Hours m (@int): Minutes s (@int) Sec. Everythings else: xml_file (@string): The filename of XML file we are parsing start_time (@int): The start time
METHODS
setXMLFile, setStartTime, _getMicroTime, _echoStatus, _splitTime, _startToParse
USED BY
XMLParseStructure, XMLParseContent
FUNCTION
A methods that prints out the status
INPUT
__start_time (@int): When was the script started __count_rows (@int): How many rows have we inserted sofar __milestone (@string): A text that tells a litte about our milestone
USED BY
_endTagProcessor
ACCESS
Private
FUNCTION
A method that gets the microtime
USED BY
setStartTime, echoStatus
ACCESS
Private
FUNCTION
A method that splits current run time for the script
USED BY
_echoStatus
ACCESS
Private
FUNCTION
A method that creates the PHP's parsers and starts parsing.
USED BY
startParse (XMLContentParser and XMLStructureParser)
ACCESS
Private
FUNCTION
A method that you need to call to set the start time
ACCESS
Public
FUNCTION
Just a methods that sets the filename
INPUT
__filename (@string): The filename of our XML file
ACCESS
Public
[top]
NAME
config.php
USAGE
In this file you have several options available - - to custimize the script for your use. *
TYPE
Just a file that contains some important information
CREATION DATE
3 dec. 2003
HISTORY
12 feb. 2004: Added some features 19 maj 2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D 24 maj 2004: Added console color (CONSOLE_COLOR) option
FUNCTION
Here you specify common global value: ECHO_STATS (@bol): If ECHO_STATS is true - statistics will be displayed (stats are used when parsing the RDF documents). ECHO_STATS_FREQUNCY (@int): A value that contains what frequncy the stats should be displayed. DOWNLOAD_SPEED (@int): download speed. in kilobyte CONSOLE_COLOR (@string): Define what color you want to use. You can set it to: black, red, green and blue
FUNCTION
Here you specify your Database information: DB_SERVER (@string): The server address - could be localhost or a URL DB_USER (@string): Database username DB_PASSWORD (@string): Password for your Database DB_Database (@string): The Database your create with create_tables.php
FUNCTION
Here you specify common global value: $rdffile_structure (@string): Structure RDF filename $rdffile_content (@string): Content RDF filename WARNING No need to edit those Filenames!
FUNCTION
In this script you specify what the script shell do: Check_for_updates (@bool): If it's set to true, then the script will check for updated DMOZ dumps. STRUCTURE_DOWNLOAD_AND_extract (@bool): If it's set to true, then the script will download and extract the structure DMOZ dump. STRUCTURE_CLEAN (@bool): If it's set to true, then the script will clean the structure file. STRUCTURE_PARSE_N_INSERT (@bool): If it's set to true, then the script will parse the structure rdf file and insert data into the MySQL db. CONTENT_DOWNLOAD_AND_extract (@bool): If it's set to true, then the script will download and extract the content DMOZ dump. CONTENT_CLEAN (@bool): If it's set to true, then the script will clean the content file. CONTENT_PARSE_N_INSERT (@bool): If it's set to true, then the script will parse the content rdf file and insert data into the MySQL db.
[top]
NAME
create_tables.php
USAGE
Just call this script to create the needed tables in your Database
CREATION DATE
23 nov. 2003
HISTORY
6 dec. 2003: Fixed some things. Added more complex regex's. 12 feb. 2004: Fixed some "" things - changed them to '' 19 maj 2004: Rewritten all comments, so they support ROBODOC - kinky stuff :-D
FUNCTION
Creates the tables in our Database.
FUNCTION
Include classes and the config file.
[top]
NAME
drop_tables.php
USAGE
Just call this script to delete the created tables.
CREATION DATE
23 nov. 2003
HISTORY
12 feb. 2004: Fixed some "" things - changed them to ''
FUNCTION
Drop the tables in the Database
FUNCTION
Include classes and the config file.
[top]
NAME
start_script.php
USAGE
First be sure you have configured this script (see config.php). Then run this script from promt. Change the directory with cd to the place where your start_script.php is located. And then type this: UNIX: You probably have a symbolic link, if not search google: php start_script.php Windows: Locate where you php.exe is. If it is C:\php\php.exe then do following: C:\php\php.exe start_script.php
FUNCTION
This scripts initializes all the classes and runs the script. Best way to configure this script is by config.php. But you may also wish to edit it in here. This script is well documented, it should not be that hard.
TYPE
A script used to create classes.
CREATION DATE
8 dec. 2003
HISTORY
12 feb. 2004: Remade it. You now control this script from config.php! 13 feb. 2004: Added class_split.php 15 maj 2004: Remove class_split.php - New XML parser no need for it :-) 19 maj 2004: Rewritten alle comments, so they support ROBODOC - kinky stuff :-D 24 maj 2004: Well lot's of improvements.. it's sick ;)
USES
All classes
FUNCTION
This section download the headers for the structure file. Then it checks when the file was last modified. at last it compares it with the users last update.
FUNCTION
Common calls: connect to the Database Create the objects: check_url (checks a specific URL) downloadfile (downloads files) Clean_xml (cleans the XML files) parse_xml_structure parse_xml_content
SECTION
Calls that handle the DMOZ content file
FUNCTION
Clean the content file! (dirty xml - we don't like it :D)
FUNCTION
download the content file
FUNCTION
Parse and insert the content RDF file into a Database
FUNCTION
Include classes and the config file.
FUNCTION
Set maximum execution time to none
SECTION
Calls that handle the DMOZ structure file
FUNCTION
Clean the structure file! (dirty xml - we don't like it :D)
FUNCTION
download the structure file
FUNCTION
Parse and insert the structure RDF file into a Database