Parser

From Promoserv
Jump to: navigation, search

Contents

Data acquisition - introduction

Terminology:

  • Listener is program which waits for message and delivers it to parser-module.
  • Parser means here framework of different message decoding modules and/or programs which handles the data acquisition. This document describes parser-modules used at NHMS with real-time data flow.

Although terms parser and listener are often used interchangeable and combined to same program, they are different objects. Parser does not need listener for it's work, for example, when parsing messages in files which are received via FTP. There is no need for listener to use any parser-modules although they very often do so.

Parsers are developed with Perl-programming language using modularised approach. Programming type is mostly object oriented but in some places there are traditional procedure based parts. In Perl-world, it is quite easy to mix both approach and use the one which is more convenient for current task. Parsers as well other in-house build modules are located in non-standard name space FMI. In Lithuania, name space was LHMS. For Vietnam-project and perhaps other projects, name space NHMS will be used.


Parser-modules at NHMS-namespace

Nhms modules.png

Note! Although structurally similar, FMI and NHMS modules are not same, and they can't be used together.

Parser-modules are located in NHMS::Parser -name space. Base class for these modules is Parser.pm (NHMS::Parser). It provides common methods for parser-modules (or classes). These are e.g. creating database-connections, storing decoded values to database, checking possible checksums of messages etc. See more detailed description from documentation of NHMS::Parser.

Actual parsing is done at NHMS::Parser::XXX -modules. Each of these modules is specialised library which can decode certain type of message.

Methods provided by NHMS::Parser

  • connect_to_db: Connects to database. In future this will probably be a wrapper method for connect_to_pg and connect_to_oracle -meethods.
  • store: Stores decoded observations to RAWDATA-style table

Helper methods by NHMS::Parser

  • is_obs_in_future: Checks whether given obstime is in future
  • calculate_client_msg: Selects return code to be returned in case there are several possibilities (used only internally unless message includes checksum)
  • calculate_md5: Calculates md5-sum of message
  • check_digest: Check whether message has checksum
  • extract_digest: Extracts message body, checksum type and checksum from received message
  • print_values: Prints values of multidimensional hash in human readable form.

Required methods for NHMS::Parser::XXX

  • parse: Each module must provide parse-method which accepts message to be parsed as string. This method is compulsory.
  • is_valid_message: Checks the syntax-validity of the message and returns true if messages is valid. All modules must provide this method, although it can return hardcoded true (not recommended, even small check is better than no check).

NHMS::Parser:XXX-modules can provide any number private methods it needs to parse the message. These private methods should not be used outside the parser. It is highly recommended that each message format have their own parsing module and parsing different formats should not be combined to single module. Message formats should be defined so that they could be parsed without external knowledge for parser. For example, Synop-message is problematic because it has not fully qualified date-field. Parser can, however, read needed information from database.

Internal communication between parser-modules

Parser-modules communicates with base-module using multidimensional hash. Current implementation from NHMS::Parser is

$parsed{STATION_ID}->{OBSTIME}->{VARIABLE_CODE}->{SENSOR_ID}=[$value,$qc0_flag];

In normal stations sensor_id is 1. Meta-information about sensors whose sensor_id is not 1, must be found from database. Qc0_flag is usually 0, unless there has been some kind of quality control at station (or at parser).

Other requirements for parsers

Parsers requires following variables to be set

  • $parser->{DATA_ORIGIN}

Required by store-method. This is identifier set by listener to inform used data-transfer route. This should be unique for each message-type <-> communication method-combination

Listeners

Listeners are programs waiting for TCP/IP-socket connections from station or whatever datasource (data_origin) is. They are located at NMHS::DA::TCP::XXX. It would be possible to use other protocols like UDP with same framework, but so far only TCP-sockets are implemented. Internally most listeners are using Net::Server-module to create listener-processes. Net::Server is stable Perl-module which supports several different server-models. NHMS-listeners are usually using pre-forking approach.

Child classes of NHMS::Socket are data acquisition specific modules i.e for each route and message format (from listener-perspective) there should be own module. Sometimes same modules are used for technically different routes and they are identified by certain parameters. For example, in FMI, from listener point-of-view, it is same whether message from M500-AWS-stations is sent by using GPRS-connection or LAN-connection. Listener sees only incoming tcp-connection in both cases.

NHMS::DA::TCP::XXX-classes must provide process_connection -method which overrides default method provided by Net::Server and handles everything which is needed for receiving message and possible digest- and syntax-checks (provided by NHMS::Parser-class). One instance of listener should handle only certain type of messages, so it know to which parser message should be directed to.

Listeners must also define these variables. These variables are needed by Parser-modules!:

  • DATA_ORIGIN (INT): Id of technical route where observations came from (in FMI known as DATA_SOURCE)
  • OPER (BOOLEAN): Defines whether listener is running operatively or testing purposes

Listeners can define also these variables:

  • TARGET_TABLE (STRING): name of the database-table to be used, otherwise parsers default is used.

Configuration files

Listeners are using usually 3 different configuration files. It is recommended that information contained in configuration files should not be hardcoded to scripts.

Listener.conf

Listener.conf is most important single configuration file. It uses ini-file -format with blocks and key=value-pairs. It is located in same directory as scripts starting the listeners. There are some required fields and some optional (with decent default values). Individual listeners can add needed configuration values to their own configuration blocks. For example, listener.conf -section for fmiaws-listeners looks like:

[10min_fmiaws_lan_9760]
program=vietnam/scripts/listener_fmiaws_9760.pl
pidfile=/tmp/listener_fmiaws_9760.pid
logfile=/var/log/nhms/listener_fmiaws_9760.log
log4perl_conf=vietnam/scripts/listener_log.conf
target_table=rawdata
oper=0
data_origin=10
port=9760

Naming convention in block-name is usually: Interval of the observations, (10min), type of station (fmiaws), primary data communication method (lan, could be also gprs etc), port number which listener is listening (9760).

  • program: relative path and name of the script
  • pidfile: File where pid-number of listener is stored
  • logfile: Logfile of the listener
  • log4perl_conf: Configuration for logging. If listener is not using Log4perl, this is optional (Use of Log4perl is recommended)
  • target_table: Table where parsed observations should be stored. It is recommended to always define table. Useful options are rawdata (for operative listeners) and test_rawdata (for testing purposes)
  • oper: Declares whether listener is in real use or in test use. There are some safety checks to prevent test-listener inserting data to operative rawdata-table
  • data_origin: Identifier of data transfer method and listener. For example, data_origin 10 means that observation has arrived from fmiaws-station, using 10min_fmiaws_lan_9760-listener.
  • port: Port which listener is bind (listening)

db_user.conf

This configuration file is also important because it defines database users and their passwords which are used by listener/parser-modules. It is also in ini-format. Usual location for this file is $HOME/etc -directory

There are two possibilities in naming of the blocks, either by database user-accounts (for example: qc, collector) or for purpose (oper_listener, test_listener). Values which must be defined are. Required fields may differ depending on database-system used, but these are for NHMS-system with Postgresql-database.

  • user: username used to connecting database
  • password: password of user
  • host: Hostname or IP of the server running the database
  • db: name of database

log4perl.conf

This file is used for configuring logging from the listener. It uses own format and allows variety possibilities in logging (logging to screen, STDOUT, file, database, TCP/IP-socket, syslog..) See example in same directory where starting scripts are located


Error handling

Error handling differs a bit from normal perlish handling. While most methods return false or true, there are few exceptions which must be checked. Method parse returns always hashreference. In case of successful parsing, hashref looks like example defined above. In case of failure, parsing of message is stopped and hashreference containing only key PARSER_ERROR is returned. Value of PARSER_ERROR is string which includes return code and reason. This string can be sent back to client if ERRORS_TO_CLIENT is true.

Method store returns always client_msg as string. It contains return code and explanation. If there has been several problems at storing the values to database, client_msg has been determined by method calculate_client_msg (usually most common client_msg is selected). In case of success, client_msg contains string '200 OK'.

Logging

Logging is done using Log::Log4perl -framework (Perl-port of Log4j- java-logging modules). Log4perl gives nice and consistent way to implement logging to several levels and change the verbosity of log in runtime or using configuration file.

External communication

Parser and listener-modules can inform clients about message handling results. Communication is done with so-called 'return codes'. These codes are very similar to those which are used with http-protocol.

Parser always return return code, but whether it goes to client, is up to listener. Listeners behaviour is determined by ERRORS_TO_CLIENT -variable. This variable can be set either in initialisation or by runtime. For example, some FMI-listeners behave differently depending of incoming message. If message contains checksum (digest) return code is sent to client. If there is no checksum, no returncode is sent.

Code Explanation Possible reasons Resend
200 OK Everything OK No
400 Bad request Problem in communication or message malformed Yes
406 Not acceptable Message can't be accepted although it might be formally correct. Sender should not retry but move to next message. Most usual reason is that message is duplicate. No
408 Request timeout Connection has time outed for some reason. Yes
500 Internal server error There is some internal error, see logfiles for details. Yes

When client receives return code from listener, it should check only the numerical code-part. Answer usually contains also string, which gives more detailed information about problem but this string is meant for debugging and should not be tried to decode automatically.

Usual behaviour of client when receiving return code other than 200 OK, is to resend message. Resending should be tried periodically until it succeeds. However, if client receives code 406, message should not be resent.

Starting listeners, scripts

check_listener.pl

Individual listeners are started by short scripts. However, it is recommended to use check_listener.pl -script located in same directory as other scripts. Check_listener.pl is able to start, stop, restart and check the status of listener. See check_listener.pl --help for more infromation

starting scripts

Listeners are started by short scripts which loads the listener-module and/or parser-modules. Parser-modules needs also calling scripts if they are used without listener-modules. These scripts must provide some subroutines and define some variables. They are mostly same as variables described earlier. Subroutines provided by scripts getLogfileName : This subroutine is used in usual logging by Log4perl-configuration. It should return absolute or relative file-name where logs should be saved.

Variables set by scripts

Following variables needs to be set for listener to start properly. For some variables, there are decent default values. Listener-namespace:

  • OPER: Need by parser-module
  • DATA_SOURCE: Need by parser-module
  • TARGET_TABLE: Need by parser-module
  • ERRORS_TO_CLIENT: Defines whether to send return-codes to client or not. Listener can detect wanted behaviour based on message format (default: with checksum yes, without don't send) but default can be overridden by this variable.
my $user_prefix='havkeruu';
my $listener_prefix='10min_naws_gprs'; 
my $logfile=$cfg->param("$listener_prefix.logfile");
my $pidfile=$cfg->param("$listener_prefix.pidfile");
my $port=$cfg->param("$listener_prefix.port"); 
my $db_user=$user_cfg->param("$user_prefix.user");
my $db_password=$user_cfg->param("$user_prefix.password");
my $db_db=$user_cfg->param("$user_prefix.db"); 
$socket->{VERBOSE}
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox