Reporting ETL and OLAP Project


Background

Goal

The goal of this project is to build ETL functionality into OpenMRS in order to support OLAP analysis.

Use Cases

  • TBD

Design

Definitions

OLAP
OLAP stands for Online Analytical Processing. OLAP generally refers to an approach of finding answers to questions that deal with multiple dimensions. For instance, how many patients were enrolled at a particular clinic in a particular year. Answering that question with a conventional, normalized database system would be complex and slow. Using OLAP, the answer would usually be pre-generated and waiting to be retrieved.
OLAP System
An OLAP system is a database that stores read-only, aggregated data. An OLAP system usually consists of a denormalized schema (star schema) to improve the speed of reporting. Any database can be used to create an OLAP system.
Star Schema
A star schema refers to the de-normalized data model used to store the aggregated data for an OLAP system. A star schema has a central fact table and outer dimension tables to help make it easier (and faster) to retrieve answers to complex questions.
Data Warehouse
A data warehouse is a database that stores an organization's historical, aggregated data. A data warehouse is basically an implementation of an OLAP system, usually with an added layer of cube structures that helps dimensionalize the data into "cubes". Mondrian is an example of a data warehouse tool.
ETL
ETL stands for Extraction, Transformation, and Load. This term refers to the manner in which a system would move data from a conventional, normalized schema to a de-normalized star schema. However, this term is loosely used for functionality that takes data from one data store, transforms it to create a new data structure, and stores the transformed data into a schema based on the new data structure.


Requirements

  1. Define a star schema to answer a set of general questions
    • Analyze reporting requirements
    • Find general pattern of reporting questions
  2. Implement star schema
    • Install MySQL
    • Configure MySQL as a decision support / OLAP system
    • Design star schema as a denormalized SQL schema
  3. Implement ETL functionality
    • Install Pentaho's KETTLE
    • Work with KETTLE Spoon to determine the types of ETL functionality available
    • Implement functionality in OpenMRS module using KETTLE API
  4. Implement ETL scripts to move data from the OpenMRS to OLAP system
    • Write scripts that will perform ETL functionality for
  5. Implement core reporting code within OpenMRS that can use KETTLE module
    • Define a new interface or class that will use ETL functionality (IDataSet <-- TransformDataSet?)
  6. Implement data warehouse / OLAP system
    • Install Mondrian
    • Design cubes (Pentaho has a cube designer) based on reporting and analysis needs
    • Define MDX queries to access data and answer questions

Installation Guide

  1. Install OpenMRS
    You can find instructions for this at Step-by-Step_Installation_for_Implementers and Step-by-Step_Installation_for_Developers.
  2. Install Ant
    Download and install Ant from http://ant.apache.org/. You can find installation instructions at http://ant.apache.org/manual/install.html.
  3. Install MySQL databases (schemas)
    • Download pentaho-openmrs.zip and unzip it. In the pentaho-openmrs directory that is created, you will find scripts and configuration files to customize Pentaho for MySQL and OpenMRS.
    • To create the schema for the reporting database and others needed by Pentaho components execute the following command, from the pentaho-openmrs directory. Before executing, replace "password" with your own MySQL root user password.
      > mysql -u root -ppassword mysql < openmrs_pentaho_create_db.sql
    • The above script will also create a user called "test", that is used to connect to the schemas.
    • To create the database objects (tables, indexes, e.t.c.) and load test data execute the following commands:
      For the target database:
      > mysql -u test -ptest openmrs_dw < openmrs_dw.sql
      > mysql -u test -ptest openmrs_dw < etl_control.sql
      > mysql -u test -ptest openmrs_dw < etl_extract.sql
      For the ETL tool (Kettle) design repository
      > mysql -u test -ptest OPENMRS_REP
      mysql> source openmrs_rep.sql
      mysql> call p_repository_table_names_to_upper;
      For the other databases
      > mysql -u test -ptest hibernate < pentaho_hibernate.sql
      > mysql -u test -ptest quartz < pentaho_quartz.sql
      > mysql -u test -ptest shark < pentaho_shark.sql
  4. Install Pentaho BI Server
    • Download the latest stable release pentaho_j2ee_deployments-1.2.1.625-GA.zip from sourceforge (http://downloads.sourceforge.net/pentaho/pentaho_j2ee_deployments-1.2.1.625-GA.zip?modtime=1178540158&big_mirror=0) under Business Intelligence Suite, 1.2.1 (GA).
    • Unzip pentaho_j2ee_deployments-1.2.1.625-GA.zip to a directory of your choice, say [pentaho_home].
    • Copy from the pentaho-openmrs directory:
      images directory to [pentaho_home]/pentaho-webapp
      META-INF directory to [pentaho_home]/pentaho-webapp
      If the MySQL port is not the default (3306) or MySQL is installed in a different machine make the corresponding changes to the [pentaho_home]/pentaho-webapp/META-INF/context.xml file for your data sources.
    • Replace the [pentaho_home]/pentaho-webapp/jsp/Home.jsp file with the pentaho-openmrs/Home.jsp.
    • Replace the [pentaho_home]/pentaho-res/hibernate/mysql5/hibernate.cfg.xml file with the pentaho-openmrs/hibernate.cfg.xml.
    • From [pentaho_home] execute the following command:
      > ant war-pentaho-tomcat-mysql
    • Using the Tomcat Manager web application go to the "Select WAR file to upload" field.
    • Browse to [pentaho_home]/build/pentaho-wars/tomcat/mysql5/ and deploy the pentaho.war file.
    • Browse to [pentaho_home]/build/pentaho-wars/ and deploy the pentaho-style.war file.
    • Modify the following entries of the web.xml file of the pentaho web application, for your environment. The web.xml file is located under [Tomcat_Home]/webapps/pentaho/WEB-INF/.
      #*:<context-param>
      <param-name>base-url</param-name>
      <param-value>http://localhost:8080/pentaho/</param-value>
      </context-param>
      The above url will be used as the base url for the Pentaho web application.
    • Download and unzip the pentaho-solutions.zip.
    • Copy the pentaho-solutions directory next to the [Tomcat_Home] directory (they should be siblings).
      You may copy it to a directory of your choice, but change the following entry in the web.xml file:
      #*:<context-param>
      <param-name>solution-path</param-name>
      <param-value>pentaho-solutions</param-value>
      </context-param>
    • Set up your data source references. In the web.xml add the following entry
      #*:<resource-ref>
      <description>OpenMRS DW Connection</description>
      <res-ref-name>jdbc/openmrs_dw</res-ref-name>
      <res-type>javax.sql.DataSource</res-type>
      <res-auth>Container</res-auth>
      </resource-ref>
  5. Install Kettle
    • Download Kettle-2.5.0.zip under Pentaho Data Integration from sourceforge (http://downloads.sourceforge.net/pentaho/Kettle-2.5.0.zip?modtime=1178288438&big_mirror=0).
    • Unzip it to a directory of your choice.
    • Set an environment variable called KETTLE_HOME, which points at the installation directory.
    • Copy the following directories and files from pentaho-openmrs to the [KETTLE_HOME] directory:
      .kettle
      etl_init.bat, etl_init.sh
      etl_daily.bat, etl_daily.sh
    • Modify the kettle.properties file located at .kettle for your setup.
    • To run the job that loads data for dimensions that are loaded only once (e.g. date_dim), execute the following scripts in the [KETTLE_HOME] directory:
      In Windows, at a command prompt run: etl_init.bat
      In UNIX: sh etl_init.sh
    • This should be done only once. If you try to run it again the job will abort.
    • To run the jobs that load historical data from the OpenMRS database as well as changed and new data from the last successful execution to refresh the data warehouse database:
      In Windows, at a command prompt run: etl_daily.bat
      In UNIX: sh etl_daily.sh
    • The above scripts can be scheduled to run at specific time intervals using a scheduler, like Windows scheduler.
    • To monitor the execution of jobs and transformations use your favorite SQL client to connect to the DW database and access these tables: audit and job_audit.
    • Rejected records (records that cause errors) are stored in the error_event table.

Using your browser point to http://localhost:8080/pentaho to go to the Pentaho BI server homepage, which looks like the following image. Image: Pentaho_openmrs.png From there, use the menu Go->Solutions to browse the pages with the reports.

Image: Solutions.png

Developer`s Guide

TO DO: Describe the development process.

Schedule

Week 0 (May 28 - June 2)

  • Install / deploy OpenMRS.
  • Work with OpenMRS mentor to write specifications for the project
  • Sent example reports to Gjergji to go along with requirements


Week 1 (June 3 - 9)


Week 2 (June 10 - 16)

  • Create KETTLE integration module branch in Subversion

Week 3 (June 17 - 23)

Week 4 (June 24 - 30)

Week 5 (July 1 - 7)

Week 6 (July 8 - 14)

Resources

  • None