Quick guide to Auditing a (postgreSQL) Database

According to Wikipedia, ‘Database auditing’ involves observing a database so as to be aware of the actions of database users. A bit like ‘spying on the user activity’.
This of course, could be useful, if you have a database with multiple users and store some sort of ‘confidential’ information.

spy

Previously I implemented some code to do this in SQL Server, that basically involved:

  • creating a table to store the audit information.
  • create an ‘encode scheme’ for this table (i.e.: how to distinguish ‘edits’ from ‘removes’, etc).
  •  creating triggers for every table that I wanted to audit, that do the ‘house-work’ (these were of course, generated from a script)

Moreover, these ‘encoded changes’ where exported to JSON, as they were the basis for a synchronization system that I implemented,

This was working fine, until I had to port the database to another RDBMS, which gave me the opportunity to rethink the structure that I had in place.
Instead of porting directly the code from T-SQL to PSQL, I did a little research and (gladly!) found out that Postgres had an ‘inbuilt’ support for audit. It took me about 10 minutes, to ‘get grips with it”, which is what I describe next.

First thing would be to create a table to store the ‘changes’. The PostgreSQL wiki, actually advises to use a different schema, which I I agree is a good idea.

-- create a schema named "audit"
CREATE schema audit;
REVOKE CREATE ON schema audit FROM public;

CREATE TABLE audit.logged_actions (
schema_name text NOT NULL,
table_name text NOT NULL,
user_name text,
action_tstamp timestamp WITH time zone NOT NULL DEFAULT current_timestamp,
action TEXT NOT NULL CHECK (action IN ('I','D','U')),
original_data text,
new_data text,
query text
) WITH (fillfactor=100);

REVOKE ALL ON audit.logged_actions FROM public;

The stored information is: the relevant schema and table names, the username (so it is important to enforce a user policy here) and a timestamp. Then we also have the ‘type’ of action, that is one of the following: insert, delete or update. This is not supporting triggers or selects, which I guess it’s ok. Then we have the previous value and current value. For inserts, we have a change from nothing to something; for deletes we have a change from something to nothing and from updates we have a change of something to something. In a way, you could figure out the change type from looking at these values, so the ‘action’ field is perhaps a bit redundant; but then, you would have to represent ‘nothing’ as a sort of special value (or keyword) and not an empty space (that could be misunderstood by an empty string).
Finally we have the ‘query’, which is the exact query that triggered this audit. Although this is also a bit redundant, since it could be reconstructed from the other values, it is not a bad idea since it allows to quickly see/reproduce exactly what happened.

Then we can create some indexes:

CREATE INDEX logged_actions_schema_table_idx
ON audit.logged_actions(((schema_name||'.'||table_name)::TEXT));

CREATE INDEX logged_actions_action_tstamp_idx
ON audit.logged_actions(action_tstamp);

CREATE INDEX logged_actions_action_idx
ON audit.logged_actions(action);

The next step is to create the trigger to ‘fill’ this table, on relevant actions:

CREATE OR REPLACE FUNCTION audit.if_modified_func() RETURNS TRIGGER AS $body$
DECLARE
v_old_data TEXT;
v_new_data TEXT;
BEGIN
/*  If this actually for real auditing (where you need to log EVERY action),
then you would need to use something like dblink or plperl that could log outside the transaction,
regardless of whether the transaction committed or rolled back.
*/

/* This dance with casting the NEW and OLD values to a ROW is not necessary in pg 9.0+ */

IF (TG_OP = 'UPDATE') THEN
v_old_data := ROW(OLD.*);
v_new_data := ROW(NEW.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,new_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_old_data,v_new_data, current_query());
RETURN NEW;
ELSIF (TG_OP = 'DELETE') THEN
v_old_data := ROW(OLD.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_old_data, current_query());
RETURN OLD;
ELSIF (TG_OP = 'INSERT') THEN
v_new_data := ROW(NEW.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,new_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_new_data, current_query());
RETURN NEW;
ELSE
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - Other action occurred: %, at %',TG_OP,now();
RETURN NULL;
END IF;

EXCEPTION
WHEN data_exception THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [DATA EXCEPTION] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
WHEN unique_violation THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [UNIQUE] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
WHEN OTHERS THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [OTHER] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
END;
$body$
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = pg_catalog, audit;

The ‘TG_OP’ variable inside a trigger, is a string that tells us for which operation the trigger was fired (INSERT, UPDATE, or DELETE). The rest is fiddling around with the ‘old’ and ‘new’ values.

Actually this is all to it, regarding having the audit ‘structure’ in place for postgresql. To put it ‘in action’, auditing a table is as simple as this:

CREATE TRIGGER fr_frame_audit
AFTER INSERT OR UPDATE OR DELETE ON fr_frame
FOR EACH ROW EXECUTE PROCEDURE audit.if_modified_func();

In the example above, I created a trigger for auditing table ‘fr_frame’ (of course you can create a sp procedure to generate these statements, if you want to generate a trigger for auditing every table in the database…).
Then I went to table ‘fr_frame’ and deleted a row. It got stored like this:

"public";"fr_frame";"postgres";"2014-02-11 12:36:52.063113+00";"D";"(67,"bin frame",missing,missing,1,1,missing)";"";"DELETE FROM public.fr_frame WHERE id = '67'::integer"

Then I modified and added a row; it got stored like this:

"public";"fr_frame";"postgres";"2014-02-11 12:38:54.793902+00";"U";"(59,"Sampling Frame",missing,"Initial Sampling frame",1,2,missing)";"(59,"Sampling Frame","Sampling Frame","Initial Sampling frame",1,2,missing)";"UPDATE public.fr_frame SET nameeng='Sampling Frame'::text WHERE id = '59'::integer"
"public";"fr_frame";"postgres";"2014-02-11 12:39:14.290898+00";"I";"";"(68,test,test,test,1,2,test)";"INSERT INTO public.fr_frame(name, nameeng, description, id_cloned_previous_frame, id_source, comments) VALUES ('test'::text, 'test'::text, 'test'::text, '1'::integer, '2'::smallint, 'test'::text)"

As I said before, the encode of non-existing values (nothing) as an empty string is not the most accurate approach, but since we have more information to complement it, it works. Also, the fact that it is row-based rather than field-based (as I had in my implementation), originates the serializing arrays, which is not exactly normalized… on the other hand, I can accept it from the point of view of storage.

Overall I think I accept it as a good solution, measuring all the pros and cons, and high fives for being so simple to implement.

You can find this information (and more), in the PostgreSQL wiki.

Advertisement