Thursday, January 27, 2005

Memory mapped I/O

Other links
http://en.wikipedia.org/wiki/Memory-mapped_I/O
http://www.esacademy.com/automation/docs/c51primer/c07.htm - XWord etc.
http://www.xml.com/ldd/chapter/book/ch13.html

Memory-mapped I/O (MMIO) and port I/O (also called port-mapped I/O or PMIO) are two complementary methods of performing input/output between the CPU and I/O devices in a computer.

Memory-mapped I/O uses the same bus to address both memory and I/O devices, and the CPU instructions used to read and write to memory are also used access I/O devices. In order to accommodate the I/O devices, areas of CPU addressable space must be reserved for I/O rather than memory. This does not have to be permanent, for example the Commodore 64 could bank switch between its I/O devices and regular memory. The I/O devices monitor the CPU's address bus and respond to any CPU access of their assigned address space, mapping the address to their hardware registers.

Port-mapped I/O uses a special class of CPU instructions specifically for performing I/O. This is generally found on Intel microprocessors, specifically the IN and OUT instructions which can read and write a single byte to an I/O device. I/O devices have a separate address space from general memory, either accomplished by an extra "I/O" pin on the CPU's physical interface, or an entire bus dedicated to I/O.

Relative merits of the two I/O methods

The main advantage of using port-mapped I/O is on CPUs with a limited addressing capability. Because port-mapped I/O separates I/O access from memory access, the full address space can be used for memory. It is also obvious to a person reading an assembly language program listing when I/O is being performed, due to the special instructions that can only be used for that purpose.

The advantage of using memory mapped I/O is that, by discarding the extra complexity that port I/O brings, a CPU requires less internal logic and is thus cheaper, faster and easier to build; this follows the basic tenets of reduced instruction set computing. As 16-bit CPU architectures have become obsolete and replaced with 32-bit and 64-bit architectures in general use, reserving space on the memory map for I/O devices is no longer a problem. The fact that regular memory instructions are used to address devices also means that all of the CPU's addressing modes are available for the I/O as well as memory.

With the popularisation of higher-level programming languages such as C and Lisp, which do not support generation of the special port-mapped I/O instructions without incompatible and proprietary extensions, port-mapped I/O has become remarkably cumbersome to use. Contrast this situation with when assembly language was dominant and port-mapped I/O instructions simplified the code.





Memory-mapped I/O is something you can do reasonably well in standard C and C++.

Device drivers communicate with peripheral devices through device registers. A driver sends commands or data to a device by storing into its device register, or retrieves status or data from a device by reading from its device register.

Many processors use memory-mapped I/O, which maps device registers to fixed addresses in the conventional memory space. To a C or C++ programmer, a memory-mapped device register looks very much like an ordinary data object. Programs can use ordinary assignment operators to move values to or from memory-mapped device registers.

Some processors use port-mapped I/O, which maps device registers to locations in a separate address space, typically smaller than the conventional memory space. On these processors, programs must use special machine instructions, such as the in and out instructions of the Intel x86 processors, to move data to or from device registers. To a C programmer, port-mapped device registers don't look quite like ordinary data.

The C and C++ standards are silent about port-mapped I/O. Programs that perform port-mapped I/O must use some nonstandard, platform-specific language or library extensions, or worse, assembly code. On the other hand, memory-mapped I/O is something you can do reasonably well within the standard language dialects.

This month, I'll look at different approaches you can use to refer to memory-mapped device registers.

Device register types
Some device registers might occupy just a byte; others may occupy a word or more. In C or C++, the simplest representation for a single device register is as an object of an appropriately sized and signed integer type. For example, you might declare a one-byte register as a char or a two-byte register as an unsigned short.

For example, the ARM Evaluator-7T is a single-board computer with a small assortment of memory-mapped peripheral devices. The board's documentation refers to the device registers as special registers. The special registers span 64KB starting at address 0x03FF0000. The memory is byte-addressable, but each register is a four-byte word aligned to an address that's a multiple of four. You could manipulate each special register as if it were an int or unsigned int. Some programmers prefer to use a type that specifies the physical size of the register more overtly, such as int32_t or uint32_t. (Types such as int32_t and uint32_t are defined in the C99 header .)1

I prefer to use a symbolic type whose name conveys the meaning of the type rather than its physical extent, such as:

typedef unsigned int special_register;

Special registers are actually volatile entities — they may change state in ways that the compiler can't detect. Therefore, the typedef should be an alias for a volatile-qualified type, as in:

typedef unsigned int volatile special_register;

Many devices interact through a small collection of device registers, rather than just one. For example, the Evaluator-7T uses five special registers to control the two integrated timers:

  • TMOD: timer mode register
  • TDATA0: timer 0 data register
  • TDATA1: timer 1 data register
  • TCNT0: timer 0 count register
  • TCNT1: timer 1 count register

You can represent the timer registers as a struct defined as:

typedef struct dual_timers dual_timers;
struct dual_timers
{
special_register TMOD;
special_register TDATA0;
special_register TDATA1;
special_register TCNT0;
special_register TCNT1;
};

The typedef before the struct definition elevates the name dual_timers from a mere tag to a full-fledged type name.2 I'd rather spell TCNT0 as count0, but TCNT0 is the name used throughout the product documentation, so it's probably best not to change it.

In C++, I'd define this struct as a class with appropriate member functions. Whether dual_timers is a C struct or a C++ class doesn't affect the following discussion.

Positioning device registers
Some compilers provide language extensions that will let you position an object at a specified memory address. For example, using the TASKING C166/ST10 C Cross-Compiler's _at attribute you can write a global declaration such as:

unsigned short count _at(0xFF08);

to declare count as a memory-mapped device register residing at address 0xFF08. Other compilers offer #pragma directives to do something similar. However, the _at attribute and #pragma directives are nonstandard. Each compiler with such extensions is likely to support something different.

Standard C and C++ don't let you declare a variable so that it resides at a specified address. The common idiom for accessing a device register is to use a pointer whose value contains the register's address. For example, the timer registers on the Evaluator-7T reside at address 0x03FF6000. A program can access these registers via a pointer that points to that address. You can define that pointer as a macro, as in:

#define timers ((dual_timers *)0x03FF6000)

or as a constant pointer, as in:

dual_timers *const timers
= (dual_timers *)0x03FF6000;

Either way you define timers, you can use it to reach the timer registers. For example, the TMOD register contains bits that you can set to enable a timer and clear to disable a timer. You can define the masks for those bits as enumeration constants:

enum { TE0 = 0x01, TE1 = 0x08 };

Then you can disable both timers using:

timers->TMOD &= ~(TE0 | TE1);

Weighing the alternatives
These two pointer definitions—the macro and the constant object—are largely interchangeable. However, they produce slightly different behavior and, on some platforms, generate slightly different machine code.

As I explained in an earlier column, the macro preprocessor is a distinct compilation phase.3 The preprocessor does macro substitution before the compiler does any other symbol processing. For example, given the macro definition for timers, the preprocessor transforms:

timers->TMOD &= ~(TE0 | TE1);

into:

((dual_timers *)0x03FF6000)->TMOD
&= ~(TE0 | TE1);

Later compilation phases never see the macro symbol timers; they see only the source text after macro substitution. Many compilers don't pass macro names on to their debuggers, in which case macro names are invisible to the debugger.

Macros have an even more serious problem: macro names don't observe the scope rules that apply to other names. For example, you can't restrict a macro to a local scope. Defining a macro within a function, as in:

void timer_handler()
{
#define timers ((dual_timers *)0x03FF6000)
...
}

doesn't make the macro local to the function. The macro is still effectively global. Similarly, you can't declare a macro as a member of a C++ class or namespace.

Actually, macro names are worse than global names. Names declared in inner scopes can temporarily hide names from outer scopes, but they can't hide macro names. Consequently, macros might substitute in places where you don't expect them to.

Declaring timers as a constant pointer avoids both of these problems. The name should be visible in your debugger, and if you declare it in a nonglobal scope, it should stay there.

On the other hand, with some compilers on some platforms, declaring timers as a constant pointer might—I emphasize might—produce slightly slower and larger code. The compiler might produce different code if you define the pointer globally or locally. It might produce different code if you compile the definition in C as opposed to C++. I'll explain what the differences are and why they occur in my next column.

Memory consumption in Unix

Q: How can I measure the resource consumption and activity of the processes on my Unix computer?
--Probing in Peshtigo

A: Most people know about the ps command. It gives a high level summary of the status of processes on your system. It has many options, but it is a relatively crude way to figure out what your processes are doing.

Of all the available options, the best performance related summary comes from the BSD version /usr/ucb/ps uax, which collects all the process data in one go, sorts the output by CPU usage, then displays the result. The unsorted versions of ps loop through the processes, printing as they go. This spreads out the time at which the processes are measured, and the last line process measured is significantly later than the first. By collecting in one go, the sorted versions give a more consistent view of the system. A quick look at the most active processes can be obtained easily using this command:

% /usr/ucb/ps uax | head
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
adrianc 333 1.0 8.2 8372 5120 console S 09:28:38 0:29 /usr/openwin/bin/X
root 483 0.4 1.4 1016 872 pts/1 O 09:56:36 0:00 /usr/ucb/ps uax
adrianc 433 0.3 15.812492 9832 ? S 09:31:47 0:26 /export/framemaker
root 240 0.3 5.3 3688 3260 ? S 09:27:22 0:07 //opt/RICHPse/bin/
adrianc 367 0.2 4.2 3472 2620 ?? S 09:28:56 0:00 cmdtool -Wp 603 49
adrianc 484 0.1 0.9 724 540 pts/1 S 09:56:36 0:00 head
root 3 0.1 0.0 0 0 ? S 09:25:17 0:02 fsflush
adrianc 370 0.1 1.4 980 824 pts/1 S 09:28:57 0:00 /bin/csh
adrianc 358 0.1 2.6 2088 1616 console S 09:28:54 0:00 olwm -syncpid 357

This summary immediately tells you who is running the most active processes. The %CPU measure is a time-decayed average of recent CPU usage. %MEM tells you the proportion of the total RAM in your system is in use by each process (it won't add up to 100 percent as some RAM is shared by several processes). SZ is the size of the process address space. It's a good indicator of how much swap space the process needs. In some cases it includes memory mapped devices, so don't be surprised if the X process appears to be huge on an Ultra 1 with Creator framebuffer. RSS is the basis for %MEM, its the amount of RAM in use by the process. TT shows you which "teletype" the user is logged in on. S shows the status. "S" means sleeping, "O" means on-cpu or running, "R" means runnable and waiting for a CPU to become free. START is the time the process started up, and TIME is the total amount of CPU time it has used so far. COMMAND The same basic information is displayed by the well known freeware utilities top and proctool. Solstice Symon and most commercial performance tools also display this data.

Where does the data come from?
I'm sure you have noticed the strange entry that pops up in df when you are checking how much disk space is left.

% df -k
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0t2d0s0 963662 782001 85301 91% /
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
/dev/dsk/c0t3d0s0 406854 290066 96448 76% /export/home

It seems that there is a filesystem called /proc, if you look at it, you find a list of numbers that correspond to the processes on your system.

% ls /proc
00000 00140 00197 00237 00309 00333 00358 00379 00586
00001 00149 00207 00239 00312 00334 00359 00382
00002 00152 00216 00240 00313 00342 00367 00385
00003 00154 00217 00258 00318 00349 00370 00388

Using ls -l you see the owner and size.

% /usr/ucb/ps uax | head
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
adrianc 333 0.6 8.0 8380 4984 console S 09:28:38 1:16 /usr/openwin/bin/X
% ls -l /proc/333
-rw------- 1 adrianc staff 8581120 Jul 26 09:28 /proc/333

The units in ps are KB, and 8581120/1024 = 8380 as expected. The owner and permissions are used to control access to the processes. The ps command has to be setuid root so be able to show anyone status of all processes, You can only debug or trace processes that you have permissions for.

As you might expect there is a manual page for /proc, if you are running Solaris 2.5 you will discover another manual page. There are a whole bunch of commands that use /proc, documented under the proc(1) manual page. The /proc programming interface is described in proc(4). Let's take a look at these commands, which live in /usr/proc/bin. Here's an excerpt from the manual page, to save you looking it up.

SunOS 5.5 Last change: 9 Nov 1994 1

proc(1) User Commands proc(1)

NAME
proc, pflags, pcred, pmap, pldd, psig, pstack, pfiles, pwdx,
pstop, prun, pwait, ptree, ptime - proc tools

DESCRIPTION
The proc tools are utilities which exercise features of
/proc (see proc(4)). Most of them take a list of process-
ids (pid); those that do also accept /proc/nnn as a
process-id, so the shell expansion /proc/* can be used to
specify all processes in the system.

pflags print the /proc tracing flags, the pend-
ing and held signals, and other /proc
status information for each lwp in each
process.

pcred print the credentials (effective, real
and saved UID's and GID's) of each pro-
cess.

pmap print the address space map of each pro-
cess.


pldd list the dynamic libraries linked into
each process, including shared objects
explicitly attached using dlopen(3X).
(See also ldd(1).)

psig list the signal actions of each process
(See signal(5).)

pstack print a hex+symbolic stack trace for
each lwp in each process.

pfiles report fstat(2) and fcntl(2) information
for all open files in each process.

pwdx print the current working directory of
each process.

pstop stop each process (PR_REQUESTED stop).

prun set each process running (inverse of
pstop).

pwait wait for all of the specified processes
to terminate.

ptree print the process trees containing the
specified pid's or users, with child
processes indented from their respective
parent processes. An argument of all
digits is taken to be a process-id, oth-
erwise it is assumed to be a user login
name. Default is all processes.

ptime time a command, such as the time(1) com-
mand, but using microstate accounting
for reproducible precision.

That's already opened up a lot more possibilities. The /proc interface is designed to support the process debugging and analyzing tools in Sun's Workshop development tools. There are a few tantalizing hints here. Look at the description of pflags, it mentions tracing. And ptime, what is microstate accounting? A way to get higher precision measurements? We need to dig further into the programming interface to find out, but first we'll take a look at a bundled tool that uses /proc to trace the system calls made by a process.

Tracing in Solaris 2
The /usr/bin/truss command has many useful features not found in the SunOS 4 trace command. It can trace child processes, and it can count and time system calls and signals. Other options allow named system calls to be excluded or focused on, and data structures can be printed out in full. Here is an excerpt showing a fragment of truss output with the -v option to set verbose mode for data structures, and an example of truss -c showing the system call counts.

% truss -v all cp NewDocument Tuning
execve("/usr/bin/cp", 0xEFFFFB28, 0xEFFFFB38) argc = 3
open("/usr/lib/libintl.so.1", O_RDONLY, 035737561304) = 3
mmap(0x00000000, 4096, PROT_READ, MAP_SHARED, 3, 0) = 0xEF7B0000
fstat(3, 0xEFFFF768) = 0
d=0x0080001E i=29585 m=0100755 l=1 u=2 g=2 sz=14512
at = Apr 27 11:30:14 PDT 1993 [ 735935414 ]
mt = Mar 12 18:35:36 PST 1993 [ 731990136 ]
ct = Mar 29 11:49:11 PST 1993 [ 733434551 ]
bsz=8192 blks=30 fs=ufs
....

% truss -c cp NewDocument Tuning
syscall seconds calls errors
_exit .00 1
write .00 1
open .00 10 4
close .01 7
creat .01 1
chmod .01 1
stat .02 2 1
lseek .00 1
fstat .00 4
execve .00 1
mmap .01 18
munmap .00 9
memcntl .01 1
---- --- ---
sys totals: .07 57 5
usr time: .02
elapsed: .43

I use truss a great deal to find out what a process is doing, which files are being read and written, and with truss -c you can see how long system calls take to execute on average, and where your system CPU time is coming from.

Who, what, when, how much?
Many processes live very short lives. You cannot see them with ps, but they may be so frequent that they dominate the load on your system. The only way to catch them is to ask the system to keep a record of every process that has run, who ran it, what was it, when did it start and end, and how much resource did it use. This is done by the system accounting subsystem. For some reason many administrators seem to have hang-ups about accounting. Perhaps it has connotations of "big brother is watching you," or they fear additional overhead. In truth, if Fred complains that his system is too slow, and the accounting records show that he spends all his time playing Doom, you should not be too sympathetic! The overhead of collecting accounting data is always present. When you turn on accounting, you are just enabling storage of a few bytes of useful data when a process exits.

Accounting data is most useful when measured over a long period of time. This can be useful on a network of workstations as well as on a single time-shared server. From this you can identify how often programs run, how much CPU time, I/O, and memory each program uses, and what work patterns throughout the week look like. To enable accounting to start immediately, enter the three commands shown below. Check out the section "Administering Security, Performance, and Accounting in Solaris 2" in the Solaris System Administration Answerbook and see the acctcom command. Some crontab entries must also be added to summarize and checkpoint the accounting logs. Collecting and checkpointing the accounting data itself puts a negligible additional load onto the system, but the summary scripts that run once a day or once a week can have a noticeable effect, so they should be scheduled to run out of hours.


# ln /etc/init.d/acct /etc/rc0.d/K22acct
# ln /etc/init.d/acct /etc/rc2.d/S22acct
# /etc/init.d/acct start
Starting process accounting

This is what your crontab file for the adm user should contain.


# crontab -l adm
#ident "@(#)adm 1.5 92/07/14 SMI" /* SVr4.0 1.2 */
#min hour day month weekday
0 * * * * /usr/lib/acct/ckpacct
30 2 * * * /usr/lib/acct/runacct 2&#gt; /var/adm/acct/nite/fd2log
30 9 * * 5 /usr/lib/acct/monacct

You get a daily accounting summary, but the one I like to keep track of is the monthly one stored in /var/adm/acct/fiscal. Here is an excerpt from fiscrpt07 on my home system.

Jul 26 09:30 1996 TOTAL COMMAND SUMMARY FOR FISCAL 07 Page 1


TOTAL COMMAND SUMMARY
COMMAND NUMBER TOTAL TOTAL TOTAL MEAN MEAN HOG CHARS BLOCKS
NAME CMDS KCOREMIN CPU-MIN REAL-MIN SIZE-K CPU-MIN FACTOR TRNSFD READ

TOTALS 26488 16062007.75 3960.11 494612.41 4055.95 0.15 0.01 17427899648 39944

mae 36 7142887.25 1501.73 2128.50 4756.45 41.71 0.71 2059814144 1653
sundgado 16 3668645.19 964.83 1074.34 3802.36 60.30 0.90 139549181 76
Xsun 29 1342108.55 251.32 9991.62 5340.18 8.67 0.03 2784769024 1295
xlock 32 1027099.38 726.87 4253.34 1413.04 22.71 0.17 4009349888 15
fountain 2 803036.25 165.11 333.65 4863.71 82.55 0.49 378388 1
netscape 22 489512.97 72.39 3647.61 6762.19 3.29 0.02 887353080 2649
maker4X. 10 426182.31 43.77 5004.30 9736.27 4.38 0.01 803267592 3434
wabiprog 53 355574.99 44.32 972.44 8022.87 0.84 0.05 355871360 570
imagetoo 21 257617.08 15.65 688.46 16456.60 0.75 0.02 64291840 387
java 235 203963.64 37.96 346.35 5373.76 0.16 0.11 155950720 240
aviator 2 101012.82 22.93 29.26 4406.20 11.46 0.78 2335744 40
se.sparc 18 46793.09 19.30 6535.43 2424.47 1.07 0.00 631756294 20
xv 3 40930.98 5.58 46.37 7337.93 1.86 0.12 109690880 28

It looks as if my kids have been using it to play games during the day! The commands reported are sorted by KCOREMIN, which is the product of the amount of CPU time used and the amount of RAM used while the command was active. CPU-MIN is the number of minutes of CPU time. REAL_MIN is the elapsed time for the commands. SIZE-K is an average value for the RSS over the active lifetime of the process. It does not include times when the process was not actually running. In Solaris 2.4 and earlier releases a bug causes this measure to be garbage. HOG FACTOR is the ratio of CPU-MIN to REAL-MIN. A high factor means that this command hogs the CPU whenever it is running. CHARS TRNSFD counts the number of characters read and written. BLOCKS READ counts data read from block devices (basically local disk filesystem reads and writes). The underlying data that is collected can be seen in the acct(4) manual page. The data structure looks like this, its very compact, around 40 bytes.

DESCRIPTION
Files produced as a result of calling acct(2) have records
in the form defined by &#lt;sys/acct.h&#gt;, whose contents are:

typedef ushort comp_t; /* pseudo "floating point" representation */
/* 3 bit base-8 exponent in the high */
/* order bits, and a 13-bit fraction */
/* in the low order bits. */

struct acct
{
char ac_flag; /* Accounting flag */
char ac_stat; /* Exit status */
uid_t ac_uid; /* Accounting user ID */
gid_t ac_gid; /* Accounting group ID */
dev_t ac_tty; /* control tty */
time_t ac_btime; /* Beginning time */
comp_t ac_utime; /* accounting user time in clock */
/* ticks */
comp_t ac_stime; /* accounting system time in clock */
/* ticks */
comp_t ac_etime; /* accounting total elapsed time in clock */
/* ticks */
comp_t ac_mem; /* memory usage in clicks (pages) */
comp_t ac_io; /* chars transferred by read/write */
comp_t ac_rw; /* number of block reads/writes */
char ac_comm[8]; /* command name */
};

Process data structures
There isn't a great deal of data in the accounting record. Let's see what is available from a process that is still running. Actually, a neat trick is to open an entry in /proc and wait until the process has exited. The fact that the /proc entry is still open means that you can still get the data described below. When you close the /proc "file" the zombie process will disappear.

These data structures are described in full in the proc(4) manual page. They are also available in the SE toolkit, so if you want to obtain the data and play around with it you should look at the code for ps-ax.se and msacct.se. The interface to /proc involves sending ioctl commands. The one that ps uses is called PIOCPSINFO and this is what you get back.

SunOS 5.5 Last change: 28 Mar 1995 2

proc(4) File Formats proc(4)

PIOCPSINFO
This returns miscellaneous process information such as that
reported by ps(1). p is a pointer to a prpsinfo structure
containing at least the following fields:

typedef struct prpsinfo {
char pr_state; /* numeric process state (see pr_sname) */
char pr_sname; /* printable character representing pr_state */
char pr_zomb; /* !=0: process terminated but not waited for */
char pr_nice; /* nice for cpu usage */
u_long pr_flag; /* process flags */
int pr_wstat; /* if zombie, the wait() status */
uid_t pr_uid; /* real user id */
uid_t pr_euid; /* effective user id */
gid_t pr_gid; /* real group id */
gid_t pr_egid; /* effective group id */
pid_t pr_pid; /* process id */
pid_t pr_ppid; /* process id of parent */
pid_t pr_pgrp; /* pid of process group leader */
pid_t pr_sid; /* session id */
caddr_t pr_addr; /* physical address of process */
long pr_size; /* size of process image in pages */
long pr_rssize; /* resident set size in pages */
u_long pr_bysize; /* size of process image in bytes */
u_long pr_byrssize; /* resident set size in bytes */
caddr_t pr_wchan; /* wait addr for sleeping process */
short pr_syscall; /* system call number (if in syscall) */ id_t pr_aslwpid; /* lwp id of the aslwp; zero if no aslwp */
timestruc_t pr_start; /* process start time, sec+nsec since epoch */
timestruc_t pr_time; /* usr+sys cpu time for this process */
timestruc_t pr_ctime; /* usr+sys cpu time for reaped children */
long pr_pri; /* priority, high value is high priority */
char pr_oldpri; /* pre-SVR4, low value is high priority */
char pr_cpu; /* pre-SVR4, cpu usage for scheduling */
u_short pr_pctcpu; /* % of recent cpu time, one or all lwps */
u_short pr_pctmem; /* % of system memory used by the process */
dev_t pr_ttydev; /* controlling tty device (PRNODEV if none) */
char pr_clname[PRCLSZ]; /* scheduling class name */
char pr_fname[PRFNSZ]; /* last component of exec()ed pathname */
char pr_psargs[PRARGSZ];/* initial characters of arg list */
int pr_argc; /* initial argument count */
char **pr_argv; /* initial argument vector */
char **pr_envp; /* initial environment vector */
} prpsinfo_t;

For a multithreaded process it is possible to get the data for each lightweight process separately. There's a lot more useful looking information there, but no sign of the high resolution microstate accounting that /usr/proc/bin/ptime (and msacct.se) display. They use a separate ioctl, PIOCUSAGE.

SunOS 5.5 Last change: 28 Mar 1995 18

proc(4) File Formats proc(4)

PIOCUSAGE
When applied to the process file descriptor, PIOCUSAGE
returns the process usage information; when applied to an
lwp file descriptor, it returns usage information for the
specific lwp. p points to a prusage structure which is
filled by the operation. The prusage structure contains at
least the following fields:

typedef struct prusage {
id_t pr_lwpid; /* lwp id. 0: process or defunct */
u_long pr_count; /* number of contributing lwps */
timestruc_t pr_tstamp; /* current time stamp */
timestruc_t pr_create; /* process/lwp creation time stamp */
timestruc_t pr_term; /* process/lwp termination time stamp */
timestruc_t pr_rtime; /* total lwp real (elapsed) time */
timestruc_t pr_utime; /* user level CPU time */
timestruc_t pr_stime; /* system call CPU time */
timestruc_t pr_ttime; /* other system trap CPU time */
timestruc_t pr_tftime; /* text page fault sleep time */
timestruc_t pr_dftime; /* data page fault sleep time */
timestruc_t pr_kftime; /* kernel page fault sleep time */
timestruc_t pr_ltime; /* user lock wait sleep time */
timestruc_t pr_slptime; /* all other sleep time */
timestruc_t pr_wtime; /* wait-cpu (latency) time */
timestruc_t pr_stoptime; /* stopped time */
u_long pr_minf; /* minor page faults */
u_long pr_majf; /* major page faults */
u_long pr_nswap; /* swaps */
u_long pr_inblk; /* input blocks */
u_long pr_oublk; /* output blocks */
u_long pr_msnd; /* messages sent */
u_long pr_mrcv; /* messages received */
u_long pr_sigs; /* signals received */
u_long pr_vctx; /* voluntary context switches */
u_long pr_ictx; /* involuntary context switches */
u_long pr_sysc; /* system calls */
u_long pr_ioch; /* chars read and written */
} prusage_t;

PIOCUSAGE can be applied to a zombie process (see
PIOCPSINFO).

Applying PIOCUSAGE to a process that does not have micro-
state accounting enabled will enable microstate accounting
and return an estimate of times spent in the various states
up to this point. Further invocations of PIOCUSAGE will
yield accurate microstate time accounting from this point.
To disable microstate accounting, use PIOCRESET with the
PR_MSACCT flag.

There is a lot of useful data here. The time spent waiting for various events is a key measure. I summarize it in msacct.se like this:

Elapsed time 3:20:50.049 Current time Fri Jul 26 12:49:28 1996
User CPU time 2:11.723 System call time 1:54.890
System trap time 0.006 Text pfault sleep 0.000
Data pfault sleep 0.023 Kernel pfault sleep 0.000
User lock sleep 0.000 Other sleep time 3:16:43.022
Wait for CPU time 0.382 Stopped time 0.000

The other thing to notice is that microstate accounting is not turned on by default. It slows the system down slightly, and while it was on by default up to Solaris 2.3, from Solaris 2.4 onwards it is only collected if you use it. The way CPU time is normally measured is to sample the state of all the CPUs from the clock interrupt 100 times a second. The way microstate accounting works, a high resolution timestamp is taken on every state change, every system call, every page fault, every scheduler change. It doesn't miss anything.

Wrap up
There are a lot more things that can be done with /proc, but I hope I have opened your eyes to some of the most useful performance information available in Solaris 2.

Monday, January 17, 2005

Soft updates: how to maintain data consistency

http://www.usenix.org/publications/library/proceedings/usenix99/full_papers/mckusick/mckusick.pdf

linux log structured file system: disk writes ordering semantics

Thursday, January 13, 2005

page cache vs buffer cache