Memory consumption in Unix
Q: How can I measure the resource consumption and activity of the processes on my Unix computer?
--Probing in Peshtigo
A: Most people know about the ps command. It gives a high level summary of the status of processes on your system. It has many options, but it is a relatively crude way to figure out what your processes are doing.
Of all the available options, the best performance related summary comes from the BSD version /usr/ucb/ps uax, which collects all the process data in one go, sorts the output by CPU usage, then displays the result. The unsorted versions of ps loop through the processes, printing as they go. This spreads out the time at which the processes are measured, and the last line process measured is significantly later than the first. By collecting in one go, the sorted versions give a more consistent view of the system. A quick look at the most active processes can be obtained easily using this command:
% /usr/ucb/ps uax | head
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
adrianc 333 1.0 8.2 8372 5120 console S 09:28:38 0:29 /usr/openwin/bin/X
root 483 0.4 1.4 1016 872 pts/1 O 09:56:36 0:00 /usr/ucb/ps uax
adrianc 433 0.3 15.812492 9832 ? S 09:31:47 0:26 /export/framemaker
root 240 0.3 5.3 3688 3260 ? S 09:27:22 0:07 //opt/RICHPse/bin/
adrianc 367 0.2 4.2 3472 2620 ?? S 09:28:56 0:00 cmdtool -Wp 603 49
adrianc 484 0.1 0.9 724 540 pts/1 S 09:56:36 0:00 head
root 3 0.1 0.0 0 0 ? S 09:25:17 0:02 fsflush
adrianc 370 0.1 1.4 980 824 pts/1 S 09:28:57 0:00 /bin/csh
adrianc 358 0.1 2.6 2088 1616 console S 09:28:54 0:00 olwm -syncpid 357
This summary immediately tells you who is running the most active processes. The %CPU measure is a time-decayed average of recent CPU usage. %MEM tells you the proportion of the total RAM in your system is in use by each process (it won't add up to 100 percent as some RAM is shared by several processes). SZ is the size of the process address space. It's a good indicator of how much swap space the process needs. In some cases it includes memory mapped devices, so don't be surprised if the X process appears to be huge on an Ultra 1 with Creator framebuffer. RSS is the basis for %MEM, its the amount of RAM in use by the process. TT shows you which "teletype" the user is logged in on. S shows the status. "S" means sleeping, "O" means on-cpu or running, "R" means runnable and waiting for a CPU to become free. START is the time the process started up, and TIME is the total amount of CPU time it has used so far. COMMAND The same basic information is displayed by the well known freeware utilities top and proctool. Solstice Symon and most commercial performance tools also display this data.
Where does the data come from?
I'm sure you have noticed the strange entry that pops up in df when you are checking how much disk space is left.
% df -k
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0t2d0s0 963662 782001 85301 91% /
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
/dev/dsk/c0t3d0s0 406854 290066 96448 76% /export/home
It seems that there is a filesystem called /proc, if you look at it, you find a list of numbers that correspond to the processes on your system.
% ls /proc
00000 00140 00197 00237 00309 00333 00358 00379 00586
00001 00149 00207 00239 00312 00334 00359 00382
00002 00152 00216 00240 00313 00342 00367 00385
00003 00154 00217 00258 00318 00349 00370 00388
Using ls -l you see the owner and size.
% /usr/ucb/ps uax | head
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
adrianc 333 0.6 8.0 8380 4984 console S 09:28:38 1:16 /usr/openwin/bin/X
% ls -l /proc/333
-rw------- 1 adrianc staff 8581120 Jul 26 09:28 /proc/333
The units in ps are KB, and 8581120/1024 = 8380 as expected. The owner and permissions are used to control access to the processes. The ps command has to be setuid root so be able to show anyone status of all processes, You can only debug or trace processes that you have permissions for.
As you might expect there is a manual page for /proc, if you are running Solaris 2.5 you will discover another manual page. There are a whole bunch of commands that use /proc, documented under the proc(1) manual page. The /proc programming interface is described in proc(4). Let's take a look at these commands, which live in /usr/proc/bin. Here's an excerpt from the manual page, to save you looking it up.
SunOS 5.5 Last change: 9 Nov 1994 1
proc(1) User Commands proc(1)
NAME
proc, pflags, pcred, pmap, pldd, psig, pstack, pfiles, pwdx,
pstop, prun, pwait, ptree, ptime - proc tools
DESCRIPTION
The proc tools are utilities which exercise features of
/proc (see proc(4)). Most of them take a list of process-
ids (pid); those that do also accept /proc/nnn as a
process-id, so the shell expansion /proc/* can be used to
specify all processes in the system.
pflags print the /proc tracing flags, the pend-
ing and held signals, and other /proc
status information for each lwp in each
process.
pcred print the credentials (effective, real
and saved UID's and GID's) of each pro-
cess.
pmap print the address space map of each pro-
cess.
pldd list the dynamic libraries linked into
each process, including shared objects
explicitly attached using dlopen(3X).
(See also ldd(1).)
psig list the signal actions of each process
(See signal(5).)
pstack print a hex+symbolic stack trace for
each lwp in each process.
pfiles report fstat(2) and fcntl(2) information
for all open files in each process.
pwdx print the current working directory of
each process.
pstop stop each process (PR_REQUESTED stop).
prun set each process running (inverse of
pstop).
pwait wait for all of the specified processes
to terminate.
ptree print the process trees containing the
specified pid's or users, with child
processes indented from their respective
parent processes. An argument of all
digits is taken to be a process-id, oth-
erwise it is assumed to be a user login
name. Default is all processes.
ptime time a command, such as the time(1) com-
mand, but using microstate accounting
for reproducible precision.
That's already opened up a lot more possibilities. The /proc interface is designed to support the process debugging and analyzing tools in Sun's Workshop development tools. There are a few tantalizing hints here. Look at the description of pflags, it mentions tracing. And ptime, what is microstate accounting? A way to get higher precision measurements? We need to dig further into the programming interface to find out, but first we'll take a look at a bundled tool that uses /proc to trace the system calls made by a process.
Tracing in Solaris 2
The /usr/bin/truss command has many useful features not found in the SunOS 4 trace command. It can trace child processes, and it can count and time system calls and signals. Other options allow named system calls to be excluded or focused on, and data structures can be printed out in full. Here is an excerpt showing a fragment of truss output with the -v option to set verbose mode for data structures, and an example of truss -c showing the system call counts.
% truss -v all cp NewDocument Tuning
execve("/usr/bin/cp", 0xEFFFFB28, 0xEFFFFB38) argc = 3
open("/usr/lib/libintl.so.1", O_RDONLY, 035737561304) = 3
mmap(0x00000000, 4096, PROT_READ, MAP_SHARED, 3, 0) = 0xEF7B0000
fstat(3, 0xEFFFF768) = 0
d=0x0080001E i=29585 m=0100755 l=1 u=2 g=2 sz=14512
at = Apr 27 11:30:14 PDT 1993 [ 735935414 ]
mt = Mar 12 18:35:36 PST 1993 [ 731990136 ]
ct = Mar 29 11:49:11 PST 1993 [ 733434551 ]
bsz=8192 blks=30 fs=ufs
....
% truss -c cp NewDocument Tuning
syscall seconds calls errors
_exit .00 1
write .00 1
open .00 10 4
close .01 7
creat .01 1
chmod .01 1
stat .02 2 1
lseek .00 1
fstat .00 4
execve .00 1
mmap .01 18
munmap .00 9
memcntl .01 1
---- --- ---
sys totals: .07 57 5
usr time: .02
elapsed: .43
I use truss a great deal to find out what a process is doing, which files are being read and written, and with truss -c you can see how long system calls take to execute on average, and where your system CPU time is coming from.
Who, what, when, how much?
Many processes live very short lives. You cannot see them with ps, but they may be so frequent that they dominate the load on your system. The only way to catch them is to ask the system to keep a record of every process that has run, who ran it, what was it, when did it start and end, and how much resource did it use. This is done by the system accounting subsystem. For some reason many administrators seem to have hang-ups about accounting. Perhaps it has connotations of "big brother is watching you," or they fear additional overhead. In truth, if Fred complains that his system is too slow, and the accounting records show that he spends all his time playing Doom, you should not be too sympathetic! The overhead of collecting accounting data is always present. When you turn on accounting, you are just enabling storage of a few bytes of useful data when a process exits.
Accounting data is most useful when measured over a long period of time. This can be useful on a network of workstations as well as on a single time-shared server. From this you can identify how often programs run, how much CPU time, I/O, and memory each program uses, and what work patterns throughout the week look like. To enable accounting to start immediately, enter the three commands shown below. Check out the section "Administering Security, Performance, and Accounting in Solaris 2" in the Solaris System Administration Answerbook and see the acctcom command. Some crontab entries must also be added to summarize and checkpoint the accounting logs. Collecting and checkpointing the accounting data itself puts a negligible additional load onto the system, but the summary scripts that run once a day or once a week can have a noticeable effect, so they should be scheduled to run out of hours.
# ln /etc/init.d/acct /etc/rc0.d/K22acct
# ln /etc/init.d/acct /etc/rc2.d/S22acct
# /etc/init.d/acct start
Starting process accounting
This is what your crontab file for the adm user should contain.
# crontab -l adm
#ident "@(#)adm 1.5 92/07/14 SMI" /* SVr4.0 1.2 */
#min hour day month weekday
0 * * * * /usr/lib/acct/ckpacct
30 2 * * * /usr/lib/acct/runacct 2gt; /var/adm/acct/nite/fd2log
30 9 * * 5 /usr/lib/acct/monacct
You get a daily accounting summary, but the one I like to keep track of is the monthly one stored in /var/adm/acct/fiscal. Here is an excerpt from fiscrpt07 on my home system.
Jul 26 09:30 1996 TOTAL COMMAND SUMMARY FOR FISCAL 07 Page 1
TOTAL COMMAND SUMMARY
COMMAND NUMBER TOTAL TOTAL TOTAL MEAN MEAN HOG CHARS BLOCKS
NAME CMDS KCOREMIN CPU-MIN REAL-MIN SIZE-K CPU-MIN FACTOR TRNSFD READ
TOTALS 26488 16062007.75 3960.11 494612.41 4055.95 0.15 0.01 17427899648 39944
mae 36 7142887.25 1501.73 2128.50 4756.45 41.71 0.71 2059814144 1653
sundgado 16 3668645.19 964.83 1074.34 3802.36 60.30 0.90 139549181 76
Xsun 29 1342108.55 251.32 9991.62 5340.18 8.67 0.03 2784769024 1295
xlock 32 1027099.38 726.87 4253.34 1413.04 22.71 0.17 4009349888 15
fountain 2 803036.25 165.11 333.65 4863.71 82.55 0.49 378388 1
netscape 22 489512.97 72.39 3647.61 6762.19 3.29 0.02 887353080 2649
maker4X. 10 426182.31 43.77 5004.30 9736.27 4.38 0.01 803267592 3434
wabiprog 53 355574.99 44.32 972.44 8022.87 0.84 0.05 355871360 570
imagetoo 21 257617.08 15.65 688.46 16456.60 0.75 0.02 64291840 387
java 235 203963.64 37.96 346.35 5373.76 0.16 0.11 155950720 240
aviator 2 101012.82 22.93 29.26 4406.20 11.46 0.78 2335744 40
se.sparc 18 46793.09 19.30 6535.43 2424.47 1.07 0.00 631756294 20
xv 3 40930.98 5.58 46.37 7337.93 1.86 0.12 109690880 28
It looks as if my kids have been using it to play games during the day! The commands reported are sorted by KCOREMIN, which is the product of the amount of CPU time used and the amount of RAM used while the command was active. CPU-MIN is the number of minutes of CPU time. REAL_MIN is the elapsed time for the commands. SIZE-K is an average value for the RSS over the active lifetime of the process. It does not include times when the process was not actually running. In Solaris 2.4 and earlier releases a bug causes this measure to be garbage. HOG FACTOR is the ratio of CPU-MIN to REAL-MIN. A high factor means that this command hogs the CPU whenever it is running. CHARS TRNSFD counts the number of characters read and written. BLOCKS READ counts data read from block devices (basically local disk filesystem reads and writes). The underlying data that is collected can be seen in the acct(4) manual page. The data structure looks like this, its very compact, around 40 bytes.
DESCRIPTION
Files produced as a result of calling acct(2) have records
in the form defined by lt;sys/acct.hgt;, whose contents are:
typedef ushort comp_t; /* pseudo "floating point" representation */
/* 3 bit base-8 exponent in the high */
/* order bits, and a 13-bit fraction */
/* in the low order bits. */
struct acct
{
char ac_flag; /* Accounting flag */
char ac_stat; /* Exit status */
uid_t ac_uid; /* Accounting user ID */
gid_t ac_gid; /* Accounting group ID */
dev_t ac_tty; /* control tty */
time_t ac_btime; /* Beginning time */
comp_t ac_utime; /* accounting user time in clock */
/* ticks */
comp_t ac_stime; /* accounting system time in clock */
/* ticks */
comp_t ac_etime; /* accounting total elapsed time in clock */
/* ticks */
comp_t ac_mem; /* memory usage in clicks (pages) */
comp_t ac_io; /* chars transferred by read/write */
comp_t ac_rw; /* number of block reads/writes */
char ac_comm[8]; /* command name */
};
Process data structures
There isn't a great deal of data in the accounting record. Let's see what is available from a process that is still running. Actually, a neat trick is to open an entry in /proc and wait until the process has exited. The fact that the /proc entry is still open means that you can still get the data described below. When you close the /proc "file" the zombie process will disappear.
These data structures are described in full in the proc(4) manual page. They are also available in the SE toolkit, so if you want to obtain the data and play around with it you should look at the code for ps-ax.se and msacct.se. The interface to /proc involves sending ioctl commands. The one that ps uses is called PIOCPSINFO and this is what you get back.
SunOS 5.5 Last change: 28 Mar 1995 2
proc(4) File Formats proc(4)
PIOCPSINFO
This returns miscellaneous process information such as that
reported by ps(1). p is a pointer to a prpsinfo structure
containing at least the following fields:
typedef struct prpsinfo {
char pr_state; /* numeric process state (see pr_sname) */
char pr_sname; /* printable character representing pr_state */
char pr_zomb; /* !=0: process terminated but not waited for */
char pr_nice; /* nice for cpu usage */
u_long pr_flag; /* process flags */
int pr_wstat; /* if zombie, the wait() status */
uid_t pr_uid; /* real user id */
uid_t pr_euid; /* effective user id */
gid_t pr_gid; /* real group id */
gid_t pr_egid; /* effective group id */
pid_t pr_pid; /* process id */
pid_t pr_ppid; /* process id of parent */
pid_t pr_pgrp; /* pid of process group leader */
pid_t pr_sid; /* session id */
caddr_t pr_addr; /* physical address of process */
long pr_size; /* size of process image in pages */
long pr_rssize; /* resident set size in pages */
u_long pr_bysize; /* size of process image in bytes */
u_long pr_byrssize; /* resident set size in bytes */
caddr_t pr_wchan; /* wait addr for sleeping process */
short pr_syscall; /* system call number (if in syscall) */ id_t pr_aslwpid; /* lwp id of the aslwp; zero if no aslwp */
timestruc_t pr_start; /* process start time, sec+nsec since epoch */
timestruc_t pr_time; /* usr+sys cpu time for this process */
timestruc_t pr_ctime; /* usr+sys cpu time for reaped children */
long pr_pri; /* priority, high value is high priority */
char pr_oldpri; /* pre-SVR4, low value is high priority */
char pr_cpu; /* pre-SVR4, cpu usage for scheduling */
u_short pr_pctcpu; /* % of recent cpu time, one or all lwps */
u_short pr_pctmem; /* % of system memory used by the process */
dev_t pr_ttydev; /* controlling tty device (PRNODEV if none) */
char pr_clname[PRCLSZ]; /* scheduling class name */
char pr_fname[PRFNSZ]; /* last component of exec()ed pathname */
char pr_psargs[PRARGSZ];/* initial characters of arg list */
int pr_argc; /* initial argument count */
char **pr_argv; /* initial argument vector */
char **pr_envp; /* initial environment vector */
} prpsinfo_t;
For a multithreaded process it is possible to get the data for each lightweight process separately. There's a lot more useful looking information there, but no sign of the high resolution microstate accounting that /usr/proc/bin/ptime (and msacct.se) display. They use a separate ioctl, PIOCUSAGE.
SunOS 5.5 Last change: 28 Mar 1995 18
proc(4) File Formats proc(4)
PIOCUSAGE
When applied to the process file descriptor, PIOCUSAGE
returns the process usage information; when applied to an
lwp file descriptor, it returns usage information for the
specific lwp. p points to a prusage structure which is
filled by the operation. The prusage structure contains at
least the following fields:
typedef struct prusage {
id_t pr_lwpid; /* lwp id. 0: process or defunct */
u_long pr_count; /* number of contributing lwps */
timestruc_t pr_tstamp; /* current time stamp */
timestruc_t pr_create; /* process/lwp creation time stamp */
timestruc_t pr_term; /* process/lwp termination time stamp */
timestruc_t pr_rtime; /* total lwp real (elapsed) time */
timestruc_t pr_utime; /* user level CPU time */
timestruc_t pr_stime; /* system call CPU time */
timestruc_t pr_ttime; /* other system trap CPU time */
timestruc_t pr_tftime; /* text page fault sleep time */
timestruc_t pr_dftime; /* data page fault sleep time */
timestruc_t pr_kftime; /* kernel page fault sleep time */
timestruc_t pr_ltime; /* user lock wait sleep time */
timestruc_t pr_slptime; /* all other sleep time */
timestruc_t pr_wtime; /* wait-cpu (latency) time */
timestruc_t pr_stoptime; /* stopped time */
u_long pr_minf; /* minor page faults */
u_long pr_majf; /* major page faults */
u_long pr_nswap; /* swaps */
u_long pr_inblk; /* input blocks */
u_long pr_oublk; /* output blocks */
u_long pr_msnd; /* messages sent */
u_long pr_mrcv; /* messages received */
u_long pr_sigs; /* signals received */
u_long pr_vctx; /* voluntary context switches */
u_long pr_ictx; /* involuntary context switches */
u_long pr_sysc; /* system calls */
u_long pr_ioch; /* chars read and written */
} prusage_t;
PIOCUSAGE can be applied to a zombie process (see
PIOCPSINFO).
Applying PIOCUSAGE to a process that does not have micro-
state accounting enabled will enable microstate accounting
and return an estimate of times spent in the various states
up to this point. Further invocations of PIOCUSAGE will
yield accurate microstate time accounting from this point.
To disable microstate accounting, use PIOCRESET with the
PR_MSACCT flag.
There is a lot of useful data here. The time spent waiting for various events is a key measure. I summarize it in msacct.se like this:
Elapsed time 3:20:50.049 Current time Fri Jul 26 12:49:28 1996
User CPU time 2:11.723 System call time 1:54.890
System trap time 0.006 Text pfault sleep 0.000
Data pfault sleep 0.023 Kernel pfault sleep 0.000
User lock sleep 0.000 Other sleep time 3:16:43.022
Wait for CPU time 0.382 Stopped time 0.000
The other thing to notice is that microstate accounting is not turned on by default. It slows the system down slightly, and while it was on by default up to Solaris 2.3, from Solaris 2.4 onwards it is only collected if you use it. The way CPU time is normally measured is to sample the state of all the CPUs from the clock interrupt 100 times a second. The way microstate accounting works, a high resolution timestamp is taken on every state change, every system call, every page fault, every scheduler change. It doesn't miss anything.
Wrap up
There are a lot more things that can be done with /proc, but I hope I have opened your eyes to some of the most useful performance information available in Solaris 2.
0 Comments:
Post a Comment
<< Home