The COMPUSTAT database from Standard and Poors is one of the most extensive databases of business financial data available. It provides annual and quarterly income statement and balance sheet data on over 10,000 firms dating back to the 1950s.
The files are all stored on unix in the /usr/finance/compustat directory.
This page documents a variety of things, from installing COMPUSTAT onto Unix from tape, to running jobs and accessing the data. The file cutoff date for the 1999 files is 7/20/1999 this year.
The annual and quarterly files are each split into 3 categories:
If you want all firms that meet a certain criteria without survival bias, you need to access all three files for a given time period.
The annual files are also each split into 3 20-year time periods. For the 1998 files these years are:
Notice that the backdata and wayback data overlap. This is because COMPUSTAT wants to send 20 years worth of data on every annual tape.
The quarterly files are each split into 4 12-year time periods for a total of 48 quarters of data for each tape. For the 1999 files these years are:
Again notice that in the case of the quarterly data, the way wayback data and the wayback data overlap. I know the year overlaps don't look right for the current versus backdata. You need to check the dates on the quarterly files as you extract data from them .
I am assuming that the COMPUSTAT data is loaded onto hard drive.
The COMPUSTAT tapes come zipped up via ftp. This format is really the same as the tapes that were once shipped in an ASCII format on 2.5 Gb 8mm tapes. These files have no end of block or end of record delimiters, so you have to have them inserted after they are unzipped. The easiest way is to copy the unzipped files to ascii files by using the Unix dd command. Below are examples of the dd command for copying the annual, quarterly, pst, and bif files
For the PST Annual, FCOTC Annual, Annual Research, Annual Backdata files use the following dd statement. The blocksize and record sizes are both 8332.
dd if=name_of_flat_input_file of=name_of_output_file ibs=8332 cbs=8332 conv=unblock
Be sure to name the output file something easily identifiable. We put these on
/usr/finance/compustat/ and call them
fcotc_way_backdata_1950-1969.dat
mrged_annual_research_backdata_1959-1978.dat
pst_annual_current.dat
For the PST Quarterly, FCOTC Quarterly, Quarterly Research, Quarterly Backdata files use the following dd statement. The blocksize and record sizes are both
dd if=name_of_flat_input_file of=name_of_output_file ibs=27552 cbs=9184 conv=unblock
(I think the quarterly blocksize is 27552--or 3 records. They changed it for the 1997 files)
We also put these in the /usr/compustat sub directory and call them
pst_qtrly_current.dat
US PDE file
dd if=name_of_flat_input_file of=name_of_output_file ibs=19632 cbs=3272 conv=unblock
We call it ......
Canadian PDE file
dd if=name_of_flat_input_file of=name_of_output_file ibs=20928 cbs=3488 conv=unblock
We call it ......
Reference File of SIC Codes
dd if=name_of_flat_input_file of=name_of_output_file ibs=7200 cbs=80 conv=unblock
we call it.......
SIC File
dd if=name_of_flat_input_file of=name_of_output_file ibs=4800 cbs=240 conv=unblock
We call it bif_sic.dat
Industry Segment File
dd if=name_of_flat_input_file of=name_of_output_file ibs=7740 cbs=774 conv=unblock
We call it bif_industry_segment.dat
Geographic Segment File
dd if=name_of_flat_input_file of=name_of_output_file ibs=8040 cbs=804 conv=unblock
We call it bif_geographic_segment.dat
Fortran access to the COMPUSTAT data is pretty straightforward after it is loaded onto disk. Note, however, that there is a BUG IN THE SUN FORTRAN F90 VERISION 1.2 COMPILER that makes things interesting if you don't know about it!!! The insidious little rascal will not allow you to read a record that is longer than 267 characters, and SUN has done very little to document its existence and its fix. You will know that you have the bug if, when you try to access any file, except the SIC file, or the Reference File of SIC Codes, you get a lib-1217 runtime error. There are two fixes.
1. Don't use f90 to compile. Use f77 to compile instead. This works fine, except that you are probably using f90 to access the CRSP files, and those bozos at University of Chicago will only support f90 version 1.1 or later (and begrudgingly at that)--they won't provide the source codes for the runtime libraries so you can use whichever compiler you want. This leaves you in the situation of having to use one compiler for COMPUSTAT programs, and another compiler for CRSP programs. Kind of a pain in the pattootie.
2. This is lots better. Put in an recl=nnnn statement in the open statement where nnnn is the correct record length for the file. Even though the compiler is supposed to ignore this command in a sequential access open, it needs it.
Tape Name | filename on Unix (Note: I will rewrite these to binary unformatted files. Then the file suffix will be .bin instead of .ascii) | dcb information |
PST ANNUAL Current | pst_ann.ascii | ibs=8332 cbs=8332 conv=unblock |
FCOTC ANNUAL Current | fcotc_ann.ascii | ibs=8332 cbs=8332 conv=unblock |
MRGED (PST&FCOTC) ANNUAL RESEARCH Current | mrged_ann_res.ascii | ibs=8332 cbs=8332 conv=unblock |
CDN CDN$ ANNUAL Current | cdn_ann.ascii | ibs=8332 cbs=8332 conv=unblock |
PST ANNUAL Backdata 1959-1978 | pst_ann_back.ascii | ibs=8332 cbs=8332 conv=unblock |
FCOTC ANNUAL Backdata 1959-1978 | fcotc_ann_back.ascii | ibs=8332 cbs=8332 conv=unblock |
MRGED (PST&FCOTC) ANNUAL RESEARCH Backdata 1959-1978 | mrged_ann_res_back.ascii | ibs=8332 cbs=8332 conv=unblock |
PST ANNUAL Wayback 1950-1969 | pst_ann_wback.ascii | ibs=8332 cbs=8332 conv=unblock |
FCOTC ANNUAL Wayback 1950-1969 | fcotc_ann_wback.ascii | ibs=8332 cbs=8332 conv=unblock |
MRGED (PST&FCOTC) ANNUAL RESEARCH Wayback 1950-1969 | mrged_ann_res_wback.ascii | ibs=8332 cbs=8332 conv=unblock |
Tape Name | filename on Unix (Note: I will rewrite these to binary unformatted files. Then the file suffix will be .bin instead of .ascii) | dcb information |
PST QTRLY CURRENT | pst_qtr.ascii | ibs=27552 cbs=9184 conv=unblock |
FCOTC QTRLY CURRENT | fcotc_qtr.ascii | ibs=27552 cbs=9184 conv=unblock |
MRGED (PST&FCOTC) QTRLY RESEARCH CURRENT | mrged_qtr_res.ascii | ibs=27552 cbs=9184 conv=unblock |
CDN CDN$ QTRLY Current | cdn_qtr.ascii | ibs=27552 cbs=9184 conv=unblock |
PST QTRLY Backdata 1977-1988 | pst_qtr_back.ascii | ibs=27552 cbs=9184 conv=unblock |
FCOTC QTRLY Backdata 1977-1988 | fcotc_qtr_back.ascii | ibs=27552 cbs=9184 conv=unblock |
MRGED (PST&FCOTC) QTRLY RESEARCH Backdata 1977-1988 | mrged_qtr_res_back.ascii | ibs=27552 cbs=9184 conv=unblock |
PST QTRLY Wayback 1966-1977 | pst_qtr_wback.ascii | ibs=27552 cbs=9184 conv=unblock |
FCOTC QTRLY Wayback 1966-1977 | fcotc_qtr_wback.ascii | ibs=27552 cbs=9184 conv=unblock |
MRGED (PST&FCOTC) QTRLY RESEARCH Wayback 1966-1977 | mrged_qtr_res_wback.ascii | ibs=27552 cbs=9184 conv=unblock |
PST QTRLY Way Wayback 1962-1973 | pst_qtr_wwback.ascii | ibs=27552 cbs=9184 conv=unblock |
FCOTC QTRLY Way Wayback 1962-1973 | fcotc_qtr_wwback.ascii | ibs=27552 cbs=9184 conv=unblock |
MRGED (PST&FCOTC) QTRLY RESEARCH Way Wayback 1962-1973 | mrged_qtr_res_wwback.ascii | ibs=27552 cbs=9184 conv=unblock |
Tape Name | filename on Unix (Note: I will rewrite these to binary unformatted files. Then the file suffix will be .bin instead of .ascii) | dcb information |
Reference file of SIC Codes | (missing)reference_sic.ascii | ibs=7200 cbs=80 conv=unblock |
SIC File | (missing)sic.ascii | ibs=4800 cbs=240 conv=unblock |
S&P Index Fundamentals -- Annual | compustat needs to send me documentation on this one sandp_fundamentals.ascii |
?what is this file? |
BIF Industry Segment Current | industry_segment.ascii | ibs=7740 cbs=774 conv=unblock |
BIF Geographic Segment Current | geographic_segment.ascii | ibs=8040 cbs=804 conv=unblock |
When data is missing from the Compustat Database, Compustat assigns a missing observation code. You have to be careful when reading the data, because the missing observations are just coded as special numbers, and you might process the data as if it were actual accounting data rather than a missing observation code. Following are the codes that Compustat uses, and what they mean:
Code | Meaning |
-0.001 | Data not available. |
-0.007 | Not Meaningful |
-0.004 | Combined Figure. This item is combined into another data item. |
-0.008 | Insignificant figure. The company has reported this item as insignificant |
-0.002 | Semi-annual figure. If the data is only available on a semi-annual basis, then this code appears in the first and third quarter. Actual data appear in the second and fourth quarters. |