STATA offers high profile data management capabilities which can be usefull for anay data analyst.In this tutorial we are going to handle various ways we can use to solve problems relating data.
For a faster view into how to ingest data into stata you can see this post.As for this one I will quickly ingest system data here quickly.
. sysuse auto.dta,clear
(1978 Automobile Data)
In STATA once you have loaded your dataset you can get a view of your
dataset by running the list
comand.Further you can specify
how many rows by adding in 1/number of rows
comand where
the ��number of rows us an integer representing the number of
observations.��
. list in 1/5 //lists the first five rows of data +------------------------------------------------------------------------------------------------------------------+ | make price mpg rep78 headroom trunk weight length turn displa~t gear_r~o foreign | |------------------------------------------------------------------------------------------------------------------| 1. | AMC Concord 4,099 22 3 2.5 11 2,930 186 40 121 3.58 Domestic | 2. | AMC Pacer 4,749 17 3 3.0 11 3,350 173 40 258 2.53 Domestic | 3. | AMC Spirit 3,799 22 . 3.0 12 2,640 168 35 121 3.08 Domestic | 4. | Buick Century 4,816 20 3 4.5 16 3,250 196 40 196 2.93 Domestic | 5. | Buick Electra 7,827 15 4 4.0 20 4,080 222 43 350 2.41 Domestic | +------------------------------------------------------------------------------------------------------------------+
The browse
and edit
commmands can also be used
to to open an interactive data viewing and editing modes respectively in
stata pop up windows.
browse //opens stata data browser window edit // opens stata data editor window
Also note that in the data editor window you can rename your variable.
For a quick glimpse into the dataset structure you can run the
describe
command.This gives you information on how
variables are stored in STATA
. describe Contains data from C:\Program Files (x86)\Stata13\ado\base/a/auto.dta obs: 74 1978 Automobile Data vars: 12 13 Apr 2013 17:45 size: 3,182 (_dta has notes) ----------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ----------------------------------------------------------------------------------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 headroom float %6.1f Headroom (in.) trunk int %8.0g Trunk space (cu. ft.) weight int %8.0gc Weight (lbs.) length int %8.0g Length (in.) turn int %8.0g Turn Circle (ft.) displacement int %8.0g Displacement (cu. in.) gear_ratio float %6.2f Gear Ratio foreign byte %8.0g origin Car type ----------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: foreign
We can see that the data has 74 observations and 12 features.The variable names ate also given and their labels alongside storage types.
For a more refined detail about the dataset you are working with in
stata ,codebook
can be of great help in stata.When followed
with a specific variable name the codebook returns infomation only
concerning the variable else all
. codebook price ----------------------------------------------------------------------------------------------------------------------------------------------------------- price Price ----------------------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (int) range: [3291,15906] units: 1 unique values: 74 missing .: 0/74 mean: 6165.26 std. dev: 2949.5 percentiles: 10% 25% 50% 75% 90% 3895 4195 5006.5 6342 11385
In stata we are able to select observations using the in
method which specifies the number of observations we need in a given
range.
For example.
. list price in 40/45 +--------+ | price | |--------| 40. | 4,195 | 41. | 10,371 | 42. | 4,647 | 43. | 4,425 | 44. | 4,482 | |--------| 45. | 6,486 | +--------+
Lists 40th to 45th observations in the price variable in our dataset.In case of negative numbers STATA will return observations from the end.
The in
keyword can also accept leters for example if you
are intrested of viewing the last 5 (tail) observations of your dataset
you can just tipe the code below.
. list in -5/L +----------------------------------------------------------------------------------------------------------------+ | make price mpg rep78 headroom trunk weight length turn displa~t gear_r~o foreign | |----------------------------------------------------------------------------------------------------------------| 70. | VW Dasher 7,140 23 4 2.5 12 2,160 172 36 97 3.74 Foreign | 71. | VW Diesel 5,397 41 5 3.0 15 2,040 155 35 90 3.78 Foreign | 72. | VW Rabbit 4,697 25 4 3.0 15 1,930 155 35 89 3.78 Foreign | 73. | VW Scirocco 6,850 25 4 2.0 16 1,990 156 36 97 3.78 Foreign | 74. | Volvo 260 11,995 17 5 2.5 14 3,170 193 37 163 2.98 Foreign | +----------------------------------------------------------------------------------------------------------------+
This is used in cases where a dataset needs to be split into multiple files.Lets say you want to devide the dataset into two groups namely foreign or domestic or if its demographic data and you want males and females separated.
Lets see miles per galon of the domestic cars that have a price greater than 15000 .
. list mpg if foreign == 0 & price > 12000 +-----+ | mpg | |-----| 12. | 14 | 13. | 21 | 27. | 12 | 28. | 14 | +-----+
Only four cars meet the condition set above.The & opperator makes stata to check if both conditions are met then scrapes off the data that satisfies it.Its important to note that in this dataset ,the label domestic was coded 0 and 1 for foreign
To check any missing data in stata ,we can run the if
missing
comand as shown below.
. list if missing()
We can see that there is no missing information from the data above.
Within the missing parentheses ,a particular variable can be added to specify where the missing data is being sought.
Also before the missing if
command,a list of variable can
be specified.
. list mpg rep78 trunk if missing(gear_ratio)
The code above tells stata to return observations in the mpg,rep and trunk variables where missing values are found.
So what information is comtained in that dataset you are using?To get a
quick glimpse of the summary statistics,You can run the
summary
command in the console.
The summary command can be specified with a variable/variables infront of it,to summarize all the variables,it should be left empty.
Note also that the summarize command can be shortened as sum and still the summary is gotten.
. summarize rep78 foreign Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- rep78 | 69 3.405797 .9899323 1 5 foreign | 74 .2972973 .4601885 0 1
To merge datsets ,we call the merge
command and specify
which columns to use for merging.
Merge has so many options to choose from.
-
One-to-one merging;
-
Many-to-one merging
-
One-to-many merging
-
Many-to-many merging
-
One-to-one merging by specified observation
Merge produces a new variable ,_merge,containing ,numeric codes concerning the source and the cotents of each observation in the merged dataset.
. use "C:\Users\RuralNet011\Desktop\datascience\analysis\StataTest\Question 1 Stata Files\11.dta", clear . . // 1:1 merging . . merge 1:1 partid using "C:\Users\RuralNet011\Desktop\datascience\analysis\StataTest\Question 1 Stata Files\12.dta" (note: variable TimeStamp was str15, now str16 to accommodate using data's values) Result # of obs. ----------------------------------------- not matched 0 matched 395 (_merge==3) -----------------------------------------
At sompoint we may need to drop some varibles or observations in a data set.This is majorly caused by overprescence of missing vaulues or maybe the ate not just needed for analysis.Lets see how to work out this.
To drop a variable we use drop
command followed with the
variable names list.Below i am dropning the price variable.
. sysuse auto.dta ,clear (1978 Automobile Data) . drop price
To drop an observation,we use an if
and/or in
qualifier after the drop command.
By using logical operators you can specify more than one condition for dropping observations.
Below i am going to drop observations where the variable foreign column is domestic and price is greater than $ 6000.
. sysuse auto.dta ,clear (1978 Automobile Data) . drop if foreign == 1 & price > 6000 (9 observations deleted)
And we can see that 9 observations are deleted.As mentioned above whe can use in to specify the number of observations to delete.
Below am dropping 5 observations from the dataset.
. drop in 1/5
(5 observations deleted)
By using keep,you are telling stata to drop all variables except those that are specified or by using an if and/or in expression.
. // keeping only price,mpg and foreign variables . . keep price mpg foreign . . list in 1/5 +-------------------------+ | price mpg foreign | |-------------------------| 1. | 5,788 18 Domestic | 2. | 4,453 26 Domestic | 3. | 5,189 20 Domestic | 4. | 10,372 16 Domestic | 5. | 4,082 19 Domestic | +-------------------------+ . . // Deleting 19 observations . . keep in 10 (59 observations deleted) . . list +------------------------+ | price mpg foreign | |------------------------| 1. | 5,705 16 Domestic | +------------------------+ . . // keeping observation by condition . . keep if price > 2000 (0 observations deleted) . . list +------------------------+ | price mpg foreign | |------------------------| 1. | 5,705 16 Domestic | +------------------------+